Linux Documentation
 help / color / mirror / Atom feed
* Re: [RFC PATCH v2 04/14] kcov: reject enable on multiple dataflow fds simultaneously
From: Alexander Potapenko @ 2026-06-12  7:32 UTC (permalink / raw)
  To: Yunseong Kim
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, K Prateek Nayak, Andrey Konovalov,
	Dmitry Vyukov, Andrew Morton, Miguel Ojeda, Boqun Feng, Gary Guo,
	Björn Roy Baron, Benno Lossin, Andreas Hindborg, Alice Ryhl,
	Trevor Gross, Danilo Krummrich, Nathan Chancellor, Nicolas Schier,
	Nick Desaulniers, Bill Wendling, Justin Stitt, Kees Cook,
	David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Shuah Khan, Jonathan Corbet, Shuah Khan, linux-kernel, kasan-dev,
	rust-for-linux, linux-kbuild, llvm, linux-mm, linux-kselftest,
	workflows, linux-doc, Yeoreum Yun, sashiko-bot
In-Reply-To: <20260611-b4-kcov-dataflow-v2-v2-4-0a261da3987c@est.tech>

On Thu, Jun 11, 2026 at 6:21 PM Yunseong Kim <yunseong.kim@est.tech> wrote:
>
> A task could enable tracing on multiple kcov_dataflow file descriptors,
> corrupting the internal tracking state when one is subsequently closed.
>
> Check current->kcov_df_enabled before allowing KCOV_DF_ENABLE and
> return -EBUSY if already active. This matches kcov's check of
> t->kcov != NULL in the KCOV_ENABLE path.
>
> Reported-by: sashiko-bot <sashiko-bot@kernel.org>
> Closes: https://sashiko.dev/#/patchset/20260603-kcov-dataflow-next-20260603-v2-0-fee0939de2c4%40est.tech
> Signed-off-by: Yunseong Kim <yunseong.kim@est.tech>
> ---
>  kernel/kcov_dataflow.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/kcov_dataflow.c b/kernel/kcov_dataflow.c
> index 5248293280d5..27587b8ceeab 100644
> --- a/kernel/kcov_dataflow.c
> +++ b/kernel/kcov_dataflow.c

Please merge this patch into the one introducing kcov_dataflow.c

^ permalink raw reply

* Re: [RFC PATCH v2 0/6] kcov: per-task dataflow extraction at kernel function boundaries
From: Yunseong Kim @ 2026-06-12  7:33 UTC (permalink / raw)
  To: Alexander Potapenko
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, K Prateek Nayak, Dmitry Vyukov,
	Andrey Konovalov, Andrew Morton, Nathan Chancellor,
	Nick Desaulniers, Bill Wendling, Justin Stitt, Nicolas Schier,
	Miguel Ojeda, Boqun Feng, Gary Guo, Björn Roy Baron,
	Benno Lossin, Andreas Hindborg, Alice Ryhl, Trevor Gross,
	Danilo Krummrich, Jonathan Corbet, Shuah Khan, linux-kernel,
	kasan-dev, llvm, linux-kbuild, rust-for-linux, workflows,
	linux-doc, Yunseong Kim
In-Reply-To: <CAG_fn=W1++qPJWQk1+4MtRfe6n1iUKF2O5pddnqKGwSq85CuqA@mail.gmail.com>

Hi Alexander,

> On Wed, Jun 3, 2026 at 7:43 PM Yunseong Kim <yunseong.kim@est.tech> wrote:
>>
>> Introduces a new KCOV exetened feature that captures function arguments and
>> return values at kernel function boundaries, enabling per-process visibility
>> into runtime dataflow.
> 
> Some high-level comments:
> - Make sure your code can run on every platform supported by kcov (namely ARM64)
> - Check out Sashiko findings:
> https://sashiko.dev/#/patchset/20260603-kcov-dataflow-next-20260603-v2-0-fee0939de2c4%40est.tech,

I handled those parts that seemed problematic from sashiko's review.

> at least some of them seem to make sense
> - Please consolidate changes to the same file into a single patch
> - There seem to be two tools (one in C and one in Python) with
> overlapping functionality, can you keep only one?

I revised this part in v2.

> - The test modules seem to be used only in manual testing. Can you
> convert them to kselftests or remove them?

Thanks again for yout guide, I've updated it to v2.

> - At this point, long dashes in the kernel codebase are quite rare,
> and I don't see a reason to add more.

I checked that the v2 series patchset was removed, using long dashes.

Best regards,
Yunseong


^ permalink raw reply

* Re: [RFC PATCH v2 03/14] kcov: add barriers to recursion guard in kcov_df_write
From: Alexander Potapenko @ 2026-06-12  7:30 UTC (permalink / raw)
  To: Yunseong Kim
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, K Prateek Nayak, Andrey Konovalov,
	Dmitry Vyukov, Andrew Morton, Miguel Ojeda, Boqun Feng, Gary Guo,
	Björn Roy Baron, Benno Lossin, Andreas Hindborg, Alice Ryhl,
	Trevor Gross, Danilo Krummrich, Nathan Chancellor, Nicolas Schier,
	Nick Desaulniers, Bill Wendling, Justin Stitt, Kees Cook,
	David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Shuah Khan, Jonathan Corbet, Shuah Khan, linux-kernel, kasan-dev,
	rust-for-linux, linux-kbuild, llvm, linux-mm, linux-kselftest,
	workflows, linux-doc, Yeoreum Yun
In-Reply-To: <20260611-b4-kcov-dataflow-v2-v2-3-0a261da3987c@est.tech>

On Thu, Jun 11, 2026 at 6:21 PM Yunseong Kim <yunseong.kim@est.tech> wrote:
>
> The recursion guard (bit-31 of kcov_df_seq) prevents reentry when
> copy_from_kernel_nofault() or other called functions are instrumented
> with INSTRUMENT_ALL. Without compiler barriers, the guard set/clear
> can be reordered relative to the function body, making the protection
> ineffective under optimization.
>
> Add barrier() after setting the guard and before clearing it, ensuring
> the compiler does not move instrumented operations outside the guarded
> region.
>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Yunseong Kim <yunseong.kim@est.tech>
> ---
>  kernel/kcov_dataflow.c | 2 ++

Please merge this patch into the one introducing kcov_dataflow.c


>  1 file changed, 2 insertions(+)
>
> diff --git a/kernel/kcov_dataflow.c b/kernel/kcov_dataflow.c
> index df7e8bf70bfa..5248293280d5 100644
> --- a/kernel/kcov_dataflow.c
> +++ b/kernel/kcov_dataflow.c
> @@ -86,6 +86,7 @@ kcov_df_write(u64 type_marker, u64 pc, u64 meta, void *ptr,
>         if (t->kcov_df_seq & (1U << 31))
>                 return;
>         t->kcov_df_seq |= (1U << 31);
> +       barrier();

Please make sure barriers have comments explaining which barriers they
pair with (see kernel/kcov.c)

^ permalink raw reply

* Re: [PATCH v3 2/4] mm/zswap: Implement proactive writeback
From: YoungJun Park @ 2026-06-12  7:27 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Yosry Ahmed, Hao Jia, Johannes Weiner, mhocko, tj, mkoutny,
	roman.gushchin, Nhat Pham, akpm, chengming.zhou, muchun.song,
	cgroups, linux-mm, linux-kernel, linux-doc, Hao Jia, chrisl,
	kasong, baoquan.he
In-Reply-To: <aisEWnb3pzmVC4dl@linux.dev>

On Thu, Jun 11, 2026 at 12:12:40PM -0700, Shakeel Butt wrote:
> On Thu, Jun 11, 2026 at 05:45:04PM +0000, Yosry Ahmed wrote:
> > On Tue, Jun 09, 2026 at 01:19:13PM +0900, YoungJun Park wrote:
> > > On Mon, Jun 08, 2026 at 03:27:07PM -0700, Yosry Ahmed wrote:
> > > 
> > > +Chris +Kairui +Baoquan
> > > 
> > > Hello
> > > 
> > > Thanks for inviting me to the discussion, Shakeel.
> > > 
> > > > > > > Youngjun is working on swap tiers. At the moment he is more interested in
> > > > > > > allowing a specific swap device to a memcg or not. I can imagine in future there
> > > > > > > will be use-cases where there will be a need to demote data on higher tier swap
> > > > > > > to lower tier swap. What would be the appropriate interface?
> > > 
> > > Speaking of my work on swap tiers, I recently submitted a patch and am
> > > currently considering memcg integration:
> > > https://lore.kernel.org/linux-mm/20260527062247.3440692-1-youngjun.park@lge.com/
> > > 
> > > The future use-cases imagined above seem to align with this
> > > direction. (BTW, I am currently waiting for reviews/feedback from the memcg
> > > folks on this patch. Any reviews would be highly appreciated!)
> > > 
> > > We could potentially assign a target tier
> > > for writeback within the existing memory.zswap.writeback interface. 
> > > 
> > > For instance, '0' could mean disabled, while non-zero values could represent
> > > specific tiers, which would maintain backward compatibility with the current
> > > version. Alternatively, if zswap is treated as the default top tier, 
> > > the `memory.swap.tiers` interface could potentially replace `memory.zswap.writeback`.
> > > 
> > > Furthermore, this could be expanded so that each swap tier can demote data
> > > user-triggered demotion between swap tiers.
> > > 
> > > Based on the current patch's ideas combined with my swap tiers concept:
> > > 
> > > Assuming a hierarchy like:
> > > zswap -> tier1 (SSD swap) -> tier2 (HDD swap) -> tier3 (Network swap)
> > > 
> > > We could configure the active tiers via a setting like `memory.swap.tiers`
> > > (tier2 enabled, tier3 enabled).
> > > 
> > > For example, the concept of `echo "100M zswap_writeback_only > memory.reclaim"`
> > > could be extended. A user could run `echo "100M tier2 > memory.reclaim"`
> > > to explicitly trigger demotion from tier2 to tier3.
> > > (BTW, if we combine these features, my personal preference for the keyword
> > > format would be `<size> <demote_prefix><tier_name>`. I think it would be
> > > better to explicitly indicate that it is a swap demotion by using a specific
> > > prefix followed by the tier name. 
> > > Or make demote prefix another key is also possible)
> > 
> > I am not sure if proactive demotion between swap tiers would be driven
> > by memory.reclaim, I am guessing a new interface might be more suitable.
> > But yes, you are right that it's very possible that
> > 'zswap_writeback_only' with memory.reclaim will become obsolete once
> > swap tiering matures and starts supporting things like proactive
> > demotion.
> > 
> > Part of me wants to wait until the swap tiering interfaces are figured
> > out so that we don't end up with redundant interfaces, but I also don't
> > want to hold Hao's work since it doesn't directly depend on swap
> > tiering.
> However I would need zswap folks (Yosry & Nhat) help in figuring out swap tiers
> interfaces. Zswap is the current top tier swap usage in real world. I want
> zswap users to eaily (and hopefully transparently) migrate to swap tiers.

> > Shakeel, how do you want to handle this? I think there's a few options:
> > 
> > 1. Add zswap_writeback_only now, and when we have swap tiering demotion
> > it becomes a redundant interface, like memory.zswap.writeback -- or
> > maybe we try to deprecate both of them at that point. It's difficult to
> > remove interfaces tho, but maybe easier to stop supporting
> > zswap_writeback_only.
> > 
> > 2. Add zswap_writeback_only behind an experimental config option, to
> > unblock development but have a line of sight to dropping support once we
> > have a swap tiering interface.
> > 
> > 3. Wait until we figure out the swap tiering interfaces and then add
> > the proactive zswap writeback as part of it.
> > 
> > WDYT?
> 
> Is Hao's work needed for some followup work/development? The earliest Hao's
> work can is 7.3, so if we aim to figure out swap tiering interfaces in next
> couple of weeks then option 3 is the way to go. If swap tiers take more time
> then we can discuss other options as well.
> However I would need zswap folks (Yosry & Nhat) help in figuring out swap tiers
> interfaces. Zswap is the current top tier swap usage in real world. I want
> zswap users to eaily (and hopefully transparently) migrate to swap tiers.

I am looking forward to the discussion on this interface!

To help boost the discussion and progress, I would like to share a few of my thoughts.
We could either introduce a new interface to trigger demotion/promotion,
or we could reuse the existing one (using tier just internally)

Based on the memcg interface currently proposed in swap_tier
(memory.swap.tiers, memory.swap.tiers.effective), I think it aligns well
with the current direction. It provides a foundation for selectively
targeting devices in tier order.

To summarize the discussions so far, the following points align well.

- Per-cgroup swap control, as I suggested.
- Proactive zswap writeback (Hao's usecase)
- Swap device target demotion(if it wants selective, then it is more better), as you mentioned:
  https://lore.kernel.org/linux-mm/aicZ-5GX9De3MAU7@linux.dev/
- Virtual Swap on/off in the future, as Nhat mentioned:
  https://lore.kernel.org/linux-mm/20260528212955.1912856-1-nphamcs@gmail.com/
- The memory.zswap.writeback alternative (no hierarchy model conflict)
- zswap is first swap tier.
- Promotion. (Also better for selectve usage)
- tier based swap policy (e.g round-robin...)

To accelerate this work, I believe we should reach a consensus and
merge the currently proposed swap_tier interface :)

If the above approach is difficult, I would like to suggest an
alternative for progress with the memcg interfaces removed:

1) We could make zswap the first tier and create
a use case where memory.zswap.writeback internally is handled by tier logic.

2) Or simply merge the swap_tier infrastructure itself first.

This would allow the swap_tier infrastructure to be merged and discussed
more easily.

If it takes longer to adopt swap_tier anyway, by doing so we progress next step
as a experimental feature.

- Apply per-cgroup swap as an experimental (debugfs) feature.
- Apply Hao's use case experimentally or as it is as Yosry suggested.
(future migration to swap tier)

How do you think?

(FYI: My emails to kernel.org are failing due to internal server issues.)

Thank you 
Youngjun Park

^ permalink raw reply

* Re: [RFC PATCH v2 02/14] kcov: fix INIT_TRACK race in kcov_dataflow
From: Yunseong Kim @ 2026-06-12  7:25 UTC (permalink / raw)
  To: Alexander Potapenko
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, K Prateek Nayak, Andrey Konovalov,
	Dmitry Vyukov, Andrew Morton, Miguel Ojeda, Boqun Feng, Gary Guo,
	Björn Roy Baron, Benno Lossin, Andreas Hindborg, Alice Ryhl,
	Trevor Gross, Danilo Krummrich, Nathan Chancellor, Nicolas Schier,
	Nick Desaulniers, Bill Wendling, Justin Stitt, Kees Cook,
	David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Shuah Khan, Jonathan Corbet, Shuah Khan, linux-kernel, kasan-dev,
	rust-for-linux, linux-kbuild, llvm, linux-mm, linux-kselftest,
	workflows, linux-doc, Yeoreum Yun, sashiko-bot
In-Reply-To: <CAG_fn=V1+_xLgCZgdLnT7Y-muRO0CXkrNKkC8AzrqzWoL4eR8w@mail.gmail.com>

Hi Alexander,

> On Thu, Jun 11, 2026 at 6:21 PM Yunseong Kim <yunseong.kim@est.tech> wrote:
>>
>> [snip...]
>> Reported-by: sashiko-bot <sashiko-bot@kernel.org>
>> Closes: https://sashiko.dev/#/patchset/20260603-kcov-dataflow-next-20260603-v2-0-fee0939de2c4%40est.tech
>> Signed-off-by: Yunseong Kim <yunseong.kim@est.tech>
> 
> Can we please avoid this?
> kcov_dataflow.c is being introduced in the same series, there is no
> need to send a buggy commit and a follow-up fix - just squash the two
> together and note the changes after Signed-off-by: separated by a
> triple dash.

Thank you for your guide. I'll remove it in the next patch set.

Best regards,
Yunseong

^ permalink raw reply

* Re: [PATCH v2 0/4] mm: split the file's i_mmap tree for NUMA
From: Huang Shijie @ 2026-06-12  7:02 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: akpm, viro, brauner, jack, muchun.song, osalvador, david, surenb,
	mjguzik, liam, vbabka, shakeel.butt, rppt, mhocko, corbet, skhan,
	linux, dinguyen, schuster.simon, James.Bottomley, deller, djbw,
	willy, peterz, mingo, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, james.clark,
	mhiramat, oleg, ziy, baolin.wang, npache, ryan.roberts, dev.jain,
	baohua, lance.yang, linmiaohe, nao.horiguchi, jannh, pfalcato,
	riel, harry, will, brian.ruley, rmk+kernel, dave.anglin, linux-mm,
	linux-doc, linux-kernel, linux-arm-kernel, linux-parisc,
	linux-fsdevel, nvdimm, linux-perf-users, linux-trace-kernel,
	zhongyuan, fangbaoshun, yingzhiwei
In-Reply-To: <airY5q_SspdbQDbi@lucifer>

On Thu, Jun 11, 2026 at 05:00:49PM +0100, Lorenzo Stoakes wrote:
> Hi Huang,
> 
> You seem to be replacing the file rmap altogether here, so you really ought
> to have sent this as an RFC so we could discuss it as a community first.
No problem.

> 
> Especially so as Pedro had publicly mentioned his plans to implement
> something similar here, so coordination would have been appreciated.
Yes. I am very happy to work with Pedro.

> 
> Anyway, as Pedro has pointed out, the code is overly complicated, it's far
> too configurable (not always a good thing), and the locking implementation
> is questionable.
I can make the code more simple. :)

> 
> You seem to be adding a whole bunch of open-coded complexity too, which is
> not something we want. Abstraction is key for the rmap.
> 
> You're also not adding any kdoc comments or really many comments at all,
> and you've not added any tests (though perhaps it's difficult given how
> core this is).
Got it.

> 
> So I would suggest that perhaps any respin should be sent as an RFC so we
> can engage in that conversation and ensure we're all on the same page?
> 
> Especially since Pedro plans to send an alternative, simpler, solution I
> believe.
> 
> It's also not helpful that you haven't examined the non-NUMA case :)
> perhaps your particular server behaves a certain way that this approach
> aids, but regresses other NUMA configurations?

emm. I ever hoped someone can help me to test this patch set on the non-NUMA
server.

It seems I should find some non-NUMA server before I send out the patch set. :)

> 
> We'd really need to be sure of this before accepting invasive changes like
> this.
Okay.

Thanks
Huang Shijie


^ permalink raw reply

* Re: [RFC PATCH v2 02/14] kcov: fix INIT_TRACK race in kcov_dataflow
From: Alexander Potapenko @ 2026-06-12  6:55 UTC (permalink / raw)
  To: Yunseong Kim
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, K Prateek Nayak, Andrey Konovalov,
	Dmitry Vyukov, Andrew Morton, Miguel Ojeda, Boqun Feng, Gary Guo,
	Björn Roy Baron, Benno Lossin, Andreas Hindborg, Alice Ryhl,
	Trevor Gross, Danilo Krummrich, Nathan Chancellor, Nicolas Schier,
	Nick Desaulniers, Bill Wendling, Justin Stitt, Kees Cook,
	David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Shuah Khan, Jonathan Corbet, Shuah Khan, linux-kernel, kasan-dev,
	rust-for-linux, linux-kbuild, llvm, linux-mm, linux-kselftest,
	workflows, linux-doc, Yeoreum Yun, sashiko-bot
In-Reply-To: <20260611-b4-kcov-dataflow-v2-v2-2-0a261da3987c@est.tech>

On Thu, Jun 11, 2026 at 6:21 PM Yunseong Kim <yunseong.kim@est.tech> wrote:
>
> Two threads calling KCOV_DF_INIT_TRACK concurrently could both observe
> df->area == NULL, drop the lock to allocate, and then both assign their
> allocation to df->area, leaking one buffer.
>
> Fix by rechecking df->area after re-acquiring the lock. If another
> thread won the race, free the allocation and return -EBUSY. This
> matches the pattern used by KCOV_INIT_TRACE in kernel/kcov.c.
>
> Reported-by: sashiko-bot <sashiko-bot@kernel.org>
> Closes: https://sashiko.dev/#/patchset/20260603-kcov-dataflow-next-20260603-v2-0-fee0939de2c4%40est.tech
> Signed-off-by: Yunseong Kim <yunseong.kim@est.tech>

Can we please avoid this?
kcov_dataflow.c is being introduced in the same series, there is no
need to send a buggy commit and a follow-up fix - just squash the two
together and note the changes after Signed-off-by: separated by a
triple dash.

^ permalink raw reply

* Re: [PATCH v6 01/12] PCI: liveupdate: Set up FLB handler for the PCI core
From: Mike Rapoport @ 2026-06-12  6:54 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: David Matlack, kexec, linux-doc, linux-kernel, linux-mm,
	linux-pci, Adithya Jayachandran, Alexander Graf, Alex Williamson,
	Bjorn Helgaas, Chris Li, David Rientjes, Jacob Pan,
	Jason Gunthorpe, Jonathan Corbet, Josh Hilke, Leon Romanovsky,
	Lukas Wunner, Parav Pandit, Pranjal Shrivastava, Pratyush Yadav,
	Saeed Mahameed, Samiullah Khawaja, Shuah Khan, Vipin Sharma,
	William Tu, Yi Liu
In-Reply-To: <178124130274.908199.14827357870284807134.b4-review@b4>

On Fri, Jun 12, 2026 at 05:15:02AM +0000, Pasha Tatashin wrote:
> On Fri, 22 May 2026 20:23:59 +0000, David Matlack <dmatlack@google.com> wrote:
> > diff --git a/MAINTAINERS b/MAINTAINERS
> > index 2fb1c75afd16..6c618830cf61 100644
> > --- a/MAINTAINERS
> > +++ b/MAINTAINERS
> > @@ -20530,6 +20530,16 @@ L:	linux-pci@vger.kernel.org
> >  S:	Supported
> >  F:	Documentation/PCI/pci-error-recovery.rst
> >  
> > +PCI LIVE UPDATE
> > +M:	David Matlack <dmatlack@google.com>
> 
> Please add Pratyush, Mike, and myself so we are notified directly of 
> incoming patches, the same as with other areas where the liveupdate/ 
> tree is specified.

Or we can add PCI liveupdate files to LIVEUPDATE entry.

-- 
Sincerely yours,
Mike.

^ permalink raw reply

* Re: [PATCH v2 3/4] mm/fs: split the file's i_mmap tree
From: Huang Shijie @ 2026-06-12  6:44 UTC (permalink / raw)
  To: Pedro Falcato
  Cc: akpm, viro, brauner, jack, muchun.song, osalvador, david, surenb,
	mjguzik, liam, ljs, vbabka, shakeel.butt, rppt, mhocko, corbet,
	skhan, linux, dinguyen, schuster.simon, James.Bottomley, deller,
	djbw, willy, peterz, mingo, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, james.clark,
	mhiramat, oleg, ziy, baolin.wang, npache, ryan.roberts, dev.jain,
	baohua, lance.yang, linmiaohe, nao.horiguchi, jannh, riel, harry,
	will, brian.ruley, rmk+kernel, dave.anglin, linux-mm, linux-doc,
	linux-kernel, linux-arm-kernel, linux-parisc, linux-fsdevel,
	nvdimm, linux-perf-users, linux-trace-kernel, zhongyuan,
	fangbaoshun, yingzhiwei
In-Reply-To: <aiqFgGbIo1Psy3pI@pedro-suse.lan>

On Thu, Jun 11, 2026 at 12:11:27PM +0100, Pedro Falcato wrote:
> Hi,
> 
> On Thu, Jun 11, 2026 at 02:18:59PM +0800, Huang Shijie wrote:
> > In the UnixBench tests, there is a test "execl" which tests
> > the execve system call.
> >   For example, a Hygon's server has 12 NUMA nodes, and 384 CPUs.
> > When we test our server with "./Run -c 384 execl",
> > the test result is not good enough. The i_mmap locks contended heavily on
> > "libc.so" and "ld.so". The i_mmap tree for "libc.so" can be
> > over 6000 VMAs, all the VMAs can be in different NUMA mode. The insert/remove
> > operations do not run quickly enough.
> 
> I _really_ would have appreciated some coordination here, because I said I was
> going to take a look at it. I have something that I think is much simpler
Okay, no problem. 

I waited for more then a month, I thought you are busy at other
things. So I spent more then a week to finish the patch set v2.


> in practice. These patches are also way too complex to be dropped just before
> the merge window.
> 
> Some comments:
> 
> > 
> >  In order to reduce the competition of the i_mmap lock, this patch does
> > following:
> >    1.) Split the single i_mmap tree into several sibling trees:
> >        Each tree has a lock. The CONFIG_SPLIT_I_MMAP is used to
> >        turn on/off this feature.
> 
> There is no need for a config option. This needs to Just Work.
> 
> >    2.) Introduce a new field "tree_idx" for vm_area_struct to save the
> >        sibling tree index for this VMA.
> 
> This is possibly contentious, but there are holes in vm_area_struct.
> So I think this is fine.
> 
> >    3.) Introduce a new field "vma_count" for address_space.
> >        The new mapping_mapped() will use it.
> >    4.) Rewrite the vma_interval_tree_foreach()
> >    5.) Rewrite the lock functions.	
> > 
> >  After this patch, the VMA insert/remove operations will work faster,
> > and we can get over 400% performance improvement with the above test.
> > 
> > Signed-off-by: Huang Shijie <huangsj@hygon.cn>
> > ---
> >  fs/Kconfig               |   8 ++
> >  fs/hugetlbfs/inode.c     |  20 ++++-
> >  fs/inode.c               |  75 ++++++++++++++++-
> >  include/linux/fs.h       | 174 ++++++++++++++++++++++++++++++++++++++-
> >  include/linux/mm.h       |  80 ++++++++++++++++++
> >  include/linux/mm_types.h |   3 +
> >  mm/internal.h            |   3 +-
> >  mm/mmap.c                |  11 ++-
> >  mm/nommu.c               |  23 ++++--
> >  mm/pagewalk.c            |   2 +-
> >  mm/vma.c                 |  72 +++++++++++-----
> >  mm/vma_init.c            |   3 +
> >  12 files changed, 436 insertions(+), 38 deletions(-)
> > 
> > diff --git a/fs/Kconfig b/fs/Kconfig
> > index 43cb06de297f..e24804f70432 100644
> > --- a/fs/Kconfig
> > +++ b/fs/Kconfig
> > @@ -9,6 +9,14 @@ menu "File systems"
> >  config DCACHE_WORD_ACCESS
> >         bool
> >  
> > +config SPLIT_I_MMAP
> > +	bool "Split the file's i_mmap to several trees"
> > +	default n
> > +	help
> > +	  Split the file's i_mmap to several trees, each tree has a separate
> > +	  lock. This will reduce the lock contention of file's i_mmap tree,
> > +	  but it will cost more memory for per inode.
> > +
> >  config VALIDATE_FS_PARSER
> >  	bool "Validate filesystem parameter description"
> >  	help
> > diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> > index da5b41ea5bdd..68d8308418dd 100644
> > --- a/fs/hugetlbfs/inode.c
> > +++ b/fs/hugetlbfs/inode.c
> > @@ -891,6 +891,23 @@ static struct inode *hugetlbfs_get_root(struct super_block *sb,
> >   */
> >  static struct lock_class_key hugetlbfs_i_mmap_rwsem_key;
> >  
> > +#ifdef CONFIG_SPLIT_I_MMAP
> > +static void hugetlbfs_lockdep_set_class(struct address_space *mapping)
> > +{
> > +	int i;
> > +
> > +	for (i = 0; i < split_tree_num; i++) {
> > +		lockdep_set_class(&mapping->i_mmap[i].rwsem,
> > +				&hugetlbfs_i_mmap_rwsem_key);
> > +	}
> > +}
> > +#else
> > +static void hugetlbfs_lockdep_set_class(struct address_space *mapping)
> > +{
> > +	lockdep_set_class(&mapping->i_mmap_rwsem, &hugetlbfs_i_mmap_rwsem_key);
> > +}
> > +#endif
> > +
> >  static struct inode *hugetlbfs_get_inode(struct super_block *sb,
> >  					struct mnt_idmap *idmap,
> >  					struct inode *dir,
> > @@ -915,8 +932,7 @@ static struct inode *hugetlbfs_get_inode(struct super_block *sb,
> >  
> >  		inode->i_ino = get_next_ino();
> >  		inode_init_owner(idmap, inode, dir, mode);
> > -		lockdep_set_class(&inode->i_mapping->i_mmap_rwsem,
> > -				&hugetlbfs_i_mmap_rwsem_key);
> > +		hugetlbfs_lockdep_set_class(inode->i_mapping);
> >  		inode->i_mapping->a_ops = &hugetlbfs_aops;
> >  		simple_inode_init_ts(inode);
> >  		info->resv_map = resv_map;
> > diff --git a/fs/inode.c b/fs/inode.c
> > index 62c579a0cf7d..cb67ae83f5b3 100644
> > --- a/fs/inode.c
> > +++ b/fs/inode.c
> > @@ -214,6 +214,70 @@ static int no_open(struct inode *inode, struct file *file)
> >  	return -ENXIO;
> >  }
> >  
> > +#ifdef CONFIG_SPLIT_I_MMAP
> > +int split_tree_num;
> > +static int split_tree_align __maybe_unused = 32;
> > +
> > +static void __init init_split_tree_num(void)
> > +{
> > +#ifdef CONFIG_NUMA
> > +	split_tree_num = nr_node_ids;
> > +#else
> > +	split_tree_num = ALIGN(nr_cpu_ids, split_tree_align);
> > +#endif
> > +}
> 
> Again, too configurable. I think you're too stuck up on the NUMA case -

If you do not care about the NUMA. The performance will _NOT_ get improved
in our NUMA server. I had ever tested code which do not care about the NUMA,
and I got a bad performance. Avoid the remote access is a very important
thing for the NUMA server.

> which does not matter for many people - and may actively harm NUMA users. If
> I have a 128 core 2 NUMA node system, what should I shard by?
It is easy to extend the tree number for NUMA. :)

For the 128 core 2 NUMA, we can extend to more trees, such as:
   Two trees for each NUMA node.

> 
> > +
> > +static void free_mapping_i_mmap(struct address_space *mapping)
> > +{
> > +	int i;
> > +
> > +	if (!mapping->i_mmap)
> > +		return;
> > +
> > +	for (i = 0; i < split_tree_num; i++)
> > +		kfree(mapping->i_mmap[i]);
> > +
> > +	kfree(mapping->i_mmap);
> > +	mapping->i_mmap = NULL;
> > +}
> > +
> > +static int init_mapping_i_mmap(struct address_space *mapping, gfp_t gfp)
> > +{
> > +	struct i_mmap_tree *tree;
> > +	int i;
> > +
> > +	/* The extra one is used as terminator in vma_interval_tree_foreach() */
> > +	mapping->i_mmap = kzalloc(sizeof(tree) * (split_tree_num + 1), gfp);
> > +	if (!mapping->i_mmap)
> > +		return -ENOMEM;
> > +
> > +	for (i = 0; i < split_tree_num; i++) {
> > +		tree = kzalloc_node(sizeof(*tree), gfp, i);
> > +		if (!tree)
> > +			goto nomem;
> > +
> > +		tree->root = RB_ROOT_CACHED;
> > +		init_rwsem(&tree->rwsem);
> 
> This (as-is) should blow up with lockdep + the locking loops down there.
okay, I will check it later.

thanks a lot.
> 
> > +
> > +		mapping->i_mmap[i] = tree;
> > +	}
> > +	return 0;
> > +nomem:
> > +	free_mapping_i_mmap(mapping);
> > +	return -ENOMEM;
> > +}
> 
> Honestly, it's likely that a simple static array in struct address_space
The array size is not fixed, so we cannot add a static array in address_space.

> suffices. I would not go through the trouble of getting everything very
> tight and NUMA correct.
> 
> > +#else
> > +static int init_mapping_i_mmap(struct address_space *mapping, gfp_t gfp)
> > +{
> > +	mapping->i_mmap = RB_ROOT_CACHED;
> > +	init_rwsem(&mapping->i_mmap_rwsem);
> > +	return 0;
> > +}
> > +
> > +static void free_mapping_i_mmap(struct address_space *mapping) { }
> > +static void __init init_split_tree_num(void) {}
> > +#endif
> > +
> >  /**
> >   * inode_init_always_gfp - perform inode structure initialisation
> >   * @sb: superblock inode belongs to
> > @@ -302,9 +366,14 @@ int inode_init_always_gfp(struct super_block *sb, struct inode *inode, gfp_t gfp
> >  #endif
> >  	inode->i_flctx = NULL;
> >  
> > -	if (unlikely(security_inode_alloc(inode, gfp)))
> > +	if (init_mapping_i_mmap(mapping, gfp))
> >  		return -ENOMEM;
> >  
> > +	if (unlikely(security_inode_alloc(inode, gfp))) {
> > +		free_mapping_i_mmap(mapping);
> > +		return -ENOMEM;
> > +	}
> > +
> >  	this_cpu_inc(nr_inodes);
> >  
> >  	return 0;
> > @@ -380,6 +449,7 @@ void __destroy_inode(struct inode *inode)
> >  	if (inode->i_default_acl && !is_uncached_acl(inode->i_default_acl))
> >  		posix_acl_release(inode->i_default_acl);
> >  #endif
> > +	free_mapping_i_mmap(&inode->i_data);
> >  	this_cpu_dec(nr_inodes);
> >  }
> >  EXPORT_SYMBOL(__destroy_inode);
> > @@ -480,9 +550,7 @@ EXPORT_SYMBOL(inc_nlink);
> >  static void __address_space_init_once(struct address_space *mapping)
> >  {
> >  	xa_init_flags(&mapping->i_pages, XA_FLAGS_LOCK_IRQ | XA_FLAGS_ACCOUNT);
> > -	init_rwsem(&mapping->i_mmap_rwsem);
> >  	spin_lock_init(&mapping->i_private_lock);
> > -	mapping->i_mmap = RB_ROOT_CACHED;
> >  }
> >  
> >  void address_space_init_once(struct address_space *mapping)
> > @@ -2619,6 +2687,7 @@ void __init inode_init(void)
> >  					&i_hash_mask,
> >  					0,
> >  					0);
> > +	init_split_tree_num();
> >  }
> >  
> >  void init_special_inode(struct inode *inode, umode_t mode, dev_t rdev)
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index cd46615b8f53..f4b3645b61df 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -450,6 +450,25 @@ struct mapping_metadata_bhs {
> >  	struct list_head list;	/* The list of bhs (b_assoc_buffers) */
> >  };
> >  
> > +#ifdef CONFIG_SPLIT_I_MMAP
> > +/*
> > + * struct i_mmap_tree - A single sibling tree of the file's split i_mmap.
> > + * @root: The red/black interval tree root.
> > + * @rwsem: Protects insert/remove operations on this sibling tree.
> > + * @vma_count: Number of VMAs in this sibling tree.
> > + *
> > + * When CONFIG_SPLIT_I_MMAP is enabled, the file's single i_mmap tree is
> > + * split into split_tree_num sibling trees, each with its own lock. This
> > + * reduces lock contention by allowing concurrent VMA insert/remove
> > + * operations on different sibling trees.
> > + */
> > +struct i_mmap_tree {
> > +	struct rb_root_cached	root;
> > +	struct rw_semaphore	rwsem;
> > +	atomic_t		vma_count;
> 
> I don't see what you need this vma_count for? I get the one in address_space,
> but this one does not seem useful.
For non-NUMA case, we can use it to determine which tree we should put the new
VMA.
Round-robin is not good enough for a dynamic system.

> 
> > +};
> > +#endif
> > +
> >  /**
> >   * struct address_space - Contents of a cacheable, mappable object.
> >   * @host: Owner, either the inode or the block_device.
> > @@ -461,8 +480,13 @@ struct mapping_metadata_bhs {
> >   * @gfp_mask: Memory allocation flags to use for allocating pages.
> >   * @i_mmap_writable: Number of VM_SHARED, VM_MAYWRITE mappings.
> >   * @nr_thps: Number of THPs in the pagecache (non-shmem only).
> > - * @i_mmap: Tree of private and shared mappings.
> > - * @i_mmap_rwsem: Protects @i_mmap and @i_mmap_writable.
> > + * @i_mmap: Tree of private and shared mappings. When CONFIG_SPLIT_I_MMAP
> > + *   is enabled, this is an array of split_tree_num struct i_mmap_tree
> > + *   pointers (plus a NULL terminator).
> 
> NULL terminator wastes more memory, so I would really strongly avoid it as
> well.
any better idea?

> 
> > + * @vma_count: Total number of VMAs across all sibling trees (only when
> > + *   CONFIG_SPLIT_I_MMAP is enabled). Used by mapping_mapped().
> > + * @i_mmap_rwsem: Protects @i_mmap and @i_mmap_writable (only when
> > + *   CONFIG_SPLIT_I_MMAP is disabled; otherwise per-tree rwsem is used).
> 
> So, there are very good reasons why you still need an i_mmap_rwsem protecting
> state, even with split mmap trees. Which I'll go into later.
> 
> >   * @nrpages: Number of page entries, protected by the i_pages lock.
> >   * @writeback_index: Writeback starts here.
> >   * @a_ops: Methods.
> > @@ -480,14 +504,19 @@ struct address_space {
> >  	/* number of thp, only for non-shmem files */
> >  	atomic_t		nr_thps;
> >  #endif
> > +#ifdef CONFIG_SPLIT_I_MMAP
> > +	struct i_mmap_tree	**i_mmap;
> > +	atomic_t		vma_count;
> > +#else
> >  	struct rb_root_cached	i_mmap;
> > +	struct rw_semaphore	i_mmap_rwsem;
> > +#endif
> >  	unsigned long		nrpages;
> >  	pgoff_t			writeback_index;
> >  	const struct address_space_operations *a_ops;
> >  	unsigned long		flags;
> >  	errseq_t		wb_err;
> >  	spinlock_t		i_private_lock;
> > -	struct rw_semaphore	i_mmap_rwsem;
> 
> See d3b1a9a778e1 ("fs/address_space: move i_mmap_rwsem to mitigate a false sharing with i_mmap.")
Got it.
> 
> >  } __attribute__((aligned(sizeof(long)))) __randomize_layout;
> >  	/*
> >  	 * On most architectures that alignment is already the case; but
> > @@ -508,6 +537,133 @@ static inline bool mapping_tagged(const struct address_space *mapping, xa_mark_t
> >  	return xa_marked(&mapping->i_pages, tag);
> >  }
> >  
> > +#ifdef CONFIG_SPLIT_I_MMAP
> > +static inline int mapping_mapped(const struct address_space *mapping)
> > +{
> > +	return	atomic_read(&mapping->vma_count);
> 
> Now that I think of it, I don't think we need atomic_t, only unsigned long +
> READ_ONCE() suffices. Increments can race just fine, we don't expect any 
> consistency there - if you want consistency you probably hold the i_mmap lock.
> 
okay. I will check it.

> > +}
> > +
> > +static inline void inc_mapping_vma(struct address_space *mapping,
> > +				struct vm_area_struct *vma)
> > +{
> > +	struct i_mmap_tree *tree = mapping->i_mmap[vma->tree_idx];
> > +
> > +	atomic_inc(&tree->vma_count);
> > +	atomic_inc(&mapping->vma_count);
> > +}
> > +
> > +static inline void dec_mapping_vma(struct address_space *mapping,
> > +				struct vm_area_struct *vma)
> > +{
> > +	struct i_mmap_tree *tree = mapping->i_mmap[vma->tree_idx];
> > +
> > +	atomic_dec(&tree->vma_count);
> > +	atomic_dec(&mapping->vma_count);
> > +}
> 
> This probably shouldn't be in linux/fs.h.
> 
> > +
> > +static inline struct rb_root_cached *get_i_mmap_root(struct address_space *mapping)
> > +{
> > +	return (struct rb_root_cached *)mapping->i_mmap;
> > +}
> > +
> > +static inline void i_mmap_tree_lock_write(struct address_space *mapping,
> > +					struct vm_area_struct *vma)
> > +{
> > +	struct i_mmap_tree *tree = mapping->i_mmap[vma->tree_idx];
> > +
> > +	down_write(&tree->rwsem);
> > +}
> > +
> > +static inline void i_mmap_tree_unlock_write(struct address_space *mapping,
> > +					struct vm_area_struct *vma)
> > +{
> > +	struct i_mmap_tree *tree = mapping->i_mmap[vma->tree_idx];
> > +
> > +	up_write(&tree->rwsem);
> > +}
> > +
> > +#define i_mmap_lock_write_prepare(mapping)
> > +#define i_mmap_unlock_write_complete(mapping)
> 
> It's unclear to me why you added write_prepare() and write_complete().
> 
> > +
> > +extern int split_tree_num;
> > +static inline void i_mmap_lock_write(struct address_space *mapping)
> > +{
> > +	int i;
> > +
> > +	for (i = 0; i < split_tree_num; i++)
> > +		down_write(&mapping->i_mmap[i]->rwsem);
> 
> Oof, this is an incredibly large hammer. This is basically why I think keeping
> i_mmap_rwsem (in a different form) is required. You do not want to take $nr_cpus
> locks (read _or_ write). For my design, I keep i_mmap_rwsem, but I invert its
> meaning - taking it in write = I'm reading from the tree; taking it in read =
> I'm writing to the tree. This provides some lighter-weight exclusion between
> rmap walks and rmap tree manipulation.
okay, it seem your method is better. I am waiting for your patch.

> 
> _Technically_, you shouldn't need to always take a lock when manipulating the
> tree. A pattern like mnt_hold_writers()/mnt_get_write_access() can probably
> work well. But it may be too complex ATM.
> 
> 
> Also, note that you pretty much do not want i_mmap_lock_write() users after
> the conversion is done.
> 
> > +}
> > +
> > +static inline int i_mmap_trylock_write(struct address_space *mapping)
> > +{
> > +	int i;
> > +
> > +	for (i = 0; i < split_tree_num; i++) {
> > +		if (!down_write_trylock(&mapping->i_mmap[i]->rwsem)) {
> > +			while (i--)
> > +				up_write(&mapping->i_mmap[i]->rwsem);
> > +			return 0;
> > +		}
> > +	}
> > +	return 1;
> > +}
> > +
> > +static inline void i_mmap_unlock_write(struct address_space *mapping)
> > +{
> > +	int i;
> > +
> > +	for (i = 0; i < split_tree_num; i++)
> > +		up_write(&mapping->i_mmap[i]->rwsem);
> > +}
> > +
> > +static inline int i_mmap_trylock_read(struct address_space *mapping)
> > +{
> > +	int i;
> > +
> > +	for (i = 0; i < split_tree_num; i++) {
> > +		if (!down_read_trylock(&mapping->i_mmap[i]->rwsem)) {
> > +			while (i--)
> > +				up_read(&mapping->i_mmap[i]->rwsem);
> > +			return 0;
> > +		}
> > +	}
> > +	return 1;
> > +}
> > +
> > +static inline void i_mmap_lock_read(struct address_space *mapping)
> > +{
> > +	int i;
> > +
> > +	for (i = 0; i < split_tree_num; i++)
> > +		down_read(&mapping->i_mmap[i]->rwsem);
> > +}
> > +
> > +static inline void i_mmap_unlock_read(struct address_space *mapping)
> > +{
> > +	int i;
> > +
> > +	for (i = 0; i < split_tree_num; i++)
> > +		up_read(&mapping->i_mmap[i]->rwsem);
> > +}
> > +
> > +static inline void i_mmap_assert_locked(struct address_space *mapping)
> > +{
> > +	int i;
> > +
> > +	for (i = 0; i < split_tree_num; i++)
> > +		lockdep_assert_held(&mapping->i_mmap[i]->rwsem);
> > +}
> > +
> > +static inline void i_mmap_assert_write_locked(struct address_space *mapping)
> > +{
> > +	int i;
> > +
> > +	for (i = 0; i < split_tree_num; i++)
> > +		lockdep_assert_held_write(&mapping->i_mmap[i]->rwsem);
> > +}
> > +
> > +#else
> > +
> >  static inline void i_mmap_lock_write(struct address_space *mapping)
> >  {
> >  	down_write(&mapping->i_mmap_rwsem);
> > @@ -561,6 +717,18 @@ static inline struct rb_root_cached *get_i_mmap_root(struct address_space *mappi
> >  	return &mapping->i_mmap;
> >  }
> >  
> > +static inline void inc_mapping_vma(struct address_space *mapping,
> > +				struct vm_area_struct *vma) { }
> > +static inline void dec_mapping_vma(struct address_space *mapping,
> > +				struct vm_area_struct *vma) { }
> > +
> > +#define i_mmap_lock_write_prepare(mapping)	i_mmap_lock_write(mapping)
> > +#define i_mmap_unlock_write_complete(mapping)	i_mmap_unlock_write(mapping)
> > +#define i_mmap_tree_lock_write(mapping, vma)
> > +#define i_mmap_tree_unlock_write(mapping, vma)
> > +
> > +#endif
> > +
> >  /*
> >   * Might pages of this file have been modified in userspace?
> >   * Note that i_mmap_writable counts all VM_SHARED, VM_MAYWRITE vmas: do_mmap
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 0a45c6a8b9f2..9aa8119fa9bf 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -4041,11 +4041,91 @@ struct vm_area_struct *vma_interval_tree_iter_first(struct rb_root_cached *root,
> >  struct vm_area_struct *vma_interval_tree_iter_next(struct vm_area_struct *node,
> >  				unsigned long start, unsigned long last);
> >  
> > +#ifdef CONFIG_SPLIT_I_MMAP
> > +extern int split_tree_num;
> > +
> > +static inline int smallest_tree_idx(struct file *file)
> > +{
> > +	struct address_space *mapping = file->f_mapping;
> > +	int tmp = INT_MAX, count;
> > +	int i, j = 0;
> > +
> > +	/*
> > +	 * Since a not 100% accurate value is still okay,
> > +	 * we do not need any lock here.
> > +	 */
> > +	for (i = 0; i < split_tree_num; i++) {
> > +		count = atomic_read(&mapping->i_mmap[i]->vma_count);
> > +		if (count < tmp) {
> > +			j = i;
> > +			tmp = count;
> > +			if (!tmp)
> > +				break;
> > +		}
> > +	}
> 
> Ohh, I see why you want the per-subtree vma_count now. But is this a net-win?
It keep the trees as even as possible.

> I think doing something like vma-pointer-hashing or just smp_processor_id()
> would work a-ok.
> 
> > +	return j;
> > +}
> > +
> > +static inline void vma_set_tree_idx(struct vm_area_struct *vma)
> > +{
> > +#ifdef CONFIG_NUMA
> > +	vma->tree_idx = numa_node_id();
> > +#else
> > +	vma->tree_idx = smallest_tree_idx(vma->vm_file);
> > +#endif
> > +}
> > +
> > +static inline struct rb_root_cached *get_rb_root(struct vm_area_struct *vma,
> > +					struct address_space *mapping)
> > +{
> > +	return &mapping->i_mmap[vma->tree_idx]->root;
> > +}
> > +
> > +/* Find the first valid VMA in the sibling trees */
> > +static inline struct vm_area_struct *first_vma(struct i_mmap_tree ***__r,
> > +				unsigned long start, unsigned long last)
> > +{
> > +	struct vm_area_struct *vma = NULL;
> > +	struct i_mmap_tree **tree = *__r;
> > +	struct rb_root_cached *root;
> > +
> > +	while (*tree) {
> > +		root = &(*tree)->root;
> > +		tree++;
> > +		vma = vma_interval_tree_iter_first(root, start, last);
> > +		if (vma)
> > +			break;
> > +	}
> > +
> > +	/* Save for the next loop */
> > +	*__r = tree;
> > +	return vma;
> > +}
> > +
> > +/*
> > + * Please use get_i_mmap_root() to get the @root.
> > + * @_tmp is referenced to avoid unused variable warning.
> > + */
> > +#define vma_interval_tree_foreach(vma, root, start, last)		\
> > +	for (struct i_mmap_tree **_r = (struct i_mmap_tree **)(root),	\
> > +		**_tmp = (vma = first_vma(&_r, start, last)) ? _r : NULL;\
> > +	     ((_tmp && vma) || (vma = first_vma(&_r, start, last)));	\
> > +		vma = vma_interval_tree_iter_next(vma, start, last))
> > +#else
> >  /* Please use get_i_mmap_root() to get the @root */
> >  #define vma_interval_tree_foreach(vma, root, start, last)		\
> >  	for (vma = vma_interval_tree_iter_first(root, start, last);	\
> >  	     vma; vma = vma_interval_tree_iter_next(vma, start, last))
> >  
> > +static inline void vma_set_tree_idx(struct vm_area_struct *vma) { }
> > +
> > +static inline struct rb_root_cached *get_rb_root(struct vm_area_struct *vma,
> > +					struct address_space *mapping)
> > +{
> > +	return &mapping->i_mmap;
> > +}
> > +#endif
> > +
> >  void anon_vma_interval_tree_insert(struct anon_vma_chain *node,
> >  				   struct rb_root_cached *root);
> >  void anon_vma_interval_tree_remove(struct anon_vma_chain *node,
> > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > index a308e2c23b82..8d6aab3346ce 100644
> > --- a/include/linux/mm_types.h
> > +++ b/include/linux/mm_types.h
> > @@ -1072,6 +1072,9 @@ struct vm_area_struct {
> >  #ifdef __HAVE_PFNMAP_TRACKING
> >  	struct pfnmap_track_ctx *pfnmap_track_ctx;
> >  #endif
> > +#ifdef CONFIG_SPLIT_I_MMAP
> > +	int tree_idx;			/* The sibling tree index for the VMA */
> > +#endif
> 
> FTR the struct hole isn't here, but right after vm_lock_seq or vm_refcnt in
> most configs.
okay, thanks.
I did not notice the struct hole issue.
> 
> >  } __randomize_layout;
> >  
> >  /* Clears all bits in the VMA flags bitmap, non-atomically. */
> > diff --git a/mm/internal.h b/mm/internal.h
> > index 5a2ddcf68e0b..2d35cacffd19 100644
> > --- a/mm/internal.h
> > +++ b/mm/internal.h
> > @@ -1888,7 +1888,8 @@ static inline void maybe_rmap_unlock_action(struct vm_area_struct *vma,
> >  
> >  	VM_WARN_ON_ONCE(vma_is_anonymous(vma));
> >  	file = vma->vm_file;
> > -	i_mmap_unlock_write(file->f_mapping);
> > +	i_mmap_tree_unlock_write(file->f_mapping, vma);
> > +	i_mmap_unlock_write_complete(file->f_mapping);
> >  	action->hide_from_rmap_until_complete = false;
> >  }
> >  
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index d714fdb357e5..70036ec9dcaa 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -1825,15 +1825,20 @@ __latent_entropy int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
> >  			struct address_space *mapping = file->f_mapping;
> >  
> >  			get_file(file);
> > -			i_mmap_lock_write(mapping);
> > +			i_mmap_lock_write_prepare(mapping);
> > +			i_mmap_tree_lock_write(mapping, mpnt);
> > +
> >  			if (vma_is_shared_maywrite(tmp))
> >  				mapping_allow_writable(mapping);
> >  			flush_dcache_mmap_lock(mapping);
> >  			/* insert tmp into the share list, just after mpnt */
> >  			vma_interval_tree_insert_after(tmp, mpnt,
> > -					get_i_mmap_root(mapping));
> > +					get_rb_root(mpnt, mapping));
> > +			inc_mapping_vma(mapping, tmp);
> 
> Honestly, would prefer to hide all of these details from mmap.
yes, we can. 

But we need to change the functions in mm/interval_tree.c

> 
> >  			flush_dcache_mmap_unlock(mapping);
> > -			i_mmap_unlock_write(mapping);
> > +
> > +			i_mmap_tree_unlock_write(mapping, mpnt);
> > +			i_mmap_unlock_write_complete(mapping);
> >  		}
> >  
> >  		if (!(tmp->vm_flags & VM_WIPEONFORK))
> > diff --git a/mm/nommu.c b/mm/nommu.c
> > index 0f18ffc658e9..1f2c60a220f6 100644
> > --- a/mm/nommu.c
> > +++ b/mm/nommu.c
> > @@ -567,11 +567,16 @@ static void setup_vma_to_mm(struct vm_area_struct *vma, struct mm_struct *mm)
> >  	if (vma->vm_file) {
> >  		struct address_space *mapping = vma->vm_file->f_mapping;
> >  
> > -		i_mmap_lock_write(mapping);
> > +		i_mmap_lock_write_prepare(mapping);
> > +		i_mmap_tree_lock_write(mapping, vma);
> > +
> >  		flush_dcache_mmap_lock(mapping);
> > -		vma_interval_tree_insert(vma, get_i_mmap_root(mapping));
> > +		vma_interval_tree_insert(vma, get_rb_root(vma, mapping));
> > +		inc_mapping_vma(mapping, vma);
> >  		flush_dcache_mmap_unlock(mapping);
> > -		i_mmap_unlock_write(mapping);
> > +
> > +		i_mmap_tree_unlock_write(mapping, vma);
> > +		i_mmap_unlock_write_complete(mapping);
> >  	}
> >  }
> >  
> > @@ -583,11 +588,16 @@ static void cleanup_vma_from_mm(struct vm_area_struct *vma)
> >  		struct address_space *mapping;
> >  		mapping = vma->vm_file->f_mapping;
> >  
> > -		i_mmap_lock_write(mapping);
> > +		i_mmap_lock_write_prepare(mapping);
> > +		i_mmap_tree_lock_write(mapping, vma);
> > +
> >  		flush_dcache_mmap_lock(mapping);
> > -		vma_interval_tree_remove(vma, get_i_mmap_root(mapping));
> > +		vma_interval_tree_remove(vma, get_rb_root(vma, mapping));
> > +		dec_mapping_vma(mapping, vma);
> >  		flush_dcache_mmap_unlock(mapping);
> > -		i_mmap_unlock_write(mapping);
> > +
> > +		i_mmap_tree_unlock_write(mapping, vma);
> > +		i_mmap_unlock_write_complete(mapping);
> >  	}
> >  }
> >  
> > @@ -1063,6 +1073,7 @@ unsigned long do_mmap(struct file *file,
> >  	if (file) {
> >  		region->vm_file = get_file(file);
> >  		vma->vm_file = get_file(file);
> > +		vma_set_tree_idx(vma);
> 
> This is unrelated, shouldn't be done here.
> 
> >  	}
> >  
> >  	down_write(&nommu_region_sem);
> > diff --git a/mm/pagewalk.c b/mm/pagewalk.c
> > index 8df1b5077951..d5745519d95a 100644
> > --- a/mm/pagewalk.c
> > +++ b/mm/pagewalk.c
> > @@ -809,7 +809,7 @@ int walk_page_mapping(struct address_space *mapping, pgoff_t first_index,
> >  	if (!check_ops_safe(ops))
> >  		return -EINVAL;
> >  
> > -	lockdep_assert_held(&mapping->i_mmap_rwsem);
> > +	i_mmap_assert_locked(mapping);
> 
> This kind of conversion should be done in a separate step.
> 
> >  	vma_interval_tree_foreach(vma, get_i_mmap_root(mapping), first_index,
> >  				  first_index + nr - 1) {
> >  		/* Clip to the vma */
> > diff --git a/mm/vma.c b/mm/vma.c
> > index 6159650c1b42..2055758064a9 100644
> > --- a/mm/vma.c
> > +++ b/mm/vma.c
> > @@ -234,22 +234,23 @@ static void __vma_link_file(struct vm_area_struct *vma,
> >  		mapping_allow_writable(mapping);
> >  
> >  	flush_dcache_mmap_lock(mapping);
> > -	vma_interval_tree_insert(vma, get_i_mmap_root(mapping));
> > +	vma_interval_tree_insert(vma, get_rb_root(vma, mapping));
> > +	inc_mapping_vma(mapping, vma);
> 
> inc_mapping_vma() should probably be done implicitly by insertion?
Yes, we can. 
It is more grace to hide it in vma_interval_tree_insert.

> 
> >  	flush_dcache_mmap_unlock(mapping);
> >  }
> >  
> > -/*
> > - * Requires inode->i_mapping->i_mmap_rwsem
> > - */
> >  static void __remove_shared_vm_struct(struct vm_area_struct *vma,
> >  				      struct address_space *mapping)
> >  {
> > +	i_mmap_tree_lock_write(mapping, vma);
> >  	if (vma_is_shared_maywrite(vma))
> >  		mapping_unmap_writable(mapping);
> >  
> >  	flush_dcache_mmap_lock(mapping);
> > -	vma_interval_tree_remove(vma, get_i_mmap_root(mapping));
> > +	vma_interval_tree_remove(vma, get_rb_root(vma, mapping));
> > +	dec_mapping_vma(mapping, vma);
> >  	flush_dcache_mmap_unlock(mapping);
> > +	i_mmap_tree_unlock_write(mapping, vma);
> >  }
> >  
> >  /*
> > @@ -297,8 +298,9 @@ static void vma_prepare(struct vma_prepare *vp)
> >  			uprobe_munmap(vp->adj_next, vp->adj_next->vm_start,
> >  				      vp->adj_next->vm_end);
> >  
> > -		i_mmap_lock_write(vp->mapping);
> > +		i_mmap_lock_write_prepare(vp->mapping);
> >  		if (vp->insert && vp->insert->vm_file) {
> > +			i_mmap_tree_lock_write(vp->mapping, vp->insert);
> >  			/*
> >  			 * Put into interval tree now, so instantiated pages
> >  			 * are visible to arm/parisc __flush_dcache_page
> > @@ -307,6 +309,7 @@ static void vma_prepare(struct vma_prepare *vp)
> >  			 */
> >  			__vma_link_file(vp->insert,
> >  					vp->insert->vm_file->f_mapping);
> > +			i_mmap_tree_unlock_write(vp->mapping, vp->insert);
> >  		}
> >  	}
> >  
> > @@ -318,12 +321,17 @@ static void vma_prepare(struct vma_prepare *vp)
> >  	}
> >  
> >  	if (vp->file) {
> > +		i_mmap_tree_lock_write(vp->mapping, vp->vma);
> >  		flush_dcache_mmap_lock(vp->mapping);
> >  		vma_interval_tree_remove(vp->vma,
> > -					get_i_mmap_root(vp->mapping));
> > -		if (vp->adj_next)
> > +					get_rb_root(vp->vma, vp->mapping));
> > +		dec_mapping_vma(vp->mapping, vp->vma);
> > +		if (vp->adj_next) {
> > +			i_mmap_tree_lock_write(vp->mapping, vp->adj_next);
> >  			vma_interval_tree_remove(vp->adj_next,
> > -					get_i_mmap_root(vp->mapping));
> > +					get_rb_root(vp->adj_next, vp->mapping));
> > +			dec_mapping_vma(vp->mapping, vp->adj_next);
> > +		}
> >  	}
> >  
> >  }
> > @@ -340,12 +348,17 @@ static void vma_complete(struct vma_prepare *vp, struct vma_iterator *vmi,
> >  			 struct mm_struct *mm)
> >  {
> >  	if (vp->file) {
> > -		if (vp->adj_next)
> > +		if (vp->adj_next) {
> >  			vma_interval_tree_insert(vp->adj_next,
> > -					get_i_mmap_root(vp->mapping));
> > +					get_rb_root(vp->adj_next, vp->mapping));
> > +			inc_mapping_vma(vp->mapping, vp->adj_next);
> > +			i_mmap_tree_unlock_write(vp->mapping, vp->adj_next);
> > +		}
> >  		vma_interval_tree_insert(vp->vma,
> > -					get_i_mmap_root(vp->mapping));
> > +					get_rb_root(vp->vma, vp->mapping));
> > +		inc_mapping_vma(vp->mapping, vp->vma);
> >  		flush_dcache_mmap_unlock(vp->mapping);
> > +		i_mmap_tree_unlock_write(vp->mapping, vp->vma);
> >  	}
> >  
> >  	if (vp->remove && vp->file) {
> > @@ -370,7 +383,7 @@ static void vma_complete(struct vma_prepare *vp, struct vma_iterator *vmi,
> >  	}
> >  
> >  	if (vp->file) {
> > -		i_mmap_unlock_write(vp->mapping);
> > +		i_mmap_unlock_write_complete(vp->mapping);
> >  
> >  		if (!vp->skip_vma_uprobe) {
> >  			uprobe_mmap(vp->vma);
> > @@ -1799,12 +1812,12 @@ static void unlink_file_vma_batch_process(struct unlink_vma_file_batch *vb)
> >  	int i;
> >  
> >  	mapping = vb->vmas[0]->vm_file->f_mapping;
> > -	i_mmap_lock_write(mapping);
> > +	i_mmap_lock_write_prepare(mapping);
> >  	for (i = 0; i < vb->count; i++) {
> >  		VM_WARN_ON_ONCE(vb->vmas[i]->vm_file->f_mapping != mapping);
> >  		__remove_shared_vm_struct(vb->vmas[i], mapping);
> >  	}
> > -	i_mmap_unlock_write(mapping);
> > +	i_mmap_unlock_write_complete(mapping);
> >  
> >  	unlink_file_vma_batch_init(vb);
> >  }
> > @@ -1836,10 +1849,13 @@ static void vma_link_file(struct vm_area_struct *vma, bool hold_rmap_lock)
> >  
> >  	if (file) {
> >  		mapping = file->f_mapping;
> > -		i_mmap_lock_write(mapping);
> > +		i_mmap_lock_write_prepare(mapping);
> > +		i_mmap_tree_lock_write(mapping, vma);
> >  		__vma_link_file(vma, mapping);
> > -		if (!hold_rmap_lock)
> > -			i_mmap_unlock_write(mapping);
> > +		if (!hold_rmap_lock) {
> > +			i_mmap_tree_unlock_write(mapping, vma);
> > +			i_mmap_unlock_write_complete(mapping);
> > +		}
> >  	}
> >  }
> >  
> > @@ -2164,6 +2180,23 @@ static void vm_lock_anon_vma(struct mm_struct *mm, struct anon_vma *anon_vma)
> >  	}
> >  }
> 
> I can but hope that all of the above is quite simplified before we get to the
> "making file rmap more complicated" bit.
:(
If we can do not care about the ARM device, we can make it simple.

Thanks
Huang Shijie


^ permalink raw reply

* Re: [PATCH v2 1/4] mm: use mapping_mapped to simplify the code
From: Huang Shijie @ 2026-06-12  6:03 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: akpm, viro, brauner, jack, muchun.song, osalvador, david, surenb,
	mjguzik, liam, vbabka, shakeel.butt, rppt, mhocko, corbet, skhan,
	linux, dinguyen, schuster.simon, James.Bottomley, deller, djbw,
	willy, peterz, mingo, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, james.clark,
	mhiramat, oleg, ziy, baolin.wang, npache, ryan.roberts, dev.jain,
	baohua, lance.yang, linmiaohe, nao.horiguchi, jannh, pfalcato,
	riel, harry, will, brian.ruley, rmk+kernel, dave.anglin, linux-mm,
	linux-doc, linux-kernel, linux-arm-kernel, linux-parisc,
	linux-fsdevel, nvdimm, linux-perf-users, linux-trace-kernel,
	zhongyuan, fangbaoshun, yingzhiwei
In-Reply-To: <airZn524Ip8VsWra@lucifer>

Hi Lorenzo & Pedro,
On Thu, Jun 11, 2026 at 04:52:54PM +0100, Lorenzo Stoakes wrote:
> On Thu, Jun 11, 2026 at 02:18:57PM +0800, Huang Shijie wrote:
> > Use mapping_mapped() to simplify the code, make
> > the code tidy and clean.
> >
> > Signed-off-by: Huang Shijie <huangsj@hygon.cn>
> 
> Yeah as Pedro said this one could just be sent separately, and I in fact
> suggest you do that :) So:
> 
Thank you Pedro and Lorenzo.
I can send a separate patch later.

Thanks
Huang Shijie


^ permalink raw reply

* Re: [PATCH v6 01/12] PCI: liveupdate: Set up FLB handler for the PCI core
From: Pasha Tatashin @ 2026-06-12  5:15 UTC (permalink / raw)
  To: David Matlack
  Cc: kexec, linux-doc, linux-kernel, linux-mm, linux-pci,
	Adithya Jayachandran, Alexander Graf, Alex Williamson,
	Bjorn Helgaas, Chris Li, David Rientjes, Jacob Pan,
	Jason Gunthorpe, Jonathan Corbet, Josh Hilke, Leon Romanovsky,
	Lukas Wunner, Mike Rapoport, Parav Pandit, Pasha Tatashin,
	Pranjal Shrivastava, Pratyush Yadav, Saeed Mahameed,
	Samiullah Khawaja, Shuah Khan, Vipin Sharma, William Tu, Yi Liu
In-Reply-To: <20260522202410.3104264-2-dmatlack@google.com>

On Fri, 22 May 2026 20:23:59 +0000, David Matlack <dmatlack@google.com> wrote:
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 2fb1c75afd16..6c618830cf61 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -20530,6 +20530,16 @@ L:	linux-pci@vger.kernel.org
>  S:	Supported
>  F:	Documentation/PCI/pci-error-recovery.rst
>  
> +PCI LIVE UPDATE
> +M:	David Matlack <dmatlack@google.com>

Please add Pratyush, Mike, and myself so we are notified directly of 
incoming patches, the same as with other areas where the liveupdate/ 
tree is specified.

>
> diff --git a/drivers/pci/liveupdate.c b/drivers/pci/liveupdate.c
> new file mode 100644
> index 000000000000..737e7b9366db
> --- /dev/null
> +++ b/drivers/pci/liveupdate.c
> @@ -0,0 +1,145 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +/*
> + * Copyright (c) 2026, Google LLC.
> + * David Matlack <dmatlack@google.com>
> + */
> +
> +/**
> + * DOC: PCI Live Update
> + *
> + * The PCI subsystem participates in the Live Update process to enable drivers
> + * to preserve their PCI devices across kexec.
> + *
> + * File-Lifecycle-Bound (FLB) Data
> + * ===============================

...

> + *
> + * PCI device preservation across Live Update is built on top of the Live Update
> + * Orchestrator's (LUO) support for file preservation across kexec. Drivers

I prefer to just use acronyms FLB, and LUO, but have links to the actual 
documentations about them.

So, something like this:

  * :ref:`FLB <flb>` Data
  * =====================
  *
  * PCI device preservation across Live Update is built on top of the
  * :ref:`LUO <luo>` support for file preservation across kexec. Drivers

And also add _luo and _flb to Documentation/core-api/liveupdate.rst

.. _luo:

 ========================
 Live Update Orchestrator
 ========================

.. _flb:

 LUO File Lifecycle Bound Global Data
 ====================================

> [ ... skip 17 lines ... ]
> + *
> + *  * ``pci_liveupdate_register_flb(driver_file_handler)``
> + *  * ``pci_liveupdate_unregister_flb(driver_file_handler)``
> + */
> +
> +#define pr_fmt(fmt) "PCI: liveupdate: " fmt

Nit, may be:

> +
> +#include <linux/io.h>
> +#include <linux/kexec_handover.h>
> +#include <linux/kho/abi/pci.h>
> +#include <linux/liveupdate.h>
> +#include <linux/mutex.h>
> +#include <linux/mm.h>

Please sort alphabetically.

> [ ... skip 12 lines ... ]
> +	 * future to increase the chances that there is enough room to preserve
> +	 * devices that are not yet present on the system (e.g. VFs, hot-plugged
> +	 * devices).
> +	 */
> +	for_each_pci_dev(dev)
> +		max_nr_devices++;

I think, we want to use kho_block [1] (it is in liveupdate/next branch) 
to allow number of supported devices to be dynamic.

To support this, we would redefine the ABI and tracking structures like 
so:

/* include/linux/kho/abi/pci.h */
struct pci_ser {
	u64 devices;      /* Phys address of the first block header of kho_block_set */
	u64 nr_devices;   /* Total count of active preserved devices */
} __packed;

/* drivers/pci/liveupdate.c */
struct pci_flb_outgoing {
	struct pci_ser *ser;            /* Points to the FDT/KHO-allocated ABI struct */
	struct kho_block_set block_set;  /* Controls the active blocks on the fly */
};

In  __pci_liveupdate_preserve_device() , we would search for 
and reuse any inactive  pci_dev_ser  slot first, and only call 
kho_block_set_grow() to expand if no inactive slots are available.

In pci_liveupdate_unpreserve_device(), we would simply 
mark the  pci_dev_ser as inactive.

>
> diff --git a/include/linux/pci_liveupdate.h b/include/linux/pci_liveupdate.h
> new file mode 100644
> index 000000000000..8ec98beefcb4
> --- /dev/null
> +++ b/include/linux/pci_liveupdate.h
> @@ -0,0 +1,30 @@
> [ ... skip 24 lines ... ]
> +static inline void pci_liveupdate_unregister_flb(struct liveupdate_file_handler *fh)
> +{
> +}
> +#endif
> +
> +#endif /* LINUX_PCI_LIVEUPDATE_H */

[1] https://lore.kernel.org/all/20260603154402.468928-1-pasha.tatashin@soleen.com/

Preserving: In  __pci_liveupdate_preserve_device() , we would search for 
Unpreserving: In  pci_liveupdate_unpreserve_device(), we would simply

Preserving: In  __pci_liveupdate_preserve_device() , we would search for 
Unpreserving: In  pci_liveupdate_unpreserve_device(), we would simply 

-- 
Pasha Tatashin <pasha.tatashin@soleen.com>

^ permalink raw reply

* [RFC PATCH 2/2] kasan: hw_tags: Add boot option to elide free time poisoning
From: Dev Jain @ 2026-06-12  4:44 UTC (permalink / raw)
  To: ryabinin.a.a, akpm, corbet
  Cc: Dev Jain, glider, andreyknvl, dvyukov, vincenzo.frascino,
	kasan-dev, linux-mm, linux-kernel, skhan, workflows, linux-doc,
	linux-arm-kernel, ryan.roberts, anshuman.khandual, kaleshsingh,
	21cnbao, david, will, catalin.marinas
In-Reply-To: <20260612044425.763060-1-dev.jain@arm.com>

Introduce a boot option to tag only at allocation time of the objects. This
reduces KASAN MTE overhead, the tradeoff being reduced ability
of catching bugs.

Now, when a memory object will be freed, it will retain the random tag it
had at allocation time. This compromises on catching UAF bugs, till the
time the object is not reallocated.

Hence, not catching "use-after-free-before-reallocation" and not catching
"double-free" will be the compromise for reduced KASAN overhead.

Keep this as a boot time feature to prevent building two kernel images.

To implement the feature, we need to effectively render kasan_poison()
redundant for hw tags case, but keep it working in the case where it is
used not in an object-freeing code path, but the redzoning path (which
means, poisoning the tail end of a vmalloc or kmalloc allocation).

We achieve this by overloading the poison values for the hw tags case: we
define the four poison values as 0x0E, 0x1E, 0x2E, 0x3E. In kasan_poison(),
if we arrive with KASAN_SLAB_REDZONE or KASAN_PAGE_REDZONE, do a bitwise
OR on the value of the tag to make it equal to KASAN_TAG_INVALID.

If not, then, if init is true, zero out the memory and bail out.

Signed-off-by: Dev Jain <dev.jain@arm.com>
---
 Documentation/dev-tools/kasan.rst |  4 +++
 mm/kasan/hw_tags.c                | 43 ++++++++++++++++++++++++++++++-
 mm/kasan/kasan.h                  | 23 ++++++++++++++++-
 3 files changed, 68 insertions(+), 2 deletions(-)

diff --git a/Documentation/dev-tools/kasan.rst b/Documentation/dev-tools/kasan.rst
index 4968b2aa60c80..b0c30584b5062 100644
--- a/Documentation/dev-tools/kasan.rst
+++ b/Documentation/dev-tools/kasan.rst
@@ -146,6 +146,10 @@ disabling KASAN altogether or controlling its features:
 - ``kasan.vmalloc=off`` or ``=on`` disables or enables tagging of vmalloc
   allocations (default: ``on``).
 
+- ``kasan.tag_only_on_alloc=off`` or ``=on`` disables or enables skipping
+  free-time tagging (poisoning) while keeping allocation-time tagging enabled
+  (default: ``off``).
+
 - ``kasan.page_alloc.sample=<sampling interval>`` makes KASAN tag only every
   Nth page_alloc allocation with the order equal or greater than
   ``kasan.page_alloc.sample.order``, where N is the value of the ``sample``
diff --git a/mm/kasan/hw_tags.c b/mm/kasan/hw_tags.c
index c1a2b48808ed7..a392e34d11e3a 100644
--- a/mm/kasan/hw_tags.c
+++ b/mm/kasan/hw_tags.c
@@ -41,9 +41,16 @@ enum kasan_arg_vmalloc {
 	KASAN_ARG_VMALLOC_ON,
 };
 
+enum kasan_arg_tag_only_on_alloc {
+	KASAN_ARG_TAG_ONLY_ON_ALLOC_DEFAULT,
+	KASAN_ARG_TAG_ONLY_ON_ALLOC_OFF,
+	KASAN_ARG_TAG_ONLY_ON_ALLOC_ON,
+};
+
 static enum kasan_arg kasan_arg __ro_after_init;
 static enum kasan_arg_mode kasan_arg_mode __ro_after_init;
 static enum kasan_arg_vmalloc kasan_arg_vmalloc __initdata;
+static enum kasan_arg_tag_only_on_alloc kasan_arg_tag_only_on_alloc __initdata;
 
 /*
  * Whether the selected mode is synchronous, asynchronous, or asymmetric.
@@ -63,6 +70,10 @@ EXPORT_SYMBOL_GPL(kasan_flag_vmalloc);
 /* Whether to check write accesses only. */
 static bool kasan_flag_write_only = false;
 
+/* Whether to skip free-time tagging. */
+DEFINE_STATIC_KEY_FALSE(kasan_flag_tag_only_on_alloc);
+EXPORT_SYMBOL_GPL(kasan_flag_tag_only_on_alloc);
+
 #define PAGE_ALLOC_SAMPLE_DEFAULT	1
 #define PAGE_ALLOC_SAMPLE_ORDER_DEFAULT	3
 
@@ -154,6 +165,23 @@ static int __init early_kasan_flag_write_only(char *arg)
 }
 early_param("kasan.write_only", early_kasan_flag_write_only);
 
+/* kasan.tag_only_on_alloc=off/on */
+static int __init early_kasan_flag_tag_only_on_alloc(char *arg)
+{
+	if (!arg)
+		return -EINVAL;
+
+	if (!strcmp(arg, "off"))
+		kasan_arg_tag_only_on_alloc = KASAN_ARG_TAG_ONLY_ON_ALLOC_OFF;
+	else if (!strcmp(arg, "on"))
+		kasan_arg_tag_only_on_alloc = KASAN_ARG_TAG_ONLY_ON_ALLOC_ON;
+	else
+		return -EINVAL;
+
+	return 0;
+}
+early_param("kasan.tag_only_on_alloc", early_kasan_flag_tag_only_on_alloc);
+
 static inline const char *kasan_mode_info(void)
 {
 	if (kasan_mode == KASAN_MODE_ASYNC)
@@ -270,14 +298,27 @@ void __init kasan_init_hw_tags(void)
 		break;
 	}
 
+	switch (kasan_arg_tag_only_on_alloc) {
+	case KASAN_ARG_TAG_ONLY_ON_ALLOC_DEFAULT:
+		/* Default is specified by kasan_flag_tag_only_on_alloc. */
+		break;
+	case KASAN_ARG_TAG_ONLY_ON_ALLOC_OFF:
+		static_branch_disable(&kasan_flag_tag_only_on_alloc);
+		break;
+	case KASAN_ARG_TAG_ONLY_ON_ALLOC_ON:
+		static_branch_enable(&kasan_flag_tag_only_on_alloc);
+		break;
+	}
+
 	kasan_init_tags();
 
 	/* KASAN is now initialized, enable it. */
 	kasan_enable();
 
-	pr_info("KernelAddressSanitizer initialized (hw-tags, mode=%s, vmalloc=%s, stacktrace=%s, write_only=%s)\n",
+	pr_info("KernelAddressSanitizer initialized (hw-tags, mode=%s, vmalloc=%s, tag_only_on_alloc=%s, stacktrace=%s, write_only=%s)\n",
 		kasan_mode_info(),
 		str_on_off(kasan_vmalloc_enabled()),
+		str_on_off(kasan_tag_only_on_alloc_enabled()),
 		str_on_off(kasan_stack_collection_enabled()),
 		str_on_off(kasan_flag_write_only));
 }
diff --git a/mm/kasan/kasan.h b/mm/kasan/kasan.h
index fc9169a547662..4fa8abb312faa 100644
--- a/mm/kasan/kasan.h
+++ b/mm/kasan/kasan.h
@@ -33,6 +33,7 @@ static inline bool kasan_stack_collection_enabled(void)
 #include "../slab.h"
 
 DECLARE_STATIC_KEY_TRUE(kasan_flag_vmalloc);
+DECLARE_STATIC_KEY_FALSE(kasan_flag_tag_only_on_alloc);
 
 enum kasan_mode {
 	KASAN_MODE_SYNC,
@@ -52,6 +53,11 @@ static inline bool kasan_vmalloc_enabled(void)
 	return static_branch_likely(&kasan_flag_vmalloc);
 }
 
+static inline bool kasan_tag_only_on_alloc_enabled(void)
+{
+	return static_branch_unlikely(&kasan_flag_tag_only_on_alloc);
+}
+
 static inline bool kasan_async_fault_possible(void)
 {
 	return kasan_mode == KASAN_MODE_ASYNC || kasan_mode == KASAN_MODE_ASYMM;
@@ -145,12 +151,17 @@ static inline bool kasan_requires_meta(void)
 #define KASAN_SLAB_REDZONE	0xFC  /* redzone for slab object */
 #define KASAN_SLAB_FREE		0xFB  /* freed slab object */
 #define KASAN_VMALLOC_INVALID	0xF8  /* inaccessible space in vmap area */
+#elif defined(CONFIG_KASAN_HW_TAGS)
+#define KASAN_PAGE_FREE		0x0E
+#define KASAN_PAGE_REDZONE	0x1E
+#define KASAN_SLAB_REDZONE	0x2E
+#define KASAN_SLAB_FREE		0x3E
 #else
 #define KASAN_PAGE_FREE		KASAN_TAG_INVALID
 #define KASAN_PAGE_REDZONE	KASAN_TAG_INVALID
 #define KASAN_SLAB_REDZONE	KASAN_TAG_INVALID
 #define KASAN_SLAB_FREE		KASAN_TAG_INVALID
-#define KASAN_VMALLOC_INVALID	KASAN_TAG_INVALID /* only used for SW_TAGS */
+#define KASAN_VMALLOC_INVALID	KASAN_TAG_INVALID
 #endif
 
 #ifdef CONFIG_KASAN_GENERIC
@@ -478,6 +489,16 @@ static inline u8 kasan_random_tag(void) { return 0; }
 
 static inline void kasan_poison(const void *addr, size_t size, u8 value, bool init)
 {
+	if (kasan_tag_only_on_alloc_enabled()) {
+		if ((value != KASAN_SLAB_REDZONE) && (value != KASAN_PAGE_REDZONE)) {
+			if (init)
+				memset((void *)kasan_reset_tag(addr), 0, size);
+			return;
+		}
+	}
+
+	value |= 0xF0;
+
 	if (WARN_ON((unsigned long)addr & KASAN_GRANULE_MASK))
 		return;
 	if (WARN_ON(size & KASAN_GRANULE_MASK))
-- 
2.43.0


^ permalink raw reply related

* [RFC PATCH 1/2] kasan: hw_tags: Use KASAN_PAGE_REDZONE for vmalloc redzoning
From: Dev Jain @ 2026-06-12  4:44 UTC (permalink / raw)
  To: ryabinin.a.a, akpm, corbet
  Cc: Dev Jain, glider, andreyknvl, dvyukov, vincenzo.frascino,
	kasan-dev, linux-mm, linux-kernel, skhan, workflows, linux-doc,
	linux-arm-kernel, ryan.roberts, anshuman.khandual, kaleshsingh,
	21cnbao, david, will, catalin.marinas
In-Reply-To: <20260612044425.763060-1-dev.jain@arm.com>

In preparation for adding "tag only on alloc" boot time option, use
KASAN_PAGE_REDZONE instead of KASAN_TAG_INVALID for poisoning the tail end
of the vmalloc allocation.

Although both values are the same for hw tags, KASAN_SLAB_REDZONE is used
for poisoning the tail end of a kmalloc object allocation, so maintain
the pattern.

Signed-off-by: Dev Jain <dev.jain@arm.com>
---
 mm/kasan/hw_tags.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/kasan/hw_tags.c b/mm/kasan/hw_tags.c
index cbef5e450954e..c1a2b48808ed7 100644
--- a/mm/kasan/hw_tags.c
+++ b/mm/kasan/hw_tags.c
@@ -375,7 +375,7 @@ void *__kasan_unpoison_vmalloc(const void *start, unsigned long size,
 	redzone_start = round_up((unsigned long)start + size,
 				 KASAN_GRANULE_SIZE);
 	redzone_size = round_up(redzone_start, PAGE_SIZE) - redzone_start;
-	kasan_poison((void *)redzone_start, redzone_size, KASAN_TAG_INVALID,
+	kasan_poison((void *)redzone_start, redzone_size, KASAN_PAGE_REDZONE,
 		     flags & KASAN_VMALLOC_INIT);
 
 	/*
-- 
2.43.0


^ permalink raw reply related

* [RFC PATCH 0/2] kasan: hw_tags: Add option to tag only at allocation time
From: Dev Jain @ 2026-06-12  4:44 UTC (permalink / raw)
  To: ryabinin.a.a, akpm, corbet
  Cc: Dev Jain, glider, andreyknvl, dvyukov, vincenzo.frascino,
	kasan-dev, linux-mm, linux-kernel, skhan, workflows, linux-doc,
	linux-arm-kernel, ryan.roberts, anshuman.khandual, kaleshsingh,
	21cnbao, david, will, catalin.marinas

Introduce a boot option to tag only at allocation time of the objects. This
reduces KASAN MTE overhead, the tradeoff being reduced ability of
catching bugs.

Now, when a memory object will be freed, it will retain the random tag it
had at allocation time. This compromises on catching UAF bugs, till the
time the object is not reallocated, at which point it will have a new
random tag.

Hence, not catching "use-after-free-before-reallocation" and not catching
"double-free" will be the compromise for reduced KASAN overhead.

This is an RFC because we are not clear about the performance benefit.

Android folks, please help with testing!

---
Applies on Linus master (9716c086c8e8).

Dev Jain (2):
  kasan: hw_tags: Use KASAN_PAGE_REDZONE for vmalloc redzoning
  kasan: hw_tags: Add boot option to elide free time poisoning

 Documentation/dev-tools/kasan.rst |  4 +++
 mm/kasan/hw_tags.c                | 45 +++++++++++++++++++++++++++++--
 mm/kasan/kasan.h                  | 23 +++++++++++++++-
 3 files changed, 69 insertions(+), 3 deletions(-)

-- 
2.43.0


^ permalink raw reply

* Re: [PATCH v2 3/7] seg6: add End.M.GTP6.E behavior
From: Yuya Kusakabe @ 2026-06-12  3:14 UTC (permalink / raw)
  To: andrea
  Cc: Yuya Kusakabe, andrea.mayer, davem, edumazet, dsahern, kuba,
	pabeni, horms, justin.iurman, shuah, corbet, skhan, linux-kernel,
	netdev, linux-kselftest, linux-doc, stefano.salsano, ahabdels
In-Reply-To: <20260605032001.2f46e6a55f69896d29da69df@common-net.org>

From: Yuya Kusakabe <yuya.kusakabe@gmail.com>

Hi Andrea,

Thank you for the review. The points shared with patch 2 (NF_HOOK
split removal, drop reasons via your prep series, reverse christmas
tree, the missing frag_off check, BAD_INNER scoping, the repeated
size-selection ternary, iptunnel_handle_offloads(), the fixed source
port, and the RFC 6040 wording) will be addressed as described in my
patch 2 reply and apply here the same way. Below are the
End.M.GTP6.E-specific points.

> SEG6_LOCAL_MOBILE_SRC_ADDR (the "src" attribute) is copied verbatim into
> the outer IPv6 source address. In patch 2 (End.M.GTP4.E) the same
> attribute is used as a template from which bits are extracted to form
> the IPv4 source address, and may be entirely unused depending on
> v4_mask_len.
> This UAPI overload needs revision.

Agreed. With v4_mask_len gone, End.M.GTP4.E will not take src at all
(the IPv4 SA will be recovered purely from the inbound IPv6 SA, see
the patch 2 reply), which removes the verbatim-vs-template overload.
In the new SEG6_MOBILE_* namespace I plan to give SEG6_MOBILE_SRC_ADDR
a single meaning for the IPv6-emitting behaviors
(End.M.GTP6.E/D/D.Di): the outer IPv6 source address, used verbatim.
The one remaining non-verbatim consumer would be H.M.GTP4.D, where the
configured address acts as the RFC 9433 Figure 12 "Source UPF Prefix"
template with exactly the 32 IPv4 SA bits overlaid at
v6_src_prefix_len. H.M.GTP4.D posts last in the per-behavior order, so
if you prefer the two semantics not to share one attribute name, I can
give the template a distinctly named attribute in that series.

> udp6_set_csum() already handles the CHECKSUM_PARTIAL + pseudo-header seed
> setup and also covers the GSO case. Using it would avoid open-coding this
> sequence.

Will switch to udp6_set_csum(), thanks. It is also more correct than
the open-coded sequence: for a non-GSO inner that arrives
CHECKSUM_PARTIAL it resolves the inner checksum via local checksum
offload instead of clobbering csum_start.

> seg6_lookup_any_nexthop() already calls skb_dst_drop() internally. The
> explicit call above is redundant.

Will remove.

> Nit: fc_dst_len is int in struct fib6_config (IPv6 prefix length, range
> 0..128); the (unsigned int) cast is not needed.

This check will move into the attribute parser of the new explicit
locator-length attribute (see the patch 2 reply), so the fib6_config
peek and the cast both go away.

Thanks,
Yuya

^ permalink raw reply

* [PATCH v4 3/3] hwmon: Add documentation for SQ24860
From: Ziming Zhu @ 2026-06-12  3:03 UTC (permalink / raw)
  To: Guenter Roeck
  Cc: Rob Herring, Krzysztof Kozlowski, Conor Dooley, Jonathan Corbet,
	Shuah Khan, linux-hwmon, devicetree, linux-kernel, linux-doc,
	Ziming Zhu
In-Reply-To: <20260612030304.5165-1-zmzhu0630@163.com>

From: Ziming Zhu <ziming.zhu@silergycorp.com>

Document the supported sysfs attributes for the Silergy SQ24860 PMBus
hwmon driver.

Signed-off-by: Ziming Zhu <ziming.zhu@silergycorp.com>
---
 Documentation/hwmon/index.rst   |  1 +
 Documentation/hwmon/sq24860.rst | 96 +++++++++++++++++++++++++++++++++
 2 files changed, 97 insertions(+)
 create mode 100644 Documentation/hwmon/sq24860.rst

diff --git a/Documentation/hwmon/index.rst b/Documentation/hwmon/index.rst
index 8b655e5d6b68..6184b88e2095 100644
--- a/Documentation/hwmon/index.rst
+++ b/Documentation/hwmon/index.rst
@@ -243,6 +243,7 @@ Hardware Monitoring Kernel Drivers
    smsc47m1
    sparx5-temp
    spd5118
+   sq24860
    stpddc60
    surface_fan
    sy7636a-hwmon
diff --git a/Documentation/hwmon/sq24860.rst b/Documentation/hwmon/sq24860.rst
new file mode 100644
index 000000000000..f0182b955d8a
--- /dev/null
+++ b/Documentation/hwmon/sq24860.rst
@@ -0,0 +1,96 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+Kernel driver sq24860
+=====================
+
+Supported chips:
+
+  * Silergy SQ24860
+
+    Prefix: 'sq24860'
+
+Author:
+
+	Ziming Zhu <ziming.zhu@silergycorp.com>
+
+Description
+------------
+
+This driver implements support for the Silergy SQ24860 eFuse. The device is an
+integrated circuit protection and power management device with a PMBus
+interface.
+
+The device supports direct format for reading input voltage, output voltage,
+auxiliary voltage, input current, input power, and temperature.
+
+The current and power measurement scale depends on the resistor connected
+between the IMON pin and ground. The resistor value can be configured with the
+``silergy,rimon-micro-ohms`` device tree property. See
+``Documentation/devicetree/bindings/hwmon/pmbus/silergy,sq24860.yaml`` for details.
+
+Due to the specificities of the chip, all history reset attributes are tied
+together. Resetting the history of one sensor resets the history of all sensors.
+
+Sysfs entries
+-------------
+
+The following attributes are supported. Limits are read-write; all other
+attributes are read-only.
+
+======================= ======================================================
+in1_label               "vin"
+in1_input               Measured input voltage.
+in1_average             Average measured input voltage.
+in1_min                 Minimum input voltage limit.
+in1_lcrit               Critical low input voltage limit.
+in1_max                 Maximum input voltage limit.
+in1_crit                Critical high input voltage limit.
+in1_min_alarm           Input voltage low warning alarm.
+in1_lcrit_alarm         Input voltage low fault alarm.
+in1_max_alarm           Input voltage high warning alarm.
+in1_crit_alarm          Input voltage high fault alarm.
+in1_highest             Historical maximum input voltage.
+in1_lowest              Historical minimum input voltage.
+in1_reset_history       Write any value to reset history.
+
+in2_label               "vmon"
+in2_input               Measured auxiliary input voltage.
+
+in3_label               "vout1"
+in3_input               Measured output voltage.
+in3_average             Average measured output voltage.
+in3_min                 Minimum output voltage limit.
+in3_min_alarm           Output voltage low alarm.
+in3_lowest              Historical minimum output voltage.
+in3_reset_history       Write any value to reset history.
+
+curr1_label             "iin"
+curr1_input             Measured input current.
+curr1_average           Average measured input current.
+curr1_max               Maximum input current warning limit.
+curr1_crit              Critical input over-current fault limit.
+curr1_max_alarm         Input current warning alarm.
+curr1_crit_alarm        Input over-current fault alarm.
+curr1_highest           Historical maximum input current.
+curr1_reset_history     Write any value to reset history.
+
+power1_label            "pin"
+power1_input            Measured input power.
+power1_average          Average measured input power.
+power1_max              Maximum input power warning limit.
+power1_alarm            Input power warning alarm.
+power1_input_highest    Historical maximum input power.
+power1_reset_history    Write any value to reset history.
+
+temp1_input             Measured temperature.
+temp1_average           Average measured temperature.
+temp1_max               Maximum temperature warning limit.
+temp1_crit              Critical temperature fault limit.
+temp1_max_alarm         Temperature warning alarm.
+temp1_crit_alarm        Temperature fault alarm.
+temp1_highest           Historical maximum temperature.
+temp1_reset_history     Write any value to reset history.
+
+samples                 Number of samples used for average values.
+======================= ======================================================
+
-- 
2.25.1


^ permalink raw reply related

* [PATCH v4 0/3] Add Silergy SQ24860 support
From: Ziming Zhu @ 2026-06-12  3:03 UTC (permalink / raw)
  To: Guenter Roeck
  Cc: Rob Herring, Krzysztof Kozlowski, Conor Dooley, Jonathan Corbet,
	Shuah Khan, linux-hwmon, devicetree, linux-kernel, linux-doc,
	Ziming Zhu

From: Ziming Zhu <ziming.zhu@silergycorp.com>

Changes in v4:
- dt-bindings: Collected Reviewed-by tag from Conor Dooley.
- hwmon: pmbus: sq24860: Fixed signedness issue on PMBus limits where
  negative user inputs were silently parsed as large positive unsigned 
  values. Now casting limit values to s16 to properly intercept negative
  bounds.
- hwmon: pmbus: sq24860: Fixed PMBUS_IIN_OC_FAULT_LIMIT handling to 
  silently clamp out-of-range lower limits to the nearest supported
  hardware value (SQ24860_IIN_OCF_OFF) instead of returning -EINVAL, 
  complying with hwmon ABI conventions.
- Fixed function parenthesis alignments reported by checkpatch.

Changes in v3:
- fix remaining checkpatch issues in the SQ24860 driver
- use C comments consistently in the driver
- drop unused header files
- make GIMON a constant in the gain calculation helper
- use proper 64-bit division for the calibration gain calculation
- return -EINVAL when the calculated gain does not fit
- reject PMBUS_IIN_OC_FAULT_LIMIT values outside the hardware range
- treat malformed silergy,rimon-micro-ohms as an error
- sort sq24860 correctly in Documentation/hwmon/index.rst

Ziming Zhu (3):
  dt-bindings: hwmon: pmbus: Add bindings for Silergy SQ24860
  hwmon: pmbus: Add support for Silergy SQ24860
  hwmon: Add documentation for SQ24860

 .../bindings/hwmon/pmbus/silergy,sq24860.yaml |  74 +++
 Documentation/hwmon/index.rst                 |   1 +
 Documentation/hwmon/sq24860.rst               |  96 ++++
 drivers/hwmon/pmbus/Kconfig                   |  19 +
 drivers/hwmon/pmbus/Makefile                  |   1 +
 drivers/hwmon/pmbus/sq24860.c                 | 430 ++++++++++++++++++
 6 files changed, 621 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/hwmon/pmbus/silergy,sq24860.yaml
 create mode 100644 Documentation/hwmon/sq24860.rst
 create mode 100644 drivers/hwmon/pmbus/sq24860.c

-- 
2.25.1


^ permalink raw reply

* [PATCH v4 2/3] hwmon: pmbus: Add support for Silergy SQ24860
From: Ziming Zhu @ 2026-06-12  3:03 UTC (permalink / raw)
  To: Guenter Roeck
  Cc: Rob Herring, Krzysztof Kozlowski, Conor Dooley, Jonathan Corbet,
	Shuah Khan, linux-hwmon, devicetree, linux-kernel, linux-doc,
	Ziming Zhu
In-Reply-To: <20260612030304.5165-1-zmzhu0630@163.com>

From: Ziming Zhu <ziming.zhu@silergycorp.com>

Add PMBus hwmon support for the Silergy SQ24860 eFuse.

The driver reports input voltage, output voltage, auxiliary voltage,
input current, input power, and temperature. It also exposes peak,
average, and minimum history attributes, sample count configuration,
and maps the manufacturer-specific VIREF register to the generic input
over-current fault limit attribute.

The IMON resistor value is read from the silergy,rimon-micro-ohms device
property and used to configure the input current calibration gain.

Signed-off-by: Ziming Zhu <ziming.zhu@silergycorp.com>
---
 drivers/hwmon/pmbus/Kconfig   |  19 ++
 drivers/hwmon/pmbus/Makefile  |   1 +
 drivers/hwmon/pmbus/sq24860.c | 430 ++++++++++++++++++++++++++++++++++
 3 files changed, 450 insertions(+)
 create mode 100644 drivers/hwmon/pmbus/sq24860.c

diff --git a/drivers/hwmon/pmbus/Kconfig b/drivers/hwmon/pmbus/Kconfig
index 8f4bff375ecb..a905b5af137c 100644
--- a/drivers/hwmon/pmbus/Kconfig
+++ b/drivers/hwmon/pmbus/Kconfig
@@ -612,6 +612,25 @@ config SENSORS_STEF48H28
 	  This driver can also be built as a module. If so, the module will
 	  be called stef48h28.
 
+config SENSORS_SQ24860
+	tristate "Silergy SQ24860"
+	help
+	  If you say yes here you get hardware monitoring support for Silergy
+	  SQ24860 eFuse.
+
+	  This driver can also be built as a module. If so, the module will
+	  be called sq24860.
+
+config SENSORS_SQ24860_REGULATOR
+	bool "Regulator support for SQ24860"
+	depends on SENSORS_SQ24860 && REGULATOR
+	default SENSORS_SQ24860
+	help
+	  If you say yes here you get regulator support for Silergy SQ24860.
+	  The regulator is registered through the PMBus regulator framework and
+	  can be used to control the output exposed by the device.
+	  This option is only useful if regulator framework support is needed.
+
 config SENSORS_STPDDC60
 	tristate "ST STPDDC60"
 	help
diff --git a/drivers/hwmon/pmbus/Makefile b/drivers/hwmon/pmbus/Makefile
index 7129b62bc00f..86bc93c6c091 100644
--- a/drivers/hwmon/pmbus/Makefile
+++ b/drivers/hwmon/pmbus/Makefile
@@ -60,6 +60,7 @@ obj-$(CONFIG_SENSORS_PM6764TR)	+= pm6764tr.o
 obj-$(CONFIG_SENSORS_PXE1610)	+= pxe1610.o
 obj-$(CONFIG_SENSORS_Q54SJ108A2)	+= q54sj108a2.o
 obj-$(CONFIG_SENSORS_STEF48H28)	+= stef48h28.o
+obj-$(CONFIG_SENSORS_SQ24860)	+= sq24860.o
 obj-$(CONFIG_SENSORS_STPDDC60)	+= stpddc60.o
 obj-$(CONFIG_SENSORS_TDA38640)	+= tda38640.o
 obj-$(CONFIG_SENSORS_TPS25990)	+= tps25990.o
diff --git a/drivers/hwmon/pmbus/sq24860.c b/drivers/hwmon/pmbus/sq24860.c
new file mode 100644
index 000000000000..30202a4b34cf
--- /dev/null
+++ b/drivers/hwmon/pmbus/sq24860.c
@@ -0,0 +1,430 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Author: Ziming Zhu <ziming.zhu@silergycorp.com>
+ */
+
+#include <linux/bitfield.h>
+#include <linux/err.h>
+#include <linux/i2c.h>
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/math64.h>
+
+#include "pmbus.h"
+
+#define SQ24860_IIN_CAL_GAIN		0x38
+#define SQ24860_READ_VAUX		0xd0
+#define SQ24860_READ_VIN_MIN		0xd1
+#define SQ24860_READ_VIN_PEAK		0xd2
+#define SQ24860_READ_IIN_PEAK		0xd4
+#define SQ24860_READ_PIN_PEAK		0xd5
+#define SQ24860_READ_TEMP_AVG		0xd6
+#define SQ24860_READ_TEMP_PEAK		0xd7
+#define SQ24860_READ_VOUT_MIN		0xda
+#define SQ24860_READ_VIN_AVG		0xdc
+#define SQ24860_READ_VOUT_AVG		0xdd
+#define SQ24860_READ_IIN_AVG		0xde
+#define SQ24860_READ_PIN_AVG		0xdf
+#define SQ24860_VIREF			0xe0
+#define SQ24860_PK_MIN_AVG		0xea
+#define PK_MIN_AVG_RST_PEAK		BIT(7)
+#define PK_MIN_AVG_RST_AVG		BIT(6)
+#define PK_MIN_AVG_RST_MIN		BIT(5)
+#define PK_MIN_AVG_AVG_CNT		GENMASK(2, 0)
+#define SQ24860_MFR_WRITE_PROTECT	0xf8
+#define SQ24860_UNLOCKED		BIT(7)
+
+#define SQ24860_8B_SHIFT		2
+#define SQ24860_IIN_OCF_NUM		1000000
+#define SQ24860_IIN_OCF_DIV		129278
+#define SQ24860_IIN_OCF_OFF		165
+
+#define PK_MIN_AVG_RST_MASK		(PK_MIN_AVG_RST_PEAK | \
+					 PK_MIN_AVG_RST_AVG  | \
+					 PK_MIN_AVG_RST_MIN)
+#define SQ24860_MAX_SAMPLES		BIT(FIELD_MAX(PK_MIN_AVG_AVG_CNT))
+/*
+ * Arbitrary default Rimon value: 1.6kOhm
+ */
+#define SQ24860_DEFAULT_RIMON		1600000000
+#define SQ24860_GIMON			18180
+
+#define SQ24860_VAUX_DIV		20
+
+static int sq24860_write_iin_cal_gain(struct i2c_client *client, u32 rimon)
+{
+	u64 temp = 6400ULL * 1000000000ULL * 1000ULL;
+	u64 denom;
+	u64 word;
+
+	if (!rimon)
+		return -EINVAL;
+
+	denom = (u64)rimon * SQ24860_GIMON;
+	word = div64_u64(temp, denom);
+	if (!word || word > U16_MAX)
+		return -EINVAL;
+
+	return i2c_smbus_write_word_data(client, SQ24860_IIN_CAL_GAIN,
+					(u16)word);
+}
+
+static int sq24860_mfr_write_protect_set(struct i2c_client *client,
+					 u8 protect)
+{
+	u8 val;
+
+	switch (protect) {
+	case 0:
+		val = 0xa2;
+		break;
+	case PB_WP_ALL:
+		val = 0x0;
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	return pmbus_write_byte_data(client, -1, SQ24860_MFR_WRITE_PROTECT,
+				     val);
+}
+
+static int sq24860_mfr_write_protect_get(struct i2c_client *client)
+{
+	int ret = pmbus_read_byte_data(client, -1, SQ24860_MFR_WRITE_PROTECT);
+
+	if (ret < 0)
+		return ret;
+
+	return (ret & SQ24860_UNLOCKED) ? 0 : PB_WP_ALL;
+}
+
+static int sq24860_read_word_data(struct i2c_client *client,
+				  int page, int phase, int reg)
+{
+	int ret;
+
+	switch (reg) {
+	case PMBUS_VIRT_READ_VIN_MAX:
+		ret = pmbus_read_word_data(client, page, phase,
+					   SQ24860_READ_VIN_PEAK);
+		break;
+
+	case PMBUS_VIRT_READ_VIN_MIN:
+		ret = pmbus_read_word_data(client, page, phase,
+					   SQ24860_READ_VIN_MIN);
+		break;
+
+	case PMBUS_VIRT_READ_VIN_AVG:
+		ret = pmbus_read_word_data(client, page, phase,
+					   SQ24860_READ_VIN_AVG);
+		break;
+
+	case PMBUS_VIRT_READ_VOUT_MIN:
+		ret = pmbus_read_word_data(client, page, phase,
+					   SQ24860_READ_VOUT_MIN);
+		break;
+
+	case PMBUS_VIRT_READ_VOUT_AVG:
+		ret = pmbus_read_word_data(client, page, phase,
+					   SQ24860_READ_VOUT_AVG);
+		break;
+
+	case PMBUS_VIRT_READ_IIN_AVG:
+		ret = pmbus_read_word_data(client, page, phase,
+					   SQ24860_READ_IIN_AVG);
+		break;
+
+	case PMBUS_VIRT_READ_IIN_MAX:
+		ret = pmbus_read_word_data(client, page, phase,
+					   SQ24860_READ_IIN_PEAK);
+		break;
+
+	case PMBUS_VIRT_READ_TEMP_AVG:
+		ret = pmbus_read_word_data(client, page, phase,
+					   SQ24860_READ_TEMP_AVG);
+		break;
+
+	case PMBUS_VIRT_READ_TEMP_MAX:
+		ret = pmbus_read_word_data(client, page, phase,
+					   SQ24860_READ_TEMP_PEAK);
+		break;
+
+	case PMBUS_VIRT_READ_PIN_AVG:
+		ret = pmbus_read_word_data(client, page, phase,
+					   SQ24860_READ_PIN_AVG);
+		break;
+
+	case PMBUS_VIRT_READ_PIN_MAX:
+		ret = pmbus_read_word_data(client, page, phase,
+					   SQ24860_READ_PIN_PEAK);
+		break;
+
+	case PMBUS_VIRT_READ_VMON:
+		ret = pmbus_read_word_data(client, page, phase,
+					   SQ24860_READ_VAUX);
+		if (ret < 0)
+			break;
+		ret = DIV_ROUND_CLOSEST(ret, SQ24860_VAUX_DIV);
+		break;
+
+	case PMBUS_VIN_UV_WARN_LIMIT:
+	case PMBUS_VIN_UV_FAULT_LIMIT:
+	case PMBUS_VIN_OV_WARN_LIMIT:
+	case PMBUS_VIN_OV_FAULT_LIMIT:
+	case PMBUS_VOUT_UV_WARN_LIMIT:
+	case PMBUS_IIN_OC_WARN_LIMIT:
+	case PMBUS_OT_WARN_LIMIT:
+	case PMBUS_OT_FAULT_LIMIT:
+	case PMBUS_PIN_OP_WARN_LIMIT:
+		/*
+		 * These registers provide an 8 bits value instead of a
+		 * 10bits one. Just shifting twice the register value is
+		 * enough to make the sensor type conversion work, even
+		 * if the datasheet provides different m, b and R for
+		 * those.
+		 */
+		ret = pmbus_read_word_data(client, page, phase, reg);
+		if (ret < 0)
+			break;
+		ret <<= SQ24860_8B_SHIFT;
+		break;
+
+	case PMBUS_IIN_OC_FAULT_LIMIT:
+		/*
+		 * VIREF directly sets the over-current limit at which the eFuse
+		 * will turn the FET off and trigger a fault. Expose it through
+		 * this generic property instead of a manufacturer specific one.
+		 */
+		ret = pmbus_read_byte_data(client, page, SQ24860_VIREF);
+		if (ret < 0)
+			break;
+		ret = DIV_ROUND_CLOSEST(ret * SQ24860_IIN_OCF_NUM,
+					SQ24860_IIN_OCF_DIV);
+		ret += SQ24860_IIN_OCF_OFF;
+		break;
+
+	case PMBUS_VIRT_SAMPLES:
+		ret = pmbus_read_byte_data(client, page, SQ24860_PK_MIN_AVG);
+		if (ret < 0)
+			break;
+		ret = BIT(FIELD_GET(PK_MIN_AVG_AVG_CNT, ret));
+		break;
+
+	case PMBUS_VIRT_RESET_TEMP_HISTORY:
+	case PMBUS_VIRT_RESET_VIN_HISTORY:
+	case PMBUS_VIRT_RESET_IIN_HISTORY:
+	case PMBUS_VIRT_RESET_PIN_HISTORY:
+	case PMBUS_VIRT_RESET_VOUT_HISTORY:
+		ret = 0;
+		break;
+
+	default:
+		ret = -ENODATA;
+		break;
+	}
+
+	return ret;
+}
+
+static int sq24860_write_word_data(struct i2c_client *client,
+				   int page, int reg, u16 value)
+{
+	int ret;
+
+	switch (reg) {
+	case PMBUS_VIN_UV_WARN_LIMIT:
+	case PMBUS_VIN_UV_FAULT_LIMIT:
+	case PMBUS_VIN_OV_WARN_LIMIT:
+	case PMBUS_VIN_OV_FAULT_LIMIT:
+	case PMBUS_VOUT_UV_WARN_LIMIT:
+	case PMBUS_IIN_OC_WARN_LIMIT:
+	case PMBUS_OT_WARN_LIMIT:
+	case PMBUS_OT_FAULT_LIMIT:
+	case PMBUS_PIN_OP_WARN_LIMIT:
+		value = max_t(s16, (s16)value, 0);
+		value >>= SQ24860_8B_SHIFT;
+		value = clamp_val(value, 0, 0xff);
+		ret = pmbus_write_word_data(client, page, reg, value);
+		break;
+
+	case PMBUS_IIN_OC_FAULT_LIMIT:
+		value = max_t(s16, (s16)value, SQ24860_IIN_OCF_OFF);
+		value -= SQ24860_IIN_OCF_OFF;
+		value = DIV_ROUND_CLOSEST(((unsigned int)value) * SQ24860_IIN_OCF_DIV,
+					  SQ24860_IIN_OCF_NUM);
+		value = clamp_val(value, 0, 0x3f);
+		ret = pmbus_write_byte_data(client, page, SQ24860_VIREF, value);
+		break;
+
+	case PMBUS_VIRT_SAMPLES:
+		value = clamp_val(value, 1, SQ24860_MAX_SAMPLES);
+		value = ilog2(value);
+		ret = pmbus_update_byte_data(client, page, SQ24860_PK_MIN_AVG,
+					     PK_MIN_AVG_AVG_CNT,
+					     FIELD_PREP(PK_MIN_AVG_AVG_CNT, value));
+		break;
+
+	case PMBUS_VIRT_RESET_TEMP_HISTORY:
+	case PMBUS_VIRT_RESET_VIN_HISTORY:
+	case PMBUS_VIRT_RESET_IIN_HISTORY:
+	case PMBUS_VIRT_RESET_PIN_HISTORY:
+	case PMBUS_VIRT_RESET_VOUT_HISTORY:
+		/*
+		 * SQ24860 has history resets based on MIN/AVG/PEAK instead of per
+		 * sensor type. Exposing this quirk in hwmon is not desirable so
+		 * reset MIN, AVG and PEAK together. Even is there effectively only
+		 * one reset, which resets everything, expose the 5 entries so
+		 * userspace is not required map a sensor type to another to trigger
+		 * a reset
+		 */
+		ret = pmbus_update_byte_data(client, 0, SQ24860_PK_MIN_AVG,
+					     PK_MIN_AVG_RST_MASK,
+					     PK_MIN_AVG_RST_MASK);
+		break;
+
+	default:
+		ret = -ENODATA;
+		break;
+	}
+
+	return ret;
+}
+
+static int sq24860_read_byte_data(struct i2c_client *client,
+				  int page, int reg)
+{
+	int ret;
+
+	switch (reg) {
+	case PMBUS_WRITE_PROTECT:
+		ret = sq24860_mfr_write_protect_get(client);
+		break;
+
+	default:
+		ret = -ENODATA;
+		break;
+	}
+
+	return ret;
+}
+
+static int sq24860_write_byte_data(struct i2c_client *client,
+				   int page, int reg, u8 byte)
+{
+	int ret;
+
+	switch (reg) {
+	case PMBUS_WRITE_PROTECT:
+		ret = sq24860_mfr_write_protect_set(client, byte);
+		break;
+
+	default:
+		ret = -ENODATA;
+		break;
+	}
+
+	return ret;
+}
+
+#if IS_ENABLED(CONFIG_SENSORS_SQ24860_REGULATOR)
+static const struct regulator_desc sq24860_reg_desc[] = {
+	PMBUS_REGULATOR_ONE_NODE("vout"),
+};
+#endif
+
+static const struct pmbus_driver_info sq24860_base_info = {
+	.pages = 1,
+	.format[PSC_VOLTAGE_IN] = direct,
+	.m[PSC_VOLTAGE_IN] = 64,
+	.b[PSC_VOLTAGE_IN] = 0,
+	.R[PSC_VOLTAGE_IN] = 0,
+	.format[PSC_VOLTAGE_OUT] = direct,
+	.m[PSC_VOLTAGE_OUT] = 64,
+	.b[PSC_VOLTAGE_OUT] = 0,
+	.R[PSC_VOLTAGE_OUT] = 0,
+	.format[PSC_TEMPERATURE] = direct,
+	.m[PSC_TEMPERATURE] = 1,
+	.b[PSC_TEMPERATURE] = 0,
+	.R[PSC_TEMPERATURE] = 0,
+	/*
+	 * Current and power measurements depend on the calibration gain
+	 * programmed from the board-specific IMON resistor value.
+	 */
+	.format[PSC_CURRENT_IN] = direct,
+	.m[PSC_CURRENT_IN] = 16,
+	.b[PSC_CURRENT_IN] = 0,
+	.R[PSC_CURRENT_IN] = 0,
+	.format[PSC_POWER] = direct,
+	.m[PSC_POWER] = 2,
+	.b[PSC_POWER] = 0,
+	.R[PSC_POWER] = 0,
+	.func[0] = PMBUS_HAVE_VIN |
+		   PMBUS_HAVE_VOUT |
+		   PMBUS_HAVE_VMON |
+		   PMBUS_HAVE_IIN |
+		   PMBUS_HAVE_PIN |
+		   PMBUS_HAVE_TEMP |
+		   PMBUS_HAVE_STATUS_VOUT |
+		   PMBUS_HAVE_STATUS_IOUT |
+		   PMBUS_HAVE_STATUS_INPUT |
+		   PMBUS_HAVE_STATUS_TEMP |
+		   PMBUS_HAVE_SAMPLES,
+	.read_word_data = sq24860_read_word_data,
+	.write_word_data = sq24860_write_word_data,
+	.read_byte_data = sq24860_read_byte_data,
+	.write_byte_data = sq24860_write_byte_data,
+
+#if IS_ENABLED(CONFIG_SENSORS_SQ24860_REGULATOR)
+	.reg_desc = sq24860_reg_desc,
+	.num_regulators = ARRAY_SIZE(sq24860_reg_desc),
+#endif
+};
+
+static const struct i2c_device_id sq24860_i2c_id[] = {
+	{ "sq24860" },
+	{}
+};
+MODULE_DEVICE_TABLE(i2c, sq24860_i2c_id);
+
+static const struct of_device_id sq24860_of_match[] = {
+	{ .compatible = "silergy,sq24860" },
+	{}
+};
+MODULE_DEVICE_TABLE(of, sq24860_of_match);
+
+static int sq24860_probe(struct i2c_client *client)
+{
+	struct device *dev = &client->dev;
+	struct pmbus_driver_info *info;
+	u32 rimon;
+	int ret;
+
+	if (device_property_read_u32(dev, "silergy,rimon-micro-ohms", &rimon))
+		rimon = SQ24860_DEFAULT_RIMON;
+	ret = sq24860_write_iin_cal_gain(client, rimon);
+	if (ret < 0)
+		return dev_err_probe(&client->dev, ret,
+					     "Failed to set gain\n");
+	info = devm_kmemdup(dev, &sq24860_base_info, sizeof(*info), GFP_KERNEL);
+	if (!info)
+		return -ENOMEM;
+
+	return pmbus_do_probe(client, info);
+}
+
+static struct i2c_driver sq24860_driver = {
+	.driver = {
+		.name = "sq24860",
+		.of_match_table = sq24860_of_match,
+	},
+	.probe = sq24860_probe,
+	.id_table = sq24860_i2c_id,
+};
+module_i2c_driver(sq24860_driver);
+
+MODULE_AUTHOR("Ziming Zhu <ziming.zhu@silergycorp.com>");
+MODULE_DESCRIPTION("PMBUS driver for SQ24860 eFuse");
+MODULE_LICENSE("GPL");
+MODULE_IMPORT_NS("PMBUS");
-- 
2.25.1


^ permalink raw reply related

* [PATCH v4 1/3] dt-bindings: hwmon: pmbus: Add bindings for Silergy SQ24860
From: Ziming Zhu @ 2026-06-12  3:03 UTC (permalink / raw)
  To: Guenter Roeck
  Cc: Rob Herring, Krzysztof Kozlowski, Conor Dooley, Jonathan Corbet,
	Shuah Khan, linux-hwmon, devicetree, linux-kernel, linux-doc,
	Ziming Zhu, Conor Dooley
In-Reply-To: <20260612030304.5165-1-zmzhu0630@163.com>

From: Ziming Zhu <ziming.zhu@silergycorp.com>

Add devicetree binding documentation for the Silergy SQ24860 eFuse.

The device is a PMBus hardware monitoring device which reports voltage,
current, power, and temperature telemetry. The board-specific IMON
resistor value is described with silergy,rimon-micro-ohms.

Signed-off-by: Ziming Zhu <ziming.zhu@silergycorp.com>

Reviewed-by: Conor Dooley <conor.dooley@microchip.com>
---
 .../bindings/hwmon/pmbus/silergy,sq24860.yaml | 74 +++++++++++++++++++
 1 file changed, 74 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/hwmon/pmbus/silergy,sq24860.yaml

diff --git a/Documentation/devicetree/bindings/hwmon/pmbus/silergy,sq24860.yaml b/Documentation/devicetree/bindings/hwmon/pmbus/silergy,sq24860.yaml
new file mode 100644
index 000000000000..03ef82c11e1a
--- /dev/null
+++ b/Documentation/devicetree/bindings/hwmon/pmbus/silergy,sq24860.yaml
@@ -0,0 +1,74 @@
+# SPDX-License-Identifier: (GPL-2.0 OR BSD-2-Clause)
+%YAML 1.2
+---
+
+$id: http://devicetree.org/schemas/hwmon/pmbus/silergy,sq24860.yaml#
+$schema: http://devicetree.org/meta-schemas/core.yaml#
+
+title: Silergy SQ24860 eFuse
+
+maintainers:
+  - Ziming Zhu <ziming.zhu@silergycorp.com>
+
+description:
+  The Silergy SQ24860 is an integrated, high-current circuit protection and
+  power management device with PMBus interface.
+
+properties:
+  compatible:
+    const: silergy,sq24860
+
+  reg:
+    maxItems: 1
+
+  silergy,rimon-micro-ohms:
+    description:
+      Micro-ohms value of the resistance installed between the IMON pin and
+      the ground reference.
+
+  interrupts:
+    description: PMBus SMBAlert interrupt.
+    maxItems: 1
+
+  regulators:
+    type: object
+    description:
+      List of regulators provided by this controller.
+
+    properties:
+      vout:
+        $ref: /schemas/regulator/regulator.yaml#
+        type: object
+        unevaluatedProperties: false
+
+    additionalProperties: false
+
+required:
+  - compatible
+  - reg
+  - silergy,rimon-micro-ohms
+
+additionalProperties: false
+
+examples:
+  - |
+
+    i2c {
+        #address-cells = <1>;
+        #size-cells = <0>;
+
+        hw-monitor@40 {
+            compatible = "silergy,sq24860";
+            reg = <0x40>;
+
+            interrupt-parent = <&gpio>;
+            interrupts = <42 8>;
+            silergy,rimon-micro-ohms = <1600000000>;
+
+            regulators {
+                cpu0_vout: vout {
+                    regulator-name = "main_cpu0";
+                };
+            };
+        };
+    };
-- 
2.25.1


^ permalink raw reply related

* [nsa:xlnx/fix/buf-mmap-multibuffer 27173/27391] htmldocs: Warning: MAINTAINERS references a file that doesn't exist: Documentation/devicetree/bindings/i3c/adi,i3c-master.yaml
From: kernel test robot @ 2026-06-12  2:58 UTC (permalink / raw)
  To: Jorge Marques
  Cc: oe-kbuild-all, Nuno Sa, Frank Li, Alexandre Belloni, linux-doc

tree:   https://github.com/nunojsa/linux xlnx/fix/buf-mmap-multibuffer
head:   a26a8baba71e866951f6abf4fc6c0504770c272e
commit: c54362d581209a7421fe0d0caf70ff172289da75 [27173/27391] i3c: master: Add driver for Analog Devices I3C Controller IP
compiler: clang version 22.0.0git (https://github.com/llvm/llvm-project f43d6834093b19baf79beda8c0337ab020ac5f17)
docutils: docutils (Docutils 0.21.2, Python 3.13.5, on linux)
reproduce: (https://download.01.org/0day-ci/archive/20260612/202606120453.2rP7tq7s-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202606120453.2rP7tq7s-lkp@intel.com/

All warnings (new ones prefixed by >>):

   from /zdci/src/kernel-tests/bisect-test-build-error.sh:102: main
   Warning: Documentation/devicetree/bindings/iio/adc/adi,ltc2308.yaml references a file that doesn't exist: Documentation/devicetree/bindings/iio/adc/adc.txt
   Warning: Documentation/devicetree/bindings/iio/adc/adi,ltc2308.yaml references a file that doesn't exist: Documentation/devicetree/bindings/iio/adc/adc.txt
   Warning: Documentation/devicetree/bindings/regulator/siliconmitus,sm5703-regulator.yaml references a file that doesn't exist: Documentation/devicetree/bindings/mfd/siliconmitus,sm5703.yaml
   Warning: Documentation/hwmon/g762.rst references a file that doesn't exist: Documentation/devicetree/bindings/hwmon/g762.txt
>> Warning: MAINTAINERS references a file that doesn't exist: Documentation/devicetree/bindings/i3c/adi,i3c-master.yaml
   Warning: MAINTAINERS references a file that doesn't exist: Documentation/devicetree/bindings/misc/fsl,qoriq-mc.txt
   Using alabaster theme
--
     from /zdci/src/kernel-tests/bisect-test-build-error.sh:102: main
   Warning: Documentation/devicetree/bindings/iio/adc/adi,ltc2308.yaml references a file that doesn't exist: Documentation/devicetree/bindings/iio/adc/adc.txt
   Warning: Documentation/devicetree/bindings/iio/adc/adi,ltc2308.yaml references a file that doesn't exist: Documentation/devicetree/bindings/iio/adc/adc.txt
   Warning: Documentation/devicetree/bindings/regulator/siliconmitus,sm5703-regulator.yaml references a file that doesn't exist: Documentation/devicetree/bindings/mfd/siliconmitus,sm5703.yaml
   Warning: Documentation/hwmon/g762.rst references a file that doesn't exist: Documentation/devicetree/bindings/hwmon/g762.txt
>> Warning: MAINTAINERS references a file that doesn't exist: Documentation/devicetree/bindings/i3c/adi,i3c-master.yaml
   Warning: MAINTAINERS references a file that doesn't exist: Documentation/devicetree/bindings/misc/fsl,qoriq-mc.txt

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply

* Re: [PATCH v3] arm64: errata: Workaround NVIDIA Olympus device store/load ordering erratum
From: Shanker Donthineni @ 2026-06-12  1:13 UTC (permalink / raw)
  To: Will Deacon
  Cc: Catalin Marinas, Vladimir Murzin, Jason Gunthorpe,
	linux-arm-kernel@lists.infradead.org, Mark Rutland,
	linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org,
	Vikram Sethi, Jason Sequeira, Shanker Donthineni
In-Reply-To: <IA1PR12MB6089049028A73A2078FC6831C71B2@IA1PR12MB6089.namprd12.prod.outlook.com>

Hi Will,

On 6/11/2026 8:39 AM, sdonthineni@nvidia.com wrote:
>
> -----Original Message-----
> From: Will Deacon <will@kernel.org>
> Sent: Thursday, June 11, 2026 8:34 AM
> To: Shanker Donthineni <sdonthineni@nvidia.com>
> Cc: Catalin Marinas <catalin.marinas@arm.com>; Vladimir Murzin <vladimir.murzin@arm.com>; Jason Gunthorpe <jgg@nvidia.com>; linux-arm-kernel@lists.infradead.org; Mark Rutland <mark.rutland@arm.com>; linux-kernel@vger.kernel.org; linux-doc@vger.kernel.org; Vikram Sethi <vsethi@nvidia.com>; Jason Sequeira <jsequeira@nvidia.com>
> Subject: Re: [PATCH v3] arm64: errata: Workaround NVIDIA Olympus device store/load ordering erratum
>
> External email: Use caution opening links or attachments
>
>
> On Wed, Jun 10, 2026 at 11:48:22AM -0500, Shanker Donthineni wrote:
>> On systems with NVIDIA Olympus cores, a Device-nGnR* load can be
>> observed by a peripheral before an older, non-overlapping Device-nGnR*
>> store to the same peripheral. This breaks the program-order guarantee
>> that software expects for Device-nGnR* accesses and can leave a
>> peripheral in an incorrect state, as a load is observed before an
>> earlier store takes effect.
>>
>> The erratum can occur only when all of the following apply:
>>
>>    - A PE executes a Device-nGnR* store followed by a younger
>>      Device-nGnR* load.
>>    - The store is not a store-release.
>>    - The accesses target the same peripheral and do not overlap in bytes.
>>    - There is at most one intervening Device-nGnR* store in program
>>      order, and there are no intervening Device-nGnR* loads.
>>    - There is no DSB, and no DMB that orders loads, between the store and
>>      the load.
>>    - Specific micro-architectural and timing conditions occur.
>>
>> Promote the raw MMIO store helpers (__raw_writeb/w/l/q) from plain
>> str* to stlr* (Store-Release), which removes the "store is not a
>> store-release" condition for every device write the kernel issues.
>> Because writel() and writel_relaxed() are both built on __raw_writel()
>> in asm-generic/io.h, patching the raw variants covers both the
>> non-relaxed and relaxed APIs without touching the higher layers. Note
>> that writel()'s own barrier sits before the store, so it does not
>> order the store against a subsequent readl(); the store-release
>> promotion is what provides that ordering.
>>
>> Like ARM64_ERRATUM_832075 on the load side, the change is gated on a
>> new ARM64_WORKAROUND_DEVICE_STORE_RELEASE capability and only
>> activated on parts that match MIDR_NVIDIA_OLYMPUS, so unaffected CPUs
>> continue to use the plain str* sequence.
>>
>> Note: stlr* only supports base-register addressing, so affected CPUs
>> use a base-register stlr* path. Unaffected CPUs keep the original
>> offset-addressed str* sequence introduced by commit d044d6ba6f02
>> ("arm64: io: permit offset addressing").
>>
>> The __const_memcpy_toio_aligned32() and
>> __const_memcpy_toio_aligned64() helpers are left unchanged. These
>> helpers are intended for write-combining mappings, which are Normal-NC
>> on arm64. Replacing their contiguous str* groups would defeat the
>> write-combining behavior used to improve store performance.
>>
>> Co-developed-by: Vikram Sethi <vsethi@nvidia.com>
>> Signed-off-by: Vikram Sethi <vsethi@nvidia.com>
>> Signed-off-by: Shanker Donthineni <sdonthineni@nvidia.com>
>> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
>> ---
>> Changes since v2:
>>    - Reworked the raw MMIO write helpers so unaffected CPUs keep the
>>      existing offset-addressed STR sequence, while affected CPUs use the
>>      base-register STLR path.
>>    - Updated the commit message to match the code changes.
>>    - Rebased on top of the arm64 for-next/errata branch:
>>      
>> https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git/log/?h
>> =for-next/errata
>>
>> Changes since v1:
>>    - Updated the commit message based on feedback from Vladimir Murzin.
>>
>>   Documentation/arch/arm64/silicon-errata.rst |  2 ++
>>   arch/arm64/Kconfig                          | 23 ++++++++++++++++
>>   arch/arm64/include/asm/io.h                 | 30 +++++++++++++++++++++
>>   arch/arm64/kernel/cpu_errata.c              |  8 ++++++
>>   arch/arm64/tools/cpucaps                    |  1 +
>>   5 files changed, 64 insertions(+)
>>
>> diff --git a/Documentation/arch/arm64/silicon-errata.rst
>> b/Documentation/arch/arm64/silicon-errata.rst
>> index ad09bbb10da80..fc45125dc2f80 100644
>> --- a/Documentation/arch/arm64/silicon-errata.rst
>> +++ b/Documentation/arch/arm64/silicon-errata.rst
>> @@ -298,6 +298,8 @@ stable kernels.
>>   +----------------+-----------------+-----------------+-----------------------------+
>>   | NVIDIA         | Carmel Core     | N/A             | NVIDIA_CARMEL_CNP_ERRATUM   |
>>   
>> +----------------+-----------------+-----------------+----------------
>> -------------+
>> +| NVIDIA         | Olympus core    | T410-OLY-1027   | NVIDIA_OLYMPUS_1027_ERRATUM |
>> ++----------------+-----------------+-----------------+-----------------------------+
>>   | NVIDIA         | Olympus core    | T410-OLY-1029   | ARM64_ERRATUM_4118414       |
>>   +----------------+-----------------+-----------------+-----------------------------+
>>   | NVIDIA         | T241 GICv3/4.x  | T241-FABRIC-4   | N/A                         |
>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index
>> c65cef81be86a..d633eb70de1ac 100644
>> --- a/arch/arm64/Kconfig
>> +++ b/arch/arm64/Kconfig
>> @@ -564,6 +564,29 @@ config ARM64_ERRATUM_832075
>>
>>          If unsure, say Y.
>>
>> +config NVIDIA_OLYMPUS_1027_ERRATUM
>> +     bool "NVIDIA Olympus: device store/load ordering erratum"
>> +     default y
>> +     help
>> +       This option adds an alternative code sequence to work around an
>> +       NVIDIA Olympus core erratum where a Device-nGnR* store can be
>> +       observed by a peripheral after a younger Device-nGnR* load to the
>> +       same peripheral. This breaks the program order that drivers rely
>> +       on for MMIO and can leave a device in an incorrect state.
>> +
>> +       The workaround promotes the raw MMIO store helpers
>> +       (__raw_writeb/w/l/q) to Store-Release (STLR), which restores the
>> +       required ordering. Because writel() and writel_relaxed() are built
>> +       on __raw_writel(), both are covered without changes to the higher
>> +       layers.
>> +
>> +       The fix is applied through the alternatives framework, so enabling
>> +       this option does not by itself activate the workaround: it is
>> +       patched in only when an affected CPU is detected, and is a no-op on
>> +       unaffected CPUs.
>> +
>> +       If unsure, say Y.
>> +
>>   config ARM64_ERRATUM_834220
>>        bool "Cortex-A57: 834220: Stage 2 translation fault might be incorrectly reported in presence of a Stage 1 fault (rare)"
>>        depends on KVM
>> diff --git a/arch/arm64/include/asm/io.h b/arch/arm64/include/asm/io.h
>> index 8cbd1e96fd50b..801223e754c90 100644
>> --- a/arch/arm64/include/asm/io.h
>> +++ b/arch/arm64/include/asm/io.h
>> @@ -22,10 +22,22 @@
>>   /*
>>    * Generic IO read/write.  These perform native-endian accesses.
>>    */
>> +static __always_inline bool arm64_needs_device_store_release(void)
>> +{
>> +     return alternative_has_cap_unlikely(
>> +                             ARM64_WORKAROUND_DEVICE_STORE_RELEASE);
>> +}
>> +
>>   #define __raw_writeb __raw_writeb
>>   static __always_inline void __raw_writeb(u8 val, volatile void
>> __iomem *addr)  {
>>        volatile u8 __iomem *ptr = addr;
>> +
>> +     if (arm64_needs_device_store_release()) {
>> +             asm volatile("stlrb %w0, [%1]" : : "rZ" (val), "r" (addr));
>> +             return;
>> +     }
>> +
>>        asm volatile("strb %w0, %1" : : "rZ" (val), "Qo" (*ptr));  }
> Use an 'else' clause instead of the early return? (similarly for the other changes).
>
> I still reckon you should do something with the memcpy-to-io routines.
> A simple option could be to make dgh() a dmb on parts with the erratum?
> That at least moves the barrier out of the loop.

Thanks Will. I looked again at both the arm64 comments and the generic iomap_copy.c
contract, and I’m not convinced that making dgh() a dmb is the right fit for this
path. Based on the documented comments, callers should not assume ordering from
these helpers; if ordering is required around a memcpy, the call site should already
be providing the necessary barriers.

Related data point in generic lib/iomap_copy.c:

/**
  * __iowrite32_copy - copy data to MMIO space, in 32-bit units
  * @to: destination, in MMIO space (must be 32-bit aligned)
  * @from: source (must be 32-bit aligned)
  * @count: number of 32-bit quantities to copy
  *
  * Copy data from kernel space to MMIO space, in units of 32 bits at a
  * time.  Order of access is not guaranteed, nor is a memory barrier
  * performed afterwards.
  */
#ifndef __iowrite32_copy
void __iowrite32_copy(void __iomem *to, const void *from, size_t count)

/**
  * __iowrite64_copy - copy data to MMIO space, in 64-bit or 32-bit units
  * @to: destination, in MMIO space (must be 64-bit aligned)
  * @from: source (must be 64-bit aligned)
  * @count: number of 64-bit quantities to copy
  *
  * Copy data from kernel space to MMIO space, in units of 32 or 64 bits at a
  * time.  Order of access is not guaranteed, nor is a memory barrier
  * performed afterwards.
  */
#ifndef __iowrite64_copy
void __iowrite64_copy(void __iomem *to, const void *from, size_t count)

/**
  * __iowrite32_copy - copy data to MMIO space, in 32-bit units
  * @to: destination, in MMIO space (must be 32-bit aligned)
  * @from: source (must be 32-bit aligned)
  * @count: number of 32-bit quantities to copy
  *
  * Copy data from kernel space to MMIO space, in units of 32 bits at a
  * time.  Order of access is not guaranteed, nor is a memory barrier
  * performed afterwards.
  */
#ifndef __iowrite32_copy
void __iowrite32_copy(void __iomem *to, const void *from, size_t count)


The arm64 comment says in arch/arm64/asm/io.h:

/*
  * The ARM64 iowrite implementation is intended to support drivers that want to
  * use write combining. For instance PCI drivers using write combining with a 64
  * byte __iowrite64_copy() expect to get a 64 byte MemWr TLP on the PCIe bus.
  *
  * Newer ARM core have sensitive write combining buffers, it is important that
  * the stores be contiguous blocks of store instructions. Normal memcpy
  * approaches have a very low chance to generate write combining.
  *
  * Since this is the only API on ARM64 that should be used with write combining
  * it also integrates the DGH hint which is supposed to lower the latency to
  * emit the large TLP from the CPU.
  */

So my reading is that dgh() in the arm64 implementation is there for the
write-combining/gathering behavior. Replacing it with dmb would make this
path stronger than the generic API contract and could penalize performance
of the WC use case.

For the scalar MMIO helpers, the workaround promotes the raw writes to
store-release on affected CPUs as v1/v2 shown below. For the memcpy-toIO
helpers, could you please clarify the specific reason for adding a dmb despite
the documented no-ordering contract? Is the concern that some drivers may
be relying on ordering across memcpy_toio_*() today even though the API
does not guarantee it, and that we should cover those cases defensively?

Would prefer to avoid replacing DGH() with DMB unless there is a strong
reason to do so. Please let me know if I can post the v4 patch with
the change below, while keeping DGH() as-is in the memcpy-toIO path.

  #define __raw_writeb __raw_writeb
  static __always_inline void __raw_writeb(u8 val, volatile void __iomem *addr)
  {
-       volatile u8 __iomem *ptr = addr;
-       asm volatile("strb %w0, %1" : : "rZ" (val), "Qo" (*ptr));
+       asm volatile(ALTERNATIVE("strb %w0, [%1]",
+                                "stlrb %w0, [%1]",
+                                ARM64_WORKAROUND_DEVICE_STORE_RELEASE)
+                    : : "rZ" (val), "r" (addr));
  }

  #define __raw_writew __raw_writew
  static __always_inline void __raw_writew(u16 val, volatile void __iomem *addr)
  {
-       volatile u16 __iomem *ptr = addr;
-       asm volatile("strh %w0, %1" : : "rZ" (val), "Qo" (*ptr));
+       asm volatile(ALTERNATIVE("strh %w0, [%1]",
+                                "stlrh %w0, [%1]",
+                                ARM64_WORKAROUND_DEVICE_STORE_RELEASE)
+                    : : "rZ" (val), "r" (addr));
  }

  #define __raw_writel __raw_writel
  static __always_inline void __raw_writel(u32 val, volatile void __iomem *addr)
  {
-       volatile u32 __iomem *ptr = addr;
-       asm volatile("str %w0, %1" : : "rZ" (val), "Qo" (*ptr));
+       asm volatile(ALTERNATIVE("str %w0, [%1]",
+                                "stlr %w0, [%1]",
+                                ARM64_WORKAROUND_DEVICE_STORE_RELEASE)
+                    : : "rZ" (val), "r" (addr));
  }

  #define __raw_writeq __raw_writeq
  static __always_inline void __raw_writeq(u64 val, volatile void __iomem *addr)
  {
-       volatile u64 __iomem *ptr = addr;
-       asm volatile("str %x0, %1" : : "rZ" (val), "Qo" (*ptr));
+       asm volatile(ALTERNATIVE("str %x0, [%1]",
+                                "stlr %x0, [%1]",
+                                ARM64_WORKAROUND_DEVICE_STORE_RELEASE)
+                    : : "rZ" (val), "r" (addr));
  }


-Shanker


^ permalink raw reply

* Re: [PATCH v4 6/6] kselftest: alloc_tag: extend the allocinfo ioctl kselftest
From: Abhishek Bapat @ 2026-06-12  0:23 UTC (permalink / raw)
  To: Hao Ge
  Cc: Shuah Khan, Jonathan Corbet, linux-doc, linux-kernel, linux-mm,
	Sourav Panda, Suren Baghdasaryan, Andrew Morton, Kent Overstreet
In-Reply-To: <c7ae2aa9-a1fb-4965-a213-f9cfb2aa101e@linux.dev>

On Wed, Jun 10, 2026 at 2:34 AM Hao Ge <hao.ge@linux.dev> wrote:
>
> Hi Abhishek
>
>
> On 2026/6/10 08:12, Abhishek Bapat wrote:
> > Add the following 2 scenarios to the allocinfo ioctl kselftest:
> > 1. Validate size based filtering
> > 2. Validate lineno based filtering
> >
> > The first test uses "do_init_module" as the candidate function for the
> > test. This is because the associated site will only allocate memory when
> > a kernel module is loaded. The return value of get_content_id() changes
> > every time modules are loaded or unloaded. Hence, as long as
> > get_content_id() values at the start and the end of the test are the
> > same, the memory allocated by the do_init_module call site should also
> > remain the same. Consequently, the test can assume consistency between
> > the value returned by the ioctl and the procfs resulting in less
> > flakiness.
> >
> > Signed-off-by: Abhishek Bapat <abhishekbapat@google.com>
> > ---
> >   .../alloc_tag/allocinfo_ioctl_test.c          | 204 +++++++++++++++++-
> >   1 file changed, 203 insertions(+), 1 deletion(-)
> >
> > diff --git a/tools/testing/selftests/alloc_tag/allocinfo_ioctl_test.c b/tools/testing/selftests/alloc_tag/allocinfo_ioctl_test.c
> > index cd9cf229ae1f..5d2f13900a47 100644
> > --- a/tools/testing/selftests/alloc_tag/allocinfo_ioctl_test.c
> > +++ b/tools/testing/selftests/alloc_tag/allocinfo_ioctl_test.c
> > @@ -311,11 +311,201 @@ static int test_function_filter(void)
> >       return run_filter_test(&filter);
> >   }
> >
> > +static int test_size_filter(void)
> > +{
> > +     int fd;
> > +     struct allocinfo_tag_data_vec *tags = malloc(sizeof(*tags));
> > +     struct allocinfo_tag_data_vec *procfs_entries = malloc(sizeof(*procfs_entries));
> > +     struct allocinfo_filter filter;
> > +     int ret = KSFT_PASS;
> > +     __u64 target_size, i, pos;
> > +     bool found;
> > +     const char *target_function = "do_init_module";
> > +     struct allocinfo_content_id start_cont_id, end_cont_id;
> > +     int retry = 0;
> > +     const int max_retries = 10;
> > +
> > +     if (!tags || !procfs_entries) {
> > +             ksft_print_msg("Memory allocation failed.\n");
> > +             ret = KSFT_FAIL;
> > +             goto freemem;
> > +     }
> > +
> > +     fd = open(ALLOCINFO_PROC, O_RDONLY);
> > +     if (fd < 0) {
> > +             ksft_exit_skip("Failed to open " ALLOCINFO_PROC ": %s\n", strerror(errno));
> > +             ret = KSFT_FAIL;
> > +             goto freemem;
> > +     }
> > +
> > +     do {
> > +             found = false;
> > +             pos = 0;
> > +
> > +             if (__allocinfo_get_content_id(fd, &start_cont_id)) {
> > +                     ksft_print_msg("allocinfo_get_content_id failed\n");
> > +                     ret = KSFT_FAIL;
> > +                     goto exit;
> > +             }
> > +
> > +             memset(&filter, 0, sizeof(filter));
> > +             filter.mask |= ALLOCINFO_FILTER_MASK_FUNCTION;
> > +             strncpy(filter.fields.function, target_function, ALLOCINFO_STR_SIZE);
> > +
> > +             if (get_filtered_procfs_entries(procfs_entries, &filter, fd)) {
> > +                     ksft_print_msg("Error retrieving entries from " ALLOCINFO_PROC "\n");
> > +                     ret = KSFT_FAIL;
> > +                     goto exit;
> > +             }
> > +
>
>
> As I mentioned for patch 5, the retry loop in test_size_filter calls
>
> get_filtered_procfs_entries() which reads fd to EOF via fdopen/fgets.
>
> If a module load triggers a retry, the second call to
> get_filtered_procfs_entries() gets EOF
>
> immediately.
>
> And Sashiko has also reported several minor issues.
>
>
> Thanks
>
> Best Regards
>
> Hao
>
Hi All,

Please note that I am moving the file descriptor definitions inside
each of the functions. That makes the code much cleaner and avoids the
weird scenarios related to using the same file descriptor everywhere.
I will include this change along with fixes for some of the other
issues Sashiko identified.

> > +             if (procfs_entries->count == 0) {
> > +                     ksft_print_msg("Function %s not found in procfs\n", target_function);
> > +                     ret = KSFT_SKIP;
> > +                     goto exit;
> > +             }
> > +
> > +             target_size = procfs_entries->tag[0].counter.bytes;
> > +
> > +             memset(&filter, 0, sizeof(filter));
> > +             filter.mask |= ALLOCINFO_FILTER_MASK_MIN_SIZE | ALLOCINFO_FILTER_MASK_MAX_SIZE;
> > +             filter.min_size = target_size;
> > +             filter.max_size = target_size;
> > +
> > +             while (1) {
> > +                     struct allocinfo_get_at get_at_params;
> > +
> > +                     memset(&get_at_params, 0, sizeof(get_at_params));
> > +                     memcpy(&get_at_params.filter, &filter, sizeof(filter));
> > +                     get_at_params.pos = pos;
> > +
> > +                     if (__allocinfo_get_at(fd, &get_at_params))
> > +                             break;
> > +
> > +                     tags->count = 0;
> > +                     memcpy(&tags->tag[tags->count++], &get_at_params.data,
> > +                            sizeof(get_at_params.data));
> > +
> > +                     while (tags->count < VEC_MAX_ENTRIES &&
> > +                            __allocinfo_get_next(fd, &tags->tag[tags->count]) == 0)
> > +                             tags->count++;
> > +
> > +                     for (i = 0; i < tags->count; i++) {
> > +                             if (strcmp(tags->tag[i].tag.function, target_function) == 0) {
> > +                                     found = true;
> > +                                     break;
> > +                             }
> > +                     }
> > +
> > +                     if (found || tags->count < VEC_MAX_ENTRIES)
> > +                             break;
> > +
> > +                     pos += tags->count;
> > +             }
> > +
> > +             if (__allocinfo_get_content_id(fd, &end_cont_id)) {
> > +                     ksft_print_msg("allocinfo_get_content_id failed\n");
> > +                     ret = KSFT_FAIL;
> > +                     goto exit;
> > +             }
> > +
> > +             if (start_cont_id.id == end_cont_id.id)
> > +                     break;
> > +
> > +             ksft_print_msg("Module load detected during size verification, retrying...\n");
> > +     } while (retry++ < max_retries);
> > +
> > +     if (start_cont_id.id == end_cont_id.id && !found) {
> > +             ksft_print_msg("Entry with function %s not found in IOCTL results\n",
> > +                            target_function);
> > +             ret = KSFT_FAIL;
> > +     }
> > +
> > +exit:
> > +     close(fd);
> > +freemem:
> > +     free(tags);
> > +     free(procfs_entries);
> > +     return ret;
> > +}
> > +
> > +static int test_lineno_filter(void)
> > +{
> > +     int fd;
> > +     struct allocinfo_tag_data_vec *tags = malloc(sizeof(*tags));
> > +     struct allocinfo_tag_data_vec *procfs_entries = malloc(sizeof(*procfs_entries));
> > +     struct allocinfo_filter filter;
> > +     enum ioctl_ret ioctl_status;
> > +     int ret = KSFT_PASS;
> > +     __u64 target_lineno, i;
> > +
> > +     if (!tags || !procfs_entries) {
> > +             ksft_print_msg("Memory allocation failed.\n");
> > +             ret = KSFT_FAIL;
> > +             goto freemem;
> > +     }
> > +
> > +     fd = open(ALLOCINFO_PROC, O_RDONLY);
> > +     if (fd < 0) {
> > +             ksft_exit_skip("Failed to open " ALLOCINFO_PROC ": %s\n", strerror(errno));
> > +             ret = KSFT_FAIL;
> > +             goto freemem;
> > +     }
> > +
> > +     memset(&filter, 0, sizeof(filter));
> > +
> > +     if (get_filtered_procfs_entries(procfs_entries, &filter, fd)) {
> > +             ksft_print_msg("Error retrieving entries from " ALLOCINFO_PROC "\n");
> > +             ret = KSFT_FAIL;
> > +             goto exit;
> > +     }
> > +     if (procfs_entries->count == 0) {
> > +             ksft_print_msg("Could not retrieve procfs entries\n");
> > +             ret = KSFT_SKIP;
> > +             goto exit;
> > +     }
> > +     /*
> > +      * We depend on the result of procfs entries to create the ioctl_filter. Hence we
> > +      * cannot recycle the run_filter_test function here.
> > +      */
> > +     target_lineno = procfs_entries->tag[0].tag.lineno;
> > +
> > +     filter.mask |= ALLOCINFO_FILTER_MASK_LINENO;
> > +     filter.fields.lineno = target_lineno;
> > +
> > +     ioctl_status = get_filtered_ioctl_entries(tags, &filter, fd, 0);
> > +     if (ioctl_status == IOCTL_INVALID_DATA) {
> > +             ksft_print_msg("Trouble retrieving valid IOCTL entries, skipping.\n");
> > +             ret = KSFT_SKIP;
> > +             goto exit;
> > +     }
> > +     if (ioctl_status == IOCTL_FAILURE) {
> > +             ksft_print_msg("Error retrieving IOCTL entries.\n");
> > +             ret = KSFT_FAIL;
> > +             goto exit;
> > +     }
> > +
> > +     for (i = 0; i < tags->count; i++) {
> > +             if (tags->tag[i].tag.lineno != target_lineno) {
> > +                     ksft_print_msg("IOCTL entry %llu has incorrect lineno %llu.\n",
> > +                                    i, tags->tag[i].tag.lineno);
> > +                     ret = KSFT_FAIL;
> > +                     goto exit;
> > +             }
> > +     }
> > +
> > +exit:
> > +     close(fd);
> > +freemem:
> > +     free(tags);
> > +     free(procfs_entries);
> > +     return ret;
> > +}
> > +
> >   int main(int argc, char *argv[])
> >   {
> >       int ret;
> >
> > -     ksft_set_plan(2);
> > +     ksft_set_plan(4);
> >
> >       ret = test_filename_filter();
> >       if (ret == KSFT_SKIP)
> > @@ -329,5 +519,17 @@ int main(int argc, char *argv[])
> >       else
> >               ksft_test_result(ret == KSFT_PASS, "test_function_filter\n");
> >
> > +     ret = test_size_filter();
> > +     if (ret == KSFT_SKIP)
> > +             ksft_test_result_skip("Skipping test_size_filter\n");
> > +     else
> > +             ksft_test_result(ret == KSFT_PASS, "test_size_filter\n");
> > +
> > +     ret = test_lineno_filter();
> > +     if (ret == KSFT_SKIP)
> > +             ksft_test_result_skip("Skipping test_lineno_filter\n");
> > +     else
> > +             ksft_test_result(ret == KSFT_PASS, "test_lineno_filter\n");
> > +
> >       ksft_finished();
> >   }

^ permalink raw reply

* Re: [PATCH v3 02/12] x86/resctrl: Add data structures and definitions for PLZA configuration
From: Reinette Chatre @ 2026-06-11 23:40 UTC (permalink / raw)
  To: Babu Moger, corbet, tony.luck, Dave.Martin, james.morse, tglx, bp,
	dave.hansen
  Cc: skhan, x86, mingo, hpa, akpm, rdunlap, pawan.kumar.gupta,
	feng.tang, dapeng1.mi, kees, elver, lirongqing, paulmck, bhelgaas,
	seanjc, alexandre.chartre, yazen.ghannam, peterz, chang.seok.bae,
	kim.phillips, xin, naveen, thomas.lendacky, linux-doc,
	linux-kernel, eranian, peternewman
In-Reply-To: <e84fdbc324b312ff137d279ec154e3827c0aed81.1777591497.git.babu.moger@amd.com>

Hi Babu,

On 4/30/26 4:24 PM, Babu Moger wrote:
> Privilege Level Zero Association (PLZA) is configured per logical processor
> via MSR_IA32_PQR_PLZA_ASSOC (0xc00003fc). Software must program RMID and
> CLOSID association fields and their enable bits using the layout defined
> for the MSR.
> 
> Define MSR_IA32_PQR_PLZA_ASSOC and the RMID_EN, CLOSID_EN, and PLZA_EN bit
> masks in asm/msr-index.h. Add union msr_pqr_plza_assoc in arch resctrl
> internal.h

Above paragraph captures what can be seen from the patch. Please check entire
series for this since many changelogs in this series verbatim describes the code
changes in patch without helping reader understand why those changes are made.


> 
> Signed-off-by: Babu Moger <babu.moger@amd.com>
> ---

> diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
> index 9dc6b610e4e2..623628d3c643 100644
> --- a/arch/x86/include/asm/msr-index.h
> +++ b/arch/x86/include/asm/msr-index.h
> @@ -1287,10 +1287,17 @@
>  /* - AMD: */
>  #define MSR_IA32_MBA_BW_BASE		0xc0000200
>  #define MSR_IA32_SMBA_BW_BASE		0xc0000280
> +#define MSR_IA32_PQR_PLZA_ASSOC		0xc00003fc
>  #define MSR_IA32_L3_QOS_ABMC_CFG	0xc00003fd
>  #define MSR_IA32_L3_QOS_EXT_CFG		0xc00003ff
>  #define MSR_IA32_EVT_CFG_BASE		0xc0000400
>  
> +/* Lower 32 bits of MSR_IA32_PQR_PLZA_ASSOC */
> +#define RMID_EN				BIT(31)
> +/* Upper 32 bits of MSR_IA32_PQR_PLZA_ASSOC */
> +#define CLOSID_EN			BIT(15)
> +#define PLZA_EN				BIT(31)
> +

This is unexpected. So far resctrl has only defined the MSR numbers in this file, not
the individual fields. This seems a legitimate use of msr-index.h but creates inconsistency
with how the fields of the other resctrl registers are defined. This may be ok so I am
looking past this for now. Since I am not familiar with this use I am looking at other
patterns of this and it seems that the register fields are usually defined right after
the register to make this relationship clear and also use more verbose naming to establish
this relationship ... I do not think such cryptic names should be used without context
in such a global scope. Please compare with how other fields are defined at this scope.

> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
> index e3cfa0c10e92..1c2f87ffb0ea 100644
> --- a/arch/x86/kernel/cpu/resctrl/internal.h
> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> @@ -222,6 +222,33 @@ union l3_qos_abmc_cfg {
>  	unsigned long full;
>  };
>  
> +/*
> + * PLZA is programmed by writing to MSR_IA32_PQR_PLZA_ASSOC. Bitfield
> + * layout for MSR_IA32_PQR_PLZA_ASSOC (Privilege Level Zero Association).

These comments are valuable to describe how resctrl should interact with
this register so it would help to be specific and document any and all
constraints.

For example, I seem to remember that all fields except PLZA_EN are required
to be identical on all CPUs. Please document that and any other constraints here.

> + *
> + * @rmid		: The RMID to be configured for PLZA.

What does "to be configured" mean? It seems to imply that when resctrl
writes to @rmid then the setting does not take immediate effect but would
take effect at some future "configure" time?

> + * @reserved1		: Reserved.
> + * @rmid_en		: Associate RMID or not.

Please elaborate ... what is RMID associated with? What does "or not" imply? 
Here it will help to document relationship with MSR_IA32_PQR_ASSOC.

> + * @closid		: The CLOSID to be configured for PLZA.
> + * @reserved2		: Reserved.
> + * @closid_en		: Associate CLOSID or not.

Same comments as for RMID

> + * @reserved3		: Reserved.
> + * @plza_en		: Configure PLZA or not.

plza_en implies "enable" but the comment mentions "configure". Considering
the other fields are "to be configured" there seems to be relationship but
that is not documented at all. For example, if @plza_en is 1 and resctrl modifies
@rmid should resctrl write "1" to @plza_en again to "configure" the new RMID?

Please add specific detail to help understand how best to interact with this
register. 

> + */
> +union msr_pqr_plza_assoc {
> +	struct {
> +		unsigned long rmid	:12,
> +			      reserved1	:19,
> +			      rmid_en	: 1,
> +			      closid	: 4,
> +			      reserved2	:11,
> +			      closid_en	: 1,
> +			      reserved3	:15,
> +			      plza_en	: 1;
> +	} split;
> +	unsigned long full;
> +};
> +
>  void rdt_ctrl_update(void *arg);
>  
>  int rdt_get_l3_mon_config(struct rdt_resource *r);

Reinette

^ permalink raw reply

* Re: [PATCH v3 01/12] x86/resctrl: Support Privilege-Level Zero Association (PLZA)
From: Reinette Chatre @ 2026-06-11 23:23 UTC (permalink / raw)
  To: Babu Moger, corbet, tony.luck, Dave.Martin, james.morse, tglx, bp,
	dave.hansen
  Cc: skhan, x86, mingo, hpa, akpm, rdunlap, pawan.kumar.gupta,
	feng.tang, dapeng1.mi, kees, elver, lirongqing, paulmck, bhelgaas,
	seanjc, alexandre.chartre, yazen.ghannam, peterz, chang.seok.bae,
	kim.phillips, xin, naveen, thomas.lendacky, linux-doc,
	linux-kernel, eranian, peternewman, sos-linux-ext-patches
In-Reply-To: <f59c7f5404f29b2901af68d8032ee615b7f0efea.1777591496.git.babu.moger@amd.com>

Hi Babu,

On 4/30/26 4:24 PM, Babu Moger wrote:
> Customers have identified an issue while using the QoS resource Control

"Control" -> "control"?

> feature. If a memory bandwidth associated with a CLOSID is aggressively

"a memory bandwidth" -> "memory bandwidth"?

> throttled, and it moves into Kernel mode, the Kernel operations are also

What does "it" refer to here? From text it seems to be the "CLOSID" but that
does not sound right? Should "it" instead be something like "a task with that
CLOSID"?

"Kernel" -> "kernel"?

> aggressively throttled. This can stall forward progress and eventually
> degrade overall system performance. AMD hardware supports a feature
> Privilege-Level Zero Association (PLZA) to change the association of the
> thread as soon as it begins executing.

"change the association of the thread as soon as it begins executing." I am
not able to parse this.

> 
> Privilege-Level Zero Association (PLZA) allows the user to specify a CLOSID
> and/or RMID associated with execution in Privilege-Level Zero. When enabled
> on a HW thread, when the thread enters Privilege-Level Zero, transactions

Could you please use consistent terminology throughout this series? This patch
uses "HW thread"/"thread", the next patch then switches to "logical processor",
and then by patch #4 the term seems to settle on "CPU". Could this just be
"CPU" from here and throughout series to be consistent and easier to read?

What is meant with "transactions"?  Is this just about memory transactions?
Using this term combined with earlier "memory bandwidth" related problem description
hints that this feature just impacts memory bandwidth allocation but from what
I understand this impacts all allocation (CLOSID of all resources) and monitoring.

Could "transactions" be replaced with "allocation and monitoring" and be
more accurate?

> associated with that thread will be associated with the PLZA CLOSID and/or
> RMID. Otherwise, the HW thread will be associated with the CLOSID and RMID
> identified by PQR_ASSOC.
> 
> Add PLZA support to resctrl and introduce a kernel parameter that allows
> enabling or disabling the feature at boot time.
> 
> The GLBE feature details are documented in:

"GLBE" -> "PLZA"?

> 
>   AMD64 Zen6 Platform Quality of Service (PQOS) Extensions:
>   Publication # 69193 Revision: 1.00, Issue Date: March 2026
> 
> available at https://bugzilla.kernel.org/show_bug.cgi?id=206537

Please follow same style as what you used in the assignable counter enabling where
this URL is provided via a "Link:" tag and then the text can refer to it. Specifically,
	Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537 # [1]

> 
> Signed-off-by: Babu Moger <babu.moger@amd.com>
> ---
> v3: Code did not change. Patch order cahnged.
>     Added documentation link.
> 
> v2: Rebased on top of the latest tip.
> ---
>  Documentation/admin-guide/kernel-parameters.txt | 2 +-
>  arch/x86/include/asm/cpufeatures.h              | 1 +
>  arch/x86/kernel/cpu/resctrl/core.c              | 2 ++
>  arch/x86/kernel/cpu/scattered.c                 | 1 +

Please split changes to other subsystems and make these changes
obvious with their own subject prefix to avoid sneaking changes into
other subsystems via resctrl.

Reinette

^ permalink raw reply

* Re: [PATCH net-next v2 00/15] mptcp: pm: drop TCP TS with ADD_ADDRv6 + port
From: patchwork-bot+netdevbpf @ 2026-06-11 22:50 UTC (permalink / raw)
  To: Matthieu Baerts
  Cc: martineau, geliang, davem, edumazet, kuba, pabeni, horms, netdev,
	mptcp, linux-kernel, corbet, skhan, linux-doc, linux-kselftest,
	ncardwell, kuniyu, shuah
In-Reply-To: <20260605-net-next-mptcp-add-addr6-port-ts-v2-0-758e7ca73f4d@kernel.org>

Hello:

This series was applied to netdev/net-next.git (main)
by Jakub Kicinski <kuba@kernel.org>:

On Fri, 05 Jun 2026 19:21:44 +1000 you wrote:
> Up to this series, it was possible to add a "signal" MPTCP endpoint with
> an IPv6 address and a port, or to directly request to send an ADD_ADDR
> with a v6 address and a port, but the expected ADD_ADDR wasn't sent when
> TCP timestamps was used for the connection.
> 
> In fact, such signalling option cannot be sent when TCP timestamps is
> used due to a lack of option space: the limit is at 40 bytes, and, with
> padding, TCP timestamps is taking 12 bytes, while an ADD_ADDR IPv6 +
> port is taking 30 bytes. The selected solution here is to simply drop
> the TCP timestamps option when such ADD_ADDR of 30 bytes needs to be
> sent.
> 
> [...]

Here is the summary with links:
  - [net-next,v2,01/15] mptcp: options: suboptions sizes can be negative
    https://git.kernel.org/netdev/net-next/c/f4a58ffbd4cf
  - [net-next,v2,02/15] mptcp: pm: avoid computing rm_addr size twice
    https://git.kernel.org/netdev/net-next/c/a8bffec089d5
  - [net-next,v2,03/15] mptcp: pm: avoid computing add_addr size twice
    https://git.kernel.org/netdev/net-next/c/06c62385be85
  - [net-next,v2,04/15] mptcp: introduce add_addr_v6_port_drop_ts sysctl knob
    https://git.kernel.org/netdev/net-next/c/30ff28fdc4da
  - [net-next,v2,05/15] tcp: allow mptcp to drop TS for some packets
    https://git.kernel.org/netdev/net-next/c/1c3e7e043977
  - [net-next,v2,06/15] mptcp: pm: drop TCP TS with ADD_ADDRv6 + port
    https://git.kernel.org/netdev/net-next/c/23eeaad0d89d
  - [net-next,v2,07/15] selftests: mptcp: validate ADD_ADDRv6 + TS + port
    https://git.kernel.org/netdev/net-next/c/dd7fb53c21c3
  - [net-next,v2,08/15] selftests: mptcp: always check sent/dropped ADD_ADDRs
    https://git.kernel.org/netdev/net-next/c/5558517b0001
  - [net-next,v2,09/15] mptcp: pm: use for_each_subflow helper
    https://git.kernel.org/netdev/net-next/c/f81689172429
  - [net-next,v2,10/15] mptcp: pm: rename add_entry structure to add_addr
    https://git.kernel.org/netdev/net-next/c/350d76dd6e79
  - [net-next,v2,11/15] mptcp: pm: uniform announced addresses helpers
    https://git.kernel.org/netdev/net-next/c/7d4dacc8ccca
  - [net-next,v2,12/15] mptcp: pm: remove add_ prefix from timer
    https://git.kernel.org/netdev/net-next/c/938490767e37
  - [net-next,v2,13/15] mptcp: pm: make mptcp_pm_add_addr_send_ack static
    https://git.kernel.org/netdev/net-next/c/d0f866e64897
  - [net-next,v2,14/15] mptcp: pm: avoid using del_timer directly
    https://git.kernel.org/netdev/net-next/c/6ea199a938da
  - [net-next,v2,15/15] mptcp: options: rst: drop unused skb parameter
    https://git.kernel.org/netdev/net-next/c/6545a8c34703

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox