Linux Documentation

Linux Documentation
 help / color / mirror / Atom feed

* Re: [RFC, PATCH 00/12] userfaultfd: working set tracking for VM guest memory
From: Kiryl Shutsemau @ 2026-04-14 17:10 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Andrew Morton, Peter Xu, Lorenzo Stoakes, Mike Rapoport,
	Suren Baghdasaryan, Vlastimil Babka, Liam R . Howlett, Zi Yan,
	Jonathan Corbet, Shuah Khan, Sean Christopherson, Paolo Bonzini,
	linux-mm, linux-kernel, linux-doc, linux-kselftest, kvm
In-Reply-To: <55019037-4f1c-4d9c-83ee-3a844d8f3d5e@kernel.org>

On Tue, Apr 14, 2026 at 05:37:50PM +0200, David Hildenbrand (Arm) wrote:
> On 4/14/26 16:23, Kiryl Shutsemau (Meta) wrote:
> > This series adds userfaultfd support for tracking the working set of
> > VM guest memory, enabling VMMs to identify cold pages and evict them
> > to tiered or remote storage.
> > 
> > == Problem ==
> > 
> > VMMs managing guest memory need to:
> > 1. Track which pages are actively used (working set detection)
> > 2. Safely evict cold pages to slower storage
> > 3. Fetch pages back on demand when accessed again
> > 
> > For shmem-backed guest memory, working set tracking partially works
> > today: MADV_DONTNEED zaps PTEs while pages stay in page cache, and
> > re-access auto-resolves from cache. But safe eviction still requires
> > synchronous fault interception to prevent data loss races.
> > 
> > For anonymous guest memory (needed for KSM cross-VM deduplication),
> > there is no mechanism at all — clearing a PTE loses the page.
> > 
> > == Solution ==
> > 
> > The series introduces a unified userfaultfd interface that works
> > across both anonymous and shmem-backed memory:
> > 
> > UFFD_FEATURE_MINOR_ANON: extends MODE_MINOR registration to anonymous
> > private memory. Uses the PROT_NONE hinting mechanism (same as NUMA
> > balancing) to make pages inaccessible without freeing them.
> 
> I would rather tackle this from the other direction: it's another form
> of protection (like WP), not really a "minor" mode.
> 
> Could we add a UFFDIO_REGISTER_MODE_RWP (or however we would call it)
> and support it for anon+shmem, avoiding the zapping for shmem completely?

I like this idea.

It should be functionally equivalent, but your interface idea fits
better with the rest.

Thanks! Will give it a try.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply

* Re: [RFC, PATCH 00/12] userfaultfd: working set tracking for VM guest memory
From: Kiryl Shutsemau @ 2026-04-14 17:08 UTC (permalink / raw)
  To: Peter Xu
  Cc: Andrew Morton, David Hildenbrand, Lorenzo Stoakes, Mike Rapoport,
	Suren Baghdasaryan, Vlastimil Babka, Liam R . Howlett, Zi Yan,
	Jonathan Corbet, Shuah Khan, Sean Christopherson, Paolo Bonzini,
	linux-mm, linux-kernel, linux-doc, linux-kselftest, kvm,
	James Houghton, Andrea Arcangeli
In-Reply-To: <ad5dIUpAMs4MuBvV@x1.local>

On Tue, Apr 14, 2026 at 11:28:33AM -0400, Peter Xu wrote:
> Hi, Kiryl,
> 
> On Tue, Apr 14, 2026 at 03:23:34PM +0100, Kiryl Shutsemau (Meta) wrote:
> > This series adds userfaultfd support for tracking the working set of
> > VM guest memory, enabling VMMs to identify cold pages and evict them
> > to tiered or remote storage.
> 
> Thanks for sharing this work, it looks very interesting to me.
> 
> Personally I am also looking at some kind of VMM memtiering issues.  I'm
> not sure if you saw my lsfmm proposal, it mentioned the challenge we're
> facing, it's slightly different but still a bit relevant:
> 
> https://lore.kernel.org/all/aYuad2k75iD9bnBE@x1.local/

Thanks will read up. I didn't follow userfultfd work until recently.

> Unfortunately, that proposal was rejected upstream.

Sorry about that. We can chat about in hall track, if you are there :)

> > == VMM Workflow ==
> 
> AFAIU, this workflow provides two functionalities:
> 
> > 
> >     UFFDIO_DEACTIVATE(all)            -- async, no vCPU stalls
> >     sleep(interval)
> >     PAGEMAP_SCAN                      -- find cold pages
> 
> Until here it's only about page hotness tracking.  I am curious whether you
> evaluated idle page tracking.  Is it because of perf overheads on rmap?

I didn't gave idle page tracking much thought. I needed uffd faults to
serialize reclaim against memory accesses. If use it for one thing we
can as well try to use it for tracking as well. And it seems to be
fitting together nicely with sync/async mode flipping.

> To
> me, your solution (until here.. on the hotness sampling) reads more like a
> more efficient way to do idle page tracking but only per-mm, not per-folio.
> 
> That will also be something I would like to benefit if QEMU will decide to
> do full userspace swap.  I think that's our last resort, I'll likely start
> with something that makes QEMU work together with Linux on swapping
> (e.g. we're happy to make MGLRU or any reclaim logic that Linux mm
> currently uses, as long as efficient) then QEMU only cares about the rest,
> which is what the migration problem is about.
> 
> The other issue about idle page tracking to us is, I believe MGLRU
> currently doesn't work well with it (due to ignoring IDLE bits) where the
> old LRU algo works.  I'm not sure how much you evaluated above, so it'll be
> great to share from that perspective too.  I also mentioned some of these
> challenges in the lsfmm proposal link above.
> 
> >     UFFDIO_SET_MODE(sync)             -- block faults for eviction
> >     pwrite + MADV_DONTNEED cold pages -- safe, faults block
> >     UFFDIO_SET_MODE(async)            -- resume tracking
> 
> These operations are the 2nd function.  It's, IMHO, a full userspace swap
> system based on userfaultfd.

Right. And we want to decide where to put cold pages from userspace. 

> Have you thought about directly relying on userfaultfd-wp to do this work?
> The relevant question is, why do we need to block guest reads on pages
> being evicted by the userapp?  Can we still allow that to happen, which
> seems to be more efficient?  IIUC, only writes / updates matters in such
> swap system.

But we do care about about read accesses. We don't want to swap out
pages that got read-touched. And we cannot in practice switch to WP mode
after PAGEMAP_SCAN: it would require a lot of UFFDIO_WRITEPROTECT calls
with TLB flushing each.

With my approach switching tracking and reclaiming is single bit flip
under mmap lock.

> Also, I'm not sure if you're aware of LLNL's umap library:
> 
> https://github.com/llnl/umap
> 
> That implemnted the swap system using userfaultfd wr-protect mode only, so
> no new kernel API needed.

Will look into it. Thanks.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply

* Re: [GIT PULL] Documentation for 7.1
From: pr-tracker-bot @ 2026-04-14 16:46 UTC (permalink / raw)
  To: Jonathan Corbet; +Cc: Linus Torvalds, linux-kernel, linux-doc, Shuah Khan
In-Reply-To: <87bjfnzw2x.fsf@trenco.lwn.net>

The pull request you sent on Sun, 12 Apr 2026 15:51:18 -0600:

> git://git.kernel.org/pub/scm/linux/kernel/git/docs/linux.git tags/docs-7.1

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/5181afcdf99527dd92a88f80fc4d0d8013e1b510

Thank you!

-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/prtracker.html

^ permalink raw reply

* Re: [PATCH v5 00/21] Virtual Swap Space
From: Nhat Pham @ 2026-04-14 16:35 UTC (permalink / raw)
  To: Kairui Song
  Cc: YoungJun Park, Liam.Howlett, akpm, apopple, axelrasmussen, baohua,
	baolin.wang, bhe, byungchul, cgroups, chengming.zhou, chrisl,
	corbet, david, dev.jain, gourry, hannes, hughd, jannh,
	joshua.hahnjy, lance.yang, lenb, linux-doc, linux-kernel,
	linux-mm, linux-pm, lorenzo.stoakes, matthew.brost, mhocko,
	muchun.song, npache, pavel, peterx, peterz, pfalcato, rafael,
	rakie.kim, roman.gushchin, rppt, ryan.roberts, shakeel.butt,
	shikemeng, surenb, tglx, vbabka, weixugc, ying.huang, yosry.ahmed,
	yuanchu, zhengqi.arch, ziy, kernel-team, riel
In-Reply-To: <CAMgjq7BO6SLZPfNXDh1F-7RAOqDAfqMQ4PM=qjAq1mCsWyD0LQ@mail.gmail.com>

On Mon, Apr 13, 2026 at 8:29 PM Kairui Song <ryncsn@gmail.com> wrote:
>
> On Tue, Apr 14, 2026 at 11:05 AM YoungJun Park <youngjun.park@lge.com> wrote:
> >
>
> Hi All,
>
> > On Sat, Apr 11, 2026 at 06:40:44PM -0700, Nhat Pham wrote:
> > > > 1. Modularization
> > > >
> > > > You removed CONFIG_* and went with a unified approach. I recall
> > > > you were also considering a module-based structure at some point.
> > > > What are your thoughts on that direction?
> > > >
> > >
> > > The CONFIG-based approach was a huge mess. It makes me not want to
> > > look at the code, and I'm the author :)
> > >
> > > > If we take that approach, we could extend the recent swap ops
> > > > patchset (https://lore.kernel.org/linux-mm/20260302104016.163542-1-bhe@redhat.com/)
> > > > as follows:
> > > > - Make vswap a swap module
> > > > - Have cluster allocation functions reside in swapops
> > > > - Enable vswap through swapon
> > >
> > > Hmmmmm.
> >
> > I think this would be a happy world, but I wonder what others think.
> > Anyway, I'm looking forward to the future direction.
> >
>
> Yeah, I agree with this.
>
> And I do think swapoff of the virtual space itself is also necessary,
> we really need a failsafe, e.g. a clean way to drop the swap
> cache and data, kind of like drop_caches or shrinker fs are
> commonly used.
>
> > > > 2. Flash-friendly swap integration (for my use case)
> > > >
> > > > I've been thinking about the flash-friendly swap concept that
> > > > I mentioned before and recently proposed:
> > > > (https://lore.kernel.org/linux-mm/aZW0voL4MmnMQlaR@yjaykim-PowerEdge-T330/)
> > > >
> > > > One of its core functions requires buffering RAM-swapped pages
> > > > and writing them sequentially at an appropriate time -- not
> > > > immediately, but in proper block-sized units, sequentially.
> > > >
> > > > This means allocated offsets must essentially be virtual, and
> > > > physical offsets need to be managed separately at the actual
> > > > write time.
> > > >
> > > > If we integrate this into the current vswap, we would either
> > > > need vswap itself to handle the sequential writes (bypassing
> > > > the physical device and receiving pages directly), or swapon
> > > > a swap device and have vswap obtain physical offsets from it.
> > > > But since those offsets cannot be used directly (due to
> > > > buffering and sequential write requirements), they become
> > > > virtual too, resulting in:
> > > >
> > > >   virtual -> virtual -> physical
> > > >
> > > > This triple indirection is not ideal.
> > > >
> > > > However, if the modularization from point 1 is achieved and
> > > > vswap acts as a swap device itself, then we can cleanly
> > > > establish a:
> > > >
> > > >   virtual -> physical
> > >
> > > I read that thread sometimes ago. Some remarks:
> > >
> > > 1. I think Christoph has a point. Seems like some of your ideas ( are
> > > broadly applicable to swap in general. Maybe fixing swap infra
> > > generally would make a lot of sense?
> >
> > Broadly speaking, there are two main ideas:
> > 1. Swap I/O buffering (which is also tied to cluster management issues)
> > 2. Deduplication
> >
> > Are you leaning towards the view that these two should be placed in a
> > higher layer?
>
> IMHO the swap infra should be doing less, not more, so we can have
> more flexible design, and different backends can implement their own
> way to manage the data and layer. e.g. Having one backend being
> flash friendly and it can do this without caring or affecting other devices
> or backends.

I think that's what Youngjun already has, unless I misunderstand his
descriptions.

>
> > If it goes into ZSWAP, there would definitely be a clear advantage of
> > seeing dedup benefits across all swap devices. It's a technically
> > interesting area, and I'd like to discuss it in a separate thread if
> > I have more ideas or thoughts.
>
> Just branstorm... Why don't we just merge these identical pages like
> KSM? Maybe at least zero folios might benefit a lot if we keep them
> mapped as RO instead of recording them in swap, seems better in the
> long term?

That's our preferred approach too. We just didn't manage to get that
to work (yet). :)

^ permalink raw reply

* [PATCH] docs: proc.rst: update description of VmallocUsed and VmallocChunk
From: Herve Vico @ 2026-04-14 15:38 UTC (permalink / raw)
  To: Jonathan Corbet, Shuah Khan
  Cc: Herve Vico, linux-kernel, linux-fsdevel, linux-doc

A long time ago the behavior of two /proc/meminfo Vmalloc<...> counters
has been modified twice without updating the doc:

- v4.4 removes the expensive 'vmalloc_info' bookkeeping behind VmallocUsed
  and VmallocChunk, and makes both counters return zero:

  commit a5ad88ce8c7f ("mm: get rid of 'vmalloc_info' from /proc/meminfo")

-  v5.3 reintroduces VmallocUsed, making it now report the physical memory
  allocated by vmalloc() calls rather than the vmalloc VA space:

  commit 97105f0ab7b8 ("mm: vmalloc: show number of vmalloc pages in
                        /proc/meminfo")

Let's update the doc to reflect the current behavior.

Signed-off-by: Herve Vico <herve.vico@sipearl.com>
---
 Documentation/filesystems/proc.rst | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
index b0c0d1b45b99..4a36ac960417 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -1233,9 +1233,12 @@ Committed_AS
 VmallocTotal
               total size of vmalloc virtual address space
 VmallocUsed
-              amount of vmalloc area which is used
+              Amount of memory allocated by vmalloc() calls.
+              Note that VmallocTotal is a constant that refers to the size of
+              the vmalloc VA space, while VmallocUsed reports the amount of
+              memory allocated by vmalloc() calls.
 VmallocChunk
-              largest contiguous block of vmalloc area which is free
+              Deprecated, hardcoded to zero.
 Percpu
               Memory allocated to the percpu allocator used to back percpu
               allocations. This stat excludes the cost of metadata.

base-commit: d60bc140158342716e13ff0f8aa65642f43ba053
-- 
2.51.2


^ permalink raw reply related

* htmldocs: Documentation/core-api/mm-api:104: ./include/linux/mm_inline.h:577: WARNING: Inline emphasis start-string without end-string. [docutils]
From: kernel test robot @ 2026-04-14 15:52 UTC (permalink / raw)
  To: Dev Jain; +Cc: oe-kbuild-all, 0day robot, linux-doc

Hi Dev,

FYI, the error/warning was bisected to this commit, please ignore it if it's irrelevant.

tree:   https://github.com/intel-lab-lkp/linux/commits/Dev-Jain/mm-rmap-initialize-nr_pages-to-1-at-loop-start-in-try_to_unmap_one/20260414-033035
head:   77dacfde3a6afac7fa3c015671d2452b524b37ad
commit: 1202b576c2e876c8cab1c41f1816a3c15bdf79d3 mm/memory: Batch set uffd-wp markers during zapping
date:   20 hours ago
compiler: clang version 20.1.8 (https://github.com/llvm/llvm-project 87f0227cb60147a26a1eeb4fb06e3b505e9c7261)
docutils: docutils (Docutils 0.21.2, Python 3.13.5, on linux)
reproduce: (https://download.01.org/0day-ci/archive/20260414/202604141701.JY7pVkau-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202604141701.JY7pVkau-lkp@intel.com/

All warnings (new ones prefixed by >>):

   Documentation/core-api/kref:328: ./include/linux/kref.h:94: WARNING: Invalid C declaration: Expected end of definition. [error at 92]
   int kref_put_lock (struct kref *kref, void (*release)(struct kref *kref), spinlock_t *lock) __cond_acquires(true# lock)
   --------------------------------------------------------------------------------------------^
   WARNING: ./include/linux/mm_inline.h:591 function parameter 'pte' not described in 'install_uffd_wp_ptes_if_needed'
   WARNING: ./include/linux/mm_inline.h:591 function parameter 'pte' not described in 'install_uffd_wp_ptes_if_needed'
>> Documentation/core-api/mm-api:104: ./include/linux/mm_inline.h:577: WARNING: Inline emphasis start-string without end-string. [docutils]
   WARNING: ./include/crypto/skcipher.h:166 struct member 'SKCIPHER_ALG_COMMON' not described in 'skcipher_alg'
   Documentation/driver-api/basics:42: ./kernel/time/time.c:370: WARNING: Duplicate C declaration, also defined at driver-api/basics:440.
   Declaration is '.. c:function:: unsigned int jiffies_to_msecs (const unsigned long j)'. [duplicate_declaration.c]
   Documentation/driver-api/basics:42: ./kernel/time/time.c:393: WARNING: Duplicate C declaration, also defined at driver-api/basics:457.
   Declaration is '.. c:function:: unsigned int jiffies_to_usecs (const unsigned long j)'. [duplicate_declaration.c]

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply

* Re: [PATCH bpf] bpf,tcp: avoid infinite recursion in BPF_SOCK_OPS_HDR_OPT_LEN_CB
From: mkf @ 2026-04-14 15:37 UTC (permalink / raw)
  To: Jiayuan Chen, bpf
  Cc: Quan Sun, Yinhao Hu, Kaiyan Mei, Dongliang Mu, Eric Dumazet,
	Neal Cardwell, Kuniyuki Iwashima, David S. Miller, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Jonathan Corbet, Shuah Khan,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	David Ahern, netdev, linux-doc, linux-kernel
In-Reply-To: <20260414105702.248310-1-jiayuan.chen@linux.dev>

On Tue, 2026-04-14 at 18:57 +0800, Jiayuan Chen wrote:
> A BPF_PROG_TYPE_SOCK_OPS program can set BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG
> to inject custom TCP header options. When the kernel builds a TCP packet,
> it calls tcp_established_options() to calculate the header size, which
> invokes bpf_skops_hdr_opt_len() to trigger the BPF_SOCK_OPS_HDR_OPT_LEN_CB
> callback.
> 
> If the BPF program calls bpf_setsockopt(TCP_NODELAY) inside this callback,
> __tcp_sock_set_nodelay() will call tcp_push_pending_frames(), which calls
> tcp_current_mss(), which calls tcp_established_options() again,
> re-triggering the same BPF callback. This creates an infinite recursion
> that exhausts the kernel stack and causes a panic.
> 
> BPF_SOCK_OPS_HDR_OPT_LEN_CB
>   -> bpf_setsockopt(TCP_NODELAY)
> 	-> tcp_push_pending_frames()
> 	  -> tcp_current_mss()
> 		-> tcp_established_options()
> 		  -> bpf_skops_hdr_opt_len()
>                            /* infinite recursion */
> 			-> BPF_SOCK_OPS_HDR_OPT_LEN_CB
> 
> A similar reentrancy issue exists for TCP congestion control, which is
> guarded by tp->bpf_chg_cc_inprogress. Adopt the same approach: introduce
> tp->bpf_hdr_opt_len_cb_inprogress, set it before invoking the callback in
> bpf_skops_hdr_opt_len(), and check it in sol_tcp_sockopt() to reject
> bpf_setsockopt(TCP_NODELAY) calls that would trigger
> tcp_push_pending_frames() and cause the recursion.
> 
> Reported-by: Quan Sun <2022090917019@std.uestc.edu.cn>
> Reported-by: Yinhao Hu <dddddd@hust.edu.cn>
> Reported-by: Kaiyan Mei <M202472210@hust.edu.cn>
> Reported-by: Dongliang Mu <dzm91@hust.edu.cn>
> Closes: https://lore.kernel.org/bpf/d1d523c9-6901-4454-a183-94462b8f3e4e@std.uestc.edu.cn/
> Fixes: 0813a841566f ("bpf: tcp: Allow bpf prog to write and parse TCP header option")
> Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
> ---
>  Documentation/networking/net_cachelines/tcp_sock.rst |  1 +
>  include/linux/tcp.h                                  | 11 ++++++++++-
>  net/core/filter.c                                    |  4 ++++
>  net/ipv4/tcp_minisocks.c                             |  1 +
>  net/ipv4/tcp_output.c                                |  3 +++
>  5 files changed, 19 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/networking/net_cachelines/tcp_sock.rst
> b/Documentation/networking/net_cachelines/tcp_sock.rst
> index 563daea10d6c..07d3226d90cc 100644
> --- a/Documentation/networking/net_cachelines/tcp_sock.rst
> +++ b/Documentation/networking/net_cachelines/tcp_sock.rst
> @@ -152,6 +152,7 @@ unsigned_int                  keepalive_intvl
>  int                           linger2
>  u8                            bpf_sock_ops_cb_flags
>  u8:1                          bpf_chg_cc_inprogress
> +u8:1                          bpf_hdr_opt_len_cb_inprogress
>  u16                           timeout_rehash
>  u32                           rcv_ooopack
>  u32                           rcv_rtt_last_tsecr
> diff --git a/include/linux/tcp.h b/include/linux/tcp.h
> index f72eef31fa23..2bfb73cf922e 100644
> --- a/include/linux/tcp.h
> +++ b/include/linux/tcp.h
> @@ -475,12 +475,21 @@ struct tcp_sock {
>  	u8	bpf_sock_ops_cb_flags;  /* Control calling BPF programs
>  					 * values defined in uapi/linux/tcp.h
>  					 */
> -	u8	bpf_chg_cc_inprogress:1; /* In the middle of
> +	u8	bpf_chg_cc_inprogress:1, /* In the middle of
>  					  * bpf_setsockopt(TCP_CONGESTION),
>  					  * it is to avoid the bpf_tcp_cc->init()
>  					  * to recur itself by calling
>  					  * bpf_setsockopt(TCP_CONGESTION, "itself").
>  					  */
> +		bpf_hdr_opt_len_cb_inprogress:1; /* It is set before invoking the
> +						  * callback so that a nested
> +						  * bpf_setsockopt(TCP_NODELAY) or
> +						  * bpf_setsockopt(TCP_CORK) cannot
> +						  * trigger tcp_push_pending_frames(),
> +						  * which would call tcp_current_mss()
> +						  * -> bpf_skops_hdr_opt_len(), causing
> +						  * infinite recursion.
> +						  */
>  #define BPF_SOCK_OPS_TEST_FLAG(TP, ARG) (TP->bpf_sock_ops_cb_flags & ARG)
>  #else
>  #define BPF_SOCK_OPS_TEST_FLAG(TP, ARG) 0
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 78b548158fb0..518699429a7a 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -5483,6 +5483,10 @@ static int sol_tcp_sockopt(struct sock *sk, int optname,
>  	if (sk->sk_protocol != IPPROTO_TCP)
>  		return -EINVAL;
>  
> +	if ((optname == TCP_NODELAY || optname == TCP_CORK) &&
> +	    tcp_sk(sk)->bpf_hdr_opt_len_cb_inprogress)
> +		return -EBUSY;
> +
TCP_CORK is not support in sol_tcp_sockopt(), return -EINVAL by default. and put the check here
could also prevent us from calling getsockopt(TCP_NODELAY) below.

>  	switch (optname) {
>  	case TCP_NODELAY:
>  	case TCP_MAXSEG:
> diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
> index dafb63b923d0..fb06c464ac16 100644
> --- a/net/ipv4/tcp_minisocks.c
> +++ b/net/ipv4/tcp_minisocks.c
> @@ -663,6 +663,7 @@ struct sock *tcp_create_openreq_child(const struct sock *sk,
>  	RCU_INIT_POINTER(newtp->fastopen_rsk, NULL);
>  
>  	newtp->bpf_chg_cc_inprogress = 0;
> +	newtp->bpf_hdr_opt_len_cb_inprogress = 0;
>  	tcp_bpf_clone(sk, newsk);
>  
>  	__TCP_INC_STATS(sock_net(sk), TCP_MIB_PASSIVEOPENS);
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 326b58ff1118..c9654e690e1a 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -475,6 +475,7 @@ static void bpf_skops_hdr_opt_len(struct sock *sk, struct sk_buff *skb,
>  				  unsigned int *remaining)
>  {
>  	struct bpf_sock_ops_kern sock_ops;
> +	struct tcp_sock *tp = tcp_sk(sk);
>  	int err;
>  
>  	if (likely(!BPF_SOCK_OPS_TEST_FLAG(tcp_sk(sk),
> @@ -519,7 +520,9 @@ static void bpf_skops_hdr_opt_len(struct sock *sk, struct sk_buff *skb,
>  	if (skb)
>  		bpf_skops_init_skb(&sock_ops, skb, 0);
>  
> +	tp->bpf_hdr_opt_len_cb_inprogress = 1;
we check the BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG before calling BPF_CGROUP_RUN_PROG_SOCK_OPS_SK,
could this flag use for the same purpose? so we don't need to add an extra field.

	if (likely(!BPF_SOCK_OPS_TEST_FLAG(tcp_sk(sk),
					   BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG)) ||
	    !*remaining)
		return;
>  	err = BPF_CGROUP_RUN_PROG_SOCK_OPS_SK(&sock_ops, sk);
> +	tp->bpf_hdr_opt_len_cb_inprogress = 0;
>  
>  	if (err || sock_ops.remaining_opt_len == *remaining)
>  		return;

-- 
Thanks,
KaFai


^ permalink raw reply

* Re: [RFC, PATCH 00/12] userfaultfd: working set tracking for VM guest memory
From: David Hildenbrand (Arm) @ 2026-04-14 15:37 UTC (permalink / raw)
  To: Kiryl Shutsemau (Meta), Andrew Morton
  Cc: Peter Xu, Lorenzo Stoakes, Mike Rapoport, Suren Baghdasaryan,
	Vlastimil Babka, Liam R . Howlett, Zi Yan, Jonathan Corbet,
	Shuah Khan, Sean Christopherson, Paolo Bonzini, linux-mm,
	linux-kernel, linux-doc, linux-kselftest, kvm
In-Reply-To: <20260414142354.1465950-1-kas@kernel.org>

On 4/14/26 16:23, Kiryl Shutsemau (Meta) wrote:
> This series adds userfaultfd support for tracking the working set of
> VM guest memory, enabling VMMs to identify cold pages and evict them
> to tiered or remote storage.
> 
> == Problem ==
> 
> VMMs managing guest memory need to:
> 1. Track which pages are actively used (working set detection)
> 2. Safely evict cold pages to slower storage
> 3. Fetch pages back on demand when accessed again
> 
> For shmem-backed guest memory, working set tracking partially works
> today: MADV_DONTNEED zaps PTEs while pages stay in page cache, and
> re-access auto-resolves from cache. But safe eviction still requires
> synchronous fault interception to prevent data loss races.
> 
> For anonymous guest memory (needed for KSM cross-VM deduplication),
> there is no mechanism at all — clearing a PTE loses the page.
> 
> == Solution ==
> 
> The series introduces a unified userfaultfd interface that works
> across both anonymous and shmem-backed memory:
> 
> UFFD_FEATURE_MINOR_ANON: extends MODE_MINOR registration to anonymous
> private memory. Uses the PROT_NONE hinting mechanism (same as NUMA
> balancing) to make pages inaccessible without freeing them.

I would rather tackle this from the other direction: it's another form
of protection (like WP), not really a "minor" mode.

Could we add a UFFDIO_REGISTER_MODE_RWP (or however we would call it)
and support it for anon+shmem, avoiding the zapping for shmem completely?

-- 
Cheers,

David

^ permalink raw reply

* Re: [RFC, PATCH 00/12] userfaultfd: working set tracking for VM guest memory
From: Peter Xu @ 2026-04-14 15:28 UTC (permalink / raw)
  To: Kiryl Shutsemau (Meta)
  Cc: Andrew Morton, David Hildenbrand, Lorenzo Stoakes, Mike Rapoport,
	Suren Baghdasaryan, Vlastimil Babka, Liam R . Howlett, Zi Yan,
	Jonathan Corbet, Shuah Khan, Sean Christopherson, Paolo Bonzini,
	linux-mm, linux-kernel, linux-doc, linux-kselftest, kvm,
	James Houghton, Andrea Arcangeli
In-Reply-To: <20260414142354.1465950-1-kas@kernel.org>

Hi, Kiryl,

On Tue, Apr 14, 2026 at 03:23:34PM +0100, Kiryl Shutsemau (Meta) wrote:
> This series adds userfaultfd support for tracking the working set of
> VM guest memory, enabling VMMs to identify cold pages and evict them
> to tiered or remote storage.

Thanks for sharing this work, it looks very interesting to me.

Personally I am also looking at some kind of VMM memtiering issues.  I'm
not sure if you saw my lsfmm proposal, it mentioned the challenge we're
facing, it's slightly different but still a bit relevant:

https://lore.kernel.org/all/aYuad2k75iD9bnBE@x1.local/

Unfortunately, that proposal was rejected upstream.

For us, it's so far more about migration and how migration process
introduce zero impact to guest workloads especially on hotness.  I'm not
sure if we have any shared goals over that aspect.

> 
> == Problem ==
> 
> VMMs managing guest memory need to:
> 1. Track which pages are actively used (working set detection)
> 2. Safely evict cold pages to slower storage
> 3. Fetch pages back on demand when accessed again
> 
> For shmem-backed guest memory, working set tracking partially works
> today: MADV_DONTNEED zaps PTEs while pages stay in page cache, and
> re-access auto-resolves from cache. But safe eviction still requires
> synchronous fault interception to prevent data loss races.
> 
> For anonymous guest memory (needed for KSM cross-VM deduplication),
> there is no mechanism at all — clearing a PTE loses the page.
> 
> == Solution ==
> 
> The series introduces a unified userfaultfd interface that works
> across both anonymous and shmem-backed memory:
> 
> UFFD_FEATURE_MINOR_ANON: extends MODE_MINOR registration to anonymous
> private memory. Uses the PROT_NONE hinting mechanism (same as NUMA
> balancing) to make pages inaccessible without freeing them.
> 
> UFFD_FEATURE_MINOR_ASYNC: auto-resolves minor faults without handler
> involvement. The kernel restores PTE permissions immediately and the
> faulting thread continues. Works for anonymous, shmem, and hugetlbfs.
> 
> UFFDIO_DEACTIVATE: marks pages as deactivated. For anonymous memory,
> sets PROT_NONE on PTEs (pages stay resident). For shmem/hugetlbfs,
> zaps PTEs (pages stay in page cache).
> 
> UFFDIO_SET_MODE: toggles MINOR_ASYNC at runtime, synchronized via
> mmap_write_lock. Enables the VMM workflow: async mode for lightweight
> detection, sync mode for race-free eviction.
> 
> PAGE_IS_UFFD_DEACTIVATED: PAGEMAP_SCAN category flag for efficient
> batch detection of cold (still-deactivated) anonymous pages.
> 
> == VMM Workflow ==

AFAIU, this workflow provides two functionalities:

> 
>     UFFDIO_DEACTIVATE(all)            -- async, no vCPU stalls
>     sleep(interval)
>     PAGEMAP_SCAN                      -- find cold pages

Until here it's only about page hotness tracking.  I am curious whether you
evaluated idle page tracking.  Is it because of perf overheads on rmap?  To
me, your solution (until here.. on the hotness sampling) reads more like a
more efficient way to do idle page tracking but only per-mm, not per-folio.

That will also be something I would like to benefit if QEMU will decide to
do full userspace swap.  I think that's our last resort, I'll likely start
with something that makes QEMU work together with Linux on swapping
(e.g. we're happy to make MGLRU or any reclaim logic that Linux mm
currently uses, as long as efficient) then QEMU only cares about the rest,
which is what the migration problem is about.

The other issue about idle page tracking to us is, I believe MGLRU
currently doesn't work well with it (due to ignoring IDLE bits) where the
old LRU algo works.  I'm not sure how much you evaluated above, so it'll be
great to share from that perspective too.  I also mentioned some of these
challenges in the lsfmm proposal link above.

>     UFFDIO_SET_MODE(sync)             -- block faults for eviction
>     pwrite + MADV_DONTNEED cold pages -- safe, faults block
>     UFFDIO_SET_MODE(async)            -- resume tracking

These operations are the 2nd function.  It's, IMHO, a full userspace swap
system based on userfaultfd.

Have you thought about directly relying on userfaultfd-wp to do this work?
The relevant question is, why do we need to block guest reads on pages
being evicted by the userapp?  Can we still allow that to happen, which
seems to be more efficient?  IIUC, only writes / updates matters in such
swap system.

Also, I'm not sure if you're aware of LLNL's umap library:

https://github.com/llnl/umap

That implemnted the swap system using userfaultfd wr-protect mode only, so
no new kernel API needed.

Thanks,

> 
> The same workflow applies to shmem, with a different PAGEMAP_SCAN mask
> (!PAGE_IS_PRESENT instead of PAGE_IS_UFFD_DEACTIVATED).
> 
> == NUMA Balancing ==
> 
> NUMA balancing scanning is skipped on anonymous VM_UFFD_MINOR VMAs to
> avoid protnone conflicts. NUMA locality stats are fed from the uffd
> fault path via task_numa_fault() so the scheduler retains placement
> data. Shmem VMAs are unaffected (UFFDIO_DEACTIVATE zaps PTEs there,
> no protnone involved).
> 
> == Testing ==
> 
> The series includes 6 new selftests covering async/sync modes,
> PAGEMAP_SCAN cold detection, GUP through protnone, UFFDIO_SET_MODE
> toggling, and cleanup on close. All 73 uffd unit tests pass
> (including hugetlb) across defconfig, allnoconfig, allmodconfig,
> and randomized configs.
> 
> Kiryl Shutsemau (Meta) (12):
>   userfaultfd: define UAPI constants for anonymous minor faults
>   userfaultfd: add UFFD_FEATURE_MINOR_ANON registration support
>   userfaultfd: implement UFFDIO_DEACTIVATE ioctl
>   userfaultfd: UFFDIO_CONTINUE for anonymous memory
>   mm: intercept protnone faults on VM_UFFD_MINOR anonymous VMAs
>   userfaultfd: auto-resolve shmem and hugetlbfs minor faults in async
>     mode
>   sched/numa: skip scanning anonymous VM_UFFD_MINOR VMAs
>   userfaultfd: enable UFFD_FEATURE_MINOR_ANON
>   mm/pagemap: add PAGE_IS_UFFD_DEACTIVATED to PAGEMAP_SCAN
>   userfaultfd: add UFFDIO_SET_MODE for runtime sync/async toggle
>   selftests/mm: add userfaultfd anonymous minor fault tests
>   Documentation/userfaultfd: document working set tracking
> 
>  Documentation/admin-guide/mm/userfaultfd.rst | 141 ++++-
>  fs/proc/task_mmu.c                           |  11 +-
>  fs/userfaultfd.c                             | 184 +++++-
>  include/linux/huge_mm.h                      |   6 +
>  include/linux/mm.h                           |   2 +
>  include/linux/sched/numa_balancing.h         |   1 +
>  include/linux/userfaultfd_k.h                |  21 +-
>  include/trace/events/sched.h                 |   3 +-
>  include/uapi/linux/fs.h                      |   1 +
>  include/uapi/linux/userfaultfd.h             |  40 +-
>  kernel/sched/fair.c                          |  13 +
>  mm/huge_memory.c                             |  33 +-
>  mm/hugetlb.c                                 |   3 +-
>  mm/memory.c                                  |  51 +-
>  mm/mprotect.c                                |   9 +-
>  mm/shmem.c                                   |   3 +-
>  mm/userfaultfd.c                             | 164 +++++-
>  tools/testing/selftests/mm/uffd-unit-tests.c | 458 +++++++++++++++
>  18 files changed, 1096 insertions(+), 48 deletions(-)
> 
> Kiryl Shutsemau (Meta) (12):
>   userfaultfd: define UAPI constants for anonymous minor faults
>   userfaultfd: add UFFD_FEATURE_MINOR_ANON registration support
>   userfaultfd: implement UFFDIO_DEACTIVATE ioctl
>   userfaultfd: UFFDIO_CONTINUE for anonymous memory
>   mm: intercept protnone faults on VM_UFFD_MINOR anonymous VMAs
>   userfaultfd: auto-resolve shmem and hugetlbfs minor faults in async
>     mode
>   sched/numa: skip scanning anonymous VM_UFFD_MINOR VMAs
>   userfaultfd: enable UFFD_FEATURE_MINOR_ANON
>   mm/pagemap: add PAGE_IS_UFFD_DEACTIVATED to PAGEMAP_SCAN
>   userfaultfd: add UFFDIO_SET_MODE for runtime sync/async toggle
>   selftests/mm: add userfaultfd anonymous minor fault tests
>   Documentation/userfaultfd: document working set tracking
> 
>  Documentation/admin-guide/mm/userfaultfd.rst | 141 +++++-
>  fs/proc/task_mmu.c                           |  11 +-
>  fs/userfaultfd.c                             | 184 +++++++-
>  include/linux/huge_mm.h                      |   6 +
>  include/linux/mm.h                           |   2 +
>  include/linux/sched/numa_balancing.h         |   1 +
>  include/linux/userfaultfd_k.h                |  21 +-
>  include/trace/events/sched.h                 |   3 +-
>  include/uapi/linux/fs.h                      |   1 +
>  include/uapi/linux/userfaultfd.h             |  40 +-
>  kernel/sched/fair.c                          |  13 +
>  mm/huge_memory.c                             |  33 +-
>  mm/hugetlb.c                                 |   3 +-
>  mm/memory.c                                  |  51 ++-
>  mm/mprotect.c                                |   9 +-
>  mm/shmem.c                                   |   3 +-
>  mm/userfaultfd.c                             | 164 ++++++-
>  tools/testing/selftests/mm/uffd-unit-tests.c | 458 +++++++++++++++++++
>  18 files changed, 1096 insertions(+), 48 deletions(-)
> 
> -- 
> 2.51.2
> 
> 

-- 
Peter Xu


^ permalink raw reply

* Re: [PATCH V10 00/10] famfs: port into fuse
From: John Groves @ 2026-04-14 15:23 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Joanne Koong, Bernd Schubert, John Groves, Dan Williams,
	Bernd Schubert, Alison Schofield, John Groves, Jonathan Corbet,
	Shuah Khan, Vishal Verma, Dave Jiang, Matthew Wilcox, Jan Kara,
	Alexander Viro, David Hildenbrand, Christian Brauner,
	Darrick J . Wong, Randy Dunlap, Jeff Layton, Amir Goldstein,
	Jonathan Cameron, Stefan Hajnoczi, Josef Bacik, Bagas Sanjaya,
	Chen Linxuan, James Morse, Fuad Tabba, Sean Christopherson,
	Shivank Garg, Ackerley Tng, Gregory Price, Aravind Ramesh,
	Ajay Joshi, venkataravis@micron.com, linux-doc@vger.kernel.org,
	linux-kernel@vger.kernel.org, nvdimm@lists.linux.dev,
	linux-cxl@vger.kernel.org, linux-fsdevel@vger.kernel.org, djbw
In-Reply-To: <CAJfpegsCoMMg-Ux3CbBh0d1uqDNg3Fu_8YE-LubwrQ6A-2Cggw@mail.gmail.com>

On 26/04/14 04:18PM, Miklos Szeredi wrote:
> On Tue, 14 Apr 2026 at 15:41, John Groves <John@groves.net> wrote:
> 
> > My short response: Noooooooooo!!!!!!
> 
> :) Seems like this is a highly emotional topic...  I suggest that we
> go ahead with bpf experiments, then discuss results and path forward
> at LSM.
> 
> Thanks,
> Miklos

I think we need to try to emergency-add a session at LSFMM on this, with
fs/mm/bpf people. Any ideas on how to do this?

John


^ permalink raw reply

* Re: [PATCH v7 5/6] iio: adc: ad4691: add oversampling support
From: David Lechner @ 2026-04-14 15:02 UTC (permalink / raw)
  To: Sabau, Radu bogdan, Jonathan Cameron
  Cc: Lars-Peter Clausen, Hennerich, Michael, Sa, Nuno, Andy Shevchenko,
	Rob Herring, Krzysztof Kozlowski, Conor Dooley,
	Uwe Kleine-König, Liam Girdwood, Mark Brown, Linus Walleij,
	Bartosz Golaszewski, Philipp Zabel, Jonathan Corbet, Shuah Khan,
	linux-iio@vger.kernel.org, devicetree@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-pwm@vger.kernel.org,
	linux-gpio@vger.kernel.org, linux-doc@vger.kernel.org
In-Reply-To: <LV9PR03MB8414E0A68C5676302909E220F7252@LV9PR03MB8414.namprd03.prod.outlook.com>

On 4/14/26 9:25 AM, Sabau, Radu bogdan wrote:
> 
> 
>> -----Original Message-----
>> From: Jonathan Cameron <jic23@kernel.org>
>> Sent: Sunday, April 12, 2026 8:58 PM
>> To: David Lechner <dlechner@baylibre.com>
>> Cc: Sabau, Radu bogdan <Radu.Sabau@analog.com>; Lars-Peter Clausen
>> <lars@metafoo.de>; Hennerich, Michael <Michael.Hennerich@analog.com>;
>> Sa, Nuno <Nuno.Sa@analog.com>; Andy Shevchenko <andy@kernel.org>;
>> Rob Herring <robh@kernel.org>; Krzysztof Kozlowski <krzk+dt@kernel.org>;
>> Conor Dooley <conor+dt@kernel.org>; Uwe Kleine-König
>> <ukleinek@kernel.org>; Liam Girdwood <lgirdwood@gmail.com>; Mark Brown
>> <broonie@kernel.org>; Linus Walleij <linusw@kernel.org>; Bartosz
>> Golaszewski <brgl@kernel.org>; Philipp Zabel <p.zabel@pengutronix.de>;
>> Jonathan Corbet <corbet@lwn.net>; Shuah Khan
>> <skhan@linuxfoundation.org>; linux-iio@vger.kernel.org;
>> devicetree@vger.kernel.org; linux-kernel@vger.kernel.org; linux-
>> pwm@vger.kernel.org; linux-gpio@vger.kernel.org; linux-doc@vger.kernel.org
>> Subject: Re: [PATCH v7 5/6] iio: adc: ad4691: add oversampling support
>>
>> [External]
>>
>> On Fri, 10 Apr 2026 16:15:20 -0500
>> David Lechner <dlechner@baylibre.com> wrote:
>>
>>> On 4/9/26 10:28 AM, Radu Sabau via B4 Relay wrote:
>>>> From: Radu Sabau <radu.sabau@analog.com>
>>>>
>>>> Add per-channel oversampling ratio (OSR) support for CNV burst mode.
>>>> The accumulator depth register (ACC_DEPTH_IN) is programmed with the
>>>> selected OSR at buffer enable time and before each single-shot read.
>>>>
>>>> Supported OSR values: 1, 2, 4, 8, 16, 32.
>>>>
>>>> Introduce AD4691_MANUAL_CHANNEL() for manual mode channels,
>> which do
>>>> not expose the oversampling ratio attribute since OSR is not applicable
>>>> in that mode. A separate manual_channels array is added to
>>>> struct ad4691_channel_info and selected at probe time; offload paths
>>>> reuse the same arrays with num_channels capping access before the soft
>>>> timestamp entry.
>>>>
>>>> The reported sampling frequency accounts for the active OSR:
>>>> effective_freq = oscillator_freq / osr
>>>
>>> Technically, the way this is implemented is fine according to IIO ABI
>>> rules. Writing any attribute can cause others to change. It does
>>> introduce a potential pitfall though. Currently, changing the OSR will
>>> change the sampling frequency, so you have to always write
>> oversampling_ratio
>>> first, then write sampling_frequency to get what you asked for. If you want
>>> to change the OSR and keep the same sample rate, you still have to write
>> both
>>> attributes again.
>>>
>>> In other drivers, I've implemented it so that the requested sampling
>> frequency
>>> is stored any you always get the closest sampling frequency available based
>> on
>>> the oversampling ratio. This way, it doesn't matter which order you write
>>> the attributes. In that case, the actual periodic trigger source isn't set up
>>> until we actually start sampling.
>>>
>> Agreed. This is more intuitive. Now generally the userspace should
>> be sanity checking the value anyway as limitations may mean the new
>> sampling frequency is not particularly close to the original one but
>> at least it increases the chances of getting the expected value somewhat!
>>
>> So to me this is a nice useability improvement given the code to implement
>> it tends not to be too complex.
>>
> 
> Hi David, Jonathan,
> 
> What I understand from this is that the osr should be taken into account when writing
> the sampling frequency as well, right? Here's what I understand:
> 
> If the user wants a 125kHz freq with 4 OSR, then when internal osc will be written
> to 500kHz before single-shot read, buffer preenable/postenable.
> However, if the user wants a 500kHz frequency with 4 OSR, that would mean a 2MHz
> Internal osc freq, which is impossible.

It is up to the user to request something that is legal. They should know this
from reading the datasheet.

> 
> More than this, if the OSR is 32 the maximum effective rate would be 31250, so 25kHz
> would make it the closes available one. If the user would select 1MHz from the available
> list it would be weird I would say. So perhaps a solution for this is to display the avail list
> depending on the set OSR value.

Yes, the available list should reflect the current state of any other attributes
that affect it.

> 
> Linking the two together is perhaps wrong to begin with from my end, since in this
> driver's case, the per-channel sampling frequency is controlled by the internal oscillator
> which has static available values. So perhaps sampling frequency should be separate, and
> OSR separate as well, which would make everything cleaner.
> 
> Indeed, the effective rate is changed by OSR, but perhaps that is something the user
> should be aware of, since the sampling frequency is the rate at which the channel samples
> (1 sample per period) and OSR is how many times the channel samples upon a final sample
> is to be read. The user already has to take this into account when setting the buffer
> sampling frequency, so it would make sense to take this into account here too.

We can't change the definition of the IIO ABI just to make one driver simpler
to implement. The OSR and sample rate can't be completely independent.

If you want to leave it the way it is currently implemented though, that is fine.

> 
> Please let me know you thoughts on this,
> Radu


^ permalink raw reply

* Re: [PATCH bpf] bpf,tcp: avoid infinite recursion in BPF_SOCK_OPS_HDR_OPT_LEN_CB
From: Alexei Starovoitov @ 2026-04-14 14:33 UTC (permalink / raw)
  To: Jiayuan Chen
  Cc: bpf, Quan Sun, Yinhao Hu, Kaiyan Mei, Dongliang Mu, Eric Dumazet,
	Neal Cardwell, Kuniyuki Iwashima, David S. Miller, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Jonathan Corbet, Shuah Khan,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	David Ahern, Network Development, open list:DOCUMENTATION, LKML
In-Reply-To: <20260414105702.248310-1-jiayuan.chen@linux.dev>

On Tue, Apr 14, 2026 at 3:57 AM Jiayuan Chen <jiayuan.chen@linux.dev> wrote:
>
> A BPF_PROG_TYPE_SOCK_OPS program can set BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG
> to inject custom TCP header options. When the kernel builds a TCP packet,
> it calls tcp_established_options() to calculate the header size, which
> invokes bpf_skops_hdr_opt_len() to trigger the BPF_SOCK_OPS_HDR_OPT_LEN_CB
> callback.
>
> If the BPF program calls bpf_setsockopt(TCP_NODELAY) inside this callback,
> __tcp_sock_set_nodelay() will call tcp_push_pending_frames(), which calls
> tcp_current_mss(), which calls tcp_established_options() again,
> re-triggering the same BPF callback. This creates an infinite recursion
> that exhausts the kernel stack and causes a panic.
>
> BPF_SOCK_OPS_HDR_OPT_LEN_CB
>   -> bpf_setsockopt(TCP_NODELAY)
>         -> tcp_push_pending_frames()
>           -> tcp_current_mss()
>                 -> tcp_established_options()
>                   -> bpf_skops_hdr_opt_len()
>                            /* infinite recursion */
>                         -> BPF_SOCK_OPS_HDR_OPT_LEN_CB
>
> A similar reentrancy issue exists for TCP congestion control, which is
> guarded by tp->bpf_chg_cc_inprogress. Adopt the same approach: introduce
> tp->bpf_hdr_opt_len_cb_inprogress, set it before invoking the callback in
> bpf_skops_hdr_opt_len(), and check it in sol_tcp_sockopt() to reject
> bpf_setsockopt(TCP_NODELAY) calls that would trigger
> tcp_push_pending_frames() and cause the recursion.
>
> Reported-by: Quan Sun <2022090917019@std.uestc.edu.cn>
> Reported-by: Yinhao Hu <dddddd@hust.edu.cn>
> Reported-by: Kaiyan Mei <M202472210@hust.edu.cn>
> Reported-by: Dongliang Mu <dzm91@hust.edu.cn>
> Closes: https://lore.kernel.org/bpf/d1d523c9-6901-4454-a183-94462b8f3e4e@std.uestc.edu.cn/
> Fixes: 0813a841566f ("bpf: tcp: Allow bpf prog to write and parse TCP header option")
> Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
> ---
>  Documentation/networking/net_cachelines/tcp_sock.rst |  1 +
>  include/linux/tcp.h                                  | 11 ++++++++++-
>  net/core/filter.c                                    |  4 ++++
>  net/ipv4/tcp_minisocks.c                             |  1 +
>  net/ipv4/tcp_output.c                                |  3 +++
>  5 files changed, 19 insertions(+), 1 deletion(-)
>
> diff --git a/Documentation/networking/net_cachelines/tcp_sock.rst b/Documentation/networking/net_cachelines/tcp_sock.rst
> index 563daea10d6c..07d3226d90cc 100644
> --- a/Documentation/networking/net_cachelines/tcp_sock.rst
> +++ b/Documentation/networking/net_cachelines/tcp_sock.rst
> @@ -152,6 +152,7 @@ unsigned_int                  keepalive_intvl
>  int                           linger2
>  u8                            bpf_sock_ops_cb_flags
>  u8:1                          bpf_chg_cc_inprogress
> +u8:1                          bpf_hdr_opt_len_cb_inprogress
>  u16                           timeout_rehash
>  u32                           rcv_ooopack
>  u32                           rcv_rtt_last_tsecr
> diff --git a/include/linux/tcp.h b/include/linux/tcp.h
> index f72eef31fa23..2bfb73cf922e 100644
> --- a/include/linux/tcp.h
> +++ b/include/linux/tcp.h
> @@ -475,12 +475,21 @@ struct tcp_sock {
>         u8      bpf_sock_ops_cb_flags;  /* Control calling BPF programs
>                                          * values defined in uapi/linux/tcp.h
>                                          */
> -       u8      bpf_chg_cc_inprogress:1; /* In the middle of
> +       u8      bpf_chg_cc_inprogress:1, /* In the middle of
>                                           * bpf_setsockopt(TCP_CONGESTION),
>                                           * it is to avoid the bpf_tcp_cc->init()
>                                           * to recur itself by calling
>                                           * bpf_setsockopt(TCP_CONGESTION, "itself").
>                                           */
> +               bpf_hdr_opt_len_cb_inprogress:1; /* It is set before invoking the
> +                                                 * callback so that a nested
> +                                                 * bpf_setsockopt(TCP_NODELAY) or
> +                                                 * bpf_setsockopt(TCP_CORK) cannot
> +                                                 * trigger tcp_push_pending_frames(),
> +                                                 * which would call tcp_current_mss()
> +                                                 * -> bpf_skops_hdr_opt_len(), causing
> +                                                 * infinite recursion.

Let's not add new bits.
Reuse existing and test/check all in one place,
like commit 061ff040710e9 did.

pw-bot: cr

^ permalink raw reply

* Re: maintainer profiles
From: Mauro Carvalho Chehab @ 2026-04-14 14:32 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jonathan Corbet, Randy Dunlap, Linux Documentation,
	Linux Kernel Mailing List, Linux Kernel Workflows
In-Reply-To: <20260414143733.6cbd6d62@localhost>

On Tue, 14 Apr 2026 14:37:33 +0200
Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote:

> On Mon, 13 Apr 2026 14:39:37 -0700
> Dan Williams <djbw@kernel.org> wrote:
> 
> > Jonathan Corbet wrote:  
> > > Randy Dunlap <rdunlap@infradead.org> writes:
> > >     
> > > > Hi,
> > > >
> > > > Is there supposed to be a difference (or distinction) in the contents of
> > > >
> > > > Documentation/process/maintainer-handbooks.rst
> > > > and
> > > > Documentation/maintainer/maintainer-entry-profile.rst
> > > > ?
> > > >
> > > > Can they be combined into one location?    
> > > 
> > > Late to the party, sorry ... the original idea, I believe, was that
> > > maintainer-handbooks.rst would be for developers looking for a guidebook
> > > for a specific subsystem, while maintainer-entry-profile.rst was about
> > > how maintainers themselves should write their subsystem guide.
> > > Doubtless things have drifted since then...  But the intended audiences
> > > were different, so it might be good to think about bringing them back
> > > into focus.    
> > 
> > Right, I think something (roughly / hand-wavy) like the below is the
> > intent. However, as I write that I notice that the combined list is a
> > bit of a mess. I also notice that there are more "P:" entries in
> > MAINTAINERS than there are entries in this maintainer-handbooks.rst
> > list.
> > 
> > So this probably wants to be a script that can build Documentation links
> > from MAINTAINERS, or otherwise provide a script for developers to query
> > a kernel tree for additional submission guides. It is probably not as
> > important for the built docs to link all guides as it is for developers
> > (or their agents) to live query a tree they are developing against.  
> 
> There is already a Python script which parses MAINTAINERS file
> (Documentation/sphinx/maintainers_include.py).
> 
> Currently, it expects a Sphinx meta-tag inside
> Documentation/process/maintainers.rst:
> 
> 	.. maintainers-include::
> 
> I guess it shouldn't be hard to add support there for a
> 
> 	.. maintainers-profile::
> 
> Making it creating a set of cross-references is probably easy. Not
> sure how easy/hard would be to create a TOC tree, though.

It was actually easier than what I would expect ;-)

Just submitted a patch series doing that:

https://lore.kernel.org/linux-doc/cover.1776176108.git.mchehab+huawei@kernel.org/T/#t

> > diff --git a/Documentation/maintainer/maintainer-entry-profile.rst b/Documentation/maintainer/maintainer-entry-profile.rst
> > index 6020d188e13d..58e2af333692 100644

...

If you transform this diff into a patch, it would make sense to
add together with the next version of my RFC ;-)

-- 
Thanks,
Mauro

^ permalink raw reply

* [PATCH RFC 4/4] docs: auto-generate maintainer entry profile links
From: Mauro Carvalho Chehab @ 2026-04-14 14:29 UTC (permalink / raw)
  To: Jonathan Corbet, Linux Doc Mailing List
  Cc: Mauro Carvalho Chehab, linux-kernel, linux-riscv, workflows,
	Albert Ou, Alexandre Ghiti, Dan Williams, Palmer Dabbelt,
	Paul Walmsley, Randy Dunlap, Shuah Khan
In-Reply-To: <cover.1776176108.git.mchehab+huawei@kernel.org>

Instead of manually creating a TOC tree for them, use the new
tag to auto-generate its TOC.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
---
 .../maintainer/maintainer-entry-profile.rst     | 17 ++---------------
 Documentation/process/maintainer-handbooks.rst  | 10 +---------
 2 files changed, 3 insertions(+), 24 deletions(-)

diff --git a/Documentation/maintainer/maintainer-entry-profile.rst b/Documentation/maintainer/maintainer-entry-profile.rst
index 6020d188e13d..48ecabd4ce13 100644
--- a/Documentation/maintainer/maintainer-entry-profile.rst
+++ b/Documentation/maintainer/maintainer-entry-profile.rst
@@ -98,18 +98,5 @@ Existing profiles
 For now, existing maintainer profiles are listed here; we will likely want
 to do something different in the near future.
 
-.. toctree::
-   :maxdepth: 1
-
-   ../doc-guide/maintainer-profile
-   ../nvdimm/maintainer-entry-profile
-   ../arch/riscv/patch-acceptance
-   ../process/maintainer-soc
-   ../process/maintainer-soc-clean-dts
-   ../driver-api/media/maintainer-entry-profile
-   ../process/maintainer-netdev
-   ../driver-api/vfio-pci-device-specific-driver-acceptance
-   ../nvme/feature-and-quirk-policy
-   ../filesystems/nfs/nfsd-maintainer-entry-profile
-   ../filesystems/xfs/xfs-maintainer-entry-profile
-   ../mm/damon/maintainer-profile
+See Documentation/process/maintainer-handbooks.rst for subsystem-specific
+profiles.
diff --git a/Documentation/process/maintainer-handbooks.rst b/Documentation/process/maintainer-handbooks.rst
index 3d72ad25fc6a..d3d74c719018 100644
--- a/Documentation/process/maintainer-handbooks.rst
+++ b/Documentation/process/maintainer-handbooks.rst
@@ -9,12 +9,4 @@ which is supplementary to the general development process handbook
 
 Contents:
 
-.. toctree::
-   :numbered:
-   :maxdepth: 2
-
-   maintainer-netdev
-   maintainer-soc
-   maintainer-soc-clean-dts
-   maintainer-tip
-   maintainer-kvm-x86
+.. maintainers-profile-toc::
-- 
2.52.0


^ permalink raw reply related

* [PATCH RFC 3/4] MAINTAINERS: add maintainer-tip.rst to X86
From: Mauro Carvalho Chehab @ 2026-04-14 14:29 UTC (permalink / raw)
  To: Jonathan Corbet, Linux Doc Mailing List
  Cc: Mauro Carvalho Chehab, linux-kernel, linux-riscv, workflows,
	Dan Williams, Randy Dunlap
In-Reply-To: <cover.1776176108.git.mchehab+huawei@kernel.org>

While the maintainer's profile for tip is there, it is not
at X86 maintainer's entry.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
---
 MAINTAINERS | 1 +
 1 file changed, 1 insertion(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 620219e48f98..a85fcae5f56e 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -28560,6 +28560,7 @@ M:	Ingo Molnar <mingo@redhat.com>
 M:	Borislav Petkov <bp@alien8.de>
 M:	Dave Hansen <dave.hansen@linux.intel.com>
 M:	x86@kernel.org
+P:	Documentation/process/maintainer-tip.rst
 R:	"H. Peter Anvin" <hpa@zytor.com>
 L:	linux-kernel@vger.kernel.org
 S:	Maintained
-- 
2.52.0


^ permalink raw reply related

* [PATCH RFC 2/4] MAINTAINERS: add an entry for media maintainers profile
From: Mauro Carvalho Chehab @ 2026-04-14 14:29 UTC (permalink / raw)
  To: Jonathan Corbet, Linux Doc Mailing List
  Cc: Mauro Carvalho Chehab, linux-kernel, linux-riscv, workflows,
	Dan Williams, Randy Dunlap
In-Reply-To: <cover.1776176108.git.mchehab+huawei@kernel.org>

While media has a maintainers entry profile, its entry is
missing at MAINTAINERS.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
---
 MAINTAINERS | 1 +
 1 file changed, 1 insertion(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index f0b106a4dd96..620219e48f98 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -16115,6 +16115,7 @@ S:	Maintained
 W:	https://linuxtv.org
 Q:	http://patchwork.kernel.org/project/linux-media/list/
 T:	git git://linuxtv.org/media.git
+P:	Documentation/driver-api/media/maintainer-entry-profile.rst
 F:	Documentation/admin-guide/media/
 F:	Documentation/devicetree/bindings/media/
 F:	Documentation/driver-api/media/
-- 
2.52.0


^ permalink raw reply related

* [PATCH RFC 1/4] docs: maintainers_include: auto-generate maintainer profile TOC
From: Mauro Carvalho Chehab @ 2026-04-14 14:29 UTC (permalink / raw)
  To: Jonathan Corbet, Linux Doc Mailing List, Mauro Carvalho Chehab
  Cc: Mauro Carvalho Chehab, linux-kernel, linux-riscv, workflows,
	Dan Williams, Randy Dunlap, Shuah Khan
In-Reply-To: <cover.1776176108.git.mchehab+huawei@kernel.org>

Add a feature to allow auto-generating media entry profiles from the
corresponding field inside MAINTAINERS file(s).

Suggested-by: Dan Williams <djbw@kernel.org>
Closes: https://lore.kernel.org/linux-doc/69dd6299440be_147c801005b@djbw-dev.notmuch/
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
---
 Documentation/sphinx/maintainers_include.py | 93 +++++++++++++++++----
 1 file changed, 76 insertions(+), 17 deletions(-)

diff --git a/Documentation/sphinx/maintainers_include.py b/Documentation/sphinx/maintainers_include.py
index 519ad18685b2..1dac83bf1a65 100755
--- a/Documentation/sphinx/maintainers_include.py
+++ b/Documentation/sphinx/maintainers_include.py
@@ -21,6 +21,8 @@ import sys
 import re
 import os.path
 
+from textwrap import indent
+
 from docutils import statemachine
 from docutils.parsers.rst import Directive
 from docutils.parsers.rst.directives.misc import Include
@@ -30,20 +32,11 @@ def ErrorString(exc):  # Shamelessly stolen from docutils
 
 __version__  = '1.0'
 
-def setup(app):
-    app.add_directive("maintainers-include", MaintainersInclude)
-    return dict(
-        version = __version__,
-        parallel_read_safe = True,
-        parallel_write_safe = True
-    )
+class MaintainersParser:
+    """Parse MAINTAINERS file(s) content"""
 
-class MaintainersInclude(Include):
-    """MaintainersInclude (``maintainers-include``) directive"""
-    required_arguments = 0
-
-    def parse_maintainers(self, path):
-        """Parse all the MAINTAINERS lines into ReST for human-readability"""
+    def __init__(self, base_path, path):
+        self.profiles = list()
 
         result = list()
         result.append(".. _maintainers:")
@@ -78,6 +71,12 @@ class MaintainersInclude(Include):
             # Drop needless input whitespace.
             line = line.rstrip()
 
+            match = re.match(r"P:\s*(Documentation/\S+)\.rst", line)
+            if match:
+                fname = os.path.relpath(match.group(1), base_path)
+                if fname not in self.profiles:
+                    self.profiles.append(fname)
+
             # Linkify all non-wildcard refs to ReST files in Documentation/.
             pat = r'(Documentation/([^\s\?\*]*)\.rst)'
             m = re.search(pat, line)
@@ -165,12 +164,23 @@ class MaintainersInclude(Include):
             for separated in field_content.split('\n'):
                 result.append(separated)
 
-        output = "\n".join(result)
+        self.output = "\n".join(result)
+
+        # Create a TOC class
+
+class MaintainersInclude(Include):
+    """MaintainersInclude (``maintainers-include``) directive"""
+    required_arguments = 0
+
+    def emit(self, base_path, path):
+        """Parse all the MAINTAINERS lines into ReST for human-readability"""
+
+        output = MaintainersParser(base_path, path).output
+
         # For debugging the pre-rendered results...
         #print(output, file=open("/tmp/MAINTAINERS.rst", "w"))
 
-        self.state_machine.insert_input(
-          statemachine.string2lines(output), path)
+        self.state_machine.insert_input(statemachine.string2lines(output), path)
 
     def run(self):
         """Include the MAINTAINERS file as part of this reST file."""
@@ -186,12 +196,61 @@ class MaintainersInclude(Include):
 
         # Append "MAINTAINERS"
         path = os.path.join(path, "MAINTAINERS")
+        base_path = os.path.dirname(self.state.document.document.current_source)
 
         try:
             self.state.document.settings.record_dependencies.add(path)
-            lines = self.parse_maintainers(path)
+            lines = self.emit(base_path, path)
         except IOError as error:
             raise self.severe('Problems with "%s" directive path:\n%s.' %
                       (self.name, ErrorString(error)))
 
         return []
+
+class MaintainersProfile(Include):
+    required_arguments = 0
+
+    def emit(self, base_path, path):
+        """Parse all the MAINTAINERS lines looking for profile entries"""
+
+        profiles = MaintainersParser(base_path, path).profiles
+
+        output  = ".. toctree::\n"
+        output += "   :maxdepth: 2\n\n"
+        output += indent("\n".join(profiles), "   ")
+
+        self.state_machine.insert_input(statemachine.string2lines(output), path)
+
+    def run(self):
+        """Include the MAINTAINERS file as part of this reST file."""
+        if not self.state.document.settings.file_insertion_enabled:
+            raise self.warning('"%s" directive disabled.' % self.name)
+
+        # Walk up source path directories to find Documentation/../
+        path = self.state_machine.document.attributes['source']
+        path = os.path.realpath(path)
+        tail = path
+        while tail != "Documentation" and tail != "":
+            (path, tail) = os.path.split(path)
+
+        # Append "MAINTAINERS"
+        path = os.path.join(path, "MAINTAINERS")
+        base_path = os.path.dirname(self.state.document.document.current_source)
+
+        try:
+            self.state.document.settings.record_dependencies.add(path)
+            lines = self.emit(base_path, path)
+        except IOError as error:
+            raise self.severe('Problems with "%s" directive path:\n%s.' %
+                      (self.name, ErrorString(error)))
+
+        return []
+
+def setup(app):
+    app.add_directive("maintainers-include", MaintainersInclude)
+    app.add_directive("maintainers-profile-toc", MaintainersProfile)
+    return dict(
+        version = __version__,
+        parallel_read_safe = True,
+        parallel_write_safe = True
+    )
-- 
2.52.0


^ permalink raw reply related

* [PATCH RFC 0/4] Auto-generate maintainer profile entries
From: Mauro Carvalho Chehab @ 2026-04-14 14:29 UTC (permalink / raw)
  To: Albert Ou, Jonathan Corbet, Dan Williams, Mauro Carvalho Chehab,
	Palmer Dabbelt, Paul Walmsley
  Cc: Mauro Carvalho Chehab, Randy Dunlap, linux-doc, linux-kernel,
	linux-riscv, workflows, Alexandre Ghiti, Shuah Khan

Hi Dan/Jon,

This small patch series change the way maintainer entry profile links
are added to the documentation. Instead of having an entry for
each of them at an ReST file, get them from MAINTAINERS content.

That should likely make easier to maintain, as there will be a single
point to place all such profiles.

I made this as an RFC. The goal is mostly to be a start of discussions
about how this is implemented.

Also, it should be noticed that  I'm not incorporating the diff
content from Dan's sugggestion, as it was just an e-mail reply without
a proper patch title/description/SoB.

Some points on this RFC:

1. some P: entries are links to web pages. The current approach
   ignores them;

2. the current logic doesn't use glob. So, if one would add an
   entry like:

	P: Documentation/foo/profiles-*.rst

   it will generate an entry like "../foo/profiles-*".

   This probably works, as toc trees accept glob.

3. entries are placed at the order they occur at MAINTAINERS
   file (but duplication is properly handled);

4. as Randy mentioned, if an entry there is inside another TOC
   using numeration, those entries will have numeration as well;

5. the approach I took on patch 1 was a little bit lazy, as it
   ends processing MAINTAINERS two times, and there are some code
   duplication on different classes to handle path. I opted to do
   this way to minimize the differences, but it makes sense to 
   clean the code later on newer versions of this series or after
   applying it;

6. patches 2 and 3 can be applied independently of this approach.
   They just add two missing "P:" entries to MAINTAINERS.

Suggested-by: Dan Williams <djbw@kernel.org>
Closes: https://lore.kernel.org/linux-doc/69dd6299440be_147c801005b@djbw-dev.notmuch/

Mauro Carvalho Chehab (4):
  docs: maintainers_include: auto-generate maintainer profile TOC
  MAINTAINERS: add an entry for media maintainers profile
  MAINTAINERS: add maintainer-tip.rst to X86
  docs: auto-generate maintainer entry profile links

 .../maintainer/maintainer-entry-profile.rst   | 17 +---
 .../process/maintainer-handbooks.rst          | 10 +-
 Documentation/sphinx/maintainers_include.py   | 93 +++++++++++++++----
 MAINTAINERS                                   |  2 +
 4 files changed, 81 insertions(+), 41 deletions(-)

-- 
2.52.0

^ permalink raw reply

* RE: [PATCH v7 5/6] iio: adc: ad4691: add oversampling support
From: Sabau, Radu bogdan @ 2026-04-14 14:25 UTC (permalink / raw)
  To: Jonathan Cameron, David Lechner
  Cc: Lars-Peter Clausen, Hennerich, Michael, Sa, Nuno, Andy Shevchenko,
	Rob Herring, Krzysztof Kozlowski, Conor Dooley,
	Uwe Kleine-König, Liam Girdwood, Mark Brown, Linus Walleij,
	Bartosz Golaszewski, Philipp Zabel, Jonathan Corbet, Shuah Khan,
	linux-iio@vger.kernel.org, devicetree@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-pwm@vger.kernel.org,
	linux-gpio@vger.kernel.org, linux-doc@vger.kernel.org
In-Reply-To: <20260412185821.739e477f@jic23-huawei>



> -----Original Message-----
> From: Jonathan Cameron <jic23@kernel.org>
> Sent: Sunday, April 12, 2026 8:58 PM
> To: David Lechner <dlechner@baylibre.com>
> Cc: Sabau, Radu bogdan <Radu.Sabau@analog.com>; Lars-Peter Clausen
> <lars@metafoo.de>; Hennerich, Michael <Michael.Hennerich@analog.com>;
> Sa, Nuno <Nuno.Sa@analog.com>; Andy Shevchenko <andy@kernel.org>;
> Rob Herring <robh@kernel.org>; Krzysztof Kozlowski <krzk+dt@kernel.org>;
> Conor Dooley <conor+dt@kernel.org>; Uwe Kleine-König
> <ukleinek@kernel.org>; Liam Girdwood <lgirdwood@gmail.com>; Mark Brown
> <broonie@kernel.org>; Linus Walleij <linusw@kernel.org>; Bartosz
> Golaszewski <brgl@kernel.org>; Philipp Zabel <p.zabel@pengutronix.de>;
> Jonathan Corbet <corbet@lwn.net>; Shuah Khan
> <skhan@linuxfoundation.org>; linux-iio@vger.kernel.org;
> devicetree@vger.kernel.org; linux-kernel@vger.kernel.org; linux-
> pwm@vger.kernel.org; linux-gpio@vger.kernel.org; linux-doc@vger.kernel.org
> Subject: Re: [PATCH v7 5/6] iio: adc: ad4691: add oversampling support
> 
> [External]
> 
> On Fri, 10 Apr 2026 16:15:20 -0500
> David Lechner <dlechner@baylibre.com> wrote:
> 
> > On 4/9/26 10:28 AM, Radu Sabau via B4 Relay wrote:
> > > From: Radu Sabau <radu.sabau@analog.com>
> > >
> > > Add per-channel oversampling ratio (OSR) support for CNV burst mode.
> > > The accumulator depth register (ACC_DEPTH_IN) is programmed with the
> > > selected OSR at buffer enable time and before each single-shot read.
> > >
> > > Supported OSR values: 1, 2, 4, 8, 16, 32.
> > >
> > > Introduce AD4691_MANUAL_CHANNEL() for manual mode channels,
> which do
> > > not expose the oversampling ratio attribute since OSR is not applicable
> > > in that mode. A separate manual_channels array is added to
> > > struct ad4691_channel_info and selected at probe time; offload paths
> > > reuse the same arrays with num_channels capping access before the soft
> > > timestamp entry.
> > >
> > > The reported sampling frequency accounts for the active OSR:
> > > effective_freq = oscillator_freq / osr
> >
> > Technically, the way this is implemented is fine according to IIO ABI
> > rules. Writing any attribute can cause others to change. It does
> > introduce a potential pitfall though. Currently, changing the OSR will
> > change the sampling frequency, so you have to always write
> oversampling_ratio
> > first, then write sampling_frequency to get what you asked for. If you want
> > to change the OSR and keep the same sample rate, you still have to write
> both
> > attributes again.
> >
> > In other drivers, I've implemented it so that the requested sampling
> frequency
> > is stored any you always get the closest sampling frequency available based
> on
> > the oversampling ratio. This way, it doesn't matter which order you write
> > the attributes. In that case, the actual periodic trigger source isn't set up
> > until we actually start sampling.
> >
> Agreed. This is more intuitive. Now generally the userspace should
> be sanity checking the value anyway as limitations may mean the new
> sampling frequency is not particularly close to the original one but
> at least it increases the chances of getting the expected value somewhat!
> 
> So to me this is a nice useability improvement given the code to implement
> it tends not to be too complex.
> 

Hi David, Jonathan,

What I understand from this is that the osr should be taken into account when writing
the sampling frequency as well, right? Here's what I understand:

If the user wants a 125kHz freq with 4 OSR, then when internal osc will be written
to 500kHz before single-shot read, buffer preenable/postenable.
However, if the user wants a 500kHz frequency with 4 OSR, that would mean a 2MHz
Internal osc freq, which is impossible.

More than this, if the OSR is 32 the maximum effective rate would be 31250, so 25kHz
would make it the closes available one. If the user would select 1MHz from the available
list it would be weird I would say. So perhaps a solution for this is to display the avail list
depending on the set OSR value.

Linking the two together is perhaps wrong to begin with from my end, since in this
driver's case, the per-channel sampling frequency is controlled by the internal oscillator
which has static available values. So perhaps sampling frequency should be separate, and
OSR separate as well, which would make everything cleaner.

Indeed, the effective rate is changed by OSR, but perhaps that is something the user
should be aware of, since the sampling frequency is the rate at which the channel samples
(1 sample per period) and OSR is how many times the channel samples upon a final sample
is to be read. The user already has to take this into account when setting the buffer
sampling frequency, so it would make sense to take this into account here too.

Please let me know you thoughts on this,
Radu

^ permalink raw reply

* [RFC, PATCH 12/12] Documentation/userfaultfd: document working set tracking
From: Kiryl Shutsemau (Meta) @ 2026-04-14 14:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Peter Xu, David Hildenbrand, Lorenzo Stoakes, Mike Rapoport,
	Suren Baghdasaryan, Vlastimil Babka, Liam R . Howlett, Zi Yan,
	Jonathan Corbet, Shuah Khan, Sean Christopherson, Paolo Bonzini,
	linux-mm, linux-kernel, linux-doc, linux-kselftest, kvm,
	Kiryl Shutsemau (Meta)
In-Reply-To: <20260414142354.1465950-1-kas@kernel.org>

Document the new userfaultfd capabilities for VM working set tracking:

- UFFD_FEATURE_MINOR_ANON and UFFD_FEATURE_MINOR_ASYNC for anonymous
  minor fault interception using the PROT_NONE hinting mechanism.
- UFFDIO_DEACTIVATE for marking pages as inaccessible while keeping
  them resident.
- Sync and async fault resolution modes, and UFFDIO_SET_MODE for
  runtime toggling between them.
- PAGEMAP_SCAN with PAGE_IS_UFFD_DEACTIVATED for cold page detection.
- Cleanup semantics on unregister and close.
- NUMA balancing interaction on anonymous VMAs.
- Complete VMM workflow example for the cold page eviction lifecycle,
  with a note on shmem applicability.

Update the feature flag descriptions at the top of the guide to
reference the new section.

Signed-off-by: Kiryl Shutsemau (Meta) <kas@kernel.org>
Assisted-by: Claude:claude-opus-4-6
---
 Documentation/admin-guide/mm/userfaultfd.rst | 141 ++++++++++++++++++-
 1 file changed, 140 insertions(+), 1 deletion(-)

diff --git a/Documentation/admin-guide/mm/userfaultfd.rst b/Documentation/admin-guide/mm/userfaultfd.rst
index e5cc8848dcb3..fc89e029060c 100644
--- a/Documentation/admin-guide/mm/userfaultfd.rst
+++ b/Documentation/admin-guide/mm/userfaultfd.rst
@@ -111,7 +111,11 @@ events, except page fault notifications, may be generated:
 - ``UFFD_FEATURE_MINOR_HUGETLBFS`` indicates that the kernel supports
   ``UFFDIO_REGISTER_MODE_MINOR`` registration for hugetlbfs virtual memory
   areas. ``UFFD_FEATURE_MINOR_SHMEM`` is the analogous feature indicating
-  support for shmem virtual memory areas.
+  support for shmem virtual memory areas. ``UFFD_FEATURE_MINOR_ANON``
+  extends minor fault support to anonymous private memory using
+  PROT_NONE hinting; see the `Anonymous Minor Faults`_ section.
+  ``UFFD_FEATURE_MINOR_ASYNC`` enables asynchronous auto-resolution for
+  anonymous minor faults (requires ``UFFD_FEATURE_MINOR_ANON``).
 
 - ``UFFD_FEATURE_MOVE`` indicates that the kernel supports moving an
   existing page contents from userspace.
@@ -297,6 +301,141 @@ transparent to the guest, we want that same address range to act as if it was
 still poisoned, even though it's on a new physical host which ostensibly
 doesn't have a memory error in the exact same spot.
 
+Anonymous Minor Faults
+----------------------
+
+``UFFD_FEATURE_MINOR_ANON`` enables ``UFFDIO_REGISTER_MODE_MINOR`` on
+anonymous private memory. Unlike shmem/hugetlbfs minor faults (where a page
+exists in the page cache but has no PTE), anonymous minor faults use the
+PROT_NONE hinting mechanism: pages remain resident in memory with their PFNs
+preserved in the PTEs, but access permissions are removed so the next access
+triggers a fault.
+
+This is designed for VM memory managers that need to track the working set of
+anonymous guest memory for cold page eviction to tiered or remote storage.
+
+**Setup:**
+
+1. Open a userfaultfd and enable ``UFFD_FEATURE_MINOR_ANON`` (and optionally
+   ``UFFD_FEATURE_MINOR_ASYNC``) via ``UFFDIO_API``.
+
+2. Register the guest memory range with ``UFFDIO_REGISTER_MODE_MINOR``
+   (and ``UFFDIO_REGISTER_MODE_MISSING`` if evicted pages will need to be
+   fetched back from storage).
+
+**Deactivation:**
+
+Use ``UFFDIO_DEACTIVATE`` to mark pages as inaccessible. This ioctl takes a
+``struct uffdio_range`` and sets PROT_NONE on all present PTEs in the range,
+using the same mechanism as NUMA balancing. Pages stay resident and their
+physical frames are preserved — only access permissions are removed.
+
+**Fault Handling:**
+
+When a deactivated page is accessed:
+
+- **Sync mode** (default): The faulting thread blocks and a
+  ``UFFD_PAGEFAULT_FLAG_MINOR`` message is delivered to the userfaultfd
+  handler. The handler resolves the fault with ``UFFDIO_CONTINUE``, which
+  restores the PTE permissions and wakes the faulting thread.
+
+- **Async mode** (``UFFD_FEATURE_MINOR_ASYNC``): The kernel automatically
+  restores PTE permissions and the thread continues without blocking. No
+  message is delivered to the handler.
+
+**Cold Page Detection with PAGEMAP_SCAN:**
+
+After deactivating a range and letting the application run, use the
+``PAGEMAP_SCAN`` ioctl on ``/proc/pid/pagemap`` with the
+``PAGE_IS_UFFD_DEACTIVATED`` category flag to efficiently find pages that were
+never re-accessed (cold pages)::
+
+    struct pm_scan_arg arg = {
+        .size = sizeof(arg),
+        .start = guest_mem_start,
+        .end = guest_mem_end,
+        .vec = (uint64_t)regions,
+        .vec_len = regions_len,
+        .category_mask = PAGE_IS_UFFD_DEACTIVATED,
+        .return_mask = PAGE_IS_UFFD_DEACTIVATED,
+    };
+    long n = ioctl(pagemap_fd, PAGEMAP_SCAN, &arg);
+
+The returned ``page_region`` array contains contiguous cold ranges that can
+then be evicted.
+
+**Cleanup:**
+
+When the userfaultfd is closed or the range is unregistered, all protnone
+PTEs are automatically restored to their normal VMA permissions. This
+prevents pages from becoming permanently inaccessible.
+
+**Interaction with NUMA Balancing:**
+
+NUMA balancing is automatically disabled on anonymous VMAs registered with
+``UFFDIO_REGISTER_MODE_MINOR``, since both mechanisms use PROT_NONE PTEs
+as access hints and would interfere with each other. Shmem VMAs are not
+affected since ``UFFDIO_DEACTIVATE`` zaps PTEs there instead of using
+PROT_NONE.
+
+**VMM Working Set Tracking Workflow:**
+
+A typical VMM lifecycle for cold page eviction to tiered storage::
+
+    /* One-time setup */
+    uffd = userfaultfd(O_CLOEXEC | O_NONBLOCK);
+    ioctl(uffd, UFFDIO_API, &(struct uffdio_api){
+        .api = UFFD_API,
+        .features = UFFD_FEATURE_MINOR_ANON |
+                    UFFD_FEATURE_MINOR_ASYNC,
+    });
+    ioctl(uffd, UFFDIO_REGISTER, &(struct uffdio_register){
+        .range = { guest_mem, guest_size },
+        .mode = UFFDIO_REGISTER_MODE_MINOR |
+                UFFDIO_REGISTER_MODE_MISSING,
+    });
+
+    /* Tracking loop */
+    while (vm_running) {
+        /* 1. Detection phase (async — no vCPU stalls) */
+        ioctl(uffd, UFFDIO_DEACTIVATE, &full_range);
+        sleep(tracking_interval);
+
+        /* 2. Find cold pages */
+        ioctl(pagemap_fd, PAGEMAP_SCAN, &(struct pm_scan_arg){
+            .category_mask = PAGE_IS_UFFD_DEACTIVATED,
+            ...
+        });
+
+        /* 3. Switch to sync for safe eviction */
+        ioctl(uffd, UFFDIO_SET_MODE,
+              &(struct uffdio_set_mode){
+                  .disable = UFFD_FEATURE_MINOR_ASYNC });
+
+        /* 4. Evict cold pages (vCPU faults block in handler) */
+        for each cold range:
+            pwrite(storage_fd, cold_addr, len, offset);
+            madvise(cold_addr, len, MADV_DONTNEED);
+
+        /* 5. Resume async tracking */
+        ioctl(uffd, UFFDIO_SET_MODE,
+              &(struct uffdio_set_mode){
+                  .enable = UFFD_FEATURE_MINOR_ASYNC });
+    }
+
+During step 4, if a vCPU accesses a cold page being evicted, it blocks
+with a ``UFFD_PAGEFAULT_FLAG_MINOR`` fault. The handler can either let it
+wait (the eviction completes, ``MADV_DONTNEED`` fires, the fault retries as
+``MISSING`` and is resolved with ``UFFDIO_COPY`` from storage) or resolve
+it immediately with ``UFFDIO_CONTINUE``.
+
+The same workflow applies to shmem-backed guest memory
+(``UFFD_FEATURE_MINOR_SHMEM``). The only difference is the
+``PAGEMAP_SCAN`` mask for cold page detection: use
+``!PAGE_IS_PRESENT`` instead of ``PAGE_IS_UFFD_DEACTIVATED``, since
+``UFFDIO_DEACTIVATE`` zaps PTEs on shmem (pages stay in page cache)
+rather than setting PROT_NONE.
+
 QEMU/KVM
 ========
 
-- 
2.51.2


^ permalink raw reply related

* [RFC, PATCH 11/12] selftests/mm: add userfaultfd anonymous minor fault tests
From: Kiryl Shutsemau (Meta) @ 2026-04-14 14:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Peter Xu, David Hildenbrand, Lorenzo Stoakes, Mike Rapoport,
	Suren Baghdasaryan, Vlastimil Babka, Liam R . Howlett, Zi Yan,
	Jonathan Corbet, Shuah Khan, Sean Christopherson, Paolo Bonzini,
	linux-mm, linux-kernel, linux-doc, linux-kselftest, kvm,
	Kiryl Shutsemau (Meta)
In-Reply-To: <20260414142354.1465950-1-kas@kernel.org>

Add tests for UFFD_FEATURE_MINOR_ANON, UFFD_FEATURE_MINOR_ASYNC,
UFFDIO_DEACTIVATE, UFFDIO_SET_MODE, and PAGE_IS_UFFD_DEACTIVATED:

- minor-anon-async: populate pages, register MODE_MINOR with
  MINOR_ASYNC, deactivate via UFFDIO_DEACTIVATE, re-access and verify
  content is preserved with no faults delivered to the handler.

- minor-anon-sync: same setup but without MINOR_ASYNC. Verify that
  each deactivated page access delivers a MINOR fault to the handler,
  and UFFDIO_CONTINUE resolves it. Exercises both PTE and THP paths.

- minor-anon-pagemap: deactivate a range, touch first half, use
  PAGEMAP_SCAN with PAGE_IS_UFFD_DEACTIVATED to verify the untouched
  second half is reported as cold.

- minor-anon-gup: write() from a deactivated page into a pipe to
  exercise GUP resolution through protnone PTEs via async auto-restore.

- minor-anon-async-toggle: full detection-to-eviction cycle using
  UFFDIO_SET_MODE. Start async (detection), flip to sync (eviction
  of cold pages), flip back to async.

- minor-anon-close: deactivate pages, close the uffd fd, verify all
  pages are accessible again (protnone PTEs restored on cleanup).

Signed-off-by: Kiryl Shutsemau (Meta) <kas@kernel.org>
Assisted-by: Claude:claude-opus-4-6
---
 tools/testing/selftests/mm/uffd-unit-tests.c | 458 +++++++++++++++++++
 1 file changed, 458 insertions(+)

diff --git a/tools/testing/selftests/mm/uffd-unit-tests.c b/tools/testing/selftests/mm/uffd-unit-tests.c
index 6f5e404a446c..8bd5a642bd5a 100644
--- a/tools/testing/selftests/mm/uffd-unit-tests.c
+++ b/tools/testing/selftests/mm/uffd-unit-tests.c
@@ -7,6 +7,7 @@
 
 #include "uffd-common.h"
 
+#include <linux/fs.h>
 #include "../../../../mm/gup_test.h"
 
 #ifdef __NR_userfaultfd
@@ -623,6 +624,423 @@ void uffd_minor_collapse_test(uffd_global_test_opts_t *gopts, uffd_test_args_t *
 	uffd_minor_test_common(gopts, true, false);
 }
 
+static void deactivate_range(int uffd, __u64 start, __u64 len)
+{
+	struct uffdio_range range = { .start = start, .len = len };
+
+	if (ioctl(uffd, UFFDIO_DEACTIVATE, &range))
+		err("UFFDIO_DEACTIVATE failed");
+}
+
+static void set_async_mode(int uffd, bool enable)
+{
+	struct uffdio_set_mode mode = { };
+
+	if (enable)
+		mode.enable = UFFD_FEATURE_MINOR_ASYNC;
+	else
+		mode.disable = UFFD_FEATURE_MINOR_ASYNC;
+
+	if (ioctl(uffd, UFFDIO_SET_MODE, &mode))
+		err("UFFDIO_SET_MODE failed");
+}
+
+/*
+ * Test async minor faults on anonymous memory.
+ * Populate pages, register MODE_MINOR with MINOR_ASYNC,
+ * deactivate, re-access, verify content preserved and no faults delivered.
+ */
+static void uffd_minor_anon_async_test(uffd_global_test_opts_t *gopts,
+				       uffd_test_args_t *args)
+{
+	unsigned long nr_pages = gopts->nr_pages;
+	unsigned long page_size = gopts->page_size;
+	unsigned long p;
+
+	/* Populate all pages with known content */
+	for (p = 0; p < nr_pages; p++)
+		memset(gopts->area_dst + p * page_size, p % 255 + 1, page_size);
+
+	/* Register MODE_MINOR (uffd was opened with MINOR_ANON | MINOR_ASYNC) */
+	if (uffd_register(gopts->uffd, gopts->area_dst,
+			  nr_pages * page_size,
+			  false, false, true))
+		err("register failure");
+
+	/* Deactivate all pages — sets protnone */
+	deactivate_range(gopts->uffd, (uint64_t)gopts->area_dst,
+			 nr_pages * page_size);
+
+	/* Access all pages — should auto-resolve, no faults */
+	for (p = 0; p < nr_pages; p++) {
+		unsigned char *page = (unsigned char *)gopts->area_dst +
+				      p * page_size;
+		unsigned char expected = p % 255 + 1;
+
+		if (page[0] != expected) {
+			uffd_test_fail("page %lu content mismatch: %u != %u",
+				       p, page[0], expected);
+			return;
+		}
+	}
+
+	uffd_test_pass();
+}
+
+/*
+ * Custom fault handler for anon minor — just UFFDIO_CONTINUE, no content
+ * modification (the page is protnone so we can't access it from here).
+ */
+static void uffd_handle_minor_anon(uffd_global_test_opts_t *gopts,
+				   struct uffd_msg *msg,
+				   struct uffd_args *uargs)
+{
+	struct uffdio_continue req;
+
+	if (!(msg->arg.pagefault.flags & UFFD_PAGEFAULT_FLAG_MINOR))
+		err("expected minor fault, got 0x%llx",
+		    msg->arg.pagefault.flags);
+
+	req.range.start = msg->arg.pagefault.address;
+	req.range.len = gopts->page_size;
+	req.mode = 0;
+	if (ioctl(gopts->uffd, UFFDIO_CONTINUE, &req)) {
+		/*
+		 * THP races with khugepaged collapse/split:
+		 * EAGAIN: PMD changed under us
+		 * EEXIST: THP present but already resolved
+		 * In both cases the page is accessible — the faulting
+		 * thread retries and succeeds.
+		 */
+		if (errno != EEXIST && errno != EAGAIN)
+			err("UFFDIO_CONTINUE failed");
+	}
+
+	uargs->minor_faults++;
+}
+
+/*
+ * Test sync minor faults on anonymous memory.
+ * Populate pages, register MODE_MINOR (sync), deactivate,
+ * access from worker thread, verify fault delivered, UFFDIO_CONTINUE resolves.
+ */
+static void uffd_minor_anon_sync_test(uffd_global_test_opts_t *gopts,
+				      uffd_test_args_t *args)
+{
+	unsigned long nr_pages = gopts->nr_pages;
+	unsigned long page_size = gopts->page_size;
+	pthread_t uffd_mon;
+	struct uffd_args uargs = { };
+	char c = '\0';
+	unsigned long p;
+
+	uargs.gopts = gopts;
+	uargs.handle_fault = uffd_handle_minor_anon;
+
+	/* Populate all pages */
+	for (p = 0; p < nr_pages; p++)
+		memset(gopts->area_dst + p * page_size, p % 255 + 1, page_size);
+
+	/* Register MODE_MINOR (uffd opened with MINOR_ANON, no MINOR_ASYNC) */
+	if (uffd_register(gopts->uffd, gopts->area_dst,
+			  nr_pages * page_size,
+			  false, false, true))
+		err("register failure");
+
+	/* Deactivate all pages */
+	deactivate_range(gopts->uffd, (uint64_t)gopts->area_dst,
+			 nr_pages * page_size);
+
+	/* Start fault handler thread */
+	if (pthread_create(&uffd_mon, NULL, uffd_poll_thread, &uargs))
+		err("uffd_poll_thread create");
+
+	/* Access all pages — triggers sync minor faults, handler does CONTINUE */
+	for (p = 0; p < nr_pages; p++) {
+		unsigned char *page = (unsigned char *)gopts->area_dst +
+				      p * page_size;
+
+		if (page[0] != (p % 255 + 1)) {
+			uffd_test_fail("page %lu content mismatch", p);
+			goto out;
+		}
+	}
+
+	if (uargs.minor_faults == 0) {
+		uffd_test_fail("expected minor faults, got 0");
+		goto out;
+	}
+
+	uffd_test_pass();
+out:
+	if (write(gopts->pipefd[1], &c, sizeof(c)) != sizeof(c))
+		err("pipe write");
+	if (pthread_join(uffd_mon, NULL))
+		err("join() failed");
+}
+
+/*
+ * Test PAGEMAP_SCAN detection of deactivated (cold) pages.
+ */
+static void uffd_minor_anon_pagemap_test(uffd_global_test_opts_t *gopts,
+					  uffd_test_args_t *args)
+{
+	unsigned long nr_pages = gopts->nr_pages;
+	unsigned long page_size = gopts->page_size;
+	unsigned long p;
+	struct page_region regions[16];
+	struct pm_scan_arg pm_arg;
+	int pagemap_fd;
+	long ret;
+
+	/* Need at least 4 pages */
+	if (nr_pages < 4) {
+		uffd_test_skip("need at least 4 pages");
+		return;
+	}
+
+	/* Populate all pages */
+	for (p = 0; p < nr_pages; p++)
+		memset(gopts->area_dst + p * page_size, 0xab, page_size);
+
+	/* Register and deactivate */
+	if (uffd_register(gopts->uffd, gopts->area_dst,
+			  nr_pages * page_size,
+			  false, false, true))
+		err("register failure");
+
+	deactivate_range(gopts->uffd, (uint64_t)gopts->area_dst,
+			 nr_pages * page_size);
+
+	/* Touch first half of pages to re-activate them (async auto-resolve) */
+	for (p = 0; p < nr_pages / 2; p++) {
+		volatile char *page = gopts->area_dst + p * page_size;
+		(void)*page;
+	}
+
+	/* Scan for cold (still deactivated) pages */
+	pagemap_fd = open("/proc/self/pagemap", O_RDONLY);
+	if (pagemap_fd < 0)
+		err("open pagemap");
+
+	memset(&pm_arg, 0, sizeof(pm_arg));
+	pm_arg.size = sizeof(pm_arg);
+	pm_arg.start = (uint64_t)gopts->area_dst;
+	pm_arg.end = (uint64_t)gopts->area_dst + nr_pages * page_size;
+	pm_arg.vec = (uint64_t)regions;
+	pm_arg.vec_len = 16;
+	pm_arg.category_mask = PAGE_IS_UFFD_DEACTIVATED;
+	pm_arg.return_mask = PAGE_IS_UFFD_DEACTIVATED;
+
+	ret = ioctl(pagemap_fd, PAGEMAP_SCAN, &pm_arg);
+	close(pagemap_fd);
+
+	if (ret < 0) {
+		uffd_test_fail("PAGEMAP_SCAN failed: %s", strerror(errno));
+		return;
+	}
+
+	/*
+	 * The second half of pages should be reported as deactivated.
+	 * They may be coalesced into one region.
+	 */
+	if (ret < 1) {
+		uffd_test_fail("expected cold pages, got %ld regions", ret);
+		return;
+	}
+
+	/* Verify the cold region covers the second half */
+	uint64_t cold_start = regions[0].start;
+	uint64_t expected_start = (uint64_t)gopts->area_dst +
+				  (nr_pages / 2) * page_size;
+
+	if (cold_start != expected_start) {
+		uffd_test_fail("cold region starts at 0x%lx, expected 0x%lx",
+			       (unsigned long)cold_start,
+			       (unsigned long)expected_start);
+		return;
+	}
+
+	uffd_test_pass();
+}
+
+/*
+ * Test that GUP resolves through protnone PTEs (async mode).
+ * Deactivate pages, then use a pipe to exercise GUP on the deactivated
+ * memory. write() from deactivated pages triggers GUP which must fault
+ * through the protnone PTE.
+ */
+static void uffd_minor_anon_gup_test(uffd_global_test_opts_t *gopts,
+				     uffd_test_args_t *args)
+{
+	unsigned long page_size = gopts->page_size;
+	char *buf;
+	int pipefd[2];
+
+	buf = malloc(page_size);
+	if (!buf)
+		err("malloc");
+
+	/* Populate first page with known content */
+	memset(gopts->area_dst, 0xCD, page_size);
+
+	if (uffd_register(gopts->uffd, gopts->area_dst, page_size,
+			  false, false, true))
+		err("register failure");
+
+	deactivate_range(gopts->uffd, (uint64_t)gopts->area_dst, page_size);
+
+	if (pipe(pipefd))
+		err("pipe");
+
+	/*
+	 * write() from the deactivated page into the pipe.
+	 * This triggers GUP on the protnone PTE. In async mode the
+	 * kernel auto-restores permissions and GUP succeeds.
+	 */
+	if (write(pipefd[1], gopts->area_dst, page_size) != page_size) {
+		uffd_test_fail("write from deactivated page failed: %s",
+			       strerror(errno));
+		goto out;
+	}
+
+	if (read(pipefd[0], buf, page_size) != page_size) {
+		uffd_test_fail("read from pipe failed");
+		goto out;
+	}
+
+	if (memcmp(buf, "\xCD", 1) != 0) {
+		uffd_test_fail("content mismatch: got 0x%02x, expected 0xCD",
+			       (unsigned char)buf[0]);
+		goto out;
+	}
+
+	uffd_test_pass();
+out:
+	close(pipefd[0]);
+	close(pipefd[1]);
+	free(buf);
+}
+
+/*
+ * Test runtime toggle between async and sync modes.
+ * Start in async mode (detection), flip to sync (eviction), verify faults
+ * block, resolve them, flip back to async.
+ */
+static void uffd_minor_anon_async_toggle_test(uffd_global_test_opts_t *gopts,
+					      uffd_test_args_t *args)
+{
+	unsigned long nr_pages = gopts->nr_pages;
+	unsigned long page_size = gopts->page_size;
+	struct uffd_args uargs = { };
+	pthread_t uffd_mon;
+	char c = '\0';
+	unsigned long p;
+
+	uargs.gopts = gopts;
+	uargs.handle_fault = uffd_handle_minor_anon;
+
+	/* Populate */
+	for (p = 0; p < nr_pages; p++)
+		memset(gopts->area_dst + p * page_size, p % 255 + 1, page_size);
+
+	if (uffd_register(gopts->uffd, gopts->area_dst,
+			  nr_pages * page_size,
+			  false, false, true))
+		err("register failure");
+
+	/* Phase 1: async detection — deactivate, access first half */
+	deactivate_range(gopts->uffd, (uint64_t)gopts->area_dst,
+			 nr_pages * page_size);
+
+	for (p = 0; p < nr_pages / 2; p++) {
+		volatile char *page = gopts->area_dst + p * page_size;
+		(void)*page;  /* auto-resolves in async mode */
+	}
+
+	/* Phase 2: flip to sync for eviction */
+	set_async_mode(gopts->uffd, false);
+
+	/* Start handler — will receive faults for cold pages */
+	if (pthread_create(&uffd_mon, NULL, uffd_poll_thread, &uargs))
+		err("uffd_poll_thread create");
+
+	/* Access second half (cold pages) — should trigger sync faults */
+	for (p = nr_pages / 2; p < nr_pages; p++) {
+		unsigned char *page = (unsigned char *)gopts->area_dst +
+				      p * page_size;
+		if (page[0] != (p % 255 + 1)) {
+			uffd_test_fail("page %lu content mismatch", p);
+			goto out;
+		}
+	}
+
+	if (uargs.minor_faults == 0) {
+		uffd_test_fail("expected sync faults, got 0");
+		goto out;
+	}
+
+	/* Phase 3: flip back to async */
+	set_async_mode(gopts->uffd, true);
+
+	/* Deactivate and access again — should auto-resolve */
+	deactivate_range(gopts->uffd, (uint64_t)gopts->area_dst,
+			 nr_pages * page_size);
+
+	for (p = 0; p < nr_pages; p++) {
+		volatile char *page = gopts->area_dst + p * page_size;
+		(void)*page;
+	}
+
+	uffd_test_pass();
+out:
+	if (write(gopts->pipefd[1], &c, sizeof(c)) != sizeof(c))
+		err("pipe write");
+	if (pthread_join(uffd_mon, NULL))
+		err("join() failed");
+}
+
+/*
+ * Test that deactivated pages become accessible after closing uffd.
+ */
+static void uffd_minor_anon_close_test(uffd_global_test_opts_t *gopts,
+				       uffd_test_args_t *args)
+{
+	unsigned long nr_pages = gopts->nr_pages;
+	unsigned long page_size = gopts->page_size;
+	unsigned long p;
+
+	/* Populate */
+	for (p = 0; p < nr_pages; p++)
+		memset(gopts->area_dst + p * page_size, p % 255 + 1, page_size);
+
+	if (uffd_register(gopts->uffd, gopts->area_dst,
+			  nr_pages * page_size,
+			  false, false, true))
+		err("register failure");
+
+	deactivate_range(gopts->uffd, (uint64_t)gopts->area_dst,
+			 nr_pages * page_size);
+
+	/* Close uffd — should restore protnone PTEs */
+	close(gopts->uffd);
+	gopts->uffd = -1;
+
+	/* All pages should be accessible with original content */
+	for (p = 0; p < nr_pages; p++) {
+		unsigned char *page = (unsigned char *)gopts->area_dst +
+				      p * page_size;
+		unsigned char expected = p % 255 + 1;
+
+		if (page[0] != expected) {
+			uffd_test_fail("page %lu not accessible after close", p);
+			return;
+		}
+	}
+
+	uffd_test_pass();
+}
+
 static sigjmp_buf jbuf, *sigbuf;
 
 static void sighndl(int sig, siginfo_t *siginfo, void *ptr)
@@ -1625,6 +2043,46 @@ uffd_test_case_t uffd_tests[] = {
 		/* We can't test MADV_COLLAPSE, so try our luck */
 		.uffd_feature_required = UFFD_FEATURE_MINOR_SHMEM,
 	},
+	{
+		.name = "minor-anon-async",
+		.uffd_fn = uffd_minor_anon_async_test,
+		.mem_targets = MEM_ANON,
+		.uffd_feature_required =
+		UFFD_FEATURE_MINOR_ANON | UFFD_FEATURE_MINOR_ASYNC,
+	},
+	{
+		.name = "minor-anon-sync",
+		.uffd_fn = uffd_minor_anon_sync_test,
+		.mem_targets = MEM_ANON,
+		.uffd_feature_required = UFFD_FEATURE_MINOR_ANON,
+	},
+	{
+		.name = "minor-anon-pagemap",
+		.uffd_fn = uffd_minor_anon_pagemap_test,
+		.mem_targets = MEM_ANON,
+		.uffd_feature_required =
+		UFFD_FEATURE_MINOR_ANON | UFFD_FEATURE_MINOR_ASYNC,
+	},
+	{
+		.name = "minor-anon-gup",
+		.uffd_fn = uffd_minor_anon_gup_test,
+		.mem_targets = MEM_ANON,
+		.uffd_feature_required =
+		UFFD_FEATURE_MINOR_ANON | UFFD_FEATURE_MINOR_ASYNC,
+	},
+	{
+		.name = "minor-anon-async-toggle",
+		.uffd_fn = uffd_minor_anon_async_toggle_test,
+		.mem_targets = MEM_ANON,
+		.uffd_feature_required =
+		UFFD_FEATURE_MINOR_ANON | UFFD_FEATURE_MINOR_ASYNC,
+	},
+	{
+		.name = "minor-anon-close",
+		.uffd_fn = uffd_minor_anon_close_test,
+		.mem_targets = MEM_ANON,
+		.uffd_feature_required = UFFD_FEATURE_MINOR_ANON,
+	},
 	{
 		.name = "sigbus",
 		.uffd_fn = uffd_sigbus_test,
-- 
2.51.2


^ permalink raw reply related

* [RFC, PATCH 10/12] userfaultfd: add UFFDIO_SET_MODE for runtime sync/async toggle
From: Kiryl Shutsemau (Meta) @ 2026-04-14 14:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Peter Xu, David Hildenbrand, Lorenzo Stoakes, Mike Rapoport,
	Suren Baghdasaryan, Vlastimil Babka, Liam R . Howlett, Zi Yan,
	Jonathan Corbet, Shuah Khan, Sean Christopherson, Paolo Bonzini,
	linux-mm, linux-kernel, linux-doc, linux-kselftest, kvm,
	Kiryl Shutsemau (Meta)
In-Reply-To: <20260414142354.1465950-1-kas@kernel.org>

Add UFFDIO_SET_MODE ioctl to toggle UFFD_FEATURE_MINOR_ASYNC at
runtime. Takes mmap_write_lock for serialization against all in-flight
faults. On sync-to-async transition, wake threads blocked in
handle_userfault() so they retry and auto-resolve.

Since ctx->features can now be modified concurrently, add
userfaultfd_features() helper that wraps READ_ONCE() and convert
all ctx->features reads to use it.

Signed-off-by: Kiryl Shutsemau (Meta) <kas@kernel.org>
Assisted-by: Claude:claude-opus-4-6
---
 fs/userfaultfd.c                 | 95 ++++++++++++++++++++++++++++----
 include/uapi/linux/userfaultfd.h | 13 +++++
 2 files changed, 96 insertions(+), 12 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 43064238fd8d..0edb33599491 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -79,24 +79,33 @@ struct userfaultfd_wake_range {
 /* internal indication that UFFD_API ioctl was successfully executed */
 #define UFFD_FEATURE_INITIALIZED		(1u << 31)
 
+/*
+ * Read ctx->features with READ_ONCE() since UFFDIO_SET_MODE can
+ * modify it concurrently.
+ */
+static unsigned int userfaultfd_features(struct userfaultfd_ctx *ctx)
+{
+	return READ_ONCE(ctx->features);
+}
+
 static bool userfaultfd_is_initialized(struct userfaultfd_ctx *ctx)
 {
-	return ctx->features & UFFD_FEATURE_INITIALIZED;
+	return userfaultfd_features(ctx) & UFFD_FEATURE_INITIALIZED;
 }
 
 static bool userfaultfd_wp_async_ctx(struct userfaultfd_ctx *ctx)
 {
-	return ctx && (ctx->features & UFFD_FEATURE_WP_ASYNC);
+	return ctx && (userfaultfd_features(ctx) & UFFD_FEATURE_WP_ASYNC);
 }
 
 static bool userfaultfd_minor_anon_ctx(struct userfaultfd_ctx *ctx)
 {
-	return ctx && (ctx->features & UFFD_FEATURE_MINOR_ANON);
+	return ctx && (userfaultfd_features(ctx) & UFFD_FEATURE_MINOR_ANON);
 }
 
 static bool userfaultfd_minor_async_ctx(struct userfaultfd_ctx *ctx)
 {
-	return ctx && (ctx->features & UFFD_FEATURE_MINOR_ASYNC);
+	return ctx && (userfaultfd_features(ctx) & UFFD_FEATURE_MINOR_ASYNC);
 }
 
 static unsigned int userfaultfd_ctx_flags(struct userfaultfd_ctx *ctx)
@@ -122,7 +131,7 @@ bool userfaultfd_wp_unpopulated(struct vm_area_struct *vma)
 	if (!ctx)
 		return false;
 
-	return ctx->features & UFFD_FEATURE_WP_UNPOPULATED;
+	return userfaultfd_features(ctx) & UFFD_FEATURE_WP_UNPOPULATED;
 }
 
 static int userfaultfd_wake_function(wait_queue_entry_t *wq, unsigned mode,
@@ -435,7 +444,7 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
 	/* 0 or > 1 flags set is a bug; we expect exactly 1. */
 	VM_WARN_ON_ONCE(!reason || (reason & (reason - 1)));
 
-	if (ctx->features & UFFD_FEATURE_SIGBUS)
+	if (userfaultfd_features(ctx) & UFFD_FEATURE_SIGBUS)
 		goto out;
 	if (!(vmf->flags & FAULT_FLAG_USER) && (ctx->flags & UFFD_USER_MODE_ONLY))
 		goto out;
@@ -506,7 +515,7 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
 	init_waitqueue_func_entry(&uwq.wq, userfaultfd_wake_function);
 	uwq.wq.private = current;
 	uwq.msg = userfault_msg(vmf->address, vmf->real_address, vmf->flags,
-				reason, ctx->features);
+				reason, userfaultfd_features(ctx));
 	uwq.ctx = ctx;
 	uwq.waken = false;
 
@@ -668,7 +677,7 @@ int dup_userfaultfd(struct vm_area_struct *vma, struct list_head *fcs)
 	if (!octx)
 		return 0;
 
-	if (!(octx->features & UFFD_FEATURE_EVENT_FORK)) {
+	if (!(userfaultfd_features(octx) & UFFD_FEATURE_EVENT_FORK)) {
 		userfaultfd_reset_ctx(vma);
 		return 0;
 	}
@@ -774,7 +783,7 @@ void mremap_userfaultfd_prep(struct vm_area_struct *vma,
 	if (!ctx)
 		return;
 
-	if (ctx->features & UFFD_FEATURE_EVENT_REMAP) {
+	if (userfaultfd_features(ctx) & UFFD_FEATURE_EVENT_REMAP) {
 		vm_ctx->ctx = ctx;
 		userfaultfd_ctx_get(ctx);
 		down_write(&ctx->map_changing_lock);
@@ -824,7 +833,7 @@ bool userfaultfd_remove(struct vm_area_struct *vma,
 	struct userfaultfd_wait_queue ewq;
 
 	ctx = vma->vm_userfaultfd_ctx.ctx;
-	if (!ctx || !(ctx->features & UFFD_FEATURE_EVENT_REMOVE))
+	if (!ctx || !(userfaultfd_features(ctx) & UFFD_FEATURE_EVENT_REMOVE))
 		return true;
 
 	userfaultfd_ctx_get(ctx);
@@ -863,7 +872,7 @@ int userfaultfd_unmap_prep(struct vm_area_struct *vma, unsigned long start,
 	struct userfaultfd_unmap_ctx *unmap_ctx;
 	struct userfaultfd_ctx *ctx = vma->vm_userfaultfd_ctx.ctx;
 
-	if (!ctx || !(ctx->features & UFFD_FEATURE_EVENT_UNMAP) ||
+	if (!ctx || !(userfaultfd_features(ctx) & UFFD_FEATURE_EVENT_UNMAP) ||
 	    has_unmap_ctx(ctx, unmaps, start, end))
 		return 0;
 
@@ -1826,6 +1835,65 @@ static int userfaultfd_deactivate(struct userfaultfd_ctx *ctx,
 	return ret;
 }
 
+/*
+ * Features that can be toggled at runtime via UFFDIO_SET_MODE.
+ * Only async features that were enabled at UFFDIO_API time may be toggled.
+ */
+#define UFFD_FEATURE_TOGGLEABLE	(UFFD_FEATURE_MINOR_ASYNC)
+
+static int userfaultfd_set_mode(struct userfaultfd_ctx *ctx,
+				  unsigned long arg)
+{
+	struct uffdio_set_mode mode;
+	struct mm_struct *mm = ctx->mm;
+
+	if (copy_from_user(&mode, (void __user *)arg, sizeof(mode)))
+		return -EFAULT;
+
+	/* enable and disable must not overlap */
+	if (mode.enable & mode.disable)
+		return -EINVAL;
+
+	/* only toggleable features are allowed */
+	if ((mode.enable | mode.disable) & ~UFFD_FEATURE_TOGGLEABLE)
+		return -EINVAL;
+
+	if (!mmget_not_zero(mm))
+		return -ESRCH;
+
+	/*
+	 * mmap_write_lock serializes against all page faults.
+	 * After we release, no in-flight faults from the old mode exist.
+	 */
+	{
+		unsigned int new_features;
+
+		mmap_write_lock(mm);
+		new_features = userfaultfd_features(ctx);
+		new_features |= mode.enable;
+		new_features &= ~mode.disable;
+		WRITE_ONCE(ctx->features, new_features);
+		mmap_write_unlock(mm);
+	}
+
+	/*
+	 * If switching to async, wake threads blocked in handle_userfault().
+	 * They will retry the fault and auto-resolve under the new mode.
+	 * len=0 means wake all pending faults on this context.
+	 */
+	if (mode.enable & UFFD_FEATURE_MINOR_ASYNC) {
+		struct userfaultfd_wake_range range = { .len = 0 };
+
+		spin_lock_irq(&ctx->fault_pending_wqh.lock);
+		__wake_up_locked_key(&ctx->fault_pending_wqh, TASK_NORMAL,
+				     &range);
+		__wake_up(&ctx->fault_wqh, TASK_NORMAL, 1, &range);
+		spin_unlock_irq(&ctx->fault_pending_wqh.lock);
+	}
+
+	mmput(mm);
+	return 0;
+}
 
 static int userfaultfd_continue(struct userfaultfd_ctx *ctx, unsigned long arg)
 {
@@ -2150,6 +2218,9 @@ static long userfaultfd_ioctl(struct file *file, unsigned cmd,
 	case UFFDIO_DEACTIVATE:
 		ret = userfaultfd_deactivate(ctx, arg);
 		break;
+	case UFFDIO_SET_MODE:
+		ret = userfaultfd_set_mode(ctx, arg);
+		break;
 	}
 	return ret;
 }
@@ -2177,7 +2248,7 @@ static void userfaultfd_show_fdinfo(struct seq_file *m, struct file *f)
 	 *	protocols: aa:... bb:...
 	 */
 	seq_printf(m, "pending:\t%lu\ntotal:\t%lu\nAPI:\t%Lx:%x:%Lx\n",
-		   pending, total, UFFD_API, ctx->features,
+		   pending, total, UFFD_API, userfaultfd_features(ctx),
 		   UFFD_API_IOCTLS|UFFD_API_RANGE_IOCTLS);
 }
 #endif
diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
index 775825da2596..f0f14f9db06c 100644
--- a/include/uapi/linux/userfaultfd.h
+++ b/include/uapi/linux/userfaultfd.h
@@ -84,6 +84,7 @@
 #define _UFFDIO_CONTINUE		(0x07)
 #define _UFFDIO_POISON			(0x08)
 #define _UFFDIO_DEACTIVATE		(0x09)
+#define _UFFDIO_SET_MODE		(0x0A)
 #define _UFFDIO_API			(0x3F)
 
 /* userfaultfd ioctl ids */
@@ -110,6 +111,8 @@
 				      struct uffdio_poison)
 #define UFFDIO_DEACTIVATE	_IOR(UFFDIO, _UFFDIO_DEACTIVATE,	\
 				     struct uffdio_range)
+#define UFFDIO_SET_MODE		_IOW(UFFDIO, _UFFDIO_SET_MODE,	\
+				     struct uffdio_set_mode)
 
 /* read() structure */
 struct uffd_msg {
@@ -395,6 +398,16 @@ struct uffdio_move {
 	__s64 move;
 };
 
+struct uffdio_set_mode {
+	/*
+	 * Toggle async mode for features at runtime.
+	 * Supported: UFFD_FEATURE_MINOR_ASYNC.
+	 * Setting a bit in both enable and disable is invalid.
+	 */
+	__u64 enable;
+	__u64 disable;
+};
+
 /*
  * Flags for the userfaultfd(2) system call itself.
  */
-- 
2.51.2


^ permalink raw reply related

* [RFC, PATCH 09/12] mm/pagemap: add PAGE_IS_UFFD_DEACTIVATED to PAGEMAP_SCAN
From: Kiryl Shutsemau (Meta) @ 2026-04-14 14:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Peter Xu, David Hildenbrand, Lorenzo Stoakes, Mike Rapoport,
	Suren Baghdasaryan, Vlastimil Babka, Liam R . Howlett, Zi Yan,
	Jonathan Corbet, Shuah Khan, Sean Christopherson, Paolo Bonzini,
	linux-mm, linux-kernel, linux-doc, linux-kselftest, kvm,
	Kiryl Shutsemau (Meta)
In-Reply-To: <20260414142354.1465950-1-kas@kernel.org>

Report deactivated anonymous pages in PAGEMAP_SCAN results.
Only set on anonymous VMAs (shmem cold = !PAGE_IS_PRESENT).
Both PTE and PMD (THP) levels handled.

Signed-off-by: Kiryl Shutsemau (Meta) <kas@kernel.org>
Assisted-by: Claude:claude-opus-4-6
---
 fs/proc/task_mmu.c      | 11 ++++++++++-
 include/uapi/linux/fs.h |  1 +
 2 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index e091931d7ca1..fc42cfd5720a 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -2329,7 +2329,7 @@ static int pagemap_release(struct inode *inode, struct file *file)
 				 PAGE_IS_FILE |	PAGE_IS_PRESENT |	\
 				 PAGE_IS_SWAPPED | PAGE_IS_PFNZERO |	\
 				 PAGE_IS_HUGE | PAGE_IS_SOFT_DIRTY |	\
-				 PAGE_IS_GUARD)
+				 PAGE_IS_GUARD | PAGE_IS_UFFD_DEACTIVATED)
 #define PM_SCAN_FLAGS		(PM_SCAN_WP_MATCHING | PM_SCAN_CHECK_WPASYNC)
 
 struct pagemap_scan_private {
@@ -2354,6 +2354,10 @@ static unsigned long pagemap_page_category(struct pagemap_scan_private *p,
 
 		categories = PAGE_IS_PRESENT;
 
+		if (pte_protnone(pte) && vma_is_accessible(vma) &&
+		    vma_is_anonymous(vma) && userfaultfd_minor(vma))
+			categories |= PAGE_IS_UFFD_DEACTIVATED;
+
 		if (!pte_uffd_wp(pte))
 			categories |= PAGE_IS_WRITTEN;
 
@@ -2422,6 +2426,11 @@ static unsigned long pagemap_thp_category(struct pagemap_scan_private *p,
 		struct page *page;
 
 		categories |= PAGE_IS_PRESENT;
+
+		if (pmd_protnone(pmd) && vma_is_accessible(vma) &&
+		    vma_is_anonymous(vma) && userfaultfd_minor(vma))
+			categories |= PAGE_IS_UFFD_DEACTIVATED;
+
 		if (!pmd_uffd_wp(pmd))
 			categories |= PAGE_IS_WRITTEN;
 
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 70b2b661f42c..af5b28901800 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -455,6 +455,7 @@ typedef int __bitwise __kernel_rwf_t;
 #define PAGE_IS_HUGE		(1 << 6)
 #define PAGE_IS_SOFT_DIRTY	(1 << 7)
 #define PAGE_IS_GUARD		(1 << 8)
+#define PAGE_IS_UFFD_DEACTIVATED (1 << 9)
 
 /*
  * struct page_region - Page region with flags
-- 
2.51.2


^ permalink raw reply related

* [RFC, PATCH 08/12] userfaultfd: enable UFFD_FEATURE_MINOR_ANON
From: Kiryl Shutsemau (Meta) @ 2026-04-14 14:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Peter Xu, David Hildenbrand, Lorenzo Stoakes, Mike Rapoport,
	Suren Baghdasaryan, Vlastimil Babka, Liam R . Howlett, Zi Yan,
	Jonathan Corbet, Shuah Khan, Sean Christopherson, Paolo Bonzini,
	linux-mm, linux-kernel, linux-doc, linux-kselftest, kvm,
	Kiryl Shutsemau (Meta)
In-Reply-To: <20260414142354.1465950-1-kas@kernel.org>

Add UFFD_FEATURE_MINOR_ANON, UFFD_FEATURE_MINOR_ASYNC to
UFFD_API_FEATURES and UFFDIO_DEACTIVATE to UFFD_API_RANGE_IOCTLS.
The feature is now available to userspace.

Signed-off-by: Kiryl Shutsemau (Meta) <kas@kernel.org>
Assisted-by: Claude:claude-opus-4-6
---
 include/uapi/linux/userfaultfd.h | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
index 336d07e1b6de..775825da2596 100644
--- a/include/uapi/linux/userfaultfd.h
+++ b/include/uapi/linux/userfaultfd.h
@@ -42,7 +42,9 @@
 			   UFFD_FEATURE_WP_UNPOPULATED |	\
 			   UFFD_FEATURE_POISON |		\
 			   UFFD_FEATURE_WP_ASYNC |		\
-			   UFFD_FEATURE_MOVE)
+			   UFFD_FEATURE_MOVE |			\
+			   UFFD_FEATURE_MINOR_ANON |		\
+			   UFFD_FEATURE_MINOR_ASYNC)
 #define UFFD_API_IOCTLS				\
 	((__u64)1 << _UFFDIO_REGISTER |		\
 	 (__u64)1 << _UFFDIO_UNREGISTER |	\
@@ -54,13 +56,15 @@
 	 (__u64)1 << _UFFDIO_MOVE |		\
 	 (__u64)1 << _UFFDIO_WRITEPROTECT |	\
 	 (__u64)1 << _UFFDIO_CONTINUE |		\
-	 (__u64)1 << _UFFDIO_POISON)
+	 (__u64)1 << _UFFDIO_POISON |		\
+	 (__u64)1 << _UFFDIO_DEACTIVATE)
 #define UFFD_API_RANGE_IOCTLS_BASIC		\
 	((__u64)1 << _UFFDIO_WAKE |		\
 	 (__u64)1 << _UFFDIO_COPY |		\
 	 (__u64)1 << _UFFDIO_WRITEPROTECT |	\
 	 (__u64)1 << _UFFDIO_CONTINUE |		\
-	 (__u64)1 << _UFFDIO_POISON)
+	 (__u64)1 << _UFFDIO_POISON |		\
+	 (__u64)1 << _UFFDIO_DEACTIVATE)
 
 /*
  * Valid ioctl command number range with this API is from 0x00 to
-- 
2.51.2


^ permalink raw reply related

* [RFC, PATCH 07/12] sched/numa: skip scanning anonymous VM_UFFD_MINOR VMAs
From: Kiryl Shutsemau (Meta) @ 2026-04-14 14:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Peter Xu, David Hildenbrand, Lorenzo Stoakes, Mike Rapoport,
	Suren Baghdasaryan, Vlastimil Babka, Liam R . Howlett, Zi Yan,
	Jonathan Corbet, Shuah Khan, Sean Christopherson, Paolo Bonzini,
	linux-mm, linux-kernel, linux-doc, linux-kselftest, kvm,
	Kiryl Shutsemau (Meta)
In-Reply-To: <20260414142354.1465950-1-kas@kernel.org>

Avoid protnone conflict on anonymous VMAs. Shmem unaffected.
NUMA stats fed from uffd fault path instead.
Add NUMAB_SKIP_UFFD_MINOR trace reason.

Signed-off-by: Kiryl Shutsemau (Meta) <kas@kernel.org>
Assisted-by: Claude:claude-opus-4-6
---
 include/linux/sched/numa_balancing.h |  1 +
 include/trace/events/sched.h         |  3 ++-
 kernel/sched/fair.c                  | 13 +++++++++++++
 3 files changed, 16 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched/numa_balancing.h b/include/linux/sched/numa_balancing.h
index 52b22c5c396d..5668074a4271 100644
--- a/include/linux/sched/numa_balancing.h
+++ b/include/linux/sched/numa_balancing.h
@@ -23,6 +23,7 @@ enum numa_vmaskip_reason {
 	NUMAB_SKIP_PID_INACTIVE,
 	NUMAB_SKIP_IGNORE_PID,
 	NUMAB_SKIP_SEQ_COMPLETED,
+	NUMAB_SKIP_UFFD_MINOR,
 };
 
 #ifdef CONFIG_NUMA_BALANCING
diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index 7b2645b50e78..02e79b56db28 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -728,7 +728,8 @@ DEFINE_EVENT(sched_numa_pair_template, sched_swap_numa,
 	EM( NUMAB_SKIP_SCAN_DELAY,		"scan_delay" )	\
 	EM( NUMAB_SKIP_PID_INACTIVE,		"pid_inactive" )	\
 	EM( NUMAB_SKIP_IGNORE_PID,		"ignore_pid_inactive" )		\
-	EMe(NUMAB_SKIP_SEQ_COMPLETED,		"seq_completed" )
+	EM( NUMAB_SKIP_SEQ_COMPLETED,		"seq_completed" )	\
+	EMe(NUMAB_SKIP_UFFD_MINOR,		"uffd_minor" )
 
 /* Redefine for export. */
 #undef EM
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ab4114712be7..57beb04562cf 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -25,6 +25,7 @@
 #include <linux/hugetlb_inline.h>
 #include <linux/jiffies.h>
 #include <linux/mm_api.h>
+#include <linux/userfaultfd_k.h>
 #include <linux/highmem.h>
 #include <linux/spinlock_api.h>
 #include <linux/cpumask_api.h>
@@ -3459,6 +3460,18 @@ static void task_numa_work(struct callback_head *work)
 			continue;
 		}
 
+		/*
+		 * Skip anonymous VMAs registered for userfaultfd minor faults.
+		 * Both NUMA balancing and uffd use protnone PTEs on anonymous
+		 * memory — let uffd own the hinting. For shmem, UFFDIO_DEACTIVATE
+		 * zaps PTEs entirely (no protnone conflict), so NUMA scanning
+		 * can proceed normally.
+		 */
+		if (vma_is_anonymous(vma) && userfaultfd_minor(vma)) {
+			trace_sched_skip_vma_numa(mm, vma, NUMAB_SKIP_UFFD_MINOR);
+			continue;
+		}
+
 		/*
 		 * Shared library pages mapped by multiple processes are not
 		 * migrated as it is expected they are cache replicated. Avoid
-- 
2.51.2


^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox