Netdev List
 help / color / mirror / Atom feed
* [PATCH bpf v4 5/5] bpf, sockmap: Take state lock for af_unix iter
From: Michal Luczaj @ 2026-04-14 14:13 UTC (permalink / raw)
  To: John Fastabend, Jakub Sitnicki, Eric Dumazet, Kuniyuki Iwashima,
	Paolo Abeni, Willem de Bruijn, David S. Miller, Jakub Kicinski,
	Simon Horman, Yonghong Song, Andrii Nakryiko, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, Eduard Zingerman, Song Liu,
	Yonghong Song, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Shuah Khan, Cong Wang
  Cc: netdev, bpf, linux-kernel, linux-kselftest, Michal Luczaj
In-Reply-To: <20260414-unix-proto-update-null-ptr-deref-v4-0-2af6fe97918e@rbox.co>

When a BPF iterator program updates a sockmap, there is a race condition in
unix_stream_bpf_update_proto() where the `peer` pointer can become stale[1]
during a state transition TCP_ESTABLISHED -> TCP_CLOSE.

        CPU0 bpf                          CPU1 close
        --------                          ----------
// unix_stream_bpf_update_proto()
sk_pair = unix_peer(sk)
if (unlikely(!sk_pair))
   return -EINVAL;
                                     // unix_release_sock()
                                     skpair = unix_peer(sk);
                                     unix_peer(sk) = NULL;
                                     sock_put(skpair)
sock_hold(sk_pair) // UaF

More practically, this fix guarantees that the iterator program is
consistently provided with a unix socket that remains stable during
iterator execution.

[1]:
BUG: KASAN: slab-use-after-free in unix_stream_bpf_update_proto+0x155/0x490
Write of size 4 at addr ffff8881178c9a00 by task test_progs/2231
Call Trace:
 dump_stack_lvl+0x5d/0x80
 print_report+0x170/0x4f3
 kasan_report+0xe4/0x1c0
 kasan_check_range+0x125/0x200
 unix_stream_bpf_update_proto+0x155/0x490
 sock_map_link+0x71c/0xec0
 sock_map_update_common+0xbc/0x600
 sock_map_update_elem+0x19a/0x1f0
 bpf_prog_bbbf56096cdd4f01_selective_dump_unix+0x20c/0x217
 bpf_iter_run_prog+0x21e/0xae0
 bpf_iter_unix_seq_show+0x1e0/0x2a0
 bpf_seq_read+0x42c/0x10d0
 vfs_read+0x171/0xb20
 ksys_read+0xff/0x200
 do_syscall_64+0xf7/0x5e0
 entry_SYSCALL_64_after_hwframe+0x76/0x7e

Allocated by task 2236:
 kasan_save_stack+0x30/0x50
 kasan_save_track+0x14/0x30
 __kasan_slab_alloc+0x63/0x80
 kmem_cache_alloc_noprof+0x1d5/0x680
 sk_prot_alloc+0x59/0x210
 sk_alloc+0x34/0x470
 unix_create1+0x86/0x8a0
 unix_stream_connect+0x318/0x15b0
 __sys_connect+0xfd/0x130
 __x64_sys_connect+0x72/0xd0
 do_syscall_64+0xf7/0x5e0
 entry_SYSCALL_64_after_hwframe+0x76/0x7e

Freed by task 2236:
 kasan_save_stack+0x30/0x50
 kasan_save_track+0x14/0x30
 kasan_save_free_info+0x3b/0x70
 __kasan_slab_free+0x47/0x70
 kmem_cache_free+0x11c/0x590
 __sk_destruct+0x432/0x6e0
 unix_release_sock+0x9b3/0xf60
 unix_release+0x8a/0xf0
 __sock_release+0xb0/0x270
 sock_close+0x18/0x20
 __fput+0x36e/0xac0
 fput_close_sync+0xe5/0x1a0
 __x64_sys_close+0x7d/0xd0
 do_syscall_64+0xf7/0x5e0
 entry_SYSCALL_64_after_hwframe+0x76/0x7e

Suggested-by: Kuniyuki Iwashima <kuniyu@google.com>
Fixes: 2c860a43dd77 ("bpf: af_unix: Implement BPF iterator for UNIX domain socket.")
Signed-off-by: Michal Luczaj <mhal@rbox.co>
---
 net/unix/af_unix.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 590a30d3b2f7..15b48cc6e9b0 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -3737,6 +3737,7 @@ static int bpf_iter_unix_seq_show(struct seq_file *seq, void *v)
 		return 0;
 
 	lock_sock(sk);
+	unix_state_lock(sk);
 
 	if (unlikely(sock_flag(sk, SOCK_DEAD))) {
 		ret = SEQ_SKIP;
@@ -3748,6 +3749,7 @@ static int bpf_iter_unix_seq_show(struct seq_file *seq, void *v)
 	prog = bpf_iter_get_info(&meta, false);
 	ret = unix_prog_seq_show(prog, &meta, v, uid);
 unlock:
+	unix_state_unlock(sk);
 	release_sock(sk);
 	return ret;
 }

-- 
2.53.0


^ permalink raw reply related

* [PATCH bpf v4 4/5] bpf, sockmap: Fix af_unix null-ptr-deref in proto update
From: Michal Luczaj @ 2026-04-14 14:13 UTC (permalink / raw)
  To: John Fastabend, Jakub Sitnicki, Eric Dumazet, Kuniyuki Iwashima,
	Paolo Abeni, Willem de Bruijn, David S. Miller, Jakub Kicinski,
	Simon Horman, Yonghong Song, Andrii Nakryiko, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, Eduard Zingerman, Song Liu,
	Yonghong Song, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Shuah Khan, Cong Wang
  Cc: netdev, bpf, linux-kernel, linux-kselftest, Michal Luczaj,
	钱一铭
In-Reply-To: <20260414-unix-proto-update-null-ptr-deref-v4-0-2af6fe97918e@rbox.co>

unix_stream_connect() sets sk_state (`WRITE_ONCE(sk->sk_state,
TCP_ESTABLISHED)`) _before_ it assigns a peer (`unix_peer(sk) = newsk`).
sk_state == TCP_ESTABLISHED makes sock_map_sk_state_allowed() believe that
socket is properly set up, which would include having a defined peer. IOW,
there's a window when unix_stream_bpf_update_proto() can be called on
socket which still has unix_peer(sk) == NULL.

         CPU0 bpf                            CPU1 connect
         --------                            ------------

                                WRITE_ONCE(sk->sk_state, TCP_ESTABLISHED)
sock_map_sk_state_allowed(sk)
...
sk_pair = unix_peer(sk)
sock_hold(sk_pair)
                                sock_hold(newsk)
                                smp_mb__after_atomic()
                                unix_peer(sk) = newsk

BUG: kernel NULL pointer dereference, address: 0000000000000080
RIP: 0010:unix_stream_bpf_update_proto+0xa0/0x1b0
Call Trace:
  sock_map_link+0x564/0x8b0
  sock_map_update_common+0x6e/0x340
  sock_map_update_elem_sys+0x17d/0x240
  __sys_bpf+0x26db/0x3250
  __x64_sys_bpf+0x21/0x30
  do_syscall_64+0x6b/0x3a0
  entry_SYSCALL_64_after_hwframe+0x76/0x7e

Initial idea was to move peer assignment _before_ the sk_state update[1],
but that involved an additional memory barrier, and changing the hot path
was rejected.
Then a NULL check during proto update in unix_stream_bpf_update_proto() was
considered[2], but the follow-up discussion[3] focused on the root cause,
i.e. sockmap update taking a wrong lock. Or, more specifically, missing
unix_state_lock()[4].
In the end it was concluded that teaching sockmap about the af_unix locking
would be unnecessarily complex[5].
Complexity aside, since BPF_PROG_TYPE_SCHED_CLS and BPF_PROG_TYPE_SCHED_ACT
are allowed to update sockmaps, sock_map_update_elem() taking the unix
lock, as it is currently implemented in unix_state_lock():
spin_lock(&unix_sk(s)->lock), would be problematic. unix_state_lock() taken
in a process context, followed by a softirq-context TC BPF program
attempting to take the same spinlock -- deadlock[6].
This way we circled back to the peer check idea[2].

[1]: https://lore.kernel.org/netdev/ba5c50aa-1df4-40c2-ab33-a72022c5a32e@rbox.co/
[2]: https://lore.kernel.org/netdev/20240610174906.32921-1-kuniyu@amazon.com/
[3]: https://lore.kernel.org/netdev/7603c0e6-cd5b-452b-b710-73b64bd9de26@linux.dev/
[4]: https://lore.kernel.org/netdev/CAAVpQUA+8GL_j63CaKb8hbxoL21izD58yr1NvhOhU=j+35+3og@mail.gmail.com/
[5]: https://lore.kernel.org/bpf/CAAVpQUAHijOMext28Gi10dSLuMzGYh+jK61Ujn+fZ-wvcODR2A@mail.gmail.com/
[6]: https://lore.kernel.org/bpf/dd043c69-4d03-46fe-8325-8f97101435cf@linux.dev/

Summary of scenarios where af_unix/stream connect() may race a sockmap
update:

1. connect() vs. bpf(BPF_MAP_UPDATE_ELEM), i.e. sock_map_update_elem_sys()

   Implemented NULL check is sufficient. Once assigned, socket peer won't
   be released until socket fd is released. And that's not an issue because
   sock_map_update_elem_sys() bumps fd refcnf.

2. connect() vs BPF program doing update

   Update restricted per verifier.c:may_update_sockmap() to

      BPF_PROG_TYPE_TRACING/BPF_TRACE_ITER
      BPF_PROG_TYPE_SOCK_OPS (bpf_sock_map_update() only)
      BPF_PROG_TYPE_SOCKET_FILTER
      BPF_PROG_TYPE_SCHED_CLS
      BPF_PROG_TYPE_SCHED_ACT
      BPF_PROG_TYPE_XDP
      BPF_PROG_TYPE_SK_REUSEPORT
      BPF_PROG_TYPE_FLOW_DISSECTOR
      BPF_PROG_TYPE_SK_LOOKUP

   Plus one more race to consider:

            CPU0 bpf                            CPU1 connect
            --------                            ------------

                                   WRITE_ONCE(sk->sk_state, TCP_ESTABLISHED)
   sock_map_sk_state_allowed(sk)
                                   sock_hold(newsk)
                                   smp_mb__after_atomic()
                                   unix_peer(sk) = newsk
   sk_pair = unix_peer(sk)
   if (unlikely(!sk_pair))
      return -EINVAL;

                                                 CPU1 close
                                                 ----------

                                   skpair = unix_peer(sk);
                                   unix_peer(sk) = NULL;
                                   sock_put(skpair)
   // use after free?
   sock_hold(sk_pair)

   2.1 BPF program invoking helper function bpf_sock_map_update() ->
       BPF_CALL_4(bpf_sock_map_update(), ...)

       Helper limited to BPF_PROG_TYPE_SOCK_OPS. Nevertheless, a unix sock
       might be accessible via bpf_map_lookup_elem(). Which implies sk
       already having psock, which in turn implies sk already having
       sk_pair. Since sk_psock_destroy() is queued as RCU work, sk_pair
       won't go away while BPF executes the update.

   2.2 BPF program invoking helper function bpf_map_update_elem() ->
       sock_map_update_elem()

       2.2.1 Unix sock accessible to BPF prog only via sockmap lookup in
             BPF_PROG_TYPE_SOCKET_FILTER, BPF_PROG_TYPE_SCHED_CLS,
             BPF_PROG_TYPE_SCHED_ACT, BPF_PROG_TYPE_XDP,
             BPF_PROG_TYPE_SK_REUSEPORT, BPF_PROG_TYPE_FLOW_DISSECTOR,
             BPF_PROG_TYPE_SK_LOOKUP.

             Pretty much the same as case 2.1.

       2.2.2 Unix sock accessible to BPF program directly:
             BPF_PROG_TYPE_TRACING, narrowed down to BPF_TRACE_ITER.

             Sockmap iterator (sock_map_seq_ops) is safe: unix sock
             residing in a sockmap means that the sock already went through
             the proto update step.

             Unix sock iterator (bpf_iter_unix_seq_ops), on the other hand,
             gives access to socks that may still be unconnected. Which
             means iterator prog can race sockmap/proto update against
             connect().

             BUG: KASAN: null-ptr-deref in unix_stream_bpf_update_proto+0x253/0x4d0
             Write of size 4 at addr 0000000000000080 by task test_progs/3140
             Call Trace:
              dump_stack_lvl+0x5d/0x80
              kasan_report+0xe4/0x1c0
              kasan_check_range+0x125/0x200
              unix_stream_bpf_update_proto+0x253/0x4d0
              sock_map_link+0x71c/0xec0
              sock_map_update_common+0xbc/0x600
              sock_map_update_elem+0x19a/0x1f0
              bpf_prog_bbbf56096cdd4f01_selective_dump_unix+0x20c/0x217
              bpf_iter_run_prog+0x21e/0xae0
              bpf_iter_unix_seq_show+0x1e0/0x2a0
              bpf_seq_read+0x42c/0x10d0
              vfs_read+0x171/0xb20
              ksys_read+0xff/0x200
              do_syscall_64+0xf7/0x5e0
              entry_SYSCALL_64_after_hwframe+0x76/0x7e

             While the introduced NULL check prevents null-ptr-deref in the
             BPF program path as well, it is insufficient to guard against
             a poorly timed close() leading to a use-after-free. This will
             be addressed in a subsequent patch.

Reported-by: Michal Luczaj <mhal@rbox.co>
Closes: https://lore.kernel.org/netdev/ba5c50aa-1df4-40c2-ab33-a72022c5a32e@rbox.co/
Reported-by: 钱一铭 <yimingqian591@gmail.com>
Suggested-by: Kuniyuki Iwashima <kuniyu@google.com>
Suggested-by: Martin KaFai Lau <martin.lau@linux.dev>
Fixes: c63829182c37 ("af_unix: Implement ->psock_update_sk_prot()")
Signed-off-by: Michal Luczaj <mhal@rbox.co>
---
 net/unix/unix_bpf.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/unix/unix_bpf.c b/net/unix/unix_bpf.c
index e0d30d6d22ac..57f3124c9d8d 100644
--- a/net/unix/unix_bpf.c
+++ b/net/unix/unix_bpf.c
@@ -185,6 +185,9 @@ int unix_stream_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool r
 	 */
 	if (!psock->sk_pair) {
 		sk_pair = unix_peer(sk);
+		if (unlikely(!sk_pair))
+			return -EINVAL;
+
 		sock_hold(sk_pair);
 		psock->sk_pair = sk_pair;
 	}

-- 
2.53.0


^ permalink raw reply related

* Re: [PATCH v2] wireguard: device: use exit_rtnl callback instead of manual rtnl_lock in pre_exit
From: Jakub Kicinski @ 2026-04-14 15:18 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: Shardul Bankar, kuniyu, andrew+netdev, davem, edumazet, pabeni,
	wireguard, netdev, linux-kernel, janak, kalpan.jani, shardulsb08,
	syzbot+f2fbf7478a35a94c8b7c
In-Reply-To: <CAHmME9oXoXykXq_emkA3v8nG2VR28CmRP2+WmhrvGJc0ZbPfpA@mail.gmail.com>

On Tue, 14 Apr 2026 15:28:37 +0200 Jason A. Donenfeld wrote:
> Thanks. Applied to the wireguard tree, and also added the missing
> __net_exit and __read_mostly annotations in the process.

Hi Jason, while we have you - do you have a PR for us for wireguard?
We're going to be sending the net-next PR later today..

^ permalink raw reply

* [PATCH net-next 2/2] net: mana: Use kvmalloc for large RX queue and buffer allocations
From: Aditya Garg @ 2026-04-14 15:13 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
	edumazet, kuba, pabeni, kotaranov, horms, ssengar, jacob.e.keller,
	dipayanroy, ernis, shirazsaleem, kees, sbhatta, leitao, netdev,
	linux-hyperv, linux-kernel, linux-rdma, bpf, gargaditya,
	gargaditya
In-Reply-To: <20260414151456.687506-1-gargaditya@linux.microsoft.com>

The RX path allocations for rxbufs_pre, das_pre, and rxq scale with
queue count and queue depth. With high queue counts and depth, these can
exceed what kmalloc can reliably provide from physically contiguous
memory under fragmentation.

Switch these from kmalloc to kvmalloc variants so the allocator
transparently falls back to vmalloc when contiguous memory is scarce,
and update the corresponding frees to kvfree.

Signed-off-by: Aditya Garg <gargaditya@linux.microsoft.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
---
 drivers/net/ethernet/microsoft/mana/mana_en.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index 49ee77b0939a..585d891bbbac 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -685,11 +685,11 @@ void mana_pre_dealloc_rxbufs(struct mana_port_context *mpc)
 		put_page(virt_to_head_page(mpc->rxbufs_pre[i]));
 	}
 
-	kfree(mpc->das_pre);
+	kvfree(mpc->das_pre);
 	mpc->das_pre = NULL;
 
 out2:
-	kfree(mpc->rxbufs_pre);
+	kvfree(mpc->rxbufs_pre);
 	mpc->rxbufs_pre = NULL;
 
 out1:
@@ -806,11 +806,11 @@ int mana_pre_alloc_rxbufs(struct mana_port_context *mpc, int new_mtu, int num_qu
 	num_rxb = num_queues * mpc->rx_queue_size;
 
 	WARN(mpc->rxbufs_pre, "mana rxbufs_pre exists\n");
-	mpc->rxbufs_pre = kmalloc_array(num_rxb, sizeof(void *), GFP_KERNEL);
+	mpc->rxbufs_pre = kvmalloc_array(num_rxb, sizeof(void *), GFP_KERNEL);
 	if (!mpc->rxbufs_pre)
 		goto error;
 
-	mpc->das_pre = kmalloc_objs(dma_addr_t, num_rxb);
+	mpc->das_pre = kvmalloc_objs(dma_addr_t, num_rxb);
 	if (!mpc->das_pre)
 		goto error;
 
@@ -2527,7 +2527,7 @@ static void mana_destroy_rxq(struct mana_port_context *apc,
 	if (rxq->gdma_rq)
 		mana_gd_destroy_queue(gc, rxq->gdma_rq);
 
-	kfree(rxq);
+	kvfree(rxq);
 }
 
 static int mana_fill_rx_oob(struct mana_recv_buf_oob *rx_oob, u32 mem_key,
@@ -2667,7 +2667,7 @@ static struct mana_rxq *mana_create_rxq(struct mana_port_context *apc,
 
 	gc = gd->gdma_context;
 
-	rxq = kzalloc_flex(*rxq, rx_oobs, apc->rx_queue_size);
+	rxq = kvzalloc_flex(*rxq, rx_oobs, apc->rx_queue_size);
 	if (!rxq)
 		return NULL;
 
-- 
2.43.0


^ permalink raw reply related

* [PATCH net-next 0/2] net: mana: Avoid queue struct allocation failure under memory fragmentation
From: Aditya Garg @ 2026-04-14 15:13 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
	edumazet, kuba, pabeni, kotaranov, horms, ssengar, jacob.e.keller,
	dipayanroy, ernis, shirazsaleem, kees, sbhatta, leitao, netdev,
	linux-hyperv, linux-kernel, linux-rdma, bpf, gargaditya,
	gargaditya

The MANA driver can fail to load on systems with high memory
utilization because several allocations in the queue setup paths
require large physically contiguous blocks via kmalloc. Under memory
fragmentation these high-order allocations may fail, preventing the
driver from creating queues at probe time or when reconfiguring
channels, ring parameters or MTU at runtime.

Allocation sizes that are problematic:

  mana_create_txq -> tx_qp flat array (sizeof(mana_tx_qp) = 35528):
    16 queues (default): 35528 * 16 =  ~555 KB contiguous
    64 queues (max):     35528 * 64 = ~2220 KB contiguous

  mana_create_rxq -> rxq struct with flex array
  (sizeof(mana_rxq) = 35712, rx_oobs=296 per entry):
    depth 1024 (default): 35712 + 296 * 1024 =  ~331 KB per queue
    depth 8192 (max):     35712 + 296 * 8192 = ~2403 KB per queue

  mana_pre_alloc_rxbufs -> rxbufs_pre and das_pre arrays:
    16 queues, depth 1024 (default): 16 * 1024 * 8 =  128 KB each
    64 queues, depth 8192 (max):     64 * 8192 * 8 = 4096 KB each

This series addresses the issue by:
  1. Converting the tx_qp flat array into an array of pointers with
     per-queue kvzalloc (~35 KB each), replacing a single contiguous
     allocation that can reach ~2.2 MB at 64 queues.
  2. Switching rxbufs_pre, das_pre, and rxq allocations to
     kvmalloc/kvzalloc so the allocator can fall back to vmalloc
     when contiguous memory is unavailable.

Throughput testing confirms no regression. Since kvmalloc falls
back to vmalloc under memory fragmentation, all kvmalloc calls
were temporarily replaced with vmalloc to simulate the fallback
path (iperf3, GBits/sec):

                 Physically contiguous         vmalloc region
  Connections      TX          RX              TX          RX
  --------------------------------------------------------------
  1                47.2        46.9            46.8        46.6
  16               181         181             181         181
  32               181         181             181         181
  64               181         181             181         181

Aditya Garg (2):
  net: mana: Use per-queue allocation for tx_qp to reduce allocation
    size
  net: mana: Use kvmalloc for large RX queue and buffer allocations

 .../net/ethernet/microsoft/mana/mana_bpf.c    |  2 +-
 drivers/net/ethernet/microsoft/mana/mana_en.c | 61 +++++++++++--------
 .../ethernet/microsoft/mana/mana_ethtool.c    |  2 +-
 include/net/mana/mana.h                       |  2 +-
 4 files changed, 39 insertions(+), 28 deletions(-)

-- 
2.43.0


^ permalink raw reply

* [PATCH net-next 1/2] net: mana: Use per-queue allocation for tx_qp to reduce allocation size
From: Aditya Garg @ 2026-04-14 15:13 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
	edumazet, kuba, pabeni, kotaranov, horms, ssengar, jacob.e.keller,
	dipayanroy, ernis, shirazsaleem, kees, sbhatta, leitao, netdev,
	linux-hyperv, linux-kernel, linux-rdma, bpf, gargaditya,
	gargaditya
In-Reply-To: <20260414151456.687506-1-gargaditya@linux.microsoft.com>

Convert tx_qp from a single contiguous array allocation to per-queue
individual allocations. Each mana_tx_qp struct is approximately 35KB.
With many queues (e.g., 32/64), the flat array requires a single
contiguous allocation that can fail under memory fragmentation.

Change mana_tx_qp *tx_qp to mana_tx_qp **tx_qp (array of pointers),
allocating each queue's mana_tx_qp individually via kvzalloc. This
reduces each allocation to ~35KB and provides vmalloc fallback,
avoiding allocation failure due to fragmentation.

Signed-off-by: Aditya Garg <gargaditya@linux.microsoft.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
---
 .../net/ethernet/microsoft/mana/mana_bpf.c    |  2 +-
 drivers/net/ethernet/microsoft/mana/mana_en.c | 49 ++++++++++++-------
 .../ethernet/microsoft/mana/mana_ethtool.c    |  2 +-
 include/net/mana/mana.h                       |  2 +-
 4 files changed, 33 insertions(+), 22 deletions(-)

diff --git a/drivers/net/ethernet/microsoft/mana/mana_bpf.c b/drivers/net/ethernet/microsoft/mana/mana_bpf.c
index 7697c9b52ed3..b5e9bb184a1d 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_bpf.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_bpf.c
@@ -68,7 +68,7 @@ int mana_xdp_xmit(struct net_device *ndev, int n, struct xdp_frame **frames,
 		count++;
 	}
 
-	tx_stats = &apc->tx_qp[q_idx].txq.stats;
+	tx_stats = &apc->tx_qp[q_idx]->txq.stats;
 
 	u64_stats_update_begin(&tx_stats->syncp);
 	tx_stats->xdp_xmit += count;
diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index 09a53c977545..49ee77b0939a 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -355,9 +355,9 @@ netdev_tx_t mana_start_xmit(struct sk_buff *skb, struct net_device *ndev)
 	if (skb_cow_head(skb, MANA_HEADROOM))
 		goto tx_drop_count;
 
-	txq = &apc->tx_qp[txq_idx].txq;
+	txq = &apc->tx_qp[txq_idx]->txq;
 	gdma_sq = txq->gdma_sq;
-	cq = &apc->tx_qp[txq_idx].tx_cq;
+	cq = &apc->tx_qp[txq_idx]->tx_cq;
 	tx_stats = &txq->stats;
 
 	BUILD_BUG_ON(MAX_TX_WQE_SGL_ENTRIES != MANA_MAX_TX_WQE_SGL_ENTRIES);
@@ -614,7 +614,7 @@ static void mana_get_stats64(struct net_device *ndev,
 	}
 
 	for (q = 0; q < num_queues; q++) {
-		tx_stats = &apc->tx_qp[q].txq.stats;
+		tx_stats = &apc->tx_qp[q]->txq.stats;
 
 		do {
 			start = u64_stats_fetch_begin(&tx_stats->syncp);
@@ -2284,21 +2284,26 @@ static void mana_destroy_txq(struct mana_port_context *apc)
 		return;
 
 	for (i = 0; i < apc->num_queues; i++) {
-		debugfs_remove_recursive(apc->tx_qp[i].mana_tx_debugfs);
-		apc->tx_qp[i].mana_tx_debugfs = NULL;
+		if (!apc->tx_qp[i])
+			continue;
+
+		debugfs_remove_recursive(apc->tx_qp[i]->mana_tx_debugfs);
+		apc->tx_qp[i]->mana_tx_debugfs = NULL;
 
-		napi = &apc->tx_qp[i].tx_cq.napi;
-		if (apc->tx_qp[i].txq.napi_initialized) {
+		napi = &apc->tx_qp[i]->tx_cq.napi;
+		if (apc->tx_qp[i]->txq.napi_initialized) {
 			napi_synchronize(napi);
 			napi_disable_locked(napi);
 			netif_napi_del_locked(napi);
-			apc->tx_qp[i].txq.napi_initialized = false;
+			apc->tx_qp[i]->txq.napi_initialized = false;
 		}
-		mana_destroy_wq_obj(apc, GDMA_SQ, apc->tx_qp[i].tx_object);
+		mana_destroy_wq_obj(apc, GDMA_SQ, apc->tx_qp[i]->tx_object);
 
-		mana_deinit_cq(apc, &apc->tx_qp[i].tx_cq);
+		mana_deinit_cq(apc, &apc->tx_qp[i]->tx_cq);
 
-		mana_deinit_txq(apc, &apc->tx_qp[i].txq);
+		mana_deinit_txq(apc, &apc->tx_qp[i]->txq);
+
+		kvfree(apc->tx_qp[i]);
 	}
 
 	kfree(apc->tx_qp);
@@ -2307,7 +2312,7 @@ static void mana_destroy_txq(struct mana_port_context *apc)
 
 static void mana_create_txq_debugfs(struct mana_port_context *apc, int idx)
 {
-	struct mana_tx_qp *tx_qp = &apc->tx_qp[idx];
+	struct mana_tx_qp *tx_qp = apc->tx_qp[idx];
 	char qnum[32];
 
 	sprintf(qnum, "TX-%d", idx);
@@ -2346,7 +2351,7 @@ static int mana_create_txq(struct mana_port_context *apc,
 	int err;
 	int i;
 
-	apc->tx_qp = kzalloc_objs(struct mana_tx_qp, apc->num_queues);
+	apc->tx_qp = kzalloc_objs(struct mana_tx_qp *, apc->num_queues);
 	if (!apc->tx_qp)
 		return -ENOMEM;
 
@@ -2366,10 +2371,16 @@ static int mana_create_txq(struct mana_port_context *apc,
 	gc = gd->gdma_context;
 
 	for (i = 0; i < apc->num_queues; i++) {
-		apc->tx_qp[i].tx_object = INVALID_MANA_HANDLE;
+		apc->tx_qp[i] = kvzalloc_obj(*apc->tx_qp[i]);
+		if (!apc->tx_qp[i]) {
+			err = -ENOMEM;
+			goto out;
+		}
+
+		apc->tx_qp[i]->tx_object = INVALID_MANA_HANDLE;
 
 		/* Create SQ */
-		txq = &apc->tx_qp[i].txq;
+		txq = &apc->tx_qp[i]->txq;
 
 		u64_stats_init(&txq->stats.syncp);
 		txq->ndev = net;
@@ -2387,7 +2398,7 @@ static int mana_create_txq(struct mana_port_context *apc,
 			goto out;
 
 		/* Create SQ's CQ */
-		cq = &apc->tx_qp[i].tx_cq;
+		cq = &apc->tx_qp[i]->tx_cq;
 		cq->type = MANA_CQ_TYPE_TX;
 
 		cq->txq = txq;
@@ -2416,7 +2427,7 @@ static int mana_create_txq(struct mana_port_context *apc,
 
 		err = mana_create_wq_obj(apc, apc->port_handle, GDMA_SQ,
 					 &wq_spec, &cq_spec,
-					 &apc->tx_qp[i].tx_object);
+					 &apc->tx_qp[i]->tx_object);
 
 		if (err)
 			goto out;
@@ -3242,7 +3253,7 @@ static int mana_dealloc_queues(struct net_device *ndev)
 	 */
 
 	for (i = 0; i < apc->num_queues; i++) {
-		txq = &apc->tx_qp[i].txq;
+		txq = &apc->tx_qp[i]->txq;
 		tsleep = 1000;
 		while (atomic_read(&txq->pending_sends) > 0 &&
 		       time_before(jiffies, timeout)) {
@@ -3261,7 +3272,7 @@ static int mana_dealloc_queues(struct net_device *ndev)
 	}
 
 	for (i = 0; i < apc->num_queues; i++) {
-		txq = &apc->tx_qp[i].txq;
+		txq = &apc->tx_qp[i]->txq;
 		while ((skb = skb_dequeue(&txq->pending_skbs))) {
 			mana_unmap_skb(skb, apc);
 			dev_kfree_skb_any(skb);
diff --git a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
index f2d220b371b5..f5901e4c9816 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
@@ -251,7 +251,7 @@ static void mana_get_ethtool_stats(struct net_device *ndev,
 	}
 
 	for (q = 0; q < num_queues; q++) {
-		tx_stats = &apc->tx_qp[q].txq.stats;
+		tx_stats = &apc->tx_qp[q]->txq.stats;
 
 		do {
 			start = u64_stats_fetch_begin(&tx_stats->syncp);
diff --git a/include/net/mana/mana.h b/include/net/mana/mana.h
index a078af283bdd..60b4a4146ea2 100644
--- a/include/net/mana/mana.h
+++ b/include/net/mana/mana.h
@@ -505,7 +505,7 @@ struct mana_port_context {
 	bool tx_shortform_allowed;
 	u16 tx_vp_offset;
 
-	struct mana_tx_qp *tx_qp;
+	struct mana_tx_qp **tx_qp;
 
 	/* Indirection Table for RX & TX. The values are queue indexes */
 	u32 *indir_table;
-- 
2.43.0


^ permalink raw reply related

* [PATCH net] net: pse-pd: fix out-of-bounds bitmap access in pse_isr() on 32-bit
From: Kory Maincent @ 2026-04-14 15:13 UTC (permalink / raw)
  To: Kory Maincent (Dent Project), Jakub Kicinski, netdev,
	linux-kernel
  Cc: Carlo Szelinsky, thomas.petazzoni, Oleksij Rempel, Andrew Lunn,
	David S. Miller, Eric Dumazet, Paolo Abeni

In pse_isr(), notifs_mask was declared as a single unsigned long on the
stack (32 bits on 32-bit architectures). For PSE controllers with more
than 32 ports, this causes two problems:

- map_event callbacks could wrote bit positions >= 32 via
  *notifs_mask |= BIT(i), which is undefined behaviour on a 32-bit
  unsigned long and corrupts adjacent stack memory.

- for_each_set_bit(i, &notifs_mask, pcdev->nr_lines) treats
  &notifs_mask as a multi-word bitmap and reads beyond the single
  unsigned long when nr_lines > BITS_PER_LONG.

Fix this by moving notifs_mask out of the stack and into struct pse_irq
as a dynamically allocated bitmap. It is sized with
BITS_TO_LONGS(pcdev->nr_lines) words in devm_pse_irq_helper(), so it
is always wide enough regardless of the host word size.

Fixes: fc0e6db30941a ("net: pse-pd: Add support for reporting events")
Signed-off-by: Kory Maincent <kory.maincent@bootlin.com>
---
 drivers/net/pse-pd/pse_core.c | 14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/drivers/net/pse-pd/pse_core.c b/drivers/net/pse-pd/pse_core.c
index 3beaaaeec9e1f..2ced837f375d2 100644
--- a/drivers/net/pse-pd/pse_core.c
+++ b/drivers/net/pse-pd/pse_core.c
@@ -1170,6 +1170,7 @@ struct pse_irq {
 	struct pse_controller_dev *pcdev;
 	struct pse_irq_desc desc;
 	unsigned long *notifs;
+	unsigned long *notifs_mask;
 };
 
 /**
@@ -1247,7 +1248,6 @@ static int pse_set_config_isr(struct pse_controller_dev *pcdev, int id,
 static irqreturn_t pse_isr(int irq, void *data)
 {
 	struct pse_controller_dev *pcdev;
-	unsigned long notifs_mask = 0;
 	struct pse_irq_desc *desc;
 	struct pse_irq *h = data;
 	int ret, i;
@@ -1257,14 +1257,15 @@ static irqreturn_t pse_isr(int irq, void *data)
 
 	/* Clear notifs mask */
 	memset(h->notifs, 0, pcdev->nr_lines * sizeof(*h->notifs));
+	bitmap_zero(h->notifs_mask, pcdev->nr_lines);
 	mutex_lock(&pcdev->lock);
-	ret = desc->map_event(irq, pcdev, h->notifs, &notifs_mask);
-	if (ret || !notifs_mask) {
+	ret = desc->map_event(irq, pcdev, h->notifs, h->notifs_mask);
+	if (ret || bitmap_empty(h->notifs_mask, pcdev->nr_lines)) {
 		mutex_unlock(&pcdev->lock);
 		return IRQ_NONE;
 	}
 
-	for_each_set_bit(i, &notifs_mask, pcdev->nr_lines) {
+	for_each_set_bit(i, h->notifs_mask, pcdev->nr_lines) {
 		unsigned long notifs, rnotifs;
 		struct pse_ntf ntf = {};
 
@@ -1340,6 +1341,11 @@ int devm_pse_irq_helper(struct pse_controller_dev *pcdev, int irq,
 	if (!h->notifs)
 		return -ENOMEM;
 
+	h->notifs_mask = devm_kcalloc(dev, BITS_TO_LONGS(pcdev->nr_lines),
+				      sizeof(*h->notifs_mask), GFP_KERNEL);
+	if (!h->notifs_mask)
+		return -ENOMEM;
+
 	ret = devm_request_threaded_irq(dev, irq, NULL, pse_isr,
 					IRQF_ONESHOT | irq_flags,
 					irq_name, h);
-- 
2.43.0


^ permalink raw reply related

* Re: [PATCH iwl-next v2 2/2] idpf: implement pci error handlers
From: Lukas Wunner @ 2026-04-14 15:13 UTC (permalink / raw)
  To: Emil Tantilov
  Cc: intel-wired-lan, netdev, przemyslaw.kitszel, jay.bhat,
	ivan.d.barrera, aleksandr.loktionov, larysa.zaremba,
	anthony.l.nguyen, andrew+netdev, davem, edumazet, kuba, pabeni,
	aleksander.lobakin, linux-pci, madhu.chittim, decot, willemb,
	sheenamo
In-Reply-To: <20260414031631.2107-3-emil.s.tantilov@intel.com>

On Mon, Apr 13, 2026 at 08:16:31PM -0700, Emil Tantilov wrote:
> +static pci_ers_result_t
> +idpf_pci_err_slot_reset(struct pci_dev *pdev)
> +{
> +	struct idpf_adapter *adapter = pci_get_drvdata(pdev);
> +
> +	pci_restore_state(pdev);
> +	pci_set_master(pdev);
> +	pci_wake_from_d3(pdev, false);
> +	if (readl(adapter->reset_reg.rstat) != 0xFFFFFFFF)
> +		return PCI_ERS_RESULT_RECOVERED;

FWIW, there's a PCI_POSSIBLE_ERROR() helper that you may find useful
to check for an "all ones" MMIO read.

Thanks,

Lukas

^ permalink raw reply

* [PATCH bpf v4 2/5] bpf, sockmap: Fix af_unix iter deadlock
From: Michal Luczaj @ 2026-04-14 14:13 UTC (permalink / raw)
  To: John Fastabend, Jakub Sitnicki, Eric Dumazet, Kuniyuki Iwashima,
	Paolo Abeni, Willem de Bruijn, David S. Miller, Jakub Kicinski,
	Simon Horman, Yonghong Song, Andrii Nakryiko, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, Eduard Zingerman, Song Liu,
	Yonghong Song, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Shuah Khan, Cong Wang
  Cc: netdev, bpf, linux-kernel, linux-kselftest, Michal Luczaj,
	Jiayuan Chen
In-Reply-To: <20260414-unix-proto-update-null-ptr-deref-v4-0-2af6fe97918e@rbox.co>

bpf_iter_unix_seq_show() may deadlock when lock_sock_fast() takes the fast
path and the iter prog attempts to update a sockmap. Which ends up spinning
at sock_map_update_elem()'s bh_lock_sock():

WARNING: possible recursive locking detected
test_progs/1393 is trying to acquire lock:
ffff88811ec25f58 (slock-AF_UNIX){+...}-{3:3}, at: sock_map_update_elem+0xdb/0x1f0

but task is already holding lock:
ffff88811ec25f58 (slock-AF_UNIX){+...}-{3:3}, at: __lock_sock_fast+0x37/0xe0

other info that might help us debug this:
 Possible unsafe locking scenario:

       CPU0
       ----
  lock(slock-AF_UNIX);
  lock(slock-AF_UNIX);

 *** DEADLOCK ***

 May be due to missing lock nesting notation

4 locks held by test_progs/1393:
 #0: ffff88814b59c790 (&p->lock){+.+.}-{4:4}, at: bpf_seq_read+0x59/0x10d0
 #1: ffff88811ec25fd8 (sk_lock-AF_UNIX){+.+.}-{0:0}, at: bpf_seq_read+0x42c/0x10d0
 #2: ffff88811ec25f58 (slock-AF_UNIX){+...}-{3:3}, at: __lock_sock_fast+0x37/0xe0
 #3: ffffffff85a6a7c0 (rcu_read_lock){....}-{1:3}, at: bpf_iter_run_prog+0x51d/0xb00

Call Trace:
 dump_stack_lvl+0x5d/0x80
 print_deadlock_bug.cold+0xc0/0xce
 __lock_acquire+0x130f/0x2590
 lock_acquire+0x14e/0x2b0
 _raw_spin_lock+0x30/0x40
 sock_map_update_elem+0xdb/0x1f0
 bpf_prog_2d0075e5d9b721cd_dump_unix+0x55/0x4f4
 bpf_iter_run_prog+0x5b9/0xb00
 bpf_iter_unix_seq_show+0x1f7/0x2e0
 bpf_seq_read+0x42c/0x10d0
 vfs_read+0x171/0xb20
 ksys_read+0xff/0x200
 do_syscall_64+0x6b/0x3a0
 entry_SYSCALL_64_after_hwframe+0x76/0x7e

Suggested-by: Kuniyuki Iwashima <kuniyu@google.com>
Suggested-by: Martin KaFai Lau <martin.lau@linux.dev>
Fixes: 2c860a43dd77 ("bpf: af_unix: Implement BPF iterator for UNIX domain socket.")
Reviewed-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Signed-off-by: Michal Luczaj <mhal@rbox.co>
---
 net/unix/af_unix.c | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index b23c33df8b46..590a30d3b2f7 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -3731,15 +3731,14 @@ static int bpf_iter_unix_seq_show(struct seq_file *seq, void *v)
 	struct bpf_prog *prog;
 	struct sock *sk = v;
 	uid_t uid;
-	bool slow;
 	int ret;
 
 	if (v == SEQ_START_TOKEN)
 		return 0;
 
-	slow = lock_sock_fast(sk);
+	lock_sock(sk);
 
-	if (unlikely(sk_unhashed(sk))) {
+	if (unlikely(sock_flag(sk, SOCK_DEAD))) {
 		ret = SEQ_SKIP;
 		goto unlock;
 	}
@@ -3749,7 +3748,7 @@ static int bpf_iter_unix_seq_show(struct seq_file *seq, void *v)
 	prog = bpf_iter_get_info(&meta, false);
 	ret = unix_prog_seq_show(prog, &meta, v, uid);
 unlock:
-	unlock_sock_fast(sk, slow);
+	release_sock(sk);
 	return ret;
 }
 

-- 
2.53.0


^ permalink raw reply related

* Re: [RFC] Proposal: Add sysfs interface for PCIe TPH Steering Tag retrieval and configuration
From: Jason Gunthorpe @ 2026-04-14 15:11 UTC (permalink / raw)
  To: fengchengwen
  Cc: Leon Romanovsky, Bjorn Helgaas, linux-rdma, linux-pci, netdev,
	dri-devel, Keith Busch, Yochai Cohen, Yishai Hadas, Zhiping Zhang
In-Reply-To: <11eaea26-ec10-264a-db1e-951f6b46078d@huawei.com>

On Tue, Apr 14, 2026 at 10:46:00PM +0800, fengchengwen wrote:
>    We have a real platform requirement:
> 
>      * 1. Devices in TPH Device-Specific Mode with no standard ST table
>      * 2. Steering Tags must be obtained from ACPI _DSM (kernel-only)
>      * 3. Devices are fully managed by userspace drivers (VFIO/UIO)
>      * 4. Userspace must program STs into vendor-specific registers

No, this is nonsenscial too.

If you want to control the steering tags for MMIO BAR memory exposed
by VFIO then the DMABUF mechanism Keith & co has been working on is
the correct approach.

If the VFIO user needs to control steering tags for the device it is
directly controling then it must do that through VFIO ioctls.

Nobody messes around with other devices under the covers of the
operating kernel driver. Stop proposing that.

Jason

^ permalink raw reply

* Re: [PATCH iwl-next v2 2/2] idpf: implement pci error handlers
From: Lukas Wunner @ 2026-04-14 15:10 UTC (permalink / raw)
  To: Loktionov, Aleksandr
  Cc: Tantilov, Emil S, intel-wired-lan@lists.osuosl.org,
	netdev@vger.kernel.org, Kitszel, Przemyslaw, Bhat, Jay,
	Barrera, Ivan D, Zaremba, Larysa, Nguyen, Anthony L,
	andrew+netdev@lunn.ch, davem@davemloft.net, edumazet@google.com,
	kuba@kernel.org, pabeni@redhat.com, Lobakin, Aleksander,
	linux-pci@vger.kernel.org, Chittim, Madhu, decot@google.com,
	willemb@google.com, sheenamo@google.com
In-Reply-To: <IA3PR11MB8986C6EC840268F14C44B28CE5252@IA3PR11MB8986.namprd11.prod.outlook.com>

On Tue, Apr 14, 2026 at 11:09:05AM +0000, Loktionov, Aleksandr wrote:
> > From: Tantilov, Emil S <emil.s.tantilov@intel.com>
> > .slot_reset is the callback attempting to restore the device, provided
> > a PCI reset was initiated by the AER driver.

Just for clarity, those callbacks are invoked by PCI core error handling
code and are shared by EEH, AER, DPC as well as s390 error recovery flows.
So it's not only AER.

> > +/**
> > + * idpf_pci_err_resume - Resume operations after PCI error recovery
> > + * @pdev: PCI device struct
> > + */
> > +static void idpf_pci_err_resume(struct pci_dev *pdev) {
> > +	struct idpf_adapter *adapter = pci_get_drvdata(pdev);
> > +
> > +	/* Force a PFR when resuming from PCI error. */
> > +	if (test_and_set_bit(IDPF_PCI_CB_RESET, adapter->flags))
> > +		adapter->dev_ops.reg_ops.trigger_reset(adapter,
> > IDPF_HR_FUNC_RESET);
> 
> You say "Force a PFR", but PFR is only triggered on the AER path,
> not on the FLR path.

And?  idpf_pci_err_resume() is only invoked in the error recovery path
(aka AER path), not FLR path AFAICS.

Thanks,

Lukas

^ permalink raw reply

* Re: [PATCH RFC bpf-next 1/8] kasan: expose generic kasan helpers
From: Andrey Konovalov @ 2026-04-14 15:10 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Alexis Lothoré, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman,
	Kumar Kartikeya Dwivedi, Song Liu, Yonghong Song, Jiri Olsa,
	John Fastabend, David S. Miller, David Ahern, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, X86 ML, H. Peter Anvin,
	Shuah Khan, Maxime Coquelin, Alexandre Torgue, Andrey Ryabinin,
	Alexander Potapenko, Dmitry Vyukov, Vincenzo Frascino,
	Andrew Morton, ebpf, Bastien Curutchet, Thomas Petazzoni,
	Xu Kuohai, bpf, LKML, Network Development,
	open list:KERNEL SELFTEST FRAMEWORK, linux-stm32,
	linux-arm-kernel, kasan-dev, linux-mm
In-Reply-To: <CAADnVQLJ=fJ7t1i2+_RYqU1gqYqiLP9Zrwo4vdZsgzjK_yzJTQ@mail.gmail.com>

On Tue, Apr 14, 2026 at 4:36 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> > ACK, I'll try to use those kasan_check_read and kasan_check_write rather
> > than __asan_{load,store}X.
>
> No. The performance penalty will be too high.

With using __asan_load/storeX(), it will be one function call to get
to check_region_inline(): __asan_load/storeX->check_region_inline.

With kasan_check_read/write(), right now, it would be two function
calls: __kasan_check_read->kasan_check_range->check_region_inline.

I doubt an extra function call would make a difference in terms of
performance: the shadow checking itself is also expensive.

But if the second call is a concern, we can move kasan_check_range()
and lower-level functions into mm/kasan/generic.h and include it into
shadow.c, and then it will be just one function call.

To improve performance further, the JIT compiler could emit inlined
shadow checking instructions, same as the C compiler does with
KASAN_INLINE=y.

> hw_tags won't work without corresponding JIT work.

You probably meant SW_TAGS here.

HW_TAGS will likely just work without any JIT changes (even the
kasan_check_byte() thing I mentioned should not be required), assuming
JIT'ed BPF code just accesses kernel-returned pointers as is.

> I see no point sacrificing performance for aesthetics.

With the change I suggested above, there would be no performance
difference. And the code stays cleaner.

> __asan_load/storeX is what compilers emit.

For Generic mode. For SW_TAGS, the function names are different.
Keeping this detail within the KASAN code is cleaner.


> In that sense JIT is a compiler it should emit exactly the same.

^ permalink raw reply

* [PATCH net] net: pse-pd: fix kernel-doc function name for pse_control_find_by_id()
From: Kory Maincent @ 2026-04-14 15:09 UTC (permalink / raw)
  To: Kory Maincent (Dent Project), Jakub Kicinski, netdev,
	linux-kernel
  Cc: thomas.petazzoni, Oleksij Rempel, Andrew Lunn, David S. Miller,
	Eric Dumazet, Paolo Abeni

The kernel-doc comment header incorrectly referenced the function
name pse_control_find_net_by_id() instead of the actual function name
pse_control_find_by_id(). Correct the function name in the documentation
to match the implementation.

Fixes: fc0e6db30941a ("net: pse-pd: Add support for reporting events")
Signed-off-by: Kory Maincent <kory.maincent@bootlin.com>
---
 drivers/net/pse-pd/pse_core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/pse-pd/pse_core.c b/drivers/net/pse-pd/pse_core.c
index 2ced837f375d2..0848097ce7bf3 100644
--- a/drivers/net/pse-pd/pse_core.c
+++ b/drivers/net/pse-pd/pse_core.c
@@ -234,7 +234,7 @@ static int of_load_pse_pis(struct pse_controller_dev *pcdev)
 }
 
 /**
- * pse_control_find_net_by_id - Find net attached to the pse control id
+ * pse_control_find_by_id - Find pse_control from an id
  * @pcdev: a pointer to the PSE
  * @id: index of the PSE control
  *
-- 
2.43.0


^ permalink raw reply related

* Re: [net,PATCH v3 1/2] net: ks8851: Reinstate disabling of BHs around IRQ handler
From: Jakub Kicinski @ 2026-04-14 15:09 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Marek Vasut, netdev, stable, David S. Miller, Andrew Lunn,
	Eric Dumazet, Nicolai Buchwitz, Paolo Abeni, Ronald Wahl,
	Yicong Hui, linux-kernel
In-Reply-To: <20260414125753.Im6GAIHn@linutronix.de>

On Tue, 14 Apr 2026 14:57:53 +0200 Sebastian Andrzej Siewior wrote:
> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>

Maybe I'm not being forceful enough.

Putting workarounds in the drivers is unacceptable.
__netdev_alloc_skb() must be legal to call under an _irq spin lock.

^ permalink raw reply

* Re: [PATCH iwl-next v2 2/2] idpf: implement pci error handlers
From: Tantilov, Emil S @ 2026-04-14 15:01 UTC (permalink / raw)
  To: Loktionov, Aleksandr, intel-wired-lan@lists.osuosl.org
  Cc: netdev@vger.kernel.org, Kitszel, Przemyslaw, Bhat, Jay,
	Barrera, Ivan D, Zaremba, Larysa, Nguyen, Anthony L,
	andrew+netdev@lunn.ch, davem@davemloft.net, edumazet@google.com,
	kuba@kernel.org, pabeni@redhat.com, Lobakin, Aleksander,
	linux-pci@vger.kernel.org, Chittim, Madhu, decot@google.com,
	willemb@google.com, sheenamo@google.com, lukas@wunner.de
In-Reply-To: <IA3PR11MB8986C6EC840268F14C44B28CE5252@IA3PR11MB8986.namprd11.prod.outlook.com>



On 4/14/2026 4:09 AM, Loktionov, Aleksandr wrote:
> 
> 
>> -----Original Message-----
>> From: Tantilov, Emil S <emil.s.tantilov@intel.com>
>> Sent: Tuesday, April 14, 2026 5:17 AM
>> To: intel-wired-lan@lists.osuosl.org
>> Cc: netdev@vger.kernel.org; Kitszel, Przemyslaw
>> <przemyslaw.kitszel@intel.com>; Bhat, Jay <jay.bhat@intel.com>;
>> Barrera, Ivan D <ivan.d.barrera@intel.com>; Loktionov, Aleksandr
>> <aleksandr.loktionov@intel.com>; Zaremba, Larysa
>> <larysa.zaremba@intel.com>; Nguyen, Anthony L
>> <anthony.l.nguyen@intel.com>; andrew+netdev@lunn.ch;
>> davem@davemloft.net; edumazet@google.com; kuba@kernel.org;
>> pabeni@redhat.com; Lobakin, Aleksander <aleksander.lobakin@intel.com>;
>> linux-pci@vger.kernel.org; Chittim, Madhu <madhu.chittim@intel.com>;
>> decot@google.com; willemb@google.com; sheenamo@google.com;
>> lukas@wunner.de
>> Subject: [PATCH iwl-next v2 2/2] idpf: implement pci error handlers
>>
>> Add callbacks to handle PCI errors and FLR reset. When preparing to
>> handle reset on the bus, the driver must stop all operations that can
>> lead to MMIO access in order to prevent HW errors. To accomplish this
>> introduce helper
>> idpf_reset_prepare() that gets called prior to FLR or when PCI error
>> is detected. Upon resume the recovery is done through the existing
>> reset path by starting the event task.
>>
>> The following callbacks are implemented:
>> .reset_prepare runs the first portion of the generic reset path
>> leading up to the part where we wait for the reset to complete.
>> .reset_done/resume runs the recovery part of the reset handling.
>> .error_detected is the callback dealing with PCI errors, similar to
>> the prepare call, we stop all operations, prior to attempting a
>> recovery.
>> .slot_reset is the callback attempting to restore the device, provided
>> a PCI reset was initiated by the AER driver.
>>
>> Whereas previously the init logic guaranteed netdevs during reset, the
>> addition of idpf_detach_and_close() to the PCI callbacks flow makes it
>> possible for the function to be called without netdevs. Add check to
>> avoid NULL pointer dereference in that case.
>>
>> Co-developed-by: Alan Brady <alan.brady@intel.com>
>> Signed-off-by: Alan Brady <alan.brady@intel.com>
>> Signed-off-by: Emil Tantilov <emil.s.tantilov@intel.com>
>> Reviewed-by: Jay Bhat <jay.bhat@intel.com>
>> Reviewed-by: Madhu Chittim <madhu.chittim@intel.com>
>> ---
>>   drivers/net/ethernet/intel/idpf/idpf.h      |   3 +
>>   drivers/net/ethernet/intel/idpf/idpf_lib.c  |  13 ++-
>> drivers/net/ethernet/intel/idpf/idpf_main.c | 112 ++++++++++++++++++++
>>   3 files changed, 126 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/net/ethernet/intel/idpf/idpf.h
>> b/drivers/net/ethernet/intel/idpf/idpf.h
>> index 1d0e32e47e87..164d2f3e233a 100644
>> --- a/drivers/net/ethernet/intel/idpf/idpf.h
>> +++ b/drivers/net/ethernet/intel/idpf/idpf.h
>> @@ -88,6 +88,7 @@ enum idpf_state {
>>    * @IDPF_REMOVE_IN_PROG: Driver remove in progress
>>    * @IDPF_MB_INTR_MODE: Mailbox in interrupt mode
>>    * @IDPF_VC_CORE_INIT: virtchnl core has been init
>> + * @IDPF_PCI_CB_RESET: Reset via the PCI callbacks
>>    * @IDPF_FLAGS_NBITS: Must be last
>>    */
>>   enum idpf_flags {
>> @@ -97,6 +98,7 @@ enum idpf_flags {
>>   	IDPF_REMOVE_IN_PROG,
>>   	IDPF_MB_INTR_MODE,
>>   	IDPF_VC_CORE_INIT,
> 
> ...
> 
>> +/**
>> + * idpf_pci_err_resume - Resume operations after PCI error recovery
>> + * @pdev: PCI device struct
>> + */
>> +static void idpf_pci_err_resume(struct pci_dev *pdev) {
>> +	struct idpf_adapter *adapter = pci_get_drvdata(pdev);
>> +
>> +	/* Force a PFR when resuming from PCI error. */
>> +	if (test_and_set_bit(IDPF_PCI_CB_RESET, adapter->flags))
>> +		adapter->dev_ops.reg_ops.trigger_reset(adapter,
>> IDPF_HR_FUNC_RESET);
> You say "Force a PFR", but PFR is only triggered on the AER path, not on the FLR path.

Hence the "force" - the call to `trigger_reset` results in a PFR and is
only needed in the case of a PCI error. If this function was called
because a user issued an FLR, the kernel will trigger it for us. This
way we can reuse the reset handling path to restore the operation of the
netdevs.

Though I may be misunderstanding - are you referring to the wording or
the logic?

Thanks,
Emil

> 
> Everything else looks fine
> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
> 
>> +
>> +	queue_delayed_work(adapter->vc_event_wq,
>> +			   &adapter->vc_event_task,
>> +			   msecs_to_jiffies(300));
>> +}
> 
> ...
> 
>>   };
>>   module_pci_driver(idpf_driver);
>> --
>> 2.37.3
> 


^ permalink raw reply

* Re: [PATCH net v2 1/1] net: hsr: avoid learning unknown senders for local delivery
From: Sebastian Andrzej Siewior @ 2026-04-14 14:59 UTC (permalink / raw)
  To: Felix Maurer, Ao Zhou
  Cc: netdev, David S . Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Murali Karicheri, Shaurya Rane,
	Ingo Molnar, Kees Cook, Yifan Wu, Juefei Pu, Yuan Tan, Xin Liu,
	Yuqi Xu, Haoze Xie
In-Reply-To: <adYwjxLBBaLY52Wb@thinkpad>

On 2026-04-08 12:40:15 [+0200], Felix Maurer wrote:
> IMHO, the only real way to prevent excessive resource use on our side is
> to put a limit on these resources. In this case, limit the size of the
> node table (bonus: make that limit configurable as Paolo suggested).

I am slowly catching up. There was no follow-up on this one, right?

> Thanks,
>    Felix

Sebastian

^ permalink raw reply

* Re: [PATCH iwl-next v2 1/2] idpf: remove conditonal MBX deinit from idpf_vc_core_deinit()
From: Tantilov, Emil S @ 2026-04-14 14:56 UTC (permalink / raw)
  To: Loktionov, Aleksandr, intel-wired-lan@lists.osuosl.org
  Cc: netdev@vger.kernel.org, Kitszel, Przemyslaw, Bhat, Jay,
	Barrera, Ivan D, Zaremba, Larysa, Nguyen, Anthony L,
	andrew+netdev@lunn.ch, davem@davemloft.net, edumazet@google.com,
	kuba@kernel.org, pabeni@redhat.com, Lobakin, Aleksander,
	linux-pci@vger.kernel.org, Chittim, Madhu, decot@google.com,
	willemb@google.com, sheenamo@google.com, lukas@wunner.de
In-Reply-To: <IA3PR11MB8986EDBC27D4267AA0F48BFBE5252@IA3PR11MB8986.namprd11.prod.outlook.com>



On 4/14/2026 4:07 AM, Loktionov, Aleksandr wrote:
> 
> 
>> -----Original Message-----
>> From: Tantilov, Emil S <emil.s.tantilov@intel.com>
>> Sent: Tuesday, April 14, 2026 5:17 AM
>> To: intel-wired-lan@lists.osuosl.org
>> Cc: netdev@vger.kernel.org; Kitszel, Przemyslaw
>> <przemyslaw.kitszel@intel.com>; Bhat, Jay <jay.bhat@intel.com>;
>> Barrera, Ivan D <ivan.d.barrera@intel.com>; Loktionov, Aleksandr
>> <aleksandr.loktionov@intel.com>; Zaremba, Larysa
>> <larysa.zaremba@intel.com>; Nguyen, Anthony L
>> <anthony.l.nguyen@intel.com>; andrew+netdev@lunn.ch;
>> davem@davemloft.net; edumazet@google.com; kuba@kernel.org;
>> pabeni@redhat.com; Lobakin, Aleksander <aleksander.lobakin@intel.com>;
>> linux-pci@vger.kernel.org; Chittim, Madhu <madhu.chittim@intel.com>;
>> decot@google.com; willemb@google.com; sheenamo@google.com;
>> lukas@wunner.de
>> Subject: [PATCH iwl-next v2 1/2] idpf: remove conditonal MBX deinit
>> from idpf_vc_core_deinit()
> "conditional" -> "conditional"

Doh. I will make sure it is corrected.

> 
> Everything else looks fine
> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>

Thanks,
Emil

> 
>>
>> Previously it was assumed that idpf_vc_core_deinit() is always being
>> called during reset handling, with remove being an exception. Ideally
>> the driver needs to communicate the changes to FW in all instances
>> where the MBX is not already disabled. Remove the remove_in_prog check
>> from
>> idpf_vc_core_deinit() as the MBX was already disabled while handling
>> the reset via libie_ctlq_xn_shutdown() by the service task. This is
>> also needed by the following patch, introducing PCI callbacks support.
>>
>> Signed-off-by: Emil Tantilov <emil.s.tantilov@intel.com>
>> Reviewed-by: Jay Bhat <jay.bhat@intel.com>
>> Reviewed-by: Madhu Chittim <madhu.chittim@intel.com>
>> ---
>>   drivers/net/ethernet/intel/idpf/idpf_virtchnl.c | 11 +----------
>>   1 file changed, 1 insertion(+), 10 deletions(-)
>>
>> diff --git a/drivers/net/ethernet/intel/idpf/idpf_virtchnl.c
>> b/drivers/net/ethernet/intel/idpf/idpf_virtchnl.c
>> index 129c8f6b0faa..fceaf3ec1cd4 100644
>> --- a/drivers/net/ethernet/intel/idpf/idpf_virtchnl.c
>> +++ b/drivers/net/ethernet/intel/idpf/idpf_virtchnl.c
>> @@ -3229,24 +3229,15 @@ int idpf_vc_core_init(struct idpf_adapter
>> *adapter)
>>    */
>>   void idpf_vc_core_deinit(struct idpf_adapter *adapter)  {
>> -	bool remove_in_prog;
>> -
>>   	if (!test_bit(IDPF_VC_CORE_INIT, adapter->flags))
>>   		return;
>>
>> -	/* Avoid transaction timeouts when called during reset */
>> -	remove_in_prog = test_bit(IDPF_REMOVE_IN_PROG, adapter->flags);
>> -	if (!remove_in_prog)
>> -		idpf_deinit_dflt_mbx(adapter);
>> -
>>   	idpf_ptp_release(adapter);
>>   	idpf_deinit_task(adapter);
>>   	idpf_idc_deinit_core_aux_device(adapter);
>>   	idpf_rel_rx_pt_lkup(adapter);
>>   	idpf_intr_rel(adapter);
>> -
>> -	if (remove_in_prog)
>> -		idpf_deinit_dflt_mbx(adapter);
>> +	idpf_deinit_dflt_mbx(adapter);
>>
>>   	cancel_delayed_work_sync(&adapter->serv_task);
>>
>> --
>> 2.37.3
> 


^ permalink raw reply

* Re: [PATCH v3 1/3] net: dsa: microchip: implement KSZ87xx Module 3 low-loss cable errata
From: Andrew Lunn @ 2026-04-14 14:54 UTC (permalink / raw)
  To: Fidelio LAWSON
  Cc: Marek Vasut, Woojung Huh, UNGLinuxDriver, Vladimir Oltean,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Marek Vasut, Maxime Chevallier, Simon Horman, Heiner Kallweit,
	Russell King, netdev, linux-kernel, Fidelio Lawson
In-Reply-To: <264667a8-bbb2-44ac-84e7-df6c506ae6fa@gmail.com>

On Tue, Apr 14, 2026 at 03:48:33PM +0200, Fidelio LAWSON wrote:
> On 4/14/26 14:40, Andrew Lunn wrote:
> > On Tue, Apr 14, 2026 at 01:05:49PM +0200, Marek Vasut wrote:
> > > On 4/14/26 11:12 AM, Fidelio Lawson wrote:
> > > > Implement the "Module 3: Equalizer fix for short cables" erratum from
> > > > Microchip document DS80000687C for KSZ87xx switches.
> > > > 
> > > > The issue affects short or low-loss cable links (e.g. CAT5e/CAT6),
> > > > where the PHY receiver equalizer may amplify high-amplitude signals
> > > > excessively, resulting in internal distortion and link establishment
> > > > failures.
> > > > 
> > > > KSZ87xx devices require a workaround for the Module 3 low-loss cable
> > > > condition, controlled through the switch TABLE_LINK_MD_V indirect
> > > > registers.
> > > > 
> > > > The affected registers are part of the switch address space and are not
> > > > directly accessible from the PHY driver. To keep the PHY-facing API
> > > > clean and avoid leaking switch-specific details, model this errata
> > > > control as vendor-specific Clause 22 PHY registers.
> > > > 
> > > > A vendor-specific Clause 22 PHY register is introduced as a mode
> > > > selector in PHY_REG_LOW_LOSS_CTRL, and ksz8_r_phy() / ksz8_w_phy()
> > > > translate accesses to these bits into the appropriate indirect
> > > > TABLE_LINK_MD_V accesses.
> > > > 
> > > > The control register defines the following modes:
> > > > 0: disabled (default behavior)
> > > > 1: EQ training workaround
> > > > 2: LPF 90 MHz
> > > > 3: LPF 62 MHz
> > > > 4: LPF 55 MHz
> > > > 5: LPF 44 MHz
> > > I may not fully understand this, but aren't the EQ and LPF settings
> > > orthogonal ?
> > 
> > What is the real life experience using this feature? Is it needed for
> > 1cm cables, but most > 1m cables are O.K with the defaults? Do we need
> > all these configuration options? How is a user supposed to discover
> > the different options? Can we simplify it down to a Boolean?
> We were seeing random link dropouts with the default settings, and since
> enabling the workaround 2, the link has remained stable and we have not
> observed any further issues.

So for you, a boolean which enables workaround 2 would be sufficient.

Marek, what is your experience?

       Andrew

^ permalink raw reply

* Re: [net,PATCH v3 1/2] net: ks8851: Reinstate disabling of BHs around IRQ handler
From: Sebastian Andrzej Siewior @ 2026-04-14 14:52 UTC (permalink / raw)
  To: Marek Vasut
  Cc: netdev, stable, David S. Miller, Andrew Lunn, Eric Dumazet,
	Jakub Kicinski, Nicolai Buchwitz, Paolo Abeni, Ronald Wahl,
	Yicong Hui, linux-kernel
In-Reply-To: <2fcfb84f-69f6-493e-94d6-95d85d8000f6@nabladev.com>

On 2026-04-14 16:20:46 [+0200], Marek Vasut wrote:
> > This is what happens since commit 0913ec336a6c0 ("net: ks8851: Fix
> > deadlock with the SPI chip variant"). Before that commit the softirq
> > execution will be picked up by netdev_alloc_skb_ip_align() and requires
> > PREEMPT_RT and a RX packet in #1 to trigger the deadlock.
> 
> Do you want me to add this into the V4 commit message ?

The description does not match the code since the commit mentioned
above.

> > > Fix the problem by disabling BH around critical sections, including the
> > > IRQ handler, thus preventing the net_tx_action() softirq from triggering
> > > during these critical sections. The net_tx_action() softirq is triggered
> > > at the end of the IRQ handler, once all the other IRQ handler actions have
> > > been completed.
> > > 
> > >   __schedule from schedule_rtlock+0x1c/0x34
> > >   schedule_rtlock from rtlock_slowlock_locked+0x548/0x904
> > >   rtlock_slowlock_locked from rt_spin_lock+0x60/0x9c
> > >   rt_spin_lock from ks8851_start_xmit_par+0x74/0x1a8
> > >   ks8851_start_xmit_par from netdev_start_xmit+0x20/0x44
> > >   netdev_start_xmit from dev_hard_start_xmit+0xd0/0x188
> > >   dev_hard_start_xmit from sch_direct_xmit+0xb8/0x25c
> > >   sch_direct_xmit from __qdisc_run+0x1f8/0x4ec
> > >   __qdisc_run from qdisc_run+0x1c/0x28
> > >   qdisc_run from net_tx_action+0x1f0/0x268
> > >   net_tx_action from handle_softirqs+0x1a4/0x270
> > >   handle_softirqs from __local_bh_enable_ip+0xcc/0xe0
> > >   __local_bh_enable_ip from __alloc_skb+0xd8/0x128
> > >   __alloc_skb from __netdev_alloc_skb+0x3c/0x19c
> > >   __netdev_alloc_skb from ks8851_irq+0x388/0x4d4
> > >   ks8851_irq from irq_thread_fn+0x24/0x64
> > >   irq_thread_fn from irq_thread+0x178/0x28c
> > >   irq_thread from kthread+0x12c/0x138
> > >   kthread from ret_from_fork+0x14/0x28
> > 
> > The backtrace here and the description is based on an older kernel.
> > However
> I actually did update the backtrace in V3 with the one from current next
> 20260413 .

That would be from yesterday and the change is merged since v6.10. But
why is the softirq starting from __netdev_alloc_skb() instead of
spin_unlock_bh(&ks->statelock)? After that unlock, the softirq must be
processed and __netdev_alloc_skb() _could_ observe pending softirqs but
not from ks8851.

Sebastian

^ permalink raw reply

* Re: [PATCH RFC bpf-next 3/8] bpf: add BPF_JIT_KASAN for KASAN instrumentation of JITed programs
From: Alexei Starovoitov @ 2026-04-14 14:38 UTC (permalink / raw)
  To: Alexis Lothoré
  Cc: Andrey Konovalov, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman,
	Kumar Kartikeya Dwivedi, Song Liu, Yonghong Song, Jiri Olsa,
	John Fastabend, David S. Miller, David Ahern, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, X86 ML, H. Peter Anvin,
	Shuah Khan, Maxime Coquelin, Alexandre Torgue, Andrey Ryabinin,
	Alexander Potapenko, Dmitry Vyukov, Vincenzo Frascino,
	Andrew Morton, ebpf, Bastien Curutchet, Thomas Petazzoni,
	Xu Kuohai, bpf, LKML, Network Development,
	open list:KERNEL SELFTEST FRAMEWORK, linux-stm32,
	linux-arm-kernel, kasan-dev, linux-mm
In-Reply-To: <DHSWSSYRPUVC.2W3G3OU27L3HG@bootlin.com>

On Tue, Apr 14, 2026 at 6:24 AM Alexis Lothoré
<alexis.lothore@bootlin.com> wrote:
>
> On Tue Apr 14, 2026 at 12:20 AM CEST, Andrey Konovalov wrote:
> > On Mon, Apr 13, 2026 at 8:29 PM Alexis Lothoré (eBPF Foundation)
> > <alexis.lothore@bootlin.com> wrote:
> >>
> >> Add a new Kconfig option CONFIG_BPF_JIT_KASAN that automatically enables
> >> KASAN (Kernel Address Sanitizer) memory access checks for JIT-compiled
> >> BPF programs, when both KASAN and JIT compiler are enabled. When
> >> enabled, the JIT compiler will emit shadow memory checks before memory
> >> loads and stores to detect use-after-free, out-of-bounds, and other
> >> memory safety bugs at runtime. The option is gated behind
> >> HAVE_EBPF_JIT_KASAN, as it needs proper arch-specific implementation.
> >>
> >> Signed-off-by: Alexis Lothoré (eBPF Foundation) <alexis.lothore@bootlin.com>
> >> ---
> >>  kernel/bpf/Kconfig | 9 +++++++++
> >>  1 file changed, 9 insertions(+)
> >>
> >> diff --git a/kernel/bpf/Kconfig b/kernel/bpf/Kconfig
> >> index eb3de35734f0..28392adb3d7e 100644
> >> --- a/kernel/bpf/Kconfig
> >> +++ b/kernel/bpf/Kconfig
> >> @@ -17,6 +17,10 @@ config HAVE_CBPF_JIT
> >>  config HAVE_EBPF_JIT
> >>         bool
> >>
> >> +# KASAN support for JIT compiler
> >> +config HAVE_EBPF_JIT_KASAN
> >> +       bool
> >> +
> >>  # Used by archs to tell that they want the BPF JIT compiler enabled by
> >>  # default for kernels that were compiled with BPF JIT support.
> >>  config ARCH_WANT_DEFAULT_BPF_JIT
> >> @@ -101,4 +105,9 @@ config BPF_LSM
> >>
> >>           If you are unsure how to answer this question, answer N.
> >>
> >> +config BPF_JIT_KASAN
> >> +       bool
> >> +       depends on HAVE_EBPF_JIT_KASAN
> >> +       default y if BPF_JIT && KASAN_GENERIC
> >
> > Should this be "depends on KASAN && KASAN_GENERIC"?
>
> Meaning, making it an explicit user-selectable option ?
>
> If so, the current design choice is voluntary and based on the feedback
> received on the original RFC, where I have been suggested to
> automatically enable the KASAN instrumentation in BPF programs if KASAN
> support is enabled in the kernel ([1]). But if a user-selectable toggle
> is eventually a better solution, I'm fine with changing it.

Let's not add more config knobs.
Even this patch looks redundant.
Inside JIT do instrumentation when KASAN_GENERIC is set.

^ permalink raw reply

* Re: linux-next: manual merge of the bpf-next tree with the origin tree
From: Mark Brown @ 2026-04-14 14:38 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Daniel Borkmann, Alexei Starovoitov, Andrii Nakryiko, bpf,
	Networking, Joel Fernandes, Kumar Kartikeya Dwivedi,
	Linux Kernel Mailing List, Linux Next Mailing List,
	Paul E. McKenney
In-Reply-To: <CAADnVQLz6-LK4+qad_XqEZXrspttpe4b49jRZ6wUCgEhJeTvgw@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 737 bytes --]

On Tue, Apr 14, 2026 at 07:09:44AM -0700, Alexei Starovoitov wrote:

> But how come you're saying it was discovered "today" ?

> Paul's commit ad6ef775cbeff was committed to rcu tree on Mar 30,
> while Kumar's 57b23c0f612dc was committed to bpf-next on Apr 7.

> "today" is April 14.

> My only explanation is that rcu tree was not in linux-next until today?!

We're in the merge window, this means things get sent to Linus and end
up in his tree.  This in turn means that they are seen in different
orders, and sometimes depending on context in different forms.  -next
is merged sequentially, it's not an octopus merge of all the trees at
once.  Some variation of this will likely have been seen before, but not
this exact combination.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply

* Re: [PATCH RFC bpf-next 1/8] kasan: expose generic kasan helpers
From: Alexei Starovoitov @ 2026-04-14 14:36 UTC (permalink / raw)
  To: Alexis Lothoré
  Cc: Andrey Konovalov, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman,
	Kumar Kartikeya Dwivedi, Song Liu, Yonghong Song, Jiri Olsa,
	John Fastabend, David S. Miller, David Ahern, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, X86 ML, H. Peter Anvin,
	Shuah Khan, Maxime Coquelin, Alexandre Torgue, Andrey Ryabinin,
	Alexander Potapenko, Dmitry Vyukov, Vincenzo Frascino,
	Andrew Morton, ebpf, Bastien Curutchet, Thomas Petazzoni,
	Xu Kuohai, bpf, LKML, Network Development,
	open list:KERNEL SELFTEST FRAMEWORK, linux-stm32,
	linux-arm-kernel, kasan-dev, linux-mm
In-Reply-To: <DHSWK17EZUDP.GIJ6BX2NFR6U@bootlin.com>

On Tue, Apr 14, 2026 at 6:13 AM Alexis Lothoré
<alexis.lothore@bootlin.com> wrote:
>
> Hi Andrey, thanks for the prompt review !
>
> On Tue Apr 14, 2026 at 12:19 AM CEST, Andrey Konovalov wrote:
> > On Mon, Apr 13, 2026 at 8:29 PM Alexis Lothoré (eBPF Foundation)
> > <alexis.lothore@bootlin.com> wrote:
> >>
>
> [...]
>
> >> +#ifdef CONFIG_KASAN_GENERIC
> >> +void __asan_load1(void *p);
> >> +void __asan_store1(void *p);
> >> +void __asan_load2(void *p);
> >> +void __asan_store2(void *p);
> >> +void __asan_load4(void *p);
> >> +void __asan_store4(void *p);
> >> +void __asan_load8(void *p);
> >> +void __asan_store8(void *p);
> >> +void __asan_load16(void *p);
> >> +void __asan_store16(void *p);
> >> +#endif /* CONFIG_KASAN_GENERIC */
> >
> > This looks ugly, let's not do this unless it's really required.
> >
> > You can just use kasan_check_read/write() instead - these are public
> > wrappers around the same shadow memory checking functions. And they
> > also work with the SW_TAGS mode, in case the BPF would want to use
> > that mode at some point. (For HW_TAGS, we only have kasan_check_byte()
> > that checks a single byte, but it can be extended in the future if
> > required to be used by BPF.)
>
> ACK, I'll try to use those kasan_check_read and kasan_check_write rather
> than __asan_{load,store}X.

No. The performance penalty will be too high.
hw_tags won't work without corresponding JIT work.
I see no point sacrificing performance for aesthetics.
__asan_load/storeX is what compilers emit.
In that sense JIT is a compiler it should emit exactly the same.

^ permalink raw reply

* Re: [RFC PATCH 2/2] kernel/module: Decouple klp and ftrace from load_module
From: Petr Pavlu @ 2026-04-14 14:33 UTC (permalink / raw)
  To: chensong_2000
  Cc: rafael, lenb, mturquette, sboyd, viresh.kumar, agk, snitzer,
	mpatocka, bmarzins, song, yukuai, linan122, jason.wessel, danielt,
	dianders, horms, davem, edumazet, kuba, pabeni, paulmck, frederic,
	mcgrof, da.gomez, samitolvanen, atomlin, jpoimboe, jikos, mbenes,
	pmladek, joe.lawrence, rostedt, mhiramat, mark.rutland,
	mathieu.desnoyers, linux-modules, linux-kernel,
	linux-trace-kernel, linux-acpi, linux-clk, linux-pm,
	live-patching, dm-devel, linux-raid, kgdb-bugreport, netdev
In-Reply-To: <20260413080701.180976-1-chensong_2000@189.cn>

On 4/13/26 10:07 AM, chensong_2000@189.cn wrote:
> From: Song Chen <chensong_2000@189.cn>
> 
> ftrace and livepatch currently have their module load/unload callbacks
> hard-coded in the module loader as direct function calls to
> ftrace_module_enable(), klp_module_coming(), klp_module_going()
> and ftrace_release_mod(). This tight coupling was originally introduced
> to enforce strict call ordering that could not be guaranteed by the
> module notifier chain, which only supported forward traversal. Their
> notifiers were moved in and out back and forth. see [1] and [2].

I'm unclear about what is meant by the notifiers being moved back and
forth. The links point to patches that converted ftrace+klp from using
module notifiers to explicit callbacks due to ordering issues, but this
switch occurred only once. Have there been other attempts to use
notifiers again?

> 
> Now that the notifier chain supports reverse traversal via
> blocking_notifier_call_chain_reverse(), the ordering can be enforced
> purely through notifier priority. As a result, the module loader is now
> decoupled from the implementation details of ftrace and livepatch.
> What's more, adding a new subsystem with symmetric setup/teardown ordering
> requirements during module load/unload no longer requires modifying
> kernel/module/main.c; it only needs to register a notifier_block with an
> appropriate priority.
> 
> [1]:https://lore.kernel.org/all/
> 	alpine.LNX.2.00.1602172216491.22700@cbobk.fhfr.pm/
> [2]:https://lore.kernel.org/all/
> 	20160301030034.GC12120@packer-debian-8-amd64.digitalocean.com/

Nit: Avoid wrapping URLs, as it breaks autolinking and makes the links
harder to copy.

Better links would be:
[1] https://lore.kernel.org/all/1455661953-15838-1-git-send-email-jeyu@redhat.com/
[2] https://lore.kernel.org/all/1458176139-17455-1-git-send-email-jeyu@redhat.com/

The first link is the final version of what landed as commit
7dcd182bec27 ("ftrace/module: remove ftrace module notifier"). The
second is commit 7e545d6eca20 ("livepatch/module: remove livepatch
module notifier").

> 
> Signed-off-by: Song Chen <chensong_2000@189.cn>
> ---
>  include/linux/module.h  |  8 ++++++++
>  kernel/livepatch/core.c | 29 ++++++++++++++++++++++++++++-
>  kernel/module/main.c    | 34 +++++++++++++++-------------------
>  kernel/trace/ftrace.c   | 38 ++++++++++++++++++++++++++++++++++++++
>  4 files changed, 89 insertions(+), 20 deletions(-)
> 
> diff --git a/include/linux/module.h b/include/linux/module.h
> index 14f391b186c6..0bdd56f9defd 100644
> --- a/include/linux/module.h
> +++ b/include/linux/module.h
> @@ -308,6 +308,14 @@ enum module_state {
>  	MODULE_STATE_COMING,	/* Full formed, running module_init. */
>  	MODULE_STATE_GOING,	/* Going away. */
>  	MODULE_STATE_UNFORMED,	/* Still setting it up. */
> +	MODULE_STATE_FORMED,

I don't see a reason to add a new module state. Why is it necessary and
how does it fit with the existing states?

> +};
> +
> +enum module_notifier_prio {
> +	MODULE_NOTIFIER_PRIO_LOW = INT_MIN,	/* Low prioroty, coming last, going first */
> +	MODULE_NOTIFIER_PRIO_MID = 0,	/* Normal priority. */
> +	MODULE_NOTIFIER_PRIO_SECOND_HIGH = INT_MAX - 1,	/* Second high priorigy, coming second*/
> +	MODULE_NOTIFIER_PRIO_HIGH = INT_MAX,	/* High priorigy, coming first, going late. */

I suggest being explicit about how the notifiers are ordered. For
example:

enum module_notifier_prio {
	MODULE_NOTIFIER_PRIO_NORMAL,	/* Normal priority, coming last, going first. */
	MODULE_NOTIFIER_PRIO_LIVEPATCH,
	MODULE_NOTIFIER_PRIO_FTRACE,	/* High priority, coming first, going late. */
};

>  };
>  
>  struct mod_tree_node {
> diff --git a/kernel/livepatch/core.c b/kernel/livepatch/core.c
> index 28d15ba58a26..ce78bb23e24b 100644
> --- a/kernel/livepatch/core.c
> +++ b/kernel/livepatch/core.c
> @@ -1375,13 +1375,40 @@ void *klp_find_section_by_name(const struct module *mod, const char *name,
>  }
>  EXPORT_SYMBOL_GPL(klp_find_section_by_name);
>  
> +static int klp_module_callback(struct notifier_block *nb, unsigned long op,
> +			void *module)
> +{
> +	struct module *mod = module;
> +	int err = 0;
> +
> +	switch (op) {
> +	case MODULE_STATE_COMING:
> +		err = klp_module_coming(mod);
> +		break;
> +	case MODULE_STATE_LIVE:
> +		break;
> +	case MODULE_STATE_GOING:
> +		klp_module_going(mod);
> +		break;
> +	default:
> +		break;
> +	}

klp_module_coming() and klp_module_going() are now used only in
kernel/livepatch/core.c where they are also defined. This means the
functions can be static and their declarations removed from
include/linux/livepatch.h.

Nit: The MODULE_STATE_LIVE and default cases in the switch can be
removed.

> +
> +	return notifier_from_errno(err);
> +}
> +
> +static struct notifier_block klp_module_nb = {
> +	.notifier_call = klp_module_callback,
> +	.priority = MODULE_NOTIFIER_PRIO_SECOND_HIGH
> +};
> +
>  static int __init klp_init(void)
>  {
>  	klp_root_kobj = kobject_create_and_add("livepatch", kernel_kobj);
>  	if (!klp_root_kobj)
>  		return -ENOMEM;
>  
> -	return 0;
> +	return register_module_notifier(&klp_module_nb);
>  }
>  
>  module_init(klp_init);
> diff --git a/kernel/module/main.c b/kernel/module/main.c
> index c3ce106c70af..226dd5b80997 100644
> --- a/kernel/module/main.c
> +++ b/kernel/module/main.c
> @@ -833,10 +833,8 @@ SYSCALL_DEFINE2(delete_module, const char __user *, name_user,
>  	/* Final destruction now no one is using it. */
>  	if (mod->exit != NULL)
>  		mod->exit();
> -	blocking_notifier_call_chain(&module_notify_list,
> +	blocking_notifier_call_chain_reverse(&module_notify_list,
>  				     MODULE_STATE_GOING, mod);
> -	klp_module_going(mod);
> -	ftrace_release_mod(mod);
>  
>  	async_synchronize_full();
>  
> @@ -3135,10 +3133,8 @@ static noinline int do_init_module(struct module *mod)
>  	mod->state = MODULE_STATE_GOING;
>  	synchronize_rcu();
>  	module_put(mod);
> -	blocking_notifier_call_chain(&module_notify_list,
> +	blocking_notifier_call_chain_reverse(&module_notify_list,
>  				     MODULE_STATE_GOING, mod);
> -	klp_module_going(mod);
> -	ftrace_release_mod(mod);
>  	free_module(mod);
>  	wake_up_all(&module_wq);
>  

The patch unexpectedly leaves a call to ftrace_free_mem() in
do_init_module().

> @@ -3281,20 +3277,14 @@ static int complete_formation(struct module *mod, struct load_info *info)
>  	return err;
>  }
>  
> -static int prepare_coming_module(struct module *mod)
> +static int prepare_module_state_transaction(struct module *mod,
> +			unsigned long val_up, unsigned long val_down)
>  {
>  	int err;
>  
> -	ftrace_module_enable(mod);
> -	err = klp_module_coming(mod);
> -	if (err)
> -		return err;
> -
>  	err = blocking_notifier_call_chain_robust(&module_notify_list,
> -			MODULE_STATE_COMING, MODULE_STATE_GOING, mod);
> +			val_up, val_down, mod);
>  	err = notifier_to_errno(err);
> -	if (err)
> -		klp_module_going(mod);
>  
>  	return err;
>  }
> @@ -3468,14 +3458,21 @@ static int load_module(struct load_info *info, const char __user *uargs,
>  	init_build_id(mod, info);
>  
>  	/* Ftrace init must be called in the MODULE_STATE_UNFORMED state */
> -	ftrace_module_init(mod);
> +	err = prepare_module_state_transaction(mod,
> +				MODULE_STATE_UNFORMED, MODULE_STATE_FORMED);

I believe val_down should be MODULE_STATE_GOING to reverse the
operation. Why is the new state MODULE_STATE_FORMED needed here?

> +	if (err)
> +		goto ddebug_cleanup;
>  
>  	/* Finally it's fully formed, ready to start executing. */
>  	err = complete_formation(mod, info);
> -	if (err)
> +	if (err) {
> +		blocking_notifier_call_chain_reverse(&module_notify_list,
> +				MODULE_STATE_FORMED, mod);
>  		goto ddebug_cleanup;
> +	}
>  
> -	err = prepare_coming_module(mod);
> +	err = prepare_module_state_transaction(mod,
> +				MODULE_STATE_COMING, MODULE_STATE_GOING);
>  	if (err)
>  		goto bug_cleanup;
>  
> @@ -3522,7 +3519,6 @@ static int load_module(struct load_info *info, const char __user *uargs,
>  	destroy_params(mod->kp, mod->num_kp);
>  	blocking_notifier_call_chain(&module_notify_list,
>  				     MODULE_STATE_GOING, mod);

My understanding is that all notifier chains for MODULE_STATE_GOING
should be reversed.

> -	klp_module_going(mod);
>   bug_cleanup:
>  	mod->state = MODULE_STATE_GOING;
>  	/* module_bug_cleanup needs module_mutex protection */

The patch removes the klp_module_going() cleanup call in load_module().
Similarly, the ftrace_release_mod() call under the ddebug_cleanup label
should be removed and appropriately replaced with a cleanup via
a notifier.

> diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
> index 8df69e702706..efedb98d3db4 100644
> --- a/kernel/trace/ftrace.c
> +++ b/kernel/trace/ftrace.c
> @@ -5241,6 +5241,44 @@ static int __init ftrace_mod_cmd_init(void)
>  }
>  core_initcall(ftrace_mod_cmd_init);
>  
> +static int ftrace_module_callback(struct notifier_block *nb, unsigned long op,
> +			void *module)
> +{
> +	struct module *mod = module;
> +
> +	switch (op) {
> +	case MODULE_STATE_UNFORMED:
> +		ftrace_module_init(mod);
> +		break;
> +	case MODULE_STATE_COMING:
> +		ftrace_module_enable(mod);
> +		break;
> +	case MODULE_STATE_LIVE:
> +		ftrace_free_mem(mod, mod->mem[MOD_INIT_TEXT].base,
> +				mod->mem[MOD_INIT_TEXT].base + mod->mem[MOD_INIT_TEXT].size);
> +		break;
> +	case MODULE_STATE_GOING:
> +	case MODULE_STATE_FORMED:
> +		ftrace_release_mod(mod);
> +		break;
> +	default:
> +		break;
> +	}

ftrace_module_init(), ftrace_module_enable(), ftrace_free_mem() and
ftrace_release_mod() should be newly used only in kernel/trace/ftrace.c
where they are also defined. The functions can then be made static and
removed from include/linux/ftrace.h.

Nit: The default case in the switch can be removed.

> +
> +	return notifier_from_errno(0);

Nit: This can be simply "return NOTIFY_OK;".

> +}
> +
> +static struct notifier_block ftrace_module_nb = {
> +	.notifier_call = ftrace_module_callback,
> +	.priority = MODULE_NOTIFIER_PRIO_HIGH
> +};
> +
> +static int __init ftrace_register_module_notifier(void)
> +{
> +	return register_module_notifier(&ftrace_module_nb);
> +}
> +core_initcall(ftrace_register_module_notifier);
> +
>  static void function_trace_probe_call(unsigned long ip, unsigned long parent_ip,
>  				      struct ftrace_ops *op, struct ftrace_regs *fregs)
>  {

-- 
Thanks,
Petr

^ permalink raw reply

* Re: [PATCH bpf] bpf,tcp: avoid infinite recursion in BPF_SOCK_OPS_HDR_OPT_LEN_CB
From: Alexei Starovoitov @ 2026-04-14 14:33 UTC (permalink / raw)
  To: Jiayuan Chen
  Cc: bpf, Quan Sun, Yinhao Hu, Kaiyan Mei, Dongliang Mu, Eric Dumazet,
	Neal Cardwell, Kuniyuki Iwashima, David S. Miller, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Jonathan Corbet, Shuah Khan,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	David Ahern, Network Development, open list:DOCUMENTATION, LKML
In-Reply-To: <20260414105702.248310-1-jiayuan.chen@linux.dev>

On Tue, Apr 14, 2026 at 3:57 AM Jiayuan Chen <jiayuan.chen@linux.dev> wrote:
>
> A BPF_PROG_TYPE_SOCK_OPS program can set BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG
> to inject custom TCP header options. When the kernel builds a TCP packet,
> it calls tcp_established_options() to calculate the header size, which
> invokes bpf_skops_hdr_opt_len() to trigger the BPF_SOCK_OPS_HDR_OPT_LEN_CB
> callback.
>
> If the BPF program calls bpf_setsockopt(TCP_NODELAY) inside this callback,
> __tcp_sock_set_nodelay() will call tcp_push_pending_frames(), which calls
> tcp_current_mss(), which calls tcp_established_options() again,
> re-triggering the same BPF callback. This creates an infinite recursion
> that exhausts the kernel stack and causes a panic.
>
> BPF_SOCK_OPS_HDR_OPT_LEN_CB
>   -> bpf_setsockopt(TCP_NODELAY)
>         -> tcp_push_pending_frames()
>           -> tcp_current_mss()
>                 -> tcp_established_options()
>                   -> bpf_skops_hdr_opt_len()
>                            /* infinite recursion */
>                         -> BPF_SOCK_OPS_HDR_OPT_LEN_CB
>
> A similar reentrancy issue exists for TCP congestion control, which is
> guarded by tp->bpf_chg_cc_inprogress. Adopt the same approach: introduce
> tp->bpf_hdr_opt_len_cb_inprogress, set it before invoking the callback in
> bpf_skops_hdr_opt_len(), and check it in sol_tcp_sockopt() to reject
> bpf_setsockopt(TCP_NODELAY) calls that would trigger
> tcp_push_pending_frames() and cause the recursion.
>
> Reported-by: Quan Sun <2022090917019@std.uestc.edu.cn>
> Reported-by: Yinhao Hu <dddddd@hust.edu.cn>
> Reported-by: Kaiyan Mei <M202472210@hust.edu.cn>
> Reported-by: Dongliang Mu <dzm91@hust.edu.cn>
> Closes: https://lore.kernel.org/bpf/d1d523c9-6901-4454-a183-94462b8f3e4e@std.uestc.edu.cn/
> Fixes: 0813a841566f ("bpf: tcp: Allow bpf prog to write and parse TCP header option")
> Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
> ---
>  Documentation/networking/net_cachelines/tcp_sock.rst |  1 +
>  include/linux/tcp.h                                  | 11 ++++++++++-
>  net/core/filter.c                                    |  4 ++++
>  net/ipv4/tcp_minisocks.c                             |  1 +
>  net/ipv4/tcp_output.c                                |  3 +++
>  5 files changed, 19 insertions(+), 1 deletion(-)
>
> diff --git a/Documentation/networking/net_cachelines/tcp_sock.rst b/Documentation/networking/net_cachelines/tcp_sock.rst
> index 563daea10d6c..07d3226d90cc 100644
> --- a/Documentation/networking/net_cachelines/tcp_sock.rst
> +++ b/Documentation/networking/net_cachelines/tcp_sock.rst
> @@ -152,6 +152,7 @@ unsigned_int                  keepalive_intvl
>  int                           linger2
>  u8                            bpf_sock_ops_cb_flags
>  u8:1                          bpf_chg_cc_inprogress
> +u8:1                          bpf_hdr_opt_len_cb_inprogress
>  u16                           timeout_rehash
>  u32                           rcv_ooopack
>  u32                           rcv_rtt_last_tsecr
> diff --git a/include/linux/tcp.h b/include/linux/tcp.h
> index f72eef31fa23..2bfb73cf922e 100644
> --- a/include/linux/tcp.h
> +++ b/include/linux/tcp.h
> @@ -475,12 +475,21 @@ struct tcp_sock {
>         u8      bpf_sock_ops_cb_flags;  /* Control calling BPF programs
>                                          * values defined in uapi/linux/tcp.h
>                                          */
> -       u8      bpf_chg_cc_inprogress:1; /* In the middle of
> +       u8      bpf_chg_cc_inprogress:1, /* In the middle of
>                                           * bpf_setsockopt(TCP_CONGESTION),
>                                           * it is to avoid the bpf_tcp_cc->init()
>                                           * to recur itself by calling
>                                           * bpf_setsockopt(TCP_CONGESTION, "itself").
>                                           */
> +               bpf_hdr_opt_len_cb_inprogress:1; /* It is set before invoking the
> +                                                 * callback so that a nested
> +                                                 * bpf_setsockopt(TCP_NODELAY) or
> +                                                 * bpf_setsockopt(TCP_CORK) cannot
> +                                                 * trigger tcp_push_pending_frames(),
> +                                                 * which would call tcp_current_mss()
> +                                                 * -> bpf_skops_hdr_opt_len(), causing
> +                                                 * infinite recursion.

Let's not add new bits.
Reuse existing and test/check all in one place,
like commit 061ff040710e9 did.

pw-bot: cr

^ permalink raw reply

* Re: [RFC PATCH v4 00/19] Support socket access-control
From: Mickaël Salaün @ 2026-04-14 14:27 UTC (permalink / raw)
  To: Mikhail Ivanov
  Cc: gnoack, willemdebruijn.kernel, matthieu, linux-security-module,
	netdev, netfilter-devel, yusongping, artem.kuzin,
	konstantin.meskhidze
In-Reply-To: <ca9b74f3-ce72-1d7f-c922-be1b276b69a8@huawei-partners.com>

On Mon, Apr 13, 2026 at 08:11:48PM +0300, Mikhail Ivanov wrote:
> On 4/8/2026 1:26 PM, Mickaël Salaün wrote:
> > Hi Mikhail,
> 
> Hi!
> 
> > 
> > On Tue, Nov 18, 2025 at 09:46:20PM +0800, Mikhail Ivanov wrote:
> > > Hello! This is v4 RFC patch dedicated to socket protocols restriction.
> > > 
> > > It is based on the landlock's mic-next branch on top of Linux 6.16-rc2
> > > kernel version.
> > > 
> > > Objective
> > > =========
> > > Extend Landlock with a mechanism to restrict any set of protocols in
> > > a sandboxed process.
> > > 
> > > Closes: https://github.com/landlock-lsm/linux/issues/6
> > > 
> > > Motivation
> > > ==========
> > > Landlock implements the `LANDLOCK_RULE_NET_PORT` rule type, which provides
> > > fine-grained control of actions for a specific protocol. Any action or
> > > protocol that is not supported by this rule can not be controlled. As a
> > > result, protocols for which fine-grained control is not supported can be
> > > used in a sandboxed system and lead to vulnerabilities or unexpected
> > > behavior.
> > > 
> > > Controlling the protocols used will allow to use only those that are
> > > necessary for the system and/or which have fine-grained Landlock control
> > > through others types of rules (e.g. TCP bind/connect control with
> > > `LANDLOCK_RULE_NET_PORT`, UNIX bind control with
> > > `LANDLOCK_RULE_PATH_BENEATH`).
> > > 
> > > Consider following examples:
> > > * Server may want to use only TCP sockets for which there is fine-grained
> > >    control of bind(2) and connect(2) actions [1].
> > > * System that does not need a network or that may want to disable network
> > >    for security reasons (e.g. [2]) can achieve this by restricting the use
> > >    of all possible protocols.
> > > 
> > > [1] https://lore.kernel.org/all/ZJvy2SViorgc+cZI@google.com/
> > > [2] https://cr.yp.to/unix/disablenetwork.html
> > > 
> > > Implementation
> > > ==============
> > > This patchset adds control over the protocols used by implementing a
> > > restriction of socket creation. This is possible thanks to the new type
> > > of rule - `LANDLOCK_RULE_SOCKET`, that allows to restrict actions on
> > > sockets, and a new access right - `LANDLOCK_ACCESS_SOCKET_CREATE`, that
> > > corresponds to user space sockets creation. The key in this rule
> > > corresponds to communication protocol signature from socket(2) syscall.
> > 
> > FYI, I sent a new patch series that adds a handled_perm field to
> > rulesets:
> > https://lore.kernel.org/all/20260312100444.2609563-6-mic@digikod.net/
> > See also the rationale:
> > https://lore.kernel.org/all/20260312100444.2609563-12-mic@digikod.net/
> > 
> > I think that would work well with the socket creation permission.  WDYT?
> 
> Agreed. AFAICS restrictions of protocols used for communication (eg.TCP)
> will complement restriction of network namespace which sandboxed process
> is pinned by LANDLOCK_PERM_NAMESPACE_ENTER permission.

I mean that socket creation restriction should use the same handled_perm
interface e.g. add a LANDLOCK_PERM_SOCKET_CREATE right with related
LANDLOCK_RULE_SOCKET rule type.

With the first RFC for handled_perm, the related rules (e.g. struct
landlock_socket_attr) don't have an allowed_access field but an
allowed_perm one instead.  The related permission would then be
LANDLOCK_PERM_SOCKET_CREATE.  WDYT?

> 
> > 
> > Do you think you'll be able to continue this work or would you like me
> > or Günther to complete the remaining last bits (while of course keeping
> > you as the main author)?
> 
> Sorry for the delay. I will finish and send patch series ASAP.

This new version should then be on top of the Landlock namespace and
capability patchset to reuse the handled_perm interface.  I plan to send
a new version by the end of the month, but this should not change the
handled_perm interface.

> 
> > 
> > 
> > > 
> > > The right to create a socket is checked in the LSM hook which is called
> > > in the __sock_create method. The following user space operations are
> > > subject to this check: socket(2), socketpair(2), io_uring(7).
> > > 
> > > `LANDLOCK_ACCESS_SOCKET_CREATE` does not restrict socket creation
> > > performed by accept(2), because created socket is used for messaging
> > > between already existing endpoints.
> > > 
> > > Design discussion
> > > ===================
> > > 1. Should `SCTP_SOCKOPT_PEELOFF` and socketpair(2) be restricted?
> > > 
> > > SCTP socket can be connected to a multiple endpoints (one-to-many
> > > relation). Calling setsockopt(2) on such socket with option
> > > `SCTP_SOCKOPT_PEELOFF` detaches one of existing connections to a separate
> > > UDP socket. This detach is currently restrictable.
> > > 
> > > Same applies for the socketpair(2) syscall. It was noted that denying
> > > usage of socketpair(2) in sandboxed environment may be not meaninful [1].
> > > 
> > > Currently both operations use general socket interface to create sockets.
> > > Therefore it's not possible to distinguish between socket(2) and those
> > > operations inside security_socket_create LSM hook which is currently
> > > used for protocols restriction. Providing such separation may require
> > > changes in socket layer (eg. in __sock_create) interface which may not be
> > > acceptable.
> > > 
> > > [1] https://lore.kernel.org/all/ZurZ7nuRRl0Zf2iM@google.com/
> > > 
> > > Code coverage
> > > =============
> > > Code coverage(gcov) report with the launch of all the landlock selftests:
> > > * security/landlock:
> > > lines......: 94.0% (1200 of 1276 lines)
> > > functions..: 95.0% (134 of 141 functions)
> > > 
> > > * security/landlock/socket.c:
> > > lines......: 100.0% (56 of 56 lines)
> > > functions..: 100.0% (5 of 5 functions)
> > > 
> > > Currently landlock-test-tools fails on mini.kernel_socket test due to lack
> > > of SMC protocol support.
> > > 
> > > General changes v3->v4
> > > ======================
> > > * Implementation
> > >    * Adds protocol field to landlock_socket_attr.
> > >    * Adds protocol masks support via wildcards values in
> > >      landlock_socket_attr.
> > >    * Changes LSM hook used from socket_post_create to socket_create.
> > >    * Changes protocol ranges acceptable by socket rules.
> > >    * Adds audit support.
> > >    * Changes ABI version to 8.
> > > * Tests
> > >    * Adds 5 new tests:
> > >      * mini.rule_with_wildcard, protocol_wildcard.access,
> > >        mini.ruleset_with_wildcards_overlap:
> > >        verify rulesets containing rules with wildcard values.
> > >      * tcp_protocol.alias_restriction: verify that Landlock doesn't
> > >        perform protocol mappings.
> > >      * audit.socket_create: tests audit denial logging.
> > >    * Squashes tests corresponding to Landlock rule adding to a single commit.
> > > * Documentation
> > >    * Refactors Documentation/userspace-api/landlock.rst.
> > > * Commits
> > >    * Rebases on mic-next.
> > >    * Refactors commits.
> > > 
> > > Previous versions
> > > =================
> > > v3: https://lore.kernel.org/all/20240904104824.1844082-1-ivanov.mikhail1@huawei-partners.com/
> > > v2: https://lore.kernel.org/all/20240524093015.2402952-1-ivanov.mikhail1@huawei-partners.com/
> > > v1: https://lore.kernel.org/all/20240408093927.1759381-1-ivanov.mikhail1@huawei-partners.com/
> > > 
> > > Mikhail Ivanov (19):
> > >    landlock: Support socket access-control
> > >    selftests/landlock: Test creating a ruleset with unknown access
> > >    selftests/landlock: Test adding a socket rule
> > >    selftests/landlock: Testing adding rule with wildcard value
> > >    selftests/landlock: Test acceptable ranges of socket rule key
> > >    landlock: Add hook on socket creation
> > >    selftests/landlock: Test basic socket restriction
> > >    selftests/landlock: Test network stack error code consistency
> > >    selftests/landlock: Test overlapped rulesets with rules of protocol
> > >      ranges
> > >    selftests/landlock: Test that kernel space sockets are not restricted
> > >    selftests/landlock: Test protocol mappings
> > >    selftests/landlock: Test socketpair(2) restriction
> > >    selftests/landlock: Test SCTP peeloff restriction
> > >    selftests/landlock: Test that accept(2) is not restricted
> > >    lsm: Support logging socket common data
> > >    landlock: Log socket creation denials
> > >    selftests/landlock: Test socket creation denial log for audit
> > >    samples/landlock: Support socket protocol restrictions
> > >    landlock: Document socket rule type support
> > > 
> > >   Documentation/userspace-api/landlock.rst      |   48 +-
> > >   include/linux/lsm_audit.h                     |    8 +
> > >   include/uapi/linux/landlock.h                 |   60 +-
> > >   samples/landlock/sandboxer.c                  |  118 +-
> > >   security/landlock/Makefile                    |    2 +-
> > >   security/landlock/access.h                    |    3 +
> > >   security/landlock/audit.c                     |   12 +
> > >   security/landlock/audit.h                     |    1 +
> > >   security/landlock/limits.h                    |    4 +
> > >   security/landlock/ruleset.c                   |   37 +-
> > >   security/landlock/ruleset.h                   |   46 +-
> > >   security/landlock/setup.c                     |    2 +
> > >   security/landlock/socket.c                    |  198 +++
> > >   security/landlock/socket.h                    |   20 +
> > >   security/landlock/syscalls.c                  |   61 +-
> > >   security/lsm_audit.c                          |    4 +
> > >   tools/testing/selftests/landlock/base_test.c  |    2 +-
> > >   tools/testing/selftests/landlock/common.h     |   14 +
> > >   tools/testing/selftests/landlock/config       |   47 +
> > >   tools/testing/selftests/landlock/net_test.c   |   11 -
> > >   .../selftests/landlock/protocols_define.h     |  169 +++
> > >   .../testing/selftests/landlock/socket_test.c  | 1169 +++++++++++++++++
> > >   22 files changed, 1990 insertions(+), 46 deletions(-)
> > >   create mode 100644 security/landlock/socket.c
> > >   create mode 100644 security/landlock/socket.h
> > >   create mode 100644 tools/testing/selftests/landlock/protocols_define.h
> > >   create mode 100644 tools/testing/selftests/landlock/socket_test.c
> > > 
> > > 
> > > base-commit: 6dde339a3df80a57ac3d780d8cfc14d9262e2acd
> > > -- 
> > > 2.34.1
> > > 
> > > 
> 

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox