Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH net-next v4 1/2] net: pse-pd: add devm_pse_poll_helper()
From: Kory Maincent @ 2026-04-14 14:11 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: github, o.rempel, andrew+netdev, davem, edumazet, pabeni, krzk,
	krzk+dt, conor+dt, robh, netdev, linux-kernel, linux-leds, lkp
In-Reply-To: <20260414160506.4ce4d543@kmaincent-XPS-13-7390>

On Tue, 14 Apr 2026 16:05:06 +0200
Kory Maincent <kory.maincent@bootlin.com> wrote:

> On Mon, 13 Apr 2026 15:50:21 -0700
> Jakub Kicinski <kuba@kernel.org> wrote:
> 
> > This is an AI-generated review of your patch. The human sending this
> > email has considered the AI review valid, or at least plausible.
> > ---
> > net: pse-pd: add devm_pse_poll_helper()
> > 
> > Extract the common event handling loop from pse_isr() into a shared
> > pse_handle_events() function, and add a generic poll-based alternative
> > to the IRQ path for PSE controllers that lack interrupt support.
> >   
> > > Work cancellation on teardown is handled via devm_add_action_or_reset()
> > > to ensure the delayed work is cancelled before poll_notifs is freed
> > > by devres, avoiding a use-after-free when devm_pse_poll_helper() is
> > > called after devm_pse_controller_register() (devres LIFO ordering).    
> > 
> > Is it safe to rely on this devm registration order to prevent a
> > use-after-free?
> > 
> > Standard Linux driver initialization typically sets up interrupts and
> > pollers before device registration to prevent race conditions where an
> > unready device is exposed to userspace.
> > 
> > If a driver follows this standard ordering by calling the helper before
> > registration, devres will unregister the controller first. This will
> > tear down resources in pse_controller_unregister() while the poll work
> > remains active.
> > 
> > If the worker executes during this race window, could it access the
> > already-freed pcdev->pi array or push to the destroyed ntf_fifo? Would
> > it be safer to explicitly cancel asynchronous event sources within
> > pse_controller_unregister() instead?
> >   
> > > --- a/drivers/net/pse-pd/pse_core.c
> > > +++ b/drivers/net/pse-pd/pse_core.c    
> > [ ... ]  
> > > +static void pse_handle_events(struct pse_controller_dev *pcdev,
> > > +			      unsigned long *notifs,
> > > +			      unsigned long notifs_mask)
> > >  {
> > > -	struct pse_controller_dev *pcdev;
> > > -	unsigned long notifs_mask = 0;
> > > -	struct pse_irq_desc *desc;
> > > -	struct pse_irq *h = data;
> > > -	int ret, i;
> > > -
> > > -	desc = &h->desc;
> > > -	pcdev = h->pcdev;
> > > -
> > > -	/* Clear notifs mask */
> > > -	memset(h->notifs, 0, pcdev->nr_lines * sizeof(*h->notifs));
> > > -	mutex_lock(&pcdev->lock);
> > > -	ret = desc->map_event(irq, pcdev, h->notifs, &notifs_mask);
> > > -	if (ret || !notifs_mask) {
> > > -		mutex_unlock(&pcdev->lock);
> > > -		return IRQ_NONE;
> > > -	}
> > > +	int i;
> > >  
> > >  	for_each_set_bit(i, &notifs_mask, pcdev->nr_lines) {    
> > 
> > Can this lead to an out-of-bounds read on the stack if a controller
> > supports more than BITS_PER_LONG ports?
> > 
> > The notifs_mask variable is passed by value to pse_handle_events(), but
> > for_each_set_bit() takes its address. If pcdev->nr_lines is greater than
> > BITS_PER_LONG (e.g., a 48-port switch on a 32-bit architecture), the
> > macro will read past the function argument on the stack into uninitialized
> > memory.  
> 
> It's seems there is a possible out-of-bound issue in my code :/ Oops.
> Carlo, could you take a look and propose a fix? Otherwise, I'll handle it.

But currently it can't be reached as the only driver that support interrupt is
the TPS23881 with 8 ports.

Regards,
-- 
Köry Maincent, Bootlin
Embedded Linux and kernel engineering
https://bootlin.com

^ permalink raw reply

* Re: [PATCH net-next] net: stmmac: enable RPS and RBU interrupts
From: Russell King (Oracle) @ 2026-04-14 14:13 UTC (permalink / raw)
  To: Sam Edwards
  Cc: Jakub Kicinski, Andrew Lunn, Alexandre Torgue, Andrew Lunn,
	David S. Miller, Eric Dumazet,
	moderated list:BROADCOM BCM2711/BCM2835 ARM ARCHITECTURE,
	linux-stm32, Linux Network Development Mailing List, Paolo Abeni
In-Reply-To: <CAH5Ym4i7VV53hQGY3AjAUW3B8g_ffgmw69kPhPrk2CmcRbguuQ@mail.gmail.com>

Hi Sam,

Most of this email was written this morning, but I didn't have a chance
to finish nor send it due to how busy I am.

I had also written a separate reply last night with detailed results of
what I was seeing but didn't/haven't got around to sending it. Not
currently sure whether I saved it as draft or got rid of it yet.

On Mon, Apr 13, 2026 at 02:54:30PM -0700, Sam Edwards wrote:
> On Mon, Apr 13, 2026, 11:49 Russell King (Oracle) <linux@armlinux.org.uk> wrote:
> >
> > On Mon, Apr 13, 2026 at 11:02:22AM -0700, Jakub Kicinski wrote:
> > > On Fri, 10 Apr 2026 14:07:51 +0100 Russell King (Oracle) wrote:
> > > > Since we are seeing receive buffer exhaustion on several platforms,
> > > > let's enable the interrupts so the statistics we publish via ethtool -S
> > > > actually work to aid diagnosis. I've been in two minds about whether
> > > > to send this patch, but given the problems with stmmac at the moment,
> > > > I think it should be merged.
> > >
> > > Sorry for a under-research response but wasn't there are person trying
> > > to fix the OOM starvation issue? Who was supposed to add a timer?
> > > Is your problem also OOM related or do you suspect something else?
> >
> > It is not OOM related. I have this patch applied:
> >
> > diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> > index 131ea887bedc..614d0e10e3e6 100644
> > --- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> > +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> > @@ -5095,14 +5095,18 @@ static inline void stmmac_rx_refill(struct stmmac_priv *priv, u32 queue)
> >
> >                 if (!buf->page) {
> >                         buf->page = page_pool_alloc_pages(rx_q->page_pool, gfp);
> > -                       if (!buf->page)
> > +                       if (!buf->page) {
> > +                               netdev_err(priv->dev, "q%u: no buffer 1\n", queue);
> >                                 break;
> > +                       }
> >                 }
> >
> >                 if (priv->sph_active && !buf->sec_page) {
> >                         buf->sec_page = page_pool_alloc_pages(rx_q->page_pool, gfp);
> > -                       if (!buf->sec_page)
> > +                       if (!buf->sec_page) {
> > +                               netdev_err(priv->dev, "q%u: no buffer 2\n", queue);
> >                                 break;
> > +                       }
> >
> >                         buf->sec_addr = page_pool_get_dma_addr(buf->sec_page);
> >                 }
> >
> > and it is silent, so we are not suffering starvation of buffers.
> >
> > However, the hardware hangs during iperf3, and because it triggers the
> > MAC to stream PAUSE frames, and my network uses Netgear GS108 and GS116
> > unmanaged switches that always use flow-control between them (there's no
> > way not to) it takes down the entire network - as we've discussed
> > before. So, this problem is pretty fatal to the *entire* network.
> >
> > With this patch, the existing statistical counters for this condition
> > are incremented, and thus users can use ethtool -S to see what happened
> > and report whether they are seeing the same issue.
> >
> > Without this patch applied, there are no diagnostics from stmmac that
> > report what the state is. ethtool -d doesn't list the appropriate
> > registers (as I suspect part of the problem is the number of queues
> > is somewhat dynamic - userspace can change that configuration through
> > ethtool).
> >
> > Thus, one has to resort to using devmem2 to find out what's happened.
> > That's not user friendly.
> >
> > For me, devmem2 shows:
> >
> > Channel 0 status register:
> > Value at address 0x02491160: 0x00000484
> > bit 10: ETI early transmit interrupt - set
> > bit 9 : RWT receive watchdog - clear
> > bit 8 : RPS receieve process stopped - clear
> > bit 7 : RBU receive buffer unavailable - set
> > bit 6 : RI  receive interrupt - clear
> > bit 2 : TBU transmit buffer unavailable - set
> > bit 1 : TPS transmit process stopped - clear
> > bit 0 : TI  transmit interrupt - clear
> 
> Should that reset trigger be RPS, not RBU? My understanding of these
> status bits is RBU is just "RxDMA has failed to take a frame from the
> RxFIFO" while RPS is "the RxFIFO is full." That would make RBU our
> critical threshold to start proactively refilling, and RPS the "too
> late, we lose" threshold.

That's a fine theory, but look at the channel 0 status register above,
noting that any interrupts that are raised but not enabled remain set.
RPS is not set, so RPS is not being raised, only RBU when this
condition occurs.

> Thinking aloud: Do you suppose the RxDMA waits for a wakeup signal
> sent whenever a frame is added to RxFIFO? That might explain why the
> former never recovers once the latter is full: a manual wakeup needs
> to be sent whenever we resolve RBU. Does the .enable_dma_reception()
> op need to be implemented for dwmac5, or have you tried that already?

I've not found anything in the closest documentation I have. The Xavier
is Synopsys IP v5.0, whereas i.MX8M is v5.1 - and v5.1 compared to
previous versions reads the same for statements concerning recovering
from a RBU condition:

"In ring mode, the application should advance the Receive Descriptor
Tail Pointer register of a channel. This bit is set only when the DMA
owns the previous Rx descriptor."

I've tried expanding what happens when RBU fires, dumping some of the
receive state and the receive ring:

[   55.766199] dwc-eth-dwmac 2490000.ethernet eth0: q0: receive buffer unavailable: cur_rx=309 dirty_rx=309 last_cur_rx=245 last_cur_rx_post=309 last_dirty_rx=245 count=64 budget=64

cur_rx == dirty_rx _should_ mean that we fully refilled the ring. These
are their values at the point the RBU interrupt fires.

last_cur_rx and last_dirty_rx are the values of cur_rx/dirty_rx when
stmmac_rx() was last entered.

last_cur_rx_post is the value of cur_rx when stmmac_rx() finished
looping but before we have refilled the ring.

count is the value of count just before stmmac_rx() returns, budget is
the limit at that point.

The patch that prints errors should we fail to allocate a buffer is in
place, none of those errors fire, so we are fully repopulating the ring
each time stmmac_rx() runs.

[   55.766785] RX descriptor ring:
[   55.766802] 000 [0x0000007fffffe000]: 0x0 0x12 0x0 0x340105ee
[   55.766826] 001 [0x0000007fffffe010]: 0x0 0x12 0x0 0x340105ee
[   55.766843] 002 [0x0000007fffffe020]: 0x0 0x12 0x0 0x340105ee
[   55.766860] 003 [0x0000007fffffe030]: 0x0 0x12 0x0 0x340105ee
...
[   55.772205] 308 [0x0000007ffffff340]: 0x0 0x12 0x0 0x340105ee
[   55.772221] 309 [0x0000007ffffff350]: 0x0 0x12 0x0 0x340105ee
[   55.772237] 310 [0x0000007ffffff360]: 0x0 0x12 0x0 0x340105ee
[   55.772253] 311 [0x0000007ffffff370]: 0x0 0x12 0x0 0x340105ee
[   55.772268] 312 [0x0000007ffffff380]: 0x0 0x12 0x0 0x340105ee
[   55.772284] 313 [0x0000007ffffff390]: 0x0 0x12 0x0 0x340105ee
[   55.772300] 314 [0x0000007ffffff3a0]: 0x0 0x12 0x0 0x340105ee
[   55.772315] 315 [0x0000007ffffff3b0]: 0x0 0x12 0x0 0x340105ee
...
[   55.775539] 511 [0x0000007ffffffff0]: 0x0 0x12 0x0 0x340105ee

Every ring entry contains the same RDES3 value, so it really is
completely full at the point RBU fires (bit 31 clear means software
owns the descriptor, and it's basically saying first/last segment,
RDES1 valid, buffer 1 length of 1518.

The Rx tail pointer register contains 0xfffff3a0 which is entry 314.
The current receive descriptor address is also 0xfffff3a0. Note that
these values were obtained some time after the RBU interrupt fired
(due to the time taken for devmem2 to access every stmmac register -
I have a script that dumps the entire stmmac register state via
devmem2.)

The other thing to note is that when looking at debugfs
stmmaceth/eth0/descriptor* (or whatever it's called, I don't have the
NX powered to look at the moment, and I didn't take a copy of it last
night) all tne descriptor entries are fully repopulated with buffers
and owned by the hardware.

I've tried using devmem2 to write to the rx tail pointer to kick it
back into action, but that changes nothing. I've tried writing the
next descriptor value and previous descriptor value, but that appears
to have no effect, it stedfastly remains stuck - and as that is the
documented recovery from RBU and there's no "receive demand" register
listed in dwmac v4 or v5 documentation, there seems to be no other
documented way.

The debug registers that I provided in my previous email suggest that
the MAC is waiting for a packet, and MTL's descriptor reader is idle
(I'm guessing it would only briefly change when the tail pointer is
updated.)

Note that I have augmented the driver with more dma_rmb() + dma_wmb()
in stmmac_rx(), dwmac4_wrback_get_rx_status(), and stmmac_rx_refill()
to ensure that reads and writes to the descriptor ring are correctly
ordered. While this generally allows iperf3 to run for a few more
seconds, it doesn't solve the problem - it is very rare for iperf3
to actually complete before stmmac has taken down my entire network.

I have noticed that on some occasions I see a small number of RBU
interrupts before it falls over.

I'm not going to have much time to look at this today due to further
appointments (I also didn't yesterday - only an hour in the morning
and a bit more time late in the evening/night.) I should have more
time during the rest of the week... but that may change.

From the above, it looks like NAPI/stmmac driver isn't keeping up with
the packet flow coming from an i.MX6 platform (which is limited to
around 470Mbps due to internal SoC bus limitations.)

I'll also mention that stmmac falls apart even more if I run iperf3 -c
-R against an x86 machine that is capable of saturating the network,
so much so that the arm-smmu IOMMU throws errors even after the stmmac
hardware has been soft-reset for addresses that were in the ring
*prior* to the soft-reset occuring (stmmac is soft-reset each time the
netdev is brought up.) The only recovery from that is to reboot -
down/up the interface just spews more IOMMU errors. I don't have the
details of that to hand and I don't have enough time to re-run that
test this morning. From what I remember, the transmit side also stops
processing descriptors (one can see them accumulate in the debugfs
file,) which eventually leads to the netdev watchdog firing.

It currently looks like the stmmac v5 EQoS IP works fine only under
light packet loads. If one puts any stress on it, then the hardware
totally falls apart. This may point to an issue with the AXI bus
configuration that is specific to this platform, but that requires
further investigation.

I'll mention again, in case anyone's forgotten, that these problems
pre-date any of my cleanups I've made to stmmac. From what I remember
they are reproducible with the kernels that are supplied as part of
the nVidia BSP. Again, as I don't have access to the nVidia platform
at the moment, I can't include the details in this email.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!

^ permalink raw reply

* [PATCH bpf v4 0/5] bpf, sockmap: Fix af_unix null-ptr-deref in proto update
From: Michal Luczaj @ 2026-04-14 14:13 UTC (permalink / raw)
  To: John Fastabend, Jakub Sitnicki, Eric Dumazet, Kuniyuki Iwashima,
	Paolo Abeni, Willem de Bruijn, David S. Miller, Jakub Kicinski,
	Simon Horman, Yonghong Song, Andrii Nakryiko, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, Eduard Zingerman, Song Liu,
	Yonghong Song, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Shuah Khan, Cong Wang
  Cc: netdev, bpf, linux-kernel, linux-kselftest, Michal Luczaj,
	Jiayuan Chen, 钱一铭

Updating sockmap/sockhash using a unix sock races unix_stream_connect():
when sock_map_sk_state_allowed() passes (sk_state == TCP_ESTABLISHED),
unix_peer(sk) in unix_stream_bpf_update_proto() may still return NULL.

Signed-off-by: Michal Luczaj <mhal@rbox.co>
---
Changes in v4:
- Circle back to v1 approach
- More details in commit messages [Martin]
- Make unix iter take the state lock [Kaniyuki]
- Link to v3: https://lore.kernel.org/r/20260306-unix-proto-update-null-ptr-deref-v3-0-2f0c7410c523@rbox.co

Changes in v3:
- Drop sparse annotations [Martin]
- Keep lock_sock() along the unix_state_lock() [Kaniyuki]
- Unify BPF iter af_unix locking [Kaniyuki, Martin]
- Link to v2: https://lore.kernel.org/r/20260207-unix-proto-update-null-ptr-deref-v2-0-9f091330e7cd@rbox.co

Changes in v2:
- Instead of probing for unix peer, make sockmap take the right lock [Martin]
- Annotate data races [Kaniyuki, Martin]
- Extend bpf unix iter selftest to attempt a deadlock
- Link to v1: https://lore.kernel.org/r/20260129-unix-proto-update-null-ptr-deref-v1-1-e1daeb7012fd@rbox.co

To: John Fastabend <john.fastabend@gmail.com>
To: Jakub Sitnicki <jakub@cloudflare.com>
To: Eric Dumazet <edumazet@google.com>
To: Kuniyuki Iwashima <kuniyu@google.com>
To: Paolo Abeni <pabeni@redhat.com>
To: Willem de Bruijn <willemb@google.com>
To: "David S. Miller" <davem@davemloft.net>
To: Jakub Kicinski <kuba@kernel.org>
To: Simon Horman <horms@kernel.org>
To: Yonghong Song <yhs@fb.com>
To: Andrii Nakryiko <andrii@kernel.org>
To: Eduard Zingerman <eddyz87@gmail.com>
To: Alexei Starovoitov <ast@kernel.org>
To: Daniel Borkmann <daniel@iogearbox.net>
To: Martin KaFai Lau <martin.lau@linux.dev>
To: Song Liu <song@kernel.org>
To: Yonghong Song <yonghong.song@linux.dev>
To: KP Singh <kpsingh@kernel.org>
To: Stanislav Fomichev <sdf@fomichev.me>
To: Hao Luo <haoluo@google.com>
To: Jiri Olsa <jolsa@kernel.org>
To: Shuah Khan <shuah@kernel.org>
To: Cong Wang <cong.wang@bytedance.com>
Cc: netdev@vger.kernel.org
Cc: bpf@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-kselftest@vger.kernel.org

---
Michal Luczaj (5):
      bpf, sockmap: Annotate af_unix sock::sk_state data-races
      bpf, sockmap: Fix af_unix iter deadlock
      selftests/bpf: Extend bpf_iter_unix to attempt deadlocking
      bpf, sockmap: Fix af_unix null-ptr-deref in proto update
      bpf, sockmap: Take state lock for af_unix iter

 net/core/sock_map.c                               |  4 ++--
 net/unix/af_unix.c                                |  9 +++++----
 net/unix/unix_bpf.c                               |  3 +++
 tools/testing/selftests/bpf/progs/bpf_iter_unix.c | 10 ++++++++++
 4 files changed, 20 insertions(+), 6 deletions(-)
---
base-commit: 0f00132132937ca01a99feaf8985109a9087c9ff
change-id: 20260129-unix-proto-update-null-ptr-deref-6a2733bcbbf8

Best regards,
--  
Michal Luczaj <mhal@rbox.co>

^ permalink raw reply

* [PATCH bpf v4 3/5] selftests/bpf: Extend bpf_iter_unix to attempt deadlocking
From: Michal Luczaj @ 2026-04-14 14:13 UTC (permalink / raw)
  To: John Fastabend, Jakub Sitnicki, Eric Dumazet, Kuniyuki Iwashima,
	Paolo Abeni, Willem de Bruijn, David S. Miller, Jakub Kicinski,
	Simon Horman, Yonghong Song, Andrii Nakryiko, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, Eduard Zingerman, Song Liu,
	Yonghong Song, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Shuah Khan, Cong Wang
  Cc: netdev, bpf, linux-kernel, linux-kselftest, Michal Luczaj,
	Jiayuan Chen
In-Reply-To: <20260414-unix-proto-update-null-ptr-deref-v4-0-2af6fe97918e@rbox.co>

Updating a sockmap from a unix iterator prog may lead to a deadlock.
Piggyback on the original selftest.

Reviewed-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Signed-off-by: Michal Luczaj <mhal@rbox.co>
---
 tools/testing/selftests/bpf/progs/bpf_iter_unix.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/tools/testing/selftests/bpf/progs/bpf_iter_unix.c b/tools/testing/selftests/bpf/progs/bpf_iter_unix.c
index fea275df9e22..a2652c8c3616 100644
--- a/tools/testing/selftests/bpf/progs/bpf_iter_unix.c
+++ b/tools/testing/selftests/bpf/progs/bpf_iter_unix.c
@@ -7,6 +7,13 @@
 
 char _license[] SEC("license") = "GPL";
 
+SEC(".maps") struct {
+	__uint(type, BPF_MAP_TYPE_SOCKMAP);
+	__uint(max_entries, 1);
+	__type(key, __u32);
+	__type(value, __u64);
+} sockmap;
+
 static long sock_i_ino(const struct sock *sk)
 {
 	const struct socket *sk_socket = sk->sk_socket;
@@ -76,5 +83,8 @@ int dump_unix(struct bpf_iter__unix *ctx)
 
 	BPF_SEQ_PRINTF(seq, "\n");
 
+	/* Test for deadlock. */
+	bpf_map_update_elem(&sockmap, &(int){0}, sk, 0);
+
 	return 0;
 }

-- 
2.53.0


^ permalink raw reply related

* [PATCH bpf v4 1/5] bpf, sockmap: Annotate af_unix sock::sk_state data-races
From: Michal Luczaj @ 2026-04-14 14:13 UTC (permalink / raw)
  To: John Fastabend, Jakub Sitnicki, Eric Dumazet, Kuniyuki Iwashima,
	Paolo Abeni, Willem de Bruijn, David S. Miller, Jakub Kicinski,
	Simon Horman, Yonghong Song, Andrii Nakryiko, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, Eduard Zingerman, Song Liu,
	Yonghong Song, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Shuah Khan, Cong Wang
  Cc: netdev, bpf, linux-kernel, linux-kselftest, Michal Luczaj,
	Jiayuan Chen
In-Reply-To: <20260414-unix-proto-update-null-ptr-deref-v4-0-2af6fe97918e@rbox.co>

sock_map_sk_state_allowed() and sock_map_redirect_allowed() read af_unix
socket sk_state locklessly.

Use READ_ONCE(). Note that for sock_map_redirect_allowed() change affects
not only af_unix, but all non-TCP sockets (UDP, af_vsock).

Suggested-by: Kuniyuki Iwashima <kuniyu@google.com>
Suggested-by: Martin KaFai Lau <martin.lau@linux.dev>
Reviewed-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Signed-off-by: Michal Luczaj <mhal@rbox.co>
---
 net/core/sock_map.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/core/sock_map.c b/net/core/sock_map.c
index b0e96337a269..02a68be3002a 100644
--- a/net/core/sock_map.c
+++ b/net/core/sock_map.c
@@ -530,7 +530,7 @@ static bool sock_map_redirect_allowed(const struct sock *sk)
 	if (sk_is_tcp(sk))
 		return sk->sk_state != TCP_LISTEN;
 	else
-		return sk->sk_state == TCP_ESTABLISHED;
+		return READ_ONCE(sk->sk_state) == TCP_ESTABLISHED;
 }
 
 static bool sock_map_sk_is_suitable(const struct sock *sk)
@@ -543,7 +543,7 @@ static bool sock_map_sk_state_allowed(const struct sock *sk)
 	if (sk_is_tcp(sk))
 		return (1 << sk->sk_state) & (TCPF_ESTABLISHED | TCPF_LISTEN);
 	if (sk_is_stream_unix(sk))
-		return (1 << sk->sk_state) & TCPF_ESTABLISHED;
+		return (1 << READ_ONCE(sk->sk_state)) & TCPF_ESTABLISHED;
 	if (sk_is_vsock(sk) &&
 	    (sk->sk_type == SOCK_STREAM || sk->sk_type == SOCK_SEQPACKET))
 		return (1 << sk->sk_state) & TCPF_ESTABLISHED;

-- 
2.53.0


^ permalink raw reply related

* Re: [syzbot] [lvs?] BUG: sleeping function called from invalid context in ip_vs_conn_expire
From: Julian Anastasov @ 2026-04-14 14:18 UTC (permalink / raw)
  To: Jiayuan Chen
  Cc: syzbot, coreteam, davem, edumazet, fw, horms, kuba, linux-kernel,
	lvs-devel, netdev, netfilter-devel, pabeni, pablo, phil,
	syzkaller-bugs
In-Reply-To: <927be094-315b-48ab-8e89-45bbe9183d5b@linux.dev>

[-- Attachment #1: Type: text/plain, Size: 4838 bytes --]


	Hello,

On Tue, 14 Apr 2026, Jiayuan Chen wrote:

> 
> On 4/14/26 6:30 PM, syzbot wrote:
> 
> [...]
> 
> > if you fix the issue, please add the following tag to the commit:
> > Reported-by: syzbot+504e778ddaecd36fdd17@syzkaller.appspotmail.com
> >
> > BUG: sleeping function called from invalid context at
> > kernel/locking/spinlock_rt.c:48
> 
> 
> 
> The problem occurs under PREEMPT_RT. conn_tab_lock pair with spin_lock has the
> problem:
> 
>     conn_tab_lock(...) -> hlist_bl_lock -> preempt_disable()  ==> disables
> preemption
>     spin_lock(&cp->lock) -> rt_mutex  ==> sleepable under RT, but preemption
> is already disabled by conn_tab_lock

	I guess, spin_lock(&cp->lock) which sleeps under
PREEMPT_RT, should not be called under bit spinlock.
I'll check it soon...

> > in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 16, name: ktimers/0
> > preempt_count: 2, expected: 0
> > RCU nest depth: 3, expected: 3
> > 8 locks held by ktimers/0/16:
> >   #0: ffffffff8de5f260 (local_bh){.+.+}-{1:3}, at:
> >   __local_bh_disable_ip+0x3c/0x420 kernel/softirq.c:163
> >   #1: ffffffff8dfc80c0 (rcu_read_lock){....}-{1:3}, at:
> >   __local_bh_disable_ip+0x3c/0x420 kernel/softirq.c:163
> >   #2: ffff8880b8826360 (&base->expiry_lock){+...}-{3:3}, at: spin_lock
> >   include/linux/spinlock_rt.h:45 [inline]
> >   #2: ffff8880b8826360 (&base->expiry_lock){+...}-{3:3}, at:
> >   timer_base_lock_expiry kernel/time/timer.c:1502 [inline]
> >   #2: ffff8880b8826360 (&base->expiry_lock){+...}-{3:3}, at:
> >   __run_timer_base+0x120/0x9f0 kernel/time/timer.c:2384
> >   #3: ffffffff8dfc80c0 (rcu_read_lock){....}-{1:3}, at: rcu_lock_acquire
> >   include/linux/rcupdate.h:300 [inline]
> >   #3: ffffffff8dfc80c0 (rcu_read_lock){....}-{1:3}, at: rcu_read_lock
> >   include/linux/rcupdate.h:838 [inline]
> >   #3: ffffffff8dfc80c0 (rcu_read_lock){....}-{1:3}, at: __rt_spin_lock
> >   kernel/locking/spinlock_rt.c:50 [inline]
> >   #3: ffffffff8dfc80c0 (rcu_read_lock){....}-{1:3}, at:
> >   rt_spin_lock+0x1e0/0x400 kernel/locking/spinlock_rt.c:57
> >   #4: ffffc90000157a80 ((&cp->timer)){+...}-{0:0}, at:
> >   call_timer_fn+0xd4/0x5e0 kernel/time/timer.c:1745
> >   #5: ffffffff8dfc80c0 (rcu_read_lock){....}-{1:3}, at: rcu_lock_acquire
> >   include/linux/rcupdate.h:300 [inline]
> >   #5: ffffffff8dfc80c0 (rcu_read_lock){....}-{1:3}, at: rcu_read_lock
> >   include/linux/rcupdate.h:838 [inline]
> >   #5: ffffffff8dfc80c0 (rcu_read_lock){....}-{1:3}, at: ip_vs_conn_unlink
> >   net/netfilter/ipvs/ip_vs_conn.c:315 [inline]
> >   #5: ffffffff8dfc80c0 (rcu_read_lock){....}-{1:3}, at:
> >   ip_vs_conn_expire+0x257/0x2390 net/netfilter/ipvs/ip_vs_conn.c:1260
> >   #6: ffffffff8de5f260 (local_bh){.+.+}-{1:3}, at:
> >   __local_bh_disable_ip+0x3c/0x420 kernel/softirq.c:163
> >   #7: ffff888068d4c3f0 (&cp->lock#2){+...}-{3:3}, at: spin_lock
> >   include/linux/spinlock_rt.h:45 [inline]
> >   #7: ffff888068d4c3f0 (&cp->lock#2){+...}-{3:3}, at: ip_vs_conn_unlink
> >   net/netfilter/ipvs/ip_vs_conn.c:324 [inline]
> >   #7: ffff888068d4c3f0 (&cp->lock#2){+...}-{3:3}, at:
> >   ip_vs_conn_expire+0xd4a/0x2390 net/netfilter/ipvs/ip_vs_conn.c:1260
> > Preemption disabled at:
> > [<ffffffff898a6358>] bit_spin_lock include/linux/bit_spinlock.h:38 [inline]
> > [<ffffffff898a6358>] hlist_bl_lock+0x18/0x110 include/linux/list_bl.h:149
> > CPU: 0 UID: 0 PID: 16 Comm: ktimers/0 Tainted: G        W    L
> > syzkaller #0 PREEMPT_{RT,(full)}
> > Tainted: [W]=WARN, [L]=SOFTLOCKUP
> > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> > Google 03/18/2026
> > Call Trace:
> >   <TASK>
> >   dump_stack_lvl+0xe8/0x150 lib/dump_stack.c:120
> >   __might_resched+0x329/0x480 kernel/sched/core.c:9162
> >   __rt_spin_lock kernel/locking/spinlock_rt.c:48 [inline]
> >   rt_spin_lock+0xc2/0x400 kernel/locking/spinlock_rt.c:57
> >   spin_lock include/linux/spinlock_rt.h:45 [inline]
> >   ip_vs_conn_unlink net/netfilter/ipvs/ip_vs_conn.c:324 [inline]
> >   ip_vs_conn_expire+0xd4a/0x2390 net/netfilter/ipvs/ip_vs_conn.c:1260
> >   call_timer_fn+0x192/0x5e0 kernel/time/timer.c:1748
> >   expire_timers kernel/time/timer.c:1799 [inline]
> >   __run_timers kernel/time/timer.c:2374 [inline]
> >   __run_timer_base+0x6a3/0x9f0 kernel/time/timer.c:2386
> >   run_timer_base kernel/time/timer.c:2395 [inline]
> >   run_timer_softirq+0xb7/0x170 kernel/time/timer.c:2405
> >   handle_softirqs+0x1de/0x6d0 kernel/softirq.c:622
> >   __do_softirq kernel/softirq.c:656 [inline]
> >   run_ktimerd+0x69/0x100 kernel/softirq.c:1151
> >   smpboot_thread_fn+0x541/0xa50 kernel/smpboot.c:160
> >   kthread+0x388/0x470 kernel/kthread.c:436
> >   ret_from_fork+0x514/0xb70 arch/x86/kernel/process.c:158
> >   ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
> >   </TASK>

Regards

--
Julian Anastasov <ja@ssi.bg>

^ permalink raw reply

* Re: [PATCH bpf v3 5/5] bpf, sockmap: Adapt for af_unix-specific lock
From: Michal Luczaj @ 2026-04-14 14:19 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Jiayuan Chen, John Fastabend, Jakub Sitnicki, Eric Dumazet,
	Kuniyuki Iwashima, Paolo Abeni, Willem de Bruijn, David S. Miller,
	Jakub Kicinski, Simon Horman, Yonghong Song, Andrii Nakryiko,
	Alexei Starovoitov, Daniel Borkmann, Eduard Zingerman, Song Liu,
	Yonghong Song, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Shuah Khan, Cong Wang, netdev, bpf, linux-kernel, linux-kselftest
In-Reply-To: <ac2z6IqYyIxSZFPS@devbig1721.ftw5.facebook.com>

On 4/2/26 03:34, Martin KaFai Lau wrote:
> On Wed, 01 Apr 2026 00:43:58 +0200, Michal Luczaj wrote:
>> On 3/31/26 02:20, Martin KaFai Lau wrote:
>>> On 3/30/26 4:03 PM, Michal Luczaj wrote:
>>>> On 3/26/26 07:26, Martin KaFai Lau wrote:
>>>>> On 3/15/26 4:58 PM, Michal Luczaj wrote:
>>>>>>> Beside, from looking at the may_update_sockmap(), I don't know if it is
>>>>>>> even doable (or useful) to bpf_map_update_elem(unix_sk) in
>>>>>>> tc/flow_dissector/xdp. One possible path is the SOCK_FILTER when looking
>>>>>>> at unix_dgram_sendmsg() => sk_filter(). It was not the original use case
>>>>>>> when the bpf_map_update_elem(sockmap) support was added iirc.
>>>>>>
>>>>>> What about a situation when unix_sk is stored in a sockmap, then tc prog
>>>>>> looks it up and invokes bpf_map_update_elem(unix_sk)? I'm not sure it's
>>>>>> useful, but seems doable.
>>>>>
>>>>> [ Sorry for the late reply ]
>>>>>
>>>>> It is a bummer that the bpf_map_update_elem(unix_sk) path is possible
>>>>> from tc :(
>>>>>
>>>>> Then unix_state_lock() in its current form cannot be safely acquired in
>>>>> sock_map_update_elem(). It is currently a spin_lock() instead of
>>>>> spin_lock_bh().
>>>>
>>>> Is there a specific deadlock you have in your mind?
>>>
>>> e.g. unix_stream_connect() is taking unix_state_lock(). Can a tc's 
>>> ingress bpf prog call unix_state_lock()?
>>
>> Ah, right, that's the problem, thanks for explaining.
>>
>> But, as I've asked in the parallel thread, do we really need to take the
>> unix_state_lock() in sock_map_update_elem()? Taking it in
>> sock_map_update_elem_sys() fixes the null-ptr-deref and does not lead to a
>> deadlock. Taking unix_state_lock() in sock_map_update_elem() seems
>> unnecessary. Well, at least under the assumption progs can only access
>> unix_sk via the sockmap lookup.
> 
> right, sock_map_update_elem_sys() should be safe to take
> unix_state_lock().
> 
> If it is fixed by testing unix_peer(), is the TCPF_ESTABLISHED test
> in sock_map_sk_state_allowed() still useful and needed?

I don't think it's necessary. Although removing it may slightly mask the
fact that we're interested in TCP_ESTABLISHED sockets (we watch the sock's
life cycle and invoke sock_map_close() as it transitions to TCP_CLOSE).
Removing this check will also mean listening socks will be rejected not
early in sock_map_sk_state_allowed(), but deeper in
unix_stream_bpf_update_proto() (and with a different error code?).

> Also,
> please explain in detail in the commit message why testing for NULL
> without unix_state_lock() is enough.

OK, will do.

> For example, for the BPF iterator on
> sock_map, my understanding is that unix_release_sock() can still happen
> while the BPF iterator is iterating over a unix_sock. I guess a future
> unix_state_lock() in the iterator's seq_show() should be useful.

That's right. That's also why, I think, Kuniyuki was asking for
"lock_sock() + unix_state_lock() + SOCK_DEAD check" in a parallel thread.

> It will also be useful to mention what was discovered about TC + lookup
> + update_elem(&sock_map, ...) and why it is not safe to take
> unix_state_lock() in that path. Thanks.

The softirq vs. process context? Sure, I'll mention that.

Took a while (sorry), but here's v4:
https://lore.kernel.org/netdev/20260414-unix-proto-update-null-ptr-deref-v4-0-2af6fe97918e@rbox.co/


^ permalink raw reply

* Re: [net,PATCH v3 1/2] net: ks8851: Reinstate disabling of BHs around IRQ handler
From: Marek Vasut @ 2026-04-14 14:20 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: netdev, stable, David S. Miller, Andrew Lunn, Eric Dumazet,
	Jakub Kicinski, Nicolai Buchwitz, Paolo Abeni, Ronald Wahl,
	Yicong Hui, linux-kernel
In-Reply-To: <20260414125753.Im6GAIHn@linutronix.de>

On 4/14/26 2:57 PM, Sebastian Andrzej Siewior wrote:
> On 2026-04-14 12:32:52 [+0200], Marek Vasut wrote:
>> If CONFIG_PREEMPT_RT=y is set AND the driver executes ks8851_irq() AND
>> KSZ_ISR register bit IRQ_RXI is set AND ks8851_rx_pkts() detects that
>> there are packets in the RX FIFO, then netdev_alloc_skb_ip_align() is
>> called to allocate SKBs. If netdev_alloc_skb_ip_align() is called with
>> BH enabled, local_bh_enable() at the end of netdev_alloc_skb_ip_align()
>> will call __local_bh_enable_ip(), which will call __do_softirq(), which
>> may trigger net_tx_action() softirq, which may ultimately call the xmit
>> callback ks8851_start_xmit_par(). The ks8851_start_xmit_par() will try
>> to lock struct ks8851_net_par .lock spinlock, which is already locked
>> by ks8851_irq() from which ks8851_start_xmit_par() was called. This
>> leads to a deadlock, which is reported by the kernel, including a trace
>> listed below.
> 
> #1 [received RX packet and a] TX packet has been sent
> #2 Driver enables TX queue via netif_wake_queue() which schedules TX
>     softirq to queue packets for this device.
> #2 After spin_unlock_bh(&ks->statelock) the pending softirqs will be
>     processed
> #3 This deadlocks because of recursive locking via ks8851_net::lock in
>     ks8851_irq() and ks8851_start_xmit_par().
> 
> This is what happens since commit 0913ec336a6c0 ("net: ks8851: Fix
> deadlock with the SPI chip variant"). Before that commit the softirq
> execution will be picked up by netdev_alloc_skb_ip_align() and requires
> PREEMPT_RT and a RX packet in #1 to trigger the deadlock.

Do you want me to add this into the V4 commit message ?

>> Fix the problem by disabling BH around critical sections, including the
>> IRQ handler, thus preventing the net_tx_action() softirq from triggering
>> during these critical sections. The net_tx_action() softirq is triggered
>> at the end of the IRQ handler, once all the other IRQ handler actions have
>> been completed.
>>
>>   __schedule from schedule_rtlock+0x1c/0x34
>>   schedule_rtlock from rtlock_slowlock_locked+0x548/0x904
>>   rtlock_slowlock_locked from rt_spin_lock+0x60/0x9c
>>   rt_spin_lock from ks8851_start_xmit_par+0x74/0x1a8
>>   ks8851_start_xmit_par from netdev_start_xmit+0x20/0x44
>>   netdev_start_xmit from dev_hard_start_xmit+0xd0/0x188
>>   dev_hard_start_xmit from sch_direct_xmit+0xb8/0x25c
>>   sch_direct_xmit from __qdisc_run+0x1f8/0x4ec
>>   __qdisc_run from qdisc_run+0x1c/0x28
>>   qdisc_run from net_tx_action+0x1f0/0x268
>>   net_tx_action from handle_softirqs+0x1a4/0x270
>>   handle_softirqs from __local_bh_enable_ip+0xcc/0xe0
>>   __local_bh_enable_ip from __alloc_skb+0xd8/0x128
>>   __alloc_skb from __netdev_alloc_skb+0x3c/0x19c
>>   __netdev_alloc_skb from ks8851_irq+0x388/0x4d4
>>   ks8851_irq from irq_thread_fn+0x24/0x64
>>   irq_thread_fn from irq_thread+0x178/0x28c
>>   irq_thread from kthread+0x12c/0x138
>>   kthread from ret_from_fork+0x14/0x28
> 
> The backtrace here and the description is based on an older kernel.
> However
I actually did update the backtrace in V3 with the one from current next 
20260413 .

^ permalink raw reply

* Re: [PATCH v3 net] vsock: fix buffer size clamping order
From: Michal Luczaj @ 2026-04-14 14:22 UTC (permalink / raw)
  To: Norbert Szetei, Stefano Garzarella
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, virtualization, netdev, linux-kernel
In-Reply-To: <180118C5-8BCF-4A63-A305-4EE53A34AB9C@doyensec.com>

On 4/9/26 18:34, Norbert Szetei wrote:
> In vsock_update_buffer_size(), the buffer size was being clamped to the
> maximum first, and then to the minimum. If a user sets a minimum buffer
> size larger than the maximum, the minimum check overrides the maximum
> check, inverting the constraint.
> 
> This breaks the intended socket memory boundaries by allowing the
> vsk->buffer_size to grow beyond the configured vsk->buffer_max_size.
> 
> Fix this by checking the minimum first, and then the maximum. This
> ensures the buffer size never exceeds the buffer_max_size.

Something may be missing. After adding another ioctl to your reproducer, I
still see crashes.

     SYSCHK(setsockopt(fd, AF_VSOCK, SO_VM_SOCKETS_BUFFER_MIN_SIZE, &min,
                       sizeof(min)));
+    SYSCHK(setsockopt(fd, AF_VSOCK, SO_VM_SOCKETS_BUFFER_MAX_SIZE, &min,
+                      sizeof(min)));
 }

[*] Setting buffer_min_size to 0x400000000.
[socket][0] sending...

refcount_t: saturated; leaking memory.
WARNING: lib/refcount.c:22 at refcount_warn_saturate+0x7d/0xb0, CPU#2:
a.out/1478
...
refcount_t: underflow; use-after-free.
WARNING: lib/refcount.c:28 at refcount_warn_saturate+0x50/0xb0, CPU#12:
kworker/12:0/80
Workqueue: vsock-loopback vsock_loopback_work
...


^ permalink raw reply

* Re: [RFC PATCH v4 00/19] Support socket access-control
From: Mickaël Salaün @ 2026-04-14 14:27 UTC (permalink / raw)
  To: Mikhail Ivanov
  Cc: gnoack, willemdebruijn.kernel, matthieu, linux-security-module,
	netdev, netfilter-devel, yusongping, artem.kuzin,
	konstantin.meskhidze
In-Reply-To: <ca9b74f3-ce72-1d7f-c922-be1b276b69a8@huawei-partners.com>

On Mon, Apr 13, 2026 at 08:11:48PM +0300, Mikhail Ivanov wrote:
> On 4/8/2026 1:26 PM, Mickaël Salaün wrote:
> > Hi Mikhail,
> 
> Hi!
> 
> > 
> > On Tue, Nov 18, 2025 at 09:46:20PM +0800, Mikhail Ivanov wrote:
> > > Hello! This is v4 RFC patch dedicated to socket protocols restriction.
> > > 
> > > It is based on the landlock's mic-next branch on top of Linux 6.16-rc2
> > > kernel version.
> > > 
> > > Objective
> > > =========
> > > Extend Landlock with a mechanism to restrict any set of protocols in
> > > a sandboxed process.
> > > 
> > > Closes: https://github.com/landlock-lsm/linux/issues/6
> > > 
> > > Motivation
> > > ==========
> > > Landlock implements the `LANDLOCK_RULE_NET_PORT` rule type, which provides
> > > fine-grained control of actions for a specific protocol. Any action or
> > > protocol that is not supported by this rule can not be controlled. As a
> > > result, protocols for which fine-grained control is not supported can be
> > > used in a sandboxed system and lead to vulnerabilities or unexpected
> > > behavior.
> > > 
> > > Controlling the protocols used will allow to use only those that are
> > > necessary for the system and/or which have fine-grained Landlock control
> > > through others types of rules (e.g. TCP bind/connect control with
> > > `LANDLOCK_RULE_NET_PORT`, UNIX bind control with
> > > `LANDLOCK_RULE_PATH_BENEATH`).
> > > 
> > > Consider following examples:
> > > * Server may want to use only TCP sockets for which there is fine-grained
> > >    control of bind(2) and connect(2) actions [1].
> > > * System that does not need a network or that may want to disable network
> > >    for security reasons (e.g. [2]) can achieve this by restricting the use
> > >    of all possible protocols.
> > > 
> > > [1] https://lore.kernel.org/all/ZJvy2SViorgc+cZI@google.com/
> > > [2] https://cr.yp.to/unix/disablenetwork.html
> > > 
> > > Implementation
> > > ==============
> > > This patchset adds control over the protocols used by implementing a
> > > restriction of socket creation. This is possible thanks to the new type
> > > of rule - `LANDLOCK_RULE_SOCKET`, that allows to restrict actions on
> > > sockets, and a new access right - `LANDLOCK_ACCESS_SOCKET_CREATE`, that
> > > corresponds to user space sockets creation. The key in this rule
> > > corresponds to communication protocol signature from socket(2) syscall.
> > 
> > FYI, I sent a new patch series that adds a handled_perm field to
> > rulesets:
> > https://lore.kernel.org/all/20260312100444.2609563-6-mic@digikod.net/
> > See also the rationale:
> > https://lore.kernel.org/all/20260312100444.2609563-12-mic@digikod.net/
> > 
> > I think that would work well with the socket creation permission.  WDYT?
> 
> Agreed. AFAICS restrictions of protocols used for communication (eg.TCP)
> will complement restriction of network namespace which sandboxed process
> is pinned by LANDLOCK_PERM_NAMESPACE_ENTER permission.

I mean that socket creation restriction should use the same handled_perm
interface e.g. add a LANDLOCK_PERM_SOCKET_CREATE right with related
LANDLOCK_RULE_SOCKET rule type.

With the first RFC for handled_perm, the related rules (e.g. struct
landlock_socket_attr) don't have an allowed_access field but an
allowed_perm one instead.  The related permission would then be
LANDLOCK_PERM_SOCKET_CREATE.  WDYT?

> 
> > 
> > Do you think you'll be able to continue this work or would you like me
> > or Günther to complete the remaining last bits (while of course keeping
> > you as the main author)?
> 
> Sorry for the delay. I will finish and send patch series ASAP.

This new version should then be on top of the Landlock namespace and
capability patchset to reuse the handled_perm interface.  I plan to send
a new version by the end of the month, but this should not change the
handled_perm interface.

> 
> > 
> > 
> > > 
> > > The right to create a socket is checked in the LSM hook which is called
> > > in the __sock_create method. The following user space operations are
> > > subject to this check: socket(2), socketpair(2), io_uring(7).
> > > 
> > > `LANDLOCK_ACCESS_SOCKET_CREATE` does not restrict socket creation
> > > performed by accept(2), because created socket is used for messaging
> > > between already existing endpoints.
> > > 
> > > Design discussion
> > > ===================
> > > 1. Should `SCTP_SOCKOPT_PEELOFF` and socketpair(2) be restricted?
> > > 
> > > SCTP socket can be connected to a multiple endpoints (one-to-many
> > > relation). Calling setsockopt(2) on such socket with option
> > > `SCTP_SOCKOPT_PEELOFF` detaches one of existing connections to a separate
> > > UDP socket. This detach is currently restrictable.
> > > 
> > > Same applies for the socketpair(2) syscall. It was noted that denying
> > > usage of socketpair(2) in sandboxed environment may be not meaninful [1].
> > > 
> > > Currently both operations use general socket interface to create sockets.
> > > Therefore it's not possible to distinguish between socket(2) and those
> > > operations inside security_socket_create LSM hook which is currently
> > > used for protocols restriction. Providing such separation may require
> > > changes in socket layer (eg. in __sock_create) interface which may not be
> > > acceptable.
> > > 
> > > [1] https://lore.kernel.org/all/ZurZ7nuRRl0Zf2iM@google.com/
> > > 
> > > Code coverage
> > > =============
> > > Code coverage(gcov) report with the launch of all the landlock selftests:
> > > * security/landlock:
> > > lines......: 94.0% (1200 of 1276 lines)
> > > functions..: 95.0% (134 of 141 functions)
> > > 
> > > * security/landlock/socket.c:
> > > lines......: 100.0% (56 of 56 lines)
> > > functions..: 100.0% (5 of 5 functions)
> > > 
> > > Currently landlock-test-tools fails on mini.kernel_socket test due to lack
> > > of SMC protocol support.
> > > 
> > > General changes v3->v4
> > > ======================
> > > * Implementation
> > >    * Adds protocol field to landlock_socket_attr.
> > >    * Adds protocol masks support via wildcards values in
> > >      landlock_socket_attr.
> > >    * Changes LSM hook used from socket_post_create to socket_create.
> > >    * Changes protocol ranges acceptable by socket rules.
> > >    * Adds audit support.
> > >    * Changes ABI version to 8.
> > > * Tests
> > >    * Adds 5 new tests:
> > >      * mini.rule_with_wildcard, protocol_wildcard.access,
> > >        mini.ruleset_with_wildcards_overlap:
> > >        verify rulesets containing rules with wildcard values.
> > >      * tcp_protocol.alias_restriction: verify that Landlock doesn't
> > >        perform protocol mappings.
> > >      * audit.socket_create: tests audit denial logging.
> > >    * Squashes tests corresponding to Landlock rule adding to a single commit.
> > > * Documentation
> > >    * Refactors Documentation/userspace-api/landlock.rst.
> > > * Commits
> > >    * Rebases on mic-next.
> > >    * Refactors commits.
> > > 
> > > Previous versions
> > > =================
> > > v3: https://lore.kernel.org/all/20240904104824.1844082-1-ivanov.mikhail1@huawei-partners.com/
> > > v2: https://lore.kernel.org/all/20240524093015.2402952-1-ivanov.mikhail1@huawei-partners.com/
> > > v1: https://lore.kernel.org/all/20240408093927.1759381-1-ivanov.mikhail1@huawei-partners.com/
> > > 
> > > Mikhail Ivanov (19):
> > >    landlock: Support socket access-control
> > >    selftests/landlock: Test creating a ruleset with unknown access
> > >    selftests/landlock: Test adding a socket rule
> > >    selftests/landlock: Testing adding rule with wildcard value
> > >    selftests/landlock: Test acceptable ranges of socket rule key
> > >    landlock: Add hook on socket creation
> > >    selftests/landlock: Test basic socket restriction
> > >    selftests/landlock: Test network stack error code consistency
> > >    selftests/landlock: Test overlapped rulesets with rules of protocol
> > >      ranges
> > >    selftests/landlock: Test that kernel space sockets are not restricted
> > >    selftests/landlock: Test protocol mappings
> > >    selftests/landlock: Test socketpair(2) restriction
> > >    selftests/landlock: Test SCTP peeloff restriction
> > >    selftests/landlock: Test that accept(2) is not restricted
> > >    lsm: Support logging socket common data
> > >    landlock: Log socket creation denials
> > >    selftests/landlock: Test socket creation denial log for audit
> > >    samples/landlock: Support socket protocol restrictions
> > >    landlock: Document socket rule type support
> > > 
> > >   Documentation/userspace-api/landlock.rst      |   48 +-
> > >   include/linux/lsm_audit.h                     |    8 +
> > >   include/uapi/linux/landlock.h                 |   60 +-
> > >   samples/landlock/sandboxer.c                  |  118 +-
> > >   security/landlock/Makefile                    |    2 +-
> > >   security/landlock/access.h                    |    3 +
> > >   security/landlock/audit.c                     |   12 +
> > >   security/landlock/audit.h                     |    1 +
> > >   security/landlock/limits.h                    |    4 +
> > >   security/landlock/ruleset.c                   |   37 +-
> > >   security/landlock/ruleset.h                   |   46 +-
> > >   security/landlock/setup.c                     |    2 +
> > >   security/landlock/socket.c                    |  198 +++
> > >   security/landlock/socket.h                    |   20 +
> > >   security/landlock/syscalls.c                  |   61 +-
> > >   security/lsm_audit.c                          |    4 +
> > >   tools/testing/selftests/landlock/base_test.c  |    2 +-
> > >   tools/testing/selftests/landlock/common.h     |   14 +
> > >   tools/testing/selftests/landlock/config       |   47 +
> > >   tools/testing/selftests/landlock/net_test.c   |   11 -
> > >   .../selftests/landlock/protocols_define.h     |  169 +++
> > >   .../testing/selftests/landlock/socket_test.c  | 1169 +++++++++++++++++
> > >   22 files changed, 1990 insertions(+), 46 deletions(-)
> > >   create mode 100644 security/landlock/socket.c
> > >   create mode 100644 security/landlock/socket.h
> > >   create mode 100644 tools/testing/selftests/landlock/protocols_define.h
> > >   create mode 100644 tools/testing/selftests/landlock/socket_test.c
> > > 
> > > 
> > > base-commit: 6dde339a3df80a57ac3d780d8cfc14d9262e2acd
> > > -- 
> > > 2.34.1
> > > 
> > > 
> 

^ permalink raw reply

* Re: [PATCH bpf] bpf,tcp: avoid infinite recursion in BPF_SOCK_OPS_HDR_OPT_LEN_CB
From: Alexei Starovoitov @ 2026-04-14 14:33 UTC (permalink / raw)
  To: Jiayuan Chen
  Cc: bpf, Quan Sun, Yinhao Hu, Kaiyan Mei, Dongliang Mu, Eric Dumazet,
	Neal Cardwell, Kuniyuki Iwashima, David S. Miller, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Jonathan Corbet, Shuah Khan,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	David Ahern, Network Development, open list:DOCUMENTATION, LKML
In-Reply-To: <20260414105702.248310-1-jiayuan.chen@linux.dev>

On Tue, Apr 14, 2026 at 3:57 AM Jiayuan Chen <jiayuan.chen@linux.dev> wrote:
>
> A BPF_PROG_TYPE_SOCK_OPS program can set BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG
> to inject custom TCP header options. When the kernel builds a TCP packet,
> it calls tcp_established_options() to calculate the header size, which
> invokes bpf_skops_hdr_opt_len() to trigger the BPF_SOCK_OPS_HDR_OPT_LEN_CB
> callback.
>
> If the BPF program calls bpf_setsockopt(TCP_NODELAY) inside this callback,
> __tcp_sock_set_nodelay() will call tcp_push_pending_frames(), which calls
> tcp_current_mss(), which calls tcp_established_options() again,
> re-triggering the same BPF callback. This creates an infinite recursion
> that exhausts the kernel stack and causes a panic.
>
> BPF_SOCK_OPS_HDR_OPT_LEN_CB
>   -> bpf_setsockopt(TCP_NODELAY)
>         -> tcp_push_pending_frames()
>           -> tcp_current_mss()
>                 -> tcp_established_options()
>                   -> bpf_skops_hdr_opt_len()
>                            /* infinite recursion */
>                         -> BPF_SOCK_OPS_HDR_OPT_LEN_CB
>
> A similar reentrancy issue exists for TCP congestion control, which is
> guarded by tp->bpf_chg_cc_inprogress. Adopt the same approach: introduce
> tp->bpf_hdr_opt_len_cb_inprogress, set it before invoking the callback in
> bpf_skops_hdr_opt_len(), and check it in sol_tcp_sockopt() to reject
> bpf_setsockopt(TCP_NODELAY) calls that would trigger
> tcp_push_pending_frames() and cause the recursion.
>
> Reported-by: Quan Sun <2022090917019@std.uestc.edu.cn>
> Reported-by: Yinhao Hu <dddddd@hust.edu.cn>
> Reported-by: Kaiyan Mei <M202472210@hust.edu.cn>
> Reported-by: Dongliang Mu <dzm91@hust.edu.cn>
> Closes: https://lore.kernel.org/bpf/d1d523c9-6901-4454-a183-94462b8f3e4e@std.uestc.edu.cn/
> Fixes: 0813a841566f ("bpf: tcp: Allow bpf prog to write and parse TCP header option")
> Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
> ---
>  Documentation/networking/net_cachelines/tcp_sock.rst |  1 +
>  include/linux/tcp.h                                  | 11 ++++++++++-
>  net/core/filter.c                                    |  4 ++++
>  net/ipv4/tcp_minisocks.c                             |  1 +
>  net/ipv4/tcp_output.c                                |  3 +++
>  5 files changed, 19 insertions(+), 1 deletion(-)
>
> diff --git a/Documentation/networking/net_cachelines/tcp_sock.rst b/Documentation/networking/net_cachelines/tcp_sock.rst
> index 563daea10d6c..07d3226d90cc 100644
> --- a/Documentation/networking/net_cachelines/tcp_sock.rst
> +++ b/Documentation/networking/net_cachelines/tcp_sock.rst
> @@ -152,6 +152,7 @@ unsigned_int                  keepalive_intvl
>  int                           linger2
>  u8                            bpf_sock_ops_cb_flags
>  u8:1                          bpf_chg_cc_inprogress
> +u8:1                          bpf_hdr_opt_len_cb_inprogress
>  u16                           timeout_rehash
>  u32                           rcv_ooopack
>  u32                           rcv_rtt_last_tsecr
> diff --git a/include/linux/tcp.h b/include/linux/tcp.h
> index f72eef31fa23..2bfb73cf922e 100644
> --- a/include/linux/tcp.h
> +++ b/include/linux/tcp.h
> @@ -475,12 +475,21 @@ struct tcp_sock {
>         u8      bpf_sock_ops_cb_flags;  /* Control calling BPF programs
>                                          * values defined in uapi/linux/tcp.h
>                                          */
> -       u8      bpf_chg_cc_inprogress:1; /* In the middle of
> +       u8      bpf_chg_cc_inprogress:1, /* In the middle of
>                                           * bpf_setsockopt(TCP_CONGESTION),
>                                           * it is to avoid the bpf_tcp_cc->init()
>                                           * to recur itself by calling
>                                           * bpf_setsockopt(TCP_CONGESTION, "itself").
>                                           */
> +               bpf_hdr_opt_len_cb_inprogress:1; /* It is set before invoking the
> +                                                 * callback so that a nested
> +                                                 * bpf_setsockopt(TCP_NODELAY) or
> +                                                 * bpf_setsockopt(TCP_CORK) cannot
> +                                                 * trigger tcp_push_pending_frames(),
> +                                                 * which would call tcp_current_mss()
> +                                                 * -> bpf_skops_hdr_opt_len(), causing
> +                                                 * infinite recursion.

Let's not add new bits.
Reuse existing and test/check all in one place,
like commit 061ff040710e9 did.

pw-bot: cr

^ permalink raw reply

* Re: [RFC PATCH 2/2] kernel/module: Decouple klp and ftrace from load_module
From: Petr Pavlu @ 2026-04-14 14:33 UTC (permalink / raw)
  To: chensong_2000
  Cc: rafael, lenb, mturquette, sboyd, viresh.kumar, agk, snitzer,
	mpatocka, bmarzins, song, yukuai, linan122, jason.wessel, danielt,
	dianders, horms, davem, edumazet, kuba, pabeni, paulmck, frederic,
	mcgrof, da.gomez, samitolvanen, atomlin, jpoimboe, jikos, mbenes,
	pmladek, joe.lawrence, rostedt, mhiramat, mark.rutland,
	mathieu.desnoyers, linux-modules, linux-kernel,
	linux-trace-kernel, linux-acpi, linux-clk, linux-pm,
	live-patching, dm-devel, linux-raid, kgdb-bugreport, netdev
In-Reply-To: <20260413080701.180976-1-chensong_2000@189.cn>

On 4/13/26 10:07 AM, chensong_2000@189.cn wrote:
> From: Song Chen <chensong_2000@189.cn>
> 
> ftrace and livepatch currently have their module load/unload callbacks
> hard-coded in the module loader as direct function calls to
> ftrace_module_enable(), klp_module_coming(), klp_module_going()
> and ftrace_release_mod(). This tight coupling was originally introduced
> to enforce strict call ordering that could not be guaranteed by the
> module notifier chain, which only supported forward traversal. Their
> notifiers were moved in and out back and forth. see [1] and [2].

I'm unclear about what is meant by the notifiers being moved back and
forth. The links point to patches that converted ftrace+klp from using
module notifiers to explicit callbacks due to ordering issues, but this
switch occurred only once. Have there been other attempts to use
notifiers again?

> 
> Now that the notifier chain supports reverse traversal via
> blocking_notifier_call_chain_reverse(), the ordering can be enforced
> purely through notifier priority. As a result, the module loader is now
> decoupled from the implementation details of ftrace and livepatch.
> What's more, adding a new subsystem with symmetric setup/teardown ordering
> requirements during module load/unload no longer requires modifying
> kernel/module/main.c; it only needs to register a notifier_block with an
> appropriate priority.
> 
> [1]:https://lore.kernel.org/all/
> 	alpine.LNX.2.00.1602172216491.22700@cbobk.fhfr.pm/
> [2]:https://lore.kernel.org/all/
> 	20160301030034.GC12120@packer-debian-8-amd64.digitalocean.com/

Nit: Avoid wrapping URLs, as it breaks autolinking and makes the links
harder to copy.

Better links would be:
[1] https://lore.kernel.org/all/1455661953-15838-1-git-send-email-jeyu@redhat.com/
[2] https://lore.kernel.org/all/1458176139-17455-1-git-send-email-jeyu@redhat.com/

The first link is the final version of what landed as commit
7dcd182bec27 ("ftrace/module: remove ftrace module notifier"). The
second is commit 7e545d6eca20 ("livepatch/module: remove livepatch
module notifier").

> 
> Signed-off-by: Song Chen <chensong_2000@189.cn>
> ---
>  include/linux/module.h  |  8 ++++++++
>  kernel/livepatch/core.c | 29 ++++++++++++++++++++++++++++-
>  kernel/module/main.c    | 34 +++++++++++++++-------------------
>  kernel/trace/ftrace.c   | 38 ++++++++++++++++++++++++++++++++++++++
>  4 files changed, 89 insertions(+), 20 deletions(-)
> 
> diff --git a/include/linux/module.h b/include/linux/module.h
> index 14f391b186c6..0bdd56f9defd 100644
> --- a/include/linux/module.h
> +++ b/include/linux/module.h
> @@ -308,6 +308,14 @@ enum module_state {
>  	MODULE_STATE_COMING,	/* Full formed, running module_init. */
>  	MODULE_STATE_GOING,	/* Going away. */
>  	MODULE_STATE_UNFORMED,	/* Still setting it up. */
> +	MODULE_STATE_FORMED,

I don't see a reason to add a new module state. Why is it necessary and
how does it fit with the existing states?

> +};
> +
> +enum module_notifier_prio {
> +	MODULE_NOTIFIER_PRIO_LOW = INT_MIN,	/* Low prioroty, coming last, going first */
> +	MODULE_NOTIFIER_PRIO_MID = 0,	/* Normal priority. */
> +	MODULE_NOTIFIER_PRIO_SECOND_HIGH = INT_MAX - 1,	/* Second high priorigy, coming second*/
> +	MODULE_NOTIFIER_PRIO_HIGH = INT_MAX,	/* High priorigy, coming first, going late. */

I suggest being explicit about how the notifiers are ordered. For
example:

enum module_notifier_prio {
	MODULE_NOTIFIER_PRIO_NORMAL,	/* Normal priority, coming last, going first. */
	MODULE_NOTIFIER_PRIO_LIVEPATCH,
	MODULE_NOTIFIER_PRIO_FTRACE,	/* High priority, coming first, going late. */
};

>  };
>  
>  struct mod_tree_node {
> diff --git a/kernel/livepatch/core.c b/kernel/livepatch/core.c
> index 28d15ba58a26..ce78bb23e24b 100644
> --- a/kernel/livepatch/core.c
> +++ b/kernel/livepatch/core.c
> @@ -1375,13 +1375,40 @@ void *klp_find_section_by_name(const struct module *mod, const char *name,
>  }
>  EXPORT_SYMBOL_GPL(klp_find_section_by_name);
>  
> +static int klp_module_callback(struct notifier_block *nb, unsigned long op,
> +			void *module)
> +{
> +	struct module *mod = module;
> +	int err = 0;
> +
> +	switch (op) {
> +	case MODULE_STATE_COMING:
> +		err = klp_module_coming(mod);
> +		break;
> +	case MODULE_STATE_LIVE:
> +		break;
> +	case MODULE_STATE_GOING:
> +		klp_module_going(mod);
> +		break;
> +	default:
> +		break;
> +	}

klp_module_coming() and klp_module_going() are now used only in
kernel/livepatch/core.c where they are also defined. This means the
functions can be static and their declarations removed from
include/linux/livepatch.h.

Nit: The MODULE_STATE_LIVE and default cases in the switch can be
removed.

> +
> +	return notifier_from_errno(err);
> +}
> +
> +static struct notifier_block klp_module_nb = {
> +	.notifier_call = klp_module_callback,
> +	.priority = MODULE_NOTIFIER_PRIO_SECOND_HIGH
> +};
> +
>  static int __init klp_init(void)
>  {
>  	klp_root_kobj = kobject_create_and_add("livepatch", kernel_kobj);
>  	if (!klp_root_kobj)
>  		return -ENOMEM;
>  
> -	return 0;
> +	return register_module_notifier(&klp_module_nb);
>  }
>  
>  module_init(klp_init);
> diff --git a/kernel/module/main.c b/kernel/module/main.c
> index c3ce106c70af..226dd5b80997 100644
> --- a/kernel/module/main.c
> +++ b/kernel/module/main.c
> @@ -833,10 +833,8 @@ SYSCALL_DEFINE2(delete_module, const char __user *, name_user,
>  	/* Final destruction now no one is using it. */
>  	if (mod->exit != NULL)
>  		mod->exit();
> -	blocking_notifier_call_chain(&module_notify_list,
> +	blocking_notifier_call_chain_reverse(&module_notify_list,
>  				     MODULE_STATE_GOING, mod);
> -	klp_module_going(mod);
> -	ftrace_release_mod(mod);
>  
>  	async_synchronize_full();
>  
> @@ -3135,10 +3133,8 @@ static noinline int do_init_module(struct module *mod)
>  	mod->state = MODULE_STATE_GOING;
>  	synchronize_rcu();
>  	module_put(mod);
> -	blocking_notifier_call_chain(&module_notify_list,
> +	blocking_notifier_call_chain_reverse(&module_notify_list,
>  				     MODULE_STATE_GOING, mod);
> -	klp_module_going(mod);
> -	ftrace_release_mod(mod);
>  	free_module(mod);
>  	wake_up_all(&module_wq);
>  

The patch unexpectedly leaves a call to ftrace_free_mem() in
do_init_module().

> @@ -3281,20 +3277,14 @@ static int complete_formation(struct module *mod, struct load_info *info)
>  	return err;
>  }
>  
> -static int prepare_coming_module(struct module *mod)
> +static int prepare_module_state_transaction(struct module *mod,
> +			unsigned long val_up, unsigned long val_down)
>  {
>  	int err;
>  
> -	ftrace_module_enable(mod);
> -	err = klp_module_coming(mod);
> -	if (err)
> -		return err;
> -
>  	err = blocking_notifier_call_chain_robust(&module_notify_list,
> -			MODULE_STATE_COMING, MODULE_STATE_GOING, mod);
> +			val_up, val_down, mod);
>  	err = notifier_to_errno(err);
> -	if (err)
> -		klp_module_going(mod);
>  
>  	return err;
>  }
> @@ -3468,14 +3458,21 @@ static int load_module(struct load_info *info, const char __user *uargs,
>  	init_build_id(mod, info);
>  
>  	/* Ftrace init must be called in the MODULE_STATE_UNFORMED state */
> -	ftrace_module_init(mod);
> +	err = prepare_module_state_transaction(mod,
> +				MODULE_STATE_UNFORMED, MODULE_STATE_FORMED);

I believe val_down should be MODULE_STATE_GOING to reverse the
operation. Why is the new state MODULE_STATE_FORMED needed here?

> +	if (err)
> +		goto ddebug_cleanup;
>  
>  	/* Finally it's fully formed, ready to start executing. */
>  	err = complete_formation(mod, info);
> -	if (err)
> +	if (err) {
> +		blocking_notifier_call_chain_reverse(&module_notify_list,
> +				MODULE_STATE_FORMED, mod);
>  		goto ddebug_cleanup;
> +	}
>  
> -	err = prepare_coming_module(mod);
> +	err = prepare_module_state_transaction(mod,
> +				MODULE_STATE_COMING, MODULE_STATE_GOING);
>  	if (err)
>  		goto bug_cleanup;
>  
> @@ -3522,7 +3519,6 @@ static int load_module(struct load_info *info, const char __user *uargs,
>  	destroy_params(mod->kp, mod->num_kp);
>  	blocking_notifier_call_chain(&module_notify_list,
>  				     MODULE_STATE_GOING, mod);

My understanding is that all notifier chains for MODULE_STATE_GOING
should be reversed.

> -	klp_module_going(mod);
>   bug_cleanup:
>  	mod->state = MODULE_STATE_GOING;
>  	/* module_bug_cleanup needs module_mutex protection */

The patch removes the klp_module_going() cleanup call in load_module().
Similarly, the ftrace_release_mod() call under the ddebug_cleanup label
should be removed and appropriately replaced with a cleanup via
a notifier.

> diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
> index 8df69e702706..efedb98d3db4 100644
> --- a/kernel/trace/ftrace.c
> +++ b/kernel/trace/ftrace.c
> @@ -5241,6 +5241,44 @@ static int __init ftrace_mod_cmd_init(void)
>  }
>  core_initcall(ftrace_mod_cmd_init);
>  
> +static int ftrace_module_callback(struct notifier_block *nb, unsigned long op,
> +			void *module)
> +{
> +	struct module *mod = module;
> +
> +	switch (op) {
> +	case MODULE_STATE_UNFORMED:
> +		ftrace_module_init(mod);
> +		break;
> +	case MODULE_STATE_COMING:
> +		ftrace_module_enable(mod);
> +		break;
> +	case MODULE_STATE_LIVE:
> +		ftrace_free_mem(mod, mod->mem[MOD_INIT_TEXT].base,
> +				mod->mem[MOD_INIT_TEXT].base + mod->mem[MOD_INIT_TEXT].size);
> +		break;
> +	case MODULE_STATE_GOING:
> +	case MODULE_STATE_FORMED:
> +		ftrace_release_mod(mod);
> +		break;
> +	default:
> +		break;
> +	}

ftrace_module_init(), ftrace_module_enable(), ftrace_free_mem() and
ftrace_release_mod() should be newly used only in kernel/trace/ftrace.c
where they are also defined. The functions can then be made static and
removed from include/linux/ftrace.h.

Nit: The default case in the switch can be removed.

> +
> +	return notifier_from_errno(0);

Nit: This can be simply "return NOTIFY_OK;".

> +}
> +
> +static struct notifier_block ftrace_module_nb = {
> +	.notifier_call = ftrace_module_callback,
> +	.priority = MODULE_NOTIFIER_PRIO_HIGH
> +};
> +
> +static int __init ftrace_register_module_notifier(void)
> +{
> +	return register_module_notifier(&ftrace_module_nb);
> +}
> +core_initcall(ftrace_register_module_notifier);
> +
>  static void function_trace_probe_call(unsigned long ip, unsigned long parent_ip,
>  				      struct ftrace_ops *op, struct ftrace_regs *fregs)
>  {

-- 
Thanks,
Petr

^ permalink raw reply

* Re: [PATCH RFC bpf-next 1/8] kasan: expose generic kasan helpers
From: Alexei Starovoitov @ 2026-04-14 14:36 UTC (permalink / raw)
  To: Alexis Lothoré
  Cc: Andrey Konovalov, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman,
	Kumar Kartikeya Dwivedi, Song Liu, Yonghong Song, Jiri Olsa,
	John Fastabend, David S. Miller, David Ahern, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, X86 ML, H. Peter Anvin,
	Shuah Khan, Maxime Coquelin, Alexandre Torgue, Andrey Ryabinin,
	Alexander Potapenko, Dmitry Vyukov, Vincenzo Frascino,
	Andrew Morton, ebpf, Bastien Curutchet, Thomas Petazzoni,
	Xu Kuohai, bpf, LKML, Network Development,
	open list:KERNEL SELFTEST FRAMEWORK, linux-stm32,
	linux-arm-kernel, kasan-dev, linux-mm
In-Reply-To: <DHSWK17EZUDP.GIJ6BX2NFR6U@bootlin.com>

On Tue, Apr 14, 2026 at 6:13 AM Alexis Lothoré
<alexis.lothore@bootlin.com> wrote:
>
> Hi Andrey, thanks for the prompt review !
>
> On Tue Apr 14, 2026 at 12:19 AM CEST, Andrey Konovalov wrote:
> > On Mon, Apr 13, 2026 at 8:29 PM Alexis Lothoré (eBPF Foundation)
> > <alexis.lothore@bootlin.com> wrote:
> >>
>
> [...]
>
> >> +#ifdef CONFIG_KASAN_GENERIC
> >> +void __asan_load1(void *p);
> >> +void __asan_store1(void *p);
> >> +void __asan_load2(void *p);
> >> +void __asan_store2(void *p);
> >> +void __asan_load4(void *p);
> >> +void __asan_store4(void *p);
> >> +void __asan_load8(void *p);
> >> +void __asan_store8(void *p);
> >> +void __asan_load16(void *p);
> >> +void __asan_store16(void *p);
> >> +#endif /* CONFIG_KASAN_GENERIC */
> >
> > This looks ugly, let's not do this unless it's really required.
> >
> > You can just use kasan_check_read/write() instead - these are public
> > wrappers around the same shadow memory checking functions. And they
> > also work with the SW_TAGS mode, in case the BPF would want to use
> > that mode at some point. (For HW_TAGS, we only have kasan_check_byte()
> > that checks a single byte, but it can be extended in the future if
> > required to be used by BPF.)
>
> ACK, I'll try to use those kasan_check_read and kasan_check_write rather
> than __asan_{load,store}X.

No. The performance penalty will be too high.
hw_tags won't work without corresponding JIT work.
I see no point sacrificing performance for aesthetics.
__asan_load/storeX is what compilers emit.
In that sense JIT is a compiler it should emit exactly the same.

^ permalink raw reply

* Re: linux-next: manual merge of the bpf-next tree with the origin tree
From: Mark Brown @ 2026-04-14 14:38 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Daniel Borkmann, Alexei Starovoitov, Andrii Nakryiko, bpf,
	Networking, Joel Fernandes, Kumar Kartikeya Dwivedi,
	Linux Kernel Mailing List, Linux Next Mailing List,
	Paul E. McKenney
In-Reply-To: <CAADnVQLz6-LK4+qad_XqEZXrspttpe4b49jRZ6wUCgEhJeTvgw@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 737 bytes --]

On Tue, Apr 14, 2026 at 07:09:44AM -0700, Alexei Starovoitov wrote:

> But how come you're saying it was discovered "today" ?

> Paul's commit ad6ef775cbeff was committed to rcu tree on Mar 30,
> while Kumar's 57b23c0f612dc was committed to bpf-next on Apr 7.

> "today" is April 14.

> My only explanation is that rcu tree was not in linux-next until today?!

We're in the merge window, this means things get sent to Linus and end
up in his tree.  This in turn means that they are seen in different
orders, and sometimes depending on context in different forms.  -next
is merged sequentially, it's not an octopus merge of all the trees at
once.  Some variation of this will likely have been seen before, but not
this exact combination.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply

* Re: [PATCH RFC bpf-next 3/8] bpf: add BPF_JIT_KASAN for KASAN instrumentation of JITed programs
From: Alexei Starovoitov @ 2026-04-14 14:38 UTC (permalink / raw)
  To: Alexis Lothoré
  Cc: Andrey Konovalov, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman,
	Kumar Kartikeya Dwivedi, Song Liu, Yonghong Song, Jiri Olsa,
	John Fastabend, David S. Miller, David Ahern, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, X86 ML, H. Peter Anvin,
	Shuah Khan, Maxime Coquelin, Alexandre Torgue, Andrey Ryabinin,
	Alexander Potapenko, Dmitry Vyukov, Vincenzo Frascino,
	Andrew Morton, ebpf, Bastien Curutchet, Thomas Petazzoni,
	Xu Kuohai, bpf, LKML, Network Development,
	open list:KERNEL SELFTEST FRAMEWORK, linux-stm32,
	linux-arm-kernel, kasan-dev, linux-mm
In-Reply-To: <DHSWSSYRPUVC.2W3G3OU27L3HG@bootlin.com>

On Tue, Apr 14, 2026 at 6:24 AM Alexis Lothoré
<alexis.lothore@bootlin.com> wrote:
>
> On Tue Apr 14, 2026 at 12:20 AM CEST, Andrey Konovalov wrote:
> > On Mon, Apr 13, 2026 at 8:29 PM Alexis Lothoré (eBPF Foundation)
> > <alexis.lothore@bootlin.com> wrote:
> >>
> >> Add a new Kconfig option CONFIG_BPF_JIT_KASAN that automatically enables
> >> KASAN (Kernel Address Sanitizer) memory access checks for JIT-compiled
> >> BPF programs, when both KASAN and JIT compiler are enabled. When
> >> enabled, the JIT compiler will emit shadow memory checks before memory
> >> loads and stores to detect use-after-free, out-of-bounds, and other
> >> memory safety bugs at runtime. The option is gated behind
> >> HAVE_EBPF_JIT_KASAN, as it needs proper arch-specific implementation.
> >>
> >> Signed-off-by: Alexis Lothoré (eBPF Foundation) <alexis.lothore@bootlin.com>
> >> ---
> >>  kernel/bpf/Kconfig | 9 +++++++++
> >>  1 file changed, 9 insertions(+)
> >>
> >> diff --git a/kernel/bpf/Kconfig b/kernel/bpf/Kconfig
> >> index eb3de35734f0..28392adb3d7e 100644
> >> --- a/kernel/bpf/Kconfig
> >> +++ b/kernel/bpf/Kconfig
> >> @@ -17,6 +17,10 @@ config HAVE_CBPF_JIT
> >>  config HAVE_EBPF_JIT
> >>         bool
> >>
> >> +# KASAN support for JIT compiler
> >> +config HAVE_EBPF_JIT_KASAN
> >> +       bool
> >> +
> >>  # Used by archs to tell that they want the BPF JIT compiler enabled by
> >>  # default for kernels that were compiled with BPF JIT support.
> >>  config ARCH_WANT_DEFAULT_BPF_JIT
> >> @@ -101,4 +105,9 @@ config BPF_LSM
> >>
> >>           If you are unsure how to answer this question, answer N.
> >>
> >> +config BPF_JIT_KASAN
> >> +       bool
> >> +       depends on HAVE_EBPF_JIT_KASAN
> >> +       default y if BPF_JIT && KASAN_GENERIC
> >
> > Should this be "depends on KASAN && KASAN_GENERIC"?
>
> Meaning, making it an explicit user-selectable option ?
>
> If so, the current design choice is voluntary and based on the feedback
> received on the original RFC, where I have been suggested to
> automatically enable the KASAN instrumentation in BPF programs if KASAN
> support is enabled in the kernel ([1]). But if a user-selectable toggle
> is eventually a better solution, I'm fine with changing it.

Let's not add more config knobs.
Even this patch looks redundant.
Inside JIT do instrumentation when KASAN_GENERIC is set.

^ permalink raw reply

* Re: [net,PATCH v3 1/2] net: ks8851: Reinstate disabling of BHs around IRQ handler
From: Sebastian Andrzej Siewior @ 2026-04-14 14:52 UTC (permalink / raw)
  To: Marek Vasut
  Cc: netdev, stable, David S. Miller, Andrew Lunn, Eric Dumazet,
	Jakub Kicinski, Nicolai Buchwitz, Paolo Abeni, Ronald Wahl,
	Yicong Hui, linux-kernel
In-Reply-To: <2fcfb84f-69f6-493e-94d6-95d85d8000f6@nabladev.com>

On 2026-04-14 16:20:46 [+0200], Marek Vasut wrote:
> > This is what happens since commit 0913ec336a6c0 ("net: ks8851: Fix
> > deadlock with the SPI chip variant"). Before that commit the softirq
> > execution will be picked up by netdev_alloc_skb_ip_align() and requires
> > PREEMPT_RT and a RX packet in #1 to trigger the deadlock.
> 
> Do you want me to add this into the V4 commit message ?

The description does not match the code since the commit mentioned
above.

> > > Fix the problem by disabling BH around critical sections, including the
> > > IRQ handler, thus preventing the net_tx_action() softirq from triggering
> > > during these critical sections. The net_tx_action() softirq is triggered
> > > at the end of the IRQ handler, once all the other IRQ handler actions have
> > > been completed.
> > > 
> > >   __schedule from schedule_rtlock+0x1c/0x34
> > >   schedule_rtlock from rtlock_slowlock_locked+0x548/0x904
> > >   rtlock_slowlock_locked from rt_spin_lock+0x60/0x9c
> > >   rt_spin_lock from ks8851_start_xmit_par+0x74/0x1a8
> > >   ks8851_start_xmit_par from netdev_start_xmit+0x20/0x44
> > >   netdev_start_xmit from dev_hard_start_xmit+0xd0/0x188
> > >   dev_hard_start_xmit from sch_direct_xmit+0xb8/0x25c
> > >   sch_direct_xmit from __qdisc_run+0x1f8/0x4ec
> > >   __qdisc_run from qdisc_run+0x1c/0x28
> > >   qdisc_run from net_tx_action+0x1f0/0x268
> > >   net_tx_action from handle_softirqs+0x1a4/0x270
> > >   handle_softirqs from __local_bh_enable_ip+0xcc/0xe0
> > >   __local_bh_enable_ip from __alloc_skb+0xd8/0x128
> > >   __alloc_skb from __netdev_alloc_skb+0x3c/0x19c
> > >   __netdev_alloc_skb from ks8851_irq+0x388/0x4d4
> > >   ks8851_irq from irq_thread_fn+0x24/0x64
> > >   irq_thread_fn from irq_thread+0x178/0x28c
> > >   irq_thread from kthread+0x12c/0x138
> > >   kthread from ret_from_fork+0x14/0x28
> > 
> > The backtrace here and the description is based on an older kernel.
> > However
> I actually did update the backtrace in V3 with the one from current next
> 20260413 .

That would be from yesterday and the change is merged since v6.10. But
why is the softirq starting from __netdev_alloc_skb() instead of
spin_unlock_bh(&ks->statelock)? After that unlock, the softirq must be
processed and __netdev_alloc_skb() _could_ observe pending softirqs but
not from ks8851.

Sebastian

^ permalink raw reply

* Re: [PATCH v3 1/3] net: dsa: microchip: implement KSZ87xx Module 3 low-loss cable errata
From: Andrew Lunn @ 2026-04-14 14:54 UTC (permalink / raw)
  To: Fidelio LAWSON
  Cc: Marek Vasut, Woojung Huh, UNGLinuxDriver, Vladimir Oltean,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Marek Vasut, Maxime Chevallier, Simon Horman, Heiner Kallweit,
	Russell King, netdev, linux-kernel, Fidelio Lawson
In-Reply-To: <264667a8-bbb2-44ac-84e7-df6c506ae6fa@gmail.com>

On Tue, Apr 14, 2026 at 03:48:33PM +0200, Fidelio LAWSON wrote:
> On 4/14/26 14:40, Andrew Lunn wrote:
> > On Tue, Apr 14, 2026 at 01:05:49PM +0200, Marek Vasut wrote:
> > > On 4/14/26 11:12 AM, Fidelio Lawson wrote:
> > > > Implement the "Module 3: Equalizer fix for short cables" erratum from
> > > > Microchip document DS80000687C for KSZ87xx switches.
> > > > 
> > > > The issue affects short or low-loss cable links (e.g. CAT5e/CAT6),
> > > > where the PHY receiver equalizer may amplify high-amplitude signals
> > > > excessively, resulting in internal distortion and link establishment
> > > > failures.
> > > > 
> > > > KSZ87xx devices require a workaround for the Module 3 low-loss cable
> > > > condition, controlled through the switch TABLE_LINK_MD_V indirect
> > > > registers.
> > > > 
> > > > The affected registers are part of the switch address space and are not
> > > > directly accessible from the PHY driver. To keep the PHY-facing API
> > > > clean and avoid leaking switch-specific details, model this errata
> > > > control as vendor-specific Clause 22 PHY registers.
> > > > 
> > > > A vendor-specific Clause 22 PHY register is introduced as a mode
> > > > selector in PHY_REG_LOW_LOSS_CTRL, and ksz8_r_phy() / ksz8_w_phy()
> > > > translate accesses to these bits into the appropriate indirect
> > > > TABLE_LINK_MD_V accesses.
> > > > 
> > > > The control register defines the following modes:
> > > > 0: disabled (default behavior)
> > > > 1: EQ training workaround
> > > > 2: LPF 90 MHz
> > > > 3: LPF 62 MHz
> > > > 4: LPF 55 MHz
> > > > 5: LPF 44 MHz
> > > I may not fully understand this, but aren't the EQ and LPF settings
> > > orthogonal ?
> > 
> > What is the real life experience using this feature? Is it needed for
> > 1cm cables, but most > 1m cables are O.K with the defaults? Do we need
> > all these configuration options? How is a user supposed to discover
> > the different options? Can we simplify it down to a Boolean?
> We were seeing random link dropouts with the default settings, and since
> enabling the workaround 2, the link has remained stable and we have not
> observed any further issues.

So for you, a boolean which enables workaround 2 would be sufficient.

Marek, what is your experience?

       Andrew

^ permalink raw reply

* Re: [PATCH iwl-next v2 1/2] idpf: remove conditonal MBX deinit from idpf_vc_core_deinit()
From: Tantilov, Emil S @ 2026-04-14 14:56 UTC (permalink / raw)
  To: Loktionov, Aleksandr, intel-wired-lan@lists.osuosl.org
  Cc: netdev@vger.kernel.org, Kitszel, Przemyslaw, Bhat, Jay,
	Barrera, Ivan D, Zaremba, Larysa, Nguyen, Anthony L,
	andrew+netdev@lunn.ch, davem@davemloft.net, edumazet@google.com,
	kuba@kernel.org, pabeni@redhat.com, Lobakin, Aleksander,
	linux-pci@vger.kernel.org, Chittim, Madhu, decot@google.com,
	willemb@google.com, sheenamo@google.com, lukas@wunner.de
In-Reply-To: <IA3PR11MB8986EDBC27D4267AA0F48BFBE5252@IA3PR11MB8986.namprd11.prod.outlook.com>



On 4/14/2026 4:07 AM, Loktionov, Aleksandr wrote:
> 
> 
>> -----Original Message-----
>> From: Tantilov, Emil S <emil.s.tantilov@intel.com>
>> Sent: Tuesday, April 14, 2026 5:17 AM
>> To: intel-wired-lan@lists.osuosl.org
>> Cc: netdev@vger.kernel.org; Kitszel, Przemyslaw
>> <przemyslaw.kitszel@intel.com>; Bhat, Jay <jay.bhat@intel.com>;
>> Barrera, Ivan D <ivan.d.barrera@intel.com>; Loktionov, Aleksandr
>> <aleksandr.loktionov@intel.com>; Zaremba, Larysa
>> <larysa.zaremba@intel.com>; Nguyen, Anthony L
>> <anthony.l.nguyen@intel.com>; andrew+netdev@lunn.ch;
>> davem@davemloft.net; edumazet@google.com; kuba@kernel.org;
>> pabeni@redhat.com; Lobakin, Aleksander <aleksander.lobakin@intel.com>;
>> linux-pci@vger.kernel.org; Chittim, Madhu <madhu.chittim@intel.com>;
>> decot@google.com; willemb@google.com; sheenamo@google.com;
>> lukas@wunner.de
>> Subject: [PATCH iwl-next v2 1/2] idpf: remove conditonal MBX deinit
>> from idpf_vc_core_deinit()
> "conditional" -> "conditional"

Doh. I will make sure it is corrected.

> 
> Everything else looks fine
> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>

Thanks,
Emil

> 
>>
>> Previously it was assumed that idpf_vc_core_deinit() is always being
>> called during reset handling, with remove being an exception. Ideally
>> the driver needs to communicate the changes to FW in all instances
>> where the MBX is not already disabled. Remove the remove_in_prog check
>> from
>> idpf_vc_core_deinit() as the MBX was already disabled while handling
>> the reset via libie_ctlq_xn_shutdown() by the service task. This is
>> also needed by the following patch, introducing PCI callbacks support.
>>
>> Signed-off-by: Emil Tantilov <emil.s.tantilov@intel.com>
>> Reviewed-by: Jay Bhat <jay.bhat@intel.com>
>> Reviewed-by: Madhu Chittim <madhu.chittim@intel.com>
>> ---
>>   drivers/net/ethernet/intel/idpf/idpf_virtchnl.c | 11 +----------
>>   1 file changed, 1 insertion(+), 10 deletions(-)
>>
>> diff --git a/drivers/net/ethernet/intel/idpf/idpf_virtchnl.c
>> b/drivers/net/ethernet/intel/idpf/idpf_virtchnl.c
>> index 129c8f6b0faa..fceaf3ec1cd4 100644
>> --- a/drivers/net/ethernet/intel/idpf/idpf_virtchnl.c
>> +++ b/drivers/net/ethernet/intel/idpf/idpf_virtchnl.c
>> @@ -3229,24 +3229,15 @@ int idpf_vc_core_init(struct idpf_adapter
>> *adapter)
>>    */
>>   void idpf_vc_core_deinit(struct idpf_adapter *adapter)  {
>> -	bool remove_in_prog;
>> -
>>   	if (!test_bit(IDPF_VC_CORE_INIT, adapter->flags))
>>   		return;
>>
>> -	/* Avoid transaction timeouts when called during reset */
>> -	remove_in_prog = test_bit(IDPF_REMOVE_IN_PROG, adapter->flags);
>> -	if (!remove_in_prog)
>> -		idpf_deinit_dflt_mbx(adapter);
>> -
>>   	idpf_ptp_release(adapter);
>>   	idpf_deinit_task(adapter);
>>   	idpf_idc_deinit_core_aux_device(adapter);
>>   	idpf_rel_rx_pt_lkup(adapter);
>>   	idpf_intr_rel(adapter);
>> -
>> -	if (remove_in_prog)
>> -		idpf_deinit_dflt_mbx(adapter);
>> +	idpf_deinit_dflt_mbx(adapter);
>>
>>   	cancel_delayed_work_sync(&adapter->serv_task);
>>
>> --
>> 2.37.3
> 


^ permalink raw reply

* Re: [PATCH net v2 1/1] net: hsr: avoid learning unknown senders for local delivery
From: Sebastian Andrzej Siewior @ 2026-04-14 14:59 UTC (permalink / raw)
  To: Felix Maurer, Ao Zhou
  Cc: netdev, David S . Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Murali Karicheri, Shaurya Rane,
	Ingo Molnar, Kees Cook, Yifan Wu, Juefei Pu, Yuan Tan, Xin Liu,
	Yuqi Xu, Haoze Xie
In-Reply-To: <adYwjxLBBaLY52Wb@thinkpad>

On 2026-04-08 12:40:15 [+0200], Felix Maurer wrote:
> IMHO, the only real way to prevent excessive resource use on our side is
> to put a limit on these resources. In this case, limit the size of the
> node table (bonus: make that limit configurable as Paolo suggested).

I am slowly catching up. There was no follow-up on this one, right?

> Thanks,
>    Felix

Sebastian

^ permalink raw reply

* Re: [PATCH iwl-next v2 2/2] idpf: implement pci error handlers
From: Tantilov, Emil S @ 2026-04-14 15:01 UTC (permalink / raw)
  To: Loktionov, Aleksandr, intel-wired-lan@lists.osuosl.org
  Cc: netdev@vger.kernel.org, Kitszel, Przemyslaw, Bhat, Jay,
	Barrera, Ivan D, Zaremba, Larysa, Nguyen, Anthony L,
	andrew+netdev@lunn.ch, davem@davemloft.net, edumazet@google.com,
	kuba@kernel.org, pabeni@redhat.com, Lobakin, Aleksander,
	linux-pci@vger.kernel.org, Chittim, Madhu, decot@google.com,
	willemb@google.com, sheenamo@google.com, lukas@wunner.de
In-Reply-To: <IA3PR11MB8986C6EC840268F14C44B28CE5252@IA3PR11MB8986.namprd11.prod.outlook.com>



On 4/14/2026 4:09 AM, Loktionov, Aleksandr wrote:
> 
> 
>> -----Original Message-----
>> From: Tantilov, Emil S <emil.s.tantilov@intel.com>
>> Sent: Tuesday, April 14, 2026 5:17 AM
>> To: intel-wired-lan@lists.osuosl.org
>> Cc: netdev@vger.kernel.org; Kitszel, Przemyslaw
>> <przemyslaw.kitszel@intel.com>; Bhat, Jay <jay.bhat@intel.com>;
>> Barrera, Ivan D <ivan.d.barrera@intel.com>; Loktionov, Aleksandr
>> <aleksandr.loktionov@intel.com>; Zaremba, Larysa
>> <larysa.zaremba@intel.com>; Nguyen, Anthony L
>> <anthony.l.nguyen@intel.com>; andrew+netdev@lunn.ch;
>> davem@davemloft.net; edumazet@google.com; kuba@kernel.org;
>> pabeni@redhat.com; Lobakin, Aleksander <aleksander.lobakin@intel.com>;
>> linux-pci@vger.kernel.org; Chittim, Madhu <madhu.chittim@intel.com>;
>> decot@google.com; willemb@google.com; sheenamo@google.com;
>> lukas@wunner.de
>> Subject: [PATCH iwl-next v2 2/2] idpf: implement pci error handlers
>>
>> Add callbacks to handle PCI errors and FLR reset. When preparing to
>> handle reset on the bus, the driver must stop all operations that can
>> lead to MMIO access in order to prevent HW errors. To accomplish this
>> introduce helper
>> idpf_reset_prepare() that gets called prior to FLR or when PCI error
>> is detected. Upon resume the recovery is done through the existing
>> reset path by starting the event task.
>>
>> The following callbacks are implemented:
>> .reset_prepare runs the first portion of the generic reset path
>> leading up to the part where we wait for the reset to complete.
>> .reset_done/resume runs the recovery part of the reset handling.
>> .error_detected is the callback dealing with PCI errors, similar to
>> the prepare call, we stop all operations, prior to attempting a
>> recovery.
>> .slot_reset is the callback attempting to restore the device, provided
>> a PCI reset was initiated by the AER driver.
>>
>> Whereas previously the init logic guaranteed netdevs during reset, the
>> addition of idpf_detach_and_close() to the PCI callbacks flow makes it
>> possible for the function to be called without netdevs. Add check to
>> avoid NULL pointer dereference in that case.
>>
>> Co-developed-by: Alan Brady <alan.brady@intel.com>
>> Signed-off-by: Alan Brady <alan.brady@intel.com>
>> Signed-off-by: Emil Tantilov <emil.s.tantilov@intel.com>
>> Reviewed-by: Jay Bhat <jay.bhat@intel.com>
>> Reviewed-by: Madhu Chittim <madhu.chittim@intel.com>
>> ---
>>   drivers/net/ethernet/intel/idpf/idpf.h      |   3 +
>>   drivers/net/ethernet/intel/idpf/idpf_lib.c  |  13 ++-
>> drivers/net/ethernet/intel/idpf/idpf_main.c | 112 ++++++++++++++++++++
>>   3 files changed, 126 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/net/ethernet/intel/idpf/idpf.h
>> b/drivers/net/ethernet/intel/idpf/idpf.h
>> index 1d0e32e47e87..164d2f3e233a 100644
>> --- a/drivers/net/ethernet/intel/idpf/idpf.h
>> +++ b/drivers/net/ethernet/intel/idpf/idpf.h
>> @@ -88,6 +88,7 @@ enum idpf_state {
>>    * @IDPF_REMOVE_IN_PROG: Driver remove in progress
>>    * @IDPF_MB_INTR_MODE: Mailbox in interrupt mode
>>    * @IDPF_VC_CORE_INIT: virtchnl core has been init
>> + * @IDPF_PCI_CB_RESET: Reset via the PCI callbacks
>>    * @IDPF_FLAGS_NBITS: Must be last
>>    */
>>   enum idpf_flags {
>> @@ -97,6 +98,7 @@ enum idpf_flags {
>>   	IDPF_REMOVE_IN_PROG,
>>   	IDPF_MB_INTR_MODE,
>>   	IDPF_VC_CORE_INIT,
> 
> ...
> 
>> +/**
>> + * idpf_pci_err_resume - Resume operations after PCI error recovery
>> + * @pdev: PCI device struct
>> + */
>> +static void idpf_pci_err_resume(struct pci_dev *pdev) {
>> +	struct idpf_adapter *adapter = pci_get_drvdata(pdev);
>> +
>> +	/* Force a PFR when resuming from PCI error. */
>> +	if (test_and_set_bit(IDPF_PCI_CB_RESET, adapter->flags))
>> +		adapter->dev_ops.reg_ops.trigger_reset(adapter,
>> IDPF_HR_FUNC_RESET);
> You say "Force a PFR", but PFR is only triggered on the AER path, not on the FLR path.

Hence the "force" - the call to `trigger_reset` results in a PFR and is
only needed in the case of a PCI error. If this function was called
because a user issued an FLR, the kernel will trigger it for us. This
way we can reuse the reset handling path to restore the operation of the
netdevs.

Though I may be misunderstanding - are you referring to the wording or
the logic?

Thanks,
Emil

> 
> Everything else looks fine
> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
> 
>> +
>> +	queue_delayed_work(adapter->vc_event_wq,
>> +			   &adapter->vc_event_task,
>> +			   msecs_to_jiffies(300));
>> +}
> 
> ...
> 
>>   };
>>   module_pci_driver(idpf_driver);
>> --
>> 2.37.3
> 


^ permalink raw reply

* Re: [net,PATCH v3 1/2] net: ks8851: Reinstate disabling of BHs around IRQ handler
From: Jakub Kicinski @ 2026-04-14 15:09 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Marek Vasut, netdev, stable, David S. Miller, Andrew Lunn,
	Eric Dumazet, Nicolai Buchwitz, Paolo Abeni, Ronald Wahl,
	Yicong Hui, linux-kernel
In-Reply-To: <20260414125753.Im6GAIHn@linutronix.de>

On Tue, 14 Apr 2026 14:57:53 +0200 Sebastian Andrzej Siewior wrote:
> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>

Maybe I'm not being forceful enough.

Putting workarounds in the drivers is unacceptable.
__netdev_alloc_skb() must be legal to call under an _irq spin lock.

^ permalink raw reply

* [PATCH net] net: pse-pd: fix kernel-doc function name for pse_control_find_by_id()
From: Kory Maincent @ 2026-04-14 15:09 UTC (permalink / raw)
  To: Kory Maincent (Dent Project), Jakub Kicinski, netdev,
	linux-kernel
  Cc: thomas.petazzoni, Oleksij Rempel, Andrew Lunn, David S. Miller,
	Eric Dumazet, Paolo Abeni

The kernel-doc comment header incorrectly referenced the function
name pse_control_find_net_by_id() instead of the actual function name
pse_control_find_by_id(). Correct the function name in the documentation
to match the implementation.

Fixes: fc0e6db30941a ("net: pse-pd: Add support for reporting events")
Signed-off-by: Kory Maincent <kory.maincent@bootlin.com>
---
 drivers/net/pse-pd/pse_core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/pse-pd/pse_core.c b/drivers/net/pse-pd/pse_core.c
index 2ced837f375d2..0848097ce7bf3 100644
--- a/drivers/net/pse-pd/pse_core.c
+++ b/drivers/net/pse-pd/pse_core.c
@@ -234,7 +234,7 @@ static int of_load_pse_pis(struct pse_controller_dev *pcdev)
 }
 
 /**
- * pse_control_find_net_by_id - Find net attached to the pse control id
+ * pse_control_find_by_id - Find pse_control from an id
  * @pcdev: a pointer to the PSE
  * @id: index of the PSE control
  *
-- 
2.43.0


^ permalink raw reply related

* Re: [PATCH RFC bpf-next 1/8] kasan: expose generic kasan helpers
From: Andrey Konovalov @ 2026-04-14 15:10 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Alexis Lothoré, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman,
	Kumar Kartikeya Dwivedi, Song Liu, Yonghong Song, Jiri Olsa,
	John Fastabend, David S. Miller, David Ahern, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, X86 ML, H. Peter Anvin,
	Shuah Khan, Maxime Coquelin, Alexandre Torgue, Andrey Ryabinin,
	Alexander Potapenko, Dmitry Vyukov, Vincenzo Frascino,
	Andrew Morton, ebpf, Bastien Curutchet, Thomas Petazzoni,
	Xu Kuohai, bpf, LKML, Network Development,
	open list:KERNEL SELFTEST FRAMEWORK, linux-stm32,
	linux-arm-kernel, kasan-dev, linux-mm
In-Reply-To: <CAADnVQLJ=fJ7t1i2+_RYqU1gqYqiLP9Zrwo4vdZsgzjK_yzJTQ@mail.gmail.com>

On Tue, Apr 14, 2026 at 4:36 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> > ACK, I'll try to use those kasan_check_read and kasan_check_write rather
> > than __asan_{load,store}X.
>
> No. The performance penalty will be too high.

With using __asan_load/storeX(), it will be one function call to get
to check_region_inline(): __asan_load/storeX->check_region_inline.

With kasan_check_read/write(), right now, it would be two function
calls: __kasan_check_read->kasan_check_range->check_region_inline.

I doubt an extra function call would make a difference in terms of
performance: the shadow checking itself is also expensive.

But if the second call is a concern, we can move kasan_check_range()
and lower-level functions into mm/kasan/generic.h and include it into
shadow.c, and then it will be just one function call.

To improve performance further, the JIT compiler could emit inlined
shadow checking instructions, same as the C compiler does with
KASAN_INLINE=y.

> hw_tags won't work without corresponding JIT work.

You probably meant SW_TAGS here.

HW_TAGS will likely just work without any JIT changes (even the
kasan_check_byte() thing I mentioned should not be required), assuming
JIT'ed BPF code just accesses kernel-returned pointers as is.

> I see no point sacrificing performance for aesthetics.

With the change I suggested above, there would be no performance
difference. And the code stays cleaner.

> __asan_load/storeX is what compilers emit.

For Generic mode. For SW_TAGS, the function names are different.
Keeping this detail within the KASAN code is cleaner.

> In that sense JIT is a compiler it should emit exactly the same.

^ permalink raw reply

* Re: [PATCH iwl-next v2 2/2] idpf: implement pci error handlers
From: Lukas Wunner @ 2026-04-14 15:10 UTC (permalink / raw)
  To: Loktionov, Aleksandr
  Cc: Tantilov, Emil S, intel-wired-lan@lists.osuosl.org,
	netdev@vger.kernel.org, Kitszel, Przemyslaw, Bhat, Jay,
	Barrera, Ivan D, Zaremba, Larysa, Nguyen, Anthony L,
	andrew+netdev@lunn.ch, davem@davemloft.net, edumazet@google.com,
	kuba@kernel.org, pabeni@redhat.com, Lobakin, Aleksander,
	linux-pci@vger.kernel.org, Chittim, Madhu, decot@google.com,
	willemb@google.com, sheenamo@google.com
In-Reply-To: <IA3PR11MB8986C6EC840268F14C44B28CE5252@IA3PR11MB8986.namprd11.prod.outlook.com>

On Tue, Apr 14, 2026 at 11:09:05AM +0000, Loktionov, Aleksandr wrote:
> > From: Tantilov, Emil S <emil.s.tantilov@intel.com>
> > .slot_reset is the callback attempting to restore the device, provided
> > a PCI reset was initiated by the AER driver.

Just for clarity, those callbacks are invoked by PCI core error handling
code and are shared by EEH, AER, DPC as well as s390 error recovery flows.
So it's not only AER.

> > +/**
> > + * idpf_pci_err_resume - Resume operations after PCI error recovery
> > + * @pdev: PCI device struct
> > + */
> > +static void idpf_pci_err_resume(struct pci_dev *pdev) {
> > +	struct idpf_adapter *adapter = pci_get_drvdata(pdev);
> > +
> > +	/* Force a PFR when resuming from PCI error. */
> > +	if (test_and_set_bit(IDPF_PCI_CB_RESET, adapter->flags))
> > +		adapter->dev_ops.reg_ops.trigger_reset(adapter,
> > IDPF_HR_FUNC_RESET);
> 
> You say "Force a PFR", but PFR is only triggered on the AER path,
> not on the FLR path.

And?  idpf_pci_err_resume() is only invoked in the error recovery path
(aka AER path), not FLR path AFAICS.

Thanks,

Lukas

^ permalink raw reply

* Re: [RFC] Proposal: Add sysfs interface for PCIe TPH Steering Tag retrieval and configuration
From: Jason Gunthorpe @ 2026-04-14 15:11 UTC (permalink / raw)
  To: fengchengwen
  Cc: Leon Romanovsky, Bjorn Helgaas, linux-rdma, linux-pci, netdev,
	dri-devel, Keith Busch, Yochai Cohen, Yishai Hadas, Zhiping Zhang
In-Reply-To: <11eaea26-ec10-264a-db1e-951f6b46078d@huawei.com>

On Tue, Apr 14, 2026 at 10:46:00PM +0800, fengchengwen wrote:
>    We have a real platform requirement:
> 
>      * 1. Devices in TPH Device-Specific Mode with no standard ST table
>      * 2. Steering Tags must be obtained from ACPI _DSM (kernel-only)
>      * 3. Devices are fully managed by userspace drivers (VFIO/UIO)
>      * 4. Userspace must program STs into vendor-specific registers

No, this is nonsenscial too.

If you want to control the steering tags for MMIO BAR memory exposed
by VFIO then the DMABUF mechanism Keith & co has been working on is
the correct approach.

If the VFIO user needs to control steering tags for the device it is
directly controling then it must do that through VFIO ioctls.

Nobody messes around with other devices under the covers of the
operating kernel driver. Stop proposing that.

Jason

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox