Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH net v2 2/2] sctp: add INIT verification after cookie unpacking
From: Xin Long @ 2026-06-23 14:45 UTC (permalink / raw)
  To: Simon Horman
  Cc: netdev, linux-sctp, davem, kuba, edumazet, pabeni,
	marcelo.leitner
In-Reply-To: <20260622165200.1050676-1-horms@kernel.org>

On Mon, Jun 22, 2026 at 12:52 PM Simon Horman <horms@kernel.org> wrote:
>
> This is an AI-generated review of your patch. The human sending this
> email has considered the AI review valid, or at least plausible.
> Full review at: https://netdev-ai.bots.linux.dev/sashiko/
> ---
> sctp: add INIT verification after cookie unpacking
>
> This commit adds sctp_verify_init() checks in the COOKIE-ECHO processing
> paths sctp_sf_do_5_1D_ce() and sctp_sf_do_5_2_4_dupcook() when
> cookie_auth_enable is disabled, and updates sctp_verify_init() to bound
> the parameter walk against peer_init length rather than chunk->chunk_end.
> On verification failure the new association is freed and an ABORT is
> generated via sctp_abort_on_init_err().
>
> > diff --git a/net/sctp/sm_statefuns.c b/net/sctp/sm_statefuns.c
> > index 8c636f045e45..6967e889d1bd 100644
> > --- a/net/sctp/sm_statefuns.c
> > +++ b/net/sctp/sm_statefuns.c
> > @@ -650,11 +650,12 @@ enum sctp_disposition sctp_sf_do_5_1D_ce(struct net *net,
> >                                        struct sctp_cmd_seq *commands)
> >  {
> >       struct sctp_ulpevent *ev, *ai_ev = NULL, *auth_ev = NULL;
> > +     struct sctp_chunk *err_chk_p = NULL;
> >       struct sctp_association *new_asoc;
> >       struct sctp_init_chunk *peer_init;
> >       struct sctp_chunk *chunk = arg;
> > -     struct sctp_chunk *err_chk_p;
> >       struct sctp_chunk *repl;
> > +     enum sctp_cid cid;
> >       struct sock *sk;
> >       int error = 0;
> >
> > @@ -728,6 +729,18 @@ enum sctp_disposition sctp_sf_do_5_1D_ce(struct net *net,
> >               }
> >       }
> >
> > +     peer_init = (struct sctp_init_chunk *)(chunk->subh.cookie_hdr + 1);
> > +     cid = peer_init->chunk_hdr.type;
> > +     if (!sctp_sk(sk)->cookie_auth_enable &&
> > +         !sctp_verify_init(net, ep, asoc, cid, peer_init, chunk,
> > +                           &err_chk_p)) {
> > +             sctp_association_free(new_asoc);
> > +             return sctp_abort_on_init_err(net, ep, asoc, cid, arg, commands,
> > +                                           err_chk_p);
> > +     }
>
> [Critical]
> When cookie_auth_enable is disabled the cookie HMAC is not validated by
> sctp_unpack_cookie(), so every byte of the embedded INIT (including
> peer_init->chunk_hdr.type) is attacker-controlled.  Is it safe to feed
> that byte directly into both sctp_verify_init() and
> sctp_abort_on_init_err() as cid?
>
> Inside the COOKIE-ECHO body the embedded chunk is by RFC always an INIT.
> Could the cid argument be hard-coded to SCTP_CID_INIT (or
> peer_init->chunk_hdr.type be validated against SCTP_CID_INIT before use)
> instead of trusting the wire byte?
>
> Two attacker-reachable consequences look possible if cid is forged to
> SCTP_CID_INIT_ACK (3) and the STATE_COOKIE parameter is omitted:
>
> sctp_verify_init() then returns 0 via this branch with *errp populated:
>
>         if ((SCTP_CID_INIT_ACK == cid) && !has_cookie)
>                 return sctp_process_missing_param(asoc, SCTP_PARAM_STATE_COOKIE,
>                                                   chunk, errp);
>
> control transfers to sctp_abort_on_init_err(net, ep, asoc, cid, ...) with
> cid == SCTP_CID_INIT_ACK and a non-NULL err_chunk.  After sending the
> ABORT packet, that helper falls through to its out: label because
> cid == SCTP_CID_INIT_ACK:
>
>         if (cid != SCTP_CID_INIT_ACK) {
>                 if (!packet)
>                         return SCTP_DISPOSITION_NOMEM;
>                 return SCTP_DISPOSITION_CONSUME;
>         }
>
> out:
>         ...
>         return sctp_stop_t1_and_abort(net, commands, error, ECONNREFUSED,
>                                       asoc, chunk->transport);
>
> In the sctp_sf_do_5_1D_ce() context, asoc is typically NULL (the function
> itself opens with if (asoc && !sctp_vtag_verify(chunk, asoc))).  Can the
> SCTP_CMD_NEW_STATE / SCTP_CMD_TIMER_STOP / SCTP_CMD_INIT_FAILED commands
> queued by sctp_stop_t1_and_abort() against a NULL asoc be dispatched
> without the sideeffect interpreter dereferencing asoc?
>
> > @@ -2205,6 +2219,19 @@ enum sctp_disposition sctp_sf_do_5_2_4_dupcook(
> >       switch (action) {
> >       case 'A': /* Association restart. */
> >       case 'B': /* Collision case B. */
> > +             peer_init = (struct sctp_init_chunk *)
> > +                             (chunk->subh.cookie_hdr + 1);
> > +             cid = peer_init->chunk_hdr.type;
> > +             if (!sctp_sk(ep->base.sk)->cookie_auth_enable &&
> > +                 !sctp_verify_init(net, ep, asoc, cid, peer_init, chunk,
> > +                                   &err_chk_p)) {
> > +                     sctp_association_free(new_asoc);
> > +                     return sctp_abort_on_init_err(net, ep, asoc, cid, arg,
> > +                                                   commands, err_chk_p);
> > +             }
> > +             if (err_chk_p)
> > +                     sctp_chunk_free(err_chk_p);
> > +             fallthrough;
>
> [Critical]
> The same attacker-controlled cid is propagated here as well.  In
> sctp_sf_do_5_2_4_dupcook() asoc is the existing established association.
> With a forged cid of SCTP_CID_INIT_ACK and missing STATE_COOKIE, can a
> single unauthenticated COOKIE-ECHO drive sctp_abort_on_init_err() into
> its out: path and have sctp_stop_t1_and_abort() tear down the established
> asoc and report ECONNREFUSED to the application?
>
> Would constraining the verification to the well-defined INIT case (for
> example, hard-coding SCTP_CID_INIT or rejecting any cid !=
> SCTP_CID_INIT before calling sctp_verify_init()/sctp_abort_on_init_err())
> avoid both of these paths?

I will place a check in sctp_unpack_cookie() for such malformed packets:

diff --git a/net/sctp/sm_make_chunk.c b/net/sctp/sm_make_chunk.c
index d5ee81934d93..8adac9e0cd66 100644
--- a/net/sctp/sm_make_chunk.c
+++ b/net/sctp/sm_make_chunk.c
@@ -1761,6 +1761,8 @@ struct sctp_association *sctp_unpack_cookie(
        bear_cookie = &cookie->c;

        ch = (struct sctp_chunkhdr *)(bear_cookie + 1);
+       if (ch->type != SCTP_CID_INIT)
+               goto malformed;
        chlen = ntohs(ch->length);
        if (chlen < sizeof(struct sctp_init_chunk))
                goto malformed;

Thanks.

^ permalink raw reply related

* [PATCH] qede: fix out-of-bounds check for cqe->len_list[]
From: Matvey Kovalev @ 2026-06-23 14:45 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Matvey Kovalev, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Pavel Zhigulin, netdev, linux-kernel, lvc-project

Move index check before element access.

Fixes: 896f1a2493b5 ("net: qlogic/qede: fix potential out-of-bounds read in qede_tpa_cont() and qede_tpa_end()")
Found by Linux Verification Center (linuxtesting.org) with SVACE.

Signed-off-by: Matvey Kovalev <matvey.kovalev@ispras.ru>
---
 drivers/net/ethernet/qlogic/qede/qede_fp.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/qlogic/qede/qede_fp.c b/drivers/net/ethernet/qlogic/qede/qede_fp.c
index e338bfc8b7b2..33e18bb69774 100644
--- a/drivers/net/ethernet/qlogic/qede/qede_fp.c
+++ b/drivers/net/ethernet/qlogic/qede/qede_fp.c
@@ -961,7 +961,7 @@ static inline void qede_tpa_cont(struct qede_dev *edev,
 {
 	int i;
 
-	for (i = 0; cqe->len_list[i] && i < ARRAY_SIZE(cqe->len_list); i++)
+	for (i = 0; i < ARRAY_SIZE(cqe->len_list) && cqe->len_list[i]; i++)
 		qede_fill_frag_skb(edev, rxq, cqe->tpa_agg_index,
 				   le16_to_cpu(cqe->len_list[i]));
 
@@ -986,7 +986,7 @@ static int qede_tpa_end(struct qede_dev *edev,
 		dma_unmap_page(rxq->dev, tpa_info->buffer.mapping,
 			       PAGE_SIZE, rxq->data_direction);
 
-	for (i = 0; cqe->len_list[i] && i < ARRAY_SIZE(cqe->len_list); i++)
+	for (i = 0; i < ARRAY_SIZE(cqe->len_list) && cqe->len_list[i]; i++)
 		qede_fill_frag_skb(edev, rxq, cqe->tpa_agg_index,
 				   le16_to_cpu(cqe->len_list[i]));
 	if (unlikely(i > 1))
-- 
2.54.0


^ permalink raw reply related

* Re: [PATCH 1/2] bug: Provide WARN_ON.*DEFERRED() macros for console deferred output
From: K Prateek Nayak @ 2026-06-23 14:54 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior, linux-arch, linux-kernel, sched-ext,
	netdev
  Cc: David S . Miller, Andrea Righi, Andrew Morton, Arnd Bergmann,
	Ben Segall, Breno Leitao, Changwoo Min, David Vernet,
	Dietmar Eggemann, Eric Dumazet, Ingo Molnar, Jakub Kicinski,
	John Ogness, Juri Lelli, Paolo Abeni, Peter Zijlstra, Petr Mladek,
	Sergey Senozhatsky, Simon Horman, Steven Rostedt, Tejun Heo,
	Vincent Guittot, Vlad Poenaru
In-Reply-To: <20260623142650.265721-2-bigeasy@linutronix.de>

Hello Sebastian,

On 6/23/2026 7:56 PM, Sebastian Andrzej Siewior wrote:
> --- a/lib/bug.c
> +++ b/lib/bug.c
> @@ -196,7 +196,7 @@ void __warn_printf(const char *fmt, struct pt_regs *regs)
>  
>  static enum bug_trap_type __report_bug(struct bug_entry *bug, unsigned long bugaddr, struct pt_regs *regs)
>  {
> -	bool warning, once, done, no_cut, has_args;
> +	bool warning, once, done, no_cut, has_args, deferred;
>  	const char *file, *fmt;
>  	unsigned line;
>  
> @@ -219,6 +219,7 @@ static enum bug_trap_type __report_bug(struct bug_entry *bug, unsigned long buga
>  	done     = bug->flags & BUGFLAG_DONE;
>  	no_cut   = bug->flags & BUGFLAG_NO_CUT_HERE;
>  	has_args = bug->flags & BUGFLAG_ARGS;
> +	deferred = bug->flags & BUGFLAG_DEFERRED;
>  
>  	if (warning && once) {
>  		if (done)
> @@ -229,7 +230,10 @@ static enum bug_trap_type __report_bug(struct bug_entry *bug, unsigned long buga
>  		 */
>  		bug->flags |= BUGFLAG_DONE;
>  	}
> -
> +	if (deferred) {
> +		preempt_disable_notrace();
> +		printk_deferred_enter();
> +	}
>  	/*
>  	 * BUG() and WARN_ON() families don't print a custom debug message
>  	 * before triggering the exception handler, so we must add the
> @@ -245,6 +249,10 @@ static enum bug_trap_type __report_bug(struct bug_entry *bug, unsigned long buga
>  		/* this is a WARN_ON rather than BUG/BUG_ON */
>  		__warn(file, line, (void *)bugaddr, BUG_GET_TAINT(bug), regs,
>  		       NULL);
> +		if (deferred) {
> +			printk_deferred_exit();
> +			preempt_enable_notrace();
> +		}
>  		return BUG_TRAP_TYPE_WARN;

nit.

Instead of replicating these bits, can we replace that return with a
"goto out" ...

>  	}
>  
> @@ -254,6 +262,10 @@ static enum bug_trap_type __report_bug(struct bug_entry *bug, unsigned long buga
>  		pr_crit("kernel BUG at %pB [verbose debug info unavailable]\n",
>  			(void *)bugaddr);
>  

out:

> +	if (deferred) {
> +		printk_deferred_exit();
> +		preempt_enable_notrace();
> +	}
>  	return BUG_TRAP_TYPE_BUG;

... and replace this return with a:

    return (warning) ? BUG_TRAP_TYPE_WARN : BUG_TRAP_TYPE_BUG;

Looks a tab bit cleaner to my eyes. Thoughts?

>  }
>  

-- 
Thanks and Regards,
Prateek


^ permalink raw reply

* Re: [PATCH net-next 0/3] selftests/xsk: stabilize timeout test behavior
From: Maciej Fijalkowski @ 2026-06-23 14:58 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Jason Xing, Tushar Vyavahare, netdev, magnus.karlsson, stfomichev,
	kernelxing, davem, pabeni, ast, daniel, tirthendu.sarkar, bpf
In-Reply-To: <ajpLuDNCu2PHS78l@boxer>

On Tue, Jun 23, 2026 at 11:02:48AM +0200, Maciej Fijalkowski wrote:
> On Mon, Jun 22, 2026 at 04:07:06PM -0700, Jakub Kicinski wrote:
> > On Wed, 17 Jun 2026 11:43:14 +0200 Maciej Fijalkowski wrote:
> > > > On Tue, Jun 16, 2026 at 11:50 PM Tushar Vyavahare
> > > > <tushar.vyavahare@intel.com> wrote:  
> > > > >
> > > > > This series improves AF_XDP selftests by making timeout handling
> > > > > explicit and fixing sources of non-determinism in xsk timeout tests.
> > > > >
> > > > > Patch 1 introduces test_spec::poll_tmout and removes implicit
> > > > > dependence on RX UMEM setup state for timeout behavior.
> > > > >
> > > > > Patch 2 fixes thread harness sequencing by attaching XDP programs
> > > > > before worker startup, removing signal-based termination, and using
> > > > > barrier synchronization only for dual-thread runs.
> > > > >
> > > > > Patch 3 restores shared_umem after POLL_TXQ_FULL so test-local
> > > > > configuration does not leak into subsequent cases on shared-netdev
> > > > > runs.
> > > > >
> > > > > Together these changes make timeout handling easier to follow and
> > > > > improve selftest stability, especially on real NIC runs.  
> > > > 
> > > > net-next is closed, but in the meantime I'll review the series ASAP.
> > > > 
> > > > BTW, another thing about selftests I had in my mind is that are you
> > > > planning to work on this [1]?  
> > > 
> > > This one is on me. I took your changes Jason and aligned ZC batching side
> > > to this behavior, followed by xskxceiver adjustment. I am planning to send
> > > this today EOD, however let's see how badly internal Sashiko will kick my
> > > ass.
> > 
> > Hi Maciej, do you want these applied? If they help make the tests less
> > flaky I think that it's fine to take them during the merge window.
> 
> Hi Jakub,
> 
> last refactor from Tushar broke BIDIRECTIONAL test case when HW is test
> target, but not on veth, so let me test these changes locally and then get
> back to you.
> 
> BPF CI runs xskxceiver on veth so this has not been caught. Seems my/our
> focus should be to enable xskxceiver HW tests on any kind of
> environment/infrastructure.
> 
> Gonna get back to you by the EOD.
> Maciej

Ah I replied on other thread I guess, so let me repeat:

Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Tested-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>

^ permalink raw reply

* Re: [PATCH net-next 0/3] selftests/xsk: stabilize timeout test behavior
From: Maciej Fijalkowski @ 2026-06-23 14:56 UTC (permalink / raw)
  To: Jason Xing
  Cc: Tushar Vyavahare, netdev, magnus.karlsson, stfomichev, kernelxing,
	davem, kuba, pabeni, ast, daniel, tirthendu.sarkar, bpf
In-Reply-To: <ajJsMj0QMOF5I8qq@boxer>

On Wed, Jun 17, 2026 at 11:43:14AM +0200, Maciej Fijalkowski wrote:
> On Wed, Jun 17, 2026 at 07:39:06AM +0800, Jason Xing wrote:
> > Hi Tushar,
> > 
> > On Tue, Jun 16, 2026 at 11:50 PM Tushar Vyavahare
> > <tushar.vyavahare@intel.com> wrote:
> > >
> > > This series improves AF_XDP selftests by making timeout handling
> > > explicit and fixing sources of non-determinism in xsk timeout tests.
> > >
> > > Patch 1 introduces test_spec::poll_tmout and removes implicit
> > > dependence on RX UMEM setup state for timeout behavior.
> > >
> > > Patch 2 fixes thread harness sequencing by attaching XDP programs
> > > before worker startup, removing signal-based termination, and using
> > > barrier synchronization only for dual-thread runs.
> > >
> > > Patch 3 restores shared_umem after POLL_TXQ_FULL so test-local
> > > configuration does not leak into subsequent cases on shared-netdev
> > > runs.
> > >
> > > Together these changes make timeout handling easier to follow and
> > > improve selftest stability, especially on real NIC runs.
> > 
> > net-next is closed, but in the meantime I'll review the series ASAP.
> > 
> > BTW, another thing about selftests I had in my mind is that are you
> > planning to work on this [1]?
> 
> This one is on me. I took your changes Jason and aligned ZC batching side
> to this behavior, followed by xskxceiver adjustment. I am planning to send
> this today EOD, however let's see how badly internal Sashiko will kick my
> ass.

Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Tested-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>

> 
> > 
> > [1]: https://lore.kernel.org/all/20260520004244.55663-1-kerneljasonxing@gmail.com/
> > 
> > Thanks,
> > Jason
> > 
> > >
> > > Tushar Vyavahare (3):
> > >   selftests/xsk: make poll timeout mode explicit
> > >   selftests/xsk: fix timeout thread harness sequencing
> > >   selftests/xsk: restore shared_umem after POLL_TXQ_FULL
> > >
> > >  .../selftests/bpf/prog_tests/test_xsk.c       | 96 +++++++++++--------
> > >  .../selftests/bpf/prog_tests/test_xsk.h       |  2 +
> > >  2 files changed, 56 insertions(+), 42 deletions(-)
> > >
> > > --
> > > 2.43.0
> > >
> > >
> > 

^ permalink raw reply

* Re: [PATCH net-next v2] Documentation: net/smc: correct old value of smcr_max_recv_wr
From: Breno Leitao @ 2026-06-23 15:12 UTC (permalink / raw)
  To: Mahanta Jambigi
  Cc: andrew+netdev, davem, edumazet, kuba, pabeni, alibuda, dust.li,
	sidraya, wenjia, wintera, pasic, horms, tonylu, guwen, netdev,
	linux-s390
In-Reply-To: <20260424052336.3262350-1-mjambigi@linux.ibm.com>

On Fri, Apr 24, 2026 at 07:23:36AM +0200, Mahanta Jambigi wrote:
> The smc-sysctl.rst documentation incorrectly stated that the previous
> hardcoded maximum number of WR buffers on the receive path (smcr_max_recv_wr)
> was 16. The correct historical value used before the introduction of the sysctl
> control was 48. Update the documentation to reflect the accurate historical
> value. Also fix a couple of minor typos.
> 
> Fixes: aef3cdb47bbb net/smc: make wr buffer count configurable

This Fixes tag is broken. You probably want:

	Fixes: aef3cdb47bbb ("net/smc: make wr buffer count configurable")

Other than that, it looks good, the corrected value checks out.

^ permalink raw reply

* Re: [PATCH 1/2] bug: Provide WARN_ON.*DEFERRED() macros for console deferred output
From: Andrew Morton @ 2026-06-23 15:12 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: linux-arch, linux-kernel, sched-ext, netdev, David S . Miller,
	Andrea Righi, Arnd Bergmann, Ben Segall, Breno Leitao,
	Changwoo Min, David Vernet, Dietmar Eggemann, Eric Dumazet,
	Ingo Molnar, Jakub Kicinski, John Ogness, Juri Lelli,
	K Prateek Nayak, Paolo Abeni, Peter Zijlstra, Petr Mladek,
	Sergey Senozhatsky, Simon Horman, Steven Rostedt, Tejun Heo,
	Vincent Guittot, Vlad Poenaru
In-Reply-To: <20260623142650.265721-2-bigeasy@linutronix.de>

On Tue, 23 Jun 2026 16:26:49 +0200 Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote:

> Provide a deferred version of the WARN_ON() macro. It will delay
> flushing the console until a later context. It is needed in a context
> where the caller holds locks which can lead to a deadlock content is
> flushed to the console driver.
> An example would from a warning from within the scheduler resulting in a
> wake-up of a task.
> 
> Deferring the output works by using printk_deferred_enter/ exit() around
> the printing output. This must be used in a context where the task can't
> migrate to another CPU. This should be the case usually, since the
> scheduler would acquire the rq lock whith disabled interrupts, but to be
> safe preemption is disabled to guarantee this.
> 
> In order not to bloat the code on architectures which provide an
> optimized __WARN_FLAGS() define BUGFLAG_DEFERRED which is handled by
> __report_bug() and does not increase the code size.
> 
> Provide the DEFERRED macros based on __WARN_FLAGS and __WARN_FLAGS
> macros. Extend __report_bug() to handle the deferred case.
> 
> ...
>
> --- a/include/asm-generic/bug.h
> +++ b/include/asm-generic/bug.h
> @@ -229,7 +230,10 @@ static enum bug_trap_type __report_bug(struct bug_entry *bug, unsigned long buga
>  		 */
>  		bug->flags |= BUGFLAG_DONE;
>  	}
> -
> +	if (deferred) {
> +		preempt_disable_notrace();
> +		printk_deferred_enter();
> +	}

For some reason the comment over printk_deferred_enter() says
"Interrupts must be disabled for the deferred duration".  Is that the
case for all the printk_deferred_enter() calls which this patch adds?



^ permalink raw reply

* [PATCH net-next v3] vsock/virtio: rewrite MSG_ZEROCOPY flag handling
From: Arseniy Krasnov @ 2026-06-23 15:38 UTC (permalink / raw)
  To: Stefan Hajnoczi, Stefano Garzarella, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Michael S. Tsirkin,
	Jason Wang, Bobby Eshleman, Xuan Zhuo, Eugenio Pérez,
	Simon Horman
  Cc: kvm, virtualization, netdev, linux-kernel, oxffffaa, rulkc,
	Arseniy Krasnov

Logically it was based on TCP implementation, so to make further support
easier, rewrite it in the TCP way (like in 'tcp_sendmsg_locked()'). This
patch only rewrites flag handling (e.g. it doesn't change logic).

Signed-off-by: Arseniy Krasnov <avkrasnov@rulkc.org>
---
 Changelog v1->v2:
 * Rebase on last 'net-next'. Don't need 'skb_zcopy_set()' now - it was
   already added.
 Changelog v2->v3:
 * Update commit message.
 * Remove one empty line.

 net/vmw_vsock/virtio_transport_common.c | 47 ++++++++++++-------------
 1 file changed, 22 insertions(+), 25 deletions(-)

diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
index 09475007165b..41c2a0b82a8e 100644
--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -328,38 +328,35 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
 	if (pkt_len == 0 && info->op == VIRTIO_VSOCK_OP_RW)
 		return pkt_len;
 
-	if (info->msg) {
-		/* If zerocopy is not enabled by 'setsockopt()', we behave as
-		 * there is no MSG_ZEROCOPY flag set.
+	if (info->msg && (info->msg->msg_flags & MSG_ZEROCOPY)) {
+		/* If 'info->msg' is not NULL, this is only VIRTIO_VSOCK_OP_RW.
+		 * 'MSG_ZEROCOPY' flag handling here is based on the same flag
+		 * handling from 'tcp_sendmsg_locked()'.
 		 */
-		if (!sock_flag(sk_vsock(vsk), SOCK_ZEROCOPY))
-			info->msg->msg_flags &= ~MSG_ZEROCOPY;
+		if (info->msg->msg_ubuf) {
+			uarg = info->msg->msg_ubuf;
+			can_zcopy = virtio_transport_can_zcopy(t_ops, info, pkt_len);
+		} else if (sock_flag(sk_vsock(vsk), SOCK_ZEROCOPY)) {
+			uarg = msg_zerocopy_realloc(sk_vsock(vsk), pkt_len,
+						    NULL, false);
+			if (!uarg) {
+				virtio_transport_put_credit(vvs, pkt_len);
+				return -ENOMEM;
+			}
 
-		if (info->msg->msg_flags & MSG_ZEROCOPY)
 			can_zcopy = virtio_transport_can_zcopy(t_ops, info, pkt_len);
+			if (!can_zcopy)
+				uarg_to_msgzc(uarg)->zerocopy = 0;
 
+			have_uref = true;
+		}
+
+		/* 'can_zcopy' means that this transmission will be
+		 * in zerocopy way (e.g. using 'frags' array).
+		 */
 		if (can_zcopy)
 			max_skb_len = min_t(u32, VIRTIO_VSOCK_MAX_PKT_BUF_SIZE,
 					    (MAX_SKB_FRAGS * PAGE_SIZE));
-
-		if (info->msg->msg_flags & MSG_ZEROCOPY &&
-		    info->op == VIRTIO_VSOCK_OP_RW) {
-			uarg = info->msg->msg_ubuf;
-
-			if (!uarg) {
-				uarg = msg_zerocopy_realloc(sk_vsock(vsk),
-							    pkt_len, NULL, false);
-				if (!uarg) {
-					virtio_transport_put_credit(vvs, pkt_len);
-					return -ENOMEM;
-				}
-
-				if (!can_zcopy)
-					uarg_to_msgzc(uarg)->zerocopy = 0;
-
-				have_uref = true;
-			}
-		}
 	}
 
 	rest_len = pkt_len;
-- 
2.25.1


^ permalink raw reply related

* Re: [PATCH 1/2] bug: Provide WARN_ON.*DEFERRED() macros for console deferred output
From: Petr Mladek @ 2026-06-23 15:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Sebastian Andrzej Siewior, linux-arch, linux-kernel, sched-ext,
	netdev, David S . Miller, Andrea Righi, Arnd Bergmann, Ben Segall,
	Breno Leitao, Changwoo Min, David Vernet, Dietmar Eggemann,
	Eric Dumazet, Ingo Molnar, Jakub Kicinski, John Ogness,
	Juri Lelli, K Prateek Nayak, Paolo Abeni, Peter Zijlstra,
	Sergey Senozhatsky, Simon Horman, Steven Rostedt, Tejun Heo,
	Vincent Guittot, Vlad Poenaru
In-Reply-To: <20260623081258.580e034fdb5b98f4f8dba44a@linux-foundation.org>

On Tue 2026-06-23 08:12:58, Andrew Morton wrote:
> On Tue, 23 Jun 2026 16:26:49 +0200 Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote:
> 
> > Provide a deferred version of the WARN_ON() macro. It will delay
> > flushing the console until a later context. It is needed in a context
> > where the caller holds locks which can lead to a deadlock content is
> > flushed to the console driver.
> > An example would from a warning from within the scheduler resulting in a
> > wake-up of a task.
> > 
> > Deferring the output works by using printk_deferred_enter/ exit() around
> > the printing output. This must be used in a context where the task can't
> > migrate to another CPU. This should be the case usually, since the
> > scheduler would acquire the rq lock whith disabled interrupts, but to be
> > safe preemption is disabled to guarantee this.
> > 
> > In order not to bloat the code on architectures which provide an
> > optimized __WARN_FLAGS() define BUGFLAG_DEFERRED which is handled by
> > __report_bug() and does not increase the code size.
> > 
> > Provide the DEFERRED macros based on __WARN_FLAGS and __WARN_FLAGS
> > macros. Extend __report_bug() to handle the deferred case.
> > 
> > ...
> >
> > --- a/include/asm-generic/bug.h
> > +++ b/include/asm-generic/bug.h
> > @@ -229,7 +230,10 @@ static enum bug_trap_type __report_bug(struct bug_entry *bug, unsigned long buga
> >  		 */
> >  		bug->flags |= BUGFLAG_DONE;
> >  	}
> > -
> > +	if (deferred) {
> > +		preempt_disable_notrace();
> > +		printk_deferred_enter();
> > +	}
> 
> For some reason the comment over printk_deferred_enter() says
> "Interrupts must be disabled for the deferred duration".  Is that the
> case for all the printk_deferred_enter() calls which this patch adds?

Strictly speaking, "only" CPU migration must be disabled around
printk_deferred_enter()/exit() call because the state is stored
in a per-CPU variable.

It means that preempt_disable() would work.

I do not recall whether we mentioned interrupts by mistake or
on purpose. It is possible that we suggested to disable interrupts
because we did not want to deffer messages from unrelated (interrupt)
context.

Best Regards,
Petr

^ permalink raw reply

* Re: [PATCH] net: ipa: fix SMEM state handle leaks in SMP2P init
From: Alex Elder @ 2026-06-23 15:53 UTC (permalink / raw)
  To: Haoxiang Li, elder, andrew+netdev, davem, edumazet, kuba, pabeni
  Cc: netdev, linux-kernel, stable
In-Reply-To: <20260623031831.1788454-1-haoxiang_li2024@163.com>

On 6/22/26 10:18 PM, Haoxiang Li wrote:
> ipa_smp2p_init() acquires two Qualcomm SMEM state handles with
> qcom_smem_state_get(). However, neither the init error paths
> nor ipa_smp2p_exit() release them.
> 
> Use devm_qcom_smem_state_get() for both state handles so the
> references are released automatically when the platform device
> is removed.
> 
> Fixes: 530f9216a953 ("soc: qcom: ipa: AP/modem communications")
> Cc: stable@vger.kernel.org
> Signed-off-by: Haoxiang Li <haoxiang_li2024@163.com>

So I guess they were never "put" before?

This looks OK, but I'll just mention that the IPA code
doesn't use devm_*() (managed) interfaces.  So it would
be more consistent to just call qcom_smem_state_put()
at the end of ipa_smp2p_exit() for both ipa->enabled_state
and ipa->valid_state.

					-Alex

> ---
>   drivers/net/ipa/ipa_smp2p.c | 8 ++++----
>   1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/net/ipa/ipa_smp2p.c b/drivers/net/ipa/ipa_smp2p.c
> index 2f0ccdd937cc..d8fd56949082 100644
> --- a/drivers/net/ipa/ipa_smp2p.c
> +++ b/drivers/net/ipa/ipa_smp2p.c
> @@ -228,15 +228,15 @@ ipa_smp2p_init(struct ipa *ipa, struct platform_device *pdev, bool modem_init)
>   	u32 valid_bit;
>   	int ret;
>   
> -	valid_state = qcom_smem_state_get(dev, "ipa-clock-enabled-valid",
> -					  &valid_bit);
> +	valid_state = devm_qcom_smem_state_get(dev, "ipa-clock-enabled-valid",
> +					       &valid_bit);
>   	if (IS_ERR(valid_state))
>   		return PTR_ERR(valid_state);
>   	if (valid_bit >= 32)		/* BITS_PER_U32 */
>   		return -EINVAL;
>   
> -	enabled_state = qcom_smem_state_get(dev, "ipa-clock-enabled",
> -					    &enabled_bit);
> +	enabled_state = devm_qcom_smem_state_get(dev, "ipa-clock-enabled",
> +						 &enabled_bit);
>   	if (IS_ERR(enabled_state))
>   		return PTR_ERR(enabled_state);
>   	if (enabled_bit >= 32)		/* BITS_PER_U32 */


^ permalink raw reply

* Re: [PATCH 0/3] SM8450 IPA support
From: Alex Elder @ 2026-06-23 15:56 UTC (permalink / raw)
  To: esteuwu, Bjorn Andersson, Konrad Dybcio, Rob Herring,
	Krzysztof Kozlowski, Conor Dooley, Andrew Lunn, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Alex Elder
  Cc: linux-arm-msm, devicetree, linux-kernel, netdev
In-Reply-To: <20260622-sm8450-ipa-v1-0-532f0299f96e@proton.me>

On 6/22/26 8:44 PM, Esteban Urrutia via B4 Relay wrote:
> This series adds support for the IPA subsystem found in the SM8450 SoC.
> While IPA v5.0 is very similar to IPA v5.1 (heck, it even managed to
> properly get the modem up and running), it wasn't perfect, since the
> modem would sometimes hang when rebooting or powering the AP off.
> After a thorough investigation, I managed to create the proper data file
> required for IPA v5.1.
> 
> Regards,
> Esteban

I assume you have implemented this based on what you found in
some downstream code.  And if so, could you please indicate
where to find that (so I can do some cross-referencing myself).
I no longer have access to any Qualcomm internal documentation.

Thanks.

					-Alex

> Signed-off-by: Esteban Urrutia <esteuwu@proton.me>
> ---
> Esteban Urrutia (3):
>        arm64: dts: qcom: sm8450: Add IPA support
>        dt-bindings: net: qcom,ipa: Add SM8450 compatible string
>        net: ipa: Add IPA v5.1 data
> 
>   .../devicetree/bindings/net/qcom,ipa.yaml          |   1 +
>   arch/arm64/boot/dts/qcom/sm8450.dtsi               |  55 ++-
>   drivers/net/ipa/Makefile                           |   2 +-
>   drivers/net/ipa/data/ipa_data-v5.1.c               | 477 +++++++++++++++++++++
>   drivers/net/ipa/gsi_reg.c                          |   1 +
>   drivers/net/ipa/ipa_data.h                         |   1 +
>   drivers/net/ipa/ipa_main.c                         |   4 +
>   drivers/net/ipa/ipa_reg.c                          |   1 +
>   8 files changed, 536 insertions(+), 6 deletions(-)
> ---
> base-commit: 948efecf22e49aa4bf55bb73ec79a0ddcfd38571
> change-id: 20260622-sm8450-ipa-5da81f67eb65
> 
> Best regards,
> --
> Esteban Urrutia <esteuwu@proton.me>
> 
> 


^ permalink raw reply

* Re: [PATCH bpf-next v2] bpf, unix: Guard sk_msg-dependent code behind CONFIG_NET_SOCK_MSG
From: Kuniyuki Iwashima @ 2026-06-23 16:08 UTC (permalink / raw)
  To: Jakub Sitnicki
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Jakub Kicinski,
	Jiayuan Chen, John Fastabend, netdev, kernel-team
In-Reply-To: <20260623-bpf-sk_msg-split-unix-v2-1-ca7a626a94a5@cloudflare.com>

On Tue, Jun 23, 2026 at 4:20 AM Jakub Sitnicki <jakub@cloudflare.com> wrote:
>
> Prepare to decouple BPF_SYSCALL config option from NET_SOCK_MSG. When
> completed all code paths related to sockmap-based redirects should be
> guarded by BPF_SYSCALL && NET_SOCK_MSG to allow users to opt out by
> disabling NET_SOCK_MSG. The implementation of sockmap as a container for
> socket references would remain under BPF_SYSCALL.
>
> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
> ---
> Changes in v2:
> - Handle prot->recvmsg being NULL (Sashiko)
> - Elaborate on the end goal in description
> - Link to v1: https://patch.msgid.link/20260622-bpf-sk_msg-split-unix-v1-1-d7e0cb7bb03b@cloudflare.com
> ---
>  net/unix/af_unix.c  | 4 ++--
>  net/unix/unix_bpf.c | 6 ++++++
>  2 files changed, 8 insertions(+), 2 deletions(-)
>
> diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
> index f7a9d55eee8a..84c11c60c75f 100644
> --- a/net/unix/af_unix.c
> +++ b/net/unix/af_unix.c
> @@ -2675,7 +2675,7 @@ static int unix_dgram_recvmsg(struct socket *sock, struct msghdr *msg, size_t si
>  #ifdef CONFIG_BPF_SYSCALL
>         const struct proto *prot = READ_ONCE(sk->sk_prot);
>
> -       if (prot != &unix_dgram_proto)
> +       if (prot->recvmsg)

There is no reason to have this dead branch when
CONFIG_BPF_SYSCALL && !NET_SOCK_MSG.

Let's compile out all sockmap code when both configs
are not enabled.

Since AF_UNIX differs from TCP/UDP, it can take the
simpler approach.


>                 return prot->recvmsg(sk, msg, size, flags);
>  #endif
>         return __unix_dgram_recvmsg(sk, msg, size, flags);
> @@ -3152,7 +3152,7 @@ static int unix_stream_recvmsg(struct socket *sock, struct msghdr *msg,
>         struct sock *sk = sock->sk;
>         const struct proto *prot = READ_ONCE(sk->sk_prot);
>
> -       if (prot != &unix_stream_proto)
> +       if (prot->recvmsg)
>                 return prot->recvmsg(sk, msg, size, flags);
>  #endif
>         return unix_stream_read_generic(&state, true);
> diff --git a/net/unix/unix_bpf.c b/net/unix/unix_bpf.c
> index f86ff19e9764..5289a04b4993 100644
> --- a/net/unix/unix_bpf.c
> +++ b/net/unix/unix_bpf.c
> @@ -7,6 +7,7 @@
>
>  #include "af_unix.h"
>
> +#ifdef CONFIG_NET_SOCK_MSG
>  #define unix_sk_has_data(__sk, __psock)                                        \
>                 ({      !skb_queue_empty(&__sk->sk_receive_queue) ||    \
>                         !skb_queue_empty(&__psock->ingress_skb) ||      \
> @@ -94,6 +95,7 @@ static int unix_bpf_recvmsg(struct sock *sk, struct msghdr *msg,
>         sk_psock_put(sk, psock);
>         return copied;
>  }
> +#endif /* CONFIG_NET_SOCK_MSG */
>
>  static struct proto *unix_dgram_prot_saved __read_mostly;
>  static DEFINE_SPINLOCK(unix_dgram_prot_lock);
> @@ -107,8 +109,10 @@ static void unix_dgram_bpf_rebuild_protos(struct proto *prot, const struct proto
>  {
>         *prot        = *base;
>         prot->close  = sock_map_close;
> +#ifdef CONFIG_NET_SOCK_MSG
>         prot->recvmsg = unix_bpf_recvmsg;
>         prot->sock_is_readable = sk_msg_is_readable;
> +#endif
>  }
>
>  static void unix_stream_bpf_rebuild_protos(struct proto *prot,
> @@ -116,8 +120,10 @@ static void unix_stream_bpf_rebuild_protos(struct proto *prot,
>  {
>         *prot        = *base;
>         prot->close  = sock_map_close;
> +#ifdef CONFIG_NET_SOCK_MSG
>         prot->recvmsg = unix_bpf_recvmsg;
>         prot->sock_is_readable = sk_msg_is_readable;
> +#endif
>         prot->unhash  = sock_map_unhash;
>  }
>
>
>
>

^ permalink raw reply

* Re: [PATCH net-next v3 1/2] net: phy: sfp: free mii_bus in sfp_i2c_mdiobus_destroy
From: Maxime Chevallier @ 2026-06-23 16:23 UTC (permalink / raw)
  To: Petr Wozniak, linux, andrew, hkallweit1
  Cc: kuba, davem, edumazet, pabeni, netdev, linux-kernel, linux-phy,
	bjorn, olek2, kabel
In-Reply-To: <20260623080538.7646-2-petr.wozniak@gmail.com>



On 6/23/26 10:05, Petr Wozniak wrote:
> sfp_i2c_mdiobus_create() allocates the I2C MDIO bus with mdio_i2c_alloc(),
> a plain (non-devm) allocation, and registers it. sfp_i2c_mdiobus_destroy()
> only unregisters the bus and clears sfp->i2c_mii without calling
> mdiobus_free(). As the only reference to the bus is then cleared, the
> struct mii_bus is leaked.
> 
> This is hit whenever a copper/RollBall SFP module that instantiated an MDIO
> bus is removed: sfp_sm_main() takes the global teardown path and calls
> sfp_i2c_mdiobus_destroy(). sfp_cleanup(), on driver unbind, frees
> sfp->i2c_mii directly, which is why the leak only triggered on module
> hot-removal and not on unbind.

which is worse, this can happen many times in a row :)

> 
> Free the bus in sfp_i2c_mdiobus_destroy() to match the allocation done in
> sfp_i2c_mdiobus_create().
> 
> Fixes: e85b1347ace6 ("net: sfp: create/destroy I2C mdiobus before PHY probe/after PHY release")
> Signed-off-by: Petr Wozniak <petr.wozniak@gmail.com>

With this patch sent towards the -net tree,

Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>

Maxime

> ---
>  drivers/net/phy/sfp.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/net/phy/sfp.c b/drivers/net/phy/sfp.c
> index 03bfd8640db9..c4d274ab651e 100644
> --- a/drivers/net/phy/sfp.c
> +++ b/drivers/net/phy/sfp.c
> @@ -963,6 +963,7 @@ static int sfp_i2c_mdiobus_create(struct sfp *sfp)
>  static void sfp_i2c_mdiobus_destroy(struct sfp *sfp)
>  {
>  	mdiobus_unregister(sfp->i2c_mii);
> +	mdiobus_free(sfp->i2c_mii);
>  	sfp->i2c_mii = NULL;
>  }
>  


^ permalink raw reply

* Re: [PATCH iwl-net v2 1/2] ice: skip per-VLAN promisc rules when default VSI Rx rule is set
From: Simon Horman @ 2026-06-23 16:25 UTC (permalink / raw)
  To: poros; +Cc: Simon Horman, netdev, aleksandr.loktionov
In-Reply-To: <20260622113428.2565255-2-poros@redhat.com>

This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
Full review at: https://sashiko.dev
---
Subject: ice: skip per-VLAN promisc rules when default VSI Rx rule is set

This patch optimizes the number of hardware switch rules used when an ice
port is part of a vlan-filtering bridge. It avoids installing per-VLAN
promiscuous rules if the broad default VSI Rx rule is already catching
all traffic.

> diff --git a/drivers/net/ethernet/intel/ice/ice_main.c b/drivers/net/ethernet/intel/ice/ice_main.c
> index e2fbe111f8496..efb1f536de6c7 100644
> --- a/drivers/net/ethernet/intel/ice/ice_main.c
> +++ b/drivers/net/ethernet/intel/ice/ice_main.c
[ ... ]
> @@ -442,17 +496,12 @@ static int ice_vsi_sync_fltr(struct ice_vsi *vsi)

[Severity: High]
This is a pre-existing issue, but does MAC filter overflow fail to force
promiscuous mode, resulting in silent packet drops?

When the hardware MAC filter limit is reached (LIBIE_AQ_RC_ENOSPC), the driver
attempts to fall back to promiscuous mode by setting promisc_forced_on = true.

However, the execution flow then reaches a check for IFF_PROMISC:

    if (vsi->current_netdev_flags & IFF_PROMISC) {
        /* Apply Rx filter rule to get traffic from wire */
        ...
    } else {
        /* Clear Rx filter to remove traffic from wire */

Since the user never explicitly requested IFF_PROMISC, this evaluates to false,
causing the driver to take the else branch. The else branch actively disables
the default VSI rule and clears promiscuous mode instead of enabling it. The
fallback feature appears to be non-functional, leading to lost traffic for
newly added MACs.

[Severity: High]
This is a pre-existing issue, but could this function prevent retries of
incomplete configurations and break multicast promiscuity if any default
VSI is already active?

In ice_vsi_sync_fltr(), the setup of multicast promiscuity and disabling of
Rx filtering is gated:

    if (!ice_is_dflt_vsi_in_use(vsi->port_info)) {
        err = ice_set_dflt_vsi(vsi);
        ...
        err = ice_set_promisc(vsi, ICE_MCAST_PROMISC_BITS);
        if (err)
            goto out_promisc;
    }

If ice_set_dflt_vsi() succeeds but the subsequent ice_set_promisc() fails,
the function jumps to out_promisc to trigger a retry. On the next cycle,
ice_is_dflt_vsi_in_use() returns true because this VSI is now the default,
causing the entire block to be skipped and leaving the VSI misconfigured.

Furthermore, if another VSI (such as a VF) is already the default VSI,
the PF VSI entirely skips setting multicast promiscuity, breaking multicast
reception in promiscuous mode.

>  		} else {
>  			/* Clear Rx filter to remove traffic from wire */
>  			if (ice_is_vsi_dflt_vsi(vsi)) {
> -				err = ice_clear_dflt_vsi(vsi);
> +				err = ice_vsi_exit_dflt_promisc(vsi);
>  				if (err) {
> -					netdev_err(netdev, "Error %d clearing default VSI %i Rx rule\n",
> -						   err, vsi->vsi_num);
>  					vsi->current_netdev_flags |=
>  						IFF_PROMISC;
>  					goto out_promisc;
>  				}
> -				if (vsi->netdev->features &
> -				    NETIF_F_HW_VLAN_CTAG_FILTER)
> -					vlan_ops->ena_rx_filtering(vsi);
>  			}
[ ... ]
> @@ -3675,10 +3724,9 @@ int ice_vlan_rx_add_vid(struct net_device *netdev, __be16 proto, u16 vid)
>  	while (test_and_set_bit(ICE_CFG_BUSY, vsi->state))
>  		usleep_range(1000, 2000);

[Severity: High]
This is a pre-existing issue, but isn't there a risk of TOCTOU races with
this ad-hoc lock pattern lacking memory barriers?

The driver spins on ICE_CFG_BUSY using test_and_set_bit() and usleep_range()
to guard VSI filter synchronization. The lock is subsequently dropped using
clear_bit(ICE_CFG_BUSY, vsi->state) at the exit label.

Because clear_bit() lacks a memory release barrier, the CPU or compiler can
reorder memory stores from the critical section to occur after the flag is
cleared, leading to data races and corruption of shared filter state on
weakly ordered architectures.

> -	/* Add multicast promisc rule for the VLAN ID to be added if
> -	 * all-multicast is currently enabled.
> -	 */
> -	if (vsi->current_netdev_flags & IFF_ALLMULTI) {
> +	/* skip the per-VID rule when the DFLT Rx rule already covers this VID */
> +	if ((vsi->current_netdev_flags & IFF_ALLMULTI) &&
> +	    !ice_is_vsi_dflt_vsi(vsi)) {
>  		ret = ice_fltr_set_vsi_promisc(&vsi->back->hw, vsi->idx,
>  					       ICE_MCAST_VLAN_PROMISC_BITS,
>  					       vid);
> @@ -3696,11 +3744,12 @@ int ice_vlan_rx_add_vid(struct net_device *netdev, __be16 proto, u16 vid)
>  	if (ret)
>  		goto finish;

[Severity: High]
This is a pre-existing issue, but what happens if vlan_ops->add_vlan()
fails here? Is the hardware switch rule leaked?

When adding a VLAN to an interface with IFF_ALLMULTI enabled, this function
first allocates a promiscuous switch rule in hardware via
ice_fltr_set_vsi_promisc().

If this succeeds but the subsequent call to vlan_ops->add_vlan() fails
(e.g., due to ENOSPC or memory allocation failure), the function returns
early without rolling back the promiscuous rule. Because the network stack
considers the VLAN addition to have failed, ice_vlan_rx_kill_vid() will
never be called for this VID, permanently leaking a Flow Lookup Unit (FLU).

> -	/* If all-multicast is currently enabled and this VLAN ID is only one
> -	 * besides VLAN-0 we have to update look-up type of multicast promisc
> -	 * rule for VLAN-0 from ICE_SW_LKUP_PROMISC to ICE_SW_LKUP_PROMISC_VLAN.
> +	/* On the first non-zero VLAN, promote the VLAN-0 multicast promisc
> +	 * rule from ICE_SW_LKUP_PROMISC to ICE_SW_LKUP_PROMISC_VLAN. Skip when
> +	 * the DFLT Rx rule is installed; it already covers every VID.
>  	 */
>  	if ((vsi->current_netdev_flags & IFF_ALLMULTI) &&
> +	    !ice_is_vsi_dflt_vsi(vsi) &&
>  	    ice_vsi_num_non_zero_vlans(vsi) == 1) {
[ ... ]
> @@ -3763,11 +3812,12 @@ int ice_vlan_rx_kill_vid(struct net_device *netdev, __be16 proto, u16 vid)
>  					   ICE_MCAST_VLAN_PROMISC_BITS, vid);
>  
>  	if (!ice_vsi_has_non_zero_vlans(vsi)) {
> -		/* Update look-up type of multicast promisc rule for VLAN 0
> -		 * from ICE_SW_LKUP_PROMISC_VLAN to ICE_SW_LKUP_PROMISC when
> -		 * all-multicast is enabled and VLAN 0 is the only VLAN rule.
> +		/* Last non-zero VLAN gone: demote the VLAN-0 multicast promisc
> +		 * rule back to ICE_SW_LKUP_PROMISC. Skip when the DFLT Rx rule
> +		 * is installed; no recipe swap is needed.
>  		 */
> -		if (vsi->current_netdev_flags & IFF_ALLMULTI) {
> +		if ((vsi->current_netdev_flags & IFF_ALLMULTI) &&
> +		    !ice_is_vsi_dflt_vsi(vsi)) {

[Severity: High]
Will this modification cause a stale switch rule (ICE_SW_LKUP_PROMISC_VLAN for
vid=0) to be permanently leaked if the last VLAN is removed while the DFLT VSI
rule is active?

This patch skips demoting the vid=0 multicast promiscuous rule from
ICE_SW_LKUP_PROMISC_VLAN back to ICE_SW_LKUP_PROMISC when the last VLAN is
removed, if ice_is_vsi_dflt_vsi(vsi) is true.

However, if the interface later drops IFF_PROMISC (removing the DFLT VSI rule)
and then drops IFF_ALLMULTI, ice_clear_promisc() is called. Because the
interface now has zero VLANs, ice_clear_promisc() executes its else branch:

    } else {
        status = ice_fltr_clear_vsi_promisc(&vsi->back->hw, vsi->idx,
                                            promisc_m, 0);
    }

This only searches and clears rules from the ICE_SW_LKUP_PROMISC recipe. The
stranded rule in the ICE_SW_LKUP_PROMISC_VLAN recipe is never cleared,
resulting in a permanent hardware switch rule leak.

>  			ice_fltr_clear_vsi_promisc(&vsi->back->hw, vsi->idx,
>  						   ICE_MCAST_VLAN_PROMISC_BITS,
>  						   0);

^ permalink raw reply

* Re: [PATCH net-next v3 2/2] net: phy: mdio-i2c: defer RollBall bridge probe to PHY discovery
From: Maxime Chevallier @ 2026-06-23 16:28 UTC (permalink / raw)
  To: Petr Wozniak, linux, andrew, hkallweit1
  Cc: kuba, davem, edumazet, pabeni, netdev, linux-kernel, linux-phy,
	bjorn, olek2, kabel
In-Reply-To: <20260623080538.7646-3-petr.wozniak@gmail.com>

Hi Petr,

On 6/23/26 10:05, Petr Wozniak wrote:
> commit 8fe125892f40 ("net: phy: sfp: probe for RollBall I2C-to-MDIO
> bridge in mdio-i2c") introduced a regression: the RollBall I2C-to-MDIO
> bridge is not yet ready to respond to CMD_READ/CMD_DONE cycles when
> sfp_sm_add_mdio_bus() runs in SFP_S_INIT.  The 200 ms probe times out,
> i2c_mii_probe_rollball() returns -ENODEV, and sfp_sm_add_mdio_bus()
> sets mdio_protocol = MDIO_I2C_NONE.  By the time sfp_sm_probe_for_phy()
> runs (up to ~17 s later on affected hardware), the bridge is fully
> initialized but PHY probing is skipped because the protocol has already
> been changed to NONE.
> 
> This affects both modules inserted before boot and hotplugged modules on
> hardware where bridge initialization exceeds the 200 ms probe window
> (confirmed: FLYPRO SFP-10GT-CS-30M with Aquantia AQR113C, hotplugged).
> 
> Move the probe from i2c_mii_init_rollball(), called at bus-creation time,
> to sfp_sm_probe_for_phy() in sfp.c, where it runs after the SFP state
> machine module initialization delays.  Export the probe function as
> mdio_i2c_probe_rollball() so sfp.c can call it.
> 
> For RTL8261BE-based modules the probe correctly returns -ENODEV at PHY
> discovery time, causing sfp_sm_probe_for_phy() to destroy the MDIO bus
> and set MDIO_I2C_NONE, eliminating the 5+ minute PHY probe retry loop.
> 
> For genuine RollBall modules (e.g. FLYPRO SFP-10GT-CS-30M with Aquantia
> AQR113C) the probe now runs after initialization is complete and
> correctly returns 0, so PHY detection proceeds normally.
> 
> Reported-by: Aleksander Bajkowski <olek2@wp.pl>
> Fixes: 8fe125892f40 ("net: phy: sfp: probe for RollBall I2C-to-MDIO bridge in mdio-i2c")
> Signed-off-by: Petr Wozniak <petr.wozniak@gmail.com>

I'm not currently at home so I can't test that on my side, but as you'll
have to resend to the net tree, can you CC me for the next round so that
I can test with the few odd-ball modules I have ?

I expect to be able to test this on friday :(

Maxime

> ---
> v3: regenerated against net-next (v2 failed to apply due to transit
>     corruption); fixed block comment style (checkpatch); no functional
>     change.
> v2: commit message only - generalized scope (Aleksander Bajkowski);
>     corrected SM description (Jan Hoffmann); no code change from v1.
> v1: initial.
>  drivers/net/mdio/mdio-i2c.c   | 15 +++++++++------
>  drivers/net/phy/sfp.c         | 22 +++++++++++++---------
>  include/linux/mdio/mdio-i2c.h |  1 +
>  3 files changed, 23 insertions(+), 15 deletions(-)
> 
> diff --git a/drivers/net/mdio/mdio-i2c.c b/drivers/net/mdio/mdio-i2c.c
> index b88f63234b4e..2a3a418c1369 100644
> --- a/drivers/net/mdio/mdio-i2c.c
> +++ b/drivers/net/mdio/mdio-i2c.c
> @@ -419,7 +419,7 @@ static int i2c_mii_write_rollball(struct mii_bus *bus, int phy_id, int devad,
>  	return 0;
>  }
>  
> -static int i2c_mii_probe_rollball(struct i2c_adapter *i2c)
> +int mdio_i2c_probe_rollball(struct i2c_adapter *i2c)
>  {
>  	u8 data_buf[] = { ROLLBALL_DATA_ADDR, 0x01, 0x00, 0x00 };
>  	u8 cmd_buf[]  = { ROLLBALL_CMD_ADDR, ROLLBALL_CMD_READ };
> @@ -462,9 +462,13 @@ static int i2c_mii_probe_rollball(struct i2c_adapter *i2c)
>  
>  	return -ENODEV;
>  }
> +EXPORT_SYMBOL_GPL(mdio_i2c_probe_rollball);
>  
>  static int i2c_mii_init_rollball(struct i2c_adapter *i2c)
>  {
> +	/* Send the RollBall unlock password; bridge presence is verified
> +	 * later, in sfp_sm_probe_for_phy(), after module initialization.
> +	 */
>  	struct i2c_msg msg;
>  	u8 pw[5];
>  	int ret;
> @@ -486,7 +490,7 @@ static int i2c_mii_init_rollball(struct i2c_adapter *i2c)
>  	if (ret != 1)
>  		return -EIO;
>  
> -	return i2c_mii_probe_rollball(i2c);
> +	return 0;
>  }
>  
>  static bool mdio_i2c_check_functionality(struct i2c_adapter *i2c,
> @@ -531,10 +535,9 @@ struct mii_bus *mdio_i2c_alloc(struct device *parent, struct i2c_adapter *i2c,
>  	case MDIO_I2C_ROLLBALL:
>  		ret = i2c_mii_init_rollball(i2c);
>  		if (ret < 0) {
> -			if (ret != -ENODEV)
> -				dev_err(parent,
> -					"Cannot initialize RollBall MDIO I2C protocol: %d\n",
> -					ret);
> +			dev_err(parent,
> +				"Cannot initialize RollBall MDIO I2C protocol: %d\n",
> +				ret);
>  			mdiobus_free(mii);
>  			return ERR_PTR(ret);
>  		}
> diff --git a/drivers/net/phy/sfp.c b/drivers/net/phy/sfp.c
> index c4d274ab651e..bbfaa0450798 100644
> --- a/drivers/net/phy/sfp.c
> +++ b/drivers/net/phy/sfp.c
> @@ -2174,17 +2174,10 @@ static void sfp_sm_fault(struct sfp *sfp, unsigned int next_state, bool warn)
>  
>  static int sfp_sm_add_mdio_bus(struct sfp *sfp)
>  {
> -	int ret;
> -
>  	if (sfp->mdio_protocol == MDIO_I2C_NONE)
>  		return 0;
>  
> -	ret = sfp_i2c_mdiobus_create(sfp);
> -	if (ret == -ENODEV) {
> -		sfp->mdio_protocol = MDIO_I2C_NONE;
> -		return 0;
> -	}
> -	return ret;
> +	return sfp_i2c_mdiobus_create(sfp);
>  }
>  
>  /* Probe a SFP for a PHY device if the module supports copper - the PHY
> @@ -2215,7 +2208,18 @@ static int sfp_sm_probe_for_phy(struct sfp *sfp)
>  		break;
>  
>  	case MDIO_I2C_ROLLBALL:
> -		err = sfp_sm_probe_phy(sfp, SFP_PHY_ADDR_ROLLBALL, true);
> +		/* Probe here, after module initialization delays, so that
> +		 * genuine RollBall bridges have had time to start up.
> +		 * Modules without a bridge (e.g. RTL8261BE) return -ENODEV.
> +		 */
> +		err = mdio_i2c_probe_rollball(sfp->i2c);
> +		if (err == -ENODEV) {
> +			sfp_i2c_mdiobus_destroy(sfp);
> +			sfp->mdio_protocol = MDIO_I2C_NONE;
> +			break;
> +		}
> +		if (!err)
> +			err = sfp_sm_probe_phy(sfp, SFP_PHY_ADDR_ROLLBALL, true);
>  		break;
>  	}
>  
> diff --git a/include/linux/mdio/mdio-i2c.h b/include/linux/mdio/mdio-i2c.h
> index 65b550a6fc32..5cf14f45c94b 100644
> --- a/include/linux/mdio/mdio-i2c.h
> +++ b/include/linux/mdio/mdio-i2c.h
> @@ -20,5 +20,6 @@ enum mdio_i2c_proto {
>  
>  struct mii_bus *mdio_i2c_alloc(struct device *parent, struct i2c_adapter *i2c,
>  			       enum mdio_i2c_proto protocol);
> +int mdio_i2c_probe_rollball(struct i2c_adapter *i2c);
>  
>  #endif


^ permalink raw reply

* Re: [PATCH v7 01/15] arm64: dts: qcom: kodiak: Add EL2 overlay
From: Mukesh Ojha @ 2026-06-23 16:31 UTC (permalink / raw)
  To: Sumit Garg
  Cc: andersson, linux-arm-msm, devicetree, dri-devel, freedreno,
	linux-media, netdev, linux-wireless, ath12k, linux-remoteproc,
	konradybcio, robh, krzk+dt, conor+dt, robin.clark, sean, akhilpo,
	lumag, abhinav.kumar, jesszhan0024, marijn.suijten, airlied,
	simona, vikash.garodia, dikshita.agarwal, bod, mchehab, elder,
	andrew+netdev, davem, edumazet, kuba, pabeni, jjohnson,
	mathieu.poirier, trilokkumar.soni, pavan.kondeti, jorge.ramirez,
	tonyh, vignesh.viswanathan, srinivas.kandagatla, amirreza.zarrabi,
	jens.wiklander, op-tee, apurupa, skare, linux-kernel, Sumit Garg
In-Reply-To: <20260522115936.201208-2-sumit.garg@kernel.org>

On Fri, May 22, 2026 at 05:29:22PM +0530, Sumit Garg wrote:
> From: Mukesh Ojha <mukesh.ojha@oss.qualcomm.com>
> 
> All the existing variants Kodiak boards are using Gunyah hypervisor
> which means that, so far, Linux-based OS could only boot in EL1 on those
> devices.  However, it is possible for us to boot Linux at EL2 on these
> devices [1].
> 
> When running under Gunyah, the remote processor firmware IOMMU
> streams are controlled by Gunyah. However, without Gunyah, the IOMMU is
> managed by the consumer of this DeviceTree. Therefore, describe the
> firmware streams for each remote processor.
> 
> Add a EL2-specific DT overlay and apply it to Kodiak IOT variant
> devices to create -el2.dtb for each of them alongside "normal" dtb.
> 
> Note that modem and media subsystems haven't been supported yet due
> to missing dependencies. For GPU to work, zap shader is disabled and
> in EL2 mode the kernel owns hardware watchdog which is enabled here.
> 
> [1]
> https://docs.qualcomm.com/bundle/publicresource/topics/80-70020-4/boot-developer-touchpoints.html#uefi
> 
> Signed-off-by: Mukesh Ojha <mukesh.ojha@oss.qualcomm.com>
> [SG: watchdog and modem fixup]
> Signed-off-by: Sumit Garg <sumit.garg@oss.qualcomm.com>

As discussed internally, I will be taking this patch separately and you
can drop this from series.

-- 
-Mukesh Ojha

^ permalink raw reply

* Re: [PATCH iwl-net v2 2/2] ice: preserve uplink DFLT Rx rule on switchdev release
From: Simon Horman @ 2026-06-23 16:25 UTC (permalink / raw)
  To: poros; +Cc: Simon Horman, netdev, aleksandr.loktionov
In-Reply-To: <20260622113428.2565255-3-poros@redhat.com>

This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
Full review at: https://sashiko.dev
---
ice: preserve uplink DFLT Rx rule on switchdev release

This commit preserves the default Rx rule on the uplink VSI when releasing
the switchdev environment if the interface is in promiscuous mode. This
prevents the catch-all rule from being removed while the netdev flags still
request promiscuity.

> diff --git a/drivers/net/ethernet/intel/ice/ice_eswitch.c b/drivers/net/ethernet/intel/ice/ice_eswitch.c
> index 2e4f0969035f7..48273ef9f69dc 100644
> --- a/drivers/net/ethernet/intel/ice/ice_eswitch.c
> +++ b/drivers/net/ethernet/intel/ice/ice_eswitch.c
[ ... ]
> @@ -278,8 +280,16 @@ static void ice_eswitch_release_env(struct ice_pf *pf)
>  	vlan_ops->ena_rx_filtering(uplink_vsi);

[Severity: High]
Could this lockless read of netdev->flags cause a regression due to a TOCTOU
race?

The ena_rx_filtering() call eventually reaches ice_cfg_vlan_pruning(),
which reads uplink_vsi->netdev->flags & IFF_PROMISC locklessly to decide
whether to enable VLAN pruning.

>  	ice_cfg_dflt_vsi(uplink_vsi->port_info, uplink_vsi->idx, false,
>  			 ICE_FLTR_TX);
> -	ice_cfg_dflt_vsi(uplink_vsi->port_info, uplink_vsi->idx, false,
> -			 ICE_FLTR_RX);
> +
> +	/* Keep the Rx DFLT rule if the uplink is promiscuous; it must outlive
> +	 * the session. Test the live netdev->flags, the same value
> +	 * ena_rx_filtering() -> ice_cfg_vlan_pruning() above keys its decision
> +	 * on, so the preserved DFLT rule and the pruning state stay consistent.
> +	 */
> +	if (!(uplink_vsi->netdev->flags & IFF_PROMISC))

... and then the exact same flags are read locklessly again right here.

Because setup/teardown paths are protected by devl_lock rather than
rtnl_lock, could a concurrent dev_change_flags() holding rtnl_lock toggle
IFF_PROMISC between the two checks?

If IFF_PROMISC changes from 0 to 1 in this window, the code would incorrectly
enable VLAN pruning (seeing 0) and then incorrectly retain the DFLT_RX rule
(seeing 1), leaving the hardware with VLAN pruning enabled during promiscuous
mode.

[Severity: High]
Does preserving the Rx DFLT rule here cause a regression by permanently
breaking multicast promiscuity configuration if IFF_PROMISC was set?

During the switchdev session, ice_remove_vsi_fltr() removes the
ICE_SW_LKUP_PROMISC rules. By manually preserving the DFLT rule here, any
future synchronization via ice_vsi_sync_fltr() will evaluate
!ice_is_dflt_vsi_in_use(vsi->port_info) as false:

ice_main.c:ice_vsi_sync_fltr() {
    ...
    if (!ice_is_dflt_vsi_in_use(vsi->port_info)) {
        err = ice_set_dflt_vsi(vsi);
        ...
        err = ice_set_promisc(vsi, ICE_MCAST_PROMISC_BITS);
    }
    ...
}

Because it is skipped, ICE_MCAST_PROMISC_BITS is never configured, leaving
the VSI in unicast promiscuous mode but silently dropping multicast traffic
destined for other MAC addresses.

> +		ice_cfg_dflt_vsi(uplink_vsi->port_info, uplink_vsi->idx,
> +				 false, ICE_FLTR_RX);
> +
>  	ice_fltr_add_mac_and_broadcast(uplink_vsi,
>  				       uplink_vsi->port_info->mac.perm_addr,
>  				       ICE_FWD_TO_VSI);

[Severity: High]
This is a pre-existing issue, but does this teardown function fail to
synchronize hardware state for previously joined multicast groups and
secondary unicast addresses?

When entering switchdev mode, __dev_uc_unsync() and __dev_mc_unsync() clear
the synchronization state of OS-level MAC addresses, and
ice_remove_vsi_fltr() removes the actual hardware filters. While switchdev
runs, calls to ndo_set_rx_mode return early and do not update hardware.

When returning to legacy mode, this function restores only the primary MAC
and broadcast addresses, but doesn't seem to invoke netif_set_rx_mode() or
equivalent to resynchronize the deferred Rx filters from the netdev to the
hardware. Could this cause the hardware to silently drop packets for
previously joined groups until the interface is bounced?

^ permalink raw reply

* Re: [PATCH net-next v3 0/2] net: phy: sfp/mdio-i2c: defer RollBall probe + fix mii_bus leak
From: Maxime Chevallier @ 2026-06-23 16:34 UTC (permalink / raw)
  To: Petr Wozniak, linux, andrew, hkallweit1
  Cc: kuba, davem, edumazet, pabeni, netdev, linux-kernel, linux-phy,
	bjorn, olek2, kabel
In-Reply-To: <20260623080538.7646-1-petr.wozniak@gmail.com>

Hi Petr,

On 6/23/26 10:05, Petr Wozniak wrote:
> This series resends the RollBall bridge probe deferral (a fix for the
> regression in commit 8fe125892f40) and adds a related mii_bus leak fix.

These are bugfixes, you need to target the 'net' tree as explained here :

https://docs.kernel.org/process/maintainer-netdev.html

Thanks :)

Maxime
> 
> Patch 1 fixes a pre-existing mii_bus leak in sfp_i2c_mdiobus_destroy()
> that has been present since the helper was introduced in 2022. Patch 2's
> new -ENODEV path destroys the MDIO bus via sfp_i2c_mdiobus_destroy(), so
> patch 1 is a prerequisite to avoid leaking the bus on that path.
> 
> The v2 deferral patch was corrupted in transit and failed to apply; it is
> regenerated here against current net-next with no functional change.
> 
> v3:
>  - Resend: v2 defer patch was corrupted in transit and failed to apply
>    (netdev/apply); regenerated against current net-next.
>  - Fixed block comment style flagged by checkpatch. No functional change.
>  - Added patch 1/2 (sfp: free mii_bus in sfp_i2c_mdiobus_destroy).
> v2 (defer):
>  - Generalized scope: regression affects boot-inserted and hotplugged
>    modules where bridge init exceeds 200 ms; Aleksander Bajkowski
>    confirmed FLYPRO SFP-10GT-CS-30M / AQR113C broken when hotplugged.
>  - Corrected state machine description (probe runs in SFP_S_INIT after
>    SFP_S_WAIT) - Jan Hoffmann.
>  - No code changes from v1.
> v1: initial submission.
> 
> Petr Wozniak (2):
>   net: phy: sfp: free mii_bus in sfp_i2c_mdiobus_destroy
>   net: phy: mdio-i2c: defer RollBall bridge probe to PHY discovery
> 
>  drivers/net/mdio/mdio-i2c.c   | 15 +++++++++------
>  drivers/net/phy/sfp.c         | 23 ++++++++++++++---------
>  include/linux/mdio/mdio-i2c.h |  1 +
>  3 files changed, 24 insertions(+), 15 deletions(-)
> 
> 
> base-commit: b85966adbf5de0668a815c6e3527f87e0c387fb4


^ permalink raw reply

* Re: [RFC net-next 08/15] ipxlat: add translation engine and dispatch core
From: Ralf Lici @ 2026-06-23 16:36 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: netdev, Daniel Gröber, Antonio Quartulli, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	linux-kernel, Pablo Neira Ayuso, Florian Westphal, Phil Sutter,
	Beniamino Galvani
In-Reply-To: <87ik7aej6f.fsf@toke.dk>

On Mon, 22 Jun 2026 16:36:24 +0200, Toke Høiland-Jørgensen <toke@kernel.org> wrote:
> >> > My second concern is that the SIIT boundary would be a property of
> >> > rule and hook placement. That gives flexibility, but it also means the
> >> > translation point has to be constrained and documented very carefully
> >> > to avoid ambiguous TTL/Hop Limit, PMTU/ICMP, and hook-order behavior.
> >> > For this use case I would rather have the route that matches the
> >> > translation prefix also be the object that says: leave this family
> >> > here and continue in the other one.
> >>
> >> Yeah, with flexibility comes the ability to shoot yourself in the foot.
> >> But that's not really different from much of the other functionality we
> >> have in the kernel today, is it? For netfilter in particular it's
> >> certainly possible to configure a broken NAT configuration that leads to
> >> packet drops (or just invalid packets being sent out on a network
> >> device).
> >>
> >
> > True, misconfiguration is always possible and that alone is not an
> > argument against the netfilter model. But what do we actually gain in
> > capability from that flexibility? I agree on the UX argument (an admin
> > would look in nft first), but in terms of what the feature can do, I
> > can't yet see what the nft model unlocks. More on this just below.
> >
> >> > After looking at the available kernel mechanisms again, I think the
> >> > better model is probably LWT: routes carry an ipxlat encap referencing a
> >> > named translator domain configured over netlink. That should represent
> >> > the stateless, prefix-based and symmetric nature of ipxlat.
> >>
> >> I think this description actually hits the nail on the head: What are we
> >> implementing here? Is it a product feature, or a building block for one?
> >> The properties you mention wrt consistency, symmetry etc are properties
> >> of the high-level feature (which is also generally the level things are
> >> specified in RFCs). Whereas other packet mangling features in the kernel
> >> are more in the "building block" category, where it's possible to
> >> configure things to implement a particular feature set / compliance with
> >> a particular RFC, but it's also possible to do things that are outside
> >> of that.
> >>
> >> I think this relates to the "mechanism, not policy" approach that we
> >> take to most things in the kernel: implement the building blocks to do
> >> something in the most general way we can, and then leave it up to
> >> userspace to configure things in a way that results in a consistent
> >> high-level system behaviour.
> >>
> >
> > That's a good point, and I agree that we should not bake a high-level
> > product policy into the kernel if what we need is a reusable mechanism
> > (the LWT idea was my attempt at exactly that). What I am still trying to
> > understand is whether there is a useful generic trigger for stateless
> > cross-family translation beyond the route/prefix/policy-routing cases.
> >
> > Routes and policy routing already cover the selectors I can make
> > coherent for a stateless, per-packet translator: destination/source
> > prefix, iif/oif/VRF, mark, TOS/DSCP, and so on. nft can of course match
> > much more than that, but the additional selectors that would materially
> > change the translation decision seem to be selectors such as L4 fields,
> > payload state, or conntrack state. Those are exactly the selectors I am
> > struggling to make correct for a stateless translator:
> >
> > - non-first fragments carry no L4 header at all, yet the translator must
> >   rewrite every fragment (an nft ... tcp dport trigger cannot fire on
> >   them);
> >
> > - ICMP errors must be translated too, but the flow identity lives in the
> >   quoted inner header (reversed), not in anything an L4/ct match on the
> >   error packet can see and there is no conntrack to associate them,
> >   since this is stateless.
>
> True in principle, but if (say) you deploy this on a network that is
> configured so it will never fragment packets, this won't be an issue in
> practice.
>
> I.e., you're quite right that arbitrary matching criteria cannot be
> guaranteed to result in coherent translation. But I think that goes into
> the "use it wrong, get wrong results" bin. E.g., if you match on
> something that results in only a subset of the packets of a flow being
> translated, well, only that subset of the packets will make it to the
> destination. The SIIT translator itself should not try to fix this, but
> neither should it prevent it; that's what I mean by "building block" -
> it's up to the builder using the blocks to make sure the building
> doesn't collapse, that's out of scope for the block manufacturer to
> worry about :)
>

I agree with that framing. The translation core should not try to prove
that the surrounding policy describes a coherent SIIT deployment.

> > So an L4-conditional trigger does not look like a good primitive for
> > correct stateless SIIT unless the action also defragments/refragments or
> > uses conntrack-like state. Those may be valid mechanisms, but they move
> > the design away from the stateless per-packet SIIT boundary this RFC is
> > trying to model.
> >
> > So my first question is: is there a useful nft configuration this should
> > enable that is not naturally expressible as route selection, while still
> > remaining stateless SIIT rather than a NAT64-like stateful feature?
> > Maybe there is a real use case there, but I cannot construct one yet.
>
> So the poster child for "match on arbitrary criteria" is of course BPF.
> You can write BPF programs that match on arbitrary parts of the packet
> header, custom encapsulation headers,or even on out of band things like
> system state, phase of the moon, or what have you. And we should
> certainly allow a BPF program to make the decision on whether to perform
> the SIIT translation.
>
> Which... maybe is an argument to keep it as a device like you do in this
> RFC series? Redirecting to a device is trivially supported from TC-BPF,
> which also makes it possible to use the translation mechanism without
> going through the routing subsystem at all, saving a bit of overhead.
> Whereas making it a route action ties it very closely to the routing
> subsystem.
>
> WDYT?
>

I see the netdevice appeal for this, especially as a BPF redirect
target. But as we discussed earlier, the device model has some real
problems: the device selected by the first route is not the real
post-translation egress, so the model ends up doing translation and
reinjection rather than normal transmission. Concretely:

- it needs synthetic routing state purely to get things like MTU for
  fragmentation, because the real post-translation nexthop is not known
  at translation time;

- TTL/Hop Limit handling gets harder to reason about because the packet
  has effectively gone through two routing decisions;

- rx/tx stats can't be made meaningful for a direction-agnostic device
  whose ndo_start_xmit is really "translate and receive";

- and the setup is not very obvious: create an interface, route packets
  to it, then have them come back translated.

None of these is fatal on its own, but together they make me think the
abstraction does not quite fit.

On the BPF point specifically: I agree a BPF program should be able to
decide whether to translate. What I am less sure about is whether
redirecting to a netdevice is the best way to expose that. A TC action
(yet another model, I know :)) gives you the same thing in-pipeline and
more directly:

    tc filter add dev wwan0 egress \
        bpf obj match.o action ipxlat4to6 domain clat0

Let BPF make the policy decision, with the native action doing the
translation work that the current BPF CLAT implementations have trouble
with: fragmentation, checksum corner cases, and ICMP error inner
headers (as explained by Beniamino).

So TC clsact looks like the natural in-kernel replacement for today's
TC-BPF CLAT programs: no extra netdev, you attach to the existing
uplink, direction is explicit, and on egress you sit on the real route
dst, so the synthetic-dst and double-routing problems above just don't
arise. The cost is more moving parts than a single bpf_redirect since
userspace has to manage clsact, filters, priorities and action
lifecycle/cleanup.

For a gateway translator, though, I still think a device-bound model is
less natural. There the translation point is more like a forwarding
decision across routes and nexthops, so a route/LWT attachment, or
possibly a netfilter attachment seems easier to reason about. Also, as
you already pointed out while discussing LWT, an admin setting up NAT64
is more likely to reach for an nft rule than for a clsact filter on a
specific device.

Taking a step back, ipxlat is really a generic translation engine plus a
thin harness around it. So rather than pick one attachment, it might be
worth structuring the engine so different harnesses can drive it.
There's interesting precedent for this shape:

- ILA, again, is the closest sibling: stateless IPv6 address translation
  with a shared core in ila_common.c, driven both by an LWT frontend in
  ila_lwt.c and by an inline netfilter hook with a netlink-configured
  mapping table in ila_xlat.c.

- act_ct is the precedent for the TC side specifically: a TC action that
  reuses the netfilter conntrack engine rather than reimplementing it.

And act_nat is the cautionary counter-example: a standalone TC
reimplementation of stateless NAT that shares no code with nf_nat, and
carries a "would be nice to share code" comment :)

So I am wondering whether the right direction is to factor the
translation engine cleanly, land it with one harness first, and keep the
other attachment points as follow-up work once the core semantics are
settled.

Does that direction seem reasonable to you?

-- 
Ralf Lici
Mandelbit Srl

^ permalink raw reply

* Re: [PATCH] crypto: af_alg - Document the deprecation of AF_ALG
From: Eric Biggers @ 2026-06-23 16:49 UTC (permalink / raw)
  To: Bastien Nocera
  Cc: linux-crypto, Herbert Xu, Marcel Holtmann, Luiz Augusto von Dentz,
	linux-doc, linux-api, linux-kernel, netdev, Linus Torvalds,
	linux-bluetooth, ell
In-Reply-To: <7d08a6df54279e9915f5df6bd4e5e5dde52b4fe1.camel@hadess.net>

On Tue, Jun 23, 2026 at 02:44:28PM +0200, Bastien Nocera wrote:
> Hey,
> 
> Replying to this older patch.
> 
> On Wed, 2026-04-29 at 18:15 -0700, Eric Biggers wrote:
> <snip>
> > This isn't intended to change anything overnight.  After all, most Linux
> > distros won't be able to disable the kconfig options quite yet, mainly
> > because of iwd.  But this should create a bit more impetus for these
> > userspace programs to be fixed, and the documentation update should also
> > help prevent more users from appearing.
> 
> There are 2 other users that I know of: bluez, and the ell library
> (used by iwd and bluez).
>
> From what I could tell, bluetoothd uses AF_ALG for cryptography:
> https://git.kernel.org/pub/scm/bluetooth/bluez.git/tree/src/shared/crypto.c
> https://git.kernel.org/pub/scm/bluetooth/bluez.git/tree/tools/mesh-gatt/crypto.c
> 
> It uses "ecb(aes)" and "cmac(aes)" as algorithms.
> 
> Finally, it also uses them both again:
> https://git.kernel.org/pub/scm/bluetooth/bluez.git/tree/mesh/crypto.c
> through ell:
> https://git.kernel.org/pub/scm/libs/ell/ell.git/tree/ell/cipher.c
> 
> Because that's a question that also came up, bluetoothd also uses the
> CAP_NET_ADMIN capability.
> 
> I'll let Luiz and Marcel take it over from here.
> 

We're aware of that and are taking it into account in the allowlist:
https://lore.kernel.org/linux-crypto/20260622234803.6982-1-ebiggers@kernel.org/
If you have any feedback on the allowlist, please respond to that patch.

- Eric

^ permalink raw reply

* Re: [PATCH net v2 1/2] net: stmmac: dwmac-spacemit: Fix wrong phy interface definition
From: Maxime Chevallier @ 2026-06-23 16:53 UTC (permalink / raw)
  To: Inochi Amaoto, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Maxime Coquelin, Alexandre Torgue,
	Yixun Lan, Russell King (Oracle)
  Cc: netdev, linux-stm32, linux-arm-kernel, linux-riscv, spacemit,
	linux-kernel, Yixun Lan, Longbin Li
In-Reply-To: <20260623074637.503864-2-inochiama@gmail.com>

Hello,

On 6/23/26 09:46, Inochi Amaoto wrote:
> The current MII interface register definition from the vendor is wrong,
> use the right number for the macro. Also, correct the interface mask
> in spacemit_set_phy_intf_sel() so it can update the register with the
> right number
> 
> Fixes: 30f0ba420ed3 ("net: stmmac: Add glue layer for Spacemit K3 SoC")
> Signed-off-by: Inochi Amaoto <inochiama@gmail.com>
> ---
>  drivers/net/ethernet/stmicro/stmmac/dwmac-spacemit.c | 9 ++++++---
>  1 file changed, 6 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac-spacemit.c b/drivers/net/ethernet/stmicro/stmmac/dwmac-spacemit.c
> index 223754cc5c79..3bfb6d49be6c 100644
> --- a/drivers/net/ethernet/stmicro/stmmac/dwmac-spacemit.c
> +++ b/drivers/net/ethernet/stmicro/stmmac/dwmac-spacemit.c
> @@ -18,8 +18,10 @@
>  #include "stmmac_platform.h"
>  
>  /* ctrl register bits */
> -#define CTRL_PHY_INTF_RGMII		BIT(3)
> -#define CTRL_PHY_INTF_MII		BIT(4)
> +#define CTRL_PHY_INTF_MODE		GENMASK(4, 3)
> +#define CTRL_PHY_INTF_RMII		FIELD_PREP(CTRL_PHY_INTF_MODE, 0)
> +#define CTRL_PHY_INTF_RGMII		FIELD_PREP(CTRL_PHY_INTF_MODE, 1)
> +#define CTRL_PHY_INTF_MII		FIELD_PREP(CTRL_PHY_INTF_MODE, 3)
>  #define CTRL_WAKE_IRQ_EN		BIT(9)
>  #define CTRL_PHY_IRQ_EN			BIT(12)
>  
> @@ -118,7 +120,7 @@ static void spacemit_get_interfaces(struct stmmac_priv *priv, void *bsp_priv,
>  
>  static int spacemit_set_phy_intf_sel(void *bsp_priv, u8 phy_intf_sel)
>  {
> -	unsigned int mask = CTRL_PHY_INTF_MII | CTRL_PHY_INTF_RGMII;
> +	unsigned int mask = CTRL_PHY_INTF_MODE;
>  	struct spacmit_dwmac *dwmac = bsp_priv;
>  	unsigned int val = 0;
>  
> @@ -128,6 +130,7 @@ static int spacemit_set_phy_intf_sel(void *bsp_priv, u8 phy_intf_sel)
>  		break;
>  
>  	case PHY_INTF_SEL_RMII:
> +		val = CTRL_PHY_INTF_RMII;

This isn't strictly-speaking necessary as this is 0 and val is already 0, maybe
compilers can figure it out and this leaves us with more self-documenting code ?

So I'm ok with that personally,

Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>

Maxime


>  		break;
>  
>  	case PHY_INTF_SEL_RGMII:


^ permalink raw reply

* Re: [PATCH net v2 2/2] net: stmmac: dwmac-spacemit: Fix wrong irq definition
From: Maxime Chevallier @ 2026-06-23 16:54 UTC (permalink / raw)
  To: Inochi Amaoto, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Maxime Coquelin, Alexandre Torgue,
	Yixun Lan, Russell King (Oracle)
  Cc: netdev, linux-stm32, linux-arm-kernel, linux-riscv, spacemit,
	linux-kernel, Yixun Lan, Longbin Li
In-Reply-To: <20260623074637.503864-3-inochiama@gmail.com>

Hi,

On 6/23/26 09:46, Inochi Amaoto wrote:
> The current irq definition of the wake irq and the lpi irq
> is wrong, replace them with the right number and name.
> 
> Fixes: 30f0ba420ed3 ("net: stmmac: Add glue layer for Spacemit K3 SoC")
> Signed-off-by: Inochi Amaoto <inochiama@gmail.com>

Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>

Maxime

> ---
>  drivers/net/ethernet/stmicro/stmmac/dwmac-spacemit.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac-spacemit.c b/drivers/net/ethernet/stmicro/stmmac/dwmac-spacemit.c
> index 3bfb6d49be6c..322bdf167a4a 100644
> --- a/drivers/net/ethernet/stmicro/stmmac/dwmac-spacemit.c
> +++ b/drivers/net/ethernet/stmicro/stmmac/dwmac-spacemit.c
> @@ -22,8 +22,8 @@
>  #define CTRL_PHY_INTF_RMII		FIELD_PREP(CTRL_PHY_INTF_MODE, 0)
>  #define CTRL_PHY_INTF_RGMII		FIELD_PREP(CTRL_PHY_INTF_MODE, 1)
>  #define CTRL_PHY_INTF_MII		FIELD_PREP(CTRL_PHY_INTF_MODE, 3)
> -#define CTRL_WAKE_IRQ_EN		BIT(9)
> -#define CTRL_PHY_IRQ_EN			BIT(12)
> +#define CTRL_LPI_IRQ_EN			BIT(9)
> +#define CTRL_WAKE_IRQ_EN		BIT(12)
>  
>  /* dline register bits */
>  #define RGMII_RX_DLINE_EN		BIT(0)


^ permalink raw reply

* Re: Ethtool : PRBS feature
From: Lee Trager @ 2026-06-23 17:10 UTC (permalink / raw)
  To: Andrew Lunn, Das, Shubham
  Cc: Maxime Chevallier, Alexander H Duyck, netdev@vger.kernel.org,
	mkubecek@suse.cz, D H, Siddaraju, Chintalapalle, Balaji,
	Lindberg, Magnus, niklas.damberg@ericsson.com
In-Reply-To: <5f22c491-b816-421e-a531-bf87a07fea70@lunn.ch>

On 6/23/26 2:43 AM, Andrew Lunn wrote:

> Taking a quick look at this:
>
> You are missing a way to enumerate what test patterns the hardware
> supports. There is more than prbs7. You want to be able to report the
> contents of C45 1.1500, and other similar registers.
Not only is there more than PRBS7 but also PRBS 8/10 encoding which is 
an option on any test. There may be other options, that was the only one 
fbnic supported. I agree there does need to be a user interface which 
displays supported tests and options.
> To avoid race conditions, maybe some of these commands need combining.
> ethtool --phy-test eth1 tx-prbs prbs7 rx-prbs prbs7 bert start
>
> The configuration is then atomic, with respect to the uAPI, so we
> don't get two users configuring it at the same time, ending up with a
> messed up configuration.
Testing consumes the link so you really don't want anything done to the 
netdev while testing is running. fbnic does the following.

1. Testing cannot start when the link is up
2. Once testing starts the driver removes the netdev to prevent use. The 
netdev is only added back when testing stops. The upstream solution will 
need something that can keep the netdev but lock everything down while 
testing is running.
3. Once testing starts you cannot change the test, even on an individual 
lane basis. You must stop testing first.
>
> Traditionally, Unix does not offer a way to clear statistic counters
> back to zero. So i'm not sure about clear-stats. We also need to think
> about hardware which does not support that. And there is locking
> issues, can the stats be cleared while a test is active?

fbnic actually has separate registers for PRBS test results. Results do 
need to be clean between runs but I never created an explicit clear 
interface. Firmware automatically reset the registers when a new test 
was started. This also allows results to be viewed after testing has 
stopped.

Reading results was a little tricky due to roll over between two 32bit 
registers. I was able to read results while testing was running without 
pausing. Technically I could clear results while testing was running but 
never saw a need to.

>
> You need to think about the units for inject errors. There is no
> floating point support. Also, is this corrupt packets? Or single bit
> flips in the stream? It needs to be well defined what it actually
> means. The driver can then convert it to whatever the hardware
> supports. How does 802.3 specify this?
>
> Also, 802.3 defines PRBS7 as a benign pattern. With a quick look, i
> did not find a definition of benign, but injecting errors does not
> seem benign to me.
>
> I'm assuming when 'start' is used, the networking core will change the
> interface status to IF_OPER_TESTING. It is not always obvious why an
> interface is in testing mode, rather than IF_OPER_UP. Cable testing
> could also be running, etc. So maybe there needs to be a way to report
> why it is in IF_OPER_TESTING?
>
> I also wounder if a timeout should be used with start, so that it will
> return to IF_OPER_UP after a time period?

When I spoke to hardware engineers at Meta they did not want a timeout. 
Testing often occurred over days, so they wanted to be able to start it 
and explicitly stop it. I'm not against a time out but I do think it 
should be optional.

Since PRBS testing is handled by firmware one safety measure I added is 
if firmware lost contact with the host testing was automatically stopped 
and TX FIR values were reset to factory. This ensured that the NIC won't 
get stuck in testing and on initialization the driver doesn't have to 
worry about testing state.

Lee


^ permalink raw reply

* Re: [PATCH net-next v3] vsock/virtio: rewrite MSG_ZEROCOPY flag handling
From: Michael S. Tsirkin @ 2026-06-23 17:26 UTC (permalink / raw)
  To: Arseniy Krasnov
  Cc: Stefan Hajnoczi, Stefano Garzarella, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Jason Wang,
	Bobby Eshleman, Xuan Zhuo, Eugenio Pérez, Simon Horman, kvm,
	virtualization, netdev, linux-kernel, oxffffaa, rulkc
In-Reply-To: <20260623153819.697635-1-avkrasnov@rulkc.org>

On Tue, Jun 23, 2026 at 06:38:19PM +0300, Arseniy Krasnov wrote:
> Logically it was based on TCP implementation, so to make further support
> easier, rewrite it in the TCP way (like in 'tcp_sendmsg_locked()'). This
> patch only rewrites flag handling (e.g. it doesn't change logic).
> 
> Signed-off-by: Arseniy Krasnov <avkrasnov@rulkc.org>


It seems to change logic though:

> ---
>  Changelog v1->v2:
>  * Rebase on last 'net-next'. Don't need 'skb_zcopy_set()' now - it was
>    already added.
>  Changelog v2->v3:
>  * Update commit message.
>  * Remove one empty line.
> 
>  net/vmw_vsock/virtio_transport_common.c | 47 ++++++++++++-------------
>  1 file changed, 22 insertions(+), 25 deletions(-)
> 
> diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
> index 09475007165b..41c2a0b82a8e 100644
> --- a/net/vmw_vsock/virtio_transport_common.c
> +++ b/net/vmw_vsock/virtio_transport_common.c
> @@ -328,38 +328,35 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
>  	if (pkt_len == 0 && info->op == VIRTIO_VSOCK_OP_RW)
>  		return pkt_len;
>  
> -	if (info->msg) {
> -		/* If zerocopy is not enabled by 'setsockopt()', we behave as
> -		 * there is no MSG_ZEROCOPY flag set.
> +	if (info->msg && (info->msg->msg_flags & MSG_ZEROCOPY)) {
> +		/* If 'info->msg' is not NULL, this is only VIRTIO_VSOCK_OP_RW.
> +		 * 'MSG_ZEROCOPY' flag handling here is based on the same flag
> +		 * handling from 'tcp_sendmsg_locked()'.
>  		 */
> -		if (!sock_flag(sk_vsock(vsk), SOCK_ZEROCOPY))
> -			info->msg->msg_flags &= ~MSG_ZEROCOPY;

So previously without SOCK_ZEROCOPY, MSG_ZEROCOPY was always ignored...


> +		if (info->msg->msg_ubuf) {
> +			uarg = info->msg->msg_ubuf;
> +			can_zcopy = virtio_transport_can_zcopy(t_ops, info, pkt_len);

now it's not in this case?


Maybe the right call, but saying "does not change logic" seems wrong.


> +		} else if (sock_flag(sk_vsock(vsk), SOCK_ZEROCOPY)) {
> +			uarg = msg_zerocopy_realloc(sk_vsock(vsk), pkt_len,
> +						    NULL, false);
> +			if (!uarg) {
> +				virtio_transport_put_credit(vvs, pkt_len);
> +				return -ENOMEM;
> +			}
>  
> -		if (info->msg->msg_flags & MSG_ZEROCOPY)
>  			can_zcopy = virtio_transport_can_zcopy(t_ops, info, pkt_len);
> +			if (!can_zcopy)
> +				uarg_to_msgzc(uarg)->zerocopy = 0;
>  
> +			have_uref = true;
> +		}
> +
> +		/* 'can_zcopy' means that this transmission will be
> +		 * in zerocopy way (e.g. using 'frags' array).
> +		 */
>  		if (can_zcopy)
>  			max_skb_len = min_t(u32, VIRTIO_VSOCK_MAX_PKT_BUF_SIZE,
>  					    (MAX_SKB_FRAGS * PAGE_SIZE));
> -
> -		if (info->msg->msg_flags & MSG_ZEROCOPY &&
> -		    info->op == VIRTIO_VSOCK_OP_RW) {
> -			uarg = info->msg->msg_ubuf;
> -
> -			if (!uarg) {
> -				uarg = msg_zerocopy_realloc(sk_vsock(vsk),
> -							    pkt_len, NULL, false);
> -				if (!uarg) {
> -					virtio_transport_put_credit(vvs, pkt_len);
> -					return -ENOMEM;
> -				}
> -
> -				if (!can_zcopy)
> -					uarg_to_msgzc(uarg)->zerocopy = 0;
> -
> -				have_uref = true;
> -			}
> -		}
>  	}
>  
>  	rest_len = pkt_len;
> -- 
> 2.25.1


^ permalink raw reply

* Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Andrei Vagin @ 2026-06-23 17:29 UTC (permalink / raw)
  To: Askar Safin
  Cc: akpm, alexander, axboe, bernd, brauner, criu, david, dhowells,
	fuse-devel, hch, jack, joannelkoong, linux-api, linux-fsdevel,
	linux-kernel, linux-mm, miklos, netdev, patches, pfalcato,
	rostedt, torvalds, val, viro, willy
In-Reply-To: <20260623094211.1080873-1-safinaskar@gmail.com>

On Tue, Jun 23, 2026 at 2:42 AM Askar Safin <safinaskar@gmail.com> wrote:
>
> Andrei Vagin <avagin@gmail.com>:
> > Actually, this change introduces a performance and functional
> > regression for CRIU.
> >
> > Here is a brief overview of how CRIU currently dumps memory pages:
> >
> > CRIU injects a parasite code blob into the target process's address
> > space. The parasite invokes vmsplice() with the SPLICE_F_GIFT flag to
> > pin physical pages directly inside a pipe without copying them. The main
> > CRIU process then takes over from outside the target context, calling
> > splice() on the other end of the pipe to stream the data directly into
> > checkpoint image files or a remote network socket.
> >
> > I ran a simple test that creates an anonymous mapping and touches every
> > page within it:
> > Without this patch, CRIU takes 9 seconds to dump the test process.
> > With this patch, It takes 18 seconds...
> >
> > Plus, it obviously introduces some memory overhead.
> >
> > If these changes are merged, we will need to completely rework the
> > memory dumping mechanism in CRIU. Using vmsplice() in this proposed form
> > no longer makes any sense for our architecture...
>
> I just have read some docs for CRIU. I found this statement:
>
> > #### Why `splice` is Better:
> > *   **Consistency via COW**: The `SPLICE_F_GIFT` flag ensures that if the process modifies a "gifted" page after resuming, the kernel performs a **Copy-on-Write (COW)**. The pipe buffer > continues to hold the *original* version of the page as it existed at the moment of the `vmsplice()` call, ensuring a perfectly consistent snapshot of that page.
>
> This is wrong (with released kernels). I confirmed this by testing this on my current kernel (6.12.90).
>
> See the code in the end of this message.
>
> If you actually rely on mentioned consistency, then, it seems, CRIU is broken.
>
> So, in fact, my patch actually brings consistency to CRIU. :)

Askar, unfortunately, the statement about "Consistency via COW" is just
"AI imagination". The under-the-hood docs were recently transferred from
the criu.org wiki using some AI transformations, which introduced this
wrong statement. The original document can be found here:
https://criu.org/Memory_dumping_and_restoring.

To clarify, CRIU does not rely on page consistency for intermediate
pre-dumps. The pre-dump mechanism is designed to be iterative: During a
pre-dump, tasks are briefly frozen to vmsplice dirty pages into pipes.
Then the tasks are resumed, and CRIU drains the pipes. If the process
modifies a page after it has been spliced, the data in the pipe may
become inconsistent. However, any such modification is tracked by the
soft-dirty page tracker. In the next pre-dump iteration (or the final
dump), these modified pages will be identified as dirty again and
re-dumped. During restore, the images are applied sequentially, and the
final dump (taken while the process is fully frozen) ensures we
reconstruct a consistent final state.

But what really matters in this scheme is the vmsplice performance.
The proposed change significantly slows it down. In the case of CRIU,
vmsplice performance is critical because the target process is frozen
during these calls. Minimizing the freeze time is the primary goal of
pre-dumping to make migration almost invisible to the user workload.

Thanks,
Andrei

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox