Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH net-next v7 00/11] BIG TCP for UDP tunnels
From: patchwork-bot+netdevbpf @ 2026-06-14 13:50 UTC (permalink / raw)
  To: Alice Mikityanska
  Cc: daniel, davem, edumazet, kuba, pabeni, lucien.xin,
	willemdebruijn.kernel, willemb, dsahern, razor, shuah, stfomichev,
	andrew+netdev, horms, fw, netdev, alice
In-Reply-To: <20260611192955.604661-1-alice.kernel@fastmail.im>

Hello:

This series was applied to netdev/net-next.git (main)
by Paolo Abeni <pabeni@redhat.com>:

On Thu, 11 Jun 2026 21:29:44 +0200 you wrote:
> From: Alice Mikityanska <alice@isovalent.com>
> 
> This series is a follow-up to "BIG TCP without HBH in IPv6", and it adds
> support for BIG TCP IPv4/IPv6 workloads in vxlan and geneve. Now that
> IPv6 BIG TCP doesn't require stripping the HBH in all various
> combinations in tunneled traffic, adding BIG TCP becomes feasible.
> 
> [...]

Here is the summary with links:
  - [net-next,v7,01/11] net/sched: act_csum: don't mangle UDP tunnel GSO packets
    https://git.kernel.org/netdev/net-next/c/9bcb30b389ec
  - [net-next,v7,02/11] geneve: Fix off-by-one comparing with GRO_LEGACY_MAX_SIZE
    https://git.kernel.org/netdev/net-next/c/2319688890d9
  - [net-next,v7,03/11] net: Use helpers to get/set UDP len tree-wide
    (no matching commit)
  - [net-next,v7,04/11] net: Enable BIG TCP with partial GSO
    (no matching commit)
  - [net-next,v7,05/11] udp: Support BIG TCP GSO packets where they can occur
    (no matching commit)
  - [net-next,v7,06/11] udp: Support gro_ipv4_max_size > 65536
    (no matching commit)
  - [net-next,v7,07/11] udp: Validate UDP length in udp_gro_receive
    (no matching commit)
  - [net-next,v7,08/11] udp: Set length in UDP header to 0 for big GSO packets
    (no matching commit)
  - [net-next,v7,09/11] vxlan: Enable BIG TCP packets
    (no matching commit)
  - [net-next,v7,10/11] geneve: Enable BIG TCP packets
    (no matching commit)
  - [net-next,v7,11/11] selftests: net: Add a test for BIG TCP in UDP tunnels
    (no matching commit)

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply

* Re: [PATCH net-next v13 1/2] tcp: rehash onto different local ECMP path on retransmit timeout
From: Paolo Abeni @ 2026-06-14 13:54 UTC (permalink / raw)
  To: Neil Spring, netdev
  Cc: edumazet, ncardwell, kuniyu, davem, kuba, dsahern, horms, shuah,
	linux-kselftest, bpf, martin.lau, daniel
In-Reply-To: <20260612010047.1377331-2-ntspring@meta.com>

On 6/12/26 3:00 AM, Neil Spring wrote:
> Currently sk_rethink_txhash() re-rolls the socket's txhash on RTO, PLB,
> and spurious-retransmission events, but the cached route is reused and
> the new hash is not propagated into the ECMP path selection logic.  Two
> changes are needed to make rehash select a different local ECMP path:
> 
> 1. Add __sk_dst_reset() alongside sk_rethink_txhash() in
>    tcp_write_timeout(), tcp_rcv_spurious_retrans(), and
>    tcp_plb_check_rehash() so the cached dst is invalidated and the
>    next transmit triggers a fresh route lookup.
> 
> 2. Set fl6->mp_hash from sk_txhash (or tcp_rsk(req)->txhash for
>    SYN/ACK retransmits and syncookies) in tcp_v6_connect(),
>    inet6_sk_rebuild_header(), inet6_csk_route_req(),
>    inet6_csk_route_socket(), tcp_v6_send_response(), and
>    cookie_v6_check() so fib6_select_path() picks a path based on the
>    new hash.
> 
> The mp_hash override only applies to fib_multipath_hash_policy 0 (the
> default L3 policy).  Its hash includes the flow label, but that is 0 by
> default -- np->flow_label is unset, and auto_flowlabels only computes
> the on-wire label later, per packet -- so flows to the same peer share
> one local path.  Keying the hash on sk_txhash makes the local path
> per-connection and lets a rehash re-select it.  Policies 1-3 are left
> unchanged.
> 
> The mp_hash assignment is factored into a small helper,
> ip6_ecmp_set_mp_hash(), shared by inet6_csk_route_req(),
> inet6_csk_route_socket(), tcp_v6_connect(), inet6_sk_rebuild_header(),
> tcp_v6_send_response(), and cookie_v6_check().  It applies
> (txhash >> 1) ?: 1 for policy 0 (the >> 1 keeps mp_hash in the 31-bit
> range; ?: 1 keeps it non-zero, since 0 would fall back to
> rt6_multipath_hash()).  inet6_csk_route_socket() calls it only for
> sk_protocol == IPPROTO_TCP so that non-TCP callers (e.g., L2TP via
> inet6_csk_xmit) fall through to rt6_multipath_hash() and retain their
> existing flow-key-based ECMP behavior.
> 
> tcp_v6_send_response() also sets mp_hash from the response txhash so
> that a control packet (a RST from the full socket, or an ACK from a
> time-wait socket) selects the same local ECMP nexthop as the
> connection's txhash rather than falling back to the flow hash.  The
> time-wait socket's tw_txhash is copied from sk_txhash when the
> connection enters TIME_WAIT, so it reflects any rehash that occurred.
> 
> Setting mp_hash explicitly is necessary because the default ECMP hash
> derives from fl6->flowlabel via np->flow_label, which is not updated
> from sk_txhash (REPFLOW is off by default).  ip6_make_flowlabel()
> cannot help either, as it runs after the route lookup.
> 
> As a consequence, for policy 0 the local ECMP path of an IPv6 TCP
> flow follows sk_txhash even when fl6->flowlabel is non-zero, e.g. a
> reflected (REPFLOW) or explicitly set (IPV6_FLOWLABEL_MGR) flow
> label.  This is intentional: only local path selection changes, so
> rehash can recover from a failed path; the on-wire flow label is
> unchanged.
> 
> sk_set_txhash() is moved before ip6_dst_lookup_flow() in
> tcp_v6_connect() so the initial ECMP path is selected by the same
> txhash that subsequent route rebuilds will use.  This avoids
> unintended path changes when the cached dst is naturally invalidated
> (e.g., by PMTU discovery or route changes).
> 
> The rehash sites (tcp_write_timeout(), tcp_plb_check_rehash(), and
> tcp_rcv_spurious_retrans()) call __sk_rethink_txhash_reset_dst(),
> which re-rolls the txhash and, when it changed, drops the cached dst
> so the next transmit re-runs route selection.  The dst reset is
> guarded by sk->sk_family == AF_INET6 since IPv4 ECMP does not
> currently use sk_txhash for path selection.  For IPv4-mapped IPv6
> sockets this produces a redundant dst reset on a cold path
> (RTO/PLB); the subsequent IPv4 route lookup returns the same result.
> The helper is deliberately separate from sk_rethink_txhash() itself:
> dst_negative_advice() calls sk_rethink_txhash() before its own dst op,
> so resetting the dst inside sk_rethink_txhash() would skip that op
> (e.g. rt6_remove_exception_rt()).
> 
> For syncookies, cookie_init_sequence() computes the cookie value
> before route_req() and sets txhash so the SYN-ACK selects the same
> ECMP path that cookie_v6_check() will use when the full socket is
> created.  cookie_tcp_reqsk_init() derives txhash from the cookie so
> the full socket's ECMP path matches the SYN-ACK.  Both the SYN-ACK
> assignment in tcp_conn_request() and the full-socket assignment in
> cookie_tcp_reqsk_init() are keyed on the packet family
> (skb->protocol == ETH_P_IPV6), not sk->sk_family: a dual-stack
> AF_INET6 listener also serves IPv4 connections, and the v4 cookie has
> mssind bits that would bias TX queue distribution if used as txhash.
> IPv4 connections retain net_tx_rndhash().
> 
> cookie_init_sequence() is split from the former version that also
> called tcp_synq_overflow() and incremented SYNCOOKIESSENT; those
> side effects are now in cookie_record_sent(), called after
> route_req() succeeds so they are not bumped when route_req() fails.
> cookie_record_sent() is guarded by CONFIG_SYN_COOKIES to
> match the guard on tcp_synq_overflow().  route_req() receives 0 as
> tw_isn for the syncookie path so that tcp_v6_init_req() still saves
> ireq->pktopts for REPFLOW flowlabel reflection and IPv6 cmsg
> options.  The ecn_ok clear for syncookies without timestamps stays
> after tcp_ecn_create_request() so it takes precedence.
> 
> Signed-off-by: Neil Spring <ntspring@meta.com>

This looks good to me, with a minor commit below.

> diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c
> index df479277fb80..cc71d84df42b 100644
> --- a/net/ipv4/syncookies.c
> +++ b/net/ipv4/syncookies.c
> @@ -280,9 +280,18 @@ static int cookie_tcp_reqsk_init(struct sock *sk, struct sk_buff *skb,
>  	treq->snt_synack = 0;
>  	treq->snt_tsval_first = 0;
>  	treq->tfo_listener = false;
> -	treq->txhash = net_tx_rndhash();
>  	treq->rcv_isn = ntohl(th->seq) - 1;
>  	treq->snt_isn = ntohl(th->ack_seq) - 1;
> +	if (skb->protocol == htons(ETH_P_IPV6)) {
> +		/* Use the cookie as txhash so the ECMP path matches
> +		 * the SYN-ACK, where txhash was also set to the
> +		 * cookie.  The original request socket (and its
> +		 * txhash) was freed after sending the SYN-ACK.
> +		 */
> +		treq->txhash = treq->snt_isn;
> +	} else {
> +		treq->txhash = net_tx_rndhash();

I'm wondering if it would make sense always using snt_isn for txhash in
the syn cookie case, regardless of the IP protocol. Beyond reducing the
differences between ipv4 and ipv6 it will make the code a little simpler.

Not a blocker in any case.

Still I think this could deserve an explicit ack from Eric.

/P


^ permalink raw reply

* Re: [PATCH net v3] net: protect egress device access in the output path with rcu_read_lock
From: Paolo Abeni @ 2026-06-14 14:02 UTC (permalink / raw)
  To: Hyunwoo Kim, dsahern, idosch, davem, edumazet, kuba, horms,
	steffen.klassert, herbert, andrew+netdev, kuniyu, jlayton
  Cc: netdev
In-Reply-To: <aitI7sLX8SIC5RqQ@v4bel>

On 6/12/26 1:46 AM, Hyunwoo Kim wrote:
> The locally generated output path reads the egress network device from
> the route attached to the skb (skb_dst(skb)->dev, skb_dst_dev() or
> rt->dst.dev) and uses it as the 'out' device argument of an
> NF_HOOK()/nf_hook() call, or for direct field reads, without holding
> rcu_read_lock().
> 
> dst->dev is protected by RCU. When a device is unregistered its value is
> replaced with blackhole_netdev and the previous device is freed after an
> RCU grace period. A section that reads dst->dev and uses that pointer
> must therefore hold rcu_read_lock(). Otherwise a LOCAL_OUT / POST_ROUTING
> hook consumer (nft meta oif, selinux_ip_postroute_compat, etc.) or an
> early field read can reference a device that is no longer valid when the
> egress device is unregistered concurrently with transmission.
> 
> Rather than taking the lock in each dst->output leaf, take it once at the
> common ip_local_out() and ip6_local_out() level. This covers
> __ip_local_out() / __ip6_local_out() (the LOCAL_OUT hook) and
> dst_output(), and therefore ip_output(), ip_mc_output(), ip_mr_output(),
> xfrm4_output() and vrf_output(), as well as the IPv6 leaves ip6_output(),
> ip6_mr_output(), xfrm6_output() and vrf_output6(), in one place.
> 
> raw_send_hdrinc() and rawv6_send_hdrinc() do not go through
> ip_local_out() / ip6_local_out(); they run their own NF_HOOK() and also
> read the device (mtu, LL_RESERVED_SPACE(), needed_tailroom) before that
> hook, so they take their own rcu_read_lock(). The device fields are read
> under a short rcu_read_lock() at function entry that is dropped before
> the blocking sock_alloc_send_skb(); the NF_HOOK() itself runs under
> rcu_read_lock() (added in raw_send_hdrinc(), already present in
> rawv6_send_hdrinc()).
> 
> xfrm_output_resume() is left unchanged. It reads the device locklessly
> and is only reached either under the rcu_read_lock() held by its
> dst_output() caller (now including ip_local_out() / ip6_local_out() for
> locally generated traffic) or with softirqs disabled: the async crypto
> completion runs in BH-off context (cryptd and padata invoke it under
> local_bh_disable(); hardware crypto drivers complete it from softirq),
> and the IPTFS output runs from a HRTIMER_MODE_REL_SOFT timer. For the
> same reason __ip_local_out() and __ip6_local_out() keep the lockless
> skb_dst_dev() accessor: they are also reached from xfrm_output_resume()
> via ->local_out() in BH-off context, where the rcu_dereference() based
> accessor would trip CONFIG_PROVE_RCU.
> 
> Fixes: 4a6ce2b6f2ec ("net: introduce a new function dst_dev_put()")
> Signed-off-by: Hyunwoo Kim <imv4bel@gmail.com>
> ---
> Changes in v3:
> - Take the lock once at ip_local_out() / ip6_local_out() instead of in
>   each output leaf, and cover the IPv6 path as well (David Ahern, Ido
>   Schimmel).
> - Drop the xfrm_output_resume() change; its resumption paths already run
>   with BH off (Herbert Xu). Its synchronous entry is reached under the
>   rcu held by its dst_output() caller.
> - Also protect the early device reads (mtu, headroom) in
>   raw_send_hdrinc() / rawv6_send_hdrinc(), split so the lock is not held
>   across the blocking allocation.
> - The now redundant rcu_read_lock() nested in ip_output() / ip6_output()
>   will be removed in a follow-up.
> - v2: https://lore.kernel.org/all/ahveCu-zzFlpeVut@v4bel/
> 
> Changes in v2:
> - Changed to a net-wide patch that also fixes the issue in
>   raw_send_hdrinc(), xfrm_output_resume() and vrf_output().
> - v1: https://lore.kernel.org/all/ahqIg6vURwYI0LJ5@v4bel/
> ---
>  net/ipv4/ip_output.c   |  2 ++
>  net/ipv4/raw.c         | 19 ++++++++++++++-----
>  net/ipv6/output_core.c |  2 ++
>  net/ipv6/raw.c         | 16 ++++++++++++----
>  4 files changed, 30 insertions(+), 9 deletions(-)
> 
> diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
> index 5bcd73cbdb41c..26b51ef0763fa 100644
> --- a/net/ipv4/ip_output.c
> +++ b/net/ipv4/ip_output.c
> @@ -126,9 +126,11 @@ int ip_local_out(struct net *net, struct sock *sk, struct sk_buff *skb)
>  {
>  	int err;
>  
> +	rcu_read_lock();
>  	err = __ip_local_out(net, sk, skb);
>  	if (likely(err == 1))
>  		err = dst_output(net, sk, skb);
> +	rcu_read_unlock();

IN RH we are observing small but measurable and reproducible regressions
due to increased number of rcu lock pair in the fast-path.
ip_local_out() is invoked by __ip_queue_xmit() already under the RCU lock.

I think it could be usedful either providing an _rcu less variant and
use the latter in __ip_queue_xmit(), or open-code ip_local_out() there.

AFAICS the ipv6 counter part is not in the local output fastpath.

/P


^ permalink raw reply

* Re: [PATCH v2 2/5] binder: Make shrinker rely solely on per-VMA lock
From: Carlos Llamas @ 2026-06-14 14:10 UTC (permalink / raw)
  To: Alice Ryhl
  Cc: Dave Hansen, Suren Baghdasaryan, Vlastimil Babka (SUSE),
	Dave Hansen, linux-kernel, Andrew Morton,
	Arve Hjønnevåg, Christian Brauner, David Ahern,
	David S. Miller, Greg Kroah-Hartman, Liam R. Howlett, linux-mm,
	Lorenzo Stoakes, netdev, Shakeel Butt, Todd Kjos
In-Reply-To: <aixi-DxMuc0MiGeO@google.com>

On Fri, Jun 12, 2026 at 07:50:16PM +0000, Alice Ryhl wrote:
> On Fri, Jun 12, 2026 at 11:47:59AM -0700, Dave Hansen wrote:
> > On 6/12/26 10:44, Suren Baghdasaryan wrote:
> > >> It's not impossible, but I do think it is irrelevant. Or at least that
> > >> the *VMA* is irrelevant in this case. binder_alloc_is_mapped()==false
> > >> means that the binder VMA is gone. It's not in the maple tree, and it's
> > >> not coming back. If a VMA is found, it's an impostor.
> > > Right, but before your change we were bailing out early. With your
> > > change we would be generating the traces and freeing the page. I think
> > > that's a functional change. Was that your intention?
> > 
> > Yeah, it was intentional.
> > 
> > I think the existing behavior is buggy. It also complicates the goal of
> > removing the mmap lock fallback. I've broken that behavior change out
> > into a separate patch. (attached here)
> 
> I think you can just:
> 
> 1. do a lock_vma_under_rcu().
> 2. if it fails, check binder_alloc_is_mapped().
> 3. if still mapped, return LRU_SKIP, otherwise behave like a failed
>    vma_lookup() does today under the mmap read lock.

Right! This is the same suggestion I sent.

...
Also, I would _prefer_ if the commit message was more accurate. The
mmap_lock fallback was there because of "compatibility", as per-vma
locking is technically behind CONFIG_PER_VMA_LOCK. This would be the
only part that IMO describes the actual reason for the change:

> Now that per-VMA locks are universally available, lock_vma_under_rcu()
> will not persistently fail. Rely on it alone and simplify the code.

Cheers,
--
Carlos Llamas

^ permalink raw reply

* [PATCH net-next] net: dsa: sja1105: fix lastused timestamp in flower stats
From: David Yang @ 2026-06-14 14:13 UTC (permalink / raw)
  To: netdev
  Cc: David Yang, Vladimir Oltean, Andrew Lunn, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, linux-kernel

flow_stats_update() takes an absolute timestamp for lastused, not delta.
Fix that.

Signed-off-by: David Yang <mmyangfl@gmail.com>
---
 drivers/net/dsa/sja1105/sja1105_vl.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/net/dsa/sja1105/sja1105_vl.c b/drivers/net/dsa/sja1105/sja1105_vl.c
index 0ae9cb5ea8d1..e6ba9d4f8d1e 100644
--- a/drivers/net/dsa/sja1105/sja1105_vl.c
+++ b/drivers/net/dsa/sja1105/sja1105_vl.c
@@ -791,8 +791,7 @@ int sja1105_vl_stats(struct sja1105_private *priv, int port,
 	pkts = timingerr + unreleased + lengtherr;
 
 	flow_stats_update(stats, 0, pkts - rule->vl.stats.pkts, 0,
-			  jiffies - rule->vl.stats.lastused,
-			  FLOW_ACTION_HW_STATS_IMMEDIATE);
+			  jiffies, FLOW_ACTION_HW_STATS_IMMEDIATE);
 
 	rule->vl.stats.pkts = pkts;
 	rule->vl.stats.lastused = jiffies;
-- 
2.53.0


^ permalink raw reply related

* [PATCH net] atm: br2684: reject short VC-MUX bridged frames
From: Yizhou Zhao @ 2026-06-14 15:27 UTC (permalink / raw)
  To: netdev
  Cc: Yizhou Zhao, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Kees Cook, linux-kernel, Yuxiang Yang,
	Ao Wang, Xuewei Feng, Qi Li, Ke Xu, stable

br2684_push() validates the two-byte pad at the start of received
VC-MUX bridged frames with memcmp(), but does not first make sure that
those two bytes are present in the skb.

A short AAL5 PDU can reach this path after a BR2684 VCC is attached with
BR2684_ENCAPS_VC and bridged payload.  If skb->len is 0 or 1, the pad
comparison reads beyond the valid skb data.  When the bytes beyond
skb->len compare as zero, the code then continues toward eth_type_trans()
with the malformed frame.

Reject frames shorter than BR2684_PAD_LEN before checking the pad.  This
keeps the existing validation for valid VC-MUX bridged frames, which must
carry the two-byte pad before the Ethernet header.

Fixes: 7e903c2ae36e ("atm: [br2864] fix routed vcmux support")
Cc: stable@vger.kernel.org
Reported-by: Yizhou Zhao <zhaoyz24@mails.tsinghua.edu.cn>
Reported-by: Yuxiang Yang <yangyx22@mails.tsinghua.edu.cn>
Reported-by: Ao Wang <wangao@seu.edu.cn>
Reported-by: Xuewei Feng <fengxw06@126.com>
Reported-by: Qi Li <qli01@tsinghua.edu.cn>
Reported-by: Ke Xu <xuke@tsinghua.edu.cn>
Assisted-by: GLM:GLM-5.1
Signed-off-by: Yizhou Zhao <zhaoyz24@mails.tsinghua.edu.cn>
---
 net/atm/br2684.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/atm/br2684.c b/net/atm/br2684.c
index 6580d67c3456..07283c475a40 100644
--- a/net/atm/br2684.c
+++ b/net/atm/br2684.c
@@ -491,6 +491,8 @@ static void br2684_push(struct atm_vcc *atmvcc, struct sk_buff *skb)
 			skb->pkt_type = PACKET_HOST;
 		} else { /* p_bridged */
 			/* first 2 chars should be 0 */
+			if (skb->len < BR2684_PAD_LEN)
+				goto error;
 			if (memcmp(skb->data, pad, BR2684_PAD_LEN) != 0)
 				goto error;
 			skb_pull(skb, BR2684_PAD_LEN);

-- 
2.43.0

^ permalink raw reply related

* Re: [PATCH RESEND 1/6] sock: add sock_kzalloc helper
From: Thorsten Blum @ 2026-06-14 15:32 UTC (permalink / raw)
  To: Herbert Xu, David S. Miller, Eric Dumazet, Kuniyuki Iwashima,
	Paolo Abeni, Willem de Bruijn, Jakub Kicinski, Simon Horman
  Cc: linux-crypto, linux-kernel, netdev
In-Reply-To: <20260527082509.1133816-8-thorsten.blum@linux.dev>

On Wed, May 27, 2026 at 10:25:11AM +0200, Thorsten Blum wrote:
> Add sock_kzalloc() helper - the sock equivalent to kzalloc().
> 
> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
> Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
> ---
> Patch 1/6 needs an Acked-by: from netdev maintainers for the series to
> go through Herbert's crypto tree:
> https://lore.kernel.org/lkml/ahVkZOxZtFes6Huf@gondor.apana.org.au/
> ---
>  include/net/sock.h | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/include/net/sock.h b/include/net/sock.h
> index 76bfd3e56d63..b521bd34ac9f 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -1913,6 +1913,11 @@ void sock_kfree_s(struct sock *sk, void *mem, int size);
>  void sock_kzfree_s(struct sock *sk, void *mem, int size);
>  void sk_send_sigurg(struct sock *sk);
>  
> +static inline void *sock_kzalloc(struct sock *sk, int size, gfp_t priority)
> +{
> +	return sock_kmalloc(sk, size, priority | __GFP_ZERO);
> +}
> +
>  static inline void sock_replace_proto(struct sock *sk, struct proto *proto)
>  {
>  	if (sk->sk_socket)

Gentle ping? Patch 1/6 still needs an ack from netdev maintainers.

Thanks,
Thorsten

^ permalink raw reply

* Re: [PATCH iproute2-next v2] tc: fq_pie: add support for printing per-flow PIE statistics
From: Stephen Hemminger @ 2026-06-14 15:43 UTC (permalink / raw)
  To: Hemendra M. Naik; +Cc: netdev, jiri, jhs, linux-kernel, vishy0777, tahiliani
In-Reply-To: <20260614130729.10076-1-hemendranaik@gmail.com>

On Sun, 14 Jun 2026 18:37:29 +0530
"Hemendra M. Naik" <hemendranaik@gmail.com> wrote:

> 'tc -s class show' against an fq_pie qdisc now prints:
> 
>  prob           drop probability for the flow
>  delay          per-flow queue sojourn time (microseconds)
>  deficit        remaining DRR byte credits (signed integer)
>  avg_dq_rate    dequeue rate estimate in bytes/second
>              	(dq_rate_estimator mode only)
> 
> avg_dq_rate is formatted using tc_print_rate(), which converts the
> kernel's bytes/second value to a human-readable bits/second string
> (e.g. '3906Kbit'), consistent with how other tc schedulers display
> rate fields. Apply the same fix to tc/q_pie.c, where avg_dq_rate was
> also printed as a raw integer without a unit.
> 
> Update the UAPI header to mirror tc_fq_pie_cl_stats from the kernel.
> Fix the 'delay' field comment in struct tc_pie_xstats from "in ms" to
> "in microseconds" to match the kernel's
> PSCHED_TICKS2NS / NSEC_PER_USEC conversion.
> 
> Add a 'tc -s class show' example to tc-fq_pie(8) with dq_rate_estimator
> enabled, showing all per-flow fields (prob, delay, deficit, avg_dq_rate)
> across multiple flows. Update tc-pie(8) avg_dq_rate example from a raw
> integer to a formatted bits/second string.
> 
> The corresponding kernel patch can be viewed here:
> https://lore.kernel.org/netdev/20260614125000.6058-1-hemendranaik@gmail.com/
> 
> Signed-off-by: Hemendra M. Naik <hemendranaik@gmail.com>
> Signed-off-by: Vishal Kamath <vishy0777@gmail.com>
> Signed-off-by: Mohit P. Tahiliani <tahiliani@nitk.edu.in>

Minor feedback from AI review was:
Subject: Re: [PATCH iproute2-next v2] tc: fq_pie: add support for printing per-flow PIE statistics

On Sun, 14 Jun 2026, Hemendra M. Naik wrote:
> diff --git a/tc/q_fq_pie.c b/tc/q_fq_pie.c
> @@ -283,25 +285,43 @@ static int fq_pie_print_xstats(const struct qdisc_util *qu, FILE *f,
> +	if (st->type == TCA_FQ_PIE_XSTATS_CLASS) {
> +		print_float(PRINT_ANY, "prob", " prob %lg",
> +			    (double)st->class_stats.prob / (double)UINT64_MAX);
> +		print_uint(PRINT_JSON, "delay", NULL, st->class_stats.delay);
> +		print_string(PRINT_FP, NULL, " delay %s",
> +			     sprint_time(st->class_stats.delay, b1));
> +		print_int(PRINT_ANY, "deficit", " deficit %d",
> +			  st->class_stats.deficit);
> +
> +		if (st->class_stats.dq_rate_estimating) {
> +			tc_print_rate(PRINT_ANY, "avg_dq_rate", " avg_dq_rate %s",
> +				      st->class_stats.avg_dq_rate);
> +		}
> +	}
>  	print_nl();

The print_nl() at line 334 appears to be misplaced. It's outside both
conditional blocks, which means it will always print a newline regardless
of the statistics type being displayed. This could cause formatting issues:

- For TCA_FQ_PIE_XSTATS_CLASS, you'll get a newline after the class stats
- For qdisc stats, you'll get an extra newline after the memory_used field

The original code had print_nl() after the qdisc statistics. With the new
class statistics block, you likely need print_nl() inside each conditional
block to maintain proper formatting for each type.

Consider restructuring like this:
	if (!st->type || st->type == TCA_FQ_PIE_XSTATS_QDISC) {
		/* qdisc stats */
		...
		print_uint(PRINT_ANY, "memory_used", " memory_used %u",
			   st->memory_usage);
		print_nl();
	}

	if (st->type == TCA_FQ_PIE_XSTATS_CLASS) {
		/* class stats */
		...
		print_nl();
	}

Otherwise the patch looks good:
- Good use of print_* helpers throughout
- Proper handling of JSON vs text output modes
- The tc_print_rate() usage is correct and consistent
- Documentation updates in man pages are helpful


^ permalink raw reply

* [PATCH] amt: don't read the IP source address from a reallocated skb header
From: Michael Bommarito @ 2026-06-14 15:55 UTC (permalink / raw)
  To: Taehee Yoo; +Cc: netdev, linux-kernel

amt_update_handler() caches iph = ip_hdr(skb) and then calls
pskb_may_pull(). pskb_may_pull() can reallocate the skb head: the new
head is allocated and the old one is freed. The cached iph is not
refreshed, so the following tunnel lookup reads iph->saddr from the
freed head. On an AMT relay this lookup runs for every incoming
membership update, before the update's nonce and response MAC are
validated.

The sibling handlers amt_multicast_data_handler() and
amt_membership_query_handler() re-read ip_hdr() after the pull and are
not affected; only amt_update_handler() keeps the pre-pull pointer.

Snapshot the source address before the pulls and match against the
snapshot.

The stale read was confirmed by instrumentation rather than a sanitizer:
after the head is reallocated the comparison reads from the freed old
head. KASAN does not flag it because the skb head is released through
the page-fragment free path, which is not poisoned on free.

Fixes: cbc21dc1cfe9 ("amt: add data plane of amt interface")
Cc: stable@vger.kernel.org # v5.16+
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com>
---
Confirmed on x86_64 by instrumenting the comparison: with the update
packet built so the first pskb_may_pull() reallocates the head (it pulls
bytes out of a page fragment with no tailroom), the read runs against
the freed old head -- the head pointer moves and the old page's refcount
is 0. Neither generic KASAN nor arm64 HW-tag KASAN reports it: page-
fragment frees are not synchronously poisoned, and under MTE the freed
page keeps a tag matching the stale pointer, so this class of stale-
header read escapes the usual fuzzing oracles. On a live relay the freed
head is also exposed to reuse by later skb allocations.

  amtdbg: cmp reads iph=...e000 (skb->head=...384380) stale_head=1 ref=0

A KUnit covering the re-read can follow separately.

 drivers/net/amt.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/net/amt.c b/drivers/net/amt.c
index f2f3139..af6e28d 100644
--- a/drivers/net/amt.c
+++ b/drivers/net/amt.c
@@ -2455,8 +2455,10 @@ static bool amt_update_handler(struct amt_dev *amt, struct sk_buff *skb)
 	struct ethhdr *eth;
 	struct iphdr *iph;
 	int len, hdr_size;
+	__be32 saddr;

 	iph = ip_hdr(skb);
+	saddr = iph->saddr;

 	hdr_size = sizeof(*amtmu) + sizeof(struct udphdr);
 	if (!pskb_may_pull(skb, hdr_size))
@@ -2472,7 +2474,7 @@ static bool amt_update_handler(struct amt_dev *amt, struct sk_buff *skb)
 	skb_reset_network_header(skb);

 	list_for_each_entry_rcu(tunnel, &amt->tunnel_list, list) {
-		if (tunnel->ip4 == iph->saddr) {
+		if (tunnel->ip4 == saddr) {
 			if ((amtmu->nonce == tunnel->nonce &&
 			     amtmu->response_mac == tunnel->mac)) {
 				mod_delayed_work(amt_wq, &tunnel->gc_wq,
base-commit: 5200f5f493f79f14bbdc349e402a40dfb32f23c8
-- 
2.53.0

^ permalink raw reply related

* Re: [PATCH iproute2-next] ipaddress: add support for showing IPv4 devconf attributes
From: Stephen Hemminger @ 2026-06-14 15:55 UTC (permalink / raw)
  To: Fernando Fernandez Mancera
  Cc: netdev, dsahern, davem, edumazet, kuba, pabeni, horms
In-Reply-To: <3e4a425e-0f58-48dc-a2bc-88fd6eb4a302@suse.de>

On Sat, 13 Jun 2026 09:22:38 +0200
Fernando Fernandez Mancera <fmancera@suse.de> wrote:

> On 6/13/26 8:41 AM, Fernando Fernandez Mancera wrote:
> > On 6/13/26 4:29 AM, Stephen Hemminger wrote:  
> >> On Sat, 13 Jun 2026 01:17:22 +0200
> >> Fernando Fernandez Mancera <fmancera@suse.de> wrote:
> >>  
> >>> tatic void print_inet(FILE *fp, struct rtattr *inet_attr)
> >>> +{
> >>> +    struct rtattr *tb[IFLA_INET_MAX + 1];
> >>> +
> >>> +    parse_rtattr_nested(tb, IFLA_INET_MAX, inet_attr);
> >>> +
> >>> +    if (tb[IFLA_INET_CONF] && show_details) {
> >>> +        int *conf = RTA_DATA(tb[IFLA_INET_CONF]);
> >>> +        int max_elements = RTA_PAYLOAD(tb[IFLA_INET_CONF]) / 
> >>> sizeof(int);
> >>> +
> >>> +        if (max_elements >= IPV4_DEVCONF_FORWARDING)
> >>> +            print_string(PRINT_ANY, "forwarding", "forwarding %s ",
> >>> +                     conf[IPV4_DEVCONF_FORWARDING - 1] ? "on" : "off");
> >>> +
> >>> +        if (max_elements >= IPV4_DEVCONF_MC_FORWARDING)
> >>> +            print_string(PRINT_ANY, "mc_forwarding", "mc_forwarding 
> >>> %s ",
> >>> +                     conf[IPV4_DEVCONF_MC_FORWARDING - 1] ? "on" : 
> >>> "off");
> >>> +
> >>> +        if (max_elements >= IPV4_DEVCONF_PROXY_ARP)
> >>> +            print_string(PRINT_ANY, "proxy_arp", "proxy_arp %s ",
> >>> +                     conf[IPV4_DEVCONF_PROXY_ARP - 1] ? "on" : "off");
> >>> +
> >>> +        if (max_elements >= IPV4_DEVCONF_ACCEPT_REDIRECTS)
> >>> +            print_string(PRINT_ANY, "accept_redirects",
> >>> +                     "accept_redirects %s ",
> >>> +                     conf[IPV4_DEVCONF_ACCEPT_REDIRECTS - 1] ? 
> >>> "on" : "off");
> >>> +
> >>> +        if (max_elements >= IPV4_DEVCONF_SECURE_REDIRECTS)
> >>> +            print_string(PRINT_ANY, "secure_redirects",
> >>> +                     "secure_redirects %s ",
> >>> +                     conf[IPV4_DEVCONF_SECURE_REDIRECTS - 1] ? 
> >>> "on" : "off");
> >>> +
> >>> +        if (max_elements >= IPV4_DEVCONF_SEND_REDIRECTS)
> >>> +            print_string(PRINT_ANY, "send_redirects", 
> >>> "send_redirects %s ",
> >>> +                     conf[IPV4_DEVCONF_SEND_REDIRECTS - 1] ? "on" : 
> >>> "off");
> >>> +
> >>> +        if (max_elements >= IPV4_DEVCONF_SHARED_MEDIA)
> >>> +            print_string(PRINT_ANY, "shared_media", "shared_media %s ",
> >>> +                     conf[IPV4_DEVCONF_SHARED_MEDIA - 1] ? "on" : 
> >>> "off");
> >>> +
> >>> +        if (max_elements >= IPV4_DEVCONF_RP_FILTER)
> >>> +            print_int(PRINT_ANY, "rp_filter", "rp_filter %d ",
> >>> +                  conf[IPV4_DEVCONF_RP_FILTER - 1]);
> >>> +
> >>> +        if (max_elements >= IPV4_DEVCONF_ACCEPT_SOURCE_ROUTE)
> >>> +            print_string(PRINT_ANY, "accept_source_route",
> >>> +                     "accept_source_route %s ",
> >>> +                     conf[IPV4_DEVCONF_ACCEPT_SOURCE_ROUTE - 1] ? 
> >>> "on" : "off");
> >>> +
> >>> +        if (max_elements >= IPV4_DEVCONF_BOOTP_RELAY)
> >>> +            print_string(PRINT_ANY, "bootp_relay", "bootp_relay %s ",
> >>> +                     conf[IPV4_DEVCONF_BOOTP_RELAY - 1] ? "on" : 
> >>> "off");
> >>> +
> >>> +        if (max_elements >= IPV4_DEVCONF_LOG_MARTIANS)
> >>> +            print_string(PRINT_ANY, "log_martians", "log_martians %s ",
> >>> +                     conf[IPV4_DEVCONF_LOG_MARTIANS - 1] ? "on" : 
> >>> "off");
> >>> +
> >>> +        if (max_elements >= IPV4_DEVCONF_TAG)
> >>> +            print_int(PRINT_ANY, "tag", "tag %d ",
> >>> +                  conf[IPV4_DEVCONF_TAG - 1]);
> >>> +
> >>> +        if (max_elements >= IPV4_DEVCONF_ARPFILTER)
> >>> +            print_string(PRINT_ANY, "arpfilter", "arpfilter %s ",
> >>> +                     conf[IPV4_DEVCONF_ARPFILTER - 1] ? "on" : "off");
> >>> +
> >>> +        if (max_elements >= IPV4_DEVCONF_MEDIUM_ID)
> >>> +            print_int(PRINT_ANY, "medium_id", "medium_id %d ",
> >>> +                  conf[IPV4_DEVCONF_MEDIUM_ID - 1]);
> >>> +
> >>> +        if (max_elements >= IPV4_DEVCONF_NOXFRM)
> >>> +            print_string(PRINT_ANY, "noxfrm", "noxfrm %s ",
> >>> +                     conf[IPV4_DEVCONF_NOXFRM - 1] ? "on" : "off");
> >>> +
> >>> +        if (max_elements >= IPV4_DEVCONF_NOPOLICY)
> >>> +            print_string(PRINT_ANY, "nopolicy", "nopolicy %s ",
> >>> +                     conf[IPV4_DEVCONF_NOPOLICY - 1] ? "on" : "off");
> >>> +
> >>> +        if (max_elements >= IPV4_DEVCONF_FORCE_IGMP_VERSION)
> >>> +            print_int(PRINT_ANY, "force_igmp_version", 
> >>> "force_igmp_version %d ",
> >>> +                  conf[IPV4_DEVCONF_FORCE_IGMP_VERSION - 1]);
> >>> +
> >>> +        if (max_elements >= IPV4_DEVCONF_ARP_ANNOUNCE)
> >>> +            print_int(PRINT_ANY, "arp_announce", "arp_announce %d ",
> >>> +                  conf[IPV4_DEVCONF_ARP_ANNOUNCE - 1]);
> >>> +
> >>> +        if (max_elements >= IPV4_DEVCONF_ARP_IGNORE)
> >>> +            print_int(PRINT_ANY, "arp_ignore", "arp_ignore %d ",
> >>> +                  conf[IPV4_DEVCONF_ARP_IGNORE - 1]);
> >>> +
> >>> +        if (max_elements >= IPV4_DEVCONF_PROMOTE_SECONDARIES)
> >>> +            print_string(PRINT_ANY, "promote_secondaries",
> >>> +                     "promote_secondaries %s ",
> >>> +                     conf[IPV4_DEVCONF_PROMOTE_SECONDARIES - 1] ? 
> >>> "on" : "off");
> >>> +
> >>> +        if (max_elements >= IPV4_DEVCONF_ARP_ACCEPT)
> >>> +            print_int(PRINT_ANY, "arp_accept", "arp_accept %d ",
> >>> +                  conf[IPV4_DEVCONF_ARP_ACCEPT - 1]);
> >>> +
> >>> +        if (max_elements >= IPV4_DEVCONF_ARP_NOTIFY)
> >>> +            print_string(PRINT_ANY, "arp_notify", "arp_notify %s ",
> >>> +                     conf[IPV4_DEVCONF_ARP_NOTIFY - 1] ? "on" : "off");
> >>> +
> >>> +        if (max_elements >= IPV4_DEVCONF_ACCEPT_LOCAL)
> >>> +            print_string(PRINT_ANY, "accept_local", "accept_local %s ",
> >>> +                     conf[IPV4_DEVCONF_ACCEPT_LOCAL - 1] ? "on" : 
> >>> "off");
> >>> +
> >>> +        if (max_elements >= IPV4_DEVCONF_SRC_VMARK)
> >>> +            print_string(PRINT_ANY, "src_vmark", " src_vmark %s",
> >>> +                     conf[IPV4_DEVCONF_SRC_VMARK - 1] ? "on" : "off");
> >>> +
> >>> +        if (max_elements >= IPV4_DEVCONF_PROXY_ARP_PVLAN)
> >>> +            print_string(PRINT_ANY, "proxy_arp_pvlan", 
> >>> "proxy_arp_pvlan %s ",
> >>> +                     conf[IPV4_DEVCONF_PROXY_ARP_PVLAN - 1] ? "on" : 
> >>> "off");
> >>> +
> >>> +        if (max_elements >= IPV4_DEVCONF_ROUTE_LOCALNET)
> >>> +            print_string(PRINT_ANY, "route_localnet", 
> >>> "route_localnet %s ",
> >>> +                     conf[IPV4_DEVCONF_ROUTE_LOCALNET - 1] ? "on" : 
> >>> "off");
> >>> +
> >>> +        if (max_elements >= IPV4_DEVCONF_BC_FORWARDING)
> >>> +            print_string(PRINT_ANY, "bc_forwarding", "bc_forwarding 
> >>> %s ",
> >>> +                     conf[IPV4_DEVCONF_BC_FORWARDING - 1] ? "on" : 
> >>> "off");
> >>> +
> >>> +        if (max_elements >= 
> >>> IPV4_DEVCONF_IGMPV2_UNSOLICITED_REPORT_INTERVAL)
> >>> +            print_int(PRINT_ANY, "igmpv2_unsolicited_report_interval",
> >>> +                  "igmpv2_unsolicited_report_interval %d ",
> >>> +                  
> >>> conf[IPV4_DEVCONF_IGMPV2_UNSOLICITED_REPORT_INTERVAL - 1]);
> >>> +
> >>> +        if (max_elements >= 
> >>> IPV4_DEVCONF_IGMPV3_UNSOLICITED_REPORT_INTERVAL)
> >>> +            print_int(PRINT_ANY, "igmpv3_unsolicited_report_interval",
> >>> +                  "igmpv3_unsolicited_report_interval %d ",
> >>> +                  
> >>> conf[IPV4_DEVCONF_IGMPV3_UNSOLICITED_REPORT_INTERVAL - 1]);
> >>> +
> >>> +        if (max_elements >= IPV4_DEVCONF_IGNORE_ROUTES_WITH_LINKDOWN)
> >>> +            print_string(PRINT_ANY, "ignore_routes_with_linkdown",
> >>> +                     "ignore_routes_with_linkdown %s ",
> >>> +                     conf[IPV4_DEVCONF_IGNORE_ROUTES_WITH_LINKDOWN - 
> >>> 1] ?
> >>> +                     "on" : "off");
> >>> +
> >>> +        if (max_elements >= IPV4_DEVCONF_DROP_UNICAST_IN_L2_MULTICAST)
> >>> +            print_string(PRINT_ANY, "drop_unicast_in_l2_multicast",
> >>> +                     "drop_unicast_in_l2_multicast %s ",
> >>> +                     conf[IPV4_DEVCONF_DROP_UNICAST_IN_L2_MULTICAST 
> >>> - 1] ?
> >>> +                     "on" : "off");
> >>> +
> >>> +        if (max_elements >= IPV4_DEVCONF_DROP_GRATUITOUS_ARP)
> >>> +            print_string(PRINT_ANY, "drop_gratuitous_arp",
> >>> +                     "drop_gratuitous_arp %s ",
> >>> +                     conf[IPV4_DEVCONF_DROP_GRATUITOUS_ARP - 1] ? 
> >>> "on" : "off");
> >>> +
> >>> +        if (max_elements >= IPV4_DEVCONF_ARP_EVICT_NOCARRIER)
> >>> +            print_string(PRINT_ANY, "arp_evict_nocarrier",
> >>> +                     "arp_evict_nocarrier %s ",
> >>> +                     conf[IPV4_DEVCONF_ARP_EVICT_NOCARRIER - 1] ? 
> >>> "on" : "off");
> >>> +    }
> >>> +}
> >>> +  
> >> There are three different ways to display a flag value in JSON used in 
> >> iproute2.
> >> This one is my least favorite.
> >>
> >> The three ways are:
> >>     - print_bool
> >>     - print_null (only if on)
> >>     - print_string
> >>
> >> I would use the print_null pattern but print_bool would also be ok.
> >>  
> > 
> > Thanks for the suggestion Stephen, I would pick print_bool in this case.
> > 
> > If one of the options evolves to supporting something else we could 
> > easily adapt it without breaking compatibility if we use print_bool. If 
> > we use print_null I don't think we could do that.
> >   
> 
> Hm. I am actually not so sure about this..
> 
> the current print_string approach matches the setter and also the 
> netconf side. While print_bool would be easier to parse for JSON, it 
> looks not so good for command line output.
> 
> print_null presents a different problem, users would need to make sure 
> their parsing is working with an iproute2 version that support these new 
> attributes.
> 
> So I am not so sure what is the best option here.

None of this is a hard requirement. The requirement is that the output
be valid JSON. The other common practices are:
  - non JSON output of display should match the input command line
    there were even some user tools that depended on this to do save/restore of state
  - JSON output should be easy to parse in python.

If user is using python to read JSON (which seems like the most common).
then bool or presence allows for use in conditional.



>>> import json
>>> json.loads('{"forwarding": true}')["forwarding"]
True                          # <class 'bool'>
>>> json.loads('{"forwarding": "on"}')["forwarding"]
'on'         

^ permalink raw reply

* [PATCH net-next] i40e: add devlink parameter for Flow Director ATR sample rate
From: mheib @ 2026-06-14 16:11 UTC (permalink / raw)
  To: intel-wired-lan
  Cc: netdev, jiri, davem, edumazet, kuba, pabeni, horms, corbet,
	anthony.l.nguyen, przemyslaw.kitszel, andrew+netdev,
	Mohammad Heib

From: Mohammad Heib <mheib@redhat.com>

The i40e driver uses Flow Director ATR to periodically update flow
steering information for active TCP flows. The update frequency is
currently controlled by I40E_DEFAULT_ATR_SAMPLE_RATE and is fixed at
driver build time.

On systems with a large number of queues and high-rate TCP workloads,
the default sampling interval can result in frequent Flow Director
reprogramming for long-lived flows.

The amount of TCP packet reordering observed on some systems is
sensitive to the ATR sampling interval. Increasing the interval reduces
Flow Director programming activity and can significantly reduce the
associated reordering.

Since the optimal sampling interval depends on the workload and system
configuration, a single fixed value is not suitable for all deployments.

Add a devlink parameter to allow administrators to tune the ATR sample
rate at runtime without rebuilding the driver or disabling ATR
functionality entirely.

Signed-off-by: Mohammad Heib <mheib@redhat.com>
---
 Documentation/networking/devlink/i40e.rst     | 19 ++++++
 drivers/net/ethernet/intel/i40e/i40e.h        |  1 +
 .../net/ethernet/intel/i40e/i40e_devlink.c    | 65 +++++++++++++++++++
 drivers/net/ethernet/intel/i40e/i40e_main.c   |  4 +-
 drivers/net/ethernet/intel/i40e/i40e_txrx.h   |  4 +-
 5 files changed, 90 insertions(+), 3 deletions(-)

diff --git a/Documentation/networking/devlink/i40e.rst b/Documentation/networking/devlink/i40e.rst
index 51c887f0dc83..704469aa9acf 100644
--- a/Documentation/networking/devlink/i40e.rst
+++ b/Documentation/networking/devlink/i40e.rst
@@ -40,6 +40,25 @@ Parameters
 
         The default value is ``0`` (internal calculation is used).
 
+.. list-table:: Driver specific parameters implemented
+    :widths: 5 5 90
+
+    * - Name
+      - Mode
+      - Description
+    * - ``atr_sample_rate``
+      - runtime
+      - Controls how frequently Flow Director ATR updates flow steering
+        information for active TCP flows.
+
+        ATR programs Flow Director entries based on sampled transmitted
+        packets. The sampling interval is specified as the number of
+        transmitted packets between ATR updates.
+
+        Lower values increase Flow Director programming activity, while
+        higher values reduce the update frequency.
+
+        The default value is ``20``.
 
 Info versions
 =============
diff --git a/drivers/net/ethernet/intel/i40e/i40e.h b/drivers/net/ethernet/intel/i40e/i40e.h
index 1b6a8fbaa648..88eb40ee45f0 100644
--- a/drivers/net/ethernet/intel/i40e/i40e.h
+++ b/drivers/net/ethernet/intel/i40e/i40e.h
@@ -487,6 +487,7 @@ struct i40e_pf {
 	u16 rss_size_max;          /* HW defined max RSS queues */
 	u16 fdir_pf_filter_count;  /* num of guaranteed filters for this PF */
 	u16 num_alloc_vsi;         /* num VSIs this driver supports */
+	u32 atr_sample_rate;
 	bool wol_en;
 
 	struct hlist_head fdir_filter_list;
diff --git a/drivers/net/ethernet/intel/i40e/i40e_devlink.c b/drivers/net/ethernet/intel/i40e/i40e_devlink.c
index 229179ccc131..16e51762db45 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_devlink.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_devlink.c
@@ -33,12 +33,77 @@ static int i40e_max_mac_per_vf_get(struct devlink *devlink,
 	return 0;
 }
 
+static int i40e_atr_sample_rate_set(struct devlink *devlink,
+				    u32 id,
+				    struct devlink_param_gset_ctx *ctx,
+				    struct netlink_ext_ack *extack)
+{
+	struct i40e_pf *pf = devlink_priv(devlink);
+	struct i40e_vsi *vsi;
+	u32 sample_rate = ctx->val.vu32;
+	int i;
+
+	pf->atr_sample_rate = sample_rate;
+
+	if (!test_bit(I40E_FLAG_FD_ATR_ENA, pf->flags))
+		return 0;
+
+	vsi = i40e_pf_get_main_vsi(pf);
+	if (!vsi)
+		return 0;
+
+	for (i = 0; i < vsi->num_queue_pairs; i++) {
+		if (!vsi->tx_rings[i])
+			continue;
+		vsi->tx_rings[i]->atr_sample_rate = sample_rate;
+		vsi->tx_rings[i]->atr_count = 0;
+	}
+
+	return 0;
+}
+
+static int i40e_atr_sample_rate_get(struct devlink *devlink,
+				    u32 id,
+				    struct devlink_param_gset_ctx *ctx,
+				    struct netlink_ext_ack *extack)
+{
+	struct i40e_pf *pf = devlink_priv(devlink);
+
+	ctx->val.vu32 = pf->atr_sample_rate;
+
+	return 0;
+}
+
+static int i40e_atr_sample_rate_validate(struct devlink *devlink, u32 id,
+					 union devlink_param_value val,
+					 struct netlink_ext_ack *extack)
+{
+	if (!val.vu32) {
+		NL_SET_ERR_MSG_MOD(extack,
+				   "ATR sample rate must be greater than 0");
+		return -EINVAL;
+	}
+	return 0;
+}
+
+enum i40e_dl_param_id {
+	I40E_DEVLINK_PARAM_ID_BASE = DEVLINK_PARAM_GENERIC_ID_MAX,
+	I40E_DEVLINK_PARAM_ID_ATR_SAMPLE_RATE,
+};
+
 static const struct devlink_param i40e_dl_params[] = {
 	DEVLINK_PARAM_GENERIC(MAX_MAC_PER_VF,
 			      BIT(DEVLINK_PARAM_CMODE_RUNTIME),
 			      i40e_max_mac_per_vf_get,
 			      i40e_max_mac_per_vf_set,
 			      NULL),
+	DEVLINK_PARAM_DRIVER(I40E_DEVLINK_PARAM_ID_ATR_SAMPLE_RATE,
+			     "atr_sample_rate",
+			     DEVLINK_PARAM_TYPE_U32,
+			     BIT(DEVLINK_PARAM_CMODE_RUNTIME),
+			     i40e_atr_sample_rate_get,
+			     i40e_atr_sample_rate_set,
+			     i40e_atr_sample_rate_validate),
 };
 
 static void i40e_info_get_dsn(struct i40e_pf *pf, char *buf, size_t len)
diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
index d59750c490f4..9c8144970a34 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -3458,7 +3458,7 @@ static int i40e_configure_tx_ring(struct i40e_ring *ring)
 
 	/* some ATR related tx ring init */
 	if (test_bit(I40E_FLAG_FD_ATR_ENA, vsi->back->flags)) {
-		ring->atr_sample_rate = I40E_DEFAULT_ATR_SAMPLE_RATE;
+		ring->atr_sample_rate = vsi->back->atr_sample_rate;
 		ring->atr_count = 0;
 	} else {
 		ring->atr_sample_rate = 0;
@@ -12745,6 +12745,8 @@ static int i40e_sw_init(struct i40e_pf *pf)
 		}
 	}
 
+	pf->atr_sample_rate = I40E_DEFAULT_ATR_SAMPLE_RATE;
+
 	if ((pf->hw.func_caps.fd_filters_guaranteed > 0) ||
 	    (pf->hw.func_caps.fd_filters_best_effort > 0)) {
 		set_bit(I40E_FLAG_FD_ATR_ENA, pf->flags);
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.h b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
index bb741ff3e5f2..7e29e9244c3a 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.h
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
@@ -372,8 +372,8 @@ struct i40e_ring {
 	u16 next_to_clean;
 	u16 xdp_tx_active;
 
-	u8 atr_sample_rate;
-	u8 atr_count;
+	u32 atr_sample_rate;
+	u32 atr_count;
 
 	bool ring_active;		/* is ring online or not */
 	bool arm_wb;		/* do something to arm write back */
-- 
2.53.0


^ permalink raw reply related

* Re: [PATCH net-next V3 10/15] net/mlx5: LAG, disable both regular and SD LAG on lag_disable_change
From: Shay Drori @ 2026-06-14 16:43 UTC (permalink / raw)
  To: Tariq Toukan, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Andrew Lunn, David S. Miller
  Cc: Saeed Mahameed, Leon Romanovsky, Mark Bloch, Or Har-Toov,
	Edward Srouji, Maher Sanalla, Simon Horman, Gerd Bayer, Kees Cook,
	Moshe Shemesh, Parav Pandit, Patrisious Haddad, netdev,
	linux-rdma, linux-kernel, Gal Pressman
In-Reply-To: <20260612113904.537595-11-tariqt@nvidia.com>



On 12/06/2026 14:38, Tariq Toukan wrote:
> From: Shay Drory <shayd@nvidia.com>
> 
> Extend mlx5_lag_disable_change() to properly disable both regular LAG
> and SD LAG when requested. Each LAG type uses its own devcom component
> for locking.
> 
> Use mlx5_sd_get_devcom() helper to retrieve the SD devcom component,
> needed for proper locking when disabling SD LAG.
> 
> Signed-off-by: Shay Drory <shayd@nvidia.com>
> Reviewed-by: Mark Bloch <mbloch@nvidia.com>
> Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
> ---
>   .../net/ethernet/mellanox/mlx5/core/lag/lag.c | 29 +++++++++++++++++--
>   1 file changed, 27 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.c b/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.c
> index e23c1e81b98f..84eff995cad1 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.c
> @@ -2494,13 +2494,22 @@ EXPORT_SYMBOL(mlx5_lag_is_shared_fdb);
>   
>   void mlx5_lag_disable_change(struct mlx5_core_dev *dev)
>   {
> +	struct mlx5_devcom_comp_dev *sd_devcom = mlx5_sd_get_devcom(dev);
> +	struct mlx5_core_dev *primary = dev;
>   	struct mlx5_lag *ldev;
> +	struct lag_func *pf;
> +	int i;
>   
>   	ldev = mlx5_lag_dev(dev);
>   	if (!ldev)
>   		return;
>   
> -	mlx5_devcom_comp_lock(dev->priv.hca_devcom_comp);
> +	if (sd_devcom) {
> +		mlx5_devcom_comp_lock(sd_devcom);
> +		primary = mlx5_sd_get_primary(dev) ?: dev;
> +		mlx5_devcom_comp_unlock(sd_devcom);
> +	}
> +	mlx5_devcom_comp_lock(primary->priv.hca_devcom_comp);
>   	mutex_lock(&ldev->lock);
>   
>   	ldev->mode_changes_in_progress++;
> @@ -2512,7 +2521,23 @@ void mlx5_lag_disable_change(struct mlx5_core_dev *dev)
>   	}
>   
>   	mutex_unlock(&ldev->lock);
> -	mlx5_devcom_comp_unlock(dev->priv.hca_devcom_comp);
> +	mlx5_devcom_comp_unlock(primary->priv.hca_devcom_comp);
> +
> +	if (!sd_devcom)
> +		return;
> +
> +	/* Teardown SD shared FDB for this device's group if active */
> +	mlx5_devcom_comp_lock(sd_devcom);
> +	mutex_lock(&ldev->lock);
> +	mlx5_lag_for_each(i, 0, ldev, MLX5_LAG_FILTER_ALL) {
> +		pf = mlx5_lag_pf(ldev, i);
> +		if (pf->dev == dev && pf->sd_fdb_active) {
> +			mlx5_lag_shared_fdb_destroy(ldev, pf->group_id);
> +			break;
> +		}
> +	}
> +	mutex_unlock(&ldev->lock);
> +	mlx5_devcom_comp_unlock(sd_devcom);

sashiko.dev says:
Does holding the sd_devcom lock while calling mlx5_lag_shared_fdb_destroy()
introduce an AB-BA deadlock with auxiliary device probe?
This path acquires sd_devcom, and mlx5_lag_shared_fdb_destroy() eventually
reaches mlx5_rescan_drivers_locked() calling device_del() on auxiliary
devices, which attempts to acquire device_lock(&adev->dev). This gives us:
sd_devcom -> device_lock()
However, during auxiliary device probe, the driver core holds
device_lock(&adev->dev) before calling mlx5e_probe().
mlx5e_probe() then calls mlx5_sd_get_adev() which acquires sd_devcom,
giving us the reverse:
device_lock() -> sd_devcom
Could the teardown be performed without holding the sd_devcom lock here
to prevent this deadlock?

[SD] No — the teardown's device_del runs on the IB aux devices, while
the device_lock held during probe is the ETH aux device (mlx5e_probe);
different struct devices, so no AB-BA

>   }
>   
>   void mlx5_lag_enable_change(struct mlx5_core_dev *dev)


^ permalink raw reply

* Re: [PATCH net-next V3 13/15] net/mlx5: E-Switch, Tie rep load/unload to SD LAG state
From: Shay Drori @ 2026-06-14 16:44 UTC (permalink / raw)
  To: Tariq Toukan, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Andrew Lunn, David S. Miller
  Cc: Saeed Mahameed, Leon Romanovsky, Mark Bloch, Or Har-Toov,
	Edward Srouji, Maher Sanalla, Simon Horman, Gerd Bayer, Kees Cook,
	Moshe Shemesh, Parav Pandit, Patrisious Haddad, netdev,
	linux-rdma, linux-kernel, Gal Pressman
In-Reply-To: <20260612113904.537595-14-tariqt@nvidia.com>



On 12/06/2026 14:39, Tariq Toukan wrote:
> From: Shay Drory <shayd@nvidia.com>
> 
> On an SD device, vport representors are not functional until the SD
> group is combined and shared FDB is active. Skip the initial load and
> the reload paths in that window; reps are loaded as part of the SD LAG
> activation flow once it becomes active.
> 
> In addition, explicitly unload representors when SD LAG is destroyed.
> 
> Signed-off-by: Shay Drory <shayd@nvidia.com>
> Reviewed-by: Mark Bloch <mbloch@nvidia.com>
> Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
> ---
>   .../net/ethernet/mellanox/mlx5/core/eswitch.h |  4 +++
>   .../mellanox/mlx5/core/eswitch_offloads.c     | 26 +++++++++++++++++++
>   .../net/ethernet/mellanox/mlx5/core/lag/lag.c | 26 +++++++++++++++++++
>   .../net/ethernet/mellanox/mlx5/core/lag/lag.h |  1 +
>   .../mellanox/mlx5/core/lag/shared_fdb.c       |  1 +
>   5 files changed, 58 insertions(+)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
> index a5f0774834fe..b2b3150f1f04 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
> @@ -959,6 +959,7 @@ int mlx5_eswitch_offloads_single_fdb_add_one(struct mlx5_eswitch *master_esw,
>   void mlx5_eswitch_offloads_single_fdb_del_one(struct mlx5_eswitch *master_esw,
>   					      struct mlx5_eswitch *slave_esw);
>   int mlx5_eswitch_reload_ib_reps(struct mlx5_eswitch *esw);
> +void mlx5_eswitch_unload_reps(struct mlx5_eswitch *esw);
>   bool mlx5_eswitch_is_peer(struct mlx5_eswitch *esw,
>   			  struct mlx5_eswitch *peer_esw);
>   
> @@ -1063,6 +1064,9 @@ mlx5_eswitch_reload_ib_reps(struct mlx5_eswitch *esw)
>   	return 0;
>   }
>   
> +static inline void
> +mlx5_eswitch_unload_reps(struct mlx5_eswitch *esw) {}
> +
>   static inline bool
>   mlx5_eswitch_block_encap(struct mlx5_core_dev *dev, bool from_fdb)
>   {
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
> index a24719cfba34..4dc190a4e7b2 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
> @@ -2863,6 +2863,10 @@ static int mlx5_esw_offloads_rep_load(struct mlx5_eswitch *esw, u16 vport_num)
>   	int rep_type;
>   	int err;
>   
> +	if (vport_num != MLX5_VPORT_UPLINK &&
> +	    mlx5_get_sd(esw->dev) && !mlx5_lag_is_active(esw->dev))
> +		return 0;

sashiko.dev says:
If we intentionally skip the initial load of the Ethernet representors
(REP_ETH) here, will secondary devices permanently fail to load them later?
Looking at mlx5_lag_shared_fdb_create(), the code triggers an asynchronous
rescan exclusively for the primary device (dev0):
     mlx5_lag_rescan_dev_locked(ldev, dev0, true);
     err = mlx5_lag_reload_ib_reps_from_locked(ldev, 0, filter, false);
Because secondary devices are not rescanned, their auxiliary eth-rep drivers
are never re-probed, and their asynchronous reload is never queued. 
Since the
synchronous reload depends on REP_ETH already being loaded, skipping it here
seems to leave secondary devices permanently without network interfaces.

[SD] primary will invoke mlx5_esw_add_work() for secondary devices as
well via mlx5_eswitch_register_vport_reps_nested().

> +
>   	rep = mlx5_eswitch_get_rep(esw, vport_num);
>   	for (rep_type = 0; rep_type < NUM_REP_TYPES; rep_type++) {
>   		err = __esw_offloads_load_rep(esw, rep, rep_type,
> @@ -3779,6 +3783,21 @@ static void esw_destroy_offloads_acl_tables(struct mlx5_eswitch *esw)
>   		esw_vport_destroy_offloads_acl_tables(esw, vport);
>   }
>   
> +void mlx5_eswitch_unload_reps(struct mlx5_eswitch *esw)
> +{
> +	struct mlx5_eswitch_rep *rep;
> +	unsigned long i;
> +
> +	if (!esw || esw->mode != MLX5_ESWITCH_OFFLOADS)
> +		return;
> +
> +	mlx5_esw_for_each_rep(esw, i, rep) {
> +		if (rep->vport == MLX5_VPORT_UPLINK)
> +			continue;
> +		mlx5_esw_offloads_rep_unload(esw, rep->vport);
> +	}
> +}
> +
>   int mlx5_eswitch_reload_ib_reps(struct mlx5_eswitch *esw)
>   {
>   	struct mlx5_eswitch_rep *rep;
> @@ -3805,6 +3824,10 @@ int mlx5_eswitch_reload_ib_reps(struct mlx5_eswitch *esw)
>   		if (!mlx5_sd_is_primary(esw->dev) &&
>   		    rep->vport == MLX5_VPORT_UPLINK)
>   			continue;
> +		if (rep->vport != MLX5_VPORT_UPLINK &&
> +		    mlx5_get_sd(esw->dev) && !mlx5_lag_is_active(esw->dev))
> +			continue;
> +

Is there a race condition here during SD LAG activation that bypasses the
synchronous load of primary device representors?
In mlx5_lag_shared_fdb_create(), the unbind/rebind of auxiliary drivers for
the primary device queues an asynchronous work item to load REP_ETH.
Immediately following this, mlx5_lag_reload_ib_reps_from_locked() executes
synchronously.
Because the asynchronous work hasn't run yet, REP_ETH is not loaded.
Consequently, this synchronous loop will evaluate the REP_LOADED check as
false and silently skip loading REP_IB.

[SD] The async reload loads both REP_ETH and REP_IB for VF/SF. The
synchronous reload_ib_reps only re-adds IB for reps whose ETH is already
loaded; skipping IB when ETH isn't up yet is not a loss — the async path
loads both. No race.


>   		if (atomic_read(&rep->rep_data[REP_ETH].state) == REP_LOADED)
>   			__esw_offloads_load_rep(esw, rep, REP_IB, NULL);
>   	}
> @@ -4764,6 +4787,9 @@ static void mlx5_eswitch_reload_reps_blocked(struct mlx5_eswitch *esw)
>   		return;
>   	}
>   
> +	if (mlx5_get_sd(esw->dev) && !mlx5_lag_is_active(esw->dev))
> +		return;
> +
>   	mlx5_esw_for_each_vport(esw, i, vport) {
>   		if (!vport)
>   			continue;
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.c b/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.c
> index 424478e649ef..28d16fdc3f06 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.c
> @@ -1312,6 +1312,32 @@ int mlx5_lag_reload_ib_reps_from_locked(struct mlx5_lag *ldev, u32 flags,
>   	return mlx5_lag_reload_ib_reps(ldev, flags, filter, cont_on_fail);
>   }
>   
> +static void mlx5_lag_unload_reps_unlocked(struct mlx5_lag *ldev, u32 filter)
> +{
> +	struct lag_func *pf;
> +	int i;
> +
> +	mlx5_lag_for_each(i, 0, ldev, filter) {
> +		struct mlx5_eswitch *esw;
> +
> +		pf = mlx5_lag_pf(ldev, i);
> +		esw = pf->dev->priv.eswitch;
> +		mlx5_esw_reps_block(esw);
> +		mlx5_eswitch_unload_reps(esw);
> +		mlx5_esw_reps_unblock(esw);
> +	}
> +}
> +
> +void mlx5_lag_unload_reps_from_locked(struct mlx5_lag *ldev, u32 filter)
> +{
> +	/* Same lock dance as mlx5_lag_reload_ib_reps: drop ldev->lock around
> +	 * the per-eswitch reps_lock to keep the reps_lock -> ldev->lock order.
> +	 */
> +	mlx5_lag_drop_lock_for_reps(ldev, filter);
> +	mlx5_lag_unload_reps_unlocked(ldev, filter);
> +	mlx5_lag_retake_lock_after_reps(ldev);
> +}
> +
>   void mlx5_disable_lag(struct mlx5_lag *ldev)
>   {
>   	bool shared_fdb = test_bit(MLX5_LAG_MODE_FLAG_SHARED_FDB, &ldev->mode_flags);
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.h b/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.h
> index 8481ce55c10a..e9f0ef83ce1d 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.h
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.h
> @@ -310,6 +310,7 @@ int mlx5_lag_num_devs(struct mlx5_lag *ldev);
>   int mlx5_lag_num_netdevs(struct mlx5_lag *ldev);
>   int mlx5_lag_reload_ib_reps_from_locked(struct mlx5_lag *ldev, u32 flags,
>   					u32 filter, bool cont_on_fail);
> +void mlx5_lag_unload_reps_from_locked(struct mlx5_lag *ldev, u32 filter);
>   int mlx5_ldev_add_mdev(struct mlx5_lag *ldev, struct mlx5_core_dev *dev,
>   		       u32 group_id);
>   void mlx5_ldev_remove_mdev(struct mlx5_lag *ldev, struct mlx5_core_dev *dev);
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lag/shared_fdb.c b/drivers/net/ethernet/mellanox/mlx5/core/lag/shared_fdb.c
> index 8d4f2903a101..113866494d16 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/lag/shared_fdb.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/lag/shared_fdb.c
> @@ -296,6 +296,7 @@ void mlx5_lag_shared_fdb_destroy(struct mlx5_lag *ldev, u32 group_id)
>   			pf->sd_fdb_active = false;
>   		}
>   		mlx5_lag_destroy_single_fdb_filter(ldev, group_id);
> +		mlx5_lag_unload_reps_from_locked(ldev, filter);

sashiko.dev says:
Does explicitly unloading all representors here render the subsequent IB 
reload
a dead code no-op?
Immediately after mlx5_lag_unload_reps_from_locked() forcefully unloads 
REP_ETH
and other representors, this function calls
mlx5_lag_reload_ib_reps_from_locked().
Because REP_ETH was just unloaded, the condition checking if the state is
REP_LOADED inside mlx5_eswitch_reload_ib_reps() will evaluate to false,
silently skipping all IB representors.

[SD] this is intended

>   	}
>   
>   	mlx5_lag_add_devices_filter(ldev, filter);


^ permalink raw reply

* Re: [PATCH net-next V3 14/15] net/mlx5: SD, defer vport metadata init until SD is ready
From: Shay Drori @ 2026-06-14 16:44 UTC (permalink / raw)
  To: Tariq Toukan, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Andrew Lunn, David S. Miller
  Cc: Saeed Mahameed, Leon Romanovsky, Mark Bloch, Or Har-Toov,
	Edward Srouji, Maher Sanalla, Simon Horman, Gerd Bayer, Kees Cook,
	Moshe Shemesh, Parav Pandit, Patrisious Haddad, netdev,
	linux-rdma, linux-kernel, Gal Pressman
In-Reply-To: <20260612113904.537595-15-tariqt@nvidia.com>



On 12/06/2026 14:39, Tariq Toukan wrote:
> From: Shay Drory <shayd@nvidia.com>
> 
> Allow SD devices to transition to switchdev before the SD group is
> fully up. Metadata allocation requires the SD group to be ready, so
> defer it from esw_offloads_enable() until SD shared-FDB activation.
> 
> Add mlx5_esw_offloads_init_deferred_metadata() which allocates per-vport
> metadata and refreshes the ingress ACLs that were previously programmed
> with metadata=0. The helper is idempotent and can be called multiple
> times.
> 
> Signed-off-by: Shay Drory <shayd@nvidia.com>
> Reviewed-by: Mark Bloch <mbloch@nvidia.com>
> Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
> ---
>   .../net/ethernet/mellanox/mlx5/core/eswitch.h |  1 +
>   .../mellanox/mlx5/core/eswitch_offloads.c     | 79 ++++++++++++++++++-
>   .../net/ethernet/mellanox/mlx5/core/lib/sd.c  | 16 ++++
>   3 files changed, 93 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
> index b2b3150f1f04..fea72b1dedab 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
> @@ -440,6 +440,7 @@ struct mlx5_eswitch {
>   
>   void esw_offloads_disable(struct mlx5_eswitch *esw);
>   int esw_offloads_enable(struct mlx5_eswitch *esw);
> +int mlx5_esw_offloads_init_deferred_metadata(struct mlx5_eswitch *esw);
>   void esw_offloads_cleanup(struct mlx5_eswitch *esw);
>   int esw_offloads_init(struct mlx5_eswitch *esw);
>   
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
> index 4dc190a4e7b2..8fa7e633451c 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
> @@ -3675,6 +3675,7 @@ static void esw_offloads_vport_metadata_cleanup(struct mlx5_eswitch *esw,
>   
>   	WARN_ON(vport->metadata != vport->default_metadata);
>   	mlx5_esw_match_metadata_free(esw, vport->default_metadata);
> +	vport->default_metadata = 0;
>   }
>   
>   static void esw_offloads_metadata_uninit(struct mlx5_eswitch *esw)
> @@ -3711,6 +3712,73 @@ static int esw_offloads_metadata_init(struct mlx5_eswitch *esw)
>   	return err;
>   }
>   
> +/* Deferred metadata init for SD devices: allocate vport metadata and
> + * refresh the ingress ACL for every vport whose ACL was created with
> + * metadata=0 in esw_create_offloads_acl_tables() / esw_vport_setup().
> + *
> + * No Rep is loaded at this point ==> no Rep net-dev exists, so no need
> + * to take rtnl lock.
> + *
> + * Safe to call multiple times - subsequent calls are no-ops.
> + */
> +int mlx5_esw_offloads_init_deferred_metadata(struct mlx5_eswitch *esw)
> +{
> +	struct mlx5_vport *manager, *vport;
> +	unsigned long i;
> +	int err;
> +
> +	if (!mlx5_eswitch_vport_match_metadata_enabled(esw))
> +		return 0;
> +
> +	manager = mlx5_eswitch_get_vport(esw, esw->manager_vport);
> +	if (IS_ERR(manager))
> +		return PTR_ERR(manager);
> +
> +	/* Sanity check: skip if metadata was already initialized */
> +	if (manager->default_metadata)
> +		return 0;
> +
> +	err = esw_offloads_metadata_init(esw);

sashiko.dev says:

Does unconditionally allocating new metadata here overwrite any dynamically
assigned vport metadata?

[SD] No. no REPs are loaded until SD LAG is created, so user can't do
any REPs bonding

> +	if (err)
> +		return err;
> +
> +	mutex_lock(&esw->state_lock);
> +	/* Manager vport doesn't have a rep/netdev loaded but its ingress ACL
> +	 * was programmed with metadata=0 - refresh it explicitly.
> +	 */
> +	err = mlx5_esw_acl_ingress_vport_metadata_update(esw,
> +							 esw->manager_vport,
> +							 0);
> +	if (err)
> +		goto err_acl;
> +
> +	/* UPLINK is never marked enabled but its ACL is programmed in
> +	 * esw_create_offloads_acl_tables(); refresh it explicitly.
> +	 */
> +	err = mlx5_esw_acl_ingress_vport_metadata_update(esw, MLX5_VPORT_UPLINK,
> +							 0);

sashiko.dev says:
Is the rtnl_lock required during these deferred ACL updates?
The code comment assumes no representor netdev exists at this point. 
However,
for the primary SD device, representor netdevs were loaded and exposed to
user-space much earlier when its switchdev transition completed.

[SD] same as above, no REPs net-dev exists at this point

> +	if (err)
> +		goto err_acl;
> +
> +	mlx5_esw_for_each_vport(esw, i, vport) {
> +		if (!vport || !vport->enabled)
> +			continue;
> +		err = mlx5_esw_acl_ingress_vport_metadata_update(esw,
> +								 vport->vport,
> +								 0);
> +		if (err)
> +			goto err_acl;
> +	}
> +
> +	mutex_unlock(&esw->state_lock);
> +	return 0;
> +
> +err_acl:
> +	esw_offloads_metadata_uninit(esw);
> +	mutex_unlock(&esw->state_lock);
> +	return err;
> +}
> +
>   int
>   esw_vport_create_offloads_acl_tables(struct mlx5_eswitch *esw,
>   				     struct mlx5_vport *vport)
> @@ -4072,9 +4140,14 @@ int esw_offloads_enable(struct mlx5_eswitch *esw)
>   	if (err)
>   		goto err_roce;
>   
> -	err = esw_offloads_metadata_init(esw);
> -	if (err)
> -		goto err_metadata;
> +	/* SD devices defer metadata init until SD is ready and
> +	 * mlx5_sd_pf_num_get() can return the correct pf_num.
> +	 */
> +	if (!mlx5_get_sd(esw->dev)) {
> +		err = esw_offloads_metadata_init(esw);

sashiko.dev says:
Does skipping metadata initialization here prematurely expose representors
with uninitialized metadata?
esw_offloads_enable() skips metadata initialization for SD devices, leaving
vport->metadata = 0. Later in this function, representor netdevs are loaded
and exposed to userspace, which allows userspace to add TC rules matching
metadata=0 before the metadata is actually initialized.

[SD] same as above

> +		if (err)
> +			goto err_metadata;
> +	}
>   
>   	err = esw_set_passing_vport_metadata(esw, true);
>   	if (err)
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c b/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c
> index b35795bac098..2fcccd329eb5 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c
> @@ -992,6 +992,7 @@ static bool mlx5_sd_all_paired(struct mlx5_core_dev *primary)
>   static void mlx5_sd_activate_shared_fdb(struct mlx5_core_dev *primary)
>   {
>   	struct mlx5_sd *sd = mlx5_get_sd(primary);
> +	struct mlx5_core_dev *pos;
>   	struct mlx5_lag *ldev;
>   	struct lag_func *pf;
>   	int err;
> @@ -1024,6 +1025,21 @@ static void mlx5_sd_activate_shared_fdb(struct mlx5_core_dev *primary)
>   		goto unlock;
>   	}
>   
> +	/* Initialize vport metadata for all group devices. This is deferred
> +	 * from esw_offloads_enable() because mlx5_sd_pf_num_get() requires
> +	 * the SD group to be ready.
> +	 */
> +	mlx5_sd_for_each_dev(i, primary, pos) {
> +		struct mlx5_eswitch *esw = pos->priv.eswitch;
> +
> +		err = mlx5_esw_offloads_init_deferred_metadata(esw);
> +		if (err) {
> +			sd_warn(primary, "Failed to init metadata for %s: %d\n",
> +				dev_name(pos->device), err);
> +			goto unlock;
> +		}
> +	}
> +
>   	err = mlx5_lag_shared_fdb_create(ldev, NULL, 0, sd->group_id);
>   	if (err)
>   		sd_warn(primary, "Failed to create shared FDB: %d\n", err);


^ permalink raw reply

* [syzbot] [wireless?] WARNING in __ieee80211_start_scan (2)
From: syzbot @ 2026-06-14 16:49 UTC (permalink / raw)
  To: johannes, linux-kernel, linux-wireless, netdev, syzkaller-bugs

Hello,

syzbot found the following issue on:

HEAD commit:    627366c51145 ptp: ocp: fix resource freeing order
git tree:       net
console output: https://syzkaller.appspot.com/x/log.txt?x=1114f186580000
kernel config:  https://syzkaller.appspot.com/x/.config?x=65472e27d1590a04
dashboard link: https://syzkaller.appspot.com/bug?extid=f961b9f94edbc266f1f8
compiler:       Debian clang version 21.1.8 (++20251221033036+2078da43e25a-1~exp1~20251221153213.50), Debian LLD 21.1.8

Unfortunately, I don't have any reproducer for this issue yet.

Downloadable assets:
disk image: https://storage.googleapis.com/syzbot-assets/fdf7eb944feb/disk-627366c5.raw.xz
vmlinux: https://storage.googleapis.com/syzbot-assets/ab07a79e10f6/vmlinux-627366c5.xz
kernel image: https://storage.googleapis.com/syzbot-assets/270e46d829a1/bzImage-627366c5.xz

IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+f961b9f94edbc266f1f8@syzkaller.appspotmail.com

------------[ cut here ]------------
!ieee80211_prep_hw_scan(sdata)
WARNING: net/mac80211/scan.c:879 at __ieee80211_start_scan+0x1336/0x1d40 net/mac80211/scan.c:879, CPU#0: syz.0.5003/24116
Modules linked in:
CPU: 0 UID: 0 PID: 24116 Comm: syz.0.5003 Not tainted syzkaller #0 PREEMPT(full) 
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/09/2026
RIP: 0010:__ieee80211_start_scan+0x1336/0x1d40 net/mac80211/scan.c:879
Code: 06 90 e8 bd 74 b0 f6 48 83 fd 09 0f 84 41 07 00 00 83 fd 03 0f 84 3f 07 00 00 e8 25 6f b0 f6 e9 d5 f2 ff ff e8 1b 6f b0 f6 90 <0f> 0b 90 e9 0d fe ff ff 89 d9 80 e1 07 80 c1 03 38 c1 0f 8c 53 fa
RSP: 0018:ffffc9000674f170 EFLAGS: 00010293
RAX: ffffffff8b154795 RBX: ffff888053bf8e40 RCX: ffff888079661f00
RDX: 0000000000000000 RSI: 00000000fffffff4 RDI: 0000000000000000
RBP: ffff88804acc3024 R08: ffffffff903034f7 R09: 1ffffffff206069e
R10: dffffc0000000000 R11: fffffbfff206069f R12: ffff88804acc3060
R13: dffffc0000000000 R14: ffff88805c5a0f20 R15: ffff88805c5a2a88
FS:  00007fe61492d6c0(0000) GS:ffff8881252a0000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fe613987cc0 CR3: 0000000075924000 CR4: 00000000003526f0
Call Trace:
 <TASK>
 rdev_scan+0x147/0x300 net/wireless/rdev-ops.h:467
 nl80211_trigger_scan+0x1aa1/0x1f50 net/wireless/nl80211.c:11046
 genl_family_rcv_msg_doit+0x22a/0x330 net/netlink/genetlink.c:1114
 genl_family_rcv_msg net/netlink/genetlink.c:1194 [inline]
 genl_rcv_msg+0x61c/0x7a0 net/netlink/genetlink.c:1209
 netlink_rcv_skb+0x232/0x4b0 net/netlink/af_netlink.c:2555
 genl_rcv+0x28/0x40 net/netlink/genetlink.c:1218
 netlink_unicast_kernel net/netlink/af_netlink.c:1318 [inline]
 netlink_unicast+0x75c/0x8e0 net/netlink/af_netlink.c:1344
 netlink_sendmsg+0x813/0xb40 net/netlink/af_netlink.c:1899
 sock_sendmsg_nosec net/socket.c:787 [inline]
 __sock_sendmsg net/socket.c:802 [inline]
 ____sys_sendmsg+0x972/0x9f0 net/socket.c:2699
 ___sys_sendmsg+0x2a5/0x360 net/socket.c:2753
 __sys_sendmsg net/socket.c:2785 [inline]
 __do_sys_sendmsg net/socket.c:2790 [inline]
 __se_sys_sendmsg net/socket.c:2788 [inline]
 __x64_sys_sendmsg+0x1bd/0x2a0 net/socket.c:2788
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0x174/0x580 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7fe61399ce59
Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 e8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007fe61492d028 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 00007fe613c15fa0 RCX: 00007fe61399ce59
RDX: 0000000004000000 RSI: 0000200000000900 RDI: 0000000000000003
RBP: 00007fe61492d090 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000001
R13: 00007fe613c16038 R14: 00007fe613c15fa0 R15: 00007fff3acb29d8
 </TASK>


---
This report is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.

syzbot will keep track of this issue. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.

If the report is already addressed, let syzbot know by replying with:
#syz fix: exact-commit-title

If you want to overwrite report's subsystems, reply with:
#syz set subsystems: new-subsystem
(See the list of subsystem names on the web dashboard)

If the report is a duplicate of another one, reply with:
#syz dup: exact-subject-of-another-report

If you want to undo deduplication, reply with:
#syz undup

^ permalink raw reply

* [PATCH stable 6.6.y v3 0/4] bpf: linked scalar precision fixes
From: Zhenzhong Wu @ 2026-06-14 16:58 UTC (permalink / raw)
  To: bpf
  Cc: netdev, linux-kernel, ast, daniel, john.fastabend, andrii,
	martin.lau, song, yonghong.song, kpsingh, haoluo, jolsa,
	menglong8.dong, eddyz87, shung-hsi.yu, stable, mykolal, tamird

Hi,

This v3 targets 6.6.y and changes the backport strategy based on review
feedback on v2.

The original observed failure was found in Rust/Aya-generated eBPF
around helper calls. Rust match lowering can keep a helper return value
and a scalar filled through a by-reference helper argument in the same
enum-style control flow. That makes it easy for the verifier-visible
scalar values to become linked by scalar id.

The relevant verifier-log bytecode from the original reproducer is
below. The later instructions only store r7 into a map so user space can
observe which branch the verifier kept.

  15: (85) call bpf_get_func_ret#184    ; R0_w=scalar() fp-8_w=mmmmmmmm
  16: (79) r7 = *(u64 *)(r10 -8)        ; R7_w=scalar() R10=fp0
  17: (15) if r0 == 0x0 goto pc+1       ; R0_w=scalar()
  18: (bf) r7 = r0                    ; R0=scalar(id=1) R7=scalar(id=1)
  19: (55) if r0 != 0x0 goto pc+6       ; R0=0
  20: (67) r7 <<= 32                    ; R7_w=0
  21: (77) r7 >>= 32                    ; R7_w=0
  22: (b7) r1 = 1                       ; R1_w=1
  23: (55) if r7 != 0xf goto pc+1

The important verifier state shape is:

  1. The program checks "if r0 == 0". The jump target is the success
     path, and the fallthrough path is the failure path.

  2. On the failure path, "r7 = r0" gives r0 and r7 the same scalar id.
     The real success path skips that assignment, so r7 is independent
     there.

  3. At the later "if r0 != 0" check, an affected verifier can explore
     an impossible continuation where r0 is zero and r7 is narrowed
     through the shared scalar id as well.

  4. That impossible continuation reaches the return-value comparison
     with the wrong r7 value. When the real success path is analyzed
     later, state pruning can consider it safe against the earlier
     cached verifier state and skip the real continuation.

The root cause is verifier scalar state tracking, not helper-specific
behavior. A helper return value in r0 and another scalar can become
linked by scalar id on one branch. The real success path can skip that
assignment, so the two verifier states are not equivalent.

The relevant pruning point is that regsafe()/states_equal() accepted
the real success-path state against an earlier cached state where r0 was
an imprecise scalar and r7 constraints were loose enough to cover the
current r7. In the impossible path, r0 and r7 are linked by scalar id
after instruction 18. In the real success path, instruction 18 is
skipped and that scalar-id relationship does not exist. These states
should therefore not be treated as equivalent for pruning.

The upstream linked-scalar precision series fixes that root cause by
recording, in jmp_history, which linked registers were synchronized at
each conditional jump and by using that per-instruction history during
precision backtracking. This covers both the original r0 == 0 /
r0 != 0 shape and the r0 == 1 / r0 != 1 shape used by the separate
runtime selftest.

A Rust/Aya-specific runtime reproducer/selftest discussed in the v2
thread has been submitted separately to bpf-next for review:

  https://lore.kernel.org/bpf/20260611160749.391279-1-jt26wzz@gmail.com/

That reproducer keeps the same helper-return/control-flow shape but
shifts the success value to 1 before branching. This avoids depending
on the not-equal-zero refinement and exercises linked scalar precision
during state pruning directly. It uses bpf_skb_load_bytes() in the
normal tc test-run path and does not require fexit attach or
bpf_get_func_ret(). It is not included in this stable series because
per review feedback it should go through bpf-next first before being
considered for stable.

Targeted results for that separate helper-status runtime reproducer are:

  v6.6.142 + reproducer:                                  FAIL
  v6.6.142 + v2 d028/9e backport path + reproducer:      FAIL
  v6.6.142 + this linked-scalars series + reproducer:    PASS
  bpf-next + reproducer:                                  PASS

Changes since v2:
  - update the subject from the v2 not-equal title to reflect the
    linked-scalar precision backport used in this version;
  - replace the d028f87517d6/9e314f5d8682 backport path with the full
    upstream linked-scalar precision-tracking series suggested during
    review;
  - drop the custom Rust/Aya selftest from the stable series and point
    to the separate bpf-next review instead;
  - adapt the linked_regs_broken_link_2 selftest log expectations for
    6.6.y, where the verifier does not derive the same non-constant
    JMP_X scalar-vs-scalar range used by the upstream log check;
  - keep 6.6.y as the first stable target and document that older LTS
    trees need separate adaptations.

v2:
  https://lore.kernel.org/r/20260607170959.823755-1-jt26wzz@gmail.com/

RFC v1:
  https://lore.kernel.org/r/20260601180400.1381736-1-jt26wzz@gmail.com/

Backport details:

This series is based on v6.6.142 / stable/linux-6.6.y commit
924b4a879cbb ("Linux 6.6.142"). I would like it applied to 6.6.y first.
The same issue is reproducible on 6.1.y, 5.15.y, and 5.10.y, but those
trees need separate older-layout adaptations.

Instead of backporting the d028f87517d6 not-equal refinement plus the
9e314f5d8682 range-combining prerequisite, this series backports the
full upstream linked-scalar precision-tracking series:

  4bf79f9be434
    bpf: Track equal scalars history on per-instruction level
  842edb5507a1
    bpf: Remove mark_precise_scalar_ids()
  bebc17b1c03b
    selftests/bpf: Tests for per-insn sync_linked_regs() precision
    tracking
  cfbf25481d6d
    selftests/bpf: Update comments find_equal_scalars->sync_linked_regs

Upstream series:
  https://lore.kernel.org/r/20240718202357.1746514-1-eddyz87@gmail.com/

Patches 1 and 2 are the verifier changes from that upstream series. The
main 6.6.y-specific verifier adaptation is in patch 1: 6.6.y does not
yet have the newer BPF_ADD_CONST scalar-id representation, so
sync_linked_regs() is adapted to the older scalar-id layout. Patch 2
then follows on top of that adapted layout.

Patches 3 and 4 bring the upstream verifier selftests and comment
updates. Patch 3 has one 6.6.y-specific log adaptation:
linked_regs_broken_link_2 keeps the "div by zero" reject check, but
drops the upstream mark_precise log expectations because 6.6.y does not
derive the scalar-vs-scalar range for that non-constant JMP_X
comparison. Patch 4 only updates the two pre-existing comments that are
present in 6.6.y.

Relevant QEMU selftest results on 6.6.y with this backport:

  verifier_scalar_ids passed all 18 subtests, including the newly
  backported linked-scalar precision tests and the related
  check_ids_in_regsafe tests.

Thanks to Shung-Hsi Yu for reviewing v2 and suggesting the upstream
linked-scalar precision series as the preferred backport direction.

Eduard Zingerman (4):
  bpf: Track equal scalars history on per-instruction level
  bpf: Remove mark_precise_scalar_ids()
  selftests/bpf: Tests for per-insn sync_linked_regs() precision
    tracking
  selftests/bpf: Update comments find_equal_scalars->sync_linked_regs

 include/linux/bpf_verifier.h                  |   4 +
 kernel/bpf/verifier.c                         | 367 +++++++++++-------
 .../selftests/bpf/progs/verifier_scalar_ids.c | 253 ++++++++----
 .../selftests/bpf/progs/verifier_spill_fill.c |   4 +-
 .../bpf/progs/verifier_subprog_precision.c    |   2 +-
 .../testing/selftests/bpf/verifier/precise.c  |   2 +-
 6 files changed, 417 insertions(+), 215 deletions(-)

base-commit: 924b4a879cbb75aef37c160b955b92f6894b11a4
-- 
2.43.0

^ permalink raw reply

* [PATCH stable 6.6.y v3 1/4] bpf: Track equal scalars history on per-instruction level
From: Zhenzhong Wu @ 2026-06-14 16:58 UTC (permalink / raw)
  To: bpf
  Cc: netdev, linux-kernel, ast, daniel, john.fastabend, andrii,
	martin.lau, song, yonghong.song, kpsingh, haoluo, jolsa,
	menglong8.dong, eddyz87, shung-hsi.yu, stable, mykolal, tamird,
	Hao Sun
In-Reply-To: <cover.1781194510.git.jt26wzz@gmail.com>

From: Eduard Zingerman <eddyz87@gmail.com>

[ Upstream commit 4bf79f9be434e000c8e12fe83b2f4402480f1460 ]

Use bpf_verifier_state->jmp_history to track which registers were
updated by find_equal_scalars() (renamed to collect_linked_regs())
when conditional jump was verified. Use recorded information in
backtrack_insn() to propagate precision.

E.g. for the following program:

            while verifying instructions
  1: r1 = r0              |
  2: if r1 < 8  goto ...  | push r0,r1 as linked registers in jmp_history
  3: if r0 > 16 goto ...  | push r0,r1 as linked registers in jmp_history
  4: r2 = r10             |
  5: r2 += r0             v mark_chain_precision(r0)

            while doing mark_chain_precision(r0)
  5: r2 += r0             | mark r0 precise
  4: r2 = r10             |
  3: if r0 > 16 goto ...  | mark r0,r1 as precise
  2: if r1 < 8  goto ...  | mark r0,r1 as precise
  1: r1 = r0              v

Technically, do this as follows:
- Use 10 bits to identify each register that gains range because of
  sync_linked_regs():
  - 3 bits for frame number;
  - 6 bits for register or stack slot number;
  - 1 bit to indicate if register is spilled.
- Use u64 as a vector of 6 such records + 4 bits for vector length.
- Augment struct bpf_jmp_history_entry with a field 'linked_regs'
  representing such vector.
- When doing check_cond_jmp_op() remember up to 6 registers that
  gain range because of sync_linked_regs() in such a vector.
- Don't propagate range information and reset IDs for registers that
  don't fit in 6-value vector.
- Push a pair {instruction index, linked registers vector}
  to bpf_verifier_state->jmp_history.
- When doing backtrack_insn() check if any of recorded linked
  registers is currently marked precise, if so mark all linked
  registers as precise.

This also requires fixes for two test_verifier tests:
- precise: test 1
- precise: test 2

Both tests contain the following instruction sequence:

19: (bf) r2 = r9                      ; R2=scalar(id=3) R9=scalar(id=3)
20: (a5) if r2 < 0x8 goto pc+1        ; R2=scalar(id=3,umin=8)
21: (95) exit
22: (07) r2 += 1                      ; R2_w=scalar(id=3+1,...)
23: (bf) r1 = r10                     ; R1_w=fp0 R10=fp0
24: (07) r1 += -8                     ; R1_w=fp-8
25: (b7) r3 = 0                       ; R3_w=0
26: (85) call bpf_probe_read_kernel#113

The call to bpf_probe_read_kernel() at (26) forces r2 to be precise.
Previously, this forced all registers with same id to become precise
immediately when mark_chain_precision() is called.
After this change, the precision is propagated to registers sharing
same id only when 'if' instruction is backtracked.
Hence verification log for both tests is changed:
regs=r2,r9 -> regs=r2 for instructions 25..20.

Fixes: 904e6ddf4133 ("bpf: Use scalar ids in mark_chain_precision()")
Reported-by: Hao Sun <sunhao.th@gmail.com>
Suggested-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20240718202357.1746514-2-eddyz87@gmail.com
Closes: https://lore.kernel.org/bpf/CAEf4BzZ0xidVCqB47XnkXcNhkPWF6_nTV7yt+_Lf0kcFEut2Mg@mail.gmail.com/
[ zhenzhong: backport to 6.6.y verifier layout and adapt
  sync_linked_regs() to the pre-BPF_ADD_CONST scalar-id code. ]
Signed-off-by: Zhenzhong Wu <jt26wzz@gmail.com>
---
 include/linux/bpf_verifier.h                  |   4 +
 kernel/bpf/verifier.c                         | 256 ++++++++++++++++--
 .../bpf/progs/verifier_subprog_precision.c    |   2 +-
 3 files changed, 239 insertions(+), 23 deletions(-)

diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
index dba211d3b..9a3b93c24 100644
--- a/include/linux/bpf_verifier.h
+++ b/include/linux/bpf_verifier.h
@@ -345,6 +345,10 @@ struct bpf_jmp_history_entry {
 	u32 prev_idx : 22;
 	/* special flags, e.g., whether insn is doing register stack spill/load */
 	u32 flags : 10;
+	/* additional registers that need precision tracking when this
+	 * jump is backtracked, vector of six 10-bit records
+	 */
+	u64 linked_regs;
 };
 
 /* Maximum number of register states that can exist at once */
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 0d90236d0..3cc0fc902 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -3461,9 +3461,87 @@ static bool is_jmp_point(struct bpf_verifier_env *env, int insn_idx)
 	return env->insn_aux_data[insn_idx].jmp_point;
 }
 
+#define LR_FRAMENO_BITS	3
+#define LR_SPI_BITS	6
+#define LR_ENTRY_BITS	(LR_SPI_BITS + LR_FRAMENO_BITS + 1)
+#define LR_SIZE_BITS	4
+#define LR_FRAMENO_MASK	((1ull << LR_FRAMENO_BITS) - 1)
+#define LR_SPI_MASK	((1ull << LR_SPI_BITS)     - 1)
+#define LR_SIZE_MASK	((1ull << LR_SIZE_BITS)    - 1)
+#define LR_SPI_OFF	LR_FRAMENO_BITS
+#define LR_IS_REG_OFF	(LR_SPI_BITS + LR_FRAMENO_BITS)
+#define LINKED_REGS_MAX	6
+
+struct linked_reg {
+	u8 frameno;
+	union {
+		u8 spi;
+		u8 regno;
+	};
+	bool is_reg;
+};
+
+struct linked_regs {
+	int cnt;
+	struct linked_reg entries[LINKED_REGS_MAX];
+};
+
+static struct linked_reg *linked_regs_push(struct linked_regs *s)
+{
+	if (s->cnt < LINKED_REGS_MAX)
+		return &s->entries[s->cnt++];
+
+	return NULL;
+}
+
+/* Use u64 as a vector of 6 10-bit values, use first 4-bits to track
+ * number of elements currently in stack.
+ * Pack one history entry for linked registers as 10 bits in the following format:
+ * - 3-bits frameno
+ * - 6-bits spi_or_reg
+ * - 1-bit  is_reg
+ */
+static u64 linked_regs_pack(struct linked_regs *s)
+{
+	u64 val = 0;
+	int i;
+
+	for (i = 0; i < s->cnt; ++i) {
+		struct linked_reg *e = &s->entries[i];
+		u64 tmp = 0;
+
+		tmp |= e->frameno;
+		tmp |= e->spi << LR_SPI_OFF;
+		tmp |= (e->is_reg ? 1 : 0) << LR_IS_REG_OFF;
+
+		val <<= LR_ENTRY_BITS;
+		val |= tmp;
+	}
+	val <<= LR_SIZE_BITS;
+	val |= s->cnt;
+	return val;
+}
+
+static void linked_regs_unpack(u64 val, struct linked_regs *s)
+{
+	int i;
+
+	s->cnt = val & LR_SIZE_MASK;
+	val >>= LR_SIZE_BITS;
+
+	for (i = 0; i < s->cnt; ++i) {
+		struct linked_reg *e = &s->entries[i];
+
+		e->frameno =  val & LR_FRAMENO_MASK;
+		e->spi     = (val >> LR_SPI_OFF) & LR_SPI_MASK;
+		e->is_reg  = (val >> LR_IS_REG_OFF) & 0x1;
+		val >>= LR_ENTRY_BITS;
+	}
+}
+
 /* for any branch, call, exit record the history of jmps in the given state */
 static int push_jmp_history(struct bpf_verifier_env *env, struct bpf_verifier_state *cur,
-			    int insn_flags)
+			    int insn_flags, u64 linked_regs)
 {
 	u32 cnt = cur->jmp_history_cnt;
 	struct bpf_jmp_history_entry *p;
@@ -3479,6 +3557,10 @@ static int push_jmp_history(struct bpf_verifier_env *env, struct bpf_verifier_st
 			  "verifier insn history bug: insn_idx %d cur flags %x new flags %x\n",
 			  env->insn_idx, env->cur_hist_ent->flags, insn_flags);
 		env->cur_hist_ent->flags |= insn_flags;
+		WARN_ONCE(env->cur_hist_ent->linked_regs != 0,
+			  "verifier insn history bug: insn_idx %d linked_regs != 0: %#llx\n",
+			  env->insn_idx, env->cur_hist_ent->linked_regs);
+		env->cur_hist_ent->linked_regs = linked_regs;
 		return 0;
 	}
 
@@ -3493,6 +3575,7 @@ static int push_jmp_history(struct bpf_verifier_env *env, struct bpf_verifier_st
 	p->idx = env->insn_idx;
 	p->prev_idx = env->prev_insn_idx;
 	p->flags = insn_flags;
+	p->linked_regs = linked_regs;
 	cur->jmp_history_cnt = cnt;
 	env->cur_hist_ent = p;
 
@@ -3668,6 +3751,11 @@ static inline bool bt_is_reg_set(struct backtrack_state *bt, u32 reg)
 	return bt->reg_masks[bt->frame] & (1 << reg);
 }
 
+static inline bool bt_is_frame_reg_set(struct backtrack_state *bt, u32 frame, u32 reg)
+{
+	return bt->reg_masks[frame] & (1 << reg);
+}
+
 static inline bool bt_is_frame_slot_set(struct backtrack_state *bt, u32 frame, u32 slot)
 {
 	return bt->stack_masks[frame] & (1ull << slot);
@@ -3717,6 +3805,42 @@ static void fmt_stack_mask(char *buf, ssize_t buf_sz, u64 stack_mask)
 	}
 }
 
+/* If any register R in hist->linked_regs is marked as precise in bt,
+ * do bt_set_frame_{reg,slot}(bt, R) for all registers in hist->linked_regs.
+ */
+static void bt_sync_linked_regs(struct backtrack_state *bt, struct bpf_jmp_history_entry *hist)
+{
+	struct linked_regs linked_regs;
+	bool some_precise = false;
+	int i;
+
+	if (!hist || hist->linked_regs == 0)
+		return;
+
+	linked_regs_unpack(hist->linked_regs, &linked_regs);
+	for (i = 0; i < linked_regs.cnt; ++i) {
+		struct linked_reg *e = &linked_regs.entries[i];
+
+		if ((e->is_reg && bt_is_frame_reg_set(bt, e->frameno, e->regno)) ||
+		    (!e->is_reg && bt_is_frame_slot_set(bt, e->frameno, e->spi))) {
+			some_precise = true;
+			break;
+		}
+	}
+
+	if (!some_precise)
+		return;
+
+	for (i = 0; i < linked_regs.cnt; ++i) {
+		struct linked_reg *e = &linked_regs.entries[i];
+
+		if (e->is_reg)
+			bt_set_frame_reg(bt, e->frameno, e->regno);
+		else
+			bt_set_frame_slot(bt, e->frameno, e->spi);
+	}
+}
+
 static bool calls_callback(struct bpf_verifier_env *env, int insn_idx);
 
 /* For given verifier state backtrack_insn() is called from the last insn to
@@ -3756,6 +3880,12 @@ static int backtrack_insn(struct bpf_verifier_env *env, int idx, int subseq_idx,
 		print_bpf_insn(&cbs, insn, env->allow_ptr_leaks);
 	}
 
+	/* If there is a history record that some registers gained range at this insn,
+	 * propagate precision marks to those registers, so that bt_is_reg_set()
+	 * accounts for these registers.
+	 */
+	bt_sync_linked_regs(bt, hist);
+
 	if (class == BPF_ALU || class == BPF_ALU64) {
 		if (!bt_is_reg_set(bt, dreg))
 			return 0;
@@ -3985,7 +4115,8 @@ static int backtrack_insn(struct bpf_verifier_env *env, int idx, int subseq_idx,
 			 */
 			bt_set_reg(bt, dreg);
 			bt_set_reg(bt, sreg);
-			 /* else dreg <cond> K
+		} else if (BPF_SRC(insn->code) == BPF_K) {
+			 /* dreg <cond> K
 			  * Only dreg still needs precision before
 			  * this insn, so for the K-based conditional
 			  * there is nothing new to be marked.
@@ -4003,6 +4134,10 @@ static int backtrack_insn(struct bpf_verifier_env *env, int idx, int subseq_idx,
 			/* to be analyzed */
 			return -ENOTSUPP;
 	}
+	/* Propagate precision marks to linked registers, to account for
+	 * registers marked as precise in this function.
+	 */
+	bt_sync_linked_regs(bt, hist);
 	return 0;
 }
 
@@ -4354,7 +4489,7 @@ static int __mark_chain_precision(struct bpf_verifier_env *env, int regno)
 
 		/* If some register with scalar ID is marked as precise,
 		 * make sure that all registers sharing this ID are also precise.
-		 * This is needed to estimate effect of find_equal_scalars().
+		 * This is needed to estimate effect of sync_linked_regs().
 		 * Do this at the last instruction of each state,
 		 * bpf_reg_state::id fields are valid for these instructions.
 		 *
@@ -4368,7 +4503,7 @@ static int __mark_chain_precision(struct bpf_verifier_env *env, int regno)
 		 *     ...
 		 *   --- state #1 {r1.id = A, r2.id = A} ---
 		 *     ...
-		 *     if (r2 > 10) goto exit; // find_equal_scalars() assigns range to r1
+		 *     if (r2 > 10) goto exit; // sync_linked_regs() assigns range to r1
 		 *     ...
 		 *   --- state #2 {r1.id = A, r2.id = A} ---
 		 *     r3 = r10
@@ -4736,7 +4871,7 @@ static int check_stack_write_fixed_off(struct bpf_verifier_env *env,
 	}
 
 	if (insn_flags)
-		return push_jmp_history(env, env->cur_state, insn_flags);
+		return push_jmp_history(env, env->cur_state, insn_flags, 0);
 	return 0;
 }
 
@@ -5032,7 +5167,7 @@ static int check_stack_read_fixed_off(struct bpf_verifier_env *env,
 		insn_flags = 0; /* we are not restoring spilled register */
 	}
 	if (insn_flags)
-		return push_jmp_history(env, env->cur_state, insn_flags);
+		return push_jmp_history(env, env->cur_state, insn_flags, 0);
 	return 0;
 }
 
@@ -13540,7 +13675,7 @@ static int adjust_reg_min_max_vals(struct bpf_verifier_env *env,
 		ptr_reg = dst_reg;
 	else
 		/* Make sure ID is cleared otherwise dst_reg min/max could be
-		 * incorrectly propagated into other registers by find_equal_scalars()
+		 * incorrectly propagated into other registers by sync_linked_regs()
 		 */
 		dst_reg->id = 0;
 	if (BPF_SRC(insn->code) == BPF_X) {
@@ -13700,7 +13835,7 @@ static int check_alu_op(struct bpf_verifier_env *env, struct bpf_insn *insn)
 					 */
 					if (need_id)
 						/* Assign src and dst registers the same ID
-						 * that will be used by find_equal_scalars()
+						 * that will be used by sync_linked_regs()
 						 * to propagate min/max range.
 						 */
 						src_reg->id = ++env->id_gen;
@@ -13746,7 +13881,7 @@ static int check_alu_op(struct bpf_verifier_env *env, struct bpf_insn *insn)
 						copy_register_state(dst_reg, src_reg);
 						/* Make sure ID is cleared if src_reg is not in u32
 						 * range otherwise dst_reg min/max could be incorrectly
-						 * propagated into src_reg by find_equal_scalars()
+						 * propagated into src_reg by sync_linked_regs()
 						 */
 						if (!is_src_reg_u32)
 							dst_reg->id = 0;
@@ -14564,19 +14699,78 @@ static bool try_match_pkt_pointers(const struct bpf_insn *insn,
 	return true;
 }
 
-static void find_equal_scalars(struct bpf_verifier_state *vstate,
-			       struct bpf_reg_state *known_reg)
+static void __collect_linked_regs(struct linked_regs *reg_set, struct bpf_reg_state *reg,
+				  u32 id, u32 frameno, u32 spi_or_reg, bool is_reg)
 {
-	struct bpf_func_state *state;
+	struct linked_reg *e;
+
+	if (reg->type != SCALAR_VALUE || reg->id != id)
+		return;
+
+	e = linked_regs_push(reg_set);
+	if (e) {
+		e->frameno = frameno;
+		e->is_reg = is_reg;
+		e->regno = spi_or_reg;
+	} else {
+		reg->id = 0;
+	}
+}
+
+/* For all R being scalar registers or spilled scalar registers
+ * in verifier state, save R in linked_regs if R->id == id.
+ * If there are too many Rs sharing same id, reset id for leftover Rs.
+ */
+static void collect_linked_regs(struct bpf_verifier_state *vstate, u32 id,
+				struct linked_regs *linked_regs)
+{
+	struct bpf_func_state *func;
 	struct bpf_reg_state *reg;
+	int i, j;
 
-	bpf_for_each_reg_in_vstate(vstate, state, reg, ({
-		if (reg->type == SCALAR_VALUE && reg->id == known_reg->id) {
+	for (i = vstate->curframe; i >= 0; i--) {
+		func = vstate->frame[i];
+		for (j = 0; j < BPF_REG_FP; j++) {
+			reg = &func->regs[j];
+			__collect_linked_regs(linked_regs, reg, id, i, j, true);
+		}
+		for (j = 0; j < func->allocated_stack / BPF_REG_SIZE; j++) {
+			if (!is_spilled_reg(&func->stack[j]))
+				continue;
+			reg = &func->stack[j].spilled_ptr;
+			__collect_linked_regs(linked_regs, reg, id, i, j, false);
+		}
+	}
+
+	if (linked_regs->cnt == 1)
+		linked_regs->cnt = 0;
+}
+
+/* For all R in linked_regs, copy known_reg range into R
+ * if R->id == known_reg->id.
+ */
+static void sync_linked_regs(struct bpf_verifier_state *vstate, struct bpf_reg_state *known_reg,
+			     struct linked_regs *linked_regs)
+{
+	struct bpf_reg_state *reg;
+	struct linked_reg *e;
+	int i;
+
+	for (i = 0; i < linked_regs->cnt; ++i) {
+		e = &linked_regs->entries[i];
+		reg = e->is_reg ? &vstate->frame[e->frameno]->regs[e->regno]
+				: &vstate->frame[e->frameno]->stack[e->spi].spilled_ptr;
+		if (reg->type != SCALAR_VALUE || reg == known_reg)
+			continue;
+		if (reg->id != known_reg->id)
+			continue;
+		{
 			s32 saved_subreg_def = reg->subreg_def;
+
 			copy_register_state(reg, known_reg);
 			reg->subreg_def = saved_subreg_def;
 		}
-	}));
+	}
 }
 
 static int check_cond_jmp_op(struct bpf_verifier_env *env,
@@ -14587,6 +14781,7 @@ static int check_cond_jmp_op(struct bpf_verifier_env *env,
 	struct bpf_reg_state *regs = this_branch->frame[this_branch->curframe]->regs;
 	struct bpf_reg_state *dst_reg, *other_branch_regs, *src_reg = NULL;
 	struct bpf_reg_state *eq_branch_regs;
+	struct linked_regs linked_regs = {};
 	u8 opcode = BPF_OP(insn->code);
 	bool is_jmp32;
 	int pred = -1;
@@ -14704,6 +14899,21 @@ static int check_cond_jmp_op(struct bpf_verifier_env *env,
 		return 0;
 	}
 
+	/* Push scalar registers sharing same ID to jump history,
+	 * do this before creating 'other_branch', so that both
+	 * 'this_branch' and 'other_branch' share this history
+	 * if parent state is created.
+	 */
+	if (BPF_SRC(insn->code) == BPF_X && src_reg->type == SCALAR_VALUE && src_reg->id)
+		collect_linked_regs(this_branch, src_reg->id, &linked_regs);
+	if (dst_reg->type == SCALAR_VALUE && dst_reg->id)
+		collect_linked_regs(this_branch, dst_reg->id, &linked_regs);
+	if (linked_regs.cnt > 0) {
+		err = push_jmp_history(env, this_branch, 0, linked_regs_pack(&linked_regs));
+		if (err)
+			return err;
+	}
+
 	other_branch = push_stack(env, *insn_idx + insn->off + 1, *insn_idx,
 				  false);
 	if (!other_branch)
@@ -14746,8 +14956,9 @@ static int check_cond_jmp_op(struct bpf_verifier_env *env,
 						    src_reg, dst_reg, opcode);
 			if (src_reg->id &&
 			    !WARN_ON_ONCE(src_reg->id != other_branch_regs[insn->src_reg].id)) {
-				find_equal_scalars(this_branch, src_reg);
-				find_equal_scalars(other_branch, &other_branch_regs[insn->src_reg]);
+				sync_linked_regs(this_branch, src_reg, &linked_regs);
+				sync_linked_regs(other_branch, &other_branch_regs[insn->src_reg],
+						 &linked_regs);
 			}
 
 		}
@@ -14759,8 +14970,9 @@ static int check_cond_jmp_op(struct bpf_verifier_env *env,
 
 	if (dst_reg->type == SCALAR_VALUE && dst_reg->id &&
 	    !WARN_ON_ONCE(dst_reg->id != other_branch_regs[insn->dst_reg].id)) {
-		find_equal_scalars(this_branch, dst_reg);
-		find_equal_scalars(other_branch, &other_branch_regs[insn->dst_reg]);
+		sync_linked_regs(this_branch, dst_reg, &linked_regs);
+		sync_linked_regs(other_branch, &other_branch_regs[insn->dst_reg],
+				 &linked_regs);
 	}
 
 	/* if one pointer register is compared to another pointer
@@ -16182,7 +16394,7 @@ static bool regsafe(struct bpf_verifier_env *env, struct bpf_reg_state *rold,
 		 *
 		 * First verification path is [1-6]:
 		 * - at (4) same bpf_reg_state::id (b) would be assigned to r6 and r7;
-		 * - at (5) r6 would be marked <= X, find_equal_scalars() would also mark
+		 * - at (5) r6 would be marked <= X, sync_linked_regs() would also mark
 		 *   r7 <= X, because r6 and r7 share same id.
 		 * Next verification path is [1-4, 6].
 		 *
@@ -16915,7 +17127,7 @@ static int is_state_visited(struct bpf_verifier_env *env, int insn_idx)
 			 * the current state.
 			 */
 			if (is_jmp_point(env, env->insn_idx))
-				err = err ? : push_jmp_history(env, cur, 0);
+				err = err ? : push_jmp_history(env, cur, 0, 0);
 			err = err ? : propagate_precision(env, &sl->state);
 			if (err)
 				return err;
@@ -17181,7 +17393,7 @@ static int do_check(struct bpf_verifier_env *env)
 		}
 
 		if (is_jmp_point(env, env->insn_idx)) {
-			err = push_jmp_history(env, state, 0);
+			err = push_jmp_history(env, state, 0, 0);
 			if (err)
 				return err;
 		}
diff --git a/tools/testing/selftests/bpf/progs/verifier_subprog_precision.c b/tools/testing/selftests/bpf/progs/verifier_subprog_precision.c
index 4b8b0f45d..a188e26f0 100644
--- a/tools/testing/selftests/bpf/progs/verifier_subprog_precision.c
+++ b/tools/testing/selftests/bpf/progs/verifier_subprog_precision.c
@@ -141,7 +141,7 @@ __msg("mark_precise: frame0: last_idx 14 first_idx 9")
 __msg("mark_precise: frame0: regs=r6 stack= before 13: (bf) r1 = r7")
 __msg("mark_precise: frame0: regs=r6 stack= before 12: (27) r6 *= 4")
 __msg("mark_precise: frame0: regs=r6 stack= before 11: (25) if r6 > 0x3 goto pc+4")
-__msg("mark_precise: frame0: regs=r6 stack= before 10: (bf) r6 = r0")
+__msg("mark_precise: frame0: regs=r0,r6 stack= before 10: (bf) r6 = r0")
 __msg("mark_precise: frame0: regs=r0 stack= before 9: (85) call bpf_loop")
 /* State entering callback body popped from states stack */
 __msg("from 9 to 17: frame1:")
-- 
2.43.0


^ permalink raw reply related

* [PATCH stable 6.6.y v3 2/4] bpf: Remove mark_precise_scalar_ids()
From: Zhenzhong Wu @ 2026-06-14 16:58 UTC (permalink / raw)
  To: bpf
  Cc: netdev, linux-kernel, ast, daniel, john.fastabend, andrii,
	martin.lau, song, yonghong.song, kpsingh, haoluo, jolsa,
	menglong8.dong, eddyz87, shung-hsi.yu, stable, mykolal, tamird
In-Reply-To: <cover.1781194510.git.jt26wzz@gmail.com>

From: Eduard Zingerman <eddyz87@gmail.com>

[ Upstream commit 842edb5507a1038e009d27e69d13b94b6f085763 ]

Function mark_precise_scalar_ids() is superseded by
bt_sync_linked_regs() and equal scalars tracking in jump history.
mark_precise_scalar_ids() propagates precision over registers sharing
same ID on parent/child state boundaries, while jump history records
allow bt_sync_linked_regs() to propagate same information with
instruction level granularity, which is strictly more precise.

This commit removes mark_precise_scalar_ids() and updates test cases
in progs/verifier_scalar_ids to reflect new verifier behavior.

The tests are updated in the following manner:
- mark_precise_scalar_ids() propagated precision regardless of
  presence of conditional jumps, while new jump history based logic
  only kicks in when conditional jumps are present.
  Hence test cases are augmented with conditional jumps to still
  trigger precision propagation.
- As equal scalars tracking no longer relies on parent/child state
  boundaries some test cases are no longer interesting,
  such test cases are removed, namely:
  - precision_same_state and precision_cross_state are superseded by
    linked_regs_bpf_k;
  - precision_same_state_broken_link and equal_scalars_broken_link
    are superseded by linked_regs_broken_link.

Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20240718202357.1746514-3-eddyz87@gmail.com
[ zhenzhong: backport to 6.6.y after adapting the first linked-regs
  history commit to the older scalar-id verifier layout. ]
Signed-off-by: Zhenzhong Wu <jt26wzz@gmail.com>
---
 kernel/bpf/verifier.c                         | 115 ------------
 .../selftests/bpf/progs/verifier_scalar_ids.c | 171 ++++++------------
 .../testing/selftests/bpf/verifier/precise.c  |   2 +-
 3 files changed, 56 insertions(+), 232 deletions(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 3cc0fc902..55a5a5bed 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -4265,96 +4265,6 @@ static void mark_all_scalars_imprecise(struct bpf_verifier_env *env, struct bpf_
 	}
 }
 
-static bool idset_contains(struct bpf_idset *s, u32 id)
-{
-	u32 i;
-
-	for (i = 0; i < s->count; ++i)
-		if (s->ids[i] == id)
-			return true;
-
-	return false;
-}
-
-static int idset_push(struct bpf_idset *s, u32 id)
-{
-	if (WARN_ON_ONCE(s->count >= ARRAY_SIZE(s->ids)))
-		return -EFAULT;
-	s->ids[s->count++] = id;
-	return 0;
-}
-
-static void idset_reset(struct bpf_idset *s)
-{
-	s->count = 0;
-}
-
-/* Collect a set of IDs for all registers currently marked as precise in env->bt.
- * Mark all registers with these IDs as precise.
- */
-static int mark_precise_scalar_ids(struct bpf_verifier_env *env, struct bpf_verifier_state *st)
-{
-	struct bpf_idset *precise_ids = &env->idset_scratch;
-	struct backtrack_state *bt = &env->bt;
-	struct bpf_func_state *func;
-	struct bpf_reg_state *reg;
-	DECLARE_BITMAP(mask, 64);
-	int i, fr;
-
-	idset_reset(precise_ids);
-
-	for (fr = bt->frame; fr >= 0; fr--) {
-		func = st->frame[fr];
-
-		bitmap_from_u64(mask, bt_frame_reg_mask(bt, fr));
-		for_each_set_bit(i, mask, 32) {
-			reg = &func->regs[i];
-			if (!reg->id || reg->type != SCALAR_VALUE)
-				continue;
-			if (idset_push(precise_ids, reg->id))
-				return -EFAULT;
-		}
-
-		bitmap_from_u64(mask, bt_frame_stack_mask(bt, fr));
-		for_each_set_bit(i, mask, 64) {
-			if (i >= func->allocated_stack / BPF_REG_SIZE)
-				break;
-			if (!is_spilled_scalar_reg(&func->stack[i]))
-				continue;
-			reg = &func->stack[i].spilled_ptr;
-			if (!reg->id)
-				continue;
-			if (idset_push(precise_ids, reg->id))
-				return -EFAULT;
-		}
-	}
-
-	for (fr = 0; fr <= st->curframe; ++fr) {
-		func = st->frame[fr];
-
-		for (i = BPF_REG_0; i < BPF_REG_10; ++i) {
-			reg = &func->regs[i];
-			if (!reg->id)
-				continue;
-			if (!idset_contains(precise_ids, reg->id))
-				continue;
-			bt_set_frame_reg(bt, fr, i);
-		}
-		for (i = 0; i < func->allocated_stack / BPF_REG_SIZE; ++i) {
-			if (!is_spilled_scalar_reg(&func->stack[i]))
-				continue;
-			reg = &func->stack[i].spilled_ptr;
-			if (!reg->id)
-				continue;
-			if (!idset_contains(precise_ids, reg->id))
-				continue;
-			bt_set_frame_slot(bt, fr, i);
-		}
-	}
-
-	return 0;
-}
-
 /*
  * __mark_chain_precision() backtracks BPF program instruction sequence and
  * chain of verifier states making sure that register *regno* (if regno >= 0)
@@ -4487,31 +4397,6 @@ static int __mark_chain_precision(struct bpf_verifier_env *env, int regno)
 				bt->frame, last_idx, first_idx, subseq_idx);
 		}
 
-		/* If some register with scalar ID is marked as precise,
-		 * make sure that all registers sharing this ID are also precise.
-		 * This is needed to estimate effect of sync_linked_regs().
-		 * Do this at the last instruction of each state,
-		 * bpf_reg_state::id fields are valid for these instructions.
-		 *
-		 * Allows to track precision in situation like below:
-		 *
-		 *     r2 = unknown value
-		 *     ...
-		 *   --- state #0 ---
-		 *     ...
-		 *     r1 = r2                 // r1 and r2 now share the same ID
-		 *     ...
-		 *   --- state #1 {r1.id = A, r2.id = A} ---
-		 *     ...
-		 *     if (r2 > 10) goto exit; // sync_linked_regs() assigns range to r1
-		 *     ...
-		 *   --- state #2 {r1.id = A, r2.id = A} ---
-		 *     r3 = r10
-		 *     r3 += r1                // need to mark both r1 and r2
-		 */
-		if (mark_precise_scalar_ids(env, st))
-			return -EFAULT;
-
 		if (last_idx < 0) {
 			/* we are at the entry into subprog, which
 			 * is expected for global funcs, but only if
diff --git a/tools/testing/selftests/bpf/progs/verifier_scalar_ids.c b/tools/testing/selftests/bpf/progs/verifier_scalar_ids.c
index 22a6cf6e8..f70392bf6 100644
--- a/tools/testing/selftests/bpf/progs/verifier_scalar_ids.c
+++ b/tools/testing/selftests/bpf/progs/verifier_scalar_ids.c
@@ -5,54 +5,27 @@
 #include "bpf_misc.h"
 
 /* Check that precision marks propagate through scalar IDs.
- * Registers r{0,1,2} have the same scalar ID at the moment when r0 is
- * marked to be precise, this mark is immediately propagated to r{1,2}.
+ * Registers r{0,1,2} have the same scalar ID.
+ * Range information is propagated for scalars sharing same ID.
+ * Check that precision mark for r0 causes precision marks for r{1,2}
+ * when range information is propagated for 'if <reg> <op> <const>' insn.
  */
 SEC("socket")
 __success __log_level(2)
-__msg("frame0: regs=r0,r1,r2 stack= before 4: (bf) r3 = r10")
-__msg("frame0: regs=r0,r1,r2 stack= before 3: (bf) r2 = r0")
-__msg("frame0: regs=r0,r1 stack= before 2: (bf) r1 = r0")
-__msg("frame0: regs=r0 stack= before 1: (57) r0 &= 255")
-__msg("frame0: regs=r0 stack= before 0: (85) call bpf_ktime_get_ns")
-__flag(BPF_F_TEST_STATE_FREQ)
-__naked void precision_same_state(void)
-{
-	asm volatile (
-	/* r0 = random number up to 0xff */
-	"call %[bpf_ktime_get_ns];"
-	"r0 &= 0xff;"
-	/* tie r0.id == r1.id == r2.id */
-	"r1 = r0;"
-	"r2 = r0;"
-	/* force r0 to be precise, this immediately marks r1 and r2 as
-	 * precise as well because of shared IDs
-	 */
-	"r3 = r10;"
-	"r3 += r0;"
-	"r0 = 0;"
-	"exit;"
-	:
-	: __imm(bpf_ktime_get_ns)
-	: __clobber_all);
-}
-
-/* Same as precision_same_state, but mark propagates through state /
- * parent state boundary.
- */
-SEC("socket")
-__success __log_level(2)
-__msg("frame0: last_idx 6 first_idx 5 subseq_idx -1")
-__msg("frame0: regs=r0,r1,r2 stack= before 5: (bf) r3 = r10")
+/* first 'if' branch */
+__msg("6: (0f) r3 += r0")
+__msg("frame0: regs=r0 stack= before 4: (25) if r1 > 0x7 goto pc+0")
 __msg("frame0: parent state regs=r0,r1,r2 stack=:")
-__msg("frame0: regs=r0,r1,r2 stack= before 4: (05) goto pc+0")
 __msg("frame0: regs=r0,r1,r2 stack= before 3: (bf) r2 = r0")
-__msg("frame0: regs=r0,r1 stack= before 2: (bf) r1 = r0")
-__msg("frame0: regs=r0 stack= before 1: (57) r0 &= 255")
-__msg("frame0: parent state regs=r0 stack=:")
-__msg("frame0: regs=r0 stack= before 0: (85) call bpf_ktime_get_ns")
+/* second 'if' branch */
+__msg("from 4 to 5: ")
+__msg("6: (0f) r3 += r0")
+__msg("frame0: regs=r0 stack= before 5: (bf) r3 = r10")
+__msg("frame0: regs=r0 stack= before 4: (25) if r1 > 0x7 goto pc+0")
+/* parent state already has r{0,1,2} as precise */
+__msg("frame0: parent state regs= stack=:")
 __flag(BPF_F_TEST_STATE_FREQ)
-__naked void precision_cross_state(void)
+__naked void linked_regs_bpf_k(void)
 {
 	asm volatile (
 	/* r0 = random number up to 0xff */
@@ -61,9 +34,8 @@ __naked void precision_cross_state(void)
 	/* tie r0.id == r1.id == r2.id */
 	"r1 = r0;"
 	"r2 = r0;"
-	/* force checkpoint */
-	"goto +0;"
-	/* force r0 to be precise, this immediately marks r1 and r2 as
+	"if r1 > 7 goto +0;"
+	/* force r0 to be precise, this eventually marks r1 and r2 as
 	 * precise as well because of shared IDs
 	 */
 	"r3 = r10;"
@@ -75,59 +47,18 @@ __naked void precision_cross_state(void)
 	: __clobber_all);
 }
 
-/* Same as precision_same_state, but break one of the
+/* Same as linked_regs_bpf_k, but break one of the
  * links, note that r1 is absent from regs=... in __msg below.
  */
 SEC("socket")
 __success __log_level(2)
-__msg("frame0: regs=r0,r2 stack= before 5: (bf) r3 = r10")
-__msg("frame0: regs=r0,r2 stack= before 4: (b7) r1 = 0")
-__msg("frame0: regs=r0,r2 stack= before 3: (bf) r2 = r0")
-__msg("frame0: regs=r0 stack= before 2: (bf) r1 = r0")
-__msg("frame0: regs=r0 stack= before 1: (57) r0 &= 255")
-__msg("frame0: regs=r0 stack= before 0: (85) call bpf_ktime_get_ns")
-__flag(BPF_F_TEST_STATE_FREQ)
-__naked void precision_same_state_broken_link(void)
-{
-	asm volatile (
-	/* r0 = random number up to 0xff */
-	"call %[bpf_ktime_get_ns];"
-	"r0 &= 0xff;"
-	/* tie r0.id == r1.id == r2.id */
-	"r1 = r0;"
-	"r2 = r0;"
-	/* break link for r1, this is the only line that differs
-	 * compared to the previous test
-	 */
-	"r1 = 0;"
-	/* force r0 to be precise, this immediately marks r1 and r2 as
-	 * precise as well because of shared IDs
-	 */
-	"r3 = r10;"
-	"r3 += r0;"
-	"r0 = 0;"
-	"exit;"
-	:
-	: __imm(bpf_ktime_get_ns)
-	: __clobber_all);
-}
-
-/* Same as precision_same_state_broken_link, but with state /
- * parent state boundary.
- */
-SEC("socket")
-__success __log_level(2)
-__msg("frame0: regs=r0,r2 stack= before 6: (bf) r3 = r10")
-__msg("frame0: regs=r0,r2 stack= before 5: (b7) r1 = 0")
-__msg("frame0: parent state regs=r0,r2 stack=:")
-__msg("frame0: regs=r0,r1,r2 stack= before 4: (05) goto pc+0")
-__msg("frame0: regs=r0,r1,r2 stack= before 3: (bf) r2 = r0")
-__msg("frame0: regs=r0,r1 stack= before 2: (bf) r1 = r0")
-__msg("frame0: regs=r0 stack= before 1: (57) r0 &= 255")
+__msg("7: (0f) r3 += r0")
+__msg("frame0: regs=r0 stack= before 6: (bf) r3 = r10")
 __msg("frame0: parent state regs=r0 stack=:")
-__msg("frame0: regs=r0 stack= before 0: (85) call bpf_ktime_get_ns")
+__msg("frame0: regs=r0 stack= before 5: (25) if r0 > 0x7 goto pc+0")
+__msg("frame0: parent state regs=r0,r2 stack=:")
 __flag(BPF_F_TEST_STATE_FREQ)
-__naked void precision_cross_state_broken_link(void)
+__naked void linked_regs_broken_link(void)
 {
 	asm volatile (
 	/* r0 = random number up to 0xff */
@@ -136,18 +67,13 @@ __naked void precision_cross_state_broken_link(void)
 	/* tie r0.id == r1.id == r2.id */
 	"r1 = r0;"
 	"r2 = r0;"
-	/* force checkpoint, although link between r1 and r{0,2} is
-	 * broken by the next statement current precision tracking
-	 * algorithm can't react to it and propagates mark for r1 to
-	 * the parent state.
-	 */
-	"goto +0;"
 	/* break link for r1, this is the only line that differs
-	 * compared to precision_cross_state()
+	 * compared to the previous test
 	 */
 	"r1 = 0;"
-	/* force r0 to be precise, this immediately marks r1 and r2 as
-	 * precise as well because of shared IDs
+	"if r0 > 7 goto +0;"
+	/* force r0 to be precise,
+	 * this eventually marks r2 as precise because of shared IDs
 	 */
 	"r3 = r10;"
 	"r3 += r0;"
@@ -164,10 +90,16 @@ __naked void precision_cross_state_broken_link(void)
  */
 SEC("socket")
 __success __log_level(2)
-__msg("11: (0f) r2 += r1")
+__msg("12: (0f) r2 += r1")
 /* Current state */
-__msg("frame2: last_idx 11 first_idx 10 subseq_idx -1")
-__msg("frame2: regs=r1 stack= before 10: (bf) r2 = r10")
+__msg("frame2: last_idx 12 first_idx 11 subseq_idx -1 ")
+__msg("frame2: regs=r1 stack= before 11: (bf) r2 = r10")
+__msg("frame2: parent state regs=r1 stack=")
+__msg("frame1: parent state regs= stack=")
+__msg("frame0: parent state regs= stack=")
+/* Parent state */
+__msg("frame2: last_idx 10 first_idx 10 subseq_idx 11 ")
+__msg("frame2: regs=r1 stack= before 10: (25) if r1 > 0x7 goto pc+0")
 __msg("frame2: parent state regs=r1 stack=")
 /* frame1.r{6,7} are marked because mark_precise_scalar_ids()
  * looks for all registers with frame2.r1.id in the current state
@@ -192,7 +124,7 @@ __msg("frame1: regs=r1 stack= before 4: (85) call pc+1")
 __msg("frame0: parent state regs=r1,r6 stack=")
 /* Parent state */
 __msg("frame0: last_idx 3 first_idx 1 subseq_idx 4")
-__msg("frame0: regs=r0,r1,r6 stack= before 3: (bf) r6 = r0")
+__msg("frame0: regs=r1,r6 stack= before 3: (bf) r6 = r0")
 __msg("frame0: regs=r0,r1 stack= before 2: (bf) r1 = r0")
 __msg("frame0: regs=r0 stack= before 1: (57) r0 &= 255")
 __flag(BPF_F_TEST_STATE_FREQ)
@@ -230,7 +162,8 @@ static __naked __noinline __used
 void precision_many_frames__bar(void)
 {
 	asm volatile (
-	/* force r1 to be precise, this immediately marks:
+	"if r1 > 7 goto +0;"
+	/* force r1 to be precise, this eventually marks:
 	 * - bar frame r1
 	 * - foo frame r{1,6,7}
 	 * - main frame r{1,6}
@@ -247,14 +180,16 @@ void precision_many_frames__bar(void)
  */
 SEC("socket")
 __success __log_level(2)
+__msg("11: (0f) r2 += r1")
 /* foo frame */
-__msg("frame1: regs=r1 stack=-8,-16 before 9: (bf) r2 = r10")
+__msg("frame1: regs=r1 stack= before 10: (bf) r2 = r10")
+__msg("frame1: regs=r1 stack= before 9: (25) if r1 > 0x7 goto pc+0")
 __msg("frame1: regs=r1 stack=-8,-16 before 8: (7b) *(u64 *)(r10 -16) = r1")
 __msg("frame1: regs=r1 stack=-8 before 7: (7b) *(u64 *)(r10 -8) = r1")
 __msg("frame1: regs=r1 stack= before 4: (85) call pc+2")
 /* main frame */
-__msg("frame0: regs=r0,r1 stack=-8 before 3: (7b) *(u64 *)(r10 -8) = r1")
-__msg("frame0: regs=r0,r1 stack= before 2: (bf) r1 = r0")
+__msg("frame0: regs=r1 stack=-8 before 3: (7b) *(u64 *)(r10 -8) = r1")
+__msg("frame0: regs=r1 stack= before 2: (bf) r1 = r0")
 __msg("frame0: regs=r0 stack= before 1: (57) r0 &= 255")
 __flag(BPF_F_TEST_STATE_FREQ)
 __naked void precision_stack(void)
@@ -283,7 +218,8 @@ void precision_stack__foo(void)
 	 */
 	"*(u64*)(r10 - 8) = r1;"
 	"*(u64*)(r10 - 16) = r1;"
-	/* force r1 to be precise, this immediately marks:
+	"if r1 > 7 goto +0;"
+	/* force r1 to be precise, this eventually marks:
 	 * - foo frame r1,fp{-8,-16}
 	 * - main frame r1,fp{-8}
 	 */
@@ -299,15 +235,17 @@ void precision_stack__foo(void)
 SEC("socket")
 __success __log_level(2)
 /* r{6,7} */
-__msg("11: (0f) r3 += r7")
-__msg("frame0: regs=r6,r7 stack= before 10: (bf) r3 = r10")
+__msg("12: (0f) r3 += r7")
+__msg("frame0: regs=r7 stack= before 11: (bf) r3 = r10")
+__msg("frame0: regs=r7 stack= before 9: (25) if r7 > 0x7 goto pc+0")
 /* ... skip some insns ... */
 __msg("frame0: regs=r6,r7 stack= before 3: (bf) r7 = r0")
 __msg("frame0: regs=r0,r6 stack= before 2: (bf) r6 = r0")
 /* r{8,9} */
-__msg("12: (0f) r3 += r9")
-__msg("frame0: regs=r8,r9 stack= before 11: (0f) r3 += r7")
+__msg("13: (0f) r3 += r9")
+__msg("frame0: regs=r9 stack= before 12: (0f) r3 += r7")
 /* ... skip some insns ... */
+__msg("frame0: regs=r9 stack= before 10: (25) if r9 > 0x7 goto pc+0")
 __msg("frame0: regs=r8,r9 stack= before 7: (bf) r9 = r0")
 __msg("frame0: regs=r0,r8 stack= before 6: (bf) r8 = r0")
 __flag(BPF_F_TEST_STATE_FREQ)
@@ -328,8 +266,9 @@ __naked void precision_two_ids(void)
 	"r9 = r0;"
 	/* clear r0 id */
 	"r0 = 0;"
-	/* force checkpoint */
-	"goto +0;"
+	/* propagate equal scalars precision */
+	"if r7 > 7 goto +0;"
+	"if r9 > 7 goto +0;"
 	"r3 = r10;"
 	/* force r7 to be precise, this also marks r6 */
 	"r3 += r7;"
diff --git a/tools/testing/selftests/bpf/verifier/precise.c b/tools/testing/selftests/bpf/verifier/precise.c
index 8a2ff81d8..17fbc1e61 100644
--- a/tools/testing/selftests/bpf/verifier/precise.c
+++ b/tools/testing/selftests/bpf/verifier/precise.c
@@ -106,7 +106,7 @@
 	mark_precise: frame0: regs=r2 stack= before 22\
 	mark_precise: frame0: parent state regs=r2 stack=:\
 	mark_precise: frame0: last_idx 20 first_idx 20\
-	mark_precise: frame0: regs=r2,r9 stack= before 20\
+	mark_precise: frame0: regs=r2 stack= before 20\
 	mark_precise: frame0: parent state regs=r2,r9 stack=:\
 	mark_precise: frame0: last_idx 19 first_idx 17\
 	mark_precise: frame0: regs=r2,r9 stack= before 19\
-- 
2.43.0


^ permalink raw reply related

* [PATCH stable 6.6.y v3 3/4] selftests/bpf: Tests for per-insn sync_linked_regs() precision tracking
From: Zhenzhong Wu @ 2026-06-14 16:58 UTC (permalink / raw)
  To: bpf
  Cc: netdev, linux-kernel, ast, daniel, john.fastabend, andrii,
	martin.lau, song, yonghong.song, kpsingh, haoluo, jolsa,
	menglong8.dong, eddyz87, shung-hsi.yu, stable, mykolal, tamird
In-Reply-To: <cover.1781194510.git.jt26wzz@gmail.com>

From: Eduard Zingerman <eddyz87@gmail.com>

[ Upstream commit bebc17b1c03b224a0b4aec6a171815e39f8ba9bc ]

Add a few test cases to verify precision tracking for scalars gaining
range because of sync_linked_regs():
- check what happens when more than 6 registers might gain range in
  sync_linked_regs();
- check if precision is propagated correctly when operand of
  conditional jump gained range in sync_linked_regs() and one of
  linked registers is marked precise;
- check if precision is propagated correctly when operand of
  conditional jump gained range in sync_linked_regs() and a
  other-linked operand of the conditional jump is marked precise;
- add a minimized reproducer for precision tracking bug reported in [0];
- Check that mark_chain_precision() for one of the conditional jump
  operands does not trigger equal scalars precision propagation.

[0] https://lore.kernel.org/bpf/CAEf4BzZ0xidVCqB47XnkXcNhkPWF6_nTV7yt+_Lf0kcFEut2Mg@mail.gmail.com/

Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20240718202357.1746514-4-eddyz87@gmail.com
[ zhenzhong: keep the linked_regs_broken_link_2 reject check, but
  drop the mark_precise log expectations because 6.6.y does not derive
  the scalar-vs-scalar range for that non-constant JMP_X comparison. ]
Signed-off-by: Zhenzhong Wu <jt26wzz@gmail.com>
---
 .../selftests/bpf/progs/verifier_scalar_ids.c | 162 ++++++++++++++++++
 1 file changed, 162 insertions(+)

diff --git a/tools/testing/selftests/bpf/progs/verifier_scalar_ids.c b/tools/testing/selftests/bpf/progs/verifier_scalar_ids.c
index f70392bf6..778630402 100644
--- a/tools/testing/selftests/bpf/progs/verifier_scalar_ids.c
+++ b/tools/testing/selftests/bpf/progs/verifier_scalar_ids.c
@@ -47,6 +47,72 @@ __naked void linked_regs_bpf_k(void)
 	: __clobber_all);
 }

+/* Registers r{0,1,2} share same ID when 'if r1 > ...' insn is processed,
+ * check that verifier marks r{1,2} as precise while backtracking
+ * 'if r1 > ...' with r0 already marked.
+ */
+SEC("socket")
+__success __log_level(2)
+__flag(BPF_F_TEST_STATE_FREQ)
+__msg("frame0: regs=r0 stack= before 5: (2d) if r1 > r3 goto pc+0")
+__msg("frame0: parent state regs=r0,r1,r2,r3 stack=:")
+__msg("frame0: regs=r0,r1,r2,r3 stack= before 4: (b7) r3 = 7")
+__naked void linked_regs_bpf_x_src(void)
+{
+	asm volatile (
+	/* r0 = random number up to 0xff */
+	"call %[bpf_ktime_get_ns];"
+	"r0 &= 0xff;"
+	/* tie r0.id == r1.id == r2.id */
+	"r1 = r0;"
+	"r2 = r0;"
+	"r3 = 7;"
+	"if r1 > r3 goto +0;"
+	/* force r0 to be precise, this eventually marks r1 and r2 as
+	 * precise as well because of shared IDs
+	 */
+	"r4 = r10;"
+	"r4 += r0;"
+	"r0 = 0;"
+	"exit;"
+	:
+	: __imm(bpf_ktime_get_ns)
+	: __clobber_all);
+}
+
+/* Registers r{0,1,2} share same ID when 'if r1 > r3' insn is processed,
+ * check that verifier marks r{0,1,2} as precise while backtracking
+ * 'if r1 > r3' with r3 already marked.
+ */
+SEC("socket")
+__success __log_level(2)
+__flag(BPF_F_TEST_STATE_FREQ)
+__msg("frame0: regs=r3 stack= before 5: (2d) if r1 > r3 goto pc+0")
+__msg("frame0: parent state regs=r0,r1,r2,r3 stack=:")
+__msg("frame0: regs=r0,r1,r2,r3 stack= before 4: (b7) r3 = 7")
+__naked void linked_regs_bpf_x_dst(void)
+{
+	asm volatile (
+	/* r0 = random number up to 0xff */
+	"call %[bpf_ktime_get_ns];"
+	"r0 &= 0xff;"
+	/* tie r0.id == r1.id == r2.id */
+	"r1 = r0;"
+	"r2 = r0;"
+	"r3 = 7;"
+	"if r1 > r3 goto +0;"
+	/* force r0 to be precise, this eventually marks r1 and r2 as
+	 * precise as well because of shared IDs
+	 */
+	"r4 = r10;"
+	"r4 += r3;"
+	"r0 = 0;"
+	"exit;"
+	:
+	: __imm(bpf_ktime_get_ns)
+	: __clobber_all);
+}
+
 /* Same as linked_regs_bpf_k, but break one of the
  * links, note that r1 is absent from regs=... in __msg below.
  */
@@ -280,6 +346,102 @@ __naked void precision_two_ids(void)
 	: __clobber_all);
 }

+SEC("socket")
+__success __log_level(2)
+__flag(BPF_F_TEST_STATE_FREQ)
+/* check thar r0 and r6 have different IDs after 'if',
+ * collect_linked_regs() can't tie more than 6 registers for a single insn.
+ */
+__msg("8: (25) if r0 > 0x7 goto pc+0         ; R0=scalar(id=1")
+__msg("9: (bf) r6 = r6                       ; R6_w=scalar(id=2")
+/* check that r{0-5} are marked precise after 'if' */
+__msg("frame0: regs=r0 stack= before 8: (25) if r0 > 0x7 goto pc+0")
+__msg("frame0: parent state regs=r0,r1,r2,r3,r4,r5 stack=:")
+__naked void linked_regs_too_many_regs(void)
+{
+	asm volatile (
+	/* r0 = random number up to 0xff */
+	"call %[bpf_ktime_get_ns];"
+	"r0 &= 0xff;"
+	/* tie r{0-6} IDs */
+	"r1 = r0;"
+	"r2 = r0;"
+	"r3 = r0;"
+	"r4 = r0;"
+	"r5 = r0;"
+	"r6 = r0;"
+	/* propagate range for r{0-6} */
+	"if r0 > 7 goto +0;"
+	/* make r6 appear in the log */
+	"r6 = r6;"
+	/* force r0 to be precise,
+	 * this would cause r{0-4} to be precise because of shared IDs
+	 */
+	"r7 = r10;"
+	"r7 += r0;"
+	"r0 = 0;"
+	"exit;"
+	:
+	: __imm(bpf_ktime_get_ns)
+	: __clobber_all);
+}
+
+SEC("socket")
+__failure __log_level(2)
+__flag(BPF_F_TEST_STATE_FREQ)
+__msg("div by zero")
+__naked void linked_regs_broken_link_2(void)
+{
+	asm volatile (
+	"call %[bpf_get_prandom_u32];"
+	"r7 = r0;"
+	"r8 = r0;"
+	"call %[bpf_get_prandom_u32];"
+	"if r0 > 1 goto +0;"
+	/* r7.id == r8.id,
+	 * thus r7 precision implies r8 precision,
+	 * which implies r0 precision because of the conditional below.
+	 */
+	"if r8 >= r0 goto 1f;"
+	/* break id relation between r7 and r8 */
+	"r8 += r8;"
+	/* make r7 precise */
+	"if r7 == 0 goto 1f;"
+	"r0 /= 0;"
+"1:"
+	"r0 = 42;"
+	"exit;"
+	:
+	: __imm(bpf_get_prandom_u32)
+	: __clobber_all);
+}
+
+/* Check that mark_chain_precision() for one of the conditional jump
+ * operands does not trigger equal scalars precision propagation.
+ */
+SEC("socket")
+__success __log_level(2)
+__msg("3: (25) if r1 > 0x100 goto pc+0")
+__msg("frame0: regs=r1 stack= before 2: (bf) r1 = r0")
+__naked void cjmp_no_linked_regs_trigger(void)
+{
+	asm volatile (
+	/* r0 = random number up to 0xff */
+	"call %[bpf_ktime_get_ns];"
+	"r0 &= 0xff;"
+	/* tie r0.id == r1.id */
+	"r1 = r0;"
+	/* the jump below would be predicted, thus r1 would be marked precise,
+	 * this should not imply precision mark for r0
+	 */
+	"if r1 > 256 goto +0;"
+	"r0 = 0;"
+	"exit;"
+	:
+	: __imm(bpf_ktime_get_ns)
+	: __clobber_all);
+}
+
 /* Verify that check_ids() is used by regsafe() for scalars.
  *
  * r9 = ... some pointer with range X ...
--
2.43.0

^ permalink raw reply related

* [PATCH stable 6.6.y v3 4/4] selftests/bpf: Update comments find_equal_scalars->sync_linked_regs
From: Zhenzhong Wu @ 2026-06-14 16:58 UTC (permalink / raw)
  To: bpf
  Cc: netdev, linux-kernel, ast, daniel, john.fastabend, andrii,
	martin.lau, song, yonghong.song, kpsingh, haoluo, jolsa,
	menglong8.dong, eddyz87, shung-hsi.yu, stable, mykolal, tamird
In-Reply-To: <cover.1781194510.git.jt26wzz@gmail.com>

From: Eduard Zingerman <eddyz87@gmail.com>

[ Upstream commit cfbf25481d6dec0089c99c9d33a2ea634fe8f008 ]

find_equal_scalars() is renamed to sync_linked_regs(),
this commit updates existing references in the selftests comments.

Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20240718202357.1746514-5-eddyz87@gmail.com
[ zhenzhong: only two pre-existing comments still needed updating in 6.6.y. ]
Signed-off-by: Zhenzhong Wu <jt26wzz@gmail.com>
---
 tools/testing/selftests/bpf/progs/verifier_spill_fill.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/bpf/progs/verifier_spill_fill.c b/tools/testing/selftests/bpf/progs/verifier_spill_fill.c
index 1f71f596d..07a2527a8 100644
--- a/tools/testing/selftests/bpf/progs/verifier_spill_fill.c
+++ b/tools/testing/selftests/bpf/progs/verifier_spill_fill.c
@@ -392,7 +392,7 @@ __naked void spill_32bit_of_64bit_fail(void)
 	*(u32*)(r10 - 8) = r1;				\
 	/* 32-bit fill r2 from stack. */		\
 	r2 = *(u32*)(r10 - 8);				\
-	/* Compare r2 with another register to trigger find_equal_scalars.\
+	/* Compare r2 with another register to trigger sync_linked_regs.\
 	 * Having one random bit is important here, otherwise the verifier cuts\
 	 * the corners. If the ID was mistakenly preserved on spill, this would\
 	 * cause the verifier to think that r1 is also equal to zero in one of\
@@ -431,7 +431,7 @@ __naked void spill_16bit_of_32bit_fail(void)
 	*(u16*)(r10 - 8) = r1;				\
 	/* 16-bit fill r2 from stack. */		\
 	r2 = *(u16*)(r10 - 8);				\
-	/* Compare r2 with another register to trigger find_equal_scalars.\
+	/* Compare r2 with another register to trigger sync_linked_regs.\
 	 * Having one random bit is important here, otherwise the verifier cuts\
 	 * the corners. If the ID was mistakenly preserved on spill, this would\
 	 * cause the verifier to think that r1 is also equal to zero in one of\
-- 
2.43.0


^ permalink raw reply related

* [PATCH net-next v5 03/15] net: ethernet: oa_tc6: Move oa_tc6.c to its own directory
From: Selvamani Rajagopal via B4 Relay @ 2026-06-14 17:00 UTC (permalink / raw)
  To: Andrew Lunn, Piergiorgio Beruto, Heiner Kallweit, Russell King,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Andrew Lunn, Parthiban Veerasooran, Selva Rajagopal,
	Richard Cochran, Rob Herring, Krzysztof Kozlowski, Conor Dooley,
	Simon Horman, Jonathan Corbet, Shuah Khan
  Cc: netdev, linux-kernel, devicetree, linux-doc, Jerry Ray,
	Selvamani Rajagopal
In-Reply-To: <20260614-s2500-mac-phy-support-v5-0-89874b72f725@onsemi.com>

From: Selvamani Rajagopal <Selvamani.Rajagopal@onsemi.com>

Moving oa_tc6.c to its own directory, drivers/net/ethernet/oa_tc6. This
will facilitate adding more files to support other features
defined by OPEN Alliance 10BASE-T1x Serial Interface specification

This patch series is adding two files, one for hardware
timestamp related functions and one for PTP related APIs.

Signed-off-by: Selvamani Rajagopal <Selvamani.Rajagopal@onsemi.com>

---
changes in v5
  - No change
changes in v4
  - Removed reference to onsemi in Kconfig files
changes in v3
  - Moved oa_tc6.c to its own, oa_tc6 directory under ethernet.
  - First patch
---
 MAINTAINERS                                |  2 +-
 drivers/net/ethernet/Kconfig               | 12 +-----------
 drivers/net/ethernet/Makefile              |  2 +-
 drivers/net/ethernet/oa_tc6/Kconfig        | 16 ++++++++++++++++
 drivers/net/ethernet/oa_tc6/Makefile       |  7 +++++++
 drivers/net/ethernet/{ => oa_tc6}/oa_tc6.c |  0
 6 files changed, 26 insertions(+), 13 deletions(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index cc1dde0c9067..4cee98fc922c 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -20001,7 +20001,7 @@ M:	Parthiban Veerasooran <parthiban.veerasooran@microchip.com>
 L:	netdev@vger.kernel.org
 S:	Maintained
 F:	Documentation/networking/oa-tc6-framework.rst
-F:	drivers/net/ethernet/oa_tc6.c
+F:	drivers/net/ethernet/oa_tc6/oa_tc6*
 F:	include/linux/oa_tc6.h
 
 OPEN FIRMWARE AND FLATTENED DEVICE TREE
diff --git a/drivers/net/ethernet/Kconfig b/drivers/net/ethernet/Kconfig
index 78c79ad7bba5..49d93488ba52 100644
--- a/drivers/net/ethernet/Kconfig
+++ b/drivers/net/ethernet/Kconfig
@@ -134,6 +134,7 @@ source "drivers/net/ethernet/netronome/Kconfig"
 source "drivers/net/ethernet/8390/Kconfig"
 source "drivers/net/ethernet/nvidia/Kconfig"
 source "drivers/net/ethernet/nxp/Kconfig"
+source "drivers/net/ethernet/oa_tc6/Kconfig"
 source "drivers/net/ethernet/oki-semi/Kconfig"
 
 config ETHOC
@@ -146,17 +147,6 @@ config ETHOC
 	help
 	  Say Y here if you want to use the OpenCores 10/100 Mbps Ethernet MAC.
 
-config OA_TC6
-	tristate "OPEN Alliance TC6 10BASE-T1x MAC-PHY support" if COMPILE_TEST
-	depends on SPI
-	select PHYLIB
-	help
-	  This library implements OPEN Alliance TC6 10BASE-T1x MAC-PHY
-	  Serial Interface protocol for supporting 10BASE-T1x MAC-PHYs.
-
-	  To know the implementation details, refer documentation in
-	  <file:Documentation/networking/oa-tc6-framework.rst>.
-
 source "drivers/net/ethernet/pasemi/Kconfig"
 source "drivers/net/ethernet/pensando/Kconfig"
 source "drivers/net/ethernet/qlogic/Kconfig"
diff --git a/drivers/net/ethernet/Makefile b/drivers/net/ethernet/Makefile
index bba55d9af387..77b11d5a7abf 100644
--- a/drivers/net/ethernet/Makefile
+++ b/drivers/net/ethernet/Makefile
@@ -71,6 +71,7 @@ obj-$(CONFIG_NET_VENDOR_NETRONOME) += netronome/
 obj-$(CONFIG_NET_VENDOR_NI) += ni/
 obj-$(CONFIG_NET_VENDOR_NVIDIA) += nvidia/
 obj-$(CONFIG_LPC_ENET) += nxp/
+obj-$(CONFIG_OA_TC6) += oa_tc6/
 obj-$(CONFIG_NET_VENDOR_OKI) += oki-semi/
 obj-$(CONFIG_ETHOC) += ethoc.o
 obj-$(CONFIG_NET_VENDOR_PASEMI) += pasemi/
@@ -104,4 +105,3 @@ obj-$(CONFIG_NET_VENDOR_XILINX) += xilinx/
 obj-$(CONFIG_NET_VENDOR_XIRCOM) += xircom/
 obj-$(CONFIG_NET_VENDOR_SYNOPSYS) += synopsys/
 obj-$(CONFIG_NET_VENDOR_PENSANDO) += pensando/
-obj-$(CONFIG_OA_TC6) += oa_tc6.o
diff --git a/drivers/net/ethernet/oa_tc6/Kconfig b/drivers/net/ethernet/oa_tc6/Kconfig
new file mode 100644
index 000000000000..97345f345fb9
--- /dev/null
+++ b/drivers/net/ethernet/oa_tc6/Kconfig
@@ -0,0 +1,16 @@
+# SPDX-License-Identifier: GPL-2.0-only
+#
+# OA TC6 10BASE-T1x MAC-PHY configuration
+#
+
+config OA_TC6
+	tristate "OPEN Alliance TC6 10BASE-T1x MAC-PHY support"
+	depends on SPI
+	select PHYLIB
+	help
+	  This library implements OPEN Alliance TC6 10BASE-T1x MAC-PHY
+	  Serial Interface protocol for supporting 10BASE-T1x MAC-PHYs.
+
+	  To know the implementation details, refer documentation in
+	  <file:Documentation/networking/oa-tc6-framework.rst>.
+
diff --git a/drivers/net/ethernet/oa_tc6/Makefile b/drivers/net/ethernet/oa_tc6/Makefile
new file mode 100644
index 000000000000..f24aae852ef2
--- /dev/null
+++ b/drivers/net/ethernet/oa_tc6/Makefile
@@ -0,0 +1,7 @@
+# SPDX-License-Identifier: GPL-2.0-only
+#
+# Makefile for OA TC6 10BASE-T1x MAC-PHY
+#
+
+obj-$(CONFIG_OA_TC6) := oa_tc6_mod.o
+oa_tc6_mod-objs := oa_tc6.o
diff --git a/drivers/net/ethernet/oa_tc6.c b/drivers/net/ethernet/oa_tc6/oa_tc6.c
similarity index 100%
rename from drivers/net/ethernet/oa_tc6.c
rename to drivers/net/ethernet/oa_tc6/oa_tc6.c

-- 
2.43.0



^ permalink raw reply related

* [PATCH net-next v5 01/15] net: phy: Helper to read and write through C45 without lock
From: Selvamani Rajagopal via B4 Relay @ 2026-06-14 17:00 UTC (permalink / raw)
  To: Andrew Lunn, Piergiorgio Beruto, Heiner Kallweit, Russell King,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Andrew Lunn, Parthiban Veerasooran, Selva Rajagopal,
	Richard Cochran, Rob Herring, Krzysztof Kozlowski, Conor Dooley,
	Simon Horman, Jonathan Corbet, Shuah Khan
  Cc: netdev, linux-kernel, devicetree, linux-doc, Jerry Ray,
	Selvamani Rajagopal
In-Reply-To: <20260614-s2500-mac-phy-support-v5-0-89874b72f725@onsemi.com>

From: Selvamani Rajagopal <Selvamani.Rajagopal@onsemi.com>

Generic helper function to initiate read and write through C45 bus
protocol without mdio bus lock. This will help PHYs to avoid indirect C22
API calls for C45 bus protocol which may not be supported by the PHY.

Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Selvamani Rajagopal <Selvamani.Rajagopal@onsemi.com>

---
changes in v5
  - no change
changes in v4
  - lockdep_assert_held added to ensure correct calling convention
changes in v3
  - Added the genphy APIs to initiate Clause 45 register read/write
  - first patch
---
 drivers/net/phy/phy_device.c | 55 ++++++++++++++++++++++++++++++++++++++++++++
 include/linux/phy.h          |  4 ++++
 2 files changed, 59 insertions(+)

diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
index 0615228459ef..b82b99d08132 100644
--- a/drivers/net/phy/phy_device.c
+++ b/drivers/net/phy/phy_device.c
@@ -2787,6 +2787,61 @@ int genphy_write_mmd_unsupported(struct phy_device *phdev, int devnum,
 }
 EXPORT_SYMBOL(genphy_write_mmd_unsupported);
 
+/**
+ * genphy_phy_read_mmd - Helper for reading a register without lock
+ * from the given MMD and PHY.
+ * @phydev: The phy_device struct
+ * @devnum: The MMD to read from
+ * @regnum: The register on the MMD to read
+ *
+ * Description: PHYs can have both C22 and C45 registers space. Once PHY
+ * is discovered via C22 bus protocol, it uses C22 indirect access to
+ * access C45 registers. Some PHYs, like 10Base-T1S PHYs defined by OPEN
+ * Alliance 10BASE‑T1x, support only direct access.
+ *
+ * If PHY indicates C45 support through DTS entry, it avoid C22 APIs
+ * entirely and therefore generic MDIO registers are inaccessible.
+ *
+ * MDIO bus isn't locked here because when called through read_mmd
+ * callback of phy_driver, caller is expected to lock the bus as
+ * implemented in phy_read_mmd.
+ *
+ * Returns: Register value if successful, negative error code on failure.
+ */
+int genphy_phy_read_mmd(struct phy_device *phydev, int devnum,
+			u16 regnum)
+{
+	struct mii_bus *bus = phydev->mdio.bus;
+	int addr = phydev->mdio.addr;
+
+	lockdep_assert_held(&bus->mdio_lock);
+	return __mdiobus_c45_read(bus, addr, devnum, regnum);
+}
+EXPORT_SYMBOL(genphy_phy_read_mmd);
+
+/**
+ * genphy_phy_write_mmd - Helper for writing a register without lock
+ * to the given MMD and PHY.
+ * @phydev: The phy_device struct
+ * @devnum: The MMD to write to
+ * @regnum: The register on the MMD to write
+ * @val:    Value to write
+ *
+ * Description: Similar to genphy_phy_read_mmd
+ *
+ * Returns: 0 if successful, negative error code on failure.
+ */
+int genphy_phy_write_mmd(struct phy_device *phydev, int devnum,
+			 u16 regnum, u16 val)
+{
+	struct mii_bus *bus = phydev->mdio.bus;
+	int addr = phydev->mdio.addr;
+
+	lockdep_assert_held(&bus->mdio_lock);
+	return __mdiobus_c45_write(bus, addr, devnum, regnum, val);
+}
+EXPORT_SYMBOL(genphy_phy_write_mmd);
+
 int genphy_suspend(struct phy_device *phydev)
 {
 	return phy_set_bits(phydev, MII_BMCR, BMCR_PDOWN);
diff --git a/include/linux/phy.h b/include/linux/phy.h
index 199a7aaa341b..8266dd4a8dbe 100644
--- a/include/linux/phy.h
+++ b/include/linux/phy.h
@@ -2301,6 +2301,10 @@ int genphy_read_mmd_unsupported(struct phy_device *phdev, int devad,
 				u16 regnum);
 int genphy_write_mmd_unsupported(struct phy_device *phdev, int devnum,
 				 u16 regnum, u16 val);
+int genphy_phy_write_mmd(struct phy_device *phydev, int devnum,
+			 u16 regnum, u16 val);
+int genphy_phy_read_mmd(struct phy_device *phydev, int devnum,
+			u16 regnum);
 
 /* Clause 37 */
 int genphy_c37_config_aneg(struct phy_device *phydev);

-- 
2.43.0



^ permalink raw reply related

* [PATCH net-next v5 04/15] net: phy: microchip_t1s: Use generic APIs for C45 read and write
From: Selvamani Rajagopal via B4 Relay @ 2026-06-14 17:00 UTC (permalink / raw)
  To: Andrew Lunn, Piergiorgio Beruto, Heiner Kallweit, Russell King,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Andrew Lunn, Parthiban Veerasooran, Selva Rajagopal,
	Richard Cochran, Rob Herring, Krzysztof Kozlowski, Conor Dooley,
	Simon Horman, Jonathan Corbet, Shuah Khan
  Cc: netdev, linux-kernel, devicetree, linux-doc, Jerry Ray,
	Selvamani Rajagopal
In-Reply-To: <20260614-s2500-mac-phy-support-v5-0-89874b72f725@onsemi.com>

From: Selvamani Rajagopal <Selvamani.Rajagopal@onsemi.com>

Replace vendor implementation with generic API to read and write
PHY registers using C45 bus protocol.

Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Selvamani Rajagopal <Selvamani.Rajagopal@onsemi.com>

---
changes in v5
  - no change
changes in v4
  - no change
changes in v3
  - Updated vendor specific phy_read_mmd/phy_write_mmd functions to
    use genphy read/write APIs that is introduced
  - First patch
---
 drivers/net/phy/microchip_t1s.c | 32 ++------------------------------
 1 file changed, 2 insertions(+), 30 deletions(-)

diff --git a/drivers/net/phy/microchip_t1s.c b/drivers/net/phy/microchip_t1s.c
index e601d56b2507..0c4dc70641d8 100644
--- a/drivers/net/phy/microchip_t1s.c
+++ b/drivers/net/phy/microchip_t1s.c
@@ -506,34 +506,6 @@ static int lan86xx_read_status(struct phy_device *phydev)
 	return 0;
 }
 
-/* OPEN Alliance 10BASE-T1x compliance MAC-PHYs will have both C22 and
- * C45 registers space. If the PHY is discovered via C22 bus protocol it assumes
- * it uses C22 protocol and always uses C22 registers indirect access to access
- * C45 registers. This is because, we don't have a clean separation between
- * C22/C45 register space and C22/C45 MDIO bus protocols. Resulting, PHY C45
- * registers direct access can't be used which can save multiple SPI bus access.
- * To support this feature, set .read_mmd/.write_mmd in the PHY driver to call
- * .read_c45/.write_c45 in the OPEN Alliance framework
- * drivers/net/ethernet/oa_tc6.c
- */
-static int lan865x_phy_read_mmd(struct phy_device *phydev, int devnum,
-				u16 regnum)
-{
-	struct mii_bus *bus = phydev->mdio.bus;
-	int addr = phydev->mdio.addr;
-
-	return __mdiobus_c45_read(bus, addr, devnum, regnum);
-}
-
-static int lan865x_phy_write_mmd(struct phy_device *phydev, int devnum,
-				 u16 regnum, u16 val)
-{
-	struct mii_bus *bus = phydev->mdio.bus;
-	int addr = phydev->mdio.addr;
-
-	return __mdiobus_c45_write(bus, addr, devnum, regnum, val);
-}
-
 static struct phy_driver microchip_t1s_driver[] = {
 	{
 		PHY_ID_MATCH_EXACT(PHY_ID_LAN867X_REVB1),
@@ -584,8 +556,8 @@ static struct phy_driver microchip_t1s_driver[] = {
 		.features           = PHY_BASIC_T1S_P2MP_FEATURES,
 		.config_init        = lan865x_revb_config_init,
 		.read_status        = lan86xx_read_status,
-		.read_mmd           = lan865x_phy_read_mmd,
-		.write_mmd          = lan865x_phy_write_mmd,
+		.read_mmd           = genphy_phy_read_mmd,
+		.write_mmd          = genphy_phy_write_mmd,
 		.get_plca_cfg	    = genphy_c45_plca_get_cfg,
 		.set_plca_cfg	    = lan86xx_plca_set_cfg,
 		.get_plca_status    = genphy_c45_plca_get_status,

-- 
2.43.0



^ permalink raw reply related

* [PATCH net-next v5 02/15] net: phy: Helper to modify PHY loopback mode only
From: Selvamani Rajagopal via B4 Relay @ 2026-06-14 17:00 UTC (permalink / raw)
  To: Andrew Lunn, Piergiorgio Beruto, Heiner Kallweit, Russell King,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Andrew Lunn, Parthiban Veerasooran, Selva Rajagopal,
	Richard Cochran, Rob Herring, Krzysztof Kozlowski, Conor Dooley,
	Simon Horman, Jonathan Corbet, Shuah Khan
  Cc: netdev, linux-kernel, devicetree, linux-doc, Jerry Ray,
	Selvamani Rajagopal
In-Reply-To: <20260614-s2500-mac-phy-support-v5-0-89874b72f725@onsemi.com>

From: Selvamani Rajagopal <Selvamani.Rajagopal@onsemi.com>

Generic helper function to modify loopback bit of the PHY without
modifying any other bit. This will help the PHYs that may have fixed
speed, like 10Base-T1S or PHYs that don't need any other settings
to set them in loopback mode.

Signed-off-by: Selvamani Rajagopal <Selvamani.Rajagopal@onsemi.com>

---
changes in v5
  - No change
changes in v4
  - Created a new genphy API to set the loopback. No other PHY
    registers touched.
---
 drivers/net/phy/dp83867.c    | 11 +----------
 drivers/net/phy/phy_device.c | 20 ++++++++++++++++++++
 include/linux/phy.h          |  2 ++
 3 files changed, 23 insertions(+), 10 deletions(-)

diff --git a/drivers/net/phy/dp83867.c b/drivers/net/phy/dp83867.c
index 88255e92b4cd..01ea2e8dd253 100644
--- a/drivers/net/phy/dp83867.c
+++ b/drivers/net/phy/dp83867.c
@@ -991,15 +991,6 @@ static void dp83867_link_change_notify(struct phy_device *phydev)
 	}
 }
 
-static int dp83867_loopback(struct phy_device *phydev, bool enable, int speed)
-{
-	if (enable && speed)
-		return -EOPNOTSUPP;
-
-	return phy_modify(phydev, MII_BMCR, BMCR_LOOPBACK,
-			  enable ? BMCR_LOOPBACK : 0);
-}
-
 static int
 dp83867_led_brightness_set(struct phy_device *phydev,
 			   u8 index, enum led_brightness brightness)
@@ -1204,7 +1195,7 @@ static struct phy_driver dp83867_driver[] = {
 		.resume		= dp83867_resume,
 
 		.link_change_notify = dp83867_link_change_notify,
-		.set_loopback	= dp83867_loopback,
+		.set_loopback	= genphy_loopback_fixed_speed,
 
 		.led_brightness_set = dp83867_led_brightness_set,
 		.led_hw_is_supported = dp83867_led_hw_is_supported,
diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
index b82b99d08132..11fd204eea16 100644
--- a/drivers/net/phy/phy_device.c
+++ b/drivers/net/phy/phy_device.c
@@ -2842,6 +2842,26 @@ int genphy_phy_write_mmd(struct phy_device *phydev, int devnum,
 }
 EXPORT_SYMBOL(genphy_phy_write_mmd);
 
+/**
+ * genphy_loopback_fixed_speed - Helper to modify the PHY loopback mode
+ * without affecting any other settings.
+ * @phydev: The phy_device struct
+ * @enable: Flag to enable or disable the PHY level loopback.
+ * @speed: Speed setting. Not expected to be set. Error if it is set.
+ *
+ * Returns: 0 if successful, negative error code on failure.
+ */
+int genphy_loopback_fixed_speed(struct phy_device *phydev, bool enable,
+				int speed)
+{
+	if (enable && speed)
+		return -EOPNOTSUPP;
+
+	return phy_modify(phydev, MII_BMCR, BMCR_LOOPBACK,
+			  enable ? BMCR_LOOPBACK : 0);
+}
+EXPORT_SYMBOL(genphy_loopback_fixed_speed);
+
 int genphy_suspend(struct phy_device *phydev)
 {
 	return phy_set_bits(phydev, MII_BMCR, BMCR_PDOWN);
diff --git a/include/linux/phy.h b/include/linux/phy.h
index 8266dd4a8dbe..61bcd71a3143 100644
--- a/include/linux/phy.h
+++ b/include/linux/phy.h
@@ -2301,6 +2301,8 @@ int genphy_read_mmd_unsupported(struct phy_device *phdev, int devad,
 				u16 regnum);
 int genphy_write_mmd_unsupported(struct phy_device *phdev, int devnum,
 				 u16 regnum, u16 val);
+int genphy_loopback_fixed_speed(struct phy_device *phydev, bool enable,
+				int speed);
 int genphy_phy_write_mmd(struct phy_device *phydev, int devnum,
 			 u16 regnum, u16 val);
 int genphy_phy_read_mmd(struct phy_device *phydev, int devnum,

-- 
2.43.0



^ permalink raw reply related

* [PATCH net-next v5 00/15] Support for onsemi's S2500 10Base-T1S MAC-PHY
From: Selvamani Rajagopal via B4 Relay @ 2026-06-14 17:00 UTC (permalink / raw)
  To: Andrew Lunn, Piergiorgio Beruto, Heiner Kallweit, Russell King,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Andrew Lunn, Parthiban Veerasooran, Selva Rajagopal,
	Richard Cochran, Rob Herring, Krzysztof Kozlowski, Conor Dooley,
	Simon Horman, Jonathan Corbet, Shuah Khan
  Cc: netdev, linux-kernel, devicetree, linux-doc, Jerry Ray,
	Selvamani Rajagopal

This patch series brings support for onsemi's S2500 that iss
IEEE 802.3cg compliant Ethernet transceiver with an integrated
Media Access Controller (MAC-PHY)

Driver implementation is compatible and works with OA TC6
framework that is already present. S2500 driver supports
hardware timestamping.

Driver has support for running selftest and loopback tests.
Through ethtool, it can provide traffic stats, rmon stats,
and timestamping related traffic stats.

As S2500 has an internal PHY, changes have been added
to onsemi's PHY driver to support this device.

---
Changes in v5:
 - kernel doc related changes in oa_tc.c, onsemi driver files and
  oa tc6 rst file
- Link to v4: https://lore.kernel.org/r/20260605-s2500-mac-phy-support-v4-0-de0fbc13c6d8@onsemi.com

Changes in v4:
 - Added return value comment for genphy_read/write_phy_mmd functions
 - Added genphy_loopback_fixed_speed helper function to be used in
   set_loopback callbacks
 - Updated networking documentation for OA TC6 framework to elaborate
   on what is expected in the ptp_clock_info structure for registration.
 - added spi-max-frequency in YAML file based on alert from sashiko-bot
 - Removed model/version from the onsemi driver's private structure as
   they were useful as "information-only" data.
 - Replaced the non-standard selftest with Linux's standard selftest
   and made it as a separate patch
 - Changed bit manipulation, shift operations to use macros so that
   it is clean and readable.
 - added new read_register and write_register apis with _mms postfix
   so that MMS (memory map selector) can be given as a parameter.
 - Fixed the wrong condition check with NETIF_F_RXFCS to subtract
   FCS size from the length of the frame.

To: Andrew Lunn <andrew@lunn.ch>
To: Piergiorgio Beruto <pier.beruto@onsemi.com>
To: Heiner Kallweit <hkallweit1@gmail.com>
To: Russell King <linux@armlinux.org.uk>
To: David S. Miller <davem@davemloft.net>
To: Eric Dumazet <edumazet@google.com>
To: Jakub Kicinski <kuba@kernel.org>
To: Paolo Abeni <pabeni@redhat.com>
To: Andrew Lunn <andrew+netdev@lunn.ch>
To: Parthiban Veerasooran <parthiban.veerasooran@microchip.com>
To: Selva Rajagopal <selvamani.rajagopal@onsemi.com>
To: Richard Cochran <richardcochran@gmail.com>
To: Rob Herring <robh@kernel.org>
To: Krzysztof Kozlowski <krzk+dt@kernel.org>
To: Conor Dooley <conor+dt@kernel.org>
To: Simon Horman <horms@kernel.org>
To: Jonathan Corbet <corbet@lwn.net>
To: Shuah Khan <skhan@linuxfoundation.org>
Cc: netdev@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: devicetree@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: Jerry Ray <jerry.ray@microchip.com>
Signed-off-by: Selvamani Rajagopal <Selvamani.Rajagopal@onsemi.com>

---
Selvamani Rajagopal (15):
      net: phy: Helper to read and write through C45 without lock
      net: phy: Helper to modify PHY loopback mode only
      net: ethernet: oa_tc6: Move oa_tc6.c to its own directory
      net: phy: microchip_t1s: Use generic APIs for C45 read and write
      net: ethernet: oa_tc6: Move constant definitions to header file
      net: ethernet: oa_tc6: Support for hardware timestamp
      net: ethernet: oa_tc6: Support for vendor specific MMS
      net: ethernet: oa_tc6: read, write interface with MMS option
      net: phy: ncn26000: Support for onsemi's S2500 internal phy
      net: phy: ncn26000: Enable enhanced noise immunity
      net: phy: ncn26000: Support for loopback
      onsemi: s2500: Add driver support for TS2500 MAC-PHY
      onsemi: s2500: Added selftest support to onsemi's S2500 driver
      dt-bindings: net: add onsemi's S2500
      Documentation: networking: Add timestamp related APIs to OA TC6 framework

 .../devicetree/bindings/net/onnn,s2500.yaml        |  67 +++
 Documentation/networking/oa-tc6-framework.rst      |  80 +++
 MAINTAINERS                                        |  13 +-
 drivers/net/ethernet/Kconfig                       |  12 +-
 drivers/net/ethernet/Makefile                      |   2 +-
 drivers/net/ethernet/microchip/lan865x/lan865x.c   |  61 +-
 drivers/net/ethernet/oa_tc6/Kconfig                |  16 +
 drivers/net/ethernet/oa_tc6/Makefile               |   7 +
 drivers/net/ethernet/{ => oa_tc6}/oa_tc6.c         | 465 +++++++++------
 drivers/net/ethernet/oa_tc6/oa_tc6_ptp.c           |  67 +++
 drivers/net/ethernet/oa_tc6/oa_tc6_std_def.h       | 190 +++++++
 drivers/net/ethernet/oa_tc6/oa_tc6_tstamp.c        | 201 +++++++
 drivers/net/ethernet/onsemi/Kconfig                |  21 +
 drivers/net/ethernet/onsemi/Makefile               |   7 +
 drivers/net/ethernet/onsemi/s2500/Kconfig          |  22 +
 drivers/net/ethernet/onsemi/s2500/Makefile         |   7 +
 drivers/net/ethernet/onsemi/s2500/s2500_ethtool.c  | 354 ++++++++++++
 drivers/net/ethernet/onsemi/s2500/s2500_hw_def.h   | 225 ++++++++
 drivers/net/ethernet/onsemi/s2500/s2500_main.c     | 632 +++++++++++++++++++++
 drivers/net/ethernet/onsemi/s2500/s2500_ptp.c      | 233 ++++++++
 drivers/net/phy/dp83867.c                          |  11 +-
 drivers/net/phy/microchip_t1s.c                    |  32 +-
 drivers/net/phy/ncn26000.c                         |  63 +-
 drivers/net/phy/phy_device.c                       |  75 +++
 include/linux/oa_tc6.h                             |  36 ++
 include/linux/phy.h                                |   6 +
 26 files changed, 2655 insertions(+), 250 deletions(-)
---
base-commit: 2319688890d97c63da423a3c57c23b4ab5952dfc
change-id: 20260601-s2500-mac-phy-support-4f3ae920fb73

Best regards,
--  
Selvamani Rajagopal <Selvamani.Rajagopal@onsemi.com>



^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox