Netdev List
 help / color / mirror / Atom feed
* Re: [RFC net-next 2/4] selftests: drv-net: tso: add helpers for double tunneling GSO
From: Jakub Kicinski @ 2026-04-10  2:23 UTC (permalink / raw)
  To: Xu Du
  Cc: davem, edumazet, pabeni, horms, shuah, netdev, linux-kselftest,
	linux-kernel
In-Reply-To: <CAA92Kxmka9=GEvNwxOy7pSEuudqsaF3WGJttyQpYEkTyyYhgLg@mail.gmail.com>

On Thu, 9 Apr 2026 15:35:21 +0800 Xu Du wrote:
> > > I want to test the gro-hint parameter functionality of the GENEVE tunnel,
> > > so I intend to use YNL for the testing. I am conducting the test between
> > > two machines using SSH type. I want to add the gro-hint parameter on
> > > both the local and remote nodes; however, I am unable to invoke class
> > > RtnlFamily on the remote node via SSH.  
> >
> > Oh. But that's not really what you're doing:
> >
> > +def ynlcli(family, args, json=None, ns=None, host=None):
> > +    if (KSFT_DIR / "kselftest-list.txt").exists():
> > +        cli = KSFT_DIR / "net/lib/ynl/pyynl/cli.py"
> > +        spec = KSFT_DIR / f"net/lib/specs/{family}.yaml"
> > +    else:
> > +        cli = KSRC / "tools/net/ynl/pyynl/cli.py"
> > +        spec = KSRC / f"Documentation/netlink/specs/{family}.yaml"
> > +    if not cli.exists():
> > +        raise FileNotFoundError(f"cli not found at {cli}")
> > +    args = f"--spec {spec} --no-schema {args}"
> > +    return tool(cli.as_posix(), args, json=json, ns=ns, host=host, shell=True)
> >
> > You're not deploying anything to the remote system.
> > Are you assuming that the remote system magically has the same
> > filesystem layout?
> >
> > You can use the ynl CLI but it has to be whatever version is on
> > the remote system. Just call ynl --family rt-link, don't dig
> > around for the spec paths etc.
> >  
> 
> In fact, I have tested this from two different locations. The first is in
> tools/testing/selftests/drivers/net/hw/ using python3 tso.py,
> which utilizes the specs located in Documentation/netlink/specs/.
> The second follows the testing methodology described in the
> README.rst of tools/testing/selftests/drivers/net/, which uses the specs
> in net/lib/specs/. Based on this, I include that different processes utilize
> different spec locations.
> I also referred to the implementation in net/lib/py/ynl.py, which employs
> a similar handling logic. Both using the source code repository and
> installing the package can meet the requirements for remote testing.

On the local system you have Python bindings.
On the remote system you can't assume KSFT_DIR or KSRC exist.

You can use ynl CLI on the remote system, like we use ip, tc etc.
But then just use --family rt-link, don't try to find the filesystem
path to the spec.

Is this clear now? Am I misunderstanding your misunderstanding?

^ permalink raw reply

* Re: [PATCH net-next v2 5/5] ethtool: strset: check nla_len overflow
From: Jakub Kicinski @ 2026-04-10  2:14 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Stanislav Fomichev, Hangbin Liu, Donald Hunter, David S. Miller,
	Eric Dumazet, Paolo Abeni, Simon Horman, netdev, linux-kernel
In-Reply-To: <bb2e7087-aa36-4556-8778-b65d11354779@lunn.ch>

On Thu, 9 Apr 2026 18:12:36 +0200 Andrew Lunn wrote:
> > I guess... Should we update ethtool.yaml doc to tell the users to prefer
> > ioctl over netlink for strset-get and mention this new EMSGSIZE?  
> 
> No. The ioctl is deprecated. It can still be used for drivers which
> need it, but netlink is the preferred method.

Off the top of my head I think string sets are used in Netlink in
bitsets. These are the first class citizen string sets, maintained 
in the core.

I'm not sure if there was any motivation for exposing other string
sets (like driver priv flags, legacy stats etc) via Netlink or it 
was just a "why not cover the entire enum if it just works" thing.

There is no known real user space which runs into the Netlink + legacy
strings issue. Hangbin was doing exhaustive testing of the ethtool
YNL.

LMK if this makes sense or I'm missing a concern. If we need to make
larger string sets work we'll definitely have to revisit. But I'd like
to have that helper from patch 4 in tree sooner rather than later,
so I'd lean towards merging this series.

^ permalink raw reply

* [PATCH net v2] net: fix __this_cpu_add() in preemptible code in dev_xmit_recursion_inc/dec
From: Jiayuan Chen @ 2026-04-10  2:06 UTC (permalink / raw)
  To: netdev
  Cc: Jiayuan Chen, David S. Miller, David Ahern, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Simon Horman, Weiming Shi,
	linux-kernel

dev_xmit_recursion_{inc,dec}() use __this_cpu_{inc,dec}() which requires
the caller to be non-preemptible in order to avoid cpu migration. However,
some callers like SCTP's UDP encapsulation path invoke iptunnel_xmit()
from process context without disabling BH or preemption:

  sctp_inet_connect -> __sctp_connect -> sctp_do_sm ->
  sctp_outq_flush -> sctp_packet_transmit -> sctp_v4_xmit ->
  udp_tunnel_xmit_skb -> iptunnel_xmit -> dev_xmit_recursion_inc

This triggers the following warning on PREEMPT(full) kernels:

  BUG: using __this_cpu_add() in preemptible [00000000]
  caller is dev_xmit_recursion_inc include/linux/netdevice.h:3595 [inline]
  caller is iptunnel_xmit+0x1cd/0xb80 net/ipv4/ip_tunnel_core.c:72
  Tainted: [L]=SOFTLOCKUP
  Call Trace:
   <TASK>
   __dump_stack lib/dump_stack.c:94 [inline]
   dump_stack_lvl+0x100/0x190 lib/dump_stack.c:120
   check_preemption_disabled+0xd8/0xe0 lib/smp_processor_id.c:47
   dev_xmit_recursion_inc include/linux/netdevice.h:3595 [inline]
   iptunnel_xmit+0x1cd/0xb80 net/ipv4/ip_tunnel_core.c:72
   sctp_v4_xmit+0x75f/0x1060 net/sctp/protocol.c:1073
   sctp_packet_transmit+0x22ec/0x3060 net/sctp/output.c:653
   sctp_packet_singleton+0x19e/0x370 net/sctp/outqueue.c:783
   sctp_outq_flush_ctrl net/sctp/outqueue.c:914 [inline]
   sctp_outq_flush+0x315/0x3350 net/sctp/outqueue.c:1212
   sctp_cmd_interpreter net/sctp/sm_sideeffect.c:1824 [inline]
   sctp_side_effects net/sctp/sm_sideeffect.c:1204 [inline]
   sctp_do_sm+0xce1/0x5be0 net/sctp/sm_sideeffect.c:1175
   sctp_primitive_ASSOCIATE+0x9c/0xd0 net/sctp/primitive.c:73
   __sctp_connect+0x9fc/0xc70 net/sctp/socket.c:1235
   sctp_connect net/sctp/socket.c:4818 [inline]
   sctp_inet_connect+0x15f/0x220 net/sctp/socket.c:4833
   __sys_connect_file+0x141/0x1a0 net/socket.c:2089
   __sys_connect+0x141/0x170 net/socket.c:2108
   __do_sys_connect net/socket.c:2114 [inline]
   __se_sys_connect net/socket.c:2111 [inline]
   __x64_sys_connect+0x72/0xb0 net/socket.c:2111
   do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
   do_syscall_64+0x106/0xf80 arch/x86/entry/syscall_64.c:94
   entry_SYSCALL_64_after_hwframe+0x77/0x7f

All other callers of dev_xmit_recursion_{inc,dec}() are fine: those in
net/core/dev.c and net/core/filter.c run under local_bh_disable(), and
lwtunnel_input() asserts in_softirq() context. Currently only
iptunnel_xmit() and ip6tunnel_xmit() can be reached from process
context via the SCTP UDP encapsulation path.

Fix this by adding guard(migrate)() at the top of iptunnel_xmit() and
ip6tunnel_xmit() to ensure dev_recursion_level(), dev_xmit_recursion_inc()
and dev_xmit_recursion_dec() all run on the same CPU.

Fixes: 6f1a9140ecda ("net: add xmit recursion limit to tunnel xmit functions")
Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
---
v1->v2: https://lore.kernel.org/netdev/20260409035344.214279-1-jiayuan.chen@linux.dev/
 - Move guard(migrate)() to iptunnel_xmit()/ip6tunnel_xmit() instead
   of dev_xmit_recursion_{inc,dec}(), so that dev_recursion_level() is
   also covered under the same migration protection.
 - Revert changes to include/linux/netdevice.h.
---
 include/net/ip6_tunnel.h  | 2 ++
 net/ipv4/ip_tunnel_core.c | 2 ++
 2 files changed, 4 insertions(+)

diff --git a/include/net/ip6_tunnel.h b/include/net/ip6_tunnel.h
index 359b595f1df9..3f877164233c 100644
--- a/include/net/ip6_tunnel.h
+++ b/include/net/ip6_tunnel.h
@@ -156,6 +156,8 @@ static inline void ip6tunnel_xmit(struct sock *sk, struct sk_buff *skb,
 {
 	int pkt_len, err;
 
+	guard(migrate)();
+
 	if (unlikely(dev_recursion_level() > IP_TUNNEL_RECURSION_LIMIT)) {
 		if (dev) {
 			net_crit_ratelimited("Dead loop on virtual device %s, fix it urgently!\n",
diff --git a/net/ipv4/ip_tunnel_core.c b/net/ipv4/ip_tunnel_core.c
index 5683c328990f..808b8eaf7fad 100644
--- a/net/ipv4/ip_tunnel_core.c
+++ b/net/ipv4/ip_tunnel_core.c
@@ -58,6 +58,8 @@ void iptunnel_xmit(struct sock *sk, struct rtable *rt, struct sk_buff *skb,
 	struct iphdr *iph;
 	int err;
 
+	guard(migrate)();
+
 	if (unlikely(dev_recursion_level() > IP_TUNNEL_RECURSION_LIMIT)) {
 		if (dev) {
 			net_crit_ratelimited("Dead loop on virtual device %s, fix it urgently!\n",
-- 
2.43.0


^ permalink raw reply related

* Re: [PATCH net-next v11 00/14] netkit: Support for io_uring zero-copy and AF_XDP
From: patchwork-bot+netdevbpf @ 2026-04-10  2:00 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: netdev, bpf, kuba, davem, razor, pabeni, willemb, sdf,
	john.fastabend, martin.lau, jordan, maciej.fijalkowski,
	magnus.karlsson, dw, toke, yangzhenze, wangdongdong.6
In-Reply-To: <20260402231031.447597-1-daniel@iogearbox.net>

Hello:

This series was applied to netdev/net-next.git (main)
by Jakub Kicinski <kuba@kernel.org>:

On Fri,  3 Apr 2026 01:10:17 +0200 you wrote:
> Containers use virtual netdevs to route traffic from a physical netdev
> in the host namespace. They do not have access to the physical netdev
> in the host and thus can't use memory providers or AF_XDP that require
> reconfiguring/restarting queues in the physical netdev.
> 
> This patchset adds the concept of queue leasing to virtual netdevs that
> allow containers to use memory providers and AF_XDP at native speed.
> Leased queues are bound to a real queue in a physical netdev and act
> as a proxy.
> 
> [...]

Here is the summary with links:
  - [net-next,v11,01/14] net: Add queue-create operation
    https://git.kernel.org/netdev/net-next/c/7789c6bb76ac
  - [net-next,v11,02/14] net: Implement netdev_nl_queue_create_doit
    https://git.kernel.org/netdev/net-next/c/d04686d9bc86
  - [net-next,v11,03/14] net: Add lease info to queue-get response
    https://git.kernel.org/netdev/net-next/c/21d58b35e500
  - [net-next,v11,04/14] net, ethtool: Disallow leased real rxqs to be resized
    https://git.kernel.org/netdev/net-next/c/22fdf28f7c03
  - [net-next,v11,05/14] net: Slightly simplify net_mp_{open,close}_rxq
    https://git.kernel.org/netdev/net-next/c/1e91c98bc9a8
  - [net-next,v11,06/14] net: Proxy netif_mp_{open,close}_rxq for leased queues
    https://git.kernel.org/netdev/net-next/c/5602ad61ebee
  - [net-next,v11,07/14] net: Proxy netdev_queue_get_dma_dev for leased queues
    https://git.kernel.org/netdev/net-next/c/222b5566a02d
  - [net-next,v11,08/14] xsk: Extend xsk_rcv_check validation
    https://git.kernel.org/netdev/net-next/c/9368397fb92a
  - [net-next,v11,09/14] xsk: Proxy pool management for leased queues
    https://git.kernel.org/netdev/net-next/c/910f636db958
  - [net-next,v11,10/14] netkit: Add single device mode for netkit
    https://git.kernel.org/netdev/net-next/c/481038960538
  - [net-next,v11,11/14] netkit: Implement rtnl_link_ops->alloc and ndo_queue_create
    https://git.kernel.org/netdev/net-next/c/b789acc0695c
  - [net-next,v11,12/14] netkit: Add netkit notifier to check for unregistering devices
    https://git.kernel.org/netdev/net-next/c/25444470570b
  - [net-next,v11,13/14] netkit: Add xsk support for af_xdp applications
    https://git.kernel.org/netdev/net-next/c/a14fd6474883
  - [net-next,v11,14/14] selftests/net: Add queue leasing tests with netkit
    https://git.kernel.org/netdev/net-next/c/65d657d80684

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply

* Re: [PATCH v2] net/mlx5: Fix OOB access and stack information leak in
From: Prathamesh Deshpande @ 2026-04-10  2:00 UTC (permalink / raw)
  To: cjubran
  Cc: leon, linux-kernel, linux-rdma, mbloch, netdev,
	prathameshdeshpande7, richardcochran, saeedm, tariqt
In-Reply-To: <3a238d0c-4ec1-432d-995a-19d7db3e310e@nvidia.com>

On Thu, Apr 9, 2026 at 17:16 +0300, Carolina Jubran wrote:
> pin is defined as u8 in struct mlx5_eqe_pps, so pin < 0 is dead code.
> 
> As for the upper bound: in order to receive a PPS event on a pin, the 
> user must first configure it via mlx5_ptp_enable, which already 
> validates the index (rq->extts.index >= clock->ptp_info.n_pins returns 
> -EINVAL) and since the mtpps register only defines capabilities for 8 
> pins, so n_pins cannot exceed MAX_PIN_NUM.
> 
> Maybe wrap it with WARN_ON_ONCE instead of silently returning, so if 
> future hardware adds support for more pins we would notice rather than 
> silently dropping events.

Hi Carolina,

Thanks for the feedback. I've removed the redundant pin < 0 check and 
implemented the WARN_ON_ONCE for the upper bound as suggested.

I just submitted a v3 as a fresh thread with these changes and a fix 
for the union corruption bug.

Thanks,
Prathamesh


^ permalink raw reply

* [PATCH v3] net/mlx5: Fix OOB access and stack information leak in PTP event handling
From: Prathamesh Deshpande @ 2026-04-10  1:53 UTC (permalink / raw)
  To: Carolina Jubran, Saeed Mahameed, Leon Romanovsky
  Cc: Richard Cochran, Tariq Toukan, Mark Bloch, netdev, linux-rdma,
	linux-kernel, Prathamesh Deshpande

In mlx5_pps_event(), several critical issues were identified during
review by Sashiko:

1. The 'pin' index from the hardware event was used without bounds
   checking to index 'pin_config' and 'pps_info->start', leading to
   potential out-of-bounds memory access.
2. 'ptp_event' was not zero-initialized. Since it contains a union,
   assigning a timestamp partially leaves the 'ts_raw' field with
   uninitialized stack memory, which can leak kernel data or
   corrupt time sync logic in hardpps().
3. A NULL 'pin_config' could be dereferenced if initialization failed.
4. 'clock->ptp' could be NULL if ptp_clock_register() failed.

Fix these by zero-initializing the event struct, adding a bounds
check against n_pins, and adding appropriate NULL guards.

Fixes: 7c39afb394c7 ("net/mlx5: PTP code migration to driver core section")
Suggested-by: Carolina Jubran <cjubran@nvidia.com>
Signed-off-by: Prathamesh Deshpande <prathameshdeshpande7@gmail.com>
---
v3:
- Fix union corruption by using a local timestamp variable [Sashiko].
- Validate pin index against n_pins with WARN_ON_ONCE [Carolina].
- Remove redundant pin < 0 check and cleanup TODO comment.
v2:
- Zero-initialize ptp_event to prevent stack information leak [Sashiko].
- Add bounds check for hardware pin index to prevent OOB access [Sashiko].
- Add NULL guard for pin_config to handle initialization failures [Sashiko].
- Add NULL check for clock->ptp as originally intended.

 .../net/ethernet/mellanox/mlx5/core/lib/clock.c | 17 ++++++++++++-----
 1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/clock.c b/drivers/net/ethernet/mellanox/mlx5/core/lib/clock.c
index bd4e042077af..674dd048a6b8 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/lib/clock.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/clock.c
@@ -1164,16 +1164,22 @@ static int mlx5_pps_event(struct notifier_block *nb,
 							       pps_nb);
 	struct mlx5_core_dev *mdev = clock_state->mdev;
 	struct mlx5_clock *clock = mdev->clock;
-	struct ptp_clock_event ptp_event;
+	struct ptp_clock_event ptp_event = {};
 	struct mlx5_eqe *eqe = data;
 	int pin = eqe->data.pps.pin;
 	unsigned long flags;
 	u64 ns;
 
+	if (!clock->ptp_info.pin_config)
+		return NOTIFY_OK;
+
+	if (WARN_ON_ONCE(pin >= clock->ptp_info.n_pins))
+		return NOTIFY_OK;
+
 	switch (clock->ptp_info.pin_config[pin].func) {
 	case PTP_PF_EXTTS:
 		ptp_event.index = pin;
-		ptp_event.timestamp = mlx5_real_time_mode(mdev) ?
+		ns = mlx5_real_time_mode(mdev) ?
 			mlx5_real_time_cyc2time(clock,
 						be64_to_cpu(eqe->data.pps.time_stamp)) :
 			mlx5_timecounter_cyc2time(clock,
@@ -1181,12 +1187,13 @@ static int mlx5_pps_event(struct notifier_block *nb,
 		if (clock->pps_info.enabled) {
 			ptp_event.type = PTP_CLOCK_PPSUSR;
 			ptp_event.pps_times.ts_real =
-					ns_to_timespec64(ptp_event.timestamp);
+					ns_to_timespec64(ns);
 		} else {
 			ptp_event.type = PTP_CLOCK_EXTTS;
+			ptp_event.timestamp = ns;
 		}
-		/* TODOL clock->ptp can be NULL if ptp_clock_register fails */
-		ptp_clock_event(clock->ptp, &ptp_event);
+		if (clock->ptp)
+			ptp_clock_event(clock->ptp, &ptp_event);
 		break;
 	case PTP_PF_PEROUT:
 		if (clock->shared) {
-- 
2.43.0


^ permalink raw reply related

* Re: [PATCH net-next v11 03/14] net: Add lease info to queue-get response
From: Jakub Kicinski @ 2026-04-10  1:51 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: netdev, bpf, davem, razor, pabeni, willemb, sdf, john.fastabend,
	martin.lau, jordan, maciej.fijalkowski, magnus.karlsson, dw, toke,
	yangzhenze, wangdongdong.6
In-Reply-To: <aa222301-e716-41cf-ab10-5365c0e15b6a@iogearbox.net>

On Thu, 9 Apr 2026 17:32:31 +0200 Daniel Borkmann wrote:
> > I think the test has to be reworked but of the available options seems
> > like merging it as is and following up quickly is the best. I've only
> > set up the container testing in our CI yesterday anyway so there may
> > be more things that need changing in the test as we gain experience :S  
> 
> No objections obviously if you want to land as-is with your refactor on
> top.

Done, please double check my work, there were some conflicts with net.

^ permalink raw reply

* Re: [PATCH net-next v3] selftests/net: convert so_txtime to drv-net
From: Jakub Kicinski @ 2026-04-10  1:50 UTC (permalink / raw)
  To: Willem de Bruijn; +Cc: netdev, davem, edumazet, pabeni, horms, Willem de Bruijn
In-Reply-To: <willemdebruijn.kernel.1b750e4c127c7@gmail.com>

On Thu, 09 Apr 2026 11:01:49 -0400 Willem de Bruijn wrote:
> > Alternatively could record the root qdisc at the start of the test and
> > restore that.  
> 
> This should work:
> 
>     def main() -> None:
>         """Boilerplate ksft main."""
>         with NetDrvEpEnv(__file__) as cfg:
>     +        # Record original root qdisc
>     +        cmd_obj = cmd((f"tc -j qdisc show dev {cfg.ifname} root"))
>     +        qdisc_root = json.loads(cmd_obj.stdout)[0].get("kind", None)

I don't like doing setup in main() TBH. It can well fail and no KTAP
will be produced. Breaking all the tracking and stability-based
filtering. Not sure if it's still the case but for a very long time
not all tc qdiscs supported JSON for example.

>     	ksft_run([test_so_txtime_mono, test_so_txtime_etf], args=(cfg,))
>     +
>     +        # Restore original root qdisc. If mq, populate with default_qdisc nodes
>     +        if (qdisc_root):
>     +            cmd(f"tc qdisc replace dev {cfg.ifname} root {qdisc_root}")
>         ksft_exit()
> 
> 
> Do we want to add a tc command similar to ip, bpftool, etc.

Yes, we can wrap it if it outputs json.

^ permalink raw reply

* Re: [PATCH v2 bpf-next 4/4] bpf: Replace bpf memory allocator with kmalloc_nolock() in local storage
From: Slava Imameev @ 2026-04-10  1:43 UTC (permalink / raw)
  To: ameryhung
  Cc: alexei.starovoitov, andrii, bpf, daniel, kernel-team, kpsingh,
	martin.lau, memxor, netdev, song, yonghong.song,
	linux-open-source
In-Reply-To: <20251114201329.3275875-5-ameryhung@gmail.com>

I found that this patch restricts task storage value allocation to
KMALLOC_MAX_CACHE_SIZE on any system regardless of CONFIG_PREEMPT_RT,
which is 8KB on many systems, as use_kmalloc_nolock is set to true
always for task storage by task_storage_map_alloc. Before this patch
there was no such restriction and task storage supported at least 64KB
allocation for value, restricted only by
BPF_LOCAL_STORAGE_MAX_VALUE_SIZE. Was this KMALLOC_MAX_CACHE_SIZE
restriction added deliberately by this patch?


^ permalink raw reply

* [PATCH net-next v2] selftests: net: py: add test case filtering and listing
From: Jakub Kicinski @ 2026-04-10  1:39 UTC (permalink / raw)
  To: davem
  Cc: netdev, edumazet, pabeni, andrew+netdev, horms, Jakub Kicinski,
	Willem de Bruijn, Gal Pressman, Breno Leitao, shuah, petrm,
	linux-kselftest

When developing new test cases and reproducing failures in
existing ones we currently have to run the entire test which
can take minutes to finish.

Add command line options for test selection, modeled after
kselftest_harness.h:

  -l       list tests (filtered, if filters were specified)
  -t name  include test
  -T name  exclude test

Since we don't have as clean separation into fixture / variant /
test as kselftest_harness this is not really a 1 to 1 match.
We have to lean on glob patterns instead.

Like in kselftest_harness filters are evaluated in order, first
match wins. If only exclusions are specified everything else is
included and vice versa.

Glob patterns (*, ?, [) are supported in addition to exact
matching.

Reviewed-by: Willem de Bruijn <willemb@google.com>
Tested-by: Gal Pressman <gal@nvidia.com>
Reviewed-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
v2:
 - update the help message for -l
v1: https://lore.kernel.org/20260407151715.3800579-1-kuba@kernel.org

CC: shuah@kernel.org
CC: petrm@nvidia.com
CC: linux-kselftest@vger.kernel.org
---
 tools/testing/selftests/net/lib/py/ksft.py | 65 +++++++++++++++++++++-
 1 file changed, 62 insertions(+), 3 deletions(-)

diff --git a/tools/testing/selftests/net/lib/py/ksft.py b/tools/testing/selftests/net/lib/py/ksft.py
index 7b8af463e35d..71518c3f8ad9 100644
--- a/tools/testing/selftests/net/lib/py/ksft.py
+++ b/tools/testing/selftests/net/lib/py/ksft.py
@@ -1,6 +1,8 @@
 # SPDX-License-Identifier: GPL-2.0
 
+import fnmatch
 import functools
+import getopt
 import inspect
 import os
 import signal
@@ -32,6 +34,34 @@ KSFT_DISRUPTIVE = True
     pass
 
 
+class _KsftArgs:
+    def __init__(self):
+        self.list_tests = False
+        self.filters = []
+
+        try:
+            opts, _ = getopt.getopt(sys.argv[1:], 'hlt:T:')
+        except getopt.GetoptError as e:
+            print(e, file=sys.stderr)
+            sys.exit(1)
+
+        for opt, val in opts:
+            if opt == '-h':
+                print(f"Usage: {sys.argv[0]} [-h|-l] [-t|-T name]\n"
+                      f"\t-h       print help\n"
+                      f"\t-l       list tests (filtered, if filters were specified)\n"
+                      f"\t-t name  include test\n"
+                      f"\t-T name  exclude test",
+                      file=sys.stderr)
+                sys.exit(0)
+            elif opt == '-l':
+                self.list_tests = True
+            elif opt == '-t':
+                self.filters.append((True, val))
+            elif opt == '-T':
+                self.filters.append((False, val))
+
+
 @functools.lru_cache()
 def _ksft_supports_color():
     if os.environ.get("NO_COLOR") is not None:
@@ -298,8 +328,26 @@ KsftCaseFunction = namedtuple("KsftCaseFunction",
         ksft_pr(f"Ignoring SIGTERM (cnt: {term_cnt}), already exiting...")
 
 
-def _ksft_generate_test_cases(cases, globs, case_pfx, args):
-    """Generate a flat list of (func, args, name) tuples"""
+def _ksft_name_matches(name, pattern):
+    if '*' in pattern or '?' in pattern or '[' in pattern:
+        return fnmatch.fnmatchcase(name, pattern)
+    return name == pattern
+
+
+def _ksft_test_enabled(name, filters):
+    has_positive = False
+    for include, pattern in filters:
+        has_positive |= include
+        if _ksft_name_matches(name, pattern):
+            return include
+    return not has_positive
+
+
+def _ksft_generate_test_cases(cases, globs, case_pfx, args, cli_args):
+    """Generate a filtered list of (func, args, name) tuples.
+
+    If -l is given, prints matching test names and exits.
+    """
 
     cases = cases or []
     test_cases = []
@@ -329,11 +377,22 @@ KsftCaseFunction = namedtuple("KsftCaseFunction",
         else:
             test_cases.append((func, args, func.__name__))
 
+    if cli_args.filters:
+        test_cases = [tc for tc in test_cases
+                      if _ksft_test_enabled(tc[2], cli_args.filters)]
+
+    if cli_args.list_tests:
+        for _, _, name in test_cases:
+            print(name)
+        sys.exit(0)
+
     return test_cases
 
 
 def ksft_run(cases=None, globs=None, case_pfx=None, args=()):
-    test_cases = _ksft_generate_test_cases(cases, globs, case_pfx, args)
+    cli_args = _KsftArgs()
+    test_cases = _ksft_generate_test_cases(cases, globs, case_pfx, args,
+                                           cli_args)
 
     global term_cnt
     term_cnt = 0
-- 
2.53.0


^ permalink raw reply related

* Re: [PATCH nf] netfilter: nf_conntrack_sip: fix OOB read in epaddr_len and ct_sip_parse_header_uri
From: Weiming Shi @ 2026-04-10  1:36 UTC (permalink / raw)
  To: Florian Westphal
  Cc: Pablo Neira Ayuso, David S . Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Phil Sutter, Simon Horman, Patrick McHardy,
	netfilter-devel, coreteam, netdev, linux-kernel, Xiang Mei
In-Reply-To: <adfEUtiiLzjtKd8m@strlen.de>

On 26-04-09 17:22, Florian Westphal wrote:
> Weiming Shi <bestswngs@gmail.com> wrote:
> > In epaddr_len() and ct_sip_parse_header_uri(), after sip_parse_addr()
> > successfully parses an IP address, the code checks whether the next
> > character is ':' to determine if a port number follows. However,
> > neither function verifies that the pointer is still within bounds
> > before dereferencing it.
> 
> I already queued up:
> https://patchwork.ozlabs.org/project/netfilter-devel/patch/20260313195256.2783257-1-qguanni@gmail.com/
> 
> for nf-next (I already sent the 'last' PR for 7.0).
> 
> Could you check if that resolves the problem you're reporting?
> 
> >  		p = simple_strtoul(c, (char **)&c, 10);
> 
> All of these functions require a c-string, which we usually
> don't have with network packet parsing.
> 
> IOW, sip helper needs to be audited for these problems
> but I don't know when I can get to it.

Tested-by: Weiming Shi <bestswngs@gmail.com>


^ permalink raw reply

* Re: [PATCH net-next v6 0/2] net: mana: add ethtool private flag for full-page RX buffers
From: Jakub Kicinski @ 2026-04-10  1:35 UTC (permalink / raw)
  To: Dipayaan Roy
  Cc: kys, haiyangz, wei.liu, decui, andrew+netdev, davem, edumazet,
	pabeni, leon, longli, kotaranov, horms, shradhagupta, ssengar,
	ernis, shirazsaleem, linux-hyperv, netdev, linux-kernel,
	linux-rdma, stephen, jacob.e.keller, leitao, kees, john.fastabend,
	hawk, bpf, daniel, ast, sdf, dipayanroy
In-Reply-To: <20260407200216.272659-1-dipayanroy@linux.microsoft.com>

On Tue,  7 Apr 2026 12:59:17 -0700 Dipayaan Roy wrote:
> This behavior is observed on a single platform; other platforms
> perform better with page_pool fragments, indicating this is not a
> page_pool issue but platform-specific.

Well, someone has to run some experiments and confirm other ARM
platforms are not impacted, with data. I was hoping to do it myself
but doesn't look like that will happen in time for the merge window :(

> Changes in v6:
>  - Added missed maintainers.

STOP REPOSTING PATCHES FOR NO REASON.

^ permalink raw reply

* Re: [PATCH 0/6] IPA v5.2 support for Milos and Fairphone (Gen. 6)
From: Jakub Kicinski @ 2026-04-10  1:31 UTC (permalink / raw)
  To: Luca Weiss
  Cc: Paolo Abeni, Alex Elder, Andrew Lunn, David S. Miller,
	Eric Dumazet, Rob Herring, Krzysztof Kozlowski, Conor Dooley,
	Bjorn Andersson, Konrad Dybcio, Alexander Koskovich,
	~postmarketos/upstreaming, phone-devel, netdev, linux-kernel,
	linux-arm-msm, devicetree
In-Reply-To: <48464d44-1fac-47a2-839a-c963e9421615@redhat.com>

On Thu, 9 Apr 2026 09:46:31 +0200 Paolo Abeni wrote:
> On 4/3/26 6:43 PM, Luca Weiss wrote:
> > First, two fixes that unbreak IPA v5.0+, which can be applied
> > independently.
> > 
> > Then add support for IPA v5.2 which can be found in the Milos SoC. And
> > finally enable it on Fairphone (Gen. 6) so that mobile data (4G/5G/..)
> > starts working.  
> 
> You should have probably split the series in 2, with patches 1 & 2
> targeting net and the following ones targeting net-next. It looks like
> patch 5 needs some adjustment. I'm applying the first 2.

1 & 2 have now propagated to net-next, please repost 3 & 4.

^ permalink raw reply

* Re: [net,PATCH v2] net: ks8851: Reinstate disabling of BHs around IRQ handler
From: Marek Vasut @ 2026-04-09 15:26 UTC (permalink / raw)
  To: Nicolai Buchwitz
  Cc: netdev, stable, David S. Miller, Andrew Lunn, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Ronald Wahl, Yicong Hui,
	linux-kernel
In-Reply-To: <6391ee36b7d9c66d33c734650ebfb7fe@tipi-net.de>

On 4/9/26 8:52 AM, Nicolai Buchwitz wrote:

Hello Nicolai,

>> @@ -408,7 +426,9 @@ static int ks8851_net_open(struct net_device *dev)
>>      unsigned long flags;
>>      int ret;
>>
>> -    ret = request_threaded_irq(dev->irq, NULL, ks8851_irq,
>> +    ret = request_threaded_irq(dev->irq, NULL,
>> +                   ks->no_bh_in_irq_handler ?
>> +                   ks8851_irq_nobh : ks8851_irq,
> 
> This works, but wouldn't it be simpler to put the BH disable
> into the PAR lock/unlock directly?
> 
>    static void ks8851_lock_par(...)
>    {
>        local_bh_disable();
>        spin_lock_irqsave(&ksp->lock, *flags);
>    }
> 
>    static void ks8851_unlock_par(...)
>    {
>        spin_unlock_irqrestore(&ksp->lock, *flags);
>        local_bh_enable();
>    }
> 
> No flag, no wrapper, no conditional in request_threaded_irq.
> And it protects all PAR lock/unlock callsites, not just the
> IRQ handler.
That is exactly why I wrapped the IRQ handler, because the BH should be 
disabled ONLY around the IRQ handler, not around the other call sites.

^ permalink raw reply

* Re: [PATCH net-next v11 14/14] selftests/net: Add queue leasing tests with netkit
From: Jakub Kicinski @ 2026-04-10  1:19 UTC (permalink / raw)
  To: David Wei
  Cc: Daniel Borkmann, netdev, bpf, davem, razor, pabeni, willemb, sdf,
	john.fastabend, martin.lau, jordan, maciej.fijalkowski,
	magnus.karlsson, toke, yangzhenze, wangdongdong.6
In-Reply-To: <244b745a-3c95-48c3-b6c5-11ed3eacdf46@davidwei.uk>

On Thu, 9 Apr 2026 08:26:30 -0700 David Wei wrote:
> > ksft_run() can't be called multiple times.
> > 
> > The first run looks like it's purely testing netdevsim. So that should
> > move to selftests/net. The rest which tests HW should stay here.
> > Please also move all the setup inside the test cases.  
> 
> Sorry, didn't know about multiple ksft_run(). I'll prep the follow up
> separately so we can land it soon after 7.1.

I have a patch queued to make multiple ksft_run()s hard-fail.
So let's not wait until after the merge window if that's what you mean.

^ permalink raw reply

* Re: [PATCH] netfilter: ipset: harden payload calculation in call_ad()
From: Pablo Neira Ayuso @ 2026-04-10  1:19 UTC (permalink / raw)
  To: David Baum
  Cc: fw, phil, davem, edumazet, kuba, pabeni, horms, netfilter-devel,
	coreteam, netdev
In-Reply-To: <adhMzvFhTcmTMZTV@chamomile>

On Fri, Apr 10, 2026 at 03:05:18AM +0200, Pablo Neira Ayuso wrote:
> On Fri, Mar 13, 2026 at 01:01:32PM -0500, David Baum wrote:
> > call_ad() computes the netlink error payload size with
> > min(SIZE_MAX, sizeof(*errmsg) + nlmsg_len(nlh)), but min(SIZE_MAX, x)
> > is always x, so the guard is a no-op.
> > 
> > Replace it with an explicit negative-length check and
> > check_add_overflow() so the addition is validated before being passed
> > to nlmsg_new().
> > 
> > Signed-off-by: David Baum <davidbaum461@gmail.com>
> > ---
> >  net/netfilter/ipset/ip_set_core.c | 10 ++++++++--
> >  1 file changed, 8 insertions(+), 2 deletions(-)
> > 
> > diff --git a/net/netfilter/ipset/ip_set_core.c b/net/netfilter/ipset/ip_set_core.c
> > index a2fe711cb5e3..11d3854d9b11 100644
> > --- a/net/netfilter/ipset/ip_set_core.c
> > +++ b/net/netfilter/ipset/ip_set_core.c
> > @@ -10,6 +10,7 @@
> >  #include <linux/module.h>
> >  #include <linux/moduleparam.h>
> >  #include <linux/ip.h>
> > +#include <linux/overflow.h>
> >  #include <linux/skbuff.h>
> >  #include <linux/spinlock.h>
> >  #include <linux/rculist.h>
> > @@ -1763,13 +1764,18 @@ call_ad(struct net *net, struct sock *ctnl, struct sk_buff *skb,
> >  		struct nlmsghdr *rep, *nlh = nlmsg_hdr(skb);
> >  		struct sk_buff *skb2;
> >  		struct nlmsgerr *errmsg;
> > -		size_t payload = min(SIZE_MAX,
> > -				     sizeof(*errmsg) + nlmsg_len(nlh));
> > +		int nlmsg_payload_len = nlmsg_len(nlh);
> > +		size_t payload;
> >  		int min_len = nlmsg_total_size(sizeof(struct nfgenmsg));
> >  		struct nlattr *cda[IPSET_ATTR_CMD_MAX + 1];
> >  		struct nlattr *cmdattr;
> >  		u32 *errline;
> >  
> > +		if (nlmsg_payload_len < 0 ||
> 
> Hm, nlh was already sanitized by nfnetlink?
> 
> > +		    check_add_overflow(sizeof(*errmsg),
> > +				       (size_t)nlmsg_payload_len, &payload))
> 
> Maybe cap this to int:
> 
>                 int payload = sizeof(struct nlmsgerr);
>                 ...
>         
>                 if (check_add_overflow(payload, nlmsg_payload_len, &payload))

Wait, then payload and nlmsg_payload_len should be __u32, not int.

> 
> > +			return -ENOMEM;
> > +
> >  		skb2 = nlmsg_new(payload, GFP_KERNEL);
> >  		if (!skb2)
> >  			return -ENOMEM;
> 

^ permalink raw reply

* Re: [PATCH net v3] ipvs: fix MTU check for GSO packets in tunnel mode
From: Pablo Neira Ayuso @ 2026-04-10  1:11 UTC (permalink / raw)
  To: Yingnan Zhang
  Cc: horms, ja, fw, phil, davem, edumazet, kuba, pabeni, netdev,
	lvs-devel, netfilter-devel, coreteam, linux-kernel
In-Reply-To: <tencent_73010FBD5FA1C05C3BC23A07A50B11CEC90A@qq.com>

On Thu, Apr 02, 2026 at 10:46:16PM +0800, Yingnan Zhang wrote:
> Currently, IPVS skips MTU checks for GSO packets by excluding them with
> the !skb_is_gso(skb) condition. This creates problems when IPVS tunnel
> mode encapsulates GSO packets with IPIP headers.
> 
> The issue manifests in two ways:
> 
> 1. MTU violation after encapsulation:
>    When a GSO packet passes through IPVS tunnel mode, the original MTU
>    check is bypassed. After adding the IPIP tunnel header, the packet
>    size may exceed the outgoing interface MTU, leading to unexpected
>    fragmentation at the IP layer.
> 
> 2. Fragmentation with problematic IP IDs:
>    When net.ipv4.vs.pmtu_disc=1 and a GSO packet with multiple segments
>    is fragmented after encapsulation, each segment gets a sequentially
>    incremented IP ID (0, 1, 2, ...). This happens because:
> 
>    a) The GSO packet bypasses MTU check and gets encapsulated
>    b) At __ip_finish_output, the oversized GSO packet is split into
>       separate SKBs (one per segment), with IP IDs incrementing
>    c) Each SKB is then fragmented again based on the actual MTU
> 
>    This sequential IP ID allocation differs from the expected behavior
>    and can cause issues with fragment reassembly and packet tracking.
> 
> Fix this by properly validating GSO packets using
> skb_gso_validate_network_len(). This function correctly validates
> whether the GSO segments will fit within the MTU after segmentation. If
> validation fails, send an ICMP Fragmentation Needed message to enable
> proper PMTU discovery.
> 
> Fixes: 4cdd34084d53 ("netfilter: nf_conntrack_ipv6: improve fragmentation handling")
> Signed-off-by: Yingnan Zhang <342144303@qq.com>
> ---
> v3:
> - Fixed compilation error (removed extra closing brace in IPv6 function)
> - Fixed indentation to match kernel style
> 
> v2: https://lore.kernel.org/netdev/20260402030541.27855-1-342144303@qq.com/
> v1: https://lore.kernel.org/netdev/20260401152228.31190-1-342144303@qq.com/
> ---
>  net/netfilter/ipvs/ip_vs_xmit.c | 9 +++++----
>  1 file changed, 5 insertions(+), 4 deletions(-)
> 
> diff --git a/net/netfilter/ipvs/ip_vs_xmit.c b/net/netfilter/ipvs/ip_vs_xmit.c
> index 3601eb86d..a4ca7cad0 100644
> --- a/net/netfilter/ipvs/ip_vs_xmit.c
> +++ b/net/netfilter/ipvs/ip_vs_xmit.c
> @@ -111,8 +111,8 @@ __mtu_check_toobig_v6(const struct sk_buff *skb, u32 mtu)
>  		 */
>  		if (IP6CB(skb)->frag_max_size > mtu)
>  			return true; /* largest fragment violate MTU */
> -	}
> -	else if (skb->len > mtu && !skb_is_gso(skb)) {
> +	} else if (skb->len > mtu &&
> +		   !(skb_is_gso(skb) && skb_gso_validate_network_len(skb, mtu))) {

Maybe helper function helps make this more readable?

/* Based on ip_exceeds_mtu(). */
static bool ip_vs_exceeds_mtu(const struct sk_buff *skb, unsigned int mtu)
{
        if (skb->len <= mtu)
                return false;

        if (skb_is_gso(skb) && skb_gso_validate_network_len(skb, mtu))
                return false;

        return true;
}


>  		return true; /* Packet size violate MTU size */
>  	}
>  	return false;
> @@ -232,8 +232,9 @@ static inline bool ensure_mtu_is_adequate(struct netns_ipvs *ipvs, int skb_af,
>  			return true;
>  
>  		if (unlikely(ip_hdr(skb)->frag_off & htons(IP_DF) &&
> -			     skb->len > mtu && !skb_is_gso(skb) &&
> -			     !ip_vs_iph_icmp(ipvsh))) {
> +			     skb->len > mtu && !ip_vs_iph_icmp(ipvsh) &&
> +			     !(skb_is_gso(skb) &&
> +			       skb_gso_validate_network_len(skb, mtu)))) {
>  			icmp_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED,
>  				  htonl(mtu));
>  			IP_VS_DBG(1, "frag needed for %pI4\n",
> -- 
> 2.51.0
> 

^ permalink raw reply

* Re: [PATCH v5 net-next 0/8] dpll/ice: Add TXC DPLL type and full TX reference clock control for E825
From: Jakub Kicinski @ 2026-04-10  1:10 UTC (permalink / raw)
  To: Nitka, Grzegorz
  Cc: netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
	intel-wired-lan@lists.osuosl.org, Oros, Petr,
	richardcochran@gmail.com, andrew+netdev@lunn.ch,
	Kitszel, Przemyslaw, Nguyen, Anthony L,
	Prathosh.Satish@microchip.com, Vecera, Ivan, jiri@resnulli.us,
	Kubalewski, Arkadiusz, vadim.fedorenko@linux.dev,
	donald.hunter@gmail.com, horms@kernel.org, pabeni@redhat.com,
	davem@davemloft.net, edumazet@google.com
In-Reply-To: <IA1PR11MB621925C1718B838147404DC492582@IA1PR11MB6219.namprd11.prod.outlook.com>

On Thu, 9 Apr 2026 11:21:35 +0000 Nitka, Grzegorz wrote:
> > On Fri,  3 Apr 2026 01:06:18 +0200 Grzegorz Nitka wrote:  
> > > This series adds TX reference clock support for E825 devices and exposes
> > > TX clock selection and synchronization status via the Linux DPLL
> > > subsystem.
> > > E825 hardware contains a dedicated Tx clock (TXC) domain that is
> > > distinct
> > > from PPS and EEC. TX reference clock selection is device‑wide, shared
> > > across ports, and mediated by firmware as part of the link bring‑up
> > > process. As a result, TX clock selection intent may differ from the
> > > effective hardware configuration, and software must verify the outcome
> > > after link‑up.
> > > To support this, the series introduces TXC support incrementally across
> > > the DPLL core and the ice driver:
> > >
> > > - add a new DPLL type (TXC) to represent transmit clock generators;  
> > 
> > I'm not grasping why this is needed, isn't it part of any EEC system
> > that the DPLL can drive the TXC? Is your system going to expose multiple
> > DPLLs now for one NIC?
> 
> Hello Jakub,
> For E825 device, the short answer is yes. We have platform EEC now and
> we want to add:
> - TXC DPLLs per port, and
> - PPS DPLL for TSPLL config purposes (in the near future)
> 
> EEC (Ethernet Equipment Clock) type DPLL is designed to control multiple
> source signals (internal-NIC or external), where one drives the dpll device,
> where multiple outputs are possible, each could drive various components
> as well as propagate signal to external devices.
> TXC is specific dpll device that associated with single ETH port to control it's source,
> there is no need to declare any outputs as the single output is already determined.
> Basically, having TXC DPLL indicates per port control over SyncE (or some external)
> clock source. 

Could you share a diagram of how things are wired up?
DPLL can have multiple outputs and multiple inputs. I'm not getting why
a single device would have to have multiple actual DPLLs (which makes
me worried this is just some "convenient use of the uAPI")

^ permalink raw reply

* Re: [PATCH] netfilter: ipset: harden payload calculation in call_ad()
From: Pablo Neira Ayuso @ 2026-04-10  1:05 UTC (permalink / raw)
  To: David Baum
  Cc: fw, phil, davem, edumazet, kuba, pabeni, horms, netfilter-devel,
	coreteam, netdev
In-Reply-To: <20260313180132.75655-1-davidbaum461@gmail.com>

On Fri, Mar 13, 2026 at 01:01:32PM -0500, David Baum wrote:
> call_ad() computes the netlink error payload size with
> min(SIZE_MAX, sizeof(*errmsg) + nlmsg_len(nlh)), but min(SIZE_MAX, x)
> is always x, so the guard is a no-op.
> 
> Replace it with an explicit negative-length check and
> check_add_overflow() so the addition is validated before being passed
> to nlmsg_new().
> 
> Signed-off-by: David Baum <davidbaum461@gmail.com>
> ---
>  net/netfilter/ipset/ip_set_core.c | 10 ++++++++--
>  1 file changed, 8 insertions(+), 2 deletions(-)
> 
> diff --git a/net/netfilter/ipset/ip_set_core.c b/net/netfilter/ipset/ip_set_core.c
> index a2fe711cb5e3..11d3854d9b11 100644
> --- a/net/netfilter/ipset/ip_set_core.c
> +++ b/net/netfilter/ipset/ip_set_core.c
> @@ -10,6 +10,7 @@
>  #include <linux/module.h>
>  #include <linux/moduleparam.h>
>  #include <linux/ip.h>
> +#include <linux/overflow.h>
>  #include <linux/skbuff.h>
>  #include <linux/spinlock.h>
>  #include <linux/rculist.h>
> @@ -1763,13 +1764,18 @@ call_ad(struct net *net, struct sock *ctnl, struct sk_buff *skb,
>  		struct nlmsghdr *rep, *nlh = nlmsg_hdr(skb);
>  		struct sk_buff *skb2;
>  		struct nlmsgerr *errmsg;
> -		size_t payload = min(SIZE_MAX,
> -				     sizeof(*errmsg) + nlmsg_len(nlh));
> +		int nlmsg_payload_len = nlmsg_len(nlh);
> +		size_t payload;
>  		int min_len = nlmsg_total_size(sizeof(struct nfgenmsg));
>  		struct nlattr *cda[IPSET_ATTR_CMD_MAX + 1];
>  		struct nlattr *cmdattr;
>  		u32 *errline;
>  
> +		if (nlmsg_payload_len < 0 ||

Hm, nlh was already sanitized by nfnetlink?

> +		    check_add_overflow(sizeof(*errmsg),
> +				       (size_t)nlmsg_payload_len, &payload))

Maybe cap this to int:

                int payload = sizeof(struct nlmsgerr);
                ...
        
                if (check_add_overflow(payload, nlmsg_payload_len, &payload))

> +			return -ENOMEM;
> +
>  		skb2 = nlmsg_new(payload, GFP_KERNEL);
>  		if (!skb2)
>  			return -ENOMEM;

^ permalink raw reply

* [PATCH RFC v2] r8169: implement SFP support
From: Fabio Baltieri @ 2026-04-10  0:53 UTC (permalink / raw)
  To: Heiner Kallweit, nic_swsd, Andrew Lunn, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: netdev, linux-kernel, Fabio Baltieri

Implement support for reading the identification and diagnostic
information on SFP modules for rtl8127atf devices.

This uses the sfp module, implements a GPIO devices for presence
detection and loss of signal and i2c communication using the designware
module.

Signed-off-by: Fabio Baltieri <fabio.baltieri@gmail.com>
---
Hi,

here's the rework of the v1 I sent as "r8169: implement get_module
functions for rtl8127atf", this is now implementing sfp support using
the kernel sfp module, using the support nodes as well, including
reusing the designware driver for i2c.

Module presence detection seems to work correctly:

[  555.853597] sfp sfp.256: module removed
[  561.628005] sfp sfp.256: module QSFPTEK          QT-SFP+-SR       rev      sn QT8250805132     dc 250806  

Had to guess the gpio input register offset (the out of tree driver does
not implement this), hopefully the realtek folks can chime in down the
road and other functions can be implemented too.

This is largely a copy paste of the txgbe txgbe_phy.c code, though the
pcs and phylink code is missing since as far as I understand it should
be implemented separately, so the sfp module here just reports the
status via hwmon and is stuck in:

Module state: waitdev

Just looking for early feedback, this is functional as is but I guess
it'll have to wait for the phylink support to get implemented first and
then rebase on top of it, and I guess a realtek specific variant of the
wx,i2c-snps-model property.

Cheers,
Fabio

 drivers/net/ethernet/realtek/Kconfig      |   3 +
 drivers/net/ethernet/realtek/r8169_main.c | 302 +++++++++++++++++++++-
 2 files changed, 304 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/realtek/Kconfig b/drivers/net/ethernet/realtek/Kconfig
index 9b0f4f9631db..ae936e1586aa 100644
--- a/drivers/net/ethernet/realtek/Kconfig
+++ b/drivers/net/ethernet/realtek/Kconfig
@@ -88,6 +88,9 @@ config R8169
 	select CRC32
 	select PHYLIB
 	select REALTEK_PHY
+	select REGMAP
+	select SFP
+	select GPIOLIB
 	help
 	  Say Y here if you have a Realtek Ethernet adapter belonging to
 	  the following families:
diff --git a/drivers/net/ethernet/realtek/r8169_main.c b/drivers/net/ethernet/realtek/r8169_main.c
index 58788d196c57..77266de27656 100644
--- a/drivers/net/ethernet/realtek/r8169_main.c
+++ b/drivers/net/ethernet/realtek/r8169_main.c
@@ -29,6 +29,15 @@
 #include <linux/prefetch.h>
 #include <linux/ipv6.h>
 #include <linux/unaligned.h>
+#include <linux/regmap.h>
+#include <linux/platform_device.h>
+#include <linux/i2c.h>
+#include <linux/property.h>
+#include <linux/clkdev.h>
+#include <linux/clk-provider.h>
+#include <linux/gpio/machine.h>
+#include <linux/gpio/driver.h>
+#include <linux/gpio/property.h>
 #include <net/ip6_checksum.h>
 #include <net/netdev_queues.h>
 #include <net/phy/realtek_phy.h>
@@ -724,6 +733,37 @@ enum rtl_dash_type {
 	RTL_DASH_25_BP,
 };
 
+#define NODE_PROP(_NAME, _PROP)                 \
+	(const struct software_node) {          \
+		.name = (_NAME),                \
+		.properties = (_PROP),          \
+	}
+
+enum rtl8169_swnodes {
+	SWNODE_GPIO = 0,
+	SWNODE_I2C,
+	SWNODE_SFP,
+	SWNODE_PHYLINK,
+	SWNODE_MAX
+};
+
+struct rtl8169_nodes {
+	char gpio_name[32];
+	char i2c_name[32];
+	char sfp_name[32];
+	char phylink_name[32];
+	struct property_entry gpio_props[2];
+	struct property_entry i2c_props[4];
+	struct property_entry sfp_props[5];
+	struct property_entry phylink_props[3];
+	struct software_node_ref_args i2c_ref[1];
+	struct software_node_ref_args gpio0_ref[1];
+	struct software_node_ref_args gpio1_ref[1];
+	struct software_node_ref_args sfp_ref[1];
+	struct software_node swnodes[SWNODE_MAX];
+	const struct software_node *group[SWNODE_MAX + 1];
+};
+
 struct rtl8169_private {
 	void __iomem *mmio_addr;	/* memory map physical address */
 	struct pci_dev *pci_dev;
@@ -770,6 +810,13 @@ struct rtl8169_private {
 	struct r8169_led_classdev *leds;
 
 	u32 ocp_base;
+
+	struct platform_device *sfp_dev;
+	struct platform_device *i2c_dev;
+	struct clk_lookup *i2c_clock;
+	struct clk *i2c_clk;
+	struct gpio_chip *gpio;
+	struct rtl8169_nodes nodes;
 };
 
 typedef void (*rtl_generic_fct)(struct rtl8169_private *tp);
@@ -2411,6 +2458,246 @@ static int rtl8169_set_link_ksettings(struct net_device *ndev,
 	return 0;
 }
 
+static int r8169_swnodes_register(struct rtl8169_private *tp)
+{
+	struct rtl8169_nodes *nodes = &tp->nodes;
+	struct pci_dev *pdev = tp->pci_dev;
+	struct software_node *swnodes;
+	u32 id;
+
+	id = pci_dev_id(pdev);
+
+	snprintf(nodes->gpio_name, sizeof(nodes->gpio_name), "r8169_gpio-%x", id);
+	snprintf(nodes->i2c_name, sizeof(nodes->i2c_name), "r8169_i2c-%x", id);
+	snprintf(nodes->sfp_name, sizeof(nodes->sfp_name), "r8169_sfp-%x", id);
+	snprintf(nodes->phylink_name, sizeof(nodes->phylink_name), "r8169_phylink-%x", id);
+
+	swnodes = nodes->swnodes;
+
+	/* GPIO 8: module presence
+	 * GPIO 11: rx signal lost
+	 */
+	nodes->gpio_props[0] = PROPERTY_ENTRY_STRING("pinctrl-names", "default");
+	swnodes[SWNODE_GPIO] = NODE_PROP(nodes->gpio_name, nodes->gpio_props);
+	nodes->gpio0_ref[0] = SOFTWARE_NODE_REFERENCE(&swnodes[SWNODE_GPIO], 8, GPIO_ACTIVE_LOW);
+	nodes->gpio1_ref[0] = SOFTWARE_NODE_REFERENCE(&swnodes[SWNODE_GPIO], 11, GPIO_ACTIVE_LOW);
+
+	nodes->i2c_props[0] = PROPERTY_ENTRY_STRING("compatible", "snps,designware-i2c");
+	nodes->i2c_props[1] = PROPERTY_ENTRY_BOOL("wx,i2c-snps-model");
+	nodes->i2c_props[2] = PROPERTY_ENTRY_U32("clock-frequency", I2C_MAX_STANDARD_MODE_FREQ);
+	swnodes[SWNODE_I2C] = NODE_PROP(nodes->i2c_name, nodes->i2c_props);
+	nodes->i2c_ref[0] = SOFTWARE_NODE_REFERENCE(&swnodes[SWNODE_I2C]);
+
+	nodes->sfp_props[0] = PROPERTY_ENTRY_STRING("compatible", "sff,sfp");
+	nodes->sfp_props[1] = PROPERTY_ENTRY_REF_ARRAY("i2c-bus", nodes->i2c_ref);
+	nodes->sfp_props[2] = PROPERTY_ENTRY_REF_ARRAY("mod-def0-gpios", nodes->gpio0_ref);
+	nodes->sfp_props[3] = PROPERTY_ENTRY_REF_ARRAY("los-gpios", nodes->gpio1_ref);
+	swnodes[SWNODE_SFP] = NODE_PROP(nodes->sfp_name, nodes->sfp_props);
+	nodes->sfp_ref[0] = SOFTWARE_NODE_REFERENCE(&swnodes[SWNODE_SFP]);
+
+	nodes->phylink_props[0] = PROPERTY_ENTRY_STRING("managed", "in-band-status");
+	nodes->phylink_props[1] = PROPERTY_ENTRY_REF_ARRAY("sfp", nodes->sfp_ref);
+	swnodes[SWNODE_PHYLINK] = NODE_PROP(nodes->phylink_name, nodes->phylink_props);
+
+	nodes->group[SWNODE_GPIO] = &swnodes[SWNODE_GPIO];
+	nodes->group[SWNODE_I2C] = &swnodes[SWNODE_I2C];
+	nodes->group[SWNODE_SFP] = &swnodes[SWNODE_SFP];
+	nodes->group[SWNODE_PHYLINK] = &swnodes[SWNODE_PHYLINK];
+
+	return software_node_register_node_group(nodes->group);
+}
+
+static int r8169_gpio_get(struct gpio_chip *chip, unsigned int offset)
+{
+	struct rtl8169_private *tp = gpiochip_get_data(chip);
+	int val;
+
+	val = r8168_mac_ocp_read(tp, 0xdc30);
+
+	return !!(val & BIT(offset));
+}
+
+static int r8169_gpio_init(struct rtl8169_private *tp)
+{
+	struct gpio_chip *gc;
+	struct pci_dev *pdev = tp->pci_dev;
+	struct device *dev;
+	int ret;
+
+	dev = &pdev->dev;
+
+	gc = devm_kzalloc(dev, sizeof(*gc), GFP_KERNEL);
+	if (!gc)
+		return -ENOMEM;
+
+	gc->label = devm_kasprintf(dev, GFP_KERNEL, "r8169_gpio-%x",
+				   pci_dev_id(pdev));
+	if (!gc->label)
+		return -ENOMEM;
+
+	gc->base = -1;
+	gc->ngpio = 16;
+	gc->owner = THIS_MODULE;
+	gc->parent = dev;
+	gc->fwnode = software_node_fwnode(tp->nodes.group[SWNODE_GPIO]);
+	gc->get = r8169_gpio_get;
+
+	ret = devm_gpiochip_add_data(dev, gc, tp);
+	if (ret)
+		return ret;
+
+	tp->gpio = gc;
+
+	return 0;
+}
+
+static int r8169_clock_register(struct rtl8169_private *tp)
+{
+	struct pci_dev *pdev = tp->pci_dev;
+	struct clk_lookup *clock;
+	char clk_name[32];
+	struct clk *clk;
+
+	snprintf(clk_name, sizeof(clk_name), "i2c_designware.%d",
+		 pci_dev_id(pdev));
+
+	/* 115MHz seems to result in an i2c clock of 100kHz */
+	clk = clk_register_fixed_rate(NULL, clk_name, NULL, 0, 115000000);
+	if (IS_ERR(clk))
+		return PTR_ERR(clk);
+
+	clock = clkdev_create(clk, NULL, "%s", clk_name);
+	if (!clock) {
+		clk_unregister(clk);
+		return -ENOMEM;
+	}
+
+	tp->i2c_clk = clk;
+	tp->i2c_clock = clock;
+
+	return 0;
+}
+
+#define R8127_SDS_I2C_BASE 0xe200
+
+static int r8169_i2c_read(void *context, unsigned int reg, unsigned int *val)
+{
+	struct rtl8169_private *tp = context;
+
+	*val = r8168_mac_ocp_read(tp, R8127_SDS_I2C_BASE + reg);
+
+	return 0;
+}
+
+static int r8169_i2c_write(void *context, unsigned int reg, unsigned int val)
+{
+	struct rtl8169_private *tp = context;
+
+	r8168_mac_ocp_write(tp, R8127_SDS_I2C_BASE + reg, val);
+
+	return 0;
+}
+
+static const struct regmap_config i2c_regmap_config = {
+	.reg_bits = 32,
+	.val_bits = 32,
+	.reg_read = r8169_i2c_read,
+	.reg_write = r8169_i2c_write,
+	.fast_io = true,
+};
+
+static int r8169_i2c_register(struct rtl8169_private *tp)
+{
+	struct platform_device_info info = {};
+	struct platform_device *i2c_dev;
+	struct regmap *i2c_regmap;
+	struct pci_dev *pdev = tp->pci_dev;
+
+	i2c_regmap = devm_regmap_init(&pdev->dev, NULL, tp, &i2c_regmap_config);
+	if (IS_ERR(i2c_regmap)) {
+		dev_err(&pdev->dev, "failed to init I2C regmap\n");
+		return PTR_ERR(i2c_regmap);
+	}
+
+	info.parent = &pdev->dev;
+	info.fwnode = software_node_fwnode(tp->nodes.group[SWNODE_I2C]);
+	info.name = "i2c_designware";
+	info.id = pci_dev_id(pdev);
+
+	i2c_dev = platform_device_register_full(&info);
+	if (IS_ERR(i2c_dev))
+		return PTR_ERR(i2c_dev);
+
+	tp->i2c_dev = i2c_dev;
+
+	return 0;
+}
+
+static int r8169_sfp_register(struct rtl8169_private *tp)
+{
+	struct pci_dev *pdev = tp->pci_dev;
+	struct platform_device_info info = {};
+	struct platform_device *sfp_dev;
+
+	info.parent = &pdev->dev;
+	info.fwnode = software_node_fwnode(tp->nodes.group[SWNODE_SFP]);
+	info.name = "sfp";
+	info.id = pci_dev_id(pdev);
+	sfp_dev = platform_device_register_full(&info);
+	if (IS_ERR(sfp_dev))
+		return PTR_ERR(sfp_dev);
+
+	tp->sfp_dev = sfp_dev;
+
+	return 0;
+}
+
+static int r8169_sfp_nodes_init(struct rtl8169_private *tp)
+{
+	struct pci_dev *pdev = tp->pci_dev;
+	int ret;
+
+	ret = r8169_swnodes_register(tp);
+	if (ret < 0)
+		return dev_err_probe(&pdev->dev, ret, "r8169_swnodes_register\n");
+
+	ret = r8169_gpio_init(tp);
+	if (ret < 0) {
+		ret = dev_err_probe(&pdev->dev, ret, "r8169_gpio_init\n");
+		goto err_unregister_swnode;
+	}
+
+	ret = r8169_clock_register(tp);
+	if (ret < 0) {
+		ret = dev_err_probe(&pdev->dev, ret, "r8169_clock_register\n");
+		goto err_unregister_swnode;
+	}
+
+	ret = r8169_i2c_register(tp);
+	if (ret < 0) {
+		ret = dev_err_probe(&pdev->dev, ret, "r8169_i2c_register\n");
+		goto err_unregister_clk;
+	}
+
+	ret = r8169_sfp_register(tp);
+	if (ret < 0) {
+		ret = dev_err_probe(&pdev->dev, ret, "r8169_sfp_register\n");
+		goto err_unregister_i2c;
+	}
+
+	return 0;
+
+err_unregister_i2c:
+	platform_device_unregister(tp->i2c_dev);
+err_unregister_clk:
+	clkdev_drop(tp->i2c_clock);
+	clk_unregister(tp->i2c_clk);
+err_unregister_swnode:
+	software_node_unregister_node_group(tp->nodes.group);
+
+	return ret;
+}
+
 static const struct ethtool_ops rtl8169_ethtool_ops = {
 	.supported_coalesce_params = ETHTOOL_COALESCE_USECS |
 				     ETHTOOL_COALESCE_MAX_FRAMES,
@@ -5289,6 +5576,14 @@ static void rtl_remove_one(struct pci_dev *pdev)
 
 	/* restore original MAC address */
 	rtl_rar_set(tp, tp->dev->perm_addr);
+
+	if (tp->sfp_mode) {
+		platform_device_unregister(tp->sfp_dev);
+		platform_device_unregister(tp->i2c_dev);
+		clkdev_drop(tp->i2c_clock);
+		clk_unregister(tp->i2c_clk);
+		software_node_unregister_node_group(tp->nodes.group);
+	}
 }
 
 static const struct net_device_ops rtl_netdev_ops = {
@@ -5675,8 +5970,13 @@ static int rtl_init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
 	if (rtl_is_8125(tp)) {
 		u16 data = r8168_mac_ocp_read(tp, 0xd006);
 
-		if ((data & 0xff) == 0x07)
+		if ((data & 0xff) == 0x07) {
 			tp->sfp_mode = true;
+
+			rc = r8169_sfp_nodes_init(tp);
+			if (rc < 0)
+				return rc;
+		}
 	}
 
 	tp->dash_type = rtl_get_dash_type(tp);
-- 
2.47.3


^ permalink raw reply related

* Re: [PATCH v6 net-next 0/8] dpll/ice: Add TXC DPLL type and full TX reference clock control for E825
From: Jakub Kicinski @ 2026-04-10  0:33 UTC (permalink / raw)
  To: Grzegorz Nitka
  Cc: netdev, linux-kernel, intel-wired-lan, poros, richardcochran,
	andrew+netdev, przemyslaw.kitszel, anthony.l.nguyen,
	Prathosh.Satish, ivecera, jiri, arkadiusz.kubalewski,
	vadim.fedorenko, donald.hunter, horms, pabeni, davem, edumazet
In-Reply-To: <20260409235122.436749-1-grzegorz.nitka@intel.com>

On Fri, 10 Apr 2026 01:51:14 +0200 Grzegorz Nitka wrote:
> NOTE: This series is intentionally submitted on net-next (not
> intel-wired-lan) as early feedback of DPLL subsystem changes is
> welcomed. In the past possible approaches were discussed in [1].

I LOVE when someone takes 3 days to respond but then posts the next
version the same day.

^ permalink raw reply

* Re: [PATCH] net:mctp: split mctp hdr version to ver and rsvd
From: Jeremy Kerr @ 2026-04-10  0:13 UTC (permalink / raw)
  To: wit_yuan; +Cc: yuanzm2, matt, davem, edumazet, kuba, pabeni, netdev,
	linux-kernel
In-Reply-To: <20260409125129.9210-1-yuanzhaoming901030@126.com>

Hi yuanxhaoming,

> from spec dsp0236_1.2.1.pdf page 26, the mctp header contains the
> RSVD(4bit) and Hdr version(4 bit).
> 
> mctp_pkttype_receive invoke mctp_hdr, and get mh->ver whole byte 
> compare the MCTP_VER_MIN, MCTP_VER_MAX. the reserver bits may be
> by misleading used.
> 
> Signed-off-by: yuanzhaoming <yuanzm2@lenovo.com>
> ---
>  include/net/mctp.h | 9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/include/net/mctp.h b/include/net/mctp.h
> index e1e0a69afdce..80cc9c63f6ba 100644
> --- a/include/net/mctp.h
> +++ b/include/net/mctp.h
> @@ -14,10 +14,17 @@
>  #include <linux/netdevice.h>
>  #include <net/net_namespace.h>
>  #include <net/sock.h>
> +#include <asm/byteorder.h>
>  
>  /* MCTP packet definitions */
>  struct mctp_hdr {
> -       u8      ver;
> +#if defined(__LITTLE_ENDIAN_BITFIELD)
> +       u8      ver:4, rsvd: 4;
> +#elif defined(__BIG_ENDIAN_BITFIELD)
> +       u8      rsvd:4, ver: 4;
> +#else
> +#error "Please fix <asm/byteorder.h>"
> +#endif

I would strongly prefer that we do not use C bitfields for a wire
format. The existing flags_seq_tag member contains three fields, which
we use with a couple of helpers to extract the flag, sequence number or
tag values - please follow that convention if we need changes here.

Also, this introduces a few subtle bugs in that we are no longer setting
the reserved bits to zero when preparing an outgoing TX packet header.

What is the underlying issue are you fixing here? Are you seeing a peer
that is sending us packets with bits set in the reserved field?

(if so, that would also be handy information to include in the commit
message)

> From: yuanzhaoming <yuanzm2@lenovo.com>

Is this the preferred format of your name? These are generally in
a full-name format, or a known identity. There's no particular issue
with what you're using there, if that's what you prefer.

Cheers,


Jeremy

^ permalink raw reply

* [PATCH v6 net-next 8/8] ice: implement E825 TX ref clock control and TXC hardware sync status
From: Grzegorz Nitka @ 2026-04-09 23:51 UTC (permalink / raw)
  To: netdev
  Cc: linux-kernel, intel-wired-lan, poros, richardcochran,
	andrew+netdev, przemyslaw.kitszel, anthony.l.nguyen,
	Prathosh.Satish, ivecera, jiri, arkadiusz.kubalewski,
	vadim.fedorenko, donald.hunter, horms, pabeni, kuba, davem,
	edumazet, Grzegorz Nitka
In-Reply-To: <20260409235122.436749-1-grzegorz.nitka@intel.com>

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=y, Size: 27480 bytes --]

Build on the previously introduced TXC DPLL framework and implement
full TX reference clock control and hardware-backed synchronization
status reporting for E825 devices.

E825 firmware may accept or override TX reference clock requests based
on device-wide routing constraints and link conditions. For this
reason, TX reference selection and synchronization status must be
observed from hardware rather than inferred from user intent.

This change implements TX reference switching using a deferred worker,
triggered by DPLL TXCLK pin operations. Pin set callbacks express
selection intent and schedule the operation asynchronously; firmware
commands and autonegotiation restarts are executed outside of DPLL
context.

After link-up, the effective TX reference clock is read back from
hardware and software state is reconciled accordingly. TXCLK pin state
reflects only the selected reference clock topology:
- External references (SYNCE, EREF0) are represented as TXCLK pins
- The internal ENET/TXCO clock has no pin representation; when selected,
  all TXCLK pins are reported DISCONNECTED

Actual hardware synchronization result is reported exclusively via the
TXC DPLL lock status:
- LOCKED when an external TX reference is in use
- UNLOCKED when falling back to ENET/TXCO

This separation allows userspace to distinguish between TX reference
selection and successful synchronization, matching the DPLL subsystem
model where pin state describes topology and device lock status
describes signal quality.

The implementation also tracks TX reference usage per PHY across all
PFs to ensure shared TX clock resources are not disabled while still in
use by peer ports.

With this change, TX reference clocks on E825 devices can be reliably
selected, verified against hardware state, and monitored for effective
synchronization via standard DPLL interfaces.

Reviewed-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com>
Signed-off-by: Grzegorz Nitka <grzegorz.nitka@intel.com>
---
 drivers/net/ethernet/intel/ice/Makefile     |   2 +-
 drivers/net/ethernet/intel/ice/ice.h        |  12 +
 drivers/net/ethernet/intel/ice/ice_dpll.c   | 110 ++++++++-
 drivers/net/ethernet/intel/ice/ice_dpll.h   |   4 +
 drivers/net/ethernet/intel/ice/ice_ptp.c    |  26 +-
 drivers/net/ethernet/intel/ice/ice_ptp.h    |   7 +
 drivers/net/ethernet/intel/ice/ice_ptp_hw.c |  37 +++
 drivers/net/ethernet/intel/ice/ice_ptp_hw.h |  27 +++
 drivers/net/ethernet/intel/ice/ice_txclk.c  | 251 ++++++++++++++++++++
 drivers/net/ethernet/intel/ice/ice_txclk.h  |  38 +++
 10 files changed, 495 insertions(+), 19 deletions(-)
 create mode 100644 drivers/net/ethernet/intel/ice/ice_txclk.c
 create mode 100644 drivers/net/ethernet/intel/ice/ice_txclk.h

diff --git a/drivers/net/ethernet/intel/ice/Makefile b/drivers/net/ethernet/intel/ice/Makefile
index 38db476ab2ec..95fd0c49800f 100644
--- a/drivers/net/ethernet/intel/ice/Makefile
+++ b/drivers/net/ethernet/intel/ice/Makefile
@@ -54,7 +54,7 @@ ice-$(CONFIG_PCI_IOV) +=	\
 	ice_vf_mbx.o		\
 	ice_vf_vsi_vlan_ops.o	\
 	ice_vf_lib.o
-ice-$(CONFIG_PTP_1588_CLOCK) += ice_ptp.o ice_ptp_hw.o ice_dpll.o ice_tspll.o ice_cpi.o
+ice-$(CONFIG_PTP_1588_CLOCK) += ice_ptp.o ice_ptp_hw.o ice_dpll.o ice_tspll.o ice_cpi.o ice_txclk.o
 ice-$(CONFIG_DCB) += ice_dcb.o ice_dcb_nl.o ice_dcb_lib.o
 ice-$(CONFIG_RFS_ACCEL) += ice_arfs.o
 ice-$(CONFIG_XDP_SOCKETS) += ice_xsk.o
diff --git a/drivers/net/ethernet/intel/ice/ice.h b/drivers/net/ethernet/intel/ice/ice.h
index eb3a48330cc1..6edafce4624a 100644
--- a/drivers/net/ethernet/intel/ice/ice.h
+++ b/drivers/net/ethernet/intel/ice/ice.h
@@ -1155,4 +1155,16 @@ static inline struct ice_hw *ice_get_primary_hw(struct ice_pf *pf)
 	else
 		return &pf->adapter->ctrl_pf->hw;
 }
+
+/**
+ * ice_get_ctrl_pf - Get pointer to Control PF of the adapter
+ * @pf: pointer to the current PF structure
+ *
+ * Return: A pointer to ice_pf structure which is Control PF,
+ * NULL if it's not initialized yet.
+ */
+static inline struct ice_pf *ice_get_ctrl_pf(struct ice_pf *pf)
+{
+	return !pf->adapter ? NULL : pf->adapter->ctrl_pf;
+}
 #endif /* _ICE_H_ */
diff --git a/drivers/net/ethernet/intel/ice/ice_dpll.c b/drivers/net/ethernet/intel/ice/ice_dpll.c
index ab62aac77399..09855f211ed5 100644
--- a/drivers/net/ethernet/intel/ice/ice_dpll.c
+++ b/drivers/net/ethernet/intel/ice/ice_dpll.c
@@ -4,6 +4,7 @@
 #include "ice.h"
 #include "ice_lib.h"
 #include "ice_trace.h"
+#include "ice_txclk.h"
 #include <linux/dpll.h>
 #include <linux/property.h>
 
@@ -19,8 +20,6 @@
 #define ICE_DPLL_SW_PIN_INPUT_BASE_QSFP		6
 #define ICE_DPLL_SW_PIN_OUTPUT_BASE		0
 
-#define E825_EXT_EREF_PIN_IDX			0
-#define E825_EXT_SYNCE_PIN_IDX			1
 #define E825_RCLK_PARENT_0_PIN_IDX		0
 #define E825_RCLK_PARENT_1_PIN_IDX		1
 
@@ -2527,6 +2526,44 @@ ice_dpll_rclk_state_on_pin_get(const struct dpll_pin *pin, void *pin_priv,
 	return ret;
 }
 
+/**
+ * ice_dpll_txclk_work - apply a pending TX reference clock change
+ * @work: work_struct embedded in struct ice_dplls
+ *
+ * This worker executes an outstanding TX reference clock switch request
+ * that was previously queued via the DPLL TXCLK pin set callback.
+ *
+ * The worker performs only the operational part of the switch, issuing
+ * the necessary firmware commands to request a new TX reference clock
+ * selection (e.g. triggering an AN restart). It does not verify whether
+ * the requested clock was ultimately accepted by the hardware.
+ *
+ * Hardware verification, software state reconciliation, pin state
+ * notification, and TXC DPLL lock-status updates are performed later,
+ * after link-up, by ice_txclk_update_and_notify().
+ *
+ * Context:
+ *   - Runs in process context on pf->dplls.wq and may sleep.
+ *   - Serializes access to shared TXCLK state using pf->dplls.lock.
+ */
+static void ice_dpll_txclk_work(struct work_struct *work)
+{
+	struct ice_dplls *dplls =
+		container_of(work, struct ice_dplls, txclk_work);
+	struct ice_pf *pf = container_of(dplls, struct ice_pf, dplls);
+	enum ice_e825c_ref_clk clk;
+	bool do_switch;
+
+	mutex_lock(&pf->dplls.lock);
+	do_switch  = pf->dplls.txclk_switch_requested;
+	clk = pf->ptp.port.tx_clk_req;
+	pf->dplls.txclk_switch_requested  = false;
+	mutex_unlock(&pf->dplls.lock);
+
+	if (do_switch)
+		ice_txclk_set_clk(pf, clk);
+}
+
 /**
  * ice_dpll_txclk_state_on_dpll_set - set a state on TX clk pin
  * @pin: pointer to a pin
@@ -2538,7 +2575,9 @@ ice_dpll_rclk_state_on_pin_get(const struct dpll_pin *pin, void *pin_priv,
  *
  * Dpll subsystem callback, set a state of a Tx reference clock pin
  *
+ * Context: Acquires and releases pf->dplls.lock
  * Return:
+ * * 0 - success
  * * negative - failure
  */
 static int
@@ -2547,11 +2586,29 @@ ice_dpll_txclk_state_on_dpll_set(const struct dpll_pin *pin, void *pin_priv,
 				 void *dpll_priv, enum dpll_pin_state state,
 				 struct netlink_ext_ack *extack)
 {
-	/*
-	 * TODO: set HW accordingly to selected TX reference clock.
-	 * To be added in the follow up patches.
-	 */
-	return -EOPNOTSUPP;
+	struct ice_dpll_pin *p = pin_priv;
+	struct ice_pf *pf = p->pf;
+	enum ice_e825c_ref_clk new_clk;
+
+	if (ice_dpll_is_reset(pf, extack))
+		return -EBUSY;
+
+	mutex_lock(&pf->dplls.lock);
+	new_clk = (state == DPLL_PIN_STATE_DISCONNECTED) ? ICE_REF_CLK_ENET :
+			p->tx_ref_src;
+	if (new_clk == pf->ptp.port.tx_clk_req) {
+		NL_SET_ERR_MSG_FMT(extack,
+				   "pin:%u state:%u on parent device already set",
+				   p->idx, state);
+		goto unlock;
+	}
+
+	pf->ptp.port.tx_clk_req = new_clk;
+	pf->dplls.txclk_switch_requested = true;
+	queue_work(pf->dplls.wq, &pf->dplls.txclk_work);
+unlock:
+	mutex_unlock(&pf->dplls.lock);
+	return 0;
 }
 
 /**
@@ -2563,10 +2620,21 @@ ice_dpll_txclk_state_on_dpll_set(const struct dpll_pin *pin, void *pin_priv,
  * @state: on success holds pin state on parent pin
  * @extack: error reporting
  *
- * dpll subsystem callback, get a state of a TX clock reference pin.
+ * TXCLK DPLL pin state is derived and not stored explicitly.
+ *
+ * Only external TX reference clocks (SYNCE, EREF0) are modeled
+ * as DPLL pins. The internal ENET (TXCO) clock has no pin and,
+ * when selected, all TXCLK pins are reported DISCONNECTED.
+ *
+ * During a pending TXCLK switch, the requested pin may be
+ * reported as CONNECTED before hardware verification.
+ * Hardware acceptance and synchronization are reported
+ * exclusively via TXC DPLL lock-status.
  *
+ * Context: Acquires and releases pf->dplls.lock
  * Return:
  * * 0 - success
+ * * negative - failure
  */
 static int
 ice_dpll_txclk_state_on_dpll_get(const struct dpll_pin *pin, void *pin_priv,
@@ -2575,11 +2643,21 @@ ice_dpll_txclk_state_on_dpll_get(const struct dpll_pin *pin, void *pin_priv,
 				 enum dpll_pin_state *state,
 				 struct netlink_ext_ack *extack)
 {
-	/*
-	 * TODO: query HW status to determine if the TX reference is selected.
-	 * To be added in the follow up patches.
-	 */
-	*state = DPLL_PIN_STATE_DISCONNECTED;
+	struct ice_dpll_pin *p = pin_priv;
+	struct ice_pf *pf = p->pf;
+
+	if (ice_dpll_is_reset(pf, extack))
+		return -EBUSY;
+
+	mutex_lock(&pf->dplls.lock);
+	if (pf->ptp.port.tx_clk_req == p->tx_ref_src)
+		*state = DPLL_PIN_STATE_CONNECTED;
+	else if (pf->ptp.port.tx_clk == p->tx_ref_src &&
+		 pf->ptp.port.tx_clk_req == pf->ptp.port.tx_clk)
+		*state = DPLL_PIN_STATE_CONNECTED;
+	else
+		*state = DPLL_PIN_STATE_DISCONNECTED;
+	mutex_unlock(&pf->dplls.lock);
 
 	return 0;
 }
@@ -4402,6 +4480,8 @@ static int ice_dpll_init_info_e825c(struct ice_pf *pf)
 	if (ret)
 		goto deinit_info;
 
+	INIT_WORK(&pf->dplls.txclk_work, ice_dpll_txclk_work);
+
 	dev_dbg(ice_pf_to_dev(pf),
 		"%s - success, inputs: %u, outputs: %u, rclk-parents: %u\n",
 		 __func__, d->num_inputs, d->num_outputs, d->rclk.num_parents);
@@ -4538,6 +4618,10 @@ void ice_dpll_deinit(struct ice_pf *pf)
 		ice_dpll_deinit_dpll(pf, &pf->dplls.txc, false);
 
 	ice_dpll_deinit_info(pf);
+
+	if (pf->hw.mac_type == ICE_MAC_GENERIC_3K_E825)
+		flush_work(&pf->dplls.txclk_work);
+
 	mutex_destroy(&pf->dplls.lock);
 }
 
diff --git a/drivers/net/ethernet/intel/ice/ice_dpll.h b/drivers/net/ethernet/intel/ice/ice_dpll.h
index 23f9d4da73c5..9a96b905141d 100644
--- a/drivers/net/ethernet/intel/ice/ice_dpll.h
+++ b/drivers/net/ethernet/intel/ice/ice_dpll.h
@@ -8,6 +8,8 @@
 
 #define ICE_DPLL_RCLK_NUM_MAX	4
 #define ICE_DPLL_TXCLK_NUM_MAX	2
+#define E825_EXT_EREF_PIN_IDX	0
+#define E825_EXT_SYNCE_PIN_IDX	1
 
 /**
  * enum ice_dpll_pin_sw - enumerate ice software pin indices:
@@ -152,6 +154,8 @@ struct ice_dplls {
 	s32 output_phase_adj_max;
 	u32 periodic_counter;
 	bool generic;
+	struct work_struct txclk_work;
+	bool txclk_switch_requested;
 };
 
 #if IS_ENABLED(CONFIG_PTP_1588_CLOCK)
diff --git a/drivers/net/ethernet/intel/ice/ice_ptp.c b/drivers/net/ethernet/intel/ice/ice_ptp.c
index 6cb0cf7a9891..31b1ec5cd9a3 100644
--- a/drivers/net/ethernet/intel/ice/ice_ptp.c
+++ b/drivers/net/ethernet/intel/ice/ice_ptp.c
@@ -4,6 +4,7 @@
 #include "ice.h"
 #include "ice_lib.h"
 #include "ice_trace.h"
+#include "ice_txclk.h"
 
 static const char ice_pin_names[][64] = {
 	"SDP0",
@@ -54,11 +55,6 @@ static const struct ice_ptp_pin_desc ice_pin_desc_dpll[] = {
 	{  SDP3, {  3, -1 }, { 0, 0 }},
 };
 
-static struct ice_pf *ice_get_ctrl_pf(struct ice_pf *pf)
-{
-	return !pf->adapter ? NULL : pf->adapter->ctrl_pf;
-}
-
 static struct ice_ptp *ice_get_ctrl_ptp(struct ice_pf *pf)
 {
 	struct ice_pf *ctrl_pf = ice_get_ctrl_pf(pf);
@@ -1328,6 +1324,9 @@ void ice_ptp_link_change(struct ice_pf *pf, bool linkup)
 			}
 		}
 		mutex_unlock(&pf->dplls.lock);
+
+		if (linkup)
+			ice_txclk_update_and_notify(pf);
 	}
 
 	switch (hw->mac_type) {
@@ -3081,6 +3080,7 @@ static int ice_ptp_setup_pf(struct ice_pf *pf)
 {
 	struct ice_ptp *ctrl_ptp = ice_get_ctrl_ptp(pf);
 	struct ice_ptp *ptp = &pf->ptp;
+	u8 port_num, phy;
 
 	if (!ctrl_ptp) {
 		dev_info(ice_pf_to_dev(pf),
@@ -3098,6 +3098,10 @@ static int ice_ptp_setup_pf(struct ice_pf *pf)
 		 &pf->adapter->ports.ports);
 	mutex_unlock(&pf->adapter->ports.lock);
 
+	port_num = ptp->port.port_num;
+	phy = port_num / pf->hw.ptp.ports_per_phy;
+	set_bit(port_num, &ctrl_ptp->tx_refclks[phy][pf->ptp.port.tx_clk]);
+
 	return 0;
 }
 
@@ -3298,6 +3302,7 @@ static void ice_ptp_init_tx_interrupt_mode(struct ice_pf *pf)
  */
 void ice_ptp_init(struct ice_pf *pf)
 {
+	enum ice_e825c_ref_clk tx_ref_clk;
 	struct ice_ptp *ptp = &pf->ptp;
 	struct ice_hw *hw = &pf->hw;
 	int err;
@@ -3326,6 +3331,17 @@ void ice_ptp_init(struct ice_pf *pf)
 			goto err_exit;
 	}
 
+	ptp->port.tx_clk = ICE_REF_CLK_ENET;
+	ptp->port.tx_clk_req = ICE_REF_CLK_ENET;
+	if (hw->mac_type == ICE_MAC_GENERIC_3K_E825) {
+		err = ice_get_serdes_ref_sel_e825c(hw, ptp->port.port_num,
+						   &tx_ref_clk);
+		if (!err) {
+			ptp->port.tx_clk = tx_ref_clk;
+			ptp->port.tx_clk_req = tx_ref_clk;
+		}
+	}
+
 	err = ice_ptp_setup_pf(pf);
 	if (err)
 		goto err_exit;
diff --git a/drivers/net/ethernet/intel/ice/ice_ptp.h b/drivers/net/ethernet/intel/ice/ice_ptp.h
index 8c44bd758a4f..8b385271ab36 100644
--- a/drivers/net/ethernet/intel/ice/ice_ptp.h
+++ b/drivers/net/ethernet/intel/ice/ice_ptp.h
@@ -144,6 +144,8 @@ struct ice_ptp_tx {
  * @link_up: indicates whether the link is up
  * @tx_fifo_busy_cnt: number of times the Tx FIFO was busy
  * @port_num: the port number this structure represents
+ * @tx_clk: currently active Tx reference clock source
+ * @tx_clk_req: requested Tx reference clock source (new target)
  */
 struct ice_ptp_port {
 	struct list_head list_node;
@@ -153,6 +155,8 @@ struct ice_ptp_port {
 	bool link_up;
 	u8 tx_fifo_busy_cnt;
 	u8 port_num;
+	enum ice_e825c_ref_clk tx_clk;
+	enum ice_e825c_ref_clk tx_clk_req;
 };
 
 enum ice_ptp_tx_interrupt {
@@ -236,6 +240,7 @@ struct ice_ptp_pin_desc {
  * @info: structure defining PTP hardware capabilities
  * @clock: pointer to registered PTP clock device
  * @tstamp_config: hardware timestamping configuration
+ * @tx_refclks: bitmaps table to store the information about TX reference clocks
  * @reset_time: kernel time after clock stop on reset
  * @tx_hwtstamp_good: number of completed Tx timestamp requests
  * @tx_hwtstamp_skipped: number of Tx time stamp requests skipped
@@ -261,6 +266,8 @@ struct ice_ptp {
 	struct ptp_clock_info info;
 	struct ptp_clock *clock;
 	struct kernel_hwtstamp_config tstamp_config;
+#define ICE_E825_MAX_PHYS 2
+	unsigned long tx_refclks[ICE_E825_MAX_PHYS][ICE_REF_CLK_MAX];
 	u64 reset_time;
 	u64 tx_hwtstamp_good;
 	u32 tx_hwtstamp_skipped;
diff --git a/drivers/net/ethernet/intel/ice/ice_ptp_hw.c b/drivers/net/ethernet/intel/ice/ice_ptp_hw.c
index 61c0a0d93ea8..c0720525ac49 100644
--- a/drivers/net/ethernet/intel/ice/ice_ptp_hw.c
+++ b/drivers/net/ethernet/intel/ice/ice_ptp_hw.c
@@ -461,6 +461,43 @@ static int ice_read_phy_eth56g(struct ice_hw *hw, u8 port, u32 addr, u32 *val)
 	return err;
 }
 
+/**
+ * ice_get_serdes_ref_sel_e825c - Read current Tx ref clock source
+ * @hw: pointer to the HW struct
+ * @port: port number for which Tx reference clock is read
+ * @clk: Tx reference clock value (output)
+ *
+ * Return: 0 on success, other error codes when failed to read from PHY
+ */
+int ice_get_serdes_ref_sel_e825c(struct ice_hw *hw, u8 port,
+				 enum ice_e825c_ref_clk *clk)
+{
+	u8 lane = port % hw->ptp.ports_per_phy;
+	u32 serdes_rx_nt, serdes_tx_nt;
+	u32 val;
+	int ret;
+
+	ret = ice_read_phy_eth56g(hw, port,
+				  SERDES_IP_IF_LN_FLXM_GENERAL(lane, 0),
+				  &val);
+	if (ret)
+		return ret;
+
+	serdes_rx_nt = FIELD_GET(CFG_ICTL_PCS_REF_SEL_RX_NT, val);
+	serdes_tx_nt = FIELD_GET(CFG_ICTL_PCS_REF_SEL_TX_NT, val);
+
+	if (serdes_tx_nt == REF_SEL_NT_SYNCE &&
+	    serdes_rx_nt == REF_SEL_NT_SYNCE)
+		*clk = ICE_REF_CLK_SYNCE;
+	else if (serdes_tx_nt == REF_SEL_NT_EREF0 &&
+		 serdes_rx_nt == REF_SEL_NT_EREF0)
+		*clk = ICE_REF_CLK_EREF0;
+	else
+		*clk = ICE_REF_CLK_ENET;
+
+	return 0;
+}
+
 /**
  * ice_phy_res_address_eth56g - Calculate a PHY port register address
  * @hw: pointer to the HW struct
diff --git a/drivers/net/ethernet/intel/ice/ice_ptp_hw.h b/drivers/net/ethernet/intel/ice/ice_ptp_hw.h
index cbc9693179a1..820ba953ea01 100644
--- a/drivers/net/ethernet/intel/ice/ice_ptp_hw.h
+++ b/drivers/net/ethernet/intel/ice/ice_ptp_hw.h
@@ -381,6 +381,8 @@ int ice_stop_phy_timer_eth56g(struct ice_hw *hw, u8 port, bool soft_reset);
 int ice_start_phy_timer_eth56g(struct ice_hw *hw, u8 port);
 int ice_phy_cfg_intr_eth56g(struct ice_hw *hw, u8 port, bool ena, u8 threshold);
 int ice_phy_cfg_ptp_1step_eth56g(struct ice_hw *hw, u8 port);
+int ice_get_serdes_ref_sel_e825c(struct ice_hw *hw, u8 port,
+				 enum ice_e825c_ref_clk *clk);
 
 #define ICE_ETH56G_NOMINAL_INCVAL	0x140000000ULL
 #define ICE_ETH56G_NOMINAL_PCS_REF_TUS	0x100000000ULL
@@ -790,4 +792,29 @@ static inline u64 ice_get_base_incval(struct ice_hw *hw)
 #define PHY_PTP_1STEP_PD_DELAY_M	GENMASK(30, 1)
 #define PHY_PTP_1STEP_PD_DLY_V_M	BIT(31)
 
+#define SERDES_IP_IF_LN_FLXM_GENERAL(n, m) \
+	(0x32B800 + (m) * 0x100000 + (n) * 0x8000)
+#define CFG_RESERVED0_1                         GENMASK(1, 0)
+#define CFG_ICTL_PCS_MODE_NT                    BIT(2)
+#define CFG_ICTL_PCS_RCOMP_SLAVE_EN_NT          BIT(3)
+#define CFG_ICTL_PCS_CMN_FORCE_PUP_A            BIT(4)
+#define CFG_ICTL_PCS_RCOMP_SLAVE_VALID_A        BIT(5)
+#define CFG_ICTL_PCS_REF_SEL_RX_NT              GENMASK(9, 6)
+#define REF_SEL_NT_ENET                         0
+#define REF_SEL_NT_EREF0                        1
+#define REF_SEL_NT_SYNCE                        2
+#define CFG_IDAT_DFX_OBS_DIG_                   GENMASK(11, 10)
+#define CFG_IRST_APB_MEM_B                      BIT(12)
+#define CFG_ICTL_PCS_DISCONNECT_NT              BIT(13)
+#define CFG_ICTL_PCS_ISOLATE_NT                 BIT(14)
+#define CFG_RESERVED15_15                       BIT(15)
+#define CFG_IRST_PCS_TSTBUS_B_A                 BIT(16)
+#define CFG_ICTL_PCS_REF_TERM_HIZ_EN_NT         BIT(17)
+#define CFG_RESERVED18_19                       GENMASK(19, 18)
+#define CFG_ICTL_PCS_SYNTHLCSLOW_FORCE_PUP_A    BIT(20)
+#define CFG_ICTL_PCS_SYNTHLCFAST_FORCE_PUP_A    BIT(21)
+#define CFG_RESERVED22_24                       GENMASK(24, 22)
+#define CFG_ICTL_PCS_REF_SEL_TX_NT              GENMASK(28, 25)
+#define CFG_RESERVED29_31                       GENMASK(31, 29)
+
 #endif /* _ICE_PTP_HW_H_ */
diff --git a/drivers/net/ethernet/intel/ice/ice_txclk.c b/drivers/net/ethernet/intel/ice/ice_txclk.c
new file mode 100644
index 000000000000..ba7e83b8952e
--- /dev/null
+++ b/drivers/net/ethernet/intel/ice/ice_txclk.c
@@ -0,0 +1,251 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (C) 2026 Intel Corporation */
+
+#include "ice.h"
+#include "ice_cpi.h"
+#include "ice_txclk.h"
+
+#define ICE_PHY0	0
+#define ICE_PHY1	1
+
+/**
+ * ice_txclk_get_pin - map TX reference clock to its DPLL pin
+ * @pf: pointer to the PF structure
+ * @ref_clk: TX reference clock selection
+ *
+ * Return the DPLL pin corresponding to a given external TX reference
+ * clock. Only external TX reference clocks (SYNCE and EREF0) are
+ * represented as DPLL pins. The internal ENET (TXCO) clock has no
+ * associated DPLL pin and therefore yields %NULL.
+ *
+ * This helper is used when emitting DPLL pin change notifications
+ * after TX reference clock transitions have been verified.
+ *
+ * Return: Pointer to the corresponding struct dpll_pin, or %NULL if
+ *         the TX reference clock has no DPLL pin representation.
+ */
+static struct dpll_pin *
+ice_txclk_get_pin(struct ice_pf *pf, enum ice_e825c_ref_clk ref_clk)
+{
+	switch (ref_clk) {
+	case ICE_REF_CLK_SYNCE:
+		return pf->dplls.txclks[E825_EXT_SYNCE_PIN_IDX].pin;
+	case ICE_REF_CLK_EREF0:
+		return pf->dplls.txclks[E825_EXT_EREF_PIN_IDX].pin;
+	case ICE_REF_CLK_ENET:
+	default:
+		return NULL;
+	}
+}
+
+/**
+ * ice_txclk_enable_peer - Enable required TX reference clock on peer PHY
+ * @pf: pointer to the PF structure
+ * @clk: TX reference clock that must be enabled
+ *
+ * Some TX reference clocks on E825-class devices (SyncE and EREF0) must
+ * be enabled on both PHY complexes to allow proper routing:
+ *
+ *   - SyncE must be enabled on both PHYs when used by PHY0
+ *   - EREF0 must be enabled on both PHYs when used by PHY1
+ *
+ * If the requested clock is not yet enabled on the peer PHY, enable it.
+ * ENET does not require duplication and is ignored.
+ *
+ * Return: 0 on success or negative error code on failure.
+ */
+static int ice_txclk_enable_peer(struct ice_pf *pf, enum ice_e825c_ref_clk clk)
+{
+	struct ice_pf *ctrl_pf = ice_get_ctrl_pf(pf);
+	bool peer_clk_in_use;
+	u8 port_num, phy;
+	int err;
+
+	if (clk == ICE_REF_CLK_ENET)
+		return 0;
+
+	if (IS_ERR_OR_NULL(ctrl_pf)) {
+		dev_err(ice_pf_to_dev(pf),
+			"Can't enable tx-clk on peer: no controlling PF\n");
+		return -EINVAL;
+	}
+
+	port_num = pf->ptp.port.port_num;
+	phy = port_num / pf->hw.ptp.ports_per_phy;
+	peer_clk_in_use = true;
+	mutex_lock(&ctrl_pf->dplls.lock);
+	if (clk == ICE_REF_CLK_SYNCE && phy == ICE_PHY0)
+		peer_clk_in_use = ice_txclk_any_port_uses(ctrl_pf, ICE_PHY1, clk);
+	else if (clk == ICE_REF_CLK_EREF0 && phy == ICE_PHY1)
+		peer_clk_in_use = ice_txclk_any_port_uses(ctrl_pf, ICE_PHY0, clk);
+	mutex_unlock(&ctrl_pf->dplls.lock);
+
+	if ((clk == ICE_REF_CLK_SYNCE && phy == ICE_PHY0 && !peer_clk_in_use) ||
+	    (clk == ICE_REF_CLK_EREF0 && phy == ICE_PHY1 && !peer_clk_in_use)) {
+		u8 peer_phy = phy ? ICE_PHY0 : ICE_PHY1;
+
+		err = ice_cpi_ena_dis_clk_ref(&pf->hw, peer_phy, clk, true);
+		if (err) {
+			dev_err(ice_hw_to_dev(&pf->hw),
+				"Failed to enable the %u TX clock for the %u PHY\n",
+				clk, peer_phy);
+			return err;
+		}
+	}
+
+	return 0;
+}
+
+#define ICE_REFCLK_USER_TO_AQ_IDX(x) ((x) + 1)
+
+/**
+ * ice_txclk_set_clk - Set Tx reference clock
+ * @pf: pointer to pf structure
+ * @clk: new Tx clock
+ *
+ * Return: 0 on success, negative value otherwise.
+ */
+int ice_txclk_set_clk(struct ice_pf *pf, enum ice_e825c_ref_clk clk)
+{
+	struct ice_pf *ctrl_pf = ice_get_ctrl_pf(pf);
+	struct ice_port_info *port_info;
+	bool clk_in_use;
+	u8 port_num, phy;
+	int err;
+
+	if (pf->ptp.port.tx_clk == clk)
+		return 0;
+
+	if (IS_ERR_OR_NULL(ctrl_pf)) {
+		dev_err(ice_pf_to_dev(pf),
+			"Can't set tx-clk: no controlling PF\n");
+		return -EINVAL;
+	}
+
+	port_num = pf->ptp.port.port_num;
+	phy = port_num / pf->hw.ptp.ports_per_phy;
+	port_info = pf->hw.port_info;
+	mutex_lock(&ctrl_pf->dplls.lock);
+	clk_in_use = ice_txclk_any_port_uses(ctrl_pf, phy, clk);
+	mutex_unlock(&ctrl_pf->dplls.lock);
+
+	/* Check if the TX clk is enabled for this PHY, if not - enable it */
+	if (!clk_in_use) {
+		err = ice_cpi_ena_dis_clk_ref(&pf->hw, phy, clk, true);
+		if (err) {
+			dev_err(ice_pf_to_dev(pf), "Failed to enable the %u TX clock for the %u PHY\n",
+				clk, phy);
+			return err;
+		}
+		err = ice_txclk_enable_peer(pf, clk);
+		if (err)
+			return err;
+	}
+
+	/* We are ready to switch to the new TX clk. */
+	err = ice_aq_set_link_restart_an(port_info, true, NULL,
+					 ICE_REFCLK_USER_TO_AQ_IDX(clk));
+	if (err)
+		dev_err(ice_pf_to_dev(pf),
+			"AN restart AQ command failed with err %d\n",
+			err);
+
+	return err;
+}
+
+/**
+ * ice_txclk_update_and_notify - Validate TX reference clock switching
+ * @pf: pointer to PF structure
+ *
+ * After a link-up event, verify whether the previously requested TX reference
+ * clock transition actually succeeded. The SERDES reference selector reflects
+ * the effective hardware choice, which may differ from the requested clock
+ * when Auto-Negotiation or firmware applies additional policy.
+ *
+ * If the hardware-selected clock differs from the requested one, update the
+ * software state accordingly and stop further processing.
+ *
+ * When the switch is successful, update the per‑PHY usage bitmaps so that the
+ * driver knows which reference clock is currently in use by this port.
+ *
+ * This function does not initiate a clock switch; it only validates the result
+ * of a previously triggered transition and performs cleanup of unused clocks.
+ */
+void ice_txclk_update_and_notify(struct ice_pf *pf)
+{
+	struct ice_ptp_port *ptp_port = &pf->ptp.port;
+	struct ice_pf *ctrl_pf = ice_get_ctrl_pf(pf);
+	struct dpll_pin *old_pin = NULL;
+	struct dpll_pin *new_pin = NULL;
+	struct ice_hw *hw = &pf->hw;
+	enum ice_e825c_ref_clk clk;
+	int err;
+	u8 phy;
+
+	phy = ptp_port->port_num / hw->ptp.ports_per_phy;
+
+	mutex_lock(&pf->dplls.lock);
+	/* no TX clock change requested */
+	if (pf->ptp.port.tx_clk == pf->ptp.port.tx_clk_req) {
+		mutex_unlock(&pf->dplls.lock);
+		return;
+	}
+	/* verify current Tx reference settings */
+	err = ice_get_serdes_ref_sel_e825c(hw,
+					   ptp_port->port_num,
+					   &clk);
+	if (err) {
+		mutex_unlock(&pf->dplls.lock);
+		return;
+	}
+
+	if (clk != pf->ptp.port.tx_clk_req) {
+		dev_warn(ice_pf_to_dev(pf),
+			 "Failed to switch tx-clk for phy %d and clk %u (current: %u)\n",
+			 phy, pf->ptp.port.tx_clk_req, clk);
+		old_pin = ice_txclk_get_pin(pf, pf->ptp.port.tx_clk_req);
+		new_pin = ice_txclk_get_pin(pf, clk);
+		pf->ptp.port.tx_clk = clk;
+		pf->ptp.port.tx_clk_req = clk;
+		goto err_notify;
+	}
+
+	old_pin = ice_txclk_get_pin(pf, pf->ptp.port.tx_clk);
+	pf->ptp.port.tx_clk = clk;
+	pf->ptp.port.tx_clk_req = clk;
+
+	if (IS_ERR_OR_NULL(ctrl_pf)) {
+		dev_err(ice_pf_to_dev(pf),
+			"Can't set tx-clk: no controlling PF\n");
+		goto err_notify;
+	}
+
+	/* update Tx reference clock usage map */
+	for (int i = 0; i < ICE_REF_CLK_MAX; i++)
+		(clk == i) ?
+		 set_bit(ptp_port->port_num,
+			 &ctrl_pf->ptp.tx_refclks[phy][i]) :
+		 clear_bit(ptp_port->port_num,
+			   &ctrl_pf->ptp.tx_refclks[phy][i]);
+
+err_notify:
+	mutex_unlock(&pf->dplls.lock);
+
+	/* Notify TX clk pins state transition */
+	if (old_pin)
+		dpll_pin_change_ntf(old_pin);
+	if (new_pin)
+		dpll_pin_change_ntf(new_pin);
+
+	/* Update TXC DPLL lock status based on effective TX clk */
+	if (!IS_ERR_OR_NULL(pf->dplls.txc.dpll)) {
+		enum dpll_lock_status new_lock;
+
+		new_lock = ice_txclk_lock_status(clk);
+
+		if (pf->dplls.txc.dpll_state != new_lock) {
+			pf->dplls.txc.dpll_state = new_lock;
+			dpll_device_change_ntf(pf->dplls.txc.dpll);
+		}
+	}
+}
diff --git a/drivers/net/ethernet/intel/ice/ice_txclk.h b/drivers/net/ethernet/intel/ice/ice_txclk.h
new file mode 100644
index 000000000000..bdf591ea8f14
--- /dev/null
+++ b/drivers/net/ethernet/intel/ice/ice_txclk.h
@@ -0,0 +1,38 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright (C) 2026 Intel Corporation */
+
+#ifndef _ICE_TXCLK_H_
+#define _ICE_TXCLK_H_
+
+/**
+ * ice_txclk_any_port_uses - check if any port on a PHY uses this TX refclk
+ * @ctrl_pf: control PF (owner of the shared tx_refclks map)
+ * @phy: PHY index
+ * @clk: TX reference clock
+ *
+ * Return: true if any bit (port) is set for this clock on this PHY
+ */
+static inline bool
+ice_txclk_any_port_uses(const struct ice_pf *ctrl_pf, u8 phy,
+			enum ice_e825c_ref_clk clk)
+{
+	return find_first_bit(&ctrl_pf->ptp.tx_refclks[phy][clk],
+			BITS_PER_LONG) < BITS_PER_LONG;
+}
+
+static inline enum dpll_lock_status
+ice_txclk_lock_status(enum ice_e825c_ref_clk clk)
+{
+	switch (clk) {
+	case ICE_REF_CLK_SYNCE:
+	case ICE_REF_CLK_EREF0:
+		return DPLL_LOCK_STATUS_LOCKED;
+	case ICE_REF_CLK_ENET:
+	default:
+		return DPLL_LOCK_STATUS_UNLOCKED;
+	}
+}
+
+int ice_txclk_set_clk(struct ice_pf *pf, enum ice_e825c_ref_clk clk);
+void ice_txclk_update_and_notify(struct ice_pf *pf);
+#endif /* _ICE_TXCLK_H_ */
-- 
2.39.3


^ permalink raw reply related

* [PATCH v6 net-next 7/8] ice: add Tx reference clock index handling to AN restart command
From: Grzegorz Nitka @ 2026-04-09 23:51 UTC (permalink / raw)
  To: netdev
  Cc: linux-kernel, intel-wired-lan, poros, richardcochran,
	andrew+netdev, przemyslaw.kitszel, anthony.l.nguyen,
	Prathosh.Satish, ivecera, jiri, arkadiusz.kubalewski,
	vadim.fedorenko, donald.hunter, horms, pabeni, kuba, davem,
	edumazet, Grzegorz Nitka
In-Reply-To: <20260409235122.436749-1-grzegorz.nitka@intel.com>

Extend the Restart Auto-Negotiation (AN) AdminQ command with a new
parameter allowing software to specify the Tx reference clock index to
be used during link restart.

This patch:
 - adds REFCLK field definitions to ice_aqc_restart_an
 - updates ice_aq_set_link_restart_an() to take a new refclk parameter
   and properly encode it into the command
 - keeps legacy behavior by passing REFCLK_NOCHANGE where appropriate

This prepares the driver for configurations requiring dynamic selection
of the Tx reference clock as part of the AN flow.

Reviewed-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com>
Signed-off-by: Grzegorz Nitka <grzegorz.nitka@intel.com>
---
 drivers/net/ethernet/intel/ice/ice_adminq_cmd.h | 2 ++
 drivers/net/ethernet/intel/ice/ice_common.c     | 5 ++++-
 drivers/net/ethernet/intel/ice/ice_common.h     | 2 +-
 drivers/net/ethernet/intel/ice/ice_lib.c        | 3 ++-
 4 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_adminq_cmd.h b/drivers/net/ethernet/intel/ice/ice_adminq_cmd.h
index 859e9c66f3e7..a24a0613d887 100644
--- a/drivers/net/ethernet/intel/ice/ice_adminq_cmd.h
+++ b/drivers/net/ethernet/intel/ice/ice_adminq_cmd.h
@@ -1169,6 +1169,8 @@ struct ice_aqc_restart_an {
 	u8 cmd_flags;
 #define ICE_AQC_RESTART_AN_LINK_RESTART	BIT(1)
 #define ICE_AQC_RESTART_AN_LINK_ENABLE	BIT(2)
+#define ICE_AQC_RESTART_AN_REFCLK_M	GENMASK(4, 3)
+#define ICE_AQC_RESTART_AN_REFCLK_NOCHANGE 0
 	u8 reserved2[13];
 };
 
diff --git a/drivers/net/ethernet/intel/ice/ice_common.c b/drivers/net/ethernet/intel/ice/ice_common.c
index ce11fea122d0..de88aec9137c 100644
--- a/drivers/net/ethernet/intel/ice/ice_common.c
+++ b/drivers/net/ethernet/intel/ice/ice_common.c
@@ -4126,12 +4126,13 @@ int ice_get_link_status(struct ice_port_info *pi, bool *link_up)
  * @pi: pointer to the port information structure
  * @ena_link: if true: enable link, if false: disable link
  * @cd: pointer to command details structure or NULL
+ * @refclk: the new TX reference clock, 0 if no change
  *
  * Sets up the link and restarts the Auto-Negotiation over the link.
  */
 int
 ice_aq_set_link_restart_an(struct ice_port_info *pi, bool ena_link,
-			   struct ice_sq_cd *cd)
+			   struct ice_sq_cd *cd,  u8 refclk)
 {
 	struct ice_aqc_restart_an *cmd;
 	struct libie_aq_desc desc;
@@ -4147,6 +4148,8 @@ ice_aq_set_link_restart_an(struct ice_port_info *pi, bool ena_link,
 	else
 		cmd->cmd_flags &= ~ICE_AQC_RESTART_AN_LINK_ENABLE;
 
+	cmd->cmd_flags |= FIELD_PREP(ICE_AQC_RESTART_AN_REFCLK_M, refclk);
+
 	return ice_aq_send_cmd(pi->hw, &desc, NULL, 0, cd);
 }
 
diff --git a/drivers/net/ethernet/intel/ice/ice_common.h b/drivers/net/ethernet/intel/ice/ice_common.h
index e700ac0dc347..9f5344212195 100644
--- a/drivers/net/ethernet/intel/ice/ice_common.h
+++ b/drivers/net/ethernet/intel/ice/ice_common.h
@@ -215,7 +215,7 @@ ice_cfg_phy_fec(struct ice_port_info *pi, struct ice_aqc_set_phy_cfg_data *cfg,
 		enum ice_fec_mode fec);
 int
 ice_aq_set_link_restart_an(struct ice_port_info *pi, bool ena_link,
-			   struct ice_sq_cd *cd);
+			   struct ice_sq_cd *cd, u8 refclk);
 int
 ice_aq_set_mac_cfg(struct ice_hw *hw, u16 max_frame_size, struct ice_sq_cd *cd);
 int
diff --git a/drivers/net/ethernet/intel/ice/ice_lib.c b/drivers/net/ethernet/intel/ice/ice_lib.c
index 689c6025ea82..c2c7f186bcc7 100644
--- a/drivers/net/ethernet/intel/ice/ice_lib.c
+++ b/drivers/net/ethernet/intel/ice/ice_lib.c
@@ -3769,7 +3769,8 @@ int ice_set_link(struct ice_vsi *vsi, bool ena)
 	if (vsi->type != ICE_VSI_PF)
 		return -EINVAL;
 
-	status = ice_aq_set_link_restart_an(pi, ena, NULL);
+	status = ice_aq_set_link_restart_an(pi, ena, NULL,
+					    ICE_AQC_RESTART_AN_REFCLK_NOCHANGE);
 
 	/* if link is owned by manageability, FW will return LIBIE_AQ_RC_EMODE.
 	 * this is not a fatal error, so print a warning message and return
-- 
2.39.3


^ permalink raw reply related

* [PATCH v6 net-next 6/8] ice: implement CPI support for E825C
From: Grzegorz Nitka @ 2026-04-09 23:51 UTC (permalink / raw)
  To: netdev
  Cc: linux-kernel, intel-wired-lan, poros, richardcochran,
	andrew+netdev, przemyslaw.kitszel, anthony.l.nguyen,
	Prathosh.Satish, ivecera, jiri, arkadiusz.kubalewski,
	vadim.fedorenko, donald.hunter, horms, pabeni, kuba, davem,
	edumazet, Grzegorz Nitka
In-Reply-To: <20260409235122.436749-1-grzegorz.nitka@intel.com>

Add full CPI (Converged PHY Interface) command handling required for
E825C devices. The CPI interface allows the driver to interact with
PHY-side control logic through the LM/PHY command registers, including
enabling/disabling/selection of PHY reference clock.

This patch introduces:
 - a new CPI subsystem (ice_cpi.c / ice_cpi.h) implementing the CPI
   request/acknowledge state machine, including REQ/ACK protocol,
   command execution, and response handling
 - helper functions for reading/writing PHY registers over Sideband
   Queue
 - CPI command execution API (ice_cpi_exec) and a helper for enabling or
   disabling Tx reference clocks (CPI 0xF1 opcode 'Config PHY clocking')
 - assurance of CPI transaction serialization into the CPI core.
   CPI REQ/ACK is a multi-step handshake    and must be executed
   atomically per PHY. Centralize the lock in ice_cpi_exec() and
   use adapter-scoped per-PHY mutexes, which match the hardware sharing
   model across PFs.
 - addition of the non-posted write opcode (wr_np) to SBQ
 - Makefile integration to build CPI support together with the PTP stack

This provides the infrastructure necessary to support PHY-side
configuration flows on E825C and is required for advanced link control
and Tx reference clock management.

Reviewed-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com>
Signed-off-by: Grzegorz Nitka <grzegorz.nitka@intel.com>
---
 drivers/net/ethernet/intel/ice/Makefile      |   2 +-
 drivers/net/ethernet/intel/ice/ice_adapter.c |   4 +
 drivers/net/ethernet/intel/ice/ice_adapter.h |   7 +
 drivers/net/ethernet/intel/ice/ice_cpi.c     | 364 +++++++++++++++++++
 drivers/net/ethernet/intel/ice/ice_cpi.h     |  61 ++++
 drivers/net/ethernet/intel/ice/ice_sbq_cmd.h |   5 +-
 drivers/net/ethernet/intel/ice/ice_type.h    |   2 +
 7 files changed, 442 insertions(+), 3 deletions(-)
 create mode 100644 drivers/net/ethernet/intel/ice/ice_cpi.c
 create mode 100644 drivers/net/ethernet/intel/ice/ice_cpi.h

diff --git a/drivers/net/ethernet/intel/ice/Makefile b/drivers/net/ethernet/intel/ice/Makefile
index 5b2c666496e7..38db476ab2ec 100644
--- a/drivers/net/ethernet/intel/ice/Makefile
+++ b/drivers/net/ethernet/intel/ice/Makefile
@@ -54,7 +54,7 @@ ice-$(CONFIG_PCI_IOV) +=	\
 	ice_vf_mbx.o		\
 	ice_vf_vsi_vlan_ops.o	\
 	ice_vf_lib.o
-ice-$(CONFIG_PTP_1588_CLOCK) += ice_ptp.o ice_ptp_hw.o ice_dpll.o ice_tspll.o
+ice-$(CONFIG_PTP_1588_CLOCK) += ice_ptp.o ice_ptp_hw.o ice_dpll.o ice_tspll.o ice_cpi.o
 ice-$(CONFIG_DCB) += ice_dcb.o ice_dcb_nl.o ice_dcb_lib.o
 ice-$(CONFIG_RFS_ACCEL) += ice_arfs.o
 ice-$(CONFIG_XDP_SOCKETS) += ice_xsk.o
diff --git a/drivers/net/ethernet/intel/ice/ice_adapter.c b/drivers/net/ethernet/intel/ice/ice_adapter.c
index cbb57060bd56..2dc3629d6d0f 100644
--- a/drivers/net/ethernet/intel/ice/ice_adapter.c
+++ b/drivers/net/ethernet/intel/ice/ice_adapter.c
@@ -62,6 +62,8 @@ static struct ice_adapter *ice_adapter_new(struct pci_dev *pdev)
 	adapter->index = ice_adapter_index(pdev);
 	spin_lock_init(&adapter->ptp_gltsyn_time_lock);
 	spin_lock_init(&adapter->txq_ctx_lock);
+	for (int i = 0; i < ARRAY_SIZE(adapter->cpi_phy_lock); i++)
+		mutex_init(&adapter->cpi_phy_lock[i]);
 	refcount_set(&adapter->refcount, 1);
 
 	mutex_init(&adapter->ports.lock);
@@ -73,6 +75,8 @@ static struct ice_adapter *ice_adapter_new(struct pci_dev *pdev)
 static void ice_adapter_free(struct ice_adapter *adapter)
 {
 	WARN_ON(!list_empty(&adapter->ports.ports));
+	for (int i = 0; i < ARRAY_SIZE(adapter->cpi_phy_lock); i++)
+		mutex_destroy(&adapter->cpi_phy_lock[i]);
 	mutex_destroy(&adapter->ports.lock);
 
 	kfree(adapter);
diff --git a/drivers/net/ethernet/intel/ice/ice_adapter.h b/drivers/net/ethernet/intel/ice/ice_adapter.h
index e95266c7f20b..fa238a6a0e1a 100644
--- a/drivers/net/ethernet/intel/ice/ice_adapter.h
+++ b/drivers/net/ethernet/intel/ice/ice_adapter.h
@@ -5,9 +5,12 @@
 #define _ICE_ADAPTER_H_
 
 #include <linux/types.h>
+#include <linux/mutex.h>
 #include <linux/spinlock_types.h>
 #include <linux/refcount_types.h>
 
+#include "ice_type.h"
+
 struct pci_dev;
 struct ice_pf;
 
@@ -31,6 +34,8 @@ struct ice_port_list {
  * @ptp_gltsyn_time_lock: Spinlock protecting access to the GLTSYN_TIME
  *                        register of the PTP clock.
  * @txq_ctx_lock: Spinlock protecting access to the GLCOMM_QTX_CNTX_CTL register
+ * @cpi_phy_lock: Per-PHY mutex serializing CPI REQ/ACK transactions.
+ *               Index 0 = PHY0, index 1 = PHY1. Only used on E825C.
  * @ctrl_pf: Control PF of the adapter
  * @ports: Ports list
  * @index: 64-bit index cached for collision detection on 32bit systems
@@ -41,6 +46,8 @@ struct ice_adapter {
 	spinlock_t ptp_gltsyn_time_lock;
 	/* For access to GLCOMM_QTX_CNTX_CTL register */
 	spinlock_t txq_ctx_lock;
+	/* Serialize CPI REQ/ACK transactions per PHY (E825C only) */
+	struct mutex cpi_phy_lock[ICE_E825_MAX_PHYS];
 
 	struct ice_pf *ctrl_pf;
 	struct ice_port_list ports;
diff --git a/drivers/net/ethernet/intel/ice/ice_cpi.c b/drivers/net/ethernet/intel/ice/ice_cpi.c
new file mode 100644
index 000000000000..22c8d5a9f859
--- /dev/null
+++ b/drivers/net/ethernet/intel/ice/ice_cpi.c
@@ -0,0 +1,364 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (C) 2018-2026 Intel Corporation */
+
+#include "ice_type.h"
+#include "ice_common.h"
+#include "ice_ptp_hw.h"
+#include "ice.h"
+#include "ice_cpi.h"
+
+/**
+ * ice_cpi_get_dest_dev - get destination PHY for given phy index
+ * @hw: pointer to the HW struct
+ * @phy: phy index of port the CPI action is taken on
+ *
+ * Return: sideband queue destination PHY device.
+ */
+static enum ice_sbq_dev_id ice_cpi_get_dest_dev(struct ice_hw *hw, u8 phy)
+{
+	u8 curr_phy = hw->lane_num / hw->ptp.ports_per_phy;
+
+	/* In the driver, lanes 4..7 are in fact 0..3 on a second PHY.
+	 * On a single complex E825C, PHY 0 is always destination device phy_0
+	 * and PHY 1 is phy_0_peer.
+	 * On dual complex E825C, device phy_0 points to PHY on a current
+	 * complex and phy_0_peer to PHY on a different complex.
+	 */
+	if ((!ice_is_dual(hw) && phy) ||
+	    (ice_is_dual(hw) && phy != curr_phy))
+		return ice_sbq_dev_phy_0_peer;
+	else
+		return ice_sbq_dev_phy_0;
+}
+
+/**
+ * ice_cpi_write_phy - Write a CPI port register
+ * @hw: pointer to the HW struct
+ * @phy: phy index of port the CPI action is taken on
+ * @addr: PHY register address
+ * @val: Value to write
+ *
+ * Return:
+ * * 0 on success
+ * * other error codes when failed to write to PHY
+ */
+static int ice_cpi_write_phy(struct ice_hw *hw, u8 phy, u32 addr, u32 val)
+{
+	struct ice_sbq_msg_input msg = {
+		.dest_dev = ice_cpi_get_dest_dev(hw, phy),
+		.opcode = ice_sbq_msg_wr_np,
+		.msg_addr_low = lower_16_bits(addr),
+		.msg_addr_high = upper_16_bits(addr),
+		.data = val
+	};
+	int err;
+
+	err = ice_sbq_rw_reg(hw, &msg, LIBIE_AQ_FLAG_RD);
+	if (err)
+		ice_debug(hw, ICE_DBG_PTP,
+			  "Failed to write CPI msg to phy %d, err: %d\n",
+			  phy, err);
+
+	return err;
+}
+
+/**
+ * ice_cpi_read_phy - Read a CPI port register
+ * @hw: pointer to the HW struct
+ * @phy: phy index of port the CPI action is taken on
+ * @addr: PHY register address
+ * @val: storage for register value
+ *
+ * Return:
+ * * 0 on success
+ * * other error codes when failed to read from PHY
+ */
+static int ice_cpi_read_phy(struct ice_hw *hw, u8 phy, u32 addr, u32 *val)
+{
+	struct ice_sbq_msg_input msg = {
+		.dest_dev = ice_cpi_get_dest_dev(hw, phy),
+		.opcode = ice_sbq_msg_rd,
+		.msg_addr_low = lower_16_bits(addr),
+		.msg_addr_high = upper_16_bits(addr)
+	};
+	int err;
+
+	err = ice_sbq_rw_reg(hw, &msg, LIBIE_AQ_FLAG_RD);
+	if (err) {
+		ice_debug(hw, ICE_DBG_PTP,
+			  "Failed to read CPI msg from phy %d, err: %d\n",
+			  phy, err);
+		return err;
+	}
+
+	*val = msg.data;
+
+	return 0;
+}
+
+/**
+ * ice_cpi_wait_req0_ack0 - waits for CPI interface to be available
+ * @hw: pointer to the HW struct
+ * @phy: phy index of port the CPI action is taken on
+ *
+ * This function checks if CPI interface is ready to use by CPI client.
+ * It's done by assuring LM.CMD.REQ and PHY.CMD.ACK bit in CPI
+ * interface registers to be 0.
+ *
+ * Return: 0 on success, negative on error
+ */
+static int ice_cpi_wait_req0_ack0(struct ice_hw *hw, int phy)
+{
+	u32 phy_val;
+	u32 lm_val;
+
+	for (int i = 0; i < CPI_RETRIES_COUNT; i++) {
+		int err;
+
+		/* check if another CPI Client is also accessing CPI */
+		err = ice_cpi_read_phy(hw, phy, CPI0_LM1_CMD_DATA, &lm_val);
+		if (err)
+			return err;
+		if (FIELD_GET(CPI_LM_CMD_REQ_M, lm_val))
+			return -EBUSY;
+
+		/* check if PHY.ACK is deasserted */
+		err = ice_cpi_read_phy(hw, phy, CPI0_PHY1_CMD_DATA, &phy_val);
+		if (err)
+			return err;
+		if (FIELD_GET(CPI_PHY_CMD_ERROR_M, phy_val))
+			return -EFAULT;
+		if (!FIELD_GET(CPI_PHY_CMD_ACK_M, phy_val))
+			/* req0 and ack0 at this point - ready to go */
+			return 0;
+
+		msleep(CPI_RETRIES_CADENCE_MS);
+	}
+
+	return -ETIMEDOUT;
+}
+
+/**
+ * ice_cpi_wait_ack - Waits for the PHY.ACK bit to be asserted/deasserted
+ * @hw: pointer to the HW struct
+ * @phy: phy index of port the CPI action is taken on
+ * @asserted: desired state of PHY.ACK bit
+ * @data: pointer to the user data where PHY.data is stored
+ *
+ * This function checks if PHY.ACK bit is asserted or deasserted, depending
+ * on the phase of CPI handshake. If 'asserted' state is required, PHY command
+ * data is stored in the 'data' storage.
+ *
+ * Return: 0 on success, negative on error
+ */
+static int ice_cpi_wait_ack(struct ice_hw *hw, u8 phy, bool asserted,
+			    u32 *data)
+{
+	u32 phy_val;
+
+	for (int i = 0; i < CPI_RETRIES_COUNT; i++) {
+		int err;
+
+		err = ice_cpi_read_phy(hw, phy, CPI0_PHY1_CMD_DATA, &phy_val);
+		if (err)
+			return err;
+		if (FIELD_GET(CPI_PHY_CMD_ERROR_M, phy_val))
+			return -EFAULT;
+		if (asserted && FIELD_GET(CPI_PHY_CMD_ACK_M, phy_val)) {
+			if (data)
+				*data = phy_val;
+			return 0;
+		}
+		if (!asserted && !FIELD_GET(CPI_PHY_CMD_ACK_M, phy_val))
+			return 0;
+
+		msleep(CPI_RETRIES_CADENCE_MS);
+	}
+
+	return -ETIMEDOUT;
+}
+
+#define ice_cpi_wait_ack0(hw, port) \
+	ice_cpi_wait_ack(hw, port, false, NULL)
+
+#define ice_cpi_wait_ack1(hw, port, data) \
+	ice_cpi_wait_ack(hw, port, true, data)
+
+/**
+ * ice_cpi_req0 - deasserts LM.REQ bit
+ * @hw: pointer to the HW struct
+ * @phy: phy index of port the CPI action is taken on
+ * @data: the command data
+ *
+ * Return: 0 on success, negative on CPI write error
+ */
+static int ice_cpi_req0(struct ice_hw *hw, u8 phy, u32 data)
+{
+	data &= ~CPI_LM_CMD_REQ_M;
+
+	return ice_cpi_write_phy(hw, phy, CPI0_LM1_CMD_DATA, data);
+}
+
+/**
+ * ice_cpi_exec_cmd - writes command data to CPI interface
+ * @hw: pointer to the HW struct
+ * @phy: phy index of port the CPI action is taken on
+ * @data: the command data
+ *
+ * Return: 0 on success, otherwise negative on error
+ */
+static int ice_cpi_exec_cmd(struct ice_hw *hw, int phy, u32 data)
+{
+	return ice_cpi_write_phy(hw, phy, CPI0_LM1_CMD_DATA, data);
+}
+
+/**
+ * ice_cpi_phy_lock - get per-PHY lock for CPI transaction serialization
+ * @hw: pointer to the HW struct
+ * @phy: PHY index
+ *
+ * Return: pointer to PHY mutex, or %NULL when context is unavailable.
+ */
+static struct mutex *ice_cpi_phy_lock(struct ice_hw *hw, u8 phy)
+{
+	struct ice_pf *pf = hw->back;
+
+	if (!pf || !pf->adapter || phy >= ICE_E825_MAX_PHYS)
+		return NULL;
+
+	return &pf->adapter->cpi_phy_lock[phy];
+}
+
+/**
+ * ice_cpi_exec - executes CPI command
+ * @hw: pointer to the HW struct
+ * @phy: phy index of port the CPI action is taken on
+ * @cmd: pointer to the command struct to execute
+ * @resp: pointer to user allocated CPI response struct
+ *
+ * This function executes CPI request with respect to CPI handshake
+ * mechanism.
+ *
+ * Return: 0 on success, otherwise negative on error
+ */
+int ice_cpi_exec(struct ice_hw *hw, u8 phy,
+		 const struct ice_cpi_cmd *cmd,
+		 struct ice_cpi_resp *resp)
+{
+	struct mutex *cpi_lock;
+	u32 phy_cmd, lm_cmd = 0;
+	int err, err1 = 0;
+
+	if (!cmd || !resp)
+		return -EINVAL;
+
+	cpi_lock = ice_cpi_phy_lock(hw, phy);
+	if (!cpi_lock)
+		return -EINVAL;
+
+	mutex_lock(cpi_lock);
+
+	lm_cmd =
+		FIELD_PREP(CPI_LM_CMD_REQ_M, CPI_LM_CMD_REQ) |
+		FIELD_PREP(CPI_LM_CMD_GET_SET_M, cmd->set) |
+		FIELD_PREP(CPI_LM_CMD_OPCODE_M, cmd->opcode) |
+		FIELD_PREP(CPI_LM_CMD_PORTLANE_M, cmd->port) |
+		FIELD_PREP(CPI_LM_CMD_DATA_M, cmd->data);
+
+	/* 1. Try to acquire the bus, PHY ACK should be low before we begin */
+	err = ice_cpi_wait_req0_ack0(hw, phy);
+	if (err)
+		goto cpi_exec_exit;
+
+	/* 2. We start the CPI request */
+	err = ice_cpi_exec_cmd(hw, phy, lm_cmd);
+	if (err)
+		goto cpi_exec_exit;
+
+	/*
+	 * 3. Wait for CPI confirmation, PHY ACK should be asserted and opcode
+	 *    echoed in the response
+	 */
+	err = ice_cpi_wait_ack1(hw, phy, &phy_cmd);
+	if (err)
+		goto cpi_deassert;
+
+	if (FIELD_GET(CPI_PHY_CMD_ACK_M, phy_cmd) &&
+	    FIELD_GET(CPI_LM_CMD_OPCODE_M, lm_cmd) !=
+	    FIELD_GET(CPI_PHY_CMD_OPCODE_M, phy_cmd)) {
+		err = -EFAULT;
+		goto cpi_deassert;
+	}
+
+	resp->opcode = FIELD_GET(CPI_PHY_CMD_OPCODE_M, phy_cmd);
+	resp->data = FIELD_GET(CPI_PHY_CMD_DATA_M, phy_cmd);
+	resp->port = FIELD_GET(CPI_PHY_CMD_PORTLANE_M, phy_cmd);
+
+cpi_deassert:
+	/* 4. We deassert REQ */
+	err1 = ice_cpi_req0(hw, phy, lm_cmd);
+	if (err1)
+		goto cpi_exec_exit;
+
+	/* 5. PHY ACK should be deasserted in response */
+	err1 = ice_cpi_wait_ack0(hw, phy);
+
+cpi_exec_exit:
+	if (!err)
+		err = err1;
+
+	mutex_unlock(cpi_lock);
+
+	return err;
+}
+
+/**
+ * ice_cpi_set_cmd - execute CPI SET command
+ * @hw: pointer to the HW struct
+ * @opcode: CPI command opcode
+ * @phy: phy index CPI command is applied for
+ * @port_lane: ephy index CPI command is applied for
+ * @data: CPI opcode context specific data
+ *
+ * Return: 0 on success.
+ */
+static int ice_cpi_set_cmd(struct ice_hw *hw, u16 opcode, u8 phy, u8 port_lane,
+			   u16 data)
+{
+	struct ice_cpi_resp cpi_resp = {0};
+	struct ice_cpi_cmd cpi_cmd = {
+		.opcode = opcode,
+		.set = true,
+		.port = port_lane,
+		.data = data,
+	};
+
+	return ice_cpi_exec(hw, phy, &cpi_cmd, &cpi_resp);
+}
+
+/**
+ * ice_cpi_ena_dis_clk_ref - enables/disables Tx reference clock on port
+ * @hw: pointer to the HW struct
+ * @phy: phy index of port for which Tx reference clock is enabled/disabled
+ * @clk: Tx reference clock to enable or disable
+ * @enable: bool value to enable or disable Tx reference clock
+ *
+ * This function executes CPI request to enable or disable specific
+ * Tx reference clock on given PHY.
+ *
+ * Return: 0 on success.
+ */
+int ice_cpi_ena_dis_clk_ref(struct ice_hw *hw, u8 phy,
+			    enum ice_e825c_ref_clk clk, bool enable)
+{
+	u16 val;
+
+	val = FIELD_PREP(CPI_OPCODE_PHY_CLK_PHY_SEL_M, phy) |
+	      FIELD_PREP(CPI_OPCODE_PHY_CLK_REF_CTRL_M,
+			 enable ? CPI_OPCODE_PHY_CLK_ENABLE :
+			 CPI_OPCODE_PHY_CLK_DISABLE) |
+	      FIELD_PREP(CPI_OPCODE_PHY_CLK_REF_SEL_M, clk);
+
+	return ice_cpi_set_cmd(hw, CPI_OPCODE_PHY_CLK, phy, 0, val);
+}
+
diff --git a/drivers/net/ethernet/intel/ice/ice_cpi.h b/drivers/net/ethernet/intel/ice/ice_cpi.h
new file mode 100644
index 000000000000..932fe0c0824a
--- /dev/null
+++ b/drivers/net/ethernet/intel/ice/ice_cpi.h
@@ -0,0 +1,61 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/* Copyright (C) 2018-2025 Intel Corporation */
+
+#ifndef _ICE_CPI_H_
+#define _ICE_CPI_H_
+
+#define CPI0_PHY1_CMD_DATA	0x7FD028
+#define CPI0_LM1_CMD_DATA	0x7FD024
+#define CPI_RETRIES_COUNT	10
+#define CPI_RETRIES_CADENCE_MS	100
+
+/* CPI PHY CMD DATA register (CPI0_PHY1_CMD_DATA) */
+#define CPI_PHY_CMD_DATA_M	GENMASK(15, 0)
+#define CPI_PHY_CMD_OPCODE_M	GENMASK(23, 16)
+#define CPI_PHY_CMD_PORTLANE_M	GENMASK(26, 24)
+#define CPI_PHY_CMD_RSVD_M	GENMASK(29, 27)
+#define CPI_PHY_CMD_ERROR_M	BIT(30)
+#define CPI_PHY_CMD_ACK_M	BIT(31)
+
+/* CPI LM CMD DATA register (CPI0_LM1_CMD_DATA) */
+#define CPI_LM_CMD_DATA_M	GENMASK(15, 0)
+#define CPI_LM_CMD_OPCODE_M	GENMASK(23, 16)
+#define CPI_LM_CMD_PORTLANE_M	GENMASK(26, 24)
+#define CPI_LM_CMD_RSVD_M	GENMASK(28, 27)
+#define CPI_LM_CMD_GET_SET_M	BIT(29)
+#define CPI_LM_CMD_RESET_M	BIT(30)
+#define CPI_LM_CMD_REQ_M        BIT(31)
+
+#define CPI_OPCODE_PHY_CLK			0xF1
+#define CPI_OPCODE_PHY_CLK_PHY_SEL_M		GENMASK(9, 6)
+#define CPI_OPCODE_PHY_CLK_REF_CTRL_M		GENMASK(5, 4)
+#define CPI_OPCODE_PHY_CLK_PORT_SEL		0
+#define CPI_OPCODE_PHY_CLK_DISABLE		1
+#define CPI_OPCODE_PHY_CLK_ENABLE		2
+#define CPI_OPCODE_PHY_CLK_REF_SEL_M		GENMASK(3, 0)
+
+#define CPI_OPCODE_PHY_PCS_RESET		0xF0
+#define CPI_OPCODE_PHY_PCS_ONPI_RESET_VAL	0x3F
+
+#define CPI_LM_CMD_REQ		1
+#define CPI_LM_CMD_SET		1
+
+struct ice_cpi_cmd {
+	u8 port;
+	u8 opcode;
+	u16 data;
+	bool set;
+};
+
+struct ice_cpi_resp {
+	u8 port;
+	u8 opcode;
+	u16 data;
+};
+
+int ice_cpi_exec(struct ice_hw *hw, u8 phy,
+		 const struct ice_cpi_cmd *cmd,
+		 struct ice_cpi_resp *resp);
+int ice_cpi_ena_dis_clk_ref(struct ice_hw *hw, u8 port,
+			    enum ice_e825c_ref_clk clk, bool enable);
+#endif /* _ICE_CPI_H_ */
diff --git a/drivers/net/ethernet/intel/ice/ice_sbq_cmd.h b/drivers/net/ethernet/intel/ice/ice_sbq_cmd.h
index 21bb861febbf..226243d32968 100644
--- a/drivers/net/ethernet/intel/ice/ice_sbq_cmd.h
+++ b/drivers/net/ethernet/intel/ice/ice_sbq_cmd.h
@@ -54,8 +54,9 @@ enum ice_sbq_dev_id {
 };
 
 enum ice_sbq_msg_opcode {
-	ice_sbq_msg_rd	= 0x00,
-	ice_sbq_msg_wr	= 0x01
+	ice_sbq_msg_rd		= 0x00,
+	ice_sbq_msg_wr		= 0x01,
+	ice_sbq_msg_wr_np	= 0x02
 };
 
 #define ICE_SBQ_MSG_FLAGS	0x40
diff --git a/drivers/net/ethernet/intel/ice/ice_type.h b/drivers/net/ethernet/intel/ice/ice_type.h
index 1e82f4c40b32..d9a5c1aae7c2 100644
--- a/drivers/net/ethernet/intel/ice/ice_type.h
+++ b/drivers/net/ethernet/intel/ice/ice_type.h
@@ -893,6 +893,8 @@ struct ice_ptp_hw {
 	u8 ports_per_phy;
 };
 
+#define ICE_E825_MAX_PHYS	2
+
 /* Port hardware description */
 struct ice_hw {
 	u8 __iomem *hw_addr;
-- 
2.39.3


^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox