* Re: [PATCH net 0/3][pull request] Intel Wired LAN Driver Updates 2026-06-09 (idpf, ixgbe, igc)
From: patchwork-bot+netdevbpf @ 2026-06-13 10:00 UTC (permalink / raw)
To: Tony Nguyen; +Cc: davem, kuba, pabeni, edumazet, andrew+netdev, netdev
In-Reply-To: <20260609172458.4046222-1-anthony.l.nguyen@intel.com>
Hello:
This series was applied to netdev/net.git (main)
by Tony Nguyen <anthony.l.nguyen@intel.com>:
On Tue, 9 Jun 2026 10:24:53 -0700 you wrote:
> Przemyslaw adds needed padding to idpf PTP structures to match firmware
> expectations.
>
> Larysa bypasses XPS configuration on XDP queues for ixgbe.
>
> Khai Wen corrects offset into packet buffer when handling for frame
> preemption on igc.
>
> [...]
Here is the summary with links:
- [net,1/3] idpf: add padding to PTP virtchnl structures
https://git.kernel.org/netdev/net/c/d1e8f9fd6b98
- [net,2/3] ixgbe: do not configure xps for XDP queues
https://git.kernel.org/netdev/net/c/7bd4355272de
- [net,3/3] igc: skip RX timestamp header for frame preemption verification
https://git.kernel.org/netdev/net/c/38b7a274cf84
You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html
^ permalink raw reply
* Re: [net-next 8/9] dt-bindings: net: renesas,etheravb: Add optional gPTP phandle for Gen4
From: Krzysztof Kozlowski @ 2026-06-13 10:04 UTC (permalink / raw)
To: Niklas Söderlund
Cc: Paul Barker, Andrew Lunn, David S. Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Rob Herring, Krzysztof Kozlowski,
Conor Dooley, Richard Cochran, Geert Uytterhoeven, Magnus Damm,
Sergei Shtylyov, netdev, linux-renesas-soc, devicetree,
linux-kernel
In-Reply-To: <20260610102432.3538432-9-niklas.soderlund+renesas@ragnatech.se>
On Wed, Jun 10, 2026 at 12:24:31PM +0200, Niklas Söderlund wrote:
> The RAVB module on Gen4 have no gPTP clock as part of the RAVB module
> itself, instead it relies on an external system wide gPTP clock. The
> gPTP clock is shared with RTSN on V4H and RSWITCH on S4.
>
> Add an optional phandle so that the RAVB driver can find and use the
> gPTP clock. Ideally this should have been an mandatory property but for
> backward compatible it is optional. The RAVB module is capable of
> functioning without it, but can in such cases not provided PTP
> functionality.
>
> Signed-off-by: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se>
> ---
> .../bindings/net/renesas,etheravb.yaml | 16 ++++++++++++++++
> 1 file changed, 16 insertions(+)
>
> diff --git a/Documentation/devicetree/bindings/net/renesas,etheravb.yaml b/Documentation/devicetree/bindings/net/renesas,etheravb.yaml
> index 1e00ef5b3acd..7bc910ab3ae0 100644
> --- a/Documentation/devicetree/bindings/net/renesas,etheravb.yaml
> +++ b/Documentation/devicetree/bindings/net/renesas,etheravb.yaml
> @@ -122,6 +122,13 @@ properties:
> Specify when the AVB_LINK signal is active-low instead of normal
> active-high.
>
> + renesas,gptp:
Aren't you duplicating existing timestamper property? Aren't purpose of
both the same?
> + $ref: /schemas/types.yaml#/definitions/phandle
> + description:
> + A phandle to an external gPTP clock for Gen4 platforms. The property is
Explain the purpose of this in the hardware.
> + optional for backwards compatibility, but without it gPTP timestamps are
> + disabled as Gen4 have no gPTP as part of the RAVB module itself.
> +
Best regards,
Krzysztof
^ permalink raw reply
* Re: [PATCH net-next v5 3/3] net: airoha: defer GDM3/GDM4 WAN mode and GDM2 loopback to QoS offload
From: Lorenzo Bianconi @ 2026-06-13 10:04 UTC (permalink / raw)
To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni
Cc: linux-arm-kernel, linux-mediatek, netdev, Madhur Agrawal,
Alexander Lobakin
In-Reply-To: <20260611-airoha-ethtool-priv_flags-v5-3-c11de08486d1@kernel.org>
[-- Attachment #1: Type: text/plain, Size: 7641 bytes --]
Commenting on sashiko's report:
https://netdev-ai.bots.linux.dev/sashiko/#/patchset/20260611-airoha-ethtool-priv_flags-v5-0-c11de08486d1%40kernel.org
[...]
> +static int airoha_enable_qos_for_gdm34(struct net_device *netdev,
> + struct netlink_ext_ack *extack)
> +{
> + struct airoha_gdm_dev *wan_dev, *dev = netdev_priv(netdev);
> + struct airoha_gdm_port *port = dev->port;
> + struct airoha_eth *eth = dev->eth;
> + int err = -EBUSY;
> +
> + if (port->id != AIROHA_GDM3_IDX &&
> + port->id != AIROHA_GDM4_IDX) {
> + /* HW QoS is always supported by GDM1 and GDM2 */
> + return 0;
> + }
> +
> + if (!airoha_is_lan_gdm_dev(dev)) /* Already enabled */
> + return 0;
> +
> + mutex_lock(&flow_offload_mutex);
> +
> + wan_dev = airoha_get_wan_gdm_dev(eth);
> + if (wan_dev) {
> + if ((wan_dev->flags & AIROHA_PRIV_F_QOS) ||
> + wan_dev->port->id == AIROHA_GDM2_IDX) {
> + NL_SET_ERR_MSG_MOD(extack,
> + "QoS configured for WAN device");
> + goto error_unlock;
> + }
> + airoha_disable_qos_for_gdm34(netdev_from_priv(wan_dev));
> + }
> +
> + dev->flags |= AIROHA_PRIV_F_WAN;
> + airoha_dev_set_qdma(dev);
> + err = airoha_enable_gdm2_loopback(dev);
> + if (err)
> + goto error_disable_wan;
> +
> + err = airoha_set_macaddr(dev, netdev->dev_addr);
> + if (err)
> + goto error_disable_loopback;
> +
> + if (netif_running(netdev)) {
> + u32 pse_port;
> +
> + pse_port = airoha_ppe_is_enabled(eth, 1) ? FE_PSE_PORT_PPE2
> + : FE_PSE_PORT_PPE1;
> + airoha_set_gdm_port_fwd_cfg(eth, REG_GDM_FWD_CFG(port->id),
> + pse_port);
> + }
> +
> + mutex_unlock(&flow_offload_mutex);
> +
> + return 0;
> +
> +error_disable_loopback:
> + airoha_disable_gdm2_loopback(dev);
> +error_disable_wan:
> + dev->flags &= ~AIROHA_PRIV_F_WAN;
> + airoha_dev_set_qdma(dev);
> +error_unlock:
> + mutex_unlock(&flow_offload_mutex);
> +
> + return err;
> +}
- The error_disable_loopback / error_disable_wan / error_unlock paths only
revert the requesting dev. The earlier airoha_disable_qos_for_gdm34(wan_dev)
demotion is never undone.
Could a TC_HTB_CREATE that returns an error to userspace then leave the
system with no WAN GDM3/GDM4 device, with the previously-working sibling
silently flipped to LAN, its QDMA migrated to QDMA0, GDM2 loopback torn
down, and its forwarding rewritten to PPE1?
Should the failure paths re-promote wan_dev (re-call airoha_enable_qos_for_gdm34
or an equivalent restore helper on it) before unlocking and returning?
- This is the same item reported by sashiko-gemini in [0]. In my previous
reply I have explained why I do not think it worths to re-promote the
original interface in case of failure in airoha_enable_qos_for_gdm34().
Regards,
Lorenzo
[0] https://sashiko.dev/#/patchset/20260611-airoha-ethtool-priv_flags-v5-0-c11de08486d1%40kernel.org
> +
> static int airoha_tc_htb_destroy(struct net_device *netdev)
> {
> struct airoha_gdm_dev *dev = netdev_priv(netdev);
> @@ -3038,6 +3205,8 @@ static int airoha_tc_htb_destroy(struct net_device *netdev)
> for_each_set_bit(q, dev->qos_sq_bmap, AIROHA_NUM_QOS_CHANNELS)
> airoha_tc_remove_htb_queue(netdev, q);
>
> + dev->flags &= ~AIROHA_PRIV_F_QOS;
> +
> return 0;
> }
>
> @@ -3057,24 +3226,33 @@ static int airoha_tc_get_htb_get_leaf_queue(struct net_device *netdev,
> return 0;
> }
>
> -static int airoha_tc_setup_qdisc_htb(struct net_device *dev,
> +static int airoha_tc_setup_qdisc_htb(struct net_device *netdev,
> struct tc_htb_qopt_offload *opt)
> {
> switch (opt->command) {
> - case TC_HTB_CREATE:
> + case TC_HTB_CREATE: {
> + struct airoha_gdm_dev *dev = netdev_priv(netdev);
> + int err;
> +
> + err = airoha_enable_qos_for_gdm34(netdev, opt->extack);
> + if (err)
> + return err;
> +
> + dev->flags |= AIROHA_PRIV_F_QOS;
> break;
> + }
> case TC_HTB_DESTROY:
> - return airoha_tc_htb_destroy(dev);
> + return airoha_tc_htb_destroy(netdev);
> case TC_HTB_NODE_MODIFY:
> - return airoha_tc_htb_modify_queue(dev, opt);
> + return airoha_tc_htb_modify_queue(netdev, opt);
> case TC_HTB_LEAF_ALLOC_QUEUE:
> - return airoha_tc_htb_alloc_leaf_queue(dev, opt);
> + return airoha_tc_htb_alloc_leaf_queue(netdev, opt);
> case TC_HTB_LEAF_DEL:
> case TC_HTB_LEAF_DEL_LAST:
> case TC_HTB_LEAF_DEL_LAST_FORCE:
> - return airoha_tc_htb_delete_leaf_queue(dev, opt);
> + return airoha_tc_htb_delete_leaf_queue(netdev, opt);
> case TC_HTB_LEAF_QUERY_QUEUE:
> - return airoha_tc_get_htb_get_leaf_queue(dev, opt);
> + return airoha_tc_get_htb_get_leaf_queue(netdev, opt);
> default:
> return -EOPNOTSUPP;
> }
> diff --git a/drivers/net/ethernet/airoha/airoha_eth.h b/drivers/net/ethernet/airoha/airoha_eth.h
> index 24fd8dcf7fca..d1390ffcea7c 100644
> --- a/drivers/net/ethernet/airoha/airoha_eth.h
> +++ b/drivers/net/ethernet/airoha/airoha_eth.h
> @@ -540,11 +540,12 @@ struct airoha_qdma {
>
> enum airoha_priv_flags {
> AIROHA_PRIV_F_WAN = BIT(0),
> + AIROHA_PRIV_F_QOS = BIT(1),
> };
>
> struct airoha_gdm_dev {
> + struct airoha_qdma __rcu *qdma;
> struct airoha_gdm_port *port;
> - struct airoha_qdma *qdma;
> struct airoha_eth *eth;
>
> DECLARE_BITMAP(qos_sq_bmap, AIROHA_NUM_QOS_CHANNELS);
> @@ -676,6 +677,16 @@ int airoha_get_fe_port(struct airoha_gdm_dev *dev);
> bool airoha_is_valid_gdm_dev(struct airoha_eth *eth,
> struct airoha_gdm_dev *dev);
>
> +extern struct mutex flow_offload_mutex;
> +
> +static inline struct airoha_qdma *
> +airoha_qdma_deref(struct airoha_gdm_dev *dev)
> +{
> + return rcu_dereference_protected(dev->qdma,
> + lockdep_rtnl_is_held() ||
> + lockdep_is_held(&flow_offload_mutex));
> +}
> +
> void airoha_ppe_set_cpu_port(struct airoha_gdm_dev *dev, u8 ppe_id, u8 fport);
> bool airoha_ppe_is_enabled(struct airoha_eth *eth, int index);
> void airoha_ppe_check_skb(struct airoha_ppe_dev *dev, struct sk_buff *skb,
> diff --git a/drivers/net/ethernet/airoha/airoha_ppe.c b/drivers/net/ethernet/airoha/airoha_ppe.c
> index 91bcc55a6ac6..1d1b1a57d795 100644
> --- a/drivers/net/ethernet/airoha/airoha_ppe.c
> +++ b/drivers/net/ethernet/airoha/airoha_ppe.c
> @@ -15,7 +15,10 @@
> #include "airoha_regs.h"
> #include "airoha_eth.h"
>
> -static DEFINE_MUTEX(flow_offload_mutex);
> +/* Serialize airoha_gdm_dev flags, QDMA pointer and PPE CPU port
> + * configuration.
> + */
> +DEFINE_MUTEX(flow_offload_mutex);
> static DEFINE_SPINLOCK(ppe_lock);
>
> static const struct rhashtable_params airoha_flow_table_params = {
> @@ -86,8 +89,8 @@ static u32 airoha_ppe_get_timestamp(struct airoha_ppe *ppe)
>
> void airoha_ppe_set_cpu_port(struct airoha_gdm_dev *dev, u8 ppe_id, u8 fport)
> {
> - struct airoha_qdma *qdma = dev->qdma;
> - struct airoha_eth *eth = qdma->eth;
> + struct airoha_qdma *qdma = airoha_qdma_deref(dev);
> + struct airoha_eth *eth = dev->eth;
> u8 qdma_id = qdma - ð->qdma[0];
> u32 fe_cpu_port;
>
> diff --git a/drivers/net/ethernet/airoha/airoha_regs.h b/drivers/net/ethernet/airoha/airoha_regs.h
> index 436f3c8779c1..4e17dfbcf2b8 100644
> --- a/drivers/net/ethernet/airoha/airoha_regs.h
> +++ b/drivers/net/ethernet/airoha/airoha_regs.h
> @@ -376,6 +376,7 @@
>
> #define REG_SRC_PORT_FC_MAP6 0x2298
> #define FC_ID_OF_SRC_PORT_MASK(_n) GENMASK(4 + ((_n) << 3), ((_n) << 3))
> +#define FC_MAP6_DEF_VALUE 0x1b1a1918
>
> #define REG_CDM5_RX_OQ1_DROP_CNT 0x29d4
>
>
> --
> 2.54.0
>
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]
^ permalink raw reply
* Re: [PATCH] nbd: Reclassify sockets to avoid lockdep circular dependency
From: Hillf Danton @ 2026-06-13 10:12 UTC (permalink / raw)
To: Eric Dumazet
Cc: linux-kernel, Jens Axboe, linux-block, nbd, Kuniyuki Iwashima,
netdev, syzbot+607cdcf978b3e79da878
In-Reply-To: <20260613042619.1108126-1-edumazet@google.com>
On Sat, 13 Jun 2026 04:26:19 +0000 Eric Dumazet wrote:
> syzbot reported a possible circular locking dependency in udp_sendmsg()
> where fs_reclaim can be triggered while holding sk_lock, and fs_reclaim
> can eventually depend on another sk_lock (e.g., if NBD is used for swap
> or writeback and NBD uses TLS/TCP which acquires sk_lock).
>
> Since the UDP socket and the NBD TCP/TLS socket are different, this is a
> false positive. Fix this by reclassifying NBD sockets to a separate lock
> class when they are added to the NBD device.
>
> This is similar to what nvme-tcp and other network block devices do.
>
> Fixes: ffa1e7ada456 ("block: Make request_queue lockdep splats show up earlier")
Given the Fixes tag, can you specify anything wrong that commit added?
> Reported-by: syzbot+607cdcf978b3e79da878@syzkaller.appspotmail.com
> Closes: https://lore.kernel.org/netdev/6a2cdafe.428ffe26.258b27.0161.GAE@google.com/T/#u
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> ---
^ permalink raw reply
* Re: [net-next 8/9] dt-bindings: net: renesas,etheravb: Add optional gPTP phandle for Gen4
From: Niklas Söderlund @ 2026-06-13 10:13 UTC (permalink / raw)
To: Krzysztof Kozlowski
Cc: Paul Barker, Andrew Lunn, David S. Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Rob Herring, Krzysztof Kozlowski,
Conor Dooley, Richard Cochran, Geert Uytterhoeven, Magnus Damm,
Sergei Shtylyov, netdev, linux-renesas-soc, devicetree,
linux-kernel
In-Reply-To: <20260613-caped-ferret-of-philosophy-acae13@quoll>
On 2026-06-13 12:04:09 +0200, Krzysztof Kozlowski wrote:
> On Wed, Jun 10, 2026 at 12:24:31PM +0200, Niklas Söderlund wrote:
> > The RAVB module on Gen4 have no gPTP clock as part of the RAVB module
> > itself, instead it relies on an external system wide gPTP clock. The
> > gPTP clock is shared with RTSN on V4H and RSWITCH on S4.
> >
> > Add an optional phandle so that the RAVB driver can find and use the
> > gPTP clock. Ideally this should have been an mandatory property but for
> > backward compatible it is optional. The RAVB module is capable of
> > functioning without it, but can in such cases not provided PTP
> > functionality.
> >
> > Signed-off-by: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se>
> > ---
> > .../bindings/net/renesas,etheravb.yaml | 16 ++++++++++++++++
> > 1 file changed, 16 insertions(+)
> >
> > diff --git a/Documentation/devicetree/bindings/net/renesas,etheravb.yaml b/Documentation/devicetree/bindings/net/renesas,etheravb.yaml
> > index 1e00ef5b3acd..7bc910ab3ae0 100644
> > --- a/Documentation/devicetree/bindings/net/renesas,etheravb.yaml
> > +++ b/Documentation/devicetree/bindings/net/renesas,etheravb.yaml
> > @@ -122,6 +122,13 @@ properties:
> > Specify when the AVB_LINK signal is active-low instead of normal
> > active-high.
> >
> > + renesas,gptp:
>
> Aren't you duplicating existing timestamper property? Aren't purpose of
> both the same?
Yes I am. I will switch to using the existing ptp-timer property. Thank
you.
>
> > + $ref: /schemas/types.yaml#/definitions/phandle
> > + description:
> > + A phandle to an external gPTP clock for Gen4 platforms. The property is
>
> Explain the purpose of this in the hardware.
>
> > + optional for backwards compatibility, but without it gPTP timestamps are
> > + disabled as Gen4 have no gPTP as part of the RAVB module itself.
> > +
>
> Best regards,
> Krzysztof
>
--
Kind Regards,
Niklas Söderlund
^ permalink raw reply
* [PATCH net] ice: Fix use-after-scope in ice_sched_add_nodes_to_layer()
From: NeKon69 @ 2026-06-13 10:14 UTC (permalink / raw)
To: anthony.l.nguyen, przemyslaw.kitszel
Cc: andrew+netdev, davem, edumazet, kuba, pabeni, victor.raj,
intel-wired-lan, netdev, linux-kernel, NeKon69
Commit 7fb09a737536 ("ice: Modify recursive way of adding nodes")
changed ice_sched_add_nodes_to_layer() from recursive control flow to an
iterative loop.
Inside the loop, first_teid_ptr may be set to the address of a
block-local variable:
u32 temp;
...
if (num_added)
first_teid_ptr = &temp;
On the next loop iteration, first_teid_ptr may be passed to
ice_sched_add_nodes_to_hw_layer(), after temp from the previous
iteration has gone out of scope.
Move temp outside the loop so the pointer remains valid for the lifetime
of ice_sched_add_nodes_to_layer().
This was found by Clang with LifetimeSafety enabled while testing C
language support on a Linux allmodconfig build.
Fixes: 7fb09a737536 ("ice: Modify recursive way of adding nodes")
Link: https://github.com/llvm/llvm-project/pull/203270
Signed-off-by: NeKon69 <nobodqwe@gmail.com>
---
drivers/net/ethernet/intel/ice/ice_sched.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/intel/ice/ice_sched.c b/drivers/net/ethernet/intel/ice/ice_sched.c
index fff0c1afdb41..089ad3967be5 100644
--- a/drivers/net/ethernet/intel/ice/ice_sched.c
+++ b/drivers/net/ethernet/intel/ice/ice_sched.c
@@ -1074,11 +1074,11 @@ ice_sched_add_nodes_to_layer(struct ice_port_info *pi,
u32 *first_teid_ptr = first_node_teid;
u16 new_num_nodes = num_nodes;
int status = 0;
+ u32 temp;
*num_nodes_added = 0;
while (*num_nodes_added < num_nodes) {
u16 max_child_nodes, num_added = 0;
- u32 temp;
status = ice_sched_add_nodes_to_hw_layer(pi, tc_node, parent,
layer, new_num_nodes,
--
2.54.0
^ permalink raw reply related
* [PATCH v3] flow_dissector: fix uninit-value in __skb_flow_dissect() for ETH_ADDRS
From: Yun Zhou @ 2026-06-13 11:00 UTC (permalink / raw)
To: davem, edumazet, kuba, pabeni, horms, qingfang.deng, jiri
Cc: netdev, linux-kernel, yun.zhou
In-Reply-To: <20260609023752.1245848-1-yun.zhou@windriver.com>
__skb_flow_dissect() unconditionally reads 12 bytes from eth_hdr(skb)
when FLOW_DISSECTOR_KEY_ETH_ADDRS is requested. This assumes the skb
has a valid Ethernet header at mac_header, which is not always the case.
The problem can be triggered by:
1. Creating a TUN device in L3 mode (IFF_TUN, hard_header_len=0)
2. Attaching a multiq qdisc with a flower filter matching on eth_src
3. Sending a packet through AF_PACKET
Since TUN in L3 mode has no link-layer header, mac_header points to
the L3 data area. The flow dissector reads 12 bytes of uninitialized
skb memory, which then propagates through fl_set_masked_key() and is
used as a rhashtable lookup key in __fl_lookup(), as reported by KMSAN.
Rejecting the filter in the control path (at tc filter add time) is
not feasible because TC filter blocks can be shared between arbitrary
devices -- a filter installed on an Ethernet device may later classify
packets on a headerless device through a shared block. The device
association is not fixed at filter creation time.
Fix this in the data path by checking skb->dev->hard_header_len before
reading. If the device does not have a link-layer header large enough
to contain the Ethernet addresses, zero the key so the filter will not
match.
Reported-by: syzbot+fa2f5b1fb06147be5e16@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=fa2f5b1fb06147be5e16
Fixes: 67a900cc0436 ("flow_dissector: introduce support for Ethernet addresses")
Signed-off-by: Yun Zhou <yun.zhou@windriver.com>
---
v3: Replace skb_tail_pointer() - skb_mac_header() length check with
skb->dev->hard_header_len check.
v2: Adjust commit message and comment.
net/core/flow_dissector.c | 11 +++++++++--
1 file changed, 9 insertions(+), 2 deletions(-)
diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
index 2a98f5fa74eb..0b235ec0743f 100644
--- a/net/core/flow_dissector.c
+++ b/net/core/flow_dissector.c
@@ -1173,13 +1173,20 @@ bool __skb_flow_dissect(const struct net *net,
if (dissector_uses_key(flow_dissector,
FLOW_DISSECTOR_KEY_ETH_ADDRS)) {
- struct ethhdr *eth = eth_hdr(skb);
struct flow_dissector_key_eth_addrs *key_eth_addrs;
key_eth_addrs = skb_flow_dissector_target(flow_dissector,
FLOW_DISSECTOR_KEY_ETH_ADDRS,
target_container);
- memcpy(key_eth_addrs, eth, sizeof(*key_eth_addrs));
+ /* TC filter blocks can be shared across devices with
+ * different header lengths, so we cannot validate this
+ * when the filter is installed -- check at dissect time.
+ */
+ if (skb->dev &&
+ skb->dev->hard_header_len >= sizeof(*key_eth_addrs))
+ memcpy(key_eth_addrs, eth_hdr(skb), sizeof(*key_eth_addrs));
+ else
+ memset(key_eth_addrs, 0, sizeof(*key_eth_addrs));
}
if (dissector_uses_key(flow_dissector,
--
2.43.0
^ permalink raw reply related
* Re: [PATCH net-next v2 05/15] tcp: allow mptcp to drop TS for some packets
From: Simon Baatz @ 2026-06-13 11:16 UTC (permalink / raw)
To: Matthieu Baerts
Cc: Mat Martineau, Geliang Tang, David S. Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Simon Horman, netdev, mptcp,
linux-kernel, Neal Cardwell, Kuniyuki Iwashima
In-Reply-To: <8d744049-65b7-4652-ae9f-a4a159fff56c@kernel.org>
Hi Matt,
On Thu, Jun 11, 2026 at 09:40:23PM +0200, Matthieu Baerts wrote:
> Hi Simon,
>
> Thank you for you review.
>
> 11 Jun 2026 19:18:29 Simon Baatz <gmbnomis@gmail.com>:
>
> > Hi Matt,
> >
> > On Fri, Jun 05, 2026 at 07:21:49PM +1000, Matthieu Baerts (NGI0) wrote:
> >> With TCP-timestamps (padded) taking 12 bytes and ADD_ADDR IPv6 + port
> >> taking 30 bytes, the 40-byte limit for the TCP options is reached. In
> >> this case, it is then not possible to send the address signal.
> >>
> >> The idea is to let MPTCP dropping the TCP-timestamps option for some
> >> specific packets, to be able to send some specific pure ACK carrying >28
> >> bytes of MPTCP options, like with this specific ADD_ADDR. A new
> >> parameter is passed from tcp_established_options to the MPTCP side to
> >> indicate if the TCP TS option is used, and if it should be dropped. The
> >> next commit implements the part on MPTCP side, but split into two
> >> patches to help TCP maintainers to identify the modifications on TCP
> >> side. This feature will be controlled by a new add_addr_v6_port_drop_ts
> >> MPTCP sysctl knob.
> >>
> >> It is important to keep in mind that dropping the TCP timestamps option
> >> for one packet of the connection could eventually disrupt some
> >> middleboxes: even if it should be unlikely, they could drop the packet
> >> or even block the connection. That's why this new feature will be
> >> controlled by a sysctl knob.
> >
> > RFC 7323 (which obsoletes RFC 1323) specifies an "all or nothing"
> > approach for the TS option. Section 3.2 states:
> >
> > Once TSopt has been successfully negotiated, that is both <SYN> and
> > <SYN,ACK> contain TSopt, the TSopt MUST be sent in every non-<RST>
> > segment for the duration of the connection, [...] If a
> > non-<RST> segment is received without a TSopt, a TCP SHOULD silently
> > drop the segment.
> >
> > So, selectively omitting the TS option on established subflows
> > appears to go against RFC 7323 at the TCP level. Do we consider that
> > an acceptable deviation for MPTCP subflows?
>
> Yes, to me, it is acceptable. Please note that here, TCP TS is only
> dropped on some specific packets -- ADD_ADDR with v6 + port --
> which are TCP pure ACK acking the same sequence as the previous
> one (dupack from a TCP point of view). So it is just a signalling packet,
> specific to MPTCP. If it is dropped by a middlebox, that's not nice, but
> that's OK.
>
> Or do you think something would break when this happens?
No, I just wanted to point out that an MPTCP peer with an underlying
TCP implementation that follows RFC 7323 strictly might drop these
packets (not only middleboxes). But given your explanation, this
sounds more like "bad luck" than a real breakage.
Thanks for the clarification.
- Simon
--
Simon Baatz <gmbnomis@gmail.com>
^ permalink raw reply
* Re: [PATCH v3] flow_dissector: fix uninit-value in __skb_flow_dissect() for ETH_ADDRS
From: Zhou, Yun @ 2026-06-13 11:29 UTC (permalink / raw)
To: davem@davemloft.net, edumazet@google.com, kuba@kernel.org,
pabeni@redhat.com, horms@kernel.org, qingfang.deng@linux.dev,
jiri@resnulli.us
Cc: netdev@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <20260613110055.2318264-1-yun.zhou@windriver.com>
Superseded. I will launch a new thread later.
________________________________________
From: Yun Zhou <yun.zhou@windriver.com>
Sent: Saturday, June 13, 2026 19:00
To: davem@davemloft.net; edumazet@google.com; kuba@kernel.org; pabeni@redhat.com; horms@kernel.org; qingfang.deng@linux.dev; jiri@resnulli.us
Cc: netdev@vger.kernel.org; linux-kernel@vger.kernel.org; Zhou, Yun
Subject: [PATCH v3] flow_dissector: fix uninit-value in __skb_flow_dissect() for ETH_ADDRS
__skb_flow_dissect() unconditionally reads 12 bytes from eth_hdr(skb)
when FLOW_DISSECTOR_KEY_ETH_ADDRS is requested. This assumes the skb
has a valid Ethernet header at mac_header, which is not always the case.
The problem can be triggered by:
1. Creating a TUN device in L3 mode (IFF_TUN, hard_header_len=0)
2. Attaching a multiq qdisc with a flower filter matching on eth_src
3. Sending a packet through AF_PACKET
Since TUN in L3 mode has no link-layer header, mac_header points to
the L3 data area. The flow dissector reads 12 bytes of uninitialized
skb memory, which then propagates through fl_set_masked_key() and is
used as a rhashtable lookup key in __fl_lookup(), as reported by KMSAN.
Rejecting the filter in the control path (at tc filter add time) is
not feasible because TC filter blocks can be shared between arbitrary
devices -- a filter installed on an Ethernet device may later classify
packets on a headerless device through a shared block. The device
association is not fixed at filter creation time.
Fix this in the data path by checking skb->dev->hard_header_len before
reading. If the device does not have a link-layer header large enough
to contain the Ethernet addresses, zero the key so the filter will not
match.
Reported-by: syzbot+fa2f5b1fb06147be5e16@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=fa2f5b1fb06147be5e16
Fixes: 67a900cc0436 ("flow_dissector: introduce support for Ethernet addresses")
Signed-off-by: Yun Zhou <yun.zhou@windriver.com>
---
v3: Replace skb_tail_pointer() - skb_mac_header() length check with
skb->dev->hard_header_len check.
v2: Adjust commit message and comment.
net/core/flow_dissector.c | 11 +++++++++--
1 file changed, 9 insertions(+), 2 deletions(-)
diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
index 2a98f5fa74eb..0b235ec0743f 100644
--- a/net/core/flow_dissector.c
+++ b/net/core/flow_dissector.c
@@ -1173,13 +1173,20 @@ bool __skb_flow_dissect(const struct net *net,
if (dissector_uses_key(flow_dissector,
FLOW_DISSECTOR_KEY_ETH_ADDRS)) {
- struct ethhdr *eth = eth_hdr(skb);
struct flow_dissector_key_eth_addrs *key_eth_addrs;
key_eth_addrs = skb_flow_dissector_target(flow_dissector,
FLOW_DISSECTOR_KEY_ETH_ADDRS,
target_container);
- memcpy(key_eth_addrs, eth, sizeof(*key_eth_addrs));
+ /* TC filter blocks can be shared across devices with
+ * different header lengths, so we cannot validate this
+ * when the filter is installed -- check at dissect time.
+ */
+ if (skb->dev &&
+ skb->dev->hard_header_len >= sizeof(*key_eth_addrs))
+ memcpy(key_eth_addrs, eth_hdr(skb), sizeof(*key_eth_addrs));
+ else
+ memset(key_eth_addrs, 0, sizeof(*key_eth_addrs));
}
if (dissector_uses_key(flow_dissector,
--
2.43.0
^ permalink raw reply related
* [PATCH net-next 0/8] net: mdio: realtek-rtl9300: Add RTL83xx support
From: Markus Stockhausen @ 2026-06-13 11:29 UTC (permalink / raw)
To: andrew, hkallweit1, linux, davem, edumazet, kuba, pabeni, netdev,
chris.packham, daniel, robh, krzk+dt, conor+dt, devicetree
Cc: Markus Stockhausen
The Realtek Otto switch platform consists of four different series
- RTL838x aka maple : 28 port 1G Switches
- RTL839x aka cypress : 52 port 1G Switches
- RTL930x aka longan : 28 port 1G/2.5G/10G Switches
- RTL931x aka mango : 56 port 1G/2.5G/10G Switches
While there always was a good knowledge about the MDIO hardware
polling unit and its necessity for the MAC layer, there was no
detailed documentation available. For this series the MDIO bus was
inspected with a logic analyzer for a better understanding how
polling and kernel access interact on the bus. All this is now
explained in the driver comments.
This patch series adds support for the RTL83xx devices. For this
- Enhance device tree binding.
- Add special handling for limitations enforced by hardware polling.
These already have minor side effects on RTL93xx devices but are even
more critical for the RTL83xx hardware.
- Add RTL83xx coding.
Signed-off-by: Markus Stockhausen <markus.stockhausen@gmx.de>
---
Markus Stockhausen (8):
dt-bindings: net: realtek,rtl9301-mdio: Add RTL83xx series
net: mdio: realtek-rtl9300: Add polling documentation
net: mdio: realtek-rtl9300: Add page tracking
net: mdio: realtek-rtl9300: Configure hardware polling during probing
net: mdio: realtek-rtl9300: Add c45 over c22 mitigation
net: mdio: realtek-rtl9300: Increase MDIO timeout
net: mdio: realtek-rtl9300: Add support for RTL838x
net: mdio: realtek-rtl9300: Add support for RTL839x
.../bindings/net/realtek,rtl9301-mdio.yaml | 12 +
drivers/net/mdio/mdio-realtek-rtl9300.c | 399 +++++++++++++++++-
2 files changed, 398 insertions(+), 13 deletions(-)
--
2.54.0
^ permalink raw reply
* [PATCH net-next 3/8] net: mdio: realtek-rtl9300: Add page tracking
From: Markus Stockhausen @ 2026-06-13 11:29 UTC (permalink / raw)
To: andrew, hkallweit1, linux, davem, edumazet, kuba, pabeni, netdev,
chris.packham, daniel, robh, krzk+dt, conor+dt, devicetree
Cc: Markus Stockhausen
In-Reply-To: <20260613112946.1071411-1-markus.stockhausen@gmx.de>
The hardware polling unit of the Realtek switches has a very special
handling for PHY register 31 (aka Realtek page register) in place.
- On the RTL838x it is permanently reset to zero.
- On other devices there is some magic saving/restoring (aka parking)
in the background in place.
This makes access to PHYs a gamble.
As of now all known existing hardware designs have Realtek based 1G PHYs.
Otherwise the polling engine and the MAC status update will not work at
all and the vendor SDK would fail totally.
This driver differentiates clearly between c22 and c45 buses. During
probing it enables only one of the protocols for a bus. So it is safe
to assume that any c22 access will only target a Realtek based 1G PHY.
Intercept access to register 31 and store the desired value for each port
in the driver. When issuing access to other registers add the saved page.
This given, the hardware will run two consecutive c22 commands that are
not interrupted by polling.
... hardware poll ...
phy_write(phy, 31, page)
phy_write(phy, reg, value)
... hardware poll ...
Remark! To keep this simple, writes to register 31 are only accepted
if they are lower than the device specific raw page - 0..4094/8190.
Otherwise -EINVAL is returned. Under the above assumption (Only 1G
Realtek PHYs on c22 bus) this is no limitation.
Signed-off-by: Markus Stockhausen <markus.stockhausen@gmx.de>
---
drivers/net/mdio/mdio-realtek-rtl9300.c | 32 +++++++++++++++++--------
1 file changed, 22 insertions(+), 10 deletions(-)
diff --git a/drivers/net/mdio/mdio-realtek-rtl9300.c b/drivers/net/mdio/mdio-realtek-rtl9300.c
index da2864c94d2c..c3a9eeca3154 100644
--- a/drivers/net/mdio/mdio-realtek-rtl9300.c
+++ b/drivers/net/mdio/mdio-realtek-rtl9300.c
@@ -193,6 +193,7 @@ struct otto_emdio_priv {
struct regmap *regmap;
struct mutex lock; /* protect HW access */
DECLARE_BITMAP(valid_ports, MAX_PORTS);
+ u16 page[MAX_PORTS];
u8 smi_bus[MAX_PORTS];
u8 smi_addr[MAX_PORTS];
bool smi_bus_is_c45[MAX_SMI_BUSSES];
@@ -337,7 +338,7 @@ static int otto_emdio_9300_read_c22(struct mii_bus *bus, int port, int regnum, u
struct otto_emdio_cmd_regs cmd_data = {
.c22_data = FIELD_PREP(RTL9300_PHY_CTRL_REG_ADDR, regnum) |
FIELD_PREP(RTL9300_PHY_CTRL_PARK_PAGE, 0x1f) |
- FIELD_PREP(RTL9300_PHY_CTRL_MAIN_PAGE, RAW_PAGE(priv)),
+ FIELD_PREP(RTL9300_PHY_CTRL_MAIN_PAGE, priv->page[port]),
.io_data = FIELD_PREP(RTL9300_PHY_CTRL_INDATA, port),
};
@@ -351,7 +352,7 @@ static int otto_emdio_9300_write_c22(struct mii_bus *bus, int port, int regnum,
struct otto_emdio_cmd_regs cmd_data = {
.c22_data = FIELD_PREP(RTL9300_PHY_CTRL_REG_ADDR, regnum) |
FIELD_PREP(RTL9300_PHY_CTRL_PARK_PAGE, 0x1f) |
- FIELD_PREP(RTL9300_PHY_CTRL_MAIN_PAGE, RAW_PAGE(priv)),
+ FIELD_PREP(RTL9300_PHY_CTRL_MAIN_PAGE, priv->page[port]),
.io_data = FIELD_PREP(RTL9300_PHY_CTRL_INDATA, value),
.port_mask_low = BIT(port),
};
@@ -391,7 +392,7 @@ static int otto_emdio_9310_read_c22(struct mii_bus *bus, int port, int regnum, u
struct otto_emdio_cmd_regs cmd_data = {
.broadcast = FIELD_PREP(RTL9310_BC_PORT_ID, port),
.c22_data = FIELD_PREP(RTL9310_PHY_CTRL_REG_ADDR, regnum) |
- FIELD_PREP(RTL9310_PHY_CTRL_MAIN_PAGE, RAW_PAGE(priv)),
+ FIELD_PREP(RTL9310_PHY_CTRL_MAIN_PAGE, priv->page[port]),
};
return otto_emdio_read_cmd(bus, RTL9310_PHY_CTRL_TYPE_C22, &cmd_data,
@@ -403,7 +404,7 @@ static int otto_emdio_9310_write_c22(struct mii_bus *bus, int port, int regnum,
struct otto_emdio_priv *priv = otto_emdio_bus_to_priv(bus);
struct otto_emdio_cmd_regs cmd_data = {
.c22_data = FIELD_PREP(RTL9310_PHY_CTRL_REG_ADDR, regnum) |
- FIELD_PREP(RTL9310_PHY_CTRL_MAIN_PAGE, RAW_PAGE(priv)),
+ FIELD_PREP(RTL9310_PHY_CTRL_MAIN_PAGE, priv->page[port]),
.io_data = FIELD_PREP(RTL9310_PHY_CTRL_INDATA, value),
.port_mask_high = (u32)(BIT_ULL(port) >> 32),
.port_mask_low = (u32)(BIT_ULL(port)),
@@ -442,15 +443,19 @@ static int otto_emdio_9310_write_c45(struct mii_bus *bus, int port,
static int otto_emdio_read_c22(struct mii_bus *bus, int phy_id, int regnum)
{
struct otto_emdio_priv *priv = otto_emdio_bus_to_priv(bus);
- int ret, port;
+ int port, ret = 0;
u32 value;
port = otto_emdio_phy_to_port(bus, phy_id);
if (port < 0)
return port;
- scoped_guard(mutex, &priv->lock)
+ scoped_guard(mutex, &priv->lock) {
+ if (regnum == 31)
+ return priv->page[port];
+
ret = priv->info->read_c22(bus, port, regnum, &value);
+ }
return ret ? ret : value;
}
@@ -458,16 +463,23 @@ static int otto_emdio_read_c22(struct mii_bus *bus, int phy_id, int regnum)
static int otto_emdio_write_c22(struct mii_bus *bus, int phy_id, int regnum, u16 value)
{
struct otto_emdio_priv *priv = otto_emdio_bus_to_priv(bus);
- int ret, port;
+ int port;
port = otto_emdio_phy_to_port(bus, phy_id);
if (port < 0)
return port;
- scoped_guard(mutex, &priv->lock)
- ret = priv->info->write_c22(bus, port, regnum, value);
+ scoped_guard(mutex, &priv->lock) {
+ if (regnum == 31) {
+ if (value >= RAW_PAGE(priv))
+ return -EINVAL;
- return ret;
+ priv->page[port] = value;
+ return 0;
+ }
+
+ return priv->info->write_c22(bus, port, regnum, value);
+ }
}
static int otto_emdio_read_c45(struct mii_bus *bus, int phy_id, int dev_addr, int regnum)
--
2.54.0
^ permalink raw reply related
* [PATCH net-next 7/8] net: mdio: realtek-rtl9300: Add support for RTL838x
From: Markus Stockhausen @ 2026-06-13 11:29 UTC (permalink / raw)
To: andrew, hkallweit1, linux, davem, edumazet, kuba, pabeni, netdev,
chris.packham, daniel, robh, krzk+dt, conor+dt, devicetree
Cc: Markus Stockhausen
In-Reply-To: <20260613112946.1071411-1-markus.stockhausen@gmx.de>
The MDIO driver has been prepared for multiple device support. Add all
required bits for the RTL838x (aka maple) series. This is straightforward
but some things are worth mentioning.
- The device has a lot in common with the RTL930x series. 28 ports, 4096
(Realtek) pages, 4 MMIO registers
- The MDIO engine has no fail bit. Thus the mask is set to zero
- There is only one SMI bus for 1G PHYs. No bus_map_base register exists.
- The setup_controller() function needs no c45 setup but must activate
the PHY access.
Signed-off-by: Markus Stockhausen <markus.stockhausen@gmx.de>
---
drivers/net/mdio/mdio-realtek-rtl9300.c | 109 ++++++++++++++++++++++++
1 file changed, 109 insertions(+)
diff --git a/drivers/net/mdio/mdio-realtek-rtl9300.c b/drivers/net/mdio/mdio-realtek-rtl9300.c
index 244af5fdeaf3..d9ff0b0aecbb 100644
--- a/drivers/net/mdio/mdio-realtek-rtl9300.c
+++ b/drivers/net/mdio/mdio-realtek-rtl9300.c
@@ -117,6 +117,28 @@
#include <linux/property.h>
#include <linux/regmap.h>
+#define RTL8380_NUM_BUSES 1
+#define RTL8380_NUM_PAGES 4096
+#define RTL8380_NUM_PORTS 28
+#define RTL8380_SMI_GLB_CTRL 0xa100
+#define RTL8380_SMI_PHY_PATCH_DONE BIT(15)
+#define RTL8380_SMI_ACCESS_PHY_CTRL_0 0xa1b8
+#define RTL8380_SMI_ACCESS_PHY_CTRL_1 0xa1bc
+#define RTL8380_PHY_CTRL_REG_ADDR GENMASK(24, 20)
+#define RTL8380_PHY_CTRL_PARK_PAGE GENMASK(19, 15)
+#define RTL8380_PHY_CTRL_MAIN_PAGE GENMASK(14, 3)
+#define RTL8380_PHY_CTRL_WRITE BIT(2)
+#define RTL8380_PHY_CTRL_READ 0
+#define RTL8380_PHY_CTRL_TYPE_C45 BIT(1)
+#define RTL8380_PHY_CTRL_TYPE_C22 0
+#define RTL8380_PHY_CTRL_FAIL 0 /* no fail indicator */
+#define RTL8380_SMI_ACCESS_PHY_CTRL_2 0xa1c0
+#define RTL8380_PHY_CTRL_INDATA GENMASK(31, 16)
+#define RTL8380_PHY_CTRL_DATA GENMASK(15, 0)
+#define RTL8380_SMI_ACCESS_PHY_CTRL_3 0xa1c4
+#define RTL8380_SMI_POLL_CTRL 0xa17c
+#define RTL8380_SMI_PORT0_5_ADDR_CTRL 0xa1c8
+
#define RTL9300_NUM_BUSES 4
#define RTL9300_NUM_PAGES 4096
#define RTL9300_NUM_PORTS 28
@@ -381,6 +403,60 @@ static int otto_emdio_write_cmd(struct mii_bus *bus, u32 cmd,
return otto_emdio_run_cmd(bus, cmd | priv->info->cmd_write, cmd_data);
}
+static int otto_emdio_8380_read_c22(struct mii_bus *bus, int port, int regnum, u32 *value)
+{
+ struct otto_emdio_priv *priv = otto_emdio_bus_to_priv(bus);
+ struct otto_emdio_cmd_regs cmd_data = {
+ .c22_data = FIELD_PREP(RTL8380_PHY_CTRL_REG_ADDR, regnum) |
+ FIELD_PREP(RTL8380_PHY_CTRL_PARK_PAGE, 0x1f) |
+ FIELD_PREP(RTL8380_PHY_CTRL_MAIN_PAGE, priv->page[port]),
+ .io_data = FIELD_PREP(RTL8380_PHY_CTRL_INDATA, port),
+ };
+
+ return otto_emdio_read_cmd(bus, RTL8380_PHY_CTRL_TYPE_C22, &cmd_data,
+ RTL8380_PHY_CTRL_DATA, value);
+}
+
+static int otto_emdio_8380_write_c22(struct mii_bus *bus, int port, int regnum, u16 value)
+{
+ struct otto_emdio_priv *priv = otto_emdio_bus_to_priv(bus);
+ struct otto_emdio_cmd_regs cmd_data = {
+ .c22_data = FIELD_PREP(RTL8380_PHY_CTRL_REG_ADDR, regnum) |
+ FIELD_PREP(RTL8380_PHY_CTRL_PARK_PAGE, 0x1f) |
+ FIELD_PREP(RTL8380_PHY_CTRL_MAIN_PAGE, priv->page[port]),
+ .io_data = FIELD_PREP(RTL8380_PHY_CTRL_INDATA, value),
+ .port_mask_low = BIT(port),
+ };
+
+ return otto_emdio_write_cmd(bus, RTL8380_PHY_CTRL_TYPE_C22, &cmd_data);
+}
+
+static int otto_emdio_8380_read_c45(struct mii_bus *bus, int port,
+ int dev_addr, int regnum, u32 *value)
+{
+ struct otto_emdio_cmd_regs cmd_data = {
+ .c45_data = FIELD_PREP(PHY_CTRL_MMD_DEVAD, dev_addr) |
+ FIELD_PREP(PHY_CTRL_MMD_REG, regnum),
+ .io_data = FIELD_PREP(RTL8380_PHY_CTRL_INDATA, port),
+ };
+
+ return otto_emdio_read_cmd(bus, RTL8380_PHY_CTRL_TYPE_C45, &cmd_data,
+ RTL8380_PHY_CTRL_DATA, value);
+}
+
+static int otto_emdio_8380_write_c45(struct mii_bus *bus, int port,
+ int dev_addr, int regnum, u16 value)
+{
+ struct otto_emdio_cmd_regs cmd_data = {
+ .c45_data = FIELD_PREP(PHY_CTRL_MMD_DEVAD, dev_addr) |
+ FIELD_PREP(PHY_CTRL_MMD_REG, regnum),
+ .io_data = FIELD_PREP(RTL8380_PHY_CTRL_INDATA, value),
+ .port_mask_low = BIT(port),
+ };
+
+ return otto_emdio_write_cmd(bus, RTL8380_PHY_CTRL_TYPE_C45, &cmd_data);
+}
+
static int otto_emdio_9300_read_c22(struct mii_bus *bus, int port, int regnum, u32 *value)
{
struct otto_emdio_priv *priv = otto_emdio_bus_to_priv(bus);
@@ -615,6 +691,16 @@ static int otto_emdio_setup_topology(struct otto_emdio_priv *priv)
return 0;
}
+static int otto_emdio_8380_setup_controller(struct otto_emdio_priv *priv)
+{
+ /*
+ * PHY_PATCH_DONE enables PHY control via SoC. This is required for PHY access, including
+ * patching and must be set before the PHYs are probed.
+ */
+ return regmap_set_bits(priv->regmap, RTL8380_SMI_GLB_CTRL,
+ RTL8380_SMI_PHY_PATCH_DONE);
+}
+
static int otto_emdio_9300_setup_controller(struct otto_emdio_priv *priv)
{
u32 glb_ctrl_mask = 0, glb_ctrl_val = 0;
@@ -855,6 +941,28 @@ static int otto_emdio_probe(struct platform_device *pdev)
return 0;
}
+static const struct otto_emdio_info otto_emdio_8380_info = {
+ .addr_map_base = RTL8380_SMI_PORT0_5_ADDR_CTRL,
+ .cmd_fail = RTL8380_PHY_CTRL_FAIL,
+ .cmd_read = RTL8380_PHY_CTRL_READ,
+ .cmd_write = RTL8380_PHY_CTRL_WRITE,
+ .cmd_regs = {
+ .c22_data = RTL8380_SMI_ACCESS_PHY_CTRL_1,
+ .c45_data = RTL8380_SMI_ACCESS_PHY_CTRL_3,
+ .io_data = RTL8380_SMI_ACCESS_PHY_CTRL_2,
+ .port_mask_low = RTL8380_SMI_ACCESS_PHY_CTRL_0,
+ },
+ .num_buses = RTL8380_NUM_BUSES,
+ .num_pages = RTL8380_NUM_PAGES,
+ .num_ports = RTL8380_NUM_PORTS,
+ .poll_ctrl = RTL8380_SMI_POLL_CTRL,
+ .setup_controller = otto_emdio_8380_setup_controller,
+ .read_c22 = otto_emdio_8380_read_c22,
+ .read_c45 = otto_emdio_8380_read_c45,
+ .write_c22 = otto_emdio_8380_write_c22,
+ .write_c45 = otto_emdio_8380_write_c45,
+};
+
static const struct otto_emdio_info otto_emdio_9300_info = {
.addr_map_base = RTL9300_SMI_PORT0_5_ADDR_CTRL,
.bus_map_base = RTL9300_SMI_PORT0_15_POLLING_SEL,
@@ -905,6 +1013,7 @@ static const struct otto_emdio_info otto_emdio_9310_info = {
};
static const struct of_device_id otto_emdio_ids[] = {
+ { .compatible = "realtek,rtl8380-mdio", .data = &otto_emdio_8380_info },
{ .compatible = "realtek,rtl9301-mdio", .data = &otto_emdio_9300_info },
{ .compatible = "realtek,rtl9311-mdio", .data = &otto_emdio_9310_info },
{}
--
2.54.0
^ permalink raw reply related
* [PATCH net-next 1/8] dt-bindings: net: realtek,rtl9301-mdio: Add RTL83xx series
From: Markus Stockhausen @ 2026-06-13 11:29 UTC (permalink / raw)
To: andrew, hkallweit1, linux, davem, edumazet, kuba, pabeni, netdev,
chris.packham, daniel, robh, krzk+dt, conor+dt, devicetree
Cc: Markus Stockhausen
In-Reply-To: <20260613112946.1071411-1-markus.stockhausen@gmx.de>
The lower end Realtek Otto switches provide 1G only and are divided into
two series:
- Maple : RTL838x up to 28 ports
- Cypress: RTL839x up to 56 ports
The Maple based devices have 3 different SoCs: RTL8380, RTL8381 and
RTL8382. The Cypress series consists of the RTL8391, RTL8392 and
RTL8393 SoCs. The MDIO controller of these switches works like the
existing RTL93xx logic but has different characteristics and different
registers. Add new compatibles in the device tree.
Signed-off-by: Markus Stockhausen <markus.stockhausen@gmx.de>
---
.../bindings/net/realtek,rtl9301-mdio.yaml | 12 ++++++++++++
1 file changed, 12 insertions(+)
diff --git a/Documentation/devicetree/bindings/net/realtek,rtl9301-mdio.yaml b/Documentation/devicetree/bindings/net/realtek,rtl9301-mdio.yaml
index 271e05bae9c5..de33364b67ef 100644
--- a/Documentation/devicetree/bindings/net/realtek,rtl9301-mdio.yaml
+++ b/Documentation/devicetree/bindings/net/realtek,rtl9301-mdio.yaml
@@ -12,6 +12,16 @@ maintainers:
properties:
compatible:
oneOf:
+ - items:
+ - enum:
+ - realtek,rtl8381-mdio
+ - realtek,rtl8382-mdio
+ - const: realtek,rtl8380-mdio
+ - items:
+ - enum:
+ - realtek,rtl8392-mdio
+ - realtek,rtl8393-mdio
+ - const: realtek,rtl8391-mdio
- items:
- enum:
- realtek,rtl9302b-mdio
@@ -24,6 +34,8 @@ properties:
- realtek,rtl9313-mdio
- const: realtek,rtl9311-mdio
- enum:
+ - realtek,rtl8380-mdio
+ - realtek,rtl8391-mdio
- realtek,rtl9301-mdio
- realtek,rtl9311-mdio
--
2.54.0
^ permalink raw reply related
* [PATCH net-next 4/8] net: mdio: realtek-rtl9300: Configure hardware polling during probing
From: Markus Stockhausen @ 2026-06-13 11:29 UTC (permalink / raw)
To: andrew, hkallweit1, linux, davem, edumazet, kuba, pabeni, netdev,
chris.packham, daniel, robh, krzk+dt, conor+dt, devicetree
Cc: Markus Stockhausen
In-Reply-To: <20260613112946.1071411-1-markus.stockhausen@gmx.de>
During bus probing the PHYs are initialized and firmware might be
loaded. This often requires complex access sequences where the switch
hardware polling might interfere badly.
The polling can be configured with one or two 32 bit mask registers.
Each bit enables (=1) or disables (=0) the polling of the corresponding
port.
Provide a helper to enable/disable polling for a specific port. With
this disable hardware polling temporarily during bus probing and enable
it afterwards according to the device tree topology. Nice side effect:
This patch brings the hardware polling into a consistent state for
devices where U-Boot does not take care.
Signed-off-by: Markus Stockhausen <markus.stockhausen@gmx.de>
---
drivers/net/mdio/mdio-realtek-rtl9300.c | 26 ++++++++++++++++++++++++-
1 file changed, 25 insertions(+), 1 deletion(-)
diff --git a/drivers/net/mdio/mdio-realtek-rtl9300.c b/drivers/net/mdio/mdio-realtek-rtl9300.c
index c3a9eeca3154..a7fd075947b6 100644
--- a/drivers/net/mdio/mdio-realtek-rtl9300.c
+++ b/drivers/net/mdio/mdio-realtek-rtl9300.c
@@ -137,6 +137,7 @@
#define RTL9300_PHY_CTRL_INDATA GENMASK(31, 16)
#define RTL9300_PHY_CTRL_DATA GENMASK(15, 0)
#define RTL9300_SMI_ACCESS_PHY_CTRL_3 0xcb7c
+#define RTL9300_SMI_POLL_CTRL 0xca90
#define RTL9300_SMI_PORT0_5_ADDR_CTRL 0xcb80
#define RTL9310_NUM_BUSES 4
@@ -162,6 +163,7 @@
#define RTL9310_PHY_CTRL_INDATA GENMASK(15, 0)
#define RTL9310_SMI_INDRT_ACCESS_MMD_CTRL 0x0c18
#define RTL9310_SMI_PORT_ADDR_CTRL 0x0c74
+#define RTL9310_SMI_PORT_POLLING_CTRL 0x0ccc
#define RTL9310_SMI_PORT_POLLING_SEL 0x0c9c
#define PHY_CTRL_CMD BIT(0)
@@ -210,6 +212,7 @@ struct otto_emdio_info {
u8 num_buses;
u8 num_ports;
u16 num_pages;
+ u32 poll_ctrl;
int (*setup_controller)(struct otto_emdio_priv *priv);
int (*read_c22)(struct mii_bus *bus, int port, int regnum, u32 *value);
int (*read_c45)(struct mii_bus *bus, int port, int dev_addr, int regnum, u32 *value);
@@ -245,6 +248,12 @@ static struct otto_emdio_priv *otto_emdio_bus_to_priv(struct mii_bus *bus)
return chan->priv;
}
+static int otto_emdio_set_port_polling(struct otto_emdio_priv *priv, int port, bool active)
+{
+ return regmap_assign_bits(priv->regmap, priv->info->poll_ctrl + (port / 32) * 4,
+ BIT(port % 32), active);
+}
+
static int otto_emdio_run_cmd(struct mii_bus *bus, u32 cmd,
struct otto_emdio_cmd_regs *cmd_data)
{
@@ -735,7 +744,7 @@ static int otto_emdio_probe(struct platform_device *pdev)
{
struct device *dev = &pdev->dev;
struct otto_emdio_priv *priv;
- int err;
+ int port, err;
priv = devm_kzalloc(dev, sizeof(*priv), GFP_KERNEL);
if (!priv)
@@ -750,6 +759,13 @@ static int otto_emdio_probe(struct platform_device *pdev)
if (IS_ERR(priv->regmap))
return PTR_ERR(priv->regmap);
+ /* Avoid issues with complex firmware loads. */
+ for (port = 0; port < priv->info->num_ports; port++) {
+ err = otto_emdio_set_port_polling(priv, port, false);
+ if (err)
+ return err;
+ }
+
platform_set_drvdata(pdev, priv);
err = otto_emdio_map_ports(dev);
@@ -772,6 +788,12 @@ static int otto_emdio_probe(struct platform_device *pdev)
return err;
}
+ for_each_set_bit(port, priv->valid_ports, priv->info->num_ports) {
+ err = otto_emdio_set_port_polling(priv, port, true);
+ if (err)
+ return err;
+ }
+
return 0;
}
@@ -790,6 +812,7 @@ static const struct otto_emdio_info otto_emdio_9300_info = {
.num_buses = RTL9300_NUM_BUSES,
.num_ports = RTL9300_NUM_PORTS,
.num_pages = RTL9300_NUM_PAGES,
+ .poll_ctrl = RTL9300_SMI_POLL_CTRL,
.setup_controller = otto_emdio_9300_setup_controller,
.read_c22 = otto_emdio_9300_read_c22,
.read_c45 = otto_emdio_9300_read_c45,
@@ -815,6 +838,7 @@ static const struct otto_emdio_info otto_emdio_9310_info = {
.num_buses = RTL9310_NUM_BUSES,
.num_pages = RTL9310_NUM_PAGES,
.num_ports = RTL9310_NUM_PORTS,
+ .poll_ctrl = RTL9310_SMI_PORT_POLLING_CTRL,
.setup_controller = otto_emdio_9310_setup_controller,
.read_c22 = otto_emdio_9310_read_c22,
.read_c45 = otto_emdio_9310_read_c45,
--
2.54.0
^ permalink raw reply related
* [PATCH net-next 2/8] net: mdio: realtek-rtl9300: Add polling documentation
From: Markus Stockhausen @ 2026-06-13 11:29 UTC (permalink / raw)
To: andrew, hkallweit1, linux, davem, edumazet, kuba, pabeni, netdev,
chris.packham, daniel, robh, krzk+dt, conor+dt, devicetree
Cc: Markus Stockhausen
In-Reply-To: <20260613112946.1071411-1-markus.stockhausen@gmx.de>
Add a detailed explanation how the hardware polling unit in the
Realtek Otto switches works. This simplifies developing future
patches and reviewing them.
Signed-off-by: Markus Stockhausen <markus.stockhausen@gmx.de>
---
drivers/net/mdio/mdio-realtek-rtl9300.c | 66 +++++++++++++++++++++++++
1 file changed, 66 insertions(+)
diff --git a/drivers/net/mdio/mdio-realtek-rtl9300.c b/drivers/net/mdio/mdio-realtek-rtl9300.c
index 892ed3780a65..da2864c94d2c 100644
--- a/drivers/net/mdio/mdio-realtek-rtl9300.c
+++ b/drivers/net/mdio/mdio-realtek-rtl9300.c
@@ -35,6 +35,72 @@
*
* The driver works out the mapping based on the MDIO bus described in device tree and phandles on
* the ethernet-ports property.
+ *
+ * The devices have a hardware polling unit that runs in the background without any CPU load. It
+ * constantly scans the MDIO bus and the attached PHYs and updates the MAC status registers.
+ *
+ * How does the polling work?
+ *
+ * Each device has a SMI_POLL_CTRL register. A per-port bitmask decides if the hardware polling of
+ * the associated bus/address is active or not. The hardware runs a tight loop over this and for
+ * each set polling bit it issues a status check for the PHY. Attaching a logic analyzer to the
+ * MDIO bus of an RTL8380 and RTL8393 gives the following commands (in kernel notation):
+ *
+ * RTL8380 RTL8393
+ * --------------------------- ---------------------------
+ * phy_write(phy, 31, 0x0); phy_read(phy, 0);
+ * phy_write(phy, 13, 0x7); phy_read(phy, 1);
+ * phy_write(phy, 14, 0x3c); phy_read(phy, 4);
+ * phy_write(phy, 13, 0x8007); phy_read(phy, 5);
+ * phy_read(phy, 14); phy_read(phy, 6);
+ * phy_write(phy, 13, 0x7); phy_read(phy, 9);
+ * phy_write(phy, 14, 0x3d); phy_read(phy, 10);
+ * phy_write(phy, 13, 0x8007); phy_read(phy, 15);
+ * phy_read(phy, 14); phy_write(phy, 13, 0x7);
+ * phy_read(phy, 9); phy_write(phy, 14, 0x3c);
+ * phy_read(phy, 10); phy_write(phy, 13, 0x4007);
+ * phy_read(phy, 15); phy_read(phy, 14);
+ * phy_read(phy, 0); phy_write(phy, 13, 0x7);
+ * phy_read(phy, 1); phy_write(phy, 14, 0x3d);
+ * phy_read(phy, 4); phy_write(phy, 13, 0x4007);
+ * phy_read(phy, 5); phy_read(phy, 14);
+ * phy_read(phy, 6);
+ *
+ * The c22 over c45 register 13/14 sequences read MDIO_AN_EEE_ADV and MDIO_AN_EEE_LPABLE. As soon
+ * as one PHY status is read, the polling engine goes over to the next PHY. Basically the bus is
+ * always busy and the MAC status is updated in realtime.
+ *
+ * How does MDIO access from kernel work?
+ *
+ * When issuing MDIO accesses via an MMIO based interface the final write to the command register
+ * sets a "run command now" bit. Between two polling sequences for different PHYs the hardware
+ * checks if a user command needs to run and sends it onto the bus. Afterwards it simply continues
+ * its polling work. Inspecting the command sequence for a paged read on the logic analyzer gives:
+ *
+ * RTL8380 RTL8393
+ * --------------------------- ---------------------------
+ * phy_write(phy, 31, page); phy_write(phy, 31, page);
+ * phy_write(phy, reg, value); phy_write(phy, reg, value);
+ * phy_write(phy, 31, 0);
+ *
+ * What does this mean?
+ *
+ * There are slight differences in polling and PHY access between the models but the challenge
+ * stays the same. On the one hand that greatly simplifies the MAC layer, on the other hand it
+ * has some implications for the kernel PHY subsystem.
+ *
+ * - Without the polling and a proper MAC status, some of the link handling features do not work.
+ * Especially an unpopulated MAC_LINK_STS register cancels operations to other MAC registers.
+ * - The Realtek page register 31 is magically modified in the background. On the RTL838x it is
+ * simply reset. Other devices have hardware mitigations for this in place.
+ * - A c45 over c22 kernel access sequence is most likely to fail because chances are high that
+ * the polling engine overwrites registers 13/14 in between.
+ * - PHY firmware loading can have issues. Especially if a PHY is designed to expect a clean
+ * sequence of registers and values without deviation.
+ * - An access to one PHY will need to wait for the next free slot of the polling engine.
+ *
+ * Conclusion: Kernel access to the PHYs must know and handle any interference that arises from
+ * the above described hardware polling.
*/
#include <linux/bitfield.h>
--
2.54.0
^ permalink raw reply related
* [PATCH net-next 5/8] net: mdio: realtek-rtl9300: Add c45 over c22 mitigation
From: Markus Stockhausen @ 2026-06-13 11:29 UTC (permalink / raw)
To: andrew, hkallweit1, linux, davem, edumazet, kuba, pabeni, netdev,
chris.packham, daniel, robh, krzk+dt, conor+dt, devicetree
Cc: Markus Stockhausen
In-Reply-To: <20260613112946.1071411-1-markus.stockhausen@gmx.de>
When reading the PHY state on c22 based buses the hardware polling unit
reads the EEE status with a sequence similar to this:
...
phy_write(phy, 31, 0x0);
phy_write(phy, 13, 0x7); /* c22 over c45 MDIO_AN_EEE_ADV */
phy_write(phy, 14, 0x3c);
phy_write(phy, 13, 0x8007);
phy_read(phy, 14);
phy_write(phy, 13, 0x7); /* c22 over c45 MDIO_AN_EEE_LPABLE */
phy_write(phy, 14, 0x3d);
phy_write(phy, 13, 0x8007);
...
If the Linux kernel wants to do the same in mmd_phy_read() via a call to
mmd_phy_indirect() this most likely fails. The commands are issued in a
straight sequence but between two of them the hardware polling might run
a status check for the same PHY. This effectively breaks the kernel access
and makes use of c45 over c22 unusable.
Detailed analysis shows that for RTL838x, RTL839x and RTL931x polling
can be safely deactivated during operation. The MAC layer will continue
to show the last known state. RTL839x is an exception from this. As soon
as polling is disabled the MAC link status register shows "port down".
Enhance the driver to detect this register 13/14/13/14 access sequence.
Before the first access to register 13 of a PHY disable polling for the
corresponding port. Reenable polling as soon as the sequence is finished
or any other unexpected input is detected. Some details about the stop
and start timing:
- The stopping is issued inflight while the polling engine is working.
After it is finished no new polling for the port will be issued (tested
with only one port with active polling).
- Reenabling the polling engine happens within ~25us after the last
command of the MMD sequence. This is mostly due to MMIO overhead.
Technically speaking, add a simple state machine that increments a
per-port MMD counter for each successful step of the sequence. When the
first command starts (counter=1) stop polling. When the last command
finishes (counter=4) or unexpected data is sent start polling.
Additionally:
- Add a global "initialization done" tracker that stops the mechanism
from kicking in during bus probing.
- Add a global "link flapping" option that allows to disable the state
tracker for the to-be-added RTL839x series completely.
Signed-off-by: Markus Stockhausen <markus.stockhausen@gmx.de>
---
drivers/net/mdio/mdio-realtek-rtl9300.c | 62 ++++++++++++++++++++++++-
1 file changed, 60 insertions(+), 2 deletions(-)
diff --git a/drivers/net/mdio/mdio-realtek-rtl9300.c b/drivers/net/mdio/mdio-realtek-rtl9300.c
index a7fd075947b6..e206ee3e2b1c 100644
--- a/drivers/net/mdio/mdio-realtek-rtl9300.c
+++ b/drivers/net/mdio/mdio-realtek-rtl9300.c
@@ -196,10 +196,12 @@ struct otto_emdio_priv {
struct mutex lock; /* protect HW access */
DECLARE_BITMAP(valid_ports, MAX_PORTS);
u16 page[MAX_PORTS];
+ u8 mmd_state[MAX_PORTS];
u8 smi_bus[MAX_PORTS];
u8 smi_addr[MAX_PORTS];
bool smi_bus_is_c45[MAX_SMI_BUSSES];
struct mii_bus *bus[MAX_SMI_BUSSES];
+ bool init_done;
};
struct otto_emdio_info {
@@ -209,6 +211,7 @@ struct otto_emdio_info {
u32 cmd_read;
u32 cmd_write;
struct otto_emdio_cmd_regs cmd_regs;
+ bool link_flap;
u8 num_buses;
u8 num_ports;
u16 num_pages;
@@ -254,6 +257,43 @@ static int otto_emdio_set_port_polling(struct otto_emdio_priv *priv, int port, b
BIT(port % 32), active);
}
+static int otto_emdio_mmd_prefix(struct otto_emdio_priv *priv, int port, int regnum)
+{
+ u8 newstate, *state = &priv->mmd_state[port];
+ int expected, ret = 0;
+
+ if (!priv->init_done)
+ return 0;
+ /*
+ * Disabled polling might produce link flapping and false notification interrupts on the
+ * MAC layer. In this case disable c45 over c22 MMD access because chances are high that
+ * the register 13/14/13/14 sequence is intercepted by a parallel hardware access. As
+ * a workaround the PHY must provide its own mmd read/write() callbacks and redirect to
+ * normal c22 registers. See rtlgen_read_mmd().
+ */
+ if (priv->info->link_flap)
+ return (regnum == MII_MMD_DATA || regnum == MII_MMD_CTRL) ? -EIO : 0;
+
+ expected = (*state & 1) ? MII_MMD_DATA : MII_MMD_CTRL;
+ newstate = regnum == expected ? *state + 1 : 0;
+
+ if (newstate == 1 || newstate < *state)
+ ret = otto_emdio_set_port_polling(priv, port, !newstate);
+ *state = newstate;
+
+ return ret;
+}
+
+static int otto_emdio_mmd_postfix(struct otto_emdio_priv *priv, int port, int regnum)
+{
+ if (priv->mmd_state[port] != 4)
+ return 0;
+
+ priv->mmd_state[port] = 0;
+
+ return otto_emdio_set_port_polling(priv, port, true);
+}
+
static int otto_emdio_run_cmd(struct mii_bus *bus, u32 cmd,
struct otto_emdio_cmd_regs *cmd_data)
{
@@ -463,7 +503,15 @@ static int otto_emdio_read_c22(struct mii_bus *bus, int phy_id, int regnum)
if (regnum == 31)
return priv->page[port];
+ ret = otto_emdio_mmd_prefix(priv, port, regnum);
+ if (ret)
+ return ret;
+
ret = priv->info->read_c22(bus, port, regnum, &value);
+ if (ret)
+ return ret;
+
+ ret = otto_emdio_mmd_postfix(priv, port, regnum);
}
return ret ? ret : value;
@@ -472,7 +520,7 @@ static int otto_emdio_read_c22(struct mii_bus *bus, int phy_id, int regnum)
static int otto_emdio_write_c22(struct mii_bus *bus, int phy_id, int regnum, u16 value)
{
struct otto_emdio_priv *priv = otto_emdio_bus_to_priv(bus);
- int port;
+ int port, ret;
port = otto_emdio_phy_to_port(bus, phy_id);
if (port < 0)
@@ -487,7 +535,15 @@ static int otto_emdio_write_c22(struct mii_bus *bus, int phy_id, int regnum, u16
return 0;
}
- return priv->info->write_c22(bus, port, regnum, value);
+ ret = otto_emdio_mmd_prefix(priv, port, regnum);
+ if (ret)
+ return ret;
+
+ ret = priv->info->write_c22(bus, port, regnum, value);
+ if (ret)
+ return ret;
+
+ return otto_emdio_mmd_postfix(priv, port, regnum);
}
}
@@ -794,6 +850,8 @@ static int otto_emdio_probe(struct platform_device *pdev)
return err;
}
+ priv->init_done = true;
+
return 0;
}
--
2.54.0
^ permalink raw reply related
* [PATCH net-next 6/8] net: mdio: realtek-rtl9300: Increase MDIO timeout
From: Markus Stockhausen @ 2026-06-13 11:29 UTC (permalink / raw)
To: andrew, hkallweit1, linux, davem, edumazet, kuba, pabeni, netdev,
chris.packham, daniel, robh, krzk+dt, conor+dt, devicetree
Cc: Markus Stockhausen
In-Reply-To: <20260613112946.1071411-1-markus.stockhausen@gmx.de>
Access to the Realtek Otto ethernet MDIO bus must wait for a free slot
between two hardware polls. The polling sequence consists of at least
17 commands on the RTL8380 devices. This delay can be nicely seen when
disabling polling completely. The following times are measured from the
last register write that sets the command bit until hardware responds
with the command finished bit set.
- average c22 read with polling enabled on all ports: ~380us
- average c22 read with polling enabled on one port: ~380us
- average c22 read with polling completely disabled: ~180us
With a default MDIO bus frequency of 2.5Mhz the bare hardware runtime
for a single command (32 bit preamble + 32 bit data) is ~25us. So the
hardware adds quite some overhead. On top of this comes the fact that
especially the RTL838x devices are low on resources (500Mhz 4Kec core
with 16K cache).
Analysis on a RTL838x device with 28 ports gives PHY access timeouts
during one of three boots while waiting for command completion. This is
currently set to 1ms. From the above explanation one can see that there
is not much headroom left.
Increase the timeout to 5ms.
Signed-off-by: Markus Stockhausen <markus.stockhausen@gmx.de>
---
drivers/net/mdio/mdio-realtek-rtl9300.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/drivers/net/mdio/mdio-realtek-rtl9300.c b/drivers/net/mdio/mdio-realtek-rtl9300.c
index e206ee3e2b1c..244af5fdeaf3 100644
--- a/drivers/net/mdio/mdio-realtek-rtl9300.c
+++ b/drivers/net/mdio/mdio-realtek-rtl9300.c
@@ -302,9 +302,9 @@ static int otto_emdio_run_cmd(struct mii_bus *bus, u32 cmd,
u32 cmdstate;
int ret;
- /* Defensive pre check just in case something goes horrible wrong */
+ /* Defensive pre check just in case something goes horribly wrong */
ret = regmap_read_poll_timeout(priv->regmap, info->cmd_regs.c22_data,
- cmdstate, !(cmdstate & PHY_CTRL_CMD), 10, 1000);
+ cmdstate, !(cmdstate & PHY_CTRL_CMD), 10, 5000);
if (ret)
return ret;
@@ -344,7 +344,7 @@ static int otto_emdio_run_cmd(struct mii_bus *bus, u32 cmd,
return ret;
ret = regmap_read_poll_timeout(priv->regmap, info->cmd_regs.c22_data,
- cmdstate, !(cmdstate & PHY_CTRL_CMD), 10, 1000);
+ cmdstate, !(cmdstate & PHY_CTRL_CMD), 10, 5000);
if (ret)
return ret;
--
2.54.0
^ permalink raw reply related
* [PATCH net-next 8/8] net: mdio: realtek-rtl9300: Add support for RTL839x
From: Markus Stockhausen @ 2026-06-13 11:29 UTC (permalink / raw)
To: andrew, hkallweit1, linux, davem, edumazet, kuba, pabeni, netdev,
chris.packham, daniel, robh, krzk+dt, conor+dt, devicetree
Cc: Markus Stockhausen
In-Reply-To: <20260613112946.1071411-1-markus.stockhausen@gmx.de>
The MDIO driver has been prepared for multiple device support. Add all
required bits for the RTL839x (aka cypress) series. This is straightforward
but some things are worth mentioning.
- The device has a lot in common with the RTL931x series. 8192 (Realtek)
pages and 7 MMIO registers
- There are two SMI buses for 1G PHYs. Neither the bus nor address map
register exists.
- The MAC layer shows link flapping when temporarily deactivating the
hardware polling for one port. Mark this in the info structure.
- The hardware has not much to configure. So the setup_controller()
function is not needed.
Signed-off-by: Markus Stockhausen <markus.stockhausen@gmx.de>
---
drivers/net/mdio/mdio-realtek-rtl9300.c | 104 ++++++++++++++++++++++++
1 file changed, 104 insertions(+)
diff --git a/drivers/net/mdio/mdio-realtek-rtl9300.c b/drivers/net/mdio/mdio-realtek-rtl9300.c
index d9ff0b0aecbb..2ab1aeb85eed 100644
--- a/drivers/net/mdio/mdio-realtek-rtl9300.c
+++ b/drivers/net/mdio/mdio-realtek-rtl9300.c
@@ -139,6 +139,29 @@
#define RTL8380_SMI_POLL_CTRL 0xa17c
#define RTL8380_SMI_PORT0_5_ADDR_CTRL 0xa1c8
+#define RTL8390_NUM_BUSES 2
+#define RTL8390_NUM_PAGES 8192
+#define RTL8390_NUM_PORTS 52
+#define RTL8390_BCAST_PHYID_CTRL 0x03ec
+#define RTL8390_PHYREG_ACCESS_CTRL 0x03dc
+#define RTL8390_PHY_CTRL_REG_ADDR GENMASK(9, 5)
+#define RTL8390_PHY_CTRL_PARK_PAGE GENMASK(27, 23)
+#define RTL8390_PHY_CTRL_MAIN_PAGE GENMASK(22, 10)
+#define RTL8390_PHY_CTRL_FAIL BIT(1)
+#define RTL8390_PHY_CTRL_WRITE BIT(3)
+#define RTL8390_PHY_CTRL_READ 0
+#define RTL8390_PHY_CTRL_TYPE_C45 BIT(2)
+#define RTL8390_PHY_CTRL_TYPE_C22 0
+#define RTL8390_PHYREG_CTRL 0x03e0
+#define RTL8390_PHY_CTRL_EXT_PAGE GENMASK(8, 0)
+#define RTL8390_PHYREG_DATA_CTRL 0x03f0
+#define RTL8390_PHY_CTRL_INDATA GENMASK(31, 16)
+#define RTL8390_PHY_CTRL_DATA GENMASK(15, 0)
+#define RTL8390_PHYREG_MMD_CTRL 0x03f4
+#define RTL8390_PHYREG_PORT_CTRL_LOW 0x03e4
+#define RTL8390_PHYREG_PORT_CTRL_HIGH 0x03e8
+#define RTL8390_SMI_PORT_POLLING_CTRL 0x03fc
+
#define RTL9300_NUM_BUSES 4
#define RTL9300_NUM_PAGES 4096
#define RTL9300_NUM_PORTS 28
@@ -457,6 +480,62 @@ static int otto_emdio_8380_write_c45(struct mii_bus *bus, int port,
return otto_emdio_write_cmd(bus, RTL8380_PHY_CTRL_TYPE_C45, &cmd_data);
}
+static int otto_emdio_8390_read_c22(struct mii_bus *bus, int port, int regnum, u32 *value)
+{
+ struct otto_emdio_priv *priv = otto_emdio_bus_to_priv(bus);
+ struct otto_emdio_cmd_regs cmd_data = {
+ .c22_data = FIELD_PREP(RTL8390_PHY_CTRL_REG_ADDR, regnum) |
+ FIELD_PREP(RTL8390_PHY_CTRL_MAIN_PAGE, priv->page[port]),
+ .ext_page = FIELD_PREP(RTL8390_PHY_CTRL_EXT_PAGE, 0x1ff),
+ .io_data = FIELD_PREP(RTL8390_PHY_CTRL_INDATA, port),
+ };
+
+ return otto_emdio_read_cmd(bus, RTL8390_PHY_CTRL_TYPE_C22, &cmd_data,
+ RTL8390_PHY_CTRL_DATA, value);
+}
+
+static int otto_emdio_8390_write_c22(struct mii_bus *bus, int port, int regnum, u16 value)
+{
+ struct otto_emdio_priv *priv = otto_emdio_bus_to_priv(bus);
+ struct otto_emdio_cmd_regs cmd_data = {
+ .c22_data = FIELD_PREP(RTL8390_PHY_CTRL_REG_ADDR, regnum) |
+ FIELD_PREP(RTL8390_PHY_CTRL_MAIN_PAGE, priv->page[port]),
+ .ext_page = FIELD_PREP(RTL8390_PHY_CTRL_EXT_PAGE, 0x1ff),
+ .io_data = FIELD_PREP(RTL8390_PHY_CTRL_INDATA, value),
+ .port_mask_high = (u32)(BIT_ULL(port) >> 32),
+ .port_mask_low = (u32)(BIT_ULL(port)),
+ };
+
+ return otto_emdio_write_cmd(bus, RTL8390_PHY_CTRL_TYPE_C22, &cmd_data);
+}
+
+static int otto_emdio_8390_read_c45(struct mii_bus *bus, int port,
+ int dev_addr, int regnum, u32 *value)
+{
+ struct otto_emdio_cmd_regs cmd_data = {
+ .c45_data = FIELD_PREP(PHY_CTRL_MMD_DEVAD, dev_addr) |
+ FIELD_PREP(PHY_CTRL_MMD_REG, regnum),
+ .io_data = FIELD_PREP(RTL8390_PHY_CTRL_INDATA, port),
+ };
+
+ return otto_emdio_read_cmd(bus, RTL8390_PHY_CTRL_TYPE_C45, &cmd_data,
+ RTL8390_PHY_CTRL_DATA, value);
+}
+
+static int otto_emdio_8390_write_c45(struct mii_bus *bus, int port,
+ int dev_addr, int regnum, u16 value)
+{
+ struct otto_emdio_cmd_regs cmd_data = {
+ .c45_data = FIELD_PREP(PHY_CTRL_MMD_DEVAD, dev_addr) |
+ FIELD_PREP(PHY_CTRL_MMD_REG, regnum),
+ .io_data = FIELD_PREP(RTL8390_PHY_CTRL_INDATA, value),
+ .port_mask_high = (u32)(BIT_ULL(port) >> 32),
+ .port_mask_low = (u32)(BIT_ULL(port)),
+ };
+
+ return otto_emdio_write_cmd(bus, RTL8390_PHY_CTRL_TYPE_C45, &cmd_data);
+}
+
static int otto_emdio_9300_read_c22(struct mii_bus *bus, int port, int regnum, u32 *value)
{
struct otto_emdio_priv *priv = otto_emdio_bus_to_priv(bus);
@@ -963,6 +1042,30 @@ static const struct otto_emdio_info otto_emdio_8380_info = {
.write_c45 = otto_emdio_8380_write_c45,
};
+static const struct otto_emdio_info otto_emdio_8390_info = {
+ .cmd_fail = RTL8390_PHY_CTRL_FAIL,
+ .cmd_read = RTL8390_PHY_CTRL_READ,
+ .cmd_write = RTL8390_PHY_CTRL_WRITE,
+ .cmd_regs = {
+ .broadcast = RTL8390_BCAST_PHYID_CTRL,
+ .c22_data = RTL8390_PHYREG_ACCESS_CTRL,
+ .c45_data = RTL8390_PHYREG_MMD_CTRL,
+ .ext_page = RTL8390_PHYREG_CTRL,
+ .io_data = RTL8390_PHYREG_DATA_CTRL,
+ .port_mask_low = RTL8390_PHYREG_PORT_CTRL_LOW,
+ .port_mask_high = RTL8390_PHYREG_PORT_CTRL_HIGH,
+ },
+ .link_flap = true,
+ .num_buses = RTL8390_NUM_BUSES,
+ .num_pages = RTL8390_NUM_PAGES,
+ .num_ports = RTL8390_NUM_PORTS,
+ .poll_ctrl = RTL8390_SMI_PORT_POLLING_CTRL,
+ .read_c22 = otto_emdio_8390_read_c22,
+ .read_c45 = otto_emdio_8390_read_c45,
+ .write_c22 = otto_emdio_8390_write_c22,
+ .write_c45 = otto_emdio_8390_write_c45,
+};
+
static const struct otto_emdio_info otto_emdio_9300_info = {
.addr_map_base = RTL9300_SMI_PORT0_5_ADDR_CTRL,
.bus_map_base = RTL9300_SMI_PORT0_15_POLLING_SEL,
@@ -1014,6 +1117,7 @@ static const struct otto_emdio_info otto_emdio_9310_info = {
static const struct of_device_id otto_emdio_ids[] = {
{ .compatible = "realtek,rtl8380-mdio", .data = &otto_emdio_8380_info },
+ { .compatible = "realtek,rtl8391-mdio", .data = &otto_emdio_8390_info },
{ .compatible = "realtek,rtl9301-mdio", .data = &otto_emdio_9300_info },
{ .compatible = "realtek,rtl9311-mdio", .data = &otto_emdio_9310_info },
{}
--
2.54.0
^ permalink raw reply related
* [PATCH v3] flow_dissector: fix uninit-value in __skb_flow_dissect() for ETH_ADDRS
From: Yun Zhou @ 2026-06-13 11:31 UTC (permalink / raw)
To: davem, edumazet, kuba, pabeni, horms, qingfang.deng, jiri
Cc: netdev, linux-kernel, yun.zhou
__skb_flow_dissect() unconditionally reads 12 bytes from eth_hdr(skb)
when FLOW_DISSECTOR_KEY_ETH_ADDRS is requested. This assumes the skb
has a valid Ethernet header at mac_header, which is not always the case.
The problem can be triggered by:
1. Creating a TUN device in L3 mode (IFF_TUN, hard_header_len=0)
2. Attaching a multiq qdisc with a flower filter matching on eth_src
3. Sending a packet through AF_PACKET
Since TUN in L3 mode has no link-layer header, mac_header points to
the L3 data area. The flow dissector reads 12 bytes of uninitialized
skb memory, which then propagates through fl_set_masked_key() and is
used as a rhashtable lookup key in __fl_lookup(), as reported by KMSAN.
Rejecting the filter in the control path (at tc filter add time) is
not feasible because TC filter blocks can be shared between arbitrary
devices -- a filter installed on an Ethernet device may later classify
packets on a headerless device through a shared block. The device
association is not fixed at filter creation time.
Fix this in the data path by checking skb->dev->hard_header_len before
reading. If the device does not have a link-layer header large enough
to contain the Ethernet addresses, zero the key so the filter will not
match.
Reported-by: syzbot+fa2f5b1fb06147be5e16@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=fa2f5b1fb06147be5e16
Fixes: 67a900cc0436 ("flow_dissector: introduce support for Ethernet addresses")
Signed-off-by: Yun Zhou <yun.zhou@windriver.com>
---
v3: Replace skb_tail_pointer() - skb_mac_header() length check with
skb->dev->hard_header_len check.
v2: Adjust commit message and comment.
net/core/flow_dissector.c | 11 +++++++++--
1 file changed, 9 insertions(+), 2 deletions(-)
diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
index 2a98f5fa74eb..0b235ec0743f 100644
--- a/net/core/flow_dissector.c
+++ b/net/core/flow_dissector.c
@@ -1173,13 +1173,20 @@ bool __skb_flow_dissect(const struct net *net,
if (dissector_uses_key(flow_dissector,
FLOW_DISSECTOR_KEY_ETH_ADDRS)) {
- struct ethhdr *eth = eth_hdr(skb);
struct flow_dissector_key_eth_addrs *key_eth_addrs;
key_eth_addrs = skb_flow_dissector_target(flow_dissector,
FLOW_DISSECTOR_KEY_ETH_ADDRS,
target_container);
- memcpy(key_eth_addrs, eth, sizeof(*key_eth_addrs));
+ /* TC filter blocks can be shared across devices with
+ * different header lengths, so we cannot validate this
+ * when the filter is installed -- check at dissect time.
+ */
+ if (skb->dev &&
+ skb->dev->hard_header_len >= sizeof(*key_eth_addrs))
+ memcpy(key_eth_addrs, eth_hdr(skb), sizeof(*key_eth_addrs));
+ else
+ memset(key_eth_addrs, 0, sizeof(*key_eth_addrs));
}
if (dissector_uses_key(flow_dissector,
--
2.43.0
^ permalink raw reply related
* Re: [PATCH net-next v2 1/2] virtio_net: xsk: fix race in rx wake up
From: Menglong Dong @ 2026-06-13 12:26 UTC (permalink / raw)
To: menglong8.dong, xuanzhuo, eperezma, Bui Quang Minh
Cc: mst, jasowang, andrew+netdev, davem, edumazet, kuba, pabeni,
kerneljasonxing, netdev, virtualization, linux-kernel
In-Reply-To: <41eefa1d-99bf-450d-988e-7dec67c6b61e@gmail.com>
On 2026/6/12 00:24, Bui Quang Minh wrote:
> On 6/11/26 09:56, menglong8.dong@gmail.com wrote:
> > From: Menglong Dong <dongml2@chinatelecom.cn>
> >
> > During packet receiving in virtio-net, the rq can be empty, which means
> > "rq->vq->num_free == virtqueue_get_vring_size(rq->vq)", in
> > virtnet_add_recvbuf_xsk(), if we are using xsk. Meanwhile, the fill ring
> > can be empty too, which means we can't allocate anything from
> > xsk_buff_alloc_batch(). Then, we will set the XDP_RING_NEED_WAKEUP flag.
> >
> > However, if the user clean all the data in rx ring and fill the
> > "fill ring" and check the XDP_RING_NEED_WAKEUP flag after
> > xsk_buff_alloc_batch() and before xsk_set_rx_need_wakeup(), then the rx
> > napi will never be scheduled: the rx ring is empty, which means we will
> > never receive a packet to trigger the further recv fill. The rx ring is
> > empty now, so the user will not check the flag too.
> >
> > Fix this by set the XDP_RING_NEED_WAKEUP flag before
> > xsk_buff_alloc_batch() if both rq->vq and fill ring are empty.
> >
> > Meanwhile, set the XDP_RING_NEED_WAKEUP flag if we have any free entry in
> > rq->vq.
> >
> > Fixes: e3f8800aa243 ("virtio-net: xsk: Support wakeup on RX side")
> > Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
> > ---
> > drivers/net/virtio_net.c | 25 ++++++++++++++++++++++---
> > 1 file changed, 22 insertions(+), 3 deletions(-)
> >
> > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > index f4adcfee7a80..4b5b3fa62008 100644
> > --- a/drivers/net/virtio_net.c
> > +++ b/drivers/net/virtio_net.c
> > @@ -1323,16 +1323,27 @@ static int virtnet_add_recvbuf_xsk(struct virtnet_info *vi, struct receive_queue
> > struct xsk_buff_pool *pool, gfp_t gfp)
> > {
> > struct xdp_buff **xsk_buffs;
> > + bool need_wakeup;
> > dma_addr_t addr;
> > int err = 0;
> > u32 len, i;
> > int num;
> >
> > + need_wakeup = xsk_uses_need_wakeup(pool);
> > xsk_buffs = rq->xsk_buffs;
> >
> > + /* If both rq->vq and fill ring are empty, and then the user submit
> > + * all the chunks to the fill ring and check the wake up flag
> > + * after xsk_buff_alloc_batch() and before xsk_set_rx_need_wakeup(),
> > + * we will lose the chance to wake up the rx napi, so we have to
> > + * set the need_wakeup flag here.
> > + */
> > + if (need_wakeup && virtqueue_get_vring_size(rq->vq) == rq->vq->num_free)
> > + xsk_set_rx_need_wakeup(pool);
>
Hi, Bui Quang. Thanks for your reply. I spent some time learning
what you said.
> I think when polling the receive queue, the userspace program needs to
> check the XDP_RING_NEED_WAKEUP flag if it does not see any packets. The
> flag check is quite lightweight in my opinion. Here are some examples I find
>
> -
> https://github.com/xdp-project/xdp-tools/blob/e9469501622aa22a7e452a671000bec8685edcde/lib/util/xdpsock.c#L1206
You are right, I'm over concerned about this point. My origin
concern is that we can't wake up from the poll syscall in this case:
The chunk of the umem is 2000. In the beginning, the xsk->fill_ring
is filled with 2000 chunk, and then the user fall asleep and don't
do anything.
Kernel: the 2000th packet is received
Kernel: xsk_buff_alloc_batch return 0(xsk->fill_ring is empty and xsk->rx_ring is full)
User: handle the xsk->rx_ring
User: fill the xsk->fill_ring with 2000 chunks
User: check the wake up flag
User: no need_wakeup flag, fall asleep with poll() syscall
Kernel: call xsk_set_rx_need_wakeup()
Kernel: virio-net rx ringbuf is empty, we can't receive any packet further
Kernel: to call virtnet_add_recvbuf_xsk(), we are dead
But then, I found that we can still be wake up with the 2000th
packet from the poll syscall, which means that the case that
the NAPI and the user can't both be waked up doesn't exist.
> -
> https://github.com/xdp-project/bpf-examples/blob/43e565901c4287efa863edca7f0e6cd6e35ed896/AF_XDP-forwarding/xsk_fwd.c#L540
>
> Furthermore, the XDP_RING_NEED_WAKEUP flag related functions does not
> provide any memory orderings. So even with your patch, I'm worried that
> this case is possible
>
> kernel userspace
>
> xsk_buff_alloc_batch -> failed
> submit fill
> ring
> flag !=
> XDP_RING_NEED_WAKEUP
> // reordering due to lack of memory orderings
> xsk_set_rx_need_wakeup
>
> I'm not expert here, so correct me if I'm wrong. I think the wake up
> flag is designed with no orderings so we cannot rely on it to reason and
> skip further checks.
>
> > +
> > num = xsk_buff_alloc_batch(pool, xsk_buffs, rq->vq->num_free);
[....]
> > +
>
> Why do we need to set XDP_RING_NEED_WAKEUP even when
> xsk_buff_alloc_batch succeeds?
Ah, don't mind here. I just thought that if xsk_buff_alloc_batch()
didn't allocate enough chunks as we need, we can wake up
the NAPI as soon as possible, in case that the virtio-net ringbuf
is full and cause packet dropping :)
Anyway, I'll remove the first patch, and send the second patch
only in the V3.
Thanks!
Menglong Dong
>
> > return num;
> >
> > err:
>
> Thanks,
> Quang Minh.
>
>
>
>
^ permalink raw reply
* Re: [PATCH] nbd: Reclassify sockets to avoid lockdep circular dependency
From: Jens Axboe @ 2026-06-13 12:34 UTC (permalink / raw)
To: Josef Bacik, Eric Dumazet
Cc: linux-kernel, linux-block, nbd, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Kuniyuki Iwashima, netdev,
syzbot+607cdcf978b3e79da878
In-Reply-To: <20260613042619.1108126-1-edumazet@google.com>
On Sat, 13 Jun 2026 04:26:19 +0000, Eric Dumazet wrote:
> syzbot reported a possible circular locking dependency in udp_sendmsg()
> where fs_reclaim can be triggered while holding sk_lock, and fs_reclaim
> can eventually depend on another sk_lock (e.g., if NBD is used for swap
> or writeback and NBD uses TLS/TCP which acquires sk_lock).
>
> Since the UDP socket and the NBD TCP/TLS socket are different, this is a
> false positive. Fix this by reclassifying NBD sockets to a separate lock
> class when they are added to the NBD device.
>
> [...]
Applied, thanks!
[1/1] nbd: Reclassify sockets to avoid lockdep circular dependency
commit: d532cddb6c6049ced414d64d83c6ce7149a6421a
Best regards,
--
Jens Axboe
^ permalink raw reply
* Re: [PATCH] r8152: add vendor/device ID for CoreChips SR9900
From: Nicolai Buchwitz @ 2026-06-13 12:52 UTC (permalink / raw)
To: zjzhao, hayeswang
Cc: andrew+netdev, linux-usb, netdev, linux-kernel, zjzhao-eda
In-Reply-To: <20260613090154.1975753-1-zjzhao@edatec.cn>
Hi
On June 13, 2026 11:01:54 AM GMT+02:00, zjzhao@edatec.cn wrote:
>From: zjzhao-eda <zjzhao@edatec.cn>
>
>The CoreChips SR9900 (0x0fe6:0x9900) is a USB 2.0 10/100
>Ethernet adapter. Testing shows it works correctly with the
>r8152 driver, reaching wire speed (94 Mbps) with zero packet
>loss on both TCP and UDP.
>
>Tested on Raspberry Pi, including hotplug and extended data
>transfer.
>
>Signed-off-by: zjzhao-eda <zjzhao@edatec.cn>
AFAIK the DCO must contain a full name and not just an alias
>---
> drivers/net/usb/r8152.c | 1 +
> 1 file changed, 1 insertion(+)
>
>diff --git a/drivers/net/usb/r8152.c b/drivers/net/usb/r8152.c
>index d61074178279..ea1733e3619c 100644
>--- a/drivers/net/usb/r8152.c
>+++ b/drivers/net/usb/r8152.c
>@@ -10062,6 +10062,7 @@ static const struct usb_device_id rtl8152_table[] = {
> { USB_DEVICE(VENDOR_ID_DELL, 0xb097) },
> { USB_DEVICE(VENDOR_ID_ASUS, 0x1976) },
> { USB_DEVICE(VENDOR_ID_TRENDNET, 0xe02b) },
>+ { USB_DEVICE(0x0fe6, 0x9900) },
> {}
> };
>
Also please indicate the target tree for your patch in the subject (eg. net-next). For furher details, have a look at the netdev FAQ.
[1] https://www.kernel.org/doc/html/v6.1/process/maintainer-netdev.html
Thanks
Nicolai
^ permalink raw reply
* Re: [RFC net-next 08/15] ipxlat: add translation engine and dispatch core
From: Ralf Lici @ 2026-06-13 13:17 UTC (permalink / raw)
To: Toke Høiland-Jørgensen
Cc: netdev, Daniel Gröber, Antonio Quartulli, Andrew Lunn,
David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
linux-kernel
In-Reply-To: <87y0gm8x5k.fsf@toke.dk>
On Wed, 10 Jun 2026 13:14:47 +0200, Toke Høiland-Jørgensen <toke@kernel.org> wrote:
> Ralf Lici <ralf@mandelbit.com> writes:
>
> > Hi Toke,
> >
> > On Thu, 04 Jun 2026 20:23:51 +0200, Toke Høiland-Jørgensen <toke@kernel.org> wrote:
> >> Ralf Lici <ralf@mandelbit.com> writes:
> >>
> >> > This commit introduces the core start_xmit processing flow: validate,
> >> > select action, translate, and forward. It centralizes action resolution
> >> > in the dispatch layer and keeps per-direction translation logic separate
> >> > from device glue. The result is a single data-path entry point with
> >> > explicit control over drop/forward/emit behavior.
> >> >
> >> > Signed-off-by: Ralf Lici <ralf@mandelbit.com>
> >>
> >> This is very cool! Going quickly through the series, this seems like
> >> thorough work that will be cool to have available in the kernel, so
> >> thanks for doing this! I'll be quite happy to retire my barebones
> >> BPF-based implementation once this lands :)
> >>
> >
> > Thanks, glad to hear this looks useful. I have not had much time to work
> > on ipxlat lately, but I hope to respin the RFC soon.
> >
> >> One comment on the device model below (which is also why I chose this
> >> patch to reply to):
> >>
> >> > +static void ipxlat_forward_pkt(struct ipxlat_priv *ipxlat, struct sk_buff *skb)
> >> > +{
> >> > + const unsigned int len = skb->len;
> >> > + int err;
> >> > +
> >> > + /* reinject as a fresh packet with scrubbed metadata */
> >> > + skb_set_queue_mapping(skb, 0);
> >> > + skb_scrub_packet(skb, false);
> >> > +
> >> > + err = gro_cells_receive(&ipxlat->gro_cells, skb);
> >>
> >> So given that you're not resetting skb->dev here, IIUC, this means that
> >> the translated packet will magically re-appear as if it arrived on the
> >> interface it first came in on, right?
> >>
> >> That seems... a bit too magical? Sending a packet to one device making
> >> it suddenly reappear on a different, unrelated, device seems like it
> >> will just create confusion. It's like the ipxlat device can't really
> >> device if it's a device or a tunnel? :)
> >>
> >
> > That's not quite what happens in the routed xmit path. There the stack
> > sets skb->dev to the selected output device before handing the skb to
> > the device. For IPv4 and IPv6 this happens in ip_output/ip6_output,
> > where the output device is taken from the skb dst. So when the route
> > selects the ipxlat device, the skb reaches ndo_start_xmit with skb->dev
> > already pointing at the ipxlat device, not at the original ingress
> > device.
> >
> > The internal 4-to-6 pre-fragmentation path should preserve the same
> > property as well: ip_do_fragment copies the skb metadata to the
> > generated fragments, including skb->dev, and the temporary dst used for
> > that path also points at the ipxlat device. The fragment callback then
> > feeds those fragments back into the same ipxlat processing path.
> >
> > That said, I agree that relying on this implicitly is not great.
> > gro_cells_receive uses skb->dev directly, and the intended receive-side
> > re-injection model should be obvious at the call site. I will set
> > skb->dev = ipxlat->dev explicitly before gro_cells_receive in the next
> > version.
>
> Right, sounds good. I'm also wondering if you actually need the gro_cells
> infrastructure at all? IIUC, the purpose of that is to allow tunnels to
> create GRO superframes of packets after they are decapsulated (and thus
> their l4 commonality becomes apparent). But you're not decapsulating
> anything, you're just translating between protocols the kernel already
> understands. So presumably any opportunity to coalesce GRO packets would
> already have happened pre-translation? So any reason why you can't just
> do what loopback.c does, and do a straight __netif_rx() call in the
> transmit function?
>
No, I think you're right that gro_cells is not justified here, I was
probably biased by my work on tunnel interfaces. Unlike a tunnel decap
path, ipxlat does not reveal a new same-family L4 flow after
decapsulation, so I don't see a translation-specific GRO opportunity
there, and a loopback-style receive handoff would be the simpler version
of that design.
That said, after thinking more about the rest of your feedback, I think
the right fix is probably not just replacing gro_cells with __netif_rx.
The deeper issue is the netdevice/RX-reinjection model itself.
> >> I think a better model is to treat the device as basically a loopback
> >> device that translates packets before looping them back (so when they
> >> come back they appear to be coming from that device).
> >>
> >> Any reason why that wouldn't work?
> >>
> >
> > That's indeed the intended model for the ipxlat netdevice: route packets
> > to it, translate them, then loop them back into the stack as packets
> > received from that same device. That seemed like the simplest model and
> > the one that exposes the translation point most clearly.
>
> Right. I think this could be made a bit more explicit in the
> documentation as well, since it's a bit of an unusual model.
>
> And, well, taking a step back: is it really the right model? Regular NAT
> lives in netfilter, why can't this be a netfilter module as well? Seems
> to me you could have something like:
>
> table ip xlat4 {
> chain postrouting {
> type nat hook postrouting priority srcnat; policy accept;
> ip daddr 0.0.0.0/0 oifname "eth0" xlat to 64:ff9b::/96
> }
> }
> table ip6 xlat6 {
> chain prerouting {
> type nat hook prerouting priority dstnat; policy accept;
> ip6 saddr 64::ff0b::/96 iifname "eth0" xlat from 64::ff9b::/96
> }
> }
>
> and that would provide the functionality without having to implement a
> new interface type and the associated multiple traversals through the
> stack? Did you consider this as an alternative to the new device type?
>
We did consider netfilter, and your example is syntactically attractive,
but I am no longer convinced it is the cleanest model for SIIT.
An nft expression cannot simply rewrite ETH_P_IP <-> ETH_P_IPV6 and
return ACCEPT as if this were normal NAT because the current hook
invocation, dst, and conntrack-related state were established for the
packet as it entered that hook. A cross-family translator would need to
consume the skb, clear or rebuild route and ct metadata as appropriate,
do an other-family route lookup, and resume at a well-defined point in
that family. That seems possible, but it would be a new stateless
cross-family action, not just a new mode of the existing nft nat
expression (which is built around nf_nat_setup_info and assumes the
packet's L3 family does not change AFAICT).
My second concern is that the SIIT boundary would be a property of rule
and hook placement. That gives flexibility, but it also means the
translation point has to be constrained and documented very carefully to
avoid ambiguous TTL/Hop Limit, PMTU/ICMP, and hook-order behavior. For
this use case I would rather have the route that matches the translation
prefix also be the object that says: leave this family here and continue
in the other one.
After looking at the available kernel mechanisms again, I think the
better model is probably LWT: routes carry an ipxlat encap referencing a
named translator domain configured over netlink. That should represent
the stateless, prefix-based and symmetric nature of ipxlat.
Very roughly, userspace could look like:
ip xlat add siit0 prefix6 64:ff9b::/96
ip route add ... encap ipxlat id siit0
ip -6 route add ... encap ipxlat id siit0
There are some useful precedents for this: ILA is stateless address
translation as LWT, seg6_local already has cross-family LWT actions, and
ioam6 has a similar split between separately configured objects and
route attachments.
The invariant I would like v2 to follow is that the original-family
route lookup selects translation as its terminal route action. The
translated skb then gets a fresh lookup in the other family. From that
point on, TTL/Hop Limit where applicable, PMTU, ICMP errors, and
netfilter visibility belong to the translated family.
So I think your question addresses the core design issue in this RFC. My
current preference is to rework the next version around an LWT/domain
model instead of the virtual netdevice model, unless prototyping shows a
fundamental problem with that approach.
Does that model make sense to you?
Thanks for pushing on this.
--
Ralf Lici
Mandelbit Srl
^ permalink raw reply
* Re: [PATCH net-next v7 0/5] veth: add Byte Queue Limits (BQL) support
From: Simon Schippers @ 2026-06-13 13:57 UTC (permalink / raw)
To: Jonas Köppeler, hawk, netdev
Cc: kernel-team, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Simon Horman, Chris Arges, Mike Freemon,
Toke Høiland-Jørgensen, Breno Leitao,
Alexei Starovoitov, Daniel Borkmann, John Fastabend,
Stanislav Fomichev, bpf
In-Reply-To: <4ddf3bcb-db5d-4821-ab32-577de93973a7@tu-berlin.de>
On 6/12/26 19:21, Jonas Köppeler wrote:
> On 6/12/26 16:10, Simon Schippers wrote:
>> On 6/12/26 10:35, hawk@kernel.org wrote:
>>> From: Jesper Dangaard Brouer <hawk@kernel.org>
>>>
>>> This series adds BQL (Byte Queue Limits) to the veth driver, reducing
>>> latency by dynamically limiting in-flight packets in the ptr_ring and
>>> moving buffering into the qdisc where AQM algorithms can act on it.
>>
>> LGTM, thanks for the detailed changelog :)
>>
>> Maybe we should stop searching for the perfect tx-usecs value.
>> 100us is probably fine for most hardware to not have a performance
>> regression. And lowering it does not really improve the RTT anyways.
>> Do you agree?
> I agree, I already thought that it just might be a very lucky case when using 50us where something accidentally aligns nicely. Interestingly, I could also reproduce that 50us was consistently a little better compared to 100us on an Intel CPU. Maybe if I get the time, I'll have another look at it, but in general I think 50us or 100us does not really matter.
>
Interesting.
I ran the benchmarks again, the results are at [1].
I tested values between 0-9us, 10-90us, 100, 500, 1000, 5000 and 10000us.
TLDR: Throughput is fine for everything > 0us. RTT only improves
slightly for < 100us. So 100us is fine.
[1] https://github.com/simoschip2000/veth-backpressure-performance-testing/blob/v7/results/tx-usecs/text_sweep.txt
>>
>> Nevertheless, I will compile and run the benchmarks again.
>>
>> I will go on vacation from 15th to 24th of June, so I will not be able
>> to contribute code or run benchmarks then.
>>
>> Thanks,
>> Simon
>>
>
^ permalink raw reply
* Re: [PATCH net-next v2 1/2] netdev: expose io_uring rx_page_order order via netlink
From: Dragos Tatulea @ 2026-06-13 14:09 UTC (permalink / raw)
To: Pavel Begunkov, Donald Hunter, Jakub Kicinski, David S. Miller,
Eric Dumazet, Paolo Abeni, Simon Horman, Andrew Lunn, Jens Axboe
Cc: Yael Chemla, Tariq Toukan, netdev, linux-kernel, io-uring
In-Reply-To: <d0401fab-61c5-43e7-93ae-d4757433eb7a@gmail.com>
On 13.06.26 11:53, Pavel Begunkov wrote:
> On 6/12/26 22:17, Dragos Tatulea wrote:
>> This adds observability for the io_uring zcrx rx-buf-len configuration.
>
> It might be nicer to look it up in the queue, e.g. rxq->mp_params,
> and make it a queue attribute instead of zcrx specific one. In either
> case, no objections.
>
In io_pp_nl_fill() or in page_pool_nl_fill() as it was done in v1 for order?
Thanks,
Dragos
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox