* [GIT PULL] Networking for v7.1-rc3
From: Jakub Kicinski @ 2026-05-07 17:21 UTC (permalink / raw)
To: torvalds; +Cc: kuba, davem, netdev, linux-kernel, pabeni
Hi Linus!
The following changes since commit 08d0d3466664000ba0670e0ef0d447f23459e0d4:
Merge tag 'net-7.1-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net (2026-04-30 08:45:43 -0700)
are available in the Git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git tags/net-7.1-rc3
for you to fetch changes up to 41ae14071cd7f6a7770e2fe1f8a0859d4c2c6ba4:
net: sparx5: configure serdes for 1000BASE-X in sparx5_port_init() (2026-05-07 09:08:47 -0700)
----------------------------------------------------------------
Including fixes from Netfilter, IPsec, Bluetooth and WiFi.
Current release - fix to a fix:
- ipmr: add __rcu to netns_ipv4.mrt, make sure we hold the RCU lock
in all relevant places
Current release - new code bugs:
- fixes for the recently added resizable hash tables
- ipv6: make sure we default IPv6 tunnel drivers to =m now that
IPv6 itself is built in
- drv: octeontx2-af: fixes for parser/CAM fixes
Previous releases - regressions:
- phy: micrel: fix LAN8814 QSGMII soft reset
- wifi: cw1200: revert "Fix locking in error paths"
- wifi: ath12k: fix crash on WCN7850, due to adding the same queue
buffer to a list multiple times
Previous releases - always broken:
- number of info leak fixes
- ipv6: implement limits on extension header parsing
- wifi: number of fixes for missing bound checks in the drivers
- Bluetooth: fixes for races and locking issues
- af_unix: fix an issue between garbage collection and PEEK
- af_unix: fix yet another issue with OOB data
- xfrm: esp: avoid in-place decrypt on shared skb frags
- netfilter: replace skb_try_make_writable() by skb_ensure_writable()
- openvswitch: vport: fix race between tunnel creation and linking
leading to invalid memory accesses (type confusion)
- drv: amd-xgbe: fix PTP addend overflow causing frozen clock
Misc:
- sched/isolation: make HK_TYPE_KTHREAD an alias of HK_TYPE_DOMAIN
(for relevant IPVS change)
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
----------------------------------------------------------------
Aaradhana Sahu (1):
wifi: ath12k: fix OF node refcount imbalance in WSI graph traversal
Aleksander Jan Bajkowski (1):
net: usb: r8152: add TRENDnet TUC-ET2G v2.0
Alex Cheema (1):
net: usb: cdc_ncm: add Apple Mac USB-C direct networking quirk
Alyssa Ross (1):
ipv6: default IPV6_SIT to m
Amir Mohammad Jahangirzad (1):
wifi: libertas: fix integer underflow in process_cmdrequest()
Andreas Haarmann-Thiemann (1):
net: ethernet: cortina: Drop half-assembled SKB
Aurelien DESBRIERES (1):
Bluetooth: hci_uart: Fix NULL deref in recv callbacks when priv is uninitialized
Baochen Qiang (2):
wifi: ath12k: prepare REO update element only for primary link
wifi: ath12k: fix peer_id usage in normal RX path
Bart Van Assche (1):
wifi: cw1200: Revert "Fix locking in error paths"
Benjamin Berg (1):
wifi: mac80211: use safe list iteration in radar detect work
Bobby Eshleman (1):
eth: fbnic: fix double-free of PCS on phylink creation failure
Breno Leitao (1):
netpoll: pass buffer size to egress_dev() to avoid MAC truncation
Catherine (1):
wifi: mac80211: drop stray 'static' from fast-RX rx_result
Cosmin Ratiu (6):
tools/selftests: Use a sensible timeout value for iperf3 client
tools/selftests: Add a VXLAN+IPsec traffic test
xfrm: Don't clobber inner headers when already set
net/mlx5e: psp: Fix invalid access on PSP dev registration fail
net/mlx5e: psp: Expose only a fully initialized priv->psp
net/mlx5e: psp: Hook PSP dev reg/unreg to profile enable/disable
D. Wythe (1):
net/smc: fix missing sk_err when TCP handshake fails
Daniel Borkmann (1):
ipv6: Implement limits on extension header parsing
Daniel Golle (1):
net: dsa: mt7530: fix .get_stats64 sleeping in atomic context
Daniel Machon (2):
net: sparx5: fix wrong chip ids for TSN SKUs
net: sparx5: configure serdes for 1000BASE-X in sparx5_port_init()
Daniel Zahka (3):
netdevsim: psp: only call nsim_psp_uninit() on PFs
netdevsim: psp: serialize calls to nsim_psp_uninit()
netdevsim: psp: rcu protect psp_dev reference
David Carlier (2):
psp: strip variable-length PSP header in psp_dev_rcv()
Bluetooth: hci_conn: fix potential UAF in create_big_sync
Dipayaan Roy (4):
net: mana: check xdp_rxq registration before unreg in mana_destroy_rxq()
net: mana: Skip WQ object destruction for uninitialized RXQ
net: mana: remove double CQ cleanup in mana_create_rxq error path
net: mana: Fix crash from unvalidated SHM offset read from BAR0 during FLR
Dmitry Baryshkov (1):
wifi: ath10k: snoc: select POWER_SEQUENCING
Dudu Lu (2):
Bluetooth: bnep: fix incorrect length parsing in bnep_rx_frame() extension handling
Bluetooth: l2cap: fix MPS check in l2cap_ecred_reconf_req
Eric Dumazet (12):
ipmr: prevent info-leak in pmr_cache_report()
ipv4: igmp: annotate data-races in igmp_heard_query()
net/sched: sch_pie: annotate more data-races in pie_dump_stats()
net/sched: sch_cake: annotate data-races in cake_dump_class_stats (I)
net/sched: sch_cake: annotate data-races in cake_dump_class_stats (II)
vsock/virtio: fix potential unbounded skb queue
net: prevent possible UAF in rtnl_prop_list_size()
net/sched: sch_fq_codel: annotate data-races from fq_codel_dump_class_stats()
ipv6: fix potential UAF caused by ip6_forward_proxy_check()
inetpeer: add a missing read_seqretry() in inet_getpeer()
net/sched: sch_sfq: annotate data-races from sfq_dump_class_stats()
tcp: tcp_child_process() related UAF
Fernando Fernandez Mancera (3):
netfilter: nf_socket: skip socket lookup for non-first fragments
netfilter: nf_tables: skip L4 header parsing for non-first fragments
netfilter: xtables: fix L4 header parsing for non-first fragments
Florian Westphal (2):
netfilter: xt_CT: fix usersize for v1 and v2 revision
netfilter: nf_tables: fix netdev hook allocation memleak with dormant tables
Gregory Fuchedgi (1):
amd-xgbe: fix PTP addend overflow causing frozen clock
Holger Brunck (2):
net: wan: fsl_ucc_hdlc: fix uhdlc_memclean
net: wan: fsl_ucc_hdlc: fix ucc_hdlc_remove
Ilya Maximets (3):
openvswitch: vport: fix race between tunnel creation and linking
openvswitch: vport: fix self-deadlock on release of tunnel ports
selftests: openvswitch: add tests for tunnel vport refcounting
Jakov Novak (1):
wifi: libertas: notify firmware load wait on disconnect
Jakub Kicinski (21):
Merge branch 'net-mctp-test-minor-kunit-test-fixes'
Merge branch 'octeontx2-af-npc-cn20k-mcam-fixes'
Merge tag 'nf-26-05-01' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf
Merge branch 'ipv6-fix-ecmp-route-failover-on-carrier-loss'
Merge branch 'replace-direct-dequeue-call-with-qdisc_dequeue_peeked'
Merge branch 'net-sched-sch_cake-annotate-data-races-in-cake_dump_class_stats-series'
net: tls: fix silent data drop under pipe back-pressure
selftests: tls: add test for data loss on small pipe
Merge branch 'mptcp-misc-fixes-for-v7-1-rc3'
Merge branch 'bnxt_en-bug-fixes'
Merge tag 'nf-26-05-05' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf
Merge branch 'net-mlx5e-psp-fixes'
Merge branch 'net-mlx5-fixes-for-socket-direct'
Merge branch 'xsk-fix-bugs-around-xsk-skb-allocation'
Merge tag 'wireless-2026-05-06' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless
Merge tag 'for-net-2026-05-06' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth
Merge tag 'ovpn-net-20260504' of https://github.com/OpenVPN/ovpn-net-next
Merge tag 'ipsec-2026-05-05' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec
selftests: drv-net: fix sort order of makefile and config
Merge branch 'netdevsim-psp-fix-init-and-uninit-bugs'
Merge branch 'mptcp-pm-misc-fixes-for-v7-1-rc3'
Jamal Hadi Salim (1):
net/sched: sch_red: Replace direct dequeue call with peek and qdisc_dequeue_peeked
Jann Horn (1):
Bluetooth: hci_event: fix memset typo
Jason Xing (8):
xsk: reject sw-csum UMEM binding to IFF_TX_SKB_NO_LINEAR devices
xsk: free the skb when hitting the upper bound MAX_SKB_FRAGS
xsk: handle NULL dereference of the skb without frags issue
xsk: fix use-after-free of xs->skb in xsk_build_skb() free_err path
xsk: prevent CQ desync when freeing half-built skbs in xsk_build_skb()
xsk: avoid skb leak in XDP_TX_METADATA case
xsk: fix xsk_addrs slab leak on multi-buffer error path
xsk: fix u64 descriptor address truncation on 32-bit architectures
Jeongjun Park (1):
wifi: rsi: fix kthread lifetime race between self-exit and external-stop
Jeremy Kerr (2):
net: mctp: test: use a zeroed struct sockaddr_mctp
net: mctp: test: Use dev_direct_xmit for TX to our test device
Jesper Dangaard Brouer (1):
veth: fix OOB txq access in veth_poll() with asymmetric queue counts
Jiawen Wu (2):
net: libwx: fix VF illegal register access
net: libwx: use request_irq for VF misc interrupt
Jiexun Wang (1):
af_unix: Reject SIOCATMARK on non-stream sockets
Jiri Slaby (SUSE) (1):
wifi: ath5k: do not access array OOB
Joey Lu (1):
net: stmmac: dwmac-nuvoton: fix NULL pointer dereference in nvt_set_phy_intf_sel()
Johannes Berg (5):
Merge tag 'ath-current-20260427' of git://git.kernel.org/pub/scm/linux/kernel/git/ath/ath
wifi: mac80211: tests: mark HT check strict
Merge tag 'ath-current-20260505' of git://git.kernel.org/pub/scm/linux/kernel/git/ath/ath
wifi: mac80211: remove station if connection prep fails
wifi: nl80211: fix NL80211_PMSR_FTM_REQ_ATTR_FTMS_PER_BURST usage
Julian Anastasov (6):
ipvs: fixes for the new ip_vs_status info
ipvs: fix races around the conn_lfactor and svc_lfactor sysctl vars
ipvs: fix the spin_lock usage for RT build
ipvs: do not leak dest after get from dest trash
ipvs: fix races around est_mutex and est_cpulist
ipvs: fix shift-out-of-bounds in ip_vs_rht_desired_size
Justin Chen (1):
net: phy: broadcom: Save PHY counters during suspend
Kai Zen (1):
net: rtnetlink: zero ifla_vf_broadcast to avoid stack infoleak in rtnl_fill_vfinfo
Kalesh AP (1):
bnxt_en: Check return value of bnxt_hwrm_vnic_cfg
Kuan-Ting Chen (1):
xfrm: esp: avoid in-place decrypt on shared skb frags
Kuniyuki Iwashima (6):
selftest: net: Add test for TCP flow failover with ECMP routes.
af_unix: Set gc_in_progress to true in unix_gc().
ipmr: Add __rcu to netns_ipv4.mrt.
ipv6: Fix null-ptr-deref in fib6_mtu().
ipmr: Call ipmr_fib_lookup() under RCU.
tcp: Fix dst leak in tcp_v6_connect().
Lorenzo Bianconi (1):
net: airoha: Move entries to queue head in case of DMA mapping failure in airoha_dev_xmit()
Luiz Augusto von Dentz (1):
Bluetooth: hci_event: Fix OOB read and infinite loop in hci_le_create_big_complete_evt
Maciej W. Rozycki (1):
MAINTAINERS: Add self for the DEC LANCE network driver
Maoyi Xie (3):
ip6_gre: Use cached t->net in ip6erspan_changelink().
wifi: nl80211: require CAP_NET_ADMIN over the target netns in SET_WIPHY_NETNS
wifi: nl80211: re-check wiphy netns in nl80211_prepare_wdev_dump() continuation
Marek Szyprowski (1):
wifi: brcmfmac: Fix potential use-after-free issue when stopping watchdog task
Markus Baier (1):
net: usb: asix: ax88772: re-add usbnet_link_change() in phylink callbacks
Matthieu Baerts (NGI0) (12):
mptcp: sockopt: increase seq in mptcp_setsockopt_all_sf
mptcp: pm: kernel: correctly retransmit ADD_ADDR ID 0
mptcp: pm: ADD_ADDR rtx: allow ID 0
mptcp: pm: ADD_ADDR rtx: fix potential data-race
mptcp: pm: ADD_ADDR rtx: always decrease sk refcount
mptcp: pm: ADD_ADDR rtx: free sk if last
mptcp: pm: ADD_ADDR rtx: resched blocked ADD_ADDR quicker
mptcp: pm: ADD_ADDR rtx: skip inactive subflows
mptcp: pm: ADD_ADDR rtx: return early if no retrans
mptcp: pm: prio: skip closed subflows
selftests: mptcp: check output: catch cmd errors
selftests: mptcp: pm: restrict 'unknown' check to pm_nl_ctl
Michael Bommarito (6):
xfrm: ah: account for ESN high bits in async callbacks
wifi: nl80211: require admin perm on SET_PMK / DEL_PMK
wifi: mac80211: check ieee80211_rx_data_set_link return in pubsta MLO path
Bluetooth: virtio_bt: clamp rx length before skb_put
Bluetooth: virtio_bt: validate rx pkt_type header length
Bluetooth: HIDP: serialise l2cap_unregister_user via hidp_session_sem
Michael Chan (2):
bnxt_en: Delay for 5 seconds after AER DPC for all chips
bnxt_en: Set bp->max_tpa according to what the FW supports
Michal Kosiorek (1):
xfrm: defensively unhash xfrm_state lists in __xfrm_state_delete
Mikhail Gavrilov (1):
Bluetooth: l2cap: defer conn param update to avoid conn->lock/hdev->lock inversion
Nan Li (1):
net/rds: handle zerocopy send cleanup before the message is queued
Nicolas Escande (1):
wifi: ath12k: fix leak in some ath12k_wmi_xxx() functions
Pablo Neira Ayuso (8):
netfilter: replace skb_try_make_writable() by skb_ensure_writable()
netfilter: nft_fwd_netdev: add device and headroom validate with neigh forwarding
netfilter: x_tables: add .check_hooks to matches and targets
netfilter: nft_compat: run xt_check_hooks_{match,target}() from .validate
netfilter: flowtable: ensure sufficient headroom in xmit path
netfilter: flowtable: fix inline vlan encapsulation in xmit path
netfilter: flowtable: fix inline pppoe encapsulation in xmit path
netfilter: flowtable: use skb_pull_rcsum() to pop vlan/pppoe header
Paolo Abeni (3):
mptcp: fix rx timestamp corruption on fastopen
Merge branch 'net-mana-fix-mana_destroy_rxq-cleanup-for-partial-rxq-init'
Merge branch 'openvswitch-fix-self-deadlock-on-release-of-tunnel-vports'
Pauli Virtanen (2):
Bluetooth: SCO: fix sleeping under spinlock in sco_conn_ready
Bluetooth: SCO: hold sk properly in sco_conn_ready
Pavan Chebbi (1):
bnxt_en: Use absolute target ns from ptp_clock_request
Pavitra Jha (1):
net: wwan: t7xx: validate port_count against message length in t7xx_port_enum_msg_handler
Pengpeng Hou (1):
Bluetooth: RFCOMM: pull credit byte with skb_pull_data()
Qingfang Deng (1):
ovpn: reset MAC header before passing skb up
Ralf Lici (2):
ovpn: ensure packet delivery happens with BH disabled
selftests: ovpn: reduce ping count in test.sh
Rameshkumar Sundaram (1):
wifi: ath12k: initialize RSSI dBm conversion event state
Ratheesh Kannoth (10):
octeontx2-af: npc: cn20k: Propagate MCAM key-type errors on cn20k
octeontx2-af: npc: cn20k: Drop debugfs_create_file() error checks in init
octeontx2-af: npc: cn20k: Propagate errors in defrag MCAM alloc rollback
octeontx2-af: npc: cn20k: Fix target map and rule
octeontx2-af: npc: cn20k: Clear MCAM entries by index and key width
octeontx2-af: npc: cn20k: Fix bank value
octeontx2-af: npc: cn20k: Fix MCAM actions read
octeontx2-af: npc: cn20k: Initialize default-rule index outputs up front
octeontx2-af: npc: cn20k: Tear down default MCAM rules explicitly on free
octeontx2-af: npc: cn20k: Reject missing default-rule MCAM indices
Rio Liu (1):
wifi: mac80211: skip ieee80211_verify_sta_ht_mcs_support check in non-strict mode
Robert Marko (1):
net: phy: micrel: fix LAN8814 QSGMII soft reset
Ruijie Li (1):
xfrm: provide message size for XFRM_MSG_MAPPING
Sagarika Sharma (1):
ipv6: update route serial number on NETDEV_CHANGE
Sai Teja Aluvala (1):
Bluetooth: btintel_pcie: treat boot stage bit 12 as warning
SeungJu Cheon (2):
Bluetooth: ISO: Fix data-race on dst in iso_sock_connect()
Bluetooth: ISO: Fix data-race on iso_pi(sk) in socket and HCI event paths
Shardul Bankar (2):
mptcp: use MPJoinSynAckHMacFailure for SynAck HMAC failure
mptcp: use MPTCP_RST_EMPTCP for ACK HMAC validation failure
Shay Drory (4):
net/mlx5: SD: Serialize init/cleanup
net/mlx5: SD, Keep multi-pf debugfs entries on primary
net/mlx5e: SD, Fix missing cleanup on probe error
net/mlx5e: SD, Fix race condition in secondary device probe/remove
Shitalkumar Gandhi (1):
net: rtsn: fix mdio_node leak in rtsn_mdio_alloc()
Siwei Zhang (3):
Bluetooth: L2CAP: Fix null-ptr-deref in l2cap_sock_state_change_cb()
Bluetooth: L2CAP: Fix null-ptr-deref in l2cap_sock_get_sndtimeo_cb()
Bluetooth: L2CAP: Fix null-ptr-deref in l2cap_sock_new_connection_cb()
Tristan Madani (3):
wifi: b43: enforce bounds check on firmware key index in b43_rx()
wifi: b43legacy: enforce bounds check on firmware key index in RX path
Bluetooth: btmtk: validate WMT event SKB length before struct access
Victor Nogueira (1):
selftests/tc-testing: Add tests that force red and sfb to dequeue from child's gso_skb
Victor Nogueria (1):
net/sched: sch_sfb: Replace direct dequeue call with peek and qdisc_dequeue_peeked
Waiman Long (2):
ipvs: Guard access of HK_TYPE_KTHREAD cpumask with RCU
sched/isolation: Make HK_TYPE_KTHREAD an alias of HK_TYPE_DOMAIN
Wei Fang (1):
net: enetc: fix VSI mailbox timeout handling and DMA lifecycle
Weiming Shi (1):
netfilter: nft_fwd_netdev: use recursion counter in neigh egress path
Yilin Zhu (1):
ipv6: xfrm6: release dst on error in xfrm6_rcv_encap()
Yu-Hsiang Tseng (1):
wifi: ath12k: use lockdep_assert_in_rcu_read_lock() for RCU assertions
MAINTAINERS | 6 +
drivers/bluetooth/btintel_pcie.c | 13 +-
drivers/bluetooth/btintel_pcie.h | 2 +-
drivers/bluetooth/btmtk.c | 15 +-
drivers/bluetooth/hci_ath.c | 3 +
drivers/bluetooth/hci_bcsp.c | 3 +
drivers/bluetooth/hci_h4.c | 3 +
drivers/bluetooth/hci_h5.c | 3 +
drivers/bluetooth/virtio_bt.c | 39 ++-
drivers/net/dsa/mt7530.c | 75 +++-
drivers/net/dsa/mt7530.h | 8 +
drivers/net/ethernet/airoha/airoha_eth.c | 6 +-
drivers/net/ethernet/amd/xgbe/xgbe.h | 4 +-
drivers/net/ethernet/broadcom/bnxt/bnxt.c | 16 +-
drivers/net/ethernet/broadcom/bnxt/bnxt_ptp.c | 29 +-
drivers/net/ethernet/broadcom/bnxt/bnxt_ulp.c | 10 +-
drivers/net/ethernet/cortina/gemini.c | 5 +
drivers/net/ethernet/freescale/enetc/enetc.h | 1 +
drivers/net/ethernet/freescale/enetc/enetc_vf.c | 42 ++-
.../ethernet/marvell/octeontx2/af/cn20k/debugfs.c | 33 +-
.../net/ethernet/marvell/octeontx2/af/cn20k/npc.c | 382 ++++++++++++++-------
.../net/ethernet/marvell/octeontx2/af/cn20k/npc.h | 24 +-
.../net/ethernet/marvell/octeontx2/af/rvu_nix.c | 3 +
.../net/ethernet/marvell/octeontx2/af/rvu_npc.c | 231 +++++++++++--
.../net/ethernet/marvell/octeontx2/af/rvu_npc_fs.c | 30 +-
.../net/ethernet/mellanox/mlx5/core/en_accel/psp.c | 36 +-
drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 30 +-
drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c | 114 +++++-
drivers/net/ethernet/mellanox/mlx5/core/lib/sd.h | 2 +
drivers/net/ethernet/meta/fbnic/fbnic_netdev.c | 3 +-
.../net/ethernet/microchip/sparx5/sparx5_main.h | 10 +-
.../net/ethernet/microchip/sparx5/sparx5_port.c | 3 +-
drivers/net/ethernet/microsoft/mana/gdma_main.c | 40 ++-
drivers/net/ethernet/microsoft/mana/mana_en.c | 10 +-
drivers/net/ethernet/microsoft/mana/shm_channel.c | 5 -
drivers/net/ethernet/renesas/rtsn.c | 6 +-
.../net/ethernet/stmicro/stmmac/dwmac-nuvoton.c | 2 +
drivers/net/ethernet/wangxun/libwx/wx_hw.c | 7 +-
drivers/net/ethernet/wangxun/libwx/wx_vf_common.c | 4 +-
drivers/net/netdevsim/netdev.c | 3 +-
drivers/net/netdevsim/netdevsim.h | 4 +-
drivers/net/netdevsim/psp.c | 65 +++-
drivers/net/ovpn/io.c | 7 +
drivers/net/phy/bcm-phy-lib.c | 9 +
drivers/net/phy/bcm-phy-lib.h | 1 +
drivers/net/phy/bcm7xxx.c | 14 +
drivers/net/phy/broadcom.c | 5 +
drivers/net/phy/micrel.c | 15 +-
drivers/net/usb/asix_devices.c | 2 +
drivers/net/usb/cdc_ncm.c | 8 +
drivers/net/usb/r8152.c | 1 +
drivers/net/veth.c | 3 +-
drivers/net/wan/fsl_ucc_hdlc.c | 9 +-
drivers/net/wireless/ath/ath10k/Kconfig | 1 +
drivers/net/wireless/ath/ath12k/core.c | 77 +++--
drivers/net/wireless/ath/ath12k/dp_rx.c | 5 +-
drivers/net/wireless/ath/ath12k/mac.c | 2 +-
drivers/net/wireless/ath/ath12k/p2p.c | 2 +-
drivers/net/wireless/ath/ath12k/wmi.c | 105 +++++-
drivers/net/wireless/ath/ath5k/base.c | 3 +-
drivers/net/wireless/broadcom/b43/xmit.c | 3 +-
drivers/net/wireless/broadcom/b43legacy/xmit.c | 3 +-
.../wireless/broadcom/brcm80211/brcmfmac/sdio.c | 6 +-
drivers/net/wireless/marvell/libertas/if_usb.c | 6 +-
drivers/net/wireless/rsi/rsi_common.h | 5 +-
drivers/net/wireless/st/cw1200/pm.c | 2 -
drivers/net/wwan/t7xx/t7xx_modem_ops.c | 20 +-
drivers/net/wwan/t7xx/t7xx_port_ctrl_msg.c | 18 +-
drivers/net/wwan/t7xx/t7xx_port_proxy.h | 2 +-
include/linux/netfilter/x_tables.h | 8 +
include/linux/sched/isolation.h | 6 +-
include/net/bluetooth/hci_core.h | 2 +-
include/net/dropreason-core.h | 6 +
include/net/ip_vs.h | 31 +-
include/net/ipv6.h | 3 +
include/net/mana/shm_channel.h | 6 +
include/net/netfilter/nf_dup_netdev.h | 13 +
include/net/netfilter/nf_flow_table.h | 4 +-
include/net/netns/ipv4.h | 2 +-
net/bluetooth/bnep/core.c | 13 +-
net/bluetooth/hci_conn.c | 124 +++++--
net/bluetooth/hci_event.c | 31 +-
net/bluetooth/hidp/core.c | 27 +-
net/bluetooth/iso.c | 56 +--
net/bluetooth/l2cap_core.c | 14 +-
net/bluetooth/l2cap_sock.c | 9 +
net/bluetooth/rfcomm/core.c | 7 +-
net/bluetooth/sco.c | 62 ++--
net/core/dev.c | 2 +-
net/core/netpoll.c | 23 +-
net/core/rtnetlink.c | 1 +
net/ipv4/ah4.c | 14 +-
net/ipv4/esp4.c | 3 +-
net/ipv4/igmp.c | 58 ++--
net/ipv4/inetpeer.c | 3 +-
net/ipv4/ip_output.c | 2 +
net/ipv4/ipmr.c | 10 +-
net/ipv4/netfilter/nf_socket_ipv4.c | 3 +
net/ipv4/tcp_ipv4.c | 14 +-
net/ipv4/tcp_minisocks.c | 2 +-
net/ipv6/Kconfig | 4 +-
net/ipv6/ah6.c | 14 +-
net/ipv6/esp6.c | 3 +-
net/ipv6/exthdrs_core.c | 7 +
net/ipv6/ip6_gre.c | 5 +-
net/ipv6/ip6_input.c | 5 +
net/ipv6/ip6_output.c | 5 +
net/ipv6/ip6_tunnel.c | 4 +
net/ipv6/netfilter/nf_socket_ipv6.c | 5 +-
net/ipv6/route.c | 5 +
net/ipv6/tcp_ipv6.c | 17 +-
net/ipv6/xfrm6_protocol.c | 4 +-
net/mac80211/mlme.c | 18 +-
net/mac80211/rx.c | 6 +-
net/mac80211/tests/chan-mode.c | 1 +
net/mac80211/util.c | 4 +-
net/mctp/test/route-test.c | 2 +-
net/mctp/test/utils.c | 2 +-
net/mptcp/fastopen.c | 4 +-
net/mptcp/pm.c | 62 ++--
net/mptcp/pm_kernel.c | 13 +-
net/mptcp/sockopt.c | 4 +
net/mptcp/subflow.c | 4 +-
net/netfilter/ipvs/ip_vs_conn.c | 74 ++--
net/netfilter/ipvs/ip_vs_core.c | 2 +-
net/netfilter/ipvs/ip_vs_ctl.c | 164 ++++++---
net/netfilter/ipvs/ip_vs_est.c | 83 +++--
net/netfilter/nf_dup_netdev.c | 16 -
net/netfilter/nf_flow_table_core.c | 1 +
net/netfilter/nf_flow_table_ip.c | 151 ++++++--
net/netfilter/nf_flow_table_path.c | 7 +-
net/netfilter/nf_tables_api.c | 35 +-
net/netfilter/nf_tables_core.c | 2 +-
net/netfilter/nft_compat.c | 45 ++-
net/netfilter/nft_exthdr.c | 2 +-
net/netfilter/nft_fwd_netdev.c | 29 +-
net/netfilter/nft_osf.c | 2 +-
net/netfilter/nft_tproxy.c | 8 +-
net/netfilter/x_tables.c | 79 ++++-
net/netfilter/xt_CT.c | 8 +-
net/netfilter/xt_TCPMSS.c | 33 +-
net/netfilter/xt_TPROXY.c | 11 +-
net/netfilter/xt_addrtype.c | 25 +-
net/netfilter/xt_devgroup.c | 18 +-
net/netfilter/xt_ecn.c | 4 +
net/netfilter/xt_hashlimit.c | 4 +-
net/netfilter/xt_osf.c | 3 +
net/netfilter/xt_physdev.c | 24 +-
net/netfilter/xt_policy.c | 24 +-
net/netfilter/xt_set.c | 39 ++-
net/netfilter/xt_tcpmss.c | 4 +
net/openvswitch/vport-geneve.c | 5 +-
net/openvswitch/vport-gre.c | 5 +-
net/openvswitch/vport-netdev.c | 64 ++--
net/openvswitch/vport-netdev.h | 2 +-
net/openvswitch/vport-vxlan.c | 5 +-
net/psp/psp_main.c | 42 ++-
net/rds/message.c | 20 +-
net/sched/sch_cake.c | 153 +++++----
net/sched/sch_fq_codel.c | 39 ++-
net/sched/sch_pie.c | 14 +-
net/sched/sch_red.c | 2 +-
net/sched/sch_sfb.c | 2 +-
net/sched/sch_sfq.c | 48 +--
net/smc/af_smc.c | 8 +-
net/tls/tls_sw.c | 6 +-
net/unix/af_unix.c | 3 +
net/unix/garbage.c | 6 +-
net/vmw_vsock/virtio_transport_common.c | 4 +-
net/wireless/nl80211.c | 27 ++
net/wireless/pmsr.c | 2 +-
net/xdp/xsk.c | 115 ++++---
net/xdp/xsk_buff_pool.c | 3 +
net/xfrm/xfrm_output.c | 20 +-
net/xfrm/xfrm_state.c | 12 +-
net/xfrm/xfrm_user.c | 1 +
tools/testing/selftests/drivers/net/hw/Makefile | 1 +
tools/testing/selftests/drivers/net/hw/config | 5 +
.../selftests/drivers/net/hw/ipsec_vxlan.py | 204 +++++++++++
tools/testing/selftests/drivers/net/lib/py/load.py | 5 +-
tools/testing/selftests/net/Makefile | 1 +
tools/testing/selftests/net/mptcp/mptcp_lib.sh | 16 +-
tools/testing/selftests/net/mptcp/pm_netlink.sh | 20 +-
.../selftests/net/openvswitch/openvswitch.sh | 37 ++
.../testing/selftests/net/openvswitch/ovs-dpctl.py | 19 +-
tools/testing/selftests/net/ovpn/test.sh | 4 +-
tools/testing/selftests/net/tcp_ecmp_failover.sh | 216 ++++++++++++
tools/testing/selftests/net/tls.c | 43 +++
.../tc-testing/tc-tests/infra/qdiscs.json | 148 ++++++++
189 files changed, 3485 insertions(+), 1160 deletions(-)
create mode 100755 tools/testing/selftests/drivers/net/hw/ipsec_vxlan.py
create mode 100755 tools/testing/selftests/net/tcp_ecmp_failover.sh
^ permalink raw reply
* Re: [PATCH net v1 1/2] dt-bindings: ethernet: eswin: refine delay model and HSP register description
From: Conor Dooley @ 2026-05-07 17:24 UTC (permalink / raw)
To: lizhi2
Cc: andrew+netdev, davem, edumazet, kuba, pabeni, robh, krzk+dt,
conor+dt, netdev, devicetree, linux-kernel, mcoquelin.stm32,
alexandre.torgue, rmk+kernel, maxime.chevallier, linux-stm32,
linux-arm-kernel, ningyu, linmin, pinkesh.vaghela, pritesh.patel,
weishangjuan
In-Reply-To: <20260507083136.175-1-lizhi2@eswincomputing.com>
[-- Attachment #1: Type: text/plain, Size: 6710 bytes --]
On Thu, May 07, 2026 at 04:31:36PM +0800, lizhi2@eswincomputing.com wrote:
> From: Zhi Li <lizhi2@eswincomputing.com>
>
> Refine the EIC7700 Ethernet dt-binding based on observed hardware behavior
> and clarify the original delay model for eth0.
>
> The previous binding used an enum-based definition for
> rx-internal-delay-ps and tx-internal-delay-ps. Replace it with a
> range-based model using:
>
> - minimum: 0
> - maximum: 2540
> - multipleOf: 20
>
> This better reflects the actual hardware implementation, which
> supports 20ps granularity delay steps in the MAC RGMII interface.
>
> The tx/rx internal delay values are clarified as MAC-side programmable
> delay components applied on the RGMII clock/data path, representing
> the effective delay seen at the MAC interface.
>
> This does not change the intended hardware semantics, but aligns the
> binding with the actual hardware implementation.
>
> These properties are optional and only required when MAC-side fine
> tuning is needed; otherwise delay alignment is provided by PHY or
> board design.
>
> Depending on the selected RGMII timing mode, delay alignment may be
> provided by the PHY (e.g. rgmii-id) or by board/MAC-side configuration.
> When PHY or board design already provides the required delay, these
> MAC-side properties may be omitted. When MAC-side fine tuning is
> required, they should be provided to describe the internal RGMII
> timing adjustment.
>
> Additionally, extend the description of the HSP subsystem register
> layout used by the MAC glue logic. This includes explicit TXD and RXD
> delay control registers to ensure deterministic initialization and
> to override any residual configuration potentially left by bootloaders.
>
> Add reference to the EIC7700X SoC Technical Reference Manual,
> Chapter 10 ("High-Speed Interface"), Part 4 for background of the
> HSP CSR block:
> https://github.com/eswincomputing/EIC7700X-SoC-Technical-Reference-Manual/releases
>
> There are no in-tree users of this binding, so no ABI impact is
> expected.
>
> Fixes: 888bd0eca93c ("dt-bindings: ethernet: eswin: Document for EIC7700 SoC")
> Signed-off-by: Zhi Li <lizhi2@eswincomputing.com>
> ---
While this is v1, it's really v8 and there should therefore be a
changelog that explains where my ack and the new compatible went.
Cheers,
Conor.
> .../bindings/net/eswin,eic7700-eth.yaml | 50 +++++++++++++------
> 1 file changed, 36 insertions(+), 14 deletions(-)
>
> diff --git a/Documentation/devicetree/bindings/net/eswin,eic7700-eth.yaml b/Documentation/devicetree/bindings/net/eswin,eic7700-eth.yaml
> index 91e8cd1db67b..fab95603bd82 100644
> --- a/Documentation/devicetree/bindings/net/eswin,eic7700-eth.yaml
> +++ b/Documentation/devicetree/bindings/net/eswin,eic7700-eth.yaml
> @@ -63,16 +63,39 @@ properties:
> - const: stmmaceth
>
> rx-internal-delay-ps:
> - enum: [0, 200, 600, 1200, 1600, 1800, 2000, 2200, 2400]
> + minimum: 0
> + maximum: 2540
> + multipleOf: 20
> + description:
> + RX internal delay in picoseconds applied on the RGMII clock at the MAC
> + side. The hardware supports 20 ps steps.
> + This property is optional and only needed when MAC-side delay tuning
> + is required.
>
> tx-internal-delay-ps:
> - enum: [0, 200, 600, 1200, 1600, 1800, 2000, 2200, 2400]
> + minimum: 0
> + maximum: 2540
> + multipleOf: 20
> + description:
> + TX internal delay in picoseconds applied on the RGMII clock at the MAC
> + side. The hardware supports 20 ps steps.
> + This property is optional and only needed when MAC-side delay tuning
> + is required.
>
> eswin,hsp-sp-csr:
> description:
> HSP CSR is to control and get status of different high-speed peripherals
> (such as Ethernet, USB, SATA, etc.) via register, which can tune
> board-level's parameters of PHY, etc.
> +
> + Additional background information about the High-Speed Subsystem
> + and the HSP CSR block is available in Chapter 10 ("High-Speed Interface")
> + of the EIC7700X SoC Technical Reference Manual, Part 4
> + (EIC7700X_SoC_Technical_Reference_Manual_Part4.pdf). The manual is
> + publicly available at
> + https://github.com/eswincomputing/EIC7700X-SoC-Technical-Reference-Manual/releases
> +
> + This reference is provided for background information only.
> $ref: /schemas/types.yaml#/definitions/phandle-array
> items:
> - items:
> @@ -82,6 +105,8 @@ properties:
> - description: Offset of AXI clock controller Low-Power request
> register
> - description: Offset of register controlling TX/RX clock delay
> + - description: Offset of register controlling TXD delay
> + - description: Offset of register controlling RXD delay
>
> required:
> - compatible
> @@ -93,8 +118,6 @@ required:
> - phy-mode
> - resets
> - reset-names
> - - rx-internal-delay-ps
> - - tx-internal-delay-ps
> - eswin,hsp-sp-csr
>
> unevaluatedProperties: false
> @@ -104,24 +127,23 @@ examples:
> ethernet@50400000 {
> compatible = "eswin,eic7700-qos-eth", "snps,dwmac-5.20";
> reg = <0x50400000 0x10000>;
> - clocks = <&d0_clock 186>, <&d0_clock 171>, <&d0_clock 40>,
> - <&d0_clock 193>;
> - clock-names = "axi", "cfg", "stmmaceth", "tx";
> interrupt-parent = <&plic>;
> interrupts = <61>;
> interrupt-names = "macirq";
> - phy-mode = "rgmii-id";
> - phy-handle = <&phy0>;
> + clocks = <&d0_clock 186>, <&d0_clock 171>, <&d0_clock 40>,
> + <&d0_clock 193>;
> + clock-names = "axi", "cfg", "stmmaceth", "tx";
> resets = <&reset 95>;
> reset-names = "stmmaceth";
> - rx-internal-delay-ps = <200>;
> - tx-internal-delay-ps = <200>;
> - eswin,hsp-sp-csr = <&hsp_sp_csr 0x100 0x108 0x118>;
> - snps,axi-config = <&stmmac_axi_setup>;
> + eswin,hsp-sp-csr = <&hsp_sp_csr 0x100 0x108 0x118 0x114 0x11c>;
> + phy-handle = <&phy0>;
> + phy-mode = "rgmii-id";
> snps,aal;
> snps,fixed-burst;
> snps,tso;
> - stmmac_axi_setup: stmmac-axi-config {
> + snps,axi-config = <&stmmac_axi_setup_gmac0>;
> +
> + stmmac_axi_setup_gmac0: stmmac-axi-config {
> snps,blen = <0 0 0 0 16 8 4>;
> snps,rd_osr_lmt = <2>;
> snps,wr_osr_lmt = <2>;
> --
> 2.25.1
>
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]
^ permalink raw reply
* Re: [syzbot] [kernel?] WARNING: ODEBUG bug in smpboot_thread_fn
From: Ido Schimmel @ 2026-05-07 17:30 UTC (permalink / raw)
To: Thomas Gleixner
Cc: syzbot, linux-kernel, peterz, syzkaller-bugs, bridge,
Nikolay Aleksandrov, netdev
In-Reply-To: <87bjerwqan.ffs@tglx>
On Thu, May 07, 2026 at 10:57:04AM +0200, Thomas Gleixner wrote:
> On Wed, May 06 2026 at 18:29, Thomas Gleixner wrote:
> > On Mon, May 04 2026 at 05:23, syzbot wrote:
> >>
> >> ------------[ cut here ]------------
> >> ODEBUG: free active (active state 0) object: ffff888033a47278 object type: timer_list hint: br_ip6_multicast_port_query_expired+0x0/0x380 net/bridge/br_multicast.c:-1
> >
> > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > An object which contains an active timer is RCU freed....
>
> Unlike the other timer in the same object, the own_query timer is not
> shut down in br_multicast_port_ctx_deinit()
>
> Something kike the below.
>
> Thanks,
>
> tglx
> ---
> --- a/net/bridge/br_multicast.c
> +++ b/net/bridge/br_multicast.c
> @@ -2030,8 +2030,10 @@ void br_multicast_port_ctx_deinit(struct
>
> #if IS_ENABLED(CONFIG_IPV6)
> timer_delete_sync(&pmctx->ip6_mc_router_timer);
> + timer_delete_sync(&pmctx->ip6_own_query_timer);
> #endif
> timer_delete_sync(&pmctx->ip4_mc_router_timer);
> + timer_delete_sync(&pmctx->ip4_own_query_timer);
>
> spin_lock_bh(&br->multicast_lock);
> del |= br_ip6_multicast_rport_del(pmctx);
Thanks for the report and the fix. It looks correct, but it's unclear to
me which commit to blame.
AFAICT, the trace tells us that the timer is pending (not executing)
when the object that contains it is RCU freed. However, it shouldn't be
possible for the timer to be pending at this stage since it is
deactivated when the port multicast context is disabled and it is only
reactivated if the context is not disabled.
So, I see two options:
1. We did not disable port multicast context.
2. We did disable the port multicast context, but the timer somehow got
reactivated.
I will look into it...
^ permalink raw reply
* [PATCH net-next 0/3] net/mlx5: Steering misc enhancements
From: Tariq Toukan @ 2026-05-07 17:34 UTC (permalink / raw)
To: Christoph Paasch, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Andrew Lunn, David S. Miller
Cc: Saeed Mahameed, Leon Romanovsky, Tariq Toukan, Mark Bloch,
Yevgeny Kliteynik, Vlad Dogaru, Simon Horman, Kees Cook,
Alex Vesker, Erez Shitrit, netdev, linux-rdma, linux-kernel,
Gal Pressman, Dragos Tatulea
Hi,
This small series by Yevgeny contains a few steering enhancements /
cleanups.
Regards,
Tariq
Yevgeny Kliteynik (3):
net/mlx5: HWS, Check if device is down while polling for completion
net/mlx5: HWS, Handle destroying table that has a miss table
net/mlx5: DR, Remove unused field of struct mlx5dr_matcher_rx_tx
.../ethernet/mellanox/mlx5/core/steering/hws/bwc.c | 12 ++++++++++++
.../ethernet/mellanox/mlx5/core/steering/hws/table.c | 3 +++
.../mellanox/mlx5/core/steering/sws/dr_types.h | 1 -
3 files changed, 15 insertions(+), 1 deletion(-)
base-commit: dacf281771a9aed1a723b196120a0de8637910b9
--
2.44.0
^ permalink raw reply
* [PATCH net-next 1/3] net/mlx5: HWS, Check if device is down while polling for completion
From: Tariq Toukan @ 2026-05-07 17:34 UTC (permalink / raw)
To: Christoph Paasch, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Andrew Lunn, David S. Miller
Cc: Saeed Mahameed, Leon Romanovsky, Tariq Toukan, Mark Bloch,
Yevgeny Kliteynik, Vlad Dogaru, Simon Horman, Kees Cook,
Alex Vesker, Erez Shitrit, netdev, linux-rdma, linux-kernel,
Gal Pressman, Dragos Tatulea, Shay Drori
In-Reply-To: <20260507173443.320465-1-tariqt@nvidia.com>
From: Yevgeny Kliteynik <kliteyn@nvidia.com>
In case the device is down for any reason (e.g. FLR),
the HW will no longer generate completions - no point
polling and waiting for timeout.
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Erez Shitrit <erezsh@nvidia.com>
Reviewed-by: Shay Drori <shayd@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
.../ethernet/mellanox/mlx5/core/steering/hws/bwc.c | 12 ++++++++++++
1 file changed, 12 insertions(+)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/steering/hws/bwc.c b/drivers/net/ethernet/mellanox/mlx5/core/steering/hws/bwc.c
index 6dcd9c2a78aa..eae02bc74221 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/steering/hws/bwc.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/steering/hws/bwc.c
@@ -422,6 +422,18 @@ int mlx5hws_bwc_queue_poll(struct mlx5hws_context *ctx,
if (!got_comp && !drain)
return 0;
+ if (unlikely(ctx->mdev->state == MLX5_DEVICE_STATE_INTERNAL_ERROR)) {
+ /* If the device is down for any reason (e.g. FLR), the HW will
+ * no longer generate completions.
+ * Note that ETIMEDOUT is returned here because the BWC layer
+ * already has a special handling for timeouts - it breaks the
+ * rehash / resize / shrink loops to avoid chain of timeouts.
+ */
+ mlx5_core_warn_once(ctx->mdev,
+ "BWC poll: device is down, polling for completion aborted\n");
+ return -ETIMEDOUT;
+ }
+
queue_full = mlx5hws_send_engine_full(&ctx->send_queue[queue_id]);
while (queue_full || ((got_comp || drain) && *pending_rules)) {
ret = mlx5hws_send_queue_poll(ctx, queue_id, comp, burst_th);
--
2.44.0
^ permalink raw reply related
* [PATCH net-next 2/3] net/mlx5: HWS, Handle destroying table that has a miss table
From: Tariq Toukan @ 2026-05-07 17:34 UTC (permalink / raw)
To: Christoph Paasch, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Andrew Lunn, David S. Miller
Cc: Saeed Mahameed, Leon Romanovsky, Tariq Toukan, Mark Bloch,
Yevgeny Kliteynik, Vlad Dogaru, Simon Horman, Kees Cook,
Alex Vesker, Erez Shitrit, netdev, linux-rdma, linux-kernel,
Gal Pressman, Dragos Tatulea, Moshe Shemesh
In-Reply-To: <20260507173443.320465-1-tariqt@nvidia.com>
From: Yevgeny Kliteynik <kliteyn@nvidia.com>
If a table has a miss table that was created by
'mlx5hws_table_set_default_miss' API function, its miss_tbl
keeps the table that points to it in a list.
If such table is deleted, we need to also remove it from the
miss_tbl list, otherwise the node in miss_tbl list will contain
garbage.
Signed-off-by: Erez Shitrit <erezsh@nvidia.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
drivers/net/ethernet/mellanox/mlx5/core/steering/hws/table.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/steering/hws/table.c b/drivers/net/ethernet/mellanox/mlx5/core/steering/hws/table.c
index bd292485a25b..dd7927983ab2 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/steering/hws/table.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/steering/hws/table.c
@@ -282,6 +282,9 @@ int mlx5hws_table_destroy(struct mlx5hws_table *tbl)
goto unlock_err;
}
+ if (tbl->default_miss.miss_tbl)
+ list_del_init(&tbl->default_miss.next);
+
list_del_init(&tbl->tbl_list_node);
mutex_unlock(&ctx->ctrl_lock);
--
2.44.0
^ permalink raw reply related
* [PATCH net-next 3/3] net/mlx5: DR, Remove unused field of struct mlx5dr_matcher_rx_tx
From: Tariq Toukan @ 2026-05-07 17:34 UTC (permalink / raw)
To: Christoph Paasch, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Andrew Lunn, David S. Miller
Cc: Saeed Mahameed, Leon Romanovsky, Tariq Toukan, Mark Bloch,
Yevgeny Kliteynik, Vlad Dogaru, Simon Horman, Kees Cook,
Alex Vesker, Erez Shitrit, netdev, linux-rdma, linux-kernel,
Gal Pressman, Dragos Tatulea
In-Reply-To: <20260507173443.320465-1-tariqt@nvidia.com>
From: Yevgeny Kliteynik <kliteyn@nvidia.com>
Remove a field that was never used.
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Erez Shitrit <erezsh@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
drivers/net/ethernet/mellanox/mlx5/core/steering/sws/dr_types.h | 1 -
1 file changed, 1 deletion(-)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/steering/sws/dr_types.h b/drivers/net/ethernet/mellanox/mlx5/core/steering/sws/dr_types.h
index cc328292bf84..e0344707f522 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/steering/sws/dr_types.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/steering/sws/dr_types.h
@@ -986,7 +986,6 @@ struct mlx5dr_matcher_rx_tx {
[DR_RULE_MAX_STES];
u8 num_of_builders;
u8 num_of_builders_arr[DR_RULE_IPV_MAX][DR_RULE_IPV_MAX];
- u64 default_icm_addr;
struct mlx5dr_table_rx_tx *nic_tbl;
u32 prio;
struct list_head list_node;
--
2.44.0
^ permalink raw reply related
* Re: [net-next v3 1/5] dt-bindings: net: starfive,jh7110-dwmac: Remove jh8100
From: Conor Dooley @ 2026-05-07 17:36 UTC (permalink / raw)
To: Minda Chen
Cc: Alexandre Torgue, Andrew Lunn, David S . Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Maxime Coquelin,
Emil Renner Berthing, Rob Herring, Krzysztof Kozlowski, netdev,
linux-kernel, linux-stm32, devicetree
In-Reply-To: <20260507094115.8355-2-minda.chen@starfivetech.com>
[-- Attachment #1: Type: text/plain, Size: 318 bytes --]
On Thu, May 07, 2026 at 05:41:11PM +0800, Minda Chen wrote:
> Remove jh8100 dt-bindings because do not support it now.
> StarFive have stopped jh8100 developing and will not release
> it outside.
>
> Signed-off-by: Minda Chen <minda.chen@starfivetech.com>
Acked-by: Conor Dooley <conor.dooley@microchip.com>
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]
^ permalink raw reply
* Re: [net-next v3 3/5] dt-bindings: net: starfive,jh7110-dwmac: Add jhb100 sgmii rx clk
From: Conor Dooley @ 2026-05-07 17:42 UTC (permalink / raw)
To: Minda Chen
Cc: Alexandre Torgue, Andrew Lunn, David S . Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Maxime Coquelin,
Emil Renner Berthing, Rob Herring, Krzysztof Kozlowski, netdev,
linux-kernel, linux-stm32, devicetree
In-Reply-To: <20260507094115.8355-4-minda.chen@starfivetech.com>
[-- Attachment #1: Type: text/plain, Size: 3320 bytes --]
On Thu, May 07, 2026 at 05:41:13PM +0800, Minda Chen wrote:
> jhb100 SGMII interface tx/rx mac clock is split and require to
> set clock rate in 10M/100M/1000M speed. So dts need to add a
> new rx clock in code, dts and dt binding doc.
> So in jhb100 SGMII interface contain 6 clocks, RMII/RGMII
> interface still contail 5 clocks.
Why is this not being done in the commit adding the jhb100 in the first
place?
>
> Signed-off-by: Minda Chen <minda.chen@starfivetech.com>
> ---
> .../bindings/net/starfive,jh7110-dwmac.yaml | 42 ++++++++++++++++---
> 1 file changed, 36 insertions(+), 6 deletions(-)
>
> diff --git a/Documentation/devicetree/bindings/net/starfive,jh7110-dwmac.yaml b/Documentation/devicetree/bindings/net/starfive,jh7110-dwmac.yaml
> index 06aeaa0f6f00..af160a8dedb8 100644
> --- a/Documentation/devicetree/bindings/net/starfive,jh7110-dwmac.yaml
> +++ b/Documentation/devicetree/bindings/net/starfive,jh7110-dwmac.yaml
> @@ -39,20 +39,18 @@ properties:
> maxItems: 1
>
> clocks:
> + minItems: 5
> items:
> - description: GMAC main clock
> - description: GMAC AHB clock
> - description: PTP clock
> - description: TX clock
> - description: GTX clock
> + - description: SGMII RX clock
>
> clock-names:
> - items:
> - - const: stmmaceth
> - - const: pclk
> - - const: ptp_ref
> - - const: tx
> - - const: gtx
> + minItems: 5
> + maxItems: 6
>
> starfive,tx-use-rgmii-clk:
> description:
> @@ -99,6 +97,18 @@ allOf:
> minItems: 2
> maxItems: 2
>
> + clocks:
> + minItems: 5
> + maxItems: 5
This can just be "maxItems: 5", since minItems is set outside the
conditional to 5.
> +
> + clock-names:
> + items:
> + - const: stmmaceth
> + - const: pclk
> + - const: ptp_ref
> + - const: tx
> + - const: gtx
> +
> resets:
> maxItems: 1
>
> @@ -111,6 +121,26 @@ allOf:
> contains:
> const: starfive,jh7110-dwmac
> then:
> + properties:
> + clocks:
> + minItems: 5
> + maxItems: 6
Remove these constraints, since they don't do anything more than the
outside ones do.
> +
> + clock-names:
> + oneOf:
> + - items:
> + - const: stmmaceth
> + - const: pclk
> + - const: ptp_ref
> + - const: tx
> + - const: gtx
> + - items:
> + - const: stmmaceth
> + - const: pclk
> + - const: ptp_ref
> + - const: tx
> + - const: gtx
> + - const: sgmii_rx
Can't you just leave this list outside the conditional section, and add
the extra item to the end? The only difference appears to be the
sgmii_rx clock, and it's at the end.
I'm also not really convinced that this flexibility is required, unless
there are some controllers on the platform that do not support sgmii.
pw-bot: changes-requested
Cheers,
Conor.
> if:
> properties:
> compatible:
> --
> 2.17.1
>
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]
^ permalink raw reply
* Re: [GIT PULL] Networking for v7.1-rc3
From: pr-tracker-bot @ 2026-05-07 17:42 UTC (permalink / raw)
To: Jakub Kicinski; +Cc: torvalds, kuba, davem, netdev, linux-kernel, pabeni
In-Reply-To: <20260507172147.3509230-1-kuba@kernel.org>
The pull request you sent on Thu, 7 May 2026 10:21:47 -0700:
> git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git tags/net-7.1-rc3
has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/fcee7d82f27d6a8b1ddc5bbefda59b4e441e9bc0
Thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/prtracker.html
^ permalink raw reply
* Re: [net-next v3 4/5] net: stmmac: starfive: Add jhb100 SGMII interface
From: Conor Dooley @ 2026-05-07 17:44 UTC (permalink / raw)
To: Minda Chen
Cc: Alexandre Torgue, Andrew Lunn, David S . Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Maxime Coquelin,
Emil Renner Berthing, Rob Herring, Krzysztof Kozlowski, netdev,
linux-kernel, linux-stm32, devicetree
In-Reply-To: <20260507094115.8355-5-minda.chen@starfivetech.com>
[-- Attachment #1: Type: text/plain, Size: 925 bytes --]
On Thu, May 07, 2026 at 05:41:14PM +0800, Minda Chen wrote:
> Add jhb100 compatible and SGMII support. jhb100 soc contains
> 2 SGMII interfaces and integrated with serdes PHY. SGMII with
> split TX/RX MAC clock and need to set 2.5M/25M/125M TX/RX clock
> rate in 10M/100M/1000M speed mode.
>
> Signed-off-by: Minda Chen <minda.chen@starfivetech.com>
> Reviewed-by: Sai Krishna <saikrishnag@marvell.com>
> @@ -130,6 +160,7 @@ static const struct starfive_dwmac_data jh7100_data = {
> static const struct of_device_id starfive_dwmac_match[] = {
> { .compatible = "starfive,jh7100-dwmac", .data = &jh7100_data },
> { .compatible = "starfive,jh7110-dwmac" },
> + { .compatible = "starfive,jhb100-dwmac" },
You've declared compatibility with the jh7110, why do you also need to
add the new comaptible?
> { /* sentinel */ }
> };
> MODULE_DEVICE_TABLE(of, starfive_dwmac_match);
> --
> 2.17.1
>
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]
^ permalink raw reply
* [PATCH net] rtnetlink: add RTEXT_FILTER_TERSE_DUMP support
From: Eric Dumazet @ 2026-05-07 17:45 UTC (permalink / raw)
To: David S . Miller, Jakub Kicinski, Paolo Abeni
Cc: Simon Horman, David Ahern, Kuniyuki Iwashima, netdev,
eric.dumazet, Eric Dumazet
iproute2 can spend considerable amount of time in ll_init_map()
or ll_link_get() to dump verbose netdev attributes, contributing
to RTNL pressure.
Add RTEXT_FILTER_TERSE_DUMP new flag so that rtnl_fill_ifinfo()
limits its output to:
- struct nlmsghdr
- IFLA_IFNAME
- IFLA_PROP_LIST
We can later avoid using RTNL when RTEXT_FILTER_TERSE_DUMP
is requested, as none of these attributes need RTNL.
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
include/uapi/linux/rtnetlink.h | 1 +
net/core/rtnetlink.c | 31 ++++++++++++++++++++++---------
2 files changed, 23 insertions(+), 9 deletions(-)
diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h
index dab9493c791b8465c6476990f42c4ee5ae82da2d..4b1dbd554e5c72c90d2416f7ade37956ad5472b7 100644
--- a/include/uapi/linux/rtnetlink.h
+++ b/include/uapi/linux/rtnetlink.h
@@ -840,6 +840,7 @@ enum {
#define RTEXT_FILTER_CFM_CONFIG (1 << 5)
#define RTEXT_FILTER_CFM_STATUS (1 << 6)
#define RTEXT_FILTER_MST (1 << 7)
+#define RTEXT_FILTER_TERSE_DUMP (1 << 8)
/* End of information exported to user level */
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index b613bb6e07df6586aa06e7a59c8384dcedffeeef..7a9a769d142b6826d7fb01b599a4d9e0ae09a97d 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -1295,7 +1295,12 @@ static noinline size_t if_nlmsg_size(const struct net_device *dev,
size = NLMSG_ALIGN(sizeof(struct ifinfomsg))
+ nla_total_size(IFNAMSIZ) /* IFLA_IFNAME */
- + nla_total_size(IFALIASZ) /* IFLA_IFALIAS */
+ + rtnl_prop_list_size(dev);
+
+ if (ext_filter_mask & RTEXT_FILTER_TERSE_DUMP)
+ return size;
+
+ size += nla_total_size(IFALIASZ) /* IFLA_IFALIAS */
+ nla_total_size(IFNAMSIZ) /* IFLA_QDISC */
+ nla_total_size_64bit(sizeof(struct rtnl_link_ifmap))
+ nla_total_size(MAX_ADDR_LEN) /* IFLA_ADDRESS */
@@ -1342,7 +1347,6 @@ static noinline size_t if_nlmsg_size(const struct net_device *dev,
+ nla_total_size(4) /* IFLA_CARRIER_DOWN_COUNT */
+ nla_total_size(4) /* IFLA_MIN_MTU */
+ nla_total_size(4) /* IFLA_MAX_MTU */
- + rtnl_prop_list_size(dev)
+ nla_total_size(MAX_ADDR_LEN) /* IFLA_PERM_ADDRESS */
+ rtnl_devlink_port_size(dev)
+ rtnl_dpll_pin_size(dev)
@@ -1940,15 +1944,18 @@ static int rtnl_fill_alt_ifnames(struct sk_buff *skb,
struct netdev_name_node *name_node;
int count = 0;
+ rcu_read_lock();
list_for_each_entry_rcu(name_node, &dev->name_node->list, list) {
- if (nla_put_string(skb, IFLA_ALT_IFNAME, name_node->name))
+ if (nla_put_string(skb, IFLA_ALT_IFNAME, name_node->name)) {
+ rcu_read_unlock();
return -EMSGSIZE;
+ }
count++;
}
+ rcu_read_unlock();
return count;
}
-/* RCU protected. */
static int rtnl_fill_prop_list(struct sk_buff *skb,
const struct net_device *dev)
{
@@ -2071,13 +2078,20 @@ static int rtnl_fill_ifinfo(struct sk_buff *skb,
ifm->ifi_flags = netif_get_flags(dev);
ifm->ifi_change = change;
- if (tgt_netnsid >= 0 && nla_put_s32(skb, IFLA_TARGET_NETNSID, tgt_netnsid))
- goto nla_put_failure;
-
netdev_copy_name(dev, devname);
if (nla_put_string(skb, IFLA_IFNAME, devname))
goto nla_put_failure;
+ if (rtnl_fill_prop_list(skb, dev))
+ goto nla_put_failure;
+
+ if (ext_filter_mask & RTEXT_FILTER_TERSE_DUMP)
+ goto end;
+
+ if (tgt_netnsid >= 0 &&
+ nla_put_s32(skb, IFLA_TARGET_NETNSID, tgt_netnsid))
+ goto nla_put_failure;
+
if (nla_put_u32(skb, IFLA_TXQLEN, READ_ONCE(dev->tx_queue_len)) ||
nla_put_u8(skb, IFLA_OPERSTATE,
netif_running(dev) ? READ_ONCE(dev->operstate) :
@@ -2190,8 +2204,6 @@ static int rtnl_fill_ifinfo(struct sk_buff *skb,
goto nla_put_failure_rcu;
if (rtnl_fill_link_ifmap(skb, dev))
goto nla_put_failure_rcu;
- if (rtnl_fill_prop_list(skb, dev))
- goto nla_put_failure_rcu;
rcu_read_unlock();
if (dev->dev.parent &&
@@ -2210,6 +2222,7 @@ static int rtnl_fill_ifinfo(struct sk_buff *skb,
if (rtnl_fill_dpll_pin(skb, dev))
goto nla_put_failure;
+end:
nlmsg_end(skb, nlh);
return 0;
--
2.54.0.563.g4f69b47b94-goog
^ permalink raw reply related
* Re: [PATCH net] rtnetlink: add RTEXT_FILTER_TERSE_DUMP support
From: Eric Dumazet @ 2026-05-07 17:46 UTC (permalink / raw)
To: David S . Miller, Jakub Kicinski, Paolo Abeni
Cc: Simon Horman, David Ahern, Kuniyuki Iwashima, netdev,
eric.dumazet
In-Reply-To: <20260507174547.4125412-1-edumazet@google.com>
On Thu, May 7, 2026 at 10:45 AM Eric Dumazet <edumazet@google.com> wrote:
>
> iproute2 can spend considerable amount of time in ll_init_map()
> or ll_link_get() to dump verbose netdev attributes, contributing
> to RTNL pressure.
>
> Add RTEXT_FILTER_TERSE_DUMP new flag so that rtnl_fill_ifinfo()
> limits its output to:
>
> - struct nlmsghdr
> - IFLA_IFNAME
> - IFLA_PROP_LIST
>
> We can later avoid using RTNL when RTEXT_FILTER_TERSE_DUMP
> is requested, as none of these attributes need RTNL.
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
Wrong patch title, this targets net-next tree (obviously).
Thanks.
^ permalink raw reply
* RE: [PATCH net] net: ena: PHC: Check return code before setting timestamp output
From: Kiyanovski, Arthur @ 2026-05-07 18:09 UTC (permalink / raw)
To: Vadim Fedorenko, David Miller, Jakub Kicinski,
netdev@vger.kernel.org
Cc: Richard Cochran, Eric Dumazet, Paolo Abeni, David Woodhouse,
Thomas Gleixner, Miroslav Lichvar, Andrew Lunn, Wen Gu, Xuan Zhuo,
Woodhouse, David, Sarna, Yuval, Machulsky, Zorik,
Matushevsky, Alexander, Bshara, Saeed, Wilson, Matt,
Liguori, Anthony, Bshara, Nafea, Schmeilin, Evgeny,
Belgazal, Netanel, Saidi, Ali, Herrenschmidt, Benjamin,
Dagan, Noam, Arinzon, David, Ostrovsky, Evgeny, Tabachnik, Ofir,
Bernstein, Amit, stable@vger.kernel.org
In-Reply-To: <6511ab18-250b-436a-a11c-f50e78334666@linux.dev>
> -----Original Message-----
> From: Vadim Fedorenko <vadim.fedorenko@linux.dev>
> Sent: Thursday, May 7, 2026 3:38 AM
> Subject: RE: [EXTERNAL] [PATCH net] net: ena: PHC: Check return code before
> setting timestamp output
> ...
> Just an observation while reviewing - the idea of taking 2 spinlocks while
> reading timestamp doesn't look great and can potentially be CPU-expensive.
> Please, consider refactoring into RCU-style...
Noted, thanks for the review. We'll evaluate whether an RCU-based approach is appropriate here.
Arthur
^ permalink raw reply
* [PATCH net] ice: fix packet corruption due to extraneous page flip
From: John Ousterhout @ 2026-05-07 18:12 UTC (permalink / raw)
To: anthony.l.nguyen
Cc: intel-wired-lan, przemyslaw.kitszel, netdev, John Ousterhout
Consider the following sequence of events:
* The bottom half of a buffer page is filled with data from
packet A. The page has a net reference count (reference count
- bias) of 1. The page is returned to the NIC, flipped to
use the top half.
* Before the reference on the page is released, the NIC returns
the page with no data in it ('size' is zero in ice_clean_rx_irq).
In this case the bias does not get decremented. The page still
has a net reference count of 1, so it gets returned to the NIC.
However, ice_put_rx_mbuf flipped the page so that the bottom
half is active.
* If the NIC stores another packet in the page before packet A
has released its reference, the data in packet A will be
overwritten with data from the new packet.
The fix is for ice_put_rx_mbuf not to flip pages that have a
size of 0.
Note: major revisions to the ice driver make this patch irrelevant
for recent versions. It applies to longterm stable versions
6.18.27 and 6.12.86; it also seems relevant for 6.6.137, but would
need modifications for that version. I have not examined earlier
versions
Signed-off-by: John Ousterhout <ouster@cs.stanford.edu>
---
drivers/net/ethernet/intel/ice/ice_txrx.c | 19 ++++++++++++++++---
1 file changed, 16 insertions(+), 3 deletions(-)
diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.c b/drivers/net/ethernet/intel/ice/ice_txrx.c
index 51c459a3e722..371e6db3c272 100644
--- a/drivers/net/ethernet/intel/ice/ice_txrx.c
+++ b/drivers/net/ethernet/intel/ice/ice_txrx.c
@@ -1215,6 +1215,9 @@ static void ice_put_rx_mbuf(struct ice_rx_ring *rx_ring, struct xdp_buff *xdp,
xdp_frags = xdp_get_shared_info_from_buff(xdp)->nr_frags;
while (idx != ntc) {
+ union ice_32b_rx_flex_desc *rx_desc;
+ unsigned int size;
+
buf = &rx_ring->rx_buf[idx];
if (++idx == cnt)
idx = 0;
@@ -1224,10 +1227,20 @@ static void ice_put_rx_mbuf(struct ice_rx_ring *rx_ring, struct xdp_buff *xdp,
* To do this, only adjust pagecnt_bias for fragments up to
* the total remaining after the XDP program has run.
*/
- if (verdict != ICE_XDP_CONSUMED)
- ice_rx_buf_adjust_pg_offset(buf, xdp->frame_sz);
- else if (i++ <= xdp_frags)
+ if (verdict != ICE_XDP_CONSUMED) {
+ /* Don't "flip" the page if size is 0: in this case
+ * the data in the current half will not be used so
+ * it's OK to reuse that half. And, since the bias
+ * didn't get decremented for this half, the page can
+ * be returned to the NIC even if the other half is
+ * still in use, so flipping the page could cause
+ * live packet data to be overwritten.
+ */
+ if (size != 0)
+ ice_rx_buf_adjust_pg_offset(buf, xdp->frame_sz);
+ } else if (i++ <= xdp_frags) {
buf->pagecnt_bias++;
+ }
ice_put_rx_buf(rx_ring, buf);
}
--
2.43.0
^ permalink raw reply related
* Re: [PATCH net-next 08/12] dt-bindings: net: toshiba,tc965x-dwmac: add TC956x Ethernet bridge
From: Alex Elder @ 2026-05-07 18:37 UTC (permalink / raw)
To: Bjorn Andersson
Cc: andrew+netdev, davem, edumazet, kuba, pabeni, maxime.chevallier,
rmk+kernel, konradybcio, robh, krzk+dt, conor+dt, linusw, brgl,
arnd, gregkh, Daniel Thompson, mohd.anwar, a0987203069,
alexandre.torgue, ast, boon.khai.ng, chenchuangyu, chenhuacai,
daniel, hawk, hkallweit1, inochiama, john.fastabend, julianbraha,
livelycarpet87, matthew.gerlach, mcoquelin.stm32, me,
prabhakar.mahadev-lad.rj, richardcochran, rohan.g.thomas, sdf,
siyanteng, weishangjuan, wens, netdev, bpf, linux-arm-msm,
devicetree, linux-gpio, linux-stm32, linux-arm-kernel,
linux-kernel
In-Reply-To: <afycOwz5TpkegkZd@baldur>
On 5/7/26 9:12 AM, Bjorn Andersson wrote:
> On Fri, May 01, 2026 at 10:54:16AM -0500, Alex Elder wrote:
>> diff --git a/Documentation/devicetree/bindings/net/toshiba,tc956x-dwmac.yaml b/Documentation/devicetree/bindings/net/toshiba,tc956x-dwmac.yaml
> [..]
>> +
>> + gpio-controller: true
>
> I don't have any concern with the use of a proper gpio driver to model
> the implementation, but if I understand correctly this relationship
> between gpio controller and gpio consumer is strictly internal to "the
> PCI device".
(I think you're already cool with this but I still wanted to respond.)
That is not correct. These GPIO lines are used two ways for the
RB3gen2:
- drivers/pci/pwrctrl/pci-pwrctrl-tc9563.c uses GPIOs 2 and 3 to
assert/deassert the reset lines associated with the two exposed
downstream PCIe ports on the PCIe switch within the TC956x.
- Each of the Ethernet PHYs has a reset GPIO. On the RB3gen2, the
GPIOs used for the purpose come from the GPIO controller embedded
in the TC9564 (00 and 01).
These are therefore "exposed" (they are *not* strictly internal).
> Is this connection variable or is the link merely expressed in
> DeviceTree to mitigate the fact that you choose to implement the
> responsibilities of the two parts split into two device drivers?
It is variable. These resets might be implemented by other GPIO
controllers on other platforms.
> Are there other consumers of these TC956x gpios which would result in a
> board designer (and hence dts author) to ever reference this
> gpio-controller in a different way?
They could. Nine of these GPIOs are exposed by the TC956x pins
(GPIO00-06, GPIO12, GPIO35 and GPIO36). The RB3gen2 uses 00-03
(and possibly 04 but that's for a PHY we haven't tested yet).
-Alex
> Regards,
> Bjorn
^ permalink raw reply
* Re: [PATCH net-next 09/12] gpio: tc956x: add TC956x/QPS615 support
From: Alex Elder @ 2026-05-07 18:39 UTC (permalink / raw)
To: Andrew Lunn
Cc: andrew+netdev, davem, edumazet, kuba, pabeni, maxime.chevallier,
rmk+kernel, andersson, konradybcio, robh, krzk+dt, conor+dt,
linusw, brgl, arnd, gregkh, daniel, mohd.anwar, a0987203069,
alexandre.torgue, ast, boon.khai.ng, chenchuangyu, chenhuacai,
daniel, hawk, hkallweit1, inochiama, john.fastabend, julianbraha,
livelycarpet87, matthew.gerlach, mcoquelin.stm32, me,
prabhakar.mahadev-lad.rj, richardcochran, rohan.g.thomas, sdf,
siyanteng, weishangjuan, wens, netdev, bpf, linux-arm-msm,
devicetree, linux-gpio, linux-stm32, linux-arm-kernel,
linux-kernel
In-Reply-To: <3e5a42cc-53b8-4065-a32a-d754be40b4c7@lunn.ch>
On 5/2/26 9:48 PM, Andrew Lunn wrote:
>> It's possible gpio-regmap.c *could* be used. We started with
>> vendor code and this code got separated at some point along
>> the way. It was working, and I don't think I pursued other
>> options at that point. I'll look at this possibility before we
>> send out the next version.
>
> The GPIO subsystem has made a big effort to provide generic code,
> since GPIOs are pretty simple things with a lot in common. So if the
> generic code works, or can be made to work with minor changes, you
> should use it.
Yes, I can confirm this is what I will use. As mentioned
elsewhere, LinusW provided a patch that support the one
unusual thing this does (input-only GPIO lines).
>> What do you mean instantiate it twice?
>
> I _think_ you need one instance for the first 32 GPIOs, and a second
> one for the remaining GPIOs. But maybe config->reg_stride might allow
> it to work with a single instance?
Oh I see. Looking at it now, I presume the reg_stride will work here.
Thanks.
-Alex
> Andrew
^ permalink raw reply
* [PATCH net v2] ice: fix packet corruption due to extraneous page flip
From: John Ousterhout @ 2026-05-07 18:38 UTC (permalink / raw)
To: anthony.l.nguyen
Cc: intel-wired-lan, przemyslaw.kitszel, netdev, John Ousterhout
Consider the following sequence of events:
* The bottom half of a buffer page is filled with data from
packet A. The page has a net reference count (reference count
- bias) of 1. The page is returned to the NIC, flipped to
use the top half.
* Before the reference on the page is released, the NIC returns
the page with no data in it ('size' is zero in ice_clean_rx_irq).
In this case the bias does not get decremented. The page still
has a net reference count of 1, so it gets returned to the NIC.
However, ice_put_rx_mbuf flipped the page so that the bottom
half is active.
* If the NIC stores another packet in the page before packet A
has released its reference, the data in packet A will be
overwritten with data from the new packet.
The fix is for ice_put_rx_mbuf not to flip pages that have a
size of 0.
Note: major revisions to the ice driver make this patch irrelevant
for recent versions. It applies to longterm stable versions
6.18.27 and 6.12.86; it also seems relevant for 6.6.137, but would
need modifications for that version. I have not examined earlier
versions
Signed-off-by: John Ousterhout <ouster@cs.stanford.edu>
---
drivers/net/ethernet/intel/ice/ice_txrx.c | 23 ++++++++++++++++++++---
1 file changed, 20 insertions(+), 3 deletions(-)
diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.c b/drivers/net/ethernet/intel/ice/ice_txrx.c
index 51c459a3e722..081c7a7392b7 100644
--- a/drivers/net/ethernet/intel/ice/ice_txrx.c
+++ b/drivers/net/ethernet/intel/ice/ice_txrx.c
@@ -1215,6 +1215,13 @@ static void ice_put_rx_mbuf(struct ice_rx_ring *rx_ring, struct xdp_buff *xdp,
xdp_frags = xdp_get_shared_info_from_buff(xdp)->nr_frags;
while (idx != ntc) {
+ union ice_32b_rx_flex_desc *rx_desc;
+ unsigned int size;
+
+ rx_desc = ICE_RX_DESC(rx_ring, idx);
+ size = le16_to_cpu(rx_desc->wb.pkt_len) &
+ ICE_RX_FLX_DESC_PKT_LEN_M;
+
buf = &rx_ring->rx_buf[idx];
if (++idx == cnt)
idx = 0;
@@ -1224,10 +1231,20 @@ static void ice_put_rx_mbuf(struct ice_rx_ring *rx_ring, struct xdp_buff *xdp,
* To do this, only adjust pagecnt_bias for fragments up to
* the total remaining after the XDP program has run.
*/
- if (verdict != ICE_XDP_CONSUMED)
- ice_rx_buf_adjust_pg_offset(buf, xdp->frame_sz);
- else if (i++ <= xdp_frags)
+ if (verdict != ICE_XDP_CONSUMED) {
+ /* Don't "flip" the page if size is 0: in this case
+ * the data in the current half will not be used so
+ * it's OK to reuse that half. And, since the bias
+ * didn't get decremented for this half, the page can
+ * be returned to the NIC even if the other half is
+ * still in use, so flipping the page could cause
+ * live packet data to be overwritten.
+ */
+ if (size != 0)
+ ice_rx_buf_adjust_pg_offset(buf, xdp->frame_sz);
+ } else if (i++ <= xdp_frags) {
buf->pagecnt_bias++;
+ }
ice_put_rx_buf(rx_ring, buf);
}
--
2.43.0
^ permalink raw reply related
* Re: [PATCH net-next 10/12] net: stmmac: tc956x: add TC956x/QPS615 support
From: Alex Elder @ 2026-05-07 18:44 UTC (permalink / raw)
To: Xilin Wu, andrew+netdev, davem, edumazet, kuba, pabeni,
maxime.chevallier, rmk+kernel, andersson, konradybcio, robh,
krzk+dt, conor+dt, linusw, brgl, arnd, gregkh
Cc: Daniel Thompson, mohd.anwar, a0987203069, alexandre.torgue, ast,
boon.khai.ng, chenchuangyu, chenhuacai, daniel, hawk, hkallweit1,
inochiama, john.fastabend, julianbraha, livelycarpet87,
matthew.gerlach, mcoquelin.stm32, me, prabhakar.mahadev-lad.rj,
richardcochran, rohan.g.thomas, sdf, siyanteng, weishangjuan,
wens, netdev, bpf, linux-arm-msm, devicetree, linux-gpio,
linux-stm32, linux-arm-kernel, linux-kernel
In-Reply-To: <224E233C593EF171+8c8a43dd-5061-40f8-9eb7-f360eabf2ecc@radxa.com>
On 5/6/26 7:59 AM, Xilin Wu wrote:
> On 5/1/2026 11:54 PM, Alex Elder wrote:
>> + /* AXI Configuration */
>> + axi = &td->axi;
>> + axi->axi_lpi_en = 1;
>> + axi->axi_wr_osr_lmt = 31;
>> + axi->axi_rd_osr_lmt = 31;
>> + /* All sizes (2^2..2^8) are supported */
>> + axi->axi_blen_regval = DMA_AXI_BLEN_MASK;
>> + plat->axi = axi;
>> +
>> + plat->mac_port_sel_speed = speed;
>> + plat->flags = STMMAC_FLAG_MULTI_MSI_EN | STMMAC_FLAG_TSO_EN;
>
> I got WoL working only after adding STMMAC_FLAG_USE_PHY_WOL here. I
> guess it's required, since the driver clocks down the MAC/PMA/XPCS in
> its suspend hook?
I just want to respond to this with a summary of our plans.
We will *not* be implementing wake-on-LAN (WoL) initially. We
will work to get support for the eMACs upstream for TC956x, and
then as a separate step, we will enable WoL.
It's great to know you have it working, and our plan is to
implement it via the PHYs and not involve the MAC. It seems
it will be relatively easy, but we have no plans to add it to
the current series.
-Alex
^ permalink raw reply
* [PATCH iproute2-next] rdma: Align FRMR pool UAPI names with merged kernel UAPI
From: Chiara Meiohas @ 2026-05-07 18:46 UTC (permalink / raw)
To: leon, dsahern, stephen
Cc: michaelgur, jgg, linux-rdma, netdev, Chiara Meiohas
From: Michael Guralnik <michaelgur@nvidia.com>
The FRMR pools UAPI merged in kernel v7.0-rc1 commit dbd0472fd7a5
("RDMA/nldev: Expose kernel-internal FRMR pools in netlink")
uses different identifier names than what the iproute2 FRMR pools
series was developed against.
Update the vendored copy of RDMA UAPI and all references in the rdma
tool to match the names that actually shipped in the kernel.
Fixes: 93368ee34528 ("rdma: Update headers")
Signed-off-by: Michael Guralnik <michaelgur@nvidia.com>
Signed-off-by: Chiara Meiohas <cmeiohas@nvidia.com>
---
rdma/include/uapi/rdma/rdma_netlink.h | 30 ++++----
rdma/res-frmr-pools.c | 98 +++++++++++++--------------
rdma/res.h | 2 +-
3 files changed, 65 insertions(+), 65 deletions(-)
diff --git a/rdma/include/uapi/rdma/rdma_netlink.h b/rdma/include/uapi/rdma/rdma_netlink.h
index 8709e558b..4356ec4a1 100644
--- a/rdma/include/uapi/rdma/rdma_netlink.h
+++ b/rdma/include/uapi/rdma/rdma_netlink.h
@@ -308,9 +308,9 @@ enum rdma_nldev_command {
RDMA_NLDEV_CMD_MONITOR,
- RDMA_NLDEV_CMD_RES_FRMR_POOLS_GET, /* can dump */
+ RDMA_NLDEV_CMD_FRMR_POOLS_GET, /* can dump */
- RDMA_NLDEV_CMD_RES_FRMR_POOLS_SET,
+ RDMA_NLDEV_CMD_FRMR_POOLS_SET,
RDMA_NLDEV_NUM_OPS
};
@@ -590,19 +590,19 @@ enum rdma_nldev_attr {
/*
* FRMR Pools attributes
*/
- RDMA_NLDEV_ATTR_RES_FRMR_POOLS, /* nested table */
- RDMA_NLDEV_ATTR_RES_FRMR_POOL_ENTRY, /* nested table */
- RDMA_NLDEV_ATTR_RES_FRMR_POOL_KEY, /* nested table */
- RDMA_NLDEV_ATTR_RES_FRMR_POOL_KEY_ATS, /* u8 */
- RDMA_NLDEV_ATTR_RES_FRMR_POOL_KEY_ACCESS_FLAGS, /* u32 */
- RDMA_NLDEV_ATTR_RES_FRMR_POOL_KEY_VENDOR_KEY, /* u64 */
- RDMA_NLDEV_ATTR_RES_FRMR_POOL_KEY_NUM_DMA_BLOCKS, /* u64 */
- RDMA_NLDEV_ATTR_RES_FRMR_POOL_QUEUE_HANDLES, /* u32 */
- RDMA_NLDEV_ATTR_RES_FRMR_POOL_MAX_IN_USE, /* u64 */
- RDMA_NLDEV_ATTR_RES_FRMR_POOL_IN_USE, /* u64 */
- RDMA_NLDEV_ATTR_RES_FRMR_POOL_AGING_PERIOD, /* u32 */
- RDMA_NLDEV_ATTR_RES_FRMR_POOL_PINNED, /* u32 */
- RDMA_NLDEV_ATTR_RES_FRMR_POOL_KEY_KERNEL_VENDOR_KEY, /* u64 */
+ RDMA_NLDEV_ATTR_FRMR_POOLS, /* nested table */
+ RDMA_NLDEV_ATTR_FRMR_POOL_ENTRY, /* nested table */
+ RDMA_NLDEV_ATTR_FRMR_POOL_KEY, /* nested table */
+ RDMA_NLDEV_ATTR_FRMR_POOL_KEY_ATS, /* u8 */
+ RDMA_NLDEV_ATTR_FRMR_POOL_KEY_ACCESS_FLAGS, /* u32 */
+ RDMA_NLDEV_ATTR_FRMR_POOL_KEY_VENDOR_KEY, /* u64 */
+ RDMA_NLDEV_ATTR_FRMR_POOL_KEY_NUM_DMA_BLOCKS, /* u64 */
+ RDMA_NLDEV_ATTR_FRMR_POOL_QUEUE_HANDLES, /* u32 */
+ RDMA_NLDEV_ATTR_FRMR_POOL_MAX_IN_USE, /* u64 */
+ RDMA_NLDEV_ATTR_FRMR_POOL_IN_USE, /* u64 */
+ RDMA_NLDEV_ATTR_FRMR_POOLS_AGING_PERIOD, /* u32 */
+ RDMA_NLDEV_ATTR_FRMR_POOL_PINNED_HANDLES, /* u32 */
+ RDMA_NLDEV_ATTR_FRMR_POOL_KEY_KERNEL_VENDOR_KEY, /* u64 */
/*
* Always the end
diff --git a/rdma/res-frmr-pools.c b/rdma/res-frmr-pools.c
index abcd21884..d5faa5c14 100644
--- a/rdma/res-frmr-pools.c
+++ b/rdma/res-frmr-pools.c
@@ -80,83 +80,83 @@ static int res_frmr_pools_line(struct rd *rd, const char *name, int idx,
char key_str[FRMR_POOL_KEY_MAX_LEN];
struct frmr_pool_key key = { 0 };
- if (nla_line[RDMA_NLDEV_ATTR_RES_FRMR_POOL_KEY]) {
+ if (nla_line[RDMA_NLDEV_ATTR_FRMR_POOL_KEY]) {
if (mnl_attr_parse_nested(
- nla_line[RDMA_NLDEV_ATTR_RES_FRMR_POOL_KEY],
+ nla_line[RDMA_NLDEV_ATTR_FRMR_POOL_KEY],
rd_attr_cb, key_tb) != MNL_CB_OK)
return MNL_CB_ERROR;
- if (key_tb[RDMA_NLDEV_ATTR_RES_FRMR_POOL_KEY_ATS])
+ if (key_tb[RDMA_NLDEV_ATTR_FRMR_POOL_KEY_ATS])
key.ats = mnl_attr_get_u8(
- key_tb[RDMA_NLDEV_ATTR_RES_FRMR_POOL_KEY_ATS]);
- if (key_tb[RDMA_NLDEV_ATTR_RES_FRMR_POOL_KEY_ACCESS_FLAGS])
+ key_tb[RDMA_NLDEV_ATTR_FRMR_POOL_KEY_ATS]);
+ if (key_tb[RDMA_NLDEV_ATTR_FRMR_POOL_KEY_ACCESS_FLAGS])
key.access_flags = mnl_attr_get_u32(
- key_tb[RDMA_NLDEV_ATTR_RES_FRMR_POOL_KEY_ACCESS_FLAGS]);
- if (key_tb[RDMA_NLDEV_ATTR_RES_FRMR_POOL_KEY_VENDOR_KEY])
+ key_tb[RDMA_NLDEV_ATTR_FRMR_POOL_KEY_ACCESS_FLAGS]);
+ if (key_tb[RDMA_NLDEV_ATTR_FRMR_POOL_KEY_VENDOR_KEY])
key.vendor_key = mnl_attr_get_u64(
- key_tb[RDMA_NLDEV_ATTR_RES_FRMR_POOL_KEY_VENDOR_KEY]);
- if (key_tb[RDMA_NLDEV_ATTR_RES_FRMR_POOL_KEY_NUM_DMA_BLOCKS])
+ key_tb[RDMA_NLDEV_ATTR_FRMR_POOL_KEY_VENDOR_KEY]);
+ if (key_tb[RDMA_NLDEV_ATTR_FRMR_POOL_KEY_NUM_DMA_BLOCKS])
key.num_dma_blocks = mnl_attr_get_u64(
- key_tb[RDMA_NLDEV_ATTR_RES_FRMR_POOL_KEY_NUM_DMA_BLOCKS]);
- if (key_tb[RDMA_NLDEV_ATTR_RES_FRMR_POOL_KEY_KERNEL_VENDOR_KEY])
+ key_tb[RDMA_NLDEV_ATTR_FRMR_POOL_KEY_NUM_DMA_BLOCKS]);
+ if (key_tb[RDMA_NLDEV_ATTR_FRMR_POOL_KEY_KERNEL_VENDOR_KEY])
kernel_vendor_key = mnl_attr_get_u64(
- key_tb[RDMA_NLDEV_ATTR_RES_FRMR_POOL_KEY_KERNEL_VENDOR_KEY]);
+ key_tb[RDMA_NLDEV_ATTR_FRMR_POOL_KEY_KERNEL_VENDOR_KEY]);
if (rd_is_filtered_attr(
rd, "ats", key.ats,
- key_tb[RDMA_NLDEV_ATTR_RES_FRMR_POOL_KEY_ATS]))
+ key_tb[RDMA_NLDEV_ATTR_FRMR_POOL_KEY_ATS]))
goto out;
if (rd_is_filtered_attr(
rd, "access_flags", key.access_flags,
- key_tb[RDMA_NLDEV_ATTR_RES_FRMR_POOL_KEY_ACCESS_FLAGS]))
+ key_tb[RDMA_NLDEV_ATTR_FRMR_POOL_KEY_ACCESS_FLAGS]))
goto out;
if (rd_is_filtered_attr(
rd, "vendor_key", key.vendor_key,
- key_tb[RDMA_NLDEV_ATTR_RES_FRMR_POOL_KEY_VENDOR_KEY]))
+ key_tb[RDMA_NLDEV_ATTR_FRMR_POOL_KEY_VENDOR_KEY]))
goto out;
if (rd_is_filtered_attr(
rd, "num_dma_blocks", key.num_dma_blocks,
- key_tb[RDMA_NLDEV_ATTR_RES_FRMR_POOL_KEY_NUM_DMA_BLOCKS]))
+ key_tb[RDMA_NLDEV_ATTR_FRMR_POOL_KEY_NUM_DMA_BLOCKS]))
goto out;
}
- if (nla_line[RDMA_NLDEV_ATTR_RES_FRMR_POOL_QUEUE_HANDLES])
+ if (nla_line[RDMA_NLDEV_ATTR_FRMR_POOL_QUEUE_HANDLES])
queue_handles = mnl_attr_get_u32(
- nla_line[RDMA_NLDEV_ATTR_RES_FRMR_POOL_QUEUE_HANDLES]);
+ nla_line[RDMA_NLDEV_ATTR_FRMR_POOL_QUEUE_HANDLES]);
if (rd_is_filtered_attr(
rd, "queue", queue_handles,
- nla_line[RDMA_NLDEV_ATTR_RES_FRMR_POOL_QUEUE_HANDLES]))
+ nla_line[RDMA_NLDEV_ATTR_FRMR_POOL_QUEUE_HANDLES]))
goto out;
- if (nla_line[RDMA_NLDEV_ATTR_RES_FRMR_POOL_IN_USE])
+ if (nla_line[RDMA_NLDEV_ATTR_FRMR_POOL_IN_USE])
in_use = mnl_attr_get_u64(
- nla_line[RDMA_NLDEV_ATTR_RES_FRMR_POOL_IN_USE]);
+ nla_line[RDMA_NLDEV_ATTR_FRMR_POOL_IN_USE]);
if (rd_is_filtered_attr(rd, "in_use", in_use,
- nla_line[RDMA_NLDEV_ATTR_RES_FRMR_POOL_IN_USE]))
+ nla_line[RDMA_NLDEV_ATTR_FRMR_POOL_IN_USE]))
goto out;
- if (nla_line[RDMA_NLDEV_ATTR_RES_FRMR_POOL_MAX_IN_USE])
+ if (nla_line[RDMA_NLDEV_ATTR_FRMR_POOL_MAX_IN_USE])
max_in_use = mnl_attr_get_u64(
- nla_line[RDMA_NLDEV_ATTR_RES_FRMR_POOL_MAX_IN_USE]);
+ nla_line[RDMA_NLDEV_ATTR_FRMR_POOL_MAX_IN_USE]);
if (rd_is_filtered_attr(
rd, "max_in_use", max_in_use,
- nla_line[RDMA_NLDEV_ATTR_RES_FRMR_POOL_MAX_IN_USE]))
+ nla_line[RDMA_NLDEV_ATTR_FRMR_POOL_MAX_IN_USE]))
goto out;
- if (nla_line[RDMA_NLDEV_ATTR_RES_FRMR_POOL_PINNED])
+ if (nla_line[RDMA_NLDEV_ATTR_FRMR_POOL_PINNED_HANDLES])
pinned_handles = mnl_attr_get_u32(
- nla_line[RDMA_NLDEV_ATTR_RES_FRMR_POOL_PINNED]);
+ nla_line[RDMA_NLDEV_ATTR_FRMR_POOL_PINNED_HANDLES]);
if (rd_is_filtered_attr(rd, "pinned", pinned_handles,
- nla_line[RDMA_NLDEV_ATTR_RES_FRMR_POOL_PINNED]))
+ nla_line[RDMA_NLDEV_ATTR_FRMR_POOL_PINNED_HANDLES]))
goto out;
open_json_object(NULL);
print_dev(idx, name);
- if (nla_line[RDMA_NLDEV_ATTR_RES_FRMR_POOL_KEY]) {
+ if (nla_line[RDMA_NLDEV_ATTR_FRMR_POOL_KEY]) {
snprintf(key_str, sizeof(key_str),
"%" PRIx64 ":%" PRIx64 ":%x:%s",
key.vendor_key, key.num_dma_blocks,
@@ -166,30 +166,30 @@ static int res_frmr_pools_line(struct rd *rd, const char *name, int idx,
if (rd->show_details) {
res_print_u32(
"ats", key.ats,
- key_tb[RDMA_NLDEV_ATTR_RES_FRMR_POOL_KEY_ATS]);
+ key_tb[RDMA_NLDEV_ATTR_FRMR_POOL_KEY_ATS]);
res_print_u32(
"access_flags", key.access_flags,
- key_tb[RDMA_NLDEV_ATTR_RES_FRMR_POOL_KEY_ACCESS_FLAGS]);
+ key_tb[RDMA_NLDEV_ATTR_FRMR_POOL_KEY_ACCESS_FLAGS]);
res_print_u64(
"vendor_key", key.vendor_key,
- key_tb[RDMA_NLDEV_ATTR_RES_FRMR_POOL_KEY_VENDOR_KEY]);
+ key_tb[RDMA_NLDEV_ATTR_FRMR_POOL_KEY_VENDOR_KEY]);
res_print_u64(
"num_dma_blocks", key.num_dma_blocks,
- key_tb[RDMA_NLDEV_ATTR_RES_FRMR_POOL_KEY_NUM_DMA_BLOCKS]);
+ key_tb[RDMA_NLDEV_ATTR_FRMR_POOL_KEY_NUM_DMA_BLOCKS]);
res_print_u64(
"kernel_vendor_key", kernel_vendor_key,
- key_tb[RDMA_NLDEV_ATTR_RES_FRMR_POOL_KEY_KERNEL_VENDOR_KEY]);
+ key_tb[RDMA_NLDEV_ATTR_FRMR_POOL_KEY_KERNEL_VENDOR_KEY]);
}
}
res_print_u32("queue", queue_handles,
- nla_line[RDMA_NLDEV_ATTR_RES_FRMR_POOL_QUEUE_HANDLES]);
+ nla_line[RDMA_NLDEV_ATTR_FRMR_POOL_QUEUE_HANDLES]);
res_print_u64("in_use", in_use,
- nla_line[RDMA_NLDEV_ATTR_RES_FRMR_POOL_IN_USE]);
+ nla_line[RDMA_NLDEV_ATTR_FRMR_POOL_IN_USE]);
res_print_u64("max_in_use", max_in_use,
- nla_line[RDMA_NLDEV_ATTR_RES_FRMR_POOL_MAX_IN_USE]);
+ nla_line[RDMA_NLDEV_ATTR_FRMR_POOL_MAX_IN_USE]);
res_print_u32("pinned", pinned_handles,
- nla_line[RDMA_NLDEV_ATTR_RES_FRMR_POOL_PINNED]);
+ nla_line[RDMA_NLDEV_ATTR_FRMR_POOL_PINNED_HANDLES]);
print_driver_table(rd, nla_line[RDMA_NLDEV_ATTR_DRIVER]);
close_json_object();
@@ -215,12 +215,12 @@ int res_frmr_pools_parse_cb(const struct nlmsghdr *nlh, void *data)
mnl_attr_parse(nlh, 0, rd_attr_cb, tb);
if (!tb[RDMA_NLDEV_ATTR_DEV_INDEX] || !tb[RDMA_NLDEV_ATTR_DEV_NAME] ||
- !tb[RDMA_NLDEV_ATTR_RES_FRMR_POOLS])
+ !tb[RDMA_NLDEV_ATTR_FRMR_POOLS])
return MNL_CB_ERROR;
name = mnl_attr_get_str(tb[RDMA_NLDEV_ATTR_DEV_NAME]);
idx = mnl_attr_get_u32(tb[RDMA_NLDEV_ATTR_DEV_INDEX]);
- nla_table = tb[RDMA_NLDEV_ATTR_RES_FRMR_POOLS];
+ nla_table = tb[RDMA_NLDEV_ATTR_FRMR_POOLS];
mnl_attr_for_each_nested(nla_entry, nla_table) {
struct nlattr *nla_line[RDMA_NLDEV_ATTR_MAX] = {};
@@ -256,10 +256,10 @@ static int res_frmr_pools_one_set_aging(struct rd *rd)
return -EINVAL;
}
- rd_prepare_msg(rd, RDMA_NLDEV_CMD_RES_FRMR_POOLS_SET, &seq,
+ rd_prepare_msg(rd, RDMA_NLDEV_CMD_FRMR_POOLS_SET, &seq,
(NLM_F_REQUEST | NLM_F_ACK));
mnl_attr_put_u32(rd->nlh, RDMA_NLDEV_ATTR_DEV_INDEX, rd->dev_idx);
- mnl_attr_put_u32(rd->nlh, RDMA_NLDEV_ATTR_RES_FRMR_POOL_AGING_PERIOD,
+ mnl_attr_put_u32(rd->nlh, RDMA_NLDEV_ATTR_FRMR_POOLS_AGING_PERIOD,
aging_period);
return rd_sendrecv_msg(rd, seq);
@@ -294,24 +294,24 @@ static int res_frmr_pools_one_set_pinned(struct rd *rd)
return -EINVAL;
}
- rd_prepare_msg(rd, RDMA_NLDEV_CMD_RES_FRMR_POOLS_SET, &seq,
+ rd_prepare_msg(rd, RDMA_NLDEV_CMD_FRMR_POOLS_SET, &seq,
(NLM_F_REQUEST | NLM_F_ACK));
mnl_attr_put_u32(rd->nlh, RDMA_NLDEV_ATTR_DEV_INDEX, rd->dev_idx);
- mnl_attr_put_u32(rd->nlh, RDMA_NLDEV_ATTR_RES_FRMR_POOL_PINNED,
+ mnl_attr_put_u32(rd->nlh, RDMA_NLDEV_ATTR_FRMR_POOL_PINNED_HANDLES,
pinned_value);
key_attr =
- mnl_attr_nest_start(rd->nlh, RDMA_NLDEV_ATTR_RES_FRMR_POOL_KEY);
- mnl_attr_put_u8(rd->nlh, RDMA_NLDEV_ATTR_RES_FRMR_POOL_KEY_ATS,
+ mnl_attr_nest_start(rd->nlh, RDMA_NLDEV_ATTR_FRMR_POOL_KEY);
+ mnl_attr_put_u8(rd->nlh, RDMA_NLDEV_ATTR_FRMR_POOL_KEY_ATS,
pool_key.ats);
mnl_attr_put_u32(rd->nlh,
- RDMA_NLDEV_ATTR_RES_FRMR_POOL_KEY_ACCESS_FLAGS,
+ RDMA_NLDEV_ATTR_FRMR_POOL_KEY_ACCESS_FLAGS,
pool_key.access_flags);
- mnl_attr_put_u64(rd->nlh, RDMA_NLDEV_ATTR_RES_FRMR_POOL_KEY_VENDOR_KEY,
+ mnl_attr_put_u64(rd->nlh, RDMA_NLDEV_ATTR_FRMR_POOL_KEY_VENDOR_KEY,
pool_key.vendor_key);
mnl_attr_put_u64(rd->nlh,
- RDMA_NLDEV_ATTR_RES_FRMR_POOL_KEY_NUM_DMA_BLOCKS,
+ RDMA_NLDEV_ATTR_FRMR_POOL_KEY_NUM_DMA_BLOCKS,
pool_key.num_dma_blocks);
mnl_attr_nest_end(rd->nlh, key_attr);
diff --git a/rdma/res.h b/rdma/res.h
index 8d7b4a0bf..1f71115b9 100644
--- a/rdma/res.h
+++ b/rdma/res.h
@@ -200,7 +200,7 @@ struct filters frmr_pools_valid_filters[MAX_NUMBER_OF_FILTERS] = {
{ .name = "pinned", .is_number = true },
};
-RES_FUNC(res_frmr_pools, RDMA_NLDEV_CMD_RES_FRMR_POOLS_GET,
+RES_FUNC(res_frmr_pools, RDMA_NLDEV_CMD_FRMR_POOLS_GET,
frmr_pools_valid_filters, true, 0);
int res_frmr_pools_set(struct rd *rd);
--
2.38.1
^ permalink raw reply related
* Re: [PATCH net v2 1/4] net: sparx5: defer VCAP debugfs creation until after netdev registration
From: Daniel Machon @ 2026-05-07 18:47 UTC (permalink / raw)
To: Jakub Kicinski
Cc: Andrew Lunn, David S. Miller, Eric Dumazet, Paolo Abeni,
Steen Hegelund, UNGLinuxDriver, Sebastian Andrzej Siewior,
Clark Williams, Steven Rostedt, Bjarni Jonasson, Lars Povlsen,
Philipp Zabel, kees, linux-kernel, netdev, linux-arm-kernel,
linux-rt-devel
In-Reply-To: <20260507090810.53e66ef6@kernel.org>
> On Wed, 6 May 2026 09:25:36 +0200 Daniel Machon wrote:
> > Move the debugfs setup into a new sparx5_debugfs() helper in
> > sparx5_debugfs.c, invoked after sparx5_register_notifier_blocks()
> > succeeds so the netdev names are finalized. sparx5_vcap_init() now
> > only deals with VCAP state. The sparx5/ debugfs root is created in
> > the new helper as well.
>
> netdev names are never final :( User can change them at any time.
> The best practice is to name the debugfs file by some stable hw-related
> property, bus, port number etc.
Right, but they are finalized in the sense that we have a name we can use for the
debugfs files (which we dont pre-patch).
Hmm. I think this patch fixes an actual issue, where you cannot query the
debugfs files, because a previous patch broke the ordering. I agree that the
names chosen (netdev_name()) for the files were poor, but is that really a fix
for this series? Should that not be adressed in a future patch for net-next (it
involves changing an VCAP API function that is not only used by Sparx5/lan969x,
but also lan966x.).
/Daniel
^ permalink raw reply
* Re: [PATCH v2 iproute2-next 1/4] rdma: Update headers
From: Chiara Meiohas @ 2026-05-07 19:03 UTC (permalink / raw)
To: David Ahern, Stephen Hemminger
Cc: leon, michaelgur, jgg, linux-rdma, netdev, Patrisious Haddad
In-Reply-To: <3cf0dcca-9a3f-4cee-83d7-f058f33bcc04@gmail.com>
On 07/05/2026 19:20, David Ahern wrote:
> On 4/28/26 4:05 AM, Chiara Meiohas wrote:
>> We will prepare a sync patch to align the names with the kernel and send
>> it shortly.
> what happened to this request? I see that Stephen had to post a patch
> (not yet applied) to address this problem:
>
> https://patchwork.kernel.org/project/netdevbpf/patch/20260505181045.748088-1-stephen@networkplumber.org/
>
> We allow rdma to have separate uapi headers for convenience. Responses
> to mistakes need to be timely.
Hi David,
My apologies for the delay; I will make sure these mistakes are handled
more promptly in the future.
I have sent our version to the mailing list, as I was not sure how you
would prefer to proceed given the overlap.
https://lore.kernel.org/linux-rdma/20260507184609.3439875-1-cmeiohas@nvidia.com/
Thanks,
Chiara
^ permalink raw reply
* Re: [PATCH net-next v5 3/5] veth: implement Byte Queue Limits (BQL) for latency reduction
From: Jesper Dangaard Brouer @ 2026-05-07 19:09 UTC (permalink / raw)
To: Simon Schippers, Paolo Abeni, netdev
Cc: kernel-team, Andrew Lunn, David S. Miller, Eric Dumazet,
Jakub Kicinski, Alexei Starovoitov, Daniel Borkmann,
John Fastabend, Stanislav Fomichev, linux-kernel, bpf
In-Reply-To: <e3a91545-13cd-4f87-8375-d707865bdbca@schippers-hamm.de>
[-- Attachment #1: Type: text/plain, Size: 3900 bytes --]
On 07/05/2026 16.46, Simon Schippers wrote:
>
>
> On 5/7/26 16:34, Paolo Abeni wrote:
>> On 5/7/26 8:54 AM, Simon Schippers wrote:
>>> On 5/5/26 15:21, hawk@kernel.org wrote:
>>>> @@ -928,9 +968,13 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
>>>> }
>>>> } else {
>>>> /* ndo_start_xmit */
>>>> - struct sk_buff *skb = ptr;
>>>> + bool bql_charged = veth_ptr_is_bql(ptr);
>>>> + struct sk_buff *skb = veth_ptr_to_skb(ptr);
>>>>
>>>> stats->xdp_bytes += skb->len;
>>>> + if (peer_txq && bql_charged)
>>>> + netdev_tx_completed_queue(peer_txq, 1, VETH_BQL_UNIT);
>>>
>>> In the discussion with Jonas [1], I left a comment explaining why I think
>>> this doesn’t work.
>>>
I've experimented with doing the "completion" at NAPI-end in
veth_poll(), but that resulted in BQL limit being 128 packets, which
leads to bad latency results (not acceptable).
(See detailed report later)
>>> I still think first that adding an option to modify the hard-coded
>>> VETH_RING_SIZE is the way to go.
>>>
Not against being able to modify VETH_RING_SIZE, but I don't think it is
the solution here.
The simply solution is the configure BQL limit_min:
`/sys/class/net/<dev>/queues/tx-N/byte_queue_limits/limit_min`
My experiments (below) find that limit_min=8 is gives good performance.
We can simply set default to 8 as this still allows userspace to change
this later if lower latency is preferred.
>>> Thanks!
>>>
>>> [1] Link: https://lore.kernel.org/netdev/e8cdba04-aa9a-45c6-9807-8274b62920df@tu-dortmund.de/
>>
>> In the above discussion a 20% regression is reported, which IMHO can't
>> be ignored. Still the tput figures in the data are extremely low,
>> something is possibly off?!? I would expect a few Mpps with pktgen on
>> top of veth, while the reported data is ~20-30Kpps.
>>
>> /P
>>
>
> The ~20-30Kpps occur when thousands of iptables rules are applied and
> an UDP userspace application is sending.
>
> And there is a 20% pktgen regression (no iptables rules applied).
>
The pktgen test is a little dubious/weird and Jonas had to modify pktgen
to test this. John Fastabend added a config to pktgen that allows us
to benchmarking egress qdisc path, this might be better to use this.
The samples/pktgen/pktgen_bench_xmit_mode_queue_xmit.sh is a demo usage.
If redoing the tests, can you adjust limit_min to see the effect?
/sys/class/net/<dev>/queues/tx-N/byte_queue_limits/limit_min
20% throughput performance regression is of-cause too much, but I will
remind us, that adding a qdisc will "cost" some overhead, that is a
configuration choice. Our purpose here is to reduce bufferbloat and
latency, not optimize for throughput.
> I am pretty sure the reason is because the BQL limit is stuck at 2
> packets (because the completed queue is always called with 1 packet
> and not in a interrupt/timer with multiple packets...).
>
I've run a lot of experiments, which I made AI write a report over, see
attachment. The TL;DR is that best performance vs latency tradeoff is
defaulting BQL/DQL limit_min to be 8 packets.
I fear this patchset will stall forever, if we keep searching for a
perfect solution without any overhead. The qdisc layer will be a
baseline overhead. The limit=2 packets is actually the optimal
darkbuffer queue size, but I acknowledge that this causes too many qdisc
requeue events (leading to overhead). I suggest that I add another
patch in V6, that defaults limit_min to 8 (separate patch to make it
easier to revert/adjust later).
I've talked with Jonas, and we want to experiment with different
solutions to make BQL/DQL work better with virtual devices.
This patchset helps our (production) use-case reduce mice-flow latency
from approx 22ms to 1.3ms for latency under-load. Due to the consumer
namespace being the bottleneck the requeue overhead is negligible in
comparison.
-Jesper
[-- Attachment #2: PERF-2651-bql-completion-experiment.md --]
[-- Type: text/markdown, Size: 13601 bytes --]
# PERF-2651: BQL Completion Batching Experiment (2026-05-05)
## Background
Simon Schippers and Jonas Koeppeler raised concerns that DQL settles at
limit=2 with veth BQL, citing the netdevice.h comment:
> "Must be called at most once per TX completion round (and not per
> individual packet), so that BQL can adjust its limits appropriately."
And Tom Herbert's original BQL cover letter:
> "BQL accounting is in the transmit path for every packet, and the
> function to recompute the byte limit is run once per transmit completion."
Thread: https://lore.kernel.org/all/e8cdba04-aa9a-45c6-9807-8274b62920df@tu-dortmund.de/
## Experiment: Batch BQL completion at end of veth_poll
Created stg patch `experiment-batch-bql-completion` that moves
`netdev_tx_completed_queue()` from per-SKB inside `veth_xdp_rcv()` to a
single batched call at the end of `veth_poll()`.
### Code change (drivers/net/veth.c)
In `veth_xdp_rcv()`: replace per-SKB completion with counter accumulation:
```c
// Before (V5, per-packet):
if (peer_txq && bql_charged)
netdev_tx_completed_queue(peer_txq, 1, VETH_BQL_UNIT);
// After (experiment, accumulate):
if (peer_txq && bql_charged)
stats->bql_completed += VETH_BQL_UNIT;
```
In `veth_poll()`: single batched call after veth_xdp_rcv() returns:
```c
if (peer_txq && stats.bql_completed)
netdev_tx_completed_queue(peer_txq, stats.bql_completed,
stats.bql_completed);
```
Note: cannot use `done` (return value of veth_xdp_rcv) because it counts
all consumed ring entries including XDP frames that were never BQL-charged.
Using `done` would over-complete and hit BUG_ON in dql_completed().
## Why DQL settles at limit=2 with per-packet completion
The DQL slack calculation in `dql_completed()` uses:
```c
slack = POSDIFF(limit + prev_ovlimit, 2 * (completed - num_completed));
```
`completed - num_completed` equals the `count` parameter (bytes completed
this call). Per-packet: count=1, so slack = limit + prev_ovlimit - 2.
With limit=2, slack=0, so the algorithm holds steady at 2.
With batched completion: count=~64, slack calculation sees the real batch
size, and DQL converges to limit=128 (~2x NAPI budget).
## Results: nrules=3500 (sfq + tiny-flood)
| Metric | No BQL | Per-pkt | limit=4 | limit=8 | limit=16 | Batched |
| | | (limit=2) | | | | (limit=128) |
|---------------|-----------|-----------|---------|---------|----------|-------------|
| BQL limit | unlimited | 2 | 4 | 8 | 16 | 128 |
| BQL inflight | 254 | 3 | 5 | 9-17 | 25 | 133 |
| Ping RTT avg | 9.3ms | 0.94ms | 1.07ms | 1.24ms | 1.69ms | 4.0ms |
| requeues | 52K | 454K | 426K | 399K | 356K | 112K |
| NAPI avg_work | 63 | 5 | 15 | 63 | 63 | 63 |
| NAPI polls | ~2.2K | ~27K | ~10.5K | ~2.5K | ~2.3K | ~2.5K |
| Consumer pps | ~26K | ~30K | ~30K | ~30K | ~29K | ~30K |
## Results: nrules=15000 (sfq + tiny-flood, slower consumer)
| Metric | No BQL | Per-pkt | limit=4 | limit=8 | limit=16 | Batched |
| | | (limit=2) | | | | (limit=128) |
|---------------|------------|-----------|-----------|-----------|-----------|-------------|
| BQL limit | unlimited | 2 | 4 | 8 | 16 | 128 |
| BQL inflight | 211-227 | 3 | 5-12 | 9 | 17 | 132-136 |
| Ping RTT avg | **37.8ms** | **4.5ms** | **5.0ms** | **6.0ms** | **6.4ms** | **20.0ms** |
| Ping RTT min | 27.7ms | 1.4ms | 1.7ms | 2.4ms | 3.0ms | 10.4ms |
| requeues | 12.9K | 93K | 87K | 80K | 86K | 22.8K |
| NAPI avg_work | 61 | 6 | 17 | 60 | 61 | 61 |
| NAPI polls | ~540 | ~4.9K | ~1.9K | ~540 | ~550 | ~540 |
| Consumer pps | ~6.7K | ~6.7K | ~6.9K | ~6.8K | ~7.0K | ~6.7K |
## Analysis
### Batched completion is clearly worse for latency
At nrules=15000, batched completion gives 20ms ping RTT -- only 2x better
than no-BQL (37.8ms). Per-packet gives 4.5ms -- an 8x improvement.
The math confirms this: 128 packets / 6.7K pps = 19ms of uncontrolled
queuing delay. This matches the measured 20ms almost exactly.
### Per-packet completion (limit=2) is correct for veth
Simon's concern that limit=2 is a DQL defect is wrong. limit=2 is the
ideal behavior for dark-buffer elimination:
- Only 2-3 packets in the ptr_ring at any time
- Qdisc gets immediate control over all buffering
- 8x latency reduction vs no-BQL
The DQL comment "once per TX completion round" was written for HW NICs
where interrupt coalescing batches completions naturally. For veth, each
per-SKB completion within a NAPI poll technically violates the letter of
the comment, but the resulting limit=2 is correct for the use case.
The concern with limit=2 is the overhead it introduces:
### Trade-off: NAPI polling overhead
Per-packet (limit=2) causes many more NAPI polls:
- nrules=3500: 27K polls (avg_work=5) vs 2.5K polls (avg_work=63)
- nrules=15000: 4.9K polls (avg_work=6) vs 540 polls (avg_work=61)
This is because with only 2-3 items in the ring, each NAPI poll drains
the ring quickly -> napi_complete_done -> reschedule. More scheduling
overhead, but no throughput impact when consumer is the bottleneck.
### limit_min tuning via sysfs
DQL limit_min can be set via:
`/sys/class/net/<dev>/queues/tx-0/byte_queue_limits/limit_min`
The selftest `--bql-min-limit N` flag writes to this sysfs.
- **limit_min=4**: half a cache-line (32 bytes of ptr_ring pointers).
avg_work=17, 1.9K polls. Ping 5.0ms -- close to limit=2 (4.5ms).
- **limit_min=8**: one cache-line (64 bytes of ptr_ring pointers).
avg_work=60, 540 polls. Ping 6.0ms -- efficient full-budget polls.
### Dark buffer formula
At consumer rate R (pps) and BQL limit L (packets):
- Dark buffer latency = L / R
- limit=2: 2/6700 = 0.3ms (negligible)
- limit=8: 8/6700 = 1.2ms
- limit=128: 128/6700 = 19ms (matches measured 20ms)
- unlimited (254): 254/6700 = 38ms (matches measured 37.8ms)
## Results: nrules=0 (no consumer overhead, max throughput)
This tests the raw throughput overhead of BQL stop/start oscillation.
All values are averages of 4 runs (VM noise is ~15-20% per-run variance).
| Metric | No BQL | limit=2 | limit=4 | limit=8 | limit=16 |
|-----------------|--------|---------|---------|---------|----------|
| Sink pps (large)| 841K | 759K | 692K | 762K | 736K |
| Sink pps (small)| 950K | 874K | 807K | 874K | 844K |
| qdisc pkts | 48.6M | 44.8M | 40.1M | 45.0M | 44.8M |
| requeues | 311K | 6.1M | 13.4M | 5.8M | 5.2M |
| NAPI avg_work | 22 | 27 | 12 | 19 | 21 |
| Ping RTT avg | 0.17ms | 0.11ms | 0.10ms | 0.085ms | 0.095ms |
| Runs | 4 | 4 | 4 | 4 | 4 |
Observations:
- **limit=2 is NOT the worst** -- limit=4 has higher requeues (13.4M) and
lower throughput (692K sink) due to more stop/start cycles at a less
efficient NAPI batch size (avg_work=12)
- **limit=8 and limit=16 match No-BQL throughput** within noise (~762K vs 841K
sink pps for large pkts, ~3-10% difference)
- **Requeue overhead**: 311K (No BQL) -> 5.2-5.8M (limit=8/16) -> 13.4M (limit=4)
- Latency sub-0.2ms for all settings at this speed -- not a differentiator
## Comparison: limit=8 vs limit=16
Multi-run (4 iterations each, nrules=0) to cut through VM noise:
### limit=8 (4 runs)
| Run | Sink pps (large/small) | qdisc pkts | requeues | avg_work | Ping avg |
|-----|------------------------|-----------|----------|----------|----------|
| 1 | 796K / 911K | 46.2M | 5.6M | 20 | 0.062ms |
| 2 | 796K / 883K | 45.5M | 4.7M | 16 | 0.081ms |
| 3 | 654K / 836K | 43.5M | 8.3M | 22 | 0.100ms |
| 4 | 803K / 865K | 44.8M | 4.4M | 16 | 0.095ms |
| **avg** | **762K / 874K** | **45.0M** | **5.8M** | **19** | **0.085ms** |
### limit=16 (4 runs)
| Run | Sink pps (large/small) | qdisc pkts | requeues | avg_work | Ping avg |
|-----|------------------------|-----------|----------|----------|----------|
| 1 | 844K / 940K | 48.1M | 3.3M | 20 | 0.081ms |
| 2 | 768K / 873K | 45.6M | 4.1M | 15 | 0.097ms |
| 3 | 733K / 804K | 44.8M | 6.5M | 26 | 0.085ms |
| 4 | 597K / 757K | 40.7M | 6.9M | 23 | 0.115ms |
| **avg** | **736K / 844K** | **44.8M** | **5.2M** | **21** | **0.095ms** |
### Averaged comparison (nrules=0, 4 runs)
| Metric | limit=8 | limit=16 |
|---------------------|-----------|-----------|
| Sink pps (large) | 762K | 736K |
| Sink pps (small) | 874K | 844K |
| qdisc pkts | 45.0M | 44.8M |
| requeues | 5.8M | 5.2M |
| avg_work | 19 | 21 |
| Ping RTT avg | 0.085ms | 0.095ms |
At max throughput, limit=8 and limit=16 are within VM noise (~3-4%).
### Cross-load comparison (all averages of 4 runs)
| Metric | limit=8 | limit=16 | Winner |
|---------------|---------|----------|---------------|
| nrules=15000: | | | |
| Ping RTT | 6.73ms | 8.00ms | 8 (+1.3ms) |
| requeues | 71K | 73K | ~same |
| avg_work | 59 | 59 | ~same |
| nrules=3500: | | | |
| Ping RTT | 1.77ms | 2.11ms | 8 (+0.34ms) |
| requeues | 279K | 282K | ~same |
| avg_work | 62 | 62 | ~same |
| nrules=0: | | | |
| Sink pps | 762K | 736K | ~same (noise) |
| requeues | 5.8M | 5.2M | ~same (noise) |
**Verdict: limit=8 is the better default.**
- Consistent latency advantage under load: +1.3ms at nrules=15000,
+0.34ms at nrules=3500 (reproducible across 4 runs each)
- Throughput indistinguishable from limit=16 after averaging
- One cache-line (64 bytes) is a clean hardware alignment
- More conservative -- smaller dark buffer
## Proposed patch: dql_set_min_limit() + veth default min_limit=8
Two-part solution in stg patch `veth-set-bql-min-limit-8`:
### 1. New DQL API helper (include/linux/dynamic_queue_limits.h)
```c
static inline void dql_set_min_limit(struct dql *dql, unsigned int min_limit)
{
dql->min_limit = min_limit;
}
```
Gives drivers a clean API to set a default floor. Currently no driver
sets min_limit -- all rely on the dql_init() default of 0 or user sysfs.
### 2. Veth sets min_limit=8 at device creation (drivers/net/veth.c)
In `veth_init_queues()`, after TX queue setup:
```c
#ifdef CONFIG_BQL
for (i = 0; i < dev->num_tx_queues; i++)
dql_set_min_limit(&netdev_get_tx_queue(dev, i)->dql,
VETH_BQL_UNIT * 8);
#endif
```
Called for both `dev` and `peer` in `veth_newlink()`. Uses
`num_tx_queues` (all pre-allocated queues), not `real_num_tx_queues`,
so channel changes via `ethtool -L` are covered -- no new queues are
ever created at runtime.
### Why min_limit=8
- One cache-line of ptr_ring pointers (8 x 8 = 64 bytes)
- Lowest requeue count at max throughput (5.3M vs 16.9M at limit=2)
- Keeps full-budget NAPI polls (avg_work=63) -- no scheduling overhead
- Latency only 0.3ms worse than limit=2 at moderate load (1.24ms vs 0.94ms)
- Still 6x better latency than no-BQL at heavy load (6ms vs 37.8ms)
- User can lower to 0 or raise via sysfs limit_min at any time
### Verified: driver default works (nrules=15000, --hist)
Tested with `veth-set-bql-min-limit-8` patch applied, no `--bql-min-limit`
sysfs override. BQL limit=8 held stable, ping RTT ~6.5ms (matches sysfs
override results).
BQL inflight histogram (bpftrace, 169K samples):
```
[1] 15 | |
[2, 4) 21193 |@@@@@@@@@@@@@ |
[4, 8) 63615 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[8, 16) 80116 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[16, 32) 4709 |@@@ |
```
- Inflight avg=7, max=17 -- ring stays shallow
- Peak at [8,16): inflight near the limit=8 floor most of the time
- [4,8) second: ring draining between NAPI polls
- [16,32) rare: brief producer bursts
- stack_xoff ~15K/5s, drv_xoff=0 -- BQL stops queue well before ring fills
- NAPI avg_work=61, almost all full-budget polls
## Conclusion
Per-packet BQL completion in V5 is the right design. It gives DQL the
information it needs to keep the dark buffer minimal, which is exactly
what we want for latency reduction.
Simon's suggestion to call netdev_tx_completed_queue() once per NAPI poll
would regress ping latency from 4.5ms to 20ms at production-like iptables
rule counts.
The default min_limit=8 (via dql_set_min_limit) is the proposed follow-up
to address the requeue overhead that per-packet completion causes. It
keeps latency close to optimal while reducing the ~10% throughput loss
and 20x requeue increase (6.1M vs 311K) that limit=2 causes at max speed.
Users wanting tighter latency can set limit_min=0 via sysfs to get the
original limit=2 behavior.
^ permalink raw reply
* [PATCH net-next v7 0/6] net: mana: Per-vPort EQ and MSI-X interrupt management
From: Long Li @ 2026-05-07 19:12 UTC (permalink / raw)
To: Long Li, Konstantin Taranov, Jakub Kicinski, David S . Miller,
Paolo Abeni, Eric Dumazet, Andrew Lunn, Jason Gunthorpe,
Leon Romanovsky, Haiyang Zhang, K . Y . Srinivasan, Wei Liu,
Dexuan Cui, shradhagupta
Cc: Simon Horman, netdev, linux-rdma, linux-hyperv, linux-kernel
This series adds per-vPort Event Queue (EQ) allocation and MSI-X interrupt
management for the MANA driver. Previously, all vPorts shared a single set
of EQs. This change enables dedicated EQs per vPort with support for both
dedicated and shared MSI-X vector allocation modes.
Patch 1 moves EQ ownership from mana_context to per-vPort mana_port_context
and exports create/destroy functions for the RDMA driver. Also adds EQ
create/destroy calls to mana_ib_cfg_vport/uncfg_vport so RDMA vPorts get
their own EQs.
Patch 2 adds device capability queries to determine whether MSI-X vectors
should be dedicated per-vPort or shared. When the number of available MSI-X
vectors is insufficient for dedicated allocation, the driver enables sharing
mode with bitmap-based vector assignment.
Patch 3 introduces the GIC (GDMA IRQ Context) abstraction with reference
counting, allowing multiple EQs to safely share a single MSI-X vector.
Patch 4 converts the global EQ allocation in probe/resume to use the new
GIC functions.
Patch 5 adds per-vPort GIC lifecycle management, calling get/put on each
EQ creation and destruction during vPort open/close.
Patch 6 extends the same GIC lifecycle management to the RDMA driver's EQ
allocation path.
Changes in v7:
- Rebased on net-next/main
- Patch 1: Guard ibdev_dbg() in mana_ib_cfg_vport() with error check so
the vport handle is not logged on the failure path
- Patch 1: Fix checkpatch line length warning in debugfs_create_dir() call
- Patch 2: Use rounddown_pow_of_two() instead of roundup_pow_of_two() when
computing per-vPort queue count to avoid unnecessarily forcing shared
MSI-X mode in borderline configurations
- Patch 2: Call mana_gd_setup_remaining_irqs() unconditionally to ensure
irq_contexts are populated in both dedicated and shared MSI-X modes,
fixing bisectability between patches 2 and 5
- Patch 2: Fix checkpatch line length warning in debugfs_create_u16() call
- Patch 3: Use cached gic->irq instead of pci_irq_vector() lookup in
mana_gd_put_gic() for consistency with the allocation path
- Patch 3: Fix checkpatch line length warning in mana_gd_get_gic()
declaration
- Patch 5: Fix unsigned int* to int* pointer type mismatch when calling
mana_gd_get_gic() by using a local int variable for the MSI index
- Patch 6: Fix same unsigned int* to int* pointer type mismatch in RDMA
EQ creation path
Changes in v6:
- Rebased on net-next/main (v7.1-rc1)
Changes in v5:
- Rebased on net-next/main
Changes in v4:
- Rebased on net-next/main 7.0-rc4
- Patch 2: Use MANA_DEF_NUM_QUEUES instead of hardcoded 16 for
max_num_queues clamping
- Patch 3: Track dyn_msix in GIC context instead of re-checking
pci_msix_can_alloc_dyn() on each call; improved remove_irqs iteration
to skip unallocated entries
Changes in v3:
- Rebased on net-next/main
- Patch 1: Added NULL check for mpc->eqs in mana_ib_create_qp_rss() to
prevent NULL pointer dereference when RSS QP is created before a raw QP
has configured the vport and allocated EQs
Changes in v2:
- Rebased on net-next/main (adapted to kzalloc_objs/kzalloc_obj macros,
new GDMA_DRV_CAP_FLAG definitions)
- Patch 2: Fixed misleading comment for max_num_queues vs
max_num_queues_vport in gdma.h
- Patch 3: Fixed spelling typo in gdma_main.c ("difference" -> "different")
Long Li (6):
net: mana: Create separate EQs for each vPort
net: mana: Query device capabilities and configure MSI-X sharing for
EQs
net: mana: Introduce GIC context with refcounting for interrupt
management
net: mana: Use GIC functions to allocate global EQs
net: mana: Allocate interrupt context for each EQ when creating vPort
RDMA/mana_ib: Allocate interrupt contexts on EQs
drivers/infiniband/hw/mana/main.c | 60 +++-
drivers/infiniband/hw/mana/qp.c | 16 +-
.../net/ethernet/microsoft/mana/gdma_main.c | 297 +++++++++++++-----
drivers/net/ethernet/microsoft/mana/mana_en.c | 168 ++++++----
include/net/mana/gdma.h | 33 +-
include/net/mana/mana.h | 7 +-
6 files changed, 425 insertions(+), 156 deletions(-)
--
2.43.0
^ permalink raw reply
* [PATCH net-next v7 1/6] net: mana: Create separate EQs for each vPort
From: Long Li @ 2026-05-07 19:12 UTC (permalink / raw)
To: Long Li, Konstantin Taranov, Jakub Kicinski, David S . Miller,
Paolo Abeni, Eric Dumazet, Andrew Lunn, Jason Gunthorpe,
Leon Romanovsky, Haiyang Zhang, K . Y . Srinivasan, Wei Liu,
Dexuan Cui, shradhagupta
Cc: Simon Horman, netdev, linux-rdma, linux-hyperv, linux-kernel
In-Reply-To: <20260507191237.438671-1-longli@microsoft.com>
To prepare for assigning vPorts to dedicated MSI-X vectors, remove EQ
sharing among the vPorts and create dedicated EQs for each vPort.
Move the EQ definition from struct mana_context to struct mana_port_context
and update related support functions. Export mana_create_eq() and
mana_destroy_eq() for use by the MANA RDMA driver.
Signed-off-by: Long Li <longli@microsoft.com>
---
drivers/infiniband/hw/mana/main.c | 19 ++-
drivers/infiniband/hw/mana/qp.c | 16 ++-
drivers/net/ethernet/microsoft/mana/mana_en.c | 111 ++++++++++--------
include/net/mana/mana.h | 7 +-
4 files changed, 98 insertions(+), 55 deletions(-)
diff --git a/drivers/infiniband/hw/mana/main.c b/drivers/infiniband/hw/mana/main.c
index ac5e75dd3494..8000ab6e8beb 100644
--- a/drivers/infiniband/hw/mana/main.c
+++ b/drivers/infiniband/hw/mana/main.c
@@ -20,8 +20,10 @@ void mana_ib_uncfg_vport(struct mana_ib_dev *dev, struct mana_ib_pd *pd,
pd->vport_use_count--;
WARN_ON(pd->vport_use_count < 0);
- if (!pd->vport_use_count)
+ if (!pd->vport_use_count) {
+ mana_destroy_eq(mpc);
mana_uncfg_vport(mpc);
+ }
mutex_unlock(&pd->vport_mutex);
}
@@ -55,15 +57,22 @@ int mana_ib_cfg_vport(struct mana_ib_dev *dev, u32 port, struct mana_ib_pd *pd,
return err;
}
- mutex_unlock(&pd->vport_mutex);
pd->tx_shortform_allowed = mpc->tx_shortform_allowed;
pd->tx_vp_offset = mpc->tx_vp_offset;
+ err = mana_create_eq(mpc);
+ if (err) {
+ mana_uncfg_vport(mpc);
+ pd->vport_use_count--;
+ }
- ibdev_dbg(&dev->ib_dev, "vport handle %llx pdid %x doorbell_id %x\n",
- mpc->port_handle, pd->pdn, doorbell_id);
+ mutex_unlock(&pd->vport_mutex);
- return 0;
+ if (!err)
+ ibdev_dbg(&dev->ib_dev, "vport handle %llx pdid %x doorbell_id %x\n",
+ mpc->port_handle, pd->pdn, doorbell_id);
+
+ return err;
}
int mana_ib_alloc_pd(struct ib_pd *ibpd, struct ib_udata *udata)
diff --git a/drivers/infiniband/hw/mana/qp.c b/drivers/infiniband/hw/mana/qp.c
index 645581359cee..6f1043383e8c 100644
--- a/drivers/infiniband/hw/mana/qp.c
+++ b/drivers/infiniband/hw/mana/qp.c
@@ -168,7 +168,15 @@ static int mana_ib_create_qp_rss(struct ib_qp *ibqp, struct ib_pd *pd,
cq_spec.gdma_region = cq->queue.gdma_region;
cq_spec.queue_size = cq->cqe * COMP_ENTRY_SIZE;
cq_spec.modr_ctx_id = 0;
- eq = &mpc->ac->eqs[cq->comp_vector];
+ /* EQs are created when a raw QP configures the vport.
+ * A raw QP must be created before creating rwq_ind_tbl.
+ */
+ if (!mpc->eqs) {
+ ret = -EINVAL;
+ i--;
+ goto fail;
+ }
+ eq = &mpc->eqs[cq->comp_vector % mpc->num_queues];
cq_spec.attached_eq = eq->eq->id;
ret = mana_create_wq_obj(mpc, mpc->port_handle, GDMA_RQ,
@@ -317,7 +325,11 @@ static int mana_ib_create_qp_raw(struct ib_qp *ibqp, struct ib_pd *ibpd,
cq_spec.queue_size = send_cq->cqe * COMP_ENTRY_SIZE;
cq_spec.modr_ctx_id = 0;
eq_vec = send_cq->comp_vector;
- eq = &mpc->ac->eqs[eq_vec];
+ if (!mpc->eqs) {
+ err = -EINVAL;
+ goto err_destroy_queue;
+ }
+ eq = &mpc->eqs[eq_vec % mpc->num_queues];
cq_spec.attached_eq = eq->eq->id;
err = mana_create_wq_obj(mpc, mpc->port_handle, GDMA_SQ, &wq_spec,
diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index 462a457e7d53..a13204b3ee79 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -1615,78 +1615,83 @@ void mana_destroy_wq_obj(struct mana_port_context *apc, u32 wq_type,
}
EXPORT_SYMBOL_NS(mana_destroy_wq_obj, "NET_MANA");
-static void mana_destroy_eq(struct mana_context *ac)
+void mana_destroy_eq(struct mana_port_context *apc)
{
+ struct mana_context *ac = apc->ac;
struct gdma_context *gc = ac->gdma_dev->gdma_context;
struct gdma_queue *eq;
int i;
- if (!ac->eqs)
+ if (!apc->eqs)
return;
- debugfs_remove_recursive(ac->mana_eqs_debugfs);
- ac->mana_eqs_debugfs = NULL;
+ debugfs_remove_recursive(apc->mana_eqs_debugfs);
+ apc->mana_eqs_debugfs = NULL;
- for (i = 0; i < gc->max_num_queues; i++) {
- eq = ac->eqs[i].eq;
+ for (i = 0; i < apc->num_queues; i++) {
+ eq = apc->eqs[i].eq;
if (!eq)
continue;
mana_gd_destroy_queue(gc, eq);
}
- kfree(ac->eqs);
- ac->eqs = NULL;
+ kfree(apc->eqs);
+ apc->eqs = NULL;
}
+EXPORT_SYMBOL_NS(mana_destroy_eq, "NET_MANA");
-static void mana_create_eq_debugfs(struct mana_context *ac, int i)
+static void mana_create_eq_debugfs(struct mana_port_context *apc, int i)
{
- struct mana_eq eq = ac->eqs[i];
+ struct mana_eq eq = apc->eqs[i];
char eqnum[32];
sprintf(eqnum, "eq%d", i);
- eq.mana_eq_debugfs = debugfs_create_dir(eqnum, ac->mana_eqs_debugfs);
+ eq.mana_eq_debugfs = debugfs_create_dir(eqnum, apc->mana_eqs_debugfs);
debugfs_create_u32("head", 0400, eq.mana_eq_debugfs, &eq.eq->head);
debugfs_create_u32("tail", 0400, eq.mana_eq_debugfs, &eq.eq->tail);
debugfs_create_file("eq_dump", 0400, eq.mana_eq_debugfs, eq.eq, &mana_dbg_q_fops);
}
-static int mana_create_eq(struct mana_context *ac)
+int mana_create_eq(struct mana_port_context *apc)
{
- struct gdma_dev *gd = ac->gdma_dev;
+ struct gdma_dev *gd = apc->ac->gdma_dev;
struct gdma_context *gc = gd->gdma_context;
struct gdma_queue_spec spec = {};
int err;
int i;
- ac->eqs = kzalloc_objs(struct mana_eq, gc->max_num_queues);
- if (!ac->eqs)
+ WARN_ON(apc->eqs);
+ apc->eqs = kzalloc_objs(struct mana_eq, apc->num_queues);
+ if (!apc->eqs)
return -ENOMEM;
spec.type = GDMA_EQ;
spec.monitor_avl_buf = false;
spec.queue_size = EQ_SIZE;
spec.eq.callback = NULL;
- spec.eq.context = ac->eqs;
+ spec.eq.context = apc->eqs;
spec.eq.log2_throttle_limit = LOG2_EQ_THROTTLE;
- ac->mana_eqs_debugfs = debugfs_create_dir("EQs", gc->mana_pci_debugfs);
+ apc->mana_eqs_debugfs = debugfs_create_dir("EQs",
+ apc->mana_port_debugfs);
- for (i = 0; i < gc->max_num_queues; i++) {
+ for (i = 0; i < apc->num_queues; i++) {
spec.eq.msix_index = (i + 1) % gc->num_msix_usable;
- err = mana_gd_create_mana_eq(gd, &spec, &ac->eqs[i].eq);
+ err = mana_gd_create_mana_eq(gd, &spec, &apc->eqs[i].eq);
if (err) {
dev_err(gc->dev, "Failed to create EQ %d : %d\n", i, err);
goto out;
}
- mana_create_eq_debugfs(ac, i);
+ mana_create_eq_debugfs(apc, i);
}
return 0;
out:
- mana_destroy_eq(ac);
+ mana_destroy_eq(apc);
return err;
}
+EXPORT_SYMBOL_NS(mana_create_eq, "NET_MANA");
static int mana_fence_rq(struct mana_port_context *apc, struct mana_rxq *rxq)
{
@@ -2451,7 +2456,7 @@ static int mana_create_txq(struct mana_port_context *apc,
spec.monitor_avl_buf = false;
spec.queue_size = cq_size;
spec.cq.callback = mana_schedule_napi;
- spec.cq.parent_eq = ac->eqs[i].eq;
+ spec.cq.parent_eq = apc->eqs[i].eq;
spec.cq.context = cq;
err = mana_gd_create_mana_wq_cq(gd, &spec, &cq->gdma_cq);
if (err)
@@ -2844,13 +2849,12 @@ static void mana_create_rxq_debugfs(struct mana_port_context *apc, int idx)
static int mana_add_rx_queues(struct mana_port_context *apc,
struct net_device *ndev)
{
- struct mana_context *ac = apc->ac;
struct mana_rxq *rxq;
int err = 0;
int i;
for (i = 0; i < apc->num_queues; i++) {
- rxq = mana_create_rxq(apc, i, &ac->eqs[i], ndev);
+ rxq = mana_create_rxq(apc, i, &apc->eqs[i], ndev);
if (!rxq) {
err = -ENOMEM;
netdev_err(ndev, "Failed to create rxq %d : %d\n", i, err);
@@ -2869,9 +2873,8 @@ static int mana_add_rx_queues(struct mana_port_context *apc,
return err;
}
-static void mana_destroy_vport(struct mana_port_context *apc)
+static void mana_destroy_rxqs(struct mana_port_context *apc)
{
- struct gdma_dev *gd = apc->ac->gdma_dev;
struct mana_rxq *rxq;
u32 rxq_idx;
@@ -2883,8 +2886,12 @@ static void mana_destroy_vport(struct mana_port_context *apc)
mana_destroy_rxq(apc, rxq, true);
apc->rxqs[rxq_idx] = NULL;
}
+}
+
+static void mana_destroy_vport(struct mana_port_context *apc)
+{
+ struct gdma_dev *gd = apc->ac->gdma_dev;
- mana_destroy_txq(apc);
mana_uncfg_vport(apc);
if (gd->gdma_context->is_pf && !apc->ac->bm_hostmode)
@@ -2905,11 +2912,7 @@ static int mana_create_vport(struct mana_port_context *apc,
return err;
}
- err = mana_cfg_vport(apc, gd->pdid, gd->doorbell);
- if (err)
- return err;
-
- return mana_create_txq(apc, net);
+ return mana_cfg_vport(apc, gd->pdid, gd->doorbell);
}
static int mana_rss_table_alloc(struct mana_port_context *apc)
@@ -3195,21 +3198,36 @@ int mana_alloc_queues(struct net_device *ndev)
err = mana_create_vport(apc, ndev);
if (err) {
- netdev_err(ndev, "Failed to create vPort %u : %d\n", apc->port_idx, err);
+ netdev_err(ndev, "Failed to create vPort %u : %d\n",
+ apc->port_idx, err);
return err;
}
+ err = mana_create_eq(apc);
+ if (err) {
+ netdev_err(ndev, "Failed to create EQ on vPort %u: %d\n",
+ apc->port_idx, err);
+ goto destroy_vport;
+ }
+
+ err = mana_create_txq(apc, ndev);
+ if (err) {
+ netdev_err(ndev, "Failed to create TXQ on vPort %u: %d\n",
+ apc->port_idx, err);
+ goto destroy_eq;
+ }
+
err = netif_set_real_num_tx_queues(ndev, apc->num_queues);
if (err) {
netdev_err(ndev,
"netif_set_real_num_tx_queues () failed for ndev with num_queues %u : %d\n",
apc->num_queues, err);
- goto destroy_vport;
+ goto destroy_txq;
}
err = mana_add_rx_queues(apc, ndev);
if (err)
- goto destroy_vport;
+ goto destroy_rxq;
apc->rss_state = apc->num_queues > 1 ? TRI_STATE_TRUE : TRI_STATE_FALSE;
@@ -3218,7 +3236,7 @@ int mana_alloc_queues(struct net_device *ndev)
netdev_err(ndev,
"netif_set_real_num_rx_queues () failed for ndev with num_queues %u : %d\n",
apc->num_queues, err);
- goto destroy_vport;
+ goto destroy_rxq;
}
mana_rss_table_init(apc);
@@ -3226,19 +3244,25 @@ int mana_alloc_queues(struct net_device *ndev)
err = mana_config_rss(apc, TRI_STATE_TRUE, true, true);
if (err) {
netdev_err(ndev, "Failed to configure RSS table: %d\n", err);
- goto destroy_vport;
+ goto destroy_rxq;
}
if (gd->gdma_context->is_pf && !apc->ac->bm_hostmode) {
err = mana_pf_register_filter(apc);
if (err)
- goto destroy_vport;
+ goto destroy_rxq;
}
mana_chn_setxdp(apc, mana_xdp_get(apc));
return 0;
+destroy_rxq:
+ mana_destroy_rxqs(apc);
+destroy_txq:
+ mana_destroy_txq(apc);
+destroy_eq:
+ mana_destroy_eq(apc);
destroy_vport:
mana_destroy_vport(apc);
return err;
@@ -3343,6 +3367,9 @@ static int mana_dealloc_queues(struct net_device *ndev)
mana_fence_rqs(apc);
/* Even in err case, still need to cleanup the vPort */
+ mana_destroy_rxqs(apc);
+ mana_destroy_txq(apc);
+ mana_destroy_eq(apc);
mana_destroy_vport(apc);
return 0;
@@ -3663,12 +3690,6 @@ int mana_probe(struct gdma_dev *gd, bool resuming)
INIT_DELAYED_WORK(&ac->gf_stats_work, mana_gf_stats_work_handler);
- err = mana_create_eq(ac);
- if (err) {
- dev_err(dev, "Failed to create EQs: %d\n", err);
- goto out;
- }
-
err = mana_query_device_cfg(ac, MANA_MAJOR_VERSION, MANA_MINOR_VERSION,
MANA_MICRO_VERSION, &num_ports, &bm_hostmode);
if (err)
@@ -3808,8 +3829,6 @@ void mana_remove(struct gdma_dev *gd, bool suspending)
free_netdev(ndev);
}
- mana_destroy_eq(ac);
-
if (ac->per_port_queue_reset_wq) {
destroy_workqueue(ac->per_port_queue_reset_wq);
ac->per_port_queue_reset_wq = NULL;
diff --git a/include/net/mana/mana.h b/include/net/mana/mana.h
index aa90a858c8e3..c8e7d16f6685 100644
--- a/include/net/mana/mana.h
+++ b/include/net/mana/mana.h
@@ -480,8 +480,6 @@ struct mana_context {
u8 bm_hostmode;
struct mana_ethtool_hc_stats hc_stats;
- struct mana_eq *eqs;
- struct dentry *mana_eqs_debugfs;
struct workqueue_struct *per_port_queue_reset_wq;
/* Workqueue for querying hardware stats */
struct delayed_work gf_stats_work;
@@ -501,6 +499,9 @@ struct mana_port_context {
u8 mac_addr[ETH_ALEN];
+ struct mana_eq *eqs;
+ struct dentry *mana_eqs_debugfs;
+
enum TRI_STATE rss_state;
mana_handle_t default_rxobj;
@@ -1034,6 +1035,8 @@ void mana_destroy_wq_obj(struct mana_port_context *apc, u32 wq_type,
int mana_cfg_vport(struct mana_port_context *apc, u32 protection_dom_id,
u32 doorbell_pg_id);
void mana_uncfg_vport(struct mana_port_context *apc);
+int mana_create_eq(struct mana_port_context *apc);
+void mana_destroy_eq(struct mana_port_context *apc);
struct net_device *mana_get_primary_netdev(struct mana_context *ac,
u32 port_index,
--
2.43.0
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox