Netdev List
 help / color / mirror / Atom feed
* [PATCH v6] vmxnet3: Add XDP support.
From: William Tu @ 2022-12-16 14:41 UTC (permalink / raw)
  To: netdev; +Cc: tuc, gyang, doshir, alexander.duyck

The patch adds native-mode XDP support: XDP DROP, PASS, TX, and REDIRECT.

Background:
The vmxnet3 rx consists of three rings: ring0, ring1, and dataring.
For r0 and r1, buffers at r0 are allocated using alloc_skb APIs and dma
mapped to the ring's descriptor. If LRO is enabled and packet size larger
than 3K, VMXNET3_MAX_SKB_BUF_SIZE, then r1 is used to mapped the rest of
the buffer larger than VMXNET3_MAX_SKB_BUF_SIZE. Each buffer in r1 is
allocated using alloc_page. So for LRO packets, the payload will be in one
buffer from r0 and multiple from r1, for non-LRO packets, only one
descriptor in r0 is used for packet size less than 3k.

When receiving a packet, the first descriptor will have the sop (start of
packet) bit set, and the last descriptor will have the eop (end of packet)
bit set. Non-LRO packets will have only one descriptor with both sop and
eop set.

Other than r0 and r1, vmxnet3 dataring is specifically designed for
handling packets with small size, usually 128 bytes, defined in
VMXNET3_DEF_RXDATA_DESC_SIZE, by simply copying the packet from the backend
driver in ESXi to the ring's memory region at front-end vmxnet3 driver, in
order to avoid memory mapping/unmapping overhead. In summary, packet size:
    A. < 128B: use dataring
    B. 128B - 3K: use ring0 (VMXNET3_RX_BUF_SKB)
    C. > 3K: use ring0 and ring1 (VMXNET3_RX_BUF_SKB + VMXNET3_RX_BUF_PAGE)
As a result, the patch adds XDP support for packets using dataring
and r0 (case A and B), not the large packet size when LRO is enabled.

XDP Implementation:
When user loads and XDP prog, vmxnet3 driver checks configurations, such
as mtu, lro, and re-allocate the rx buffer size for reserving the extra
headroom, XDP_PACKET_HEADROOM, for XDP frame. The XDP prog will then be
associated with every rx queue of the device. Note that when using dataring
for small packet size, vmxnet3 (front-end driver) doesn't control the
buffer allocation, as a result the XDP frame's headroom is zero.

The receive side of XDP is implemented for case A and B, by invoking the
bpf program at vmxnet3_rq_rx_complete and handle its returned action.
The new vmxnet3_run_xdp function handles the difference of using dataring
or ring0, and decides the next journey of the packet afterward.

For TX, vmxnet3 has split header design. Outgoing packets are parsed
first and protocol headers (L2/L3/L4) are copied to the backend. The
rest of the payload are dma mapped. Since XDP_TX does not parse the
packet protocol, the entire XDP frame is using dma mapped for the
transmission. Because the actual TX is deferred until txThreshold, it's
possible that an rx buffer is being overwritten by incoming burst of rx
packets, before tx buffer being transmitted. As a result, we allocate new
skb and introduce a copy.

Performance:
Tested using two VMs inside one ESXi machine, using single core on each
vmxnet3 device, sender using DPDK testpmd tx-mode attached to vmxnet3
device, sending 64B or 512B packet.

VM1 txgen:
$ dpdk-testpmd -l 0-3 -n 1 -- -i --nb-cores=3 \
--forward-mode=txonly --eth-peer=0,<mac addr of vm2>
option: add "--txonly-multi-flow"
option: use --txpkts=512 or 64 byte

VM2 running XDP:
$ ./samples/bpf/xdp_rxq_info -d ens160 -a <options> --skb-mode
$ ./samples/bpf/xdp_rxq_info -d ens160 -a <options>
options: XDP_DROP, XDP_PASS, XDP_TX

To test REDIRECT to cpu 0, use
$ ./samples/bpf/xdp_redirect_cpu -d ens160 -c 0 -e drop

Single core performance comparison with skb-mode.
64B:      skb-mode -> native-mode (with this patch)
XDP_DROP: 932Kpps -> 2.0Mpps
XDP_PASS: 284Kpps -> 314Kpps
XDP_TX:   591Kpps -> 1.8Mpps
REDIRECT: 453Kpps -> 501Kpps

512B:      skb-mode -> native-mode (with this patch)
XDP_DROP: 890Kpps -> 1.3Mpps
XDP_PASS: 284Kpps -> 314Kpps
XDP_TX:   555Kpps -> 1.2Mpps
REDIRECT: 670Kpps -> 430Kpps

Limitations:
a. LRO will be disabled when users load XDP program
b. MTU will be checked and limit to
   VMXNET3_MAX_SKB_BUF_SIZE(3K) - XDP_PACKET_HEADROOM(256) -
   SKB_DATA_ALIGN(sizeof(struct skb_shared_info))

Signed-off-by: William Tu <u9012063@gmail.com>
---
v5 -> v6:
work on feedbacks from Alexander Duyck
- remove unused skb parameter at vmxnet3_xdp_xmit
- add trace point for XDP_ABORTED
- avoid TX packet buffer being overwritten by allocatin
  new skb and memcpy (TX performance drop from 2.3 to 1.8Mpps)
- update some of the performance number and a demo video
  https://youtu.be/I3nx5wQDTXw
- pass VMware internal regression test using non-ENS: vmxnet3v2,
  vmxnet3v3, vmxnet3v4, so remove RFC tag.

v4 -> v5:
- move XDP code to separate file: vmxnet3_xdp.{c, h},
  suggested by Guolin
- expose vmxnet3_rq_create_all and vmxnet3_adjust_rx_ring_size
- more test using samples/bpf/{xdp1, xdp2, xdp_adjust_tail}
- add debug print
- rebase on commit 65e6af6cebe

v3 -> v4:
- code refactoring and improved commit message
- make dataring and non-dataring case clear
- in XDP_PASS, handle xdp.data and xdp.data_end change after
  bpf program executed
- now still working on internal testing
- v4 applied on net-next commit 65e6af6cebef

v2 -> v3:
- code refactoring: move the XDP processing to the front
  of the vmxnet3_rq_rx_complete, and minimize the places
  of changes to existing code.
- Performance improvement over BUF_SKB (512B) due to skipping
  skb allocation when DROP and TX.

v1 -> v2:
- Avoid skb allocation for small packet size (when dataring is used)
- Use rcu_read_lock unlock instead of READ_ONCE
- Peroformance improvement over v1
- Merge xdp drop, tx, pass, and redirect into 1 patch

I tested the patch using below script:
while [ true ]; do
timeout 20 ./samples/bpf/xdp_rxq_info -d ens160 -a XDP_DROP --skb-mode
timeout 20 ./samples/bpf/xdp_rxq_info -d ens160 -a XDP_DROP
timeout 20 ./samples/bpf/xdp_rxq_info -d ens160 -a XDP_PASS --skb-mode
timeout 20 ./samples/bpf/xdp_rxq_info -d ens160 -a XDP_PASS
timeout 20 ./samples/bpf/xdp_rxq_info -d ens160 -a XDP_TX --skb-mode
timeout 20 ./samples/bpf/xdp_rxq_info -d ens160 -a XDP_TX
timeout 20 ./samples/bpf/xdp_redirect_cpu -d ens160 -c 0 -e drop
timeout 20 ./samples/bpf/xdp_redirect_cpu -d ens160 -c 0 -e pass
done
---
 drivers/net/vmxnet3/Makefile          |   2 +-
 drivers/net/vmxnet3/vmxnet3_drv.c     |  48 ++-
 drivers/net/vmxnet3/vmxnet3_ethtool.c |  14 +
 drivers/net/vmxnet3/vmxnet3_int.h     |  20 ++
 drivers/net/vmxnet3/vmxnet3_xdp.c     | 458 ++++++++++++++++++++++++++
 drivers/net/vmxnet3/vmxnet3_xdp.h     |  39 +++
 6 files changed, 573 insertions(+), 8 deletions(-)
 create mode 100644 drivers/net/vmxnet3/vmxnet3_xdp.c
 create mode 100644 drivers/net/vmxnet3/vmxnet3_xdp.h

diff --git a/drivers/net/vmxnet3/Makefile b/drivers/net/vmxnet3/Makefile
index a666a88ac1ff..f82870c10205 100644
--- a/drivers/net/vmxnet3/Makefile
+++ b/drivers/net/vmxnet3/Makefile
@@ -32,4 +32,4 @@
 
 obj-$(CONFIG_VMXNET3) += vmxnet3.o
 
-vmxnet3-objs := vmxnet3_drv.o vmxnet3_ethtool.o
+vmxnet3-objs := vmxnet3_drv.o vmxnet3_ethtool.o vmxnet3_xdp.o
diff --git a/drivers/net/vmxnet3/vmxnet3_drv.c b/drivers/net/vmxnet3/vmxnet3_drv.c
index d3e7b27eb933..b55fec2ac2bf 100644
--- a/drivers/net/vmxnet3/vmxnet3_drv.c
+++ b/drivers/net/vmxnet3/vmxnet3_drv.c
@@ -28,6 +28,7 @@
 #include <net/ip6_checksum.h>
 
 #include "vmxnet3_int.h"
+#include "vmxnet3_xdp.h"
 
 char vmxnet3_driver_name[] = "vmxnet3";
 #define VMXNET3_DRIVER_DESC "VMware vmxnet3 virtual NIC driver"
@@ -351,7 +352,6 @@ vmxnet3_unmap_pkt(u32 eop_idx, struct vmxnet3_tx_queue *tq,
 	BUG_ON(VMXNET3_TXDESC_GET_EOP(&(tq->tx_ring.base[eop_idx].txd)) != 1);
 
 	skb = tq->buf_info[eop_idx].skb;
-	BUG_ON(skb == NULL);
 	tq->buf_info[eop_idx].skb = NULL;
 
 	VMXNET3_INC_RING_IDX_ONLY(eop_idx, tq->tx_ring.size);
@@ -592,6 +592,9 @@ vmxnet3_rq_alloc_rx_buf(struct vmxnet3_rx_queue *rq, u32 ring_idx,
 				rbi->skb = __netdev_alloc_skb_ip_align(adapter->netdev,
 								       rbi->len,
 								       GFP_KERNEL);
+				if (adapter->xdp_enabled)
+					skb_reserve(rbi->skb, XDP_PACKET_HEADROOM);
+
 				if (unlikely(rbi->skb == NULL)) {
 					rq->stats.rx_buf_alloc_failure++;
 					break;
@@ -1404,6 +1407,8 @@ vmxnet3_rq_rx_complete(struct vmxnet3_rx_queue *rq,
 	struct Vmxnet3_RxDesc rxCmdDesc;
 	struct Vmxnet3_RxCompDesc rxComp;
 #endif
+	bool need_flush = 0;
+
 	vmxnet3_getRxComp(rcd, &rq->comp_ring.base[rq->comp_ring.next2proc].rcd,
 			  &rxComp);
 	while (rcd->gen == rq->comp_ring.gen) {
@@ -1444,6 +1449,19 @@ vmxnet3_rq_rx_complete(struct vmxnet3_rx_queue *rq,
 			goto rcd_done;
 		}
 
+		if (unlikely(rcd->sop && rcd->eop && adapter->xdp_enabled)) {
+			int act = vmxnet3_process_xdp(adapter, rq, rcd, rbi,
+						      rxd, &need_flush);
+			ctx->skb = NULL;
+			switch (act) {
+			case VMXNET3_XDP_TAKEN:
+				goto rcd_done;
+			case VMXNET3_XDP_CONTINUE:
+			default:
+				break;
+			}
+		}
+
 		if (rcd->sop) { /* first buf of the pkt */
 			bool rxDataRingUsed;
 			u16 len;
@@ -1483,6 +1501,10 @@ vmxnet3_rq_rx_complete(struct vmxnet3_rx_queue *rq,
 				goto rcd_done;
 			}
 
+			if (adapter->xdp_enabled && !rxDataRingUsed)
+				skb_reserve(new_skb,
+					    XDP_PACKET_HEADROOM);
+
 			if (rxDataRingUsed) {
 				size_t sz;
 
@@ -1730,6 +1752,8 @@ vmxnet3_rq_rx_complete(struct vmxnet3_rx_queue *rq,
 		vmxnet3_getRxComp(rcd,
 				  &rq->comp_ring.base[rq->comp_ring.next2proc].rcd, &rxComp);
 	}
+	if (need_flush)
+		xdp_do_flush_map();
 
 	return num_pkts;
 }
@@ -1776,6 +1800,7 @@ vmxnet3_rq_cleanup(struct vmxnet3_rx_queue *rq,
 
 	rq->comp_ring.gen = VMXNET3_INIT_GEN;
 	rq->comp_ring.next2proc = 0;
+	rq->xdp_bpf_prog = NULL;
 }
 
 
@@ -1788,7 +1813,6 @@ vmxnet3_rq_cleanup_all(struct vmxnet3_adapter *adapter)
 		vmxnet3_rq_cleanup(&adapter->rx_queue[i], adapter);
 }
 
-
 static void vmxnet3_rq_destroy(struct vmxnet3_rx_queue *rq,
 			       struct vmxnet3_adapter *adapter)
 {
@@ -1832,6 +1856,8 @@ static void vmxnet3_rq_destroy(struct vmxnet3_rx_queue *rq,
 	kfree(rq->buf_info[0]);
 	rq->buf_info[0] = NULL;
 	rq->buf_info[1] = NULL;
+
+	vmxnet3_unregister_xdp_rxq(rq);
 }
 
 static void
@@ -1893,6 +1919,10 @@ vmxnet3_rq_init(struct vmxnet3_rx_queue *rq,
 	}
 	vmxnet3_rq_alloc_rx_buf(rq, 1, rq->rx_ring[1].size - 1, adapter);
 
+	/* always register, even if no XDP prog used */
+	if (vmxnet3_register_xdp_rxq(rq, adapter))
+		return -EINVAL;
+
 	/* reset the comp ring */
 	rq->comp_ring.next2proc = 0;
 	memset(rq->comp_ring.base, 0, rq->comp_ring.size *
@@ -1989,7 +2019,7 @@ vmxnet3_rq_create(struct vmxnet3_rx_queue *rq, struct vmxnet3_adapter *adapter)
 }
 
 
-static int
+int
 vmxnet3_rq_create_all(struct vmxnet3_adapter *adapter)
 {
 	int i, err = 0;
@@ -2585,7 +2615,8 @@ vmxnet3_setup_driver_shared(struct vmxnet3_adapter *adapter)
 	if (adapter->netdev->features & NETIF_F_RXCSUM)
 		devRead->misc.uptFeatures |= UPT1_F_RXCSUM;
 
-	if (adapter->netdev->features & NETIF_F_LRO) {
+	if (adapter->netdev->features & NETIF_F_LRO &&
+	    !adapter->xdp_enabled) {
 		devRead->misc.uptFeatures |= UPT1_F_LRO;
 		devRead->misc.maxNumRxSG = cpu_to_le16(1 + MAX_SKB_FRAGS);
 	}
@@ -3026,7 +3057,7 @@ vmxnet3_free_pci_resources(struct vmxnet3_adapter *adapter)
 }
 
 
-static void
+void
 vmxnet3_adjust_rx_ring_size(struct vmxnet3_adapter *adapter)
 {
 	size_t sz, i, ring0_size, ring1_size, comp_size;
@@ -3035,7 +3066,8 @@ vmxnet3_adjust_rx_ring_size(struct vmxnet3_adapter *adapter)
 		if (adapter->netdev->mtu <= VMXNET3_MAX_SKB_BUF_SIZE -
 					    VMXNET3_MAX_ETH_HDR_SIZE) {
 			adapter->skb_buf_size = adapter->netdev->mtu +
-						VMXNET3_MAX_ETH_HDR_SIZE;
+						VMXNET3_MAX_ETH_HDR_SIZE +
+						vmxnet3_xdp_headroom(adapter);
 			if (adapter->skb_buf_size < VMXNET3_MIN_T0_BUF_SIZE)
 				adapter->skb_buf_size = VMXNET3_MIN_T0_BUF_SIZE;
 
@@ -3563,7 +3595,6 @@ vmxnet3_reset_work(struct work_struct *data)
 	clear_bit(VMXNET3_STATE_BIT_RESETTING, &adapter->state);
 }
 
-
 static int
 vmxnet3_probe_device(struct pci_dev *pdev,
 		     const struct pci_device_id *id)
@@ -3585,6 +3616,8 @@ vmxnet3_probe_device(struct pci_dev *pdev,
 #ifdef CONFIG_NET_POLL_CONTROLLER
 		.ndo_poll_controller = vmxnet3_netpoll,
 #endif
+		.ndo_bpf = vmxnet3_xdp,
+		.ndo_xdp_xmit = vmxnet3_xdp_xmit,
 	};
 	int err;
 	u32 ver;
@@ -3900,6 +3933,7 @@ vmxnet3_probe_device(struct pci_dev *pdev,
 		goto err_register;
 	}
 
+	adapter->xdp_enabled = false;
 	vmxnet3_check_link(adapter, false);
 	return 0;
 
diff --git a/drivers/net/vmxnet3/vmxnet3_ethtool.c b/drivers/net/vmxnet3/vmxnet3_ethtool.c
index 18cf7c723201..6f542236b26e 100644
--- a/drivers/net/vmxnet3/vmxnet3_ethtool.c
+++ b/drivers/net/vmxnet3/vmxnet3_ethtool.c
@@ -76,6 +76,10 @@ vmxnet3_tq_driver_stats[] = {
 					 copy_skb_header) },
 	{ "  giant hdr",	offsetof(struct vmxnet3_tq_driver_stats,
 					 oversized_hdr) },
+	{ "  xdp xmit",		offsetof(struct vmxnet3_tq_driver_stats,
+					 xdp_xmit) },
+	{ "  xdp xmit err",	offsetof(struct vmxnet3_tq_driver_stats,
+					 xdp_xmit_err) },
 };
 
 /* per rq stats maintained by the device */
@@ -106,6 +110,16 @@ vmxnet3_rq_driver_stats[] = {
 					 drop_fcs) },
 	{ "  rx buf alloc fail", offsetof(struct vmxnet3_rq_driver_stats,
 					  rx_buf_alloc_failure) },
+	{ "     xdp packets", offsetof(struct vmxnet3_rq_driver_stats,
+				       xdp_packets) },
+	{ "     xdp tx", offsetof(struct vmxnet3_rq_driver_stats,
+				  xdp_tx) },
+	{ "     xdp redirects", offsetof(struct vmxnet3_rq_driver_stats,
+					 xdp_redirects) },
+	{ "     xdp drops", offsetof(struct vmxnet3_rq_driver_stats,
+				     xdp_drops) },
+	{ "     xdp aborted", offsetof(struct vmxnet3_rq_driver_stats,
+				       xdp_aborted) },
 };
 
 /* global stats maintained by the driver */
diff --git a/drivers/net/vmxnet3/vmxnet3_int.h b/drivers/net/vmxnet3/vmxnet3_int.h
index 3367db23aa13..5cf4033930d8 100644
--- a/drivers/net/vmxnet3/vmxnet3_int.h
+++ b/drivers/net/vmxnet3/vmxnet3_int.h
@@ -56,6 +56,8 @@
 #include <linux/if_arp.h>
 #include <linux/inetdevice.h>
 #include <linux/log2.h>
+#include <linux/bpf.h>
+#include <linux/skbuff.h>
 
 #include "vmxnet3_defs.h"
 
@@ -217,6 +219,9 @@ struct vmxnet3_tq_driver_stats {
 	u64 linearized;         /* # of pkts linearized */
 	u64 copy_skb_header;    /* # of times we have to copy skb header */
 	u64 oversized_hdr;
+
+	u64 xdp_xmit;
+	u64 xdp_xmit_err;
 };
 
 struct vmxnet3_tx_ctx {
@@ -285,6 +290,12 @@ struct vmxnet3_rq_driver_stats {
 	u64 drop_err;
 	u64 drop_fcs;
 	u64 rx_buf_alloc_failure;
+
+	u64 xdp_packets;	/* Total packets processed by XDP. */
+	u64 xdp_tx;
+	u64 xdp_redirects;
+	u64 xdp_drops;
+	u64 xdp_aborted;
 };
 
 struct vmxnet3_rx_data_ring {
@@ -307,6 +318,8 @@ struct vmxnet3_rx_queue {
 	struct vmxnet3_rx_buf_info     *buf_info[2];
 	struct Vmxnet3_RxQueueCtrl            *shared;
 	struct vmxnet3_rq_driver_stats  stats;
+	struct bpf_prog __rcu *xdp_bpf_prog;
+	struct xdp_rxq_info xdp_rxq;
 } __attribute__((__aligned__(SMP_CACHE_BYTES)));
 
 #define VMXNET3_DEVICE_MAX_TX_QUEUES 32
@@ -415,6 +428,7 @@ struct vmxnet3_adapter {
 	u16    tx_prod_offset;
 	u16    rx_prod_offset;
 	u16    rx_prod2_offset;
+	bool   xdp_enabled;
 };
 
 #define VMXNET3_WRITE_BAR0_REG(adapter, reg, val)  \
@@ -490,6 +504,12 @@ vmxnet3_tq_destroy_all(struct vmxnet3_adapter *adapter);
 void
 vmxnet3_rq_destroy_all(struct vmxnet3_adapter *adapter);
 
+int
+vmxnet3_rq_create_all(struct vmxnet3_adapter *adapter);
+
+void
+vmxnet3_adjust_rx_ring_size(struct vmxnet3_adapter *adapter);
+
 netdev_features_t
 vmxnet3_fix_features(struct net_device *netdev, netdev_features_t features);
 
diff --git a/drivers/net/vmxnet3/vmxnet3_xdp.c b/drivers/net/vmxnet3/vmxnet3_xdp.c
new file mode 100644
index 000000000000..afb2d43b5464
--- /dev/null
+++ b/drivers/net/vmxnet3/vmxnet3_xdp.c
@@ -0,0 +1,458 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Linux driver for VMware's vmxnet3 ethernet NIC.
+ * Copyright (C) 2008-2023, VMware, Inc. All Rights Reserved.
+ * Maintained by: pv-drivers@vmware.com
+ *
+ */
+
+#include "vmxnet3_int.h"
+#include "vmxnet3_xdp.h"
+
+static void
+vmxnet3_xdp_exchange_program(struct vmxnet3_adapter *adapter,
+			     struct bpf_prog *prog)
+{
+	struct vmxnet3_rx_queue *rq;
+	int i;
+
+	for (i = 0; i < adapter->num_rx_queues; i++) {
+		rq = &adapter->rx_queue[i];
+		rcu_assign_pointer(rq->xdp_bpf_prog, prog);
+	}
+	if (prog)
+		adapter->xdp_enabled = true;
+	else
+		adapter->xdp_enabled = false;
+}
+
+static int
+vmxnet3_xdp_set(struct net_device *netdev, struct netdev_bpf *bpf,
+		struct netlink_ext_ack *extack)
+{
+	struct vmxnet3_adapter *adapter = netdev_priv(netdev);
+	struct bpf_prog *new_bpf_prog = bpf->prog;
+	struct bpf_prog *old_bpf_prog;
+	bool need_update;
+	bool running;
+	int err = 0;
+
+	if (new_bpf_prog && netdev->mtu > VMXNET3_XDP_MAX_MTU) {
+		NL_SET_ERR_MSG_MOD(extack, "MTU too large for XDP");
+		return -EOPNOTSUPP;
+	}
+
+	old_bpf_prog = READ_ONCE(adapter->rx_queue[0].xdp_bpf_prog);
+	if (!new_bpf_prog && !old_bpf_prog) {
+		adapter->xdp_enabled = false;
+		return 0;
+	}
+	running = netif_running(netdev);
+	need_update = !!old_bpf_prog != !!new_bpf_prog;
+
+	if (running && need_update)
+		vmxnet3_quiesce_dev(adapter);
+
+	vmxnet3_xdp_exchange_program(adapter, new_bpf_prog);
+	if (old_bpf_prog)
+		bpf_prog_put(old_bpf_prog);
+
+	if (running && need_update) {
+		vmxnet3_reset_dev(adapter);
+		vmxnet3_rq_destroy_all(adapter);
+		vmxnet3_adjust_rx_ring_size(adapter);
+		err = vmxnet3_rq_create_all(adapter);
+		if (err) {
+			NL_SET_ERR_MSG_MOD(extack,
+				"failed to re-create rx queues for XDP.");
+			err = -EOPNOTSUPP;
+			goto out;
+		}
+		err = vmxnet3_activate_dev(adapter);
+		if (err) {
+			NL_SET_ERR_MSG_MOD(extack,
+				"failed to activate device for XDP.");
+			err = -EOPNOTSUPP;
+			goto out;
+		}
+		clear_bit(VMXNET3_STATE_BIT_RESETTING, &adapter->state);
+	}
+out:
+	return err;
+}
+
+/* This is the main xdp call used by kernel to set/unset eBPF program. */
+int
+vmxnet3_xdp(struct net_device *netdev, struct netdev_bpf *bpf)
+{
+	switch (bpf->command) {
+	case XDP_SETUP_PROG:
+		netdev_dbg(netdev, "XDP: set program to ");
+		return vmxnet3_xdp_set(netdev, bpf, bpf->extack);
+	default:
+		return -EINVAL;
+	}
+	return 0;
+}
+
+int
+vmxnet3_xdp_headroom(struct vmxnet3_adapter *adapter)
+{
+	if (adapter->xdp_enabled)
+		return VMXNET3_XDP_ROOM;
+	else
+		return 0;
+}
+
+void
+vmxnet3_unregister_xdp_rxq(struct vmxnet3_rx_queue *rq)
+{
+	xdp_rxq_info_unreg_mem_model(&rq->xdp_rxq);
+	xdp_rxq_info_unreg(&rq->xdp_rxq);
+}
+
+int
+vmxnet3_register_xdp_rxq(struct vmxnet3_rx_queue *rq,
+			 struct vmxnet3_adapter *adapter)
+{
+	int err;
+
+	err = xdp_rxq_info_reg(&rq->xdp_rxq, adapter->netdev, rq->qid, 0);
+	if (err < 0)
+		return err;
+
+	err = xdp_rxq_info_reg_mem_model(&rq->xdp_rxq, MEM_TYPE_PAGE_SHARED,
+					 NULL);
+	if (err < 0) {
+		xdp_rxq_info_unreg(&rq->xdp_rxq);
+		return err;
+	}
+	return 0;
+}
+
+static int
+vmxnet3_xdp_xmit_frame(struct vmxnet3_adapter *adapter,
+		       struct xdp_frame *xdpf,
+		       struct vmxnet3_tx_queue *tq)
+{
+	struct vmxnet3_tx_buf_info *tbi = NULL;
+	union Vmxnet3_GenericDesc *gdesc;
+	struct vmxnet3_tx_ctx ctx;
+	int tx_num_deferred;
+	struct sk_buff *skb;
+	u32 buf_size;
+	int ret = 0;
+	u32 dw2;
+
+	if (vmxnet3_cmd_ring_desc_avail(&tq->tx_ring) == 0) {
+		tq->stats.tx_ring_full++;
+		ret = -ENOMEM;
+		goto exit;
+	}
+
+	skb = __netdev_alloc_skb(adapter->netdev, xdpf->len, GFP_KERNEL);
+	if (unlikely(!skb))
+		goto exit;
+
+	memcpy(skb->data, xdpf->data, xdpf->len);
+	dw2 = (tq->tx_ring.gen ^ 0x1) << VMXNET3_TXD_GEN_SHIFT;
+	dw2 |= xdpf->len;
+	ctx.sop_txd = tq->tx_ring.base + tq->tx_ring.next2fill;
+	gdesc = ctx.sop_txd;
+
+	buf_size = xdpf->len;
+	tbi = tq->buf_info + tq->tx_ring.next2fill;
+	tbi->map_type = VMXNET3_MAP_SINGLE;
+	tbi->dma_addr = dma_map_single(&adapter->pdev->dev,
+				       skb->data, buf_size,
+				       DMA_TO_DEVICE);
+	if (dma_mapping_error(&adapter->pdev->dev, tbi->dma_addr)) {
+		ret = -EFAULT;
+		goto exit;
+	}
+	tbi->len = buf_size;
+
+	gdesc = tq->tx_ring.base + tq->tx_ring.next2fill;
+	WARN_ON_ONCE(gdesc->txd.gen == tq->tx_ring.gen);
+
+	gdesc->txd.addr = cpu_to_le64(tbi->dma_addr);
+	gdesc->dword[2] = cpu_to_le32(dw2);
+
+	/* Setup the EOP desc */
+	gdesc->dword[3] = cpu_to_le32(VMXNET3_TXD_CQ | VMXNET3_TXD_EOP);
+
+	gdesc->txd.om = 0;
+	gdesc->txd.msscof = 0;
+	gdesc->txd.hlen = 0;
+	gdesc->txd.ti = 0;
+
+	tx_num_deferred = le32_to_cpu(tq->shared->txNumDeferred);
+	tq->shared->txNumDeferred += 1;
+	tx_num_deferred++;
+
+	vmxnet3_cmd_ring_adv_next2fill(&tq->tx_ring);
+
+	/* set the last buf_info for the pkt */
+	tbi->skb = skb;
+	tbi->sop_idx = ctx.sop_txd - tq->tx_ring.base;
+
+	dma_wmb();
+	gdesc->dword[2] = cpu_to_le32(le32_to_cpu(gdesc->dword[2]) ^
+						  VMXNET3_TXD_GEN);
+	if (tx_num_deferred >= le32_to_cpu(tq->shared->txThreshold)) {
+		tq->shared->txNumDeferred = 0;
+		VMXNET3_WRITE_BAR0_REG(adapter,
+				       VMXNET3_REG_TXPROD + tq->qid * 8,
+				       tq->tx_ring.next2fill);
+	}
+exit:
+	return ret;
+}
+
+static int
+vmxnet3_xdp_xmit_back(struct vmxnet3_adapter *adapter,
+		      struct xdp_frame *xdpf)
+{
+	struct vmxnet3_tx_queue *tq;
+	struct netdev_queue *nq;
+	int err = 0, cpu;
+	int tq_number;
+
+	tq_number = adapter->num_tx_queues;
+	cpu = smp_processor_id();
+	tq = &adapter->tx_queue[cpu % tq_number];
+	if (tq->stopped)
+		return -ENETDOWN;
+
+	nq = netdev_get_tx_queue(adapter->netdev, tq->qid);
+
+	__netif_tx_lock(nq, cpu);
+	err = vmxnet3_xdp_xmit_frame(adapter, xdpf, tq);
+	if (err)
+		goto exit;
+
+exit:
+	__netif_tx_unlock(nq);
+	return err;
+}
+
+int
+vmxnet3_xdp_xmit(struct net_device *dev,
+		 int n, struct xdp_frame **frames, u32 flags)
+{
+	struct vmxnet3_adapter *adapter;
+	struct vmxnet3_tx_queue *tq;
+	struct netdev_queue *nq;
+	int i, err, cpu;
+	int nxmit = 0;
+	int tq_number;
+
+	adapter = netdev_priv(dev);
+
+	if (unlikely(test_bit(VMXNET3_STATE_BIT_QUIESCED, &adapter->state)))
+		return -ENETDOWN;
+	if (unlikely(test_bit(VMXNET3_STATE_BIT_RESETTING, &adapter->state)))
+		return -EINVAL;
+
+	tq_number = adapter->num_tx_queues;
+	cpu = smp_processor_id();
+	tq = &adapter->tx_queue[cpu % tq_number];
+	if (tq->stopped)
+		return -ENETDOWN;
+
+	nq = netdev_get_tx_queue(adapter->netdev, tq->qid);
+
+	__netif_tx_lock(nq, cpu);
+	for (i = 0; i < n; i++) {
+		err = vmxnet3_xdp_xmit_frame(adapter, frames[i], tq);
+		if (err) {
+			tq->stats.xdp_xmit_err++;
+			break;
+		}
+		nxmit++;
+	}
+
+	tq->stats.xdp_xmit += nxmit;
+	__netif_tx_unlock(nq);
+
+	return nxmit;
+}
+
+static int
+__vmxnet3_run_xdp(struct vmxnet3_rx_queue *rq, void *data, int data_len,
+		  int headroom, int frame_sz, bool *need_xdp_flush,
+		  struct sk_buff *skb)
+{
+	struct xdp_frame *xdpf;
+	void *buf_hard_start;
+	struct xdp_buff xdp;
+	struct page *page;
+	void *orig_data;
+	int err, delta;
+	int delta_len;
+	u32 act;
+
+	buf_hard_start = data;
+	xdp_init_buff(&xdp, frame_sz, &rq->xdp_rxq);
+	xdp_prepare_buff(&xdp, buf_hard_start, headroom, data_len, true);
+	orig_data = xdp.data;
+
+	act = bpf_prog_run_xdp(rq->xdp_bpf_prog, &xdp);
+	rq->stats.xdp_packets++;
+
+	switch (act) {
+	case XDP_DROP:
+		rq->stats.xdp_drops++;
+		break;
+	case XDP_PASS:
+		/* bpf prog might change len and data position.
+		 * dataring does not use skb so not support this.
+		 */
+		delta = xdp.data - orig_data;
+		delta_len = (xdp.data_end - xdp.data) - data_len;
+		if (skb) {
+			skb_reserve(skb, delta);
+			skb_put(skb, delta_len);
+		}
+		break;
+	case XDP_TX:
+		xdpf = xdp_convert_buff_to_frame(&xdp);
+		if (!xdpf ||
+		    vmxnet3_xdp_xmit_back(rq->adapter, xdpf)) {
+			rq->stats.xdp_drops++;
+		} else {
+			rq->stats.xdp_tx++;
+		}
+		break;
+	case XDP_ABORTED:
+		trace_xdp_exception(rq->adapter->netdev, rq->xdp_bpf_prog,
+				    act);
+		rq->stats.xdp_aborted++;
+		break;
+	case XDP_REDIRECT:
+		page = alloc_page(GFP_ATOMIC);
+		if (!page) {
+			rq->stats.rx_buf_alloc_failure++;
+			return XDP_DROP;
+		}
+		xdp_init_buff(&xdp, PAGE_SIZE, &rq->xdp_rxq);
+		xdp_prepare_buff(&xdp, page_address(page),
+				 XDP_PACKET_HEADROOM,
+				 data_len, true);
+		memcpy(xdp.data, data, data_len);
+		err = xdp_do_redirect(rq->adapter->netdev, &xdp,
+				      rq->xdp_bpf_prog);
+		if (!err) {
+			rq->stats.xdp_redirects++;
+		} else {
+			__free_page(page);
+			rq->stats.xdp_drops++;
+		}
+		*need_xdp_flush = true;
+		break;
+	default:
+		bpf_warn_invalid_xdp_action(rq->adapter->netdev,
+					    rq->xdp_bpf_prog, act);
+		break;
+	}
+	return act;
+}
+
+static int
+vmxnet3_run_xdp(struct vmxnet3_rx_queue *rq, struct vmxnet3_rx_buf_info *rbi,
+		struct Vmxnet3_RxCompDesc *rcd, bool *need_flush,
+		bool rxDataRingUsed)
+{
+	struct vmxnet3_adapter *adapter;
+	struct ethhdr *ehdr;
+	int act = XDP_PASS;
+	void *data;
+	int sz;
+
+	adapter = rq->adapter;
+	if (rxDataRingUsed) {
+		sz = rcd->rxdIdx * rq->data_ring.desc_size;
+		data = &rq->data_ring.base[sz];
+		ehdr = data;
+		netdev_dbg(adapter->netdev,
+			   "XDP: rxDataRing packet size %d, eth proto 0x%x\n",
+			   rcd->len, ntohs(ehdr->h_proto));
+		act = __vmxnet3_run_xdp(rq, data, rcd->len, 0,
+					rq->data_ring.desc_size, need_flush,
+					NULL);
+	} else {
+		dma_unmap_single(&adapter->pdev->dev,
+				 rbi->dma_addr,
+				 rbi->len,
+				 DMA_FROM_DEVICE);
+		ehdr = (struct ethhdr *)rbi->skb->data;
+		netdev_dbg(adapter->netdev,
+			   "XDP: packet size %d, eth proto 0x%x\n",
+			   rcd->len, ntohs(ehdr->h_proto));
+		act = __vmxnet3_run_xdp(rq,
+					rbi->skb->data - XDP_PACKET_HEADROOM,
+					rcd->len, XDP_PACKET_HEADROOM,
+					rbi->len, need_flush, rbi->skb);
+	}
+	return act;
+}
+
+int
+vmxnet3_process_xdp(struct vmxnet3_adapter *adapter,
+		    struct vmxnet3_rx_queue *rq,
+		    struct Vmxnet3_RxCompDesc *rcd,
+		    struct vmxnet3_rx_buf_info *rbi,
+		    struct Vmxnet3_RxDesc *rxd,
+		    bool *need_flush)
+{
+	struct bpf_prog *xdp_prog;
+	dma_addr_t new_dma_addr;
+	struct sk_buff *new_skb;
+	bool rxDataRingUsed;
+	int ret, act;
+
+	ret = VMXNET3_XDP_CONTINUE;
+	if (unlikely(rcd->len == 0))
+		return VMXNET3_XDP_TAKEN;
+
+	rxDataRingUsed = VMXNET3_RX_DATA_RING(adapter, rcd->rqID);
+	rcu_read_lock();
+	xdp_prog = rcu_dereference(rq->xdp_bpf_prog);
+	if (!xdp_prog) {
+		rcu_read_unlock();
+		return VMXNET3_XDP_CONTINUE;
+	}
+	act = vmxnet3_run_xdp(rq, rbi, rcd, need_flush, rxDataRingUsed);
+	rcu_read_unlock();
+
+	switch (act) {
+	case XDP_PASS:
+		ret = VMXNET3_XDP_CONTINUE;
+		break;
+	case XDP_DROP:
+	case XDP_TX:
+	case XDP_REDIRECT:
+	case XDP_ABORTED:
+	default:
+		/* Reuse and remap the existing buffer. */
+		ret = VMXNET3_XDP_TAKEN;
+		if (rxDataRingUsed)
+			return ret;
+
+		new_skb = rbi->skb;
+		new_dma_addr =
+			dma_map_single(&adapter->pdev->dev,
+				       new_skb->data, rbi->len,
+				       DMA_FROM_DEVICE);
+		if (dma_mapping_error(&adapter->pdev->dev,
+				      new_dma_addr)) {
+			dev_kfree_skb(new_skb);
+			rq->stats.drop_total++;
+			return ret;
+		}
+		rbi->dma_addr = new_dma_addr;
+		rxd->addr = cpu_to_le64(rbi->dma_addr);
+		rxd->len = rbi->len;
+	}
+	return ret;
+}
diff --git a/drivers/net/vmxnet3/vmxnet3_xdp.h b/drivers/net/vmxnet3/vmxnet3_xdp.h
new file mode 100644
index 000000000000..6a3c662a4464
--- /dev/null
+++ b/drivers/net/vmxnet3/vmxnet3_xdp.h
@@ -0,0 +1,39 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later
+ *
+ * Linux driver for VMware's vmxnet3 ethernet NIC.
+ * Copyright (C) 2008-2023, VMware, Inc. All Rights Reserved.
+ * Maintained by: pv-drivers@vmware.com
+ *
+ */
+
+#ifndef _VMXNET3_XDP_H
+#define _VMXNET3_XDP_H
+
+#include <linux/filter.h>
+#include <linux/bpf_trace.h>
+#include <linux/netlink.h>
+#include <net/xdp.h>
+
+#include "vmxnet3_int.h"
+
+#define VMXNET3_XDP_ROOM (SKB_DATA_ALIGN(sizeof(struct skb_shared_info)) + \
+				XDP_PACKET_HEADROOM)
+#define VMXNET3_XDP_MAX_MTU (VMXNET3_MAX_SKB_BUF_SIZE - VMXNET3_XDP_ROOM)
+
+#define VMXNET3_XDP_CONTINUE 0	/* Pass to the stack, ex: XDP_PASS. */
+#define VMXNET3_XDP_TAKEN 1	/* Skip the stack, ex: XDP_DROP/TX/REDIRECT */
+
+int vmxnet3_xdp(struct net_device *netdev, struct netdev_bpf *bpf);
+void vmxnet3_unregister_xdp_rxq(struct vmxnet3_rx_queue *rq);
+int vmxnet3_register_xdp_rxq(struct vmxnet3_rx_queue *rq,
+			     struct vmxnet3_adapter *adapter);
+int vmxnet3_xdp_headroom(struct vmxnet3_adapter *adapter);
+int vmxnet3_xdp_xmit(struct net_device *dev, int n, struct xdp_frame **frames,
+		     u32 flags);
+int vmxnet3_process_xdp(struct vmxnet3_adapter *adapter,
+			struct vmxnet3_rx_queue *rq,
+			struct Vmxnet3_RxCompDesc *rcd,
+			struct vmxnet3_rx_buf_info *rbi,
+			struct Vmxnet3_RxDesc *rxd,
+			bool *need_flush);
+#endif
-- 
2.25.1


^ permalink raw reply related

* [PATCH] net: lan78xx: isolate LAN88XX specific operations
From: Enguerrand de Ribaucourt @ 2022-12-16 14:49 UTC (permalink / raw)
  To: netdev; +Cc: woojung.huh, davem, UNGLinuxDriver, Enguerrand de Ribaucourt
In-Reply-To: <9235D6609DB808459E95D78E17F2E43D408987FF@CHN-SV-EXMX02.mchp-main.com>

Some operations during the cable switch workaround modify the register
LAN88XX_INT_MASK of the PHY. However, this register is specific to the
LAN8835 PHY. For instance, if a DP8322I PHY is connected to the LAN7801,
that register (0x19), corresponds to the LED and MAC address
configuration, resulting in unapropriate behavior.

Fixes: 14437e3fa284 ("lan78xx: workaround of forced 100 Full/Half duplex mode error")
Signed-off-by: Enguerrand de Ribaucourt <enguerrand.de-ribaucourt@savoirfairelinux.com>
---
 drivers/net/usb/lan78xx.c | 28 ++++++++++++++++++----------
 1 file changed, 18 insertions(+), 10 deletions(-)

diff --git a/drivers/net/usb/lan78xx.c b/drivers/net/usb/lan78xx.c
index f18ab8e220db..ea0a56e6cd40 100644
--- a/drivers/net/usb/lan78xx.c
+++ b/drivers/net/usb/lan78xx.c
@@ -2116,6 +2116,7 @@ static void lan78xx_link_status_change(struct net_device *net)
 {
 	struct phy_device *phydev = net->phydev;
 	int temp;
+	bool lan88_fixup;
 
 	/* At forced 100 F/H mode, chip may fail to set mode correctly
 	 * when cable is switched between long(~50+m) and short one.
@@ -2123,10 +2124,15 @@ static void lan78xx_link_status_change(struct net_device *net)
 	 * at forced 100 F/H mode.
 	 */
 	if (!phydev->autoneg && (phydev->speed == 100)) {
-		/* disable phy interrupt */
-		temp = phy_read(phydev, LAN88XX_INT_MASK);
-		temp &= ~LAN88XX_INT_MASK_MDINTPIN_EN_;
-		phy_write(phydev, LAN88XX_INT_MASK, temp);
+		lan88_fixup = (PHY_LAN8835 & 0xfffffff0) ==
+			(phydev->phy_id & 0xfffffff0);
+
+		if(lan88_fixup) {
+			/* disable phy interrupt */
+			temp = phy_read(phydev, LAN88XX_INT_MASK);
+			temp &= ~LAN88XX_INT_MASK_MDINTPIN_EN_;
+			phy_write(phydev, LAN88XX_INT_MASK, temp);
+		}
 
 		temp = phy_read(phydev, MII_BMCR);
 		temp &= ~(BMCR_SPEED100 | BMCR_SPEED1000);
@@ -2134,13 +2140,15 @@ static void lan78xx_link_status_change(struct net_device *net)
 		temp |= BMCR_SPEED100;
 		phy_write(phydev, MII_BMCR, temp); /* set to 100 later */
 
-		/* clear pending interrupt generated while workaround */
-		temp = phy_read(phydev, LAN88XX_INT_STS);
+		if(lan88_fixup) {
+			/* clear pending interrupt generated while workaround */
+			temp = phy_read(phydev, LAN88XX_INT_STS);
 
-		/* enable phy interrupt back */
-		temp = phy_read(phydev, LAN88XX_INT_MASK);
-		temp |= LAN88XX_INT_MASK_MDINTPIN_EN_;
-		phy_write(phydev, LAN88XX_INT_MASK, temp);
+			/* enable phy interrupt back */
+			temp = phy_read(phydev, LAN88XX_INT_MASK);
+			temp |= LAN88XX_INT_MASK_MDINTPIN_EN_;
+			phy_write(phydev, LAN88XX_INT_MASK, temp);
+		}
 	}
 }
 
-- 
2.25.1


^ permalink raw reply related

* [PATCH net-next v2] igc: offload queue max SDU from tc-taprio
From: Muhammad Husaini Zulkifli @ 2022-12-16 15:03 UTC (permalink / raw)
  To: intel-wired-lan, vinicius.gomes
  Cc: tee.min.tan, davem, kuba, netdev, muhammad.husaini.zulkifli,
	naamax.meir, anthony.l.nguyen

From: Tan Tee Min <tee.min.tan@linux.intel.com>

Add support for configuring the max SDU for each Tx queue.
If not specified, keep the default.

All link speeds have been tested with this implementation.
No performance issue observed.

How to test:

1) Configure the tc with max-sdu

tc qdisc replace dev $IFACE parent root handle 100 taprio \
    num_tc 4 \
    map 0 1 2 3 3 3 3 3 3 3 3 3 3 3 3 3 \
    queues 1@0 1@1 1@2 1@3 \
    base-time $BASE \
    sched-entry S 0xF 1000000 \
    max-sdu 1500 1498 0 0 0 0 0 0 0 0 0 0 0 0 0 0 \
    flags 0x2 \
    txtime-delay 0

2) Use network statistic to watch the tx queue packet to see if
packet able to go out or drop.

Signed-off-by: Tan Tee Min <tee.min.tan@linux.intel.com>
Signed-off-by: Muhammad Husaini Zulkifli <muhammad.husaini.zulkifli@intel.com>

---
V1 -> V2: Rework based on Vinicius's comment.
---
---
 drivers/net/ethernet/intel/igc/igc.h      |  1 +
 drivers/net/ethernet/intel/igc/igc_main.c | 44 +++++++++++++++++++++++
 2 files changed, 45 insertions(+)

diff --git a/drivers/net/ethernet/intel/igc/igc.h b/drivers/net/ethernet/intel/igc/igc.h
index 5da8d162cd38..ce9e88687d8c 100644
--- a/drivers/net/ethernet/intel/igc/igc.h
+++ b/drivers/net/ethernet/intel/igc/igc.h
@@ -99,6 +99,7 @@ struct igc_ring {
 
 	u32 start_time;
 	u32 end_time;
+	u32 max_sdu;
 
 	/* CBS parameters */
 	bool cbs_enable;                /* indicates if CBS is enabled */
diff --git a/drivers/net/ethernet/intel/igc/igc_main.c b/drivers/net/ethernet/intel/igc/igc_main.c
index fdb7f0b26ed0..741c938313cf 100644
--- a/drivers/net/ethernet/intel/igc/igc_main.c
+++ b/drivers/net/ethernet/intel/igc/igc_main.c
@@ -1508,6 +1508,7 @@ static netdev_tx_t igc_xmit_frame_ring(struct sk_buff *skb,
 	__le32 launch_time = 0;
 	u32 tx_flags = 0;
 	unsigned short f;
+	u32 max_sdu = 0;
 	ktime_t txtime;
 	u8 hdr_len = 0;
 	int tso = 0;
@@ -1527,6 +1528,14 @@ static netdev_tx_t igc_xmit_frame_ring(struct sk_buff *skb,
 		return NETDEV_TX_BUSY;
 	}
 
+	if (tx_ring->max_sdu > 0) {
+		max_sdu = tx_ring->max_sdu +
+			  (skb_vlan_tagged(skb) ? VLAN_HLEN : 0);
+
+		if (skb->len > max_sdu)
+			goto skb_drop;
+	}
+
 	if (!tx_ring->launchtime_enable)
 		goto done;
 
@@ -1606,6 +1615,12 @@ static netdev_tx_t igc_xmit_frame_ring(struct sk_buff *skb,
 	dev_kfree_skb_any(first->skb);
 	first->skb = NULL;
 
+	return NETDEV_TX_OK;
+
+skb_drop:
+	dev_kfree_skb_any(skb);
+	skb = NULL;
+
 	return NETDEV_TX_OK;
 }
 
@@ -6018,6 +6033,7 @@ static int igc_tsn_clear_schedule(struct igc_adapter *adapter)
 
 		ring->start_time = 0;
 		ring->end_time = NSEC_PER_SEC;
+		ring->max_sdu = 0;
 	}
 
 	return 0;
@@ -6101,6 +6117,16 @@ static int igc_save_qbv_schedule(struct igc_adapter *adapter,
 		}
 	}
 
+	for (i = 0; i < adapter->num_tx_queues; i++) {
+		struct igc_ring *ring = adapter->tx_ring[i];
+		struct net_device *dev = adapter->netdev;
+
+		if (qopt->max_sdu[i])
+			ring->max_sdu = qopt->max_sdu[i] + dev->hard_header_len;
+		else
+			ring->max_sdu = 0;
+	}
+
 	return 0;
 }
 
@@ -6188,12 +6214,30 @@ static int igc_tsn_enable_cbs(struct igc_adapter *adapter,
 	return igc_tsn_offload_apply(adapter);
 }
 
+static int igc_tsn_query_caps(struct tc_query_caps_base *base)
+{
+	switch (base->type) {
+	case TC_SETUP_QDISC_TAPRIO: {
+		struct tc_taprio_caps *caps = base->caps;
+
+		caps->supports_queue_max_sdu = true;
+
+		return 0;
+	}
+	default:
+		return -EOPNOTSUPP;
+	}
+}
+
 static int igc_setup_tc(struct net_device *dev, enum tc_setup_type type,
 			void *type_data)
 {
 	struct igc_adapter *adapter = netdev_priv(dev);
 
 	switch (type) {
+	case TC_QUERY_CAPS:
+		return igc_tsn_query_caps(type_data);
+
 	case TC_SETUP_QDISC_TAPRIO:
 		return igc_tsn_enable_qbv_scheduling(adapter, type_data);
 
-- 
2.17.1


^ permalink raw reply related

* Re: [PATCH net] net: fec: Coverity issue: Dereference null return value
From: Alexander H Duyck @ 2022-12-16 15:34 UTC (permalink / raw)
  To: wei.fang, davem, edumazet, kuba, pabeni, xiaoning.wang,
	shenwei.wang, linux-imx
  Cc: netdev, linux-kernel
In-Reply-To: <20221215091149.936369-1-wei.fang@nxp.com>

On Thu, 2022-12-15 at 17:11 +0800, wei.fang@nxp.com wrote:
> From: Wei Fang <wei.fang@nxp.com>
> 
> The build_skb might return a null pointer but there is no check on the
> return value in the fec_enet_rx_queue(). So a null pointer dereference
> might occur. To avoid this, we check the return value of build_skb. If
> the return value is a null pointer, the driver will recycle the page and
> update the statistic of ndev. Then jump to rx_processing_done to clear
> the status flags of the BD so that the hardware can recycle the BD.
> 
> Signed-off-by: Wei Fang <wei.fang@nxp.com>
> Reviewed-by: Shenwei Wang <Shenwei.wang@nxp.com>
> ---
>  drivers/net/ethernet/freescale/fec_main.c | 10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/drivers/net/ethernet/freescale/fec_main.c b/drivers/net/ethernet/freescale/fec_main.c
> index 5528b0af82ae..c78aaa780983 100644
> --- a/drivers/net/ethernet/freescale/fec_main.c
> +++ b/drivers/net/ethernet/freescale/fec_main.c
> @@ -1674,6 +1674,16 @@ fec_enet_rx_queue(struct net_device *ndev, int budget, u16 queue_id)
>  		 * bridging applications.
>  		 */
>  		skb = build_skb(page_address(page), PAGE_SIZE);
> +		if (unlikely(!skb)) {
> +			page_pool_recycle_direct(rxq->page_pool, page);
> +			ndev->stats.rx_packets--;
> +			ndev->stats.rx_bytes -= pkt_len;
> +			ndev->stats.rx_dropped++;

I'm not sure you really need to bother with rewinding the rx_packets
and rx_bytes counters. I know that the rx_dropped statistic will get
incremented in the network stack in the event of a packet failing to
enqueue to the backlog, so it might be better to just leave the
rx_packets counter as is and assume the actual packet count is
rx_packets - rx_dropped.

> +
> +			netdev_err(ndev, "build_skb failed!\n");

Instead of netdev_err you may want to consider netdev_err_once for
this. Generally speaking when we start seeing memory allocation error
issues they can get very noisy very quickly as you are likely to fail
the allocation for every packet in a given polling session, and
sessions to follow.

> +			goto rx_processing_done;
> +		}
> +
>  		skb_reserve(skb, data_start);
>  		skb_put(skb, pkt_len - sub_len);
>  		skb_mark_for_recycle(skb);


^ permalink raw reply

* Re: [PATCH] net: ethernet: ti: am65-cpsw: fix CONFIG_PM #ifdef
From: Alexander H Duyck @ 2022-12-16 16:14 UTC (permalink / raw)
  To: Arnd Bergmann, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni
  Cc: Arnd Bergmann, Roger Quadros, Siddharth Vadapalli, Jiri Pirko,
	netdev, linux-kernel
In-Reply-To: <20221215163918.611609-1-arnd@kernel.org>

On Thu, 2022-12-15 at 17:39 +0100, Arnd Bergmann wrote:
> From: Arnd Bergmann <arnd@arndb.de>
> 
> The #ifdef check is incorrect and leads to a warning:
> 
> drivers/net/ethernet/ti/am65-cpsw-nuss.c:1679:13: error: 'am65_cpsw_nuss_remove_rx_chns' defined but not used [-Werror=unused-function]
>  1679 | static void am65_cpsw_nuss_remove_rx_chns(void *data)
> 
> It's better to remove the #ifdef here and use the modern
> SYSTEM_SLEEP_PM_OPS() macro instead.
> 
> Fixes: 24bc19b05f1f ("net: ethernet: ti: am65-cpsw: Add suspend/resume support")
> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
> ---
>  drivers/net/ethernet/ti/am65-cpsw-nuss.c | 4 +---
>  1 file changed, 1 insertion(+), 3 deletions(-)
> 
> diff --git a/drivers/net/ethernet/ti/am65-cpsw-nuss.c b/drivers/net/ethernet/ti/am65-cpsw-nuss.c
> index 9decb0c7961b..ecbde83b5243 100644
> --- a/drivers/net/ethernet/ti/am65-cpsw-nuss.c
> +++ b/drivers/net/ethernet/ti/am65-cpsw-nuss.c
> @@ -2878,7 +2878,6 @@ static int am65_cpsw_nuss_remove(struct platform_device *pdev)
>  	return 0;
>  }
>  
> -#ifdef CONFIG_PM_SLEEP
>  static int am65_cpsw_nuss_suspend(struct device *dev)
>  {
>  	struct am65_cpsw_common *common = dev_get_drvdata(dev);
> @@ -2964,10 +2963,9 @@ static int am65_cpsw_nuss_resume(struct device *dev)
>  
>  	return 0;
>  }
> -#endif /* CONFIG_PM_SLEEP */
>  
>  static const struct dev_pm_ops am65_cpsw_nuss_dev_pm_ops = {
> -	SET_SYSTEM_SLEEP_PM_OPS(am65_cpsw_nuss_suspend, am65_cpsw_nuss_resume)
> +	SYSTEM_SLEEP_PM_OPS(am65_cpsw_nuss_suspend, am65_cpsw_nuss_resume)
>  };
>  
>  static struct platform_driver am65_cpsw_nuss_driver = {

Looks good to me.

Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>

^ permalink raw reply

* iproute2 merge conflicts
From: David Ahern @ 2022-12-16 16:14 UTC (permalink / raw)
  To: Michal Wilczynski; +Cc: Stephen Hemminger, netdev@vger.kernel.org

Hi Michal:

I merged main into next and hit a conflict with your recent patch set.
Can you take a look at devlink/devlink.c, cmd_port_fn_rate_add and
verify the conflict is correctly resolved?

Thanks,
David

^ permalink raw reply

* [net PATCH 1/5] net: dsa: qca8k: fix wrong length value for mgmt eth packet
From: Christian Marangi @ 2022-12-16 16:17 UTC (permalink / raw)
  To: Andrew Lunn, Florian Fainelli, Vladimir Oltean, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Christian Marangi,
	Russell King (Oracle), netdev, linux-kernel
  Cc: Ronald Wahl, stable

The assumption that Documentation was right about how this value work was
wrong. It was discovered that the length value of the mgmt header is in
step of word size.

As an example to process 4 byte of data the correct length to set is 2.
To process 8 byte 4, 12 byte 6, 16 byte 8...

Odd values will always return the next size on the ack packet.
(length of 3 (6 byte) will always return 8 bytes of data)

This means that a value of 15 (0xf) actually means reading/writing 32 bytes
of data instead of 16 bytes. This behaviour is totally absent and not
documented in the switch Documentation.

In fact from Documentation the max value that mgmt eth can process is
16 byte of data while in reality it can process 32 bytes at once.

To handle this we always round up the length after deviding it for word
size. We check if the result is odd and we round another time to align
to what the switch will provide in the ack packet.
The workaround for the length limit of 15 is still needed as the length
reg max value is 0xf(15)

Reported-by: Ronald Wahl <ronald.wahl@raritan.com>
Tested-by: Ronald Wahl <ronald.wahl@raritan.com>
Fixes: 90386223f44e ("net: dsa: qca8k: add support for larger read/write size with mgmt Ethernet")
Signed-off-by: Christian Marangi <ansuelsmth@gmail.com>
Cc: stable@vger.kernel.org # v5.18+
---
 drivers/net/dsa/qca/qca8k-8xxx.c | 45 +++++++++++++++++++++++++-------
 1 file changed, 35 insertions(+), 10 deletions(-)

diff --git a/drivers/net/dsa/qca/qca8k-8xxx.c b/drivers/net/dsa/qca/qca8k-8xxx.c
index c5c3b4e92f28..46151320b2a8 100644
--- a/drivers/net/dsa/qca/qca8k-8xxx.c
+++ b/drivers/net/dsa/qca/qca8k-8xxx.c
@@ -146,7 +146,16 @@ static void qca8k_rw_reg_ack_handler(struct dsa_switch *ds, struct sk_buff *skb)
 
 	command = get_unaligned_le32(&mgmt_ethhdr->command);
 	cmd = FIELD_GET(QCA_HDR_MGMT_CMD, command);
+
 	len = FIELD_GET(QCA_HDR_MGMT_LENGTH, command);
+	/* Special case for len of 15 as this is the max value for len and needs to
+	 * be increased before converting it from word to dword.
+	 */
+	if (len == 15)
+		len++;
+
+	/* We can ignore odd value, we always round up them in the alloc function. */
+	len *= sizeof(u16);
 
 	/* Make sure the seq match the requested packet */
 	if (get_unaligned_le32(&mgmt_ethhdr->seq) == mgmt_eth_data->seq)
@@ -193,17 +202,33 @@ static struct sk_buff *qca8k_alloc_mdio_header(enum mdio_cmd cmd, u32 reg, u32 *
 	if (!skb)
 		return NULL;
 
-	/* Max value for len reg is 15 (0xf) but the switch actually return 16 byte
-	 * Actually for some reason the steps are:
-	 * 0: nothing
-	 * 1-4: first 4 byte
-	 * 5-6: first 12 byte
-	 * 7-15: all 16 byte
+	/* Hdr mgmt length value is in step of word size.
+	 * As an example to process 4 byte of data the correct length to set is 2.
+	 * To process 8 byte 4, 12 byte 6, 16 byte 8...
+	 *
+	 * Odd values will always return the next size on the ack packet.
+	 * (length of 3 (6 byte) will always return 8 bytes of data)
+	 *
+	 * This means that a value of 15 (0xf) actually means reading/writing 32 bytes
+	 * of data.
+	 *
+	 * To correctly calculate the length we devide the requested len by word and
+	 * round up.
+	 * On the ack function we can skip the odd check as we already handle the
+	 * case here.
+	 */
+	real_len = DIV_ROUND_UP(len, sizeof(u16));
+
+	/* We check if the result len is odd and we round up another time to
+	 * the next size. (length of 3 will be increased to 4 as switch will always
+	 * return 8 bytes)
 	 */
-	if (len == 16)
-		real_len = 15;
-	else
-		real_len = len;
+	if (real_len % sizeof(u16) != 0)
+		real_len++;
+
+	/* Max reg value is 0xf(15) but switch will always return the next size (32 byte) */
+	if (real_len == 16)
+		real_len--;
 
 	skb_reset_mac_header(skb);
 	skb_set_network_header(skb, skb->len);
-- 
2.37.2


^ permalink raw reply related

* [net PATCH 2/5] net: dsa: tag_qca: fix wrong MGMT_DATA2 size
From: Christian Marangi @ 2022-12-16 16:17 UTC (permalink / raw)
  To: Andrew Lunn, Florian Fainelli, Vladimir Oltean, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Christian Marangi,
	Russell King (Oracle), netdev, linux-kernel
  Cc: Ronald Wahl, stable
In-Reply-To: <20221216161721.23863-1-ansuelsmth@gmail.com>

It was discovered that MGMT_DATA2 can contain up to 28 bytes of data
instead of the 12 bytes written in the Documentation by accounting the
limit of 16 bytes declared in Documentation subtracting the first 4 byte
in the packet header.

Update the define with the real world value.

Tested-by: Ronald Wahl <ronald.wahl@raritan.com>
Fixes: c2ee8181fddb ("net: dsa: tag_qca: add define for handling mgmt Ethernet packet")
Signed-off-by: Christian Marangi <ansuelsmth@gmail.com>
Cc: stable@vger.kernel.org # v5.18+
---
 include/linux/dsa/tag_qca.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/linux/dsa/tag_qca.h b/include/linux/dsa/tag_qca.h
index b1b5720d89a5..ee657452f122 100644
--- a/include/linux/dsa/tag_qca.h
+++ b/include/linux/dsa/tag_qca.h
@@ -45,8 +45,8 @@ struct sk_buff;
 					QCA_HDR_MGMT_COMMAND_LEN + \
 					QCA_HDR_MGMT_DATA1_LEN)
 
-#define QCA_HDR_MGMT_DATA2_LEN		12 /* Other 12 byte for the mdio data */
-#define QCA_HDR_MGMT_PADDING_LEN	34 /* Padding to reach the min Ethernet packet */
+#define QCA_HDR_MGMT_DATA2_LEN		28 /* Other 28 byte for the mdio data */
+#define QCA_HDR_MGMT_PADDING_LEN	18 /* Padding to reach the min Ethernet packet */
 
 #define QCA_HDR_MGMT_PKT_LEN		(QCA_HDR_MGMT_HEADER_LEN + \
 					QCA_HDR_LEN + \
-- 
2.37.2


^ permalink raw reply related

* [net PATCH 3/5] Revert "net: dsa: qca8k: cache lo and hi for mdio write"
From: Christian Marangi @ 2022-12-16 16:17 UTC (permalink / raw)
  To: Andrew Lunn, Florian Fainelli, Vladimir Oltean, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Christian Marangi,
	Russell King (Oracle), netdev, linux-kernel
  Cc: Ronald Wahl, stable
In-Reply-To: <20221216161721.23863-1-ansuelsmth@gmail.com>

This reverts commit 2481d206fae7884cd07014fd1318e63af35e99eb.

The Documentation is very confusing about the topic.
The cache logic for hi and lo is wrong and actually miss some regs to be
actually written.

What the Docuemntation actually intended was that it's possible to skip
writing hi OR lo if half of the reg is not needed to be written or read.

Revert the change in favor of a better and correct implementation.

Reported-by: Ronald Wahl <ronald.wahl@raritan.com>
Signed-off-by: Christian Marangi <ansuelsmth@gmail.com>
Cc: stable@vger.kernel.org # v5.18+
---
 drivers/net/dsa/qca/qca8k-8xxx.c | 61 +++++++-------------------------
 drivers/net/dsa/qca/qca8k.h      |  5 ---
 2 files changed, 12 insertions(+), 54 deletions(-)

diff --git a/drivers/net/dsa/qca/qca8k-8xxx.c b/drivers/net/dsa/qca/qca8k-8xxx.c
index 46151320b2a8..fbcd5c2b13ae 100644
--- a/drivers/net/dsa/qca/qca8k-8xxx.c
+++ b/drivers/net/dsa/qca/qca8k-8xxx.c
@@ -36,44 +36,6 @@ qca8k_split_addr(u32 regaddr, u16 *r1, u16 *r2, u16 *page)
 	*page = regaddr & 0x3ff;
 }
 
-static int
-qca8k_set_lo(struct qca8k_priv *priv, int phy_id, u32 regnum, u16 lo)
-{
-	u16 *cached_lo = &priv->mdio_cache.lo;
-	struct mii_bus *bus = priv->bus;
-	int ret;
-
-	if (lo == *cached_lo)
-		return 0;
-
-	ret = bus->write(bus, phy_id, regnum, lo);
-	if (ret < 0)
-		dev_err_ratelimited(&bus->dev,
-				    "failed to write qca8k 32bit lo register\n");
-
-	*cached_lo = lo;
-	return 0;
-}
-
-static int
-qca8k_set_hi(struct qca8k_priv *priv, int phy_id, u32 regnum, u16 hi)
-{
-	u16 *cached_hi = &priv->mdio_cache.hi;
-	struct mii_bus *bus = priv->bus;
-	int ret;
-
-	if (hi == *cached_hi)
-		return 0;
-
-	ret = bus->write(bus, phy_id, regnum, hi);
-	if (ret < 0)
-		dev_err_ratelimited(&bus->dev,
-				    "failed to write qca8k 32bit hi register\n");
-
-	*cached_hi = hi;
-	return 0;
-}
-
 static int
 qca8k_mii_read32(struct mii_bus *bus, int phy_id, u32 regnum, u32 *val)
 {
@@ -97,7 +59,7 @@ qca8k_mii_read32(struct mii_bus *bus, int phy_id, u32 regnum, u32 *val)
 }
 
 static void
-qca8k_mii_write32(struct qca8k_priv *priv, int phy_id, u32 regnum, u32 val)
+qca8k_mii_write32(struct mii_bus *bus, int phy_id, u32 regnum, u32 val)
 {
 	u16 lo, hi;
 	int ret;
@@ -105,9 +67,12 @@ qca8k_mii_write32(struct qca8k_priv *priv, int phy_id, u32 regnum, u32 val)
 	lo = val & 0xffff;
 	hi = (u16)(val >> 16);
 
-	ret = qca8k_set_lo(priv, phy_id, regnum, lo);
+	ret = bus->write(bus, phy_id, regnum, lo);
 	if (ret >= 0)
-		ret = qca8k_set_hi(priv, phy_id, regnum + 1, hi);
+		ret = bus->write(bus, phy_id, regnum + 1, hi);
+	if (ret < 0)
+		dev_err_ratelimited(&bus->dev,
+				    "failed to write qca8k 32bit register\n");
 }
 
 static int
@@ -442,7 +407,7 @@ qca8k_regmap_write(void *ctx, uint32_t reg, uint32_t val)
 	if (ret < 0)
 		goto exit;
 
-	qca8k_mii_write32(priv, 0x10 | r2, r1, val);
+	qca8k_mii_write32(bus, 0x10 | r2, r1, val);
 
 exit:
 	mutex_unlock(&bus->mdio_lock);
@@ -475,7 +440,7 @@ qca8k_regmap_update_bits(void *ctx, uint32_t reg, uint32_t mask, uint32_t write_
 
 	val &= ~mask;
 	val |= write_val;
-	qca8k_mii_write32(priv, 0x10 | r2, r1, val);
+	qca8k_mii_write32(bus, 0x10 | r2, r1, val);
 
 exit:
 	mutex_unlock(&bus->mdio_lock);
@@ -750,14 +715,14 @@ qca8k_mdio_write(struct qca8k_priv *priv, int phy, int regnum, u16 data)
 	if (ret)
 		goto exit;
 
-	qca8k_mii_write32(priv, 0x10 | r2, r1, val);
+	qca8k_mii_write32(bus, 0x10 | r2, r1, val);
 
 	ret = qca8k_mdio_busy_wait(bus, QCA8K_MDIO_MASTER_CTRL,
 				   QCA8K_MDIO_MASTER_BUSY);
 
 exit:
 	/* even if the busy_wait timeouts try to clear the MASTER_EN */
-	qca8k_mii_write32(priv, 0x10 | r2, r1, 0);
+	qca8k_mii_write32(bus, 0x10 | r2, r1, 0);
 
 	mutex_unlock(&bus->mdio_lock);
 
@@ -787,7 +752,7 @@ qca8k_mdio_read(struct qca8k_priv *priv, int phy, int regnum)
 	if (ret)
 		goto exit;
 
-	qca8k_mii_write32(priv, 0x10 | r2, r1, val);
+	qca8k_mii_write32(bus, 0x10 | r2, r1, val);
 
 	ret = qca8k_mdio_busy_wait(bus, QCA8K_MDIO_MASTER_CTRL,
 				   QCA8K_MDIO_MASTER_BUSY);
@@ -798,7 +763,7 @@ qca8k_mdio_read(struct qca8k_priv *priv, int phy, int regnum)
 
 exit:
 	/* even if the busy_wait timeouts try to clear the MASTER_EN */
-	qca8k_mii_write32(priv, 0x10 | r2, r1, 0);
+	qca8k_mii_write32(bus, 0x10 | r2, r1, 0);
 
 	mutex_unlock(&bus->mdio_lock);
 
@@ -1968,8 +1933,6 @@ qca8k_sw_probe(struct mdio_device *mdiodev)
 	}
 
 	priv->mdio_cache.page = 0xffff;
-	priv->mdio_cache.lo = 0xffff;
-	priv->mdio_cache.hi = 0xffff;
 
 	/* Check the detected switch id */
 	ret = qca8k_read_switch_id(priv);
diff --git a/drivers/net/dsa/qca/qca8k.h b/drivers/net/dsa/qca/qca8k.h
index 0b7a5cb12321..03514f7a20be 100644
--- a/drivers/net/dsa/qca/qca8k.h
+++ b/drivers/net/dsa/qca/qca8k.h
@@ -375,11 +375,6 @@ struct qca8k_mdio_cache {
  * mdio writes
  */
 	u16 page;
-/* lo and hi can also be cached and from Documentation we can skip one
- * extra mdio write if lo or hi is didn't change.
- */
-	u16 lo;
-	u16 hi;
 };
 
 struct qca8k_pcs {
-- 
2.37.2


^ permalink raw reply related

* [net PATCH 4/5] net: dsa: qca8k: introduce single mii read/write lo/hi
From: Christian Marangi @ 2022-12-16 16:17 UTC (permalink / raw)
  To: Andrew Lunn, Florian Fainelli, Vladimir Oltean, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Christian Marangi,
	Russell King (Oracle), netdev, linux-kernel
  Cc: Ronald Wahl
In-Reply-To: <20221216161721.23863-1-ansuelsmth@gmail.com>

It may be useful to read/write just the lo or hi half of a reg.

This is especially useful for phy poll with the use of mdio master.
The mdio master reg is composed by the first 16 bit related to setup and
the other half with the returned data or data to write.

Refactor the mii function to permit single mii read/write of lo or hi
half of the reg.

Tested-by: Ronald Wahl <ronald.wahl@raritan.com>
Signed-off-by: Christian Marangi <ansuelsmth@gmail.com>
---
 drivers/net/dsa/qca/qca8k-8xxx.c | 106 ++++++++++++++++++++++++-------
 1 file changed, 84 insertions(+), 22 deletions(-)

diff --git a/drivers/net/dsa/qca/qca8k-8xxx.c b/drivers/net/dsa/qca/qca8k-8xxx.c
index fbcd5c2b13ae..92c4bfef7c97 100644
--- a/drivers/net/dsa/qca/qca8k-8xxx.c
+++ b/drivers/net/dsa/qca/qca8k-8xxx.c
@@ -37,42 +37,104 @@ qca8k_split_addr(u32 regaddr, u16 *r1, u16 *r2, u16 *page)
 }
 
 static int
-qca8k_mii_read32(struct mii_bus *bus, int phy_id, u32 regnum, u32 *val)
+qca8k_mii_write_lo(struct mii_bus *bus, int phy_id, u32 regnum, u32 val)
 {
 	int ret;
+	u16 lo;
 
-	ret = bus->read(bus, phy_id, regnum);
-	if (ret >= 0) {
-		*val = ret;
-		ret = bus->read(bus, phy_id, regnum + 1);
-		*val |= ret << 16;
-	}
+	lo = val & 0xffff;
+	ret = bus->write(bus, phy_id, regnum, lo);
+	if (ret < 0)
+		dev_err_ratelimited(&bus->dev,
+				    "failed to write qca8k 32bit lo register\n");
 
-	if (ret < 0) {
+	return ret;
+}
+
+static int
+qca8k_mii_write_hi(struct mii_bus *bus, int phy_id, u32 regnum, u32 val)
+{
+	int ret;
+	u16 hi;
+
+	hi = (u16)(val >> 16);
+	ret = bus->write(bus, phy_id, regnum, hi);
+	if (ret < 0)
 		dev_err_ratelimited(&bus->dev,
-				    "failed to read qca8k 32bit register\n");
-		*val = 0;
-		return ret;
-	}
+				    "failed to write qca8k 32bit hi register\n");
+
+	return ret;
+}
+
+static int
+qca8k_mii_read_lo(struct mii_bus *bus, int phy_id, u32 regnum, u32 *val)
+{
+	int ret;
+
+	ret = bus->read(bus, phy_id, regnum);
+	if (ret < 0)
+		goto err;
 
+	*val = ret & 0xffff;
 	return 0;
+
+err:
+	dev_err_ratelimited(&bus->dev,
+			    "failed to read qca8k 32bit lo register\n");
+	*val = 0;
+
+	return ret;
 }
 
-static void
-qca8k_mii_write32(struct mii_bus *bus, int phy_id, u32 regnum, u32 val)
+static int
+qca8k_mii_read_hi(struct mii_bus *bus, int phy_id, u32 regnum, u32 *val)
 {
-	u16 lo, hi;
 	int ret;
 
-	lo = val & 0xffff;
-	hi = (u16)(val >> 16);
+	ret = bus->read(bus, phy_id, regnum);
+	if (ret < 0)
+		goto err;
 
-	ret = bus->write(bus, phy_id, regnum, lo);
-	if (ret >= 0)
-		ret = bus->write(bus, phy_id, regnum + 1, hi);
+	*val = ret << 16;
+	return 0;
+
+err:
+	dev_err_ratelimited(&bus->dev,
+			    "failed to read qca8k 32bit hi register\n");
+	*val = 0;
+
+	return ret;
+}
+
+static int
+qca8k_mii_read32(struct mii_bus *bus, int phy_id, u32 regnum, u32 *val)
+{
+	u32 hi, lo;
+	int ret;
+
+	*val = 0;
+
+	ret = qca8k_mii_read_lo(bus, phy_id, regnum, &lo);
 	if (ret < 0)
-		dev_err_ratelimited(&bus->dev,
-				    "failed to write qca8k 32bit register\n");
+		goto err;
+
+	ret = qca8k_mii_read_hi(bus, phy_id, regnum + 1, &hi);
+	if (ret < 0)
+		goto err;
+
+	*val = lo | hi;
+
+err:
+	return ret;
+}
+
+static void
+qca8k_mii_write32(struct mii_bus *bus, int phy_id, u32 regnum, u32 val)
+{
+	if (qca8k_mii_write_lo(bus, phy_id, regnum, val) < 0)
+		return;
+
+	qca8k_mii_write_hi(bus, phy_id, regnum + 1, val);
 }
 
 static int
-- 
2.37.2


^ permalink raw reply related

* [net PATCH 5/5] net: dsa: qca8k: improve mdio master read/write by using single lo/hi
From: Christian Marangi @ 2022-12-16 16:17 UTC (permalink / raw)
  To: Andrew Lunn, Florian Fainelli, Vladimir Oltean, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Christian Marangi,
	Russell King (Oracle), netdev, linux-kernel
  Cc: Ronald Wahl
In-Reply-To: <20221216161721.23863-1-ansuelsmth@gmail.com>

Improve mdio master read/write by using singe mii read/write lo/hi.

In a read and write we need to poll the mdio master regs in a busy loop
to check for a specific bit present in the upper half of the reg. We can
ignore the other half since it won't contain useful data. This will save
an additional useless read for each read and write operation.

In a read operation the returned data is present in the mdio master reg
lower half. We can ignore the other half since it won't contain useful
data. This will save an additional useless read for each read operation.

In a read operation it's needed to just set the hi half of the mdio
master reg as the lo half will be replaced by the result. This will save
an additional useless write for each read operation.

Tested-by: Ronald Wahl <ronald.wahl@raritan.com>
Signed-off-by: Christian Marangi <ansuelsmth@gmail.com>
---
 drivers/net/dsa/qca/qca8k-8xxx.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/net/dsa/qca/qca8k-8xxx.c b/drivers/net/dsa/qca/qca8k-8xxx.c
index 92c4bfef7c97..2f224b166bbb 100644
--- a/drivers/net/dsa/qca/qca8k-8xxx.c
+++ b/drivers/net/dsa/qca/qca8k-8xxx.c
@@ -740,9 +740,9 @@ qca8k_mdio_busy_wait(struct mii_bus *bus, u32 reg, u32 mask)
 
 	qca8k_split_addr(reg, &r1, &r2, &page);
 
-	ret = read_poll_timeout(qca8k_mii_read32, ret1, !(val & mask), 0,
+	ret = read_poll_timeout(qca8k_mii_read_hi, ret1, !(val & mask), 0,
 				QCA8K_BUSY_WAIT_TIMEOUT * USEC_PER_MSEC, false,
-				bus, 0x10 | r2, r1, &val);
+				bus, 0x10 | r2, r1 + 1, &val);
 
 	/* Check if qca8k_read has failed for a different reason
 	 * before returnting -ETIMEDOUT
@@ -784,7 +784,7 @@ qca8k_mdio_write(struct qca8k_priv *priv, int phy, int regnum, u16 data)
 
 exit:
 	/* even if the busy_wait timeouts try to clear the MASTER_EN */
-	qca8k_mii_write32(bus, 0x10 | r2, r1, 0);
+	qca8k_mii_write_hi(bus, 0x10 | r2, r1 + 1, 0);
 
 	mutex_unlock(&bus->mdio_lock);
 
@@ -814,18 +814,18 @@ qca8k_mdio_read(struct qca8k_priv *priv, int phy, int regnum)
 	if (ret)
 		goto exit;
 
-	qca8k_mii_write32(bus, 0x10 | r2, r1, val);
+	qca8k_mii_write_hi(bus, 0x10 | r2, r1 + 1, val);
 
 	ret = qca8k_mdio_busy_wait(bus, QCA8K_MDIO_MASTER_CTRL,
 				   QCA8K_MDIO_MASTER_BUSY);
 	if (ret)
 		goto exit;
 
-	ret = qca8k_mii_read32(bus, 0x10 | r2, r1, &val);
+	ret = qca8k_mii_read_lo(bus, 0x10 | r2, r1, &val);
 
 exit:
 	/* even if the busy_wait timeouts try to clear the MASTER_EN */
-	qca8k_mii_write32(bus, 0x10 | r2, r1, 0);
+	qca8k_mii_write_hi(bus, 0x10 | r2, r1 + 1, 0);
 
 	mutex_unlock(&bus->mdio_lock);
 
-- 
2.37.2


^ permalink raw reply related

* Re: [net PATCH 3/5] Revert "net: dsa: qca8k: cache lo and hi for mdio write"
From: Christian Marangi @ 2022-12-16 16:21 UTC (permalink / raw)
  To: Andrew Lunn, Florian Fainelli, Vladimir Oltean, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Russell King (Oracle),
	netdev, linux-kernel
  Cc: Ronald Wahl, stable
In-Reply-To: <20221216161721.23863-3-ansuelsmth@gmail.com>

On Fri, Dec 16, 2022 at 05:17:19PM +0100, Christian Marangi wrote:
> This reverts commit 2481d206fae7884cd07014fd1318e63af35e99eb.
> 
> The Documentation is very confusing about the topic.
> The cache logic for hi and lo is wrong and actually miss some regs to be
> actually written.
> 
> What the Docuemntation actually intended was that it's possible to skip

Just notice that I forgot to fix a typo here! Hope it's not a problem if
this can be fixed when merged. If it is I will happly send v2 if the
rest of the changes are OK.

> writing hi OR lo if half of the reg is not needed to be written or read.
> 
> Revert the change in favor of a better and correct implementation.
> 
> Reported-by: Ronald Wahl <ronald.wahl@raritan.com>
> Signed-off-by: Christian Marangi <ansuelsmth@gmail.com>
> Cc: stable@vger.kernel.org # v5.18+
> ---
>  drivers/net/dsa/qca/qca8k-8xxx.c | 61 +++++++-------------------------
>  drivers/net/dsa/qca/qca8k.h      |  5 ---
>  2 files changed, 12 insertions(+), 54 deletions(-)
> 
> diff --git a/drivers/net/dsa/qca/qca8k-8xxx.c b/drivers/net/dsa/qca/qca8k-8xxx.c
> index 46151320b2a8..fbcd5c2b13ae 100644
> --- a/drivers/net/dsa/qca/qca8k-8xxx.c
> +++ b/drivers/net/dsa/qca/qca8k-8xxx.c
> @@ -36,44 +36,6 @@ qca8k_split_addr(u32 regaddr, u16 *r1, u16 *r2, u16 *page)
>  	*page = regaddr & 0x3ff;
>  }
>  
> -static int
> -qca8k_set_lo(struct qca8k_priv *priv, int phy_id, u32 regnum, u16 lo)
> -{
> -	u16 *cached_lo = &priv->mdio_cache.lo;
> -	struct mii_bus *bus = priv->bus;
> -	int ret;
> -
> -	if (lo == *cached_lo)
> -		return 0;
> -
> -	ret = bus->write(bus, phy_id, regnum, lo);
> -	if (ret < 0)
> -		dev_err_ratelimited(&bus->dev,
> -				    "failed to write qca8k 32bit lo register\n");
> -
> -	*cached_lo = lo;
> -	return 0;
> -}
> -
> -static int
> -qca8k_set_hi(struct qca8k_priv *priv, int phy_id, u32 regnum, u16 hi)
> -{
> -	u16 *cached_hi = &priv->mdio_cache.hi;
> -	struct mii_bus *bus = priv->bus;
> -	int ret;
> -
> -	if (hi == *cached_hi)
> -		return 0;
> -
> -	ret = bus->write(bus, phy_id, regnum, hi);
> -	if (ret < 0)
> -		dev_err_ratelimited(&bus->dev,
> -				    "failed to write qca8k 32bit hi register\n");
> -
> -	*cached_hi = hi;
> -	return 0;
> -}
> -
>  static int
>  qca8k_mii_read32(struct mii_bus *bus, int phy_id, u32 regnum, u32 *val)
>  {
> @@ -97,7 +59,7 @@ qca8k_mii_read32(struct mii_bus *bus, int phy_id, u32 regnum, u32 *val)
>  }
>  
>  static void
> -qca8k_mii_write32(struct qca8k_priv *priv, int phy_id, u32 regnum, u32 val)
> +qca8k_mii_write32(struct mii_bus *bus, int phy_id, u32 regnum, u32 val)
>  {
>  	u16 lo, hi;
>  	int ret;
> @@ -105,9 +67,12 @@ qca8k_mii_write32(struct qca8k_priv *priv, int phy_id, u32 regnum, u32 val)
>  	lo = val & 0xffff;
>  	hi = (u16)(val >> 16);
>  
> -	ret = qca8k_set_lo(priv, phy_id, regnum, lo);
> +	ret = bus->write(bus, phy_id, regnum, lo);
>  	if (ret >= 0)
> -		ret = qca8k_set_hi(priv, phy_id, regnum + 1, hi);
> +		ret = bus->write(bus, phy_id, regnum + 1, hi);
> +	if (ret < 0)
> +		dev_err_ratelimited(&bus->dev,
> +				    "failed to write qca8k 32bit register\n");
>  }
>  
>  static int
> @@ -442,7 +407,7 @@ qca8k_regmap_write(void *ctx, uint32_t reg, uint32_t val)
>  	if (ret < 0)
>  		goto exit;
>  
> -	qca8k_mii_write32(priv, 0x10 | r2, r1, val);
> +	qca8k_mii_write32(bus, 0x10 | r2, r1, val);
>  
>  exit:
>  	mutex_unlock(&bus->mdio_lock);
> @@ -475,7 +440,7 @@ qca8k_regmap_update_bits(void *ctx, uint32_t reg, uint32_t mask, uint32_t write_
>  
>  	val &= ~mask;
>  	val |= write_val;
> -	qca8k_mii_write32(priv, 0x10 | r2, r1, val);
> +	qca8k_mii_write32(bus, 0x10 | r2, r1, val);
>  
>  exit:
>  	mutex_unlock(&bus->mdio_lock);
> @@ -750,14 +715,14 @@ qca8k_mdio_write(struct qca8k_priv *priv, int phy, int regnum, u16 data)
>  	if (ret)
>  		goto exit;
>  
> -	qca8k_mii_write32(priv, 0x10 | r2, r1, val);
> +	qca8k_mii_write32(bus, 0x10 | r2, r1, val);
>  
>  	ret = qca8k_mdio_busy_wait(bus, QCA8K_MDIO_MASTER_CTRL,
>  				   QCA8K_MDIO_MASTER_BUSY);
>  
>  exit:
>  	/* even if the busy_wait timeouts try to clear the MASTER_EN */
> -	qca8k_mii_write32(priv, 0x10 | r2, r1, 0);
> +	qca8k_mii_write32(bus, 0x10 | r2, r1, 0);
>  
>  	mutex_unlock(&bus->mdio_lock);
>  
> @@ -787,7 +752,7 @@ qca8k_mdio_read(struct qca8k_priv *priv, int phy, int regnum)
>  	if (ret)
>  		goto exit;
>  
> -	qca8k_mii_write32(priv, 0x10 | r2, r1, val);
> +	qca8k_mii_write32(bus, 0x10 | r2, r1, val);
>  
>  	ret = qca8k_mdio_busy_wait(bus, QCA8K_MDIO_MASTER_CTRL,
>  				   QCA8K_MDIO_MASTER_BUSY);
> @@ -798,7 +763,7 @@ qca8k_mdio_read(struct qca8k_priv *priv, int phy, int regnum)
>  
>  exit:
>  	/* even if the busy_wait timeouts try to clear the MASTER_EN */
> -	qca8k_mii_write32(priv, 0x10 | r2, r1, 0);
> +	qca8k_mii_write32(bus, 0x10 | r2, r1, 0);
>  
>  	mutex_unlock(&bus->mdio_lock);
>  
> @@ -1968,8 +1933,6 @@ qca8k_sw_probe(struct mdio_device *mdiodev)
>  	}
>  
>  	priv->mdio_cache.page = 0xffff;
> -	priv->mdio_cache.lo = 0xffff;
> -	priv->mdio_cache.hi = 0xffff;
>  
>  	/* Check the detected switch id */
>  	ret = qca8k_read_switch_id(priv);
> diff --git a/drivers/net/dsa/qca/qca8k.h b/drivers/net/dsa/qca/qca8k.h
> index 0b7a5cb12321..03514f7a20be 100644
> --- a/drivers/net/dsa/qca/qca8k.h
> +++ b/drivers/net/dsa/qca/qca8k.h
> @@ -375,11 +375,6 @@ struct qca8k_mdio_cache {
>   * mdio writes
>   */
>  	u16 page;
> -/* lo and hi can also be cached and from Documentation we can skip one
> - * extra mdio write if lo or hi is didn't change.
> - */
> -	u16 lo;
> -	u16 hi;
>  };
>  
>  struct qca8k_pcs {
> -- 
> 2.37.2
> 

-- 
	Ansuel

^ permalink raw reply

* [PATCH net] net: stream: purge sk_error_queue in sk_stream_kill_queues()
From: Eric Dumazet @ 2022-12-16 16:29 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: netdev, eric.dumazet, Eric Dumazet, Changheon Lee

Changheon Lee reported TCP socket leaks, with a nice repro.

It seems we leak TCP sockets with the following sequence:

1) SOF_TIMESTAMPING_TX_ACK is enabled on the socket.

   Each ACK will cook an skb put in error queue, from __skb_tstamp_tx().
   __skb_tstamp_tx() is using skb_clone(), unless
   SOF_TIMESTAMPING_OPT_TSONLY was also requested.

2) If the application is also using MSG_ZEROCOPY, then we put in the
   error queue cloned skbs that had a struct ubuf_info attached to them.

   Whenever an struct ubuf_info is allocated, sock_zerocopy_alloc()
   does a sock_hold().

   As long as the cloned skbs are still in sk_error_queue,
   socket refcount is kept elevated.

3) Application closes the socket, while error queue is not empty.

Since tcp_close() no longer purges the socket error queue,
we might end up with a TCP socket with at least one skb in
error queue keeping the socket alive forever.

This bug can be (ab)used to consume all kernel memory
and freeze the host.

We need to purge the error queue, with proper synchronization
against concurrent writers.

Fixes: 24bcbe1cc69f ("net: stream: don't purge sk_error_queue in sk_stream_kill_queues()")
Reported-by: Changheon Lee <darklight2357@icloud.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 net/core/stream.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/net/core/stream.c b/net/core/stream.c
index 5b1fe2b82eac753bc8e18c02db04c5906b3a2d97..cd06750dd3297cd0e0f073057a4d85d4078f87c3 100644
--- a/net/core/stream.c
+++ b/net/core/stream.c
@@ -196,6 +196,12 @@ void sk_stream_kill_queues(struct sock *sk)
 	/* First the read buffer. */
 	__skb_queue_purge(&sk->sk_receive_queue);
 
+	/* Next, the error queue.
+	 * We need to use queue lock, because other threads might
+	 * add packets to the queue without socket lock being held.
+	 */
+	skb_queue_purge(&sk->sk_error_queue);
+
 	/* Next, the write queue. */
 	WARN_ON_ONCE(!skb_queue_empty(&sk->sk_write_queue));
 
-- 
2.39.0.314.g84b9a713c41-goog


^ permalink raw reply related

* Re: [PATCH net v2 0/2] iavf: fix temporary deadlock and failure to set MAC address
From: Alexander H Duyck @ 2022-12-16 16:29 UTC (permalink / raw)
  To: Michal Schmidt, intel-wired-lan
  Cc: Ivan Vecera, netdev, Mateusz Palczewski, Patryk Piotrowski
In-Reply-To: <20221215225049.508812-1-mschmidt@redhat.com>

On Thu, 2022-12-15 at 23:50 +0100, Michal Schmidt wrote:
> This fixes an issue where setting the MAC address on iavf runs into a
> timeout and fails with EAGAIN.
> 
> Changes in v2:
>  - Removed unused 'ret' variable in patch 1.
>  - Added patch 2 to fix another cause of the same timeout.
> 
> Michal Schmidt (2):
>   iavf: fix temporary deadlock and failure to set MAC address
>   iavf: avoid taking rtnl_lock in adminq_task
> 
>  drivers/net/ethernet/intel/iavf/iavf.h        |   4 +-
>  .../net/ethernet/intel/iavf/iavf_ethtool.c    |  10 +-
>  drivers/net/ethernet/intel/iavf/iavf_main.c   | 135 ++++++++++--------
>  .../net/ethernet/intel/iavf/iavf_virtchnl.c   |   8 +-
>  4 files changed, 86 insertions(+), 71 deletions(-)
> 

The series looks good to me.

Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>

^ permalink raw reply

* Re: [PATCH net] mctp: serial: Fix starting value for frame check sequence
From: Alexander H Duyck @ 2022-12-16 16:41 UTC (permalink / raw)
  To: Jeremy Kerr, netdev
  Cc: Matt Johnston, David S. Miller, Jakub Kicinski, Eric Dumazet,
	Paolo Abeni, Harsh Tyagi
In-Reply-To: <20221216034409.27174-1-jk@codeconstruct.com.au>

On Fri, 2022-12-16 at 11:44 +0800, Jeremy Kerr wrote:
> RFC1662 defines the start state for the crc16 FCS to be 0xffff, but
> we're currently starting at zero.
> 
> This change uses the correct start state. We're only early in the
> adoption for the serial binding, so there aren't yet any other users to
> interface to.
> 
> Fixes: a0c2ccd9b5ad ("mctp: Add MCTP-over-serial transport binding")
> 
> Reported-by: Harsh Tyagi <harshtya@google.com>
> Tested-by: Harsh Tyagi <harshtya@google.com>
> Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
> ---
>  drivers/net/mctp/mctp-serial.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/mctp/mctp-serial.c b/drivers/net/mctp/mctp-serial.c
> index 7cd103fd34ef..9f9eaf896047 100644
> --- a/drivers/net/mctp/mctp-serial.c
> +++ b/drivers/net/mctp/mctp-serial.c
> @@ -35,6 +35,8 @@
>  #define BYTE_FRAME		0x7e
>  #define BYTE_ESC		0x7d
>  
> +#define FCS_INIT		0xffff
> +
>  static DEFINE_IDA(mctp_serial_ida);
>  
>  enum mctp_serial_state {
> @@ -123,7 +125,7 @@ static void mctp_serial_tx_work(struct work_struct *work)
>  		buf[2] = dev->txlen;
>  
>  		if (!dev->txpos)
> -			dev->txfcs = crc_ccitt(0, buf + 1, 2);
> +			dev->txfcs = crc_ccitt(FCS_INIT, buf + 1, 2);
>  
>  		txlen = write_chunk(dev, buf + dev->txpos, 3 - dev->txpos);
>  		if (txlen <= 0) {
> @@ -303,7 +305,7 @@ static void mctp_serial_push_header(struct mctp_serial *dev, unsigned char c)
>  	case 1:
>  		if (c == MCTP_SERIAL_VERSION) {
>  			dev->rxpos++;
> -			dev->rxfcs = crc_ccitt_byte(0, c);
> +			dev->rxfcs = crc_ccitt_byte(FCS_INIT, c);
>  		} else {
>  			dev->rxstate = STATE_ERR;
>  		}

Since the starting value isn't unique would it possibly be worthwhile
to look at adding a define to include/linux/crc-ccitt.h to be used to
handle the cases where the initial value is 0xffff? I notice there
seems to only be two starting values 0 and 0xffff for all callers so it
might make sense to centralize it in one place.

Otherwise the code itself looks good.

Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>

^ permalink raw reply

* Re: [PATCH v7 01/11] leds: add support for hardware driven LEDs
From: Christian Marangi @ 2022-12-16 16:45 UTC (permalink / raw)
  To: Russell King (Oracle)
  Cc: Andrew Lunn, Florian Fainelli, Vladimir Oltean, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Rob Herring,
	Krzysztof Kozlowski, Jonathan Corbet, Pavel Machek, John Crispin,
	netdev, devicetree, linux-kernel, linux-doc, linux-leds,
	Tim Harvey, Alexander Stein, Rasmus Villemoes, Marek Behún
In-Reply-To: <Y5tHjwx1Boj3xMok@shell.armlinux.org.uk>

On Thu, Dec 15, 2022 at 04:13:03PM +0000, Russell King (Oracle) wrote:
> Hi Christian,
> 
> Thanks for the patch.
> 
> I think Andrew's email is offline at the moment.
>

Notice by gmail spamming me "I CAN'T SEND IT AHHHHH"
Holidy times I guess?

> On Thu, Dec 15, 2022 at 12:54:28AM +0100, Christian Marangi wrote:
> > +static bool led_trigger_is_supported(struct led_classdev *led_cdev,
> > +				     struct led_trigger *trigger)
> > +{
> > +	switch (led_cdev->blink_mode) {
> > +	case SOFTWARE_CONTROLLED:
> > +		if (trigger->supported_blink_modes == HARDWARE_ONLY)
> > +			return 0;
> > +		break;
> > +	case HARDWARE_CONTROLLED:
> > +		if (trigger->supported_blink_modes == SOFTWARE_ONLY)
> > +			return 0;
> > +		break;
> > +	case SOFTWARE_HARDWARE_CONTROLLED:
> > +		break;
> > +	default:
> > +		return 0;
> > +	}
> > +
> > +	return 1;
> 
> Should be returning true/false. I'm not sure I'm a fan of the style of
> this though - wouldn't the following be easier to read?
> 
> 	switch (led_cdev->blink_mode) {
> 	case SOFTWARE_CONTROLLED:
> 		return trigger->supported_blink_modes != HARDWARE_ONLY;
> 
> 	case HARDWARE_CONTROLLED:
> 		return trigger->supported_blink_modes != SOFTWARE_ONLY;
> 
> 	case SOFTWARE_HARDWARE_CONTROLLED:
> 		return true;
> 	}
> ?

Much better!

> 
> Also, does it really need a default case - without it, when the
> led_blink_modes enum is expanded and the switch statement isn't
> updated, we'll get a compiler warning which will prompt this to be
> updated - whereas, with a default case, it won't.
> 

I added the default just to mute some compiler warning. But guess if
every enum is handled the warning should not be reported.

> > @@ -188,6 +213,10 @@ int led_trigger_set(struct led_classdev *led_cdev, struct led_trigger *trig)
> >  		led_set_brightness(led_cdev, LED_OFF);
> >  	}
> >  	if (trig) {
> > +		/* Make sure the trigger support the LED blink mode */
> > +		if (!led_trigger_is_supported(led_cdev, trig))
> > +			return -EINVAL;
> 
> Shouldn't validation happen before we start taking any actions? In other
> words, before we remove the previous trigger?
> 

trigger_set first remove any trigger and set the led off. Then apply the
new trigger. So the validation is done only when a trigger is actually
applied. Think we should understand the best case here.

> > @@ -350,12 +381,26 @@ static inline bool led_sysfs_is_disabled(struct led_classdev *led_cdev)
> >  
> >  #define TRIG_NAME_MAX 50
> >  
> > +enum led_trigger_blink_supported_modes {
> > +	SOFTWARE_ONLY = SOFTWARE_CONTROLLED,
> > +	HARDWARE_ONLY = HARDWARE_CONTROLLED,
> > +	SOFTWARE_HARDWARE = SOFTWARE_HARDWARE_CONTROLLED,
> 
> I suspect all these generic names are asking for eventual namespace
> clashes. Maybe prefix them with LED_ ?

Agree they are pretty generic so I can see why... My only concern was
making them too long... Maybe reduce them to SW or HW? LEDS_SW_ONLY...
LEDS_SW_CONTROLLED?

> 
> Thanks.
> 
> -- 
> RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
> FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!

-- 
	Ansuel

^ permalink raw reply

* RE: [PATCH net-next] powerpc: dts: t208x: Disable 10G on MAC1 and MAC2
From: Camelia Alexandra Groza @ 2022-12-16 16:46 UTC (permalink / raw)
  To: Sean Anderson
  Cc: David S . Miller, netdev@vger.kernel.org,
	devicetree@vger.kernel.org, Michael Ellerman,
	linux-kernel@vger.kernel.org, Rob Herring, Christophe Leroy,
	linuxppc-dev@lists.ozlabs.org, Nicholas Piggin,
	Krzysztof Kozlowski
In-Reply-To: <8b6ad6a9-05fd-d4c4-16cb-5596e3defc04@seco.com>

> -----Original Message-----
> From: Sean Anderson <sean.anderson@seco.com>
> Sent: Thursday, December 15, 2022 18:33
> To: Camelia Alexandra Groza <camelia.groza@nxp.com>
> Cc: David S . Miller <davem@davemloft.net>; netdev@vger.kernel.org;
> devicetree@vger.kernel.org; Michael Ellerman <mpe@ellerman.id.au>;
> linux-kernel@vger.kernel.org; Rob Herring <robh+dt@kernel.org>;
> Christophe Leroy <christophe.leroy@csgroup.eu>; linuxppc-
> dev@lists.ozlabs.org; Nicholas Piggin <npiggin@gmail.com>; Krzysztof
> Kozlowski <krzysztof.kozlowski+dt@linaro.org>
> Subject: Re: [PATCH net-next] powerpc: dts: t208x: Disable 10G on MAC1 and
> MAC2
> 
> On 12/15/22 11:12, Camelia Alexandra Groza wrote:
> >> -----Original Message-----
> >> From: Sean Anderson <sean.anderson@seco.com>
> >> Sent: Thursday, November 3, 2022 23:29
> >> To: David S . Miller <davem@davemloft.net>; netdev@vger.kernel.org
> >> Cc: devicetree@vger.kernel.org; Michael Ellerman
> <mpe@ellerman.id.au>;
> >> linux-kernel@vger.kernel.org; Rob Herring <robh+dt@kernel.org>;
> >> Christophe Leroy <christophe.leroy@csgroup.eu>; linuxppc-
> >> dev@lists.ozlabs.org; Nicholas Piggin <npiggin@gmail.com>; Krzysztof
> >> Kozlowski <krzysztof.kozlowski+dt@linaro.org>; Sean Anderson
> >> <sean.anderson@seco.com>; Camelia Alexandra Groza
> >> <camelia.groza@nxp.com>
> >> Subject: [PATCH net-next] powerpc: dts: t208x: Disable 10G on MAC1 and
> >> MAC2
> >>
> >> There aren't enough resources to run these ports at 10G speeds. Just
> >> keep the pcs changes, and revert the rest. This is not really correct,
> >> since the hardware could support 10g in some other configuration...
> >>
> >> Fixes: 36926a7d70c2 ("powerpc: dts: t208x: Mark MAC1 and MAC2 as 10G")
> >> Reported-by: Camelia Alexandra Groza <camelia.groza@nxp.com>
> >> Signed-off-by: Sean Anderson <sean.anderson@seco.com>
> >> ---
> >>
> >
> > Hi Sean,
> >
> > I know I'm late, but there are a couple of issues with this patch. Do you
> intend
> > on sending a v2 or should I pick it up?
> >
> >>  .../boot/dts/fsl/qoriq-fman3-0-10g-2.dtsi     | 45 -------------------
> >>  .../boot/dts/fsl/qoriq-fman3-0-10g-3.dtsi     | 45 -------------------
> >>  arch/powerpc/boot/dts/fsl/t2081si-post.dtsi   |  6 ++-
> >>  3 files changed, 4 insertions(+), 92 deletions(-)
> >>  delete mode 100644 arch/powerpc/boot/dts/fsl/qoriq-fman3-0-10g-
> 2.dtsi
> >>  delete mode 100644 arch/powerpc/boot/dts/fsl/qoriq-fman3-0-10g-
> 3.dtsi
> >>
> >> diff --git a/arch/powerpc/boot/dts/fsl/qoriq-fman3-0-10g-2.dtsi
> >> b/arch/powerpc/boot/dts/fsl/qoriq-fman3-0-10g-2.dtsi
> >> deleted file mode 100644
> >> index 6b3609574b0f..000000000000
> >> --- a/arch/powerpc/boot/dts/fsl/qoriq-fman3-0-10g-2.dtsi
> >> +++ /dev/null
> >> @@ -1,45 +0,0 @@
> >> -// SPDX-License-Identifier: BSD-3-Clause OR GPL-2.0-or-later
> >> -/*
> >> - * QorIQ FMan v3 10g port #2 device tree stub [ controller @ offset
> >> 0x400000 ]
> >> - *
> >> - * Copyright 2022 Sean Anderson <sean.anderson@seco.com>
> >> - * Copyright 2012 - 2015 Freescale Semiconductor Inc.
> >> - */
> >> -
> >> -fman@400000 {
> >> -	fman0_rx_0x08: port@88000 {
> >> -		cell-index = <0x8>;
> >> -		compatible = "fsl,fman-v3-port-rx";
> >> -		reg = <0x88000 0x1000>;
> >> -		fsl,fman-10g-port;
> >> -	};
> >> -
> >> -	fman0_tx_0x28: port@a8000 {
> >> -		cell-index = <0x28>;
> >> -		compatible = "fsl,fman-v3-port-tx";
> >> -		reg = <0xa8000 0x1000>;
> >> -		fsl,fman-10g-port;
> >> -	};
> >> -
> >> -	ethernet@e0000 {
> >> -		cell-index = <0>;
> >> -		compatible = "fsl,fman-memac";
> >> -		reg = <0xe0000 0x1000>;
> >> -		fsl,fman-ports = <&fman0_rx_0x08 &fman0_tx_0x28>;
> >> -		ptp-timer = <&ptp_timer0>;
> >> -		pcsphy-handle = <&pcsphy0>, <&pcsphy0>;
> >> -		pcs-handle-names = "sgmii", "xfi";
> >> -	};
> >> -
> >> -	mdio@e1000 {
> >> -		#address-cells = <1>;
> >> -		#size-cells = <0>;
> >> -		compatible = "fsl,fman-memac-mdio", "fsl,fman-xmdio";
> >> -		reg = <0xe1000 0x1000>;
> >> -		fsl,erratum-a011043; /* must ignore read errors */
> >> -
> >> -		pcsphy0: ethernet-phy@0 {
> >> -			reg = <0x0>;
> >> -		};
> >> -	};
> >> -};
> >> diff --git a/arch/powerpc/boot/dts/fsl/qoriq-fman3-0-10g-3.dtsi
> >> b/arch/powerpc/boot/dts/fsl/qoriq-fman3-0-10g-3.dtsi
> >> deleted file mode 100644
> >> index 28ed1a85a436..000000000000
> >> --- a/arch/powerpc/boot/dts/fsl/qoriq-fman3-0-10g-3.dtsi
> >> +++ /dev/null
> >> @@ -1,45 +0,0 @@
> >> -// SPDX-License-Identifier: BSD-3-Clause OR GPL-2.0-or-later
> >> -/*
> >> - * QorIQ FMan v3 10g port #3 device tree stub [ controller @ offset
> >> 0x400000 ]
> >> - *
> >> - * Copyright 2022 Sean Anderson <sean.anderson@seco.com>
> >> - * Copyright 2012 - 2015 Freescale Semiconductor Inc.
> >> - */
> >> -
> >> -fman@400000 {
> >> -	fman0_rx_0x09: port@89000 {
> >> -		cell-index = <0x9>;
> >> -		compatible = "fsl,fman-v3-port-rx";
> >> -		reg = <0x89000 0x1000>;
> >> -		fsl,fman-10g-port;
> >> -	};
> >> -
> >> -	fman0_tx_0x29: port@a9000 {
> >> -		cell-index = <0x29>;
> >> -		compatible = "fsl,fman-v3-port-tx";
> >> -		reg = <0xa9000 0x1000>;
> >> -		fsl,fman-10g-port;
> >> -	};
> >> -
> >> -	ethernet@e2000 {
> >> -		cell-index = <1>;
> >> -		compatible = "fsl,fman-memac";
> >> -		reg = <0xe2000 0x1000>;
> >> -		fsl,fman-ports = <&fman0_rx_0x09 &fman0_tx_0x29>;
> >> -		ptp-timer = <&ptp_timer0>;
> >> -		pcsphy-handle = <&pcsphy1>, <&pcsphy1>;
> >> -		pcs-handle-names = "sgmii", "xfi";
> >> -	};
> >> -
> >> -	mdio@e3000 {
> >> -		#address-cells = <1>;
> >> -		#size-cells = <0>;
> >> -		compatible = "fsl,fman-memac-mdio", "fsl,fman-xmdio";
> >> -		reg = <0xe3000 0x1000>;
> >> -		fsl,erratum-a011043; /* must ignore read errors */
> >> -
> >> -		pcsphy1: ethernet-phy@0 {
> >> -			reg = <0x0>;
> >> -		};
> >> -	};
> >> -};
> >> diff --git a/arch/powerpc/boot/dts/fsl/t2081si-post.dtsi
> >> b/arch/powerpc/boot/dts/fsl/t2081si-post.dtsi
> >> index 74e17e134387..fed3879fa0aa 100644
> >> --- a/arch/powerpc/boot/dts/fsl/t2081si-post.dtsi
> >> +++ b/arch/powerpc/boot/dts/fsl/t2081si-post.dtsi
> >> @@ -609,8 +609,8 @@ usb1: usb@211000 {
> >>  /include/ "qoriq-bman1.dtsi"
> >>
> >>  /include/ "qoriq-fman3-0.dtsi"
> >> -/include/ "qoriq-fman3-0-10g-2.dtsi"
> >> -/include/ "qoriq-fman3-0-10g-3.dtsi"
> >> +/include/ "qoriq-fman3-0-1g-2.dtsi"
> >> +/include/ "qoriq-fman3-0-1g-3.dtsi"
> >
> > These two should be qoriq-fman3-0-1g-0.dtsi and qoriq-fman3-0-1g-1.dtsi.
> > You are including 1g-2.dtsi and 1g-3.dtsi twice.
> 
> So they should.
> 
> >>  /include/ "qoriq-fman3-0-1g-2.dtsi"
> >>  /include/ "qoriq-fman3-0-1g-3.dtsi"
> >>  /include/ "qoriq-fman3-0-1g-4.dtsi"
> >> @@ -619,9 +619,11 @@ usb1: usb@211000 {
> >>  /include/ "qoriq-fman3-0-10g-1.dtsi"
> >>  	fman@400000 {
> >>  		enet0: ethernet@e0000 {
> >> +			pcs-handle-names = "sgmii", "xfi";
> >>  		};
> >>
> >>  		enet1: ethernet@e2000 {
> >> +			pcs-handle-names = "sgmii", "xfi";
> >
> > The second pcsphy for this port is still qsgmiia_pcs1 as described in
> > qoriq-fman3-0-1g-1.dtsi. It should also be overwritten, not only the name
> > property:
> > 	pcsphy-handle = <&pcsphy1>, <&pcsphy1>;
> 
> This is the sort of reason I wanted to just delete the 10g property.
> 
> --Sean

I was against adding the 10g property in qoriq-fman3-0-10g-2/3.dtsi and
then deleting it in from t2081si-post.dtsi. Since there aren't other users for
these two new dtsi, it felt unneeded adding it in the first place.

I was thinking of just removing the property from qoriq-fman3-0-10g-2/3.dtsi
from the start.

I am reconsidering now. I see the value in maintaining the layout across all
qoriq-fman3-0-10g-x.dtsi files. And given the resource allocation issue is
relevant only for this one SoC, not all generic ports, it makes sense to edit the 
SoC dtsi only.

In short, we can go with your original proposal. Sorry for the noise.

Camelia

> >>  		};
> >>
> >>  		enet2: ethernet@e4000 {
> >> --
> >> 2.35.1.1320.gc452695387.dirty
> >


^ permalink raw reply

* Re: Possible race with xsk_flush
From: Shawn Bohrer @ 2022-12-16 16:48 UTC (permalink / raw)
  To: Magnus Karlsson; +Cc: netdev, bpf, bjorn, magnus.karlsson, kernel-team
In-Reply-To: <CAJ8uoz1GKvoaM0DCo1Ki8q=LHR1cjrNC=1BK7chTKKW9Po5F5A@mail.gmail.com>

On Fri, Dec 16, 2022 at 11:05:19AM +0100, Magnus Karlsson wrote:
> To summarize, we are expecting this ordering:
> 
> CPU 0 __xsk_rcv_zc()
> CPU 0 __xsk_map_flush()
> CPU 2 __xsk_rcv_zc()
> CPU 2 __xsk_map_flush()
> 
> But we are seeing this order:
> 
> CPU 0 __xsk_rcv_zc()
> CPU 2 __xsk_rcv_zc()
> CPU 0 __xsk_map_flush()
> CPU 2 __xsk_map_flush()
 
Yes exactly, and I think I've proved that this really is the order,
and the race is occurring.  See my cookie/poisoning below.

> Here is the veth NAPI poll loop:
> 
> static int veth_poll(struct napi_struct *napi, int budget)
> {
>     struct veth_rq *rq =
>     container_of(napi, struct veth_rq, xdp_napi);
>     struct veth_stats stats = {};
>     struct veth_xdp_tx_bq bq;
>     int done;
> 
>     bq.count = 0;
> 
>     xdp_set_return_frame_no_direct();
>     done = veth_xdp_rcv(rq, budget, &bq, &stats);
> 
>     if (done < budget && napi_complete_done(napi, done)) {
>         /* Write rx_notify_masked before reading ptr_ring */
>        smp_store_mb(rq->rx_notify_masked, false);
>        if (unlikely(!__ptr_ring_empty(&rq->xdp_ring))) {
>            if (napi_schedule_prep(&rq->xdp_napi)) {
>                WRITE_ONCE(rq->rx_notify_masked, true);
>                __napi_schedule(&rq->xdp_napi);
>             }
>         }
>     }
> 
>     if (stats.xdp_tx > 0)
>         veth_xdp_flush(rq, &bq);
>     if (stats.xdp_redirect > 0)
>         xdp_do_flush();
>     xdp_clear_return_frame_no_direct();
> 
>     return done;
> }
> 
> Something I have never seen before is that there is
> napi_complete_done() and a __napi_schedule() before xdp_do_flush().
> Let us check if this has something to do with it. So two suggestions
> to be executed separately:
> 
> * Put a probe at the __napi_schedule() above and check if it gets
> triggered before this problem
> * Move the "if (stats.xdp_redirect > 0) xdp_do_flush();" to just
> before "if (done < budget && napi_complete_done(napi, done)) {"
> 
> This might provide us some hints on what is going on.

Excellent observation, I haven't really looked at what
napi_complete_done() does yet.  I did notice it could call
__napi_schedule() and that seemed like it might be fine.  I'll also
note that veth_xdp_flush() can also ultimately call __napi_schedule().
I'll see what I can do to explore these ideas.
 
> > I've additionally updated my application to put a bad "cookie"
> > descriptor address back in the RX ring before updating the consumer
> > pointer.  My hope is that if we then ever receive that cookie it
> > proves the kernel raced and failed to update the correct address.

I guess this is more like poisoning the old descriptors rather than a
cookie.  This ran last night and one of my machines read back my
0xdeadbeefdeadbeef poisoned cookie value:

          iperf2-125483  [003] d.Z1. 792878.867088: __xsk_rcv_zc_L7: (__xsk_rcv_zc+0xa7/0x250) addr=0x8d4900 len=0x42 xs=0xffff8bbc542a5000 fq=0xffff8bbc1c464e40
          iperf2-125483  [003] d.Z1. 792878.867093: xsk_flush: (__xsk_map_flush+0x4e/0x180) xs=0xffff8bbc542a5000
          iperf2-125491  [001] d.Z1. 792878.867219: __xsk_rcv_zc_L7: (__xsk_rcv_zc+0xa7/0x250) addr=0xc79900 len=0x42 xs=0xffff8bbc542a5000 fq=0xffff8bbc1c464e40
          iperf2-125491  [001] d.Z1. 792878.867229: xsk_flush: (__xsk_map_flush+0x4e/0x180) xs=0xffff8bbc542a5000
          iperf2-125491  [001] d.Z1. 792878.867291: __xsk_rcv_zc_L7: (__xsk_rcv_zc+0xa7/0x250) addr=0x18e1900 len=0x42 xs=0xffff8bbc542a5000 fq=0xffff8bbc1c464e40
          iperf2-125483  [003] d.Z1. 792878.867441: __xsk_rcv_zc_L7: (__xsk_rcv_zc+0xa7/0x250) addr=0xc0a900 len=0x42 xs=0xffff8bbc542a5000 fq=0xffff8bbc1c464e40
          iperf2-125491  [001] d.Z1. 792878.867457: xsk_flush: (__xsk_map_flush+0x4e/0x180) xs=0xffff8bbc542a5000
 flowtrackd-zjTA-201813  [001] ..... 792878.867496: tracing_mark_write: ingress q:2 0x8d4900 FILL -> RX
 flowtrackd-zjTA-201813  [001] ..... 792878.867503: tracing_mark_write: ingress q:2 0xc79900 FILL -> RX
 flowtrackd-zjTA-201813  [001] ..... 792878.867506: tracing_mark_write: ingress q:2 0x18e1900 FILL -> RX
 flowtrackd-zjTA-201813  [001] ..... 792878.867524: tracing_mark_write: read invalid descriptor cookie: 0xdeadbeefdeadbeef

This shows what I've seen before where the xsk_flush() of CPU 1 runs
after (during?) __xsk_rcv_zc() of CPU 3.  In this trace we never see
the xsk_flush() from CPU 3 but I stop tracing when the bug occurs so
it probably just hasn't happened yet.

So at least to me this does confirm there is definitely a race here
where we can flush an updated producer pointer before the descriptor
address has been filled in.

--
Shawn

^ permalink raw reply

* [PATCH net-next v4 0/4] phy: aquantia: Determine rate adaptation support from registers
From: Sean Anderson @ 2022-12-16 16:48 UTC (permalink / raw)
  To: Andrew Lunn, Heiner Kallweit, netdev, Russell King
  Cc: Paolo Abeni, Vladimir Oltean, linux-kernel, Jakub Kicinski,
	Tim Harvey, David S . Miller, Eric Dumazet, Sean Anderson

This attempts to address the problems first reported in [1]. Tim has an
Aquantia phy where the firmware is set up to use "5G XFI" (underclocked
10GBASE-R) when rate adapting lower speeds. This results in us
advertising that we support lower speeds and then failing to bring the
link up. To avoid this, determine whether to enable rate adaptation
based on what's programmed by the firmware. This is "the worst choice"
[2], but we can't really do better until we have more insight into
what the firmware is doing. At the very least, we can prevent bad
firmware from causing us to advertise the wrong modes.

Past submissions may be found at [3, 4].

[1] https://lore.kernel.org/netdev/CAJ+vNU3zeNqiGhjTKE8jRjDYR0D7f=iqPLB8phNyA2CWixy7JA@mail.gmail.com/
[2] https://lore.kernel.org/netdev/20221118171643.vu6uxbnmog4sna65@skbuf/
[3] https://lore.kernel.org/netdev/20221114210740.3332937-1-sean.anderson@seco.com/
[4] https://lore.kernel.org/netdev/20221128195409.100873-1-sean.anderson@seco.com/

Changes in v4:
- Reorganize MDIO defines
- Fix kerneldoc using - instead of : for parameters

Changes in v3:
- Update speed register bits
- Fix incorrect bits for PMA/PMD speed

Changes in v2:
- Move/rename phylink_interface_max_speed
- Rework to just validate things instead of modifying registers

Sean Anderson (4):
  net: phy: Move/rename phylink_interface_max_speed
  phy: mdio: Reorganize defines
  net: mdio: Update speed register bits
  phy: aquantia: Determine rate adaptation support from registers

 drivers/net/phy/aquantia_main.c | 160 ++++++++++++++++++++++++++++++--
 drivers/net/phy/phy-core.c      |  70 ++++++++++++++
 drivers/net/phy/phylink.c       |  75 +--------------
 include/linux/phy.h             |   1 +
 include/uapi/linux/mdio.h       | 109 ++++++++++++++--------
 5 files changed, 299 insertions(+), 116 deletions(-)

-- 
2.35.1.1320.gc452695387.dirty


^ permalink raw reply

* [PATCH net-next v4 1/4] net: phy: Move/rename phylink_interface_max_speed
From: Sean Anderson @ 2022-12-16 16:48 UTC (permalink / raw)
  To: Andrew Lunn, Heiner Kallweit, netdev, Russell King
  Cc: Paolo Abeni, Vladimir Oltean, linux-kernel, Jakub Kicinski,
	Tim Harvey, David S . Miller, Eric Dumazet, Sean Anderson,
	Russell King
In-Reply-To: <20221216164851.2932043-1-sean.anderson@seco.com>

This is really a core phy function like phy_interface_num_ports. Move it
to drivers/net/phy/phy-core.c and rename it accordingly.

Signed-off-by: Sean Anderson <sean.anderson@seco.com>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
---

(no changes since v2)

Changes in v2:
- New

 drivers/net/phy/phy-core.c | 70 +++++++++++++++++++++++++++++++++++
 drivers/net/phy/phylink.c  | 75 ++------------------------------------
 include/linux/phy.h        |  1 +
 3 files changed, 74 insertions(+), 72 deletions(-)

diff --git a/drivers/net/phy/phy-core.c b/drivers/net/phy/phy-core.c
index 5d08c627a516..5a515434a228 100644
--- a/drivers/net/phy/phy-core.c
+++ b/drivers/net/phy/phy-core.c
@@ -150,6 +150,76 @@ int phy_interface_num_ports(phy_interface_t interface)
 }
 EXPORT_SYMBOL_GPL(phy_interface_num_ports);
 
+/**
+ * phy_interface_max_speed() - get the maximum speed of a phy interface
+ * @interface: phy interface mode defined by &typedef phy_interface_t
+ *
+ * Determine the maximum speed of a phy interface. This is intended to help
+ * determine the correct speed to pass to the MAC when the phy is performing
+ * rate matching.
+ *
+ * Return: The maximum speed of @interface
+ */
+int phy_interface_max_speed(phy_interface_t interface)
+{
+	switch (interface) {
+	case PHY_INTERFACE_MODE_100BASEX:
+	case PHY_INTERFACE_MODE_REVRMII:
+	case PHY_INTERFACE_MODE_RMII:
+	case PHY_INTERFACE_MODE_SMII:
+	case PHY_INTERFACE_MODE_REVMII:
+	case PHY_INTERFACE_MODE_MII:
+		return SPEED_100;
+
+	case PHY_INTERFACE_MODE_TBI:
+	case PHY_INTERFACE_MODE_MOCA:
+	case PHY_INTERFACE_MODE_RTBI:
+	case PHY_INTERFACE_MODE_1000BASEX:
+	case PHY_INTERFACE_MODE_1000BASEKX:
+	case PHY_INTERFACE_MODE_TRGMII:
+	case PHY_INTERFACE_MODE_RGMII_TXID:
+	case PHY_INTERFACE_MODE_RGMII_RXID:
+	case PHY_INTERFACE_MODE_RGMII_ID:
+	case PHY_INTERFACE_MODE_RGMII:
+	case PHY_INTERFACE_MODE_QSGMII:
+	case PHY_INTERFACE_MODE_SGMII:
+	case PHY_INTERFACE_MODE_GMII:
+		return SPEED_1000;
+
+	case PHY_INTERFACE_MODE_2500BASEX:
+		return SPEED_2500;
+
+	case PHY_INTERFACE_MODE_5GBASER:
+		return SPEED_5000;
+
+	case PHY_INTERFACE_MODE_XGMII:
+	case PHY_INTERFACE_MODE_RXAUI:
+	case PHY_INTERFACE_MODE_XAUI:
+	case PHY_INTERFACE_MODE_10GBASER:
+	case PHY_INTERFACE_MODE_10GKR:
+	case PHY_INTERFACE_MODE_USXGMII:
+	case PHY_INTERFACE_MODE_QUSGMII:
+		return SPEED_10000;
+
+	case PHY_INTERFACE_MODE_25GBASER:
+		return SPEED_25000;
+
+	case PHY_INTERFACE_MODE_XLGMII:
+		return SPEED_40000;
+
+	case PHY_INTERFACE_MODE_INTERNAL:
+	case PHY_INTERFACE_MODE_NA:
+	case PHY_INTERFACE_MODE_MAX:
+		/* No idea! Garbage in, unknown out */
+		return SPEED_UNKNOWN;
+	}
+
+	/* If we get here, someone forgot to add an interface mode above */
+	WARN_ON_ONCE(1);
+	return SPEED_UNKNOWN;
+}
+EXPORT_SYMBOL_GPL(phy_interface_max_speed);
+
 /* A mapping of all SUPPORTED settings to speed/duplex.  This table
  * must be grouped by speed and sorted in descending match priority
  * - iow, descending speed.
diff --git a/drivers/net/phy/phylink.c b/drivers/net/phy/phylink.c
index 09cc65c0da93..f8cba09f9d87 100644
--- a/drivers/net/phy/phylink.c
+++ b/drivers/net/phy/phylink.c
@@ -156,75 +156,6 @@ static const char *phylink_an_mode_str(unsigned int mode)
 	return mode < ARRAY_SIZE(modestr) ? modestr[mode] : "unknown";
 }
 
-/**
- * phylink_interface_max_speed() - get the maximum speed of a phy interface
- * @interface: phy interface mode defined by &typedef phy_interface_t
- *
- * Determine the maximum speed of a phy interface. This is intended to help
- * determine the correct speed to pass to the MAC when the phy is performing
- * rate matching.
- *
- * Return: The maximum speed of @interface
- */
-static int phylink_interface_max_speed(phy_interface_t interface)
-{
-	switch (interface) {
-	case PHY_INTERFACE_MODE_100BASEX:
-	case PHY_INTERFACE_MODE_REVRMII:
-	case PHY_INTERFACE_MODE_RMII:
-	case PHY_INTERFACE_MODE_SMII:
-	case PHY_INTERFACE_MODE_REVMII:
-	case PHY_INTERFACE_MODE_MII:
-		return SPEED_100;
-
-	case PHY_INTERFACE_MODE_TBI:
-	case PHY_INTERFACE_MODE_MOCA:
-	case PHY_INTERFACE_MODE_RTBI:
-	case PHY_INTERFACE_MODE_1000BASEX:
-	case PHY_INTERFACE_MODE_1000BASEKX:
-	case PHY_INTERFACE_MODE_TRGMII:
-	case PHY_INTERFACE_MODE_RGMII_TXID:
-	case PHY_INTERFACE_MODE_RGMII_RXID:
-	case PHY_INTERFACE_MODE_RGMII_ID:
-	case PHY_INTERFACE_MODE_RGMII:
-	case PHY_INTERFACE_MODE_QSGMII:
-	case PHY_INTERFACE_MODE_SGMII:
-	case PHY_INTERFACE_MODE_GMII:
-		return SPEED_1000;
-
-	case PHY_INTERFACE_MODE_2500BASEX:
-		return SPEED_2500;
-
-	case PHY_INTERFACE_MODE_5GBASER:
-		return SPEED_5000;
-
-	case PHY_INTERFACE_MODE_XGMII:
-	case PHY_INTERFACE_MODE_RXAUI:
-	case PHY_INTERFACE_MODE_XAUI:
-	case PHY_INTERFACE_MODE_10GBASER:
-	case PHY_INTERFACE_MODE_10GKR:
-	case PHY_INTERFACE_MODE_USXGMII:
-	case PHY_INTERFACE_MODE_QUSGMII:
-		return SPEED_10000;
-
-	case PHY_INTERFACE_MODE_25GBASER:
-		return SPEED_25000;
-
-	case PHY_INTERFACE_MODE_XLGMII:
-		return SPEED_40000;
-
-	case PHY_INTERFACE_MODE_INTERNAL:
-	case PHY_INTERFACE_MODE_NA:
-	case PHY_INTERFACE_MODE_MAX:
-		/* No idea! Garbage in, unknown out */
-		return SPEED_UNKNOWN;
-	}
-
-	/* If we get here, someone forgot to add an interface mode above */
-	WARN_ON_ONCE(1);
-	return SPEED_UNKNOWN;
-}
-
 /**
  * phylink_caps_to_linkmodes() - Convert capabilities to ethtool link modes
  * @linkmodes: ethtool linkmode mask (must be already initialised)
@@ -435,7 +366,7 @@ unsigned long phylink_get_capabilities(phy_interface_t interface,
 				       unsigned long mac_capabilities,
 				       int rate_matching)
 {
-	int max_speed = phylink_interface_max_speed(interface);
+	int max_speed = phy_interface_max_speed(interface);
 	unsigned long caps = MAC_SYM_PAUSE | MAC_ASYM_PAUSE;
 	unsigned long matched_caps = 0;
 
@@ -1221,7 +1152,7 @@ static void phylink_link_up(struct phylink *pl,
 		 * the link_state) to the interface speed, and will send
 		 * pause frames to the MAC to limit its transmission speed.
 		 */
-		speed = phylink_interface_max_speed(link_state.interface);
+		speed = phy_interface_max_speed(link_state.interface);
 		duplex = DUPLEX_FULL;
 		rx_pause = true;
 		break;
@@ -1231,7 +1162,7 @@ static void phylink_link_up(struct phylink *pl,
 		 * the link_state) to the interface speed, and will cause
 		 * collisions to the MAC to limit its transmission speed.
 		 */
-		speed = phylink_interface_max_speed(link_state.interface);
+		speed = phy_interface_max_speed(link_state.interface);
 		duplex = DUPLEX_HALF;
 		break;
 	}
diff --git a/include/linux/phy.h b/include/linux/phy.h
index 71eeb4e3b1fd..65d21a79bab3 100644
--- a/include/linux/phy.h
+++ b/include/linux/phy.h
@@ -1004,6 +1004,7 @@ const char *phy_duplex_to_str(unsigned int duplex);
 const char *phy_rate_matching_to_str(int rate_matching);
 
 int phy_interface_num_ports(phy_interface_t interface);
+int phy_interface_max_speed(phy_interface_t interface);
 
 /* A structure for mapping a particular speed and duplex
  * combination to a particular SUPPORTED and ADVERTISED value
-- 
2.35.1.1320.gc452695387.dirty


^ permalink raw reply related

* [PATCH net-next v4 2/4] phy: mdio: Reorganize defines
From: Sean Anderson @ 2022-12-16 16:48 UTC (permalink / raw)
  To: Andrew Lunn, Heiner Kallweit, netdev, Russell King
  Cc: Paolo Abeni, Vladimir Oltean, linux-kernel, Jakub Kicinski,
	Tim Harvey, David S . Miller, Eric Dumazet, Sean Anderson
In-Reply-To: <20221216164851.2932043-1-sean.anderson@seco.com>

Reorder all registers to be grouped by MMD. Groups fields in
similarly-named registers in the same way. This is especially useful for
registers which may have some bits in common, but interpret other bits
in different ways. The comments have been tweaked to more closely follow
802.3's naming.

Signed-off-by: Sean Anderson <sean.anderson@seco.com>
---

Changes in v4:
- New

 include/uapi/linux/mdio.h | 102 ++++++++++++++++++++++++--------------
 1 file changed, 64 insertions(+), 38 deletions(-)

diff --git a/include/uapi/linux/mdio.h b/include/uapi/linux/mdio.h
index 75b7257a51e1..14b779a8577b 100644
--- a/include/uapi/linux/mdio.h
+++ b/include/uapi/linux/mdio.h
@@ -37,40 +37,47 @@
 #define MDIO_DEVS2		6
 #define MDIO_CTRL2		7	/* 10G control 2 */
 #define MDIO_STAT2		8	/* 10G status 2 */
+#define MDIO_PKGID1		14	/* Package identifier */
+#define MDIO_PKGID2		15
+
+/* PMA/PMD registers. */
 #define MDIO_PMA_TXDIS		9	/* 10G PMA/PMD transmit disable */
 #define MDIO_PMA_RXDET		10	/* 10G PMA/PMD receive signal detect */
 #define MDIO_PMA_EXTABLE	11	/* 10G PMA/PMD extended ability */
-#define MDIO_PKGID1		14	/* Package identifier */
-#define MDIO_PKGID2		15
-#define MDIO_AN_ADVERTISE	16	/* AN advertising (base page) */
-#define MDIO_AN_LPA		19	/* AN LP abilities (base page) */
-#define MDIO_PCS_EEE_ABLE	20	/* EEE Capability register */
-#define MDIO_PCS_EEE_ABLE2	21	/* EEE Capability register 2 */
+#define MDIO_PMA_PMD_BT1	18	/* BASE-T1 PMA/PMD extended ability */
 #define MDIO_PMA_NG_EXTABLE	21	/* 2.5G/5G PMA/PMD extended ability */
-#define MDIO_PCS_EEE_WK_ERR	22	/* EEE wake error counter */
-#define MDIO_PHYXS_LNSTAT	24	/* PHY XGXS lane state */
-#define MDIO_AN_EEE_ADV		60	/* EEE advertisement */
-#define MDIO_AN_EEE_LPABLE	61	/* EEE link partner ability */
-#define MDIO_AN_EEE_ADV2	62	/* EEE advertisement 2 */
-#define MDIO_AN_EEE_LPABLE2	63	/* EEE link partner ability 2 */
-#define MDIO_AN_CTRL2		64	/* AN THP bypass request control */
-
-/* Media-dependent registers. */
 #define MDIO_PMA_10GBT_SWAPPOL	130	/* 10GBASE-T pair swap & polarity */
 #define MDIO_PMA_10GBT_TXPWR	131	/* 10GBASE-T TX power control */
 #define MDIO_PMA_10GBT_SNR	133	/* 10GBASE-T SNR margin, lane A.
 					 * Lanes B-D are numbered 134-136. */
 #define MDIO_PMA_10GBR_FSRT_CSR	147	/* 10GBASE-R fast retrain status and control */
 #define MDIO_PMA_10GBR_FECABLE	170	/* 10GBASE-R FEC ability */
+#define MDIO_PMA_PMD_BT1_CTRL	2100	/* BASE-T1 PMA/PMD control register */
+#define MDIO_B10L_PMA_CTRL	2294	/* 10BASE-T1L PMA control */
+#define MDIO_PMA_10T1L_STAT	2295	/* 10BASE-T1L PMA status */
+
+/* PCS registers */
+#define MDIO_PCS_EEE_ABLE	20	/* EEE Capability register */
+#define MDIO_PCS_EEE_ABLE2	21	/* EEE Capability register 2 */
+#define MDIO_PCS_EEE_WK_ERR	22	/* EEE wake error counter */
 #define MDIO_PCS_10GBX_STAT1	24	/* 10GBASE-X PCS status 1 */
 #define MDIO_PCS_10GBRT_STAT1	32	/* 10GBASE-R/-T PCS status 1 */
 #define MDIO_PCS_10GBRT_STAT2	33	/* 10GBASE-R/-T PCS status 2 */
+#define MDIO_PCS_10T1L_CTRL	2278	/* 10BASE-T1L PCS control */
+
+/* PHY XS registers */
+#define MDIO_PHYXS_LNSTAT	24	/* PHY XGXS lane state */
+
+/* Auto_negotiation registers */
+#define MDIO_AN_ADVERTISE	16	/* AN advertising (base page) */
+#define MDIO_AN_LPA		19	/* AN LP abilities (base page) */
 #define MDIO_AN_10GBT_CTRL	32	/* 10GBASE-T auto-negotiation control */
 #define MDIO_AN_10GBT_STAT	33	/* 10GBASE-T auto-negotiation status */
-#define MDIO_B10L_PMA_CTRL	2294	/* 10BASE-T1L PMA control */
-#define MDIO_PMA_10T1L_STAT	2295	/* 10BASE-T1L PMA status */
-#define MDIO_PCS_10T1L_CTRL	2278	/* 10BASE-T1L PCS control */
-#define MDIO_PMA_PMD_BT1	18	/* BASE-T1 PMA/PMD extended ability */
+#define MDIO_AN_EEE_ADV		60	/* EEE advertisement */
+#define MDIO_AN_EEE_LPABLE	61	/* EEE link partner ability */
+#define MDIO_AN_EEE_ADV2	62	/* EEE advertisement 2 */
+#define MDIO_AN_EEE_LPABLE2	63	/* EEE link partner ability 2 */
+#define MDIO_AN_CTRL2		64	/* AN THP bypass request control */
 #define MDIO_AN_T1_CTRL		512	/* BASE-T1 AN control */
 #define MDIO_AN_T1_STAT		513	/* BASE-T1 AN status */
 #define MDIO_AN_T1_ADV_L	514	/* BASE-T1 AN advertisement register [15:0] */
@@ -79,7 +86,6 @@
 #define MDIO_AN_T1_LP_L		517	/* BASE-T1 AN LP Base Page ability register [15:0] */
 #define MDIO_AN_T1_LP_M		518	/* BASE-T1 AN LP Base Page ability register [31:16] */
 #define MDIO_AN_T1_LP_H		519	/* BASE-T1 AN LP Base Page ability register [47:32] */
-#define MDIO_PMA_PMD_BT1_CTRL	2100	/* BASE-T1 PMA/PMD control register */
 
 /* LASI (Link Alarm Status Interrupt) registers, defined by XENPAK MSA. */
 #define MDIO_PMA_LASI_RXCTRL	0x9000	/* RX_ALARM control */
@@ -89,7 +95,7 @@
 #define MDIO_PMA_LASI_TXSTAT	0x9004	/* TX_ALARM status */
 #define MDIO_PMA_LASI_STAT	0x9005	/* LASI status */
 
-/* Control register 1. */
+/* Generic control 1 register. */
 /* Enable extended speed selection */
 #define MDIO_CTRL1_SPEEDSELEXT		(BMCR_SPEED1000 | BMCR_SPEED100)
 /* All speed selection bits */
@@ -97,15 +103,6 @@
 #define MDIO_CTRL1_FULLDPLX		BMCR_FULLDPLX
 #define MDIO_CTRL1_LPOWER		BMCR_PDOWN
 #define MDIO_CTRL1_RESET		BMCR_RESET
-#define MDIO_PMA_CTRL1_LOOPBACK		0x0001
-#define MDIO_PMA_CTRL1_SPEED1000	BMCR_SPEED1000
-#define MDIO_PMA_CTRL1_SPEED100		BMCR_SPEED100
-#define MDIO_PCS_CTRL1_LOOPBACK		BMCR_LOOPBACK
-#define MDIO_PHYXS_CTRL1_LOOPBACK	BMCR_LOOPBACK
-#define MDIO_AN_CTRL1_RESTART		BMCR_ANRESTART
-#define MDIO_AN_CTRL1_ENABLE		BMCR_ANENABLE
-#define MDIO_AN_CTRL1_XNP		0x2000	/* Enable extended next page */
-#define MDIO_PCS_CTRL1_CLKSTOP_EN	0x400	/* Stop the clock during LPI */
 
 /* 10 Gb/s */
 #define MDIO_CTRL1_SPEED10G		(MDIO_CTRL1_SPEEDSELEXT | 0x00)
@@ -116,10 +113,29 @@
 /* 5 Gb/s */
 #define MDIO_CTRL1_SPEED5G		(MDIO_CTRL1_SPEEDSELEXT | 0x1c)
 
-/* Status register 1. */
+/* PMA/PMD control 1 register. */
+#define MDIO_PMA_CTRL1_LOOPBACK		0x0001
+#define MDIO_PMA_CTRL1_SPEED1000	BMCR_SPEED1000
+#define MDIO_PMA_CTRL1_SPEED100		BMCR_SPEED100
+
+/* PCS control 1 register. */
+#define MDIO_PCS_CTRL1_LOOPBACK		BMCR_LOOPBACK
+#define MDIO_PCS_CTRL1_CLKSTOP_EN	0x400	/* Stop the clock during LPI */
+
+/* PHY XS control 1 register. */
+#define MDIO_PHYXS_CTRL1_LOOPBACK	BMCR_LOOPBACK
+
+/* AN control register. */
+#define MDIO_AN_CTRL1_RESTART		BMCR_ANRESTART
+#define MDIO_AN_CTRL1_ENABLE		BMCR_ANENABLE
+#define MDIO_AN_CTRL1_XNP		0x2000	/* Enable extended next page */
+
+/* Generic status 1 register. */
 #define MDIO_STAT1_LPOWERABLE		0x0002	/* Low-power ability */
 #define MDIO_STAT1_LSTATUS		BMSR_LSTATUS
 #define MDIO_STAT1_FAULT		0x0080	/* Fault */
+
+/* AN status register. */
 #define MDIO_AN_STAT1_LPABLE		0x0001	/* Link partner AN ability */
 #define MDIO_AN_STAT1_ABLE		BMSR_ANEGCAPABLE
 #define MDIO_AN_STAT1_RFAULT		BMSR_RFAULT
@@ -127,13 +143,17 @@
 #define MDIO_AN_STAT1_PAGE		0x0040	/* Page received */
 #define MDIO_AN_STAT1_XNP		0x0080	/* Extended next page status */
 
-/* Speed register. */
+/* Generic Speed register. */
 #define MDIO_SPEED_10G			0x0001	/* 10G capable */
+
+/* PMA/PMD Speed register. */
 #define MDIO_PMA_SPEED_2B		0x0002	/* 2BASE-TL capable */
 #define MDIO_PMA_SPEED_10P		0x0004	/* 10PASS-TS capable */
 #define MDIO_PMA_SPEED_1000		0x0010	/* 1000M capable */
 #define MDIO_PMA_SPEED_100		0x0020	/* 100M capable */
 #define MDIO_PMA_SPEED_10		0x0040	/* 10M capable */
+
+/* PCS et al. Speed register. */
 #define MDIO_PCS_SPEED_10P2B		0x0002	/* 10PASS-TS/2BASE-TL capable */
 #define MDIO_PCS_SPEED_2_5G		0x0040	/* 2.5G capable */
 #define MDIO_PCS_SPEED_5G		0x0080	/* 5G capable */
@@ -152,7 +172,7 @@
 #define MDIO_DEVS_VEND1			MDIO_DEVS_PRESENT(MDIO_MMD_VEND1)
 #define MDIO_DEVS_VEND2			MDIO_DEVS_PRESENT(MDIO_MMD_VEND2)
 
-/* Control register 2. */
+/* PMA/PMD control 2 register. */
 #define MDIO_PMA_CTRL2_TYPE		0x000f	/* PMA/PMD type selection */
 #define MDIO_PMA_CTRL2_10GBCX4		0x0000	/* 10GBASE-CX4 type */
 #define MDIO_PMA_CTRL2_10GBEW		0x0001	/* 10GBASE-EW type */
@@ -173,17 +193,21 @@
 #define MDIO_PMA_CTRL2_2_5GBT		0x0030  /* 2.5GBaseT type */
 #define MDIO_PMA_CTRL2_5GBT		0x0031  /* 5GBaseT type */
 #define MDIO_PMA_CTRL2_BASET1		0x003D  /* BASE-T1 type */
+
+/* PCS control 2 register. */
 #define MDIO_PCS_CTRL2_TYPE		0x0003	/* PCS type selection */
 #define MDIO_PCS_CTRL2_10GBR		0x0000	/* 10GBASE-R type */
 #define MDIO_PCS_CTRL2_10GBX		0x0001	/* 10GBASE-X type */
 #define MDIO_PCS_CTRL2_10GBW		0x0002	/* 10GBASE-W type */
 #define MDIO_PCS_CTRL2_10GBT		0x0003	/* 10GBASE-T type */
 
-/* Status register 2. */
+/* Generic status 2 register. */
 #define MDIO_STAT2_RXFAULT		0x0400	/* Receive fault */
 #define MDIO_STAT2_TXFAULT		0x0800	/* Transmit fault */
 #define MDIO_STAT2_DEVPRST		0xc000	/* Device present */
 #define MDIO_STAT2_DEVPRST_VAL		0x8000	/* Device present value */
+
+/* PMA/PMD status 2 register */
 #define MDIO_PMA_STAT2_LBABLE		0x0001	/* PMA loopback ability */
 #define MDIO_PMA_STAT2_10GBEW		0x0002	/* 10GBASE-EW ability */
 #define MDIO_PMA_STAT2_10GBLW		0x0004	/* 10GBASE-LW ability */
@@ -196,27 +220,29 @@
 #define MDIO_PMA_STAT2_EXTABLE		0x0200	/* Extended abilities */
 #define MDIO_PMA_STAT2_RXFLTABLE	0x1000	/* Receive fault ability */
 #define MDIO_PMA_STAT2_TXFLTABLE	0x2000	/* Transmit fault ability */
+
+/* PCS status 2 register */
 #define MDIO_PCS_STAT2_10GBR		0x0001	/* 10GBASE-R capable */
 #define MDIO_PCS_STAT2_10GBX		0x0002	/* 10GBASE-X capable */
 #define MDIO_PCS_STAT2_10GBW		0x0004	/* 10GBASE-W capable */
 #define MDIO_PCS_STAT2_RXFLTABLE	0x1000	/* Receive fault ability */
 #define MDIO_PCS_STAT2_TXFLTABLE	0x2000	/* Transmit fault ability */
 
-/* Transmit disable register. */
+/* PMD Transmit disable register. */
 #define MDIO_PMD_TXDIS_GLOBAL		0x0001	/* Global PMD TX disable */
 #define MDIO_PMD_TXDIS_0		0x0002	/* PMD TX disable 0 */
 #define MDIO_PMD_TXDIS_1		0x0004	/* PMD TX disable 1 */
 #define MDIO_PMD_TXDIS_2		0x0008	/* PMD TX disable 2 */
 #define MDIO_PMD_TXDIS_3		0x0010	/* PMD TX disable 3 */
 
-/* Receive signal detect register. */
+/* PMD receive signal detect register. */
 #define MDIO_PMD_RXDET_GLOBAL		0x0001	/* Global PMD RX signal detect */
 #define MDIO_PMD_RXDET_0		0x0002	/* PMD RX signal detect 0 */
 #define MDIO_PMD_RXDET_1		0x0004	/* PMD RX signal detect 1 */
 #define MDIO_PMD_RXDET_2		0x0008	/* PMD RX signal detect 2 */
 #define MDIO_PMD_RXDET_3		0x0010	/* PMD RX signal detect 3 */
 
-/* Extended abilities register. */
+/* PMA/PMD extended ability register. */
 #define MDIO_PMA_EXTABLE_10GCX4		0x0001	/* 10GBASE-CX4 ability */
 #define MDIO_PMA_EXTABLE_10GBLRM	0x0002	/* 10GBASE-LRM ability */
 #define MDIO_PMA_EXTABLE_10GBT		0x0004	/* 10GBASE-T ability */
@@ -229,7 +255,7 @@
 #define MDIO_PMA_EXTABLE_BT1		0x0800	/* BASE-T1 ability */
 #define MDIO_PMA_EXTABLE_NBT		0x4000  /* 2.5/5GBASE-T ability */
 
-/* PHY XGXS lane state register. */
+/* 10G PHY XGXS lane status register. */
 #define MDIO_PHYXS_LNSTAT_SYNC0		0x0001
 #define MDIO_PHYXS_LNSTAT_SYNC1		0x0002
 #define MDIO_PHYXS_LNSTAT_SYNC2		0x0004
-- 
2.35.1.1320.gc452695387.dirty


^ permalink raw reply related

* [PATCH net-next v4 3/4] net: mdio: Update speed register bits
From: Sean Anderson @ 2022-12-16 16:48 UTC (permalink / raw)
  To: Andrew Lunn, Heiner Kallweit, netdev, Russell King
  Cc: Paolo Abeni, Vladimir Oltean, linux-kernel, Jakub Kicinski,
	Tim Harvey, David S . Miller, Eric Dumazet, Sean Anderson
In-Reply-To: <20221216164851.2932043-1-sean.anderson@seco.com>

This updates the speed register bits to the 2018 revision of 802.3.

Signed-off-by: Sean Anderson <sean.anderson@seco.com>
---

(no changes since v3)

Changes in v3:
- New

 include/uapi/linux/mdio.h | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/include/uapi/linux/mdio.h b/include/uapi/linux/mdio.h
index 14b779a8577b..490466f9a5c5 100644
--- a/include/uapi/linux/mdio.h
+++ b/include/uapi/linux/mdio.h
@@ -147,6 +147,7 @@
 #define MDIO_SPEED_10G			0x0001	/* 10G capable */
 
 /* PMA/PMD Speed register. */
+#define MDIO_PMA_SPEED_10G		MDIO_SPEED_10G
 #define MDIO_PMA_SPEED_2B		0x0002	/* 2BASE-TL capable */
 #define MDIO_PMA_SPEED_10P		0x0004	/* 10PASS-TS capable */
 #define MDIO_PMA_SPEED_1000		0x0010	/* 1000M capable */
@@ -154,9 +155,15 @@
 #define MDIO_PMA_SPEED_10		0x0040	/* 10M capable */
 
 /* PCS et al. Speed register. */
+#define MDIO_PCS_SPEED_10G		MDIO_SPEED_10G
 #define MDIO_PCS_SPEED_10P2B		0x0002	/* 10PASS-TS/2BASE-TL capable */
+#define MDIO_PCS_SPEED_40G		0x0004  /* 450G capable */
+#define MDIO_PCS_SPEED_100G		0x0008  /* 100G capable */
+#define MDIO_PCS_SPEED_25G		0x0010  /* 25G capable */
 #define MDIO_PCS_SPEED_2_5G		0x0040	/* 2.5G capable */
 #define MDIO_PCS_SPEED_5G		0x0080	/* 5G capable */
+#define MDIO_PCS_SPEED_200G		0x0100  /* 200G capable */
+#define MDIO_PCS_SPEED_400G		0x0200  /* 400G capable */
 
 /* Device present registers. */
 #define MDIO_DEVS_PRESENT(devad)	(1 << (devad))
-- 
2.35.1.1320.gc452695387.dirty


^ permalink raw reply related

* [PATCH net-next v4 4/4] phy: aquantia: Determine rate adaptation support from registers
From: Sean Anderson @ 2022-12-16 16:48 UTC (permalink / raw)
  To: Andrew Lunn, Heiner Kallweit, netdev, Russell King
  Cc: Paolo Abeni, Vladimir Oltean, linux-kernel, Jakub Kicinski,
	Tim Harvey, David S . Miller, Eric Dumazet, Sean Anderson
In-Reply-To: <20221216164851.2932043-1-sean.anderson@seco.com>

When autonegotiation completes, the phy interface will be set based on
the global config register for that speed. If the SERDES mode is set to
something which the MAC does not support, then the link will not come
up. To avoid this, validate each combination of interface speed and link
speed which might be configured. This way, we ensure that we only
consider rate adaptation in our advertisement when we can actually use
it.

The API for get_rate_matching requires that PHY_INTERFACE_MODE_NA be
handled properly. To do this, we adopt a structure similar to
phylink_validate. At the top-level, we either validate a particular
interface speed or all of them. Below that, we validate each combination
of serdes speed and link speed.

For some firmwares, not all speeds are supported. In this case, the
global config register for that speed will be initialized to zero
(indicating that rate adaptation is not supported). We can detect this
by reading the PMA/PMD speed register to determine which speeds are
supported. This register is read once in probe and cached for later.

Signed-off-by: Sean Anderson <sean.anderson@seco.com>
---
This commit fixes 3c42563b3041 ("net: phy: aquantia: Add support for
rate matching"). In an effort to avoid backporting of this commit until
it has soaked in master for a while, the fixes tag has been left off.

Changes in v4:
- Fix kerneldoc using - instead of : for parameters

Changes in v3:
- Fix incorrect bits for PMA/PMD speed

Changes in v2:
- Rework to just validate things instead of modifying registers

 drivers/net/phy/aquantia_main.c | 160 ++++++++++++++++++++++++++++++--
 1 file changed, 154 insertions(+), 6 deletions(-)

diff --git a/drivers/net/phy/aquantia_main.c b/drivers/net/phy/aquantia_main.c
index 334a6904ca5a..e942b99be823 100644
--- a/drivers/net/phy/aquantia_main.c
+++ b/drivers/net/phy/aquantia_main.c
@@ -111,6 +111,12 @@
 #define VEND1_GLOBAL_CFG_RATE_ADAPT_NONE	0
 #define VEND1_GLOBAL_CFG_RATE_ADAPT_USX		1
 #define VEND1_GLOBAL_CFG_RATE_ADAPT_PAUSE	2
+#define VEND1_GLOBAL_CFG_SERDES_MODE		GENMASK(2, 0)
+#define VEND1_GLOBAL_CFG_SERDES_MODE_XFI	0
+#define VEND1_GLOBAL_CFG_SERDES_MODE_SGMII	3
+#define VEND1_GLOBAL_CFG_SERDES_MODE_OCSGMII	4
+#define VEND1_GLOBAL_CFG_SERDES_MODE_XFI5G	6
+#define VEND1_GLOBAL_CFG_SERDES_MODE_XFI20G	7
 
 #define VEND1_GLOBAL_RSVD_STAT1			0xc885
 #define VEND1_GLOBAL_RSVD_STAT1_FW_BUILD_ID	GENMASK(7, 4)
@@ -175,6 +181,7 @@ static const struct aqr107_hw_stat aqr107_hw_stats[] = {
 
 struct aqr107_priv {
 	u64 sgmii_stats[AQR107_SGMII_STAT_SZ];
+	int pmapmd_speeds;
 };
 
 static int aqr107_get_sset_count(struct phy_device *phydev)
@@ -677,13 +684,146 @@ static int aqr107_wait_processor_intensive_op(struct phy_device *phydev)
 	return 0;
 }
 
+/**
+ * struct aqr107_link_speed_cfg - Common configuration for link speeds
+ * @speed: The speed of this config
+ * @reg: The global system configuration register for this speed
+ * @speed_bit: The bit in the PMA/PMD speed ability register which determines
+ *             whether this link speed is supported
+ */
+struct aqr107_link_speed_cfg {
+	int speed;
+	u16 reg, speed_bit;
+};
+
+/**
+ * aqr107_rate_adapt_ok_one() - Validate rate adaptation for one configuration
+ * @phydev: The phy to act on
+ * @serdes_speed: The speed of the serdes (aka the phy interface)
+ * @link_cfg: The config for the link speed
+ *
+ * This function validates whether rate adaptation will work for a particular
+ * combination of @serdes_speed and @link_cfg.
+ *
+ * Return: %true if the @link_cfg.reg is configured for rate adaptation, %true
+ *         if @link_cfg.speed will not be advertised, %false otherwise.
+ */
+static bool aqr107_rate_adapt_ok_one(struct phy_device *phydev, int serdes_speed,
+				     const struct aqr107_link_speed_cfg *link_cfg)
+{
+	struct aqr107_priv *priv = phydev->priv;
+	int val;
+
+	phydev_dbg(phydev, "validating link_speed=%d serdes_speed=%d\n",
+		   link_cfg->speed, serdes_speed);
+
+	/* Vacuously OK, since we won't advertise it anyway */
+	if (!(priv->pmapmd_speeds & link_cfg->speed_bit))
+		return true;
+
+	val = phy_read_mmd(phydev, MDIO_MMD_VEND1, link_cfg->reg);
+	if (val < 0) {
+		phydev_warn(phydev, "could not read register %x:%.04x (err = %d)\n",
+			    MDIO_MMD_VEND1, link_cfg->reg, val);
+		return false;
+	}
+
+	phydev_dbg(phydev, "%x:%.04x = %.04x\n", MDIO_MMD_VEND1, link_cfg->reg, val);
+	if (FIELD_GET(VEND1_GLOBAL_CFG_RATE_ADAPT, val) !=
+		VEND1_GLOBAL_CFG_RATE_ADAPT_PAUSE)
+		return false;
+
+	switch (FIELD_GET(VEND1_GLOBAL_CFG_SERDES_MODE, val)) {
+	case VEND1_GLOBAL_CFG_SERDES_MODE_XFI20G:
+		return serdes_speed == SPEED_20000;
+	case VEND1_GLOBAL_CFG_SERDES_MODE_XFI:
+		return serdes_speed == SPEED_10000;
+	case VEND1_GLOBAL_CFG_SERDES_MODE_XFI5G:
+		return serdes_speed == SPEED_5000;
+	case VEND1_GLOBAL_CFG_SERDES_MODE_OCSGMII:
+		return serdes_speed == SPEED_2500;
+	case VEND1_GLOBAL_CFG_SERDES_MODE_SGMII:
+		return serdes_speed == SPEED_1000;
+	default:
+		return false;
+	}
+}
+
+/**
+ * aqr107_rate_adapt_ok() - Validate rate adaptation for an interface speed
+ * @phydev: The phy device
+ * @speed: The serdes (phy interface) speed
+ *
+ * This validates whether rate adaptation will work for a particular @speed.
+ * All link speeds less than or equal to @speed are validate to ensure they are
+ * configured properly.
+ *
+ * Return: %true if rate adaptation is supported for @speed, %false otherwise.
+ */
+static bool aqr107_rate_adapt_ok(struct phy_device *phydev, int speed)
+{
+	static const struct aqr107_link_speed_cfg speed_table[] = {
+		{
+			.speed = SPEED_10,
+			.reg = VEND1_GLOBAL_CFG_10M,
+			.speed_bit = MDIO_PMA_SPEED_10,
+		},
+		{
+			.speed = SPEED_100,
+			.reg = VEND1_GLOBAL_CFG_100M,
+			.speed_bit = MDIO_PMA_SPEED_100,
+		},
+		{
+			.speed = SPEED_1000,
+			.reg = VEND1_GLOBAL_CFG_1G,
+			.speed_bit = MDIO_PMA_SPEED_1000,
+		},
+		{
+			.speed = SPEED_2500,
+			.reg = VEND1_GLOBAL_CFG_2_5G,
+			.speed_bit = MDIO_PMA_SPEED_2_5G,
+		},
+		{
+			.speed = SPEED_5000,
+			.reg = VEND1_GLOBAL_CFG_5G,
+			.speed_bit = MDIO_PMA_SPEED_5G,
+		},
+		{
+			.speed = SPEED_10000,
+			.reg = VEND1_GLOBAL_CFG_10G,
+			.speed_bit = MDIO_PMA_SPEED_10G,
+		},
+	};
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(speed_table) &&
+		    speed_table[i].speed <= speed; i++)
+		if (!aqr107_rate_adapt_ok_one(phydev, speed, &speed_table[i]))
+			return false;
+
+	/* Must match at least one speed */
+	if (i == ARRAY_SIZE(speed_table) && speed != speed_table[i].speed)
+		return false;
+
+	return true;
+}
+
 static int aqr107_get_rate_matching(struct phy_device *phydev,
 				    phy_interface_t iface)
 {
-	if (iface == PHY_INTERFACE_MODE_10GBASER ||
-	    iface == PHY_INTERFACE_MODE_2500BASEX ||
-	    iface == PHY_INTERFACE_MODE_NA)
+	if (iface != PHY_INTERFACE_MODE_NA) {
+		if (aqr107_rate_adapt_ok(phydev,
+					 phy_interface_max_speed(iface)))
+			return RATE_MATCH_PAUSE;
+		else
+			return RATE_MATCH_NONE;
+	}
+
+	if (aqr107_rate_adapt_ok(phydev, SPEED_10000) ||
+	    aqr107_rate_adapt_ok(phydev, SPEED_2500) ||
+	    aqr107_rate_adapt_ok(phydev, SPEED_1000))
 		return RATE_MATCH_PAUSE;
+
 	return RATE_MATCH_NONE;
 }
 
@@ -713,10 +853,18 @@ static int aqr107_resume(struct phy_device *phydev)
 
 static int aqr107_probe(struct phy_device *phydev)
 {
-	phydev->priv = devm_kzalloc(&phydev->mdio.dev,
-				    sizeof(struct aqr107_priv), GFP_KERNEL);
-	if (!phydev->priv)
+	struct aqr107_priv *priv;
+
+	priv = devm_kzalloc(&phydev->mdio.dev, sizeof(*priv), GFP_KERNEL);
+	if (!priv)
 		return -ENOMEM;
+	phydev->priv = priv;
+
+	priv->pmapmd_speeds = phy_read_mmd(phydev, MDIO_MMD_PMAPMD, MDIO_SPEED);
+	if (priv->pmapmd_speeds < 0) {
+		phydev_err(phydev, "could not read PMA/PMD speeds\n");
+		return priv->pmapmd_speeds;
+	};
 
 	return aqr_hwmon_probe(phydev);
 }
-- 
2.35.1.1320.gc452695387.dirty


^ permalink raw reply related

* Re: [PATCH v7 01/11] leds: add support for hardware driven LEDs
From: Russell King (Oracle) @ 2022-12-16 16:51 UTC (permalink / raw)
  To: Christian Marangi
  Cc: Andrew Lunn, Florian Fainelli, Vladimir Oltean, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Rob Herring,
	Krzysztof Kozlowski, Jonathan Corbet, Pavel Machek, John Crispin,
	netdev, devicetree, linux-kernel, linux-doc, linux-leds,
	Tim Harvey, Alexander Stein, Rasmus Villemoes, Marek Behún
In-Reply-To: <639ca0a4.050a0220.99395.8fd3@mx.google.com>

On Fri, Dec 16, 2022 at 05:45:25PM +0100, Christian Marangi wrote:
> On Thu, Dec 15, 2022 at 04:13:03PM +0000, Russell King (Oracle) wrote:
> > Hi Christian,
> > 
> > Thanks for the patch.
> > 
> > I think Andrew's email is offline at the moment.
> >
> 
> Notice by gmail spamming me "I CAN'T SEND IT AHHHHH"
> Holidy times I guess?

Sadly, Andrew's email has done this a number of times - and Andrew
used to be on IRC so I could prod him about it, but it seems he
doesn't hang out on IRC anymore. It's been like it a few days now.

> > On Thu, Dec 15, 2022 at 12:54:28AM +0100, Christian Marangi wrote:
> > > @@ -188,6 +213,10 @@ int led_trigger_set(struct led_classdev *led_cdev, struct led_trigger *trig)
> > >  		led_set_brightness(led_cdev, LED_OFF);
> > >  	}
> > >  	if (trig) {
> > > +		/* Make sure the trigger support the LED blink mode */
> > > +		if (!led_trigger_is_supported(led_cdev, trig))
> > > +			return -EINVAL;
> > 
> > Shouldn't validation happen before we start taking any actions? In other
> > words, before we remove the previous trigger?
> > 
> 
> trigger_set first remove any trigger and set the led off. Then apply the
> new trigger. So the validation is done only when a trigger is actually
> applied. Think we should understand the best case here.

I think this is a question that needs to be answered by the LEDs folk,
as it's an interface behaviour / quality of implementation issue.

> > > @@ -350,12 +381,26 @@ static inline bool led_sysfs_is_disabled(struct led_classdev *led_cdev)
> > >  
> > >  #define TRIG_NAME_MAX 50
> > >  
> > > +enum led_trigger_blink_supported_modes {
> > > +	SOFTWARE_ONLY = SOFTWARE_CONTROLLED,
> > > +	HARDWARE_ONLY = HARDWARE_CONTROLLED,
> > > +	SOFTWARE_HARDWARE = SOFTWARE_HARDWARE_CONTROLLED,
> > 
> > I suspect all these generic names are asking for eventual namespace
> > clashes. Maybe prefix them with LED_ ?
> 
> Agree they are pretty generic so I can see why... My only concern was
> making them too long... Maybe reduce them to SW or HW? LEDS_SW_ONLY...
> LEDS_SW_CONTROLLED?

Seems sensible to me - and as a bonus they get shorter than the above!

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!

^ permalink raw reply

* Re: [PATCH net-next v4 0/4] phy: aquantia: Determine rate adaptation support from registers
From: Russell King (Oracle) @ 2022-12-16 16:54 UTC (permalink / raw)
  To: Sean Anderson
  Cc: Andrew Lunn, Heiner Kallweit, netdev, Paolo Abeni,
	Vladimir Oltean, linux-kernel, Jakub Kicinski, Tim Harvey,
	David S . Miller, Eric Dumazet
In-Reply-To: <20221216164851.2932043-1-sean.anderson@seco.com>

Hi Sean,

Please note net-next is currently closed due to the merge window, so
please don't send patches for it. However, feel free to send RFC
patches for net-next so that reviews can still happen.

Thanks!

On Fri, Dec 16, 2022 at 11:48:47AM -0500, Sean Anderson wrote:
> This attempts to address the problems first reported in [1]. Tim has an
> Aquantia phy where the firmware is set up to use "5G XFI" (underclocked
> 10GBASE-R) when rate adapting lower speeds. This results in us
> advertising that we support lower speeds and then failing to bring the
> link up. To avoid this, determine whether to enable rate adaptation
> based on what's programmed by the firmware. This is "the worst choice"
> [2], but we can't really do better until we have more insight into
> what the firmware is doing. At the very least, we can prevent bad
> firmware from causing us to advertise the wrong modes.
> 
> Past submissions may be found at [3, 4].
> 
> [1] https://lore.kernel.org/netdev/CAJ+vNU3zeNqiGhjTKE8jRjDYR0D7f=iqPLB8phNyA2CWixy7JA@mail.gmail.com/
> [2] https://lore.kernel.org/netdev/20221118171643.vu6uxbnmog4sna65@skbuf/
> [3] https://lore.kernel.org/netdev/20221114210740.3332937-1-sean.anderson@seco.com/
> [4] https://lore.kernel.org/netdev/20221128195409.100873-1-sean.anderson@seco.com/
> 
> Changes in v4:
> - Reorganize MDIO defines
> - Fix kerneldoc using - instead of : for parameters
> 
> Changes in v3:
> - Update speed register bits
> - Fix incorrect bits for PMA/PMD speed
> 
> Changes in v2:
> - Move/rename phylink_interface_max_speed
> - Rework to just validate things instead of modifying registers
> 
> Sean Anderson (4):
>   net: phy: Move/rename phylink_interface_max_speed
>   phy: mdio: Reorganize defines
>   net: mdio: Update speed register bits
>   phy: aquantia: Determine rate adaptation support from registers
> 
>  drivers/net/phy/aquantia_main.c | 160 ++++++++++++++++++++++++++++++--
>  drivers/net/phy/phy-core.c      |  70 ++++++++++++++
>  drivers/net/phy/phylink.c       |  75 +--------------
>  include/linux/phy.h             |   1 +
>  include/uapi/linux/mdio.h       | 109 ++++++++++++++--------
>  5 files changed, 299 insertions(+), 116 deletions(-)
> 
> -- 
> 2.35.1.1320.gc452695387.dirty
> 
> 

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox