Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH v6] vmxnet3: Add XDP support.
From: William Tu @ 2022-12-16 14:41 UTC (permalink / raw)
  To: netdev; +Cc: tuc, gyang, doshir, alexander.duyck

The patch adds native-mode XDP support: XDP DROP, PASS, TX, and REDIRECT.

Background:
The vmxnet3 rx consists of three rings: ring0, ring1, and dataring.
For r0 and r1, buffers at r0 are allocated using alloc_skb APIs and dma
mapped to the ring's descriptor. If LRO is enabled and packet size larger
than 3K, VMXNET3_MAX_SKB_BUF_SIZE, then r1 is used to mapped the rest of
the buffer larger than VMXNET3_MAX_SKB_BUF_SIZE. Each buffer in r1 is
allocated using alloc_page. So for LRO packets, the payload will be in one
buffer from r0 and multiple from r1, for non-LRO packets, only one
descriptor in r0 is used for packet size less than 3k.

When receiving a packet, the first descriptor will have the sop (start of
packet) bit set, and the last descriptor will have the eop (end of packet)
bit set. Non-LRO packets will have only one descriptor with both sop and
eop set.

Other than r0 and r1, vmxnet3 dataring is specifically designed for
handling packets with small size, usually 128 bytes, defined in
VMXNET3_DEF_RXDATA_DESC_SIZE, by simply copying the packet from the backend
driver in ESXi to the ring's memory region at front-end vmxnet3 driver, in
order to avoid memory mapping/unmapping overhead. In summary, packet size:
    A. < 128B: use dataring
    B. 128B - 3K: use ring0 (VMXNET3_RX_BUF_SKB)
    C. > 3K: use ring0 and ring1 (VMXNET3_RX_BUF_SKB + VMXNET3_RX_BUF_PAGE)
As a result, the patch adds XDP support for packets using dataring
and r0 (case A and B), not the large packet size when LRO is enabled.

XDP Implementation:
When user loads and XDP prog, vmxnet3 driver checks configurations, such
as mtu, lro, and re-allocate the rx buffer size for reserving the extra
headroom, XDP_PACKET_HEADROOM, for XDP frame. The XDP prog will then be
associated with every rx queue of the device. Note that when using dataring
for small packet size, vmxnet3 (front-end driver) doesn't control the
buffer allocation, as a result the XDP frame's headroom is zero.

The receive side of XDP is implemented for case A and B, by invoking the
bpf program at vmxnet3_rq_rx_complete and handle its returned action.
The new vmxnet3_run_xdp function handles the difference of using dataring
or ring0, and decides the next journey of the packet afterward.

For TX, vmxnet3 has split header design. Outgoing packets are parsed
first and protocol headers (L2/L3/L4) are copied to the backend. The
rest of the payload are dma mapped. Since XDP_TX does not parse the
packet protocol, the entire XDP frame is using dma mapped for the
transmission. Because the actual TX is deferred until txThreshold, it's
possible that an rx buffer is being overwritten by incoming burst of rx
packets, before tx buffer being transmitted. As a result, we allocate new
skb and introduce a copy.

Performance:
Tested using two VMs inside one ESXi machine, using single core on each
vmxnet3 device, sender using DPDK testpmd tx-mode attached to vmxnet3
device, sending 64B or 512B packet.

VM1 txgen:
$ dpdk-testpmd -l 0-3 -n 1 -- -i --nb-cores=3 \
--forward-mode=txonly --eth-peer=0,<mac addr of vm2>
option: add "--txonly-multi-flow"
option: use --txpkts=512 or 64 byte

VM2 running XDP:
$ ./samples/bpf/xdp_rxq_info -d ens160 -a <options> --skb-mode
$ ./samples/bpf/xdp_rxq_info -d ens160 -a <options>
options: XDP_DROP, XDP_PASS, XDP_TX

To test REDIRECT to cpu 0, use
$ ./samples/bpf/xdp_redirect_cpu -d ens160 -c 0 -e drop

Single core performance comparison with skb-mode.
64B:      skb-mode -> native-mode (with this patch)
XDP_DROP: 932Kpps -> 2.0Mpps
XDP_PASS: 284Kpps -> 314Kpps
XDP_TX:   591Kpps -> 1.8Mpps
REDIRECT: 453Kpps -> 501Kpps

512B:      skb-mode -> native-mode (with this patch)
XDP_DROP: 890Kpps -> 1.3Mpps
XDP_PASS: 284Kpps -> 314Kpps
XDP_TX:   555Kpps -> 1.2Mpps
REDIRECT: 670Kpps -> 430Kpps

Limitations:
a. LRO will be disabled when users load XDP program
b. MTU will be checked and limit to
   VMXNET3_MAX_SKB_BUF_SIZE(3K) - XDP_PACKET_HEADROOM(256) -
   SKB_DATA_ALIGN(sizeof(struct skb_shared_info))

Signed-off-by: William Tu <u9012063@gmail.com>
---
v5 -> v6:
work on feedbacks from Alexander Duyck
- remove unused skb parameter at vmxnet3_xdp_xmit
- add trace point for XDP_ABORTED
- avoid TX packet buffer being overwritten by allocatin
  new skb and memcpy (TX performance drop from 2.3 to 1.8Mpps)
- update some of the performance number and a demo video
  https://youtu.be/I3nx5wQDTXw
- pass VMware internal regression test using non-ENS: vmxnet3v2,
  vmxnet3v3, vmxnet3v4, so remove RFC tag.

v4 -> v5:
- move XDP code to separate file: vmxnet3_xdp.{c, h},
  suggested by Guolin
- expose vmxnet3_rq_create_all and vmxnet3_adjust_rx_ring_size
- more test using samples/bpf/{xdp1, xdp2, xdp_adjust_tail}
- add debug print
- rebase on commit 65e6af6cebe

v3 -> v4:
- code refactoring and improved commit message
- make dataring and non-dataring case clear
- in XDP_PASS, handle xdp.data and xdp.data_end change after
  bpf program executed
- now still working on internal testing
- v4 applied on net-next commit 65e6af6cebef

v2 -> v3:
- code refactoring: move the XDP processing to the front
  of the vmxnet3_rq_rx_complete, and minimize the places
  of changes to existing code.
- Performance improvement over BUF_SKB (512B) due to skipping
  skb allocation when DROP and TX.

v1 -> v2:
- Avoid skb allocation for small packet size (when dataring is used)
- Use rcu_read_lock unlock instead of READ_ONCE
- Peroformance improvement over v1
- Merge xdp drop, tx, pass, and redirect into 1 patch

I tested the patch using below script:
while [ true ]; do
timeout 20 ./samples/bpf/xdp_rxq_info -d ens160 -a XDP_DROP --skb-mode
timeout 20 ./samples/bpf/xdp_rxq_info -d ens160 -a XDP_DROP
timeout 20 ./samples/bpf/xdp_rxq_info -d ens160 -a XDP_PASS --skb-mode
timeout 20 ./samples/bpf/xdp_rxq_info -d ens160 -a XDP_PASS
timeout 20 ./samples/bpf/xdp_rxq_info -d ens160 -a XDP_TX --skb-mode
timeout 20 ./samples/bpf/xdp_rxq_info -d ens160 -a XDP_TX
timeout 20 ./samples/bpf/xdp_redirect_cpu -d ens160 -c 0 -e drop
timeout 20 ./samples/bpf/xdp_redirect_cpu -d ens160 -c 0 -e pass
done
---
 drivers/net/vmxnet3/Makefile          |   2 +-
 drivers/net/vmxnet3/vmxnet3_drv.c     |  48 ++-
 drivers/net/vmxnet3/vmxnet3_ethtool.c |  14 +
 drivers/net/vmxnet3/vmxnet3_int.h     |  20 ++
 drivers/net/vmxnet3/vmxnet3_xdp.c     | 458 ++++++++++++++++++++++++++
 drivers/net/vmxnet3/vmxnet3_xdp.h     |  39 +++
 6 files changed, 573 insertions(+), 8 deletions(-)
 create mode 100644 drivers/net/vmxnet3/vmxnet3_xdp.c
 create mode 100644 drivers/net/vmxnet3/vmxnet3_xdp.h

diff --git a/drivers/net/vmxnet3/Makefile b/drivers/net/vmxnet3/Makefile
index a666a88ac1ff..f82870c10205 100644
--- a/drivers/net/vmxnet3/Makefile
+++ b/drivers/net/vmxnet3/Makefile
@@ -32,4 +32,4 @@
 
 obj-$(CONFIG_VMXNET3) += vmxnet3.o
 
-vmxnet3-objs := vmxnet3_drv.o vmxnet3_ethtool.o
+vmxnet3-objs := vmxnet3_drv.o vmxnet3_ethtool.o vmxnet3_xdp.o
diff --git a/drivers/net/vmxnet3/vmxnet3_drv.c b/drivers/net/vmxnet3/vmxnet3_drv.c
index d3e7b27eb933..b55fec2ac2bf 100644
--- a/drivers/net/vmxnet3/vmxnet3_drv.c
+++ b/drivers/net/vmxnet3/vmxnet3_drv.c
@@ -28,6 +28,7 @@
 #include <net/ip6_checksum.h>
 
 #include "vmxnet3_int.h"
+#include "vmxnet3_xdp.h"
 
 char vmxnet3_driver_name[] = "vmxnet3";
 #define VMXNET3_DRIVER_DESC "VMware vmxnet3 virtual NIC driver"
@@ -351,7 +352,6 @@ vmxnet3_unmap_pkt(u32 eop_idx, struct vmxnet3_tx_queue *tq,
 	BUG_ON(VMXNET3_TXDESC_GET_EOP(&(tq->tx_ring.base[eop_idx].txd)) != 1);
 
 	skb = tq->buf_info[eop_idx].skb;
-	BUG_ON(skb == NULL);
 	tq->buf_info[eop_idx].skb = NULL;
 
 	VMXNET3_INC_RING_IDX_ONLY(eop_idx, tq->tx_ring.size);
@@ -592,6 +592,9 @@ vmxnet3_rq_alloc_rx_buf(struct vmxnet3_rx_queue *rq, u32 ring_idx,
 				rbi->skb = __netdev_alloc_skb_ip_align(adapter->netdev,
 								       rbi->len,
 								       GFP_KERNEL);
+				if (adapter->xdp_enabled)
+					skb_reserve(rbi->skb, XDP_PACKET_HEADROOM);
+
 				if (unlikely(rbi->skb == NULL)) {
 					rq->stats.rx_buf_alloc_failure++;
 					break;
@@ -1404,6 +1407,8 @@ vmxnet3_rq_rx_complete(struct vmxnet3_rx_queue *rq,
 	struct Vmxnet3_RxDesc rxCmdDesc;
 	struct Vmxnet3_RxCompDesc rxComp;
 #endif
+	bool need_flush = 0;
+
 	vmxnet3_getRxComp(rcd, &rq->comp_ring.base[rq->comp_ring.next2proc].rcd,
 			  &rxComp);
 	while (rcd->gen == rq->comp_ring.gen) {
@@ -1444,6 +1449,19 @@ vmxnet3_rq_rx_complete(struct vmxnet3_rx_queue *rq,
 			goto rcd_done;
 		}
 
+		if (unlikely(rcd->sop && rcd->eop && adapter->xdp_enabled)) {
+			int act = vmxnet3_process_xdp(adapter, rq, rcd, rbi,
+						      rxd, &need_flush);
+			ctx->skb = NULL;
+			switch (act) {
+			case VMXNET3_XDP_TAKEN:
+				goto rcd_done;
+			case VMXNET3_XDP_CONTINUE:
+			default:
+				break;
+			}
+		}
+
 		if (rcd->sop) { /* first buf of the pkt */
 			bool rxDataRingUsed;
 			u16 len;
@@ -1483,6 +1501,10 @@ vmxnet3_rq_rx_complete(struct vmxnet3_rx_queue *rq,
 				goto rcd_done;
 			}
 
+			if (adapter->xdp_enabled && !rxDataRingUsed)
+				skb_reserve(new_skb,
+					    XDP_PACKET_HEADROOM);
+
 			if (rxDataRingUsed) {
 				size_t sz;
 
@@ -1730,6 +1752,8 @@ vmxnet3_rq_rx_complete(struct vmxnet3_rx_queue *rq,
 		vmxnet3_getRxComp(rcd,
 				  &rq->comp_ring.base[rq->comp_ring.next2proc].rcd, &rxComp);
 	}
+	if (need_flush)
+		xdp_do_flush_map();
 
 	return num_pkts;
 }
@@ -1776,6 +1800,7 @@ vmxnet3_rq_cleanup(struct vmxnet3_rx_queue *rq,
 
 	rq->comp_ring.gen = VMXNET3_INIT_GEN;
 	rq->comp_ring.next2proc = 0;
+	rq->xdp_bpf_prog = NULL;
 }
 
 
@@ -1788,7 +1813,6 @@ vmxnet3_rq_cleanup_all(struct vmxnet3_adapter *adapter)
 		vmxnet3_rq_cleanup(&adapter->rx_queue[i], adapter);
 }
 
-
 static void vmxnet3_rq_destroy(struct vmxnet3_rx_queue *rq,
 			       struct vmxnet3_adapter *adapter)
 {
@@ -1832,6 +1856,8 @@ static void vmxnet3_rq_destroy(struct vmxnet3_rx_queue *rq,
 	kfree(rq->buf_info[0]);
 	rq->buf_info[0] = NULL;
 	rq->buf_info[1] = NULL;
+
+	vmxnet3_unregister_xdp_rxq(rq);
 }
 
 static void
@@ -1893,6 +1919,10 @@ vmxnet3_rq_init(struct vmxnet3_rx_queue *rq,
 	}
 	vmxnet3_rq_alloc_rx_buf(rq, 1, rq->rx_ring[1].size - 1, adapter);
 
+	/* always register, even if no XDP prog used */
+	if (vmxnet3_register_xdp_rxq(rq, adapter))
+		return -EINVAL;
+
 	/* reset the comp ring */
 	rq->comp_ring.next2proc = 0;
 	memset(rq->comp_ring.base, 0, rq->comp_ring.size *
@@ -1989,7 +2019,7 @@ vmxnet3_rq_create(struct vmxnet3_rx_queue *rq, struct vmxnet3_adapter *adapter)
 }
 
 
-static int
+int
 vmxnet3_rq_create_all(struct vmxnet3_adapter *adapter)
 {
 	int i, err = 0;
@@ -2585,7 +2615,8 @@ vmxnet3_setup_driver_shared(struct vmxnet3_adapter *adapter)
 	if (adapter->netdev->features & NETIF_F_RXCSUM)
 		devRead->misc.uptFeatures |= UPT1_F_RXCSUM;
 
-	if (adapter->netdev->features & NETIF_F_LRO) {
+	if (adapter->netdev->features & NETIF_F_LRO &&
+	    !adapter->xdp_enabled) {
 		devRead->misc.uptFeatures |= UPT1_F_LRO;
 		devRead->misc.maxNumRxSG = cpu_to_le16(1 + MAX_SKB_FRAGS);
 	}
@@ -3026,7 +3057,7 @@ vmxnet3_free_pci_resources(struct vmxnet3_adapter *adapter)
 }
 
 
-static void
+void
 vmxnet3_adjust_rx_ring_size(struct vmxnet3_adapter *adapter)
 {
 	size_t sz, i, ring0_size, ring1_size, comp_size;
@@ -3035,7 +3066,8 @@ vmxnet3_adjust_rx_ring_size(struct vmxnet3_adapter *adapter)
 		if (adapter->netdev->mtu <= VMXNET3_MAX_SKB_BUF_SIZE -
 					    VMXNET3_MAX_ETH_HDR_SIZE) {
 			adapter->skb_buf_size = adapter->netdev->mtu +
-						VMXNET3_MAX_ETH_HDR_SIZE;
+						VMXNET3_MAX_ETH_HDR_SIZE +
+						vmxnet3_xdp_headroom(adapter);
 			if (adapter->skb_buf_size < VMXNET3_MIN_T0_BUF_SIZE)
 				adapter->skb_buf_size = VMXNET3_MIN_T0_BUF_SIZE;
 
@@ -3563,7 +3595,6 @@ vmxnet3_reset_work(struct work_struct *data)
 	clear_bit(VMXNET3_STATE_BIT_RESETTING, &adapter->state);
 }
 
-
 static int
 vmxnet3_probe_device(struct pci_dev *pdev,
 		     const struct pci_device_id *id)
@@ -3585,6 +3616,8 @@ vmxnet3_probe_device(struct pci_dev *pdev,
 #ifdef CONFIG_NET_POLL_CONTROLLER
 		.ndo_poll_controller = vmxnet3_netpoll,
 #endif
+		.ndo_bpf = vmxnet3_xdp,
+		.ndo_xdp_xmit = vmxnet3_xdp_xmit,
 	};
 	int err;
 	u32 ver;
@@ -3900,6 +3933,7 @@ vmxnet3_probe_device(struct pci_dev *pdev,
 		goto err_register;
 	}
 
+	adapter->xdp_enabled = false;
 	vmxnet3_check_link(adapter, false);
 	return 0;
 
diff --git a/drivers/net/vmxnet3/vmxnet3_ethtool.c b/drivers/net/vmxnet3/vmxnet3_ethtool.c
index 18cf7c723201..6f542236b26e 100644
--- a/drivers/net/vmxnet3/vmxnet3_ethtool.c
+++ b/drivers/net/vmxnet3/vmxnet3_ethtool.c
@@ -76,6 +76,10 @@ vmxnet3_tq_driver_stats[] = {
 					 copy_skb_header) },
 	{ "  giant hdr",	offsetof(struct vmxnet3_tq_driver_stats,
 					 oversized_hdr) },
+	{ "  xdp xmit",		offsetof(struct vmxnet3_tq_driver_stats,
+					 xdp_xmit) },
+	{ "  xdp xmit err",	offsetof(struct vmxnet3_tq_driver_stats,
+					 xdp_xmit_err) },
 };
 
 /* per rq stats maintained by the device */
@@ -106,6 +110,16 @@ vmxnet3_rq_driver_stats[] = {
 					 drop_fcs) },
 	{ "  rx buf alloc fail", offsetof(struct vmxnet3_rq_driver_stats,
 					  rx_buf_alloc_failure) },
+	{ "     xdp packets", offsetof(struct vmxnet3_rq_driver_stats,
+				       xdp_packets) },
+	{ "     xdp tx", offsetof(struct vmxnet3_rq_driver_stats,
+				  xdp_tx) },
+	{ "     xdp redirects", offsetof(struct vmxnet3_rq_driver_stats,
+					 xdp_redirects) },
+	{ "     xdp drops", offsetof(struct vmxnet3_rq_driver_stats,
+				     xdp_drops) },
+	{ "     xdp aborted", offsetof(struct vmxnet3_rq_driver_stats,
+				       xdp_aborted) },
 };
 
 /* global stats maintained by the driver */
diff --git a/drivers/net/vmxnet3/vmxnet3_int.h b/drivers/net/vmxnet3/vmxnet3_int.h
index 3367db23aa13..5cf4033930d8 100644
--- a/drivers/net/vmxnet3/vmxnet3_int.h
+++ b/drivers/net/vmxnet3/vmxnet3_int.h
@@ -56,6 +56,8 @@
 #include <linux/if_arp.h>
 #include <linux/inetdevice.h>
 #include <linux/log2.h>
+#include <linux/bpf.h>
+#include <linux/skbuff.h>
 
 #include "vmxnet3_defs.h"
 
@@ -217,6 +219,9 @@ struct vmxnet3_tq_driver_stats {
 	u64 linearized;         /* # of pkts linearized */
 	u64 copy_skb_header;    /* # of times we have to copy skb header */
 	u64 oversized_hdr;
+
+	u64 xdp_xmit;
+	u64 xdp_xmit_err;
 };
 
 struct vmxnet3_tx_ctx {
@@ -285,6 +290,12 @@ struct vmxnet3_rq_driver_stats {
 	u64 drop_err;
 	u64 drop_fcs;
 	u64 rx_buf_alloc_failure;
+
+	u64 xdp_packets;	/* Total packets processed by XDP. */
+	u64 xdp_tx;
+	u64 xdp_redirects;
+	u64 xdp_drops;
+	u64 xdp_aborted;
 };
 
 struct vmxnet3_rx_data_ring {
@@ -307,6 +318,8 @@ struct vmxnet3_rx_queue {
 	struct vmxnet3_rx_buf_info     *buf_info[2];
 	struct Vmxnet3_RxQueueCtrl            *shared;
 	struct vmxnet3_rq_driver_stats  stats;
+	struct bpf_prog __rcu *xdp_bpf_prog;
+	struct xdp_rxq_info xdp_rxq;
 } __attribute__((__aligned__(SMP_CACHE_BYTES)));
 
 #define VMXNET3_DEVICE_MAX_TX_QUEUES 32
@@ -415,6 +428,7 @@ struct vmxnet3_adapter {
 	u16    tx_prod_offset;
 	u16    rx_prod_offset;
 	u16    rx_prod2_offset;
+	bool   xdp_enabled;
 };
 
 #define VMXNET3_WRITE_BAR0_REG(adapter, reg, val)  \
@@ -490,6 +504,12 @@ vmxnet3_tq_destroy_all(struct vmxnet3_adapter *adapter);
 void
 vmxnet3_rq_destroy_all(struct vmxnet3_adapter *adapter);
 
+int
+vmxnet3_rq_create_all(struct vmxnet3_adapter *adapter);
+
+void
+vmxnet3_adjust_rx_ring_size(struct vmxnet3_adapter *adapter);
+
 netdev_features_t
 vmxnet3_fix_features(struct net_device *netdev, netdev_features_t features);
 
diff --git a/drivers/net/vmxnet3/vmxnet3_xdp.c b/drivers/net/vmxnet3/vmxnet3_xdp.c
new file mode 100644
index 000000000000..afb2d43b5464
--- /dev/null
+++ b/drivers/net/vmxnet3/vmxnet3_xdp.c
@@ -0,0 +1,458 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Linux driver for VMware's vmxnet3 ethernet NIC.
+ * Copyright (C) 2008-2023, VMware, Inc. All Rights Reserved.
+ * Maintained by: pv-drivers@vmware.com
+ *
+ */
+
+#include "vmxnet3_int.h"
+#include "vmxnet3_xdp.h"
+
+static void
+vmxnet3_xdp_exchange_program(struct vmxnet3_adapter *adapter,
+			     struct bpf_prog *prog)
+{
+	struct vmxnet3_rx_queue *rq;
+	int i;
+
+	for (i = 0; i < adapter->num_rx_queues; i++) {
+		rq = &adapter->rx_queue[i];
+		rcu_assign_pointer(rq->xdp_bpf_prog, prog);
+	}
+	if (prog)
+		adapter->xdp_enabled = true;
+	else
+		adapter->xdp_enabled = false;
+}
+
+static int
+vmxnet3_xdp_set(struct net_device *netdev, struct netdev_bpf *bpf,
+		struct netlink_ext_ack *extack)
+{
+	struct vmxnet3_adapter *adapter = netdev_priv(netdev);
+	struct bpf_prog *new_bpf_prog = bpf->prog;
+	struct bpf_prog *old_bpf_prog;
+	bool need_update;
+	bool running;
+	int err = 0;
+
+	if (new_bpf_prog && netdev->mtu > VMXNET3_XDP_MAX_MTU) {
+		NL_SET_ERR_MSG_MOD(extack, "MTU too large for XDP");
+		return -EOPNOTSUPP;
+	}
+
+	old_bpf_prog = READ_ONCE(adapter->rx_queue[0].xdp_bpf_prog);
+	if (!new_bpf_prog && !old_bpf_prog) {
+		adapter->xdp_enabled = false;
+		return 0;
+	}
+	running = netif_running(netdev);
+	need_update = !!old_bpf_prog != !!new_bpf_prog;
+
+	if (running && need_update)
+		vmxnet3_quiesce_dev(adapter);
+
+	vmxnet3_xdp_exchange_program(adapter, new_bpf_prog);
+	if (old_bpf_prog)
+		bpf_prog_put(old_bpf_prog);
+
+	if (running && need_update) {
+		vmxnet3_reset_dev(adapter);
+		vmxnet3_rq_destroy_all(adapter);
+		vmxnet3_adjust_rx_ring_size(adapter);
+		err = vmxnet3_rq_create_all(adapter);
+		if (err) {
+			NL_SET_ERR_MSG_MOD(extack,
+				"failed to re-create rx queues for XDP.");
+			err = -EOPNOTSUPP;
+			goto out;
+		}
+		err = vmxnet3_activate_dev(adapter);
+		if (err) {
+			NL_SET_ERR_MSG_MOD(extack,
+				"failed to activate device for XDP.");
+			err = -EOPNOTSUPP;
+			goto out;
+		}
+		clear_bit(VMXNET3_STATE_BIT_RESETTING, &adapter->state);
+	}
+out:
+	return err;
+}
+
+/* This is the main xdp call used by kernel to set/unset eBPF program. */
+int
+vmxnet3_xdp(struct net_device *netdev, struct netdev_bpf *bpf)
+{
+	switch (bpf->command) {
+	case XDP_SETUP_PROG:
+		netdev_dbg(netdev, "XDP: set program to ");
+		return vmxnet3_xdp_set(netdev, bpf, bpf->extack);
+	default:
+		return -EINVAL;
+	}
+	return 0;
+}
+
+int
+vmxnet3_xdp_headroom(struct vmxnet3_adapter *adapter)
+{
+	if (adapter->xdp_enabled)
+		return VMXNET3_XDP_ROOM;
+	else
+		return 0;
+}
+
+void
+vmxnet3_unregister_xdp_rxq(struct vmxnet3_rx_queue *rq)
+{
+	xdp_rxq_info_unreg_mem_model(&rq->xdp_rxq);
+	xdp_rxq_info_unreg(&rq->xdp_rxq);
+}
+
+int
+vmxnet3_register_xdp_rxq(struct vmxnet3_rx_queue *rq,
+			 struct vmxnet3_adapter *adapter)
+{
+	int err;
+
+	err = xdp_rxq_info_reg(&rq->xdp_rxq, adapter->netdev, rq->qid, 0);
+	if (err < 0)
+		return err;
+
+	err = xdp_rxq_info_reg_mem_model(&rq->xdp_rxq, MEM_TYPE_PAGE_SHARED,
+					 NULL);
+	if (err < 0) {
+		xdp_rxq_info_unreg(&rq->xdp_rxq);
+		return err;
+	}
+	return 0;
+}
+
+static int
+vmxnet3_xdp_xmit_frame(struct vmxnet3_adapter *adapter,
+		       struct xdp_frame *xdpf,
+		       struct vmxnet3_tx_queue *tq)
+{
+	struct vmxnet3_tx_buf_info *tbi = NULL;
+	union Vmxnet3_GenericDesc *gdesc;
+	struct vmxnet3_tx_ctx ctx;
+	int tx_num_deferred;
+	struct sk_buff *skb;
+	u32 buf_size;
+	int ret = 0;
+	u32 dw2;
+
+	if (vmxnet3_cmd_ring_desc_avail(&tq->tx_ring) == 0) {
+		tq->stats.tx_ring_full++;
+		ret = -ENOMEM;
+		goto exit;
+	}
+
+	skb = __netdev_alloc_skb(adapter->netdev, xdpf->len, GFP_KERNEL);
+	if (unlikely(!skb))
+		goto exit;
+
+	memcpy(skb->data, xdpf->data, xdpf->len);
+	dw2 = (tq->tx_ring.gen ^ 0x1) << VMXNET3_TXD_GEN_SHIFT;
+	dw2 |= xdpf->len;
+	ctx.sop_txd = tq->tx_ring.base + tq->tx_ring.next2fill;
+	gdesc = ctx.sop_txd;
+
+	buf_size = xdpf->len;
+	tbi = tq->buf_info + tq->tx_ring.next2fill;
+	tbi->map_type = VMXNET3_MAP_SINGLE;
+	tbi->dma_addr = dma_map_single(&adapter->pdev->dev,
+				       skb->data, buf_size,
+				       DMA_TO_DEVICE);
+	if (dma_mapping_error(&adapter->pdev->dev, tbi->dma_addr)) {
+		ret = -EFAULT;
+		goto exit;
+	}
+	tbi->len = buf_size;
+
+	gdesc = tq->tx_ring.base + tq->tx_ring.next2fill;
+	WARN_ON_ONCE(gdesc->txd.gen == tq->tx_ring.gen);
+
+	gdesc->txd.addr = cpu_to_le64(tbi->dma_addr);
+	gdesc->dword[2] = cpu_to_le32(dw2);
+
+	/* Setup the EOP desc */
+	gdesc->dword[3] = cpu_to_le32(VMXNET3_TXD_CQ | VMXNET3_TXD_EOP);
+
+	gdesc->txd.om = 0;
+	gdesc->txd.msscof = 0;
+	gdesc->txd.hlen = 0;
+	gdesc->txd.ti = 0;
+
+	tx_num_deferred = le32_to_cpu(tq->shared->txNumDeferred);
+	tq->shared->txNumDeferred += 1;
+	tx_num_deferred++;
+
+	vmxnet3_cmd_ring_adv_next2fill(&tq->tx_ring);
+
+	/* set the last buf_info for the pkt */
+	tbi->skb = skb;
+	tbi->sop_idx = ctx.sop_txd - tq->tx_ring.base;
+
+	dma_wmb();
+	gdesc->dword[2] = cpu_to_le32(le32_to_cpu(gdesc->dword[2]) ^
+						  VMXNET3_TXD_GEN);
+	if (tx_num_deferred >= le32_to_cpu(tq->shared->txThreshold)) {
+		tq->shared->txNumDeferred = 0;
+		VMXNET3_WRITE_BAR0_REG(adapter,
+				       VMXNET3_REG_TXPROD + tq->qid * 8,
+				       tq->tx_ring.next2fill);
+	}
+exit:
+	return ret;
+}
+
+static int
+vmxnet3_xdp_xmit_back(struct vmxnet3_adapter *adapter,
+		      struct xdp_frame *xdpf)
+{
+	struct vmxnet3_tx_queue *tq;
+	struct netdev_queue *nq;
+	int err = 0, cpu;
+	int tq_number;
+
+	tq_number = adapter->num_tx_queues;
+	cpu = smp_processor_id();
+	tq = &adapter->tx_queue[cpu % tq_number];
+	if (tq->stopped)
+		return -ENETDOWN;
+
+	nq = netdev_get_tx_queue(adapter->netdev, tq->qid);
+
+	__netif_tx_lock(nq, cpu);
+	err = vmxnet3_xdp_xmit_frame(adapter, xdpf, tq);
+	if (err)
+		goto exit;
+
+exit:
+	__netif_tx_unlock(nq);
+	return err;
+}
+
+int
+vmxnet3_xdp_xmit(struct net_device *dev,
+		 int n, struct xdp_frame **frames, u32 flags)
+{
+	struct vmxnet3_adapter *adapter;
+	struct vmxnet3_tx_queue *tq;
+	struct netdev_queue *nq;
+	int i, err, cpu;
+	int nxmit = 0;
+	int tq_number;
+
+	adapter = netdev_priv(dev);
+
+	if (unlikely(test_bit(VMXNET3_STATE_BIT_QUIESCED, &adapter->state)))
+		return -ENETDOWN;
+	if (unlikely(test_bit(VMXNET3_STATE_BIT_RESETTING, &adapter->state)))
+		return -EINVAL;
+
+	tq_number = adapter->num_tx_queues;
+	cpu = smp_processor_id();
+	tq = &adapter->tx_queue[cpu % tq_number];
+	if (tq->stopped)
+		return -ENETDOWN;
+
+	nq = netdev_get_tx_queue(adapter->netdev, tq->qid);
+
+	__netif_tx_lock(nq, cpu);
+	for (i = 0; i < n; i++) {
+		err = vmxnet3_xdp_xmit_frame(adapter, frames[i], tq);
+		if (err) {
+			tq->stats.xdp_xmit_err++;
+			break;
+		}
+		nxmit++;
+	}
+
+	tq->stats.xdp_xmit += nxmit;
+	__netif_tx_unlock(nq);
+
+	return nxmit;
+}
+
+static int
+__vmxnet3_run_xdp(struct vmxnet3_rx_queue *rq, void *data, int data_len,
+		  int headroom, int frame_sz, bool *need_xdp_flush,
+		  struct sk_buff *skb)
+{
+	struct xdp_frame *xdpf;
+	void *buf_hard_start;
+	struct xdp_buff xdp;
+	struct page *page;
+	void *orig_data;
+	int err, delta;
+	int delta_len;
+	u32 act;
+
+	buf_hard_start = data;
+	xdp_init_buff(&xdp, frame_sz, &rq->xdp_rxq);
+	xdp_prepare_buff(&xdp, buf_hard_start, headroom, data_len, true);
+	orig_data = xdp.data;
+
+	act = bpf_prog_run_xdp(rq->xdp_bpf_prog, &xdp);
+	rq->stats.xdp_packets++;
+
+	switch (act) {
+	case XDP_DROP:
+		rq->stats.xdp_drops++;
+		break;
+	case XDP_PASS:
+		/* bpf prog might change len and data position.
+		 * dataring does not use skb so not support this.
+		 */
+		delta = xdp.data - orig_data;
+		delta_len = (xdp.data_end - xdp.data) - data_len;
+		if (skb) {
+			skb_reserve(skb, delta);
+			skb_put(skb, delta_len);
+		}
+		break;
+	case XDP_TX:
+		xdpf = xdp_convert_buff_to_frame(&xdp);
+		if (!xdpf ||
+		    vmxnet3_xdp_xmit_back(rq->adapter, xdpf)) {
+			rq->stats.xdp_drops++;
+		} else {
+			rq->stats.xdp_tx++;
+		}
+		break;
+	case XDP_ABORTED:
+		trace_xdp_exception(rq->adapter->netdev, rq->xdp_bpf_prog,
+				    act);
+		rq->stats.xdp_aborted++;
+		break;
+	case XDP_REDIRECT:
+		page = alloc_page(GFP_ATOMIC);
+		if (!page) {
+			rq->stats.rx_buf_alloc_failure++;
+			return XDP_DROP;
+		}
+		xdp_init_buff(&xdp, PAGE_SIZE, &rq->xdp_rxq);
+		xdp_prepare_buff(&xdp, page_address(page),
+				 XDP_PACKET_HEADROOM,
+				 data_len, true);
+		memcpy(xdp.data, data, data_len);
+		err = xdp_do_redirect(rq->adapter->netdev, &xdp,
+				      rq->xdp_bpf_prog);
+		if (!err) {
+			rq->stats.xdp_redirects++;
+		} else {
+			__free_page(page);
+			rq->stats.xdp_drops++;
+		}
+		*need_xdp_flush = true;
+		break;
+	default:
+		bpf_warn_invalid_xdp_action(rq->adapter->netdev,
+					    rq->xdp_bpf_prog, act);
+		break;
+	}
+	return act;
+}
+
+static int
+vmxnet3_run_xdp(struct vmxnet3_rx_queue *rq, struct vmxnet3_rx_buf_info *rbi,
+		struct Vmxnet3_RxCompDesc *rcd, bool *need_flush,
+		bool rxDataRingUsed)
+{
+	struct vmxnet3_adapter *adapter;
+	struct ethhdr *ehdr;
+	int act = XDP_PASS;
+	void *data;
+	int sz;
+
+	adapter = rq->adapter;
+	if (rxDataRingUsed) {
+		sz = rcd->rxdIdx * rq->data_ring.desc_size;
+		data = &rq->data_ring.base[sz];
+		ehdr = data;
+		netdev_dbg(adapter->netdev,
+			   "XDP: rxDataRing packet size %d, eth proto 0x%x\n",
+			   rcd->len, ntohs(ehdr->h_proto));
+		act = __vmxnet3_run_xdp(rq, data, rcd->len, 0,
+					rq->data_ring.desc_size, need_flush,
+					NULL);
+	} else {
+		dma_unmap_single(&adapter->pdev->dev,
+				 rbi->dma_addr,
+				 rbi->len,
+				 DMA_FROM_DEVICE);
+		ehdr = (struct ethhdr *)rbi->skb->data;
+		netdev_dbg(adapter->netdev,
+			   "XDP: packet size %d, eth proto 0x%x\n",
+			   rcd->len, ntohs(ehdr->h_proto));
+		act = __vmxnet3_run_xdp(rq,
+					rbi->skb->data - XDP_PACKET_HEADROOM,
+					rcd->len, XDP_PACKET_HEADROOM,
+					rbi->len, need_flush, rbi->skb);
+	}
+	return act;
+}
+
+int
+vmxnet3_process_xdp(struct vmxnet3_adapter *adapter,
+		    struct vmxnet3_rx_queue *rq,
+		    struct Vmxnet3_RxCompDesc *rcd,
+		    struct vmxnet3_rx_buf_info *rbi,
+		    struct Vmxnet3_RxDesc *rxd,
+		    bool *need_flush)
+{
+	struct bpf_prog *xdp_prog;
+	dma_addr_t new_dma_addr;
+	struct sk_buff *new_skb;
+	bool rxDataRingUsed;
+	int ret, act;
+
+	ret = VMXNET3_XDP_CONTINUE;
+	if (unlikely(rcd->len == 0))
+		return VMXNET3_XDP_TAKEN;
+
+	rxDataRingUsed = VMXNET3_RX_DATA_RING(adapter, rcd->rqID);
+	rcu_read_lock();
+	xdp_prog = rcu_dereference(rq->xdp_bpf_prog);
+	if (!xdp_prog) {
+		rcu_read_unlock();
+		return VMXNET3_XDP_CONTINUE;
+	}
+	act = vmxnet3_run_xdp(rq, rbi, rcd, need_flush, rxDataRingUsed);
+	rcu_read_unlock();
+
+	switch (act) {
+	case XDP_PASS:
+		ret = VMXNET3_XDP_CONTINUE;
+		break;
+	case XDP_DROP:
+	case XDP_TX:
+	case XDP_REDIRECT:
+	case XDP_ABORTED:
+	default:
+		/* Reuse and remap the existing buffer. */
+		ret = VMXNET3_XDP_TAKEN;
+		if (rxDataRingUsed)
+			return ret;
+
+		new_skb = rbi->skb;
+		new_dma_addr =
+			dma_map_single(&adapter->pdev->dev,
+				       new_skb->data, rbi->len,
+				       DMA_FROM_DEVICE);
+		if (dma_mapping_error(&adapter->pdev->dev,
+				      new_dma_addr)) {
+			dev_kfree_skb(new_skb);
+			rq->stats.drop_total++;
+			return ret;
+		}
+		rbi->dma_addr = new_dma_addr;
+		rxd->addr = cpu_to_le64(rbi->dma_addr);
+		rxd->len = rbi->len;
+	}
+	return ret;
+}
diff --git a/drivers/net/vmxnet3/vmxnet3_xdp.h b/drivers/net/vmxnet3/vmxnet3_xdp.h
new file mode 100644
index 000000000000..6a3c662a4464
--- /dev/null
+++ b/drivers/net/vmxnet3/vmxnet3_xdp.h
@@ -0,0 +1,39 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later
+ *
+ * Linux driver for VMware's vmxnet3 ethernet NIC.
+ * Copyright (C) 2008-2023, VMware, Inc. All Rights Reserved.
+ * Maintained by: pv-drivers@vmware.com
+ *
+ */
+
+#ifndef _VMXNET3_XDP_H
+#define _VMXNET3_XDP_H
+
+#include <linux/filter.h>
+#include <linux/bpf_trace.h>
+#include <linux/netlink.h>
+#include <net/xdp.h>
+
+#include "vmxnet3_int.h"
+
+#define VMXNET3_XDP_ROOM (SKB_DATA_ALIGN(sizeof(struct skb_shared_info)) + \
+				XDP_PACKET_HEADROOM)
+#define VMXNET3_XDP_MAX_MTU (VMXNET3_MAX_SKB_BUF_SIZE - VMXNET3_XDP_ROOM)
+
+#define VMXNET3_XDP_CONTINUE 0	/* Pass to the stack, ex: XDP_PASS. */
+#define VMXNET3_XDP_TAKEN 1	/* Skip the stack, ex: XDP_DROP/TX/REDIRECT */
+
+int vmxnet3_xdp(struct net_device *netdev, struct netdev_bpf *bpf);
+void vmxnet3_unregister_xdp_rxq(struct vmxnet3_rx_queue *rq);
+int vmxnet3_register_xdp_rxq(struct vmxnet3_rx_queue *rq,
+			     struct vmxnet3_adapter *adapter);
+int vmxnet3_xdp_headroom(struct vmxnet3_adapter *adapter);
+int vmxnet3_xdp_xmit(struct net_device *dev, int n, struct xdp_frame **frames,
+		     u32 flags);
+int vmxnet3_process_xdp(struct vmxnet3_adapter *adapter,
+			struct vmxnet3_rx_queue *rq,
+			struct Vmxnet3_RxCompDesc *rcd,
+			struct vmxnet3_rx_buf_info *rbi,
+			struct Vmxnet3_RxDesc *rxd,
+			bool *need_flush);
+#endif
-- 
2.25.1


^ permalink raw reply related

* Re: [PATCH] ath9k: use proper statements in conditionals
From: Toke Høiland-Jørgensen @ 2022-12-16 14:33 UTC (permalink / raw)
  To: Arnd Bergmann, Kalle Valo, Pavel Skripkin
  Cc: Arnd Bergmann, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Tetsuo Handa, linux-wireless, netdev, linux-kernel
In-Reply-To: <20221215165553.1950307-1-arnd@kernel.org>

Arnd Bergmann <arnd@kernel.org> writes:

> From: Arnd Bergmann <arnd@arndb.de>
>
> A previous cleanup patch accidentally broke some conditional
> expressions by replacing the safe "do {} while (0)" constructs
> with empty macros. gcc points this out when extra warnings
> are enabled:
>
> drivers/net/wireless/ath/ath9k/hif_usb.c: In function 'ath9k_skb_queue_complete':
> drivers/net/wireless/ath/ath9k/hif_usb.c:251:57: error: suggest braces around empty body in an 'else' statement [-Werror=empty-body]
>   251 |                         TX_STAT_INC(hif_dev, skb_failed);
>
> Make both sets of macros proper expressions again.
>
> Fixes: d7fc76039b74 ("ath9k: htc: clean up statistics macros")
> Signed-off-by: Arnd Bergmann <arnd@arndb.de>

Acked-by: Toke Høiland-Jørgensen <toke@toke.dk>

^ permalink raw reply

* Re: [PATCH] ath9k: use proper statements in conditionals
From: Toke Høiland-Jørgensen @ 2022-12-16 14:33 UTC (permalink / raw)
  To: Arnd Bergmann, Arnd Bergmann, Kalle Valo, Pavel Skripkin
  Cc: David S . Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Tetsuo Handa, linux-wireless, Netdev, linux-kernel
In-Reply-To: <486f9bc9-408f-4c29-b675-cbd61673f58c@app.fastmail.com>

"Arnd Bergmann" <arnd@arndb.de> writes:

> On Thu, Dec 15, 2022, at 18:16, Toke Høiland-Jørgensen wrote:
>>> index 30f0765fb9fd..237f4ec2cffd 100644
>>> --- a/drivers/net/wireless/ath/ath9k/htc.h
>>> +++ b/drivers/net/wireless/ath/ath9k/htc.h
>>> @@ -327,9 +327,9 @@ static inline struct ath9k_htc_tx_ctl *HTC_SKB_CB(struct sk_buff *skb)
>>>  }
>>>  
>>>  #ifdef CONFIG_ATH9K_HTC_DEBUGFS
>>> -#define __STAT_SAFE(hif_dev, expr)	((hif_dev)->htc_handle->drv_priv ? (expr) : 0)
>>> -#define CAB_STAT_INC(priv)		((priv)->debug.tx_stats.cab_queued++)
>>> -#define TX_QSTAT_INC(priv, q)		((priv)->debug.tx_stats.queue_stats[q]++)
>>> +#define __STAT_SAFE(hif_dev, expr)	do { ((hif_dev)->htc_handle->drv_priv ? (expr) : 0); } while (0)
>>> +#define CAB_STAT_INC(priv)		do { ((priv)->debug.tx_stats.cab_queued++); } while (0)
>>> +#define TX_QSTAT_INC(priv, q)		do { ((priv)->debug.tx_stats.queue_stats[q]++); } while (0)
>>
>> Hmm, is it really necessary to wrap these in do/while constructs? AFAICT
>> they're all simple statements already?
>
> It's generally safer to do the same thing on both side of the #ifdef.
>
> The "do { } while (0)" is an empty statement that is needed to fix
> the bug on the #else side. The expressions you have on the #ifdef
> side can be used as values, and wrapping them in do{}while(0)
> turns them into statements (without a value) as well, so fewer
> things can go wrong when you only test one side.

Alright, makes sense; thanks for explaining!

> I suppose the best solution would be to just use inline functions
> for all of them and get rid of the macros.

Let's merge this patch to fix the bug, and if someone wants to follow up
with a conversion to inline functions, that would be awesome, of course :)

-Toke

^ permalink raw reply

* [PATCH net] nfp: fix unaligned io read of capabilities word
From: Simon Horman @ 2022-12-16 14:31 UTC (permalink / raw)
  To: David Miller, Jakub Kicinski, Paolo Abeni
  Cc: netdev, oss-drivers, Huanhuan Wang, Niklas Söderlund,
	Simon Horman

From: Huanhuan Wang <huanhuan.wang@corigine.com>

The address of 32-bit extend capability is not qword aligned,
and may cause exception in some arch.

Fixes: 484963ce9f1e ("nfp: extend capability and control words")
Signed-off-by: Huanhuan Wang <huanhuan.wang@corigine.com>
Reviewed-by: Niklas Söderlund <niklas.soderlund@corigine.com>
Signed-off-by: Simon Horman <simon.horman@corigine.com>
---
 drivers/net/ethernet/netronome/nfp/nfp_net_common.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
index 2314cf55e821..09053373288f 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
@@ -2509,7 +2509,7 @@ static int nfp_net_read_caps(struct nfp_net *nn)
 {
 	/* Get some of the read-only fields from the BAR */
 	nn->cap = nn_readl(nn, NFP_NET_CFG_CAP);
-	nn->cap_w1 = nn_readq(nn, NFP_NET_CFG_CAP_WORD1);
+	nn->cap_w1 = nn_readl(nn, NFP_NET_CFG_CAP_WORD1);
 	nn->max_mtu = nn_readl(nn, NFP_NET_CFG_MAX_MTU);
 
 	/* ABI 4.x and ctrl vNIC always use chained metadata, in other cases
-- 
2.30.2


^ permalink raw reply related

* Re: [PATCH] net: ks8851: Drop IRQ threading
From: Eric Dumazet @ 2022-12-16 13:23 UTC (permalink / raw)
  To: Marek Vasut
  Cc: netdev, David S. Miller, Dmitry Torokhov, Geoff Levand,
	Jakub Kicinski, Linus Walleij, Paolo Abeni, Petr Machata,
	Wolfram Sang
In-Reply-To: <20221216124731.122459-1-marex@denx.de>

On Fri, Dec 16, 2022 at 1:47 PM Marek Vasut <marex@denx.de> wrote:
>
> Request non-threaded IRQ in the KSZ8851 driver, this fixes the following warning:
> "
> NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #08!!!

This changelog is a bit terse.

Why can other drivers use request_threaded_irq(), but not this one ?


> "
>
> Signed-off-by: Marek Vasut <marex@denx.de>
> ---
> Cc: "David S. Miller" <davem@davemloft.net>
> Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>
> Cc: Eric Dumazet <edumazet@google.com>
> Cc: Geoff Levand <geoff@infradead.org>
> Cc: Jakub Kicinski <kuba@kernel.org>
> Cc: Linus Walleij <linus.walleij@linaro.org>
> Cc: Paolo Abeni <pabeni@redhat.com>
> Cc: Petr Machata <petrm@nvidia.com>
> Cc: Wolfram Sang <wsa+renesas@sang-engineering.com>
> To: netdev@vger.kernel.org
> ---
>  drivers/net/ethernet/micrel/ks8851_common.c | 4 +---
>  1 file changed, 1 insertion(+), 3 deletions(-)
>
> diff --git a/drivers/net/ethernet/micrel/ks8851_common.c b/drivers/net/ethernet/micrel/ks8851_common.c
> index cfbc900d4aeb9..1eba4ba0b95cf 100644
> --- a/drivers/net/ethernet/micrel/ks8851_common.c
> +++ b/drivers/net/ethernet/micrel/ks8851_common.c
> @@ -443,9 +443,7 @@ static int ks8851_net_open(struct net_device *dev)
>         unsigned long flags;
>         int ret;
>
> -       ret = request_threaded_irq(dev->irq, NULL, ks8851_irq,
> -                                  IRQF_TRIGGER_LOW | IRQF_ONESHOT,
> -                                  dev->name, ks);
> +       ret = request_irq(dev->irq, ks8851_irq, IRQF_TRIGGER_LOW, dev->name, ks);
>         if (ret < 0) {
>                 netdev_err(dev, "failed to get irq\n");
>                 return ret;
> --
> 2.35.1
>

^ permalink raw reply

* Re: [PATCH v2 1/3] dsa: marvell: Provide per device information about max frame size
From: Lukasz Majewski @ 2022-12-16 13:05 UTC (permalink / raw)
  To: Alexander H Duyck
  Cc: Andrew Lunn, Vladimir Oltean, Vivien Didelot, Florian Fainelli,
	David S. Miller, Jakub Kicinski, Russell King, netdev,
	linux-kernel
In-Reply-To: <4d16ffd327d193f8c1f7c40f968fda90a267348e.camel@gmail.com>

[-- Attachment #1: Type: text/plain, Size: 3447 bytes --]

Hi Alexander,

> On Thu, 2022-12-15 at 15:45 +0100, Lukasz Majewski wrote:
> > Different Marvell DSA switches support different size of max frame
> > bytes to be sent.
> > 
> > For example mv88e6185 supports max 1632 bytes, which is now
> > in-driver standard value. On the other hand - mv88e6250 supports
> > 2048 bytes.
> > 
> > As this value is internal and may be different for each switch IC,
> > new entry in struct mv88e6xxx_info has been added to store it.
> > 
> > Signed-off-by: Lukasz Majewski <lukma@denx.de>
> > ---
> > Changes for v2:
> > - Define max_frame_size with default value of 1632 bytes,
> > - Set proper value for the mv88e6250 switch SoC (linkstreet) family
> > ---
> >  drivers/net/dsa/mv88e6xxx/chip.c | 13 ++++++++++++-
> >  drivers/net/dsa/mv88e6xxx/chip.h |  1 +
> >  2 files changed, 13 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/net/dsa/mv88e6xxx/chip.c
> > b/drivers/net/dsa/mv88e6xxx/chip.c index 2ca3cbba5764..7ae4c389ce50
> > 100644 --- a/drivers/net/dsa/mv88e6xxx/chip.c
> > +++ b/drivers/net/dsa/mv88e6xxx/chip.c
> > @@ -3093,7 +3093,9 @@ static int mv88e6xxx_get_max_mtu(struct
> > dsa_switch *ds, int port) if (chip->info->ops->port_set_jumbo_size)
> >  		return 10240 - VLAN_ETH_HLEN - EDSA_HLEN -
> > ETH_FCS_LEN; else if (chip->info->ops->set_max_frame_size)
> > -		return 1632 - VLAN_ETH_HLEN - EDSA_HLEN -
> > ETH_FCS_LEN;
> > +		return (chip->info->max_frame_size  - VLAN_ETH_HLEN
> > +			- EDSA_HLEN - ETH_FCS_LEN);
> > +
> >  	return 1522 - VLAN_ETH_HLEN - EDSA_HLEN - ETH_FCS_LEN;
> >  }
> > 
> >   
> 
> Is there any specific reason for triggering this based on the
> existance of the function call? 

This was the original code in this driver.

This value (1632 or 2048 bytes) is SoC (family) specific.

By checking which device defines set_max_frame_size callback, I could
fill the chip->info->max_frame_size with 1632 value.

> Why not just replace:
> 	else if (chip->info->ops->set_max_frame_size)
> with:
> 	else if (chip->info->max_frame_size)
> 

I think that the callback check is a bit "defensive" approach -> 1522B
is the default value and 1632 (or 10240 - jumbo) can be set only when
proper callback is defined.

> Otherwise my concern is one gets defined without the other leading to
> a future issue as 0 - extra headers will likely wrap and while the
> return value may be a signed int, it is usually stored in an unsigned
> int so it would effectively uncap the MTU.

Please correct me if I misunderstood something:

The problem is with new mv88eXXXX devices, which will not provide
max_frame_size information to their chip->info struct?

Or is there any other issue?

> 
> Actually you could take this one step further since all values should
> be 1522 or greater you could just drop the else/if and replace the
> last line with "max_t(int, chip->info->max_frame_size, 1522) -
> (headers)".

This would be possible, yes.

However, then we will not check if the proper callback (e.g.
chip->info->ops->set_max_frame_size) is available for specific SoC.

If this is OK for maintainers for this driver, then I don't mind.


Best regards,

Lukasz Majewski

--

DENX Software Engineering GmbH,      Managing Director: Wolfgang Denk
HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
Phone: (+49)-8142-66989-59 Fax: (+49)-8142-66989-80 Email: lukma@denx.de

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply

* Re: [PATCH net-next 02/11] sfc: implement MCDI interface for vDPA operations
From: Gautam Dawar @ 2022-12-16 12:50 UTC (permalink / raw)
  To: Jason Wang, Gautam Dawar
  Cc: linux-net-drivers, netdev, eperezma, tanuj.kamde, Koushik.Dutta,
	harpreet.anand, Edward Cree, Martin Habets, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, linux-kernel
In-Reply-To: <CACGkMEuEJ9+wkFSiwUFGUi4RuQyJe2mc4fCNTwMw=S4SsSboiQ@mail.gmail.com>


On 12/14/22 12:13, Jason Wang wrote:
> Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding.
>
>
> On Wed, Dec 7, 2022 at 10:56 PM Gautam Dawar <gautam.dawar@amd.com> wrote:
>> Implement functions to perform vDPA operations like creating and
>> removing virtqueues, getting doorbell register offset etc. using
>> the MCDI interface with FW.
>>
>> Signed-off-by: Gautam Dawar <gautam.dawar@amd.com>
>> ---
>>   drivers/net/ethernet/sfc/Kconfig      |   8 +
>>   drivers/net/ethernet/sfc/Makefile     |   1 +
>>   drivers/net/ethernet/sfc/ef100_vdpa.h |  32 +++
>>   drivers/net/ethernet/sfc/mcdi.h       |   4 +
>>   drivers/net/ethernet/sfc/mcdi_vdpa.c  | 268 ++++++++++++++++++++++++++
>>   drivers/net/ethernet/sfc/mcdi_vdpa.h  |  84 ++++++++
>>   6 files changed, 397 insertions(+)
>>   create mode 100644 drivers/net/ethernet/sfc/ef100_vdpa.h
>>   create mode 100644 drivers/net/ethernet/sfc/mcdi_vdpa.c
>>   create mode 100644 drivers/net/ethernet/sfc/mcdi_vdpa.h
>>
>> diff --git a/drivers/net/ethernet/sfc/Kconfig b/drivers/net/ethernet/sfc/Kconfig
>> index 0950e6b0508f..1fa626c87d36 100644
>> --- a/drivers/net/ethernet/sfc/Kconfig
>> +++ b/drivers/net/ethernet/sfc/Kconfig
>> @@ -63,6 +63,14 @@ config SFC_MCDI_LOGGING
>>            Driver-Interface) commands and responses, allowing debugging of
>>            driver/firmware interaction.  The tracing is actually enabled by
>>            a sysfs file 'mcdi_logging' under the PCI device.
>> +config SFC_VDPA
>> +       bool "Solarflare EF100-family VDPA support"
>> +       depends on SFC && VDPA && SFC_SRIOV
>> +       default y
>> +       help
>> +         This enables support for the virtio data path acceleration (vDPA).
>> +         vDPA device's datapath complies with the virtio specification,
>> +         but control path is vendor specific.
>>
>>   source "drivers/net/ethernet/sfc/falcon/Kconfig"
>>   source "drivers/net/ethernet/sfc/siena/Kconfig"
>> diff --git a/drivers/net/ethernet/sfc/Makefile b/drivers/net/ethernet/sfc/Makefile
>> index 712a48d00069..059a0944e89a 100644
>> --- a/drivers/net/ethernet/sfc/Makefile
>> +++ b/drivers/net/ethernet/sfc/Makefile
>> @@ -11,6 +11,7 @@ sfc-$(CONFIG_SFC_MTD) += mtd.o
>>   sfc-$(CONFIG_SFC_SRIOV)        += sriov.o ef10_sriov.o ef100_sriov.o ef100_rep.o \
>>                              mae.o tc.o tc_bindings.o tc_counters.o
>>
>> +sfc-$(CONFIG_SFC_VDPA) += mcdi_vdpa.o
>>   obj-$(CONFIG_SFC)      += sfc.o
>>
>>   obj-$(CONFIG_SFC_FALCON) += falcon/
>> diff --git a/drivers/net/ethernet/sfc/ef100_vdpa.h b/drivers/net/ethernet/sfc/ef100_vdpa.h
>> new file mode 100644
>> index 000000000000..f6564448d0c7
>> --- /dev/null
>> +++ b/drivers/net/ethernet/sfc/ef100_vdpa.h
>> @@ -0,0 +1,32 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +/* Driver for Xilinx network controllers and boards
>> + * Copyright (C) 2020-2022, Xilinx, Inc.
>> + * Copyright (C) 2022, Advanced Micro Devices, Inc.
>> + *
>> + * This program is free software; you can redistribute it and/or modify it
>> + * under the terms of the GNU General Public License version 2 as published
>> + * by the Free Software Foundation, incorporated herein by reference.
>> + */
>> +
>> +#ifndef __EF100_VDPA_H__
>> +#define __EF100_VDPA_H__
>> +
>> +#include <linux/vdpa.h>
>> +#include <uapi/linux/virtio_net.h>
>> +#include "net_driver.h"
>> +#include "ef100_nic.h"
>> +
>> +#if defined(CONFIG_SFC_VDPA)
>> +
>> +enum ef100_vdpa_device_type {
>> +       EF100_VDPA_DEVICE_TYPE_NET,
>> +};
>> +
>> +enum ef100_vdpa_vq_type {
>> +       EF100_VDPA_VQ_TYPE_NET_RXQ,
>> +       EF100_VDPA_VQ_TYPE_NET_TXQ,
>> +       EF100_VDPA_VQ_NTYPES
>> +};
>> +
>> +#endif /* CONFIG_SFC_VDPA */
>> +#endif /* __EF100_VDPA_H__ */
>> diff --git a/drivers/net/ethernet/sfc/mcdi.h b/drivers/net/ethernet/sfc/mcdi.h
>> index 7e35fec9da35..db4ca4975ada 100644
>> --- a/drivers/net/ethernet/sfc/mcdi.h
>> +++ b/drivers/net/ethernet/sfc/mcdi.h
>> @@ -214,6 +214,10 @@ void efx_mcdi_sensor_event(struct efx_nic *efx, efx_qword_t *ev);
>>   #define _MCDI_STRUCT_DWORD(_buf, _field)                               \
>>          ((_buf) + (_MCDI_CHECK_ALIGN(_field ## _OFST, 4) >> 2))
>>
>> +#define MCDI_SET_BYTE(_buf, _field, _value) do {                       \
>> +       BUILD_BUG_ON(MC_CMD_ ## _field ## _LEN != 1);                   \
>> +       *(u8 *)MCDI_PTR(_buf, _field) = _value;                         \
>> +       } while (0)
>>   #define MCDI_STRUCT_SET_BYTE(_buf, _field, _value) do {                        \
>>          BUILD_BUG_ON(_field ## _LEN != 1);                              \
>>          *(u8 *)MCDI_STRUCT_PTR(_buf, _field) = _value;                  \
>> diff --git a/drivers/net/ethernet/sfc/mcdi_vdpa.c b/drivers/net/ethernet/sfc/mcdi_vdpa.c
>> new file mode 100644
>> index 000000000000..35f822170049
>> --- /dev/null
>> +++ b/drivers/net/ethernet/sfc/mcdi_vdpa.c
>> @@ -0,0 +1,268 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/* Driver for Xilinx network controllers and boards
>> + * Copyright (C) 2020-2022, Xilinx, Inc.
>> + * Copyright (C) 2022, Advanced Micro Devices, Inc.
>> + *
>> + * This program is free software; you can redistribute it and/or modify it
>> + * under the terms of the GNU General Public License version 2 as published
>> + * by the Free Software Foundation, incorporated herein by reference.
>> + */
>> +
>> +#include <linux/vdpa.h>
>> +#include "ef100_vdpa.h"
>> +#include "efx.h"
>> +#include "nic.h"
>> +#include "mcdi_vdpa.h"
>> +#include "mcdi_pcol.h"
>> +
>> +/* The value of target_vf in virtio MC commands like
>> + * virtqueue create, delete and get doorbell offset should
>> + * contain the VF index when the calling function is a PF
>> + * and VF_NULL (0xFFFF) otherwise. As the vDPA driver invokes
>> + * MC commands in context of the VF, it uses VF_NULL.
>> + */
>> +#define MC_CMD_VIRTIO_TARGET_VF_NULL 0xFFFF
>> +
>> +struct efx_vring_ctx *efx_vdpa_vring_init(struct efx_nic *efx,  u32 vi,
>> +                                         enum ef100_vdpa_vq_type vring_type)
>> +{
>> +       struct efx_vring_ctx *vring_ctx;
>> +       u32 queue_cmd;
>> +
>> +       vring_ctx = kzalloc(sizeof(*vring_ctx), GFP_KERNEL);
>> +       if (!vring_ctx)
>> +               return ERR_PTR(-ENOMEM);
>> +
>> +       switch (vring_type) {
>> +       case EF100_VDPA_VQ_TYPE_NET_RXQ:
>> +               queue_cmd = MC_CMD_VIRTIO_INIT_QUEUE_REQ_NET_RXQ;
>> +               break;
>> +       case EF100_VDPA_VQ_TYPE_NET_TXQ:
>> +               queue_cmd = MC_CMD_VIRTIO_INIT_QUEUE_REQ_NET_TXQ;
>> +               break;
>> +       default:
>> +               pci_err(efx->pci_dev,
>> +                       "%s: Invalid Queue type %u\n", __func__, vring_type);
>> +               kfree(vring_ctx);
>> +               return ERR_PTR(-ENOMEM);
>> +       }
>> +
>> +       vring_ctx->efx = efx;
>> +       vring_ctx->vf_index = MC_CMD_VIRTIO_TARGET_VF_NULL;
>> +       vring_ctx->vi_index = vi;
>> +       vring_ctx->mcdi_vring_type = queue_cmd;
>> +       return vring_ctx;
>> +}
>> +
>> +void efx_vdpa_vring_fini(struct efx_vring_ctx *vring_ctx)
>> +{
>> +       kfree(vring_ctx);
>> +}
>> +
>> +int efx_vdpa_get_features(struct efx_nic *efx,
>> +                         enum ef100_vdpa_device_type type,
>> +                         u64 *features)
>> +{
>> +       MCDI_DECLARE_BUF(outbuf, MC_CMD_VIRTIO_GET_FEATURES_OUT_LEN);
>> +       MCDI_DECLARE_BUF(inbuf, MC_CMD_VIRTIO_GET_FEATURES_IN_LEN);
>> +       u32 high_val, low_val;
>> +       ssize_t outlen;
>> +       u32 dev_type;
>> +       int rc;
>> +
>> +       if (!efx) {
>> +               pci_err(efx->pci_dev, "%s: Invalid NIC pointer\n", __func__);
>> +               return -EINVAL;
>> +       }
>> +       switch (type) {
>> +       case EF100_VDPA_DEVICE_TYPE_NET:
>> +               dev_type = MC_CMD_VIRTIO_GET_FEATURES_IN_NET;
>> +               break;
>> +       default:
>> +               pci_err(efx->pci_dev,
>> +                       "%s: Device type %d not supported\n", __func__, type);
>> +               return -EINVAL;
>> +       }
>> +       MCDI_SET_DWORD(inbuf, VIRTIO_GET_FEATURES_IN_DEVICE_ID, dev_type);
>> +       rc = efx_mcdi_rpc(efx, MC_CMD_VIRTIO_GET_FEATURES, inbuf, sizeof(inbuf),
>> +                         outbuf, sizeof(outbuf), &outlen);
>> +       if (rc)
>> +               return rc;
>> +       if (outlen < MC_CMD_VIRTIO_GET_FEATURES_OUT_LEN)
>> +               return -EIO;
>> +       low_val = MCDI_DWORD(outbuf, VIRTIO_GET_FEATURES_OUT_FEATURES_LO);
>> +       high_val = MCDI_DWORD(outbuf, VIRTIO_GET_FEATURES_OUT_FEATURES_HI);
>> +       *features = ((u64)high_val << 32) | low_val;
>> +       return 0;
>> +}
>> +
>> +int efx_vdpa_verify_features(struct efx_nic *efx,
>> +                            enum ef100_vdpa_device_type type, u64 features)
>> +{
>> +       MCDI_DECLARE_BUF(inbuf, MC_CMD_VIRTIO_TEST_FEATURES_IN_LEN);
>> +       u32 dev_type;
>> +       int rc;
>> +
>> +       BUILD_BUG_ON(MC_CMD_VIRTIO_TEST_FEATURES_OUT_LEN != 0);
>> +       switch (type) {
>> +       case EF100_VDPA_DEVICE_TYPE_NET:
>> +               dev_type = MC_CMD_VIRTIO_GET_FEATURES_IN_NET;
>> +               break;
>> +       default:
>> +               pci_err(efx->pci_dev,
>> +                       "%s: Device type %d not supported\n", __func__, type);
>> +               return -EINVAL;
>> +       }
>> +       MCDI_SET_DWORD(inbuf, VIRTIO_TEST_FEATURES_IN_DEVICE_ID, dev_type);
>> +       MCDI_SET_DWORD(inbuf, VIRTIO_TEST_FEATURES_IN_FEATURES_LO, features);
>> +       MCDI_SET_DWORD(inbuf, VIRTIO_TEST_FEATURES_IN_FEATURES_HI,
>> +                      features >> 32);
>> +       rc = efx_mcdi_rpc(efx, MC_CMD_VIRTIO_TEST_FEATURES, inbuf,
>> +                         sizeof(inbuf), NULL, 0, NULL);
>> +       return rc;
>> +}
>> +
>> +int efx_vdpa_vring_create(struct efx_vring_ctx *vring_ctx,
>> +                         struct efx_vring_cfg *vring_cfg,
>> +                         struct efx_vring_dyn_cfg *vring_dyn_cfg)
>> +{
>> +       MCDI_DECLARE_BUF(inbuf, MC_CMD_VIRTIO_INIT_QUEUE_REQ_LEN);
>> +       struct efx_nic *efx = vring_ctx->efx;
>> +       int rc;
>> +
>> +       BUILD_BUG_ON(MC_CMD_VIRTIO_INIT_QUEUE_RESP_LEN != 0);
>> +
>> +       MCDI_SET_BYTE(inbuf, VIRTIO_INIT_QUEUE_REQ_QUEUE_TYPE,
>> +                     vring_ctx->mcdi_vring_type);
>> +       MCDI_SET_WORD(inbuf, VIRTIO_INIT_QUEUE_REQ_TARGET_VF,
>> +                     vring_ctx->vf_index);
>> +       MCDI_SET_DWORD(inbuf, VIRTIO_INIT_QUEUE_REQ_INSTANCE,
>> +                      vring_ctx->vi_index);
>> +
>> +       MCDI_SET_DWORD(inbuf, VIRTIO_INIT_QUEUE_REQ_SIZE, vring_cfg->size);
>> +       MCDI_SET_DWORD(inbuf, VIRTIO_INIT_QUEUE_REQ_DESC_TBL_ADDR_LO,
>> +                      vring_cfg->desc);
>> +       MCDI_SET_DWORD(inbuf, VIRTIO_INIT_QUEUE_REQ_DESC_TBL_ADDR_HI,
>> +                      vring_cfg->desc >> 32);
>> +       MCDI_SET_DWORD(inbuf, VIRTIO_INIT_QUEUE_REQ_AVAIL_RING_ADDR_LO,
>> +                      vring_cfg->avail);
>> +       MCDI_SET_DWORD(inbuf, VIRTIO_INIT_QUEUE_REQ_AVAIL_RING_ADDR_HI,
>> +                      vring_cfg->avail >> 32);
>> +       MCDI_SET_DWORD(inbuf, VIRTIO_INIT_QUEUE_REQ_USED_RING_ADDR_LO,
>> +                      vring_cfg->used);
>> +       MCDI_SET_DWORD(inbuf, VIRTIO_INIT_QUEUE_REQ_USED_RING_ADDR_HI,
>> +                      vring_cfg->used >> 32);
>> +       MCDI_SET_WORD(inbuf, VIRTIO_INIT_QUEUE_REQ_MSIX_VECTOR,
>> +                     vring_cfg->msix_vector);
>> +       MCDI_SET_DWORD(inbuf, VIRTIO_INIT_QUEUE_REQ_FEATURES_LO,
>> +                      vring_cfg->features);
>> +       MCDI_SET_DWORD(inbuf, VIRTIO_INIT_QUEUE_REQ_FEATURES_HI,
>> +                      vring_cfg->features >> 32);
>> +
>> +       if (vring_dyn_cfg) {
>> +               MCDI_SET_DWORD(inbuf, VIRTIO_INIT_QUEUE_REQ_INITIAL_PIDX,
>> +                              vring_dyn_cfg->avail_idx);
>> +               MCDI_SET_DWORD(inbuf, VIRTIO_INIT_QUEUE_REQ_INITIAL_CIDX,
>> +                              vring_dyn_cfg->used_idx);
>> +       }
>> +       MCDI_SET_DWORD(inbuf, VIRTIO_INIT_QUEUE_REQ_MPORT_SELECTOR,
>> +                      MAE_MPORT_SELECTOR_ASSIGNED);
>> +
>> +       rc = efx_mcdi_rpc(efx, MC_CMD_VIRTIO_INIT_QUEUE, inbuf, sizeof(inbuf),
>> +                         NULL, 0, NULL);
> It looks to me the mcdi_buffer belongs to the VF (allocated by the
> calling of ef100_probe_vf()), I wonder how it is isolated from the DMA
> that is initiated by userspace(guest)?
>
> Thanks
Yes, I just responded to this concern on a similar comment in patch 9/11.
>
>
>> +       return rc;
>> +}
>> +
>> +int efx_vdpa_vring_destroy(struct efx_vring_ctx *vring_ctx,
>> +                          struct efx_vring_dyn_cfg *vring_dyn_cfg)
>> +{
>> +       MCDI_DECLARE_BUF(outbuf, MC_CMD_VIRTIO_FINI_QUEUE_RESP_LEN);
>> +       MCDI_DECLARE_BUF(inbuf, MC_CMD_VIRTIO_FINI_QUEUE_REQ_LEN);
>> +       struct efx_nic *efx = vring_ctx->efx;
>> +       ssize_t outlen;
>> +       int rc;
>> +
>> +       MCDI_SET_BYTE(inbuf, VIRTIO_FINI_QUEUE_REQ_QUEUE_TYPE,
>> +                     vring_ctx->mcdi_vring_type);
>> +       MCDI_SET_WORD(inbuf, VIRTIO_INIT_QUEUE_REQ_TARGET_VF,
>> +                     vring_ctx->vf_index);
>> +       MCDI_SET_DWORD(inbuf, VIRTIO_INIT_QUEUE_REQ_INSTANCE,
>> +                      vring_ctx->vi_index);
>> +       rc = efx_mcdi_rpc(efx, MC_CMD_VIRTIO_FINI_QUEUE, inbuf, sizeof(inbuf),
>> +                         outbuf, sizeof(outbuf), &outlen);
>> +
>> +       if (rc)
>> +               return rc;
>> +
>> +       if (outlen < MC_CMD_VIRTIO_FINI_QUEUE_RESP_LEN)
>> +               return -EIO;
>> +
>> +       if (vring_dyn_cfg) {
>> +               vring_dyn_cfg->avail_idx = MCDI_DWORD(outbuf,
>> +                                                     VIRTIO_FINI_QUEUE_RESP_FINAL_PIDX);
>> +               vring_dyn_cfg->used_idx = MCDI_DWORD(outbuf,
>> +                                                    VIRTIO_FINI_QUEUE_RESP_FINAL_CIDX);
>> +       }
>> +
>> +       return 0;
>> +}
>> +
>> +int efx_vdpa_get_doorbell_offset(struct efx_vring_ctx *vring_ctx,
>> +                                u32 *offset)
>> +{
>> +       MCDI_DECLARE_BUF(outbuf, MC_CMD_VIRTIO_GET_NET_DOORBELL_OFFSET_RESP_LEN);
>> +       MCDI_DECLARE_BUF(inbuf, MC_CMD_VIRTIO_GET_DOORBELL_OFFSET_REQ_LEN);
>> +       struct efx_nic *efx = vring_ctx->efx;
>> +       ssize_t outlen;
>> +       int rc;
>> +
>> +       if (vring_ctx->mcdi_vring_type != MC_CMD_VIRTIO_INIT_QUEUE_REQ_NET_RXQ &&
>> +           vring_ctx->mcdi_vring_type != MC_CMD_VIRTIO_INIT_QUEUE_REQ_NET_TXQ) {
>> +               pci_err(efx->pci_dev,
>> +                       "%s: Invalid Queue type %u\n",
>> +                       __func__, vring_ctx->mcdi_vring_type);
>> +               return -EINVAL;
>> +       }
>> +
>> +       MCDI_SET_BYTE(inbuf, VIRTIO_GET_DOORBELL_OFFSET_REQ_DEVICE_ID,
>> +                     MC_CMD_VIRTIO_GET_FEATURES_IN_NET);
>> +       MCDI_SET_WORD(inbuf, VIRTIO_GET_DOORBELL_OFFSET_REQ_TARGET_VF,
>> +                     vring_ctx->vf_index);
>> +       MCDI_SET_DWORD(inbuf, VIRTIO_GET_DOORBELL_OFFSET_REQ_INSTANCE,
>> +                      vring_ctx->vi_index);
>> +
>> +       rc = efx_mcdi_rpc(efx, MC_CMD_VIRTIO_GET_DOORBELL_OFFSET, inbuf,
>> +                         sizeof(inbuf), outbuf, sizeof(outbuf), &outlen);
>> +       if (rc)
>> +               return rc;
>> +
>> +       if (outlen < MC_CMD_VIRTIO_GET_NET_DOORBELL_OFFSET_RESP_LEN)
>> +               return -EIO;
>> +       if (vring_ctx->mcdi_vring_type == MC_CMD_VIRTIO_INIT_QUEUE_REQ_NET_RXQ)
>> +               *offset = MCDI_DWORD(outbuf,
>> +                                    VIRTIO_GET_NET_DOORBELL_OFFSET_RESP_RX_DBL_OFFSET);
>> +       else
>> +               *offset = MCDI_DWORD(outbuf,
>> +                                    VIRTIO_GET_NET_DOORBELL_OFFSET_RESP_TX_DBL_OFFSET);
>> +
>> +       return 0;
>> +}
>> +
>> +int efx_vdpa_get_mtu(struct efx_nic *efx, u16 *mtu)
>> +{
>> +       MCDI_DECLARE_BUF(outbuf, MC_CMD_SET_MAC_V2_OUT_LEN);
>> +       MCDI_DECLARE_BUF(inbuf, MC_CMD_SET_MAC_EXT_IN_LEN);
>> +       ssize_t outlen;
>> +       int rc;
>> +
>> +       MCDI_SET_DWORD(inbuf, SET_MAC_EXT_IN_CONTROL, 0);
>> +       rc =  efx_mcdi_rpc(efx, MC_CMD_SET_MAC, inbuf, sizeof(inbuf),
>> +                          outbuf, sizeof(outbuf), &outlen);
>> +       if (rc)
>> +               return rc;
>> +       if (outlen < MC_CMD_SET_MAC_V2_OUT_LEN)
>> +               return -EIO;
>> +
>> +       *mtu = MCDI_DWORD(outbuf, SET_MAC_V2_OUT_MTU);
>> +       return 0;
>> +}
>> diff --git a/drivers/net/ethernet/sfc/mcdi_vdpa.h b/drivers/net/ethernet/sfc/mcdi_vdpa.h
>> new file mode 100644
>> index 000000000000..2a0f7c647c44
>> --- /dev/null
>> +++ b/drivers/net/ethernet/sfc/mcdi_vdpa.h
>> @@ -0,0 +1,84 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +/* Driver for Xilinx network controllers and boards
>> + * Copyright (C) 2020-2022, Xilinx, Inc.
>> + * Copyright (C) 2022, Advanced Micro Devices, Inc.
>> + *
>> + * This program is free software; you can redistribute it and/or modify it
>> + * under the terms of the GNU General Public License version 2 as published
>> + * by the Free Software Foundation, incorporated herein by reference.
>> + */
>> +
>> +#ifndef EFX_MCDI_VDPA_H
>> +#define EFX_MCDI_VDPA_H
>> +
>> +#if defined(CONFIG_SFC_VDPA)
>> +#include "mcdi.h"
>> +
>> +/**
>> + * struct efx_vring_ctx: The vring context
>> + *
>> + * @efx: pointer of the VF's efx_nic object
>> + * @vf_index: VF index of the vDPA VF
>> + * @vi_index: vi index to be used for queue creation
>> + * @mcdi_vring_type: corresponding MCDI vring type
>> + */
>> +struct efx_vring_ctx {
>> +       struct efx_nic *efx;
>> +       u32 vf_index;
>> +       u32 vi_index;
>> +       u32 mcdi_vring_type;
>> +};
>> +
>> +/**
>> + * struct efx_vring_cfg: Configuration for vring creation
>> + *
>> + * @desc: Descriptor area address of the vring
>> + * @avail: Available area address of the vring
>> + * @used: Device area address of the vring
>> + * @size: Queue size, in entries. Must be a power of two
>> + * @msix_vector: msix vector address for the queue
>> + * @features: negotiated feature bits
>> + */
>> +struct efx_vring_cfg {
>> +       u64 desc;
>> +       u64 avail;
>> +       u64 used;
>> +       u32 size;
>> +       u16 msix_vector;
>> +       u64 features;
>> +};
>> +
>> +/**
>> + * struct efx_vring_dyn_cfg - dynamic vring configuration
>> + *
>> + * @avail_idx: last available index of the vring
>> + * @used_idx: last used index of the vring
>> + */
>> +struct efx_vring_dyn_cfg {
>> +       u32 avail_idx;
>> +       u32 used_idx;
>> +};
>> +
>> +int efx_vdpa_get_features(struct efx_nic *efx, enum ef100_vdpa_device_type type,
>> +                         u64 *featuresp);
>> +
>> +int efx_vdpa_verify_features(struct efx_nic *efx,
>> +                            enum ef100_vdpa_device_type type, u64 features);
>> +
>> +struct efx_vring_ctx *efx_vdpa_vring_init(struct efx_nic *efx, u32 vi,
>> +                                         enum ef100_vdpa_vq_type vring_type);
>> +
>> +void efx_vdpa_vring_fini(struct efx_vring_ctx *vring_ctx);
>> +
>> +int efx_vdpa_vring_create(struct efx_vring_ctx *vring_ctx,
>> +                         struct efx_vring_cfg *vring_cfg,
>> +                         struct efx_vring_dyn_cfg *vring_dyn_cfg);
>> +
>> +int efx_vdpa_vring_destroy(struct efx_vring_ctx *vring_ctx,
>> +                          struct efx_vring_dyn_cfg *vring_dyn_cfg);
>> +
>> +int efx_vdpa_get_doorbell_offset(struct efx_vring_ctx *vring_ctx,
>> +                                u32 *offsetp);
>> +int efx_vdpa_get_mtu(struct efx_nic *efx, u16 *mtu);
>> +#endif
>> +#endif
>> --
>> 2.30.1
>>

^ permalink raw reply

* Re: [PATCH net-next 09/11] sfc: implement iova rbtree to store dma mappings
From: Gautam Dawar @ 2022-12-16 12:48 UTC (permalink / raw)
  To: Jason Wang, Gautam Dawar
  Cc: linux-net-drivers, netdev, eperezma, tanuj.kamde, Koushik.Dutta,
	harpreet.anand, Edward Cree, Martin Habets, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, linux-kernel
In-Reply-To: <CACGkMEt+euNwg+DEYFMNhJGXm1v2UqiPx622F-=DARFB4CWavQ@mail.gmail.com>


On 12/14/22 12:16, Jason Wang wrote:
> Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding.
>
>
> On Wed, Dec 7, 2022 at 10:57 PM Gautam Dawar <gautam.dawar@amd.com> wrote:
>> sfc uses a MCDI DMA buffer that is allocated on the host
>> for communicating with the Firmware. The MCDI buffer IOVA
>> could overlap with the IOVA used by the guest for the
>> virtqueue buffers. To detect such overlap, the DMA mappings
>> from the guest will be stored in a IOVA rbtree and every
>> such mapping will be compared against the MCDI buffer IOVA
>> range. If an overlap is detected, the MCDI buffer will be
>> relocated to a different IOVA.
> I think it can't prevent guests from guessing the MCDI buffer address
> and trying to DMA to/from that buffer.

Yes, if the guest can guess the MCDI buffer address, it could use it to 
DMA to/from this buffer.

However, the guest can modify the buffer contents but can't instruct the 
MC to execute the request. To cause any MCDI failure, the request buffer 
needs to be updated when host driver is about to execute the request or 
response buffer needs to be updated after command execution but before 
host driver reads it. This would be a very small time window and hard to 
guess for the guest.

> It might work with some co-operation of the NIC who can validate the
> DMA initialized by the virtqueue and forbid the DMA to the MDCI
> buffer.
I think this problem can be solved using PASID which will be supported 
by our next hardware version.
>
> Thanks

>> Signed-off-by: Gautam Dawar <gautam.dawar@amd.com>
>> ---
>>   drivers/net/ethernet/sfc/Makefile         |   3 +-
>>   drivers/net/ethernet/sfc/ef100_iova.c     | 205 ++++++++++++++++++++++
>>   drivers/net/ethernet/sfc/ef100_iova.h     |  40 +++++
>>   drivers/net/ethernet/sfc/ef100_nic.c      |   1 -
>>   drivers/net/ethernet/sfc/ef100_vdpa.c     |  38 ++++
>>   drivers/net/ethernet/sfc/ef100_vdpa.h     |  15 ++
>>   drivers/net/ethernet/sfc/ef100_vdpa_ops.c |   5 +
>>   drivers/net/ethernet/sfc/mcdi.h           |   3 +
>>   8 files changed, 308 insertions(+), 2 deletions(-)
>>   create mode 100644 drivers/net/ethernet/sfc/ef100_iova.c
>>   create mode 100644 drivers/net/ethernet/sfc/ef100_iova.h
>>
>> diff --git a/drivers/net/ethernet/sfc/Makefile b/drivers/net/ethernet/sfc/Makefile
>> index a10eac91ab23..85852ff50b7c 100644
>> --- a/drivers/net/ethernet/sfc/Makefile
>> +++ b/drivers/net/ethernet/sfc/Makefile
>> @@ -11,7 +11,8 @@ sfc-$(CONFIG_SFC_MTD) += mtd.o
>>   sfc-$(CONFIG_SFC_SRIOV)        += sriov.o ef10_sriov.o ef100_sriov.o ef100_rep.o \
>>                              mae.o tc.o tc_bindings.o tc_counters.o
>>
>> -sfc-$(CONFIG_SFC_VDPA) += mcdi_vdpa.o ef100_vdpa.o ef100_vdpa_ops.o
>> +sfc-$(CONFIG_SFC_VDPA) += mcdi_vdpa.o ef100_vdpa.o ef100_vdpa_ops.o \
>> +                          ef100_iova.o
>>   obj-$(CONFIG_SFC)      += sfc.o
>>
>>   obj-$(CONFIG_SFC_FALCON) += falcon/
>> diff --git a/drivers/net/ethernet/sfc/ef100_iova.c b/drivers/net/ethernet/sfc/ef100_iova.c
>> new file mode 100644
>> index 000000000000..863314c5b9b5
>> --- /dev/null
>> +++ b/drivers/net/ethernet/sfc/ef100_iova.c
>> @@ -0,0 +1,205 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/* Driver for Xilinx network controllers and boards
>> + * Copyright(C) 2020-2022 Xilinx, Inc.
>> + * Copyright(C) 2022 Advanced Micro Devices, Inc.
>> + *
>> + * This program is free software; you can redistribute it and/or modify it
>> + * under the terms of the GNU General Public License version 2 as published
>> + * by the Free Software Foundation, incorporated herein by reference.
>> + */
>> +
>> +#include "ef100_iova.h"
>> +
>> +static void update_free_list_node(struct ef100_vdpa_iova_node *target_node,
>> +                                 struct ef100_vdpa_iova_node *next_node,
>> +                                 struct ef100_vdpa_nic *vdpa_nic)
>> +{
>> +       unsigned long target_node_end;
>> +       unsigned long free_area;
>> +       bool in_list;
>> +
>> +       target_node_end = target_node->iova + target_node->size;
>> +       free_area = next_node->iova - target_node_end;
>> +       in_list = !(list_empty(&target_node->free_node));
>> +
>> +       if (!in_list && free_area >= MCDI_BUF_LEN) {
>> +               list_add(&target_node->free_node,
>> +                        &vdpa_nic->free_list);
>> +       } else if (in_list && free_area < MCDI_BUF_LEN) {
>> +               list_del_init(&target_node->free_node);
>> +       }
>> +}
>> +
>> +static void update_free_list(struct ef100_vdpa_iova_node *iova_node,
>> +                            struct ef100_vdpa_nic *vdpa_nic,
>> +                            bool add_node)
>> +{
>> +       struct ef100_vdpa_iova_node *prev_in = NULL;
>> +       struct ef100_vdpa_iova_node *next_in = NULL;
>> +       struct rb_node *prev_node;
>> +       struct rb_node *next_node;
>> +
>> +       prev_node = rb_prev(&iova_node->node);
>> +       next_node = rb_next(&iova_node->node);
>> +
>> +       if (prev_node)
>> +               prev_in = rb_entry(prev_node,
>> +                                  struct ef100_vdpa_iova_node, node);
>> +       if (next_node)
>> +               next_in = rb_entry(next_node,
>> +                                  struct ef100_vdpa_iova_node, node);
>> +
>> +       if (add_node) {
>> +               if (prev_in)
>> +                       update_free_list_node(prev_in, iova_node, vdpa_nic);
>> +
>> +               if (next_in)
>> +                       update_free_list_node(iova_node, next_in, vdpa_nic);
>> +       } else {
>> +               if (next_in && prev_in)
>> +                       update_free_list_node(prev_in, next_in, vdpa_nic);
>> +               if (!list_empty(&iova_node->free_node))
>> +                       list_del_init(&iova_node->free_node);
>> +       }
>> +}
>> +
>> +int efx_ef100_insert_iova_node(struct ef100_vdpa_nic *vdpa_nic,
>> +                              u64 iova, u64 size)
>> +{
>> +       struct ef100_vdpa_iova_node *iova_node;
>> +       struct ef100_vdpa_iova_node *new_node;
>> +       struct rb_node *parent;
>> +       struct rb_node **link;
>> +       struct rb_root *root;
>> +       int rc = 0;
>> +
>> +       mutex_lock(&vdpa_nic->iova_lock);
>> +
>> +       root = &vdpa_nic->iova_root;
>> +       link = &root->rb_node;
>> +       parent = *link;
>> +       /* Go to the bottom of the tree */
>> +       while (*link) {
>> +               parent = *link;
>> +               iova_node = rb_entry(parent, struct ef100_vdpa_iova_node, node);
>> +
>> +               /* handle duplicate node */
>> +               if (iova_node->iova == iova) {
>> +                       rc = -EEXIST;
>> +                       goto out_unlock;
>> +               }
>> +
>> +               if (iova_node->iova > iova)
>> +                       link = &(*link)->rb_left;
>> +               else
>> +                       link = &(*link)->rb_right;
>> +       }
>> +
>> +       new_node = kzalloc(sizeof(*new_node), GFP_KERNEL);
>> +       if (!new_node) {
>> +               rc = -ENOMEM;
>> +               goto out_unlock;
>> +       }
>> +
>> +       new_node->iova = iova;
>> +       new_node->size = size;
>> +       INIT_LIST_HEAD(&new_node->free_node);
>> +
>> +       /* Put the new node here */
>> +       rb_link_node(&new_node->node, parent, link);
>> +       rb_insert_color(&new_node->node, root);
>> +
>> +       update_free_list(new_node, vdpa_nic, true);
>> +
>> +out_unlock:
>> +       mutex_unlock(&vdpa_nic->iova_lock);
>> +       return rc;
>> +}
>> +
>> +static struct ef100_vdpa_iova_node*
>> +ef100_rbt_search_node(struct ef100_vdpa_nic *vdpa_nic,
>> +                     unsigned long iova)
>> +{
>> +       struct ef100_vdpa_iova_node *iova_node;
>> +       struct rb_node *rb_node;
>> +       struct rb_root *root;
>> +
>> +       root = &vdpa_nic->iova_root;
>> +       if (!root)
>> +               return NULL;
>> +
>> +       rb_node = root->rb_node;
>> +
>> +       while (rb_node) {
>> +               iova_node = rb_entry(rb_node, struct ef100_vdpa_iova_node,
>> +                                    node);
>> +               if (iova_node->iova > iova)
>> +                       rb_node = rb_node->rb_left;
>> +               else if (iova_node->iova < iova)
>> +                       rb_node = rb_node->rb_right;
>> +               else
>> +                       return iova_node;
>> +       }
>> +
>> +       return NULL;
>> +}
>> +
>> +void efx_ef100_remove_iova_node(struct ef100_vdpa_nic *vdpa_nic,
>> +                               unsigned long iova)
>> +{
>> +       struct ef100_vdpa_iova_node *iova_node;
>> +
>> +       mutex_lock(&vdpa_nic->iova_lock);
>> +       iova_node = ef100_rbt_search_node(vdpa_nic, iova);
>> +       if (!iova_node)
>> +               goto out_unlock;
>> +
>> +       update_free_list(iova_node, vdpa_nic, false);
>> +
>> +       rb_erase(&iova_node->node, &vdpa_nic->iova_root);
>> +       kfree(iova_node);
>> +
>> +out_unlock:
>> +       mutex_unlock(&vdpa_nic->iova_lock);
>> +}
>> +
>> +void efx_ef100_delete_iova(struct ef100_vdpa_nic *vdpa_nic)
>> +{
>> +       struct ef100_vdpa_iova_node *iova_node;
>> +       struct rb_root *iova_root;
>> +       struct rb_node *node;
>> +
>> +       mutex_lock(&vdpa_nic->iova_lock);
>> +
>> +       iova_root = &vdpa_nic->iova_root;
>> +       while (!RB_EMPTY_ROOT(iova_root)) {
>> +               node = rb_first(iova_root);
>> +               iova_node = rb_entry(node, struct ef100_vdpa_iova_node, node);
>> +               if (!list_empty(&iova_node->free_node))
>> +                       list_del_init(&iova_node->free_node);
>> +               if (vdpa_nic->domain)
>> +                       iommu_unmap(vdpa_nic->domain, iova_node->iova,
>> +                                   iova_node->size);
>> +               rb_erase(node, iova_root);
>> +               kfree(iova_node);
>> +       }
>> +
>> +       mutex_unlock(&vdpa_nic->iova_lock);
>> +}
>> +
>> +int efx_ef100_find_new_iova(struct ef100_vdpa_nic *vdpa_nic,
>> +                           unsigned int buf_len,
>> +                           u64 *new_iova)
>> +{
>> +       struct ef100_vdpa_iova_node *iova_node;
>> +
>> +       /* pick the first node from freelist */
>> +       iova_node = list_first_entry_or_null(&vdpa_nic->free_list,
>> +                                            struct ef100_vdpa_iova_node,
>> +                                            free_node);
>> +       if (!iova_node)
>> +               return -ENOENT;
>> +
>> +       *new_iova = iova_node->iova + iova_node->size;
>> +       return 0;
>> +}
>> diff --git a/drivers/net/ethernet/sfc/ef100_iova.h b/drivers/net/ethernet/sfc/ef100_iova.h
>> new file mode 100644
>> index 000000000000..68e39c4152c7
>> --- /dev/null
>> +++ b/drivers/net/ethernet/sfc/ef100_iova.h
>> @@ -0,0 +1,40 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +/* Driver for Xilinx network controllers and boards
>> + * Copyright(C) 2020-2022 Xilinx, Inc.
>> + * Copyright(C) 2022 Advanced Micro Devices, Inc.
>> + *
>> + * This program is free software; you can redistribute it and/or modify it
>> + * under the terms of the GNU General Public License version 2 as published
>> + * by the Free Software Foundation, incorporated herein by reference.
>> + */
>> +#ifndef EFX_EF100_IOVA_H
>> +#define EFX_EF100_IOVA_H
>> +
>> +#include "ef100_nic.h"
>> +#include "ef100_vdpa.h"
>> +
>> +#if defined(CONFIG_SFC_VDPA)
>> +/**
>> + * struct ef100_vdpa_iova_node - guest buffer iova entry
>> + *
>> + * @node: red black tree node
>> + * @iova: mapping's IO virtual address
>> + * @size: length of mapped region in bytes
>> + * @free_node: free list node
>> + */
>> +struct ef100_vdpa_iova_node {
>> +       struct rb_node node;
>> +       unsigned long iova;
>> +       size_t size;
>> +       struct list_head free_node;
>> +};
>> +
>> +int efx_ef100_insert_iova_node(struct ef100_vdpa_nic *vdpa_nic,
>> +                              u64 iova, u64 size);
>> +void efx_ef100_remove_iova_node(struct ef100_vdpa_nic *vdpa_nic,
>> +                               unsigned long iova);
>> +void efx_ef100_delete_iova(struct ef100_vdpa_nic *vdpa_nic);
>> +int efx_ef100_find_new_iova(struct ef100_vdpa_nic *vdpa_nic,
>> +                           unsigned int buf_len, u64 *new_iova);
>> +#endif  /* CONFIG_SFC_VDPA */
>> +#endif /* EFX_EF100_IOVA_H */
>> diff --git a/drivers/net/ethernet/sfc/ef100_nic.c b/drivers/net/ethernet/sfc/ef100_nic.c
>> index 41811c519275..72820d2fe19d 100644
>> --- a/drivers/net/ethernet/sfc/ef100_nic.c
>> +++ b/drivers/net/ethernet/sfc/ef100_nic.c
>> @@ -33,7 +33,6 @@
>>
>>   #define EF100_MAX_VIS 4096
>>   #define EF100_NUM_MCDI_BUFFERS 1
>> -#define MCDI_BUF_LEN (8 + MCDI_CTL_SDU_LEN_MAX)
>>
>>   #define EF100_RESET_PORT ((ETH_RESET_MAC | ETH_RESET_PHY) << ETH_RESET_SHARED_SHIFT)
>>
>> diff --git a/drivers/net/ethernet/sfc/ef100_vdpa.c b/drivers/net/ethernet/sfc/ef100_vdpa.c
>> index 80bca281a748..b9368eb1acd5 100644
>> --- a/drivers/net/ethernet/sfc/ef100_vdpa.c
>> +++ b/drivers/net/ethernet/sfc/ef100_vdpa.c
>> @@ -14,6 +14,7 @@
>>   #include <uapi/linux/vdpa.h>
>>   #include "ef100_vdpa.h"
>>   #include "mcdi_vdpa.h"
>> +#include "ef100_iova.h"
>>   #include "mcdi_filters.h"
>>   #include "mcdi_functions.h"
>>   #include "ef100_netdev.h"
>> @@ -280,6 +281,34 @@ static int get_net_config(struct ef100_vdpa_nic *vdpa_nic)
>>          return 0;
>>   }
>>
>> +static int vdpa_update_domain(struct ef100_vdpa_nic *vdpa_nic)
>> +{
>> +       struct vdpa_device *vdpa = &vdpa_nic->vdpa_dev;
>> +       struct iommu_domain_geometry *geo;
>> +       struct device *dma_dev;
>> +
>> +       dma_dev = vdpa_get_dma_dev(vdpa);
>> +       if (!device_iommu_capable(dma_dev, IOMMU_CAP_CACHE_COHERENCY))
>> +               return -EOPNOTSUPP;
>> +
>> +       vdpa_nic->domain = iommu_get_domain_for_dev(dma_dev);
>> +       if (!vdpa_nic->domain)
>> +               return -ENODEV;
>> +
>> +       geo = &vdpa_nic->domain->geometry;
>> +       /* save the geo aperture range for validation in dma_map */
>> +       vdpa_nic->geo_aper_start = geo->aperture_start;
>> +
>> +       /* Handle the boundary case */
>> +       if (geo->aperture_end == ~0ULL)
>> +               geo->aperture_end -= 1;
>> +       vdpa_nic->geo_aper_end = geo->aperture_end;
>> +
>> +       /* insert a sentinel node */
>> +       return efx_ef100_insert_iova_node(vdpa_nic,
>> +                                         vdpa_nic->geo_aper_end + 1, 0);
>> +}
>> +
>>   static struct ef100_vdpa_nic *ef100_vdpa_create(struct efx_nic *efx,
>>                                                  const char *dev_name,
>>                                                  enum ef100_vdpa_class dev_type,
>> @@ -316,6 +345,7 @@ static struct ef100_vdpa_nic *ef100_vdpa_create(struct efx_nic *efx,
>>          }
>>
>>          mutex_init(&vdpa_nic->lock);
>> +       mutex_init(&vdpa_nic->iova_lock);
>>          dev = &vdpa_nic->vdpa_dev.dev;
>>          efx->vdpa_nic = vdpa_nic;
>>          vdpa_nic->vdpa_dev.dma_dev = &efx->pci_dev->dev;
>> @@ -325,9 +355,11 @@ static struct ef100_vdpa_nic *ef100_vdpa_create(struct efx_nic *efx,
>>          vdpa_nic->pf_index = nic_data->pf_index;
>>          vdpa_nic->vf_index = nic_data->vf_index;
>>          vdpa_nic->vdpa_state = EF100_VDPA_STATE_INITIALIZED;
>> +       vdpa_nic->iova_root = RB_ROOT;
>>          vdpa_nic->mac_address = (u8 *)&vdpa_nic->net_config.mac;
>>          ether_addr_copy(vdpa_nic->mac_address, mac);
>>          vdpa_nic->mac_configured = true;
>> +       INIT_LIST_HEAD(&vdpa_nic->free_list);
>>
>>          for (i = 0; i < EF100_VDPA_MAC_FILTER_NTYPES; i++)
>>                  vdpa_nic->filters[i].filter_id = EFX_INVALID_FILTER_ID;
>> @@ -353,6 +385,12 @@ static struct ef100_vdpa_nic *ef100_vdpa_create(struct efx_nic *efx,
>>                  goto err_put_device;
>>          }
>>
>> +       rc = vdpa_update_domain(vdpa_nic);
>> +       if (rc) {
>> +               pci_err(efx->pci_dev, "update_domain failed, err: %d\n", rc);
>> +               goto err_put_device;
>> +       }
>> +
>>          rc = get_net_config(vdpa_nic);
>>          if (rc)
>>                  goto err_put_device;
>> diff --git a/drivers/net/ethernet/sfc/ef100_vdpa.h b/drivers/net/ethernet/sfc/ef100_vdpa.h
>> index 1b0bbba88154..c3c77029973d 100644
>> --- a/drivers/net/ethernet/sfc/ef100_vdpa.h
>> +++ b/drivers/net/ethernet/sfc/ef100_vdpa.h
>> @@ -12,7 +12,9 @@
>>   #define __EF100_VDPA_H__
>>
>>   #include <linux/vdpa.h>
>> +#include <linux/iommu.h>
>>   #include <uapi/linux/virtio_net.h>
>> +#include <linux/rbtree.h>
>>   #include "net_driver.h"
>>   #include "ef100_nic.h"
>>
>> @@ -155,6 +157,12 @@ struct ef100_vdpa_filter {
>>    * @mac_configured: true after MAC address is configured
>>    * @filters: details of all filters created on this vdpa device
>>    * @cfg_cb: callback for config change
>> + * @domain: IOMMU domain
>> + * @iova_root: iova rbtree root
>> + * @iova_lock: lock to synchronize updates to rbtree and freelist
>> + * @free_list: list to store free iova areas of size >= MCDI buffer length
>> + * @geo_aper_start: start of valid IOVA range
>> + * @geo_aper_end: end of valid IOVA range
>>    */
>>   struct ef100_vdpa_nic {
>>          struct vdpa_device vdpa_dev;
>> @@ -174,6 +182,13 @@ struct ef100_vdpa_nic {
>>          bool mac_configured;
>>          struct ef100_vdpa_filter filters[EF100_VDPA_MAC_FILTER_NTYPES];
>>          struct vdpa_callback cfg_cb;
>> +       struct iommu_domain *domain;
>> +       struct rb_root iova_root;
>> +       /* mutex to synchronize rbtree operations */
>> +       struct mutex iova_lock;
>> +       struct list_head free_list;
>> +       u64 geo_aper_start;
>> +       u64 geo_aper_end;
>>   };
>>
>>   int ef100_vdpa_init(struct efx_probe_data *probe_data);
>> diff --git a/drivers/net/ethernet/sfc/ef100_vdpa_ops.c b/drivers/net/ethernet/sfc/ef100_vdpa_ops.c
>> index 718b67f6da90..8c198d949fdb 100644
>> --- a/drivers/net/ethernet/sfc/ef100_vdpa_ops.c
>> +++ b/drivers/net/ethernet/sfc/ef100_vdpa_ops.c
>> @@ -10,6 +10,7 @@
>>
>>   #include <linux/vdpa.h>
>>   #include "ef100_vdpa.h"
>> +#include "ef100_iova.h"
>>   #include "io.h"
>>   #include "mcdi_vdpa.h"
>>
>> @@ -260,6 +261,7 @@ static void ef100_reset_vdpa_device(struct ef100_vdpa_nic *vdpa_nic)
>>          if (!vdpa_nic->status)
>>                  return;
>>
>> +       efx_ef100_delete_iova(vdpa_nic);
>>          vdpa_nic->vdpa_state = EF100_VDPA_STATE_INITIALIZED;
>>          vdpa_nic->status = 0;
>>          vdpa_nic->features = 0;
>> @@ -743,9 +745,12 @@ static void ef100_vdpa_free(struct vdpa_device *vdev)
>>          int i;
>>
>>          if (vdpa_nic) {
>> +               /* clean-up the mappings and iova tree */
>> +               efx_ef100_delete_iova(vdpa_nic);
>>                  for (i = 0; i < (vdpa_nic->max_queue_pairs * 2); i++)
>>                          reset_vring(vdpa_nic, i);
>>                  ef100_vdpa_irq_vectors_free(vdpa_nic->efx->pci_dev);
>> +               mutex_destroy(&vdpa_nic->iova_lock);
>>                  mutex_destroy(&vdpa_nic->lock);
>>                  vdpa_nic->efx->vdpa_nic = NULL;
>>          }
>> diff --git a/drivers/net/ethernet/sfc/mcdi.h b/drivers/net/ethernet/sfc/mcdi.h
>> index db4ca4975ada..7d977a58a0df 100644
>> --- a/drivers/net/ethernet/sfc/mcdi.h
>> +++ b/drivers/net/ethernet/sfc/mcdi.h
>> @@ -7,6 +7,7 @@
>>   #ifndef EFX_MCDI_H
>>   #define EFX_MCDI_H
>>
>> +#include "mcdi_pcol.h"
>>   /**
>>    * enum efx_mcdi_state - MCDI request handling state
>>    * @MCDI_STATE_QUIESCENT: No pending MCDI requests. If the caller holds the
>> @@ -40,6 +41,8 @@ enum efx_mcdi_mode {
>>          MCDI_MODE_FAIL,
>>   };
>>
>> +#define MCDI_BUF_LEN (8 + MCDI_CTL_SDU_LEN_MAX)
>> +
>>   /**
>>    * struct efx_mcdi_iface - MCDI protocol context
>>    * @efx: The associated NIC.
>> --
>> 2.30.1
>>

^ permalink raw reply

* [PATCH] net: ks8851: Drop IRQ threading
From: Marek Vasut @ 2022-12-16 12:47 UTC (permalink / raw)
  To: netdev
  Cc: Marek Vasut, David S. Miller, Dmitry Torokhov, Eric Dumazet,
	Geoff Levand, Jakub Kicinski, Linus Walleij, Paolo Abeni,
	Petr Machata, Wolfram Sang

Request non-threaded IRQ in the KSZ8851 driver, this fixes the following warning:
"
NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #08!!!
"

Signed-off-by: Marek Vasut <marex@denx.de>
---
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Linus Walleij <linus.walleij@linaro.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Petr Machata <petrm@nvidia.com>
Cc: Wolfram Sang <wsa+renesas@sang-engineering.com>
To: netdev@vger.kernel.org
---
 drivers/net/ethernet/micrel/ks8851_common.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/micrel/ks8851_common.c b/drivers/net/ethernet/micrel/ks8851_common.c
index cfbc900d4aeb9..1eba4ba0b95cf 100644
--- a/drivers/net/ethernet/micrel/ks8851_common.c
+++ b/drivers/net/ethernet/micrel/ks8851_common.c
@@ -443,9 +443,7 @@ static int ks8851_net_open(struct net_device *dev)
 	unsigned long flags;
 	int ret;
 
-	ret = request_threaded_irq(dev->irq, NULL, ks8851_irq,
-				   IRQF_TRIGGER_LOW | IRQF_ONESHOT,
-				   dev->name, ks);
+	ret = request_irq(dev->irq, ks8851_irq, IRQF_TRIGGER_LOW, dev->name, ks);
 	if (ret < 0) {
 		netdev_err(dev, "failed to get irq\n");
 		return ret;
-- 
2.35.1


^ permalink raw reply related

* [PATCH net v4 2/3] Treewide: Stop corrupting socket's task_frag
From: Benjamin Coddington @ 2022-12-16 12:45 UTC (permalink / raw)
  To: Jakub Kicinski, Paolo Abeni, David S. Miller, Eric Dumazet
  Cc: Guillaume Nault, Philipp Reisner, Lars Ellenberg,
	Christoph Böhmwalder, Jens Axboe, Josef Bacik, Keith Busch,
	Christoph Hellwig, Sagi Grimberg, Lee Duncan, Chris Leech,
	Mike Christie, James E.J. Bottomley, Martin K. Petersen,
	Valentina Manea, Shuah Khan, Greg Kroah-Hartman, David Howells,
	Marc Dionne, Steve French, Christine Caulfield, David Teigland,
	Mark Fasheh, Joel Becker, Joseph Qi, Eric Van Hensbergen,
	Latchesar Ionkov, Dominique Martinet, Ilya Dryomov, Xiubo Li,
	Chuck Lever, Jeff Layton, Trond Myklebust, Anna Schumaker,
	Steffen Klassert, Herbert Xu, netdev
In-Reply-To: <cover.1671194454.git.bcodding@redhat.com>

Since moving to memalloc_nofs_save/restore, SUNRPC has stopped setting the
GFP_NOIO flag on sk_allocation which the networking system uses to decide
when it is safe to use current->task_frag.  The results of this are
unexpected corruption in task_frag when SUNRPC is involved in memory
reclaim.

The corruption can be seen in crashes, but the root cause is often
difficult to ascertain as a crashing machine's stack trace will have no
evidence of being near NFS or SUNRPC code.  I believe this problem to
be much more pervasive than reports to the community may indicate.

Fix this by having kernel users of sockets that may corrupt task_frag due
to reclaim set sk_use_task_frag = false.  Preemptively correcting this
situation for users that still set sk_allocation allows them to convert to
memalloc_nofs_save/restore without the same unexpected corruptions that are
sure to follow, unlikely to show up in testing, and difficult to bisect.

CC: Philipp Reisner <philipp.reisner@linbit.com>
CC: Lars Ellenberg <lars.ellenberg@linbit.com>
CC: "Christoph Böhmwalder" <christoph.boehmwalder@linbit.com>
CC: Jens Axboe <axboe@kernel.dk>
CC: Josef Bacik <josef@toxicpanda.com>
CC: Keith Busch <kbusch@kernel.org>
CC: Christoph Hellwig <hch@lst.de>
CC: Sagi Grimberg <sagi@grimberg.me>
CC: Lee Duncan <lduncan@suse.com>
CC: Chris Leech <cleech@redhat.com>
CC: Mike Christie <michael.christie@oracle.com>
CC: "James E.J. Bottomley" <jejb@linux.ibm.com>
CC: "Martin K. Petersen" <martin.petersen@oracle.com>
CC: Valentina Manea <valentina.manea.m@gmail.com>
CC: Shuah Khan <shuah@kernel.org>
CC: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
CC: David Howells <dhowells@redhat.com>
CC: Marc Dionne <marc.dionne@auristor.com>
CC: Steve French <sfrench@samba.org>
CC: Christine Caulfield <ccaulfie@redhat.com>
CC: David Teigland <teigland@redhat.com>
CC: Mark Fasheh <mark@fasheh.com>
CC: Joel Becker <jlbec@evilplan.org>
CC: Joseph Qi <joseph.qi@linux.alibaba.com>
CC: Eric Van Hensbergen <ericvh@gmail.com>
CC: Latchesar Ionkov <lucho@ionkov.net>
CC: Dominique Martinet <asmadeus@codewreck.org>
CC: "David S. Miller" <davem@davemloft.net>
CC: Eric Dumazet <edumazet@google.com>
CC: Jakub Kicinski <kuba@kernel.org>
CC: Paolo Abeni <pabeni@redhat.com>
CC: Ilya Dryomov <idryomov@gmail.com>
CC: Xiubo Li <xiubli@redhat.com>
CC: Chuck Lever <chuck.lever@oracle.com>
CC: Jeff Layton <jlayton@kernel.org>
CC: Trond Myklebust <trond.myklebust@hammerspace.com>
CC: Anna Schumaker <anna@kernel.org>
CC: Steffen Klassert <steffen.klassert@secunet.com>
CC: Herbert Xu <herbert@gondor.apana.org.au>
CC: netdev@vger.kernel.org

Suggested-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
---
 drivers/block/drbd/drbd_receiver.c | 3 +++
 drivers/block/nbd.c                | 1 +
 drivers/nvme/host/tcp.c            | 1 +
 drivers/scsi/iscsi_tcp.c           | 1 +
 drivers/usb/usbip/usbip_common.c   | 1 +
 fs/cifs/connect.c                  | 1 +
 fs/dlm/lowcomms.c                  | 2 ++
 fs/ocfs2/cluster/tcp.c             | 1 +
 net/9p/trans_fd.c                  | 1 +
 net/ceph/messenger.c               | 1 +
 net/sunrpc/xprtsock.c              | 3 +++
 net/xfrm/espintcp.c                | 1 +
 12 files changed, 17 insertions(+)

diff --git a/drivers/block/drbd/drbd_receiver.c b/drivers/block/drbd/drbd_receiver.c
index 0e58a3187345..757f4692b5bd 100644
--- a/drivers/block/drbd/drbd_receiver.c
+++ b/drivers/block/drbd/drbd_receiver.c
@@ -1030,6 +1030,9 @@ static int conn_connect(struct drbd_connection *connection)
 	sock.socket->sk->sk_allocation = GFP_NOIO;
 	msock.socket->sk->sk_allocation = GFP_NOIO;
 
+	sock.socket->sk->sk_use_task_frag = false;
+	msock.socket->sk->sk_use_task_frag = false;
+
 	sock.socket->sk->sk_priority = TC_PRIO_INTERACTIVE_BULK;
 	msock.socket->sk->sk_priority = TC_PRIO_INTERACTIVE;
 
diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
index e379ccc63c52..592cfa8b765a 100644
--- a/drivers/block/nbd.c
+++ b/drivers/block/nbd.c
@@ -512,6 +512,7 @@ static int sock_xmit(struct nbd_device *nbd, int index, int send,
 	noreclaim_flag = memalloc_noreclaim_save();
 	do {
 		sock->sk->sk_allocation = GFP_NOIO | __GFP_MEMALLOC;
+		sock->sk->sk_use_task_frag = false;
 		msg.msg_name = NULL;
 		msg.msg_namelen = 0;
 		msg.msg_control = NULL;
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index b69b89166b6b..8cedc1ef496c 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -1537,6 +1537,7 @@ static int nvme_tcp_alloc_queue(struct nvme_ctrl *nctrl, int qid)
 	queue->sock->sk->sk_rcvtimeo = 10 * HZ;
 
 	queue->sock->sk->sk_allocation = GFP_ATOMIC;
+	queue->sock->sk->sk_use_task_frag = false;
 	nvme_tcp_set_queue_io_cpu(queue);
 	queue->request = NULL;
 	queue->data_remaining = 0;
diff --git a/drivers/scsi/iscsi_tcp.c b/drivers/scsi/iscsi_tcp.c
index 5fb1f364e815..1d1cf641937c 100644
--- a/drivers/scsi/iscsi_tcp.c
+++ b/drivers/scsi/iscsi_tcp.c
@@ -738,6 +738,7 @@ iscsi_sw_tcp_conn_bind(struct iscsi_cls_session *cls_session,
 	sk->sk_reuse = SK_CAN_REUSE;
 	sk->sk_sndtimeo = 15 * HZ; /* FIXME: make it configurable */
 	sk->sk_allocation = GFP_ATOMIC;
+	sk->sk_use_task_frag = false;
 	sk_set_memalloc(sk);
 	sock_no_linger(sk);
 
diff --git a/drivers/usb/usbip/usbip_common.c b/drivers/usb/usbip/usbip_common.c
index f8b326eed54d..a2b2da1255dd 100644
--- a/drivers/usb/usbip/usbip_common.c
+++ b/drivers/usb/usbip/usbip_common.c
@@ -315,6 +315,7 @@ int usbip_recv(struct socket *sock, void *buf, int size)
 
 	do {
 		sock->sk->sk_allocation = GFP_NOIO;
+		sock->sk->sk_use_task_frag = false;
 
 		result = sock_recvmsg(sock, &msg, MSG_WAITALL);
 		if (result <= 0)
diff --git a/fs/cifs/connect.c b/fs/cifs/connect.c
index e80252a83225..7bc7b5e03c51 100644
--- a/fs/cifs/connect.c
+++ b/fs/cifs/connect.c
@@ -2944,6 +2944,7 @@ generic_ip_connect(struct TCP_Server_Info *server)
 		cifs_dbg(FYI, "Socket created\n");
 		server->ssocket = socket;
 		socket->sk->sk_allocation = GFP_NOFS;
+		socket->sk->sk_use_task_frag = false;
 		if (sfamily == AF_INET6)
 			cifs_reclassify_socket6(socket);
 		else
diff --git a/fs/dlm/lowcomms.c b/fs/dlm/lowcomms.c
index 8b80ca0cd65f..4450721ec83c 100644
--- a/fs/dlm/lowcomms.c
+++ b/fs/dlm/lowcomms.c
@@ -645,6 +645,7 @@ static void add_sock(struct socket *sock, struct connection *con)
 	if (dlm_config.ci_protocol == DLM_PROTO_SCTP)
 		sk->sk_state_change = lowcomms_state_change;
 	sk->sk_allocation = GFP_NOFS;
+	sk->sk_use_task_frag = false;
 	sk->sk_error_report = lowcomms_error_report;
 	release_sock(sk);
 }
@@ -1769,6 +1770,7 @@ static int dlm_listen_for_all(void)
 	listen_con.sock = sock;
 
 	sock->sk->sk_allocation = GFP_NOFS;
+	sock->sk->sk_use_task_frag = false;
 	sock->sk->sk_data_ready = lowcomms_listen_data_ready;
 	release_sock(sock->sk);
 
diff --git a/fs/ocfs2/cluster/tcp.c b/fs/ocfs2/cluster/tcp.c
index 37d222bdfc8c..a07b24d170f2 100644
--- a/fs/ocfs2/cluster/tcp.c
+++ b/fs/ocfs2/cluster/tcp.c
@@ -1602,6 +1602,7 @@ static void o2net_start_connect(struct work_struct *work)
 	sc->sc_sock = sock; /* freed by sc_kref_release */
 
 	sock->sk->sk_allocation = GFP_ATOMIC;
+	sock->sk->sk_use_task_frag = false;
 
 	myaddr.sin_family = AF_INET;
 	myaddr.sin_addr.s_addr = mynode->nd_ipv4_address;
diff --git a/net/9p/trans_fd.c b/net/9p/trans_fd.c
index 07db2f436d44..d9120f14684b 100644
--- a/net/9p/trans_fd.c
+++ b/net/9p/trans_fd.c
@@ -868,6 +868,7 @@ static int p9_socket_open(struct p9_client *client, struct socket *csocket)
 	}
 
 	csocket->sk->sk_allocation = GFP_NOIO;
+	csocket->sk->sk_use_task_frag = false;
 	file = sock_alloc_file(csocket, 0, NULL);
 	if (IS_ERR(file)) {
 		pr_err("%s (%d): failed to map fd\n",
diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
index dfa237fbd5a3..1d06e114ba3f 100644
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@@ -446,6 +446,7 @@ int ceph_tcp_connect(struct ceph_connection *con)
 	if (ret)
 		return ret;
 	sock->sk->sk_allocation = GFP_NOFS;
+	sock->sk->sk_use_task_frag = false;
 
 #ifdef CONFIG_LOCKDEP
 	lockdep_set_class(&sock->sk->sk_lock, &socket_class);
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index c0506d0d7478..aaa5b2741b79 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -1882,6 +1882,7 @@ static int xs_local_finish_connecting(struct rpc_xprt *xprt,
 		sk->sk_write_space = xs_udp_write_space;
 		sk->sk_state_change = xs_local_state_change;
 		sk->sk_error_report = xs_error_report;
+		sk->sk_use_task_frag = false;
 
 		xprt_clear_connected(xprt);
 
@@ -2082,6 +2083,7 @@ static void xs_udp_finish_connecting(struct rpc_xprt *xprt, struct socket *sock)
 		sk->sk_user_data = xprt;
 		sk->sk_data_ready = xs_data_ready;
 		sk->sk_write_space = xs_udp_write_space;
+		sk->sk_use_task_frag = false;
 
 		xprt_set_connected(xprt);
 
@@ -2249,6 +2251,7 @@ static int xs_tcp_finish_connecting(struct rpc_xprt *xprt, struct socket *sock)
 		sk->sk_state_change = xs_tcp_state_change;
 		sk->sk_write_space = xs_tcp_write_space;
 		sk->sk_error_report = xs_error_report;
+		sk->sk_use_task_frag = false;
 
 		/* socket options */
 		sock_reset_flag(sk, SOCK_LINGER);
diff --git a/net/xfrm/espintcp.c b/net/xfrm/espintcp.c
index d6fece1ed982..74a54295c164 100644
--- a/net/xfrm/espintcp.c
+++ b/net/xfrm/espintcp.c
@@ -489,6 +489,7 @@ static int espintcp_init_sk(struct sock *sk)
 
 	/* avoid using task_frag */
 	sk->sk_allocation = GFP_ATOMIC;
+	sk->sk_use_task_frag = false;
 
 	return 0;
 
-- 
2.31.1


^ permalink raw reply related

* [PATCH net v4 1/3] net: Introduce sk_use_task_frag in struct sock.
From: Benjamin Coddington @ 2022-12-16 12:45 UTC (permalink / raw)
  To: Jakub Kicinski, Paolo Abeni, David S. Miller, Eric Dumazet
  Cc: Guillaume Nault, Philipp Reisner, Lars Ellenberg,
	Christoph Böhmwalder, Jens Axboe, Josef Bacik, Keith Busch,
	Christoph Hellwig, Sagi Grimberg, Lee Duncan, Chris Leech,
	Mike Christie, James E.J. Bottomley, Martin K. Petersen,
	Valentina Manea, Shuah Khan, Greg Kroah-Hartman, David Howells,
	Marc Dionne, Steve French, Christine Caulfield, David Teigland,
	Mark Fasheh, Joel Becker, Joseph Qi, Eric Van Hensbergen,
	Latchesar Ionkov, Dominique Martinet, Ilya Dryomov, Xiubo Li,
	Chuck Lever, Jeff Layton, Trond Myklebust, Anna Schumaker,
	Steffen Klassert, Herbert Xu, netdev
In-Reply-To: <cover.1671194454.git.bcodding@redhat.com>

From: Guillaume Nault <gnault@redhat.com>

Sockets that can be used while recursing into memory reclaim, like
those used by network block devices and file systems, mustn't use
current->task_frag: if the current process is already using it, then
the inner memory reclaim call would corrupt the task_frag structure.

To avoid this, sk_page_frag() uses ->sk_allocation to detect sockets
that mustn't use current->task_frag, assuming that those used during
memory reclaim had their allocation constraints reflected in
->sk_allocation.

This unfortunately doesn't cover all cases: in an attempt to remove all
usage of GFP_NOFS and GFP_NOIO, sunrpc stopped setting these flags in
->sk_allocation, and used memalloc_nofs critical sections instead.
This breaks the sk_page_frag() heuristic since the allocation
constraints are now stored in current->flags, which sk_page_frag()
can't read without risking triggering a cache miss and slowing down
TCP's fast path.

This patch creates a new field in struct sock, named sk_use_task_frag,
which sockets with memory reclaim constraints can set to false if they
can't safely use current->task_frag. In such cases, sk_page_frag() now
always returns the socket's page_frag (->sk_frag). The first user is
sunrpc, which needs to avoid using current->task_frag but can keep
->sk_allocation set to GFP_KERNEL otherwise.

Eventually, it might be possible to simplify sk_page_frag() by only
testing ->sk_use_task_frag and avoid relying on the ->sk_allocation
heuristic entirely (assuming other sockets will set ->sk_use_task_frag
according to their constraints in the future).

The new ->sk_use_task_frag field is placed in a hole in struct sock and
belongs to a cache line shared with ->sk_shutdown. Therefore it should
be hot and shouldn't have negative performance impacts on TCP's fast
path (sk_shutdown is tested just before the while() loop in
tcp_sendmsg_locked()).

Link: https://lore.kernel.org/netdev/b4d8cb09c913d3e34f853736f3f5628abfd7f4b6.1656699567.git.gnault@redhat.com/
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
---
 include/net/sock.h | 11 +++++++++--
 net/core/sock.c    |  1 +
 2 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index ecea3dcc2217..fefe1f4abf19 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -318,6 +318,9 @@ struct sk_filter;
   *	@sk_stamp: time stamp of last packet received
   *	@sk_stamp_seq: lock for accessing sk_stamp on 32 bit architectures only
   *	@sk_tsflags: SO_TIMESTAMPING flags
+  *	@sk_use_task_frag: allow sk_page_frag() to use current->task_frag.
+  *			   Sockets that can be used under memory reclaim should
+  *			   set this to false.
   *	@sk_bind_phc: SO_TIMESTAMPING bind PHC index of PTP virtual clock
   *	              for timestamping
   *	@sk_tskey: counter to disambiguate concurrent tstamp requests
@@ -512,6 +515,7 @@ struct sock {
 	u8			sk_txtime_deadline_mode : 1,
 				sk_txtime_report_errors : 1,
 				sk_txtime_unused : 6;
+	bool			sk_use_task_frag;
 
 	struct socket		*sk_socket;
 	void			*sk_user_data;
@@ -2561,14 +2565,17 @@ static inline void sk_stream_moderate_sndbuf(struct sock *sk)
  * socket operations and end up recursing into sk_page_frag()
  * while it's already in use: explicitly avoid task page_frag
  * usage if the caller is potentially doing any of them.
- * This assumes that page fault handlers use the GFP_NOFS flags.
+ * This assumes that page fault handlers use the GFP_NOFS flags or
+ * explicitly disable sk_use_task_frag.
  *
  * Return: a per task page_frag if context allows that,
  * otherwise a per socket one.
  */
 static inline struct page_frag *sk_page_frag(struct sock *sk)
 {
-	if ((sk->sk_allocation & (__GFP_DIRECT_RECLAIM | __GFP_MEMALLOC | __GFP_FS)) ==
+	if (sk->sk_use_task_frag &&
+	    (sk->sk_allocation & (__GFP_DIRECT_RECLAIM | __GFP_MEMALLOC |
+				  __GFP_FS)) ==
 	    (__GFP_DIRECT_RECLAIM | __GFP_FS))
 		return &current->task_frag;
 
diff --git a/net/core/sock.c b/net/core/sock.c
index d2587d8712db..f954d5893e79 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -3390,6 +3390,7 @@ void sock_init_data(struct socket *sock, struct sock *sk)
 	sk->sk_rcvbuf		=	READ_ONCE(sysctl_rmem_default);
 	sk->sk_sndbuf		=	READ_ONCE(sysctl_wmem_default);
 	sk->sk_state		=	TCP_CLOSE;
+	sk->sk_use_task_frag	=	true;
 	sk_set_socket(sk, sock);
 
 	sock_set_flag(sk, SOCK_ZAPPED);
-- 
2.31.1


^ permalink raw reply related

* [PATCH net v4 3/3] net: simplify sk_page_frag
From: Benjamin Coddington @ 2022-12-16 12:45 UTC (permalink / raw)
  To: Jakub Kicinski, Paolo Abeni, David S. Miller, Eric Dumazet
  Cc: Guillaume Nault, Philipp Reisner, Lars Ellenberg,
	Christoph Böhmwalder, Jens Axboe, Josef Bacik, Keith Busch,
	Christoph Hellwig, Sagi Grimberg, Lee Duncan, Chris Leech,
	Mike Christie, James E.J. Bottomley, Martin K. Petersen,
	Valentina Manea, Shuah Khan, Greg Kroah-Hartman, David Howells,
	Marc Dionne, Steve French, Christine Caulfield, David Teigland,
	Mark Fasheh, Joel Becker, Joseph Qi, Eric Van Hensbergen,
	Latchesar Ionkov, Dominique Martinet, Ilya Dryomov, Xiubo Li,
	Chuck Lever, Jeff Layton, Trond Myklebust, Anna Schumaker,
	Steffen Klassert, Herbert Xu, netdev
In-Reply-To: <cover.1671194454.git.bcodding@redhat.com>

Now that in-kernel socket users that may recurse during reclaim have benn
converted to sk_use_task_frag = false, we can have sk_page_frag() simply
check that value.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
---
 include/net/sock.h | 9 ++-------
 1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index fefe1f4abf19..dcd72e6285b2 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -2564,19 +2564,14 @@ static inline void sk_stream_moderate_sndbuf(struct sock *sk)
  * Both direct reclaim and page faults can nest inside other
  * socket operations and end up recursing into sk_page_frag()
  * while it's already in use: explicitly avoid task page_frag
- * usage if the caller is potentially doing any of them.
- * This assumes that page fault handlers use the GFP_NOFS flags or
- * explicitly disable sk_use_task_frag.
+ * when users disable sk_use_task_frag.
  *
  * Return: a per task page_frag if context allows that,
  * otherwise a per socket one.
  */
 static inline struct page_frag *sk_page_frag(struct sock *sk)
 {
-	if (sk->sk_use_task_frag &&
-	    (sk->sk_allocation & (__GFP_DIRECT_RECLAIM | __GFP_MEMALLOC |
-				  __GFP_FS)) ==
-	    (__GFP_DIRECT_RECLAIM | __GFP_FS))
+	if (sk->sk_use_task_frag)
 		return &current->task_frag;
 
 	return &sk->sk_frag;
-- 
2.31.1


^ permalink raw reply related

* [PATCH net v4 0/3] Stop corrupting socket's task_frag
From: Benjamin Coddington @ 2022-12-16 12:45 UTC (permalink / raw)
  To: Jakub Kicinski, Paolo Abeni, David S. Miller, Eric Dumazet
  Cc: Guillaume Nault, Philipp Reisner, Lars Ellenberg,
	Christoph Böhmwalder, Jens Axboe, Josef Bacik, Keith Busch,
	Christoph Hellwig, Sagi Grimberg, Lee Duncan, Chris Leech,
	Mike Christie, James E.J. Bottomley, Martin K. Petersen,
	Valentina Manea, Shuah Khan, Greg Kroah-Hartman, David Howells,
	Marc Dionne, Steve French, Christine Caulfield, David Teigland,
	Mark Fasheh, Joel Becker, Joseph Qi, Eric Van Hensbergen,
	Latchesar Ionkov, Dominique Martinet, Ilya Dryomov, Xiubo Li,
	Chuck Lever, Jeff Layton, Trond Myklebust, Anna Schumaker,
	Steffen Klassert, Herbert Xu, netdev

The networking code uses flags in sk_allocation to determine if it can use
current->task_frag, however in-kernel users of sockets may stop setting
sk_allocation when they convert to the preferred memalloc_nofs_save/restore,
as SUNRPC has done in commit a1231fda7e94 ("SUNRPC: Set memalloc_nofs_save()
on all rpciod/xprtiod jobs").

This will cause corruption in current->task_frag when recursing into the
network layer for those subsystems during page fault or reclaim.  The
corruption is difficult to diagnose because stack traces may not contain the
offending subsystem at all.  The corruption is unlikely to show up in
testing because it requires memory pressure, and so subsystems that
convert to memalloc_nofs_save/restore are likely to continue to run into
this issue.

Previous reports and proposed fixes:
https://lore.kernel.org/netdev/96a18bd00cbc6cb554603cc0d6ef1c551965b078.1663762494.git.gnault@redhat.com/
https://lore.kernel.org/netdev/b4d8cb09c913d3e34f853736f3f5628abfd7f4b6.1656699567.git.gnault@redhat.com/
https://lore.kernel.org/linux-nfs/de6d99321d1dcaa2ad456b92b3680aa77c07a747.1665401788.git.gnault@redhat.com/

Guilluame Nault has done all of the hard work tracking this problem down and
finding the best fix for this issue.  I'm just taking a turn posting another
fix.

Changes on v2:
	- rebased on -net
	- set sk_use_task_frag = false for xfrm/espintcp.c

Changes on v3:
	- fixup comments in sock.h for kernel-doc

Changes on v4:
	- rebased on -net
	- sk_use_task_frag moved to hole after sk_txtime_unused
	- droppd afs/rxrpc.c hunk, not needed

Benjamin Coddington (2):
  Treewide: Stop corrupting socket's task_frag
  net: simplify sk_page_frag

Guillaume Nault (1):
  net: Introduce sk_use_task_frag in struct sock.

 drivers/block/drbd/drbd_receiver.c |  3 +++
 drivers/block/nbd.c                |  1 +
 drivers/nvme/host/tcp.c            |  1 +
 drivers/scsi/iscsi_tcp.c           |  1 +
 drivers/usb/usbip/usbip_common.c   |  1 +
 fs/cifs/connect.c                  |  1 +
 fs/dlm/lowcomms.c                  |  2 ++
 fs/ocfs2/cluster/tcp.c             |  1 +
 include/net/sock.h                 | 10 ++++++----
 net/9p/trans_fd.c                  |  1 +
 net/ceph/messenger.c               |  1 +
 net/core/sock.c                    |  1 +
 net/sunrpc/xprtsock.c              |  3 +++
 net/xfrm/espintcp.c                |  1 +
 14 files changed, 24 insertions(+), 4 deletions(-)

-- 
2.31.1

^ permalink raw reply

* Re: [PATCH iproute2-next 6/6] bridge: mdb: Add replace support
From: Nikolay Aleksandrov @ 2022-12-16 12:06 UTC (permalink / raw)
  To: Ido Schimmel, netdev; +Cc: dsahern, stephen, mlxsw
In-Reply-To: <20221215175230.1907938-7-idosch@nvidia.com>

On 15/12/2022 19:52, Ido Schimmel wrote:
> Allow user space to replace MDB port group entries by specifying the
> 'NLM_F_REPLACE' flag in the netlink message header.
> 
> Examples:
> 
>  # bridge mdb replace dev br0 port dummy10 grp 239.1.1.1 permanent source_list 192.0.2.1,192.0.2.2 filter_mode include
>  # bridge -d -s mdb show
>  dev br0 port dummy10 grp 239.1.1.1 src 192.0.2.2 permanent filter_mode include proto static     0.00
>  dev br0 port dummy10 grp 239.1.1.1 src 192.0.2.1 permanent filter_mode include proto static     0.00
>  dev br0 port dummy10 grp 239.1.1.1 permanent filter_mode include source_list 192.0.2.2/0.00,192.0.2.1/0.00 proto static     0.00
> 
>  # bridge mdb replace dev br0 port dummy10 grp 239.1.1.1 permanent source_list 192.0.2.1,192.0.2.3 filter_mode exclude proto zebra
>  # bridge -d -s mdb show
>  dev br0 port dummy10 grp 239.1.1.1 src 192.0.2.3 permanent filter_mode include proto zebra  blocked    0.00
>  dev br0 port dummy10 grp 239.1.1.1 src 192.0.2.1 permanent filter_mode include proto zebra  blocked    0.00
>  dev br0 port dummy10 grp 239.1.1.1 permanent filter_mode exclude source_list 192.0.2.3/0.00,192.0.2.1/0.00 proto zebra     0.00
> 
>  # bridge mdb replace dev br0 port dummy10 grp 239.1.1.1 temp source_list 192.0.2.4,192.0.2.3 filter_mode include proto bgp
>  # bridge -d -s mdb show
>  dev br0 port dummy10 grp 239.1.1.1 src 192.0.2.4 temp filter_mode include proto bgp     0.00
>  dev br0 port dummy10 grp 239.1.1.1 src 192.0.2.3 temp filter_mode include proto bgp     0.00
>  dev br0 port dummy10 grp 239.1.1.1 temp filter_mode include source_list 192.0.2.4/259.44,192.0.2.3/259.44 proto bgp     0.00
> 
> Signed-off-by: Ido Schimmel <idosch@nvidia.com>
> ---
>  bridge/mdb.c      |  4 +++-
>  man/man8/bridge.8 | 13 ++++++++++---
>  2 files changed, 13 insertions(+), 4 deletions(-)
> 

Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>



^ permalink raw reply

* Re: [PATCH iproute2-next 5/6] bridge: mdb: Add routing protocol support
From: Nikolay Aleksandrov @ 2022-12-16 12:06 UTC (permalink / raw)
  To: Ido Schimmel, netdev; +Cc: dsahern, stephen, mlxsw
In-Reply-To: <20221215175230.1907938-6-idosch@nvidia.com>

On 15/12/2022 19:52, Ido Schimmel wrote:
> Allow user space to specify the routing protocol of the MDB port group
> entry by adding the 'MDBE_ATTR_RTPROT' attribute to the
> 'MDBA_SET_ENTRY_ATTRS' nest.
> 
> Examples:
> 
>  # bridge mdb add dev br0 port dummy10 grp 239.1.1.1 permanent proto zebra
> 
>  # bridge mdb add dev br0 port dummy10 grp 239.1.1.2 permanent
> 
>  # bridge -d mdb show
>  dev br0 port dummy10 grp 239.1.1.2 permanent filter_mode exclude proto static
>  dev br0 port dummy10 grp 239.1.1.1 permanent filter_mode exclude proto zebra
> 
> Signed-off-by: Ido Schimmel <idosch@nvidia.com>
> ---
>  bridge/mdb.c      | 28 ++++++++++++++++++++++++++--
>  man/man8/bridge.8 | 12 +++++++++++-
>  2 files changed, 37 insertions(+), 3 deletions(-)
> 

Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>



^ permalink raw reply

* Re: [PATCH iproute2-next 4/6] bridge: mdb: Add source list support
From: Nikolay Aleksandrov @ 2022-12-16 12:05 UTC (permalink / raw)
  To: Ido Schimmel, netdev; +Cc: dsahern, stephen, mlxsw
In-Reply-To: <20221215175230.1907938-5-idosch@nvidia.com>

On 15/12/2022 19:52, Ido Schimmel wrote:
> Allow user space to specify the source list of (*, G) entries by adding
> the 'MDBE_ATTR_SRC_LIST' attribute to the 'MDBA_SET_ENTRY_ATTRS' nest.
> 
> Example:
> 
>  # bridge mdb add dev br0 port dummy10 grp 239.1.1.1 temp source_list 198.51.100.1,198.51.100.2 filter_mode exclude
> 
>  # bridge -d -s mdb show
>  dev br0 port dummy10 grp 239.1.1.1 src 198.51.100.2 temp filter_mode include proto static  blocked    0.00
>  dev br0 port dummy10 grp 239.1.1.1 src 198.51.100.1 temp filter_mode include proto static  blocked    0.00
>  dev br0 port dummy10 grp 239.1.1.1 temp filter_mode exclude source_list 198.51.100.2/0.00,198.51.100.1/0.00 proto static   256.42
> 
>  # bridge -j -p -d -s mdb show
>  [ {
>          "mdb": [ {
>                  "index": 10,
>                  "dev": "br0",
>                  "port": "dummy10",
>                  "grp": "239.1.1.1",
>                  "src": "198.51.100.2",
>                  "state": "temp",
>                  "filter_mode": "include",
>                  "protocol": "static",
>                  "flags": [ "blocked" ],
>                  "timer": "   0.00"
>              },{
>                  "index": 10,
>                  "dev": "br0",
>                  "port": "dummy10",
>                  "grp": "239.1.1.1",
>                  "src": "198.51.100.1",
>                  "state": "temp",
>                  "filter_mode": "include",
>                  "protocol": "static",
>                  "flags": [ "blocked" ],
>                  "timer": "   0.00"
>              },{
>              },{
>                  "index": 10,
>                  "dev": "br0",
>                  "port": "dummy10",
>                  "grp": "239.1.1.1",
>                  "state": "temp",
>                  "filter_mode": "exclude",
>                  "source_list": [ {
>                          "address": "198.51.100.2",
>                          "timer": "0.00"
>                      },{
>                          "address": "198.51.100.1",
>                          "timer": "0.00"
>                      } ],
>                  "protocol": "static",
>                  "flags": [ ],
>                  "timer": " 251.19"
>              } ],
>          "router": {}
>      } ]
> 
> Signed-off-by: Ido Schimmel <idosch@nvidia.com>
> ---
>  bridge/mdb.c      | 58 ++++++++++++++++++++++++++++++++++++++++++++++-
>  man/man8/bridge.8 | 11 ++++++++-
>  2 files changed, 67 insertions(+), 2 deletions(-)
> 

Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>



^ permalink raw reply

* Re: [PATCH iproute2-next 3/6] bridge: mdb: Add filter mode support
From: Nikolay Aleksandrov @ 2022-12-16 12:00 UTC (permalink / raw)
  To: Ido Schimmel, netdev; +Cc: dsahern, stephen, mlxsw
In-Reply-To: <20221215175230.1907938-4-idosch@nvidia.com>

On 15/12/2022 19:52, Ido Schimmel wrote:
> Allow user space to specify the filter mode of (*, G) entries by adding
> the 'MDBE_ATTR_GROUP_MODE' attribute to the 'MDBA_SET_ENTRY_ATTRS' nest.
> 
> Signed-off-by: Ido Schimmel <idosch@nvidia.com>
> ---
>  bridge/mdb.c      | 27 ++++++++++++++++++++++++++-
>  man/man8/bridge.8 |  8 +++++++-
>  2 files changed, 33 insertions(+), 2 deletions(-)
> 

Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>



^ permalink raw reply

* Re: imx7: USB modem reset causes modem to not re-connect
From: Fabio Estevam @ 2022-12-16 12:00 UTC (permalink / raw)
  To: Jun Li
  Cc: bjorn@mork.no, Peter Chen, Marek Vasut, netdev, USB list,
	Alexander Stein, Schrempf Frieder
In-Reply-To: <PA4PR04MB9640B1C33E8D5704885A9A3B89E39@PA4PR04MB9640.eurprd04.prod.outlook.com>

Hi Li Jun,

On Tue, Dec 13, 2022 at 8:17 AM Jun Li <jun.li@nxp.com> wrote:

> What's the OC polarity config in your SW, active low, or active high?
> Basically if the OC condition is active, the host mode cannot work
> well.

Yes, if I keep the OC pinctrl definition and pass
'over-current-active-low;' the problem
does not happen.

Thanks a lot,

Fabio Estevam

^ permalink raw reply

* Re: [PATCH iproute2-next 2/6] bridge: mdb: Split source parsing to a separate function
From: Nikolay Aleksandrov @ 2022-12-16 11:59 UTC (permalink / raw)
  To: Ido Schimmel, netdev; +Cc: dsahern, stephen, mlxsw
In-Reply-To: <20221215175230.1907938-3-idosch@nvidia.com>

On 15/12/2022 19:52, Ido Schimmel wrote:
> Currently, the only attribute inside the 'MDBA_SET_ENTRY_ATTRS' nest is
> 'MDBE_ATTR_SOURCE', but subsequent patches are going to add more
> attributes to the nest.
> 
> Prepare for the addition of these attributes by splitting the parsing of
> individual attributes inside the nest to separate functions.
> 
> Signed-off-by: Ido Schimmel <idosch@nvidia.com>
> ---
>  bridge/mdb.c | 34 ++++++++++++++++++++++++----------
>  1 file changed, 24 insertions(+), 10 deletions(-)
> 

Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>

^ permalink raw reply

* Re: [PATCH iproute2-next 1/6] bridge: mdb: Use a boolean to indicate nest is required
From: Nikolay Aleksandrov @ 2022-12-16 11:59 UTC (permalink / raw)
  To: Ido Schimmel, netdev; +Cc: dsahern, stephen, mlxsw
In-Reply-To: <20221215175230.1907938-2-idosch@nvidia.com>

On 15/12/2022 19:52, Ido Schimmel wrote:
> Currently, the only attribute inside the 'MDBA_SET_ENTRY_ATTRS' nest is
> 'MDBE_ATTR_SOURCE', but subsequent patches are going to add more
> attributes to the nest.
> 
> Prepare for the addition of these attributes by determining the
> necessity of the nest from a boolean variable that is set whenever one
> of these attributes is parsed. This avoids the need to have one long
> condition that checks for the presence of one of the individual
> attributes.
> 
> Signed-off-by: Ido Schimmel <idosch@nvidia.com>
> ---
>  bridge/mdb.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/bridge/mdb.c b/bridge/mdb.c
> index d3afc900e798..4ae91f15dac8 100644
> --- a/bridge/mdb.c
> +++ b/bridge/mdb.c
> @@ -488,6 +488,7 @@ static int mdb_modify(int cmd, int flags, int argc, char **argv)
>  	};
>  	char *d = NULL, *p = NULL, *grp = NULL, *src = NULL;
>  	struct br_mdb_entry entry = {};
> +	bool set_attrs = false;
>  	short vid = 0;
>  
>  	while (argc > 0) {
> @@ -511,6 +512,7 @@ static int mdb_modify(int cmd, int flags, int argc, char **argv)
>  		} else if (strcmp(*argv, "src") == 0) {
>  			NEXT_ARG();
>  			src = *argv;
> +			set_attrs = true;
>  		} else {
>  			if (matches(*argv, "help") == 0)
>  				usage();
> @@ -538,7 +540,7 @@ static int mdb_modify(int cmd, int flags, int argc, char **argv)
>  
>  	entry.vid = vid;
>  	addattr_l(&req.n, sizeof(req), MDBA_SET_ENTRY, &entry, sizeof(entry));
> -	if (src) {
> +	if (set_attrs) {
>  		struct rtattr *nest = addattr_nest(&req.n, sizeof(req),
>  						   MDBA_SET_ENTRY_ATTRS);
>  		struct in6_addr src_ip6;

Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>


^ permalink raw reply

* Re: [PATCH v2 6/9] net: phy: motorcomm: Add YT8531 phy support
From: Heiner Kallweit @ 2022-12-16 11:58 UTC (permalink / raw)
  To: Yanhong Wang, linux-riscv, netdev, devicetree, linux-kernel
  Cc: David S . Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Rob Herring, Krzysztof Kozlowski, Emil Renner Berthing,
	Richard Cochran, Andrew Lunn, Peter Geis
In-Reply-To: <20221216070632.11444-7-yanhong.wang@starfivetech.com>

On 16.12.2022 08:06, Yanhong Wang wrote:
> This adds basic support for the Motorcomm YT8531
> Gigabit Ethernet PHY.
> 
> Signed-off-by: Yanhong Wang <yanhong.wang@starfivetech.com>
> ---
>  drivers/net/phy/Kconfig     |   3 +-
>  drivers/net/phy/motorcomm.c | 202 ++++++++++++++++++++++++++++++++++++
>  2 files changed, 204 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/net/phy/Kconfig b/drivers/net/phy/Kconfig
> index c57a0262fb64..86399254d9ff 100644
> --- a/drivers/net/phy/Kconfig
> +++ b/drivers/net/phy/Kconfig
> @@ -258,9 +258,10 @@ config MICROSEMI_PHY
>  
>  config MOTORCOMM_PHY
>  	tristate "Motorcomm PHYs"
> +	default SOC_STARFIVE

Both are completely independent. This default should be removed.

>  	help
>  	  Enables support for Motorcomm network PHYs.
> -	  Currently supports the YT8511 gigabit PHY.
> +	  Currently supports the YT8511 and YT8531 gigabit PHYs.
>  

This doesn't apply. Parts of your patch exist already in net-next.
Support for YT8531S has been added in the meantime. Please rebase
your patch on net-next and annotate your patch as net-next.

>  config NATIONAL_PHY
>  	tristate "National Semiconductor PHYs"
> diff --git a/drivers/net/phy/motorcomm.c b/drivers/net/phy/motorcomm.c
> index 7e6ac2c5e27e..bca03185b338 100644
> --- a/drivers/net/phy/motorcomm.c
> +++ b/drivers/net/phy/motorcomm.c
> @@ -3,13 +3,17 @@
>   * Driver for Motorcomm PHYs
>   *
>   * Author: Peter Geis <pgwipeout@gmail.com>
> + *
>   */
>  
> +#include <linux/bitops.h>
>  #include <linux/kernel.h>
>  #include <linux/module.h>
> +#include <linux/of.h>
>  #include <linux/phy.h>
>  
>  #define PHY_ID_YT8511		0x0000010a
> +#define PHY_ID_YT8531		0x4f51e91b
>  
>  #define YT8511_PAGE_SELECT	0x1e
>  #define YT8511_PAGE		0x1f
> @@ -17,6 +21,10 @@
>  #define YT8511_EXT_DELAY_DRIVE	0x0d
>  #define YT8511_EXT_SLEEP_CTRL	0x27
>  
> +#define YTPHY_EXT_SMI_SDS_PHY		0xa000
> +#define YTPHY_EXT_CHIP_CONFIG		0xa001
> +#define YTPHY_EXT_RGMII_CONFIG1	0xa003
> +
>  /* 2b00 25m from pll
>   * 2b01 25m from xtl *default*
>   * 2b10 62.m from pll
> @@ -38,6 +46,51 @@
>  #define YT8511_DELAY_FE_TX_EN	(0xf << 12)
>  #define YT8511_DELAY_FE_TX_DIS	(0x2 << 12)
>  
> +struct ytphy_reg_field {
> +	char *name;
> +	u32 mask;
> +	u8	dflt;	/* Default value */
> +};
> +
> +struct ytphy_priv_t {
> +	u32 tx_inverted_1000;
> +	u32 tx_inverted_100;
> +	u32 tx_inverted_10;
> +};
> +
> +/* rx_delay_sel: RGMII rx clock delay train configuration, about 150ps per
> + *               step. Delay = 150ps * N
> + *
> + * tx_delay_sel_fe: RGMII tx clock delay train configuration when speed is
> + *                  100Mbps or 10Mbps, it's 150ps per step. Delay = 150ps * N
> + *
> + * tx_delay_sel: RGMII tx clock delay train configuration when speed is
> + *               1000Mbps, it's 150ps per step. Delay = 150ps * N
> + */
> +static const struct ytphy_reg_field ytphy_rxtxd_grp[] = {
> +	{ "rx_delay_sel", GENMASK(13, 10), 0x0 },
> +	{ "tx_delay_sel_fe", GENMASK(7, 4), 0xf },
> +	{ "tx_delay_sel", GENMASK(3, 0), 0x1 }
> +};
> +
> +/* tx_inverted_x: Use original or inverted RGMII TX_CLK to drive the RGMII
> + *                TX_CLK delay train configuration when speed is
> + *                xMbps(10/100/1000Mbps).
> + *                0: original,  1: inverted
> + */
> +static const struct ytphy_reg_field ytphy_txinver_grp[] = {
> +	{ "tx_inverted_1000", BIT(14), 0x0 },
> +	{ "tx_inverted_100", BIT(14), 0x0 },
> +	{ "tx_inverted_10", BIT(14), 0x0 }

Copy & Paste error that mask is the same for all entries?

> +};
> +
> +/* rxc_dly_en: RGMII clk 2ns delay control bit.
> + *             0: disable   1: enable
> + */
> +static const struct ytphy_reg_field ytphy_rxden_grp[] = {
> +	{ "rxc_dly_en", BIT(8), 0x1 }
> +};
> +
>  static int yt8511_read_page(struct phy_device *phydev)
>  {
>  	return __phy_read(phydev, YT8511_PAGE_SELECT);
> @@ -48,6 +101,33 @@ static int yt8511_write_page(struct phy_device *phydev, int page)
>  	return __phy_write(phydev, YT8511_PAGE_SELECT, page);
>  };
>  
> +static int ytphy_read_ext(struct phy_device *phydev, u32 regnum)
> +{
> +	int ret;
> +	int val;
> +
> +	ret = __phy_write(phydev, YT8511_PAGE_SELECT, regnum);
> +	if (ret < 0)
> +		return ret;
> +
> +	val = __phy_read(phydev, YT8511_PAGE);
> +
> +	return val;
> +}
> +
> +static int ytphy_write_ext(struct phy_device *phydev, u32 regnum, u16 val)
> +{
> +	int ret;
> +
> +	ret = __phy_write(phydev, YT8511_PAGE_SELECT, regnum);
> +	if (ret < 0)
> +		return ret;
> +
> +	ret = __phy_write(phydev, YT8511_PAGE, val);
> +
> +	return ret;
> +}
> +
>  static int yt8511_config_init(struct phy_device *phydev)
>  {
>  	int oldpage, ret = 0;
> @@ -111,6 +191,116 @@ static int yt8511_config_init(struct phy_device *phydev)
>  	return phy_restore_page(phydev, oldpage, ret);
>  }
>  
> +static int ytphy_config_init(struct phy_device *phydev)
> +{
> +	struct device_node *of_node;
> +	u32 val;
> +	u32 mask;
> +	u32 cfg;
> +	int ret;
> +	int i = 0;
> +
> +	of_node = phydev->mdio.dev.of_node;
> +	if (of_node) {
> +		ret = of_property_read_u32(of_node, ytphy_rxden_grp[0].name, &cfg);
> +		if (!ret) {
> +			mask = ytphy_rxden_grp[0].mask;
> +			val = ytphy_read_ext(phydev, YTPHY_EXT_CHIP_CONFIG);
> +
> +			/* check the cfg overflow or not */
> +			cfg = cfg > mask >> (ffs(mask) - 1) ? mask : cfg;
> +
> +			val &= ~mask;
> +			val |= FIELD_PREP(mask, cfg);
> +			ytphy_write_ext(phydev, YTPHY_EXT_CHIP_CONFIG, val);

This is the unlocked version. MDIO bus locking is missing.

> +		}
> +
> +		val = ytphy_read_ext(phydev, YTPHY_EXT_RGMII_CONFIG1);
> +		for (i = 0; i < ARRAY_SIZE(ytphy_rxtxd_grp); i++) {
> +			ret = of_property_read_u32(of_node, ytphy_rxtxd_grp[i].name, &cfg);
> +			if (!ret) {
> +				mask = ytphy_rxtxd_grp[i].mask;
> +
> +				/* check the cfg overflow or not */
> +				cfg = cfg > mask >> (ffs(mask) - 1) ? mask : cfg;
> +
> +				val &= ~mask;
> +				val |= cfg << (ffs(mask) - 1);
> +			}
> +		}
> +		return ytphy_write_ext(phydev, YTPHY_EXT_RGMII_CONFIG1, val);
> +	}
> +
> +	phydev_err(phydev, "Get of node fail\n");
> +

Please consider that the PHY may be used on non-DT systems.

> +	return -EINVAL;
> +}
> +
> +static void ytphy_link_change_notify(struct phy_device *phydev)
> +{
> +	u32 val;
> +	struct ytphy_priv_t *ytphy_priv = phydev->priv;
> +
> +	if (phydev->speed < 0)
> +		return;
> +
> +	val = ytphy_read_ext(phydev, YTPHY_EXT_RGMII_CONFIG1);
> +	switch (phydev->speed) {
> +	case SPEED_1000:
> +		val  &= ~ytphy_txinver_grp[0].mask;
> +		val |= FIELD_PREP(ytphy_txinver_grp[0].mask,
> +				ytphy_priv->tx_inverted_1000);
> +		break;
> +
> +	case SPEED_100:
> +		val  &= ~ytphy_txinver_grp[1].mask;
> +		val |= FIELD_PREP(ytphy_txinver_grp[1].mask,
> +				ytphy_priv->tx_inverted_100);
> +		break;
> +
> +	case SPEED_10:
> +		val  &= ~ytphy_txinver_grp[2].mask;
> +		val |= FIELD_PREP(ytphy_txinver_grp[2].mask,
> +				ytphy_priv->tx_inverted_10);
> +		break;
> +
> +	default:
> +		break;
> +	}
> +
> +	ytphy_write_ext(phydev, YTPHY_EXT_RGMII_CONFIG1, val);
> +}
> +
> +static int yt8531_probe(struct phy_device *phydev)
> +{
> +	struct ytphy_priv_t *priv;
> +	const struct device_node *of_node;
> +	u32 val;
> +	int ret;
> +
> +	priv = devm_kzalloc(&phydev->mdio.dev, sizeof(*priv), GFP_KERNEL);
> +	if (!priv)
> +		return -ENOMEM;
> +
> +	of_node = phydev->mdio.dev.of_node;
> +	if (of_node) {
> +		ret = of_property_read_u32(of_node, ytphy_txinver_grp[0].name, &val);
> +		if (!ret)
> +			priv->tx_inverted_1000 = val;
> +
> +		ret = of_property_read_u32(of_node, ytphy_txinver_grp[1].name, &val);
> +		if (!ret)
> +			priv->tx_inverted_100 = val;
> +
> +		ret = of_property_read_u32(of_node, ytphy_txinver_grp[2].name, &val);
> +		if (!ret)
> +			priv->tx_inverted_10 = val;
> +	}
> +	phydev->priv = priv;
> +
> +	return 0;
> +}
> +
>  static struct phy_driver motorcomm_phy_drvs[] = {
>  	{
>  		PHY_ID_MATCH_EXACT(PHY_ID_YT8511),
> @@ -120,6 +310,17 @@ static struct phy_driver motorcomm_phy_drvs[] = {
>  		.resume		= genphy_resume,
>  		.read_page	= yt8511_read_page,
>  		.write_page	= yt8511_write_page,
> +	}, {
> +		PHY_ID_MATCH_EXACT(PHY_ID_YT8531),
> +		.name		= "YT8531 Gigabit Ethernet",
> +		.probe		= yt8531_probe,
> +		.config_init	= ytphy_config_init,
> +		.read_status	= genphy_read_status,
> +		.suspend	= genphy_suspend,
> +		.resume		= genphy_resume,
> +		.read_page	= yt8511_read_page,
> +		.write_page	= yt8511_write_page,
> +		.link_change_notify = ytphy_link_change_notify,
>  	},
>  };
>  
> @@ -131,6 +332,7 @@ MODULE_LICENSE("GPL");
>  
>  static const struct mdio_device_id __maybe_unused motorcomm_tbl[] = {
>  	{ PHY_ID_MATCH_EXACT(PHY_ID_YT8511) },
> +	{ PHY_ID_MATCH_EXACT(PHY_ID_YT8531) },
>  	{ /* sentinal */ }
>  };
>  


^ permalink raw reply

* Re: [PATCH v2 1/3] dsa: marvell: Provide per device information about max frame size
From: Lukasz Majewski @ 2022-12-16 11:56 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Andrew Lunn, Vladimir Oltean, Vivien Didelot, Florian Fainelli,
	David S. Miller, Russell King, netdev, linux-kernel
In-Reply-To: <20221215202017.4432ef25@kernel.org>

[-- Attachment #1: Type: text/plain, Size: 1245 bytes --]

Hi Jakub,

> On Thu, 15 Dec 2022 15:45:34 +0100 Lukasz Majewski wrote:
> > Different Marvell DSA switches support different size of max frame
> > bytes to be sent.
> > 
> > For example mv88e6185 supports max 1632 bytes, which is now
> > in-driver standard value. On the other hand - mv88e6250 supports
> > 2048 bytes.
> > 
> > As this value is internal and may be different for each switch IC,
> > new entry in struct mv88e6xxx_info has been added to store it.  
> 
> # Form letter - net-next is closed
> 

I see....

> We have already submitted the networking pull request to Linus
> for v6.2 and therefore net-next is closed for new drivers, features,
> code refactoring and optimizations. We are currently accepting
> bug fixes only.

Ok.

> 
> Please repost when net-next reopens after Jan 2nd.
> 
> RFC patches sent for review only are obviously welcome at any time.

I hope that the discussion regarding those patches will be done by this
time.


Best regards,

Lukasz Majewski

--

DENX Software Engineering GmbH,      Managing Director: Wolfgang Denk
HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
Phone: (+49)-8142-66989-59 Fax: (+49)-8142-66989-80 Email: lukma@denx.de

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply

* Re: [PATCH] ipvs: use div_s64 for signed division
From: Arnd Bergmann @ 2022-12-16 11:24 UTC (permalink / raw)
  To: Pablo Neira Ayuso, Julian Anastasov
  Cc: Arnd Bergmann, Simon Horman, Jakub Kicinski, Paolo Abeni,
	Jiri Wiesner, Netdev, lvs-devel, netfilter-devel, coreteam,
	linux-kernel
In-Reply-To: <Y5xEITNJkry8uy/h@salvia>

On Fri, Dec 16, 2022, at 11:10, Pablo Neira Ayuso wrote:
> Hi Julian,
>
> On Thu, Dec 15, 2022 at 09:01:59PM +0200, Julian Anastasov wrote:
>> 
>> 	Hello,
>> 
>> On Thu, 15 Dec 2022, Arnd Bergmann wrote:
>> 
>> > From: Arnd Bergmann <arnd@arndb.de>
>> > 
>> > do_div() is only well-behaved for positive numbers, and now warns
>> > when the first argument is a an s64:
>> > 
>> > net/netfilter/ipvs/ip_vs_est.c: In function 'ip_vs_est_calc_limits':
>> > include/asm-generic/div64.h:222:35: error: comparison of distinct pointer types lacks a cast [-Werror]
>> >   222 |         (void)(((typeof((n)) *)0) == ((uint64_t *)0));  \
>> >       |                                   ^~
>> > net/netfilter/ipvs/ip_vs_est.c:694:17: note: in expansion of macro 'do_div'
>> >   694 |                 do_div(val, loops);
>> 
>> 	net-next already contains fix for this warning
>> and changes val to u64.
>
> Arnd's patch applies fine on top of net-next, maybe he is addressing
> something else?

No, it's the same bug. I had prepared my patch before the other fix
went in, and only one of the two is needed.

FWIW, I find my version slightly more readable, but Jakub's fix
is probably more efficient, because the unsigned 64-bit division
is better optimized on 32-bit, while div_s64() goes through an
extern function.

     Arnd

^ permalink raw reply

* Re: [PATCH v1] iavfs/iavf_main: actually log ->src mask when talking about it
From: Jiri Pirko @ 2022-12-16 11:20 UTC (permalink / raw)
  To: Daniil Tatianin
  Cc: Jesse Brandeburg, Tony Nguyen, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Harshitha Ramamurthy, Jeff Kirsher, intel-wired-lan,
	netdev, linux-kernel
In-Reply-To: <20221216091326.1457454-1-d-tatianin@yandex-team.ru>

Fri, Dec 16, 2022 at 10:13:26AM CET, d-tatianin@yandex-team.ru wrote:
>This fixes a copy-paste issue where dev_err would log the dst mask even
>though it is clearly talking about src.
>
>Found by Linux Verification Center (linuxtesting.org) with the SVACE
>static analysis tool.
>
>Fixes: 0075fa0fadd0a ("i40evf: Add support to apply cloud filters")
>Signed-off-by: Daniil Tatianin <d-tatianin@yandex-team.ru>

Reviewed-by: Jiri Pirko <jiri@nvidia.com>

^ permalink raw reply

* Re: [PATCH v2] net: Fix documentation for unregister_netdevice_notifier_net
From: Jiri Pirko @ 2022-12-16 11:19 UTC (permalink / raw)
  To: Miaoqian Lin
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Sebastian Andrzej Siewior, Menglong Dong, Kuniyuki Iwashima,
	Petr Machata, Jiri Pirko, netdev, linux-kernel
In-Reply-To: <20221216094838.683379-1-linmq006@gmail.com>

Fri, Dec 16, 2022 at 10:48:35AM CET, linmq006@gmail.com wrote:
>unregister_netdevice_notifier_net() is used for unregister a notifier
>registered by register_netdevice_notifier_net(). Also s/into/from/.
>
>Fixes: a30c7b429f2d ("net: introduce per-netns netdevice notifiers")

Hmm, I believe that comment fixes should not contain the fixes tag not
to confuse anyone (bot) as this is not fixing a bug. This is net-next
tree material, please mark it as such in the patch subject line:
[PATCH net-next v2] ...

Reviewed-by: Jiri Pirko <jiri@nvidia.com>

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox