netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH net-next v3 0/7] bnxt_en: implement device memory TCP for bnxt
@ 2024-10-03 16:06 Taehee Yoo
  2024-10-03 16:06 ` [PATCH net-next v3 1/7] bnxt_en: add support for rx-copybreak ethtool command Taehee Yoo
                   ` (7 more replies)
  0 siblings, 8 replies; 73+ messages in thread
From: Taehee Yoo @ 2024-10-03 16:06 UTC (permalink / raw)
  To: davem, kuba, pabeni, edumazet, almasrymina, netdev, linux-doc,
	donald.hunter, corbet, michael.chan
  Cc: kory.maincent, andrew, maxime.chevallier, danieller, hengqi,
	ecree.xilinx, przemyslaw.kitszel, hkallweit1, ahmed.zaki,
	paul.greenwalt, rrameshbabu, idosch, asml.silence, kaiyuanz,
	willemb, aleksander.lobakin, dw, sridhar.samudrala, bcreeley,
	ap420073

This series implements device memory TCP for bnxt_en driver and
necessary ethtool command implementations.

NICs that use the bnxt_en driver support tcp-data-split feature named
HDS(header-data-split).
But there is no implementation for the HDS to enable/disable by ethtool.
Only getting the current HDS status is implemented and the HDS is just
automatically enabled only when either LRO, HW-GRO, or JUMBO is enabled.
The hds_threshold follows the rx-copybreak value but it wasn't
changeable.

Currently, bnxt_en driver enables tcp-data-split by default but not
always work.
There is hds_threshold value, which indicates that a packet size is
larger than this value, a packet will be split into header and data.
hds_threshold value has been 256, which is a default value of
rx-copybreak value too.
The rx-copybreak value hasn't been allowed to change so the
hds_threshold too.

This patchset decouples hds_threshold and rx-copybreak first.
and make tcp-data-split, rx-copybreak, and
tcp-data-split-thresh(hds_threshold) configurable independently.

But the default configuration is the same.
The default value of rx-copybreak is 256 and default
tcp-data-split-thresh is also 256.

There are several related options.
TPA(HW-GRO, LRO), JUMBO, jumbo_thresh(firmware command), and Aggregation
Ring.

The aggregation ring is fundamental to these all features.
When gro/lro/jumbo packets are received, NIC receives the first packet
from the normal ring.
follow packets come from the aggregation ring.

These features are working regardless of HDS.
When TPA is enabled and HDS is disabled, the first packet contains
header and payload too.
and the following packets contain payload only.
If HDS is enabled, the first packet contains the header only, and the
following packets contain only payload.
So, HW-GRO/LRO is working regardless of HDS.

There is another threshold value, which is jumbo_thresh.
This is very similar to hds_thresh, but jumbo thresh doesn't split
header and data.
It just split the first and following data based on length.
When NIC receives 1500 sized packet, and jumbo_thresh is 256(default, but
follows rx-copybreak),
the first data is 256 and the following packet size is 1500-256.

Before this patch, at least if one of GRO, LRO, and JUMBO flags is
enabled, the Aggregation ring will be enabled.
If the Aggregation ring is enabled, both hds_threshold and
jumbo_thresh are set to the default value of rx-copybreak.

So, GRO, LRO, JUMBO frames, they larger than 256 bytes, they will
be split into header and data if the protocol is TCP or UDP.
for the other protocol, jumbo_thresh works instead of hds_thresh.

This means that tcp-data-split relies on the GRO, LRO, and JUMBO flags.
But by this patch, tcp-data-split no longer relies on these flags.
If the tcp-data-split is enabled, the Aggregation ring will be
enabled.
Also, hds_threshold no longer follows rx-copybreak value, it will
be set to the tcp-data-split-thresh value by user-space, but the
default value is still 256.

If the protocol is TCP or UDP and the HDS is disabled and Aggregation
ring is enabled, a packet will be split into several pieces due to
jumbo_thresh.

When XDP is attached, tcp-data-split is automatically disabled.

LRO, GRO, and JUMBO are tested with BCM57414, BCM57504 and the firmware
version is 230.0.157.0.
I couldn't find any specification about minimum and maximum value
of hds_threshold, but from my test result, it was about 0 ~ 1023.
It means, over 1023 sized packets will be split into header and data if
tcp-data-split is enabled regardless of hds_treshold value.
When hds_threshold is 1500 and received packet size is 1400, HDS should
not be activated, but it is activated.
The maximum value of hds_threshold(tcp-data-split-thresh)
value is 256 because it has been working.
It was decided very conservatively.

I checked out the tcp-data-split(HDS) works independently of GRO, LRO,
JUMBO. Tested GRO/LRO, JUMBO with enabled HDS and disabled HDS.
Also, I checked out tcp-data-split should be disabled automatically
when XDP is attached and disallowed to enable it again while XDP is
attached. I tested ranged values from min to max for
tcp-data-split-thresh and rx-copybreak, and it works.
tcp-data-split-thresh from 0 to 256, and rx-copybreak 65 to 256.
When testing this patchset, I checked skb->data, skb->data_len, and
nr_frags values.

The first patch implements .{set, get}_tunable() in the bnxt_en.
The bnxt_en driver has been supporting the rx-copybreak feature but is
not configurable, Only the default rx-copybreak value has been working.
So, it changes the bnxt_en driver to be able to configure
the rx-copybreak value.

The second patch adds an implementation of tcp-data-split ethtool
command.
The HDS relies on the Aggregation ring, which is automatically enabled
when either LRO, GRO, or large mtu is configured.
So, if the Aggregation ring is enabled, HDS is automatically enabled by
it.

The third patch adds tcp-data-split-thresh command in the ethtool.
This threshold value indicates if a received packet size is larger
than this threshold, the packet's header and payload will be split.
Example:
   # ethtool -G <interface name> tcp-data-split-thresh <value>
This option can not be used when tcp-data-split is disabled or not
supported.
   # ethtool -G enp14s0f0np0 tcp-data-split on tcp-data-split-thresh 256
   # ethtool -g enp14s0f0np0
   Ring parameters for enp14s0f0np0:
   Pre-set maximums:
   ...
   Current hardware settings:
   ...
   TCP data split:         on
   TCP data split thresh:  256

   # ethtool -G enp14s0f0np0 tcp-data-split off
   # ethtool -g enp14s0f0np0
   Ring parameters for enp14s0f0np0:
   Pre-set maximums:
   ...
   Current hardware settings:
   ...
   TCP data split:         off
   TCP data split thresh:  n/a

The fourth patch adds the implementation of tcp-data-split-thresh logic
in the bnxt_en driver.
The default value is 256, which used to be the default rx-copybreak
value.

The fifth and sixth adds condition check for devmem and ethtool.
If tcp-data-split is disabled or threshold value is not zero, setup of
devmem will be failed.
Also, tcp-data-split and tcp-data-split-thresh will not be changed
while devmem is running.

The last patch implements device memory TCP for bnxt_en driver.
It usually converts generic page_pool api to netmem page_pool api.

No dependencies exist between device memory TCP and GRO/LRO/MTU.
Only tcp-data-split and tcp-data-split-thresh should be enabled when the
device memory TCP.
While devmem TCP is set, tcp-data-split and tcp-data-split-thresh can't
be updated because core API disallows change.

I tested the interface up/down while devmem TCP running. It works well.
Also, channel count change, and rx/tx ringsize change tests work well too.

The devmem TCP test NIC is BCM57504

All necessary configuration validations exist at the core API level.

Note that by this patch, the setup of device memory TCP would fail.
Because tcp-data-split-thresh command is not supported by ethtool yet.
The tcp-data-split-thresh should be 0 for setup device memory TCP and
the default of bnxt is 256.
So, for the bnxt, it always fails until ethtool supports
tcp-data-split-thresh command.

The ncdevmem.c will be updated after ethtool supports
tcp-data-split-thresh option.

v3:
 - Change headline
 - Add condition checks for ethtool and devmem
 - Fix documentation
 - Move validation of tcp-data-split and thresh from dirver to core API
 - Add implementation of device memory TCP for bnxt_en driver

v2:
 - Add tcp-data-split-thresh ethtool command
 - Implement tcp-data-split-threh in the bnxt_en driver
 - Define min/max rx-copybreak value
 - Update commit message

Taehee Yoo (7):
  bnxt_en: add support for rx-copybreak ethtool command
  bnxt_en: add support for tcp-data-split ethtool command
  net: ethtool: add support for configuring tcp-data-split-thresh
  bnxt_en: add support for tcp-data-split-thresh ethtool command
  net: devmem: add ring parameter filtering
  net: ethtool: add ring parameter filtering
  bnxt_en: add support for device memory tcp

 Documentation/netlink/specs/ethtool.yaml      |   8 ++
 Documentation/networking/ethtool-netlink.rst  |  75 ++++++----
 drivers/net/ethernet/broadcom/Kconfig         |   1 +
 drivers/net/ethernet/broadcom/bnxt/bnxt.c     | 130 +++++++++++-------
 drivers/net/ethernet/broadcom/bnxt/bnxt.h     |  15 +-
 .../net/ethernet/broadcom/bnxt/bnxt_ethtool.c |  78 ++++++++++-
 include/linux/ethtool.h                       |   4 +
 include/uapi/linux/ethtool_netlink.h          |   2 +
 net/core/devmem.c                             |  18 +++
 net/ethtool/common.h                          |   1 +
 net/ethtool/netlink.h                         |   2 +-
 net/ethtool/rings.c                           |  61 +++++++-
 12 files changed, 305 insertions(+), 90 deletions(-)

-- 
2.34.1


^ permalink raw reply	[flat|nested] 73+ messages in thread

* [PATCH net-next v3 1/7] bnxt_en: add support for rx-copybreak ethtool command
  2024-10-03 16:06 [PATCH net-next v3 0/7] bnxt_en: implement device memory TCP for bnxt Taehee Yoo
@ 2024-10-03 16:06 ` Taehee Yoo
  2024-10-03 16:57   ` Brett Creeley
  2024-10-03 17:13   ` Michael Chan
  2024-10-03 16:06 ` [PATCH net-next v3 2/7] bnxt_en: add support for tcp-data-split " Taehee Yoo
                   ` (6 subsequent siblings)
  7 siblings, 2 replies; 73+ messages in thread
From: Taehee Yoo @ 2024-10-03 16:06 UTC (permalink / raw)
  To: davem, kuba, pabeni, edumazet, almasrymina, netdev, linux-doc,
	donald.hunter, corbet, michael.chan
  Cc: kory.maincent, andrew, maxime.chevallier, danieller, hengqi,
	ecree.xilinx, przemyslaw.kitszel, hkallweit1, ahmed.zaki,
	paul.greenwalt, rrameshbabu, idosch, asml.silence, kaiyuanz,
	willemb, aleksander.lobakin, dw, sridhar.samudrala, bcreeley,
	ap420073

The bnxt_en driver supports rx-copybreak, but it couldn't be set by
userspace. Only the default value(256) has worked.
This patch makes the bnxt_en driver support following command.
`ethtool --set-tunable <devname> rx-copybreak <value> ` and
`ethtool --get-tunable <devname> rx-copybreak`.

Signed-off-by: Taehee Yoo <ap420073@gmail.com>
---

v3:
 - Update copybreak value before closing nic.

v2:
 - Define max/vim rx_copybreak value.

 drivers/net/ethernet/broadcom/bnxt/bnxt.c     | 24 +++++----
 drivers/net/ethernet/broadcom/bnxt/bnxt.h     |  6 ++-
 .../net/ethernet/broadcom/bnxt/bnxt_ethtool.c | 49 ++++++++++++++++++-
 3 files changed, 68 insertions(+), 11 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index 6e422e24750a..8da211e083a4 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -81,7 +81,6 @@ MODULE_DESCRIPTION("Broadcom NetXtreme network driver");
 
 #define BNXT_RX_OFFSET (NET_SKB_PAD + NET_IP_ALIGN)
 #define BNXT_RX_DMA_OFFSET NET_SKB_PAD
-#define BNXT_RX_COPY_THRESH 256
 
 #define BNXT_TX_PUSH_THRESH 164
 
@@ -1330,13 +1329,13 @@ static struct sk_buff *bnxt_copy_data(struct bnxt_napi *bnapi, u8 *data,
 	if (!skb)
 		return NULL;
 
-	dma_sync_single_for_cpu(&pdev->dev, mapping, bp->rx_copy_thresh,
+	dma_sync_single_for_cpu(&pdev->dev, mapping, bp->rx_copybreak,
 				bp->rx_dir);
 
 	memcpy(skb->data - NET_IP_ALIGN, data - NET_IP_ALIGN,
 	       len + NET_IP_ALIGN);
 
-	dma_sync_single_for_device(&pdev->dev, mapping, bp->rx_copy_thresh,
+	dma_sync_single_for_device(&pdev->dev, mapping, bp->rx_copybreak,
 				   bp->rx_dir);
 
 	skb_put(skb, len);
@@ -1829,7 +1828,7 @@ static inline struct sk_buff *bnxt_tpa_end(struct bnxt *bp,
 		return NULL;
 	}
 
-	if (len <= bp->rx_copy_thresh) {
+	if (len <= bp->rx_copybreak) {
 		skb = bnxt_copy_skb(bnapi, data_ptr, len, mapping);
 		if (!skb) {
 			bnxt_abort_tpa(cpr, idx, agg_bufs);
@@ -1931,6 +1930,7 @@ static void bnxt_deliver_skb(struct bnxt *bp, struct bnxt_napi *bnapi,
 		bnxt_vf_rep_rx(bp, skb);
 		return;
 	}
+
 	skb_record_rx_queue(skb, bnapi->index);
 	napi_gro_receive(&bnapi->napi, skb);
 }
@@ -2162,7 +2162,7 @@ static int bnxt_rx_pkt(struct bnxt *bp, struct bnxt_cp_ring_info *cpr,
 		}
 	}
 
-	if (len <= bp->rx_copy_thresh) {
+	if (len <= bp->rx_copybreak) {
 		if (!xdp_active)
 			skb = bnxt_copy_skb(bnapi, data_ptr, len, dma_addr);
 		else
@@ -4451,6 +4451,11 @@ void bnxt_set_tpa_flags(struct bnxt *bp)
 		bp->flags |= BNXT_FLAG_GRO;
 }
 
+static void bnxt_init_ring_params(struct bnxt *bp)
+{
+	bp->rx_copybreak = BNXT_DEFAULT_RX_COPYBREAK;
+}
+
 /* bp->rx_ring_size, bp->tx_ring_size, dev->mtu, BNXT_FLAG_{G|L}RO flags must
  * be set on entry.
  */
@@ -4465,7 +4470,6 @@ void bnxt_set_ring_params(struct bnxt *bp)
 	rx_space = rx_size + ALIGN(max(NET_SKB_PAD, XDP_PACKET_HEADROOM), 8) +
 		SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
 
-	bp->rx_copy_thresh = BNXT_RX_COPY_THRESH;
 	ring_size = bp->rx_ring_size;
 	bp->rx_agg_ring_size = 0;
 	bp->rx_agg_nr_pages = 0;
@@ -4510,7 +4514,8 @@ void bnxt_set_ring_params(struct bnxt *bp)
 				  ALIGN(max(NET_SKB_PAD, XDP_PACKET_HEADROOM), 8) -
 				  SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
 		} else {
-			rx_size = SKB_DATA_ALIGN(BNXT_RX_COPY_THRESH + NET_IP_ALIGN);
+			rx_size = SKB_DATA_ALIGN(bp->rx_copybreak +
+						 NET_IP_ALIGN);
 			rx_space = rx_size + NET_SKB_PAD +
 				SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
 		}
@@ -6424,8 +6429,8 @@ static int bnxt_hwrm_vnic_set_hds(struct bnxt *bp, struct bnxt_vnic_info *vnic)
 					  VNIC_PLCMODES_CFG_REQ_FLAGS_HDS_IPV6);
 		req->enables |=
 			cpu_to_le32(VNIC_PLCMODES_CFG_REQ_ENABLES_HDS_THRESHOLD_VALID);
-		req->jumbo_thresh = cpu_to_le16(bp->rx_copy_thresh);
-		req->hds_threshold = cpu_to_le16(bp->rx_copy_thresh);
+		req->jumbo_thresh = cpu_to_le16(bp->rx_copybreak);
+		req->hds_threshold = cpu_to_le16(bp->rx_copybreak);
 	}
 	req->vnic_id = cpu_to_le32(vnic->fw_vnic_id);
 	return hwrm_req_send(bp, req);
@@ -15864,6 +15869,7 @@ static int bnxt_init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
 	bnxt_init_l2_fltr_tbl(bp);
 	bnxt_set_rx_skb_mode(bp, false);
 	bnxt_set_tpa_flags(bp);
+	bnxt_init_ring_params(bp);
 	bnxt_set_ring_params(bp);
 	bnxt_rdma_aux_device_init(bp);
 	rc = bnxt_set_dflt_rings(bp, true);
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.h b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
index 69231e85140b..cff031993223 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.h
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
@@ -34,6 +34,10 @@
 #include <linux/firmware/broadcom/tee_bnxt_fw.h>
 #endif
 
+#define BNXT_DEFAULT_RX_COPYBREAK 256
+#define BNXT_MIN_RX_COPYBREAK 65
+#define BNXT_MAX_RX_COPYBREAK 1024
+
 extern struct list_head bnxt_block_cb_list;
 
 struct page_pool;
@@ -2299,7 +2303,7 @@ struct bnxt {
 	enum dma_data_direction	rx_dir;
 	u32			rx_ring_size;
 	u32			rx_agg_ring_size;
-	u32			rx_copy_thresh;
+	u32			rx_copybreak;
 	u32			rx_ring_mask;
 	u32			rx_agg_ring_mask;
 	int			rx_nr_pages;
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c b/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c
index f71cc8188b4e..fdecdf8894b3 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c
@@ -4319,6 +4319,51 @@ static int bnxt_get_eee(struct net_device *dev, struct ethtool_keee *edata)
 	return 0;
 }
 
+static int bnxt_set_tunable(struct net_device *dev,
+			    const struct ethtool_tunable *tuna,
+			    const void *data)
+{
+	struct bnxt *bp = netdev_priv(dev);
+	u32 rx_copybreak;
+
+	switch (tuna->id) {
+	case ETHTOOL_RX_COPYBREAK:
+		rx_copybreak = *(u32 *)data;
+		if (rx_copybreak < BNXT_MIN_RX_COPYBREAK ||
+		    rx_copybreak > BNXT_MAX_RX_COPYBREAK)
+			return -EINVAL;
+		if (rx_copybreak != bp->rx_copybreak) {
+			if (netif_running(dev)) {
+				bnxt_close_nic(bp, false, false);
+				bp->rx_copybreak = rx_copybreak;
+				bnxt_set_ring_params(bp);
+				bnxt_open_nic(bp, false, false);
+			} else {
+				bp->rx_copybreak = rx_copybreak;
+			}
+		}
+		return 0;
+	default:
+		return -EOPNOTSUPP;
+	}
+}
+
+static int bnxt_get_tunable(struct net_device *dev,
+			    const struct ethtool_tunable *tuna, void *data)
+{
+	struct bnxt *bp = netdev_priv(dev);
+
+	switch (tuna->id) {
+	case ETHTOOL_RX_COPYBREAK:
+		*(u32 *)data = bp->rx_copybreak;
+		break;
+	default:
+		return -EOPNOTSUPP;
+	}
+
+	return 0;
+}
+
 static int bnxt_read_sfp_module_eeprom_info(struct bnxt *bp, u16 i2c_addr,
 					    u16 page_number, u8 bank,
 					    u16 start_addr, u16 data_length,
@@ -4769,7 +4814,7 @@ static int bnxt_run_loopback(struct bnxt *bp)
 	cpr = &rxr->bnapi->cp_ring;
 	if (bp->flags & BNXT_FLAG_CHIP_P5_PLUS)
 		cpr = rxr->rx_cpr;
-	pkt_size = min(bp->dev->mtu + ETH_HLEN, bp->rx_copy_thresh);
+	pkt_size = min(bp->dev->mtu + ETH_HLEN, bp->rx_copybreak);
 	skb = netdev_alloc_skb(bp->dev, pkt_size);
 	if (!skb)
 		return -ENOMEM;
@@ -5342,6 +5387,8 @@ const struct ethtool_ops bnxt_ethtool_ops = {
 	.get_link_ext_stats	= bnxt_get_link_ext_stats,
 	.get_eee		= bnxt_get_eee,
 	.set_eee		= bnxt_set_eee,
+	.get_tunable		= bnxt_get_tunable,
+	.set_tunable		= bnxt_set_tunable,
 	.get_module_info	= bnxt_get_module_info,
 	.get_module_eeprom	= bnxt_get_module_eeprom,
 	.get_module_eeprom_by_page = bnxt_get_module_eeprom_by_page,
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH net-next v3 2/7] bnxt_en: add support for tcp-data-split ethtool command
  2024-10-03 16:06 [PATCH net-next v3 0/7] bnxt_en: implement device memory TCP for bnxt Taehee Yoo
  2024-10-03 16:06 ` [PATCH net-next v3 1/7] bnxt_en: add support for rx-copybreak ethtool command Taehee Yoo
@ 2024-10-03 16:06 ` Taehee Yoo
  2024-10-08 18:19   ` Jakub Kicinski
  2024-10-03 16:06 ` [PATCH net-next v3 3/7] net: ethtool: add support for configuring tcp-data-split-thresh Taehee Yoo
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 73+ messages in thread
From: Taehee Yoo @ 2024-10-03 16:06 UTC (permalink / raw)
  To: davem, kuba, pabeni, edumazet, almasrymina, netdev, linux-doc,
	donald.hunter, corbet, michael.chan
  Cc: kory.maincent, andrew, maxime.chevallier, danieller, hengqi,
	ecree.xilinx, przemyslaw.kitszel, hkallweit1, ahmed.zaki,
	paul.greenwalt, rrameshbabu, idosch, asml.silence, kaiyuanz,
	willemb, aleksander.lobakin, dw, sridhar.samudrala, bcreeley,
	ap420073

NICs that uses bnxt_en driver supports tcp-data-split feature by the
name of HDS(header-data-split).
But there is no implementation for the HDS to enable or disable by
ethtool.
Only getting the current HDS status is implemented and The HDS is just
automatically enabled only when either LRO, HW-GRO, or JUMBO is enabled.
The hds_threshold follows rx-copybreak value. and it was unchangeable.

This implements `ethtool -G <interface name> tcp-data-split <value>`
command option.
The value can be <on>, <off>, and <auto> but the <auto> will be
automatically changed to <on>.

HDS feature relies on the aggregation ring.
So, if HDS is enabled, the bnxt_en driver initializes the aggregation
ring.
This is the reason why BNXT_FLAG_AGG_RINGS contains HDS condition.

Signed-off-by: Taehee Yoo <ap420073@gmail.com>
---

v3:
 - No changes.

v2:
 - Do not set hds_threshold to 0.

 drivers/net/ethernet/broadcom/bnxt/bnxt.c     |  9 +++----
 drivers/net/ethernet/broadcom/bnxt/bnxt.h     |  5 ++--
 .../net/ethernet/broadcom/bnxt/bnxt_ethtool.c | 25 +++++++++++++++++--
 3 files changed, 30 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index 8da211e083a4..f046478dfd2a 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -4454,6 +4454,7 @@ void bnxt_set_tpa_flags(struct bnxt *bp)
 static void bnxt_init_ring_params(struct bnxt *bp)
 {
 	bp->rx_copybreak = BNXT_DEFAULT_RX_COPYBREAK;
+	bp->flags |= BNXT_FLAG_HDS;
 }
 
 /* bp->rx_ring_size, bp->tx_ring_size, dev->mtu, BNXT_FLAG_{G|L}RO flags must
@@ -4474,7 +4475,7 @@ void bnxt_set_ring_params(struct bnxt *bp)
 	bp->rx_agg_ring_size = 0;
 	bp->rx_agg_nr_pages = 0;
 
-	if (bp->flags & BNXT_FLAG_TPA)
+	if (bp->flags & BNXT_FLAG_TPA || bp->flags & BNXT_FLAG_HDS)
 		agg_factor = min_t(u32, 4, 65536 / BNXT_RX_PAGE_SIZE);
 
 	bp->flags &= ~BNXT_FLAG_JUMBO;
@@ -6421,15 +6422,13 @@ static int bnxt_hwrm_vnic_set_hds(struct bnxt *bp, struct bnxt_vnic_info *vnic)
 
 	req->flags = cpu_to_le32(VNIC_PLCMODES_CFG_REQ_FLAGS_JUMBO_PLACEMENT);
 	req->enables = cpu_to_le32(VNIC_PLCMODES_CFG_REQ_ENABLES_JUMBO_THRESH_VALID);
+	req->jumbo_thresh = cpu_to_le16(bp->rx_buf_use_size);
 
-	if (BNXT_RX_PAGE_MODE(bp)) {
-		req->jumbo_thresh = cpu_to_le16(bp->rx_buf_use_size);
-	} else {
+	if (bp->flags & BNXT_FLAG_HDS) {
 		req->flags |= cpu_to_le32(VNIC_PLCMODES_CFG_REQ_FLAGS_HDS_IPV4 |
 					  VNIC_PLCMODES_CFG_REQ_FLAGS_HDS_IPV6);
 		req->enables |=
 			cpu_to_le32(VNIC_PLCMODES_CFG_REQ_ENABLES_HDS_THRESHOLD_VALID);
-		req->jumbo_thresh = cpu_to_le16(bp->rx_copybreak);
 		req->hds_threshold = cpu_to_le16(bp->rx_copybreak);
 	}
 	req->vnic_id = cpu_to_le32(vnic->fw_vnic_id);
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.h b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
index cff031993223..35601c71dfe9 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.h
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
@@ -2202,8 +2202,6 @@ struct bnxt {
 	#define BNXT_FLAG_TPA		(BNXT_FLAG_LRO | BNXT_FLAG_GRO)
 	#define BNXT_FLAG_JUMBO		0x10
 	#define BNXT_FLAG_STRIP_VLAN	0x20
-	#define BNXT_FLAG_AGG_RINGS	(BNXT_FLAG_JUMBO | BNXT_FLAG_GRO | \
-					 BNXT_FLAG_LRO)
 	#define BNXT_FLAG_RFS		0x100
 	#define BNXT_FLAG_SHARED_RINGS	0x200
 	#define BNXT_FLAG_PORT_STATS	0x400
@@ -2224,6 +2222,9 @@ struct bnxt {
 	#define BNXT_FLAG_ROCE_MIRROR_CAP	0x4000000
 	#define BNXT_FLAG_TX_COAL_CMPL	0x8000000
 	#define BNXT_FLAG_PORT_STATS_EXT	0x10000000
+	#define BNXT_FLAG_HDS		0x20000000
+	#define BNXT_FLAG_AGG_RINGS	(BNXT_FLAG_JUMBO | BNXT_FLAG_GRO | \
+					 BNXT_FLAG_LRO | BNXT_FLAG_HDS)
 
 	#define BNXT_FLAG_ALL_CONFIG_FEATS (BNXT_FLAG_TPA |		\
 					    BNXT_FLAG_RFS |		\
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c b/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c
index fdecdf8894b3..e9ef65dd2e7b 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c
@@ -829,12 +829,16 @@ static void bnxt_get_ringparam(struct net_device *dev,
 	if (bp->flags & BNXT_FLAG_AGG_RINGS) {
 		ering->rx_max_pending = BNXT_MAX_RX_DESC_CNT_JUM_ENA;
 		ering->rx_jumbo_max_pending = BNXT_MAX_RX_JUM_DESC_CNT;
-		kernel_ering->tcp_data_split = ETHTOOL_TCP_DATA_SPLIT_ENABLED;
 	} else {
 		ering->rx_max_pending = BNXT_MAX_RX_DESC_CNT;
 		ering->rx_jumbo_max_pending = 0;
-		kernel_ering->tcp_data_split = ETHTOOL_TCP_DATA_SPLIT_DISABLED;
 	}
+
+	if (bp->flags & BNXT_FLAG_HDS)
+		kernel_ering->tcp_data_split = ETHTOOL_TCP_DATA_SPLIT_ENABLED;
+	else
+		kernel_ering->tcp_data_split = ETHTOOL_TCP_DATA_SPLIT_DISABLED;
+
 	ering->tx_max_pending = BNXT_MAX_TX_DESC_CNT;
 
 	ering->rx_pending = bp->rx_ring_size;
@@ -854,9 +858,25 @@ static int bnxt_set_ringparam(struct net_device *dev,
 	    (ering->tx_pending < BNXT_MIN_TX_DESC_CNT))
 		return -EINVAL;
 
+	if (kernel_ering->tcp_data_split != ETHTOOL_TCP_DATA_SPLIT_DISABLED &&
+	    BNXT_RX_PAGE_MODE(bp)) {
+		NL_SET_ERR_MSG_MOD(extack, "tcp-data-split can not be enabled with XDP");
+		return -EINVAL;
+	}
+
 	if (netif_running(dev))
 		bnxt_close_nic(bp, false, false);
 
+	switch (kernel_ering->tcp_data_split) {
+	case ETHTOOL_TCP_DATA_SPLIT_UNKNOWN:
+	case ETHTOOL_TCP_DATA_SPLIT_ENABLED:
+		bp->flags |= BNXT_FLAG_HDS;
+		break;
+	case ETHTOOL_TCP_DATA_SPLIT_DISABLED:
+		bp->flags &= ~BNXT_FLAG_HDS;
+		break;
+	}
+
 	bp->rx_ring_size = ering->rx_pending;
 	bp->tx_ring_size = ering->tx_pending;
 	bnxt_set_ring_params(bp);
@@ -5346,6 +5366,7 @@ const struct ethtool_ops bnxt_ethtool_ops = {
 				     ETHTOOL_COALESCE_STATS_BLOCK_USECS |
 				     ETHTOOL_COALESCE_USE_ADAPTIVE_RX |
 				     ETHTOOL_COALESCE_USE_CQE,
+	.supported_ring_params	= ETHTOOL_RING_USE_TCP_DATA_SPLIT,
 	.get_link_ksettings	= bnxt_get_link_ksettings,
 	.set_link_ksettings	= bnxt_set_link_ksettings,
 	.get_fec_stats		= bnxt_get_fec_stats,
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH net-next v3 3/7] net: ethtool: add support for configuring tcp-data-split-thresh
  2024-10-03 16:06 [PATCH net-next v3 0/7] bnxt_en: implement device memory TCP for bnxt Taehee Yoo
  2024-10-03 16:06 ` [PATCH net-next v3 1/7] bnxt_en: add support for rx-copybreak ethtool command Taehee Yoo
  2024-10-03 16:06 ` [PATCH net-next v3 2/7] bnxt_en: add support for tcp-data-split " Taehee Yoo
@ 2024-10-03 16:06 ` Taehee Yoo
  2024-10-03 18:25   ` Mina Almasry
  2024-10-08 18:33   ` Jakub Kicinski
  2024-10-03 16:06 ` [PATCH net-next v3 4/7] bnxt_en: add support for tcp-data-split-thresh ethtool command Taehee Yoo
                   ` (4 subsequent siblings)
  7 siblings, 2 replies; 73+ messages in thread
From: Taehee Yoo @ 2024-10-03 16:06 UTC (permalink / raw)
  To: davem, kuba, pabeni, edumazet, almasrymina, netdev, linux-doc,
	donald.hunter, corbet, michael.chan
  Cc: kory.maincent, andrew, maxime.chevallier, danieller, hengqi,
	ecree.xilinx, przemyslaw.kitszel, hkallweit1, ahmed.zaki,
	paul.greenwalt, rrameshbabu, idosch, asml.silence, kaiyuanz,
	willemb, aleksander.lobakin, dw, sridhar.samudrala, bcreeley,
	ap420073

The tcp-data-split-thresh option configures the threshold value of
the tcp-data-split.
If a received packet size is larger than this threshold value, a packet
will be split into header and payload.
The header indicates TCP header, but it depends on driver spec.
The bnxt_en driver supports HDS(Header-Data-Split) configuration at
FW level, affecting TCP and UDP too.
So, like the tcp-data-split option, If tcp-data-split-thresh is set,
it affects UDP and TCP packets.

The tcp-data-split-thresh has a dependency, that is tcp-data-split
option. This threshold value can be get/set only when tcp-data-split
option is enabled.

Example:
   # ethtool -G <interface name> tcp-data-split-thresh <value>

   # ethtool -G enp14s0f0np0 tcp-data-split on tcp-data-split-thresh 256
   # ethtool -g enp14s0f0np0
   Ring parameters for enp14s0f0np0:
   Pre-set maximums:
   ...
   TCP data split thresh:  256
   Current hardware settings:
   ...
   TCP data split:         on
   TCP data split thresh:  256

The tcp-data-split is not enabled, the tcp-data-split-thresh will
not be used and can't be configured.

   # ethtool -G enp14s0f0np0 tcp-data-split off
   # ethtool -g enp14s0f0np0
   Ring parameters for enp14s0f0np0:
   Pre-set maximums:
   ...
   TCP data split thresh:  256
   Current hardware settings:
   ...
   TCP data split:         off
   TCP data split thresh:  n/a

The default/min/max values are not defined in the ethtool so the drivers
should define themself.
The 0 value means that all TCP and UDP packets' header and payload
will be split.
Users should consider the overhead due to this feature.

Signed-off-by: Taehee Yoo <ap420073@gmail.com>
---

v3:
 - Fix documentation and ynl
 - Update error messages
 - Validate configuration of tcp-data-split and tcp-data-split-thresh

v2:
 - Patch added.

 Documentation/netlink/specs/ethtool.yaml     |  8 +++
 Documentation/networking/ethtool-netlink.rst | 75 ++++++++++++--------
 include/linux/ethtool.h                      |  4 ++
 include/uapi/linux/ethtool_netlink.h         |  2 +
 net/ethtool/netlink.h                        |  2 +-
 net/ethtool/rings.c                          | 46 ++++++++++--
 6 files changed, 102 insertions(+), 35 deletions(-)

diff --git a/Documentation/netlink/specs/ethtool.yaml b/Documentation/netlink/specs/ethtool.yaml
index 6a050d755b9c..96298fe5ed43 100644
--- a/Documentation/netlink/specs/ethtool.yaml
+++ b/Documentation/netlink/specs/ethtool.yaml
@@ -215,6 +215,12 @@ attribute-sets:
       -
         name: tx-push-buf-len-max
         type: u32
+      -
+        name: tcp-data-split-thresh
+        type: u32
+      -
+        name: tcp-data-split-thresh-max
+        type: u32
 
   -
     name: mm-stat
@@ -1393,6 +1399,8 @@ operations:
             - rx-push
             - tx-push-buf-len
             - tx-push-buf-len-max
+            - tcp-data-split-thresh
+            - tcp-data-split-thresh-max
       dump: *ring-get-op
     -
       name: rings-set
diff --git a/Documentation/networking/ethtool-netlink.rst b/Documentation/networking/ethtool-netlink.rst
index 295563e91082..f0cd918dbe7e 100644
--- a/Documentation/networking/ethtool-netlink.rst
+++ b/Documentation/networking/ethtool-netlink.rst
@@ -875,24 +875,32 @@ Request contents:
 
 Kernel response contents:
 
-  =======================================   ======  ===========================
-  ``ETHTOOL_A_RINGS_HEADER``                nested  reply header
-  ``ETHTOOL_A_RINGS_RX_MAX``                u32     max size of RX ring
-  ``ETHTOOL_A_RINGS_RX_MINI_MAX``           u32     max size of RX mini ring
-  ``ETHTOOL_A_RINGS_RX_JUMBO_MAX``          u32     max size of RX jumbo ring
-  ``ETHTOOL_A_RINGS_TX_MAX``                u32     max size of TX ring
-  ``ETHTOOL_A_RINGS_RX``                    u32     size of RX ring
-  ``ETHTOOL_A_RINGS_RX_MINI``               u32     size of RX mini ring
-  ``ETHTOOL_A_RINGS_RX_JUMBO``              u32     size of RX jumbo ring
-  ``ETHTOOL_A_RINGS_TX``                    u32     size of TX ring
-  ``ETHTOOL_A_RINGS_RX_BUF_LEN``            u32     size of buffers on the ring
-  ``ETHTOOL_A_RINGS_TCP_DATA_SPLIT``        u8      TCP header / data split
-  ``ETHTOOL_A_RINGS_CQE_SIZE``              u32     Size of TX/RX CQE
-  ``ETHTOOL_A_RINGS_TX_PUSH``               u8      flag of TX Push mode
-  ``ETHTOOL_A_RINGS_RX_PUSH``               u8      flag of RX Push mode
-  ``ETHTOOL_A_RINGS_TX_PUSH_BUF_LEN``       u32     size of TX push buffer
-  ``ETHTOOL_A_RINGS_TX_PUSH_BUF_LEN_MAX``   u32     max size of TX push buffer
-  =======================================   ======  ===========================
+  =============================================  ======  =======================
+  ``ETHTOOL_A_RINGS_HEADER``                     nested  reply header
+  ``ETHTOOL_A_RINGS_RX_MAX``                     u32     max size of RX ring
+  ``ETHTOOL_A_RINGS_RX_MINI_MAX``                u32     max size of RX mini
+                                                         ring
+  ``ETHTOOL_A_RINGS_RX_JUMBO_MAX``               u32     max size of RX jumbo
+                                                         ring
+  ``ETHTOOL_A_RINGS_TX_MAX``                     u32     max size of TX ring
+  ``ETHTOOL_A_RINGS_RX``                         u32     size of RX ring
+  ``ETHTOOL_A_RINGS_RX_MINI``                    u32     size of RX mini ring
+  ``ETHTOOL_A_RINGS_RX_JUMBO``                   u32     size of RX jumbo ring
+  ``ETHTOOL_A_RINGS_TX``                         u32     size of TX ring
+  ``ETHTOOL_A_RINGS_RX_BUF_LEN``                 u32     size of buffers on the
+                                                         ring
+  ``ETHTOOL_A_RINGS_TCP_DATA_SPLIT``             u8      TCP header / data split
+  ``ETHTOOL_A_RINGS_CQE_SIZE``                   u32     Size of TX/RX CQE
+  ``ETHTOOL_A_RINGS_TX_PUSH``                    u8      flag of TX Push mode
+  ``ETHTOOL_A_RINGS_RX_PUSH``                    u8      flag of RX Push mode
+  ``ETHTOOL_A_RINGS_TX_PUSH_BUF_LEN``            u32     size of TX push buffer
+  ``ETHTOOL_A_RINGS_TX_PUSH_BUF_LEN_MAX``        u32     max size of TX push
+                                                         buffer
+  ``ETHTOOL_A_RINGS_TCP_DATA_SPLIT_THRESH``      u32     threshold of
+                                                         TCP header / data split
+  ``ETHTOOL_A_RINGS_TCP_DATA_SPLIT_THRESH_MAX``  u32     max threshold of
+                                                         TCP header / data split
+  =============================================  ======  =======================
 
 ``ETHTOOL_A_RINGS_TCP_DATA_SPLIT`` indicates whether the device is usable with
 page-flipping TCP zero-copy receive (``getsockopt(TCP_ZEROCOPY_RECEIVE)``).
@@ -927,18 +935,21 @@ Sets ring sizes like ``ETHTOOL_SRINGPARAM`` ioctl request.
 
 Request contents:
 
-  ====================================  ======  ===========================
-  ``ETHTOOL_A_RINGS_HEADER``            nested  reply header
-  ``ETHTOOL_A_RINGS_RX``                u32     size of RX ring
-  ``ETHTOOL_A_RINGS_RX_MINI``           u32     size of RX mini ring
-  ``ETHTOOL_A_RINGS_RX_JUMBO``          u32     size of RX jumbo ring
-  ``ETHTOOL_A_RINGS_TX``                u32     size of TX ring
-  ``ETHTOOL_A_RINGS_RX_BUF_LEN``        u32     size of buffers on the ring
-  ``ETHTOOL_A_RINGS_CQE_SIZE``          u32     Size of TX/RX CQE
-  ``ETHTOOL_A_RINGS_TX_PUSH``           u8      flag of TX Push mode
-  ``ETHTOOL_A_RINGS_RX_PUSH``           u8      flag of RX Push mode
-  ``ETHTOOL_A_RINGS_TX_PUSH_BUF_LEN``   u32     size of TX push buffer
-  ====================================  ======  ===========================
+  =========================================  ======  =======================
+  ``ETHTOOL_A_RINGS_HEADER``                 nested  reply header
+  ``ETHTOOL_A_RINGS_RX``                     u32     size of RX ring
+  ``ETHTOOL_A_RINGS_RX_MINI``                u32     size of RX mini ring
+  ``ETHTOOL_A_RINGS_RX_JUMBO``               u32     size of RX jumbo ring
+  ``ETHTOOL_A_RINGS_TX``                     u32     size of TX ring
+  ``ETHTOOL_A_RINGS_RX_BUF_LEN``             u32     size of buffers on the ring
+  ``ETHTOOL_A_RINGS_TCP_DATA_SPLIT``         u8      TCP header / data split
+  ``ETHTOOL_A_RINGS_CQE_SIZE``               u32     Size of TX/RX CQE
+  ``ETHTOOL_A_RINGS_TX_PUSH``                u8      flag of TX Push mode
+  ``ETHTOOL_A_RINGS_RX_PUSH``                u8      flag of RX Push mode
+  ``ETHTOOL_A_RINGS_TX_PUSH_BUF_LEN``        u32     size of TX push buffer
+  ``ETHTOOL_A_RINGS_TCP_DATA_SPLIT_THRESH``  u32     threshold of
+                                                     TCP header / data split
+  =========================================  ======  =======================
 
 Kernel checks that requested ring sizes do not exceed limits reported by
 driver. Driver may impose additional constraints and may not support all
@@ -954,6 +965,10 @@ A bigger CQE can have more receive buffer pointers, and in turn the NIC can
 transfer a bigger frame from wire. Based on the NIC hardware, the overall
 completion queue size can be adjusted in the driver if CQE size is modified.
 
+``ETHTOOL_A_RINGS_TCP_DATA_SPLIT_THRESH`` specifies the threshold value of
+tcp data split feature. If tcp-data-split is enabled and a received packet
+size is larger than this threshold value, header and data will be split.
+
 CHANNELS_GET
 ============
 
diff --git a/include/linux/ethtool.h b/include/linux/ethtool.h
index 12f6dc567598..891f55b0f6aa 100644
--- a/include/linux/ethtool.h
+++ b/include/linux/ethtool.h
@@ -78,6 +78,8 @@ enum {
  * @cqe_size: Size of TX/RX completion queue event
  * @tx_push_buf_len: Size of TX push buffer
  * @tx_push_buf_max_len: Maximum allowed size of TX push buffer
+ * @tcp_data_split_thresh: Threshold value of tcp-data-split
+ * @tcp_data_split_thresh_max: Maximum allowed threshold of tcp-data-split-threshold
  */
 struct kernel_ethtool_ringparam {
 	u32	rx_buf_len;
@@ -87,6 +89,8 @@ struct kernel_ethtool_ringparam {
 	u32	cqe_size;
 	u32	tx_push_buf_len;
 	u32	tx_push_buf_max_len;
+	u32	tcp_data_split_thresh;
+	u32	tcp_data_split_thresh_max;
 };
 
 /**
diff --git a/include/uapi/linux/ethtool_netlink.h b/include/uapi/linux/ethtool_netlink.h
index 283305f6b063..20fe6065b7ba 100644
--- a/include/uapi/linux/ethtool_netlink.h
+++ b/include/uapi/linux/ethtool_netlink.h
@@ -364,6 +364,8 @@ enum {
 	ETHTOOL_A_RINGS_RX_PUSH,			/* u8 */
 	ETHTOOL_A_RINGS_TX_PUSH_BUF_LEN,		/* u32 */
 	ETHTOOL_A_RINGS_TX_PUSH_BUF_LEN_MAX,		/* u32 */
+	ETHTOOL_A_RINGS_TCP_DATA_SPLIT_THRESH,		/* u32 */
+	ETHTOOL_A_RINGS_TCP_DATA_SPLIT_THRESH_MAX,	/* u32 */
 
 	/* add new constants above here */
 	__ETHTOOL_A_RINGS_CNT,
diff --git a/net/ethtool/netlink.h b/net/ethtool/netlink.h
index 203b08eb6c6f..8bea47a26605 100644
--- a/net/ethtool/netlink.h
+++ b/net/ethtool/netlink.h
@@ -455,7 +455,7 @@ extern const struct nla_policy ethnl_features_set_policy[ETHTOOL_A_FEATURES_WANT
 extern const struct nla_policy ethnl_privflags_get_policy[ETHTOOL_A_PRIVFLAGS_HEADER + 1];
 extern const struct nla_policy ethnl_privflags_set_policy[ETHTOOL_A_PRIVFLAGS_FLAGS + 1];
 extern const struct nla_policy ethnl_rings_get_policy[ETHTOOL_A_RINGS_HEADER + 1];
-extern const struct nla_policy ethnl_rings_set_policy[ETHTOOL_A_RINGS_TX_PUSH_BUF_LEN_MAX + 1];
+extern const struct nla_policy ethnl_rings_set_policy[ETHTOOL_A_RINGS_TCP_DATA_SPLIT_THRESH_MAX + 1];
 extern const struct nla_policy ethnl_channels_get_policy[ETHTOOL_A_CHANNELS_HEADER + 1];
 extern const struct nla_policy ethnl_channels_set_policy[ETHTOOL_A_CHANNELS_COMBINED_COUNT + 1];
 extern const struct nla_policy ethnl_coalesce_get_policy[ETHTOOL_A_COALESCE_HEADER + 1];
diff --git a/net/ethtool/rings.c b/net/ethtool/rings.c
index b7865a14fdf8..c7824515857f 100644
--- a/net/ethtool/rings.c
+++ b/net/ethtool/rings.c
@@ -61,7 +61,9 @@ static int rings_reply_size(const struct ethnl_req_info *req_base,
 	       nla_total_size(sizeof(u8))  +	/* _RINGS_TX_PUSH */
 	       nla_total_size(sizeof(u8))) +	/* _RINGS_RX_PUSH */
 	       nla_total_size(sizeof(u32)) +	/* _RINGS_TX_PUSH_BUF_LEN */
-	       nla_total_size(sizeof(u32));	/* _RINGS_TX_PUSH_BUF_LEN_MAX */
+	       nla_total_size(sizeof(u32)) +	/* _RINGS_TX_PUSH_BUF_LEN_MAX */
+	       nla_total_size(sizeof(u32)) +	/* _RINGS_TCP_DATA_SPLIT_THRESH */
+	       nla_total_size(sizeof(u32));	/* _RINGS_TCP_DATA_SPLIT_THRESH_MAX */
 }
 
 static int rings_fill_reply(struct sk_buff *skb,
@@ -108,7 +110,13 @@ static int rings_fill_reply(struct sk_buff *skb,
 	     (nla_put_u32(skb, ETHTOOL_A_RINGS_TX_PUSH_BUF_LEN_MAX,
 			  kr->tx_push_buf_max_len) ||
 	      nla_put_u32(skb, ETHTOOL_A_RINGS_TX_PUSH_BUF_LEN,
-			  kr->tx_push_buf_len))))
+			  kr->tx_push_buf_len))) ||
+	    (kr->tcp_data_split == ETHTOOL_TCP_DATA_SPLIT_ENABLED &&
+	     (nla_put_u32(skb, ETHTOOL_A_RINGS_TCP_DATA_SPLIT_THRESH,
+			 kr->tcp_data_split_thresh))) ||
+	    (kr->tcp_data_split == ETHTOOL_TCP_DATA_SPLIT_ENABLED &&
+	     (nla_put_u32(skb, ETHTOOL_A_RINGS_TCP_DATA_SPLIT_THRESH_MAX,
+			 kr->tcp_data_split_thresh_max))))
 		return -EMSGSIZE;
 
 	return 0;
@@ -130,6 +138,7 @@ const struct nla_policy ethnl_rings_set_policy[] = {
 	[ETHTOOL_A_RINGS_TX_PUSH]		= NLA_POLICY_MAX(NLA_U8, 1),
 	[ETHTOOL_A_RINGS_RX_PUSH]		= NLA_POLICY_MAX(NLA_U8, 1),
 	[ETHTOOL_A_RINGS_TX_PUSH_BUF_LEN]	= { .type = NLA_U32 },
+	[ETHTOOL_A_RINGS_TCP_DATA_SPLIT_THRESH]	= { .type = NLA_U32 },
 };
 
 static int
@@ -155,6 +164,14 @@ ethnl_set_rings_validate(struct ethnl_req_info *req_info,
 		return -EOPNOTSUPP;
 	}
 
+	if (tb[ETHTOOL_A_RINGS_TCP_DATA_SPLIT_THRESH] &&
+	    !(ops->supported_ring_params & ETHTOOL_RING_USE_TCP_DATA_SPLIT)) {
+		NL_SET_ERR_MSG_ATTR(info->extack,
+				    tb[ETHTOOL_A_RINGS_TCP_DATA_SPLIT_THRESH],
+				    "setting tcp-data-split-thresh is not supported");
+		return -EOPNOTSUPP;
+	}
+
 	if (tb[ETHTOOL_A_RINGS_CQE_SIZE] &&
 	    !(ops->supported_ring_params & ETHTOOL_RING_USE_CQE_SIZE)) {
 		NL_SET_ERR_MSG_ATTR(info->extack,
@@ -196,9 +213,9 @@ ethnl_set_rings(struct ethnl_req_info *req_info, struct genl_info *info)
 	struct kernel_ethtool_ringparam kernel_ringparam = {};
 	struct ethtool_ringparam ringparam = {};
 	struct net_device *dev = req_info->dev;
+	bool mod = false, thresh_mod = false;
 	struct nlattr **tb = info->attrs;
 	const struct nlattr *err_attr;
-	bool mod = false;
 	int ret;
 
 	dev->ethtool_ops->get_ringparam(dev, &ringparam,
@@ -222,9 +239,30 @@ ethnl_set_rings(struct ethnl_req_info *req_info, struct genl_info *info)
 			tb[ETHTOOL_A_RINGS_RX_PUSH], &mod);
 	ethnl_update_u32(&kernel_ringparam.tx_push_buf_len,
 			 tb[ETHTOOL_A_RINGS_TX_PUSH_BUF_LEN], &mod);
-	if (!mod)
+	ethnl_update_u32(&kernel_ringparam.tcp_data_split_thresh,
+			 tb[ETHTOOL_A_RINGS_TCP_DATA_SPLIT_THRESH],
+			 &thresh_mod);
+	if (!mod && !thresh_mod)
 		return 0;
 
+	if (kernel_ringparam.tcp_data_split == ETHTOOL_TCP_DATA_SPLIT_DISABLED &&
+	    thresh_mod) {
+		NL_SET_ERR_MSG_ATTR(info->extack,
+				    tb[ETHTOOL_A_RINGS_TCP_DATA_SPLIT_THRESH],
+				    "tcp-data-split-thresh can not be updated while tcp-data-split is disabled");
+		return -EINVAL;
+	}
+
+	if (kernel_ringparam.tcp_data_split_thresh >
+	    kernel_ringparam.tcp_data_split_thresh_max) {
+		NL_SET_ERR_MSG_ATTR_FMT(info->extack,
+					tb[ETHTOOL_A_RINGS_TCP_DATA_SPLIT_THRESH_MAX],
+					"Requested tcp-data-split-thresh exceeds the maximum of %u",
+					kernel_ringparam.tcp_data_split_thresh_max);
+
+		return -EINVAL;
+	}
+
 	/* ensure new ring parameters are within limits */
 	if (ringparam.rx_pending > ringparam.rx_max_pending)
 		err_attr = tb[ETHTOOL_A_RINGS_RX];
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH net-next v3 4/7] bnxt_en: add support for tcp-data-split-thresh ethtool command
  2024-10-03 16:06 [PATCH net-next v3 0/7] bnxt_en: implement device memory TCP for bnxt Taehee Yoo
                   ` (2 preceding siblings ...)
  2024-10-03 16:06 ` [PATCH net-next v3 3/7] net: ethtool: add support for configuring tcp-data-split-thresh Taehee Yoo
@ 2024-10-03 16:06 ` Taehee Yoo
  2024-10-03 18:13   ` Brett Creeley
  2024-10-08 18:35   ` Jakub Kicinski
  2024-10-03 16:06 ` [PATCH net-next v3 5/7] net: devmem: add ring parameter filtering Taehee Yoo
                   ` (3 subsequent siblings)
  7 siblings, 2 replies; 73+ messages in thread
From: Taehee Yoo @ 2024-10-03 16:06 UTC (permalink / raw)
  To: davem, kuba, pabeni, edumazet, almasrymina, netdev, linux-doc,
	donald.hunter, corbet, michael.chan
  Cc: kory.maincent, andrew, maxime.chevallier, danieller, hengqi,
	ecree.xilinx, przemyslaw.kitszel, hkallweit1, ahmed.zaki,
	paul.greenwalt, rrameshbabu, idosch, asml.silence, kaiyuanz,
	willemb, aleksander.lobakin, dw, sridhar.samudrala, bcreeley,
	ap420073

The bnxt_en driver has configured the hds_threshold value automatically
when TPA is enabled based on the rx-copybreak default value.
Now the tcp-data-split-thresh ethtool command is added, so it adds an
implementation of tcp-data-split-thresh option.

Configuration of the tcp-data-split-thresh is allowed only when
the tcp-data-split is enabled. The default value of
tcp-data-split-thresh is 256, which is the default value of rx-copybreak,
which used to be the hds_thresh value.

   # Example:
   # ethtool -G enp14s0f0np0 tcp-data-split on tcp-data-split-thresh 256
   # ethtool -g enp14s0f0np0
   Ring parameters for enp14s0f0np0:
   Pre-set maximums:
   ...
   TCP data split thresh:  256
   Current hardware settings:
   ...
   TCP data split:         on
   TCP data split thresh:  256

It enables tcp-data-split and sets tcp-data-split-thresh value to 256.

   # ethtool -G enp14s0f0np0 tcp-data-split off
   # ethtool -g enp14s0f0np0
   Ring parameters for enp14s0f0np0:
   Pre-set maximums:
   ...
   TCP data split thresh:  256
   Current hardware settings:
   ...
   TCP data split:         off
   TCP data split thresh:  n/a

Signed-off-by: Taehee Yoo <ap420073@gmail.com>
---

v3:
 - Drop validation logic tcp-data-split and tcp-data-split-thresh.

v2:
 - Patch added.

 drivers/net/ethernet/broadcom/bnxt/bnxt.c         | 3 ++-
 drivers/net/ethernet/broadcom/bnxt/bnxt.h         | 2 ++
 drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c | 4 ++++
 3 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index f046478dfd2a..872b15842b11 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -4455,6 +4455,7 @@ static void bnxt_init_ring_params(struct bnxt *bp)
 {
 	bp->rx_copybreak = BNXT_DEFAULT_RX_COPYBREAK;
 	bp->flags |= BNXT_FLAG_HDS;
+	bp->hds_threshold = BNXT_DEFAULT_RX_COPYBREAK;
 }
 
 /* bp->rx_ring_size, bp->tx_ring_size, dev->mtu, BNXT_FLAG_{G|L}RO flags must
@@ -6429,7 +6430,7 @@ static int bnxt_hwrm_vnic_set_hds(struct bnxt *bp, struct bnxt_vnic_info *vnic)
 					  VNIC_PLCMODES_CFG_REQ_FLAGS_HDS_IPV6);
 		req->enables |=
 			cpu_to_le32(VNIC_PLCMODES_CFG_REQ_ENABLES_HDS_THRESHOLD_VALID);
-		req->hds_threshold = cpu_to_le16(bp->rx_copybreak);
+		req->hds_threshold = cpu_to_le16(bp->hds_threshold);
 	}
 	req->vnic_id = cpu_to_le32(vnic->fw_vnic_id);
 	return hwrm_req_send(bp, req);
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.h b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
index 35601c71dfe9..48f390519c35 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.h
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
@@ -2311,6 +2311,8 @@ struct bnxt {
 	int			rx_agg_nr_pages;
 	int			rx_nr_rings;
 	int			rsscos_nr_ctxs;
+#define BNXT_HDS_THRESHOLD_MAX	256
+	u16			hds_threshold;
 
 	u32			tx_ring_size;
 	u32			tx_ring_mask;
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c b/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c
index e9ef65dd2e7b..af6ed492f688 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c
@@ -839,6 +839,9 @@ static void bnxt_get_ringparam(struct net_device *dev,
 	else
 		kernel_ering->tcp_data_split = ETHTOOL_TCP_DATA_SPLIT_DISABLED;
 
+	kernel_ering->tcp_data_split_thresh = bp->hds_threshold;
+	kernel_ering->tcp_data_split_thresh_max = BNXT_HDS_THRESHOLD_MAX;
+
 	ering->tx_max_pending = BNXT_MAX_TX_DESC_CNT;
 
 	ering->rx_pending = bp->rx_ring_size;
@@ -871,6 +874,7 @@ static int bnxt_set_ringparam(struct net_device *dev,
 	case ETHTOOL_TCP_DATA_SPLIT_UNKNOWN:
 	case ETHTOOL_TCP_DATA_SPLIT_ENABLED:
 		bp->flags |= BNXT_FLAG_HDS;
+		bp->hds_threshold = (u16)kernel_ering->tcp_data_split_thresh;
 		break;
 	case ETHTOOL_TCP_DATA_SPLIT_DISABLED:
 		bp->flags &= ~BNXT_FLAG_HDS;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH net-next v3 5/7] net: devmem: add ring parameter filtering
  2024-10-03 16:06 [PATCH net-next v3 0/7] bnxt_en: implement device memory TCP for bnxt Taehee Yoo
                   ` (3 preceding siblings ...)
  2024-10-03 16:06 ` [PATCH net-next v3 4/7] bnxt_en: add support for tcp-data-split-thresh ethtool command Taehee Yoo
@ 2024-10-03 16:06 ` Taehee Yoo
  2024-10-03 18:29   ` Mina Almasry
  2024-10-03 18:35   ` Brett Creeley
  2024-10-03 16:06 ` [PATCH net-next v3 6/7] net: ethtool: " Taehee Yoo
                   ` (2 subsequent siblings)
  7 siblings, 2 replies; 73+ messages in thread
From: Taehee Yoo @ 2024-10-03 16:06 UTC (permalink / raw)
  To: davem, kuba, pabeni, edumazet, almasrymina, netdev, linux-doc,
	donald.hunter, corbet, michael.chan
  Cc: kory.maincent, andrew, maxime.chevallier, danieller, hengqi,
	ecree.xilinx, przemyslaw.kitszel, hkallweit1, ahmed.zaki,
	paul.greenwalt, rrameshbabu, idosch, asml.silence, kaiyuanz,
	willemb, aleksander.lobakin, dw, sridhar.samudrala, bcreeley,
	ap420073

If driver doesn't support ring parameter or tcp-data-split configuration
is not sufficient, the devmem should not be set up.
Before setup the devmem, tcp-data-split should be ON and
tcp-data-split-thresh value should be 0.

Signed-off-by: Taehee Yoo <ap420073@gmail.com>
---

v3:
 - Patch added.

 net/core/devmem.c | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/net/core/devmem.c b/net/core/devmem.c
index 11b91c12ee11..a9e9b15028e0 100644
--- a/net/core/devmem.c
+++ b/net/core/devmem.c
@@ -8,6 +8,8 @@
  */
 
 #include <linux/dma-buf.h>
+#include <linux/ethtool.h>
+#include <linux/ethtool_netlink.h>
 #include <linux/genalloc.h>
 #include <linux/mm.h>
 #include <linux/netdevice.h>
@@ -131,6 +133,8 @@ int net_devmem_bind_dmabuf_to_queue(struct net_device *dev, u32 rxq_idx,
 				    struct net_devmem_dmabuf_binding *binding,
 				    struct netlink_ext_ack *extack)
 {
+	struct kernel_ethtool_ringparam kernel_ringparam = {};
+	struct ethtool_ringparam ringparam = {};
 	struct netdev_rx_queue *rxq;
 	u32 xa_idx;
 	int err;
@@ -146,6 +150,20 @@ int net_devmem_bind_dmabuf_to_queue(struct net_device *dev, u32 rxq_idx,
 		return -EEXIST;
 	}
 
+	if (!dev->ethtool_ops->get_ringparam) {
+		NL_SET_ERR_MSG(extack, "can't get ringparam");
+		return -EINVAL;
+	}
+
+	dev->ethtool_ops->get_ringparam(dev, &ringparam,
+					&kernel_ringparam, extack);
+	if (kernel_ringparam.tcp_data_split != ETHTOOL_TCP_DATA_SPLIT_ENABLED ||
+	    kernel_ringparam.tcp_data_split_thresh) {
+		NL_SET_ERR_MSG(extack,
+			       "tcp-header-data-split is disabled or threshold is not zero");
+		return -EINVAL;
+	}
+
 #ifdef CONFIG_XDP_SOCKETS
 	if (rxq->pool) {
 		NL_SET_ERR_MSG(extack, "designated queue already in use by AF_XDP");
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH net-next v3 6/7] net: ethtool: add ring parameter filtering
  2024-10-03 16:06 [PATCH net-next v3 0/7] bnxt_en: implement device memory TCP for bnxt Taehee Yoo
                   ` (4 preceding siblings ...)
  2024-10-03 16:06 ` [PATCH net-next v3 5/7] net: devmem: add ring parameter filtering Taehee Yoo
@ 2024-10-03 16:06 ` Taehee Yoo
  2024-10-03 18:32   ` Mina Almasry
  2024-10-03 16:06 ` [PATCH net-next v3 7/7] bnxt_en: add support for device memory tcp Taehee Yoo
  2024-10-16 20:17 ` [PATCH net-next v3 0/7] bnxt_en: implement device memory TCP for bnxt Stanislav Fomichev
  7 siblings, 1 reply; 73+ messages in thread
From: Taehee Yoo @ 2024-10-03 16:06 UTC (permalink / raw)
  To: davem, kuba, pabeni, edumazet, almasrymina, netdev, linux-doc,
	donald.hunter, corbet, michael.chan
  Cc: kory.maincent, andrew, maxime.chevallier, danieller, hengqi,
	ecree.xilinx, przemyslaw.kitszel, hkallweit1, ahmed.zaki,
	paul.greenwalt, rrameshbabu, idosch, asml.silence, kaiyuanz,
	willemb, aleksander.lobakin, dw, sridhar.samudrala, bcreeley,
	ap420073

While the devmem is running, the tcp-data-split and
tcp-data-split-thresh configuration should not be changed.
If user tries to change tcp-data-split and threshold value while the
devmem is running, it fails and shows extack message.

Signed-off-by: Taehee Yoo <ap420073@gmail.com>
---

v3:
 - Patch added

 net/ethtool/common.h |  1 +
 net/ethtool/rings.c  | 15 ++++++++++++++-
 2 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/net/ethtool/common.h b/net/ethtool/common.h
index d55d5201b085..beebd4db3e10 100644
--- a/net/ethtool/common.h
+++ b/net/ethtool/common.h
@@ -5,6 +5,7 @@
 
 #include <linux/netdevice.h>
 #include <linux/ethtool.h>
+#include <net/netdev_rx_queue.h>
 
 #define ETHTOOL_DEV_FEATURE_WORDS	DIV_ROUND_UP(NETDEV_FEATURE_COUNT, 32)
 
diff --git a/net/ethtool/rings.c b/net/ethtool/rings.c
index c7824515857f..0afc6b29a229 100644
--- a/net/ethtool/rings.c
+++ b/net/ethtool/rings.c
@@ -216,7 +216,8 @@ ethnl_set_rings(struct ethnl_req_info *req_info, struct genl_info *info)
 	bool mod = false, thresh_mod = false;
 	struct nlattr **tb = info->attrs;
 	const struct nlattr *err_attr;
-	int ret;
+	struct netdev_rx_queue *rxq;
+	int ret, i;
 
 	dev->ethtool_ops->get_ringparam(dev, &ringparam,
 					&kernel_ringparam, info->extack);
@@ -263,6 +264,18 @@ ethnl_set_rings(struct ethnl_req_info *req_info, struct genl_info *info)
 		return -EINVAL;
 	}
 
+	if (kernel_ringparam.tcp_data_split != ETHTOOL_TCP_DATA_SPLIT_ENABLED ||
+	    kernel_ringparam.tcp_data_split_thresh) {
+		for (i = 0; i < dev->real_num_rx_queues; i++) {
+			rxq = __netif_get_rx_queue(dev, i);
+			if (rxq->mp_params.mp_priv) {
+				NL_SET_ERR_MSG(info->extack,
+					       "tcp-header-data-split is disabled or threshold is not zero");
+				return -EINVAL;
+			}
+		}
+	}
+
 	/* ensure new ring parameters are within limits */
 	if (ringparam.rx_pending > ringparam.rx_max_pending)
 		err_attr = tb[ETHTOOL_A_RINGS_RX];
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH net-next v3 7/7] bnxt_en: add support for device memory tcp
  2024-10-03 16:06 [PATCH net-next v3 0/7] bnxt_en: implement device memory TCP for bnxt Taehee Yoo
                   ` (5 preceding siblings ...)
  2024-10-03 16:06 ` [PATCH net-next v3 6/7] net: ethtool: " Taehee Yoo
@ 2024-10-03 16:06 ` Taehee Yoo
  2024-10-03 18:43   ` Mina Almasry
                     ` (2 more replies)
  2024-10-16 20:17 ` [PATCH net-next v3 0/7] bnxt_en: implement device memory TCP for bnxt Stanislav Fomichev
  7 siblings, 3 replies; 73+ messages in thread
From: Taehee Yoo @ 2024-10-03 16:06 UTC (permalink / raw)
  To: davem, kuba, pabeni, edumazet, almasrymina, netdev, linux-doc,
	donald.hunter, corbet, michael.chan
  Cc: kory.maincent, andrew, maxime.chevallier, danieller, hengqi,
	ecree.xilinx, przemyslaw.kitszel, hkallweit1, ahmed.zaki,
	paul.greenwalt, rrameshbabu, idosch, asml.silence, kaiyuanz,
	willemb, aleksander.lobakin, dw, sridhar.samudrala, bcreeley,
	ap420073

Currently, bnxt_en driver satisfies the requirements of Device memory
TCP, which is tcp-data-split.
So, it implements Device memory TCP for bnxt_en driver.

From now on, the aggregation ring handles netmem_ref instead of page
regardless of the on/off of netmem.
So, for the aggregation ring, memory will be handled with the netmem
page_pool API instead of generic page_pool API.

If Devmem is enabled, netmem_ref is used as-is and if Devmem is not
enabled, netmem_ref will be converted to page and that is used.

Driver recognizes whether the devmem is set or unset based on the
mp_params.mp_priv is not NULL.
Only if devmem is set, it passes PP_FLAG_ALLOW_UNREADABLE_NETMEM.

Signed-off-by: Taehee Yoo <ap420073@gmail.com>
---

v3:
 - Patch added

 drivers/net/ethernet/broadcom/Kconfig     |  1 +
 drivers/net/ethernet/broadcom/bnxt/bnxt.c | 98 +++++++++++++++--------
 drivers/net/ethernet/broadcom/bnxt/bnxt.h |  2 +-
 3 files changed, 66 insertions(+), 35 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/Kconfig b/drivers/net/ethernet/broadcom/Kconfig
index 75ca3ddda1f5..f37ff12d4746 100644
--- a/drivers/net/ethernet/broadcom/Kconfig
+++ b/drivers/net/ethernet/broadcom/Kconfig
@@ -211,6 +211,7 @@ config BNXT
 	select FW_LOADER
 	select LIBCRC32C
 	select NET_DEVLINK
+	select NET_DEVMEM
 	select PAGE_POOL
 	select DIMLIB
 	select AUXILIARY_BUS
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index 872b15842b11..64e07d247f97 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -55,6 +55,7 @@
 #include <net/page_pool/helpers.h>
 #include <linux/align.h>
 #include <net/netdev_queues.h>
+#include <net/netdev_rx_queue.h>
 
 #include "bnxt_hsi.h"
 #include "bnxt.h"
@@ -863,6 +864,22 @@ static void bnxt_tx_int(struct bnxt *bp, struct bnxt_napi *bnapi, int budget)
 		bnapi->events &= ~BNXT_TX_CMP_EVENT;
 }
 
+static netmem_ref __bnxt_alloc_rx_netmem(struct bnxt *bp, dma_addr_t *mapping,
+					 struct bnxt_rx_ring_info *rxr,
+					 unsigned int *offset,
+					 gfp_t gfp)
+{
+	netmem_ref netmem;
+
+	netmem = page_pool_alloc_netmem(rxr->page_pool, GFP_ATOMIC);
+	if (!netmem)
+		return 0;
+	*offset = 0;
+
+	*mapping = page_pool_get_dma_addr_netmem(netmem) + *offset;
+	return netmem;
+}
+
 static struct page *__bnxt_alloc_rx_page(struct bnxt *bp, dma_addr_t *mapping,
 					 struct bnxt_rx_ring_info *rxr,
 					 unsigned int *offset,
@@ -972,21 +989,21 @@ static inline u16 bnxt_find_next_agg_idx(struct bnxt_rx_ring_info *rxr, u16 idx)
 	return next;
 }
 
-static inline int bnxt_alloc_rx_page(struct bnxt *bp,
-				     struct bnxt_rx_ring_info *rxr,
-				     u16 prod, gfp_t gfp)
+static inline int bnxt_alloc_rx_netmem(struct bnxt *bp,
+				       struct bnxt_rx_ring_info *rxr,
+				       u16 prod, gfp_t gfp)
 {
 	struct rx_bd *rxbd =
 		&rxr->rx_agg_desc_ring[RX_AGG_RING(bp, prod)][RX_IDX(prod)];
 	struct bnxt_sw_rx_agg_bd *rx_agg_buf;
-	struct page *page;
+	netmem_ref netmem;
 	dma_addr_t mapping;
 	u16 sw_prod = rxr->rx_sw_agg_prod;
 	unsigned int offset = 0;
 
-	page = __bnxt_alloc_rx_page(bp, &mapping, rxr, &offset, gfp);
+	netmem = __bnxt_alloc_rx_netmem(bp, &mapping, rxr, &offset, gfp);
 
-	if (!page)
+	if (!netmem)
 		return -ENOMEM;
 
 	if (unlikely(test_bit(sw_prod, rxr->rx_agg_bmap)))
@@ -996,7 +1013,7 @@ static inline int bnxt_alloc_rx_page(struct bnxt *bp,
 	rx_agg_buf = &rxr->rx_agg_ring[sw_prod];
 	rxr->rx_sw_agg_prod = RING_RX_AGG(bp, NEXT_RX_AGG(sw_prod));
 
-	rx_agg_buf->page = page;
+	rx_agg_buf->netmem = netmem;
 	rx_agg_buf->offset = offset;
 	rx_agg_buf->mapping = mapping;
 	rxbd->rx_bd_haddr = cpu_to_le64(mapping);
@@ -1044,7 +1061,7 @@ static void bnxt_reuse_rx_agg_bufs(struct bnxt_cp_ring_info *cpr, u16 idx,
 		struct rx_agg_cmp *agg;
 		struct bnxt_sw_rx_agg_bd *cons_rx_buf, *prod_rx_buf;
 		struct rx_bd *prod_bd;
-		struct page *page;
+		netmem_ref netmem;
 
 		if (p5_tpa)
 			agg = bnxt_get_tpa_agg_p5(bp, rxr, idx, start + i);
@@ -1061,11 +1078,11 @@ static void bnxt_reuse_rx_agg_bufs(struct bnxt_cp_ring_info *cpr, u16 idx,
 		cons_rx_buf = &rxr->rx_agg_ring[cons];
 
 		/* It is possible for sw_prod to be equal to cons, so
-		 * set cons_rx_buf->page to NULL first.
+		 * set cons_rx_buf->netmem to 0 first.
 		 */
-		page = cons_rx_buf->page;
-		cons_rx_buf->page = NULL;
-		prod_rx_buf->page = page;
+		netmem = cons_rx_buf->netmem;
+		cons_rx_buf->netmem = 0;
+		prod_rx_buf->netmem = netmem;
 		prod_rx_buf->offset = cons_rx_buf->offset;
 
 		prod_rx_buf->mapping = cons_rx_buf->mapping;
@@ -1192,6 +1209,7 @@ static struct sk_buff *bnxt_rx_skb(struct bnxt *bp,
 
 static u32 __bnxt_rx_agg_pages(struct bnxt *bp,
 			       struct bnxt_cp_ring_info *cpr,
+			       struct sk_buff *skb,
 			       struct skb_shared_info *shinfo,
 			       u16 idx, u32 agg_bufs, bool tpa,
 			       struct xdp_buff *xdp)
@@ -1211,7 +1229,7 @@ static u32 __bnxt_rx_agg_pages(struct bnxt *bp,
 		u16 cons, frag_len;
 		struct rx_agg_cmp *agg;
 		struct bnxt_sw_rx_agg_bd *cons_rx_buf;
-		struct page *page;
+		netmem_ref netmem;
 		dma_addr_t mapping;
 
 		if (p5_tpa)
@@ -1223,9 +1241,15 @@ static u32 __bnxt_rx_agg_pages(struct bnxt *bp,
 			    RX_AGG_CMP_LEN) >> RX_AGG_CMP_LEN_SHIFT;
 
 		cons_rx_buf = &rxr->rx_agg_ring[cons];
-		skb_frag_fill_page_desc(frag, cons_rx_buf->page,
-					cons_rx_buf->offset, frag_len);
-		shinfo->nr_frags = i + 1;
+		if (skb) {
+			skb_add_rx_frag_netmem(skb, i, cons_rx_buf->netmem,
+					       cons_rx_buf->offset, frag_len,
+					       BNXT_RX_PAGE_SIZE);
+		} else {
+			skb_frag_fill_page_desc(frag, netmem_to_page(cons_rx_buf->netmem),
+						cons_rx_buf->offset, frag_len);
+			shinfo->nr_frags = i + 1;
+		}
 		__clear_bit(cons, rxr->rx_agg_bmap);
 
 		/* It is possible for bnxt_alloc_rx_page() to allocate
@@ -1233,15 +1257,15 @@ static u32 __bnxt_rx_agg_pages(struct bnxt *bp,
 		 * need to clear the cons entry now.
 		 */
 		mapping = cons_rx_buf->mapping;
-		page = cons_rx_buf->page;
-		cons_rx_buf->page = NULL;
+		netmem = cons_rx_buf->netmem;
+		cons_rx_buf->netmem = 0;
 
-		if (xdp && page_is_pfmemalloc(page))
+		if (xdp && page_is_pfmemalloc(netmem_to_page(netmem)))
 			xdp_buff_set_frag_pfmemalloc(xdp);
 
-		if (bnxt_alloc_rx_page(bp, rxr, prod, GFP_ATOMIC) != 0) {
+		if (bnxt_alloc_rx_netmem(bp, rxr, prod, GFP_ATOMIC) != 0) {
 			--shinfo->nr_frags;
-			cons_rx_buf->page = page;
+			cons_rx_buf->netmem = netmem;
 
 			/* Update prod since possibly some pages have been
 			 * allocated already.
@@ -1269,7 +1293,7 @@ static struct sk_buff *bnxt_rx_agg_pages_skb(struct bnxt *bp,
 	struct skb_shared_info *shinfo = skb_shinfo(skb);
 	u32 total_frag_len = 0;
 
-	total_frag_len = __bnxt_rx_agg_pages(bp, cpr, shinfo, idx,
+	total_frag_len = __bnxt_rx_agg_pages(bp, cpr, skb, shinfo, idx,
 					     agg_bufs, tpa, NULL);
 	if (!total_frag_len) {
 		skb_mark_for_recycle(skb);
@@ -1277,9 +1301,6 @@ static struct sk_buff *bnxt_rx_agg_pages_skb(struct bnxt *bp,
 		return NULL;
 	}
 
-	skb->data_len += total_frag_len;
-	skb->len += total_frag_len;
-	skb->truesize += BNXT_RX_PAGE_SIZE * agg_bufs;
 	return skb;
 }
 
@@ -1294,7 +1315,7 @@ static u32 bnxt_rx_agg_pages_xdp(struct bnxt *bp,
 	if (!xdp_buff_has_frags(xdp))
 		shinfo->nr_frags = 0;
 
-	total_frag_len = __bnxt_rx_agg_pages(bp, cpr, shinfo,
+	total_frag_len = __bnxt_rx_agg_pages(bp, cpr, NULL, shinfo,
 					     idx, agg_bufs, tpa, xdp);
 	if (total_frag_len) {
 		xdp_buff_set_frags_flag(xdp);
@@ -3342,15 +3363,15 @@ static void bnxt_free_one_rx_agg_ring(struct bnxt *bp, struct bnxt_rx_ring_info
 
 	for (i = 0; i < max_idx; i++) {
 		struct bnxt_sw_rx_agg_bd *rx_agg_buf = &rxr->rx_agg_ring[i];
-		struct page *page = rx_agg_buf->page;
+		netmem_ref netmem = rx_agg_buf->netmem;
 
-		if (!page)
+		if (!netmem)
 			continue;
 
-		rx_agg_buf->page = NULL;
+		rx_agg_buf->netmem = 0;
 		__clear_bit(i, rxr->rx_agg_bmap);
 
-		page_pool_recycle_direct(rxr->page_pool, page);
+		page_pool_put_full_netmem(rxr->page_pool, netmem, true);
 	}
 }
 
@@ -3608,9 +3629,11 @@ static void bnxt_free_rx_rings(struct bnxt *bp)
 
 static int bnxt_alloc_rx_page_pool(struct bnxt *bp,
 				   struct bnxt_rx_ring_info *rxr,
+				   int queue_idx,
 				   int numa_node)
 {
 	struct page_pool_params pp = { 0 };
+	struct netdev_rx_queue *rxq;
 
 	pp.pool_size = bp->rx_agg_ring_size;
 	if (BNXT_RX_PAGE_MODE(bp))
@@ -3621,8 +3644,15 @@ static int bnxt_alloc_rx_page_pool(struct bnxt *bp,
 	pp.dev = &bp->pdev->dev;
 	pp.dma_dir = bp->rx_dir;
 	pp.max_len = PAGE_SIZE;
-	pp.flags = PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV;
+	pp.order = 0;
+
+	rxq = __netif_get_rx_queue(bp->dev, queue_idx);
+	if (rxq->mp_params.mp_priv)
+		pp.flags = PP_FLAG_DMA_MAP | PP_FLAG_ALLOW_UNREADABLE_NETMEM;
+	else
+		pp.flags = PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV;
 
+	pp.queue_idx = queue_idx;
 	rxr->page_pool = page_pool_create(&pp);
 	if (IS_ERR(rxr->page_pool)) {
 		int err = PTR_ERR(rxr->page_pool);
@@ -3655,7 +3685,7 @@ static int bnxt_alloc_rx_rings(struct bnxt *bp)
 		cpu_node = cpu_to_node(cpu);
 		netdev_dbg(bp->dev, "Allocating page pool for rx_ring[%d] on numa_node: %d\n",
 			   i, cpu_node);
-		rc = bnxt_alloc_rx_page_pool(bp, rxr, cpu_node);
+		rc = bnxt_alloc_rx_page_pool(bp, rxr, i, cpu_node);
 		if (rc)
 			return rc;
 
@@ -4154,7 +4184,7 @@ static void bnxt_alloc_one_rx_ring_page(struct bnxt *bp,
 
 	prod = rxr->rx_agg_prod;
 	for (i = 0; i < bp->rx_agg_ring_size; i++) {
-		if (bnxt_alloc_rx_page(bp, rxr, prod, GFP_KERNEL)) {
+		if (bnxt_alloc_rx_netmem(bp, rxr, prod, GFP_KERNEL)) {
 			netdev_warn(bp->dev, "init'ed rx ring %d with %d/%d pages only\n",
 				    ring_nr, i, bp->rx_ring_size);
 			break;
@@ -15063,7 +15093,7 @@ static int bnxt_queue_mem_alloc(struct net_device *dev, void *qmem, int idx)
 	clone->rx_sw_agg_prod = 0;
 	clone->rx_next_cons = 0;
 
-	rc = bnxt_alloc_rx_page_pool(bp, clone, rxr->page_pool->p.nid);
+	rc = bnxt_alloc_rx_page_pool(bp, clone, idx, rxr->page_pool->p.nid);
 	if (rc)
 		return rc;
 
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.h b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
index 48f390519c35..3cf57a3c7664 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.h
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
@@ -895,7 +895,7 @@ struct bnxt_sw_rx_bd {
 };
 
 struct bnxt_sw_rx_agg_bd {
-	struct page		*page;
+	netmem_ref		netmem;
 	unsigned int		offset;
 	dma_addr_t		mapping;
 };
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* Re: [PATCH net-next v3 1/7] bnxt_en: add support for rx-copybreak ethtool command
  2024-10-03 16:06 ` [PATCH net-next v3 1/7] bnxt_en: add support for rx-copybreak ethtool command Taehee Yoo
@ 2024-10-03 16:57   ` Brett Creeley
  2024-10-03 17:15     ` Taehee Yoo
  2024-10-03 17:13   ` Michael Chan
  1 sibling, 1 reply; 73+ messages in thread
From: Brett Creeley @ 2024-10-03 16:57 UTC (permalink / raw)
  To: Taehee Yoo, davem, kuba, pabeni, edumazet, almasrymina, netdev,
	linux-doc, donald.hunter, corbet, michael.chan
  Cc: kory.maincent, andrew, maxime.chevallier, danieller, hengqi,
	ecree.xilinx, przemyslaw.kitszel, hkallweit1, ahmed.zaki,
	paul.greenwalt, rrameshbabu, idosch, asml.silence, kaiyuanz,
	willemb, aleksander.lobakin, dw, sridhar.samudrala

On 10/3/2024 9:06 AM, Taehee Yoo wrote:
> Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding.
> 
> 
> The bnxt_en driver supports rx-copybreak, but it couldn't be set by
> userspace. Only the default value(256) has worked.
> This patch makes the bnxt_en driver support following command.
> `ethtool --set-tunable <devname> rx-copybreak <value> ` and
> `ethtool --get-tunable <devname> rx-copybreak`.
> 
> Signed-off-by: Taehee Yoo <ap420073@gmail.com>
> ---
> 
> v3:
>   - Update copybreak value before closing nic.

Nit, but maybe this should say:

Update copybreak value after closing nic and before opening nic when the 
device is running.

Definitely not worth a respin, but if you end up having to do a v4.

> 
> v2:
>   - Define max/vim rx_copybreak value.
> 
>   drivers/net/ethernet/broadcom/bnxt/bnxt.c     | 24 +++++----
>   drivers/net/ethernet/broadcom/bnxt/bnxt.h     |  6 ++-
>   .../net/ethernet/broadcom/bnxt/bnxt_ethtool.c | 49 ++++++++++++++++++-
>   3 files changed, 68 insertions(+), 11 deletions(-)

Other than the tiny nit, LGTM.

Reviewed-by: Brett Creeley <brett.creeley@amd.com>

<snip>

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH net-next v3 1/7] bnxt_en: add support for rx-copybreak ethtool command
  2024-10-03 16:06 ` [PATCH net-next v3 1/7] bnxt_en: add support for rx-copybreak ethtool command Taehee Yoo
  2024-10-03 16:57   ` Brett Creeley
@ 2024-10-03 17:13   ` Michael Chan
  2024-10-03 17:22     ` Taehee Yoo
  1 sibling, 1 reply; 73+ messages in thread
From: Michael Chan @ 2024-10-03 17:13 UTC (permalink / raw)
  To: Taehee Yoo
  Cc: davem, kuba, pabeni, edumazet, almasrymina, netdev, linux-doc,
	donald.hunter, corbet, kory.maincent, andrew, maxime.chevallier,
	danieller, hengqi, ecree.xilinx, przemyslaw.kitszel, hkallweit1,
	ahmed.zaki, paul.greenwalt, rrameshbabu, idosch, asml.silence,
	kaiyuanz, willemb, aleksander.lobakin, dw, sridhar.samudrala,
	bcreeley

[-- Attachment #1: Type: text/plain, Size: 1388 bytes --]

On Thu, Oct 3, 2024 at 9:06 AM Taehee Yoo <ap420073@gmail.com> wrote:
>
> The bnxt_en driver supports rx-copybreak, but it couldn't be set by
> userspace. Only the default value(256) has worked.
> This patch makes the bnxt_en driver support following command.
> `ethtool --set-tunable <devname> rx-copybreak <value> ` and
> `ethtool --get-tunable <devname> rx-copybreak`.
>
> Signed-off-by: Taehee Yoo <ap420073@gmail.com>
> ---
>
> v3:
>  - Update copybreak value before closing nic.
>
> v2:
>  - Define max/vim rx_copybreak value.
>
>  drivers/net/ethernet/broadcom/bnxt/bnxt.c     | 24 +++++----
>  drivers/net/ethernet/broadcom/bnxt/bnxt.h     |  6 ++-
>  .../net/ethernet/broadcom/bnxt/bnxt_ethtool.c | 49 ++++++++++++++++++-
>  3 files changed, 68 insertions(+), 11 deletions(-)
>

> diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.h b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
> index 69231e85140b..cff031993223 100644
> --- a/drivers/net/ethernet/broadcom/bnxt/bnxt.h
> +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
> @@ -34,6 +34,10 @@
>  #include <linux/firmware/broadcom/tee_bnxt_fw.h>
>  #endif
>
> +#define BNXT_DEFAULT_RX_COPYBREAK 256
> +#define BNXT_MIN_RX_COPYBREAK 65
> +#define BNXT_MAX_RX_COPYBREAK 1024
> +

Sorry for the late review.  Perhaps we should also support a value of
zero which means to disable RX copybreak.

[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4209 bytes --]

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH net-next v3 1/7] bnxt_en: add support for rx-copybreak ethtool command
  2024-10-03 16:57   ` Brett Creeley
@ 2024-10-03 17:15     ` Taehee Yoo
  0 siblings, 0 replies; 73+ messages in thread
From: Taehee Yoo @ 2024-10-03 17:15 UTC (permalink / raw)
  To: Brett Creeley
  Cc: davem, kuba, pabeni, edumazet, almasrymina, netdev, linux-doc,
	donald.hunter, corbet, michael.chan, kory.maincent, andrew,
	maxime.chevallier, danieller, hengqi, ecree.xilinx,
	przemyslaw.kitszel, hkallweit1, ahmed.zaki, paul.greenwalt,
	rrameshbabu, idosch, asml.silence, kaiyuanz, willemb,
	aleksander.lobakin, dw, sridhar.samudrala

On Fri, Oct 4, 2024 at 1:57 AM Brett Creeley <bcreeley@amd.com> wrote:
>

Hi Brett,
Thanks a lot for the review!

> On 10/3/2024 9:06 AM, Taehee Yoo wrote:
> > Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding.
> >
> >
> > The bnxt_en driver supports rx-copybreak, but it couldn't be set by
> > userspace. Only the default value(256) has worked.
> > This patch makes the bnxt_en driver support following command.
> > `ethtool --set-tunable <devname> rx-copybreak <value> ` and
> > `ethtool --get-tunable <devname> rx-copybreak`.
> >
> > Signed-off-by: Taehee Yoo <ap420073@gmail.com>
> > ---
> >
> > v3:
> > - Update copybreak value before closing nic.
>
> Nit, but maybe this should say:
>
> Update copybreak value after closing nic and before opening nic when the
> device is running.
>
> Definitely not worth a respin, but if you end up having to do a v4.
>

Thank you so much for catching this.
I will fix it if I send a v4 patch.

> >
> > v2:
> > - Define max/vim rx_copybreak value.
> >
> > drivers/net/ethernet/broadcom/bnxt/bnxt.c | 24 +++++----
> > drivers/net/ethernet/broadcom/bnxt/bnxt.h | 6 ++-
> > .../net/ethernet/broadcom/bnxt/bnxt_ethtool.c | 49 ++++++++++++++++++-
> > 3 files changed, 68 insertions(+), 11 deletions(-)
>
> Other than the tiny nit, LGTM.
>
> Reviewed-by: Brett Creeley <brett.creeley@amd.com>
>
> <snip>

Thanks a lot!
Taehee Yoo

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH net-next v3 1/7] bnxt_en: add support for rx-copybreak ethtool command
  2024-10-03 17:13   ` Michael Chan
@ 2024-10-03 17:22     ` Taehee Yoo
  2024-10-03 17:43       ` Michael Chan
  0 siblings, 1 reply; 73+ messages in thread
From: Taehee Yoo @ 2024-10-03 17:22 UTC (permalink / raw)
  To: Michael Chan
  Cc: davem, kuba, pabeni, edumazet, almasrymina, netdev, linux-doc,
	donald.hunter, corbet, kory.maincent, andrew, maxime.chevallier,
	danieller, hengqi, ecree.xilinx, przemyslaw.kitszel, hkallweit1,
	ahmed.zaki, paul.greenwalt, rrameshbabu, idosch, asml.silence,
	kaiyuanz, willemb, aleksander.lobakin, dw, sridhar.samudrala,
	bcreeley

On Fri, Oct 4, 2024 at 2:14 AM Michael Chan <michael.chan@broadcom.com> wrote:
>

Hi Michael,
Thanks a lot for the review!

> On Thu, Oct 3, 2024 at 9:06 AM Taehee Yoo <ap420073@gmail.com> wrote:
> >
> > The bnxt_en driver supports rx-copybreak, but it couldn't be set by
> > userspace. Only the default value(256) has worked.
> > This patch makes the bnxt_en driver support following command.
> > `ethtool --set-tunable <devname> rx-copybreak <value> ` and
> > `ethtool --get-tunable <devname> rx-copybreak`.
> >
> > Signed-off-by: Taehee Yoo <ap420073@gmail.com>
> > ---
> >
> > v3:
> > - Update copybreak value before closing nic.
> >
> > v2:
> > - Define max/vim rx_copybreak value.
> >
> > drivers/net/ethernet/broadcom/bnxt/bnxt.c | 24 +++++----
> > drivers/net/ethernet/broadcom/bnxt/bnxt.h | 6 ++-
> > .../net/ethernet/broadcom/bnxt/bnxt_ethtool.c | 49 ++++++++++++++++++-
> > 3 files changed, 68 insertions(+), 11 deletions(-)
> >
>
> > diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.h b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
> > index 69231e85140b..cff031993223 100644
> > --- a/drivers/net/ethernet/broadcom/bnxt/bnxt.h
> > +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
> > @@ -34,6 +34,10 @@
> > #include <linux/firmware/broadcom/tee_bnxt_fw.h>
> > #endif
> >
> > +#define BNXT_DEFAULT_RX_COPYBREAK 256
> > +#define BNXT_MIN_RX_COPYBREAK 65
> > +#define BNXT_MAX_RX_COPYBREAK 1024
> > +
>
> Sorry for the late review. Perhaps we should also support a value of
> zero which means to disable RX copybreak.

I agree that we need to support disabling rx-copybreak.
What about 0 ~ 64 means to disable rx-copybreak?
Or should only 0 be allowed to disable rx-copybreak?

Thanks a lot!
Taehee Yoo

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH net-next v3 1/7] bnxt_en: add support for rx-copybreak ethtool command
  2024-10-03 17:22     ` Taehee Yoo
@ 2024-10-03 17:43       ` Michael Chan
  2024-10-03 18:28         ` Taehee Yoo
  2024-10-03 18:34         ` Andrew Lunn
  0 siblings, 2 replies; 73+ messages in thread
From: Michael Chan @ 2024-10-03 17:43 UTC (permalink / raw)
  To: Taehee Yoo
  Cc: davem, kuba, pabeni, edumazet, almasrymina, netdev, linux-doc,
	donald.hunter, corbet, kory.maincent, andrew, maxime.chevallier,
	danieller, hengqi, ecree.xilinx, przemyslaw.kitszel, hkallweit1,
	ahmed.zaki, paul.greenwalt, rrameshbabu, idosch, asml.silence,
	kaiyuanz, willemb, aleksander.lobakin, dw, sridhar.samudrala,
	bcreeley

[-- Attachment #1: Type: text/plain, Size: 2141 bytes --]

On Thu, Oct 3, 2024 at 10:23 AM Taehee Yoo <ap420073@gmail.com> wrote:
>
> On Fri, Oct 4, 2024 at 2:14 AM Michael Chan <michael.chan@broadcom.com> wrote:
> >
>
> Hi Michael,
> Thanks a lot for the review!
>
> > On Thu, Oct 3, 2024 at 9:06 AM Taehee Yoo <ap420073@gmail.com> wrote:
> > >
> > > The bnxt_en driver supports rx-copybreak, but it couldn't be set by
> > > userspace. Only the default value(256) has worked.
> > > This patch makes the bnxt_en driver support following command.
> > > `ethtool --set-tunable <devname> rx-copybreak <value> ` and
> > > `ethtool --get-tunable <devname> rx-copybreak`.
> > >
> > > Signed-off-by: Taehee Yoo <ap420073@gmail.com>
> > > ---
> > >
> > > v3:
> > > - Update copybreak value before closing nic.
> > >
> > > v2:
> > > - Define max/vim rx_copybreak value.
> > >
> > > drivers/net/ethernet/broadcom/bnxt/bnxt.c | 24 +++++----
> > > drivers/net/ethernet/broadcom/bnxt/bnxt.h | 6 ++-
> > > .../net/ethernet/broadcom/bnxt/bnxt_ethtool.c | 49 ++++++++++++++++++-
> > > 3 files changed, 68 insertions(+), 11 deletions(-)
> > >
> >
> > > diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.h b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
> > > index 69231e85140b..cff031993223 100644
> > > --- a/drivers/net/ethernet/broadcom/bnxt/bnxt.h
> > > +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
> > > @@ -34,6 +34,10 @@
> > > #include <linux/firmware/broadcom/tee_bnxt_fw.h>
> > > #endif
> > >
> > > +#define BNXT_DEFAULT_RX_COPYBREAK 256
> > > +#define BNXT_MIN_RX_COPYBREAK 65
> > > +#define BNXT_MAX_RX_COPYBREAK 1024
> > > +
> >
> > Sorry for the late review. Perhaps we should also support a value of
> > zero which means to disable RX copybreak.
>
> I agree that we need to support disabling rx-copybreak.
> What about 0 ~ 64 means to disable rx-copybreak?
> Or should only 0 be allowed to disable rx-copybreak?
>

I think a single value of 0 that means disable RX copybreak is more
clear and intuitive.  Also, I think we can allow 64 to be a valid
value.

So, 0 means to disable.  1 to 63 are -EINVAL and 64 to 1024 are valid.  Thanks.

[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4209 bytes --]

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH net-next v3 4/7] bnxt_en: add support for tcp-data-split-thresh ethtool command
  2024-10-03 16:06 ` [PATCH net-next v3 4/7] bnxt_en: add support for tcp-data-split-thresh ethtool command Taehee Yoo
@ 2024-10-03 18:13   ` Brett Creeley
  2024-10-03 19:13     ` Taehee Yoo
  2024-10-08 18:35   ` Jakub Kicinski
  1 sibling, 1 reply; 73+ messages in thread
From: Brett Creeley @ 2024-10-03 18:13 UTC (permalink / raw)
  To: Taehee Yoo, davem, kuba, pabeni, edumazet, almasrymina, netdev,
	linux-doc, donald.hunter, corbet, michael.chan
  Cc: kory.maincent, andrew, maxime.chevallier, danieller, hengqi,
	ecree.xilinx, przemyslaw.kitszel, hkallweit1, ahmed.zaki,
	paul.greenwalt, rrameshbabu, idosch, asml.silence, kaiyuanz,
	willemb, aleksander.lobakin, dw, sridhar.samudrala



On 10/3/2024 9:06 AM, Taehee Yoo wrote:
> Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding.
> 
> 
> The bnxt_en driver has configured the hds_threshold value automatically
> when TPA is enabled based on the rx-copybreak default value.
> Now the tcp-data-split-thresh ethtool command is added, so it adds an
> implementation of tcp-data-split-thresh option.
> 
> Configuration of the tcp-data-split-thresh is allowed only when
> the tcp-data-split is enabled. The default value of
> tcp-data-split-thresh is 256, which is the default value of rx-copybreak,
> which used to be the hds_thresh value.
> 
>     # Example:
>     # ethtool -G enp14s0f0np0 tcp-data-split on tcp-data-split-thresh 256
>     # ethtool -g enp14s0f0np0
>     Ring parameters for enp14s0f0np0:
>     Pre-set maximums:
>     ...
>     TCP data split thresh:  256
>     Current hardware settings:
>     ...
>     TCP data split:         on
>     TCP data split thresh:  256
> 
> It enables tcp-data-split and sets tcp-data-split-thresh value to 256.
> 
>     # ethtool -G enp14s0f0np0 tcp-data-split off
>     # ethtool -g enp14s0f0np0
>     Ring parameters for enp14s0f0np0:
>     Pre-set maximums:
>     ...
>     TCP data split thresh:  256
>     Current hardware settings:
>     ...
>     TCP data split:         off
>     TCP data split thresh:  n/a
> 
> Signed-off-by: Taehee Yoo <ap420073@gmail.com>
> ---
> 
> v3:
>   - Drop validation logic tcp-data-split and tcp-data-split-thresh.
> 
> v2:
>   - Patch added.
> 
>   drivers/net/ethernet/broadcom/bnxt/bnxt.c         | 3 ++-
>   drivers/net/ethernet/broadcom/bnxt/bnxt.h         | 2 ++
>   drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c | 4 ++++
>   3 files changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> index f046478dfd2a..872b15842b11 100644
> --- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> @@ -4455,6 +4455,7 @@ static void bnxt_init_ring_params(struct bnxt *bp)
>   {
>          bp->rx_copybreak = BNXT_DEFAULT_RX_COPYBREAK;
>          bp->flags |= BNXT_FLAG_HDS;
> +       bp->hds_threshold = BNXT_DEFAULT_RX_COPYBREAK;
>   }
> 
>   /* bp->rx_ring_size, bp->tx_ring_size, dev->mtu, BNXT_FLAG_{G|L}RO flags must
> @@ -6429,7 +6430,7 @@ static int bnxt_hwrm_vnic_set_hds(struct bnxt *bp, struct bnxt_vnic_info *vnic)
>                                            VNIC_PLCMODES_CFG_REQ_FLAGS_HDS_IPV6);
>                  req->enables |=
>                          cpu_to_le32(VNIC_PLCMODES_CFG_REQ_ENABLES_HDS_THRESHOLD_VALID);
> -               req->hds_threshold = cpu_to_le16(bp->rx_copybreak);
> +               req->hds_threshold = cpu_to_le16(bp->hds_threshold);
>          }
>          req->vnic_id = cpu_to_le32(vnic->fw_vnic_id);
>          return hwrm_req_send(bp, req);
> diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.h b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
> index 35601c71dfe9..48f390519c35 100644
> --- a/drivers/net/ethernet/broadcom/bnxt/bnxt.h
> +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
> @@ -2311,6 +2311,8 @@ struct bnxt {
>          int                     rx_agg_nr_pages;
>          int                     rx_nr_rings;
>          int                     rsscos_nr_ctxs;
> +#define BNXT_HDS_THRESHOLD_MAX 256
> +       u16                     hds_threshold;

Putting this here creates a 2 byte hole right after hds_threshold and 
also puts a 4 byte hole after cp_nr_rings.

Since hds_threshold doesn't seem to be used in the hotpath maybe it 
would be best to fill a pre-existing hole in struct bnxt to put it?

Thanks,

Brett

> 
>          u32                     tx_ring_size;
>          u32                     tx_ring_mask;
> diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c b/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c
> index e9ef65dd2e7b..af6ed492f688 100644
> --- a/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c
> +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c
> @@ -839,6 +839,9 @@ static void bnxt_get_ringparam(struct net_device *dev,
>          else
>                  kernel_ering->tcp_data_split = ETHTOOL_TCP_DATA_SPLIT_DISABLED;
> 
> +       kernel_ering->tcp_data_split_thresh = bp->hds_threshold;
> +       kernel_ering->tcp_data_split_thresh_max = BNXT_HDS_THRESHOLD_MAX;
> +
>          ering->tx_max_pending = BNXT_MAX_TX_DESC_CNT;
> 
>          ering->rx_pending = bp->rx_ring_size;
> @@ -871,6 +874,7 @@ static int bnxt_set_ringparam(struct net_device *dev,
>          case ETHTOOL_TCP_DATA_SPLIT_UNKNOWN:
>          case ETHTOOL_TCP_DATA_SPLIT_ENABLED:
>                  bp->flags |= BNXT_FLAG_HDS;
> +               bp->hds_threshold = (u16)kernel_ering->tcp_data_split_thresh;
>                  break;
>          case ETHTOOL_TCP_DATA_SPLIT_DISABLED:
>                  bp->flags &= ~BNXT_FLAG_HDS;
> --
> 2.34.1
> 

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH net-next v3 3/7] net: ethtool: add support for configuring tcp-data-split-thresh
  2024-10-03 16:06 ` [PATCH net-next v3 3/7] net: ethtool: add support for configuring tcp-data-split-thresh Taehee Yoo
@ 2024-10-03 18:25   ` Mina Almasry
  2024-10-03 19:33     ` Taehee Yoo
  2024-10-08 18:33   ` Jakub Kicinski
  1 sibling, 1 reply; 73+ messages in thread
From: Mina Almasry @ 2024-10-03 18:25 UTC (permalink / raw)
  To: Taehee Yoo
  Cc: davem, kuba, pabeni, edumazet, netdev, linux-doc, donald.hunter,
	corbet, michael.chan, kory.maincent, andrew, maxime.chevallier,
	danieller, hengqi, ecree.xilinx, przemyslaw.kitszel, hkallweit1,
	ahmed.zaki, paul.greenwalt, rrameshbabu, idosch, asml.silence,
	kaiyuanz, willemb, aleksander.lobakin, dw, sridhar.samudrala,
	bcreeley

On Thu, Oct 3, 2024 at 9:07 AM Taehee Yoo <ap420073@gmail.com> wrote:
>
> The tcp-data-split-thresh option configures the threshold value of
> the tcp-data-split.
> If a received packet size is larger than this threshold value, a packet
> will be split into header and payload.

Why do you need this? devmem TCP will always not work with unsplit
packets. Seems like you always want to set thresh to 0 to support
something like devmem TCP.

Why would the user ever want to configure this? I can't think of a
scenario where the user wouldn't want packets under X bytes to be
unsplit.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH net-next v3 1/7] bnxt_en: add support for rx-copybreak ethtool command
  2024-10-03 17:43       ` Michael Chan
@ 2024-10-03 18:28         ` Taehee Yoo
  2024-10-03 18:34         ` Andrew Lunn
  1 sibling, 0 replies; 73+ messages in thread
From: Taehee Yoo @ 2024-10-03 18:28 UTC (permalink / raw)
  To: Michael Chan
  Cc: davem, kuba, pabeni, edumazet, almasrymina, netdev, linux-doc,
	donald.hunter, corbet, kory.maincent, andrew, maxime.chevallier,
	danieller, hengqi, ecree.xilinx, przemyslaw.kitszel, hkallweit1,
	ahmed.zaki, paul.greenwalt, rrameshbabu, idosch, asml.silence,
	kaiyuanz, willemb, aleksander.lobakin, dw, sridhar.samudrala,
	bcreeley

On Fri, Oct 4, 2024 at 2:43 AM Michael Chan <michael.chan@broadcom.com> wrote:
>
> On Thu, Oct 3, 2024 at 10:23 AM Taehee Yoo <ap420073@gmail.com> wrote:
> >
> > On Fri, Oct 4, 2024 at 2:14 AM Michael Chan <michael.chan@broadcom.com> wrote:
> > >
> >
> > Hi Michael,
> > Thanks a lot for the review!
> >
> > > On Thu, Oct 3, 2024 at 9:06 AM Taehee Yoo <ap420073@gmail.com> wrote:
> > > >
> > > > The bnxt_en driver supports rx-copybreak, but it couldn't be set by
> > > > userspace. Only the default value(256) has worked.
> > > > This patch makes the bnxt_en driver support following command.
> > > > `ethtool --set-tunable <devname> rx-copybreak <value> ` and
> > > > `ethtool --get-tunable <devname> rx-copybreak`.
> > > >
> > > > Signed-off-by: Taehee Yoo <ap420073@gmail.com>
> > > > ---
> > > >
> > > > v3:
> > > > - Update copybreak value before closing nic.
> > > >
> > > > v2:
> > > > - Define max/vim rx_copybreak value.
> > > >
> > > > drivers/net/ethernet/broadcom/bnxt/bnxt.c | 24 +++++----
> > > > drivers/net/ethernet/broadcom/bnxt/bnxt.h | 6 ++-
> > > > .../net/ethernet/broadcom/bnxt/bnxt_ethtool.c | 49 ++++++++++++++++++-
> > > > 3 files changed, 68 insertions(+), 11 deletions(-)
> > > >
> > >
> > > > diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.h b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
> > > > index 69231e85140b..cff031993223 100644
> > > > --- a/drivers/net/ethernet/broadcom/bnxt/bnxt.h
> > > > +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
> > > > @@ -34,6 +34,10 @@
> > > > #include <linux/firmware/broadcom/tee_bnxt_fw.h>
> > > > #endif
> > > >
> > > > +#define BNXT_DEFAULT_RX_COPYBREAK 256
> > > > +#define BNXT_MIN_RX_COPYBREAK 65
> > > > +#define BNXT_MAX_RX_COPYBREAK 1024
> > > > +
> > >
> > > Sorry for the late review. Perhaps we should also support a value of
> > > zero which means to disable RX copybreak.
> >
> > I agree that we need to support disabling rx-copybreak.
> > What about 0 ~ 64 means to disable rx-copybreak?
> > Or should only 0 be allowed to disable rx-copybreak?
> >
>
> I think a single value of 0 that means disable RX copybreak is more
> clear and intuitive. Also, I think we can allow 64 to be a valid
> value.
>
> So, 0 means to disable. 1 to 63 are -EINVAL and 64 to 1024 are valid. Thanks.

Thanks for that, It's clear to me.
I will change it as you suggested.

Thanks a lot!
Taehee

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH net-next v3 5/7] net: devmem: add ring parameter filtering
  2024-10-03 16:06 ` [PATCH net-next v3 5/7] net: devmem: add ring parameter filtering Taehee Yoo
@ 2024-10-03 18:29   ` Mina Almasry
  2024-10-04  3:57     ` Taehee Yoo
  2024-10-03 18:35   ` Brett Creeley
  1 sibling, 1 reply; 73+ messages in thread
From: Mina Almasry @ 2024-10-03 18:29 UTC (permalink / raw)
  To: Taehee Yoo
  Cc: davem, kuba, pabeni, edumazet, netdev, linux-doc, donald.hunter,
	corbet, michael.chan, kory.maincent, andrew, maxime.chevallier,
	danieller, hengqi, ecree.xilinx, przemyslaw.kitszel, hkallweit1,
	ahmed.zaki, paul.greenwalt, rrameshbabu, idosch, asml.silence,
	kaiyuanz, willemb, aleksander.lobakin, dw, sridhar.samudrala,
	bcreeley

On Thu, Oct 3, 2024 at 9:07 AM Taehee Yoo <ap420073@gmail.com> wrote:
>
> If driver doesn't support ring parameter or tcp-data-split configuration
> is not sufficient, the devmem should not be set up.
> Before setup the devmem, tcp-data-split should be ON and
> tcp-data-split-thresh value should be 0.
>
> Signed-off-by: Taehee Yoo <ap420073@gmail.com>
> ---
>
> v3:
>  - Patch added.
>
>  net/core/devmem.c | 18 ++++++++++++++++++
>  1 file changed, 18 insertions(+)
>
> diff --git a/net/core/devmem.c b/net/core/devmem.c
> index 11b91c12ee11..a9e9b15028e0 100644
> --- a/net/core/devmem.c
> +++ b/net/core/devmem.c
> @@ -8,6 +8,8 @@
>   */
>
>  #include <linux/dma-buf.h>
> +#include <linux/ethtool.h>
> +#include <linux/ethtool_netlink.h>
>  #include <linux/genalloc.h>
>  #include <linux/mm.h>
>  #include <linux/netdevice.h>
> @@ -131,6 +133,8 @@ int net_devmem_bind_dmabuf_to_queue(struct net_device *dev, u32 rxq_idx,
>                                     struct net_devmem_dmabuf_binding *binding,
>                                     struct netlink_ext_ack *extack)
>  {
> +       struct kernel_ethtool_ringparam kernel_ringparam = {};
> +       struct ethtool_ringparam ringparam = {};
>         struct netdev_rx_queue *rxq;
>         u32 xa_idx;
>         int err;
> @@ -146,6 +150,20 @@ int net_devmem_bind_dmabuf_to_queue(struct net_device *dev, u32 rxq_idx,
>                 return -EEXIST;
>         }
>
> +       if (!dev->ethtool_ops->get_ringparam) {
> +               NL_SET_ERR_MSG(extack, "can't get ringparam");
> +               return -EINVAL;
> +       }
> +
> +       dev->ethtool_ops->get_ringparam(dev, &ringparam,
> +                                       &kernel_ringparam, extack);
> +       if (kernel_ringparam.tcp_data_split != ETHTOOL_TCP_DATA_SPLIT_ENABLED ||

The way I had set this up is that the driver checks whether header
split is enabled, and only sets PP_FLAG_ALLOW_UNREADABLE_NETMEM if it
is. Then core detects that the driver did not allow unreadable netmem
and it fails that way.

This check is redundant with that. I'm not 100% opposed to redundant
checks. Maybe they will add some reliability, but also maybe they will
be confusing to check the same thing essentially in 2 places.

Is the PP_FLAG_ALLOW_UNREADABLE_NETMEM trick not sufficient for you?

> +           kernel_ringparam.tcp_data_split_thresh) {
> +               NL_SET_ERR_MSG(extack,
> +                              "tcp-header-data-split is disabled or threshold is not zero");
> +               return -EINVAL;
> +       }
> +
>  #ifdef CONFIG_XDP_SOCKETS
>         if (rxq->pool) {
>                 NL_SET_ERR_MSG(extack, "designated queue already in use by AF_XDP");
> --
> 2.34.1
>


-- 
Thanks,
Mina

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH net-next v3 6/7] net: ethtool: add ring parameter filtering
  2024-10-03 16:06 ` [PATCH net-next v3 6/7] net: ethtool: " Taehee Yoo
@ 2024-10-03 18:32   ` Mina Almasry
  2024-10-03 19:35     ` Taehee Yoo
  0 siblings, 1 reply; 73+ messages in thread
From: Mina Almasry @ 2024-10-03 18:32 UTC (permalink / raw)
  To: Taehee Yoo
  Cc: davem, kuba, pabeni, edumazet, netdev, linux-doc, donald.hunter,
	corbet, michael.chan, kory.maincent, andrew, maxime.chevallier,
	danieller, hengqi, ecree.xilinx, przemyslaw.kitszel, hkallweit1,
	ahmed.zaki, paul.greenwalt, rrameshbabu, idosch, asml.silence,
	kaiyuanz, willemb, aleksander.lobakin, dw, sridhar.samudrala,
	bcreeley

On Thu, Oct 3, 2024 at 9:07 AM Taehee Yoo <ap420073@gmail.com> wrote:
>
> While the devmem is running, the tcp-data-split and
> tcp-data-split-thresh configuration should not be changed.
> If user tries to change tcp-data-split and threshold value while the
> devmem is running, it fails and shows extack message.
>
> Signed-off-by: Taehee Yoo <ap420073@gmail.com>
> ---
>
> v3:
>  - Patch added
>
>  net/ethtool/common.h |  1 +
>  net/ethtool/rings.c  | 15 ++++++++++++++-
>  2 files changed, 15 insertions(+), 1 deletion(-)
>
> diff --git a/net/ethtool/common.h b/net/ethtool/common.h
> index d55d5201b085..beebd4db3e10 100644
> --- a/net/ethtool/common.h
> +++ b/net/ethtool/common.h
> @@ -5,6 +5,7 @@
>
>  #include <linux/netdevice.h>
>  #include <linux/ethtool.h>
> +#include <net/netdev_rx_queue.h>
>
>  #define ETHTOOL_DEV_FEATURE_WORDS      DIV_ROUND_UP(NETDEV_FEATURE_COUNT, 32)
>
> diff --git a/net/ethtool/rings.c b/net/ethtool/rings.c
> index c7824515857f..0afc6b29a229 100644
> --- a/net/ethtool/rings.c
> +++ b/net/ethtool/rings.c
> @@ -216,7 +216,8 @@ ethnl_set_rings(struct ethnl_req_info *req_info, struct genl_info *info)
>         bool mod = false, thresh_mod = false;
>         struct nlattr **tb = info->attrs;
>         const struct nlattr *err_attr;
> -       int ret;
> +       struct netdev_rx_queue *rxq;
> +       int ret, i;
>
>         dev->ethtool_ops->get_ringparam(dev, &ringparam,
>                                         &kernel_ringparam, info->extack);
> @@ -263,6 +264,18 @@ ethnl_set_rings(struct ethnl_req_info *req_info, struct genl_info *info)
>                 return -EINVAL;
>         }
>
> +       if (kernel_ringparam.tcp_data_split != ETHTOOL_TCP_DATA_SPLIT_ENABLED ||
> +           kernel_ringparam.tcp_data_split_thresh) {
> +               for (i = 0; i < dev->real_num_rx_queues; i++) {
> +                       rxq = __netif_get_rx_queue(dev, i);
> +                       if (rxq->mp_params.mp_priv) {
> +                               NL_SET_ERR_MSG(info->extack,
> +                                              "tcp-header-data-split is disabled or threshold is not zero");
> +                               return -EINVAL;
> +                       }

Probably worth adding a helper for this. I think the same loop is
checked in a few places.

Other than that, yes, this looks good to me.


-- 
Thanks,
Mina

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH net-next v3 1/7] bnxt_en: add support for rx-copybreak ethtool command
  2024-10-03 17:43       ` Michael Chan
  2024-10-03 18:28         ` Taehee Yoo
@ 2024-10-03 18:34         ` Andrew Lunn
  2024-10-05  6:29           ` Taehee Yoo
  1 sibling, 1 reply; 73+ messages in thread
From: Andrew Lunn @ 2024-10-03 18:34 UTC (permalink / raw)
  To: Michael Chan
  Cc: Taehee Yoo, davem, kuba, pabeni, edumazet, almasrymina, netdev,
	linux-doc, donald.hunter, corbet, kory.maincent,
	maxime.chevallier, danieller, hengqi, ecree.xilinx,
	przemyslaw.kitszel, hkallweit1, ahmed.zaki, paul.greenwalt,
	rrameshbabu, idosch, asml.silence, kaiyuanz, willemb,
	aleksander.lobakin, dw, sridhar.samudrala, bcreeley

> > I agree that we need to support disabling rx-copybreak.
> > What about 0 ~ 64 means to disable rx-copybreak?
> > Or should only 0 be allowed to disable rx-copybreak?
> >
> 
> I think a single value of 0 that means disable RX copybreak is more
> clear and intuitive.  Also, I think we can allow 64 to be a valid
> value.
> 
> So, 0 means to disable.  1 to 63 are -EINVAL and 64 to 1024 are valid.  Thanks.

Please spend a little time and see what other drivers do. Ideally we
want one consistent behaviour for all drivers that allow copybreak to
be disabled.

	Andrew

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH net-next v3 5/7] net: devmem: add ring parameter filtering
  2024-10-03 16:06 ` [PATCH net-next v3 5/7] net: devmem: add ring parameter filtering Taehee Yoo
  2024-10-03 18:29   ` Mina Almasry
@ 2024-10-03 18:35   ` Brett Creeley
  2024-10-03 18:49     ` Mina Almasry
  2024-10-04  4:01     ` Taehee Yoo
  1 sibling, 2 replies; 73+ messages in thread
From: Brett Creeley @ 2024-10-03 18:35 UTC (permalink / raw)
  To: Taehee Yoo, davem, kuba, pabeni, edumazet, almasrymina, netdev,
	linux-doc, donald.hunter, corbet, michael.chan
  Cc: kory.maincent, andrew, maxime.chevallier, danieller, hengqi,
	ecree.xilinx, przemyslaw.kitszel, hkallweit1, ahmed.zaki,
	paul.greenwalt, rrameshbabu, idosch, asml.silence, kaiyuanz,
	willemb, aleksander.lobakin, dw, sridhar.samudrala



On 10/3/2024 9:06 AM, Taehee Yoo wrote:
> Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding.
> 
> 
> If driver doesn't support ring parameter or tcp-data-split configuration
> is not sufficient, the devmem should not be set up.
> Before setup the devmem, tcp-data-split should be ON and
> tcp-data-split-thresh value should be 0.
> 
> Signed-off-by: Taehee Yoo <ap420073@gmail.com>
> ---
> 
> v3:
>   - Patch added.
> 
>   net/core/devmem.c | 18 ++++++++++++++++++
>   1 file changed, 18 insertions(+)
> 
> diff --git a/net/core/devmem.c b/net/core/devmem.c
> index 11b91c12ee11..a9e9b15028e0 100644
> --- a/net/core/devmem.c
> +++ b/net/core/devmem.c
> @@ -8,6 +8,8 @@
>    */
> 
>   #include <linux/dma-buf.h>
> +#include <linux/ethtool.h>
> +#include <linux/ethtool_netlink.h>
>   #include <linux/genalloc.h>
>   #include <linux/mm.h>
>   #include <linux/netdevice.h>
> @@ -131,6 +133,8 @@ int net_devmem_bind_dmabuf_to_queue(struct net_device *dev, u32 rxq_idx,
>                                      struct net_devmem_dmabuf_binding *binding,
>                                      struct netlink_ext_ack *extack)
>   {
> +       struct kernel_ethtool_ringparam kernel_ringparam = {};
> +       struct ethtool_ringparam ringparam = {};
>          struct netdev_rx_queue *rxq;
>          u32 xa_idx;
>          int err;
> @@ -146,6 +150,20 @@ int net_devmem_bind_dmabuf_to_queue(struct net_device *dev, u32 rxq_idx,
>                  return -EEXIST;
>          }
> 
> +       if (!dev->ethtool_ops->get_ringparam) {
> +               NL_SET_ERR_MSG(extack, "can't get ringparam");
> +               return -EINVAL;
> +       }

Is EINVAL the correct return value here? I think it makes more sense as 
EOPNOTSUPP.

> +
> +       dev->ethtool_ops->get_ringparam(dev, &ringparam,
> +                                       &kernel_ringparam, extack);
> +       if (kernel_ringparam.tcp_data_split != ETHTOOL_TCP_DATA_SPLIT_ENABLED ||
> +           kernel_ringparam.tcp_data_split_thresh) {
> +               NL_SET_ERR_MSG(extack,
> +                              "tcp-header-data-split is disabled or threshold is not zero");
> +               return -EINVAL;
> +       }
> +
Maybe just my personal opinion, but IMHO these checks should be separate 
so the error message can be more concise/clear.

Also, a small nit, but I think both of these checks should be before 
getting the rxq via __netif_get_rx_queue().


Thanks,

Brett
>   #ifdef CONFIG_XDP_SOCKETS
>          if (rxq->pool) {
>                  NL_SET_ERR_MSG(extack, "designated queue already in use by AF_XDP");
> --
> 2.34.1
> 

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH net-next v3 7/7] bnxt_en: add support for device memory tcp
  2024-10-03 16:06 ` [PATCH net-next v3 7/7] bnxt_en: add support for device memory tcp Taehee Yoo
@ 2024-10-03 18:43   ` Mina Almasry
  2024-10-04 10:34     ` Taehee Yoo
  2024-10-05  3:48   ` kernel test robot
  2024-10-08  2:45   ` David Wei
  2 siblings, 1 reply; 73+ messages in thread
From: Mina Almasry @ 2024-10-03 18:43 UTC (permalink / raw)
  To: Taehee Yoo
  Cc: davem, kuba, pabeni, edumazet, netdev, linux-doc, donald.hunter,
	corbet, michael.chan, kory.maincent, andrew, maxime.chevallier,
	danieller, hengqi, ecree.xilinx, przemyslaw.kitszel, hkallweit1,
	ahmed.zaki, paul.greenwalt, rrameshbabu, idosch, asml.silence,
	kaiyuanz, willemb, aleksander.lobakin, dw, sridhar.samudrala,
	bcreeley

On Thu, Oct 3, 2024 at 9:07 AM Taehee Yoo <ap420073@gmail.com> wrote:
>
> Currently, bnxt_en driver satisfies the requirements of Device memory
> TCP, which is tcp-data-split.
> So, it implements Device memory TCP for bnxt_en driver.
>
> From now on, the aggregation ring handles netmem_ref instead of page
> regardless of the on/off of netmem.
> So, for the aggregation ring, memory will be handled with the netmem
> page_pool API instead of generic page_pool API.
>
> If Devmem is enabled, netmem_ref is used as-is and if Devmem is not
> enabled, netmem_ref will be converted to page and that is used.
>
> Driver recognizes whether the devmem is set or unset based on the
> mp_params.mp_priv is not NULL.
> Only if devmem is set, it passes PP_FLAG_ALLOW_UNREADABLE_NETMEM.
>
> Signed-off-by: Taehee Yoo <ap420073@gmail.com>
> ---
>
> v3:
>  - Patch added
>
>  drivers/net/ethernet/broadcom/Kconfig     |  1 +
>  drivers/net/ethernet/broadcom/bnxt/bnxt.c | 98 +++++++++++++++--------
>  drivers/net/ethernet/broadcom/bnxt/bnxt.h |  2 +-
>  3 files changed, 66 insertions(+), 35 deletions(-)
>
> diff --git a/drivers/net/ethernet/broadcom/Kconfig b/drivers/net/ethernet/broadcom/Kconfig
> index 75ca3ddda1f5..f37ff12d4746 100644
> --- a/drivers/net/ethernet/broadcom/Kconfig
> +++ b/drivers/net/ethernet/broadcom/Kconfig
> @@ -211,6 +211,7 @@ config BNXT
>         select FW_LOADER
>         select LIBCRC32C
>         select NET_DEVLINK
> +       select NET_DEVMEM
>         select PAGE_POOL
>         select DIMLIB
>         select AUXILIARY_BUS
> diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> index 872b15842b11..64e07d247f97 100644
> --- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> @@ -55,6 +55,7 @@
>  #include <net/page_pool/helpers.h>
>  #include <linux/align.h>
>  #include <net/netdev_queues.h>
> +#include <net/netdev_rx_queue.h>
>
>  #include "bnxt_hsi.h"
>  #include "bnxt.h"
> @@ -863,6 +864,22 @@ static void bnxt_tx_int(struct bnxt *bp, struct bnxt_napi *bnapi, int budget)
>                 bnapi->events &= ~BNXT_TX_CMP_EVENT;
>  }
>
> +static netmem_ref __bnxt_alloc_rx_netmem(struct bnxt *bp, dma_addr_t *mapping,
> +                                        struct bnxt_rx_ring_info *rxr,
> +                                        unsigned int *offset,
> +                                        gfp_t gfp)
> +{
> +       netmem_ref netmem;
> +
> +       netmem = page_pool_alloc_netmem(rxr->page_pool, GFP_ATOMIC);
> +       if (!netmem)
> +               return 0;
> +       *offset = 0;
> +
> +       *mapping = page_pool_get_dma_addr_netmem(netmem) + *offset;
> +       return netmem;
> +}
> +
>  static struct page *__bnxt_alloc_rx_page(struct bnxt *bp, dma_addr_t *mapping,
>                                          struct bnxt_rx_ring_info *rxr,
>                                          unsigned int *offset,
> @@ -972,21 +989,21 @@ static inline u16 bnxt_find_next_agg_idx(struct bnxt_rx_ring_info *rxr, u16 idx)
>         return next;
>  }
>
> -static inline int bnxt_alloc_rx_page(struct bnxt *bp,
> -                                    struct bnxt_rx_ring_info *rxr,
> -                                    u16 prod, gfp_t gfp)
> +static inline int bnxt_alloc_rx_netmem(struct bnxt *bp,
> +                                      struct bnxt_rx_ring_info *rxr,
> +                                      u16 prod, gfp_t gfp)
>  {
>         struct rx_bd *rxbd =
>                 &rxr->rx_agg_desc_ring[RX_AGG_RING(bp, prod)][RX_IDX(prod)];
>         struct bnxt_sw_rx_agg_bd *rx_agg_buf;
> -       struct page *page;
> +       netmem_ref netmem;
>         dma_addr_t mapping;
>         u16 sw_prod = rxr->rx_sw_agg_prod;
>         unsigned int offset = 0;
>
> -       page = __bnxt_alloc_rx_page(bp, &mapping, rxr, &offset, gfp);
> +       netmem = __bnxt_alloc_rx_netmem(bp, &mapping, rxr, &offset, gfp);

Does __bnxt_alloc_rx_page become dead code after this change? Or is it
still used for something?

>
> -       if (!page)
> +       if (!netmem)
>                 return -ENOMEM;
>
>         if (unlikely(test_bit(sw_prod, rxr->rx_agg_bmap)))
> @@ -996,7 +1013,7 @@ static inline int bnxt_alloc_rx_page(struct bnxt *bp,
>         rx_agg_buf = &rxr->rx_agg_ring[sw_prod];
>         rxr->rx_sw_agg_prod = RING_RX_AGG(bp, NEXT_RX_AGG(sw_prod));
>
> -       rx_agg_buf->page = page;
> +       rx_agg_buf->netmem = netmem;
>         rx_agg_buf->offset = offset;
>         rx_agg_buf->mapping = mapping;
>         rxbd->rx_bd_haddr = cpu_to_le64(mapping);
> @@ -1044,7 +1061,7 @@ static void bnxt_reuse_rx_agg_bufs(struct bnxt_cp_ring_info *cpr, u16 idx,
>                 struct rx_agg_cmp *agg;
>                 struct bnxt_sw_rx_agg_bd *cons_rx_buf, *prod_rx_buf;
>                 struct rx_bd *prod_bd;
> -               struct page *page;
> +               netmem_ref netmem;
>
>                 if (p5_tpa)
>                         agg = bnxt_get_tpa_agg_p5(bp, rxr, idx, start + i);
> @@ -1061,11 +1078,11 @@ static void bnxt_reuse_rx_agg_bufs(struct bnxt_cp_ring_info *cpr, u16 idx,
>                 cons_rx_buf = &rxr->rx_agg_ring[cons];
>
>                 /* It is possible for sw_prod to be equal to cons, so
> -                * set cons_rx_buf->page to NULL first.
> +                * set cons_rx_buf->netmem to 0 first.
>                  */
> -               page = cons_rx_buf->page;
> -               cons_rx_buf->page = NULL;
> -               prod_rx_buf->page = page;
> +               netmem = cons_rx_buf->netmem;
> +               cons_rx_buf->netmem = 0;
> +               prod_rx_buf->netmem = netmem;
>                 prod_rx_buf->offset = cons_rx_buf->offset;
>
>                 prod_rx_buf->mapping = cons_rx_buf->mapping;
> @@ -1192,6 +1209,7 @@ static struct sk_buff *bnxt_rx_skb(struct bnxt *bp,
>
>  static u32 __bnxt_rx_agg_pages(struct bnxt *bp,
>                                struct bnxt_cp_ring_info *cpr,
> +                              struct sk_buff *skb,
>                                struct skb_shared_info *shinfo,
>                                u16 idx, u32 agg_bufs, bool tpa,
>                                struct xdp_buff *xdp)
> @@ -1211,7 +1229,7 @@ static u32 __bnxt_rx_agg_pages(struct bnxt *bp,
>                 u16 cons, frag_len;
>                 struct rx_agg_cmp *agg;
>                 struct bnxt_sw_rx_agg_bd *cons_rx_buf;
> -               struct page *page;
> +               netmem_ref netmem;
>                 dma_addr_t mapping;
>
>                 if (p5_tpa)
> @@ -1223,9 +1241,15 @@ static u32 __bnxt_rx_agg_pages(struct bnxt *bp,
>                             RX_AGG_CMP_LEN) >> RX_AGG_CMP_LEN_SHIFT;
>
>                 cons_rx_buf = &rxr->rx_agg_ring[cons];
> -               skb_frag_fill_page_desc(frag, cons_rx_buf->page,
> -                                       cons_rx_buf->offset, frag_len);
> -               shinfo->nr_frags = i + 1;
> +               if (skb) {
> +                       skb_add_rx_frag_netmem(skb, i, cons_rx_buf->netmem,
> +                                              cons_rx_buf->offset, frag_len,
> +                                              BNXT_RX_PAGE_SIZE);
> +               } else {
> +                       skb_frag_fill_page_desc(frag, netmem_to_page(cons_rx_buf->netmem),
> +                                               cons_rx_buf->offset, frag_len);

Our intention with the whole netmem design is that drivers should
never have to call netmem_to_page(). I.e. the driver should use netmem
unaware of whether it's page or non-page underneath, to minimize
complexity driver needs to handle.

This netmem_to_page() call can be removed by using
skb_frag_fill_netmem_desc() instead of the page variant. But, more
improtantly, why did the code change here? The code before calls
skb_frag_fill_page_desc, but the new code sometimes will
skb_frag_fill_netmem_desc() and sometimes will skb_add_rx_frag_netmem.
I'm not sure why that logic changed.

> +                       shinfo->nr_frags = i + 1;
> +               }
>                 __clear_bit(cons, rxr->rx_agg_bmap);
>
>                 /* It is possible for bnxt_alloc_rx_page() to allocate
> @@ -1233,15 +1257,15 @@ static u32 __bnxt_rx_agg_pages(struct bnxt *bp,
>                  * need to clear the cons entry now.
>                  */
>                 mapping = cons_rx_buf->mapping;
> -               page = cons_rx_buf->page;
> -               cons_rx_buf->page = NULL;
> +               netmem = cons_rx_buf->netmem;
> +               cons_rx_buf->netmem = 0;
>
> -               if (xdp && page_is_pfmemalloc(page))
> +               if (xdp && page_is_pfmemalloc(netmem_to_page(netmem)))

Similarly, add netmem_is_pfmemalloc to netmem.h, instead of doing a
netmem_to_page() call here I think.

>                         xdp_buff_set_frag_pfmemalloc(xdp);
>
> -               if (bnxt_alloc_rx_page(bp, rxr, prod, GFP_ATOMIC) != 0) {
> +               if (bnxt_alloc_rx_netmem(bp, rxr, prod, GFP_ATOMIC) != 0) {
>                         --shinfo->nr_frags;
> -                       cons_rx_buf->page = page;
> +                       cons_rx_buf->netmem = netmem;
>
>                         /* Update prod since possibly some pages have been
>                          * allocated already.
> @@ -1269,7 +1293,7 @@ static struct sk_buff *bnxt_rx_agg_pages_skb(struct bnxt *bp,
>         struct skb_shared_info *shinfo = skb_shinfo(skb);
>         u32 total_frag_len = 0;
>
> -       total_frag_len = __bnxt_rx_agg_pages(bp, cpr, shinfo, idx,
> +       total_frag_len = __bnxt_rx_agg_pages(bp, cpr, skb, shinfo, idx,
>                                              agg_bufs, tpa, NULL);
>         if (!total_frag_len) {
>                 skb_mark_for_recycle(skb);
> @@ -1277,9 +1301,6 @@ static struct sk_buff *bnxt_rx_agg_pages_skb(struct bnxt *bp,
>                 return NULL;
>         }
>
> -       skb->data_len += total_frag_len;
> -       skb->len += total_frag_len;
> -       skb->truesize += BNXT_RX_PAGE_SIZE * agg_bufs;
>         return skb;
>  }
>
> @@ -1294,7 +1315,7 @@ static u32 bnxt_rx_agg_pages_xdp(struct bnxt *bp,
>         if (!xdp_buff_has_frags(xdp))
>                 shinfo->nr_frags = 0;
>
> -       total_frag_len = __bnxt_rx_agg_pages(bp, cpr, shinfo,
> +       total_frag_len = __bnxt_rx_agg_pages(bp, cpr, NULL, shinfo,
>                                              idx, agg_bufs, tpa, xdp);
>         if (total_frag_len) {
>                 xdp_buff_set_frags_flag(xdp);
> @@ -3342,15 +3363,15 @@ static void bnxt_free_one_rx_agg_ring(struct bnxt *bp, struct bnxt_rx_ring_info
>
>         for (i = 0; i < max_idx; i++) {
>                 struct bnxt_sw_rx_agg_bd *rx_agg_buf = &rxr->rx_agg_ring[i];
> -               struct page *page = rx_agg_buf->page;
> +               netmem_ref netmem = rx_agg_buf->netmem;
>
> -               if (!page)
> +               if (!netmem)
>                         continue;
>
> -               rx_agg_buf->page = NULL;
> +               rx_agg_buf->netmem = 0;
>                 __clear_bit(i, rxr->rx_agg_bmap);
>
> -               page_pool_recycle_direct(rxr->page_pool, page);
> +               page_pool_put_full_netmem(rxr->page_pool, netmem, true);
>         }
>  }
>
> @@ -3608,9 +3629,11 @@ static void bnxt_free_rx_rings(struct bnxt *bp)
>
>  static int bnxt_alloc_rx_page_pool(struct bnxt *bp,
>                                    struct bnxt_rx_ring_info *rxr,
> +                                  int queue_idx,
>                                    int numa_node)
>  {
>         struct page_pool_params pp = { 0 };
> +       struct netdev_rx_queue *rxq;
>
>         pp.pool_size = bp->rx_agg_ring_size;
>         if (BNXT_RX_PAGE_MODE(bp))
> @@ -3621,8 +3644,15 @@ static int bnxt_alloc_rx_page_pool(struct bnxt *bp,
>         pp.dev = &bp->pdev->dev;
>         pp.dma_dir = bp->rx_dir;
>         pp.max_len = PAGE_SIZE;
> -       pp.flags = PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV;
> +       pp.order = 0;
> +
> +       rxq = __netif_get_rx_queue(bp->dev, queue_idx);
> +       if (rxq->mp_params.mp_priv)
> +               pp.flags = PP_FLAG_DMA_MAP | PP_FLAG_ALLOW_UNREADABLE_NETMEM;

This is not the intended use of PP_FLAG_ALLOW_UNREADABLE_NETMEM.

The driver should set PP_FLAG_ALLOW_UNREADABLE_NETMEM when it's able
to handle unreadable netmem, it should not worry about whether
rxq->mp_params.mp_priv is set or not.

You should set PP_FLAG_ALLOW_UNREADABLE_NETMEM when HDS is enabled.
Let core figure out if mp_params.mp_priv is enabled. All the driver
needs to report is whether it's configured to be able to handle
unreadable netmem (which practically means HDS is enabled).

> +       else
> +               pp.flags = PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV;
>
> +       pp.queue_idx = queue_idx;
>         rxr->page_pool = page_pool_create(&pp);
>         if (IS_ERR(rxr->page_pool)) {
>                 int err = PTR_ERR(rxr->page_pool);
> @@ -3655,7 +3685,7 @@ static int bnxt_alloc_rx_rings(struct bnxt *bp)
>                 cpu_node = cpu_to_node(cpu);
>                 netdev_dbg(bp->dev, "Allocating page pool for rx_ring[%d] on numa_node: %d\n",
>                            i, cpu_node);
> -               rc = bnxt_alloc_rx_page_pool(bp, rxr, cpu_node);
> +               rc = bnxt_alloc_rx_page_pool(bp, rxr, i, cpu_node);
>                 if (rc)
>                         return rc;
>
> @@ -4154,7 +4184,7 @@ static void bnxt_alloc_one_rx_ring_page(struct bnxt *bp,
>
>         prod = rxr->rx_agg_prod;
>         for (i = 0; i < bp->rx_agg_ring_size; i++) {
> -               if (bnxt_alloc_rx_page(bp, rxr, prod, GFP_KERNEL)) {
> +               if (bnxt_alloc_rx_netmem(bp, rxr, prod, GFP_KERNEL)) {
>                         netdev_warn(bp->dev, "init'ed rx ring %d with %d/%d pages only\n",
>                                     ring_nr, i, bp->rx_ring_size);
>                         break;
> @@ -15063,7 +15093,7 @@ static int bnxt_queue_mem_alloc(struct net_device *dev, void *qmem, int idx)
>         clone->rx_sw_agg_prod = 0;
>         clone->rx_next_cons = 0;
>
> -       rc = bnxt_alloc_rx_page_pool(bp, clone, rxr->page_pool->p.nid);
> +       rc = bnxt_alloc_rx_page_pool(bp, clone, idx, rxr->page_pool->p.nid);
>         if (rc)
>                 return rc;
>
> diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.h b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
> index 48f390519c35..3cf57a3c7664 100644
> --- a/drivers/net/ethernet/broadcom/bnxt/bnxt.h
> +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
> @@ -895,7 +895,7 @@ struct bnxt_sw_rx_bd {
>  };
>
>  struct bnxt_sw_rx_agg_bd {
> -       struct page             *page;
> +       netmem_ref              netmem;
>         unsigned int            offset;
>         dma_addr_t              mapping;
>  };
> --
> 2.34.1
>


-- 
Thanks,
Mina

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH net-next v3 5/7] net: devmem: add ring parameter filtering
  2024-10-03 18:35   ` Brett Creeley
@ 2024-10-03 18:49     ` Mina Almasry
  2024-10-08 19:28       ` Jakub Kicinski
  2024-10-04  4:01     ` Taehee Yoo
  1 sibling, 1 reply; 73+ messages in thread
From: Mina Almasry @ 2024-10-03 18:49 UTC (permalink / raw)
  To: Brett Creeley
  Cc: Taehee Yoo, davem, kuba, pabeni, edumazet, netdev, linux-doc,
	donald.hunter, corbet, michael.chan, kory.maincent, andrew,
	maxime.chevallier, danieller, hengqi, ecree.xilinx,
	przemyslaw.kitszel, hkallweit1, ahmed.zaki, paul.greenwalt,
	rrameshbabu, idosch, asml.silence, kaiyuanz, willemb,
	aleksander.lobakin, dw, sridhar.samudrala

On Thu, Oct 3, 2024 at 11:35 AM Brett Creeley <bcreeley@amd.com> wrote:
>
>
>
> On 10/3/2024 9:06 AM, Taehee Yoo wrote:
> > Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding.
> >
> >
> > If driver doesn't support ring parameter or tcp-data-split configuration
> > is not sufficient, the devmem should not be set up.
> > Before setup the devmem, tcp-data-split should be ON and
> > tcp-data-split-thresh value should be 0.
> >
> > Signed-off-by: Taehee Yoo <ap420073@gmail.com>
> > ---
> >
> > v3:
> >   - Patch added.
> >
> >   net/core/devmem.c | 18 ++++++++++++++++++
> >   1 file changed, 18 insertions(+)
> >
> > diff --git a/net/core/devmem.c b/net/core/devmem.c
> > index 11b91c12ee11..a9e9b15028e0 100644
> > --- a/net/core/devmem.c
> > +++ b/net/core/devmem.c
> > @@ -8,6 +8,8 @@
> >    */
> >
> >   #include <linux/dma-buf.h>
> > +#include <linux/ethtool.h>
> > +#include <linux/ethtool_netlink.h>
> >   #include <linux/genalloc.h>
> >   #include <linux/mm.h>
> >   #include <linux/netdevice.h>
> > @@ -131,6 +133,8 @@ int net_devmem_bind_dmabuf_to_queue(struct net_device *dev, u32 rxq_idx,
> >                                      struct net_devmem_dmabuf_binding *binding,
> >                                      struct netlink_ext_ack *extack)
> >   {
> > +       struct kernel_ethtool_ringparam kernel_ringparam = {};
> > +       struct ethtool_ringparam ringparam = {};
> >          struct netdev_rx_queue *rxq;
> >          u32 xa_idx;
> >          int err;
> > @@ -146,6 +150,20 @@ int net_devmem_bind_dmabuf_to_queue(struct net_device *dev, u32 rxq_idx,
> >                  return -EEXIST;
> >          }
> >
> > +       if (!dev->ethtool_ops->get_ringparam) {
> > +               NL_SET_ERR_MSG(extack, "can't get ringparam");
> > +               return -EINVAL;
> > +       }
>
> Is EINVAL the correct return value here? I think it makes more sense as
> EOPNOTSUPP.
>
> > +
> > +       dev->ethtool_ops->get_ringparam(dev, &ringparam,
> > +                                       &kernel_ringparam, extack);
> > +       if (kernel_ringparam.tcp_data_split != ETHTOOL_TCP_DATA_SPLIT_ENABLED ||
> > +           kernel_ringparam.tcp_data_split_thresh) {
> > +               NL_SET_ERR_MSG(extack,
> > +                              "tcp-header-data-split is disabled or threshold is not zero");
> > +               return -EINVAL;
> > +       }
> > +
> Maybe just my personal opinion, but IMHO these checks should be separate
> so the error message can be more concise/clear.
>

Good point. The error message in itself is valuable.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH net-next v3 4/7] bnxt_en: add support for tcp-data-split-thresh ethtool command
  2024-10-03 18:13   ` Brett Creeley
@ 2024-10-03 19:13     ` Taehee Yoo
  0 siblings, 0 replies; 73+ messages in thread
From: Taehee Yoo @ 2024-10-03 19:13 UTC (permalink / raw)
  To: Brett Creeley
  Cc: davem, kuba, pabeni, edumazet, almasrymina, netdev, linux-doc,
	donald.hunter, corbet, michael.chan, kory.maincent, andrew,
	maxime.chevallier, danieller, hengqi, ecree.xilinx,
	przemyslaw.kitszel, hkallweit1, ahmed.zaki, paul.greenwalt,
	rrameshbabu, idosch, asml.silence, kaiyuanz, willemb,
	aleksander.lobakin, dw, sridhar.samudrala

On Fri, Oct 4, 2024 at 3:13 AM Brett Creeley <bcreeley@amd.com> wrote:
>
>
>
> On 10/3/2024 9:06 AM, Taehee Yoo wrote:
> > Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding.
> >
> >
> > The bnxt_en driver has configured the hds_threshold value automatically
> > when TPA is enabled based on the rx-copybreak default value.
> > Now the tcp-data-split-thresh ethtool command is added, so it adds an
> > implementation of tcp-data-split-thresh option.
> >
> > Configuration of the tcp-data-split-thresh is allowed only when
> > the tcp-data-split is enabled. The default value of
> > tcp-data-split-thresh is 256, which is the default value of rx-copybreak,
> > which used to be the hds_thresh value.
> >
> > # Example:
> > # ethtool -G enp14s0f0np0 tcp-data-split on tcp-data-split-thresh 256
> > # ethtool -g enp14s0f0np0
> > Ring parameters for enp14s0f0np0:
> > Pre-set maximums:
> > ...
> > TCP data split thresh: 256
> > Current hardware settings:
> > ...
> > TCP data split: on
> > TCP data split thresh: 256
> >
> > It enables tcp-data-split and sets tcp-data-split-thresh value to 256.
> >
> > # ethtool -G enp14s0f0np0 tcp-data-split off
> > # ethtool -g enp14s0f0np0
> > Ring parameters for enp14s0f0np0:
> > Pre-set maximums:
> > ...
> > TCP data split thresh: 256
> > Current hardware settings:
> > ...
> > TCP data split: off
> > TCP data split thresh: n/a
> >
> > Signed-off-by: Taehee Yoo <ap420073@gmail.com>
> > ---
> >
> > v3:
> > - Drop validation logic tcp-data-split and tcp-data-split-thresh.
> >
> > v2:
> > - Patch added.
> >
> > drivers/net/ethernet/broadcom/bnxt/bnxt.c | 3 ++-
> > drivers/net/ethernet/broadcom/bnxt/bnxt.h | 2 ++
> > drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c | 4 ++++
> > 3 files changed, 8 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> > index f046478dfd2a..872b15842b11 100644
> > --- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> > +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> > @@ -4455,6 +4455,7 @@ static void bnxt_init_ring_params(struct bnxt *bp)
> > {
> > bp->rx_copybreak = BNXT_DEFAULT_RX_COPYBREAK;
> > bp->flags |= BNXT_FLAG_HDS;
> > + bp->hds_threshold = BNXT_DEFAULT_RX_COPYBREAK;
> > }
> >
> > /* bp->rx_ring_size, bp->tx_ring_size, dev->mtu, BNXT_FLAG_{G|L}RO flags must
> > @@ -6429,7 +6430,7 @@ static int bnxt_hwrm_vnic_set_hds(struct bnxt *bp, struct bnxt_vnic_info *vnic)
> > VNIC_PLCMODES_CFG_REQ_FLAGS_HDS_IPV6);
> > req->enables |=
> > cpu_to_le32(VNIC_PLCMODES_CFG_REQ_ENABLES_HDS_THRESHOLD_VALID);
> > - req->hds_threshold = cpu_to_le16(bp->rx_copybreak);
> > + req->hds_threshold = cpu_to_le16(bp->hds_threshold);
> > }
> > req->vnic_id = cpu_to_le32(vnic->fw_vnic_id);
> > return hwrm_req_send(bp, req);
> > diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.h b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
> > index 35601c71dfe9..48f390519c35 100644
> > --- a/drivers/net/ethernet/broadcom/bnxt/bnxt.h
> > +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
> > @@ -2311,6 +2311,8 @@ struct bnxt {
> > int rx_agg_nr_pages;
> > int rx_nr_rings;
> > int rsscos_nr_ctxs;
> > +#define BNXT_HDS_THRESHOLD_MAX 256
> > + u16 hds_threshold;
>
> Putting this here creates a 2 byte hole right after hds_threshold and
> also puts a 4 byte hole after cp_nr_rings.
>
> Since hds_threshold doesn't seem to be used in the hotpath maybe it
> would be best to fill a pre-existing hole in struct bnxt to put it?
>

Yes, hds_threshold makes an additional 2bytes hole.
I checked pre-existing holes in struct bnxt, and almost all members
are sorted by purpose.
However, I think under num_tc would be a pretty good place for hds_threshold.

Before:
/* size: 7000, cachelines: 110, members: 185 */
/* sum members: 6931, holes: 21, sum holes: 64 */
/* sum bitfield members: 2 bits, bit holes: 1, sum bit holes: 6 bits */
/* padding: 4 */
/* paddings: 7, sum paddings: 25 */
/* last cacheline: 24 bytes */

After:
/* size: 6992, cachelines: 110, members: 185 */
/* sum members: 6931, holes: 19, sum holes: 56 */
/* sum bitfield members: 2 bits, bit holes: 1, sum bit holes: 6 bits */
/* padding: 4 */
/* paddings: 7, sum paddings: 25 */
/* last cacheline: 16 bytes */

So, I would like to change it in a v4 patch if there are no objections.

Thanks a lot!
Taehee Yoo

> Thanks,
>
> Brett
>
> >
> > u32 tx_ring_size;
> > u32 tx_ring_mask;
> > diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c b/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c
> > index e9ef65dd2e7b..af6ed492f688 100644
> > --- a/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c
> > +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c
> > @@ -839,6 +839,9 @@ static void bnxt_get_ringparam(struct net_device *dev,
> > else
> > kernel_ering->tcp_data_split = ETHTOOL_TCP_DATA_SPLIT_DISABLED;
> >
> > + kernel_ering->tcp_data_split_thresh = bp->hds_threshold;
> > + kernel_ering->tcp_data_split_thresh_max = BNXT_HDS_THRESHOLD_MAX;
> > +
> > ering->tx_max_pending = BNXT_MAX_TX_DESC_CNT;
> >
> > ering->rx_pending = bp->rx_ring_size;
> > @@ -871,6 +874,7 @@ static int bnxt_set_ringparam(struct net_device *dev,
> > case ETHTOOL_TCP_DATA_SPLIT_UNKNOWN:
> > case ETHTOOL_TCP_DATA_SPLIT_ENABLED:
> > bp->flags |= BNXT_FLAG_HDS;
> > + bp->hds_threshold = (u16)kernel_ering->tcp_data_split_thresh;
> > break;
> > case ETHTOOL_TCP_DATA_SPLIT_DISABLED:
> > bp->flags &= ~BNXT_FLAG_HDS;
> > --
> > 2.34.1
> >

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH net-next v3 3/7] net: ethtool: add support for configuring tcp-data-split-thresh
  2024-10-03 18:25   ` Mina Almasry
@ 2024-10-03 19:33     ` Taehee Yoo
  2024-10-04  1:47       ` Mina Almasry
  0 siblings, 1 reply; 73+ messages in thread
From: Taehee Yoo @ 2024-10-03 19:33 UTC (permalink / raw)
  To: Mina Almasry
  Cc: davem, kuba, pabeni, edumazet, netdev, linux-doc, donald.hunter,
	corbet, michael.chan, kory.maincent, andrew, maxime.chevallier,
	danieller, hengqi, ecree.xilinx, przemyslaw.kitszel, hkallweit1,
	ahmed.zaki, paul.greenwalt, rrameshbabu, idosch, asml.silence,
	kaiyuanz, willemb, aleksander.lobakin, dw, sridhar.samudrala,
	bcreeley

On Fri, Oct 4, 2024 at 3:25 AM Mina Almasry <almasrymina@google.com> wrote:
>
> On Thu, Oct 3, 2024 at 9:07 AM Taehee Yoo <ap420073@gmail.com> wrote:
> >
> > The tcp-data-split-thresh option configures the threshold value of
> > the tcp-data-split.
> > If a received packet size is larger than this threshold value, a packet
> > will be split into header and payload.
>
> Why do you need this? devmem TCP will always not work with unsplit
> packets. Seems like you always want to set thresh to 0 to support
> something like devmem TCP.
>
> Why would the user ever want to configure this? I can't think of a
> scenario where the user wouldn't want packets under X bytes to be
> unsplit.

I totally understand what you mean,
Yes, tcp-data-split is zerocopy friendly option but as far as I know,
this option is not only for the zerocopy usecase.
So, If users enable tcp-data-split, they would assume threshold is 0.
But there are already NICs that have been supporting tcp-data-split
enabled by default.
bnxt_en's default value is 256bytes.
If we just assume the tcp-data-split-threshold to 0 for all cases,
it would change the default behavior of bnxt_en driver(maybe other drivers too)
for the not zerocopy case.
Jakub pointed out the generic case, not only for zerocopy usecase
in the v1 and I agree with that opinion.
https://lore.kernel.org/netdev/20240906183844.2e8226f3@kernel.org/

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH net-next v3 6/7] net: ethtool: add ring parameter filtering
  2024-10-03 18:32   ` Mina Almasry
@ 2024-10-03 19:35     ` Taehee Yoo
  0 siblings, 0 replies; 73+ messages in thread
From: Taehee Yoo @ 2024-10-03 19:35 UTC (permalink / raw)
  To: Mina Almasry
  Cc: davem, kuba, pabeni, edumazet, netdev, linux-doc, donald.hunter,
	corbet, michael.chan, kory.maincent, andrew, maxime.chevallier,
	danieller, hengqi, ecree.xilinx, przemyslaw.kitszel, hkallweit1,
	ahmed.zaki, paul.greenwalt, rrameshbabu, idosch, asml.silence,
	kaiyuanz, willemb, aleksander.lobakin, dw, sridhar.samudrala,
	bcreeley

On Fri, Oct 4, 2024 at 3:33 AM Mina Almasry <almasrymina@google.com> wrote:
>
> On Thu, Oct 3, 2024 at 9:07 AM Taehee Yoo <ap420073@gmail.com> wrote:
> >
> > While the devmem is running, the tcp-data-split and
> > tcp-data-split-thresh configuration should not be changed.
> > If user tries to change tcp-data-split and threshold value while the
> > devmem is running, it fails and shows extack message.
> >
> > Signed-off-by: Taehee Yoo <ap420073@gmail.com>
> > ---
> >
> > v3:
> >  - Patch added
> >
> >  net/ethtool/common.h |  1 +
> >  net/ethtool/rings.c  | 15 ++++++++++++++-
> >  2 files changed, 15 insertions(+), 1 deletion(-)
> >
> > diff --git a/net/ethtool/common.h b/net/ethtool/common.h
> > index d55d5201b085..beebd4db3e10 100644
> > --- a/net/ethtool/common.h
> > +++ b/net/ethtool/common.h
> > @@ -5,6 +5,7 @@
> >
> >  #include <linux/netdevice.h>
> >  #include <linux/ethtool.h>
> > +#include <net/netdev_rx_queue.h>
> >
> >  #define ETHTOOL_DEV_FEATURE_WORDS      DIV_ROUND_UP(NETDEV_FEATURE_COUNT, 32)
> >
> > diff --git a/net/ethtool/rings.c b/net/ethtool/rings.c
> > index c7824515857f..0afc6b29a229 100644
> > --- a/net/ethtool/rings.c
> > +++ b/net/ethtool/rings.c
> > @@ -216,7 +216,8 @@ ethnl_set_rings(struct ethnl_req_info *req_info, struct genl_info *info)
> >         bool mod = false, thresh_mod = false;
> >         struct nlattr **tb = info->attrs;
> >         const struct nlattr *err_attr;
> > -       int ret;
> > +       struct netdev_rx_queue *rxq;
> > +       int ret, i;
> >
> >         dev->ethtool_ops->get_ringparam(dev, &ringparam,
> >                                         &kernel_ringparam, info->extack);
> > @@ -263,6 +264,18 @@ ethnl_set_rings(struct ethnl_req_info *req_info, struct genl_info *info)
> >                 return -EINVAL;
> >         }
> >
> > +       if (kernel_ringparam.tcp_data_split != ETHTOOL_TCP_DATA_SPLIT_ENABLED ||
> > +           kernel_ringparam.tcp_data_split_thresh) {
> > +               for (i = 0; i < dev->real_num_rx_queues; i++) {
> > +                       rxq = __netif_get_rx_queue(dev, i);
> > +                       if (rxq->mp_params.mp_priv) {
> > +                               NL_SET_ERR_MSG(info->extack,
> > +                                              "tcp-header-data-split is disabled or threshold is not zero");
> > +                               return -EINVAL;
> > +                       }
>
> Probably worth adding a helper for this. I think the same loop is
> checked in a few places.
>
> Other than that, yes, this looks good to me.

Thanks, I will add a helper function for this in a v4 patch.

Thanks a lot,
Taehee Yoo

>
>
> --
> Thanks,
> Mina

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH net-next v3 3/7] net: ethtool: add support for configuring tcp-data-split-thresh
  2024-10-03 19:33     ` Taehee Yoo
@ 2024-10-04  1:47       ` Mina Almasry
  2024-10-05  6:11         ` Taehee Yoo
  0 siblings, 1 reply; 73+ messages in thread
From: Mina Almasry @ 2024-10-04  1:47 UTC (permalink / raw)
  To: Taehee Yoo
  Cc: davem, kuba, pabeni, edumazet, netdev, linux-doc, donald.hunter,
	corbet, michael.chan, kory.maincent, andrew, maxime.chevallier,
	danieller, hengqi, ecree.xilinx, przemyslaw.kitszel, hkallweit1,
	ahmed.zaki, paul.greenwalt, rrameshbabu, idosch, asml.silence,
	kaiyuanz, willemb, aleksander.lobakin, dw, sridhar.samudrala,
	bcreeley

On Thu, Oct 3, 2024 at 12:33 PM Taehee Yoo <ap420073@gmail.com> wrote:
>
> On Fri, Oct 4, 2024 at 3:25 AM Mina Almasry <almasrymina@google.com> wrote:
> >
> > On Thu, Oct 3, 2024 at 9:07 AM Taehee Yoo <ap420073@gmail.com> wrote:
> > >
> > > The tcp-data-split-thresh option configures the threshold value of
> > > the tcp-data-split.
> > > If a received packet size is larger than this threshold value, a packet
> > > will be split into header and payload.
> >
> > Why do you need this? devmem TCP will always not work with unsplit
> > packets. Seems like you always want to set thresh to 0 to support
> > something like devmem TCP.
> >
> > Why would the user ever want to configure this? I can't think of a
> > scenario where the user wouldn't want packets under X bytes to be
> > unsplit.
>
> I totally understand what you mean,
> Yes, tcp-data-split is zerocopy friendly option but as far as I know,
> this option is not only for the zerocopy usecase.
> So, If users enable tcp-data-split, they would assume threshold is 0.
> But there are already NICs that have been supporting tcp-data-split
> enabled by default.
> bnxt_en's default value is 256bytes.
> If we just assume the tcp-data-split-threshold to 0 for all cases,
> it would change the default behavior of bnxt_en driver(maybe other drivers too)
> for the not zerocopy case.
> Jakub pointed out the generic case, not only for zerocopy usecase
> in the v1 and I agree with that opinion.
> https://lore.kernel.org/netdev/20240906183844.2e8226f3@kernel.org/

I see, thanks. The ability to tune the threshold to save some pcie
bandwidth is interesting. Not sure how much it would matter in
practice. I guess if you're receiving _lots_ of small packets then it
could be critical.

Sounds good then, please consider adding Jakub's reasoning for why
tuning this could be valuable to the commit message for future
userspace readers that wonder why to set this.

-- 
Thanks,
Mina

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH net-next v3 5/7] net: devmem: add ring parameter filtering
  2024-10-03 18:29   ` Mina Almasry
@ 2024-10-04  3:57     ` Taehee Yoo
  0 siblings, 0 replies; 73+ messages in thread
From: Taehee Yoo @ 2024-10-04  3:57 UTC (permalink / raw)
  To: Mina Almasry
  Cc: davem, kuba, pabeni, edumazet, netdev, linux-doc, donald.hunter,
	corbet, michael.chan, kory.maincent, andrew, maxime.chevallier,
	danieller, hengqi, ecree.xilinx, przemyslaw.kitszel, hkallweit1,
	ahmed.zaki, paul.greenwalt, rrameshbabu, idosch, asml.silence,
	kaiyuanz, willemb, aleksander.lobakin, dw, sridhar.samudrala,
	bcreeley

On Fri, Oct 4, 2024 at 3:29 AM Mina Almasry <almasrymina@google.com> wrote:
>

Hi Mina,
Thanks a lot for the review!

> On Thu, Oct 3, 2024 at 9:07 AM Taehee Yoo <ap420073@gmail.com> wrote:
> >
> > If driver doesn't support ring parameter or tcp-data-split configuration
> > is not sufficient, the devmem should not be set up.
> > Before setup the devmem, tcp-data-split should be ON and
> > tcp-data-split-thresh value should be 0.
> >
> > Signed-off-by: Taehee Yoo <ap420073@gmail.com>
> > ---
> >
> > v3:
> >  - Patch added.
> >
> >  net/core/devmem.c | 18 ++++++++++++++++++
> >  1 file changed, 18 insertions(+)
> >
> > diff --git a/net/core/devmem.c b/net/core/devmem.c
> > index 11b91c12ee11..a9e9b15028e0 100644
> > --- a/net/core/devmem.c
> > +++ b/net/core/devmem.c
> > @@ -8,6 +8,8 @@
> >   */
> >
> >  #include <linux/dma-buf.h>
> > +#include <linux/ethtool.h>
> > +#include <linux/ethtool_netlink.h>
> >  #include <linux/genalloc.h>
> >  #include <linux/mm.h>
> >  #include <linux/netdevice.h>
> > @@ -131,6 +133,8 @@ int net_devmem_bind_dmabuf_to_queue(struct net_device *dev, u32 rxq_idx,
> >                                     struct net_devmem_dmabuf_binding *binding,
> >                                     struct netlink_ext_ack *extack)
> >  {
> > +       struct kernel_ethtool_ringparam kernel_ringparam = {};
> > +       struct ethtool_ringparam ringparam = {};
> >         struct netdev_rx_queue *rxq;
> >         u32 xa_idx;
> >         int err;
> > @@ -146,6 +150,20 @@ int net_devmem_bind_dmabuf_to_queue(struct net_device *dev, u32 rxq_idx,
> >                 return -EEXIST;
> >         }
> >
> > +       if (!dev->ethtool_ops->get_ringparam) {
> > +               NL_SET_ERR_MSG(extack, "can't get ringparam");
> > +               return -EINVAL;
> > +       }
> > +
> > +       dev->ethtool_ops->get_ringparam(dev, &ringparam,
> > +                                       &kernel_ringparam, extack);
> > +       if (kernel_ringparam.tcp_data_split != ETHTOOL_TCP_DATA_SPLIT_ENABLED ||
>
> The way I had set this up is that the driver checks whether header
> split is enabled, and only sets PP_FLAG_ALLOW_UNREADABLE_NETMEM if it
> is. Then core detects that the driver did not allow unreadable netmem
> and it fails that way.
>
> This check is redundant with that. I'm not 100% opposed to redundant
> checks. Maybe they will add some reliability, but also maybe they will
> be confusing to check the same thing essentially in 2 places.
>
> Is the PP_FLAG_ALLOW_UNREADABLE_NETMEM trick not sufficient for you?

Ah okay, I understand.
It looks like it's already validated enough based on
PP_FLAG_ALLOW_UNREADABLE_NETMEM.
I tested how you guided it, and it works as you intended.
It's a duplicated validation indeed, so I will drop this patch in a v4.

Thanks a lot!
Taehee Yoo

>
> > +           kernel_ringparam.tcp_data_split_thresh) {
> > +               NL_SET_ERR_MSG(extack,
> > +                              "tcp-header-data-split is disabled or threshold is not zero");
> > +               return -EINVAL;
> > +       }
> > +
> >  #ifdef CONFIG_XDP_SOCKETS
> >         if (rxq->pool) {
> >                 NL_SET_ERR_MSG(extack, "designated queue already in use by AF_XDP");
> > --
> > 2.34.1
> >
>
>
> --
> Thanks,
> Mina

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH net-next v3 5/7] net: devmem: add ring parameter filtering
  2024-10-03 18:35   ` Brett Creeley
  2024-10-03 18:49     ` Mina Almasry
@ 2024-10-04  4:01     ` Taehee Yoo
  1 sibling, 0 replies; 73+ messages in thread
From: Taehee Yoo @ 2024-10-04  4:01 UTC (permalink / raw)
  To: Brett Creeley
  Cc: davem, kuba, pabeni, edumazet, almasrymina, netdev, linux-doc,
	donald.hunter, corbet, michael.chan, kory.maincent, andrew,
	maxime.chevallier, danieller, hengqi, ecree.xilinx,
	przemyslaw.kitszel, hkallweit1, ahmed.zaki, paul.greenwalt,
	rrameshbabu, idosch, asml.silence, kaiyuanz, willemb,
	aleksander.lobakin, dw, sridhar.samudrala

On Fri, Oct 4, 2024 at 3:35 AM Brett Creeley <bcreeley@amd.com> wrote:
>

Hi Brett,
Thanks a lot for your review!

>
>
> On 10/3/2024 9:06 AM, Taehee Yoo wrote:
> > Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding.
> >
> >
> > If driver doesn't support ring parameter or tcp-data-split configuration
> > is not sufficient, the devmem should not be set up.
> > Before setup the devmem, tcp-data-split should be ON and
> > tcp-data-split-thresh value should be 0.
> >
> > Signed-off-by: Taehee Yoo <ap420073@gmail.com>
> > ---
> >
> > v3:
> >   - Patch added.
> >
> >   net/core/devmem.c | 18 ++++++++++++++++++
> >   1 file changed, 18 insertions(+)
> >
> > diff --git a/net/core/devmem.c b/net/core/devmem.c
> > index 11b91c12ee11..a9e9b15028e0 100644
> > --- a/net/core/devmem.c
> > +++ b/net/core/devmem.c
> > @@ -8,6 +8,8 @@
> >    */
> >
> >   #include <linux/dma-buf.h>
> > +#include <linux/ethtool.h>
> > +#include <linux/ethtool_netlink.h>
> >   #include <linux/genalloc.h>
> >   #include <linux/mm.h>
> >   #include <linux/netdevice.h>
> > @@ -131,6 +133,8 @@ int net_devmem_bind_dmabuf_to_queue(struct net_device *dev, u32 rxq_idx,
> >                                      struct net_devmem_dmabuf_binding *binding,
> >                                      struct netlink_ext_ack *extack)
> >   {
> > +       struct kernel_ethtool_ringparam kernel_ringparam = {};
> > +       struct ethtool_ringparam ringparam = {};
> >          struct netdev_rx_queue *rxq;
> >          u32 xa_idx;
> >          int err;
> > @@ -146,6 +150,20 @@ int net_devmem_bind_dmabuf_to_queue(struct net_device *dev, u32 rxq_idx,
> >                  return -EEXIST;
> >          }
> >
> > +       if (!dev->ethtool_ops->get_ringparam) {
> > +               NL_SET_ERR_MSG(extack, "can't get ringparam");
> > +               return -EINVAL;
> > +       }
>
> Is EINVAL the correct return value here? I think it makes more sense as
> EOPNOTSUPP.

Yes, Thanks for catching this.

>
> > +
> > +       dev->ethtool_ops->get_ringparam(dev, &ringparam,
> > +                                       &kernel_ringparam, extack);
> > +       if (kernel_ringparam.tcp_data_split != ETHTOOL_TCP_DATA_SPLIT_ENABLED ||
> > +           kernel_ringparam.tcp_data_split_thresh) {
> > +               NL_SET_ERR_MSG(extack,
> > +                              "tcp-header-data-split is disabled or threshold is not zero");
> > +               return -EINVAL;
> > +       }
> > +
> Maybe just my personal opinion, but IMHO these checks should be separate
> so the error message can be more concise/clear.

I agree, the error message is not clear, it contains two conditions.

>
> Also, a small nit, but I think both of these checks should be before
> getting the rxq via __netif_get_rx_queue().
>

I will drop this patch in a v4 patch.

Thanks a lot!
Taehee Yoo

>
> Thanks,
>
> Brett
> >   #ifdef CONFIG_XDP_SOCKETS
> >          if (rxq->pool) {
> >                  NL_SET_ERR_MSG(extack, "designated queue already in use by AF_XDP");
> > --
> > 2.34.1
> >

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH net-next v3 7/7] bnxt_en: add support for device memory tcp
  2024-10-03 18:43   ` Mina Almasry
@ 2024-10-04 10:34     ` Taehee Yoo
  2024-10-08  2:57       ` David Wei
  2024-10-08 19:50       ` Jakub Kicinski
  0 siblings, 2 replies; 73+ messages in thread
From: Taehee Yoo @ 2024-10-04 10:34 UTC (permalink / raw)
  To: Mina Almasry
  Cc: davem, kuba, pabeni, edumazet, netdev, linux-doc, donald.hunter,
	corbet, michael.chan, kory.maincent, andrew, maxime.chevallier,
	danieller, hengqi, ecree.xilinx, przemyslaw.kitszel, hkallweit1,
	ahmed.zaki, paul.greenwalt, rrameshbabu, idosch, asml.silence,
	kaiyuanz, willemb, aleksander.lobakin, dw, sridhar.samudrala,
	bcreeley

On Fri, Oct 4, 2024 at 3:43 AM Mina Almasry <almasrymina@google.com> wrote:
>

Hi Mina,
Thanks a lot for your review!

> On Thu, Oct 3, 2024 at 9:07 AM Taehee Yoo <ap420073@gmail.com> wrote:
> >
> > Currently, bnxt_en driver satisfies the requirements of Device memory
> > TCP, which is tcp-data-split.
> > So, it implements Device memory TCP for bnxt_en driver.
> >
> > From now on, the aggregation ring handles netmem_ref instead of page
> > regardless of the on/off of netmem.
> > So, for the aggregation ring, memory will be handled with the netmem
> > page_pool API instead of generic page_pool API.
> >
> > If Devmem is enabled, netmem_ref is used as-is and if Devmem is not
> > enabled, netmem_ref will be converted to page and that is used.
> >
> > Driver recognizes whether the devmem is set or unset based on the
> > mp_params.mp_priv is not NULL.
> > Only if devmem is set, it passes PP_FLAG_ALLOW_UNREADABLE_NETMEM.
> >
> > Signed-off-by: Taehee Yoo <ap420073@gmail.com>
> > ---
> >
> > v3:
> >  - Patch added
> >
> >  drivers/net/ethernet/broadcom/Kconfig     |  1 +
> >  drivers/net/ethernet/broadcom/bnxt/bnxt.c | 98 +++++++++++++++--------
> >  drivers/net/ethernet/broadcom/bnxt/bnxt.h |  2 +-
> >  3 files changed, 66 insertions(+), 35 deletions(-)
> >
> > diff --git a/drivers/net/ethernet/broadcom/Kconfig b/drivers/net/ethernet/broadcom/Kconfig
> > index 75ca3ddda1f5..f37ff12d4746 100644
> > --- a/drivers/net/ethernet/broadcom/Kconfig
> > +++ b/drivers/net/ethernet/broadcom/Kconfig
> > @@ -211,6 +211,7 @@ config BNXT
> >         select FW_LOADER
> >         select LIBCRC32C
> >         select NET_DEVLINK
> > +       select NET_DEVMEM
> >         select PAGE_POOL
> >         select DIMLIB
> >         select AUXILIARY_BUS
> > diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> > index 872b15842b11..64e07d247f97 100644
> > --- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> > +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> > @@ -55,6 +55,7 @@
> >  #include <net/page_pool/helpers.h>
> >  #include <linux/align.h>
> >  #include <net/netdev_queues.h>
> > +#include <net/netdev_rx_queue.h>
> >
> >  #include "bnxt_hsi.h"
> >  #include "bnxt.h"
> > @@ -863,6 +864,22 @@ static void bnxt_tx_int(struct bnxt *bp, struct bnxt_napi *bnapi, int budget)
> >                 bnapi->events &= ~BNXT_TX_CMP_EVENT;
> >  }
> >
> > +static netmem_ref __bnxt_alloc_rx_netmem(struct bnxt *bp, dma_addr_t *mapping,
> > +                                        struct bnxt_rx_ring_info *rxr,
> > +                                        unsigned int *offset,
> > +                                        gfp_t gfp)
> > +{
> > +       netmem_ref netmem;
> > +
> > +       netmem = page_pool_alloc_netmem(rxr->page_pool, GFP_ATOMIC);
> > +       if (!netmem)
> > +               return 0;
> > +       *offset = 0;
> > +
> > +       *mapping = page_pool_get_dma_addr_netmem(netmem) + *offset;
> > +       return netmem;
> > +}
> > +
> >  static struct page *__bnxt_alloc_rx_page(struct bnxt *bp, dma_addr_t *mapping,
> >                                          struct bnxt_rx_ring_info *rxr,
> >                                          unsigned int *offset,
> > @@ -972,21 +989,21 @@ static inline u16 bnxt_find_next_agg_idx(struct bnxt_rx_ring_info *rxr, u16 idx)
> >         return next;
> >  }
> >
> > -static inline int bnxt_alloc_rx_page(struct bnxt *bp,
> > -                                    struct bnxt_rx_ring_info *rxr,
> > -                                    u16 prod, gfp_t gfp)
> > +static inline int bnxt_alloc_rx_netmem(struct bnxt *bp,
> > +                                      struct bnxt_rx_ring_info *rxr,
> > +                                      u16 prod, gfp_t gfp)
> >  {
> >         struct rx_bd *rxbd =
> >                 &rxr->rx_agg_desc_ring[RX_AGG_RING(bp, prod)][RX_IDX(prod)];
> >         struct bnxt_sw_rx_agg_bd *rx_agg_buf;
> > -       struct page *page;
> > +       netmem_ref netmem;
> >         dma_addr_t mapping;
> >         u16 sw_prod = rxr->rx_sw_agg_prod;
> >         unsigned int offset = 0;
> >
> > -       page = __bnxt_alloc_rx_page(bp, &mapping, rxr, &offset, gfp);
> > +       netmem = __bnxt_alloc_rx_netmem(bp, &mapping, rxr, &offset, gfp);
>
> Does __bnxt_alloc_rx_page become dead code after this change? Or is it
> still used for something?

__bnxt_alloc_rx_page() is still used.

>
> >
> > -       if (!page)
> > +       if (!netmem)
> >                 return -ENOMEM;
> >
> >         if (unlikely(test_bit(sw_prod, rxr->rx_agg_bmap)))
> > @@ -996,7 +1013,7 @@ static inline int bnxt_alloc_rx_page(struct bnxt *bp,
> >         rx_agg_buf = &rxr->rx_agg_ring[sw_prod];
> >         rxr->rx_sw_agg_prod = RING_RX_AGG(bp, NEXT_RX_AGG(sw_prod));
> >
> > -       rx_agg_buf->page = page;
> > +       rx_agg_buf->netmem = netmem;
> >         rx_agg_buf->offset = offset;
> >         rx_agg_buf->mapping = mapping;
> >         rxbd->rx_bd_haddr = cpu_to_le64(mapping);
> > @@ -1044,7 +1061,7 @@ static void bnxt_reuse_rx_agg_bufs(struct bnxt_cp_ring_info *cpr, u16 idx,
> >                 struct rx_agg_cmp *agg;
> >                 struct bnxt_sw_rx_agg_bd *cons_rx_buf, *prod_rx_buf;
> >                 struct rx_bd *prod_bd;
> > -               struct page *page;
> > +               netmem_ref netmem;
> >
> >                 if (p5_tpa)
> >                         agg = bnxt_get_tpa_agg_p5(bp, rxr, idx, start + i);
> > @@ -1061,11 +1078,11 @@ static void bnxt_reuse_rx_agg_bufs(struct bnxt_cp_ring_info *cpr, u16 idx,
> >                 cons_rx_buf = &rxr->rx_agg_ring[cons];
> >
> >                 /* It is possible for sw_prod to be equal to cons, so
> > -                * set cons_rx_buf->page to NULL first.If I misunderstand about
> > +                * set cons_rx_buf->netmem to 0 first.
> >                  */
> > -               page = cons_rx_buf->page;
> > -               cons_rx_buf->page = NULL;
> > -               prod_rx_buf->page = page;
> > +               netmem = cons_rx_buf->netmem;
> > +               cons_rx_buf->netmem = 0;
> > +               prod_rx_buf->netmem = netmem;
> >                 prod_rx_buf->offset = cons_rx_buf->offset;
> >
> >                 prod_rx_buf->mapping = cons_rx_buf->mapping;
> > @@ -1192,6 +1209,7 @@ static struct sk_buff *bnxt_rx_skb(struct bnxt *bp,
> >
> >  static u32 __bnxt_rx_agg_pages(struct bnxt *bp,
> >                                struct bnxt_cp_ring_info *cpr,
> > +                              struct sk_buff *skb,
> >                                struct skb_shared_info *shinfo,
> >                                u16 idx, u32 agg_bufs, bool tpa,
> >                                struct xdp_buff *xdp)
> > @@ -1211,7 +1229,7 @@ static u32 __bnxt_rx_agg_pages(struct bnxt *bp,
> >                 u16 cons, frag_len;
> >                 struct rx_agg_cmp *agg;
> >                 struct bnxt_sw_rx_agg_bd *cons_rx_buf;
> > -               struct page *page;
> > +               netmem_ref netmem;
> >                 dma_addr_t mapping;
> >
> >                 if (p5_tpa)
> > @@ -1223,9 +1241,15 @@ static u32 __bnxt_rx_agg_pages(struct bnxt *bp,
> >                             RX_AGG_CMP_LEN) >> RX_AGG_CMP_LEN_SHIFT;
> >
> >                 cons_rx_buf = &rxr->rx_agg_ring[cons];
> > -               skb_frag_fill_page_desc(frag, cons_rx_buf->page,
> > -                                       cons_rx_buf->offset, frag_len);
> > -               shinfo->nr_frags = i + 1;
> > +               if (skb) {
> > +                       skb_add_rx_frag_netmem(skb, i, cons_rx_buf->netmem,
> > +                                              cons_rx_buf->offset, frag_len,
> > +                                              BNXT_RX_PAGE_SIZE);
> > +               } else {
> > +                       skb_frag_fill_page_desc(frag, netmem_to_page(cons_rx_buf->netmem),
> > +                                               cons_rx_buf->offset, frag_len);
>
> Our intention with the whole netmem design is that drivers should
> never have to call netmem_to_page(). I.e. the driver should use netmem
> unaware of whether it's page or non-page underneath, to minimize
> complexity driver needs to handle.
>
> This netmem_to_page() call can be removed by using
> skb_frag_fill_netmem_desc() instead of the page variant. But, more
> improtantly, why did the code change here? The code before calls
> skb_frag_fill_page_desc, but the new code sometimes will
> skb_frag_fill_netmem_desc() and sometimes will skb_add_rx_frag_netmem.
> I'm not sure why that logic changed.

The reason why skb_add_rx_frag_netmem() is used here is to set
skb->unreadable flag. the skb_frag_fill_netmem_desc() doesn't set
skb->unreadable because it doesn't handle skb, it only handles frag.
As far as I know, skb->unreadable should be set to true for devmem
TCP, am I misunderstood?
I tested that don't using skb_add_rx_frag_netmem() here, and it
immediately fails.

The "if (skb)" branch will be hit only when devmem TCP path.
Normal packet and XDP path will hit "else" branch.

I will use skb_frag_fill_netmem_desc() instead of
skb_frag_fill_page_desc() in the "else" branch.
With this change, as you said, there is no netmem_to_page() in bnxt_en
driver, Thanks!

>
> > +                       shinfo->nr_frags = i + 1;
> > +               }
> >                 __clear_bit(cons, rxr->rx_agg_bmap);
> >
> >                 /* It is possible for bnxt_alloc_rx_page() to allocate
> > @@ -1233,15 +1257,15 @@ static u32 __bnxt_rx_agg_pages(struct bnxt *bp,
> >                  * need to clear the cons entry now.
> >                  */
> >                 mapping = cons_rx_buf->mapping;
> > -               page = cons_rx_buf->page;
> > -               cons_rx_buf->page = NULL;
> > +               netmem = cons_rx_buf->netmem;
> > +               cons_rx_buf->netmem = 0;
> >
> > -               if (xdp && page_is_pfmemalloc(page))
> > +               if (xdp && page_is_pfmemalloc(netmem_to_page(netmem)))
>
> Similarly, add netmem_is_pfmemalloc to netmem.h, instead of doing a
> netmem_to_page() call here I think.

Thanks, I will add netmem_is_pfmemalloc() to netmem.h in a v4 patch.

>
> >                         xdp_buff_set_frag_pfmemalloc(xdp);
> >
> > -               if (bnxt_alloc_rx_page(bp, rxr, prod, GFP_ATOMIC) != 0) {
> > +               if (bnxt_alloc_rx_netmem(bp, rxr, prod, GFP_ATOMIC) != 0) {
> >                         --shinfo->nr_frags;
> > -                       cons_rx_buf->page = page;
> > +                       cons_rx_buf->netmem = netmem;
> >
> >                         /* Update prod since possibly some pages have been
> >                          * allocated already.
> > @@ -1269,7 +1293,7 @@ static struct sk_buff *bnxt_rx_agg_pages_skb(struct bnxt *bp,
> >         struct skb_shared_info *shinfo = skb_shinfo(skb);
> >         u32 total_frag_len = 0;
> >
> > -       total_frag_len = __bnxt_rx_agg_pages(bp, cpr, shinfo, idx,
> > +       total_frag_len = __bnxt_rx_agg_pages(bp, cpr, skb, shinfo, idx,
> >                                              agg_bufs, tpa, NULL);
> >         if (!total_frag_len) {
> >                 skb_mark_for_recycle(skb);
> > @@ -1277,9 +1301,6 @@ static struct sk_buff *bnxt_rx_agg_pages_skb(struct bnxt *bp,
> >                 return NULL;
> >         }
> >
> > -       skb->data_len += total_frag_len;
> > -       skb->len += total_frag_len;
> > -       skb->truesize += BNXT_RX_PAGE_SIZE * agg_bufs;
> >         return skb;
> >  }
> >
> > @@ -1294,7 +1315,7 @@ static u32 bnxt_rx_agg_pages_xdp(struct bnxt *bp,
> >         if (!xdp_buff_has_frags(xdp))
> >                 shinfo->nr_frags = 0;
> >
> > -       total_frag_len = __bnxt_rx_agg_pages(bp, cpr, shinfo,
> > +       total_frag_len = __bnxt_rx_agg_pages(bp, cpr, NULL, shinfo,
> >                                              idx, agg_bufs, tpa, xdp);
> >         if (total_frag_len) {
> >                 xdp_buff_set_frags_flag(xdp);
> > @@ -3342,15 +3363,15 @@ static void bnxt_free_one_rx_agg_ring(struct bnxt *bp, struct bnxt_rx_ring_info
> >
> >         for (i = 0; i < max_idx; i++) {
> >                 struct bnxt_sw_rx_agg_bd *rx_agg_buf = &rxr->rx_agg_ring[i];
> > -               struct page *page = rx_agg_buf->page;
> > +               netmem_ref netmem = rx_agg_buf->netmem;
> >
> > -               if (!page)
> > +               if (!netmem)
> >                         continue;
> >
> > -               rx_agg_buf->page = NULL;
> > +               rx_agg_buf->netmem = 0;
> >                 __clear_bit(i, rxr->rx_agg_bmap);
> >
> > -               page_pool_recycle_direct(rxr->page_pool, page);
> > +               page_pool_put_full_netmem(rxr->page_pool, netmem, true);
> >         }
> >  }
> >
> > @@ -3608,9 +3629,11 @@ static void bnxt_free_rx_rings(struct bnxt *bp)
> >
> >  static int bnxt_alloc_rx_page_pool(struct bnxt *bp,
> >                                    struct bnxt_rx_ring_info *rxr,
> > +                                  int queue_idx,
> >                                    int numa_node)
> >  {
> >         struct page_pool_params pp = { 0 };
> > +       struct netdev_rx_queue *rxq;
> >
> >         pp.pool_size = bp->rx_agg_ring_size;
> >         if (BNXT_RX_PAGE_MODE(bp))
> > @@ -3621,8 +3644,15 @@ static int bnxt_alloc_rx_page_pool(struct bnxt *bp,
> >         pp.dev = &bp->pdev->dev;
> >         pp.dma_dir = bp->rx_dir;
> >         pp.max_len = PAGE_SIZE;
> > -       pp.flags = PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV;
> > +       pp.order = 0;
> > +
> > +       rxq = __netif_get_rx_queue(bp->dev, queue_idx);
> > +       if (rxq->mp_params.mp_priv)
> > +               pp.flags = PP_FLAG_DMA_MAP | PP_FLAG_ALLOW_UNREADABLE_NETMEM;
>
> This is not the intended use of PP_FLAG_ALLOW_UNREADABLE_NETMEM.
>
> The driver should set PP_FLAG_ALLOW_UNREADABLE_NETMEM when it's able
> to handle unreadable netmem, it should not worry about whether
> rxq->mp_params.mp_priv is set or not.
>
> You should set PP_FLAG_ALLOW_UNREADABLE_NETMEM when HDS is enabled.
> Let core figure out if mp_params.mp_priv is enabled. All the driver
> needs to report is whether it's configured to be able to handle
> unreadable netmem (which practically means HDS is enabled).

The reason why the branch exists here is the PP_FLAG_ALLOW_UNREADABLE_NETMEM
flag can't be used with PP_FLAG_DMA_SYNC_DEV.

 228         if (pool->slow.flags & PP_FLAG_DMA_SYNC_DEV) {
 229                 /* In order to request DMA-sync-for-device the page
 230                  * needs to be mapped
 231                  */
 232                 if (!(pool->slow.flags & PP_FLAG_DMA_MAP))
 233                         return -EINVAL;
 234
 235                 if (!pool->p.max_len)
 236                         return -EINVAL;
 237
 238                 pool->dma_sync = true;                //here
 239
 240                 /* pool->p.offset has to be set according to the address
 241                  * offset used by the DMA engine to start copying rx data
 242                  */
 243         }

If PP_FLAG_DMA_SYNC_DEV is set, page->dma_sync is set to true.

347 int mp_dmabuf_devmem_init(struct page_pool *pool)
348 {
349         struct net_devmem_dmabuf_binding *binding = pool->mp_priv;
350
351         if (!binding)
352                 return -EINVAL;
353
354         if (!pool->dma_map)
355                 return -EOPNOTSUPP;
356
357         if (pool->dma_sync)                      //here
358                 return -EOPNOTSUPP;
359
360         if (pool->p.order != 0)
361                 return -E2BIG;
362
363         net_devmem_dmabuf_binding_get(binding);
364         return 0;
365 }

In the mp_dmabuf_devmem_init(), it fails when pool->dma_sync is true.

tcp-data-split can be used for normal cases, not only devmem TCP case.
If we enable tcp-data-split and disable devmem TCP, page_pool doesn't
have PP_FLAG_DMA_SYNC_DEV.
So I think mp_params.mp_priv is still useful.

Thanks a lot,
Taehee Yoo

>
> > +       else
> > +               pp.flags = PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV;
> >
> > +       pp.queue_idx = queue_idx;
> >         rxr->page_pool = page_pool_create(&pp);
> >         if (IS_ERR(rxr->page_pool)) {
> >                 int err = PTR_ERR(rxr->page_pool);
> > @@ -3655,7 +3685,7 @@ static int bnxt_alloc_rx_rings(struct bnxt *bp)
> >                 cpu_node = cpu_to_node(cpu);
> >                 netdev_dbg(bp->dev, "Allocating page pool for rx_ring[%d] on numa_node: %d\n",
> >                            i, cpu_node);
> > -               rc = bnxt_alloc_rx_page_pool(bp, rxr, cpu_node);
> > +               rc = bnxt_alloc_rx_page_pool(bp, rxr, i, cpu_node);
> >                 if (rc)
> >                         return rc;
> >
> > @@ -4154,7 +4184,7 @@ static void bnxt_alloc_one_rx_ring_page(struct bnxt *bp,
> >
> >         prod = rxr->rx_agg_prod;
> >         for (i = 0; i < bp->rx_agg_ring_size; i++) {
> > -               if (bnxt_alloc_rx_page(bp, rxr, prod, GFP_KERNEL)) {
> > +               if (bnxt_alloc_rx_netmem(bp, rxr, prod, GFP_KERNEL)) {
> >                         netdev_warn(bp->dev, "init'ed rx ring %d with %d/%d pages only\n",
> >                                     ring_nr, i, bp->rx_ring_size);
> >                         break;
> > @@ -15063,7 +15093,7 @@ static int bnxt_queue_mem_alloc(struct net_device *dev, void *qmem, int idx)
> >         clone->rx_sw_agg_prod = 0;
> >         clone->rx_next_cons = 0;
> >
> > -       rc = bnxt_alloc_rx_page_pool(bp, clone, rxr->page_pool->p.nid);
> > +       rc = bnxt_alloc_rx_page_pool(bp, clone, idx, rxr->page_pool->p.nid);
> >         if (rc)
> >                 return rc;
> >
> > diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.h b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
> > index 48f390519c35..3cf57a3c7664 100644
> > --- a/drivers/net/ethernet/broadcom/bnxt/bnxt.h
> > +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
> > @@ -895,7 +895,7 @@ struct bnxt_sw_rx_bd {
> >  };
> >
> >  struct bnxt_sw_rx_agg_bd {
> > -       struct page             *page;
> > +       netmem_ref              netmem;
> >         unsigned int            offset;
> >         dma_addr_t              mapping;
> >  };
> > --
> > 2.34.1
> >
>
>
> --
> Thanks,
> Mina

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH net-next v3 7/7] bnxt_en: add support for device memory tcp
  2024-10-03 16:06 ` [PATCH net-next v3 7/7] bnxt_en: add support for device memory tcp Taehee Yoo
  2024-10-03 18:43   ` Mina Almasry
@ 2024-10-05  3:48   ` kernel test robot
  2024-10-08  2:45   ` David Wei
  2 siblings, 0 replies; 73+ messages in thread
From: kernel test robot @ 2024-10-05  3:48 UTC (permalink / raw)
  To: Taehee Yoo, davem, kuba, pabeni, edumazet, almasrymina, netdev,
	linux-doc, donald.hunter, corbet, michael.chan
  Cc: Paul Gazzillo, Necip Fazil Yildiran, oe-kbuild-all, kory.maincent,
	andrew, maxime.chevallier, danieller, hengqi, ecree.xilinx,
	przemyslaw.kitszel, hkallweit1, ahmed.zaki, paul.greenwalt,
	rrameshbabu, idosch, asml.silence, kaiyuanz, willemb,
	aleksander.lobakin, dw, sridhar.samudrala, bcreeley, ap420073

Hi Taehee,

kernel test robot noticed the following build warnings:

[auto build test WARNING on net-next/main]

url:    https://github.com/intel-lab-lkp/linux/commits/Taehee-Yoo/bnxt_en-add-support-for-rx-copybreak-ethtool-command/20241004-000934
base:   net-next/main
patch link:    https://lore.kernel.org/r/20241003160620.1521626-8-ap420073%40gmail.com
patch subject: [PATCH net-next v3 7/7] bnxt_en: add support for device memory tcp
config: x86_64-kismet-CONFIG_NET_DEVMEM-CONFIG_BNXT-0-0 (https://download.01.org/0day-ci/archive/20241005/202410051156.r68SYo4V-lkp@intel.com/config)
reproduce: (https://download.01.org/0day-ci/archive/20241005/202410051156.r68SYo4V-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202410051156.r68SYo4V-lkp@intel.com/

kismet warnings: (new ones prefixed by >>)
>> kismet: WARNING: unmet direct dependencies detected for NET_DEVMEM when selected by BNXT
   WARNING: unmet direct dependencies detected for NET_DEVMEM
     Depends on [n]: NET [=y] && DMA_SHARED_BUFFER [=n] && GENERIC_ALLOCATOR [=y] && PAGE_POOL [=y]
     Selected by [y]:
     - BNXT [=y] && NETDEVICES [=y] && ETHERNET [=y] && NET_VENDOR_BROADCOM [=y] && PCI [=y] && PTP_1588_CLOCK_OPTIONAL [=y]

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH net-next v3 3/7] net: ethtool: add support for configuring tcp-data-split-thresh
  2024-10-04  1:47       ` Mina Almasry
@ 2024-10-05  6:11         ` Taehee Yoo
  0 siblings, 0 replies; 73+ messages in thread
From: Taehee Yoo @ 2024-10-05  6:11 UTC (permalink / raw)
  To: Mina Almasry
  Cc: davem, kuba, pabeni, edumazet, netdev, linux-doc, donald.hunter,
	corbet, michael.chan, kory.maincent, andrew, maxime.chevallier,
	danieller, hengqi, ecree.xilinx, przemyslaw.kitszel, hkallweit1,
	ahmed.zaki, paul.greenwalt, rrameshbabu, idosch, asml.silence,
	kaiyuanz, willemb, aleksander.lobakin, dw, sridhar.samudrala,
	bcreeley

On Fri, Oct 4, 2024 at 10:47 AM Mina Almasry <almasrymina@google.com> wrote:
>
> On Thu, Oct 3, 2024 at 12:33 PM Taehee Yoo <ap420073@gmail.com> wrote:
> >
> > On Fri, Oct 4, 2024 at 3:25 AM Mina Almasry <almasrymina@google.com> wrote:
> > >
> > > On Thu, Oct 3, 2024 at 9:07 AM Taehee Yoo <ap420073@gmail.com> wrote:
> > > >
> > > > The tcp-data-split-thresh option configures the threshold value of
> > > > the tcp-data-split.
> > > > If a received packet size is larger than this threshold value, a packet
> > > > will be split into header and payload.
> > >
> > > Why do you need this? devmem TCP will always not work with unsplit
> > > packets. Seems like you always want to set thresh to 0 to support
> > > something like devmem TCP.
> > >
> > > Why would the user ever want to configure this? I can't think of a
> > > scenario where the user wouldn't want packets under X bytes to be
> > > unsplit.
> >
> > I totally understand what you mean,
> > Yes, tcp-data-split is zerocopy friendly option but as far as I know,
> > this option is not only for the zerocopy usecase.
> > So, If users enable tcp-data-split, they would assume threshold is 0.
> > But there are already NICs that have been supporting tcp-data-split
> > enabled by default.
> > bnxt_en's default value is 256bytes.
> > If we just assume the tcp-data-split-threshold to 0 for all cases,
> > it would change the default behavior of bnxt_en driver(maybe other drivers too)
> > for the not zerocopy case.
> > Jakub pointed out the generic case, not only for zerocopy usecase
> > in the v1 and I agree with that opinion.
> > https://lore.kernel.org/netdev/20240906183844.2e8226f3@kernel.org/
>
> I see, thanks. The ability to tune the threshold to save some pcie
> bandwidth is interesting. Not sure how much it would matter in
> practice. I guess if you're receiving _lots_ of small packets then it
> could be critical.
>
> Sounds good then, please consider adding Jakub's reasoning for why
> tuning this could be valuable to the commit message for future
> userspace readers that wonder why to set this.

Okay, I will add an explanation of this feature to commit message in v4 patch.

Thanks a lot!
Taehee Yoo

>
> --
> Thanks,
> Mina

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH net-next v3 1/7] bnxt_en: add support for rx-copybreak ethtool command
  2024-10-03 18:34         ` Andrew Lunn
@ 2024-10-05  6:29           ` Taehee Yoo
  2024-10-08 18:10             ` Jakub Kicinski
  0 siblings, 1 reply; 73+ messages in thread
From: Taehee Yoo @ 2024-10-05  6:29 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Michael Chan, davem, kuba, pabeni, edumazet, almasrymina, netdev,
	linux-doc, donald.hunter, corbet, kory.maincent,
	maxime.chevallier, danieller, hengqi, ecree.xilinx,
	przemyslaw.kitszel, hkallweit1, ahmed.zaki, paul.greenwalt,
	rrameshbabu, idosch, asml.silence, kaiyuanz, willemb,
	aleksander.lobakin, dw, sridhar.samudrala, bcreeley

On Fri, Oct 4, 2024 at 1:41 PM Andrew Lunn <andrew@lunn.ch> wrote:
>

Hi Andew,
Thanks a lot for the review!

> > > I agree that we need to support disabling rx-copybreak.
> > > What about 0 ~ 64 means to disable rx-copybreak?
> > > Or should only 0 be allowed to disable rx-copybreak?
> > >
> >
> > I think a single value of 0 that means disable RX copybreak is more
> > clear and intuitive.  Also, I think we can allow 64 to be a valid
> > value.
> >
> > So, 0 means to disable.  1 to 63 are -EINVAL and 64 to 1024 are valid.  Thanks.
>
> Please spend a little time and see what other drivers do. Ideally we
> want one consistent behaviour for all drivers that allow copybreak to
> be disabled.

There is no specific disable value in other drivers.
But some other drivers have min/max rx-copybreak value.
If rx-copybreak is low enough, it will not be worked.
So, min value has been working as a disable value actually.

I think Andrew's point makes sense.
So I would like to change min value from 65 to 64, not add a disable value.

Thanks a lot!
Taehee Yoo

>
>         Andrew

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH net-next v3 7/7] bnxt_en: add support for device memory tcp
  2024-10-03 16:06 ` [PATCH net-next v3 7/7] bnxt_en: add support for device memory tcp Taehee Yoo
  2024-10-03 18:43   ` Mina Almasry
  2024-10-05  3:48   ` kernel test robot
@ 2024-10-08  2:45   ` David Wei
  2024-10-08  3:54     ` Taehee Yoo
  2 siblings, 1 reply; 73+ messages in thread
From: David Wei @ 2024-10-08  2:45 UTC (permalink / raw)
  To: Taehee Yoo, davem, kuba, pabeni, edumazet, almasrymina, netdev,
	linux-doc, donald.hunter, corbet, michael.chan
  Cc: kory.maincent, andrew, maxime.chevallier, danieller, hengqi,
	ecree.xilinx, przemyslaw.kitszel, hkallweit1, ahmed.zaki,
	paul.greenwalt, rrameshbabu, idosch, asml.silence, kaiyuanz,
	willemb, aleksander.lobakin, sridhar.samudrala, bcreeley

On 2024-10-03 09:06, Taehee Yoo wrote:
> diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> index 872b15842b11..64e07d247f97 100644
> --- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> @@ -55,6 +55,7 @@
>  #include <net/page_pool/helpers.h>
>  #include <linux/align.h>
>  #include <net/netdev_queues.h>
> +#include <net/netdev_rx_queue.h>
>  
>  #include "bnxt_hsi.h"
>  #include "bnxt.h"
> @@ -863,6 +864,22 @@ static void bnxt_tx_int(struct bnxt *bp, struct bnxt_napi *bnapi, int budget)
>  		bnapi->events &= ~BNXT_TX_CMP_EVENT;
>  }
>  
> +static netmem_ref __bnxt_alloc_rx_netmem(struct bnxt *bp, dma_addr_t *mapping,
> +					 struct bnxt_rx_ring_info *rxr,
> +					 unsigned int *offset,
> +					 gfp_t gfp)

gfp is unused

> +{
> +	netmem_ref netmem;
> +
> +	netmem = page_pool_alloc_netmem(rxr->page_pool, GFP_ATOMIC);
> +	if (!netmem)
> +		return 0;
> +	*offset = 0;
> +
> +	*mapping = page_pool_get_dma_addr_netmem(netmem) + *offset;

offset is always 0

> +	return netmem;
> +}
> +
>  static struct page *__bnxt_alloc_rx_page(struct bnxt *bp, dma_addr_t *mapping,
>  					 struct bnxt_rx_ring_info *rxr,
>  					 unsigned int *offset,

[...]

> @@ -1192,6 +1209,7 @@ static struct sk_buff *bnxt_rx_skb(struct bnxt *bp,
>  
>  static u32 __bnxt_rx_agg_pages(struct bnxt *bp,
>  			       struct bnxt_cp_ring_info *cpr,
> +			       struct sk_buff *skb,
>  			       struct skb_shared_info *shinfo,
>  			       u16 idx, u32 agg_bufs, bool tpa,
>  			       struct xdp_buff *xdp)
> @@ -1211,7 +1229,7 @@ static u32 __bnxt_rx_agg_pages(struct bnxt *bp,
>  		u16 cons, frag_len;
>  		struct rx_agg_cmp *agg;
>  		struct bnxt_sw_rx_agg_bd *cons_rx_buf;
> -		struct page *page;
> +		netmem_ref netmem;
>  		dma_addr_t mapping;
>  
>  		if (p5_tpa)
> @@ -1223,9 +1241,15 @@ static u32 __bnxt_rx_agg_pages(struct bnxt *bp,
>  			    RX_AGG_CMP_LEN) >> RX_AGG_CMP_LEN_SHIFT;
>  
>  		cons_rx_buf = &rxr->rx_agg_ring[cons];
> -		skb_frag_fill_page_desc(frag, cons_rx_buf->page,
> -					cons_rx_buf->offset, frag_len);
> -		shinfo->nr_frags = i + 1;
> +		if (skb) {
> +			skb_add_rx_frag_netmem(skb, i, cons_rx_buf->netmem,
> +					       cons_rx_buf->offset, frag_len,
> +					       BNXT_RX_PAGE_SIZE);
> +		} else {
> +			skb_frag_fill_page_desc(frag, netmem_to_page(cons_rx_buf->netmem),
> +						cons_rx_buf->offset, frag_len);
> +			shinfo->nr_frags = i + 1;
> +		}

I feel like this function needs a refactor at some point to split out
the skb and xdp paths.

>  		__clear_bit(cons, rxr->rx_agg_bmap);
>  
>  		/* It is possible for bnxt_alloc_rx_page() to allocate

[...]

> @@ -3608,9 +3629,11 @@ static void bnxt_free_rx_rings(struct bnxt *bp)
>  
>  static int bnxt_alloc_rx_page_pool(struct bnxt *bp,
>  				   struct bnxt_rx_ring_info *rxr,
> +				   int queue_idx,

To save a parameter, the index is available already in rxr->bnapi->index

>  				   int numa_node)
>  {
>  	struct page_pool_params pp = { 0 };
> +	struct netdev_rx_queue *rxq;
>  
>  	pp.pool_size = bp->rx_agg_ring_size;
>  	if (BNXT_RX_PAGE_MODE(bp))
> @@ -3621,8 +3644,15 @@ static int bnxt_alloc_rx_page_pool(struct bnxt *bp,
>  	pp.dev = &bp->pdev->dev;
>  	pp.dma_dir = bp->rx_dir;
>  	pp.max_len = PAGE_SIZE;
> -	pp.flags = PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV;
> +	pp.order = 0;
> +
> +	rxq = __netif_get_rx_queue(bp->dev, queue_idx);
> +	if (rxq->mp_params.mp_priv)
> +		pp.flags = PP_FLAG_DMA_MAP | PP_FLAG_ALLOW_UNREADABLE_NETMEM;
> +	else
> +		pp.flags = PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV;
>  
> +	pp.queue_idx = queue_idx;
>  	rxr->page_pool = page_pool_create(&pp);
>  	if (IS_ERR(rxr->page_pool)) {
>  		int err = PTR_ERR(rxr->page_pool);
> @@ -3655,7 +3685,7 @@ static int bnxt_alloc_rx_rings(struct bnxt *bp)
>  		cpu_node = cpu_to_node(cpu);
>  		netdev_dbg(bp->dev, "Allocating page pool for rx_ring[%d] on numa_node: %d\n",
>  			   i, cpu_node);
> -		rc = bnxt_alloc_rx_page_pool(bp, rxr, cpu_node);
> +		rc = bnxt_alloc_rx_page_pool(bp, rxr, i, cpu_node);
>  		if (rc)
>  			return rc;
>  

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH net-next v3 7/7] bnxt_en: add support for device memory tcp
  2024-10-04 10:34     ` Taehee Yoo
@ 2024-10-08  2:57       ` David Wei
  2024-10-09 15:02         ` Taehee Yoo
  2024-10-08 19:50       ` Jakub Kicinski
  1 sibling, 1 reply; 73+ messages in thread
From: David Wei @ 2024-10-08  2:57 UTC (permalink / raw)
  To: Taehee Yoo, Mina Almasry
  Cc: davem, kuba, pabeni, edumazet, netdev, linux-doc, donald.hunter,
	corbet, michael.chan, kory.maincent, andrew, maxime.chevallier,
	danieller, hengqi, ecree.xilinx, przemyslaw.kitszel, hkallweit1,
	ahmed.zaki, paul.greenwalt, rrameshbabu, idosch, asml.silence,
	kaiyuanz, willemb, aleksander.lobakin, sridhar.samudrala,
	bcreeley, David Wei

On 2024-10-04 03:34, Taehee Yoo wrote:
> On Fri, Oct 4, 2024 at 3:43 AM Mina Almasry <almasrymina@google.com> wrote:
>>> @@ -3608,9 +3629,11 @@ static void bnxt_free_rx_rings(struct bnxt *bp)
>>>
>>>  static int bnxt_alloc_rx_page_pool(struct bnxt *bp,
>>>                                    struct bnxt_rx_ring_info *rxr,
>>> +                                  int queue_idx,
>>>                                    int numa_node)
>>>  {
>>>         struct page_pool_params pp = { 0 };
>>> +       struct netdev_rx_queue *rxq;
>>>
>>>         pp.pool_size = bp->rx_agg_ring_size;
>>>         if (BNXT_RX_PAGE_MODE(bp))
>>> @@ -3621,8 +3644,15 @@ static int bnxt_alloc_rx_page_pool(struct bnxt *bp,
>>>         pp.dev = &bp->pdev->dev;
>>>         pp.dma_dir = bp->rx_dir;
>>>         pp.max_len = PAGE_SIZE;
>>> -       pp.flags = PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV;
>>> +       pp.order = 0;
>>> +
>>> +       rxq = __netif_get_rx_queue(bp->dev, queue_idx);
>>> +       if (rxq->mp_params.mp_priv)
>>> +               pp.flags = PP_FLAG_DMA_MAP | PP_FLAG_ALLOW_UNREADABLE_NETMEM;
>>
>> This is not the intended use of PP_FLAG_ALLOW_UNREADABLE_NETMEM.
>>
>> The driver should set PP_FLAG_ALLOW_UNREADABLE_NETMEM when it's able
>> to handle unreadable netmem, it should not worry about whether
>> rxq->mp_params.mp_priv is set or not.
>>
>> You should set PP_FLAG_ALLOW_UNREADABLE_NETMEM when HDS is enabled.
>> Let core figure out if mp_params.mp_priv is enabled. All the driver
>> needs to report is whether it's configured to be able to handle
>> unreadable netmem (which practically means HDS is enabled).
> 
> The reason why the branch exists here is the PP_FLAG_ALLOW_UNREADABLE_NETMEM
> flag can't be used with PP_FLAG_DMA_SYNC_DEV.
> 
>  228         if (pool->slow.flags & PP_FLAG_DMA_SYNC_DEV) {
>  229                 /* In order to request DMA-sync-for-device the page
>  230                  * needs to be mapped
>  231                  */
>  232                 if (!(pool->slow.flags & PP_FLAG_DMA_MAP))
>  233                         return -EINVAL;
>  234
>  235                 if (!pool->p.max_len)
>  236                         return -EINVAL;
>  237
>  238                 pool->dma_sync = true;                //here
>  239
>  240                 /* pool->p.offset has to be set according to the address
>  241                  * offset used by the DMA engine to start copying rx data
>  242                  */
>  243         }
> 
> If PP_FLAG_DMA_SYNC_DEV is set, page->dma_sync is set to true.
> 
> 347 int mp_dmabuf_devmem_init(struct page_pool *pool)
> 348 {
> 349         struct net_devmem_dmabuf_binding *binding = pool->mp_priv;
> 350
> 351         if (!binding)
> 352                 return -EINVAL;
> 353
> 354         if (!pool->dma_map)
> 355                 return -EOPNOTSUPP;
> 356
> 357         if (pool->dma_sync)                      //here
> 358                 return -EOPNOTSUPP;
> 359
> 360         if (pool->p.order != 0)
> 361                 return -E2BIG;
> 362
> 363         net_devmem_dmabuf_binding_get(binding);
> 364         return 0;
> 365 }
> 
> In the mp_dmabuf_devmem_init(), it fails when pool->dma_sync is true.

This won't work for io_uring zero copy into user memory. We need all
PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV | PP_FLAG_ALLOW_UNREADABLE_NETMEM
set.

I agree with Mina that the driver should not be poking at the mp_priv
fields. How about setting all the flags and then letting the mp->init()
figure it out? mp_dmabuf_devmem_init() is called within page_pool_init()
so as long as it resets dma_sync if set I don't see any issues.

> 
> tcp-data-split can be used for normal cases, not only devmem TCP case.
> If we enable tcp-data-split and disable devmem TCP, page_pool doesn't
> have PP_FLAG_DMA_SYNC_DEV.
> So I think mp_params.mp_priv is still useful.
> 
> Thanks a lot,
> Taehee Yoo
> 
>>
>>
>> --
>> Thanks,
>> Mina

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH net-next v3 7/7] bnxt_en: add support for device memory tcp
  2024-10-08  2:45   ` David Wei
@ 2024-10-08  3:54     ` Taehee Yoo
  2024-10-08  3:58       ` Taehee Yoo
  0 siblings, 1 reply; 73+ messages in thread
From: Taehee Yoo @ 2024-10-08  3:54 UTC (permalink / raw)
  To: David Wei
  Cc: davem, kuba, pabeni, edumazet, almasrymina, netdev, linux-doc,
	donald.hunter, corbet, michael.chan, kory.maincent, andrew,
	maxime.chevallier, danieller, hengqi, ecree.xilinx,
	przemyslaw.kitszel, hkallweit1, ahmed.zaki, paul.greenwalt,
	rrameshbabu, idosch, asml.silence, kaiyuanz, willemb,
	aleksander.lobakin, sridhar.samudrala, bcreeley

On Tue, Oct 8, 2024 at 11:45 AM David Wei <dw@davidwei.uk> wrote:
>

Hi David,
Thanks a lot for your review!

> On 2024-10-03 09:06, Taehee Yoo wrote:
> > diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> > index 872b15842b11..64e07d247f97 100644
> > --- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> > +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> > @@ -55,6 +55,7 @@
> >  #include <net/page_pool/helpers.h>
> >  #include <linux/align.h>
> >  #include <net/netdev_queues.h>
> > +#include <net/netdev_rx_queue.h>
> >
> >  #include "bnxt_hsi.h"
> >  #include "bnxt.h"
> > @@ -863,6 +864,22 @@ static void bnxt_tx_int(struct bnxt *bp, struct bnxt_napi *bnapi, int budget)
> >               bnapi->events &= ~BNXT_TX_CMP_EVENT;
> >  }
> >
> > +static netmem_ref __bnxt_alloc_rx_netmem(struct bnxt *bp, dma_addr_t *mapping,
> > +                                      struct bnxt_rx_ring_info *rxr,
> > +                                      unsigned int *offset,
> > +                                      gfp_t gfp)
>
> gfp is unused

I will remove unnecessary gfp parameter in v4.

>
> > +{
> > +     netmem_ref netmem;
> > +
> > +     netmem = page_pool_alloc_netmem(rxr->page_pool, GFP_ATOMIC);
> > +     if (!netmem)
> > +             return 0;
> > +     *offset = 0;
> > +
> > +     *mapping = page_pool_get_dma_addr_netmem(netmem) + *offset;
>
> offset is always 0

Okay, I will remove this too in v4.

>
> > +     return netmem;
> > +}
> > +
> >  static struct page *__bnxt_alloc_rx_page(struct bnxt *bp, dma_addr_t *mapping,
> >                                        struct bnxt_rx_ring_info *rxr,
> >                                        unsigned int *offset,
>
> [...]
>
> > @@ -1192,6 +1209,7 @@ static struct sk_buff *bnxt_rx_skb(struct bnxt *bp,
> >
> >  static u32 __bnxt_rx_agg_pages(struct bnxt *bp,
> >                              struct bnxt_cp_ring_info *cpr,
> > +                            struct sk_buff *skb,
> >                              struct skb_shared_info *shinfo,
> >                              u16 idx, u32 agg_bufs, bool tpa,
> >                              struct xdp_buff *xdp)
> > @@ -1211,7 +1229,7 @@ static u32 __bnxt_rx_agg_pages(struct bnxt *bp,
> >               u16 cons, frag_len;
> >               struct rx_agg_cmp *agg;
> >               struct bnxt_sw_rx_agg_bd *cons_rx_buf;
> > -             struct page *page;
> > +             netmem_ref netmem;
> >               dma_addr_t mapping;
> >
> >               if (p5_tpa)
> > @@ -1223,9 +1241,15 @@ static u32 __bnxt_rx_agg_pages(struct bnxt *bp,
> >                           RX_AGG_CMP_LEN) >> RX_AGG_CMP_LEN_SHIFT;
> >
> >               cons_rx_buf = &rxr->rx_agg_ring[cons];
> > -             skb_frag_fill_page_desc(frag, cons_rx_buf->page,
> > -                                     cons_rx_buf->offset, frag_len);
> > -             shinfo->nr_frags = i + 1;
> > +             if (skb) {
> > +                     skb_add_rx_frag_netmem(skb, i, cons_rx_buf->netmem,
> > +                                            cons_rx_buf->offset, frag_len,
> > +                                            BNXT_RX_PAGE_SIZE);
> > +             } else {
> > +                     skb_frag_fill_page_desc(frag, netmem_to_page(cons_rx_buf->netmem),
> > +                                             cons_rx_buf->offset, frag_len);
> > +                     shinfo->nr_frags = i + 1;
> > +             }
>
> I feel like this function needs a refactor at some point to split out
> the skb and xdp paths.

Okay, I will add __bnxt_rx_agg_netmem() in v4 patch.

>
> >               __clear_bit(cons, rxr->rx_agg_bmap);
> >
> >               /* It is possible for bnxt_alloc_rx_page() to allocate
>
> [...]
>
> > @@ -3608,9 +3629,11 @@ static void bnxt_free_rx_rings(struct bnxt *bp)
> >
> >  static int bnxt_alloc_rx_page_pool(struct bnxt *bp,
> >                                  struct bnxt_rx_ring_info *rxr,
> > +                                int queue_idx,
>
> To save a parameter, the index is available already in rxr->bnapi->index

Okay, I also remove the queue_idx parameter in v4.

>
> >                                  int numa_node)
> >  {
> >       struct page_pool_params pp = { 0 };
> > +     struct netdev_rx_queue *rxq;
> >
> >       pp.pool_size = bp->rx_agg_ring_size;
> >       if (BNXT_RX_PAGE_MODE(bp))
> > @@ -3621,8 +3644,15 @@ static int bnxt_alloc_rx_page_pool(struct bnxt *bp,
> >       pp.dev = &bp->pdev->dev;
> >       pp.dma_dir = bp->rx_dir;
> >       pp.max_len = PAGE_SIZE;
> > -     pp.flags = PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV;
> > +     pp.order = 0;
> > +
> > +     rxq = __netif_get_rx_queue(bp->dev, queue_idx);
> > +     if (rxq->mp_params.mp_priv)
> > +             pp.flags = PP_FLAG_DMA_MAP | PP_FLAG_ALLOW_UNREADABLE_NETMEM;
> > +     else
> > +             pp.flags = PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV;
> >
> > +     pp.queue_idx = queue_idx;
> >       rxr->page_pool = page_pool_create(&pp);
> >       if (IS_ERR(rxr->page_pool)) {
> >               int err = PTR_ERR(rxr->page_pool);
> > @@ -3655,7 +3685,7 @@ static int bnxt_alloc_rx_rings(struct bnxt *bp)
> >               cpu_node = cpu_to_node(cpu);
> >               netdev_dbg(bp->dev, "Allocating page pool for rx_ring[%d] on numa_node: %d\n",
> >                          i, cpu_node);
> > -             rc = bnxt_alloc_rx_page_pool(bp, rxr, cpu_node);
> > +             rc = bnxt_alloc_rx_page_pool(bp, rxr, i, cpu_node);
> >               if (rc)
> >                       return rc;
> >

Thanks a lot for catching things,
I will send v4 if there is no problem after some tests.

Thanks!
Taehee Yoo

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH net-next v3 7/7] bnxt_en: add support for device memory tcp
  2024-10-08  3:54     ` Taehee Yoo
@ 2024-10-08  3:58       ` Taehee Yoo
  0 siblings, 0 replies; 73+ messages in thread
From: Taehee Yoo @ 2024-10-08  3:58 UTC (permalink / raw)
  To: David Wei
  Cc: davem, kuba, pabeni, edumazet, almasrymina, netdev, linux-doc,
	donald.hunter, corbet, michael.chan, kory.maincent, andrew,
	maxime.chevallier, danieller, hengqi, ecree.xilinx,
	przemyslaw.kitszel, hkallweit1, ahmed.zaki, paul.greenwalt,
	rrameshbabu, idosch, asml.silence, kaiyuanz, willemb,
	aleksander.lobakin, sridhar.samudrala, bcreeley

On Tue, Oct 8, 2024 at 12:54 PM Taehee Yoo <ap420073@gmail.com> wrote:
>
> On Tue, Oct 8, 2024 at 11:45 AM David Wei <dw@davidwei.uk> wrote:
> >
>
> Hi David,
> Thanks a lot for your review!
>
> > On 2024-10-03 09:06, Taehee Yoo wrote:
> > > diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> > > index 872b15842b11..64e07d247f97 100644
> > > --- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> > > +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> > > @@ -55,6 +55,7 @@
> > >  #include <net/page_pool/helpers.h>
> > >  #include <linux/align.h>
> > >  #include <net/netdev_queues.h>
> > > +#include <net/netdev_rx_queue.h>
> > >
> > >  #include "bnxt_hsi.h"
> > >  #include "bnxt.h"
> > > @@ -863,6 +864,22 @@ static void bnxt_tx_int(struct bnxt *bp, struct bnxt_napi *bnapi, int budget)
> > >               bnapi->events &= ~BNXT_TX_CMP_EVENT;
> > >  }
> > >
> > > +static netmem_ref __bnxt_alloc_rx_netmem(struct bnxt *bp, dma_addr_t *mapping,
> > > +                                      struct bnxt_rx_ring_info *rxr,
> > > +                                      unsigned int *offset,
> > > +                                      gfp_t gfp)
> >
> > gfp is unused
>
> I will remove unnecessary gfp parameter in v4.

Oh sorry,
I will use gfp parameter, not remove it.

>
> >
> > > +{
> > > +     netmem_ref netmem;
> > > +
> > > +     netmem = page_pool_alloc_netmem(rxr->page_pool, GFP_ATOMIC);
> > > +     if (!netmem)
> > > +             return 0;
> > > +     *offset = 0;
> > > +
> > > +     *mapping = page_pool_get_dma_addr_netmem(netmem) + *offset;
> >
> > offset is always 0
>
> Okay, I will remove this too in v4.
>
> >
> > > +     return netmem;
> > > +}
> > > +
> > >  static struct page *__bnxt_alloc_rx_page(struct bnxt *bp, dma_addr_t *mapping,
> > >                                        struct bnxt_rx_ring_info *rxr,
> > >                                        unsigned int *offset,
> >
> > [...]
> >
> > > @@ -1192,6 +1209,7 @@ static struct sk_buff *bnxt_rx_skb(struct bnxt *bp,
> > >
> > >  static u32 __bnxt_rx_agg_pages(struct bnxt *bp,
> > >                              struct bnxt_cp_ring_info *cpr,
> > > +                            struct sk_buff *skb,
> > >                              struct skb_shared_info *shinfo,
> > >                              u16 idx, u32 agg_bufs, bool tpa,
> > >                              struct xdp_buff *xdp)
> > > @@ -1211,7 +1229,7 @@ static u32 __bnxt_rx_agg_pages(struct bnxt *bp,
> > >               u16 cons, frag_len;
> > >               struct rx_agg_cmp *agg;
> > >               struct bnxt_sw_rx_agg_bd *cons_rx_buf;
> > > -             struct page *page;
> > > +             netmem_ref netmem;
> > >               dma_addr_t mapping;
> > >
> > >               if (p5_tpa)
> > > @@ -1223,9 +1241,15 @@ static u32 __bnxt_rx_agg_pages(struct bnxt *bp,
> > >                           RX_AGG_CMP_LEN) >> RX_AGG_CMP_LEN_SHIFT;
> > >
> > >               cons_rx_buf = &rxr->rx_agg_ring[cons];
> > > -             skb_frag_fill_page_desc(frag, cons_rx_buf->page,
> > > -                                     cons_rx_buf->offset, frag_len);
> > > -             shinfo->nr_frags = i + 1;
> > > +             if (skb) {
> > > +                     skb_add_rx_frag_netmem(skb, i, cons_rx_buf->netmem,
> > > +                                            cons_rx_buf->offset, frag_len,
> > > +                                            BNXT_RX_PAGE_SIZE);
> > > +             } else {
> > > +                     skb_frag_fill_page_desc(frag, netmem_to_page(cons_rx_buf->netmem),
> > > +                                             cons_rx_buf->offset, frag_len);
> > > +                     shinfo->nr_frags = i + 1;
> > > +             }
> >
> > I feel like this function needs a refactor at some point to split out
> > the skb and xdp paths.
>
> Okay, I will add __bnxt_rx_agg_netmem() in v4 patch.
>
> >
> > >               __clear_bit(cons, rxr->rx_agg_bmap);
> > >
> > >               /* It is possible for bnxt_alloc_rx_page() to allocate
> >
> > [...]
> >
> > > @@ -3608,9 +3629,11 @@ static void bnxt_free_rx_rings(struct bnxt *bp)
> > >
> > >  static int bnxt_alloc_rx_page_pool(struct bnxt *bp,
> > >                                  struct bnxt_rx_ring_info *rxr,
> > > +                                int queue_idx,
> >
> > To save a parameter, the index is available already in rxr->bnapi->index
>
> Okay, I also remove the queue_idx parameter in v4.
>
> >
> > >                                  int numa_node)
> > >  {
> > >       struct page_pool_params pp = { 0 };
> > > +     struct netdev_rx_queue *rxq;
> > >
> > >       pp.pool_size = bp->rx_agg_ring_size;
> > >       if (BNXT_RX_PAGE_MODE(bp))
> > > @@ -3621,8 +3644,15 @@ static int bnxt_alloc_rx_page_pool(struct bnxt *bp,
> > >       pp.dev = &bp->pdev->dev;
> > >       pp.dma_dir = bp->rx_dir;
> > >       pp.max_len = PAGE_SIZE;
> > > -     pp.flags = PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV;
> > > +     pp.order = 0;
> > > +
> > > +     rxq = __netif_get_rx_queue(bp->dev, queue_idx);
> > > +     if (rxq->mp_params.mp_priv)
> > > +             pp.flags = PP_FLAG_DMA_MAP | PP_FLAG_ALLOW_UNREADABLE_NETMEM;
> > > +     else
> > > +             pp.flags = PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV;
> > >
> > > +     pp.queue_idx = queue_idx;
> > >       rxr->page_pool = page_pool_create(&pp);
> > >       if (IS_ERR(rxr->page_pool)) {
> > >               int err = PTR_ERR(rxr->page_pool);
> > > @@ -3655,7 +3685,7 @@ static int bnxt_alloc_rx_rings(struct bnxt *bp)
> > >               cpu_node = cpu_to_node(cpu);
> > >               netdev_dbg(bp->dev, "Allocating page pool for rx_ring[%d] on numa_node: %d\n",
> > >                          i, cpu_node);
> > > -             rc = bnxt_alloc_rx_page_pool(bp, rxr, cpu_node);
> > > +             rc = bnxt_alloc_rx_page_pool(bp, rxr, i, cpu_node);
> > >               if (rc)
> > >                       return rc;
> > >
>
> Thanks a lot for catching things,
> I will send v4 if there is no problem after some tests.
>
> Thanks!
> Taehee Yoo

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH net-next v3 1/7] bnxt_en: add support for rx-copybreak ethtool command
  2024-10-05  6:29           ` Taehee Yoo
@ 2024-10-08 18:10             ` Jakub Kicinski
  2024-10-08 19:38               ` Michael Chan
  0 siblings, 1 reply; 73+ messages in thread
From: Jakub Kicinski @ 2024-10-08 18:10 UTC (permalink / raw)
  To: Taehee Yoo
  Cc: Andrew Lunn, Michael Chan, davem, pabeni, edumazet, almasrymina,
	netdev, linux-doc, donald.hunter, corbet, kory.maincent,
	maxime.chevallier, danieller, hengqi, ecree.xilinx,
	przemyslaw.kitszel, hkallweit1, ahmed.zaki, paul.greenwalt,
	rrameshbabu, idosch, asml.silence, kaiyuanz, willemb,
	aleksander.lobakin, dw, sridhar.samudrala, bcreeley

On Sat, 5 Oct 2024 15:29:54 +0900 Taehee Yoo wrote:
> > > I think a single value of 0 that means disable RX copybreak is more
> > > clear and intuitive.  Also, I think we can allow 64 to be a valid
> > > value.
> > >
> > > So, 0 means to disable.  1 to 63 are -EINVAL and 64 to 1024 are valid.  Thanks.  
> >
> > Please spend a little time and see what other drivers do. Ideally we
> > want one consistent behaviour for all drivers that allow copybreak to
> > be disabled.  
> 
> There is no specific disable value in other drivers.
> But some other drivers have min/max rx-copybreak value.
> If rx-copybreak is low enough, it will not be worked.
> So, min value has been working as a disable value actually.
> 
> I think Andrew's point makes sense.
> So I would like to change min value from 65 to 64, not add a disable value.

Where does the min value of 64 come from? Ethernet min frame length?

IIUC the copybreak threshold is purely a SW feature, after this series.
If someone sets the copybreak value to, say 13 it will simply never
engage but it's not really an invalid setting, IMHO. Similarly setting
it to 0 makes intuitive sense (that's how e1000e works, AFAICT).

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH net-next v3 2/7] bnxt_en: add support for tcp-data-split ethtool command
  2024-10-03 16:06 ` [PATCH net-next v3 2/7] bnxt_en: add support for tcp-data-split " Taehee Yoo
@ 2024-10-08 18:19   ` Jakub Kicinski
  2024-10-09 13:54     ` Taehee Yoo
  0 siblings, 1 reply; 73+ messages in thread
From: Jakub Kicinski @ 2024-10-08 18:19 UTC (permalink / raw)
  To: Taehee Yoo
  Cc: davem, pabeni, edumazet, almasrymina, netdev, linux-doc,
	donald.hunter, corbet, michael.chan, kory.maincent, andrew,
	maxime.chevallier, danieller, hengqi, ecree.xilinx,
	przemyslaw.kitszel, hkallweit1, ahmed.zaki, paul.greenwalt,
	rrameshbabu, idosch, asml.silence, kaiyuanz, willemb,
	aleksander.lobakin, dw, sridhar.samudrala, bcreeley

On Thu,  3 Oct 2024 16:06:15 +0000 Taehee Yoo wrote:
> diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c b/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c
> index fdecdf8894b3..e9ef65dd2e7b 100644
> --- a/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c
> +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c
> @@ -829,12 +829,16 @@ static void bnxt_get_ringparam(struct net_device *dev,
>  	if (bp->flags & BNXT_FLAG_AGG_RINGS) {
>  		ering->rx_max_pending = BNXT_MAX_RX_DESC_CNT_JUM_ENA;
>  		ering->rx_jumbo_max_pending = BNXT_MAX_RX_JUM_DESC_CNT;
> -		kernel_ering->tcp_data_split = ETHTOOL_TCP_DATA_SPLIT_ENABLED;
>  	} else {
>  		ering->rx_max_pending = BNXT_MAX_RX_DESC_CNT;
>  		ering->rx_jumbo_max_pending = 0;
> -		kernel_ering->tcp_data_split = ETHTOOL_TCP_DATA_SPLIT_DISABLED;
>  	}
> +
> +	if (bp->flags & BNXT_FLAG_HDS)
> +		kernel_ering->tcp_data_split = ETHTOOL_TCP_DATA_SPLIT_ENABLED;
> +	else
> +		kernel_ering->tcp_data_split = ETHTOOL_TCP_DATA_SPLIT_DISABLED;

This breaks previous behavior. The HDS reporting from get was
introduced to signal to user space whether the page flip based
TCP zero-copy (the one added some years ago not the recent one)
will be usable with this NIC.

When HW-GRO is enabled HDS will be working.

I think that the driver should only track if the user has set the value
to ENABLED (forced HDS), or to UKNOWN (driver default). Setting the HDS
to disabled is not useful, don't support it.

>  	ering->tx_max_pending = BNXT_MAX_TX_DESC_CNT;
>  
>  	ering->rx_pending = bp->rx_ring_size;
> @@ -854,9 +858,25 @@ static int bnxt_set_ringparam(struct net_device *dev,
>  	    (ering->tx_pending < BNXT_MIN_TX_DESC_CNT))
>  		return -EINVAL;
>  
> +	if (kernel_ering->tcp_data_split != ETHTOOL_TCP_DATA_SPLIT_DISABLED &&
> +	    BNXT_RX_PAGE_MODE(bp)) {
> +		NL_SET_ERR_MSG_MOD(extack, "tcp-data-split can not be enabled with XDP");
> +		return -EINVAL;
> +	}

Technically just if the XDP does not support multi-buffer.
Any chance we could do this check in the core?

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH net-next v3 3/7] net: ethtool: add support for configuring tcp-data-split-thresh
  2024-10-03 16:06 ` [PATCH net-next v3 3/7] net: ethtool: add support for configuring tcp-data-split-thresh Taehee Yoo
  2024-10-03 18:25   ` Mina Almasry
@ 2024-10-08 18:33   ` Jakub Kicinski
  2024-10-09 14:25     ` Taehee Yoo
  1 sibling, 1 reply; 73+ messages in thread
From: Jakub Kicinski @ 2024-10-08 18:33 UTC (permalink / raw)
  To: Taehee Yoo
  Cc: davem, pabeni, edumazet, almasrymina, netdev, linux-doc,
	donald.hunter, corbet, michael.chan, kory.maincent, andrew,
	maxime.chevallier, danieller, hengqi, ecree.xilinx,
	przemyslaw.kitszel, hkallweit1, ahmed.zaki, paul.greenwalt,
	rrameshbabu, idosch, asml.silence, kaiyuanz, willemb,
	aleksander.lobakin, dw, sridhar.samudrala, bcreeley

On Thu,  3 Oct 2024 16:06:16 +0000 Taehee Yoo wrote:
> The tcp-data-split-thresh option configures the threshold value of
> the tcp-data-split.
> If a received packet size is larger than this threshold value, a packet
> will be split into header and payload.
> The header indicates TCP header, but it depends on driver spec.
> The bnxt_en driver supports HDS(Header-Data-Split) configuration at
> FW level, affecting TCP and UDP too.
> So, like the tcp-data-split option, If tcp-data-split-thresh is set,
> it affects UDP and TCP packets.
> 
> The tcp-data-split-thresh has a dependency, that is tcp-data-split
> option. This threshold value can be get/set only when tcp-data-split
> option is enabled.
> 
> Example:
>    # ethtool -G <interface name> tcp-data-split-thresh <value>
> 
>    # ethtool -G enp14s0f0np0 tcp-data-split on tcp-data-split-thresh 256
>    # ethtool -g enp14s0f0np0
>    Ring parameters for enp14s0f0np0:
>    Pre-set maximums:
>    ...
>    TCP data split thresh:  256
>    Current hardware settings:
>    ...
>    TCP data split:         on
>    TCP data split thresh:  256
> 
> The tcp-data-split is not enabled, the tcp-data-split-thresh will
> not be used and can't be configured.
> 
>    # ethtool -G enp14s0f0np0 tcp-data-split off
>    # ethtool -g enp14s0f0np0
>    Ring parameters for enp14s0f0np0:
>    Pre-set maximums:
>    ...
>    TCP data split thresh:  256
>    Current hardware settings:
>    ...
>    TCP data split:         off
>    TCP data split thresh:  n/a

My reply to Sridhar was probably quite unclear on this point, but FWIW
I do also have a weak preference to drop the "TCP" from the new knob.
Rephrasing what I said here:
https://lore.kernel.org/all/20240911173150.571bf93b@kernel.org/
the old knob is defined as being about TCP but for the new one we can
pick how widely applicable it is (and make it cover UDP as well).

> The default/min/max values are not defined in the ethtool so the drivers
> should define themself.
> The 0 value means that all TCP and UDP packets' header and payload
> will be split.

> diff --git a/include/linux/ethtool.h b/include/linux/ethtool.h
> index 12f6dc567598..891f55b0f6aa 100644
> --- a/include/linux/ethtool.h
> +++ b/include/linux/ethtool.h
> @@ -78,6 +78,8 @@ enum {
>   * @cqe_size: Size of TX/RX completion queue event
>   * @tx_push_buf_len: Size of TX push buffer
>   * @tx_push_buf_max_len: Maximum allowed size of TX push buffer
> + * @tcp_data_split_thresh: Threshold value of tcp-data-split
> + * @tcp_data_split_thresh_max: Maximum allowed threshold of tcp-data-split-threshold

Please wrap at 80 chars:

./scripts/checkpatch.pl --max-line-length=80 --strict $patch..

>  static int rings_fill_reply(struct sk_buff *skb,
> @@ -108,7 +110,13 @@ static int rings_fill_reply(struct sk_buff *skb,
>  	     (nla_put_u32(skb, ETHTOOL_A_RINGS_TX_PUSH_BUF_LEN_MAX,
>  			  kr->tx_push_buf_max_len) ||
>  	      nla_put_u32(skb, ETHTOOL_A_RINGS_TX_PUSH_BUF_LEN,
> -			  kr->tx_push_buf_len))))
> +			  kr->tx_push_buf_len))) ||
> +	    (kr->tcp_data_split == ETHTOOL_TCP_DATA_SPLIT_ENABLED &&

Please add a new ETHTOOL_RING_USE_* flag for this, or fix all the
drivers which set ETHTOOL_RING_USE_TCP_DATA_SPLIT already and use that.
I don't think we should hide the value when HDS is disabled.

> +	     (nla_put_u32(skb, ETHTOOL_A_RINGS_TCP_DATA_SPLIT_THRESH,
> +			 kr->tcp_data_split_thresh))) ||

nit: unnecessary brackets around the nla_put_u32()

> +	    (kr->tcp_data_split == ETHTOOL_TCP_DATA_SPLIT_ENABLED &&
> +	     (nla_put_u32(skb, ETHTOOL_A_RINGS_TCP_DATA_SPLIT_THRESH_MAX,
> +			 kr->tcp_data_split_thresh_max))))
>  		return -EMSGSIZE;
>  
>  	return 0;

> +	if (tb[ETHTOOL_A_RINGS_TCP_DATA_SPLIT_THRESH] &&
> +	    !(ops->supported_ring_params & ETHTOOL_RING_USE_TCP_DATA_SPLIT)) {

here you use the existing flag, yet gve and idpf set that flag and will
ignore the setting silently. They need to be changed or we need a new
flag.

> +		NL_SET_ERR_MSG_ATTR(info->extack,
> +				    tb[ETHTOOL_A_RINGS_TCP_DATA_SPLIT_THRESH],
> +				    "setting tcp-data-split-thresh is not supported");
> +		return -EOPNOTSUPP;
> +	}
> +
>  	if (tb[ETHTOOL_A_RINGS_CQE_SIZE] &&
>  	    !(ops->supported_ring_params & ETHTOOL_RING_USE_CQE_SIZE)) {
>  		NL_SET_ERR_MSG_ATTR(info->extack,
> @@ -196,9 +213,9 @@ ethnl_set_rings(struct ethnl_req_info *req_info, struct genl_info *info)
>  	struct kernel_ethtool_ringparam kernel_ringparam = {};
>  	struct ethtool_ringparam ringparam = {};
>  	struct net_device *dev = req_info->dev;
> +	bool mod = false, thresh_mod = false;
>  	struct nlattr **tb = info->attrs;
>  	const struct nlattr *err_attr;
> -	bool mod = false;
>  	int ret;
>  
>  	dev->ethtool_ops->get_ringparam(dev, &ringparam,
> @@ -222,9 +239,30 @@ ethnl_set_rings(struct ethnl_req_info *req_info, struct genl_info *info)
>  			tb[ETHTOOL_A_RINGS_RX_PUSH], &mod);
>  	ethnl_update_u32(&kernel_ringparam.tx_push_buf_len,
>  			 tb[ETHTOOL_A_RINGS_TX_PUSH_BUF_LEN], &mod);
> -	if (!mod)
> +	ethnl_update_u32(&kernel_ringparam.tcp_data_split_thresh,
> +			 tb[ETHTOOL_A_RINGS_TCP_DATA_SPLIT_THRESH],
> +			 &thresh_mod);
> +	if (!mod && !thresh_mod)
>  		return 0;
>  
> +	if (kernel_ringparam.tcp_data_split == ETHTOOL_TCP_DATA_SPLIT_DISABLED &&
> +	    thresh_mod) {
> +		NL_SET_ERR_MSG_ATTR(info->extack,
> +				    tb[ETHTOOL_A_RINGS_TCP_DATA_SPLIT_THRESH],
> +				    "tcp-data-split-thresh can not be updated while tcp-data-split is disabled");
> +		return -EINVAL;
> +	}

I'm not sure we need to reject changing the setting when HDS is
disabled. Driver can just store the value until HDS gets enabled?
WDYT? I don't have a strong preference.

> +	if (kernel_ringparam.tcp_data_split_thresh >
> +	    kernel_ringparam.tcp_data_split_thresh_max) {
> +		NL_SET_ERR_MSG_ATTR_FMT(info->extack,
> +					tb[ETHTOOL_A_RINGS_TCP_DATA_SPLIT_THRESH_MAX],
> +					"Requested tcp-data-split-thresh exceeds the maximum of %u",

No need for the string, just NL_SET_BAD_ATTR() + ERANGE is enough

> +					kernel_ringparam.tcp_data_split_thresh_max);
> +
> +		return -EINVAL;

ERANGE

> +	}
> +
>  	/* ensure new ring parameters are within limits */
>  	if (ringparam.rx_pending > ringparam.rx_max_pending)
>  		err_attr = tb[ETHTOOL_A_RINGS_RX];


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH net-next v3 4/7] bnxt_en: add support for tcp-data-split-thresh ethtool command
  2024-10-03 16:06 ` [PATCH net-next v3 4/7] bnxt_en: add support for tcp-data-split-thresh ethtool command Taehee Yoo
  2024-10-03 18:13   ` Brett Creeley
@ 2024-10-08 18:35   ` Jakub Kicinski
  2024-10-09 14:31     ` Taehee Yoo
  1 sibling, 1 reply; 73+ messages in thread
From: Jakub Kicinski @ 2024-10-08 18:35 UTC (permalink / raw)
  To: Taehee Yoo
  Cc: davem, pabeni, edumazet, almasrymina, netdev, linux-doc,
	donald.hunter, corbet, michael.chan, kory.maincent, andrew,
	maxime.chevallier, danieller, hengqi, ecree.xilinx,
	przemyslaw.kitszel, hkallweit1, ahmed.zaki, paul.greenwalt,
	rrameshbabu, idosch, asml.silence, kaiyuanz, willemb,
	aleksander.lobakin, dw, sridhar.samudrala, bcreeley

On Thu,  3 Oct 2024 16:06:17 +0000 Taehee Yoo wrote:
> +#define BNXT_HDS_THRESHOLD_MAX	256
> +	u16			hds_threshold;

From the cover letter it sounded like the max is 1023.
Did I misread that ?

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH net-next v3 5/7] net: devmem: add ring parameter filtering
  2024-10-03 18:49     ` Mina Almasry
@ 2024-10-08 19:28       ` Jakub Kicinski
  2024-10-09 14:35         ` Taehee Yoo
  0 siblings, 1 reply; 73+ messages in thread
From: Jakub Kicinski @ 2024-10-08 19:28 UTC (permalink / raw)
  To: Mina Almasry
  Cc: Brett Creeley, Taehee Yoo, davem, pabeni, edumazet, netdev,
	linux-doc, donald.hunter, corbet, michael.chan, kory.maincent,
	andrew, maxime.chevallier, danieller, hengqi, ecree.xilinx,
	przemyslaw.kitszel, hkallweit1, ahmed.zaki, paul.greenwalt,
	rrameshbabu, idosch, asml.silence, kaiyuanz, willemb,
	aleksander.lobakin, dw, sridhar.samudrala

On Thu, 3 Oct 2024 11:49:50 -0700 Mina Almasry wrote:
> > > +       dev->ethtool_ops->get_ringparam(dev, &ringparam,
> > > +                                       &kernel_ringparam, extack);
> > > +       if (kernel_ringparam.tcp_data_split != ETHTOOL_TCP_DATA_SPLIT_ENABLED ||
> > > +           kernel_ringparam.tcp_data_split_thresh) {
> > > +               NL_SET_ERR_MSG(extack,
> > > +                              "tcp-header-data-split is disabled or threshold is not zero");
> > > +               return -EINVAL;
> > > +       }
> > > +  
> > Maybe just my personal opinion, but IMHO these checks should be separate
> > so the error message can be more concise/clear.
> >  
> 
> Good point. The error message in itself is valuable.

If you mean that the error message is more intuitive than debugging why
PP_FLAG_ALLOW_UNREADABLE_NETMEM isn't set - I agree :)

I vote to keep the patch, FWIW. Maybe add a comment that for now drivers
should not set PP_FLAG_ALLOW_UNREADABLE_NETMEM, anyway, but this gives
us better debuggability, and in the future we may find cases where
doing a copy is cheaper than buffer circulation (and therefore may lift
this check).

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH net-next v3 1/7] bnxt_en: add support for rx-copybreak ethtool command
  2024-10-08 18:10             ` Jakub Kicinski
@ 2024-10-08 19:38               ` Michael Chan
  2024-10-08 19:53                 ` Jakub Kicinski
  0 siblings, 1 reply; 73+ messages in thread
From: Michael Chan @ 2024-10-08 19:38 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Taehee Yoo, Andrew Lunn, davem, pabeni, edumazet, almasrymina,
	netdev, linux-doc, donald.hunter, corbet, kory.maincent,
	maxime.chevallier, danieller, hengqi, ecree.xilinx,
	przemyslaw.kitszel, hkallweit1, ahmed.zaki, paul.greenwalt,
	rrameshbabu, idosch, asml.silence, kaiyuanz, willemb,
	aleksander.lobakin, dw, sridhar.samudrala, bcreeley

[-- Attachment #1: Type: text/plain, Size: 1637 bytes --]

On Tue, Oct 8, 2024 at 11:11 AM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Sat, 5 Oct 2024 15:29:54 +0900 Taehee Yoo wrote:
> > > > I think a single value of 0 that means disable RX copybreak is more
> > > > clear and intuitive.  Also, I think we can allow 64 to be a valid
> > > > value.
> > > >
> > > > So, 0 means to disable.  1 to 63 are -EINVAL and 64 to 1024 are valid.  Thanks.
> > >
> > > Please spend a little time and see what other drivers do. Ideally we
> > > want one consistent behaviour for all drivers that allow copybreak to
> > > be disabled.
> >
> > There is no specific disable value in other drivers.
> > But some other drivers have min/max rx-copybreak value.
> > If rx-copybreak is low enough, it will not be worked.
> > So, min value has been working as a disable value actually.
> >
> > I think Andrew's point makes sense.
> > So I would like to change min value from 65 to 64, not add a disable value.
>
> Where does the min value of 64 come from? Ethernet min frame length?
>

The length is actually the ethernet length minus the 4-byte CRC.  So
60 is the minimum length that the driver will see.  Anything smaller
coming from the wire will be a runt frame discarded by the chip.

> IIUC the copybreak threshold is purely a SW feature, after this series.
> If someone sets the copybreak value to, say 13 it will simply never
> engage but it's not really an invalid setting, IMHO. Similarly setting
> it to 0 makes intuitive sense (that's how e1000e works, AFAICT).

Right, setting it to 0 or 13 will have the same effect of disabling
it.  0 makes more intuitive sense.

[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4209 bytes --]

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH net-next v3 7/7] bnxt_en: add support for device memory tcp
  2024-10-04 10:34     ` Taehee Yoo
  2024-10-08  2:57       ` David Wei
@ 2024-10-08 19:50       ` Jakub Kicinski
  2024-10-09 15:37         ` Taehee Yoo
  1 sibling, 1 reply; 73+ messages in thread
From: Jakub Kicinski @ 2024-10-08 19:50 UTC (permalink / raw)
  To: Taehee Yoo
  Cc: Mina Almasry, davem, pabeni, edumazet, netdev, linux-doc,
	donald.hunter, corbet, michael.chan, kory.maincent, andrew,
	maxime.chevallier, danieller, hengqi, ecree.xilinx,
	przemyslaw.kitszel, hkallweit1, ahmed.zaki, paul.greenwalt,
	rrameshbabu, idosch, asml.silence, kaiyuanz, willemb,
	aleksander.lobakin, dw, sridhar.samudrala, bcreeley

On Fri, 4 Oct 2024 19:34:45 +0900 Taehee Yoo wrote:
> > Our intention with the whole netmem design is that drivers should
> > never have to call netmem_to_page(). I.e. the driver should use netmem
> > unaware of whether it's page or non-page underneath, to minimize
> > complexity driver needs to handle.
> >
> > This netmem_to_page() call can be removed by using
> > skb_frag_fill_netmem_desc() instead of the page variant. But, more
> > improtantly, why did the code change here? The code before calls
> > skb_frag_fill_page_desc, but the new code sometimes will
> > skb_frag_fill_netmem_desc() and sometimes will skb_add_rx_frag_netmem.
> > I'm not sure why that logic changed.  
> 
> The reason why skb_add_rx_frag_netmem() is used here is to set
> skb->unreadable flag. the skb_frag_fill_netmem_desc() doesn't set
> skb->unreadable because it doesn't handle skb, it only handles frag.
> As far as I know, skb->unreadable should be set to true for devmem
> TCP, am I misunderstood?
> I tested that don't using skb_add_rx_frag_netmem() here, and it
> immediately fails.

Yes, but netmem_ref can be either a net_iov or a normal page,
and skb_add_rx_frag_netmem() and similar helpers should automatically
set skb->unreadable or not.

IOW you should be able to always use netmem-aware APIs, no?

> > This is not the intended use of PP_FLAG_ALLOW_UNREADABLE_NETMEM.
> >
> > The driver should set PP_FLAG_ALLOW_UNREADABLE_NETMEM when it's able
> > to handle unreadable netmem, it should not worry about whether
> > rxq->mp_params.mp_priv is set or not.
> >
> > You should set PP_FLAG_ALLOW_UNREADABLE_NETMEM when HDS is enabled.
> > Let core figure out if mp_params.mp_priv is enabled. All the driver
> > needs to report is whether it's configured to be able to handle
> > unreadable netmem (which practically means HDS is enabled).  
> 
> The reason why the branch exists here is the PP_FLAG_ALLOW_UNREADABLE_NETMEM
> flag can't be used with PP_FLAG_DMA_SYNC_DEV.

Hm. Isn't the existing check the wrong way around? Is the driver
supposed to sync the buffers for device before passing them down?

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH net-next v3 1/7] bnxt_en: add support for rx-copybreak ethtool command
  2024-10-08 19:38               ` Michael Chan
@ 2024-10-08 19:53                 ` Jakub Kicinski
  2024-10-08 20:35                   ` Michael Chan
  0 siblings, 1 reply; 73+ messages in thread
From: Jakub Kicinski @ 2024-10-08 19:53 UTC (permalink / raw)
  To: Michael Chan
  Cc: Taehee Yoo, Andrew Lunn, davem, pabeni, edumazet, almasrymina,
	netdev, linux-doc, donald.hunter, corbet, kory.maincent,
	maxime.chevallier, danieller, hengqi, ecree.xilinx,
	przemyslaw.kitszel, hkallweit1, ahmed.zaki, paul.greenwalt,
	rrameshbabu, idosch, asml.silence, kaiyuanz, willemb,
	aleksander.lobakin, dw, sridhar.samudrala, bcreeley

On Tue, 8 Oct 2024 12:38:18 -0700 Michael Chan wrote:
> > Where does the min value of 64 come from? Ethernet min frame length?
> 
> The length is actually the ethernet length minus the 4-byte CRC.  So
> 60 is the minimum length that the driver will see.  Anything smaller
> coming from the wire will be a runt frame discarded by the chip.

Also for VF to VF traffic?

> > IIUC the copybreak threshold is purely a SW feature, after this series.
> > If someone sets the copybreak value to, say 13 it will simply never
> > engage but it's not really an invalid setting, IMHO. Similarly setting
> > it to 0 makes intuitive sense (that's how e1000e works, AFAICT).  
> 
> Right, setting it to 0 or 13 will have the same effect of disabling
> it.  0 makes more intuitive sense.

Agreed on 0 making sense, but not sure if rejecting intermediate values
buys us anything. As Andrew mentioned consistency is important. I only
checked two drivers (e1000e and gve) and they don't seem to check 
the lower limit.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH net-next v3 1/7] bnxt_en: add support for rx-copybreak ethtool command
  2024-10-08 19:53                 ` Jakub Kicinski
@ 2024-10-08 20:35                   ` Michael Chan
  0 siblings, 0 replies; 73+ messages in thread
From: Michael Chan @ 2024-10-08 20:35 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Taehee Yoo, Andrew Lunn, davem, pabeni, edumazet, almasrymina,
	netdev, linux-doc, donald.hunter, corbet, kory.maincent,
	maxime.chevallier, danieller, hengqi, ecree.xilinx,
	przemyslaw.kitszel, hkallweit1, ahmed.zaki, paul.greenwalt,
	rrameshbabu, idosch, asml.silence, kaiyuanz, willemb,
	aleksander.lobakin, dw, sridhar.samudrala, bcreeley

[-- Attachment #1: Type: text/plain, Size: 1384 bytes --]

On Tue, Oct 8, 2024 at 12:53 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Tue, 8 Oct 2024 12:38:18 -0700 Michael Chan wrote:
> > > Where does the min value of 64 come from? Ethernet min frame length?
> >
> > The length is actually the ethernet length minus the 4-byte CRC.  So
> > 60 is the minimum length that the driver will see.  Anything smaller
> > coming from the wire will be a runt frame discarded by the chip.
>
> Also for VF to VF traffic?

Good point.  Loopback traffic is not subject to padding and can be
smaller than 60 bytes.  So, lower limit checking doesn't make much
sense anymore.

>
> > > IIUC the copybreak threshold is purely a SW feature, after this series.
> > > If someone sets the copybreak value to, say 13 it will simply never
> > > engage but it's not really an invalid setting, IMHO. Similarly setting
> > > it to 0 makes intuitive sense (that's how e1000e works, AFAICT).
> >
> > Right, setting it to 0 or 13 will have the same effect of disabling
> > it.  0 makes more intuitive sense.
>
> Agreed on 0 making sense, but not sure if rejecting intermediate values
> buys us anything. As Andrew mentioned consistency is important. I only
> checked two drivers (e1000e and gve) and they don't seem to check
> the lower limit.

Sure, so the range should be 0 to 1024.  Any value close to 0 will
effectively disable it.

[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4209 bytes --]

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH net-next v3 2/7] bnxt_en: add support for tcp-data-split ethtool command
  2024-10-08 18:19   ` Jakub Kicinski
@ 2024-10-09 13:54     ` Taehee Yoo
  2024-10-09 15:28       ` Jakub Kicinski
  0 siblings, 1 reply; 73+ messages in thread
From: Taehee Yoo @ 2024-10-09 13:54 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: davem, pabeni, edumazet, almasrymina, netdev, linux-doc,
	donald.hunter, corbet, michael.chan, kory.maincent, andrew,
	maxime.chevallier, danieller, hengqi, ecree.xilinx,
	przemyslaw.kitszel, hkallweit1, ahmed.zaki, paul.greenwalt,
	rrameshbabu, idosch, asml.silence, kaiyuanz, willemb,
	aleksander.lobakin, dw, sridhar.samudrala, bcreeley

On Wed, Oct 9, 2024 at 3:19 AM Jakub Kicinski <kuba@kernel.org> wrote:
>

Hi Jakub,
Thanks a lot for your reviews!

> On Thu,  3 Oct 2024 16:06:15 +0000 Taehee Yoo wrote:
> > diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c b/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c
> > index fdecdf8894b3..e9ef65dd2e7b 100644
> > --- a/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c
> > +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c
> > @@ -829,12 +829,16 @@ static void bnxt_get_ringparam(struct net_device *dev,
> >       if (bp->flags & BNXT_FLAG_AGG_RINGS) {
> >               ering->rx_max_pending = BNXT_MAX_RX_DESC_CNT_JUM_ENA;
> >               ering->rx_jumbo_max_pending = BNXT_MAX_RX_JUM_DESC_CNT;
> > -             kernel_ering->tcp_data_split = ETHTOOL_TCP_DATA_SPLIT_ENABLED;
> >       } else {
> >               ering->rx_max_pending = BNXT_MAX_RX_DESC_CNT;
> >               ering->rx_jumbo_max_pending = 0;
> > -             kernel_ering->tcp_data_split = ETHTOOL_TCP_DATA_SPLIT_DISABLED;
> >       }
> > +
> > +     if (bp->flags & BNXT_FLAG_HDS)
> > +             kernel_ering->tcp_data_split = ETHTOOL_TCP_DATA_SPLIT_ENABLED;
> > +     else
> > +             kernel_ering->tcp_data_split = ETHTOOL_TCP_DATA_SPLIT_DISABLED;
>
> This breaks previous behavior. The HDS reporting from get was
> introduced to signal to user space whether the page flip based
> TCP zero-copy (the one added some years ago not the recent one)
> will be usable with this NIC.
>
> When HW-GRO is enabled HDS will be working.
>
> I think that the driver should only track if the user has set the value
> to ENABLED (forced HDS), or to UKNOWN (driver default). Setting the HDS
> to disabled is not useful, don't support it.

Okay, I will remove the disable feature in a v4 patch.
Before this patch, hds_threshold was rx-copybreak value.
How do you think hds_threshold should still follow rx-copybreak value
if it is UNKNOWN mode?
I think hds_threshold need to follow new tcp-data-split-thresh value in
ENABLE/UNKNOWN and make rx-copybreak pure software feature.
But if so, it changes the default behavior.
How do you think about it?

>
> >       ering->tx_max_pending = BNXT_MAX_TX_DESC_CNT;
> >
> >       ering->rx_pending = bp->rx_ring_size;
> > @@ -854,9 +858,25 @@ static int bnxt_set_ringparam(struct net_device *dev,
> >           (ering->tx_pending < BNXT_MIN_TX_DESC_CNT))
> >               return -EINVAL;
> >
> > +     if (kernel_ering->tcp_data_split != ETHTOOL_TCP_DATA_SPLIT_DISABLED &&
> > +         BNXT_RX_PAGE_MODE(bp)) {
> > +             NL_SET_ERR_MSG_MOD(extack, "tcp-data-split can not be enabled with XDP");
> > +             return -EINVAL;
> > +     }
>
> Technically just if the XDP does not support multi-buffer.
> Any chance we could do this check in the core?

I think we can access xdp_rxq_info with netdev_rx_queue structure.
However, xdp_rxq_info is not sufficient to distinguish mb is supported
by the driver or not. I think prog->aux->xdp_has_frags is required to
distinguish it correctly.
So, I think we need something more.
Do you have any idea?

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH net-next v3 3/7] net: ethtool: add support for configuring tcp-data-split-thresh
  2024-10-08 18:33   ` Jakub Kicinski
@ 2024-10-09 14:25     ` Taehee Yoo
  2024-10-09 15:46       ` Jakub Kicinski
  0 siblings, 1 reply; 73+ messages in thread
From: Taehee Yoo @ 2024-10-09 14:25 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: davem, pabeni, edumazet, almasrymina, netdev, linux-doc,
	donald.hunter, corbet, michael.chan, kory.maincent, andrew,
	maxime.chevallier, danieller, hengqi, ecree.xilinx,
	przemyslaw.kitszel, hkallweit1, ahmed.zaki, paul.greenwalt,
	rrameshbabu, idosch, asml.silence, kaiyuanz, willemb,
	aleksander.lobakin, dw, sridhar.samudrala, bcreeley

On Wed, Oct 9, 2024 at 3:33 AM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Thu,  3 Oct 2024 16:06:16 +0000 Taehee Yoo wrote:
> > The tcp-data-split-thresh option configures the threshold value of
> > the tcp-data-split.
> > If a received packet size is larger than this threshold value, a packet
> > will be split into header and payload.
> > The header indicates TCP header, but it depends on driver spec.
> > The bnxt_en driver supports HDS(Header-Data-Split) configuration at
> > FW level, affecting TCP and UDP too.
> > So, like the tcp-data-split option, If tcp-data-split-thresh is set,
> > it affects UDP and TCP packets.
> >
> > The tcp-data-split-thresh has a dependency, that is tcp-data-split
> > option. This threshold value can be get/set only when tcp-data-split
> > option is enabled.
> >
> > Example:
> >    # ethtool -G <interface name> tcp-data-split-thresh <value>
> >
> >    # ethtool -G enp14s0f0np0 tcp-data-split on tcp-data-split-thresh 256
> >    # ethtool -g enp14s0f0np0
> >    Ring parameters for enp14s0f0np0:
> >    Pre-set maximums:
> >    ...
> >    TCP data split thresh:  256
> >    Current hardware settings:
> >    ...
> >    TCP data split:         on
> >    TCP data split thresh:  256
> >
> > The tcp-data-split is not enabled, the tcp-data-split-thresh will
> > not be used and can't be configured.
> >
> >    # ethtool -G enp14s0f0np0 tcp-data-split off
> >    # ethtool -g enp14s0f0np0
> >    Ring parameters for enp14s0f0np0:
> >    Pre-set maximums:
> >    ...
> >    TCP data split thresh:  256
> >    Current hardware settings:
> >    ...
> >    TCP data split:         off
> >    TCP data split thresh:  n/a
>
> My reply to Sridhar was probably quite unclear on this point, but FWIW
> I do also have a weak preference to drop the "TCP" from the new knob.
> Rephrasing what I said here:
> https://lore.kernel.org/all/20240911173150.571bf93b@kernel.org/
> the old knob is defined as being about TCP but for the new one we can
> pick how widely applicable it is (and make it cover UDP as well).

I'm not sure that I understand about "knob".
The knob means the command "tcp-data-split-thresh"?
If so, I would like to change from "tcp-data-split-thresh" to
"header-data-split-thresh".

>
> > The default/min/max values are not defined in the ethtool so the drivers
> > should define themself.
> > The 0 value means that all TCP and UDP packets' header and payload
> > will be split.
>
> > diff --git a/include/linux/ethtool.h b/include/linux/ethtool.h
> > index 12f6dc567598..891f55b0f6aa 100644
> > --- a/include/linux/ethtool.h
> > +++ b/include/linux/ethtool.h
> > @@ -78,6 +78,8 @@ enum {
> >   * @cqe_size: Size of TX/RX completion queue event
> >   * @tx_push_buf_len: Size of TX push buffer
> >   * @tx_push_buf_max_len: Maximum allowed size of TX push buffer
> > + * @tcp_data_split_thresh: Threshold value of tcp-data-split
> > + * @tcp_data_split_thresh_max: Maximum allowed threshold of tcp-data-split-threshold
>
> Please wrap at 80 chars:
>
> ./scripts/checkpatch.pl --max-line-length=80 --strict $patch..

Thanks, I will fix this in v4 patch.

>
> >  static int rings_fill_reply(struct sk_buff *skb,
> > @@ -108,7 +110,13 @@ static int rings_fill_reply(struct sk_buff *skb,
> >            (nla_put_u32(skb, ETHTOOL_A_RINGS_TX_PUSH_BUF_LEN_MAX,
> >                         kr->tx_push_buf_max_len) ||
> >             nla_put_u32(skb, ETHTOOL_A_RINGS_TX_PUSH_BUF_LEN,
> > -                       kr->tx_push_buf_len))))
> > +                       kr->tx_push_buf_len))) ||
> > +         (kr->tcp_data_split == ETHTOOL_TCP_DATA_SPLIT_ENABLED &&
>
> Please add a new ETHTOOL_RING_USE_* flag for this, or fix all the
> drivers which set ETHTOOL_RING_USE_TCP_DATA_SPLIT already and use that.
> I don't think we should hide the value when HDS is disabled.
>
> > +          (nla_put_u32(skb, ETHTOOL_A_RINGS_TCP_DATA_SPLIT_THRESH,
> > +                      kr->tcp_data_split_thresh))) ||
>
> nit: unnecessary brackets around the nla_put_u32()

I will fix this too.

>
> > +         (kr->tcp_data_split == ETHTOOL_TCP_DATA_SPLIT_ENABLED &&
> > +          (nla_put_u32(skb, ETHTOOL_A_RINGS_TCP_DATA_SPLIT_THRESH_MAX,
> > +                      kr->tcp_data_split_thresh_max))))
> >               return -EMSGSIZE;
> >
> >       return 0;
>
> > +     if (tb[ETHTOOL_A_RINGS_TCP_DATA_SPLIT_THRESH] &&
> > +         !(ops->supported_ring_params & ETHTOOL_RING_USE_TCP_DATA_SPLIT)) {
>
> here you use the existing flag, yet gve and idpf set that flag and will
> ignore the setting silently. They need to be changed or we need a new
> flag.

Okay, I would like to add the ETHTOOL_RING_USE_TCP_DATA_SPLIT_THRESH flag.
Or ETHTOOL_RING_USE_HDS_THRESH, which indicates header-data-split thresh.
If you agree with adding a new flag, how do you think about naming it?

>
> > +             NL_SET_ERR_MSG_ATTR(info->extack,
> > +                                 tb[ETHTOOL_A_RINGS_TCP_DATA_SPLIT_THRESH],
> > +                                 "setting tcp-data-split-thresh is not supported");
> > +             return -EOPNOTSUPP;
> > +     }
> > +
> >       if (tb[ETHTOOL_A_RINGS_CQE_SIZE] &&
> >           !(ops->supported_ring_params & ETHTOOL_RING_USE_CQE_SIZE)) {
> >               NL_SET_ERR_MSG_ATTR(info->extack,
> > @@ -196,9 +213,9 @@ ethnl_set_rings(struct ethnl_req_info *req_info, struct genl_info *info)
> >       struct kernel_ethtool_ringparam kernel_ringparam = {};
> >       struct ethtool_ringparam ringparam = {};
> >       struct net_device *dev = req_info->dev;
> > +     bool mod = false, thresh_mod = false;
> >       struct nlattr **tb = info->attrs;
> >       const struct nlattr *err_attr;
> > -     bool mod = false;
> >       int ret;
> >
> >       dev->ethtool_ops->get_ringparam(dev, &ringparam,
> > @@ -222,9 +239,30 @@ ethnl_set_rings(struct ethnl_req_info *req_info, struct genl_info *info)
> >                       tb[ETHTOOL_A_RINGS_RX_PUSH], &mod);
> >       ethnl_update_u32(&kernel_ringparam.tx_push_buf_len,
> >                        tb[ETHTOOL_A_RINGS_TX_PUSH_BUF_LEN], &mod);
> > -     if (!mod)
> > +     ethnl_update_u32(&kernel_ringparam.tcp_data_split_thresh,
> > +                      tb[ETHTOOL_A_RINGS_TCP_DATA_SPLIT_THRESH],
> > +                      &thresh_mod);
> > +     if (!mod && !thresh_mod)
> >               return 0;
> >
> > +     if (kernel_ringparam.tcp_data_split == ETHTOOL_TCP_DATA_SPLIT_DISABLED &&
> > +         thresh_mod) {
> > +             NL_SET_ERR_MSG_ATTR(info->extack,
> > +                                 tb[ETHTOOL_A_RINGS_TCP_DATA_SPLIT_THRESH],
> > +                                 "tcp-data-split-thresh can not be updated while tcp-data-split is disabled");
> > +             return -EINVAL;
> > +     }
>
> I'm not sure we need to reject changing the setting when HDS is
> disabled. Driver can just store the value until HDS gets enabled?
> WDYT? I don't have a strong preference.

I checked similar options, which are tx-push and tx-push-buff-len,
updating tx-push-buff-len may not fail when tx-push is disabled.

I think It's too strong condition check and it's not consistent with
similar options.

The disabling HDS option is going to be removed in v4 patch.
I asked about how to handle hds_threshold when it is UNKNOWN mode in the
previous patch thread. If the hds_threshold should follow rx-copybreak
value in the UNKNOWN mode, this condition check is not necessary.

>
> > +     if (kernel_ringparam.tcp_data_split_thresh >
> > +         kernel_ringparam.tcp_data_split_thresh_max) {
> > +             NL_SET_ERR_MSG_ATTR_FMT(info->extack,
> > +                                     tb[ETHTOOL_A_RINGS_TCP_DATA_SPLIT_THRESH_MAX],
> > +                                     "Requested tcp-data-split-thresh exceeds the maximum of %u",
>
> No need for the string, just NL_SET_BAD_ATTR() + ERANGE is enough

Thanks, I will fix it.

>
> > +                                     kernel_ringparam.tcp_data_split_thresh_max);
> > +
> > +             return -EINVAL;
>
> ERANGE

I will fix it too.

>
> > +     }
> > +
> >       /* ensure new ring parameters are within limits */
> >       if (ringparam.rx_pending > ringparam.rx_max_pending)
> >               err_attr = tb[ETHTOOL_A_RINGS_RX];
>

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH net-next v3 4/7] bnxt_en: add support for tcp-data-split-thresh ethtool command
  2024-10-08 18:35   ` Jakub Kicinski
@ 2024-10-09 14:31     ` Taehee Yoo
  0 siblings, 0 replies; 73+ messages in thread
From: Taehee Yoo @ 2024-10-09 14:31 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: davem, pabeni, edumazet, almasrymina, netdev, linux-doc,
	donald.hunter, corbet, michael.chan, kory.maincent, andrew,
	maxime.chevallier, danieller, hengqi, ecree.xilinx,
	przemyslaw.kitszel, hkallweit1, ahmed.zaki, paul.greenwalt,
	rrameshbabu, idosch, asml.silence, kaiyuanz, willemb,
	aleksander.lobakin, dw, sridhar.samudrala, bcreeley

On Wed, Oct 9, 2024 at 3:35 AM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Thu,  3 Oct 2024 16:06:17 +0000 Taehee Yoo wrote:
> > +#define BNXT_HDS_THRESHOLD_MAX       256
> > +     u16                     hds_threshold;
>
> From the cover letter it sounded like the max is 1023.
> Did I misread that ?

Based on my test, the maximum value seems to be 1023.
But I'm not sure that all NICs, that use bnxt_en driver support 1023 value.
(At least all NICs I have support 1023).
I decided 256 as the maximum value, that was default value so all
NICs can use it safely. If Broadcom Engineer confirms 1023 is right
 for all NICs, I would like to change it.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH net-next v3 5/7] net: devmem: add ring parameter filtering
  2024-10-08 19:28       ` Jakub Kicinski
@ 2024-10-09 14:35         ` Taehee Yoo
  0 siblings, 0 replies; 73+ messages in thread
From: Taehee Yoo @ 2024-10-09 14:35 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Mina Almasry, Brett Creeley, davem, pabeni, edumazet, netdev,
	linux-doc, donald.hunter, corbet, michael.chan, kory.maincent,
	andrew, maxime.chevallier, danieller, hengqi, ecree.xilinx,
	przemyslaw.kitszel, hkallweit1, ahmed.zaki, paul.greenwalt,
	rrameshbabu, idosch, asml.silence, kaiyuanz, willemb,
	aleksander.lobakin, dw, sridhar.samudrala

On Wed, Oct 9, 2024 at 4:28 AM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Thu, 3 Oct 2024 11:49:50 -0700 Mina Almasry wrote:
> > > > +       dev->ethtool_ops->get_ringparam(dev, &ringparam,
> > > > +                                       &kernel_ringparam, extack);
> > > > +       if (kernel_ringparam.tcp_data_split != ETHTOOL_TCP_DATA_SPLIT_ENABLED ||
> > > > +           kernel_ringparam.tcp_data_split_thresh) {
> > > > +               NL_SET_ERR_MSG(extack,
> > > > +                              "tcp-header-data-split is disabled or threshold is not zero");
> > > > +               return -EINVAL;
> > > > +       }
> > > > +
> > > Maybe just my personal opinion, but IMHO these checks should be separate
> > > so the error message can be more concise/clear.
> > >
> >
> > Good point. The error message in itself is valuable.
>
> If you mean that the error message is more intuitive than debugging why
> PP_FLAG_ALLOW_UNREADABLE_NETMEM isn't set - I agree :)
>
> I vote to keep the patch, FWIW. Maybe add a comment that for now drivers
> should not set PP_FLAG_ALLOW_UNREADABLE_NETMEM, anyway, but this gives
> us better debuggability, and in the future we may find cases where
> doing a copy is cheaper than buffer circulation (and therefore may lift
> this check).

Okay, I will not drop this patch in v4 patch.
So, I just will fix what Brett and Mina pointed out.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH net-next v3 7/7] bnxt_en: add support for device memory tcp
  2024-10-08  2:57       ` David Wei
@ 2024-10-09 15:02         ` Taehee Yoo
  0 siblings, 0 replies; 73+ messages in thread
From: Taehee Yoo @ 2024-10-09 15:02 UTC (permalink / raw)
  To: David Wei
  Cc: Mina Almasry, davem, kuba, pabeni, edumazet, netdev, linux-doc,
	donald.hunter, corbet, michael.chan, kory.maincent, andrew,
	maxime.chevallier, danieller, hengqi, ecree.xilinx,
	przemyslaw.kitszel, hkallweit1, ahmed.zaki, paul.greenwalt,
	rrameshbabu, idosch, asml.silence, kaiyuanz, willemb,
	aleksander.lobakin, sridhar.samudrala, bcreeley

On Tue, Oct 8, 2024 at 11:57 AM David Wei <dw@davidwei.uk> wrote:
>
> On 2024-10-04 03:34, Taehee Yoo wrote:
> > On Fri, Oct 4, 2024 at 3:43 AM Mina Almasry <almasrymina@google.com> wrote:
> >>> @@ -3608,9 +3629,11 @@ static void bnxt_free_rx_rings(struct bnxt *bp)
> >>>
> >>>  static int bnxt_alloc_rx_page_pool(struct bnxt *bp,
> >>>                                    struct bnxt_rx_ring_info *rxr,
> >>> +                                  int queue_idx,
> >>>                                    int numa_node)
> >>>  {
> >>>         struct page_pool_params pp = { 0 };
> >>> +       struct netdev_rx_queue *rxq;
> >>>
> >>>         pp.pool_size = bp->rx_agg_ring_size;
> >>>         if (BNXT_RX_PAGE_MODE(bp))
> >>> @@ -3621,8 +3644,15 @@ static int bnxt_alloc_rx_page_pool(struct bnxt *bp,
> >>>         pp.dev = &bp->pdev->dev;
> >>>         pp.dma_dir = bp->rx_dir;
> >>>         pp.max_len = PAGE_SIZE;
> >>> -       pp.flags = PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV;
> >>> +       pp.order = 0;
> >>> +
> >>> +       rxq = __netif_get_rx_queue(bp->dev, queue_idx);
> >>> +       if (rxq->mp_params.mp_priv)
> >>> +               pp.flags = PP_FLAG_DMA_MAP | PP_FLAG_ALLOW_UNREADABLE_NETMEM;
> >>
> >> This is not the intended use of PP_FLAG_ALLOW_UNREADABLE_NETMEM.
> >>
> >> The driver should set PP_FLAG_ALLOW_UNREADABLE_NETMEM when it's able
> >> to handle unreadable netmem, it should not worry about whether
> >> rxq->mp_params.mp_priv is set or not.
> >>
> >> You should set PP_FLAG_ALLOW_UNREADABLE_NETMEM when HDS is enabled.
> >> Let core figure out if mp_params.mp_priv is enabled. All the driver
> >> needs to report is whether it's configured to be able to handle
> >> unreadable netmem (which practically means HDS is enabled).
> >
> > The reason why the branch exists here is the PP_FLAG_ALLOW_UNREADABLE_NETMEM
> > flag can't be used with PP_FLAG_DMA_SYNC_DEV.
> >
> >  228         if (pool->slow.flags & PP_FLAG_DMA_SYNC_DEV) {
> >  229                 /* In order to request DMA-sync-for-device the page
> >  230                  * needs to be mapped
> >  231                  */
> >  232                 if (!(pool->slow.flags & PP_FLAG_DMA_MAP))
> >  233                         return -EINVAL;
> >  234
> >  235                 if (!pool->p.max_len)
> >  236                         return -EINVAL;
> >  237
> >  238                 pool->dma_sync = true;                //here
> >  239
> >  240                 /* pool->p.offset has to be set according to the address
> >  241                  * offset used by the DMA engine to start copying rx data
> >  242                  */
> >  243         }
> >
> > If PP_FLAG_DMA_SYNC_DEV is set, page->dma_sync is set to true.
> >
> > 347 int mp_dmabuf_devmem_init(struct page_pool *pool)
> > 348 {
> > 349         struct net_devmem_dmabuf_binding *binding = pool->mp_priv;
> > 350
> > 351         if (!binding)
> > 352                 return -EINVAL;
> > 353
> > 354         if (!pool->dma_map)
> > 355                 return -EOPNOTSUPP;
> > 356
> > 357         if (pool->dma_sync)                      //here
> > 358                 return -EOPNOTSUPP;
> > 359
> > 360         if (pool->p.order != 0)
> > 361                 return -E2BIG;
> > 362
> > 363         net_devmem_dmabuf_binding_get(binding);
> > 364         return 0;
> > 365 }
> >
> > In the mp_dmabuf_devmem_init(), it fails when pool->dma_sync is true.
>
> This won't work for io_uring zero copy into user memory. We need all
> PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV | PP_FLAG_ALLOW_UNREADABLE_NETMEM
> set.
>
> I agree with Mina that the driver should not be poking at the mp_priv
> fields. How about setting all the flags and then letting the mp->init()
> figure it out? mp_dmabuf_devmem_init() is called within page_pool_init()
> so as long as it resets dma_sync if set I don't see any issues.
>

Ah, I haven't thought the failure of PP_FLAG_DMA_SYNC_DEV
for dmabuf may be wrong.
IIUC this flag indicates sync between device and CPU.
But device memory TCP is not related to sync between device and CPU.
So, I think we need to remove this condition check code in the core.
How do you think about it?

> >
> > tcp-data-split can be used for normal cases, not only devmem TCP case.
> > If we enable tcp-data-split and disable devmem TCP, page_pool doesn't
> > have PP_FLAG_DMA_SYNC_DEV.
> > So I think mp_params.mp_priv is still useful.
> >
> > Thanks a lot,
> > Taehee Yoo
> >
> >>
> >>
> >> --
> >> Thanks,
> >> Mina

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH net-next v3 2/7] bnxt_en: add support for tcp-data-split ethtool command
  2024-10-09 13:54     ` Taehee Yoo
@ 2024-10-09 15:28       ` Jakub Kicinski
  2024-10-09 17:47         ` Taehee Yoo
  2024-10-31 17:34         ` Taehee Yoo
  0 siblings, 2 replies; 73+ messages in thread
From: Jakub Kicinski @ 2024-10-09 15:28 UTC (permalink / raw)
  To: Taehee Yoo
  Cc: davem, pabeni, edumazet, almasrymina, netdev, linux-doc,
	donald.hunter, corbet, michael.chan, kory.maincent, andrew,
	maxime.chevallier, danieller, hengqi, ecree.xilinx,
	przemyslaw.kitszel, hkallweit1, ahmed.zaki, paul.greenwalt,
	rrameshbabu, idosch, asml.silence, kaiyuanz, willemb,
	aleksander.lobakin, dw, sridhar.samudrala, bcreeley

On Wed, 9 Oct 2024 22:54:17 +0900 Taehee Yoo wrote:
> > This breaks previous behavior. The HDS reporting from get was
> > introduced to signal to user space whether the page flip based
> > TCP zero-copy (the one added some years ago not the recent one)
> > will be usable with this NIC.
> >
> > When HW-GRO is enabled HDS will be working.
> >
> > I think that the driver should only track if the user has set the value
> > to ENABLED (forced HDS), or to UKNOWN (driver default). Setting the HDS
> > to disabled is not useful, don't support it.  
> 
> Okay, I will remove the disable feature in a v4 patch.
> Before this patch, hds_threshold was rx-copybreak value.
> How do you think hds_threshold should still follow rx-copybreak value
> if it is UNKNOWN mode?

IIUC the rx_copybreak only applies to the header? Or does it apply 
to the entire frame?

If rx_copybreak applies to the entire frame and not just the first
buffer (headers or headers+payload if not split) - no preference.
If rx_copybreak only applies to the headers / first buffer then
I'd keep them separate as they operate on a different length.

> I think hds_threshold need to follow new tcp-data-split-thresh value in
> ENABLE/UNKNOWN and make rx-copybreak pure software feature.

Sounds good to me, but just to be clear:

If user sets the HDS enable to UNKNOWN (or doesn't set it):
 - GET returns (current behavior, AFAIU):
   - DISABLED (if HW-GRO is disabled and MTU is not Jumbo)
   - ENABLED (if HW-GRO is enabled of MTU is Jumbo)
If user sets the HDS enable to ENABLED (force HDS on):
 - GET returns ENABLED 

hds_threshold returns: some value, but it's only actually used if GET
returns ENABLED.

> But if so, it changes the default behavior.

How so? The configuration of neither of those two is exposed to 
the user. We can keep the same defaults, until user overrides them.

> How do you think about it?
> 
> >  
> > >       ering->tx_max_pending = BNXT_MAX_TX_DESC_CNT;
> > >
> > >       ering->rx_pending = bp->rx_ring_size;
> > > @@ -854,9 +858,25 @@ static int bnxt_set_ringparam(struct net_device *dev,
> > >           (ering->tx_pending < BNXT_MIN_TX_DESC_CNT))
> > >               return -EINVAL;
> > >
> > > +     if (kernel_ering->tcp_data_split != ETHTOOL_TCP_DATA_SPLIT_DISABLED &&
> > > +         BNXT_RX_PAGE_MODE(bp)) {
> > > +             NL_SET_ERR_MSG_MOD(extack, "tcp-data-split can not be enabled with XDP");
> > > +             return -EINVAL;
> > > +     }  
> >
> > Technically just if the XDP does not support multi-buffer.
> > Any chance we could do this check in the core?  
> 
> I think we can access xdp_rxq_info with netdev_rx_queue structure.
> However, xdp_rxq_info is not sufficient to distinguish mb is supported
> by the driver or not. I think prog->aux->xdp_has_frags is required to
> distinguish it correctly.
> So, I think we need something more.
> Do you have any idea?

Take a look at dev_xdp_prog_count(), something like that but only
counting non-mb progs?

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH net-next v3 7/7] bnxt_en: add support for device memory tcp
  2024-10-08 19:50       ` Jakub Kicinski
@ 2024-10-09 15:37         ` Taehee Yoo
  2024-10-10  0:01           ` Jakub Kicinski
  0 siblings, 1 reply; 73+ messages in thread
From: Taehee Yoo @ 2024-10-09 15:37 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Mina Almasry, davem, pabeni, edumazet, netdev, linux-doc,
	donald.hunter, corbet, michael.chan, kory.maincent, andrew,
	maxime.chevallier, danieller, hengqi, ecree.xilinx,
	przemyslaw.kitszel, hkallweit1, ahmed.zaki, paul.greenwalt,
	rrameshbabu, idosch, asml.silence, kaiyuanz, willemb,
	aleksander.lobakin, dw, sridhar.samudrala, bcreeley

On Wed, Oct 9, 2024 at 4:50 AM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Fri, 4 Oct 2024 19:34:45 +0900 Taehee Yoo wrote:
> > > Our intention with the whole netmem design is that drivers should
> > > never have to call netmem_to_page(). I.e. the driver should use netmem
> > > unaware of whether it's page or non-page underneath, to minimize
> > > complexity driver needs to handle.
> > >
> > > This netmem_to_page() call can be removed by using
> > > skb_frag_fill_netmem_desc() instead of the page variant. But, more
> > > improtantly, why did the code change here? The code before calls
> > > skb_frag_fill_page_desc, but the new code sometimes will
> > > skb_frag_fill_netmem_desc() and sometimes will skb_add_rx_frag_netmem.
> > > I'm not sure why that logic changed.
> >
> > The reason why skb_add_rx_frag_netmem() is used here is to set
> > skb->unreadable flag. the skb_frag_fill_netmem_desc() doesn't set
> > skb->unreadable because it doesn't handle skb, it only handles frag.
> > As far as I know, skb->unreadable should be set to true for devmem
> > TCP, am I misunderstood?
> > I tested that don't using skb_add_rx_frag_netmem() here, and it
> > immediately fails.
>
> Yes, but netmem_ref can be either a net_iov or a normal page,
> and skb_add_rx_frag_netmem() and similar helpers should automatically
> set skb->unreadable or not.
>
> IOW you should be able to always use netmem-aware APIs, no?

I'm not sure the update skb->unreadable flag is possible because
frag API like skb_add_rx_frag_netmem(), receives only frag, not skb.
How about an additional API to update skb->unreadable flag?
skb_update_unreadable() or skb_update_netmem()?

>
> > > This is not the intended use of PP_FLAG_ALLOW_UNREADABLE_NETMEM.
> > >
> > > The driver should set PP_FLAG_ALLOW_UNREADABLE_NETMEM when it's able
> > > to handle unreadable netmem, it should not worry about whether
> > > rxq->mp_params.mp_priv is set or not.
> > >
> > > You should set PP_FLAG_ALLOW_UNREADABLE_NETMEM when HDS is enabled.
> > > Let core figure out if mp_params.mp_priv is enabled. All the driver
> > > needs to report is whether it's configured to be able to handle
> > > unreadable netmem (which practically means HDS is enabled).
> >
> > The reason why the branch exists here is the PP_FLAG_ALLOW_UNREADABLE_NETMEM
> > flag can't be used with PP_FLAG_DMA_SYNC_DEV.
>
> Hm. Isn't the existing check the wrong way around? Is the driver
> supposed to sync the buffers for device before passing them down?

I haven't thought the failure of PP_FLAG_DMA_SYNC_DEV
for dmabuf may be wrong.
I think device memory TCP is not related to this flag.
So device memory TCP core API should not return failure when
PP_FLAG_DMA_SYNC_DEV flag is set.
How about removing this condition check code in device memory TCP core?

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH net-next v3 3/7] net: ethtool: add support for configuring tcp-data-split-thresh
  2024-10-09 14:25     ` Taehee Yoo
@ 2024-10-09 15:46       ` Jakub Kicinski
  2024-10-09 17:49         ` Taehee Yoo
  0 siblings, 1 reply; 73+ messages in thread
From: Jakub Kicinski @ 2024-10-09 15:46 UTC (permalink / raw)
  To: Taehee Yoo
  Cc: davem, pabeni, edumazet, almasrymina, netdev, linux-doc,
	donald.hunter, corbet, michael.chan, kory.maincent, andrew,
	maxime.chevallier, danieller, hengqi, ecree.xilinx,
	przemyslaw.kitszel, hkallweit1, ahmed.zaki, paul.greenwalt,
	rrameshbabu, idosch, asml.silence, kaiyuanz, willemb,
	aleksander.lobakin, dw, sridhar.samudrala, bcreeley

On Wed, 9 Oct 2024 23:25:55 +0900 Taehee Yoo wrote:
> > > The tcp-data-split is not enabled, the tcp-data-split-thresh will
> > > not be used and can't be configured.
> > >
> > >    # ethtool -G enp14s0f0np0 tcp-data-split off
> > >    # ethtool -g enp14s0f0np0
> > >    Ring parameters for enp14s0f0np0:
> > >    Pre-set maximums:
> > >    ...
> > >    TCP data split thresh:  256
> > >    Current hardware settings:
> > >    ...
> > >    TCP data split:         off
> > >    TCP data split thresh:  n/a  
> >
> > My reply to Sridhar was probably quite unclear on this point, but FWIW
> > I do also have a weak preference to drop the "TCP" from the new knob.
> > Rephrasing what I said here:
> > https://lore.kernel.org/all/20240911173150.571bf93b@kernel.org/
> > the old knob is defined as being about TCP but for the new one we can
> > pick how widely applicable it is (and make it cover UDP as well).  
> 
> I'm not sure that I understand about "knob".
> The knob means the command "tcp-data-split-thresh"?
> If so, I would like to change from "tcp-data-split-thresh" to
> "header-data-split-thresh".

Sounds good!

> > > +     if (tb[ETHTOOL_A_RINGS_TCP_DATA_SPLIT_THRESH] &&
> > > +         !(ops->supported_ring_params & ETHTOOL_RING_USE_TCP_DATA_SPLIT)) {  
> >
> > here you use the existing flag, yet gve and idpf set that flag and will
> > ignore the setting silently. They need to be changed or we need a new
> > flag.  
> 
> Okay, I would like to add the ETHTOOL_RING_USE_TCP_DATA_SPLIT_THRESH flag.
> Or ETHTOOL_RING_USE_HDS_THRESH, which indicates header-data-split thresh.
> If you agree with adding a new flag, how do you think about naming it?

How about ETHTOOL_RING_USE_HDS_THRS ?

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH net-next v3 2/7] bnxt_en: add support for tcp-data-split ethtool command
  2024-10-09 15:28       ` Jakub Kicinski
@ 2024-10-09 17:47         ` Taehee Yoo
  2024-10-31 17:34         ` Taehee Yoo
  1 sibling, 0 replies; 73+ messages in thread
From: Taehee Yoo @ 2024-10-09 17:47 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: davem, pabeni, edumazet, almasrymina, netdev, linux-doc,
	donald.hunter, corbet, michael.chan, kory.maincent, andrew,
	maxime.chevallier, danieller, hengqi, ecree.xilinx,
	przemyslaw.kitszel, hkallweit1, ahmed.zaki, paul.greenwalt,
	rrameshbabu, idosch, asml.silence, kaiyuanz, willemb,
	aleksander.lobakin, dw, sridhar.samudrala, bcreeley

On Thu, Oct 10, 2024 at 12:28 AM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Wed, 9 Oct 2024 22:54:17 +0900 Taehee Yoo wrote:
> > > This breaks previous behavior. The HDS reporting from get was
> > > introduced to signal to user space whether the page flip based
> > > TCP zero-copy (the one added some years ago not the recent one)
> > > will be usable with this NIC.
> > >
> > > When HW-GRO is enabled HDS will be working.
> > >
> > > I think that the driver should only track if the user has set the value
> > > to ENABLED (forced HDS), or to UKNOWN (driver default). Setting the HDS
> > > to disabled is not useful, don't support it.
> >
> > Okay, I will remove the disable feature in a v4 patch.
> > Before this patch, hds_threshold was rx-copybreak value.
> > How do you think hds_threshold should still follow rx-copybreak value
> > if it is UNKNOWN mode?
>
> IIUC the rx_copybreak only applies to the header? Or does it apply
> to the entire frame?
>
> If rx_copybreak applies to the entire frame and not just the first
> buffer (headers or headers+payload if not split) - no preference.
> If rx_copybreak only applies to the headers / first buffer then
> I'd keep them separate as they operate on a different length.

It applies only the first buffer.
So, if HDS is enabled, it copies only header.
Thanks, I will separate rx-copybreak and hds_threshold.

>
> > I think hds_threshold need to follow new tcp-data-split-thresh value in
> > ENABLE/UNKNOWN and make rx-copybreak pure software feature.
>
> Sounds good to me, but just to be clear:
>
> If user sets the HDS enable to UNKNOWN (or doesn't set it):
>  - GET returns (current behavior, AFAIU):
>    - DISABLED (if HW-GRO is disabled and MTU is not Jumbo)
>    - ENABLED (if HW-GRO is enabled of MTU is Jumbo)
> If user sets the HDS enable to ENABLED (force HDS on):
>  - GET returns ENABLED
>
> hds_threshold returns: some value, but it's only actually used if GET
> returns ENABLED.
>

Thanks for the detailed explanation!

> > But if so, it changes the default behavior.
>
> How so? The configuration of neither of those two is exposed to
> the user. We can keep the same defaults, until user overrides them.
>

Ah, right.
I understood.

> > How do you think about it?
> >
> > >
> > > >       ering->tx_max_pending = BNXT_MAX_TX_DESC_CNT;
> > > >
> > > >       ering->rx_pending = bp->rx_ring_size;
> > > > @@ -854,9 +858,25 @@ static int bnxt_set_ringparam(struct net_device *dev,
> > > >           (ering->tx_pending < BNXT_MIN_TX_DESC_CNT))
> > > >               return -EINVAL;
> > > >
> > > > +     if (kernel_ering->tcp_data_split != ETHTOOL_TCP_DATA_SPLIT_DISABLED &&
> > > > +         BNXT_RX_PAGE_MODE(bp)) {
> > > > +             NL_SET_ERR_MSG_MOD(extack, "tcp-data-split can not be enabled with XDP");
> > > > +             return -EINVAL;
> > > > +     }
> > >
> > > Technically just if the XDP does not support multi-buffer.
> > > Any chance we could do this check in the core?
> >
> > I think we can access xdp_rxq_info with netdev_rx_queue structure.
> > However, xdp_rxq_info is not sufficient to distinguish mb is supported
> > by the driver or not. I think prog->aux->xdp_has_frags is required to
> > distinguish it correctly.
> > So, I think we need something more.
> > Do you have any idea?
>
> Take a look at dev_xdp_prog_count(), something like that but only
> counting non-mb progs?

Thanks for very nice example, I will try it!

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH net-next v3 3/7] net: ethtool: add support for configuring tcp-data-split-thresh
  2024-10-09 15:46       ` Jakub Kicinski
@ 2024-10-09 17:49         ` Taehee Yoo
  0 siblings, 0 replies; 73+ messages in thread
From: Taehee Yoo @ 2024-10-09 17:49 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: davem, pabeni, edumazet, almasrymina, netdev, linux-doc,
	donald.hunter, corbet, michael.chan, kory.maincent, andrew,
	maxime.chevallier, danieller, hengqi, ecree.xilinx,
	przemyslaw.kitszel, hkallweit1, ahmed.zaki, paul.greenwalt,
	rrameshbabu, idosch, asml.silence, kaiyuanz, willemb,
	aleksander.lobakin, dw, sridhar.samudrala, bcreeley

On Thu, Oct 10, 2024 at 12:46 AM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Wed, 9 Oct 2024 23:25:55 +0900 Taehee Yoo wrote:
> > > > The tcp-data-split is not enabled, the tcp-data-split-thresh will
> > > > not be used and can't be configured.
> > > >
> > > >    # ethtool -G enp14s0f0np0 tcp-data-split off
> > > >    # ethtool -g enp14s0f0np0
> > > >    Ring parameters for enp14s0f0np0:
> > > >    Pre-set maximums:
> > > >    ...
> > > >    TCP data split thresh:  256
> > > >    Current hardware settings:
> > > >    ...
> > > >    TCP data split:         off
> > > >    TCP data split thresh:  n/a
> > >
> > > My reply to Sridhar was probably quite unclear on this point, but FWIW
> > > I do also have a weak preference to drop the "TCP" from the new knob.
> > > Rephrasing what I said here:
> > > https://lore.kernel.org/all/20240911173150.571bf93b@kernel.org/
> > > the old knob is defined as being about TCP but for the new one we can
> > > pick how widely applicable it is (and make it cover UDP as well).
> >
> > I'm not sure that I understand about "knob".
> > The knob means the command "tcp-data-split-thresh"?
> > If so, I would like to change from "tcp-data-split-thresh" to
> > "header-data-split-thresh".
>
> Sounds good!
>
> > > > +     if (tb[ETHTOOL_A_RINGS_TCP_DATA_SPLIT_THRESH] &&
> > > > +         !(ops->supported_ring_params & ETHTOOL_RING_USE_TCP_DATA_SPLIT)) {
> > >
> > > here you use the existing flag, yet gve and idpf set that flag and will
> > > ignore the setting silently. They need to be changed or we need a new
> > > flag.
> >
> > Okay, I would like to add the ETHTOOL_RING_USE_TCP_DATA_SPLIT_THRESH flag.
> > Or ETHTOOL_RING_USE_HDS_THRESH, which indicates header-data-split thresh.
> > If you agree with adding a new flag, how do you think about naming it?
>
> How about ETHTOOL_RING_USE_HDS_THRS ?

Thanks! I will use that name.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH net-next v3 7/7] bnxt_en: add support for device memory tcp
  2024-10-09 15:37         ` Taehee Yoo
@ 2024-10-10  0:01           ` Jakub Kicinski
  2024-10-10 17:44             ` Mina Almasry
  0 siblings, 1 reply; 73+ messages in thread
From: Jakub Kicinski @ 2024-10-10  0:01 UTC (permalink / raw)
  To: Taehee Yoo, Mina Almasry
  Cc: davem, pabeni, edumazet, netdev, linux-doc, donald.hunter, corbet,
	michael.chan, kory.maincent, andrew, maxime.chevallier, danieller,
	hengqi, ecree.xilinx, przemyslaw.kitszel, hkallweit1, ahmed.zaki,
	paul.greenwalt, rrameshbabu, idosch, asml.silence, kaiyuanz,
	willemb, aleksander.lobakin, dw, sridhar.samudrala, bcreeley

On Thu, 10 Oct 2024 00:37:49 +0900 Taehee Yoo wrote:
> > Yes, but netmem_ref can be either a net_iov or a normal page,
> > and skb_add_rx_frag_netmem() and similar helpers should automatically
> > set skb->unreadable or not.
> >
> > IOW you should be able to always use netmem-aware APIs, no?  
> 
> I'm not sure the update skb->unreadable flag is possible because
> frag API like skb_add_rx_frag_netmem(), receives only frag, not skb.
> How about an additional API to update skb->unreadable flag?
> skb_update_unreadable() or skb_update_netmem()?

Ah, the case where we don't get skb is because we're just building XDP
frame at that stage. And XDP can't be netmem.

In that case switching to skb_frag_fill_netmem_desc() should be enough.

> > > The reason why the branch exists here is the PP_FLAG_ALLOW_UNREADABLE_NETMEM
> > > flag can't be used with PP_FLAG_DMA_SYNC_DEV.  
> >
> > Hm. Isn't the existing check the wrong way around? Is the driver
> > supposed to sync the buffers for device before passing them down?  
> 
> I haven't thought the failure of PP_FLAG_DMA_SYNC_DEV
> for dmabuf may be wrong.
> I think device memory TCP is not related to this flag.
> So device memory TCP core API should not return failure when
> PP_FLAG_DMA_SYNC_DEV flag is set.
> How about removing this condition check code in device memory TCP core?

I think we need to invert the check..
Mina, WDYT?

diff --git a/net/core/devmem.c b/net/core/devmem.c
index 11b91c12ee11..c5cace3f9831 100644
--- a/net/core/devmem.c
+++ b/net/core/devmem.c
@@ -331,12 +331,6 @@ int mp_dmabuf_devmem_init(struct page_pool *pool)
 	if (!binding)
 		return -EINVAL;
 
-	if (!pool->dma_map)
-		return -EOPNOTSUPP;
-
-	if (pool->dma_sync)
-		return -EOPNOTSUPP;
-
 	if (pool->p.order != 0)
 		return -E2BIG;
 
diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index a813d30d2135..c8dbbf262de3 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -287,6 +287,12 @@ static int page_pool_init(struct page_pool *pool,
 	}
 
 	if (pool->mp_priv) {
+		if (!pool->dma_map || !pool->dma_sync)
+			return -EOPNOTSUPP;
+
+		/* Memory provider is responsible for syncing the pages. */
+		pool->dma_sync = 0;
+
 		err = mp_dmabuf_devmem_init(pool);
 		if (err) {
 			pr_warn("%s() mem-provider init failed %d\n", __func__,

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* Re: [PATCH net-next v3 7/7] bnxt_en: add support for device memory tcp
  2024-10-10  0:01           ` Jakub Kicinski
@ 2024-10-10 17:44             ` Mina Almasry
  2024-10-11  1:34               ` Jakub Kicinski
  0 siblings, 1 reply; 73+ messages in thread
From: Mina Almasry @ 2024-10-10 17:44 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Taehee Yoo, davem, pabeni, edumazet, netdev, linux-doc,
	donald.hunter, corbet, michael.chan, kory.maincent, andrew,
	maxime.chevallier, danieller, hengqi, ecree.xilinx,
	przemyslaw.kitszel, hkallweit1, ahmed.zaki, paul.greenwalt,
	rrameshbabu, idosch, asml.silence, kaiyuanz, willemb,
	aleksander.lobakin, dw, sridhar.samudrala, bcreeley

On Wed, Oct 9, 2024 at 5:01 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Thu, 10 Oct 2024 00:37:49 +0900 Taehee Yoo wrote:
> > > Yes, but netmem_ref can be either a net_iov or a normal page,
> > > and skb_add_rx_frag_netmem() and similar helpers should automatically
> > > set skb->unreadable or not.
> > >
> > > IOW you should be able to always use netmem-aware APIs, no?
> >
> > I'm not sure the update skb->unreadable flag is possible because
> > frag API like skb_add_rx_frag_netmem(), receives only frag, not skb.
> > How about an additional API to update skb->unreadable flag?
> > skb_update_unreadable() or skb_update_netmem()?
>
> Ah, the case where we don't get skb is because we're just building XDP
> frame at that stage. And XDP can't be netmem.
>
> In that case switching to skb_frag_fill_netmem_desc() should be enough.
>
> > > > The reason why the branch exists here is the PP_FLAG_ALLOW_UNREADABLE_NETMEM
> > > > flag can't be used with PP_FLAG_DMA_SYNC_DEV.
> > >
> > > Hm. Isn't the existing check the wrong way around? Is the driver
> > > supposed to sync the buffers for device before passing them down?
> >
> > I haven't thought the failure of PP_FLAG_DMA_SYNC_DEV
> > for dmabuf may be wrong.
> > I think device memory TCP is not related to this flag.
> > So device memory TCP core API should not return failure when
> > PP_FLAG_DMA_SYNC_DEV flag is set.
> > How about removing this condition check code in device memory TCP core?
>
> I think we need to invert the check..
> Mina, WDYT?
>

On a closer look, my feeling is similar to Taehee,
PP_FLAG_DMA_SYNC_DEV should be orthogonal to memory providers. The
memory providers allocate the memory and provide the dma-addr, but
need not dma-sync the dma-addr, right? The driver can sync the
dma-addr if it wants and the driver can delegate the syncing to the pp
via PP_FLAG_DMA_SYNC_DEV if it wants. AFAICT I think the check should
be removed, not inverted, but I could be missing something.

-- 
Thanks,
Mina

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH net-next v3 7/7] bnxt_en: add support for device memory tcp
  2024-10-10 17:44             ` Mina Almasry
@ 2024-10-11  1:34               ` Jakub Kicinski
  2024-10-11 17:33                 ` Mina Almasry
  0 siblings, 1 reply; 73+ messages in thread
From: Jakub Kicinski @ 2024-10-11  1:34 UTC (permalink / raw)
  To: Mina Almasry
  Cc: Taehee Yoo, davem, pabeni, edumazet, netdev, linux-doc,
	donald.hunter, corbet, michael.chan, kory.maincent, andrew,
	maxime.chevallier, danieller, hengqi, ecree.xilinx,
	przemyslaw.kitszel, hkallweit1, ahmed.zaki, paul.greenwalt,
	rrameshbabu, idosch, asml.silence, kaiyuanz, willemb,
	aleksander.lobakin, dw, sridhar.samudrala, bcreeley

On Thu, 10 Oct 2024 10:44:38 -0700 Mina Almasry wrote:
> > > I haven't thought the failure of PP_FLAG_DMA_SYNC_DEV
> > > for dmabuf may be wrong.
> > > I think device memory TCP is not related to this flag.
> > > So device memory TCP core API should not return failure when
> > > PP_FLAG_DMA_SYNC_DEV flag is set.
> > > How about removing this condition check code in device memory TCP core?  
> >
> > I think we need to invert the check..
> > Mina, WDYT?
> 
> On a closer look, my feeling is similar to Taehee,
> PP_FLAG_DMA_SYNC_DEV should be orthogonal to memory providers. The
> memory providers allocate the memory and provide the dma-addr, but
> need not dma-sync the dma-addr, right? The driver can sync the
> dma-addr if it wants and the driver can delegate the syncing to the pp
> via PP_FLAG_DMA_SYNC_DEV if it wants. AFAICT I think the check should
> be removed, not inverted, but I could be missing something.

I don't know much about dmabuf but it hinges on the question whether
doing DMA sync for device on a dmabuf address is :
 - a good thing
 - a noop
 - a bad thing

If it's a good thing or a noop - agreed.

Similar question for the sync for CPU.

I agree that intuitively it should be all fine. But the fact that dmabuf
has a bespoke API for accessing the memory by the CPU makes me worried
that there may be assumptions about these addresses not getting
randomly fed into the normal DMA API..

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH net-next v3 7/7] bnxt_en: add support for device memory tcp
  2024-10-11  1:34               ` Jakub Kicinski
@ 2024-10-11 17:33                 ` Mina Almasry
  2024-10-11 23:42                   ` Jason Gunthorpe
  0 siblings, 1 reply; 73+ messages in thread
From: Mina Almasry @ 2024-10-11 17:33 UTC (permalink / raw)
  To: Jakub Kicinski, Christian König, Jason Gunthorpe,
	Samiullah Khawaja
  Cc: Taehee Yoo, davem, pabeni, edumazet, netdev, linux-doc,
	donald.hunter, corbet, michael.chan, kory.maincent, andrew,
	maxime.chevallier, danieller, hengqi, ecree.xilinx,
	przemyslaw.kitszel, hkallweit1, ahmed.zaki, paul.greenwalt,
	rrameshbabu, idosch, asml.silence, kaiyuanz, willemb,
	aleksander.lobakin, dw, sridhar.samudrala, bcreeley

On Thu, Oct 10, 2024 at 6:34 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Thu, 10 Oct 2024 10:44:38 -0700 Mina Almasry wrote:
> > > > I haven't thought the failure of PP_FLAG_DMA_SYNC_DEV
> > > > for dmabuf may be wrong.
> > > > I think device memory TCP is not related to this flag.
> > > > So device memory TCP core API should not return failure when
> > > > PP_FLAG_DMA_SYNC_DEV flag is set.
> > > > How about removing this condition check code in device memory TCP core?
> > >
> > > I think we need to invert the check..
> > > Mina, WDYT?
> >
> > On a closer look, my feeling is similar to Taehee,
> > PP_FLAG_DMA_SYNC_DEV should be orthogonal to memory providers. The
> > memory providers allocate the memory and provide the dma-addr, but
> > need not dma-sync the dma-addr, right? The driver can sync the
> > dma-addr if it wants and the driver can delegate the syncing to the pp
> > via PP_FLAG_DMA_SYNC_DEV if it wants. AFAICT I think the check should
> > be removed, not inverted, but I could be missing something.
>
> I don't know much about dmabuf but it hinges on the question whether
> doing DMA sync for device on a dmabuf address is :
>  - a good thing
>  - a noop
>  - a bad thing
>
> If it's a good thing or a noop - agreed.
>
> Similar question for the sync for CPU.
>
> I agree that intuitively it should be all fine. But the fact that dmabuf
> has a bespoke API for accessing the memory by the CPU makes me worried
> that there may be assumptions about these addresses not getting
> randomly fed into the normal DMA API..

Sorry I'm also a bit unsure what is the right thing to do here. The
code that we've been running in GVE does a dma-sync for cpu
unconditionally on RX for dma-buf and non-dmabuf dma-addrs and we
haven't been seeing issues. It never does dma-sync for device.

My first question is why is dma-sync for device needed on RX path at
all for some drivers in the first place? For incoming (non-dmabuf)
data, the data is written by the device and read by the cpu, so sync
for cpu is really what's needed. Is the sync for device for XDP? Or is
it that buffers should be dma-syncd for device before they are
re-posted to the NIC?

Christian/Jason, sorry quick question: are
dma_sync_single_for_{device|cpu} needed or wanted when the dma-addrs
come from a dma-buf? Or these dma-addrs to be treated like any other
with the normal dma_sync_for_{device|cpu} rules?

--
Thanks,
Mina

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH net-next v3 7/7] bnxt_en: add support for device memory tcp
  2024-10-11 17:33                 ` Mina Almasry
@ 2024-10-11 23:42                   ` Jason Gunthorpe
  2024-10-14 22:38                     ` Mina Almasry
  0 siblings, 1 reply; 73+ messages in thread
From: Jason Gunthorpe @ 2024-10-11 23:42 UTC (permalink / raw)
  To: Mina Almasry, Leon Romanovsky
  Cc: Jakub Kicinski, Christian König, Samiullah Khawaja,
	Taehee Yoo, davem, pabeni, edumazet, netdev, linux-doc,
	donald.hunter, corbet, michael.chan, kory.maincent, andrew,
	maxime.chevallier, danieller, hengqi, ecree.xilinx,
	przemyslaw.kitszel, hkallweit1, ahmed.zaki, paul.greenwalt,
	rrameshbabu, idosch, asml.silence, kaiyuanz, willemb,
	aleksander.lobakin, dw, sridhar.samudrala, bcreeley

On Fri, Oct 11, 2024 at 10:33:43AM -0700, Mina Almasry wrote:
> On Thu, Oct 10, 2024 at 6:34 PM Jakub Kicinski <kuba@kernel.org> wrote:
> >
> > On Thu, 10 Oct 2024 10:44:38 -0700 Mina Almasry wrote:
> > > > > I haven't thought the failure of PP_FLAG_DMA_SYNC_DEV
> > > > > for dmabuf may be wrong.
> > > > > I think device memory TCP is not related to this flag.
> > > > > So device memory TCP core API should not return failure when
> > > > > PP_FLAG_DMA_SYNC_DEV flag is set.
> > > > > How about removing this condition check code in device memory TCP core?
> > > >
> > > > I think we need to invert the check..
> > > > Mina, WDYT?
> > >
> > > On a closer look, my feeling is similar to Taehee,
> > > PP_FLAG_DMA_SYNC_DEV should be orthogonal to memory providers. The
> > > memory providers allocate the memory and provide the dma-addr, but
> > > need not dma-sync the dma-addr, right? The driver can sync the
> > > dma-addr if it wants and the driver can delegate the syncing to the pp
> > > via PP_FLAG_DMA_SYNC_DEV if it wants. AFAICT I think the check should
> > > be removed, not inverted, but I could be missing something.
> >
> > I don't know much about dmabuf but it hinges on the question whether
> > doing DMA sync for device on a dmabuf address is :
> >  - a good thing
> >  - a noop
> >  - a bad thing
> >
> > If it's a good thing or a noop - agreed.
> >
> > Similar question for the sync for CPU.
> >
> > I agree that intuitively it should be all fine. But the fact that dmabuf
> > has a bespoke API for accessing the memory by the CPU makes me worried
> > that there may be assumptions about these addresses not getting
> > randomly fed into the normal DMA API..
> 
> Sorry I'm also a bit unsure what is the right thing to do here. The
> code that we've been running in GVE does a dma-sync for cpu
> unconditionally on RX for dma-buf and non-dmabuf dma-addrs and we
> haven't been seeing issues. It never does dma-sync for device.
> 
> My first question is why is dma-sync for device needed on RX path at
> all for some drivers in the first place? For incoming (non-dmabuf)
> data, the data is written by the device and read by the cpu, so sync
> for cpu is really what's needed. Is the sync for device for XDP? Or is
> it that buffers should be dma-syncd for device before they are
> re-posted to the NIC?
> 
> Christian/Jason, sorry quick question: are
> dma_sync_single_for_{device|cpu} needed or wanted when the dma-addrs
> come from a dma-buf? Or these dma-addrs to be treated like any other
> with the normal dma_sync_for_{device|cpu} rules?

Um, I think because dma-buf hacks things up and generates illegal
scatterlist entries with weird dma_map_resource() addresses for the
typical P2P case the dma sync API should not be used on those things.

However, there is no way to know if the dma-buf has does this, and
there are valid case where the scatterlist is not ill formed and the
sync is necessary.

We are getting soo close to being able to start fixing these API
issues in dmabuf, I hope next cylce we can begin.. Fingers crossed.

From a CPU architecture perspective you do not need to cache flush PCI
MMIO BAR memory, and perhaps doing so be might be problematic on some
arches (???). But you do need to flush normal cachable CPU memory if
that is in the DMA buf.

Jason

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH net-next v3 7/7] bnxt_en: add support for device memory tcp
  2024-10-11 23:42                   ` Jason Gunthorpe
@ 2024-10-14 22:38                     ` Mina Almasry
  2024-10-15  0:16                       ` Jakub Kicinski
  2024-10-15 14:29                       ` Pavel Begunkov
  0 siblings, 2 replies; 73+ messages in thread
From: Mina Almasry @ 2024-10-14 22:38 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Leon Romanovsky, Jakub Kicinski, Christian König,
	Samiullah Khawaja, Taehee Yoo, davem, pabeni, edumazet, netdev,
	linux-doc, donald.hunter, corbet, michael.chan, kory.maincent,
	andrew, maxime.chevallier, danieller, hengqi, ecree.xilinx,
	przemyslaw.kitszel, hkallweit1, ahmed.zaki, paul.greenwalt,
	rrameshbabu, idosch, asml.silence, kaiyuanz, willemb,
	aleksander.lobakin, dw, sridhar.samudrala, bcreeley

On Sat, Oct 12, 2024 at 2:42 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Fri, Oct 11, 2024 at 10:33:43AM -0700, Mina Almasry wrote:
> > On Thu, Oct 10, 2024 at 6:34 PM Jakub Kicinski <kuba@kernel.org> wrote:
> > >
> > > On Thu, 10 Oct 2024 10:44:38 -0700 Mina Almasry wrote:
> > > > > > I haven't thought the failure of PP_FLAG_DMA_SYNC_DEV
> > > > > > for dmabuf may be wrong.
> > > > > > I think device memory TCP is not related to this flag.
> > > > > > So device memory TCP core API should not return failure when
> > > > > > PP_FLAG_DMA_SYNC_DEV flag is set.
> > > > > > How about removing this condition check code in device memory TCP core?
> > > > >
> > > > > I think we need to invert the check..
> > > > > Mina, WDYT?
> > > >
> > > > On a closer look, my feeling is similar to Taehee,
> > > > PP_FLAG_DMA_SYNC_DEV should be orthogonal to memory providers. The
> > > > memory providers allocate the memory and provide the dma-addr, but
> > > > need not dma-sync the dma-addr, right? The driver can sync the
> > > > dma-addr if it wants and the driver can delegate the syncing to the pp
> > > > via PP_FLAG_DMA_SYNC_DEV if it wants. AFAICT I think the check should
> > > > be removed, not inverted, but I could be missing something.
> > >
> > > I don't know much about dmabuf but it hinges on the question whether
> > > doing DMA sync for device on a dmabuf address is :
> > >  - a good thing
> > >  - a noop
> > >  - a bad thing
> > >
> > > If it's a good thing or a noop - agreed.
> > >
> > > Similar question for the sync for CPU.
> > >
> > > I agree that intuitively it should be all fine. But the fact that dmabuf
> > > has a bespoke API for accessing the memory by the CPU makes me worried
> > > that there may be assumptions about these addresses not getting
> > > randomly fed into the normal DMA API..
> >
> > Sorry I'm also a bit unsure what is the right thing to do here. The
> > code that we've been running in GVE does a dma-sync for cpu
> > unconditionally on RX for dma-buf and non-dmabuf dma-addrs and we
> > haven't been seeing issues. It never does dma-sync for device.
> >
> > My first question is why is dma-sync for device needed on RX path at
> > all for some drivers in the first place? For incoming (non-dmabuf)
> > data, the data is written by the device and read by the cpu, so sync
> > for cpu is really what's needed. Is the sync for device for XDP? Or is
> > it that buffers should be dma-syncd for device before they are
> > re-posted to the NIC?
> >
> > Christian/Jason, sorry quick question: are
> > dma_sync_single_for_{device|cpu} needed or wanted when the dma-addrs
> > come from a dma-buf? Or these dma-addrs to be treated like any other
> > with the normal dma_sync_for_{device|cpu} rules?
>
> Um, I think because dma-buf hacks things up and generates illegal
> scatterlist entries with weird dma_map_resource() addresses for the
> typical P2P case the dma sync API should not be used on those things.
>
> However, there is no way to know if the dma-buf has does this, and
> there are valid case where the scatterlist is not ill formed and the
> sync is necessary.
>
> We are getting soo close to being able to start fixing these API
> issues in dmabuf, I hope next cylce we can begin.. Fingers crossed.
>
> From a CPU architecture perspective you do not need to cache flush PCI
> MMIO BAR memory, and perhaps doing so be might be problematic on some
> arches (???). But you do need to flush normal cachable CPU memory if
> that is in the DMA buf.
>

Thanks Jason. In that case I agree with Jakub we should take in his change here:

https://lore.kernel.org/netdev/20241009170102.1980ed1d@kernel.org/

With this change the driver would delegate dma_sync_for_device to the
page_pool, and the page_pool will skip it altogether for the dma-buf
memory provider.

--
Thanks,
Mina

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH net-next v3 7/7] bnxt_en: add support for device memory tcp
  2024-10-14 22:38                     ` Mina Almasry
@ 2024-10-15  0:16                       ` Jakub Kicinski
  2024-10-15  1:10                         ` Mina Almasry
  2024-10-15 14:29                       ` Pavel Begunkov
  1 sibling, 1 reply; 73+ messages in thread
From: Jakub Kicinski @ 2024-10-15  0:16 UTC (permalink / raw)
  To: Mina Almasry
  Cc: Jason Gunthorpe, Leon Romanovsky, Christian König,
	Samiullah Khawaja, Taehee Yoo, davem, pabeni, edumazet, netdev,
	linux-doc, donald.hunter, corbet, michael.chan, kory.maincent,
	andrew, maxime.chevallier, danieller, hengqi, ecree.xilinx,
	przemyslaw.kitszel, hkallweit1, ahmed.zaki, paul.greenwalt,
	rrameshbabu, idosch, asml.silence, kaiyuanz, willemb,
	aleksander.lobakin, dw, sridhar.samudrala, bcreeley

On Tue, 15 Oct 2024 01:38:20 +0300 Mina Almasry wrote:
> Thanks Jason. In that case I agree with Jakub we should take in his change here:
> 
> https://lore.kernel.org/netdev/20241009170102.1980ed1d@kernel.org/
> 
> With this change the driver would delegate dma_sync_for_device to the
> page_pool, and the page_pool will skip it altogether for the dma-buf
> memory provider.

And we need a wrapper for a sync for CPU which will skip if the page
comes from an unreadable pool?

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH net-next v3 7/7] bnxt_en: add support for device memory tcp
  2024-10-15  0:16                       ` Jakub Kicinski
@ 2024-10-15  1:10                         ` Mina Almasry
  2024-10-15 12:44                           ` Jason Gunthorpe
  0 siblings, 1 reply; 73+ messages in thread
From: Mina Almasry @ 2024-10-15  1:10 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Jason Gunthorpe, Leon Romanovsky, Christian König,
	Samiullah Khawaja, Taehee Yoo, davem, pabeni, edumazet, netdev,
	linux-doc, donald.hunter, corbet, michael.chan, kory.maincent,
	andrew, maxime.chevallier, danieller, hengqi, ecree.xilinx,
	przemyslaw.kitszel, hkallweit1, ahmed.zaki, paul.greenwalt,
	rrameshbabu, idosch, asml.silence, kaiyuanz, willemb,
	aleksander.lobakin, dw, sridhar.samudrala, bcreeley

On Tue, Oct 15, 2024 at 3:16 AM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Tue, 15 Oct 2024 01:38:20 +0300 Mina Almasry wrote:
> > Thanks Jason. In that case I agree with Jakub we should take in his change here:
> >
> > https://lore.kernel.org/netdev/20241009170102.1980ed1d@kernel.org/
> >
> > With this change the driver would delegate dma_sync_for_device to the
> > page_pool, and the page_pool will skip it altogether for the dma-buf
> > memory provider.
>
> And we need a wrapper for a sync for CPU which will skip if the page
> comes from an unreadable pool?

This is where it gets a bit tricky, no?

Our production code does a dma_sync_for_cpu but no
dma_sync_for_device. That has been working reliably for us with GPU
dmabufs and udmabuf, but we haven't of course tested every dma-buf.
I'm comfortable enforcing the 'no dma_sync_for_device' now since it
brings upstream in line with our well tested setup. I'm not sure I'm
100% comfortable enforcing the 'no dma_sync_for_cpu' now since it's a
deviation. The dma_sync_for_cpu is very very likely a no-op since we
don't really access the data from cpu ever with devmem TCP, but who
knows.

Is it possible to give me a couple of weeks to make this change
locally and run it through some testing to see if it breaks anything?

But if you or Jason think that enforcing the 'no dma_buf_sync_for_cpu'
now is critical, no problem. We can also provide this patch, and seek
to revert it or fix it up properly later in the event it turns out it
causes issues.

Note that io_uring provider, or other non-dmabuf providers may need to
do a dma-sync, but that bridge can be crossed in David's patchset.

-- 
Thanks,
Mina

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH net-next v3 7/7] bnxt_en: add support for device memory tcp
  2024-10-15  1:10                         ` Mina Almasry
@ 2024-10-15 12:44                           ` Jason Gunthorpe
  2024-10-18  8:25                             ` Mina Almasry
  0 siblings, 1 reply; 73+ messages in thread
From: Jason Gunthorpe @ 2024-10-15 12:44 UTC (permalink / raw)
  To: Mina Almasry
  Cc: Jakub Kicinski, Leon Romanovsky, Christian König,
	Samiullah Khawaja, Taehee Yoo, davem, pabeni, edumazet, netdev,
	linux-doc, donald.hunter, corbet, michael.chan, kory.maincent,
	andrew, maxime.chevallier, danieller, hengqi, ecree.xilinx,
	przemyslaw.kitszel, hkallweit1, ahmed.zaki, paul.greenwalt,
	rrameshbabu, idosch, asml.silence, kaiyuanz, willemb,
	aleksander.lobakin, dw, sridhar.samudrala, bcreeley

On Tue, Oct 15, 2024 at 04:10:44AM +0300, Mina Almasry wrote:
> On Tue, Oct 15, 2024 at 3:16 AM Jakub Kicinski <kuba@kernel.org> wrote:
> >
> > On Tue, 15 Oct 2024 01:38:20 +0300 Mina Almasry wrote:
> > > Thanks Jason. In that case I agree with Jakub we should take in his change here:
> > >
> > > https://lore.kernel.org/netdev/20241009170102.1980ed1d@kernel.org/
> > >
> > > With this change the driver would delegate dma_sync_for_device to the
> > > page_pool, and the page_pool will skip it altogether for the dma-buf
> > > memory provider.
> >
> > And we need a wrapper for a sync for CPU which will skip if the page
> > comes from an unreadable pool?
> 
> This is where it gets a bit tricky, no?
> 
> Our production code does a dma_sync_for_cpu but no
> dma_sync_for_device. That has been working reliably for us with GPU

Those functions are all NOP on systems you are testing on.

The question is what is correct to do on systems where it is not a
NOP, and none of this is really right, as I explained..

> But if you or Jason think that enforcing the 'no dma_buf_sync_for_cpu'
> now is critical, no problem. We can also provide this patch, and seek
> to revert it or fix it up properly later in the event it turns out it
> causes issues.

What is important is you organize things going forward to be able to
do this properly, which means the required sync type is dependent on
the actual page being synced and you will eventually somehow learn
which is required from the dmabuf.

Most likely nobody will ever run this code on system where dma_sync is
not a NOP, but we should still use the DMA API properly and things
should make architectural sense.

Jason

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH net-next v3 7/7] bnxt_en: add support for device memory tcp
  2024-10-14 22:38                     ` Mina Almasry
  2024-10-15  0:16                       ` Jakub Kicinski
@ 2024-10-15 14:29                       ` Pavel Begunkov
  2024-10-15 17:38                         ` David Wei
  1 sibling, 1 reply; 73+ messages in thread
From: Pavel Begunkov @ 2024-10-15 14:29 UTC (permalink / raw)
  To: Mina Almasry, Jason Gunthorpe
  Cc: Leon Romanovsky, Jakub Kicinski, Christian König,
	Samiullah Khawaja, Taehee Yoo, davem, pabeni, edumazet, netdev,
	linux-doc, donald.hunter, corbet, michael.chan, kory.maincent,
	andrew, maxime.chevallier, danieller, hengqi, ecree.xilinx,
	przemyslaw.kitszel, hkallweit1, ahmed.zaki, paul.greenwalt,
	rrameshbabu, idosch, kaiyuanz, willemb, aleksander.lobakin, dw,
	sridhar.samudrala, bcreeley

On 10/14/24 23:38, Mina Almasry wrote:
> On Sat, Oct 12, 2024 at 2:42 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>>
>> On Fri, Oct 11, 2024 at 10:33:43AM -0700, Mina Almasry wrote:
>>> On Thu, Oct 10, 2024 at 6:34 PM Jakub Kicinski <kuba@kernel.org> wrote:
>>>>
>>>> On Thu, 10 Oct 2024 10:44:38 -0700 Mina Almasry wrote:
>>>>>>> I haven't thought the failure of PP_FLAG_DMA_SYNC_DEV
>>>>>>> for dmabuf may be wrong.
>>>>>>> I think device memory TCP is not related to this flag.
>>>>>>> So device memory TCP core API should not return failure when
>>>>>>> PP_FLAG_DMA_SYNC_DEV flag is set.
>>>>>>> How about removing this condition check code in device memory TCP core?
>>>>>>
>>>>>> I think we need to invert the check..
>>>>>> Mina, WDYT?
>>>>>
>>>>> On a closer look, my feeling is similar to Taehee,
>>>>> PP_FLAG_DMA_SYNC_DEV should be orthogonal to memory providers. The
>>>>> memory providers allocate the memory and provide the dma-addr, but
>>>>> need not dma-sync the dma-addr, right? The driver can sync the
>>>>> dma-addr if it wants and the driver can delegate the syncing to the pp
>>>>> via PP_FLAG_DMA_SYNC_DEV if it wants. AFAICT I think the check should
>>>>> be removed, not inverted, but I could be missing something.
>>>>
>>>> I don't know much about dmabuf but it hinges on the question whether
>>>> doing DMA sync for device on a dmabuf address is :
>>>>   - a good thing
>>>>   - a noop
>>>>   - a bad thing
>>>>
>>>> If it's a good thing or a noop - agreed.
>>>>
>>>> Similar question for the sync for CPU.
>>>>
>>>> I agree that intuitively it should be all fine. But the fact that dmabuf
>>>> has a bespoke API for accessing the memory by the CPU makes me worried
>>>> that there may be assumptions about these addresses not getting
>>>> randomly fed into the normal DMA API..
>>>
>>> Sorry I'm also a bit unsure what is the right thing to do here. The
>>> code that we've been running in GVE does a dma-sync for cpu
>>> unconditionally on RX for dma-buf and non-dmabuf dma-addrs and we
>>> haven't been seeing issues. It never does dma-sync for device.
>>>
>>> My first question is why is dma-sync for device needed on RX path at
>>> all for some drivers in the first place? For incoming (non-dmabuf)
>>> data, the data is written by the device and read by the cpu, so sync
>>> for cpu is really what's needed. Is the sync for device for XDP? Or is
>>> it that buffers should be dma-syncd for device before they are
>>> re-posted to the NIC?
>>>
>>> Christian/Jason, sorry quick question: are
>>> dma_sync_single_for_{device|cpu} needed or wanted when the dma-addrs
>>> come from a dma-buf? Or these dma-addrs to be treated like any other
>>> with the normal dma_sync_for_{device|cpu} rules?
>>
>> Um, I think because dma-buf hacks things up and generates illegal
>> scatterlist entries with weird dma_map_resource() addresses for the
>> typical P2P case the dma sync API should not be used on those things.
>>
>> However, there is no way to know if the dma-buf has does this, and
>> there are valid case where the scatterlist is not ill formed and the
>> sync is necessary.
>>
>> We are getting soo close to being able to start fixing these API
>> issues in dmabuf, I hope next cylce we can begin.. Fingers crossed.
>>
>>  From a CPU architecture perspective you do not need to cache flush PCI
>> MMIO BAR memory, and perhaps doing so be might be problematic on some
>> arches (???). But you do need to flush normal cachable CPU memory if
>> that is in the DMA buf.
>>
> 
> Thanks Jason. In that case I agree with Jakub we should take in his change here:
> 
> https://lore.kernel.org/netdev/20241009170102.1980ed1d@kernel.org/
> 
> With this change the driver would delegate dma_sync_for_device to the
> page_pool, and the page_pool will skip it altogether for the dma-buf
> memory provider.

Requiring ->dma_map should be common to all providers as page pool
shouldn't be dipping to net_iovs figuring out how to map them. However,
looking at this discussion seems that the ->dma_sync concern is devmem
specific and should be discarded by pp providers using dmabufs, i.e. in
devmem.c:mp_dmabuf_devmem_init().

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH net-next v3 7/7] bnxt_en: add support for device memory tcp
  2024-10-15 14:29                       ` Pavel Begunkov
@ 2024-10-15 17:38                         ` David Wei
  0 siblings, 0 replies; 73+ messages in thread
From: David Wei @ 2024-10-15 17:38 UTC (permalink / raw)
  To: Pavel Begunkov, Mina Almasry, Jason Gunthorpe
  Cc: Leon Romanovsky, Jakub Kicinski, Christian König,
	Samiullah Khawaja, Taehee Yoo, davem, pabeni, edumazet, netdev,
	linux-doc, donald.hunter, corbet, michael.chan, kory.maincent,
	andrew, maxime.chevallier, danieller, hengqi, ecree.xilinx,
	przemyslaw.kitszel, hkallweit1, ahmed.zaki, paul.greenwalt,
	rrameshbabu, idosch, kaiyuanz, willemb, aleksander.lobakin,
	sridhar.samudrala, bcreeley, David Wei

On 2024-10-15 07:29, Pavel Begunkov wrote:
> On 10/14/24 23:38, Mina Almasry wrote:
>> On Sat, Oct 12, 2024 at 2:42 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>>>
>>> On Fri, Oct 11, 2024 at 10:33:43AM -0700, Mina Almasry wrote:
>>>> On Thu, Oct 10, 2024 at 6:34 PM Jakub Kicinski <kuba@kernel.org> wrote:
>>>>>
>>>>> On Thu, 10 Oct 2024 10:44:38 -0700 Mina Almasry wrote:
>>>>>>>> I haven't thought the failure of PP_FLAG_DMA_SYNC_DEV
>>>>>>>> for dmabuf may be wrong.
>>>>>>>> I think device memory TCP is not related to this flag.
>>>>>>>> So device memory TCP core API should not return failure when
>>>>>>>> PP_FLAG_DMA_SYNC_DEV flag is set.
>>>>>>>> How about removing this condition check code in device memory TCP core?
>>>>>>>
>>>>>>> I think we need to invert the check..
>>>>>>> Mina, WDYT?
>>>>>>
>>>>>> On a closer look, my feeling is similar to Taehee,
>>>>>> PP_FLAG_DMA_SYNC_DEV should be orthogonal to memory providers. The
>>>>>> memory providers allocate the memory and provide the dma-addr, but
>>>>>> need not dma-sync the dma-addr, right? The driver can sync the
>>>>>> dma-addr if it wants and the driver can delegate the syncing to the pp
>>>>>> via PP_FLAG_DMA_SYNC_DEV if it wants. AFAICT I think the check should
>>>>>> be removed, not inverted, but I could be missing something.
>>>>>
>>>>> I don't know much about dmabuf but it hinges on the question whether
>>>>> doing DMA sync for device on a dmabuf address is :
>>>>>   - a good thing
>>>>>   - a noop
>>>>>   - a bad thing
>>>>>
>>>>> If it's a good thing or a noop - agreed.
>>>>>
>>>>> Similar question for the sync for CPU.
>>>>>
>>>>> I agree that intuitively it should be all fine. But the fact that dmabuf
>>>>> has a bespoke API for accessing the memory by the CPU makes me worried
>>>>> that there may be assumptions about these addresses not getting
>>>>> randomly fed into the normal DMA API..
>>>>
>>>> Sorry I'm also a bit unsure what is the right thing to do here. The
>>>> code that we've been running in GVE does a dma-sync for cpu
>>>> unconditionally on RX for dma-buf and non-dmabuf dma-addrs and we
>>>> haven't been seeing issues. It never does dma-sync for device.
>>>>
>>>> My first question is why is dma-sync for device needed on RX path at
>>>> all for some drivers in the first place? For incoming (non-dmabuf)
>>>> data, the data is written by the device and read by the cpu, so sync
>>>> for cpu is really what's needed. Is the sync for device for XDP? Or is
>>>> it that buffers should be dma-syncd for device before they are
>>>> re-posted to the NIC?
>>>>
>>>> Christian/Jason, sorry quick question: are
>>>> dma_sync_single_for_{device|cpu} needed or wanted when the dma-addrs
>>>> come from a dma-buf? Or these dma-addrs to be treated like any other
>>>> with the normal dma_sync_for_{device|cpu} rules?
>>>
>>> Um, I think because dma-buf hacks things up and generates illegal
>>> scatterlist entries with weird dma_map_resource() addresses for the
>>> typical P2P case the dma sync API should not be used on those things.
>>>
>>> However, there is no way to know if the dma-buf has does this, and
>>> there are valid case where the scatterlist is not ill formed and the
>>> sync is necessary.
>>>
>>> We are getting soo close to being able to start fixing these API
>>> issues in dmabuf, I hope next cylce we can begin.. Fingers crossed.
>>>
>>>  From a CPU architecture perspective you do not need to cache flush PCI
>>> MMIO BAR memory, and perhaps doing so be might be problematic on some
>>> arches (???). But you do need to flush normal cachable CPU memory if
>>> that is in the DMA buf.
>>>
>>
>> Thanks Jason. In that case I agree with Jakub we should take in his change here:
>>
>> https://lore.kernel.org/netdev/20241009170102.1980ed1d@kernel.org/
>>
>> With this change the driver would delegate dma_sync_for_device to the
>> page_pool, and the page_pool will skip it altogether for the dma-buf
>> memory provider.
> 
> Requiring ->dma_map should be common to all providers as page pool
> shouldn't be dipping to net_iovs figuring out how to map them. However,
> looking at this discussion seems that the ->dma_sync concern is devmem
> specific and should be discarded by pp providers using dmabufs, i.e. in
> devmem.c:mp_dmabuf_devmem_init().

Yes, that's my preference as well, see my earlier reply.

> 

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH net-next v3 0/7] bnxt_en: implement device memory TCP for bnxt
  2024-10-03 16:06 [PATCH net-next v3 0/7] bnxt_en: implement device memory TCP for bnxt Taehee Yoo
                   ` (6 preceding siblings ...)
  2024-10-03 16:06 ` [PATCH net-next v3 7/7] bnxt_en: add support for device memory tcp Taehee Yoo
@ 2024-10-16 20:17 ` Stanislav Fomichev
  2024-10-17  8:58   ` Taehee Yoo
  7 siblings, 1 reply; 73+ messages in thread
From: Stanislav Fomichev @ 2024-10-16 20:17 UTC (permalink / raw)
  To: Taehee Yoo
  Cc: davem, kuba, pabeni, edumazet, almasrymina, netdev, linux-doc,
	donald.hunter, corbet, michael.chan, kory.maincent, andrew,
	maxime.chevallier, danieller, hengqi, ecree.xilinx,
	przemyslaw.kitszel, hkallweit1, ahmed.zaki, paul.greenwalt,
	rrameshbabu, idosch, asml.silence, kaiyuanz, willemb,
	aleksander.lobakin, dw, sridhar.samudrala, bcreeley

On 10/03, Taehee Yoo wrote:
> This series implements device memory TCP for bnxt_en driver and
> necessary ethtool command implementations.
> 
> NICs that use the bnxt_en driver support tcp-data-split feature named
> HDS(header-data-split).
> But there is no implementation for the HDS to enable/disable by ethtool.
> Only getting the current HDS status is implemented and the HDS is just
> automatically enabled only when either LRO, HW-GRO, or JUMBO is enabled.
> The hds_threshold follows the rx-copybreak value but it wasn't
> changeable.
> 
> Currently, bnxt_en driver enables tcp-data-split by default but not
> always work.
> There is hds_threshold value, which indicates that a packet size is
> larger than this value, a packet will be split into header and data.
> hds_threshold value has been 256, which is a default value of
> rx-copybreak value too.
> The rx-copybreak value hasn't been allowed to change so the
> hds_threshold too.
> 
> This patchset decouples hds_threshold and rx-copybreak first.
> and make tcp-data-split, rx-copybreak, and
> tcp-data-split-thresh(hds_threshold) configurable independently.
> 
> But the default configuration is the same.
> The default value of rx-copybreak is 256 and default
> tcp-data-split-thresh is also 256.
> 
> There are several related options.
> TPA(HW-GRO, LRO), JUMBO, jumbo_thresh(firmware command), and Aggregation
> Ring.
> 
> The aggregation ring is fundamental to these all features.
> When gro/lro/jumbo packets are received, NIC receives the first packet
> from the normal ring.
> follow packets come from the aggregation ring.
> 
> These features are working regardless of HDS.
> When TPA is enabled and HDS is disabled, the first packet contains
> header and payload too.
> and the following packets contain payload only.
> If HDS is enabled, the first packet contains the header only, and the
> following packets contain only payload.
> So, HW-GRO/LRO is working regardless of HDS.
> 
> There is another threshold value, which is jumbo_thresh.
> This is very similar to hds_thresh, but jumbo thresh doesn't split
> header and data.
> It just split the first and following data based on length.
> When NIC receives 1500 sized packet, and jumbo_thresh is 256(default, but
> follows rx-copybreak),
> the first data is 256 and the following packet size is 1500-256.
> 
> Before this patch, at least if one of GRO, LRO, and JUMBO flags is
> enabled, the Aggregation ring will be enabled.
> If the Aggregation ring is enabled, both hds_threshold and
> jumbo_thresh are set to the default value of rx-copybreak.
> 
> So, GRO, LRO, JUMBO frames, they larger than 256 bytes, they will
> be split into header and data if the protocol is TCP or UDP.
> for the other protocol, jumbo_thresh works instead of hds_thresh.
> 
> This means that tcp-data-split relies on the GRO, LRO, and JUMBO flags.
> But by this patch, tcp-data-split no longer relies on these flags.
> If the tcp-data-split is enabled, the Aggregation ring will be
> enabled.
> Also, hds_threshold no longer follows rx-copybreak value, it will
> be set to the tcp-data-split-thresh value by user-space, but the
> default value is still 256.
> 
> If the protocol is TCP or UDP and the HDS is disabled and Aggregation
> ring is enabled, a packet will be split into several pieces due to
> jumbo_thresh.
> 
> When XDP is attached, tcp-data-split is automatically disabled.
> 
> LRO, GRO, and JUMBO are tested with BCM57414, BCM57504 and the firmware
> version is 230.0.157.0.
> I couldn't find any specification about minimum and maximum value
> of hds_threshold, but from my test result, it was about 0 ~ 1023.
> It means, over 1023 sized packets will be split into header and data if
> tcp-data-split is enabled regardless of hds_treshold value.
> When hds_threshold is 1500 and received packet size is 1400, HDS should
> not be activated, but it is activated.
> The maximum value of hds_threshold(tcp-data-split-thresh)
> value is 256 because it has been working.
> It was decided very conservatively.
> 
> I checked out the tcp-data-split(HDS) works independently of GRO, LRO,
> JUMBO. Tested GRO/LRO, JUMBO with enabled HDS and disabled HDS.
> Also, I checked out tcp-data-split should be disabled automatically
> when XDP is attached and disallowed to enable it again while XDP is
> attached. I tested ranged values from min to max for
> tcp-data-split-thresh and rx-copybreak, and it works.
> tcp-data-split-thresh from 0 to 256, and rx-copybreak 65 to 256.
> When testing this patchset, I checked skb->data, skb->data_len, and
> nr_frags values.
> 
> The first patch implements .{set, get}_tunable() in the bnxt_en.
> The bnxt_en driver has been supporting the rx-copybreak feature but is
> not configurable, Only the default rx-copybreak value has been working.
> So, it changes the bnxt_en driver to be able to configure
> the rx-copybreak value.
> 
> The second patch adds an implementation of tcp-data-split ethtool
> command.
> The HDS relies on the Aggregation ring, which is automatically enabled
> when either LRO, GRO, or large mtu is configured.
> So, if the Aggregation ring is enabled, HDS is automatically enabled by
> it.
> 
> The third patch adds tcp-data-split-thresh command in the ethtool.
> This threshold value indicates if a received packet size is larger
> than this threshold, the packet's header and payload will be split.
> Example:
>    # ethtool -G <interface name> tcp-data-split-thresh <value>
> This option can not be used when tcp-data-split is disabled or not
> supported.
>    # ethtool -G enp14s0f0np0 tcp-data-split on tcp-data-split-thresh 256
>    # ethtool -g enp14s0f0np0
>    Ring parameters for enp14s0f0np0:
>    Pre-set maximums:
>    ...
>    Current hardware settings:
>    ...
>    TCP data split:         on
>    TCP data split thresh:  256
> 
>    # ethtool -G enp14s0f0np0 tcp-data-split off
>    # ethtool -g enp14s0f0np0
>    Ring parameters for enp14s0f0np0:
>    Pre-set maximums:
>    ...
>    Current hardware settings:
>    ...
>    TCP data split:         off
>    TCP data split thresh:  n/a
> 
> The fourth patch adds the implementation of tcp-data-split-thresh logic
> in the bnxt_en driver.
> The default value is 256, which used to be the default rx-copybreak
> value.
> 
> The fifth and sixth adds condition check for devmem and ethtool.
> If tcp-data-split is disabled or threshold value is not zero, setup of
> devmem will be failed.
> Also, tcp-data-split and tcp-data-split-thresh will not be changed
> while devmem is running.
> 
> The last patch implements device memory TCP for bnxt_en driver.
> It usually converts generic page_pool api to netmem page_pool api.
> 
> No dependencies exist between device memory TCP and GRO/LRO/MTU.
> Only tcp-data-split and tcp-data-split-thresh should be enabled when the
> device memory TCP.
> While devmem TCP is set, tcp-data-split and tcp-data-split-thresh can't
> be updated because core API disallows change.
> 
> I tested the interface up/down while devmem TCP running. It works well.
> Also, channel count change, and rx/tx ringsize change tests work well too.
> 
> The devmem TCP test NIC is BCM57504

[..]

> All necessary configuration validations exist at the core API level.
> 
> Note that by this patch, the setup of device memory TCP would fail.
> Because tcp-data-split-thresh command is not supported by ethtool yet.
> The tcp-data-split-thresh should be 0 for setup device memory TCP and
> the default of bnxt is 256.
> So, for the bnxt, it always fails until ethtool supports
> tcp-data-split-thresh command.
> 
> The ncdevmem.c will be updated after ethtool supports
> tcp-data-split-thresh option.

FYI, I've tested your series with BCM57504 on top of [1] and [2] with
a couple of patches to make ncdevmem.c and TX work (see below). [1]
decouples ncdevmem from ethtool so we can flip header split settings
without requiring recent ethtool. Both RX and TX work perfectly.
Feel free to carry:

Tested-by: Stanislav Fomichev <sdf@fomichev.me>

Also feel free to take over the ncdevmem patch if my ncdevmem changes
get pulled before your series.

1: https://lore.kernel.org/netdev/20241009171252.2328284-1-sdf@fomichev.me/
2: https://lore.kernel.org/netdev/20240913150913.1280238-1-sdf@fomichev.me/

commit 69bc0e247eb4132ef5fd0b118719427d35d462fc
Author:     Stanislav Fomichev <sdf@fomichev.me>
AuthorDate: Tue Oct 15 15:56:43 2024 -0700
Commit:     Stanislav Fomichev <sdf@fomichev.me>
CommitDate: Wed Oct 16 13:13:42 2024 -0700

    selftests: ncdevmem: Set header split threshold to 0
    
    Needs to happen on BRCM to allow devmem to be attached.
    
    Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>

diff --git a/tools/testing/selftests/drivers/net/hw/ncdevmem.c b/tools/testing/selftests/drivers/net/hw/ncdevmem.c
index 903dac3e61d5..6a94d52a6c43 100644
--- a/tools/testing/selftests/drivers/net/hw/ncdevmem.c
+++ b/tools/testing/selftests/drivers/net/hw/ncdevmem.c
@@ -322,6 +322,8 @@ static int configure_headersplit(bool on)
 	ethtool_rings_set_req_set_header_dev_index(req, ifindex);
 	/* 0 - off, 1 - auto, 2 - on */
 	ethtool_rings_set_req_set_tcp_data_split(req, on ? 2 : 0);
+	if (enable)
+		ethtool_rings_set_req_set_tcp_data_split_thresh(req, 0);
 	ret = ethtool_rings_set(ys, req);
 	if (ret < 0)
 		fprintf(stderr, "YNL failed: %s\n", ys->err.msg);


commit ef5ba647bc94a19153c2c5cfc64ebe4cb86ac58d
Author:     Stanislav Fomichev <sdf@fomichev.me>
AuthorDate: Fri Oct 11 13:52:03 2024 -0700
Commit:     Stanislav Fomichev <sdf@fomichev.me>
CommitDate: Wed Oct 16 13:13:42 2024 -0700

    bnxt_en: support tx device memory
    
    The only change is to not unmap the frags on completions.
    
    Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index 6e422e24750a..cb22707a35aa 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -692,7 +692,10 @@ static netdev_tx_t bnxt_start_xmit(struct sk_buff *skb, struct net_device *dev)
 			goto tx_dma_error;
 
 		tx_buf = &txr->tx_buf_ring[RING_TX(bp, prod)];
-		dma_unmap_addr_set(tx_buf, mapping, mapping);
+		if (netmem_is_net_iov(frag->netmem))
+			dma_unmap_addr_set(tx_buf, mapping, 0);
+		else
+			dma_unmap_addr_set(tx_buf, mapping, mapping);
 
 		txbd->tx_bd_haddr = cpu_to_le64(mapping);
 
@@ -749,9 +752,10 @@ static netdev_tx_t bnxt_start_xmit(struct sk_buff *skb, struct net_device *dev)
 	for (i = 0; i < last_frag; i++) {
 		prod = NEXT_TX(prod);
 		tx_buf = &txr->tx_buf_ring[RING_TX(bp, prod)];
-		dma_unmap_page(&pdev->dev, dma_unmap_addr(tx_buf, mapping),
-			       skb_frag_size(&skb_shinfo(skb)->frags[i]),
-			       DMA_TO_DEVICE);
+		if (dma_unmap_addr(tx_buf, mapping))
+			dma_unmap_page(&pdev->dev, dma_unmap_addr(tx_buf, mapping),
+				       skb_frag_size(&skb_shinfo(skb)->frags[i]),
+				       DMA_TO_DEVICE);
 	}
 
 tx_free:
@@ -821,11 +825,12 @@ static bool __bnxt_tx_int(struct bnxt *bp, struct bnxt_tx_ring_info *txr,
 		for (j = 0; j < last; j++) {
 			cons = NEXT_TX(cons);
 			tx_buf = &txr->tx_buf_ring[RING_TX(bp, cons)];
-			dma_unmap_page(
-				&pdev->dev,
-				dma_unmap_addr(tx_buf, mapping),
-				skb_frag_size(&skb_shinfo(skb)->frags[j]),
-				DMA_TO_DEVICE);
+			if (dma_unmap_addr(tx_buf, mapping))
+				dma_unmap_page(
+					&pdev->dev,
+					dma_unmap_addr(tx_buf, mapping),
+					skb_frag_size(&skb_shinfo(skb)->frags[j]),
+					DMA_TO_DEVICE);
 		}
 		if (unlikely(is_ts_pkt)) {
 			if (BNXT_CHIP_P5(bp)) {
@@ -3296,10 +3301,11 @@ static void bnxt_free_tx_skbs(struct bnxt *bp)
 				skb_frag_t *frag = &skb_shinfo(skb)->frags[k];
 
 				tx_buf = &txr->tx_buf_ring[ring_idx];
-				dma_unmap_page(
-					&pdev->dev,
-					dma_unmap_addr(tx_buf, mapping),
-					skb_frag_size(frag), DMA_TO_DEVICE);
+				if (dma_unmap_addr(tx_buf, mapping))
+					dma_unmap_page(
+						&pdev->dev,
+						dma_unmap_addr(tx_buf, mapping),
+						skb_frag_size(frag), DMA_TO_DEVICE);
 			}
 			dev_kfree_skb(skb);
 		}

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* Re: [PATCH net-next v3 0/7] bnxt_en: implement device memory TCP for bnxt
  2024-10-16 20:17 ` [PATCH net-next v3 0/7] bnxt_en: implement device memory TCP for bnxt Stanislav Fomichev
@ 2024-10-17  8:58   ` Taehee Yoo
  0 siblings, 0 replies; 73+ messages in thread
From: Taehee Yoo @ 2024-10-17  8:58 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: davem, kuba, pabeni, edumazet, almasrymina, netdev, linux-doc,
	donald.hunter, corbet, michael.chan, kory.maincent, andrew,
	maxime.chevallier, danieller, hengqi, ecree.xilinx,
	przemyslaw.kitszel, hkallweit1, ahmed.zaki, paul.greenwalt,
	rrameshbabu, idosch, asml.silence, kaiyuanz, willemb,
	aleksander.lobakin, dw, sridhar.samudrala, bcreeley

On Thu, Oct 17, 2024 at 5:17 AM Stanislav Fomichev <stfomichev@gmail.com> wrote:

Hi Stanislav,
Thank you so much for testing and improvement!

>
> On 10/03, Taehee Yoo wrote:
> > This series implements device memory TCP for bnxt_en driver and
> > necessary ethtool command implementations.
> >
> > NICs that use the bnxt_en driver support tcp-data-split feature named
> > HDS(header-data-split).
> > But there is no implementation for the HDS to enable/disable by ethtool.
> > Only getting the current HDS status is implemented and the HDS is just
> > automatically enabled only when either LRO, HW-GRO, or JUMBO is enabled.
> > The hds_threshold follows the rx-copybreak value but it wasn't
> > changeable.
> >
> > Currently, bnxt_en driver enables tcp-data-split by default but not
> > always work.
> > There is hds_threshold value, which indicates that a packet size is
> > larger than this value, a packet will be split into header and data.
> > hds_threshold value has been 256, which is a default value of
> > rx-copybreak value too.
> > The rx-copybreak value hasn't been allowed to change so the
> > hds_threshold too.
> >
> > This patchset decouples hds_threshold and rx-copybreak first.
> > and make tcp-data-split, rx-copybreak, and
> > tcp-data-split-thresh(hds_threshold) configurable independently.
> >
> > But the default configuration is the same.
> > The default value of rx-copybreak is 256 and default
> > tcp-data-split-thresh is also 256.
> >
> > There are several related options.
> > TPA(HW-GRO, LRO), JUMBO, jumbo_thresh(firmware command), and Aggregation
> > Ring.
> >
> > The aggregation ring is fundamental to these all features.
> > When gro/lro/jumbo packets are received, NIC receives the first packet
> > from the normal ring.
> > follow packets come from the aggregation ring.
> >
> > These features are working regardless of HDS.
> > When TPA is enabled and HDS is disabled, the first packet contains
> > header and payload too.
> > and the following packets contain payload only.
> > If HDS is enabled, the first packet contains the header only, and the
> > following packets contain only payload.
> > So, HW-GRO/LRO is working regardless of HDS.
> >
> > There is another threshold value, which is jumbo_thresh.
> > This is very similar to hds_thresh, but jumbo thresh doesn't split
> > header and data.
> > It just split the first and following data based on length.
> > When NIC receives 1500 sized packet, and jumbo_thresh is 256(default, but
> > follows rx-copybreak),
> > the first data is 256 and the following packet size is 1500-256.
> >
> > Before this patch, at least if one of GRO, LRO, and JUMBO flags is
> > enabled, the Aggregation ring will be enabled.
> > If the Aggregation ring is enabled, both hds_threshold and
> > jumbo_thresh are set to the default value of rx-copybreak.
> >
> > So, GRO, LRO, JUMBO frames, they larger than 256 bytes, they will
> > be split into header and data if the protocol is TCP or UDP.
> > for the other protocol, jumbo_thresh works instead of hds_thresh.
> >
> > This means that tcp-data-split relies on the GRO, LRO, and JUMBO flags.
> > But by this patch, tcp-data-split no longer relies on these flags.
> > If the tcp-data-split is enabled, the Aggregation ring will be
> > enabled.
> > Also, hds_threshold no longer follows rx-copybreak value, it will
> > be set to the tcp-data-split-thresh value by user-space, but the
> > default value is still 256.
> >
> > If the protocol is TCP or UDP and the HDS is disabled and Aggregation
> > ring is enabled, a packet will be split into several pieces due to
> > jumbo_thresh.
> >
> > When XDP is attached, tcp-data-split is automatically disabled.
> >
> > LRO, GRO, and JUMBO are tested with BCM57414, BCM57504 and the firmware
> > version is 230.0.157.0.
> > I couldn't find any specification about minimum and maximum value
> > of hds_threshold, but from my test result, it was about 0 ~ 1023.
> > It means, over 1023 sized packets will be split into header and data if
> > tcp-data-split is enabled regardless of hds_treshold value.
> > When hds_threshold is 1500 and received packet size is 1400, HDS should
> > not be activated, but it is activated.
> > The maximum value of hds_threshold(tcp-data-split-thresh)
> > value is 256 because it has been working.
> > It was decided very conservatively.
> >
> > I checked out the tcp-data-split(HDS) works independently of GRO, LRO,
> > JUMBO. Tested GRO/LRO, JUMBO with enabled HDS and disabled HDS.
> > Also, I checked out tcp-data-split should be disabled automatically
> > when XDP is attached and disallowed to enable it again while XDP is
> > attached. I tested ranged values from min to max for
> > tcp-data-split-thresh and rx-copybreak, and it works.
> > tcp-data-split-thresh from 0 to 256, and rx-copybreak 65 to 256.
> > When testing this patchset, I checked skb->data, skb->data_len, and
> > nr_frags values.
> >
> > The first patch implements .{set, get}_tunable() in the bnxt_en.
> > The bnxt_en driver has been supporting the rx-copybreak feature but is
> > not configurable, Only the default rx-copybreak value has been working.
> > So, it changes the bnxt_en driver to be able to configure
> > the rx-copybreak value.
> >
> > The second patch adds an implementation of tcp-data-split ethtool
> > command.
> > The HDS relies on the Aggregation ring, which is automatically enabled
> > when either LRO, GRO, or large mtu is configured.
> > So, if the Aggregation ring is enabled, HDS is automatically enabled by
> > it.
> >
> > The third patch adds tcp-data-split-thresh command in the ethtool.
> > This threshold value indicates if a received packet size is larger
> > than this threshold, the packet's header and payload will be split.
> > Example:
> >    # ethtool -G <interface name> tcp-data-split-thresh <value>
> > This option can not be used when tcp-data-split is disabled or not
> > supported.
> >    # ethtool -G enp14s0f0np0 tcp-data-split on tcp-data-split-thresh 256
> >    # ethtool -g enp14s0f0np0
> >    Ring parameters for enp14s0f0np0:
> >    Pre-set maximums:
> >    ...
> >    Current hardware settings:
> >    ...
> >    TCP data split:         on
> >    TCP data split thresh:  256
> >
> >    # ethtool -G enp14s0f0np0 tcp-data-split off
> >    # ethtool -g enp14s0f0np0
> >    Ring parameters for enp14s0f0np0:
> >    Pre-set maximums:
> >    ...
> >    Current hardware settings:
> >    ...
> >    TCP data split:         off
> >    TCP data split thresh:  n/a
> >
> > The fourth patch adds the implementation of tcp-data-split-thresh logic
> > in the bnxt_en driver.
> > The default value is 256, which used to be the default rx-copybreak
> > value.
> >
> > The fifth and sixth adds condition check for devmem and ethtool.
> > If tcp-data-split is disabled or threshold value is not zero, setup of
> > devmem will be failed.
> > Also, tcp-data-split and tcp-data-split-thresh will not be changed
> > while devmem is running.
> >
> > The last patch implements device memory TCP for bnxt_en driver.
> > It usually converts generic page_pool api to netmem page_pool api.
> >
> > No dependencies exist between device memory TCP and GRO/LRO/MTU.
> > Only tcp-data-split and tcp-data-split-thresh should be enabled when the
> > device memory TCP.
> > While devmem TCP is set, tcp-data-split and tcp-data-split-thresh can't
> > be updated because core API disallows change.
> >
> > I tested the interface up/down while devmem TCP running. It works well.
> > Also, channel count change, and rx/tx ringsize change tests work well too.
> >
> > The devmem TCP test NIC is BCM57504
>
> [..]
>
> > All necessary configuration validations exist at the core API level.
> >
> > Note that by this patch, the setup of device memory TCP would fail.
> > Because tcp-data-split-thresh command is not supported by ethtool yet.
> > The tcp-data-split-thresh should be 0 for setup device memory TCP and
> > the default of bnxt is 256.
> > So, for the bnxt, it always fails until ethtool supports
> > tcp-data-split-thresh command.
> >
> > The ncdevmem.c will be updated after ethtool supports
> > tcp-data-split-thresh option.
>
> FYI, I've tested your series with BCM57504 on top of [1] and [2] with
> a couple of patches to make ncdevmem.c and TX work (see below). [1]
> decouples ncdevmem from ethtool so we can flip header split settings
> without requiring recent ethtool. Both RX and TX work perfectly.
> Feel free to carry:
>
> Tested-by: Stanislav Fomichev <sdf@fomichev.me>

Thank you so much for your work!
I will try to test your TX side patch before sending v4 patch.

>
> Also feel free to take over the ncdevmem patch if my ncdevmem changes
> get pulled before your series.

Good, Thanks!

>
> 1: https://lore.kernel.org/netdev/20241009171252.2328284-1-sdf@fomichev.me/
> 2: https://lore.kernel.org/netdev/20240913150913.1280238-1-sdf@fomichev.me/
>
> commit 69bc0e247eb4132ef5fd0b118719427d35d462fc
> Author:     Stanislav Fomichev <sdf@fomichev.me>
> AuthorDate: Tue Oct 15 15:56:43 2024 -0700
> Commit:     Stanislav Fomichev <sdf@fomichev.me>
> CommitDate: Wed Oct 16 13:13:42 2024 -0700
>
>     selftests: ncdevmem: Set header split threshold to 0
>
>     Needs to happen on BRCM to allow devmem to be attached.
>
>     Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
>
> diff --git a/tools/testing/selftests/drivers/net/hw/ncdevmem.c b/tools/testing/selftests/drivers/net/hw/ncdevmem.c
> index 903dac3e61d5..6a94d52a6c43 100644
> --- a/tools/testing/selftests/drivers/net/hw/ncdevmem.c
> +++ b/tools/testing/selftests/drivers/net/hw/ncdevmem.c
> @@ -322,6 +322,8 @@ static int configure_headersplit(bool on)
>         ethtool_rings_set_req_set_header_dev_index(req, ifindex);
>         /* 0 - off, 1 - auto, 2 - on */
>         ethtool_rings_set_req_set_tcp_data_split(req, on ? 2 : 0);
> +       if (enable)
> +               ethtool_rings_set_req_set_tcp_data_split_thresh(req, 0);
>         ret = ethtool_rings_set(ys, req);
>         if (ret < 0)
>                 fprintf(stderr, "YNL failed: %s\n", ys->err.msg);
>
>
> commit ef5ba647bc94a19153c2c5cfc64ebe4cb86ac58d
> Author:     Stanislav Fomichev <sdf@fomichev.me>
> AuthorDate: Fri Oct 11 13:52:03 2024 -0700
> Commit:     Stanislav Fomichev <sdf@fomichev.me>
> CommitDate: Wed Oct 16 13:13:42 2024 -0700
>
>     bnxt_en: support tx device memory
>
>     The only change is to not unmap the frags on completions.
>
>     Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
>
> diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> index 6e422e24750a..cb22707a35aa 100644
> --- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> @@ -692,7 +692,10 @@ static netdev_tx_t bnxt_start_xmit(struct sk_buff *skb, struct net_device *dev)
>                         goto tx_dma_error;
>
>                 tx_buf = &txr->tx_buf_ring[RING_TX(bp, prod)];
> -               dma_unmap_addr_set(tx_buf, mapping, mapping);
> +               if (netmem_is_net_iov(frag->netmem))
> +                       dma_unmap_addr_set(tx_buf, mapping, 0);
> +               else
> +                       dma_unmap_addr_set(tx_buf, mapping, mapping);
>
>                 txbd->tx_bd_haddr = cpu_to_le64(mapping);
>
> @@ -749,9 +752,10 @@ static netdev_tx_t bnxt_start_xmit(struct sk_buff *skb, struct net_device *dev)
>         for (i = 0; i < last_frag; i++) {
>                 prod = NEXT_TX(prod);
>                 tx_buf = &txr->tx_buf_ring[RING_TX(bp, prod)];
> -               dma_unmap_page(&pdev->dev, dma_unmap_addr(tx_buf, mapping),
> -                              skb_frag_size(&skb_shinfo(skb)->frags[i]),
> -                              DMA_TO_DEVICE);
> +               if (dma_unmap_addr(tx_buf, mapping))
> +                       dma_unmap_page(&pdev->dev, dma_unmap_addr(tx_buf, mapping),
> +                                      skb_frag_size(&skb_shinfo(skb)->frags[i]),
> +                                      DMA_TO_DEVICE);
>         }
>
>  tx_free:
> @@ -821,11 +825,12 @@ static bool __bnxt_tx_int(struct bnxt *bp, struct bnxt_tx_ring_info *txr,
>                 for (j = 0; j < last; j++) {
>                         cons = NEXT_TX(cons);
>                         tx_buf = &txr->tx_buf_ring[RING_TX(bp, cons)];
> -                       dma_unmap_page(
> -                               &pdev->dev,
> -                               dma_unmap_addr(tx_buf, mapping),
> -                               skb_frag_size(&skb_shinfo(skb)->frags[j]),
> -                               DMA_TO_DEVICE);
> +                       if (dma_unmap_addr(tx_buf, mapping))
> +                               dma_unmap_page(
> +                                       &pdev->dev,
> +                                       dma_unmap_addr(tx_buf, mapping),
> +                                       skb_frag_size(&skb_shinfo(skb)->frags[j]),
> +                                       DMA_TO_DEVICE);
>                 }
>                 if (unlikely(is_ts_pkt)) {
>                         if (BNXT_CHIP_P5(bp)) {
> @@ -3296,10 +3301,11 @@ static void bnxt_free_tx_skbs(struct bnxt *bp)
>                                 skb_frag_t *frag = &skb_shinfo(skb)->frags[k];
>
>                                 tx_buf = &txr->tx_buf_ring[ring_idx];
> -                               dma_unmap_page(
> -                                       &pdev->dev,
> -                                       dma_unmap_addr(tx_buf, mapping),
> -                                       skb_frag_size(frag), DMA_TO_DEVICE);
> +                               if (dma_unmap_addr(tx_buf, mapping))
> +                                       dma_unmap_page(
> +                                               &pdev->dev,
> +                                               dma_unmap_addr(tx_buf, mapping),
> +                                               skb_frag_size(frag), DMA_TO_DEVICE);
>                         }
>                         dev_kfree_skb(skb);
>                 }

Thanks a lot!
Taehee Yoo

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH net-next v3 7/7] bnxt_en: add support for device memory tcp
  2024-10-15 12:44                           ` Jason Gunthorpe
@ 2024-10-18  8:25                             ` Mina Almasry
  2024-10-19 13:55                               ` Taehee Yoo
  0 siblings, 1 reply; 73+ messages in thread
From: Mina Almasry @ 2024-10-18  8:25 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jakub Kicinski, Leon Romanovsky, Christian König,
	Samiullah Khawaja, Taehee Yoo, davem, pabeni, edumazet, netdev,
	linux-doc, donald.hunter, corbet, michael.chan, kory.maincent,
	andrew, maxime.chevallier, danieller, hengqi, ecree.xilinx,
	przemyslaw.kitszel, hkallweit1, ahmed.zaki, paul.greenwalt,
	rrameshbabu, idosch, asml.silence, kaiyuanz, willemb,
	aleksander.lobakin, dw, sridhar.samudrala, bcreeley

On Tue, Oct 15, 2024 at 3:44 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Tue, Oct 15, 2024 at 04:10:44AM +0300, Mina Almasry wrote:
> > On Tue, Oct 15, 2024 at 3:16 AM Jakub Kicinski <kuba@kernel.org> wrote:
> > >
> > > On Tue, 15 Oct 2024 01:38:20 +0300 Mina Almasry wrote:
> > > > Thanks Jason. In that case I agree with Jakub we should take in his change here:
> > > >
> > > > https://lore.kernel.org/netdev/20241009170102.1980ed1d@kernel.org/
> > > >
> > > > With this change the driver would delegate dma_sync_for_device to the
> > > > page_pool, and the page_pool will skip it altogether for the dma-buf
> > > > memory provider.
> > >
> > > And we need a wrapper for a sync for CPU which will skip if the page
> > > comes from an unreadable pool?
> >
> > This is where it gets a bit tricky, no?
> >
> > Our production code does a dma_sync_for_cpu but no
> > dma_sync_for_device. That has been working reliably for us with GPU
>
> Those functions are all NOP on systems you are testing on.
>

OK, thanks. This is what I wanted to confirm. If you already know this
here then there is no need to wait for me to confirm.

> The question is what is correct to do on systems where it is not a
> NOP, and none of this is really right, as I explained..
>
> > But if you or Jason think that enforcing the 'no dma_buf_sync_for_cpu'
> > now is critical, no problem. We can also provide this patch, and seek
> > to revert it or fix it up properly later in the event it turns out it
> > causes issues.
>
> What is important is you organize things going forward to be able to
> do this properly, which means the required sync type is dependent on
> the actual page being synced and you will eventually somehow learn
> which is required from the dmabuf.
>
> Most likely nobody will ever run this code on system where dma_sync is
> not a NOP, but we should still use the DMA API properly and things
> should make architectural sense.
>

Makes sense. OK, we can do what Jakub suggested in the thread earlier.
I.e. likely some wrapper which skips the dma_sync_for_cpu if the
netmem is unreadable.

--
Thanks,
Mina

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH net-next v3 7/7] bnxt_en: add support for device memory tcp
  2024-10-18  8:25                             ` Mina Almasry
@ 2024-10-19 13:55                               ` Taehee Yoo
  0 siblings, 0 replies; 73+ messages in thread
From: Taehee Yoo @ 2024-10-19 13:55 UTC (permalink / raw)
  To: Mina Almasry
  Cc: Jason Gunthorpe, Jakub Kicinski, Leon Romanovsky,
	Christian König, Samiullah Khawaja, davem, pabeni, edumazet,
	netdev, linux-doc, donald.hunter, corbet, michael.chan,
	kory.maincent, andrew, maxime.chevallier, danieller, hengqi,
	ecree.xilinx, przemyslaw.kitszel, hkallweit1, ahmed.zaki,
	paul.greenwalt, rrameshbabu, idosch, asml.silence, kaiyuanz,
	willemb, aleksander.lobakin, dw, sridhar.samudrala, bcreeley

On Fri, Oct 18, 2024 at 5:25 PM Mina Almasry <almasrymina@google.com> wrote:
>
> On Tue, Oct 15, 2024 at 3:44 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> >
> > On Tue, Oct 15, 2024 at 04:10:44AM +0300, Mina Almasry wrote:
> > > On Tue, Oct 15, 2024 at 3:16 AM Jakub Kicinski <kuba@kernel.org> wrote:
> > > >
> > > > On Tue, 15 Oct 2024 01:38:20 +0300 Mina Almasry wrote:
> > > > > Thanks Jason. In that case I agree with Jakub we should take in his change here:
> > > > >
> > > > > https://lore.kernel.org/netdev/20241009170102.1980ed1d@kernel.org/
> > > > >
> > > > > With this change the driver would delegate dma_sync_for_device to the
> > > > > page_pool, and the page_pool will skip it altogether for the dma-buf
> > > > > memory provider.
> > > >
> > > > And we need a wrapper for a sync for CPU which will skip if the page
> > > > comes from an unreadable pool?
> > >
> > > This is where it gets a bit tricky, no?
> > >
> > > Our production code does a dma_sync_for_cpu but no
> > > dma_sync_for_device. That has been working reliably for us with GPU
> >
> > Those functions are all NOP on systems you are testing on.
> >
>
> OK, thanks. This is what I wanted to confirm. If you already know this
> here then there is no need to wait for me to confirm.
>
> > The question is what is correct to do on systems where it is not a
> > NOP, and none of this is really right, as I explained..
> >
> > > But if you or Jason think that enforcing the 'no dma_buf_sync_for_cpu'
> > > now is critical, no problem. We can also provide this patch, and seek
> > > to revert it or fix it up properly later in the event it turns out it
> > > causes issues.
> >
> > What is important is you organize things going forward to be able to
> > do this properly, which means the required sync type is dependent on
> > the actual page being synced and you will eventually somehow learn
> > which is required from the dmabuf.
> >
> > Most likely nobody will ever run this code on system where dma_sync is
> > not a NOP, but we should still use the DMA API properly and things
> > should make architectural sense.
> >
>
> Makes sense. OK, we can do what Jakub suggested in the thread earlier.
> I.e. likely some wrapper which skips the dma_sync_for_cpu if the
> netmem is unreadable.
>

Thanks a lot for confirmation about it.
I will pass the PP_FLAG_ALLOW_UNREADABLE_NETMEM flag
regardless of enabling/disabling devmem TCP in a v4 patch.
The page_pool core logic will handle flags properly.

I think patches for changes of page_pool are worked on by Mina,
so I will not include changes for page_pool in a v4 patch.

If you think I missed something, please let me know :)

Thanks a lot!
Taehee Yoo

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH net-next v3 2/7] bnxt_en: add support for tcp-data-split ethtool command
  2024-10-09 15:28       ` Jakub Kicinski
  2024-10-09 17:47         ` Taehee Yoo
@ 2024-10-31 17:34         ` Taehee Yoo
  2024-10-31 23:56           ` Jakub Kicinski
  1 sibling, 1 reply; 73+ messages in thread
From: Taehee Yoo @ 2024-10-31 17:34 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: davem, pabeni, edumazet, almasrymina, netdev, linux-doc,
	donald.hunter, corbet, michael.chan, kory.maincent, andrew,
	maxime.chevallier, danieller, hengqi, ecree.xilinx,
	przemyslaw.kitszel, hkallweit1, ahmed.zaki, paul.greenwalt,
	rrameshbabu, idosch, asml.silence, kaiyuanz, willemb,
	aleksander.lobakin, dw, sridhar.samudrala, bcreeley

On Thu, Oct 10, 2024 at 12:28 AM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Wed, 9 Oct 2024 22:54:17 +0900 Taehee Yoo wrote:
> > > This breaks previous behavior. The HDS reporting from get was
> > > introduced to signal to user space whether the page flip based
> > > TCP zero-copy (the one added some years ago not the recent one)
> > > will be usable with this NIC.
> > >
> > > When HW-GRO is enabled HDS will be working.
> > >
> > > I think that the driver should only track if the user has set the value
> > > to ENABLED (forced HDS), or to UKNOWN (driver default). Setting the HDS
> > > to disabled is not useful, don't support it.
> >
> > Okay, I will remove the disable feature in a v4 patch.
> > Before this patch, hds_threshold was rx-copybreak value.
> > How do you think hds_threshold should still follow rx-copybreak value
> > if it is UNKNOWN mode?
>
> IIUC the rx_copybreak only applies to the header? Or does it apply
> to the entire frame?
>
> If rx_copybreak applies to the entire frame and not just the first
> buffer (headers or headers+payload if not split) - no preference.
> If rx_copybreak only applies to the headers / first buffer then
> I'd keep them separate as they operate on a different length.
>
> > I think hds_threshold need to follow new tcp-data-split-thresh value in
> > ENABLE/UNKNOWN and make rx-copybreak pure software feature.
>
> Sounds good to me, but just to be clear:
>
> If user sets the HDS enable to UNKNOWN (or doesn't set it):
>  - GET returns (current behavior, AFAIU):
>    - DISABLED (if HW-GRO is disabled and MTU is not Jumbo)
>    - ENABLED (if HW-GRO is enabled of MTU is Jumbo)
> If user sets the HDS enable to ENABLED (force HDS on):
>  - GET returns ENABLED

While I'm writing a patch I face an ambiguous problem here.
ethnl_set_ring() first calls .get_ringparam() to get current config.
Then it calls .set_ringparam() after it sets the current config + new
config to param structures.
The bnxt_set_ringparam() may receive ETHTOOL_TCP_DATA_SPLIT_ENABLED
because two cases.
1. from user
2. from bnxt_get_ringparam() because of UNKNWON.
The problem is that the bnxt_set_ringparam() can't distinguish them.
The problem scenario is here.
1. tcp-data-split is UNKNOWN mode.
2. HDS is automatically enabled because one of LRO or GRO is enabled.
3. user changes ring parameter with following command
`ethtool -G eth0 rx 1024`
4. ethnl_set_rings() calls .get_ringparam() to get current config.
5. bnxt_get_ringparam() returns ENABLE of HDS because of UNKNWON mode.
6. ethnl_set_rings() calls .set_ringparam() after setting param with
configs comes from .get_ringparam().
7. bnxt_set_ringparam() is passed ETHTOOL_TCP_DATA_SPLIT_ENABLED but
the user didn't set it explicitly.
8. bnxt_set_ringparam() eventually force enables tcp-data-split.

I couldn't find a way to distinguish them so far.
I'm not sure if this is acceptable or not.
Maybe we need to modify a scenario?

>
> hds_threshold returns: some value, but it's only actually used if GET
> returns ENABLED.
>
> > But if so, it changes the default behavior.
>
> How so? The configuration of neither of those two is exposed to
> the user. We can keep the same defaults, until user overrides them.
>
> > How do you think about it?
> >
> > >
> > > >       ering->tx_max_pending = BNXT_MAX_TX_DESC_CNT;
> > > >
> > > >       ering->rx_pending = bp->rx_ring_size;
> > > > @@ -854,9 +858,25 @@ static int bnxt_set_ringparam(struct net_device *dev,
> > > >           (ering->tx_pending < BNXT_MIN_TX_DESC_CNT))
> > > >               return -EINVAL;
> > > >
> > > > +     if (kernel_ering->tcp_data_split != ETHTOOL_TCP_DATA_SPLIT_DISABLED &&
> > > > +         BNXT_RX_PAGE_MODE(bp)) {
> > > > +             NL_SET_ERR_MSG_MOD(extack, "tcp-data-split can not be enabled with XDP");
> > > > +             return -EINVAL;
> > > > +     }
> > >
> > > Technically just if the XDP does not support multi-buffer.
> > > Any chance we could do this check in the core?
> >
> > I think we can access xdp_rxq_info with netdev_rx_queue structure.
> > However, xdp_rxq_info is not sufficient to distinguish mb is supported
> > by the driver or not. I think prog->aux->xdp_has_frags is required to
> > distinguish it correctly.
> > So, I think we need something more.
> > Do you have any idea?
>
> Take a look at dev_xdp_prog_count(), something like that but only
> counting non-mb progs?

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH net-next v3 2/7] bnxt_en: add support for tcp-data-split ethtool command
  2024-10-31 17:34         ` Taehee Yoo
@ 2024-10-31 23:56           ` Jakub Kicinski
  2024-11-01 17:11             ` Taehee Yoo
  0 siblings, 1 reply; 73+ messages in thread
From: Jakub Kicinski @ 2024-10-31 23:56 UTC (permalink / raw)
  To: Taehee Yoo
  Cc: davem, pabeni, edumazet, almasrymina, netdev, linux-doc,
	donald.hunter, corbet, michael.chan, kory.maincent, andrew,
	maxime.chevallier, danieller, hengqi, ecree.xilinx,
	przemyslaw.kitszel, hkallweit1, ahmed.zaki, paul.greenwalt,
	rrameshbabu, idosch, asml.silence, kaiyuanz, willemb,
	aleksander.lobakin, dw, sridhar.samudrala, bcreeley

On Fri, 1 Nov 2024 02:34:59 +0900 Taehee Yoo wrote:
> While I'm writing a patch I face an ambiguous problem here.
> ethnl_set_ring() first calls .get_ringparam() to get current config.
> Then it calls .set_ringparam() after it sets the current config + new
> config to param structures.
> The bnxt_set_ringparam() may receive ETHTOOL_TCP_DATA_SPLIT_ENABLED
> because two cases.
> 1. from user
> 2. from bnxt_get_ringparam() because of UNKNWON.
> The problem is that the bnxt_set_ringparam() can't distinguish them.
> The problem scenario is here.
> 1. tcp-data-split is UNKNOWN mode.
> 2. HDS is automatically enabled because one of LRO or GRO is enabled.
> 3. user changes ring parameter with following command
> `ethtool -G eth0 rx 1024`
> 4. ethnl_set_rings() calls .get_ringparam() to get current config.
> 5. bnxt_get_ringparam() returns ENABLE of HDS because of UNKNWON mode.
> 6. ethnl_set_rings() calls .set_ringparam() after setting param with
> configs comes from .get_ringparam().
> 7. bnxt_set_ringparam() is passed ETHTOOL_TCP_DATA_SPLIT_ENABLED but
> the user didn't set it explicitly.
> 8. bnxt_set_ringparam() eventually force enables tcp-data-split.
> 
> I couldn't find a way to distinguish them so far.
> I'm not sure if this is acceptable or not.
> Maybe we need to modify a scenario?

I thought we discussed this, but I may be misremembering.
You may need to record in the core whether the setting came 
from the user or not (similarly to IFF_RXFH_CONFIGURED).
User setting UNKNWON would mean "reset".
Maybe I'm misunderstanding..

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH net-next v3 2/7] bnxt_en: add support for tcp-data-split ethtool command
  2024-10-31 23:56           ` Jakub Kicinski
@ 2024-11-01 17:11             ` Taehee Yoo
  0 siblings, 0 replies; 73+ messages in thread
From: Taehee Yoo @ 2024-11-01 17:11 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: davem, pabeni, edumazet, almasrymina, netdev, linux-doc,
	donald.hunter, corbet, michael.chan, kory.maincent, andrew,
	maxime.chevallier, danieller, hengqi, ecree.xilinx,
	przemyslaw.kitszel, hkallweit1, ahmed.zaki, paul.greenwalt,
	rrameshbabu, idosch, asml.silence, kaiyuanz, willemb,
	aleksander.lobakin, dw, sridhar.samudrala, bcreeley

On Fri, Nov 1, 2024 at 8:56 AM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Fri, 1 Nov 2024 02:34:59 +0900 Taehee Yoo wrote:
> > While I'm writing a patch I face an ambiguous problem here.
> > ethnl_set_ring() first calls .get_ringparam() to get current config.
> > Then it calls .set_ringparam() after it sets the current config + new
> > config to param structures.
> > The bnxt_set_ringparam() may receive ETHTOOL_TCP_DATA_SPLIT_ENABLED
> > because two cases.
> > 1. from user
> > 2. from bnxt_get_ringparam() because of UNKNWON.
> > The problem is that the bnxt_set_ringparam() can't distinguish them.
> > The problem scenario is here.
> > 1. tcp-data-split is UNKNOWN mode.
> > 2. HDS is automatically enabled because one of LRO or GRO is enabled.
> > 3. user changes ring parameter with following command
> > `ethtool -G eth0 rx 1024`
> > 4. ethnl_set_rings() calls .get_ringparam() to get current config.
> > 5. bnxt_get_ringparam() returns ENABLE of HDS because of UNKNWON mode.
> > 6. ethnl_set_rings() calls .set_ringparam() after setting param with
> > configs comes from .get_ringparam().
> > 7. bnxt_set_ringparam() is passed ETHTOOL_TCP_DATA_SPLIT_ENABLED but
> > the user didn't set it explicitly.
> > 8. bnxt_set_ringparam() eventually force enables tcp-data-split.
> >
> > I couldn't find a way to distinguish them so far.
> > I'm not sure if this is acceptable or not.
> > Maybe we need to modify a scenario?
>
> I thought we discussed this, but I may be misremembering.
> You may need to record in the core whether the setting came
> from the user or not (similarly to IFF_RXFH_CONFIGURED).
> User setting UNKNWON would mean "reset".
> Maybe I'm misunderstanding..

Thanks a lot for that!
I will try to add a new variable, that indicates tcp-data-split is set by
user. It would be the tcp_data_split_mod in the
kernel_ethtool_ringparam structure.

Thanks a lot!
Taehee Yoo

^ permalink raw reply	[flat|nested] 73+ messages in thread

end of thread, other threads:[~2024-11-01 17:11 UTC | newest]

Thread overview: 73+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-03 16:06 [PATCH net-next v3 0/7] bnxt_en: implement device memory TCP for bnxt Taehee Yoo
2024-10-03 16:06 ` [PATCH net-next v3 1/7] bnxt_en: add support for rx-copybreak ethtool command Taehee Yoo
2024-10-03 16:57   ` Brett Creeley
2024-10-03 17:15     ` Taehee Yoo
2024-10-03 17:13   ` Michael Chan
2024-10-03 17:22     ` Taehee Yoo
2024-10-03 17:43       ` Michael Chan
2024-10-03 18:28         ` Taehee Yoo
2024-10-03 18:34         ` Andrew Lunn
2024-10-05  6:29           ` Taehee Yoo
2024-10-08 18:10             ` Jakub Kicinski
2024-10-08 19:38               ` Michael Chan
2024-10-08 19:53                 ` Jakub Kicinski
2024-10-08 20:35                   ` Michael Chan
2024-10-03 16:06 ` [PATCH net-next v3 2/7] bnxt_en: add support for tcp-data-split " Taehee Yoo
2024-10-08 18:19   ` Jakub Kicinski
2024-10-09 13:54     ` Taehee Yoo
2024-10-09 15:28       ` Jakub Kicinski
2024-10-09 17:47         ` Taehee Yoo
2024-10-31 17:34         ` Taehee Yoo
2024-10-31 23:56           ` Jakub Kicinski
2024-11-01 17:11             ` Taehee Yoo
2024-10-03 16:06 ` [PATCH net-next v3 3/7] net: ethtool: add support for configuring tcp-data-split-thresh Taehee Yoo
2024-10-03 18:25   ` Mina Almasry
2024-10-03 19:33     ` Taehee Yoo
2024-10-04  1:47       ` Mina Almasry
2024-10-05  6:11         ` Taehee Yoo
2024-10-08 18:33   ` Jakub Kicinski
2024-10-09 14:25     ` Taehee Yoo
2024-10-09 15:46       ` Jakub Kicinski
2024-10-09 17:49         ` Taehee Yoo
2024-10-03 16:06 ` [PATCH net-next v3 4/7] bnxt_en: add support for tcp-data-split-thresh ethtool command Taehee Yoo
2024-10-03 18:13   ` Brett Creeley
2024-10-03 19:13     ` Taehee Yoo
2024-10-08 18:35   ` Jakub Kicinski
2024-10-09 14:31     ` Taehee Yoo
2024-10-03 16:06 ` [PATCH net-next v3 5/7] net: devmem: add ring parameter filtering Taehee Yoo
2024-10-03 18:29   ` Mina Almasry
2024-10-04  3:57     ` Taehee Yoo
2024-10-03 18:35   ` Brett Creeley
2024-10-03 18:49     ` Mina Almasry
2024-10-08 19:28       ` Jakub Kicinski
2024-10-09 14:35         ` Taehee Yoo
2024-10-04  4:01     ` Taehee Yoo
2024-10-03 16:06 ` [PATCH net-next v3 6/7] net: ethtool: " Taehee Yoo
2024-10-03 18:32   ` Mina Almasry
2024-10-03 19:35     ` Taehee Yoo
2024-10-03 16:06 ` [PATCH net-next v3 7/7] bnxt_en: add support for device memory tcp Taehee Yoo
2024-10-03 18:43   ` Mina Almasry
2024-10-04 10:34     ` Taehee Yoo
2024-10-08  2:57       ` David Wei
2024-10-09 15:02         ` Taehee Yoo
2024-10-08 19:50       ` Jakub Kicinski
2024-10-09 15:37         ` Taehee Yoo
2024-10-10  0:01           ` Jakub Kicinski
2024-10-10 17:44             ` Mina Almasry
2024-10-11  1:34               ` Jakub Kicinski
2024-10-11 17:33                 ` Mina Almasry
2024-10-11 23:42                   ` Jason Gunthorpe
2024-10-14 22:38                     ` Mina Almasry
2024-10-15  0:16                       ` Jakub Kicinski
2024-10-15  1:10                         ` Mina Almasry
2024-10-15 12:44                           ` Jason Gunthorpe
2024-10-18  8:25                             ` Mina Almasry
2024-10-19 13:55                               ` Taehee Yoo
2024-10-15 14:29                       ` Pavel Begunkov
2024-10-15 17:38                         ` David Wei
2024-10-05  3:48   ` kernel test robot
2024-10-08  2:45   ` David Wei
2024-10-08  3:54     ` Taehee Yoo
2024-10-08  3:58       ` Taehee Yoo
2024-10-16 20:17 ` [PATCH net-next v3 0/7] bnxt_en: implement device memory TCP for bnxt Stanislav Fomichev
2024-10-17  8:58   ` Taehee Yoo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).