[PATCH net-next 0/7] ibmvnic RR performance improvements

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH net-next 0/7] ibmvnic RR performance improvements
@ 2024-08-01 21:23 Nick Child
  2024-08-01 21:23 ` [PATCH net-next 1/7] ibmvnic: Only replenish rx pool when resources are getting low Nick Child
                   ` (6 more replies)
  0 siblings, 7 replies; 11+ messages in thread
From: Nick Child @ 2024-08-01 21:23 UTC (permalink / raw)
  To: netdev; +Cc: bjking1, haren, ricklind, Nick Child

Hello!

This patchset aims to increase the ibmvnic small packet request
response transaction rate.

When measuring transaction rate on several netperf tcp_rr connections,
a ~2x improvement can be observed! No regressions were seen when
performing other bw/latency tests.

The main points of improvement were from:
 - Patch 1 - request response tests will almost never fill a napi budget
             so wasting time replenishing every poll can be expensive
 - Patch 6 - Turns out that updating BQL completed bytes more than once per
             interrupt can be a really bad idea!

The final patch SHOULD be temporary, we are waiting for our FW teams to
clarify some documentation items. Within a few years that logic should
be replaced. In the meantime, it only effects non-GSO + CSO + !xmit_more
packets. There was no effect on performance from this patch.

Looking forward to any and all feedback!
Thanks,
  Nick Child

Nick Child (7):
  ibmvnic: Only replenish rx pool when resources are getting low
  ibmvnic: Use header len helper functions on tx
  ibmvnic: Reduce memcpys in tx descriptor generation
  ibmvnic: Remove duplicate memory barriers in tx
  ibmvnic: Introduce send sub-crq direct
  ibmvnic: Only record tx completed bytes once per handler
  ibmvnic: Perform tx CSO during send scrq direct

 drivers/net/ethernet/ibm/ibmvnic.c | 174 +++++++++++++++++------------
 1 file changed, 102 insertions(+), 72 deletions(-)

-- 
2.43.0

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH net-next 1/7] ibmvnic: Only replenish rx pool when resources are getting low
  2024-08-01 21:23 [PATCH net-next 0/7] ibmvnic RR performance improvements Nick Child
@ 2024-08-01 21:23 ` Nick Child
  2024-08-01 21:23 ` [PATCH net-next 2/7] ibmvnic: Use header len helper functions on tx Nick Child
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 11+ messages in thread
From: Nick Child @ 2024-08-01 21:23 UTC (permalink / raw)
  To: netdev; +Cc: bjking1, haren, ricklind, Nick Child

Previously, the driver would replenish the rx pool if the polling
function consumed less than the budget. The logic being that the driver
did not exhaust its budget so that must mean that the driver is not busy
and has cycles to spare for replenishing the pool.

So pool replenishment happens on every poll which did not consume
the budget. This can very costly during request-response tests.

In fact, an extra ~100pps can be seen in TCP_RR_150 tests when we remove
this conditional. Trace results (ftrace, graph-time=1) for the poll
function are below:
Previous results:
    ibmvnic_poll = 64951846.0 us / 4167628.0 hits = AVG 15.58
    replenish_rx_pool = 17602846.0 us / 4710437.0 hits = AVG 3.74
Now:
    ibmvnic_poll = 57673941.0 us / 4791737.0 hits = AVG 12.04
    replenish_rx_pool = 3938171.6 us / 4314.0 hits = AVG 912.88

While the replenish function takes longer, it is hit less frequently
meaning the ibmvnic_poll function, on average, is faster.

Furthermore, this change does not have a negative effect on
performance bandwidth/latency measurements.

Signed-off-by: Nick Child <nnac123@linux.ibm.com>
---
 drivers/net/ethernet/ibm/ibmvnic.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/ibm/ibmvnic.c b/drivers/net/ethernet/ibm/ibmvnic.c
index 23ebeb143987..857d585bd229 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -3527,9 +3527,8 @@ static int ibmvnic_poll(struct napi_struct *napi, int budget)
 	}
 
 	if (adapter->state != VNIC_CLOSING &&
-	    ((atomic_read(&adapter->rx_pool[scrq_num].available) <
-	      adapter->req_rx_add_entries_per_subcrq / 2) ||
-	      frames_processed < budget))
+	    (atomic_read(&adapter->rx_pool[scrq_num].available) <
+	      adapter->req_rx_add_entries_per_subcrq / 2))
 		replenish_rx_pool(adapter, &adapter->rx_pool[scrq_num]);
 	if (frames_processed < budget) {
 		if (napi_complete_done(napi, frames_processed)) {
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH net-next 2/7] ibmvnic: Use header len helper functions on tx
  2024-08-01 21:23 [PATCH net-next 0/7] ibmvnic RR performance improvements Nick Child
  2024-08-01 21:23 ` [PATCH net-next 1/7] ibmvnic: Only replenish rx pool when resources are getting low Nick Child
@ 2024-08-01 21:23 ` Nick Child
  2024-08-01 21:23 ` [PATCH net-next 3/7] ibmvnic: Reduce memcpys in tx descriptor generation Nick Child
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 11+ messages in thread
From: Nick Child @ 2024-08-01 21:23 UTC (permalink / raw)
  To: netdev; +Cc: bjking1, haren, ricklind, Nick Child

Use the header length helper functions rather than trying to calculate
it within the driver. There are defined functions for mac and network
headers (skb_mac_header_len and skb_network_header_len) but no such
function exists for the transport header length.

Also, hdr_data was memset during allocation to all 0's so no need to
memset again.

Signed-off-by: Nick Child <nnac123@linux.ibm.com>
---
 drivers/net/ethernet/ibm/ibmvnic.c | 13 +++----------
 1 file changed, 3 insertions(+), 10 deletions(-)

diff --git a/drivers/net/ethernet/ibm/ibmvnic.c b/drivers/net/ethernet/ibm/ibmvnic.c
index 857d585bd229..7d552d4bbe15 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -2156,36 +2156,28 @@ static int build_hdr_data(u8 hdr_field, struct sk_buff *skb,
 	int len = 0;
 	u8 *hdr;
 
-	if (skb_vlan_tagged(skb) && !skb_vlan_tag_present(skb))
-		hdr_len[0] = sizeof(struct vlan_ethhdr);
-	else
-		hdr_len[0] = sizeof(struct ethhdr);
 
 	if (skb->protocol == htons(ETH_P_IP)) {
-		hdr_len[1] = ip_hdr(skb)->ihl * 4;
 		if (ip_hdr(skb)->protocol == IPPROTO_TCP)
 			hdr_len[2] = tcp_hdrlen(skb);
 		else if (ip_hdr(skb)->protocol == IPPROTO_UDP)
 			hdr_len[2] = sizeof(struct udphdr);
 	} else if (skb->protocol == htons(ETH_P_IPV6)) {
-		hdr_len[1] = sizeof(struct ipv6hdr);
 		if (ipv6_hdr(skb)->nexthdr == IPPROTO_TCP)
 			hdr_len[2] = tcp_hdrlen(skb);
 		else if (ipv6_hdr(skb)->nexthdr == IPPROTO_UDP)
 			hdr_len[2] = sizeof(struct udphdr);
-	} else if (skb->protocol == htons(ETH_P_ARP)) {
-		hdr_len[1] = arp_hdr_len(skb->dev);
-		hdr_len[2] = 0;
 	}
 
-	memset(hdr_data, 0, 120);
 	if ((hdr_field >> 6) & 1) {
+		hdr_len[0] = skb_mac_header_len(skb);
 		hdr = skb_mac_header(skb);
 		memcpy(hdr_data, hdr, hdr_len[0]);
 		len += hdr_len[0];
 	}
 
 	if ((hdr_field >> 5) & 1) {
+		hdr_len[1] = skb_network_header_len(skb);
 		hdr = skb_network_header(skb);
 		memcpy(hdr_data + len, hdr, hdr_len[1]);
 		len += hdr_len[1];
@@ -2196,6 +2188,7 @@ static int build_hdr_data(u8 hdr_field, struct sk_buff *skb,
 		memcpy(hdr_data + len, hdr, hdr_len[2]);
 		len += hdr_len[2];
 	}
+
 	return len;
 }
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH net-next 3/7] ibmvnic: Reduce memcpys in tx descriptor generation
  2024-08-01 21:23 [PATCH net-next 0/7] ibmvnic RR performance improvements Nick Child
  2024-08-01 21:23 ` [PATCH net-next 1/7] ibmvnic: Only replenish rx pool when resources are getting low Nick Child
  2024-08-01 21:23 ` [PATCH net-next 2/7] ibmvnic: Use header len helper functions on tx Nick Child
@ 2024-08-01 21:23 ` Nick Child
  2024-08-01 21:23 ` [PATCH net-next 4/7] ibmvnic: Remove duplicate memory barriers in tx Nick Child
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 11+ messages in thread
From: Nick Child @ 2024-08-01 21:23 UTC (permalink / raw)
  To: netdev; +Cc: bjking1, haren, ricklind, Nick Child

Previously when creating the header descriptors, the driver would:
1. allocate a temporary buffer on the stack (in build_hdr_descs_arr)
2. memcpy the header info into the temporary buffer (in build_hdr_data)
3. memcpy the temp buffer into a local variable (in create_hdr_descs)
4. copy the local variable into the return buffer (in create_hdr_descs)

Since, there is no opportunity for errors during this process, the temp
buffer is not needed and work can be done on the return buffer directly.

Repurpose build_hdr_data() to only calculate the header lengths. Rename
it to get_hdr_lens().
Edit create_hdr_descs() to read from the skb directly and copy directly
into the returned useful buffer.

The process now involves less memory and write operations while
also being more readable.

Signed-off-by: Nick Child <nnac123@linux.ibm.com>
---
 drivers/net/ethernet/ibm/ibmvnic.c | 80 +++++++++++++-----------------
 1 file changed, 34 insertions(+), 46 deletions(-)

diff --git a/drivers/net/ethernet/ibm/ibmvnic.c b/drivers/net/ethernet/ibm/ibmvnic.c
index 7d552d4bbe15..4fe2c8c17b05 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -2150,46 +2150,38 @@ static int ibmvnic_close(struct net_device *netdev)
  * Builds a buffer containing these headers.  Saves individual header
  * lengths and total buffer length to be used to build descriptors.
  */
-static int build_hdr_data(u8 hdr_field, struct sk_buff *skb,
-			  int *hdr_len, u8 *hdr_data)
+static int get_hdr_lens(u8 hdr_field, struct sk_buff *skb,
+			int *hdr_len)
 {
 	int len = 0;
-	u8 *hdr;
 
 
-	if (skb->protocol == htons(ETH_P_IP)) {
-		if (ip_hdr(skb)->protocol == IPPROTO_TCP)
-			hdr_len[2] = tcp_hdrlen(skb);
-		else if (ip_hdr(skb)->protocol == IPPROTO_UDP)
-			hdr_len[2] = sizeof(struct udphdr);
-	} else if (skb->protocol == htons(ETH_P_IPV6)) {
-		if (ipv6_hdr(skb)->nexthdr == IPPROTO_TCP)
-			hdr_len[2] = tcp_hdrlen(skb);
-		else if (ipv6_hdr(skb)->nexthdr == IPPROTO_UDP)
-			hdr_len[2] = sizeof(struct udphdr);
-	}
-
 	if ((hdr_field >> 6) & 1) {
 		hdr_len[0] = skb_mac_header_len(skb);
-		hdr = skb_mac_header(skb);
-		memcpy(hdr_data, hdr, hdr_len[0]);
 		len += hdr_len[0];
 	}
 
 	if ((hdr_field >> 5) & 1) {
 		hdr_len[1] = skb_network_header_len(skb);
-		hdr = skb_network_header(skb);
-		memcpy(hdr_data + len, hdr, hdr_len[1]);
 		len += hdr_len[1];
 	}
 
-	if ((hdr_field >> 4) & 1) {
-		hdr = skb_transport_header(skb);
-		memcpy(hdr_data + len, hdr, hdr_len[2]);
-		len += hdr_len[2];
+	if (!((hdr_field >> 4) & 1))
+		return len;
+
+	if (skb->protocol == htons(ETH_P_IP)) {
+		if (ip_hdr(skb)->protocol == IPPROTO_TCP)
+			hdr_len[2] = tcp_hdrlen(skb);
+		else if (ip_hdr(skb)->protocol == IPPROTO_UDP)
+			hdr_len[2] = sizeof(struct udphdr);
+	} else if (skb->protocol == htons(ETH_P_IPV6)) {
+		if (ipv6_hdr(skb)->nexthdr == IPPROTO_TCP)
+			hdr_len[2] = tcp_hdrlen(skb);
+		else if (ipv6_hdr(skb)->nexthdr == IPPROTO_UDP)
+			hdr_len[2] = sizeof(struct udphdr);
 	}
 
-	return len;
+	return len + hdr_len[2];
 }
 
 /**
@@ -2207,7 +2199,7 @@ static int build_hdr_data(u8 hdr_field, struct sk_buff *skb,
 static int create_hdr_descs(u8 hdr_field, u8 *hdr_data, int len, int *hdr_len,
 			    union sub_crq *scrq_arr)
 {
-	union sub_crq hdr_desc;
+	union sub_crq *hdr_desc;
 	int tmp_len = len;
 	int num_descs = 0;
 	u8 *data, *cur;
@@ -2216,28 +2208,26 @@ static int create_hdr_descs(u8 hdr_field, u8 *hdr_data, int len, int *hdr_len,
 	while (tmp_len > 0) {
 		cur = hdr_data + len - tmp_len;
 
-		memset(&hdr_desc, 0, sizeof(hdr_desc));
-		if (cur != hdr_data) {
-			data = hdr_desc.hdr_ext.data;
+		hdr_desc = &scrq_arr[num_descs];
+		if (num_descs) {
+			data = hdr_desc->hdr_ext.data;
 			tmp = tmp_len > 29 ? 29 : tmp_len;
-			hdr_desc.hdr_ext.first = IBMVNIC_CRQ_CMD;
-			hdr_desc.hdr_ext.type = IBMVNIC_HDR_EXT_DESC;
-			hdr_desc.hdr_ext.len = tmp;
+			hdr_desc->hdr_ext.first = IBMVNIC_CRQ_CMD;
+			hdr_desc->hdr_ext.type = IBMVNIC_HDR_EXT_DESC;
+			hdr_desc->hdr_ext.len = tmp;
 		} else {
-			data = hdr_desc.hdr.data;
+			data = hdr_desc->hdr.data;
 			tmp = tmp_len > 24 ? 24 : tmp_len;
-			hdr_desc.hdr.first = IBMVNIC_CRQ_CMD;
-			hdr_desc.hdr.type = IBMVNIC_HDR_DESC;
-			hdr_desc.hdr.len = tmp;
-			hdr_desc.hdr.l2_len = (u8)hdr_len[0];
-			hdr_desc.hdr.l3_len = cpu_to_be16((u16)hdr_len[1]);
-			hdr_desc.hdr.l4_len = (u8)hdr_len[2];
-			hdr_desc.hdr.flag = hdr_field << 1;
+			hdr_desc->hdr.first = IBMVNIC_CRQ_CMD;
+			hdr_desc->hdr.type = IBMVNIC_HDR_DESC;
+			hdr_desc->hdr.len = tmp;
+			hdr_desc->hdr.l2_len = (u8)hdr_len[0];
+			hdr_desc->hdr.l3_len = cpu_to_be16((u16)hdr_len[1]);
+			hdr_desc->hdr.l4_len = (u8)hdr_len[2];
+			hdr_desc->hdr.flag = hdr_field << 1;
 		}
 		memcpy(data, cur, tmp);
 		tmp_len -= tmp;
-		*scrq_arr = hdr_desc;
-		scrq_arr++;
 		num_descs++;
 	}
 
@@ -2260,13 +2250,11 @@ static void build_hdr_descs_arr(struct sk_buff *skb,
 				int *num_entries, u8 hdr_field)
 {
 	int hdr_len[3] = {0, 0, 0};
-	u8 hdr_data[140] = {0};
 	int tot_len;
 
-	tot_len = build_hdr_data(hdr_field, skb, hdr_len,
-				 hdr_data);
-	*num_entries += create_hdr_descs(hdr_field, hdr_data, tot_len, hdr_len,
-					 indir_arr + 1);
+	tot_len = get_hdr_lens(hdr_field, skb, hdr_len);
+	*num_entries += create_hdr_descs(hdr_field, skb_mac_header(skb),
+					 tot_len, hdr_len, indir_arr + 1);
 }
 
 static int ibmvnic_xmit_workarounds(struct sk_buff *skb,
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH net-next 4/7] ibmvnic: Remove duplicate memory barriers in tx
  2024-08-01 21:23 [PATCH net-next 0/7] ibmvnic RR performance improvements Nick Child
                   ` (2 preceding siblings ...)
  2024-08-01 21:23 ` [PATCH net-next 3/7] ibmvnic: Reduce memcpys in tx descriptor generation Nick Child
@ 2024-08-01 21:23 ` Nick Child
  2024-08-01 21:23 ` [PATCH net-next 5/7] ibmvnic: Introduce send sub-crq direct Nick Child
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 11+ messages in thread
From: Nick Child @ 2024-08-01 21:23 UTC (permalink / raw)
  To: netdev; +Cc: bjking1, haren, ricklind, Nick Child

send_subcrq_[in]direct() already has a dma memory barrier.
Remove the earlier one.

Signed-off-by: Nick Child <nnac123@linux.ibm.com>
---
 drivers/net/ethernet/ibm/ibmvnic.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/drivers/net/ethernet/ibm/ibmvnic.c b/drivers/net/ethernet/ibm/ibmvnic.c
index 4fe2c8c17b05..533e79a0c6ac 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -2456,9 +2456,6 @@ static netdev_tx_t ibmvnic_xmit(struct sk_buff *skb, struct net_device *netdev)
 		skb_copy_from_linear_data(skb, dst, skb->len);
 	}
 
-	/* post changes to long_term_buff *dst before VIOS accessing it */
-	dma_wmb();
-
 	tx_pool->consumer_index =
 	    (tx_pool->consumer_index + 1) % tx_pool->num_buffers;
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH net-next 5/7] ibmvnic: Introduce send sub-crq direct
  2024-08-01 21:23 [PATCH net-next 0/7] ibmvnic RR performance improvements Nick Child
                   ` (3 preceding siblings ...)
  2024-08-01 21:23 ` [PATCH net-next 4/7] ibmvnic: Remove duplicate memory barriers in tx Nick Child
@ 2024-08-01 21:23 ` Nick Child
  2024-08-01 21:23 ` [PATCH net-next 6/7] ibmvnic: Only record tx completed bytes once per handler Nick Child
  2024-08-01 21:23 ` [PATCH net-next 7/7] ibmvnic: Perform tx CSO during send scrq direct Nick Child
  6 siblings, 0 replies; 11+ messages in thread
From: Nick Child @ 2024-08-01 21:23 UTC (permalink / raw)
  To: netdev; +Cc: bjking1, haren, ricklind, Nick Child

Firmware supports two hcalls to send a sub-crq request:
H_SEND_SUB_CRQ_INDIRECT and H_SEND_SUB_CRQ. The indirect hcall allows
for submission of batched messages while the other hcall is limited to
only one message. This protocol is defined in PAPR section 17.2.3.3.

Previously, the ibmvnic xmit function only used the indirect hcall. This
allowed the driver to batch it's skbs. A single skb can occupy a few
entries per hcall depending on if FW requires skb header information or
not. The FW only needs header information if the packet is segmented.

By this logic, if an skb is not GSO then it can fit in one sub-crq
message and therefore is a candidate for H_SEND_SUB_CRQ.
Batching skb transmission is only useful when there are more packets
coming down the line (ie netdev_xmit_more is true).

As it turns out, H_SEND_SUB_CRQ induces less latency than
H_SEND_SUB_CRQ_INDIRECT. Therefore, use H_SEND_SUB_CRQ where
appropriate.

Small latency gains seen when doing TCP_RR_150 (request/response
workload). Ftrace results (graph-time=1):
  Previous:
     ibmvnic_xmit = 29618270.83 us / 8860058.0 hits = AVG 3.34
     ibmvnic_tx_scrq_flush = 21972231.02 us / 6553972.0 hits = AVG 3.35
  Now:
     ibmvnic_xmit = 22153350.96 us / 8438942.0 hits = AVG 2.63
     ibmvnic_tx_scrq_flush = 15858922.4 us / 6244076.0 hits = AVG 2.54

Signed-off-by: Nick Child <nnac123@linux.ibm.com>
---
 drivers/net/ethernet/ibm/ibmvnic.c | 52 ++++++++++++++++++++++++++----
 1 file changed, 46 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/ibm/ibmvnic.c b/drivers/net/ethernet/ibm/ibmvnic.c
index 533e79a0c6ac..c9aa276507bb 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -117,6 +117,7 @@ static void free_long_term_buff(struct ibmvnic_adapter *adapter,
 				struct ibmvnic_long_term_buff *ltb);
 static void ibmvnic_disable_irqs(struct ibmvnic_adapter *adapter);
 static void flush_reset_queue(struct ibmvnic_adapter *adapter);
+static void print_subcrq_error(struct device *dev, int rc, const char *func);
 
 struct ibmvnic_stat {
 	char name[ETH_GSTRING_LEN];
@@ -2331,8 +2332,29 @@ static void ibmvnic_tx_scrq_clean_buffer(struct ibmvnic_adapter *adapter,
 	}
 }
 
+static int send_subcrq_direct(struct ibmvnic_adapter *adapter,
+			      u64 remote_handle, u64 *entry)
+{
+	unsigned int ua = adapter->vdev->unit_address;
+	struct device *dev = &adapter->vdev->dev;
+	int rc;
+
+	/* Make sure the hypervisor sees the complete request */
+	dma_wmb();
+	rc = plpar_hcall_norets(H_SEND_SUB_CRQ, ua,
+				cpu_to_be64(remote_handle),
+				cpu_to_be64(entry[0]), cpu_to_be64(entry[1]),
+				cpu_to_be64(entry[2]), cpu_to_be64(entry[3]));
+
+	if (rc)
+		print_subcrq_error(dev, rc, __func__);
+
+	return rc;
+}
+
 static int ibmvnic_tx_scrq_flush(struct ibmvnic_adapter *adapter,
-				 struct ibmvnic_sub_crq_queue *tx_scrq)
+				 struct ibmvnic_sub_crq_queue *tx_scrq,
+				 bool indirect)
 {
 	struct ibmvnic_ind_xmit_queue *ind_bufp;
 	u64 dma_addr;
@@ -2347,7 +2369,13 @@ static int ibmvnic_tx_scrq_flush(struct ibmvnic_adapter *adapter,
 
 	if (!entries)
 		return 0;
-	rc = send_subcrq_indirect(adapter, handle, dma_addr, entries);
+
+	if (indirect)
+		rc = send_subcrq_indirect(adapter, handle, dma_addr, entries);
+	else
+		rc = send_subcrq_direct(adapter, handle,
+					(u64 *)ind_bufp->indir_arr);
+
 	if (rc)
 		ibmvnic_tx_scrq_clean_buffer(adapter, tx_scrq);
 	else
@@ -2405,7 +2433,7 @@ static netdev_tx_t ibmvnic_xmit(struct sk_buff *skb, struct net_device *netdev)
 		tx_dropped++;
 		tx_send_failed++;
 		ret = NETDEV_TX_OK;
-		lpar_rc = ibmvnic_tx_scrq_flush(adapter, tx_scrq);
+		lpar_rc = ibmvnic_tx_scrq_flush(adapter, tx_scrq, true);
 		if (lpar_rc != H_SUCCESS)
 			goto tx_err;
 		goto out;
@@ -2423,7 +2451,7 @@ static netdev_tx_t ibmvnic_xmit(struct sk_buff *skb, struct net_device *netdev)
 		tx_send_failed++;
 		tx_dropped++;
 		ret = NETDEV_TX_OK;
-		lpar_rc = ibmvnic_tx_scrq_flush(adapter, tx_scrq);
+		lpar_rc = ibmvnic_tx_scrq_flush(adapter, tx_scrq, true);
 		if (lpar_rc != H_SUCCESS)
 			goto tx_err;
 		goto out;
@@ -2518,6 +2546,16 @@ static netdev_tx_t ibmvnic_xmit(struct sk_buff *skb, struct net_device *netdev)
 		tx_crq.v1.flags1 |= IBMVNIC_TX_LSO;
 		tx_crq.v1.mss = cpu_to_be16(skb_shinfo(skb)->gso_size);
 		hdrs += 2;
+	} else if (!ind_bufp->index && !netdev_xmit_more()) {
+		ind_bufp->indir_arr[0] = tx_crq;
+		ind_bufp->index = 1;
+		tx_buff->num_entries = 1;
+		netdev_tx_sent_queue(txq, skb->len);
+		lpar_rc = ibmvnic_tx_scrq_flush(adapter, tx_scrq, false);
+		if (lpar_rc != H_SUCCESS)
+			goto tx_err;
+
+		goto early_exit;
 	}
 
 	if ((*hdrs >> 7) & 1)
@@ -2527,7 +2565,7 @@ static netdev_tx_t ibmvnic_xmit(struct sk_buff *skb, struct net_device *netdev)
 	tx_buff->num_entries = num_entries;
 	/* flush buffer if current entry can not fit */
 	if (num_entries + ind_bufp->index > IBMVNIC_MAX_IND_DESCS) {
-		lpar_rc = ibmvnic_tx_scrq_flush(adapter, tx_scrq);
+		lpar_rc = ibmvnic_tx_scrq_flush(adapter, tx_scrq, true);
 		if (lpar_rc != H_SUCCESS)
 			goto tx_flush_err;
 	}
@@ -2535,15 +2573,17 @@ static netdev_tx_t ibmvnic_xmit(struct sk_buff *skb, struct net_device *netdev)
 	indir_arr[0] = tx_crq;
 	memcpy(&ind_bufp->indir_arr[ind_bufp->index], &indir_arr[0],
 	       num_entries * sizeof(struct ibmvnic_generic_scrq));
+
 	ind_bufp->index += num_entries;
 	if (__netdev_tx_sent_queue(txq, skb->len,
 				   netdev_xmit_more() &&
 				   ind_bufp->index < IBMVNIC_MAX_IND_DESCS)) {
-		lpar_rc = ibmvnic_tx_scrq_flush(adapter, tx_scrq);
+		lpar_rc = ibmvnic_tx_scrq_flush(adapter, tx_scrq, true);
 		if (lpar_rc != H_SUCCESS)
 			goto tx_err;
 	}
 
+early_exit:
 	if (atomic_add_return(num_entries, &tx_scrq->used)
 					>= adapter->req_tx_entries_per_subcrq) {
 		netdev_dbg(netdev, "Stopping queue %d\n", queue_num);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH net-next 6/7] ibmvnic: Only record tx completed bytes once per handler
  2024-08-01 21:23 [PATCH net-next 0/7] ibmvnic RR performance improvements Nick Child
                   ` (4 preceding siblings ...)
  2024-08-01 21:23 ` [PATCH net-next 5/7] ibmvnic: Introduce send sub-crq direct Nick Child
@ 2024-08-01 21:23 ` Nick Child
  2024-08-01 21:23 ` [PATCH net-next 7/7] ibmvnic: Perform tx CSO during send scrq direct Nick Child
  6 siblings, 0 replies; 11+ messages in thread
From: Nick Child @ 2024-08-01 21:23 UTC (permalink / raw)
  To: netdev; +Cc: bjking1, haren, ricklind, Nick Child

Byte Queue Limits depends on dql_completed being called once per tx
completion round in order to adjust its algorithm appropriately. The
dql->limit value is an approximation of the amount of bytes that the NIC
can consume per irq interval. If this approximation is too high then the
NIC will become over-saturated. Too low and the NIC will starve.

The dql->limit depends on dql->prev-* stats to calculate an optimal
value. If dql_completed() is called more than once per irq handler then
those prev-* values become unreliable (because they are not an accurate
representation of the previous state of the NIC) resulting in a
sub-optimal limit value.

Therefore, move the call to netdev_tx_completed_queue() to the end of
ibmvnic_complete_tx().

When performing 150 sessions of TCP rr (request-response 1 byte packets)
workloads, one could observe:
  PREVIOUSLY: - limit and inflight values hovering around 130
              - transaction rate of around 750k pps.

  NOW:        - limit rises and falls in response to inflight (130-900)
              - transaction rate of around 1M pps (33% improvement)

Signed-off-by: Nick Child <nnac123@linux.ibm.com>
---
 drivers/net/ethernet/ibm/ibmvnic.c | 12 +++++-------
 1 file changed, 5 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/ibm/ibmvnic.c b/drivers/net/ethernet/ibm/ibmvnic.c
index c9aa276507bb..05c0d68c3efa 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -4186,20 +4186,17 @@ static int ibmvnic_complete_tx(struct ibmvnic_adapter *adapter,
 			       struct ibmvnic_sub_crq_queue *scrq)
 {
 	struct device *dev = &adapter->vdev->dev;
+	int num_packets = 0, total_bytes = 0;
 	struct ibmvnic_tx_pool *tx_pool;
 	struct ibmvnic_tx_buff *txbuff;
 	struct netdev_queue *txq;
 	union sub_crq *next;
-	int index;
-	int i;
+	int index, i;
 
 restart_loop:
 	while (pending_scrq(adapter, scrq)) {
 		unsigned int pool = scrq->pool_index;
 		int num_entries = 0;
-		int total_bytes = 0;
-		int num_packets = 0;
-
 		next = ibmvnic_next_scrq(adapter, scrq);
 		for (i = 0; i < next->tx_comp.num_comps; i++) {
 			index = be32_to_cpu(next->tx_comp.correlators[i]);
@@ -4235,8 +4232,6 @@ static int ibmvnic_complete_tx(struct ibmvnic_adapter *adapter,
 		/* remove tx_comp scrq*/
 		next->tx_comp.first = 0;
 
-		txq = netdev_get_tx_queue(adapter->netdev, scrq->pool_index);
-		netdev_tx_completed_queue(txq, num_packets, total_bytes);
 
 		if (atomic_sub_return(num_entries, &scrq->used) <=
 		    (adapter->req_tx_entries_per_subcrq / 2) &&
@@ -4261,6 +4256,9 @@ static int ibmvnic_complete_tx(struct ibmvnic_adapter *adapter,
 		goto restart_loop;
 	}
 
+	txq = netdev_get_tx_queue(adapter->netdev, scrq->pool_index);
+	netdev_tx_completed_queue(txq, num_packets, total_bytes);
+
 	return 0;
 }
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH net-next 7/7] ibmvnic: Perform tx CSO during send scrq direct
  2024-08-01 21:23 [PATCH net-next 0/7] ibmvnic RR performance improvements Nick Child
                   ` (5 preceding siblings ...)
  2024-08-01 21:23 ` [PATCH net-next 6/7] ibmvnic: Only record tx completed bytes once per handler Nick Child
@ 2024-08-01 21:23 ` Nick Child
  2024-08-03  0:15   ` Jakub Kicinski
  6 siblings, 1 reply; 11+ messages in thread
From: Nick Child @ 2024-08-01 21:23 UTC (permalink / raw)
  To: netdev; +Cc: bjking1, haren, ricklind, Nick Child

During initialization with the vnic server, a bitstring is communicated
to the client regarding header info needed during CSO (See "VNIC
Capabilities" in PAPR). Most of the time, to be safe, vnic server
requests header info for CSO. When header info is needed, multiple TX
descriptors are required per skb; This limits the driver to use
send_subcrq_indirect instead of send_subcrq_direct.

Previously, the vnic server request for header info was ignored. This
allowed the use of send_sub_crq_direct. Transmitions were successful
because the bitstring returned by vnic server is very broad and over
cautionary. It was observed that mlx backing devices could actually
transmit and handle CSO packets without the vnic server receiving
header info (despite the fact that the bitstring requested it).

There was a trust issue: The bitstring was overcautionary. This extra
precaution (requesting header info when the backing device may not use
it) comes at the cost of performance (using direct vs indirect hcalls
has a 30% delta in small packet RR transaction rate). So it has been
requested that the vnic server team tries to ensure that the bitstring
is more exact. In the meantime, disable CSO when it is possible to use
the skb in the send_subcrq_direct path. In other words, calculate the
checksum before handing the packet to FW when the packet is not
segmented and xmit_more is false.

When the bitstring ever specifies that CSO does not require headers
(dependent on VIOS vnic server changes), then this patch should be
removed and replaced with one that investigates the bitstring before
using send_subcrq_direct.

Signed-off-by: Nick Child <nnac123@linux.ibm.com>
---
 drivers/net/ethernet/ibm/ibmvnic.c | 19 +++++++++++++++++--
 1 file changed, 17 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/ibm/ibmvnic.c b/drivers/net/ethernet/ibm/ibmvnic.c
index 05c0d68c3efa..1990d518f247 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -2406,6 +2406,7 @@ static netdev_tx_t ibmvnic_xmit(struct sk_buff *skb, struct net_device *netdev)
 	unsigned long lpar_rc;
 	union sub_crq tx_crq;
 	unsigned int offset;
+	bool use_scrq_send_direct = false;
 	int num_entries = 1;
 	unsigned char *dst;
 	int bufidx = 0;
@@ -2465,6 +2466,18 @@ static netdev_tx_t ibmvnic_xmit(struct sk_buff *skb, struct net_device *netdev)
 	memset(dst, 0, tx_pool->buf_size);
 	data_dma_addr = ltb->addr + offset;

+	/* if we are going to send_subcrq_direct this then we need to
+	 * update the checksum before copying the data into ltb. Essentially
+	 * these packets force disable CSO so that we can guarantee that
+	 * FW does not need header info and we can send direct.
+	 */
+	if (!skb_is_gso(skb) && !ind_bufp->index && !netdev_xmit_more()) {
+		use_scrq_send_direct = true;
+		if (skb->ip_summed == CHECKSUM_PARTIAL &&
+		    skb_checksum_help(skb))
+			use_scrq_send_direct = false;
+	}
+
 	if (skb_shinfo(skb)->nr_frags) {
 		int cur, i;

@@ -2546,11 +2559,13 @@ static netdev_tx_t ibmvnic_xmit(struct sk_buff *skb, struct net_device *netdev)
 		tx_crq.v1.flags1 |= IBMVNIC_TX_LSO;
 		tx_crq.v1.mss = cpu_to_be16(skb_shinfo(skb)->gso_size);
 		hdrs += 2;
-	} else if (!ind_bufp->index && !netdev_xmit_more()) {
-		ind_bufp->indir_arr[0] = tx_crq;
+	} else if (use_scrq_send_direct) {
+		/* See above comment, CSO disabled with direct xmit */
+		tx_crq.v1.flags1 &= ~(IBMVNIC_TX_CHKSUM_OFFLOAD);
 		ind_bufp->index = 1;
 		tx_buff->num_entries = 1;
 		netdev_tx_sent_queue(txq, skb->len);
+		ind_bufp->indir_arr[0] = tx_crq;
 		lpar_rc = ibmvnic_tx_scrq_flush(adapter, tx_scrq, false);
 		if (lpar_rc != H_SUCCESS)
 			goto tx_err;
-- 
2.43.0

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH net-next 7/7] ibmvnic: Perform tx CSO during send scrq direct
  2024-08-01 21:23 ` [PATCH net-next 7/7] ibmvnic: Perform tx CSO during send scrq direct Nick Child
@ 2024-08-03  0:15   ` Jakub Kicinski
  2024-08-05 13:52     ` Nick Child
  0 siblings, 1 reply; 11+ messages in thread
From: Jakub Kicinski @ 2024-08-03  0:15 UTC (permalink / raw)
  To: Nick Child; +Cc: netdev, bjking1, haren, ricklind

On Thu,  1 Aug 2024 16:23:40 -0500 Nick Child wrote:
> This extra
> precaution (requesting header info when the backing device may not use
> it) comes at the cost of performance (using direct vs indirect hcalls
> has a 30% delta in small packet RR transaction rate). 

What's "small" in this case? Non-GSO, or also less than MTU?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH net-next 7/7] ibmvnic: Perform tx CSO during send scrq direct
  2024-08-03  0:15   ` Jakub Kicinski
@ 2024-08-05 13:52     ` Nick Child
  2024-08-05 19:09       ` Jakub Kicinski
  0 siblings, 1 reply; 11+ messages in thread
From: Nick Child @ 2024-08-05 13:52 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: netdev, bjking1, haren, ricklind



On 8/2/24 19:15, Jakub Kicinski wrote:
> On Thu,  1 Aug 2024 16:23:40 -0500 Nick Child wrote:
>> This extra
>> precaution (requesting header info when the backing device may not use
>> it) comes at the cost of performance (using direct vs indirect hcalls
>> has a 30% delta in small packet RR transaction rate).
> 
> What's "small" in this case? Non-GSO, or also less than MTU?

I suppose "non-GSO" is the proper term. If a packet is non-GSO
then we are able to use the direct hcall. On the other hand,
if a packet is GSO then indirect must be used, we do not have the option 
of direct vs indirect.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH net-next 7/7] ibmvnic: Perform tx CSO during send scrq direct
  2024-08-05 13:52     ` Nick Child
@ 2024-08-05 19:09       ` Jakub Kicinski
  0 siblings, 0 replies; 11+ messages in thread
From: Jakub Kicinski @ 2024-08-05 19:09 UTC (permalink / raw)
  To: Nick Child; +Cc: netdev, bjking1, haren, ricklind

On Mon, 5 Aug 2024 08:52:57 -0500 Nick Child wrote:
> On 8/2/24 19:15, Jakub Kicinski wrote:
> > On Thu,  1 Aug 2024 16:23:40 -0500 Nick Child wrote:  
> >> This extra
> >> precaution (requesting header info when the backing device may not use
> >> it) comes at the cost of performance (using direct vs indirect hcalls
> >> has a 30% delta in small packet RR transaction rate).  
> > 
> > What's "small" in this case? Non-GSO, or also less than MTU?  
> 
> I suppose "non-GSO" is the proper term. If a packet is non-GSO
> then we are able to use the direct hcall. On the other hand,
> if a packet is GSO then indirect must be used, we do not have the option 
> of direct vs indirect.

It'd be great to add more exact analysis to the commit message.
Presumably the change is most likely to cause trouble in combination
with large non-GSO frames. Could you measure the perf impact when TSO
is disabled and MTU is 9k?

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2024-08-05 19:10 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-08-01 21:23 [PATCH net-next 0/7] ibmvnic RR performance improvements Nick Child
2024-08-01 21:23 ` [PATCH net-next 1/7] ibmvnic: Only replenish rx pool when resources are getting low Nick Child
2024-08-01 21:23 ` [PATCH net-next 2/7] ibmvnic: Use header len helper functions on tx Nick Child
2024-08-01 21:23 ` [PATCH net-next 3/7] ibmvnic: Reduce memcpys in tx descriptor generation Nick Child
2024-08-01 21:23 ` [PATCH net-next 4/7] ibmvnic: Remove duplicate memory barriers in tx Nick Child
2024-08-01 21:23 ` [PATCH net-next 5/7] ibmvnic: Introduce send sub-crq direct Nick Child
2024-08-01 21:23 ` [PATCH net-next 6/7] ibmvnic: Only record tx completed bytes once per handler Nick Child
2024-08-01 21:23 ` [PATCH net-next 7/7] ibmvnic: Perform tx CSO during send scrq direct Nick Child
2024-08-03  0:15   ` Jakub Kicinski
2024-08-05 13:52     ` Nick Child
2024-08-05 19:09       ` Jakub Kicinski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).