netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH net-next 0/2] net/smc: make wr buffer count configurable
@ 2025-09-04 21:12 Halil Pasic
  2025-09-04 21:12 ` [PATCH net-next 1/2] " Halil Pasic
  2025-09-04 21:12 ` [PATCH net-next 2/2] net/smc: handle -ENOMEM from smc_wr_alloc_link_mem gracefully Halil Pasic
  0 siblings, 2 replies; 11+ messages in thread
From: Halil Pasic @ 2025-09-04 21:12 UTC (permalink / raw)
  To: Jakub Kicinski, Paolo Abeni, Simon Horman, D. Wythe, Dust Li,
	Sidraya Jayagond, Wenjia Zhang, Mahanta Jambigi, Tony Lu, Wen Gu,
	netdev, linux-doc, linux-kernel, linux-rdma, linux-s390
  Cc: Halil Pasic

The current value of SMC_WR_BUF_CNT is 16 which leads to heavy
contention on the wr_tx_wait workqueue of the SMC-R linkgroup and its
spinlock when many connections are competing for the work request
buffers. Currently up to 256 connections per linkgroup are supported.

To make things worse when finally a buffer becomes available and
smc_wr_tx_put_slot() signals the linkgroup's wr_tx_wait wq, because
WQ_FLAG_EXCLUSIVE is not used all the waiters get woken up, most of the
time a single one can proceed, and the rest is contending on the
spinlock of the wq to go to sleep again.

Addressing this by simply bumping SMC_WR_BUF_CNT to 256 was deemed
risky, because the large-ish physically continuous allocation could fail
and lead to TCP fall-backs. For reference see this discussion thread on
"[PATCH net-next] net/smc: increase SMC_WR_BUF_CNT" (in archive
https://lists.openwall.net/netdev/2024/11/05/186), which concludes with
the agreement to try to come up with something smarter, which is what
this series aims for.

Additionally if for some reason it is known that heavy contention is not
to be expected going with something like 256 work request buffers is
wasteful. To address these concerns make the number of work requests
configurable, and introduce a back-off logic with handles -ENOMEM form
smc_wr_alloc_link_mem() gracefully.

Halil Pasic (2):
  net/smc: make wr buffer count configurable
  net/smc: handle -ENOMEM from smc_wr_alloc_link_mem gracefully

 Documentation/networking/smc-sysctl.rst | 40 +++++++++++++++++++++++++
 net/smc/smc.h                           |  2 ++
 net/smc/smc_core.c                      | 34 ++++++++++++++-------
 net/smc/smc_core.h                      |  6 ++++
 net/smc/smc_ib.c                        |  7 ++---
 net/smc/smc_llc.c                       |  2 ++
 net/smc/smc_sysctl.c                    | 22 ++++++++++++++
 net/smc/smc_wr.c                        | 32 ++++++++++----------
 net/smc/smc_wr.h                        |  2 --
 9 files changed, 115 insertions(+), 32 deletions(-)


base-commit: 5ef04a7b068cbb828eba226aacb42f880f7924d7
-- 
2.48.1


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH net-next 1/2] net/smc: make wr buffer count configurable
  2025-09-04 21:12 [PATCH net-next 0/2] net/smc: make wr buffer count configurable Halil Pasic
@ 2025-09-04 21:12 ` Halil Pasic
  2025-09-05  3:45   ` Dust Li
  2025-09-06 18:25   ` kernel test robot
  2025-09-04 21:12 ` [PATCH net-next 2/2] net/smc: handle -ENOMEM from smc_wr_alloc_link_mem gracefully Halil Pasic
  1 sibling, 2 replies; 11+ messages in thread
From: Halil Pasic @ 2025-09-04 21:12 UTC (permalink / raw)
  To: Jakub Kicinski, Paolo Abeni, Simon Horman, D. Wythe, Dust Li,
	Sidraya Jayagond, Wenjia Zhang, Mahanta Jambigi, Tony Lu, Wen Gu,
	netdev, linux-doc, linux-kernel, linux-rdma, linux-s390
  Cc: Halil Pasic

Think SMC_WR_BUF_CNT_SEND := SMC_WR_BUF_CNT used in send context and
SMC_WR_BUF_CNT_RECV := 3 * SMC_WR_BUF_CNT used in recv context. Those
get replaced with lgr->max_send_wr and lgr->max_recv_wr respective.

While at it let us also remove a confusing comment that is either not
about the context in which it resides (describing
qp_attr.cap.max_send_wr and qp_attr.cap.max_recv_wr) or not applicable
any more when these values become configurable .

Signed-off-by: Halil Pasic <pasic@linux.ibm.com>
Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com>
---
 Documentation/networking/smc-sysctl.rst | 37 +++++++++++++++++++++++++
 net/smc/smc.h                           |  2 ++
 net/smc/smc_core.h                      |  4 +++
 net/smc/smc_ib.c                        |  7 ++---
 net/smc/smc_llc.c                       |  2 ++
 net/smc/smc_sysctl.c                    | 22 +++++++++++++++
 net/smc/smc_wr.c                        | 32 +++++++++++----------
 net/smc/smc_wr.h                        |  2 --
 8 files changed, 86 insertions(+), 22 deletions(-)

diff --git a/Documentation/networking/smc-sysctl.rst b/Documentation/networking/smc-sysctl.rst
index a874d007f2db..c687092329e3 100644
--- a/Documentation/networking/smc-sysctl.rst
+++ b/Documentation/networking/smc-sysctl.rst
@@ -71,3 +71,40 @@ smcr_max_conns_per_lgr - INTEGER
 	acceptable value ranges from 16 to 255. Only for SMC-R v2.1 and later.
 
 	Default: 255
+
+smcr_max_send_wr - INTEGER
+	So called work request buffers are SMCR link (and RDMA queue pair) level
+	resources necessary for performing RDMA operations. Since up to 255
+	connections can share a link group and thus also a link and the number
+	of the work request buffers is decided when the link is allocated,
+	depending on the workload it can a bottleneck in a sense that threads
+	have to wait for work request buffers to become available. Before the
+	introduction of this control the maximal number of work request buffers
+	available on the send path used to be hard coded to 16. With this control
+	it becomes configurable. The acceptable range is between 2 and 2048.
+
+	Please be aware that all the buffers need to be allocated as a physically
+	continuous array in which each element is a single buffer and has the size
+	of SMC_WR_BUF_SIZE (48) bytes. If the allocation fails we give up much
+	like before having this control.
+	this control.
+
+	Default: 16
+
+smcr_max_recv_wr - INTEGER
+	So called work request buffers are SMCR link (and RDMA queue pair) level
+	resources necessary for performing RDMA operations. Since up to 255
+	connections can share a link group and thus also a link and the number
+	of the work request buffers is decided when the link is allocated,
+	depending on the workload it can a bottleneck in a sense that threads
+	have to wait for work request buffers to become available. Before the
+	introduction of this control the maximal number of work request buffers
+	available on the receive path used to be hard coded to 16. With this control
+	it becomes configurable. The acceptable range is between 2 and 2048.
+
+	Please be aware that all the buffers need to be allocated as a physically
+	continuous array in which each element is a single buffer and has the size
+	of SMC_WR_BUF_SIZE (48) bytes. If the allocation fails we give up much
+	like before having this control.
+
+	Default: 48
diff --git a/net/smc/smc.h b/net/smc/smc.h
index 2c9084963739..ffe48253fa1f 100644
--- a/net/smc/smc.h
+++ b/net/smc/smc.h
@@ -33,6 +33,8 @@
 
 extern struct proto smc_proto;
 extern struct proto smc_proto6;
+extern unsigned int smc_ib_sysctl_max_send_wr;
+extern unsigned int smc_ib_sysctl_max_recv_wr;
 
 extern struct smc_hashinfo smc_v4_hashinfo;
 extern struct smc_hashinfo smc_v6_hashinfo;
diff --git a/net/smc/smc_core.h b/net/smc/smc_core.h
index 48a1b1dcb576..b883f43fc206 100644
--- a/net/smc/smc_core.h
+++ b/net/smc/smc_core.h
@@ -361,6 +361,10 @@ struct smc_link_group {
 						/* max conn can be assigned to lgr */
 			u8			max_links;
 						/* max links can be added in lgr */
+			u16			max_send_wr;
+						/* number of WR buffers on send */
+			u16			max_recv_wr;
+						/* number of WR buffers on recv */
 		};
 		struct { /* SMC-D */
 			struct smcd_gid		peer_gid;
diff --git a/net/smc/smc_ib.c b/net/smc/smc_ib.c
index 0052f02756eb..e8d35c22c525 100644
--- a/net/smc/smc_ib.c
+++ b/net/smc/smc_ib.c
@@ -669,11 +669,6 @@ int smc_ib_create_queue_pair(struct smc_link *lnk)
 		.recv_cq = lnk->smcibdev->roce_cq_recv,
 		.srq = NULL,
 		.cap = {
-				/* include unsolicited rdma_writes as well,
-				 * there are max. 2 RDMA_WRITE per 1 WR_SEND
-				 */
-			.max_send_wr = SMC_WR_BUF_CNT * 3,
-			.max_recv_wr = SMC_WR_BUF_CNT * 3,
 			.max_send_sge = SMC_IB_MAX_SEND_SGE,
 			.max_recv_sge = lnk->wr_rx_sge_cnt,
 			.max_inline_data = 0,
@@ -683,6 +678,8 @@ int smc_ib_create_queue_pair(struct smc_link *lnk)
 	};
 	int rc;
 
+	qp_attr.cap.max_send_wr = 3 * lnk->lgr->max_send_wr;
+	qp_attr.cap.max_recv_wr = lnk->lgr->max_recv_wr;
 	lnk->roce_qp = ib_create_qp(lnk->roce_pd, &qp_attr);
 	rc = PTR_ERR_OR_ZERO(lnk->roce_qp);
 	if (IS_ERR(lnk->roce_qp))
diff --git a/net/smc/smc_llc.c b/net/smc/smc_llc.c
index f865c58c3aa7..91c936bf7336 100644
--- a/net/smc/smc_llc.c
+++ b/net/smc/smc_llc.c
@@ -2157,6 +2157,8 @@ void smc_llc_lgr_init(struct smc_link_group *lgr, struct smc_sock *smc)
 	init_waitqueue_head(&lgr->llc_msg_waiter);
 	init_rwsem(&lgr->llc_conf_mutex);
 	lgr->llc_testlink_time = READ_ONCE(net->smc.sysctl_smcr_testlink_time);
+	lgr->max_send_wr = (u16)(READ_ONCE(smc_ib_sysctl_max_send_wr));
+	lgr->max_recv_wr = (u16)(READ_ONCE(smc_ib_sysctl_max_recv_wr));
 }
 
 /* called after lgr was removed from lgr_list */
diff --git a/net/smc/smc_sysctl.c b/net/smc/smc_sysctl.c
index 2fab6456f765..01da1297e150 100644
--- a/net/smc/smc_sysctl.c
+++ b/net/smc/smc_sysctl.c
@@ -29,6 +29,10 @@ static int links_per_lgr_min = SMC_LINKS_ADD_LNK_MIN;
 static int links_per_lgr_max = SMC_LINKS_ADD_LNK_MAX;
 static int conns_per_lgr_min = SMC_CONN_PER_LGR_MIN;
 static int conns_per_lgr_max = SMC_CONN_PER_LGR_MAX;
+unsigned int smc_ib_sysctl_max_send_wr = 16;
+unsigned int smc_ib_sysctl_max_recv_wr = 48;
+static unsigned int smc_ib_sysctl_max_wr_min = 2;
+static unsigned int smc_ib_sysctl_max_wr_max = 2048;
 
 static struct ctl_table smc_table[] = {
 	{
@@ -99,6 +103,24 @@ static struct ctl_table smc_table[] = {
 		.extra1		= SYSCTL_ZERO,
 		.extra2		= SYSCTL_ONE,
 	},
+	{
+		.procname       = "smcr_max_send_wr",
+		.data		= &smc_ib_sysctl_max_send_wr,
+		.maxlen         = sizeof(int),
+		.mode           = 0644,
+		.proc_handler   = proc_dointvec_minmax,
+		.extra1		= &smc_ib_sysctl_max_wr_min,
+		.extra2		= &smc_ib_sysctl_max_wr_max,
+	},
+	{
+		.procname       = "smcr_max_recv_wr",
+		.data		= &smc_ib_sysctl_max_recv_wr,
+		.maxlen         = sizeof(int),
+		.mode           = 0644,
+		.proc_handler   = proc_dointvec_minmax,
+		.extra1		= &smc_ib_sysctl_max_wr_min,
+		.extra2		= &smc_ib_sysctl_max_wr_max,
+	},
 };
 
 int __net_init smc_sysctl_net_init(struct net *net)
diff --git a/net/smc/smc_wr.c b/net/smc/smc_wr.c
index b04a21b8c511..85ebc65f1546 100644
--- a/net/smc/smc_wr.c
+++ b/net/smc/smc_wr.c
@@ -34,6 +34,7 @@
 #define SMC_WR_MAX_POLL_CQE 10	/* max. # of compl. queue elements in 1 poll */
 
 #define SMC_WR_RX_HASH_BITS 4
+
 static DEFINE_HASHTABLE(smc_wr_rx_hash, SMC_WR_RX_HASH_BITS);
 static DEFINE_SPINLOCK(smc_wr_rx_hash_lock);
 
@@ -547,9 +548,9 @@ void smc_wr_remember_qp_attr(struct smc_link *lnk)
 		    IB_QP_DEST_QPN,
 		    &init_attr);
 
-	lnk->wr_tx_cnt = min_t(size_t, SMC_WR_BUF_CNT,
+	lnk->wr_tx_cnt = min_t(size_t, lnk->lgr->max_send_wr,
 			       lnk->qp_attr.cap.max_send_wr);
-	lnk->wr_rx_cnt = min_t(size_t, SMC_WR_BUF_CNT * 3,
+	lnk->wr_rx_cnt = min_t(size_t, lnk->lgr->max_recv_wr,
 			       lnk->qp_attr.cap.max_recv_wr);
 }
 
@@ -741,50 +742,51 @@ int smc_wr_alloc_lgr_mem(struct smc_link_group *lgr)
 int smc_wr_alloc_link_mem(struct smc_link *link)
 {
 	/* allocate link related memory */
-	link->wr_tx_bufs = kcalloc(SMC_WR_BUF_CNT, SMC_WR_BUF_SIZE, GFP_KERNEL);
+	link->wr_tx_bufs = kcalloc(link->lgr->max_send_wr,
+				   SMC_WR_BUF_SIZE, GFP_KERNEL);
 	if (!link->wr_tx_bufs)
 		goto no_mem;
-	link->wr_rx_bufs = kcalloc(SMC_WR_BUF_CNT * 3, link->wr_rx_buflen,
+	link->wr_rx_bufs = kcalloc(link->lgr->max_recv_wr, SMC_WR_BUF_SIZE,
 				   GFP_KERNEL);
 	if (!link->wr_rx_bufs)
 		goto no_mem_wr_tx_bufs;
-	link->wr_tx_ibs = kcalloc(SMC_WR_BUF_CNT, sizeof(link->wr_tx_ibs[0]),
-				  GFP_KERNEL);
+	link->wr_tx_ibs = kcalloc(link->lgr->max_send_wr,
+				  sizeof(link->wr_tx_ibs[0]), GFP_KERNEL);
 	if (!link->wr_tx_ibs)
 		goto no_mem_wr_rx_bufs;
-	link->wr_rx_ibs = kcalloc(SMC_WR_BUF_CNT * 3,
+	link->wr_rx_ibs = kcalloc(link->lgr->max_recv_wr,
 				  sizeof(link->wr_rx_ibs[0]),
 				  GFP_KERNEL);
 	if (!link->wr_rx_ibs)
 		goto no_mem_wr_tx_ibs;
-	link->wr_tx_rdmas = kcalloc(SMC_WR_BUF_CNT,
+	link->wr_tx_rdmas = kcalloc(link->lgr->max_send_wr,
 				    sizeof(link->wr_tx_rdmas[0]),
 				    GFP_KERNEL);
 	if (!link->wr_tx_rdmas)
 		goto no_mem_wr_rx_ibs;
-	link->wr_tx_rdma_sges = kcalloc(SMC_WR_BUF_CNT,
+	link->wr_tx_rdma_sges = kcalloc(link->lgr->max_send_wr,
 					sizeof(link->wr_tx_rdma_sges[0]),
 					GFP_KERNEL);
 	if (!link->wr_tx_rdma_sges)
 		goto no_mem_wr_tx_rdmas;
-	link->wr_tx_sges = kcalloc(SMC_WR_BUF_CNT, sizeof(link->wr_tx_sges[0]),
+	link->wr_tx_sges = kcalloc(link->lgr->max_send_wr, sizeof(link->wr_tx_sges[0]),
 				   GFP_KERNEL);
 	if (!link->wr_tx_sges)
 		goto no_mem_wr_tx_rdma_sges;
-	link->wr_rx_sges = kcalloc(SMC_WR_BUF_CNT * 3,
+	link->wr_rx_sges = kcalloc(link->lgr->max_recv_wr,
 				   sizeof(link->wr_rx_sges[0]) * link->wr_rx_sge_cnt,
 				   GFP_KERNEL);
 	if (!link->wr_rx_sges)
 		goto no_mem_wr_tx_sges;
-	link->wr_tx_mask = bitmap_zalloc(SMC_WR_BUF_CNT, GFP_KERNEL);
+	link->wr_tx_mask = bitmap_zalloc(link->lgr->max_send_wr, GFP_KERNEL);
 	if (!link->wr_tx_mask)
 		goto no_mem_wr_rx_sges;
-	link->wr_tx_pends = kcalloc(SMC_WR_BUF_CNT,
+	link->wr_tx_pends = kcalloc(link->lgr->max_send_wr,
 				    sizeof(link->wr_tx_pends[0]),
 				    GFP_KERNEL);
 	if (!link->wr_tx_pends)
 		goto no_mem_wr_tx_mask;
-	link->wr_tx_compl = kcalloc(SMC_WR_BUF_CNT,
+	link->wr_tx_compl = kcalloc(link->lgr->max_send_wr,
 				    sizeof(link->wr_tx_compl[0]),
 				    GFP_KERNEL);
 	if (!link->wr_tx_compl)
@@ -905,7 +907,7 @@ int smc_wr_create_link(struct smc_link *lnk)
 		goto dma_unmap;
 	}
 	smc_wr_init_sge(lnk);
-	bitmap_zero(lnk->wr_tx_mask, SMC_WR_BUF_CNT);
+	bitmap_zero(lnk->wr_tx_mask, lnk->lgr->max_send_wr);
 	init_waitqueue_head(&lnk->wr_tx_wait);
 	rc = percpu_ref_init(&lnk->wr_tx_refs, smcr_wr_tx_refs_free, 0, GFP_KERNEL);
 	if (rc)
diff --git a/net/smc/smc_wr.h b/net/smc/smc_wr.h
index f3008dda222a..aa4533af9122 100644
--- a/net/smc/smc_wr.h
+++ b/net/smc/smc_wr.h
@@ -19,8 +19,6 @@
 #include "smc.h"
 #include "smc_core.h"
 
-#define SMC_WR_BUF_CNT 16	/* # of ctrl buffers per link */
-
 #define SMC_WR_TX_WAIT_FREE_SLOT_TIME	(10 * HZ)
 
 #define SMC_WR_TX_SIZE 44 /* actual size of wr_send data (<=SMC_WR_BUF_SIZE) */
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH net-next 2/2] net/smc: handle -ENOMEM from smc_wr_alloc_link_mem gracefully
  2025-09-04 21:12 [PATCH net-next 0/2] net/smc: make wr buffer count configurable Halil Pasic
  2025-09-04 21:12 ` [PATCH net-next 1/2] " Halil Pasic
@ 2025-09-04 21:12 ` Halil Pasic
  1 sibling, 0 replies; 11+ messages in thread
From: Halil Pasic @ 2025-09-04 21:12 UTC (permalink / raw)
  To: Jakub Kicinski, Paolo Abeni, Simon Horman, D. Wythe, Dust Li,
	Sidraya Jayagond, Wenjia Zhang, Mahanta Jambigi, Tony Lu, Wen Gu,
	netdev, linux-doc, linux-kernel, linux-rdma, linux-s390
  Cc: Halil Pasic

Currently if a -ENOMEM from smc_wr_alloc_link_mem() is handled by
giving up and going the way of a TCP fallback. This was reasonable
before the sizes of the allocations there were compile time constants
and reasonably small. But now those are actually configurable.

So instead of giving up, keep retrying with half of the requested
size unless we dip below the old static sizes -- then give up!

Signed-off-by: Halil Pasic <pasic@linux.ibm.com>
Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com>
---
 Documentation/networking/smc-sysctl.rst |  9 ++++---
 net/smc/smc_core.c                      | 34 +++++++++++++++++--------
 net/smc/smc_core.h                      |  2 ++
 net/smc/smc_wr.c                        | 28 ++++++++++----------
 4 files changed, 46 insertions(+), 27 deletions(-)

diff --git a/Documentation/networking/smc-sysctl.rst b/Documentation/networking/smc-sysctl.rst
index c687092329e3..c8dbe7ac8bdf 100644
--- a/Documentation/networking/smc-sysctl.rst
+++ b/Documentation/networking/smc-sysctl.rst
@@ -85,9 +85,10 @@ smcr_max_send_wr - INTEGER
 
 	Please be aware that all the buffers need to be allocated as a physically
 	continuous array in which each element is a single buffer and has the size
-	of SMC_WR_BUF_SIZE (48) bytes. If the allocation fails we give up much
+	of SMC_WR_BUF_SIZE (48) bytes. If the allocation fails, we keep retrying
+	with half of the buffer count until it is ether successful or (unlikely)
+	we dip below the old hard coded value which is 16 where we give up much
 	like before having this control.
-	this control.
 
 	Default: 16
 
@@ -104,7 +105,9 @@ smcr_max_recv_wr - INTEGER
 
 	Please be aware that all the buffers need to be allocated as a physically
 	continuous array in which each element is a single buffer and has the size
-	of SMC_WR_BUF_SIZE (48) bytes. If the allocation fails we give up much
+	of SMC_WR_BUF_SIZE (48) bytes. If the allocation fails, we keep retrying
+	with half of the buffer count until it is ether successful or (unlikely)
+	we dip below the old hard coded value which is 16 where we give up much
 	like before having this control.
 
 	Default: 48
diff --git a/net/smc/smc_core.c b/net/smc/smc_core.c
index 262746e304dd..da2bde99ebc6 100644
--- a/net/smc/smc_core.c
+++ b/net/smc/smc_core.c
@@ -810,6 +810,8 @@ int smcr_link_init(struct smc_link_group *lgr, struct smc_link *lnk,
 	lnk->clearing = 0;
 	lnk->path_mtu = lnk->smcibdev->pattr[lnk->ibport - 1].active_mtu;
 	lnk->link_id = smcr_next_link_id(lgr);
+	lnk->max_send_wr = lgr->max_send_wr;
+	lnk->max_recv_wr = lgr->max_recv_wr;
 	lnk->lgr = lgr;
 	smc_lgr_hold(lgr); /* lgr_put in smcr_link_clear() */
 	lnk->link_idx = link_idx;
@@ -836,27 +838,39 @@ int smcr_link_init(struct smc_link_group *lgr, struct smc_link *lnk,
 	rc = smc_llc_link_init(lnk);
 	if (rc)
 		goto out;
-	rc = smc_wr_alloc_link_mem(lnk);
-	if (rc)
-		goto clear_llc_lnk;
 	rc = smc_ib_create_protection_domain(lnk);
 	if (rc)
-		goto free_link_mem;
-	rc = smc_ib_create_queue_pair(lnk);
-	if (rc)
-		goto dealloc_pd;
+		goto clear_llc_lnk;
+	do {
+		rc = smc_ib_create_queue_pair(lnk);
+		if (rc)
+			goto dealloc_pd;
+		rc = smc_wr_alloc_link_mem(lnk);
+		if (!rc)
+			break;
+		else if (rc != -ENOMEM) /* give up */
+			goto destroy_qp;
+		/* retry with smaller ... */
+		lnk->max_send_wr /= 2;
+		lnk->max_recv_wr /= 2;
+		/* ... unless droping below old SMC_WR_BUF_SIZE */
+		if (lnk->max_send_wr < 16 || lnk->max_recv_wr < 48)
+			goto destroy_qp;
+		smc_ib_destroy_queue_pair(lnk);
+	} while (1);
+
 	rc = smc_wr_create_link(lnk);
 	if (rc)
-		goto destroy_qp;
+		goto free_link_mem;
 	lnk->state = SMC_LNK_ACTIVATING;
 	return 0;
 
+free_link_mem:
+	smc_wr_free_link_mem(lnk);
 destroy_qp:
 	smc_ib_destroy_queue_pair(lnk);
 dealloc_pd:
 	smc_ib_dealloc_protection_domain(lnk);
-free_link_mem:
-	smc_wr_free_link_mem(lnk);
 clear_llc_lnk:
 	smc_llc_link_clear(lnk, false);
 out:
diff --git a/net/smc/smc_core.h b/net/smc/smc_core.h
index b883f43fc206..92d70c57d23d 100644
--- a/net/smc/smc_core.h
+++ b/net/smc/smc_core.h
@@ -172,6 +172,8 @@ struct smc_link {
 	struct completion	llc_testlink_resp; /* wait for rx of testlink */
 	int			llc_testlink_time; /* testlink interval */
 	atomic_t		conn_cnt; /* connections on this link */
+	u16			max_send_wr;
+	u16			max_recv_wr;
 };
 
 /* For now we just allow one parallel link per link group. The SMC protocol
diff --git a/net/smc/smc_wr.c b/net/smc/smc_wr.c
index 85ebc65f1546..4759041d3b02 100644
--- a/net/smc/smc_wr.c
+++ b/net/smc/smc_wr.c
@@ -548,9 +548,9 @@ void smc_wr_remember_qp_attr(struct smc_link *lnk)
 		    IB_QP_DEST_QPN,
 		    &init_attr);
 
-	lnk->wr_tx_cnt = min_t(size_t, lnk->lgr->max_send_wr,
+	lnk->wr_tx_cnt = min_t(size_t, lnk->max_send_wr,
 			       lnk->qp_attr.cap.max_send_wr);
-	lnk->wr_rx_cnt = min_t(size_t, lnk->lgr->max_recv_wr,
+	lnk->wr_rx_cnt = min_t(size_t, lnk->max_recv_wr,
 			       lnk->qp_attr.cap.max_recv_wr);
 }
 
@@ -742,51 +742,51 @@ int smc_wr_alloc_lgr_mem(struct smc_link_group *lgr)
 int smc_wr_alloc_link_mem(struct smc_link *link)
 {
 	/* allocate link related memory */
-	link->wr_tx_bufs = kcalloc(link->lgr->max_send_wr,
+	link->wr_tx_bufs = kcalloc(link->max_send_wr,
 				   SMC_WR_BUF_SIZE, GFP_KERNEL);
 	if (!link->wr_tx_bufs)
 		goto no_mem;
-	link->wr_rx_bufs = kcalloc(link->lgr->max_recv_wr, SMC_WR_BUF_SIZE,
+	link->wr_rx_bufs = kcalloc(link->max_recv_wr, SMC_WR_BUF_SIZE,
 				   GFP_KERNEL);
 	if (!link->wr_rx_bufs)
 		goto no_mem_wr_tx_bufs;
-	link->wr_tx_ibs = kcalloc(link->lgr->max_send_wr,
+	link->wr_tx_ibs = kcalloc(link->max_send_wr,
 				  sizeof(link->wr_tx_ibs[0]), GFP_KERNEL);
 	if (!link->wr_tx_ibs)
 		goto no_mem_wr_rx_bufs;
-	link->wr_rx_ibs = kcalloc(link->lgr->max_recv_wr,
+	link->wr_rx_ibs = kcalloc(link->max_recv_wr,
 				  sizeof(link->wr_rx_ibs[0]),
 				  GFP_KERNEL);
 	if (!link->wr_rx_ibs)
 		goto no_mem_wr_tx_ibs;
-	link->wr_tx_rdmas = kcalloc(link->lgr->max_send_wr,
+	link->wr_tx_rdmas = kcalloc(link->max_send_wr,
 				    sizeof(link->wr_tx_rdmas[0]),
 				    GFP_KERNEL);
 	if (!link->wr_tx_rdmas)
 		goto no_mem_wr_rx_ibs;
-	link->wr_tx_rdma_sges = kcalloc(link->lgr->max_send_wr,
+	link->wr_tx_rdma_sges = kcalloc(link->max_send_wr,
 					sizeof(link->wr_tx_rdma_sges[0]),
 					GFP_KERNEL);
 	if (!link->wr_tx_rdma_sges)
 		goto no_mem_wr_tx_rdmas;
-	link->wr_tx_sges = kcalloc(link->lgr->max_send_wr, sizeof(link->wr_tx_sges[0]),
+	link->wr_tx_sges = kcalloc(link->max_send_wr, sizeof(link->wr_tx_sges[0]),
 				   GFP_KERNEL);
 	if (!link->wr_tx_sges)
 		goto no_mem_wr_tx_rdma_sges;
-	link->wr_rx_sges = kcalloc(link->lgr->max_recv_wr,
+	link->wr_rx_sges = kcalloc(link->max_recv_wr,
 				   sizeof(link->wr_rx_sges[0]) * link->wr_rx_sge_cnt,
 				   GFP_KERNEL);
 	if (!link->wr_rx_sges)
 		goto no_mem_wr_tx_sges;
-	link->wr_tx_mask = bitmap_zalloc(link->lgr->max_send_wr, GFP_KERNEL);
+	link->wr_tx_mask = bitmap_zalloc(link->max_send_wr, GFP_KERNEL);
 	if (!link->wr_tx_mask)
 		goto no_mem_wr_rx_sges;
-	link->wr_tx_pends = kcalloc(link->lgr->max_send_wr,
+	link->wr_tx_pends = kcalloc(link->max_send_wr,
 				    sizeof(link->wr_tx_pends[0]),
 				    GFP_KERNEL);
 	if (!link->wr_tx_pends)
 		goto no_mem_wr_tx_mask;
-	link->wr_tx_compl = kcalloc(link->lgr->max_send_wr,
+	link->wr_tx_compl = kcalloc(link->max_send_wr,
 				    sizeof(link->wr_tx_compl[0]),
 				    GFP_KERNEL);
 	if (!link->wr_tx_compl)
@@ -907,7 +907,7 @@ int smc_wr_create_link(struct smc_link *lnk)
 		goto dma_unmap;
 	}
 	smc_wr_init_sge(lnk);
-	bitmap_zero(lnk->wr_tx_mask, lnk->lgr->max_send_wr);
+	bitmap_zero(lnk->wr_tx_mask, lnk->max_send_wr);
 	init_waitqueue_head(&lnk->wr_tx_wait);
 	rc = percpu_ref_init(&lnk->wr_tx_refs, smcr_wr_tx_refs_free, 0, GFP_KERNEL);
 	if (rc)
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH net-next 1/2] net/smc: make wr buffer count configurable
  2025-09-04 21:12 ` [PATCH net-next 1/2] " Halil Pasic
@ 2025-09-05  3:45   ` Dust Li
  2025-09-05  9:00     ` Halil Pasic
  2025-09-06 18:25   ` kernel test robot
  1 sibling, 1 reply; 11+ messages in thread
From: Dust Li @ 2025-09-05  3:45 UTC (permalink / raw)
  To: Halil Pasic, Jakub Kicinski, Paolo Abeni, Simon Horman, D. Wythe,
	Sidraya Jayagond, Wenjia Zhang, Mahanta Jambigi, Tony Lu, Wen Gu,
	netdev, linux-doc, linux-kernel, linux-rdma, linux-s390

On 2025-09-04 23:12:52, Halil Pasic wrote:
>Think SMC_WR_BUF_CNT_SEND := SMC_WR_BUF_CNT used in send context and
>SMC_WR_BUF_CNT_RECV := 3 * SMC_WR_BUF_CNT used in recv context. Those
>get replaced with lgr->max_send_wr and lgr->max_recv_wr respective.

Hi Halil,

I think making the WR buffer count configurable helps make SMC more flexible.

However, there are two additional issues we need to consider:

1. What if the two sides have different max_send_wr/max_recv_wr configurations?
IIUC, For example, if the client sets max_send_wr to 64, but the server sets
max_recv_wr to 16, the client might overflow the server's QP receive
queue, potentially causing an RNR (Receiver Not Ready) error.

2. Since WR buffers are configurable, it’d be helpful to add some
monitoring methods so we can keep track of how many WR buffers each QP
is currently using. This should be useful now that you introduced a fallback
retry mechanism, which can cause the number of WR buffers to change
dynamically.


Some other minor issues in the patch itself, see below

>
>While at it let us also remove a confusing comment that is either not
>about the context in which it resides (describing
>qp_attr.cap.max_send_wr and qp_attr.cap.max_recv_wr) or not applicable
>any more when these values become configurable .
>
>Signed-off-by: Halil Pasic <pasic@linux.ibm.com>
>Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com>
>---
> Documentation/networking/smc-sysctl.rst | 37 +++++++++++++++++++++++++
> net/smc/smc.h                           |  2 ++
> net/smc/smc_core.h                      |  4 +++
> net/smc/smc_ib.c                        |  7 ++---
> net/smc/smc_llc.c                       |  2 ++
> net/smc/smc_sysctl.c                    | 22 +++++++++++++++
> net/smc/smc_wr.c                        | 32 +++++++++++----------
> net/smc/smc_wr.h                        |  2 --
> 8 files changed, 86 insertions(+), 22 deletions(-)
>
>diff --git a/Documentation/networking/smc-sysctl.rst b/Documentation/networking/smc-sysctl.rst
>index a874d007f2db..c687092329e3 100644
>--- a/Documentation/networking/smc-sysctl.rst
>+++ b/Documentation/networking/smc-sysctl.rst
>@@ -71,3 +71,40 @@ smcr_max_conns_per_lgr - INTEGER
> 	acceptable value ranges from 16 to 255. Only for SMC-R v2.1 and later.
> 
> 	Default: 255
>+
>+smcr_max_send_wr - INTEGER

Why call it max ? But not something like smcr_send_wr_cnt ?

>+	So called work request buffers are SMCR link (and RDMA queue pair) level
>+	resources necessary for performing RDMA operations. Since up to 255
>+	connections can share a link group and thus also a link and the number
>+	of the work request buffers is decided when the link is allocated,
>+	depending on the workload it can a bottleneck in a sense that threads
>+	have to wait for work request buffers to become available. Before the
>+	introduction of this control the maximal number of work request buffers
>+	available on the send path used to be hard coded to 16. With this control
>+	it becomes configurable. The acceptable range is between 2 and 2048.
>+
>+	Please be aware that all the buffers need to be allocated as a physically
>+	continuous array in which each element is a single buffer and has the size
>+	of SMC_WR_BUF_SIZE (48) bytes. If the allocation fails we give up much
>+	like before having this control.
>+	this control.
>+
>+	Default: 16
>+
>+smcr_max_recv_wr - INTEGER
>+	So called work request buffers are SMCR link (and RDMA queue pair) level
>+	resources necessary for performing RDMA operations. Since up to 255
>+	connections can share a link group and thus also a link and the number
>+	of the work request buffers is decided when the link is allocated,
>+	depending on the workload it can a bottleneck in a sense that threads
>+	have to wait for work request buffers to become available. Before the
>+	introduction of this control the maximal number of work request buffers
>+	available on the receive path used to be hard coded to 16. With this control
>+	it becomes configurable. The acceptable range is between 2 and 2048.
>+
>+	Please be aware that all the buffers need to be allocated as a physically
>+	continuous array in which each element is a single buffer and has the size
>+	of SMC_WR_BUF_SIZE (48) bytes. If the allocation fails we give up much
>+	like before having this control.
>+
>+	Default: 48
>diff --git a/net/smc/smc.h b/net/smc/smc.h
>index 2c9084963739..ffe48253fa1f 100644
>--- a/net/smc/smc.h
>+++ b/net/smc/smc.h
>@@ -33,6 +33,8 @@
> 
> extern struct proto smc_proto;
> extern struct proto smc_proto6;
>+extern unsigned int smc_ib_sysctl_max_send_wr;
>+extern unsigned int smc_ib_sysctl_max_recv_wr;
> 
> extern struct smc_hashinfo smc_v4_hashinfo;
> extern struct smc_hashinfo smc_v6_hashinfo;
>diff --git a/net/smc/smc_core.h b/net/smc/smc_core.h
>index 48a1b1dcb576..b883f43fc206 100644
>--- a/net/smc/smc_core.h
>+++ b/net/smc/smc_core.h
>@@ -361,6 +361,10 @@ struct smc_link_group {
> 						/* max conn can be assigned to lgr */
> 			u8			max_links;
> 						/* max links can be added in lgr */
>+			u16			max_send_wr;
>+						/* number of WR buffers on send */
>+			u16			max_recv_wr;
>+						/* number of WR buffers on recv */
> 		};
> 		struct { /* SMC-D */
> 			struct smcd_gid		peer_gid;
>diff --git a/net/smc/smc_ib.c b/net/smc/smc_ib.c
>index 0052f02756eb..e8d35c22c525 100644
>--- a/net/smc/smc_ib.c
>+++ b/net/smc/smc_ib.c
>@@ -669,11 +669,6 @@ int smc_ib_create_queue_pair(struct smc_link *lnk)
> 		.recv_cq = lnk->smcibdev->roce_cq_recv,
> 		.srq = NULL,
> 		.cap = {
>-				/* include unsolicited rdma_writes as well,
>-				 * there are max. 2 RDMA_WRITE per 1 WR_SEND
>-				 */
>-			.max_send_wr = SMC_WR_BUF_CNT * 3,
>-			.max_recv_wr = SMC_WR_BUF_CNT * 3,
> 			.max_send_sge = SMC_IB_MAX_SEND_SGE,
> 			.max_recv_sge = lnk->wr_rx_sge_cnt,
> 			.max_inline_data = 0,
>@@ -683,6 +678,8 @@ int smc_ib_create_queue_pair(struct smc_link *lnk)
> 	};
> 	int rc;
> 
>+	qp_attr.cap.max_send_wr = 3 * lnk->lgr->max_send_wr;
>+	qp_attr.cap.max_recv_wr = lnk->lgr->max_recv_wr;
> 	lnk->roce_qp = ib_create_qp(lnk->roce_pd, &qp_attr);
> 	rc = PTR_ERR_OR_ZERO(lnk->roce_qp);
> 	if (IS_ERR(lnk->roce_qp))
>diff --git a/net/smc/smc_llc.c b/net/smc/smc_llc.c
>index f865c58c3aa7..91c936bf7336 100644
>--- a/net/smc/smc_llc.c
>+++ b/net/smc/smc_llc.c
>@@ -2157,6 +2157,8 @@ void smc_llc_lgr_init(struct smc_link_group *lgr, struct smc_sock *smc)
> 	init_waitqueue_head(&lgr->llc_msg_waiter);
> 	init_rwsem(&lgr->llc_conf_mutex);
> 	lgr->llc_testlink_time = READ_ONCE(net->smc.sysctl_smcr_testlink_time);
>+	lgr->max_send_wr = (u16)(READ_ONCE(smc_ib_sysctl_max_send_wr));
>+	lgr->max_recv_wr = (u16)(READ_ONCE(smc_ib_sysctl_max_recv_wr));
> }
> 
> /* called after lgr was removed from lgr_list */
>diff --git a/net/smc/smc_sysctl.c b/net/smc/smc_sysctl.c
>index 2fab6456f765..01da1297e150 100644
>--- a/net/smc/smc_sysctl.c
>+++ b/net/smc/smc_sysctl.c
>@@ -29,6 +29,10 @@ static int links_per_lgr_min = SMC_LINKS_ADD_LNK_MIN;
> static int links_per_lgr_max = SMC_LINKS_ADD_LNK_MAX;
> static int conns_per_lgr_min = SMC_CONN_PER_LGR_MIN;
> static int conns_per_lgr_max = SMC_CONN_PER_LGR_MAX;
>+unsigned int smc_ib_sysctl_max_send_wr = 16;
>+unsigned int smc_ib_sysctl_max_recv_wr = 48;
>+static unsigned int smc_ib_sysctl_max_wr_min = 2;
>+static unsigned int smc_ib_sysctl_max_wr_max = 2048;
> 
> static struct ctl_table smc_table[] = {
> 	{
>@@ -99,6 +103,24 @@ static struct ctl_table smc_table[] = {
> 		.extra1		= SYSCTL_ZERO,
> 		.extra2		= SYSCTL_ONE,
> 	},
>+	{
>+		.procname       = "smcr_max_send_wr",
>+		.data		= &smc_ib_sysctl_max_send_wr,
>+		.maxlen         = sizeof(int),
>+		.mode           = 0644,
>+		.proc_handler   = proc_dointvec_minmax,
>+		.extra1		= &smc_ib_sysctl_max_wr_min,
>+		.extra2		= &smc_ib_sysctl_max_wr_max,
>+	},
>+	{
>+		.procname       = "smcr_max_recv_wr",
>+		.data		= &smc_ib_sysctl_max_recv_wr,
>+		.maxlen         = sizeof(int),
>+		.mode           = 0644,
>+		.proc_handler   = proc_dointvec_minmax,

It's better to use tab instead of space before those '=' here.

Best regards,
Dust


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH net-next 1/2] net/smc: make wr buffer count configurable
  2025-09-05  3:45   ` Dust Li
@ 2025-09-05  9:00     ` Halil Pasic
  2025-09-05 12:01       ` Halil Pasic
  0 siblings, 1 reply; 11+ messages in thread
From: Halil Pasic @ 2025-09-05  9:00 UTC (permalink / raw)
  To: Dust Li
  Cc: Jakub Kicinski, Paolo Abeni, Simon Horman, D. Wythe,
	Sidraya Jayagond, Wenjia Zhang, Mahanta Jambigi, Tony Lu, Wen Gu,
	netdev, linux-doc, linux-kernel, linux-rdma, linux-s390,
	Halil Pasic

On Fri, 5 Sep 2025 11:45:36 +0800
Dust Li <dust.li@linux.alibaba.com> wrote:

> On 2025-09-04 23:12:52, Halil Pasic wrote:
> >Think SMC_WR_BUF_CNT_SEND := SMC_WR_BUF_CNT used in send context and
> >SMC_WR_BUF_CNT_RECV := 3 * SMC_WR_BUF_CNT used in recv context. Those
> >get replaced with lgr->max_send_wr and lgr->max_recv_wr respective.  
> 
> Hi Halil,
> 
> I think making the WR buffer count configurable helps make SMC more flexible.

Hi Dust Li,

Thank you for having a look. 

> 
> However, there are two additional issues we need to consider:
> 
> 1. What if the two sides have different max_send_wr/max_recv_wr configurations?
> IIUC, For example, if the client sets max_send_wr to 64, but the server sets
> max_recv_wr to 16, the client might overflow the server's QP receive
> queue, potentially causing an RNR (Receiver Not Ready) error.

I don't think the 16 is spec-ed anywhere and if the client and the server
need to agree on the same value it should either be speced, or a
protocol mechanism for negotiating it needs to exist. So what is your
take on this as an SMC maintainer?

I think, we have tested heterogeneous setups and didn't see any grave
issues. But let me please do a follow up on this. Maybe the other
maintainers can chime in as well.

> 
> 2. Since WR buffers are configurable, it’d be helpful to add some
> monitoring methods so we can keep track of how many WR buffers each QP
> is currently using. This should be useful now that you introduced a fallback
> retry mechanism, which can cause the number of WR buffers to change
> dynamically.
> 

I agree, but I think that can be done in a different scope. I don't think
this needs to be a part of the MVP. Or do you think that it needs to
be part of the series?
> 
> Some other minor issues in the patch itself, see below
> 
> >
> >While at it let us also remove a confusing comment that is either not
> >about the context in which it resides (describing
> >qp_attr.cap.max_send_wr and qp_attr.cap.max_recv_wr) or not applicable
> >any more when these values become configurable .
> >
> >Signed-off-by: Halil Pasic <pasic@linux.ibm.com>
> >Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com>
> >---
> > Documentation/networking/smc-sysctl.rst | 37 +++++++++++++++++++++++++
> > net/smc/smc.h                           |  2 ++
> > net/smc/smc_core.h                      |  4 +++
> > net/smc/smc_ib.c                        |  7 ++---
> > net/smc/smc_llc.c                       |  2 ++
> > net/smc/smc_sysctl.c                    | 22 +++++++++++++++
> > net/smc/smc_wr.c                        | 32 +++++++++++----------
> > net/smc/smc_wr.h                        |  2 --
> > 8 files changed, 86 insertions(+), 22 deletions(-)
> >
> >diff --git a/Documentation/networking/smc-sysctl.rst b/Documentation/networking/smc-sysctl.rst
> >index a874d007f2db..c687092329e3 100644
> >--- a/Documentation/networking/smc-sysctl.rst
> >+++ b/Documentation/networking/smc-sysctl.rst
> >@@ -71,3 +71,40 @@ smcr_max_conns_per_lgr - INTEGER
> > 	acceptable value ranges from 16 to 255. Only for SMC-R v2.1 and later.
> > 
> > 	Default: 255
> >+
> >+smcr_max_send_wr - INTEGER  
> 
> Why call it max ? But not something like smcr_send_wr_cnt ?

Because of the back-off mechanism. You are not guaranteed to get
this many but you are guaranteed to not get more.
> > static struct ctl_table smc_table[] = {
> > 	{
> >@@ -99,6 +103,24 @@ static struct ctl_table smc_table[] = {
> > 		.extra1		= SYSCTL_ZERO,
> > 		.extra2		= SYSCTL_ONE,
> > 	},
> >+	{
> >+		.procname       = "smcr_max_send_wr",
> >+		.data		= &smc_ib_sysctl_max_send_wr,
> >+		.maxlen         = sizeof(int),
> >+		.mode           = 0644,
> >+		.proc_handler   = proc_dointvec_minmax,
> >+		.extra1		= &smc_ib_sysctl_max_wr_min,
> >+		.extra2		= &smc_ib_sysctl_max_wr_max,
> >+	},
> >+	{
> >+		.procname       = "smcr_max_recv_wr",
> >+		.data		= &smc_ib_sysctl_max_recv_wr,
> >+		.maxlen         = sizeof(int),
> >+		.mode           = 0644,
> >+		.proc_handler   = proc_dointvec_minmax,  
> 
> It's better to use tab instead of space before those '=' here.
> 

I can definitely fix that and do a respin if you like. But I think
we need to sort out the other problems first.

Regards,
Halil


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH net-next 1/2] net/smc: make wr buffer count configurable
  2025-09-05  9:00     ` Halil Pasic
@ 2025-09-05 12:01       ` Halil Pasic
  2025-09-05 14:12         ` Mahanta Jambigi
  2025-09-05 14:22         ` Dust Li
  0 siblings, 2 replies; 11+ messages in thread
From: Halil Pasic @ 2025-09-05 12:01 UTC (permalink / raw)
  To: Dust Li
  Cc: Jakub Kicinski, Paolo Abeni, Simon Horman, D. Wythe,
	Sidraya Jayagond, Wenjia Zhang, Mahanta Jambigi, Tony Lu, Wen Gu,
	netdev, linux-doc, linux-kernel, linux-rdma, linux-s390,
	Halil Pasic

On Fri, 5 Sep 2025 11:00:59 +0200
Halil Pasic <pasic@linux.ibm.com> wrote:

> > 1. What if the two sides have different max_send_wr/max_recv_wr configurations?
> > IIUC, For example, if the client sets max_send_wr to 64, but the server sets
> > max_recv_wr to 16, the client might overflow the server's QP receive
> > queue, potentially causing an RNR (Receiver Not Ready) error.  
> 
> I don't think the 16 is spec-ed anywhere and if the client and the server
> need to agree on the same value it should either be speced, or a
> protocol mechanism for negotiating it needs to exist. So what is your
> take on this as an SMC maintainer?
> 
> I think, we have tested heterogeneous setups and didn't see any grave
> issues. But let me please do a follow up on this. Maybe the other
> maintainers can chime in as well.

Did some research and some thinking. Are you concerned about a
performance regression for e.g. 64 -> 16 compared to 16 -> 16? According
to my current understanding the RNR must not lead to a catastrophic
failure, but the RDMA/IB stack is supposed to handle that.

I would like to also point out that bumping SMC_WR_BUF_CNT basically has
the same problem, although admittedly to a smaller extent because it is
only between "old" and "new".

Assuming that my understanding is correct, I believe that the problem of
the potential RNR is inherent to the objective of the series, and
probably one that can be lived with. Given this entire EID business, I
think the SMC-R setup is likely to happen in a coordinated fashion for
all potential peers, and I hope whoever tweaks those values has a
sufficent understanding or empiric evidence to justify the tweaks.

Assuming my understanding is not utterly wrong, I would very much like
to know what would you want me to do with this?

Thank you in advance!

Regards,
Hali

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH net-next 1/2] net/smc: make wr buffer count configurable
  2025-09-05 12:01       ` Halil Pasic
@ 2025-09-05 14:12         ` Mahanta Jambigi
  2025-09-05 14:22         ` Dust Li
  1 sibling, 0 replies; 11+ messages in thread
From: Mahanta Jambigi @ 2025-09-05 14:12 UTC (permalink / raw)
  To: Halil Pasic, Dust Li
  Cc: Jakub Kicinski, Paolo Abeni, Simon Horman, D. Wythe,
	Sidraya Jayagond, Wenjia Zhang, Tony Lu, Wen Gu, netdev,
	linux-doc, linux-kernel, linux-rdma, linux-s390



On 05/09/25 5:31 pm, Halil Pasic wrote:
> On Fri, 5 Sep 2025 11:00:59 +0200
> Halil Pasic <pasic@linux.ibm.com> wrote:
> 
>>> 1. What if the two sides have different max_send_wr/max_recv_wr configurations?
>>> IIUC, For example, if the client sets max_send_wr to 64, but the server sets
>>> max_recv_wr to 16, the client might overflow the server's QP receive
>>> queue, potentially causing an RNR (Receiver Not Ready) error.  
>>
>> I don't think the 16 is spec-ed anywhere and if the client and the server
>> need to agree on the same value it should either be speced, or a
>> protocol mechanism for negotiating it needs to exist. So what is your
>> take on this as an SMC maintainer?
>>
>> I think, we have tested heterogeneous setups and didn't see any grave
>> issues. But let me please do a follow up on this. Maybe the other
>> maintainers can chime in as well.
> 
> Did some research and some thinking. Are you concerned about a
> performance regression for e.g. 64 -> 16 compared to 16 -> 16? According
> to my current understanding the RNR must not lead to a catastrophic
> failure, but the RDMA/IB stack is supposed to handle that.
> 

Hi Dust,

I configured a client-server setup & did some SMC-R testing by setting
the values you proposed. Ran iperf3(using smc_run) with max parallel
connections of 128 & it looks good. No tcp fallback. No obvious errors.
As Halil mentioned I don't see any catastrophic failure here. Let me
know if I need to stress the system by some more tests or any specific
test that you can think may cause RNR errors. The setup is ready & I can
try it.

*Client* side logs:
[root@client ~]$ sysctl net.smc.smcr_max_send_wr
net.smc.smcr_max_send_wr = 64
[root@client ~]$

[root@client ~]$ smc_run iperf3  -P 128 -t 120 -c 10.25.0.72
Connecting to host 10.25.0.72, port 5201
[  5] local 10.25.0.73 port 52544 connected to 10.25.0.72 port 5201
[  7] local 10.25.0.73 port 52558 connected to 10.25.0.72 port 5201

*Server* side logs:
[root@server ~]$ sysctl net.smc.smcr_max_recv_wr
net.smc.smcr_max_recv_wr = 16
[root@client ~]$

[root@server~]$ smc_run iperf3 -s

-----------------------------------------------------------
Server listening on 5201 (test #1)
-----------------------------------------------------------

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH net-next 1/2] net/smc: make wr buffer count configurable
  2025-09-05 12:01       ` Halil Pasic
  2025-09-05 14:12         ` Mahanta Jambigi
@ 2025-09-05 14:22         ` Dust Li
  2025-09-05 14:51           ` Dust Li
  1 sibling, 1 reply; 11+ messages in thread
From: Dust Li @ 2025-09-05 14:22 UTC (permalink / raw)
  To: Halil Pasic
  Cc: Jakub Kicinski, Paolo Abeni, Simon Horman, D. Wythe,
	Sidraya Jayagond, Wenjia Zhang, Mahanta Jambigi, Tony Lu, Wen Gu,
	netdev, linux-doc, linux-kernel, linux-rdma, linux-s390

On 2025-09-05 14:01:35, Halil Pasic wrote:
>On Fri, 5 Sep 2025 11:00:59 +0200
>Halil Pasic <pasic@linux.ibm.com> wrote:
>
>> > 1. What if the two sides have different max_send_wr/max_recv_wr configurations?
>> > IIUC, For example, if the client sets max_send_wr to 64, but the server sets
>> > max_recv_wr to 16, the client might overflow the server's QP receive
>> > queue, potentially causing an RNR (Receiver Not Ready) error.  
>>
>> I don't think the 16 is spec-ed anywhere and if the client and the server
>> need to agree on the same value it should either be speced, or a
>> protocol mechanism for negotiating it needs to exist. So what is your
>> take on this as an SMC maintainer?

Right — I didn't realize that either until I saw this patch today :)
But since the implementation's been set to 16 since day one, bumping it
up might break things.

>>
>> I think, we have tested heterogeneous setups and didn't see any grave
>> issues. But let me please do a follow up on this. Maybe the other
>> maintainers can chime in as well.

I'm glad to hear from others.

>
>Did some research and some thinking. Are you concerned about a
>performance regression for e.g. 64 -> 16 compared to 16 -> 16? According
>to my current understanding the RNR must not lead to a catastrophic
>failure, but the RDMA/IB stack is supposed to handle that.

No, it's not just a performance regression.
If we get an RNR when going from 64 -> 16, the whole link group gets
torn down — and all SMC connections inside it break.
So from the user’s point of view, connections will just randomly drop
out of nowhere.

>
>I would like to also point out that bumping SMC_WR_BUF_CNT basically has
>the same problem, although admittedly to a smaller extent because it is
>only between "old" and "new".

Right.

>
>Assuming that my understanding is correct, I believe that the problem of
>the potential RNR is inherent to the objective of the series, and
>probably one that can be lived with. Given this entire EID business, I
>think the SMC-R setup is likely to happen in a coordinated fashion for
>all potential peers, and I hope whoever tweaks those values has a
>sufficent understanding or empiric evidence to justify the tweaks.
>
>Assuming my understanding is not utterly wrong, I would very much like
>to know what would you want me to do with this?

In my opinion, the best way to support different SMC_WR_BUF_CNT values
is: First, define it in the spec - then do a handshake to agree on the
value.
If the peer doesn’t support this feature, just fall back to 16.
If both sides support it, use the smaller (MIN) of the two values.

Best regards,
Dust


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH net-next 1/2] net/smc: make wr buffer count configurable
  2025-09-05 14:22         ` Dust Li
@ 2025-09-05 14:51           ` Dust Li
  2025-09-05 21:05             ` Halil Pasic
  0 siblings, 1 reply; 11+ messages in thread
From: Dust Li @ 2025-09-05 14:51 UTC (permalink / raw)
  To: Halil Pasic
  Cc: Jakub Kicinski, Paolo Abeni, Simon Horman, D. Wythe,
	Sidraya Jayagond, Wenjia Zhang, Mahanta Jambigi, Tony Lu, Wen Gu,
	netdev, linux-doc, linux-kernel, linux-rdma, linux-s390

On 2025-09-05 22:22:48, Dust Li wrote:
>On 2025-09-05 14:01:35, Halil Pasic wrote:
>>On Fri, 5 Sep 2025 11:00:59 +0200
>>Halil Pasic <pasic@linux.ibm.com> wrote:
>>
>>> > 1. What if the two sides have different max_send_wr/max_recv_wr configurations?
>>> > IIUC, For example, if the client sets max_send_wr to 64, but the server sets
>>> > max_recv_wr to 16, the client might overflow the server's QP receive
>>> > queue, potentially causing an RNR (Receiver Not Ready) error.  
>>>
>>> I don't think the 16 is spec-ed anywhere and if the client and the server
>>> need to agree on the same value it should either be speced, or a
>>> protocol mechanism for negotiating it needs to exist. So what is your
>>> take on this as an SMC maintainer?
>
>Right — I didn't realize that either until I saw this patch today :)
>But since the implementation's been set to 16 since day one, bumping it
>up might break things.
>
>>>
>>> I think, we have tested heterogeneous setups and didn't see any grave
>>> issues. But let me please do a follow up on this. Maybe the other
>>> maintainers can chime in as well.
>
>I'm glad to hear from others.
>
>>
>>Did some research and some thinking. Are you concerned about a
>>performance regression for e.g. 64 -> 16 compared to 16 -> 16? According
>>to my current understanding the RNR must not lead to a catastrophic
>>failure, but the RDMA/IB stack is supposed to handle that.
>
>No, it's not just a performance regression.
>If we get an RNR when going from 64 -> 16, the whole link group gets
>torn down — and all SMC connections inside it break.
>So from the user’s point of view, connections will just randomly drop
>out of nowhere.

I double-checked the code and noticed we set qp_attr.rnr_retry =
SMC_QP_RNR_RETRY = 7, which means "infinite retries."
So the QP will just keep retrying — we won't actually get an RNR.
That said, yeah, just performance regression.

So in this case, I would regard it as acceptable. We can go with this.

Best regards,
Dust


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH net-next 1/2] net/smc: make wr buffer count configurable
  2025-09-05 14:51           ` Dust Li
@ 2025-09-05 21:05             ` Halil Pasic
  0 siblings, 0 replies; 11+ messages in thread
From: Halil Pasic @ 2025-09-05 21:05 UTC (permalink / raw)
  To: Dust Li
  Cc: Jakub Kicinski, Paolo Abeni, Simon Horman, D. Wythe,
	Sidraya Jayagond, Wenjia Zhang, Mahanta Jambigi, Tony Lu, Wen Gu,
	netdev, linux-doc, linux-kernel, linux-rdma, linux-s390,
	Halil Pasic

On Fri, 5 Sep 2025 22:51:37 +0800
Dust Li <dust.li@linux.alibaba.com> wrote:

> >>Did some research and some thinking. Are you concerned about a
> >>performance regression for e.g. 64 -> 16 compared to 16 -> 16? According
> >>to my current understanding the RNR must not lead to a catastrophic
> >>failure, but the RDMA/IB stack is supposed to handle that.  
> >
> >No, it's not just a performance regression.
> >If we get an RNR when going from 64 -> 16, the whole link group gets
> >torn down — and all SMC connections inside it break.
> >So from the user’s point of view, connections will just randomly drop
> >out of nowhere.  
> 
> I double-checked the code and noticed we set qp_attr.rnr_retry =
> SMC_QP_RNR_RETRY = 7, which means "infinite retries."
> So the QP will just keep retrying — we won't actually get an RNR.
> That said, yeah, just performance regression.
> 
> So in this case, I would regard it as acceptable. We can go with this.

Yes, that is consistent with Mahanta's testing in a sense that he did
not see any catastrophic failure. Regarding the performance regression,
I don't know how bad it is. Mahanta was so kind to do most of the
testing. 

So that leaves us with replacing tabs with spaces and maybe with the
names, or? If you have a proposal for a better name let's talk about
is.

BTW are you aware of any generic counters that would help with
figuring out how many RNRs have been sent/received?

Regards,
Halil

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH net-next 1/2] net/smc: make wr buffer count configurable
  2025-09-04 21:12 ` [PATCH net-next 1/2] " Halil Pasic
  2025-09-05  3:45   ` Dust Li
@ 2025-09-06 18:25   ` kernel test robot
  1 sibling, 0 replies; 11+ messages in thread
From: kernel test robot @ 2025-09-06 18:25 UTC (permalink / raw)
  To: Halil Pasic, Jakub Kicinski, Paolo Abeni, Simon Horman, D. Wythe,
	Dust Li, Sidraya Jayagond, Wenjia Zhang, Mahanta Jambigi, Tony Lu,
	Wen Gu, netdev, linux-doc, linux-kernel, linux-rdma, linux-s390
  Cc: oe-kbuild-all, Halil Pasic

Hi Halil,

kernel test robot noticed the following build errors:

[auto build test ERROR on 5ef04a7b068cbb828eba226aacb42f880f7924d7]

url:    https://github.com/intel-lab-lkp/linux/commits/Halil-Pasic/net-smc-make-wr-buffer-count-configurable/20250905-051510
base:   5ef04a7b068cbb828eba226aacb42f880f7924d7
patch link:    https://lore.kernel.org/r/20250904211254.1057445-2-pasic%40linux.ibm.com
patch subject: [PATCH net-next 1/2] net/smc: make wr buffer count configurable
config: loongarch-randconfig-002-20250906 (https://download.01.org/0day-ci/archive/20250907/202509070225.pVKkaaCr-lkp@intel.com/config)
compiler: loongarch64-linux-gcc (GCC) 15.1.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250907/202509070225.pVKkaaCr-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202509070225.pVKkaaCr-lkp@intel.com/

All errors (new ones prefixed by >>, old ones prefixed by <<):

>> ERROR: modpost: "smc_ib_sysctl_max_send_wr" [net/smc/smc.ko] undefined!
>> ERROR: modpost: "smc_ib_sysctl_max_recv_wr" [net/smc/smc.ko] undefined!

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2025-09-06 18:26 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-04 21:12 [PATCH net-next 0/2] net/smc: make wr buffer count configurable Halil Pasic
2025-09-04 21:12 ` [PATCH net-next 1/2] " Halil Pasic
2025-09-05  3:45   ` Dust Li
2025-09-05  9:00     ` Halil Pasic
2025-09-05 12:01       ` Halil Pasic
2025-09-05 14:12         ` Mahanta Jambigi
2025-09-05 14:22         ` Dust Li
2025-09-05 14:51           ` Dust Li
2025-09-05 21:05             ` Halil Pasic
2025-09-06 18:25   ` kernel test robot
2025-09-04 21:12 ` [PATCH net-next 2/2] net/smc: handle -ENOMEM from smc_wr_alloc_link_mem gracefully Halil Pasic

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).