[PATCH net-next v3 0/2] net/smc: make wr buffer count configurable

linux-rdma.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH net-next v3 0/2] net/smc: make wr buffer count configurable
@ 2025-09-21 21:44 Halil Pasic
  2025-09-21 21:44 ` [PATCH net-next v3 1/2] " Halil Pasic
  2025-09-21 21:44 ` [PATCH net-next v3 2/2] net/smc: handle -ENOMEM from smc_wr_alloc_link_mem gracefully Halil Pasic
  0 siblings, 2 replies; 21+ messages in thread
From: Halil Pasic @ 2025-09-21 21:44 UTC (permalink / raw)
  To: Jakub Kicinski, Paolo Abeni, Simon Horman, D. Wythe, Dust Li,
	Sidraya Jayagond, Wenjia Zhang, Mahanta Jambigi, Tony Lu, Wen Gu,
	netdev, linux-doc, linux-kernel, linux-rdma, linux-s390
  Cc: Halil Pasic

The current value of SMC_WR_BUF_CNT is 16 which leads to heavy
contention on the wr_tx_wait workqueue of the SMC-R linkgroup and its
spinlock when many connections are competing for the work request
buffers. Currently up to 256 connections per linkgroup are supported.

To make things worse when finally a buffer becomes available and
smc_wr_tx_put_slot() signals the linkgroup's wr_tx_wait wq, because
WQ_FLAG_EXCLUSIVE is not used all the waiters get woken up, most of the
time a single one can proceed, and the rest is contending on the
spinlock of the wq to go to sleep again.

Addressing this by simply bumping SMC_WR_BUF_CNT to 256 was deemed
risky, because the large-ish physically continuous allocation could fail
and lead to TCP fall-backs. For reference see this discussion thread on
"[PATCH net-next] net/smc: increase SMC_WR_BUF_CNT" (in archive
https://lists.openwall.net/netdev/2024/11/05/186), which concludes with
the agreement to try to come up with something smarter, which is what
this series aims for.

Additionally if for some reason it is known that heavy contention is not
to be expected going with something like 256 work request buffers is
wasteful. To address these concerns make the number of work requests
configurable, and introduce a back-off logic with handles -ENOMEM form
smc_wr_alloc_link_mem() gracefully.
---

Changelog:
---------
v3:
 1) Renamed back sysctls from smcr_pref_*_wr to smcr_max_*_wr (Dust Li)
 2) Using link->wr_rx_buflen instead of SMC_WR_BUF_SIZE when allocating
    wr_rx_bufs to not break what was fixed with 
    commit 27ef6a9981fe ("net/smc: support SMC-R V2 for rdma devices
    with max_recv_sge equals to 1") (Dust Li)
 3) Removed unwanted 'this control' in the documentation in #1
    (Dust Li)
v2: https://lore.kernel.org/netdev/20250908220150.3329433-1-pasic@linux.ibm.com/
v1: https://lore.kernel.org/all/20250904211254.1057445-1-pasic@linux.ibm.com/

Halil Pasic (2):
  net/smc: make wr buffer count configurable
  net/smc: handle -ENOMEM from smc_wr_alloc_link_mem gracefully

 Documentation/networking/smc-sysctl.rst | 40 +++++++++++++++++++++++++
 include/net/netns/smc.h                 |  2 ++
 net/smc/smc_core.c                      | 34 ++++++++++++++-------
 net/smc/smc_core.h                      |  8 +++++
 net/smc/smc_ib.c                        |  7 ++---
 net/smc/smc_llc.c                       |  2 ++
 net/smc/smc_sysctl.c                    | 22 ++++++++++++++
 net/smc/smc_sysctl.h                    |  2 ++
 net/smc/smc_wr.c                        | 32 ++++++++++----------
 net/smc/smc_wr.h                        |  2 --
 10 files changed, 119 insertions(+), 32 deletions(-)

base-commit: 312e6f7676e63bbb9b81e5c68e580a9f776cc6f0
-- 
2.48.1

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH net-next v3 1/2] net/smc: make wr buffer count configurable
  2025-09-21 21:44 [PATCH net-next v3 0/2] net/smc: make wr buffer count configurable Halil Pasic
@ 2025-09-21 21:44 ` Halil Pasic
  2025-09-24 17:27   ` Sidraya Jayagond
                     ` (2 more replies)
  2025-09-21 21:44 ` [PATCH net-next v3 2/2] net/smc: handle -ENOMEM from smc_wr_alloc_link_mem gracefully Halil Pasic
  1 sibling, 3 replies; 21+ messages in thread
From: Halil Pasic @ 2025-09-21 21:44 UTC (permalink / raw)
  To: Jakub Kicinski, Paolo Abeni, Simon Horman, D. Wythe, Dust Li,
	Sidraya Jayagond, Wenjia Zhang, Mahanta Jambigi, Tony Lu, Wen Gu,
	netdev, linux-doc, linux-kernel, linux-rdma, linux-s390
  Cc: Halil Pasic

Think SMC_WR_BUF_CNT_SEND := SMC_WR_BUF_CNT used in send context and
SMC_WR_BUF_CNT_RECV := 3 * SMC_WR_BUF_CNT used in recv context. Those
get replaced with lgr->max_send_wr and lgr->max_recv_wr respective.

While at it let us also remove a confusing comment that is either not
about the context in which it resides (describing
qp_attr.cap.max_send_wr and qp_attr.cap.max_recv_wr) or not applicable
any more when these values become configurable.

Signed-off-by: Halil Pasic <pasic@linux.ibm.com>
---
 Documentation/networking/smc-sysctl.rst | 36 +++++++++++++++++++++++++
 include/net/netns/smc.h                 |  2 ++
 net/smc/smc_core.h                      |  6 +++++
 net/smc/smc_ib.c                        |  7 ++---
 net/smc/smc_llc.c                       |  2 ++
 net/smc/smc_sysctl.c                    | 22 +++++++++++++++
 net/smc/smc_sysctl.h                    |  2 ++
 net/smc/smc_wr.c                        | 32 +++++++++++-----------
 net/smc/smc_wr.h                        |  2 --
 9 files changed, 89 insertions(+), 22 deletions(-)

diff --git a/Documentation/networking/smc-sysctl.rst b/Documentation/networking/smc-sysctl.rst
index a874d007f2db..c94d750c7c84 100644
--- a/Documentation/networking/smc-sysctl.rst
+++ b/Documentation/networking/smc-sysctl.rst
@@ -71,3 +71,39 @@ smcr_max_conns_per_lgr - INTEGER
 	acceptable value ranges from 16 to 255. Only for SMC-R v2.1 and later.
 
 	Default: 255
+
+smcr_max_send_wr - INTEGER
+	So called work request buffers are SMCR link (and RDMA queue pair) level
+	resources necessary for performing RDMA operations. Since up to 255
+	connections can share a link group and thus also a link and the number
+	of the work request buffers is decided when the link is allocated,
+	depending on the workload it can a bottleneck in a sense that threads
+	have to wait for work request buffers to become available. Before the
+	introduction of this control the maximal number of work request buffers
+	available on the send path used to be hard coded to 16. With this control
+	it becomes configurable. The acceptable range is between 2 and 2048.
+
+	Please be aware that all the buffers need to be allocated as a physically
+	continuous array in which each element is a single buffer and has the size
+	of SMC_WR_BUF_SIZE (48) bytes. If the allocation fails we give up much
+	like before having this control.
+
+	Default: 16
+
+smcr_max_recv_wr - INTEGER
+	So called work request buffers are SMCR link (and RDMA queue pair) level
+	resources necessary for performing RDMA operations. Since up to 255
+	connections can share a link group and thus also a link and the number
+	of the work request buffers is decided when the link is allocated,
+	depending on the workload it can a bottleneck in a sense that threads
+	have to wait for work request buffers to become available. Before the
+	introduction of this control the maximal number of work request buffers
+	available on the receive path used to be hard coded to 16. With this control
+	it becomes configurable. The acceptable range is between 2 and 2048.
+
+	Please be aware that all the buffers need to be allocated as a physically
+	continuous array in which each element is a single buffer and has the size
+	of SMC_WR_BUF_SIZE (48) bytes. If the allocation fails we give up much
+	like before having this control.
+
+	Default: 48
diff --git a/include/net/netns/smc.h b/include/net/netns/smc.h
index fc752a50f91b..6ceb12baec24 100644
--- a/include/net/netns/smc.h
+++ b/include/net/netns/smc.h
@@ -24,5 +24,7 @@ struct netns_smc {
 	int				sysctl_rmem;
 	int				sysctl_max_links_per_lgr;
 	int				sysctl_max_conns_per_lgr;
+	unsigned int			sysctl_smcr_max_send_wr;
+	unsigned int			sysctl_smcr_max_recv_wr;
 };
 #endif
diff --git a/net/smc/smc_core.h b/net/smc/smc_core.h
index 48a1b1dcb576..ab2d15929cb2 100644
--- a/net/smc/smc_core.h
+++ b/net/smc/smc_core.h
@@ -33,6 +33,8 @@
 					 * distributions may modify it to a value between
 					 * 16-255 as needed.
 					 */
+#define SMCR_MAX_SEND_WR_DEF	16	/* Default number of work requests per send queue */
+#define SMCR_MAX_RECV_WR_DEF	48	/* Default number of work requests per recv queue */
 
 struct smc_lgr_list {			/* list of link group definition */
 	struct list_head	list;
@@ -361,6 +363,10 @@ struct smc_link_group {
 						/* max conn can be assigned to lgr */
 			u8			max_links;
 						/* max links can be added in lgr */
+			u16			max_send_wr;
+						/* number of WR buffers on send */
+			u16			max_recv_wr;
+						/* number of WR buffers on recv */
 		};
 		struct { /* SMC-D */
 			struct smcd_gid		peer_gid;
diff --git a/net/smc/smc_ib.c b/net/smc/smc_ib.c
index 0052f02756eb..e8d35c22c525 100644
--- a/net/smc/smc_ib.c
+++ b/net/smc/smc_ib.c
@@ -669,11 +669,6 @@ int smc_ib_create_queue_pair(struct smc_link *lnk)
 		.recv_cq = lnk->smcibdev->roce_cq_recv,
 		.srq = NULL,
 		.cap = {
-				/* include unsolicited rdma_writes as well,
-				 * there are max. 2 RDMA_WRITE per 1 WR_SEND
-				 */
-			.max_send_wr = SMC_WR_BUF_CNT * 3,
-			.max_recv_wr = SMC_WR_BUF_CNT * 3,
 			.max_send_sge = SMC_IB_MAX_SEND_SGE,
 			.max_recv_sge = lnk->wr_rx_sge_cnt,
 			.max_inline_data = 0,
@@ -683,6 +678,8 @@ int smc_ib_create_queue_pair(struct smc_link *lnk)
 	};
 	int rc;
 
+	qp_attr.cap.max_send_wr = 3 * lnk->lgr->max_send_wr;
+	qp_attr.cap.max_recv_wr = lnk->lgr->max_recv_wr;
 	lnk->roce_qp = ib_create_qp(lnk->roce_pd, &qp_attr);
 	rc = PTR_ERR_OR_ZERO(lnk->roce_qp);
 	if (IS_ERR(lnk->roce_qp))
diff --git a/net/smc/smc_llc.c b/net/smc/smc_llc.c
index f865c58c3aa7..f5d5eb617526 100644
--- a/net/smc/smc_llc.c
+++ b/net/smc/smc_llc.c
@@ -2157,6 +2157,8 @@ void smc_llc_lgr_init(struct smc_link_group *lgr, struct smc_sock *smc)
 	init_waitqueue_head(&lgr->llc_msg_waiter);
 	init_rwsem(&lgr->llc_conf_mutex);
 	lgr->llc_testlink_time = READ_ONCE(net->smc.sysctl_smcr_testlink_time);
+	lgr->max_send_wr = (u16)(READ_ONCE(net->smc.sysctl_smcr_max_send_wr));
+	lgr->max_recv_wr = (u16)(READ_ONCE(net->smc.sysctl_smcr_max_recv_wr));
 }
 
 /* called after lgr was removed from lgr_list */
diff --git a/net/smc/smc_sysctl.c b/net/smc/smc_sysctl.c
index 2fab6456f765..7b2471904d04 100644
--- a/net/smc/smc_sysctl.c
+++ b/net/smc/smc_sysctl.c
@@ -29,6 +29,8 @@ static int links_per_lgr_min = SMC_LINKS_ADD_LNK_MIN;
 static int links_per_lgr_max = SMC_LINKS_ADD_LNK_MAX;
 static int conns_per_lgr_min = SMC_CONN_PER_LGR_MIN;
 static int conns_per_lgr_max = SMC_CONN_PER_LGR_MAX;
+static unsigned int smcr_max_wr_min = 2;
+static unsigned int smcr_max_wr_max = 2048;
 
 static struct ctl_table smc_table[] = {
 	{
@@ -99,6 +101,24 @@ static struct ctl_table smc_table[] = {
 		.extra1		= SYSCTL_ZERO,
 		.extra2		= SYSCTL_ONE,
 	},
+	{
+		.procname	= "smcr_max_send_wr",
+		.data		= &init_net.smc.sysctl_smcr_max_send_wr,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &smcr_max_wr_min,
+		.extra2		= &smcr_max_wr_max,
+	},
+	{
+		.procname	= "smcr_max_recv_wr",
+		.data		= &init_net.smc.sysctl_smcr_max_recv_wr,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &smcr_max_wr_min,
+		.extra2		= &smcr_max_wr_max,
+	},
 };
 
 int __net_init smc_sysctl_net_init(struct net *net)
@@ -130,6 +150,8 @@ int __net_init smc_sysctl_net_init(struct net *net)
 	WRITE_ONCE(net->smc.sysctl_rmem, net_smc_rmem_init);
 	net->smc.sysctl_max_links_per_lgr = SMC_LINKS_PER_LGR_MAX_PREFER;
 	net->smc.sysctl_max_conns_per_lgr = SMC_CONN_PER_LGR_PREFER;
+	net->smc.sysctl_smcr_max_send_wr = SMCR_MAX_SEND_WR_DEF;
+	net->smc.sysctl_smcr_max_recv_wr = SMCR_MAX_RECV_WR_DEF;
 	/* disable handshake limitation by default */
 	net->smc.limit_smc_hs = 0;
 
diff --git a/net/smc/smc_sysctl.h b/net/smc/smc_sysctl.h
index eb2465ae1e15..8538915af7af 100644
--- a/net/smc/smc_sysctl.h
+++ b/net/smc/smc_sysctl.h
@@ -25,6 +25,8 @@ static inline int smc_sysctl_net_init(struct net *net)
 	net->smc.sysctl_autocorking_size = SMC_AUTOCORKING_DEFAULT_SIZE;
 	net->smc.sysctl_max_links_per_lgr = SMC_LINKS_PER_LGR_MAX_PREFER;
 	net->smc.sysctl_max_conns_per_lgr = SMC_CONN_PER_LGR_PREFER;
+	net->smc.sysctl_smcr_max_send_wr = SMCR_MAX_SEND_WR_DEF;
+	net->smc.sysctl_smcr_max_recv_wr = SMCR_MAX_RECV_WR_DEF;
 	return 0;
 }
 
diff --git a/net/smc/smc_wr.c b/net/smc/smc_wr.c
index b04a21b8c511..f5b2772414fd 100644
--- a/net/smc/smc_wr.c
+++ b/net/smc/smc_wr.c
@@ -34,6 +34,7 @@
 #define SMC_WR_MAX_POLL_CQE 10	/* max. # of compl. queue elements in 1 poll */
 
 #define SMC_WR_RX_HASH_BITS 4
+
 static DEFINE_HASHTABLE(smc_wr_rx_hash, SMC_WR_RX_HASH_BITS);
 static DEFINE_SPINLOCK(smc_wr_rx_hash_lock);
 
@@ -547,9 +548,9 @@ void smc_wr_remember_qp_attr(struct smc_link *lnk)
 		    IB_QP_DEST_QPN,
 		    &init_attr);
 
-	lnk->wr_tx_cnt = min_t(size_t, SMC_WR_BUF_CNT,
+	lnk->wr_tx_cnt = min_t(size_t, lnk->lgr->max_send_wr,
 			       lnk->qp_attr.cap.max_send_wr);
-	lnk->wr_rx_cnt = min_t(size_t, SMC_WR_BUF_CNT * 3,
+	lnk->wr_rx_cnt = min_t(size_t, lnk->lgr->max_recv_wr,
 			       lnk->qp_attr.cap.max_recv_wr);
 }
 
@@ -741,50 +742,51 @@ int smc_wr_alloc_lgr_mem(struct smc_link_group *lgr)
 int smc_wr_alloc_link_mem(struct smc_link *link)
 {
 	/* allocate link related memory */
-	link->wr_tx_bufs = kcalloc(SMC_WR_BUF_CNT, SMC_WR_BUF_SIZE, GFP_KERNEL);
+	link->wr_tx_bufs = kcalloc(link->lgr->max_send_wr,
+				   SMC_WR_BUF_SIZE, GFP_KERNEL);
 	if (!link->wr_tx_bufs)
 		goto no_mem;
-	link->wr_rx_bufs = kcalloc(SMC_WR_BUF_CNT * 3, link->wr_rx_buflen,
+	link->wr_rx_bufs = kcalloc(link->lgr->max_recv_wr, link->wr_rx_buflen,
 				   GFP_KERNEL);
 	if (!link->wr_rx_bufs)
 		goto no_mem_wr_tx_bufs;
-	link->wr_tx_ibs = kcalloc(SMC_WR_BUF_CNT, sizeof(link->wr_tx_ibs[0]),
-				  GFP_KERNEL);
+	link->wr_tx_ibs = kcalloc(link->lgr->max_send_wr,
+				  sizeof(link->wr_tx_ibs[0]), GFP_KERNEL);
 	if (!link->wr_tx_ibs)
 		goto no_mem_wr_rx_bufs;
-	link->wr_rx_ibs = kcalloc(SMC_WR_BUF_CNT * 3,
+	link->wr_rx_ibs = kcalloc(link->lgr->max_recv_wr,
 				  sizeof(link->wr_rx_ibs[0]),
 				  GFP_KERNEL);
 	if (!link->wr_rx_ibs)
 		goto no_mem_wr_tx_ibs;
-	link->wr_tx_rdmas = kcalloc(SMC_WR_BUF_CNT,
+	link->wr_tx_rdmas = kcalloc(link->lgr->max_send_wr,
 				    sizeof(link->wr_tx_rdmas[0]),
 				    GFP_KERNEL);
 	if (!link->wr_tx_rdmas)
 		goto no_mem_wr_rx_ibs;
-	link->wr_tx_rdma_sges = kcalloc(SMC_WR_BUF_CNT,
+	link->wr_tx_rdma_sges = kcalloc(link->lgr->max_send_wr,
 					sizeof(link->wr_tx_rdma_sges[0]),
 					GFP_KERNEL);
 	if (!link->wr_tx_rdma_sges)
 		goto no_mem_wr_tx_rdmas;
-	link->wr_tx_sges = kcalloc(SMC_WR_BUF_CNT, sizeof(link->wr_tx_sges[0]),
+	link->wr_tx_sges = kcalloc(link->lgr->max_send_wr, sizeof(link->wr_tx_sges[0]),
 				   GFP_KERNEL);
 	if (!link->wr_tx_sges)
 		goto no_mem_wr_tx_rdma_sges;
-	link->wr_rx_sges = kcalloc(SMC_WR_BUF_CNT * 3,
+	link->wr_rx_sges = kcalloc(link->lgr->max_recv_wr,
 				   sizeof(link->wr_rx_sges[0]) * link->wr_rx_sge_cnt,
 				   GFP_KERNEL);
 	if (!link->wr_rx_sges)
 		goto no_mem_wr_tx_sges;
-	link->wr_tx_mask = bitmap_zalloc(SMC_WR_BUF_CNT, GFP_KERNEL);
+	link->wr_tx_mask = bitmap_zalloc(link->lgr->max_send_wr, GFP_KERNEL);
 	if (!link->wr_tx_mask)
 		goto no_mem_wr_rx_sges;
-	link->wr_tx_pends = kcalloc(SMC_WR_BUF_CNT,
+	link->wr_tx_pends = kcalloc(link->lgr->max_send_wr,
 				    sizeof(link->wr_tx_pends[0]),
 				    GFP_KERNEL);
 	if (!link->wr_tx_pends)
 		goto no_mem_wr_tx_mask;
-	link->wr_tx_compl = kcalloc(SMC_WR_BUF_CNT,
+	link->wr_tx_compl = kcalloc(link->lgr->max_send_wr,
 				    sizeof(link->wr_tx_compl[0]),
 				    GFP_KERNEL);
 	if (!link->wr_tx_compl)
@@ -905,7 +907,7 @@ int smc_wr_create_link(struct smc_link *lnk)
 		goto dma_unmap;
 	}
 	smc_wr_init_sge(lnk);
-	bitmap_zero(lnk->wr_tx_mask, SMC_WR_BUF_CNT);
+	bitmap_zero(lnk->wr_tx_mask, lnk->lgr->max_send_wr);
 	init_waitqueue_head(&lnk->wr_tx_wait);
 	rc = percpu_ref_init(&lnk->wr_tx_refs, smcr_wr_tx_refs_free, 0, GFP_KERNEL);
 	if (rc)
diff --git a/net/smc/smc_wr.h b/net/smc/smc_wr.h
index f3008dda222a..aa4533af9122 100644
--- a/net/smc/smc_wr.h
+++ b/net/smc/smc_wr.h
@@ -19,8 +19,6 @@
 #include "smc.h"
 #include "smc_core.h"
 
-#define SMC_WR_BUF_CNT 16	/* # of ctrl buffers per link */
-
 #define SMC_WR_TX_WAIT_FREE_SLOT_TIME	(10 * HZ)
 
 #define SMC_WR_TX_SIZE 44 /* actual size of wr_send data (<=SMC_WR_BUF_SIZE) */
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH net-next v3 1/2] net/smc: make wr buffer count configurable
  2025-09-21 21:44 ` [PATCH net-next v3 1/2] " Halil Pasic
@ 2025-09-24 17:27   ` Sidraya Jayagond
  2025-09-25  9:27   ` Paolo Abeni
  2025-09-26  2:44   ` Guangguan Wang
  2 siblings, 0 replies; 21+ messages in thread
From: Sidraya Jayagond @ 2025-09-24 17:27 UTC (permalink / raw)
  To: Halil Pasic, Jakub Kicinski, Paolo Abeni, Simon Horman, D. Wythe,
	Dust Li, Wenjia Zhang, Mahanta Jambigi, Tony Lu, Wen Gu, netdev,
	linux-doc, linux-kernel, linux-rdma, linux-s390



On 22/09/25 3:14 am, Halil Pasic wrote:
> Think SMC_WR_BUF_CNT_SEND := SMC_WR_BUF_CNT used in send context and
> SMC_WR_BUF_CNT_RECV := 3 * SMC_WR_BUF_CNT used in recv context. Those
> get replaced with lgr->max_send_wr and lgr->max_recv_wr respective.
> 
> While at it let us also remove a confusing comment that is either not
> about the context in which it resides (describing
> qp_attr.cap.max_send_wr and qp_attr.cap.max_recv_wr) or not applicable
> any more when these values become configurable.
> 
> Signed-off-by: Halil Pasic <pasic@linux.ibm.com>
> ---
>  Documentation/networking/smc-sysctl.rst | 36 +++++++++++++++++++++++++
>  include/net/netns/smc.h                 |  2 ++
>  net/smc/smc_core.h                      |  6 +++++
>  net/smc/smc_ib.c                        |  7 ++---
>  net/smc/smc_llc.c                       |  2 ++
>  net/smc/smc_sysctl.c                    | 22 +++++++++++++++
>  net/smc/smc_sysctl.h                    |  2 ++
>  net/smc/smc_wr.c                        | 32 +++++++++++-----------
>  net/smc/smc_wr.h                        |  2 --
>  9 files changed, 89 insertions(+), 22 deletions(-)
> 
> diff --git a/Documentation/networking/smc-sysctl.rst b/Documentation/networking/smc-sysctl.rst
> index a874d007f2db..c94d750c7c84 100644
> --- a/Documentation/networking/smc-sysctl.rst
> +++ b/Documentation/networking/smc-sysctl.rst
> @@ -71,3 +71,39 @@ smcr_max_conns_per_lgr - INTEGER
>  	acceptable value ranges from 16 to 255. Only for SMC-R v2.1 and later.
>  
>  	Default: 255
> +
> +smcr_max_send_wr - INTEGER
> +	So called work request buffers are SMCR link (and RDMA queue pair) level
> +	resources necessary for performing RDMA operations. Since up to 255
> +	connections can share a link group and thus also a link and the number
> +	of the work request buffers is decided when the link is allocated,
> +	depending on the workload it can a bottleneck in a sense that threads
> +	have to wait for work request buffers to become available. Before the
> +	introduction of this control the maximal number of work request buffers
> +	available on the send path used to be hard coded to 16. With this control
> +	it becomes configurable. The acceptable range is between 2 and 2048.
> +
> +	Please be aware that all the buffers need to be allocated as a physically
> +	continuous array in which each element is a single buffer and has the size
> +	of SMC_WR_BUF_SIZE (48) bytes. If the allocation fails we give up much
> +	like before having this control.
> +
> +	Default: 16
> +
> +smcr_max_recv_wr - INTEGER
> +	So called work request buffers are SMCR link (and RDMA queue pair) level
> +	resources necessary for performing RDMA operations. Since up to 255
> +	connections can share a link group and thus also a link and the number
> +	of the work request buffers is decided when the link is allocated,
> +	depending on the workload it can a bottleneck in a sense that threads
> +	have to wait for work request buffers to become available. Before the
> +	introduction of this control the maximal number of work request buffers
> +	available on the receive path used to be hard coded to 16. With this control
> +	it becomes configurable. The acceptable range is between 2 and 2048.
> +
> +	Please be aware that all the buffers need to be allocated as a physically
> +	continuous array in which each element is a single buffer and has the size
> +	of SMC_WR_BUF_SIZE (48) bytes. If the allocation fails we give up much
> +	like before having this control.
> +
> +	Default: 48

Tuning default WR BUF count and max_conns per lgr together makes sense,
otherwise either memory is wasted or WRs queue up on a QP.
That exercise, however, can be done separately from this patchset.

> diff --git a/include/net/netns/smc.h b/include/net/netns/smc.h
> index fc752a50f91b..6ceb12baec24 100644
> --- a/include/net/netns/smc.h
> +++ b/include/net/netns/smc.h
> @@ -24,5 +24,7 @@ struct netns_smc {
>  	int				sysctl_rmem;
>  	int				sysctl_max_links_per_lgr;
>  	int				sysctl_max_conns_per_lgr;
> +	unsigned int			sysctl_smcr_max_send_wr;
> +	unsigned int			sysctl_smcr_max_recv_wr;
>  };
>  #endif
> diff --git a/net/smc/smc_core.h b/net/smc/smc_core.h
> index 48a1b1dcb576..ab2d15929cb2 100644
> --- a/net/smc/smc_core.h
> +++ b/net/smc/smc_core.h
> @@ -33,6 +33,8 @@
>  					 * distributions may modify it to a value between
>  					 * 16-255 as needed.
>  					 */
> +#define SMCR_MAX_SEND_WR_DEF	16	/* Default number of work requests per send queue */
> +#define SMCR_MAX_RECV_WR_DEF	48	/* Default number of work requests per recv queue */
>  
>  struct smc_lgr_list {			/* list of link group definition */
>  	struct list_head	list;
> @@ -361,6 +363,10 @@ struct smc_link_group {
>  						/* max conn can be assigned to lgr */
>  			u8			max_links;
>  						/* max links can be added in lgr */
> +			u16			max_send_wr;
> +						/* number of WR buffers on send */
> +			u16			max_recv_wr;
> +						/* number of WR buffers on recv */
>  		};
>  		struct { /* SMC-D */
>  			struct smcd_gid		peer_gid;
> diff --git a/net/smc/smc_ib.c b/net/smc/smc_ib.c
> index 0052f02756eb..e8d35c22c525 100644
> --- a/net/smc/smc_ib.c
> +++ b/net/smc/smc_ib.c
> @@ -669,11 +669,6 @@ int smc_ib_create_queue_pair(struct smc_link *lnk)
>  		.recv_cq = lnk->smcibdev->roce_cq_recv,
>  		.srq = NULL,
>  		.cap = {
> -				/* include unsolicited rdma_writes as well,
> -				 * there are max. 2 RDMA_WRITE per 1 WR_SEND
> -				 */
> -			.max_send_wr = SMC_WR_BUF_CNT * 3,
> -			.max_recv_wr = SMC_WR_BUF_CNT * 3,
>  			.max_send_sge = SMC_IB_MAX_SEND_SGE,
>  			.max_recv_sge = lnk->wr_rx_sge_cnt,
>  			.max_inline_data = 0,
> @@ -683,6 +678,8 @@ int smc_ib_create_queue_pair(struct smc_link *lnk)
>  	};
>  	int rc;
>  
> +	qp_attr.cap.max_send_wr = 3 * lnk->lgr->max_send_wr;
> +	qp_attr.cap.max_recv_wr = lnk->lgr->max_recv_wr;
>  	lnk->roce_qp = ib_create_qp(lnk->roce_pd, &qp_attr);
>  	rc = PTR_ERR_OR_ZERO(lnk->roce_qp);
>  	if (IS_ERR(lnk->roce_qp))
> diff --git a/net/smc/smc_llc.c b/net/smc/smc_llc.c
> index f865c58c3aa7..f5d5eb617526 100644
> --- a/net/smc/smc_llc.c
> +++ b/net/smc/smc_llc.c
> @@ -2157,6 +2157,8 @@ void smc_llc_lgr_init(struct smc_link_group *lgr, struct smc_sock *smc)
>  	init_waitqueue_head(&lgr->llc_msg_waiter);
>  	init_rwsem(&lgr->llc_conf_mutex);
>  	lgr->llc_testlink_time = READ_ONCE(net->smc.sysctl_smcr_testlink_time);
> +	lgr->max_send_wr = (u16)(READ_ONCE(net->smc.sysctl_smcr_max_send_wr));
> +	lgr->max_recv_wr = (u16)(READ_ONCE(net->smc.sysctl_smcr_max_recv_wr));
>  }
>  
>  /* called after lgr was removed from lgr_list */
> diff --git a/net/smc/smc_sysctl.c b/net/smc/smc_sysctl.c
> index 2fab6456f765..7b2471904d04 100644
> --- a/net/smc/smc_sysctl.c
> +++ b/net/smc/smc_sysctl.c
> @@ -29,6 +29,8 @@ static int links_per_lgr_min = SMC_LINKS_ADD_LNK_MIN;
>  static int links_per_lgr_max = SMC_LINKS_ADD_LNK_MAX;
>  static int conns_per_lgr_min = SMC_CONN_PER_LGR_MIN;
>  static int conns_per_lgr_max = SMC_CONN_PER_LGR_MAX;
> +static unsigned int smcr_max_wr_min = 2;
> +static unsigned int smcr_max_wr_max = 2048;
>  
>  static struct ctl_table smc_table[] = {
>  	{
> @@ -99,6 +101,24 @@ static struct ctl_table smc_table[] = {
>  		.extra1		= SYSCTL_ZERO,
>  		.extra2		= SYSCTL_ONE,
>  	},
> +	{
> +		.procname	= "smcr_max_send_wr",
> +		.data		= &init_net.smc.sysctl_smcr_max_send_wr,
> +		.maxlen		= sizeof(int),
> +		.mode		= 0644,
> +		.proc_handler	= proc_dointvec_minmax,
> +		.extra1		= &smcr_max_wr_min,
> +		.extra2		= &smcr_max_wr_max,
> +	},
> +	{
> +		.procname	= "smcr_max_recv_wr",
> +		.data		= &init_net.smc.sysctl_smcr_max_recv_wr,
> +		.maxlen		= sizeof(int),
> +		.mode		= 0644,
> +		.proc_handler	= proc_dointvec_minmax,
> +		.extra1		= &smcr_max_wr_min,
> +		.extra2		= &smcr_max_wr_max,
> +	},
>  };
>  
>  int __net_init smc_sysctl_net_init(struct net *net)
> @@ -130,6 +150,8 @@ int __net_init smc_sysctl_net_init(struct net *net)
>  	WRITE_ONCE(net->smc.sysctl_rmem, net_smc_rmem_init);
>  	net->smc.sysctl_max_links_per_lgr = SMC_LINKS_PER_LGR_MAX_PREFER;
>  	net->smc.sysctl_max_conns_per_lgr = SMC_CONN_PER_LGR_PREFER;
> +	net->smc.sysctl_smcr_max_send_wr = SMCR_MAX_SEND_WR_DEF;
> +	net->smc.sysctl_smcr_max_recv_wr = SMCR_MAX_RECV_WR_DEF;
>  	/* disable handshake limitation by default */
>  	net->smc.limit_smc_hs = 0;
>  
> diff --git a/net/smc/smc_sysctl.h b/net/smc/smc_sysctl.h
> index eb2465ae1e15..8538915af7af 100644
> --- a/net/smc/smc_sysctl.h
> +++ b/net/smc/smc_sysctl.h
> @@ -25,6 +25,8 @@ static inline int smc_sysctl_net_init(struct net *net)
>  	net->smc.sysctl_autocorking_size = SMC_AUTOCORKING_DEFAULT_SIZE;
>  	net->smc.sysctl_max_links_per_lgr = SMC_LINKS_PER_LGR_MAX_PREFER;
>  	net->smc.sysctl_max_conns_per_lgr = SMC_CONN_PER_LGR_PREFER;
> +	net->smc.sysctl_smcr_max_send_wr = SMCR_MAX_SEND_WR_DEF;
> +	net->smc.sysctl_smcr_max_recv_wr = SMCR_MAX_RECV_WR_DEF;
>  	return 0;
>  }
>  
> diff --git a/net/smc/smc_wr.c b/net/smc/smc_wr.c
> index b04a21b8c511..f5b2772414fd 100644
> --- a/net/smc/smc_wr.c
> +++ b/net/smc/smc_wr.c
> @@ -34,6 +34,7 @@
>  #define SMC_WR_MAX_POLL_CQE 10	/* max. # of compl. queue elements in 1 poll */
>  
>  #define SMC_WR_RX_HASH_BITS 4
> +
>  static DEFINE_HASHTABLE(smc_wr_rx_hash, SMC_WR_RX_HASH_BITS);
>  static DEFINE_SPINLOCK(smc_wr_rx_hash_lock);
>  
> @@ -547,9 +548,9 @@ void smc_wr_remember_qp_attr(struct smc_link *lnk)
>  		    IB_QP_DEST_QPN,
>  		    &init_attr);
>  
> -	lnk->wr_tx_cnt = min_t(size_t, SMC_WR_BUF_CNT,
> +	lnk->wr_tx_cnt = min_t(size_t, lnk->lgr->max_send_wr,
>  			       lnk->qp_attr.cap.max_send_wr);
> -	lnk->wr_rx_cnt = min_t(size_t, SMC_WR_BUF_CNT * 3,
> +	lnk->wr_rx_cnt = min_t(size_t, lnk->lgr->max_recv_wr,
>  			       lnk->qp_attr.cap.max_recv_wr);
>  }
>  
> @@ -741,50 +742,51 @@ int smc_wr_alloc_lgr_mem(struct smc_link_group *lgr)
>  int smc_wr_alloc_link_mem(struct smc_link *link)
>  {
>  	/* allocate link related memory */
> -	link->wr_tx_bufs = kcalloc(SMC_WR_BUF_CNT, SMC_WR_BUF_SIZE, GFP_KERNEL);
> +	link->wr_tx_bufs = kcalloc(link->lgr->max_send_wr,
> +				   SMC_WR_BUF_SIZE, GFP_KERNEL);
>  	if (!link->wr_tx_bufs)
>  		goto no_mem;
> -	link->wr_rx_bufs = kcalloc(SMC_WR_BUF_CNT * 3, link->wr_rx_buflen,
> +	link->wr_rx_bufs = kcalloc(link->lgr->max_recv_wr, link->wr_rx_buflen,
>  				   GFP_KERNEL);
>  	if (!link->wr_rx_bufs)
>  		goto no_mem_wr_tx_bufs;
> -	link->wr_tx_ibs = kcalloc(SMC_WR_BUF_CNT, sizeof(link->wr_tx_ibs[0]),
> -				  GFP_KERNEL);
> +	link->wr_tx_ibs = kcalloc(link->lgr->max_send_wr,
> +				  sizeof(link->wr_tx_ibs[0]), GFP_KERNEL);
>  	if (!link->wr_tx_ibs)
>  		goto no_mem_wr_rx_bufs;
> -	link->wr_rx_ibs = kcalloc(SMC_WR_BUF_CNT * 3,
> +	link->wr_rx_ibs = kcalloc(link->lgr->max_recv_wr,
>  				  sizeof(link->wr_rx_ibs[0]),
>  				  GFP_KERNEL);
>  	if (!link->wr_rx_ibs)
>  		goto no_mem_wr_tx_ibs;
> -	link->wr_tx_rdmas = kcalloc(SMC_WR_BUF_CNT,
> +	link->wr_tx_rdmas = kcalloc(link->lgr->max_send_wr,
>  				    sizeof(link->wr_tx_rdmas[0]),
>  				    GFP_KERNEL);
>  	if (!link->wr_tx_rdmas)
>  		goto no_mem_wr_rx_ibs;
> -	link->wr_tx_rdma_sges = kcalloc(SMC_WR_BUF_CNT,
> +	link->wr_tx_rdma_sges = kcalloc(link->lgr->max_send_wr,
>  					sizeof(link->wr_tx_rdma_sges[0]),
>  					GFP_KERNEL);
>  	if (!link->wr_tx_rdma_sges)
>  		goto no_mem_wr_tx_rdmas;
> -	link->wr_tx_sges = kcalloc(SMC_WR_BUF_CNT, sizeof(link->wr_tx_sges[0]),
> +	link->wr_tx_sges = kcalloc(link->lgr->max_send_wr, sizeof(link->wr_tx_sges[0]),
>  				   GFP_KERNEL);
>  	if (!link->wr_tx_sges)
>  		goto no_mem_wr_tx_rdma_sges;
> -	link->wr_rx_sges = kcalloc(SMC_WR_BUF_CNT * 3,
> +	link->wr_rx_sges = kcalloc(link->lgr->max_recv_wr,
>  				   sizeof(link->wr_rx_sges[0]) * link->wr_rx_sge_cnt,
>  				   GFP_KERNEL);
>  	if (!link->wr_rx_sges)
>  		goto no_mem_wr_tx_sges;
> -	link->wr_tx_mask = bitmap_zalloc(SMC_WR_BUF_CNT, GFP_KERNEL);
> +	link->wr_tx_mask = bitmap_zalloc(link->lgr->max_send_wr, GFP_KERNEL);
>  	if (!link->wr_tx_mask)
>  		goto no_mem_wr_rx_sges;
> -	link->wr_tx_pends = kcalloc(SMC_WR_BUF_CNT,
> +	link->wr_tx_pends = kcalloc(link->lgr->max_send_wr,
>  				    sizeof(link->wr_tx_pends[0]),
>  				    GFP_KERNEL);
>  	if (!link->wr_tx_pends)
>  		goto no_mem_wr_tx_mask;
> -	link->wr_tx_compl = kcalloc(SMC_WR_BUF_CNT,
> +	link->wr_tx_compl = kcalloc(link->lgr->max_send_wr,
>  				    sizeof(link->wr_tx_compl[0]),
>  				    GFP_KERNEL);
>  	if (!link->wr_tx_compl)
> @@ -905,7 +907,7 @@ int smc_wr_create_link(struct smc_link *lnk)
>  		goto dma_unmap;
>  	}
>  	smc_wr_init_sge(lnk);
> -	bitmap_zero(lnk->wr_tx_mask, SMC_WR_BUF_CNT);
> +	bitmap_zero(lnk->wr_tx_mask, lnk->lgr->max_send_wr);
>  	init_waitqueue_head(&lnk->wr_tx_wait);
>  	rc = percpu_ref_init(&lnk->wr_tx_refs, smcr_wr_tx_refs_free, 0, GFP_KERNEL);
>  	if (rc)
> diff --git a/net/smc/smc_wr.h b/net/smc/smc_wr.h
> index f3008dda222a..aa4533af9122 100644
> --- a/net/smc/smc_wr.h
> +++ b/net/smc/smc_wr.h
> @@ -19,8 +19,6 @@
>  #include "smc.h"
>  #include "smc_core.h"
>  
> -#define SMC_WR_BUF_CNT 16	/* # of ctrl buffers per link */
> -
>  #define SMC_WR_TX_WAIT_FREE_SLOT_TIME	(10 * HZ)
>  
>  #define SMC_WR_TX_SIZE 44 /* actual size of wr_send data (<=SMC_WR_BUF_SIZE) */

Code changes looks good to me.
Reviewed-by: Sidraya Jayagond <sidraya@linux.ibm.com>


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net-next v3 1/2] net/smc: make wr buffer count configurable
  2025-09-21 21:44 ` [PATCH net-next v3 1/2] " Halil Pasic
  2025-09-24 17:27   ` Sidraya Jayagond
@ 2025-09-25  9:27   ` Paolo Abeni
  2025-09-25 11:25     ` Halil Pasic
  2025-09-26  2:44   ` Guangguan Wang
  2 siblings, 1 reply; 21+ messages in thread
From: Paolo Abeni @ 2025-09-25  9:27 UTC (permalink / raw)
  To: Halil Pasic, Jakub Kicinski, Simon Horman, D. Wythe, Dust Li,
	Sidraya Jayagond, Wenjia Zhang, Mahanta Jambigi, Tony Lu, Wen Gu,
	netdev, linux-doc, linux-kernel, linux-rdma, linux-s390

On 9/21/25 11:44 PM, Halil Pasic wrote:
> diff --git a/Documentation/networking/smc-sysctl.rst b/Documentation/networking/smc-sysctl.rst
> index a874d007f2db..c94d750c7c84 100644
> --- a/Documentation/networking/smc-sysctl.rst
> +++ b/Documentation/networking/smc-sysctl.rst
> @@ -71,3 +71,39 @@ smcr_max_conns_per_lgr - INTEGER
>  	acceptable value ranges from 16 to 255. Only for SMC-R v2.1 and later.
>  
>  	Default: 255
> +
> +smcr_max_send_wr - INTEGER
> +	So called work request buffers are SMCR link (and RDMA queue pair) level
> +	resources necessary for performing RDMA operations. Since up to 255
> +	connections can share a link group and thus also a link and the number
> +	of the work request buffers is decided when the link is allocated,
> +	depending on the workload it can a bottleneck in a sense that threads

missing 'be' or 'become'           here^^

> +	have to wait for work request buffers to become available. Before the
> +	introduction of this control the maximal number of work request buffers
> +	available on the send path used to be hard coded to 16. With this control
> +	it becomes configurable. The acceptable range is between 2 and 2048.
> +
> +	Please be aware that all the buffers need to be allocated as a physically
> +	continuous array in which each element is a single buffer and has the size
> +	of SMC_WR_BUF_SIZE (48) bytes. If the allocation fails we give up much
> +	like before having this control.
> +
> +	Default: 16
> +
> +smcr_max_recv_wr - INTEGER
> +	So called work request buffers are SMCR link (and RDMA queue pair) level
> +	resources necessary for performing RDMA operations. Since up to 255
> +	connections can share a link group and thus also a link and the number
> +	of the work request buffers is decided when the link is allocated,
> +	depending on the workload it can a bottleneck in a sense that threads

same                               here^^

[...]
> @@ -683,6 +678,8 @@ int smc_ib_create_queue_pair(struct smc_link *lnk)
>  	};
>  	int rc;
>  
> +	qp_attr.cap.max_send_wr = 3 * lnk->lgr->max_send_wr;
> +	qp_attr.cap.max_recv_wr = lnk->lgr->max_recv_wr;

Possibly:

	cap = max(3 * lnk->lgr->max_send_wr, lnk->lgr->max_recv_wr);
	qp_attr.cap.max_send_wr = cap;
	qp_attr.cap.max_recv_wr = cap

to avoid assumption on `max_send_wr`, `max_recv_wr` relative values.

[...]
> diff --git a/net/smc/smc_sysctl.h b/net/smc/smc_sysctl.h
> index eb2465ae1e15..8538915af7af 100644
> --- a/net/smc/smc_sysctl.h
> +++ b/net/smc/smc_sysctl.h
> @@ -25,6 +25,8 @@ static inline int smc_sysctl_net_init(struct net *net)
>  	net->smc.sysctl_autocorking_size = SMC_AUTOCORKING_DEFAULT_SIZE;
>  	net->smc.sysctl_max_links_per_lgr = SMC_LINKS_PER_LGR_MAX_PREFER;
>  	net->smc.sysctl_max_conns_per_lgr = SMC_CONN_PER_LGR_PREFER;
> +	net->smc.sysctl_smcr_max_send_wr = SMCR_MAX_SEND_WR_DEF;
> +	net->smc.sysctl_smcr_max_recv_wr = SMCR_MAX_RECV_WR_DEF;
>  	return 0;
>  }
>  
> diff --git a/net/smc/smc_wr.c b/net/smc/smc_wr.c
> index b04a21b8c511..f5b2772414fd 100644
> --- a/net/smc/smc_wr.c
> +++ b/net/smc/smc_wr.c
> @@ -34,6 +34,7 @@
>  #define SMC_WR_MAX_POLL_CQE 10	/* max. # of compl. queue elements in 1 poll */
>  
>  #define SMC_WR_RX_HASH_BITS 4
> +

Please avoid unrelated whitespace only changes.

Thanks,

Paolo


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net-next v3 1/2] net/smc: make wr buffer count configurable
  2025-09-25  9:27   ` Paolo Abeni
@ 2025-09-25 11:25     ` Halil Pasic
  2025-09-27 22:55       ` Halil Pasic
  0 siblings, 1 reply; 21+ messages in thread
From: Halil Pasic @ 2025-09-25 11:25 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: Jakub Kicinski, Simon Horman, D. Wythe, Dust Li, Sidraya Jayagond,
	Wenjia Zhang, Mahanta Jambigi, Tony Lu, Wen Gu, netdev, linux-doc,
	linux-kernel, linux-rdma, linux-s390, Halil Pasic

On Thu, 25 Sep 2025 11:27:38 +0200
Paolo Abeni <pabeni@redhat.com> wrote:
[..]
> > +smcr_max_recv_wr - INTEGER
> > +	So called work request buffers are SMCR link (and RDMA queue pair) level
> > +	resources necessary for performing RDMA operations. Since up to 255
> > +	connections can share a link group and thus also a link and the number
> > +	of the work request buffers is decided when the link is allocated,
> > +	depending on the workload it can a bottleneck in a sense that threads  
> 
> same                               here^^

Sorry about those! Will fix for v4.

> 
> [...]
> > @@ -683,6 +678,8 @@ int smc_ib_create_queue_pair(struct smc_link *lnk)
> >  	};
> >  	int rc;
> >  
> > +	qp_attr.cap.max_send_wr = 3 * lnk->lgr->max_send_wr;
> > +	qp_attr.cap.max_recv_wr = lnk->lgr->max_recv_wr;  
> 
> Possibly:
> 
> 	cap = max(3 * lnk->lgr->max_send_wr, lnk->lgr->max_recv_wr);
> 	qp_attr.cap.max_send_wr = cap;
> 	qp_attr.cap.max_recv_wr = cap
> 
> to avoid assumption on `max_send_wr`, `max_recv_wr` relative values.

Can you explain a little more. I'm happy to do the change, but I would
prefer to understand why is keeping qp_attr.cap.max_send_wr ==
qp_attr.cap.max_recv_wr better? But if you tell: "Just trust me!" I will.

[..]

> >  
> > diff --git a/net/smc/smc_wr.c b/net/smc/smc_wr.c
> > index b04a21b8c511..f5b2772414fd 100644
> > --- a/net/smc/smc_wr.c
> > +++ b/net/smc/smc_wr.c
> > @@ -34,6 +34,7 @@
> >  #define SMC_WR_MAX_POLL_CQE 10	/* max. # of compl. queue elements in 1 poll */
> >  
> >  #define SMC_WR_RX_HASH_BITS 4
> > +  
> 
> Please avoid unrelated whitespace only changes.

Will fix  for v4. Really sorry!

Regards,
Halil

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net-next v3 1/2] net/smc: make wr buffer count configurable
  2025-09-25 11:25     ` Halil Pasic
@ 2025-09-27 22:55       ` Halil Pasic
  2025-09-28  2:02         ` Dust Li
  0 siblings, 1 reply; 21+ messages in thread
From: Halil Pasic @ 2025-09-27 22:55 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: Jakub Kicinski, Simon Horman, D. Wythe, Dust Li, Sidraya Jayagond,
	Wenjia Zhang, Mahanta Jambigi, Tony Lu, Wen Gu, netdev, linux-doc,
	linux-kernel, linux-rdma, linux-s390, Halil Pasic

On Thu, 25 Sep 2025 13:25:40 +0200
Halil Pasic <pasic@linux.ibm.com> wrote:

> > [...]  
> > > @@ -683,6 +678,8 @@ int smc_ib_create_queue_pair(struct smc_link *lnk)
> > >  	};
> > >  	int rc;
> > >  
> > > +	qp_attr.cap.max_send_wr = 3 * lnk->lgr->max_send_wr;
> > > +	qp_attr.cap.max_recv_wr = lnk->lgr->max_recv_wr;    
> > 
> > Possibly:
> > 
> > 	cap = max(3 * lnk->lgr->max_send_wr, lnk->lgr->max_recv_wr);
> > 	qp_attr.cap.max_send_wr = cap;
> > 	qp_attr.cap.max_recv_wr = cap
> > 
> > to avoid assumption on `max_send_wr`, `max_recv_wr` relative values.  
> 
> Can you explain a little more. I'm happy to do the change, but I would
> prefer to understand why is keeping qp_attr.cap.max_send_wr ==
> qp_attr.cap.max_recv_wr better? But if you tell: "Just trust me!" I will.

Due to a little accident we ended up having a private conversation
on this, which I'm going to sum up quickly.

Paolo stated that he has no strong preference and that I should at
least add a comment, which I will do for v4. 

Unfortunately I don't quite understand why qp_attr.cap.max_send_wr is 3
times the number of send WR buffers we allocate. My understanding
is that qp_attr.cap.max_send_wr is about the number of send WQEs.
I assume that qp_attr.cap.max_send_wr == qp_attr.cap.max_recv_wr
is not something we would want to preserve.

Regards,
Halil 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net-next v3 1/2] net/smc: make wr buffer count configurable
  2025-09-27 22:55       ` Halil Pasic
@ 2025-09-28  2:02         ` Dust Li
  2025-09-28  2:12           ` Dust Li
  2025-09-28  8:39           ` Halil Pasic
  0 siblings, 2 replies; 21+ messages in thread
From: Dust Li @ 2025-09-28  2:02 UTC (permalink / raw)
  To: Halil Pasic, Paolo Abeni
  Cc: Jakub Kicinski, Simon Horman, D. Wythe, Sidraya Jayagond,
	Wenjia Zhang, Mahanta Jambigi, Tony Lu, Wen Gu, netdev, linux-doc,
	linux-kernel, linux-rdma, linux-s390

On 2025-09-28 00:55:15, Halil Pasic wrote:
>On Thu, 25 Sep 2025 13:25:40 +0200
>Halil Pasic <pasic@linux.ibm.com> wrote:
>
>> > [...]  
>> > > @@ -683,6 +678,8 @@ int smc_ib_create_queue_pair(struct smc_link *lnk)
>> > >  	};
>> > >  	int rc;
>> > >  
>> > > +	qp_attr.cap.max_send_wr = 3 * lnk->lgr->max_send_wr;
>> > > +	qp_attr.cap.max_recv_wr = lnk->lgr->max_recv_wr;    
>> > 
>> > Possibly:
>> > 
>> > 	cap = max(3 * lnk->lgr->max_send_wr, lnk->lgr->max_recv_wr);
>> > 	qp_attr.cap.max_send_wr = cap;
>> > 	qp_attr.cap.max_recv_wr = cap
>> > 
>> > to avoid assumption on `max_send_wr`, `max_recv_wr` relative values.  
>> 
>> Can you explain a little more. I'm happy to do the change, but I would
>> prefer to understand why is keeping qp_attr.cap.max_send_wr ==
>> qp_attr.cap.max_recv_wr better? But if you tell: "Just trust me!" I will.
>
>Due to a little accident we ended up having a private conversation
>on this, which I'm going to sum up quickly.
>
>Paolo stated that he has no strong preference and that I should at
>least add a comment, which I will do for v4. 
>
>Unfortunately I don't quite understand why qp_attr.cap.max_send_wr is 3
>times the number of send WR buffers we allocate. My understanding
>is that qp_attr.cap.max_send_wr is about the number of send WQEs.

We have at most 2 RDMA Write for 1 RDMA send. So 3 times is necessary.
That is explained in the original comments. Maybe it's better to keep it.

```
.cap = {
                /* include unsolicited rdma_writes as well,
                 * there are max. 2 RDMA_WRITE per 1 WR_SEND
                 */
        .max_send_wr = SMC_WR_BUF_CNT * 3,
        .max_recv_wr = SMC_WR_BUF_CNT * 3,
        .max_send_sge = SMC_IB_MAX_SEND_SGE,
        .max_recv_sge = lnk->wr_rx_sge_cnt,
        .max_inline_data = 0,
},
```

>I assume that qp_attr.cap.max_send_wr == qp_attr.cap.max_recv_wr
>is not something we would want to preserve.

IIUC, RDMA Write won't consume any RX wqe on the receive side, so I think
the .max_recv_wr can be SMC_WR_BUF_CNT if we don't use RDMA_WRITE_IMM.


Best regards,
Dust

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net-next v3 1/2] net/smc: make wr buffer count configurable
  2025-09-28  2:02         ` Dust Li
@ 2025-09-28  2:12           ` Dust Li
  2025-09-28  8:39           ` Halil Pasic
  1 sibling, 0 replies; 21+ messages in thread
From: Dust Li @ 2025-09-28  2:12 UTC (permalink / raw)
  To: Halil Pasic, Paolo Abeni
  Cc: Jakub Kicinski, Simon Horman, D. Wythe, Sidraya Jayagond,
	Wenjia Zhang, Mahanta Jambigi, Tony Lu, Wen Gu, netdev, linux-doc,
	linux-kernel, linux-rdma, linux-s390

On 2025-09-28 10:02:43, Dust Li wrote:
>On 2025-09-28 00:55:15, Halil Pasic wrote:
>>On Thu, 25 Sep 2025 13:25:40 +0200
>>Halil Pasic <pasic@linux.ibm.com> wrote:
>>
>>> > [...]  
>>> > > @@ -683,6 +678,8 @@ int smc_ib_create_queue_pair(struct smc_link *lnk)
>>> > >  	};
>>> > >  	int rc;
>>> > >  
>>> > > +	qp_attr.cap.max_send_wr = 3 * lnk->lgr->max_send_wr;
>>> > > +	qp_attr.cap.max_recv_wr = lnk->lgr->max_recv_wr;    
>>> > 
>>> > Possibly:
>>> > 
>>> > 	cap = max(3 * lnk->lgr->max_send_wr, lnk->lgr->max_recv_wr);
>>> > 	qp_attr.cap.max_send_wr = cap;
>>> > 	qp_attr.cap.max_recv_wr = cap
>>> > 
>>> > to avoid assumption on `max_send_wr`, `max_recv_wr` relative values.  
>>> 
>>> Can you explain a little more. I'm happy to do the change, but I would
>>> prefer to understand why is keeping qp_attr.cap.max_send_wr ==
>>> qp_attr.cap.max_recv_wr better? But if you tell: "Just trust me!" I will.
>>
>>Due to a little accident we ended up having a private conversation
>>on this, which I'm going to sum up quickly.
>>
>>Paolo stated that he has no strong preference and that I should at
>>least add a comment, which I will do for v4. 
>>
>>Unfortunately I don't quite understand why qp_attr.cap.max_send_wr is 3
>>times the number of send WR buffers we allocate. My understanding
>>is that qp_attr.cap.max_send_wr is about the number of send WQEs.
>
>We have at most 2 RDMA Write for 1 RDMA send. So 3 times is necessary.
>That is explained in the original comments. Maybe it's better to keep it.
>
>```
>.cap = {
>                /* include unsolicited rdma_writes as well,
>                 * there are max. 2 RDMA_WRITE per 1 WR_SEND
>                 */
>        .max_send_wr = SMC_WR_BUF_CNT * 3,
>        .max_recv_wr = SMC_WR_BUF_CNT * 3,
>        .max_send_sge = SMC_IB_MAX_SEND_SGE,
>        .max_recv_sge = lnk->wr_rx_sge_cnt,
>        .max_inline_data = 0,
>},
>```
>
>>I assume that qp_attr.cap.max_send_wr == qp_attr.cap.max_recv_wr
>>is not something we would want to preserve.
>
>IIUC, RDMA Write won't consume any RX wqe on the receive side, so I think
>the .max_recv_wr can be SMC_WR_BUF_CNT if we don't use RDMA_WRITE_IMM.

I kept thinking about this a bit more, and I realized that max_recv_wr
should be larger than SMC_WR_BUF_CNT.

Since receive WQEs are posted in a softirq context, their posting may be
delayed. Meanwhile, the sender might already have received the TX
completion (CQE) and continue sending new messages. In this case, if the
receiver’s post_recv() (i.e., posting of RX WQEs) is delayed, an RNR
(Receiver Not Ready) can easily occur.

Best regards,
Dust

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net-next v3 1/2] net/smc: make wr buffer count configurable
  2025-09-28  2:02         ` Dust Li
  2025-09-28  2:12           ` Dust Li
@ 2025-09-28  8:39           ` Halil Pasic
  2025-09-28 11:42             ` Dust Li
  1 sibling, 1 reply; 21+ messages in thread
From: Halil Pasic @ 2025-09-28  8:39 UTC (permalink / raw)
  To: Dust Li
  Cc: Paolo Abeni, Jakub Kicinski, Simon Horman, D. Wythe,
	Sidraya Jayagond, Wenjia Zhang, Mahanta Jambigi, Tony Lu, Wen Gu,
	netdev, linux-doc, linux-kernel, linux-rdma, linux-s390,
	Halil Pasic

On Sun, 28 Sep 2025 10:02:43 +0800
Dust Li <dust.li@linux.alibaba.com> wrote:

> >Unfortunately I don't quite understand why qp_attr.cap.max_send_wr is 3
> >times the number of send WR buffers we allocate. My understanding
> >is that qp_attr.cap.max_send_wr is about the number of send WQEs.  
> 
> We have at most 2 RDMA Write for 1 RDMA send. So 3 times is necessary.
> That is explained in the original comments. Maybe it's better to keep it.
> 
> ```
> .cap = {
>                 /* include unsolicited rdma_writes as well,
>                  * there are max. 2 RDMA_WRITE per 1 WR_SEND
>                  */

But what are "the unsolicited" rdma_writes? I have heard of
unsolicited receive, where the data is received without
consuming a WR previously put on the RQ on the receiving end, but
the concept of unsolicited rdma_writes eludes me completely.

I guess what you are trying to say, and what I understand is
that we first put the payload into the RMB of the remote, which
may require up 2 RDMA_WRITE operations, probably because we may
cross the end (and start) of the array that hosts the circular
buffer, and then we send a CDC message to update the cursor.

For the latter a  ib_post_send() is used in smc_wr_tx_send()
and AFAICT it consumes a WR from wr_tx_bufs. For the former
we consume a single wr_tx_rdmas which and each wr_tx_rdmas
has 2 WR allocated.

And all those WRs need a WQE. So I guess now I do understand
SMC_WR_BUF_CNT, but I find the comment still confusing like
hell because of these unsolicited rdma_writes.

Thank you for the explanation! It was indeed helpful! Let
me try to come up with a better comment -- unless somebody
manages to explain "unsolicited rdma_writes" to me.

>         .max_send_wr = SMC_WR_BUF_CNT * 3,
>         .max_recv_wr = SMC_WR_BUF_CNT * 3,
>         .max_send_sge = SMC_IB_MAX_SEND_SGE,
>         .max_recv_sge = lnk->wr_rx_sge_cnt,
>         .max_inline_data = 0,
> },
> ```
> 
> >I assume that qp_attr.cap.max_send_wr == qp_attr.cap.max_recv_wr
> >is not something we would want to preserve.  
> 
> IIUC, RDMA Write won't consume any RX wqe on the receive side, so I think
> the .max_recv_wr can be SMC_WR_BUF_CNT if we don't use RDMA_WRITE_IMM.

Maybe we don't want to assume somebody else (another implementation)
would not use immediate data. I'm not sure. But I don't quite understand
the why the relationship between the send and the receive side either.

Regards,
Halil

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net-next v3 1/2] net/smc: make wr buffer count configurable
  2025-09-28  8:39           ` Halil Pasic
@ 2025-09-28 11:42             ` Dust Li
  2025-09-28 18:32               ` Halil Pasic
  0 siblings, 1 reply; 21+ messages in thread
From: Dust Li @ 2025-09-28 11:42 UTC (permalink / raw)
  To: Halil Pasic
  Cc: Paolo Abeni, Jakub Kicinski, Simon Horman, D. Wythe,
	Sidraya Jayagond, Wenjia Zhang, Mahanta Jambigi, Tony Lu, Wen Gu,
	netdev, linux-doc, linux-kernel, linux-rdma, linux-s390

On 2025-09-28 10:39:51, Halil Pasic wrote:
>On Sun, 28 Sep 2025 10:02:43 +0800
>Dust Li <dust.li@linux.alibaba.com> wrote:
>
>> >Unfortunately I don't quite understand why qp_attr.cap.max_send_wr is 3
>> >times the number of send WR buffers we allocate. My understanding
>> >is that qp_attr.cap.max_send_wr is about the number of send WQEs.  
>> 
>> We have at most 2 RDMA Write for 1 RDMA send. So 3 times is necessary.
>> That is explained in the original comments. Maybe it's better to keep it.
>> 
>> ```
>> .cap = {
>>                 /* include unsolicited rdma_writes as well,
>>                  * there are max. 2 RDMA_WRITE per 1 WR_SEND
>>                  */
>
>But what are "the unsolicited" rdma_writes? I have heard of
>unsolicited receive, where the data is received without
>consuming a WR previously put on the RQ on the receiving end, but
>the concept of unsolicited rdma_writes eludes me completely.

unsolicited RDMA Writes means those RDMA Writes won't generate
CQEs on the local side. You can refer to:
https://www.rdmamojo.com/2014/05/27/solicited-event/

>
>I guess what you are trying to say, and what I understand is
>that we first put the payload into the RMB of the remote, which
>may require up 2 RDMA_WRITE operations, probably because we may
>cross the end (and start) of the array that hosts the circular
>buffer, and then we send a CDC message to update the cursor.
>
>For the latter a  ib_post_send() is used in smc_wr_tx_send()
>and AFAICT it consumes a WR from wr_tx_bufs. For the former
>we consume a single wr_tx_rdmas which and each wr_tx_rdmas
>has 2 WR allocated.

Right.

>
>And all those WRs need a WQE. So I guess now I do understand
>SMC_WR_BUF_CNT, but I find the comment still confusing like
>hell because of these unsolicited rdma_writes.
>
>Thank you for the explanation! It was indeed helpful! Let
>me try to come up with a better comment -- unless somebody
>manages to explain "unsolicited rdma_writes" to me.
>
>>         .max_send_wr = SMC_WR_BUF_CNT * 3,
>>         .max_recv_wr = SMC_WR_BUF_CNT * 3,
>>         .max_send_sge = SMC_IB_MAX_SEND_SGE,
>>         .max_recv_sge = lnk->wr_rx_sge_cnt,
>>         .max_inline_data = 0,
>> },
>> ```
>> 
>> >I assume that qp_attr.cap.max_send_wr == qp_attr.cap.max_recv_wr
>> >is not something we would want to preserve.  
>> 
>> IIUC, RDMA Write won't consume any RX wqe on the receive side, so I think
>> the .max_recv_wr can be SMC_WR_BUF_CNT if we don't use RDMA_WRITE_IMM.
>
>Maybe we don't want to assume somebody else (another implementation)
>would not use immediate data. I'm not sure. But I don't quite understand
>the why the relationship between the send and the receive side either.

I missed something here. I sent an other email right after this to
explain my thoughts here:

    I kept thinking about this a bit more, and I realized that max_recv_wr
    should be larger than SMC_WR_BUF_CNT.

    Since receive WQEs are posted in a softirq context, their posting may be
    delayed. Meanwhile, the sender might already have received the TX
    completion (CQE) and continue sending new messages. In this case, if the
    receiver’s post_recv() (i.e., posting of RX WQEs) is delayed, an RNR
    (Receiver Not Ready) can easily occur.

Best regards,
Dust


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net-next v3 1/2] net/smc: make wr buffer count configurable
  2025-09-28 11:42             ` Dust Li
@ 2025-09-28 18:32               ` Halil Pasic
  0 siblings, 0 replies; 21+ messages in thread
From: Halil Pasic @ 2025-09-28 18:32 UTC (permalink / raw)
  To: Dust Li
  Cc: Paolo Abeni, Jakub Kicinski, Simon Horman, D. Wythe,
	Sidraya Jayagond, Wenjia Zhang, Mahanta Jambigi, Tony Lu, Wen Gu,
	netdev, linux-doc, linux-kernel, linux-rdma, linux-s390,
	Halil Pasic

On Sun, 28 Sep 2025 19:42:54 +0800
Dust Li <dust.li@linux.alibaba.com> wrote:

> >> We have at most 2 RDMA Write for 1 RDMA send. So 3 times is necessary.
> >> That is explained in the original comments. Maybe it's better to keep it.
> >> 
> >> ```
> >> .cap = {
> >>                 /* include unsolicited rdma_writes as well,
> >>                  * there are max. 2 RDMA_WRITE per 1 WR_SEND
> >>                  */  
> >
> >But what are "the unsolicited" rdma_writes? I have heard of
> >unsolicited receive, where the data is received without
> >consuming a WR previously put on the RQ on the receiving end, but
> >the concept of unsolicited rdma_writes eludes me completely.  
> 
> unsolicited RDMA Writes means those RDMA Writes won't generate
> CQEs on the local side. You can refer to:
> https://www.rdmamojo.com/2014/05/27/solicited-event/
> 
> >
> >I guess what you are trying to say, and what I understand is
> >that we first put the payload into the RMB of the remote, which
> >may require up 2 RDMA_WRITE operations, probably because we may
> >cross the end (and start) of the array that hosts the circular
> >buffer, and then we send a CDC message to update the cursor.
> >
> >For the latter a  ib_post_send() is used in smc_wr_tx_send()
> >and AFAICT it consumes a WR from wr_tx_bufs. For the former
> >we consume a single wr_tx_rdmas which and each wr_tx_rdmas
> >has 2 WR allocated.  
> 
> Right.

Thank you Dust Li! Unfortunately I have already spinned a v4. Let
me add back that comment, as for people knowledgeable enough it does
not appear to be confusing at all. I can try to improve that comment
and maybe add a new one on the reason why we do need more WRs on the
receive end than on the send end, after this series has been merged. Or
if you want to do it yourself, I'm happy with it as well. In the end
it is you who me helped get a better understanding of this :)

Thank you again!

Regards,
Halil

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net-next v3 1/2] net/smc: make wr buffer count configurable
  2025-09-21 21:44 ` [PATCH net-next v3 1/2] " Halil Pasic
  2025-09-24 17:27   ` Sidraya Jayagond
  2025-09-25  9:27   ` Paolo Abeni
@ 2025-09-26  2:44   ` Guangguan Wang
  2025-09-26 10:12     ` Halil Pasic
  2 siblings, 1 reply; 21+ messages in thread
From: Guangguan Wang @ 2025-09-26  2:44 UTC (permalink / raw)
  To: Halil Pasic
  Cc: Jakub Kicinski, Paolo Abeni, Simon Horman, D. Wythe, Dust Li,
	Sidraya Jayagond, Wenjia Zhang, Mahanta Jambigi, Tony Lu, Wen Gu,
	netdev, linux-doc, linux-kernel, linux-rdma, linux-s390



在 2025/9/22 05:44, Halil Pasic 写道:
> Think SMC_WR_BUF_CNT_SEND := SMC_WR_BUF_CNT used in send context and
> SMC_WR_BUF_CNT_RECV := 3 * SMC_WR_BUF_CNT used in recv context. Those
> get replaced with lgr->max_send_wr and lgr->max_recv_wr respective.
> 
> While at it let us also remove a confusing comment that is either not
> about the context in which it resides (describing
> qp_attr.cap.max_send_wr and qp_attr.cap.max_recv_wr) or not applicable
> any more when these values become configurable.
> 
> Signed-off-by: Halil Pasic <pasic@linux.ibm.com>
> ---
>  Documentation/networking/smc-sysctl.rst | 36 +++++++++++++++++++++++++
>  include/net/netns/smc.h                 |  2 ++
>  net/smc/smc_core.h                      |  6 +++++
>  net/smc/smc_ib.c                        |  7 ++---
>  net/smc/smc_llc.c                       |  2 ++
>  net/smc/smc_sysctl.c                    | 22 +++++++++++++++
>  net/smc/smc_sysctl.h                    |  2 ++
>  net/smc/smc_wr.c                        | 32 +++++++++++-----------
>  net/smc/smc_wr.h                        |  2 --
>  9 files changed, 89 insertions(+), 22 deletions(-)
> 
> diff --git a/Documentation/networking/smc-sysctl.rst b/Documentation/networking/smc-sysctl.rst
> index a874d007f2db..c94d750c7c84 100644
> --- a/Documentation/networking/smc-sysctl.rst
> +++ b/Documentation/networking/smc-sysctl.rst
> @@ -71,3 +71,39 @@ smcr_max_conns_per_lgr - INTEGER
>  	acceptable value ranges from 16 to 255. Only for SMC-R v2.1 and later.
>  
>  	Default: 255
> +
> +smcr_max_send_wr - INTEGER
> +	So called work request buffers are SMCR link (and RDMA queue pair) level
> +	resources necessary for performing RDMA operations. Since up to 255
> +	connections can share a link group and thus also a link and the number
> +	of the work request buffers is decided when the link is allocated,
> +	depending on the workload it can a bottleneck in a sense that threads
> +	have to wait for work request buffers to become available. Before the
> +	introduction of this control the maximal number of work request buffers
> +	available on the send path used to be hard coded to 16. With this control
> +	it becomes configurable. The acceptable range is between 2 and 2048.
> +
> +	Please be aware that all the buffers need to be allocated as a physically
> +	continuous array in which each element is a single buffer and has the size
> +	of SMC_WR_BUF_SIZE (48) bytes. If the allocation fails we give up much
> +	like before having this control.
> +
> +	Default: 16
> +
> +smcr_max_recv_wr - INTEGER
> +	So called work request buffers are SMCR link (and RDMA queue pair) level
> +	resources necessary for performing RDMA operations. Since up to 255
> +	connections can share a link group and thus also a link and the number
> +	of the work request buffers is decided when the link is allocated,
> +	depending on the workload it can a bottleneck in a sense that threads
> +	have to wait for work request buffers to become available. Before the
> +	introduction of this control the maximal number of work request buffers
> +	available on the receive path used to be hard coded to 16. With this control
> +	it becomes configurable. The acceptable range is between 2 and 2048.
> +
> +	Please be aware that all the buffers need to be allocated as a physically
> +	continuous array in which each element is a single buffer and has the size
> +	of SMC_WR_BUF_SIZE (48) bytes. If the allocation fails we give up much
> +	like before having this control.
> +
> +	Default: 48

Notice that the ratio of smcr_max_recv_wr to smcr_max_send_wr is set to 3:1, with the
intention of ensuring that the peer QP's smcr_max_recv_wr is three times the local QP's
smcr_max_send_wr and the local QP's smcr_max_recv_wr is three times the peer QP's
smcr_max_send_wr, rather than making the local QP's smcr_max_recv_wr three times its own
smcr_max_send_wr. The purpose of this design is to guarantee sufficient receive WRs on
the side to receive incoming data when peer QP doing RDMA sends. Otherwise, RNR (Receiver
Not Ready) may occur, leading to poor performance(RNR will drop the packet and retransmit
happens in the transport layer of the RDMA).

Let us guess a scenario that have multiple hosts, and the multiple hosts have different
smcr_max_send_wr and smcr_max_recv_wr configurations, mesh connections between these hosts.
It is difficult to ensure that the smcr_max_recv_wr/smcr_max_send_wr is 3:1 on the connected
QPs between these hosts, and it may even be hard to guarantee the smcr_max_recv_wr > smcr_max_send_wr
on the connected QPs between these hosts.

Therefore, I believe that if these values are made configurable, additional mechanisms must be
in place to prevent RNR from occurring. Otherwise we need to carefully configure smcr_max_recv_wr
and smcr_max_send_wr, or ensure that all hosts capable of establishing SMC-R connections are configured
smcr_max_recv_wr and smcr_max_send_wr with the same values.

Regards,
Guangguan Wang

> diff --git a/include/net/netns/smc.h b/include/net/netns/smc.h
> index fc752a50f91b..6ceb12baec24 100644
> --- a/include/net/netns/smc.h
> +++ b/include/net/netns/smc.h
> @@ -24,5 +24,7 @@ struct netns_smc {
>  	int				sysctl_rmem;
>  	int				sysctl_max_links_per_lgr;
>  	int				sysctl_max_conns_per_lgr;
> +	unsigned int			sysctl_smcr_max_send_wr;
> +	unsigned int			sysctl_smcr_max_recv_wr;
>  };
>  #endif
> diff --git a/net/smc/smc_core.h b/net/smc/smc_core.h
> index 48a1b1dcb576..ab2d15929cb2 100644
> --- a/net/smc/smc_core.h
> +++ b/net/smc/smc_core.h
> @@ -33,6 +33,8 @@
>  					 * distributions may modify it to a value between
>  					 * 16-255 as needed.
>  					 */
> +#define SMCR_MAX_SEND_WR_DEF	16	/* Default number of work requests per send queue */
> +#define SMCR_MAX_RECV_WR_DEF	48	/* Default number of work requests per recv queue */
>  
>  struct smc_lgr_list {			/* list of link group definition */
>  	struct list_head	list;
> @@ -361,6 +363,10 @@ struct smc_link_group {
>  						/* max conn can be assigned to lgr */
>  			u8			max_links;
>  						/* max links can be added in lgr */
> +			u16			max_send_wr;
> +						/* number of WR buffers on send */
> +			u16			max_recv_wr;
> +						/* number of WR buffers on recv */
>  		};
>  		struct { /* SMC-D */
>  			struct smcd_gid		peer_gid;
> diff --git a/net/smc/smc_ib.c b/net/smc/smc_ib.c
> index 0052f02756eb..e8d35c22c525 100644
> --- a/net/smc/smc_ib.c
> +++ b/net/smc/smc_ib.c
> @@ -669,11 +669,6 @@ int smc_ib_create_queue_pair(struct smc_link *lnk)
>  		.recv_cq = lnk->smcibdev->roce_cq_recv,
>  		.srq = NULL,
>  		.cap = {
> -				/* include unsolicited rdma_writes as well,
> -				 * there are max. 2 RDMA_WRITE per 1 WR_SEND
> -				 */
> -			.max_send_wr = SMC_WR_BUF_CNT * 3,
> -			.max_recv_wr = SMC_WR_BUF_CNT * 3,
>  			.max_send_sge = SMC_IB_MAX_SEND_SGE,
>  			.max_recv_sge = lnk->wr_rx_sge_cnt,
>  			.max_inline_data = 0,
> @@ -683,6 +678,8 @@ int smc_ib_create_queue_pair(struct smc_link *lnk)
>  	};
>  	int rc;
>  
> +	qp_attr.cap.max_send_wr = 3 * lnk->lgr->max_send_wr;
> +	qp_attr.cap.max_recv_wr = lnk->lgr->max_recv_wr;
>  	lnk->roce_qp = ib_create_qp(lnk->roce_pd, &qp_attr);
>  	rc = PTR_ERR_OR_ZERO(lnk->roce_qp);
>  	if (IS_ERR(lnk->roce_qp))
> diff --git a/net/smc/smc_llc.c b/net/smc/smc_llc.c
> index f865c58c3aa7..f5d5eb617526 100644
> --- a/net/smc/smc_llc.c
> +++ b/net/smc/smc_llc.c
> @@ -2157,6 +2157,8 @@ void smc_llc_lgr_init(struct smc_link_group *lgr, struct smc_sock *smc)
>  	init_waitqueue_head(&lgr->llc_msg_waiter);
>  	init_rwsem(&lgr->llc_conf_mutex);
>  	lgr->llc_testlink_time = READ_ONCE(net->smc.sysctl_smcr_testlink_time);
> +	lgr->max_send_wr = (u16)(READ_ONCE(net->smc.sysctl_smcr_max_send_wr));
> +	lgr->max_recv_wr = (u16)(READ_ONCE(net->smc.sysctl_smcr_max_recv_wr));
>  }
>  
>  /* called after lgr was removed from lgr_list */
> diff --git a/net/smc/smc_sysctl.c b/net/smc/smc_sysctl.c
> index 2fab6456f765..7b2471904d04 100644
> --- a/net/smc/smc_sysctl.c
> +++ b/net/smc/smc_sysctl.c
> @@ -29,6 +29,8 @@ static int links_per_lgr_min = SMC_LINKS_ADD_LNK_MIN;
>  static int links_per_lgr_max = SMC_LINKS_ADD_LNK_MAX;
>  static int conns_per_lgr_min = SMC_CONN_PER_LGR_MIN;
>  static int conns_per_lgr_max = SMC_CONN_PER_LGR_MAX;
> +static unsigned int smcr_max_wr_min = 2;
> +static unsigned int smcr_max_wr_max = 2048;
>  
>  static struct ctl_table smc_table[] = {
>  	{
> @@ -99,6 +101,24 @@ static struct ctl_table smc_table[] = {
>  		.extra1		= SYSCTL_ZERO,
>  		.extra2		= SYSCTL_ONE,
>  	},
> +	{
> +		.procname	= "smcr_max_send_wr",
> +		.data		= &init_net.smc.sysctl_smcr_max_send_wr,
> +		.maxlen		= sizeof(int),
> +		.mode		= 0644,
> +		.proc_handler	= proc_dointvec_minmax,
> +		.extra1		= &smcr_max_wr_min,
> +		.extra2		= &smcr_max_wr_max,
> +	},
> +	{
> +		.procname	= "smcr_max_recv_wr",
> +		.data		= &init_net.smc.sysctl_smcr_max_recv_wr,
> +		.maxlen		= sizeof(int),
> +		.mode		= 0644,
> +		.proc_handler	= proc_dointvec_minmax,
> +		.extra1		= &smcr_max_wr_min,
> +		.extra2		= &smcr_max_wr_max,
> +	},
>  };
>  
>  int __net_init smc_sysctl_net_init(struct net *net)
> @@ -130,6 +150,8 @@ int __net_init smc_sysctl_net_init(struct net *net)
>  	WRITE_ONCE(net->smc.sysctl_rmem, net_smc_rmem_init);
>  	net->smc.sysctl_max_links_per_lgr = SMC_LINKS_PER_LGR_MAX_PREFER;
>  	net->smc.sysctl_max_conns_per_lgr = SMC_CONN_PER_LGR_PREFER;
> +	net->smc.sysctl_smcr_max_send_wr = SMCR_MAX_SEND_WR_DEF;
> +	net->smc.sysctl_smcr_max_recv_wr = SMCR_MAX_RECV_WR_DEF;
>  	/* disable handshake limitation by default */
>  	net->smc.limit_smc_hs = 0;
>  
> diff --git a/net/smc/smc_sysctl.h b/net/smc/smc_sysctl.h
> index eb2465ae1e15..8538915af7af 100644
> --- a/net/smc/smc_sysctl.h
> +++ b/net/smc/smc_sysctl.h
> @@ -25,6 +25,8 @@ static inline int smc_sysctl_net_init(struct net *net)
>  	net->smc.sysctl_autocorking_size = SMC_AUTOCORKING_DEFAULT_SIZE;
>  	net->smc.sysctl_max_links_per_lgr = SMC_LINKS_PER_LGR_MAX_PREFER;
>  	net->smc.sysctl_max_conns_per_lgr = SMC_CONN_PER_LGR_PREFER;
> +	net->smc.sysctl_smcr_max_send_wr = SMCR_MAX_SEND_WR_DEF;
> +	net->smc.sysctl_smcr_max_recv_wr = SMCR_MAX_RECV_WR_DEF;
>  	return 0;
>  }
>  
> diff --git a/net/smc/smc_wr.c b/net/smc/smc_wr.c
> index b04a21b8c511..f5b2772414fd 100644
> --- a/net/smc/smc_wr.c
> +++ b/net/smc/smc_wr.c
> @@ -34,6 +34,7 @@
>  #define SMC_WR_MAX_POLL_CQE 10	/* max. # of compl. queue elements in 1 poll */
>  
>  #define SMC_WR_RX_HASH_BITS 4
> +
>  static DEFINE_HASHTABLE(smc_wr_rx_hash, SMC_WR_RX_HASH_BITS);
>  static DEFINE_SPINLOCK(smc_wr_rx_hash_lock);
>  
> @@ -547,9 +548,9 @@ void smc_wr_remember_qp_attr(struct smc_link *lnk)
>  		    IB_QP_DEST_QPN,
>  		    &init_attr);
>  
> -	lnk->wr_tx_cnt = min_t(size_t, SMC_WR_BUF_CNT,
> +	lnk->wr_tx_cnt = min_t(size_t, lnk->lgr->max_send_wr,
>  			       lnk->qp_attr.cap.max_send_wr);
> -	lnk->wr_rx_cnt = min_t(size_t, SMC_WR_BUF_CNT * 3,
> +	lnk->wr_rx_cnt = min_t(size_t, lnk->lgr->max_recv_wr,
>  			       lnk->qp_attr.cap.max_recv_wr);
>  }
>  
> @@ -741,50 +742,51 @@ int smc_wr_alloc_lgr_mem(struct smc_link_group *lgr)
>  int smc_wr_alloc_link_mem(struct smc_link *link)
>  {
>  	/* allocate link related memory */
> -	link->wr_tx_bufs = kcalloc(SMC_WR_BUF_CNT, SMC_WR_BUF_SIZE, GFP_KERNEL);
> +	link->wr_tx_bufs = kcalloc(link->lgr->max_send_wr,
> +				   SMC_WR_BUF_SIZE, GFP_KERNEL);
>  	if (!link->wr_tx_bufs)
>  		goto no_mem;
> -	link->wr_rx_bufs = kcalloc(SMC_WR_BUF_CNT * 3, link->wr_rx_buflen,
> +	link->wr_rx_bufs = kcalloc(link->lgr->max_recv_wr, link->wr_rx_buflen,
>  				   GFP_KERNEL);
>  	if (!link->wr_rx_bufs)
>  		goto no_mem_wr_tx_bufs;
> -	link->wr_tx_ibs = kcalloc(SMC_WR_BUF_CNT, sizeof(link->wr_tx_ibs[0]),
> -				  GFP_KERNEL);
> +	link->wr_tx_ibs = kcalloc(link->lgr->max_send_wr,
> +				  sizeof(link->wr_tx_ibs[0]), GFP_KERNEL);
>  	if (!link->wr_tx_ibs)
>  		goto no_mem_wr_rx_bufs;
> -	link->wr_rx_ibs = kcalloc(SMC_WR_BUF_CNT * 3,
> +	link->wr_rx_ibs = kcalloc(link->lgr->max_recv_wr,
>  				  sizeof(link->wr_rx_ibs[0]),
>  				  GFP_KERNEL);
>  	if (!link->wr_rx_ibs)
>  		goto no_mem_wr_tx_ibs;
> -	link->wr_tx_rdmas = kcalloc(SMC_WR_BUF_CNT,
> +	link->wr_tx_rdmas = kcalloc(link->lgr->max_send_wr,
>  				    sizeof(link->wr_tx_rdmas[0]),
>  				    GFP_KERNEL);
>  	if (!link->wr_tx_rdmas)
>  		goto no_mem_wr_rx_ibs;
> -	link->wr_tx_rdma_sges = kcalloc(SMC_WR_BUF_CNT,
> +	link->wr_tx_rdma_sges = kcalloc(link->lgr->max_send_wr,
>  					sizeof(link->wr_tx_rdma_sges[0]),
>  					GFP_KERNEL);
>  	if (!link->wr_tx_rdma_sges)
>  		goto no_mem_wr_tx_rdmas;
> -	link->wr_tx_sges = kcalloc(SMC_WR_BUF_CNT, sizeof(link->wr_tx_sges[0]),
> +	link->wr_tx_sges = kcalloc(link->lgr->max_send_wr, sizeof(link->wr_tx_sges[0]),
>  				   GFP_KERNEL);
>  	if (!link->wr_tx_sges)
>  		goto no_mem_wr_tx_rdma_sges;
> -	link->wr_rx_sges = kcalloc(SMC_WR_BUF_CNT * 3,
> +	link->wr_rx_sges = kcalloc(link->lgr->max_recv_wr,
>  				   sizeof(link->wr_rx_sges[0]) * link->wr_rx_sge_cnt,
>  				   GFP_KERNEL);
>  	if (!link->wr_rx_sges)
>  		goto no_mem_wr_tx_sges;
> -	link->wr_tx_mask = bitmap_zalloc(SMC_WR_BUF_CNT, GFP_KERNEL);
> +	link->wr_tx_mask = bitmap_zalloc(link->lgr->max_send_wr, GFP_KERNEL);
>  	if (!link->wr_tx_mask)
>  		goto no_mem_wr_rx_sges;
> -	link->wr_tx_pends = kcalloc(SMC_WR_BUF_CNT,
> +	link->wr_tx_pends = kcalloc(link->lgr->max_send_wr,
>  				    sizeof(link->wr_tx_pends[0]),
>  				    GFP_KERNEL);
>  	if (!link->wr_tx_pends)
>  		goto no_mem_wr_tx_mask;
> -	link->wr_tx_compl = kcalloc(SMC_WR_BUF_CNT,
> +	link->wr_tx_compl = kcalloc(link->lgr->max_send_wr,
>  				    sizeof(link->wr_tx_compl[0]),
>  				    GFP_KERNEL);
>  	if (!link->wr_tx_compl)
> @@ -905,7 +907,7 @@ int smc_wr_create_link(struct smc_link *lnk)
>  		goto dma_unmap;
>  	}
>  	smc_wr_init_sge(lnk);
> -	bitmap_zero(lnk->wr_tx_mask, SMC_WR_BUF_CNT);
> +	bitmap_zero(lnk->wr_tx_mask, lnk->lgr->max_send_wr);
>  	init_waitqueue_head(&lnk->wr_tx_wait);
>  	rc = percpu_ref_init(&lnk->wr_tx_refs, smcr_wr_tx_refs_free, 0, GFP_KERNEL);
>  	if (rc)
> diff --git a/net/smc/smc_wr.h b/net/smc/smc_wr.h
> index f3008dda222a..aa4533af9122 100644
> --- a/net/smc/smc_wr.h
> +++ b/net/smc/smc_wr.h
> @@ -19,8 +19,6 @@
>  #include "smc.h"
>  #include "smc_core.h"
>  
> -#define SMC_WR_BUF_CNT 16	/* # of ctrl buffers per link */
> -
>  #define SMC_WR_TX_WAIT_FREE_SLOT_TIME	(10 * HZ)
>  
>  #define SMC_WR_TX_SIZE 44 /* actual size of wr_send data (<=SMC_WR_BUF_SIZE) */


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net-next v3 1/2] net/smc: make wr buffer count configurable
  2025-09-26  2:44   ` Guangguan Wang
@ 2025-09-26 10:12     ` Halil Pasic
  2025-09-26 10:30       ` Halil Pasic
  0 siblings, 1 reply; 21+ messages in thread
From: Halil Pasic @ 2025-09-26 10:12 UTC (permalink / raw)
  To: Guangguan Wang
  Cc: Jakub Kicinski, Paolo Abeni, Simon Horman, D. Wythe, Dust Li,
	Sidraya Jayagond, Wenjia Zhang, Mahanta Jambigi, Tony Lu, Wen Gu,
	netdev, linux-doc, linux-kernel, linux-rdma, linux-s390,
	Halil Pasic

On Fri, 26 Sep 2025 10:44:00 +0800
Guangguan Wang <guangguan.wang@linux.alibaba.com> wrote:

> > +
> > +smcr_max_send_wr - INTEGER
> > +	So called work request buffers are SMCR link (and RDMA queue pair) level
> > +	resources necessary for performing RDMA operations. Since up to 255
> > +	connections can share a link group and thus also a link and the number
> > +	of the work request buffers is decided when the link is allocated,
> > +	depending on the workload it can a bottleneck in a sense that threads
> > +	have to wait for work request buffers to become available. Before the
> > +	introduction of this control the maximal number of work request buffers
> > +	available on the send path used to be hard coded to 16. With this control
> > +	it becomes configurable. The acceptable range is between 2 and 2048.
> > +
> > +	Please be aware that all the buffers need to be allocated as a physically
> > +	continuous array in which each element is a single buffer and has the size
> > +	of SMC_WR_BUF_SIZE (48) bytes. If the allocation fails we give up much
> > +	like before having this control.
> > +
> > +	Default: 16
> > +
> > +smcr_max_recv_wr - INTEGER
> > +	So called work request buffers are SMCR link (and RDMA queue pair) level
> > +	resources necessary for performing RDMA operations. Since up to 255
> > +	connections can share a link group and thus also a link and the number
> > +	of the work request buffers is decided when the link is allocated,
> > +	depending on the workload it can a bottleneck in a sense that threads
> > +	have to wait for work request buffers to become available. Before the
> > +	introduction of this control the maximal number of work request buffers
> > +	available on the receive path used to be hard coded to 16. With this control
> > +	it becomes configurable. The acceptable range is between 2 and 2048.
> > +
> > +	Please be aware that all the buffers need to be allocated as a physically
> > +	continuous array in which each element is a single buffer and has the size
> > +	of SMC_WR_BUF_SIZE (48) bytes. If the allocation fails we give up much
> > +	like before having this control.
> > +
> > +	Default: 48  
> 
> Notice that the ratio of smcr_max_recv_wr to smcr_max_send_wr is set to 3:1, with the
> intention of ensuring that the peer QP's smcr_max_recv_wr is three times the local QP's
> smcr_max_send_wr and the local QP's smcr_max_recv_wr is three times the peer QP's
> smcr_max_send_wr, rather than making the local QP's smcr_max_recv_wr three times its own
> smcr_max_send_wr. The purpose of this design is to guarantee sufficient receive WRs on
> the side to receive incoming data when peer QP doing RDMA sends. Otherwise, RNR (Receiver
> Not Ready) may occur, leading to poor performance(RNR will drop the packet and retransmit
> happens in the transport layer of the RDMA).

Thank you Guangguan! I think we already had that discussion. 
> 
> Let us guess a scenario that have multiple hosts, and the multiple hosts have different
> smcr_max_send_wr and smcr_max_recv_wr configurations, mesh connections between these hosts.
> It is difficult to ensure that the smcr_max_recv_wr/smcr_max_send_wr is 3:1 on the connected
> QPs between these hosts, and it may even be hard to guarantee the smcr_max_recv_wr > smcr_max_send_wr
> on the connected QPs between these hosts.


It is not difficult IMHO. You just leave the knobs alone and you have
3:1 per default. If tuning is attempted that needs to be done carefully.
At least with SMC-R V2 there is this whole EID business, as well so it
is reasonable to assume that the environment can be tuned in a coherent
fashion. E.g. whoever is calling the EID could call use smcr_max_recv_wr:=32 and smcr_max_send_wr:=

> 
> Therefore, I believe that if these values are made configurable, additional mechanisms must be
> in place to prevent RNR from occurring. Otherwise we need to carefully configure smcr_max_recv_wr
> and smcr_max_send_wr, or ensure that all hosts capable of establishing SMC-R connections are configured
> smcr_max_recv_wr and smcr_max_send_wr with the same values.

Thank you for 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net-next v3 1/2] net/smc: make wr buffer count configurable
  2025-09-26 10:12     ` Halil Pasic
@ 2025-09-26 10:30       ` Halil Pasic
  2025-09-28  3:05         ` Guangguan Wang
  0 siblings, 1 reply; 21+ messages in thread
From: Halil Pasic @ 2025-09-26 10:30 UTC (permalink / raw)
  To: Guangguan Wang
  Cc: Jakub Kicinski, Paolo Abeni, Simon Horman, D. Wythe, Dust Li,
	Sidraya Jayagond, Wenjia Zhang, Mahanta Jambigi, Tony Lu, Wen Gu,
	netdev, linux-doc, linux-kernel, linux-rdma, linux-s390,
	Halil Pasic

On Fri, 26 Sep 2025 12:12:49 +0200
Halil Pasic <pasic@linux.ibm.com> wrote:

> On Fri, 26 Sep 2025 10:44:00 +0800
> Guangguan Wang <guangguan.wang@linux.alibaba.com> wrote:
> 
> > > +
> > > +smcr_max_send_wr - INTEGER
> > > +	So called work request buffers are SMCR link (and RDMA queue pair) level
> > > +	resources necessary for performing RDMA operations. Since up to 255
> > > +	connections can share a link group and thus also a link and the number
> > > +	of the work request buffers is decided when the link is allocated,
> > > +	depending on the workload it can a bottleneck in a sense that threads
> > > +	have to wait for work request buffers to become available. Before the
> > > +	introduction of this control the maximal number of work request buffers
> > > +	available on the send path used to be hard coded to 16. With this control
> > > +	it becomes configurable. The acceptable range is between 2 and 2048.
> > > +
> > > +	Please be aware that all the buffers need to be allocated as a physically
> > > +	continuous array in which each element is a single buffer and has the size
> > > +	of SMC_WR_BUF_SIZE (48) bytes. If the allocation fails we give up much
> > > +	like before having this control.
> > > +
> > > +	Default: 16
> > > +
> > > +smcr_max_recv_wr - INTEGER
> > > +	So called work request buffers are SMCR link (and RDMA queue pair) level
> > > +	resources necessary for performing RDMA operations. Since up to 255
> > > +	connections can share a link group and thus also a link and the number
> > > +	of the work request buffers is decided when the link is allocated,
> > > +	depending on the workload it can a bottleneck in a sense that threads
> > > +	have to wait for work request buffers to become available. Before the
> > > +	introduction of this control the maximal number of work request buffers
> > > +	available on the receive path used to be hard coded to 16. With this control
> > > +	it becomes configurable. The acceptable range is between 2 and 2048.
> > > +
> > > +	Please be aware that all the buffers need to be allocated as a physically
> > > +	continuous array in which each element is a single buffer and has the size
> > > +	of SMC_WR_BUF_SIZE (48) bytes. If the allocation fails we give up much
> > > +	like before having this control.
> > > +
> > > +	Default: 48    
> > 
> > Notice that the ratio of smcr_max_recv_wr to smcr_max_send_wr is set to 3:1, with the
> > intention of ensuring that the peer QP's smcr_max_recv_wr is three times the local QP's
> > smcr_max_send_wr and the local QP's smcr_max_recv_wr is three times the peer QP's
> > smcr_max_send_wr, rather than making the local QP's smcr_max_recv_wr three times its own
> > smcr_max_send_wr. The purpose of this design is to guarantee sufficient receive WRs on
> > the side to receive incoming data when peer QP doing RDMA sends. Otherwise, RNR (Receiver
> > Not Ready) may occur, leading to poor performance(RNR will drop the packet and retransmit
> > happens in the transport layer of the RDMA).  

Sorry this was sent accidentally by the virtue of unintentionally
pressing the shortcut for send while trying to actually edit! 

> 
> Thank you Guangguan! I think we already had that discussion. 

Please have a look at this thread
https://lore.kernel.org/all/4c5347ff-779b-48d7-8234-2aac9992f487@linux.ibm.com/

I'm aware of this, but I think this problem needs to be solved on
a different level.

> > 
> > Let us guess a scenario that have multiple hosts, and the multiple hosts have different
> > smcr_max_send_wr and smcr_max_recv_wr configurations, mesh connections between these hosts.
> > It is difficult to ensure that the smcr_max_recv_wr/smcr_max_send_wr is 3:1 on the connected
> > QPs between these hosts, and it may even be hard to guarantee the smcr_max_recv_wr > smcr_max_send_wr
> > on the connected QPs between these hosts.  
> 
> 
> It is not difficult IMHO. You just leave the knobs alone and you have
[..]

It is not difficult IMHO. You just leave the knobs alone and you have
3:1 per default. If tuning is attempted that needs to be done carefully.
At least with SMC-R V2 there is this whole EID business, as well so it
is reasonable to assume that the environment can be tuned in a coherent
fashion. E.g. whoever is calling the EID could call use smcr_max_recv_wr:=32
and smcr_max_send_wr:=96. 

> > 
> > Therefore, I believe that if these values are made configurable, additional mechanisms must be
> > in place to prevent RNR from occurring. Otherwise we need to carefully configure smcr_max_recv_wr
> > and smcr_max_send_wr, or ensure that all hosts capable of establishing SMC-R connections are configured
> > smcr_max_recv_wr and smcr_max_send_wr with the same values.  
> 

I'm in favor of adding such mechanisms on top of this. Do you have
something particular in mind? Unfortunately I'm not knowledgeable enough
in the area to know what mechanisms you may mean. But I guess it is
patches welcome as always! Currently I would encourage to users
to tune carefully. 

Sorry about that half baked answer before.

Regards,
Halil


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net-next v3 1/2] net/smc: make wr buffer count configurable
  2025-09-26 10:30       ` Halil Pasic
@ 2025-09-28  3:05         ` Guangguan Wang
  0 siblings, 0 replies; 21+ messages in thread
From: Guangguan Wang @ 2025-09-28  3:05 UTC (permalink / raw)
  To: Halil Pasic
  Cc: Jakub Kicinski, Paolo Abeni, Simon Horman, D. Wythe, Dust Li,
	Sidraya Jayagond, Wenjia Zhang, Mahanta Jambigi, Tony Lu, Wen Gu,
	netdev, linux-doc, linux-kernel, linux-rdma, linux-s390



在 2025/9/26 18:30, Halil Pasic 写道:
> On Fri, 26 Sep 2025 12:12:49 +0200
> Halil Pasic <pasic@linux.ibm.com> wrote:
> 
>> On Fri, 26 Sep 2025 10:44:00 +0800
>> Guangguan Wang <guangguan.wang@linux.alibaba.com> wrote:
>>
>>>
>>> Notice that the ratio of smcr_max_recv_wr to smcr_max_send_wr is set to 3:1, with the
>>> intention of ensuring that the peer QP's smcr_max_recv_wr is three times the local QP's
>>> smcr_max_send_wr and the local QP's smcr_max_recv_wr is three times the peer QP's
>>> smcr_max_send_wr, rather than making the local QP's smcr_max_recv_wr three times its own
>>> smcr_max_send_wr. The purpose of this design is to guarantee sufficient receive WRs on
>>> the side to receive incoming data when peer QP doing RDMA sends. Otherwise, RNR (Receiver
>>> Not Ready) may occur, leading to poor performance(RNR will drop the packet and retransmit
>>> happens in the transport layer of the RDMA).  
> 
> Sorry this was sent accidentally by the virtue of unintentionally
> pressing the shortcut for send while trying to actually edit! 
> 
>>
>> Thank you Guangguan! I think we already had that discussion. 
> 
> Please have a look at this thread
> https://lore.kernel.org/all/4c5347ff-779b-48d7-8234-2aac9992f487@linux.ibm.com/
> 
> I'm aware of this, but I think this problem needs to be solved on
> a different level.
> 
Oh, I see. Sorry for missing the previous discussion.

BTW, the RNR counter is the file like '/sys/class/infiniband/mlx5_0/ports/1/hw_counters/rnr_nak_retry_err'.

>>>
>>> Let us guess a scenario that have multiple hosts, and the multiple hosts have different
>>> smcr_max_send_wr and smcr_max_recv_wr configurations, mesh connections between these hosts.
>>> It is difficult to ensure that the smcr_max_recv_wr/smcr_max_send_wr is 3:1 on the connected
>>> QPs between these hosts, and it may even be hard to guarantee the smcr_max_recv_wr > smcr_max_send_wr
>>> on the connected QPs between these hosts.  
>>
>>
>> It is not difficult IMHO. You just leave the knobs alone and you have
> [..]
> 
> It is not difficult IMHO. You just leave the knobs alone and you have
> 3:1 per default. If tuning is attempted that needs to be done carefully.
> At least with SMC-R V2 there is this whole EID business, as well so it
> is reasonable to assume that the environment can be tuned in a coherent
> fashion. E.g. whoever is calling the EID could call use smcr_max_recv_wr:=32
> and smcr_max_send_wr:=96. 
> 
>>>
>>> Therefore, I believe that if these values are made configurable, additional mechanisms must be
>>> in place to prevent RNR from occurring. Otherwise we need to carefully configure smcr_max_recv_wr
>>> and smcr_max_send_wr, or ensure that all hosts capable of establishing SMC-R connections are configured
>>> smcr_max_recv_wr and smcr_max_send_wr with the same values.  
>>
> 
> I'm in favor of adding such mechanisms on top of this. Do you have
> something particular in mind? Unfortunately I'm not knowledgeable enough
> in the area to know what mechanisms you may mean. But I guess it is
> patches welcome as always! Currently I would encourage to users
> to tune carefully. 
> 

AFAIK, flow control is a usual way, maybe credit-based flow control is enough. Credit means the valid
counts of receive wr can be used. The receiver counts the credit every time post_recv, and advertises
credits to the connected sender at a certain frequency. The sender counts the credits advertised from
peer. The sender consumes a credit everytime post_send wr which will consume a receive wr in the receiver,
if have enough credits to consume. Otherwise the sender should hang the wr and should wait for the credits
advertised from peer. 

But this requires support at the SMC-R protocol level. And this also can be addressed as an enhancement.
I do not known if someone from Dust Li's team or someone from IBM has interests to pick this up.

Regards,
Guangguan Wang


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH net-next v3 2/2] net/smc: handle -ENOMEM from smc_wr_alloc_link_mem gracefully
  2025-09-21 21:44 [PATCH net-next v3 0/2] net/smc: make wr buffer count configurable Halil Pasic
  2025-09-21 21:44 ` [PATCH net-next v3 1/2] " Halil Pasic
@ 2025-09-21 21:44 ` Halil Pasic
  2025-09-24 17:28   ` Sidraya Jayagond
  2025-09-25  9:40   ` Paolo Abeni
  1 sibling, 2 replies; 21+ messages in thread
From: Halil Pasic @ 2025-09-21 21:44 UTC (permalink / raw)
  To: Jakub Kicinski, Paolo Abeni, Simon Horman, D. Wythe, Dust Li,
	Sidraya Jayagond, Wenjia Zhang, Mahanta Jambigi, Tony Lu, Wen Gu,
	netdev, linux-doc, linux-kernel, linux-rdma, linux-s390
  Cc: Halil Pasic

Currently if a -ENOMEM from smc_wr_alloc_link_mem() is handled by
giving up and going the way of a TCP fallback. This was reasonable
before the sizes of the allocations there were compile time constants
and reasonably small. But now those are actually configurable.

So instead of giving up, keep retrying with half of the requested
size unless we dip below the old static sizes -- then give up!

Signed-off-by: Halil Pasic <pasic@linux.ibm.com>
Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com>
Reviewed-by: Mahanta Jambigi <mjambigi@linux.ibm.com>
---
 Documentation/networking/smc-sysctl.rst |  8 ++++--
 net/smc/smc_core.c                      | 34 +++++++++++++++++--------
 net/smc/smc_core.h                      |  2 ++
 net/smc/smc_wr.c                        | 28 ++++++++++----------
 4 files changed, 46 insertions(+), 26 deletions(-)

diff --git a/Documentation/networking/smc-sysctl.rst b/Documentation/networking/smc-sysctl.rst
index c94d750c7c84..c8dbe7ac8bdf 100644
--- a/Documentation/networking/smc-sysctl.rst
+++ b/Documentation/networking/smc-sysctl.rst
@@ -85,7 +85,9 @@ smcr_max_send_wr - INTEGER
 
 	Please be aware that all the buffers need to be allocated as a physically
 	continuous array in which each element is a single buffer and has the size
-	of SMC_WR_BUF_SIZE (48) bytes. If the allocation fails we give up much
+	of SMC_WR_BUF_SIZE (48) bytes. If the allocation fails, we keep retrying
+	with half of the buffer count until it is ether successful or (unlikely)
+	we dip below the old hard coded value which is 16 where we give up much
 	like before having this control.
 
 	Default: 16
@@ -103,7 +105,9 @@ smcr_max_recv_wr - INTEGER
 
 	Please be aware that all the buffers need to be allocated as a physically
 	continuous array in which each element is a single buffer and has the size
-	of SMC_WR_BUF_SIZE (48) bytes. If the allocation fails we give up much
+	of SMC_WR_BUF_SIZE (48) bytes. If the allocation fails, we keep retrying
+	with half of the buffer count until it is ether successful or (unlikely)
+	we dip below the old hard coded value which is 16 where we give up much
 	like before having this control.
 
 	Default: 48
diff --git a/net/smc/smc_core.c b/net/smc/smc_core.c
index 2a559a98541c..f8131b4dfcd6 100644
--- a/net/smc/smc_core.c
+++ b/net/smc/smc_core.c
@@ -810,6 +810,8 @@ int smcr_link_init(struct smc_link_group *lgr, struct smc_link *lnk,
 	lnk->clearing = 0;
 	lnk->path_mtu = lnk->smcibdev->pattr[lnk->ibport - 1].active_mtu;
 	lnk->link_id = smcr_next_link_id(lgr);
+	lnk->max_send_wr = lgr->max_send_wr;
+	lnk->max_recv_wr = lgr->max_recv_wr;
 	lnk->lgr = lgr;
 	smc_lgr_hold(lgr); /* lgr_put in smcr_link_clear() */
 	lnk->link_idx = link_idx;
@@ -836,27 +838,39 @@ int smcr_link_init(struct smc_link_group *lgr, struct smc_link *lnk,
 	rc = smc_llc_link_init(lnk);
 	if (rc)
 		goto out;
-	rc = smc_wr_alloc_link_mem(lnk);
-	if (rc)
-		goto clear_llc_lnk;
 	rc = smc_ib_create_protection_domain(lnk);
 	if (rc)
-		goto free_link_mem;
-	rc = smc_ib_create_queue_pair(lnk);
-	if (rc)
-		goto dealloc_pd;
+		goto clear_llc_lnk;
+	do {
+		rc = smc_ib_create_queue_pair(lnk);
+		if (rc)
+			goto dealloc_pd;
+		rc = smc_wr_alloc_link_mem(lnk);
+		if (!rc)
+			break;
+		else if (rc != -ENOMEM) /* give up */
+			goto destroy_qp;
+		/* retry with smaller ... */
+		lnk->max_send_wr /= 2;
+		lnk->max_recv_wr /= 2;
+		/* ... unless droping below old SMC_WR_BUF_SIZE */
+		if (lnk->max_send_wr < 16 || lnk->max_recv_wr < 48)
+			goto destroy_qp;
+		smc_ib_destroy_queue_pair(lnk);
+	} while (1);
+
 	rc = smc_wr_create_link(lnk);
 	if (rc)
-		goto destroy_qp;
+		goto free_link_mem;
 	lnk->state = SMC_LNK_ACTIVATING;
 	return 0;
 
+free_link_mem:
+	smc_wr_free_link_mem(lnk);
 destroy_qp:
 	smc_ib_destroy_queue_pair(lnk);
 dealloc_pd:
 	smc_ib_dealloc_protection_domain(lnk);
-free_link_mem:
-	smc_wr_free_link_mem(lnk);
 clear_llc_lnk:
 	smc_llc_link_clear(lnk, false);
 out:
diff --git a/net/smc/smc_core.h b/net/smc/smc_core.h
index ab2d15929cb2..b9addf633d8a 100644
--- a/net/smc/smc_core.h
+++ b/net/smc/smc_core.h
@@ -174,6 +174,8 @@ struct smc_link {
 	struct completion	llc_testlink_resp; /* wait for rx of testlink */
 	int			llc_testlink_time; /* testlink interval */
 	atomic_t		conn_cnt; /* connections on this link */
+	u16			max_send_wr;
+	u16			max_recv_wr;
 };
 
 /* For now we just allow one parallel link per link group. The SMC protocol
diff --git a/net/smc/smc_wr.c b/net/smc/smc_wr.c
index f5b2772414fd..6b14711f0c93 100644
--- a/net/smc/smc_wr.c
+++ b/net/smc/smc_wr.c
@@ -548,9 +548,9 @@ void smc_wr_remember_qp_attr(struct smc_link *lnk)
 		    IB_QP_DEST_QPN,
 		    &init_attr);
 
-	lnk->wr_tx_cnt = min_t(size_t, lnk->lgr->max_send_wr,
+	lnk->wr_tx_cnt = min_t(size_t, lnk->max_send_wr,
 			       lnk->qp_attr.cap.max_send_wr);
-	lnk->wr_rx_cnt = min_t(size_t, lnk->lgr->max_recv_wr,
+	lnk->wr_rx_cnt = min_t(size_t, lnk->max_recv_wr,
 			       lnk->qp_attr.cap.max_recv_wr);
 }
 
@@ -742,51 +742,51 @@ int smc_wr_alloc_lgr_mem(struct smc_link_group *lgr)
 int smc_wr_alloc_link_mem(struct smc_link *link)
 {
 	/* allocate link related memory */
-	link->wr_tx_bufs = kcalloc(link->lgr->max_send_wr,
+	link->wr_tx_bufs = kcalloc(link->max_send_wr,
 				   SMC_WR_BUF_SIZE, GFP_KERNEL);
 	if (!link->wr_tx_bufs)
 		goto no_mem;
-	link->wr_rx_bufs = kcalloc(link->lgr->max_recv_wr, link->wr_rx_buflen,
+	link->wr_rx_bufs = kcalloc(link->max_recv_wr, link->wr_rx_buflen,
 				   GFP_KERNEL);
 	if (!link->wr_rx_bufs)
 		goto no_mem_wr_tx_bufs;
-	link->wr_tx_ibs = kcalloc(link->lgr->max_send_wr,
+	link->wr_tx_ibs = kcalloc(link->max_send_wr,
 				  sizeof(link->wr_tx_ibs[0]), GFP_KERNEL);
 	if (!link->wr_tx_ibs)
 		goto no_mem_wr_rx_bufs;
-	link->wr_rx_ibs = kcalloc(link->lgr->max_recv_wr,
+	link->wr_rx_ibs = kcalloc(link->max_recv_wr,
 				  sizeof(link->wr_rx_ibs[0]),
 				  GFP_KERNEL);
 	if (!link->wr_rx_ibs)
 		goto no_mem_wr_tx_ibs;
-	link->wr_tx_rdmas = kcalloc(link->lgr->max_send_wr,
+	link->wr_tx_rdmas = kcalloc(link->max_send_wr,
 				    sizeof(link->wr_tx_rdmas[0]),
 				    GFP_KERNEL);
 	if (!link->wr_tx_rdmas)
 		goto no_mem_wr_rx_ibs;
-	link->wr_tx_rdma_sges = kcalloc(link->lgr->max_send_wr,
+	link->wr_tx_rdma_sges = kcalloc(link->max_send_wr,
 					sizeof(link->wr_tx_rdma_sges[0]),
 					GFP_KERNEL);
 	if (!link->wr_tx_rdma_sges)
 		goto no_mem_wr_tx_rdmas;
-	link->wr_tx_sges = kcalloc(link->lgr->max_send_wr, sizeof(link->wr_tx_sges[0]),
+	link->wr_tx_sges = kcalloc(link->max_send_wr, sizeof(link->wr_tx_sges[0]),
 				   GFP_KERNEL);
 	if (!link->wr_tx_sges)
 		goto no_mem_wr_tx_rdma_sges;
-	link->wr_rx_sges = kcalloc(link->lgr->max_recv_wr,
+	link->wr_rx_sges = kcalloc(link->max_recv_wr,
 				   sizeof(link->wr_rx_sges[0]) * link->wr_rx_sge_cnt,
 				   GFP_KERNEL);
 	if (!link->wr_rx_sges)
 		goto no_mem_wr_tx_sges;
-	link->wr_tx_mask = bitmap_zalloc(link->lgr->max_send_wr, GFP_KERNEL);
+	link->wr_tx_mask = bitmap_zalloc(link->max_send_wr, GFP_KERNEL);
 	if (!link->wr_tx_mask)
 		goto no_mem_wr_rx_sges;
-	link->wr_tx_pends = kcalloc(link->lgr->max_send_wr,
+	link->wr_tx_pends = kcalloc(link->max_send_wr,
 				    sizeof(link->wr_tx_pends[0]),
 				    GFP_KERNEL);
 	if (!link->wr_tx_pends)
 		goto no_mem_wr_tx_mask;
-	link->wr_tx_compl = kcalloc(link->lgr->max_send_wr,
+	link->wr_tx_compl = kcalloc(link->max_send_wr,
 				    sizeof(link->wr_tx_compl[0]),
 				    GFP_KERNEL);
 	if (!link->wr_tx_compl)
@@ -907,7 +907,7 @@ int smc_wr_create_link(struct smc_link *lnk)
 		goto dma_unmap;
 	}
 	smc_wr_init_sge(lnk);
-	bitmap_zero(lnk->wr_tx_mask, lnk->lgr->max_send_wr);
+	bitmap_zero(lnk->wr_tx_mask, lnk->max_send_wr);
 	init_waitqueue_head(&lnk->wr_tx_wait);
 	rc = percpu_ref_init(&lnk->wr_tx_refs, smcr_wr_tx_refs_free, 0, GFP_KERNEL);
 	if (rc)
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH net-next v3 2/2] net/smc: handle -ENOMEM from smc_wr_alloc_link_mem gracefully
  2025-09-21 21:44 ` [PATCH net-next v3 2/2] net/smc: handle -ENOMEM from smc_wr_alloc_link_mem gracefully Halil Pasic
@ 2025-09-24 17:28   ` Sidraya Jayagond
  2025-09-25  9:40   ` Paolo Abeni
  1 sibling, 0 replies; 21+ messages in thread
From: Sidraya Jayagond @ 2025-09-24 17:28 UTC (permalink / raw)
  To: Halil Pasic, Jakub Kicinski, Paolo Abeni, Simon Horman, D. Wythe,
	Dust Li, Wenjia Zhang, Mahanta Jambigi, Tony Lu, Wen Gu, netdev,
	linux-doc, linux-kernel, linux-rdma, linux-s390



On 22/09/25 3:14 am, Halil Pasic wrote:
> Currently if a -ENOMEM from smc_wr_alloc_link_mem() is handled by
> giving up and going the way of a TCP fallback. This was reasonable
> before the sizes of the allocations there were compile time constants
> and reasonably small. But now those are actually configurable.
> 
> So instead of giving up, keep retrying with half of the requested
> size unless we dip below the old static sizes -- then give up!
> 
> Signed-off-by: Halil Pasic <pasic@linux.ibm.com>
> Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com>
> Reviewed-by: Mahanta Jambigi <mjambigi@linux.ibm.com>
> ---
>  Documentation/networking/smc-sysctl.rst |  8 ++++--
>  net/smc/smc_core.c                      | 34 +++++++++++++++++--------
>  net/smc/smc_core.h                      |  2 ++
>  net/smc/smc_wr.c                        | 28 ++++++++++----------
>  4 files changed, 46 insertions(+), 26 deletions(-)
> 
> diff --git a/Documentation/networking/smc-sysctl.rst b/Documentation/networking/smc-sysctl.rst
> index c94d750c7c84..c8dbe7ac8bdf 100644
> --- a/Documentation/networking/smc-sysctl.rst
> +++ b/Documentation/networking/smc-sysctl.rst
> @@ -85,7 +85,9 @@ smcr_max_send_wr - INTEGER
>  
>  	Please be aware that all the buffers need to be allocated as a physically
>  	continuous array in which each element is a single buffer and has the size
> -	of SMC_WR_BUF_SIZE (48) bytes. If the allocation fails we give up much
> +	of SMC_WR_BUF_SIZE (48) bytes. If the allocation fails, we keep retrying
> +	with half of the buffer count until it is ether successful or (unlikely)
> +	we dip below the old hard coded value which is 16 where we give up much
>  	like before having this control.
>  
>  	Default: 16
> @@ -103,7 +105,9 @@ smcr_max_recv_wr - INTEGER
>  
>  	Please be aware that all the buffers need to be allocated as a physically
>  	continuous array in which each element is a single buffer and has the size
> -	of SMC_WR_BUF_SIZE (48) bytes. If the allocation fails we give up much
> +	of SMC_WR_BUF_SIZE (48) bytes. If the allocation fails, we keep retrying
> +	with half of the buffer count until it is ether successful or (unlikely)
> +	we dip below the old hard coded value which is 16 where we give up much
>  	like before having this control.
>  
>  	Default: 48
> diff --git a/net/smc/smc_core.c b/net/smc/smc_core.c
> index 2a559a98541c..f8131b4dfcd6 100644
> --- a/net/smc/smc_core.c
> +++ b/net/smc/smc_core.c
> @@ -810,6 +810,8 @@ int smcr_link_init(struct smc_link_group *lgr, struct smc_link *lnk,
>  	lnk->clearing = 0;
>  	lnk->path_mtu = lnk->smcibdev->pattr[lnk->ibport - 1].active_mtu;
>  	lnk->link_id = smcr_next_link_id(lgr);
> +	lnk->max_send_wr = lgr->max_send_wr;
> +	lnk->max_recv_wr = lgr->max_recv_wr;
>  	lnk->lgr = lgr;
>  	smc_lgr_hold(lgr); /* lgr_put in smcr_link_clear() */
>  	lnk->link_idx = link_idx;
> @@ -836,27 +838,39 @@ int smcr_link_init(struct smc_link_group *lgr, struct smc_link *lnk,
>  	rc = smc_llc_link_init(lnk);
>  	if (rc)
>  		goto out;
> -	rc = smc_wr_alloc_link_mem(lnk);
> -	if (rc)
> -		goto clear_llc_lnk;
>  	rc = smc_ib_create_protection_domain(lnk);
>  	if (rc)
> -		goto free_link_mem;
> -	rc = smc_ib_create_queue_pair(lnk);
> -	if (rc)
> -		goto dealloc_pd;
> +		goto clear_llc_lnk;
> +	do {
> +		rc = smc_ib_create_queue_pair(lnk);
> +		if (rc)
> +			goto dealloc_pd;
> +		rc = smc_wr_alloc_link_mem(lnk);
> +		if (!rc)
> +			break;
> +		else if (rc != -ENOMEM) /* give up */
> +			goto destroy_qp;
> +		/* retry with smaller ... */
> +		lnk->max_send_wr /= 2;
> +		lnk->max_recv_wr /= 2;
> +		/* ... unless droping below old SMC_WR_BUF_SIZE */
> +		if (lnk->max_send_wr < 16 || lnk->max_recv_wr < 48)
> +			goto destroy_qp;
> +		smc_ib_destroy_queue_pair(lnk);
> +	} while (1);
> +
>  	rc = smc_wr_create_link(lnk);
>  	if (rc)
> -		goto destroy_qp;
> +		goto free_link_mem;
>  	lnk->state = SMC_LNK_ACTIVATING;
>  	return 0;
>  
> +free_link_mem:
> +	smc_wr_free_link_mem(lnk);
>  destroy_qp:
>  	smc_ib_destroy_queue_pair(lnk);
>  dealloc_pd:
>  	smc_ib_dealloc_protection_domain(lnk);
> -free_link_mem:
> -	smc_wr_free_link_mem(lnk);
>  clear_llc_lnk:
>  	smc_llc_link_clear(lnk, false);
>  out:
> diff --git a/net/smc/smc_core.h b/net/smc/smc_core.h
> index ab2d15929cb2..b9addf633d8a 100644
> --- a/net/smc/smc_core.h
> +++ b/net/smc/smc_core.h
> @@ -174,6 +174,8 @@ struct smc_link {
>  	struct completion	llc_testlink_resp; /* wait for rx of testlink */
>  	int			llc_testlink_time; /* testlink interval */
>  	atomic_t		conn_cnt; /* connections on this link */
> +	u16			max_send_wr;
> +	u16			max_recv_wr;
>  };
>  
>  /* For now we just allow one parallel link per link group. The SMC protocol
> diff --git a/net/smc/smc_wr.c b/net/smc/smc_wr.c
> index f5b2772414fd..6b14711f0c93 100644
> --- a/net/smc/smc_wr.c
> +++ b/net/smc/smc_wr.c
> @@ -548,9 +548,9 @@ void smc_wr_remember_qp_attr(struct smc_link *lnk)
>  		    IB_QP_DEST_QPN,
>  		    &init_attr);
>  
> -	lnk->wr_tx_cnt = min_t(size_t, lnk->lgr->max_send_wr,
> +	lnk->wr_tx_cnt = min_t(size_t, lnk->max_send_wr,
>  			       lnk->qp_attr.cap.max_send_wr);
> -	lnk->wr_rx_cnt = min_t(size_t, lnk->lgr->max_recv_wr,
> +	lnk->wr_rx_cnt = min_t(size_t, lnk->max_recv_wr,
>  			       lnk->qp_attr.cap.max_recv_wr);
>  }
>  
> @@ -742,51 +742,51 @@ int smc_wr_alloc_lgr_mem(struct smc_link_group *lgr)
>  int smc_wr_alloc_link_mem(struct smc_link *link)
>  {
>  	/* allocate link related memory */
> -	link->wr_tx_bufs = kcalloc(link->lgr->max_send_wr,
> +	link->wr_tx_bufs = kcalloc(link->max_send_wr,
>  				   SMC_WR_BUF_SIZE, GFP_KERNEL);
>  	if (!link->wr_tx_bufs)
>  		goto no_mem;
> -	link->wr_rx_bufs = kcalloc(link->lgr->max_recv_wr, link->wr_rx_buflen,
> +	link->wr_rx_bufs = kcalloc(link->max_recv_wr, link->wr_rx_buflen,
>  				   GFP_KERNEL);
>  	if (!link->wr_rx_bufs)
>  		goto no_mem_wr_tx_bufs;
> -	link->wr_tx_ibs = kcalloc(link->lgr->max_send_wr,
> +	link->wr_tx_ibs = kcalloc(link->max_send_wr,
>  				  sizeof(link->wr_tx_ibs[0]), GFP_KERNEL);
>  	if (!link->wr_tx_ibs)
>  		goto no_mem_wr_rx_bufs;
> -	link->wr_rx_ibs = kcalloc(link->lgr->max_recv_wr,
> +	link->wr_rx_ibs = kcalloc(link->max_recv_wr,
>  				  sizeof(link->wr_rx_ibs[0]),
>  				  GFP_KERNEL);
>  	if (!link->wr_rx_ibs)
>  		goto no_mem_wr_tx_ibs;
> -	link->wr_tx_rdmas = kcalloc(link->lgr->max_send_wr,
> +	link->wr_tx_rdmas = kcalloc(link->max_send_wr,
>  				    sizeof(link->wr_tx_rdmas[0]),
>  				    GFP_KERNEL);
>  	if (!link->wr_tx_rdmas)
>  		goto no_mem_wr_rx_ibs;
> -	link->wr_tx_rdma_sges = kcalloc(link->lgr->max_send_wr,
> +	link->wr_tx_rdma_sges = kcalloc(link->max_send_wr,
>  					sizeof(link->wr_tx_rdma_sges[0]),
>  					GFP_KERNEL);
>  	if (!link->wr_tx_rdma_sges)
>  		goto no_mem_wr_tx_rdmas;
> -	link->wr_tx_sges = kcalloc(link->lgr->max_send_wr, sizeof(link->wr_tx_sges[0]),
> +	link->wr_tx_sges = kcalloc(link->max_send_wr, sizeof(link->wr_tx_sges[0]),
>  				   GFP_KERNEL);
>  	if (!link->wr_tx_sges)
>  		goto no_mem_wr_tx_rdma_sges;
> -	link->wr_rx_sges = kcalloc(link->lgr->max_recv_wr,
> +	link->wr_rx_sges = kcalloc(link->max_recv_wr,
>  				   sizeof(link->wr_rx_sges[0]) * link->wr_rx_sge_cnt,
>  				   GFP_KERNEL);
>  	if (!link->wr_rx_sges)
>  		goto no_mem_wr_tx_sges;
> -	link->wr_tx_mask = bitmap_zalloc(link->lgr->max_send_wr, GFP_KERNEL);
> +	link->wr_tx_mask = bitmap_zalloc(link->max_send_wr, GFP_KERNEL);
>  	if (!link->wr_tx_mask)
>  		goto no_mem_wr_rx_sges;
> -	link->wr_tx_pends = kcalloc(link->lgr->max_send_wr,
> +	link->wr_tx_pends = kcalloc(link->max_send_wr,
>  				    sizeof(link->wr_tx_pends[0]),
>  				    GFP_KERNEL);
>  	if (!link->wr_tx_pends)
>  		goto no_mem_wr_tx_mask;
> -	link->wr_tx_compl = kcalloc(link->lgr->max_send_wr,
> +	link->wr_tx_compl = kcalloc(link->max_send_wr,
>  				    sizeof(link->wr_tx_compl[0]),
>  				    GFP_KERNEL);
>  	if (!link->wr_tx_compl)
> @@ -907,7 +907,7 @@ int smc_wr_create_link(struct smc_link *lnk)
>  		goto dma_unmap;
>  	}
>  	smc_wr_init_sge(lnk);
> -	bitmap_zero(lnk->wr_tx_mask, lnk->lgr->max_send_wr);
> +	bitmap_zero(lnk->wr_tx_mask, lnk->max_send_wr);
>  	init_waitqueue_head(&lnk->wr_tx_wait);
>  	rc = percpu_ref_init(&lnk->wr_tx_refs, smcr_wr_tx_refs_free, 0, GFP_KERNEL);
>  	if (rc)

Code changes looks good to me.
Reviewed-by: Sidraya Jayagond <sidraya@linux.ibm.com>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net-next v3 2/2] net/smc: handle -ENOMEM from smc_wr_alloc_link_mem gracefully
  2025-09-21 21:44 ` [PATCH net-next v3 2/2] net/smc: handle -ENOMEM from smc_wr_alloc_link_mem gracefully Halil Pasic
  2025-09-24 17:28   ` Sidraya Jayagond
@ 2025-09-25  9:40   ` Paolo Abeni
  2025-09-25 15:05     ` Halil Pasic
  1 sibling, 1 reply; 21+ messages in thread
From: Paolo Abeni @ 2025-09-25  9:40 UTC (permalink / raw)
  To: Halil Pasic, Jakub Kicinski, Simon Horman, D. Wythe, Dust Li,
	Sidraya Jayagond, Wenjia Zhang, Mahanta Jambigi, Tony Lu, Wen Gu,
	netdev, linux-doc, linux-kernel, linux-rdma, linux-s390

On 9/21/25 11:44 PM, Halil Pasic wrote:
> @@ -836,27 +838,39 @@ int smcr_link_init(struct smc_link_group *lgr, struct smc_link *lnk,
>  	rc = smc_llc_link_init(lnk);
>  	if (rc)
>  		goto out;
> -	rc = smc_wr_alloc_link_mem(lnk);
> -	if (rc)
> -		goto clear_llc_lnk;
>  	rc = smc_ib_create_protection_domain(lnk);
>  	if (rc)
> -		goto free_link_mem;
> -	rc = smc_ib_create_queue_pair(lnk);
> -	if (rc)
> -		goto dealloc_pd;
> +		goto clear_llc_lnk;
> +	do {
> +		rc = smc_ib_create_queue_pair(lnk);
> +		if (rc)
> +			goto dealloc_pd;
> +		rc = smc_wr_alloc_link_mem(lnk);
> +		if (!rc)
> +			break;
> +		else if (rc != -ENOMEM) /* give up */
> +			goto destroy_qp;
> +		/* retry with smaller ... */
> +		lnk->max_send_wr /= 2;
> +		lnk->max_recv_wr /= 2;
> +		/* ... unless droping below old SMC_WR_BUF_SIZE */
> +		if (lnk->max_send_wr < 16 || lnk->max_recv_wr < 48)
> +			goto destroy_qp;

If i.e. smc.sysctl_smcr_max_recv_wr == 2048, and
smc.sysctl_smcr_max_send_wr == 16, the above loop can give-up a little
too early - after the first failure. What about changing the termination
condition to:

	lnk->max_send_wr < 16 && lnk->max_recv_wr < 48

and use 2 as a lower bound for both lnk->max_send_wr and lnk->max_recv_wr?

Thanks,

Paolo


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net-next v3 2/2] net/smc: handle -ENOMEM from smc_wr_alloc_link_mem gracefully
  2025-09-25  9:40   ` Paolo Abeni
@ 2025-09-25 15:05     ` Halil Pasic
  2025-09-25 15:41       ` Paolo Abeni
  0 siblings, 1 reply; 21+ messages in thread
From: Halil Pasic @ 2025-09-25 15:05 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: Jakub Kicinski, Simon Horman, D. Wythe, Dust Li, Sidraya Jayagond,
	Wenjia Zhang, Mahanta Jambigi, Tony Lu, Wen Gu, netdev, linux-doc,
	linux-kernel, linux-rdma, linux-s390, Halil Pasic

On Thu, 25 Sep 2025 11:40:40 +0200
Paolo Abeni <pabeni@redhat.com> wrote:

> > +	do {
> > +		rc = smc_ib_create_queue_pair(lnk);
> > +		if (rc)
> > +			goto dealloc_pd;
> > +		rc = smc_wr_alloc_link_mem(lnk);
> > +		if (!rc)
> > +			break;
> > +		else if (rc != -ENOMEM) /* give up */
> > +			goto destroy_qp;
> > +		/* retry with smaller ... */
> > +		lnk->max_send_wr /= 2;
> > +		lnk->max_recv_wr /= 2;
> > +		/* ... unless droping below old SMC_WR_BUF_SIZE */
> > +		if (lnk->max_send_wr < 16 || lnk->max_recv_wr < 48)
> > +			goto destroy_qp;  
> 
> If i.e. smc.sysctl_smcr_max_recv_wr == 2048, and
> smc.sysctl_smcr_max_send_wr == 16, the above loop can give-up a little
> too early - after the first failure. What about changing the termination
> condition to:
> 
> 	lnk->max_send_wr < 16 && lnk->max_recv_wr < 48
> 
> and use 2 as a lower bound for both lnk->max_send_wr and lnk->max_recv_wr?

My intention was to preserve the ratio (max_recv_wr/max_send_wr) because 
I assume that the optimal ratio is workload dependent, and that scaling
both down at the same rate is easy to understand. And also to never dip
below the old values to avoid regressions due to even less WR buffers
than before the change.

I get your point, but as long as the ratio is kept I think the problem,
if considered a problem is there to stay. For example for 
smc.sysctl_smcr_max_recv_wr == 2048 and smc.sysctl_smcr_max_send_wr == 2
we would still give up after the first failure even with 2 as a lower
bound.

Let me also state that in my opinion giving up isn't that bad, because
SMC-R is supposed to be an optimization, and we still have the TCP
fallback. If we end up much worse than TCP because of back-off going
overboard, that is probably worse than just giving up on SMC-R and
going with TCP.

On the other hand, making the ratio change would make things more
complicated, less predictable, and also possibly take more iterations.
For example smc.sysctl_smcr_max_recv_wr == 2048 and
smc.sysctl_smcr_max_send_wr == 2000.

So I would prefer sticking to the current logic.

Regards,
Halil

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net-next v3 2/2] net/smc: handle -ENOMEM from smc_wr_alloc_link_mem gracefully
  2025-09-25 15:05     ` Halil Pasic
@ 2025-09-25 15:41       ` Paolo Abeni
  2025-09-25 21:46         ` Halil Pasic
  0 siblings, 1 reply; 21+ messages in thread
From: Paolo Abeni @ 2025-09-25 15:41 UTC (permalink / raw)
  To: Halil Pasic
  Cc: Jakub Kicinski, Simon Horman, D. Wythe, Dust Li, Sidraya Jayagond,
	Wenjia Zhang, Mahanta Jambigi, Tony Lu, Wen Gu, netdev, linux-doc,
	linux-kernel, linux-rdma, linux-s390

On 9/25/25 5:05 PM, Halil Pasic wrote:
> On Thu, 25 Sep 2025 11:40:40 +0200
> Paolo Abeni <pabeni@redhat.com> wrote:
> 
>>> +	do {
>>> +		rc = smc_ib_create_queue_pair(lnk);
>>> +		if (rc)
>>> +			goto dealloc_pd;
>>> +		rc = smc_wr_alloc_link_mem(lnk);
>>> +		if (!rc)
>>> +			break;
>>> +		else if (rc != -ENOMEM) /* give up */
>>> +			goto destroy_qp;
>>> +		/* retry with smaller ... */
>>> +		lnk->max_send_wr /= 2;
>>> +		lnk->max_recv_wr /= 2;
>>> +		/* ... unless droping below old SMC_WR_BUF_SIZE */
>>> +		if (lnk->max_send_wr < 16 || lnk->max_recv_wr < 48)
>>> +			goto destroy_qp;  
>>
>> If i.e. smc.sysctl_smcr_max_recv_wr == 2048, and
>> smc.sysctl_smcr_max_send_wr == 16, the above loop can give-up a little
>> too early - after the first failure. What about changing the termination
>> condition to:
>>
>> 	lnk->max_send_wr < 16 && lnk->max_recv_wr < 48
>>
>> and use 2 as a lower bound for both lnk->max_send_wr and lnk->max_recv_wr?
> 
> My intention was to preserve the ratio (max_recv_wr/max_send_wr) because 
> I assume that the optimal ratio is workload dependent, and that scaling
> both down at the same rate is easy to understand. And also to never dip
> below the old values to avoid regressions due to even less WR buffers
> than before the change.
> 
> I get your point, but as long as the ratio is kept I think the problem,
> if considered a problem is there to stay. For example for 
> smc.sysctl_smcr_max_recv_wr == 2048 and smc.sysctl_smcr_max_send_wr == 2
> we would still give up after the first failure even with 2 as a lower
> bound.
> 
> Let me also state that in my opinion giving up isn't that bad, because
> SMC-R is supposed to be an optimization, and we still have the TCP
> fallback. If we end up much worse than TCP because of back-off going
> overboard, that is probably worse than just giving up on SMC-R and
> going with TCP.
> 
> On the other hand, making the ratio change would make things more
> complicated, less predictable, and also possibly take more iterations.
> For example smc.sysctl_smcr_max_recv_wr == 2048 and
> smc.sysctl_smcr_max_send_wr == 2000.
> 
> So I would prefer sticking to the current logic.

Ok, makes sense to me. Please capture some of the above either in the
commit message or in a code comment.

Thanks,

Paolo


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net-next v3 2/2] net/smc: handle -ENOMEM from smc_wr_alloc_link_mem gracefully
  2025-09-25 15:41       ` Paolo Abeni
@ 2025-09-25 21:46         ` Halil Pasic
  0 siblings, 0 replies; 21+ messages in thread
From: Halil Pasic @ 2025-09-25 21:46 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: Jakub Kicinski, Simon Horman, D. Wythe, Dust Li, Sidraya Jayagond,
	Wenjia Zhang, Mahanta Jambigi, Tony Lu, Wen Gu, netdev, linux-doc,
	linux-kernel, linux-rdma, linux-s390, Halil Pasic

On Thu, 25 Sep 2025 17:41:29 +0200
Paolo Abeni <pabeni@redhat.com> wrote:

> > So I would prefer sticking to the current logic.  
> 
> Ok, makes sense to me. Please capture some of the above either in the
> commit message or in a code comment.

I will make sure to add to the commit message that the ratio is preserved
when backing off and spell out the old values the 'we give up condition'.
Maybe add some more words on possible implications for v4.

Thank you very much for your review!

Regards,
Halil

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2025-09-28 18:32 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-21 21:44 [PATCH net-next v3 0/2] net/smc: make wr buffer count configurable Halil Pasic
2025-09-21 21:44 ` [PATCH net-next v3 1/2] " Halil Pasic
2025-09-24 17:27   ` Sidraya Jayagond
2025-09-25  9:27   ` Paolo Abeni
2025-09-25 11:25     ` Halil Pasic
2025-09-27 22:55       ` Halil Pasic
2025-09-28  2:02         ` Dust Li
2025-09-28  2:12           ` Dust Li
2025-09-28  8:39           ` Halil Pasic
2025-09-28 11:42             ` Dust Li
2025-09-28 18:32               ` Halil Pasic
2025-09-26  2:44   ` Guangguan Wang
2025-09-26 10:12     ` Halil Pasic
2025-09-26 10:30       ` Halil Pasic
2025-09-28  3:05         ` Guangguan Wang
2025-09-21 21:44 ` [PATCH net-next v3 2/2] net/smc: handle -ENOMEM from smc_wr_alloc_link_mem gracefully Halil Pasic
2025-09-24 17:28   ` Sidraya Jayagond
2025-09-25  9:40   ` Paolo Abeni
2025-09-25 15:05     ` Halil Pasic
2025-09-25 15:41       ` Paolo Abeni
2025-09-25 21:46         ` Halil Pasic

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).