netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH net-next v2 0/2] net/smc: make wr buffer count configurable
@ 2025-09-08 22:01 Halil Pasic
  2025-09-08 22:01 ` [PATCH net-next v2 1/2] " Halil Pasic
  2025-09-08 22:01 ` [PATCH net-next v2 2/2] net/smc: handle -ENOMEM from smc_wr_alloc_link_mem gracefully Halil Pasic
  0 siblings, 2 replies; 9+ messages in thread
From: Halil Pasic @ 2025-09-08 22:01 UTC (permalink / raw)
  To: Jakub Kicinski, Paolo Abeni, Simon Horman, D. Wythe, Dust Li,
	Sidraya Jayagond, Wenjia Zhang, Mahanta Jambigi, Tony Lu, Wen Gu,
	netdev, linux-doc, linux-kernel, linux-rdma, linux-s390
  Cc: Halil Pasic

The current value of SMC_WR_BUF_CNT is 16 which leads to heavy
contention on the wr_tx_wait workqueue of the SMC-R linkgroup and its
spinlock when many connections are competing for the work request
buffers. Currently up to 256 connections per linkgroup are supported.

To make things worse when finally a buffer becomes available and
smc_wr_tx_put_slot() signals the linkgroup's wr_tx_wait wq, because
WQ_FLAG_EXCLUSIVE is not used all the waiters get woken up, most of the
time a single one can proceed, and the rest is contending on the
spinlock of the wq to go to sleep again.

Addressing this by simply bumping SMC_WR_BUF_CNT to 256 was deemed
risky, because the large-ish physically continuous allocation could fail
and lead to TCP fall-backs. For reference see this discussion thread on
"[PATCH net-next] net/smc: increase SMC_WR_BUF_CNT" (in archive
https://lists.openwall.net/netdev/2024/11/05/186), which concludes with
the agreement to try to come up with something smarter, which is what
this series aims for.

Additionally if for some reason it is known that heavy contention is not
to be expected going with something like 256 work request buffers is
wasteful. To address these concerns make the number of work requests
configurable, and introduce a back-off logic with handles -ENOMEM form
smc_wr_alloc_link_mem() gracefully.

Changelog:
v2:
 1) Fixed https://lore-kernel.gnuweeb.org/linux-doc/202509070225.pVKkaaCr-lkp@intel.com/
 2) Inspired by 1) made the sysctls namespaced and moved the backing
    variables into struct netns_smc 
 3) Use tabs instead of spaces for alignment in (Dust Li) 
 4) Renamed the sysctls form smcr_max_*_wr to smcr_pref_*_wr (along with
    the affiliated fields/variables based on the discussion with Dust Li
    (although maximal is mathematically accurate preferred is more fitting
    IMHO). Should the community decide tha max was better I can roll
    this change back very quickly
 5) Removed Wenjia's r-b form patch 1 as quite a few things changed
 4) Add r-b from Mahanta I forgot to add in v1 (which remains the
    same moduo rename.
v1: https://lore.kernel.org/all/20250904211254.1057445-1-pasic@linux.ibm.com/

Halil Pasic (2):
  net/smc: make wr buffer count configurable
  net/smc: handle -ENOMEM from smc_wr_alloc_link_mem gracefully

 Documentation/networking/smc-sysctl.rst | 40 +++++++++++++++++++++++++
 include/net/netns/smc.h                 |  2 ++
 net/smc/smc_core.c                      | 34 ++++++++++++++-------
 net/smc/smc_core.h                      |  8 +++++
 net/smc/smc_ib.c                        |  7 ++---
 net/smc/smc_llc.c                       |  2 ++
 net/smc/smc_sysctl.c                    | 22 ++++++++++++++
 net/smc/smc_sysctl.h                    |  2 ++
 net/smc/smc_wr.c                        | 32 ++++++++++----------
 net/smc/smc_wr.h                        |  2 --
 10 files changed, 119 insertions(+), 32 deletions(-)


base-commit: f3883b1ea5a8f31df4eba1c2cb5196e3a249a09e
-- 
2.48.1


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH net-next v2 1/2] net/smc: make wr buffer count configurable
  2025-09-08 22:01 [PATCH net-next v2 0/2] net/smc: make wr buffer count configurable Halil Pasic
@ 2025-09-08 22:01 ` Halil Pasic
  2025-09-09  3:00   ` Dust Li
  2025-09-08 22:01 ` [PATCH net-next v2 2/2] net/smc: handle -ENOMEM from smc_wr_alloc_link_mem gracefully Halil Pasic
  1 sibling, 1 reply; 9+ messages in thread
From: Halil Pasic @ 2025-09-08 22:01 UTC (permalink / raw)
  To: Jakub Kicinski, Paolo Abeni, Simon Horman, D. Wythe, Dust Li,
	Sidraya Jayagond, Wenjia Zhang, Mahanta Jambigi, Tony Lu, Wen Gu,
	netdev, linux-doc, linux-kernel, linux-rdma, linux-s390
  Cc: Halil Pasic

Think SMC_WR_BUF_CNT_SEND := SMC_WR_BUF_CNT used in send context and
SMC_WR_BUF_CNT_RECV := 3 * SMC_WR_BUF_CNT used in recv context. Those
get replaced with lgr->pref_send_wr and lgr->max_recv_wr respective.

While at it let us also remove a confusing comment that is either not
about the context in which it resides (describing
qp_attr.cap.pref_send_wr and qp_attr.cap.max_recv_wr) or not applicable
any more when these values become configurable.

Signed-off-by: Halil Pasic <pasic@linux.ibm.com>
---
 Documentation/networking/smc-sysctl.rst | 37 +++++++++++++++++++++++++
 include/net/netns/smc.h                 |  2 ++
 net/smc/smc_core.h                      |  6 ++++
 net/smc/smc_ib.c                        |  7 ++---
 net/smc/smc_llc.c                       |  2 ++
 net/smc/smc_sysctl.c                    | 22 +++++++++++++++
 net/smc/smc_sysctl.h                    |  2 ++
 net/smc/smc_wr.c                        | 32 +++++++++++----------
 net/smc/smc_wr.h                        |  2 --
 9 files changed, 90 insertions(+), 22 deletions(-)

diff --git a/Documentation/networking/smc-sysctl.rst b/Documentation/networking/smc-sysctl.rst
index a874d007f2db..d533830df28f 100644
--- a/Documentation/networking/smc-sysctl.rst
+++ b/Documentation/networking/smc-sysctl.rst
@@ -71,3 +71,40 @@ smcr_max_conns_per_lgr - INTEGER
 	acceptable value ranges from 16 to 255. Only for SMC-R v2.1 and later.
 
 	Default: 255
+
+smcr_pref_send_wr - INTEGER
+	So called work request buffers are SMCR link (and RDMA queue pair) level
+	resources necessary for performing RDMA operations. Since up to 255
+	connections can share a link group and thus also a link and the number
+	of the work request buffers is decided when the link is allocated,
+	depending on the workload it can a bottleneck in a sense that threads
+	have to wait for work request buffers to become available. Before the
+	introduction of this control the maximal number of work request buffers
+	available on the send path used to be hard coded to 16. With this control
+	it becomes configurable. The acceptable range is between 2 and 2048.
+
+	Please be aware that all the buffers need to be allocated as a physically
+	continuous array in which each element is a single buffer and has the size
+	of SMC_WR_BUF_SIZE (48) bytes. If the allocation fails we give up much
+	like before having this control.
+	this control.
+
+	Default: 16
+
+smcr_pref_recv_wr - INTEGER
+	So called work request buffers are SMCR link (and RDMA queue pair) level
+	resources necessary for performing RDMA operations. Since up to 255
+	connections can share a link group and thus also a link and the number
+	of the work request buffers is decided when the link is allocated,
+	depending on the workload it can a bottleneck in a sense that threads
+	have to wait for work request buffers to become available. Before the
+	introduction of this control the maximal number of work request buffers
+	available on the receive path used to be hard coded to 16. With this control
+	it becomes configurable. The acceptable range is between 2 and 2048.
+
+	Please be aware that all the buffers need to be allocated as a physically
+	continuous array in which each element is a single buffer and has the size
+	of SMC_WR_BUF_SIZE (48) bytes. If the allocation fails we give up much
+	like before having this control.
+
+	Default: 48
diff --git a/include/net/netns/smc.h b/include/net/netns/smc.h
index fc752a50f91b..830817fc7fd7 100644
--- a/include/net/netns/smc.h
+++ b/include/net/netns/smc.h
@@ -24,5 +24,7 @@ struct netns_smc {
 	int				sysctl_rmem;
 	int				sysctl_max_links_per_lgr;
 	int				sysctl_max_conns_per_lgr;
+	unsigned int			sysctl_smcr_pref_send_wr;
+	unsigned int			sysctl_smcr_pref_recv_wr;
 };
 #endif
diff --git a/net/smc/smc_core.h b/net/smc/smc_core.h
index 48a1b1dcb576..78d5bcefa1b8 100644
--- a/net/smc/smc_core.h
+++ b/net/smc/smc_core.h
@@ -33,6 +33,8 @@
 					 * distributions may modify it to a value between
 					 * 16-255 as needed.
 					 */
+#define SMCR_MAX_SEND_WR_DEF	16	/* Default number of work requests per send queue */
+#define SMCR_MAX_RECV_WR_DEF	48	/* Default number of work requests per recv queue */
 
 struct smc_lgr_list {			/* list of link group definition */
 	struct list_head	list;
@@ -361,6 +363,10 @@ struct smc_link_group {
 						/* max conn can be assigned to lgr */
 			u8			max_links;
 						/* max links can be added in lgr */
+			u16			pref_send_wr;
+						/* number of WR buffers on send */
+			u16			pref_recv_wr;
+						/* number of WR buffers on recv */
 		};
 		struct { /* SMC-D */
 			struct smcd_gid		peer_gid;
diff --git a/net/smc/smc_ib.c b/net/smc/smc_ib.c
index 0052f02756eb..2f8f214fc634 100644
--- a/net/smc/smc_ib.c
+++ b/net/smc/smc_ib.c
@@ -669,11 +669,6 @@ int smc_ib_create_queue_pair(struct smc_link *lnk)
 		.recv_cq = lnk->smcibdev->roce_cq_recv,
 		.srq = NULL,
 		.cap = {
-				/* include unsolicited rdma_writes as well,
-				 * there are max. 2 RDMA_WRITE per 1 WR_SEND
-				 */
-			.max_send_wr = SMC_WR_BUF_CNT * 3,
-			.max_recv_wr = SMC_WR_BUF_CNT * 3,
 			.max_send_sge = SMC_IB_MAX_SEND_SGE,
 			.max_recv_sge = lnk->wr_rx_sge_cnt,
 			.max_inline_data = 0,
@@ -683,6 +678,8 @@ int smc_ib_create_queue_pair(struct smc_link *lnk)
 	};
 	int rc;
 
+	qp_attr.cap.max_send_wr = 3 * lnk->lgr->pref_send_wr;
+	qp_attr.cap.max_recv_wr = lnk->lgr->pref_recv_wr;
 	lnk->roce_qp = ib_create_qp(lnk->roce_pd, &qp_attr);
 	rc = PTR_ERR_OR_ZERO(lnk->roce_qp);
 	if (IS_ERR(lnk->roce_qp))
diff --git a/net/smc/smc_llc.c b/net/smc/smc_llc.c
index f865c58c3aa7..1098bdc3557b 100644
--- a/net/smc/smc_llc.c
+++ b/net/smc/smc_llc.c
@@ -2157,6 +2157,8 @@ void smc_llc_lgr_init(struct smc_link_group *lgr, struct smc_sock *smc)
 	init_waitqueue_head(&lgr->llc_msg_waiter);
 	init_rwsem(&lgr->llc_conf_mutex);
 	lgr->llc_testlink_time = READ_ONCE(net->smc.sysctl_smcr_testlink_time);
+	lgr->pref_send_wr = (u16)(READ_ONCE(net->smc.sysctl_smcr_pref_send_wr));
+	lgr->pref_recv_wr = (u16)(READ_ONCE(net->smc.sysctl_smcr_pref_recv_wr));
 }
 
 /* called after lgr was removed from lgr_list */
diff --git a/net/smc/smc_sysctl.c b/net/smc/smc_sysctl.c
index 2fab6456f765..f320443e563b 100644
--- a/net/smc/smc_sysctl.c
+++ b/net/smc/smc_sysctl.c
@@ -29,6 +29,8 @@ static int links_per_lgr_min = SMC_LINKS_ADD_LNK_MIN;
 static int links_per_lgr_max = SMC_LINKS_ADD_LNK_MAX;
 static int conns_per_lgr_min = SMC_CONN_PER_LGR_MIN;
 static int conns_per_lgr_max = SMC_CONN_PER_LGR_MAX;
+static unsigned int smcr_max_wr_min = 2;
+static unsigned int smcr_max_wr_max = 2048;
 
 static struct ctl_table smc_table[] = {
 	{
@@ -99,6 +101,24 @@ static struct ctl_table smc_table[] = {
 		.extra1		= SYSCTL_ZERO,
 		.extra2		= SYSCTL_ONE,
 	},
+	{
+		.procname	= "smcr_pref_send_wr",
+		.data		= &init_net.smc.sysctl_smcr_pref_send_wr,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &smcr_max_wr_min,
+		.extra2		= &smcr_max_wr_max,
+	},
+	{
+		.procname	= "smcr_pref_recv_wr",
+		.data		= &init_net.smc.sysctl_smcr_pref_recv_wr,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &smcr_max_wr_min,
+		.extra2		= &smcr_max_wr_max,
+	},
 };
 
 int __net_init smc_sysctl_net_init(struct net *net)
@@ -130,6 +150,8 @@ int __net_init smc_sysctl_net_init(struct net *net)
 	WRITE_ONCE(net->smc.sysctl_rmem, net_smc_rmem_init);
 	net->smc.sysctl_max_links_per_lgr = SMC_LINKS_PER_LGR_MAX_PREFER;
 	net->smc.sysctl_max_conns_per_lgr = SMC_CONN_PER_LGR_PREFER;
+	net->smc.sysctl_smcr_pref_send_wr = SMCR_MAX_SEND_WR_DEF;
+	net->smc.sysctl_smcr_pref_recv_wr = SMCR_MAX_RECV_WR_DEF;
 	/* disable handshake limitation by default */
 	net->smc.limit_smc_hs = 0;
 
diff --git a/net/smc/smc_sysctl.h b/net/smc/smc_sysctl.h
index eb2465ae1e15..5d17c6082cc2 100644
--- a/net/smc/smc_sysctl.h
+++ b/net/smc/smc_sysctl.h
@@ -25,6 +25,8 @@ static inline int smc_sysctl_net_init(struct net *net)
 	net->smc.sysctl_autocorking_size = SMC_AUTOCORKING_DEFAULT_SIZE;
 	net->smc.sysctl_max_links_per_lgr = SMC_LINKS_PER_LGR_MAX_PREFER;
 	net->smc.sysctl_max_conns_per_lgr = SMC_CONN_PER_LGR_PREFER;
+	net->smc.sysctl_smcr_pref_send_wr = SMCR_MAX_SEND_WR_DEF;
+	net->smc.sysctl_smcr_pref_recv_wr = SMCR_MAX_RECV_WR_DEF;
 	return 0;
 }
 
diff --git a/net/smc/smc_wr.c b/net/smc/smc_wr.c
index b04a21b8c511..606fe0bec4ef 100644
--- a/net/smc/smc_wr.c
+++ b/net/smc/smc_wr.c
@@ -34,6 +34,7 @@
 #define SMC_WR_MAX_POLL_CQE 10	/* max. # of compl. queue elements in 1 poll */
 
 #define SMC_WR_RX_HASH_BITS 4
+
 static DEFINE_HASHTABLE(smc_wr_rx_hash, SMC_WR_RX_HASH_BITS);
 static DEFINE_SPINLOCK(smc_wr_rx_hash_lock);
 
@@ -547,9 +548,9 @@ void smc_wr_remember_qp_attr(struct smc_link *lnk)
 		    IB_QP_DEST_QPN,
 		    &init_attr);
 
-	lnk->wr_tx_cnt = min_t(size_t, SMC_WR_BUF_CNT,
+	lnk->wr_tx_cnt = min_t(size_t, lnk->lgr->pref_send_wr,
 			       lnk->qp_attr.cap.max_send_wr);
-	lnk->wr_rx_cnt = min_t(size_t, SMC_WR_BUF_CNT * 3,
+	lnk->wr_rx_cnt = min_t(size_t, lnk->lgr->pref_recv_wr,
 			       lnk->qp_attr.cap.max_recv_wr);
 }
 
@@ -741,50 +742,51 @@ int smc_wr_alloc_lgr_mem(struct smc_link_group *lgr)
 int smc_wr_alloc_link_mem(struct smc_link *link)
 {
 	/* allocate link related memory */
-	link->wr_tx_bufs = kcalloc(SMC_WR_BUF_CNT, SMC_WR_BUF_SIZE, GFP_KERNEL);
+	link->wr_tx_bufs = kcalloc(link->lgr->pref_send_wr,
+				   SMC_WR_BUF_SIZE, GFP_KERNEL);
 	if (!link->wr_tx_bufs)
 		goto no_mem;
-	link->wr_rx_bufs = kcalloc(SMC_WR_BUF_CNT * 3, link->wr_rx_buflen,
+	link->wr_rx_bufs = kcalloc(link->lgr->pref_recv_wr, SMC_WR_BUF_SIZE,
 				   GFP_KERNEL);
 	if (!link->wr_rx_bufs)
 		goto no_mem_wr_tx_bufs;
-	link->wr_tx_ibs = kcalloc(SMC_WR_BUF_CNT, sizeof(link->wr_tx_ibs[0]),
-				  GFP_KERNEL);
+	link->wr_tx_ibs = kcalloc(link->lgr->pref_send_wr,
+				  sizeof(link->wr_tx_ibs[0]), GFP_KERNEL);
 	if (!link->wr_tx_ibs)
 		goto no_mem_wr_rx_bufs;
-	link->wr_rx_ibs = kcalloc(SMC_WR_BUF_CNT * 3,
+	link->wr_rx_ibs = kcalloc(link->lgr->pref_recv_wr,
 				  sizeof(link->wr_rx_ibs[0]),
 				  GFP_KERNEL);
 	if (!link->wr_rx_ibs)
 		goto no_mem_wr_tx_ibs;
-	link->wr_tx_rdmas = kcalloc(SMC_WR_BUF_CNT,
+	link->wr_tx_rdmas = kcalloc(link->lgr->pref_send_wr,
 				    sizeof(link->wr_tx_rdmas[0]),
 				    GFP_KERNEL);
 	if (!link->wr_tx_rdmas)
 		goto no_mem_wr_rx_ibs;
-	link->wr_tx_rdma_sges = kcalloc(SMC_WR_BUF_CNT,
+	link->wr_tx_rdma_sges = kcalloc(link->lgr->pref_send_wr,
 					sizeof(link->wr_tx_rdma_sges[0]),
 					GFP_KERNEL);
 	if (!link->wr_tx_rdma_sges)
 		goto no_mem_wr_tx_rdmas;
-	link->wr_tx_sges = kcalloc(SMC_WR_BUF_CNT, sizeof(link->wr_tx_sges[0]),
+	link->wr_tx_sges = kcalloc(link->lgr->pref_send_wr, sizeof(link->wr_tx_sges[0]),
 				   GFP_KERNEL);
 	if (!link->wr_tx_sges)
 		goto no_mem_wr_tx_rdma_sges;
-	link->wr_rx_sges = kcalloc(SMC_WR_BUF_CNT * 3,
+	link->wr_rx_sges = kcalloc(link->lgr->pref_recv_wr,
 				   sizeof(link->wr_rx_sges[0]) * link->wr_rx_sge_cnt,
 				   GFP_KERNEL);
 	if (!link->wr_rx_sges)
 		goto no_mem_wr_tx_sges;
-	link->wr_tx_mask = bitmap_zalloc(SMC_WR_BUF_CNT, GFP_KERNEL);
+	link->wr_tx_mask = bitmap_zalloc(link->lgr->pref_send_wr, GFP_KERNEL);
 	if (!link->wr_tx_mask)
 		goto no_mem_wr_rx_sges;
-	link->wr_tx_pends = kcalloc(SMC_WR_BUF_CNT,
+	link->wr_tx_pends = kcalloc(link->lgr->pref_send_wr,
 				    sizeof(link->wr_tx_pends[0]),
 				    GFP_KERNEL);
 	if (!link->wr_tx_pends)
 		goto no_mem_wr_tx_mask;
-	link->wr_tx_compl = kcalloc(SMC_WR_BUF_CNT,
+	link->wr_tx_compl = kcalloc(link->lgr->pref_send_wr,
 				    sizeof(link->wr_tx_compl[0]),
 				    GFP_KERNEL);
 	if (!link->wr_tx_compl)
@@ -905,7 +907,7 @@ int smc_wr_create_link(struct smc_link *lnk)
 		goto dma_unmap;
 	}
 	smc_wr_init_sge(lnk);
-	bitmap_zero(lnk->wr_tx_mask, SMC_WR_BUF_CNT);
+	bitmap_zero(lnk->wr_tx_mask, lnk->lgr->pref_send_wr);
 	init_waitqueue_head(&lnk->wr_tx_wait);
 	rc = percpu_ref_init(&lnk->wr_tx_refs, smcr_wr_tx_refs_free, 0, GFP_KERNEL);
 	if (rc)
diff --git a/net/smc/smc_wr.h b/net/smc/smc_wr.h
index f3008dda222a..aa4533af9122 100644
--- a/net/smc/smc_wr.h
+++ b/net/smc/smc_wr.h
@@ -19,8 +19,6 @@
 #include "smc.h"
 #include "smc_core.h"
 
-#define SMC_WR_BUF_CNT 16	/* # of ctrl buffers per link */
-
 #define SMC_WR_TX_WAIT_FREE_SLOT_TIME	(10 * HZ)
 
 #define SMC_WR_TX_SIZE 44 /* actual size of wr_send data (<=SMC_WR_BUF_SIZE) */
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH net-next v2 2/2] net/smc: handle -ENOMEM from smc_wr_alloc_link_mem gracefully
  2025-09-08 22:01 [PATCH net-next v2 0/2] net/smc: make wr buffer count configurable Halil Pasic
  2025-09-08 22:01 ` [PATCH net-next v2 1/2] " Halil Pasic
@ 2025-09-08 22:01 ` Halil Pasic
  1 sibling, 0 replies; 9+ messages in thread
From: Halil Pasic @ 2025-09-08 22:01 UTC (permalink / raw)
  To: Jakub Kicinski, Paolo Abeni, Simon Horman, D. Wythe, Dust Li,
	Sidraya Jayagond, Wenjia Zhang, Mahanta Jambigi, Tony Lu, Wen Gu,
	netdev, linux-doc, linux-kernel, linux-rdma, linux-s390
  Cc: Halil Pasic

Currently if a -ENOMEM from smc_wr_alloc_link_mem() is handled by
giving up and going the way of a TCP fallback. This was reasonable
before the sizes of the allocations there were compile time constants
and reasonably small. But now those are actually configurable.

So instead of giving up, keep retrying with half of the requested
size unless we dip below the old static sizes -- then give up!

Signed-off-by: Halil Pasic <pasic@linux.ibm.com>
Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com>
Reviewed-by: Mahanta Jambigi <mjambigi@linux.ibm.com>
---
 Documentation/networking/smc-sysctl.rst |  9 ++++---
 net/smc/smc_core.c                      | 34 +++++++++++++++++--------
 net/smc/smc_core.h                      |  2 ++
 net/smc/smc_wr.c                        | 28 ++++++++++----------
 4 files changed, 46 insertions(+), 27 deletions(-)

diff --git a/Documentation/networking/smc-sysctl.rst b/Documentation/networking/smc-sysctl.rst
index d533830df28f..846fdea87c84 100644
--- a/Documentation/networking/smc-sysctl.rst
+++ b/Documentation/networking/smc-sysctl.rst
@@ -85,9 +85,10 @@ smcr_pref_send_wr - INTEGER
 
 	Please be aware that all the buffers need to be allocated as a physically
 	continuous array in which each element is a single buffer and has the size
-	of SMC_WR_BUF_SIZE (48) bytes. If the allocation fails we give up much
+	of SMC_WR_BUF_SIZE (48) bytes. If the allocation fails, we keep retrying
+	with half of the buffer count until it is ether successful or (unlikely)
+	we dip below the old hard coded value which is 16 where we give up much
 	like before having this control.
-	this control.
 
 	Default: 16
 
@@ -104,7 +105,9 @@ smcr_pref_recv_wr - INTEGER
 
 	Please be aware that all the buffers need to be allocated as a physically
 	continuous array in which each element is a single buffer and has the size
-	of SMC_WR_BUF_SIZE (48) bytes. If the allocation fails we give up much
+	of SMC_WR_BUF_SIZE (48) bytes. If the allocation fails, we keep retrying
+	with half of the buffer count until it is ether successful or (unlikely)
+	we dip below the old hard coded value which is 16 where we give up much
 	like before having this control.
 
 	Default: 48
diff --git a/net/smc/smc_core.c b/net/smc/smc_core.c
index 262746e304dd..d55511d79cc2 100644
--- a/net/smc/smc_core.c
+++ b/net/smc/smc_core.c
@@ -810,6 +810,8 @@ int smcr_link_init(struct smc_link_group *lgr, struct smc_link *lnk,
 	lnk->clearing = 0;
 	lnk->path_mtu = lnk->smcibdev->pattr[lnk->ibport - 1].active_mtu;
 	lnk->link_id = smcr_next_link_id(lgr);
+	lnk->pref_send_wr = lgr->pref_send_wr;
+	lnk->pref_recv_wr = lgr->pref_recv_wr;
 	lnk->lgr = lgr;
 	smc_lgr_hold(lgr); /* lgr_put in smcr_link_clear() */
 	lnk->link_idx = link_idx;
@@ -836,27 +838,39 @@ int smcr_link_init(struct smc_link_group *lgr, struct smc_link *lnk,
 	rc = smc_llc_link_init(lnk);
 	if (rc)
 		goto out;
-	rc = smc_wr_alloc_link_mem(lnk);
-	if (rc)
-		goto clear_llc_lnk;
 	rc = smc_ib_create_protection_domain(lnk);
 	if (rc)
-		goto free_link_mem;
-	rc = smc_ib_create_queue_pair(lnk);
-	if (rc)
-		goto dealloc_pd;
+		goto clear_llc_lnk;
+	do {
+		rc = smc_ib_create_queue_pair(lnk);
+		if (rc)
+			goto dealloc_pd;
+		rc = smc_wr_alloc_link_mem(lnk);
+		if (!rc)
+			break;
+		else if (rc != -ENOMEM) /* give up */
+			goto destroy_qp;
+		/* retry with smaller ... */
+		lnk->pref_send_wr /= 2;
+		lnk->pref_recv_wr /= 2;
+		/* ... unless droping below old SMC_WR_BUF_SIZE */
+		if (lnk->pref_send_wr < 16 || lnk->pref_recv_wr < 48)
+			goto destroy_qp;
+		smc_ib_destroy_queue_pair(lnk);
+	} while (1);
+
 	rc = smc_wr_create_link(lnk);
 	if (rc)
-		goto destroy_qp;
+		goto free_link_mem;
 	lnk->state = SMC_LNK_ACTIVATING;
 	return 0;
 
+free_link_mem:
+	smc_wr_free_link_mem(lnk);
 destroy_qp:
 	smc_ib_destroy_queue_pair(lnk);
 dealloc_pd:
 	smc_ib_dealloc_protection_domain(lnk);
-free_link_mem:
-	smc_wr_free_link_mem(lnk);
 clear_llc_lnk:
 	smc_llc_link_clear(lnk, false);
 out:
diff --git a/net/smc/smc_core.h b/net/smc/smc_core.h
index 78d5bcefa1b8..18ba0364ff52 100644
--- a/net/smc/smc_core.h
+++ b/net/smc/smc_core.h
@@ -174,6 +174,8 @@ struct smc_link {
 	struct completion	llc_testlink_resp; /* wait for rx of testlink */
 	int			llc_testlink_time; /* testlink interval */
 	atomic_t		conn_cnt; /* connections on this link */
+	u16			pref_send_wr;
+	u16			pref_recv_wr;
 };
 
 /* For now we just allow one parallel link per link group. The SMC protocol
diff --git a/net/smc/smc_wr.c b/net/smc/smc_wr.c
index 606fe0bec4ef..632d095599ed 100644
--- a/net/smc/smc_wr.c
+++ b/net/smc/smc_wr.c
@@ -548,9 +548,9 @@ void smc_wr_remember_qp_attr(struct smc_link *lnk)
 		    IB_QP_DEST_QPN,
 		    &init_attr);
 
-	lnk->wr_tx_cnt = min_t(size_t, lnk->lgr->pref_send_wr,
+	lnk->wr_tx_cnt = min_t(size_t, lnk->pref_send_wr,
 			       lnk->qp_attr.cap.max_send_wr);
-	lnk->wr_rx_cnt = min_t(size_t, lnk->lgr->pref_recv_wr,
+	lnk->wr_rx_cnt = min_t(size_t, lnk->pref_recv_wr,
 			       lnk->qp_attr.cap.max_recv_wr);
 }
 
@@ -742,51 +742,51 @@ int smc_wr_alloc_lgr_mem(struct smc_link_group *lgr)
 int smc_wr_alloc_link_mem(struct smc_link *link)
 {
 	/* allocate link related memory */
-	link->wr_tx_bufs = kcalloc(link->lgr->pref_send_wr,
+	link->wr_tx_bufs = kcalloc(link->pref_send_wr,
 				   SMC_WR_BUF_SIZE, GFP_KERNEL);
 	if (!link->wr_tx_bufs)
 		goto no_mem;
-	link->wr_rx_bufs = kcalloc(link->lgr->pref_recv_wr, SMC_WR_BUF_SIZE,
+	link->wr_rx_bufs = kcalloc(link->pref_recv_wr, SMC_WR_BUF_SIZE,
 				   GFP_KERNEL);
 	if (!link->wr_rx_bufs)
 		goto no_mem_wr_tx_bufs;
-	link->wr_tx_ibs = kcalloc(link->lgr->pref_send_wr,
+	link->wr_tx_ibs = kcalloc(link->pref_send_wr,
 				  sizeof(link->wr_tx_ibs[0]), GFP_KERNEL);
 	if (!link->wr_tx_ibs)
 		goto no_mem_wr_rx_bufs;
-	link->wr_rx_ibs = kcalloc(link->lgr->pref_recv_wr,
+	link->wr_rx_ibs = kcalloc(link->pref_recv_wr,
 				  sizeof(link->wr_rx_ibs[0]),
 				  GFP_KERNEL);
 	if (!link->wr_rx_ibs)
 		goto no_mem_wr_tx_ibs;
-	link->wr_tx_rdmas = kcalloc(link->lgr->pref_send_wr,
+	link->wr_tx_rdmas = kcalloc(link->pref_send_wr,
 				    sizeof(link->wr_tx_rdmas[0]),
 				    GFP_KERNEL);
 	if (!link->wr_tx_rdmas)
 		goto no_mem_wr_rx_ibs;
-	link->wr_tx_rdma_sges = kcalloc(link->lgr->pref_send_wr,
+	link->wr_tx_rdma_sges = kcalloc(link->pref_send_wr,
 					sizeof(link->wr_tx_rdma_sges[0]),
 					GFP_KERNEL);
 	if (!link->wr_tx_rdma_sges)
 		goto no_mem_wr_tx_rdmas;
-	link->wr_tx_sges = kcalloc(link->lgr->pref_send_wr, sizeof(link->wr_tx_sges[0]),
+	link->wr_tx_sges = kcalloc(link->pref_send_wr, sizeof(link->wr_tx_sges[0]),
 				   GFP_KERNEL);
 	if (!link->wr_tx_sges)
 		goto no_mem_wr_tx_rdma_sges;
-	link->wr_rx_sges = kcalloc(link->lgr->pref_recv_wr,
+	link->wr_rx_sges = kcalloc(link->pref_recv_wr,
 				   sizeof(link->wr_rx_sges[0]) * link->wr_rx_sge_cnt,
 				   GFP_KERNEL);
 	if (!link->wr_rx_sges)
 		goto no_mem_wr_tx_sges;
-	link->wr_tx_mask = bitmap_zalloc(link->lgr->pref_send_wr, GFP_KERNEL);
+	link->wr_tx_mask = bitmap_zalloc(link->pref_send_wr, GFP_KERNEL);
 	if (!link->wr_tx_mask)
 		goto no_mem_wr_rx_sges;
-	link->wr_tx_pends = kcalloc(link->lgr->pref_send_wr,
+	link->wr_tx_pends = kcalloc(link->pref_send_wr,
 				    sizeof(link->wr_tx_pends[0]),
 				    GFP_KERNEL);
 	if (!link->wr_tx_pends)
 		goto no_mem_wr_tx_mask;
-	link->wr_tx_compl = kcalloc(link->lgr->pref_send_wr,
+	link->wr_tx_compl = kcalloc(link->pref_send_wr,
 				    sizeof(link->wr_tx_compl[0]),
 				    GFP_KERNEL);
 	if (!link->wr_tx_compl)
@@ -907,7 +907,7 @@ int smc_wr_create_link(struct smc_link *lnk)
 		goto dma_unmap;
 	}
 	smc_wr_init_sge(lnk);
-	bitmap_zero(lnk->wr_tx_mask, lnk->lgr->pref_send_wr);
+	bitmap_zero(lnk->wr_tx_mask, lnk->pref_send_wr);
 	init_waitqueue_head(&lnk->wr_tx_wait);
 	rc = percpu_ref_init(&lnk->wr_tx_refs, smcr_wr_tx_refs_free, 0, GFP_KERNEL);
 	if (rc)
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH net-next v2 1/2] net/smc: make wr buffer count configurable
  2025-09-08 22:01 ` [PATCH net-next v2 1/2] " Halil Pasic
@ 2025-09-09  3:00   ` Dust Li
  2025-09-09 10:18     ` Halil Pasic
  0 siblings, 1 reply; 9+ messages in thread
From: Dust Li @ 2025-09-09  3:00 UTC (permalink / raw)
  To: Halil Pasic, Jakub Kicinski, Paolo Abeni, Simon Horman, D. Wythe,
	Sidraya Jayagond, Wenjia Zhang, Mahanta Jambigi, Tony Lu, Wen Gu,
	netdev, linux-doc, linux-kernel, linux-rdma, linux-s390

On 2025-09-09 00:01:49, Halil Pasic wrote:
>Think SMC_WR_BUF_CNT_SEND := SMC_WR_BUF_CNT used in send context and
>SMC_WR_BUF_CNT_RECV := 3 * SMC_WR_BUF_CNT used in recv context. Those
>get replaced with lgr->pref_send_wr and lgr->max_recv_wr respective.
                            ^                       ^
                            better to use the same prefix

I personally prefer max_send_wr/max_recv_wr.

>
>While at it let us also remove a confusing comment that is either not
>about the context in which it resides (describing
>qp_attr.cap.pref_send_wr and qp_attr.cap.max_recv_wr) or not applicable
                ^
I haven't found pref_send_wr in qp_attr.cap

>any more when these values become configurable.
>
>Signed-off-by: Halil Pasic <pasic@linux.ibm.com>
>---
> Documentation/networking/smc-sysctl.rst | 37 +++++++++++++++++++++++++
> include/net/netns/smc.h                 |  2 ++
> net/smc/smc_core.h                      |  6 ++++
> net/smc/smc_ib.c                        |  7 ++---
> net/smc/smc_llc.c                       |  2 ++
> net/smc/smc_sysctl.c                    | 22 +++++++++++++++
> net/smc/smc_sysctl.h                    |  2 ++
> net/smc/smc_wr.c                        | 32 +++++++++++----------
> net/smc/smc_wr.h                        |  2 --
> 9 files changed, 90 insertions(+), 22 deletions(-)
>
>diff --git a/Documentation/networking/smc-sysctl.rst b/Documentation/networking/smc-sysctl.rst
>index a874d007f2db..d533830df28f 100644
>--- a/Documentation/networking/smc-sysctl.rst
>+++ b/Documentation/networking/smc-sysctl.rst
>@@ -71,3 +71,40 @@ smcr_max_conns_per_lgr - INTEGER
> 	acceptable value ranges from 16 to 255. Only for SMC-R v2.1 and later.
> 
> 	Default: 255
>+
>+smcr_pref_send_wr - INTEGER
>+	So called work request buffers are SMCR link (and RDMA queue pair) level
>+	resources necessary for performing RDMA operations. Since up to 255
>+	connections can share a link group and thus also a link and the number
>+	of the work request buffers is decided when the link is allocated,
>+	depending on the workload it can a bottleneck in a sense that threads
>+	have to wait for work request buffers to become available. Before the
>+	introduction of this control the maximal number of work request buffers
>+	available on the send path used to be hard coded to 16. With this control
>+	it becomes configurable. The acceptable range is between 2 and 2048.
>+
>+	Please be aware that all the buffers need to be allocated as a physically
>+	continuous array in which each element is a single buffer and has the size
>+	of SMC_WR_BUF_SIZE (48) bytes. If the allocation fails we give up much
>+	like before having this control.
>+	this control.

The final 'this control' looks unwanted.


>+
>+	Default: 16
>+
>+smcr_pref_recv_wr - INTEGER
>+	So called work request buffers are SMCR link (and RDMA queue pair) level
>+	resources necessary for performing RDMA operations. Since up to 255
>+	connections can share a link group and thus also a link and the number
>+	of the work request buffers is decided when the link is allocated,
>+	depending on the workload it can a bottleneck in a sense that threads
>+	have to wait for work request buffers to become available. Before the
>+	introduction of this control the maximal number of work request buffers
>+	available on the receive path used to be hard coded to 16. With this control
>+	it becomes configurable. The acceptable range is between 2 and 2048.
>+
>+	Please be aware that all the buffers need to be allocated as a physically
>+	continuous array in which each element is a single buffer and has the size
>+	of SMC_WR_BUF_SIZE (48) bytes. If the allocation fails we give up much
>+	like before having this control.
>+
>+	Default: 48
>diff --git a/include/net/netns/smc.h b/include/net/netns/smc.h
>index fc752a50f91b..830817fc7fd7 100644
>--- a/include/net/netns/smc.h
>+++ b/include/net/netns/smc.h
>@@ -24,5 +24,7 @@ struct netns_smc {
> 	int				sysctl_rmem;
> 	int				sysctl_max_links_per_lgr;
> 	int				sysctl_max_conns_per_lgr;
>+	unsigned int			sysctl_smcr_pref_send_wr;
>+	unsigned int			sysctl_smcr_pref_recv_wr;
> };
> #endif
>diff --git a/net/smc/smc_core.h b/net/smc/smc_core.h
>index 48a1b1dcb576..78d5bcefa1b8 100644
>--- a/net/smc/smc_core.h
>+++ b/net/smc/smc_core.h
>@@ -33,6 +33,8 @@
> 					 * distributions may modify it to a value between
> 					 * 16-255 as needed.
> 					 */
>+#define SMCR_MAX_SEND_WR_DEF	16	/* Default number of work requests per send queue */
>+#define SMCR_MAX_RECV_WR_DEF	48	/* Default number of work requests per recv queue */
> 
> struct smc_lgr_list {			/* list of link group definition */
> 	struct list_head	list;
>@@ -361,6 +363,10 @@ struct smc_link_group {
> 						/* max conn can be assigned to lgr */
> 			u8			max_links;
> 						/* max links can be added in lgr */
>+			u16			pref_send_wr;
>+						/* number of WR buffers on send */
>+			u16			pref_recv_wr;
>+						/* number of WR buffers on recv */
> 		};
> 		struct { /* SMC-D */
> 			struct smcd_gid		peer_gid;
>diff --git a/net/smc/smc_ib.c b/net/smc/smc_ib.c
>index 0052f02756eb..2f8f214fc634 100644
>--- a/net/smc/smc_ib.c
>+++ b/net/smc/smc_ib.c
>@@ -669,11 +669,6 @@ int smc_ib_create_queue_pair(struct smc_link *lnk)
> 		.recv_cq = lnk->smcibdev->roce_cq_recv,
> 		.srq = NULL,
> 		.cap = {
>-				/* include unsolicited rdma_writes as well,
>-				 * there are max. 2 RDMA_WRITE per 1 WR_SEND
>-				 */
>-			.max_send_wr = SMC_WR_BUF_CNT * 3,
>-			.max_recv_wr = SMC_WR_BUF_CNT * 3,
> 			.max_send_sge = SMC_IB_MAX_SEND_SGE,
> 			.max_recv_sge = lnk->wr_rx_sge_cnt,
> 			.max_inline_data = 0,
>@@ -683,6 +678,8 @@ int smc_ib_create_queue_pair(struct smc_link *lnk)
> 	};
> 	int rc;
> 
>+	qp_attr.cap.max_send_wr = 3 * lnk->lgr->pref_send_wr;
>+	qp_attr.cap.max_recv_wr = lnk->lgr->pref_recv_wr;
> 	lnk->roce_qp = ib_create_qp(lnk->roce_pd, &qp_attr);
> 	rc = PTR_ERR_OR_ZERO(lnk->roce_qp);
> 	if (IS_ERR(lnk->roce_qp))
>diff --git a/net/smc/smc_llc.c b/net/smc/smc_llc.c
>index f865c58c3aa7..1098bdc3557b 100644
>--- a/net/smc/smc_llc.c
>+++ b/net/smc/smc_llc.c
>@@ -2157,6 +2157,8 @@ void smc_llc_lgr_init(struct smc_link_group *lgr, struct smc_sock *smc)
> 	init_waitqueue_head(&lgr->llc_msg_waiter);
> 	init_rwsem(&lgr->llc_conf_mutex);
> 	lgr->llc_testlink_time = READ_ONCE(net->smc.sysctl_smcr_testlink_time);
>+	lgr->pref_send_wr = (u16)(READ_ONCE(net->smc.sysctl_smcr_pref_send_wr));
>+	lgr->pref_recv_wr = (u16)(READ_ONCE(net->smc.sysctl_smcr_pref_recv_wr));
> }
> 
> /* called after lgr was removed from lgr_list */
>diff --git a/net/smc/smc_sysctl.c b/net/smc/smc_sysctl.c
>index 2fab6456f765..f320443e563b 100644
>--- a/net/smc/smc_sysctl.c
>+++ b/net/smc/smc_sysctl.c
>@@ -29,6 +29,8 @@ static int links_per_lgr_min = SMC_LINKS_ADD_LNK_MIN;
> static int links_per_lgr_max = SMC_LINKS_ADD_LNK_MAX;
> static int conns_per_lgr_min = SMC_CONN_PER_LGR_MIN;
> static int conns_per_lgr_max = SMC_CONN_PER_LGR_MAX;
>+static unsigned int smcr_max_wr_min = 2;
>+static unsigned int smcr_max_wr_max = 2048;
> 
> static struct ctl_table smc_table[] = {
> 	{
>@@ -99,6 +101,24 @@ static struct ctl_table smc_table[] = {
> 		.extra1		= SYSCTL_ZERO,
> 		.extra2		= SYSCTL_ONE,
> 	},
>+	{
>+		.procname	= "smcr_pref_send_wr",
>+		.data		= &init_net.smc.sysctl_smcr_pref_send_wr,
>+		.maxlen		= sizeof(int),
>+		.mode		= 0644,
>+		.proc_handler	= proc_dointvec_minmax,
>+		.extra1		= &smcr_max_wr_min,
>+		.extra2		= &smcr_max_wr_max,
>+	},
>+	{
>+		.procname	= "smcr_pref_recv_wr",
>+		.data		= &init_net.smc.sysctl_smcr_pref_recv_wr,
>+		.maxlen		= sizeof(int),
>+		.mode		= 0644,
>+		.proc_handler	= proc_dointvec_minmax,
>+		.extra1		= &smcr_max_wr_min,
>+		.extra2		= &smcr_max_wr_max,
>+	},
> };
> 
> int __net_init smc_sysctl_net_init(struct net *net)
>@@ -130,6 +150,8 @@ int __net_init smc_sysctl_net_init(struct net *net)
> 	WRITE_ONCE(net->smc.sysctl_rmem, net_smc_rmem_init);
> 	net->smc.sysctl_max_links_per_lgr = SMC_LINKS_PER_LGR_MAX_PREFER;
> 	net->smc.sysctl_max_conns_per_lgr = SMC_CONN_PER_LGR_PREFER;
>+	net->smc.sysctl_smcr_pref_send_wr = SMCR_MAX_SEND_WR_DEF;
>+	net->smc.sysctl_smcr_pref_recv_wr = SMCR_MAX_RECV_WR_DEF;
> 	/* disable handshake limitation by default */
> 	net->smc.limit_smc_hs = 0;
> 
>diff --git a/net/smc/smc_sysctl.h b/net/smc/smc_sysctl.h
>index eb2465ae1e15..5d17c6082cc2 100644
>--- a/net/smc/smc_sysctl.h
>+++ b/net/smc/smc_sysctl.h
>@@ -25,6 +25,8 @@ static inline int smc_sysctl_net_init(struct net *net)
> 	net->smc.sysctl_autocorking_size = SMC_AUTOCORKING_DEFAULT_SIZE;
> 	net->smc.sysctl_max_links_per_lgr = SMC_LINKS_PER_LGR_MAX_PREFER;
> 	net->smc.sysctl_max_conns_per_lgr = SMC_CONN_PER_LGR_PREFER;
>+	net->smc.sysctl_smcr_pref_send_wr = SMCR_MAX_SEND_WR_DEF;
>+	net->smc.sysctl_smcr_pref_recv_wr = SMCR_MAX_RECV_WR_DEF;
> 	return 0;
> }
> 
>diff --git a/net/smc/smc_wr.c b/net/smc/smc_wr.c
>index b04a21b8c511..606fe0bec4ef 100644
>--- a/net/smc/smc_wr.c
>+++ b/net/smc/smc_wr.c
>@@ -34,6 +34,7 @@
> #define SMC_WR_MAX_POLL_CQE 10	/* max. # of compl. queue elements in 1 poll */
> 
> #define SMC_WR_RX_HASH_BITS 4
>+
> static DEFINE_HASHTABLE(smc_wr_rx_hash, SMC_WR_RX_HASH_BITS);
> static DEFINE_SPINLOCK(smc_wr_rx_hash_lock);
> 
>@@ -547,9 +548,9 @@ void smc_wr_remember_qp_attr(struct smc_link *lnk)
> 		    IB_QP_DEST_QPN,
> 		    &init_attr);
> 
>-	lnk->wr_tx_cnt = min_t(size_t, SMC_WR_BUF_CNT,
>+	lnk->wr_tx_cnt = min_t(size_t, lnk->lgr->pref_send_wr,
> 			       lnk->qp_attr.cap.max_send_wr);
>-	lnk->wr_rx_cnt = min_t(size_t, SMC_WR_BUF_CNT * 3,
>+	lnk->wr_rx_cnt = min_t(size_t, lnk->lgr->pref_recv_wr,
> 			       lnk->qp_attr.cap.max_recv_wr);
> }
> 
>@@ -741,50 +742,51 @@ int smc_wr_alloc_lgr_mem(struct smc_link_group *lgr)
> int smc_wr_alloc_link_mem(struct smc_link *link)
> {
> 	/* allocate link related memory */
>-	link->wr_tx_bufs = kcalloc(SMC_WR_BUF_CNT, SMC_WR_BUF_SIZE, GFP_KERNEL);
>+	link->wr_tx_bufs = kcalloc(link->lgr->pref_send_wr,
>+				   SMC_WR_BUF_SIZE, GFP_KERNEL);
> 	if (!link->wr_tx_bufs)
> 		goto no_mem;
>-	link->wr_rx_bufs = kcalloc(SMC_WR_BUF_CNT * 3, link->wr_rx_buflen,
>+	link->wr_rx_bufs = kcalloc(link->lgr->pref_recv_wr, SMC_WR_BUF_SIZE,
> 				   GFP_KERNEL);

Why change wr_rx_buflen to SMC_WR_BUF_SIZE ? wr_rx_buflen depends on
SMCV1 or SMCV2.

If this is mistake, we need the change the comments in sysctl.rst as
well.

Best regards,
Dust

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH net-next v2 1/2] net/smc: make wr buffer count configurable
  2025-09-09  3:00   ` Dust Li
@ 2025-09-09 10:18     ` Halil Pasic
  2025-09-19 14:55       ` Halil Pasic
  0 siblings, 1 reply; 9+ messages in thread
From: Halil Pasic @ 2025-09-09 10:18 UTC (permalink / raw)
  To: Dust Li
  Cc: Jakub Kicinski, Paolo Abeni, Simon Horman, D. Wythe,
	Sidraya Jayagond, Wenjia Zhang, Mahanta Jambigi, Tony Lu, Wen Gu,
	netdev, linux-doc, linux-kernel, linux-rdma, linux-s390,
	Halil Pasic

On Tue, 9 Sep 2025 11:00:50 +0800
Dust Li <dust.li@linux.alibaba.com> wrote:

> On 2025-09-09 00:01:49, Halil Pasic wrote:
> >Think SMC_WR_BUF_CNT_SEND := SMC_WR_BUF_CNT used in send context and
> >SMC_WR_BUF_CNT_RECV := 3 * SMC_WR_BUF_CNT used in recv context. Those
> >get replaced with lgr->pref_send_wr and lgr->max_recv_wr respective.  

Yes it is just in the commit message, I messed up the search and replace
in the commit message. :(

>                             ^                       ^
>                             better to use the same prefix
> 
> I personally prefer max_send_wr/max_recv_wr.
> 

Will go back to that then for v3

> >
> >While at it let us also remove a confusing comment that is either not
> >about the context in which it resides (describing
> >qp_attr.cap.pref_send_wr and qp_attr.cap.max_recv_wr) or not applicable  
>                 ^
> I haven't found pref_send_wr in qp_attr.cap
> 

Again search and replace. Sorry!

[..]
> >+
> >+	Please be aware that all the buffers need to be allocated as a physically
> >+	continuous array in which each element is a single buffer and has the size
> >+	of SMC_WR_BUF_SIZE (48) bytes. If the allocation fails we give up much
> >+	like before having this control.
> >+	this control.  
> 
> The final 'this control' looks unwanted.
 

You are right

[..]
> > 
> >@@ -741,50 +742,51 @@ int smc_wr_alloc_lgr_mem(struct smc_link_group *lgr)
> > int smc_wr_alloc_link_mem(struct smc_link *link)
> > {
> > 	/* allocate link related memory */
> >-	link->wr_tx_bufs = kcalloc(SMC_WR_BUF_CNT, SMC_WR_BUF_SIZE, GFP_KERNEL);
> >+	link->wr_tx_bufs = kcalloc(link->lgr->pref_send_wr,
> >+				   SMC_WR_BUF_SIZE, GFP_KERNEL);
> > 	if (!link->wr_tx_bufs)
> > 		goto no_mem;
> >-	link->wr_rx_bufs = kcalloc(SMC_WR_BUF_CNT * 3, link->wr_rx_buflen,
> >+	link->wr_rx_bufs = kcalloc(link->lgr->pref_recv_wr, SMC_WR_BUF_SIZE,
> > 				   GFP_KERNEL);  


I will have to do some digging, let's assume for now that it is my
mistake. Unfortunately I won't be able to revisit this before next
Wednesday.

Thank you for your review!

Regards,
Halil


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH net-next v2 1/2] net/smc: make wr buffer count configurable
  2025-09-09 10:18     ` Halil Pasic
@ 2025-09-19 14:55       ` Halil Pasic
  2025-09-24  3:13         ` Guangguan Wang
  0 siblings, 1 reply; 9+ messages in thread
From: Halil Pasic @ 2025-09-19 14:55 UTC (permalink / raw)
  To: Dust Li, Guangguan Wang
  Cc: Jakub Kicinski, Paolo Abeni, Simon Horman, D. Wythe,
	Sidraya Jayagond, Wenjia Zhang, Mahanta Jambigi, Tony Lu, Wen Gu,
	netdev, linux-doc, linux-kernel, linux-rdma, linux-s390,
	Halil Pasic

On Tue, 9 Sep 2025 12:18:50 +0200
Halil Pasic <pasic@linux.ibm.com> wrote:

> > >-	link->wr_rx_bufs = kcalloc(SMC_WR_BUF_CNT * 3, link->wr_rx_buflen,
> > >+	link->wr_rx_bufs = kcalloc(link->lgr->pref_recv_wr, SMC_WR_BUF_SIZE,
> > > 				   GFP_KERNEL);    
> 
> 
> I will have to do some digging, let's assume for now that it is my
> mistake. Unfortunately I won't be able to revisit this before next
> Wednesday.

Can maybe Wen Gu and  Guangguan Wang chime in. From what I read
link->wr_rx_buflen can be either SMC_WR_BUF_SIZE that is 48 in which
case it does not matter, or SMC_WR_BUF_V2_SIZE that is 8192, if
!smc_link_shared_v2_rxbuf(lnk) i.e. max_recv_sge == 1. So we talk
about roughly a factor of 170 here. For a large pref_recv_wr the
back of logic is still there to save us but I really would not say that
this is how this is intended to work.

Maybe not supporting V2 on devices with max_recv_sge is a better choice,
assuming that a maximal V2 LLC msg needs to fit each and every receive
WR buffer. Which seems to be the case based on 27ef6a9981fe ("net/smc:
support SMC-R V2 for rdma devices with max_recv_sge equals to 1").

For me the best course of action seems to be to send a V3 using
link->wr_rx_buflen. I'm really not that knowledgeable about RDMA or
the SMC-R protocol, but I'm happy to be part of the discussion on this
matter.

Regards,
Halil

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH net-next v2 1/2] net/smc: make wr buffer count configurable
  2025-09-19 14:55       ` Halil Pasic
@ 2025-09-24  3:13         ` Guangguan Wang
  2025-09-24  9:50           ` Halil Pasic
  0 siblings, 1 reply; 9+ messages in thread
From: Guangguan Wang @ 2025-09-24  3:13 UTC (permalink / raw)
  To: Halil Pasic, Dust Li
  Cc: Jakub Kicinski, Paolo Abeni, Simon Horman, D. Wythe,
	Sidraya Jayagond, Wenjia Zhang, Mahanta Jambigi, Tony Lu, Wen Gu,
	netdev, linux-doc, linux-kernel, linux-rdma, linux-s390



在 2025/9/19 22:55, Halil Pasic 写道:
> On Tue, 9 Sep 2025 12:18:50 +0200
> Halil Pasic <pasic@linux.ibm.com> wrote:
> 
> 
> Can maybe Wen Gu and  Guangguan Wang chime in. From what I read
> link->wr_rx_buflen can be either SMC_WR_BUF_SIZE that is 48 in which
> case it does not matter, or SMC_WR_BUF_V2_SIZE that is 8192, if
> !smc_link_shared_v2_rxbuf(lnk) i.e. max_recv_sge == 1. So we talk
> about roughly a factor of 170 here. For a large pref_recv_wr the
> back of logic is still there to save us but I really would not say that
> this is how this is intended to work.
> 

Hi Halil,

I think the root cause of the problem this patchset try to solve is a mismatch
between SMC_WR_BUF_CNT and the max_conns per lgr(which value is 255). Furthermore,
I believe that value 255 of the max_conns per lgr is not an optimal value, as too
few connections lead to a waste of memory and too many connections lead to I/O queuing
within a single QP(every WR post_send to a single QP will initiate and complete in sequence).

We actually identified this problem long ago. In Alibaba Cloud Linux distribution, we have
changed SMC_WR_BUF_CNT to 64 and reduced max_conns per lgr to 32(for SMC-R V2.1). This
configuration has worked well under various workflow for a long time.

SMC-R V2.1 already support negotiation of the max_conns per lgr. Simply change the value of
the macro SMC_CONN_PER_LGR_PREFER can influence the negotiation result. But SMC-R V1.0 and SMC-R
v2.0 do not support the negotiation of the max_conns per lgr.
I think it is better to reduce SMC_CONN_PER_LGR_PREFER for SMC-R V2.1. But for SMC-R V1.0 and
SMC-R V2.0, I do not have any good idea.

> Maybe not supporting V2 on devices with max_recv_sge is a better choice,
> assuming that a maximal V2 LLC msg needs to fit each and every receive
> WR buffer. Which seems to be the case based on 27ef6a9981fe ("net/smc:
> support SMC-R V2 for rdma devices with max_recv_sge equals to 1").
>

For rdma dev whose max_recv_sge is 1, as metioned in the commit log in the related patch,
it is better to support than SMC_CLC_DECL_INTERR fallback, as SMC_CLC_DECL_INTERR fallback
is not a fast fallback, and may heavily influence the efficiency of the connecting process
in both the server and client side.

 
> For me the best course of action seems to be to send a V3 using
> link->wr_rx_buflen. I'm really not that knowledgeable about RDMA or
> the SMC-R protocol, but I'm happy to be part of the discussion on this
> matter.
> 
> Regards,
> Halil
And a tiny suggestion for the risk you mentioned in commit log
("Addressing this by simply bumping SMC_WR_BUF_CNT to 256 was deemed
risky, because the large-ish physically continuous allocation could fail
and lead to TCP fall-backs."). Non-physically continuous allocation (vmalloc/vzalloc .etc.) is
also supported for wr buffers. SMC-R snd_buf and rmb have already supported for non-physically
continuous memory, when sysctl_smcr_buf_type is set to SMCR_VIRT_CONT_BUFS or SMCR_MIXED_BUFS.
It can be an example of using non-physically continuous memory.

Regards,
Guangguan Wang


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH net-next v2 1/2] net/smc: make wr buffer count configurable
  2025-09-24  3:13         ` Guangguan Wang
@ 2025-09-24  9:50           ` Halil Pasic
  2025-09-25  3:48             ` Guangguan Wang
  0 siblings, 1 reply; 9+ messages in thread
From: Halil Pasic @ 2025-09-24  9:50 UTC (permalink / raw)
  To: Guangguan Wang
  Cc: Dust Li, Jakub Kicinski, Paolo Abeni, Simon Horman, D. Wythe,
	Sidraya Jayagond, Wenjia Zhang, Mahanta Jambigi, Tony Lu, Wen Gu,
	netdev, linux-doc, linux-kernel, linux-rdma, linux-s390,
	Halil Pasic

On Wed, 24 Sep 2025 11:13:05 +0800
Guangguan Wang <guangguan.wang@linux.alibaba.com> wrote:

> 在 2025/9/19 22:55, Halil Pasic 写道:
> > On Tue, 9 Sep 2025 12:18:50 +0200
> > Halil Pasic <pasic@linux.ibm.com> wrote:
> > 
> > 
> > Can maybe Wen Gu and  Guangguan Wang chime in. From what I read
> > link->wr_rx_buflen can be either SMC_WR_BUF_SIZE that is 48 in which
> > case it does not matter, or SMC_WR_BUF_V2_SIZE that is 8192, if
> > !smc_link_shared_v2_rxbuf(lnk) i.e. max_recv_sge == 1. So we talk
> > about roughly a factor of 170 here. For a large pref_recv_wr the
> > back of logic is still there to save us but I really would not say that
> > this is how this is intended to work.
> >   
> 
> Hi Halil,
> 
> I think the root cause of the problem this patchset try to solve is a mismatch
> between SMC_WR_BUF_CNT and the max_conns per lgr(which value is 255). Furthermore,
> I believe that value 255 of the max_conns per lgr is not an optimal value, as too
> few connections lead to a waste of memory and too many connections lead to I/O queuing
> within a single QP(every WR post_send to a single QP will initiate and complete in sequence).
> 
> We actually identified this problem long ago. In Alibaba Cloud Linux distribution, we have
> changed SMC_WR_BUF_CNT to 64 and reduced max_conns per lgr to 32(for SMC-R V2.1). This
> configuration has worked well under various workflow for a long time.
> 
> SMC-R V2.1 already support negotiation of the max_conns per lgr. Simply change the value of
> the macro SMC_CONN_PER_LGR_PREFER can influence the negotiation result. But SMC-R V1.0 and SMC-R
> v2.0 do not support the negotiation of the max_conns per lgr.
> I think it is better to reduce SMC_CONN_PER_LGR_PREFER for SMC-R V2.1. But for SMC-R V1.0 and
> SMC-R V2.0, I do not have any good idea.
> 

I agree, the number of WR buffers and the max number of connections per
lgr can an should be tuned in concert.

> > Maybe not supporting V2 on devices with max_recv_sge is a better choice,
> > assuming that a maximal V2 LLC msg needs to fit each and every receive
> > WR buffer. Which seems to be the case based on 27ef6a9981fe ("net/smc:
> > support SMC-R V2 for rdma devices with max_recv_sge equals to 1").
> >  
> 
> For rdma dev whose max_recv_sge is 1, as metioned in the commit log in the related patch,
> it is better to support than SMC_CLC_DECL_INTERR fallback, as SMC_CLC_DECL_INTERR fallback
> is not a fast fallback, and may heavily influence the efficiency of the connecting process
> in both the server and client side.

I mean another possible mitigation of the problem can be the following,
if there is a device in the mix with max_recv_sge < 2 the don't propose/
accept SMCR-V2. 

Do you know how prevalent and relevant are max_recv_sge < 2 RDMA
devices, and how likely is it that somebody would like to use SMC-R with
such devices?

> 
>  
> > For me the best course of action seems to be to send a V3 using
> > link->wr_rx_buflen. I'm really not that knowledgeable about RDMA or
> > the SMC-R protocol, but I'm happy to be part of the discussion on this
> > matter.
> > 
> > Regards,
> > Halil  
>
> And a tiny suggestion for the risk you mentioned in commit log
> ("Addressing this by simply bumping SMC_WR_BUF_CNT to 256 was deemed
> risky, because the large-ish physically continuous allocation could fail
> and lead to TCP fall-backs."). Non-physically continuous allocation (vmalloc/vzalloc .etc.) is
> also supported for wr buffers. SMC-R snd_buf and rmb have already supported for non-physically
> continuous memory, when sysctl_smcr_buf_type is set to SMCR_VIRT_CONT_BUFS or SMCR_MIXED_BUFS.
> It can be an example of using non-physically continuous memory.
> 

I think we can put this on the list of possible enhancements. I would
perfer to not add this to the scope of this series. But I would be happy to
see this happen. Don't know know if somebody form Alibaba, or maybe
Mahanta or Sid would like to pick this up as an enhancement on top.

Thank you very much for for your comments!

Regards,
Halil 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH net-next v2 1/2] net/smc: make wr buffer count configurable
  2025-09-24  9:50           ` Halil Pasic
@ 2025-09-25  3:48             ` Guangguan Wang
  0 siblings, 0 replies; 9+ messages in thread
From: Guangguan Wang @ 2025-09-25  3:48 UTC (permalink / raw)
  To: Halil Pasic
  Cc: Dust Li, Jakub Kicinski, Paolo Abeni, Simon Horman, D. Wythe,
	Sidraya Jayagond, Wenjia Zhang, Mahanta Jambigi, Tony Lu, Wen Gu,
	netdev, linux-doc, linux-kernel, linux-rdma, linux-s390



在 2025/9/24 17:50, Halil Pasic 写道:
> On Wed, 24 Sep 2025 11:13:05 +0800
> Guangguan Wang <guangguan.wang@linux.alibaba.com> wrote:
> 
>> 在 2025/9/19 22:55, Halil Pasic 写道:
>>> On Tue, 9 Sep 2025 12:18:50 +0200
>>> Halil Pasic <pasic@linux.ibm.com> wrote:
>>>
>>>
>>> Can maybe Wen Gu and  Guangguan Wang chime in. From what I read
>>> link->wr_rx_buflen can be either SMC_WR_BUF_SIZE that is 48 in which
>>> case it does not matter, or SMC_WR_BUF_V2_SIZE that is 8192, if
>>> !smc_link_shared_v2_rxbuf(lnk) i.e. max_recv_sge == 1. So we talk
>>> about roughly a factor of 170 here. For a large pref_recv_wr the
>>> back of logic is still there to save us but I really would not say that
>>> this is how this is intended to work.
>>>   
>>
>> Hi Halil,
>>
>> I think the root cause of the problem this patchset try to solve is a mismatch
>> between SMC_WR_BUF_CNT and the max_conns per lgr(which value is 255). Furthermore,
>> I believe that value 255 of the max_conns per lgr is not an optimal value, as too
>> few connections lead to a waste of memory and too many connections lead to I/O queuing
>> within a single QP(every WR post_send to a single QP will initiate and complete in sequence).
>>
>> We actually identified this problem long ago. In Alibaba Cloud Linux distribution, we have
>> changed SMC_WR_BUF_CNT to 64 and reduced max_conns per lgr to 32(for SMC-R V2.1). This
>> configuration has worked well under various workflow for a long time.
>>
>> SMC-R V2.1 already support negotiation of the max_conns per lgr. Simply change the value of
>> the macro SMC_CONN_PER_LGR_PREFER can influence the negotiation result. But SMC-R V1.0 and SMC-R
>> v2.0 do not support the negotiation of the max_conns per lgr.
>> I think it is better to reduce SMC_CONN_PER_LGR_PREFER for SMC-R V2.1. But for SMC-R V1.0 and
>> SMC-R V2.0, I do not have any good idea.
>>
> 
> I agree, the number of WR buffers and the max number of connections per
> lgr can an should be tuned in concert.
> 
>>> Maybe not supporting V2 on devices with max_recv_sge is a better choice,
>>> assuming that a maximal V2 LLC msg needs to fit each and every receive
>>> WR buffer. Which seems to be the case based on 27ef6a9981fe ("net/smc:
>>> support SMC-R V2 for rdma devices with max_recv_sge equals to 1").
>>>  
>>
>> For rdma dev whose max_recv_sge is 1, as metioned in the commit log in the related patch,
>> it is better to support than SMC_CLC_DECL_INTERR fallback, as SMC_CLC_DECL_INTERR fallback
>> is not a fast fallback, and may heavily influence the efficiency of the connecting process
>> in both the server and client side.
> 
> I mean another possible mitigation of the problem can be the following,
> if there is a device in the mix with max_recv_sge < 2 the don't propose/
> accept SMCR-V2. 
> 
> Do you know how prevalent and relevant are max_recv_sge < 2 RDMA
> devices, and how likely is it that somebody would like to use SMC-R with
> such devices?
> 

eRDMA in Alibaba Cloud is max_recv_sge < 2, and it is the RDMA device we are primarily focusing on.
eRDMA prefer works on SMC-R V2.1, is it possible that supported in SMC-R V2.1 but not in V2.0? 

>>
>>  
>>> For me the best course of action seems to be to send a V3 using
>>> link->wr_rx_buflen. I'm really not that knowledgeable about RDMA or
>>> the SMC-R protocol, but I'm happy to be part of the discussion on this
>>> matter.
>>>
>>> Regards,
>>> Halil  
>>
>> And a tiny suggestion for the risk you mentioned in commit log
>> ("Addressing this by simply bumping SMC_WR_BUF_CNT to 256 was deemed
>> risky, because the large-ish physically continuous allocation could fail
>> and lead to TCP fall-backs."). Non-physically continuous allocation (vmalloc/vzalloc .etc.) is
>> also supported for wr buffers. SMC-R snd_buf and rmb have already supported for non-physically
>> continuous memory, when sysctl_smcr_buf_type is set to SMCR_VIRT_CONT_BUFS or SMCR_MIXED_BUFS.
>> It can be an example of using non-physically continuous memory.
>>
> 
> I think we can put this on the list of possible enhancements. I would
> perfer to not add this to the scope of this series. But I would be happy to
> see this happen. Don't know know if somebody form Alibaba, or maybe
> Mahanta or Sid would like to pick this up as an enhancement on top.
> > Thank you very much for for your comments!
> 
> Regards,
> Halil 


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2025-09-25  3:54 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-08 22:01 [PATCH net-next v2 0/2] net/smc: make wr buffer count configurable Halil Pasic
2025-09-08 22:01 ` [PATCH net-next v2 1/2] " Halil Pasic
2025-09-09  3:00   ` Dust Li
2025-09-09 10:18     ` Halil Pasic
2025-09-19 14:55       ` Halil Pasic
2025-09-24  3:13         ` Guangguan Wang
2025-09-24  9:50           ` Halil Pasic
2025-09-25  3:48             ` Guangguan Wang
2025-09-08 22:01 ` [PATCH net-next v2 2/2] net/smc: handle -ENOMEM from smc_wr_alloc_link_mem gracefully Halil Pasic

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).