[PATCH net-next v5 0/2] net/smc: make wr buffer count configurable

linux-rdma.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH net-next v5 0/2] net/smc: make wr buffer count configurable
@ 2025-09-28 23:59 Halil Pasic
  2025-09-29  0:00 ` [PATCH net-next v5 1/2] " Halil Pasic
  2025-09-29  0:00 ` [PATCH net-next v5 2/2] net/smc: handle -ENOMEM from smc_wr_alloc_link_mem gracefully Halil Pasic
  0 siblings, 2 replies; 12+ messages in thread
From: Halil Pasic @ 2025-09-28 23:59 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Jonathan Corbet, D. Wythe, Dust Li,
	Sidraya Jayagond, Wenjia Zhang, Mahanta Jambigi, Tony Lu, Wen Gu,
	Guangguan Wang, Halil Pasic, netdev, linux-doc, linux-kernel,
	linux-rdma, linux-s390

The current value of SMC_WR_BUF_CNT is 16 which leads to heavy
contention on the wr_tx_wait workqueue of the SMC-R linkgroup and its
spinlock when many connections are competing for the work request
buffers. Currently up to 256 connections per linkgroup are supported.

To make things worse when finally a buffer becomes available and
smc_wr_tx_put_slot() signals the linkgroup's wr_tx_wait wq, because
WQ_FLAG_EXCLUSIVE is not used all the waiters get woken up, most of the
time a single one can proceed, and the rest is contending on the
spinlock of the wq to go to sleep again.

Addressing this by simply bumping SMC_WR_BUF_CNT to 256 was deemed
risky, because the large-ish physically continuous allocation could fail
and lead to TCP fall-backs. For reference see this discussion thread on
"[PATCH net-next] net/smc: increase SMC_WR_BUF_CNT" (in archive
https://lists.openwall.net/netdev/2024/11/05/186), which concludes with
the agreement to try to come up with something smarter, which is what
this series aims for.

Additionally if for some reason it is known that heavy contention is not
to be expected going with something like 256 work request buffers is
wasteful. To address these concerns make the number of work requests
configurable, and introduce a back-off logic with handles -ENOMEM form
smc_wr_alloc_link_mem() gracefully.
---

Changelog:
---------
v5:
 * Added back a code comment about the value of qp_attr.cap.max_send_wr 
   after Dust Li's explanation of the comment itsef and the logic behind 
   it, and removed the paragraph from the commit message that concerns 
   itself with the removal of that comment. (Dust Li)
v4: https://lore.kernel.org/netdev/20250927232144.3478161-1-pasic@linux.ibm.com/
v3: https://lore.kernel.org/netdev/20250921214440.325325-1-pasic@linux.ibm.com/
v2: https://lore.kernel.org/netdev/20250908220150.3329433-1-pasic@linux.ibm.com/
v1: https://lore.kernel.org/all/20250904211254.1057445-1-pasic@linux.ibm.com/

Halil Pasic (2):
  net/smc: make wr buffer count configurable
  net/smc: handle -ENOMEM from smc_wr_alloc_link_mem gracefully

 Documentation/networking/smc-sysctl.rst | 40 +++++++++++++++++++++++++
 include/net/netns/smc.h                 |  2 ++
 net/smc/smc_core.c                      | 34 ++++++++++++++-------
 net/smc/smc_core.h                      |  8 +++++
 net/smc/smc_ib.c                        | 10 +++----
 net/smc/smc_llc.c                       |  2 ++
 net/smc/smc_sysctl.c                    | 22 ++++++++++++++
 net/smc/smc_sysctl.h                    |  2 ++
 net/smc/smc_wr.c                        | 31 +++++++++----------
 net/smc/smc_wr.h                        |  2 --
 10 files changed, 121 insertions(+), 32 deletions(-)


base-commit: e835faaed2f80ee8652f59a54703edceab04f0d9
-- 
2.48.1


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH net-next v5 1/2] net/smc: make wr buffer count configurable
  2025-09-28 23:59 [PATCH net-next v5 0/2] net/smc: make wr buffer count configurable Halil Pasic
@ 2025-09-29  0:00 ` Halil Pasic
  2025-09-29  1:28   ` Dust Li
  2025-10-01 10:29   ` Bagas Sanjaya
  2025-09-29  0:00 ` [PATCH net-next v5 2/2] net/smc: handle -ENOMEM from smc_wr_alloc_link_mem gracefully Halil Pasic
  1 sibling, 2 replies; 12+ messages in thread
From: Halil Pasic @ 2025-09-29  0:00 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Jonathan Corbet, D. Wythe, Dust Li,
	Sidraya Jayagond, Wenjia Zhang, Mahanta Jambigi, Tony Lu, Wen Gu,
	Guangguan Wang, Halil Pasic, netdev, linux-doc, linux-kernel,
	linux-rdma, linux-s390

Think SMC_WR_BUF_CNT_SEND := SMC_WR_BUF_CNT used in send context and
SMC_WR_BUF_CNT_RECV := 3 * SMC_WR_BUF_CNT used in recv context. Those
get replaced with lgr->max_send_wr and lgr->max_recv_wr respective.

Please note that although with the default sysctl values
qp_attr.cap.max_send_wr ==  qp_attr.cap.max_recv_wr is maintained but
can not be assumed to be generally true any more. I see no downside to
that, but my confidence level is rather modest.

Signed-off-by: Halil Pasic <pasic@linux.ibm.com>
Reviewed-by: Sidraya Jayagond <sidraya@linux.ibm.com>
---
 Documentation/networking/smc-sysctl.rst | 36 +++++++++++++++++++++++++
 include/net/netns/smc.h                 |  2 ++
 net/smc/smc_core.h                      |  6 +++++
 net/smc/smc_ib.c                        | 10 +++----
 net/smc/smc_llc.c                       |  2 ++
 net/smc/smc_sysctl.c                    | 22 +++++++++++++++
 net/smc/smc_sysctl.h                    |  2 ++
 net/smc/smc_wr.c                        | 31 ++++++++++-----------
 net/smc/smc_wr.h                        |  2 --
 9 files changed, 91 insertions(+), 22 deletions(-)

diff --git a/Documentation/networking/smc-sysctl.rst b/Documentation/networking/smc-sysctl.rst
index a874d007f2db..5de4893ef3e7 100644
--- a/Documentation/networking/smc-sysctl.rst
+++ b/Documentation/networking/smc-sysctl.rst
@@ -71,3 +71,39 @@ smcr_max_conns_per_lgr - INTEGER
 	acceptable value ranges from 16 to 255. Only for SMC-R v2.1 and later.
 
 	Default: 255
+
+smcr_max_send_wr - INTEGER
+	So called work request buffers are SMCR link (and RDMA queue pair) level
+	resources necessary for performing RDMA operations. Since up to 255
+	connections can share a link group and thus also a link and the number
+	of the work request buffers is decided when the link is allocated,
+	depending on the workload it can be a bottleneck in a sense that threads
+	have to wait for work request buffers to become available. Before the
+	introduction of this control the maximal number of work request buffers
+	available on the send path used to be hard coded to 16. With this control
+	it becomes configurable. The acceptable range is between 2 and 2048.
+
+	Please be aware that all the buffers need to be allocated as a physically
+	continuous array in which each element is a single buffer and has the size
+	of SMC_WR_BUF_SIZE (48) bytes. If the allocation fails we give up much
+	like before having this control.
+
+	Default: 16
+
+smcr_max_recv_wr - INTEGER
+	So called work request buffers are SMCR link (and RDMA queue pair) level
+	resources necessary for performing RDMA operations. Since up to 255
+	connections can share a link group and thus also a link and the number
+	of the work request buffers is decided when the link is allocated,
+	depending on the workload it can be a bottleneck in a sense that threads
+	have to wait for work request buffers to become available. Before the
+	introduction of this control the maximal number of work request buffers
+	available on the receive path used to be hard coded to 16. With this control
+	it becomes configurable. The acceptable range is between 2 and 2048.
+
+	Please be aware that all the buffers need to be allocated as a physically
+	continuous array in which each element is a single buffer and has the size
+	of SMC_WR_BUF_SIZE (48) bytes. If the allocation fails we give up much
+	like before having this control.
+
+	Default: 48
diff --git a/include/net/netns/smc.h b/include/net/netns/smc.h
index fc752a50f91b..6ceb12baec24 100644
--- a/include/net/netns/smc.h
+++ b/include/net/netns/smc.h
@@ -24,5 +24,7 @@ struct netns_smc {
 	int				sysctl_rmem;
 	int				sysctl_max_links_per_lgr;
 	int				sysctl_max_conns_per_lgr;
+	unsigned int			sysctl_smcr_max_send_wr;
+	unsigned int			sysctl_smcr_max_recv_wr;
 };
 #endif
diff --git a/net/smc/smc_core.h b/net/smc/smc_core.h
index a5a78cbff341..8d06c8bb14e9 100644
--- a/net/smc/smc_core.h
+++ b/net/smc/smc_core.h
@@ -34,6 +34,8 @@
 					 * distributions may modify it to a value between
 					 * 16-255 as needed.
 					 */
+#define SMCR_MAX_SEND_WR_DEF	16	/* Default number of work requests per send queue */
+#define SMCR_MAX_RECV_WR_DEF	48	/* Default number of work requests per recv queue */
 
 struct smc_lgr_list {			/* list of link group definition */
 	struct list_head	list;
@@ -366,6 +368,10 @@ struct smc_link_group {
 						/* max conn can be assigned to lgr */
 			u8			max_links;
 						/* max links can be added in lgr */
+			u16			max_send_wr;
+						/* number of WR buffers on send */
+			u16			max_recv_wr;
+						/* number of WR buffers on recv */
 		};
 		struct { /* SMC-D */
 			struct smcd_gid		peer_gid;
diff --git a/net/smc/smc_ib.c b/net/smc/smc_ib.c
index 0052f02756eb..1154907c5c05 100644
--- a/net/smc/smc_ib.c
+++ b/net/smc/smc_ib.c
@@ -669,11 +669,6 @@ int smc_ib_create_queue_pair(struct smc_link *lnk)
 		.recv_cq = lnk->smcibdev->roce_cq_recv,
 		.srq = NULL,
 		.cap = {
-				/* include unsolicited rdma_writes as well,
-				 * there are max. 2 RDMA_WRITE per 1 WR_SEND
-				 */
-			.max_send_wr = SMC_WR_BUF_CNT * 3,
-			.max_recv_wr = SMC_WR_BUF_CNT * 3,
 			.max_send_sge = SMC_IB_MAX_SEND_SGE,
 			.max_recv_sge = lnk->wr_rx_sge_cnt,
 			.max_inline_data = 0,
@@ -683,6 +678,11 @@ int smc_ib_create_queue_pair(struct smc_link *lnk)
 	};
 	int rc;
 
+	/* include unsolicited rdma_writes as well,
+	 * there are max. 2 RDMA_WRITE per 1 WR_SEND
+	 */
+	qp_attr.cap.max_send_wr = 3 * lnk->lgr->max_send_wr;
+	qp_attr.cap.max_recv_wr = lnk->lgr->max_recv_wr;
 	lnk->roce_qp = ib_create_qp(lnk->roce_pd, &qp_attr);
 	rc = PTR_ERR_OR_ZERO(lnk->roce_qp);
 	if (IS_ERR(lnk->roce_qp))
diff --git a/net/smc/smc_llc.c b/net/smc/smc_llc.c
index f865c58c3aa7..f5d5eb617526 100644
--- a/net/smc/smc_llc.c
+++ b/net/smc/smc_llc.c
@@ -2157,6 +2157,8 @@ void smc_llc_lgr_init(struct smc_link_group *lgr, struct smc_sock *smc)
 	init_waitqueue_head(&lgr->llc_msg_waiter);
 	init_rwsem(&lgr->llc_conf_mutex);
 	lgr->llc_testlink_time = READ_ONCE(net->smc.sysctl_smcr_testlink_time);
+	lgr->max_send_wr = (u16)(READ_ONCE(net->smc.sysctl_smcr_max_send_wr));
+	lgr->max_recv_wr = (u16)(READ_ONCE(net->smc.sysctl_smcr_max_recv_wr));
 }
 
 /* called after lgr was removed from lgr_list */
diff --git a/net/smc/smc_sysctl.c b/net/smc/smc_sysctl.c
index 2fab6456f765..7b2471904d04 100644
--- a/net/smc/smc_sysctl.c
+++ b/net/smc/smc_sysctl.c
@@ -29,6 +29,8 @@ static int links_per_lgr_min = SMC_LINKS_ADD_LNK_MIN;
 static int links_per_lgr_max = SMC_LINKS_ADD_LNK_MAX;
 static int conns_per_lgr_min = SMC_CONN_PER_LGR_MIN;
 static int conns_per_lgr_max = SMC_CONN_PER_LGR_MAX;
+static unsigned int smcr_max_wr_min = 2;
+static unsigned int smcr_max_wr_max = 2048;
 
 static struct ctl_table smc_table[] = {
 	{
@@ -99,6 +101,24 @@ static struct ctl_table smc_table[] = {
 		.extra1		= SYSCTL_ZERO,
 		.extra2		= SYSCTL_ONE,
 	},
+	{
+		.procname	= "smcr_max_send_wr",
+		.data		= &init_net.smc.sysctl_smcr_max_send_wr,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &smcr_max_wr_min,
+		.extra2		= &smcr_max_wr_max,
+	},
+	{
+		.procname	= "smcr_max_recv_wr",
+		.data		= &init_net.smc.sysctl_smcr_max_recv_wr,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &smcr_max_wr_min,
+		.extra2		= &smcr_max_wr_max,
+	},
 };
 
 int __net_init smc_sysctl_net_init(struct net *net)
@@ -130,6 +150,8 @@ int __net_init smc_sysctl_net_init(struct net *net)
 	WRITE_ONCE(net->smc.sysctl_rmem, net_smc_rmem_init);
 	net->smc.sysctl_max_links_per_lgr = SMC_LINKS_PER_LGR_MAX_PREFER;
 	net->smc.sysctl_max_conns_per_lgr = SMC_CONN_PER_LGR_PREFER;
+	net->smc.sysctl_smcr_max_send_wr = SMCR_MAX_SEND_WR_DEF;
+	net->smc.sysctl_smcr_max_recv_wr = SMCR_MAX_RECV_WR_DEF;
 	/* disable handshake limitation by default */
 	net->smc.limit_smc_hs = 0;
 
diff --git a/net/smc/smc_sysctl.h b/net/smc/smc_sysctl.h
index eb2465ae1e15..8538915af7af 100644
--- a/net/smc/smc_sysctl.h
+++ b/net/smc/smc_sysctl.h
@@ -25,6 +25,8 @@ static inline int smc_sysctl_net_init(struct net *net)
 	net->smc.sysctl_autocorking_size = SMC_AUTOCORKING_DEFAULT_SIZE;
 	net->smc.sysctl_max_links_per_lgr = SMC_LINKS_PER_LGR_MAX_PREFER;
 	net->smc.sysctl_max_conns_per_lgr = SMC_CONN_PER_LGR_PREFER;
+	net->smc.sysctl_smcr_max_send_wr = SMCR_MAX_SEND_WR_DEF;
+	net->smc.sysctl_smcr_max_recv_wr = SMCR_MAX_RECV_WR_DEF;
 	return 0;
 }
 
diff --git a/net/smc/smc_wr.c b/net/smc/smc_wr.c
index b04a21b8c511..883fb0f1ce43 100644
--- a/net/smc/smc_wr.c
+++ b/net/smc/smc_wr.c
@@ -547,9 +547,9 @@ void smc_wr_remember_qp_attr(struct smc_link *lnk)
 		    IB_QP_DEST_QPN,
 		    &init_attr);
 
-	lnk->wr_tx_cnt = min_t(size_t, SMC_WR_BUF_CNT,
+	lnk->wr_tx_cnt = min_t(size_t, lnk->lgr->max_send_wr,
 			       lnk->qp_attr.cap.max_send_wr);
-	lnk->wr_rx_cnt = min_t(size_t, SMC_WR_BUF_CNT * 3,
+	lnk->wr_rx_cnt = min_t(size_t, lnk->lgr->max_recv_wr,
 			       lnk->qp_attr.cap.max_recv_wr);
 }
 
@@ -741,50 +741,51 @@ int smc_wr_alloc_lgr_mem(struct smc_link_group *lgr)
 int smc_wr_alloc_link_mem(struct smc_link *link)
 {
 	/* allocate link related memory */
-	link->wr_tx_bufs = kcalloc(SMC_WR_BUF_CNT, SMC_WR_BUF_SIZE, GFP_KERNEL);
+	link->wr_tx_bufs = kcalloc(link->lgr->max_send_wr,
+				   SMC_WR_BUF_SIZE, GFP_KERNEL);
 	if (!link->wr_tx_bufs)
 		goto no_mem;
-	link->wr_rx_bufs = kcalloc(SMC_WR_BUF_CNT * 3, link->wr_rx_buflen,
+	link->wr_rx_bufs = kcalloc(link->lgr->max_recv_wr, link->wr_rx_buflen,
 				   GFP_KERNEL);
 	if (!link->wr_rx_bufs)
 		goto no_mem_wr_tx_bufs;
-	link->wr_tx_ibs = kcalloc(SMC_WR_BUF_CNT, sizeof(link->wr_tx_ibs[0]),
-				  GFP_KERNEL);
+	link->wr_tx_ibs = kcalloc(link->lgr->max_send_wr,
+				  sizeof(link->wr_tx_ibs[0]), GFP_KERNEL);
 	if (!link->wr_tx_ibs)
 		goto no_mem_wr_rx_bufs;
-	link->wr_rx_ibs = kcalloc(SMC_WR_BUF_CNT * 3,
+	link->wr_rx_ibs = kcalloc(link->lgr->max_recv_wr,
 				  sizeof(link->wr_rx_ibs[0]),
 				  GFP_KERNEL);
 	if (!link->wr_rx_ibs)
 		goto no_mem_wr_tx_ibs;
-	link->wr_tx_rdmas = kcalloc(SMC_WR_BUF_CNT,
+	link->wr_tx_rdmas = kcalloc(link->lgr->max_send_wr,
 				    sizeof(link->wr_tx_rdmas[0]),
 				    GFP_KERNEL);
 	if (!link->wr_tx_rdmas)
 		goto no_mem_wr_rx_ibs;
-	link->wr_tx_rdma_sges = kcalloc(SMC_WR_BUF_CNT,
+	link->wr_tx_rdma_sges = kcalloc(link->lgr->max_send_wr,
 					sizeof(link->wr_tx_rdma_sges[0]),
 					GFP_KERNEL);
 	if (!link->wr_tx_rdma_sges)
 		goto no_mem_wr_tx_rdmas;
-	link->wr_tx_sges = kcalloc(SMC_WR_BUF_CNT, sizeof(link->wr_tx_sges[0]),
+	link->wr_tx_sges = kcalloc(link->lgr->max_send_wr, sizeof(link->wr_tx_sges[0]),
 				   GFP_KERNEL);
 	if (!link->wr_tx_sges)
 		goto no_mem_wr_tx_rdma_sges;
-	link->wr_rx_sges = kcalloc(SMC_WR_BUF_CNT * 3,
+	link->wr_rx_sges = kcalloc(link->lgr->max_recv_wr,
 				   sizeof(link->wr_rx_sges[0]) * link->wr_rx_sge_cnt,
 				   GFP_KERNEL);
 	if (!link->wr_rx_sges)
 		goto no_mem_wr_tx_sges;
-	link->wr_tx_mask = bitmap_zalloc(SMC_WR_BUF_CNT, GFP_KERNEL);
+	link->wr_tx_mask = bitmap_zalloc(link->lgr->max_send_wr, GFP_KERNEL);
 	if (!link->wr_tx_mask)
 		goto no_mem_wr_rx_sges;
-	link->wr_tx_pends = kcalloc(SMC_WR_BUF_CNT,
+	link->wr_tx_pends = kcalloc(link->lgr->max_send_wr,
 				    sizeof(link->wr_tx_pends[0]),
 				    GFP_KERNEL);
 	if (!link->wr_tx_pends)
 		goto no_mem_wr_tx_mask;
-	link->wr_tx_compl = kcalloc(SMC_WR_BUF_CNT,
+	link->wr_tx_compl = kcalloc(link->lgr->max_send_wr,
 				    sizeof(link->wr_tx_compl[0]),
 				    GFP_KERNEL);
 	if (!link->wr_tx_compl)
@@ -905,7 +906,7 @@ int smc_wr_create_link(struct smc_link *lnk)
 		goto dma_unmap;
 	}
 	smc_wr_init_sge(lnk);
-	bitmap_zero(lnk->wr_tx_mask, SMC_WR_BUF_CNT);
+	bitmap_zero(lnk->wr_tx_mask, lnk->lgr->max_send_wr);
 	init_waitqueue_head(&lnk->wr_tx_wait);
 	rc = percpu_ref_init(&lnk->wr_tx_refs, smcr_wr_tx_refs_free, 0, GFP_KERNEL);
 	if (rc)
diff --git a/net/smc/smc_wr.h b/net/smc/smc_wr.h
index f3008dda222a..aa4533af9122 100644
--- a/net/smc/smc_wr.h
+++ b/net/smc/smc_wr.h
@@ -19,8 +19,6 @@
 #include "smc.h"
 #include "smc_core.h"
 
-#define SMC_WR_BUF_CNT 16	/* # of ctrl buffers per link */
-
 #define SMC_WR_TX_WAIT_FREE_SLOT_TIME	(10 * HZ)
 
 #define SMC_WR_TX_SIZE 44 /* actual size of wr_send data (<=SMC_WR_BUF_SIZE) */
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH net-next v5 2/2] net/smc: handle -ENOMEM from smc_wr_alloc_link_mem gracefully
  2025-09-28 23:59 [PATCH net-next v5 0/2] net/smc: make wr buffer count configurable Halil Pasic
  2025-09-29  0:00 ` [PATCH net-next v5 1/2] " Halil Pasic
@ 2025-09-29  0:00 ` Halil Pasic
  2025-09-29  1:50   ` Dust Li
  1 sibling, 1 reply; 12+ messages in thread
From: Halil Pasic @ 2025-09-29  0:00 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Jonathan Corbet, D. Wythe, Dust Li,
	Sidraya Jayagond, Wenjia Zhang, Mahanta Jambigi, Tony Lu, Wen Gu,
	Guangguan Wang, Halil Pasic, netdev, linux-doc, linux-kernel,
	linux-rdma, linux-s390

Currently if a -ENOMEM from smc_wr_alloc_link_mem() is handled by
giving up and going the way of a TCP fallback. This was reasonable
before the sizes of the allocations there were compile time constants
and reasonably small. But now those are actually configurable.

So instead of giving up, keep retrying with half of the requested size
unless we dip below the old static sizes -- then give up! In terms of
numbers that means we give up when it is certain that we at best would
end up allocating less than 16 send WR buffers or less than 48 recv WR
buffers. This is to avoid regressions due to having fewer buffers
compared the static values of the past.

Please note that SMC-R is supposed to be an optimisation over TCP, and
falling back to TCP is superior to establishing an SMC connection that
is going to perform worse. If the memory allocation fails (and we
propagate -ENOMEM), we fall back to TCP.

Preserve (modulo truncation) the ratio of send/recv WR buffer counts.

Signed-off-by: Halil Pasic <pasic@linux.ibm.com>
Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com>
Reviewed-by: Mahanta Jambigi <mjambigi@linux.ibm.com>
Reviewed-by: Sidraya Jayagond <sidraya@linux.ibm.com>
---
 Documentation/networking/smc-sysctl.rst |  8 ++++--
 net/smc/smc_core.c                      | 34 +++++++++++++++++--------
 net/smc/smc_core.h                      |  2 ++
 net/smc/smc_wr.c                        | 28 ++++++++++----------
 4 files changed, 46 insertions(+), 26 deletions(-)

diff --git a/Documentation/networking/smc-sysctl.rst b/Documentation/networking/smc-sysctl.rst
index 5de4893ef3e7..4a5b4c89bc97 100644
--- a/Documentation/networking/smc-sysctl.rst
+++ b/Documentation/networking/smc-sysctl.rst
@@ -85,7 +85,9 @@ smcr_max_send_wr - INTEGER
 
 	Please be aware that all the buffers need to be allocated as a physically
 	continuous array in which each element is a single buffer and has the size
-	of SMC_WR_BUF_SIZE (48) bytes. If the allocation fails we give up much
+	of SMC_WR_BUF_SIZE (48) bytes. If the allocation fails, we keep retrying
+	with half of the buffer count until it is ether successful or (unlikely)
+	we dip below the old hard coded value which is 16 where we give up much
 	like before having this control.
 
 	Default: 16
@@ -103,7 +105,9 @@ smcr_max_recv_wr - INTEGER
 
 	Please be aware that all the buffers need to be allocated as a physically
 	continuous array in which each element is a single buffer and has the size
-	of SMC_WR_BUF_SIZE (48) bytes. If the allocation fails we give up much
+	of SMC_WR_BUF_SIZE (48) bytes. If the allocation fails, we keep retrying
+	with half of the buffer count until it is ether successful or (unlikely)
+	we dip below the old hard coded value which is 16 where we give up much
 	like before having this control.
 
 	Default: 48
diff --git a/net/smc/smc_core.c b/net/smc/smc_core.c
index be0c2da83d2b..e4eabc83719e 100644
--- a/net/smc/smc_core.c
+++ b/net/smc/smc_core.c
@@ -810,6 +810,8 @@ int smcr_link_init(struct smc_link_group *lgr, struct smc_link *lnk,
 	lnk->clearing = 0;
 	lnk->path_mtu = lnk->smcibdev->pattr[lnk->ibport - 1].active_mtu;
 	lnk->link_id = smcr_next_link_id(lgr);
+	lnk->max_send_wr = lgr->max_send_wr;
+	lnk->max_recv_wr = lgr->max_recv_wr;
 	lnk->lgr = lgr;
 	smc_lgr_hold(lgr); /* lgr_put in smcr_link_clear() */
 	lnk->link_idx = link_idx;
@@ -836,27 +838,39 @@ int smcr_link_init(struct smc_link_group *lgr, struct smc_link *lnk,
 	rc = smc_llc_link_init(lnk);
 	if (rc)
 		goto out;
-	rc = smc_wr_alloc_link_mem(lnk);
-	if (rc)
-		goto clear_llc_lnk;
 	rc = smc_ib_create_protection_domain(lnk);
 	if (rc)
-		goto free_link_mem;
-	rc = smc_ib_create_queue_pair(lnk);
-	if (rc)
-		goto dealloc_pd;
+		goto clear_llc_lnk;
+	do {
+		rc = smc_ib_create_queue_pair(lnk);
+		if (rc)
+			goto dealloc_pd;
+		rc = smc_wr_alloc_link_mem(lnk);
+		if (!rc)
+			break;
+		else if (rc != -ENOMEM) /* give up */
+			goto destroy_qp;
+		/* retry with smaller ... */
+		lnk->max_send_wr /= 2;
+		lnk->max_recv_wr /= 2;
+		/* ... unless droping below old SMC_WR_BUF_SIZE */
+		if (lnk->max_send_wr < 16 || lnk->max_recv_wr < 48)
+			goto destroy_qp;
+		smc_ib_destroy_queue_pair(lnk);
+	} while (1);
+
 	rc = smc_wr_create_link(lnk);
 	if (rc)
-		goto destroy_qp;
+		goto free_link_mem;
 	lnk->state = SMC_LNK_ACTIVATING;
 	return 0;
 
+free_link_mem:
+	smc_wr_free_link_mem(lnk);
 destroy_qp:
 	smc_ib_destroy_queue_pair(lnk);
 dealloc_pd:
 	smc_ib_dealloc_protection_domain(lnk);
-free_link_mem:
-	smc_wr_free_link_mem(lnk);
 clear_llc_lnk:
 	smc_llc_link_clear(lnk, false);
 out:
diff --git a/net/smc/smc_core.h b/net/smc/smc_core.h
index 8d06c8bb14e9..5c18f08a4c8a 100644
--- a/net/smc/smc_core.h
+++ b/net/smc/smc_core.h
@@ -175,6 +175,8 @@ struct smc_link {
 	struct completion	llc_testlink_resp; /* wait for rx of testlink */
 	int			llc_testlink_time; /* testlink interval */
 	atomic_t		conn_cnt; /* connections on this link */
+	u16			max_send_wr;
+	u16			max_recv_wr;
 };
 
 /* For now we just allow one parallel link per link group. The SMC protocol
diff --git a/net/smc/smc_wr.c b/net/smc/smc_wr.c
index 883fb0f1ce43..5feafa98ab1a 100644
--- a/net/smc/smc_wr.c
+++ b/net/smc/smc_wr.c
@@ -547,9 +547,9 @@ void smc_wr_remember_qp_attr(struct smc_link *lnk)
 		    IB_QP_DEST_QPN,
 		    &init_attr);
 
-	lnk->wr_tx_cnt = min_t(size_t, lnk->lgr->max_send_wr,
+	lnk->wr_tx_cnt = min_t(size_t, lnk->max_send_wr,
 			       lnk->qp_attr.cap.max_send_wr);
-	lnk->wr_rx_cnt = min_t(size_t, lnk->lgr->max_recv_wr,
+	lnk->wr_rx_cnt = min_t(size_t, lnk->max_recv_wr,
 			       lnk->qp_attr.cap.max_recv_wr);
 }
 
@@ -741,51 +741,51 @@ int smc_wr_alloc_lgr_mem(struct smc_link_group *lgr)
 int smc_wr_alloc_link_mem(struct smc_link *link)
 {
 	/* allocate link related memory */
-	link->wr_tx_bufs = kcalloc(link->lgr->max_send_wr,
+	link->wr_tx_bufs = kcalloc(link->max_send_wr,
 				   SMC_WR_BUF_SIZE, GFP_KERNEL);
 	if (!link->wr_tx_bufs)
 		goto no_mem;
-	link->wr_rx_bufs = kcalloc(link->lgr->max_recv_wr, link->wr_rx_buflen,
+	link->wr_rx_bufs = kcalloc(link->max_recv_wr, link->wr_rx_buflen,
 				   GFP_KERNEL);
 	if (!link->wr_rx_bufs)
 		goto no_mem_wr_tx_bufs;
-	link->wr_tx_ibs = kcalloc(link->lgr->max_send_wr,
+	link->wr_tx_ibs = kcalloc(link->max_send_wr,
 				  sizeof(link->wr_tx_ibs[0]), GFP_KERNEL);
 	if (!link->wr_tx_ibs)
 		goto no_mem_wr_rx_bufs;
-	link->wr_rx_ibs = kcalloc(link->lgr->max_recv_wr,
+	link->wr_rx_ibs = kcalloc(link->max_recv_wr,
 				  sizeof(link->wr_rx_ibs[0]),
 				  GFP_KERNEL);
 	if (!link->wr_rx_ibs)
 		goto no_mem_wr_tx_ibs;
-	link->wr_tx_rdmas = kcalloc(link->lgr->max_send_wr,
+	link->wr_tx_rdmas = kcalloc(link->max_send_wr,
 				    sizeof(link->wr_tx_rdmas[0]),
 				    GFP_KERNEL);
 	if (!link->wr_tx_rdmas)
 		goto no_mem_wr_rx_ibs;
-	link->wr_tx_rdma_sges = kcalloc(link->lgr->max_send_wr,
+	link->wr_tx_rdma_sges = kcalloc(link->max_send_wr,
 					sizeof(link->wr_tx_rdma_sges[0]),
 					GFP_KERNEL);
 	if (!link->wr_tx_rdma_sges)
 		goto no_mem_wr_tx_rdmas;
-	link->wr_tx_sges = kcalloc(link->lgr->max_send_wr, sizeof(link->wr_tx_sges[0]),
+	link->wr_tx_sges = kcalloc(link->max_send_wr, sizeof(link->wr_tx_sges[0]),
 				   GFP_KERNEL);
 	if (!link->wr_tx_sges)
 		goto no_mem_wr_tx_rdma_sges;
-	link->wr_rx_sges = kcalloc(link->lgr->max_recv_wr,
+	link->wr_rx_sges = kcalloc(link->max_recv_wr,
 				   sizeof(link->wr_rx_sges[0]) * link->wr_rx_sge_cnt,
 				   GFP_KERNEL);
 	if (!link->wr_rx_sges)
 		goto no_mem_wr_tx_sges;
-	link->wr_tx_mask = bitmap_zalloc(link->lgr->max_send_wr, GFP_KERNEL);
+	link->wr_tx_mask = bitmap_zalloc(link->max_send_wr, GFP_KERNEL);
 	if (!link->wr_tx_mask)
 		goto no_mem_wr_rx_sges;
-	link->wr_tx_pends = kcalloc(link->lgr->max_send_wr,
+	link->wr_tx_pends = kcalloc(link->max_send_wr,
 				    sizeof(link->wr_tx_pends[0]),
 				    GFP_KERNEL);
 	if (!link->wr_tx_pends)
 		goto no_mem_wr_tx_mask;
-	link->wr_tx_compl = kcalloc(link->lgr->max_send_wr,
+	link->wr_tx_compl = kcalloc(link->max_send_wr,
 				    sizeof(link->wr_tx_compl[0]),
 				    GFP_KERNEL);
 	if (!link->wr_tx_compl)
@@ -906,7 +906,7 @@ int smc_wr_create_link(struct smc_link *lnk)
 		goto dma_unmap;
 	}
 	smc_wr_init_sge(lnk);
-	bitmap_zero(lnk->wr_tx_mask, lnk->lgr->max_send_wr);
+	bitmap_zero(lnk->wr_tx_mask, lnk->max_send_wr);
 	init_waitqueue_head(&lnk->wr_tx_wait);
 	rc = percpu_ref_init(&lnk->wr_tx_refs, smcr_wr_tx_refs_free, 0, GFP_KERNEL);
 	if (rc)
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH net-next v5 1/2] net/smc: make wr buffer count configurable
  2025-09-29  0:00 ` [PATCH net-next v5 1/2] " Halil Pasic
@ 2025-09-29  1:28   ` Dust Li
  2025-10-01 10:29   ` Bagas Sanjaya
  1 sibling, 0 replies; 12+ messages in thread
From: Dust Li @ 2025-09-29  1:28 UTC (permalink / raw)
  To: Halil Pasic, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Jonathan Corbet, D. Wythe,
	Sidraya Jayagond, Wenjia Zhang, Mahanta Jambigi, Tony Lu, Wen Gu,
	Guangguan Wang, netdev, linux-doc, linux-kernel, linux-rdma,
	linux-s390

On 2025-09-29 02:00:00, Halil Pasic wrote:
>Think SMC_WR_BUF_CNT_SEND := SMC_WR_BUF_CNT used in send context and
>SMC_WR_BUF_CNT_RECV := 3 * SMC_WR_BUF_CNT used in recv context. Those
>get replaced with lgr->max_send_wr and lgr->max_recv_wr respective.
>
>Please note that although with the default sysctl values
>qp_attr.cap.max_send_wr ==  qp_attr.cap.max_recv_wr is maintained but
>can not be assumed to be generally true any more. I see no downside to
>that, but my confidence level is rather modest.
>
>Signed-off-by: Halil Pasic <pasic@linux.ibm.com>
>Reviewed-by: Sidraya Jayagond <sidraya@linux.ibm.com>

Reviewed-by: Dust Li <dust.li@linux.alibaba.com>

Best regards,
Dust


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net-next v5 2/2] net/smc: handle -ENOMEM from smc_wr_alloc_link_mem gracefully
  2025-09-29  0:00 ` [PATCH net-next v5 2/2] net/smc: handle -ENOMEM from smc_wr_alloc_link_mem gracefully Halil Pasic
@ 2025-09-29  1:50   ` Dust Li
  2025-09-29  9:22     ` Halil Pasic
  2025-10-06  5:55     ` Mahanta Jambigi
  0 siblings, 2 replies; 12+ messages in thread
From: Dust Li @ 2025-09-29  1:50 UTC (permalink / raw)
  To: Halil Pasic, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Jonathan Corbet, D. Wythe,
	Sidraya Jayagond, Wenjia Zhang, Mahanta Jambigi, Tony Lu, Wen Gu,
	Guangguan Wang, netdev, linux-doc, linux-kernel, linux-rdma,
	linux-s390

On 2025-09-29 02:00:01, Halil Pasic wrote:
>Currently if a -ENOMEM from smc_wr_alloc_link_mem() is handled by
>giving up and going the way of a TCP fallback. This was reasonable
>before the sizes of the allocations there were compile time constants
>and reasonably small. But now those are actually configurable.
>
>So instead of giving up, keep retrying with half of the requested size
>unless we dip below the old static sizes -- then give up! In terms of
>numbers that means we give up when it is certain that we at best would
>end up allocating less than 16 send WR buffers or less than 48 recv WR
>buffers. This is to avoid regressions due to having fewer buffers
>compared the static values of the past.
>
>Please note that SMC-R is supposed to be an optimisation over TCP, and
>falling back to TCP is superior to establishing an SMC connection that
>is going to perform worse. If the memory allocation fails (and we
>propagate -ENOMEM), we fall back to TCP.
>
>Preserve (modulo truncation) the ratio of send/recv WR buffer counts.
>
>Signed-off-by: Halil Pasic <pasic@linux.ibm.com>
>Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com>
>Reviewed-by: Mahanta Jambigi <mjambigi@linux.ibm.com>
>Reviewed-by: Sidraya Jayagond <sidraya@linux.ibm.com>
>---
> Documentation/networking/smc-sysctl.rst |  8 ++++--
> net/smc/smc_core.c                      | 34 +++++++++++++++++--------
> net/smc/smc_core.h                      |  2 ++
> net/smc/smc_wr.c                        | 28 ++++++++++----------
> 4 files changed, 46 insertions(+), 26 deletions(-)
>
>diff --git a/Documentation/networking/smc-sysctl.rst b/Documentation/networking/smc-sysctl.rst
>index 5de4893ef3e7..4a5b4c89bc97 100644
>--- a/Documentation/networking/smc-sysctl.rst
>+++ b/Documentation/networking/smc-sysctl.rst
>@@ -85,7 +85,9 @@ smcr_max_send_wr - INTEGER
> 
> 	Please be aware that all the buffers need to be allocated as a physically
> 	continuous array in which each element is a single buffer and has the size
>-	of SMC_WR_BUF_SIZE (48) bytes. If the allocation fails we give up much
>+	of SMC_WR_BUF_SIZE (48) bytes. If the allocation fails, we keep retrying
>+	with half of the buffer count until it is ether successful or (unlikely)
>+	we dip below the old hard coded value which is 16 where we give up much
> 	like before having this control.
> 
> 	Default: 16
>@@ -103,7 +105,9 @@ smcr_max_recv_wr - INTEGER
> 
> 	Please be aware that all the buffers need to be allocated as a physically
> 	continuous array in which each element is a single buffer and has the size
>-	of SMC_WR_BUF_SIZE (48) bytes. If the allocation fails we give up much
>+	of SMC_WR_BUF_SIZE (48) bytes. If the allocation fails, we keep retrying
>+	with half of the buffer count until it is ether successful or (unlikely)
>+	we dip below the old hard coded value which is 16 where we give up much
> 	like before having this control.
> 
> 	Default: 48
>diff --git a/net/smc/smc_core.c b/net/smc/smc_core.c
>index be0c2da83d2b..e4eabc83719e 100644
>--- a/net/smc/smc_core.c
>+++ b/net/smc/smc_core.c
>@@ -810,6 +810,8 @@ int smcr_link_init(struct smc_link_group *lgr, struct smc_link *lnk,
> 	lnk->clearing = 0;
> 	lnk->path_mtu = lnk->smcibdev->pattr[lnk->ibport - 1].active_mtu;
> 	lnk->link_id = smcr_next_link_id(lgr);
>+	lnk->max_send_wr = lgr->max_send_wr;
>+	lnk->max_recv_wr = lgr->max_recv_wr;
> 	lnk->lgr = lgr;
> 	smc_lgr_hold(lgr); /* lgr_put in smcr_link_clear() */
> 	lnk->link_idx = link_idx;
>@@ -836,27 +838,39 @@ int smcr_link_init(struct smc_link_group *lgr, struct smc_link *lnk,
> 	rc = smc_llc_link_init(lnk);
> 	if (rc)
> 		goto out;
>-	rc = smc_wr_alloc_link_mem(lnk);
>-	if (rc)
>-		goto clear_llc_lnk;
> 	rc = smc_ib_create_protection_domain(lnk);
> 	if (rc)
>-		goto free_link_mem;
>-	rc = smc_ib_create_queue_pair(lnk);
>-	if (rc)
>-		goto dealloc_pd;
>+		goto clear_llc_lnk;
>+	do {
>+		rc = smc_ib_create_queue_pair(lnk);
>+		if (rc)
>+			goto dealloc_pd;
>+		rc = smc_wr_alloc_link_mem(lnk);
>+		if (!rc)
>+			break;
>+		else if (rc != -ENOMEM) /* give up */
>+			goto destroy_qp;
>+		/* retry with smaller ... */
>+		lnk->max_send_wr /= 2;
>+		lnk->max_recv_wr /= 2;
>+		/* ... unless droping below old SMC_WR_BUF_SIZE */
>+		if (lnk->max_send_wr < 16 || lnk->max_recv_wr < 48)
>+			goto destroy_qp;
>+		smc_ib_destroy_queue_pair(lnk);
>+	} while (1);
>+
> 	rc = smc_wr_create_link(lnk);
> 	if (rc)
>-		goto destroy_qp;
>+		goto free_link_mem;
> 	lnk->state = SMC_LNK_ACTIVATING;
> 	return 0;
> 
>+free_link_mem:
>+	smc_wr_free_link_mem(lnk);
> destroy_qp:
> 	smc_ib_destroy_queue_pair(lnk);
> dealloc_pd:
> 	smc_ib_dealloc_protection_domain(lnk);
>-free_link_mem:
>-	smc_wr_free_link_mem(lnk);
> clear_llc_lnk:
> 	smc_llc_link_clear(lnk, false);
> out:
>diff --git a/net/smc/smc_core.h b/net/smc/smc_core.h
>index 8d06c8bb14e9..5c18f08a4c8a 100644
>--- a/net/smc/smc_core.h
>+++ b/net/smc/smc_core.h
>@@ -175,6 +175,8 @@ struct smc_link {
> 	struct completion	llc_testlink_resp; /* wait for rx of testlink */
> 	int			llc_testlink_time; /* testlink interval */
> 	atomic_t		conn_cnt; /* connections on this link */
>+	u16			max_send_wr;
>+	u16			max_recv_wr;

Here, you've moved max_send_wr/max_recv_wr from the link group to individual links.
This means we can now have different max_send_wr/max_recv_wr values on two
different links within the same link group.

Since in Alibaba we doesn't use multi-link configurations, we haven't tested
this scenario. Have you tested the link-down handling process in a multi-link
setup?

Otherwise, the patch looks good to me.

Best regards,
Dust

> };
> 
> /* For now we just allow one parallel link per link group. The SMC protocol
>diff --git a/net/smc/smc_wr.c b/net/smc/smc_wr.c
>index 883fb0f1ce43..5feafa98ab1a 100644
>--- a/net/smc/smc_wr.c
>+++ b/net/smc/smc_wr.c
>@@ -547,9 +547,9 @@ void smc_wr_remember_qp_attr(struct smc_link *lnk)
> 		    IB_QP_DEST_QPN,
> 		    &init_attr);
> 
>-	lnk->wr_tx_cnt = min_t(size_t, lnk->lgr->max_send_wr,
>+	lnk->wr_tx_cnt = min_t(size_t, lnk->max_send_wr,
> 			       lnk->qp_attr.cap.max_send_wr);
>-	lnk->wr_rx_cnt = min_t(size_t, lnk->lgr->max_recv_wr,
>+	lnk->wr_rx_cnt = min_t(size_t, lnk->max_recv_wr,
> 			       lnk->qp_attr.cap.max_recv_wr);
> }
> 
>@@ -741,51 +741,51 @@ int smc_wr_alloc_lgr_mem(struct smc_link_group *lgr)
> int smc_wr_alloc_link_mem(struct smc_link *link)
> {
> 	/* allocate link related memory */
>-	link->wr_tx_bufs = kcalloc(link->lgr->max_send_wr,
>+	link->wr_tx_bufs = kcalloc(link->max_send_wr,
> 				   SMC_WR_BUF_SIZE, GFP_KERNEL);
> 	if (!link->wr_tx_bufs)
> 		goto no_mem;
>-	link->wr_rx_bufs = kcalloc(link->lgr->max_recv_wr, link->wr_rx_buflen,
>+	link->wr_rx_bufs = kcalloc(link->max_recv_wr, link->wr_rx_buflen,
> 				   GFP_KERNEL);
> 	if (!link->wr_rx_bufs)
> 		goto no_mem_wr_tx_bufs;
>-	link->wr_tx_ibs = kcalloc(link->lgr->max_send_wr,
>+	link->wr_tx_ibs = kcalloc(link->max_send_wr,
> 				  sizeof(link->wr_tx_ibs[0]), GFP_KERNEL);
> 	if (!link->wr_tx_ibs)
> 		goto no_mem_wr_rx_bufs;
>-	link->wr_rx_ibs = kcalloc(link->lgr->max_recv_wr,
>+	link->wr_rx_ibs = kcalloc(link->max_recv_wr,
> 				  sizeof(link->wr_rx_ibs[0]),
> 				  GFP_KERNEL);
> 	if (!link->wr_rx_ibs)
> 		goto no_mem_wr_tx_ibs;
>-	link->wr_tx_rdmas = kcalloc(link->lgr->max_send_wr,
>+	link->wr_tx_rdmas = kcalloc(link->max_send_wr,
> 				    sizeof(link->wr_tx_rdmas[0]),
> 				    GFP_KERNEL);
> 	if (!link->wr_tx_rdmas)
> 		goto no_mem_wr_rx_ibs;
>-	link->wr_tx_rdma_sges = kcalloc(link->lgr->max_send_wr,
>+	link->wr_tx_rdma_sges = kcalloc(link->max_send_wr,
> 					sizeof(link->wr_tx_rdma_sges[0]),
> 					GFP_KERNEL);
> 	if (!link->wr_tx_rdma_sges)
> 		goto no_mem_wr_tx_rdmas;
>-	link->wr_tx_sges = kcalloc(link->lgr->max_send_wr, sizeof(link->wr_tx_sges[0]),
>+	link->wr_tx_sges = kcalloc(link->max_send_wr, sizeof(link->wr_tx_sges[0]),
> 				   GFP_KERNEL);
> 	if (!link->wr_tx_sges)
> 		goto no_mem_wr_tx_rdma_sges;
>-	link->wr_rx_sges = kcalloc(link->lgr->max_recv_wr,
>+	link->wr_rx_sges = kcalloc(link->max_recv_wr,
> 				   sizeof(link->wr_rx_sges[0]) * link->wr_rx_sge_cnt,
> 				   GFP_KERNEL);
> 	if (!link->wr_rx_sges)
> 		goto no_mem_wr_tx_sges;
>-	link->wr_tx_mask = bitmap_zalloc(link->lgr->max_send_wr, GFP_KERNEL);
>+	link->wr_tx_mask = bitmap_zalloc(link->max_send_wr, GFP_KERNEL);
> 	if (!link->wr_tx_mask)
> 		goto no_mem_wr_rx_sges;
>-	link->wr_tx_pends = kcalloc(link->lgr->max_send_wr,
>+	link->wr_tx_pends = kcalloc(link->max_send_wr,
> 				    sizeof(link->wr_tx_pends[0]),
> 				    GFP_KERNEL);
> 	if (!link->wr_tx_pends)
> 		goto no_mem_wr_tx_mask;
>-	link->wr_tx_compl = kcalloc(link->lgr->max_send_wr,
>+	link->wr_tx_compl = kcalloc(link->max_send_wr,
> 				    sizeof(link->wr_tx_compl[0]),
> 				    GFP_KERNEL);
> 	if (!link->wr_tx_compl)
>@@ -906,7 +906,7 @@ int smc_wr_create_link(struct smc_link *lnk)
> 		goto dma_unmap;
> 	}
> 	smc_wr_init_sge(lnk);
>-	bitmap_zero(lnk->wr_tx_mask, lnk->lgr->max_send_wr);
>+	bitmap_zero(lnk->wr_tx_mask, lnk->max_send_wr);
> 	init_waitqueue_head(&lnk->wr_tx_wait);
> 	rc = percpu_ref_init(&lnk->wr_tx_refs, smcr_wr_tx_refs_free, 0, GFP_KERNEL);
> 	if (rc)
>-- 
>2.48.1

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net-next v5 2/2] net/smc: handle -ENOMEM from smc_wr_alloc_link_mem gracefully
  2025-09-29  1:50   ` Dust Li
@ 2025-09-29  9:22     ` Halil Pasic
  2025-10-01  7:21       ` Paolo Abeni
  2025-10-06  5:55     ` Mahanta Jambigi
  1 sibling, 1 reply; 12+ messages in thread
From: Halil Pasic @ 2025-09-29  9:22 UTC (permalink / raw)
  To: Dust Li
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Jonathan Corbet, D. Wythe, Sidraya Jayagond,
	Wenjia Zhang, Mahanta Jambigi, Tony Lu, Wen Gu, Guangguan Wang,
	netdev, linux-doc, linux-kernel, linux-rdma, linux-s390,
	Halil Pasic

On Mon, 29 Sep 2025 09:50:52 +0800
Dust Li <dust.li@linux.alibaba.com> wrote:

> >@@ -175,6 +175,8 @@ struct smc_link {
> > 	struct completion	llc_testlink_resp; /* wait for rx of testlink */
> > 	int			llc_testlink_time; /* testlink interval */
> > 	atomic_t		conn_cnt; /* connections on this link */
> >+	u16			max_send_wr;
> >+	u16			max_recv_wr;  
> 
> Here, you've moved max_send_wr/max_recv_wr from the link group to individual links.
> This means we can now have different max_send_wr/max_recv_wr values on two
> different links within the same link group.

Only if allocations fail. Please notice that the hunk:

--- a/net/smc/smc_core.c
+++ b/net/smc/smc_core.c
@@ -810,6 +810,8 @@ int smcr_link_init(struct smc_link_group *lgr, struct smc_link *lnk,
 	lnk->clearing = 0;
 	lnk->path_mtu = lnk->smcibdev->pattr[lnk->ibport - 1].active_mtu;
 	lnk->link_id = smcr_next_link_id(lgr);
+	lnk->max_send_wr = lgr->max_send_wr;
+	lnk->max_recv_wr = lgr->max_recv_wr;

initializes the link values with the values from the lgr which are in
turn picked up form the systctls at lgr creation time. I have made an
effort to keep these values the same for each link, but in case the
allocation fails and we do back off, we can end up with different values
on the links. 

The alternative would be to throw in the towel, and not create
a second link if we can't match what worked for the first one.

> 
> Since in Alibaba we doesn't use multi-link configurations, we haven't tested
> this scenario. Have you tested the link-down handling process in a multi-link
> setup?
> 

Mahanta was so kind to do most of the testing on this. I don't think
I've tested this myself. @Mahanta: Would you be kind to give this a try
if it wasn't covered in the past? The best way is probably to modify
the code to force such a scenario. I don't think it is easy to somehow
trigger in the wild.

BTW I don't expect any problems. I think at worst the one link would
end up giving worse performance than the other, but I guess that can
happen for other reasons as well (like different HW for the two links).

But I think getting some sort of a query interface which would tell
us how much did we end up with down the road would be a good idea anyway.

And I hope we can switch to vmalloc down the road as well, which would
make back off less likely.

> Otherwise, the patch looks good to me.
> 

Thank you very much!

Regards,
Halil

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net-next v5 2/2] net/smc: handle -ENOMEM from smc_wr_alloc_link_mem gracefully
  2025-09-29  9:22     ` Halil Pasic
@ 2025-10-01  7:21       ` Paolo Abeni
  2025-10-01  9:37         ` Halil Pasic
  0 siblings, 1 reply; 12+ messages in thread
From: Paolo Abeni @ 2025-10-01  7:21 UTC (permalink / raw)
  To: Halil Pasic, Dust Li
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Simon Horman,
	Jonathan Corbet, D. Wythe, Sidraya Jayagond, Wenjia Zhang,
	Mahanta Jambigi, Tony Lu, Wen Gu, Guangguan Wang, netdev,
	linux-doc, linux-kernel, linux-rdma, linux-s390

On 9/29/25 11:22 AM, Halil Pasic wrote:
> On Mon, 29 Sep 2025 09:50:52 +0800
> Dust Li <dust.li@linux.alibaba.com> wrote:
> 
>>> @@ -175,6 +175,8 @@ struct smc_link {
>>> 	struct completion	llc_testlink_resp; /* wait for rx of testlink */
>>> 	int			llc_testlink_time; /* testlink interval */
>>> 	atomic_t		conn_cnt; /* connections on this link */
>>> +	u16			max_send_wr;
>>> +	u16			max_recv_wr;  
>>
>> Here, you've moved max_send_wr/max_recv_wr from the link group to individual links.
>> This means we can now have different max_send_wr/max_recv_wr values on two
>> different links within the same link group.
> 
> Only if allocations fail. Please notice that the hunk:
> 
> --- a/net/smc/smc_core.c
> +++ b/net/smc/smc_core.c
> @@ -810,6 +810,8 @@ int smcr_link_init(struct smc_link_group *lgr, struct smc_link *lnk,
>  	lnk->clearing = 0;
>  	lnk->path_mtu = lnk->smcibdev->pattr[lnk->ibport - 1].active_mtu;
>  	lnk->link_id = smcr_next_link_id(lgr);
> +	lnk->max_send_wr = lgr->max_send_wr;
> +	lnk->max_recv_wr = lgr->max_recv_wr;
> 
> initializes the link values with the values from the lgr which are in
> turn picked up form the systctls at lgr creation time. I have made an
> effort to keep these values the same for each link, but in case the
> allocation fails and we do back off, we can end up with different values
> on the links. 
> 
> The alternative would be to throw in the towel, and not create
> a second link if we can't match what worked for the first one.
> 
>>
>> Since in Alibaba we doesn't use multi-link configurations, we haven't tested
>> this scenario. Have you tested the link-down handling process in a multi-link
>> setup?
>>
> 
> Mahanta was so kind to do most of the testing on this. I don't think
> I've tested this myself. @Mahanta: Would you be kind to give this a try
> if it wasn't covered in the past? The best way is probably to modify
> the code to force such a scenario. I don't think it is easy to somehow
> trigger in the wild.
> 
> BTW I don't expect any problems. I think at worst the one link would
> end up giving worse performance than the other, but I guess that can
> happen for other reasons as well (like different HW for the two links).
> 
> But I think getting some sort of a query interface which would tell
> us how much did we end up with down the road would be a good idea anyway.
> 
> And I hope we can switch to vmalloc down the road as well, which would
> make back off less likely.

Unfortunately we are closing the net-next PR right now and I would
prefer such testing being reported explicitly. Let's defer this series
to the next cycle: please re-post when net-next will reopen after Oct 12th.

Thanks,

Paolo


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net-next v5 2/2] net/smc: handle -ENOMEM from smc_wr_alloc_link_mem gracefully
  2025-10-01  7:21       ` Paolo Abeni
@ 2025-10-01  9:37         ` Halil Pasic
  0 siblings, 0 replies; 12+ messages in thread
From: Halil Pasic @ 2025-10-01  9:37 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: Dust Li, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Simon Horman, Jonathan Corbet, D. Wythe, Sidraya Jayagond,
	Wenjia Zhang, Mahanta Jambigi, Tony Lu, Wen Gu, Guangguan Wang,
	netdev, linux-doc, linux-kernel, linux-rdma, linux-s390,
	Halil Pasic

On Wed, 1 Oct 2025 09:21:09 +0200
Paolo Abeni <pabeni@redhat.com> wrote:

> >>
> >> Since in Alibaba we doesn't use multi-link configurations, we haven't tested
> >> this scenario. Have you tested the link-down handling process in a multi-link
> >> setup?
> >>  
> > 
> > Mahanta was so kind to do most of the testing on this. I don't think
> > I've tested this myself. @Mahanta: Would you be kind to give this a try
> > if it wasn't covered in the past? The best way is probably to modify
> > the code to force such a scenario. I don't think it is easy to somehow
> > trigger in the wild.
> > 
> > BTW I don't expect any problems. I think at worst the one link would
> > end up giving worse performance than the other, but I guess that can
> > happen for other reasons as well (like different HW for the two links).
> > 
> > But I think getting some sort of a query interface which would tell
> > us how much did we end up with down the road would be a good idea anyway.
> > 
> > And I hope we can switch to vmalloc down the road as well, which would
> > make back off less likely.  
> 
> Unfortunately we are closing the net-next PR right now and I would
> prefer such testing being reported explicitly. Let's defer this series
> to the next cycle: please re-post when net-next will reopen after Oct 12th.

Than you Paolo! Will do! I have talked to Mahanta yesterday about this,
and he has done some testing in the meanwhile, but I'm not sure he
covered everything he wanted. And he is out for the week (today and
tomorrow is public holiday in his geography).

Regards,
Halil

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net-next v5 1/2] net/smc: make wr buffer count configurable
  2025-09-29  0:00 ` [PATCH net-next v5 1/2] " Halil Pasic
  2025-09-29  1:28   ` Dust Li
@ 2025-10-01 10:29   ` Bagas Sanjaya
  1 sibling, 0 replies; 12+ messages in thread
From: Bagas Sanjaya @ 2025-10-01 10:29 UTC (permalink / raw)
  To: Halil Pasic, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Jonathan Corbet, D. Wythe, Dust Li,
	Sidraya Jayagond, Wenjia Zhang, Mahanta Jambigi, Tony Lu, Wen Gu,
	Guangguan Wang, netdev, linux-doc, linux-kernel, linux-rdma,
	linux-s390

[-- Attachment #1: Type: text/plain, Size: 2594 bytes --]

On Mon, Sep 29, 2025 at 02:00:00AM +0200, Halil Pasic wrote:
> diff --git a/Documentation/networking/smc-sysctl.rst b/Documentation/networking/smc-sysctl.rst
> index a874d007f2db..5de4893ef3e7 100644
> --- a/Documentation/networking/smc-sysctl.rst
> +++ b/Documentation/networking/smc-sysctl.rst
> @@ -71,3 +71,39 @@ smcr_max_conns_per_lgr - INTEGER
>  	acceptable value ranges from 16 to 255. Only for SMC-R v2.1 and later.
>  
>  	Default: 255
> +
> +smcr_max_send_wr - INTEGER
> +	So called work request buffers are SMCR link (and RDMA queue pair) level
        So-called
> +	resources necessary for performing RDMA operations. Since up to 255
> +	connections can share a link group and thus also a link and the number
> +	of the work request buffers is decided when the link is allocated,
> +	depending on the workload it can be a bottleneck in a sense that threads
> +	have to wait for work request buffers to become available. Before the
> +	introduction of this control the maximal number of work request buffers
> +	available on the send path used to be hard coded to 16. With this control
> +	it becomes configurable. The acceptable range is between 2 and 2048.
> +
> +	Please be aware that all the buffers need to be allocated as a physically
> +	continuous array in which each element is a single buffer and has the size
> +	of SMC_WR_BUF_SIZE (48) bytes. If the allocation fails we give up much
> +	like before having this control.
> +
> +	Default: 16
> +
> +smcr_max_recv_wr - INTEGER
> +	So called work request buffers are SMCR link (and RDMA queue pair) level
Ditto.
> +	resources necessary for performing RDMA operations. Since up to 255
> +	connections can share a link group and thus also a link and the number
> +	of the work request buffers is decided when the link is allocated,
> +	depending on the workload it can be a bottleneck in a sense that threads
> +	have to wait for work request buffers to become available. Before the
> +	introduction of this control the maximal number of work request buffers
> +	available on the receive path used to be hard coded to 16. With this control
> +	it becomes configurable. The acceptable range is between 2 and 2048.
> +
> +	Please be aware that all the buffers need to be allocated as a physically
> +	continuous array in which each element is a single buffer and has the size
> +	of SMC_WR_BUF_SIZE (48) bytes. If the allocation fails we give up much
> +	like before having this control.
> +
> +	Default: 48

Thanks.

-- 
An old man doll... just what I always wanted! - Clara

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net-next v5 2/2] net/smc: handle -ENOMEM from smc_wr_alloc_link_mem gracefully
  2025-09-29  1:50   ` Dust Li
  2025-09-29  9:22     ` Halil Pasic
@ 2025-10-06  5:55     ` Mahanta Jambigi
  2025-10-08 14:06       ` Dust Li
  1 sibling, 1 reply; 12+ messages in thread
From: Mahanta Jambigi @ 2025-10-06  5:55 UTC (permalink / raw)
  To: dust.li, Halil Pasic, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Simon Horman, Jonathan Corbet,
	D. Wythe, Sidraya Jayagond, Wenjia Zhang, Tony Lu, Wen Gu,
	Guangguan Wang, netdev, linux-doc, linux-kernel, linux-rdma,
	linux-s390

On 29/09/25 7:20 am, Dust Li wrote:
>> diff --git a/net/smc/smc_core.h b/net/smc/smc_core.h
>> index 8d06c8bb14e9..5c18f08a4c8a 100644
>> --- a/net/smc/smc_core.h
>> +++ b/net/smc/smc_core.h
>> @@ -175,6 +175,8 @@ struct smc_link {
>> 	struct completion	llc_testlink_resp; /* wait for rx of testlink */
>> 	int			llc_testlink_time; /* testlink interval */
>> 	atomic_t		conn_cnt; /* connections on this link */
>> +	u16			max_send_wr;
>> +	u16			max_recv_wr;
> 
> Here, you've moved max_send_wr/max_recv_wr from the link group to individual links.
> This means we can now have different max_send_wr/max_recv_wr values on two
> different links within the same link group.
> Since in Alibaba we doesn't use multi-link configurations, we haven't tested

Does Alibaba always use a single RoCE device for SMC-R? In that case how
redundancy is achieved if that link goes down?

> this scenario. Have you tested the link-down handling process in a multi-link
> setup?
I did test this after you query & don't see any issues. As Halil
mentioned in worst case scenario one link might perform lesser than the
other, that too if the kcalloc() failed for that link in
smc_wr_alloc_link_mem() & succeeded in subsequent request with reduced
max_send_wr/max_recv_wr size(half).
> Otherwise, the patch looks good to me.
> 
> Best regards,
> Dust

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net-next v5 2/2] net/smc: handle -ENOMEM from smc_wr_alloc_link_mem gracefully
  2025-10-06  5:55     ` Mahanta Jambigi
@ 2025-10-08 14:06       ` Dust Li
  2025-10-08 15:08         ` Halil Pasic
  0 siblings, 1 reply; 12+ messages in thread
From: Dust Li @ 2025-10-08 14:06 UTC (permalink / raw)
  To: Mahanta Jambigi, Halil Pasic, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Simon Horman, Jonathan Corbet,
	D. Wythe, Sidraya Jayagond, Wenjia Zhang, Tony Lu, Wen Gu,
	Guangguan Wang, netdev, linux-doc, linux-kernel, linux-rdma,
	linux-s390

On 2025-10-06 11:25:22, Mahanta Jambigi wrote:
>On 29/09/25 7:20 am, Dust Li wrote:
>>> diff --git a/net/smc/smc_core.h b/net/smc/smc_core.h
>>> index 8d06c8bb14e9..5c18f08a4c8a 100644
>>> --- a/net/smc/smc_core.h
>>> +++ b/net/smc/smc_core.h
>>> @@ -175,6 +175,8 @@ struct smc_link {
>>> 	struct completion	llc_testlink_resp; /* wait for rx of testlink */
>>> 	int			llc_testlink_time; /* testlink interval */
>>> 	atomic_t		conn_cnt; /* connections on this link */
>>> +	u16			max_send_wr;
>>> +	u16			max_recv_wr;
>> 
>> Here, you've moved max_send_wr/max_recv_wr from the link group to individual links.
>> This means we can now have different max_send_wr/max_recv_wr values on two
>> different links within the same link group.
>> Since in Alibaba we doesn't use multi-link configurations, we haven't tested
>
>Does Alibaba always use a single RoCE device for SMC-R? In that case how
>redundancy is achieved if that link goes down?

We expose a virtual RDMA device to our client inside their virtual
machine. The underlying network is already redundant, so it’s got
built-in reliability. You can think of it kind of like virtio-net, but
instead of a regular virtual NIC, it’s an RDMA device.

>
>> this scenario. Have you tested the link-down handling process in a multi-link
>> setup?
>I did test this after you query & don't see any issues. As Halil
>mentioned in worst case scenario one link might perform lesser than the
>other, that too if the kcalloc() failed for that link in
>smc_wr_alloc_link_mem() & succeeded in subsequent request with reduced
>max_send_wr/max_recv_wr size(half).

Great! You can add my

Reviewed-by: Dust Li <dust.li@linux.alibaba.com>

>> Otherwise, the patch looks good to me.
>> 
>> Best regards,
>> Dust

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net-next v5 2/2] net/smc: handle -ENOMEM from smc_wr_alloc_link_mem gracefully
  2025-10-08 14:06       ` Dust Li
@ 2025-10-08 15:08         ` Halil Pasic
  0 siblings, 0 replies; 12+ messages in thread
From: Halil Pasic @ 2025-10-08 15:08 UTC (permalink / raw)
  To: Dust Li
  Cc: Mahanta Jambigi, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Jonathan Corbet, D. Wythe,
	Sidraya Jayagond, Wenjia Zhang, Tony Lu, Wen Gu, Guangguan Wang,
	netdev, linux-doc, linux-kernel, linux-rdma, linux-s390,
	Halil Pasic

On Wed, 8 Oct 2025 22:06:08 +0800
Dust Li <dust.li@linux.alibaba.com> wrote:

> >I did test this after you query & don't see any issues. As Halil
> >mentioned in worst case scenario one link might perform lesser than the
> >other, that too if the kcalloc() failed for that link in
> >smc_wr_alloc_link_mem() & succeeded in subsequent request with reduced
> >max_send_wr/max_recv_wr size(half).  
> 
> Great! You can add my
> 
> Reviewed-by: Dust Li <dust.li@linux.alibaba.com>

Thank you! Will do and respin once net-next is open again.

Regards,
Halil

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2025-10-08 15:08 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-28 23:59 [PATCH net-next v5 0/2] net/smc: make wr buffer count configurable Halil Pasic
2025-09-29  0:00 ` [PATCH net-next v5 1/2] " Halil Pasic
2025-09-29  1:28   ` Dust Li
2025-10-01 10:29   ` Bagas Sanjaya
2025-09-29  0:00 ` [PATCH net-next v5 2/2] net/smc: handle -ENOMEM from smc_wr_alloc_link_mem gracefully Halil Pasic
2025-09-29  1:50   ` Dust Li
2025-09-29  9:22     ` Halil Pasic
2025-10-01  7:21       ` Paolo Abeni
2025-10-01  9:37         ` Halil Pasic
2025-10-06  5:55     ` Mahanta Jambigi
2025-10-08 14:06       ` Dust Li
2025-10-08 15:08         ` Halil Pasic

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).