* Re: [PATCH net-next v3 1/2] net/smc: make wr buffer count configurable
2025-09-21 21:44 ` [PATCH net-next v3 1/2] " Halil Pasic
@ 2025-09-24 17:27 ` Sidraya Jayagond
2025-09-25 9:27 ` Paolo Abeni
2025-09-26 2:44 ` Guangguan Wang
2 siblings, 0 replies; 21+ messages in thread
From: Sidraya Jayagond @ 2025-09-24 17:27 UTC (permalink / raw)
To: Halil Pasic, Jakub Kicinski, Paolo Abeni, Simon Horman, D. Wythe,
Dust Li, Wenjia Zhang, Mahanta Jambigi, Tony Lu, Wen Gu, netdev,
linux-doc, linux-kernel, linux-rdma, linux-s390
On 22/09/25 3:14 am, Halil Pasic wrote:
> Think SMC_WR_BUF_CNT_SEND := SMC_WR_BUF_CNT used in send context and
> SMC_WR_BUF_CNT_RECV := 3 * SMC_WR_BUF_CNT used in recv context. Those
> get replaced with lgr->max_send_wr and lgr->max_recv_wr respective.
>
> While at it let us also remove a confusing comment that is either not
> about the context in which it resides (describing
> qp_attr.cap.max_send_wr and qp_attr.cap.max_recv_wr) or not applicable
> any more when these values become configurable.
>
> Signed-off-by: Halil Pasic <pasic@linux.ibm.com>
> ---
> Documentation/networking/smc-sysctl.rst | 36 +++++++++++++++++++++++++
> include/net/netns/smc.h | 2 ++
> net/smc/smc_core.h | 6 +++++
> net/smc/smc_ib.c | 7 ++---
> net/smc/smc_llc.c | 2 ++
> net/smc/smc_sysctl.c | 22 +++++++++++++++
> net/smc/smc_sysctl.h | 2 ++
> net/smc/smc_wr.c | 32 +++++++++++-----------
> net/smc/smc_wr.h | 2 --
> 9 files changed, 89 insertions(+), 22 deletions(-)
>
> diff --git a/Documentation/networking/smc-sysctl.rst b/Documentation/networking/smc-sysctl.rst
> index a874d007f2db..c94d750c7c84 100644
> --- a/Documentation/networking/smc-sysctl.rst
> +++ b/Documentation/networking/smc-sysctl.rst
> @@ -71,3 +71,39 @@ smcr_max_conns_per_lgr - INTEGER
> acceptable value ranges from 16 to 255. Only for SMC-R v2.1 and later.
>
> Default: 255
> +
> +smcr_max_send_wr - INTEGER
> + So called work request buffers are SMCR link (and RDMA queue pair) level
> + resources necessary for performing RDMA operations. Since up to 255
> + connections can share a link group and thus also a link and the number
> + of the work request buffers is decided when the link is allocated,
> + depending on the workload it can a bottleneck in a sense that threads
> + have to wait for work request buffers to become available. Before the
> + introduction of this control the maximal number of work request buffers
> + available on the send path used to be hard coded to 16. With this control
> + it becomes configurable. The acceptable range is between 2 and 2048.
> +
> + Please be aware that all the buffers need to be allocated as a physically
> + continuous array in which each element is a single buffer and has the size
> + of SMC_WR_BUF_SIZE (48) bytes. If the allocation fails we give up much
> + like before having this control.
> +
> + Default: 16
> +
> +smcr_max_recv_wr - INTEGER
> + So called work request buffers are SMCR link (and RDMA queue pair) level
> + resources necessary for performing RDMA operations. Since up to 255
> + connections can share a link group and thus also a link and the number
> + of the work request buffers is decided when the link is allocated,
> + depending on the workload it can a bottleneck in a sense that threads
> + have to wait for work request buffers to become available. Before the
> + introduction of this control the maximal number of work request buffers
> + available on the receive path used to be hard coded to 16. With this control
> + it becomes configurable. The acceptable range is between 2 and 2048.
> +
> + Please be aware that all the buffers need to be allocated as a physically
> + continuous array in which each element is a single buffer and has the size
> + of SMC_WR_BUF_SIZE (48) bytes. If the allocation fails we give up much
> + like before having this control.
> +
> + Default: 48
Tuning default WR BUF count and max_conns per lgr together makes sense,
otherwise either memory is wasted or WRs queue up on a QP.
That exercise, however, can be done separately from this patchset.
> diff --git a/include/net/netns/smc.h b/include/net/netns/smc.h
> index fc752a50f91b..6ceb12baec24 100644
> --- a/include/net/netns/smc.h
> +++ b/include/net/netns/smc.h
> @@ -24,5 +24,7 @@ struct netns_smc {
> int sysctl_rmem;
> int sysctl_max_links_per_lgr;
> int sysctl_max_conns_per_lgr;
> + unsigned int sysctl_smcr_max_send_wr;
> + unsigned int sysctl_smcr_max_recv_wr;
> };
> #endif
> diff --git a/net/smc/smc_core.h b/net/smc/smc_core.h
> index 48a1b1dcb576..ab2d15929cb2 100644
> --- a/net/smc/smc_core.h
> +++ b/net/smc/smc_core.h
> @@ -33,6 +33,8 @@
> * distributions may modify it to a value between
> * 16-255 as needed.
> */
> +#define SMCR_MAX_SEND_WR_DEF 16 /* Default number of work requests per send queue */
> +#define SMCR_MAX_RECV_WR_DEF 48 /* Default number of work requests per recv queue */
>
> struct smc_lgr_list { /* list of link group definition */
> struct list_head list;
> @@ -361,6 +363,10 @@ struct smc_link_group {
> /* max conn can be assigned to lgr */
> u8 max_links;
> /* max links can be added in lgr */
> + u16 max_send_wr;
> + /* number of WR buffers on send */
> + u16 max_recv_wr;
> + /* number of WR buffers on recv */
> };
> struct { /* SMC-D */
> struct smcd_gid peer_gid;
> diff --git a/net/smc/smc_ib.c b/net/smc/smc_ib.c
> index 0052f02756eb..e8d35c22c525 100644
> --- a/net/smc/smc_ib.c
> +++ b/net/smc/smc_ib.c
> @@ -669,11 +669,6 @@ int smc_ib_create_queue_pair(struct smc_link *lnk)
> .recv_cq = lnk->smcibdev->roce_cq_recv,
> .srq = NULL,
> .cap = {
> - /* include unsolicited rdma_writes as well,
> - * there are max. 2 RDMA_WRITE per 1 WR_SEND
> - */
> - .max_send_wr = SMC_WR_BUF_CNT * 3,
> - .max_recv_wr = SMC_WR_BUF_CNT * 3,
> .max_send_sge = SMC_IB_MAX_SEND_SGE,
> .max_recv_sge = lnk->wr_rx_sge_cnt,
> .max_inline_data = 0,
> @@ -683,6 +678,8 @@ int smc_ib_create_queue_pair(struct smc_link *lnk)
> };
> int rc;
>
> + qp_attr.cap.max_send_wr = 3 * lnk->lgr->max_send_wr;
> + qp_attr.cap.max_recv_wr = lnk->lgr->max_recv_wr;
> lnk->roce_qp = ib_create_qp(lnk->roce_pd, &qp_attr);
> rc = PTR_ERR_OR_ZERO(lnk->roce_qp);
> if (IS_ERR(lnk->roce_qp))
> diff --git a/net/smc/smc_llc.c b/net/smc/smc_llc.c
> index f865c58c3aa7..f5d5eb617526 100644
> --- a/net/smc/smc_llc.c
> +++ b/net/smc/smc_llc.c
> @@ -2157,6 +2157,8 @@ void smc_llc_lgr_init(struct smc_link_group *lgr, struct smc_sock *smc)
> init_waitqueue_head(&lgr->llc_msg_waiter);
> init_rwsem(&lgr->llc_conf_mutex);
> lgr->llc_testlink_time = READ_ONCE(net->smc.sysctl_smcr_testlink_time);
> + lgr->max_send_wr = (u16)(READ_ONCE(net->smc.sysctl_smcr_max_send_wr));
> + lgr->max_recv_wr = (u16)(READ_ONCE(net->smc.sysctl_smcr_max_recv_wr));
> }
>
> /* called after lgr was removed from lgr_list */
> diff --git a/net/smc/smc_sysctl.c b/net/smc/smc_sysctl.c
> index 2fab6456f765..7b2471904d04 100644
> --- a/net/smc/smc_sysctl.c
> +++ b/net/smc/smc_sysctl.c
> @@ -29,6 +29,8 @@ static int links_per_lgr_min = SMC_LINKS_ADD_LNK_MIN;
> static int links_per_lgr_max = SMC_LINKS_ADD_LNK_MAX;
> static int conns_per_lgr_min = SMC_CONN_PER_LGR_MIN;
> static int conns_per_lgr_max = SMC_CONN_PER_LGR_MAX;
> +static unsigned int smcr_max_wr_min = 2;
> +static unsigned int smcr_max_wr_max = 2048;
>
> static struct ctl_table smc_table[] = {
> {
> @@ -99,6 +101,24 @@ static struct ctl_table smc_table[] = {
> .extra1 = SYSCTL_ZERO,
> .extra2 = SYSCTL_ONE,
> },
> + {
> + .procname = "smcr_max_send_wr",
> + .data = &init_net.smc.sysctl_smcr_max_send_wr,
> + .maxlen = sizeof(int),
> + .mode = 0644,
> + .proc_handler = proc_dointvec_minmax,
> + .extra1 = &smcr_max_wr_min,
> + .extra2 = &smcr_max_wr_max,
> + },
> + {
> + .procname = "smcr_max_recv_wr",
> + .data = &init_net.smc.sysctl_smcr_max_recv_wr,
> + .maxlen = sizeof(int),
> + .mode = 0644,
> + .proc_handler = proc_dointvec_minmax,
> + .extra1 = &smcr_max_wr_min,
> + .extra2 = &smcr_max_wr_max,
> + },
> };
>
> int __net_init smc_sysctl_net_init(struct net *net)
> @@ -130,6 +150,8 @@ int __net_init smc_sysctl_net_init(struct net *net)
> WRITE_ONCE(net->smc.sysctl_rmem, net_smc_rmem_init);
> net->smc.sysctl_max_links_per_lgr = SMC_LINKS_PER_LGR_MAX_PREFER;
> net->smc.sysctl_max_conns_per_lgr = SMC_CONN_PER_LGR_PREFER;
> + net->smc.sysctl_smcr_max_send_wr = SMCR_MAX_SEND_WR_DEF;
> + net->smc.sysctl_smcr_max_recv_wr = SMCR_MAX_RECV_WR_DEF;
> /* disable handshake limitation by default */
> net->smc.limit_smc_hs = 0;
>
> diff --git a/net/smc/smc_sysctl.h b/net/smc/smc_sysctl.h
> index eb2465ae1e15..8538915af7af 100644
> --- a/net/smc/smc_sysctl.h
> +++ b/net/smc/smc_sysctl.h
> @@ -25,6 +25,8 @@ static inline int smc_sysctl_net_init(struct net *net)
> net->smc.sysctl_autocorking_size = SMC_AUTOCORKING_DEFAULT_SIZE;
> net->smc.sysctl_max_links_per_lgr = SMC_LINKS_PER_LGR_MAX_PREFER;
> net->smc.sysctl_max_conns_per_lgr = SMC_CONN_PER_LGR_PREFER;
> + net->smc.sysctl_smcr_max_send_wr = SMCR_MAX_SEND_WR_DEF;
> + net->smc.sysctl_smcr_max_recv_wr = SMCR_MAX_RECV_WR_DEF;
> return 0;
> }
>
> diff --git a/net/smc/smc_wr.c b/net/smc/smc_wr.c
> index b04a21b8c511..f5b2772414fd 100644
> --- a/net/smc/smc_wr.c
> +++ b/net/smc/smc_wr.c
> @@ -34,6 +34,7 @@
> #define SMC_WR_MAX_POLL_CQE 10 /* max. # of compl. queue elements in 1 poll */
>
> #define SMC_WR_RX_HASH_BITS 4
> +
> static DEFINE_HASHTABLE(smc_wr_rx_hash, SMC_WR_RX_HASH_BITS);
> static DEFINE_SPINLOCK(smc_wr_rx_hash_lock);
>
> @@ -547,9 +548,9 @@ void smc_wr_remember_qp_attr(struct smc_link *lnk)
> IB_QP_DEST_QPN,
> &init_attr);
>
> - lnk->wr_tx_cnt = min_t(size_t, SMC_WR_BUF_CNT,
> + lnk->wr_tx_cnt = min_t(size_t, lnk->lgr->max_send_wr,
> lnk->qp_attr.cap.max_send_wr);
> - lnk->wr_rx_cnt = min_t(size_t, SMC_WR_BUF_CNT * 3,
> + lnk->wr_rx_cnt = min_t(size_t, lnk->lgr->max_recv_wr,
> lnk->qp_attr.cap.max_recv_wr);
> }
>
> @@ -741,50 +742,51 @@ int smc_wr_alloc_lgr_mem(struct smc_link_group *lgr)
> int smc_wr_alloc_link_mem(struct smc_link *link)
> {
> /* allocate link related memory */
> - link->wr_tx_bufs = kcalloc(SMC_WR_BUF_CNT, SMC_WR_BUF_SIZE, GFP_KERNEL);
> + link->wr_tx_bufs = kcalloc(link->lgr->max_send_wr,
> + SMC_WR_BUF_SIZE, GFP_KERNEL);
> if (!link->wr_tx_bufs)
> goto no_mem;
> - link->wr_rx_bufs = kcalloc(SMC_WR_BUF_CNT * 3, link->wr_rx_buflen,
> + link->wr_rx_bufs = kcalloc(link->lgr->max_recv_wr, link->wr_rx_buflen,
> GFP_KERNEL);
> if (!link->wr_rx_bufs)
> goto no_mem_wr_tx_bufs;
> - link->wr_tx_ibs = kcalloc(SMC_WR_BUF_CNT, sizeof(link->wr_tx_ibs[0]),
> - GFP_KERNEL);
> + link->wr_tx_ibs = kcalloc(link->lgr->max_send_wr,
> + sizeof(link->wr_tx_ibs[0]), GFP_KERNEL);
> if (!link->wr_tx_ibs)
> goto no_mem_wr_rx_bufs;
> - link->wr_rx_ibs = kcalloc(SMC_WR_BUF_CNT * 3,
> + link->wr_rx_ibs = kcalloc(link->lgr->max_recv_wr,
> sizeof(link->wr_rx_ibs[0]),
> GFP_KERNEL);
> if (!link->wr_rx_ibs)
> goto no_mem_wr_tx_ibs;
> - link->wr_tx_rdmas = kcalloc(SMC_WR_BUF_CNT,
> + link->wr_tx_rdmas = kcalloc(link->lgr->max_send_wr,
> sizeof(link->wr_tx_rdmas[0]),
> GFP_KERNEL);
> if (!link->wr_tx_rdmas)
> goto no_mem_wr_rx_ibs;
> - link->wr_tx_rdma_sges = kcalloc(SMC_WR_BUF_CNT,
> + link->wr_tx_rdma_sges = kcalloc(link->lgr->max_send_wr,
> sizeof(link->wr_tx_rdma_sges[0]),
> GFP_KERNEL);
> if (!link->wr_tx_rdma_sges)
> goto no_mem_wr_tx_rdmas;
> - link->wr_tx_sges = kcalloc(SMC_WR_BUF_CNT, sizeof(link->wr_tx_sges[0]),
> + link->wr_tx_sges = kcalloc(link->lgr->max_send_wr, sizeof(link->wr_tx_sges[0]),
> GFP_KERNEL);
> if (!link->wr_tx_sges)
> goto no_mem_wr_tx_rdma_sges;
> - link->wr_rx_sges = kcalloc(SMC_WR_BUF_CNT * 3,
> + link->wr_rx_sges = kcalloc(link->lgr->max_recv_wr,
> sizeof(link->wr_rx_sges[0]) * link->wr_rx_sge_cnt,
> GFP_KERNEL);
> if (!link->wr_rx_sges)
> goto no_mem_wr_tx_sges;
> - link->wr_tx_mask = bitmap_zalloc(SMC_WR_BUF_CNT, GFP_KERNEL);
> + link->wr_tx_mask = bitmap_zalloc(link->lgr->max_send_wr, GFP_KERNEL);
> if (!link->wr_tx_mask)
> goto no_mem_wr_rx_sges;
> - link->wr_tx_pends = kcalloc(SMC_WR_BUF_CNT,
> + link->wr_tx_pends = kcalloc(link->lgr->max_send_wr,
> sizeof(link->wr_tx_pends[0]),
> GFP_KERNEL);
> if (!link->wr_tx_pends)
> goto no_mem_wr_tx_mask;
> - link->wr_tx_compl = kcalloc(SMC_WR_BUF_CNT,
> + link->wr_tx_compl = kcalloc(link->lgr->max_send_wr,
> sizeof(link->wr_tx_compl[0]),
> GFP_KERNEL);
> if (!link->wr_tx_compl)
> @@ -905,7 +907,7 @@ int smc_wr_create_link(struct smc_link *lnk)
> goto dma_unmap;
> }
> smc_wr_init_sge(lnk);
> - bitmap_zero(lnk->wr_tx_mask, SMC_WR_BUF_CNT);
> + bitmap_zero(lnk->wr_tx_mask, lnk->lgr->max_send_wr);
> init_waitqueue_head(&lnk->wr_tx_wait);
> rc = percpu_ref_init(&lnk->wr_tx_refs, smcr_wr_tx_refs_free, 0, GFP_KERNEL);
> if (rc)
> diff --git a/net/smc/smc_wr.h b/net/smc/smc_wr.h
> index f3008dda222a..aa4533af9122 100644
> --- a/net/smc/smc_wr.h
> +++ b/net/smc/smc_wr.h
> @@ -19,8 +19,6 @@
> #include "smc.h"
> #include "smc_core.h"
>
> -#define SMC_WR_BUF_CNT 16 /* # of ctrl buffers per link */
> -
> #define SMC_WR_TX_WAIT_FREE_SLOT_TIME (10 * HZ)
>
> #define SMC_WR_TX_SIZE 44 /* actual size of wr_send data (<=SMC_WR_BUF_SIZE) */
Code changes looks good to me.
Reviewed-by: Sidraya Jayagond <sidraya@linux.ibm.com>
^ permalink raw reply [flat|nested] 21+ messages in thread* Re: [PATCH net-next v3 1/2] net/smc: make wr buffer count configurable
2025-09-21 21:44 ` [PATCH net-next v3 1/2] " Halil Pasic
2025-09-24 17:27 ` Sidraya Jayagond
@ 2025-09-25 9:27 ` Paolo Abeni
2025-09-25 11:25 ` Halil Pasic
2025-09-26 2:44 ` Guangguan Wang
2 siblings, 1 reply; 21+ messages in thread
From: Paolo Abeni @ 2025-09-25 9:27 UTC (permalink / raw)
To: Halil Pasic, Jakub Kicinski, Simon Horman, D. Wythe, Dust Li,
Sidraya Jayagond, Wenjia Zhang, Mahanta Jambigi, Tony Lu, Wen Gu,
netdev, linux-doc, linux-kernel, linux-rdma, linux-s390
On 9/21/25 11:44 PM, Halil Pasic wrote:
> diff --git a/Documentation/networking/smc-sysctl.rst b/Documentation/networking/smc-sysctl.rst
> index a874d007f2db..c94d750c7c84 100644
> --- a/Documentation/networking/smc-sysctl.rst
> +++ b/Documentation/networking/smc-sysctl.rst
> @@ -71,3 +71,39 @@ smcr_max_conns_per_lgr - INTEGER
> acceptable value ranges from 16 to 255. Only for SMC-R v2.1 and later.
>
> Default: 255
> +
> +smcr_max_send_wr - INTEGER
> + So called work request buffers are SMCR link (and RDMA queue pair) level
> + resources necessary for performing RDMA operations. Since up to 255
> + connections can share a link group and thus also a link and the number
> + of the work request buffers is decided when the link is allocated,
> + depending on the workload it can a bottleneck in a sense that threads
missing 'be' or 'become' here^^
> + have to wait for work request buffers to become available. Before the
> + introduction of this control the maximal number of work request buffers
> + available on the send path used to be hard coded to 16. With this control
> + it becomes configurable. The acceptable range is between 2 and 2048.
> +
> + Please be aware that all the buffers need to be allocated as a physically
> + continuous array in which each element is a single buffer and has the size
> + of SMC_WR_BUF_SIZE (48) bytes. If the allocation fails we give up much
> + like before having this control.
> +
> + Default: 16
> +
> +smcr_max_recv_wr - INTEGER
> + So called work request buffers are SMCR link (and RDMA queue pair) level
> + resources necessary for performing RDMA operations. Since up to 255
> + connections can share a link group and thus also a link and the number
> + of the work request buffers is decided when the link is allocated,
> + depending on the workload it can a bottleneck in a sense that threads
same here^^
[...]
> @@ -683,6 +678,8 @@ int smc_ib_create_queue_pair(struct smc_link *lnk)
> };
> int rc;
>
> + qp_attr.cap.max_send_wr = 3 * lnk->lgr->max_send_wr;
> + qp_attr.cap.max_recv_wr = lnk->lgr->max_recv_wr;
Possibly:
cap = max(3 * lnk->lgr->max_send_wr, lnk->lgr->max_recv_wr);
qp_attr.cap.max_send_wr = cap;
qp_attr.cap.max_recv_wr = cap
to avoid assumption on `max_send_wr`, `max_recv_wr` relative values.
[...]
> diff --git a/net/smc/smc_sysctl.h b/net/smc/smc_sysctl.h
> index eb2465ae1e15..8538915af7af 100644
> --- a/net/smc/smc_sysctl.h
> +++ b/net/smc/smc_sysctl.h
> @@ -25,6 +25,8 @@ static inline int smc_sysctl_net_init(struct net *net)
> net->smc.sysctl_autocorking_size = SMC_AUTOCORKING_DEFAULT_SIZE;
> net->smc.sysctl_max_links_per_lgr = SMC_LINKS_PER_LGR_MAX_PREFER;
> net->smc.sysctl_max_conns_per_lgr = SMC_CONN_PER_LGR_PREFER;
> + net->smc.sysctl_smcr_max_send_wr = SMCR_MAX_SEND_WR_DEF;
> + net->smc.sysctl_smcr_max_recv_wr = SMCR_MAX_RECV_WR_DEF;
> return 0;
> }
>
> diff --git a/net/smc/smc_wr.c b/net/smc/smc_wr.c
> index b04a21b8c511..f5b2772414fd 100644
> --- a/net/smc/smc_wr.c
> +++ b/net/smc/smc_wr.c
> @@ -34,6 +34,7 @@
> #define SMC_WR_MAX_POLL_CQE 10 /* max. # of compl. queue elements in 1 poll */
>
> #define SMC_WR_RX_HASH_BITS 4
> +
Please avoid unrelated whitespace only changes.
Thanks,
Paolo
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH net-next v3 1/2] net/smc: make wr buffer count configurable
2025-09-25 9:27 ` Paolo Abeni
@ 2025-09-25 11:25 ` Halil Pasic
2025-09-27 22:55 ` Halil Pasic
0 siblings, 1 reply; 21+ messages in thread
From: Halil Pasic @ 2025-09-25 11:25 UTC (permalink / raw)
To: Paolo Abeni
Cc: Jakub Kicinski, Simon Horman, D. Wythe, Dust Li, Sidraya Jayagond,
Wenjia Zhang, Mahanta Jambigi, Tony Lu, Wen Gu, netdev, linux-doc,
linux-kernel, linux-rdma, linux-s390, Halil Pasic
On Thu, 25 Sep 2025 11:27:38 +0200
Paolo Abeni <pabeni@redhat.com> wrote:
[..]
> > +smcr_max_recv_wr - INTEGER
> > + So called work request buffers are SMCR link (and RDMA queue pair) level
> > + resources necessary for performing RDMA operations. Since up to 255
> > + connections can share a link group and thus also a link and the number
> > + of the work request buffers is decided when the link is allocated,
> > + depending on the workload it can a bottleneck in a sense that threads
>
> same here^^
Sorry about those! Will fix for v4.
>
> [...]
> > @@ -683,6 +678,8 @@ int smc_ib_create_queue_pair(struct smc_link *lnk)
> > };
> > int rc;
> >
> > + qp_attr.cap.max_send_wr = 3 * lnk->lgr->max_send_wr;
> > + qp_attr.cap.max_recv_wr = lnk->lgr->max_recv_wr;
>
> Possibly:
>
> cap = max(3 * lnk->lgr->max_send_wr, lnk->lgr->max_recv_wr);
> qp_attr.cap.max_send_wr = cap;
> qp_attr.cap.max_recv_wr = cap
>
> to avoid assumption on `max_send_wr`, `max_recv_wr` relative values.
Can you explain a little more. I'm happy to do the change, but I would
prefer to understand why is keeping qp_attr.cap.max_send_wr ==
qp_attr.cap.max_recv_wr better? But if you tell: "Just trust me!" I will.
[..]
> >
> > diff --git a/net/smc/smc_wr.c b/net/smc/smc_wr.c
> > index b04a21b8c511..f5b2772414fd 100644
> > --- a/net/smc/smc_wr.c
> > +++ b/net/smc/smc_wr.c
> > @@ -34,6 +34,7 @@
> > #define SMC_WR_MAX_POLL_CQE 10 /* max. # of compl. queue elements in 1 poll */
> >
> > #define SMC_WR_RX_HASH_BITS 4
> > +
>
> Please avoid unrelated whitespace only changes.
Will fix for v4. Really sorry!
Regards,
Halil
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH net-next v3 1/2] net/smc: make wr buffer count configurable
2025-09-25 11:25 ` Halil Pasic
@ 2025-09-27 22:55 ` Halil Pasic
2025-09-28 2:02 ` Dust Li
0 siblings, 1 reply; 21+ messages in thread
From: Halil Pasic @ 2025-09-27 22:55 UTC (permalink / raw)
To: Paolo Abeni
Cc: Jakub Kicinski, Simon Horman, D. Wythe, Dust Li, Sidraya Jayagond,
Wenjia Zhang, Mahanta Jambigi, Tony Lu, Wen Gu, netdev, linux-doc,
linux-kernel, linux-rdma, linux-s390, Halil Pasic
On Thu, 25 Sep 2025 13:25:40 +0200
Halil Pasic <pasic@linux.ibm.com> wrote:
> > [...]
> > > @@ -683,6 +678,8 @@ int smc_ib_create_queue_pair(struct smc_link *lnk)
> > > };
> > > int rc;
> > >
> > > + qp_attr.cap.max_send_wr = 3 * lnk->lgr->max_send_wr;
> > > + qp_attr.cap.max_recv_wr = lnk->lgr->max_recv_wr;
> >
> > Possibly:
> >
> > cap = max(3 * lnk->lgr->max_send_wr, lnk->lgr->max_recv_wr);
> > qp_attr.cap.max_send_wr = cap;
> > qp_attr.cap.max_recv_wr = cap
> >
> > to avoid assumption on `max_send_wr`, `max_recv_wr` relative values.
>
> Can you explain a little more. I'm happy to do the change, but I would
> prefer to understand why is keeping qp_attr.cap.max_send_wr ==
> qp_attr.cap.max_recv_wr better? But if you tell: "Just trust me!" I will.
Due to a little accident we ended up having a private conversation
on this, which I'm going to sum up quickly.
Paolo stated that he has no strong preference and that I should at
least add a comment, which I will do for v4.
Unfortunately I don't quite understand why qp_attr.cap.max_send_wr is 3
times the number of send WR buffers we allocate. My understanding
is that qp_attr.cap.max_send_wr is about the number of send WQEs.
I assume that qp_attr.cap.max_send_wr == qp_attr.cap.max_recv_wr
is not something we would want to preserve.
Regards,
Halil
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH net-next v3 1/2] net/smc: make wr buffer count configurable
2025-09-27 22:55 ` Halil Pasic
@ 2025-09-28 2:02 ` Dust Li
2025-09-28 2:12 ` Dust Li
2025-09-28 8:39 ` Halil Pasic
0 siblings, 2 replies; 21+ messages in thread
From: Dust Li @ 2025-09-28 2:02 UTC (permalink / raw)
To: Halil Pasic, Paolo Abeni
Cc: Jakub Kicinski, Simon Horman, D. Wythe, Sidraya Jayagond,
Wenjia Zhang, Mahanta Jambigi, Tony Lu, Wen Gu, netdev, linux-doc,
linux-kernel, linux-rdma, linux-s390
On 2025-09-28 00:55:15, Halil Pasic wrote:
>On Thu, 25 Sep 2025 13:25:40 +0200
>Halil Pasic <pasic@linux.ibm.com> wrote:
>
>> > [...]
>> > > @@ -683,6 +678,8 @@ int smc_ib_create_queue_pair(struct smc_link *lnk)
>> > > };
>> > > int rc;
>> > >
>> > > + qp_attr.cap.max_send_wr = 3 * lnk->lgr->max_send_wr;
>> > > + qp_attr.cap.max_recv_wr = lnk->lgr->max_recv_wr;
>> >
>> > Possibly:
>> >
>> > cap = max(3 * lnk->lgr->max_send_wr, lnk->lgr->max_recv_wr);
>> > qp_attr.cap.max_send_wr = cap;
>> > qp_attr.cap.max_recv_wr = cap
>> >
>> > to avoid assumption on `max_send_wr`, `max_recv_wr` relative values.
>>
>> Can you explain a little more. I'm happy to do the change, but I would
>> prefer to understand why is keeping qp_attr.cap.max_send_wr ==
>> qp_attr.cap.max_recv_wr better? But if you tell: "Just trust me!" I will.
>
>Due to a little accident we ended up having a private conversation
>on this, which I'm going to sum up quickly.
>
>Paolo stated that he has no strong preference and that I should at
>least add a comment, which I will do for v4.
>
>Unfortunately I don't quite understand why qp_attr.cap.max_send_wr is 3
>times the number of send WR buffers we allocate. My understanding
>is that qp_attr.cap.max_send_wr is about the number of send WQEs.
We have at most 2 RDMA Write for 1 RDMA send. So 3 times is necessary.
That is explained in the original comments. Maybe it's better to keep it.
```
.cap = {
/* include unsolicited rdma_writes as well,
* there are max. 2 RDMA_WRITE per 1 WR_SEND
*/
.max_send_wr = SMC_WR_BUF_CNT * 3,
.max_recv_wr = SMC_WR_BUF_CNT * 3,
.max_send_sge = SMC_IB_MAX_SEND_SGE,
.max_recv_sge = lnk->wr_rx_sge_cnt,
.max_inline_data = 0,
},
```
>I assume that qp_attr.cap.max_send_wr == qp_attr.cap.max_recv_wr
>is not something we would want to preserve.
IIUC, RDMA Write won't consume any RX wqe on the receive side, so I think
the .max_recv_wr can be SMC_WR_BUF_CNT if we don't use RDMA_WRITE_IMM.
Best regards,
Dust
^ permalink raw reply [flat|nested] 21+ messages in thread* Re: [PATCH net-next v3 1/2] net/smc: make wr buffer count configurable
2025-09-28 2:02 ` Dust Li
@ 2025-09-28 2:12 ` Dust Li
2025-09-28 8:39 ` Halil Pasic
1 sibling, 0 replies; 21+ messages in thread
From: Dust Li @ 2025-09-28 2:12 UTC (permalink / raw)
To: Halil Pasic, Paolo Abeni
Cc: Jakub Kicinski, Simon Horman, D. Wythe, Sidraya Jayagond,
Wenjia Zhang, Mahanta Jambigi, Tony Lu, Wen Gu, netdev, linux-doc,
linux-kernel, linux-rdma, linux-s390
On 2025-09-28 10:02:43, Dust Li wrote:
>On 2025-09-28 00:55:15, Halil Pasic wrote:
>>On Thu, 25 Sep 2025 13:25:40 +0200
>>Halil Pasic <pasic@linux.ibm.com> wrote:
>>
>>> > [...]
>>> > > @@ -683,6 +678,8 @@ int smc_ib_create_queue_pair(struct smc_link *lnk)
>>> > > };
>>> > > int rc;
>>> > >
>>> > > + qp_attr.cap.max_send_wr = 3 * lnk->lgr->max_send_wr;
>>> > > + qp_attr.cap.max_recv_wr = lnk->lgr->max_recv_wr;
>>> >
>>> > Possibly:
>>> >
>>> > cap = max(3 * lnk->lgr->max_send_wr, lnk->lgr->max_recv_wr);
>>> > qp_attr.cap.max_send_wr = cap;
>>> > qp_attr.cap.max_recv_wr = cap
>>> >
>>> > to avoid assumption on `max_send_wr`, `max_recv_wr` relative values.
>>>
>>> Can you explain a little more. I'm happy to do the change, but I would
>>> prefer to understand why is keeping qp_attr.cap.max_send_wr ==
>>> qp_attr.cap.max_recv_wr better? But if you tell: "Just trust me!" I will.
>>
>>Due to a little accident we ended up having a private conversation
>>on this, which I'm going to sum up quickly.
>>
>>Paolo stated that he has no strong preference and that I should at
>>least add a comment, which I will do for v4.
>>
>>Unfortunately I don't quite understand why qp_attr.cap.max_send_wr is 3
>>times the number of send WR buffers we allocate. My understanding
>>is that qp_attr.cap.max_send_wr is about the number of send WQEs.
>
>We have at most 2 RDMA Write for 1 RDMA send. So 3 times is necessary.
>That is explained in the original comments. Maybe it's better to keep it.
>
>```
>.cap = {
> /* include unsolicited rdma_writes as well,
> * there are max. 2 RDMA_WRITE per 1 WR_SEND
> */
> .max_send_wr = SMC_WR_BUF_CNT * 3,
> .max_recv_wr = SMC_WR_BUF_CNT * 3,
> .max_send_sge = SMC_IB_MAX_SEND_SGE,
> .max_recv_sge = lnk->wr_rx_sge_cnt,
> .max_inline_data = 0,
>},
>```
>
>>I assume that qp_attr.cap.max_send_wr == qp_attr.cap.max_recv_wr
>>is not something we would want to preserve.
>
>IIUC, RDMA Write won't consume any RX wqe on the receive side, so I think
>the .max_recv_wr can be SMC_WR_BUF_CNT if we don't use RDMA_WRITE_IMM.
I kept thinking about this a bit more, and I realized that max_recv_wr
should be larger than SMC_WR_BUF_CNT.
Since receive WQEs are posted in a softirq context, their posting may be
delayed. Meanwhile, the sender might already have received the TX
completion (CQE) and continue sending new messages. In this case, if the
receiver’s post_recv() (i.e., posting of RX WQEs) is delayed, an RNR
(Receiver Not Ready) can easily occur.
Best regards,
Dust
^ permalink raw reply [flat|nested] 21+ messages in thread* Re: [PATCH net-next v3 1/2] net/smc: make wr buffer count configurable
2025-09-28 2:02 ` Dust Li
2025-09-28 2:12 ` Dust Li
@ 2025-09-28 8:39 ` Halil Pasic
2025-09-28 11:42 ` Dust Li
1 sibling, 1 reply; 21+ messages in thread
From: Halil Pasic @ 2025-09-28 8:39 UTC (permalink / raw)
To: Dust Li
Cc: Paolo Abeni, Jakub Kicinski, Simon Horman, D. Wythe,
Sidraya Jayagond, Wenjia Zhang, Mahanta Jambigi, Tony Lu, Wen Gu,
netdev, linux-doc, linux-kernel, linux-rdma, linux-s390,
Halil Pasic
On Sun, 28 Sep 2025 10:02:43 +0800
Dust Li <dust.li@linux.alibaba.com> wrote:
> >Unfortunately I don't quite understand why qp_attr.cap.max_send_wr is 3
> >times the number of send WR buffers we allocate. My understanding
> >is that qp_attr.cap.max_send_wr is about the number of send WQEs.
>
> We have at most 2 RDMA Write for 1 RDMA send. So 3 times is necessary.
> That is explained in the original comments. Maybe it's better to keep it.
>
> ```
> .cap = {
> /* include unsolicited rdma_writes as well,
> * there are max. 2 RDMA_WRITE per 1 WR_SEND
> */
But what are "the unsolicited" rdma_writes? I have heard of
unsolicited receive, where the data is received without
consuming a WR previously put on the RQ on the receiving end, but
the concept of unsolicited rdma_writes eludes me completely.
I guess what you are trying to say, and what I understand is
that we first put the payload into the RMB of the remote, which
may require up 2 RDMA_WRITE operations, probably because we may
cross the end (and start) of the array that hosts the circular
buffer, and then we send a CDC message to update the cursor.
For the latter a ib_post_send() is used in smc_wr_tx_send()
and AFAICT it consumes a WR from wr_tx_bufs. For the former
we consume a single wr_tx_rdmas which and each wr_tx_rdmas
has 2 WR allocated.
And all those WRs need a WQE. So I guess now I do understand
SMC_WR_BUF_CNT, but I find the comment still confusing like
hell because of these unsolicited rdma_writes.
Thank you for the explanation! It was indeed helpful! Let
me try to come up with a better comment -- unless somebody
manages to explain "unsolicited rdma_writes" to me.
> .max_send_wr = SMC_WR_BUF_CNT * 3,
> .max_recv_wr = SMC_WR_BUF_CNT * 3,
> .max_send_sge = SMC_IB_MAX_SEND_SGE,
> .max_recv_sge = lnk->wr_rx_sge_cnt,
> .max_inline_data = 0,
> },
> ```
>
> >I assume that qp_attr.cap.max_send_wr == qp_attr.cap.max_recv_wr
> >is not something we would want to preserve.
>
> IIUC, RDMA Write won't consume any RX wqe on the receive side, so I think
> the .max_recv_wr can be SMC_WR_BUF_CNT if we don't use RDMA_WRITE_IMM.
Maybe we don't want to assume somebody else (another implementation)
would not use immediate data. I'm not sure. But I don't quite understand
the why the relationship between the send and the receive side either.
Regards,
Halil
^ permalink raw reply [flat|nested] 21+ messages in thread* Re: [PATCH net-next v3 1/2] net/smc: make wr buffer count configurable
2025-09-28 8:39 ` Halil Pasic
@ 2025-09-28 11:42 ` Dust Li
2025-09-28 18:32 ` Halil Pasic
0 siblings, 1 reply; 21+ messages in thread
From: Dust Li @ 2025-09-28 11:42 UTC (permalink / raw)
To: Halil Pasic
Cc: Paolo Abeni, Jakub Kicinski, Simon Horman, D. Wythe,
Sidraya Jayagond, Wenjia Zhang, Mahanta Jambigi, Tony Lu, Wen Gu,
netdev, linux-doc, linux-kernel, linux-rdma, linux-s390
On 2025-09-28 10:39:51, Halil Pasic wrote:
>On Sun, 28 Sep 2025 10:02:43 +0800
>Dust Li <dust.li@linux.alibaba.com> wrote:
>
>> >Unfortunately I don't quite understand why qp_attr.cap.max_send_wr is 3
>> >times the number of send WR buffers we allocate. My understanding
>> >is that qp_attr.cap.max_send_wr is about the number of send WQEs.
>>
>> We have at most 2 RDMA Write for 1 RDMA send. So 3 times is necessary.
>> That is explained in the original comments. Maybe it's better to keep it.
>>
>> ```
>> .cap = {
>> /* include unsolicited rdma_writes as well,
>> * there are max. 2 RDMA_WRITE per 1 WR_SEND
>> */
>
>But what are "the unsolicited" rdma_writes? I have heard of
>unsolicited receive, where the data is received without
>consuming a WR previously put on the RQ on the receiving end, but
>the concept of unsolicited rdma_writes eludes me completely.
unsolicited RDMA Writes means those RDMA Writes won't generate
CQEs on the local side. You can refer to:
https://www.rdmamojo.com/2014/05/27/solicited-event/
>
>I guess what you are trying to say, and what I understand is
>that we first put the payload into the RMB of the remote, which
>may require up 2 RDMA_WRITE operations, probably because we may
>cross the end (and start) of the array that hosts the circular
>buffer, and then we send a CDC message to update the cursor.
>
>For the latter a ib_post_send() is used in smc_wr_tx_send()
>and AFAICT it consumes a WR from wr_tx_bufs. For the former
>we consume a single wr_tx_rdmas which and each wr_tx_rdmas
>has 2 WR allocated.
Right.
>
>And all those WRs need a WQE. So I guess now I do understand
>SMC_WR_BUF_CNT, but I find the comment still confusing like
>hell because of these unsolicited rdma_writes.
>
>Thank you for the explanation! It was indeed helpful! Let
>me try to come up with a better comment -- unless somebody
>manages to explain "unsolicited rdma_writes" to me.
>
>> .max_send_wr = SMC_WR_BUF_CNT * 3,
>> .max_recv_wr = SMC_WR_BUF_CNT * 3,
>> .max_send_sge = SMC_IB_MAX_SEND_SGE,
>> .max_recv_sge = lnk->wr_rx_sge_cnt,
>> .max_inline_data = 0,
>> },
>> ```
>>
>> >I assume that qp_attr.cap.max_send_wr == qp_attr.cap.max_recv_wr
>> >is not something we would want to preserve.
>>
>> IIUC, RDMA Write won't consume any RX wqe on the receive side, so I think
>> the .max_recv_wr can be SMC_WR_BUF_CNT if we don't use RDMA_WRITE_IMM.
>
>Maybe we don't want to assume somebody else (another implementation)
>would not use immediate data. I'm not sure. But I don't quite understand
>the why the relationship between the send and the receive side either.
I missed something here. I sent an other email right after this to
explain my thoughts here:
I kept thinking about this a bit more, and I realized that max_recv_wr
should be larger than SMC_WR_BUF_CNT.
Since receive WQEs are posted in a softirq context, their posting may be
delayed. Meanwhile, the sender might already have received the TX
completion (CQE) and continue sending new messages. In this case, if the
receiver’s post_recv() (i.e., posting of RX WQEs) is delayed, an RNR
(Receiver Not Ready) can easily occur.
Best regards,
Dust
^ permalink raw reply [flat|nested] 21+ messages in thread* Re: [PATCH net-next v3 1/2] net/smc: make wr buffer count configurable
2025-09-28 11:42 ` Dust Li
@ 2025-09-28 18:32 ` Halil Pasic
0 siblings, 0 replies; 21+ messages in thread
From: Halil Pasic @ 2025-09-28 18:32 UTC (permalink / raw)
To: Dust Li
Cc: Paolo Abeni, Jakub Kicinski, Simon Horman, D. Wythe,
Sidraya Jayagond, Wenjia Zhang, Mahanta Jambigi, Tony Lu, Wen Gu,
netdev, linux-doc, linux-kernel, linux-rdma, linux-s390,
Halil Pasic
On Sun, 28 Sep 2025 19:42:54 +0800
Dust Li <dust.li@linux.alibaba.com> wrote:
> >> We have at most 2 RDMA Write for 1 RDMA send. So 3 times is necessary.
> >> That is explained in the original comments. Maybe it's better to keep it.
> >>
> >> ```
> >> .cap = {
> >> /* include unsolicited rdma_writes as well,
> >> * there are max. 2 RDMA_WRITE per 1 WR_SEND
> >> */
> >
> >But what are "the unsolicited" rdma_writes? I have heard of
> >unsolicited receive, where the data is received without
> >consuming a WR previously put on the RQ on the receiving end, but
> >the concept of unsolicited rdma_writes eludes me completely.
>
> unsolicited RDMA Writes means those RDMA Writes won't generate
> CQEs on the local side. You can refer to:
> https://www.rdmamojo.com/2014/05/27/solicited-event/
>
> >
> >I guess what you are trying to say, and what I understand is
> >that we first put the payload into the RMB of the remote, which
> >may require up 2 RDMA_WRITE operations, probably because we may
> >cross the end (and start) of the array that hosts the circular
> >buffer, and then we send a CDC message to update the cursor.
> >
> >For the latter a ib_post_send() is used in smc_wr_tx_send()
> >and AFAICT it consumes a WR from wr_tx_bufs. For the former
> >we consume a single wr_tx_rdmas which and each wr_tx_rdmas
> >has 2 WR allocated.
>
> Right.
Thank you Dust Li! Unfortunately I have already spinned a v4. Let
me add back that comment, as for people knowledgeable enough it does
not appear to be confusing at all. I can try to improve that comment
and maybe add a new one on the reason why we do need more WRs on the
receive end than on the send end, after this series has been merged. Or
if you want to do it yourself, I'm happy with it as well. In the end
it is you who me helped get a better understanding of this :)
Thank you again!
Regards,
Halil
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH net-next v3 1/2] net/smc: make wr buffer count configurable
2025-09-21 21:44 ` [PATCH net-next v3 1/2] " Halil Pasic
2025-09-24 17:27 ` Sidraya Jayagond
2025-09-25 9:27 ` Paolo Abeni
@ 2025-09-26 2:44 ` Guangguan Wang
2025-09-26 10:12 ` Halil Pasic
2 siblings, 1 reply; 21+ messages in thread
From: Guangguan Wang @ 2025-09-26 2:44 UTC (permalink / raw)
To: Halil Pasic
Cc: Jakub Kicinski, Paolo Abeni, Simon Horman, D. Wythe, Dust Li,
Sidraya Jayagond, Wenjia Zhang, Mahanta Jambigi, Tony Lu, Wen Gu,
netdev, linux-doc, linux-kernel, linux-rdma, linux-s390
在 2025/9/22 05:44, Halil Pasic 写道:
> Think SMC_WR_BUF_CNT_SEND := SMC_WR_BUF_CNT used in send context and
> SMC_WR_BUF_CNT_RECV := 3 * SMC_WR_BUF_CNT used in recv context. Those
> get replaced with lgr->max_send_wr and lgr->max_recv_wr respective.
>
> While at it let us also remove a confusing comment that is either not
> about the context in which it resides (describing
> qp_attr.cap.max_send_wr and qp_attr.cap.max_recv_wr) or not applicable
> any more when these values become configurable.
>
> Signed-off-by: Halil Pasic <pasic@linux.ibm.com>
> ---
> Documentation/networking/smc-sysctl.rst | 36 +++++++++++++++++++++++++
> include/net/netns/smc.h | 2 ++
> net/smc/smc_core.h | 6 +++++
> net/smc/smc_ib.c | 7 ++---
> net/smc/smc_llc.c | 2 ++
> net/smc/smc_sysctl.c | 22 +++++++++++++++
> net/smc/smc_sysctl.h | 2 ++
> net/smc/smc_wr.c | 32 +++++++++++-----------
> net/smc/smc_wr.h | 2 --
> 9 files changed, 89 insertions(+), 22 deletions(-)
>
> diff --git a/Documentation/networking/smc-sysctl.rst b/Documentation/networking/smc-sysctl.rst
> index a874d007f2db..c94d750c7c84 100644
> --- a/Documentation/networking/smc-sysctl.rst
> +++ b/Documentation/networking/smc-sysctl.rst
> @@ -71,3 +71,39 @@ smcr_max_conns_per_lgr - INTEGER
> acceptable value ranges from 16 to 255. Only for SMC-R v2.1 and later.
>
> Default: 255
> +
> +smcr_max_send_wr - INTEGER
> + So called work request buffers are SMCR link (and RDMA queue pair) level
> + resources necessary for performing RDMA operations. Since up to 255
> + connections can share a link group and thus also a link and the number
> + of the work request buffers is decided when the link is allocated,
> + depending on the workload it can a bottleneck in a sense that threads
> + have to wait for work request buffers to become available. Before the
> + introduction of this control the maximal number of work request buffers
> + available on the send path used to be hard coded to 16. With this control
> + it becomes configurable. The acceptable range is between 2 and 2048.
> +
> + Please be aware that all the buffers need to be allocated as a physically
> + continuous array in which each element is a single buffer and has the size
> + of SMC_WR_BUF_SIZE (48) bytes. If the allocation fails we give up much
> + like before having this control.
> +
> + Default: 16
> +
> +smcr_max_recv_wr - INTEGER
> + So called work request buffers are SMCR link (and RDMA queue pair) level
> + resources necessary for performing RDMA operations. Since up to 255
> + connections can share a link group and thus also a link and the number
> + of the work request buffers is decided when the link is allocated,
> + depending on the workload it can a bottleneck in a sense that threads
> + have to wait for work request buffers to become available. Before the
> + introduction of this control the maximal number of work request buffers
> + available on the receive path used to be hard coded to 16. With this control
> + it becomes configurable. The acceptable range is between 2 and 2048.
> +
> + Please be aware that all the buffers need to be allocated as a physically
> + continuous array in which each element is a single buffer and has the size
> + of SMC_WR_BUF_SIZE (48) bytes. If the allocation fails we give up much
> + like before having this control.
> +
> + Default: 48
Notice that the ratio of smcr_max_recv_wr to smcr_max_send_wr is set to 3:1, with the
intention of ensuring that the peer QP's smcr_max_recv_wr is three times the local QP's
smcr_max_send_wr and the local QP's smcr_max_recv_wr is three times the peer QP's
smcr_max_send_wr, rather than making the local QP's smcr_max_recv_wr three times its own
smcr_max_send_wr. The purpose of this design is to guarantee sufficient receive WRs on
the side to receive incoming data when peer QP doing RDMA sends. Otherwise, RNR (Receiver
Not Ready) may occur, leading to poor performance(RNR will drop the packet and retransmit
happens in the transport layer of the RDMA).
Let us guess a scenario that have multiple hosts, and the multiple hosts have different
smcr_max_send_wr and smcr_max_recv_wr configurations, mesh connections between these hosts.
It is difficult to ensure that the smcr_max_recv_wr/smcr_max_send_wr is 3:1 on the connected
QPs between these hosts, and it may even be hard to guarantee the smcr_max_recv_wr > smcr_max_send_wr
on the connected QPs between these hosts.
Therefore, I believe that if these values are made configurable, additional mechanisms must be
in place to prevent RNR from occurring. Otherwise we need to carefully configure smcr_max_recv_wr
and smcr_max_send_wr, or ensure that all hosts capable of establishing SMC-R connections are configured
smcr_max_recv_wr and smcr_max_send_wr with the same values.
Regards,
Guangguan Wang
> diff --git a/include/net/netns/smc.h b/include/net/netns/smc.h
> index fc752a50f91b..6ceb12baec24 100644
> --- a/include/net/netns/smc.h
> +++ b/include/net/netns/smc.h
> @@ -24,5 +24,7 @@ struct netns_smc {
> int sysctl_rmem;
> int sysctl_max_links_per_lgr;
> int sysctl_max_conns_per_lgr;
> + unsigned int sysctl_smcr_max_send_wr;
> + unsigned int sysctl_smcr_max_recv_wr;
> };
> #endif
> diff --git a/net/smc/smc_core.h b/net/smc/smc_core.h
> index 48a1b1dcb576..ab2d15929cb2 100644
> --- a/net/smc/smc_core.h
> +++ b/net/smc/smc_core.h
> @@ -33,6 +33,8 @@
> * distributions may modify it to a value between
> * 16-255 as needed.
> */
> +#define SMCR_MAX_SEND_WR_DEF 16 /* Default number of work requests per send queue */
> +#define SMCR_MAX_RECV_WR_DEF 48 /* Default number of work requests per recv queue */
>
> struct smc_lgr_list { /* list of link group definition */
> struct list_head list;
> @@ -361,6 +363,10 @@ struct smc_link_group {
> /* max conn can be assigned to lgr */
> u8 max_links;
> /* max links can be added in lgr */
> + u16 max_send_wr;
> + /* number of WR buffers on send */
> + u16 max_recv_wr;
> + /* number of WR buffers on recv */
> };
> struct { /* SMC-D */
> struct smcd_gid peer_gid;
> diff --git a/net/smc/smc_ib.c b/net/smc/smc_ib.c
> index 0052f02756eb..e8d35c22c525 100644
> --- a/net/smc/smc_ib.c
> +++ b/net/smc/smc_ib.c
> @@ -669,11 +669,6 @@ int smc_ib_create_queue_pair(struct smc_link *lnk)
> .recv_cq = lnk->smcibdev->roce_cq_recv,
> .srq = NULL,
> .cap = {
> - /* include unsolicited rdma_writes as well,
> - * there are max. 2 RDMA_WRITE per 1 WR_SEND
> - */
> - .max_send_wr = SMC_WR_BUF_CNT * 3,
> - .max_recv_wr = SMC_WR_BUF_CNT * 3,
> .max_send_sge = SMC_IB_MAX_SEND_SGE,
> .max_recv_sge = lnk->wr_rx_sge_cnt,
> .max_inline_data = 0,
> @@ -683,6 +678,8 @@ int smc_ib_create_queue_pair(struct smc_link *lnk)
> };
> int rc;
>
> + qp_attr.cap.max_send_wr = 3 * lnk->lgr->max_send_wr;
> + qp_attr.cap.max_recv_wr = lnk->lgr->max_recv_wr;
> lnk->roce_qp = ib_create_qp(lnk->roce_pd, &qp_attr);
> rc = PTR_ERR_OR_ZERO(lnk->roce_qp);
> if (IS_ERR(lnk->roce_qp))
> diff --git a/net/smc/smc_llc.c b/net/smc/smc_llc.c
> index f865c58c3aa7..f5d5eb617526 100644
> --- a/net/smc/smc_llc.c
> +++ b/net/smc/smc_llc.c
> @@ -2157,6 +2157,8 @@ void smc_llc_lgr_init(struct smc_link_group *lgr, struct smc_sock *smc)
> init_waitqueue_head(&lgr->llc_msg_waiter);
> init_rwsem(&lgr->llc_conf_mutex);
> lgr->llc_testlink_time = READ_ONCE(net->smc.sysctl_smcr_testlink_time);
> + lgr->max_send_wr = (u16)(READ_ONCE(net->smc.sysctl_smcr_max_send_wr));
> + lgr->max_recv_wr = (u16)(READ_ONCE(net->smc.sysctl_smcr_max_recv_wr));
> }
>
> /* called after lgr was removed from lgr_list */
> diff --git a/net/smc/smc_sysctl.c b/net/smc/smc_sysctl.c
> index 2fab6456f765..7b2471904d04 100644
> --- a/net/smc/smc_sysctl.c
> +++ b/net/smc/smc_sysctl.c
> @@ -29,6 +29,8 @@ static int links_per_lgr_min = SMC_LINKS_ADD_LNK_MIN;
> static int links_per_lgr_max = SMC_LINKS_ADD_LNK_MAX;
> static int conns_per_lgr_min = SMC_CONN_PER_LGR_MIN;
> static int conns_per_lgr_max = SMC_CONN_PER_LGR_MAX;
> +static unsigned int smcr_max_wr_min = 2;
> +static unsigned int smcr_max_wr_max = 2048;
>
> static struct ctl_table smc_table[] = {
> {
> @@ -99,6 +101,24 @@ static struct ctl_table smc_table[] = {
> .extra1 = SYSCTL_ZERO,
> .extra2 = SYSCTL_ONE,
> },
> + {
> + .procname = "smcr_max_send_wr",
> + .data = &init_net.smc.sysctl_smcr_max_send_wr,
> + .maxlen = sizeof(int),
> + .mode = 0644,
> + .proc_handler = proc_dointvec_minmax,
> + .extra1 = &smcr_max_wr_min,
> + .extra2 = &smcr_max_wr_max,
> + },
> + {
> + .procname = "smcr_max_recv_wr",
> + .data = &init_net.smc.sysctl_smcr_max_recv_wr,
> + .maxlen = sizeof(int),
> + .mode = 0644,
> + .proc_handler = proc_dointvec_minmax,
> + .extra1 = &smcr_max_wr_min,
> + .extra2 = &smcr_max_wr_max,
> + },
> };
>
> int __net_init smc_sysctl_net_init(struct net *net)
> @@ -130,6 +150,8 @@ int __net_init smc_sysctl_net_init(struct net *net)
> WRITE_ONCE(net->smc.sysctl_rmem, net_smc_rmem_init);
> net->smc.sysctl_max_links_per_lgr = SMC_LINKS_PER_LGR_MAX_PREFER;
> net->smc.sysctl_max_conns_per_lgr = SMC_CONN_PER_LGR_PREFER;
> + net->smc.sysctl_smcr_max_send_wr = SMCR_MAX_SEND_WR_DEF;
> + net->smc.sysctl_smcr_max_recv_wr = SMCR_MAX_RECV_WR_DEF;
> /* disable handshake limitation by default */
> net->smc.limit_smc_hs = 0;
>
> diff --git a/net/smc/smc_sysctl.h b/net/smc/smc_sysctl.h
> index eb2465ae1e15..8538915af7af 100644
> --- a/net/smc/smc_sysctl.h
> +++ b/net/smc/smc_sysctl.h
> @@ -25,6 +25,8 @@ static inline int smc_sysctl_net_init(struct net *net)
> net->smc.sysctl_autocorking_size = SMC_AUTOCORKING_DEFAULT_SIZE;
> net->smc.sysctl_max_links_per_lgr = SMC_LINKS_PER_LGR_MAX_PREFER;
> net->smc.sysctl_max_conns_per_lgr = SMC_CONN_PER_LGR_PREFER;
> + net->smc.sysctl_smcr_max_send_wr = SMCR_MAX_SEND_WR_DEF;
> + net->smc.sysctl_smcr_max_recv_wr = SMCR_MAX_RECV_WR_DEF;
> return 0;
> }
>
> diff --git a/net/smc/smc_wr.c b/net/smc/smc_wr.c
> index b04a21b8c511..f5b2772414fd 100644
> --- a/net/smc/smc_wr.c
> +++ b/net/smc/smc_wr.c
> @@ -34,6 +34,7 @@
> #define SMC_WR_MAX_POLL_CQE 10 /* max. # of compl. queue elements in 1 poll */
>
> #define SMC_WR_RX_HASH_BITS 4
> +
> static DEFINE_HASHTABLE(smc_wr_rx_hash, SMC_WR_RX_HASH_BITS);
> static DEFINE_SPINLOCK(smc_wr_rx_hash_lock);
>
> @@ -547,9 +548,9 @@ void smc_wr_remember_qp_attr(struct smc_link *lnk)
> IB_QP_DEST_QPN,
> &init_attr);
>
> - lnk->wr_tx_cnt = min_t(size_t, SMC_WR_BUF_CNT,
> + lnk->wr_tx_cnt = min_t(size_t, lnk->lgr->max_send_wr,
> lnk->qp_attr.cap.max_send_wr);
> - lnk->wr_rx_cnt = min_t(size_t, SMC_WR_BUF_CNT * 3,
> + lnk->wr_rx_cnt = min_t(size_t, lnk->lgr->max_recv_wr,
> lnk->qp_attr.cap.max_recv_wr);
> }
>
> @@ -741,50 +742,51 @@ int smc_wr_alloc_lgr_mem(struct smc_link_group *lgr)
> int smc_wr_alloc_link_mem(struct smc_link *link)
> {
> /* allocate link related memory */
> - link->wr_tx_bufs = kcalloc(SMC_WR_BUF_CNT, SMC_WR_BUF_SIZE, GFP_KERNEL);
> + link->wr_tx_bufs = kcalloc(link->lgr->max_send_wr,
> + SMC_WR_BUF_SIZE, GFP_KERNEL);
> if (!link->wr_tx_bufs)
> goto no_mem;
> - link->wr_rx_bufs = kcalloc(SMC_WR_BUF_CNT * 3, link->wr_rx_buflen,
> + link->wr_rx_bufs = kcalloc(link->lgr->max_recv_wr, link->wr_rx_buflen,
> GFP_KERNEL);
> if (!link->wr_rx_bufs)
> goto no_mem_wr_tx_bufs;
> - link->wr_tx_ibs = kcalloc(SMC_WR_BUF_CNT, sizeof(link->wr_tx_ibs[0]),
> - GFP_KERNEL);
> + link->wr_tx_ibs = kcalloc(link->lgr->max_send_wr,
> + sizeof(link->wr_tx_ibs[0]), GFP_KERNEL);
> if (!link->wr_tx_ibs)
> goto no_mem_wr_rx_bufs;
> - link->wr_rx_ibs = kcalloc(SMC_WR_BUF_CNT * 3,
> + link->wr_rx_ibs = kcalloc(link->lgr->max_recv_wr,
> sizeof(link->wr_rx_ibs[0]),
> GFP_KERNEL);
> if (!link->wr_rx_ibs)
> goto no_mem_wr_tx_ibs;
> - link->wr_tx_rdmas = kcalloc(SMC_WR_BUF_CNT,
> + link->wr_tx_rdmas = kcalloc(link->lgr->max_send_wr,
> sizeof(link->wr_tx_rdmas[0]),
> GFP_KERNEL);
> if (!link->wr_tx_rdmas)
> goto no_mem_wr_rx_ibs;
> - link->wr_tx_rdma_sges = kcalloc(SMC_WR_BUF_CNT,
> + link->wr_tx_rdma_sges = kcalloc(link->lgr->max_send_wr,
> sizeof(link->wr_tx_rdma_sges[0]),
> GFP_KERNEL);
> if (!link->wr_tx_rdma_sges)
> goto no_mem_wr_tx_rdmas;
> - link->wr_tx_sges = kcalloc(SMC_WR_BUF_CNT, sizeof(link->wr_tx_sges[0]),
> + link->wr_tx_sges = kcalloc(link->lgr->max_send_wr, sizeof(link->wr_tx_sges[0]),
> GFP_KERNEL);
> if (!link->wr_tx_sges)
> goto no_mem_wr_tx_rdma_sges;
> - link->wr_rx_sges = kcalloc(SMC_WR_BUF_CNT * 3,
> + link->wr_rx_sges = kcalloc(link->lgr->max_recv_wr,
> sizeof(link->wr_rx_sges[0]) * link->wr_rx_sge_cnt,
> GFP_KERNEL);
> if (!link->wr_rx_sges)
> goto no_mem_wr_tx_sges;
> - link->wr_tx_mask = bitmap_zalloc(SMC_WR_BUF_CNT, GFP_KERNEL);
> + link->wr_tx_mask = bitmap_zalloc(link->lgr->max_send_wr, GFP_KERNEL);
> if (!link->wr_tx_mask)
> goto no_mem_wr_rx_sges;
> - link->wr_tx_pends = kcalloc(SMC_WR_BUF_CNT,
> + link->wr_tx_pends = kcalloc(link->lgr->max_send_wr,
> sizeof(link->wr_tx_pends[0]),
> GFP_KERNEL);
> if (!link->wr_tx_pends)
> goto no_mem_wr_tx_mask;
> - link->wr_tx_compl = kcalloc(SMC_WR_BUF_CNT,
> + link->wr_tx_compl = kcalloc(link->lgr->max_send_wr,
> sizeof(link->wr_tx_compl[0]),
> GFP_KERNEL);
> if (!link->wr_tx_compl)
> @@ -905,7 +907,7 @@ int smc_wr_create_link(struct smc_link *lnk)
> goto dma_unmap;
> }
> smc_wr_init_sge(lnk);
> - bitmap_zero(lnk->wr_tx_mask, SMC_WR_BUF_CNT);
> + bitmap_zero(lnk->wr_tx_mask, lnk->lgr->max_send_wr);
> init_waitqueue_head(&lnk->wr_tx_wait);
> rc = percpu_ref_init(&lnk->wr_tx_refs, smcr_wr_tx_refs_free, 0, GFP_KERNEL);
> if (rc)
> diff --git a/net/smc/smc_wr.h b/net/smc/smc_wr.h
> index f3008dda222a..aa4533af9122 100644
> --- a/net/smc/smc_wr.h
> +++ b/net/smc/smc_wr.h
> @@ -19,8 +19,6 @@
> #include "smc.h"
> #include "smc_core.h"
>
> -#define SMC_WR_BUF_CNT 16 /* # of ctrl buffers per link */
> -
> #define SMC_WR_TX_WAIT_FREE_SLOT_TIME (10 * HZ)
>
> #define SMC_WR_TX_SIZE 44 /* actual size of wr_send data (<=SMC_WR_BUF_SIZE) */
^ permalink raw reply [flat|nested] 21+ messages in thread* Re: [PATCH net-next v3 1/2] net/smc: make wr buffer count configurable
2025-09-26 2:44 ` Guangguan Wang
@ 2025-09-26 10:12 ` Halil Pasic
2025-09-26 10:30 ` Halil Pasic
0 siblings, 1 reply; 21+ messages in thread
From: Halil Pasic @ 2025-09-26 10:12 UTC (permalink / raw)
To: Guangguan Wang
Cc: Jakub Kicinski, Paolo Abeni, Simon Horman, D. Wythe, Dust Li,
Sidraya Jayagond, Wenjia Zhang, Mahanta Jambigi, Tony Lu, Wen Gu,
netdev, linux-doc, linux-kernel, linux-rdma, linux-s390,
Halil Pasic
On Fri, 26 Sep 2025 10:44:00 +0800
Guangguan Wang <guangguan.wang@linux.alibaba.com> wrote:
> > +
> > +smcr_max_send_wr - INTEGER
> > + So called work request buffers are SMCR link (and RDMA queue pair) level
> > + resources necessary for performing RDMA operations. Since up to 255
> > + connections can share a link group and thus also a link and the number
> > + of the work request buffers is decided when the link is allocated,
> > + depending on the workload it can a bottleneck in a sense that threads
> > + have to wait for work request buffers to become available. Before the
> > + introduction of this control the maximal number of work request buffers
> > + available on the send path used to be hard coded to 16. With this control
> > + it becomes configurable. The acceptable range is between 2 and 2048.
> > +
> > + Please be aware that all the buffers need to be allocated as a physically
> > + continuous array in which each element is a single buffer and has the size
> > + of SMC_WR_BUF_SIZE (48) bytes. If the allocation fails we give up much
> > + like before having this control.
> > +
> > + Default: 16
> > +
> > +smcr_max_recv_wr - INTEGER
> > + So called work request buffers are SMCR link (and RDMA queue pair) level
> > + resources necessary for performing RDMA operations. Since up to 255
> > + connections can share a link group and thus also a link and the number
> > + of the work request buffers is decided when the link is allocated,
> > + depending on the workload it can a bottleneck in a sense that threads
> > + have to wait for work request buffers to become available. Before the
> > + introduction of this control the maximal number of work request buffers
> > + available on the receive path used to be hard coded to 16. With this control
> > + it becomes configurable. The acceptable range is between 2 and 2048.
> > +
> > + Please be aware that all the buffers need to be allocated as a physically
> > + continuous array in which each element is a single buffer and has the size
> > + of SMC_WR_BUF_SIZE (48) bytes. If the allocation fails we give up much
> > + like before having this control.
> > +
> > + Default: 48
>
> Notice that the ratio of smcr_max_recv_wr to smcr_max_send_wr is set to 3:1, with the
> intention of ensuring that the peer QP's smcr_max_recv_wr is three times the local QP's
> smcr_max_send_wr and the local QP's smcr_max_recv_wr is three times the peer QP's
> smcr_max_send_wr, rather than making the local QP's smcr_max_recv_wr three times its own
> smcr_max_send_wr. The purpose of this design is to guarantee sufficient receive WRs on
> the side to receive incoming data when peer QP doing RDMA sends. Otherwise, RNR (Receiver
> Not Ready) may occur, leading to poor performance(RNR will drop the packet and retransmit
> happens in the transport layer of the RDMA).
Thank you Guangguan! I think we already had that discussion.
>
> Let us guess a scenario that have multiple hosts, and the multiple hosts have different
> smcr_max_send_wr and smcr_max_recv_wr configurations, mesh connections between these hosts.
> It is difficult to ensure that the smcr_max_recv_wr/smcr_max_send_wr is 3:1 on the connected
> QPs between these hosts, and it may even be hard to guarantee the smcr_max_recv_wr > smcr_max_send_wr
> on the connected QPs between these hosts.
It is not difficult IMHO. You just leave the knobs alone and you have
3:1 per default. If tuning is attempted that needs to be done carefully.
At least with SMC-R V2 there is this whole EID business, as well so it
is reasonable to assume that the environment can be tuned in a coherent
fashion. E.g. whoever is calling the EID could call use smcr_max_recv_wr:=32 and smcr_max_send_wr:=
>
> Therefore, I believe that if these values are made configurable, additional mechanisms must be
> in place to prevent RNR from occurring. Otherwise we need to carefully configure smcr_max_recv_wr
> and smcr_max_send_wr, or ensure that all hosts capable of establishing SMC-R connections are configured
> smcr_max_recv_wr and smcr_max_send_wr with the same values.
Thank you for
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH net-next v3 1/2] net/smc: make wr buffer count configurable
2025-09-26 10:12 ` Halil Pasic
@ 2025-09-26 10:30 ` Halil Pasic
2025-09-28 3:05 ` Guangguan Wang
0 siblings, 1 reply; 21+ messages in thread
From: Halil Pasic @ 2025-09-26 10:30 UTC (permalink / raw)
To: Guangguan Wang
Cc: Jakub Kicinski, Paolo Abeni, Simon Horman, D. Wythe, Dust Li,
Sidraya Jayagond, Wenjia Zhang, Mahanta Jambigi, Tony Lu, Wen Gu,
netdev, linux-doc, linux-kernel, linux-rdma, linux-s390,
Halil Pasic
On Fri, 26 Sep 2025 12:12:49 +0200
Halil Pasic <pasic@linux.ibm.com> wrote:
> On Fri, 26 Sep 2025 10:44:00 +0800
> Guangguan Wang <guangguan.wang@linux.alibaba.com> wrote:
>
> > > +
> > > +smcr_max_send_wr - INTEGER
> > > + So called work request buffers are SMCR link (and RDMA queue pair) level
> > > + resources necessary for performing RDMA operations. Since up to 255
> > > + connections can share a link group and thus also a link and the number
> > > + of the work request buffers is decided when the link is allocated,
> > > + depending on the workload it can a bottleneck in a sense that threads
> > > + have to wait for work request buffers to become available. Before the
> > > + introduction of this control the maximal number of work request buffers
> > > + available on the send path used to be hard coded to 16. With this control
> > > + it becomes configurable. The acceptable range is between 2 and 2048.
> > > +
> > > + Please be aware that all the buffers need to be allocated as a physically
> > > + continuous array in which each element is a single buffer and has the size
> > > + of SMC_WR_BUF_SIZE (48) bytes. If the allocation fails we give up much
> > > + like before having this control.
> > > +
> > > + Default: 16
> > > +
> > > +smcr_max_recv_wr - INTEGER
> > > + So called work request buffers are SMCR link (and RDMA queue pair) level
> > > + resources necessary for performing RDMA operations. Since up to 255
> > > + connections can share a link group and thus also a link and the number
> > > + of the work request buffers is decided when the link is allocated,
> > > + depending on the workload it can a bottleneck in a sense that threads
> > > + have to wait for work request buffers to become available. Before the
> > > + introduction of this control the maximal number of work request buffers
> > > + available on the receive path used to be hard coded to 16. With this control
> > > + it becomes configurable. The acceptable range is between 2 and 2048.
> > > +
> > > + Please be aware that all the buffers need to be allocated as a physically
> > > + continuous array in which each element is a single buffer and has the size
> > > + of SMC_WR_BUF_SIZE (48) bytes. If the allocation fails we give up much
> > > + like before having this control.
> > > +
> > > + Default: 48
> >
> > Notice that the ratio of smcr_max_recv_wr to smcr_max_send_wr is set to 3:1, with the
> > intention of ensuring that the peer QP's smcr_max_recv_wr is three times the local QP's
> > smcr_max_send_wr and the local QP's smcr_max_recv_wr is three times the peer QP's
> > smcr_max_send_wr, rather than making the local QP's smcr_max_recv_wr three times its own
> > smcr_max_send_wr. The purpose of this design is to guarantee sufficient receive WRs on
> > the side to receive incoming data when peer QP doing RDMA sends. Otherwise, RNR (Receiver
> > Not Ready) may occur, leading to poor performance(RNR will drop the packet and retransmit
> > happens in the transport layer of the RDMA).
Sorry this was sent accidentally by the virtue of unintentionally
pressing the shortcut for send while trying to actually edit!
>
> Thank you Guangguan! I think we already had that discussion.
Please have a look at this thread
https://lore.kernel.org/all/4c5347ff-779b-48d7-8234-2aac9992f487@linux.ibm.com/
I'm aware of this, but I think this problem needs to be solved on
a different level.
> >
> > Let us guess a scenario that have multiple hosts, and the multiple hosts have different
> > smcr_max_send_wr and smcr_max_recv_wr configurations, mesh connections between these hosts.
> > It is difficult to ensure that the smcr_max_recv_wr/smcr_max_send_wr is 3:1 on the connected
> > QPs between these hosts, and it may even be hard to guarantee the smcr_max_recv_wr > smcr_max_send_wr
> > on the connected QPs between these hosts.
>
>
> It is not difficult IMHO. You just leave the knobs alone and you have
[..]
It is not difficult IMHO. You just leave the knobs alone and you have
3:1 per default. If tuning is attempted that needs to be done carefully.
At least with SMC-R V2 there is this whole EID business, as well so it
is reasonable to assume that the environment can be tuned in a coherent
fashion. E.g. whoever is calling the EID could call use smcr_max_recv_wr:=32
and smcr_max_send_wr:=96.
> >
> > Therefore, I believe that if these values are made configurable, additional mechanisms must be
> > in place to prevent RNR from occurring. Otherwise we need to carefully configure smcr_max_recv_wr
> > and smcr_max_send_wr, or ensure that all hosts capable of establishing SMC-R connections are configured
> > smcr_max_recv_wr and smcr_max_send_wr with the same values.
>
I'm in favor of adding such mechanisms on top of this. Do you have
something particular in mind? Unfortunately I'm not knowledgeable enough
in the area to know what mechanisms you may mean. But I guess it is
patches welcome as always! Currently I would encourage to users
to tune carefully.
Sorry about that half baked answer before.
Regards,
Halil
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH net-next v3 1/2] net/smc: make wr buffer count configurable
2025-09-26 10:30 ` Halil Pasic
@ 2025-09-28 3:05 ` Guangguan Wang
0 siblings, 0 replies; 21+ messages in thread
From: Guangguan Wang @ 2025-09-28 3:05 UTC (permalink / raw)
To: Halil Pasic
Cc: Jakub Kicinski, Paolo Abeni, Simon Horman, D. Wythe, Dust Li,
Sidraya Jayagond, Wenjia Zhang, Mahanta Jambigi, Tony Lu, Wen Gu,
netdev, linux-doc, linux-kernel, linux-rdma, linux-s390
在 2025/9/26 18:30, Halil Pasic 写道:
> On Fri, 26 Sep 2025 12:12:49 +0200
> Halil Pasic <pasic@linux.ibm.com> wrote:
>
>> On Fri, 26 Sep 2025 10:44:00 +0800
>> Guangguan Wang <guangguan.wang@linux.alibaba.com> wrote:
>>
>>>
>>> Notice that the ratio of smcr_max_recv_wr to smcr_max_send_wr is set to 3:1, with the
>>> intention of ensuring that the peer QP's smcr_max_recv_wr is three times the local QP's
>>> smcr_max_send_wr and the local QP's smcr_max_recv_wr is three times the peer QP's
>>> smcr_max_send_wr, rather than making the local QP's smcr_max_recv_wr three times its own
>>> smcr_max_send_wr. The purpose of this design is to guarantee sufficient receive WRs on
>>> the side to receive incoming data when peer QP doing RDMA sends. Otherwise, RNR (Receiver
>>> Not Ready) may occur, leading to poor performance(RNR will drop the packet and retransmit
>>> happens in the transport layer of the RDMA).
>
> Sorry this was sent accidentally by the virtue of unintentionally
> pressing the shortcut for send while trying to actually edit!
>
>>
>> Thank you Guangguan! I think we already had that discussion.
>
> Please have a look at this thread
> https://lore.kernel.org/all/4c5347ff-779b-48d7-8234-2aac9992f487@linux.ibm.com/
>
> I'm aware of this, but I think this problem needs to be solved on
> a different level.
>
Oh, I see. Sorry for missing the previous discussion.
BTW, the RNR counter is the file like '/sys/class/infiniband/mlx5_0/ports/1/hw_counters/rnr_nak_retry_err'.
>>>
>>> Let us guess a scenario that have multiple hosts, and the multiple hosts have different
>>> smcr_max_send_wr and smcr_max_recv_wr configurations, mesh connections between these hosts.
>>> It is difficult to ensure that the smcr_max_recv_wr/smcr_max_send_wr is 3:1 on the connected
>>> QPs between these hosts, and it may even be hard to guarantee the smcr_max_recv_wr > smcr_max_send_wr
>>> on the connected QPs between these hosts.
>>
>>
>> It is not difficult IMHO. You just leave the knobs alone and you have
> [..]
>
> It is not difficult IMHO. You just leave the knobs alone and you have
> 3:1 per default. If tuning is attempted that needs to be done carefully.
> At least with SMC-R V2 there is this whole EID business, as well so it
> is reasonable to assume that the environment can be tuned in a coherent
> fashion. E.g. whoever is calling the EID could call use smcr_max_recv_wr:=32
> and smcr_max_send_wr:=96.
>
>>>
>>> Therefore, I believe that if these values are made configurable, additional mechanisms must be
>>> in place to prevent RNR from occurring. Otherwise we need to carefully configure smcr_max_recv_wr
>>> and smcr_max_send_wr, or ensure that all hosts capable of establishing SMC-R connections are configured
>>> smcr_max_recv_wr and smcr_max_send_wr with the same values.
>>
>
> I'm in favor of adding such mechanisms on top of this. Do you have
> something particular in mind? Unfortunately I'm not knowledgeable enough
> in the area to know what mechanisms you may mean. But I guess it is
> patches welcome as always! Currently I would encourage to users
> to tune carefully.
>
AFAIK, flow control is a usual way, maybe credit-based flow control is enough. Credit means the valid
counts of receive wr can be used. The receiver counts the credit every time post_recv, and advertises
credits to the connected sender at a certain frequency. The sender counts the credits advertised from
peer. The sender consumes a credit everytime post_send wr which will consume a receive wr in the receiver,
if have enough credits to consume. Otherwise the sender should hang the wr and should wait for the credits
advertised from peer.
But this requires support at the SMC-R protocol level. And this also can be addressed as an enhancement.
I do not known if someone from Dust Li's team or someone from IBM has interests to pick this up.
Regards,
Guangguan Wang
^ permalink raw reply [flat|nested] 21+ messages in thread