[PATCH rdma-next 0/9] Rework retry algorithm used when sending MADs

Linux RDMA and InfiniBand development
 help / color / mirror / Atom feed

* [PATCH rdma-next 0/9] Rework retry algorithm used when sending MADs
@ 2024-12-05 13:49 Leon Romanovsky
  2024-12-05 13:49 ` [PATCH rdma-next 1/9] IB/mad: Apply timeout modification (CM MRA) only once Leon Romanovsky
                   ` (8 more replies)
  0 siblings, 9 replies; 10+ messages in thread
From: Leon Romanovsky @ 2024-12-05 13:49 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: linux-kernel, linux-rdma, Sean Hefty, Vlad Dumitrescu

From Vlad,

This series aims to improve behaviour of a MAD sender under congestion
and/or receiver overload.  We've seen significant drops in goodput when
MAD receivers are overloaded.  This typically happens with SA requests,
which are served by a single node (SM), but can also happen with CM.

Patch 7 introduces the main change: exponential backoff.  This new retry
algorithm is applied to all MADs, except RMPP and OPA.  To avoid
reductions in recovery speed under transient failures, the exponential
backoff algorithm only engages after a certain number of linear timeouts
is experienced.  The backoff algorithm resets to beginning after a CM
MRA, assuming the remote is not longer overloaded.

Because a trade-off between speed of recovery under transient failure
and reducing load from unnecessary retries under persistent failure must
be made, and this trade-off depends on the network scale, patch 8 makes
mad-linear-timeouts configurable.

Patch 1 makes CM MRA apply only once, to prevent entering an excessive
delay condition, even when the receiver is likely no longer overloaded.

The exponential backoff algorithm (a) increases the time until a send
MAD reaches the final timeout, and (b) makes it hard to predict by
callers.  Since certain callers appear to care about this, Patch 2
introduces a new option, deadline, which can be used to enforce when
the final timeout is reached.  SA, UMAD and CM are updated to use this
new parameter (patches 3, 5, 6).

Patch 3 also solves a related issue in SA, which configures the MAD
layer with extremely aggressive retry intervals, in certain cases.
Because the current aggressive retry was introduced to solve another
issue, patch 4 makes sa-min-timeout configurable.

Patch 9 resolves another related issue in CM, which uses a retry
interval that is way too high for (low latency) RDMA networks.

In summary:
  1) IB/mad: Apply timeout modification (CM MRA) only once
  2) IB/mad: Add deadline for send MADs
  3) RDMA/sa_query: Enforce min retry interval and deadline
  4) RDMA/nldev: Add sa-min-timeout management attribute
  5) IB/umad: Set deadline when sending non-RMPP MADs
  6) IB/cm: Set deadline when sending MADs
  7) IB/mad: Exponential backoff when retrying sends
  8) RDMA/nldev: Add mad-linear-timeouts management attribute
  9) IB/cma: Lower response timeout to roughly 1s

Two tunables will be added to RDMA tool (iproute2), under the
'management' namespace as follow-up:

  mad-linear-timeouts
  sa-min-timeout

Thanks

Vlad Dumitrescu (9):
  IB/mad: Apply timeout modification (CM MRA) only once
  IB/mad: Add deadline for send MADs
  RDMA/sa_query: Enforce min retry interval and deadline
  RDMA/nldev: Add sa-min-timeout management attribute
  IB/umad: Set deadline when sending non-RMPP MADs
  IB/cm: Set deadline when sending MADs
  IB/mad: Exponential backoff when retrying sends
  RDMA/nldev: Add mad-linear-timeouts management attribute
  IB/cma: Lower response timeout to roughly 1s

 drivers/infiniband/core/cm.c        |  13 +++
 drivers/infiniband/core/cma.c       |   2 +-
 drivers/infiniband/core/core_priv.h |   4 +
 drivers/infiniband/core/mad.c       | 141 ++++++++++++++++++++++++++--
 drivers/infiniband/core/mad_priv.h  |   8 ++
 drivers/infiniband/core/nldev.c     | 133 ++++++++++++++++++++++++++
 drivers/infiniband/core/sa_query.c  |  81 +++++++++++++---
 drivers/infiniband/core/user_mad.c  |   8 ++
 include/rdma/ib_mad.h               |  29 ++++++
 include/uapi/rdma/ib_user_mad.h     |  12 ++-
 include/uapi/rdma/rdma_netlink.h    |   7 ++
 11 files changed, 416 insertions(+), 22 deletions(-)

-- 
2.47.0

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH rdma-next 1/9] IB/mad: Apply timeout modification (CM MRA) only once
  2024-12-05 13:49 [PATCH rdma-next 0/9] Rework retry algorithm used when sending MADs Leon Romanovsky
@ 2024-12-05 13:49 ` Leon Romanovsky
  2024-12-05 13:49 ` [PATCH rdma-next 2/9] IB/mad: Add deadline for send MADs Leon Romanovsky
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: Leon Romanovsky @ 2024-12-05 13:49 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: Vlad Dumitrescu, linux-rdma, Sean Hefty

From: Vlad Dumitrescu <vdumitrescu@nvidia.com>

ib_modify_mad with non-zero timeout is used exclusively for CM MRA [1].
The timeout is computed as the sum of the local timeout (set via
cm_init_av_by_path) and a constant 60s+ Service Timeout
(CMA_CM_MRA_SETTING) passed via the MRA message by the remote.  MRAs are
generated when duplicate messages are received.

Overwriting send_buf.timeout_ms means the MRA Service Timeout requested
by the remote will apply to all remaining retry attempts.  This can lead
to unnecessary and extreme delays when a receiver is only temporarily
overloaded.

Do not save the MRA timeout so that it only applies to the next retry.

ib_modify_mad is also called with a zero timeout, to implement
ib_cancel_mad.  The timeout was also saved in that case, but it is not
required as timeout_sends will skip the retry_send (which reads the
saved value) call anyway, based on the non-successful status.

[1] IBTA v1.7 - Section 12.6.6 - MRA - Message Receipt Acknowledgment

Signed-off-by: Vlad Dumitrescu <vdumitrescu@nvidia.com>
Reviewed-by: Sean Hefty <shefty@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/infiniband/core/mad.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c
index 86e846b12e2f..bcfbb2a5c02b 100644
--- a/drivers/infiniband/core/mad.c
+++ b/drivers/infiniband/core/mad.c
@@ -2623,7 +2623,6 @@ int ib_modify_mad(struct ib_mad_send_buf *send_buf, u32 timeout_ms)
 	if (!timeout_ms)
 		mad_send_wr->status = IB_WC_WR_FLUSH_ERR;

-	mad_send_wr->send_buf.timeout_ms = timeout_ms;
 	if (mad_send_wr->state == IB_MAD_STATE_SEND_START ||
 	    (mad_send_wr->state == IB_MAD_STATE_QUEUED && timeout_ms))
 		mad_send_wr->timeout = msecs_to_jiffies(timeout_ms);
-- 
2.47.0

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH rdma-next 2/9] IB/mad: Add deadline for send MADs
  2024-12-05 13:49 [PATCH rdma-next 0/9] Rework retry algorithm used when sending MADs Leon Romanovsky
  2024-12-05 13:49 ` [PATCH rdma-next 1/9] IB/mad: Apply timeout modification (CM MRA) only once Leon Romanovsky
@ 2024-12-05 13:49 ` Leon Romanovsky
  2024-12-05 13:49 ` [PATCH rdma-next 3/9] RDMA/sa_query: Enforce min retry interval and deadline Leon Romanovsky
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: Leon Romanovsky @ 2024-12-05 13:49 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: Vlad Dumitrescu, linux-rdma, Sean Hefty

From: Vlad Dumitrescu <vdumitrescu@nvidia.com>

The MAD layer does not currently provide a way to enforce a deadline.
Callers which need that, like rdma_resolve_route / SA, make assumptions
about the MAD layer's retry algorithm and set the retries and timeout_ms
fields struct ib_mad_send_buf accordingly.  For example, given today's
retry algorithm - linear, with no significant scheduling or queueing
delays - callers expect the final timeout to trigger roughly after
(retries + 1) * timeout_ms.

Add helper to set internal deadline based on relative timeout from
current time.  Callers can configure the deadline at any time, but
should account for delays themselves introduce before calling
ib_post_send_mad.  Otherwise, if the deadline has passed, post fails.

When a deadline is not set or it's too high, clamp to 5 minutes after
post time.  Probably not a good idea to accept arbitrary timeouts.

After a series of callers will be converted to use this new parameter,
the MAD layer can evolve its retry algorithm (e.g., to prevent
congestion) without affecting those callers.

Note that existing fields still need to be exposed:
  - timeout_ms will be needed to reset the retry algorithm after a
    temporary delay requested by remote via CM MRA [1], and
  - retries is needed to implement CM REQ:Max CM Retries [2].

In case of CM MRA (ib_modify_mad is called with non-zero timeout),
increase the deadline as the sender can't plan for MRA-requested delays.

Ignore RMPP for now - it uses a different per-window retry algorithm.

[1] IBTA v1.7 - Section 12.6.6 - MRA - Message Receipt Acknowledgment
[2] IBTA v1.7 - Section 12.7.27 - Max CM Retries

Signed-off-by: Vlad Dumitrescu <vdumitrescu@nvidia.com>
Reviewed-by: Sean Hefty <shefty@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/infiniband/core/mad.c      | 54 +++++++++++++++++++++++++++---
 drivers/infiniband/core/mad_priv.h |  1 +
 include/rdma/ib_mad.h              | 29 ++++++++++++++++
 3 files changed, 80 insertions(+), 4 deletions(-)

diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c
index bcfbb2a5c02b..5c255ee3db38 100644
--- a/drivers/infiniband/core/mad.c
+++ b/drivers/infiniband/core/mad.c
@@ -54,6 +54,8 @@
 #define CREATE_TRACE_POINTS
 #include <trace/events/ib_mad.h>
 
+#define IB_MAD_MAX_DEADLINE (jiffies + msecs_to_jiffies(5 * 60 * 1000))
+
 #ifdef CONFIG_TRACEPOINTS
 static void create_mad_addr_info(struct ib_mad_send_wr_private *mad_send_wr,
 			  struct ib_mad_qp_info *qp_info,
@@ -855,6 +857,26 @@ int ib_mad_kernel_rmpp_agent(const struct ib_mad_agent *agent)
 }
 EXPORT_SYMBOL(ib_mad_kernel_rmpp_agent);
 
+int ib_set_mad_deadline(struct ib_mad_send_buf *send_buf, u32 total_timeout_ms)
+{
+	struct ib_mad_send_wr_private *mad_send_wr =
+		container_of(send_buf, struct ib_mad_send_wr_private, send_buf);
+
+	if (WARN_ON_ONCE(!total_timeout_ms))
+		return -EINVAL;
+
+	if (WARN_ON_ONCE(mad_send_wr->deadline))
+		return -EINVAL;
+
+	if (WARN_ON_ONCE(ib_mad_kernel_rmpp_agent(
+		    &mad_send_wr->mad_agent_priv->agent)))
+		return -EINVAL;
+
+	mad_send_wr->deadline = jiffies + msecs_to_jiffies(total_timeout_ms);
+	return 0;
+}
+EXPORT_SYMBOL(ib_set_mad_deadline);
+
 struct ib_mad_send_buf *ib_create_send_mad(struct ib_mad_agent *mad_agent,
 					   u32 remote_qpn, u16 pkey_index,
 					   int rmpp_active, int hdr_len,
@@ -1174,6 +1196,19 @@ int ib_post_send_mad(struct ib_mad_send_buf *send_buf,
 				continue;
 		}
 
+		if (!ib_mad_kernel_rmpp_agent(&mad_agent_priv->agent) &&
+		    send_buf->timeout_ms) {
+			if (!mad_send_wr->deadline ||
+			    time_after(mad_send_wr->deadline,
+				       IB_MAD_MAX_DEADLINE)) {
+				mad_send_wr->deadline = IB_MAD_MAX_DEADLINE;
+			} else if (time_after_eq(jiffies,
+						 mad_send_wr->deadline)) {
+				ret = -ETIMEDOUT;
+				goto error;
+			}
+		}
+
 		mad_send_wr->tid = ((struct ib_mad_hdr *) send_buf->mad)->tid;
 		/* Timeout will be updated after send completes */
 		mad_send_wr->timeout = msecs_to_jiffies(send_buf->timeout_ms);
@@ -2293,16 +2328,23 @@ static void adjust_timeout(struct ib_mad_agent_private *mad_agent_priv)
 
 static void wait_for_response(struct ib_mad_send_wr_private *mad_send_wr)
 {
-	struct ib_mad_agent_private *mad_agent_priv;
 	struct ib_mad_send_wr_private *temp_mad_send_wr;
+	struct ib_mad_agent_private *mad_agent_priv;
+	const unsigned long now = jiffies;
 	struct list_head *list_item;
 	unsigned long delay;
 
 	mad_agent_priv = mad_send_wr->mad_agent_priv;
 	list_del_init(&mad_send_wr->agent_list);
 
-	delay = mad_send_wr->timeout;
-	mad_send_wr->timeout += jiffies;
+	/* Caller must ensure mad_send_wr->timeout is relative */
+	if (!mad_send_wr->deadline)
+		delay = mad_send_wr->timeout;
+	else if (time_after_eq(now, mad_send_wr->deadline))
+		delay = 0; /* schedule ASAP */
+	else
+		delay = min(mad_send_wr->deadline - now, mad_send_wr->timeout);
+	mad_send_wr->timeout = now + delay;
 
 	if (delay) {
 		list_for_each_prev(list_item, &mad_agent_priv->wait_list) {
@@ -2623,6 +2665,9 @@ int ib_modify_mad(struct ib_mad_send_buf *send_buf, u32 timeout_ms)
 	if (!timeout_ms)
 		mad_send_wr->status = IB_WC_WR_FLUSH_ERR;
 
+	if (mad_send_wr->deadline)
+		mad_send_wr->deadline += msecs_to_jiffies(timeout_ms);
+
 	if (mad_send_wr->state == IB_MAD_STATE_SEND_START ||
 	    (mad_send_wr->state == IB_MAD_STATE_QUEUED && timeout_ms))
 		mad_send_wr->timeout = msecs_to_jiffies(timeout_ms);
@@ -2726,7 +2771,8 @@ static int retry_send(struct ib_mad_send_wr_private *mad_send_wr)
 {
 	int ret;
 
-	if (!mad_send_wr->retries_left)
+	if (time_after_eq(jiffies, mad_send_wr->deadline) ||
+	    !mad_send_wr->retries_left)
 		return -ETIMEDOUT;
 
 	mad_send_wr->retries_left--;
diff --git a/drivers/infiniband/core/mad_priv.h b/drivers/infiniband/core/mad_priv.h
index b2a12a82a62d..24580ad2d428 100644
--- a/drivers/infiniband/core/mad_priv.h
+++ b/drivers/infiniband/core/mad_priv.h
@@ -140,6 +140,7 @@ struct ib_mad_send_wr_private {
 	struct ib_sge sg_list[IB_MAD_SEND_REQ_MAX_SG];
 	__be64 tid;
 	unsigned long timeout;
+	unsigned long deadline;
 	int max_retries;
 	int retries_left;
 	int retry;
diff --git a/include/rdma/ib_mad.h b/include/rdma/ib_mad.h
index 3f1b58d8b4bf..69708170a0d6 100644
--- a/include/rdma/ib_mad.h
+++ b/include/rdma/ib_mad.h
@@ -727,6 +727,9 @@ void ib_free_recv_mad(struct ib_mad_recv_wc *mad_recv_wc);
  *
  * This call will reset the timeout value for a sent MAD to the specified
  * value.
+ *
+ * If called with a non-zero value and ib_set_mad_deadline was used, the
+ * deadline will be extended by the @timeout_ms.
  */
 int ib_modify_mad(struct ib_mad_send_buf *send_buf, u32 timeout_ms);
 
@@ -818,4 +821,30 @@ void ib_free_send_mad(struct ib_mad_send_buf *send_buf);
  */
 int ib_mad_kernel_rmpp_agent(const struct ib_mad_agent *agent);
 
+/**
+ * ib_set_mad_deadline - Sets send MAD's deadline based on current time.
+ * @send_buf: Previously allocated send data buffer.
+ * @total_timeout_ms: Time to wait before stopping retries.
+ *
+ * The deadline will start being enforced once ib_post_send_mad is called.
+ * It is NOT guaranteed that at least one send will be performed.  Only valid
+ * for MADs waiting for response (ib_mad_send_buf.timeout_ms must also be set).
+ *
+ * This option allows callers to bound the time a MAD is owned by the MAD layer.
+ * This takes precedence over ib_mad_send_buf.{retries, timeout_ms} and is
+ * independent from the MAD layer's internal retry algorithm.
+ *
+ * Once the this deadline expires, the MAD data buffer will be returned to the
+ * caller via the send_handler configured at agent registration time.
+ * Invocation of the send_handler might happen slightly later due to scheduling
+ * delays.
+ *
+ * The deadline will be extended if ib_modify_mad is called.
+ *
+ * Can only be called once.
+ *
+ * Might return errors for MADs which do not support deadline.
+ */
+int ib_set_mad_deadline(struct ib_mad_send_buf *send_buf, u32 total_timeout_ms);
+
 #endif /* IB_MAD_H */
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH rdma-next 3/9] RDMA/sa_query: Enforce min retry interval and deadline
  2024-12-05 13:49 [PATCH rdma-next 0/9] Rework retry algorithm used when sending MADs Leon Romanovsky
  2024-12-05 13:49 ` [PATCH rdma-next 1/9] IB/mad: Apply timeout modification (CM MRA) only once Leon Romanovsky
  2024-12-05 13:49 ` [PATCH rdma-next 2/9] IB/mad: Add deadline for send MADs Leon Romanovsky
@ 2024-12-05 13:49 ` Leon Romanovsky
  2024-12-05 13:49 ` [PATCH rdma-next 4/9] RDMA/nldev: Add sa-min-timeout management attribute Leon Romanovsky
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: Leon Romanovsky @ 2024-12-05 13:49 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: Vlad Dumitrescu, linux-rdma, Sean Hefty

From: Vlad Dumitrescu <vdumitrescu@nvidia.com>

SA query users, both in-kernel and userspace (via rdma_resolve_route),
pass in a total timeout and expect the SA query layer to handle retries.
SA query relies on the MAD layer to issue a fixed number of 10 retries
at the specified interval (timeout), set to 1/10 of the requested total
timeout.

When the caller-requested total timeout is low (e.g., 1s for IPoIB), the
resulting retry interval (e.g., 100ms) to too aggressive.  There have
been reports of overloaded SA receivers.  Hence, enforce a minimum.  A
follow-up change will make this configurable via rdma tool (netlink) at
per-port granularity.

Continue to enforce the caller's total timeout by using the new MAD
layer deadline option.

Remove small-timeout special case - the total timeout option will take
care of stopping the send even when more retries are left.  Moreover,
this special case results in an extremely aggressive 1ms retries, which
is definitely not desirable.

Signed-off-by: Vlad Dumitrescu <vdumitrescu@nvidia.com>
Reviewed-by: Sean Hefty <shefty@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/infiniband/core/sa_query.c | 34 +++++++++++++++++++-----------
 1 file changed, 22 insertions(+), 12 deletions(-)

diff --git a/drivers/infiniband/core/sa_query.c b/drivers/infiniband/core/sa_query.c
index 53571e6b3162..ac0d53bf91c4 100644
--- a/drivers/infiniband/core/sa_query.c
+++ b/drivers/infiniband/core/sa_query.c
@@ -39,6 +39,7 @@
 #include <linux/slab.h>
 #include <linux/dma-mapping.h>
 #include <linux/kref.h>
+#include <linux/minmax.h>
 #include <linux/xarray.h>
 #include <linux/workqueue.h>
 #include <uapi/linux/if_ether.h>
@@ -59,6 +60,7 @@
 #define IB_SA_LOCAL_SVC_TIMEOUT_MAX		200000
 #define IB_SA_CPI_MAX_RETRY_CNT			3
 #define IB_SA_CPI_RETRY_WAIT			1000 /*msecs */
+#define IB_SA_MIN_TIMEOUT_MS_DEFAULT		500
 static int sa_local_svc_timeout_ms = IB_SA_LOCAL_SVC_TIMEOUT_DEFAULT;
 
 struct ib_sa_sm_ah {
@@ -96,6 +98,7 @@ struct ib_sa_port {
 	spinlock_t                   classport_lock; /* protects class port info set */
 	spinlock_t           ah_lock;
 	u32		     port_num;
+	u32                  min_timeout_ms;
 };
 
 struct ib_sa_device {
@@ -1344,13 +1347,14 @@ static int send_mad(struct ib_sa_query *query, unsigned long timeout_ms,
 	if (ret < 0)
 		return ret;
 
-	query->mad_buf->timeout_ms  = timeout_ms / nmbr_sa_query_retries;
+	query->mad_buf->timeout_ms =
+		max(READ_ONCE(query->port->min_timeout_ms),
+		    timeout_ms / nmbr_sa_query_retries);
 	query->mad_buf->retries = nmbr_sa_query_retries;
-	if (!query->mad_buf->timeout_ms) {
-		/* Special case, very small timeout_ms */
-		query->mad_buf->timeout_ms = 1;
-		query->mad_buf->retries = timeout_ms;
-	}
+	ret = ib_set_mad_deadline(query->mad_buf, timeout_ms);
+	if (ret)
+		goto out;
+
 	query->mad_buf->context[0] = query;
 	query->id = id;
 
@@ -1364,18 +1368,22 @@ static int send_mad(struct ib_sa_query *query, unsigned long timeout_ms,
 	}
 
 	ret = ib_post_send_mad(query->mad_buf, NULL);
-	if (ret) {
-		xa_lock_irqsave(&queries, flags);
-		__xa_erase(&queries, id);
-		xa_unlock_irqrestore(&queries, flags);
-	}
 
 	/*
 	 * It's not safe to dereference query any more, because the
 	 * send may already have completed and freed the query in
 	 * another context.
 	 */
-	return ret ? ret : id;
+
+out:
+	if (ret) {
+		xa_lock_irqsave(&queries, flags);
+		__xa_erase(&queries, id);
+		xa_unlock_irqrestore(&queries, flags);
+		return ret;
+	}
+
+	return id;
 }
 
 void ib_sa_unpack_path(void *attribute, struct sa_path_rec *rec)
@@ -2192,6 +2200,8 @@ static int ib_sa_add_one(struct ib_device *device)
 		INIT_DELAYED_WORK(&sa_dev->port[i].ib_cpi_work,
 				  update_ib_cpi);
 
+		sa_dev->port[i].min_timeout_ms = IB_SA_MIN_TIMEOUT_MS_DEFAULT;
+
 		count++;
 	}
 
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH rdma-next 4/9] RDMA/nldev: Add sa-min-timeout management attribute
  2024-12-05 13:49 [PATCH rdma-next 0/9] Rework retry algorithm used when sending MADs Leon Romanovsky
                   ` (2 preceding siblings ...)
  2024-12-05 13:49 ` [PATCH rdma-next 3/9] RDMA/sa_query: Enforce min retry interval and deadline Leon Romanovsky
@ 2024-12-05 13:49 ` Leon Romanovsky
  2024-12-05 13:49 ` [PATCH rdma-next 5/9] IB/umad: Set deadline when sending non-RMPP MADs Leon Romanovsky
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: Leon Romanovsky @ 2024-12-05 13:49 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: Vlad Dumitrescu, linux-rdma, Sean Hefty

From: Vlad Dumitrescu <vdumitrescu@nvidia.com>

Add new namespace for MAD (Management Datagram) protocols as we expect
more attributes in this area.

Add first such attribute, to control the minimum initial timeout used by
the SA client implementation.  The SA client relies on the MAD layer to
issue retries, but has to configure an initial timeout value for the
first retry.  While the SA client provides a default, the right value
likely depends on network size, loss levels and capacity of the SA
server.  This attribute enables system admins to tune the trade-off
between speed of recovery under transient loss and load (on the network
and/or the SA server) generated by unnecessary retries.

Enforce a reasonable range of 50ms - 10s.

Changes do not apply to existing SA queries, which were already posted
to the MAD layer.

Example usage:
  # rdma management show ibp1s0f0/1
  0: ibp1s0f0: 1 sa-min-timeout 500 ...
  # rdma management show
  0: ibp1s0f0: 1 sa-min-timeout 500 ...
  1: ibp1s0f1: 1 sa-min-timeout 500 ...
  # rdma management set ibp1s0f1/1 sa-min-timeout 1000 ...
  # rdma management show
  0: ibp1s0f0: 1 sa-min-timeout 500 ...
  1: ibp1s0f1: 1 sa-min-timeout 1000 ...

Signed-off-by: Vlad Dumitrescu <vdumitrescu@nvidia.com>
Reviewed-by: Sean Hefty <shefty@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/infiniband/core/core_priv.h |   4 +
 drivers/infiniband/core/nldev.c     | 114 ++++++++++++++++++++++++++++
 drivers/infiniband/core/sa_query.c  |  47 ++++++++++++
 include/uapi/rdma/rdma_netlink.h    |   5 ++
 4 files changed, 170 insertions(+)

diff --git a/drivers/infiniband/core/core_priv.h b/drivers/infiniband/core/core_priv.h
index 05102769a918..7a7326588297 100644
--- a/drivers/infiniband/core/core_priv.h
+++ b/drivers/infiniband/core/core_priv.h
@@ -199,6 +199,10 @@ void ib_mad_cleanup(void);
 int ib_sa_init(void);
 void ib_sa_cleanup(void);
 
+int ib_sa_min_timeout_set(struct ib_device *dev, u32 port_num, u32 val,
+			  struct netlink_ext_ack *extack);
+int ib_sa_min_timeout_get(struct ib_device *dev, u32 port_num, u32 *val);
+
 void rdma_nl_init(void);
 void rdma_nl_exit(void);
 
diff --git a/drivers/infiniband/core/nldev.c b/drivers/infiniband/core/nldev.c
index ff121e59b9c0..363742567dd2 100644
--- a/drivers/infiniband/core/nldev.c
+++ b/drivers/infiniband/core/nldev.c
@@ -171,6 +171,7 @@ static const struct nla_policy nldev_policy[RDMA_NLDEV_ATTR_MAX] = {
 	[RDMA_NLDEV_ATTR_PARENT_NAME]		= { .type = NLA_NUL_STRING },
 	[RDMA_NLDEV_ATTR_NAME_ASSIGN_TYPE]	= { .type = NLA_U8 },
 	[RDMA_NLDEV_ATTR_EVENT_TYPE]		= { .type = NLA_U8 },
+	[RDMA_NLDEV_MGMT_ATTR_SA_MIN_TIMEOUT]	= { .type = NLA_U32 },
 };
 
 static int put_driver_name_print_type(struct sk_buff *msg, const char *name,
@@ -2621,6 +2622,112 @@ static int nldev_deldev(struct sk_buff *skb, struct nlmsghdr *nlh,
 	return ib_del_sub_device_and_put(device);
 }
 
+static int nldev_mgmt_get_doit(struct sk_buff *skb, struct nlmsghdr *nlh,
+			       struct netlink_ext_ack *extack)
+{
+	struct nlattr *tb[RDMA_NLDEV_ATTR_MAX];
+	struct ib_device *device;
+	struct sk_buff *msg;
+	u32 index;
+	u32 port;
+	u32 sa_min_timeout;
+	int ret;
+
+	ret = __nlmsg_parse(nlh, 0, tb, RDMA_NLDEV_ATTR_MAX - 1, nldev_policy,
+			    NL_VALIDATE_LIBERAL, extack);
+	if (ret ||
+	    !tb[RDMA_NLDEV_ATTR_DEV_INDEX] ||
+	    !tb[RDMA_NLDEV_ATTR_PORT_INDEX])
+		return -EINVAL;
+
+	index = nla_get_u32(tb[RDMA_NLDEV_ATTR_DEV_INDEX]);
+	device = ib_device_get_by_index(sock_net(skb->sk), index);
+	if (!device)
+		return -EINVAL;
+
+	port = nla_get_u32(tb[RDMA_NLDEV_ATTR_PORT_INDEX]);
+	if (!rdma_is_port_valid(device, port)) {
+		ret = -EINVAL;
+		goto err;
+	}
+
+	if (rdma_cap_ib_sa(device, port)) {
+		ret = ib_sa_min_timeout_get(device, port, &sa_min_timeout);
+		if (ret)
+			goto err;
+	}
+
+	msg = nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_KERNEL);
+	if (!msg) {
+		ret = -ENOMEM;
+		goto err;
+	}
+
+	nlh = nlmsg_put(
+		msg, NETLINK_CB(skb).portid, nlh->nlmsg_seq,
+		RDMA_NL_GET_TYPE(RDMA_NL_NLDEV, RDMA_NLDEV_CMD_MGMT_GET), 0, 0);
+	if (!nlh ||
+	    fill_nldev_handle(msg, device) ||
+	    nla_put_u32(msg, RDMA_NLDEV_ATTR_PORT_INDEX, port)) {
+		ret = -EMSGSIZE;
+		goto err_msg;
+	}
+
+	if (rdma_cap_ib_sa(device, port)) {
+		ret = nla_put_u32(msg, RDMA_NLDEV_MGMT_ATTR_SA_MIN_TIMEOUT,
+				  sa_min_timeout);
+		if (ret)
+			goto err_msg;
+	}
+
+	nlmsg_end(msg, nlh);
+	return rdma_nl_unicast(sock_net(skb->sk), msg, NETLINK_CB(skb).portid);
+
+err_msg:
+	nlmsg_free(msg);
+err:
+	ib_device_put(device);
+	return ret;
+}
+
+static int nldev_set_mgmt_set_doit(struct sk_buff *skb, struct nlmsghdr *nlh,
+				   struct netlink_ext_ack *extack)
+{
+	struct nlattr *tb[RDMA_NLDEV_ATTR_MAX];
+	struct ib_device *device;
+	u32 index;
+	u32 port;
+	u32 sa_min_timeout;
+	int ret;
+
+	ret = nlmsg_parse(nlh, 0, tb, RDMA_NLDEV_ATTR_MAX - 1, nldev_policy,
+			  extack);
+	if (ret ||
+	    !tb[RDMA_NLDEV_ATTR_DEV_INDEX] ||
+	    !tb[RDMA_NLDEV_ATTR_PORT_INDEX])
+		return -EINVAL;
+
+	index = nla_get_u32(tb[RDMA_NLDEV_ATTR_DEV_INDEX]);
+	device = ib_device_get_by_index(sock_net(skb->sk), index);
+	if (!device)
+		return -EINVAL;
+
+	port = nla_get_u32(tb[RDMA_NLDEV_ATTR_PORT_INDEX]);
+	if (!rdma_is_port_valid(device, port))
+		goto err;
+
+	if (tb[RDMA_NLDEV_MGMT_ATTR_SA_MIN_TIMEOUT]) {
+		sa_min_timeout =
+			nla_get_u32(tb[RDMA_NLDEV_MGMT_ATTR_SA_MIN_TIMEOUT]);
+		return ib_sa_min_timeout_set(device, port, sa_min_timeout,
+					     extack);
+	}
+
+err:
+	ib_device_put(device);
+	return -EINVAL;
+}
+
 static const struct rdma_nl_cbs nldev_cb_table[RDMA_NLDEV_NUM_OPS] = {
 	[RDMA_NLDEV_CMD_GET] = {
 		.doit = nldev_get_doit,
@@ -2727,6 +2834,13 @@ static const struct rdma_nl_cbs nldev_cb_table[RDMA_NLDEV_NUM_OPS] = {
 		.doit = nldev_deldev,
 		.flags = RDMA_NL_ADMIN_PERM,
 	},
+	[RDMA_NLDEV_CMD_MGMT_GET] = {
+		.doit = nldev_mgmt_get_doit,
+	},
+	[RDMA_NLDEV_CMD_MGMT_SET] = {
+		.doit = nldev_set_mgmt_set_doit,
+		.flags = RDMA_NL_ADMIN_PERM,
+	},
 };
 
 static int fill_mon_netdev_rename(struct sk_buff *msg,
diff --git a/drivers/infiniband/core/sa_query.c b/drivers/infiniband/core/sa_query.c
index ac0d53bf91c4..7f63cad3f212 100644
--- a/drivers/infiniband/core/sa_query.c
+++ b/drivers/infiniband/core/sa_query.c
@@ -60,7 +60,10 @@
 #define IB_SA_LOCAL_SVC_TIMEOUT_MAX		200000
 #define IB_SA_CPI_MAX_RETRY_CNT			3
 #define IB_SA_CPI_RETRY_WAIT			1000 /*msecs */
+#define IB_SA_MIN_TIMEOUT_MS_MIN		50
 #define IB_SA_MIN_TIMEOUT_MS_DEFAULT		500
+#define IB_SA_MIN_TIMEOUT_MS_MAX		10000
+
 static int sa_local_svc_timeout_ms = IB_SA_LOCAL_SVC_TIMEOUT_DEFAULT;
 
 struct ib_sa_sm_ah {
@@ -1334,6 +1337,50 @@ static void init_mad(struct ib_sa_query *query, struct ib_mad_agent *agent)
 	spin_unlock_irqrestore(&tid_lock, flags);
 }
 
+int ib_sa_min_timeout_set(struct ib_device *dev, u32 port_num, u32 val,
+			  struct netlink_ext_ack *extack)
+{
+	struct ib_sa_device *sa_dev = ib_get_client_data(dev, &sa_client);
+	struct ib_sa_port *port;
+
+	if (!rdma_cap_ib_sa(dev, port_num))
+		return -EOPNOTSUPP;
+
+	if (!sa_dev)
+		return -ENODEV;
+
+	port = &sa_dev->port[port_num - sa_dev->start_port];
+
+	if (val > IB_SA_MIN_TIMEOUT_MS_MAX || val < IB_SA_MIN_TIMEOUT_MS_MIN) {
+		NL_SET_ERR_MSG_FMT_MOD(extack, "Valid range [%u-%u]ms",
+				       IB_SA_MIN_TIMEOUT_MS_MIN,
+				       IB_SA_MIN_TIMEOUT_MS_MAX);
+		return -EINVAL;
+	}
+
+	WRITE_ONCE(port->min_timeout_ms, val);
+
+	return 0;
+}
+
+int ib_sa_min_timeout_get(struct ib_device *dev, u32 port_num, u32 *val)
+{
+	struct ib_sa_device *sa_dev = ib_get_client_data(dev, &sa_client);
+	struct ib_sa_port *port;
+
+	if (!rdma_cap_ib_sa(dev, port_num))
+		return -EOPNOTSUPP;
+
+	if (!sa_dev)
+		return -ENODEV;
+
+	port = &sa_dev->port[port_num - sa_dev->start_port];
+
+	*val = READ_ONCE(port->min_timeout_ms);
+
+	return 0;
+}
+
 static int send_mad(struct ib_sa_query *query, unsigned long timeout_ms,
 		    gfp_t gfp_mask)
 {
diff --git a/include/uapi/rdma/rdma_netlink.h b/include/uapi/rdma/rdma_netlink.h
index 9f9cf20c1cd8..2b1c4c55e51f 100644
--- a/include/uapi/rdma/rdma_netlink.h
+++ b/include/uapi/rdma/rdma_netlink.h
@@ -308,6 +308,9 @@ enum rdma_nldev_command {
 
 	RDMA_NLDEV_CMD_MONITOR,
 
+	RDMA_NLDEV_CMD_MGMT_GET,
+	RDMA_NLDEV_CMD_MGMT_SET,
+
 	RDMA_NLDEV_NUM_OPS
 };
 
@@ -580,6 +583,8 @@ enum rdma_nldev_attr {
 	RDMA_NLDEV_ATTR_EVENT_TYPE,		/* u8 */
 
 	RDMA_NLDEV_SYS_ATTR_MONITOR_MODE,	/* u8 */
+
+	RDMA_NLDEV_MGMT_ATTR_SA_MIN_TIMEOUT,	/* u32 */
 	/*
 	 * Always the end
 	 */
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH rdma-next 5/9] IB/umad: Set deadline when sending non-RMPP MADs
  2024-12-05 13:49 [PATCH rdma-next 0/9] Rework retry algorithm used when sending MADs Leon Romanovsky
                   ` (3 preceding siblings ...)
  2024-12-05 13:49 ` [PATCH rdma-next 4/9] RDMA/nldev: Add sa-min-timeout management attribute Leon Romanovsky
@ 2024-12-05 13:49 ` Leon Romanovsky
  2024-12-05 13:49 ` [PATCH rdma-next 6/9] IB/cm: Set deadline when sending MADs Leon Romanovsky
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: Leon Romanovsky @ 2024-12-05 13:49 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: Vlad Dumitrescu, linux-rdma, Sean Hefty

From: Vlad Dumitrescu <vdumitrescu@nvidia.com>

Change semantics of the UAPI struct ib_user_mad_hdr retries and
timeout_ms fields.  Given the current implementation, users likely
expect the total timeout to be (retries + 1) * timeout_ms.  Use that
as MAD deadline.

This allows changes to the MAD layer's internal retry algorithm, for
non-RMPP agents, without affecting the total timeout experienced by
userspace.

Signed-off-by: Vlad Dumitrescu <vdumitrescu@nvidia.com>
Reviewed-by: Sean Hefty <shefty@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/infiniband/core/user_mad.c |  8 ++++++++
 include/uapi/rdma/ib_user_mad.h    | 12 ++++++++++--
 2 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/core/user_mad.c b/drivers/infiniband/core/user_mad.c
index fd67fc9fe85a..3da6c0295657 100644
--- a/drivers/infiniband/core/user_mad.c
+++ b/drivers/infiniband/core/user_mad.c
@@ -650,6 +650,14 @@ static ssize_t ib_umad_write(struct file *filp, const char __user *buf,
 		}
 	}
 
+	if (!ib_mad_kernel_rmpp_agent(agent) && packet->mad.hdr.timeout_ms) {
+		ret = ib_set_mad_deadline(packet->msg,
+					  (packet->mad.hdr.retries + 1) *
+						  packet->mad.hdr.timeout_ms);
+		if (ret)
+			goto err_send;
+	}
+
 	ret = ib_post_send_mad(packet->msg, NULL);
 	if (ret)
 		goto err_send;
diff --git a/include/uapi/rdma/ib_user_mad.h b/include/uapi/rdma/ib_user_mad.h
index 10b5f6a4c677..1e9c20a44e50 100644
--- a/include/uapi/rdma/ib_user_mad.h
+++ b/include/uapi/rdma/ib_user_mad.h
@@ -57,7 +57,11 @@
  *   received (transaction ID in data[] will be set to TID of original
  *   request) (ignored on send)
  * @timeout_ms - Milliseconds to wait for response (unset on receive)
- * @retries - Number of automatic retries to attempt
+ *   before issuing a retry
+ * @retries - Maximum number of automatic retries to attempt. Actual
+ *   number of retries could be less if (@retries + 1) * @timeout_ms
+ *   is exceeded. When the registration request sets @rmpp_version,
+ *   it applies per RMPP window
  * @qpn - Remote QP number received from/to be sent to
  * @qkey - Remote Q_Key to be sent with (unset on receive)
  * @lid - Remote lid received from/to be sent to
@@ -100,7 +104,11 @@ struct ib_user_mad_hdr_old {
  *   received (transaction ID in data[] will be set to TID of original
  *   request) (ignored on send)
  * @timeout_ms - Milliseconds to wait for response (unset on receive)
- * @retries - Number of automatic retries to attempt
+ *   before issuing a retry
+ * @retries - Maximum number of automatic retries to attempt. Actual
+ *   number of retries could be less if (@retries + 1) * @timeout_ms
+ *   is exceeded. When the registration request sets @rmpp_version,
+ *   it applies per RMPP window
  * @qpn - Remote QP number received from/to be sent to
  * @qkey - Remote Q_Key to be sent with (unset on receive)
  * @lid - Remote lid received from/to be sent to
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH rdma-next 6/9] IB/cm: Set deadline when sending MADs
  2024-12-05 13:49 [PATCH rdma-next 0/9] Rework retry algorithm used when sending MADs Leon Romanovsky
                   ` (4 preceding siblings ...)
  2024-12-05 13:49 ` [PATCH rdma-next 5/9] IB/umad: Set deadline when sending non-RMPP MADs Leon Romanovsky
@ 2024-12-05 13:49 ` Leon Romanovsky
  2024-12-05 13:49 ` [PATCH rdma-next 7/9] IB/mad: Exponential backoff when retrying sends Leon Romanovsky
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: Leon Romanovsky @ 2024-12-05 13:49 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: Vlad Dumitrescu, linux-rdma, Sean Hefty

From: Vlad Dumitrescu <vdumitrescu@nvidia.com>

With the current MAD retry algorithm, the expected total timeout is
roughly (retries + 1) * timeout_ms.  This is an approximation because
scheduling and completion delays are not strictly accounted for.

For CM the number of retries is typically CMA_MAX_CM_RETRIES (15),
unless the peer is setting REQ:Max CM Retries [1] to a different value.
In theory, the timeout could vary, being based on
CMA_CM_RESPONSE_TIMEOUT + Packet Life Time, as well as the peer's MRA
messages.  In practice, for RoCE, the formula above results in 65536ms.

Based on the above, set a constant deadline to a round 70s, for all
cases.  Note that MRAs will end up calling ib_modify_mad which will
extend the deadline accordingly.

This allows changes to the MAD layer's internal retry algorithm without
affecting the total timeout experienced by CM.

[1] IBTA v1.7 - Section 12.7.27 - Max CM Retries

Signed-off-by: Vlad Dumitrescu <vdumitrescu@nvidia.com>
Reviewed-by: Sean Hefty <shefty@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/infiniband/core/cm.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c
index 142170473e75..36649faf9842 100644
--- a/drivers/infiniband/core/cm.c
+++ b/drivers/infiniband/core/cm.c
@@ -36,6 +36,7 @@ MODULE_LICENSE("Dual BSD/GPL");
 
 #define CM_DESTROY_ID_WAIT_TIMEOUT 10000 /* msecs */
 #define CM_DIRECT_RETRY_CTX ((void *) 1UL)
+#define CM_MAD_TOTAL_TIMEOUT 70000 /* msecs */
 
 static const char * const ibcm_rej_reason_strs[] = {
 	[IB_CM_REJ_NO_QP]			= "no QP",
@@ -279,6 +280,7 @@ static struct ib_mad_send_buf *cm_alloc_msg(struct cm_id_private *cm_id_priv)
 	struct ib_mad_agent *mad_agent;
 	struct ib_mad_send_buf *m;
 	struct ib_ah *ah;
+	int ret;
 
 	lockdep_assert_held(&cm_id_priv->lock);
 
@@ -309,6 +311,17 @@ static struct ib_mad_send_buf *cm_alloc_msg(struct cm_id_private *cm_id_priv)
 	}
 
 	m->ah = ah;
+	m->retries = cm_id_priv->max_cm_retries;
+	ret = ib_set_mad_deadline(m, CM_MAD_TOTAL_TIMEOUT);
+	if (ret) {
+		m = ERR_PTR(ret);
+		ib_free_send_mad(m);
+		rdma_destroy_ah(ah, 0);
+		goto out;
+	}
+
+	refcount_inc(&cm_id_priv->refcount);
+	m->context[0] = cm_id_priv;
 
 out:
 	spin_unlock(&cm_id_priv->av.port->cm_dev->mad_agent_lock);
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH rdma-next 7/9] IB/mad: Exponential backoff when retrying sends
  2024-12-05 13:49 [PATCH rdma-next 0/9] Rework retry algorithm used when sending MADs Leon Romanovsky
                   ` (5 preceding siblings ...)
  2024-12-05 13:49 ` [PATCH rdma-next 6/9] IB/cm: Set deadline when sending MADs Leon Romanovsky
@ 2024-12-05 13:49 ` Leon Romanovsky
  2024-12-05 13:49 ` [PATCH rdma-next 8/9] RDMA/nldev: Add mad-linear-timeouts management attribute Leon Romanovsky
  2024-12-05 13:49 ` [PATCH rdma-next 9/9] IB/cma: Lower response timeout to roughly 1s Leon Romanovsky
  8 siblings, 0 replies; 10+ messages in thread
From: Leon Romanovsky @ 2024-12-05 13:49 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: Vlad Dumitrescu, linux-rdma, Sean Hefty

From: Vlad Dumitrescu <vdumitrescu@nvidia.com>

When a receiver is overloaded, MAD requests time out and get retried in
a linear fashion.  This could worsen congestion and reduce goodput.  To
help reduce the load over time, use exponential backoff after a preset
number of retries.  Cap delays between retries at 60s, even when in
exponential mode.

MRA message from recipient could request an even higher timeout, so
continue to respect that for the next retry.  However, reset the backoff
algorithm to the beginning when and MRA is received.

Exclude RMPP and OPA from exponential backoff.

Signed-off-by: Vlad Dumitrescu <vdumitrescu@nvidia.com>
Reviewed-by: Sean Hefty <shefty@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/infiniband/core/mad.c      | 53 ++++++++++++++++++++++++++++--
 drivers/infiniband/core/mad_priv.h |  3 ++
 2 files changed, 53 insertions(+), 3 deletions(-)

diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c
index 5c255ee3db38..a3a8cf4bbc20 100644
--- a/drivers/infiniband/core/mad.c
+++ b/drivers/infiniband/core/mad.c
@@ -54,7 +54,9 @@
 #define CREATE_TRACE_POINTS
 #include <trace/events/ib_mad.h>
 
-#define IB_MAD_MAX_DEADLINE (jiffies + msecs_to_jiffies(5 * 60 * 1000))
+#define IB_MAD_LINEAR_TIMEOUTS_DEFAULT	4
+#define IB_MAD_MAX_TIMEOUT_MS		(60 * MSEC_PER_SEC)
+#define IB_MAD_MAX_DEADLINE		(jiffies + msecs_to_jiffies(5 * 60 * 1000))
 
 #ifdef CONFIG_TRACEPOINTS
 static void create_mad_addr_info(struct ib_mad_send_wr_private *mad_send_wr,
@@ -1210,10 +1212,12 @@ int ib_post_send_mad(struct ib_mad_send_buf *send_buf,
 		}
 
 		mad_send_wr->tid = ((struct ib_mad_hdr *) send_buf->mad)->tid;
+		mad_send_wr->var_timeout_ms = send_buf->timeout_ms;
 		/* Timeout will be updated after send completes */
 		mad_send_wr->timeout = msecs_to_jiffies(send_buf->timeout_ms);
 		mad_send_wr->max_retries = send_buf->retries;
 		mad_send_wr->retries_left = send_buf->retries;
+		mad_send_wr->backoff_retries = 0;
 		send_buf->retries = 0;
 		mad_send_wr->status = IB_WC_SUCCESS;
 
@@ -2662,18 +2666,34 @@ int ib_modify_mad(struct ib_mad_send_buf *send_buf, u32 timeout_ms)
 		return -EINVAL;
 	}
 
-	if (!timeout_ms)
+	if (!timeout_ms) {
 		mad_send_wr->status = IB_WC_WR_FLUSH_ERR;
+		goto apply;
+	}
+
+	/* CM MRA requesting a lower timeout than ours.  Could be a delayed MRA
+	 * (variable backoff increased in the meantime) or remote using a const.
+	 */
+	if (timeout_ms < mad_send_wr->var_timeout_ms)
+		goto ignore;
+
+	/* Assume remote will no longer be overloaded after MRA Service Timeout
+	 * passes and restart variable backoff algorithm.
+	 */
+	mad_send_wr->var_timeout_ms = mad_send_wr->send_buf.timeout_ms;
+	mad_send_wr->backoff_retries = 0;
 
 	if (mad_send_wr->deadline)
 		mad_send_wr->deadline += msecs_to_jiffies(timeout_ms);
 
+apply:
 	if (mad_send_wr->state == IB_MAD_STATE_SEND_START ||
 	    (mad_send_wr->state == IB_MAD_STATE_QUEUED && timeout_ms))
 		mad_send_wr->timeout = msecs_to_jiffies(timeout_ms);
 	else
 		ib_reset_mad_timeout(mad_send_wr, timeout_ms);
 
+ignore:
 	spin_unlock_irqrestore(&mad_agent_priv->lock, flags);
 	return 0;
 }
@@ -2767,6 +2787,30 @@ static void local_completions(struct work_struct *work)
 	spin_unlock_irqrestore(&mad_agent_priv->lock, flags);
 }
 
+/*
+ * Applies a variable backoff to certain send MADs.
+ *
+ * Exists to scope down the initial variable backoff implementation.
+ */
+static void set_next_timeout(struct ib_mad_send_wr_private *mad_send_wr)
+{
+	const struct ib_mad_agent_private *agent = mad_send_wr->mad_agent_priv;
+	const struct ib_mad_port_private *port = agent->qp_info->port_priv;
+	const struct ib_mad_hdr *hdr = mad_send_wr->send_buf.mad;
+
+	if (ib_mad_kernel_rmpp_agent(&agent->agent))
+		return;
+
+	if (hdr->base_version != IB_MGMT_BASE_VERSION)
+		return;
+
+	if (++mad_send_wr->backoff_retries < READ_ONCE(port->linear_timeouts))
+		return;
+
+	mad_send_wr->var_timeout_ms =
+		min(mad_send_wr->var_timeout_ms << 1, IB_MAD_MAX_TIMEOUT_MS);
+}
+
 static int retry_send(struct ib_mad_send_wr_private *mad_send_wr)
 {
 	int ret;
@@ -2778,7 +2822,8 @@ static int retry_send(struct ib_mad_send_wr_private *mad_send_wr)
 	mad_send_wr->retries_left--;
 	mad_send_wr->send_buf.retries++;
 
-	mad_send_wr->timeout = msecs_to_jiffies(mad_send_wr->send_buf.timeout_ms);
+	set_next_timeout(mad_send_wr);
+	mad_send_wr->timeout = msecs_to_jiffies(mad_send_wr->var_timeout_ms);
 
 	if (ib_mad_kernel_rmpp_agent(&mad_send_wr->mad_agent_priv->agent)) {
 		ret = ib_retry_rmpp(mad_send_wr);
@@ -3195,6 +3240,8 @@ static int ib_mad_port_open(struct ib_device *device,
 		goto error8;
 	}
 
+	port_priv->linear_timeouts = IB_MAD_LINEAR_TIMEOUTS_DEFAULT;
+
 	spin_lock_irqsave(&ib_mad_port_list_lock, flags);
 	list_add_tail(&port_priv->port_list, &ib_mad_port_list);
 	spin_unlock_irqrestore(&ib_mad_port_list_lock, flags);
diff --git a/drivers/infiniband/core/mad_priv.h b/drivers/infiniband/core/mad_priv.h
index 24580ad2d428..076ebcea27b4 100644
--- a/drivers/infiniband/core/mad_priv.h
+++ b/drivers/infiniband/core/mad_priv.h
@@ -139,10 +139,12 @@ struct ib_mad_send_wr_private {
 	struct ib_ud_wr send_wr;
 	struct ib_sge sg_list[IB_MAD_SEND_REQ_MAX_SG];
 	__be64 tid;
+	unsigned int var_timeout_ms;
 	unsigned long timeout;
 	unsigned long deadline;
 	int max_retries;
 	int retries_left;
+	int backoff_retries;
 	int retry;
 	enum ib_wc_status status;
 
@@ -222,6 +224,7 @@ struct ib_mad_port_private {
 	struct ib_mad_mgmt_version_table version[MAX_MGMT_VERSION];
 	struct workqueue_struct *wq;
 	struct ib_mad_qp_info qp_info[IB_MAD_QPS_CORE];
+	u8 linear_timeouts;
 };
 
 int ib_send_mad(struct ib_mad_send_wr_private *mad_send_wr);
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH rdma-next 8/9] RDMA/nldev: Add mad-linear-timeouts management attribute
  2024-12-05 13:49 [PATCH rdma-next 0/9] Rework retry algorithm used when sending MADs Leon Romanovsky
                   ` (6 preceding siblings ...)
  2024-12-05 13:49 ` [PATCH rdma-next 7/9] IB/mad: Exponential backoff when retrying sends Leon Romanovsky
@ 2024-12-05 13:49 ` Leon Romanovsky
  2024-12-05 13:49 ` [PATCH rdma-next 9/9] IB/cma: Lower response timeout to roughly 1s Leon Romanovsky
  8 siblings, 0 replies; 10+ messages in thread
From: Leon Romanovsky @ 2024-12-05 13:49 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: Vlad Dumitrescu, linux-rdma, Sean Hefty

From: Vlad Dumitrescu <vdumitrescu@nvidia.com>

This attribute allows system admins to make a trade-off between speed
of recovery under transient loss and reducing congestion under
persistent loss or overload.

Set 15 as max value as it allows sys admins to effectively opt-out the
CM from exponential backoff.  CM is currently using CMA_MAX_CM_RETRIES
(15) constant to set retries.  Other MAD layer callers use different
values (e.g., sa_query uses 10, UMAD exposes the parameter to
userspace), but a max of 15 linear retries should be enough.

Example:
  # rdma management show rocep1s0f1/1
  1: rocep1s0f1: 1 mad-linear-timeouts 4 ...
  # rdma management set rocep1s0f1/1 mad-linear-timeouts 6
  # rdma management show
  0: rocep1s0f0: 1 mad-linear-timeouts 4 ...
  1: rocep1s0f1: 1 mad-linear-timeouts 6 ...

Signed-off-by: Vlad Dumitrescu <vdumitrescu@nvidia.com>
Reviewed-by: Sean Hefty <shefty@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/infiniband/core/mad.c      | 35 ++++++++++++++++++++++++++++++
 drivers/infiniband/core/mad_priv.h |  4 ++++
 drivers/infiniband/core/nldev.c    | 19 ++++++++++++++++
 include/uapi/rdma/rdma_netlink.h   |  2 ++
 4 files changed, 60 insertions(+)

diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c
index a3a8cf4bbc20..7c4ac8ae0a3f 100644
--- a/drivers/infiniband/core/mad.c
+++ b/drivers/infiniband/core/mad.c
@@ -54,7 +54,9 @@
 #define CREATE_TRACE_POINTS
 #include <trace/events/ib_mad.h>
 
+#define IB_MAD_LINEAR_TIMEOUTS_MIN	1
 #define IB_MAD_LINEAR_TIMEOUTS_DEFAULT	4
+#define IB_MAD_LINEAR_TIMEOUTS_MAX	15
 #define IB_MAD_MAX_TIMEOUT_MS		(60 * MSEC_PER_SEC)
 #define IB_MAD_MAX_DEADLINE		(jiffies + msecs_to_jiffies(5 * 60 * 1000))
 
@@ -145,6 +147,39 @@ ib_get_mad_port(struct ib_device *device, u32 port_num)
 	return entry;
 }
 
+int ib_mad_linear_timeouts_set(struct ib_device *dev, u32 port_num, u8 val,
+			       struct netlink_ext_ack *extack)
+{
+	struct ib_mad_port_private *port = ib_get_mad_port(dev, port_num);
+
+	if (!port)
+		return -ENODEV;
+
+	if (val > IB_MAD_LINEAR_TIMEOUTS_MAX ||
+	    val < IB_MAD_LINEAR_TIMEOUTS_MIN) {
+		NL_SET_ERR_MSG_FMT_MOD(extack, "Valid range [%u-%u]",
+				       IB_MAD_LINEAR_TIMEOUTS_MIN,
+				       IB_MAD_LINEAR_TIMEOUTS_MAX);
+		return -EINVAL;
+	}
+
+	WRITE_ONCE(port->linear_timeouts, val);
+
+	return 0;
+}
+
+int ib_mad_linear_timeouts_get(struct ib_device *dev, u32 port_num, u8 *val)
+{
+	struct ib_mad_port_private *port = ib_get_mad_port(dev, port_num);
+
+	if (!port)
+		return -ENODEV;
+
+	*val = READ_ONCE(port->linear_timeouts);
+
+	return 0;
+}
+
 static inline u8 convert_mgmt_class(u8 mgmt_class)
 {
 	/* Alias IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE to 0 */
diff --git a/drivers/infiniband/core/mad_priv.h b/drivers/infiniband/core/mad_priv.h
index 076ebcea27b4..e6b362c054a6 100644
--- a/drivers/infiniband/core/mad_priv.h
+++ b/drivers/infiniband/core/mad_priv.h
@@ -241,4 +241,8 @@ void ib_mark_mad_done(struct ib_mad_send_wr_private *mad_send_wr);
 void ib_reset_mad_timeout(struct ib_mad_send_wr_private *mad_send_wr,
 			  unsigned long timeout_ms);
 
+int ib_mad_linear_timeouts_set(struct ib_device *dev, u32 port_num, u8 val,
+			       struct netlink_ext_ack *extack);
+int ib_mad_linear_timeouts_get(struct ib_device *dev, u32 port_num, u8 *val);
+
 #endif	/* __IB_MAD_PRIV_H__ */
diff --git a/drivers/infiniband/core/nldev.c b/drivers/infiniband/core/nldev.c
index 363742567dd2..acb02f8c87c0 100644
--- a/drivers/infiniband/core/nldev.c
+++ b/drivers/infiniband/core/nldev.c
@@ -172,6 +172,7 @@ static const struct nla_policy nldev_policy[RDMA_NLDEV_ATTR_MAX] = {
 	[RDMA_NLDEV_ATTR_NAME_ASSIGN_TYPE]	= { .type = NLA_U8 },
 	[RDMA_NLDEV_ATTR_EVENT_TYPE]		= { .type = NLA_U8 },
 	[RDMA_NLDEV_MGMT_ATTR_SA_MIN_TIMEOUT]	= { .type = NLA_U32 },
+	[RDMA_NLDEV_MGMT_ATTR_MAD_LINEAR_TIMEOUTS] = { .type = NLA_U8 },
 };
 
 static int put_driver_name_print_type(struct sk_buff *msg, const char *name,
@@ -2627,6 +2628,7 @@ static int nldev_mgmt_get_doit(struct sk_buff *skb, struct nlmsghdr *nlh,
 {
 	struct nlattr *tb[RDMA_NLDEV_ATTR_MAX];
 	struct ib_device *device;
+	u8 mad_linear_timeouts;
 	struct sk_buff *msg;
 	u32 index;
 	u32 port;
@@ -2657,6 +2659,10 @@ static int nldev_mgmt_get_doit(struct sk_buff *skb, struct nlmsghdr *nlh,
 			goto err;
 	}
 
+	ret = ib_mad_linear_timeouts_get(device, port, &mad_linear_timeouts);
+	if (ret)
+		goto err;
+
 	msg = nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_KERNEL);
 	if (!msg) {
 		ret = -ENOMEM;
@@ -2680,6 +2686,11 @@ static int nldev_mgmt_get_doit(struct sk_buff *skb, struct nlmsghdr *nlh,
 			goto err_msg;
 	}
 
+	ret = nla_put_u8(msg, RDMA_NLDEV_MGMT_ATTR_MAD_LINEAR_TIMEOUTS,
+			 mad_linear_timeouts);
+	if (ret)
+		goto err_msg;
+
 	nlmsg_end(msg, nlh);
 	return rdma_nl_unicast(sock_net(skb->sk), msg, NETLINK_CB(skb).portid);
 
@@ -2695,6 +2706,7 @@ static int nldev_set_mgmt_set_doit(struct sk_buff *skb, struct nlmsghdr *nlh,
 {
 	struct nlattr *tb[RDMA_NLDEV_ATTR_MAX];
 	struct ib_device *device;
+	u8 mad_linear_timeouts;
 	u32 index;
 	u32 port;
 	u32 sa_min_timeout;
@@ -2723,6 +2735,13 @@ static int nldev_set_mgmt_set_doit(struct sk_buff *skb, struct nlmsghdr *nlh,
 					     extack);
 	}
 
+	if (tb[RDMA_NLDEV_MGMT_ATTR_MAD_LINEAR_TIMEOUTS]) {
+		mad_linear_timeouts = nla_get_u8(
+			tb[RDMA_NLDEV_MGMT_ATTR_MAD_LINEAR_TIMEOUTS]);
+		return ib_mad_linear_timeouts_set(device, port,
+						  mad_linear_timeouts, extack);
+	}
+
 err:
 	ib_device_put(device);
 	return -EINVAL;
diff --git a/include/uapi/rdma/rdma_netlink.h b/include/uapi/rdma/rdma_netlink.h
index 2b1c4c55e51f..d209a5973c8e 100644
--- a/include/uapi/rdma/rdma_netlink.h
+++ b/include/uapi/rdma/rdma_netlink.h
@@ -585,6 +585,8 @@ enum rdma_nldev_attr {
 	RDMA_NLDEV_SYS_ATTR_MONITOR_MODE,	/* u8 */
 
 	RDMA_NLDEV_MGMT_ATTR_SA_MIN_TIMEOUT,	/* u32 */
+
+	RDMA_NLDEV_MGMT_ATTR_MAD_LINEAR_TIMEOUTS, /* u8 */
 	/*
 	 * Always the end
 	 */
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH rdma-next 9/9] IB/cma: Lower response timeout to roughly 1s
  2024-12-05 13:49 [PATCH rdma-next 0/9] Rework retry algorithm used when sending MADs Leon Romanovsky
                   ` (7 preceding siblings ...)
  2024-12-05 13:49 ` [PATCH rdma-next 8/9] RDMA/nldev: Add mad-linear-timeouts management attribute Leon Romanovsky
@ 2024-12-05 13:49 ` Leon Romanovsky
  8 siblings, 0 replies; 10+ messages in thread
From: Leon Romanovsky @ 2024-12-05 13:49 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: Vlad Dumitrescu, linux-rdma, Sean Hefty

From: Vlad Dumitrescu <vdumitrescu@nvidia.com>

Current CMA_CM_RESPONSE_TIMEOUT converts to 4096ms, which is high for
(low latency) RDMA networks.  Match TCP's initial RTO of 1s (RFC 6298).

Rely on the recently added MAD layer exponential backoff to
counter-balance this reduction in case of persistent loss.

Signed-off-by: Vlad Dumitrescu <vdumitrescu@nvidia.com>
Reviewed-by: Sean Hefty <shefty@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/infiniband/core/cma.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index 64ace0b968f0..ec4141b84351 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -44,7 +44,7 @@ MODULE_AUTHOR("Sean Hefty");
 MODULE_DESCRIPTION("Generic RDMA CM Agent");
 MODULE_LICENSE("Dual BSD/GPL");
 
-#define CMA_CM_RESPONSE_TIMEOUT 20
+#define CMA_CM_RESPONSE_TIMEOUT 18
 #define CMA_MAX_CM_RETRIES 15
 #define CMA_CM_MRA_SETTING (IB_CM_MRA_FLAG_DELAY | 24)
 #define CMA_IBOE_PACKET_LIFETIME 16
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2024-12-05 13:51 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-12-05 13:49 [PATCH rdma-next 0/9] Rework retry algorithm used when sending MADs Leon Romanovsky
2024-12-05 13:49 ` [PATCH rdma-next 1/9] IB/mad: Apply timeout modification (CM MRA) only once Leon Romanovsky
2024-12-05 13:49 ` [PATCH rdma-next 2/9] IB/mad: Add deadline for send MADs Leon Romanovsky
2024-12-05 13:49 ` [PATCH rdma-next 3/9] RDMA/sa_query: Enforce min retry interval and deadline Leon Romanovsky
2024-12-05 13:49 ` [PATCH rdma-next 4/9] RDMA/nldev: Add sa-min-timeout management attribute Leon Romanovsky
2024-12-05 13:49 ` [PATCH rdma-next 5/9] IB/umad: Set deadline when sending non-RMPP MADs Leon Romanovsky
2024-12-05 13:49 ` [PATCH rdma-next 6/9] IB/cm: Set deadline when sending MADs Leon Romanovsky
2024-12-05 13:49 ` [PATCH rdma-next 7/9] IB/mad: Exponential backoff when retrying sends Leon Romanovsky
2024-12-05 13:49 ` [PATCH rdma-next 8/9] RDMA/nldev: Add mad-linear-timeouts management attribute Leon Romanovsky
2024-12-05 13:49 ` [PATCH rdma-next 9/9] IB/cma: Lower response timeout to roughly 1s Leon Romanovsky

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox