linux-rdma.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH rdma-next 0/3] Add Flow Control for Solicited MADs
@ 2024-12-03 13:52 Leon Romanovsky
  2024-12-03 13:52 ` [PATCH rdma-next 1/3] IB/mad: Replace MAD's refcount with a state machine Leon Romanovsky
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Leon Romanovsky @ 2024-12-03 13:52 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Leon Romanovsky, linux-kernel, linux-rdma, Maher Sanalla,
	Or Har-Toov

From: Leon Romanovsky <leonro@nvidia.com>

From Or:

This patch series introduces flow control for solicited MADs in
the MAD layer, addressing the need avoid loss caused by insufficient
resources at the receiver.

Both the client and the server act as receivers - the latter receives
requests, while the former receives responses. To facilitate this flow
control, the series also refactors the MAD code, improving readability
and enabling more straightforward implementation.

Patch #1: Replace MAD's refcount with 'state' for readability
Patch #2: Remove unnecessary done list by utilizing MAD states
Patch #3: Add flow control for solicited MADs

The primary goal of this series is to add a flow control mechanism
to the MAD layer, reducing the number of timeouts.

The accompanying refactoring simplifies state management, making
the code more maintainable and supporting the new flow control
logic effectively.

Thanks

Or Har-Toov (3):
  IB/mad: Replace MAD's refcount with a state machine
  IB/mad: Remove unnecessary done list by utilizing MAD states
  IB/mad: Add flow control for solicited MADs

 drivers/infiniband/core/mad.c      | 212 +++++++++++++++++++++++++----
 drivers/infiniband/core/mad_priv.h |  19 ++-
 drivers/infiniband/core/mad_rmpp.c |   7 +-
 3 files changed, 202 insertions(+), 36 deletions(-)

-- 
2.47.0


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH rdma-next 1/3] IB/mad: Replace MAD's refcount with a state machine
  2024-12-03 13:52 [PATCH rdma-next 0/3] Add Flow Control for Solicited MADs Leon Romanovsky
@ 2024-12-03 13:52 ` Leon Romanovsky
  2024-12-10 16:12   ` Jason Gunthorpe
  2024-12-03 13:52 ` [PATCH rdma-next 2/3] IB/mad: Remove unnecessary done list by utilizing MAD states Leon Romanovsky
  2024-12-03 13:52 ` [PATCH rdma-next 3/3] IB/mad: Add flow control for solicited MADs Leon Romanovsky
  2 siblings, 1 reply; 8+ messages in thread
From: Leon Romanovsky @ 2024-12-03 13:52 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: Or Har-Toov, linux-rdma, Maher Sanalla

From: Or Har-Toov <ohartoov@nvidia.com>

Replace the refcount mechanism with a 'state' field to track the status
of MADs send work requests (WRs). The state machine better represents
the stages in the MAD lifecycle, specifically indicating whether the
MAD is waiting for a response or a completion.

The existing refcount only takes two values:
 * 1 - MAD is waiting either for completion or for response.
 * 2 - MAD is waiting for both response and completion. Also when a
   response was received before a completion notification.

The current state transitions are not clearly visible, and developers
needs to infer the state from the refcount's value, which is
error-prone and difficult to follow.

Thus, replace with a state machine as the following:
 * IB_MAD_STATE_SEND_START - MAD was sent to the QP and is waiting
   for completion notification
 * IB_MAD_STATE_WAIT_RESP - MAD send completed successfully, waiting
   for a response
 * IB_MAD_STATE_EARLY_RESP - Response came early, before send
   completion notification
 * IB_MAD_STATE_DONE - MAD processing completed

Signed-off-by: Or Har-Toov <ohartoov@nvidia.com>
Reviewed-by: Maher Sanalla <msanalla@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/infiniband/core/mad.c      | 44 ++++++++++++++----------------
 drivers/infiniband/core/mad_priv.h | 10 ++++++-
 drivers/infiniband/core/mad_rmpp.c |  7 ++---
 3 files changed, 33 insertions(+), 28 deletions(-)

diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c
index 1fd54d5c4dd8..9b101f91ca3e 100644
--- a/drivers/infiniband/core/mad.c
+++ b/drivers/infiniband/core/mad.c
@@ -1118,8 +1118,7 @@ int ib_post_send_mad(struct ib_mad_send_buf *send_buf,
 		mad_send_wr->max_retries = send_buf->retries;
 		mad_send_wr->retries_left = send_buf->retries;
 		send_buf->retries = 0;
-		/* Reference for work request to QP + response */
-		mad_send_wr->refcount = 1 + (mad_send_wr->timeout > 0);
+		mad_send_wr->state = IB_MAD_STATE_SEND_START;
 		mad_send_wr->status = IB_WC_SUCCESS;
 
 		/* Reference MAD agent until send completes */
@@ -1773,9 +1772,13 @@ ib_find_send_mad(const struct ib_mad_agent_private *mad_agent_priv,
 void ib_mark_mad_done(struct ib_mad_send_wr_private *mad_send_wr)
 {
 	mad_send_wr->timeout = 0;
-	if (mad_send_wr->refcount == 1)
+	if (mad_send_wr->state == IB_MAD_STATE_WAIT_RESP) {
+		mad_send_wr->state = IB_MAD_STATE_DONE;
 		list_move_tail(&mad_send_wr->agent_list,
 			      &mad_send_wr->mad_agent_priv->done_list);
+	} else {
+		mad_send_wr->state = IB_MAD_STATE_EARLY_RESP;
+	}
 }
 
 static void ib_mad_complete_recv(struct ib_mad_agent_private *mad_agent_priv,
@@ -2195,6 +2198,7 @@ static void wait_for_response(struct ib_mad_send_wr_private *mad_send_wr)
 		list_item = &mad_agent_priv->wait_list;
 	}
 
+	mad_send_wr->state = IB_MAD_STATE_WAIT_RESP;
 	list_add(&mad_send_wr->agent_list, list_item);
 
 	/* Reschedule a work item if we have a shorter timeout */
@@ -2222,6 +2226,11 @@ void ib_mad_complete_send_wr(struct ib_mad_send_wr_private *mad_send_wr,
 
 	mad_agent_priv = mad_send_wr->mad_agent_priv;
 	spin_lock_irqsave(&mad_agent_priv->lock, flags);
+	if (mad_send_wr->state == IB_MAD_STATE_EARLY_RESP) {
+		mad_send_wr->state = IB_MAD_STATE_DONE;
+		goto done;
+	}
+
 	if (ib_mad_kernel_rmpp_agent(&mad_agent_priv->agent)) {
 		ret = ib_process_rmpp_send_wc(mad_send_wr, mad_send_wc);
 		if (ret == IB_RMPP_RESULT_CONSUMED)
@@ -2232,14 +2241,10 @@ void ib_mad_complete_send_wr(struct ib_mad_send_wr_private *mad_send_wr,
 	if (mad_send_wc->status != IB_WC_SUCCESS &&
 	    mad_send_wr->status == IB_WC_SUCCESS) {
 		mad_send_wr->status = mad_send_wc->status;
-		mad_send_wr->refcount -= (mad_send_wr->timeout > 0);
-	}
-
-	if (--mad_send_wr->refcount > 0) {
-		if (mad_send_wr->refcount == 1 && mad_send_wr->timeout &&
-		    mad_send_wr->status == IB_WC_SUCCESS) {
-			wait_for_response(mad_send_wr);
-		}
+	} else if (mad_send_wr->status == IB_WC_SUCCESS &&
+		   mad_send_wr->timeout &&
+		   mad_send_wr->state == IB_MAD_STATE_SEND_START) {
+		wait_for_response(mad_send_wr);
 		goto done;
 	}
 
@@ -2407,12 +2412,9 @@ static void cancel_mads(struct ib_mad_agent_private *mad_agent_priv)
 
 	spin_lock_irqsave(&mad_agent_priv->lock, flags);
 	list_for_each_entry_safe(mad_send_wr, temp_mad_send_wr,
-				 &mad_agent_priv->send_list, agent_list) {
-		if (mad_send_wr->status == IB_WC_SUCCESS) {
+				 &mad_agent_priv->send_list, agent_list)
+		if (mad_send_wr->status == IB_WC_SUCCESS)
 			mad_send_wr->status = IB_WC_WR_FLUSH_ERR;
-			mad_send_wr->refcount -= (mad_send_wr->timeout > 0);
-		}
-	}
 
 	/* Empty wait list to prevent receives from finding a request */
 	list_splice_init(&mad_agent_priv->wait_list, &cancel_list);
@@ -2459,7 +2461,6 @@ int ib_modify_mad(struct ib_mad_send_buf *send_buf, u32 timeout_ms)
 	struct ib_mad_agent_private *mad_agent_priv;
 	struct ib_mad_send_wr_private *mad_send_wr;
 	unsigned long flags;
-	int active;
 
 	if (!send_buf)
 		return -EINVAL;
@@ -2473,14 +2474,11 @@ int ib_modify_mad(struct ib_mad_send_buf *send_buf, u32 timeout_ms)
 		return -EINVAL;
 	}
 
-	active = (!mad_send_wr->timeout || mad_send_wr->refcount > 1);
-	if (!timeout_ms) {
+	if (!timeout_ms)
 		mad_send_wr->status = IB_WC_WR_FLUSH_ERR;
-		mad_send_wr->refcount -= (mad_send_wr->timeout > 0);
-	}
 
 	mad_send_wr->send_buf.timeout_ms = timeout_ms;
-	if (active)
+	if (mad_send_wr->state == IB_MAD_STATE_SEND_START)
 		mad_send_wr->timeout = msecs_to_jiffies(timeout_ms);
 	else
 		ib_reset_mad_timeout(mad_send_wr, timeout_ms);
@@ -2607,7 +2605,7 @@ static int retry_send(struct ib_mad_send_wr_private *mad_send_wr)
 		ret = ib_send_mad(mad_send_wr);
 
 	if (!ret) {
-		mad_send_wr->refcount++;
+		mad_send_wr->state = IB_MAD_STATE_SEND_START;
 		list_add_tail(&mad_send_wr->agent_list,
 			      &mad_send_wr->mad_agent_priv->send_list);
 	}
diff --git a/drivers/infiniband/core/mad_priv.h b/drivers/infiniband/core/mad_priv.h
index 1b7445a6f671..cc2de81ea6f6 100644
--- a/drivers/infiniband/core/mad_priv.h
+++ b/drivers/infiniband/core/mad_priv.h
@@ -118,6 +118,13 @@ struct ib_mad_snoop_private {
 	struct completion comp;
 };
 
+enum ib_mad_state {
+	IB_MAD_STATE_SEND_START,
+	IB_MAD_STATE_WAIT_RESP,
+	IB_MAD_STATE_EARLY_RESP,
+	IB_MAD_STATE_DONE
+};
+
 struct ib_mad_send_wr_private {
 	struct ib_mad_list_head mad_list;
 	struct list_head agent_list;
@@ -132,7 +139,6 @@ struct ib_mad_send_wr_private {
 	int max_retries;
 	int retries_left;
 	int retry;
-	int refcount;
 	enum ib_wc_status status;
 
 	/* RMPP control */
@@ -143,6 +149,8 @@ struct ib_mad_send_wr_private {
 	int seg_num;
 	int newwin;
 	int pad;
+
+	enum ib_mad_state state;
 };
 
 struct ib_mad_local_private {
diff --git a/drivers/infiniband/core/mad_rmpp.c b/drivers/infiniband/core/mad_rmpp.c
index 8af0619a39cd..dff264e9bb68 100644
--- a/drivers/infiniband/core/mad_rmpp.c
+++ b/drivers/infiniband/core/mad_rmpp.c
@@ -717,13 +717,13 @@ static void process_rmpp_ack(struct ib_mad_agent_private *agent,
 			ib_mad_complete_send_wr(mad_send_wr, &wc);
 			return;
 		}
-		if (mad_send_wr->refcount == 1)
+		if (mad_send_wr->state == IB_MAD_STATE_WAIT_RESP)
 			ib_reset_mad_timeout(mad_send_wr,
 					     mad_send_wr->send_buf.timeout_ms);
 		spin_unlock_irqrestore(&agent->lock, flags);
 		ack_ds_ack(agent, mad_recv_wc);
 		return;
-	} else if (mad_send_wr->refcount == 1 &&
+	} else if (mad_send_wr->state == IB_MAD_STATE_WAIT_RESP &&
 		   mad_send_wr->seg_num < mad_send_wr->newwin &&
 		   mad_send_wr->seg_num < mad_send_wr->send_buf.seg_count) {
 		/* Send failure will just result in a timeout/retry */
@@ -731,7 +731,7 @@ static void process_rmpp_ack(struct ib_mad_agent_private *agent,
 		if (ret)
 			goto out;
 
-		mad_send_wr->refcount++;
+		mad_send_wr->state = IB_MAD_STATE_SEND_START;
 		list_move_tail(&mad_send_wr->agent_list,
 			      &mad_send_wr->mad_agent_priv->send_list);
 	}
@@ -890,7 +890,6 @@ int ib_send_rmpp_mad(struct ib_mad_send_wr_private *mad_send_wr)
 	mad_send_wr->newwin = init_newwin(mad_send_wr);
 
 	/* We need to wait for the final ACK even if there isn't a response */
-	mad_send_wr->refcount += (mad_send_wr->timeout == 0);
 	ret = send_next_seg(mad_send_wr);
 	if (!ret)
 		return IB_RMPP_RESULT_CONSUMED;
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH rdma-next 2/3] IB/mad: Remove unnecessary done list by utilizing MAD states
  2024-12-03 13:52 [PATCH rdma-next 0/3] Add Flow Control for Solicited MADs Leon Romanovsky
  2024-12-03 13:52 ` [PATCH rdma-next 1/3] IB/mad: Replace MAD's refcount with a state machine Leon Romanovsky
@ 2024-12-03 13:52 ` Leon Romanovsky
  2024-12-10 19:00   ` Jason Gunthorpe
  2024-12-03 13:52 ` [PATCH rdma-next 3/3] IB/mad: Add flow control for solicited MADs Leon Romanovsky
  2 siblings, 1 reply; 8+ messages in thread
From: Leon Romanovsky @ 2024-12-03 13:52 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: Or Har-Toov, linux-rdma, Maher Sanalla

From: Or Har-Toov <ohartoov@nvidia.com>

Remove the done list, which has become unnecessary with the
introduction of the `state` parameter.

Previously, the done list was used to ensure that MADs removed from
the wait list would still be in some list, preventing failures in
the call to `list_del` in `ib_mad_complete_send_wr`. However, with the
new state management, we can mark a MAD as done when it is completed
and simply not delete those MADs.

Removing the done list eliminates unnecessary memory usage and
simplifies the code.

Signed-off-by: Or Har-Toov <ohartoov@nvidia.com>
Reviewed-by: Maher Sanalla <msanalla@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/infiniband/core/mad.c      | 13 ++++++-------
 drivers/infiniband/core/mad_priv.h |  1 -
 2 files changed, 6 insertions(+), 8 deletions(-)

diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c
index 9b101f91ca3e..e16bc396f6bc 100644
--- a/drivers/infiniband/core/mad.c
+++ b/drivers/infiniband/core/mad.c
@@ -391,7 +391,6 @@ struct ib_mad_agent *ib_register_mad_agent(struct ib_device *device,
 	spin_lock_init(&mad_agent_priv->lock);
 	INIT_LIST_HEAD(&mad_agent_priv->send_list);
 	INIT_LIST_HEAD(&mad_agent_priv->wait_list);
-	INIT_LIST_HEAD(&mad_agent_priv->done_list);
 	INIT_LIST_HEAD(&mad_agent_priv->rmpp_list);
 	INIT_DELAYED_WORK(&mad_agent_priv->timed_work, timeout_sends);
 	INIT_LIST_HEAD(&mad_agent_priv->local_list);
@@ -1772,13 +1771,11 @@ ib_find_send_mad(const struct ib_mad_agent_private *mad_agent_priv,
 void ib_mark_mad_done(struct ib_mad_send_wr_private *mad_send_wr)
 {
 	mad_send_wr->timeout = 0;
-	if (mad_send_wr->state == IB_MAD_STATE_WAIT_RESP) {
+	list_del(&mad_send_wr->agent_list);
+	if (mad_send_wr->state == IB_MAD_STATE_WAIT_RESP)
 		mad_send_wr->state = IB_MAD_STATE_DONE;
-		list_move_tail(&mad_send_wr->agent_list,
-			      &mad_send_wr->mad_agent_priv->done_list);
-	} else {
+	else
 		mad_send_wr->state = IB_MAD_STATE_EARLY_RESP;
-	}
 }
 
 static void ib_mad_complete_recv(struct ib_mad_agent_private *mad_agent_priv,
@@ -2249,7 +2246,9 @@ void ib_mad_complete_send_wr(struct ib_mad_send_wr_private *mad_send_wr,
 	}
 
 	/* Remove send from MAD agent and notify client of completion */
-	list_del(&mad_send_wr->agent_list);
+	if (mad_send_wr->state == IB_MAD_STATE_SEND_START)
+		list_del(&mad_send_wr->agent_list);
+
 	adjust_timeout(mad_agent_priv);
 	spin_unlock_irqrestore(&mad_agent_priv->lock, flags);
 
diff --git a/drivers/infiniband/core/mad_priv.h b/drivers/infiniband/core/mad_priv.h
index cc2de81ea6f6..4af63c1664c2 100644
--- a/drivers/infiniband/core/mad_priv.h
+++ b/drivers/infiniband/core/mad_priv.h
@@ -96,7 +96,6 @@ struct ib_mad_agent_private {
 	spinlock_t lock;
 	struct list_head send_list;
 	struct list_head wait_list;
-	struct list_head done_list;
 	struct delayed_work timed_work;
 	unsigned long timeout;
 	struct list_head local_list;
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH rdma-next 3/3] IB/mad: Add flow control for solicited MADs
  2024-12-03 13:52 [PATCH rdma-next 0/3] Add Flow Control for Solicited MADs Leon Romanovsky
  2024-12-03 13:52 ` [PATCH rdma-next 1/3] IB/mad: Replace MAD's refcount with a state machine Leon Romanovsky
  2024-12-03 13:52 ` [PATCH rdma-next 2/3] IB/mad: Remove unnecessary done list by utilizing MAD states Leon Romanovsky
@ 2024-12-03 13:52 ` Leon Romanovsky
  2024-12-10 20:12   ` Jason Gunthorpe
  2 siblings, 1 reply; 8+ messages in thread
From: Leon Romanovsky @ 2024-12-03 13:52 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: Or Har-Toov, linux-rdma, Maher Sanalla

From: Or Har-Toov <ohartoov@nvidia.com>

Currently, MADs sent via an agent are being forwarded directly to the
corresponding MAD QP layer.
MADs with a timeout value set and requiring a response (solicited MADs)
will be resent if the timeout expires without receiving a response.
In a congested subnet, flooding MAD QP layer with more solicited send
requests from the agent will only worsen the situation by triggering
more timeouts and therefore more retries.

Thus, add flow control for non-user solicited MADs to block agents from
issuing new solicited MAD requests to the MAD QP until outstanding
requests are completed and the MAD QP is ready to process additional
requests.

Therefore, keep track of the total outstanding solicited MAD work
requests (MADs that are on agent send list or wait list). The number of
outstanding send WRs will be limited by a fraction of the RQ size, and
any new send WR that exceeds that limit will be held in a backlog list.
Backlog MADs will be forwarded to agent send list only once the total
number of outstanding send WRs falls below the limit.

For this purpose, a new state is introduced:
 * IB_MAD_STATE_QUEUED - MAD is in backlog list

Unsolicited MADs, RMPP MADs and MADs which are not SA, SMP or CM are
not subject to this flow control mechanism and will not be affected by
this change.

Signed-off-by: Or Har-Toov <ohartoov@nvidia.com>
Reviewed-by: Maher Sanalla <msanalla@nvidia.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
---
 drivers/infiniband/core/mad.c      | 171 +++++++++++++++++++++++++++--
 drivers/infiniband/core/mad_priv.h |   8 ++
 2 files changed, 171 insertions(+), 8 deletions(-)

diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c
index e16bc396f6bc..86e846b12e2f 100644
--- a/drivers/infiniband/core/mad.c
+++ b/drivers/infiniband/core/mad.c
@@ -210,6 +210,29 @@ int ib_response_mad(const struct ib_mad_hdr *hdr)
 }
 EXPORT_SYMBOL(ib_response_mad);
 
+#define SOL_FC_MAX_DEFAULT_FRAC 4
+#define SOL_FC_MAX_SA_FRAC 32
+
+static int get_sol_fc_max_outstanding(struct ib_mad_reg_req *mad_reg_req)
+{
+	if (!mad_reg_req)
+		/* Send only agent */
+		return mad_recvq_size / SOL_FC_MAX_DEFAULT_FRAC;
+
+	switch (mad_reg_req->mgmt_class) {
+	case IB_MGMT_CLASS_CM:
+		return mad_recvq_size / SOL_FC_MAX_DEFAULT_FRAC;
+	case IB_MGMT_CLASS_SUBN_ADM:
+		return mad_recvq_size / SOL_FC_MAX_SA_FRAC;
+	case IB_MGMT_CLASS_SUBN_LID_ROUTED:
+	case IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE:
+		return min(mad_recvq_size, IB_MAD_QP_RECV_SIZE) /
+		       SOL_FC_MAX_DEFAULT_FRAC;
+	default:
+		return 0;
+	}
+}
+
 /*
  * ib_register_mad_agent - Register to send/receive MADs
  *
@@ -392,12 +415,15 @@ struct ib_mad_agent *ib_register_mad_agent(struct ib_device *device,
 	INIT_LIST_HEAD(&mad_agent_priv->send_list);
 	INIT_LIST_HEAD(&mad_agent_priv->wait_list);
 	INIT_LIST_HEAD(&mad_agent_priv->rmpp_list);
+	INIT_LIST_HEAD(&mad_agent_priv->backlog_list);
 	INIT_DELAYED_WORK(&mad_agent_priv->timed_work, timeout_sends);
 	INIT_LIST_HEAD(&mad_agent_priv->local_list);
 	INIT_WORK(&mad_agent_priv->local_work, local_completions);
 	refcount_set(&mad_agent_priv->refcount, 1);
 	init_completion(&mad_agent_priv->comp);
 
+	mad_agent_priv->sol_fc_max =
+		get_sol_fc_max_outstanding(mad_reg_req);
 	ret2 = ib_mad_agent_security_setup(&mad_agent_priv->agent, qp_type);
 	if (ret2) {
 		ret = ERR_PTR(ret2);
@@ -1054,6 +1080,43 @@ int ib_send_mad(struct ib_mad_send_wr_private *mad_send_wr)
 	return ret;
 }
 
+static bool is_solicited_fc_mad(struct ib_mad_send_wr_private *mad_send_wr)
+{
+	struct ib_rmpp_mad *rmpp_mad;
+	u8 mgmt_class;
+
+	if (!mad_send_wr->timeout)
+		return 0;
+
+	rmpp_mad = mad_send_wr->send_buf.mad;
+	if (mad_send_wr->mad_agent_priv->agent.rmpp_version &&
+	    (ib_get_rmpp_flags(&rmpp_mad->rmpp_hdr) & IB_MGMT_RMPP_FLAG_ACTIVE))
+		return 0;
+
+	mgmt_class =
+		((struct ib_mad_hdr *)mad_send_wr->send_buf.mad)->mgmt_class;
+	return mgmt_class == IB_MGMT_CLASS_CM ||
+	       mgmt_class == IB_MGMT_CLASS_SUBN_ADM ||
+	       mgmt_class == IB_MGMT_CLASS_SUBN_LID_ROUTED ||
+	       mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE;
+}
+
+static bool mad_is_for_backlog(struct ib_mad_send_wr_private *mad_send_wr)
+{
+	struct ib_mad_agent_private *mad_agent_priv =
+		mad_send_wr->mad_agent_priv;
+
+	if (!mad_send_wr->is_solicited_fc || !mad_agent_priv->sol_fc_max)
+		return false;
+
+	if (!list_empty(&mad_agent_priv->backlog_list))
+		return true;
+
+	return mad_agent_priv->sol_fc_send_count +
+		       mad_agent_priv->sol_fc_wait_count >=
+	       mad_agent_priv->sol_fc_max;
+}
+
 /*
  * ib_post_send_mad - Posts MAD(s) to the send queue of the QP associated
  *  with the registered client
@@ -1117,14 +1180,26 @@ int ib_post_send_mad(struct ib_mad_send_buf *send_buf,
 		mad_send_wr->max_retries = send_buf->retries;
 		mad_send_wr->retries_left = send_buf->retries;
 		send_buf->retries = 0;
-		mad_send_wr->state = IB_MAD_STATE_SEND_START;
 		mad_send_wr->status = IB_WC_SUCCESS;
 
 		/* Reference MAD agent until send completes */
 		refcount_inc(&mad_agent_priv->refcount);
 		spin_lock_irqsave(&mad_agent_priv->lock, flags);
+
+		mad_send_wr->is_solicited_fc = is_solicited_fc_mad(mad_send_wr);
+		if (mad_is_for_backlog(mad_send_wr)) {
+			list_add_tail(&mad_send_wr->agent_list,
+				      &mad_agent_priv->backlog_list);
+			mad_send_wr->state = IB_MAD_STATE_QUEUED;
+			spin_unlock_irqrestore(&mad_agent_priv->lock, flags);
+			return 0;
+		}
+
 		list_add_tail(&mad_send_wr->agent_list,
 			      &mad_agent_priv->send_list);
+		mad_send_wr->state = IB_MAD_STATE_SEND_START;
+		mad_agent_priv->sol_fc_send_count +=
+			mad_send_wr->is_solicited_fc;
 		spin_unlock_irqrestore(&mad_agent_priv->lock, flags);
 
 		if (ib_mad_kernel_rmpp_agent(&mad_agent_priv->agent)) {
@@ -1136,6 +1211,8 @@ int ib_post_send_mad(struct ib_mad_send_buf *send_buf,
 		if (ret < 0) {
 			/* Fail send request */
 			spin_lock_irqsave(&mad_agent_priv->lock, flags);
+			mad_agent_priv->sol_fc_send_count -=
+				mad_send_wr->is_solicited_fc;
 			list_del(&mad_send_wr->agent_list);
 			spin_unlock_irqrestore(&mad_agent_priv->lock, flags);
 			deref_mad_agent(mad_agent_priv);
@@ -1768,14 +1845,59 @@ ib_find_send_mad(const struct ib_mad_agent_private *mad_agent_priv,
 	return NULL;
 }
 
+static void
+process_mad_from_backlog(struct ib_mad_agent_private *mad_agent_priv)
+{
+	struct ib_mad_send_wr_private *mad_send_wr;
+	struct ib_mad_send_wc mad_send_wc = {};
+	unsigned long flags;
+	int ret;
+
+	spin_lock_irqsave(&mad_agent_priv->lock, flags);
+	while (!list_empty(&mad_agent_priv->backlog_list) &&
+	       (mad_agent_priv->sol_fc_send_count +
+			mad_agent_priv->sol_fc_wait_count <
+		mad_agent_priv->sol_fc_max)) {
+		mad_send_wr = list_entry(mad_agent_priv->backlog_list.next,
+					 struct ib_mad_send_wr_private,
+					 agent_list);
+		list_move_tail(&mad_send_wr->agent_list,
+			       &mad_agent_priv->send_list);
+		mad_agent_priv->sol_fc_send_count++;
+		mad_send_wr->state = IB_MAD_STATE_SEND_START;
+		spin_unlock_irqrestore(&mad_agent_priv->lock, flags);
+		ret = ib_send_mad(mad_send_wr);
+		if (!ret)
+			return;
+
+		spin_lock_irqsave(&mad_agent_priv->lock, flags);
+		deref_mad_agent(mad_agent_priv);
+		mad_agent_priv->sol_fc_send_count--;
+		list_del(&mad_send_wr->agent_list);
+		spin_unlock_irqrestore(&mad_agent_priv->lock, flags);
+		mad_send_wc.send_buf = &mad_send_wr->send_buf;
+		mad_send_wc.status = IB_WC_LOC_QP_OP_ERR;
+		mad_agent_priv->agent.send_handler(&mad_agent_priv->agent,
+						   &mad_send_wc);
+		spin_lock_irqsave(&mad_agent_priv->lock, flags);
+	}
+
+	spin_unlock_irqrestore(&mad_agent_priv->lock, flags);
+}
+
 void ib_mark_mad_done(struct ib_mad_send_wr_private *mad_send_wr)
 {
 	mad_send_wr->timeout = 0;
 	list_del(&mad_send_wr->agent_list);
-	if (mad_send_wr->state == IB_MAD_STATE_WAIT_RESP)
+	if (mad_send_wr->state == IB_MAD_STATE_WAIT_RESP) {
+		mad_send_wr->mad_agent_priv->sol_fc_wait_count -=
+			mad_send_wr->is_solicited_fc;
 		mad_send_wr->state = IB_MAD_STATE_DONE;
-	else
+	} else {
+		mad_send_wr->mad_agent_priv->sol_fc_send_count -=
+			mad_send_wr->is_solicited_fc;
 		mad_send_wr->state = IB_MAD_STATE_EARLY_RESP;
+	}
 }
 
 static void ib_mad_complete_recv(struct ib_mad_agent_private *mad_agent_priv,
@@ -2177,7 +2299,7 @@ static void wait_for_response(struct ib_mad_send_wr_private *mad_send_wr)
 	unsigned long delay;
 
 	mad_agent_priv = mad_send_wr->mad_agent_priv;
-	list_del(&mad_send_wr->agent_list);
+	list_del_init(&mad_send_wr->agent_list);
 
 	delay = mad_send_wr->timeout;
 	mad_send_wr->timeout += jiffies;
@@ -2195,6 +2317,16 @@ static void wait_for_response(struct ib_mad_send_wr_private *mad_send_wr)
 		list_item = &mad_agent_priv->wait_list;
 	}
 
+	if (mad_send_wr->state == IB_MAD_STATE_SEND_START) {
+		if (mad_send_wr->is_solicited_fc) {
+			mad_agent_priv->sol_fc_send_count--;
+			mad_agent_priv->sol_fc_wait_count++;
+		}
+	} else if (mad_send_wr->state == IB_MAD_STATE_QUEUED) {
+		mad_agent_priv->sol_fc_wait_count +=
+			mad_send_wr->is_solicited_fc;
+	}
+
 	mad_send_wr->state = IB_MAD_STATE_WAIT_RESP;
 	list_add(&mad_send_wr->agent_list, list_item);
 
@@ -2246,19 +2378,25 @@ void ib_mad_complete_send_wr(struct ib_mad_send_wr_private *mad_send_wr,
 	}
 
 	/* Remove send from MAD agent and notify client of completion */
-	if (mad_send_wr->state == IB_MAD_STATE_SEND_START)
+	if (mad_send_wr->state == IB_MAD_STATE_SEND_START) {
 		list_del(&mad_send_wr->agent_list);
+		mad_agent_priv->sol_fc_send_count -=
+			mad_send_wr->is_solicited_fc;
+	}
 
 	adjust_timeout(mad_agent_priv);
 	spin_unlock_irqrestore(&mad_agent_priv->lock, flags);
 
 	if (mad_send_wr->status != IB_WC_SUCCESS)
 		mad_send_wc->status = mad_send_wr->status;
-	if (ret == IB_RMPP_RESULT_INTERNAL)
+	if (ret == IB_RMPP_RESULT_INTERNAL) {
 		ib_rmpp_send_handler(mad_send_wc);
-	else
+	} else {
+		if (mad_send_wr->is_solicited_fc)
+			process_mad_from_backlog(mad_agent_priv);
 		mad_agent_priv->agent.send_handler(&mad_agent_priv->agent,
 						   mad_send_wc);
+	}
 
 	/* Release reference on agent taken when sending */
 	deref_mad_agent(mad_agent_priv);
@@ -2417,6 +2555,8 @@ static void cancel_mads(struct ib_mad_agent_private *mad_agent_priv)
 
 	/* Empty wait list to prevent receives from finding a request */
 	list_splice_init(&mad_agent_priv->wait_list, &cancel_list);
+	mad_agent_priv->sol_fc_wait_count = 0;
+	list_splice_tail_init(&mad_agent_priv->backlog_list, &cancel_list);
 	spin_unlock_irqrestore(&mad_agent_priv->lock, flags);
 
 	/* Report all cancelled requests */
@@ -2452,6 +2592,13 @@ find_send_wr(struct ib_mad_agent_private *mad_agent_priv,
 		    &mad_send_wr->send_buf == send_buf)
 			return mad_send_wr;
 	}
+
+	list_for_each_entry(mad_send_wr, &mad_agent_priv->backlog_list,
+			    agent_list) {
+		if (&mad_send_wr->send_buf == send_buf)
+			return mad_send_wr;
+	}
+
 	return NULL;
 }
 
@@ -2477,7 +2624,8 @@ int ib_modify_mad(struct ib_mad_send_buf *send_buf, u32 timeout_ms)
 		mad_send_wr->status = IB_WC_WR_FLUSH_ERR;
 
 	mad_send_wr->send_buf.timeout_ms = timeout_ms;
-	if (mad_send_wr->state == IB_MAD_STATE_SEND_START)
+	if (mad_send_wr->state == IB_MAD_STATE_SEND_START ||
+	    (mad_send_wr->state == IB_MAD_STATE_QUEUED && timeout_ms))
 		mad_send_wr->timeout = msecs_to_jiffies(timeout_ms);
 	else
 		ib_reset_mad_timeout(mad_send_wr, timeout_ms);
@@ -2607,7 +2755,10 @@ static int retry_send(struct ib_mad_send_wr_private *mad_send_wr)
 		mad_send_wr->state = IB_MAD_STATE_SEND_START;
 		list_add_tail(&mad_send_wr->agent_list,
 			      &mad_send_wr->mad_agent_priv->send_list);
+		mad_send_wr->mad_agent_priv->sol_fc_send_count +=
+			mad_send_wr->is_solicited_fc;
 	}
+
 	return ret;
 }
 
@@ -2641,6 +2792,8 @@ static void timeout_sends(struct work_struct *work)
 		}
 
 		list_del_init(&mad_send_wr->agent_list);
+		mad_agent_priv->sol_fc_wait_count -=
+			mad_send_wr->is_solicited_fc;
 		if (mad_send_wr->status == IB_WC_SUCCESS &&
 		    !retry_send(mad_send_wr))
 			continue;
@@ -2655,6 +2808,8 @@ static void timeout_sends(struct work_struct *work)
 		else
 			mad_send_wc.status = mad_send_wr->status;
 		mad_send_wc.send_buf = &mad_send_wr->send_buf;
+		if (mad_send_wr->is_solicited_fc)
+			process_mad_from_backlog(mad_agent_priv);
 		mad_agent_priv->agent.send_handler(&mad_agent_priv->agent,
 						   &mad_send_wc);
 		deref_mad_agent(mad_agent_priv);
diff --git a/drivers/infiniband/core/mad_priv.h b/drivers/infiniband/core/mad_priv.h
index 4af63c1664c2..b2a12a82a62d 100644
--- a/drivers/infiniband/core/mad_priv.h
+++ b/drivers/infiniband/core/mad_priv.h
@@ -95,12 +95,16 @@ struct ib_mad_agent_private {
 
 	spinlock_t lock;
 	struct list_head send_list;
+	unsigned int sol_fc_send_count;
 	struct list_head wait_list;
+	unsigned int sol_fc_wait_count;
 	struct delayed_work timed_work;
 	unsigned long timeout;
 	struct list_head local_list;
 	struct work_struct local_work;
 	struct list_head rmpp_list;
+	unsigned int sol_fc_max;
+	struct list_head backlog_list;
 
 	refcount_t refcount;
 	union {
@@ -118,6 +122,7 @@ struct ib_mad_snoop_private {
 };
 
 enum ib_mad_state {
+	IB_MAD_STATE_QUEUED,
 	IB_MAD_STATE_SEND_START,
 	IB_MAD_STATE_WAIT_RESP,
 	IB_MAD_STATE_EARLY_RESP,
@@ -150,6 +155,9 @@ struct ib_mad_send_wr_private {
 	int pad;
 
 	enum ib_mad_state state;
+
+	/* Solicited MAD flow control */
+	bool is_solicited_fc;
 };
 
 struct ib_mad_local_private {
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH rdma-next 1/3] IB/mad: Replace MAD's refcount with a state machine
  2024-12-03 13:52 ` [PATCH rdma-next 1/3] IB/mad: Replace MAD's refcount with a state machine Leon Romanovsky
@ 2024-12-10 16:12   ` Jason Gunthorpe
  0 siblings, 0 replies; 8+ messages in thread
From: Jason Gunthorpe @ 2024-12-10 16:12 UTC (permalink / raw)
  To: Leon Romanovsky; +Cc: Or Har-Toov, linux-rdma, Maher Sanalla

On Tue, Dec 03, 2024 at 03:52:21PM +0200, Leon Romanovsky wrote:

> Thus, replace with a state machine as the following:
>  * IB_MAD_STATE_SEND_START - MAD was sent to the QP and is waiting
>    for completion notification
>  * IB_MAD_STATE_WAIT_RESP - MAD send completed successfully, waiting
>    for a response
>  * IB_MAD_STATE_EARLY_RESP - Response came early, before send
>    completion notification
>  * IB_MAD_STATE_DONE - MAD processing completed

This is a good idea, that refcounting stuff is totally inscrutable.

> @@ -1773,9 +1772,13 @@ ib_find_send_mad(const struct ib_mad_agent_private *mad_agent_priv,
>  void ib_mark_mad_done(struct ib_mad_send_wr_private *mad_send_wr)
>  {
>  	mad_send_wr->timeout = 0;
> -	if (mad_send_wr->refcount == 1)
> +	if (mad_send_wr->state == IB_MAD_STATE_WAIT_RESP) {
> +		mad_send_wr->state = IB_MAD_STATE_DONE;
>  		list_move_tail(&mad_send_wr->agent_list,
>  			      &mad_send_wr->mad_agent_priv->done_list);
> +	} else {
> +		mad_send_wr->state = IB_MAD_STATE_EARLY_RESP;

Let's be even more explicit on all these transitions:

      	      WARN_ON(mad_send_wr->state != IB_MAD_STATE_SEND_START);

Maybe with some macro compiled out unless lockdep is on or
something. But there are clear rules here how the FSM should work,
lets document and check them.

Let me try to guess what they should be at each point..

>  static void ib_mad_complete_recv(struct ib_mad_agent_private *mad_agent_priv,
> @@ -2195,6 +2198,7 @@ static void wait_for_response(struct ib_mad_send_wr_private *mad_send_wr)
>  		list_item = &mad_agent_priv->wait_list;
>  	}
>  
> +	mad_send_wr->state = IB_MAD_STATE_WAIT_RESP;
>  	list_add(&mad_send_wr->agent_list, list_item);

      	WARN_ON(mad_send_wr->state != IB_MAD_STATE_SEND_START);

> @@ -2222,6 +2226,11 @@ void ib_mad_complete_send_wr(struct ib_mad_send_wr_private *mad_send_wr,
>  
>  	mad_agent_priv = mad_send_wr->mad_agent_priv;
>  	spin_lock_irqsave(&mad_agent_priv->lock, flags);
> +	if (mad_send_wr->state == IB_MAD_STATE_EARLY_RESP) {
> +		mad_send_wr->state = IB_MAD_STATE_DONE;
> +		goto done;
> +	}

     else
      	      WARN_ON(mad_send_wr->state != IB_MAD_STATE_SEND_START);


This flow doesn't seem to touch the agent_list? If we got to
EARLY_RESP then mad_done didn't remove it from the agent list and we
goto done which skips it? Seems like a bug?

> @@ -2232,14 +2241,10 @@ void ib_mad_complete_send_wr(struct ib_mad_send_wr_private *mad_send_wr,
>  	if (mad_send_wc->status != IB_WC_SUCCESS &&
>  	    mad_send_wr->status == IB_WC_SUCCESS) {
>  		mad_send_wr->status = mad_send_wc->status;

mad_send_wr->state = IB_MAD_STATE_DONE 

??

> -		mad_send_wr->refcount -= (mad_send_wr->timeout > 0);
> -	}
> -
> -	if (--mad_send_wr->refcount > 0) {
> -		if (mad_send_wr->refcount == 1 && mad_send_wr->timeout &&
> -		    mad_send_wr->status == IB_WC_SUCCESS) {
> -			wait_for_response(mad_send_wr);
> -		}
> +	} else if (mad_send_wr->status == IB_WC_SUCCESS &&
> +		   mad_send_wr->timeout &&
> +		   mad_send_wr->state == IB_MAD_STATE_SEND_START) {
> +		wait_for_response(mad_send_wr);
>  		goto done;
>  	}

This sequence looks a little off, if we got a EARLY_RESP then we skip
everything else. It is possible for the send WC to indicate fail even
if we got a matching response and all that checking got skipped.

Maybe like this:

	if (mad_send_wc->status != IB_WC_SUCCESS &&
	    mad_send_wr->status == IB_WC_SUCCESS) {
		mad_send_wr->status = mad_send_wc->status;
		WARN_ON(mad_send_wr->state != IB_MAD_STATE_EARLY_RESP &&
			mad_send_wr->state != IB_MAD_STATE_SEND_START);
	} else if (mad_send_wr->status == IB_WC_SUCCESS &&
		   mad_send_wr->timeout) {
		if (mad_send_wr->state == IB_MAD_STATE_SEND_START) {
			wait_for_response(mad_send_wr);
			goto done;
		}
		WARN_ON(mad_send_wr->state == IB_MAD_STATE_EARLY_RESP);
	}

	/* Remove send from MAD agent and notify client of completion */
	mad_send_wr->state = IB_MAD_STATE_DONE;
	list_del(&mad_send_wr->agent_list);

 
> @@ -2407,12 +2412,9 @@ static void cancel_mads(struct ib_mad_agent_private *mad_agent_priv)
>  
>  	spin_lock_irqsave(&mad_agent_priv->lock, flags);
>  	list_for_each_entry_safe(mad_send_wr, temp_mad_send_wr,
> -				 &mad_agent_priv->send_list, agent_list) {
> -		if (mad_send_wr->status == IB_WC_SUCCESS) {
> +				 &mad_agent_priv->send_list, agent_list)
> +		if (mad_send_wr->status == IB_WC_SUCCESS)
>  			mad_send_wr->status = IB_WC_WR_FLUSH_ERR;
> -			mad_send_wr->refcount -= (mad_send_wr->timeout > 0);
> -		}

That status check feels like it should be a state check? Should it be
a new state instead of storing status?

What states are valid here anyhow? IB_MAD_STATE_SEND_START and
IB_MAD_STATE_EARLY_RESP I guess?

> @@ -2459,7 +2461,6 @@ int ib_modify_mad(struct ib_mad_send_buf *send_buf, u32 timeout_ms)
>  	struct ib_mad_agent_private *mad_agent_priv;
>  	struct ib_mad_send_wr_private *mad_send_wr;
>  	unsigned long flags;
> -	int active;
>  
>  	if (!send_buf)
>  		return -EINVAL;
> @@ -2473,14 +2474,11 @@ int ib_modify_mad(struct ib_mad_send_buf *send_buf, u32 timeout_ms)
>  		return -EINVAL;
>  	}
>  
> -	active = (!mad_send_wr->timeout || mad_send_wr->refcount > 1);
> -	if (!timeout_ms) {
> +	if (!timeout_ms)
>  		mad_send_wr->status = IB_WC_WR_FLUSH_ERR;
> -		mad_send_wr->refcount -= (mad_send_wr->timeout > 0);
> -	}

Here again, should FLUSH_ERR be its own state? This is basically
canceling the mad right? Then why do we reset the timeout?

>  	mad_send_wr->send_buf.timeout_ms = timeout_ms;
> -	if (active)
> +	if (mad_send_wr->state == IB_MAD_STATE_SEND_START)
>  		mad_send_wr->timeout = msecs_to_jiffies(timeout_ms);
>  	else
>  		ib_reset_mad_timeout(mad_send_wr, timeout_ms);

So this else is IB_MAD_STATE_WAIT_RESP ?

> @@ -2607,7 +2605,7 @@ static int retry_send(struct ib_mad_send_wr_private *mad_send_wr)
>  		ret = ib_send_mad(mad_send_wr);
>  
>  	if (!ret) {
> -		mad_send_wr->refcount++;
> +		mad_send_wr->state = IB_MAD_STATE_SEND_START;

      	      WARN_ON(mad_send_wr->state != IB_MAD_STATE_WAIT_RESP);

?

Does the caller reliably ensure this?

> diff --git a/drivers/infiniband/core/mad_rmpp.c b/drivers/infiniband/core/mad_rmpp.c
> index 8af0619a39cd..dff264e9bb68 100644
> --- a/drivers/infiniband/core/mad_rmpp.c
> +++ b/drivers/infiniband/core/mad_rmpp.c
> @@ -717,13 +717,13 @@ static void process_rmpp_ack(struct ib_mad_agent_private *agent,
>  			ib_mad_complete_send_wr(mad_send_wr, &wc);
>  			return;
>  		}
> -		if (mad_send_wr->refcount == 1)
> +		if (mad_send_wr->state == IB_MAD_STATE_WAIT_RESP)
>  			ib_reset_mad_timeout(mad_send_wr,
>  					     mad_send_wr->send_buf.timeout_ms);

else
      	      WARN_ON(mad_send_wr->state != IB_MAD_STATE_SEND_START);

>  		spin_unlock_irqrestore(&agent->lock, flags);
>  		ack_ds_ack(agent, mad_recv_wc);
>  		return;
> -	} else if (mad_send_wr->refcount == 1 &&
> +	} else if (mad_send_wr->state == IB_MAD_STATE_WAIT_RESP &&
>  		   mad_send_wr->seg_num < mad_send_wr->newwin &&
>  		   mad_send_wr->seg_num < mad_send_wr->send_buf.seg_count) {
>  		/* Send failure will just result in a timeout/retry */
> @@ -731,7 +731,7 @@ static void process_rmpp_ack(struct ib_mad_agent_private *agent,
>  		if (ret)
>  			goto out;
>  
> -		mad_send_wr->refcount++;
> +		mad_send_wr->state = IB_MAD_STATE_SEND_START;

      	      WARN_ON(mad_send_wr->state != IB_MAD_STATE_WAIT_RESP);

?

>  		list_move_tail(&mad_send_wr->agent_list,
>  			      &mad_send_wr->mad_agent_priv->send_list);
>  	}
> @@ -890,7 +890,6 @@ int ib_send_rmpp_mad(struct ib_mad_send_wr_private *mad_send_wr)
>  	mad_send_wr->newwin = init_newwin(mad_send_wr);
>  
>  	/* We need to wait for the final ACK even if there isn't a response */
> -	mad_send_wr->refcount += (mad_send_wr->timeout == 0);

      WARN_ON(mad_send_wr->state != IB_MAD_STATE_WAIT_RESP); 
??

Jason

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH rdma-next 2/3] IB/mad: Remove unnecessary done list by utilizing MAD states
  2024-12-03 13:52 ` [PATCH rdma-next 2/3] IB/mad: Remove unnecessary done list by utilizing MAD states Leon Romanovsky
@ 2024-12-10 19:00   ` Jason Gunthorpe
  2024-12-10 23:51     ` Or Har-Toov
  0 siblings, 1 reply; 8+ messages in thread
From: Jason Gunthorpe @ 2024-12-10 19:00 UTC (permalink / raw)
  To: Leon Romanovsky; +Cc: Or Har-Toov, linux-rdma, Maher Sanalla

On Tue, Dec 03, 2024 at 03:52:22PM +0200, Leon Romanovsky wrote:
> From: Or Har-Toov <ohartoov@nvidia.com>
> 
> Remove the done list, which has become unnecessary with the
> introduction of the `state` parameter.
> 
> Previously, the done list was used to ensure that MADs removed from
> the wait list would still be in some list, preventing failures in
> the call to `list_del` in `ib_mad_complete_send_wr`. 

Yuk, that is a terrible reason for this. list_del_init() would solve
that problem.

> @@ -1772,13 +1771,11 @@ ib_find_send_mad(const struct ib_mad_agent_private *mad_agent_priv,
>  void ib_mark_mad_done(struct ib_mad_send_wr_private *mad_send_wr)
>  {
>  	mad_send_wr->timeout = 0;
> -	if (mad_send_wr->state == IB_MAD_STATE_WAIT_RESP) {
> +	list_del(&mad_send_wr->agent_list);

This is doing more than the commit message says, we are now changing
the order for when the mad is in the list, here you are removing it as
soon as it becomes done, or semi-done instead of letting
ib_mad_complete_send_wr() always remove it.

I couldn't find a reason it is not OK, but I think it should be in the
commit message.

>  static void ib_mad_complete_recv(struct ib_mad_agent_private *mad_agent_priv,
> @@ -2249,7 +2246,9 @@ void ib_mad_complete_send_wr(struct ib_mad_send_wr_private *mad_send_wr,
>  	}
>  
>  	/* Remove send from MAD agent and notify client of completion */
> -	list_del(&mad_send_wr->agent_list);
> +	if (mad_send_wr->state == IB_MAD_STATE_SEND_START)
> +		list_del(&mad_send_wr->agent_list);
> +

This extra if is confusing now.. There are two callers to
ib_mad_complete_send_wr(), the receive path did ib_mark_mad_done()
first so state should be DONE or EARLY_RESP and the list_del was done
already.

The other one is send completion which should have state be SEND_START
*and* we hit an error making the send, then we remove it from the list?

Again I think this needs to go further and stop using ->status as part
of the FSM too.

Trying again, maybe like this:

	spin_lock_irqsave(&mad_agent_priv->lock, flags);
	if (ib_mad_kernel_rmpp_agent(&mad_agent_priv->agent)) {
		ret = ib_process_rmpp_send_wc(mad_send_wr, mad_send_wc);
		if (ret == IB_RMPP_RESULT_CONSUMED)
			goto done;
	} else
		ret = IB_RMPP_RESULT_UNHANDLED;

	if (mad_send_wr->state == IB_MAD_STATE_SEND_START) {
		if (mad_send_wc->status != IB_WC_SUCCESS &&
		    mad_send_wr->timeout) {
			wait_for_response(mad_send_wr);
			goto done;
		    }
		/* Otherwise error posting send */
		list_del(&mad_send_wr->agent_list);
	}

	WARN_ON(mad_send_wr->state != IB_MAD_STATE_EARLY_RESP &&
		mad_send_wr->state != IB_MAD_STATE_DONE);

	mad_send_wr->state = IB_MAD_STATE_DONE;
	mad_send_wr->status = mad_send_wc->status;
	adjust_timeout(mad_agent_priv);
	spin_unlock_irqrestore(&mad_agent_priv->lock, flags);


Though that might require changing cancel_mad too, as in the other
message, I think with the FSM cancel_mad should put the state to DONE
and leave the status alone.

This status fiddling is probably another patch.

Jason

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH rdma-next 3/3] IB/mad: Add flow control for solicited MADs
  2024-12-03 13:52 ` [PATCH rdma-next 3/3] IB/mad: Add flow control for solicited MADs Leon Romanovsky
@ 2024-12-10 20:12   ` Jason Gunthorpe
  0 siblings, 0 replies; 8+ messages in thread
From: Jason Gunthorpe @ 2024-12-10 20:12 UTC (permalink / raw)
  To: Leon Romanovsky; +Cc: Or Har-Toov, linux-rdma, Maher Sanalla

On Tue, Dec 03, 2024 at 03:52:23PM +0200, Leon Romanovsky wrote:
> From: Or Har-Toov <ohartoov@nvidia.com>
> 
> Currently, MADs sent via an agent are being forwarded directly to the
> corresponding MAD QP layer.
> MADs with a timeout value set and requiring a response (solicited MADs)
> will be resent if the timeout expires without receiving a response.
> In a congested subnet, flooding MAD QP layer with more solicited send
> requests from the agent will only worsen the situation by triggering
> more timeouts and therefore more retries.

This explanation does not really capture what this patch is supposed
to be doing. The point of solicited MADs is that they require a reply,
and the purpose of this patch is to try and conserve a reply slot in
the receive queue for the reply to land. Sending more requests than
the kernel has reply buffers is probably going to overflow the
HCA's receive queue.

Jason

^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: [PATCH rdma-next 2/3] IB/mad: Remove unnecessary done list by utilizing MAD states
  2024-12-10 19:00   ` Jason Gunthorpe
@ 2024-12-10 23:51     ` Or Har-Toov
  0 siblings, 0 replies; 8+ messages in thread
From: Or Har-Toov @ 2024-12-10 23:51 UTC (permalink / raw)
  To: Jason Gunthorpe, Leon Romanovsky
  Cc: linux-rdma@vger.kernel.org, Maher Sanalla

From: Jason Gunthorpe <jgg@nvidia.com> 
Sent: Tuesday, 10 December 2024 21:01
To: Leon Romanovsky <leon@kernel.org>
Cc: Or Har-Toov <ohartoov@nvidia.com>; linux-rdma@vger.kernel.org; Maher Sanalla <msanalla@nvidia.com>
Subject: Re: [PATCH rdma-next 2/3] IB/mad: Remove unnecessary done list by utilizing MAD states

On Tue, Dec 03, 2024 at 03:52:22PM +0200, Leon Romanovsky wrote:
> From: Or Har-Toov <ohartoov@nvidia.com>
> 
> Remove the done list, which has become unnecessary with the 
> introduction of the `state` parameter.
> 
> Previously, the done list was used to ensure that MADs removed from 
> the wait list would still be in some list, preventing failures in the 
> call to `list_del` in `ib_mad_complete_send_wr`.

Yuk, that is a terrible reason for this. list_del_init() would solve that problem.

> @@ -1772,13 +1771,11 @@ ib_find_send_mad(const struct 
> ib_mad_agent_private *mad_agent_priv,  void ib_mark_mad_done(struct 
> ib_mad_send_wr_private *mad_send_wr)  {
>  	mad_send_wr->timeout = 0;
> -	if (mad_send_wr->state == IB_MAD_STATE_WAIT_RESP) {
> +	list_del(&mad_send_wr->agent_list);

This is doing more than the commit message says, we are now changing the order for when the mad is in the list, here you are removing it as soon as it becomes done, or semi-done instead of letting
ib_mad_complete_send_wr() always remove it.

I couldn't find a reason it is not OK, but I think it should be in the commit message.

>  static void ib_mad_complete_recv(struct ib_mad_agent_private 
> *mad_agent_priv, @@ -2249,7 +2246,9 @@ void ib_mad_complete_send_wr(struct ib_mad_send_wr_private *mad_send_wr,
>  	}
>  
>  	/* Remove send from MAD agent and notify client of completion */
> -	list_del(&mad_send_wr->agent_list);
> +	if (mad_send_wr->state == IB_MAD_STATE_SEND_START)
> +		list_del(&mad_send_wr->agent_list);
> +

This extra if is confusing now.. There are two callers to ib_mad_complete_send_wr(), the receive path did ib_mark_mad_done() first so state should be DONE or EARLY_RESP and the list_del was done already.

The other one is send completion which should have state be SEND_START
*and* we hit an error making the send, then we remove it from the list?

Again I think this needs to go further and stop using ->status as part of the FSM too.

Trying again, maybe like this:

	spin_lock_irqsave(&mad_agent_priv->lock, flags);
	if (ib_mad_kernel_rmpp_agent(&mad_agent_priv->agent)) {
		ret = ib_process_rmpp_send_wc(mad_send_wr, mad_send_wc);
		if (ret == IB_RMPP_RESULT_CONSUMED)
			goto done;
	} else
		ret = IB_RMPP_RESULT_UNHANDLED;

	if (mad_send_wr->state == IB_MAD_STATE_SEND_START) {
		if (mad_send_wc->status != IB_WC_SUCCESS &&
		    mad_send_wr->timeout) {
			wait_for_response(mad_send_wr);
			goto done;
		    }
		/* Otherwise error posting send */
		list_del(&mad_send_wr->agent_list);
	}

	WARN_ON(mad_send_wr->state != IB_MAD_STATE_EARLY_RESP &&
		mad_send_wr->state != IB_MAD_STATE_DONE);

	mad_send_wr->state = IB_MAD_STATE_DONE;
	mad_send_wr->status = mad_send_wc->status;
	adjust_timeout(mad_agent_priv);
	spin_unlock_irqrestore(&mad_agent_priv->lock, flags);


Though that might require changing cancel_mad too, as in the other message, I think with the FSM cancel_mad should put the state to DONE and leave the status alone.

>>OH: Wouldn't it be better to remove the status from the wr entirely and add a state of "canceled"? We can't just put the state to be DONE in cancel_mad cause we need to know to set the wc status to FLUSH_ERR

This status fiddling is probably another patch.

Jason

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2024-12-10 23:51 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-12-03 13:52 [PATCH rdma-next 0/3] Add Flow Control for Solicited MADs Leon Romanovsky
2024-12-03 13:52 ` [PATCH rdma-next 1/3] IB/mad: Replace MAD's refcount with a state machine Leon Romanovsky
2024-12-10 16:12   ` Jason Gunthorpe
2024-12-03 13:52 ` [PATCH rdma-next 2/3] IB/mad: Remove unnecessary done list by utilizing MAD states Leon Romanovsky
2024-12-10 19:00   ` Jason Gunthorpe
2024-12-10 23:51     ` Or Har-Toov
2024-12-03 13:52 ` [PATCH rdma-next 3/3] IB/mad: Add flow control for solicited MADs Leon Romanovsky
2024-12-10 20:12   ` Jason Gunthorpe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).