Distributed Replicated Block Device (DRBD) development
 help / color / mirror / Atom feed
* [PATCH] rdma: Fix cm leak
@ 2025-04-25 10:24 zhengbing.huang
  2025-05-05 14:26 ` Philipp Reisner
  2025-05-05 14:26 ` [PATCH 1/1] " Philipp Reisner
  0 siblings, 2 replies; 4+ messages in thread
From: zhengbing.huang @ 2025-04-25 10:24 UTC (permalink / raw)
  To: drbd-dev

We found that when all the DRBDs is down, the reference count
of the drbd_transport_rdma module is still 1.

[root@node-4 ~]# drbdadm status
No currently configured DRBD found.
[root@node-4 ~]# lsmod | grep drbd
drbd_transport_rdma   262144  1

Then, we found an unreleas cm structure and discover
that its state is DSB_CONNECT_REQ + DSB_ERROR.

crash> struct dtr_cm ffff57e515da9400
struct dtr_cm {
  kref = {
    refcount = {
      refs = {
        counter = 1
...
state = 9,
...
}

The scenario of this problem should be like this:
dtr_cma_event_handler() get an RDMA_CM_EVENT_CONNECT_REQUEST event,
and call dtr_cma_accept() to alloc a cm. and set cm->state = DSM_CONNECT_REQ,
now the cm->kref count is 2.
then dtr_cma_event_handler() get xxx_CONNECT_ERROR/xxx_UNREACHABLE/xxx_REJECTED
event, and set_bit(DSB_ERROR, &cm->state).
the cm remove from path in dtr_cma_retry_connect, put one ref.
and cm->state dont has DSB_CONNECTING flag, then return 0.
Now, the cm->kref count is 1, and state is DSB_CONNECT_REQ + DSB_ERROR.

Therefore, when we test the DSB_CONNECTING flag,
we should also test the DSB_CONNECT_REQ flag to avoid cm leak.

Signed-off-by: zhengbing.huang <zhengbing.huang@easystack.cn>
---
 drbd/drbd_transport_rdma.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/drbd/drbd_transport_rdma.c b/drbd/drbd_transport_rdma.c
index be919a926..f24440580 100644
--- a/drbd/drbd_transport_rdma.c
+++ b/drbd/drbd_transport_rdma.c
@@ -1307,9 +1307,10 @@ static int dtr_cma_event_handler(struct rdma_cm_id *cm_id, struct rdma_cm_event
 		set_bit(DSB_ERROR, &cm->state);
 
 		dtr_cma_retry_connect(cm->path, cm);
-		if (!test_and_clear_bit(DSB_CONNECTING, &cm->state))
-			return 0; /* keep ref; __dtr_disconnect_path() won */
-		break;
+		if (test_and_clear_bit(DSB_CONNECTING, &cm->state) ||
+			test_and_clear_bit(DSB_CONNECT_REQ, &cm->state))
+			break;
+		return 0; /* keep ref; __dtr_disconnect_path() won */
 
 	case RDMA_CM_EVENT_DISCONNECTED:
 		// pr_info("%s: RDMA_CM_EVENT_DISCONNECTED\n", cm->name);
@@ -2787,7 +2788,8 @@ static void __dtr_disconnect_path(struct dtr_path *path)
 	 * events. Destroy the cm and cm_id to avoid leaking it.
 	 * This is racing with the event delivery, which drops a reference.
 	 */
-	if (test_and_clear_bit(DSB_CONNECTING, &cm->state))
+	if (test_and_clear_bit(DSB_CONNECTING, &cm->state) ||
+		test_and_clear_bit(DSB_CONNECT_REQ, &cm->state))
 		kref_put(&cm->kref, dtr_destroy_cm);
 
 	kref_put(&cm->kref, dtr_destroy_cm);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH] rdma: Fix cm leak
  2025-04-25 10:24 [PATCH] rdma: Fix cm leak zhengbing.huang
@ 2025-05-05 14:26 ` Philipp Reisner
  2025-05-06  2:20   ` ZhengbingHuang
  2025-05-05 14:26 ` [PATCH 1/1] " Philipp Reisner
  1 sibling, 1 reply; 4+ messages in thread
From: Philipp Reisner @ 2025-05-05 14:26 UTC (permalink / raw)
  To: zhengbing . huang; +Cc: drbd-dev

Hi Zhengbing,

Yes, I agree. I follow your explanation of what happened and your
proposed fix. I think we also need to clear the DSB_CONNECT_REQ bit in
the RDMA_CM_EVENT_ESTABLISHED case.

Please see my proposal, which is slightly modified from your original
patch.

Best regards,
 Philipp

zhengbing.huang (1):
  rdma: Fix cm leak

 drbd/drbd_transport_rdma.c | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

-- 
2.49.0


^ permalink raw reply	[flat|nested] 4+ messages in thread

* [PATCH 1/1] rdma: Fix cm leak
  2025-04-25 10:24 [PATCH] rdma: Fix cm leak zhengbing.huang
  2025-05-05 14:26 ` Philipp Reisner
@ 2025-05-05 14:26 ` Philipp Reisner
  1 sibling, 0 replies; 4+ messages in thread
From: Philipp Reisner @ 2025-05-05 14:26 UTC (permalink / raw)
  To: zhengbing . huang; +Cc: drbd-dev

From: "zhengbing.huang" <zhengbing.huang@easystack.cn>

We found that when all the DRBD devices are down, the reference count
of the drbd_transport_rdma module is still 1.

[root@node-4 ~]# drbdadm status
No currently configured DRBD found.
[root@node-4 ~]# lsmod | grep drbd
drbd_transport_rdma   262144  1

Then, we found an unreleased cm structure and discover
that its state is DSB_CONNECT_REQ + DSB_ERROR.

crash> struct dtr_cm ffff57e515da9400
struct dtr_cm {
  kref = {
    refcount = {
      refs = {
        counter = 1
...
state = 9,
...
}

The scenario of this problem should be like this:
dtr_cma_event_handler() get an RDMA_CM_EVENT_CONNECT_REQUEST event,
and call dtr_cma_accept() to alloc a cm. and set cm->state = DSM_CONNECT_REQ,
now the cm->kref count is 2.
then dtr_cma_event_handler() get xxx_CONNECT_ERROR/xxx_UNREACHABLE/xxx_REJECTED
event, and set_bit(DSB_ERROR, &cm->state).
the cm remove from path in dtr_cma_retry_connect, put one ref.
and cm->state dont has DSB_CONNECTING flag, then return 0.
Now, the cm->kref count is 1, and state is DSB_CONNECT_REQ + DSB_ERROR.

Therefore, when we test the DSB_CONNECTING flag,
we should also test the DSB_CONNECT_REQ flag to avoid cm leak.

Signed-off-by: zhengbing.huang <zhengbing.huang@easystack.cn>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
---
 drbd/drbd_transport_rdma.c | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/drbd/drbd_transport_rdma.c b/drbd/drbd_transport_rdma.c
index be919a926..4a9ba8fa6 100644
--- a/drbd/drbd_transport_rdma.c
+++ b/drbd/drbd_transport_rdma.c
@@ -1278,8 +1278,8 @@ static int dtr_cma_event_handler(struct rdma_cm_id *cm_id, struct rdma_cm_event
 		/* cm->state = DSM_CONNECTED; is set later in the work item */
 		/* This is called for active and passive connections */
 
-		connecting = test_and_clear_bit(DSB_CONNECTING, &cm->state);
-		connecting |= test_bit(DSB_CONNECT_REQ, &cm->state);
+		connecting = test_and_clear_bit(DSB_CONNECTING, &cm->state) ||
+			test_and_clear_bit(DSB_CONNECT_REQ, &cm->state);
 		kref_get(&cm->kref); /* connected -> expect a disconnect in the future */
 		kref_get(&cm->kref); /* for the work */
 		schedule_work(&cm->establish_work);
@@ -1307,7 +1307,9 @@ static int dtr_cma_event_handler(struct rdma_cm_id *cm_id, struct rdma_cm_event
 		set_bit(DSB_ERROR, &cm->state);
 
 		dtr_cma_retry_connect(cm->path, cm);
-		if (!test_and_clear_bit(DSB_CONNECTING, &cm->state))
+		connecting = test_and_clear_bit(DSB_CONNECTING, &cm->state) ||
+			test_and_clear_bit(DSB_CONNECT_REQ, &cm->state);
+		if (!connecting)
 			return 0; /* keep ref; __dtr_disconnect_path() won */
 		break;
 
@@ -2787,7 +2789,8 @@ static void __dtr_disconnect_path(struct dtr_path *path)
 	 * events. Destroy the cm and cm_id to avoid leaking it.
 	 * This is racing with the event delivery, which drops a reference.
 	 */
-	if (test_and_clear_bit(DSB_CONNECTING, &cm->state))
+	if (test_and_clear_bit(DSB_CONNECTING, &cm->state) ||
+	    test_and_clear_bit(DSB_CONNECT_REQ, &cm->state))
 		kref_put(&cm->kref, dtr_destroy_cm);
 
 	kref_put(&cm->kref, dtr_destroy_cm);
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re:Re: [PATCH] rdma: Fix cm leak
  2025-05-05 14:26 ` Philipp Reisner
@ 2025-05-06  2:20   ` ZhengbingHuang
  0 siblings, 0 replies; 4+ messages in thread
From: ZhengbingHuang @ 2025-05-06  2:20 UTC (permalink / raw)
  To: Philipp Reisner; +Cc: drbd-dev

[-- Attachment #1: Type: text/plain, Size: 787 bytes --]

Hi Philipp,


Yes, I think the modifications you add make this patch more complete.


Best regards,
    zhengbing


From: Philipp Reisner <philipp.reisner@linbit.com>
Date: 2025-05-05 22:26:22
To:  "zhengbing . huang" <zhengbing.huang@easystack.cn>
Cc:  drbd-dev@lists.linbit.com
Subject: Re: [PATCH] rdma: Fix cm leak>Hi Zhengbing,
>
>Yes, I agree. I follow your explanation of what happened and your
>proposed fix. I think we also need to clear the DSB_CONNECT_REQ bit in
>the RDMA_CM_EVENT_ESTABLISHED case.
>
>Please see my proposal, which is slightly modified from your original
>patch.
>
>Best regards,
> Philipp
>
>zhengbing.huang (1):
>  rdma: Fix cm leak
>
> drbd/drbd_transport_rdma.c | 11 +++++++----
> 1 file changed, 7 insertions(+), 4 deletions(-)
>
>-- 
>2.49.0
>
>





[-- Attachment #2: Type: text/html, Size: 1197 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2025-05-06  2:25 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-25 10:24 [PATCH] rdma: Fix cm leak zhengbing.huang
2025-05-05 14:26 ` Philipp Reisner
2025-05-06  2:20   ` ZhengbingHuang
2025-05-05 14:26 ` [PATCH 1/1] " Philipp Reisner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox