linux-rdma.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] IB/cm: use rwlock for MAD agent lock
@ 2025-02-20 17:04 Jacob Moroni
  2025-02-20 17:37 ` Eric Dumazet
  0 siblings, 1 reply; 13+ messages in thread
From: Jacob Moroni @ 2025-02-20 17:04 UTC (permalink / raw)
  To: jgg, leon, markzhang; +Cc: linux-rdma, Eric Dumazet

In workloads where there are many processes establishing
connections using RDMA CM in parallel (large scale MPI),
there can be heavy contention for mad_agent_lock in
cm_alloc_msg.

This contention can occur while inside of a spin_lock_irq
region, leading to interrupts being disabled for extended
durations on many cores. Furthermore, it leads to the
serialization of rdma_create_ah calls, which has negative
performance impacts for NICs which are capable of processing
multiple address handle creations in parallel.

The end result is the machine becoming unresponsive, hung
task warnings, netdev TX timeouts, etc.

Since the lock appears to be only for protection from
cm_remove_one, it can be changed to a rwlock to resolve
these issues.

Reproducer:

Server:
  for i in $(seq 1 512); do
    ucmatose -c 32 -p $((i + 5000)) &
  done

Client:
  for i in $(seq 1 512); do
    ucmatose -c 32 -p $((i + 5000)) -s 10.2.0.52 &
  done

Fixes: 76039ac9095f5ee5 ("IB/cm: Protect cm_dev, cm_ports and
mad_agent with kref and lock")
Signed-off-by: Jacob Moroni <jmoroni@google.com>
---
 drivers/infiniband/core/cm.c | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c
index 142170473e75..effa53dd6800 100644
--- a/drivers/infiniband/core/cm.c
+++ b/drivers/infiniband/core/cm.c
@@ -167,7 +167,7 @@ struct cm_port {
 struct cm_device {
  struct kref kref;
  struct list_head list;
- spinlock_t mad_agent_lock;
+ rwlock_t mad_agent_lock;
  struct ib_device *ib_device;
  u8 ack_delay;
  int going_down;
@@ -285,7 +285,7 @@ static struct ib_mad_send_buf *cm_alloc_msg(struct
cm_id_private *cm_id_priv)
  if (!cm_id_priv->av.port)
  return ERR_PTR(-EINVAL);

- spin_lock(&cm_id_priv->av.port->cm_dev->mad_agent_lock);
+ read_lock(&cm_id_priv->av.port->cm_dev->mad_agent_lock);
  mad_agent = cm_id_priv->av.port->mad_agent;
  if (!mad_agent) {
  m = ERR_PTR(-EINVAL);
@@ -311,7 +311,7 @@ static struct ib_mad_send_buf *cm_alloc_msg(struct
cm_id_private *cm_id_priv)
  m->ah = ah;

 out:
- spin_unlock(&cm_id_priv->av.port->cm_dev->mad_agent_lock);
+ read_unlock(&cm_id_priv->av.port->cm_dev->mad_agent_lock);
  return m;
 }

@@ -1297,10 +1297,10 @@ static __be64 cm_form_tid(struct cm_id_private
*cm_id_priv)
  if (!cm_id_priv->av.port)
  return cpu_to_be64(low_tid);

- spin_lock(&cm_id_priv->av.port->cm_dev->mad_agent_lock);
+ read_lock(&cm_id_priv->av.port->cm_dev->mad_agent_lock);
  if (cm_id_priv->av.port->mad_agent)
  hi_tid = ((u64)cm_id_priv->av.port->mad_agent->hi_tid) << 32;
- spin_unlock(&cm_id_priv->av.port->cm_dev->mad_agent_lock);
+ read_unlock(&cm_id_priv->av.port->cm_dev->mad_agent_lock);
  return cpu_to_be64(hi_tid | low_tid);
 }

@@ -4378,7 +4378,7 @@ static int cm_add_one(struct ib_device *ib_device)
  return -ENOMEM;

  kref_init(&cm_dev->kref);
- spin_lock_init(&cm_dev->mad_agent_lock);
+ rwlock_init(&cm_dev->mad_agent_lock);
  cm_dev->ib_device = ib_device;
  cm_dev->ack_delay = ib_device->attrs.local_ca_ack_delay;
  cm_dev->going_down = 0;
@@ -4494,9 +4494,9 @@ static void cm_remove_one(struct ib_device
*ib_device, void *client_data)
  * The above ensures no call paths from the work are running,
  * the remaining paths all take the mad_agent_lock.
  */
- spin_lock(&cm_dev->mad_agent_lock);
+ write_lock(&cm_dev->mad_agent_lock);
  port->mad_agent = NULL;
- spin_unlock(&cm_dev->mad_agent_lock);
+ write_unlock(&cm_dev->mad_agent_lock);
  ib_unregister_mad_agent(mad_agent);
  ib_port_unregister_client_groups(ib_device, i,
  cm_counter_groups);
-- 
2.48.1.601.g30ceb7b040-goog

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH] IB/cm: use rwlock for MAD agent lock
  2025-02-20 17:04 Jacob Moroni
@ 2025-02-20 17:37 ` Eric Dumazet
  0 siblings, 0 replies; 13+ messages in thread
From: Eric Dumazet @ 2025-02-20 17:37 UTC (permalink / raw)
  To: Jacob Moroni; +Cc: jgg, leon, markzhang, linux-rdma

On Thu, Feb 20, 2025 at 6:05 PM Jacob Moroni <jmoroni@google.com> wrote:
>
> In workloads where there are many processes establishing
> connections using RDMA CM in parallel (large scale MPI),
> there can be heavy contention for mad_agent_lock in
> cm_alloc_msg.
>
> This contention can occur while inside of a spin_lock_irq
> region, leading to interrupts being disabled for extended
> durations on many cores. Furthermore, it leads to the
> serialization of rdma_create_ah calls, which has negative
> performance impacts for NICs which are capable of processing
> multiple address handle creations in parallel.
>
> The end result is the machine becoming unresponsive, hung
> task warnings, netdev TX timeouts, etc.
>
> Since the lock appears to be only for protection from
> cm_remove_one, it can be changed to a rwlock to resolve
> these issues.
>
> Reproducer:
>
> Server:
>   for i in $(seq 1 512); do
>     ucmatose -c 32 -p $((i + 5000)) &
>   done
>
> Client:
>   for i in $(seq 1 512); do
>     ucmatose -c 32 -p $((i + 5000)) -s 10.2.0.52 &
>   done
>
> Fixes: 76039ac9095f5ee5 ("IB/cm: Protect cm_dev, cm_ports and
> mad_agent with kref and lock")

Fixes: tag should be on a single line.

> Signed-off-by: Jacob Moroni <jmoroni@google.com>
> ---

It seems your patch is mangled.

Can you use "git send-email" to resend it ?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH] IB/cm: use rwlock for MAD agent lock
@ 2025-02-20 17:56 Jacob Moroni
  2025-02-21 16:50 ` Eric Dumazet
                   ` (4 more replies)
  0 siblings, 5 replies; 13+ messages in thread
From: Jacob Moroni @ 2025-02-20 17:56 UTC (permalink / raw)
  To: jgg, leon, markzhang; +Cc: linux-rdma, edumazet, Jacob Moroni

In workloads where there are many processes establishing
connections using RDMA CM in parallel (large scale MPI),
there can be heavy contention for mad_agent_lock in
cm_alloc_msg.

This contention can occur while inside of a spin_lock_irq
region, leading to interrupts being disabled for extended
durations on many cores. Furthermore, it leads to the
serialization of rdma_create_ah calls, which has negative
performance impacts for NICs which are capable of processing
multiple address handle creations in parallel.

The end result is the machine becoming unresponsive, hung
task warnings, netdev TX timeouts, etc.

Since the lock appears to be only for protection from
cm_remove_one, it can be changed to a rwlock to resolve
these issues.

Reproducer:

Server:
  for i in $(seq 1 512); do
    ucmatose -c 32 -p $((i + 5000)) &
  done

Client:
  for i in $(seq 1 512); do
    ucmatose -c 32 -p $((i + 5000)) -s 10.2.0.52 &
  done

Fixes: 76039ac9095f5ee5 ("IB/cm: Protect cm_dev, cm_ports and mad_agent with kref and lock")
Signed-off-by: Jacob Moroni <jmoroni@google.com>
---
 drivers/infiniband/core/cm.c | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c
index 142170473e75..effa53dd6800 100644
--- a/drivers/infiniband/core/cm.c
+++ b/drivers/infiniband/core/cm.c
@@ -167,7 +167,7 @@ struct cm_port {
 struct cm_device {
 	struct kref kref;
 	struct list_head list;
-	spinlock_t mad_agent_lock;
+	rwlock_t mad_agent_lock;
 	struct ib_device *ib_device;
 	u8 ack_delay;
 	int going_down;
@@ -285,7 +285,7 @@ static struct ib_mad_send_buf *cm_alloc_msg(struct cm_id_private *cm_id_priv)
 	if (!cm_id_priv->av.port)
 		return ERR_PTR(-EINVAL);
 
-	spin_lock(&cm_id_priv->av.port->cm_dev->mad_agent_lock);
+	read_lock(&cm_id_priv->av.port->cm_dev->mad_agent_lock);
 	mad_agent = cm_id_priv->av.port->mad_agent;
 	if (!mad_agent) {
 		m = ERR_PTR(-EINVAL);
@@ -311,7 +311,7 @@ static struct ib_mad_send_buf *cm_alloc_msg(struct cm_id_private *cm_id_priv)
 	m->ah = ah;
 
 out:
-	spin_unlock(&cm_id_priv->av.port->cm_dev->mad_agent_lock);
+	read_unlock(&cm_id_priv->av.port->cm_dev->mad_agent_lock);
 	return m;
 }
 
@@ -1297,10 +1297,10 @@ static __be64 cm_form_tid(struct cm_id_private *cm_id_priv)
 	if (!cm_id_priv->av.port)
 		return cpu_to_be64(low_tid);
 
-	spin_lock(&cm_id_priv->av.port->cm_dev->mad_agent_lock);
+	read_lock(&cm_id_priv->av.port->cm_dev->mad_agent_lock);
 	if (cm_id_priv->av.port->mad_agent)
 		hi_tid = ((u64)cm_id_priv->av.port->mad_agent->hi_tid) << 32;
-	spin_unlock(&cm_id_priv->av.port->cm_dev->mad_agent_lock);
+	read_unlock(&cm_id_priv->av.port->cm_dev->mad_agent_lock);
 	return cpu_to_be64(hi_tid | low_tid);
 }
 
@@ -4378,7 +4378,7 @@ static int cm_add_one(struct ib_device *ib_device)
 		return -ENOMEM;
 
 	kref_init(&cm_dev->kref);
-	spin_lock_init(&cm_dev->mad_agent_lock);
+	rwlock_init(&cm_dev->mad_agent_lock);
 	cm_dev->ib_device = ib_device;
 	cm_dev->ack_delay = ib_device->attrs.local_ca_ack_delay;
 	cm_dev->going_down = 0;
@@ -4494,9 +4494,9 @@ static void cm_remove_one(struct ib_device *ib_device, void *client_data)
 		 * The above ensures no call paths from the work are running,
 		 * the remaining paths all take the mad_agent_lock.
 		 */
-		spin_lock(&cm_dev->mad_agent_lock);
+		write_lock(&cm_dev->mad_agent_lock);
 		port->mad_agent = NULL;
-		spin_unlock(&cm_dev->mad_agent_lock);
+		write_unlock(&cm_dev->mad_agent_lock);
 		ib_unregister_mad_agent(mad_agent);
 		ib_port_unregister_client_groups(ib_device, i,
 						 cm_counter_groups);
-- 
2.48.1.601.g30ceb7b040-goog


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH] IB/cm: use rwlock for MAD agent lock
  2025-02-20 17:56 [PATCH] IB/cm: use rwlock for MAD agent lock Jacob Moroni
@ 2025-02-21 16:50 ` Eric Dumazet
  2025-02-21 17:00 ` Jason Gunthorpe
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 13+ messages in thread
From: Eric Dumazet @ 2025-02-21 16:50 UTC (permalink / raw)
  To: Jacob Moroni; +Cc: jgg, leon, markzhang, linux-rdma

On Thu, Feb 20, 2025 at 6:56 PM Jacob Moroni <jmoroni@google.com> wrote:
>
> In workloads where there are many processes establishing
> connections using RDMA CM in parallel (large scale MPI),
> there can be heavy contention for mad_agent_lock in
> cm_alloc_msg.
>
> This contention can occur while inside of a spin_lock_irq
> region, leading to interrupts being disabled for extended
> durations on many cores. Furthermore, it leads to the
> serialization of rdma_create_ah calls, which has negative
> performance impacts for NICs which are capable of processing
> multiple address handle creations in parallel.
>
> The end result is the machine becoming unresponsive, hung
> task warnings, netdev TX timeouts, etc.
>
> Since the lock appears to be only for protection from
> cm_remove_one, it can be changed to a rwlock to resolve
> these issues.
>
> Reproducer:
>
> Server:
>   for i in $(seq 1 512); do
>     ucmatose -c 32 -p $((i + 5000)) &
>   done
>
> Client:
>   for i in $(seq 1 512); do
>     ucmatose -c 32 -p $((i + 5000)) -s 10.2.0.52 &
>   done
>
> Fixes: 76039ac9095f5ee5 ("IB/cm: Protect cm_dev, cm_ports and mad_agent with kref and lock")
> Signed-off-by: Jacob Moroni <jmoroni@google.com>

SGTM, thanks.

Acked-by: Eric Dumazet <edumazet@google.com>

RCU could probably be used here, if we expect the
read_lock()/read_unlock() operations to happen in a fast path.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] IB/cm: use rwlock for MAD agent lock
  2025-02-20 17:56 [PATCH] IB/cm: use rwlock for MAD agent lock Jacob Moroni
  2025-02-21 16:50 ` Eric Dumazet
@ 2025-02-21 17:00 ` Jason Gunthorpe
  2025-02-21 17:03 ` Zhu Yanjun
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 13+ messages in thread
From: Jason Gunthorpe @ 2025-02-21 17:00 UTC (permalink / raw)
  To: Jacob Moroni; +Cc: leon, markzhang, linux-rdma, edumazet

On Thu, Feb 20, 2025 at 05:56:12PM +0000, Jacob Moroni wrote:
> In workloads where there are many processes establishing
> connections using RDMA CM in parallel (large scale MPI),
> there can be heavy contention for mad_agent_lock in
> cm_alloc_msg.
> 
> This contention can occur while inside of a spin_lock_irq
> region, leading to interrupts being disabled for extended
> durations on many cores. Furthermore, it leads to the
> serialization of rdma_create_ah calls, which has negative
> performance impacts for NICs which are capable of processing
> multiple address handle creations in parallel.
> 
> The end result is the machine becoming unresponsive, hung
> task warnings, netdev TX timeouts, etc.

While the patch and fix seems reasonable, I'm somewhat surprised to
see it.

If you are running at such a high workload then I'm shocked you don't
hit all the other nasty problems with RDMA CM scalability?

Is the issue that the AH creation is very slow for some reason? It has
been a longstanding peeve of mine that this is done under a spinlock
context, I've long felt that should be reworked and some of those
spinlocks converted to mutex's.

Jason

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] IB/cm: use rwlock for MAD agent lock
  2025-02-20 17:56 [PATCH] IB/cm: use rwlock for MAD agent lock Jacob Moroni
  2025-02-21 16:50 ` Eric Dumazet
  2025-02-21 17:00 ` Jason Gunthorpe
@ 2025-02-21 17:03 ` Zhu Yanjun
  2025-02-21 17:32   ` Eric Dumazet
  2025-04-01 16:18 ` Jason Gunthorpe
  2025-04-07 18:41 ` Jason Gunthorpe
  4 siblings, 1 reply; 13+ messages in thread
From: Zhu Yanjun @ 2025-02-21 17:03 UTC (permalink / raw)
  To: Jacob Moroni, jgg, leon, markzhang; +Cc: linux-rdma, edumazet

On 20.02.25 18:56, Jacob Moroni wrote:
> In workloads where there are many processes establishing
> connections using RDMA CM in parallel (large scale MPI),
> there can be heavy contention for mad_agent_lock in
> cm_alloc_msg.
> 
> This contention can occur while inside of a spin_lock_irq
> region, leading to interrupts being disabled for extended
> durations on many cores. Furthermore, it leads to the
> serialization of rdma_create_ah calls, which has negative
> performance impacts for NICs which are capable of processing
> multiple address handle creations in parallel.

In the link: 
https://www.cs.columbia.edu/~jae/4118-LAST/L12-interrupt-spinlock.html
"
...
spin_lock() / spin_unlock()

must not lose CPU while holding a spin lock, other threads will wait for 
the lock for a long time

spin_lock() prevents kernel preemption by ++preempt_count in 
uniprocessor, that’s all spin_lock() does

must NOT call any function that can potentially sleep
ex) kmalloc, copy_from_user

hardware interrupt is ok unless the interrupt handler may try to lock 
this spin lock
spin lock not recursive: same thread locking twice will deadlock

keep the critical section as small as possible
...
"
And from the source code, it seems that spin_lock/spin_unlock are not 
related with interrupts.

I wonder why "leading to interrupts being disabled for extended 
durations on many cores" with spin_lock/spin_unlock?

I am not against this commit. I am just obvious why 
spin_lock/spin_unlock are related with "interrupts being disabled".

Thanks a lot.
Zhu Yanjun

> 
> The end result is the machine becoming unresponsive, hung
> task warnings, netdev TX timeouts, etc.
> 
> Since the lock appears to be only for protection from
> cm_remove_one, it can be changed to a rwlock to resolve
> these issues.
> 
> Reproducer:
> 
> Server:
>    for i in $(seq 1 512); do
>      ucmatose -c 32 -p $((i + 5000)) &
>    done
> 
> Client:
>    for i in $(seq 1 512); do
>      ucmatose -c 32 -p $((i + 5000)) -s 10.2.0.52 &
>    done
> 
> Fixes: 76039ac9095f5ee5 ("IB/cm: Protect cm_dev, cm_ports and mad_agent with kref and lock")
> Signed-off-by: Jacob Moroni <jmoroni@google.com>
> ---
>   drivers/infiniband/core/cm.c | 16 ++++++++--------
>   1 file changed, 8 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c
> index 142170473e75..effa53dd6800 100644
> --- a/drivers/infiniband/core/cm.c
> +++ b/drivers/infiniband/core/cm.c
> @@ -167,7 +167,7 @@ struct cm_port {
>   struct cm_device {
>   	struct kref kref;
>   	struct list_head list;
> -	spinlock_t mad_agent_lock;
> +	rwlock_t mad_agent_lock;
>   	struct ib_device *ib_device;
>   	u8 ack_delay;
>   	int going_down;
> @@ -285,7 +285,7 @@ static struct ib_mad_send_buf *cm_alloc_msg(struct cm_id_private *cm_id_priv)
>   	if (!cm_id_priv->av.port)
>   		return ERR_PTR(-EINVAL);
>   
> -	spin_lock(&cm_id_priv->av.port->cm_dev->mad_agent_lock);
> +	read_lock(&cm_id_priv->av.port->cm_dev->mad_agent_lock);
>   	mad_agent = cm_id_priv->av.port->mad_agent;
>   	if (!mad_agent) {
>   		m = ERR_PTR(-EINVAL);
> @@ -311,7 +311,7 @@ static struct ib_mad_send_buf *cm_alloc_msg(struct cm_id_private *cm_id_priv)
>   	m->ah = ah;
>   
>   out:
> -	spin_unlock(&cm_id_priv->av.port->cm_dev->mad_agent_lock);
> +	read_unlock(&cm_id_priv->av.port->cm_dev->mad_agent_lock);
>   	return m;
>   }
>   
> @@ -1297,10 +1297,10 @@ static __be64 cm_form_tid(struct cm_id_private *cm_id_priv)
>   	if (!cm_id_priv->av.port)
>   		return cpu_to_be64(low_tid);
>   
> -	spin_lock(&cm_id_priv->av.port->cm_dev->mad_agent_lock);
> +	read_lock(&cm_id_priv->av.port->cm_dev->mad_agent_lock);
>   	if (cm_id_priv->av.port->mad_agent)
>   		hi_tid = ((u64)cm_id_priv->av.port->mad_agent->hi_tid) << 32;
> -	spin_unlock(&cm_id_priv->av.port->cm_dev->mad_agent_lock);
> +	read_unlock(&cm_id_priv->av.port->cm_dev->mad_agent_lock);
>   	return cpu_to_be64(hi_tid | low_tid);
>   }
>   
> @@ -4378,7 +4378,7 @@ static int cm_add_one(struct ib_device *ib_device)
>   		return -ENOMEM;
>   
>   	kref_init(&cm_dev->kref);
> -	spin_lock_init(&cm_dev->mad_agent_lock);
> +	rwlock_init(&cm_dev->mad_agent_lock);
>   	cm_dev->ib_device = ib_device;
>   	cm_dev->ack_delay = ib_device->attrs.local_ca_ack_delay;
>   	cm_dev->going_down = 0;
> @@ -4494,9 +4494,9 @@ static void cm_remove_one(struct ib_device *ib_device, void *client_data)
>   		 * The above ensures no call paths from the work are running,
>   		 * the remaining paths all take the mad_agent_lock.
>   		 */
> -		spin_lock(&cm_dev->mad_agent_lock);
> +		write_lock(&cm_dev->mad_agent_lock);
>   		port->mad_agent = NULL;
> -		spin_unlock(&cm_dev->mad_agent_lock);
> +		write_unlock(&cm_dev->mad_agent_lock);
>   		ib_unregister_mad_agent(mad_agent);
>   		ib_port_unregister_client_groups(ib_device, i,
>   						 cm_counter_groups);


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] IB/cm: use rwlock for MAD agent lock
  2025-02-21 17:03 ` Zhu Yanjun
@ 2025-02-21 17:32   ` Eric Dumazet
  2025-02-21 17:39     ` Jacob Moroni
  2025-02-22  6:20     ` Zhu Yanjun
  0 siblings, 2 replies; 13+ messages in thread
From: Eric Dumazet @ 2025-02-21 17:32 UTC (permalink / raw)
  To: Zhu Yanjun; +Cc: Jacob Moroni, jgg, leon, markzhang, linux-rdma

On Fri, Feb 21, 2025 at 6:04 PM Zhu Yanjun <yanjun.zhu@linux.dev> wrote:
>
> On 20.02.25 18:56, Jacob Moroni wrote:
> > In workloads where there are many processes establishing
> > connections using RDMA CM in parallel (large scale MPI),
> > there can be heavy contention for mad_agent_lock in
> > cm_alloc_msg.
> >
> > This contention can occur while inside of a spin_lock_irq
> > region, leading to interrupts being disabled for extended
> > durations on many cores. Furthermore, it leads to the
> > serialization of rdma_create_ah calls, which has negative
> > performance impacts for NICs which are capable of processing
> > multiple address handle creations in parallel.
>
> In the link:
> https://www.cs.columbia.edu/~jae/4118-LAST/L12-interrupt-spinlock.html
> "
> ...
> spin_lock() / spin_unlock()
>
> must not lose CPU while holding a spin lock, other threads will wait for
> the lock for a long time
>
> spin_lock() prevents kernel preemption by ++preempt_count in
> uniprocessor, that’s all spin_lock() does
>
> must NOT call any function that can potentially sleep
> ex) kmalloc, copy_from_user
>
> hardware interrupt is ok unless the interrupt handler may try to lock
> this spin lock
> spin lock not recursive: same thread locking twice will deadlock
>
> keep the critical section as small as possible
> ...
> "
> And from the source code, it seems that spin_lock/spin_unlock are not
> related with interrupts.
>
> I wonder why "leading to interrupts being disabled for extended
> durations on many cores" with spin_lock/spin_unlock?
>
> I am not against this commit. I am just obvious why
> spin_lock/spin_unlock are related with "interrupts being disabled".

Look at drivers/infiniband/core/cm.c

spin_lock_irqsave(&cm_id_priv->lock, flags);

-> Then call cm_alloc_msg() while hard IRQ are masked.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] IB/cm: use rwlock for MAD agent lock
  2025-02-21 17:32   ` Eric Dumazet
@ 2025-02-21 17:39     ` Jacob Moroni
  2025-02-22  6:20     ` Zhu Yanjun
  1 sibling, 0 replies; 13+ messages in thread
From: Jacob Moroni @ 2025-02-21 17:39 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Zhu Yanjun, jgg, leon, markzhang, linux-rdma

> If you are running at such a high workload then I'm shocked you don't
> hit all the other nasty problems with RDMA CM scalability?

It could be that we just haven't hit those issues yet :)

This serialization was slowing things down so much that I think it
has been masking some other issues. For instance, I just discovered
a bug in rping's persistent server mode (I'll be sending a Github PR
soon), which seems to be due to a race condition we started hitting
after this fix.

> Is the issue that the AH creation is very slow for some reason? It has
> been a longstanding peeve of mine that this is done under a spinlock
> context, I've long felt that should be reworked and some of those
> spinlocks converted to mutex's.

Yes, that's exactly it. We have fairly high tail latencies for creating
address handles. By removing the serialization, we can at least take
advantage of queueing, which seems to help a lot. It would be really
great if this could move out of an atomic context.

Thanks,
Jake

On Fri, Feb 21, 2025 at 12:32 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Fri, Feb 21, 2025 at 6:04 PM Zhu Yanjun <yanjun.zhu@linux.dev> wrote:
> >
> > On 20.02.25 18:56, Jacob Moroni wrote:
> > > In workloads where there are many processes establishing
> > > connections using RDMA CM in parallel (large scale MPI),
> > > there can be heavy contention for mad_agent_lock in
> > > cm_alloc_msg.
> > >
> > > This contention can occur while inside of a spin_lock_irq
> > > region, leading to interrupts being disabled for extended
> > > durations on many cores. Furthermore, it leads to the
> > > serialization of rdma_create_ah calls, which has negative
> > > performance impacts for NICs which are capable of processing
> > > multiple address handle creations in parallel.
> >
> > In the link:
> > https://www.cs.columbia.edu/~jae/4118-LAST/L12-interrupt-spinlock.html
> > "
> > ...
> > spin_lock() / spin_unlock()
> >
> > must not lose CPU while holding a spin lock, other threads will wait for
> > the lock for a long time
> >
> > spin_lock() prevents kernel preemption by ++preempt_count in
> > uniprocessor, that’s all spin_lock() does
> >
> > must NOT call any function that can potentially sleep
> > ex) kmalloc, copy_from_user
> >
> > hardware interrupt is ok unless the interrupt handler may try to lock
> > this spin lock
> > spin lock not recursive: same thread locking twice will deadlock
> >
> > keep the critical section as small as possible
> > ...
> > "
> > And from the source code, it seems that spin_lock/spin_unlock are not
> > related with interrupts.
> >
> > I wonder why "leading to interrupts being disabled for extended
> > durations on many cores" with spin_lock/spin_unlock?
> >
> > I am not against this commit. I am just obvious why
> > spin_lock/spin_unlock are related with "interrupts being disabled".
>
> Look at drivers/infiniband/core/cm.c
>
> spin_lock_irqsave(&cm_id_priv->lock, flags);
>
> -> Then call cm_alloc_msg() while hard IRQ are masked.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] IB/cm: use rwlock for MAD agent lock
  2025-02-21 17:32   ` Eric Dumazet
  2025-02-21 17:39     ` Jacob Moroni
@ 2025-02-22  6:20     ` Zhu Yanjun
  2025-02-22  7:38       ` Eric Dumazet
  1 sibling, 1 reply; 13+ messages in thread
From: Zhu Yanjun @ 2025-02-22  6:20 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Jacob Moroni, jgg, leon, markzhang, linux-rdma

在 2025/2/21 18:32, Eric Dumazet 写道:
> On Fri, Feb 21, 2025 at 6:04 PM Zhu Yanjun <yanjun.zhu@linux.dev> wrote:
>>
>> On 20.02.25 18:56, Jacob Moroni wrote:
>>> In workloads where there are many processes establishing
>>> connections using RDMA CM in parallel (large scale MPI),
>>> there can be heavy contention for mad_agent_lock in
>>> cm_alloc_msg.
>>>
>>> This contention can occur while inside of a spin_lock_irq
>>> region, leading to interrupts being disabled for extended
>>> durations on many cores. Furthermore, it leads to the
>>> serialization of rdma_create_ah calls, which has negative
>>> performance impacts for NICs which are capable of processing
>>> multiple address handle creations in parallel.
>>
>> In the link:
>> https://www.cs.columbia.edu/~jae/4118-LAST/L12-interrupt-spinlock.html
>> "
>> ...
>> spin_lock() / spin_unlock()
>>
>> must not lose CPU while holding a spin lock, other threads will wait for
>> the lock for a long time
>>
>> spin_lock() prevents kernel preemption by ++preempt_count in
>> uniprocessor, that’s all spin_lock() does
>>
>> must NOT call any function that can potentially sleep
>> ex) kmalloc, copy_from_user
>>
>> hardware interrupt is ok unless the interrupt handler may try to lock
>> this spin lock
>> spin lock not recursive: same thread locking twice will deadlock
>>
>> keep the critical section as small as possible
>> ...
>> "
>> And from the source code, it seems that spin_lock/spin_unlock are not
>> related with interrupts.
>>
>> I wonder why "leading to interrupts being disabled for extended
>> durations on many cores" with spin_lock/spin_unlock?
>>
>> I am not against this commit. I am just obvious why
>> spin_lock/spin_unlock are related with "interrupts being disabled".
> 
> Look at drivers/infiniband/core/cm.c
> 
> spin_lock_irqsave(&cm_id_priv->lock, flags);

Thanks a lot. spin_lock_irq should be spin_lock_irqsave?

Follow the steps of reproducer, I can not reproduce this problem on 
KVMs. Maybe I need a powerful host.

Anyway, read_lock should be a lighter lock than spin_lock.

Thanks,
Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>

Zhu Yanjun

> 
> -> Then call cm_alloc_msg() while hard IRQ are masked.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] IB/cm: use rwlock for MAD agent lock
  2025-02-22  6:20     ` Zhu Yanjun
@ 2025-02-22  7:38       ` Eric Dumazet
  2025-02-22 10:31         ` Zhu Yanjun
  0 siblings, 1 reply; 13+ messages in thread
From: Eric Dumazet @ 2025-02-22  7:38 UTC (permalink / raw)
  To: Zhu Yanjun; +Cc: Jacob Moroni, jgg, leon, markzhang, linux-rdma

On Sat, Feb 22, 2025 at 7:21 AM Zhu Yanjun <yanjun.zhu@linux.dev> wrote:
>
> 在 2025/2/21 18:32, Eric Dumazet 写道:
> > On Fri, Feb 21, 2025 at 6:04 PM Zhu Yanjun <yanjun.zhu@linux.dev> wrote:
> >>
> >> On 20.02.25 18:56, Jacob Moroni wrote:
> >>> In workloads where there are many processes establishing
> >>> connections using RDMA CM in parallel (large scale MPI),
> >>> there can be heavy contention for mad_agent_lock in
> >>> cm_alloc_msg.
> >>>
> >>> This contention can occur while inside of a spin_lock_irq
> >>> region, leading to interrupts being disabled for extended
> >>> durations on many cores. Furthermore, it leads to the
> >>> serialization of rdma_create_ah calls, which has negative
> >>> performance impacts for NICs which are capable of processing
> >>> multiple address handle creations in parallel.
> >>
> >> In the link:
> >> https://www.cs.columbia.edu/~jae/4118-LAST/L12-interrupt-spinlock.html
> >> "
> >> ...
> >> spin_lock() / spin_unlock()
> >>
> >> must not lose CPU while holding a spin lock, other threads will wait for
> >> the lock for a long time
> >>
> >> spin_lock() prevents kernel preemption by ++preempt_count in
> >> uniprocessor, that’s all spin_lock() does
> >>
> >> must NOT call any function that can potentially sleep
> >> ex) kmalloc, copy_from_user
> >>
> >> hardware interrupt is ok unless the interrupt handler may try to lock
> >> this spin lock
> >> spin lock not recursive: same thread locking twice will deadlock
> >>
> >> keep the critical section as small as possible
> >> ...
> >> "
> >> And from the source code, it seems that spin_lock/spin_unlock are not
> >> related with interrupts.
> >>
> >> I wonder why "leading to interrupts being disabled for extended
> >> durations on many cores" with spin_lock/spin_unlock?
> >>
> >> I am not against this commit. I am just obvious why
> >> spin_lock/spin_unlock are related with "interrupts being disabled".
> >
> > Look at drivers/infiniband/core/cm.c
> >
> > spin_lock_irqsave(&cm_id_priv->lock, flags);
>
> Thanks a lot. spin_lock_irq should be spin_lock_irqsave?

Both spin_lock_irq and spin_lock_irqsave are masking all interrupts.



>
> Follow the steps of reproducer, I can not reproduce this problem on
> KVMs. Maybe I need a powerful host.
>
> Anyway, read_lock should be a lighter lock than spin_lock.
>
> Thanks,
> Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>
>
> Zhu Yanjun
>
> >
> > -> Then call cm_alloc_msg() while hard IRQ are masked.
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] IB/cm: use rwlock for MAD agent lock
  2025-02-22  7:38       ` Eric Dumazet
@ 2025-02-22 10:31         ` Zhu Yanjun
  0 siblings, 0 replies; 13+ messages in thread
From: Zhu Yanjun @ 2025-02-22 10:31 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Jacob Moroni, jgg, leon, markzhang, linux-rdma

在 2025/2/22 8:38, Eric Dumazet 写道:
> On Sat, Feb 22, 2025 at 7:21 AM Zhu Yanjun <yanjun.zhu@linux.dev> wrote:
>>
>> 在 2025/2/21 18:32, Eric Dumazet 写道:
>>> On Fri, Feb 21, 2025 at 6:04 PM Zhu Yanjun <yanjun.zhu@linux.dev> wrote:
>>>>
>>>> On 20.02.25 18:56, Jacob Moroni wrote:
>>>>> In workloads where there are many processes establishing
>>>>> connections using RDMA CM in parallel (large scale MPI),
>>>>> there can be heavy contention for mad_agent_lock in
>>>>> cm_alloc_msg.
>>>>>
>>>>> This contention can occur while inside of a spin_lock_irq
>>>>> region, leading to interrupts being disabled for extended
>>>>> durations on many cores. Furthermore, it leads to the
>>>>> serialization of rdma_create_ah calls, which has negative
>>>>> performance impacts for NICs which are capable of processing
>>>>> multiple address handle creations in parallel.
>>>>
>>>> In the link:
>>>> https://www.cs.columbia.edu/~jae/4118-LAST/L12-interrupt-spinlock.html
>>>> "
>>>> ...
>>>> spin_lock() / spin_unlock()
>>>>
>>>> must not lose CPU while holding a spin lock, other threads will wait for
>>>> the lock for a long time
>>>>
>>>> spin_lock() prevents kernel preemption by ++preempt_count in
>>>> uniprocessor, that’s all spin_lock() does
>>>>
>>>> must NOT call any function that can potentially sleep
>>>> ex) kmalloc, copy_from_user
>>>>
>>>> hardware interrupt is ok unless the interrupt handler may try to lock
>>>> this spin lock
>>>> spin lock not recursive: same thread locking twice will deadlock
>>>>
>>>> keep the critical section as small as possible
>>>> ...
>>>> "
>>>> And from the source code, it seems that spin_lock/spin_unlock are not
>>>> related with interrupts.
>>>>
>>>> I wonder why "leading to interrupts being disabled for extended
>>>> durations on many cores" with spin_lock/spin_unlock?
>>>>
>>>> I am not against this commit. I am just obvious why
>>>> spin_lock/spin_unlock are related with "interrupts being disabled".
>>>
>>> Look at drivers/infiniband/core/cm.c
>>>
>>> spin_lock_irqsave(&cm_id_priv->lock, flags);
>>
>> Thanks a lot. spin_lock_irq should be spin_lock_irqsave?
> 
> Both spin_lock_irq and spin_lock_irqsave are masking all interrupts.

Maybe the commit log should be written clearly. Thanks.

Zhu Yanjun

> 
> 
> 
>>
>> Follow the steps of reproducer, I can not reproduce this problem on
>> KVMs. Maybe I need a powerful host.
>>
>> Anyway, read_lock should be a lighter lock than spin_lock.
>>
>> Thanks,
>> Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>
>>
>> Zhu Yanjun
>>
>>>
>>> -> Then call cm_alloc_msg() while hard IRQ are masked.
>>


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] IB/cm: use rwlock for MAD agent lock
  2025-02-20 17:56 [PATCH] IB/cm: use rwlock for MAD agent lock Jacob Moroni
                   ` (2 preceding siblings ...)
  2025-02-21 17:03 ` Zhu Yanjun
@ 2025-04-01 16:18 ` Jason Gunthorpe
  2025-04-07 18:41 ` Jason Gunthorpe
  4 siblings, 0 replies; 13+ messages in thread
From: Jason Gunthorpe @ 2025-04-01 16:18 UTC (permalink / raw)
  To: Jacob Moroni; +Cc: leon, markzhang, linux-rdma, edumazet

On Thu, Feb 20, 2025 at 05:56:12PM +0000, Jacob Moroni wrote:
> In workloads where there are many processes establishing
> connections using RDMA CM in parallel (large scale MPI),
> there can be heavy contention for mad_agent_lock in
> cm_alloc_msg.
> 
> This contention can occur while inside of a spin_lock_irq
> region, leading to interrupts being disabled for extended
> durations on many cores. Furthermore, it leads to the
> serialization of rdma_create_ah calls, which has negative
> performance impacts for NICs which are capable of processing
> multiple address handle creations in parallel.
> 
> The end result is the machine becoming unresponsive, hung
> task warnings, netdev TX timeouts, etc.
> 
> Since the lock appears to be only for protection from
> cm_remove_one, it can be changed to a rwlock to resolve
> these issues.
> 
> Reproducer:
> 
> Server:
>   for i in $(seq 1 512); do
>     ucmatose -c 32 -p $((i + 5000)) &
>   done
> 
> Client:
>   for i in $(seq 1 512); do
>     ucmatose -c 32 -p $((i + 5000)) -s 10.2.0.52 &
>   done
> 
> Fixes: 76039ac9095f5ee5 ("IB/cm: Protect cm_dev, cm_ports and mad_agent with kref and lock")
> Signed-off-by: Jacob Moroni <jmoroni@google.com>
> ---
>  drivers/infiniband/core/cm.c | 16 ++++++++--------
>  1 file changed, 8 insertions(+), 8 deletions(-)

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>

Though I strongly encourage someone to change the spinlocks in this
area to mutex :)

Jason

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] IB/cm: use rwlock for MAD agent lock
  2025-02-20 17:56 [PATCH] IB/cm: use rwlock for MAD agent lock Jacob Moroni
                   ` (3 preceding siblings ...)
  2025-04-01 16:18 ` Jason Gunthorpe
@ 2025-04-07 18:41 ` Jason Gunthorpe
  4 siblings, 0 replies; 13+ messages in thread
From: Jason Gunthorpe @ 2025-04-07 18:41 UTC (permalink / raw)
  To: Jacob Moroni; +Cc: leon, markzhang, linux-rdma, edumazet

On Thu, Feb 20, 2025 at 05:56:12PM +0000, Jacob Moroni wrote:
> In workloads where there are many processes establishing
> connections using RDMA CM in parallel (large scale MPI),
> there can be heavy contention for mad_agent_lock in
> cm_alloc_msg.
> 
> This contention can occur while inside of a spin_lock_irq
> region, leading to interrupts being disabled for extended
> durations on many cores. Furthermore, it leads to the
> serialization of rdma_create_ah calls, which has negative
> performance impacts for NICs which are capable of processing
> multiple address handle creations in parallel.
> 
> The end result is the machine becoming unresponsive, hung
> task warnings, netdev TX timeouts, etc.
> 
> Since the lock appears to be only for protection from
> cm_remove_one, it can be changed to a rwlock to resolve
> these issues.
> 
> Reproducer:
> 
> Server:
>   for i in $(seq 1 512); do
>     ucmatose -c 32 -p $((i + 5000)) &
>   done
> 
> Client:
>   for i in $(seq 1 512); do
>     ucmatose -c 32 -p $((i + 5000)) -s 10.2.0.52 &
>   done
> 
> Fixes: 76039ac9095f5ee5 ("IB/cm: Protect cm_dev, cm_ports and mad_agent with kref and lock")
> Signed-off-by: Jacob Moroni <jmoroni@google.com>
> Acked-by: Eric Dumazet <edumazet@google.com>
> Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>
> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> ---
>  drivers/infiniband/core/cm.c | 16 ++++++++--------
>  1 file changed, 8 insertions(+), 8 deletions(-)

Applied to for-next, thanks

Jason

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2025-04-07 18:41 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-02-20 17:56 [PATCH] IB/cm: use rwlock for MAD agent lock Jacob Moroni
2025-02-21 16:50 ` Eric Dumazet
2025-02-21 17:00 ` Jason Gunthorpe
2025-02-21 17:03 ` Zhu Yanjun
2025-02-21 17:32   ` Eric Dumazet
2025-02-21 17:39     ` Jacob Moroni
2025-02-22  6:20     ` Zhu Yanjun
2025-02-22  7:38       ` Eric Dumazet
2025-02-22 10:31         ` Zhu Yanjun
2025-04-01 16:18 ` Jason Gunthorpe
2025-04-07 18:41 ` Jason Gunthorpe
  -- strict thread matches above, loose matches on Subject: below --
2025-02-20 17:04 Jacob Moroni
2025-02-20 17:37 ` Eric Dumazet

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).