[PATCH 0/2] IB/ipoib fixes

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/2] IB/ipoib fixes
@ 2023-12-11 13:04 Daniel Vacek
  2023-12-11 13:04 ` [PATCH 1/2] IB/ipoib: Fix mcast list locking Daniel Vacek
                   ` (3 more replies)
  0 siblings, 4 replies; 12+ messages in thread
From: Daniel Vacek @ 2023-12-11 13:04 UTC (permalink / raw)
  To: Jason Gunthorpe, Leon Romanovsky; +Cc: linux-rdma, linux-kernel, Daniel Vacek

The first patch (hopefully) fixes a real issue while the second is an
unrelated cleanup. But it shares a context so sending as a series.

Daniel Vacek (2):
  IB/ipoib: Fix mcast list locking
  IB/ipoib: Clean up redundant netif_addr_lock

 drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 9 ++++-----
 1 file changed, 4 insertions(+), 5 deletions(-)

-- 
2.43.0


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 1/2] IB/ipoib: Fix mcast list locking
  2023-12-11 13:04 [PATCH 0/2] IB/ipoib fixes Daniel Vacek
@ 2023-12-11 13:04 ` Daniel Vacek
  2023-12-11 13:45   ` Leon Romanovsky
  2023-12-11 13:04 ` [PATCH 2/2] IB/ipoib: Clean up redundant netif_addr_lock Daniel Vacek
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 12+ messages in thread
From: Daniel Vacek @ 2023-12-11 13:04 UTC (permalink / raw)
  To: Jason Gunthorpe, Leon Romanovsky
  Cc: linux-rdma, linux-kernel, Daniel Vacek, Yuya Fujita-bishamonten

We need an additional protection against list removal between ipoib_mcast_join_task()
and ipoib_mcast_dev_flush() in case the &priv->lock needs to be dropped while
iterating the &priv->multicast_list in ipoib_mcast_join_task(). If the mcast
is removed while the lock was dropped, the for loop spins forever resulting
in a hard lockup (as was reported on RHEL 4.18.0-372.75.1.el8_6 kernel):

    Task A (kworker/u72:2 below)       | Task B (kworker/u72:0 below)
    -----------------------------------+-----------------------------------
    ipoib_mcast_join_task(work)        | ipoib_ib_dev_flush_light(work)
      spin_lock_irq(&priv->lock)       | __ipoib_ib_dev_flush(priv, ...)
      list_for_each_entry(mcast,       | ipoib_mcast_dev_flush(dev = priv->dev)
          &priv->multicast_list, list) |   mutex_lock(&priv->mcast_mutex)
        ipoib_mcast_join(dev, mcast)   |
          spin_unlock_irq(&priv->lock) |
                                       |   spin_lock_irqsave(&priv->lock, flags)
                                       |   list_for_each_entry_safe(mcast, tmcast,
                                       |                  &priv->multicast_list, list)
                                       |     list_del(&mcast->list);
                                       |     list_add_tail(&mcast->list, &remove_list)
                                       |   spin_unlock_irqrestore(&priv->lock, flags)
          spin_lock_irq(&priv->lock)   |
                                       |   ipoib_mcast_remove_list(&remove_list)
   (Here, mcast is no longer on the    |     list_for_each_entry_safe(mcast, tmcast,
    &priv->multicast_list and we keep  |                            remove_list, list)
    spinning on the &remove_list of the \ >>>  wait_for_completion(&mcast->done)
    other thread which is blocked and the|
    list is still valid on it's stack.)  | mutex_unlock(&priv->mcast_mutex)

Fix this by adding mutex_lock(&priv->mcast_mutex) to ipoib_mcast_join_task().
Unfortunately we could not reproduce the lockup and confirm this fix but
based on the code review I think this fix should address such lockups.

crash> bc 31
PID: 747      TASK: ff1c6a1a007e8000  CPU: 31   COMMAND: "kworker/u72:2"
--
    [exception RIP: ipoib_mcast_join_task+0x1b1]
    RIP: ffffffffc0944ac1  RSP: ff646f199a8c7e00  RFLAGS: 00000002
    RAX: 0000000000000000  RBX: ff1c6a1a04dc82f8  RCX: 0000000000000000
                                  work (&priv->mcast_task{,.work})
    RDX: ff1c6a192d60ac68  RSI: 0000000000000286  RDI: ff1c6a1a04dc8000
           &mcast->list
    RBP: ff646f199a8c7e90   R8: ff1c699980019420   R9: ff1c6a1920c9a000
    R10: ff646f199a8c7e00  R11: ff1c6a191a7d9800  R12: ff1c6a192d60ac00
                                                         mcast
    R13: ff1c6a1d82200000  R14: ff1c6a1a04dc8000  R15: ff1c6a1a04dc82d8
           dev                    priv (&priv->lock)     &priv->multicast_list (aka head)
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
--- <NMI exception stack> ---
 #5 [ff646f199a8c7e00] ipoib_mcast_join_task+0x1b1 at ffffffffc0944ac1 [ib_ipoib]
 #6 [ff646f199a8c7e98] process_one_work+0x1a7 at ffffffff9bf10967

crash> rx ff646f199a8c7e68
ff646f199a8c7e68:  ff1c6a1a04dc82f8 <<< work = &priv->mcast_task.work

crash> list -hO ipoib_dev_priv.multicast_list ff1c6a1a04dc8000
(empty)

crash> ipoib_dev_priv.mcast_task.work.func,mcast_mutex.owner.counter ff1c6a1a04dc8000
  mcast_task.work.func = 0xffffffffc0944910 <ipoib_mcast_join_task>,
  mcast_mutex.owner.counter = 0xff1c69998efec000

crash> b 8
PID: 8        TASK: ff1c69998efec000  CPU: 33   COMMAND: "kworker/u72:0"
--
 #3 [ff646f1980153d50] wait_for_completion+0x96 at ffffffff9c7d7646
 #4 [ff646f1980153d90] ipoib_mcast_remove_list+0x56 at ffffffffc0944dc6 [ib_ipoib]
 #5 [ff646f1980153de8] ipoib_mcast_dev_flush+0x1a7 at ffffffffc09455a7 [ib_ipoib]
 #6 [ff646f1980153e58] __ipoib_ib_dev_flush+0x1a4 at ffffffffc09431a4 [ib_ipoib]
 #7 [ff646f1980153e98] process_one_work+0x1a7 at ffffffff9bf10967

crash> rx ff646f1980153e68
ff646f1980153e68:  ff1c6a1a04dc83f0 <<< work = &priv->flush_light

crash> ipoib_dev_priv.flush_light.func,broadcast ff1c6a1a04dc8000
  flush_light.func = 0xffffffffc0943820 <ipoib_ib_dev_flush_light>,
  broadcast = 0x0,

The mcast(s) on the &remove_list (the remaining part of the ex &priv->multicast_list):

crash> list -s ipoib_mcast.done.done ipoib_mcast.list -H ff646f1980153e10 | paste - -
ff1c6a192bd0c200	  done.done = 0x0,
ff1c6a192d60ac00	  done.done = 0x0,

Reported-by: Yuya Fujita-bishamonten <fj-lsoft-rh-driver@dl.jp.fujitsu.com>
Signed-off-by: Daniel Vacek <neelx@redhat.com>
---
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
index 5b3154503bf4..8e4f2c8839be 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
@@ -580,6 +580,7 @@ void ipoib_mcast_join_task(struct work_struct *work)
 	}
 	netif_addr_unlock_bh(dev);
 
+	mutex_lock(&priv->mcast_mutex);
 	spin_lock_irq(&priv->lock);
 	if (!test_bit(IPOIB_FLAG_OPER_UP, &priv->flags))
 		goto out;
@@ -634,6 +635,7 @@ void ipoib_mcast_join_task(struct work_struct *work)
 				/* Found the next unjoined group */
 				if (ipoib_mcast_join(dev, mcast)) {
 					spin_unlock_irq(&priv->lock);
+					mutex_unlock(&priv->mcast_mutex);
 					return;
 				}
 			} else if (!delay_until ||
@@ -655,6 +657,7 @@ void ipoib_mcast_join_task(struct work_struct *work)
 		ipoib_mcast_join(dev, mcast);
 
 	spin_unlock_irq(&priv->lock);
+	mutex_unlock(&priv->mcast_mutex);
 }
 
 void ipoib_mcast_start_thread(struct net_device *dev)
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] IB/ipoib: Fix mcast list locking
  2023-12-11 13:04 ` [PATCH 1/2] IB/ipoib: Fix mcast list locking Daniel Vacek
@ 2023-12-11 13:45   ` Leon Romanovsky
  2023-12-11 14:25     ` Daniel Vacek
  0 siblings, 1 reply; 12+ messages in thread
From: Leon Romanovsky @ 2023-12-11 13:45 UTC (permalink / raw)
  To: Daniel Vacek
  Cc: Jason Gunthorpe, linux-rdma, linux-kernel,
	Yuya Fujita-bishamonten

On Mon, Dec 11, 2023 at 02:04:24PM +0100, Daniel Vacek wrote:
> We need an additional protection against list removal between ipoib_mcast_join_task()
> and ipoib_mcast_dev_flush() in case the &priv->lock needs to be dropped while
> iterating the &priv->multicast_list in ipoib_mcast_join_task(). If the mcast
> is removed while the lock was dropped, the for loop spins forever resulting
> in a hard lockup (as was reported on RHEL 4.18.0-372.75.1.el8_6 kernel):
> 
>     Task A (kworker/u72:2 below)       | Task B (kworker/u72:0 below)
>     -----------------------------------+-----------------------------------
>     ipoib_mcast_join_task(work)        | ipoib_ib_dev_flush_light(work)
>       spin_lock_irq(&priv->lock)       | __ipoib_ib_dev_flush(priv, ...)
>       list_for_each_entry(mcast,       | ipoib_mcast_dev_flush(dev = priv->dev)
>           &priv->multicast_list, list) |   mutex_lock(&priv->mcast_mutex)
>         ipoib_mcast_join(dev, mcast)   |
>           spin_unlock_irq(&priv->lock) |
>                                        |   spin_lock_irqsave(&priv->lock, flags)
>                                        |   list_for_each_entry_safe(mcast, tmcast,
>                                        |                  &priv->multicast_list, list)
>                                        |     list_del(&mcast->list);
>                                        |     list_add_tail(&mcast->list, &remove_list)
>                                        |   spin_unlock_irqrestore(&priv->lock, flags)
>           spin_lock_irq(&priv->lock)   |
>                                        |   ipoib_mcast_remove_list(&remove_list)
>    (Here, mcast is no longer on the    |     list_for_each_entry_safe(mcast, tmcast,
>     &priv->multicast_list and we keep  |                            remove_list, list)
>     spinning on the &remove_list of the \ >>>  wait_for_completion(&mcast->done)
>     other thread which is blocked and the|
>     list is still valid on it's stack.)  | mutex_unlock(&priv->mcast_mutex)
> 
> Fix this by adding mutex_lock(&priv->mcast_mutex) to ipoib_mcast_join_task().

I don't entirely understand the issue and the proposed solution.
There is only one spin_unlock_irq() in the middle of list_for_each_entry(mcast, &priv->multicast_list, list)
and it is right before return statement which will break the loop. So
how will loop spin forever?

Thanks

> Unfortunately we could not reproduce the lockup and confirm this fix but
> based on the code review I think this fix should address such lockups.
> 
> crash> bc 31
> PID: 747      TASK: ff1c6a1a007e8000  CPU: 31   COMMAND: "kworker/u72:2"
> --
>     [exception RIP: ipoib_mcast_join_task+0x1b1]
>     RIP: ffffffffc0944ac1  RSP: ff646f199a8c7e00  RFLAGS: 00000002
>     RAX: 0000000000000000  RBX: ff1c6a1a04dc82f8  RCX: 0000000000000000
>                                   work (&priv->mcast_task{,.work})
>     RDX: ff1c6a192d60ac68  RSI: 0000000000000286  RDI: ff1c6a1a04dc8000
>            &mcast->list
>     RBP: ff646f199a8c7e90   R8: ff1c699980019420   R9: ff1c6a1920c9a000
>     R10: ff646f199a8c7e00  R11: ff1c6a191a7d9800  R12: ff1c6a192d60ac00
>                                                          mcast
>     R13: ff1c6a1d82200000  R14: ff1c6a1a04dc8000  R15: ff1c6a1a04dc82d8
>            dev                    priv (&priv->lock)     &priv->multicast_list (aka head)
>     ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
> --- <NMI exception stack> ---
>  #5 [ff646f199a8c7e00] ipoib_mcast_join_task+0x1b1 at ffffffffc0944ac1 [ib_ipoib]
>  #6 [ff646f199a8c7e98] process_one_work+0x1a7 at ffffffff9bf10967
> 
> crash> rx ff646f199a8c7e68
> ff646f199a8c7e68:  ff1c6a1a04dc82f8 <<< work = &priv->mcast_task.work
> 
> crash> list -hO ipoib_dev_priv.multicast_list ff1c6a1a04dc8000
> (empty)
> 
> crash> ipoib_dev_priv.mcast_task.work.func,mcast_mutex.owner.counter ff1c6a1a04dc8000
>   mcast_task.work.func = 0xffffffffc0944910 <ipoib_mcast_join_task>,
>   mcast_mutex.owner.counter = 0xff1c69998efec000
> 
> crash> b 8
> PID: 8        TASK: ff1c69998efec000  CPU: 33   COMMAND: "kworker/u72:0"
> --
>  #3 [ff646f1980153d50] wait_for_completion+0x96 at ffffffff9c7d7646
>  #4 [ff646f1980153d90] ipoib_mcast_remove_list+0x56 at ffffffffc0944dc6 [ib_ipoib]
>  #5 [ff646f1980153de8] ipoib_mcast_dev_flush+0x1a7 at ffffffffc09455a7 [ib_ipoib]
>  #6 [ff646f1980153e58] __ipoib_ib_dev_flush+0x1a4 at ffffffffc09431a4 [ib_ipoib]
>  #7 [ff646f1980153e98] process_one_work+0x1a7 at ffffffff9bf10967
> 
> crash> rx ff646f1980153e68
> ff646f1980153e68:  ff1c6a1a04dc83f0 <<< work = &priv->flush_light
> 
> crash> ipoib_dev_priv.flush_light.func,broadcast ff1c6a1a04dc8000
>   flush_light.func = 0xffffffffc0943820 <ipoib_ib_dev_flush_light>,
>   broadcast = 0x0,
> 
> The mcast(s) on the &remove_list (the remaining part of the ex &priv->multicast_list):
> 
> crash> list -s ipoib_mcast.done.done ipoib_mcast.list -H ff646f1980153e10 | paste - -
> ff1c6a192bd0c200	  done.done = 0x0,
> ff1c6a192d60ac00	  done.done = 0x0,
> 
> Reported-by: Yuya Fujita-bishamonten <fj-lsoft-rh-driver@dl.jp.fujitsu.com>
> Signed-off-by: Daniel Vacek <neelx@redhat.com>
> ---
>  drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
> index 5b3154503bf4..8e4f2c8839be 100644
> --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
> +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
> @@ -580,6 +580,7 @@ void ipoib_mcast_join_task(struct work_struct *work)
>  	}
>  	netif_addr_unlock_bh(dev);
>  
> +	mutex_lock(&priv->mcast_mutex);
>  	spin_lock_irq(&priv->lock);
>  	if (!test_bit(IPOIB_FLAG_OPER_UP, &priv->flags))
>  		goto out;
> @@ -634,6 +635,7 @@ void ipoib_mcast_join_task(struct work_struct *work)
>  				/* Found the next unjoined group */
>  				if (ipoib_mcast_join(dev, mcast)) {
>  					spin_unlock_irq(&priv->lock);
> +					mutex_unlock(&priv->mcast_mutex);
>  					return;
>  				}
>  			} else if (!delay_until ||
> @@ -655,6 +657,7 @@ void ipoib_mcast_join_task(struct work_struct *work)
>  		ipoib_mcast_join(dev, mcast);
>  
>  	spin_unlock_irq(&priv->lock);
> +	mutex_unlock(&priv->mcast_mutex);
>  }
>  
>  void ipoib_mcast_start_thread(struct net_device *dev)
> -- 
> 2.43.0
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] IB/ipoib: Fix mcast list locking
  2023-12-11 13:45   ` Leon Romanovsky
@ 2023-12-11 14:25     ` Daniel Vacek
  2023-12-11 15:06       ` Leon Romanovsky
  0 siblings, 1 reply; 12+ messages in thread
From: Daniel Vacek @ 2023-12-11 14:25 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Jason Gunthorpe, linux-rdma, linux-kernel,
	Yuya Fujita-bishamonten

On Mon, Dec 11, 2023 at 2:45 PM Leon Romanovsky <leon@kernel.org> wrote:
>
> On Mon, Dec 11, 2023 at 02:04:24PM +0100, Daniel Vacek wrote:
> > We need an additional protection against list removal between ipoib_mcast_join_task()
> > and ipoib_mcast_dev_flush() in case the &priv->lock needs to be dropped while
> > iterating the &priv->multicast_list in ipoib_mcast_join_task(). If the mcast
> > is removed while the lock was dropped, the for loop spins forever resulting
> > in a hard lockup (as was reported on RHEL 4.18.0-372.75.1.el8_6 kernel):
> >
> >     Task A (kworker/u72:2 below)       | Task B (kworker/u72:0 below)
> >     -----------------------------------+-----------------------------------
> >     ipoib_mcast_join_task(work)        | ipoib_ib_dev_flush_light(work)
> >       spin_lock_irq(&priv->lock)       | __ipoib_ib_dev_flush(priv, ...)
> >       list_for_each_entry(mcast,       | ipoib_mcast_dev_flush(dev = priv->dev)
> >           &priv->multicast_list, list) |   mutex_lock(&priv->mcast_mutex)
> >         ipoib_mcast_join(dev, mcast)   |
> >           spin_unlock_irq(&priv->lock) |
> >                                        |   spin_lock_irqsave(&priv->lock, flags)
> >                                        |   list_for_each_entry_safe(mcast, tmcast,
> >                                        |                  &priv->multicast_list, list)
> >                                        |     list_del(&mcast->list);
> >                                        |     list_add_tail(&mcast->list, &remove_list)
> >                                        |   spin_unlock_irqrestore(&priv->lock, flags)
> >           spin_lock_irq(&priv->lock)   |
> >                                        |   ipoib_mcast_remove_list(&remove_list)
> >    (Here, mcast is no longer on the    |     list_for_each_entry_safe(mcast, tmcast,
> >     &priv->multicast_list and we keep  |                            remove_list, list)
> >     spinning on the &remove_list of the \ >>>  wait_for_completion(&mcast->done)
> >     other thread which is blocked and the|
> >     list is still valid on it's stack.)  | mutex_unlock(&priv->mcast_mutex)
> >
> > Fix this by adding mutex_lock(&priv->mcast_mutex) to ipoib_mcast_join_task().
>
> I don't entirely understand the issue and the proposed solution.
> There is only one spin_unlock_irq() in the middle of list_for_each_entry(mcast, &priv->multicast_list, list)
> and it is right before return statement which will break the loop. So
> how will loop spin forever?

There's another unlock/lock pair around ib_sa_join_multicast() call in
ipoib_mcast_join() no matter the outcome of the condition. The
ib_sa_join_multicast() cannot be called with the lock being held due
to GFP_KERNEL allocation can possibly sleep. That's what's causing the
issue.

Actually if you check the code, only if the mentioned condition is
false (and the loop is not broken and returned from
ipoib_mcast_join_task()) the lock is released and re-acquired,
creating the window for
ipoib_ib_dev_flush_light()/ipoib_mcast_dev_flush() to break the list.
The vmcore data shows/confirms that clearly.

--nX


> Thanks
>
> > Unfortunately we could not reproduce the lockup and confirm this fix but
> > based on the code review I think this fix should address such lockups.
> >
> > crash> bc 31
> > PID: 747      TASK: ff1c6a1a007e8000  CPU: 31   COMMAND: "kworker/u72:2"
> > --
> >     [exception RIP: ipoib_mcast_join_task+0x1b1]
> >     RIP: ffffffffc0944ac1  RSP: ff646f199a8c7e00  RFLAGS: 00000002
> >     RAX: 0000000000000000  RBX: ff1c6a1a04dc82f8  RCX: 0000000000000000
> >                                   work (&priv->mcast_task{,.work})
> >     RDX: ff1c6a192d60ac68  RSI: 0000000000000286  RDI: ff1c6a1a04dc8000
> >            &mcast->list
> >     RBP: ff646f199a8c7e90   R8: ff1c699980019420   R9: ff1c6a1920c9a000
> >     R10: ff646f199a8c7e00  R11: ff1c6a191a7d9800  R12: ff1c6a192d60ac00
> >                                                          mcast
> >     R13: ff1c6a1d82200000  R14: ff1c6a1a04dc8000  R15: ff1c6a1a04dc82d8
> >            dev                    priv (&priv->lock)     &priv->multicast_list (aka head)
> >     ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
> > --- <NMI exception stack> ---
> >  #5 [ff646f199a8c7e00] ipoib_mcast_join_task+0x1b1 at ffffffffc0944ac1 [ib_ipoib]
> >  #6 [ff646f199a8c7e98] process_one_work+0x1a7 at ffffffff9bf10967
> >
> > crash> rx ff646f199a8c7e68
> > ff646f199a8c7e68:  ff1c6a1a04dc82f8 <<< work = &priv->mcast_task.work
> >
> > crash> list -hO ipoib_dev_priv.multicast_list ff1c6a1a04dc8000
> > (empty)
> >
> > crash> ipoib_dev_priv.mcast_task.work.func,mcast_mutex.owner.counter ff1c6a1a04dc8000
> >   mcast_task.work.func = 0xffffffffc0944910 <ipoib_mcast_join_task>,
> >   mcast_mutex.owner.counter = 0xff1c69998efec000
> >
> > crash> b 8
> > PID: 8        TASK: ff1c69998efec000  CPU: 33   COMMAND: "kworker/u72:0"
> > --
> >  #3 [ff646f1980153d50] wait_for_completion+0x96 at ffffffff9c7d7646
> >  #4 [ff646f1980153d90] ipoib_mcast_remove_list+0x56 at ffffffffc0944dc6 [ib_ipoib]
> >  #5 [ff646f1980153de8] ipoib_mcast_dev_flush+0x1a7 at ffffffffc09455a7 [ib_ipoib]
> >  #6 [ff646f1980153e58] __ipoib_ib_dev_flush+0x1a4 at ffffffffc09431a4 [ib_ipoib]
> >  #7 [ff646f1980153e98] process_one_work+0x1a7 at ffffffff9bf10967
> >
> > crash> rx ff646f1980153e68
> > ff646f1980153e68:  ff1c6a1a04dc83f0 <<< work = &priv->flush_light
> >
> > crash> ipoib_dev_priv.flush_light.func,broadcast ff1c6a1a04dc8000
> >   flush_light.func = 0xffffffffc0943820 <ipoib_ib_dev_flush_light>,
> >   broadcast = 0x0,
> >
> > The mcast(s) on the &remove_list (the remaining part of the ex &priv->multicast_list):
> >
> > crash> list -s ipoib_mcast.done.done ipoib_mcast.list -H ff646f1980153e10 | paste - -
> > ff1c6a192bd0c200        done.done = 0x0,
> > ff1c6a192d60ac00        done.done = 0x0,
> >
> > Reported-by: Yuya Fujita-bishamonten <fj-lsoft-rh-driver@dl.jp.fujitsu.com>
> > Signed-off-by: Daniel Vacek <neelx@redhat.com>
> > ---
> >  drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 3 +++
> >  1 file changed, 3 insertions(+)
> >
> > diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
> > index 5b3154503bf4..8e4f2c8839be 100644
> > --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
> > +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
> > @@ -580,6 +580,7 @@ void ipoib_mcast_join_task(struct work_struct *work)
> >       }
> >       netif_addr_unlock_bh(dev);
> >
> > +     mutex_lock(&priv->mcast_mutex);
> >       spin_lock_irq(&priv->lock);
> >       if (!test_bit(IPOIB_FLAG_OPER_UP, &priv->flags))
> >               goto out;
> > @@ -634,6 +635,7 @@ void ipoib_mcast_join_task(struct work_struct *work)
> >                               /* Found the next unjoined group */
> >                               if (ipoib_mcast_join(dev, mcast)) {
> >                                       spin_unlock_irq(&priv->lock);
> > +                                     mutex_unlock(&priv->mcast_mutex);
> >                                       return;
> >                               }
> >                       } else if (!delay_until ||
> > @@ -655,6 +657,7 @@ void ipoib_mcast_join_task(struct work_struct *work)
> >               ipoib_mcast_join(dev, mcast);
> >
> >       spin_unlock_irq(&priv->lock);
> > +     mutex_unlock(&priv->mcast_mutex);
> >  }
> >
> >  void ipoib_mcast_start_thread(struct net_device *dev)
> > --
> > 2.43.0
> >
>


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] IB/ipoib: Fix mcast list locking
  2023-12-11 14:25     ` Daniel Vacek
@ 2023-12-11 15:06       ` Leon Romanovsky
  2023-12-11 16:00         ` Daniel Vacek
  0 siblings, 1 reply; 12+ messages in thread
From: Leon Romanovsky @ 2023-12-11 15:06 UTC (permalink / raw)
  To: Daniel Vacek
  Cc: Jason Gunthorpe, linux-rdma, linux-kernel,
	Yuya Fujita-bishamonten

On Mon, Dec 11, 2023 at 03:25:39PM +0100, Daniel Vacek wrote:
> On Mon, Dec 11, 2023 at 2:45 PM Leon Romanovsky <leon@kernel.org> wrote:
> >
> > On Mon, Dec 11, 2023 at 02:04:24PM +0100, Daniel Vacek wrote:
> > > We need an additional protection against list removal between ipoib_mcast_join_task()
> > > and ipoib_mcast_dev_flush() in case the &priv->lock needs to be dropped while
> > > iterating the &priv->multicast_list in ipoib_mcast_join_task(). If the mcast
> > > is removed while the lock was dropped, the for loop spins forever resulting
> > > in a hard lockup (as was reported on RHEL 4.18.0-372.75.1.el8_6 kernel):
> > >
> > >     Task A (kworker/u72:2 below)       | Task B (kworker/u72:0 below)
> > >     -----------------------------------+-----------------------------------
> > >     ipoib_mcast_join_task(work)        | ipoib_ib_dev_flush_light(work)
> > >       spin_lock_irq(&priv->lock)       | __ipoib_ib_dev_flush(priv, ...)
> > >       list_for_each_entry(mcast,       | ipoib_mcast_dev_flush(dev = priv->dev)
> > >           &priv->multicast_list, list) |   mutex_lock(&priv->mcast_mutex)
> > >         ipoib_mcast_join(dev, mcast)   |
> > >           spin_unlock_irq(&priv->lock) |
> > >                                        |   spin_lock_irqsave(&priv->lock, flags)
> > >                                        |   list_for_each_entry_safe(mcast, tmcast,
> > >                                        |                  &priv->multicast_list, list)
> > >                                        |     list_del(&mcast->list);
> > >                                        |     list_add_tail(&mcast->list, &remove_list)
> > >                                        |   spin_unlock_irqrestore(&priv->lock, flags)
> > >           spin_lock_irq(&priv->lock)   |
> > >                                        |   ipoib_mcast_remove_list(&remove_list)
> > >    (Here, mcast is no longer on the    |     list_for_each_entry_safe(mcast, tmcast,
> > >     &priv->multicast_list and we keep  |                            remove_list, list)
> > >     spinning on the &remove_list of the \ >>>  wait_for_completion(&mcast->done)
> > >     other thread which is blocked and the|
> > >     list is still valid on it's stack.)  | mutex_unlock(&priv->mcast_mutex)
> > >
> > > Fix this by adding mutex_lock(&priv->mcast_mutex) to ipoib_mcast_join_task().
> >
> > I don't entirely understand the issue and the proposed solution.
> > There is only one spin_unlock_irq() in the middle of list_for_each_entry(mcast, &priv->multicast_list, list)
> > and it is right before return statement which will break the loop. So
> > how will loop spin forever?
> 
> There's another unlock/lock pair around ib_sa_join_multicast() call in
> ipoib_mcast_join() no matter the outcome of the condition. The
> ib_sa_join_multicast() cannot be called with the lock being held due
> to GFP_KERNEL allocation can possibly sleep. That's what's causing the
> issue.
> 
> Actually if you check the code, only if the mentioned condition is
> false (and the loop is not broken and returned from
> ipoib_mcast_join_task()) the lock is released and re-acquired,
> creating the window for
> ipoib_ib_dev_flush_light()/ipoib_mcast_dev_flush() to break the list.
> The vmcore data shows/confirms that clearly.

Thanks, it is more clear now.

What about the following change instead of adding extra lock to already
too much complicated IPoIB?

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
index 5b3154503bf4..bca80fe07584 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
@@ -531,21 +531,17 @@ static int ipoib_mcast_join(struct net_device *dev, struct ipoib_mcast *mcast)
                if (test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags))
                        rec.join_state = SENDONLY_FULLMEMBER_JOIN;
        }
-       spin_unlock_irq(&priv->lock);

        multicast = ib_sa_join_multicast(&ipoib_sa_client, priv->ca, priv->port,
-                                        &rec, comp_mask, GFP_KERNEL,
+                                        &rec, comp_mask, GFP_ATOMIC,
                                         ipoib_mcast_join_complete, mcast);
-       spin_lock_irq(&priv->lock);
        if (IS_ERR(multicast)) {
                ret = PTR_ERR(multicast);
                ipoib_warn(priv, "ib_sa_join_multicast failed, status %d\n", ret);
                /* Requeue this join task with a backoff delay */
                __ipoib_mcast_schedule_join_thread(priv, mcast, 1);
                clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
-               spin_unlock_irq(&priv->lock);
                complete(&mcast->done);
-               spin_lock_irq(&priv->lock);
        }
        return 0;
 }


> 
> --nX
> 
> 
> > Thanks
> >
> > > Unfortunately we could not reproduce the lockup and confirm this fix but
> > > based on the code review I think this fix should address such lockups.
> > >
> > > crash> bc 31
> > > PID: 747      TASK: ff1c6a1a007e8000  CPU: 31   COMMAND: "kworker/u72:2"
> > > --
> > >     [exception RIP: ipoib_mcast_join_task+0x1b1]
> > >     RIP: ffffffffc0944ac1  RSP: ff646f199a8c7e00  RFLAGS: 00000002
> > >     RAX: 0000000000000000  RBX: ff1c6a1a04dc82f8  RCX: 0000000000000000
> > >                                   work (&priv->mcast_task{,.work})
> > >     RDX: ff1c6a192d60ac68  RSI: 0000000000000286  RDI: ff1c6a1a04dc8000
> > >            &mcast->list
> > >     RBP: ff646f199a8c7e90   R8: ff1c699980019420   R9: ff1c6a1920c9a000
> > >     R10: ff646f199a8c7e00  R11: ff1c6a191a7d9800  R12: ff1c6a192d60ac00
> > >                                                          mcast
> > >     R13: ff1c6a1d82200000  R14: ff1c6a1a04dc8000  R15: ff1c6a1a04dc82d8
> > >            dev                    priv (&priv->lock)     &priv->multicast_list (aka head)
> > >     ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
> > > --- <NMI exception stack> ---
> > >  #5 [ff646f199a8c7e00] ipoib_mcast_join_task+0x1b1 at ffffffffc0944ac1 [ib_ipoib]
> > >  #6 [ff646f199a8c7e98] process_one_work+0x1a7 at ffffffff9bf10967
> > >
> > > crash> rx ff646f199a8c7e68
> > > ff646f199a8c7e68:  ff1c6a1a04dc82f8 <<< work = &priv->mcast_task.work
> > >
> > > crash> list -hO ipoib_dev_priv.multicast_list ff1c6a1a04dc8000
> > > (empty)
> > >
> > > crash> ipoib_dev_priv.mcast_task.work.func,mcast_mutex.owner.counter ff1c6a1a04dc8000
> > >   mcast_task.work.func = 0xffffffffc0944910 <ipoib_mcast_join_task>,
> > >   mcast_mutex.owner.counter = 0xff1c69998efec000
> > >
> > > crash> b 8
> > > PID: 8        TASK: ff1c69998efec000  CPU: 33   COMMAND: "kworker/u72:0"
> > > --
> > >  #3 [ff646f1980153d50] wait_for_completion+0x96 at ffffffff9c7d7646
> > >  #4 [ff646f1980153d90] ipoib_mcast_remove_list+0x56 at ffffffffc0944dc6 [ib_ipoib]
> > >  #5 [ff646f1980153de8] ipoib_mcast_dev_flush+0x1a7 at ffffffffc09455a7 [ib_ipoib]
> > >  #6 [ff646f1980153e58] __ipoib_ib_dev_flush+0x1a4 at ffffffffc09431a4 [ib_ipoib]
> > >  #7 [ff646f1980153e98] process_one_work+0x1a7 at ffffffff9bf10967
> > >
> > > crash> rx ff646f1980153e68
> > > ff646f1980153e68:  ff1c6a1a04dc83f0 <<< work = &priv->flush_light
> > >
> > > crash> ipoib_dev_priv.flush_light.func,broadcast ff1c6a1a04dc8000
> > >   flush_light.func = 0xffffffffc0943820 <ipoib_ib_dev_flush_light>,
> > >   broadcast = 0x0,
> > >
> > > The mcast(s) on the &remove_list (the remaining part of the ex &priv->multicast_list):
> > >
> > > crash> list -s ipoib_mcast.done.done ipoib_mcast.list -H ff646f1980153e10 | paste - -
> > > ff1c6a192bd0c200        done.done = 0x0,
> > > ff1c6a192d60ac00        done.done = 0x0,
> > >
> > > Reported-by: Yuya Fujita-bishamonten <fj-lsoft-rh-driver@dl.jp.fujitsu.com>
> > > Signed-off-by: Daniel Vacek <neelx@redhat.com>
> > > ---
> > >  drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 3 +++
> > >  1 file changed, 3 insertions(+)
> > >
> > > diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
> > > index 5b3154503bf4..8e4f2c8839be 100644
> > > --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
> > > +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
> > > @@ -580,6 +580,7 @@ void ipoib_mcast_join_task(struct work_struct *work)
> > >       }
> > >       netif_addr_unlock_bh(dev);
> > >
> > > +     mutex_lock(&priv->mcast_mutex);
> > >       spin_lock_irq(&priv->lock);
> > >       if (!test_bit(IPOIB_FLAG_OPER_UP, &priv->flags))
> > >               goto out;
> > > @@ -634,6 +635,7 @@ void ipoib_mcast_join_task(struct work_struct *work)
> > >                               /* Found the next unjoined group */
> > >                               if (ipoib_mcast_join(dev, mcast)) {
> > >                                       spin_unlock_irq(&priv->lock);
> > > +                                     mutex_unlock(&priv->mcast_mutex);
> > >                                       return;
> > >                               }
> > >                       } else if (!delay_until ||
> > > @@ -655,6 +657,7 @@ void ipoib_mcast_join_task(struct work_struct *work)
> > >               ipoib_mcast_join(dev, mcast);
> > >
> > >       spin_unlock_irq(&priv->lock);
> > > +     mutex_unlock(&priv->mcast_mutex);
> > >  }
> > >
> > >  void ipoib_mcast_start_thread(struct net_device *dev)
> > > --
> > > 2.43.0
> > >
> >
> 

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] IB/ipoib: Fix mcast list locking
  2023-12-11 15:06       ` Leon Romanovsky
@ 2023-12-11 16:00         ` Daniel Vacek
  2023-12-12  7:00           ` Leon Romanovsky
  0 siblings, 1 reply; 12+ messages in thread
From: Daniel Vacek @ 2023-12-11 16:00 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Jason Gunthorpe, linux-rdma, linux-kernel,
	Yuya Fujita-bishamonten

On Mon, Dec 11, 2023 at 4:18 PM Leon Romanovsky <leon@kernel.org> wrote:
>
> What about the following change instead of adding extra lock to already
> too much complicated IPoIB?

Yeah, that's the other option should also work I believe. And it
simplifies the code nicely.

The allocated mcast_member and mcast_group structures are small enough
so that slab (by default) should not need more then order 1 block to
eventually extend/refill the full kmalloc-256 cache. Some arches will
even use order 0 I believe.
And unless I'm missing something I do not see any other sleeps in that path.

That said, as long as you are fine with occasional failures under
memory pressure, it looks OK to me.

--nX

> diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
> index 5b3154503bf4..bca80fe07584 100644
> --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
> +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
> @@ -531,21 +531,17 @@ static int ipoib_mcast_join(struct net_device *dev, struct ipoib_mcast *mcast)
>                 if (test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags))
>                         rec.join_state = SENDONLY_FULLMEMBER_JOIN;
>         }
> -       spin_unlock_irq(&priv->lock);
>
>         multicast = ib_sa_join_multicast(&ipoib_sa_client, priv->ca, priv->port,
> -                                        &rec, comp_mask, GFP_KERNEL,
> +                                        &rec, comp_mask, GFP_ATOMIC,
>                                          ipoib_mcast_join_complete, mcast);
> -       spin_lock_irq(&priv->lock);
>         if (IS_ERR(multicast)) {
>                 ret = PTR_ERR(multicast);
>                 ipoib_warn(priv, "ib_sa_join_multicast failed, status %d\n", ret);
>                 /* Requeue this join task with a backoff delay */
>                 __ipoib_mcast_schedule_join_thread(priv, mcast, 1);
>                 clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
> -               spin_unlock_irq(&priv->lock);
>                 complete(&mcast->done);
> -               spin_lock_irq(&priv->lock);
>         }
>         return 0;
>  }
>
>
> >
> > --nX
> >
> >
> > > Thanks
> > >
> > > > Unfortunately we could not reproduce the lockup and confirm this fix but
> > > > based on the code review I think this fix should address such lockups.
> > > >
> > > > crash> bc 31
> > > > PID: 747      TASK: ff1c6a1a007e8000  CPU: 31   COMMAND: "kworker/u72:2"
> > > > --
> > > >     [exception RIP: ipoib_mcast_join_task+0x1b1]
> > > >     RIP: ffffffffc0944ac1  RSP: ff646f199a8c7e00  RFLAGS: 00000002
> > > >     RAX: 0000000000000000  RBX: ff1c6a1a04dc82f8  RCX: 0000000000000000
> > > >                                   work (&priv->mcast_task{,.work})
> > > >     RDX: ff1c6a192d60ac68  RSI: 0000000000000286  RDI: ff1c6a1a04dc8000
> > > >            &mcast->list
> > > >     RBP: ff646f199a8c7e90   R8: ff1c699980019420   R9: ff1c6a1920c9a000
> > > >     R10: ff646f199a8c7e00  R11: ff1c6a191a7d9800  R12: ff1c6a192d60ac00
> > > >                                                          mcast
> > > >     R13: ff1c6a1d82200000  R14: ff1c6a1a04dc8000  R15: ff1c6a1a04dc82d8
> > > >            dev                    priv (&priv->lock)     &priv->multicast_list (aka head)
> > > >     ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
> > > > --- <NMI exception stack> ---
> > > >  #5 [ff646f199a8c7e00] ipoib_mcast_join_task+0x1b1 at ffffffffc0944ac1 [ib_ipoib]
> > > >  #6 [ff646f199a8c7e98] process_one_work+0x1a7 at ffffffff9bf10967
> > > >
> > > > crash> rx ff646f199a8c7e68
> > > > ff646f199a8c7e68:  ff1c6a1a04dc82f8 <<< work = &priv->mcast_task.work
> > > >
> > > > crash> list -hO ipoib_dev_priv.multicast_list ff1c6a1a04dc8000
> > > > (empty)
> > > >
> > > > crash> ipoib_dev_priv.mcast_task.work.func,mcast_mutex.owner.counter ff1c6a1a04dc8000
> > > >   mcast_task.work.func = 0xffffffffc0944910 <ipoib_mcast_join_task>,
> > > >   mcast_mutex.owner.counter = 0xff1c69998efec000
> > > >
> > > > crash> b 8
> > > > PID: 8        TASK: ff1c69998efec000  CPU: 33   COMMAND: "kworker/u72:0"
> > > > --
> > > >  #3 [ff646f1980153d50] wait_for_completion+0x96 at ffffffff9c7d7646
> > > >  #4 [ff646f1980153d90] ipoib_mcast_remove_list+0x56 at ffffffffc0944dc6 [ib_ipoib]
> > > >  #5 [ff646f1980153de8] ipoib_mcast_dev_flush+0x1a7 at ffffffffc09455a7 [ib_ipoib]
> > > >  #6 [ff646f1980153e58] __ipoib_ib_dev_flush+0x1a4 at ffffffffc09431a4 [ib_ipoib]
> > > >  #7 [ff646f1980153e98] process_one_work+0x1a7 at ffffffff9bf10967
> > > >
> > > > crash> rx ff646f1980153e68
> > > > ff646f1980153e68:  ff1c6a1a04dc83f0 <<< work = &priv->flush_light
> > > >
> > > > crash> ipoib_dev_priv.flush_light.func,broadcast ff1c6a1a04dc8000
> > > >   flush_light.func = 0xffffffffc0943820 <ipoib_ib_dev_flush_light>,
> > > >   broadcast = 0x0,
> > > >
> > > > The mcast(s) on the &remove_list (the remaining part of the ex &priv->multicast_list):
> > > >
> > > > crash> list -s ipoib_mcast.done.done ipoib_mcast.list -H ff646f1980153e10 | paste - -
> > > > ff1c6a192bd0c200        done.done = 0x0,
> > > > ff1c6a192d60ac00        done.done = 0x0,
> > > >
> > > > Reported-by: Yuya Fujita-bishamonten <fj-lsoft-rh-driver@dl.jp.fujitsu.com>
> > > > Signed-off-by: Daniel Vacek <neelx@redhat.com>
> > > > ---
> > > >  drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 3 +++
> > > >  1 file changed, 3 insertions(+)
> > > >
> > > > diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
> > > > index 5b3154503bf4..8e4f2c8839be 100644
> > > > --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
> > > > +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
> > > > @@ -580,6 +580,7 @@ void ipoib_mcast_join_task(struct work_struct *work)
> > > >       }
> > > >       netif_addr_unlock_bh(dev);
> > > >
> > > > +     mutex_lock(&priv->mcast_mutex);
> > > >       spin_lock_irq(&priv->lock);
> > > >       if (!test_bit(IPOIB_FLAG_OPER_UP, &priv->flags))
> > > >               goto out;
> > > > @@ -634,6 +635,7 @@ void ipoib_mcast_join_task(struct work_struct *work)
> > > >                               /* Found the next unjoined group */
> > > >                               if (ipoib_mcast_join(dev, mcast)) {
> > > >                                       spin_unlock_irq(&priv->lock);
> > > > +                                     mutex_unlock(&priv->mcast_mutex);
> > > >                                       return;
> > > >                               }
> > > >                       } else if (!delay_until ||
> > > > @@ -655,6 +657,7 @@ void ipoib_mcast_join_task(struct work_struct *work)
> > > >               ipoib_mcast_join(dev, mcast);
> > > >
> > > >       spin_unlock_irq(&priv->lock);
> > > > +     mutex_unlock(&priv->mcast_mutex);
> > > >  }
> > > >
> > > >  void ipoib_mcast_start_thread(struct net_device *dev)
> > > > --
> > > > 2.43.0
> > > >
> > >
> >
>


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] IB/ipoib: Fix mcast list locking
  2023-12-11 16:00         ` Daniel Vacek
@ 2023-12-12  7:00           ` Leon Romanovsky
  0 siblings, 0 replies; 12+ messages in thread
From: Leon Romanovsky @ 2023-12-12  7:00 UTC (permalink / raw)
  To: Daniel Vacek
  Cc: Jason Gunthorpe, linux-rdma, linux-kernel,
	Yuya Fujita-bishamonten

On Mon, Dec 11, 2023 at 05:00:11PM +0100, Daniel Vacek wrote:
> On Mon, Dec 11, 2023 at 4:18 PM Leon Romanovsky <leon@kernel.org> wrote:
> >
> > What about the following change instead of adding extra lock to already
> > too much complicated IPoIB?
> 
> Yeah, that's the other option should also work I believe. And it
> simplifies the code nicely.
> 
> The allocated mcast_member and mcast_group structures are small enough
> so that slab (by default) should not need more then order 1 block to
> eventually extend/refill the full kmalloc-256 cache. Some arches will
> even use order 0 I believe.
> And unless I'm missing something I do not see any other sleeps in that path.
> 
> That said, as long as you are fine with occasional failures under
> memory pressure, it looks OK to me.

Yes, IMHO change from GFP_KERNEL to be GFP_ATOMIC is safer than adding extra lock.

Thanks

> 
> --nX
> 
> > diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
> > index 5b3154503bf4..bca80fe07584 100644
> > --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
> > +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
> > @@ -531,21 +531,17 @@ static int ipoib_mcast_join(struct net_device *dev, struct ipoib_mcast *mcast)
> >                 if (test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags))
> >                         rec.join_state = SENDONLY_FULLMEMBER_JOIN;
> >         }
> > -       spin_unlock_irq(&priv->lock);
> >
> >         multicast = ib_sa_join_multicast(&ipoib_sa_client, priv->ca, priv->port,
> > -                                        &rec, comp_mask, GFP_KERNEL,
> > +                                        &rec, comp_mask, GFP_ATOMIC,
> >                                          ipoib_mcast_join_complete, mcast);
> > -       spin_lock_irq(&priv->lock);
> >         if (IS_ERR(multicast)) {
> >                 ret = PTR_ERR(multicast);
> >                 ipoib_warn(priv, "ib_sa_join_multicast failed, status %d\n", ret);
> >                 /* Requeue this join task with a backoff delay */
> >                 __ipoib_mcast_schedule_join_thread(priv, mcast, 1);
> >                 clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
> > -               spin_unlock_irq(&priv->lock);
> >                 complete(&mcast->done);
> > -               spin_lock_irq(&priv->lock);
> >         }
> >         return 0;
> >  }
> >
> >
> > >
> > > --nX
> > >
> > >
> > > > Thanks
> > > >
> > > > > Unfortunately we could not reproduce the lockup and confirm this fix but
> > > > > based on the code review I think this fix should address such lockups.
> > > > >
> > > > > crash> bc 31
> > > > > PID: 747      TASK: ff1c6a1a007e8000  CPU: 31   COMMAND: "kworker/u72:2"
> > > > > --
> > > > >     [exception RIP: ipoib_mcast_join_task+0x1b1]
> > > > >     RIP: ffffffffc0944ac1  RSP: ff646f199a8c7e00  RFLAGS: 00000002
> > > > >     RAX: 0000000000000000  RBX: ff1c6a1a04dc82f8  RCX: 0000000000000000
> > > > >                                   work (&priv->mcast_task{,.work})
> > > > >     RDX: ff1c6a192d60ac68  RSI: 0000000000000286  RDI: ff1c6a1a04dc8000
> > > > >            &mcast->list
> > > > >     RBP: ff646f199a8c7e90   R8: ff1c699980019420   R9: ff1c6a1920c9a000
> > > > >     R10: ff646f199a8c7e00  R11: ff1c6a191a7d9800  R12: ff1c6a192d60ac00
> > > > >                                                          mcast
> > > > >     R13: ff1c6a1d82200000  R14: ff1c6a1a04dc8000  R15: ff1c6a1a04dc82d8
> > > > >            dev                    priv (&priv->lock)     &priv->multicast_list (aka head)
> > > > >     ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
> > > > > --- <NMI exception stack> ---
> > > > >  #5 [ff646f199a8c7e00] ipoib_mcast_join_task+0x1b1 at ffffffffc0944ac1 [ib_ipoib]
> > > > >  #6 [ff646f199a8c7e98] process_one_work+0x1a7 at ffffffff9bf10967
> > > > >
> > > > > crash> rx ff646f199a8c7e68
> > > > > ff646f199a8c7e68:  ff1c6a1a04dc82f8 <<< work = &priv->mcast_task.work
> > > > >
> > > > > crash> list -hO ipoib_dev_priv.multicast_list ff1c6a1a04dc8000
> > > > > (empty)
> > > > >
> > > > > crash> ipoib_dev_priv.mcast_task.work.func,mcast_mutex.owner.counter ff1c6a1a04dc8000
> > > > >   mcast_task.work.func = 0xffffffffc0944910 <ipoib_mcast_join_task>,
> > > > >   mcast_mutex.owner.counter = 0xff1c69998efec000
> > > > >
> > > > > crash> b 8
> > > > > PID: 8        TASK: ff1c69998efec000  CPU: 33   COMMAND: "kworker/u72:0"
> > > > > --
> > > > >  #3 [ff646f1980153d50] wait_for_completion+0x96 at ffffffff9c7d7646
> > > > >  #4 [ff646f1980153d90] ipoib_mcast_remove_list+0x56 at ffffffffc0944dc6 [ib_ipoib]
> > > > >  #5 [ff646f1980153de8] ipoib_mcast_dev_flush+0x1a7 at ffffffffc09455a7 [ib_ipoib]
> > > > >  #6 [ff646f1980153e58] __ipoib_ib_dev_flush+0x1a4 at ffffffffc09431a4 [ib_ipoib]
> > > > >  #7 [ff646f1980153e98] process_one_work+0x1a7 at ffffffff9bf10967
> > > > >
> > > > > crash> rx ff646f1980153e68
> > > > > ff646f1980153e68:  ff1c6a1a04dc83f0 <<< work = &priv->flush_light
> > > > >
> > > > > crash> ipoib_dev_priv.flush_light.func,broadcast ff1c6a1a04dc8000
> > > > >   flush_light.func = 0xffffffffc0943820 <ipoib_ib_dev_flush_light>,
> > > > >   broadcast = 0x0,
> > > > >
> > > > > The mcast(s) on the &remove_list (the remaining part of the ex &priv->multicast_list):
> > > > >
> > > > > crash> list -s ipoib_mcast.done.done ipoib_mcast.list -H ff646f1980153e10 | paste - -
> > > > > ff1c6a192bd0c200        done.done = 0x0,
> > > > > ff1c6a192d60ac00        done.done = 0x0,
> > > > >
> > > > > Reported-by: Yuya Fujita-bishamonten <fj-lsoft-rh-driver@dl.jp.fujitsu.com>
> > > > > Signed-off-by: Daniel Vacek <neelx@redhat.com>
> > > > > ---
> > > > >  drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 3 +++
> > > > >  1 file changed, 3 insertions(+)
> > > > >
> > > > > diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
> > > > > index 5b3154503bf4..8e4f2c8839be 100644
> > > > > --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
> > > > > +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
> > > > > @@ -580,6 +580,7 @@ void ipoib_mcast_join_task(struct work_struct *work)
> > > > >       }
> > > > >       netif_addr_unlock_bh(dev);
> > > > >
> > > > > +     mutex_lock(&priv->mcast_mutex);
> > > > >       spin_lock_irq(&priv->lock);
> > > > >       if (!test_bit(IPOIB_FLAG_OPER_UP, &priv->flags))
> > > > >               goto out;
> > > > > @@ -634,6 +635,7 @@ void ipoib_mcast_join_task(struct work_struct *work)
> > > > >                               /* Found the next unjoined group */
> > > > >                               if (ipoib_mcast_join(dev, mcast)) {
> > > > >                                       spin_unlock_irq(&priv->lock);
> > > > > +                                     mutex_unlock(&priv->mcast_mutex);
> > > > >                                       return;
> > > > >                               }
> > > > >                       } else if (!delay_until ||
> > > > > @@ -655,6 +657,7 @@ void ipoib_mcast_join_task(struct work_struct *work)
> > > > >               ipoib_mcast_join(dev, mcast);
> > > > >
> > > > >       spin_unlock_irq(&priv->lock);
> > > > > +     mutex_unlock(&priv->mcast_mutex);
> > > > >  }
> > > > >
> > > > >  void ipoib_mcast_start_thread(struct net_device *dev)
> > > > > --
> > > > > 2.43.0
> > > > >
> > > >
> > >
> >
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 2/2] IB/ipoib: Clean up redundant netif_addr_lock
  2023-12-11 13:04 [PATCH 0/2] IB/ipoib fixes Daniel Vacek
  2023-12-11 13:04 ` [PATCH 1/2] IB/ipoib: Fix mcast list locking Daniel Vacek
@ 2023-12-11 13:04 ` Daniel Vacek
  2023-12-12  8:07 ` [PATCH v2] IB/ipoib: Fix mcast list locking Daniel Vacek
  2023-12-12  8:29 ` [PATCH 0/2] IB/ipoib fixes Leon Romanovsky
  3 siblings, 0 replies; 12+ messages in thread
From: Daniel Vacek @ 2023-12-11 13:04 UTC (permalink / raw)
  To: Jason Gunthorpe, Leon Romanovsky; +Cc: linux-rdma, linux-kernel, Daniel Vacek

A single memory load does not need to be protected by any lock.
The same priv->flags are fetched 15 lines ago without locking anyways.

Signed-off-by: Daniel Vacek <neelx@redhat.com>
---
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 6 +-----
 1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
index 8e4f2c8839be..f54e0d212630 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
@@ -572,13 +572,9 @@ void ipoib_mcast_join_task(struct work_struct *work)
 		return;
 	}
 	priv->local_lid = port_attr.lid;
-	netif_addr_lock_bh(dev);
 
-	if (!test_bit(IPOIB_FLAG_DEV_ADDR_SET, &priv->flags)) {
-		netif_addr_unlock_bh(dev);
+	if (!test_bit(IPOIB_FLAG_DEV_ADDR_SET, &priv->flags))
 		return;
-	}
-	netif_addr_unlock_bh(dev);
 
 	mutex_lock(&priv->mcast_mutex);
 	spin_lock_irq(&priv->lock);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v2] IB/ipoib: Fix mcast list locking
  2023-12-11 13:04 [PATCH 0/2] IB/ipoib fixes Daniel Vacek
  2023-12-11 13:04 ` [PATCH 1/2] IB/ipoib: Fix mcast list locking Daniel Vacek
  2023-12-11 13:04 ` [PATCH 2/2] IB/ipoib: Clean up redundant netif_addr_lock Daniel Vacek
@ 2023-12-12  8:07 ` Daniel Vacek
  2023-12-12  8:29 ` [PATCH 0/2] IB/ipoib fixes Leon Romanovsky
  3 siblings, 0 replies; 12+ messages in thread
From: Daniel Vacek @ 2023-12-12  8:07 UTC (permalink / raw)
  To: Jason Gunthorpe, Leon Romanovsky
  Cc: linux-rdma, linux-kernel, Daniel Vacek, Yuya Fujita-bishamonten

Releasing the `priv->lock` while iterating the `priv->multicast_list` in
`ipoib_mcast_join_task()` opens a window for `ipoib_mcast_dev_flush()` to
remove the items while in the middle of iteration. If the mcast is removed
while the lock was dropped, the for loop spins forever resulting in a hard
lockup (as was reported on RHEL 4.18.0-372.75.1.el8_6 kernel):

    Task A (kworker/u72:2 below)       | Task B (kworker/u72:0 below)
    -----------------------------------+-----------------------------------
    ipoib_mcast_join_task(work)        | ipoib_ib_dev_flush_light(work)
      spin_lock_irq(&priv->lock)       | __ipoib_ib_dev_flush(priv, ...)
      list_for_each_entry(mcast,       | ipoib_mcast_dev_flush(dev = priv->dev)
          &priv->multicast_list, list) |
        ipoib_mcast_join(dev, mcast)   |
          spin_unlock_irq(&priv->lock) |
                                       |   spin_lock_irqsave(&priv->lock, flags)
                                       |   list_for_each_entry_safe(mcast, tmcast,
                                       |                  &priv->multicast_list, list)
                                       |     list_del(&mcast->list);
                                       |     list_add_tail(&mcast->list, &remove_list)
                                       |   spin_unlock_irqrestore(&priv->lock, flags)
          spin_lock_irq(&priv->lock)   |
                                       |   ipoib_mcast_remove_list(&remove_list)
   (Here, `mcast` is no longer on the  |     list_for_each_entry_safe(mcast, tmcast,
    `priv->multicast_list` and we keep |                            remove_list, list)
    spinning on the `remove_list` of   |  >>>  wait_for_completion(&mcast->done)
    the other thread which is blocked  |
    and the list is still valid on     |
    it's stack.)

Fix this by keeping the lock held and changing to GFP_ATOMIC to prevent
eventual sleeps.
Unfortunately we could not reproduce the lockup and confirm this fix but
based on the code review I think this fix should address such lockups.

crash> bc 31
PID: 747      TASK: ff1c6a1a007e8000  CPU: 31   COMMAND: "kworker/u72:2"
--
    [exception RIP: ipoib_mcast_join_task+0x1b1]
    RIP: ffffffffc0944ac1  RSP: ff646f199a8c7e00  RFLAGS: 00000002
    RAX: 0000000000000000  RBX: ff1c6a1a04dc82f8  RCX: 0000000000000000
                                  work (&priv->mcast_task{,.work})
    RDX: ff1c6a192d60ac68  RSI: 0000000000000286  RDI: ff1c6a1a04dc8000
           &mcast->list
    RBP: ff646f199a8c7e90   R8: ff1c699980019420   R9: ff1c6a1920c9a000
    R10: ff646f199a8c7e00  R11: ff1c6a191a7d9800  R12: ff1c6a192d60ac00
                                                         mcast
    R13: ff1c6a1d82200000  R14: ff1c6a1a04dc8000  R15: ff1c6a1a04dc82d8
           dev                    priv (&priv->lock)     &priv->multicast_list (aka head)
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
--- <NMI exception stack> ---
 #5 [ff646f199a8c7e00] ipoib_mcast_join_task+0x1b1 at ffffffffc0944ac1 [ib_ipoib]
 #6 [ff646f199a8c7e98] process_one_work+0x1a7 at ffffffff9bf10967

crash> rx ff646f199a8c7e68
ff646f199a8c7e68:  ff1c6a1a04dc82f8 <<< work = &priv->mcast_task.work

crash> list -hO ipoib_dev_priv.multicast_list ff1c6a1a04dc8000
(empty)

crash> ipoib_dev_priv.mcast_task.work.func,mcast_mutex.owner.counter ff1c6a1a04dc8000
  mcast_task.work.func = 0xffffffffc0944910 <ipoib_mcast_join_task>,
  mcast_mutex.owner.counter = 0xff1c69998efec000

crash> b 8
PID: 8        TASK: ff1c69998efec000  CPU: 33   COMMAND: "kworker/u72:0"
--
 #3 [ff646f1980153d50] wait_for_completion+0x96 at ffffffff9c7d7646
 #4 [ff646f1980153d90] ipoib_mcast_remove_list+0x56 at ffffffffc0944dc6 [ib_ipoib]
 #5 [ff646f1980153de8] ipoib_mcast_dev_flush+0x1a7 at ffffffffc09455a7 [ib_ipoib]
 #6 [ff646f1980153e58] __ipoib_ib_dev_flush+0x1a4 at ffffffffc09431a4 [ib_ipoib]
 #7 [ff646f1980153e98] process_one_work+0x1a7 at ffffffff9bf10967

crash> rx ff646f1980153e68
ff646f1980153e68:  ff1c6a1a04dc83f0 <<< work = &priv->flush_light

crash> ipoib_dev_priv.flush_light.func,broadcast ff1c6a1a04dc8000
  flush_light.func = 0xffffffffc0943820 <ipoib_ib_dev_flush_light>,
  broadcast = 0x0,

The mcast(s) on the `remove_list` (the remaining part of the ex `priv->multicast_list`):

crash> list -s ipoib_mcast.done.done ipoib_mcast.list -H ff646f1980153e10 | paste - -
ff1c6a192bd0c200	  done.done = 0x0,
ff1c6a192d60ac00	  done.done = 0x0,

Reported-by: Yuya Fujita-bishamonten <fj-lsoft-rh-driver@dl.jp.fujitsu.com>
Signed-off-by: Daniel Vacek <neelx@redhat.com>
---
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 6 +-----
 1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
index 5b3154503bf4..bca80fe07584 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
@@ -531,21 +531,17 @@ static int ipoib_mcast_join(struct net_device *dev, struct ipoib_mcast *mcast)
 		if (test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags))
 			rec.join_state = SENDONLY_FULLMEMBER_JOIN;
 	}
-	spin_unlock_irq(&priv->lock);
 
 	multicast = ib_sa_join_multicast(&ipoib_sa_client, priv->ca, priv->port,
-					 &rec, comp_mask, GFP_KERNEL,
+					 &rec, comp_mask, GFP_ATOMIC,
 					 ipoib_mcast_join_complete, mcast);
-	spin_lock_irq(&priv->lock);
 	if (IS_ERR(multicast)) {
 		ret = PTR_ERR(multicast);
 		ipoib_warn(priv, "ib_sa_join_multicast failed, status %d\n", ret);
 		/* Requeue this join task with a backoff delay */
 		__ipoib_mcast_schedule_join_thread(priv, mcast, 1);
 		clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
-		spin_unlock_irq(&priv->lock);
 		complete(&mcast->done);
-		spin_lock_irq(&priv->lock);
 	}
 	return 0;
 }
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH 0/2] IB/ipoib fixes
  2023-12-11 13:04 [PATCH 0/2] IB/ipoib fixes Daniel Vacek
                   ` (2 preceding siblings ...)
  2023-12-12  8:07 ` [PATCH v2] IB/ipoib: Fix mcast list locking Daniel Vacek
@ 2023-12-12  8:29 ` Leon Romanovsky
  2023-12-13 12:18   ` Daniel Vacek
  3 siblings, 1 reply; 12+ messages in thread
From: Leon Romanovsky @ 2023-12-12  8:29 UTC (permalink / raw)
  To: Jason Gunthorpe, Daniel Vacek; +Cc: linux-rdma, linux-kernel


On Mon, 11 Dec 2023 14:04:23 +0100, Daniel Vacek wrote:
> The first patch (hopefully) fixes a real issue while the second is an
> unrelated cleanup. But it shares a context so sending as a series.
> 
> Daniel Vacek (2):
>   IB/ipoib: Fix mcast list locking
>   IB/ipoib: Clean up redundant netif_addr_lock
> 
> [...]

Applied, thanks!

[1/1] IB/ipoib: Fix mcast list locking
      https://git.kernel.org/rdma/rdma/c/4f973e211b3b1c

Best regards,
-- 
Leon Romanovsky <leon@kernel.org>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 0/2] IB/ipoib fixes
  2023-12-12  8:29 ` [PATCH 0/2] IB/ipoib fixes Leon Romanovsky
@ 2023-12-13 12:18   ` Daniel Vacek
  2023-12-17 12:50     ` Leon Romanovsky
  0 siblings, 1 reply; 12+ messages in thread
From: Daniel Vacek @ 2023-12-13 12:18 UTC (permalink / raw)
  To: Leon Romanovsky; +Cc: Jason Gunthorpe, linux-rdma, linux-kernel

On Tue, Dec 12, 2023 at 9:29 AM Leon Romanovsky <leon@kernel.org> wrote:
>
>
> On Mon, 11 Dec 2023 14:04:23 +0100, Daniel Vacek wrote:
> > The first patch (hopefully) fixes a real issue while the second is an
> > unrelated cleanup. But it shares a context so sending as a series.
> >
> > Daniel Vacek (2):
> >   IB/ipoib: Fix mcast list locking
> >   IB/ipoib: Clean up redundant netif_addr_lock
> >
> > [...]
>
> Applied, thanks!

Thank you.

One small detail - I was asked by Yuya to change the "Reported-by" as follows:

---
Reported-by: Yuya Fujita <fujita.yuya-00@fujitsu.com>
---

Would that be possible? And if yes, could you amend the commit
yourself or do you want me to send a v3?

--nX


> [1/1] IB/ipoib: Fix mcast list locking
>       https://git.kernel.org/rdma/rdma/c/4f973e211b3b1c
>
> Best regards,
> --
> Leon Romanovsky <leon@kernel.org>
>


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 0/2] IB/ipoib fixes
  2023-12-13 12:18   ` Daniel Vacek
@ 2023-12-17 12:50     ` Leon Romanovsky
  0 siblings, 0 replies; 12+ messages in thread
From: Leon Romanovsky @ 2023-12-17 12:50 UTC (permalink / raw)
  To: Daniel Vacek; +Cc: Jason Gunthorpe, linux-rdma, linux-kernel

On Wed, Dec 13, 2023 at 01:18:26PM +0100, Daniel Vacek wrote:
> On Tue, Dec 12, 2023 at 9:29 AM Leon Romanovsky <leon@kernel.org> wrote:
> >
> >
> > On Mon, 11 Dec 2023 14:04:23 +0100, Daniel Vacek wrote:
> > > The first patch (hopefully) fixes a real issue while the second is an
> > > unrelated cleanup. But it shares a context so sending as a series.
> > >
> > > Daniel Vacek (2):
> > >   IB/ipoib: Fix mcast list locking
> > >   IB/ipoib: Clean up redundant netif_addr_lock
> > >
> > > [...]
> >
> > Applied, thanks!
> 
> Thank you.
> 
> One small detail - I was asked by Yuya to change the "Reported-by" as follows:
> 
> ---
> Reported-by: Yuya Fujita <fujita.yuya-00@fujitsu.com>
> ---
> 
> Would that be possible? And if yes, could you amend the commit
> yourself or do you want me to send a v3?

Unfortunately, it is already too late as we promoted my wip branch to be
official rdma/for-next.

Thanks

> 
> --nX
> 
> 
> > [1/1] IB/ipoib: Fix mcast list locking
> >       https://git.kernel.org/rdma/rdma/c/4f973e211b3b1c
> >
> > Best regards,
> > --
> > Leon Romanovsky <leon@kernel.org>
> >
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2023-12-17 12:50 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-12-11 13:04 [PATCH 0/2] IB/ipoib fixes Daniel Vacek
2023-12-11 13:04 ` [PATCH 1/2] IB/ipoib: Fix mcast list locking Daniel Vacek
2023-12-11 13:45   ` Leon Romanovsky
2023-12-11 14:25     ` Daniel Vacek
2023-12-11 15:06       ` Leon Romanovsky
2023-12-11 16:00         ` Daniel Vacek
2023-12-12  7:00           ` Leon Romanovsky
2023-12-11 13:04 ` [PATCH 2/2] IB/ipoib: Clean up redundant netif_addr_lock Daniel Vacek
2023-12-12  8:07 ` [PATCH v2] IB/ipoib: Fix mcast list locking Daniel Vacek
2023-12-12  8:29 ` [PATCH 0/2] IB/ipoib fixes Leon Romanovsky
2023-12-13 12:18   ` Daniel Vacek
2023-12-17 12:50     ` Leon Romanovsky

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox