much about ah objects in rxe

Linux RDMA and InfiniBand development
 help / color / mirror / Atom feed

* much about ah objects in rxe
@ 2022-04-22 18:32 Bob Pearson
  2022-04-22 21:00 ` Jason Gunthorpe
  0 siblings, 1 reply; 8+ messages in thread
From: Bob Pearson @ 2022-04-22 18:32 UTC (permalink / raw)
  To: Jason Gunthorpe, Zhu Yanjun, linux-rdma@vger.kernel.org

Jason,

I am confused a little.

 - xa_alloc_xxx internally takes xa->xa_lock with a spinlock but
   has a gfp_t parameter which is normally GFP_KERNEL. So I trust them when they say
   that it releases the lock around kmalloc's by 'magic' as you say.

 - The only read side operation on the rxe pool xarrays is in rxe_pool_get_index() but
   that will be protected by a rcu_read_lock so it can't deadlock with the write
   side spinlocks regardless of type (plain, _bh, _saveirq)

 - Apparently CM is calling ib_create_ah while holding spin locks. This would
   call xa_alloc_xxx which would unlock xa_lock and call kmalloc(..., GFP_KERNEL)
   which should cause a warning for AH. You say it does not because xarray doesn't
   call might_sleep().

I am not sure how might_sleep() works. When I add might_sleep() just ahead of
xa_alloc_xxx() it does not cause warnings for CM test cases (e.g. rping.)
Another way to study this would be to test for in_atomic() but
that seems to be discouraged and may not work as assumed. It's hard to reproduce
evidence that ib_create_ah really has spinlocks held by the caller. I think it
was seen in lockdep traces but I have a hard time reading them.

 - There is a lot of effort trying to make 'deadlocks' go away. But the read side
   is going to end as up rcu_read_lock so there soon will be no deadlocks with
   rxe_pool_get_index() possible. xarrays were designed to work well with rcu
   so it would better to just go ahead and do it. Verbs objects tend to be long
   lived with lots of IO on each instance. This is a perfect use case for rcu.

I think this means there is no reason for anything but a plain spinlock in rxe_alloc
and rxe_add_to_pool.

To sum up once we have rcu enabled the only required change is to use GFP_ATOMIC
or find a way to pre-allocate for AH objects (assuming that I can convince myself
that ib_create_ah really comes with spinlocks held).

Bob

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: much about ah objects in rxe
  2022-04-22 18:32 much about ah objects in rxe Bob Pearson
@ 2022-04-22 21:00 ` Jason Gunthorpe
  2022-04-22 22:11   ` Bob Pearson
  0 siblings, 1 reply; 8+ messages in thread
From: Jason Gunthorpe @ 2022-04-22 21:00 UTC (permalink / raw)
  To: Bob Pearson; +Cc: Zhu Yanjun, linux-rdma@vger.kernel.org

On Fri, Apr 22, 2022 at 01:32:24PM -0500, Bob Pearson wrote:
> Jason,
> 
> I am confused a little.
> 
>  - xa_alloc_xxx internally takes xa->xa_lock with a spinlock but
>    has a gfp_t parameter which is normally GFP_KERNEL. So I trust them when they say
>    that it releases the lock around kmalloc's by 'magic' as you say.
> 
>  - The only read side operation on the rxe pool xarrays is in rxe_pool_get_index() but
>    that will be protected by a rcu_read_lock so it can't deadlock with the write
>    side spinlocks regardless of type (plain, _bh, _saveirq)
> 
>  - Apparently CM is calling ib_create_ah while holding spin locks. This would
>    call xa_alloc_xxx which would unlock xa_lock and call kmalloc(..., GFP_KERNEL)
>    which should cause a warning for AH. You say it does not because xarray doesn't
>    call might_sleep().
> 
> I am not sure how might_sleep() works. When I add might_sleep() just ahead of
> xa_alloc_xxx() it does not cause warnings for CM test cases (e.g. rping.)
> Another way to study this would be to test for in_atomic() but

might_sleep should work, it definately triggers from inside a
spinlock. Perhaps you don't have all the right debug kconfig enabled?

> that seems to be discouraged and may not work as assumed. It's hard to reproduce
> evidence that ib_create_ah really has spinlocks held by the caller. I think it
> was seen in lockdep traces but I have a hard time reading them.

There is a call to create_ah inside RDMA CM that is under a spinlock
 
>  - There is a lot of effort trying to make 'deadlocks' go away. But the read side
>    is going to end as up rcu_read_lock so there soon will be no deadlocks with
>    rxe_pool_get_index() possible. xarrays were designed to work well with rcu
>    so it would better to just go ahead and do it. Verbs objects tend to be long
>    lived with lots of IO on each instance. This is a perfect use case for rcu.

Yes

> I think this means there is no reason for anything but a plain spinlock in rxe_alloc
> and rxe_add_to_pool.

Maybe, are you sure there are no other xa spinlocks held from an IRQ?

And you still have to deal with the create AH called in an atomic
region.

> To sum up once we have rcu enabled the only required change is to use GFP_ATOMIC
> or find a way to pre-allocate for AH objects (assuming that I can convince myself
> that ib_create_ah really comes with spinlocks held).

Possibly yes

Jason

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: much about ah objects in rxe
  2022-04-22 21:00 ` Jason Gunthorpe
@ 2022-04-22 22:11   ` Bob Pearson
  2022-04-23  1:18     ` Zhu Yanjun
  0 siblings, 1 reply; 8+ messages in thread
From: Bob Pearson @ 2022-04-22 22:11 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: Zhu Yanjun, linux-rdma@vger.kernel.org

On 4/22/22 16:00, Jason Gunthorpe wrote:
> On Fri, Apr 22, 2022 at 01:32:24PM -0500, Bob Pearson wrote:
>> Jason,
>>
>> I am confused a little.
>>
>>  - xa_alloc_xxx internally takes xa->xa_lock with a spinlock but
>>    has a gfp_t parameter which is normally GFP_KERNEL. So I trust them when they say
>>    that it releases the lock around kmalloc's by 'magic' as you say.
>>
>>  - The only read side operation on the rxe pool xarrays is in rxe_pool_get_index() but
>>    that will be protected by a rcu_read_lock so it can't deadlock with the write
>>    side spinlocks regardless of type (plain, _bh, _saveirq)
>>
>>  - Apparently CM is calling ib_create_ah while holding spin locks. This would
>>    call xa_alloc_xxx which would unlock xa_lock and call kmalloc(..., GFP_KERNEL)
>>    which should cause a warning for AH. You say it does not because xarray doesn't
>>    call might_sleep().
>>
>> I am not sure how might_sleep() works. When I add might_sleep() just ahead of
>> xa_alloc_xxx() it does not cause warnings for CM test cases (e.g. rping.)
>> Another way to study this would be to test for in_atomic() but
> 
> might_sleep should work, it definately triggers from inside a
> spinlock. Perhaps you don't have all the right debug kconfig enabled?
> 
>> that seems to be discouraged and may not work as assumed. It's hard to reproduce
>> evidence that ib_create_ah really has spinlocks held by the caller. I think it
>> was seen in lockdep traces but I have a hard time reading them.
> 
> There is a call to create_ah inside RDMA CM that is under a spinlock
>  
>>  - There is a lot of effort trying to make 'deadlocks' go away. But the read side
>>    is going to end as up rcu_read_lock so there soon will be no deadlocks with
>>    rxe_pool_get_index() possible. xarrays were designed to work well with rcu
>>    so it would better to just go ahead and do it. Verbs objects tend to be long
>>    lived with lots of IO on each instance. This is a perfect use case for rcu.
> 
> Yes
> 
>> I think this means there is no reason for anything but a plain spinlock in rxe_alloc
>> and rxe_add_to_pool.
> 
> Maybe, are you sure there are no other xa spinlocks held from an IRQ?
> 
> And you still have to deal with the create AH called in an atomic
> region.

There are only 3 references to the xarrays:

	1. When an object is allocated. Either from rxe_alloc() which is called
	   an MR is registered or from rxe_add_to_pool() when the other
	   objects are created.
	2. When an object is looked up from rxe_pool_get_index()
	3. When an object is cleaned up from rxe_xxx_destroy() and similar.

For non AH objects the create and destroy verbs are always called in process
context and non-atomic and the lookup routine is normally called in soft IRQ
context but doesn't take a lock when rcu is used so can't deadlock.

For AH objects the create call is always called in process context but may
or may not hold an irq spinlock so hard interrupts are disabled to prevent
deadlocking CMs locks. The cleanup call is also in process context but also
may or may not hold an irq spinlock (not sure if it happens). These calls
can't deadlock each other for the xa_lock because there either won't be an
interrupt or because the process context calls don't cause reentering the
rxe code. They also can't deadlock with the lookup call when it is using rcu.

> 
>> To sum up once we have rcu enabled the only required change is to use GFP_ATOMIC
>> or find a way to pre-allocate for AH objects (assuming that I can convince myself
>> that ib_create_ah really comes with spinlocks held).
> 
> Possibly yes
> 
> Jason


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: much about ah objects in rxe
  2022-04-22 22:11   ` Bob Pearson
@ 2022-04-23  1:18     ` Zhu Yanjun
  2022-04-23  1:48       ` Bob Pearson
  0 siblings, 1 reply; 8+ messages in thread
From: Zhu Yanjun @ 2022-04-23  1:18 UTC (permalink / raw)
  To: Bob Pearson; +Cc: Jason Gunthorpe, linux-rdma@vger.kernel.org

On Sat, Apr 23, 2022 at 6:11 AM Bob Pearson <rpearsonhpe@gmail.com> wrote:
>
> On 4/22/22 16:00, Jason Gunthorpe wrote:
> > On Fri, Apr 22, 2022 at 01:32:24PM -0500, Bob Pearson wrote:
> >> Jason,
> >>
> >> I am confused a little.
> >>
> >>  - xa_alloc_xxx internally takes xa->xa_lock with a spinlock but
> >>    has a gfp_t parameter which is normally GFP_KERNEL. So I trust them when they say
> >>    that it releases the lock around kmalloc's by 'magic' as you say.
> >>
> >>  - The only read side operation on the rxe pool xarrays is in rxe_pool_get_index() but
> >>    that will be protected by a rcu_read_lock so it can't deadlock with the write
> >>    side spinlocks regardless of type (plain, _bh, _saveirq)
> >>
> >>  - Apparently CM is calling ib_create_ah while holding spin locks. This would
> >>    call xa_alloc_xxx which would unlock xa_lock and call kmalloc(..., GFP_KERNEL)
> >>    which should cause a warning for AH. You say it does not because xarray doesn't
> >>    call might_sleep().
> >>
> >> I am not sure how might_sleep() works. When I add might_sleep() just ahead of
> >> xa_alloc_xxx() it does not cause warnings for CM test cases (e.g. rping.)
> >> Another way to study this would be to test for in_atomic() but
> >
> > might_sleep should work, it definately triggers from inside a
> > spinlock. Perhaps you don't have all the right debug kconfig enabled?
> >
> >> that seems to be discouraged and may not work as assumed. It's hard to reproduce
> >> evidence that ib_create_ah really has spinlocks held by the caller. I think it
> >> was seen in lockdep traces but I have a hard time reading them.
> >
> > There is a call to create_ah inside RDMA CM that is under a spinlock
> >
> >>  - There is a lot of effort trying to make 'deadlocks' go away. But the read side
> >>    is going to end as up rcu_read_lock so there soon will be no deadlocks with
> >>    rxe_pool_get_index() possible. xarrays were designed to work well with rcu
> >>    so it would better to just go ahead and do it. Verbs objects tend to be long
> >>    lived with lots of IO on each instance. This is a perfect use case for rcu.
> >
> > Yes
> >
> >> I think this means there is no reason for anything but a plain spinlock in rxe_alloc
> >> and rxe_add_to_pool.
> >
> > Maybe, are you sure there are no other xa spinlocks held from an IRQ?
> >
> > And you still have to deal with the create AH called in an atomic
> > region.
>
> There are only 3 references to the xarrays:
>
>         1. When an object is allocated. Either from rxe_alloc() which is called
>            an MR is registered or from rxe_add_to_pool() when the other
>            objects are created.
>         2. When an object is looked up from rxe_pool_get_index()
>         3. When an object is cleaned up from rxe_xxx_destroy() and similar.
>
> For non AH objects the create and destroy verbs are always called in process
> context and non-atomic and the lookup routine is normally called in soft IRQ
> context but doesn't take a lock when rcu is used so can't deadlock.

Are you sure about this? There are about several non AH objects.
You can make sure that all in process context?

And you can ensure it in the future?

>
> For AH objects the create call is always called in process context but may

How to ensure it?

> or may not hold an irq spinlock so hard interrupts are disabled to prevent
> deadlocking CMs locks. The cleanup call is also in process context but also
> may or may not hold an irq spinlock (not sure if it happens). These calls
> can't deadlock each other for the xa_lock because there either won't be an
> interrupt or because the process context calls don't cause reentering the
> rxe code. They also can't deadlock with the lookup call when it is using rcu.

From you, all the operations including create, destroy and lookup are
in process context or soft IRQ context.
How to ensure it? I mean that all operations are always in process or
soft IRQ context, and will not violate?
Even though in different calls ?

Zhu Yanjun

>
> >
> >> To sum up once we have rcu enabled the only required change is to use GFP_ATOMIC
> >> or find a way to pre-allocate for AH objects (assuming that I can convince myself
> >> that ib_create_ah really comes with spinlocks held).
> >
> > Possibly yes
> >
> > Jason
>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: much about ah objects in rxe
  2022-04-23  1:18     ` Zhu Yanjun
@ 2022-04-23  1:48       ` Bob Pearson
  2022-04-23  2:35         ` Zhu Yanjun
  0 siblings, 1 reply; 8+ messages in thread
From: Bob Pearson @ 2022-04-23  1:48 UTC (permalink / raw)
  To: Zhu Yanjun; +Cc: Jason Gunthorpe, linux-rdma@vger.kernel.org

On 4/22/22 20:18, Zhu Yanjun wrote:
> On Sat, Apr 23, 2022 at 6:11 AM Bob Pearson <rpearsonhpe@gmail.com> wrote:
>>
>> On 4/22/22 16:00, Jason Gunthorpe wrote:
>>> On Fri, Apr 22, 2022 at 01:32:24PM -0500, Bob Pearson wrote:
>>>> Jason,
>>>>
>>>> I am confused a little.
>>>>
>>>>  - xa_alloc_xxx internally takes xa->xa_lock with a spinlock but
>>>>    has a gfp_t parameter which is normally GFP_KERNEL. So I trust them when they say
>>>>    that it releases the lock around kmalloc's by 'magic' as you say.
>>>>
>>>>  - The only read side operation on the rxe pool xarrays is in rxe_pool_get_index() but
>>>>    that will be protected by a rcu_read_lock so it can't deadlock with the write
>>>>    side spinlocks regardless of type (plain, _bh, _saveirq)
>>>>
>>>>  - Apparently CM is calling ib_create_ah while holding spin locks. This would
>>>>    call xa_alloc_xxx which would unlock xa_lock and call kmalloc(..., GFP_KERNEL)
>>>>    which should cause a warning for AH. You say it does not because xarray doesn't
>>>>    call might_sleep().
>>>>
>>>> I am not sure how might_sleep() works. When I add might_sleep() just ahead of
>>>> xa_alloc_xxx() it does not cause warnings for CM test cases (e.g. rping.)
>>>> Another way to study this would be to test for in_atomic() but
>>>
>>> might_sleep should work, it definately triggers from inside a
>>> spinlock. Perhaps you don't have all the right debug kconfig enabled?
>>>
>>>> that seems to be discouraged and may not work as assumed. It's hard to reproduce
>>>> evidence that ib_create_ah really has spinlocks held by the caller. I think it
>>>> was seen in lockdep traces but I have a hard time reading them.
>>>
>>> There is a call to create_ah inside RDMA CM that is under a spinlock
>>>
>>>>  - There is a lot of effort trying to make 'deadlocks' go away. But the read side
>>>>    is going to end as up rcu_read_lock so there soon will be no deadlocks with
>>>>    rxe_pool_get_index() possible. xarrays were designed to work well with rcu
>>>>    so it would better to just go ahead and do it. Verbs objects tend to be long
>>>>    lived with lots of IO on each instance. This is a perfect use case for rcu.
>>>
>>> Yes
>>>
>>>> I think this means there is no reason for anything but a plain spinlock in rxe_alloc
>>>> and rxe_add_to_pool.
>>>
>>> Maybe, are you sure there are no other xa spinlocks held from an IRQ?
>>>
>>> And you still have to deal with the create AH called in an atomic
>>> region.
>>
>> There are only 3 references to the xarrays:
>>
>>         1. When an object is allocated. Either from rxe_alloc() which is called
>>            an MR is registered or from rxe_add_to_pool() when the other
>>            objects are created.
>>         2. When an object is looked up from rxe_pool_get_index()
>>         3. When an object is cleaned up from rxe_xxx_destroy() and similar.
>>
>> For non AH objects the create and destroy verbs are always called in process
>> context and non-atomic and the lookup routine is normally called in soft IRQ
>> context but doesn't take a lock when rcu is used so can't deadlock.
> 
> Are you sure about this? There are about several non AH objects.
> You can make sure that all in process context?
> 
> And you can ensure it in the future?

I added the line

	WARN_ON(!in_task());

to rxe_alloc and rxe_add_to_pool

and it never triggered. I would be theoretically possible for someone to try
to write a module that responds to an interrupt in some wigit and then
tries to start a verbs session. But it would be very strange. It is reasonable
to just declare that the verbs APIs are not callable in interrupt (soft or hard)
context. I believe this is tacitly understood and currently is true. It is a
separate issue whether or not the caller is in 'atomic context' which includes
holding a spinlock and implies that the thread cannot sleep. the xarray code
if you look at the code for xa_alloc_cyclic() does take the xa_lock spinlock around
_xa_alloc_cyclic() but they also say in the comments that they release that lock
internally before calling kmalloc() and friends if the gfp_t parameter is
GFP_KERNEL so that it is safe to sleep.  However cma.c calls ib_create_ah() while
holding spinlocks (happens to be spin_lock_saveirq() but that doesn't matter here
since any spinlock makes it bad to call kmalloc()). So we have to use GFP_ATOMIC
for AH objects.

It is clear that the assumption of the verbs APIs is that they are always called
in process context.
> 
>>
>> For AH objects the create call is always called in process context but may
> 
> How to ensure it?

Same way. It is true now see the WARN_ON above.
> 
>> or may not hold an irq spinlock so hard interrupts are disabled to prevent
>> deadlocking CMs locks. The cleanup call is also in process context but also
>> may or may not hold an irq spinlock (not sure if it happens). These calls
>> can't deadlock each other for the xa_lock because there either won't be an
>> interrupt or because the process context calls don't cause reentering the
>> rxe code. They also can't deadlock with the lookup call when it is using rcu.
> 
> From you, all the operations including create, destroy and lookup are
> in process context or soft IRQ context.
> How to ensure it? I mean that all operations are always in process or
> soft IRQ context, and will not violate?
> Even though in different calls ?

We have no way to get into interrupt context. We get called from above through
the verbs APIs in process context and from below by NAPI passing network packets
to us in softirq context. We also internally defer work to tasklets which are
always soft IRQs. Unless someone outside of the rdma-core subsystem called into
verbs calls from interrupt context (weird) it can never happen. And has never
happened. Again don't get confused with irqsave spinlocks. They are used in
process or soft irq context sometimes and disable hardware interrupts to prevent
a deadlock with another spinlock in a hardware interrupt handler. But we don't
have any code that runs in hardware interrupt context to worry about. I doubt
cma does either but many people use irqsave spinlocks when they are not needed.
> 
> Zhu Yanjun
> 
>>
>>>
>>>> To sum up once we have rcu enabled the only required change is to use GFP_ATOMIC
>>>> or find a way to pre-allocate for AH objects (assuming that I can convince myself
>>>> that ib_create_ah really comes with spinlocks held).
>>>
>>> Possibly yes
>>>
>>> Jason
>>


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: much about ah objects in rxe
  2022-04-23  1:48       ` Bob Pearson
@ 2022-04-23  2:35         ` Zhu Yanjun
  2022-04-23  4:34           ` Bob Pearson
  0 siblings, 1 reply; 8+ messages in thread
From: Zhu Yanjun @ 2022-04-23  2:35 UTC (permalink / raw)
  To: Bob Pearson; +Cc: Jason Gunthorpe, linux-rdma@vger.kernel.org

On Sat, Apr 23, 2022 at 9:48 AM Bob Pearson <rpearsonhpe@gmail.com> wrote:
>
> On 4/22/22 20:18, Zhu Yanjun wrote:
> > On Sat, Apr 23, 2022 at 6:11 AM Bob Pearson <rpearsonhpe@gmail.com> wrote:
> >>
> >> On 4/22/22 16:00, Jason Gunthorpe wrote:
> >>> On Fri, Apr 22, 2022 at 01:32:24PM -0500, Bob Pearson wrote:
> >>>> Jason,
> >>>>
> >>>> I am confused a little.
> >>>>
> >>>>  - xa_alloc_xxx internally takes xa->xa_lock with a spinlock but
> >>>>    has a gfp_t parameter which is normally GFP_KERNEL. So I trust them when they say
> >>>>    that it releases the lock around kmalloc's by 'magic' as you say.
> >>>>
> >>>>  - The only read side operation on the rxe pool xarrays is in rxe_pool_get_index() but
> >>>>    that will be protected by a rcu_read_lock so it can't deadlock with the write
> >>>>    side spinlocks regardless of type (plain, _bh, _saveirq)
> >>>>
> >>>>  - Apparently CM is calling ib_create_ah while holding spin locks. This would
> >>>>    call xa_alloc_xxx which would unlock xa_lock and call kmalloc(..., GFP_KERNEL)
> >>>>    which should cause a warning for AH. You say it does not because xarray doesn't
> >>>>    call might_sleep().
> >>>>
> >>>> I am not sure how might_sleep() works. When I add might_sleep() just ahead of
> >>>> xa_alloc_xxx() it does not cause warnings for CM test cases (e.g. rping.)
> >>>> Another way to study this would be to test for in_atomic() but
> >>>
> >>> might_sleep should work, it definately triggers from inside a
> >>> spinlock. Perhaps you don't have all the right debug kconfig enabled?
> >>>
> >>>> that seems to be discouraged and may not work as assumed. It's hard to reproduce
> >>>> evidence that ib_create_ah really has spinlocks held by the caller. I think it
> >>>> was seen in lockdep traces but I have a hard time reading them.
> >>>
> >>> There is a call to create_ah inside RDMA CM that is under a spinlock
> >>>
> >>>>  - There is a lot of effort trying to make 'deadlocks' go away. But the read side
> >>>>    is going to end as up rcu_read_lock so there soon will be no deadlocks with
> >>>>    rxe_pool_get_index() possible. xarrays were designed to work well with rcu
> >>>>    so it would better to just go ahead and do it. Verbs objects tend to be long
> >>>>    lived with lots of IO on each instance. This is a perfect use case for rcu.
> >>>
> >>> Yes
> >>>
> >>>> I think this means there is no reason for anything but a plain spinlock in rxe_alloc
> >>>> and rxe_add_to_pool.
> >>>
> >>> Maybe, are you sure there are no other xa spinlocks held from an IRQ?
> >>>
> >>> And you still have to deal with the create AH called in an atomic
> >>> region.
> >>
> >> There are only 3 references to the xarrays:
> >>
> >>         1. When an object is allocated. Either from rxe_alloc() which is called
> >>            an MR is registered or from rxe_add_to_pool() when the other
> >>            objects are created.
> >>         2. When an object is looked up from rxe_pool_get_index()
> >>         3. When an object is cleaned up from rxe_xxx_destroy() and similar.
> >>
> >> For non AH objects the create and destroy verbs are always called in process
> >> context and non-atomic and the lookup routine is normally called in soft IRQ
> >> context but doesn't take a lock when rcu is used so can't deadlock.
> >
> > Are you sure about this? There are about several non AH objects.
> > You can make sure that all in process context?
> >
> > And you can ensure it in the future?
>
> I added the line
>
>         WARN_ON(!in_task());
>
> to rxe_alloc and rxe_add_to_pool
>
> and it never triggered. I would be theoretically possible for someone to try

Your test environment does not mean that all the test environments
will not trigger this.

> to write a module that responds to an interrupt in some wigit and then
> tries to start a verbs session. But it would be very strange. It is reasonable
> to just declare that the verbs APIs are not callable in interrupt (soft or hard)

This will limit verbs APIs use.

> context. I believe this is tacitly understood and currently is true. It is a
> separate issue whether or not the caller is in 'atomic context' which includes
> holding a spinlock and implies that the thread cannot sleep. the xarray code
> if you look at the code for xa_alloc_cyclic() does take the xa_lock spinlock around
> _xa_alloc_cyclic() but they also say in the comments that they release that lock
> internally before calling kmalloc() and friends if the gfp_t parameter is
> GFP_KERNEL so that it is safe to sleep.  However cma.c calls ib_create_ah() while
> holding spinlocks (happens to be spin_lock_saveirq() but that doesn't matter here
> since any spinlock makes it bad to call kmalloc()). So we have to use GFP_ATOMIC
> for AH objects.
>
> It is clear that the assumption of the verbs APIs is that they are always called
> in process context.

How to ensure that they are not called in interrupt context? And some
kernel modules
also use the rdma.

> >
> >>
> >> For AH objects the create call is always called in process context but may
> >
> > How to ensure it?
>
> Same way. It is true now see the WARN_ON above.
> >
> >> or may not hold an irq spinlock so hard interrupts are disabled to prevent
> >> deadlocking CMs locks. The cleanup call is also in process context but also
> >> may or may not hold an irq spinlock (not sure if it happens). These calls
> >> can't deadlock each other for the xa_lock because there either won't be an
> >> interrupt or because the process context calls don't cause reentering the
> >> rxe code. They also can't deadlock with the lookup call when it is using rcu.
> >
> > From you, all the operations including create, destroy and lookup are
> > in process context or soft IRQ context.
> > How to ensure it? I mean that all operations are always in process or
> > soft IRQ context, and will not violate?
> > Even though in different calls ?
>
> We have no way to get into interrupt context. We get called from above through
> the verbs APIs in process context and from below by NAPI passing network packets
> to us in softirq context. We also internally defer work to tasklets which are
> always soft IRQs. Unless someone outside of the rdma-core subsystem called into
> verbs calls from interrupt context (weird) it can never happen. And has never
> happened. Again don't get confused with irqsave spinlocks. They are used in
> process or soft irq context sometimes and disable hardware interrupts to prevent
> a deadlock with another spinlock in a hardware interrupt handler. But we don't
> have any code that runs in hardware interrupt context to worry about. I doubt
> cma does either but many people use irqsave spinlocks when they are not needed.

How to verify that many people use unnecessary irqsave spinlock?

Zhu Yanjun

> >
> > Zhu Yanjun
> >
> >>
> >>>
> >>>> To sum up once we have rcu enabled the only required change is to use GFP_ATOMIC
> >>>> or find a way to pre-allocate for AH objects (assuming that I can convince myself
> >>>> that ib_create_ah really comes with spinlocks held).
> >>>
> >>> Possibly yes
> >>>
> >>> Jason
> >>
>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: much about ah objects in rxe
  2022-04-23  2:35         ` Zhu Yanjun
@ 2022-04-23  4:34           ` Bob Pearson
  2022-04-23  8:02             ` Zhu Yanjun
  0 siblings, 1 reply; 8+ messages in thread
From: Bob Pearson @ 2022-04-23  4:34 UTC (permalink / raw)
  To: Zhu Yanjun; +Cc: Jason Gunthorpe, linux-rdma@vger.kernel.org

On 4/22/22 21:35, Zhu Yanjun wrote:
> On Sat, Apr 23, 2022 at 9:48 AM Bob Pearson <rpearsonhpe@gmail.com> wrote:
>>
>> On 4/22/22 20:18, Zhu Yanjun wrote:
>>> On Sat, Apr 23, 2022 at 6:11 AM Bob Pearson <rpearsonhpe@gmail.com> wrote:
>>>>
>>>> On 4/22/22 16:00, Jason Gunthorpe wrote:
>>>>> On Fri, Apr 22, 2022 at 01:32:24PM -0500, Bob Pearson wrote:
>>>>>> Jason,
>>>>>>
>>>>>> I am confused a little.
>>>>>>
>>>>>>  - xa_alloc_xxx internally takes xa->xa_lock with a spinlock but
>>>>>>    has a gfp_t parameter which is normally GFP_KERNEL. So I trust them when they say
>>>>>>    that it releases the lock around kmalloc's by 'magic' as you say.
>>>>>>
>>>>>>  - The only read side operation on the rxe pool xarrays is in rxe_pool_get_index() but
>>>>>>    that will be protected by a rcu_read_lock so it can't deadlock with the write
>>>>>>    side spinlocks regardless of type (plain, _bh, _saveirq)
>>>>>>
>>>>>>  - Apparently CM is calling ib_create_ah while holding spin locks. This would
>>>>>>    call xa_alloc_xxx which would unlock xa_lock and call kmalloc(..., GFP_KERNEL)
>>>>>>    which should cause a warning for AH. You say it does not because xarray doesn't
>>>>>>    call might_sleep().
>>>>>>
>>>>>> I am not sure how might_sleep() works. When I add might_sleep() just ahead of
>>>>>> xa_alloc_xxx() it does not cause warnings for CM test cases (e.g. rping.)
>>>>>> Another way to study this would be to test for in_atomic() but
>>>>>
>>>>> might_sleep should work, it definately triggers from inside a
>>>>> spinlock. Perhaps you don't have all the right debug kconfig enabled?
>>>>>
>>>>>> that seems to be discouraged and may not work as assumed. It's hard to reproduce
>>>>>> evidence that ib_create_ah really has spinlocks held by the caller. I think it
>>>>>> was seen in lockdep traces but I have a hard time reading them.
>>>>>
>>>>> There is a call to create_ah inside RDMA CM that is under a spinlock
>>>>>
>>>>>>  - There is a lot of effort trying to make 'deadlocks' go away. But the read side
>>>>>>    is going to end as up rcu_read_lock so there soon will be no deadlocks with
>>>>>>    rxe_pool_get_index() possible. xarrays were designed to work well with rcu
>>>>>>    so it would better to just go ahead and do it. Verbs objects tend to be long
>>>>>>    lived with lots of IO on each instance. This is a perfect use case for rcu.
>>>>>
>>>>> Yes
>>>>>
>>>>>> I think this means there is no reason for anything but a plain spinlock in rxe_alloc
>>>>>> and rxe_add_to_pool.
>>>>>
>>>>> Maybe, are you sure there are no other xa spinlocks held from an IRQ?
>>>>>
>>>>> And you still have to deal with the create AH called in an atomic
>>>>> region.
>>>>
>>>> There are only 3 references to the xarrays:
>>>>
>>>>         1. When an object is allocated. Either from rxe_alloc() which is called
>>>>            an MR is registered or from rxe_add_to_pool() when the other
>>>>            objects are created.
>>>>         2. When an object is looked up from rxe_pool_get_index()
>>>>         3. When an object is cleaned up from rxe_xxx_destroy() and similar.
>>>>
>>>> For non AH objects the create and destroy verbs are always called in process
>>>> context and non-atomic and the lookup routine is normally called in soft IRQ
>>>> context but doesn't take a lock when rcu is used so can't deadlock.
>>>
>>> Are you sure about this? There are about several non AH objects.
>>> You can make sure that all in process context?
>>>
>>> And you can ensure it in the future?
>>
>> I added the line
>>
>>         WARN_ON(!in_task());
>>
>> to rxe_alloc and rxe_add_to_pool
>>
>> and it never triggered. I would be theoretically possible for someone to try
> 
> Your test environment does not mean that all the test environments
> will not trigger this.
> 
>> to write a module that responds to an interrupt in some wigit and then
>> tries to start a verbs session. But it would be very strange. It is reasonable
>> to just declare that the verbs APIs are not callable in interrupt (soft or hard)
> 
> This will limit verbs APIs use.
Not really. The verbs APIs are written assuming that they are in process context.
Just look at the code. You could never allocate memory without GFP_ATOMIC but
that is not the case there are GFP_KERNELs all over the place. No one could use
tasklets or mutexes because they schedule. No one could ever sleep. If you want to
make the rdma verbs APIs callable from interrupt handlers be my guest but you have
a huge job to fix all the places in all the providers that make the opposite assumption.
> 
>> context. I believe this is tacitly understood and currently is true. It is a
>> separate issue whether or not the caller is in 'atomic context' which includes
>> holding a spinlock and implies that the thread cannot sleep. the xarray code
>> if you look at the code for xa_alloc_cyclic() does take the xa_lock spinlock around
>> _xa_alloc_cyclic() but they also say in the comments that they release that lock
>> internally before calling kmalloc() and friends if the gfp_t parameter is
>> GFP_KERNEL so that it is safe to sleep.  However cma.c calls ib_create_ah() while
>> holding spinlocks (happens to be spin_lock_saveirq() but that doesn't matter here
>> since any spinlock makes it bad to call kmalloc()). So we have to use GFP_ATOMIC
>> for AH objects.
>>
>> It is clear that the assumption of the verbs APIs is that they are always called
>> in process context.
> 
> How to ensure that they are not called in interrupt context? And some
> kernel modules 
> also use the rdma.
Kernel modules can and do use rdma but they use the verbs API from process context.
Any system call ends up in kernel code in process context.
> 
>>>
>>>>
>>>> For AH objects the create call is always called in process context but may
>>>
>>> How to ensure it?
>>
>> Same way. It is true now see the WARN_ON above.
>>>
>>>> or may not hold an irq spinlock so hard interrupts are disabled to prevent
>>>> deadlocking CMs locks. The cleanup call is also in process context but also
>>>> may or may not hold an irq spinlock (not sure if it happens). These calls
>>>> can't deadlock each other for the xa_lock because there either won't be an
>>>> interrupt or because the process context calls don't cause reentering the
>>>> rxe code. They also can't deadlock with the lookup call when it is using rcu.
>>>
>>> From you, all the operations including create, destroy and lookup are
>>> in process context or soft IRQ context.
>>> How to ensure it? I mean that all operations are always in process or
>>> soft IRQ context, and will not violate?
>>> Even though in different calls ?
>>
>> We have no way to get into interrupt context. We get called from above through
>> the verbs APIs in process context and from below by NAPI passing network packets
>> to us in softirq context. We also internally defer work to tasklets which are
>> always soft IRQs. Unless someone outside of the rdma-core subsystem called into
>> verbs calls from interrupt context (weird) it can never happen. And has never
>> happened. Again don't get confused with irqsave spinlocks. They are used in
>> process or soft irq context sometimes and disable hardware interrupts to prevent
>> a deadlock with another spinlock in a hardware interrupt handler. But we don't
>> have any code that runs in hardware interrupt context to worry about. 
>> cma does either but many people use irqsave spinlocks when they are not needed.
> 
> How to verify that many people use unnecessary irqsave spinlock?

Please read https://www.kernel.org/doc/html/v4.13/kernel-hacking/locking.html
especially look at the "Table of locking requirements". The only place where _saveirq
(SLIS) is really required is when two different interrupt handlers are sharing data.
All other cases are correctly addressed with plain, _bh and _irq (not _irqsave) spinlocks.
Nevertheless it is in widespread use all over the place. People are just lazy.
> 
> Zhu Yanjun
> 
>>>
>>> Zhu Yanjun
>>>
>>>>
>>>>>
>>>>>> To sum up once we have rcu enabled the only required change is to use GFP_ATOMIC
>>>>>> or find a way to pre-allocate for AH objects (assuming that I can convince myself
>>>>>> that ib_create_ah really comes with spinlocks held).
>>>>>
>>>>> Possibly yes
>>>>>
>>>>> Jason
>>>>
>>


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: much about ah objects in rxe
  2022-04-23  4:34           ` Bob Pearson
@ 2022-04-23  8:02             ` Zhu Yanjun
  0 siblings, 0 replies; 8+ messages in thread
From: Zhu Yanjun @ 2022-04-23  8:02 UTC (permalink / raw)
  To: Bob Pearson; +Cc: Jason Gunthorpe, linux-rdma@vger.kernel.org

On Sat, Apr 23, 2022 at 12:34 PM Bob Pearson <rpearsonhpe@gmail.com> wrote:
>
> On 4/22/22 21:35, Zhu Yanjun wrote:
> > On Sat, Apr 23, 2022 at 9:48 AM Bob Pearson <rpearsonhpe@gmail.com> wrote:
> >>
> >> On 4/22/22 20:18, Zhu Yanjun wrote:
> >>> On Sat, Apr 23, 2022 at 6:11 AM Bob Pearson <rpearsonhpe@gmail.com> wrote:
> >>>>
> >>>> On 4/22/22 16:00, Jason Gunthorpe wrote:
> >>>>> On Fri, Apr 22, 2022 at 01:32:24PM -0500, Bob Pearson wrote:
> >>>>>> Jason,
> >>>>>>
> >>>>>> I am confused a little.
> >>>>>>
> >>>>>>  - xa_alloc_xxx internally takes xa->xa_lock with a spinlock but
> >>>>>>    has a gfp_t parameter which is normally GFP_KERNEL. So I trust them when they say
> >>>>>>    that it releases the lock around kmalloc's by 'magic' as you say.
> >>>>>>
> >>>>>>  - The only read side operation on the rxe pool xarrays is in rxe_pool_get_index() but
> >>>>>>    that will be protected by a rcu_read_lock so it can't deadlock with the write
> >>>>>>    side spinlocks regardless of type (plain, _bh, _saveirq)
> >>>>>>
> >>>>>>  - Apparently CM is calling ib_create_ah while holding spin locks. This would
> >>>>>>    call xa_alloc_xxx which would unlock xa_lock and call kmalloc(..., GFP_KERNEL)
> >>>>>>    which should cause a warning for AH. You say it does not because xarray doesn't
> >>>>>>    call might_sleep().
> >>>>>>
> >>>>>> I am not sure how might_sleep() works. When I add might_sleep() just ahead of
> >>>>>> xa_alloc_xxx() it does not cause warnings for CM test cases (e.g. rping.)
> >>>>>> Another way to study this would be to test for in_atomic() but
> >>>>>
> >>>>> might_sleep should work, it definately triggers from inside a
> >>>>> spinlock. Perhaps you don't have all the right debug kconfig enabled?
> >>>>>
> >>>>>> that seems to be discouraged and may not work as assumed. It's hard to reproduce
> >>>>>> evidence that ib_create_ah really has spinlocks held by the caller. I think it
> >>>>>> was seen in lockdep traces but I have a hard time reading them.
> >>>>>
> >>>>> There is a call to create_ah inside RDMA CM that is under a spinlock
> >>>>>
> >>>>>>  - There is a lot of effort trying to make 'deadlocks' go away. But the read side
> >>>>>>    is going to end as up rcu_read_lock so there soon will be no deadlocks with
> >>>>>>    rxe_pool_get_index() possible. xarrays were designed to work well with rcu
> >>>>>>    so it would better to just go ahead and do it. Verbs objects tend to be long
> >>>>>>    lived with lots of IO on each instance. This is a perfect use case for rcu.
> >>>>>
> >>>>> Yes
> >>>>>
> >>>>>> I think this means there is no reason for anything but a plain spinlock in rxe_alloc
> >>>>>> and rxe_add_to_pool.
> >>>>>
> >>>>> Maybe, are you sure there are no other xa spinlocks held from an IRQ?
> >>>>>
> >>>>> And you still have to deal with the create AH called in an atomic
> >>>>> region.
> >>>>
> >>>> There are only 3 references to the xarrays:
> >>>>
> >>>>         1. When an object is allocated. Either from rxe_alloc() which is called
> >>>>            an MR is registered or from rxe_add_to_pool() when the other
> >>>>            objects are created.
> >>>>         2. When an object is looked up from rxe_pool_get_index()
> >>>>         3. When an object is cleaned up from rxe_xxx_destroy() and similar.
> >>>>
> >>>> For non AH objects the create and destroy verbs are always called in process
> >>>> context and non-atomic and the lookup routine is normally called in soft IRQ
> >>>> context but doesn't take a lock when rcu is used so can't deadlock.
> >>>
> >>> Are you sure about this? There are about several non AH objects.
> >>> You can make sure that all in process context?
> >>>
> >>> And you can ensure it in the future?
> >>
> >> I added the line
> >>
> >>         WARN_ON(!in_task());
> >>
> >> to rxe_alloc and rxe_add_to_pool
> >>
> >> and it never triggered. I would be theoretically possible for someone to try
> >
> > Your test environment does not mean that all the test environments
> > will not trigger this.
> >
> >> to write a module that responds to an interrupt in some wigit and then
> >> tries to start a verbs session. But it would be very strange. It is reasonable
> >> to just declare that the verbs APIs are not callable in interrupt (soft or hard)
> >
> > This will limit verbs APIs use.
> Not really. The verbs APIs are written assuming that they are in process context.
> Just look at the code. You could never allocate memory without GFP_ATOMIC but
> that is not the case there are GFP_KERNELs all over the place. No one could use
> tasklets or mutexes because they schedule. No one could ever sleep. If you want to
> make the rdma verbs APIs callable from interrupt handlers be my guest but you have
> a huge job to fix all the places in all the providers that make the opposite assumption.
> >
> >> context. I believe this is tacitly understood and currently is true. It is a
> >> separate issue whether or not the caller is in 'atomic context' which includes
> >> holding a spinlock and implies that the thread cannot sleep. the xarray code
> >> if you look at the code for xa_alloc_cyclic() does take the xa_lock spinlock around
> >> _xa_alloc_cyclic() but they also say in the comments that they release that lock
> >> internally before calling kmalloc() and friends if the gfp_t parameter is
> >> GFP_KERNEL so that it is safe to sleep.  However cma.c calls ib_create_ah() while
> >> holding spinlocks (happens to be spin_lock_saveirq() but that doesn't matter here
> >> since any spinlock makes it bad to call kmalloc()). So we have to use GFP_ATOMIC
> >> for AH objects.
> >>
> >> It is clear that the assumption of the verbs APIs is that they are always called
> >> in process context.
> >
> > How to ensure that they are not called in interrupt context? And some
> > kernel modules
> > also use the rdma.
> Kernel modules can and do use rdma but they use the verbs API from process context.
> Any system call ends up in kernel code in process context.

Thanks. You always assume that these verbs APIs are called in process
context. But how to prove it?

Last time bh locks were assumed to be OK in RXE. And a commit
21adfa7a3c4e ("RDMA/rxe: Replace irqsave locks with bh locks") was
merged into mainline.

But finally some problems appeared. Then the bh locks are reverted to
irqsave locks in the end.

With the above, I do not mean to complain about something. I just want
to prove and confirm your assumption. Then we will continue to do some
work to make RXE stable and better.

> >
> >>>
> >>>>
> >>>> For AH objects the create call is always called in process context but may
> >>>
> >>> How to ensure it?
> >>
> >> Same way. It is true now see the WARN_ON above.
> >>>
> >>>> or may not hold an irq spinlock so hard interrupts are disabled to prevent
> >>>> deadlocking CMs locks. The cleanup call is also in process context but also
> >>>> may or may not hold an irq spinlock (not sure if it happens). These calls
> >>>> can't deadlock each other for the xa_lock because there either won't be an
> >>>> interrupt or because the process context calls don't cause reentering the
> >>>> rxe code. They also can't deadlock with the lookup call when it is using rcu.
> >>>
> >>> From you, all the operations including create, destroy and lookup are
> >>> in process context or soft IRQ context.
> >>> How to ensure it? I mean that all operations are always in process or
> >>> soft IRQ context, and will not violate?
> >>> Even though in different calls ?
> >>
> >> We have no way to get into interrupt context. We get called from above through
> >> the verbs APIs in process context and from below by NAPI passing network packets
> >> to us in softirq context. We also internally defer work to tasklets which are
> >> always soft IRQs. Unless someone outside of the rdma-core subsystem called into
> >> verbs calls from interrupt context (weird) it can never happen. And has never
> >> happened. Again don't get confused with irqsave spinlocks. They are used in
> >> process or soft irq context sometimes and disable hardware interrupts to prevent
> >> a deadlock with another spinlock in a hardware interrupt handler. But we don't
> >> have any code that runs in hardware interrupt context to worry about.
> >> cma does either but many people use irqsave spinlocks when they are not needed.
> >
> > How to verify that many people use unnecessary irqsave spinlock?
>
> Please read https://www.kernel.org/doc/html/v4.13/kernel-hacking/locking.html
> especially look at the "Table of locking requirements". The only place where _saveirq
> (SLIS) is really required is when two different interrupt handlers are sharing data.
> All other cases are correctly addressed with plain, _bh and _irq (not _irqsave) spinlocks.
> Nevertheless it is in widespread use all over the place. People are just lazy.

Thanks a lot. I will read the link
https://www.kernel.org/doc/html/v4.13/kernel-hacking/locking.html
carefully.

Zhu Yanjun

> >
> > Zhu Yanjun
> >
> >>>
> >>> Zhu Yanjun
> >>>
> >>>>
> >>>>>
> >>>>>> To sum up once we have rcu enabled the only required change is to use GFP_ATOMIC
> >>>>>> or find a way to pre-allocate for AH objects (assuming that I can convince myself
> >>>>>> that ib_create_ah really comes with spinlocks held).
> >>>>>
> >>>>> Possibly yes
> >>>>>
> >>>>> Jason
> >>>>
> >>
>

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2022-04-23  8:03 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-04-22 18:32 much about ah objects in rxe Bob Pearson
2022-04-22 21:00 ` Jason Gunthorpe
2022-04-22 22:11   ` Bob Pearson
2022-04-23  1:18     ` Zhu Yanjun
2022-04-23  1:48       ` Bob Pearson
2022-04-23  2:35         ` Zhu Yanjun
2022-04-23  4:34           ` Bob Pearson
2022-04-23  8:02             ` Zhu Yanjun

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox