[PATCH] percpu-refcount: relax limit on percpu_ref

public inbox for linux-nvme@lists.infradead.org
 help / color / mirror / Atom feed

* [PATCH] percpu-refcount: relax limit on percpu_ref_reinit()
       [not found]                 ` <20180912155321.GE2966370@devbig004.ftw2.facebook.com>
@ 2018-09-12 22:11                   ` Ming Lei
  2018-09-18 12:49                     ` Tejun Heo
  0 siblings, 1 reply; 4+ messages in thread
From: Ming Lei @ 2018-09-12 22:11 UTC (permalink / raw)


On Wed, Sep 12, 2018@08:53:21AM -0700, Tejun Heo wrote:
> Hello,
> 
> On Wed, Sep 12, 2018@09:52:48AM +0800, Ming Lei wrote:
> > > If you killed and waited until kill finished, you should be able to
> > > re-init.  Is it that you want to kill but abort killing in some cases?
> > 
> > Yes, it can be re-init, just with the warning of WARN_ON_ONCE(!percpu_ref_is_zero(ref)).
> 
> We can add another interface but it can't be re _init_.

OK.

> 
> > > How do you then handle the race against release?  Can you please
> > 
> > The .release is only called at atomic mode, and once we switch to
> > percpu mode, .release can't be called at all. Or I may not follow you,
> > could you explain a bit the race with release?
> 
> Yeah but what guards ->release() starting to run and then the ref
> being switched to percpu mode?  Or maybe that doesn't matter?

OK, we may add synchronize_rcu() just after clearing the DEAD flag in
the new introduced helper to avoid the race.

> 
> > > describe the exact usage you have on mind?
> > 
> > Let me explain the use case:
> > 
> > 1) nvme timeout comes
> > 
> > 2) all pending requests are canceled, but won't be completed because
> > they have to be retried after the controller is recovered
> > 
> > 3) meantime, the queue has to be frozen for avoiding new request, so
> > the refcount is killed via percpu_ref_kill().
> > 
> > 4) after the queue is recovered(or the controller is reset successfully), it
> > isn't necessary to wait until the refcount drops zero, since it is fine to
> > reinit it by clearing DEAD and switching back to percpu mode from atomic mode.
> > And waiting for the refcount dropping to zero in the reset handler may trigger
> > IO hang if IO timeout happens again during reset.
> 
> Does the recovery need the in-flight commands actually drained or does
> it just need to block new issues for a while.  If latter, why is

The recovery needn't to drain the in-flight commands actually.

> percpu_ref even being used?

Just for avoiding to invent a new wheel, especially .q_usage_counter
has served for this purpose for long time.

> 
> > So what I am trying to propose is the following usage:
> > 
> > 1) percpu_ref_kill() on .q_usage_counter before recovering the controller for
> > preventing new requests from entering queue
> 
> The way you're describing it, the above part is no different from
> having a global bool which gates new issues.

Right, but the global bool has to be checked in fast path, and the sync
between updating the flag and checking it has to be considered. Given
blk-mq has already used .q_usage_counter for this purpose, that is why
I suggest to scale percpu-refcount to cover this use case.

> 
> > 2) controller is recovered
> > 
> > 3) percpu_ref_reinit() on .q_usage_counter, and do not wait for
> > .q_usage_counter dropping to zero, then we needn't to wait in NVMe reset
> > handler which can be thought as single thread, and avoid IO hang when
> > new timeout is triggered during the waiting.
> 
> This sounds possibly confused to me.  Can you please explain how the
> recovery may hang if you wait for the ref to drain?

The reset handler can be thought as one single dedicated thread, if it hangs
in draining in-flight commands, then it won't be run again for dealing with
next timeout event.


thanks,
Ming

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [PATCH] percpu-refcount: relax limit on percpu_ref_reinit()
  2018-09-12 22:11                   ` [PATCH] percpu-refcount: relax limit on percpu_ref_reinit() Ming Lei
@ 2018-09-18 12:49                     ` Tejun Heo
  2018-09-19  2:51                       ` Ming Lei
  0 siblings, 1 reply; 4+ messages in thread
From: Tejun Heo @ 2018-09-18 12:49 UTC (permalink / raw)

Hello, Ming.

Sorry about the delay.

On Thu, Sep 13, 2018@06:11:40AM +0800, Ming Lei wrote:
> > Yeah but what guards ->release() starting to run and then the ref
> > being switched to percpu mode?  Or maybe that doesn't matter?
> 
> OK, we may add synchronize_rcu() just after clearing the DEAD flag in
> the new introduced helper to avoid the race.

That doesn't make sense to me.  How is synchronize_rcu() gonna change
anything there?

> > > 4) after the queue is recovered(or the controller is reset successfully), it
> > > isn't necessary to wait until the refcount drops zero, since it is fine to
> > > reinit it by clearing DEAD and switching back to percpu mode from atomic mode.
> > > And waiting for the refcount dropping to zero in the reset handler may trigger
> > > IO hang if IO timeout happens again during reset.
> > 
> > Does the recovery need the in-flight commands actually drained or does
> > it just need to block new issues for a while.  If latter, why is
> 
> The recovery needn't to drain the in-flight commands actually.

Is it just waiting till confirm_kill is called?  So that new ref is
not given away?  If synchronization like that is gonna work, the
percpu ref operations on the reader side must be wrapped in a larger
critical region, which brings up two issues.

1. Callers of percpu_ref must not depend on what internal
   synchronization construct percpu_ref uses.  Again, percpu_ref
   doesn't even use regular RCU.

2. If there is already an outer RCU protection around ref operation,
   that RCU critical section can and should be used for
   synchronization, not percpu_ref.

> > percpu_ref even being used?
> 
> Just for avoiding to invent a new wheel, especially .q_usage_counter
> has served for this purpose for long time.

It sounds like this was more of an abuse.  So, basically what you want
is sth like the following.

READER

 rcu_read_lock();
 if (can_issue_new_commands)
	issue;
 else
	abort;
 rcu_read_unlock();

WRITER

 can_issue_new_commands = false;
 synchronize_rcu();
 // no new command will be issued anymore

Right?  There isn't much wheel to reinvent here and using percpu_ref
for the above is likely already incorrect due to the different RCU
type being used.

> > > So what I am trying to propose is the following usage:
> > > 
> > > 1) percpu_ref_kill() on .q_usage_counter before recovering the controller for
> > > preventing new requests from entering queue
> > 
> > The way you're describing it, the above part is no different from
> > having a global bool which gates new issues.
> 
> Right, but the global bool has to be checked in fast path, and the sync

That likely bool test isn't gonna cost anything.

> between updating the flag and checking it has to be considered. Given
> blk-mq has already used .q_usage_counter for this purpose, that is why
> I suggest to scale percpu-refcount to cover this use case.

And the synchronization part should always be considered and is
already likely wrong.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [PATCH] percpu-refcount: relax limit on percpu_ref_reinit()
  2018-09-18 12:49                     ` Tejun Heo
@ 2018-09-19  2:51                       ` Ming Lei
  2018-09-19 20:36                         ` Tejun Heo
  0 siblings, 1 reply; 4+ messages in thread
From: Ming Lei @ 2018-09-19  2:51 UTC (permalink / raw)


Hi Tejun,

On Tue, Sep 18, 2018@05:49:09AM -0700, Tejun Heo wrote:
> Hello, Ming.
> 
> Sorry about the delay.
> 
> On Thu, Sep 13, 2018@06:11:40AM +0800, Ming Lei wrote:
> > > Yeah but what guards ->release() starting to run and then the ref
> > > being switched to percpu mode?  Or maybe that doesn't matter?
> > 
> > OK, we may add synchronize_rcu() just after clearing the DEAD flag in
> > the new introduced helper to avoid the race.
> 
> That doesn't make sense to me.  How is synchronize_rcu() gonna change
> anything there?

As you saw in the new post, synchronize_rcu() isn't used for avoiding
the race. Instead, it is done by grabbing one extra ref on atomic part.

> 
> > > > 4) after the queue is recovered(or the controller is reset successfully), it
> > > > isn't necessary to wait until the refcount drops zero, since it is fine to
> > > > reinit it by clearing DEAD and switching back to percpu mode from atomic mode.
> > > > And waiting for the refcount dropping to zero in the reset handler may trigger
> > > > IO hang if IO timeout happens again during reset.
> > > 
> > > Does the recovery need the in-flight commands actually drained or does
> > > it just need to block new issues for a while.  If latter, why is
> > 
> > The recovery needn't to drain the in-flight commands actually.
> 
> Is it just waiting till confirm_kill is called?  So that new ref is
> not given away?  If synchronization like that is gonna work, the
> percpu ref operations on the reader side must be wrapped in a larger
> critical region, which brings up two issues.
> 
> 1. Callers of percpu_ref must not depend on what internal
>    synchronization construct percpu_ref uses.  Again, percpu_ref
>    doesn't even use regular RCU.
> 
> 2. If there is already an outer RCU protection around ref operation,
>    that RCU critical section can and should be used for
>    synchronization, not percpu_ref.

I guess the above doesn't apply any more because there isn't new 
synchronize_rcu() introduced in my new post.

> 
> > > percpu_ref even being used?
> > 
> > Just for avoiding to invent a new wheel, especially .q_usage_counter
> > has served for this purpose for long time.
> 
> It sounds like this was more of an abuse.  So, basically what you want
> is sth like the following.
> 
> READER
> 
>  rcu_read_lock();
>  if (can_issue_new_commands)
> 	issue;
>  else
> 	abort;
>  rcu_read_unlock();
> 
> WRITER
> 
>  can_issue_new_commands = false;
>  synchronize_rcu();
>  // no new command will be issued anymore
> 
> Right?  There isn't much wheel to reinvent here and using percpu_ref
> for the above is likely already incorrect due to the different RCU
> type being used.

No RCU story any more, :-)

It might work, but still a reinvented wheel since perpcu-refcount does
provide same function. Not mention the inter-action between the two
mechanism may have to be considered.

Also there is still cost introduced in WRITER side, and the
synchronize_rcu() often takes a bit long, especially there might be lots
of namespaces, each need to run one synchronize_rcu(). We have learned
lessons in converting to blk-mq for scsi, in which synchronize_rcu()
introduces long delay in booting.


Thanks,
Ming

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [PATCH] percpu-refcount: relax limit on percpu_ref_reinit()
  2018-09-19  2:51                       ` Ming Lei
@ 2018-09-19 20:36                         ` Tejun Heo
  0 siblings, 0 replies; 4+ messages in thread
From: Tejun Heo @ 2018-09-19 20:36 UTC (permalink / raw)

Hello, Ming.

On Wed, Sep 19, 2018@10:51:49AM +0800, Ming Lei wrote:
> > That doesn't make sense to me.  How is synchronize_rcu() gonna change
> > anything there?
> 
> As you saw in the new post, synchronize_rcu() isn't used for avoiding
> the race. Instead, it is done by grabbing one extra ref on atomic part.

This is layering violation.  It just isn't a good idea to depend on
percpu_ref internal implementation details like this.

> > 1. Callers of percpu_ref must not depend on what internal
> >    synchronization construct percpu_ref uses.  Again, percpu_ref
> >    doesn't even use regular RCU.
> > 
> > 2. If there is already an outer RCU protection around ref operation,
> >    that RCU critical section can and should be used for
> >    synchronization, not percpu_ref.
> 
> I guess the above doesn't apply any more because there isn't new 
> synchronize_rcu() introduced in my new post.

It still does.  The problem is that what you're doing creates
dependencies on percpu_ref's implementation details - how it
guarantees the mode transition visibility using what sort of
synchronization construct.

> > Right?  There isn't much wheel to reinvent here and using percpu_ref
> > for the above is likely already incorrect due to the different RCU
> > type being used.
> 
> No RCU story any more, :-)
> 
> It might work, but still a reinvented wheel since perpcu-refcount does
> provide same function. Not mention the inter-action between the two
> mechanism may have to be considered.

Why would the two independent mechanisms interact with each other?
What's problematic is entangling two mechanisms in an implementation
dependent way.

> Also there is still cost introduced in WRITER side, and the
> synchronize_rcu() often takes a bit long, especially there might be lots
> of namespaces, each need to run one synchronize_rcu(). We have learned
> lessons in converting to blk-mq for scsi, in which synchronize_rcu()
> introduces long delay in booting.

You're already paying that latency.  It's not like percpu_ref can make
it happen magically without paying the same cost.  You also can easily
overlay the two grace periods as the percpu_ref part can be
asynchronous (if you still care about it).  But, from what I've read
till now, it doesn't even look like you'd need to do anything with
percpu_ref if you all you need to do is shutting down issue of new
commands.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2018-09-19 20:36 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20180911000049.GB30977@ming.t460p>
     [not found] ` <20180911134836.GG1100574@devbig004.ftw2.facebook.com>
     [not found]   ` <20180911154540.GA10082@ming.t460p>
     [not found]     ` <20180911154959.GI1100574@devbig004.ftw2.facebook.com>
     [not found]       ` <20180911160532.GB10082@ming.t460p>
     [not found]         ` <20180911163032.GA2966370@devbig004.ftw2.facebook.com>
     [not found]           ` <20180911163443.GD10082@ming.t460p>
     [not found]             ` <20180911163856.GB2966370@devbig004.ftw2.facebook.com>
     [not found]               ` <20180912015247.GA12475@ming.t460p>
     [not found]                 ` <20180912155321.GE2966370@devbig004.ftw2.facebook.com>
2018-09-12 22:11                   ` [PATCH] percpu-refcount: relax limit on percpu_ref_reinit() Ming Lei
2018-09-18 12:49                     ` Tejun Heo
2018-09-19  2:51                       ` Ming Lei
2018-09-19 20:36                         ` Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox