* Re: [PATCH] generic dynamic per cpu refcounting
[not found] <20130124232024.GA584@google.com>
@ 2013-01-25 18:09 ` Oleg Nesterov
2013-01-25 18:29 ` Oleg Nesterov
` (2 more replies)
0 siblings, 3 replies; 27+ messages in thread
From: Oleg Nesterov @ 2013-01-25 18:09 UTC (permalink / raw)
To: Kent Overstreet; +Cc: tj, srivatsa.bhat, rusty, linux-kernel
(add lkml)
On 01/24, Kent Overstreet wrote:
>
> This has already been on lkml and is in Andrew's tree, Tejun just asked
> me to send it out again:
I'll try to read this code later, just a couple of questions after a quick
glance. Sorry this was already discussed...
> +struct percpu_ref {
> + atomic64_t count;
> + unsigned long pcpu_count;
> +};
The code looks a bit tricky mostly because you pack state/pointer/jiffies
into ->pcpu_count. The same for ->count.
I assume that you have a good reason to shrink the sizeof(percpu_ref), but
I am curious: who is the user of this thing?
> + * percpu_ref_get - increment a dynamic percpu refcount
> + *
> + * Increments @ref and possibly converts it to percpu counters. Must be called
> + * with rcu_read_lock() held, and may potentially drop/reacquire rcu_read_lock()
> + * to allocate percpu counters - if sleeping/allocation isn't safe for some
> + * other reason (e.g. a spinlock), see percpu_ref_get_noalloc().
And this looks strange. It must be called under rcu_read_lock(), but
->rcu_read_lock_nesting must be == 1. Otherwise rcu_read_unlock() in
percpu_ref_alloc() won't work.
Again, I think you have a reason, but could you explain? IOW, why we
can't make it might_sleep() instead? The fast path can do rcu_read_lock()
itself.
> +static inline void percpu_ref_get_noalloc(struct percpu_ref *ref)
> +{
> + __percpu_ref_get(ref, false);
> +}
and this could be percpu_ref_get_atomic().
Once again, I am not arguing, just can't understand.
> +void __percpu_ref_get(struct percpu_ref *ref, bool alloc)
> +{
> + unsigned long pcpu_count;
> + uint64_t v;
> +
> + pcpu_count = ACCESS_ONCE(ref->pcpu_count);
> +
> + if (REF_STATUS(pcpu_count) == PCPU_REF_PTR) {
> + /* for rcu - we're not using rcu_dereference() */
> + smp_read_barrier_depends();
> + __this_cpu_inc(*((unsigned __percpu *) pcpu_count));
The comment looks confusing a bit... smp_read_barrier_depends() is not
for rcu, we obviously need it to access (unsigned __percpu *) pcpu_count.
But yes, since we didn't use rcu_dereference() we have to add it by hand.
> +int percpu_ref_kill(struct percpu_ref *ref)
> +{
> ...
> + if (status == PCPU_REF_PTR) {
> + unsigned count = 0, cpu;
> +
> + synchronize_rcu();
> +
> + for_each_possible_cpu(cpu)
> + count += *per_cpu_ptr((unsigned __percpu *) pcpu_count, cpu);
> +
> + pr_debug("global %lli pcpu %i",
> + atomic64_read(&ref->count) & PCPU_COUNT_MASK,
> + (int) count);
> +
> + atomic64_add((int) count, &ref->count);
> + smp_wmb();
> + /* Between setting global count and setting PCPU_REF_DEAD */
> + ref->pcpu_count = PCPU_REF_DEAD;
The coment explains what the code does, but not why ;)
I guess this is for percpu_ref_put(), and this wmb() pairs with implicit
mb() implied by atomic64_dec_return().
> + free_percpu((unsigned __percpu *) pcpu_count);
I guess it could be freed right after for_each_possible_cpu() above, but
this doesn't matter.
Oleg.
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] generic dynamic per cpu refcounting
2013-01-25 18:09 ` [PATCH] generic dynamic per cpu refcounting Oleg Nesterov
@ 2013-01-25 18:29 ` Oleg Nesterov
2013-01-28 18:10 ` Kent Overstreet
2013-01-25 19:11 ` Oleg Nesterov
2013-01-28 18:07 ` [PATCH] generic dynamic per cpu refcounting Kent Overstreet
2 siblings, 1 reply; 27+ messages in thread
From: Oleg Nesterov @ 2013-01-25 18:29 UTC (permalink / raw)
To: Kent Overstreet; +Cc: tj, srivatsa.bhat, rusty, linux-kernel
On 01/25, Oleg Nesterov wrote:
>
> > +void __percpu_ref_get(struct percpu_ref *ref, bool alloc)
> > +{
> > + unsigned long pcpu_count;
> > + uint64_t v;
> > +
> > + pcpu_count = ACCESS_ONCE(ref->pcpu_count);
> > +
> > + if (REF_STATUS(pcpu_count) == PCPU_REF_PTR) {
> > + /* for rcu - we're not using rcu_dereference() */
> > + smp_read_barrier_depends();
> > + __this_cpu_inc(*((unsigned __percpu *) pcpu_count));
>
> The comment looks confusing a bit... smp_read_barrier_depends() is not
> for rcu, we obviously need it to access (unsigned __percpu *) pcpu_count.
> But yes, since we didn't use rcu_dereference() we have to add it by hand.
Hmm. Otoh, arch/alpha uses asm-generic/percpu.h so in theory we need
smp_read_barrier_depends() after __this_cpu_generic_to_op() calculates the
"real" pointer, __this_cpu_ptr() ?
Just curious...
Oleg.
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] generic dynamic per cpu refcounting
2013-01-25 18:09 ` [PATCH] generic dynamic per cpu refcounting Oleg Nesterov
2013-01-25 18:29 ` Oleg Nesterov
@ 2013-01-25 19:11 ` Oleg Nesterov
2013-01-28 18:15 ` Kent Overstreet
2013-01-28 18:07 ` [PATCH] generic dynamic per cpu refcounting Kent Overstreet
2 siblings, 1 reply; 27+ messages in thread
From: Oleg Nesterov @ 2013-01-25 19:11 UTC (permalink / raw)
To: Kent Overstreet; +Cc: tj, srivatsa.bhat, rusty, linux-kernel
On 01/25, Oleg Nesterov wrote:
>
> > +int percpu_ref_kill(struct percpu_ref *ref)
> > +{
> > ...
> > + if (status == PCPU_REF_PTR) {
> > + unsigned count = 0, cpu;
> > +
> > + synchronize_rcu();
> > +
> > + for_each_possible_cpu(cpu)
> > + count += *per_cpu_ptr((unsigned __percpu *) pcpu_count, cpu);
> > +
> > + pr_debug("global %lli pcpu %i",
> > + atomic64_read(&ref->count) & PCPU_COUNT_MASK,
> > + (int) count);
> > +
> > + atomic64_add((int) count, &ref->count);
> > + smp_wmb();
> > + /* Between setting global count and setting PCPU_REF_DEAD */
> > + ref->pcpu_count = PCPU_REF_DEAD;
>
> The coment explains what the code does, but not why ;)
>
> I guess this is for percpu_ref_put(), and this wmb() pairs with implicit
> mb() implied by atomic64_dec_return().
Hmm. Most probably I missed something, but it seems we need another
synchronize_rcu() _after_ we set PCPU_REF_DEAD.
To simplify, suppose that percpu_ref_put() is never called directly but
we have
void put_and_dsetroy(...)
{
if (percpu_ref_put(...))
destroy(...);
}
Suppose that ref->count == 2 after atomic64_add() above. IOW, we have
a "master" reference for _kill() and someone else did _get.
So the caller does
percpu_ref_kill();
put_and_dsetroy();
And this can race with another holder which drops the last reference,
its put_and_dsetroy() can see PCPU_REF_DYING and return false.
Or I misunderstood the code/interface?
Oleg.
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] generic dynamic per cpu refcounting
2013-01-25 18:09 ` [PATCH] generic dynamic per cpu refcounting Oleg Nesterov
2013-01-25 18:29 ` Oleg Nesterov
2013-01-25 19:11 ` Oleg Nesterov
@ 2013-01-28 18:07 ` Kent Overstreet
2 siblings, 0 replies; 27+ messages in thread
From: Kent Overstreet @ 2013-01-28 18:07 UTC (permalink / raw)
To: Oleg Nesterov; +Cc: tj, srivatsa.bhat, rusty, linux-kernel
On Fri, Jan 25, 2013 at 07:09:41PM +0100, Oleg Nesterov wrote:
> (add lkml)
>
> On 01/24, Kent Overstreet wrote:
> >
> > This has already been on lkml and is in Andrew's tree, Tejun just asked
> > me to send it out again:
>
> I'll try to read this code later, just a couple of questions after a quick
> glance. Sorry this was already discussed...
No worries, it wasn't that widely circulated.
> > +struct percpu_ref {
> > + atomic64_t count;
> > + unsigned long pcpu_count;
> > +};
>
> The code looks a bit tricky mostly because you pack state/pointer/jiffies
> into ->pcpu_count. The same for ->count.
Yes, it is.
> I assume that you have a good reason to shrink the sizeof(percpu_ref), but
> I am curious: who is the user of this thing?
Right now - just the aio code, but the idea was to make it as close to a
drop in replacement for atomic_t + atomic_get()/atomic_dec_and_test() as
possible.
> > + * percpu_ref_get - increment a dynamic percpu refcount
> > + *
> > + * Increments @ref and possibly converts it to percpu counters. Must be called
> > + * with rcu_read_lock() held, and may potentially drop/reacquire rcu_read_lock()
> > + * to allocate percpu counters - if sleeping/allocation isn't safe for some
> > + * other reason (e.g. a spinlock), see percpu_ref_get_noalloc().
>
> And this looks strange. It must be called under rcu_read_lock(), but
> ->rcu_read_lock_nesting must be == 1. Otherwise rcu_read_unlock() in
> percpu_ref_alloc() won't work.
>
> Again, I think you have a reason, but could you explain? IOW, why we
> can't make it might_sleep() instead? The fast path can do rcu_read_lock()
> itself.
It's stupid and contorted because I didn't have any better ideas when I
first wrote it and haven't fixed it yet.
> > +static inline void percpu_ref_get_noalloc(struct percpu_ref *ref)
> > +{
> > + __percpu_ref_get(ref, false);
> > +}
>
> and this could be percpu_ref_get_atomic().
>
> Once again, I am not arguing, just can't understand.
Same deal, I'm going to get rid of the two different versions.
> > +void __percpu_ref_get(struct percpu_ref *ref, bool alloc)
> > +{
> > + unsigned long pcpu_count;
> > + uint64_t v;
> > +
> > + pcpu_count = ACCESS_ONCE(ref->pcpu_count);
> > +
> > + if (REF_STATUS(pcpu_count) == PCPU_REF_PTR) {
> > + /* for rcu - we're not using rcu_dereference() */
> > + smp_read_barrier_depends();
> > + __this_cpu_inc(*((unsigned __percpu *) pcpu_count));
>
> The comment looks confusing a bit... smp_read_barrier_depends() is not
> for rcu, we obviously need it to access (unsigned __percpu *) pcpu_count.
> But yes, since we didn't use rcu_dereference() we have to add it by hand.
Yeah - originally I was using rcu_dereference(), but sparse hated
combining __percpu and __rcu and I couldn't get it to stop complaining.
>
> > +int percpu_ref_kill(struct percpu_ref *ref)
> > +{
> > ...
> > + if (status == PCPU_REF_PTR) {
> > + unsigned count = 0, cpu;
> > +
> > + synchronize_rcu();
> > +
> > + for_each_possible_cpu(cpu)
> > + count += *per_cpu_ptr((unsigned __percpu *) pcpu_count, cpu);
> > +
> > + pr_debug("global %lli pcpu %i",
> > + atomic64_read(&ref->count) & PCPU_COUNT_MASK,
> > + (int) count);
> > +
> > + atomic64_add((int) count, &ref->count);
> > + smp_wmb();
> > + /* Between setting global count and setting PCPU_REF_DEAD */
> > + ref->pcpu_count = PCPU_REF_DEAD;
>
> The coment explains what the code does, but not why ;)
That seems like a more straightforward barrier than most... we need the
refcount to be consistent before setting the state to dead :P
> I guess this is for percpu_ref_put(), and this wmb() pairs with implicit
> mb() implied by atomic64_dec_return().
Yeah. I expanded the comment there a bit...
>
> > + free_percpu((unsigned __percpu *) pcpu_count);
>
> I guess it could be freed right after for_each_possible_cpu() above, but
> this doesn't matter.
I think that'd be better though, I'll switch it.
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] generic dynamic per cpu refcounting
2013-01-25 18:29 ` Oleg Nesterov
@ 2013-01-28 18:10 ` Kent Overstreet
2013-01-28 18:50 ` Oleg Nesterov
0 siblings, 1 reply; 27+ messages in thread
From: Kent Overstreet @ 2013-01-28 18:10 UTC (permalink / raw)
To: Oleg Nesterov; +Cc: tj, srivatsa.bhat, rusty, linux-kernel
On Fri, Jan 25, 2013 at 07:29:24PM +0100, Oleg Nesterov wrote:
> On 01/25, Oleg Nesterov wrote:
> >
> > > +void __percpu_ref_get(struct percpu_ref *ref, bool alloc)
> > > +{
> > > + unsigned long pcpu_count;
> > > + uint64_t v;
> > > +
> > > + pcpu_count = ACCESS_ONCE(ref->pcpu_count);
> > > +
> > > + if (REF_STATUS(pcpu_count) == PCPU_REF_PTR) {
> > > + /* for rcu - we're not using rcu_dereference() */
> > > + smp_read_barrier_depends();
> > > + __this_cpu_inc(*((unsigned __percpu *) pcpu_count));
> >
> > The comment looks confusing a bit... smp_read_barrier_depends() is not
> > for rcu, we obviously need it to access (unsigned __percpu *) pcpu_count.
> > But yes, since we didn't use rcu_dereference() we have to add it by hand.
>
> Hmm. Otoh, arch/alpha uses asm-generic/percpu.h so in theory we need
> smp_read_barrier_depends() after __this_cpu_generic_to_op() calculates the
> "real" pointer, __this_cpu_ptr() ?
>
> Just curious...
Don't think I follow, but I don't see how... that barriers just needs to
be between the ACCESS_ONCE and touching the memory pointed to, the fact
that there's a calculation in there to get the real pointer is
irrelevant...
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] generic dynamic per cpu refcounting
2013-01-25 19:11 ` Oleg Nesterov
@ 2013-01-28 18:15 ` Kent Overstreet
2013-01-28 18:27 ` Tejun Heo
0 siblings, 1 reply; 27+ messages in thread
From: Kent Overstreet @ 2013-01-28 18:15 UTC (permalink / raw)
To: Oleg Nesterov; +Cc: tj, srivatsa.bhat, rusty, linux-kernel
On Fri, Jan 25, 2013 at 08:11:39PM +0100, Oleg Nesterov wrote:
> On 01/25, Oleg Nesterov wrote:
> >
> > > +int percpu_ref_kill(struct percpu_ref *ref)
> > > +{
> > > ...
> > > + if (status == PCPU_REF_PTR) {
> > > + unsigned count = 0, cpu;
> > > +
> > > + synchronize_rcu();
> > > +
> > > + for_each_possible_cpu(cpu)
> > > + count += *per_cpu_ptr((unsigned __percpu *) pcpu_count, cpu);
> > > +
> > > + pr_debug("global %lli pcpu %i",
> > > + atomic64_read(&ref->count) & PCPU_COUNT_MASK,
> > > + (int) count);
> > > +
> > > + atomic64_add((int) count, &ref->count);
> > > + smp_wmb();
> > > + /* Between setting global count and setting PCPU_REF_DEAD */
> > > + ref->pcpu_count = PCPU_REF_DEAD;
> >
> > The coment explains what the code does, but not why ;)
> >
> > I guess this is for percpu_ref_put(), and this wmb() pairs with implicit
> > mb() implied by atomic64_dec_return().
>
> Hmm. Most probably I missed something, but it seems we need another
> synchronize_rcu() _after_ we set PCPU_REF_DEAD.
Yeah, correct - documentation bug.
I originally had the synchronize_rcu() there, but this is called by
exit_aio() -> kill_ioctx() when we're killing a process, and Ben LaHaise
pointed out that was less than ideal if a process had a bunch of ioctxs
- so I left the second one out there so the caller would have the option
of using call_rcu() instead.
> To simplify, suppose that percpu_ref_put() is never called directly but
> we have
>
> void put_and_dsetroy(...)
> {
> if (percpu_ref_put(...))
> destroy(...);
> }
>
> Suppose that ref->count == 2 after atomic64_add() above. IOW, we have
> a "master" reference for _kill() and someone else did _get.
>
> So the caller does
>
> percpu_ref_kill();
> put_and_dsetroy();
>
> And this can race with another holder which drops the last reference,
> its put_and_dsetroy() can see PCPU_REF_DYING and return false.
>
> Or I misunderstood the code/interface?
Nope, nailed it :) That should _definitely_ be in the documentation.
Actually - I think it'd be better to have the default percpu_ref_kill()
do the second synchronize_rcu(), and have an unsafe version that skips
it.
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] generic dynamic per cpu refcounting
2013-01-28 18:15 ` Kent Overstreet
@ 2013-01-28 18:27 ` Tejun Heo
2013-01-28 18:49 ` Kent Overstreet
0 siblings, 1 reply; 27+ messages in thread
From: Tejun Heo @ 2013-01-28 18:27 UTC (permalink / raw)
To: Kent Overstreet; +Cc: Oleg Nesterov, srivatsa.bhat, rusty, linux-kernel
Hello, guys.
On Mon, Jan 28, 2013 at 10:15:28AM -0800, Kent Overstreet wrote:
> > percpu_ref_kill();
> > put_and_dsetroy();
> >
> > And this can race with another holder which drops the last reference,
> > its put_and_dsetroy() can see PCPU_REF_DYING and return false.
> >
> > Or I misunderstood the code/interface?
>
> Nope, nailed it :) That should _definitely_ be in the documentation.
Can we just combine kill initiation and base ref put and make that the
responsibility of the owner? Extra features on basic constructs may
seem good for certain use cases but tend to bring more confusion than
good in the long run. If a user needs to synchronize among multiple
killers, let the user deal with the issue.
> Actually - I think it'd be better to have the default percpu_ref_kill()
> do the second synchronize_rcu(), and have an unsafe version that skips
> it.
Note that synchronize_rcu/sched() can be very slow and cause problems
in paths which are frequently traveled and visible to userland. It's
fine for things like module destruction but can be a problem even
during device destruction - blkcg had synchronize_rcu() in
request_queue destruction which led to huge latencies during boot
because SCSI wants to create and then destroy request_queues for all
possible LUNs on certain configurations. So, if you put
synchronize_rcu/sched() in percpu_ref_kill(), that better not be used
from e.g. close(2).
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] generic dynamic per cpu refcounting
2013-01-28 18:27 ` Tejun Heo
@ 2013-01-28 18:49 ` Kent Overstreet
2013-01-28 18:55 ` Tejun Heo
0 siblings, 1 reply; 27+ messages in thread
From: Kent Overstreet @ 2013-01-28 18:49 UTC (permalink / raw)
To: Tejun Heo; +Cc: Oleg Nesterov, srivatsa.bhat, rusty, linux-kernel
On Mon, Jan 28, 2013 at 10:27:37AM -0800, Tejun Heo wrote:
> Hello, guys.
>
> On Mon, Jan 28, 2013 at 10:15:28AM -0800, Kent Overstreet wrote:
> > > percpu_ref_kill();
> > > put_and_dsetroy();
> > >
> > > And this can race with another holder which drops the last reference,
> > > its put_and_dsetroy() can see PCPU_REF_DYING and return false.
> > >
> > > Or I misunderstood the code/interface?
> >
> > Nope, nailed it :) That should _definitely_ be in the documentation.
>
> Can we just combine kill initiation and base ref put and make that the
> responsibility of the owner? Extra features on basic constructs may
> seem good for certain use cases but tend to bring more confusion than
> good in the long run. If a user needs to synchronize among multiple
> killers, let the user deal with the issue.
Don't follow...
Something I forgot to mention in the last mail though is that often the
caller will need its own synchronize_rcu()/call_rcu() -
percpu_ref_kill() corresponds to when you make the object unavailable
(i.e. deleting it from the rcu protected hash table in aio) and you need
a synchronize_rcu() before you drop your initial ref.
So letting the caller do it means the caller can merge the two
synchronize_rcu()s.
>
> > Actually - I think it'd be better to have the default percpu_ref_kill()
> > do the second synchronize_rcu(), and have an unsafe version that skips
> > it.
>
> Note that synchronize_rcu/sched() can be very slow and cause problems
> in paths which are frequently traveled and visible to userland. It's
> fine for things like module destruction but can be a problem even
> during device destruction - blkcg had synchronize_rcu() in
> request_queue destruction which led to huge latencies during boot
> because SCSI wants to create and then destroy request_queues for all
> possible LUNs on certain configurations. So, if you put
> synchronize_rcu/sched() in percpu_ref_kill(), that better not be used
> from e.g. close(2).
Yeah. It'd be really nice if it was doable without synchronize_rcu(),
but it'd definitely make get/put heavier.
Though, re. close() - considering we only need a synchronize_rcu() if
the ref was in percpu mode, I wonder if that would be a dealbreaker. I
have no clue myself.
Getting rid of synchronize_rcu would basically require turning get and
put into cmpxchg() loops - even in the percpu fastpath. However, percpu
mode would still be getting rid of the shared cacheline contention, we'd
just be adding another branch that can be safely marked unlikely() - and
my current version has one of those already, so two branches instead of
one in the fast path.
I suppose I should give it a shot.
As long as I'm going down that route I could probably make the bare non
percpu ref 8 bytes instead of 16, too...
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] generic dynamic per cpu refcounting
2013-01-28 18:10 ` Kent Overstreet
@ 2013-01-28 18:50 ` Oleg Nesterov
0 siblings, 0 replies; 27+ messages in thread
From: Oleg Nesterov @ 2013-01-28 18:50 UTC (permalink / raw)
To: Kent Overstreet; +Cc: tj, srivatsa.bhat, rusty, linux-kernel
On 01/28, Kent Overstreet wrote:
>
> On Fri, Jan 25, 2013 at 07:29:24PM +0100, Oleg Nesterov wrote:
> > On 01/25, Oleg Nesterov wrote:
> > >
> > > > +void __percpu_ref_get(struct percpu_ref *ref, bool alloc)
> > > > +{
> > > > + unsigned long pcpu_count;
> > > > + uint64_t v;
> > > > +
> > > > + pcpu_count = ACCESS_ONCE(ref->pcpu_count);
> > > > +
> > > > + if (REF_STATUS(pcpu_count) == PCPU_REF_PTR) {
> > > > + /* for rcu - we're not using rcu_dereference() */
> > > > + smp_read_barrier_depends();
> > > > + __this_cpu_inc(*((unsigned __percpu *) pcpu_count));
> > >
> > > The comment looks confusing a bit... smp_read_barrier_depends() is not
> > > for rcu, we obviously need it to access (unsigned __percpu *) pcpu_count.
> > > But yes, since we didn't use rcu_dereference() we have to add it by hand.
> >
> > Hmm. Otoh, arch/alpha uses asm-generic/percpu.h so in theory we need
> > smp_read_barrier_depends() after __this_cpu_generic_to_op() calculates the
> > "real" pointer, __this_cpu_ptr() ?
> >
> > Just curious...
>
> Don't think I follow, but I don't see how... that barriers just needs to
> be between the ACCESS_ONCE and touching the memory pointed to, the fact
> that there's a calculation in there to get the real pointer is
> irrelevant...
Yes, I was wrong. I forgot that pcp is already the pointer even in
_generic_ case, we only add the offset.
Thanks for correcting me.
Oleg.
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] generic dynamic per cpu refcounting
2013-01-28 18:49 ` Kent Overstreet
@ 2013-01-28 18:55 ` Tejun Heo
2013-01-28 20:22 ` Kent Overstreet
0 siblings, 1 reply; 27+ messages in thread
From: Tejun Heo @ 2013-01-28 18:55 UTC (permalink / raw)
To: Kent Overstreet; +Cc: Oleg Nesterov, srivatsa.bhat, rusty, linux-kernel
Hey, Kent.
On Mon, Jan 28, 2013 at 10:49:33AM -0800, Kent Overstreet wrote:
> Yeah. It'd be really nice if it was doable without synchronize_rcu(),
> but it'd definitely make get/put heavier.
>
> Though, re. close() - considering we only need a synchronize_rcu() if
> the ref was in percpu mode, I wonder if that would be a dealbreaker. I
> have no clue myself.
The problem is that the performance drop (or latency increase) in
patheological cases would be catastrophic. We're talking about
possibly quite a few millisecs of delay between each close(). When
done sequentially for large number of files, it gets ugly. It becomes
a dangerous optimization to make.
> Getting rid of synchronize_rcu would basically require turning get and
> put into cmpxchg() loops - even in the percpu fastpath. However, percpu
> mode would still be getting rid of the shared cacheline contention, we'd
> just be adding another branch that can be safely marked unlikely() - and
> my current version has one of those already, so two branches instead of
> one in the fast path.
Or offer an asynchrnous interface so that high-frequency users don't
end up inserting synchronize_sched() between each call. It makes the
interface more complex and further away from simple atomic_t
replacement tho.
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] generic dynamic per cpu refcounting
2013-01-28 18:55 ` Tejun Heo
@ 2013-01-28 20:22 ` Kent Overstreet
2013-01-28 20:27 ` Tejun Heo
0 siblings, 1 reply; 27+ messages in thread
From: Kent Overstreet @ 2013-01-28 20:22 UTC (permalink / raw)
To: Tejun Heo; +Cc: Oleg Nesterov, srivatsa.bhat, rusty, linux-kernel
On Mon, Jan 28, 2013 at 10:55:52AM -0800, Tejun Heo wrote:
> Hey, Kent.
>
> On Mon, Jan 28, 2013 at 10:49:33AM -0800, Kent Overstreet wrote:
> > Yeah. It'd be really nice if it was doable without synchronize_rcu(),
> > but it'd definitely make get/put heavier.
> >
> > Though, re. close() - considering we only need a synchronize_rcu() if
> > the ref was in percpu mode, I wonder if that would be a dealbreaker. I
> > have no clue myself.
>
> The problem is that the performance drop (or latency increase) in
> patheological cases would be catastrophic. We're talking about
> possibly quite a few millisecs of delay between each close(). When
> done sequentially for large number of files, it gets ugly. It becomes
> a dangerous optimization to make.
Yeah, I tend to agree.
> > Getting rid of synchronize_rcu would basically require turning get and
> > put into cmpxchg() loops - even in the percpu fastpath. However, percpu
> > mode would still be getting rid of the shared cacheline contention, we'd
> > just be adding another branch that can be safely marked unlikely() - and
> > my current version has one of those already, so two branches instead of
> > one in the fast path.
>
> Or offer an asynchrnous interface so that high-frequency users don't
> end up inserting synchronize_sched() between each call. It makes the
> interface more complex and further away from simple atomic_t
> replacement tho.
Could do that too, but then teardown gets really messy for the user - we
need two synchronize_rcu()s:
state := dying
synchronize_rcu()
/* Now nothing's changing the per cpu counters */
Add per cpu counters to atomic counter counter
/* Atomic counter is now consistent */
state := dead
synchronize_rcu()
/* Now percpu_ref_put will check for ref == 0 */
/* Drop initial ref */
percpu_ref_put()
And note that the first synchronize_rcu() is only needed when we had
allocated per cpu counters, my current code skips it otherwise.
(which is a reason for keeping dynamic allocation I hadn't thought of -
if we don't ever allocate percpu counters, teardown is faster)
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] generic dynamic per cpu refcounting
2013-01-28 20:22 ` Kent Overstreet
@ 2013-01-28 20:27 ` Tejun Heo
2013-01-28 20:55 ` Kent Overstreet
0 siblings, 1 reply; 27+ messages in thread
From: Tejun Heo @ 2013-01-28 20:27 UTC (permalink / raw)
To: Kent Overstreet; +Cc: Oleg Nesterov, srivatsa.bhat, rusty, linux-kernel
Hey, Kent.
On Mon, Jan 28, 2013 at 12:22 PM, Kent Overstreet
<koverstreet@google.com> wrote:
> Could do that too, but then teardown gets really messy for the user - we
> need two synchronize_rcu()s:
>
> state := dying
>
> synchronize_rcu()
>
> /* Now nothing's changing the per cpu counters */
>
> Add per cpu counters to atomic counter counter
>
> /* Atomic counter is now consistent */
>
> state := dead
>
> synchronize_rcu()
I don't understand why we need two stages. What prevents the killing
thread from fetching percpu counters after dying passes one
synchronize_sched()?
> /* Now percpu_ref_put will check for ref == 0 */
>
> /* Drop initial ref */
>
> percpu_ref_put()
>
> And note that the first synchronize_rcu() is only needed when we had
> allocated per cpu counters, my current code skips it otherwise.
And regardless, at the interface level, can't it just provide
percpu_ref_put_base_ref(release_fn)?
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] generic dynamic per cpu refcounting
2013-01-28 20:27 ` Tejun Heo
@ 2013-01-28 20:55 ` Kent Overstreet
2013-01-28 21:18 ` Tejun Heo
0 siblings, 1 reply; 27+ messages in thread
From: Kent Overstreet @ 2013-01-28 20:55 UTC (permalink / raw)
To: Tejun Heo; +Cc: Oleg Nesterov, srivatsa.bhat, rusty, linux-kernel
On Mon, Jan 28, 2013 at 12:27:45PM -0800, Tejun Heo wrote:
> Hey, Kent.
>
> On Mon, Jan 28, 2013 at 12:22 PM, Kent Overstreet
> <koverstreet@google.com> wrote:
> > Could do that too, but then teardown gets really messy for the user - we
> > need two synchronize_rcu()s:
> >
> > state := dying
> >
> > synchronize_rcu()
> >
> > /* Now nothing's changing the per cpu counters */
> >
> > Add per cpu counters to atomic counter counter
> >
> > /* Atomic counter is now consistent */
> >
> > state := dead
> >
> > synchronize_rcu()
>
> I don't understand why we need two stages. What prevents the killing
> thread from fetching percpu counters after dying passes one
> synchronize_sched()?
It does. The second synchronize_sched() is needed after we set state :=
dead, and before we drop the initial ref. Otherwise the ref could hit 0
before percpu_ref_put knows to check for it.
>
> > /* Now percpu_ref_put will check for ref == 0 */
> >
> > /* Drop initial ref */
> >
> > percpu_ref_put()
> >
> > And note that the first synchronize_rcu() is only needed when we had
> > allocated per cpu counters, my current code skips it otherwise.
>
> And regardless, at the interface level, can't it just provide
> percpu_ref_put_base_ref(release_fn)?
Yeah, can definitely provide one that wraps it all together.
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] generic dynamic per cpu refcounting
2013-01-28 20:55 ` Kent Overstreet
@ 2013-01-28 21:18 ` Tejun Heo
2013-01-28 21:24 ` Kent Overstreet
0 siblings, 1 reply; 27+ messages in thread
From: Tejun Heo @ 2013-01-28 21:18 UTC (permalink / raw)
To: Kent Overstreet; +Cc: Oleg Nesterov, srivatsa.bhat, rusty, linux-kernel
Hello, Kent.
On Mon, Jan 28, 2013 at 12:55:40PM -0800, Kent Overstreet wrote:
> > I don't understand why we need two stages. What prevents the killing
> > thread from fetching percpu counters after dying passes one
> > synchronize_sched()?
>
> It does. The second synchronize_sched() is needed after we set state :=
> dead, and before we drop the initial ref. Otherwise the ref could hit 0
> before percpu_ref_put knows to check for it.
Still a bit confused. Why do we need to make the two steps separate?
What prevents us from doing the following?
set dying;
synchronize_sched();
collect percpu refs into global atomic_t;
put the base ref;
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] generic dynamic per cpu refcounting
2013-01-28 21:18 ` Tejun Heo
@ 2013-01-28 21:24 ` Kent Overstreet
2013-01-28 21:28 ` Tejun Heo
0 siblings, 1 reply; 27+ messages in thread
From: Kent Overstreet @ 2013-01-28 21:24 UTC (permalink / raw)
To: Tejun Heo; +Cc: Oleg Nesterov, srivatsa.bhat, rusty, linux-kernel
On Mon, Jan 28, 2013 at 01:18:32PM -0800, Tejun Heo wrote:
> Hello, Kent.
>
> On Mon, Jan 28, 2013 at 12:55:40PM -0800, Kent Overstreet wrote:
> > > I don't understand why we need two stages. What prevents the killing
> > > thread from fetching percpu counters after dying passes one
> > > synchronize_sched()?
> >
> > It does. The second synchronize_sched() is needed after we set state :=
> > dead, and before we drop the initial ref. Otherwise the ref could hit 0
> > before percpu_ref_put knows to check for it.
>
> Still a bit confused. Why do we need to make the two steps separate?
> What prevents us from doing the following?
>
> set dying;
> synchronize_sched();
> collect percpu refs into global atomic_t;
> put the base ref;
After you set state := dying, percpu_ref_put() decrements the atomic_t,
but it can't check if it's 0 yet because the thread that's collecting
the percpu refs might not be done yet.
So percpu_ref_put can't check for ref == 0 until after state == dead.
But the put in your example might have made ref 0. When did you set
state to dead?
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] generic dynamic per cpu refcounting
2013-01-28 21:24 ` Kent Overstreet
@ 2013-01-28 21:28 ` Tejun Heo
2013-01-28 21:36 ` Tejun Heo
2013-01-28 21:45 ` Kent Overstreet
0 siblings, 2 replies; 27+ messages in thread
From: Tejun Heo @ 2013-01-28 21:28 UTC (permalink / raw)
To: Kent Overstreet; +Cc: Oleg Nesterov, srivatsa.bhat, rusty, linux-kernel
On Mon, Jan 28, 2013 at 01:24:07PM -0800, Kent Overstreet wrote:
> > set dying;
> > synchronize_sched();
> > collect percpu refs into global atomic_t;
> > put the base ref;
>
> After you set state := dying, percpu_ref_put() decrements the atomic_t,
> but it can't check if it's 0 yet because the thread that's collecting
> the percpu refs might not be done yet.
>
> So percpu_ref_put can't check for ref == 0 until after state == dead.
> But the put in your example might have made ref 0. When did you set
> state to dead?
But at that point, the operation is already global, so there gotta be
a lighter way to synchronize stuff than going through full grace
period. ie. You can add a bias value before marking dead so that the
counter never reaches zero before all percpu counters are collected
and then unbias it right before putting the base ref, that way the
only way you can hit zero ref is all refs are actually zero.
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] generic dynamic per cpu refcounting
2013-01-28 21:28 ` Tejun Heo
@ 2013-01-28 21:36 ` Tejun Heo
2013-01-28 21:48 ` Kent Overstreet
2013-01-28 21:45 ` Kent Overstreet
1 sibling, 1 reply; 27+ messages in thread
From: Tejun Heo @ 2013-01-28 21:36 UTC (permalink / raw)
To: Kent Overstreet; +Cc: Oleg Nesterov, srivatsa.bhat, rusty, linux-kernel
On Mon, Jan 28, 2013 at 01:28:14PM -0800, Tejun Heo wrote:
> But at that point, the operation is already global, so there gotta be
> a lighter way to synchronize stuff than going through full grace
> period. ie. You can add a bias value before marking dead so that the
> counter never reaches zero before all percpu counters are collected
> and then unbias it right before putting the base ref, that way the
> only way you can hit zero ref is all refs are actually zero.
Note that I'm saying that there's no need to distinguish between dying
and dead. The only thing percpu part should care about it whether
percpu is on or off. We need a full grace period to turn off percpu
operations of any type but that should be the only case where full
grace period is necessary. The rest should be synchronizable via the
usual global synchronization. We probably can just test percpu
variable against NULL and forget about the encoded state numbers.
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] generic dynamic per cpu refcounting
2013-01-28 21:28 ` Tejun Heo
2013-01-28 21:36 ` Tejun Heo
@ 2013-01-28 21:45 ` Kent Overstreet
2013-01-28 21:50 ` Tejun Heo
1 sibling, 1 reply; 27+ messages in thread
From: Kent Overstreet @ 2013-01-28 21:45 UTC (permalink / raw)
To: Tejun Heo; +Cc: Oleg Nesterov, srivatsa.bhat, rusty, linux-kernel
On Mon, Jan 28, 2013 at 01:28:14PM -0800, Tejun Heo wrote:
> On Mon, Jan 28, 2013 at 01:24:07PM -0800, Kent Overstreet wrote:
> > > set dying;
> > > synchronize_sched();
> > > collect percpu refs into global atomic_t;
> > > put the base ref;
> >
> > After you set state := dying, percpu_ref_put() decrements the atomic_t,
> > but it can't check if it's 0 yet because the thread that's collecting
> > the percpu refs might not be done yet.
> >
> > So percpu_ref_put can't check for ref == 0 until after state == dead.
> > But the put in your example might have made ref 0. When did you set
> > state to dead?
>
> But at that point, the operation is already global, so there gotta be
> a lighter way to synchronize stuff than going through full grace
> period. ie. You can add a bias value before marking dead so that the
> counter never reaches zero before all percpu counters are collected
> and then unbias it right before putting the base ref, that way the
> only way you can hit zero ref is all refs are actually zero.
Ahh. Bias value sounds... hacky (i.e. harder to convince myself it's
correct) but I see what you're getting at.
Something to consider is wrapping; after we set state to dying but
before we've collected the percpu counters, the atomic counter may be
negative.
But since the atomic counter is 64 bits, we can use 1 << 32 for the bias
value (and just include that when we first initialize it). Which makes
me feel like it's less of a hack too.
I'll have to think about it some more but seems like it ought t owork...
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] generic dynamic per cpu refcounting
2013-01-28 21:36 ` Tejun Heo
@ 2013-01-28 21:48 ` Kent Overstreet
0 siblings, 0 replies; 27+ messages in thread
From: Kent Overstreet @ 2013-01-28 21:48 UTC (permalink / raw)
To: Tejun Heo; +Cc: Oleg Nesterov, srivatsa.bhat, rusty, linux-kernel
On Mon, Jan 28, 2013 at 01:36:22PM -0800, Tejun Heo wrote:
> On Mon, Jan 28, 2013 at 01:28:14PM -0800, Tejun Heo wrote:
> > But at that point, the operation is already global, so there gotta be
> > a lighter way to synchronize stuff than going through full grace
> > period. ie. You can add a bias value before marking dead so that the
> > counter never reaches zero before all percpu counters are collected
> > and then unbias it right before putting the base ref, that way the
> > only way you can hit zero ref is all refs are actually zero.
>
> Note that I'm saying that there's no need to distinguish between dying
> and dead.
Yeah I got that, I _think_ that makes me like the idea.
> The only thing percpu part should care about it whether
> percpu is on or off. We need a full grace period to turn off percpu
> operations of any type but that should be the only case where full
> grace period is necessary. The rest should be synchronizable via the
> usual global synchronization. We probably can just test percpu
> variable against NULL and forget about the encoded state numbers.
Can't quite do that because initially, we use that variable for a
timestamp and when we're shutting down we don't want percpu_ref_get()
allocating percpu counters again.
You'll probably use that as an argument against dynamic allocation :P
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] generic dynamic per cpu refcounting
2013-01-28 21:45 ` Kent Overstreet
@ 2013-01-28 21:50 ` Tejun Heo
2013-01-29 16:39 ` Kent Overstreet
2013-01-29 18:04 ` [PATCH] module: Convert to generic percpu refcounts Kent Overstreet
0 siblings, 2 replies; 27+ messages in thread
From: Tejun Heo @ 2013-01-28 21:50 UTC (permalink / raw)
To: Kent Overstreet; +Cc: Oleg Nesterov, srivatsa.bhat, rusty, linux-kernel
Hello, Kent.
On Mon, Jan 28, 2013 at 01:45:06PM -0800, Kent Overstreet wrote:
> Ahh. Bias value sounds... hacky (i.e. harder to convince myself it's
> correct) but I see what you're getting at.
I don't think it's that hacky. Just push the base point to the
opposite of zero - LLONG_MAX. The global counter can start biased so
that it gets unbiased only once dying is confirmed and everything can
blindly do atomic64_dec_and_test() and trust the result.
> Something to consider is wrapping; after we set state to dying but
> before we've collected the percpu counters, the atomic counter may be
> negative.
There's a reason why we use 64bit vars for this type of global
counting. They virtually don't wrap. Here, you'll need 1<<63 for the
counter to unintentionally reach 0, which we consider practically
impossible.
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] generic dynamic per cpu refcounting
2013-01-28 21:50 ` Tejun Heo
@ 2013-01-29 16:39 ` Kent Overstreet
2013-01-29 19:29 ` Tejun Heo
2013-01-29 18:04 ` [PATCH] module: Convert to generic percpu refcounts Kent Overstreet
1 sibling, 1 reply; 27+ messages in thread
From: Kent Overstreet @ 2013-01-29 16:39 UTC (permalink / raw)
To: Tejun Heo; +Cc: Oleg Nesterov, srivatsa.bhat, rusty, linux-kernel
On Mon, Jan 28, 2013 at 01:50:42PM -0800, Tejun Heo wrote:
> Hello, Kent.
>
> On Mon, Jan 28, 2013 at 01:45:06PM -0800, Kent Overstreet wrote:
> > Ahh. Bias value sounds... hacky (i.e. harder to convince myself it's
> > correct) but I see what you're getting at.
>
> I don't think it's that hacky. Just push the base point to the
> opposite of zero - LLONG_MAX. The global counter can start biased so
> that it gets unbiased only once dying is confirmed and everything can
> blindly do atomic64_dec_and_test() and trust the result.
Didn't mean objectively hacky, just my first impression having a clear
model for how my first version works. I think I've come around - I coded
it up last night and it seems to work.
> > Something to consider is wrapping; after we set state to dying but
> > before we've collected the percpu counters, the atomic counter may be
> > negative.
>
> There's a reason why we use 64bit vars for this type of global
> counting. They virtually don't wrap. Here, you'll need 1<<63 for the
> counter to unintentionally reach 0, which we consider practically
> impossible.
It's not the 64 bit counters themselves that mean you don't have to
consider underflow (which was what I meant, not wrapping), it's the bias
value (though I'm not sure that's the best way to think about it... eh,
quibbling with semantics).
Also the bias value is just stealing one bit from the range of the
counter, and I'm implementing a 32 bit refcount, not 64 (since I'm
stealing some of the high bits of the atomic64_t for the get counter...
though I could still use uint64_t for the percpu counters and implement
a 50 bit refcount).
Oh, if this is going to be widely used I should probably have a
different implementation for archs that don't have atomic64_t in
hardware. Don't suppose you know the CONFIG_ macro to test against? I
couldn't find it when I was looking recently.
Started screwing around with converting the module refcount, too. Looks
like the end result will be cleaner and simpler, though there's some
tricky bits of the conversion due to the weird semantics of module
refcounts.
Anyways, here's the new version:
diff --git a/include/linux/percpu-refcount.h b/include/linux/percpu-refcount.h
new file mode 100644
index 0000000..9e14a09
--- /dev/null
+++ b/include/linux/percpu-refcount.h
@@ -0,0 +1,153 @@
+/*
+ * Dynamic percpu refcounts:
+ * (C) 2012 Google, Inc.
+ * Author: Kent Overstreet <koverstreet@google.com>
+ *
+ * This implements a refcount with similar semantics to atomic_t - atomic_inc(),
+ * atomic_dec_and_test() - but potentially percpu.
+ *
+ * There's one important difference between percpu refs and normal atomic_t
+ * refcounts; you have to keep track of your initial refcount, and then when you
+ * start shutting down you call percpu_ref_kill() _before_ dropping the initial
+ * refcount.
+ *
+ * Before you call percpu_ref_kill(), percpu_ref_put() does not check for the
+ * refcount hitting 0 - it can't, if it was in percpu mode. percpu_ref_kill()
+ * puts the ref back in single atomic_t mode, collecting the per cpu refs and
+ * issuing the appropriate barriers, and then marks the ref as shutting down so
+ * that percpu_ref_put() will check for the ref hitting 0. After it returns,
+ * it's safe to drop the initial ref.
+ *
+ * BACKGROUND:
+ *
+ * Percpu refcounts are quite useful for performance, but if we blindly
+ * converted all refcounts to percpu counters we'd waste quite a bit of memory.
+ *
+ * Think about all the refcounts embedded in kobjects, files, etc. most of which
+ * aren't used much. These start out as simple atomic counters - a little bigger
+ * than a bare atomic_t, 16 bytes instead of 4 - but if we exceed some arbitrary
+ * number of gets in one second, we then switch to percpu counters.
+ *
+ * This heuristic isn't perfect because it'll fire if the refcount was only
+ * being used on one cpu; ideally we'd be able to count the number of cache
+ * misses on percpu_ref_get() or something similar, but that'd make the non
+ * percpu path significantly heavier/more complex. We can count the number of
+ * gets() without any extra atomic instructions on arches that support
+ * atomic64_t - simply by changing the atomic_inc() to atomic_add_return().
+ *
+ * USAGE:
+ *
+ * See fs/aio.c for some example usage; it's used there for struct kioctx, which
+ * is created when userspaces calls io_setup(), and destroyed when userspace
+ * calls io_destroy() or the process exits.
+ *
+ * In the aio code, kill_ioctx() is called when we wish to destroy a kioctx; it
+ * calls percpu_ref_kill(), then hlist_del_rcu() and sychronize_rcu() to remove
+ * the kioctx from the proccess's list of kioctxs - after that, there can't be
+ * any new users of the kioctx (from lookup_ioctx()) and it's then safe to drop
+ * the initial ref with percpu_ref_put().
+ *
+ * Code that does a two stage shutdown like this often needs some kind of
+ * explicit synchronization to ensure the initial refcount can only be dropped
+ * once - percpu_ref_kill() does this for you, it returns true once and false if
+ * someone else already called it. The aio code uses it this way, but it's not
+ * necessary if the code has some other mechanism to synchronize teardown.
+ *
+ * As mentioned previously, we decide when to convert a ref to percpu counters
+ * in percpu_ref_get(). However, since percpu_ref_get() will often be called
+ * with rcu_read_lock() held, it's not done there - percpu_ref_get() returns
+ * true if the ref should be converted to percpu counters.
+ *
+ * The caller should then call percpu_ref_alloc() after dropping
+ * rcu_read_lock(); if there is an uncommonly used codepath where it's
+ * inconvenient to call percpu_ref_alloc() after get(), it may be safely skipped
+ * and percpu_ref_get() will return true again the next time the counter wraps
+ * around.
+ */
+
+#ifndef _LINUX_PERCPU_REFCOUNT_H
+#define _LINUX_PERCPU_REFCOUNT_H
+
+#include <linux/atomic.h>
+#include <linux/kernel.h>
+#include <linux/percpu.h>
+#include <linux/rcupdate.h>
+
+struct percpu_ref {
+ atomic64_t count;
+ unsigned long pcpu_count;
+};
+
+void percpu_ref_init(struct percpu_ref *ref);
+void __percpu_ref_get(struct percpu_ref *ref, unsigned long pcpu_count);
+int __percpu_ref_put(struct percpu_ref *ref, unsigned long pcpu_count);
+int percpu_ref_tryget(struct percpu_ref *ref);
+int percpu_ref_put_initial_ref(struct percpu_ref *ref);
+
+#define PCPU_STATUS_BITS 2
+#define PCPU_STATUS_MASK ((1 << PCPU_STATUS_BITS) - 1)
+
+#define PCPU_REF_PTR 0
+#define PCPU_REF_NONE 1
+#define PCPU_REF_DEAD 2
+
+#define REF_STATUS(count) (count & PCPU_STATUS_MASK)
+
+/**
+ * percpu_ref_get - increment a dynamic percpu refcount
+ *
+ * Analagous to atomic_inc().
+ */
+static inline void percpu_ref_get(struct percpu_ref *ref)
+{
+ unsigned long pcpu_count;
+
+ preempt_disable();
+
+ pcpu_count = ACCESS_ONCE(ref->pcpu_count);
+
+ if (REF_STATUS(pcpu_count) == PCPU_REF_PTR) {
+ /* for rcu - we're not using rcu_dereference() */
+ smp_read_barrier_depends();
+ __this_cpu_inc(*((unsigned __percpu *) pcpu_count));
+ } else {
+ __percpu_ref_get(ref, pcpu_count);
+ }
+
+ preempt_enable();
+}
+
+/**
+ * percpu_ref_put - decrement a dynamic percpu refcount
+ *
+ * Returns true if the result is 0, otherwise false; only checks for the ref
+ * hitting 0 after percpu_ref_kill() has been called. Analagous to
+ * atomic_dec_and_test().
+ */
+static inline int percpu_ref_put(struct percpu_ref *ref)
+{
+ unsigned long pcpu_count;
+ int ret = 0;
+
+ preempt_disable();
+
+ pcpu_count = ACCESS_ONCE(ref->pcpu_count);
+
+ if (REF_STATUS(pcpu_count) == PCPU_REF_PTR) {
+ /* for rcu - we're not using rcu_dereference() */
+ smp_read_barrier_depends();
+ __this_cpu_dec(*((unsigned __percpu *) pcpu_count));
+ } else {
+ ret = __percpu_ref_put(ref, pcpu_count);
+ }
+
+ preempt_enable();
+
+ return ret;
+}
+
+unsigned percpu_ref_count(struct percpu_ref *ref);
+int percpu_ref_kill(struct percpu_ref *ref);
+int percpu_ref_dead(struct percpu_ref *ref);
+
+#endif
diff --git a/lib/percpu-refcount.c b/lib/percpu-refcount.c
new file mode 100644
index 0000000..bf71d54
--- /dev/null
+++ b/lib/percpu-refcount.c
@@ -0,0 +1,280 @@
+#define pr_fmt(fmt) "%s: " fmt "\n", __func__
+
+#include <linux/jiffies.h>
+#include <linux/percpu-refcount.h>
+
+/*
+ * A percpu refcount can be in 4 different modes. The state is tracked in the
+ * low two bits of percpu_ref->pcpu_count:
+ *
+ * PCPU_REF_NONE - the initial state, no percpu counters allocated.
+ *
+ * PCPU_REF_PTR - using percpu counters for the refcount.
+ *
+ * PCPU_REF_DYING - we're shutting down so get()/put() should use the embedded
+ * atomic counter, but we're not finished updating the atomic counter from the
+ * percpu counters - this means that percpu_ref_put() can't check for the ref
+ * hitting 0 yet.
+ *
+ * PCPU_REF_DEAD - we've finished the teardown sequence, percpu_ref_put() should
+ * now check for the ref hitting 0.
+ *
+ * In PCPU_REF_NONE mode, we need to count the number of times percpu_ref_get()
+ * is called; this is done with the high bits of the raw atomic counter. We also
+ * track the time, in jiffies, when the get count last wrapped - this is done
+ * with the remaining bits of percpu_ref->percpu_count.
+ *
+ * So, when percpu_ref_get() is called it increments the get count and checks if
+ * it wrapped; if it did, it checks if the last time it wrapped was less than
+ * one second ago; if so, we want to allocate percpu counters.
+ *
+ * PCPU_COUNT_BITS determines the threshold where we convert to percpu: of the
+ * raw 64 bit counter, we use PCPU_COUNT_BITS for the refcount, and the
+ * remaining (high) bits to count the number of times percpu_ref_get() has been
+ * called. It's currently (completely arbitrarily) 16384 times in one second.
+ *
+ * Percpu mode (PCPU_REF_PTR):
+ *
+ * In percpu mode all we do on get and put is increment or decrement the cpu
+ * local counter, which is a 32 bit unsigned int.
+ *
+ * Note that all the gets() could be happening on one cpu, and all the puts() on
+ * another - the individual cpu counters can wrap (potentially many times).
+ *
+ * But this is fine because we don't need to check for the ref hitting 0 in
+ * percpu mode; before we set the state to PCPU_REF_DEAD we simply sum up all
+ * the percpu counters and add them to the atomic counter. Since addition and
+ * subtraction in modular arithmatic is still associative, the result will be
+ * correct.
+ */
+
+#define PCPU_COUNT_BITS 50
+#define PCPU_COUNT_MASK ((1LL << PCPU_COUNT_BITS) - 1)
+
+#define PCPU_COUNT_BIAS (1ULL << 32)
+
+/*
+ * So that we don't have to call alloc_percu() from percpu_ref_get(), we use a
+ * small pool and refill from a workqueue.
+ */
+
+#define PCPU_REF_ALLOC_NR 8
+
+static unsigned __percpu *percpu_ref_pool[PCPU_REF_ALLOC_NR];
+static unsigned percpu_ref_alloc_nr;
+static DEFINE_SPINLOCK(percpu_ref_alloc_lock);
+
+static void percpu_ref_alloc_refill(struct work_struct *work)
+{
+ spin_lock_irq(&percpu_ref_alloc_lock);
+
+ while (percpu_ref_alloc_nr < PCPU_REF_ALLOC_NR) {
+ unsigned __percpu *ref;
+
+ ref = alloc_percpu(unsigned);
+ if (!ref)
+ break;
+
+ percpu_ref_pool[percpu_ref_alloc_nr++] = ref;
+ }
+
+ spin_unlock_irq(&percpu_ref_alloc_lock);
+}
+
+static DECLARE_WORK(percpu_ref_alloc_work, percpu_ref_alloc_refill);
+
+static void percpu_ref_alloc(struct percpu_ref *ref, unsigned long pcpu_count)
+{
+ unsigned long flags, now = jiffies;
+ static unsigned __percpu *new = NULL;
+
+ now <<= PCPU_STATUS_BITS;
+ now |= PCPU_REF_NONE;
+
+ if (now - pcpu_count <= HZ << PCPU_STATUS_BITS) {
+ spin_lock_irqsave(&percpu_ref_alloc_lock, flags);
+
+ if (percpu_ref_alloc_nr)
+ new = percpu_ref_pool[--percpu_ref_alloc_nr];
+
+ if (percpu_ref_alloc_nr < PCPU_REF_ALLOC_NR / 2)
+ schedule_work(&percpu_ref_alloc_work);
+
+ spin_unlock_irqrestore(&percpu_ref_alloc_lock, flags);
+
+ if (!new)
+ goto update_time;
+
+ BUG_ON((unsigned long) new & PCPU_STATUS_MASK);
+
+ if (cmpxchg(&ref->pcpu_count, pcpu_count,
+ (unsigned long) new) != pcpu_count)
+ free_percpu(new);
+ } else {
+update_time:
+ cmpxchg(&ref->pcpu_count, pcpu_count, now);
+ }
+}
+
+/* Slowpath, i.e. non percpu */
+void __percpu_ref_get(struct percpu_ref *ref, unsigned long pcpu_count)
+{
+ uint64_t v;
+
+ v = atomic64_add_return(1 + (1ULL << PCPU_COUNT_BITS),
+ &ref->count);
+
+ if (unlikely(!(v >> PCPU_COUNT_BITS) &&
+ REF_STATUS(pcpu_count) == PCPU_REF_NONE)) {
+ percpu_ref_alloc(ref, pcpu_count);
+ }
+}
+
+int __percpu_ref_put(struct percpu_ref *ref, unsigned long pcpu_count)
+{
+ uint64_t v;
+
+ v = atomic64_dec_return(&ref->count);
+ v &= PCPU_COUNT_MASK;
+
+ return v == 0;
+}
+
+int percpu_ref_tryget(struct percpu_ref *ref)
+{
+ int ret = 1;
+
+ preempt_disable();
+
+ if (!percpu_ref_dead(ref))
+ percpu_ref_get(ref);
+ else
+ ret = 0;
+
+ preempt_enable();
+
+ return ret;
+}
+
+unsigned percpu_ref_count(struct percpu_ref *ref)
+{
+ unsigned long pcpu_count;
+ unsigned count = 0;
+ int cpu;
+
+ preempt_disable();
+
+ count = atomic64_read(&ref->count) & PCPU_COUNT_MASK;
+
+ pcpu_count = ACCESS_ONCE(ref->pcpu_count);
+
+ if (REF_STATUS(pcpu_count) == PCPU_REF_PTR) {
+ /* for rcu - we're not using rcu_dereference() */
+ smp_read_barrier_depends();
+
+ for_each_possible_cpu(cpu)
+ count += *per_cpu_ptr((unsigned __percpu *) pcpu_count, cpu);
+ }
+
+ preempt_enable();
+
+ return count;
+}
+
+/**
+ * percpu_ref_init - initialize a dynamic percpu refcount
+ *
+ * Initializes the refcount in single atomic counter mode with a refcount of 1;
+ * analagous to atomic_set(ref, 1).
+ */
+void percpu_ref_init(struct percpu_ref *ref)
+{
+ unsigned long now = jiffies;
+
+ atomic64_set(&ref->count, 1 + PCPU_COUNT_BIAS);
+
+ now <<= PCPU_STATUS_BITS;
+ now |= PCPU_REF_NONE;
+
+ ref->pcpu_count = now;
+}
+
+/**
+ * percpu_ref_kill - prepare a dynamic percpu refcount for teardown
+ *
+ * Must be called before dropping the initial ref, so that percpu_ref_put()
+ * knows to check for the refcount hitting 0. If the refcount was in percpu
+ * mode, converts it back to single atomic counter mode.
+ *
+ * The caller must issue a synchronize_rcu()/call_rcu() before calling
+ * percpu_ref_put() to drop the initial ref.
+ *
+ * Returns true the first time called on @ref and false if @ref is already
+ * shutting down, so it may be used by the caller for synchronizing other parts
+ * of a two stage shutdown.
+ */
+int percpu_ref_kill(struct percpu_ref *ref)
+{
+ unsigned long old, status, pcpu_count;
+
+ pcpu_count = ACCESS_ONCE(ref->pcpu_count);
+
+ do {
+ status = REF_STATUS(pcpu_count);
+
+ if (status == PCPU_REF_DEAD)
+ return 0;
+
+ old = pcpu_count;
+ pcpu_count = cmpxchg(&ref->pcpu_count, old, PCPU_REF_DEAD);
+ } while (pcpu_count != old);
+
+ if (status == PCPU_REF_PTR) {
+ int cpu;
+ unsigned count = 0;
+
+ synchronize_sched();
+
+ for_each_possible_cpu(cpu)
+ count += *per_cpu_ptr((unsigned __percpu *) pcpu_count, cpu);
+
+ free_percpu((unsigned __percpu *) pcpu_count);
+
+ pr_debug("global %lli pcpu %i",
+ atomic64_read(&ref->count) & PCPU_COUNT_MASK,
+ (int) count);
+
+ atomic64_add((int) count - PCPU_COUNT_BIAS, &ref->count);
+ }
+
+ return 1;
+}
+
+/**
+ * percpu_ref_put_initial_ref - safely drop the initial ref
+ *
+ * A percpu refcount needs a shutdown sequence before dropping the initial ref,
+ * to put it back into single atomic_t mode with the appropriate barriers so
+ * that percpu_ref_put() can safely check for it hitting 0 - this does so.
+ *
+ * Returns true if @ref hit 0.
+ */
+int percpu_ref_put_initial_ref(struct percpu_ref *ref)
+{
+ if (percpu_ref_kill(ref)) {
+ return percpu_ref_put(ref);
+ } else {
+ __WARN();
+ return 0;
+ }
+}
+
+/**
+ * percpu_ref_dead - check if a dynamic percpu refcount is shutting down
+ *
+ * Returns true if percpu_ref_kill() has been called on @ref, false otherwise.
+ */
+int percpu_ref_dead(struct percpu_ref *ref)
+{
+ return REF_STATUS(ref->pcpu_count) == PCPU_REF_DEAD;
+}
^ permalink raw reply related [flat|nested] 27+ messages in thread
* [PATCH] module: Convert to generic percpu refcounts
2013-01-28 21:50 ` Tejun Heo
2013-01-29 16:39 ` Kent Overstreet
@ 2013-01-29 18:04 ` Kent Overstreet
1 sibling, 0 replies; 27+ messages in thread
From: Kent Overstreet @ 2013-01-29 18:04 UTC (permalink / raw)
To: Tejun Heo; +Cc: Oleg Nesterov, srivatsa.bhat, rusty, linux-kernel
I started screwing aronud just to see how hard a conversion would be and
what it'd look like. I _think_ this is complete, but there's enough
going on I undoubtedly missed something.
Completely untested - builds and that's it. I'm sure it's broken.
Deletes almost 100 lines of code though. I like how it looks for the
most part... MODULE_STATE_GOING is messy since the percpu ref has to
track that, and I didn't want to track that in two places - but I
couldn't just delete the enum since module notifiers use it. It's
currently vestigal and only used for the notifiers.
Similarly with the other places module->state and percpu_ref_dead() (via
module_is_live()) are checked... bit ugly but ah well.
---
include/linux/module.h | 28 +++----
include/trace/events/module.h | 2 +-
kernel/debug/kdb/kdb_main.c | 8 +-
kernel/module.c | 165 +++++++++++++-----------------------------
4 files changed, 60 insertions(+), 143 deletions(-)
diff --git a/include/linux/module.h b/include/linux/module.h
index ead1b57..88f78f6 100644
--- a/include/linux/module.h
+++ b/include/linux/module.h
@@ -17,6 +17,7 @@
#include <linux/moduleparam.h>
#include <linux/tracepoint.h>
#include <linux/export.h>
+#include <linux/percpu-refcount.h>
#include <linux/percpu.h>
#include <asm/module.h>
@@ -206,19 +207,7 @@ enum module_state {
MODULE_STATE_UNFORMED, /* Still setting it up. */
};
-/**
- * struct module_ref - per cpu module reference counts
- * @incs: number of module get on this cpu
- * @decs: number of module put on this cpu
- *
- * We force an alignment on 8 or 16 bytes, so that alloc_percpu()
- * put @incs/@decs in same cache line, with no extra memory cost,
- * since alloc_percpu() is fine grained.
- */
-struct module_ref {
- unsigned long incs;
- unsigned long decs;
-} __attribute((aligned(2 * sizeof(unsigned long))));
+const char *module_state(struct module *mod);
struct module
{
@@ -361,13 +350,10 @@ struct module
/* What modules do I depend on? */
struct list_head target_list;
- /* Who is waiting for us to be unloaded */
- struct task_struct *waiter;
-
/* Destruction function. */
void (*exit)(void);
- struct module_ref __percpu *refptr;
+ struct percpu_ref ref;
#endif
#ifdef CONFIG_CONSTRUCTORS
@@ -387,7 +373,7 @@ extern struct mutex module_mutex;
(IDE & SCSI) require entry into the module during init.*/
static inline int module_is_live(struct module *mod)
{
- return mod->state != MODULE_STATE_GOING;
+ return !percpu_ref_dead(&mod->ref);
}
struct module *__module_text_address(unsigned long addr);
@@ -451,7 +437,11 @@ extern void __module_put_and_exit(struct module *mod, long code)
#define module_put_and_exit(code) __module_put_and_exit(THIS_MODULE, code);
#ifdef CONFIG_MODULE_UNLOAD
-unsigned long module_refcount(struct module *mod);
+static inline unsigned long module_refcount(struct module *mod)
+{
+ return percpu_ref_count(&mod->ref) - 1;
+}
+
void __symbol_put(const char *symbol);
#define symbol_put(x) __symbol_put(MODULE_SYMBOL_PREFIX #x)
void symbol_put_addr(void *addr);
diff --git a/include/trace/events/module.h b/include/trace/events/module.h
index 1619327..3de2241 100644
--- a/include/trace/events/module.h
+++ b/include/trace/events/module.h
@@ -78,7 +78,7 @@ DECLARE_EVENT_CLASS(module_refcnt,
TP_fast_assign(
__entry->ip = ip;
- __entry->refcnt = __this_cpu_read(mod->refptr->incs) + __this_cpu_read(mod->refptr->decs);
+ __entry->refcnt = module_refcount(mod);
__assign_str(name, mod->name);
),
diff --git a/kernel/debug/kdb/kdb_main.c b/kernel/debug/kdb/kdb_main.c
index 8875254..813c0ed 100644
--- a/kernel/debug/kdb/kdb_main.c
+++ b/kernel/debug/kdb/kdb_main.c
@@ -1978,13 +1978,7 @@ static int kdb_lsmod(int argc, const char **argv)
#ifdef CONFIG_MODULE_UNLOAD
kdb_printf("%4ld ", module_refcount(mod));
#endif
- if (mod->state == MODULE_STATE_GOING)
- kdb_printf(" (Unloading)");
- else if (mod->state == MODULE_STATE_COMING)
- kdb_printf(" (Loading)");
- else
- kdb_printf(" (Live)");
- kdb_printf(" 0x%p", mod->module_core);
+ kdb_printf(" (%s) 0x%p", module_state(mod), mod->module_core);
#ifdef CONFIG_MODULE_UNLOAD
{
diff --git a/kernel/module.c b/kernel/module.c
index 921bed4..08aa83a 100644
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -625,21 +625,15 @@ static char last_unloaded_module[MODULE_NAME_LEN+1];
EXPORT_TRACEPOINT_SYMBOL(module_get);
/* Init the unload section of the module. */
-static int module_unload_init(struct module *mod)
+static void module_unload_init(struct module *mod)
{
- mod->refptr = alloc_percpu(struct module_ref);
- if (!mod->refptr)
- return -ENOMEM;
+ percpu_ref_init(&mod->ref);
INIT_LIST_HEAD(&mod->source_list);
INIT_LIST_HEAD(&mod->target_list);
/* Hold reference count during initialization. */
- __this_cpu_write(mod->refptr->incs, 1);
- /* Backwards compatibility macros put refcount during init. */
- mod->waiter = current;
-
- return 0;
+ __module_get(mod);
}
/* Does a already use b? */
@@ -719,8 +713,6 @@ static void module_unload_free(struct module *mod)
kfree(use);
}
mutex_unlock(&module_mutex);
-
- free_percpu(mod->refptr);
}
#ifdef CONFIG_MODULE_FORCE_UNLOAD
@@ -745,6 +737,15 @@ struct stopref
int *forced;
};
+static void stop_module(struct module *mod)
+{
+ /* Mark it as dying, drop base ref */
+ if (percpu_ref_kill(&mod->ref))
+ module_put(mod);
+ else
+ WARN(1, "initial module ref already dropped");
+}
+
/* Whole machine is stopped with interrupts off when this runs. */
static int __try_stop_module(void *_sref)
{
@@ -756,8 +757,7 @@ static int __try_stop_module(void *_sref)
return -EWOULDBLOCK;
}
- /* Mark it as dying. */
- sref->mod->state = MODULE_STATE_GOING;
+ stop_module(sref->mod);
return 0;
}
@@ -769,57 +769,14 @@ static int try_stop_module(struct module *mod, int flags, int *forced)
return stop_machine(__try_stop_module, &sref, NULL);
} else {
/* We don't need to stop the machine for this. */
- mod->state = MODULE_STATE_GOING;
- synchronize_sched();
+ stop_module(mod);
return 0;
}
}
-unsigned long module_refcount(struct module *mod)
-{
- unsigned long incs = 0, decs = 0;
- int cpu;
-
- for_each_possible_cpu(cpu)
- decs += per_cpu_ptr(mod->refptr, cpu)->decs;
- /*
- * ensure the incs are added up after the decs.
- * module_put ensures incs are visible before decs with smp_wmb.
- *
- * This 2-count scheme avoids the situation where the refcount
- * for CPU0 is read, then CPU0 increments the module refcount,
- * then CPU1 drops that refcount, then the refcount for CPU1 is
- * read. We would record a decrement but not its corresponding
- * increment so we would see a low count (disaster).
- *
- * Rare situation? But module_refcount can be preempted, and we
- * might be tallying up 4096+ CPUs. So it is not impossible.
- */
- smp_rmb();
- for_each_possible_cpu(cpu)
- incs += per_cpu_ptr(mod->refptr, cpu)->incs;
- return incs - decs;
-}
-EXPORT_SYMBOL(module_refcount);
-
/* This exists whether we can unload or not */
static void free_module(struct module *mod);
-static void wait_for_zero_refcount(struct module *mod)
-{
- /* Since we might sleep for some time, release the mutex first */
- mutex_unlock(&module_mutex);
- for (;;) {
- pr_debug("Looking at refcount...\n");
- set_current_state(TASK_UNINTERRUPTIBLE);
- if (module_refcount(mod) == 0)
- break;
- schedule();
- }
- current->state = TASK_RUNNING;
- mutex_lock(&module_mutex);
-}
-
SYSCALL_DEFINE2(delete_module, const char __user *, name_user,
unsigned int, flags)
{
@@ -850,7 +807,7 @@ SYSCALL_DEFINE2(delete_module, const char __user *, name_user,
}
/* Doing init or already dying? */
- if (mod->state != MODULE_STATE_LIVE) {
+ if (mod->state != MODULE_STATE_LIVE || !module_is_live(mod)) {
/* FIXME: if (force), slam module count and wake up
waiter --RR */
pr_debug("%s already dying\n", mod->name);
@@ -868,19 +825,17 @@ SYSCALL_DEFINE2(delete_module, const char __user *, name_user,
}
}
- /* Set this up before setting mod->state */
- mod->waiter = current;
-
/* Stop the machine so refcounts can't move and disable module. */
ret = try_stop_module(mod, flags, &forced);
if (ret != 0)
goto out;
+ mutex_unlock(&module_mutex);
+
/* Never wait if forced. */
- if (!forced && module_refcount(mod) != 0)
- wait_for_zero_refcount(mod);
+ if (!forced)
+ wait_event(module_wq, !percpu_ref_count(&mod->ref));
- mutex_unlock(&module_mutex);
/* Final destruction now no one is using it. */
if (mod->exit != NULL)
mod->exit();
@@ -962,45 +917,29 @@ static struct module_attribute modinfo_refcnt =
void __module_get(struct module *module)
{
if (module) {
- preempt_disable();
- __this_cpu_inc(module->refptr->incs);
trace_module_get(module, _RET_IP_);
- preempt_enable();
+ percpu_ref_get(&module->ref);
}
}
EXPORT_SYMBOL(__module_get);
bool try_module_get(struct module *module)
{
- bool ret = true;
-
if (module) {
- preempt_disable();
-
- if (likely(module_is_live(module))) {
- __this_cpu_inc(module->refptr->incs);
- trace_module_get(module, _RET_IP_);
- } else
- ret = false;
-
- preempt_enable();
- }
- return ret;
+ trace_module_get(module, _RET_IP_);
+ return percpu_ref_tryget(&module->ref);
+ } else
+ return true;
}
EXPORT_SYMBOL(try_module_get);
void module_put(struct module *module)
{
if (module) {
- preempt_disable();
- smp_wmb(); /* see comment in module_refcount */
- __this_cpu_inc(module->refptr->decs);
-
trace_module_put(module, _RET_IP_);
- /* Maybe they're waiting for us to drop reference? */
- if (unlikely(!module_is_live(module)))
- wake_up_process(module->waiter);
- preempt_enable();
+
+ if (percpu_ref_put(&module->ref))
+ wake_up_all(&module_wq);
}
}
EXPORT_SYMBOL(module_put);
@@ -1048,25 +987,27 @@ static size_t module_flags_taint(struct module *mod, char *buf)
return l;
}
-static ssize_t show_initstate(struct module_attribute *mattr,
- struct module_kobject *mk, char *buffer)
+const char *module_state(struct module *mod)
{
- const char *state = "unknown";
-
- switch (mk->mod->state) {
+ switch (mod->state) {
case MODULE_STATE_LIVE:
- state = "live";
- break;
+ if (module_is_live(mod))
+ return "live";
+ else
+ return "going";
case MODULE_STATE_COMING:
- state = "coming";
- break;
- case MODULE_STATE_GOING:
- state = "going";
- break;
+ return "coming";
+ case MODULE_STATE_UNFORMED:
+ return "unformed";
default:
- BUG();
+ return "unknown";
}
- return sprintf(buffer, "%s\n", state);
+}
+
+static ssize_t show_initstate(struct module_attribute *mattr,
+ struct module_kobject *mk, char *buffer)
+{
+ return sprintf(buffer, "%s\n", module_state(mk->mod));
}
static struct module_attribute modinfo_initstate =
@@ -3022,8 +2963,7 @@ static bool finished_loading(const char *name)
mutex_lock(&module_mutex);
mod = find_module_all(name, true);
- ret = !mod || mod->state == MODULE_STATE_LIVE
- || mod->state == MODULE_STATE_GOING;
+ ret = !mod || mod->state == MODULE_STATE_LIVE;
mutex_unlock(&module_mutex);
return ret;
@@ -3073,8 +3013,7 @@ static int do_init_module(struct module *mod)
if (ret < 0) {
/* Init routine failed: abort. Try to protect us from
buggy refcounters. */
- mod->state = MODULE_STATE_GOING;
- synchronize_sched();
+ stop_module(mod);
module_put(mod);
blocking_notifier_call_chain(&module_notify_list,
MODULE_STATE_GOING, mod);
@@ -3245,9 +3184,7 @@ static int load_module(struct load_info *info, const char __user *uargs,
#endif
/* Now module is in final location, initialize linked lists, etc. */
- err = module_unload_init(mod);
- if (err)
- goto unlink_mod;
+ module_unload_init(mod);
/* Now we've got everything in the final locations, we can
* find optional sections. */
@@ -3323,7 +3260,6 @@ static int load_module(struct load_info *info, const char __user *uargs,
free_modinfo(mod);
free_unload:
module_unload_free(mod);
- unlink_mod:
mutex_lock(&module_mutex);
/* Unlink carefully: kallsyms could be walking list. */
list_del_rcu(&mod->list);
@@ -3614,12 +3550,12 @@ static char *module_flags(struct module *mod, char *buf)
BUG_ON(mod->state == MODULE_STATE_UNFORMED);
if (mod->taints ||
- mod->state == MODULE_STATE_GOING ||
+ !module_is_live(mod) ||
mod->state == MODULE_STATE_COMING) {
buf[bx++] = '(';
bx += module_flags_taint(mod, buf + bx);
/* Show a - for module-is-being-unloaded */
- if (mod->state == MODULE_STATE_GOING)
+ if (!module_is_live(mod))
buf[bx++] = '-';
/* Show a + for module-is-being-loaded */
if (mod->state == MODULE_STATE_COMING)
@@ -3663,10 +3599,7 @@ static int m_show(struct seq_file *m, void *p)
print_unload_info(m, mod);
/* Informative for users. */
- seq_printf(m, " %s",
- mod->state == MODULE_STATE_GOING ? "Unloading":
- mod->state == MODULE_STATE_COMING ? "Loading":
- "Live");
+ seq_printf(m, " %s", module_state(mod));
/* Used by oprofile and other similar tools. */
seq_printf(m, " 0x%pK", mod->module_core);
--
1.7.12
^ permalink raw reply related [flat|nested] 27+ messages in thread
* Re: [PATCH] generic dynamic per cpu refcounting
2013-01-29 16:39 ` Kent Overstreet
@ 2013-01-29 19:29 ` Tejun Heo
2013-01-29 19:51 ` Kent Overstreet
0 siblings, 1 reply; 27+ messages in thread
From: Tejun Heo @ 2013-01-29 19:29 UTC (permalink / raw)
To: Kent Overstreet; +Cc: Oleg Nesterov, srivatsa.bhat, rusty, linux-kernel
Hey, Kent.
On Tue, Jan 29, 2013 at 08:39:42AM -0800, Kent Overstreet wrote:
> Oh, if this is going to be widely used I should probably have a
> different implementation for archs that don't have atomic64_t in
> hardware. Don't suppose you know the CONFIG_ macro to test against? I
> couldn't find it when I was looking recently.
!CONFIG_GENERIC_ATOMIC64, I think. I don't think it's worthwhile to
worry about tho. Reverting to simple atomic_t ops on !CONFIG_SMP
should cover most relevant cases.
Very tentative review follows.
> +#define PCPU_REF_PTR 0
> +#define PCPU_REF_NONE 1
> +#define PCPU_REF_DEAD 2
Do we still need to distinguish between NONE and DEAD? It would be
nice if we can just test against NULL.
> +/**
> + * percpu_ref_get - increment a dynamic percpu refcount
> + *
> + * Analagous to atomic_inc().
> + */
> +static inline void percpu_ref_get(struct percpu_ref *ref)
> +{
> + unsigned long pcpu_count;
> +
> + preempt_disable();
> +
> + pcpu_count = ACCESS_ONCE(ref->pcpu_count);
> +
> + if (REF_STATUS(pcpu_count) == PCPU_REF_PTR) {
> + /* for rcu - we're not using rcu_dereference() */
> + smp_read_barrier_depends();
Let's explain more. :)
> + __this_cpu_inc(*((unsigned __percpu *) pcpu_count));
What about overflow? Note that we can have systemetic cases where ref
is gotten on one cpu and put on another transferring counts in a
specific direction.
> +#define PCPU_COUNT_BITS 50
> +#define PCPU_COUNT_MASK ((1LL << PCPU_COUNT_BITS) - 1)
> +
> +#define PCPU_COUNT_BIAS (1ULL << 32)
I really don't get this. Why use 1<<32 for bias which isn't too
difficult to overflow especially if you have many cpu threads and some
systemetic drift in percpu refs? What's the point of not using the
higher bits? Is it to encode count usage statistics into the same
counter? If so, just add another field. 24bytes is fine.
I'd really like to see just basic percpu refcnt with async and sync
interfaces w/o dynamic allocation. At this point, we aren't really
sure if there are gonna be enough benefits or use cases from dynamic
allocation, so I don't really see the point in the added complexity.
Let's start with basic stuff and add on dynamic alloc if it actually
is necessary.
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] generic dynamic per cpu refcounting
2013-01-29 19:29 ` Tejun Heo
@ 2013-01-29 19:51 ` Kent Overstreet
2013-01-29 20:02 ` Tejun Heo
0 siblings, 1 reply; 27+ messages in thread
From: Kent Overstreet @ 2013-01-29 19:51 UTC (permalink / raw)
To: Tejun Heo; +Cc: Oleg Nesterov, srivatsa.bhat, rusty, linux-kernel
On Tue, Jan 29, 2013 at 11:29:04AM -0800, Tejun Heo wrote:
> Hey, Kent.
>
> On Tue, Jan 29, 2013 at 08:39:42AM -0800, Kent Overstreet wrote:
> > Oh, if this is going to be widely used I should probably have a
> > different implementation for archs that don't have atomic64_t in
> > hardware. Don't suppose you know the CONFIG_ macro to test against? I
> > couldn't find it when I was looking recently.
>
> !CONFIG_GENERIC_ATOMIC64, I think. I don't think it's worthwhile to
> worry about tho. Reverting to simple atomic_t ops on !CONFIG_SMP
> should cover most relevant cases.
>
> Very tentative review follows.
>
> > +#define PCPU_REF_PTR 0
> > +#define PCPU_REF_NONE 1
> > +#define PCPU_REF_DEAD 2
>
> Do we still need to distinguish between NONE and DEAD? It would be
> nice if we can just test against NULL.
With dynamic allocation, yes - we can't alloc percpu counters if it's
dead.
> > +/**
> > + * percpu_ref_get - increment a dynamic percpu refcount
> > + *
> > + * Analagous to atomic_inc().
> > + */
> > +static inline void percpu_ref_get(struct percpu_ref *ref)
> > +{
> > + unsigned long pcpu_count;
> > +
> > + preempt_disable();
> > +
> > + pcpu_count = ACCESS_ONCE(ref->pcpu_count);
> > +
> > + if (REF_STATUS(pcpu_count) == PCPU_REF_PTR) {
> > + /* for rcu - we're not using rcu_dereference() */
> > + smp_read_barrier_depends();
>
> Let's explain more. :)
I'd like to come up with a way of using rcu_dereference() without sparse
choking on __percpu combined with __rcu, but ok.
> > + __this_cpu_inc(*((unsigned __percpu *) pcpu_count));
>
> What about overflow? Note that we can have systemetic cases where ref
> is gotten on one cpu and put on another transferring counts in a
> specific direction.
Heh, this keeps coming up.
It works because modular arithmatic is still associative for addition
and subtraction. It doesn't matter which cpu the gets and puts happen
on, the counters will still sum to the same value - even if individual
counters overflow/underflow.
(It's the same as doing a bunch of increments and decrements to a single
integer, but in random order. 0 - 1 - 1 + 1 + 1 + 1 will always equal 1,
no matter where you stick parentheses).
That's also why I use an unsigned integer for the per cpu counters...
since they're going to wrap and signed integer overflow is undefined.
Not that I really expect gcc to figure out a way to screw me for that,
but I tend to be anal about such things.
> > +#define PCPU_COUNT_BITS 50
> > +#define PCPU_COUNT_MASK ((1LL << PCPU_COUNT_BITS) - 1)
> > +
> > +#define PCPU_COUNT_BIAS (1ULL << 32)
>
> I really don't get this. Why use 1<<32 for bias which isn't too
> difficult to overflow especially if you have many cpu threads and some
> systemetic drift in percpu refs? What's the point of not using the
> higher bits? Is it to encode count usage statistics into the same
> counter? If so, just add another field. 24bytes is fine.
The systematic drift doesn't affect overflow here - the bias is only
applied to the atomic counter, which can only be off by at most whatever
the per cpu counters happen to sum to (and remember it's modular
arithmatic, so that's INT_MIN/INT_MAX).
Since I'm using 32 bit ints for the per cpu counters, 1 << 32 works just
fine. If we want to expand the per cpu counters to longs or 64 bit ints,
then yeah I'd use a larger bias.
It's nitpicky but using a larger bias feels like overly defensive
programming to me (if you know how big the bias has to be, great, use
that - if you aren't sure, wtf are you doing :P)
> I'd really like to see just basic percpu refcnt with async and sync
> interfaces w/o dynamic allocation. At this point, we aren't really
> sure if there are gonna be enough benefits or use cases from dynamic
> allocation, so I don't really see the point in the added complexity.
> Let's start with basic stuff and add on dynamic alloc if it actually
> is necessary.
Well, with the alloc rework the dynamic allocation is completely hidden
from the users - it's just get and put. Killing dynamic allocation would
just mean that percpu_ref_init() could fail, as far as the api is
concerned.
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] generic dynamic per cpu refcounting
2013-01-29 19:51 ` Kent Overstreet
@ 2013-01-29 20:02 ` Tejun Heo
2013-01-29 21:45 ` Kent Overstreet
0 siblings, 1 reply; 27+ messages in thread
From: Tejun Heo @ 2013-01-29 20:02 UTC (permalink / raw)
To: Kent Overstreet; +Cc: Oleg Nesterov, srivatsa.bhat, rusty, linux-kernel
Hello, Kent.
On Tue, Jan 29, 2013 at 11:51:41AM -0800, Kent Overstreet wrote:
> > What about overflow? Note that we can have systemetic cases where ref
> > is gotten on one cpu and put on another transferring counts in a
> > specific direction.
>
> Heh, this keeps coming up.
>
> It works because modular arithmatic is still associative for addition
> and subtraction. It doesn't matter which cpu the gets and puts happen
> on, the counters will still sum to the same value - even if individual
> counters overflow/underflow.
>
> (It's the same as doing a bunch of increments and decrements to a single
> integer, but in random order. 0 - 1 - 1 + 1 + 1 + 1 will always equal 1,
> no matter where you stick parentheses).
>
> That's also why I use an unsigned integer for the per cpu counters...
> since they're going to wrap and signed integer overflow is undefined.
> Not that I really expect gcc to figure out a way to screw me for that,
> but I tend to be anal about such things.
Hmmm... okay, but that only works if the summing up is also done in
32bits, right? Is that why the global counter is kept at 32bits?
> > I really don't get this. Why use 1<<32 for bias which isn't too
> > difficult to overflow especially if you have many cpu threads and some
> > systemetic drift in percpu refs? What's the point of not using the
> > higher bits? Is it to encode count usage statistics into the same
> > counter? If so, just add another field. 24bytes is fine.
>
> The systematic drift doesn't affect overflow here - the bias is only
> applied to the atomic counter, which can only be off by at most whatever
> the per cpu counters happen to sum to (and remember it's modular
> arithmatic, so that's INT_MIN/INT_MAX).
Yeah if you don't have to worry about flushing percpu counters to
prevent over/underflows, this shouldn't matter. It would be nice to
document it tho.
> Since I'm using 32 bit ints for the per cpu counters, 1 << 32 works just
> fine. If we want to expand the per cpu counters to longs or 64 bit ints,
> then yeah I'd use a larger bias.
>
> It's nitpicky but using a larger bias feels like overly defensive
> programming to me (if you know how big the bias has to be, great, use
> that - if you aren't sure, wtf are you doing :P)
I don't know. I think it's a correctness issue. If you
over/underflow into global counter, you need buffer space to be
(practically) correct. If modular calculation makes it unnecessary to
worry about over/underflows, we can't use more than 32bits for global
sums, right? I'm still a bit hazy but it looks like using all 32bit
counters should work w/o worrying about over/underflows and that's
really nice.
> > I'd really like to see just basic percpu refcnt with async and sync
> > interfaces w/o dynamic allocation. At this point, we aren't really
> > sure if there are gonna be enough benefits or use cases from dynamic
> > allocation, so I don't really see the point in the added complexity.
> > Let's start with basic stuff and add on dynamic alloc if it actually
> > is necessary.
>
> Well, with the alloc rework the dynamic allocation is completely hidden
> from the users - it's just get and put. Killing dynamic allocation would
> just mean that percpu_ref_init() could fail, as far as the api is
> concerned.
There really is no point in adding complexity which isn't needed. It
slows down the code unnecessarily (however minute) and adds
maintenance overhead. Why do that when there's no need?
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] generic dynamic per cpu refcounting
2013-01-29 20:02 ` Tejun Heo
@ 2013-01-29 21:45 ` Kent Overstreet
2013-01-29 22:06 ` Tejun Heo
0 siblings, 1 reply; 27+ messages in thread
From: Kent Overstreet @ 2013-01-29 21:45 UTC (permalink / raw)
To: Tejun Heo; +Cc: Oleg Nesterov, srivatsa.bhat, rusty, linux-kernel
On Tue, Jan 29, 2013 at 12:02:18PM -0800, Tejun Heo wrote:
> Hello, Kent.
>
> On Tue, Jan 29, 2013 at 11:51:41AM -0800, Kent Overstreet wrote:
> > > What about overflow? Note that we can have systemetic cases where ref
> > > is gotten on one cpu and put on another transferring counts in a
> > > specific direction.
> >
> > Heh, this keeps coming up.
> >
> > It works because modular arithmatic is still associative for addition
> > and subtraction. It doesn't matter which cpu the gets and puts happen
> > on, the counters will still sum to the same value - even if individual
> > counters overflow/underflow.
> >
> > (It's the same as doing a bunch of increments and decrements to a single
> > integer, but in random order. 0 - 1 - 1 + 1 + 1 + 1 will always equal 1,
> > no matter where you stick parentheses).
> >
> > That's also why I use an unsigned integer for the per cpu counters...
> > since they're going to wrap and signed integer overflow is undefined.
> > Not that I really expect gcc to figure out a way to screw me for that,
> > but I tend to be anal about such things.
>
> Hmmm... okay, but that only works if the summing up is also done in
> 32bits, right? Is that why the global counter is kept at 32bits?
Global counter's actually kept at 50 bits - arbitrarily, because the top
14 bits are used for the get counter.
It would definitely be cleaner if the global counter was also 32 bits (I
probably should've done it that way at first) but it works with it being
bigger _provided the sum of the percpu counters is sign extended_ when
collecting them up at the end.
With the bias though we kind of want the global counter to be bigger...
the bias needs to fit in the global counter.
> > > I really don't get this. Why use 1<<32 for bias which isn't too
> > > difficult to overflow especially if you have many cpu threads and some
> > > systemetic drift in percpu refs? What's the point of not using the
> > > higher bits? Is it to encode count usage statistics into the same
> > > counter? If so, just add another field. 24bytes is fine.
> >
> > The systematic drift doesn't affect overflow here - the bias is only
> > applied to the atomic counter, which can only be off by at most whatever
> > the per cpu counters happen to sum to (and remember it's modular
> > arithmatic, so that's INT_MIN/INT_MAX).
>
> Yeah if you don't have to worry about flushing percpu counters to
> prevent over/underflows, this shouldn't matter. It would be nice to
> document it tho.
Good point, I'll add the explanation to the comments.
> > Since I'm using 32 bit ints for the per cpu counters, 1 << 32 works just
> > fine. If we want to expand the per cpu counters to longs or 64 bit ints,
> > then yeah I'd use a larger bias.
> >
> > It's nitpicky but using a larger bias feels like overly defensive
> > programming to me (if you know how big the bias has to be, great, use
> > that - if you aren't sure, wtf are you doing :P)
>
> I don't know. I think it's a correctness issue. If you
> over/underflow into global counter, you need buffer space to be
> (practically) correct.
Not quite sure what you mean...
> If modular calculation makes it unnecessary to
> worry about over/underflows, we can't use more than 32bits for global
> sums, right?
It doesn't really make sense to use more than 32 bits for the global sum
(ignoring the bias value... that complicates it) but it does work, with
the sign extending. I haven't figured out a good way of explaining why
though... I should probably look at what happens without it again. (I
missed the sign extending when I first coded it, had to debug that).
> I'm still a bit hazy but it looks like using all 32bit
> counters should work w/o worrying about over/underflows and that's
> really nice.
I know my math :D
> > > I'd really like to see just basic percpu refcnt with async and sync
> > > interfaces w/o dynamic allocation. At this point, we aren't really
> > > sure if there are gonna be enough benefits or use cases from dynamic
> > > allocation, so I don't really see the point in the added complexity.
> > > Let's start with basic stuff and add on dynamic alloc if it actually
> > > is necessary.
> >
> > Well, with the alloc rework the dynamic allocation is completely hidden
> > from the users - it's just get and put. Killing dynamic allocation would
> > just mean that percpu_ref_init() could fail, as far as the api is
> > concerned.
>
> There really is no point in adding complexity which isn't needed. It
> slows down the code unnecessarily (however minute) and adds
> maintenance overhead. Why do that when there's no need?
Well, like I said I want this to be a drop in replacement for atomic_t
anyone can use without having to think about the potential overhead.
If it really does turn out there aren't enough uses for it to be worth
it I'll be ok with ripping out the dynamic alloc... but the slowdown is
pretty damn minute (one instruction in the fast path) and the additional
complexity is fairly minor and quite self contained so I'm just not
seeing the urgency.
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] generic dynamic per cpu refcounting
2013-01-29 21:45 ` Kent Overstreet
@ 2013-01-29 22:06 ` Tejun Heo
0 siblings, 0 replies; 27+ messages in thread
From: Tejun Heo @ 2013-01-29 22:06 UTC (permalink / raw)
To: Kent Overstreet; +Cc: Oleg Nesterov, srivatsa.bhat, rusty, linux-kernel
Hey, Kent.
On Tue, Jan 29, 2013 at 01:45:37PM -0800, Kent Overstreet wrote:
> It would definitely be cleaner if the global counter was also 32 bits (I
> probably should've done it that way at first) but it works with it being
> bigger _provided the sum of the percpu counters is sign extended_ when
> collecting them up at the end.
>
> With the bias though we kind of want the global counter to be bigger...
> the bias needs to fit in the global counter.
Yeah, the size being 64bit is fine as long as the summation and the
result is done in 32bits.
> > I don't know. I think it's a correctness issue. If you
> > over/underflow into global counter, you need buffer space to be
> > (practically) correct.
>
> Not quite sure what you mean...
If you compute w/o sign extension, it becomes buggy.
> It doesn't really make sense to use more than 32 bits for the global sum
> (ignoring the bias value... that complicates it) but it does work, with
> the sign extending. I haven't figured out a good way of explaining why
> though... I should probably look at what happens without it again. (I
> missed the sign extending when I first coded it, had to debug that).
I think it would be easier to understand if we use only the lower
32bits for the actual counting.
> > There really is no point in adding complexity which isn't needed. It
> > slows down the code unnecessarily (however minute) and adds
> > maintenance overhead. Why do that when there's no need?
>
> Well, like I said I want this to be a drop in replacement for atomic_t
> anyone can use without having to think about the potential overhead.
>
> If it really does turn out there aren't enough uses for it to be worth
> it I'll be ok with ripping out the dynamic alloc... but the slowdown is
> pretty damn minute (one instruction in the fast path) and the additional
> complexity is fairly minor and quite self contained so I'm just not
> seeing the urgency.
I really don't think this can be drop-in replacement given the
significant overhead on base ref put and you build up complexity with
use cases not the other way around. When you're looking at each
specific implementation, it might not look like much but it really
sucks to have a lot of unneeded / unused complexities around. It
raises maintenance overhead for no good reason. Like you said, it's
not like extending the design is difficult if we find out that the
dynamic allocation is actually useful, and there just is no reason to
do it the other way around.
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 27+ messages in thread
end of thread, other threads:[~2013-01-29 22:06 UTC | newest]
Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20130124232024.GA584@google.com>
2013-01-25 18:09 ` [PATCH] generic dynamic per cpu refcounting Oleg Nesterov
2013-01-25 18:29 ` Oleg Nesterov
2013-01-28 18:10 ` Kent Overstreet
2013-01-28 18:50 ` Oleg Nesterov
2013-01-25 19:11 ` Oleg Nesterov
2013-01-28 18:15 ` Kent Overstreet
2013-01-28 18:27 ` Tejun Heo
2013-01-28 18:49 ` Kent Overstreet
2013-01-28 18:55 ` Tejun Heo
2013-01-28 20:22 ` Kent Overstreet
2013-01-28 20:27 ` Tejun Heo
2013-01-28 20:55 ` Kent Overstreet
2013-01-28 21:18 ` Tejun Heo
2013-01-28 21:24 ` Kent Overstreet
2013-01-28 21:28 ` Tejun Heo
2013-01-28 21:36 ` Tejun Heo
2013-01-28 21:48 ` Kent Overstreet
2013-01-28 21:45 ` Kent Overstreet
2013-01-28 21:50 ` Tejun Heo
2013-01-29 16:39 ` Kent Overstreet
2013-01-29 19:29 ` Tejun Heo
2013-01-29 19:51 ` Kent Overstreet
2013-01-29 20:02 ` Tejun Heo
2013-01-29 21:45 ` Kent Overstreet
2013-01-29 22:06 ` Tejun Heo
2013-01-29 18:04 ` [PATCH] module: Convert to generic percpu refcounts Kent Overstreet
2013-01-28 18:07 ` [PATCH] generic dynamic per cpu refcounting Kent Overstreet
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox