* Re: [PATCH] generic dynamic per cpu refcounting [not found] <20130124232024.GA584@google.com> @ 2013-01-25 18:09 ` Oleg Nesterov 2013-01-25 18:29 ` Oleg Nesterov ` (2 more replies) 0 siblings, 3 replies; 27+ messages in thread From: Oleg Nesterov @ 2013-01-25 18:09 UTC (permalink / raw) To: Kent Overstreet; +Cc: tj, srivatsa.bhat, rusty, linux-kernel (add lkml) On 01/24, Kent Overstreet wrote: > > This has already been on lkml and is in Andrew's tree, Tejun just asked > me to send it out again: I'll try to read this code later, just a couple of questions after a quick glance. Sorry this was already discussed... > +struct percpu_ref { > + atomic64_t count; > + unsigned long pcpu_count; > +}; The code looks a bit tricky mostly because you pack state/pointer/jiffies into ->pcpu_count. The same for ->count. I assume that you have a good reason to shrink the sizeof(percpu_ref), but I am curious: who is the user of this thing? > + * percpu_ref_get - increment a dynamic percpu refcount > + * > + * Increments @ref and possibly converts it to percpu counters. Must be called > + * with rcu_read_lock() held, and may potentially drop/reacquire rcu_read_lock() > + * to allocate percpu counters - if sleeping/allocation isn't safe for some > + * other reason (e.g. a spinlock), see percpu_ref_get_noalloc(). And this looks strange. It must be called under rcu_read_lock(), but ->rcu_read_lock_nesting must be == 1. Otherwise rcu_read_unlock() in percpu_ref_alloc() won't work. Again, I think you have a reason, but could you explain? IOW, why we can't make it might_sleep() instead? The fast path can do rcu_read_lock() itself. > +static inline void percpu_ref_get_noalloc(struct percpu_ref *ref) > +{ > + __percpu_ref_get(ref, false); > +} and this could be percpu_ref_get_atomic(). Once again, I am not arguing, just can't understand. > +void __percpu_ref_get(struct percpu_ref *ref, bool alloc) > +{ > + unsigned long pcpu_count; > + uint64_t v; > + > + pcpu_count = ACCESS_ONCE(ref->pcpu_count); > + > + if (REF_STATUS(pcpu_count) == PCPU_REF_PTR) { > + /* for rcu - we're not using rcu_dereference() */ > + smp_read_barrier_depends(); > + __this_cpu_inc(*((unsigned __percpu *) pcpu_count)); The comment looks confusing a bit... smp_read_barrier_depends() is not for rcu, we obviously need it to access (unsigned __percpu *) pcpu_count. But yes, since we didn't use rcu_dereference() we have to add it by hand. > +int percpu_ref_kill(struct percpu_ref *ref) > +{ > ... > + if (status == PCPU_REF_PTR) { > + unsigned count = 0, cpu; > + > + synchronize_rcu(); > + > + for_each_possible_cpu(cpu) > + count += *per_cpu_ptr((unsigned __percpu *) pcpu_count, cpu); > + > + pr_debug("global %lli pcpu %i", > + atomic64_read(&ref->count) & PCPU_COUNT_MASK, > + (int) count); > + > + atomic64_add((int) count, &ref->count); > + smp_wmb(); > + /* Between setting global count and setting PCPU_REF_DEAD */ > + ref->pcpu_count = PCPU_REF_DEAD; The coment explains what the code does, but not why ;) I guess this is for percpu_ref_put(), and this wmb() pairs with implicit mb() implied by atomic64_dec_return(). > + free_percpu((unsigned __percpu *) pcpu_count); I guess it could be freed right after for_each_possible_cpu() above, but this doesn't matter. Oleg. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] generic dynamic per cpu refcounting 2013-01-25 18:09 ` [PATCH] generic dynamic per cpu refcounting Oleg Nesterov @ 2013-01-25 18:29 ` Oleg Nesterov 2013-01-28 18:10 ` Kent Overstreet 2013-01-25 19:11 ` Oleg Nesterov 2013-01-28 18:07 ` [PATCH] generic dynamic per cpu refcounting Kent Overstreet 2 siblings, 1 reply; 27+ messages in thread From: Oleg Nesterov @ 2013-01-25 18:29 UTC (permalink / raw) To: Kent Overstreet; +Cc: tj, srivatsa.bhat, rusty, linux-kernel On 01/25, Oleg Nesterov wrote: > > > +void __percpu_ref_get(struct percpu_ref *ref, bool alloc) > > +{ > > + unsigned long pcpu_count; > > + uint64_t v; > > + > > + pcpu_count = ACCESS_ONCE(ref->pcpu_count); > > + > > + if (REF_STATUS(pcpu_count) == PCPU_REF_PTR) { > > + /* for rcu - we're not using rcu_dereference() */ > > + smp_read_barrier_depends(); > > + __this_cpu_inc(*((unsigned __percpu *) pcpu_count)); > > The comment looks confusing a bit... smp_read_barrier_depends() is not > for rcu, we obviously need it to access (unsigned __percpu *) pcpu_count. > But yes, since we didn't use rcu_dereference() we have to add it by hand. Hmm. Otoh, arch/alpha uses asm-generic/percpu.h so in theory we need smp_read_barrier_depends() after __this_cpu_generic_to_op() calculates the "real" pointer, __this_cpu_ptr() ? Just curious... Oleg. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] generic dynamic per cpu refcounting 2013-01-25 18:29 ` Oleg Nesterov @ 2013-01-28 18:10 ` Kent Overstreet 2013-01-28 18:50 ` Oleg Nesterov 0 siblings, 1 reply; 27+ messages in thread From: Kent Overstreet @ 2013-01-28 18:10 UTC (permalink / raw) To: Oleg Nesterov; +Cc: tj, srivatsa.bhat, rusty, linux-kernel On Fri, Jan 25, 2013 at 07:29:24PM +0100, Oleg Nesterov wrote: > On 01/25, Oleg Nesterov wrote: > > > > > +void __percpu_ref_get(struct percpu_ref *ref, bool alloc) > > > +{ > > > + unsigned long pcpu_count; > > > + uint64_t v; > > > + > > > + pcpu_count = ACCESS_ONCE(ref->pcpu_count); > > > + > > > + if (REF_STATUS(pcpu_count) == PCPU_REF_PTR) { > > > + /* for rcu - we're not using rcu_dereference() */ > > > + smp_read_barrier_depends(); > > > + __this_cpu_inc(*((unsigned __percpu *) pcpu_count)); > > > > The comment looks confusing a bit... smp_read_barrier_depends() is not > > for rcu, we obviously need it to access (unsigned __percpu *) pcpu_count. > > But yes, since we didn't use rcu_dereference() we have to add it by hand. > > Hmm. Otoh, arch/alpha uses asm-generic/percpu.h so in theory we need > smp_read_barrier_depends() after __this_cpu_generic_to_op() calculates the > "real" pointer, __this_cpu_ptr() ? > > Just curious... Don't think I follow, but I don't see how... that barriers just needs to be between the ACCESS_ONCE and touching the memory pointed to, the fact that there's a calculation in there to get the real pointer is irrelevant... ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] generic dynamic per cpu refcounting 2013-01-28 18:10 ` Kent Overstreet @ 2013-01-28 18:50 ` Oleg Nesterov 0 siblings, 0 replies; 27+ messages in thread From: Oleg Nesterov @ 2013-01-28 18:50 UTC (permalink / raw) To: Kent Overstreet; +Cc: tj, srivatsa.bhat, rusty, linux-kernel On 01/28, Kent Overstreet wrote: > > On Fri, Jan 25, 2013 at 07:29:24PM +0100, Oleg Nesterov wrote: > > On 01/25, Oleg Nesterov wrote: > > > > > > > +void __percpu_ref_get(struct percpu_ref *ref, bool alloc) > > > > +{ > > > > + unsigned long pcpu_count; > > > > + uint64_t v; > > > > + > > > > + pcpu_count = ACCESS_ONCE(ref->pcpu_count); > > > > + > > > > + if (REF_STATUS(pcpu_count) == PCPU_REF_PTR) { > > > > + /* for rcu - we're not using rcu_dereference() */ > > > > + smp_read_barrier_depends(); > > > > + __this_cpu_inc(*((unsigned __percpu *) pcpu_count)); > > > > > > The comment looks confusing a bit... smp_read_barrier_depends() is not > > > for rcu, we obviously need it to access (unsigned __percpu *) pcpu_count. > > > But yes, since we didn't use rcu_dereference() we have to add it by hand. > > > > Hmm. Otoh, arch/alpha uses asm-generic/percpu.h so in theory we need > > smp_read_barrier_depends() after __this_cpu_generic_to_op() calculates the > > "real" pointer, __this_cpu_ptr() ? > > > > Just curious... > > Don't think I follow, but I don't see how... that barriers just needs to > be between the ACCESS_ONCE and touching the memory pointed to, the fact > that there's a calculation in there to get the real pointer is > irrelevant... Yes, I was wrong. I forgot that pcp is already the pointer even in _generic_ case, we only add the offset. Thanks for correcting me. Oleg. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] generic dynamic per cpu refcounting 2013-01-25 18:09 ` [PATCH] generic dynamic per cpu refcounting Oleg Nesterov 2013-01-25 18:29 ` Oleg Nesterov @ 2013-01-25 19:11 ` Oleg Nesterov 2013-01-28 18:15 ` Kent Overstreet 2013-01-28 18:07 ` [PATCH] generic dynamic per cpu refcounting Kent Overstreet 2 siblings, 1 reply; 27+ messages in thread From: Oleg Nesterov @ 2013-01-25 19:11 UTC (permalink / raw) To: Kent Overstreet; +Cc: tj, srivatsa.bhat, rusty, linux-kernel On 01/25, Oleg Nesterov wrote: > > > +int percpu_ref_kill(struct percpu_ref *ref) > > +{ > > ... > > + if (status == PCPU_REF_PTR) { > > + unsigned count = 0, cpu; > > + > > + synchronize_rcu(); > > + > > + for_each_possible_cpu(cpu) > > + count += *per_cpu_ptr((unsigned __percpu *) pcpu_count, cpu); > > + > > + pr_debug("global %lli pcpu %i", > > + atomic64_read(&ref->count) & PCPU_COUNT_MASK, > > + (int) count); > > + > > + atomic64_add((int) count, &ref->count); > > + smp_wmb(); > > + /* Between setting global count and setting PCPU_REF_DEAD */ > > + ref->pcpu_count = PCPU_REF_DEAD; > > The coment explains what the code does, but not why ;) > > I guess this is for percpu_ref_put(), and this wmb() pairs with implicit > mb() implied by atomic64_dec_return(). Hmm. Most probably I missed something, but it seems we need another synchronize_rcu() _after_ we set PCPU_REF_DEAD. To simplify, suppose that percpu_ref_put() is never called directly but we have void put_and_dsetroy(...) { if (percpu_ref_put(...)) destroy(...); } Suppose that ref->count == 2 after atomic64_add() above. IOW, we have a "master" reference for _kill() and someone else did _get. So the caller does percpu_ref_kill(); put_and_dsetroy(); And this can race with another holder which drops the last reference, its put_and_dsetroy() can see PCPU_REF_DYING and return false. Or I misunderstood the code/interface? Oleg. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] generic dynamic per cpu refcounting 2013-01-25 19:11 ` Oleg Nesterov @ 2013-01-28 18:15 ` Kent Overstreet 2013-01-28 18:27 ` Tejun Heo 0 siblings, 1 reply; 27+ messages in thread From: Kent Overstreet @ 2013-01-28 18:15 UTC (permalink / raw) To: Oleg Nesterov; +Cc: tj, srivatsa.bhat, rusty, linux-kernel On Fri, Jan 25, 2013 at 08:11:39PM +0100, Oleg Nesterov wrote: > On 01/25, Oleg Nesterov wrote: > > > > > +int percpu_ref_kill(struct percpu_ref *ref) > > > +{ > > > ... > > > + if (status == PCPU_REF_PTR) { > > > + unsigned count = 0, cpu; > > > + > > > + synchronize_rcu(); > > > + > > > + for_each_possible_cpu(cpu) > > > + count += *per_cpu_ptr((unsigned __percpu *) pcpu_count, cpu); > > > + > > > + pr_debug("global %lli pcpu %i", > > > + atomic64_read(&ref->count) & PCPU_COUNT_MASK, > > > + (int) count); > > > + > > > + atomic64_add((int) count, &ref->count); > > > + smp_wmb(); > > > + /* Between setting global count and setting PCPU_REF_DEAD */ > > > + ref->pcpu_count = PCPU_REF_DEAD; > > > > The coment explains what the code does, but not why ;) > > > > I guess this is for percpu_ref_put(), and this wmb() pairs with implicit > > mb() implied by atomic64_dec_return(). > > Hmm. Most probably I missed something, but it seems we need another > synchronize_rcu() _after_ we set PCPU_REF_DEAD. Yeah, correct - documentation bug. I originally had the synchronize_rcu() there, but this is called by exit_aio() -> kill_ioctx() when we're killing a process, and Ben LaHaise pointed out that was less than ideal if a process had a bunch of ioctxs - so I left the second one out there so the caller would have the option of using call_rcu() instead. > To simplify, suppose that percpu_ref_put() is never called directly but > we have > > void put_and_dsetroy(...) > { > if (percpu_ref_put(...)) > destroy(...); > } > > Suppose that ref->count == 2 after atomic64_add() above. IOW, we have > a "master" reference for _kill() and someone else did _get. > > So the caller does > > percpu_ref_kill(); > put_and_dsetroy(); > > And this can race with another holder which drops the last reference, > its put_and_dsetroy() can see PCPU_REF_DYING and return false. > > Or I misunderstood the code/interface? Nope, nailed it :) That should _definitely_ be in the documentation. Actually - I think it'd be better to have the default percpu_ref_kill() do the second synchronize_rcu(), and have an unsafe version that skips it. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] generic dynamic per cpu refcounting 2013-01-28 18:15 ` Kent Overstreet @ 2013-01-28 18:27 ` Tejun Heo 2013-01-28 18:49 ` Kent Overstreet 0 siblings, 1 reply; 27+ messages in thread From: Tejun Heo @ 2013-01-28 18:27 UTC (permalink / raw) To: Kent Overstreet; +Cc: Oleg Nesterov, srivatsa.bhat, rusty, linux-kernel Hello, guys. On Mon, Jan 28, 2013 at 10:15:28AM -0800, Kent Overstreet wrote: > > percpu_ref_kill(); > > put_and_dsetroy(); > > > > And this can race with another holder which drops the last reference, > > its put_and_dsetroy() can see PCPU_REF_DYING and return false. > > > > Or I misunderstood the code/interface? > > Nope, nailed it :) That should _definitely_ be in the documentation. Can we just combine kill initiation and base ref put and make that the responsibility of the owner? Extra features on basic constructs may seem good for certain use cases but tend to bring more confusion than good in the long run. If a user needs to synchronize among multiple killers, let the user deal with the issue. > Actually - I think it'd be better to have the default percpu_ref_kill() > do the second synchronize_rcu(), and have an unsafe version that skips > it. Note that synchronize_rcu/sched() can be very slow and cause problems in paths which are frequently traveled and visible to userland. It's fine for things like module destruction but can be a problem even during device destruction - blkcg had synchronize_rcu() in request_queue destruction which led to huge latencies during boot because SCSI wants to create and then destroy request_queues for all possible LUNs on certain configurations. So, if you put synchronize_rcu/sched() in percpu_ref_kill(), that better not be used from e.g. close(2). Thanks. -- tejun ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] generic dynamic per cpu refcounting 2013-01-28 18:27 ` Tejun Heo @ 2013-01-28 18:49 ` Kent Overstreet 2013-01-28 18:55 ` Tejun Heo 0 siblings, 1 reply; 27+ messages in thread From: Kent Overstreet @ 2013-01-28 18:49 UTC (permalink / raw) To: Tejun Heo; +Cc: Oleg Nesterov, srivatsa.bhat, rusty, linux-kernel On Mon, Jan 28, 2013 at 10:27:37AM -0800, Tejun Heo wrote: > Hello, guys. > > On Mon, Jan 28, 2013 at 10:15:28AM -0800, Kent Overstreet wrote: > > > percpu_ref_kill(); > > > put_and_dsetroy(); > > > > > > And this can race with another holder which drops the last reference, > > > its put_and_dsetroy() can see PCPU_REF_DYING and return false. > > > > > > Or I misunderstood the code/interface? > > > > Nope, nailed it :) That should _definitely_ be in the documentation. > > Can we just combine kill initiation and base ref put and make that the > responsibility of the owner? Extra features on basic constructs may > seem good for certain use cases but tend to bring more confusion than > good in the long run. If a user needs to synchronize among multiple > killers, let the user deal with the issue. Don't follow... Something I forgot to mention in the last mail though is that often the caller will need its own synchronize_rcu()/call_rcu() - percpu_ref_kill() corresponds to when you make the object unavailable (i.e. deleting it from the rcu protected hash table in aio) and you need a synchronize_rcu() before you drop your initial ref. So letting the caller do it means the caller can merge the two synchronize_rcu()s. > > > Actually - I think it'd be better to have the default percpu_ref_kill() > > do the second synchronize_rcu(), and have an unsafe version that skips > > it. > > Note that synchronize_rcu/sched() can be very slow and cause problems > in paths which are frequently traveled and visible to userland. It's > fine for things like module destruction but can be a problem even > during device destruction - blkcg had synchronize_rcu() in > request_queue destruction which led to huge latencies during boot > because SCSI wants to create and then destroy request_queues for all > possible LUNs on certain configurations. So, if you put > synchronize_rcu/sched() in percpu_ref_kill(), that better not be used > from e.g. close(2). Yeah. It'd be really nice if it was doable without synchronize_rcu(), but it'd definitely make get/put heavier. Though, re. close() - considering we only need a synchronize_rcu() if the ref was in percpu mode, I wonder if that would be a dealbreaker. I have no clue myself. Getting rid of synchronize_rcu would basically require turning get and put into cmpxchg() loops - even in the percpu fastpath. However, percpu mode would still be getting rid of the shared cacheline contention, we'd just be adding another branch that can be safely marked unlikely() - and my current version has one of those already, so two branches instead of one in the fast path. I suppose I should give it a shot. As long as I'm going down that route I could probably make the bare non percpu ref 8 bytes instead of 16, too... ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] generic dynamic per cpu refcounting 2013-01-28 18:49 ` Kent Overstreet @ 2013-01-28 18:55 ` Tejun Heo 2013-01-28 20:22 ` Kent Overstreet 0 siblings, 1 reply; 27+ messages in thread From: Tejun Heo @ 2013-01-28 18:55 UTC (permalink / raw) To: Kent Overstreet; +Cc: Oleg Nesterov, srivatsa.bhat, rusty, linux-kernel Hey, Kent. On Mon, Jan 28, 2013 at 10:49:33AM -0800, Kent Overstreet wrote: > Yeah. It'd be really nice if it was doable without synchronize_rcu(), > but it'd definitely make get/put heavier. > > Though, re. close() - considering we only need a synchronize_rcu() if > the ref was in percpu mode, I wonder if that would be a dealbreaker. I > have no clue myself. The problem is that the performance drop (or latency increase) in patheological cases would be catastrophic. We're talking about possibly quite a few millisecs of delay between each close(). When done sequentially for large number of files, it gets ugly. It becomes a dangerous optimization to make. > Getting rid of synchronize_rcu would basically require turning get and > put into cmpxchg() loops - even in the percpu fastpath. However, percpu > mode would still be getting rid of the shared cacheline contention, we'd > just be adding another branch that can be safely marked unlikely() - and > my current version has one of those already, so two branches instead of > one in the fast path. Or offer an asynchrnous interface so that high-frequency users don't end up inserting synchronize_sched() between each call. It makes the interface more complex and further away from simple atomic_t replacement tho. Thanks. -- tejun ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] generic dynamic per cpu refcounting 2013-01-28 18:55 ` Tejun Heo @ 2013-01-28 20:22 ` Kent Overstreet 2013-01-28 20:27 ` Tejun Heo 0 siblings, 1 reply; 27+ messages in thread From: Kent Overstreet @ 2013-01-28 20:22 UTC (permalink / raw) To: Tejun Heo; +Cc: Oleg Nesterov, srivatsa.bhat, rusty, linux-kernel On Mon, Jan 28, 2013 at 10:55:52AM -0800, Tejun Heo wrote: > Hey, Kent. > > On Mon, Jan 28, 2013 at 10:49:33AM -0800, Kent Overstreet wrote: > > Yeah. It'd be really nice if it was doable without synchronize_rcu(), > > but it'd definitely make get/put heavier. > > > > Though, re. close() - considering we only need a synchronize_rcu() if > > the ref was in percpu mode, I wonder if that would be a dealbreaker. I > > have no clue myself. > > The problem is that the performance drop (or latency increase) in > patheological cases would be catastrophic. We're talking about > possibly quite a few millisecs of delay between each close(). When > done sequentially for large number of files, it gets ugly. It becomes > a dangerous optimization to make. Yeah, I tend to agree. > > Getting rid of synchronize_rcu would basically require turning get and > > put into cmpxchg() loops - even in the percpu fastpath. However, percpu > > mode would still be getting rid of the shared cacheline contention, we'd > > just be adding another branch that can be safely marked unlikely() - and > > my current version has one of those already, so two branches instead of > > one in the fast path. > > Or offer an asynchrnous interface so that high-frequency users don't > end up inserting synchronize_sched() between each call. It makes the > interface more complex and further away from simple atomic_t > replacement tho. Could do that too, but then teardown gets really messy for the user - we need two synchronize_rcu()s: state := dying synchronize_rcu() /* Now nothing's changing the per cpu counters */ Add per cpu counters to atomic counter counter /* Atomic counter is now consistent */ state := dead synchronize_rcu() /* Now percpu_ref_put will check for ref == 0 */ /* Drop initial ref */ percpu_ref_put() And note that the first synchronize_rcu() is only needed when we had allocated per cpu counters, my current code skips it otherwise. (which is a reason for keeping dynamic allocation I hadn't thought of - if we don't ever allocate percpu counters, teardown is faster) ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] generic dynamic per cpu refcounting 2013-01-28 20:22 ` Kent Overstreet @ 2013-01-28 20:27 ` Tejun Heo 2013-01-28 20:55 ` Kent Overstreet 0 siblings, 1 reply; 27+ messages in thread From: Tejun Heo @ 2013-01-28 20:27 UTC (permalink / raw) To: Kent Overstreet; +Cc: Oleg Nesterov, srivatsa.bhat, rusty, linux-kernel Hey, Kent. On Mon, Jan 28, 2013 at 12:22 PM, Kent Overstreet <koverstreet@google.com> wrote: > Could do that too, but then teardown gets really messy for the user - we > need two synchronize_rcu()s: > > state := dying > > synchronize_rcu() > > /* Now nothing's changing the per cpu counters */ > > Add per cpu counters to atomic counter counter > > /* Atomic counter is now consistent */ > > state := dead > > synchronize_rcu() I don't understand why we need two stages. What prevents the killing thread from fetching percpu counters after dying passes one synchronize_sched()? > /* Now percpu_ref_put will check for ref == 0 */ > > /* Drop initial ref */ > > percpu_ref_put() > > And note that the first synchronize_rcu() is only needed when we had > allocated per cpu counters, my current code skips it otherwise. And regardless, at the interface level, can't it just provide percpu_ref_put_base_ref(release_fn)? Thanks. -- tejun ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] generic dynamic per cpu refcounting 2013-01-28 20:27 ` Tejun Heo @ 2013-01-28 20:55 ` Kent Overstreet 2013-01-28 21:18 ` Tejun Heo 0 siblings, 1 reply; 27+ messages in thread From: Kent Overstreet @ 2013-01-28 20:55 UTC (permalink / raw) To: Tejun Heo; +Cc: Oleg Nesterov, srivatsa.bhat, rusty, linux-kernel On Mon, Jan 28, 2013 at 12:27:45PM -0800, Tejun Heo wrote: > Hey, Kent. > > On Mon, Jan 28, 2013 at 12:22 PM, Kent Overstreet > <koverstreet@google.com> wrote: > > Could do that too, but then teardown gets really messy for the user - we > > need two synchronize_rcu()s: > > > > state := dying > > > > synchronize_rcu() > > > > /* Now nothing's changing the per cpu counters */ > > > > Add per cpu counters to atomic counter counter > > > > /* Atomic counter is now consistent */ > > > > state := dead > > > > synchronize_rcu() > > I don't understand why we need two stages. What prevents the killing > thread from fetching percpu counters after dying passes one > synchronize_sched()? It does. The second synchronize_sched() is needed after we set state := dead, and before we drop the initial ref. Otherwise the ref could hit 0 before percpu_ref_put knows to check for it. > > > /* Now percpu_ref_put will check for ref == 0 */ > > > > /* Drop initial ref */ > > > > percpu_ref_put() > > > > And note that the first synchronize_rcu() is only needed when we had > > allocated per cpu counters, my current code skips it otherwise. > > And regardless, at the interface level, can't it just provide > percpu_ref_put_base_ref(release_fn)? Yeah, can definitely provide one that wraps it all together. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] generic dynamic per cpu refcounting 2013-01-28 20:55 ` Kent Overstreet @ 2013-01-28 21:18 ` Tejun Heo 2013-01-28 21:24 ` Kent Overstreet 0 siblings, 1 reply; 27+ messages in thread From: Tejun Heo @ 2013-01-28 21:18 UTC (permalink / raw) To: Kent Overstreet; +Cc: Oleg Nesterov, srivatsa.bhat, rusty, linux-kernel Hello, Kent. On Mon, Jan 28, 2013 at 12:55:40PM -0800, Kent Overstreet wrote: > > I don't understand why we need two stages. What prevents the killing > > thread from fetching percpu counters after dying passes one > > synchronize_sched()? > > It does. The second synchronize_sched() is needed after we set state := > dead, and before we drop the initial ref. Otherwise the ref could hit 0 > before percpu_ref_put knows to check for it. Still a bit confused. Why do we need to make the two steps separate? What prevents us from doing the following? set dying; synchronize_sched(); collect percpu refs into global atomic_t; put the base ref; Thanks. -- tejun ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] generic dynamic per cpu refcounting 2013-01-28 21:18 ` Tejun Heo @ 2013-01-28 21:24 ` Kent Overstreet 2013-01-28 21:28 ` Tejun Heo 0 siblings, 1 reply; 27+ messages in thread From: Kent Overstreet @ 2013-01-28 21:24 UTC (permalink / raw) To: Tejun Heo; +Cc: Oleg Nesterov, srivatsa.bhat, rusty, linux-kernel On Mon, Jan 28, 2013 at 01:18:32PM -0800, Tejun Heo wrote: > Hello, Kent. > > On Mon, Jan 28, 2013 at 12:55:40PM -0800, Kent Overstreet wrote: > > > I don't understand why we need two stages. What prevents the killing > > > thread from fetching percpu counters after dying passes one > > > synchronize_sched()? > > > > It does. The second synchronize_sched() is needed after we set state := > > dead, and before we drop the initial ref. Otherwise the ref could hit 0 > > before percpu_ref_put knows to check for it. > > Still a bit confused. Why do we need to make the two steps separate? > What prevents us from doing the following? > > set dying; > synchronize_sched(); > collect percpu refs into global atomic_t; > put the base ref; After you set state := dying, percpu_ref_put() decrements the atomic_t, but it can't check if it's 0 yet because the thread that's collecting the percpu refs might not be done yet. So percpu_ref_put can't check for ref == 0 until after state == dead. But the put in your example might have made ref 0. When did you set state to dead? ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] generic dynamic per cpu refcounting 2013-01-28 21:24 ` Kent Overstreet @ 2013-01-28 21:28 ` Tejun Heo 2013-01-28 21:36 ` Tejun Heo 2013-01-28 21:45 ` Kent Overstreet 0 siblings, 2 replies; 27+ messages in thread From: Tejun Heo @ 2013-01-28 21:28 UTC (permalink / raw) To: Kent Overstreet; +Cc: Oleg Nesterov, srivatsa.bhat, rusty, linux-kernel On Mon, Jan 28, 2013 at 01:24:07PM -0800, Kent Overstreet wrote: > > set dying; > > synchronize_sched(); > > collect percpu refs into global atomic_t; > > put the base ref; > > After you set state := dying, percpu_ref_put() decrements the atomic_t, > but it can't check if it's 0 yet because the thread that's collecting > the percpu refs might not be done yet. > > So percpu_ref_put can't check for ref == 0 until after state == dead. > But the put in your example might have made ref 0. When did you set > state to dead? But at that point, the operation is already global, so there gotta be a lighter way to synchronize stuff than going through full grace period. ie. You can add a bias value before marking dead so that the counter never reaches zero before all percpu counters are collected and then unbias it right before putting the base ref, that way the only way you can hit zero ref is all refs are actually zero. Thanks. -- tejun ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] generic dynamic per cpu refcounting 2013-01-28 21:28 ` Tejun Heo @ 2013-01-28 21:36 ` Tejun Heo 2013-01-28 21:48 ` Kent Overstreet 2013-01-28 21:45 ` Kent Overstreet 1 sibling, 1 reply; 27+ messages in thread From: Tejun Heo @ 2013-01-28 21:36 UTC (permalink / raw) To: Kent Overstreet; +Cc: Oleg Nesterov, srivatsa.bhat, rusty, linux-kernel On Mon, Jan 28, 2013 at 01:28:14PM -0800, Tejun Heo wrote: > But at that point, the operation is already global, so there gotta be > a lighter way to synchronize stuff than going through full grace > period. ie. You can add a bias value before marking dead so that the > counter never reaches zero before all percpu counters are collected > and then unbias it right before putting the base ref, that way the > only way you can hit zero ref is all refs are actually zero. Note that I'm saying that there's no need to distinguish between dying and dead. The only thing percpu part should care about it whether percpu is on or off. We need a full grace period to turn off percpu operations of any type but that should be the only case where full grace period is necessary. The rest should be synchronizable via the usual global synchronization. We probably can just test percpu variable against NULL and forget about the encoded state numbers. Thanks. -- tejun ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] generic dynamic per cpu refcounting 2013-01-28 21:36 ` Tejun Heo @ 2013-01-28 21:48 ` Kent Overstreet 0 siblings, 0 replies; 27+ messages in thread From: Kent Overstreet @ 2013-01-28 21:48 UTC (permalink / raw) To: Tejun Heo; +Cc: Oleg Nesterov, srivatsa.bhat, rusty, linux-kernel On Mon, Jan 28, 2013 at 01:36:22PM -0800, Tejun Heo wrote: > On Mon, Jan 28, 2013 at 01:28:14PM -0800, Tejun Heo wrote: > > But at that point, the operation is already global, so there gotta be > > a lighter way to synchronize stuff than going through full grace > > period. ie. You can add a bias value before marking dead so that the > > counter never reaches zero before all percpu counters are collected > > and then unbias it right before putting the base ref, that way the > > only way you can hit zero ref is all refs are actually zero. > > Note that I'm saying that there's no need to distinguish between dying > and dead. Yeah I got that, I _think_ that makes me like the idea. > The only thing percpu part should care about it whether > percpu is on or off. We need a full grace period to turn off percpu > operations of any type but that should be the only case where full > grace period is necessary. The rest should be synchronizable via the > usual global synchronization. We probably can just test percpu > variable against NULL and forget about the encoded state numbers. Can't quite do that because initially, we use that variable for a timestamp and when we're shutting down we don't want percpu_ref_get() allocating percpu counters again. You'll probably use that as an argument against dynamic allocation :P ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] generic dynamic per cpu refcounting 2013-01-28 21:28 ` Tejun Heo 2013-01-28 21:36 ` Tejun Heo @ 2013-01-28 21:45 ` Kent Overstreet 2013-01-28 21:50 ` Tejun Heo 1 sibling, 1 reply; 27+ messages in thread From: Kent Overstreet @ 2013-01-28 21:45 UTC (permalink / raw) To: Tejun Heo; +Cc: Oleg Nesterov, srivatsa.bhat, rusty, linux-kernel On Mon, Jan 28, 2013 at 01:28:14PM -0800, Tejun Heo wrote: > On Mon, Jan 28, 2013 at 01:24:07PM -0800, Kent Overstreet wrote: > > > set dying; > > > synchronize_sched(); > > > collect percpu refs into global atomic_t; > > > put the base ref; > > > > After you set state := dying, percpu_ref_put() decrements the atomic_t, > > but it can't check if it's 0 yet because the thread that's collecting > > the percpu refs might not be done yet. > > > > So percpu_ref_put can't check for ref == 0 until after state == dead. > > But the put in your example might have made ref 0. When did you set > > state to dead? > > But at that point, the operation is already global, so there gotta be > a lighter way to synchronize stuff than going through full grace > period. ie. You can add a bias value before marking dead so that the > counter never reaches zero before all percpu counters are collected > and then unbias it right before putting the base ref, that way the > only way you can hit zero ref is all refs are actually zero. Ahh. Bias value sounds... hacky (i.e. harder to convince myself it's correct) but I see what you're getting at. Something to consider is wrapping; after we set state to dying but before we've collected the percpu counters, the atomic counter may be negative. But since the atomic counter is 64 bits, we can use 1 << 32 for the bias value (and just include that when we first initialize it). Which makes me feel like it's less of a hack too. I'll have to think about it some more but seems like it ought t owork... ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] generic dynamic per cpu refcounting 2013-01-28 21:45 ` Kent Overstreet @ 2013-01-28 21:50 ` Tejun Heo 2013-01-29 16:39 ` Kent Overstreet 2013-01-29 18:04 ` [PATCH] module: Convert to generic percpu refcounts Kent Overstreet 0 siblings, 2 replies; 27+ messages in thread From: Tejun Heo @ 2013-01-28 21:50 UTC (permalink / raw) To: Kent Overstreet; +Cc: Oleg Nesterov, srivatsa.bhat, rusty, linux-kernel Hello, Kent. On Mon, Jan 28, 2013 at 01:45:06PM -0800, Kent Overstreet wrote: > Ahh. Bias value sounds... hacky (i.e. harder to convince myself it's > correct) but I see what you're getting at. I don't think it's that hacky. Just push the base point to the opposite of zero - LLONG_MAX. The global counter can start biased so that it gets unbiased only once dying is confirmed and everything can blindly do atomic64_dec_and_test() and trust the result. > Something to consider is wrapping; after we set state to dying but > before we've collected the percpu counters, the atomic counter may be > negative. There's a reason why we use 64bit vars for this type of global counting. They virtually don't wrap. Here, you'll need 1<<63 for the counter to unintentionally reach 0, which we consider practically impossible. Thanks. -- tejun ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] generic dynamic per cpu refcounting 2013-01-28 21:50 ` Tejun Heo @ 2013-01-29 16:39 ` Kent Overstreet 2013-01-29 19:29 ` Tejun Heo 2013-01-29 18:04 ` [PATCH] module: Convert to generic percpu refcounts Kent Overstreet 1 sibling, 1 reply; 27+ messages in thread From: Kent Overstreet @ 2013-01-29 16:39 UTC (permalink / raw) To: Tejun Heo; +Cc: Oleg Nesterov, srivatsa.bhat, rusty, linux-kernel On Mon, Jan 28, 2013 at 01:50:42PM -0800, Tejun Heo wrote: > Hello, Kent. > > On Mon, Jan 28, 2013 at 01:45:06PM -0800, Kent Overstreet wrote: > > Ahh. Bias value sounds... hacky (i.e. harder to convince myself it's > > correct) but I see what you're getting at. > > I don't think it's that hacky. Just push the base point to the > opposite of zero - LLONG_MAX. The global counter can start biased so > that it gets unbiased only once dying is confirmed and everything can > blindly do atomic64_dec_and_test() and trust the result. Didn't mean objectively hacky, just my first impression having a clear model for how my first version works. I think I've come around - I coded it up last night and it seems to work. > > Something to consider is wrapping; after we set state to dying but > > before we've collected the percpu counters, the atomic counter may be > > negative. > > There's a reason why we use 64bit vars for this type of global > counting. They virtually don't wrap. Here, you'll need 1<<63 for the > counter to unintentionally reach 0, which we consider practically > impossible. It's not the 64 bit counters themselves that mean you don't have to consider underflow (which was what I meant, not wrapping), it's the bias value (though I'm not sure that's the best way to think about it... eh, quibbling with semantics). Also the bias value is just stealing one bit from the range of the counter, and I'm implementing a 32 bit refcount, not 64 (since I'm stealing some of the high bits of the atomic64_t for the get counter... though I could still use uint64_t for the percpu counters and implement a 50 bit refcount). Oh, if this is going to be widely used I should probably have a different implementation for archs that don't have atomic64_t in hardware. Don't suppose you know the CONFIG_ macro to test against? I couldn't find it when I was looking recently. Started screwing around with converting the module refcount, too. Looks like the end result will be cleaner and simpler, though there's some tricky bits of the conversion due to the weird semantics of module refcounts. Anyways, here's the new version: diff --git a/include/linux/percpu-refcount.h b/include/linux/percpu-refcount.h new file mode 100644 index 0000000..9e14a09 --- /dev/null +++ b/include/linux/percpu-refcount.h @@ -0,0 +1,153 @@ +/* + * Dynamic percpu refcounts: + * (C) 2012 Google, Inc. + * Author: Kent Overstreet <koverstreet@google.com> + * + * This implements a refcount with similar semantics to atomic_t - atomic_inc(), + * atomic_dec_and_test() - but potentially percpu. + * + * There's one important difference between percpu refs and normal atomic_t + * refcounts; you have to keep track of your initial refcount, and then when you + * start shutting down you call percpu_ref_kill() _before_ dropping the initial + * refcount. + * + * Before you call percpu_ref_kill(), percpu_ref_put() does not check for the + * refcount hitting 0 - it can't, if it was in percpu mode. percpu_ref_kill() + * puts the ref back in single atomic_t mode, collecting the per cpu refs and + * issuing the appropriate barriers, and then marks the ref as shutting down so + * that percpu_ref_put() will check for the ref hitting 0. After it returns, + * it's safe to drop the initial ref. + * + * BACKGROUND: + * + * Percpu refcounts are quite useful for performance, but if we blindly + * converted all refcounts to percpu counters we'd waste quite a bit of memory. + * + * Think about all the refcounts embedded in kobjects, files, etc. most of which + * aren't used much. These start out as simple atomic counters - a little bigger + * than a bare atomic_t, 16 bytes instead of 4 - but if we exceed some arbitrary + * number of gets in one second, we then switch to percpu counters. + * + * This heuristic isn't perfect because it'll fire if the refcount was only + * being used on one cpu; ideally we'd be able to count the number of cache + * misses on percpu_ref_get() or something similar, but that'd make the non + * percpu path significantly heavier/more complex. We can count the number of + * gets() without any extra atomic instructions on arches that support + * atomic64_t - simply by changing the atomic_inc() to atomic_add_return(). + * + * USAGE: + * + * See fs/aio.c for some example usage; it's used there for struct kioctx, which + * is created when userspaces calls io_setup(), and destroyed when userspace + * calls io_destroy() or the process exits. + * + * In the aio code, kill_ioctx() is called when we wish to destroy a kioctx; it + * calls percpu_ref_kill(), then hlist_del_rcu() and sychronize_rcu() to remove + * the kioctx from the proccess's list of kioctxs - after that, there can't be + * any new users of the kioctx (from lookup_ioctx()) and it's then safe to drop + * the initial ref with percpu_ref_put(). + * + * Code that does a two stage shutdown like this often needs some kind of + * explicit synchronization to ensure the initial refcount can only be dropped + * once - percpu_ref_kill() does this for you, it returns true once and false if + * someone else already called it. The aio code uses it this way, but it's not + * necessary if the code has some other mechanism to synchronize teardown. + * + * As mentioned previously, we decide when to convert a ref to percpu counters + * in percpu_ref_get(). However, since percpu_ref_get() will often be called + * with rcu_read_lock() held, it's not done there - percpu_ref_get() returns + * true if the ref should be converted to percpu counters. + * + * The caller should then call percpu_ref_alloc() after dropping + * rcu_read_lock(); if there is an uncommonly used codepath where it's + * inconvenient to call percpu_ref_alloc() after get(), it may be safely skipped + * and percpu_ref_get() will return true again the next time the counter wraps + * around. + */ + +#ifndef _LINUX_PERCPU_REFCOUNT_H +#define _LINUX_PERCPU_REFCOUNT_H + +#include <linux/atomic.h> +#include <linux/kernel.h> +#include <linux/percpu.h> +#include <linux/rcupdate.h> + +struct percpu_ref { + atomic64_t count; + unsigned long pcpu_count; +}; + +void percpu_ref_init(struct percpu_ref *ref); +void __percpu_ref_get(struct percpu_ref *ref, unsigned long pcpu_count); +int __percpu_ref_put(struct percpu_ref *ref, unsigned long pcpu_count); +int percpu_ref_tryget(struct percpu_ref *ref); +int percpu_ref_put_initial_ref(struct percpu_ref *ref); + +#define PCPU_STATUS_BITS 2 +#define PCPU_STATUS_MASK ((1 << PCPU_STATUS_BITS) - 1) + +#define PCPU_REF_PTR 0 +#define PCPU_REF_NONE 1 +#define PCPU_REF_DEAD 2 + +#define REF_STATUS(count) (count & PCPU_STATUS_MASK) + +/** + * percpu_ref_get - increment a dynamic percpu refcount + * + * Analagous to atomic_inc(). + */ +static inline void percpu_ref_get(struct percpu_ref *ref) +{ + unsigned long pcpu_count; + + preempt_disable(); + + pcpu_count = ACCESS_ONCE(ref->pcpu_count); + + if (REF_STATUS(pcpu_count) == PCPU_REF_PTR) { + /* for rcu - we're not using rcu_dereference() */ + smp_read_barrier_depends(); + __this_cpu_inc(*((unsigned __percpu *) pcpu_count)); + } else { + __percpu_ref_get(ref, pcpu_count); + } + + preempt_enable(); +} + +/** + * percpu_ref_put - decrement a dynamic percpu refcount + * + * Returns true if the result is 0, otherwise false; only checks for the ref + * hitting 0 after percpu_ref_kill() has been called. Analagous to + * atomic_dec_and_test(). + */ +static inline int percpu_ref_put(struct percpu_ref *ref) +{ + unsigned long pcpu_count; + int ret = 0; + + preempt_disable(); + + pcpu_count = ACCESS_ONCE(ref->pcpu_count); + + if (REF_STATUS(pcpu_count) == PCPU_REF_PTR) { + /* for rcu - we're not using rcu_dereference() */ + smp_read_barrier_depends(); + __this_cpu_dec(*((unsigned __percpu *) pcpu_count)); + } else { + ret = __percpu_ref_put(ref, pcpu_count); + } + + preempt_enable(); + + return ret; +} + +unsigned percpu_ref_count(struct percpu_ref *ref); +int percpu_ref_kill(struct percpu_ref *ref); +int percpu_ref_dead(struct percpu_ref *ref); + +#endif diff --git a/lib/percpu-refcount.c b/lib/percpu-refcount.c new file mode 100644 index 0000000..bf71d54 --- /dev/null +++ b/lib/percpu-refcount.c @@ -0,0 +1,280 @@ +#define pr_fmt(fmt) "%s: " fmt "\n", __func__ + +#include <linux/jiffies.h> +#include <linux/percpu-refcount.h> + +/* + * A percpu refcount can be in 4 different modes. The state is tracked in the + * low two bits of percpu_ref->pcpu_count: + * + * PCPU_REF_NONE - the initial state, no percpu counters allocated. + * + * PCPU_REF_PTR - using percpu counters for the refcount. + * + * PCPU_REF_DYING - we're shutting down so get()/put() should use the embedded + * atomic counter, but we're not finished updating the atomic counter from the + * percpu counters - this means that percpu_ref_put() can't check for the ref + * hitting 0 yet. + * + * PCPU_REF_DEAD - we've finished the teardown sequence, percpu_ref_put() should + * now check for the ref hitting 0. + * + * In PCPU_REF_NONE mode, we need to count the number of times percpu_ref_get() + * is called; this is done with the high bits of the raw atomic counter. We also + * track the time, in jiffies, when the get count last wrapped - this is done + * with the remaining bits of percpu_ref->percpu_count. + * + * So, when percpu_ref_get() is called it increments the get count and checks if + * it wrapped; if it did, it checks if the last time it wrapped was less than + * one second ago; if so, we want to allocate percpu counters. + * + * PCPU_COUNT_BITS determines the threshold where we convert to percpu: of the + * raw 64 bit counter, we use PCPU_COUNT_BITS for the refcount, and the + * remaining (high) bits to count the number of times percpu_ref_get() has been + * called. It's currently (completely arbitrarily) 16384 times in one second. + * + * Percpu mode (PCPU_REF_PTR): + * + * In percpu mode all we do on get and put is increment or decrement the cpu + * local counter, which is a 32 bit unsigned int. + * + * Note that all the gets() could be happening on one cpu, and all the puts() on + * another - the individual cpu counters can wrap (potentially many times). + * + * But this is fine because we don't need to check for the ref hitting 0 in + * percpu mode; before we set the state to PCPU_REF_DEAD we simply sum up all + * the percpu counters and add them to the atomic counter. Since addition and + * subtraction in modular arithmatic is still associative, the result will be + * correct. + */ + +#define PCPU_COUNT_BITS 50 +#define PCPU_COUNT_MASK ((1LL << PCPU_COUNT_BITS) - 1) + +#define PCPU_COUNT_BIAS (1ULL << 32) + +/* + * So that we don't have to call alloc_percu() from percpu_ref_get(), we use a + * small pool and refill from a workqueue. + */ + +#define PCPU_REF_ALLOC_NR 8 + +static unsigned __percpu *percpu_ref_pool[PCPU_REF_ALLOC_NR]; +static unsigned percpu_ref_alloc_nr; +static DEFINE_SPINLOCK(percpu_ref_alloc_lock); + +static void percpu_ref_alloc_refill(struct work_struct *work) +{ + spin_lock_irq(&percpu_ref_alloc_lock); + + while (percpu_ref_alloc_nr < PCPU_REF_ALLOC_NR) { + unsigned __percpu *ref; + + ref = alloc_percpu(unsigned); + if (!ref) + break; + + percpu_ref_pool[percpu_ref_alloc_nr++] = ref; + } + + spin_unlock_irq(&percpu_ref_alloc_lock); +} + +static DECLARE_WORK(percpu_ref_alloc_work, percpu_ref_alloc_refill); + +static void percpu_ref_alloc(struct percpu_ref *ref, unsigned long pcpu_count) +{ + unsigned long flags, now = jiffies; + static unsigned __percpu *new = NULL; + + now <<= PCPU_STATUS_BITS; + now |= PCPU_REF_NONE; + + if (now - pcpu_count <= HZ << PCPU_STATUS_BITS) { + spin_lock_irqsave(&percpu_ref_alloc_lock, flags); + + if (percpu_ref_alloc_nr) + new = percpu_ref_pool[--percpu_ref_alloc_nr]; + + if (percpu_ref_alloc_nr < PCPU_REF_ALLOC_NR / 2) + schedule_work(&percpu_ref_alloc_work); + + spin_unlock_irqrestore(&percpu_ref_alloc_lock, flags); + + if (!new) + goto update_time; + + BUG_ON((unsigned long) new & PCPU_STATUS_MASK); + + if (cmpxchg(&ref->pcpu_count, pcpu_count, + (unsigned long) new) != pcpu_count) + free_percpu(new); + } else { +update_time: + cmpxchg(&ref->pcpu_count, pcpu_count, now); + } +} + +/* Slowpath, i.e. non percpu */ +void __percpu_ref_get(struct percpu_ref *ref, unsigned long pcpu_count) +{ + uint64_t v; + + v = atomic64_add_return(1 + (1ULL << PCPU_COUNT_BITS), + &ref->count); + + if (unlikely(!(v >> PCPU_COUNT_BITS) && + REF_STATUS(pcpu_count) == PCPU_REF_NONE)) { + percpu_ref_alloc(ref, pcpu_count); + } +} + +int __percpu_ref_put(struct percpu_ref *ref, unsigned long pcpu_count) +{ + uint64_t v; + + v = atomic64_dec_return(&ref->count); + v &= PCPU_COUNT_MASK; + + return v == 0; +} + +int percpu_ref_tryget(struct percpu_ref *ref) +{ + int ret = 1; + + preempt_disable(); + + if (!percpu_ref_dead(ref)) + percpu_ref_get(ref); + else + ret = 0; + + preempt_enable(); + + return ret; +} + +unsigned percpu_ref_count(struct percpu_ref *ref) +{ + unsigned long pcpu_count; + unsigned count = 0; + int cpu; + + preempt_disable(); + + count = atomic64_read(&ref->count) & PCPU_COUNT_MASK; + + pcpu_count = ACCESS_ONCE(ref->pcpu_count); + + if (REF_STATUS(pcpu_count) == PCPU_REF_PTR) { + /* for rcu - we're not using rcu_dereference() */ + smp_read_barrier_depends(); + + for_each_possible_cpu(cpu) + count += *per_cpu_ptr((unsigned __percpu *) pcpu_count, cpu); + } + + preempt_enable(); + + return count; +} + +/** + * percpu_ref_init - initialize a dynamic percpu refcount + * + * Initializes the refcount in single atomic counter mode with a refcount of 1; + * analagous to atomic_set(ref, 1). + */ +void percpu_ref_init(struct percpu_ref *ref) +{ + unsigned long now = jiffies; + + atomic64_set(&ref->count, 1 + PCPU_COUNT_BIAS); + + now <<= PCPU_STATUS_BITS; + now |= PCPU_REF_NONE; + + ref->pcpu_count = now; +} + +/** + * percpu_ref_kill - prepare a dynamic percpu refcount for teardown + * + * Must be called before dropping the initial ref, so that percpu_ref_put() + * knows to check for the refcount hitting 0. If the refcount was in percpu + * mode, converts it back to single atomic counter mode. + * + * The caller must issue a synchronize_rcu()/call_rcu() before calling + * percpu_ref_put() to drop the initial ref. + * + * Returns true the first time called on @ref and false if @ref is already + * shutting down, so it may be used by the caller for synchronizing other parts + * of a two stage shutdown. + */ +int percpu_ref_kill(struct percpu_ref *ref) +{ + unsigned long old, status, pcpu_count; + + pcpu_count = ACCESS_ONCE(ref->pcpu_count); + + do { + status = REF_STATUS(pcpu_count); + + if (status == PCPU_REF_DEAD) + return 0; + + old = pcpu_count; + pcpu_count = cmpxchg(&ref->pcpu_count, old, PCPU_REF_DEAD); + } while (pcpu_count != old); + + if (status == PCPU_REF_PTR) { + int cpu; + unsigned count = 0; + + synchronize_sched(); + + for_each_possible_cpu(cpu) + count += *per_cpu_ptr((unsigned __percpu *) pcpu_count, cpu); + + free_percpu((unsigned __percpu *) pcpu_count); + + pr_debug("global %lli pcpu %i", + atomic64_read(&ref->count) & PCPU_COUNT_MASK, + (int) count); + + atomic64_add((int) count - PCPU_COUNT_BIAS, &ref->count); + } + + return 1; +} + +/** + * percpu_ref_put_initial_ref - safely drop the initial ref + * + * A percpu refcount needs a shutdown sequence before dropping the initial ref, + * to put it back into single atomic_t mode with the appropriate barriers so + * that percpu_ref_put() can safely check for it hitting 0 - this does so. + * + * Returns true if @ref hit 0. + */ +int percpu_ref_put_initial_ref(struct percpu_ref *ref) +{ + if (percpu_ref_kill(ref)) { + return percpu_ref_put(ref); + } else { + __WARN(); + return 0; + } +} + +/** + * percpu_ref_dead - check if a dynamic percpu refcount is shutting down + * + * Returns true if percpu_ref_kill() has been called on @ref, false otherwise. + */ +int percpu_ref_dead(struct percpu_ref *ref) +{ + return REF_STATUS(ref->pcpu_count) == PCPU_REF_DEAD; +} ^ permalink raw reply related [flat|nested] 27+ messages in thread
* Re: [PATCH] generic dynamic per cpu refcounting 2013-01-29 16:39 ` Kent Overstreet @ 2013-01-29 19:29 ` Tejun Heo 2013-01-29 19:51 ` Kent Overstreet 0 siblings, 1 reply; 27+ messages in thread From: Tejun Heo @ 2013-01-29 19:29 UTC (permalink / raw) To: Kent Overstreet; +Cc: Oleg Nesterov, srivatsa.bhat, rusty, linux-kernel Hey, Kent. On Tue, Jan 29, 2013 at 08:39:42AM -0800, Kent Overstreet wrote: > Oh, if this is going to be widely used I should probably have a > different implementation for archs that don't have atomic64_t in > hardware. Don't suppose you know the CONFIG_ macro to test against? I > couldn't find it when I was looking recently. !CONFIG_GENERIC_ATOMIC64, I think. I don't think it's worthwhile to worry about tho. Reverting to simple atomic_t ops on !CONFIG_SMP should cover most relevant cases. Very tentative review follows. > +#define PCPU_REF_PTR 0 > +#define PCPU_REF_NONE 1 > +#define PCPU_REF_DEAD 2 Do we still need to distinguish between NONE and DEAD? It would be nice if we can just test against NULL. > +/** > + * percpu_ref_get - increment a dynamic percpu refcount > + * > + * Analagous to atomic_inc(). > + */ > +static inline void percpu_ref_get(struct percpu_ref *ref) > +{ > + unsigned long pcpu_count; > + > + preempt_disable(); > + > + pcpu_count = ACCESS_ONCE(ref->pcpu_count); > + > + if (REF_STATUS(pcpu_count) == PCPU_REF_PTR) { > + /* for rcu - we're not using rcu_dereference() */ > + smp_read_barrier_depends(); Let's explain more. :) > + __this_cpu_inc(*((unsigned __percpu *) pcpu_count)); What about overflow? Note that we can have systemetic cases where ref is gotten on one cpu and put on another transferring counts in a specific direction. > +#define PCPU_COUNT_BITS 50 > +#define PCPU_COUNT_MASK ((1LL << PCPU_COUNT_BITS) - 1) > + > +#define PCPU_COUNT_BIAS (1ULL << 32) I really don't get this. Why use 1<<32 for bias which isn't too difficult to overflow especially if you have many cpu threads and some systemetic drift in percpu refs? What's the point of not using the higher bits? Is it to encode count usage statistics into the same counter? If so, just add another field. 24bytes is fine. I'd really like to see just basic percpu refcnt with async and sync interfaces w/o dynamic allocation. At this point, we aren't really sure if there are gonna be enough benefits or use cases from dynamic allocation, so I don't really see the point in the added complexity. Let's start with basic stuff and add on dynamic alloc if it actually is necessary. Thanks. -- tejun ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] generic dynamic per cpu refcounting 2013-01-29 19:29 ` Tejun Heo @ 2013-01-29 19:51 ` Kent Overstreet 2013-01-29 20:02 ` Tejun Heo 0 siblings, 1 reply; 27+ messages in thread From: Kent Overstreet @ 2013-01-29 19:51 UTC (permalink / raw) To: Tejun Heo; +Cc: Oleg Nesterov, srivatsa.bhat, rusty, linux-kernel On Tue, Jan 29, 2013 at 11:29:04AM -0800, Tejun Heo wrote: > Hey, Kent. > > On Tue, Jan 29, 2013 at 08:39:42AM -0800, Kent Overstreet wrote: > > Oh, if this is going to be widely used I should probably have a > > different implementation for archs that don't have atomic64_t in > > hardware. Don't suppose you know the CONFIG_ macro to test against? I > > couldn't find it when I was looking recently. > > !CONFIG_GENERIC_ATOMIC64, I think. I don't think it's worthwhile to > worry about tho. Reverting to simple atomic_t ops on !CONFIG_SMP > should cover most relevant cases. > > Very tentative review follows. > > > +#define PCPU_REF_PTR 0 > > +#define PCPU_REF_NONE 1 > > +#define PCPU_REF_DEAD 2 > > Do we still need to distinguish between NONE and DEAD? It would be > nice if we can just test against NULL. With dynamic allocation, yes - we can't alloc percpu counters if it's dead. > > +/** > > + * percpu_ref_get - increment a dynamic percpu refcount > > + * > > + * Analagous to atomic_inc(). > > + */ > > +static inline void percpu_ref_get(struct percpu_ref *ref) > > +{ > > + unsigned long pcpu_count; > > + > > + preempt_disable(); > > + > > + pcpu_count = ACCESS_ONCE(ref->pcpu_count); > > + > > + if (REF_STATUS(pcpu_count) == PCPU_REF_PTR) { > > + /* for rcu - we're not using rcu_dereference() */ > > + smp_read_barrier_depends(); > > Let's explain more. :) I'd like to come up with a way of using rcu_dereference() without sparse choking on __percpu combined with __rcu, but ok. > > + __this_cpu_inc(*((unsigned __percpu *) pcpu_count)); > > What about overflow? Note that we can have systemetic cases where ref > is gotten on one cpu and put on another transferring counts in a > specific direction. Heh, this keeps coming up. It works because modular arithmatic is still associative for addition and subtraction. It doesn't matter which cpu the gets and puts happen on, the counters will still sum to the same value - even if individual counters overflow/underflow. (It's the same as doing a bunch of increments and decrements to a single integer, but in random order. 0 - 1 - 1 + 1 + 1 + 1 will always equal 1, no matter where you stick parentheses). That's also why I use an unsigned integer for the per cpu counters... since they're going to wrap and signed integer overflow is undefined. Not that I really expect gcc to figure out a way to screw me for that, but I tend to be anal about such things. > > +#define PCPU_COUNT_BITS 50 > > +#define PCPU_COUNT_MASK ((1LL << PCPU_COUNT_BITS) - 1) > > + > > +#define PCPU_COUNT_BIAS (1ULL << 32) > > I really don't get this. Why use 1<<32 for bias which isn't too > difficult to overflow especially if you have many cpu threads and some > systemetic drift in percpu refs? What's the point of not using the > higher bits? Is it to encode count usage statistics into the same > counter? If so, just add another field. 24bytes is fine. The systematic drift doesn't affect overflow here - the bias is only applied to the atomic counter, which can only be off by at most whatever the per cpu counters happen to sum to (and remember it's modular arithmatic, so that's INT_MIN/INT_MAX). Since I'm using 32 bit ints for the per cpu counters, 1 << 32 works just fine. If we want to expand the per cpu counters to longs or 64 bit ints, then yeah I'd use a larger bias. It's nitpicky but using a larger bias feels like overly defensive programming to me (if you know how big the bias has to be, great, use that - if you aren't sure, wtf are you doing :P) > I'd really like to see just basic percpu refcnt with async and sync > interfaces w/o dynamic allocation. At this point, we aren't really > sure if there are gonna be enough benefits or use cases from dynamic > allocation, so I don't really see the point in the added complexity. > Let's start with basic stuff and add on dynamic alloc if it actually > is necessary. Well, with the alloc rework the dynamic allocation is completely hidden from the users - it's just get and put. Killing dynamic allocation would just mean that percpu_ref_init() could fail, as far as the api is concerned. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] generic dynamic per cpu refcounting 2013-01-29 19:51 ` Kent Overstreet @ 2013-01-29 20:02 ` Tejun Heo 2013-01-29 21:45 ` Kent Overstreet 0 siblings, 1 reply; 27+ messages in thread From: Tejun Heo @ 2013-01-29 20:02 UTC (permalink / raw) To: Kent Overstreet; +Cc: Oleg Nesterov, srivatsa.bhat, rusty, linux-kernel Hello, Kent. On Tue, Jan 29, 2013 at 11:51:41AM -0800, Kent Overstreet wrote: > > What about overflow? Note that we can have systemetic cases where ref > > is gotten on one cpu and put on another transferring counts in a > > specific direction. > > Heh, this keeps coming up. > > It works because modular arithmatic is still associative for addition > and subtraction. It doesn't matter which cpu the gets and puts happen > on, the counters will still sum to the same value - even if individual > counters overflow/underflow. > > (It's the same as doing a bunch of increments and decrements to a single > integer, but in random order. 0 - 1 - 1 + 1 + 1 + 1 will always equal 1, > no matter where you stick parentheses). > > That's also why I use an unsigned integer for the per cpu counters... > since they're going to wrap and signed integer overflow is undefined. > Not that I really expect gcc to figure out a way to screw me for that, > but I tend to be anal about such things. Hmmm... okay, but that only works if the summing up is also done in 32bits, right? Is that why the global counter is kept at 32bits? > > I really don't get this. Why use 1<<32 for bias which isn't too > > difficult to overflow especially if you have many cpu threads and some > > systemetic drift in percpu refs? What's the point of not using the > > higher bits? Is it to encode count usage statistics into the same > > counter? If so, just add another field. 24bytes is fine. > > The systematic drift doesn't affect overflow here - the bias is only > applied to the atomic counter, which can only be off by at most whatever > the per cpu counters happen to sum to (and remember it's modular > arithmatic, so that's INT_MIN/INT_MAX). Yeah if you don't have to worry about flushing percpu counters to prevent over/underflows, this shouldn't matter. It would be nice to document it tho. > Since I'm using 32 bit ints for the per cpu counters, 1 << 32 works just > fine. If we want to expand the per cpu counters to longs or 64 bit ints, > then yeah I'd use a larger bias. > > It's nitpicky but using a larger bias feels like overly defensive > programming to me (if you know how big the bias has to be, great, use > that - if you aren't sure, wtf are you doing :P) I don't know. I think it's a correctness issue. If you over/underflow into global counter, you need buffer space to be (practically) correct. If modular calculation makes it unnecessary to worry about over/underflows, we can't use more than 32bits for global sums, right? I'm still a bit hazy but it looks like using all 32bit counters should work w/o worrying about over/underflows and that's really nice. > > I'd really like to see just basic percpu refcnt with async and sync > > interfaces w/o dynamic allocation. At this point, we aren't really > > sure if there are gonna be enough benefits or use cases from dynamic > > allocation, so I don't really see the point in the added complexity. > > Let's start with basic stuff and add on dynamic alloc if it actually > > is necessary. > > Well, with the alloc rework the dynamic allocation is completely hidden > from the users - it's just get and put. Killing dynamic allocation would > just mean that percpu_ref_init() could fail, as far as the api is > concerned. There really is no point in adding complexity which isn't needed. It slows down the code unnecessarily (however minute) and adds maintenance overhead. Why do that when there's no need? Thanks. -- tejun ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] generic dynamic per cpu refcounting 2013-01-29 20:02 ` Tejun Heo @ 2013-01-29 21:45 ` Kent Overstreet 2013-01-29 22:06 ` Tejun Heo 0 siblings, 1 reply; 27+ messages in thread From: Kent Overstreet @ 2013-01-29 21:45 UTC (permalink / raw) To: Tejun Heo; +Cc: Oleg Nesterov, srivatsa.bhat, rusty, linux-kernel On Tue, Jan 29, 2013 at 12:02:18PM -0800, Tejun Heo wrote: > Hello, Kent. > > On Tue, Jan 29, 2013 at 11:51:41AM -0800, Kent Overstreet wrote: > > > What about overflow? Note that we can have systemetic cases where ref > > > is gotten on one cpu and put on another transferring counts in a > > > specific direction. > > > > Heh, this keeps coming up. > > > > It works because modular arithmatic is still associative for addition > > and subtraction. It doesn't matter which cpu the gets and puts happen > > on, the counters will still sum to the same value - even if individual > > counters overflow/underflow. > > > > (It's the same as doing a bunch of increments and decrements to a single > > integer, but in random order. 0 - 1 - 1 + 1 + 1 + 1 will always equal 1, > > no matter where you stick parentheses). > > > > That's also why I use an unsigned integer for the per cpu counters... > > since they're going to wrap and signed integer overflow is undefined. > > Not that I really expect gcc to figure out a way to screw me for that, > > but I tend to be anal about such things. > > Hmmm... okay, but that only works if the summing up is also done in > 32bits, right? Is that why the global counter is kept at 32bits? Global counter's actually kept at 50 bits - arbitrarily, because the top 14 bits are used for the get counter. It would definitely be cleaner if the global counter was also 32 bits (I probably should've done it that way at first) but it works with it being bigger _provided the sum of the percpu counters is sign extended_ when collecting them up at the end. With the bias though we kind of want the global counter to be bigger... the bias needs to fit in the global counter. > > > I really don't get this. Why use 1<<32 for bias which isn't too > > > difficult to overflow especially if you have many cpu threads and some > > > systemetic drift in percpu refs? What's the point of not using the > > > higher bits? Is it to encode count usage statistics into the same > > > counter? If so, just add another field. 24bytes is fine. > > > > The systematic drift doesn't affect overflow here - the bias is only > > applied to the atomic counter, which can only be off by at most whatever > > the per cpu counters happen to sum to (and remember it's modular > > arithmatic, so that's INT_MIN/INT_MAX). > > Yeah if you don't have to worry about flushing percpu counters to > prevent over/underflows, this shouldn't matter. It would be nice to > document it tho. Good point, I'll add the explanation to the comments. > > Since I'm using 32 bit ints for the per cpu counters, 1 << 32 works just > > fine. If we want to expand the per cpu counters to longs or 64 bit ints, > > then yeah I'd use a larger bias. > > > > It's nitpicky but using a larger bias feels like overly defensive > > programming to me (if you know how big the bias has to be, great, use > > that - if you aren't sure, wtf are you doing :P) > > I don't know. I think it's a correctness issue. If you > over/underflow into global counter, you need buffer space to be > (practically) correct. Not quite sure what you mean... > If modular calculation makes it unnecessary to > worry about over/underflows, we can't use more than 32bits for global > sums, right? It doesn't really make sense to use more than 32 bits for the global sum (ignoring the bias value... that complicates it) but it does work, with the sign extending. I haven't figured out a good way of explaining why though... I should probably look at what happens without it again. (I missed the sign extending when I first coded it, had to debug that). > I'm still a bit hazy but it looks like using all 32bit > counters should work w/o worrying about over/underflows and that's > really nice. I know my math :D > > > I'd really like to see just basic percpu refcnt with async and sync > > > interfaces w/o dynamic allocation. At this point, we aren't really > > > sure if there are gonna be enough benefits or use cases from dynamic > > > allocation, so I don't really see the point in the added complexity. > > > Let's start with basic stuff and add on dynamic alloc if it actually > > > is necessary. > > > > Well, with the alloc rework the dynamic allocation is completely hidden > > from the users - it's just get and put. Killing dynamic allocation would > > just mean that percpu_ref_init() could fail, as far as the api is > > concerned. > > There really is no point in adding complexity which isn't needed. It > slows down the code unnecessarily (however minute) and adds > maintenance overhead. Why do that when there's no need? Well, like I said I want this to be a drop in replacement for atomic_t anyone can use without having to think about the potential overhead. If it really does turn out there aren't enough uses for it to be worth it I'll be ok with ripping out the dynamic alloc... but the slowdown is pretty damn minute (one instruction in the fast path) and the additional complexity is fairly minor and quite self contained so I'm just not seeing the urgency. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] generic dynamic per cpu refcounting 2013-01-29 21:45 ` Kent Overstreet @ 2013-01-29 22:06 ` Tejun Heo 0 siblings, 0 replies; 27+ messages in thread From: Tejun Heo @ 2013-01-29 22:06 UTC (permalink / raw) To: Kent Overstreet; +Cc: Oleg Nesterov, srivatsa.bhat, rusty, linux-kernel Hey, Kent. On Tue, Jan 29, 2013 at 01:45:37PM -0800, Kent Overstreet wrote: > It would definitely be cleaner if the global counter was also 32 bits (I > probably should've done it that way at first) but it works with it being > bigger _provided the sum of the percpu counters is sign extended_ when > collecting them up at the end. > > With the bias though we kind of want the global counter to be bigger... > the bias needs to fit in the global counter. Yeah, the size being 64bit is fine as long as the summation and the result is done in 32bits. > > I don't know. I think it's a correctness issue. If you > > over/underflow into global counter, you need buffer space to be > > (practically) correct. > > Not quite sure what you mean... If you compute w/o sign extension, it becomes buggy. > It doesn't really make sense to use more than 32 bits for the global sum > (ignoring the bias value... that complicates it) but it does work, with > the sign extending. I haven't figured out a good way of explaining why > though... I should probably look at what happens without it again. (I > missed the sign extending when I first coded it, had to debug that). I think it would be easier to understand if we use only the lower 32bits for the actual counting. > > There really is no point in adding complexity which isn't needed. It > > slows down the code unnecessarily (however minute) and adds > > maintenance overhead. Why do that when there's no need? > > Well, like I said I want this to be a drop in replacement for atomic_t > anyone can use without having to think about the potential overhead. > > If it really does turn out there aren't enough uses for it to be worth > it I'll be ok with ripping out the dynamic alloc... but the slowdown is > pretty damn minute (one instruction in the fast path) and the additional > complexity is fairly minor and quite self contained so I'm just not > seeing the urgency. I really don't think this can be drop-in replacement given the significant overhead on base ref put and you build up complexity with use cases not the other way around. When you're looking at each specific implementation, it might not look like much but it really sucks to have a lot of unneeded / unused complexities around. It raises maintenance overhead for no good reason. Like you said, it's not like extending the design is difficult if we find out that the dynamic allocation is actually useful, and there just is no reason to do it the other way around. Thanks. -- tejun ^ permalink raw reply [flat|nested] 27+ messages in thread
* [PATCH] module: Convert to generic percpu refcounts 2013-01-28 21:50 ` Tejun Heo 2013-01-29 16:39 ` Kent Overstreet @ 2013-01-29 18:04 ` Kent Overstreet 1 sibling, 0 replies; 27+ messages in thread From: Kent Overstreet @ 2013-01-29 18:04 UTC (permalink / raw) To: Tejun Heo; +Cc: Oleg Nesterov, srivatsa.bhat, rusty, linux-kernel I started screwing aronud just to see how hard a conversion would be and what it'd look like. I _think_ this is complete, but there's enough going on I undoubtedly missed something. Completely untested - builds and that's it. I'm sure it's broken. Deletes almost 100 lines of code though. I like how it looks for the most part... MODULE_STATE_GOING is messy since the percpu ref has to track that, and I didn't want to track that in two places - but I couldn't just delete the enum since module notifiers use it. It's currently vestigal and only used for the notifiers. Similarly with the other places module->state and percpu_ref_dead() (via module_is_live()) are checked... bit ugly but ah well. --- include/linux/module.h | 28 +++---- include/trace/events/module.h | 2 +- kernel/debug/kdb/kdb_main.c | 8 +- kernel/module.c | 165 +++++++++++++----------------------------- 4 files changed, 60 insertions(+), 143 deletions(-) diff --git a/include/linux/module.h b/include/linux/module.h index ead1b57..88f78f6 100644 --- a/include/linux/module.h +++ b/include/linux/module.h @@ -17,6 +17,7 @@ #include <linux/moduleparam.h> #include <linux/tracepoint.h> #include <linux/export.h> +#include <linux/percpu-refcount.h> #include <linux/percpu.h> #include <asm/module.h> @@ -206,19 +207,7 @@ enum module_state { MODULE_STATE_UNFORMED, /* Still setting it up. */ }; -/** - * struct module_ref - per cpu module reference counts - * @incs: number of module get on this cpu - * @decs: number of module put on this cpu - * - * We force an alignment on 8 or 16 bytes, so that alloc_percpu() - * put @incs/@decs in same cache line, with no extra memory cost, - * since alloc_percpu() is fine grained. - */ -struct module_ref { - unsigned long incs; - unsigned long decs; -} __attribute((aligned(2 * sizeof(unsigned long)))); +const char *module_state(struct module *mod); struct module { @@ -361,13 +350,10 @@ struct module /* What modules do I depend on? */ struct list_head target_list; - /* Who is waiting for us to be unloaded */ - struct task_struct *waiter; - /* Destruction function. */ void (*exit)(void); - struct module_ref __percpu *refptr; + struct percpu_ref ref; #endif #ifdef CONFIG_CONSTRUCTORS @@ -387,7 +373,7 @@ extern struct mutex module_mutex; (IDE & SCSI) require entry into the module during init.*/ static inline int module_is_live(struct module *mod) { - return mod->state != MODULE_STATE_GOING; + return !percpu_ref_dead(&mod->ref); } struct module *__module_text_address(unsigned long addr); @@ -451,7 +437,11 @@ extern void __module_put_and_exit(struct module *mod, long code) #define module_put_and_exit(code) __module_put_and_exit(THIS_MODULE, code); #ifdef CONFIG_MODULE_UNLOAD -unsigned long module_refcount(struct module *mod); +static inline unsigned long module_refcount(struct module *mod) +{ + return percpu_ref_count(&mod->ref) - 1; +} + void __symbol_put(const char *symbol); #define symbol_put(x) __symbol_put(MODULE_SYMBOL_PREFIX #x) void symbol_put_addr(void *addr); diff --git a/include/trace/events/module.h b/include/trace/events/module.h index 1619327..3de2241 100644 --- a/include/trace/events/module.h +++ b/include/trace/events/module.h @@ -78,7 +78,7 @@ DECLARE_EVENT_CLASS(module_refcnt, TP_fast_assign( __entry->ip = ip; - __entry->refcnt = __this_cpu_read(mod->refptr->incs) + __this_cpu_read(mod->refptr->decs); + __entry->refcnt = module_refcount(mod); __assign_str(name, mod->name); ), diff --git a/kernel/debug/kdb/kdb_main.c b/kernel/debug/kdb/kdb_main.c index 8875254..813c0ed 100644 --- a/kernel/debug/kdb/kdb_main.c +++ b/kernel/debug/kdb/kdb_main.c @@ -1978,13 +1978,7 @@ static int kdb_lsmod(int argc, const char **argv) #ifdef CONFIG_MODULE_UNLOAD kdb_printf("%4ld ", module_refcount(mod)); #endif - if (mod->state == MODULE_STATE_GOING) - kdb_printf(" (Unloading)"); - else if (mod->state == MODULE_STATE_COMING) - kdb_printf(" (Loading)"); - else - kdb_printf(" (Live)"); - kdb_printf(" 0x%p", mod->module_core); + kdb_printf(" (%s) 0x%p", module_state(mod), mod->module_core); #ifdef CONFIG_MODULE_UNLOAD { diff --git a/kernel/module.c b/kernel/module.c index 921bed4..08aa83a 100644 --- a/kernel/module.c +++ b/kernel/module.c @@ -625,21 +625,15 @@ static char last_unloaded_module[MODULE_NAME_LEN+1]; EXPORT_TRACEPOINT_SYMBOL(module_get); /* Init the unload section of the module. */ -static int module_unload_init(struct module *mod) +static void module_unload_init(struct module *mod) { - mod->refptr = alloc_percpu(struct module_ref); - if (!mod->refptr) - return -ENOMEM; + percpu_ref_init(&mod->ref); INIT_LIST_HEAD(&mod->source_list); INIT_LIST_HEAD(&mod->target_list); /* Hold reference count during initialization. */ - __this_cpu_write(mod->refptr->incs, 1); - /* Backwards compatibility macros put refcount during init. */ - mod->waiter = current; - - return 0; + __module_get(mod); } /* Does a already use b? */ @@ -719,8 +713,6 @@ static void module_unload_free(struct module *mod) kfree(use); } mutex_unlock(&module_mutex); - - free_percpu(mod->refptr); } #ifdef CONFIG_MODULE_FORCE_UNLOAD @@ -745,6 +737,15 @@ struct stopref int *forced; }; +static void stop_module(struct module *mod) +{ + /* Mark it as dying, drop base ref */ + if (percpu_ref_kill(&mod->ref)) + module_put(mod); + else + WARN(1, "initial module ref already dropped"); +} + /* Whole machine is stopped with interrupts off when this runs. */ static int __try_stop_module(void *_sref) { @@ -756,8 +757,7 @@ static int __try_stop_module(void *_sref) return -EWOULDBLOCK; } - /* Mark it as dying. */ - sref->mod->state = MODULE_STATE_GOING; + stop_module(sref->mod); return 0; } @@ -769,57 +769,14 @@ static int try_stop_module(struct module *mod, int flags, int *forced) return stop_machine(__try_stop_module, &sref, NULL); } else { /* We don't need to stop the machine for this. */ - mod->state = MODULE_STATE_GOING; - synchronize_sched(); + stop_module(mod); return 0; } } -unsigned long module_refcount(struct module *mod) -{ - unsigned long incs = 0, decs = 0; - int cpu; - - for_each_possible_cpu(cpu) - decs += per_cpu_ptr(mod->refptr, cpu)->decs; - /* - * ensure the incs are added up after the decs. - * module_put ensures incs are visible before decs with smp_wmb. - * - * This 2-count scheme avoids the situation where the refcount - * for CPU0 is read, then CPU0 increments the module refcount, - * then CPU1 drops that refcount, then the refcount for CPU1 is - * read. We would record a decrement but not its corresponding - * increment so we would see a low count (disaster). - * - * Rare situation? But module_refcount can be preempted, and we - * might be tallying up 4096+ CPUs. So it is not impossible. - */ - smp_rmb(); - for_each_possible_cpu(cpu) - incs += per_cpu_ptr(mod->refptr, cpu)->incs; - return incs - decs; -} -EXPORT_SYMBOL(module_refcount); - /* This exists whether we can unload or not */ static void free_module(struct module *mod); -static void wait_for_zero_refcount(struct module *mod) -{ - /* Since we might sleep for some time, release the mutex first */ - mutex_unlock(&module_mutex); - for (;;) { - pr_debug("Looking at refcount...\n"); - set_current_state(TASK_UNINTERRUPTIBLE); - if (module_refcount(mod) == 0) - break; - schedule(); - } - current->state = TASK_RUNNING; - mutex_lock(&module_mutex); -} - SYSCALL_DEFINE2(delete_module, const char __user *, name_user, unsigned int, flags) { @@ -850,7 +807,7 @@ SYSCALL_DEFINE2(delete_module, const char __user *, name_user, } /* Doing init or already dying? */ - if (mod->state != MODULE_STATE_LIVE) { + if (mod->state != MODULE_STATE_LIVE || !module_is_live(mod)) { /* FIXME: if (force), slam module count and wake up waiter --RR */ pr_debug("%s already dying\n", mod->name); @@ -868,19 +825,17 @@ SYSCALL_DEFINE2(delete_module, const char __user *, name_user, } } - /* Set this up before setting mod->state */ - mod->waiter = current; - /* Stop the machine so refcounts can't move and disable module. */ ret = try_stop_module(mod, flags, &forced); if (ret != 0) goto out; + mutex_unlock(&module_mutex); + /* Never wait if forced. */ - if (!forced && module_refcount(mod) != 0) - wait_for_zero_refcount(mod); + if (!forced) + wait_event(module_wq, !percpu_ref_count(&mod->ref)); - mutex_unlock(&module_mutex); /* Final destruction now no one is using it. */ if (mod->exit != NULL) mod->exit(); @@ -962,45 +917,29 @@ static struct module_attribute modinfo_refcnt = void __module_get(struct module *module) { if (module) { - preempt_disable(); - __this_cpu_inc(module->refptr->incs); trace_module_get(module, _RET_IP_); - preempt_enable(); + percpu_ref_get(&module->ref); } } EXPORT_SYMBOL(__module_get); bool try_module_get(struct module *module) { - bool ret = true; - if (module) { - preempt_disable(); - - if (likely(module_is_live(module))) { - __this_cpu_inc(module->refptr->incs); - trace_module_get(module, _RET_IP_); - } else - ret = false; - - preempt_enable(); - } - return ret; + trace_module_get(module, _RET_IP_); + return percpu_ref_tryget(&module->ref); + } else + return true; } EXPORT_SYMBOL(try_module_get); void module_put(struct module *module) { if (module) { - preempt_disable(); - smp_wmb(); /* see comment in module_refcount */ - __this_cpu_inc(module->refptr->decs); - trace_module_put(module, _RET_IP_); - /* Maybe they're waiting for us to drop reference? */ - if (unlikely(!module_is_live(module))) - wake_up_process(module->waiter); - preempt_enable(); + + if (percpu_ref_put(&module->ref)) + wake_up_all(&module_wq); } } EXPORT_SYMBOL(module_put); @@ -1048,25 +987,27 @@ static size_t module_flags_taint(struct module *mod, char *buf) return l; } -static ssize_t show_initstate(struct module_attribute *mattr, - struct module_kobject *mk, char *buffer) +const char *module_state(struct module *mod) { - const char *state = "unknown"; - - switch (mk->mod->state) { + switch (mod->state) { case MODULE_STATE_LIVE: - state = "live"; - break; + if (module_is_live(mod)) + return "live"; + else + return "going"; case MODULE_STATE_COMING: - state = "coming"; - break; - case MODULE_STATE_GOING: - state = "going"; - break; + return "coming"; + case MODULE_STATE_UNFORMED: + return "unformed"; default: - BUG(); + return "unknown"; } - return sprintf(buffer, "%s\n", state); +} + +static ssize_t show_initstate(struct module_attribute *mattr, + struct module_kobject *mk, char *buffer) +{ + return sprintf(buffer, "%s\n", module_state(mk->mod)); } static struct module_attribute modinfo_initstate = @@ -3022,8 +2963,7 @@ static bool finished_loading(const char *name) mutex_lock(&module_mutex); mod = find_module_all(name, true); - ret = !mod || mod->state == MODULE_STATE_LIVE - || mod->state == MODULE_STATE_GOING; + ret = !mod || mod->state == MODULE_STATE_LIVE; mutex_unlock(&module_mutex); return ret; @@ -3073,8 +3013,7 @@ static int do_init_module(struct module *mod) if (ret < 0) { /* Init routine failed: abort. Try to protect us from buggy refcounters. */ - mod->state = MODULE_STATE_GOING; - synchronize_sched(); + stop_module(mod); module_put(mod); blocking_notifier_call_chain(&module_notify_list, MODULE_STATE_GOING, mod); @@ -3245,9 +3184,7 @@ static int load_module(struct load_info *info, const char __user *uargs, #endif /* Now module is in final location, initialize linked lists, etc. */ - err = module_unload_init(mod); - if (err) - goto unlink_mod; + module_unload_init(mod); /* Now we've got everything in the final locations, we can * find optional sections. */ @@ -3323,7 +3260,6 @@ static int load_module(struct load_info *info, const char __user *uargs, free_modinfo(mod); free_unload: module_unload_free(mod); - unlink_mod: mutex_lock(&module_mutex); /* Unlink carefully: kallsyms could be walking list. */ list_del_rcu(&mod->list); @@ -3614,12 +3550,12 @@ static char *module_flags(struct module *mod, char *buf) BUG_ON(mod->state == MODULE_STATE_UNFORMED); if (mod->taints || - mod->state == MODULE_STATE_GOING || + !module_is_live(mod) || mod->state == MODULE_STATE_COMING) { buf[bx++] = '('; bx += module_flags_taint(mod, buf + bx); /* Show a - for module-is-being-unloaded */ - if (mod->state == MODULE_STATE_GOING) + if (!module_is_live(mod)) buf[bx++] = '-'; /* Show a + for module-is-being-loaded */ if (mod->state == MODULE_STATE_COMING) @@ -3663,10 +3599,7 @@ static int m_show(struct seq_file *m, void *p) print_unload_info(m, mod); /* Informative for users. */ - seq_printf(m, " %s", - mod->state == MODULE_STATE_GOING ? "Unloading": - mod->state == MODULE_STATE_COMING ? "Loading": - "Live"); + seq_printf(m, " %s", module_state(mod)); /* Used by oprofile and other similar tools. */ seq_printf(m, " 0x%pK", mod->module_core); -- 1.7.12 ^ permalink raw reply related [flat|nested] 27+ messages in thread
* Re: [PATCH] generic dynamic per cpu refcounting 2013-01-25 18:09 ` [PATCH] generic dynamic per cpu refcounting Oleg Nesterov 2013-01-25 18:29 ` Oleg Nesterov 2013-01-25 19:11 ` Oleg Nesterov @ 2013-01-28 18:07 ` Kent Overstreet 2 siblings, 0 replies; 27+ messages in thread From: Kent Overstreet @ 2013-01-28 18:07 UTC (permalink / raw) To: Oleg Nesterov; +Cc: tj, srivatsa.bhat, rusty, linux-kernel On Fri, Jan 25, 2013 at 07:09:41PM +0100, Oleg Nesterov wrote: > (add lkml) > > On 01/24, Kent Overstreet wrote: > > > > This has already been on lkml and is in Andrew's tree, Tejun just asked > > me to send it out again: > > I'll try to read this code later, just a couple of questions after a quick > glance. Sorry this was already discussed... No worries, it wasn't that widely circulated. > > +struct percpu_ref { > > + atomic64_t count; > > + unsigned long pcpu_count; > > +}; > > The code looks a bit tricky mostly because you pack state/pointer/jiffies > into ->pcpu_count. The same for ->count. Yes, it is. > I assume that you have a good reason to shrink the sizeof(percpu_ref), but > I am curious: who is the user of this thing? Right now - just the aio code, but the idea was to make it as close to a drop in replacement for atomic_t + atomic_get()/atomic_dec_and_test() as possible. > > + * percpu_ref_get - increment a dynamic percpu refcount > > + * > > + * Increments @ref and possibly converts it to percpu counters. Must be called > > + * with rcu_read_lock() held, and may potentially drop/reacquire rcu_read_lock() > > + * to allocate percpu counters - if sleeping/allocation isn't safe for some > > + * other reason (e.g. a spinlock), see percpu_ref_get_noalloc(). > > And this looks strange. It must be called under rcu_read_lock(), but > ->rcu_read_lock_nesting must be == 1. Otherwise rcu_read_unlock() in > percpu_ref_alloc() won't work. > > Again, I think you have a reason, but could you explain? IOW, why we > can't make it might_sleep() instead? The fast path can do rcu_read_lock() > itself. It's stupid and contorted because I didn't have any better ideas when I first wrote it and haven't fixed it yet. > > +static inline void percpu_ref_get_noalloc(struct percpu_ref *ref) > > +{ > > + __percpu_ref_get(ref, false); > > +} > > and this could be percpu_ref_get_atomic(). > > Once again, I am not arguing, just can't understand. Same deal, I'm going to get rid of the two different versions. > > +void __percpu_ref_get(struct percpu_ref *ref, bool alloc) > > +{ > > + unsigned long pcpu_count; > > + uint64_t v; > > + > > + pcpu_count = ACCESS_ONCE(ref->pcpu_count); > > + > > + if (REF_STATUS(pcpu_count) == PCPU_REF_PTR) { > > + /* for rcu - we're not using rcu_dereference() */ > > + smp_read_barrier_depends(); > > + __this_cpu_inc(*((unsigned __percpu *) pcpu_count)); > > The comment looks confusing a bit... smp_read_barrier_depends() is not > for rcu, we obviously need it to access (unsigned __percpu *) pcpu_count. > But yes, since we didn't use rcu_dereference() we have to add it by hand. Yeah - originally I was using rcu_dereference(), but sparse hated combining __percpu and __rcu and I couldn't get it to stop complaining. > > > +int percpu_ref_kill(struct percpu_ref *ref) > > +{ > > ... > > + if (status == PCPU_REF_PTR) { > > + unsigned count = 0, cpu; > > + > > + synchronize_rcu(); > > + > > + for_each_possible_cpu(cpu) > > + count += *per_cpu_ptr((unsigned __percpu *) pcpu_count, cpu); > > + > > + pr_debug("global %lli pcpu %i", > > + atomic64_read(&ref->count) & PCPU_COUNT_MASK, > > + (int) count); > > + > > + atomic64_add((int) count, &ref->count); > > + smp_wmb(); > > + /* Between setting global count and setting PCPU_REF_DEAD */ > > + ref->pcpu_count = PCPU_REF_DEAD; > > The coment explains what the code does, but not why ;) That seems like a more straightforward barrier than most... we need the refcount to be consistent before setting the state to dead :P > I guess this is for percpu_ref_put(), and this wmb() pairs with implicit > mb() implied by atomic64_dec_return(). Yeah. I expanded the comment there a bit... > > > + free_percpu((unsigned __percpu *) pcpu_count); > > I guess it could be freed right after for_each_possible_cpu() above, but > this doesn't matter. I think that'd be better though, I'll switch it. ^ permalink raw reply [flat|nested] 27+ messages in thread
end of thread, other threads:[~2013-01-29 22:06 UTC | newest]
Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20130124232024.GA584@google.com>
2013-01-25 18:09 ` [PATCH] generic dynamic per cpu refcounting Oleg Nesterov
2013-01-25 18:29 ` Oleg Nesterov
2013-01-28 18:10 ` Kent Overstreet
2013-01-28 18:50 ` Oleg Nesterov
2013-01-25 19:11 ` Oleg Nesterov
2013-01-28 18:15 ` Kent Overstreet
2013-01-28 18:27 ` Tejun Heo
2013-01-28 18:49 ` Kent Overstreet
2013-01-28 18:55 ` Tejun Heo
2013-01-28 20:22 ` Kent Overstreet
2013-01-28 20:27 ` Tejun Heo
2013-01-28 20:55 ` Kent Overstreet
2013-01-28 21:18 ` Tejun Heo
2013-01-28 21:24 ` Kent Overstreet
2013-01-28 21:28 ` Tejun Heo
2013-01-28 21:36 ` Tejun Heo
2013-01-28 21:48 ` Kent Overstreet
2013-01-28 21:45 ` Kent Overstreet
2013-01-28 21:50 ` Tejun Heo
2013-01-29 16:39 ` Kent Overstreet
2013-01-29 19:29 ` Tejun Heo
2013-01-29 19:51 ` Kent Overstreet
2013-01-29 20:02 ` Tejun Heo
2013-01-29 21:45 ` Kent Overstreet
2013-01-29 22:06 ` Tejun Heo
2013-01-29 18:04 ` [PATCH] module: Convert to generic percpu refcounts Kent Overstreet
2013-01-28 18:07 ` [PATCH] generic dynamic per cpu refcounting Kent Overstreet
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox