From: Kent Overstreet <koverstreet@google.com>
To: Tejun Heo <tj@kernel.org>
Cc: linux-kernel@vger.kernel.org, linux-aio@kvack.org,
linux-fsdevel@vger.kernel.org, zab@redhat.com, bcrl@kvack.org,
jmoyer@redhat.com, axboe@kernel.dk, viro@zeniv.linux.org.uk,
tytso@mit.edu, Oleg Nesterov <oleg@redhat.com>,
srivatsa.bhat@linux.vnet.ibm.com,
Christoph Lameter <cl@linux.com>,
"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>,
Rusty Russell <rusty@rustcorp.com.au>
Subject: Re: [PATCH 23/32] Generic dynamic per cpu refcounting
Date: Thu, 24 Jan 2013 17:13:45 -0800 [thread overview]
Message-ID: <20130125011345.GT26407@google.com> (raw)
In-Reply-To: <20130125005136.GE2373@mtj.dyndns.org>
On Thu, Jan 24, 2013 at 04:51:36PM -0800, Tejun Heo wrote:
> (cc'ing percpu / rcu crowd)
>
> Hello, Kent.
>
> On Wed, Dec 26, 2012 at 06:00:02PM -0800, Kent Overstreet wrote:
> > This implements a refcount with similar semantics to
> > atomic_get()/atomic_dec_and_test(), that starts out as just an atomic_t
> > but dynamically switches to per cpu refcounting when the rate of
> > gets/puts becomes too high.
>
> I'm not sure this is necessary. Percpu memory is expensive but not
> that expensive. Perpcu memories are tightly packed and if you
> allocate, say, 4 bytes, it's really gonna be 4 bytes for each possible
> CPU, and the number of possible CPUs is determined during boot to the
> number of CPUs the platform may have while booted. ie. On machines w/
> 8 CPU threads which don't have extra CPU sockets or don't support CPU
> hotplugging (most don't), nr_possible_cpus would exactly be 8 and you
> would be allocating 32 bytes of memory per each 4 byte percpu
> allocation.
You could be right - and I'd be just fine with a simpler version.
I was envisioning something with low enough overhead that we could use
it for the refcounts in struct file and kref/kobject - I've seen both of
those show up in profiles (admittedly with the kobject ref some of it
was stupid usage, but it'd be nice if we could just hit it with a very
big hammer and make the problem go away once and for all).
I'm not sure what the memory overhead would be like if we made all those
refcounts percpu and whether people would find it acceptable.
> Memory size usually having very strong correlation with the number of
> CPUs on the system, it usually isn't worth optimizing out percpu
> allocation like this. Especially not for a single counter.
>
> Maybe this one is way more ambitious than I think but it seems quite a
> bit over engineered.
That said, there's not that much more code in this version than the
"always percpu" version, and not really any more overhead in the fast
path (we always need the branch if we're not in percpu mode so we can
shut down).
> > It also implements two stage shutdown, as we need it to tear down the
> > percpu counts. Before dropping the initial refcount, you must call
> > percpu_ref_kill(); this puts the refcount in "shutting down mode" and
> > switches back to a single atomic refcount with the appropriate barriers
> > (synchronize_rcu()).
>
> Maybe if we have tryget() which only succeeds if the counter is alive,
> we can replace moulde refcnt with this? Rusty?
Glancing try_module_get(), it looks like that'd correspend to
if (!percpu_ref_dead(ref))
percpu_ref_get(ref);
with some synchronization. That should be easy.
> > +static void percpu_ref_alloc(struct percpu_ref *ref, unsigned __user *pcpu_count)
> > +{
> > + unsigned __percpu *new;
> > + unsigned long last = (unsigned long) pcpu_count;
> > + unsigned long now = jiffies;
> > +
> > + now <<= PCPU_STATUS_BITS;
> > + now |= PCPU_REF_NONE;
> > +
> > + if (now - last <= HZ << PCPU_STATUS_BITS) {
> > + rcu_read_unlock();
> > + new = alloc_percpu(unsigned);
> > + rcu_read_lock();
>
> I suppose RCU is used to make sure the dying status is visible while
> trying to drain percpu counters?
Precisely.
> Requiring rcu locking for refcnt is
> very unusure and it would probably be better to use
> synchronize_sched[_expedited]() instead in combination w/ preemp or
> irq flipping.
I haven't come across synchronize_sched() before - is it less overhead
than synchronize_rcu()?
>
> > + if (!new)
> > + goto update_time;
> > +
> > + BUG_ON(((unsigned long) new) & PCPU_STATUS_MASK);
> > +
> > + if (cmpxchg(&ref->pcpu_count, pcpu_count, new) != pcpu_count)
> > + free_percpu(new);
> > + else
> > + pr_debug("created");
> > + } else {
> > +update_time: new = (void *) now;
> > + cmpxchg(&ref->pcpu_count, pcpu_count, new);
> > + }
> > +}
>
> The above function needs a lot more documentation on synchronization
> and just general operation.
I had a quite a bit of code documentation when I was writing the rest of
the documentation for Andrew, and then I flubbed git and lost it. Doh.
I'll rewrite it while it's on my mind, I suppose.
> Overall, I don't know. It feels quite over-engineered. I really
> don't think dynamic allocation is justified.
Dynamic allocation doesn't add that much complexity, imo. For something
small and generic, I think it's worthwhile if it means it can get used
more. But like I said I don't have a strong opinion on the memory
overhead vs. complexity.
> It also makes the
> interface prone to misuse. It'll be easy to have mixed alloc and
> noalloc sites and then lose alloc ones or just foget about the
> distinction and end up with refcnts which never convert to percpu one
> and there will be no way to easily identify those.
This is true, I'm not a huge fan of the interface.
The way percpu_ref_get() drops and retakes rcu_read_lock() is definitely
ugly. I had an idea when I was last looking at the code for that -
percpu_ref_get() could merely return whether the user should call
percpu_ref_alloc(), and then the caller can do that in the appropriate
context (or skip it if it's in a slowpath and can't).
This would also mean that users could just unconditionally call
percpu_ref_alloc() (or have an init function that does that too).
Just given that the code works and is tested I wasn't in a huge hurry to
screw with it more - sort of prefer to wait and see how it gets used.
Anyways, we can definitely change it.
--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org. For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>
WARNING: multiple messages have this Message-ID (diff)
From: Kent Overstreet <koverstreet@google.com>
To: Tejun Heo <tj@kernel.org>
Cc: linux-kernel@vger.kernel.org, linux-aio@kvack.org,
linux-fsdevel@vger.kernel.org, zab@redhat.com, bcrl@kvack.org,
jmoyer@redhat.com, axboe@kernel.dk, viro@zeniv.linux.org.uk,
tytso@mit.edu, Oleg Nesterov <oleg@redhat.com>,
srivatsa.bhat@linux.vnet.ibm.com,
Christoph Lameter <cl@linux.com>,
"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>,
Rusty Russell <rusty@rustcorp.com.au>
Subject: Re: [PATCH 23/32] Generic dynamic per cpu refcounting
Date: Thu, 24 Jan 2013 17:13:45 -0800 [thread overview]
Message-ID: <20130125011345.GT26407@google.com> (raw)
In-Reply-To: <20130125005136.GE2373@mtj.dyndns.org>
On Thu, Jan 24, 2013 at 04:51:36PM -0800, Tejun Heo wrote:
> (cc'ing percpu / rcu crowd)
>
> Hello, Kent.
>
> On Wed, Dec 26, 2012 at 06:00:02PM -0800, Kent Overstreet wrote:
> > This implements a refcount with similar semantics to
> > atomic_get()/atomic_dec_and_test(), that starts out as just an atomic_t
> > but dynamically switches to per cpu refcounting when the rate of
> > gets/puts becomes too high.
>
> I'm not sure this is necessary. Percpu memory is expensive but not
> that expensive. Perpcu memories are tightly packed and if you
> allocate, say, 4 bytes, it's really gonna be 4 bytes for each possible
> CPU, and the number of possible CPUs is determined during boot to the
> number of CPUs the platform may have while booted. ie. On machines w/
> 8 CPU threads which don't have extra CPU sockets or don't support CPU
> hotplugging (most don't), nr_possible_cpus would exactly be 8 and you
> would be allocating 32 bytes of memory per each 4 byte percpu
> allocation.
You could be right - and I'd be just fine with a simpler version.
I was envisioning something with low enough overhead that we could use
it for the refcounts in struct file and kref/kobject - I've seen both of
those show up in profiles (admittedly with the kobject ref some of it
was stupid usage, but it'd be nice if we could just hit it with a very
big hammer and make the problem go away once and for all).
I'm not sure what the memory overhead would be like if we made all those
refcounts percpu and whether people would find it acceptable.
> Memory size usually having very strong correlation with the number of
> CPUs on the system, it usually isn't worth optimizing out percpu
> allocation like this. Especially not for a single counter.
>
> Maybe this one is way more ambitious than I think but it seems quite a
> bit over engineered.
That said, there's not that much more code in this version than the
"always percpu" version, and not really any more overhead in the fast
path (we always need the branch if we're not in percpu mode so we can
shut down).
> > It also implements two stage shutdown, as we need it to tear down the
> > percpu counts. Before dropping the initial refcount, you must call
> > percpu_ref_kill(); this puts the refcount in "shutting down mode" and
> > switches back to a single atomic refcount with the appropriate barriers
> > (synchronize_rcu()).
>
> Maybe if we have tryget() which only succeeds if the counter is alive,
> we can replace moulde refcnt with this? Rusty?
Glancing try_module_get(), it looks like that'd correspend to
if (!percpu_ref_dead(ref))
percpu_ref_get(ref);
with some synchronization. That should be easy.
> > +static void percpu_ref_alloc(struct percpu_ref *ref, unsigned __user *pcpu_count)
> > +{
> > + unsigned __percpu *new;
> > + unsigned long last = (unsigned long) pcpu_count;
> > + unsigned long now = jiffies;
> > +
> > + now <<= PCPU_STATUS_BITS;
> > + now |= PCPU_REF_NONE;
> > +
> > + if (now - last <= HZ << PCPU_STATUS_BITS) {
> > + rcu_read_unlock();
> > + new = alloc_percpu(unsigned);
> > + rcu_read_lock();
>
> I suppose RCU is used to make sure the dying status is visible while
> trying to drain percpu counters?
Precisely.
> Requiring rcu locking for refcnt is
> very unusure and it would probably be better to use
> synchronize_sched[_expedited]() instead in combination w/ preemp or
> irq flipping.
I haven't come across synchronize_sched() before - is it less overhead
than synchronize_rcu()?
>
> > + if (!new)
> > + goto update_time;
> > +
> > + BUG_ON(((unsigned long) new) & PCPU_STATUS_MASK);
> > +
> > + if (cmpxchg(&ref->pcpu_count, pcpu_count, new) != pcpu_count)
> > + free_percpu(new);
> > + else
> > + pr_debug("created");
> > + } else {
> > +update_time: new = (void *) now;
> > + cmpxchg(&ref->pcpu_count, pcpu_count, new);
> > + }
> > +}
>
> The above function needs a lot more documentation on synchronization
> and just general operation.
I had a quite a bit of code documentation when I was writing the rest of
the documentation for Andrew, and then I flubbed git and lost it. Doh.
I'll rewrite it while it's on my mind, I suppose.
> Overall, I don't know. It feels quite over-engineered. I really
> don't think dynamic allocation is justified.
Dynamic allocation doesn't add that much complexity, imo. For something
small and generic, I think it's worthwhile if it means it can get used
more. But like I said I don't have a strong opinion on the memory
overhead vs. complexity.
> It also makes the
> interface prone to misuse. It'll be easy to have mixed alloc and
> noalloc sites and then lose alloc ones or just foget about the
> distinction and end up with refcnts which never convert to percpu one
> and there will be no way to easily identify those.
This is true, I'm not a huge fan of the interface.
The way percpu_ref_get() drops and retakes rcu_read_lock() is definitely
ugly. I had an idea when I was last looking at the code for that -
percpu_ref_get() could merely return whether the user should call
percpu_ref_alloc(), and then the caller can do that in the appropriate
context (or skip it if it's in a slowpath and can't).
This would also mean that users could just unconditionally call
percpu_ref_alloc() (or have an init function that does that too).
Just given that the code works and is tested I wasn't in a huge hurry to
screw with it more - sort of prefer to wait and see how it gets used.
Anyways, we can definitely change it.
next prev parent reply other threads:[~2013-01-25 1:13 UTC|newest]
Thread overview: 152+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-12-27 1:59 [PATCH 00/32] AIO performance improvements/cleanups, v3 Kent Overstreet
2012-12-27 1:59 ` Kent Overstreet
2012-12-27 1:59 ` [PATCH 01/32] mm: remove old aio use_mm() comment Kent Overstreet
2012-12-27 1:59 ` Kent Overstreet
2012-12-27 1:59 ` [PATCH 02/32] aio: remove dead code from aio.h Kent Overstreet
2012-12-27 1:59 ` Kent Overstreet
2012-12-27 1:59 ` [PATCH 03/32] gadget: remove only user of aio retry Kent Overstreet
2012-12-27 1:59 ` Kent Overstreet
2012-12-27 1:59 ` [PATCH 04/32] aio: remove retry-based AIO Kent Overstreet
2012-12-27 1:59 ` Kent Overstreet
2012-12-29 7:36 ` Hillf Danton
2012-12-29 7:36 ` Hillf Danton
2013-01-07 22:12 ` Kent Overstreet
2013-01-07 22:12 ` Kent Overstreet
2012-12-29 7:47 ` Hillf Danton
2012-12-29 7:47 ` Hillf Danton
2013-01-07 22:15 ` Kent Overstreet
2013-01-07 22:15 ` Kent Overstreet
2012-12-27 1:59 ` [PATCH 05/32] char: add aio_{read,write} to /dev/{null,zero} Kent Overstreet
2012-12-27 1:59 ` Kent Overstreet
2012-12-27 1:59 ` [PATCH 06/32] aio: Kill return value of aio_complete() Kent Overstreet
2012-12-27 1:59 ` Kent Overstreet
2012-12-27 1:59 ` [PATCH 07/32] aio: kiocb_cancel() Kent Overstreet
2012-12-27 1:59 ` Kent Overstreet
2012-12-27 1:59 ` [PATCH 08/32] aio: Move private stuff out of aio.h Kent Overstreet
2012-12-27 1:59 ` Kent Overstreet
2012-12-27 1:59 ` [PATCH 09/32] aio: dprintk() -> pr_debug() Kent Overstreet
2012-12-27 1:59 ` Kent Overstreet
2012-12-27 1:59 ` [PATCH 10/32] aio: do fget() after aio_get_req() Kent Overstreet
2012-12-27 1:59 ` Kent Overstreet
2012-12-27 1:59 ` [PATCH 11/32] aio: Make aio_put_req() lockless Kent Overstreet
2012-12-27 1:59 ` Kent Overstreet
2012-12-27 1:59 ` [PATCH 12/32] aio: Refcounting cleanup Kent Overstreet
2012-12-27 1:59 ` Kent Overstreet
2012-12-27 1:59 ` [PATCH 13/32] wait: Add wait_event_hrtimeout() Kent Overstreet
2012-12-27 1:59 ` Kent Overstreet
2012-12-27 10:37 ` Fubo Chen
2012-12-27 10:37 ` Fubo Chen
2013-01-03 23:08 ` Andrew Morton
2013-01-03 23:08 ` Andrew Morton
2013-01-08 0:09 ` Kent Overstreet
2013-01-08 0:09 ` Kent Overstreet
2012-12-27 1:59 ` [PATCH 14/32] aio: Make aio_read_evt() more efficient, convert to hrtimers Kent Overstreet
2012-12-27 1:59 ` Kent Overstreet
2013-01-03 23:19 ` Andrew Morton
2013-01-03 23:19 ` Andrew Morton
2013-01-08 0:28 ` Kent Overstreet
2013-01-08 0:28 ` Kent Overstreet
2013-01-08 1:00 ` Andrew Morton
2013-01-08 1:00 ` Andrew Morton
2013-01-08 1:28 ` Kent Overstreet
2013-01-08 1:28 ` Kent Overstreet
2012-12-27 1:59 ` [PATCH 15/32] aio: Use flush_dcache_page() Kent Overstreet
2012-12-27 1:59 ` Kent Overstreet
2012-12-27 1:59 ` [PATCH 16/32] aio: Use cancellation list lazily Kent Overstreet
2012-12-27 1:59 ` Kent Overstreet
2012-12-27 1:59 ` [PATCH 17/32] aio: Change reqs_active to include unreaped completions Kent Overstreet
2012-12-27 1:59 ` Kent Overstreet
2012-12-27 1:59 ` [PATCH 18/32] aio: Kill batch allocation Kent Overstreet
2012-12-27 1:59 ` Kent Overstreet
2012-12-27 1:59 ` [PATCH 19/32] aio: Kill struct aio_ring_info Kent Overstreet
2012-12-27 1:59 ` Kent Overstreet
2012-12-27 1:59 ` [PATCH 20/32] aio: Give shared kioctx fields their own cachelines Kent Overstreet
2012-12-27 1:59 ` Kent Overstreet
2013-01-03 23:25 ` Andrew Morton
2013-01-03 23:25 ` Andrew Morton
2013-01-07 23:48 ` Kent Overstreet
2013-01-07 23:48 ` Kent Overstreet
2012-12-27 2:00 ` [PATCH 21/32] aio: reqs_active -> reqs_available Kent Overstreet
2012-12-27 2:00 ` Kent Overstreet
2012-12-27 2:00 ` [PATCH 22/32] aio: percpu reqs_available Kent Overstreet
2012-12-27 2:00 ` Kent Overstreet
2012-12-27 2:00 ` [PATCH 23/32] Generic dynamic per cpu refcounting Kent Overstreet
2012-12-27 2:00 ` Kent Overstreet
2013-01-03 22:48 ` Andrew Morton
2013-01-03 22:48 ` Andrew Morton
2013-01-07 23:47 ` Kent Overstreet
2013-01-07 23:47 ` Kent Overstreet
2013-01-08 1:03 ` [PATCH] percpu-refcount: Sparse fixes Kent Overstreet
2013-01-08 1:03 ` Kent Overstreet
2013-01-25 0:51 ` [PATCH 23/32] Generic dynamic per cpu refcounting Tejun Heo
2013-01-25 0:51 ` Tejun Heo
2013-01-25 1:13 ` Kent Overstreet [this message]
2013-01-25 1:13 ` Kent Overstreet
2013-01-25 2:03 ` Tejun Heo
2013-01-25 2:03 ` Tejun Heo
2013-01-25 2:09 ` Tejun Heo
2013-01-25 2:09 ` Tejun Heo
2013-01-28 17:48 ` Kent Overstreet
2013-01-28 17:48 ` Kent Overstreet
2013-01-28 18:18 ` Tejun Heo
2013-01-28 18:18 ` Tejun Heo
2013-01-25 6:15 ` Rusty Russell
2013-01-28 17:53 ` Kent Overstreet
2013-01-28 17:53 ` Kent Overstreet
2013-01-28 17:59 ` Tejun Heo
2013-01-28 17:59 ` Tejun Heo
2013-01-28 18:32 ` Kent Overstreet
2013-01-28 18:32 ` Kent Overstreet
2013-01-28 18:57 ` Christoph Lameter
2013-01-28 18:57 ` Christoph Lameter
2013-02-08 14:44 ` Tejun Heo
2013-02-08 14:44 ` Tejun Heo
2013-02-08 14:49 ` Jens Axboe
2013-02-08 14:49 ` Jens Axboe
2013-02-08 17:50 ` Andrew Morton
2013-02-08 17:50 ` Andrew Morton
2013-02-08 21:27 ` Kent Overstreet
2013-02-08 21:27 ` Kent Overstreet
2013-02-11 14:21 ` Jeff Moyer
2013-02-11 14:21 ` Jeff Moyer
2013-02-08 21:17 ` Kent Overstreet
2013-02-08 21:17 ` Kent Overstreet
2012-12-27 2:00 ` [PATCH 24/32] aio: Percpu ioctx refcount Kent Overstreet
2012-12-27 2:00 ` Kent Overstreet
2012-12-27 2:00 ` [PATCH 25/32] aio: use xchg() instead of completion_lock Kent Overstreet
2012-12-27 2:00 ` Kent Overstreet
2013-01-03 23:34 ` Andrew Morton
2013-01-07 23:21 ` Kent Overstreet
2013-01-07 23:21 ` Kent Overstreet
2013-01-07 23:35 ` Andrew Morton
2013-01-07 23:35 ` Andrew Morton
2013-01-08 0:01 ` Kent Overstreet
2013-01-08 0:01 ` Kent Overstreet
2012-12-27 2:00 ` [PATCH 26/32] aio: Don't include aio.h in sched.h Kent Overstreet
2012-12-27 2:00 ` Kent Overstreet
2012-12-27 2:00 ` [PATCH 27/32] aio: Kill ki_key Kent Overstreet
2012-12-27 2:00 ` Kent Overstreet
2012-12-27 2:00 ` [PATCH 28/32] aio: Kill ki_retry Kent Overstreet
2012-12-27 2:00 ` Kent Overstreet
2012-12-27 2:00 ` [PATCH 29/32] block, aio: Batch completion for bios/kiocbs Kent Overstreet
2012-12-27 2:00 ` Kent Overstreet
2013-01-04 9:22 ` Jens Axboe
2013-01-04 9:22 ` Jens Axboe
2013-01-07 23:34 ` Kent Overstreet
2013-01-07 23:34 ` Kent Overstreet
2013-01-08 15:33 ` Jeff Moyer
2013-01-08 15:33 ` Jeff Moyer
2013-01-08 16:06 ` Kent Overstreet
2013-01-08 16:06 ` Kent Overstreet
2013-01-08 16:15 ` Jeff Moyer
2013-01-08 16:15 ` Jeff Moyer
2013-01-08 16:48 ` Kent Overstreet
2013-01-08 16:48 ` Kent Overstreet
2012-12-27 2:00 ` [PATCH 30/32] virtio-blk: Convert to batch completion Kent Overstreet
2012-12-27 2:00 ` Kent Overstreet
2012-12-27 2:00 ` [PATCH 31/32] mtip32xx: " Kent Overstreet
2012-12-27 2:00 ` Kent Overstreet
2012-12-27 2:00 ` [PATCH 32/32] aio: Smoosh struct kiocb Kent Overstreet
2012-12-27 2:00 ` Kent Overstreet
2013-01-04 9:22 ` [PATCH 00/32] AIO performance improvements/cleanups, v3 Jens Axboe
2013-01-04 9:22 ` Jens Axboe
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20130125011345.GT26407@google.com \
--to=koverstreet@google.com \
--cc=axboe@kernel.dk \
--cc=bcrl@kvack.org \
--cc=cl@linux.com \
--cc=jmoyer@redhat.com \
--cc=linux-aio@kvack.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=oleg@redhat.com \
--cc=paulmck@linux.vnet.ibm.com \
--cc=rusty@rustcorp.com.au \
--cc=srivatsa.bhat@linux.vnet.ibm.com \
--cc=tj@kernel.org \
--cc=tytso@mit.edu \
--cc=viro@zeniv.linux.org.uk \
--cc=zab@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.