public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Rusty Russell <rusty@rustcorp.com.au>
To: Kent Overstreet <koverstreet@google.com>, Tejun Heo <tj@kernel.org>
Cc: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-aio@kvack.org, akpm@linux-foundation.org,
	Zach Brown <zab@redhat.com>, Felipe Balbi <balbi@ti.com>,
	Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	Mark Fasheh <mfasheh@suse.com>, Joel Becker <jlbec@evilplan.org>,
	Jens Axboe <axboe@kernel.dk>,
	Asai Thambi S P <asamymuthupa@micron.com>,
	Selvan Mani <smani@micron.com>,
	Sam Bradshaw <sbradshaw@micron.com>,
	Jeff Moyer <jmoyer@redhat.com>, Al Viro <viro@zeniv.linux.org.uk>,
	Benjamin LaHaise <bcrl@kvack.org>,
	Oleg Nesterov <oleg@redhat.com>,
	Christoph Lameter <cl@linux-foundation.org>,
	Ingo Molnar <mingo@redhat.com>
Subject: Re: [PATCH 04/21] Generic percpu refcounting
Date: Wed, 29 May 2013 14:29:56 +0930	[thread overview]
Message-ID: <87hahmmldf.fsf@rustcorp.com.au> (raw)
In-Reply-To: <20130528234728.GB2291@google.com>

Kent Overstreet <koverstreet@google.com> writes:
> On Wed, May 15, 2013 at 10:37:20AM -0700, Tejun Heo wrote:
>> Can you please expand it on a bit and, more importantly, describe in
>> what limits, it's safe?  This should be safe as long as the actual sum
>> of refcnts given out doesn't overflow the original type, right? 
>
> Precisely.
>
>> It'd be great if that is explained clearly in more intuitive way.  The
>> only actual explanation above is "modular arithmatic is commutative"
>> which is a very compact way to put it and I really think it deserves
>> an easier explanation.
>
> I'm not sure I know of any good way of explaining it intuitively, but
> here's this at least...
>
>  * (More precisely: because moduler arithmatic is commutative the sum of all the
>  * pcpu_count vars will be equal to what it would have been if all the gets and
>  * puts were done to a single integer, even if some of the percpu integers
>  * overflow or underflow).

This seems intuitively obvious, so I wouldn't sweat it too much.  What
goes up, has to come down somewhere.

>> > > Are we sure this is enough?  1<<31 is a fairly large number but it's
>> > > just easy enough to breach from time to time and it's gonna be hellish
>> > > to reproduce / debug when it actually overflows.  Maybe we want
>> > > atomic64_t w/ 1LLU << 63 bias?  Or is there something else which
>> > > guarantees that the bias can't over/underflow?
>> > 
>> > Well, it has the effect of halving the usable range of the refcount,
>> > which I think is probably ok - the thing is, the range of an atomic_t
>> > doesn't really correspond to anything useful on 64 bit machines so if
>> > you're concerned about overflow you probably need to be using an
>> > atomic_long_t. That is, if 32 bits is big enough 31 bits probably is
>> > too.
>> 
>> I'm not worrying about the total refcnt overflowing 31 bits, that's
>> fine.  What I'm worried about is the percpu refs having systmetic
>> drift (got on certain cpus and put on others), and the total counter
>> being overflowed while percpu draining is in progress.  To me, the
>> problem is that the bias which tags that draining in progress can be
>> overflown by percpu refs.  The summing can be the same but the tagging
>> should be put where summing can't overflow it.  It'd be great if you
>> can explain in the comment in what range it's safe and why, because
>> that'd make the limits clear to both you and other people reading the
>> code and would help a lot in deciding whether it's safe enough.
>
> (This is why I initially didn't (don't) like the bias method, it makes
> things harder to reason about).
>
> The fact that the counter is percpu is irrelevant w.r.t. the bias; we
> sum all the percpu counters up before adding them to the atomic counter
> and subtracting the bias, so when we go to add the percpu counters it's
> no different from if the percpu counter was a single integer all along.
>
> So there's only two counters we're adding together; there's the percpu
> counter (just think of it as a single integer) that we started out
> using, but then at some point in time we start applying the gets and
> puts to the atomic counter.
>
> Note that there's no systemic drift here; at time t all the gets and
> puts were happening to one counter, and then at time t+1 they switch to
> a different counter.
>
> We know the sum of the counters will be positive (again, because modular
> arithmatic is still commutative; when we sum the counters it's as if
> there was a single counter all along) but that doesn't mean either of
> the individual counters can't be negative.
>
> (Actually, unless I'm mistaken in this version the percpu counter can
> never go negative - it definitely could with dynamic percpu allocation,
> as you need a atomic_t -> percpu transition when the atomic_t was > 0
> for the percpu counter to go negative; but in this version we start out
> using the percpu counters and the atomic_t 0 (ignoring for the moment
> the bias and the initial ref).
>
> So, the sum must be positive but the atomic_t could be negative. How
> negative?
>
> We can't do a get() to the percpu counters after we've seen that the ref
> is no longer in percpu mode - so after we've done one put to the
> atomic_t we can do more puts to atomic_t (or gets to the atomic_t) but
> we can't do a get to the percpu counter.
>
> And we can't do more puts than there have been gets - because the sum
> can't be negative. So the most puts() we can do at any given time is the
> real count, or sum of the percpu ref and atomic_t.
>
> Therefore, the amount the atomic_t can go negative is bounded by the
> maximum value of the refcount.
>
> So if we say arbitrarily that the maximum legal value of the refcount is
> - say - 1U << 31, then the atomic_t will always be greater than
> -((int) (1U << 31)).
>
> So as long as the total number of outstanding refs never exceeds the
> bias we're fine.

Yes.  We should note the 31 bit limit somewhere.  We could WARN_ON() if
count is >= BIAS in percpu_ref_kill(), perhaps.

>> I probably should have made it clearer.  Sorry about that.  tryget()
>> is fine.  I was curious about count() as it's always a bit dangerous a
>> query interface which is racy and can return something unexpected like
>> false zero or underflowed refcnt.
>
> Yeah, it is, it was intended just for the module code where it's only
> used for the value lsmod shows.

Open code it there?

>> Let's just have percpu_ref_kill(ref, release) which puts the base ref
>> and invokes release whenever it's done.
>
> Release has to be stored in struct percpu_ref() so it can be invoked
> after a call_rcu() (percpu_ref_kill -> call_rcu() ->
> percpu_ref_kill_rcu() -> percpu_ref_put()) so I'm passing it to
> percpu_ref_init(), but yeah.

Or hand it to percpu_ref_put(), too, as per kref_put().  I hate indirect
magic.

Cheers,
Rusty.

  parent reply	other threads:[~2013-05-29  5:07 UTC|newest]

Thread overview: 52+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-05-14  1:18 AIO refactoring/performance improvements/cancellation Kent Overstreet
2013-05-14  1:18 ` [PATCH 01/21] aio: fix kioctx not being freed after cancellation at exit time Kent Overstreet
2013-05-14  1:18 ` [PATCH 02/21] aio: reqs_active -> reqs_available Kent Overstreet
2013-05-14  1:18 ` [PATCH 03/21] aio: percpu reqs_available Kent Overstreet
2013-05-14  1:18 ` [PATCH 04/21] Generic percpu refcounting Kent Overstreet
2013-05-14 13:51   ` Oleg Nesterov
2013-05-15  8:21     ` Kent Overstreet
2013-05-14 14:59   ` Tejun Heo
2013-05-14 15:28     ` Oleg Nesterov
2013-05-15  9:00       ` Kent Overstreet
2013-05-15  8:58     ` Kent Overstreet
2013-05-15 17:37       ` Tejun Heo
2013-05-28 23:47         ` Kent Overstreet
2013-05-29  1:11           ` Tejun Heo
2013-05-29  4:59           ` Rusty Russell [this message]
2013-05-31 20:12             ` Kent Overstreet
2013-05-14 21:59   ` Tejun Heo
2013-05-14 22:15     ` Tejun Heo
2013-05-15  9:07     ` Kent Overstreet
2013-05-15 17:56       ` Tejun Heo
2013-05-16  0:26   ` Rusty Russell
2013-05-14  1:18 ` [PATCH 05/21] aio: percpu ioctx refcount Kent Overstreet
2013-05-14  1:18 ` [PATCH 06/21] aio: io_cancel() no longer returns the io_event Kent Overstreet
2013-05-14  1:18 ` [PATCH 07/21] aio: Don't use ctx->tail unnecessarily Kent Overstreet
2013-05-14  1:18 ` [PATCH 08/21] aio: Kill aio_rw_vect_retry() Kent Overstreet
2013-05-14  1:18 ` [PATCH 09/21] aio: Kill unneeded kiocb members Kent Overstreet
2013-05-14  1:18 ` [PATCH 10/21] aio: Kill ki_users Kent Overstreet
2013-05-14  1:18 ` [PATCH 11/21] aio: Kill ki_dtor Kent Overstreet
2013-05-14  1:18 ` [PATCH 12/21] aio: convert the ioctx list to radix tree Kent Overstreet
2013-05-14  1:18 ` [PATCH 13/21] block: prep work for batch completion Kent Overstreet
2013-05-14  1:18 ` [PATCH 14/21] block, aio: batch completion for bios/kiocbs Kent Overstreet
2013-05-14  1:18 ` [PATCH 15/21] virtio-blk: convert to batch completion Kent Overstreet
2013-05-14  1:18 ` [PATCH 16/21] mtip32xx: " Kent Overstreet
2013-05-14  1:18 ` [PATCH 17/21] Percpu tag allocator Kent Overstreet
2013-05-14 13:48   ` Oleg Nesterov
2013-05-14 14:24     ` Oleg Nesterov
2013-05-15  9:34       ` Kent Overstreet
2013-05-15  9:25     ` Kent Overstreet
2013-05-15 15:41       ` Oleg Nesterov
2013-05-15 16:10         ` Oleg Nesterov
2013-06-10 23:20         ` Kent Overstreet
2013-06-11 17:42           ` Oleg Nesterov
2013-05-14 15:03   ` Tejun Heo
2013-05-15 20:19   ` Andi Kleen
2013-05-14  1:18 ` [PATCH 18/21] aio: Allow cancellation without a cancel callback, new kiocb lookup Kent Overstreet
2013-05-14  1:18 ` [PATCH 19/21] aio/usb: Update cancellation for new synchonization Kent Overstreet
2013-05-14  1:18 ` [PATCH 20/21] direct-io: Set dio->io_error directly Kent Overstreet
2013-05-14  1:18 ` [PATCH 21/21] block: Bio cancellation Kent Overstreet
2013-05-15 17:52   ` Jens Axboe
2013-05-15 19:29     ` Kent Overstreet
2013-05-15 20:01       ` Jens Axboe
2013-05-31 22:52         ` Kent Overstreet

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87hahmmldf.fsf@rustcorp.com.au \
    --to=rusty@rustcorp.com.au \
    --cc=akpm@linux-foundation.org \
    --cc=asamymuthupa@micron.com \
    --cc=axboe@kernel.dk \
    --cc=balbi@ti.com \
    --cc=bcrl@kvack.org \
    --cc=cl@linux-foundation.org \
    --cc=gregkh@linuxfoundation.org \
    --cc=jlbec@evilplan.org \
    --cc=jmoyer@redhat.com \
    --cc=koverstreet@google.com \
    --cc=linux-aio@kvack.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mfasheh@suse.com \
    --cc=mingo@redhat.com \
    --cc=oleg@redhat.com \
    --cc=sbradshaw@micron.com \
    --cc=smani@micron.com \
    --cc=tj@kernel.org \
    --cc=viro@zeniv.linux.org.uk \
    --cc=zab@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox