Linux NFS development
 help / color / mirror / Atom feed
From: Jeff Layton <jeff.layton@primarydata.com>
To: Jeff Layton <jeff.layton@primarydata.com>
Cc: Trond Myklebust <trondmy@gmail.com>,
	"J. Bruce Fields" <bfields@fieldses.org>,
	Chris Worley <chris.worley@primarydata.com>,
	linux-nfs@vger.kernel.org
Subject: Re: [PATCH 3/4] sunrpc: convert to lockless lookup of queued server threads
Date: Tue, 2 Dec 2014 07:14:22 -0500	[thread overview]
Message-ID: <20141202071422.5b01585d@tlielax.poochiereds.net> (raw)
In-Reply-To: <20141202065750.283704a7@tlielax.poochiereds.net>

On Tue, 2 Dec 2014 06:57:50 -0500
Jeff Layton <jeff.layton@primarydata.com> wrote:

> On Mon, 1 Dec 2014 19:38:19 -0500
> Trond Myklebust <trondmy@gmail.com> wrote:
> 
> > On Mon, Dec 1, 2014 at 6:47 PM, J. Bruce Fields <bfields@fieldses.org> wrote:
> > > On Fri, Nov 21, 2014 at 02:19:30PM -0500, Jeff Layton wrote:
> > >> Testing has shown that the pool->sp_lock can be a bottleneck on a busy
> > >> server. Every time data is received on a socket, the server must take
> > >> that lock in order to dequeue a thread from the sp_threads list.
> > >>
> > >> Address this problem by eliminating the sp_threads list (which contains
> > >> threads that are currently idle) and replacing it with a RQ_BUSY flag in
> > >> svc_rqst. This allows us to walk the sp_all_threads list under the
> > >> rcu_read_lock and find a suitable thread for the xprt by doing a
> > >> test_and_set_bit.
> > >>
> > >> Note that we do still have a potential atomicity problem however with
> > >> this approach.  We don't want svc_xprt_do_enqueue to set the
> > >> rqst->rq_xprt pointer unless a test_and_set_bit of RQ_BUSY returned
> > >> negative (which indicates that the thread was idle). But, by the time we
> > >> check that, the big could be flipped by a waking thread.
> > >
> > > (Nits: replacing "negative" by "zero" and "big" by "bit".)
> > >
> 

Sorry, hit send too quickly...

Thanks for fixing those.


> > >> To address this, we acquire a new per-rqst spinlock (rq_lock) and take
> > >> that before doing the test_and_set_bit. If that returns false, then we
> > >> can set rq_xprt and drop the spinlock. Then, when the thread wakes up,
> > >> it must set the bit under the same spinlock and can trust that if it was
> > >> already set then the rq_xprt is also properly set.
> > >>
> > >> With this scheme, the case where we have an idle thread no longer needs
> > >> to take the highly contended pool->sp_lock at all, and that removes the
> > >> bottleneck.
> > >>
> > >> That still leaves one issue: What of the case where we walk the whole
> > >> sp_all_threads list and don't find an idle thread? Because the search is
> > >> lockess, it's possible for the queueing to race with a thread that is
> > >> going to sleep. To address that, we queue the xprt and then search again.
> > >>
> > >> If we find an idle thread at that point, we can't attach the xprt to it
> > >> directly since that might race with a different thread waking up and
> > >> finding it.  All we can do is wake the idle thread back up and let it
> > >> attempt to find the now-queued xprt.
> > >
> > > I find it hard to think about how we expect this to affect performance.
> > > So it comes down to the observed results, I guess, but just trying to
> > > get an idea:
> > >
> > >         - this eliminates sp_lock.  I think the original idea here was
> > >           that if interrupts could be routed correctly then there
> > >           shouldn't normally be cross-cpu contention on this lock.  Do
> > >           we understand why that didn't pan out?  Is hardware capable of
> > >           doing this really rare, or is it just too hard to configure it
> > >           correctly?
> > 
> > One problem is that a 1MB incoming write will generate a lot of
> > interrupts. While that is not so noticeable on a 1GigE network, it is
> > on a 40GigE network. The other thing you should note is that this
> > workload was generated with ~100 clients pounding on that server, so
> > there are a fair amount of TCP connections to service in parallel.
> > Playing with the interrupt routing doesn't necessarily help you so
> > much when all those connections are hot.
> > 

In principle though, the percpu pool_mode should have alleviated the
contention on the sp_lock. When an interrupt comes in, the xprt gets
queued to its pool. If there is a pool for each cpu then there should
be no sp_lock contention. The pernode pool mode might also have
alleviated the lock contention to a lesser degree in a NUMA
configuration.

Do we understand why that didn't help?

In any case, I think that doing this with RCU is still preferable.
We're walking a very short list, so doing it lockless is still a
good idea to improve performance without needing to use the percpu
pool_mode.

> > >         - instead we're walking the list of all threads looking for an
> > >           idle one.  I suppose that's tpyically not more than a few
> > >           hundred.  Does this being fast depend on the fact that that
> > >           list is almost never changed?  Should we be rearranging
> > >           svc_rqst so frequently-written fields aren't nearby?
> > 
> > Given a 64-byte cache line, that is 8 pointers worth on a 64-bit processor.
> > 
> > - rq_all, rq_server, rq_pool, rq_task don't ever change, so perhaps
> > shove them together into the same cacheline?
> > 
> > - rq_xprt does get set often until we have a full RPC request worth of
> > data, so perhaps consider moving that.
> > 
> > - OTOH, rq_addr, rq_addrlen, rq_daddr, rq_daddrlen are only set once
> > we have a full RPC to process, and then keep their values until that
> > RPC call is finished. That doesn't look too bad.
> > 

That sounds reasonable to me.

-- 
Jeff Layton <jlayton@primarydata.com>

  reply	other threads:[~2014-12-02 12:14 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-11-21 19:19 [PATCH 0/4] sunrpc: reduce pool->sp_lock contention when queueing a xprt for servicing Jeff Layton
2014-11-21 19:19 ` [PATCH 1/4] sunrpc: add a rcu_head to svc_rqst and use kfree_rcu to free it Jeff Layton
2014-12-01 22:44   ` J. Bruce Fields
2014-12-01 23:05     ` Jeff Layton
2014-12-01 23:36       ` Trond Myklebust
2014-12-02  0:29         ` Jeff Layton
2014-12-02  0:52           ` Trond Myklebust
2014-12-09 17:05             ` J. Bruce Fields
2014-11-21 19:19 ` [PATCH 2/4] sunrpc: fix potential races in pool_stats collection Jeff Layton
2014-11-21 19:19 ` [PATCH 3/4] sunrpc: convert to lockless lookup of queued server threads Jeff Layton
2014-12-01 23:47   ` J. Bruce Fields
2014-12-02  0:38     ` Trond Myklebust
2014-12-02 11:57       ` Jeff Layton
2014-12-02 12:14         ` Jeff Layton [this message]
2014-12-02 16:50           ` J. Bruce Fields
2014-12-02 18:53             ` Ben Myers
2014-12-09 17:04               ` J. Bruce Fields
2014-12-08 18:57             ` J. Bruce Fields
2014-12-08 19:54               ` Jeff Layton
2014-12-08 19:58                 ` J. Bruce Fields
2014-12-08 20:24                   ` Jeff Layton
2014-12-09 16:57           ` J. Bruce Fields
2014-11-21 19:19 ` [PATCH 4/4] sunrpc: add some tracepoints around enqueue and dequeue of svc_xprt Jeff Layton
2014-12-02 13:31   ` Jeff Layton
2014-12-09 16:36     ` J. Bruce Fields
2014-11-25 21:25 ` [PATCH 0/4] sunrpc: reduce pool->sp_lock contention when queueing a xprt for servicing Jeff Layton
2014-11-26  0:09   ` J. Bruce Fields
2014-11-26  0:38     ` Jeff Layton
2014-11-26  2:40       ` J. Bruce Fields
2014-11-26 11:12         ` Jeff Layton
2014-12-09 16:44 ` J. Bruce Fields

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20141202071422.5b01585d@tlielax.poochiereds.net \
    --to=jeff.layton@primarydata.com \
    --cc=bfields@fieldses.org \
    --cc=chris.worley@primarydata.com \
    --cc=linux-nfs@vger.kernel.org \
    --cc=trondmy@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox