From: NeilBrown <neilb@suse.de>
To: "J.Bruce Fields" <bfields@citi.umich.edu>
Cc: Olga Kornievskaia <aglo@citi.umich.edu>, NFS <linux-nfs@vger.kernel.org>
Subject: Re: Is tcp autotuning really what NFS wants?
Date: Wed, 10 Jul 2013 14:32:33 +1000 [thread overview]
Message-ID: <20130710143233.77e35721@notabene.brown> (raw)
In-Reply-To: <20130710022735.GI8281@fieldses.org>
[-- Attachment #1: Type: text/plain, Size: 5930 bytes --]
On Tue, 9 Jul 2013 22:27:35 -0400 "J.Bruce Fields" <bfields@citi.umich.edu>
wrote:
> On Wed, Jul 10, 2013 at 09:22:55AM +1000, NeilBrown wrote:
> >
> > Hi,
> > I just noticed this commit:
> >
> > commit 9660439861aa8dbd5e2b8087f33e20760c2c9afc
> > Author: Olga Kornievskaia <aglo@citi.umich.edu>
> > Date: Tue Oct 21 14:13:47 2008 -0400
> >
> > svcrpc: take advantage of tcp autotuning
> >
> >
> > which I must confess surprised me. I wonder if the full implications of
> > removing that functionality were understood.
> >
> > Previously nfsd would set the transmit buffer space for a connection to
> > ensure there is plenty to hold all replies. Now it doesn't.
> >
> > nfsd refuses to accept a request if there isn't enough space in the transmit
> > buffer to send a reply. This is important to ensure that each reply gets
> > sent atomically without blocking and there is no risk of replies getting
> > interleaved.
> >
> > The server starts out with a large estimate of the reply space (1M) and for
> > NFSv3 and v2 it quickly adjusts this down to something realistic. For NFSv4
> > it is much harder to estimate the space needed so it just assumes every
> > reply will require 1M of space.
> >
> > This means that with NFSv4, as soon as you have enough concurrent requests
> > such that 1M each reserves all of whatever window size was auto-tuned, new
> > requests on that connection will be ignored.
> >
> > This could significantly limit the amount of parallelism that can be achieved
> > for a single TCP connection (and given that the Linux client strongly prefers
> > a single connection now, this could become more of an issue).
>
> Worse, I believe it can deadlock completely if the transmit buffer
> shrinks too far, and people really have run into this:
>
> http://mid.gmane.org/<20130125185748.GC29596@fieldses.org>
>
> Trond's suggestion looked at the time like it might work and be doable:
>
> http://mid.gmane.org/<4FA345DA4F4AE44899BD2B03EEEC2FA91833C1D8@sacexcmbx05-prd.hq.netapp.com>
>
> but I dropped it.
I would probably generalise Trond's suggestion and allow "N" extra requests
through that exceed the reservation, when N is related to the number of idle
threads. squareroot might be nice, but half is probably easiest.
If any send takes more than 30 seconds the sk_sndtimeo will kick in and close
the connection so a really bad connection won't block threads indefinitely.
And yes - a nice test case would be good.
What do you think of the following (totally untested - just for comment)?
NeilBrown
diff --git a/include/linux/sunrpc/svc_xprt.h b/include/linux/sunrpc/svc_xprt.h
index b05963f..2fc92f1 100644
--- a/include/linux/sunrpc/svc_xprt.h
+++ b/include/linux/sunrpc/svc_xprt.h
@@ -81,6 +81,10 @@ struct svc_xprt {
struct net *xpt_net;
struct rpc_xprt *xpt_bc_xprt; /* NFSv4.1 backchannel */
+
+ atomic_t xpt_extras; /* Extra requests which
+ * might block on send
+ */
};
static inline void unregister_xpt_user(struct svc_xprt *xpt, struct svc_xpt_user *u)
diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
index 80a6640..fc366ca 100644
--- a/net/sunrpc/svc_xprt.c
+++ b/net/sunrpc/svc_xprt.c
@@ -165,6 +165,7 @@ void svc_xprt_init(struct net *net, struct svc_xprt_class *xcl,
set_bit(XPT_BUSY, &xprt->xpt_flags);
rpc_init_wait_queue(&xprt->xpt_bc_pending, "xpt_bc_pending");
xprt->xpt_net = get_net(net);
+ atomic_set(&xprt->xpt_extra, 0);
}
EXPORT_SYMBOL_GPL(svc_xprt_init);
@@ -326,13 +327,21 @@ static void svc_thread_dequeue(struct svc_pool *pool, struct svc_rqst *rqstp)
list_del(&rqstp->rq_list);
}
-static bool svc_xprt_has_something_to_do(struct svc_xprt *xprt)
+static int svc_xprt_has_something_to_do(struct svc_xprt *xprt)
{
if (xprt->xpt_flags & ((1<<XPT_CONN)|(1<<XPT_CLOSE)))
- return true;
- if (xprt->xpt_flags & ((1<<XPT_DATA)|(1<<XPT_DEFERRED)))
- return xprt->xpt_ops->xpo_has_wspace(xprt);
- return false;
+ return 1;
+ if (xprt->xpt_flags & ((1<<XPT_DATA)|(1<<XPT_DEFERRED))) {
+ if (xprt->xpt_ops->xpo_has_wspace(xprt)) {
+ if (atomic_read(&xprt->xpt_extra))
+ atomic_set(&xprt->xpt_extras, 0);
+ return 1;
+ } else {
+ atomic_inc(&xprt->xpt_extras);
+ return 2; /* only if free threads */
+ }
+ }
+ return 0;
}
/*
@@ -345,8 +354,9 @@ void svc_xprt_enqueue(struct svc_xprt *xprt)
struct svc_pool *pool;
struct svc_rqst *rqstp;
int cpu;
+ int todo = svc_xprt_has_something_to_do(xprt);
- if (!svc_xprt_has_something_to_do(xprt))
+ if (!todo)
return;
cpu = get_cpu();
@@ -361,6 +371,19 @@ void svc_xprt_enqueue(struct svc_xprt *xprt)
"svc_xprt_enqueue: "
"threads and transports both waiting??\n");
+ if (todo == 2) {
+ int free_needed = atomic_read(&xprt->xpt_extras) * 2;
+ list_for_each_entry(rqstp, &pool->sp_thread, rq_list)
+ if (--free_needed <= 0)
+ break;
+
+ if (free_needed > 0) {
+ /* Need more free threads before we allow this. */
+ atomic_add_unless(&xprt->xpt_extras, -1, 0);
+ goto out_unlock;
+ }
+ }
+
pool->sp_stats.packets++;
/* Mark transport as busy. It will remain in this state until
@@ -371,6 +394,8 @@ void svc_xprt_enqueue(struct svc_xprt *xprt)
if (test_and_set_bit(XPT_BUSY, &xprt->xpt_flags)) {
/* Don't enqueue transport while already enqueued */
dprintk("svc: transport %p busy, not enqueued\n", xprt);
+ if (todo == 2)
+ atomic_add_unless(&xprt->xpt_extras, -1, 0);
goto out_unlock;
}
@@ -466,6 +491,7 @@ static void svc_xprt_release(struct svc_rqst *rqstp)
printk(KERN_ERR "RPC request reserved %d but used %d\n",
rqstp->rq_reserved,
rqstp->rq_res.len);
+ atomic_add_unless(&xprt->xpt_extras, -1, 0);
rqstp->rq_res.head[0].iov_len = 0;
svc_reserve(rqstp, 0);
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]
next prev parent reply other threads:[~2013-07-10 4:32 UTC|newest]
Thread overview: 27+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <20130710092255.0240a36d@notabene.brown>
2013-07-10 2:27 ` Is tcp autotuning really what NFS wants? J.Bruce Fields
2013-07-10 4:32 ` NeilBrown [this message]
2013-07-10 19:07 ` J.Bruce Fields
2013-07-15 4:32 ` NeilBrown
2013-07-16 1:58 ` J.Bruce Fields
2013-07-16 4:00 ` NeilBrown
2013-07-16 14:24 ` J.Bruce Fields
2013-07-18 0:03 ` Ben Myers
2013-07-24 21:07 ` J.Bruce Fields
2013-07-25 1:30 ` [PATCH] NFSD/sunrpc: avoid deadlock on TCP connection due to memory pressure NeilBrown
2013-07-25 12:35 ` Jim Rees
2013-07-25 20:18 ` J.Bruce Fields
2013-07-25 20:33 ` NeilBrown
2013-07-26 14:19 ` J.Bruce Fields
2013-07-30 2:48 ` NeilBrown
2013-08-01 2:49 ` J.Bruce Fields
2013-07-10 17:33 ` Is tcp autotuning really what NFS wants? Dean
2013-07-10 17:39 ` Ben Greear
2013-07-15 4:35 ` NeilBrown
2013-07-15 23:32 ` Ben Greear
2013-07-16 4:46 ` NeilBrown
2013-07-10 19:59 ` Michael Richardson
2013-07-15 1:26 ` Jim Rees
2013-07-15 5:02 ` NeilBrown
2013-07-15 11:57 ` Jim Rees
2013-07-15 13:42 ` Jim Rees
2013-07-16 1:10 ` NeilBrown
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20130710143233.77e35721@notabene.brown \
--to=neilb@suse.de \
--cc=aglo@citi.umich.edu \
--cc=bfields@citi.umich.edu \
--cc=linux-nfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).