netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jeff Moyer <jmoyer@redhat.com>
To: netdev@vger.kernel.org, Andrew Morton <akpm@linux-foundation.org>
Cc: Jens Axboe <jens.axboe@oracle.com>,
	linux-kernel@vger.kernel.org, "Rafael J. Wysocki" <rjw@sisk.pl>,
	Olga Kornievskaia <aglo@citi.umich.edu>,
	"J. Bruce Fields" <bfields@fieldses.org>,
	Jim Rees <rees@umich.edu>,
	linux-nfs@vger.kernel.org
Subject: Re: 2.6.30-rc deadline scheduler performance regression for iozone over NFS
Date: Wed, 13 May 2009 15:29:09 -0400	[thread overview]
Message-ID: <x49k54ku7i2.fsf@segfault.boston.devel.redhat.com> (raw)
In-Reply-To: <x49y6t1rqw0.fsf@segfault.boston.devel.redhat.com> (Jeff Moyer's message of "Wed, 13 May 2009 10:58:39 -0400")

Hi, netdev folks.  The summary here is:

A patch added in the 2.6.30 development cycle caused a performance
regression in my NFS iozone testing.  The patch in question is the
following:

commit 47a14ef1af48c696b214ac168f056ddc79793d0e
Author: Olga Kornievskaia <aglo@citi.umich.edu>
Date:   Tue Oct 21 14:13:47 2008 -0400

    svcrpc: take advantage of tcp autotuning
 
which is also quoted below.  Using 8 nfsd threads, a single client doing
2GB of streaming read I/O goes from 107590 KB/s under 2.6.29 to 65558
KB/s under 2.6.30-rc4.  I also see more run to run variation under
2.6.30-rc4 using the deadline I/O scheduler on the server.  That
variation disappears (as does the performance regression) when reverting
the above commit.

Olga had the following to say about the regression:

>> On Wed, 13 May 2009 12:20:57 -0400 Olga Kornievskaia <aglo@citi.umich.edu> wrote:
>>
>>> I believe what you are seeing is how well TCP autotuning performs.
>>> What old NFS code was doing is disabling autotuning and instead using
>>> #nfsd thread to scale TCP recv window. You are providing an example of
>>> where setting TCP buffer sizes outperforms TCP autotuning. While this
>>> is a valid example, there is also an alternative example of where old
>>> NFS design hurts performance.

> We realize that decrease performance is a problem and understand that
> reverting the patch might be the appropriate course of action!

> But we are curious why this is happening. Jeff if it's not too much trouble
> could you generate tcpdumps for both cases. We are curious what are
> the max window sizes in both cases? Also could you give us your tcp and
> network sysctl values for the testing environment (both client and server
> values) that you can get with "sysctl -a | grep tcp" and also
> " | grep net.core".

The requested data can be found here:
http://people.redhat.com/jmoyer/iozone-regression.tar

> Poor performance using TCP autotuning can be demonstrated outside
> of NFS but using Iperf. It can be shown that iperf will work better if "-w"
> flag is used. When this flag is set, Iperf calls setsockopt() call which in
> the kernel turns off autotuning.
>
> As for fixing this it would be great if we could get some help from the
> TCP kernel folks?

And so I've CC'd netdev.

Jim Rees had the following to add:

> TCP autotuning can reduce performance by up to about 10% in some cases.
> Jeff found one of these cases.  While the autotuning penalty never exceeds
> 10% as far as I know, I can provide examples of other cases where autotuning
> improves nfsd performance by more than a factor of 100.
> 
> The right thing is to fix autotuning.  If autotuning is considered too
> broken to use, it should be turned off everywhere, not just in nfsd, as it
> hurts/benefits all TCP clients, not just nfs.

> This topic has been discussed before on netdev:
> http://www.spinics.net/lists/netdev/msg68650.html
> http://www.spinics.net/lists/netdev/msg68155.html

I'd like to point out that the penalty is much greater than 10% in my
case.

Here are the data points:

Jeff Moyer <jmoyer@redhat.com> writes:

>>> >> >> Jeff Moyer <jmoyer@redhat.com> wrote:
>>> >> >> 
>>> >> >> > Hi,
>>> >> >> > 
>>> >> >> > I've been working on CFQ improvements for interleaved I/Os between
>>> >> >> > processes, and noticed a regression in performance when using the
>>> >> >> > deadline I/O scheduler.  The test uses a server configured with a cciss
>>> >> >> > array and 1Gb/s ethernet.
>>> >> >> > 
>>> >> >> > The iozone command line was:
>>> >> >> >   iozone -s 2000000 -r 64 -f /mnt/test/testfile -i 1 -w
>>> >> >> > 
>>> >> >> > The numbers in the nfsd's row represent the number of nfsd "threads".
>>> >> >> > These numbers (in MB/s) represent the average of 5 runs.
>>> >> >> > 
>>> >> >> >                v2.6.29
>>> >> >> > 
>>> >> >> > nfsd's  |   1    |  2   |   4   |   8
>>> >> >> > --------+---------------+-------+------
>>> >> >> > deadline| 43207 | 67436 | 96289 | 107590
>>> >> >> > 
>>> >> >> >               2.6.30-rc1
>>> >> >> > 
>>> >> >> > nfsd's  |   1   |   2   |   4   |   8
>>> >> >> > --------+---------------+-------+------
>>> >> >> > deadline| 43732 | 68059 | 76659 | 83231
>>> >> >> > 
>>> >> >> >     2.6.30-rc3.block-for-linus
>>> >> >> > 
>>> >> >> > nfsd's  |   1   |   2   |   4   |   8
>>> >> >> > --------+---------------+-------+------
>>> >> >> > deadline| 46102 | 71151 | 83120 | 82330
>>> >> >> > 
>>> >> >> > 
>>> >> >> > Notice the drop for 4 and 8 threads.  It may be worth noting that the
>>> >> >> > default number of NFSD threads is 8.
>
> Just following up with numbers:
>
>   2.6.30-rc4
>
> nfsd's  |   8
> --------+------
> cfq     | 51632   (49791 52436 52308 51488 52141)
> deadline| 65558   (41675 42559 74820 87518 81221)
>
>    2.6.30-rc4 reverting the sunrpc "fix"
>
> nfsd's  |   8
> --------+------
> cfq     |  82513  (81650 82762 83147 82935 82073)
> deadline| 107827  (109730 106077 107175 108524 107632)
>
> The numbers in parenthesis are the individual runs.  Notice how
> 2.6.30-rc4 has some pretty wide variations for deadline.
>
>>> OK, I bisected this to the following commit.  The mount is done using
>>> NFSv3, by the way.
>>> 
>>> commit 47a14ef1af48c696b214ac168f056ddc79793d0e
>>> Author: Olga Kornievskaia <aglo@citi.umich.edu>
>>> Date:   Tue Oct 21 14:13:47 2008 -0400
>>> 
>>>     svcrpc: take advantage of tcp autotuning
>>>     
>>>     Allow the NFSv4 server to make use of TCP autotuning behaviour, which
>>>     was previously disabled by setting the sk_userlocks variable.
>>>     
>>>     Set the receive buffers to be big enough to receive the whole RPC
>>>     request, and set this for the listening socket, not the accept socket.
>>>     
>>>     Remove the code that readjusts the receive/send buffer sizes for the
>>>     accepted socket. Previously this code was used to influence the TCP
>>>     window management behaviour, which is no longer needed when autotuning
>>>     is enabled.
>>>     
>>>     This can improve IO bandwidth on networks with high bandwidth-delay
>>>     products, where a large tcp window is required.  It also simplifies
>>>     performance tuning, since getting adequate tcp buffers previously
>>>     required increasing the number of nfsd threads.
>>>     
>>>     Signed-off-by: Olga Kornievskaia <aglo@citi.umich.edu>
>>>     Cc: Jim Rees <rees@umich.edu>
>>>     Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
>>> 
>>> diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
>>> index 5763e64..7a2a90f 100644
>>> --- a/net/sunrpc/svcsock.c
>>> +++ b/net/sunrpc/svcsock.c
>>> @@ -345,7 +345,6 @@ static void svc_sock_setbufsize(struct socket *sock, unsigned int snd,
>>>  	lock_sock(sock->sk);
>>>  	sock->sk->sk_sndbuf = snd * 2;
>>>  	sock->sk->sk_rcvbuf = rcv * 2;
>>> -	sock->sk->sk_userlocks |= SOCK_SNDBUF_LOCK|SOCK_RCVBUF_LOCK;
>>>  	release_sock(sock->sk);
>>>  #endif
>>>  }
>>> @@ -797,23 +796,6 @@ static int svc_tcp_recvfrom(struct svc_rqst *rqstp)
>>>  		test_bit(XPT_CONN, &svsk->sk_xprt.xpt_flags),
>>>  		test_bit(XPT_CLOSE, &svsk->sk_xprt.xpt_flags));
>>>  
>>> -	if (test_and_clear_bit(XPT_CHNGBUF, &svsk->sk_xprt.xpt_flags))
>>> -		/* sndbuf needs to have room for one request
>>> -		 * per thread, otherwise we can stall even when the
>>> -		 * network isn't a bottleneck.
>>> -		 *
>>> -		 * We count all threads rather than threads in a
>>> -		 * particular pool, which provides an upper bound
>>> -		 * on the number of threads which will access the socket.
>>> -		 *
>>> -		 * rcvbuf just needs to be able to hold a few requests.
>>> -		 * Normally they will be removed from the queue
>>> -		 * as soon a a complete request arrives.
>>> -		 */
>>> -		svc_sock_setbufsize(svsk->sk_sock,
>>> -				    (serv->sv_nrthreads+3) * serv->sv_max_mesg,
>>> -				    3 * serv->sv_max_mesg);
>>> -
>>>  	clear_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags);
>>>  
>>>  	/* Receive data. If we haven't got the record length yet, get
>>> @@ -1061,15 +1043,6 @@ static void svc_tcp_init(struct svc_sock *svsk, struct svc_serv *serv)
>>>  
>>>  		tcp_sk(sk)->nonagle |= TCP_NAGLE_OFF;
>>>  
>>> -		/* initialise setting must have enough space to
>>> -		 * receive and respond to one request.
>>> -		 * svc_tcp_recvfrom will re-adjust if necessary
>>> -		 */
>>> -		svc_sock_setbufsize(svsk->sk_sock,
>>> -				    3 * svsk->sk_xprt.xpt_server->sv_max_mesg,
>>> -				    3 * svsk->sk_xprt.xpt_server->sv_max_mesg);
>>> -
>>> -		set_bit(XPT_CHNGBUF, &svsk->sk_xprt.xpt_flags);
>>>  		set_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags);
>>>  		if (sk->sk_state != TCP_ESTABLISHED)
>>>  			set_bit(XPT_CLOSE, &svsk->sk_xprt.xpt_flags);
>>> @@ -1140,8 +1113,14 @@ static struct svc_sock *svc_setup_socket(struct svc_serv *serv,
>>>  	/* Initialize the socket */
>>>  	if (sock->type == SOCK_DGRAM)
>>>  		svc_udp_init(svsk, serv);
>>> -	else
>>> +	else {
>>> +		/* initialise setting must have enough space to
>>> +		 * receive and respond to one request.
>>> +		 */
>>> +		svc_sock_setbufsize(svsk->sk_sock, 4 * serv->sv_max_mesg,
>>> +					4 * serv->sv_max_mesg);
>>>  		svc_tcp_init(svsk, serv);
>>> +	}
>>>  
>>>  	/*
>>>  	 * We start one listener per sv_serv.  We want AF_INET

       reply	other threads:[~2009-05-13 19:30 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <x49hc0f79k9.fsf@segfault.boston.devel.redhat.com>
     [not found] ` <20090508120119.8c93cfd7.akpm@linux-foundation.org>
     [not found]   ` <20090511081415.GL4694@kernel.dk>
     [not found]     ` <x49skjb21b7.fsf@segfault.boston.devel.redhat.com>
     [not found]       ` <20090511165826.GG4694@kernel.dk>
     [not found]         ` <x494ovp4r51.fsf@segfault.boston.devel.redhat.com>
     [not found]           ` <20090512204433.7eb69075.akpm@linux-foundation.org>
     [not found]             ` <x49y6t1rqw0.fsf@segfault.boston.devel.redhat.com>
2009-05-13 19:29               ` Jeff Moyer [this message]
2009-05-13 23:45                 ` 2.6.30-rc deadline scheduler performance regression for iozone over NFS Trond Myklebust
     [not found]                   ` <1242258338.5407.244.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>
2009-05-14 13:34                     ` Jeff Moyer
     [not found]                       ` <x49octv7qr8.fsf-RRHT56Q3PSP4kTEheFKJxxDDeQx5vsVwAInAS/Ez/D0@public.gmane.org>
2009-05-14 14:33                         ` Trond Myklebust
     [not found]                           ` <1242311620.6560.14.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>
2009-05-14 14:38                             ` Jeff Moyer
2009-05-14 15:00                             ` Jeff Moyer
     [not found]                               ` <x49ws8j686r.fsf-RRHT56Q3PSP4kTEheFKJxxDDeQx5vsVwAInAS/Ez/D0@public.gmane.org>
2009-05-17 19:10                                 ` Trond Myklebust
2009-05-17 19:12                                   ` Trond Myklebust
     [not found]                                     ` <1242587524.17796.3.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>
2009-05-18 14:15                                       ` Jeff Moyer
2009-05-22 23:45                                         ` J. Bruce Fields
2009-05-14 17:55                   ` J. Bruce Fields
     [not found]                     ` <20090514175500.GB5675-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org>
2009-05-14 18:26                       ` Trond Myklebust
     [not found]                         ` <1242325569.6560.27.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>
2009-05-15 21:37                           ` J. Bruce Fields

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=x49k54ku7i2.fsf@segfault.boston.devel.redhat.com \
    --to=jmoyer@redhat.com \
    --cc=aglo@citi.umich.edu \
    --cc=akpm@linux-foundation.org \
    --cc=bfields@fieldses.org \
    --cc=jens.axboe@oracle.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-nfs@vger.kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=rees@umich.edu \
    --cc=rjw@sisk.pl \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).