* Re: 2.6.30-rc deadline scheduler performance regression for iozone over NFS [not found] ` <x49y6t1rqw0.fsf@segfault.boston.devel.redhat.com> @ 2009-05-13 19:29 ` Jeff Moyer 2009-05-13 23:45 ` Trond Myklebust 0 siblings, 1 reply; 13+ messages in thread From: Jeff Moyer @ 2009-05-13 19:29 UTC (permalink / raw) To: netdev, Andrew Morton Cc: Jens Axboe, linux-kernel, Rafael J. Wysocki, Olga Kornievskaia, J. Bruce Fields, Jim Rees, linux-nfs Hi, netdev folks. The summary here is: A patch added in the 2.6.30 development cycle caused a performance regression in my NFS iozone testing. The patch in question is the following: commit 47a14ef1af48c696b214ac168f056ddc79793d0e Author: Olga Kornievskaia <aglo@citi.umich.edu> Date: Tue Oct 21 14:13:47 2008 -0400 svcrpc: take advantage of tcp autotuning which is also quoted below. Using 8 nfsd threads, a single client doing 2GB of streaming read I/O goes from 107590 KB/s under 2.6.29 to 65558 KB/s under 2.6.30-rc4. I also see more run to run variation under 2.6.30-rc4 using the deadline I/O scheduler on the server. That variation disappears (as does the performance regression) when reverting the above commit. Olga had the following to say about the regression: >> On Wed, 13 May 2009 12:20:57 -0400 Olga Kornievskaia <aglo@citi.umich.edu> wrote: >> >>> I believe what you are seeing is how well TCP autotuning performs. >>> What old NFS code was doing is disabling autotuning and instead using >>> #nfsd thread to scale TCP recv window. You are providing an example of >>> where setting TCP buffer sizes outperforms TCP autotuning. While this >>> is a valid example, there is also an alternative example of where old >>> NFS design hurts performance. > We realize that decrease performance is a problem and understand that > reverting the patch might be the appropriate course of action! > But we are curious why this is happening. Jeff if it's not too much trouble > could you generate tcpdumps for both cases. We are curious what are > the max window sizes in both cases? Also could you give us your tcp and > network sysctl values for the testing environment (both client and server > values) that you can get with "sysctl -a | grep tcp" and also > " | grep net.core". The requested data can be found here: http://people.redhat.com/jmoyer/iozone-regression.tar > Poor performance using TCP autotuning can be demonstrated outside > of NFS but using Iperf. It can be shown that iperf will work better if "-w" > flag is used. When this flag is set, Iperf calls setsockopt() call which in > the kernel turns off autotuning. > > As for fixing this it would be great if we could get some help from the > TCP kernel folks? And so I've CC'd netdev. Jim Rees had the following to add: > TCP autotuning can reduce performance by up to about 10% in some cases. > Jeff found one of these cases. While the autotuning penalty never exceeds > 10% as far as I know, I can provide examples of other cases where autotuning > improves nfsd performance by more than a factor of 100. > > The right thing is to fix autotuning. If autotuning is considered too > broken to use, it should be turned off everywhere, not just in nfsd, as it > hurts/benefits all TCP clients, not just nfs. > This topic has been discussed before on netdev: > http://www.spinics.net/lists/netdev/msg68650.html > http://www.spinics.net/lists/netdev/msg68155.html I'd like to point out that the penalty is much greater than 10% in my case. Here are the data points: Jeff Moyer <jmoyer@redhat.com> writes: >>> >> >> Jeff Moyer <jmoyer@redhat.com> wrote: >>> >> >> >>> >> >> > Hi, >>> >> >> > >>> >> >> > I've been working on CFQ improvements for interleaved I/Os between >>> >> >> > processes, and noticed a regression in performance when using the >>> >> >> > deadline I/O scheduler. The test uses a server configured with a cciss >>> >> >> > array and 1Gb/s ethernet. >>> >> >> > >>> >> >> > The iozone command line was: >>> >> >> > iozone -s 2000000 -r 64 -f /mnt/test/testfile -i 1 -w >>> >> >> > >>> >> >> > The numbers in the nfsd's row represent the number of nfsd "threads". >>> >> >> > These numbers (in MB/s) represent the average of 5 runs. >>> >> >> > >>> >> >> > v2.6.29 >>> >> >> > >>> >> >> > nfsd's | 1 | 2 | 4 | 8 >>> >> >> > --------+---------------+-------+------ >>> >> >> > deadline| 43207 | 67436 | 96289 | 107590 >>> >> >> > >>> >> >> > 2.6.30-rc1 >>> >> >> > >>> >> >> > nfsd's | 1 | 2 | 4 | 8 >>> >> >> > --------+---------------+-------+------ >>> >> >> > deadline| 43732 | 68059 | 76659 | 83231 >>> >> >> > >>> >> >> > 2.6.30-rc3.block-for-linus >>> >> >> > >>> >> >> > nfsd's | 1 | 2 | 4 | 8 >>> >> >> > --------+---------------+-------+------ >>> >> >> > deadline| 46102 | 71151 | 83120 | 82330 >>> >> >> > >>> >> >> > >>> >> >> > Notice the drop for 4 and 8 threads. It may be worth noting that the >>> >> >> > default number of NFSD threads is 8. > > Just following up with numbers: > > 2.6.30-rc4 > > nfsd's | 8 > --------+------ > cfq | 51632 (49791 52436 52308 51488 52141) > deadline| 65558 (41675 42559 74820 87518 81221) > > 2.6.30-rc4 reverting the sunrpc "fix" > > nfsd's | 8 > --------+------ > cfq | 82513 (81650 82762 83147 82935 82073) > deadline| 107827 (109730 106077 107175 108524 107632) > > The numbers in parenthesis are the individual runs. Notice how > 2.6.30-rc4 has some pretty wide variations for deadline. > >>> OK, I bisected this to the following commit. The mount is done using >>> NFSv3, by the way. >>> >>> commit 47a14ef1af48c696b214ac168f056ddc79793d0e >>> Author: Olga Kornievskaia <aglo@citi.umich.edu> >>> Date: Tue Oct 21 14:13:47 2008 -0400 >>> >>> svcrpc: take advantage of tcp autotuning >>> >>> Allow the NFSv4 server to make use of TCP autotuning behaviour, which >>> was previously disabled by setting the sk_userlocks variable. >>> >>> Set the receive buffers to be big enough to receive the whole RPC >>> request, and set this for the listening socket, not the accept socket. >>> >>> Remove the code that readjusts the receive/send buffer sizes for the >>> accepted socket. Previously this code was used to influence the TCP >>> window management behaviour, which is no longer needed when autotuning >>> is enabled. >>> >>> This can improve IO bandwidth on networks with high bandwidth-delay >>> products, where a large tcp window is required. It also simplifies >>> performance tuning, since getting adequate tcp buffers previously >>> required increasing the number of nfsd threads. >>> >>> Signed-off-by: Olga Kornievskaia <aglo@citi.umich.edu> >>> Cc: Jim Rees <rees@umich.edu> >>> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu> >>> >>> diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c >>> index 5763e64..7a2a90f 100644 >>> --- a/net/sunrpc/svcsock.c >>> +++ b/net/sunrpc/svcsock.c >>> @@ -345,7 +345,6 @@ static void svc_sock_setbufsize(struct socket *sock, unsigned int snd, >>> lock_sock(sock->sk); >>> sock->sk->sk_sndbuf = snd * 2; >>> sock->sk->sk_rcvbuf = rcv * 2; >>> - sock->sk->sk_userlocks |= SOCK_SNDBUF_LOCK|SOCK_RCVBUF_LOCK; >>> release_sock(sock->sk); >>> #endif >>> } >>> @@ -797,23 +796,6 @@ static int svc_tcp_recvfrom(struct svc_rqst *rqstp) >>> test_bit(XPT_CONN, &svsk->sk_xprt.xpt_flags), >>> test_bit(XPT_CLOSE, &svsk->sk_xprt.xpt_flags)); >>> >>> - if (test_and_clear_bit(XPT_CHNGBUF, &svsk->sk_xprt.xpt_flags)) >>> - /* sndbuf needs to have room for one request >>> - * per thread, otherwise we can stall even when the >>> - * network isn't a bottleneck. >>> - * >>> - * We count all threads rather than threads in a >>> - * particular pool, which provides an upper bound >>> - * on the number of threads which will access the socket. >>> - * >>> - * rcvbuf just needs to be able to hold a few requests. >>> - * Normally they will be removed from the queue >>> - * as soon a a complete request arrives. >>> - */ >>> - svc_sock_setbufsize(svsk->sk_sock, >>> - (serv->sv_nrthreads+3) * serv->sv_max_mesg, >>> - 3 * serv->sv_max_mesg); >>> - >>> clear_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags); >>> >>> /* Receive data. If we haven't got the record length yet, get >>> @@ -1061,15 +1043,6 @@ static void svc_tcp_init(struct svc_sock *svsk, struct svc_serv *serv) >>> >>> tcp_sk(sk)->nonagle |= TCP_NAGLE_OFF; >>> >>> - /* initialise setting must have enough space to >>> - * receive and respond to one request. >>> - * svc_tcp_recvfrom will re-adjust if necessary >>> - */ >>> - svc_sock_setbufsize(svsk->sk_sock, >>> - 3 * svsk->sk_xprt.xpt_server->sv_max_mesg, >>> - 3 * svsk->sk_xprt.xpt_server->sv_max_mesg); >>> - >>> - set_bit(XPT_CHNGBUF, &svsk->sk_xprt.xpt_flags); >>> set_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags); >>> if (sk->sk_state != TCP_ESTABLISHED) >>> set_bit(XPT_CLOSE, &svsk->sk_xprt.xpt_flags); >>> @@ -1140,8 +1113,14 @@ static struct svc_sock *svc_setup_socket(struct svc_serv *serv, >>> /* Initialize the socket */ >>> if (sock->type == SOCK_DGRAM) >>> svc_udp_init(svsk, serv); >>> - else >>> + else { >>> + /* initialise setting must have enough space to >>> + * receive and respond to one request. >>> + */ >>> + svc_sock_setbufsize(svsk->sk_sock, 4 * serv->sv_max_mesg, >>> + 4 * serv->sv_max_mesg); >>> svc_tcp_init(svsk, serv); >>> + } >>> >>> /* >>> * We start one listener per sv_serv. We want AF_INET ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: 2.6.30-rc deadline scheduler performance regression for iozone over NFS 2009-05-13 19:29 ` 2.6.30-rc deadline scheduler performance regression for iozone over NFS Jeff Moyer @ 2009-05-13 23:45 ` Trond Myklebust [not found] ` <1242258338.5407.244.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org> 2009-05-14 17:55 ` J. Bruce Fields 0 siblings, 2 replies; 13+ messages in thread From: Trond Myklebust @ 2009-05-13 23:45 UTC (permalink / raw) To: Jeff Moyer Cc: netdev, Andrew Morton, Jens Axboe, linux-kernel, Rafael J. Wysocki, Olga Kornievskaia, J. Bruce Fields, Jim Rees, linux-nfs On Wed, 2009-05-13 at 15:29 -0400, Jeff Moyer wrote: > Hi, netdev folks. The summary here is: > > A patch added in the 2.6.30 development cycle caused a performance > regression in my NFS iozone testing. The patch in question is the > following: > > commit 47a14ef1af48c696b214ac168f056ddc79793d0e > Author: Olga Kornievskaia <aglo@citi.umich.edu> > Date: Tue Oct 21 14:13:47 2008 -0400 > > svcrpc: take advantage of tcp autotuning > > which is also quoted below. Using 8 nfsd threads, a single client doing > 2GB of streaming read I/O goes from 107590 KB/s under 2.6.29 to 65558 > KB/s under 2.6.30-rc4. I also see more run to run variation under > 2.6.30-rc4 using the deadline I/O scheduler on the server. That > variation disappears (as does the performance regression) when reverting > the above commit. It looks to me as if we've got a bug in the svc_tcp_has_wspace() helper function. I can see no reason why we should stop processing new incoming RPC requests just because the send buffer happens to be 2/3 full. If we see that we have space for another reply, then we should just go for it. OTOH, we do want to ensure that the SOCK_NOSPACE flag remains set, so that the TCP layer knows that we're congested, and that we'd like it to increase the send window size, please. Could you therefore please see if the following (untested) patch helps? Cheers Trond --------------------------------------------------------------------- >From 1545cbda1b1cda2500cb9db3c760a05fc4f6ed4d Mon Sep 17 00:00:00 2001 From: Trond Myklebust <Trond.Myklebust@netapp.com> Date: Wed, 13 May 2009 19:44:58 -0400 Subject: [PATCH] SUNRPC: Fix the TCP server's send buffer accounting Currently, the sunrpc server is refusing to allow us to process new RPC calls if the TCP send buffer is 2/3 full, even if we do actually have enough free space to guarantee that we can send another request. The following patch fixes svc_tcp_has_wspace() so that we only stop processing requests if we know that the socket buffer cannot possibly fit another reply. It also fixes the tcp write_space() callback so that we only clear the SOCK_NOSPACE flag when the TCP send buffer is less than 2/3 full. This should ensure that the send window will grow as per the standard TCP socket code. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> --- net/sunrpc/svcsock.c | 32 ++++++++++++++++---------------- 1 files changed, 16 insertions(+), 16 deletions(-) diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c index af31988..8962355 100644 --- a/net/sunrpc/svcsock.c +++ b/net/sunrpc/svcsock.c @@ -386,6 +386,15 @@ static void svc_write_space(struct sock *sk) } } +static void svc_tcp_write_space(struct sock *sk) +{ + struct socket *sock = sk->sk_socket; + + if (sk_stream_wspace(sk) >= sk_stream_min_wspace(sk) && sock) + clear_bit(SOCK_NOSPACE, &sock->flags); + svc_write_space(sk); +} + /* * Copy the UDP datagram's destination address to the rqstp structure. * The 'destination' address in this case is the address to which the @@ -964,23 +973,14 @@ static int svc_tcp_has_wspace(struct svc_xprt *xprt) struct svc_sock *svsk = container_of(xprt, struct svc_sock, sk_xprt); struct svc_serv *serv = svsk->sk_xprt.xpt_server; int required; - int wspace; - - /* - * Set the SOCK_NOSPACE flag before checking the available - * sock space. - */ - set_bit(SOCK_NOSPACE, &svsk->sk_sock->flags); - required = atomic_read(&svsk->sk_xprt.xpt_reserved) + serv->sv_max_mesg; - wspace = sk_stream_wspace(svsk->sk_sk); - - if (wspace < sk_stream_min_wspace(svsk->sk_sk)) - return 0; - if (required * 2 > wspace) - return 0; - clear_bit(SOCK_NOSPACE, &svsk->sk_sock->flags); + required = (atomic_read(&xprt->xpt_reserved) + serv->sv_max_mesg) * 2; + if (sk_stream_wspace(svsk->sk_sk) < required) + goto out_nospace; return 1; +out_nospace: + set_bit(SOCK_NOSPACE, &svsk->sk_sock->flags); + return 0; } static struct svc_xprt *svc_tcp_create(struct svc_serv *serv, @@ -1036,7 +1036,7 @@ static void svc_tcp_init(struct svc_sock *svsk, struct svc_serv *serv) dprintk("setting up TCP socket for reading\n"); sk->sk_state_change = svc_tcp_state_change; sk->sk_data_ready = svc_tcp_data_ready; - sk->sk_write_space = svc_write_space; + sk->sk_write_space = svc_tcp_write_space; svsk->sk_reclen = 0; svsk->sk_tcplen = 0; -- 1.6.0.4 ^ permalink raw reply related [flat|nested] 13+ messages in thread
[parent not found: <1242258338.5407.244.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>]
* Re: 2.6.30-rc deadline scheduler performance regression for iozone over NFS [not found] ` <1242258338.5407.244.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org> @ 2009-05-14 13:34 ` Jeff Moyer [not found] ` <x49octv7qr8.fsf-RRHT56Q3PSP4kTEheFKJxxDDeQx5vsVwAInAS/Ez/D0@public.gmane.org> 0 siblings, 1 reply; 13+ messages in thread From: Jeff Moyer @ 2009-05-14 13:34 UTC (permalink / raw) To: Trond Myklebust Cc: netdev-u79uwXL29TY76Z2rM5mHXA, Andrew Morton, Jens Axboe, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Rafael J. Wysocki, Olga Kornievskaia, J. Bruce Fields, Jim Rees, linux-nfs-u79uwXL29TY76Z2rM5mHXA Trond Myklebust <trond.myklebust-41N18TsMXrtuMpJDpNschA@public.gmane.org> writes: > On Wed, 2009-05-13 at 15:29 -0400, Jeff Moyer wrote: >> Hi, netdev folks. The summary here is: >> >> A patch added in the 2.6.30 development cycle caused a performance >> regression in my NFS iozone testing. The patch in question is the >> following: >> >> commit 47a14ef1af48c696b214ac168f056ddc79793d0e >> Author: Olga Kornievskaia <aglo-vtMw8L3fJ9vSiEDVxGk4TQ@public.gmane.org> >> Date: Tue Oct 21 14:13:47 2008 -0400 >> >> svcrpc: take advantage of tcp autotuning >> >> which is also quoted below. Using 8 nfsd threads, a single client doing >> 2GB of streaming read I/O goes from 107590 KB/s under 2.6.29 to 65558 >> KB/s under 2.6.30-rc4. I also see more run to run variation under >> 2.6.30-rc4 using the deadline I/O scheduler on the server. That >> variation disappears (as does the performance regression) when reverting >> the above commit. > > It looks to me as if we've got a bug in the svc_tcp_has_wspace() helper > function. I can see no reason why we should stop processing new incoming > RPC requests just because the send buffer happens to be 2/3 full. If we > see that we have space for another reply, then we should just go for it. > OTOH, we do want to ensure that the SOCK_NOSPACE flag remains set, so > that the TCP layer knows that we're congested, and that we'd like it to > increase the send window size, please. > > Could you therefore please see if the following (untested) patch helps? I'm seeing slightly better results with the patch: 71548 75987 71557 87432 83538 But that's still not up to the speeds we saw under 2.6.29. The packet capture for one run can be found here: http://people.redhat.com/jmoyer/trond.pcap.bz2 Cheers, Jeff -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 13+ messages in thread
[parent not found: <x49octv7qr8.fsf-RRHT56Q3PSP4kTEheFKJxxDDeQx5vsVwAInAS/Ez/D0@public.gmane.org>]
* Re: 2.6.30-rc deadline scheduler performance regression for iozone over NFS [not found] ` <x49octv7qr8.fsf-RRHT56Q3PSP4kTEheFKJxxDDeQx5vsVwAInAS/Ez/D0@public.gmane.org> @ 2009-05-14 14:33 ` Trond Myklebust [not found] ` <1242311620.6560.14.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org> 0 siblings, 1 reply; 13+ messages in thread From: Trond Myklebust @ 2009-05-14 14:33 UTC (permalink / raw) To: Jeff Moyer Cc: netdev-u79uwXL29TY76Z2rM5mHXA, Andrew Morton, Jens Axboe, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Rafael J. Wysocki, Olga Kornievskaia, J. Bruce Fields, Jim Rees, linux-nfs-u79uwXL29TY76Z2rM5mHXA On Thu, 2009-05-14 at 09:34 -0400, Jeff Moyer wrote: > Trond Myklebust <trond.myklebust-41N18TsMXrtuMpJDpNschA@public.gmane.org> writes: > > > On Wed, 2009-05-13 at 15:29 -0400, Jeff Moyer wrote: > >> Hi, netdev folks. The summary here is: > >> > >> A patch added in the 2.6.30 development cycle caused a performance > >> regression in my NFS iozone testing. The patch in question is the > >> following: > >> > >> commit 47a14ef1af48c696b214ac168f056ddc79793d0e > >> Author: Olga Kornievskaia <aglo-vtMw8L3fJ9vSiEDVxGk4TQ@public.gmane.org> > >> Date: Tue Oct 21 14:13:47 2008 -0400 > >> > >> svcrpc: take advantage of tcp autotuning > >> > >> which is also quoted below. Using 8 nfsd threads, a single client doing > >> 2GB of streaming read I/O goes from 107590 KB/s under 2.6.29 to 65558 > >> KB/s under 2.6.30-rc4. I also see more run to run variation under > >> 2.6.30-rc4 using the deadline I/O scheduler on the server. That > >> variation disappears (as does the performance regression) when reverting > >> the above commit. > > > > It looks to me as if we've got a bug in the svc_tcp_has_wspace() helper > > function. I can see no reason why we should stop processing new incoming > > RPC requests just because the send buffer happens to be 2/3 full. If we > > see that we have space for another reply, then we should just go for it. > > OTOH, we do want to ensure that the SOCK_NOSPACE flag remains set, so > > that the TCP layer knows that we're congested, and that we'd like it to > > increase the send window size, please. > > > > Could you therefore please see if the following (untested) patch helps? > > I'm seeing slightly better results with the patch: > > 71548 > 75987 > 71557 > 87432 > 83538 > > But that's still not up to the speeds we saw under 2.6.29. The packet > capture for one run can be found here: > http://people.redhat.com/jmoyer/trond.pcap.bz2 > > Cheers, > Jeff Yes. Something is very wrong there... See for instance frame 1195, where the client finishes sending a whole series of READ requests, and we go into a flurry of ACKs passing backwards and forwards, but no data. It looks as if the NFS server isn't processing anything, probably because the window size falls afoul of the svc_tcp_has_wspace()... Does something like this help? Cheers Trond --------------------------------------------------------------------- >From 85e3f5860a9063d193bdb45516b3d3d347b87301 Mon Sep 17 00:00:00 2001 From: Trond Myklebust <Trond.Myklebust-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org> Date: Thu, 14 May 2009 10:33:07 -0400 Subject: [PATCH] SUNRPC: Always allow the NFS server to process at least one request Signed-off-by: Trond Myklebust <Trond.Myklebust-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org> --- net/sunrpc/svcsock.c | 9 ++++++++- 1 files changed, 8 insertions(+), 1 deletions(-) diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c index 8962355..4837442 100644 --- a/net/sunrpc/svcsock.c +++ b/net/sunrpc/svcsock.c @@ -972,9 +972,16 @@ static int svc_tcp_has_wspace(struct svc_xprt *xprt) { struct svc_sock *svsk = container_of(xprt, struct svc_sock, sk_xprt); struct svc_serv *serv = svsk->sk_xprt.xpt_server; + int reserved; int required; - required = (atomic_read(&xprt->xpt_reserved) + serv->sv_max_mesg) * 2; + reserved = atomic_read(&xprt->xpt_reserved); + /* Always allow the server to process at least one request, whether + * or not the TCP window is large enough + */ + if (reserved == 0) + return 1; + required = (reserved + serv->sv_max_mesg) << 1; if (sk_stream_wspace(svsk->sk_sk) < required) goto out_nospace; return 1; -- 1.6.0.4 -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply related [flat|nested] 13+ messages in thread
[parent not found: <1242311620.6560.14.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>]
* Re: 2.6.30-rc deadline scheduler performance regression for iozone over NFS [not found] ` <1242311620.6560.14.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org> @ 2009-05-14 14:38 ` Jeff Moyer 2009-05-14 15:00 ` Jeff Moyer 1 sibling, 0 replies; 13+ messages in thread From: Jeff Moyer @ 2009-05-14 14:38 UTC (permalink / raw) To: Trond Myklebust Cc: netdev-u79uwXL29TY76Z2rM5mHXA, Andrew Morton, Jens Axboe, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Rafael J. Wysocki, Olga Kornievskaia, J. Bruce Fields, Jim Rees, linux-nfs-u79uwXL29TY76Z2rM5mHXA Trond Myklebust <trond.myklebust-41N18TsMXrtuMpJDpNschA@public.gmane.org> writes: > On Thu, 2009-05-14 at 09:34 -0400, Jeff Moyer wrote: >> Trond Myklebust <trond.myklebust-41N18TsMXrtuMpJDpNschA@public.gmane.org> writes: >> >> > On Wed, 2009-05-13 at 15:29 -0400, Jeff Moyer wrote: >> >> Hi, netdev folks. The summary here is: >> >> >> >> A patch added in the 2.6.30 development cycle caused a performance >> >> regression in my NFS iozone testing. The patch in question is the >> >> following: >> >> >> >> commit 47a14ef1af48c696b214ac168f056ddc79793d0e >> >> Author: Olga Kornievskaia <aglo-vtMw8L3fJ9vSiEDVxGk4TQ@public.gmane.org> >> >> Date: Tue Oct 21 14:13:47 2008 -0400 >> >> >> >> svcrpc: take advantage of tcp autotuning >> >> >> >> which is also quoted below. Using 8 nfsd threads, a single client doing >> >> 2GB of streaming read I/O goes from 107590 KB/s under 2.6.29 to 65558 >> >> KB/s under 2.6.30-rc4. I also see more run to run variation under >> >> 2.6.30-rc4 using the deadline I/O scheduler on the server. That >> >> variation disappears (as does the performance regression) when reverting >> >> the above commit. >> > >> > It looks to me as if we've got a bug in the svc_tcp_has_wspace() helper >> > function. I can see no reason why we should stop processing new incoming >> > RPC requests just because the send buffer happens to be 2/3 full. If we >> > see that we have space for another reply, then we should just go for it. >> > OTOH, we do want to ensure that the SOCK_NOSPACE flag remains set, so >> > that the TCP layer knows that we're congested, and that we'd like it to >> > increase the send window size, please. >> > >> > Could you therefore please see if the following (untested) patch helps? >> >> I'm seeing slightly better results with the patch: >> >> 71548 >> 75987 >> 71557 >> 87432 >> 83538 >> >> But that's still not up to the speeds we saw under 2.6.29. The packet >> capture for one run can be found here: >> http://people.redhat.com/jmoyer/trond.pcap.bz2 >> >> Cheers, >> Jeff > > Yes. Something is very wrong there... > > See for instance frame 1195, where the client finishes sending a whole > series of READ requests, and we go into a flurry of ACKs passing > backwards and forwards, but no data. It looks as if the NFS server isn't > processing anything, probably because the window size falls afoul of the > svc_tcp_has_wspace()... > > Does something like this help? Is this in addition to the previous patch or instead of it? Cheers, Jeff -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: 2.6.30-rc deadline scheduler performance regression for iozone over NFS [not found] ` <1242311620.6560.14.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org> 2009-05-14 14:38 ` Jeff Moyer @ 2009-05-14 15:00 ` Jeff Moyer [not found] ` <x49ws8j686r.fsf-RRHT56Q3PSP4kTEheFKJxxDDeQx5vsVwAInAS/Ez/D0@public.gmane.org> 1 sibling, 1 reply; 13+ messages in thread From: Jeff Moyer @ 2009-05-14 15:00 UTC (permalink / raw) To: Trond Myklebust Cc: netdev-u79uwXL29TY76Z2rM5mHXA, Andrew Morton, Jens Axboe, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Rafael J. Wysocki, Olga Kornievskaia, J. Bruce Fields, Jim Rees, linux-nfs-u79uwXL29TY76Z2rM5mHXA Trond Myklebust <trond.myklebust-41N18TsMXrtuMpJDpNschA@public.gmane.org> writes: > On Thu, 2009-05-14 at 09:34 -0400, Jeff Moyer wrote: >> Trond Myklebust <trond.myklebust-41N18TsMXrtuMpJDpNschA@public.gmane.org> writes: >> >> > On Wed, 2009-05-13 at 15:29 -0400, Jeff Moyer wrote: >> >> Hi, netdev folks. The summary here is: >> >> >> >> A patch added in the 2.6.30 development cycle caused a performance >> >> regression in my NFS iozone testing. The patch in question is the >> >> following: >> >> >> >> commit 47a14ef1af48c696b214ac168f056ddc79793d0e >> >> Author: Olga Kornievskaia <aglo-vtMw8L3fJ9vSiEDVxGk4TQ@public.gmane.org> >> >> Date: Tue Oct 21 14:13:47 2008 -0400 >> >> >> >> svcrpc: take advantage of tcp autotuning >> >> >> >> which is also quoted below. Using 8 nfsd threads, a single client doing >> >> 2GB of streaming read I/O goes from 107590 KB/s under 2.6.29 to 65558 >> >> KB/s under 2.6.30-rc4. I also see more run to run variation under >> >> 2.6.30-rc4 using the deadline I/O scheduler on the server. That >> >> variation disappears (as does the performance regression) when reverting >> >> the above commit. >> > >> > It looks to me as if we've got a bug in the svc_tcp_has_wspace() helper >> > function. I can see no reason why we should stop processing new incoming >> > RPC requests just because the send buffer happens to be 2/3 full. If we >> > see that we have space for another reply, then we should just go for it. >> > OTOH, we do want to ensure that the SOCK_NOSPACE flag remains set, so >> > that the TCP layer knows that we're congested, and that we'd like it to >> > increase the send window size, please. >> > >> > Could you therefore please see if the following (untested) patch helps? >> >> I'm seeing slightly better results with the patch: >> >> 71548 >> 75987 >> 71557 >> 87432 >> 83538 >> >> But that's still not up to the speeds we saw under 2.6.29. The packet >> capture for one run can be found here: >> http://people.redhat.com/jmoyer/trond.pcap.bz2 >> >> Cheers, >> Jeff > > Yes. Something is very wrong there... > > See for instance frame 1195, where the client finishes sending a whole > series of READ requests, and we go into a flurry of ACKs passing > backwards and forwards, but no data. It looks as if the NFS server isn't > processing anything, probably because the window size falls afoul of the > svc_tcp_has_wspace()... > > Does something like this help? Sorry for the previous, stupid question. I applied the patch in addition the last one and here are the results: 70327 71561 68760 69199 65324 A packet capture for this run is available here: http://people.redhat.com/jmoyer/trond2.pcap.bz2 Any more ideas? ;) -Jeff -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 13+ messages in thread
[parent not found: <x49ws8j686r.fsf-RRHT56Q3PSP4kTEheFKJxxDDeQx5vsVwAInAS/Ez/D0@public.gmane.org>]
* Re: 2.6.30-rc deadline scheduler performance regression for iozone over NFS [not found] ` <x49ws8j686r.fsf-RRHT56Q3PSP4kTEheFKJxxDDeQx5vsVwAInAS/Ez/D0@public.gmane.org> @ 2009-05-17 19:10 ` Trond Myklebust 2009-05-17 19:12 ` Trond Myklebust 0 siblings, 1 reply; 13+ messages in thread From: Trond Myklebust @ 2009-05-17 19:10 UTC (permalink / raw) To: Jeff Moyer Cc: netdev-u79uwXL29TY76Z2rM5mHXA, Andrew Morton, Jens Axboe, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Rafael J. Wysocki, Olga Kornievskaia, J. Bruce Fields, Jim Rees, linux-nfs-u79uwXL29TY76Z2rM5mHXA On Thu, 2009-05-14 at 11:00 -0400, Jeff Moyer wrote: > Sorry for the previous, stupid question. I applied the patch in > addition the last one and here are the results: > > 70327 > 71561 > 68760 > 69199 > 65324 > > A packet capture for this run is available here: > http://people.redhat.com/jmoyer/trond2.pcap.bz2 > > Any more ideas? ;) Yep. I've got 2 more patches for you. With both of them applied, I'm seeing decent performance on my own test rig. The first patch is appended. I'll send the second in another email (to avoid attachments). Cheers Trond ----------------------------------------------------------------------- >From fcfdaf81eb21a996d83a2b68da2d62bb3697c1db Mon Sep 17 00:00:00 2001 From: Trond Myklebust <Trond.Myklebust-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org> Date: Sun, 17 May 2009 12:40:05 -0400 Subject: [PATCH] SUNRPC: Fix svc_tcp_recvfrom() Ensure that if the TCP receive window is smaller than the message length, then we just buffer the existing data, in order to allow the client to send more... Signed-off-by: Trond Myklebust <Trond.Myklebust-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org> --- include/linux/sunrpc/svcsock.h | 1 + net/sunrpc/svcsock.c | 167 +++++++++++++++++++++++++++++----------- 2 files changed, 124 insertions(+), 44 deletions(-) diff --git a/include/linux/sunrpc/svcsock.h b/include/linux/sunrpc/svcsock.h index 483e103..b0b4546 100644 --- a/include/linux/sunrpc/svcsock.h +++ b/include/linux/sunrpc/svcsock.h @@ -28,6 +28,7 @@ struct svc_sock { /* private TCP part */ u32 sk_reclen; /* length of record */ u32 sk_tcplen; /* current read length */ + struct page * sk_pages[RPCSVC_MAXPAGES]; /* received data */ }; /* diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c index c4e7be5..6069489 100644 --- a/net/sunrpc/svcsock.c +++ b/net/sunrpc/svcsock.c @@ -323,6 +323,33 @@ static int svc_recvfrom(struct svc_rqst *rqstp, struct kvec *iov, int nr, return len; } +static int svc_partial_recvfrom(struct svc_rqst *rqstp, + struct kvec *iov, int nr, + int buflen, unsigned int base) +{ + size_t save_iovlen; + void __user *save_iovbase; + unsigned int i; + int ret; + + if (base == 0) + return svc_recvfrom(rqstp, iov, nr, buflen); + + for (i = 0; i < nr; i++) { + if (iov[i].iov_len > base) + break; + base -= iov[i].iov_len; + } + save_iovlen = iov[i].iov_len; + save_iovbase = iov[i].iov_base; + iov[i].iov_len -= base; + iov[i].iov_base += base; + ret = svc_recvfrom(rqstp, &iov[i], nr - i, buflen); + iov[i].iov_len = save_iovlen; + iov[i].iov_base = save_iovbase; + return ret; +} + /* * Set socket snd and rcv buffer lengths */ @@ -790,6 +817,56 @@ failed: return NULL; } +static unsigned int svc_tcp_restore_pages(struct svc_sock *svsk, struct svc_rqst *rqstp) +{ + unsigned int i, len, npages; + + if (svsk->sk_tcplen <= sizeof(rpc_fraghdr)) + return 0; + len = svsk->sk_tcplen - sizeof(rpc_fraghdr); + npages = (len + PAGE_SIZE - 1) >> PAGE_SHIFT; + for (i = 0; i < npages; i++) { + if (rqstp->rq_pages[i] != NULL) + put_page(rqstp->rq_pages[i]); + BUG_ON(svsk->sk_pages[i] == NULL); + rqstp->rq_pages[i] = svsk->sk_pages[i]; + svsk->sk_pages[i] = NULL; + } + rqstp->rq_arg.head[0].iov_base = page_address(rqstp->rq_pages[0]); + return len; +} + +static void svc_tcp_save_pages(struct svc_sock *svsk, struct svc_rqst *rqstp) +{ + unsigned int i, len, npages; + + if (svsk->sk_tcplen <= sizeof(rpc_fraghdr)) + return; + len = svsk->sk_tcplen - sizeof(rpc_fraghdr); + npages = (len + PAGE_SIZE - 1) >> PAGE_SHIFT; + for (i = 0; i < npages; i++) { + svsk->sk_pages[i] = rqstp->rq_pages[i]; + rqstp->rq_pages[i] = NULL; + } +} + +static void svc_tcp_clear_pages(struct svc_sock *svsk) +{ + unsigned int i, len, npages; + + if (svsk->sk_tcplen <= sizeof(rpc_fraghdr)) + goto out; + len = svsk->sk_tcplen - sizeof(rpc_fraghdr); + npages = (len + PAGE_SIZE - 1) >> PAGE_SHIFT; + for (i = 0; i < npages; i++) { + BUG_ON(svsk->sk_pages[i] == NULL); + put_page(svsk->sk_pages[i]); + svsk->sk_pages[i] = NULL; + } +out: + svsk->sk_tcplen = 0; +} + /* * Receive data from a TCP socket. */ @@ -800,7 +877,8 @@ static int svc_tcp_recvfrom(struct svc_rqst *rqstp) struct svc_serv *serv = svsk->sk_xprt.xpt_server; int len; struct kvec *vec; - int pnum, vlen; + unsigned int want, base, vlen; + int pnum; dprintk("svc: tcp_recv %p data %d conn %d close %d\n", svsk, test_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags), @@ -814,9 +892,9 @@ static int svc_tcp_recvfrom(struct svc_rqst *rqstp) * possible up to the complete record length. */ if (svsk->sk_tcplen < sizeof(rpc_fraghdr)) { - int want = sizeof(rpc_fraghdr) - svsk->sk_tcplen; struct kvec iov; + want = sizeof(rpc_fraghdr) - svsk->sk_tcplen; iov.iov_base = ((char *) &svsk->sk_reclen) + svsk->sk_tcplen; iov.iov_len = want; if ((len = svc_recvfrom(rqstp, &iov, 1, want)) < 0) @@ -826,8 +904,7 @@ static int svc_tcp_recvfrom(struct svc_rqst *rqstp) if (len < want) { dprintk("svc: short recvfrom while reading record " "length (%d of %d)\n", len, want); - svc_xprt_received(&svsk->sk_xprt); - return -EAGAIN; /* record header not complete */ + goto err_noclose; } svsk->sk_reclen = ntohl(svsk->sk_reclen); @@ -853,25 +930,14 @@ static int svc_tcp_recvfrom(struct svc_rqst *rqstp) } } - /* Check whether enough data is available */ - len = svc_recv_available(svsk); - if (len < 0) - goto error; - - if (len < svsk->sk_reclen) { - dprintk("svc: incomplete TCP record (%d of %d)\n", - len, svsk->sk_reclen); - svc_xprt_received(&svsk->sk_xprt); - return -EAGAIN; /* record not complete */ - } - len = svsk->sk_reclen; - set_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags); + base = svc_tcp_restore_pages(svsk, rqstp); + want = svsk->sk_reclen - base; vec = rqstp->rq_vec; vec[0] = rqstp->rq_arg.head[0]; vlen = PAGE_SIZE; pnum = 1; - while (vlen < len) { + while (vlen < svsk->sk_reclen) { vec[pnum].iov_base = page_address(rqstp->rq_pages[pnum]); vec[pnum].iov_len = PAGE_SIZE; pnum++; @@ -880,19 +946,26 @@ static int svc_tcp_recvfrom(struct svc_rqst *rqstp) rqstp->rq_respages = &rqstp->rq_pages[pnum]; /* Now receive data */ - len = svc_recvfrom(rqstp, vec, pnum, len); - if (len < 0) - goto error; + clear_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags); + len = svc_partial_recvfrom(rqstp, vec, pnum, want, base); + if (len != want) { + if (len >= 0) + svsk->sk_tcplen += len; + else if (len != -EAGAIN) + goto err_other; + svc_tcp_save_pages(svsk, rqstp); + dprintk("svc: incomplete TCP record (%d of %d)\n", + svsk->sk_tcplen, svsk->sk_reclen); + goto err_noclose; + } - dprintk("svc: TCP complete record (%d bytes)\n", len); - rqstp->rq_arg.len = len; + rqstp->rq_arg.len = svsk->sk_reclen; rqstp->rq_arg.page_base = 0; - if (len <= rqstp->rq_arg.head[0].iov_len) { - rqstp->rq_arg.head[0].iov_len = len; + if (rqstp->rq_arg.len <= rqstp->rq_arg.head[0].iov_len) { + rqstp->rq_arg.head[0].iov_len = rqstp->rq_arg.len; rqstp->rq_arg.page_len = 0; - } else { - rqstp->rq_arg.page_len = len - rqstp->rq_arg.head[0].iov_len; - } + } else + rqstp->rq_arg.page_len = rqstp->rq_arg.len - rqstp->rq_arg.head[0].iov_len; rqstp->rq_xprt_ctxt = NULL; rqstp->rq_prot = IPPROTO_TCP; @@ -900,29 +973,32 @@ static int svc_tcp_recvfrom(struct svc_rqst *rqstp) /* Reset TCP read info */ svsk->sk_reclen = 0; svsk->sk_tcplen = 0; + /* If we have more data, signal svc_xprt_enqueue() to try again */ + if (svc_recv_available(svsk) > sizeof(rpc_fraghdr)) + set_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags); + svc_xprt_copy_addrs(rqstp, &svsk->sk_xprt); svc_xprt_received(&svsk->sk_xprt); if (serv->sv_stats) serv->sv_stats->nettcpcnt++; - return len; - - err_delete: + dprintk("svc: TCP complete record (%d bytes)\n", rqstp->rq_arg.len); + return rqstp->rq_arg.len; +error: + if (len == -EAGAIN) + goto err_got_eagain; +err_other: + printk(KERN_NOTICE "%s: recvfrom returned errno %d\n", + svsk->sk_xprt.xpt_server->sv_name, -len); +err_delete: set_bit(XPT_CLOSE, &svsk->sk_xprt.xpt_flags); return -EAGAIN; - - error: - if (len == -EAGAIN) { - dprintk("RPC: TCP recvfrom got EAGAIN\n"); - svc_xprt_received(&svsk->sk_xprt); - } else { - printk(KERN_NOTICE "%s: recvfrom returned errno %d\n", - svsk->sk_xprt.xpt_server->sv_name, -len); - goto err_delete; - } - - return len; +err_got_eagain: + dprintk("RPC: TCP recvfrom got EAGAIN\n"); +err_noclose: + svc_xprt_received(&svsk->sk_xprt); + return -EAGAIN; /* record not complete */ } /* @@ -1043,6 +1119,7 @@ static void svc_tcp_init(struct svc_sock *svsk, struct svc_serv *serv) svsk->sk_reclen = 0; svsk->sk_tcplen = 0; + memset(&svsk->sk_pages[0], 0, sizeof(svsk->sk_pages)); tcp_sk(sk)->nonagle |= TCP_NAGLE_OFF; @@ -1291,8 +1368,10 @@ static void svc_tcp_sock_detach(struct svc_xprt *xprt) svc_sock_detach(xprt); - if (!test_bit(XPT_LISTENER, &xprt->xpt_flags)) + if (!test_bit(XPT_LISTENER, &xprt->xpt_flags)) { + svc_tcp_clear_pages(svsk); kernel_sock_shutdown(svsk->sk_sock, SHUT_RDWR); + } } /* -- 1.6.0.4 -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: 2.6.30-rc deadline scheduler performance regression for iozone over NFS 2009-05-17 19:10 ` Trond Myklebust @ 2009-05-17 19:12 ` Trond Myklebust [not found] ` <1242587524.17796.3.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org> 0 siblings, 1 reply; 13+ messages in thread From: Trond Myklebust @ 2009-05-17 19:12 UTC (permalink / raw) To: Jeff Moyer Cc: netdev, Andrew Morton, Jens Axboe, linux-kernel, Rafael J. Wysocki, Olga Kornievskaia, J. Bruce Fields, Jim Rees, linux-nfs On Sun, 2009-05-17 at 15:11 -0400, Trond Myklebust wrote: > On Thu, 2009-05-14 at 11:00 -0400, Jeff Moyer wrote: > > Sorry for the previous, stupid question. I applied the patch in > > addition the last one and here are the results: > > > > 70327 > > 71561 > > 68760 > > 69199 > > 65324 > > > > A packet capture for this run is available here: > > http://people.redhat.com/jmoyer/trond2.pcap.bz2 > > > > Any more ideas? ;) > > Yep. I've got 2 more patches for you. With both of them applied, I'm > seeing decent performance on my own test rig. The first patch is > appended. I'll send the second in another email (to avoid attachments). Here is number 2. It is incremental to all the others... ----------------------------------------------------------------------- >From 1d11caba8bcfc8fe718bfa9a957715bf3819af09 Mon Sep 17 00:00:00 2001 From: Trond Myklebust <Trond.Myklebust@netapp.com> Date: Sun, 17 May 2009 13:01:00 -0400 Subject: [PATCH] SUNRPC: Further congestion control tweaks... Ensure that deferred requests are accounted for correctly by the write space reservation mechanism. In order to avoid double counting, remove the reservation when we defer the request, and save any calculated value, so that we can restore it when the request is requeued. Also fix svc_tcp_has_wspace() so that it doesn't reserve twice the memory that we expect to require in order to write out the data. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> --- include/linux/sunrpc/svc.h | 1 + net/sunrpc/svc_xprt.c | 10 +++++----- net/sunrpc/svcsock.c | 19 +++++++------------ 3 files changed, 13 insertions(+), 17 deletions(-) diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h index 2a30775..2c373d8 100644 --- a/include/linux/sunrpc/svc.h +++ b/include/linux/sunrpc/svc.h @@ -341,6 +341,7 @@ struct svc_deferred_req { union svc_addr_u daddr; /* where reply must come from */ struct cache_deferred_req handle; size_t xprt_hlen; + int reserved_space; int argslen; __be32 args[0]; }; diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c index c200d92..daa1f27 100644 --- a/net/sunrpc/svc_xprt.c +++ b/net/sunrpc/svc_xprt.c @@ -299,7 +299,6 @@ static void svc_thread_dequeue(struct svc_pool *pool, struct svc_rqst *rqstp) */ void svc_xprt_enqueue(struct svc_xprt *xprt) { - struct svc_serv *serv = xprt->xpt_server; struct svc_pool *pool; struct svc_rqst *rqstp; int cpu; @@ -376,8 +375,6 @@ void svc_xprt_enqueue(struct svc_xprt *xprt) rqstp, rqstp->rq_xprt); rqstp->rq_xprt = xprt; svc_xprt_get(xprt); - rqstp->rq_reserved = serv->sv_max_mesg; - atomic_add(rqstp->rq_reserved, &xprt->xpt_reserved); rqstp->rq_waking = 1; pool->sp_nwaking++; pool->sp_stats.threads_woken++; @@ -657,8 +654,6 @@ int svc_recv(struct svc_rqst *rqstp, long timeout) if (xprt) { rqstp->rq_xprt = xprt; svc_xprt_get(xprt); - rqstp->rq_reserved = serv->sv_max_mesg; - atomic_add(rqstp->rq_reserved, &xprt->xpt_reserved); } else { /* No data pending. Go to sleep */ svc_thread_enqueue(pool, rqstp); @@ -741,6 +736,8 @@ int svc_recv(struct svc_rqst *rqstp, long timeout) dprintk("svc: server %p, pool %u, transport %p, inuse=%d\n", rqstp, pool->sp_id, xprt, atomic_read(&xprt->xpt_ref.refcount)); + rqstp->rq_reserved = serv->sv_max_mesg; + atomic_add(rqstp->rq_reserved, &xprt->xpt_reserved); rqstp->rq_deferred = svc_deferred_dequeue(xprt); if (rqstp->rq_deferred) { svc_xprt_received(xprt); @@ -1006,6 +1003,8 @@ static struct cache_deferred_req *svc_defer(struct cache_req *req) } svc_xprt_get(rqstp->rq_xprt); dr->xprt = rqstp->rq_xprt; + dr->reserved_space = rqstp->rq_reserved; + svc_reserve(rqstp, 0); dr->handle.revisit = svc_revisit; return &dr->handle; @@ -1018,6 +1017,7 @@ static int svc_deferred_recv(struct svc_rqst *rqstp) { struct svc_deferred_req *dr = rqstp->rq_deferred; + svc_reserve(rqstp, dr->reserved_space); /* setup iov_base past transport header */ rqstp->rq_arg.head[0].iov_base = dr->args + (dr->xprt_hlen>>2); /* The iov_len does not include the transport header bytes */ diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c index 4837442..eed978e 100644 --- a/net/sunrpc/svcsock.c +++ b/net/sunrpc/svcsock.c @@ -345,6 +345,7 @@ static void svc_sock_setbufsize(struct socket *sock, unsigned int snd, lock_sock(sock->sk); sock->sk->sk_sndbuf = snd * 2; sock->sk->sk_rcvbuf = rcv * 2; + sock->sk->sk_write_space(sock->sk); release_sock(sock->sk); #endif } @@ -626,6 +627,7 @@ static void svc_udp_init(struct svc_sock *svsk, struct svc_serv *serv) * receive and respond to one request. * svc_udp_recvfrom will re-adjust if necessary */ + svsk->sk_sock->sk->sk_userlocks |= SOCK_SNDBUF_LOCK|SOCK_RCVBUF_LOCK; svc_sock_setbufsize(svsk->sk_sock, 3 * svsk->sk_xprt.xpt_server->sv_max_mesg, 3 * svsk->sk_xprt.xpt_server->sv_max_mesg); @@ -971,21 +973,14 @@ static void svc_tcp_prep_reply_hdr(struct svc_rqst *rqstp) static int svc_tcp_has_wspace(struct svc_xprt *xprt) { struct svc_sock *svsk = container_of(xprt, struct svc_sock, sk_xprt); - struct svc_serv *serv = svsk->sk_xprt.xpt_server; - int reserved; + struct svc_serv *serv = svsk->sk_xprt.xpt_server; int required; - reserved = atomic_read(&xprt->xpt_reserved); - /* Always allow the server to process at least one request, whether - * or not the TCP window is large enough - */ - if (reserved == 0) + if (test_bit(XPT_LISTENER, &xprt->xpt_flags)) + return 1; + required = atomic_read(&xprt->xpt_reserved) + serv->sv_max_mesg; + if (sk_stream_wspace(svsk->sk_sk) >= required) return 1; - required = (reserved + serv->sv_max_mesg) << 1; - if (sk_stream_wspace(svsk->sk_sk) < required) - goto out_nospace; - return 1; -out_nospace: set_bit(SOCK_NOSPACE, &svsk->sk_sock->flags); return 0; } -- 1.6.0.4 ^ permalink raw reply related [flat|nested] 13+ messages in thread
[parent not found: <1242587524.17796.3.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>]
* Re: 2.6.30-rc deadline scheduler performance regression for iozone over NFS [not found] ` <1242587524.17796.3.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org> @ 2009-05-18 14:15 ` Jeff Moyer 2009-05-22 23:45 ` J. Bruce Fields 0 siblings, 1 reply; 13+ messages in thread From: Jeff Moyer @ 2009-05-18 14:15 UTC (permalink / raw) To: Trond Myklebust Cc: netdev-u79uwXL29TY76Z2rM5mHXA, Andrew Morton, Jens Axboe, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Rafael J. Wysocki, Olga Kornievskaia, J. Bruce Fields, Jim Rees, linux-nfs-u79uwXL29TY76Z2rM5mHXA Trond Myklebust <trond.myklebust-41N18TsMXrtuMpJDpNschA@public.gmane.org> writes: > On Sun, 2009-05-17 at 15:11 -0400, Trond Myklebust wrote: >> On Thu, 2009-05-14 at 11:00 -0400, Jeff Moyer wrote: >> > Sorry for the previous, stupid question. I applied the patch in >> > addition the last one and here are the results: >> > >> > 70327 >> > 71561 >> > 68760 >> > 69199 >> > 65324 >> > >> > A packet capture for this run is available here: >> > http://people.redhat.com/jmoyer/trond2.pcap.bz2 >> > >> > Any more ideas? ;) >> >> Yep. I've got 2 more patches for you. With both of them applied, I'm >> seeing decent performance on my own test rig. The first patch is >> appended. I'll send the second in another email (to avoid attachments). > > Here is number 2. It is incremental to all the others... With all 4 patches applied, these are the numbers for 5 runs: 103168 101212 103346 100842 103172 It's looking much better, but we're still off by a few percent. Thanks for the quick turnaround on this, Trond! If you submit these patches, feel free to add: Tested-by: Jeff Moyer <jmoyer-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> Cheers, Jeff -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: 2.6.30-rc deadline scheduler performance regression for iozone over NFS 2009-05-18 14:15 ` Jeff Moyer @ 2009-05-22 23:45 ` J. Bruce Fields 0 siblings, 0 replies; 13+ messages in thread From: J. Bruce Fields @ 2009-05-22 23:45 UTC (permalink / raw) To: Jeff Moyer Cc: Trond Myklebust, netdev, Andrew Morton, Jens Axboe, linux-kernel, Rafael J. Wysocki, Olga Kornievskaia, Jim Rees, linux-nfs On Mon, May 18, 2009 at 10:15:22AM -0400, Jeff Moyer wrote: > Trond Myklebust <trond.myklebust@fys.uio.no> writes: > > > On Sun, 2009-05-17 at 15:11 -0400, Trond Myklebust wrote: > >> On Thu, 2009-05-14 at 11:00 -0400, Jeff Moyer wrote: > >> > Sorry for the previous, stupid question. I applied the patch in > >> > addition the last one and here are the results: > >> > > >> > 70327 > >> > 71561 > >> > 68760 > >> > 69199 > >> > 65324 > >> > > >> > A packet capture for this run is available here: > >> > http://people.redhat.com/jmoyer/trond2.pcap.bz2 > >> > > >> > Any more ideas? ;) > >> > >> Yep. I've got 2 more patches for you. With both of them applied, I'm > >> seeing decent performance on my own test rig. The first patch is > >> appended. I'll send the second in another email (to avoid attachments). > > > > Here is number 2. It is incremental to all the others... > > With all 4 patches applied, these are the numbers for 5 runs: > > 103168 > 101212 > 103346 > 100842 > 103172 > > It's looking much better, but we're still off by a few percent. Thanks > for the quick turnaround on this, Trond! If you submit these patches, > feel free to add: I'd like to take a look and run some tests of my own when I get back from vacation next week. Then assuming no problems I'm inclined to queue them up for 2.6.31, and, in the meantime, revert the autotuning patch temporarily for 2.6.30--under the assumption that autotuning is still the right thing to do, but that this is too significant a regression to ignore, and Trond's work is too involved to submit for 2.6.30 this late in the process. --b. > > Tested-by: Jeff Moyer <jmoyer@redhat.com> > > Cheers, > Jeff ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: 2.6.30-rc deadline scheduler performance regression for iozone over NFS 2009-05-13 23:45 ` Trond Myklebust [not found] ` <1242258338.5407.244.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org> @ 2009-05-14 17:55 ` J. Bruce Fields [not found] ` <20090514175500.GB5675-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org> 1 sibling, 1 reply; 13+ messages in thread From: J. Bruce Fields @ 2009-05-14 17:55 UTC (permalink / raw) To: Trond Myklebust Cc: Jeff Moyer, netdev, Andrew Morton, Jens Axboe, linux-kernel, Rafael J. Wysocki, Olga Kornievskaia, Jim Rees, linux-nfs On Wed, May 13, 2009 at 07:45:38PM -0400, Trond Myklebust wrote: > On Wed, 2009-05-13 at 15:29 -0400, Jeff Moyer wrote: > > Hi, netdev folks. The summary here is: > > > > A patch added in the 2.6.30 development cycle caused a performance > > regression in my NFS iozone testing. The patch in question is the > > following: > > > > commit 47a14ef1af48c696b214ac168f056ddc79793d0e > > Author: Olga Kornievskaia <aglo@citi.umich.edu> > > Date: Tue Oct 21 14:13:47 2008 -0400 > > > > svcrpc: take advantage of tcp autotuning > > > > which is also quoted below. Using 8 nfsd threads, a single client doing > > 2GB of streaming read I/O goes from 107590 KB/s under 2.6.29 to 65558 > > KB/s under 2.6.30-rc4. I also see more run to run variation under > > 2.6.30-rc4 using the deadline I/O scheduler on the server. That > > variation disappears (as does the performance regression) when reverting > > the above commit. > > It looks to me as if we've got a bug in the svc_tcp_has_wspace() helper > function. I can see no reason why we should stop processing new incoming > RPC requests just because the send buffer happens to be 2/3 full. If we I agree, the calculation doesn't look right. But where do you get the 2/3 number from? ... > @@ -964,23 +973,14 @@ static int svc_tcp_has_wspace(struct svc_xprt *xprt) > struct svc_sock *svsk = container_of(xprt, struct svc_sock, sk_xprt); > struct svc_serv *serv = svsk->sk_xprt.xpt_server; > int required; > - int wspace; > - > - /* > - * Set the SOCK_NOSPACE flag before checking the available > - * sock space. > - */ > - set_bit(SOCK_NOSPACE, &svsk->sk_sock->flags); > - required = atomic_read(&svsk->sk_xprt.xpt_reserved) + serv->sv_max_mesg; > - wspace = sk_stream_wspace(svsk->sk_sk); > - > - if (wspace < sk_stream_min_wspace(svsk->sk_sk)) > - return 0; > - if (required * 2 > wspace) > - return 0; > > - clear_bit(SOCK_NOSPACE, &svsk->sk_sock->flags); > + required = (atomic_read(&xprt->xpt_reserved) + serv->sv_max_mesg) * 2; > + if (sk_stream_wspace(svsk->sk_sk) < required) This calculation looks the same before and after--you've just moved the "*2" into the calcualtion of "required". Am I missing something? Maybe you meant to write: required = atomic_read(&xprt->xpt_reserved) + serv->sv_max_mesg * 2; without the parentheses? That looks closer, assuming the calculation is meant to be: atomic_read(..) == amount of buffer space we think we already need serv->sv_max_mesg * 2 == space for worst-case request and reply? --b. > + goto out_nospace; > return 1; > +out_nospace: > + set_bit(SOCK_NOSPACE, &svsk->sk_sock->flags); > + return 0; > } ^ permalink raw reply [flat|nested] 13+ messages in thread
[parent not found: <20090514175500.GB5675-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org>]
* Re: 2.6.30-rc deadline scheduler performance regression for iozone over NFS [not found] ` <20090514175500.GB5675-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org> @ 2009-05-14 18:26 ` Trond Myklebust [not found] ` <1242325569.6560.27.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org> 0 siblings, 1 reply; 13+ messages in thread From: Trond Myklebust @ 2009-05-14 18:26 UTC (permalink / raw) To: J. Bruce Fields Cc: Jeff Moyer, netdev-u79uwXL29TY76Z2rM5mHXA, Andrew Morton, Jens Axboe, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Rafael J. Wysocki, Olga Kornievskaia, Jim Rees, linux-nfs-u79uwXL29TY76Z2rM5mHXA On Thu, 2009-05-14 at 13:55 -0400, J. Bruce Fields wrote: > On Wed, May 13, 2009 at 07:45:38PM -0400, Trond Myklebust wrote: > > On Wed, 2009-05-13 at 15:29 -0400, Jeff Moyer wrote: > > > Hi, netdev folks. The summary here is: > > > > > > A patch added in the 2.6.30 development cycle caused a performance > > > regression in my NFS iozone testing. The patch in question is the > > > following: > > > > > > commit 47a14ef1af48c696b214ac168f056ddc79793d0e > > > Author: Olga Kornievskaia <aglo-vtMw8L3fJ9vSiEDVxGk4TQ@public.gmane.org> > > > Date: Tue Oct 21 14:13:47 2008 -0400 > > > > > > svcrpc: take advantage of tcp autotuning > > > > > > which is also quoted below. Using 8 nfsd threads, a single client doing > > > 2GB of streaming read I/O goes from 107590 KB/s under 2.6.29 to 65558 > > > KB/s under 2.6.30-rc4. I also see more run to run variation under > > > 2.6.30-rc4 using the deadline I/O scheduler on the server. That > > > variation disappears (as does the performance regression) when reverting > > > the above commit. > > > > It looks to me as if we've got a bug in the svc_tcp_has_wspace() helper > > function. I can see no reason why we should stop processing new incoming > > RPC requests just because the send buffer happens to be 2/3 full. If we > > I agree, the calculation doesn't look right. But where do you get the > 2/3 number from? That's the sk_stream_wspace() vs. sk_stream_min_wspace() comparison. > ... > > @@ -964,23 +973,14 @@ static int svc_tcp_has_wspace(struct svc_xprt *xprt) > > struct svc_sock *svsk = container_of(xprt, struct svc_sock, sk_xprt); > > struct svc_serv *serv = svsk->sk_xprt.xpt_server; > > int required; > > - int wspace; > > - > > - /* > > - * Set the SOCK_NOSPACE flag before checking the available > > - * sock space. > > - */ > > - set_bit(SOCK_NOSPACE, &svsk->sk_sock->flags); > > - required = atomic_read(&svsk->sk_xprt.xpt_reserved) + serv->sv_max_mesg; > > - wspace = sk_stream_wspace(svsk->sk_sk); > > - > > - if (wspace < sk_stream_min_wspace(svsk->sk_sk)) > > - return 0; > > - if (required * 2 > wspace) > > - return 0; > > > > - clear_bit(SOCK_NOSPACE, &svsk->sk_sock->flags); > > + required = (atomic_read(&xprt->xpt_reserved) + serv->sv_max_mesg) * 2; > > + if (sk_stream_wspace(svsk->sk_sk) < required) > > This calculation looks the same before and after--you've just moved the > "*2" into the calcualtion of "required". Am I missing something? Maybe > you meant to write: > > required = atomic_read(&xprt->xpt_reserved) + serv->sv_max_mesg * 2; > > without the parentheses? I wasn't trying to change that part of the calculation. I'm just splitting out the stuff which has to do with TCP congestion (i.e. the window size), and stuff which has to do with remaining socket buffer space. I do, however, agree that we should probably drop that *2. However there is (as usual) 'interesting behaviour' when it comes to deferred requests. Their buffer space is already accounted for in the 'xpt_reserved' calculation, but they cannot get re-scheduled unless svc_tcp_has_wspace() thinks it has enough free socket space for yet another reply. Can you spell 'deadlock', children? Trond -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 13+ messages in thread
[parent not found: <1242325569.6560.27.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>]
* Re: 2.6.30-rc deadline scheduler performance regression for iozone over NFS [not found] ` <1242325569.6560.27.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org> @ 2009-05-15 21:37 ` J. Bruce Fields 0 siblings, 0 replies; 13+ messages in thread From: J. Bruce Fields @ 2009-05-15 21:37 UTC (permalink / raw) To: Trond Myklebust Cc: Jeff Moyer, netdev-u79uwXL29TY76Z2rM5mHXA, Andrew Morton, Jens Axboe, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Rafael J. Wysocki, Olga Kornievskaia, Jim Rees, linux-nfs-u79uwXL29TY76Z2rM5mHXA On Thu, May 14, 2009 at 02:26:09PM -0400, Trond Myklebust wrote: > On Thu, 2009-05-14 at 13:55 -0400, J. Bruce Fields wrote: > > On Wed, May 13, 2009 at 07:45:38PM -0400, Trond Myklebust wrote: > > > On Wed, 2009-05-13 at 15:29 -0400, Jeff Moyer wrote: > > > > Hi, netdev folks. The summary here is: > > > > > > > > A patch added in the 2.6.30 development cycle caused a performance > > > > regression in my NFS iozone testing. The patch in question is the > > > > following: > > > > > > > > commit 47a14ef1af48c696b214ac168f056ddc79793d0e > > > > Author: Olga Kornievskaia <aglo-vtMw8L3fJ9vSiEDVxGk4TQ@public.gmane.org> > > > > Date: Tue Oct 21 14:13:47 2008 -0400 > > > > > > > > svcrpc: take advantage of tcp autotuning > > > > > > > > which is also quoted below. Using 8 nfsd threads, a single client doing > > > > 2GB of streaming read I/O goes from 107590 KB/s under 2.6.29 to 65558 > > > > KB/s under 2.6.30-rc4. I also see more run to run variation under > > > > 2.6.30-rc4 using the deadline I/O scheduler on the server. That > > > > variation disappears (as does the performance regression) when reverting > > > > the above commit. > > > > > > It looks to me as if we've got a bug in the svc_tcp_has_wspace() helper > > > function. I can see no reason why we should stop processing new incoming > > > RPC requests just because the send buffer happens to be 2/3 full. If we > > > > I agree, the calculation doesn't look right. But where do you get the > > 2/3 number from? > > That's the sk_stream_wspace() vs. sk_stream_min_wspace() comparison. Oh, I see, so looking at their implementations, sk_stream_wspace(sk) < sk_stream_min_wspace(sk) is equivalent to sk_wmem_queued/2 < sk_->sndbuf - sk_wmem_queued, or sk_wmem_queued < 2/3 sndbuf, got it. I didn't understand that the point of this patch was just to do that calculation around--now I see.--b. -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2009-05-22 23:45 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <x49hc0f79k9.fsf@segfault.boston.devel.redhat.com>
[not found] ` <20090508120119.8c93cfd7.akpm@linux-foundation.org>
[not found] ` <20090511081415.GL4694@kernel.dk>
[not found] ` <x49skjb21b7.fsf@segfault.boston.devel.redhat.com>
[not found] ` <20090511165826.GG4694@kernel.dk>
[not found] ` <x494ovp4r51.fsf@segfault.boston.devel.redhat.com>
[not found] ` <20090512204433.7eb69075.akpm@linux-foundation.org>
[not found] ` <x49y6t1rqw0.fsf@segfault.boston.devel.redhat.com>
2009-05-13 19:29 ` 2.6.30-rc deadline scheduler performance regression for iozone over NFS Jeff Moyer
2009-05-13 23:45 ` Trond Myklebust
[not found] ` <1242258338.5407.244.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>
2009-05-14 13:34 ` Jeff Moyer
[not found] ` <x49octv7qr8.fsf-RRHT56Q3PSP4kTEheFKJxxDDeQx5vsVwAInAS/Ez/D0@public.gmane.org>
2009-05-14 14:33 ` Trond Myklebust
[not found] ` <1242311620.6560.14.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>
2009-05-14 14:38 ` Jeff Moyer
2009-05-14 15:00 ` Jeff Moyer
[not found] ` <x49ws8j686r.fsf-RRHT56Q3PSP4kTEheFKJxxDDeQx5vsVwAInAS/Ez/D0@public.gmane.org>
2009-05-17 19:10 ` Trond Myklebust
2009-05-17 19:12 ` Trond Myklebust
[not found] ` <1242587524.17796.3.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>
2009-05-18 14:15 ` Jeff Moyer
2009-05-22 23:45 ` J. Bruce Fields
2009-05-14 17:55 ` J. Bruce Fields
[not found] ` <20090514175500.GB5675-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org>
2009-05-14 18:26 ` Trond Myklebust
[not found] ` <1242325569.6560.27.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>
2009-05-15 21:37 ` J. Bruce Fields
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).