* [PATCH 0/3] Fix the Linux rpc-over-tcp server performance
@ 2009-05-18 21:47 Trond Myklebust
[not found] ` <20090518214756.786.28129.stgit-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>
2009-08-12 2:43 ` J. Bruce Fields
0 siblings, 2 replies; 18+ messages in thread
From: Trond Myklebust @ 2009-05-18 21:47 UTC (permalink / raw)
To: bfields; +Cc: linux-nfs, nfsv4, Jeff Moyer
I squashed the previous set of 4 incremental patches into 3. Otherwise
there should be no differences w.r.t. the set that Jeff tested.
Cheers
Trond
---
Trond Myklebust (3):
SUNRPC: Fix svc_tcp_recvfrom()
SUNRPC: Fix the TCP write space reservations for deferred requests
SUNRPC: Fix the TCP server's send buffer accounting
include/linux/sunrpc/svc.h | 1
include/linux/sunrpc/svcsock.h | 1
net/sunrpc/svc_xprt.c | 10 +-
net/sunrpc/svcsock.c | 203 ++++++++++++++++++++++++++++------------
4 files changed, 149 insertions(+), 66 deletions(-)
--
Signature
^ permalink raw reply [flat|nested] 18+ messages in thread[parent not found: <20090518214756.786.28129.stgit-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>]
* [PATCH 2/3] SUNRPC: Fix the TCP write space reservations for deferred requests [not found] ` <20090518214756.786.28129.stgit-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org> @ 2009-05-18 21:47 ` Trond Myklebust [not found] ` <20090518214756.786.33956.stgit-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org> 2009-05-18 21:47 ` [PATCH 1/3] SUNRPC: Fix the TCP server's send buffer accounting Trond Myklebust ` (2 subsequent siblings) 3 siblings, 1 reply; 18+ messages in thread From: Trond Myklebust @ 2009-05-18 21:47 UTC (permalink / raw) To: bfields; +Cc: linux-nfs, nfsv4, Jeff Moyer Ensure that deferred requests are accounted for correctly by the write space reservation mechanism. In order to avoid double counting, remove the reservation when we defer the request, and save any calculated value, so that we can restore it when the request is requeued. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> --- include/linux/sunrpc/svc.h | 1 + net/sunrpc/svc_xprt.c | 10 +++++----- 2 files changed, 6 insertions(+), 5 deletions(-) diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h index 2a30775..2c373d8 100644 --- a/include/linux/sunrpc/svc.h +++ b/include/linux/sunrpc/svc.h @@ -341,6 +341,7 @@ struct svc_deferred_req { union svc_addr_u daddr; /* where reply must come from */ struct cache_deferred_req handle; size_t xprt_hlen; + int reserved_space; int argslen; __be32 args[0]; }; diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c index c200d92..daa1f27 100644 --- a/net/sunrpc/svc_xprt.c +++ b/net/sunrpc/svc_xprt.c @@ -299,7 +299,6 @@ static void svc_thread_dequeue(struct svc_pool *pool, struct svc_rqst *rqstp) */ void svc_xprt_enqueue(struct svc_xprt *xprt) { - struct svc_serv *serv = xprt->xpt_server; struct svc_pool *pool; struct svc_rqst *rqstp; int cpu; @@ -376,8 +375,6 @@ void svc_xprt_enqueue(struct svc_xprt *xprt) rqstp, rqstp->rq_xprt); rqstp->rq_xprt = xprt; svc_xprt_get(xprt); - rqstp->rq_reserved = serv->sv_max_mesg; - atomic_add(rqstp->rq_reserved, &xprt->xpt_reserved); rqstp->rq_waking = 1; pool->sp_nwaking++; pool->sp_stats.threads_woken++; @@ -657,8 +654,6 @@ int svc_recv(struct svc_rqst *rqstp, long timeout) if (xprt) { rqstp->rq_xprt = xprt; svc_xprt_get(xprt); - rqstp->rq_reserved = serv->sv_max_mesg; - atomic_add(rqstp->rq_reserved, &xprt->xpt_reserved); } else { /* No data pending. Go to sleep */ svc_thread_enqueue(pool, rqstp); @@ -741,6 +736,8 @@ int svc_recv(struct svc_rqst *rqstp, long timeout) dprintk("svc: server %p, pool %u, transport %p, inuse=%d\n", rqstp, pool->sp_id, xprt, atomic_read(&xprt->xpt_ref.refcount)); + rqstp->rq_reserved = serv->sv_max_mesg; + atomic_add(rqstp->rq_reserved, &xprt->xpt_reserved); rqstp->rq_deferred = svc_deferred_dequeue(xprt); if (rqstp->rq_deferred) { svc_xprt_received(xprt); @@ -1006,6 +1003,8 @@ static struct cache_deferred_req *svc_defer(struct cache_req *req) } svc_xprt_get(rqstp->rq_xprt); dr->xprt = rqstp->rq_xprt; + dr->reserved_space = rqstp->rq_reserved; + svc_reserve(rqstp, 0); dr->handle.revisit = svc_revisit; return &dr->handle; @@ -1018,6 +1017,7 @@ static int svc_deferred_recv(struct svc_rqst *rqstp) { struct svc_deferred_req *dr = rqstp->rq_deferred; + svc_reserve(rqstp, dr->reserved_space); /* setup iov_base past transport header */ rqstp->rq_arg.head[0].iov_base = dr->args + (dr->xprt_hlen>>2); /* The iov_len does not include the transport header bytes */ ^ permalink raw reply related [flat|nested] 18+ messages in thread
[parent not found: <20090518214756.786.33956.stgit-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>]
* Re: [PATCH 2/3] SUNRPC: Fix the TCP write space reservations for deferred requests [not found] ` <20090518214756.786.33956.stgit-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org> @ 2009-06-19 22:23 ` J. Bruce Fields 2009-06-20 21:44 ` Trond Myklebust 0 siblings, 1 reply; 18+ messages in thread From: J. Bruce Fields @ 2009-06-19 22:23 UTC (permalink / raw) To: Trond Myklebust; +Cc: linux-nfs, nfsv4, Jeff Moyer On Mon, May 18, 2009 at 05:47:56PM -0400, Trond Myklebust wrote: > Ensure that deferred requests are accounted for correctly by the write > space reservation mechanism. In order to avoid double counting, remove the > reservation when we defer the request, and save any calculated value, so > that we can restore it when the request is requeued. I like that it does the addition to xpt_reserved in just one place instead of two, and carrying over the reserved_space is nice, but I don't understand the "double accounting" comment--where exactly is something counted twice? --b. > > Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> > --- > > include/linux/sunrpc/svc.h | 1 + > net/sunrpc/svc_xprt.c | 10 +++++----- > 2 files changed, 6 insertions(+), 5 deletions(-) > > > diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h > index 2a30775..2c373d8 100644 > --- a/include/linux/sunrpc/svc.h > +++ b/include/linux/sunrpc/svc.h > @@ -341,6 +341,7 @@ struct svc_deferred_req { > union svc_addr_u daddr; /* where reply must come from */ > struct cache_deferred_req handle; > size_t xprt_hlen; > + int reserved_space; > int argslen; > __be32 args[0]; > }; > diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c > index c200d92..daa1f27 100644 > --- a/net/sunrpc/svc_xprt.c > +++ b/net/sunrpc/svc_xprt.c > @@ -299,7 +299,6 @@ static void svc_thread_dequeue(struct svc_pool *pool, struct svc_rqst *rqstp) > */ > void svc_xprt_enqueue(struct svc_xprt *xprt) > { > - struct svc_serv *serv = xprt->xpt_server; > struct svc_pool *pool; > struct svc_rqst *rqstp; > int cpu; > @@ -376,8 +375,6 @@ void svc_xprt_enqueue(struct svc_xprt *xprt) > rqstp, rqstp->rq_xprt); > rqstp->rq_xprt = xprt; > svc_xprt_get(xprt); > - rqstp->rq_reserved = serv->sv_max_mesg; > - atomic_add(rqstp->rq_reserved, &xprt->xpt_reserved); > rqstp->rq_waking = 1; > pool->sp_nwaking++; > pool->sp_stats.threads_woken++; > @@ -657,8 +654,6 @@ int svc_recv(struct svc_rqst *rqstp, long timeout) > if (xprt) { > rqstp->rq_xprt = xprt; > svc_xprt_get(xprt); > - rqstp->rq_reserved = serv->sv_max_mesg; > - atomic_add(rqstp->rq_reserved, &xprt->xpt_reserved); > } else { > /* No data pending. Go to sleep */ > svc_thread_enqueue(pool, rqstp); > @@ -741,6 +736,8 @@ int svc_recv(struct svc_rqst *rqstp, long timeout) > dprintk("svc: server %p, pool %u, transport %p, inuse=%d\n", > rqstp, pool->sp_id, xprt, > atomic_read(&xprt->xpt_ref.refcount)); > + rqstp->rq_reserved = serv->sv_max_mesg; > + atomic_add(rqstp->rq_reserved, &xprt->xpt_reserved); > rqstp->rq_deferred = svc_deferred_dequeue(xprt); > if (rqstp->rq_deferred) { > svc_xprt_received(xprt); > @@ -1006,6 +1003,8 @@ static struct cache_deferred_req *svc_defer(struct cache_req *req) > } > svc_xprt_get(rqstp->rq_xprt); > dr->xprt = rqstp->rq_xprt; > + dr->reserved_space = rqstp->rq_reserved; > + svc_reserve(rqstp, 0); > > dr->handle.revisit = svc_revisit; > return &dr->handle; > @@ -1018,6 +1017,7 @@ static int svc_deferred_recv(struct svc_rqst *rqstp) > { > struct svc_deferred_req *dr = rqstp->rq_deferred; > > + svc_reserve(rqstp, dr->reserved_space); > /* setup iov_base past transport header */ > rqstp->rq_arg.head[0].iov_base = dr->args + (dr->xprt_hlen>>2); > /* The iov_len does not include the transport header bytes */ > ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH 2/3] SUNRPC: Fix the TCP write space reservations for deferred requests 2009-06-19 22:23 ` J. Bruce Fields @ 2009-06-20 21:44 ` Trond Myklebust [not found] ` <1245534248.5182.45.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org> 0 siblings, 1 reply; 18+ messages in thread From: Trond Myklebust @ 2009-06-20 21:44 UTC (permalink / raw) To: J. Bruce Fields; +Cc: linux-nfs, nfsv4, Jeff Moyer On Fri, 2009-06-19 at 18:23 -0400, J. Bruce Fields wrote: > On Mon, May 18, 2009 at 05:47:56PM -0400, Trond Myklebust wrote: > > Ensure that deferred requests are accounted for correctly by the write > > space reservation mechanism. In order to avoid double counting, remove the > > reservation when we defer the request, and save any calculated value, so > > that we can restore it when the request is requeued. > > I like that it does the addition to xpt_reserved in just one place > instead of two, and carrying over the reserved_space is nice, but I > don't understand the "double accounting" comment--where exactly is > something counted twice? Correct me if I'm wrong, but as far as I can see, the current code bumps xpt_reserved every time you call svc_xprt_enqueue() irrespective of whether or not you are going to process a new RPC call, or whether you are processing a deferred call. In the latter case, there doesn't appear to be any code that subtracts from xpt_reserved prior to the re-enqueue process, hence my belief that we are double counting those events... Cheers Trond -- Trond Myklebust Linux NFS client maintainer NetApp Trond.Myklebust@netapp.com www.netapp.com ^ permalink raw reply [flat|nested] 18+ messages in thread
[parent not found: <1245534248.5182.45.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>]
* Re: [PATCH 2/3] SUNRPC: Fix the TCP write space reservations for deferred requests [not found] ` <1245534248.5182.45.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org> @ 2009-08-24 21:32 ` J. Bruce Fields 0 siblings, 0 replies; 18+ messages in thread From: J. Bruce Fields @ 2009-08-24 21:32 UTC (permalink / raw) To: Trond Myklebust; +Cc: linux-nfs, nfsv4, Jeff Moyer On Sat, Jun 20, 2009 at 05:44:08PM -0400, Trond Myklebust wrote: > On Fri, 2009-06-19 at 18:23 -0400, J. Bruce Fields wrote: > > On Mon, May 18, 2009 at 05:47:56PM -0400, Trond Myklebust wrote: > > > Ensure that deferred requests are accounted for correctly by the write > > > space reservation mechanism. In order to avoid double counting, remove the > > > reservation when we defer the request, and save any calculated value, so > > > that we can restore it when the request is requeued. > > > > I like that it does the addition to xpt_reserved in just one place > > instead of two, and carrying over the reserved_space is nice, but I > > don't understand the "double accounting" comment--where exactly is > > something counted twice? > > Correct me if I'm wrong, but as far as I can see, the current code bumps > xpt_reserved every time you call svc_xprt_enqueue() irrespective of > whether or not you are going to process a new RPC call, or whether you > are processing a deferred call. > > In the latter case, there doesn't appear to be any code that subtracts > from xpt_reserved prior to the re-enqueue process, hence my belief that > we are double counting those events... Sorry for the slow response! The original request is dropped--cache_check() returns -EAGAIN on queueing up the deferred request, the caller returns back to svc_process, and the reservation is removed from xpt_reserved by svc_drop(). And a new reservation isn't made until the deferred request is revisited, at which point it's treated as any new request received from a socket would be. Your patch would instead have the server keep that reservation while the request is deferred. I'm not really sure how to decide which is better. --b. ^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH 1/3] SUNRPC: Fix the TCP server's send buffer accounting [not found] ` <20090518214756.786.28129.stgit-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org> 2009-05-18 21:47 ` [PATCH 2/3] SUNRPC: Fix the TCP write space reservations for deferred requests Trond Myklebust @ 2009-05-18 21:47 ` Trond Myklebust [not found] ` <20090518214756.786.58191.stgit-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org> 2009-05-18 21:47 ` [PATCH 3/3] SUNRPC: Fix svc_tcp_recvfrom() Trond Myklebust 2009-05-19 15:14 ` [PATCH 0/3] Fix the Linux rpc-over-tcp server performance Jeff Moyer 3 siblings, 1 reply; 18+ messages in thread From: Trond Myklebust @ 2009-05-18 21:47 UTC (permalink / raw) To: bfields; +Cc: linux-nfs, nfsv4, Jeff Moyer Currently, the sunrpc server is refusing to allow us to process new RPC calls if the TCP send buffer is 2/3 full, even if we do actually have enough free space to guarantee that we can send another request. The following patch fixes svc_tcp_has_wspace() so that we only stop processing requests if we know that the socket buffer cannot possibly fit another reply. It also fixes the tcp write_space() callback so that we only clear the SOCK_NOSPACE flag when the TCP send buffer is less than 2/3 full. This should ensure that the send window will grow as per the standard TCP socket code. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> --- net/sunrpc/svcsock.c | 36 +++++++++++++++++++----------------- 1 files changed, 19 insertions(+), 17 deletions(-) diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c index af31988..eed978e 100644 --- a/net/sunrpc/svcsock.c +++ b/net/sunrpc/svcsock.c @@ -345,6 +345,7 @@ static void svc_sock_setbufsize(struct socket *sock, unsigned int snd, lock_sock(sock->sk); sock->sk->sk_sndbuf = snd * 2; sock->sk->sk_rcvbuf = rcv * 2; + sock->sk->sk_write_space(sock->sk); release_sock(sock->sk); #endif } @@ -386,6 +387,15 @@ static void svc_write_space(struct sock *sk) } } +static void svc_tcp_write_space(struct sock *sk) +{ + struct socket *sock = sk->sk_socket; + + if (sk_stream_wspace(sk) >= sk_stream_min_wspace(sk) && sock) + clear_bit(SOCK_NOSPACE, &sock->flags); + svc_write_space(sk); +} + /* * Copy the UDP datagram's destination address to the rqstp structure. * The 'destination' address in this case is the address to which the @@ -617,6 +627,7 @@ static void svc_udp_init(struct svc_sock *svsk, struct svc_serv *serv) * receive and respond to one request. * svc_udp_recvfrom will re-adjust if necessary */ + svsk->sk_sock->sk->sk_userlocks |= SOCK_SNDBUF_LOCK|SOCK_RCVBUF_LOCK; svc_sock_setbufsize(svsk->sk_sock, 3 * svsk->sk_xprt.xpt_server->sv_max_mesg, 3 * svsk->sk_xprt.xpt_server->sv_max_mesg); @@ -962,25 +973,16 @@ static void svc_tcp_prep_reply_hdr(struct svc_rqst *rqstp) static int svc_tcp_has_wspace(struct svc_xprt *xprt) { struct svc_sock *svsk = container_of(xprt, struct svc_sock, sk_xprt); - struct svc_serv *serv = svsk->sk_xprt.xpt_server; + struct svc_serv *serv = svsk->sk_xprt.xpt_server; int required; - int wspace; - /* - * Set the SOCK_NOSPACE flag before checking the available - * sock space. - */ + if (test_bit(XPT_LISTENER, &xprt->xpt_flags)) + return 1; + required = atomic_read(&xprt->xpt_reserved) + serv->sv_max_mesg; + if (sk_stream_wspace(svsk->sk_sk) >= required) + return 1; set_bit(SOCK_NOSPACE, &svsk->sk_sock->flags); - required = atomic_read(&svsk->sk_xprt.xpt_reserved) + serv->sv_max_mesg; - wspace = sk_stream_wspace(svsk->sk_sk); - - if (wspace < sk_stream_min_wspace(svsk->sk_sk)) - return 0; - if (required * 2 > wspace) - return 0; - - clear_bit(SOCK_NOSPACE, &svsk->sk_sock->flags); - return 1; + return 0; } static struct svc_xprt *svc_tcp_create(struct svc_serv *serv, @@ -1036,7 +1038,7 @@ static void svc_tcp_init(struct svc_sock *svsk, struct svc_serv *serv) dprintk("setting up TCP socket for reading\n"); sk->sk_state_change = svc_tcp_state_change; sk->sk_data_ready = svc_tcp_data_ready; - sk->sk_write_space = svc_write_space; + sk->sk_write_space = svc_tcp_write_space; svsk->sk_reclen = 0; svsk->sk_tcplen = 0; ^ permalink raw reply related [flat|nested] 18+ messages in thread
[parent not found: <20090518214756.786.58191.stgit-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>]
* Re: [PATCH 1/3] SUNRPC: Fix the TCP server's send buffer accounting [not found] ` <20090518214756.786.58191.stgit-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org> @ 2009-06-19 3:06 ` J. Bruce Fields 0 siblings, 0 replies; 18+ messages in thread From: J. Bruce Fields @ 2009-06-19 3:06 UTC (permalink / raw) To: Trond Myklebust; +Cc: linux-nfs, nfsv4, Jeff Moyer On Mon, May 18, 2009 at 05:47:56PM -0400, Trond Myklebust wrote: > Currently, the sunrpc server is refusing to allow us to process new RPC > calls if the TCP send buffer is 2/3 full, even if we do actually have > enough free space to guarantee that we can send another request. > The following patch fixes svc_tcp_has_wspace() so that we only stop > processing requests if we know that the socket buffer cannot possibly fit > another reply. > > It also fixes the tcp write_space() callback so that we only clear the > SOCK_NOSPACE flag when the TCP send buffer is less than 2/3 full. > This should ensure that the send window will grow as per the standard TCP > socket code. > > Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> > --- > > net/sunrpc/svcsock.c | 36 +++++++++++++++++++----------------- > 1 files changed, 19 insertions(+), 17 deletions(-) > > > diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c > index af31988..eed978e 100644 > --- a/net/sunrpc/svcsock.c > +++ b/net/sunrpc/svcsock.c > @@ -345,6 +345,7 @@ static void svc_sock_setbufsize(struct socket *sock, unsigned int snd, > lock_sock(sock->sk); > sock->sk->sk_sndbuf = snd * 2; > sock->sk->sk_rcvbuf = rcv * 2; > + sock->sk->sk_write_space(sock->sk); > release_sock(sock->sk); > #endif > } > @@ -386,6 +387,15 @@ static void svc_write_space(struct sock *sk) > } > } > > +static void svc_tcp_write_space(struct sock *sk) > +{ > + struct socket *sock = sk->sk_socket; > + > + if (sk_stream_wspace(sk) >= sk_stream_min_wspace(sk) && sock) > + clear_bit(SOCK_NOSPACE, &sock->flags); > + svc_write_space(sk); > +} > + > /* > * Copy the UDP datagram's destination address to the rqstp structure. > * The 'destination' address in this case is the address to which the > @@ -617,6 +627,7 @@ static void svc_udp_init(struct svc_sock *svsk, struct svc_serv *serv) > * receive and respond to one request. > * svc_udp_recvfrom will re-adjust if necessary > */ > + svsk->sk_sock->sk->sk_userlocks |= SOCK_SNDBUF_LOCK|SOCK_RCVBUF_LOCK; > svc_sock_setbufsize(svsk->sk_sock, > 3 * svsk->sk_xprt.xpt_server->sv_max_mesg, > 3 * svsk->sk_xprt.xpt_server->sv_max_mesg); This chunk looks like an unrelated fix to the reverted tcp autotuning patch. If that's right: I'll apply the rest of this chunk without this patch, then include this chunk in the tcp autotuning patch when reapplying it. --b. > @@ -962,25 +973,16 @@ static void svc_tcp_prep_reply_hdr(struct svc_rqst *rqstp) > static int svc_tcp_has_wspace(struct svc_xprt *xprt) > { > struct svc_sock *svsk = container_of(xprt, struct svc_sock, sk_xprt); > - struct svc_serv *serv = svsk->sk_xprt.xpt_server; > + struct svc_serv *serv = svsk->sk_xprt.xpt_server; > int required; > - int wspace; > > - /* > - * Set the SOCK_NOSPACE flag before checking the available > - * sock space. > - */ > + if (test_bit(XPT_LISTENER, &xprt->xpt_flags)) > + return 1; > + required = atomic_read(&xprt->xpt_reserved) + serv->sv_max_mesg; > + if (sk_stream_wspace(svsk->sk_sk) >= required) > + return 1; > set_bit(SOCK_NOSPACE, &svsk->sk_sock->flags); > - required = atomic_read(&svsk->sk_xprt.xpt_reserved) + serv->sv_max_mesg; > - wspace = sk_stream_wspace(svsk->sk_sk); > - > - if (wspace < sk_stream_min_wspace(svsk->sk_sk)) > - return 0; > - if (required * 2 > wspace) > - return 0; > - > - clear_bit(SOCK_NOSPACE, &svsk->sk_sock->flags); > - return 1; > + return 0; > } > > static struct svc_xprt *svc_tcp_create(struct svc_serv *serv, > @@ -1036,7 +1038,7 @@ static void svc_tcp_init(struct svc_sock *svsk, struct svc_serv *serv) > dprintk("setting up TCP socket for reading\n"); > sk->sk_state_change = svc_tcp_state_change; > sk->sk_data_ready = svc_tcp_data_ready; > - sk->sk_write_space = svc_write_space; > + sk->sk_write_space = svc_tcp_write_space; > > svsk->sk_reclen = 0; > svsk->sk_tcplen = 0; > ^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH 3/3] SUNRPC: Fix svc_tcp_recvfrom() [not found] ` <20090518214756.786.28129.stgit-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org> 2009-05-18 21:47 ` [PATCH 2/3] SUNRPC: Fix the TCP write space reservations for deferred requests Trond Myklebust 2009-05-18 21:47 ` [PATCH 1/3] SUNRPC: Fix the TCP server's send buffer accounting Trond Myklebust @ 2009-05-18 21:47 ` Trond Myklebust 2010-03-18 21:21 ` J. Bruce Fields 2009-05-19 15:14 ` [PATCH 0/3] Fix the Linux rpc-over-tcp server performance Jeff Moyer 3 siblings, 1 reply; 18+ messages in thread From: Trond Myklebust @ 2009-05-18 21:47 UTC (permalink / raw) To: bfields; +Cc: linux-nfs, nfsv4, Jeff Moyer Ensure that we immediately read and buffer data from the incoming TCP stream so that we grow the receive window quickly, and don't deadlock on large READ or WRITE requests. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> --- include/linux/sunrpc/svcsock.h | 1 net/sunrpc/svcsock.c | 167 +++++++++++++++++++++++++++++----------- 2 files changed, 124 insertions(+), 44 deletions(-) diff --git a/include/linux/sunrpc/svcsock.h b/include/linux/sunrpc/svcsock.h index 483e103..b0b4546 100644 --- a/include/linux/sunrpc/svcsock.h +++ b/include/linux/sunrpc/svcsock.h @@ -28,6 +28,7 @@ struct svc_sock { /* private TCP part */ u32 sk_reclen; /* length of record */ u32 sk_tcplen; /* current read length */ + struct page * sk_pages[RPCSVC_MAXPAGES]; /* received data */ }; /* diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c index eed978e..7dd65b0 100644 --- a/net/sunrpc/svcsock.c +++ b/net/sunrpc/svcsock.c @@ -323,6 +323,33 @@ static int svc_recvfrom(struct svc_rqst *rqstp, struct kvec *iov, int nr, return len; } +static int svc_partial_recvfrom(struct svc_rqst *rqstp, + struct kvec *iov, int nr, + int buflen, unsigned int base) +{ + size_t save_iovlen; + void __user *save_iovbase; + unsigned int i; + int ret; + + if (base == 0) + return svc_recvfrom(rqstp, iov, nr, buflen); + + for (i = 0; i < nr; i++) { + if (iov[i].iov_len > base) + break; + base -= iov[i].iov_len; + } + save_iovlen = iov[i].iov_len; + save_iovbase = iov[i].iov_base; + iov[i].iov_len -= base; + iov[i].iov_base += base; + ret = svc_recvfrom(rqstp, &iov[i], nr - i, buflen); + iov[i].iov_len = save_iovlen; + iov[i].iov_base = save_iovbase; + return ret; +} + /* * Set socket snd and rcv buffer lengths */ @@ -790,6 +817,56 @@ failed: return NULL; } +static unsigned int svc_tcp_restore_pages(struct svc_sock *svsk, struct svc_rqst *rqstp) +{ + unsigned int i, len, npages; + + if (svsk->sk_tcplen <= sizeof(rpc_fraghdr)) + return 0; + len = svsk->sk_tcplen - sizeof(rpc_fraghdr); + npages = (len + PAGE_SIZE - 1) >> PAGE_SHIFT; + for (i = 0; i < npages; i++) { + if (rqstp->rq_pages[i] != NULL) + put_page(rqstp->rq_pages[i]); + BUG_ON(svsk->sk_pages[i] == NULL); + rqstp->rq_pages[i] = svsk->sk_pages[i]; + svsk->sk_pages[i] = NULL; + } + rqstp->rq_arg.head[0].iov_base = page_address(rqstp->rq_pages[0]); + return len; +} + +static void svc_tcp_save_pages(struct svc_sock *svsk, struct svc_rqst *rqstp) +{ + unsigned int i, len, npages; + + if (svsk->sk_tcplen <= sizeof(rpc_fraghdr)) + return; + len = svsk->sk_tcplen - sizeof(rpc_fraghdr); + npages = (len + PAGE_SIZE - 1) >> PAGE_SHIFT; + for (i = 0; i < npages; i++) { + svsk->sk_pages[i] = rqstp->rq_pages[i]; + rqstp->rq_pages[i] = NULL; + } +} + +static void svc_tcp_clear_pages(struct svc_sock *svsk) +{ + unsigned int i, len, npages; + + if (svsk->sk_tcplen <= sizeof(rpc_fraghdr)) + goto out; + len = svsk->sk_tcplen - sizeof(rpc_fraghdr); + npages = (len + PAGE_SIZE - 1) >> PAGE_SHIFT; + for (i = 0; i < npages; i++) { + BUG_ON(svsk->sk_pages[i] == NULL); + put_page(svsk->sk_pages[i]); + svsk->sk_pages[i] = NULL; + } +out: + svsk->sk_tcplen = 0; +} + /* * Receive data from a TCP socket. */ @@ -800,7 +877,8 @@ static int svc_tcp_recvfrom(struct svc_rqst *rqstp) struct svc_serv *serv = svsk->sk_xprt.xpt_server; int len; struct kvec *vec; - int pnum, vlen; + unsigned int want, base, vlen; + int pnum; dprintk("svc: tcp_recv %p data %d conn %d close %d\n", svsk, test_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags), @@ -814,9 +892,9 @@ static int svc_tcp_recvfrom(struct svc_rqst *rqstp) * possible up to the complete record length. */ if (svsk->sk_tcplen < sizeof(rpc_fraghdr)) { - int want = sizeof(rpc_fraghdr) - svsk->sk_tcplen; struct kvec iov; + want = sizeof(rpc_fraghdr) - svsk->sk_tcplen; iov.iov_base = ((char *) &svsk->sk_reclen) + svsk->sk_tcplen; iov.iov_len = want; if ((len = svc_recvfrom(rqstp, &iov, 1, want)) < 0) @@ -826,8 +904,7 @@ static int svc_tcp_recvfrom(struct svc_rqst *rqstp) if (len < want) { dprintk("svc: short recvfrom while reading record " "length (%d of %d)\n", len, want); - svc_xprt_received(&svsk->sk_xprt); - return -EAGAIN; /* record header not complete */ + goto err_noclose; } svsk->sk_reclen = ntohl(svsk->sk_reclen); @@ -853,25 +930,14 @@ static int svc_tcp_recvfrom(struct svc_rqst *rqstp) } } - /* Check whether enough data is available */ - len = svc_recv_available(svsk); - if (len < 0) - goto error; - - if (len < svsk->sk_reclen) { - dprintk("svc: incomplete TCP record (%d of %d)\n", - len, svsk->sk_reclen); - svc_xprt_received(&svsk->sk_xprt); - return -EAGAIN; /* record not complete */ - } - len = svsk->sk_reclen; - set_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags); + base = svc_tcp_restore_pages(svsk, rqstp); + want = svsk->sk_reclen - base; vec = rqstp->rq_vec; vec[0] = rqstp->rq_arg.head[0]; vlen = PAGE_SIZE; pnum = 1; - while (vlen < len) { + while (vlen < svsk->sk_reclen) { vec[pnum].iov_base = page_address(rqstp->rq_pages[pnum]); vec[pnum].iov_len = PAGE_SIZE; pnum++; @@ -880,19 +946,26 @@ static int svc_tcp_recvfrom(struct svc_rqst *rqstp) rqstp->rq_respages = &rqstp->rq_pages[pnum]; /* Now receive data */ - len = svc_recvfrom(rqstp, vec, pnum, len); - if (len < 0) - goto error; + clear_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags); + len = svc_partial_recvfrom(rqstp, vec, pnum, want, base); + if (len != want) { + if (len >= 0) + svsk->sk_tcplen += len; + else if (len != -EAGAIN) + goto err_other; + svc_tcp_save_pages(svsk, rqstp); + dprintk("svc: incomplete TCP record (%d of %d)\n", + svsk->sk_tcplen, svsk->sk_reclen); + goto err_noclose; + } - dprintk("svc: TCP complete record (%d bytes)\n", len); - rqstp->rq_arg.len = len; + rqstp->rq_arg.len = svsk->sk_reclen; rqstp->rq_arg.page_base = 0; - if (len <= rqstp->rq_arg.head[0].iov_len) { - rqstp->rq_arg.head[0].iov_len = len; + if (rqstp->rq_arg.len <= rqstp->rq_arg.head[0].iov_len) { + rqstp->rq_arg.head[0].iov_len = rqstp->rq_arg.len; rqstp->rq_arg.page_len = 0; - } else { - rqstp->rq_arg.page_len = len - rqstp->rq_arg.head[0].iov_len; - } + } else + rqstp->rq_arg.page_len = rqstp->rq_arg.len - rqstp->rq_arg.head[0].iov_len; rqstp->rq_xprt_ctxt = NULL; rqstp->rq_prot = IPPROTO_TCP; @@ -900,29 +973,32 @@ static int svc_tcp_recvfrom(struct svc_rqst *rqstp) /* Reset TCP read info */ svsk->sk_reclen = 0; svsk->sk_tcplen = 0; + /* If we have more data, signal svc_xprt_enqueue() to try again */ + if (svc_recv_available(svsk) > sizeof(rpc_fraghdr)) + set_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags); + svc_xprt_copy_addrs(rqstp, &svsk->sk_xprt); svc_xprt_received(&svsk->sk_xprt); if (serv->sv_stats) serv->sv_stats->nettcpcnt++; - return len; - - err_delete: + dprintk("svc: TCP complete record (%d bytes)\n", rqstp->rq_arg.len); + return rqstp->rq_arg.len; +error: + if (len == -EAGAIN) + goto err_got_eagain; +err_other: + printk(KERN_NOTICE "%s: recvfrom returned errno %d\n", + svsk->sk_xprt.xpt_server->sv_name, -len); +err_delete: set_bit(XPT_CLOSE, &svsk->sk_xprt.xpt_flags); return -EAGAIN; - - error: - if (len == -EAGAIN) { - dprintk("RPC: TCP recvfrom got EAGAIN\n"); - svc_xprt_received(&svsk->sk_xprt); - } else { - printk(KERN_NOTICE "%s: recvfrom returned errno %d\n", - svsk->sk_xprt.xpt_server->sv_name, -len); - goto err_delete; - } - - return len; +err_got_eagain: + dprintk("RPC: TCP recvfrom got EAGAIN\n"); +err_noclose: + svc_xprt_received(&svsk->sk_xprt); + return -EAGAIN; /* record not complete */ } /* @@ -1042,6 +1118,7 @@ static void svc_tcp_init(struct svc_sock *svsk, struct svc_serv *serv) svsk->sk_reclen = 0; svsk->sk_tcplen = 0; + memset(&svsk->sk_pages[0], 0, sizeof(svsk->sk_pages)); tcp_sk(sk)->nonagle |= TCP_NAGLE_OFF; @@ -1290,8 +1367,10 @@ static void svc_tcp_sock_detach(struct svc_xprt *xprt) svc_sock_detach(xprt); - if (!test_bit(XPT_LISTENER, &xprt->xpt_flags)) + if (!test_bit(XPT_LISTENER, &xprt->xpt_flags)) { + svc_tcp_clear_pages(svsk); kernel_sock_shutdown(svsk->sk_sock, SHUT_RDWR); + } } /* ^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: [PATCH 3/3] SUNRPC: Fix svc_tcp_recvfrom() 2009-05-18 21:47 ` [PATCH 3/3] SUNRPC: Fix svc_tcp_recvfrom() Trond Myklebust @ 2010-03-18 21:21 ` J. Bruce Fields 2010-04-02 21:00 ` J. Bruce Fields 0 siblings, 1 reply; 18+ messages in thread From: J. Bruce Fields @ 2010-03-18 21:21 UTC (permalink / raw) To: Trond Myklebust; +Cc: linux-nfs, nfsv4, Jeff Moyer On Mon, May 18, 2009 at 05:47:56PM -0400, Trond Myklebust wrote: > Ensure that we immediately read and buffer data from the incoming TCP > stream so that we grow the receive window quickly, and don't deadlock on > large READ or WRITE requests. So, I dropped this patch earlier because although I don't have a serious problem with it, I also couldn't explain why exactly it would explain the performance differences we were seeing, and I was afraid we might be papering over some other (as yet unfound) problem--so it seemed safest just to revert the original buffer changes for the time being and leave this alone. However: I'm recently reminded of a different problem, which is this: if (!(svsk->sk_reclen & RPC_LAST_STREAM_FRAGMENT)) { /* FIXME: technically, a record can be fragmented, and * non-terminal fragments will not have the top bit set * in the fragment length header. But apparently no * known nfs clients send fragmented records. */ if (net_ratelimit()) printk(KERN_NOTICE "RPC: multiple fragments " "per record not supported\n"); goto err_delete; } (Reminded because libtirpc does in fact send these fragmented records (why, I have no idea), so the CITI windows client (which uses libtirpc) hits this case. Hah. In any case, this is a server bug, and we really should fix it.) The current server behavior is to leave the data in the network buffers until it's sure it has a full request. It seems hard to do that if the server has to go dig through the network buffers to find all the record markers. (Is there even a reasonable way to do that?) So I wonder whether I should rebase this to the most recent kernel and then try implementing the rpc stream fragment handling on top. --b. > > Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> > --- > > include/linux/sunrpc/svcsock.h | 1 > net/sunrpc/svcsock.c | 167 +++++++++++++++++++++++++++++----------- > 2 files changed, 124 insertions(+), 44 deletions(-) > > > diff --git a/include/linux/sunrpc/svcsock.h b/include/linux/sunrpc/svcsock.h > index 483e103..b0b4546 100644 > --- a/include/linux/sunrpc/svcsock.h > +++ b/include/linux/sunrpc/svcsock.h > @@ -28,6 +28,7 @@ struct svc_sock { > /* private TCP part */ > u32 sk_reclen; /* length of record */ > u32 sk_tcplen; /* current read length */ > + struct page * sk_pages[RPCSVC_MAXPAGES]; /* received data */ > }; > > /* > diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c > index eed978e..7dd65b0 100644 > --- a/net/sunrpc/svcsock.c > +++ b/net/sunrpc/svcsock.c > @@ -323,6 +323,33 @@ static int svc_recvfrom(struct svc_rqst *rqstp, struct kvec *iov, int nr, > return len; > } > > +static int svc_partial_recvfrom(struct svc_rqst *rqstp, > + struct kvec *iov, int nr, > + int buflen, unsigned int base) > +{ > + size_t save_iovlen; > + void __user *save_iovbase; > + unsigned int i; > + int ret; > + > + if (base == 0) > + return svc_recvfrom(rqstp, iov, nr, buflen); > + > + for (i = 0; i < nr; i++) { > + if (iov[i].iov_len > base) > + break; > + base -= iov[i].iov_len; > + } > + save_iovlen = iov[i].iov_len; > + save_iovbase = iov[i].iov_base; > + iov[i].iov_len -= base; > + iov[i].iov_base += base; > + ret = svc_recvfrom(rqstp, &iov[i], nr - i, buflen); > + iov[i].iov_len = save_iovlen; > + iov[i].iov_base = save_iovbase; > + return ret; > +} > + > /* > * Set socket snd and rcv buffer lengths > */ > @@ -790,6 +817,56 @@ failed: > return NULL; > } > > +static unsigned int svc_tcp_restore_pages(struct svc_sock *svsk, struct svc_rqst *rqstp) > +{ > + unsigned int i, len, npages; > + > + if (svsk->sk_tcplen <= sizeof(rpc_fraghdr)) > + return 0; > + len = svsk->sk_tcplen - sizeof(rpc_fraghdr); > + npages = (len + PAGE_SIZE - 1) >> PAGE_SHIFT; > + for (i = 0; i < npages; i++) { > + if (rqstp->rq_pages[i] != NULL) > + put_page(rqstp->rq_pages[i]); > + BUG_ON(svsk->sk_pages[i] == NULL); > + rqstp->rq_pages[i] = svsk->sk_pages[i]; > + svsk->sk_pages[i] = NULL; > + } > + rqstp->rq_arg.head[0].iov_base = page_address(rqstp->rq_pages[0]); > + return len; > +} > + > +static void svc_tcp_save_pages(struct svc_sock *svsk, struct svc_rqst *rqstp) > +{ > + unsigned int i, len, npages; > + > + if (svsk->sk_tcplen <= sizeof(rpc_fraghdr)) > + return; > + len = svsk->sk_tcplen - sizeof(rpc_fraghdr); > + npages = (len + PAGE_SIZE - 1) >> PAGE_SHIFT; > + for (i = 0; i < npages; i++) { > + svsk->sk_pages[i] = rqstp->rq_pages[i]; > + rqstp->rq_pages[i] = NULL; > + } > +} > + > +static void svc_tcp_clear_pages(struct svc_sock *svsk) > +{ > + unsigned int i, len, npages; > + > + if (svsk->sk_tcplen <= sizeof(rpc_fraghdr)) > + goto out; > + len = svsk->sk_tcplen - sizeof(rpc_fraghdr); > + npages = (len + PAGE_SIZE - 1) >> PAGE_SHIFT; > + for (i = 0; i < npages; i++) { > + BUG_ON(svsk->sk_pages[i] == NULL); > + put_page(svsk->sk_pages[i]); > + svsk->sk_pages[i] = NULL; > + } > +out: > + svsk->sk_tcplen = 0; > +} > + > /* > * Receive data from a TCP socket. > */ > @@ -800,7 +877,8 @@ static int svc_tcp_recvfrom(struct svc_rqst *rqstp) > struct svc_serv *serv = svsk->sk_xprt.xpt_server; > int len; > struct kvec *vec; > - int pnum, vlen; > + unsigned int want, base, vlen; > + int pnum; > > dprintk("svc: tcp_recv %p data %d conn %d close %d\n", > svsk, test_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags), > @@ -814,9 +892,9 @@ static int svc_tcp_recvfrom(struct svc_rqst *rqstp) > * possible up to the complete record length. > */ > if (svsk->sk_tcplen < sizeof(rpc_fraghdr)) { > - int want = sizeof(rpc_fraghdr) - svsk->sk_tcplen; > struct kvec iov; > > + want = sizeof(rpc_fraghdr) - svsk->sk_tcplen; > iov.iov_base = ((char *) &svsk->sk_reclen) + svsk->sk_tcplen; > iov.iov_len = want; > if ((len = svc_recvfrom(rqstp, &iov, 1, want)) < 0) > @@ -826,8 +904,7 @@ static int svc_tcp_recvfrom(struct svc_rqst *rqstp) > if (len < want) { > dprintk("svc: short recvfrom while reading record " > "length (%d of %d)\n", len, want); > - svc_xprt_received(&svsk->sk_xprt); > - return -EAGAIN; /* record header not complete */ > + goto err_noclose; > } > > svsk->sk_reclen = ntohl(svsk->sk_reclen); > @@ -853,25 +930,14 @@ static int svc_tcp_recvfrom(struct svc_rqst *rqstp) > } > } > > - /* Check whether enough data is available */ > - len = svc_recv_available(svsk); > - if (len < 0) > - goto error; > - > - if (len < svsk->sk_reclen) { > - dprintk("svc: incomplete TCP record (%d of %d)\n", > - len, svsk->sk_reclen); > - svc_xprt_received(&svsk->sk_xprt); > - return -EAGAIN; /* record not complete */ > - } > - len = svsk->sk_reclen; > - set_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags); > + base = svc_tcp_restore_pages(svsk, rqstp); > + want = svsk->sk_reclen - base; > > vec = rqstp->rq_vec; > vec[0] = rqstp->rq_arg.head[0]; > vlen = PAGE_SIZE; > pnum = 1; > - while (vlen < len) { > + while (vlen < svsk->sk_reclen) { > vec[pnum].iov_base = page_address(rqstp->rq_pages[pnum]); > vec[pnum].iov_len = PAGE_SIZE; > pnum++; > @@ -880,19 +946,26 @@ static int svc_tcp_recvfrom(struct svc_rqst *rqstp) > rqstp->rq_respages = &rqstp->rq_pages[pnum]; > > /* Now receive data */ > - len = svc_recvfrom(rqstp, vec, pnum, len); > - if (len < 0) > - goto error; > + clear_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags); > + len = svc_partial_recvfrom(rqstp, vec, pnum, want, base); > + if (len != want) { > + if (len >= 0) > + svsk->sk_tcplen += len; > + else if (len != -EAGAIN) > + goto err_other; > + svc_tcp_save_pages(svsk, rqstp); > + dprintk("svc: incomplete TCP record (%d of %d)\n", > + svsk->sk_tcplen, svsk->sk_reclen); > + goto err_noclose; > + } > > - dprintk("svc: TCP complete record (%d bytes)\n", len); > - rqstp->rq_arg.len = len; > + rqstp->rq_arg.len = svsk->sk_reclen; > rqstp->rq_arg.page_base = 0; > - if (len <= rqstp->rq_arg.head[0].iov_len) { > - rqstp->rq_arg.head[0].iov_len = len; > + if (rqstp->rq_arg.len <= rqstp->rq_arg.head[0].iov_len) { > + rqstp->rq_arg.head[0].iov_len = rqstp->rq_arg.len; > rqstp->rq_arg.page_len = 0; > - } else { > - rqstp->rq_arg.page_len = len - rqstp->rq_arg.head[0].iov_len; > - } > + } else > + rqstp->rq_arg.page_len = rqstp->rq_arg.len - rqstp->rq_arg.head[0].iov_len; > > rqstp->rq_xprt_ctxt = NULL; > rqstp->rq_prot = IPPROTO_TCP; > @@ -900,29 +973,32 @@ static int svc_tcp_recvfrom(struct svc_rqst *rqstp) > /* Reset TCP read info */ > svsk->sk_reclen = 0; > svsk->sk_tcplen = 0; > + /* If we have more data, signal svc_xprt_enqueue() to try again */ > + if (svc_recv_available(svsk) > sizeof(rpc_fraghdr)) > + set_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags); > + > > svc_xprt_copy_addrs(rqstp, &svsk->sk_xprt); > svc_xprt_received(&svsk->sk_xprt); > if (serv->sv_stats) > serv->sv_stats->nettcpcnt++; > > - return len; > - > - err_delete: > + dprintk("svc: TCP complete record (%d bytes)\n", rqstp->rq_arg.len); > + return rqstp->rq_arg.len; > +error: > + if (len == -EAGAIN) > + goto err_got_eagain; > +err_other: > + printk(KERN_NOTICE "%s: recvfrom returned errno %d\n", > + svsk->sk_xprt.xpt_server->sv_name, -len); > +err_delete: > set_bit(XPT_CLOSE, &svsk->sk_xprt.xpt_flags); > return -EAGAIN; > - > - error: > - if (len == -EAGAIN) { > - dprintk("RPC: TCP recvfrom got EAGAIN\n"); > - svc_xprt_received(&svsk->sk_xprt); > - } else { > - printk(KERN_NOTICE "%s: recvfrom returned errno %d\n", > - svsk->sk_xprt.xpt_server->sv_name, -len); > - goto err_delete; > - } > - > - return len; > +err_got_eagain: > + dprintk("RPC: TCP recvfrom got EAGAIN\n"); > +err_noclose: > + svc_xprt_received(&svsk->sk_xprt); > + return -EAGAIN; /* record not complete */ > } > > /* > @@ -1042,6 +1118,7 @@ static void svc_tcp_init(struct svc_sock *svsk, struct svc_serv *serv) > > svsk->sk_reclen = 0; > svsk->sk_tcplen = 0; > + memset(&svsk->sk_pages[0], 0, sizeof(svsk->sk_pages)); > > tcp_sk(sk)->nonagle |= TCP_NAGLE_OFF; > > @@ -1290,8 +1367,10 @@ static void svc_tcp_sock_detach(struct svc_xprt *xprt) > > svc_sock_detach(xprt); > > - if (!test_bit(XPT_LISTENER, &xprt->xpt_flags)) > + if (!test_bit(XPT_LISTENER, &xprt->xpt_flags)) { > + svc_tcp_clear_pages(svsk); > kernel_sock_shutdown(svsk->sk_sock, SHUT_RDWR); > + } > } > > /* > > -- > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html _______________________________________________ NFSv4 mailing list NFSv4@linux-nfs.org http://linux-nfs.org/cgi-bin/mailman/listinfo/nfsv4 ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH 3/3] SUNRPC: Fix svc_tcp_recvfrom() 2010-03-18 21:21 ` J. Bruce Fields @ 2010-04-02 21:00 ` J. Bruce Fields 0 siblings, 0 replies; 18+ messages in thread From: J. Bruce Fields @ 2010-04-02 21:00 UTC (permalink / raw) To: Trond Myklebust; +Cc: linux-nfs, nfsv4, Jeff Moyer On Thu, Mar 18, 2010 at 05:21:53PM -0400, J. Bruce Fields wrote: > On Mon, May 18, 2009 at 05:47:56PM -0400, Trond Myklebust wrote: > > Ensure that we immediately read and buffer data from the incoming TCP > > stream so that we grow the receive window quickly, and don't deadlock on > > large READ or WRITE requests. > > So, I dropped this patch earlier because although I don't have a serious > problem with it, I also couldn't explain why exactly it would explain > the performance differences we were seeing, and I was afraid we might be > papering over some other (as yet unfound) problem--so it seemed safest > just to revert the original buffer changes for the time being and leave > this alone. > > However: I'm recently reminded of a different problem, which is this: > > if (!(svsk->sk_reclen & RPC_LAST_STREAM_FRAGMENT)) { > /* FIXME: technically, a record can be fragmented, and > * non-terminal fragments will not have the top bit set > * in the fragment length header. But apparently no > * known nfs clients send fragmented records. */ > if (net_ratelimit()) > printk(KERN_NOTICE "RPC: multiple fragments " > "per record not supported\n"); > goto err_delete; > } > > (Reminded because libtirpc does in fact send these fragmented records > (why, I have no idea), so the CITI windows client (which uses libtirpc) > hits this case. Hah. In any case, this is a server bug, and we really > should fix it.) > > The current server behavior is to leave the data in the network buffers > until it's sure it has a full request. It seems hard to do that if the > server has to go dig through the network buffers to find all the record > markers. (Is there even a reasonable way to do that?) > > So I wonder whether I should rebase this to the most recent kernel and > then try implementing the rpc stream fragment handling on top. I think you said you had a copy of this patch that you've already updated? --b. > > --b. > > > > > Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> > > --- > > > > include/linux/sunrpc/svcsock.h | 1 > > net/sunrpc/svcsock.c | 167 +++++++++++++++++++++++++++++----------- > > 2 files changed, 124 insertions(+), 44 deletions(-) > > > > > > diff --git a/include/linux/sunrpc/svcsock.h b/include/linux/sunrpc/svcsock.h > > index 483e103..b0b4546 100644 > > --- a/include/linux/sunrpc/svcsock.h > > +++ b/include/linux/sunrpc/svcsock.h > > @@ -28,6 +28,7 @@ struct svc_sock { > > /* private TCP part */ > > u32 sk_reclen; /* length of record */ > > u32 sk_tcplen; /* current read length */ > > + struct page * sk_pages[RPCSVC_MAXPAGES]; /* received data */ > > }; > > > > /* > > diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c > > index eed978e..7dd65b0 100644 > > --- a/net/sunrpc/svcsock.c > > +++ b/net/sunrpc/svcsock.c > > @@ -323,6 +323,33 @@ static int svc_recvfrom(struct svc_rqst *rqstp, struct kvec *iov, int nr, > > return len; > > } > > > > +static int svc_partial_recvfrom(struct svc_rqst *rqstp, > > + struct kvec *iov, int nr, > > + int buflen, unsigned int base) > > +{ > > + size_t save_iovlen; > > + void __user *save_iovbase; > > + unsigned int i; > > + int ret; > > + > > + if (base == 0) > > + return svc_recvfrom(rqstp, iov, nr, buflen); > > + > > + for (i = 0; i < nr; i++) { > > + if (iov[i].iov_len > base) > > + break; > > + base -= iov[i].iov_len; > > + } > > + save_iovlen = iov[i].iov_len; > > + save_iovbase = iov[i].iov_base; > > + iov[i].iov_len -= base; > > + iov[i].iov_base += base; > > + ret = svc_recvfrom(rqstp, &iov[i], nr - i, buflen); > > + iov[i].iov_len = save_iovlen; > > + iov[i].iov_base = save_iovbase; > > + return ret; > > +} > > + > > /* > > * Set socket snd and rcv buffer lengths > > */ > > @@ -790,6 +817,56 @@ failed: > > return NULL; > > } > > > > +static unsigned int svc_tcp_restore_pages(struct svc_sock *svsk, struct svc_rqst *rqstp) > > +{ > > + unsigned int i, len, npages; > > + > > + if (svsk->sk_tcplen <= sizeof(rpc_fraghdr)) > > + return 0; > > + len = svsk->sk_tcplen - sizeof(rpc_fraghdr); > > + npages = (len + PAGE_SIZE - 1) >> PAGE_SHIFT; > > + for (i = 0; i < npages; i++) { > > + if (rqstp->rq_pages[i] != NULL) > > + put_page(rqstp->rq_pages[i]); > > + BUG_ON(svsk->sk_pages[i] == NULL); > > + rqstp->rq_pages[i] = svsk->sk_pages[i]; > > + svsk->sk_pages[i] = NULL; > > + } > > + rqstp->rq_arg.head[0].iov_base = page_address(rqstp->rq_pages[0]); > > + return len; > > +} > > + > > +static void svc_tcp_save_pages(struct svc_sock *svsk, struct svc_rqst *rqstp) > > +{ > > + unsigned int i, len, npages; > > + > > + if (svsk->sk_tcplen <= sizeof(rpc_fraghdr)) > > + return; > > + len = svsk->sk_tcplen - sizeof(rpc_fraghdr); > > + npages = (len + PAGE_SIZE - 1) >> PAGE_SHIFT; > > + for (i = 0; i < npages; i++) { > > + svsk->sk_pages[i] = rqstp->rq_pages[i]; > > + rqstp->rq_pages[i] = NULL; > > + } > > +} > > + > > +static void svc_tcp_clear_pages(struct svc_sock *svsk) > > +{ > > + unsigned int i, len, npages; > > + > > + if (svsk->sk_tcplen <= sizeof(rpc_fraghdr)) > > + goto out; > > + len = svsk->sk_tcplen - sizeof(rpc_fraghdr); > > + npages = (len + PAGE_SIZE - 1) >> PAGE_SHIFT; > > + for (i = 0; i < npages; i++) { > > + BUG_ON(svsk->sk_pages[i] == NULL); > > + put_page(svsk->sk_pages[i]); > > + svsk->sk_pages[i] = NULL; > > + } > > +out: > > + svsk->sk_tcplen = 0; > > +} > > + > > /* > > * Receive data from a TCP socket. > > */ > > @@ -800,7 +877,8 @@ static int svc_tcp_recvfrom(struct svc_rqst *rqstp) > > struct svc_serv *serv = svsk->sk_xprt.xpt_server; > > int len; > > struct kvec *vec; > > - int pnum, vlen; > > + unsigned int want, base, vlen; > > + int pnum; > > > > dprintk("svc: tcp_recv %p data %d conn %d close %d\n", > > svsk, test_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags), > > @@ -814,9 +892,9 @@ static int svc_tcp_recvfrom(struct svc_rqst *rqstp) > > * possible up to the complete record length. > > */ > > if (svsk->sk_tcplen < sizeof(rpc_fraghdr)) { > > - int want = sizeof(rpc_fraghdr) - svsk->sk_tcplen; > > struct kvec iov; > > > > + want = sizeof(rpc_fraghdr) - svsk->sk_tcplen; > > iov.iov_base = ((char *) &svsk->sk_reclen) + svsk->sk_tcplen; > > iov.iov_len = want; > > if ((len = svc_recvfrom(rqstp, &iov, 1, want)) < 0) > > @@ -826,8 +904,7 @@ static int svc_tcp_recvfrom(struct svc_rqst *rqstp) > > if (len < want) { > > dprintk("svc: short recvfrom while reading record " > > "length (%d of %d)\n", len, want); > > - svc_xprt_received(&svsk->sk_xprt); > > - return -EAGAIN; /* record header not complete */ > > + goto err_noclose; > > } > > > > svsk->sk_reclen = ntohl(svsk->sk_reclen); > > @@ -853,25 +930,14 @@ static int svc_tcp_recvfrom(struct svc_rqst *rqstp) > > } > > } > > > > - /* Check whether enough data is available */ > > - len = svc_recv_available(svsk); > > - if (len < 0) > > - goto error; > > - > > - if (len < svsk->sk_reclen) { > > - dprintk("svc: incomplete TCP record (%d of %d)\n", > > - len, svsk->sk_reclen); > > - svc_xprt_received(&svsk->sk_xprt); > > - return -EAGAIN; /* record not complete */ > > - } > > - len = svsk->sk_reclen; > > - set_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags); > > + base = svc_tcp_restore_pages(svsk, rqstp); > > + want = svsk->sk_reclen - base; > > > > vec = rqstp->rq_vec; > > vec[0] = rqstp->rq_arg.head[0]; > > vlen = PAGE_SIZE; > > pnum = 1; > > - while (vlen < len) { > > + while (vlen < svsk->sk_reclen) { > > vec[pnum].iov_base = page_address(rqstp->rq_pages[pnum]); > > vec[pnum].iov_len = PAGE_SIZE; > > pnum++; > > @@ -880,19 +946,26 @@ static int svc_tcp_recvfrom(struct svc_rqst *rqstp) > > rqstp->rq_respages = &rqstp->rq_pages[pnum]; > > > > /* Now receive data */ > > - len = svc_recvfrom(rqstp, vec, pnum, len); > > - if (len < 0) > > - goto error; > > + clear_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags); > > + len = svc_partial_recvfrom(rqstp, vec, pnum, want, base); > > + if (len != want) { > > + if (len >= 0) > > + svsk->sk_tcplen += len; > > + else if (len != -EAGAIN) > > + goto err_other; > > + svc_tcp_save_pages(svsk, rqstp); > > + dprintk("svc: incomplete TCP record (%d of %d)\n", > > + svsk->sk_tcplen, svsk->sk_reclen); > > + goto err_noclose; > > + } > > > > - dprintk("svc: TCP complete record (%d bytes)\n", len); > > - rqstp->rq_arg.len = len; > > + rqstp->rq_arg.len = svsk->sk_reclen; > > rqstp->rq_arg.page_base = 0; > > - if (len <= rqstp->rq_arg.head[0].iov_len) { > > - rqstp->rq_arg.head[0].iov_len = len; > > + if (rqstp->rq_arg.len <= rqstp->rq_arg.head[0].iov_len) { > > + rqstp->rq_arg.head[0].iov_len = rqstp->rq_arg.len; > > rqstp->rq_arg.page_len = 0; > > - } else { > > - rqstp->rq_arg.page_len = len - rqstp->rq_arg.head[0].iov_len; > > - } > > + } else > > + rqstp->rq_arg.page_len = rqstp->rq_arg.len - rqstp->rq_arg.head[0].iov_len; > > > > rqstp->rq_xprt_ctxt = NULL; > > rqstp->rq_prot = IPPROTO_TCP; > > @@ -900,29 +973,32 @@ static int svc_tcp_recvfrom(struct svc_rqst *rqstp) > > /* Reset TCP read info */ > > svsk->sk_reclen = 0; > > svsk->sk_tcplen = 0; > > + /* If we have more data, signal svc_xprt_enqueue() to try again */ > > + if (svc_recv_available(svsk) > sizeof(rpc_fraghdr)) > > + set_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags); > > + > > > > svc_xprt_copy_addrs(rqstp, &svsk->sk_xprt); > > svc_xprt_received(&svsk->sk_xprt); > > if (serv->sv_stats) > > serv->sv_stats->nettcpcnt++; > > > > - return len; > > - > > - err_delete: > > + dprintk("svc: TCP complete record (%d bytes)\n", rqstp->rq_arg.len); > > + return rqstp->rq_arg.len; > > +error: > > + if (len == -EAGAIN) > > + goto err_got_eagain; > > +err_other: > > + printk(KERN_NOTICE "%s: recvfrom returned errno %d\n", > > + svsk->sk_xprt.xpt_server->sv_name, -len); > > +err_delete: > > set_bit(XPT_CLOSE, &svsk->sk_xprt.xpt_flags); > > return -EAGAIN; > > - > > - error: > > - if (len == -EAGAIN) { > > - dprintk("RPC: TCP recvfrom got EAGAIN\n"); > > - svc_xprt_received(&svsk->sk_xprt); > > - } else { > > - printk(KERN_NOTICE "%s: recvfrom returned errno %d\n", > > - svsk->sk_xprt.xpt_server->sv_name, -len); > > - goto err_delete; > > - } > > - > > - return len; > > +err_got_eagain: > > + dprintk("RPC: TCP recvfrom got EAGAIN\n"); > > +err_noclose: > > + svc_xprt_received(&svsk->sk_xprt); > > + return -EAGAIN; /* record not complete */ > > } > > > > /* > > @@ -1042,6 +1118,7 @@ static void svc_tcp_init(struct svc_sock *svsk, struct svc_serv *serv) > > > > svsk->sk_reclen = 0; > > svsk->sk_tcplen = 0; > > + memset(&svsk->sk_pages[0], 0, sizeof(svsk->sk_pages)); > > > > tcp_sk(sk)->nonagle |= TCP_NAGLE_OFF; > > > > @@ -1290,8 +1367,10 @@ static void svc_tcp_sock_detach(struct svc_xprt *xprt) > > > > svc_sock_detach(xprt); > > > > - if (!test_bit(XPT_LISTENER, &xprt->xpt_flags)) > > + if (!test_bit(XPT_LISTENER, &xprt->xpt_flags)) { > > + svc_tcp_clear_pages(svsk); > > kernel_sock_shutdown(svsk->sk_sock, SHUT_RDWR); > > + } > > } > > > > /* > > > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > _______________________________________________ > NFSv4 mailing list > NFSv4@linux-nfs.org > http://linux-nfs.org/cgi-bin/mailman/listinfo/nfsv4 ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH 0/3] Fix the Linux rpc-over-tcp server performance [not found] ` <20090518214756.786.28129.stgit-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org> ` (2 preceding siblings ...) 2009-05-18 21:47 ` [PATCH 3/3] SUNRPC: Fix svc_tcp_recvfrom() Trond Myklebust @ 2009-05-19 15:14 ` Jeff Moyer 3 siblings, 0 replies; 18+ messages in thread From: Jeff Moyer @ 2009-05-19 15:14 UTC (permalink / raw) To: Trond Myklebust; +Cc: bfields, linux-nfs, nfsv4 Trond Myklebust <Trond.Myklebust@netapp.com> writes: > I squashed the previous set of 4 incremental patches into 3. Otherwise > there should be no differences w.r.t. the set that Jeff tested. I applied your original patches, then reverted these three, and the resulting tree shows that they are, in fact, the same. So... Tested-by: Jeff Moyer <jmoyer@redhat.com> > --- > > Trond Myklebust (3): > SUNRPC: Fix svc_tcp_recvfrom() > SUNRPC: Fix the TCP write space reservations for deferred requests > SUNRPC: Fix the TCP server's send buffer accounting > > > include/linux/sunrpc/svc.h | 1 > include/linux/sunrpc/svcsock.h | 1 > net/sunrpc/svc_xprt.c | 10 +- > net/sunrpc/svcsock.c | 203 ++++++++++++++++++++++++++++------------ > 4 files changed, 149 insertions(+), 66 deletions(-) ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH 0/3] Fix the Linux rpc-over-tcp server performance 2009-05-18 21:47 [PATCH 0/3] Fix the Linux rpc-over-tcp server performance Trond Myklebust [not found] ` <20090518214756.786.28129.stgit-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org> @ 2009-08-12 2:43 ` J. Bruce Fields 2009-08-12 13:22 ` Jeff Moyer 1 sibling, 1 reply; 18+ messages in thread From: J. Bruce Fields @ 2009-08-12 2:43 UTC (permalink / raw) To: Trond Myklebust; +Cc: linux-nfs, nfsv4, Jeff Moyer On Mon, May 18, 2009 at 05:47:56PM -0400, Trond Myklebust wrote: > I squashed the previous set of 4 incremental patches into 3. Otherwise > there should be no differences w.r.t. the set that Jeff tested. Apologies for the long delay.... Unfortunately, I can't reproduce any of this at all: I reliably get about 112MB/s regardless of what combination of these patches I apply (including none). This is over gigabit ethernet to a server exporting a filesystem on raid 0 over 3 sata disks which iozone locally reports getting just over 200MB/s reads from. Any suggestions? --b. _______________________________________________ NFSv4 mailing list NFSv4@linux-nfs.org http://linux-nfs.org/cgi-bin/mailman/listinfo/nfsv4 ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH 0/3] Fix the Linux rpc-over-tcp server performance 2009-08-12 2:43 ` J. Bruce Fields @ 2009-08-12 13:22 ` Jeff Moyer 2009-08-12 14:20 ` J. Bruce Fields 0 siblings, 1 reply; 18+ messages in thread From: Jeff Moyer @ 2009-08-12 13:22 UTC (permalink / raw) To: J. Bruce Fields; +Cc: Trond Myklebust, linux-nfs, nfsv4 "J. Bruce Fields" <bfields@fieldses.org> writes: > On Mon, May 18, 2009 at 05:47:56PM -0400, Trond Myklebust wrote: >> I squashed the previous set of 4 incremental patches into 3. Otherwise >> there should be no differences w.r.t. the set that Jeff tested. > > Apologies for the long delay.... Unfortunately, I can't reproduce any of > this at all: I reliably get about 112MB/s regardless of what combination > of these patches I apply (including none). This is over gigabit > ethernet to a server exporting a filesystem on raid 0 over 3 sata disks > which iozone locally reports getting just over 200MB/s reads from. > > Any suggestions? Well, you gave me nothing to go on here, Bruce! I assume you're using the deadline I/O scheduler on the NFS server, is that right? If not, you should be. Second, are you using iozone to reproduce? iozone -s 2000000 -r 64 -f /mnt/test/testfile -i 1 -w That's the command line I was using. Third, I reproduced this on 2.6.30-rc1. Perhaps you should start there and make sure you at least see the same problem on that kernel. Otherwise, maybe we've made up for the performance elsewhere. Finally, didn't you revert the autotuning patch? If so, you wouldn't see this problem. Let me know how this goes, and if you still can't reproduce it I'll setup for testing it here. Cheers, Jeff ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH 0/3] Fix the Linux rpc-over-tcp server performance 2009-08-12 13:22 ` Jeff Moyer @ 2009-08-12 14:20 ` J. Bruce Fields 2009-08-12 17:02 ` Jeff Moyer 0 siblings, 1 reply; 18+ messages in thread From: J. Bruce Fields @ 2009-08-12 14:20 UTC (permalink / raw) To: Jeff Moyer; +Cc: nfsv4, linux-nfs, Trond Myklebust On Wed, Aug 12, 2009 at 09:22:17AM -0400, Jeff Moyer wrote: > "J. Bruce Fields" <bfields@fieldses.org> writes: > > > On Mon, May 18, 2009 at 05:47:56PM -0400, Trond Myklebust wrote: > >> I squashed the previous set of 4 incremental patches into 3. Otherwise > >> there should be no differences w.r.t. the set that Jeff tested. > > > > Apologies for the long delay.... Unfortunately, I can't reproduce any of > > this at all: I reliably get about 112MB/s regardless of what combination > > of these patches I apply (including none). This is over gigabit > > ethernet to a server exporting a filesystem on raid 0 over 3 sata disks > > which iozone locally reports getting just over 200MB/s reads from. > > > > Any suggestions? > > Well, you gave me nothing to go on here, Bruce! Apologies for the lack of details.... > I assume you're using > the deadline I/O scheduler on the NFS server, is that right? If not, > you should be. Oops, sorry, no. Looks like it doesn't allow setting a scheduler on md0, so I'm assuming I should be setting it on the component drives. > Second, are you using iozone to reproduce? > > iozone -s 2000000 -r 64 -f /mnt/test/testfile -i 1 -w Yes, I was using your commandline. > That's the command line I was using. Third, I reproduced this on > 2.6.30-rc1. Perhaps you should start there and make sure you at least > see the same problem on that kernel. Otherwise, maybe we've made up for > the performance elsewhere. Right, could be, I was working on top of 2.6.31-rc1. > Finally, didn't you revert the autotuning patch? If so, you wouldn't > see this problem. Let me know how this goes, and if you still can't > reproduce it I'll setup for testing it here. Right, I re-applied the autotuning patch before applying the others. Thanks for the suggestions. --b. _______________________________________________ NFSv4 mailing list NFSv4@linux-nfs.org http://linux-nfs.org/cgi-bin/mailman/listinfo/nfsv4 ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH 0/3] Fix the Linux rpc-over-tcp server performance 2009-08-12 14:20 ` J. Bruce Fields @ 2009-08-12 17:02 ` Jeff Moyer 2009-08-12 22:32 ` J. Bruce Fields 0 siblings, 1 reply; 18+ messages in thread From: Jeff Moyer @ 2009-08-12 17:02 UTC (permalink / raw) To: J. Bruce Fields; +Cc: nfsv4, linux-nfs, Trond Myklebust "J. Bruce Fields" <bfields@fieldses.org> writes: > On Wed, Aug 12, 2009 at 09:22:17AM -0400, Jeff Moyer wrote: >> "J. Bruce Fields" <bfields@fieldses.org> writes: >> >> > On Mon, May 18, 2009 at 05:47:56PM -0400, Trond Myklebust wrote: >> >> I squashed the previous set of 4 incremental patches into 3. Otherwise >> >> there should be no differences w.r.t. the set that Jeff tested. >> > >> > Apologies for the long delay.... Unfortunately, I can't reproduce any of >> > this at all: I reliably get about 112MB/s regardless of what combination >> > of these patches I apply (including none). This is over gigabit >> > ethernet to a server exporting a filesystem on raid 0 over 3 sata disks >> > which iozone locally reports getting just over 200MB/s reads from. >> > >> > Any suggestions? >> >> Well, you gave me nothing to go on here, Bruce! > > Apologies for the lack of details.... No worries. ;-) >> I assume you're using >> the deadline I/O scheduler on the NFS server, is that right? If not, >> you should be. > > Oops, sorry, no. Looks like it doesn't allow setting a scheduler on md0, > so I'm assuming I should be setting it on the component drives. Right, set the scheduler on the component drives. I was testing on hardware raid, fwiw. Again, let me know if you need me to reproduce this. It will give me an excuse to get back that really nice machine I was using for testing. ;-) -Jeff ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH 0/3] Fix the Linux rpc-over-tcp server performance 2009-08-12 17:02 ` Jeff Moyer @ 2009-08-12 22:32 ` J. Bruce Fields 2009-08-12 22:40 ` Trond Myklebust 2009-08-13 13:05 ` Jeff Moyer 0 siblings, 2 replies; 18+ messages in thread From: J. Bruce Fields @ 2009-08-12 22:32 UTC (permalink / raw) To: Jeff Moyer; +Cc: Trond Myklebust, linux-nfs, nfsv4 On Wed, Aug 12, 2009 at 01:02:13PM -0400, Jeff Moyer wrote: > "J. Bruce Fields" <bfields@fieldses.org> writes: > > > On Wed, Aug 12, 2009 at 09:22:17AM -0400, Jeff Moyer wrote: > >> "J. Bruce Fields" <bfields@fieldses.org> writes: > >> > >> > On Mon, May 18, 2009 at 05:47:56PM -0400, Trond Myklebust wrote: > >> >> I squashed the previous set of 4 incremental patches into 3. Otherwise > >> >> there should be no differences w.r.t. the set that Jeff tested. > >> > > >> > Apologies for the long delay.... Unfortunately, I can't reproduce any of > >> > this at all: I reliably get about 112MB/s regardless of what combination > >> > of these patches I apply (including none). This is over gigabit > >> > ethernet to a server exporting a filesystem on raid 0 over 3 sata disks > >> > which iozone locally reports getting just over 200MB/s reads from. > >> > > >> > Any suggestions? > >> > >> Well, you gave me nothing to go on here, Bruce! > > > > Apologies for the lack of details.... > > No worries. ;-) > > >> I assume you're using > >> the deadline I/O scheduler on the NFS server, is that right? If not, > >> you should be. > > > > Oops, sorry, no. Looks like it doesn't allow setting a scheduler on md0, > > so I'm assuming I should be setting it on the component drives. > > Right, set the scheduler on the component drives. I was testing on > hardware raid, fwiw. Not much difference in results. This is frustrating. For each test, I'm booting both client and server to the given kernel, running mount server:/exports/ /mnt/ iozone -s 2000000 -r 64 -f /mnt/testfile -w -i1 umount /mnt/ five times, then taking the average of the "read" columns. (I could stick to one client--not sure which you were using. Installing new kernels on both is just what my existing test scripts happened to do by default.) 2.6.30-rc1: 114113 2.6.30-rc1 + revert autotuning: 114159 2.6.30-rc1 + patch 1: 114168 2.6.30-rc1 + patch 1 & 2: 114136 2.6.30-rc1 + patch 1, 2, & 3: 114149 (Where patches 1, 2, 3 are, respectively, "SUNRPC: Fix the TCP server's send buffer accounting results", "SUNRPC: Fix the TCP write space reservations for deferred requests", and "Fix svc_tcp_recvfrom() results", respectively. And the average-of-5 is pointless, really: the individual results have very little variation. See anything obvious I've gotten wrong here? > Again, let me know if you need me to reproduce this. It will give me an > excuse to get back that really nice machine I was using for testing. ;-) I'd be happiest if I could figure out how to reproduce this myself. --b. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH 0/3] Fix the Linux rpc-over-tcp server performance 2009-08-12 22:32 ` J. Bruce Fields @ 2009-08-12 22:40 ` Trond Myklebust 2009-08-13 13:05 ` Jeff Moyer 1 sibling, 0 replies; 18+ messages in thread From: Trond Myklebust @ 2009-08-12 22:40 UTC (permalink / raw) To: J. Bruce Fields; +Cc: Jeff Moyer, nfsv4, linux-nfs On Wed, 2009-08-12 at 18:32 -0400, J. Bruce Fields wrote: > For each test, I'm booting both client and server to the given kernel, > running > > mount server:/exports/ /mnt/ > iozone -s 2000000 -r 64 -f /mnt/testfile -w -i1 > umount /mnt/ > > five times, then taking the average of the "read" columns. (I could > stick to one client--not sure which you were using. Installing new > kernels on both is just what my existing test scripts happened to do by > default.) > > 2.6.30-rc1: 114113 > 2.6.30-rc1 + revert autotuning: 114159 > 2.6.30-rc1 + patch 1: 114168 > 2.6.30-rc1 + patch 1 & 2: 114136 > 2.6.30-rc1 + patch 1, 2, & 3: 114149 My tests were made using hammerfest as the server: it would see fairly long stalls when it was running basic 2.6.30-rc1 w/ autotuning. Feel free to co-opt it for your testing... -- Trond Myklebust Linux NFS client maintainer NetApp Trond.Myklebust@netapp.com www.netapp.com ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH 0/3] Fix the Linux rpc-over-tcp server performance 2009-08-12 22:32 ` J. Bruce Fields 2009-08-12 22:40 ` Trond Myklebust @ 2009-08-13 13:05 ` Jeff Moyer 1 sibling, 0 replies; 18+ messages in thread From: Jeff Moyer @ 2009-08-13 13:05 UTC (permalink / raw) To: J. Bruce Fields; +Cc: nfsv4, linux-nfs, Trond Myklebust "J. Bruce Fields" <bfields@fieldses.org> writes: > On Wed, Aug 12, 2009 at 01:02:13PM -0400, Jeff Moyer wrote: >> "J. Bruce Fields" <bfields@fieldses.org> writes: >> >> > On Wed, Aug 12, 2009 at 09:22:17AM -0400, Jeff Moyer wrote: >> >> "J. Bruce Fields" <bfields@fieldses.org> writes: >> >> >> >> > On Mon, May 18, 2009 at 05:47:56PM -0400, Trond Myklebust wrote: >> >> >> I squashed the previous set of 4 incremental patches into 3. Otherwise >> >> >> there should be no differences w.r.t. the set that Jeff tested. >> >> > >> >> > Apologies for the long delay.... Unfortunately, I can't reproduce any of >> >> > this at all: I reliably get about 112MB/s regardless of what combination >> >> > of these patches I apply (including none). This is over gigabit >> >> > ethernet to a server exporting a filesystem on raid 0 over 3 sata disks >> >> > which iozone locally reports getting just over 200MB/s reads from. >> >> > >> >> > Any suggestions? >> >> >> >> Well, you gave me nothing to go on here, Bruce! >> > >> > Apologies for the lack of details.... >> >> No worries. ;-) >> >> >> I assume you're using >> >> the deadline I/O scheduler on the NFS server, is that right? If not, >> >> you should be. >> > >> > Oops, sorry, no. Looks like it doesn't allow setting a scheduler on md0, >> > so I'm assuming I should be setting it on the component drives. >> >> Right, set the scheduler on the component drives. I was testing on >> hardware raid, fwiw. > > Not much difference in results. This is frustrating. > > For each test, I'm booting both client and server to the given kernel, > running > > mount server:/exports/ /mnt/ > iozone -s 2000000 -r 64 -f /mnt/testfile -w -i1 > umount /mnt/ > > five times, then taking the average of the "read" columns. (I could > stick to one client--not sure which you were using. Installing new > kernels on both is just what my existing test scripts happened to do by > default.) Hmm, I think I was using a RHEL 5 client, so 2.6.18-based. > 2.6.30-rc1: 114113 > 2.6.30-rc1 + revert autotuning: 114159 > 2.6.30-rc1 + patch 1: 114168 > 2.6.30-rc1 + patch 1 & 2: 114136 > 2.6.30-rc1 + patch 1, 2, & 3: 114149 > > (Where patches 1, 2, 3 are, respectively, "SUNRPC: Fix the TCP server's > send buffer accounting results", "SUNRPC: Fix the TCP write space > reservations for deferred requests", and "Fix svc_tcp_recvfrom() > results", respectively. > > And the average-of-5 is pointless, really: the individual results have > very little variation. > > See anything obvious I've gotten wrong here? Aside from the client kernel, which I didn't think mattered at the time, no. >> Again, let me know if you need me to reproduce this. It will give me an >> excuse to get back that really nice machine I was using for testing. ;-) > > I'd be happiest if I could figure out how to reproduce this myself. Understood. Try an older client kernel. If that still doesn't work, then let me reproduce it and get back to you with more specifics of my configuration. Thanks for being so diligent, Bruce! -Jeff _______________________________________________ NFSv4 mailing list NFSv4@linux-nfs.org http://linux-nfs.org/cgi-bin/mailman/listinfo/nfsv4 ^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2010-04-02 20:57 UTC | newest]
Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-05-18 21:47 [PATCH 0/3] Fix the Linux rpc-over-tcp server performance Trond Myklebust
[not found] ` <20090518214756.786.28129.stgit-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>
2009-05-18 21:47 ` [PATCH 2/3] SUNRPC: Fix the TCP write space reservations for deferred requests Trond Myklebust
[not found] ` <20090518214756.786.33956.stgit-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>
2009-06-19 22:23 ` J. Bruce Fields
2009-06-20 21:44 ` Trond Myklebust
[not found] ` <1245534248.5182.45.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>
2009-08-24 21:32 ` J. Bruce Fields
2009-05-18 21:47 ` [PATCH 1/3] SUNRPC: Fix the TCP server's send buffer accounting Trond Myklebust
[not found] ` <20090518214756.786.58191.stgit-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>
2009-06-19 3:06 ` J. Bruce Fields
2009-05-18 21:47 ` [PATCH 3/3] SUNRPC: Fix svc_tcp_recvfrom() Trond Myklebust
2010-03-18 21:21 ` J. Bruce Fields
2010-04-02 21:00 ` J. Bruce Fields
2009-05-19 15:14 ` [PATCH 0/3] Fix the Linux rpc-over-tcp server performance Jeff Moyer
2009-08-12 2:43 ` J. Bruce Fields
2009-08-12 13:22 ` Jeff Moyer
2009-08-12 14:20 ` J. Bruce Fields
2009-08-12 17:02 ` Jeff Moyer
2009-08-12 22:32 ` J. Bruce Fields
2009-08-12 22:40 ` Trond Myklebust
2009-08-13 13:05 ` Jeff Moyer
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox