From: "J. Bruce Fields" <bfields@fieldses.org>
To: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: linux-nfs@vger.kernel.org, nfsv4@linux-nfs.org,
Jeff Moyer <jmoyer@redhat.com>
Subject: Re: [PATCH 3/3] SUNRPC: Fix svc_tcp_recvfrom()
Date: Thu, 18 Mar 2010 17:21:53 -0400 [thread overview]
Message-ID: <20100318212153.GB23085@fieldses.org> (raw)
In-Reply-To: <20090518214756.786.64058.stgit@heimdal.trondhjem.org>
On Mon, May 18, 2009 at 05:47:56PM -0400, Trond Myklebust wrote:
> Ensure that we immediately read and buffer data from the incoming TCP
> stream so that we grow the receive window quickly, and don't deadlock on
> large READ or WRITE requests.
So, I dropped this patch earlier because although I don't have a serious
problem with it, I also couldn't explain why exactly it would explain
the performance differences we were seeing, and I was afraid we might be
papering over some other (as yet unfound) problem--so it seemed safest
just to revert the original buffer changes for the time being and leave
this alone.
However: I'm recently reminded of a different problem, which is this:
if (!(svsk->sk_reclen & RPC_LAST_STREAM_FRAGMENT)) {
/* FIXME: technically, a record can be fragmented, and
* non-terminal fragments will not have the top bit set
* in the fragment length header. But apparently no
* known nfs clients send fragmented records. */
if (net_ratelimit())
printk(KERN_NOTICE "RPC: multiple fragments "
"per record not supported\n");
goto err_delete;
}
(Reminded because libtirpc does in fact send these fragmented records
(why, I have no idea), so the CITI windows client (which uses libtirpc)
hits this case. Hah. In any case, this is a server bug, and we really
should fix it.)
The current server behavior is to leave the data in the network buffers
until it's sure it has a full request. It seems hard to do that if the
server has to go dig through the network buffers to find all the record
markers. (Is there even a reasonable way to do that?)
So I wonder whether I should rebase this to the most recent kernel and
then try implementing the rpc stream fragment handling on top.
--b.
>
> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
> ---
>
> include/linux/sunrpc/svcsock.h | 1
> net/sunrpc/svcsock.c | 167 +++++++++++++++++++++++++++++-----------
> 2 files changed, 124 insertions(+), 44 deletions(-)
>
>
> diff --git a/include/linux/sunrpc/svcsock.h b/include/linux/sunrpc/svcsock.h
> index 483e103..b0b4546 100644
> --- a/include/linux/sunrpc/svcsock.h
> +++ b/include/linux/sunrpc/svcsock.h
> @@ -28,6 +28,7 @@ struct svc_sock {
> /* private TCP part */
> u32 sk_reclen; /* length of record */
> u32 sk_tcplen; /* current read length */
> + struct page * sk_pages[RPCSVC_MAXPAGES]; /* received data */
> };
>
> /*
> diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
> index eed978e..7dd65b0 100644
> --- a/net/sunrpc/svcsock.c
> +++ b/net/sunrpc/svcsock.c
> @@ -323,6 +323,33 @@ static int svc_recvfrom(struct svc_rqst *rqstp, struct kvec *iov, int nr,
> return len;
> }
>
> +static int svc_partial_recvfrom(struct svc_rqst *rqstp,
> + struct kvec *iov, int nr,
> + int buflen, unsigned int base)
> +{
> + size_t save_iovlen;
> + void __user *save_iovbase;
> + unsigned int i;
> + int ret;
> +
> + if (base == 0)
> + return svc_recvfrom(rqstp, iov, nr, buflen);
> +
> + for (i = 0; i < nr; i++) {
> + if (iov[i].iov_len > base)
> + break;
> + base -= iov[i].iov_len;
> + }
> + save_iovlen = iov[i].iov_len;
> + save_iovbase = iov[i].iov_base;
> + iov[i].iov_len -= base;
> + iov[i].iov_base += base;
> + ret = svc_recvfrom(rqstp, &iov[i], nr - i, buflen);
> + iov[i].iov_len = save_iovlen;
> + iov[i].iov_base = save_iovbase;
> + return ret;
> +}
> +
> /*
> * Set socket snd and rcv buffer lengths
> */
> @@ -790,6 +817,56 @@ failed:
> return NULL;
> }
>
> +static unsigned int svc_tcp_restore_pages(struct svc_sock *svsk, struct svc_rqst *rqstp)
> +{
> + unsigned int i, len, npages;
> +
> + if (svsk->sk_tcplen <= sizeof(rpc_fraghdr))
> + return 0;
> + len = svsk->sk_tcplen - sizeof(rpc_fraghdr);
> + npages = (len + PAGE_SIZE - 1) >> PAGE_SHIFT;
> + for (i = 0; i < npages; i++) {
> + if (rqstp->rq_pages[i] != NULL)
> + put_page(rqstp->rq_pages[i]);
> + BUG_ON(svsk->sk_pages[i] == NULL);
> + rqstp->rq_pages[i] = svsk->sk_pages[i];
> + svsk->sk_pages[i] = NULL;
> + }
> + rqstp->rq_arg.head[0].iov_base = page_address(rqstp->rq_pages[0]);
> + return len;
> +}
> +
> +static void svc_tcp_save_pages(struct svc_sock *svsk, struct svc_rqst *rqstp)
> +{
> + unsigned int i, len, npages;
> +
> + if (svsk->sk_tcplen <= sizeof(rpc_fraghdr))
> + return;
> + len = svsk->sk_tcplen - sizeof(rpc_fraghdr);
> + npages = (len + PAGE_SIZE - 1) >> PAGE_SHIFT;
> + for (i = 0; i < npages; i++) {
> + svsk->sk_pages[i] = rqstp->rq_pages[i];
> + rqstp->rq_pages[i] = NULL;
> + }
> +}
> +
> +static void svc_tcp_clear_pages(struct svc_sock *svsk)
> +{
> + unsigned int i, len, npages;
> +
> + if (svsk->sk_tcplen <= sizeof(rpc_fraghdr))
> + goto out;
> + len = svsk->sk_tcplen - sizeof(rpc_fraghdr);
> + npages = (len + PAGE_SIZE - 1) >> PAGE_SHIFT;
> + for (i = 0; i < npages; i++) {
> + BUG_ON(svsk->sk_pages[i] == NULL);
> + put_page(svsk->sk_pages[i]);
> + svsk->sk_pages[i] = NULL;
> + }
> +out:
> + svsk->sk_tcplen = 0;
> +}
> +
> /*
> * Receive data from a TCP socket.
> */
> @@ -800,7 +877,8 @@ static int svc_tcp_recvfrom(struct svc_rqst *rqstp)
> struct svc_serv *serv = svsk->sk_xprt.xpt_server;
> int len;
> struct kvec *vec;
> - int pnum, vlen;
> + unsigned int want, base, vlen;
> + int pnum;
>
> dprintk("svc: tcp_recv %p data %d conn %d close %d\n",
> svsk, test_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags),
> @@ -814,9 +892,9 @@ static int svc_tcp_recvfrom(struct svc_rqst *rqstp)
> * possible up to the complete record length.
> */
> if (svsk->sk_tcplen < sizeof(rpc_fraghdr)) {
> - int want = sizeof(rpc_fraghdr) - svsk->sk_tcplen;
> struct kvec iov;
>
> + want = sizeof(rpc_fraghdr) - svsk->sk_tcplen;
> iov.iov_base = ((char *) &svsk->sk_reclen) + svsk->sk_tcplen;
> iov.iov_len = want;
> if ((len = svc_recvfrom(rqstp, &iov, 1, want)) < 0)
> @@ -826,8 +904,7 @@ static int svc_tcp_recvfrom(struct svc_rqst *rqstp)
> if (len < want) {
> dprintk("svc: short recvfrom while reading record "
> "length (%d of %d)\n", len, want);
> - svc_xprt_received(&svsk->sk_xprt);
> - return -EAGAIN; /* record header not complete */
> + goto err_noclose;
> }
>
> svsk->sk_reclen = ntohl(svsk->sk_reclen);
> @@ -853,25 +930,14 @@ static int svc_tcp_recvfrom(struct svc_rqst *rqstp)
> }
> }
>
> - /* Check whether enough data is available */
> - len = svc_recv_available(svsk);
> - if (len < 0)
> - goto error;
> -
> - if (len < svsk->sk_reclen) {
> - dprintk("svc: incomplete TCP record (%d of %d)\n",
> - len, svsk->sk_reclen);
> - svc_xprt_received(&svsk->sk_xprt);
> - return -EAGAIN; /* record not complete */
> - }
> - len = svsk->sk_reclen;
> - set_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags);
> + base = svc_tcp_restore_pages(svsk, rqstp);
> + want = svsk->sk_reclen - base;
>
> vec = rqstp->rq_vec;
> vec[0] = rqstp->rq_arg.head[0];
> vlen = PAGE_SIZE;
> pnum = 1;
> - while (vlen < len) {
> + while (vlen < svsk->sk_reclen) {
> vec[pnum].iov_base = page_address(rqstp->rq_pages[pnum]);
> vec[pnum].iov_len = PAGE_SIZE;
> pnum++;
> @@ -880,19 +946,26 @@ static int svc_tcp_recvfrom(struct svc_rqst *rqstp)
> rqstp->rq_respages = &rqstp->rq_pages[pnum];
>
> /* Now receive data */
> - len = svc_recvfrom(rqstp, vec, pnum, len);
> - if (len < 0)
> - goto error;
> + clear_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags);
> + len = svc_partial_recvfrom(rqstp, vec, pnum, want, base);
> + if (len != want) {
> + if (len >= 0)
> + svsk->sk_tcplen += len;
> + else if (len != -EAGAIN)
> + goto err_other;
> + svc_tcp_save_pages(svsk, rqstp);
> + dprintk("svc: incomplete TCP record (%d of %d)\n",
> + svsk->sk_tcplen, svsk->sk_reclen);
> + goto err_noclose;
> + }
>
> - dprintk("svc: TCP complete record (%d bytes)\n", len);
> - rqstp->rq_arg.len = len;
> + rqstp->rq_arg.len = svsk->sk_reclen;
> rqstp->rq_arg.page_base = 0;
> - if (len <= rqstp->rq_arg.head[0].iov_len) {
> - rqstp->rq_arg.head[0].iov_len = len;
> + if (rqstp->rq_arg.len <= rqstp->rq_arg.head[0].iov_len) {
> + rqstp->rq_arg.head[0].iov_len = rqstp->rq_arg.len;
> rqstp->rq_arg.page_len = 0;
> - } else {
> - rqstp->rq_arg.page_len = len - rqstp->rq_arg.head[0].iov_len;
> - }
> + } else
> + rqstp->rq_arg.page_len = rqstp->rq_arg.len - rqstp->rq_arg.head[0].iov_len;
>
> rqstp->rq_xprt_ctxt = NULL;
> rqstp->rq_prot = IPPROTO_TCP;
> @@ -900,29 +973,32 @@ static int svc_tcp_recvfrom(struct svc_rqst *rqstp)
> /* Reset TCP read info */
> svsk->sk_reclen = 0;
> svsk->sk_tcplen = 0;
> + /* If we have more data, signal svc_xprt_enqueue() to try again */
> + if (svc_recv_available(svsk) > sizeof(rpc_fraghdr))
> + set_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags);
> +
>
> svc_xprt_copy_addrs(rqstp, &svsk->sk_xprt);
> svc_xprt_received(&svsk->sk_xprt);
> if (serv->sv_stats)
> serv->sv_stats->nettcpcnt++;
>
> - return len;
> -
> - err_delete:
> + dprintk("svc: TCP complete record (%d bytes)\n", rqstp->rq_arg.len);
> + return rqstp->rq_arg.len;
> +error:
> + if (len == -EAGAIN)
> + goto err_got_eagain;
> +err_other:
> + printk(KERN_NOTICE "%s: recvfrom returned errno %d\n",
> + svsk->sk_xprt.xpt_server->sv_name, -len);
> +err_delete:
> set_bit(XPT_CLOSE, &svsk->sk_xprt.xpt_flags);
> return -EAGAIN;
> -
> - error:
> - if (len == -EAGAIN) {
> - dprintk("RPC: TCP recvfrom got EAGAIN\n");
> - svc_xprt_received(&svsk->sk_xprt);
> - } else {
> - printk(KERN_NOTICE "%s: recvfrom returned errno %d\n",
> - svsk->sk_xprt.xpt_server->sv_name, -len);
> - goto err_delete;
> - }
> -
> - return len;
> +err_got_eagain:
> + dprintk("RPC: TCP recvfrom got EAGAIN\n");
> +err_noclose:
> + svc_xprt_received(&svsk->sk_xprt);
> + return -EAGAIN; /* record not complete */
> }
>
> /*
> @@ -1042,6 +1118,7 @@ static void svc_tcp_init(struct svc_sock *svsk, struct svc_serv *serv)
>
> svsk->sk_reclen = 0;
> svsk->sk_tcplen = 0;
> + memset(&svsk->sk_pages[0], 0, sizeof(svsk->sk_pages));
>
> tcp_sk(sk)->nonagle |= TCP_NAGLE_OFF;
>
> @@ -1290,8 +1367,10 @@ static void svc_tcp_sock_detach(struct svc_xprt *xprt)
>
> svc_sock_detach(xprt);
>
> - if (!test_bit(XPT_LISTENER, &xprt->xpt_flags))
> + if (!test_bit(XPT_LISTENER, &xprt->xpt_flags)) {
> + svc_tcp_clear_pages(svsk);
> kernel_sock_shutdown(svsk->sk_sock, SHUT_RDWR);
> + }
> }
>
> /*
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
_______________________________________________
NFSv4 mailing list
NFSv4@linux-nfs.org
http://linux-nfs.org/cgi-bin/mailman/listinfo/nfsv4
next prev parent reply other threads:[~2010-03-18 21:21 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2009-05-18 21:47 [PATCH 0/3] Fix the Linux rpc-over-tcp server performance Trond Myklebust
[not found] ` <20090518214756.786.28129.stgit-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>
2009-05-18 21:47 ` [PATCH 3/3] SUNRPC: Fix svc_tcp_recvfrom() Trond Myklebust
2010-03-18 21:21 ` J. Bruce Fields [this message]
2010-04-02 21:00 ` J. Bruce Fields
2009-05-18 21:47 ` [PATCH 1/3] SUNRPC: Fix the TCP server's send buffer accounting Trond Myklebust
[not found] ` <20090518214756.786.58191.stgit-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>
2009-06-19 3:06 ` J. Bruce Fields
2009-05-18 21:47 ` [PATCH 2/3] SUNRPC: Fix the TCP write space reservations for deferred requests Trond Myklebust
[not found] ` <20090518214756.786.33956.stgit-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>
2009-06-19 22:23 ` J. Bruce Fields
2009-06-20 21:44 ` Trond Myklebust
[not found] ` <1245534248.5182.45.camel-rJ7iovZKK19ZJLDQqaL3InhyD016LWXt@public.gmane.org>
2009-08-24 21:32 ` J. Bruce Fields
2009-05-19 15:14 ` [PATCH 0/3] Fix the Linux rpc-over-tcp server performance Jeff Moyer
2009-08-12 2:43 ` J. Bruce Fields
2009-08-12 13:22 ` Jeff Moyer
2009-08-12 14:20 ` J. Bruce Fields
2009-08-12 17:02 ` Jeff Moyer
2009-08-12 22:32 ` J. Bruce Fields
2009-08-12 22:40 ` Trond Myklebust
2009-08-13 13:05 ` Jeff Moyer
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20100318212153.GB23085@fieldses.org \
--to=bfields@fieldses.org \
--cc=Trond.Myklebust@netapp.com \
--cc=jmoyer@redhat.com \
--cc=linux-nfs@vger.kernel.org \
--cc=nfsv4@linux-nfs.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox