* [PATCH AUTOSEL 5.10 03/45] SUNRPC: Handle TCP socket sends with kernel_sendpage() again [not found] <20210120012602.769683-1-sashal@kernel.org> @ 2021-01-20 1:25 ` Sasha Levin 2021-02-08 19:34 ` Trond Myklebust 0 siblings, 1 reply; 5+ messages in thread From: Sasha Levin @ 2021-01-20 1:25 UTC (permalink / raw) To: linux-kernel, stable Cc: Chuck Lever, Daire Byrne, Sasha Levin, linux-nfs, netdev From: Chuck Lever <chuck.lever@oracle.com> [ Upstream commit 4a85a6a3320b4a622315d2e0ea91a1d2b013bce4 ] Daire Byrne reports a ~50% aggregrate throughput regression on his Linux NFS server after commit da1661b93bf4 ("SUNRPC: Teach server to use xprt_sock_sendmsg for socket sends"), which replaced kernel_send_page() calls in NFSD's socket send path with calls to sock_sendmsg() using iov_iter. Investigation showed that tcp_sendmsg() was not using zero-copy to send the xdr_buf's bvec pages, but instead was relying on memcpy. This means copying every byte of a large NFS READ payload. It looks like TLS sockets do indeed support a ->sendpage method, so it's really not necessary to use xprt_sock_sendmsg() to support TLS fully on the server. A mechanical reversion of da1661b93bf4 is not possible at this point, but we can re-implement the server's TCP socket sendmsg path using kernel_sendpage(). Reported-by: Daire Byrne <daire@dneg.com> BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=209439 Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org> --- net/sunrpc/svcsock.c | 86 +++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 85 insertions(+), 1 deletion(-) diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c index c2752e2b9ce34..4404c491eb388 100644 --- a/net/sunrpc/svcsock.c +++ b/net/sunrpc/svcsock.c @@ -1062,6 +1062,90 @@ static int svc_tcp_recvfrom(struct svc_rqst *rqstp) return 0; /* record not complete */ } +static int svc_tcp_send_kvec(struct socket *sock, const struct kvec *vec, + int flags) +{ + return kernel_sendpage(sock, virt_to_page(vec->iov_base), + offset_in_page(vec->iov_base), + vec->iov_len, flags); +} + +/* + * kernel_sendpage() is used exclusively to reduce the number of + * copy operations in this path. Therefore the caller must ensure + * that the pages backing @xdr are unchanging. + * + * In addition, the logic assumes that * .bv_len is never larger + * than PAGE_SIZE. + */ +static int svc_tcp_sendmsg(struct socket *sock, struct msghdr *msg, + struct xdr_buf *xdr, rpc_fraghdr marker, + unsigned int *sentp) +{ + const struct kvec *head = xdr->head; + const struct kvec *tail = xdr->tail; + struct kvec rm = { + .iov_base = &marker, + .iov_len = sizeof(marker), + }; + int flags, ret; + + *sentp = 0; + xdr_alloc_bvec(xdr, GFP_KERNEL); + + msg->msg_flags = MSG_MORE; + ret = kernel_sendmsg(sock, msg, &rm, 1, rm.iov_len); + if (ret < 0) + return ret; + *sentp += ret; + if (ret != rm.iov_len) + return -EAGAIN; + + flags = head->iov_len < xdr->len ? MSG_MORE | MSG_SENDPAGE_NOTLAST : 0; + ret = svc_tcp_send_kvec(sock, head, flags); + if (ret < 0) + return ret; + *sentp += ret; + if (ret != head->iov_len) + goto out; + + if (xdr->page_len) { + unsigned int offset, len, remaining; + struct bio_vec *bvec; + + bvec = xdr->bvec; + offset = xdr->page_base; + remaining = xdr->page_len; + flags = MSG_MORE | MSG_SENDPAGE_NOTLAST; + while (remaining > 0) { + if (remaining <= PAGE_SIZE && tail->iov_len == 0) + flags = 0; + len = min(remaining, bvec->bv_len); + ret = kernel_sendpage(sock, bvec->bv_page, + bvec->bv_offset + offset, + len, flags); + if (ret < 0) + return ret; + *sentp += ret; + if (ret != len) + goto out; + remaining -= len; + offset = 0; + bvec++; + } + } + + if (tail->iov_len) { + ret = svc_tcp_send_kvec(sock, tail, 0); + if (ret < 0) + return ret; + *sentp += ret; + } + +out: + return 0; +} + /** * svc_tcp_sendto - Send out a reply on a TCP socket * @rqstp: completed svc_rqst @@ -1089,7 +1173,7 @@ static int svc_tcp_sendto(struct svc_rqst *rqstp) mutex_lock(&xprt->xpt_mutex); if (svc_xprt_is_dead(xprt)) goto out_notconn; - err = xprt_sock_sendmsg(svsk->sk_sock, &msg, xdr, 0, marker, &sent); + err = svc_tcp_sendmsg(svsk->sk_sock, &msg, xdr, marker, &sent); xdr_free_bvec(xdr); trace_svcsock_tcp_send(xprt, err < 0 ? err : sent); if (err < 0 || sent != (xdr->len + sizeof(marker))) -- 2.27.0 ^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [PATCH AUTOSEL 5.10 03/45] SUNRPC: Handle TCP socket sends with kernel_sendpage() again 2021-01-20 1:25 ` [PATCH AUTOSEL 5.10 03/45] SUNRPC: Handle TCP socket sends with kernel_sendpage() again Sasha Levin @ 2021-02-08 19:34 ` Trond Myklebust 2021-02-08 19:48 ` Chuck Lever 0 siblings, 1 reply; 5+ messages in thread From: Trond Myklebust @ 2021-02-08 19:34 UTC (permalink / raw) To: sashal@kernel.org, stable@vger.kernel.org, linux-kernel@vger.kernel.org Cc: linux-nfs@vger.kernel.org, daire@dneg.com, netdev@vger.kernel.org, chuck.lever@oracle.com On Tue, 2021-01-19 at 20:25 -0500, Sasha Levin wrote: > From: Chuck Lever <chuck.lever@oracle.com> > > [ Upstream commit 4a85a6a3320b4a622315d2e0ea91a1d2b013bce4 ] > > Daire Byrne reports a ~50% aggregrate throughput regression on his > Linux NFS server after commit da1661b93bf4 ("SUNRPC: Teach server to > use xprt_sock_sendmsg for socket sends"), which replaced > kernel_send_page() calls in NFSD's socket send path with calls to > sock_sendmsg() using iov_iter. > > Investigation showed that tcp_sendmsg() was not using zero-copy to > send the xdr_buf's bvec pages, but instead was relying on memcpy. > This means copying every byte of a large NFS READ payload. > > It looks like TLS sockets do indeed support a ->sendpage method, > so it's really not necessary to use xprt_sock_sendmsg() to support > TLS fully on the server. A mechanical reversion of da1661b93bf4 is > not possible at this point, but we can re-implement the server's > TCP socket sendmsg path using kernel_sendpage(). > > Reported-by: Daire Byrne <daire@dneg.com> > BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=209439 > Signed-off-by: Chuck Lever <chuck.lever@oracle.com> > Signed-off-by: Sasha Levin <sashal@kernel.org> > --- > net/sunrpc/svcsock.c | 86 > +++++++++++++++++++++++++++++++++++++++++++- > 1 file changed, 85 insertions(+), 1 deletion(-) > > diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c > index c2752e2b9ce34..4404c491eb388 100644 > --- a/net/sunrpc/svcsock.c > +++ b/net/sunrpc/svcsock.c > @@ -1062,6 +1062,90 @@ static int svc_tcp_recvfrom(struct svc_rqst > *rqstp) > return 0; /* record not complete */ > } > > +static int svc_tcp_send_kvec(struct socket *sock, const struct kvec > *vec, > + int flags) > +{ > + return kernel_sendpage(sock, virt_to_page(vec->iov_base), > + offset_in_page(vec->iov_base), > + vec->iov_len, flags); I'm having trouble with this line. This looks like it is trying to push a slab page into kernel_sendpage(). What guarantees that the nfsd thread won't call kfree() before the socket layer is done transmitting the page? -- Trond Myklebust Linux NFS client maintainer, Hammerspace trond.myklebust@hammerspace.com ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH AUTOSEL 5.10 03/45] SUNRPC: Handle TCP socket sends with kernel_sendpage() again 2021-02-08 19:34 ` Trond Myklebust @ 2021-02-08 19:48 ` Chuck Lever 2021-02-08 20:12 ` Trond Myklebust 0 siblings, 1 reply; 5+ messages in thread From: Chuck Lever @ 2021-02-08 19:48 UTC (permalink / raw) To: Trond Myklebust Cc: sashal@kernel.org, stable@vger.kernel.org, linux-kernel@vger.kernel.org, Linux NFS Mailing List, daire@dneg.com, netdev@vger.kernel.org > On Feb 8, 2021, at 2:34 PM, Trond Myklebust <trondmy@hammerspace.com> wrote: > > On Tue, 2021-01-19 at 20:25 -0500, Sasha Levin wrote: >> From: Chuck Lever <chuck.lever@oracle.com> >> >> [ Upstream commit 4a85a6a3320b4a622315d2e0ea91a1d2b013bce4 ] >> >> Daire Byrne reports a ~50% aggregrate throughput regression on his >> Linux NFS server after commit da1661b93bf4 ("SUNRPC: Teach server to >> use xprt_sock_sendmsg for socket sends"), which replaced >> kernel_send_page() calls in NFSD's socket send path with calls to >> sock_sendmsg() using iov_iter. >> >> Investigation showed that tcp_sendmsg() was not using zero-copy to >> send the xdr_buf's bvec pages, but instead was relying on memcpy. >> This means copying every byte of a large NFS READ payload. >> >> It looks like TLS sockets do indeed support a ->sendpage method, >> so it's really not necessary to use xprt_sock_sendmsg() to support >> TLS fully on the server. A mechanical reversion of da1661b93bf4 is >> not possible at this point, but we can re-implement the server's >> TCP socket sendmsg path using kernel_sendpage(). >> >> Reported-by: Daire Byrne <daire@dneg.com> >> BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=209439 >> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> >> Signed-off-by: Sasha Levin <sashal@kernel.org> >> --- >> net/sunrpc/svcsock.c | 86 >> +++++++++++++++++++++++++++++++++++++++++++- >> 1 file changed, 85 insertions(+), 1 deletion(-) >> >> diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c >> index c2752e2b9ce34..4404c491eb388 100644 >> --- a/net/sunrpc/svcsock.c >> +++ b/net/sunrpc/svcsock.c >> @@ -1062,6 +1062,90 @@ static int svc_tcp_recvfrom(struct svc_rqst >> *rqstp) >> return 0; /* record not complete */ >> } >> >> +static int svc_tcp_send_kvec(struct socket *sock, const struct kvec >> *vec, >> + int flags) >> +{ >> + return kernel_sendpage(sock, virt_to_page(vec->iov_base), >> + offset_in_page(vec->iov_base), >> + vec->iov_len, flags); Thanks for your review! > I'm having trouble with this line. This looks like it is trying to push > a slab page into kernel_sendpage(). The head and tail kvec's in rq_res are not kmalloc'd, they are backed by pages in rqstp->rq_pages[]. > What guarantees that the nfsd > thread won't call kfree() before the socket layer is done transmitting > the page? If I understand correctly what Neil told us last week, the page reference count on those pages is set up so that one of svc_xprt_release() or the network layer does the final put_page(), in a safe fashion. Before da1661b93bf4 ("SUNRPC: Teach server to use xprt_sock_sendmsg for socket sends"), the original svc_send_common() code did this: - /* send head */ - if (slen == xdr->head[0].iov_len) - flags = 0; - len = kernel_sendpage(sock, headpage, headoffset, - xdr->head[0].iov_len, flags); - if (len != xdr->head[0].iov_len) - goto out; - slen -= xdr->head[0].iov_len; - if (slen == 0) - goto out; -- Chuck Lever ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH AUTOSEL 5.10 03/45] SUNRPC: Handle TCP socket sends with kernel_sendpage() again 2021-02-08 19:48 ` Chuck Lever @ 2021-02-08 20:12 ` Trond Myklebust 2021-02-08 20:17 ` Chuck Lever 0 siblings, 1 reply; 5+ messages in thread From: Trond Myklebust @ 2021-02-08 20:12 UTC (permalink / raw) To: chuck.lever@oracle.com Cc: sashal@kernel.org, stable@vger.kernel.org, linux-nfs@vger.kernel.org, linux-kernel@vger.kernel.org, daire@dneg.com, netdev@vger.kernel.org On Mon, 2021-02-08 at 19:48 +0000, Chuck Lever wrote: > > > > On Feb 8, 2021, at 2:34 PM, Trond Myklebust < > > trondmy@hammerspace.com> wrote: > > > > On Tue, 2021-01-19 at 20:25 -0500, Sasha Levin wrote: > > > From: Chuck Lever <chuck.lever@oracle.com> > > > > > > [ Upstream commit 4a85a6a3320b4a622315d2e0ea91a1d2b013bce4 ] > > > > > > Daire Byrne reports a ~50% aggregrate throughput regression on > > > his > > > Linux NFS server after commit da1661b93bf4 ("SUNRPC: Teach server > > > to > > > use xprt_sock_sendmsg for socket sends"), which replaced > > > kernel_send_page() calls in NFSD's socket send path with calls to > > > sock_sendmsg() using iov_iter. > > > > > > Investigation showed that tcp_sendmsg() was not using zero-copy > > > to > > > send the xdr_buf's bvec pages, but instead was relying on memcpy. > > > This means copying every byte of a large NFS READ payload. > > > > > > It looks like TLS sockets do indeed support a ->sendpage method, > > > so it's really not necessary to use xprt_sock_sendmsg() to > > > support > > > TLS fully on the server. A mechanical reversion of da1661b93bf4 > > > is > > > not possible at this point, but we can re-implement the server's > > > TCP socket sendmsg path using kernel_sendpage(). > > > > > > Reported-by: Daire Byrne <daire@dneg.com> > > > BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=209439 > > > Signed-off-by: Chuck Lever <chuck.lever@oracle.com> > > > Signed-off-by: Sasha Levin <sashal@kernel.org> > > > --- > > > net/sunrpc/svcsock.c | 86 > > > +++++++++++++++++++++++++++++++++++++++++++- > > > 1 file changed, 85 insertions(+), 1 deletion(-) > > > > > > diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c > > > index c2752e2b9ce34..4404c491eb388 100644 > > > --- a/net/sunrpc/svcsock.c > > > +++ b/net/sunrpc/svcsock.c > > > @@ -1062,6 +1062,90 @@ static int svc_tcp_recvfrom(struct > > > svc_rqst > > > *rqstp) > > > return 0; /* record not complete */ > > > } > > > > > > +static int svc_tcp_send_kvec(struct socket *sock, const struct > > > kvec > > > *vec, > > > + int flags) > > > +{ > > > + return kernel_sendpage(sock, virt_to_page(vec->iov_base), > > > + offset_in_page(vec->iov_base), > > > + vec->iov_len, flags); > > Thanks for your review! > > > I'm having trouble with this line. This looks like it is trying to > > push > > a slab page into kernel_sendpage(). > > The head and tail kvec's in rq_res are not kmalloc'd, they are > backed by pages in rqstp->rq_pages[]. > > > > What guarantees that the nfsd > > thread won't call kfree() before the socket layer is done > > transmitting > > the page? > > If I understand correctly what Neil told us last week, the page > reference count on those pages is set up so that one of > svc_xprt_release() or the network layer does the final put_page(), > in a safe fashion. > > Before da1661b93bf4 ("SUNRPC: Teach server to use xprt_sock_sendmsg > for socket sends"), the original svc_send_common() code did this: > > - /* send head */ > - if (slen == xdr->head[0].iov_len) > - flags = 0; > - len = kernel_sendpage(sock, headpage, headoffset, > - xdr->head[0].iov_len, flags); > - if (len != xdr->head[0].iov_len) > - goto out; > - slen -= xdr->head[0].iov_len; > - if (slen == 0) > - goto out; > > > OK, so then only the argument kvec can be allocated on the slab (thanks to svc_deferred_recv)? Is that correct? -- Trond Myklebust Linux NFS client maintainer, Hammerspace trond.myklebust@hammerspace.com ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH AUTOSEL 5.10 03/45] SUNRPC: Handle TCP socket sends with kernel_sendpage() again 2021-02-08 20:12 ` Trond Myklebust @ 2021-02-08 20:17 ` Chuck Lever 0 siblings, 0 replies; 5+ messages in thread From: Chuck Lever @ 2021-02-08 20:17 UTC (permalink / raw) To: Trond Myklebust Cc: sashal@kernel.org, stable@vger.kernel.org, Linux NFS Mailing List, linux-kernel@vger.kernel.org, daire@dneg.com, netdev@vger.kernel.org > On Feb 8, 2021, at 3:12 PM, Trond Myklebust <trondmy@hammerspace.com> wrote: > > On Mon, 2021-02-08 at 19:48 +0000, Chuck Lever wrote: >> >> >>> On Feb 8, 2021, at 2:34 PM, Trond Myklebust < >>> trondmy@hammerspace.com> wrote: >>> >>> On Tue, 2021-01-19 at 20:25 -0500, Sasha Levin wrote: >>>> From: Chuck Lever <chuck.lever@oracle.com> >>>> >>>> [ Upstream commit 4a85a6a3320b4a622315d2e0ea91a1d2b013bce4 ] >>>> >>>> Daire Byrne reports a ~50% aggregrate throughput regression on >>>> his >>>> Linux NFS server after commit da1661b93bf4 ("SUNRPC: Teach server >>>> to >>>> use xprt_sock_sendmsg for socket sends"), which replaced >>>> kernel_send_page() calls in NFSD's socket send path with calls to >>>> sock_sendmsg() using iov_iter. >>>> >>>> Investigation showed that tcp_sendmsg() was not using zero-copy >>>> to >>>> send the xdr_buf's bvec pages, but instead was relying on memcpy. >>>> This means copying every byte of a large NFS READ payload. >>>> >>>> It looks like TLS sockets do indeed support a ->sendpage method, >>>> so it's really not necessary to use xprt_sock_sendmsg() to >>>> support >>>> TLS fully on the server. A mechanical reversion of da1661b93bf4 >>>> is >>>> not possible at this point, but we can re-implement the server's >>>> TCP socket sendmsg path using kernel_sendpage(). >>>> >>>> Reported-by: Daire Byrne <daire@dneg.com> >>>> BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=209439 >>>> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> >>>> Signed-off-by: Sasha Levin <sashal@kernel.org> >>>> --- >>>> net/sunrpc/svcsock.c | 86 >>>> +++++++++++++++++++++++++++++++++++++++++++- >>>> 1 file changed, 85 insertions(+), 1 deletion(-) >>>> >>>> diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c >>>> index c2752e2b9ce34..4404c491eb388 100644 >>>> --- a/net/sunrpc/svcsock.c >>>> +++ b/net/sunrpc/svcsock.c >>>> @@ -1062,6 +1062,90 @@ static int svc_tcp_recvfrom(struct >>>> svc_rqst >>>> *rqstp) >>>> return 0; /* record not complete */ >>>> } >>>> >>>> +static int svc_tcp_send_kvec(struct socket *sock, const struct >>>> kvec >>>> *vec, >>>> + int flags) >>>> +{ >>>> + return kernel_sendpage(sock, virt_to_page(vec->iov_base), >>>> + offset_in_page(vec->iov_base), >>>> + vec->iov_len, flags); >> >> Thanks for your review! >> >>> I'm having trouble with this line. This looks like it is trying to >>> push >>> a slab page into kernel_sendpage(). >> >> The head and tail kvec's in rq_res are not kmalloc'd, they are >> backed by pages in rqstp->rq_pages[]. >> >> >>> What guarantees that the nfsd >>> thread won't call kfree() before the socket layer is done >>> transmitting >>> the page? >> >> If I understand correctly what Neil told us last week, the page >> reference count on those pages is set up so that one of >> svc_xprt_release() or the network layer does the final put_page(), >> in a safe fashion. >> >> Before da1661b93bf4 ("SUNRPC: Teach server to use xprt_sock_sendmsg >> for socket sends"), the original svc_send_common() code did this: >> >> - /* send head */ >> - if (slen == xdr->head[0].iov_len) >> - flags = 0; >> - len = kernel_sendpage(sock, headpage, headoffset, >> - xdr->head[0].iov_len, flags); >> - if (len != xdr->head[0].iov_len) >> - goto out; >> - slen -= xdr->head[0].iov_len; >> - if (slen == 0) >> - goto out; >> >> >> > > OK, so then only the argument kvec can be allocated on the slab (thanks > to svc_deferred_recv)? Is that correct? The RPC/RDMA Receive buffer is kmalloc'd, that would be used for rq_arg.head/tail. But for TCP, I believe the head kvec is always pulled out of rq_pages[]. svc_process() sets up rq_res.head this way: 1508 resv->iov_base = page_address(rqstp->rq_respages[0]); 1509 resv->iov_len = 0; I would need to audit the code to confirm that rq_res.tail is never kmalloc'd. -- Chuck Lever ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2021-02-08 20:19 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20210120012602.769683-1-sashal@kernel.org>
2021-01-20 1:25 ` [PATCH AUTOSEL 5.10 03/45] SUNRPC: Handle TCP socket sends with kernel_sendpage() again Sasha Levin
2021-02-08 19:34 ` Trond Myklebust
2021-02-08 19:48 ` Chuck Lever
2021-02-08 20:12 ` Trond Myklebust
2021-02-08 20:17 ` Chuck Lever
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox