* [PATCH v2 0/4] Send RPC-on-TCP with one sock_sendmsg() call
@ 2023-07-14 18:10 Chuck Lever
2023-07-14 18:10 ` [PATCH v2 1/4] SUNRPC: Convert svc_tcp_sendmsg to use bio_vecs directly Chuck Lever
` (3 more replies)
0 siblings, 4 replies; 8+ messages in thread
From: Chuck Lever @ 2023-07-14 18:10 UTC (permalink / raw)
To: linux-nfs, netdev; +Cc: Chuck Lever, David Howells, dhowells
After some discussion with David Howells at LSF/MM 2023, we arrived
at a plan to use a single sock_sendmsg() call for transmitting an
RPC message on socket-based transports. This is an initial part of
the transition to support handling folios with file content, but it
has scalability benefits as well.
This series passes the usual set of NFS-based tests. I've found no
behavior regressions so far. From the netdev folks, I'm interested
to know if this approach is sensible. The next step is performance
benchmarking.
Changes since RFC:
* Moved xdr_buf-to-bio_vec array helper to generic XDR code
* Added bio_vec array bounds-checking
* Re-ordered patches
---
Chuck Lever (4):
SUNRPC: Convert svc_tcp_sendmsg to use bio_vecs directly
SUNRPC: Send RPC message on TCP with a single sock_sendmsg() call
SUNRPC: Convert svc_udp_sendto() to use the per-socket bio_vec array
SUNRPC: Use a per-transport receive bio_vec array
include/linux/sunrpc/svc.h | 1 -
include/linux/sunrpc/svcsock.h | 7 +++
include/linux/sunrpc/xdr.h | 2 +
net/sunrpc/svcsock.c | 111 ++++++++++++++-------------------
net/sunrpc/xdr.c | 50 +++++++++++++++
5 files changed, 107 insertions(+), 64 deletions(-)
--
Chuck Lever
^ permalink raw reply [flat|nested] 8+ messages in thread
* [PATCH v2 1/4] SUNRPC: Convert svc_tcp_sendmsg to use bio_vecs directly
2023-07-14 18:10 [PATCH v2 0/4] Send RPC-on-TCP with one sock_sendmsg() call Chuck Lever
@ 2023-07-14 18:10 ` Chuck Lever
2023-08-12 12:04 ` Jeff Layton
2023-07-14 18:10 ` [PATCH v2 2/4] SUNRPC: Send RPC message on TCP with a single sock_sendmsg() call Chuck Lever
` (2 subsequent siblings)
3 siblings, 1 reply; 8+ messages in thread
From: Chuck Lever @ 2023-07-14 18:10 UTC (permalink / raw)
To: linux-nfs, netdev; +Cc: Chuck Lever, dhowells
From: Chuck Lever <chuck.lever@oracle.com>
Add a helper to convert a whole xdr_buf directly into an array of
bio_vecs, then send this array instead of iterating piecemeal over
the xdr_buf containing the outbound RPC message.
Note that the rules of the RPC protocol mean there can be only one
outstanding send at a time on a transport socket. The kernel's
SunRPC server enforces this via the transport's xpt_mutex. Thus we
can use a per-transport shared array for the xdr_buf conversion
rather than allocate one every time or use one that is part of
struct svc_rqst.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
include/linux/sunrpc/svcsock.h | 3 ++
include/linux/sunrpc/xdr.h | 2 +
net/sunrpc/svcsock.c | 59 ++++++++++++++--------------------------
net/sunrpc/xdr.c | 50 ++++++++++++++++++++++++++++++++++
4 files changed, 75 insertions(+), 39 deletions(-)
diff --git a/include/linux/sunrpc/svcsock.h b/include/linux/sunrpc/svcsock.h
index a7116048a4d4..a9bfeadf4cbe 100644
--- a/include/linux/sunrpc/svcsock.h
+++ b/include/linux/sunrpc/svcsock.h
@@ -40,6 +40,9 @@ struct svc_sock {
struct completion sk_handshake_done;
+ struct bio_vec sk_send_bvec[RPCSVC_MAXPAGES]
+ ____cacheline_aligned;
+
struct page * sk_pages[RPCSVC_MAXPAGES]; /* received data */
};
diff --git a/include/linux/sunrpc/xdr.h b/include/linux/sunrpc/xdr.h
index f89ec4b5ea16..42f9d7eb9a1a 100644
--- a/include/linux/sunrpc/xdr.h
+++ b/include/linux/sunrpc/xdr.h
@@ -139,6 +139,8 @@ void xdr_terminate_string(const struct xdr_buf *, const u32);
size_t xdr_buf_pagecount(const struct xdr_buf *buf);
int xdr_alloc_bvec(struct xdr_buf *buf, gfp_t gfp);
void xdr_free_bvec(struct xdr_buf *buf);
+unsigned int xdr_buf_to_bvec(struct bio_vec *bvec, unsigned int bvec_size,
+ const struct xdr_buf *xdr);
static inline __be32 *xdr_encode_array(__be32 *p, const void *s, unsigned int len)
{
diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index e43f26382411..e35e5afe4b81 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -36,6 +36,8 @@
#include <linux/skbuff.h>
#include <linux/file.h>
#include <linux/freezer.h>
+#include <linux/bvec.h>
+
#include <net/sock.h>
#include <net/checksum.h>
#include <net/ip.h>
@@ -1194,72 +1196,52 @@ static int svc_tcp_recvfrom(struct svc_rqst *rqstp)
return 0; /* record not complete */
}
-static int svc_tcp_send_kvec(struct socket *sock, const struct kvec *vec,
- int flags)
-{
- struct msghdr msg = { .msg_flags = MSG_SPLICE_PAGES | flags, };
-
- iov_iter_kvec(&msg.msg_iter, ITER_SOURCE, vec, 1, vec->iov_len);
- return sock_sendmsg(sock, &msg);
-}
-
/*
* MSG_SPLICE_PAGES is used exclusively to reduce the number of
* copy operations in this path. Therefore the caller must ensure
* that the pages backing @xdr are unchanging.
*
- * In addition, the logic assumes that * .bv_len is never larger
- * than PAGE_SIZE.
+ * Note that the send is non-blocking. The caller has incremented
+ * the reference count on each page backing the RPC message, and
+ * the network layer will "put" these pages when transmission is
+ * complete.
+ *
+ * This is safe for our RPC services because the memory backing
+ * the head and tail components is never kmalloc'd. These always
+ * come from pages in the svc_rqst::rq_pages array.
*/
-static int svc_tcp_sendmsg(struct socket *sock, struct xdr_buf *xdr,
+static int svc_tcp_sendmsg(struct svc_sock *svsk, struct xdr_buf *xdr,
rpc_fraghdr marker, unsigned int *sentp)
{
- const struct kvec *head = xdr->head;
- const struct kvec *tail = xdr->tail;
struct kvec rm = {
.iov_base = &marker,
.iov_len = sizeof(marker),
};
struct msghdr msg = {
- .msg_flags = 0,
+ .msg_flags = MSG_MORE,
};
+ unsigned int count;
int ret;
*sentp = 0;
- ret = xdr_alloc_bvec(xdr, GFP_KERNEL);
- if (ret < 0)
- return ret;
- ret = kernel_sendmsg(sock, &msg, &rm, 1, rm.iov_len);
+ ret = kernel_sendmsg(svsk->sk_sock, &msg, &rm, 1, rm.iov_len);
if (ret < 0)
return ret;
*sentp += ret;
if (ret != rm.iov_len)
return -EAGAIN;
- ret = svc_tcp_send_kvec(sock, head, 0);
- if (ret < 0)
- return ret;
- *sentp += ret;
- if (ret != head->iov_len)
- goto out;
+ count = xdr_buf_to_bvec(svsk->sk_send_bvec,
+ ARRAY_SIZE(svsk->sk_send_bvec), xdr);
msg.msg_flags = MSG_SPLICE_PAGES;
- iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, xdr->bvec,
- xdr_buf_pagecount(xdr), xdr->page_len);
- ret = sock_sendmsg(sock, &msg);
+ iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, svsk->sk_send_bvec,
+ count, xdr->len);
+ ret = sock_sendmsg(svsk->sk_sock, &msg);
if (ret < 0)
return ret;
*sentp += ret;
-
- if (tail->iov_len) {
- ret = svc_tcp_send_kvec(sock, tail, 0);
- if (ret < 0)
- return ret;
- *sentp += ret;
- }
-
-out:
return 0;
}
@@ -1290,8 +1272,7 @@ static int svc_tcp_sendto(struct svc_rqst *rqstp)
if (svc_xprt_is_dead(xprt))
goto out_notconn;
tcp_sock_set_cork(svsk->sk_sk, true);
- err = svc_tcp_sendmsg(svsk->sk_sock, xdr, marker, &sent);
- xdr_free_bvec(xdr);
+ err = svc_tcp_sendmsg(svsk, xdr, marker, &sent);
trace_svcsock_tcp_send(xprt, err < 0 ? (long)err : sent);
if (err < 0 || sent != (xdr->len + sizeof(marker)))
goto out_close;
diff --git a/net/sunrpc/xdr.c b/net/sunrpc/xdr.c
index 2a22e78af116..358e6de91775 100644
--- a/net/sunrpc/xdr.c
+++ b/net/sunrpc/xdr.c
@@ -164,6 +164,56 @@ xdr_free_bvec(struct xdr_buf *buf)
buf->bvec = NULL;
}
+/**
+ * xdr_buf_to_bvec - Copy components of an xdr_buf into a bio_vec array
+ * @bvec: bio_vec array to populate
+ * @bvec_size: element count of @bio_vec
+ * @xdr: xdr_buf to be copied
+ *
+ * Returns the number of entries consumed in @bvec.
+ */
+unsigned int xdr_buf_to_bvec(struct bio_vec *bvec, unsigned int bvec_size,
+ const struct xdr_buf *xdr)
+{
+ const struct kvec *head = xdr->head;
+ const struct kvec *tail = xdr->tail;
+ unsigned int count = 0;
+
+ if (head->iov_len) {
+ bvec_set_virt(bvec++, head->iov_base, head->iov_len);
+ ++count;
+ }
+
+ if (xdr->page_len) {
+ unsigned int offset, len, remaining;
+ struct page **pages = xdr->pages;
+
+ offset = offset_in_page(xdr->page_base);
+ remaining = xdr->page_len;
+ while (remaining > 0) {
+ len = min_t(unsigned int, remaining,
+ PAGE_SIZE - offset);
+ bvec_set_page(bvec++, *pages++, len, offset);
+ remaining -= len;
+ offset = 0;
+ if (unlikely(++count > bvec_size))
+ goto bvec_overflow;
+ }
+ }
+
+ if (tail->iov_len) {
+ bvec_set_virt(bvec, tail->iov_base, tail->iov_len);
+ if (unlikely(++count > bvec_size))
+ goto bvec_overflow;
+ }
+
+ return count;
+
+bvec_overflow:
+ pr_warn_once("%s: bio_vec array overflow\n", __func__);
+ return count - 1;
+}
+
/**
* xdr_inline_pages - Prepare receive buffer for a large reply
* @xdr: xdr_buf into which reply will be placed
^ permalink raw reply related [flat|nested] 8+ messages in thread
* [PATCH v2 2/4] SUNRPC: Send RPC message on TCP with a single sock_sendmsg() call
2023-07-14 18:10 [PATCH v2 0/4] Send RPC-on-TCP with one sock_sendmsg() call Chuck Lever
2023-07-14 18:10 ` [PATCH v2 1/4] SUNRPC: Convert svc_tcp_sendmsg to use bio_vecs directly Chuck Lever
@ 2023-07-14 18:10 ` Chuck Lever
2023-07-14 18:10 ` [PATCH v2 3/4] SUNRPC: Convert svc_udp_sendto() to use the per-socket bio_vec array Chuck Lever
2023-07-14 18:10 ` [PATCH v2 4/4] SUNRPC: Use a per-transport receive " Chuck Lever
3 siblings, 0 replies; 8+ messages in thread
From: Chuck Lever @ 2023-07-14 18:10 UTC (permalink / raw)
To: linux-nfs, netdev; +Cc: David Howells, Chuck Lever, dhowells
From: Chuck Lever <chuck.lever@oracle.com>
There is now enough infrastructure in place to combine the stream
record marker into the biovec array used to send each outgoing RPC
message. The whole message can be more efficiently sent with a
single call to sock_sendmsg() using a bio_vec iterator.
Note that this also helps with RPC-with-TLS: the TLS implementation
can now clearly see where the upper layer message boundaries are.
Before, it would send each component of the xdr_buf in a separate
TLS record.
Suggested-by: David Howells <dhowells@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
include/linux/sunrpc/svcsock.h | 2 ++
net/sunrpc/svcsock.c | 33 ++++++++++++++++++---------------
2 files changed, 20 insertions(+), 15 deletions(-)
diff --git a/include/linux/sunrpc/svcsock.h b/include/linux/sunrpc/svcsock.h
index a9bfeadf4cbe..baea012e3b04 100644
--- a/include/linux/sunrpc/svcsock.h
+++ b/include/linux/sunrpc/svcsock.h
@@ -38,6 +38,8 @@ struct svc_sock {
/* Number of queued send requests */
atomic_t sk_sendqlen;
+ struct page_frag_cache sk_frag_cache;
+
struct completion sk_handshake_done;
struct bio_vec sk_send_bvec[RPCSVC_MAXPAGES]
diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index e35e5afe4b81..bb185c0bb57c 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -1213,31 +1213,30 @@ static int svc_tcp_recvfrom(struct svc_rqst *rqstp)
static int svc_tcp_sendmsg(struct svc_sock *svsk, struct xdr_buf *xdr,
rpc_fraghdr marker, unsigned int *sentp)
{
- struct kvec rm = {
- .iov_base = &marker,
- .iov_len = sizeof(marker),
- };
struct msghdr msg = {
- .msg_flags = MSG_MORE,
+ .msg_flags = MSG_SPLICE_PAGES,
};
unsigned int count;
+ void *tmp;
int ret;
*sentp = 0;
- ret = kernel_sendmsg(svsk->sk_sock, &msg, &rm, 1, rm.iov_len);
- if (ret < 0)
- return ret;
- *sentp += ret;
- if (ret != rm.iov_len)
- return -EAGAIN;
+ /* The stream record marker is copied into a temporary page
+ * buffer so that it can be included in sk_send_bvec.
+ */
+ tmp = page_frag_alloc(&svsk->sk_frag_cache, sizeof(marker),
+ GFP_KERNEL);
+ if (!tmp)
+ return -ENOMEM;
+ memcpy(tmp, &marker, sizeof(marker));
+ bvec_set_virt(svsk->sk_send_bvec, tmp, sizeof(marker));
- count = xdr_buf_to_bvec(svsk->sk_send_bvec,
- ARRAY_SIZE(svsk->sk_send_bvec), xdr);
+ count = xdr_buf_to_bvec(svsk->sk_send_bvec + 1,
+ ARRAY_SIZE(svsk->sk_send_bvec) - 1, xdr);
- msg.msg_flags = MSG_SPLICE_PAGES;
iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, svsk->sk_send_bvec,
- count, xdr->len);
+ 1 + count, sizeof(marker) + xdr->len);
ret = sock_sendmsg(svsk->sk_sock, &msg);
if (ret < 0)
return ret;
@@ -1616,6 +1615,7 @@ static void svc_tcp_sock_detach(struct svc_xprt *xprt)
static void svc_sock_free(struct svc_xprt *xprt)
{
struct svc_sock *svsk = container_of(xprt, struct svc_sock, sk_xprt);
+ struct page_frag_cache *pfc = &svsk->sk_frag_cache;
struct socket *sock = svsk->sk_sock;
trace_svcsock_free(svsk, sock);
@@ -1625,5 +1625,8 @@ static void svc_sock_free(struct svc_xprt *xprt)
sockfd_put(sock);
else
sock_release(sock);
+ if (pfc->va)
+ __page_frag_cache_drain(virt_to_head_page(pfc->va),
+ pfc->pagecnt_bias);
kfree(svsk);
}
^ permalink raw reply related [flat|nested] 8+ messages in thread
* [PATCH v2 3/4] SUNRPC: Convert svc_udp_sendto() to use the per-socket bio_vec array
2023-07-14 18:10 [PATCH v2 0/4] Send RPC-on-TCP with one sock_sendmsg() call Chuck Lever
2023-07-14 18:10 ` [PATCH v2 1/4] SUNRPC: Convert svc_tcp_sendmsg to use bio_vecs directly Chuck Lever
2023-07-14 18:10 ` [PATCH v2 2/4] SUNRPC: Send RPC message on TCP with a single sock_sendmsg() call Chuck Lever
@ 2023-07-14 18:10 ` Chuck Lever
2023-07-14 18:10 ` [PATCH v2 4/4] SUNRPC: Use a per-transport receive " Chuck Lever
3 siblings, 0 replies; 8+ messages in thread
From: Chuck Lever @ 2023-07-14 18:10 UTC (permalink / raw)
To: linux-nfs, netdev; +Cc: Chuck Lever, dhowells
From: Chuck Lever <chuck.lever@oracle.com>
Commit da1661b93bf4 ("SUNRPC: Teach server to use xprt_sock_sendmsg
for socket sends") modified svc_udp_sendto() to use xprt_sock_sendmsg()
because we originally believed xprt_sock_sendmsg() would be needed
for TLS support. That does not actually appear to be the case.
In addition, the linkage between the client and server send code has
been a bit of a maintenance headache because of the distinct ways
that the client and server handle memory allocation.
Going forward, eventually the XDR layer will deal with its buffers
in the form of bio_vec arrays, so convert this function accordingly.
Once the use of bio_vecs is ubiquitous, the xdr_buf-to-bio_vec array
code can be hoisted into a path that is common for all transports.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
net/sunrpc/svcsock.c | 23 ++++++++++++-----------
1 file changed, 12 insertions(+), 11 deletions(-)
diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index bb185c0bb57c..e164ea4d0e0a 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -695,7 +695,7 @@ static int svc_udp_sendto(struct svc_rqst *rqstp)
.msg_control = cmh,
.msg_controllen = sizeof(buffer),
};
- unsigned int sent;
+ unsigned int count;
int err;
svc_udp_release_ctxt(xprt, rqstp->rq_xprt_ctxt);
@@ -708,22 +708,23 @@ static int svc_udp_sendto(struct svc_rqst *rqstp)
if (svc_xprt_is_dead(xprt))
goto out_notconn;
- err = xdr_alloc_bvec(xdr, GFP_KERNEL);
- if (err < 0)
- goto out_unlock;
+ count = xdr_buf_to_bvec(svsk->sk_send_bvec,
+ ARRAY_SIZE(svsk->sk_send_bvec), xdr);
- err = xprt_sock_sendmsg(svsk->sk_sock, &msg, xdr, 0, 0, &sent);
+ iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, svsk->sk_send_bvec,
+ count, 0);
+ err = sock_sendmsg(svsk->sk_sock, &msg);
if (err == -ECONNREFUSED) {
/* ICMP error on earlier request. */
- err = xprt_sock_sendmsg(svsk->sk_sock, &msg, xdr, 0, 0, &sent);
+ iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, svsk->sk_send_bvec,
+ count, 0);
+ err = sock_sendmsg(svsk->sk_sock, &msg);
}
- xdr_free_bvec(xdr);
+
trace_svcsock_udp_send(xprt, err);
-out_unlock:
+
mutex_unlock(&xprt->xpt_mutex);
- if (err < 0)
- return err;
- return sent;
+ return err;
out_notconn:
mutex_unlock(&xprt->xpt_mutex);
^ permalink raw reply related [flat|nested] 8+ messages in thread
* [PATCH v2 4/4] SUNRPC: Use a per-transport receive bio_vec array
2023-07-14 18:10 [PATCH v2 0/4] Send RPC-on-TCP with one sock_sendmsg() call Chuck Lever
` (2 preceding siblings ...)
2023-07-14 18:10 ` [PATCH v2 3/4] SUNRPC: Convert svc_udp_sendto() to use the per-socket bio_vec array Chuck Lever
@ 2023-07-14 18:10 ` Chuck Lever
3 siblings, 0 replies; 8+ messages in thread
From: Chuck Lever @ 2023-07-14 18:10 UTC (permalink / raw)
To: linux-nfs, netdev; +Cc: Chuck Lever, dhowells
From: Chuck Lever <chuck.lever@oracle.com>
TCP receives are serialized, so we need only one bio_vec array per
socket. This shrinks the size of struct svc_rqst by 4144 bytes on
x86_64.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
include/linux/sunrpc/svc.h | 1 -
include/linux/sunrpc/svcsock.h | 2 ++
net/sunrpc/svcsock.c | 2 +-
3 files changed, 3 insertions(+), 2 deletions(-)
diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
index f8751118c122..36052188222d 100644
--- a/include/linux/sunrpc/svc.h
+++ b/include/linux/sunrpc/svc.h
@@ -224,7 +224,6 @@ struct svc_rqst {
struct folio_batch rq_fbatch;
struct kvec rq_vec[RPCSVC_MAXPAGES]; /* generally useful.. */
- struct bio_vec rq_bvec[RPCSVC_MAXPAGES];
__be32 rq_xid; /* transmission id */
u32 rq_prog; /* program number */
diff --git a/include/linux/sunrpc/svcsock.h b/include/linux/sunrpc/svcsock.h
index baea012e3b04..55446136499f 100644
--- a/include/linux/sunrpc/svcsock.h
+++ b/include/linux/sunrpc/svcsock.h
@@ -42,6 +42,8 @@ struct svc_sock {
struct completion sk_handshake_done;
+ struct bio_vec sk_recv_bvec[RPCSVC_MAXPAGES]
+ ____cacheline_aligned;
struct bio_vec sk_send_bvec[RPCSVC_MAXPAGES]
____cacheline_aligned;
diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index e164ea4d0e0a..5cbc35e23e4f 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -299,7 +299,7 @@ static ssize_t svc_tcp_read_msg(struct svc_rqst *rqstp, size_t buflen,
{
struct svc_sock *svsk =
container_of(rqstp->rq_xprt, struct svc_sock, sk_xprt);
- struct bio_vec *bvec = rqstp->rq_bvec;
+ struct bio_vec *bvec = svsk->sk_recv_bvec;
struct msghdr msg = { NULL };
unsigned int i;
ssize_t len;
^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: [PATCH v2 1/4] SUNRPC: Convert svc_tcp_sendmsg to use bio_vecs directly
2023-07-14 18:10 ` [PATCH v2 1/4] SUNRPC: Convert svc_tcp_sendmsg to use bio_vecs directly Chuck Lever
@ 2023-08-12 12:04 ` Jeff Layton
2023-08-13 16:04 ` Chuck Lever
2023-08-14 12:56 ` Mkrtchyan, Tigran
0 siblings, 2 replies; 8+ messages in thread
From: Jeff Layton @ 2023-08-12 12:04 UTC (permalink / raw)
To: Chuck Lever, linux-nfs, netdev; +Cc: Chuck Lever, dhowells
On Fri, 2023-07-14 at 14:10 -0400, Chuck Lever wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
>
> Add a helper to convert a whole xdr_buf directly into an array of
> bio_vecs, then send this array instead of iterating piecemeal over
> the xdr_buf containing the outbound RPC message.
>
> Note that the rules of the RPC protocol mean there can be only one
> outstanding send at a time on a transport socket. The kernel's
> SunRPC server enforces this via the transport's xpt_mutex. Thus we
> can use a per-transport shared array for the xdr_buf conversion
> rather than allocate one every time or use one that is part of
> struct svc_rqst.
>
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
> include/linux/sunrpc/svcsock.h | 3 ++
> include/linux/sunrpc/xdr.h | 2 +
> net/sunrpc/svcsock.c | 59 ++++++++++++++--------------------------
> net/sunrpc/xdr.c | 50 ++++++++++++++++++++++++++++++++++
> 4 files changed, 75 insertions(+), 39 deletions(-)
>
I've seen some pynfs test regressions in mainline (v6.5-rc5-ish)
kernels. Here's one failing test:
_text = b'write data' # len=10
[...]
def testSimpleWrite2(t, env):
"""WRITE with stateid=zeros changing size
FLAGS: write all
DEPEND: MKFILE
CODE: WRT1b
"""
c = env.c1
c.init_connection()
attrs = {FATTR4_SIZE: 32, FATTR4_MODE: 0o644}
fh, stateid = c.create_confirm(t.word(), attrs=attrs,
deny=OPEN4_SHARE_DENY_NONE)
res = c.write_file(fh, _text, 30)
check(res, msg="WRITE with stateid=zeros changing size")
res = c.read_file(fh, 25, 20)
_compare(t, res, b'\0'*5 + _text, True)
This test writes 10 bytes of data (to a file at offset 30, and then does
a 20 byte read starting at offset 25. The READ reply has NULs where the
written data should be
The patch that broke things is this one:
5df5dd03a8f7 sunrpc: Use sendmsg(MSG_SPLICE_PAGES) rather then sendpage
This patch fixes the problem and gets the test run "green" again. I
think we will probably want to send this patch to mainline for v6.5, but
it'd be good to understand what's broken and how this fixes it.
Do you (or David) have any insight?
It'd also be good to understand whether we also need to fix UDP. pynfs
is tcp-only, so I can't run the same test there as easily.
> diff --git a/include/linux/sunrpc/svcsock.h b/include/linux/sunrpc/svcsock.h
> index a7116048a4d4..a9bfeadf4cbe 100644
> --- a/include/linux/sunrpc/svcsock.h
> +++ b/include/linux/sunrpc/svcsock.h
> @@ -40,6 +40,9 @@ struct svc_sock {
>
> struct completion sk_handshake_done;
>
> + struct bio_vec sk_send_bvec[RPCSVC_MAXPAGES]
> + ____cacheline_aligned;
> +
> struct page * sk_pages[RPCSVC_MAXPAGES]; /* received data */
> };
>
> diff --git a/include/linux/sunrpc/xdr.h b/include/linux/sunrpc/xdr.h
> index f89ec4b5ea16..42f9d7eb9a1a 100644
> --- a/include/linux/sunrpc/xdr.h
> +++ b/include/linux/sunrpc/xdr.h
> @@ -139,6 +139,8 @@ void xdr_terminate_string(const struct xdr_buf *, const u32);
> size_t xdr_buf_pagecount(const struct xdr_buf *buf);
> int xdr_alloc_bvec(struct xdr_buf *buf, gfp_t gfp);
> void xdr_free_bvec(struct xdr_buf *buf);
> +unsigned int xdr_buf_to_bvec(struct bio_vec *bvec, unsigned int bvec_size,
> + const struct xdr_buf *xdr);
>
> static inline __be32 *xdr_encode_array(__be32 *p, const void *s, unsigned int len)
> {
> diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
> index e43f26382411..e35e5afe4b81 100644
> --- a/net/sunrpc/svcsock.c
> +++ b/net/sunrpc/svcsock.c
> @@ -36,6 +36,8 @@
> #include <linux/skbuff.h>
> #include <linux/file.h>
> #include <linux/freezer.h>
> +#include <linux/bvec.h>
> +
> #include <net/sock.h>
> #include <net/checksum.h>
> #include <net/ip.h>
> @@ -1194,72 +1196,52 @@ static int svc_tcp_recvfrom(struct svc_rqst *rqstp)
> return 0; /* record not complete */
> }
>
> -static int svc_tcp_send_kvec(struct socket *sock, const struct kvec *vec,
> - int flags)
> -{
> - struct msghdr msg = { .msg_flags = MSG_SPLICE_PAGES | flags, };
> -
> - iov_iter_kvec(&msg.msg_iter, ITER_SOURCE, vec, 1, vec->iov_len);
> - return sock_sendmsg(sock, &msg);
> -}
> -
> /*
> * MSG_SPLICE_PAGES is used exclusively to reduce the number of
> * copy operations in this path. Therefore the caller must ensure
> * that the pages backing @xdr are unchanging.
> *
> - * In addition, the logic assumes that * .bv_len is never larger
> - * than PAGE_SIZE.
> + * Note that the send is non-blocking. The caller has incremented
> + * the reference count on each page backing the RPC message, and
> + * the network layer will "put" these pages when transmission is
> + * complete.
> + *
> + * This is safe for our RPC services because the memory backing
> + * the head and tail components is never kmalloc'd. These always
> + * come from pages in the svc_rqst::rq_pages array.
> */
> -static int svc_tcp_sendmsg(struct socket *sock, struct xdr_buf *xdr,
> +static int svc_tcp_sendmsg(struct svc_sock *svsk, struct xdr_buf *xdr,
> rpc_fraghdr marker, unsigned int *sentp)
> {
> - const struct kvec *head = xdr->head;
> - const struct kvec *tail = xdr->tail;
> struct kvec rm = {
> .iov_base = &marker,
> .iov_len = sizeof(marker),
> };
> struct msghdr msg = {
> - .msg_flags = 0,
> + .msg_flags = MSG_MORE,
> };
> + unsigned int count;
> int ret;
>
> *sentp = 0;
> - ret = xdr_alloc_bvec(xdr, GFP_KERNEL);
> - if (ret < 0)
> - return ret;
>
> - ret = kernel_sendmsg(sock, &msg, &rm, 1, rm.iov_len);
> + ret = kernel_sendmsg(svsk->sk_sock, &msg, &rm, 1, rm.iov_len);
> if (ret < 0)
> return ret;
> *sentp += ret;
> if (ret != rm.iov_len)
> return -EAGAIN;
>
> - ret = svc_tcp_send_kvec(sock, head, 0);
> - if (ret < 0)
> - return ret;
> - *sentp += ret;
> - if (ret != head->iov_len)
> - goto out;
> + count = xdr_buf_to_bvec(svsk->sk_send_bvec,
> + ARRAY_SIZE(svsk->sk_send_bvec), xdr);
>
> msg.msg_flags = MSG_SPLICE_PAGES;
> - iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, xdr->bvec,
> - xdr_buf_pagecount(xdr), xdr->page_len);
> - ret = sock_sendmsg(sock, &msg);
> + iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, svsk->sk_send_bvec,
> + count, xdr->len);
> + ret = sock_sendmsg(svsk->sk_sock, &msg);
> if (ret < 0)
> return ret;
> *sentp += ret;
> -
> - if (tail->iov_len) {
> - ret = svc_tcp_send_kvec(sock, tail, 0);
> - if (ret < 0)
> - return ret;
> - *sentp += ret;
> - }
> -
> -out:
> return 0;
> }
>
> @@ -1290,8 +1272,7 @@ static int svc_tcp_sendto(struct svc_rqst *rqstp)
> if (svc_xprt_is_dead(xprt))
> goto out_notconn;
> tcp_sock_set_cork(svsk->sk_sk, true);
> - err = svc_tcp_sendmsg(svsk->sk_sock, xdr, marker, &sent);
> - xdr_free_bvec(xdr);
> + err = svc_tcp_sendmsg(svsk, xdr, marker, &sent);
> trace_svcsock_tcp_send(xprt, err < 0 ? (long)err : sent);
> if (err < 0 || sent != (xdr->len + sizeof(marker)))
> goto out_close;
> diff --git a/net/sunrpc/xdr.c b/net/sunrpc/xdr.c
> index 2a22e78af116..358e6de91775 100644
> --- a/net/sunrpc/xdr.c
> +++ b/net/sunrpc/xdr.c
> @@ -164,6 +164,56 @@ xdr_free_bvec(struct xdr_buf *buf)
> buf->bvec = NULL;
> }
>
> +/**
> + * xdr_buf_to_bvec - Copy components of an xdr_buf into a bio_vec array
> + * @bvec: bio_vec array to populate
> + * @bvec_size: element count of @bio_vec
> + * @xdr: xdr_buf to be copied
> + *
> + * Returns the number of entries consumed in @bvec.
> + */
> +unsigned int xdr_buf_to_bvec(struct bio_vec *bvec, unsigned int bvec_size,
> + const struct xdr_buf *xdr)
> +{
> + const struct kvec *head = xdr->head;
> + const struct kvec *tail = xdr->tail;
> + unsigned int count = 0;
> +
> + if (head->iov_len) {
> + bvec_set_virt(bvec++, head->iov_base, head->iov_len);
> + ++count;
> + }
> +
> + if (xdr->page_len) {
> + unsigned int offset, len, remaining;
> + struct page **pages = xdr->pages;
> +
> + offset = offset_in_page(xdr->page_base);
> + remaining = xdr->page_len;
> + while (remaining > 0) {
> + len = min_t(unsigned int, remaining,
> + PAGE_SIZE - offset);
> + bvec_set_page(bvec++, *pages++, len, offset);
> + remaining -= len;
> + offset = 0;
> + if (unlikely(++count > bvec_size))
> + goto bvec_overflow;
> + }
> + }
> +
> + if (tail->iov_len) {
> + bvec_set_virt(bvec, tail->iov_base, tail->iov_len);
> + if (unlikely(++count > bvec_size))
> + goto bvec_overflow;
> + }
> +
> + return count;
> +
> +bvec_overflow:
> + pr_warn_once("%s: bio_vec array overflow\n", __func__);
> + return count - 1;
> +}
> +
> /**
> * xdr_inline_pages - Prepare receive buffer for a large reply
> * @xdr: xdr_buf into which reply will be placed
>
>
--
Jeff Layton <jlayton@kernel.org>
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH v2 1/4] SUNRPC: Convert svc_tcp_sendmsg to use bio_vecs directly
2023-08-12 12:04 ` Jeff Layton
@ 2023-08-13 16:04 ` Chuck Lever
2023-08-14 12:56 ` Mkrtchyan, Tigran
1 sibling, 0 replies; 8+ messages in thread
From: Chuck Lever @ 2023-08-13 16:04 UTC (permalink / raw)
To: Jeff Layton; +Cc: Chuck Lever, linux-nfs, netdev, dhowells
On Sat, Aug 12, 2023 at 08:04:57AM -0400, Jeff Layton wrote:
> On Fri, 2023-07-14 at 14:10 -0400, Chuck Lever wrote:
> > From: Chuck Lever <chuck.lever@oracle.com>
> >
> > Add a helper to convert a whole xdr_buf directly into an array of
> > bio_vecs, then send this array instead of iterating piecemeal over
> > the xdr_buf containing the outbound RPC message.
> >
> > Note that the rules of the RPC protocol mean there can be only one
> > outstanding send at a time on a transport socket. The kernel's
> > SunRPC server enforces this via the transport's xpt_mutex. Thus we
> > can use a per-transport shared array for the xdr_buf conversion
> > rather than allocate one every time or use one that is part of
> > struct svc_rqst.
> >
> > Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> > ---
> > include/linux/sunrpc/svcsock.h | 3 ++
> > include/linux/sunrpc/xdr.h | 2 +
> > net/sunrpc/svcsock.c | 59 ++++++++++++++--------------------------
> > net/sunrpc/xdr.c | 50 ++++++++++++++++++++++++++++++++++
> > 4 files changed, 75 insertions(+), 39 deletions(-)
> >
>
> I've seen some pynfs test regressions in mainline (v6.5-rc5-ish)
> kernels. Here's one failing test:
>
> _text = b'write data' # len=10
>
> [...]
>
> def testSimpleWrite2(t, env):
> """WRITE with stateid=zeros changing size
>
> FLAGS: write all
> DEPEND: MKFILE
> CODE: WRT1b
> """
> c = env.c1
> c.init_connection()
> attrs = {FATTR4_SIZE: 32, FATTR4_MODE: 0o644}
> fh, stateid = c.create_confirm(t.word(), attrs=attrs,
> deny=OPEN4_SHARE_DENY_NONE)
> res = c.write_file(fh, _text, 30)
> check(res, msg="WRITE with stateid=zeros changing size")
> res = c.read_file(fh, 25, 20)
> _compare(t, res, b'\0'*5 + _text, True)
>
> This test writes 10 bytes of data (to a file at offset 30, and then does
> a 20 byte read starting at offset 25. The READ reply has NULs where the
> written data should be
Nice catch. I hope this is something you found with your nascent
kdevops rig...? :^)
> The patch that broke things is this one:
>
> 5df5dd03a8f7 sunrpc: Use sendmsg(MSG_SPLICE_PAGES) rather then sendpage
>
> This patch fixes the problem and gets the test run "green" again.
Note that the version of the patch in this reply is not the version
that is applied to nfsd-next. That might not make a difference,
though.
> I think we will probably want to send this patch to mainline for v6.5, but
> it'd be good to understand what's broken and how this fixes it.
I agree, I'd like to get a root cause before proceeding with a fix.
> Do you (or David) have any insight?
It's often the case that this kind of problem is because either the
send or receive code doesn't handle a non-zero xdr_buf::page_base
correctly. A failing READ at a non-page-aligned offset is a typical
feature of this kind of bug.
This class of bug slips through the cracks when testing with POSIX
pagecache-based clients because such clients only rarely send non-
aligned READ requests.
> It'd also be good to understand whether we also need to fix UDP. pynfs
> is tcp-only, so I can't run the same test there as easily.
We don't have a good story for testing UDP. We should either build
a proper test infrastructure, or cut bait and remove UDP support.
But once you spot the incorrect code in the TCP send path, I bet
it won't be hard to see whether UDP also has this problem.
> > diff --git a/include/linux/sunrpc/svcsock.h b/include/linux/sunrpc/svcsock.h
> > index a7116048a4d4..a9bfeadf4cbe 100644
> > --- a/include/linux/sunrpc/svcsock.h
> > +++ b/include/linux/sunrpc/svcsock.h
> > @@ -40,6 +40,9 @@ struct svc_sock {
> >
> > struct completion sk_handshake_done;
> >
> > + struct bio_vec sk_send_bvec[RPCSVC_MAXPAGES]
> > + ____cacheline_aligned;
> > +
> > struct page * sk_pages[RPCSVC_MAXPAGES]; /* received data */
> > };
> >
> > diff --git a/include/linux/sunrpc/xdr.h b/include/linux/sunrpc/xdr.h
> > index f89ec4b5ea16..42f9d7eb9a1a 100644
> > --- a/include/linux/sunrpc/xdr.h
> > +++ b/include/linux/sunrpc/xdr.h
> > @@ -139,6 +139,8 @@ void xdr_terminate_string(const struct xdr_buf *, const u32);
> > size_t xdr_buf_pagecount(const struct xdr_buf *buf);
> > int xdr_alloc_bvec(struct xdr_buf *buf, gfp_t gfp);
> > void xdr_free_bvec(struct xdr_buf *buf);
> > +unsigned int xdr_buf_to_bvec(struct bio_vec *bvec, unsigned int bvec_size,
> > + const struct xdr_buf *xdr);
> >
> > static inline __be32 *xdr_encode_array(__be32 *p, const void *s, unsigned int len)
> > {
> > diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
> > index e43f26382411..e35e5afe4b81 100644
> > --- a/net/sunrpc/svcsock.c
> > +++ b/net/sunrpc/svcsock.c
> > @@ -36,6 +36,8 @@
> > #include <linux/skbuff.h>
> > #include <linux/file.h>
> > #include <linux/freezer.h>
> > +#include <linux/bvec.h>
> > +
> > #include <net/sock.h>
> > #include <net/checksum.h>
> > #include <net/ip.h>
> > @@ -1194,72 +1196,52 @@ static int svc_tcp_recvfrom(struct svc_rqst *rqstp)
> > return 0; /* record not complete */
> > }
> >
> > -static int svc_tcp_send_kvec(struct socket *sock, const struct kvec *vec,
> > - int flags)
> > -{
> > - struct msghdr msg = { .msg_flags = MSG_SPLICE_PAGES | flags, };
> > -
> > - iov_iter_kvec(&msg.msg_iter, ITER_SOURCE, vec, 1, vec->iov_len);
> > - return sock_sendmsg(sock, &msg);
> > -}
> > -
> > /*
> > * MSG_SPLICE_PAGES is used exclusively to reduce the number of
> > * copy operations in this path. Therefore the caller must ensure
> > * that the pages backing @xdr are unchanging.
> > *
> > - * In addition, the logic assumes that * .bv_len is never larger
> > - * than PAGE_SIZE.
> > + * Note that the send is non-blocking. The caller has incremented
> > + * the reference count on each page backing the RPC message, and
> > + * the network layer will "put" these pages when transmission is
> > + * complete.
> > + *
> > + * This is safe for our RPC services because the memory backing
> > + * the head and tail components is never kmalloc'd. These always
> > + * come from pages in the svc_rqst::rq_pages array.
> > */
> > -static int svc_tcp_sendmsg(struct socket *sock, struct xdr_buf *xdr,
> > +static int svc_tcp_sendmsg(struct svc_sock *svsk, struct xdr_buf *xdr,
> > rpc_fraghdr marker, unsigned int *sentp)
> > {
> > - const struct kvec *head = xdr->head;
> > - const struct kvec *tail = xdr->tail;
> > struct kvec rm = {
> > .iov_base = &marker,
> > .iov_len = sizeof(marker),
> > };
> > struct msghdr msg = {
> > - .msg_flags = 0,
> > + .msg_flags = MSG_MORE,
> > };
> > + unsigned int count;
> > int ret;
> >
> > *sentp = 0;
> > - ret = xdr_alloc_bvec(xdr, GFP_KERNEL);
> > - if (ret < 0)
> > - return ret;
> >
> > - ret = kernel_sendmsg(sock, &msg, &rm, 1, rm.iov_len);
> > + ret = kernel_sendmsg(svsk->sk_sock, &msg, &rm, 1, rm.iov_len);
> > if (ret < 0)
> > return ret;
> > *sentp += ret;
> > if (ret != rm.iov_len)
> > return -EAGAIN;
> >
> > - ret = svc_tcp_send_kvec(sock, head, 0);
> > - if (ret < 0)
> > - return ret;
> > - *sentp += ret;
> > - if (ret != head->iov_len)
> > - goto out;
> > + count = xdr_buf_to_bvec(svsk->sk_send_bvec,
> > + ARRAY_SIZE(svsk->sk_send_bvec), xdr);
> >
> > msg.msg_flags = MSG_SPLICE_PAGES;
> > - iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, xdr->bvec,
> > - xdr_buf_pagecount(xdr), xdr->page_len);
> > - ret = sock_sendmsg(sock, &msg);
> > + iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, svsk->sk_send_bvec,
> > + count, xdr->len);
> > + ret = sock_sendmsg(svsk->sk_sock, &msg);
> > if (ret < 0)
> > return ret;
> > *sentp += ret;
> > -
> > - if (tail->iov_len) {
> > - ret = svc_tcp_send_kvec(sock, tail, 0);
> > - if (ret < 0)
> > - return ret;
> > - *sentp += ret;
> > - }
> > -
> > -out:
> > return 0;
> > }
> >
> > @@ -1290,8 +1272,7 @@ static int svc_tcp_sendto(struct svc_rqst *rqstp)
> > if (svc_xprt_is_dead(xprt))
> > goto out_notconn;
> > tcp_sock_set_cork(svsk->sk_sk, true);
> > - err = svc_tcp_sendmsg(svsk->sk_sock, xdr, marker, &sent);
> > - xdr_free_bvec(xdr);
> > + err = svc_tcp_sendmsg(svsk, xdr, marker, &sent);
> > trace_svcsock_tcp_send(xprt, err < 0 ? (long)err : sent);
> > if (err < 0 || sent != (xdr->len + sizeof(marker)))
> > goto out_close;
> > diff --git a/net/sunrpc/xdr.c b/net/sunrpc/xdr.c
> > index 2a22e78af116..358e6de91775 100644
> > --- a/net/sunrpc/xdr.c
> > +++ b/net/sunrpc/xdr.c
> > @@ -164,6 +164,56 @@ xdr_free_bvec(struct xdr_buf *buf)
> > buf->bvec = NULL;
> > }
> >
> > +/**
> > + * xdr_buf_to_bvec - Copy components of an xdr_buf into a bio_vec array
> > + * @bvec: bio_vec array to populate
> > + * @bvec_size: element count of @bio_vec
> > + * @xdr: xdr_buf to be copied
> > + *
> > + * Returns the number of entries consumed in @bvec.
> > + */
> > +unsigned int xdr_buf_to_bvec(struct bio_vec *bvec, unsigned int bvec_size,
> > + const struct xdr_buf *xdr)
> > +{
> > + const struct kvec *head = xdr->head;
> > + const struct kvec *tail = xdr->tail;
> > + unsigned int count = 0;
> > +
> > + if (head->iov_len) {
> > + bvec_set_virt(bvec++, head->iov_base, head->iov_len);
> > + ++count;
> > + }
> > +
> > + if (xdr->page_len) {
> > + unsigned int offset, len, remaining;
> > + struct page **pages = xdr->pages;
> > +
> > + offset = offset_in_page(xdr->page_base);
> > + remaining = xdr->page_len;
> > + while (remaining > 0) {
> > + len = min_t(unsigned int, remaining,
> > + PAGE_SIZE - offset);
> > + bvec_set_page(bvec++, *pages++, len, offset);
> > + remaining -= len;
> > + offset = 0;
> > + if (unlikely(++count > bvec_size))
> > + goto bvec_overflow;
> > + }
> > + }
> > +
> > + if (tail->iov_len) {
> > + bvec_set_virt(bvec, tail->iov_base, tail->iov_len);
> > + if (unlikely(++count > bvec_size))
> > + goto bvec_overflow;
> > + }
> > +
> > + return count;
> > +
> > +bvec_overflow:
> > + pr_warn_once("%s: bio_vec array overflow\n", __func__);
> > + return count - 1;
> > +}
> > +
> > /**
> > * xdr_inline_pages - Prepare receive buffer for a large reply
> > * @xdr: xdr_buf into which reply will be placed
> >
> >
>
> --
> Jeff Layton <jlayton@kernel.org>
--
Chuck Lever
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH v2 1/4] SUNRPC: Convert svc_tcp_sendmsg to use bio_vecs directly
2023-08-12 12:04 ` Jeff Layton
2023-08-13 16:04 ` Chuck Lever
@ 2023-08-14 12:56 ` Mkrtchyan, Tigran
1 sibling, 0 replies; 8+ messages in thread
From: Mkrtchyan, Tigran @ 2023-08-14 12:56 UTC (permalink / raw)
To: Jeff Layton; +Cc: Chuck Lever, linux-nfs, netdev, Chuck Lever, dhowells
[-- Attachment #1: Type: text/plain, Size: 9688 bytes --]
----- Original Message -----
> From: "Jeff Layton" <jlayton@kernel.org>
> To: "Chuck Lever" <cel@kernel.org>, "linux-nfs" <linux-nfs@vger.kernel.org>, netdev@vger.kernel.org
> Cc: "Chuck Lever" <chuck.lever@oracle.com>, dhowells@redhat.com
> Sent: Saturday, 12 August, 2023 14:04:57
> Subject: Re: [PATCH v2 1/4] SUNRPC: Convert svc_tcp_sendmsg to use bio_vecs directly
> On Fri, 2023-07-14 at 14:10 -0400, Chuck Lever wrote:
>> From: Chuck Lever <chuck.lever@oracle.com>
>>
>> Add a helper to convert a whole xdr_buf directly into an array of
>> bio_vecs, then send this array instead of iterating piecemeal over
>> the xdr_buf containing the outbound RPC message.
>>
>> Note that the rules of the RPC protocol mean there can be only one
>> outstanding send at a time on a transport socket. The kernel's
>> SunRPC server enforces this via the transport's xpt_mutex. Thus we
>> can use a per-transport shared array for the xdr_buf conversion
>> rather than allocate one every time or use one that is part of
>> struct svc_rqst.
>>
>> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
>> ---
>> include/linux/sunrpc/svcsock.h | 3 ++
>> include/linux/sunrpc/xdr.h | 2 +
>> net/sunrpc/svcsock.c | 59 ++++++++++++++--------------------------
>> net/sunrpc/xdr.c | 50 ++++++++++++++++++++++++++++++++++
>> 4 files changed, 75 insertions(+), 39 deletions(-)
>>
>
> I've seen some pynfs test regressions in mainline (v6.5-rc5-ish)
> kernels. Here's one failing test:
BTW, we have built a container to run pynfs tests as part of your CI process.
podman run -ti --rm dcache/pynfs:0.3 /run-nfs4.0.sh --help
podman run -ti --rm dcache/pynfs:0.3 /run-nfs4.1.sh --help
Maybe others will find it useful as well.
Tigran.
>
> _text = b'write data' # len=10
>
> [...]
>
> def testSimpleWrite2(t, env):
> """WRITE with stateid=zeros changing size
>
> FLAGS: write all
> DEPEND: MKFILE
> CODE: WRT1b
> """
> c = env.c1
> c.init_connection()
> attrs = {FATTR4_SIZE: 32, FATTR4_MODE: 0o644}
> fh, stateid = c.create_confirm(t.word(), attrs=attrs,
> deny=OPEN4_SHARE_DENY_NONE)
> res = c.write_file(fh, _text, 30)
> check(res, msg="WRITE with stateid=zeros changing size")
> res = c.read_file(fh, 25, 20)
> _compare(t, res, b'\0'*5 + _text, True)
>
> This test writes 10 bytes of data (to a file at offset 30, and then does
> a 20 byte read starting at offset 25. The READ reply has NULs where the
> written data should be
>
> The patch that broke things is this one:
>
> 5df5dd03a8f7 sunrpc: Use sendmsg(MSG_SPLICE_PAGES) rather then sendpage
>
> This patch fixes the problem and gets the test run "green" again. I
> think we will probably want to send this patch to mainline for v6.5, but
> it'd be good to understand what's broken and how this fixes it.
>
> Do you (or David) have any insight?
>
> It'd also be good to understand whether we also need to fix UDP. pynfs
> is tcp-only, so I can't run the same test there as easily.
>
>> diff --git a/include/linux/sunrpc/svcsock.h b/include/linux/sunrpc/svcsock.h
>> index a7116048a4d4..a9bfeadf4cbe 100644
>> --- a/include/linux/sunrpc/svcsock.h
>> +++ b/include/linux/sunrpc/svcsock.h
>> @@ -40,6 +40,9 @@ struct svc_sock {
>>
>> struct completion sk_handshake_done;
>>
>> + struct bio_vec sk_send_bvec[RPCSVC_MAXPAGES]
>> + ____cacheline_aligned;
>> +
>> struct page * sk_pages[RPCSVC_MAXPAGES]; /* received data */
>> };
>>
>> diff --git a/include/linux/sunrpc/xdr.h b/include/linux/sunrpc/xdr.h
>> index f89ec4b5ea16..42f9d7eb9a1a 100644
>> --- a/include/linux/sunrpc/xdr.h
>> +++ b/include/linux/sunrpc/xdr.h
>> @@ -139,6 +139,8 @@ void xdr_terminate_string(const struct xdr_buf *, const
>> u32);
>> size_t xdr_buf_pagecount(const struct xdr_buf *buf);
>> int xdr_alloc_bvec(struct xdr_buf *buf, gfp_t gfp);
>> void xdr_free_bvec(struct xdr_buf *buf);
>> +unsigned int xdr_buf_to_bvec(struct bio_vec *bvec, unsigned int bvec_size,
>> + const struct xdr_buf *xdr);
>>
>> static inline __be32 *xdr_encode_array(__be32 *p, const void *s, unsigned int
>> len)
>> {
>> diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
>> index e43f26382411..e35e5afe4b81 100644
>> --- a/net/sunrpc/svcsock.c
>> +++ b/net/sunrpc/svcsock.c
>> @@ -36,6 +36,8 @@
>> #include <linux/skbuff.h>
>> #include <linux/file.h>
>> #include <linux/freezer.h>
>> +#include <linux/bvec.h>
>> +
>> #include <net/sock.h>
>> #include <net/checksum.h>
>> #include <net/ip.h>
>> @@ -1194,72 +1196,52 @@ static int svc_tcp_recvfrom(struct svc_rqst *rqstp)
>> return 0; /* record not complete */
>> }
>>
>> -static int svc_tcp_send_kvec(struct socket *sock, const struct kvec *vec,
>> - int flags)
>> -{
>> - struct msghdr msg = { .msg_flags = MSG_SPLICE_PAGES | flags, };
>> -
>> - iov_iter_kvec(&msg.msg_iter, ITER_SOURCE, vec, 1, vec->iov_len);
>> - return sock_sendmsg(sock, &msg);
>> -}
>> -
>> /*
>> * MSG_SPLICE_PAGES is used exclusively to reduce the number of
>> * copy operations in this path. Therefore the caller must ensure
>> * that the pages backing @xdr are unchanging.
>> *
>> - * In addition, the logic assumes that * .bv_len is never larger
>> - * than PAGE_SIZE.
>> + * Note that the send is non-blocking. The caller has incremented
>> + * the reference count on each page backing the RPC message, and
>> + * the network layer will "put" these pages when transmission is
>> + * complete.
>> + *
>> + * This is safe for our RPC services because the memory backing
>> + * the head and tail components is never kmalloc'd. These always
>> + * come from pages in the svc_rqst::rq_pages array.
>> */
>> -static int svc_tcp_sendmsg(struct socket *sock, struct xdr_buf *xdr,
>> +static int svc_tcp_sendmsg(struct svc_sock *svsk, struct xdr_buf *xdr,
>> rpc_fraghdr marker, unsigned int *sentp)
>> {
>> - const struct kvec *head = xdr->head;
>> - const struct kvec *tail = xdr->tail;
>> struct kvec rm = {
>> .iov_base = &marker,
>> .iov_len = sizeof(marker),
>> };
>> struct msghdr msg = {
>> - .msg_flags = 0,
>> + .msg_flags = MSG_MORE,
>> };
>> + unsigned int count;
>> int ret;
>>
>> *sentp = 0;
>> - ret = xdr_alloc_bvec(xdr, GFP_KERNEL);
>> - if (ret < 0)
>> - return ret;
>>
>> - ret = kernel_sendmsg(sock, &msg, &rm, 1, rm.iov_len);
>> + ret = kernel_sendmsg(svsk->sk_sock, &msg, &rm, 1, rm.iov_len);
>> if (ret < 0)
>> return ret;
>> *sentp += ret;
>> if (ret != rm.iov_len)
>> return -EAGAIN;
>>
>> - ret = svc_tcp_send_kvec(sock, head, 0);
>> - if (ret < 0)
>> - return ret;
>> - *sentp += ret;
>> - if (ret != head->iov_len)
>> - goto out;
>> + count = xdr_buf_to_bvec(svsk->sk_send_bvec,
>> + ARRAY_SIZE(svsk->sk_send_bvec), xdr);
>>
>> msg.msg_flags = MSG_SPLICE_PAGES;
>> - iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, xdr->bvec,
>> - xdr_buf_pagecount(xdr), xdr->page_len);
>> - ret = sock_sendmsg(sock, &msg);
>> + iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, svsk->sk_send_bvec,
>> + count, xdr->len);
>> + ret = sock_sendmsg(svsk->sk_sock, &msg);
>> if (ret < 0)
>> return ret;
>> *sentp += ret;
>> -
>> - if (tail->iov_len) {
>> - ret = svc_tcp_send_kvec(sock, tail, 0);
>> - if (ret < 0)
>> - return ret;
>> - *sentp += ret;
>> - }
>> -
>> -out:
>> return 0;
>> }
>>
>> @@ -1290,8 +1272,7 @@ static int svc_tcp_sendto(struct svc_rqst *rqstp)
>> if (svc_xprt_is_dead(xprt))
>> goto out_notconn;
>> tcp_sock_set_cork(svsk->sk_sk, true);
>> - err = svc_tcp_sendmsg(svsk->sk_sock, xdr, marker, &sent);
>> - xdr_free_bvec(xdr);
>> + err = svc_tcp_sendmsg(svsk, xdr, marker, &sent);
>> trace_svcsock_tcp_send(xprt, err < 0 ? (long)err : sent);
>> if (err < 0 || sent != (xdr->len + sizeof(marker)))
>> goto out_close;
>> diff --git a/net/sunrpc/xdr.c b/net/sunrpc/xdr.c
>> index 2a22e78af116..358e6de91775 100644
>> --- a/net/sunrpc/xdr.c
>> +++ b/net/sunrpc/xdr.c
>> @@ -164,6 +164,56 @@ xdr_free_bvec(struct xdr_buf *buf)
>> buf->bvec = NULL;
>> }
>>
>> +/**
>> + * xdr_buf_to_bvec - Copy components of an xdr_buf into a bio_vec array
>> + * @bvec: bio_vec array to populate
>> + * @bvec_size: element count of @bio_vec
>> + * @xdr: xdr_buf to be copied
>> + *
>> + * Returns the number of entries consumed in @bvec.
>> + */
>> +unsigned int xdr_buf_to_bvec(struct bio_vec *bvec, unsigned int bvec_size,
>> + const struct xdr_buf *xdr)
>> +{
>> + const struct kvec *head = xdr->head;
>> + const struct kvec *tail = xdr->tail;
>> + unsigned int count = 0;
>> +
>> + if (head->iov_len) {
>> + bvec_set_virt(bvec++, head->iov_base, head->iov_len);
>> + ++count;
>> + }
>> +
>> + if (xdr->page_len) {
>> + unsigned int offset, len, remaining;
>> + struct page **pages = xdr->pages;
>> +
>> + offset = offset_in_page(xdr->page_base);
>> + remaining = xdr->page_len;
>> + while (remaining > 0) {
>> + len = min_t(unsigned int, remaining,
>> + PAGE_SIZE - offset);
>> + bvec_set_page(bvec++, *pages++, len, offset);
>> + remaining -= len;
>> + offset = 0;
>> + if (unlikely(++count > bvec_size))
>> + goto bvec_overflow;
>> + }
>> + }
>> +
>> + if (tail->iov_len) {
>> + bvec_set_virt(bvec, tail->iov_base, tail->iov_len);
>> + if (unlikely(++count > bvec_size))
>> + goto bvec_overflow;
>> + }
>> +
>> + return count;
>> +
>> +bvec_overflow:
>> + pr_warn_once("%s: bio_vec array overflow\n", __func__);
>> + return count - 1;
>> +}
>> +
>> /**
>> * xdr_inline_pages - Prepare receive buffer for a large reply
>> * @xdr: xdr_buf into which reply will be placed
>>
>>
>
> --
> Jeff Layton <jlayton@kernel.org>
[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 2208 bytes --]
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2023-08-14 13:07 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-07-14 18:10 [PATCH v2 0/4] Send RPC-on-TCP with one sock_sendmsg() call Chuck Lever
2023-07-14 18:10 ` [PATCH v2 1/4] SUNRPC: Convert svc_tcp_sendmsg to use bio_vecs directly Chuck Lever
2023-08-12 12:04 ` Jeff Layton
2023-08-13 16:04 ` Chuck Lever
2023-08-14 12:56 ` Mkrtchyan, Tigran
2023-07-14 18:10 ` [PATCH v2 2/4] SUNRPC: Send RPC message on TCP with a single sock_sendmsg() call Chuck Lever
2023-07-14 18:10 ` [PATCH v2 3/4] SUNRPC: Convert svc_udp_sendto() to use the per-socket bio_vec array Chuck Lever
2023-07-14 18:10 ` [PATCH v2 4/4] SUNRPC: Use a per-transport receive " Chuck Lever
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).