* [PATCH 0/3] three lockd fixes
@ 2026-05-14 20:56 Chuck Lever
2026-05-14 20:56 ` [PATCH 1/3] lockd: Plug nlm_file leak when nlm_do_fopen() fails Chuck Lever
` (3 more replies)
0 siblings, 4 replies; 6+ messages in thread
From: Chuck Lever @ 2026-05-14 20:56 UTC (permalink / raw)
To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
Cc: linux-nfs, Chuck Lever
From: Chuck Lever <chuck.lever@oracle.com>
Jeff reported an issue earlier today, and while crafting the fix,
LLM review found two more pre-existing nearby problems.
Chuck Lever (3):
lockd: Plug nlm_file leak when nlm_do_fopen() fails
lockd: Plug nlm_file refcount leak on cached nlm_do_fopen() failure
lockd: Avoid hashing uninitialized bytes in nlm4svc_lookup_file()
fs/lockd/lockd.h | 8 ++++++++
fs/lockd/svc4proc.c | 3 +++
fs/lockd/svcsubs.c | 7 ++++---
3 files changed, 15 insertions(+), 3 deletions(-)
--
2.54.0
^ permalink raw reply [flat|nested] 6+ messages in thread* [PATCH 1/3] lockd: Plug nlm_file leak when nlm_do_fopen() fails 2026-05-14 20:56 [PATCH 0/3] three lockd fixes Chuck Lever @ 2026-05-14 20:56 ` Chuck Lever 2026-05-14 20:56 ` [PATCH net] tls: Preserve sk_err across recvmsg() when data has been copied Chuck Lever ` (2 subsequent siblings) 3 siblings, 0 replies; 6+ messages in thread From: Chuck Lever @ 2026-05-14 20:56 UTC (permalink / raw) To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey Cc: linux-nfs, Chuck Lever From: Chuck Lever <chuck.lever@oracle.com> A client can repeatedly drive nlm_do_fopen() failures by presenting file handles that the underlying export rejects. After kzalloc_obj() succeeds in nlm_lookup_file(), the freshly allocated nlm_file is not yet inserted into nlm_files[]. The nlm_do_fopen() failure path jumps to out_unlock, which releases nlm_file_mutex and returns without freeing the allocation, so each failure leaks one nlm_file. Route the failure through out_free so kfree() runs before the function returns. Fixes: 7f024fcd5c97 ("Keep read and write fds with each nlm_file") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> --- fs/lockd/svcsubs.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/lockd/svcsubs.c b/fs/lockd/svcsubs.c index e24bacea7e03..0b81d8db0919 100644 --- a/fs/lockd/svcsubs.c +++ b/fs/lockd/svcsubs.c @@ -166,7 +166,7 @@ nlm_lookup_file(struct svc_rqst *rqstp, struct nlm_file **result, nfserr = nlm_do_fopen(rqstp, file, mode); if (nfserr) - goto out_unlock; + goto out_free; hlist_add_head(&file->f_list, &nlm_files[hash]); -- 2.54.0 ^ permalink raw reply related [flat|nested] 6+ messages in thread
* [PATCH net] tls: Preserve sk_err across recvmsg() when data has been copied 2026-05-14 20:56 [PATCH 0/3] three lockd fixes Chuck Lever 2026-05-14 20:56 ` [PATCH 1/3] lockd: Plug nlm_file leak when nlm_do_fopen() fails Chuck Lever @ 2026-05-14 20:56 ` Chuck Lever 2026-05-14 20:57 ` Chuck Lever 2026-05-14 20:56 ` [PATCH 2/3] lockd: Plug nlm_file refcount leak on cached nlm_do_fopen() failure Chuck Lever 2026-05-14 20:56 ` [PATCH 3/3] lockd: Avoid hashing uninitialized bytes in nlm4svc_lookup_file() Chuck Lever 3 siblings, 1 reply; 6+ messages in thread From: Chuck Lever @ 2026-05-14 20:56 UTC (permalink / raw) To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey Cc: linux-nfs, Chuck Lever, Jakub Kicinski From: Chuck Lever <chuck.lever@oracle.com> The sk_err check in tls_rx_rec_wait() consumes the error via sock_error(), which clears sk_err atomically. When the caller (tls_sw_recvmsg, tls_sw_splice_read, or tls_sw_read_sock) already has bytes copied to userspace, it returns those bytes and discards the error from this call. sk_err is now zero on the socket, so the next read syscall observes only RCV_SHUTDOWN and reports a clean EOF instead of the actual error (typically -ECONNRESET). The race is reachable when tls_read_flush_backlog()'s periodic sk_flush_backlog() triggers tcp_reset() in the middle of a multi-record read. Pass a has_copied flag to tls_rx_rec_wait(). When has_copied is false, consume sk_err via sock_error() as before. When has_copied is true, report the error from READ_ONCE() but leave sk_err set: the caller returns the byte count and discards the err from this call, and the next read syscall surfaces the preserved sk_err. This mirrors the tcp_recvmsg() preserve-and-surface pattern. The decrypt-abort path is unaffected: tls_err_abort() raises sk_err to EBADMSG after tls_rx_rec_wait() returns, and nothing on the caller's return path consumes it, so the EBADMSG surfaces on the next read. tls_sw_splice_read() passes has_copied=false: it processes one record per call, so no bytes have been copied within the function when tls_rx_rec_wait() runs. A reset that arrives between iterations of splice_direct_to_actor() (the sendfile() path) is still consumed by sock_error() in the later call, and the outer loop returns the prior iterations' byte count and drops the error. tcp_splice_read() exhibits the same pattern at the iteration boundary; addressing it belongs at the splice_direct_to_actor() layer and is out of scope here. Fixes: c46b01839f7a ("tls: rx: periodically flush socket backlog") Suggested-by: Jakub Kicinski <kuba@kernel.org> Cc: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> --- net/tls/tls_sw.c | 20 ++++++++++++++------ 1 file changed, 14 insertions(+), 6 deletions(-) diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c index 2590e855f6a5..c4cc4e357848 100644 --- a/net/tls/tls_sw.c +++ b/net/tls/tls_sw.c @@ -1356,9 +1356,14 @@ void tls_sw_splice_eof(struct socket *sock) mutex_unlock(&tls_ctx->tx_lock); } +/* When has_copied is true the caller has already moved bytes to + * userspace. Report sk_err but leave it set so the next read + * surfaces it instead of a spurious EOF, otherwise sk_err is + * consumed via sock_error(). + */ static int tls_rx_rec_wait(struct sock *sk, struct sk_psock *psock, bool nonblock, - bool released) + bool released, bool has_copied) { struct tls_context *tls_ctx = tls_get_ctx(sk); struct tls_sw_context_rx *ctx = tls_sw_ctx_rx(tls_ctx); @@ -1376,8 +1381,11 @@ tls_rx_rec_wait(struct sock *sk, struct sk_psock *psock, bool nonblock, if (!sk_psock_queue_empty(psock)) return 0; - if (sk->sk_err) + if (sk->sk_err) { + if (has_copied) + return -READ_ONCE(sk->sk_err); return sock_error(sk); + } if (ret < 0) return ret; @@ -1413,7 +1421,7 @@ tls_rx_rec_wait(struct sock *sk, struct sk_psock *psock, bool nonblock, } if (unlikely(!tls_strp_msg_load(&ctx->strp, released))) - return tls_rx_rec_wait(sk, psock, nonblock, false); + return tls_rx_rec_wait(sk, psock, nonblock, false, has_copied); return 1; } @@ -2100,7 +2108,7 @@ int tls_sw_recvmsg(struct sock *sk, int to_decrypt, chunk; err = tls_rx_rec_wait(sk, psock, flags & MSG_DONTWAIT, - released); + released, !!(decrypted + copied)); if (err <= 0) { if (psock) { chunk = sk_msg_recvmsg(sk, psock, msg, len, @@ -2287,7 +2295,7 @@ ssize_t tls_sw_splice_read(struct socket *sock, loff_t *ppos, struct tls_decrypt_arg darg; err = tls_rx_rec_wait(sk, NULL, flags & SPLICE_F_NONBLOCK, - true); + true, false); if (err <= 0) goto splice_read_end; @@ -2373,7 +2381,7 @@ int tls_sw_read_sock(struct sock *sk, read_descriptor_t *desc, } else { struct tls_decrypt_arg darg; - err = tls_rx_rec_wait(sk, NULL, true, released); + err = tls_rx_rec_wait(sk, NULL, true, released, !!copied); if (err <= 0) goto read_sock_end; -- 2.54.0 ^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH net] tls: Preserve sk_err across recvmsg() when data has been copied 2026-05-14 20:56 ` [PATCH net] tls: Preserve sk_err across recvmsg() when data has been copied Chuck Lever @ 2026-05-14 20:57 ` Chuck Lever 0 siblings, 0 replies; 6+ messages in thread From: Chuck Lever @ 2026-05-14 20:57 UTC (permalink / raw) To: Chuck Lever, NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey Cc: linux-nfs, Jakub Kicinski On 5/14/26 4:56 PM, Chuck Lever wrote: > From: Chuck Lever <chuck.lever@oracle.com> > > The sk_err check in tls_rx_rec_wait() consumes the error via > sock_error(), which clears sk_err atomically. When the caller > (tls_sw_recvmsg, tls_sw_splice_read, or tls_sw_read_sock) already > has bytes copied to userspace, it returns those bytes and discards > the error from this call. sk_err is now zero on the socket, so the > next read syscall observes only RCV_SHUTDOWN and reports a clean > EOF instead of the actual error (typically -ECONNRESET). > > The race is reachable when tls_read_flush_backlog()'s periodic > sk_flush_backlog() triggers tcp_reset() in the middle of a > multi-record read. > > Pass a has_copied flag to tls_rx_rec_wait(). When has_copied is > false, consume sk_err via sock_error() as before. When has_copied > is true, report the error from READ_ONCE() but leave sk_err set: > the caller returns the byte count and discards the err from this > call, and the next read syscall surfaces the preserved sk_err. This > mirrors the tcp_recvmsg() preserve-and-surface pattern. > > The decrypt-abort path is unaffected: tls_err_abort() raises > sk_err to EBADMSG after tls_rx_rec_wait() returns, and nothing > on the caller's return path consumes it, so the EBADMSG surfaces > on the next read. > > tls_sw_splice_read() passes has_copied=false: it processes > one record per call, so no bytes have been copied within the > function when tls_rx_rec_wait() runs. A reset that arrives > between iterations of splice_direct_to_actor() (the sendfile() > path) is still consumed by sock_error() in the later call, and the > outer loop returns the prior iterations' byte count and drops the > error. tcp_splice_read() exhibits the same pattern at the iteration > boundary; addressing it belongs at the splice_direct_to_actor() > layer and is out of scope here. > > Fixes: c46b01839f7a ("tls: rx: periodically flush socket backlog") > Suggested-by: Jakub Kicinski <kuba@kernel.org> > Cc: Jakub Kicinski <kuba@kernel.org> > Signed-off-by: Chuck Lever <chuck.lever@oracle.com> > --- > net/tls/tls_sw.c | 20 ++++++++++++++------ > 1 file changed, 14 insertions(+), 6 deletions(-) > > diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c > index 2590e855f6a5..c4cc4e357848 100644 > --- a/net/tls/tls_sw.c > +++ b/net/tls/tls_sw.c > @@ -1356,9 +1356,14 @@ void tls_sw_splice_eof(struct socket *sock) > mutex_unlock(&tls_ctx->tx_lock); > } > > +/* When has_copied is true the caller has already moved bytes to > + * userspace. Report sk_err but leave it set so the next read > + * surfaces it instead of a spurious EOF, otherwise sk_err is > + * consumed via sock_error(). > + */ > static int > tls_rx_rec_wait(struct sock *sk, struct sk_psock *psock, bool nonblock, > - bool released) > + bool released, bool has_copied) > { > struct tls_context *tls_ctx = tls_get_ctx(sk); > struct tls_sw_context_rx *ctx = tls_sw_ctx_rx(tls_ctx); > @@ -1376,8 +1381,11 @@ tls_rx_rec_wait(struct sock *sk, struct sk_psock *psock, bool nonblock, > if (!sk_psock_queue_empty(psock)) > return 0; > > - if (sk->sk_err) > + if (sk->sk_err) { > + if (has_copied) > + return -READ_ONCE(sk->sk_err); > return sock_error(sk); > + } > > if (ret < 0) > return ret; > @@ -1413,7 +1421,7 @@ tls_rx_rec_wait(struct sock *sk, struct sk_psock *psock, bool nonblock, > } > > if (unlikely(!tls_strp_msg_load(&ctx->strp, released))) > - return tls_rx_rec_wait(sk, psock, nonblock, false); > + return tls_rx_rec_wait(sk, psock, nonblock, false, has_copied); > > return 1; > } > @@ -2100,7 +2108,7 @@ int tls_sw_recvmsg(struct sock *sk, > int to_decrypt, chunk; > > err = tls_rx_rec_wait(sk, psock, flags & MSG_DONTWAIT, > - released); > + released, !!(decrypted + copied)); > if (err <= 0) { > if (psock) { > chunk = sk_msg_recvmsg(sk, psock, msg, len, > @@ -2287,7 +2295,7 @@ ssize_t tls_sw_splice_read(struct socket *sock, loff_t *ppos, > struct tls_decrypt_arg darg; > > err = tls_rx_rec_wait(sk, NULL, flags & SPLICE_F_NONBLOCK, > - true); > + true, false); > if (err <= 0) > goto splice_read_end; > > @@ -2373,7 +2381,7 @@ int tls_sw_read_sock(struct sock *sk, read_descriptor_t *desc, > } else { > struct tls_decrypt_arg darg; > > - err = tls_rx_rec_wait(sk, NULL, true, released); > + err = tls_rx_rec_wait(sk, NULL, true, released, !!copied); > if (err <= 0) > goto read_sock_end; > Ignore this email. It looks like something left over from a previous posting. -- Chuck Lever ^ permalink raw reply [flat|nested] 6+ messages in thread
* [PATCH 2/3] lockd: Plug nlm_file refcount leak on cached nlm_do_fopen() failure 2026-05-14 20:56 [PATCH 0/3] three lockd fixes Chuck Lever 2026-05-14 20:56 ` [PATCH 1/3] lockd: Plug nlm_file leak when nlm_do_fopen() fails Chuck Lever 2026-05-14 20:56 ` [PATCH net] tls: Preserve sk_err across recvmsg() when data has been copied Chuck Lever @ 2026-05-14 20:56 ` Chuck Lever 2026-05-14 20:56 ` [PATCH 3/3] lockd: Avoid hashing uninitialized bytes in nlm4svc_lookup_file() Chuck Lever 3 siblings, 0 replies; 6+ messages in thread From: Chuck Lever @ 2026-05-14 20:56 UTC (permalink / raw) To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey Cc: linux-nfs, Chuck Lever From: Chuck Lever <chuck.lever@oracle.com> The cached-file path in nlm_lookup_file() reaches the found: label unconditionally, even when nlm_do_fopen() fails. At that label *result and file->f_count are updated before the error is returned. The wrappers nlm3svc_lookup_file() and nlm4svc_lookup_file() then bail out of their switch without copying *result back to their caller, so the proc handler's local nlm_file pointer remains NULL and the cleanup path skips nlm_release_file(). The f_count increment is never released, and nlm_traverse_files() can no longer reap the file because its refcount never returns to zero between requests. Short-circuit the cached path so neither *result nor f_count is touched when nlm_do_fopen() fails on a hashed nlm_file. Fixes: 7f024fcd5c97 ("Keep read and write fds with each nlm_file") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> --- fs/lockd/svcsubs.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/fs/lockd/svcsubs.c b/fs/lockd/svcsubs.c index 0b81d8db0919..58b87ec52930 100644 --- a/fs/lockd/svcsubs.c +++ b/fs/lockd/svcsubs.c @@ -150,6 +150,8 @@ nlm_lookup_file(struct svc_rqst *rqstp, struct nlm_file **result, mutex_lock(&file->f_mutex); nfserr = nlm_do_fopen(rqstp, file, mode); mutex_unlock(&file->f_mutex); + if (nfserr) + goto out_unlock; goto found; } nlm_debug_print_fh("creating file for", &lock->fh); -- 2.54.0 ^ permalink raw reply related [flat|nested] 6+ messages in thread
* [PATCH 3/3] lockd: Avoid hashing uninitialized bytes in nlm4svc_lookup_file() 2026-05-14 20:56 [PATCH 0/3] three lockd fixes Chuck Lever ` (2 preceding siblings ...) 2026-05-14 20:56 ` [PATCH 2/3] lockd: Plug nlm_file refcount leak on cached nlm_do_fopen() failure Chuck Lever @ 2026-05-14 20:56 ` Chuck Lever 3 siblings, 0 replies; 6+ messages in thread From: Chuck Lever @ 2026-05-14 20:56 UTC (permalink / raw) To: NeilBrown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey Cc: linux-nfs, Chuck Lever From: Chuck Lever <chuck.lever@oracle.com> file_hash() digests the first LOCKD_FH_HASH_SIZE bytes of nfs_fh.data when bucketing nlm_files[], independent of fh.size. Commit 3de744ee4e45 ("lockd: Use xdrgen XDR functions for the NLMv4 TEST procedure") set .pc_argzero to zero for the converted procedures and moved file-handle population into nlm4svc_lookup_file(), which copies only xdr_lock->fh.len bytes into lock->fh.data. When an NLMv4 client presents a file handle shorter than LOCKD_FH_HASH_SIZE, bytes fh.len..31 retain whatever the argument buffer held from an earlier request. The same wire handle then hashes to different buckets across calls; nlm_lookup_file() misses the existing nlm_file entry, and lock-state lookups fail. Zero only the tail bytes that file_hash() would otherwise consume. Handles of LOCKD_FH_HASH_SIZE or larger already populate every byte that file_hash() reads. Reported-by: Jeff Layton <jlayton@kernel.org> Closes: https://lore.kernel.org/r/5229a9746d723a3f830120c0b966510f75badfc2.camel@kernel.org Fixes: 3de744ee4e45 ("lockd: Use xdrgen XDR functions for the NLMv4 TEST procedure") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> --- fs/lockd/lockd.h | 8 ++++++++ fs/lockd/svc4proc.c | 3 +++ fs/lockd/svcsubs.c | 3 +-- 3 files changed, 12 insertions(+), 2 deletions(-) diff --git a/fs/lockd/lockd.h b/fs/lockd/lockd.h index 0be0dac59ea2..e418a50c4180 100644 --- a/fs/lockd/lockd.h +++ b/fs/lockd/lockd.h @@ -52,6 +52,14 @@ */ #define LOCKD_DFLT_TIMEO 10 +/* + * Number of leading bytes of nfs_fh.data that file_hash() + * digests when bucketing nlm_files[]. Sized for historical + * NFSv2 handles; nfs_fh.data must be initialized at least + * this far before lookup, regardless of fh.size. + */ +#define LOCKD_FH_HASH_SIZE 32 + /* error codes new to NLMv4 */ #define nlm4_deadlock cpu_to_be32(NLM_DEADLCK) #define nlm4_rofs cpu_to_be32(NLM_ROFS) diff --git a/fs/lockd/svc4proc.c b/fs/lockd/svc4proc.c index 997f4f437997..78e675470c4b 100644 --- a/fs/lockd/svc4proc.c +++ b/fs/lockd/svc4proc.c @@ -156,6 +156,9 @@ nlm4svc_lookup_file(struct svc_rqst *rqstp, struct nlm_host *host, return nlm_lck_denied_nolocks; lock->fh.size = xdr_lock->fh.len; memcpy(lock->fh.data, xdr_lock->fh.data, xdr_lock->fh.len); + if (xdr_lock->fh.len < LOCKD_FH_HASH_SIZE) + memset(lock->fh.data + xdr_lock->fh.len, 0, + LOCKD_FH_HASH_SIZE - xdr_lock->fh.len); lock->oh.len = xdr_lock->oh.len; lock->oh.data = xdr_lock->oh.data; diff --git a/fs/lockd/svcsubs.c b/fs/lockd/svcsubs.c index 58b87ec52930..a0d1a6fbf61e 100644 --- a/fs/lockd/svcsubs.c +++ b/fs/lockd/svcsubs.c @@ -17,7 +17,6 @@ #include <linux/sunrpc/addr.h> #include <linux/module.h> #include <linux/mount.h> -#include <uapi/linux/nfs2.h> #include "lockd.h" #include "share.h" @@ -67,7 +66,7 @@ static inline unsigned int file_hash(struct nfs_fh *f) { unsigned int tmp=0; int i; - for (i=0; i<NFS2_FHSIZE;i++) + for (i = 0; i < LOCKD_FH_HASH_SIZE; i++) tmp += f->data[i]; return tmp & (FILE_NRHASH - 1); } -- 2.54.0 ^ permalink raw reply related [flat|nested] 6+ messages in thread
end of thread, other threads:[~2026-05-14 20:58 UTC | newest] Thread overview: 6+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-05-14 20:56 [PATCH 0/3] three lockd fixes Chuck Lever 2026-05-14 20:56 ` [PATCH 1/3] lockd: Plug nlm_file leak when nlm_do_fopen() fails Chuck Lever 2026-05-14 20:56 ` [PATCH net] tls: Preserve sk_err across recvmsg() when data has been copied Chuck Lever 2026-05-14 20:57 ` Chuck Lever 2026-05-14 20:56 ` [PATCH 2/3] lockd: Plug nlm_file refcount leak on cached nlm_do_fopen() failure Chuck Lever 2026-05-14 20:56 ` [PATCH 3/3] lockd: Avoid hashing uninitialized bytes in nlm4svc_lookup_file() Chuck Lever
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox