* [PATCH] ceph: fix hanging __ceph_get_caps() with stale `mds_wanted`
@ 2026-04-27 15:58 Max Kellermann
2026-04-28 18:46 ` Viacheslav Dubeyko
0 siblings, 1 reply; 2+ messages in thread
From: Max Kellermann @ 2026-04-27 15:58 UTC (permalink / raw)
To: idryomov, amarkuze, ceph-devel, linux-kernel; +Cc: Max Kellermann, stable
A reader can hang forever in __ceph_get_caps() when the client no
longer holds `FILE_RD`, but local cap state still says that the
capability is already wanted (via `mds_wanted`).
One way to trigger this is through MDS cap revocation. If another
client performs a conflicting operation, the MDS can revoke `FILE_RD`
from the reader; the next read then has to reacquire `FILE_RD`. If
the cap update that should request `FILE_RD` never reaches the MDS
after `cap->mds_wanted` was raised, the reader is left holding only
non-file caps while local `mds_wanted` still includes the file read
caps.
In that state, try_get_cap_refs() sees `need <= mds_wanted` and
returns 0, so __ceph_get_caps() just waits on `i_cap_wq`. If the cap
update that was supposed to request `FILE_RD never reaches the MDS
after `cap->mds_wanted was` raised, no further request is sent and the
waiter can sleep indefinitely until unrelated cap traffic happens to
wake it up.
The ordering issue is that `cap->mds_wanted` is updated in
__prep_cap() before the `CEPH_MSG_CLIENT_CAPS message` is actually
queued for send. That makes one field serve two different meanings at
once: what this client wants, and what the client believes the MDS
already knows it wants.
A proper fix would be to split those states and track whether a cap
update is actually in flight or has been observed by the MDS.
However, simply moving the `cap->mds_wanted assignment` later would
not be sufficient: queueing the message in the messenger does not
guarantee that the MDS processed that specific wanted set, and
reconnect or message loss can still invalidate that assumption.
Fixing that properly would require a larger rework of the cap state
machine.
To allow simpler backports to stable kernels, this patch implements a
simpler workaround:
- stop waiting forever in __ceph_get_caps(); after a bounded wait,
fall back to the renew path
- make ceph_renew_caps() issue a synchronous `OPEN` request whenever
the inode still does not actually hold the wanted caps, instead of
only calling ceph_check_caps()
The extra issued-vs-wanted check in ceph_renew_caps() is necessary
because the previous test only checked whether the inode still had any
real caps at all. That is not enough after revocation: the client can
still hold something like `pLs` and yet be missing `FILE_RD`
completely. In that case, falling back to ceph_check_caps() is not
sufficient, because it still trusts `cap->mds_wanted` and may resend
nothing. By requiring `(issued & wanted) == wanted` before taking the
asynchronous path, the code only uses ceph_check_caps() when the
`wanted caps` are already actually issued. Otherwise, it sends the
synchronous `OPEN` renew.
This preserves the existing asynchronous fast path when the wanted
caps are already issued, avoids changing cap-state semantics, and
fixes the hang by guaranteeing that a stalled waiter eventually
retries through a path that does not rely on the stale `mds_wanted`
state.
Cc: stable@vger.kernel.org
Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
---
fs/ceph/caps.c | 20 +++++++++++++++++---
fs/ceph/file.c | 9 +++++----
2 files changed, 22 insertions(+), 7 deletions(-)
diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
index d51454e995a8..dd11611f250b 100644
--- a/fs/ceph/caps.c
+++ b/fs/ceph/caps.c
@@ -3087,11 +3087,24 @@ int __ceph_get_caps(struct inode *inode, struct ceph_file_info *fi, int need,
flags |= NON_BLOCKING;
while (!(ret = try_get_cap_refs(inode, need, want,
endoff, flags, &_got))) {
+ static const unsigned long wait_timeout = 5 * HZ;
+
if (signal_pending(current)) {
ret = -ERESTARTSYS;
break;
}
- wait_woken(&wait, TASK_INTERRUPTIBLE, MAX_SCHEDULE_TIMEOUT);
+
+ /*
+ * If a cap update is lost after
+ * mds_wanted was raised, waiting
+ * forever will never make progress.
+ * Retry the renew path periodically
+ * so we can resend synchronously.
+ */
+ if (!wait_woken(&wait, TASK_INTERRUPTIBLE, wait_timeout)) {
+ ret = -EUCLEAN;
+ break;
+ }
}
remove_wait_queue(&ci->i_cap_wq, &wait);
@@ -3125,8 +3138,9 @@ int __ceph_get_caps(struct inode *inode, struct ceph_file_info *fi, int need,
continue;
}
if (ret == -EUCLEAN) {
- /* session was killed, try renew caps */
- ret = ceph_renew_caps(inode, flags);
+ /* session was killed or a waited cap
+ * request needs a retry */
+ ret = ceph_renew_caps(inode, flags & CEPH_FILE_MODE_MASK);
if (ret == 0)
continue;
}
diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index d54d71669176..47c7d4a5ffed 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -314,7 +314,7 @@ static int ceph_init_file(struct inode *inode, struct file *file, int fmode)
}
/*
- * try renew caps after session gets killed.
+ * Retry cap acquisition after a stale session or a lost cap update.
*/
int ceph_renew_caps(struct inode *inode, int fmode)
{
@@ -322,14 +322,15 @@ int ceph_renew_caps(struct inode *inode, int fmode)
struct ceph_client *cl = mdsc->fsc->client;
struct ceph_inode_info *ci = ceph_inode(inode);
struct ceph_mds_request *req;
- int err, flags, wanted;
+ int err, flags, wanted, issued;
spin_lock(&ci->i_ceph_lock);
__ceph_touch_fmode(ci, mdsc, fmode);
wanted = __ceph_caps_file_wanted(ci);
+ issued = __ceph_caps_issued(ci, NULL);
if (__ceph_is_any_real_caps(ci) &&
- (!(wanted & CEPH_CAP_ANY_WR) || ci->i_auth_cap)) {
- int issued = __ceph_caps_issued(ci, NULL);
+ (!(wanted & CEPH_CAP_ANY_WR) || ci->i_auth_cap) &&
+ (issued & wanted) == wanted) {
spin_unlock(&ci->i_ceph_lock);
doutc(cl, "%p %llx.%llx want %s issued %s updating mds_wanted\n",
inode, ceph_vinop(inode), ceph_cap_string(wanted),
--
2.47.3
^ permalink raw reply related [flat|nested] 2+ messages in thread
* Re: [PATCH] ceph: fix hanging __ceph_get_caps() with stale `mds_wanted`
2026-04-27 15:58 [PATCH] ceph: fix hanging __ceph_get_caps() with stale `mds_wanted` Max Kellermann
@ 2026-04-28 18:46 ` Viacheslav Dubeyko
0 siblings, 0 replies; 2+ messages in thread
From: Viacheslav Dubeyko @ 2026-04-28 18:46 UTC (permalink / raw)
To: max.kellermann@ionos.com, idryomov@gmail.com, Alex Markuze,
linux-kernel@vger.kernel.org, ceph-devel@vger.kernel.org
Cc: stable@vger.kernel.org
On Mon, 2026-04-27 at 17:58 +0200, Max Kellermann wrote:
> A reader can hang forever in __ceph_get_caps() when the client no
> longer holds `FILE_RD`, but local cap state still says that the
> capability is already wanted (via `mds_wanted`).
>
> One way to trigger this is through MDS cap revocation. If another
> client performs a conflicting operation, the MDS can revoke `FILE_RD`
> from the reader; the next read then has to reacquire `FILE_RD`. If
> the cap update that should request `FILE_RD` never reaches the MDS
> after `cap->mds_wanted` was raised, the reader is left holding only
> non-file caps while local `mds_wanted` still includes the file read
> caps.
>
> In that state, try_get_cap_refs() sees `need <= mds_wanted` and
> returns 0, so __ceph_get_caps() just waits on `i_cap_wq`. If the cap
> update that was supposed to request `FILE_RD never reaches the MDS
> after `cap->mds_wanted was` raised, no further request is sent and the
> waiter can sleep indefinitely until unrelated cap traffic happens to
> wake it up.
>
> The ordering issue is that `cap->mds_wanted` is updated in
> __prep_cap() before the `CEPH_MSG_CLIENT_CAPS message` is actually
> queued for send. That makes one field serve two different meanings at
> once: what this client wants, and what the client believes the MDS
> already knows it wants.
>
> A proper fix would be to split those states and track whether a cap
> update is actually in flight or has been observed by the MDS.
> However, simply moving the `cap->mds_wanted assignment` later would
> not be sufficient: queueing the message in the messenger does not
> guarantee that the MDS processed that specific wanted set, and
> reconnect or message loss can still invalidate that assumption.
> Fixing that properly would require a larger rework of the cap state
> machine.
>
> To allow simpler backports to stable kernels, this patch implements a
> simpler workaround:
>
> - stop waiting forever in __ceph_get_caps(); after a bounded wait,
> fall back to the renew path
>
> - make ceph_renew_caps() issue a synchronous `OPEN` request whenever
> the inode still does not actually hold the wanted caps, instead of
> only calling ceph_check_caps()
>
> The extra issued-vs-wanted check in ceph_renew_caps() is necessary
> because the previous test only checked whether the inode still had any
> real caps at all. That is not enough after revocation: the client can
> still hold something like `pLs` and yet be missing `FILE_RD`
> completely. In that case, falling back to ceph_check_caps() is not
> sufficient, because it still trusts `cap->mds_wanted` and may resend
> nothing. By requiring `(issued & wanted) == wanted` before taking the
> asynchronous path, the code only uses ceph_check_caps() when the
> `wanted caps` are already actually issued. Otherwise, it sends the
> synchronous `OPEN` renew.
>
> This preserves the existing asynchronous fast path when the wanted
> caps are already issued, avoids changing cap-state semantics, and
> fixes the hang by guaranteeing that a stalled waiter eventually
> retries through a path that does not rely on the stale `mds_wanted`
> state.
>
> Cc: stable@vger.kernel.org
> Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
> ---
> fs/ceph/caps.c | 20 +++++++++++++++++---
> fs/ceph/file.c | 9 +++++----
> 2 files changed, 22 insertions(+), 7 deletions(-)
>
> diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
> index d51454e995a8..dd11611f250b 100644
> --- a/fs/ceph/caps.c
> +++ b/fs/ceph/caps.c
> @@ -3087,11 +3087,24 @@ int __ceph_get_caps(struct inode *inode, struct ceph_file_info *fi, int need,
> flags |= NON_BLOCKING;
> while (!(ret = try_get_cap_refs(inode, need, want,
> endoff, flags, &_got))) {
> + static const unsigned long wait_timeout = 5 * HZ;
Why exactly 5 * HZ? What is the basis for this timeout? Could we re-use any
available timeouts in CephFS declarations?
> +
> if (signal_pending(current)) {
> ret = -ERESTARTSYS;
> break;
> }
> - wait_woken(&wait, TASK_INTERRUPTIBLE, MAX_SCHEDULE_TIMEOUT);
> +
> + /*
> + * If a cap update is lost after
> + * mds_wanted was raised, waiting
> + * forever will never make progress.
> + * Retry the renew path periodically
> + * so we can resend synchronously.
> + */
> + if (!wait_woken(&wait, TASK_INTERRUPTIBLE, wait_timeout)) {
> + ret = -EUCLEAN;
> + break;
> + }
> }
>
> remove_wait_queue(&ci->i_cap_wq, &wait);
> @@ -3125,8 +3138,9 @@ int __ceph_get_caps(struct inode *inode, struct ceph_file_info *fi, int need,
> continue;
> }
> if (ret == -EUCLEAN) {
> - /* session was killed, try renew caps */
> - ret = ceph_renew_caps(inode, flags);
> + /* session was killed or a waited cap
> + * request needs a retry */
> + ret = ceph_renew_caps(inode, flags & CEPH_FILE_MODE_MASK);
Frankly speaking, I don't quite follow why do we need to add flags &
CEPH_FILE_MODE_MASK?
Thanks,
Slava.
> if (ret == 0)
> continue;
> }
> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> index d54d71669176..47c7d4a5ffed 100644
> --- a/fs/ceph/file.c
> +++ b/fs/ceph/file.c
> @@ -314,7 +314,7 @@ static int ceph_init_file(struct inode *inode, struct file *file, int fmode)
> }
>
> /*
> - * try renew caps after session gets killed.
> + * Retry cap acquisition after a stale session or a lost cap update.
> */
> int ceph_renew_caps(struct inode *inode, int fmode)
> {
> @@ -322,14 +322,15 @@ int ceph_renew_caps(struct inode *inode, int fmode)
> struct ceph_client *cl = mdsc->fsc->client;
> struct ceph_inode_info *ci = ceph_inode(inode);
> struct ceph_mds_request *req;
> - int err, flags, wanted;
> + int err, flags, wanted, issued;
>
> spin_lock(&ci->i_ceph_lock);
> __ceph_touch_fmode(ci, mdsc, fmode);
> wanted = __ceph_caps_file_wanted(ci);
> + issued = __ceph_caps_issued(ci, NULL);
> if (__ceph_is_any_real_caps(ci) &&
> - (!(wanted & CEPH_CAP_ANY_WR) || ci->i_auth_cap)) {
> - int issued = __ceph_caps_issued(ci, NULL);
> + (!(wanted & CEPH_CAP_ANY_WR) || ci->i_auth_cap) &&
> + (issued & wanted) == wanted) {
> spin_unlock(&ci->i_ceph_lock);
> doutc(cl, "%p %llx.%llx want %s issued %s updating mds_wanted\n",
> inode, ceph_vinop(inode), ceph_cap_string(wanted),
^ permalink raw reply [flat|nested] 2+ messages in thread
end of thread, other threads:[~2026-04-28 18:46 UTC | newest]
Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-27 15:58 [PATCH] ceph: fix hanging __ceph_get_caps() with stale `mds_wanted` Max Kellermann
2026-04-28 18:46 ` Viacheslav Dubeyko
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox