* [PATCH 0/2] ceph: Fix r_parent staleness race and related deadlock
@ 2025-07-30 15:18 Alex Markuze
2025-07-30 15:18 ` [PATCH 1/2] ceph: fix client race condition validating r_parent before applying state Alex Markuze
2025-07-30 15:19 ` [PATCH 2/2] ceph: fix client race condition where r_parent becomes stale before sending message Alex Markuze
0 siblings, 2 replies; 7+ messages in thread
From: Alex Markuze @ 2025-07-30 15:18 UTC (permalink / raw)
To: ceph-devel; +Cc: Slava.Dubeyko, Alex Markuze
Hi,
This patchset addresses two related issues in CephFS client request handling.
**Patch 1/2 ("ceph: fix client race condition where r_parent becomes stale before sending message")**
This patch fixes a race condition where the `req->r_parent` inode reference can become stale. Under specific conditions (e.g., expired dentry leases), the client can perform lockless lookups, creating a window where a concurrent `rename` operation can invalidate `req->r_parent` between initial VFS lookup and MDS request message creation. The MDS reply handler (`create_request_message`) previously trusted the cached `r_parent` without verification. This patch enhances path-building functions to track the full `ceph_vino` and adds a validation step in `create_request_message` to compare and correct `req->r_parent` if a mismatch is detected (when the parent wasn't locked).
**Patch 2/2 ("ceph: fix deadlock in ceph_readdir_prepopulate due to snap_rwsem")**
This patch fixes a deadlock in `ceph_readdir_prepopulate`. The function holds `mdsc->snap_rwsem` (read lock) while calling `ceph_get_inode`, which can potentially block on inode operations that might require the `snap_rwsem` write lock, leading to a classic reader/writer deadlock. This patch releases `mdsc->snap_rwsem` before calling `ceph_get_inode` and re-acquires it afterwards, breaking the deadlock cycle.
Together, these patches improve the robustness and stability of CephFS client request handling by fixing a correctness race and a critical deadlock.
Alex Markuze (2):
ceph: fix client race condition validating r_parent before applying
state
ceph: fix client race condition where r_parent becomes stale before
sending message
fs/ceph/inode.c | 44 +++++++++++++++++++++++++++--
fs/ceph/mds_client.c | 67 +++++++++++++++++++++++++++++++-------------
2 files changed, 89 insertions(+), 22 deletions(-)
--
2.34.1
^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH 1/2] ceph: fix client race condition validating r_parent before applying state
2025-07-30 15:18 [PATCH 0/2] ceph: Fix r_parent staleness race and related deadlock Alex Markuze
@ 2025-07-30 15:18 ` Alex Markuze
2025-07-30 22:09 ` Viacheslav Dubeyko
2025-07-30 15:19 ` [PATCH 2/2] ceph: fix client race condition where r_parent becomes stale before sending message Alex Markuze
1 sibling, 1 reply; 7+ messages in thread
From: Alex Markuze @ 2025-07-30 15:18 UTC (permalink / raw)
To: ceph-devel; +Cc: Slava.Dubeyko, Alex Markuze
Add validation to ensure the cached parent directory inode matches the
directory info in MDS replies. This prevents client-side race conditions
where concurrent operations (e.g. rename) cause r_parent to become stale
between request initiation and reply processing, which could lead to
applying state changes to incorrect directory inodes.
---
fs/ceph/mds_client.c | 67 +++++++++++++++++++++++++++++++-------------
1 file changed, 47 insertions(+), 20 deletions(-)
diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index 8d9fc5e18b17..a164783fc1e1 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -2853,7 +2853,7 @@ char *ceph_mdsc_build_path(struct ceph_mds_client *mdsc, struct dentry *dentry,
static int build_dentry_path(struct ceph_mds_client *mdsc, struct dentry *dentry,
struct inode *dir, const char **ppath, int *ppathlen,
- u64 *pino, bool *pfreepath, bool parent_locked)
+ struct ceph_vino *pvino, bool *pfreepath, bool parent_locked)
{
char *path;
@@ -2862,23 +2862,29 @@ static int build_dentry_path(struct ceph_mds_client *mdsc, struct dentry *dentry
dir = d_inode_rcu(dentry->d_parent);
if (dir && parent_locked && ceph_snap(dir) == CEPH_NOSNAP &&
!IS_ENCRYPTED(dir)) {
- *pino = ceph_ino(dir);
+ pvino->ino = ceph_ino(dir);
+ pvino->snap = ceph_snap(dir);
rcu_read_unlock();
*ppath = dentry->d_name.name;
*ppathlen = dentry->d_name.len;
return 0;
}
rcu_read_unlock();
- path = ceph_mdsc_build_path(mdsc, dentry, ppathlen, pino, 1);
+ path = ceph_mdsc_build_path(mdsc, dentry, ppathlen, &pvino->ino, 1);
if (IS_ERR(path))
return PTR_ERR(path);
*ppath = path;
*pfreepath = true;
+ /* For paths built by ceph_mdsc_build_path, we need to get snap from dentry */
+ if (dentry && d_inode(dentry))
+ pvino->snap = ceph_snap(d_inode(dentry));
+ else
+ pvino->snap = CEPH_NOSNAP;
return 0;
}
static int build_inode_path(struct inode *inode,
- const char **ppath, int *ppathlen, u64 *pino,
+ const char **ppath, int *ppathlen, struct ceph_vino *pvino,
bool *pfreepath)
{
struct ceph_mds_client *mdsc = ceph_sb_to_mdsc(inode->i_sb);
@@ -2886,17 +2892,19 @@ static int build_inode_path(struct inode *inode,
char *path;
if (ceph_snap(inode) == CEPH_NOSNAP) {
- *pino = ceph_ino(inode);
+ pvino->ino = ceph_ino(inode);
+ pvino->snap = ceph_snap(inode);
*ppathlen = 0;
return 0;
}
dentry = d_find_alias(inode);
- path = ceph_mdsc_build_path(mdsc, dentry, ppathlen, pino, 1);
+ path = ceph_mdsc_build_path(mdsc, dentry, ppathlen, &pvino->ino, 1);
dput(dentry);
if (IS_ERR(path))
return PTR_ERR(path);
*ppath = path;
*pfreepath = true;
+ pvino->snap = ceph_snap(inode);
return 0;
}
@@ -2907,7 +2915,7 @@ static int build_inode_path(struct inode *inode,
static int set_request_path_attr(struct ceph_mds_client *mdsc, struct inode *rinode,
struct dentry *rdentry, struct inode *rdiri,
const char *rpath, u64 rino, const char **ppath,
- int *pathlen, u64 *ino, bool *freepath,
+ int *pathlen, struct ceph_vino *p_vino, bool *freepath,
bool parent_locked)
{
struct ceph_client *cl = mdsc->fsc->client;
@@ -2915,16 +2923,17 @@ static int set_request_path_attr(struct ceph_mds_client *mdsc, struct inode *rin
int r = 0;
if (rinode) {
- r = build_inode_path(rinode, ppath, pathlen, ino, freepath);
+ r = build_inode_path(rinode, ppath, pathlen, p_vino, freepath);
boutc(cl, " inode %p %llx.%llx\n", rinode, ceph_ino(rinode),
- ceph_snap(rinode));
+ ceph_snap(rinode));
} else if (rdentry) {
- r = build_dentry_path(mdsc, rdentry, rdiri, ppath, pathlen, ino,
- freepath, parent_locked);
+ r = build_dentry_path(mdsc, rdentry, rdiri, ppath, pathlen, p_vino,
+ freepath, parent_locked);
CEPH_SAN_STRNCPY(result_str, sizeof(result_str), *ppath, *pathlen);
- boutc(cl, " dentry %p %llx/%s\n", rdentry, *ino, result_str);
+ boutc(cl, " dentry %p %llx/%s\n", rdentry, p_vino->ino, result_str);
} else if (rpath || rino) {
- *ino = rino;
+ p_vino->ino = rino;
+ p_vino->snap = CEPH_NOSNAP;
*ppath = rpath;
*pathlen = rpath ? strlen(rpath) : 0;
CEPH_SAN_STRNCPY(result_str, sizeof(result_str), rpath, *pathlen);
@@ -3007,7 +3016,7 @@ static struct ceph_msg *create_request_message(struct ceph_mds_session *session,
struct ceph_mds_request_head_legacy *lhead;
const char *path1 = NULL;
const char *path2 = NULL;
- u64 ino1 = 0, ino2 = 0;
+ struct ceph_vino vino1 = {0}, vino2 = {0};
int pathlen1 = 0, pathlen2 = 0;
bool freepath1 = false, freepath2 = false;
struct dentry *old_dentry = NULL;
@@ -3019,17 +3028,35 @@ static struct ceph_msg *create_request_message(struct ceph_mds_session *session,
u16 request_head_version = mds_supported_head_version(session);
kuid_t caller_fsuid = req->r_cred->fsuid;
kgid_t caller_fsgid = req->r_cred->fsgid;
+ bool parent_locked = test_bit(CEPH_MDS_R_PARENT_LOCKED, &req->r_req_flags);
ret = set_request_path_attr(mdsc, req->r_inode, req->r_dentry,
req->r_parent, req->r_path1, req->r_ino1.ino,
- &path1, &pathlen1, &ino1, &freepath1,
- test_bit(CEPH_MDS_R_PARENT_LOCKED,
- &req->r_req_flags));
+ &path1, &pathlen1, &vino1, &freepath1,
+ parent_locked);
if (ret < 0) {
msg = ERR_PTR(ret);
goto out;
}
+ /*
+ * When the parent directory's i_rwsem is *not* locked, req->r_parent may
+ * have become stale (e.g. after a concurrent rename) between the time the
+ * dentry was looked up and now. If we detect that the stored r_parent
+ * does not match the inode number we just encoded for the request, switch
+ * to the correct inode so that the MDS receives a valid parent reference.
+ */
+ if (!parent_locked &&
+ req->r_parent && vino1.ino && ceph_ino(req->r_parent) != vino1.ino) {
+ struct inode *correct_dir = ceph_get_inode(mdsc->fsc->sb, vino1, NULL);
+ if (!IS_ERR(correct_dir)) {
+ WARN(1, "ceph: r_parent mismatch (had %llx wanted %llx) - updating\n",
+ ceph_ino(req->r_parent), vino1.ino);
+ iput(req->r_parent);
+ req->r_parent = correct_dir;
+ }
+ }
+
/* If r_old_dentry is set, then assume that its parent is locked */
if (req->r_old_dentry &&
!(req->r_old_dentry->d_flags & DCACHE_DISCONNECTED))
@@ -3037,7 +3064,7 @@ static struct ceph_msg *create_request_message(struct ceph_mds_session *session,
ret = set_request_path_attr(mdsc, NULL, old_dentry,
req->r_old_dentry_dir,
req->r_path2, req->r_ino2.ino,
- &path2, &pathlen2, &ino2, &freepath2, true);
+ &path2, &pathlen2, &vino2, &freepath2, true);
if (ret < 0) {
msg = ERR_PTR(ret);
goto out_free1;
@@ -3191,8 +3218,8 @@ static struct ceph_msg *create_request_message(struct ceph_mds_session *session,
lhead->ino = cpu_to_le64(req->r_deleg_ino);
lhead->args = req->r_args;
- ceph_encode_filepath(&p, end, ino1, path1);
- ceph_encode_filepath(&p, end, ino2, path2);
+ ceph_encode_filepath(&p, end, vino1.ino, path1);
+ ceph_encode_filepath(&p, end, vino2.ino, path2);
/* make note of release offset, in case we need to replay */
req->r_request_release_offset = p - msg->front.iov_base;
--
2.34.1
^ permalink raw reply related [flat|nested] 7+ messages in thread
* [PATCH 2/2] ceph: fix client race condition where r_parent becomes stale before sending message
2025-07-30 15:18 [PATCH 0/2] ceph: Fix r_parent staleness race and related deadlock Alex Markuze
2025-07-30 15:18 ` [PATCH 1/2] ceph: fix client race condition validating r_parent before applying state Alex Markuze
@ 2025-07-30 15:19 ` Alex Markuze
2025-07-30 21:38 ` Viacheslav Dubeyko
1 sibling, 1 reply; 7+ messages in thread
From: Alex Markuze @ 2025-07-30 15:19 UTC (permalink / raw)
To: ceph-devel; +Cc: Slava.Dubeyko, Alex Markuze
When the parent directory's i_rwsem is not locked, req->r_parent may become
stale due to concurrent operations (e.g. rename) between dentry lookup and
message creation. Validate that r_parent matches the encoded parent inode
and update to the correct inode if a mismatch is detected.
---
fs/ceph/inode.c | 44 ++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 42 insertions(+), 2 deletions(-)
diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
index 814f9e9656a0..49fb1e3a02e8 100644
--- a/fs/ceph/inode.c
+++ b/fs/ceph/inode.c
@@ -56,6 +56,46 @@ static int ceph_set_ino_cb(struct inode *inode, void *data)
return 0;
}
+/*
+ * Validate that the directory inode referenced by @req->r_parent matches the
+ * inode number and snapshot id contained in the reply's directory record. If
+ * they do not match – which can theoretically happen if the parent dentry was
+ * moved between the time the request was issued and the reply arrived – fall
+ * back to looking up the correct inode in the inode cache.
+ *
+ * A reference is *always* returned. Callers that receive a different inode
+ * than the original @parent are responsible for dropping the extra reference
+ * once the reply has been processed.
+ */
+static struct inode *ceph_get_reply_dir(struct super_block *sb,
+ struct inode *parent,
+ struct ceph_mds_reply_info_parsed *rinfo)
+{
+ struct ceph_vino vino;
+
+ if (unlikely(!rinfo->diri.in))
+ return parent; /* nothing to compare against */
+
+ /* If we didn't have a cached parent inode to begin with, just bail out. */
+ if (!parent)
+ return NULL;
+
+ vino.ino = le64_to_cpu(rinfo->diri.in->ino);
+ vino.snap = le64_to_cpu(rinfo->diri.in->snapid);
+
+ if (likely(parent && ceph_ino(parent) == vino.ino && ceph_snap(parent) == vino.snap))
+ return parent; /* matches – use the original reference */
+
+ /* Mismatch – this should be rare. Emit a WARN and obtain the correct inode. */
+ WARN(1, "ceph: reply dir mismatch (parent %s %llx.%llx reply %llx.%llx)\n",
+ parent ? "valid" : "NULL",
+ parent ? ceph_ino(parent) : 0ULL,
+ parent ? ceph_snap(parent) : 0ULL,
+ vino.ino, vino.snap);
+
+ return ceph_get_inode(sb, vino, NULL);
+}
+
/**
* ceph_new_inode - allocate a new inode in advance of an expected create
* @dir: parent directory for new inode
@@ -1548,8 +1588,8 @@ int ceph_fill_trace(struct super_block *sb, struct ceph_mds_request *req)
}
if (rinfo->head->is_dentry) {
- struct inode *dir = req->r_parent;
-
+ /* r_parent may be stale, in cases when R_PARENT_LOCKED is not set, so we need to get the correct inode */
+ struct inode *dir = ceph_get_reply_dir(sb, req->r_parent, rinfo);
if (dir) {
err = ceph_fill_inode(dir, NULL, &rinfo->diri,
rinfo->dirfrag, session, -1,
--
2.34.1
^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [PATCH 2/2] ceph: fix client race condition where r_parent becomes stale before sending message
2025-07-30 15:19 ` [PATCH 2/2] ceph: fix client race condition where r_parent becomes stale before sending message Alex Markuze
@ 2025-07-30 21:38 ` Viacheslav Dubeyko
2025-07-31 7:24 ` Alex Markuze
0 siblings, 1 reply; 7+ messages in thread
From: Viacheslav Dubeyko @ 2025-07-30 21:38 UTC (permalink / raw)
To: Alex Markuze, ceph-devel@vger.kernel.org
On Wed, 2025-07-30 at 15:19 +0000, Alex Markuze wrote:
> When the parent directory's i_rwsem is not locked, req->r_parent may become
> stale due to concurrent operations (e.g. rename) between dentry lookup and
> message creation. Validate that r_parent matches the encoded parent inode
> and update to the correct inode if a mismatch is detected.
Could we share any description of crash or workload misbehavior that can
illustrate the symptoms of the issue?
> ---
> fs/ceph/inode.c | 44 ++++++++++++++++++++++++++++++++++++++++++--
> 1 file changed, 42 insertions(+), 2 deletions(-)
>
> diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
> index 814f9e9656a0..49fb1e3a02e8 100644
> --- a/fs/ceph/inode.c
> +++ b/fs/ceph/inode.c
> @@ -56,6 +56,46 @@ static int ceph_set_ino_cb(struct inode *inode, void *data)
> return 0;
> }
>
> +/*
> + * Validate that the directory inode referenced by @req->r_parent matches the
> + * inode number and snapshot id contained in the reply's directory record. If
> + * they do not match – which can theoretically happen if the parent dentry was
> + * moved between the time the request was issued and the reply arrived – fall
> + * back to looking up the correct inode in the inode cache.
> + *
> + * A reference is *always* returned. Callers that receive a different inode
> + * than the original @parent are responsible for dropping the extra reference
> + * once the reply has been processed.
> + */
> +static struct inode *ceph_get_reply_dir(struct super_block *sb,
> + struct inode *parent,
> + struct ceph_mds_reply_info_parsed *rinfo)
> +{
> + struct ceph_vino vino;
> +
> + if (unlikely(!rinfo->diri.in))
> + return parent; /* nothing to compare against */
If we could receive parent == NULL, then is it possible to receive rinfo ==
NULL?
> +
> + /* If we didn't have a cached parent inode to begin with, just bail out. */
> + if (!parent)
> + return NULL;
Is it normal workflow that parent == NULL? Should we complain about it here?
> +
> + vino.ino = le64_to_cpu(rinfo->diri.in->ino);
> + vino.snap = le64_to_cpu(rinfo->diri.in->snapid);
> +
> + if (likely(parent && ceph_ino(parent) == vino.ino && ceph_snap(parent) == vino.snap))
It looks like long line. Could we introduce a inline static function with good
name here?
We already checked that parent is not NULL above. Does it makes sense to have
this duplicated check here?
> + return parent; /* matches – use the original reference */
> +
> + /* Mismatch – this should be rare. Emit a WARN and obtain the correct inode. */
> + WARN(1, "ceph: reply dir mismatch (parent %s %llx.%llx reply %llx.%llx)\n",
> + parent ? "valid" : "NULL",
How parent can be NULL here? We already checked this pointer. And this
construction looks pretty complicated.
> + parent ? ceph_ino(parent) : 0ULL,
> + parent ? ceph_snap(parent) : 0ULL,
> + vino.ino, vino.snap);
> +
> + return ceph_get_inode(sb, vino, NULL);
> +}
> +
> /**
> * ceph_new_inode - allocate a new inode in advance of an expected create
> * @dir: parent directory for new inode
> @@ -1548,8 +1588,8 @@ int ceph_fill_trace(struct super_block *sb, struct ceph_mds_request *req)
> }
>
> if (rinfo->head->is_dentry) {
> - struct inode *dir = req->r_parent;
> -
> + /* r_parent may be stale, in cases when R_PARENT_LOCKED is not set, so we need to get the correct inode */
Comment is too long for one line. We could have multi-line comment here. If
R_PARENT_LOCKED is not set, then, is it normal execution flow or not? Should we
fix the use-case of not setting R_PARENT_LOCKED?
Thanks,
Slava.
> + struct inode *dir = ceph_get_reply_dir(sb, req->r_parent, rinfo);
> if (dir) {
> err = ceph_fill_inode(dir, NULL, &rinfo->diri,
> rinfo->dirfrag, session, -1,
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH 1/2] ceph: fix client race condition validating r_parent before applying state
2025-07-30 15:18 ` [PATCH 1/2] ceph: fix client race condition validating r_parent before applying state Alex Markuze
@ 2025-07-30 22:09 ` Viacheslav Dubeyko
2025-07-31 7:30 ` Alex Markuze
0 siblings, 1 reply; 7+ messages in thread
From: Viacheslav Dubeyko @ 2025-07-30 22:09 UTC (permalink / raw)
To: Alex Markuze, ceph-devel@vger.kernel.org
On Wed, 2025-07-30 at 15:18 +0000, Alex Markuze wrote:
> Add validation to ensure the cached parent directory inode matches the
> directory info in MDS replies. This prevents client-side race conditions
> where concurrent operations (e.g. rename) cause r_parent to become stale
> between request initiation and reply processing, which could lead to
> applying state changes to incorrect directory inodes.
> ---
> fs/ceph/mds_client.c | 67 +++++++++++++++++++++++++++++++-------------
> 1 file changed, 47 insertions(+), 20 deletions(-)
>
> diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
> index 8d9fc5e18b17..a164783fc1e1 100644
> --- a/fs/ceph/mds_client.c
> +++ b/fs/ceph/mds_client.c
> @@ -2853,7 +2853,7 @@ char *ceph_mdsc_build_path(struct ceph_mds_client *mdsc, struct dentry *dentry,
>
> static int build_dentry_path(struct ceph_mds_client *mdsc, struct dentry *dentry,
> struct inode *dir, const char **ppath, int *ppathlen,
> - u64 *pino, bool *pfreepath, bool parent_locked)
> + struct ceph_vino *pvino, bool *pfreepath, bool parent_locked)
We have too much arguments here. It looks like we need to introduce some
structure.
> {
> char *path;
>
> @@ -2862,23 +2862,29 @@ static int build_dentry_path(struct ceph_mds_client *mdsc, struct dentry *dentry
> dir = d_inode_rcu(dentry->d_parent);
> if (dir && parent_locked && ceph_snap(dir) == CEPH_NOSNAP &&
> !IS_ENCRYPTED(dir)) {
> - *pino = ceph_ino(dir);
> + pvino->ino = ceph_ino(dir);
> + pvino->snap = ceph_snap(dir);
> rcu_read_unlock();
> *ppath = dentry->d_name.name;
> *ppathlen = dentry->d_name.len;
> return 0;
> }
> rcu_read_unlock();
> - path = ceph_mdsc_build_path(mdsc, dentry, ppathlen, pino, 1);
> + path = ceph_mdsc_build_path(mdsc, dentry, ppathlen, &pvino->ino, 1);
> if (IS_ERR(path))
> return PTR_ERR(path);
> *ppath = path;
> *pfreepath = true;
> + /* For paths built by ceph_mdsc_build_path, we need to get snap from dentry */
I believe the multi-line comment could be good here.
> + if (dentry && d_inode(dentry))
Could dentry be really NULL here?
> + pvino->snap = ceph_snap(d_inode(dentry));
> + else
> + pvino->snap = CEPH_NOSNAP;
> return 0;
> }
>
> static int build_inode_path(struct inode *inode,
> - const char **ppath, int *ppathlen, u64 *pino,
> + const char **ppath, int *ppathlen, struct ceph_vino *pvino,
> bool *pfreepath)
We have too much arguments here too.
> {
> struct ceph_mds_client *mdsc = ceph_sb_to_mdsc(inode->i_sb);
> @@ -2886,17 +2892,19 @@ static int build_inode_path(struct inode *inode,
> char *path;
>
> if (ceph_snap(inode) == CEPH_NOSNAP) {
> - *pino = ceph_ino(inode);
> + pvino->ino = ceph_ino(inode);
> + pvino->snap = ceph_snap(inode);
> *ppathlen = 0;
> return 0;
> }
> dentry = d_find_alias(inode);
> - path = ceph_mdsc_build_path(mdsc, dentry, ppathlen, pino, 1);
> + path = ceph_mdsc_build_path(mdsc, dentry, ppathlen, &pvino->ino, 1);
> dput(dentry);
> if (IS_ERR(path))
> return PTR_ERR(path);
> *ppath = path;
> *pfreepath = true;
> + pvino->snap = ceph_snap(inode);
> return 0;
> }
>
> @@ -2907,7 +2915,7 @@ static int build_inode_path(struct inode *inode,
> static int set_request_path_attr(struct ceph_mds_client *mdsc, struct inode *rinode,
> struct dentry *rdentry, struct inode *rdiri,
> const char *rpath, u64 rino, const char **ppath,
> - int *pathlen, u64 *ino, bool *freepath,
> + int *pathlen, struct ceph_vino *p_vino, bool *freepath,
> bool parent_locked)
Ditto. :)
Thanks,
Slava.
> {
> struct ceph_client *cl = mdsc->fsc->client;
> @@ -2915,16 +2923,17 @@ static int set_request_path_attr(struct ceph_mds_client *mdsc, struct inode *rin
> int r = 0;
>
> if (rinode) {
> - r = build_inode_path(rinode, ppath, pathlen, ino, freepath);
> + r = build_inode_path(rinode, ppath, pathlen, p_vino, freepath);
> boutc(cl, " inode %p %llx.%llx\n", rinode, ceph_ino(rinode),
> - ceph_snap(rinode));
> + ceph_snap(rinode));
> } else if (rdentry) {
> - r = build_dentry_path(mdsc, rdentry, rdiri, ppath, pathlen, ino,
> - freepath, parent_locked);
> + r = build_dentry_path(mdsc, rdentry, rdiri, ppath, pathlen, p_vino,
> + freepath, parent_locked);
> CEPH_SAN_STRNCPY(result_str, sizeof(result_str), *ppath, *pathlen);
> - boutc(cl, " dentry %p %llx/%s\n", rdentry, *ino, result_str);
> + boutc(cl, " dentry %p %llx/%s\n", rdentry, p_vino->ino, result_str);
> } else if (rpath || rino) {
> - *ino = rino;
> + p_vino->ino = rino;
> + p_vino->snap = CEPH_NOSNAP;
> *ppath = rpath;
> *pathlen = rpath ? strlen(rpath) : 0;
> CEPH_SAN_STRNCPY(result_str, sizeof(result_str), rpath, *pathlen);
> @@ -3007,7 +3016,7 @@ static struct ceph_msg *create_request_message(struct ceph_mds_session *session,
> struct ceph_mds_request_head_legacy *lhead;
> const char *path1 = NULL;
> const char *path2 = NULL;
> - u64 ino1 = 0, ino2 = 0;
> + struct ceph_vino vino1 = {0}, vino2 = {0};
> int pathlen1 = 0, pathlen2 = 0;
> bool freepath1 = false, freepath2 = false;
> struct dentry *old_dentry = NULL;
> @@ -3019,17 +3028,35 @@ static struct ceph_msg *create_request_message(struct ceph_mds_session *session,
> u16 request_head_version = mds_supported_head_version(session);
> kuid_t caller_fsuid = req->r_cred->fsuid;
> kgid_t caller_fsgid = req->r_cred->fsgid;
> + bool parent_locked = test_bit(CEPH_MDS_R_PARENT_LOCKED, &req->r_req_flags);
>
> ret = set_request_path_attr(mdsc, req->r_inode, req->r_dentry,
> req->r_parent, req->r_path1, req->r_ino1.ino,
> - &path1, &pathlen1, &ino1, &freepath1,
> - test_bit(CEPH_MDS_R_PARENT_LOCKED,
> - &req->r_req_flags));
> + &path1, &pathlen1, &vino1, &freepath1,
> + parent_locked);
> if (ret < 0) {
> msg = ERR_PTR(ret);
> goto out;
> }
>
> + /*
> + * When the parent directory's i_rwsem is *not* locked, req->r_parent may
> + * have become stale (e.g. after a concurrent rename) between the time the
> + * dentry was looked up and now. If we detect that the stored r_parent
> + * does not match the inode number we just encoded for the request, switch
> + * to the correct inode so that the MDS receives a valid parent reference.
> + */
> + if (!parent_locked &&
> + req->r_parent && vino1.ino && ceph_ino(req->r_parent) != vino1.ino) {
> + struct inode *correct_dir = ceph_get_inode(mdsc->fsc->sb, vino1, NULL);
> + if (!IS_ERR(correct_dir)) {
> + WARN(1, "ceph: r_parent mismatch (had %llx wanted %llx) - updating\n",
> + ceph_ino(req->r_parent), vino1.ino);
> + iput(req->r_parent);
> + req->r_parent = correct_dir;
> + }
> + }
> +
> /* If r_old_dentry is set, then assume that its parent is locked */
> if (req->r_old_dentry &&
> !(req->r_old_dentry->d_flags & DCACHE_DISCONNECTED))
> @@ -3037,7 +3064,7 @@ static struct ceph_msg *create_request_message(struct ceph_mds_session *session,
> ret = set_request_path_attr(mdsc, NULL, old_dentry,
> req->r_old_dentry_dir,
> req->r_path2, req->r_ino2.ino,
> - &path2, &pathlen2, &ino2, &freepath2, true);
> + &path2, &pathlen2, &vino2, &freepath2, true);
> if (ret < 0) {
> msg = ERR_PTR(ret);
> goto out_free1;
> @@ -3191,8 +3218,8 @@ static struct ceph_msg *create_request_message(struct ceph_mds_session *session,
> lhead->ino = cpu_to_le64(req->r_deleg_ino);
> lhead->args = req->r_args;
>
> - ceph_encode_filepath(&p, end, ino1, path1);
> - ceph_encode_filepath(&p, end, ino2, path2);
> + ceph_encode_filepath(&p, end, vino1.ino, path1);
> + ceph_encode_filepath(&p, end, vino2.ino, path2);
>
> /* make note of release offset, in case we need to replay */
> req->r_request_release_offset = p - msg->front.iov_base;
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH 2/2] ceph: fix client race condition where r_parent becomes stale before sending message
2025-07-30 21:38 ` Viacheslav Dubeyko
@ 2025-07-31 7:24 ` Alex Markuze
0 siblings, 0 replies; 7+ messages in thread
From: Alex Markuze @ 2025-07-31 7:24 UTC (permalink / raw)
To: Viacheslav Dubeyko; +Cc: ceph-devel@vger.kernel.org
This fix comes after an analysis of trace from a client bug:
RCA:
Usually two operations cant interleave due to extensive VFS locking.
But CEPH maintains leases for cached dentry structures, when these
leases expire the client may create a lookup request without taking
VFS locks, this is done by choice. But it does leave a window for this
type of a race that we have witnessed. The reply handler in one
specific place doesn't make sure that the appropriate locks are held
but still trusts the r_parent pointer that now holds an incorrect
value due to the rename operation. I prepared a fix that now compares
the cached r_parent and the ino from the reply. I'll ask @André to
prepare a new image; this image should also resolve the boot issue for
the previous debug version.
It's the message handling code, so the request is always valid but not
every request will have a parent. R_PARENT_LOCKED may not be set by
design, it's a valid state.
On Thu, Jul 31, 2025 at 1:39 AM Viacheslav Dubeyko
<Slava.Dubeyko@ibm.com> wrote:
>
> On Wed, 2025-07-30 at 15:19 +0000, Alex Markuze wrote:
> > When the parent directory's i_rwsem is not locked, req->r_parent may become
> > stale due to concurrent operations (e.g. rename) between dentry lookup and
> > message creation. Validate that r_parent matches the encoded parent inode
> > and update to the correct inode if a mismatch is detected.
>
> Could we share any description of crash or workload misbehavior that can
> illustrate the symptoms of the issue?
>
> > ---
> > fs/ceph/inode.c | 44 ++++++++++++++++++++++++++++++++++++++++++--
> > 1 file changed, 42 insertions(+), 2 deletions(-)
> >
> > diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
> > index 814f9e9656a0..49fb1e3a02e8 100644
> > --- a/fs/ceph/inode.c
> > +++ b/fs/ceph/inode.c
> > @@ -56,6 +56,46 @@ static int ceph_set_ino_cb(struct inode *inode, void *data)
> > return 0;
> > }
> >
> > +/*
> > + * Validate that the directory inode referenced by @req->r_parent matches the
> > + * inode number and snapshot id contained in the reply's directory record. If
> > + * they do not match – which can theoretically happen if the parent dentry was
> > + * moved between the time the request was issued and the reply arrived – fall
> > + * back to looking up the correct inode in the inode cache.
> > + *
> > + * A reference is *always* returned. Callers that receive a different inode
> > + * than the original @parent are responsible for dropping the extra reference
> > + * once the reply has been processed.
> > + */
> > +static struct inode *ceph_get_reply_dir(struct super_block *sb,
> > + struct inode *parent,
> > + struct ceph_mds_reply_info_parsed *rinfo)
> > +{
> > + struct ceph_vino vino;
> > +
> > + if (unlikely(!rinfo->diri.in))
> > + return parent; /* nothing to compare against */
>
> If we could receive parent == NULL, then is it possible to receive rinfo ==
> NULL?
>
> > +
> > + /* If we didn't have a cached parent inode to begin with, just bail out. */
> > + if (!parent)
> > + return NULL;
>
> Is it normal workflow that parent == NULL? Should we complain about it here?
>
> > +
> > + vino.ino = le64_to_cpu(rinfo->diri.in->ino);
> > + vino.snap = le64_to_cpu(rinfo->diri.in->snapid);
> > +
> > + if (likely(parent && ceph_ino(parent) == vino.ino && ceph_snap(parent) == vino.snap))
>
> It looks like long line. Could we introduce a inline static function with good
> name here?
>
> We already checked that parent is not NULL above. Does it makes sense to have
> this duplicated check here?
>
> > + return parent; /* matches – use the original reference */
> > +
> > + /* Mismatch – this should be rare. Emit a WARN and obtain the correct inode. */
> > + WARN(1, "ceph: reply dir mismatch (parent %s %llx.%llx reply %llx.%llx)\n",
> > + parent ? "valid" : "NULL",
>
> How parent can be NULL here? We already checked this pointer. And this
> construction looks pretty complicated.
>
> > + parent ? ceph_ino(parent) : 0ULL,
> > + parent ? ceph_snap(parent) : 0ULL,
> > + vino.ino, vino.snap);
> > +
> > + return ceph_get_inode(sb, vino, NULL);
> > +}
> > +
> > /**
> > * ceph_new_inode - allocate a new inode in advance of an expected create
> > * @dir: parent directory for new inode
> > @@ -1548,8 +1588,8 @@ int ceph_fill_trace(struct super_block *sb, struct ceph_mds_request *req)
> > }
> >
> > if (rinfo->head->is_dentry) {
> > - struct inode *dir = req->r_parent;
> > -
> > + /* r_parent may be stale, in cases when R_PARENT_LOCKED is not set, so we need to get the correct inode */
>
> Comment is too long for one line. We could have multi-line comment here. If
> R_PARENT_LOCKED is not set, then, is it normal execution flow or not? Should we
> fix the use-case of not setting R_PARENT_LOCKED?
>
> Thanks,
> Slava.
>
> > + struct inode *dir = ceph_get_reply_dir(sb, req->r_parent, rinfo);
> > if (dir) {
> > err = ceph_fill_inode(dir, NULL, &rinfo->diri,
> > rinfo->dirfrag, session, -1,
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH 1/2] ceph: fix client race condition validating r_parent before applying state
2025-07-30 22:09 ` Viacheslav Dubeyko
@ 2025-07-31 7:30 ` Alex Markuze
0 siblings, 0 replies; 7+ messages in thread
From: Alex Markuze @ 2025-07-31 7:30 UTC (permalink / raw)
To: Viacheslav Dubeyko; +Cc: ceph-devel@vger.kernel.org
No new parameters were added, a u64 ino value was changed to struct ceph_vino.
I prefer not to mix functional changes with aesthetic ones, I can add
a separate patch, but it is unrelated to the bug fix here.
On Thu, Jul 31, 2025 at 2:10 AM Viacheslav Dubeyko
<Slava.Dubeyko@ibm.com> wrote:
>
> On Wed, 2025-07-30 at 15:18 +0000, Alex Markuze wrote:
> > Add validation to ensure the cached parent directory inode matches the
> > directory info in MDS replies. This prevents client-side race conditions
> > where concurrent operations (e.g. rename) cause r_parent to become stale
> > between request initiation and reply processing, which could lead to
> > applying state changes to incorrect directory inodes.
> > ---
> > fs/ceph/mds_client.c | 67 +++++++++++++++++++++++++++++++-------------
> > 1 file changed, 47 insertions(+), 20 deletions(-)
> >
> > diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
> > index 8d9fc5e18b17..a164783fc1e1 100644
> > --- a/fs/ceph/mds_client.c
> > +++ b/fs/ceph/mds_client.c
> > @@ -2853,7 +2853,7 @@ char *ceph_mdsc_build_path(struct ceph_mds_client *mdsc, struct dentry *dentry,
> >
> > static int build_dentry_path(struct ceph_mds_client *mdsc, struct dentry *dentry,
> > struct inode *dir, const char **ppath, int *ppathlen,
> > - u64 *pino, bool *pfreepath, bool parent_locked)
> > + struct ceph_vino *pvino, bool *pfreepath, bool parent_locked)
>
> We have too much arguments here. It looks like we need to introduce some
> structure.
>
> > {
> > char *path;
> >
> > @@ -2862,23 +2862,29 @@ static int build_dentry_path(struct ceph_mds_client *mdsc, struct dentry *dentry
> > dir = d_inode_rcu(dentry->d_parent);
> > if (dir && parent_locked && ceph_snap(dir) == CEPH_NOSNAP &&
> > !IS_ENCRYPTED(dir)) {
> > - *pino = ceph_ino(dir);
> > + pvino->ino = ceph_ino(dir);
> > + pvino->snap = ceph_snap(dir);
> > rcu_read_unlock();
> > *ppath = dentry->d_name.name;
> > *ppathlen = dentry->d_name.len;
> > return 0;
> > }
> > rcu_read_unlock();
> > - path = ceph_mdsc_build_path(mdsc, dentry, ppathlen, pino, 1);
> > + path = ceph_mdsc_build_path(mdsc, dentry, ppathlen, &pvino->ino, 1);
> > if (IS_ERR(path))
> > return PTR_ERR(path);
> > *ppath = path;
> > *pfreepath = true;
> > + /* For paths built by ceph_mdsc_build_path, we need to get snap from dentry */
>
> I believe the multi-line comment could be good here.
>
> > + if (dentry && d_inode(dentry))
>
> Could dentry be really NULL here?
>
> > + pvino->snap = ceph_snap(d_inode(dentry));
> > + else
> > + pvino->snap = CEPH_NOSNAP;
> > return 0;
> > }
> >
> > static int build_inode_path(struct inode *inode,
> > - const char **ppath, int *ppathlen, u64 *pino,
> > + const char **ppath, int *ppathlen, struct ceph_vino *pvino,
> > bool *pfreepath)
>
> We have too much arguments here too.
>
> > {
> > struct ceph_mds_client *mdsc = ceph_sb_to_mdsc(inode->i_sb);
> > @@ -2886,17 +2892,19 @@ static int build_inode_path(struct inode *inode,
> > char *path;
> >
> > if (ceph_snap(inode) == CEPH_NOSNAP) {
> > - *pino = ceph_ino(inode);
> > + pvino->ino = ceph_ino(inode);
> > + pvino->snap = ceph_snap(inode);
> > *ppathlen = 0;
> > return 0;
> > }
> > dentry = d_find_alias(inode);
> > - path = ceph_mdsc_build_path(mdsc, dentry, ppathlen, pino, 1);
> > + path = ceph_mdsc_build_path(mdsc, dentry, ppathlen, &pvino->ino, 1);
> > dput(dentry);
> > if (IS_ERR(path))
> > return PTR_ERR(path);
> > *ppath = path;
> > *pfreepath = true;
> > + pvino->snap = ceph_snap(inode);
> > return 0;
> > }
> >
> > @@ -2907,7 +2915,7 @@ static int build_inode_path(struct inode *inode,
> > static int set_request_path_attr(struct ceph_mds_client *mdsc, struct inode *rinode,
> > struct dentry *rdentry, struct inode *rdiri,
> > const char *rpath, u64 rino, const char **ppath,
> > - int *pathlen, u64 *ino, bool *freepath,
> > + int *pathlen, struct ceph_vino *p_vino, bool *freepath,
> > bool parent_locked)
>
> Ditto. :)
>
> Thanks,
> Slava.
>
> > {
> > struct ceph_client *cl = mdsc->fsc->client;
> > @@ -2915,16 +2923,17 @@ static int set_request_path_attr(struct ceph_mds_client *mdsc, struct inode *rin
> > int r = 0;
> >
> > if (rinode) {
> > - r = build_inode_path(rinode, ppath, pathlen, ino, freepath);
> > + r = build_inode_path(rinode, ppath, pathlen, p_vino, freepath);
> > boutc(cl, " inode %p %llx.%llx\n", rinode, ceph_ino(rinode),
> > - ceph_snap(rinode));
> > + ceph_snap(rinode));
> > } else if (rdentry) {
> > - r = build_dentry_path(mdsc, rdentry, rdiri, ppath, pathlen, ino,
> > - freepath, parent_locked);
> > + r = build_dentry_path(mdsc, rdentry, rdiri, ppath, pathlen, p_vino,
> > + freepath, parent_locked);
> > CEPH_SAN_STRNCPY(result_str, sizeof(result_str), *ppath, *pathlen);
> > - boutc(cl, " dentry %p %llx/%s\n", rdentry, *ino, result_str);
> > + boutc(cl, " dentry %p %llx/%s\n", rdentry, p_vino->ino, result_str);
> > } else if (rpath || rino) {
> > - *ino = rino;
> > + p_vino->ino = rino;
> > + p_vino->snap = CEPH_NOSNAP;
> > *ppath = rpath;
> > *pathlen = rpath ? strlen(rpath) : 0;
> > CEPH_SAN_STRNCPY(result_str, sizeof(result_str), rpath, *pathlen);
> > @@ -3007,7 +3016,7 @@ static struct ceph_msg *create_request_message(struct ceph_mds_session *session,
> > struct ceph_mds_request_head_legacy *lhead;
> > const char *path1 = NULL;
> > const char *path2 = NULL;
> > - u64 ino1 = 0, ino2 = 0;
> > + struct ceph_vino vino1 = {0}, vino2 = {0};
> > int pathlen1 = 0, pathlen2 = 0;
> > bool freepath1 = false, freepath2 = false;
> > struct dentry *old_dentry = NULL;
> > @@ -3019,17 +3028,35 @@ static struct ceph_msg *create_request_message(struct ceph_mds_session *session,
> > u16 request_head_version = mds_supported_head_version(session);
> > kuid_t caller_fsuid = req->r_cred->fsuid;
> > kgid_t caller_fsgid = req->r_cred->fsgid;
> > + bool parent_locked = test_bit(CEPH_MDS_R_PARENT_LOCKED, &req->r_req_flags);
> >
> > ret = set_request_path_attr(mdsc, req->r_inode, req->r_dentry,
> > req->r_parent, req->r_path1, req->r_ino1.ino,
> > - &path1, &pathlen1, &ino1, &freepath1,
> > - test_bit(CEPH_MDS_R_PARENT_LOCKED,
> > - &req->r_req_flags));
> > + &path1, &pathlen1, &vino1, &freepath1,
> > + parent_locked);
> > if (ret < 0) {
> > msg = ERR_PTR(ret);
> > goto out;
> > }
> >
> > + /*
> > + * When the parent directory's i_rwsem is *not* locked, req->r_parent may
> > + * have become stale (e.g. after a concurrent rename) between the time the
> > + * dentry was looked up and now. If we detect that the stored r_parent
> > + * does not match the inode number we just encoded for the request, switch
> > + * to the correct inode so that the MDS receives a valid parent reference.
> > + */
> > + if (!parent_locked &&
> > + req->r_parent && vino1.ino && ceph_ino(req->r_parent) != vino1.ino) {
> > + struct inode *correct_dir = ceph_get_inode(mdsc->fsc->sb, vino1, NULL);
> > + if (!IS_ERR(correct_dir)) {
> > + WARN(1, "ceph: r_parent mismatch (had %llx wanted %llx) - updating\n",
> > + ceph_ino(req->r_parent), vino1.ino);
> > + iput(req->r_parent);
> > + req->r_parent = correct_dir;
> > + }
> > + }
> > +
> > /* If r_old_dentry is set, then assume that its parent is locked */
> > if (req->r_old_dentry &&
> > !(req->r_old_dentry->d_flags & DCACHE_DISCONNECTED))
> > @@ -3037,7 +3064,7 @@ static struct ceph_msg *create_request_message(struct ceph_mds_session *session,
> > ret = set_request_path_attr(mdsc, NULL, old_dentry,
> > req->r_old_dentry_dir,
> > req->r_path2, req->r_ino2.ino,
> > - &path2, &pathlen2, &ino2, &freepath2, true);
> > + &path2, &pathlen2, &vino2, &freepath2, true);
> > if (ret < 0) {
> > msg = ERR_PTR(ret);
> > goto out_free1;
> > @@ -3191,8 +3218,8 @@ static struct ceph_msg *create_request_message(struct ceph_mds_session *session,
> > lhead->ino = cpu_to_le64(req->r_deleg_ino);
> > lhead->args = req->r_args;
> >
> > - ceph_encode_filepath(&p, end, ino1, path1);
> > - ceph_encode_filepath(&p, end, ino2, path2);
> > + ceph_encode_filepath(&p, end, vino1.ino, path1);
> > + ceph_encode_filepath(&p, end, vino2.ino, path2);
> >
> > /* make note of release offset, in case we need to replay */
> > req->r_request_release_offset = p - msg->front.iov_base;
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2025-07-31 7:31 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-30 15:18 [PATCH 0/2] ceph: Fix r_parent staleness race and related deadlock Alex Markuze
2025-07-30 15:18 ` [PATCH 1/2] ceph: fix client race condition validating r_parent before applying state Alex Markuze
2025-07-30 22:09 ` Viacheslav Dubeyko
2025-07-31 7:30 ` Alex Markuze
2025-07-30 15:19 ` [PATCH 2/2] ceph: fix client race condition where r_parent becomes stale before sending message Alex Markuze
2025-07-30 21:38 ` Viacheslav Dubeyko
2025-07-31 7:24 ` Alex Markuze
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).