ceph-devel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/2] ceph: Fix r_parent staleness race and related deadlock
@ 2025-07-30 15:18 Alex Markuze
  2025-07-30 15:18 ` [PATCH 1/2] ceph: fix client race condition validating r_parent before applying state Alex Markuze
  2025-07-30 15:19 ` [PATCH 2/2] ceph: fix client race condition where r_parent becomes stale before sending message Alex Markuze
  0 siblings, 2 replies; 7+ messages in thread
From: Alex Markuze @ 2025-07-30 15:18 UTC (permalink / raw)
  To: ceph-devel; +Cc: Slava.Dubeyko, Alex Markuze

Hi,

This patchset addresses two related issues in CephFS client request handling.

**Patch 1/2 ("ceph: fix client race condition where r_parent becomes stale before sending message")**

This patch fixes a race condition where the `req->r_parent` inode reference can become stale. Under specific conditions (e.g., expired dentry leases), the client can perform lockless lookups, creating a window where a concurrent `rename` operation can invalidate `req->r_parent` between initial VFS lookup and MDS request message creation. The MDS reply handler (`create_request_message`) previously trusted the cached `r_parent` without verification. This patch enhances path-building functions to track the full `ceph_vino` and adds a validation step in `create_request_message` to compare and correct `req->r_parent` if a mismatch is detected (when the parent wasn't locked).

**Patch 2/2 ("ceph: fix deadlock in ceph_readdir_prepopulate due to snap_rwsem")**

This patch fixes a deadlock in `ceph_readdir_prepopulate`. The function holds `mdsc->snap_rwsem` (read lock) while calling `ceph_get_inode`, which can potentially block on inode operations that might require the `snap_rwsem` write lock, leading to a classic reader/writer deadlock. This patch releases `mdsc->snap_rwsem` before calling `ceph_get_inode` and re-acquires it afterwards, breaking the deadlock cycle.

Together, these patches improve the robustness and stability of CephFS client request handling by fixing a correctness race and a critical deadlock.


Alex Markuze (2):
  ceph: fix client race condition validating r_parent before applying
    state
  ceph: fix client race condition where r_parent becomes stale before
    sending message

 fs/ceph/inode.c      | 44 +++++++++++++++++++++++++++--
 fs/ceph/mds_client.c | 67 +++++++++++++++++++++++++++++++-------------
 2 files changed, 89 insertions(+), 22 deletions(-)

-- 
2.34.1


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH 1/2] ceph: fix client race condition validating r_parent before applying state
  2025-07-30 15:18 [PATCH 0/2] ceph: Fix r_parent staleness race and related deadlock Alex Markuze
@ 2025-07-30 15:18 ` Alex Markuze
  2025-07-30 22:09   ` Viacheslav Dubeyko
  2025-07-30 15:19 ` [PATCH 2/2] ceph: fix client race condition where r_parent becomes stale before sending message Alex Markuze
  1 sibling, 1 reply; 7+ messages in thread
From: Alex Markuze @ 2025-07-30 15:18 UTC (permalink / raw)
  To: ceph-devel; +Cc: Slava.Dubeyko, Alex Markuze

Add validation to ensure the cached parent directory inode matches the
directory info in MDS replies. This prevents client-side race conditions
where concurrent operations (e.g. rename) cause r_parent to become stale
between request initiation and reply processing, which could lead to
applying state changes to incorrect directory inodes.
---
 fs/ceph/mds_client.c | 67 +++++++++++++++++++++++++++++++-------------
 1 file changed, 47 insertions(+), 20 deletions(-)

diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index 8d9fc5e18b17..a164783fc1e1 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -2853,7 +2853,7 @@ char *ceph_mdsc_build_path(struct ceph_mds_client *mdsc, struct dentry *dentry,
 
 static int build_dentry_path(struct ceph_mds_client *mdsc, struct dentry *dentry,
 			     struct inode *dir, const char **ppath, int *ppathlen,
-			     u64 *pino, bool *pfreepath, bool parent_locked)
+			     struct ceph_vino *pvino, bool *pfreepath, bool parent_locked)
 {
 	char *path;
 
@@ -2862,23 +2862,29 @@ static int build_dentry_path(struct ceph_mds_client *mdsc, struct dentry *dentry
 		dir = d_inode_rcu(dentry->d_parent);
 	if (dir && parent_locked && ceph_snap(dir) == CEPH_NOSNAP &&
 	    !IS_ENCRYPTED(dir)) {
-		*pino = ceph_ino(dir);
+		pvino->ino = ceph_ino(dir);
+		pvino->snap = ceph_snap(dir);
 		rcu_read_unlock();
 		*ppath = dentry->d_name.name;
 		*ppathlen = dentry->d_name.len;
 		return 0;
 	}
 	rcu_read_unlock();
-	path = ceph_mdsc_build_path(mdsc, dentry, ppathlen, pino, 1);
+	path = ceph_mdsc_build_path(mdsc, dentry, ppathlen, &pvino->ino, 1);
 	if (IS_ERR(path))
 		return PTR_ERR(path);
 	*ppath = path;
 	*pfreepath = true;
+	/* For paths built by ceph_mdsc_build_path, we need to get snap from dentry */
+	if (dentry && d_inode(dentry))
+		pvino->snap = ceph_snap(d_inode(dentry));
+	else
+		pvino->snap = CEPH_NOSNAP;
 	return 0;
 }
 
 static int build_inode_path(struct inode *inode,
-			    const char **ppath, int *ppathlen, u64 *pino,
+			    const char **ppath, int *ppathlen, struct ceph_vino *pvino,
 			    bool *pfreepath)
 {
 	struct ceph_mds_client *mdsc = ceph_sb_to_mdsc(inode->i_sb);
@@ -2886,17 +2892,19 @@ static int build_inode_path(struct inode *inode,
 	char *path;
 
 	if (ceph_snap(inode) == CEPH_NOSNAP) {
-		*pino = ceph_ino(inode);
+		pvino->ino = ceph_ino(inode);
+		pvino->snap = ceph_snap(inode);
 		*ppathlen = 0;
 		return 0;
 	}
 	dentry = d_find_alias(inode);
-	path = ceph_mdsc_build_path(mdsc, dentry, ppathlen, pino, 1);
+	path = ceph_mdsc_build_path(mdsc, dentry, ppathlen, &pvino->ino, 1);
 	dput(dentry);
 	if (IS_ERR(path))
 		return PTR_ERR(path);
 	*ppath = path;
 	*pfreepath = true;
+	pvino->snap = ceph_snap(inode);
 	return 0;
 }
 
@@ -2907,7 +2915,7 @@ static int build_inode_path(struct inode *inode,
 static int set_request_path_attr(struct ceph_mds_client *mdsc, struct inode *rinode,
 				 struct dentry *rdentry, struct inode *rdiri,
 				 const char *rpath, u64 rino, const char **ppath,
-				 int *pathlen, u64 *ino, bool *freepath,
+				 int *pathlen, struct ceph_vino *p_vino, bool *freepath,
 				 bool parent_locked)
 {
 	struct ceph_client *cl = mdsc->fsc->client;
@@ -2915,16 +2923,17 @@ static int set_request_path_attr(struct ceph_mds_client *mdsc, struct inode *rin
 	int r = 0;
 
 	if (rinode) {
-		r = build_inode_path(rinode, ppath, pathlen, ino, freepath);
+		r = build_inode_path(rinode, ppath, pathlen, p_vino, freepath);
 		boutc(cl, " inode %p %llx.%llx\n", rinode, ceph_ino(rinode),
-		      ceph_snap(rinode));
+		ceph_snap(rinode));
 	} else if (rdentry) {
-		r = build_dentry_path(mdsc, rdentry, rdiri, ppath, pathlen, ino,
-					freepath, parent_locked);
+		r = build_dentry_path(mdsc, rdentry, rdiri, ppath, pathlen, p_vino,
+				       freepath, parent_locked);
 		CEPH_SAN_STRNCPY(result_str, sizeof(result_str), *ppath, *pathlen);
-		boutc(cl, " dentry %p %llx/%s\n", rdentry, *ino, result_str);
+		boutc(cl, " dentry %p %llx/%s\n", rdentry, p_vino->ino, result_str);
 	} else if (rpath || rino) {
-		*ino = rino;
+		p_vino->ino = rino;
+		p_vino->snap = CEPH_NOSNAP;
 		*ppath = rpath;
 		*pathlen = rpath ? strlen(rpath) : 0;
 		CEPH_SAN_STRNCPY(result_str, sizeof(result_str), rpath, *pathlen);
@@ -3007,7 +3016,7 @@ static struct ceph_msg *create_request_message(struct ceph_mds_session *session,
 	struct ceph_mds_request_head_legacy *lhead;
 	const char *path1 = NULL;
 	const char *path2 = NULL;
-	u64 ino1 = 0, ino2 = 0;
+	struct ceph_vino vino1 = {0}, vino2 = {0};
 	int pathlen1 = 0, pathlen2 = 0;
 	bool freepath1 = false, freepath2 = false;
 	struct dentry *old_dentry = NULL;
@@ -3019,17 +3028,35 @@ static struct ceph_msg *create_request_message(struct ceph_mds_session *session,
 	u16 request_head_version = mds_supported_head_version(session);
 	kuid_t caller_fsuid = req->r_cred->fsuid;
 	kgid_t caller_fsgid = req->r_cred->fsgid;
+	bool parent_locked = test_bit(CEPH_MDS_R_PARENT_LOCKED, &req->r_req_flags);
 
 	ret = set_request_path_attr(mdsc, req->r_inode, req->r_dentry,
 			      req->r_parent, req->r_path1, req->r_ino1.ino,
-			      &path1, &pathlen1, &ino1, &freepath1,
-			      test_bit(CEPH_MDS_R_PARENT_LOCKED,
-					&req->r_req_flags));
+			      &path1, &pathlen1, &vino1, &freepath1,
+			      parent_locked);
 	if (ret < 0) {
 		msg = ERR_PTR(ret);
 		goto out;
 	}
 
+	/*
+	 * When the parent directory's i_rwsem is *not* locked, req->r_parent may
+	 * have become stale (e.g. after a concurrent rename) between the time the
+	 * dentry was looked up and now.  If we detect that the stored r_parent
+	 * does not match the inode number we just encoded for the request, switch
+	 * to the correct inode so that the MDS receives a valid parent reference.
+	 */
+	if (!parent_locked &&
+	    req->r_parent && vino1.ino && ceph_ino(req->r_parent) != vino1.ino) {
+		struct inode *correct_dir = ceph_get_inode(mdsc->fsc->sb, vino1, NULL);
+		if (!IS_ERR(correct_dir)) {
+			WARN(1, "ceph: r_parent mismatch (had %llx wanted %llx) - updating\n",
+			     ceph_ino(req->r_parent), vino1.ino);
+			iput(req->r_parent);
+			req->r_parent = correct_dir;
+		}
+	}
+
 	/* If r_old_dentry is set, then assume that its parent is locked */
 	if (req->r_old_dentry &&
 	    !(req->r_old_dentry->d_flags & DCACHE_DISCONNECTED))
@@ -3037,7 +3064,7 @@ static struct ceph_msg *create_request_message(struct ceph_mds_session *session,
 	ret = set_request_path_attr(mdsc, NULL, old_dentry,
 			      req->r_old_dentry_dir,
 			      req->r_path2, req->r_ino2.ino,
-			      &path2, &pathlen2, &ino2, &freepath2, true);
+			      &path2, &pathlen2, &vino2, &freepath2, true);
 	if (ret < 0) {
 		msg = ERR_PTR(ret);
 		goto out_free1;
@@ -3191,8 +3218,8 @@ static struct ceph_msg *create_request_message(struct ceph_mds_session *session,
 	lhead->ino = cpu_to_le64(req->r_deleg_ino);
 	lhead->args = req->r_args;
 
-	ceph_encode_filepath(&p, end, ino1, path1);
-	ceph_encode_filepath(&p, end, ino2, path2);
+	ceph_encode_filepath(&p, end, vino1.ino, path1);
+	ceph_encode_filepath(&p, end, vino2.ino, path2);
 
 	/* make note of release offset, in case we need to replay */
 	req->r_request_release_offset = p - msg->front.iov_base;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH 2/2] ceph: fix client race condition where r_parent becomes stale before sending message
  2025-07-30 15:18 [PATCH 0/2] ceph: Fix r_parent staleness race and related deadlock Alex Markuze
  2025-07-30 15:18 ` [PATCH 1/2] ceph: fix client race condition validating r_parent before applying state Alex Markuze
@ 2025-07-30 15:19 ` Alex Markuze
  2025-07-30 21:38   ` Viacheslav Dubeyko
  1 sibling, 1 reply; 7+ messages in thread
From: Alex Markuze @ 2025-07-30 15:19 UTC (permalink / raw)
  To: ceph-devel; +Cc: Slava.Dubeyko, Alex Markuze

When the parent directory's i_rwsem is not locked, req->r_parent may become
stale due to concurrent operations (e.g. rename) between dentry lookup and
message creation. Validate that r_parent matches the encoded parent inode
and update to the correct inode if a mismatch is detected.
---
 fs/ceph/inode.c | 44 ++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 42 insertions(+), 2 deletions(-)

diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
index 814f9e9656a0..49fb1e3a02e8 100644
--- a/fs/ceph/inode.c
+++ b/fs/ceph/inode.c
@@ -56,6 +56,46 @@ static int ceph_set_ino_cb(struct inode *inode, void *data)
 	return 0;
 }
 
+/*
+ * Validate that the directory inode referenced by @req->r_parent matches the
+ * inode number and snapshot id contained in the reply's directory record.  If
+ * they do not match – which can theoretically happen if the parent dentry was
+ * moved between the time the request was issued and the reply arrived – fall
+ * back to looking up the correct inode in the inode cache.
+ *
+ * A reference is *always* returned.  Callers that receive a different inode
+ * than the original @parent are responsible for dropping the extra reference
+ * once the reply has been processed.
+ */
+static struct inode *ceph_get_reply_dir(struct super_block *sb,
+                                       struct inode *parent,
+                                       struct ceph_mds_reply_info_parsed *rinfo)
+{
+    struct ceph_vino vino;
+
+    if (unlikely(!rinfo->diri.in))
+        return parent; /* nothing to compare against */
+
+    /* If we didn't have a cached parent inode to begin with, just bail out. */
+    if (!parent)
+        return NULL;
+
+    vino.ino  = le64_to_cpu(rinfo->diri.in->ino);
+    vino.snap = le64_to_cpu(rinfo->diri.in->snapid);
+
+    if (likely(parent && ceph_ino(parent) == vino.ino && ceph_snap(parent) == vino.snap))
+        return parent; /* matches – use the original reference */
+
+    /* Mismatch – this should be rare.  Emit a WARN and obtain the correct inode. */
+    WARN(1, "ceph: reply dir mismatch (parent %s %llx.%llx reply %llx.%llx)\n",
+         parent ? "valid" : "NULL",
+         parent ? ceph_ino(parent) : 0ULL,
+         parent ? ceph_snap(parent) : 0ULL,
+         vino.ino, vino.snap);
+
+    return ceph_get_inode(sb, vino, NULL);
+}
+
 /**
  * ceph_new_inode - allocate a new inode in advance of an expected create
  * @dir: parent directory for new inode
@@ -1548,8 +1588,8 @@ int ceph_fill_trace(struct super_block *sb, struct ceph_mds_request *req)
 	}
 
 	if (rinfo->head->is_dentry) {
-		struct inode *dir = req->r_parent;
-
+		/* r_parent may be stale, in cases when R_PARENT_LOCKED is not set, so we need to get the correct inode */
+		struct inode *dir = ceph_get_reply_dir(sb, req->r_parent, rinfo);
 		if (dir) {
 			err = ceph_fill_inode(dir, NULL, &rinfo->diri,
 					      rinfo->dirfrag, session, -1,
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re:  [PATCH 2/2] ceph: fix client race condition where r_parent becomes stale before sending message
  2025-07-30 15:19 ` [PATCH 2/2] ceph: fix client race condition where r_parent becomes stale before sending message Alex Markuze
@ 2025-07-30 21:38   ` Viacheslav Dubeyko
  2025-07-31  7:24     ` Alex Markuze
  0 siblings, 1 reply; 7+ messages in thread
From: Viacheslav Dubeyko @ 2025-07-30 21:38 UTC (permalink / raw)
  To: Alex Markuze, ceph-devel@vger.kernel.org

On Wed, 2025-07-30 at 15:19 +0000, Alex Markuze wrote:
> When the parent directory's i_rwsem is not locked, req->r_parent may become
> stale due to concurrent operations (e.g. rename) between dentry lookup and
> message creation. Validate that r_parent matches the encoded parent inode
> and update to the correct inode if a mismatch is detected.

Could we share any description of crash or workload misbehavior that can
illustrate the symptoms of the issue?

> ---
>  fs/ceph/inode.c | 44 ++++++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 42 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
> index 814f9e9656a0..49fb1e3a02e8 100644
> --- a/fs/ceph/inode.c
> +++ b/fs/ceph/inode.c
> @@ -56,6 +56,46 @@ static int ceph_set_ino_cb(struct inode *inode, void *data)
>  	return 0;
>  }
>  
> +/*
> + * Validate that the directory inode referenced by @req->r_parent matches the
> + * inode number and snapshot id contained in the reply's directory record.  If
> + * they do not match – which can theoretically happen if the parent dentry was
> + * moved between the time the request was issued and the reply arrived – fall
> + * back to looking up the correct inode in the inode cache.
> + *
> + * A reference is *always* returned.  Callers that receive a different inode
> + * than the original @parent are responsible for dropping the extra reference
> + * once the reply has been processed.
> + */
> +static struct inode *ceph_get_reply_dir(struct super_block *sb,
> +                                       struct inode *parent,
> +                                       struct ceph_mds_reply_info_parsed *rinfo)
> +{
> +    struct ceph_vino vino;
> +
> +    if (unlikely(!rinfo->diri.in))
> +        return parent; /* nothing to compare against */

If we could receive parent == NULL, then is it possible to receive rinfo ==
NULL?

> +
> +    /* If we didn't have a cached parent inode to begin with, just bail out. */
> +    if (!parent)
> +        return NULL;

Is it normal workflow that parent == NULL? Should we complain about it here?

> +
> +    vino.ino  = le64_to_cpu(rinfo->diri.in->ino);
> +    vino.snap = le64_to_cpu(rinfo->diri.in->snapid);
> +
> +    if (likely(parent && ceph_ino(parent) == vino.ino && ceph_snap(parent) == vino.snap))

It looks like long line. Could we introduce a inline static function with good
name here?

We already checked that parent is not NULL above. Does it makes sense to have
this duplicated check here?

> +        return parent; /* matches – use the original reference */
> +
> +    /* Mismatch – this should be rare.  Emit a WARN and obtain the correct inode. */
> +    WARN(1, "ceph: reply dir mismatch (parent %s %llx.%llx reply %llx.%llx)\n",
> +         parent ? "valid" : "NULL",

How parent can be NULL here? We already checked this pointer. And this
construction looks pretty complicated.

> +         parent ? ceph_ino(parent) : 0ULL,
> +         parent ? ceph_snap(parent) : 0ULL,
> +         vino.ino, vino.snap);
> +
> +    return ceph_get_inode(sb, vino, NULL);
> +}
> +
>  /**
>   * ceph_new_inode - allocate a new inode in advance of an expected create
>   * @dir: parent directory for new inode
> @@ -1548,8 +1588,8 @@ int ceph_fill_trace(struct super_block *sb, struct ceph_mds_request *req)
>  	}
>  
>  	if (rinfo->head->is_dentry) {
> -		struct inode *dir = req->r_parent;
> -
> +		/* r_parent may be stale, in cases when R_PARENT_LOCKED is not set, so we need to get the correct inode */

Comment is too long for one line. We could have multi-line comment here. If
R_PARENT_LOCKED is not set, then, is it normal execution flow or not? Should we
fix the use-case of not setting R_PARENT_LOCKED?

Thanks,
Slava.

> +		struct inode *dir = ceph_get_reply_dir(sb, req->r_parent, rinfo);
>  		if (dir) {
>  			err = ceph_fill_inode(dir, NULL, &rinfo->diri,
>  					      rinfo->dirfrag, session, -1,

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re:  [PATCH 1/2] ceph: fix client race condition validating r_parent before applying state
  2025-07-30 15:18 ` [PATCH 1/2] ceph: fix client race condition validating r_parent before applying state Alex Markuze
@ 2025-07-30 22:09   ` Viacheslav Dubeyko
  2025-07-31  7:30     ` Alex Markuze
  0 siblings, 1 reply; 7+ messages in thread
From: Viacheslav Dubeyko @ 2025-07-30 22:09 UTC (permalink / raw)
  To: Alex Markuze, ceph-devel@vger.kernel.org

On Wed, 2025-07-30 at 15:18 +0000, Alex Markuze wrote:
> Add validation to ensure the cached parent directory inode matches the
> directory info in MDS replies. This prevents client-side race conditions
> where concurrent operations (e.g. rename) cause r_parent to become stale
> between request initiation and reply processing, which could lead to
> applying state changes to incorrect directory inodes.
> ---
>  fs/ceph/mds_client.c | 67 +++++++++++++++++++++++++++++++-------------
>  1 file changed, 47 insertions(+), 20 deletions(-)
> 
> diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
> index 8d9fc5e18b17..a164783fc1e1 100644
> --- a/fs/ceph/mds_client.c
> +++ b/fs/ceph/mds_client.c
> @@ -2853,7 +2853,7 @@ char *ceph_mdsc_build_path(struct ceph_mds_client *mdsc, struct dentry *dentry,
>  
>  static int build_dentry_path(struct ceph_mds_client *mdsc, struct dentry *dentry,
>  			     struct inode *dir, const char **ppath, int *ppathlen,
> -			     u64 *pino, bool *pfreepath, bool parent_locked)
> +			     struct ceph_vino *pvino, bool *pfreepath, bool parent_locked)

We have too much arguments here. It looks like we need to introduce some
structure.

>  {
>  	char *path;
>  
> @@ -2862,23 +2862,29 @@ static int build_dentry_path(struct ceph_mds_client *mdsc, struct dentry *dentry
>  		dir = d_inode_rcu(dentry->d_parent);
>  	if (dir && parent_locked && ceph_snap(dir) == CEPH_NOSNAP &&
>  	    !IS_ENCRYPTED(dir)) {
> -		*pino = ceph_ino(dir);
> +		pvino->ino = ceph_ino(dir);
> +		pvino->snap = ceph_snap(dir);
>  		rcu_read_unlock();
>  		*ppath = dentry->d_name.name;
>  		*ppathlen = dentry->d_name.len;
>  		return 0;
>  	}
>  	rcu_read_unlock();
> -	path = ceph_mdsc_build_path(mdsc, dentry, ppathlen, pino, 1);
> +	path = ceph_mdsc_build_path(mdsc, dentry, ppathlen, &pvino->ino, 1);
>  	if (IS_ERR(path))
>  		return PTR_ERR(path);
>  	*ppath = path;
>  	*pfreepath = true;
> +	/* For paths built by ceph_mdsc_build_path, we need to get snap from dentry */

I believe the multi-line comment could be good here.

> +	if (dentry && d_inode(dentry))

Could dentry be really NULL here?

> +		pvino->snap = ceph_snap(d_inode(dentry));
> +	else
> +		pvino->snap = CEPH_NOSNAP;
>  	return 0;
>  }
>  
>  static int build_inode_path(struct inode *inode,
> -			    const char **ppath, int *ppathlen, u64 *pino,
> +			    const char **ppath, int *ppathlen, struct ceph_vino *pvino,
>  			    bool *pfreepath)

We have too much arguments here too.

>  {
>  	struct ceph_mds_client *mdsc = ceph_sb_to_mdsc(inode->i_sb);
> @@ -2886,17 +2892,19 @@ static int build_inode_path(struct inode *inode,
>  	char *path;
>  
>  	if (ceph_snap(inode) == CEPH_NOSNAP) {
> -		*pino = ceph_ino(inode);
> +		pvino->ino = ceph_ino(inode);
> +		pvino->snap = ceph_snap(inode);
>  		*ppathlen = 0;
>  		return 0;
>  	}
>  	dentry = d_find_alias(inode);
> -	path = ceph_mdsc_build_path(mdsc, dentry, ppathlen, pino, 1);
> +	path = ceph_mdsc_build_path(mdsc, dentry, ppathlen, &pvino->ino, 1);
>  	dput(dentry);
>  	if (IS_ERR(path))
>  		return PTR_ERR(path);
>  	*ppath = path;
>  	*pfreepath = true;
> +	pvino->snap = ceph_snap(inode);
>  	return 0;
>  }
>  
> @@ -2907,7 +2915,7 @@ static int build_inode_path(struct inode *inode,
>  static int set_request_path_attr(struct ceph_mds_client *mdsc, struct inode *rinode,
>  				 struct dentry *rdentry, struct inode *rdiri,
>  				 const char *rpath, u64 rino, const char **ppath,
> -				 int *pathlen, u64 *ino, bool *freepath,
> +				 int *pathlen, struct ceph_vino *p_vino, bool *freepath,
>  				 bool parent_locked)

Ditto. :)

Thanks,
Slava.

>  {
>  	struct ceph_client *cl = mdsc->fsc->client;
> @@ -2915,16 +2923,17 @@ static int set_request_path_attr(struct ceph_mds_client *mdsc, struct inode *rin
>  	int r = 0;
>  
>  	if (rinode) {
> -		r = build_inode_path(rinode, ppath, pathlen, ino, freepath);
> +		r = build_inode_path(rinode, ppath, pathlen, p_vino, freepath);
>  		boutc(cl, " inode %p %llx.%llx\n", rinode, ceph_ino(rinode),
> -		      ceph_snap(rinode));
> +		ceph_snap(rinode));
>  	} else if (rdentry) {
> -		r = build_dentry_path(mdsc, rdentry, rdiri, ppath, pathlen, ino,
> -					freepath, parent_locked);
> +		r = build_dentry_path(mdsc, rdentry, rdiri, ppath, pathlen, p_vino,
> +				       freepath, parent_locked);
>  		CEPH_SAN_STRNCPY(result_str, sizeof(result_str), *ppath, *pathlen);
> -		boutc(cl, " dentry %p %llx/%s\n", rdentry, *ino, result_str);
> +		boutc(cl, " dentry %p %llx/%s\n", rdentry, p_vino->ino, result_str);
>  	} else if (rpath || rino) {
> -		*ino = rino;
> +		p_vino->ino = rino;
> +		p_vino->snap = CEPH_NOSNAP;
>  		*ppath = rpath;
>  		*pathlen = rpath ? strlen(rpath) : 0;
>  		CEPH_SAN_STRNCPY(result_str, sizeof(result_str), rpath, *pathlen);
> @@ -3007,7 +3016,7 @@ static struct ceph_msg *create_request_message(struct ceph_mds_session *session,
>  	struct ceph_mds_request_head_legacy *lhead;
>  	const char *path1 = NULL;
>  	const char *path2 = NULL;
> -	u64 ino1 = 0, ino2 = 0;
> +	struct ceph_vino vino1 = {0}, vino2 = {0};
>  	int pathlen1 = 0, pathlen2 = 0;
>  	bool freepath1 = false, freepath2 = false;
>  	struct dentry *old_dentry = NULL;
> @@ -3019,17 +3028,35 @@ static struct ceph_msg *create_request_message(struct ceph_mds_session *session,
>  	u16 request_head_version = mds_supported_head_version(session);
>  	kuid_t caller_fsuid = req->r_cred->fsuid;
>  	kgid_t caller_fsgid = req->r_cred->fsgid;
> +	bool parent_locked = test_bit(CEPH_MDS_R_PARENT_LOCKED, &req->r_req_flags);
>  
>  	ret = set_request_path_attr(mdsc, req->r_inode, req->r_dentry,
>  			      req->r_parent, req->r_path1, req->r_ino1.ino,
> -			      &path1, &pathlen1, &ino1, &freepath1,
> -			      test_bit(CEPH_MDS_R_PARENT_LOCKED,
> -					&req->r_req_flags));
> +			      &path1, &pathlen1, &vino1, &freepath1,
> +			      parent_locked);
>  	if (ret < 0) {
>  		msg = ERR_PTR(ret);
>  		goto out;
>  	}
>  
> +	/*
> +	 * When the parent directory's i_rwsem is *not* locked, req->r_parent may
> +	 * have become stale (e.g. after a concurrent rename) between the time the
> +	 * dentry was looked up and now.  If we detect that the stored r_parent
> +	 * does not match the inode number we just encoded for the request, switch
> +	 * to the correct inode so that the MDS receives a valid parent reference.
> +	 */
> +	if (!parent_locked &&
> +	    req->r_parent && vino1.ino && ceph_ino(req->r_parent) != vino1.ino) {
> +		struct inode *correct_dir = ceph_get_inode(mdsc->fsc->sb, vino1, NULL);
> +		if (!IS_ERR(correct_dir)) {
> +			WARN(1, "ceph: r_parent mismatch (had %llx wanted %llx) - updating\n",
> +			     ceph_ino(req->r_parent), vino1.ino);
> +			iput(req->r_parent);
> +			req->r_parent = correct_dir;
> +		}
> +	}
> +
>  	/* If r_old_dentry is set, then assume that its parent is locked */
>  	if (req->r_old_dentry &&
>  	    !(req->r_old_dentry->d_flags & DCACHE_DISCONNECTED))
> @@ -3037,7 +3064,7 @@ static struct ceph_msg *create_request_message(struct ceph_mds_session *session,
>  	ret = set_request_path_attr(mdsc, NULL, old_dentry,
>  			      req->r_old_dentry_dir,
>  			      req->r_path2, req->r_ino2.ino,
> -			      &path2, &pathlen2, &ino2, &freepath2, true);
> +			      &path2, &pathlen2, &vino2, &freepath2, true);
>  	if (ret < 0) {
>  		msg = ERR_PTR(ret);
>  		goto out_free1;
> @@ -3191,8 +3218,8 @@ static struct ceph_msg *create_request_message(struct ceph_mds_session *session,
>  	lhead->ino = cpu_to_le64(req->r_deleg_ino);
>  	lhead->args = req->r_args;
>  
> -	ceph_encode_filepath(&p, end, ino1, path1);
> -	ceph_encode_filepath(&p, end, ino2, path2);
> +	ceph_encode_filepath(&p, end, vino1.ino, path1);
> +	ceph_encode_filepath(&p, end, vino2.ino, path2);
>  
>  	/* make note of release offset, in case we need to replay */
>  	req->r_request_release_offset = p - msg->front.iov_base;

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 2/2] ceph: fix client race condition where r_parent becomes stale before sending message
  2025-07-30 21:38   ` Viacheslav Dubeyko
@ 2025-07-31  7:24     ` Alex Markuze
  0 siblings, 0 replies; 7+ messages in thread
From: Alex Markuze @ 2025-07-31  7:24 UTC (permalink / raw)
  To: Viacheslav Dubeyko; +Cc: ceph-devel@vger.kernel.org

This fix comes after an analysis of trace from a client bug:
RCA:
Usually two operations cant interleave due to extensive VFS locking.
But CEPH maintains leases for cached dentry structures, when these
leases expire the client may create a lookup request without taking
VFS locks, this is done by choice. But it does leave a window for this
type of a race that we have witnessed. The reply handler in one
specific place doesn't make sure that the appropriate locks are held
but still trusts the r_parent pointer that now holds an incorrect
value due to the rename operation. I prepared a fix that now compares
the cached r_parent and the ino from the reply. I'll ask @André to
prepare a new image; this image should also resolve the boot issue for
the previous debug version.

It's the message handling code, so the request is always valid but not
every request will have a parent. R_PARENT_LOCKED may not be set by
design, it's a valid state.

On Thu, Jul 31, 2025 at 1:39 AM Viacheslav Dubeyko
<Slava.Dubeyko@ibm.com> wrote:
>
> On Wed, 2025-07-30 at 15:19 +0000, Alex Markuze wrote:
> > When the parent directory's i_rwsem is not locked, req->r_parent may become
> > stale due to concurrent operations (e.g. rename) between dentry lookup and
> > message creation. Validate that r_parent matches the encoded parent inode
> > and update to the correct inode if a mismatch is detected.
>
> Could we share any description of crash or workload misbehavior that can
> illustrate the symptoms of the issue?
>
> > ---
> >  fs/ceph/inode.c | 44 ++++++++++++++++++++++++++++++++++++++++++--
> >  1 file changed, 42 insertions(+), 2 deletions(-)
> >
> > diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
> > index 814f9e9656a0..49fb1e3a02e8 100644
> > --- a/fs/ceph/inode.c
> > +++ b/fs/ceph/inode.c
> > @@ -56,6 +56,46 @@ static int ceph_set_ino_cb(struct inode *inode, void *data)
> >       return 0;
> >  }
> >
> > +/*
> > + * Validate that the directory inode referenced by @req->r_parent matches the
> > + * inode number and snapshot id contained in the reply's directory record.  If
> > + * they do not match – which can theoretically happen if the parent dentry was
> > + * moved between the time the request was issued and the reply arrived – fall
> > + * back to looking up the correct inode in the inode cache.
> > + *
> > + * A reference is *always* returned.  Callers that receive a different inode
> > + * than the original @parent are responsible for dropping the extra reference
> > + * once the reply has been processed.
> > + */
> > +static struct inode *ceph_get_reply_dir(struct super_block *sb,
> > +                                       struct inode *parent,
> > +                                       struct ceph_mds_reply_info_parsed *rinfo)
> > +{
> > +    struct ceph_vino vino;
> > +
> > +    if (unlikely(!rinfo->diri.in))
> > +        return parent; /* nothing to compare against */
>
> If we could receive parent == NULL, then is it possible to receive rinfo ==
> NULL?
>
> > +
> > +    /* If we didn't have a cached parent inode to begin with, just bail out. */
> > +    if (!parent)
> > +        return NULL;
>
> Is it normal workflow that parent == NULL? Should we complain about it here?
>
> > +
> > +    vino.ino  = le64_to_cpu(rinfo->diri.in->ino);
> > +    vino.snap = le64_to_cpu(rinfo->diri.in->snapid);
> > +
> > +    if (likely(parent && ceph_ino(parent) == vino.ino && ceph_snap(parent) == vino.snap))
>
> It looks like long line. Could we introduce a inline static function with good
> name here?
>
> We already checked that parent is not NULL above. Does it makes sense to have
> this duplicated check here?
>
> > +        return parent; /* matches – use the original reference */
> > +
> > +    /* Mismatch – this should be rare.  Emit a WARN and obtain the correct inode. */
> > +    WARN(1, "ceph: reply dir mismatch (parent %s %llx.%llx reply %llx.%llx)\n",
> > +         parent ? "valid" : "NULL",
>
> How parent can be NULL here? We already checked this pointer. And this
> construction looks pretty complicated.
>
> > +         parent ? ceph_ino(parent) : 0ULL,
> > +         parent ? ceph_snap(parent) : 0ULL,
> > +         vino.ino, vino.snap);
> > +
> > +    return ceph_get_inode(sb, vino, NULL);
> > +}
> > +
> >  /**
> >   * ceph_new_inode - allocate a new inode in advance of an expected create
> >   * @dir: parent directory for new inode
> > @@ -1548,8 +1588,8 @@ int ceph_fill_trace(struct super_block *sb, struct ceph_mds_request *req)
> >       }
> >
> >       if (rinfo->head->is_dentry) {
> > -             struct inode *dir = req->r_parent;
> > -
> > +             /* r_parent may be stale, in cases when R_PARENT_LOCKED is not set, so we need to get the correct inode */
>
> Comment is too long for one line. We could have multi-line comment here. If
> R_PARENT_LOCKED is not set, then, is it normal execution flow or not? Should we
> fix the use-case of not setting R_PARENT_LOCKED?
>
> Thanks,
> Slava.
>
> > +             struct inode *dir = ceph_get_reply_dir(sb, req->r_parent, rinfo);
> >               if (dir) {
> >                       err = ceph_fill_inode(dir, NULL, &rinfo->diri,
> >                                             rinfo->dirfrag, session, -1,


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 1/2] ceph: fix client race condition validating r_parent before applying state
  2025-07-30 22:09   ` Viacheslav Dubeyko
@ 2025-07-31  7:30     ` Alex Markuze
  0 siblings, 0 replies; 7+ messages in thread
From: Alex Markuze @ 2025-07-31  7:30 UTC (permalink / raw)
  To: Viacheslav Dubeyko; +Cc: ceph-devel@vger.kernel.org

No new parameters were added, a u64 ino value was changed to struct ceph_vino.
I prefer not to mix functional changes with aesthetic ones, I can add
a separate patch, but it is unrelated to the bug fix here.

On Thu, Jul 31, 2025 at 2:10 AM Viacheslav Dubeyko
<Slava.Dubeyko@ibm.com> wrote:
>
> On Wed, 2025-07-30 at 15:18 +0000, Alex Markuze wrote:
> > Add validation to ensure the cached parent directory inode matches the
> > directory info in MDS replies. This prevents client-side race conditions
> > where concurrent operations (e.g. rename) cause r_parent to become stale
> > between request initiation and reply processing, which could lead to
> > applying state changes to incorrect directory inodes.
> > ---
> >  fs/ceph/mds_client.c | 67 +++++++++++++++++++++++++++++++-------------
> >  1 file changed, 47 insertions(+), 20 deletions(-)
> >
> > diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
> > index 8d9fc5e18b17..a164783fc1e1 100644
> > --- a/fs/ceph/mds_client.c
> > +++ b/fs/ceph/mds_client.c
> > @@ -2853,7 +2853,7 @@ char *ceph_mdsc_build_path(struct ceph_mds_client *mdsc, struct dentry *dentry,
> >
> >  static int build_dentry_path(struct ceph_mds_client *mdsc, struct dentry *dentry,
> >                            struct inode *dir, const char **ppath, int *ppathlen,
> > -                          u64 *pino, bool *pfreepath, bool parent_locked)
> > +                          struct ceph_vino *pvino, bool *pfreepath, bool parent_locked)
>
> We have too much arguments here. It looks like we need to introduce some
> structure.
>
> >  {
> >       char *path;
> >
> > @@ -2862,23 +2862,29 @@ static int build_dentry_path(struct ceph_mds_client *mdsc, struct dentry *dentry
> >               dir = d_inode_rcu(dentry->d_parent);
> >       if (dir && parent_locked && ceph_snap(dir) == CEPH_NOSNAP &&
> >           !IS_ENCRYPTED(dir)) {
> > -             *pino = ceph_ino(dir);
> > +             pvino->ino = ceph_ino(dir);
> > +             pvino->snap = ceph_snap(dir);
> >               rcu_read_unlock();
> >               *ppath = dentry->d_name.name;
> >               *ppathlen = dentry->d_name.len;
> >               return 0;
> >       }
> >       rcu_read_unlock();
> > -     path = ceph_mdsc_build_path(mdsc, dentry, ppathlen, pino, 1);
> > +     path = ceph_mdsc_build_path(mdsc, dentry, ppathlen, &pvino->ino, 1);
> >       if (IS_ERR(path))
> >               return PTR_ERR(path);
> >       *ppath = path;
> >       *pfreepath = true;
> > +     /* For paths built by ceph_mdsc_build_path, we need to get snap from dentry */
>
> I believe the multi-line comment could be good here.
>
> > +     if (dentry && d_inode(dentry))
>
> Could dentry be really NULL here?
>
> > +             pvino->snap = ceph_snap(d_inode(dentry));
> > +     else
> > +             pvino->snap = CEPH_NOSNAP;
> >       return 0;
> >  }
> >
> >  static int build_inode_path(struct inode *inode,
> > -                         const char **ppath, int *ppathlen, u64 *pino,
> > +                         const char **ppath, int *ppathlen, struct ceph_vino *pvino,
> >                           bool *pfreepath)
>
> We have too much arguments here too.
>
> >  {
> >       struct ceph_mds_client *mdsc = ceph_sb_to_mdsc(inode->i_sb);
> > @@ -2886,17 +2892,19 @@ static int build_inode_path(struct inode *inode,
> >       char *path;
> >
> >       if (ceph_snap(inode) == CEPH_NOSNAP) {
> > -             *pino = ceph_ino(inode);
> > +             pvino->ino = ceph_ino(inode);
> > +             pvino->snap = ceph_snap(inode);
> >               *ppathlen = 0;
> >               return 0;
> >       }
> >       dentry = d_find_alias(inode);
> > -     path = ceph_mdsc_build_path(mdsc, dentry, ppathlen, pino, 1);
> > +     path = ceph_mdsc_build_path(mdsc, dentry, ppathlen, &pvino->ino, 1);
> >       dput(dentry);
> >       if (IS_ERR(path))
> >               return PTR_ERR(path);
> >       *ppath = path;
> >       *pfreepath = true;
> > +     pvino->snap = ceph_snap(inode);
> >       return 0;
> >  }
> >
> > @@ -2907,7 +2915,7 @@ static int build_inode_path(struct inode *inode,
> >  static int set_request_path_attr(struct ceph_mds_client *mdsc, struct inode *rinode,
> >                                struct dentry *rdentry, struct inode *rdiri,
> >                                const char *rpath, u64 rino, const char **ppath,
> > -                              int *pathlen, u64 *ino, bool *freepath,
> > +                              int *pathlen, struct ceph_vino *p_vino, bool *freepath,
> >                                bool parent_locked)
>
> Ditto. :)
>
> Thanks,
> Slava.
>
> >  {
> >       struct ceph_client *cl = mdsc->fsc->client;
> > @@ -2915,16 +2923,17 @@ static int set_request_path_attr(struct ceph_mds_client *mdsc, struct inode *rin
> >       int r = 0;
> >
> >       if (rinode) {
> > -             r = build_inode_path(rinode, ppath, pathlen, ino, freepath);
> > +             r = build_inode_path(rinode, ppath, pathlen, p_vino, freepath);
> >               boutc(cl, " inode %p %llx.%llx\n", rinode, ceph_ino(rinode),
> > -                   ceph_snap(rinode));
> > +             ceph_snap(rinode));
> >       } else if (rdentry) {
> > -             r = build_dentry_path(mdsc, rdentry, rdiri, ppath, pathlen, ino,
> > -                                     freepath, parent_locked);
> > +             r = build_dentry_path(mdsc, rdentry, rdiri, ppath, pathlen, p_vino,
> > +                                    freepath, parent_locked);
> >               CEPH_SAN_STRNCPY(result_str, sizeof(result_str), *ppath, *pathlen);
> > -             boutc(cl, " dentry %p %llx/%s\n", rdentry, *ino, result_str);
> > +             boutc(cl, " dentry %p %llx/%s\n", rdentry, p_vino->ino, result_str);
> >       } else if (rpath || rino) {
> > -             *ino = rino;
> > +             p_vino->ino = rino;
> > +             p_vino->snap = CEPH_NOSNAP;
> >               *ppath = rpath;
> >               *pathlen = rpath ? strlen(rpath) : 0;
> >               CEPH_SAN_STRNCPY(result_str, sizeof(result_str), rpath, *pathlen);
> > @@ -3007,7 +3016,7 @@ static struct ceph_msg *create_request_message(struct ceph_mds_session *session,
> >       struct ceph_mds_request_head_legacy *lhead;
> >       const char *path1 = NULL;
> >       const char *path2 = NULL;
> > -     u64 ino1 = 0, ino2 = 0;
> > +     struct ceph_vino vino1 = {0}, vino2 = {0};
> >       int pathlen1 = 0, pathlen2 = 0;
> >       bool freepath1 = false, freepath2 = false;
> >       struct dentry *old_dentry = NULL;
> > @@ -3019,17 +3028,35 @@ static struct ceph_msg *create_request_message(struct ceph_mds_session *session,
> >       u16 request_head_version = mds_supported_head_version(session);
> >       kuid_t caller_fsuid = req->r_cred->fsuid;
> >       kgid_t caller_fsgid = req->r_cred->fsgid;
> > +     bool parent_locked = test_bit(CEPH_MDS_R_PARENT_LOCKED, &req->r_req_flags);
> >
> >       ret = set_request_path_attr(mdsc, req->r_inode, req->r_dentry,
> >                             req->r_parent, req->r_path1, req->r_ino1.ino,
> > -                           &path1, &pathlen1, &ino1, &freepath1,
> > -                           test_bit(CEPH_MDS_R_PARENT_LOCKED,
> > -                                     &req->r_req_flags));
> > +                           &path1, &pathlen1, &vino1, &freepath1,
> > +                           parent_locked);
> >       if (ret < 0) {
> >               msg = ERR_PTR(ret);
> >               goto out;
> >       }
> >
> > +     /*
> > +      * When the parent directory's i_rwsem is *not* locked, req->r_parent may
> > +      * have become stale (e.g. after a concurrent rename) between the time the
> > +      * dentry was looked up and now.  If we detect that the stored r_parent
> > +      * does not match the inode number we just encoded for the request, switch
> > +      * to the correct inode so that the MDS receives a valid parent reference.
> > +      */
> > +     if (!parent_locked &&
> > +         req->r_parent && vino1.ino && ceph_ino(req->r_parent) != vino1.ino) {
> > +             struct inode *correct_dir = ceph_get_inode(mdsc->fsc->sb, vino1, NULL);
> > +             if (!IS_ERR(correct_dir)) {
> > +                     WARN(1, "ceph: r_parent mismatch (had %llx wanted %llx) - updating\n",
> > +                          ceph_ino(req->r_parent), vino1.ino);
> > +                     iput(req->r_parent);
> > +                     req->r_parent = correct_dir;
> > +             }
> > +     }
> > +
> >       /* If r_old_dentry is set, then assume that its parent is locked */
> >       if (req->r_old_dentry &&
> >           !(req->r_old_dentry->d_flags & DCACHE_DISCONNECTED))
> > @@ -3037,7 +3064,7 @@ static struct ceph_msg *create_request_message(struct ceph_mds_session *session,
> >       ret = set_request_path_attr(mdsc, NULL, old_dentry,
> >                             req->r_old_dentry_dir,
> >                             req->r_path2, req->r_ino2.ino,
> > -                           &path2, &pathlen2, &ino2, &freepath2, true);
> > +                           &path2, &pathlen2, &vino2, &freepath2, true);
> >       if (ret < 0) {
> >               msg = ERR_PTR(ret);
> >               goto out_free1;
> > @@ -3191,8 +3218,8 @@ static struct ceph_msg *create_request_message(struct ceph_mds_session *session,
> >       lhead->ino = cpu_to_le64(req->r_deleg_ino);
> >       lhead->args = req->r_args;
> >
> > -     ceph_encode_filepath(&p, end, ino1, path1);
> > -     ceph_encode_filepath(&p, end, ino2, path2);
> > +     ceph_encode_filepath(&p, end, vino1.ino, path1);
> > +     ceph_encode_filepath(&p, end, vino2.ino, path2);
> >
> >       /* make note of release offset, in case we need to replay */
> >       req->r_request_release_offset = p - msg->front.iov_base;


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2025-07-31  7:31 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-30 15:18 [PATCH 0/2] ceph: Fix r_parent staleness race and related deadlock Alex Markuze
2025-07-30 15:18 ` [PATCH 1/2] ceph: fix client race condition validating r_parent before applying state Alex Markuze
2025-07-30 22:09   ` Viacheslav Dubeyko
2025-07-31  7:30     ` Alex Markuze
2025-07-30 15:19 ` [PATCH 2/2] ceph: fix client race condition where r_parent becomes stale before sending message Alex Markuze
2025-07-30 21:38   ` Viacheslav Dubeyko
2025-07-31  7:24     ` Alex Markuze

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).