public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
To: "lilei24@kuaishou.com" <lilei24@kuaishou.com>,
	"idryomov@gmail.com" <idryomov@gmail.com>,
	"slava@dubeyko.com" <slava@dubeyko.com>,
	Alex Markuze <amarkuze@redhat.com>
Cc: "ceph-devel@vger.kernel.org" <ceph-devel@vger.kernel.org>,
	"noctis.akm@gmail.com" <noctis.akm@gmail.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"sunzhao03@kuaishou.com" <sunzhao03@kuaishou.com>
Subject: Re:  [PATCH] Revert "ceph: when filling trace, call ceph_get_inode outside of mutexes"
Date: Thu, 23 Apr 2026 19:50:18 +0000	[thread overview]
Message-ID: <8a17ed01f4d98c93ef53521e8ab3fc313678af70.camel@ibm.com> (raw)
In-Reply-To: <20260418041241.17892-1-lilei24@kuaishou.com>

On Sat, 2026-04-18 at 12:12 +0800, Li Lei wrote:
> This reverts commit bca9fc14c70fcbbebc84954cc39994e463fb9468.
> 
> Deadlock detected between mdsc->snap_rwsem and the I_NEW bit in
> handle_reply().
> 
> - kworker/u113:1 (stat inode)
>   1) Hold a inode with I_NEW set
>   2) Request for mdsc->snap_rwsem
> - kworker/u113:2 (readdir)
>   1) Hold mdsc->snap_rwsem
>   2) Wait for inode I_NEW flag to be cleared
> 
> task:kworker/u113:1  state:D stack:    0 pid:34454 ppid:     2
> flags:0x00004000
> Workqueue: ceph-msgr ceph_con_workfn [libceph]
> Call Trace:
> __schedule+0x3a9/0x8d0
> schedule+0x49/0xb0
> rwsem_down_write_slowpath+0x30a/0x5e0
> handle_reply+0x4d7/0x7f0 [ceph]
> ? ceph_tcp_recvmsg+0x6f/0xa0 [libceph]
> mds_dispatch+0x10a/0x690 [ceph]
> ? calc_signature+0xdf/0x110 [libceph]
> ? ceph_x_check_message_signature+0x58/0xc0 [libceph]
> ceph_con_process_message+0x73/0x140 [libceph]
> ceph_con_v1_try_read+0x2f2/0x860 [libceph]
> ceph_con_workfn+0x31e/0x660 [libceph]
> process_one_work+0x1cb/0x370
> worker_thread+0x30/0x390
> ? process_one_work+0x370/0x370
> kthread+0x13e/0x160
> ? set_kthread_struct+0x50/0x50
> ret_from_fork+0x1f/0x30
> 
> task:kworker/u113:2  state:D stack:    0 pid:54267 ppid:     2
> flags:0x00004000
> Workqueue: ceph-msgr ceph_con_workfn [libceph]
> Call Trace:
> __schedule+0x3a9/0x8d0
> ? bit_wait_io+0x60/0x60
> ? bit_wait_io+0x60/0x60
> schedule+0x49/0xb0
> bit_wait+0xd/0x60
> __wait_on_bit+0x2a/0x90
> ? ceph_force_reconnect+0x90/0x90 [ceph]
> out_of_line_wait_on_bit+0x91/0xb0
> ? bitmap_empty+0x20/0x20
> ilookup5.part.29+0x69/0x90
> ? ceph_force_reconnect+0x90/0x90 [ceph]
> ? ceph_ino_compare+0x30/0x30 [ceph]
> iget5_locked+0x26/0x90
> ceph_get_inode+0x45/0x130 [ceph]
> ceph_readdir_prepopulate+0x59f/0xca0 [ceph]
> handle_reply+0x78d/0x7f0 [ceph]
> ? ceph_tcp_recvmsg+0x6f/0xa0 [libceph]
> mds_dispatch+0x10a/0x690 [ceph]
> ? calc_signature+0xdf/0x110 [libceph]
> ? ceph_x_check_message_signature+0x58/0xc0 [libceph]
> ceph_con_process_message+0x73/0x140 [libceph]
> ceph_con_v1_try_read+0x2f2/0x860 [libceph]
> ceph_con_workfn+0x31e/0x660 [libceph]
> process_one_work+0x1cb/0x370
> worker_thread+0x30/0x390
> ? process_one_work+0x370/0x370
> kthread+0x13e/0x160
> ? set_kthread_struct+0x50/0x50
> ret_from_fork+0x1f/0x30
> 
> It's rather rear to be caught, but here's Fast Reproduce Steps
> (multiple mds is needed):
> 1. Try to find 2 different directories (DIR_a DIR_b) in a cephfs cluster
>     and make sure they have different auth mds nodes. In this way, a client
>     may have chances to run handle_reply on different CPU for our test
>     (see step 5 and step 6).
> 2. In DIR_b, create a hard link of DIR_a/FILE_a, namely FILE_b.
>     DIR_a/FILE_a and DIR_b/FILE_b have the same ino (123456 e.g)
> 3. Save ino in code below, make it sleep for stat command.
>     ```
>     static void handle_reply(struct ceph_mds_session
>     *session, struct ceph_msg *msg)
> 			   goto out_err;
> 		   }
> 		   req->r_target_inode = in;
>     +               if (in->i_ino == 123456) {
>     +                       pr_err("inode %lu found, ready to wait 10 seconds.\n",
>     +				in->i_ino);
>     +                       msleep(10000);
>     +               }
>     ```
> 4. Execute echo 3 > /proc/sys/vm/drop_caches
> 5. In a shell, do `cd DIR_a;stat DIR_a/FILE_a`, we suppose to be stuck on this shell
>     because of msleep() in handle_reply().
> 6. In the other shell, do `cd DIR_b;ls DIR_b/` to trigger ceph_readdir_prepopulate()
> 
> Repeat step 4-6, less than 10 times is enough to see the problem.
> 
> It turns out that commit bca9fc14c70f ("ceph: when filling trace, call ceph_get_inode outside of mutexes")
> moved ceph_inode_get outside snap_rmsem and made a chance for the deadlock of ceph_inode_get()
> and snap_rwsem.
> 
> After the following commit, original commit(bca9fc14c70f) can be reverted safely.
> commit 6a92b08fdad2 ("ceph: don't take s_mutex or snap_rwsem in ceph_check_caps")
> 
> Signed-off-by: Zhao Sun <sunzhao03@kuaishou.com>
> Signed-off-by: Li Lei <lilei24@kuaishou.com>
> ---
>  fs/ceph/inode.c      | 26 ++++++++++++++++++++++----
>  fs/ceph/mds_client.c | 29 -----------------------------
>  2 files changed, 22 insertions(+), 33 deletions(-)
> 
> diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
> index d99e12d..0c241a4 100644
> --- a/fs/ceph/inode.c
> +++ b/fs/ceph/inode.c
> @@ -1667,10 +1667,28 @@ int ceph_fill_trace(struct super_block *sb, struct ceph_mds_request *req)
>  	}
>  
>  	if (rinfo->head->is_target) {
> -		/* Should be filled in by handle_reply */
> -		BUG_ON(!req->r_target_inode);
> +		in = xchg(&req->r_new_inode, NULL);
> +		tvino.ino  = le64_to_cpu(rinfo->targeti.in->ino);
> +		tvino.snap = le64_to_cpu(rinfo->targeti.in->snapid);
> +
> +		/*
> +		 * If we ended up opening an existing inode, discard
> +		 * r_new_inode
> +		 */
> +		if (req->r_op == CEPH_MDS_OP_CREATE &&
> +			!req->r_reply_info.has_create_ino) {
> +			/* This should never happen on an async create */
> +			WARN_ON_ONCE(req->r_deleg_ino);
> +			iput(in);
> +			in = NULL;
> +		}
> +
> +		in = ceph_get_inode(fsc->sb, tvino, in);
> +		if (IS_ERR(in)) {
> +			err = PTR_ERR(in);
> +			goto done;
> +		}
>  
> -		in = req->r_target_inode;
>  		err = ceph_fill_inode(in, req->r_locked_page, &rinfo->targeti,
>  				NULL, session,
>  				(!test_bit(CEPH_MDS_R_ABORTED, &req->r_req_flags) &&
> @@ -1680,13 +1698,13 @@ int ceph_fill_trace(struct super_block *sb, struct ceph_mds_request *req)
>  		if (err < 0) {
>  			pr_err_client(cl, "badness %p %llx.%llx\n", in,
>  				      ceph_vinop(in));
> -			req->r_target_inode = NULL;
>  			if (inode_state_read_once(in) & I_NEW)
>  				discard_new_inode(in);
>  			else
>  				iput(in);
>  			goto done;
>  		}
> +		req->r_target_inode = in;
>  		if (inode_state_read_once(in) & I_NEW)
>  			unlock_new_inode(in);
>  	}
> diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
> index b174627..8a27775 100644
> --- a/fs/ceph/mds_client.c
> +++ b/fs/ceph/mds_client.c
> @@ -3941,36 +3941,7 @@ static void handle_reply(struct ceph_mds_session *session, struct ceph_msg *msg)
>  				       session->s_con.peer_features);
>  	mutex_unlock(&mdsc->mutex);
>  
> -	/* Must find target inode outside of mutexes to avoid deadlocks */
>  	rinfo = &req->r_reply_info;
> -	if ((err >= 0) && rinfo->head->is_target) {
> -		struct inode *in = xchg(&req->r_new_inode, NULL);
> -		struct ceph_vino tvino = {
> -			.ino  = le64_to_cpu(rinfo->targeti.in->ino),
> -			.snap = le64_to_cpu(rinfo->targeti.in->snapid)
> -		};
> -
> -		/*
> -		 * If we ended up opening an existing inode, discard
> -		 * r_new_inode
> -		 */
> -		if (req->r_op == CEPH_MDS_OP_CREATE &&
> -		    !req->r_reply_info.has_create_ino) {
> -			/* This should never happen on an async create */
> -			WARN_ON_ONCE(req->r_deleg_ino);
> -			iput(in);
> -			in = NULL;
> -		}
> -
> -		in = ceph_get_inode(mdsc->fsc->sb, tvino, in);
> -		if (IS_ERR(in)) {
> -			err = PTR_ERR(in);
> -			mutex_lock(&session->s_mutex);
> -			goto out_err;
> -		}
> -		req->r_target_inode = in;
> -	}
> -
>  	mutex_lock(&session->s_mutex);
>  	if (err < 0) {
>  		pr_err_client(cl, "got corrupt reply mds%d(tid:%lld)\n",

Let me run xfstests for the patch to double check that nothing is broken.

Thanks,
Slava.

  reply	other threads:[~2026-04-23 19:50 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-18  4:12 [PATCH] Revert "ceph: when filling trace, call ceph_get_inode outside of mutexes" Li Lei
2026-04-23 19:50 ` Viacheslav Dubeyko [this message]
2026-04-24 18:11   ` Viacheslav Dubeyko
2026-04-24 18:54     ` 李磊
2026-04-27 19:55       ` Viacheslav Dubeyko
2026-04-28  5:31         ` 李磊
2026-04-28 17:04           ` Viacheslav Dubeyko
2026-04-29 17:39             ` Viacheslav Dubeyko
2026-04-27 19:57 ` Viacheslav Dubeyko

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=8a17ed01f4d98c93ef53521e8ab3fc313678af70.camel@ibm.com \
    --to=slava.dubeyko@ibm.com \
    --cc=amarkuze@redhat.com \
    --cc=ceph-devel@vger.kernel.org \
    --cc=idryomov@gmail.com \
    --cc=lilei24@kuaishou.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=noctis.akm@gmail.com \
    --cc=slava@dubeyko.com \
    --cc=sunzhao03@kuaishou.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox