CEPH filesystem development
 help / color / mirror / Atom feed
* [PATCH v2] ceph: when filling trace, call ceph_get_inode outside of mutexes
@ 2020-11-13 12:21 Jeff Layton
  2020-11-13 13:46 ` Luis Henriques
  0 siblings, 1 reply; 2+ messages in thread
From: Jeff Layton @ 2020-11-13 12:21 UTC (permalink / raw)
  To: ceph-devel; +Cc: gengjichao, lhenriques

Geng Jichao reported a rather complex deadlock involving several
moving parts:

1) readahead is issued against an inode and some of its pages are locked
   while the read is in flight.

2) the same inode is evicted from the cache, and this task gets stuck
   waiting for the page lock because of the above readahead.

3) another task is processing a reply trace, and looks up the inode
   being evicted while holding the s_mutex. That ends up waiting for the
   eviction to complete.

4) a write reply for an unrelated inode is then processed in the
   ceph_con_workfn job. It calls ceph_check_caps after putting wrbuffer
   caps, and that gets stuck waiting on the s_mutex held by 3.

The reply to "1" is stuck behind the write reply in "4", so it deadlocks
at that point.

This patch changes the trace processing to call ceph_get_inode outside
of the s_mutex and snap_rwsem, which should break the cycle above.

URL: https://tracker.ceph.com/issues/47998
Reported-by: Geng Jichao <gengjichao@jd.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/ceph/inode.c      | 13 ++++---------
 fs/ceph/mds_client.c | 15 +++++++++++++++
 2 files changed, 19 insertions(+), 9 deletions(-)

v2: don't call ceph_get_inode if err < 0

diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
index 88bbeb05bd27..9b85d86d8efb 100644
--- a/fs/ceph/inode.c
+++ b/fs/ceph/inode.c
@@ -1315,15 +1315,10 @@ int ceph_fill_trace(struct super_block *sb, struct ceph_mds_request *req)
 	}
 
 	if (rinfo->head->is_target) {
-		tvino.ino = le64_to_cpu(rinfo->targeti.in->ino);
-		tvino.snap = le64_to_cpu(rinfo->targeti.in->snapid);
-
-		in = ceph_get_inode(sb, tvino);
-		if (IS_ERR(in)) {
-			err = PTR_ERR(in);
-			goto done;
-		}
+		/* Should be filled in by handle_reply */
+		BUG_ON(!req->r_target_inode);
 
+		in = req->r_target_inode;
 		err = ceph_fill_inode(in, req->r_locked_page, &rinfo->targeti,
 				NULL, session,
 				(!test_bit(CEPH_MDS_R_ABORTED, &req->r_req_flags) &&
@@ -1333,13 +1328,13 @@ int ceph_fill_trace(struct super_block *sb, struct ceph_mds_request *req)
 		if (err < 0) {
 			pr_err("ceph_fill_inode badness %p %llx.%llx\n",
 				in, ceph_vinop(in));
+			req->r_target_inode = NULL;
 			if (in->i_state & I_NEW)
 				discard_new_inode(in);
 			else
 				iput(in);
 			goto done;
 		}
-		req->r_target_inode = in;
 		if (in->i_state & I_NEW)
 			unlock_new_inode(in);
 	}
diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index b3c941503f8c..4fab37d858ce 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -3179,6 +3179,21 @@ static void handle_reply(struct ceph_mds_session *session, struct ceph_msg *msg)
 		err = parse_reply_info(session, msg, rinfo, session->s_con.peer_features);
 	mutex_unlock(&mdsc->mutex);
 
+	/* Must find target inode outside of mutexes to avoid deadlocks */
+	if ((err >= 0) && rinfo->head->is_target) {
+		struct inode *in;
+		struct ceph_vino tvino = { .ino  = le64_to_cpu(rinfo->targeti.in->ino),
+					   .snap = le64_to_cpu(rinfo->targeti.in->snapid) };
+
+		in = ceph_get_inode(mdsc->fsc->sb, tvino);
+		if (IS_ERR(in)) {
+			err = PTR_ERR(in);
+			mutex_lock(&session->s_mutex);
+			goto out_err;
+		}
+		req->r_target_inode = in;
+	}
+
 	mutex_lock(&session->s_mutex);
 	if (err < 0) {
 		pr_err("mdsc_handle_reply got corrupt reply mds%d(tid:%lld)\n", mds, tid);
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 2+ messages in thread

* Re: [PATCH v2] ceph: when filling trace, call ceph_get_inode outside of mutexes
  2020-11-13 12:21 [PATCH v2] ceph: when filling trace, call ceph_get_inode outside of mutexes Jeff Layton
@ 2020-11-13 13:46 ` Luis Henriques
  0 siblings, 0 replies; 2+ messages in thread
From: Luis Henriques @ 2020-11-13 13:46 UTC (permalink / raw)
  To: Jeff Layton; +Cc: ceph-devel, gengjichao

Jeff Layton <jlayton@kernel.org> writes:

> Geng Jichao reported a rather complex deadlock involving several
> moving parts:
>
> 1) readahead is issued against an inode and some of its pages are locked
>    while the read is in flight.
>
> 2) the same inode is evicted from the cache, and this task gets stuck
>    waiting for the page lock because of the above readahead.
>
> 3) another task is processing a reply trace, and looks up the inode
>    being evicted while holding the s_mutex. That ends up waiting for the
>    eviction to complete.
>
> 4) a write reply for an unrelated inode is then processed in the
>    ceph_con_workfn job. It calls ceph_check_caps after putting wrbuffer
>    caps, and that gets stuck waiting on the s_mutex held by 3.
>
> The reply to "1" is stuck behind the write reply in "4", so it deadlocks
> at that point.
>
> This patch changes the trace processing to call ceph_get_inode outside
> of the s_mutex and snap_rwsem, which should break the cycle above.
>
> URL: https://tracker.ceph.com/issues/47998
> Reported-by: Geng Jichao <gengjichao@jd.com>
> Signed-off-by: Jeff Layton <jlayton@kernel.org>
> ---
>  fs/ceph/inode.c      | 13 ++++---------
>  fs/ceph/mds_client.c | 15 +++++++++++++++
>  2 files changed, 19 insertions(+), 9 deletions(-)
>
> v2: don't call ceph_get_inode if err < 0

Thanks, looks good (AFAICT!).

Reviewed-by: Luis Henriques <lhenriques@suse.de>

Cheers,
-- 
Luis

>
> diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
> index 88bbeb05bd27..9b85d86d8efb 100644
> --- a/fs/ceph/inode.c
> +++ b/fs/ceph/inode.c
> @@ -1315,15 +1315,10 @@ int ceph_fill_trace(struct super_block *sb, struct ceph_mds_request *req)
>  	}
>  
>  	if (rinfo->head->is_target) {
> -		tvino.ino = le64_to_cpu(rinfo->targeti.in->ino);
> -		tvino.snap = le64_to_cpu(rinfo->targeti.in->snapid);
> -
> -		in = ceph_get_inode(sb, tvino);
> -		if (IS_ERR(in)) {
> -			err = PTR_ERR(in);
> -			goto done;
> -		}
> +		/* Should be filled in by handle_reply */
> +		BUG_ON(!req->r_target_inode);
>  
> +		in = req->r_target_inode;
>  		err = ceph_fill_inode(in, req->r_locked_page, &rinfo->targeti,
>  				NULL, session,
>  				(!test_bit(CEPH_MDS_R_ABORTED, &req->r_req_flags) &&
> @@ -1333,13 +1328,13 @@ int ceph_fill_trace(struct super_block *sb, struct ceph_mds_request *req)
>  		if (err < 0) {
>  			pr_err("ceph_fill_inode badness %p %llx.%llx\n",
>  				in, ceph_vinop(in));
> +			req->r_target_inode = NULL;
>  			if (in->i_state & I_NEW)
>  				discard_new_inode(in);
>  			else
>  				iput(in);
>  			goto done;
>  		}
> -		req->r_target_inode = in;
>  		if (in->i_state & I_NEW)
>  			unlock_new_inode(in);
>  	}
> diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
> index b3c941503f8c..4fab37d858ce 100644
> --- a/fs/ceph/mds_client.c
> +++ b/fs/ceph/mds_client.c
> @@ -3179,6 +3179,21 @@ static void handle_reply(struct ceph_mds_session *session, struct ceph_msg *msg)
>  		err = parse_reply_info(session, msg, rinfo, session->s_con.peer_features);
>  	mutex_unlock(&mdsc->mutex);
>  
> +	/* Must find target inode outside of mutexes to avoid deadlocks */
> +	if ((err >= 0) && rinfo->head->is_target) {
> +		struct inode *in;
> +		struct ceph_vino tvino = { .ino  = le64_to_cpu(rinfo->targeti.in->ino),
> +					   .snap = le64_to_cpu(rinfo->targeti.in->snapid) };
> +
> +		in = ceph_get_inode(mdsc->fsc->sb, tvino);
> +		if (IS_ERR(in)) {
> +			err = PTR_ERR(in);
> +			mutex_lock(&session->s_mutex);
> +			goto out_err;
> +		}
> +		req->r_target_inode = in;
> +	}
> +
>  	mutex_lock(&session->s_mutex);
>  	if (err < 0) {
>  		pr_err("mdsc_handle_reply got corrupt reply mds%d(tid:%lld)\n", mds, tid);
> -- 
>
> 2.28.0
>

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2020-11-13 13:46 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2020-11-13 12:21 [PATCH v2] ceph: when filling trace, call ceph_get_inode outside of mutexes Jeff Layton
2020-11-13 13:46 ` Luis Henriques

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox