All of lore.kernel.org
 help / color / mirror / Atom feed
From: Luis Henriques <lhenriques@suse.de>
To: Jeff Layton <jlayton@kernel.org>
Cc: ceph-devel@vger.kernel.org, gengjichao@jd.com
Subject: Re: [PATCH] ceph: when filling trace, do ceph_get_inode outside of mutexes
Date: Fri, 13 Nov 2020 10:40:01 +0000	[thread overview]
Message-ID: <87361d8rr2.fsf@suse.de> (raw)
In-Reply-To: <720728ce40c340d5db2a4c003f89483a68876fe5.camel@kernel.org> (Jeff Layton's message of "Thu, 12 Nov 2020 14:24:10 -0500")

Jeff Layton <jlayton@kernel.org> writes:

> On Thu, 2020-11-12 at 13:15 -0500, Jeff Layton wrote:
>> Geng Jichao reported a rather complex deadlock involving several
>> moving parts:
>> 
>> 1) readahead is issued against an inode and some of its pages are locked
>>    while the read is in flight
>> 
>> 2) the same inode is evicted from the cache, and this task gets stuck
>>    waiting for the page lock because of the above readahead
>> 
>> 3) another task is processing a reply trace, and looks up the inode
>>    being evicted while holding the s_mutex. That ends up waiting for the
>>    eviction to complete
>> 
>> 4) a write reply for an unrelated inode is then processed in the
>>    ceph_con_workfn job. It calls ceph_check_caps after putting wrbuffer
>>    caps, and that gets stuck waiting on the s_mutex held by 3.
>> 
>> The reply to "1" is stuck behind the write reply in "4", so we deadlock
>> at that point.
>> 
>> This patch changes the trace processing to call ceph_get_inode outside
>> of the s_mutex and snap_rwsem, which should break the cycle above.
>> 
>> URL: https://tracker.ceph.com/issues/47998
>> Reported-by: Geng Jichao <gengjichao@jd.com>
>> Signed-off-by: Jeff Layton <jlayton@kernel.org>
>> ---
>>  fs/ceph/inode.c      | 13 ++++---------
>>  fs/ceph/mds_client.c | 15 +++++++++++++++
>>  2 files changed, 19 insertions(+), 9 deletions(-)
>> 
>
> My belief is that this patch probably won't hurt anything and might fix
> this specific deadlock. I do however wonder if we might have a whole
> class of deadlocks of this nature. I can't quite prove it by inspection
> though, so testing with this would be a good datapoint.

Yeah, I keep staring at this and can't decide either.  But I agree with
you: this may or may not fix the issue, but shouldn't hurt.  Just a minor
comment bellow:

>> diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
>> index 88bbeb05bd27..9b85d86d8efb 100644
>> --- a/fs/ceph/inode.c
>> +++ b/fs/ceph/inode.c
>> @@ -1315,15 +1315,10 @@ int ceph_fill_trace(struct super_block *sb, struct ceph_mds_request *req)
>>  	}
>>  
>> 
>> 
>> 
>>  	if (rinfo->head->is_target) {
>> -		tvino.ino = le64_to_cpu(rinfo->targeti.in->ino);
>> -		tvino.snap = le64_to_cpu(rinfo->targeti.in->snapid);
>> -
>> -		in = ceph_get_inode(sb, tvino);
>> -		if (IS_ERR(in)) {
>> -			err = PTR_ERR(in);
>> -			goto done;
>> -		}
>> +		/* Should be filled in by handle_reply */
>> +		BUG_ON(!req->r_target_inode);
>>  
>> 
>> 
>> 
>> +		in = req->r_target_inode;
>>  		err = ceph_fill_inode(in, req->r_locked_page, &rinfo->targeti,
>>  				NULL, session,
>>  				(!test_bit(CEPH_MDS_R_ABORTED, &req->r_req_flags) &&
>> @@ -1333,13 +1328,13 @@ int ceph_fill_trace(struct super_block *sb, struct ceph_mds_request *req)
>>  		if (err < 0) {
>>  			pr_err("ceph_fill_inode badness %p %llx.%llx\n",
>>  				in, ceph_vinop(in));
>> +			req->r_target_inode = NULL;
>>  			if (in->i_state & I_NEW)
>>  				discard_new_inode(in);
>>  			else
>>  				iput(in);
>>  			goto done;
>>  		}
>> -		req->r_target_inode = in;
>>  		if (in->i_state & I_NEW)
>>  			unlock_new_inode(in);
>>  	}
>> diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
>> index b3c941503f8c..48db1f593d81 100644
>> --- a/fs/ceph/mds_client.c
>> +++ b/fs/ceph/mds_client.c
>> @@ -3179,6 +3179,21 @@ static void handle_reply(struct ceph_mds_session *session, struct ceph_msg *msg)
>>  		err = parse_reply_info(session, msg, rinfo, session->s_con.peer_features);
>>  	mutex_unlock(&mdsc->mutex);
>>  
>> 
>> 
>> 
>> +	/* Must find target inode outside of mutexes to avoid deadlocks */
>> +	if (rinfo->head->is_target) {

This condition should probably be

	if ((err >= 0) && rinfo->head->is_target) {

This would allow a faster exit path and eventually avoid masking this
'err' value with another one from ceph_get_inode().

Cheers,
-- 
Luis

>> +		struct inode *in;
>> +		struct ceph_vino tvino = { .ino  = le64_to_cpu(rinfo->targeti.in->ino),
>> +					   .snap = le64_to_cpu(rinfo->targeti.in->snapid) };
>> +
>> +		in = ceph_get_inode(mdsc->fsc->sb, tvino);
>> +		if (IS_ERR(in)) {
>> +			err = PTR_ERR(in);
>> +			mutex_lock(&session->s_mutex);
>> +			goto out_err;
>> +		}
>> +		req->r_target_inode = in;
>> +	}
>> +
>>  	mutex_lock(&session->s_mutex);
>>  	if (err < 0) {
>>  		pr_err("mdsc_handle_reply got corrupt reply mds%d(tid:%lld)\n", mds, tid);
>
> -- 
> Jeff Layton <jlayton@kernel.org>
>

  reply	other threads:[~2020-11-13 10:39 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-11-12 18:15 [PATCH] ceph: when filling trace, do ceph_get_inode outside of mutexes Jeff Layton
2020-11-12 19:24 ` Jeff Layton
2020-11-13 10:40   ` Luis Henriques [this message]
2020-11-13 12:13     ` Jeff Layton

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87361d8rr2.fsf@suse.de \
    --to=lhenriques@suse.de \
    --cc=ceph-devel@vger.kernel.org \
    --cc=gengjichao@jd.com \
    --cc=jlayton@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.