Re: [EXTERNAL] [PATCH v3 03/11] ceph: harden send_mds_reconnect and handle active-MDS peer reset

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Viacheslav Dubeyko <vdubeyko@redhat.com>
To: Alex Markuze <amarkuze@redhat.com>, ceph-devel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org, idryomov@gmail.com
Subject: Re: [EXTERNAL] [PATCH v3 03/11] ceph: harden send_mds_reconnect and handle active-MDS peer reset
Date: Wed, 29 Apr 2026 14:22:56 -0700	[thread overview]
Message-ID: <f3ff03de8cadd533fbe71f271f61890e856f3b06.camel@redhat.com> (raw)
In-Reply-To: <20260429125206.1512203-4-amarkuze@redhat.com>

On Wed, 2026-04-29 at 12:51 +0000, Alex Markuze wrote:
> Change send_mds_reconnect() to return an error code so callers can detect
> and report reconnect failures instead of silently ignoring them. Add early
> bailout checks for sessions that are already closed, rejected, or
> unregistered, which avoids sending reconnect messages for sessions that
> can no longer be recovered.
> 
> The early -ESTALE and -ENOENT bailouts use a separate fail_return label
> that skips the pr_err_client diagnostic, since these codes indicate
> expected concurrent-teardown races rather than genuine reconnect build
> failures.
> 
> Move the "reconnect start" log after the early-bailout checks so it
> only appears for sessions that actually proceed with reconnect.
> 
> Save the prior session state before transitioning to RECONNECTING,
> and restore it in the failure path.  Without this, a transient
> build or encoding failure (-ENOMEM, -ENOSPC) strands the session
> in RECONNECTING indefinitely because check_new_map() only retries
> sessions in RESTARTING state.
> 
> Rewrite mds_peer_reset() to handle the case where the MDS is past its
> RECONNECT phase (i.e. active). An active MDS rejects CLIENT_RECONNECT
> messages because it only accepts them during its own RECONNECT window
> after restart. Previously, the client would send a doomed reconnect
> that the MDS would reject or ignore. Now, the client tears the session
> down locally and lets new requests re-open a fresh session, which is
> the correct recovery for this scenario. The RECONNECTING state is
> handled on the same teardown path, since the MDS will reject reconnect
> attempts from an active client regardless of the session's local state.
> 
> Add explicit cases for CLOSED and REJECTED session states in
> mds_peer_reset() since these are terminal states where a connection
> drop is expected behavior.
> 
> The session teardown path in mds_peer_reset() follows the established
> drop-and-reacquire locking pattern from check_new_map(): take
> mdsc->mutex for session unregistration, release it, then take s->s_mutex
> separately for cleanup. This avoids introducing a new simultaneous lock
> nesting pattern.
> 
> Log reconnect failures from check_new_map() and mds_peer_reset() at
> pr_warn level rather than pr_err, since return codes like -ESTALE
> (closed/rejected session) and -ENOENT (unregistered session) are
> expected during concurrent teardown. Log dropped messages for
> unregistered sessions via doutc() (dynamic debug) rather than
> pr_info, as post-reset message arrival is routine and does not
> warrant unconditional logging.
> 
> Signed-off-by: Alex Markuze <amarkuze@redhat.com>
> ---
>  fs/ceph/mds_client.c | 169 +++++++++++++++++++++++++++++++++++++++----
>  1 file changed, 155 insertions(+), 14 deletions(-)
> 
> diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
> index 871f0eef468d..b62abae72c4c 100644
> --- a/fs/ceph/mds_client.c
> +++ b/fs/ceph/mds_client.c
> @@ -4416,9 +4416,14 @@ static void handle_session(struct ceph_mds_session *session,
>  		break;
>  
>  	case CEPH_SESSION_REJECT:
> -		WARN_ON(session->s_state != CEPH_MDS_SESSION_OPENING);
> -		pr_info_client(cl, "mds%d rejected session\n",
> -			       session->s_mds);
> +		WARN_ON(session->s_state != CEPH_MDS_SESSION_OPENING &&
> +			session->s_state != CEPH_MDS_SESSION_RECONNECTING);
> +		if (session->s_state == CEPH_MDS_SESSION_RECONNECTING)
> +			pr_info_client(cl, "mds%d reconnect rejected\n",
> +				       session->s_mds);
> +		else
> +			pr_info_client(cl, "mds%d rejected session\n",
> +				       session->s_mds);
>  		session->s_state = CEPH_MDS_SESSION_REJECTED;
>  		cleanup_session_requests(mdsc, session);
>  		remove_session_caps(session);
> @@ -4678,6 +4683,14 @@ static int reconnect_caps_cb(struct inode *inode, int mds, void *arg)
>  	cap->mseq = 0;       /* and migrate_seq */
>  	cap->cap_gen = atomic_read(&cap->session->s_cap_gen);
>  
> +	/*
> +	 * Note: CEPH_I_ERROR_FILELOCK is not set during reconnect.
> +	 * Instead, locks are submitted for best-effort MDS reclaim
> +	 * via the flock_len field below.  If reclaim fails (e.g.,
> +	 * another client grabbed a conflicting lock), future lock
> +	 * operations will fail and set the error flag at that point.
> +	 */
> +
>  	/* These are lost when the session goes away */
>  	if (S_ISDIR(inode->i_mode)) {
>  		if (cap->issued & CEPH_CAP_DIR_CREATE) {
> @@ -4892,20 +4905,19 @@ static int encode_snap_realms(struct ceph_mds_client *mdsc,
>   *
>   * This is a relatively heavyweight operation, but it's rare.
>   */
> -static void send_mds_reconnect(struct ceph_mds_client *mdsc,
> -			       struct ceph_mds_session *session)
> +static int send_mds_reconnect(struct ceph_mds_client *mdsc,
> +			      struct ceph_mds_session *session)
>  {
>  	struct ceph_client *cl = mdsc->fsc->client;
>  	struct ceph_msg *reply;
>  	int mds = session->s_mds;
>  	int err = -ENOMEM;
> +	int old_state;
>  	struct ceph_reconnect_state recon_state = {
>  		.session = session,
>  	};
>  	LIST_HEAD(dispose);
>  
> -	pr_info_client(cl, "mds%d reconnect start\n", mds);
> -
>  	recon_state.pagelist = ceph_pagelist_alloc(GFP_NOFS);
>  	if (!recon_state.pagelist)
>  		goto fail_nopagelist;
> @@ -4917,6 +4929,32 @@ static void send_mds_reconnect(struct ceph_mds_client *mdsc,
>  	xa_destroy(&session->s_delegated_inos);

Is xa_destroy(&session->s_delegated_inos) thread safe now? If an in-flight async
create is mid-way through ceph_get_deleg_ino (which does xa_for_each + xa_erase)
while xa_destroy runs on the same XArray from a different thread, that is
unsafe. What do you think?

>  
>  	mutex_lock(&session->s_mutex);
> +	if (session->s_state == CEPH_MDS_SESSION_CLOSED ||
> +	    session->s_state == CEPH_MDS_SESSION_REJECTED) {
> +		pr_info_client(cl, "mds%d skipping reconnect, session %s\n",
> +			       mds,
> +			       ceph_session_state_name(session->s_state));
> +		mutex_unlock(&session->s_mutex);
> +		ceph_msg_put(reply);
> +		err = -ESTALE;
> +		goto fail_return;
> +	}
> +
> +	mutex_lock(&mdsc->mutex);

I started to guess is it safe to call mutex_lock(&mdsc->mutex) under session-
>s_mutex lock? Could we introduce potential deadlock here?

> +	if (mds >= mdsc->max_sessions || mdsc->sessions[mds] != session) {
> +		mutex_unlock(&mdsc->mutex);
> +		pr_info_client(cl,
> +			       "mds%d skipping reconnect, session unregistered\n",
> +			       mds);
> +		mutex_unlock(&session->s_mutex);
> +		ceph_msg_put(reply);
> +		err = -ENOENT;
> +		goto fail_return;
> +	}
> +	mutex_unlock(&mdsc->mutex);
> +
> +	pr_info_client(cl, "mds%d reconnect start\n", mds);
> +	old_state = session->s_state;
>  	session->s_state = CEPH_MDS_SESSION_RECONNECTING;
>  	session->s_seq = 0;
>  
> @@ -5046,18 +5084,34 @@ static void send_mds_reconnect(struct ceph_mds_client *mdsc,
>  
>  	up_read(&mdsc->snap_rwsem);
>  	ceph_pagelist_release(recon_state.pagelist);
> -	return;
> +	return 0;
>  
>  fail:
>  	ceph_msg_put(reply);
>  	up_read(&mdsc->snap_rwsem);
> +	/*
> +	 * Restore prior session state so map-driven reconnect logic
> +	 * (check_new_map) can retry.  Without this, a transient build
> +	 * failure strands the session in RECONNECTING indefinitely.
> +	 */
> +	session->s_state = old_state;
>  	mutex_unlock(&session->s_mutex);
>  fail_nomsg:
>  	ceph_pagelist_release(recon_state.pagelist);
>  fail_nopagelist:
>  	pr_err_client(cl, "error %d preparing reconnect for mds%d\n",
>  		      err, mds);
> -	return;
> +	return err;
> +
> +fail_return:
> +	/*
> +	 * Early-exit path for expected concurrent-teardown races
> +	 * (-ESTALE for closed/rejected sessions, -ENOENT for
> +	 * unregistered sessions).  Skip the pr_err_client diagnostic
> +	 * since these are not genuine reconnect build failures.
> +	 */
> +	ceph_pagelist_release(recon_state.pagelist);
> +	return err;
>  }
>  
>  
> @@ -5138,9 +5192,15 @@ static void check_new_map(struct ceph_mds_client *mdsc,
>  		 */
>  		if (s->s_state == CEPH_MDS_SESSION_RESTARTING &&
>  		    newstate >= CEPH_MDS_STATE_RECONNECT) {
> +			int rc;
> +
>  			mutex_unlock(&mdsc->mutex);
>  			clear_bit(i, targets);
> -			send_mds_reconnect(mdsc, s);
> +			rc = send_mds_reconnect(mdsc, s);
> +			if (rc)
> +				pr_warn_client(cl,
> +					       "mds%d reconnect failed: %d\n",
> +					       i, rc);
>  			mutex_lock(&mdsc->mutex);
>  		}
>  
> @@ -5204,7 +5264,11 @@ static void check_new_map(struct ceph_mds_client *mdsc,
>  		}
>  		doutc(cl, "send reconnect to export target mds.%d\n", i);
>  		mutex_unlock(&mdsc->mutex);
> -		send_mds_reconnect(mdsc, s);
> +		err = send_mds_reconnect(mdsc, s);
> +		if (err)
> +			pr_warn_client(cl,
> +				       "mds%d export target reconnect failed: %d\n",
> +				       i, err);
>  		ceph_put_mds_session(s);
>  		mutex_lock(&mdsc->mutex);
>  	}
> @@ -6284,12 +6348,87 @@ static void mds_peer_reset(struct ceph_connection *con)
>  {
>  	struct ceph_mds_session *s = con->private;
>  	struct ceph_mds_client *mdsc = s->s_mdsc;
> +	int session_state;
>  
>  	pr_warn_client(mdsc->fsc->client, "mds%d closed our session\n",
>  		       s->s_mds);
> -	if (READ_ONCE(mdsc->fsc->mount_state) != CEPH_MOUNT_FENCE_IO &&
> -	    ceph_mdsmap_get_state(mdsc->mdsmap, s->s_mds) >= CEPH_MDS_STATE_RECONNECT)
> -		send_mds_reconnect(mdsc, s);
> +
> +	if (READ_ONCE(mdsc->fsc->mount_state) == CEPH_MOUNT_FENCE_IO ||
> +	    ceph_mdsmap_get_state(mdsc->mdsmap, s->s_mds) < CEPH_MDS_STATE_RECONNECT)
> +		return;
> +
> +	if (ceph_mdsmap_get_state(mdsc->mdsmap, s->s_mds) == CEPH_MDS_STATE_RECONNECT) {

We had >= CEPH_MDS_STATE_RECONNECT before. When an MDS restarts, it moves
through these states in order:

  RECONNECT    (10)  -> waits for clients to reconnect
  REJOIN       (11)  -> rejoins the distributed MDS cache
  CLIENTREPLAY (12)  -> replays in-flight client operations
  ACTIVE       (13)  -> fully operational

If the connection dropped and the MDS was in any of those four states, the
client sent a CLIENT_RECONNECT message. REJOIN (11) and CLIENTREPLAY (12) now
fall through to the teardown path, not the reconnect path.

Thanks,
Slava.

> +		int rc = send_mds_reconnect(mdsc, s);
> +
> +		if (rc)
> +			pr_warn_client(mdsc->fsc->client,
> +				       "mds%d reconnect failed: %d\n",
> +				       s->s_mds, rc);
> +		return;
> +	}
> +
> +	/*
> +	 * MDS is active (past RECONNECT).  It will not accept a
> +	 * CLIENT_RECONNECT from us, so tear the session down locally
> +	 * and let new requests re-open a fresh session.
> +	 *
> +	 * Snapshot session state with READ_ONCE, then revalidate under
> +	 * mdsc->mutex before acting.  The subsequent mdsc->mutex
> +	 * section rechecks s_state to catch concurrent transitions, so
> +	 * the lockless snapshot here is safe.  s->s_mutex is taken
> +	 * separately for cleanup after unregistration, which avoids
> +	 * introducing a new s->s_mutex + mdsc->mutex nesting.
> +	 */
> +	session_state = READ_ONCE(s->s_state);
> +
> +	switch (session_state) {
> +	case CEPH_MDS_SESSION_RESTARTING:
> +	case CEPH_MDS_SESSION_RECONNECTING:
> +	case CEPH_MDS_SESSION_CLOSING:
> +	case CEPH_MDS_SESSION_OPEN:
> +	case CEPH_MDS_SESSION_HUNG:
> +	case CEPH_MDS_SESSION_OPENING:
> +		mutex_lock(&mdsc->mutex);
> +		if (s->s_mds >= mdsc->max_sessions ||
> +		    mdsc->sessions[s->s_mds] != s ||
> +		    s->s_state != session_state) {
> +			pr_info_client(mdsc->fsc->client,
> +				       "mds%d state changed to %s during peer reset\n",
> +				       s->s_mds,
> +				       ceph_session_state_name(s->s_state));
> +			mutex_unlock(&mdsc->mutex);
> +			return;
> +		}
> +
> +		ceph_get_mds_session(s);
> +		s->s_state = CEPH_MDS_SESSION_CLOSED;
> +		__unregister_session(mdsc, s);
> +		__wake_requests(mdsc, &s->s_waiting);
> +		mutex_unlock(&mdsc->mutex);
> +
> +		mutex_lock(&s->s_mutex);
> +		cleanup_session_requests(mdsc, s);
> +		remove_session_caps(s);
> +		mutex_unlock(&s->s_mutex);
> +
> +		wake_up_all(&mdsc->session_close_wq);
> +
> +		mutex_lock(&mdsc->mutex);
> +		kick_requests(mdsc, s->s_mds);
> +		mutex_unlock(&mdsc->mutex);
> +
> +		ceph_put_mds_session(s);
> +		break;
> +	case CEPH_MDS_SESSION_CLOSED:
> +	case CEPH_MDS_SESSION_REJECTED:
> +		break;
> +	default:
> +		pr_warn_client(mdsc->fsc->client,
> +			       "mds%d peer reset in unexpected state %s\n",
> +			       s->s_mds,
> +			       ceph_session_state_name(session_state));
> +		break;
> +	}
>  }
>  
>  static void mds_dispatch(struct ceph_connection *con, struct ceph_msg *msg)
> @@ -6301,6 +6440,8 @@ static void mds_dispatch(struct ceph_connection *con, struct ceph_msg *msg)
>  
>  	mutex_lock(&mdsc->mutex);
>  	if (__verify_registered_session(mdsc, s) < 0) {
> +		doutc(cl, "dropping tid %llu from unregistered session %d\n",
> +		      le64_to_cpu(msg->hdr.tid), s->s_mds);
>  		mutex_unlock(&mdsc->mutex);
>  		goto out;
>  	}

next prev parent reply	other threads:[~2026-04-29 21:23 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-29 12:51 [PATCH v3 00/11] ceph: manual client session reset Alex Markuze
2026-04-29 12:51 ` [PATCH v3 01/11] ceph: convert inode flags to named bit positions and atomic bitops Alex Markuze
2026-04-29 19:31   ` [EXTERNAL] " Viacheslav Dubeyko
2026-04-29 12:51 ` [PATCH v3 02/11] ceph: use proper endian conversion for flock_len in reconnect Alex Markuze
2026-04-29 12:51 ` [PATCH v3 03/11] ceph: harden send_mds_reconnect and handle active-MDS peer reset Alex Markuze
2026-04-29 21:22   ` Viacheslav Dubeyko [this message]
2026-04-29 12:51 ` [PATCH v3 04/11] ceph: add diagnostic timeout loop to wait_caps_flush() Alex Markuze
2026-04-29 21:41   ` [EXTERNAL] " Viacheslav Dubeyko
2026-04-29 12:52 ` [PATCH v3 05/11] ceph: add client reset state machine and session teardown Alex Markuze
2026-04-29 22:29   ` [EXTERNAL] " Viacheslav Dubeyko
2026-04-29 12:52 ` [PATCH v3 06/11] ceph: add manual reset debugfs control and tracepoints Alex Markuze
2026-04-30 18:38   ` [EXTERNAL] " Viacheslav Dubeyko
2026-04-29 12:52 ` [PATCH v3 07/11] selftests: ceph: add reset consistency checker Alex Markuze
2026-04-29 12:52 ` [PATCH v3 08/11] selftests: ceph: add reset stress test Alex Markuze
2026-04-29 12:52 ` [PATCH v3 09/11] selftests: ceph: add reset corner-case tests Alex Markuze
2026-04-29 12:52 ` [PATCH v3 10/11] selftests: ceph: add validation harness Alex Markuze
2026-04-29 12:52 ` [PATCH v3 11/11] selftests: ceph: wire up Ceph reset kselftests and documentation Alex Markuze

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=f3ff03de8cadd533fbe71f271f61890e856f3b06.camel@redhat.com \
    --to=vdubeyko@redhat.com \
    --cc=amarkuze@redhat.com \
    --cc=ceph-devel@vger.kernel.org \
    --cc=idryomov@gmail.com \
    --cc=linux-kernel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox