From: Viacheslav Dubeyko <vdubeyko@redhat.com>
To: Alex Markuze <amarkuze@redhat.com>, ceph-devel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org, idryomov@gmail.com
Subject: Re: [EXTERNAL] [PATCH v3 03/11] ceph: harden send_mds_reconnect and handle active-MDS peer reset
Date: Wed, 29 Apr 2026 14:22:56 -0700 [thread overview]
Message-ID: <f3ff03de8cadd533fbe71f271f61890e856f3b06.camel@redhat.com> (raw)
In-Reply-To: <20260429125206.1512203-4-amarkuze@redhat.com>
On Wed, 2026-04-29 at 12:51 +0000, Alex Markuze wrote:
> Change send_mds_reconnect() to return an error code so callers can detect
> and report reconnect failures instead of silently ignoring them. Add early
> bailout checks for sessions that are already closed, rejected, or
> unregistered, which avoids sending reconnect messages for sessions that
> can no longer be recovered.
>
> The early -ESTALE and -ENOENT bailouts use a separate fail_return label
> that skips the pr_err_client diagnostic, since these codes indicate
> expected concurrent-teardown races rather than genuine reconnect build
> failures.
>
> Move the "reconnect start" log after the early-bailout checks so it
> only appears for sessions that actually proceed with reconnect.
>
> Save the prior session state before transitioning to RECONNECTING,
> and restore it in the failure path. Without this, a transient
> build or encoding failure (-ENOMEM, -ENOSPC) strands the session
> in RECONNECTING indefinitely because check_new_map() only retries
> sessions in RESTARTING state.
>
> Rewrite mds_peer_reset() to handle the case where the MDS is past its
> RECONNECT phase (i.e. active). An active MDS rejects CLIENT_RECONNECT
> messages because it only accepts them during its own RECONNECT window
> after restart. Previously, the client would send a doomed reconnect
> that the MDS would reject or ignore. Now, the client tears the session
> down locally and lets new requests re-open a fresh session, which is
> the correct recovery for this scenario. The RECONNECTING state is
> handled on the same teardown path, since the MDS will reject reconnect
> attempts from an active client regardless of the session's local state.
>
> Add explicit cases for CLOSED and REJECTED session states in
> mds_peer_reset() since these are terminal states where a connection
> drop is expected behavior.
>
> The session teardown path in mds_peer_reset() follows the established
> drop-and-reacquire locking pattern from check_new_map(): take
> mdsc->mutex for session unregistration, release it, then take s->s_mutex
> separately for cleanup. This avoids introducing a new simultaneous lock
> nesting pattern.
>
> Log reconnect failures from check_new_map() and mds_peer_reset() at
> pr_warn level rather than pr_err, since return codes like -ESTALE
> (closed/rejected session) and -ENOENT (unregistered session) are
> expected during concurrent teardown. Log dropped messages for
> unregistered sessions via doutc() (dynamic debug) rather than
> pr_info, as post-reset message arrival is routine and does not
> warrant unconditional logging.
>
> Signed-off-by: Alex Markuze <amarkuze@redhat.com>
> ---
> fs/ceph/mds_client.c | 169 +++++++++++++++++++++++++++++++++++++++----
> 1 file changed, 155 insertions(+), 14 deletions(-)
>
> diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
> index 871f0eef468d..b62abae72c4c 100644
> --- a/fs/ceph/mds_client.c
> +++ b/fs/ceph/mds_client.c
> @@ -4416,9 +4416,14 @@ static void handle_session(struct ceph_mds_session *session,
> break;
>
> case CEPH_SESSION_REJECT:
> - WARN_ON(session->s_state != CEPH_MDS_SESSION_OPENING);
> - pr_info_client(cl, "mds%d rejected session\n",
> - session->s_mds);
> + WARN_ON(session->s_state != CEPH_MDS_SESSION_OPENING &&
> + session->s_state != CEPH_MDS_SESSION_RECONNECTING);
> + if (session->s_state == CEPH_MDS_SESSION_RECONNECTING)
> + pr_info_client(cl, "mds%d reconnect rejected\n",
> + session->s_mds);
> + else
> + pr_info_client(cl, "mds%d rejected session\n",
> + session->s_mds);
> session->s_state = CEPH_MDS_SESSION_REJECTED;
> cleanup_session_requests(mdsc, session);
> remove_session_caps(session);
> @@ -4678,6 +4683,14 @@ static int reconnect_caps_cb(struct inode *inode, int mds, void *arg)
> cap->mseq = 0; /* and migrate_seq */
> cap->cap_gen = atomic_read(&cap->session->s_cap_gen);
>
> + /*
> + * Note: CEPH_I_ERROR_FILELOCK is not set during reconnect.
> + * Instead, locks are submitted for best-effort MDS reclaim
> + * via the flock_len field below. If reclaim fails (e.g.,
> + * another client grabbed a conflicting lock), future lock
> + * operations will fail and set the error flag at that point.
> + */
> +
> /* These are lost when the session goes away */
> if (S_ISDIR(inode->i_mode)) {
> if (cap->issued & CEPH_CAP_DIR_CREATE) {
> @@ -4892,20 +4905,19 @@ static int encode_snap_realms(struct ceph_mds_client *mdsc,
> *
> * This is a relatively heavyweight operation, but it's rare.
> */
> -static void send_mds_reconnect(struct ceph_mds_client *mdsc,
> - struct ceph_mds_session *session)
> +static int send_mds_reconnect(struct ceph_mds_client *mdsc,
> + struct ceph_mds_session *session)
> {
> struct ceph_client *cl = mdsc->fsc->client;
> struct ceph_msg *reply;
> int mds = session->s_mds;
> int err = -ENOMEM;
> + int old_state;
> struct ceph_reconnect_state recon_state = {
> .session = session,
> };
> LIST_HEAD(dispose);
>
> - pr_info_client(cl, "mds%d reconnect start\n", mds);
> -
> recon_state.pagelist = ceph_pagelist_alloc(GFP_NOFS);
> if (!recon_state.pagelist)
> goto fail_nopagelist;
> @@ -4917,6 +4929,32 @@ static void send_mds_reconnect(struct ceph_mds_client *mdsc,
> xa_destroy(&session->s_delegated_inos);
Is xa_destroy(&session->s_delegated_inos) thread safe now? If an in-flight async
create is mid-way through ceph_get_deleg_ino (which does xa_for_each + xa_erase)
while xa_destroy runs on the same XArray from a different thread, that is
unsafe. What do you think?
>
> mutex_lock(&session->s_mutex);
> + if (session->s_state == CEPH_MDS_SESSION_CLOSED ||
> + session->s_state == CEPH_MDS_SESSION_REJECTED) {
> + pr_info_client(cl, "mds%d skipping reconnect, session %s\n",
> + mds,
> + ceph_session_state_name(session->s_state));
> + mutex_unlock(&session->s_mutex);
> + ceph_msg_put(reply);
> + err = -ESTALE;
> + goto fail_return;
> + }
> +
> + mutex_lock(&mdsc->mutex);
I started to guess is it safe to call mutex_lock(&mdsc->mutex) under session-
>s_mutex lock? Could we introduce potential deadlock here?
> + if (mds >= mdsc->max_sessions || mdsc->sessions[mds] != session) {
> + mutex_unlock(&mdsc->mutex);
> + pr_info_client(cl,
> + "mds%d skipping reconnect, session unregistered\n",
> + mds);
> + mutex_unlock(&session->s_mutex);
> + ceph_msg_put(reply);
> + err = -ENOENT;
> + goto fail_return;
> + }
> + mutex_unlock(&mdsc->mutex);
> +
> + pr_info_client(cl, "mds%d reconnect start\n", mds);
> + old_state = session->s_state;
> session->s_state = CEPH_MDS_SESSION_RECONNECTING;
> session->s_seq = 0;
>
> @@ -5046,18 +5084,34 @@ static void send_mds_reconnect(struct ceph_mds_client *mdsc,
>
> up_read(&mdsc->snap_rwsem);
> ceph_pagelist_release(recon_state.pagelist);
> - return;
> + return 0;
>
> fail:
> ceph_msg_put(reply);
> up_read(&mdsc->snap_rwsem);
> + /*
> + * Restore prior session state so map-driven reconnect logic
> + * (check_new_map) can retry. Without this, a transient build
> + * failure strands the session in RECONNECTING indefinitely.
> + */
> + session->s_state = old_state;
> mutex_unlock(&session->s_mutex);
> fail_nomsg:
> ceph_pagelist_release(recon_state.pagelist);
> fail_nopagelist:
> pr_err_client(cl, "error %d preparing reconnect for mds%d\n",
> err, mds);
> - return;
> + return err;
> +
> +fail_return:
> + /*
> + * Early-exit path for expected concurrent-teardown races
> + * (-ESTALE for closed/rejected sessions, -ENOENT for
> + * unregistered sessions). Skip the pr_err_client diagnostic
> + * since these are not genuine reconnect build failures.
> + */
> + ceph_pagelist_release(recon_state.pagelist);
> + return err;
> }
>
>
> @@ -5138,9 +5192,15 @@ static void check_new_map(struct ceph_mds_client *mdsc,
> */
> if (s->s_state == CEPH_MDS_SESSION_RESTARTING &&
> newstate >= CEPH_MDS_STATE_RECONNECT) {
> + int rc;
> +
> mutex_unlock(&mdsc->mutex);
> clear_bit(i, targets);
> - send_mds_reconnect(mdsc, s);
> + rc = send_mds_reconnect(mdsc, s);
> + if (rc)
> + pr_warn_client(cl,
> + "mds%d reconnect failed: %d\n",
> + i, rc);
> mutex_lock(&mdsc->mutex);
> }
>
> @@ -5204,7 +5264,11 @@ static void check_new_map(struct ceph_mds_client *mdsc,
> }
> doutc(cl, "send reconnect to export target mds.%d\n", i);
> mutex_unlock(&mdsc->mutex);
> - send_mds_reconnect(mdsc, s);
> + err = send_mds_reconnect(mdsc, s);
> + if (err)
> + pr_warn_client(cl,
> + "mds%d export target reconnect failed: %d\n",
> + i, err);
> ceph_put_mds_session(s);
> mutex_lock(&mdsc->mutex);
> }
> @@ -6284,12 +6348,87 @@ static void mds_peer_reset(struct ceph_connection *con)
> {
> struct ceph_mds_session *s = con->private;
> struct ceph_mds_client *mdsc = s->s_mdsc;
> + int session_state;
>
> pr_warn_client(mdsc->fsc->client, "mds%d closed our session\n",
> s->s_mds);
> - if (READ_ONCE(mdsc->fsc->mount_state) != CEPH_MOUNT_FENCE_IO &&
> - ceph_mdsmap_get_state(mdsc->mdsmap, s->s_mds) >= CEPH_MDS_STATE_RECONNECT)
> - send_mds_reconnect(mdsc, s);
> +
> + if (READ_ONCE(mdsc->fsc->mount_state) == CEPH_MOUNT_FENCE_IO ||
> + ceph_mdsmap_get_state(mdsc->mdsmap, s->s_mds) < CEPH_MDS_STATE_RECONNECT)
> + return;
> +
> + if (ceph_mdsmap_get_state(mdsc->mdsmap, s->s_mds) == CEPH_MDS_STATE_RECONNECT) {
We had >= CEPH_MDS_STATE_RECONNECT before. When an MDS restarts, it moves
through these states in order:
RECONNECT (10) -> waits for clients to reconnect
REJOIN (11) -> rejoins the distributed MDS cache
CLIENTREPLAY (12) -> replays in-flight client operations
ACTIVE (13) -> fully operational
If the connection dropped and the MDS was in any of those four states, the
client sent a CLIENT_RECONNECT message. REJOIN (11) and CLIENTREPLAY (12) now
fall through to the teardown path, not the reconnect path.
Thanks,
Slava.
> + int rc = send_mds_reconnect(mdsc, s);
> +
> + if (rc)
> + pr_warn_client(mdsc->fsc->client,
> + "mds%d reconnect failed: %d\n",
> + s->s_mds, rc);
> + return;
> + }
> +
> + /*
> + * MDS is active (past RECONNECT). It will not accept a
> + * CLIENT_RECONNECT from us, so tear the session down locally
> + * and let new requests re-open a fresh session.
> + *
> + * Snapshot session state with READ_ONCE, then revalidate under
> + * mdsc->mutex before acting. The subsequent mdsc->mutex
> + * section rechecks s_state to catch concurrent transitions, so
> + * the lockless snapshot here is safe. s->s_mutex is taken
> + * separately for cleanup after unregistration, which avoids
> + * introducing a new s->s_mutex + mdsc->mutex nesting.
> + */
> + session_state = READ_ONCE(s->s_state);
> +
> + switch (session_state) {
> + case CEPH_MDS_SESSION_RESTARTING:
> + case CEPH_MDS_SESSION_RECONNECTING:
> + case CEPH_MDS_SESSION_CLOSING:
> + case CEPH_MDS_SESSION_OPEN:
> + case CEPH_MDS_SESSION_HUNG:
> + case CEPH_MDS_SESSION_OPENING:
> + mutex_lock(&mdsc->mutex);
> + if (s->s_mds >= mdsc->max_sessions ||
> + mdsc->sessions[s->s_mds] != s ||
> + s->s_state != session_state) {
> + pr_info_client(mdsc->fsc->client,
> + "mds%d state changed to %s during peer reset\n",
> + s->s_mds,
> + ceph_session_state_name(s->s_state));
> + mutex_unlock(&mdsc->mutex);
> + return;
> + }
> +
> + ceph_get_mds_session(s);
> + s->s_state = CEPH_MDS_SESSION_CLOSED;
> + __unregister_session(mdsc, s);
> + __wake_requests(mdsc, &s->s_waiting);
> + mutex_unlock(&mdsc->mutex);
> +
> + mutex_lock(&s->s_mutex);
> + cleanup_session_requests(mdsc, s);
> + remove_session_caps(s);
> + mutex_unlock(&s->s_mutex);
> +
> + wake_up_all(&mdsc->session_close_wq);
> +
> + mutex_lock(&mdsc->mutex);
> + kick_requests(mdsc, s->s_mds);
> + mutex_unlock(&mdsc->mutex);
> +
> + ceph_put_mds_session(s);
> + break;
> + case CEPH_MDS_SESSION_CLOSED:
> + case CEPH_MDS_SESSION_REJECTED:
> + break;
> + default:
> + pr_warn_client(mdsc->fsc->client,
> + "mds%d peer reset in unexpected state %s\n",
> + s->s_mds,
> + ceph_session_state_name(session_state));
> + break;
> + }
> }
>
> static void mds_dispatch(struct ceph_connection *con, struct ceph_msg *msg)
> @@ -6301,6 +6440,8 @@ static void mds_dispatch(struct ceph_connection *con, struct ceph_msg *msg)
>
> mutex_lock(&mdsc->mutex);
> if (__verify_registered_session(mdsc, s) < 0) {
> + doutc(cl, "dropping tid %llu from unregistered session %d\n",
> + le64_to_cpu(msg->hdr.tid), s->s_mds);
> mutex_unlock(&mdsc->mutex);
> goto out;
> }
next prev parent reply other threads:[~2026-04-29 21:23 UTC|newest]
Thread overview: 17+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-29 12:51 [PATCH v3 00/11] ceph: manual client session reset Alex Markuze
2026-04-29 12:51 ` [PATCH v3 01/11] ceph: convert inode flags to named bit positions and atomic bitops Alex Markuze
2026-04-29 19:31 ` [EXTERNAL] " Viacheslav Dubeyko
2026-04-29 12:51 ` [PATCH v3 02/11] ceph: use proper endian conversion for flock_len in reconnect Alex Markuze
2026-04-29 12:51 ` [PATCH v3 03/11] ceph: harden send_mds_reconnect and handle active-MDS peer reset Alex Markuze
2026-04-29 21:22 ` Viacheslav Dubeyko [this message]
2026-04-29 12:51 ` [PATCH v3 04/11] ceph: add diagnostic timeout loop to wait_caps_flush() Alex Markuze
2026-04-29 21:41 ` [EXTERNAL] " Viacheslav Dubeyko
2026-04-29 12:52 ` [PATCH v3 05/11] ceph: add client reset state machine and session teardown Alex Markuze
2026-04-29 22:29 ` [EXTERNAL] " Viacheslav Dubeyko
2026-04-29 12:52 ` [PATCH v3 06/11] ceph: add manual reset debugfs control and tracepoints Alex Markuze
2026-04-30 18:38 ` [EXTERNAL] " Viacheslav Dubeyko
2026-04-29 12:52 ` [PATCH v3 07/11] selftests: ceph: add reset consistency checker Alex Markuze
2026-04-29 12:52 ` [PATCH v3 08/11] selftests: ceph: add reset stress test Alex Markuze
2026-04-29 12:52 ` [PATCH v3 09/11] selftests: ceph: add reset corner-case tests Alex Markuze
2026-04-29 12:52 ` [PATCH v3 10/11] selftests: ceph: add validation harness Alex Markuze
2026-04-29 12:52 ` [PATCH v3 11/11] selftests: ceph: wire up Ceph reset kselftests and documentation Alex Markuze
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=f3ff03de8cadd533fbe71f271f61890e856f3b06.camel@redhat.com \
--to=vdubeyko@redhat.com \
--cc=amarkuze@redhat.com \
--cc=ceph-devel@vger.kernel.org \
--cc=idryomov@gmail.com \
--cc=linux-kernel@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox