[PATCH v3 00/11] ceph: manual client session reset

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v3 00/11] ceph: manual client session reset
@ 2026-04-29 12:51 Alex Markuze
  2026-04-29 12:51 ` [PATCH v3 01/11] ceph: convert inode flags to named bit positions and atomic bitops Alex Markuze
                   ` (10 more replies)
  0 siblings, 11 replies; 17+ messages in thread
From: Alex Markuze @ 2026-04-29 12:51 UTC (permalink / raw)
  To: ceph-devel; +Cc: linux-kernel, idryomov, vdubeyko, Alex Markuze

This series adds operator-initiated manual client session reset for
CephFS, providing a controlled escape hatch for client/MDS stalemates
in which caps, locks, or unsafe metadata state stop making forward
progress.

Motivation

When a CephFS client enters a stalemate with the MDS -- stuck cap
flushes, hung file locks, or unsafe requests that cannot be journaled --
the only current recovery options are client eviction from the MDS side
or a full client node restart.  Both are disruptive and can cascade to
other workloads on the same node.

Manual reset gives the operator a targeted tool: block new metadata
work, attempt a bounded best-effort drain of dirty client state while
sessions are still alive, then tear sessions down and let new requests
re-open fresh sessions.  State that cannot drain (the stuck state
causing the stalemate) is force-dropped -- that is the point of the
reset.

Design

The reset is triggered via debugfs:

    echo "reason" > /sys/kernel/debug/ceph/<client>/reset/trigger
    cat /sys/kernel/debug/ceph/<client>/reset/status

The state machine tracks four phases:

    IDLE -> QUIESCING -> DRAINING -> TEARDOWN -> IDLE

QUIESCING is set synchronously by schedule_reset() before the workqueue
item is dispatched.  This provides immediate request gating from the
caller's context -- new metadata requests and file-lock acquisitions
block the moment the operator triggers the reset, with no race window
between scheduling and the work function starting.  All non-IDLE phases
block callers on blocked_wq; the hot path adds only a single READ_ONCE
per request.

The drain phase (bounded at 5 seconds) flushes the MDS journal, dirty
caps, and pending cap releases.  Non-stuck state drains in milliseconds;
stuck state times out and is force-dropped during teardown.

The session teardown follows the established check_new_map() forced-close
pattern: unregister sessions under mdsc->mutex, then clean up caps and
requests under s->s_mutex.  Reconnect is not attempted because the MDS
only accepts CLIENT_RECONNECT during its own RECONNECT phase after
restart, not from an active client.  A SESSION_REQUEST_CLOSE is sent to
each MDS before local teardown so the MDS can release server-side state
promptly rather than waiting for session_autoclose timeout.

Blocked callers are released when reset completes and observe the final
result.  The wait path verifies the phase is still IDLE under the lock
after wakeup, looping back if a new reset was scheduled in the interim.

Patch breakdown

Prep / cleanup:

 1. Convert all CEPH_I_* inode flags to named bit-position constants
    and switch all flag modifications to atomic bitops (set_bit,
    clear_bit, test_and_clear_bit).  The previous code mixed lockless
    atomics with non-atomic read-modify-write on the same unsigned long,
    which is a correctness hazard.  Flag reads under i_ceph_lock that
    only test lock-serialised flags retain bitmask tests.

 2. Fix a __force endian cast in reconnect_caps_cb() to use the proper
    cpu_to_le32() macro and the new test_bit() accessor.

Hardening / diagnostics:

 3. Harden send_mds_reconnect() with error return, early bailout for
    closed/rejected/unregistered sessions, state restoration on
    transient failure.  Rewrite mds_peer_reset() to handle active-MDS
    (past RECONNECT phase) by tearing the session down locally.

 4. Convert wait_caps_flush() to a diagnostic timeout loop that
    periodically dumps pending flush state, improving observability
    for reset-drain stalls and existing sync/writeback hangs.

Core feature:

 5. Add the reset state machine, request gating, session teardown
    work function, scheduling, and destroy-path coordination.

 6. Add the debugfs trigger/status interface and four tracepoints
    (schedule, complete, blocked, unblocked).

Testing:

 7-11. kselftest-integrated shell tests split into five patches:
    data integrity checker (7), stress test with concurrent I/O and
    random-interval reset injection (8), targeted corner cases --
    overlapping resets, dirty data across reset, stale locks, unmount
    during reset (9), five-stage validation wrapper with per-stage
    timeouts (10), and kselftest Makefile/MAINTAINERS wiring (11).
    All 5 validation stages pass on a real CephFS cluster.

Changes since v2

 - Patch 1: restored CEPH_I_SHUTDOWN mask define for
   ceph_inode_is_shutdown().  Kept CEPH_I_ERROR_FILELOCK alive for
   bisectability (patch 2 removes it).  Unconditional set_bit() is
   kept for hint flags like ERROR_WRITE -- the extra test-before-set
   branch is redundant on a plain atomic bitop.  Added comments
   clarifying the under-lock flag re-read in ceph_pool_perm_check()
   and the bit-operation sequence in wake_async_create_waiters().
 - Patch 2: unchanged from v2.
 - Patch 3: moved the "reconnect start" log after the early-bailout
   checks.  Added explicit CLOSED/REJECTED cases in the
   mds_peer_reset() switch.
 - Patch 4: reworked dump_cap_flushes() to collect data under
   cap_dirty_lock into an on-stack array and print after releasing
   the lock.  Added truncation count and a final suppression message.
   Changed null cf->ci to WARN_ON_ONCE.  Widened snap field to u64
   to match ceph_snap() return type.  Documented cf->ci
   inode-lifetime guarantee and flush-tid monotonic ordering (acks
   are processed in order under i_ceph_lock, so the latest-ack store
   is correct without max()).  The timeout expression
   CEPH_CAP_FLUSH_WAIT_TIMEOUT_SEC * HZ is kept because 60 * HZ is
   bounded and no conversion helper is needed here.
 - Patch 5: ceph_mdsc_wait_for_reset() maps all internal errors to
   -EIO so unrelated callers (open, flock) do not see raw
   work-function failures like -ENOMEM.  Added max_t() guard on
   deadline-jiffies underflow.  Added st->shutdown checks before
   DRAINING and TEARDOWN phase transitions so a concurrent
   ceph_mdsc_destroy() is not overwritten by the work function.
   Documented the close-grace sleep as a best-effort nudge.
 - Patch 6: removed unused debugfs_reset_trigger and
   debugfs_reset_status struct fields.  Added null-safe access for
   monc.auth->global_id in all four reset tracepoints.
 - Old patch 7: split into five smaller patches (7-11) ordered by
   file dependency.

Changes since v1

 - Patch 1 now converts ALL flag modifications to atomic bitops,
   eliminating the mixed atomic/non-atomic RMW hazard on the shared
   i_ceph_flags field.  Removed mask defines for flags that are only
   accessed via the _BIT form.  Fixed Co-authored-by -> Co-developed-by.
 - Rewrote mds_peer_reset() to handle active-MDS state correctly
   instead of sending a doomed reconnect (patch 3).
 - Added state restoration in send_mds_reconnect() failure path to
   prevent sessions stranding in RECONNECTING (patch 3).
 - Added early bailout checks in send_mds_reconnect() for
   closed/rejected/unregistered sessions (patch 3).
 - Added diagnostic timeout loop to wait_caps_flush() (patch 4).
 - Patch 5 commit message now documents the QUIESCING phase purpose
   (synchronous request gating before async work dispatch).
 - Fixed TOCTOU in ceph_mdsc_wait_for_reset(): the phase is now
   re-verified under the lock after wakeup, with a deadline-based
   retry loop if a new reset was scheduled in the interim (patch 5).
 - Added tracepoints for reset lifecycle events (patch 6).
 - Added selftest validation harness with kselftest integration
   (patch 7).
 - Rebased onto v7.0.

Alex Markuze (11):
  ceph: convert inode flags to named bit positions and atomic bitops
  ceph: use proper endian conversion for flock_len in reconnect
  ceph: harden send_mds_reconnect and handle active-MDS peer reset
  ceph: add diagnostic timeout loop to wait_caps_flush()
  ceph: add client reset state machine and session teardown
  ceph: add manual reset debugfs control and tracepoints
  selftests: ceph: add reset consistency checker
  selftests: ceph: add reset stress test
  selftests: ceph: add reset corner-case tests
  selftests: ceph: add validation harness
  selftests: ceph: wire up Ceph reset kselftests and documentation

 MAINTAINERS                                   |   1 +
 fs/ceph/addr.c                                |  17 +-
 fs/ceph/caps.c                                |  34 +-
 fs/ceph/debugfs.c                             | 102 +++
 fs/ceph/file.c                                |  13 +-
 fs/ceph/inode.c                               |   5 +-
 fs/ceph/locks.c                               |  38 +-
 fs/ceph/mds_client.c                          | 735 +++++++++++++++++-
 fs/ceph/mds_client.h                          |  44 +-
 fs/ceph/snap.c                                |   2 +-
 fs/ceph/super.h                               |  70 +-
 fs/ceph/xattr.c                               |   2 +-
 include/trace/events/ceph.h                   |  67 ++
 tools/testing/selftests/Makefile              |   1 +
 .../selftests/filesystems/ceph/Makefile       |   7 +
 .../testing/selftests/filesystems/ceph/README |  84 ++
 .../filesystems/ceph/reset_corner_cases.sh    | 646 +++++++++++++++
 .../filesystems/ceph/reset_stress.sh          | 694 +++++++++++++++++
 .../filesystems/ceph/run_validation.sh        | 350 +++++++++
 .../selftests/filesystems/ceph/settings       |   1 +
 .../filesystems/ceph/validate_consistency.py  | 297 +++++++
 21 files changed, 3110 insertions(+), 100 deletions(-)
 create mode 100644 tools/testing/selftests/filesystems/ceph/Makefile
 create mode 100644 tools/testing/selftests/filesystems/ceph/README
 create mode 100755 tools/testing/selftests/filesystems/ceph/reset_corner_cases.sh
 create mode 100755 tools/testing/selftests/filesystems/ceph/reset_stress.sh
 create mode 100755 tools/testing/selftests/filesystems/ceph/run_validation.sh
 create mode 100644 tools/testing/selftests/filesystems/ceph/settings
 create mode 100755 tools/testing/selftests/filesystems/ceph/validate_consistency.py

-- 
2.34.1


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v3 01/11] ceph: convert inode flags to named bit positions and atomic bitops
  2026-04-29 12:51 [PATCH v3 00/11] ceph: manual client session reset Alex Markuze
@ 2026-04-29 12:51 ` Alex Markuze
  2026-04-29 19:31   ` [EXTERNAL] " Viacheslav Dubeyko
  2026-04-29 12:51 ` [PATCH v3 02/11] ceph: use proper endian conversion for flock_len in reconnect Alex Markuze
                   ` (9 subsequent siblings)
  10 siblings, 1 reply; 17+ messages in thread
From: Alex Markuze @ 2026-04-29 12:51 UTC (permalink / raw)
  To: ceph-devel; +Cc: linux-kernel, idryomov, vdubeyko, Alex Markuze

Define named bit-position constants for all CEPH_I_* inode flags and
derive the bitmask values from them.  This gives every flag a named
_BIT constant usable with the test_bit/set_bit/clear_bit family.
The intentionally unused bit position 1 is documented inline.

Convert all flag modifications to use atomic bitops (set_bit,
clear_bit, test_and_clear_bit).  The previous code mixed lockless
atomic ops on some flags (ERROR_WRITE, ODIRECT) with non-atomic
read-modify-write (|= / &= ~) on other flags sharing the same
unsigned long.  A concurrent non-atomic RMW can clobber an
adjacent lockless atomic update -- for example, a lockless
clear_bit(ERROR_WRITE) could be silently resurrected by a
concurrent ci->i_ceph_flags |= CEPH_I_FLUSH under the spinlock.
Using atomic bitops for all modifications eliminates this class
of race entirely.

Flags whose only users are now the _BIT form (ERROR_WRITE,
ASYNC_CHECK_CAPS) have their old mask defines removed to document
that callers must use the _BIT constant with the set_bit/test_bit
family.  ERROR_FILELOCK and SHUTDOWN retain their mask defines
because they are still used via bitmask tests in lockless readers
(ceph_inode_is_shutdown, reconnect_caps_cb).

Flag reads under i_ceph_lock continue to use bitmask tests where
the tested flag is only modified under the same lock; this is safe
because the lock serialises both the read and the write.  The
remaining flags continue to use non-atomic bitmask operations under
i_ceph_lock, which is correct and unchanged.

The lockless reader ceph_inode_is_shutdown() retains the READ_ONCE()
snapshot plus bitmask test pattern -- the single atomic load into a
local variable is correct and avoids a second memory access that
test_bit() would require.  It now uses the named CEPH_I_SHUTDOWN
mask constant instead of an inline BIT().

The direct assignment in ceph_finish_async_create() is converted
from i_ceph_flags = CEPH_I_ASYNC_CREATE to set_bit().  This
inode is I_NEW at this point -- still invisible to other threads
and guaranteed to have zero flags from alloc_inode -- so either
form is safe, but set_bit() keeps the conversion uniform.

The only remaining direct assignment (alloc_inode zeroing) operates
on an inode that is not yet visible to other threads, so it is safe
without atomic ops.

The dead precomputed flags variable in ceph_pool_perm_check() is
removed; the check: loop re-reads flags from i_ceph_flags after
the set_bit() calls, keeping a single source of truth.

Co-developed-by: Viacheslav Dubeyko <vdubeyko@redhat.com>
Signed-off-by: Viacheslav Dubeyko <vdubeyko@redhat.com>
Signed-off-by: Alex Markuze <amarkuze@redhat.com>
---
 fs/ceph/addr.c       | 17 ++++++------
 fs/ceph/caps.c       | 24 ++++++++---------
 fs/ceph/file.c       | 13 ++++-----
 fs/ceph/inode.c      |  4 +--
 fs/ceph/locks.c      | 22 ++++-----------
 fs/ceph/mds_client.c |  3 ++-
 fs/ceph/mds_client.h |  2 +-
 fs/ceph/snap.c       |  2 +-
 fs/ceph/super.h      | 64 +++++++++++++++++++++++---------------------
 fs/ceph/xattr.c      |  2 +-
 10 files changed, 72 insertions(+), 81 deletions(-)

diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index 2090fc78529c..35c5fdb5a448 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -2583,20 +2583,19 @@ int ceph_pool_perm_check(struct inode *inode, int need)
 	if (ret < 0)
 		return ret;
 
-	flags = CEPH_I_POOL_PERM;
-	if (ret & POOL_READ)
-		flags |= CEPH_I_POOL_RD;
-	if (ret & POOL_WRITE)
-		flags |= CEPH_I_POOL_WR;
-
 	spin_lock(&ci->i_ceph_lock);
 	if (pool == ci->i_layout.pool_id &&
 	    pool_ns == rcu_dereference_raw(ci->i_layout.pool_ns)) {
-		ci->i_ceph_flags |= flags;
-        } else {
+		set_bit(CEPH_I_POOL_PERM_BIT, &ci->i_ceph_flags);
+		if (ret & POOL_READ)
+			set_bit(CEPH_I_POOL_RD_BIT, &ci->i_ceph_flags);
+		if (ret & POOL_WRITE)
+			set_bit(CEPH_I_POOL_WR_BIT, &ci->i_ceph_flags);
+	} else {
 		pool = ci->i_layout.pool_id;
-		flags = ci->i_ceph_flags;
 	}
+	/* Re-read flags under the lock so check: sees the updated bits. */
+	flags = ci->i_ceph_flags;
 	spin_unlock(&ci->i_ceph_lock);
 	goto check;
 }
diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
index d51454e995a8..cb9e78b713d9 100644
--- a/fs/ceph/caps.c
+++ b/fs/ceph/caps.c
@@ -549,7 +549,7 @@ static void __cap_delay_requeue_front(struct ceph_mds_client *mdsc,
 
 	doutc(mdsc->fsc->client, "%p %llx.%llx\n", inode, ceph_vinop(inode));
 	spin_lock(&mdsc->cap_delay_lock);
-	ci->i_ceph_flags |= CEPH_I_FLUSH;
+	set_bit(CEPH_I_FLUSH_BIT, &ci->i_ceph_flags);
 	if (!list_empty(&ci->i_cap_delay_list))
 		list_del_init(&ci->i_cap_delay_list);
 	list_add(&ci->i_cap_delay_list, &mdsc->cap_delay_list);
@@ -1409,7 +1409,7 @@ static void __prep_cap(struct cap_msg_args *arg, struct ceph_cap *cap,
 	      ceph_cap_string(revoking));
 	BUG_ON((retain & CEPH_CAP_PIN) == 0);
 
-	ci->i_ceph_flags &= ~CEPH_I_FLUSH;
+	clear_bit(CEPH_I_FLUSH_BIT, &ci->i_ceph_flags);
 
 	cap->issued &= retain;  /* drop bits we don't want */
 	/*
@@ -1666,7 +1666,7 @@ static void __ceph_flush_snaps(struct ceph_inode_info *ci,
 		last_tid = capsnap->cap_flush.tid;
 	}
 
-	ci->i_ceph_flags &= ~CEPH_I_FLUSH_SNAPS;
+	clear_bit(CEPH_I_FLUSH_SNAPS_BIT, &ci->i_ceph_flags);
 
 	while (first_tid <= last_tid) {
 		struct ceph_cap *cap = ci->i_auth_cap;
@@ -2026,7 +2026,7 @@ void ceph_check_caps(struct ceph_inode_info *ci, int flags)
 
 	spin_lock(&ci->i_ceph_lock);
 	if (ci->i_ceph_flags & CEPH_I_ASYNC_CREATE) {
-		ci->i_ceph_flags |= CEPH_I_ASYNC_CHECK_CAPS;
+		set_bit(CEPH_I_ASYNC_CHECK_CAPS_BIT, &ci->i_ceph_flags);
 
 		/* Don't send messages until we get async create reply */
 		spin_unlock(&ci->i_ceph_lock);
@@ -2577,7 +2577,7 @@ static void __kick_flushing_caps(struct ceph_mds_client *mdsc,
 	if (ci->i_ceph_flags & CEPH_I_ASYNC_CREATE)
 		return;
 
-	ci->i_ceph_flags &= ~CEPH_I_KICK_FLUSH;
+	clear_bit(CEPH_I_KICK_FLUSH_BIT, &ci->i_ceph_flags);
 
 	list_for_each_entry_reverse(cf, &ci->i_cap_flush_list, i_list) {
 		if (cf->is_capsnap) {
@@ -2686,7 +2686,7 @@ void ceph_early_kick_flushing_caps(struct ceph_mds_client *mdsc,
 			__kick_flushing_caps(mdsc, session, ci,
 					     oldest_flush_tid);
 		} else {
-			ci->i_ceph_flags |= CEPH_I_KICK_FLUSH;
+			set_bit(CEPH_I_KICK_FLUSH_BIT, &ci->i_ceph_flags);
 		}
 
 		spin_unlock(&ci->i_ceph_lock);
@@ -2829,7 +2829,7 @@ static int try_get_cap_refs(struct inode *inode, int need, int want,
 	spin_lock(&ci->i_ceph_lock);
 
 	if ((flags & CHECK_FILELOCK) &&
-	    (ci->i_ceph_flags & CEPH_I_ERROR_FILELOCK)) {
+	    test_bit(CEPH_I_ERROR_FILELOCK_BIT, &ci->i_ceph_flags)) {
 		doutc(cl, "%p %llx.%llx error filelock\n", inode,
 		      ceph_vinop(inode));
 		ret = -EIO;
@@ -3207,7 +3207,7 @@ static int ceph_try_drop_cap_snap(struct ceph_inode_info *ci,
 		BUG_ON(capsnap->cap_flush.tid > 0);
 		ceph_put_snap_context(capsnap->context);
 		if (!list_is_last(&capsnap->ci_item, &ci->i_cap_snaps))
-			ci->i_ceph_flags |= CEPH_I_FLUSH_SNAPS;
+			set_bit(CEPH_I_FLUSH_SNAPS_BIT, &ci->i_ceph_flags);
 
 		list_del(&capsnap->ci_item);
 		ceph_put_cap_snap(capsnap);
@@ -3396,7 +3396,7 @@ void ceph_put_wrbuffer_cap_refs(struct ceph_inode_info *ci, int nr,
 				if (ceph_try_drop_cap_snap(ci, capsnap)) {
 					put++;
 				} else {
-					ci->i_ceph_flags |= CEPH_I_FLUSH_SNAPS;
+					set_bit(CEPH_I_FLUSH_SNAPS_BIT, &ci->i_ceph_flags);
 					flush_snaps = true;
 				}
 			}
@@ -3648,7 +3648,7 @@ static void handle_cap_grant(struct inode *inode,
 
 		if (ci->i_layout.pool_id != old_pool ||
 		    extra_info->pool_ns != old_ns)
-			ci->i_ceph_flags &= ~CEPH_I_POOL_PERM;
+			clear_bit(CEPH_I_POOL_PERM_BIT, &ci->i_ceph_flags);
 
 		extra_info->pool_ns = old_ns;
 
@@ -4815,7 +4815,7 @@ int ceph_drop_caps_for_unlink(struct inode *inode)
 			doutc(mdsc->fsc->client, "%p %llx.%llx\n", inode,
 			      ceph_vinop(inode));
 			spin_lock(&mdsc->cap_delay_lock);
-			ci->i_ceph_flags |= CEPH_I_FLUSH;
+			set_bit(CEPH_I_FLUSH_BIT, &ci->i_ceph_flags);
 			if (!list_empty(&ci->i_cap_delay_list))
 				list_del_init(&ci->i_cap_delay_list);
 			list_add_tail(&ci->i_cap_delay_list,
@@ -5080,7 +5080,7 @@ int ceph_purge_inode_cap(struct inode *inode, struct ceph_cap *cap, bool *invali
 
 		if (atomic_read(&ci->i_filelock_ref) > 0) {
 			/* make further file lock syscall return -EIO */
-			ci->i_ceph_flags |= CEPH_I_ERROR_FILELOCK;
+			set_bit(CEPH_I_ERROR_FILELOCK_BIT, &ci->i_ceph_flags);
 			pr_warn_ratelimited_client(cl,
 				" dropping file locks for %p %llx.%llx\n",
 				inode, ceph_vinop(inode));
diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index 5e7c73a29aa3..e2622f1cfbff 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -579,12 +579,12 @@ static void wake_async_create_waiters(struct inode *inode,
 
 	spin_lock(&ci->i_ceph_lock);
 	if (ci->i_ceph_flags & CEPH_I_ASYNC_CREATE) {
-		clear_and_wake_up_bit(CEPH_ASYNC_CREATE_BIT, &ci->i_ceph_flags);
+		/* Serialized by i_ceph_lock; the two ops touch different bits. */
+		clear_and_wake_up_bit(CEPH_I_ASYNC_CREATE_BIT, &ci->i_ceph_flags);
 
-		if (ci->i_ceph_flags & CEPH_I_ASYNC_CHECK_CAPS) {
-			ci->i_ceph_flags &= ~CEPH_I_ASYNC_CHECK_CAPS;
+		if (test_and_clear_bit(CEPH_I_ASYNC_CHECK_CAPS_BIT,
+				      &ci->i_ceph_flags))
 			check_cap = true;
-		}
 	}
 	ceph_kick_flushing_inode_caps(session, ci);
 	spin_unlock(&ci->i_ceph_lock);
@@ -747,7 +747,8 @@ static int ceph_finish_async_create(struct inode *dir, struct inode *inode,
 			 * that point and don't worry about setting
 			 * CEPH_I_ASYNC_CREATE.
 			 */
-			ceph_inode(inode)->i_ceph_flags = CEPH_I_ASYNC_CREATE;
+			set_bit(CEPH_I_ASYNC_CREATE_BIT,
+				&ceph_inode(inode)->i_ceph_flags);
 			unlock_new_inode(inode);
 		}
 		if (d_in_lookup(dentry) || d_really_is_negative(dentry)) {
@@ -2422,7 +2423,7 @@ static ssize_t ceph_write_iter(struct kiocb *iocb, struct iov_iter *from)
 
 	if ((got & (CEPH_CAP_FILE_BUFFER|CEPH_CAP_FILE_LAZYIO)) == 0 ||
 	    (iocb->ki_flags & IOCB_DIRECT) || (fi->flags & CEPH_F_SYNC) ||
-	    (ci->i_ceph_flags & CEPH_I_ERROR_WRITE)) {
+	    test_bit(CEPH_I_ERROR_WRITE_BIT, &ci->i_ceph_flags)) {
 		struct ceph_snap_context *snapc;
 		struct iov_iter data;
 
diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
index d99e12d1100b..f75d66760d54 100644
--- a/fs/ceph/inode.c
+++ b/fs/ceph/inode.c
@@ -1142,7 +1142,7 @@ int ceph_fill_inode(struct inode *inode, struct page *locked_page,
 		rcu_assign_pointer(ci->i_layout.pool_ns, pool_ns);
 
 		if (ci->i_layout.pool_id != old_pool || pool_ns != old_ns)
-			ci->i_ceph_flags &= ~CEPH_I_POOL_PERM;
+			clear_bit(CEPH_I_POOL_PERM_BIT, &ci->i_ceph_flags);
 
 		pool_ns = old_ns;
 
@@ -3199,7 +3199,7 @@ void ceph_inode_shutdown(struct inode *inode)
 	bool invalidate = false;
 
 	spin_lock(&ci->i_ceph_lock);
-	ci->i_ceph_flags |= CEPH_I_SHUTDOWN;
+	set_bit(CEPH_I_SHUTDOWN_BIT, &ci->i_ceph_flags);
 	p = rb_first(&ci->i_caps);
 	while (p) {
 		struct ceph_cap *cap = rb_entry(p, struct ceph_cap, ci_node);
diff --git a/fs/ceph/locks.c b/fs/ceph/locks.c
index dd764f9c64b9..c4ff2266bb94 100644
--- a/fs/ceph/locks.c
+++ b/fs/ceph/locks.c
@@ -57,9 +57,7 @@ static void ceph_fl_release_lock(struct file_lock *fl)
 	ci = ceph_inode(inode);
 	if (atomic_dec_and_test(&ci->i_filelock_ref)) {
 		/* clear error when all locks are released */
-		spin_lock(&ci->i_ceph_lock);
-		ci->i_ceph_flags &= ~CEPH_I_ERROR_FILELOCK;
-		spin_unlock(&ci->i_ceph_lock);
+		clear_bit(CEPH_I_ERROR_FILELOCK_BIT, &ci->i_ceph_flags);
 	}
 	fl->fl_u.ceph.inode = NULL;
 	iput(inode);
@@ -271,15 +269,10 @@ int ceph_lock(struct file *file, int cmd, struct file_lock *fl)
 	else if (IS_SETLKW(cmd))
 		wait = 1;
 
-	spin_lock(&ci->i_ceph_lock);
-	if (ci->i_ceph_flags & CEPH_I_ERROR_FILELOCK) {
-		err = -EIO;
-	}
-	spin_unlock(&ci->i_ceph_lock);
-	if (err < 0) {
+	if (test_bit(CEPH_I_ERROR_FILELOCK_BIT, &ci->i_ceph_flags)) {
 		if (op == CEPH_MDS_OP_SETFILELOCK && lock_is_unlock(fl))
 			posix_lock_file(file, fl, NULL);
-		return err;
+		return -EIO;
 	}
 
 	if (lock_is_read(fl))
@@ -331,15 +324,10 @@ int ceph_flock(struct file *file, int cmd, struct file_lock *fl)
 
 	doutc(cl, "fl_file: %p\n", fl->c.flc_file);
 
-	spin_lock(&ci->i_ceph_lock);
-	if (ci->i_ceph_flags & CEPH_I_ERROR_FILELOCK) {
-		err = -EIO;
-	}
-	spin_unlock(&ci->i_ceph_lock);
-	if (err < 0) {
+	if (test_bit(CEPH_I_ERROR_FILELOCK_BIT, &ci->i_ceph_flags)) {
 		if (lock_is_unlock(fl))
 			locks_lock_file_wait(file, fl);
-		return err;
+		return -EIO;
 	}
 
 	if (IS_SETLKW(cmd))
diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index b1746273f186..ccf0d53dde2b 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -3613,7 +3613,8 @@ static void __do_request(struct ceph_mds_client *mdsc,
 
 		spin_lock(&ci->i_ceph_lock);
 		cap = ci->i_auth_cap;
-		if (ci->i_ceph_flags & CEPH_I_ASYNC_CREATE && mds != cap->mds) {
+		if (test_bit(CEPH_I_ASYNC_CREATE_BIT, &ci->i_ceph_flags) &&
+		    mds != cap->mds) {
 			doutc(cl, "session changed for auth cap %d -> %d\n",
 			      cap->session->s_mds, session->s_mds);
 
diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h
index 0428a5eaf28c..e91a199d56fd 100644
--- a/fs/ceph/mds_client.h
+++ b/fs/ceph/mds_client.h
@@ -658,7 +658,7 @@ static inline int ceph_wait_on_async_create(struct inode *inode)
 {
 	struct ceph_inode_info *ci = ceph_inode(inode);
 
-	return wait_on_bit(&ci->i_ceph_flags, CEPH_ASYNC_CREATE_BIT,
+	return wait_on_bit(&ci->i_ceph_flags, CEPH_I_ASYNC_CREATE_BIT,
 			   TASK_KILLABLE);
 }
 
diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c
index 52b4c2684f92..9b79a5eaca93 100644
--- a/fs/ceph/snap.c
+++ b/fs/ceph/snap.c
@@ -700,7 +700,7 @@ int __ceph_finish_cap_snap(struct ceph_inode_info *ci,
 		return 0;
 	}
 
-	ci->i_ceph_flags |= CEPH_I_FLUSH_SNAPS;
+	set_bit(CEPH_I_FLUSH_SNAPS_BIT, &ci->i_ceph_flags);
 	doutc(cl, "%p %llx.%llx cap_snap %p snapc %p %llu %s s=%llu\n",
 	      inode, ceph_vinop(inode), capsnap, capsnap->context,
 	      capsnap->context->seq, ceph_cap_string(capsnap->dirty),
diff --git a/fs/ceph/super.h b/fs/ceph/super.h
index 29a980e22dc2..66b047606d65 100644
--- a/fs/ceph/super.h
+++ b/fs/ceph/super.h
@@ -655,23 +655,34 @@ static inline struct inode *ceph_find_inode(struct super_block *sb,
 /*
  * Ceph inode.
  */
-#define CEPH_I_DIR_ORDERED	(1 << 0)  /* dentries in dir are ordered */
-#define CEPH_I_FLUSH		(1 << 2)  /* do not delay flush of dirty metadata */
-#define CEPH_I_POOL_PERM	(1 << 3)  /* pool rd/wr bits are valid */
-#define CEPH_I_POOL_RD		(1 << 4)  /* can read from pool */
-#define CEPH_I_POOL_WR		(1 << 5)  /* can write to pool */
-#define CEPH_I_SEC_INITED	(1 << 6)  /* security initialized */
-#define CEPH_I_KICK_FLUSH	(1 << 7)  /* kick flushing caps */
-#define CEPH_I_FLUSH_SNAPS	(1 << 8)  /* need flush snapss */
-#define CEPH_I_ERROR_WRITE	(1 << 9) /* have seen write errors */
-#define CEPH_I_ERROR_FILELOCK	(1 << 10) /* have seen file lock errors */
-#define CEPH_I_ODIRECT_BIT	(11) /* inode in direct I/O mode */
-#define CEPH_I_ODIRECT		(1 << CEPH_I_ODIRECT_BIT)
-#define CEPH_ASYNC_CREATE_BIT	(12)	  /* async create in flight for this */
-#define CEPH_I_ASYNC_CREATE	(1 << CEPH_ASYNC_CREATE_BIT)
-#define CEPH_I_SHUTDOWN		(1 << 13) /* inode is no longer usable */
-#define CEPH_I_ASYNC_CHECK_CAPS	(1 << 14) /* check caps immediately after async
-					     creating finishes */
+#define CEPH_I_DIR_ORDERED_BIT		(0)  /* dentries in dir are ordered */
+					     /* bit 1 historically unused */
+#define CEPH_I_FLUSH_BIT		(2)  /* do not delay flush of dirty metadata */
+#define CEPH_I_POOL_PERM_BIT		(3)  /* pool rd/wr bits are valid */
+#define CEPH_I_POOL_RD_BIT		(4)  /* can read from pool */
+#define CEPH_I_POOL_WR_BIT		(5)  /* can write to pool */
+#define CEPH_I_SEC_INITED_BIT		(6)  /* security initialized */
+#define CEPH_I_KICK_FLUSH_BIT		(7)  /* kick flushing caps */
+#define CEPH_I_FLUSH_SNAPS_BIT		(8)  /* need flush snaps */
+#define CEPH_I_ERROR_WRITE_BIT		(9)  /* have seen write errors */
+#define CEPH_I_ERROR_FILELOCK_BIT	(10) /* have seen file lock errors */
+#define CEPH_I_ODIRECT_BIT		(11) /* inode in direct I/O mode */
+#define CEPH_I_ASYNC_CREATE_BIT		(12) /* async create in flight for this */
+#define CEPH_I_SHUTDOWN_BIT		(13) /* inode is no longer usable */
+#define CEPH_I_ASYNC_CHECK_CAPS_BIT	(14) /* check caps after async creating finishes */
+
+#define CEPH_I_DIR_ORDERED		(1 << CEPH_I_DIR_ORDERED_BIT)
+#define CEPH_I_FLUSH			(1 << CEPH_I_FLUSH_BIT)
+#define CEPH_I_POOL_PERM		(1 << CEPH_I_POOL_PERM_BIT)
+#define CEPH_I_POOL_RD			(1 << CEPH_I_POOL_RD_BIT)
+#define CEPH_I_POOL_WR			(1 << CEPH_I_POOL_WR_BIT)
+#define CEPH_I_SEC_INITED		(1 << CEPH_I_SEC_INITED_BIT)
+#define CEPH_I_KICK_FLUSH		(1 << CEPH_I_KICK_FLUSH_BIT)
+#define CEPH_I_FLUSH_SNAPS		(1 << CEPH_I_FLUSH_SNAPS_BIT)
+#define CEPH_I_ODIRECT			(1 << CEPH_I_ODIRECT_BIT)
+#define CEPH_I_ASYNC_CREATE		(1 << CEPH_I_ASYNC_CREATE_BIT)
+#define CEPH_I_ERROR_FILELOCK		(1 << CEPH_I_ERROR_FILELOCK_BIT)
+#define CEPH_I_SHUTDOWN			(1 << CEPH_I_SHUTDOWN_BIT)
 
 /*
  * Masks of ceph inode work.
@@ -684,27 +695,18 @@ static inline struct inode *ceph_find_inode(struct super_block *sb,
 
 /*
  * We set the ERROR_WRITE bit when we start seeing write errors on an inode
- * and then clear it when they start succeeding. Note that we do a lockless
- * check first, and only take the lock if it looks like it needs to be changed.
- * The write submission code just takes this as a hint, so we're not too
- * worried if a few slip through in either direction.
+ * and then clear it when they start succeeding. The write submission code
+ * just takes this as a hint, so we're not too worried if a few slip through
+ * in either direction.
  */
 static inline void ceph_set_error_write(struct ceph_inode_info *ci)
 {
-	if (!(READ_ONCE(ci->i_ceph_flags) & CEPH_I_ERROR_WRITE)) {
-		spin_lock(&ci->i_ceph_lock);
-		ci->i_ceph_flags |= CEPH_I_ERROR_WRITE;
-		spin_unlock(&ci->i_ceph_lock);
-	}
+	set_bit(CEPH_I_ERROR_WRITE_BIT, &ci->i_ceph_flags);
 }
 
 static inline void ceph_clear_error_write(struct ceph_inode_info *ci)
 {
-	if (READ_ONCE(ci->i_ceph_flags) & CEPH_I_ERROR_WRITE) {
-		spin_lock(&ci->i_ceph_lock);
-		ci->i_ceph_flags &= ~CEPH_I_ERROR_WRITE;
-		spin_unlock(&ci->i_ceph_lock);
-	}
+	clear_bit(CEPH_I_ERROR_WRITE_BIT, &ci->i_ceph_flags);
 }
 
 static inline void __ceph_dir_set_complete(struct ceph_inode_info *ci,
diff --git a/fs/ceph/xattr.c b/fs/ceph/xattr.c
index 5f87f62091a1..7cf9e908c2fe 100644
--- a/fs/ceph/xattr.c
+++ b/fs/ceph/xattr.c
@@ -1054,7 +1054,7 @@ ssize_t __ceph_getxattr(struct inode *inode, const char *name, void *value,
 	if (current->journal_info &&
 	    !strncmp(name, XATTR_SECURITY_PREFIX, XATTR_SECURITY_PREFIX_LEN) &&
 	    security_ismaclabel(name + XATTR_SECURITY_PREFIX_LEN))
-		ci->i_ceph_flags |= CEPH_I_SEC_INITED;
+		set_bit(CEPH_I_SEC_INITED_BIT, &ci->i_ceph_flags);
 out:
 	spin_unlock(&ci->i_ceph_lock);
 	return err;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [EXTERNAL] [PATCH v3 01/11] ceph: convert inode flags to named bit positions and atomic bitops
  2026-04-29 12:51 ` [PATCH v3 01/11] ceph: convert inode flags to named bit positions and atomic bitops Alex Markuze
@ 2026-04-29 19:31   ` Viacheslav Dubeyko
  0 siblings, 0 replies; 17+ messages in thread
From: Viacheslav Dubeyko @ 2026-04-29 19:31 UTC (permalink / raw)
  To: Alex Markuze, ceph-devel; +Cc: linux-kernel, idryomov

On Wed, 2026-04-29 at 12:51 +0000, Alex Markuze wrote:
> Define named bit-position constants for all CEPH_I_* inode flags and
> derive the bitmask values from them.  This gives every flag a named
> _BIT constant usable with the test_bit/set_bit/clear_bit family.
> The intentionally unused bit position 1 is documented inline.
> 
> Convert all flag modifications to use atomic bitops (set_bit,
> clear_bit, test_and_clear_bit).  The previous code mixed lockless
> atomic ops on some flags (ERROR_WRITE, ODIRECT) with non-atomic
> read-modify-write (|= / &= ~) on other flags sharing the same
> unsigned long.  A concurrent non-atomic RMW can clobber an
> adjacent lockless atomic update -- for example, a lockless
> clear_bit(ERROR_WRITE) could be silently resurrected by a
> concurrent ci->i_ceph_flags |= CEPH_I_FLUSH under the spinlock.
> Using atomic bitops for all modifications eliminates this class
> of race entirely.
> 
> Flags whose only users are now the _BIT form (ERROR_WRITE,
> ASYNC_CHECK_CAPS) have their old mask defines removed to document
> that callers must use the _BIT constant with the set_bit/test_bit
> family.  ERROR_FILELOCK and SHUTDOWN retain their mask defines
> because they are still used via bitmask tests in lockless readers
> (ceph_inode_is_shutdown, reconnect_caps_cb).
> 
> Flag reads under i_ceph_lock continue to use bitmask tests where
> the tested flag is only modified under the same lock; this is safe
> because the lock serialises both the read and the write.  The
> remaining flags continue to use non-atomic bitmask operations under
> i_ceph_lock, which is correct and unchanged.
> 
> The lockless reader ceph_inode_is_shutdown() retains the READ_ONCE()
> snapshot plus bitmask test pattern -- the single atomic load into a
> local variable is correct and avoids a second memory access that
> test_bit() would require.  It now uses the named CEPH_I_SHUTDOWN
> mask constant instead of an inline BIT().
> 
> The direct assignment in ceph_finish_async_create() is converted
> from i_ceph_flags = CEPH_I_ASYNC_CREATE to set_bit().  This
> inode is I_NEW at this point -- still invisible to other threads
> and guaranteed to have zero flags from alloc_inode -- so either
> form is safe, but set_bit() keeps the conversion uniform.
> 
> The only remaining direct assignment (alloc_inode zeroing) operates
> on an inode that is not yet visible to other threads, so it is safe
> without atomic ops.
> 
> The dead precomputed flags variable in ceph_pool_perm_check() is
> removed; the check: loop re-reads flags from i_ceph_flags after
> the set_bit() calls, keeping a single source of truth.
> 
> Co-developed-by: Viacheslav Dubeyko <vdubeyko@redhat.com>
> Signed-off-by: Viacheslav Dubeyko <vdubeyko@redhat.com>
> Signed-off-by: Alex Markuze <amarkuze@redhat.com>
> ---
>  fs/ceph/addr.c       | 17 ++++++------
>  fs/ceph/caps.c       | 24 ++++++++---------
>  fs/ceph/file.c       | 13 ++++-----
>  fs/ceph/inode.c      |  4 +--
>  fs/ceph/locks.c      | 22 ++++-----------
>  fs/ceph/mds_client.c |  3 ++-
>  fs/ceph/mds_client.h |  2 +-
>  fs/ceph/snap.c       |  2 +-
>  fs/ceph/super.h      | 64 +++++++++++++++++++++++---------------------
>  fs/ceph/xattr.c      |  2 +-
>  10 files changed, 72 insertions(+), 81 deletions(-)
> 
> diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
> index 2090fc78529c..35c5fdb5a448 100644
> --- a/fs/ceph/addr.c
> +++ b/fs/ceph/addr.c
> @@ -2583,20 +2583,19 @@ int ceph_pool_perm_check(struct inode *inode, int need)

I have realized that we have issue here. We declare flags as int [1]:

	int ret, flags;

However, we assign ci->i_ceph_flags [2] that has unsigned long type [3]:

	spin_lock(&ci->i_ceph_lock);
	flags = ci->i_ceph_flags;
	pool = ci->i_layout.pool_id;
	spin_unlock(&ci->i_ceph_lock);

I think we need to rework the declaration of flags variable.

The rest of the patch looks good.

Thanks,
Slava.

[1] https://elixir.bootlin.com/linux/v7.0.1/source/fs/ceph/addr.c#L2544
[2] https://elixir.bootlin.com/linux/v7.0.1/source/fs/ceph/addr.c#L2564
[3] https://elixir.bootlin.com/linux/v7.0.1/source/fs/ceph/super.h#L383


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v3 02/11] ceph: use proper endian conversion for flock_len in reconnect
  2026-04-29 12:51 [PATCH v3 00/11] ceph: manual client session reset Alex Markuze
  2026-04-29 12:51 ` [PATCH v3 01/11] ceph: convert inode flags to named bit positions and atomic bitops Alex Markuze
@ 2026-04-29 12:51 ` Alex Markuze
  2026-04-29 12:51 ` [PATCH v3 03/11] ceph: harden send_mds_reconnect and handle active-MDS peer reset Alex Markuze
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 17+ messages in thread
From: Alex Markuze @ 2026-04-29 12:51 UTC (permalink / raw)
  To: ceph-devel
  Cc: linux-kernel, idryomov, vdubeyko, Alex Markuze,
	Viacheslav Dubeyko

Replace the __force __le32 cast with cpu_to_le32() for the flock_len field
in reconnect_caps_cb(). The old code used a type-system bypass to silence
sparse; the new form uses the proper endian conversion macro.

Also switch from a raw bitmask test against i_ceph_flags to test_bit() on
the named CEPH_I_ERROR_FILELOCK_BIT, which is the correct accessor for the
unsigned long flags field after the bit-position conversion.

Remove the now-unused CEPH_I_ERROR_FILELOCK mask define since all callers
use the _BIT form with test_bit/set_bit/clear_bit.

Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Signed-off-by: Alex Markuze <amarkuze@redhat.com>
---
 fs/ceph/mds_client.c | 5 +++--
 fs/ceph/super.h      | 1 -
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index ccf0d53dde2b..871f0eef468d 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -4693,8 +4693,9 @@ static int reconnect_caps_cb(struct inode *inode, int mds, void *arg)
 		rec.v2.issued = cpu_to_le32(cap->issued);
 		rec.v2.snaprealm = cpu_to_le64(ci->i_snap_realm->ino);
 		rec.v2.pathbase = cpu_to_le64(path_info.vino.ino);
-		rec.v2.flock_len = (__force __le32)
-			((ci->i_ceph_flags & CEPH_I_ERROR_FILELOCK) ? 0 : 1);
+		rec.v2.flock_len = cpu_to_le32(
+			test_bit(CEPH_I_ERROR_FILELOCK_BIT,
+				 &ci->i_ceph_flags) ? 0 : 1);
 	} else {
 		struct timespec64 ts;
 
diff --git a/fs/ceph/super.h b/fs/ceph/super.h
index 66b047606d65..30911ccf961e 100644
--- a/fs/ceph/super.h
+++ b/fs/ceph/super.h
@@ -681,7 +681,6 @@ static inline struct inode *ceph_find_inode(struct super_block *sb,
 #define CEPH_I_FLUSH_SNAPS		(1 << CEPH_I_FLUSH_SNAPS_BIT)
 #define CEPH_I_ODIRECT			(1 << CEPH_I_ODIRECT_BIT)
 #define CEPH_I_ASYNC_CREATE		(1 << CEPH_I_ASYNC_CREATE_BIT)
-#define CEPH_I_ERROR_FILELOCK		(1 << CEPH_I_ERROR_FILELOCK_BIT)
 #define CEPH_I_SHUTDOWN			(1 << CEPH_I_SHUTDOWN_BIT)
 
 /*
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v3 03/11] ceph: harden send_mds_reconnect and handle active-MDS peer reset
  2026-04-29 12:51 [PATCH v3 00/11] ceph: manual client session reset Alex Markuze
  2026-04-29 12:51 ` [PATCH v3 01/11] ceph: convert inode flags to named bit positions and atomic bitops Alex Markuze
  2026-04-29 12:51 ` [PATCH v3 02/11] ceph: use proper endian conversion for flock_len in reconnect Alex Markuze
@ 2026-04-29 12:51 ` Alex Markuze
  2026-04-29 21:22   ` [EXTERNAL] " Viacheslav Dubeyko
  2026-04-29 12:51 ` [PATCH v3 04/11] ceph: add diagnostic timeout loop to wait_caps_flush() Alex Markuze
                   ` (7 subsequent siblings)
  10 siblings, 1 reply; 17+ messages in thread
From: Alex Markuze @ 2026-04-29 12:51 UTC (permalink / raw)
  To: ceph-devel; +Cc: linux-kernel, idryomov, vdubeyko, Alex Markuze

Change send_mds_reconnect() to return an error code so callers can detect
and report reconnect failures instead of silently ignoring them. Add early
bailout checks for sessions that are already closed, rejected, or
unregistered, which avoids sending reconnect messages for sessions that
can no longer be recovered.

The early -ESTALE and -ENOENT bailouts use a separate fail_return label
that skips the pr_err_client diagnostic, since these codes indicate
expected concurrent-teardown races rather than genuine reconnect build
failures.

Move the "reconnect start" log after the early-bailout checks so it
only appears for sessions that actually proceed with reconnect.

Save the prior session state before transitioning to RECONNECTING,
and restore it in the failure path.  Without this, a transient
build or encoding failure (-ENOMEM, -ENOSPC) strands the session
in RECONNECTING indefinitely because check_new_map() only retries
sessions in RESTARTING state.

Rewrite mds_peer_reset() to handle the case where the MDS is past its
RECONNECT phase (i.e. active). An active MDS rejects CLIENT_RECONNECT
messages because it only accepts them during its own RECONNECT window
after restart. Previously, the client would send a doomed reconnect
that the MDS would reject or ignore. Now, the client tears the session
down locally and lets new requests re-open a fresh session, which is
the correct recovery for this scenario. The RECONNECTING state is
handled on the same teardown path, since the MDS will reject reconnect
attempts from an active client regardless of the session's local state.

Add explicit cases for CLOSED and REJECTED session states in
mds_peer_reset() since these are terminal states where a connection
drop is expected behavior.

The session teardown path in mds_peer_reset() follows the established
drop-and-reacquire locking pattern from check_new_map(): take
mdsc->mutex for session unregistration, release it, then take s->s_mutex
separately for cleanup. This avoids introducing a new simultaneous lock
nesting pattern.

Log reconnect failures from check_new_map() and mds_peer_reset() at
pr_warn level rather than pr_err, since return codes like -ESTALE
(closed/rejected session) and -ENOENT (unregistered session) are
expected during concurrent teardown. Log dropped messages for
unregistered sessions via doutc() (dynamic debug) rather than
pr_info, as post-reset message arrival is routine and does not
warrant unconditional logging.

Signed-off-by: Alex Markuze <amarkuze@redhat.com>
---
 fs/ceph/mds_client.c | 169 +++++++++++++++++++++++++++++++++++++++----
 1 file changed, 155 insertions(+), 14 deletions(-)

diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index 871f0eef468d..b62abae72c4c 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -4416,9 +4416,14 @@ static void handle_session(struct ceph_mds_session *session,
 		break;
 
 	case CEPH_SESSION_REJECT:
-		WARN_ON(session->s_state != CEPH_MDS_SESSION_OPENING);
-		pr_info_client(cl, "mds%d rejected session\n",
-			       session->s_mds);
+		WARN_ON(session->s_state != CEPH_MDS_SESSION_OPENING &&
+			session->s_state != CEPH_MDS_SESSION_RECONNECTING);
+		if (session->s_state == CEPH_MDS_SESSION_RECONNECTING)
+			pr_info_client(cl, "mds%d reconnect rejected\n",
+				       session->s_mds);
+		else
+			pr_info_client(cl, "mds%d rejected session\n",
+				       session->s_mds);
 		session->s_state = CEPH_MDS_SESSION_REJECTED;
 		cleanup_session_requests(mdsc, session);
 		remove_session_caps(session);
@@ -4678,6 +4683,14 @@ static int reconnect_caps_cb(struct inode *inode, int mds, void *arg)
 	cap->mseq = 0;       /* and migrate_seq */
 	cap->cap_gen = atomic_read(&cap->session->s_cap_gen);
 
+	/*
+	 * Note: CEPH_I_ERROR_FILELOCK is not set during reconnect.
+	 * Instead, locks are submitted for best-effort MDS reclaim
+	 * via the flock_len field below.  If reclaim fails (e.g.,
+	 * another client grabbed a conflicting lock), future lock
+	 * operations will fail and set the error flag at that point.
+	 */
+
 	/* These are lost when the session goes away */
 	if (S_ISDIR(inode->i_mode)) {
 		if (cap->issued & CEPH_CAP_DIR_CREATE) {
@@ -4892,20 +4905,19 @@ static int encode_snap_realms(struct ceph_mds_client *mdsc,
  *
  * This is a relatively heavyweight operation, but it's rare.
  */
-static void send_mds_reconnect(struct ceph_mds_client *mdsc,
-			       struct ceph_mds_session *session)
+static int send_mds_reconnect(struct ceph_mds_client *mdsc,
+			      struct ceph_mds_session *session)
 {
 	struct ceph_client *cl = mdsc->fsc->client;
 	struct ceph_msg *reply;
 	int mds = session->s_mds;
 	int err = -ENOMEM;
+	int old_state;
 	struct ceph_reconnect_state recon_state = {
 		.session = session,
 	};
 	LIST_HEAD(dispose);
 
-	pr_info_client(cl, "mds%d reconnect start\n", mds);
-
 	recon_state.pagelist = ceph_pagelist_alloc(GFP_NOFS);
 	if (!recon_state.pagelist)
 		goto fail_nopagelist;
@@ -4917,6 +4929,32 @@ static void send_mds_reconnect(struct ceph_mds_client *mdsc,
 	xa_destroy(&session->s_delegated_inos);
 
 	mutex_lock(&session->s_mutex);
+	if (session->s_state == CEPH_MDS_SESSION_CLOSED ||
+	    session->s_state == CEPH_MDS_SESSION_REJECTED) {
+		pr_info_client(cl, "mds%d skipping reconnect, session %s\n",
+			       mds,
+			       ceph_session_state_name(session->s_state));
+		mutex_unlock(&session->s_mutex);
+		ceph_msg_put(reply);
+		err = -ESTALE;
+		goto fail_return;
+	}
+
+	mutex_lock(&mdsc->mutex);
+	if (mds >= mdsc->max_sessions || mdsc->sessions[mds] != session) {
+		mutex_unlock(&mdsc->mutex);
+		pr_info_client(cl,
+			       "mds%d skipping reconnect, session unregistered\n",
+			       mds);
+		mutex_unlock(&session->s_mutex);
+		ceph_msg_put(reply);
+		err = -ENOENT;
+		goto fail_return;
+	}
+	mutex_unlock(&mdsc->mutex);
+
+	pr_info_client(cl, "mds%d reconnect start\n", mds);
+	old_state = session->s_state;
 	session->s_state = CEPH_MDS_SESSION_RECONNECTING;
 	session->s_seq = 0;
 
@@ -5046,18 +5084,34 @@ static void send_mds_reconnect(struct ceph_mds_client *mdsc,
 
 	up_read(&mdsc->snap_rwsem);
 	ceph_pagelist_release(recon_state.pagelist);
-	return;
+	return 0;
 
 fail:
 	ceph_msg_put(reply);
 	up_read(&mdsc->snap_rwsem);
+	/*
+	 * Restore prior session state so map-driven reconnect logic
+	 * (check_new_map) can retry.  Without this, a transient build
+	 * failure strands the session in RECONNECTING indefinitely.
+	 */
+	session->s_state = old_state;
 	mutex_unlock(&session->s_mutex);
 fail_nomsg:
 	ceph_pagelist_release(recon_state.pagelist);
 fail_nopagelist:
 	pr_err_client(cl, "error %d preparing reconnect for mds%d\n",
 		      err, mds);
-	return;
+	return err;
+
+fail_return:
+	/*
+	 * Early-exit path for expected concurrent-teardown races
+	 * (-ESTALE for closed/rejected sessions, -ENOENT for
+	 * unregistered sessions).  Skip the pr_err_client diagnostic
+	 * since these are not genuine reconnect build failures.
+	 */
+	ceph_pagelist_release(recon_state.pagelist);
+	return err;
 }
 
 
@@ -5138,9 +5192,15 @@ static void check_new_map(struct ceph_mds_client *mdsc,
 		 */
 		if (s->s_state == CEPH_MDS_SESSION_RESTARTING &&
 		    newstate >= CEPH_MDS_STATE_RECONNECT) {
+			int rc;
+
 			mutex_unlock(&mdsc->mutex);
 			clear_bit(i, targets);
-			send_mds_reconnect(mdsc, s);
+			rc = send_mds_reconnect(mdsc, s);
+			if (rc)
+				pr_warn_client(cl,
+					       "mds%d reconnect failed: %d\n",
+					       i, rc);
 			mutex_lock(&mdsc->mutex);
 		}
 
@@ -5204,7 +5264,11 @@ static void check_new_map(struct ceph_mds_client *mdsc,
 		}
 		doutc(cl, "send reconnect to export target mds.%d\n", i);
 		mutex_unlock(&mdsc->mutex);
-		send_mds_reconnect(mdsc, s);
+		err = send_mds_reconnect(mdsc, s);
+		if (err)
+			pr_warn_client(cl,
+				       "mds%d export target reconnect failed: %d\n",
+				       i, err);
 		ceph_put_mds_session(s);
 		mutex_lock(&mdsc->mutex);
 	}
@@ -6284,12 +6348,87 @@ static void mds_peer_reset(struct ceph_connection *con)
 {
 	struct ceph_mds_session *s = con->private;
 	struct ceph_mds_client *mdsc = s->s_mdsc;
+	int session_state;
 
 	pr_warn_client(mdsc->fsc->client, "mds%d closed our session\n",
 		       s->s_mds);
-	if (READ_ONCE(mdsc->fsc->mount_state) != CEPH_MOUNT_FENCE_IO &&
-	    ceph_mdsmap_get_state(mdsc->mdsmap, s->s_mds) >= CEPH_MDS_STATE_RECONNECT)
-		send_mds_reconnect(mdsc, s);
+
+	if (READ_ONCE(mdsc->fsc->mount_state) == CEPH_MOUNT_FENCE_IO ||
+	    ceph_mdsmap_get_state(mdsc->mdsmap, s->s_mds) < CEPH_MDS_STATE_RECONNECT)
+		return;
+
+	if (ceph_mdsmap_get_state(mdsc->mdsmap, s->s_mds) == CEPH_MDS_STATE_RECONNECT) {
+		int rc = send_mds_reconnect(mdsc, s);
+
+		if (rc)
+			pr_warn_client(mdsc->fsc->client,
+				       "mds%d reconnect failed: %d\n",
+				       s->s_mds, rc);
+		return;
+	}
+
+	/*
+	 * MDS is active (past RECONNECT).  It will not accept a
+	 * CLIENT_RECONNECT from us, so tear the session down locally
+	 * and let new requests re-open a fresh session.
+	 *
+	 * Snapshot session state with READ_ONCE, then revalidate under
+	 * mdsc->mutex before acting.  The subsequent mdsc->mutex
+	 * section rechecks s_state to catch concurrent transitions, so
+	 * the lockless snapshot here is safe.  s->s_mutex is taken
+	 * separately for cleanup after unregistration, which avoids
+	 * introducing a new s->s_mutex + mdsc->mutex nesting.
+	 */
+	session_state = READ_ONCE(s->s_state);
+
+	switch (session_state) {
+	case CEPH_MDS_SESSION_RESTARTING:
+	case CEPH_MDS_SESSION_RECONNECTING:
+	case CEPH_MDS_SESSION_CLOSING:
+	case CEPH_MDS_SESSION_OPEN:
+	case CEPH_MDS_SESSION_HUNG:
+	case CEPH_MDS_SESSION_OPENING:
+		mutex_lock(&mdsc->mutex);
+		if (s->s_mds >= mdsc->max_sessions ||
+		    mdsc->sessions[s->s_mds] != s ||
+		    s->s_state != session_state) {
+			pr_info_client(mdsc->fsc->client,
+				       "mds%d state changed to %s during peer reset\n",
+				       s->s_mds,
+				       ceph_session_state_name(s->s_state));
+			mutex_unlock(&mdsc->mutex);
+			return;
+		}
+
+		ceph_get_mds_session(s);
+		s->s_state = CEPH_MDS_SESSION_CLOSED;
+		__unregister_session(mdsc, s);
+		__wake_requests(mdsc, &s->s_waiting);
+		mutex_unlock(&mdsc->mutex);
+
+		mutex_lock(&s->s_mutex);
+		cleanup_session_requests(mdsc, s);
+		remove_session_caps(s);
+		mutex_unlock(&s->s_mutex);
+
+		wake_up_all(&mdsc->session_close_wq);
+
+		mutex_lock(&mdsc->mutex);
+		kick_requests(mdsc, s->s_mds);
+		mutex_unlock(&mdsc->mutex);
+
+		ceph_put_mds_session(s);
+		break;
+	case CEPH_MDS_SESSION_CLOSED:
+	case CEPH_MDS_SESSION_REJECTED:
+		break;
+	default:
+		pr_warn_client(mdsc->fsc->client,
+			       "mds%d peer reset in unexpected state %s\n",
+			       s->s_mds,
+			       ceph_session_state_name(session_state));
+		break;
+	}
 }
 
 static void mds_dispatch(struct ceph_connection *con, struct ceph_msg *msg)
@@ -6301,6 +6440,8 @@ static void mds_dispatch(struct ceph_connection *con, struct ceph_msg *msg)
 
 	mutex_lock(&mdsc->mutex);
 	if (__verify_registered_session(mdsc, s) < 0) {
+		doutc(cl, "dropping tid %llu from unregistered session %d\n",
+		      le64_to_cpu(msg->hdr.tid), s->s_mds);
 		mutex_unlock(&mdsc->mutex);
 		goto out;
 	}
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [EXTERNAL] [PATCH v3 03/11] ceph: harden send_mds_reconnect and handle active-MDS peer reset
  2026-04-29 12:51 ` [PATCH v3 03/11] ceph: harden send_mds_reconnect and handle active-MDS peer reset Alex Markuze
@ 2026-04-29 21:22   ` Viacheslav Dubeyko
  0 siblings, 0 replies; 17+ messages in thread
From: Viacheslav Dubeyko @ 2026-04-29 21:22 UTC (permalink / raw)
  To: Alex Markuze, ceph-devel; +Cc: linux-kernel, idryomov

On Wed, 2026-04-29 at 12:51 +0000, Alex Markuze wrote:
> Change send_mds_reconnect() to return an error code so callers can detect
> and report reconnect failures instead of silently ignoring them. Add early
> bailout checks for sessions that are already closed, rejected, or
> unregistered, which avoids sending reconnect messages for sessions that
> can no longer be recovered.
> 
> The early -ESTALE and -ENOENT bailouts use a separate fail_return label
> that skips the pr_err_client diagnostic, since these codes indicate
> expected concurrent-teardown races rather than genuine reconnect build
> failures.
> 
> Move the "reconnect start" log after the early-bailout checks so it
> only appears for sessions that actually proceed with reconnect.
> 
> Save the prior session state before transitioning to RECONNECTING,
> and restore it in the failure path.  Without this, a transient
> build or encoding failure (-ENOMEM, -ENOSPC) strands the session
> in RECONNECTING indefinitely because check_new_map() only retries
> sessions in RESTARTING state.
> 
> Rewrite mds_peer_reset() to handle the case where the MDS is past its
> RECONNECT phase (i.e. active). An active MDS rejects CLIENT_RECONNECT
> messages because it only accepts them during its own RECONNECT window
> after restart. Previously, the client would send a doomed reconnect
> that the MDS would reject or ignore. Now, the client tears the session
> down locally and lets new requests re-open a fresh session, which is
> the correct recovery for this scenario. The RECONNECTING state is
> handled on the same teardown path, since the MDS will reject reconnect
> attempts from an active client regardless of the session's local state.
> 
> Add explicit cases for CLOSED and REJECTED session states in
> mds_peer_reset() since these are terminal states where a connection
> drop is expected behavior.
> 
> The session teardown path in mds_peer_reset() follows the established
> drop-and-reacquire locking pattern from check_new_map(): take
> mdsc->mutex for session unregistration, release it, then take s->s_mutex
> separately for cleanup. This avoids introducing a new simultaneous lock
> nesting pattern.
> 
> Log reconnect failures from check_new_map() and mds_peer_reset() at
> pr_warn level rather than pr_err, since return codes like -ESTALE
> (closed/rejected session) and -ENOENT (unregistered session) are
> expected during concurrent teardown. Log dropped messages for
> unregistered sessions via doutc() (dynamic debug) rather than
> pr_info, as post-reset message arrival is routine and does not
> warrant unconditional logging.
> 
> Signed-off-by: Alex Markuze <amarkuze@redhat.com>
> ---
>  fs/ceph/mds_client.c | 169 +++++++++++++++++++++++++++++++++++++++----
>  1 file changed, 155 insertions(+), 14 deletions(-)
> 
> diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
> index 871f0eef468d..b62abae72c4c 100644
> --- a/fs/ceph/mds_client.c
> +++ b/fs/ceph/mds_client.c
> @@ -4416,9 +4416,14 @@ static void handle_session(struct ceph_mds_session *session,
>  		break;
>  
>  	case CEPH_SESSION_REJECT:
> -		WARN_ON(session->s_state != CEPH_MDS_SESSION_OPENING);
> -		pr_info_client(cl, "mds%d rejected session\n",
> -			       session->s_mds);
> +		WARN_ON(session->s_state != CEPH_MDS_SESSION_OPENING &&
> +			session->s_state != CEPH_MDS_SESSION_RECONNECTING);
> +		if (session->s_state == CEPH_MDS_SESSION_RECONNECTING)
> +			pr_info_client(cl, "mds%d reconnect rejected\n",
> +				       session->s_mds);
> +		else
> +			pr_info_client(cl, "mds%d rejected session\n",
> +				       session->s_mds);
>  		session->s_state = CEPH_MDS_SESSION_REJECTED;
>  		cleanup_session_requests(mdsc, session);
>  		remove_session_caps(session);
> @@ -4678,6 +4683,14 @@ static int reconnect_caps_cb(struct inode *inode, int mds, void *arg)
>  	cap->mseq = 0;       /* and migrate_seq */
>  	cap->cap_gen = atomic_read(&cap->session->s_cap_gen);
>  
> +	/*
> +	 * Note: CEPH_I_ERROR_FILELOCK is not set during reconnect.
> +	 * Instead, locks are submitted for best-effort MDS reclaim
> +	 * via the flock_len field below.  If reclaim fails (e.g.,
> +	 * another client grabbed a conflicting lock), future lock
> +	 * operations will fail and set the error flag at that point.
> +	 */
> +
>  	/* These are lost when the session goes away */
>  	if (S_ISDIR(inode->i_mode)) {
>  		if (cap->issued & CEPH_CAP_DIR_CREATE) {
> @@ -4892,20 +4905,19 @@ static int encode_snap_realms(struct ceph_mds_client *mdsc,
>   *
>   * This is a relatively heavyweight operation, but it's rare.
>   */
> -static void send_mds_reconnect(struct ceph_mds_client *mdsc,
> -			       struct ceph_mds_session *session)
> +static int send_mds_reconnect(struct ceph_mds_client *mdsc,
> +			      struct ceph_mds_session *session)
>  {
>  	struct ceph_client *cl = mdsc->fsc->client;
>  	struct ceph_msg *reply;
>  	int mds = session->s_mds;
>  	int err = -ENOMEM;
> +	int old_state;
>  	struct ceph_reconnect_state recon_state = {
>  		.session = session,
>  	};
>  	LIST_HEAD(dispose);
>  
> -	pr_info_client(cl, "mds%d reconnect start\n", mds);
> -
>  	recon_state.pagelist = ceph_pagelist_alloc(GFP_NOFS);
>  	if (!recon_state.pagelist)
>  		goto fail_nopagelist;
> @@ -4917,6 +4929,32 @@ static void send_mds_reconnect(struct ceph_mds_client *mdsc,
>  	xa_destroy(&session->s_delegated_inos);

Is xa_destroy(&session->s_delegated_inos) thread safe now? If an in-flight async
create is mid-way through ceph_get_deleg_ino (which does xa_for_each + xa_erase)
while xa_destroy runs on the same XArray from a different thread, that is
unsafe. What do you think?

>  
>  	mutex_lock(&session->s_mutex);
> +	if (session->s_state == CEPH_MDS_SESSION_CLOSED ||
> +	    session->s_state == CEPH_MDS_SESSION_REJECTED) {
> +		pr_info_client(cl, "mds%d skipping reconnect, session %s\n",
> +			       mds,
> +			       ceph_session_state_name(session->s_state));
> +		mutex_unlock(&session->s_mutex);
> +		ceph_msg_put(reply);
> +		err = -ESTALE;
> +		goto fail_return;
> +	}
> +
> +	mutex_lock(&mdsc->mutex);

I started to guess is it safe to call mutex_lock(&mdsc->mutex) under session-
>s_mutex lock? Could we introduce potential deadlock here?

> +	if (mds >= mdsc->max_sessions || mdsc->sessions[mds] != session) {
> +		mutex_unlock(&mdsc->mutex);
> +		pr_info_client(cl,
> +			       "mds%d skipping reconnect, session unregistered\n",
> +			       mds);
> +		mutex_unlock(&session->s_mutex);
> +		ceph_msg_put(reply);
> +		err = -ENOENT;
> +		goto fail_return;
> +	}
> +	mutex_unlock(&mdsc->mutex);
> +
> +	pr_info_client(cl, "mds%d reconnect start\n", mds);
> +	old_state = session->s_state;
>  	session->s_state = CEPH_MDS_SESSION_RECONNECTING;
>  	session->s_seq = 0;
>  
> @@ -5046,18 +5084,34 @@ static void send_mds_reconnect(struct ceph_mds_client *mdsc,
>  
>  	up_read(&mdsc->snap_rwsem);
>  	ceph_pagelist_release(recon_state.pagelist);
> -	return;
> +	return 0;
>  
>  fail:
>  	ceph_msg_put(reply);
>  	up_read(&mdsc->snap_rwsem);
> +	/*
> +	 * Restore prior session state so map-driven reconnect logic
> +	 * (check_new_map) can retry.  Without this, a transient build
> +	 * failure strands the session in RECONNECTING indefinitely.
> +	 */
> +	session->s_state = old_state;
>  	mutex_unlock(&session->s_mutex);
>  fail_nomsg:
>  	ceph_pagelist_release(recon_state.pagelist);
>  fail_nopagelist:
>  	pr_err_client(cl, "error %d preparing reconnect for mds%d\n",
>  		      err, mds);
> -	return;
> +	return err;
> +
> +fail_return:
> +	/*
> +	 * Early-exit path for expected concurrent-teardown races
> +	 * (-ESTALE for closed/rejected sessions, -ENOENT for
> +	 * unregistered sessions).  Skip the pr_err_client diagnostic
> +	 * since these are not genuine reconnect build failures.
> +	 */
> +	ceph_pagelist_release(recon_state.pagelist);
> +	return err;
>  }
>  
>  
> @@ -5138,9 +5192,15 @@ static void check_new_map(struct ceph_mds_client *mdsc,
>  		 */
>  		if (s->s_state == CEPH_MDS_SESSION_RESTARTING &&
>  		    newstate >= CEPH_MDS_STATE_RECONNECT) {
> +			int rc;
> +
>  			mutex_unlock(&mdsc->mutex);
>  			clear_bit(i, targets);
> -			send_mds_reconnect(mdsc, s);
> +			rc = send_mds_reconnect(mdsc, s);
> +			if (rc)
> +				pr_warn_client(cl,
> +					       "mds%d reconnect failed: %d\n",
> +					       i, rc);
>  			mutex_lock(&mdsc->mutex);
>  		}
>  
> @@ -5204,7 +5264,11 @@ static void check_new_map(struct ceph_mds_client *mdsc,
>  		}
>  		doutc(cl, "send reconnect to export target mds.%d\n", i);
>  		mutex_unlock(&mdsc->mutex);
> -		send_mds_reconnect(mdsc, s);
> +		err = send_mds_reconnect(mdsc, s);
> +		if (err)
> +			pr_warn_client(cl,
> +				       "mds%d export target reconnect failed: %d\n",
> +				       i, err);
>  		ceph_put_mds_session(s);
>  		mutex_lock(&mdsc->mutex);
>  	}
> @@ -6284,12 +6348,87 @@ static void mds_peer_reset(struct ceph_connection *con)
>  {
>  	struct ceph_mds_session *s = con->private;
>  	struct ceph_mds_client *mdsc = s->s_mdsc;
> +	int session_state;
>  
>  	pr_warn_client(mdsc->fsc->client, "mds%d closed our session\n",
>  		       s->s_mds);
> -	if (READ_ONCE(mdsc->fsc->mount_state) != CEPH_MOUNT_FENCE_IO &&
> -	    ceph_mdsmap_get_state(mdsc->mdsmap, s->s_mds) >= CEPH_MDS_STATE_RECONNECT)
> -		send_mds_reconnect(mdsc, s);
> +
> +	if (READ_ONCE(mdsc->fsc->mount_state) == CEPH_MOUNT_FENCE_IO ||
> +	    ceph_mdsmap_get_state(mdsc->mdsmap, s->s_mds) < CEPH_MDS_STATE_RECONNECT)
> +		return;
> +
> +	if (ceph_mdsmap_get_state(mdsc->mdsmap, s->s_mds) == CEPH_MDS_STATE_RECONNECT) {

We had >= CEPH_MDS_STATE_RECONNECT before. When an MDS restarts, it moves
through these states in order:

  RECONNECT    (10)  -> waits for clients to reconnect
  REJOIN       (11)  -> rejoins the distributed MDS cache
  CLIENTREPLAY (12)  -> replays in-flight client operations
  ACTIVE       (13)  -> fully operational

If the connection dropped and the MDS was in any of those four states, the
client sent a CLIENT_RECONNECT message. REJOIN (11) and CLIENTREPLAY (12) now
fall through to the teardown path, not the reconnect path.

Thanks,
Slava.

> +		int rc = send_mds_reconnect(mdsc, s);
> +
> +		if (rc)
> +			pr_warn_client(mdsc->fsc->client,
> +				       "mds%d reconnect failed: %d\n",
> +				       s->s_mds, rc);
> +		return;
> +	}
> +
> +	/*
> +	 * MDS is active (past RECONNECT).  It will not accept a
> +	 * CLIENT_RECONNECT from us, so tear the session down locally
> +	 * and let new requests re-open a fresh session.
> +	 *
> +	 * Snapshot session state with READ_ONCE, then revalidate under
> +	 * mdsc->mutex before acting.  The subsequent mdsc->mutex
> +	 * section rechecks s_state to catch concurrent transitions, so
> +	 * the lockless snapshot here is safe.  s->s_mutex is taken
> +	 * separately for cleanup after unregistration, which avoids
> +	 * introducing a new s->s_mutex + mdsc->mutex nesting.
> +	 */
> +	session_state = READ_ONCE(s->s_state);
> +
> +	switch (session_state) {
> +	case CEPH_MDS_SESSION_RESTARTING:
> +	case CEPH_MDS_SESSION_RECONNECTING:
> +	case CEPH_MDS_SESSION_CLOSING:
> +	case CEPH_MDS_SESSION_OPEN:
> +	case CEPH_MDS_SESSION_HUNG:
> +	case CEPH_MDS_SESSION_OPENING:
> +		mutex_lock(&mdsc->mutex);
> +		if (s->s_mds >= mdsc->max_sessions ||
> +		    mdsc->sessions[s->s_mds] != s ||
> +		    s->s_state != session_state) {
> +			pr_info_client(mdsc->fsc->client,
> +				       "mds%d state changed to %s during peer reset\n",
> +				       s->s_mds,
> +				       ceph_session_state_name(s->s_state));
> +			mutex_unlock(&mdsc->mutex);
> +			return;
> +		}
> +
> +		ceph_get_mds_session(s);
> +		s->s_state = CEPH_MDS_SESSION_CLOSED;
> +		__unregister_session(mdsc, s);
> +		__wake_requests(mdsc, &s->s_waiting);
> +		mutex_unlock(&mdsc->mutex);
> +
> +		mutex_lock(&s->s_mutex);
> +		cleanup_session_requests(mdsc, s);
> +		remove_session_caps(s);
> +		mutex_unlock(&s->s_mutex);
> +
> +		wake_up_all(&mdsc->session_close_wq);
> +
> +		mutex_lock(&mdsc->mutex);
> +		kick_requests(mdsc, s->s_mds);
> +		mutex_unlock(&mdsc->mutex);
> +
> +		ceph_put_mds_session(s);
> +		break;
> +	case CEPH_MDS_SESSION_CLOSED:
> +	case CEPH_MDS_SESSION_REJECTED:
> +		break;
> +	default:
> +		pr_warn_client(mdsc->fsc->client,
> +			       "mds%d peer reset in unexpected state %s\n",
> +			       s->s_mds,
> +			       ceph_session_state_name(session_state));
> +		break;
> +	}
>  }
>  
>  static void mds_dispatch(struct ceph_connection *con, struct ceph_msg *msg)
> @@ -6301,6 +6440,8 @@ static void mds_dispatch(struct ceph_connection *con, struct ceph_msg *msg)
>  
>  	mutex_lock(&mdsc->mutex);
>  	if (__verify_registered_session(mdsc, s) < 0) {
> +		doutc(cl, "dropping tid %llu from unregistered session %d\n",
> +		      le64_to_cpu(msg->hdr.tid), s->s_mds);
>  		mutex_unlock(&mdsc->mutex);
>  		goto out;
>  	}


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v3 04/11] ceph: add diagnostic timeout loop to wait_caps_flush()
  2026-04-29 12:51 [PATCH v3 00/11] ceph: manual client session reset Alex Markuze
                   ` (2 preceding siblings ...)
  2026-04-29 12:51 ` [PATCH v3 03/11] ceph: harden send_mds_reconnect and handle active-MDS peer reset Alex Markuze
@ 2026-04-29 12:51 ` Alex Markuze
  2026-04-29 21:41   ` [EXTERNAL] " Viacheslav Dubeyko
  2026-04-29 12:52 ` [PATCH v3 05/11] ceph: add client reset state machine and session teardown Alex Markuze
                   ` (6 subsequent siblings)
  10 siblings, 1 reply; 17+ messages in thread
From: Alex Markuze @ 2026-04-29 12:51 UTC (permalink / raw)
  To: ceph-devel; +Cc: linux-kernel, idryomov, vdubeyko, Alex Markuze

Convert wait_caps_flush() from a silent indefinite wait into a diagnostic
wait loop that periodically dumps pending cap flush state.

The underlying wait semantics remain intact: callers still wait until the
requested cap flushes complete. The difference is that long stalls now
produce actionable diagnostics instead of looking like a silent hang.

CEPH_CAP_FLUSH_MAX_DUMP_COUNT bounds the diagnostics in two ways:
it limits the number of entries emitted per diagnostic dump, and it
limits the number of timed diagnostic dumps before the wait continues
silently. When more entries exist than the per-dump limit, a truncation
count is reported. When the dump iteration limit is reached, a final
suppression message is emitted so the transition to silence is explicit.

The diagnostic dump collects flush entry data under cap_dirty_lock into
a bounded on-stack array, then prints after releasing the lock.  This
avoids holding the spinlock across printk calls.

A null cf->ci on the global flush list indicates a bug since all
cap_flush entries are initialized with a valid ci before being added.
Signal this with WARN_ON_ONCE while still printing enough context for
debugging.

READ_ONCE is used for the i_last_cap_flush_ack field, which is read
outside the inode lock domain. Flush tids are monotonically increasing
and acks are processed in order under i_ceph_lock, so the latest ack
tid is always the most recently written value.

Add a ci pointer to struct ceph_cap_flush so that the diagnostic
dump can identify which inode each pending flush belongs to.  The
new i_last_cap_flush_ack field tracks the latest acknowledged flush
tid per inode for diagnostic correlation.

This improves reset-drain observability and is also useful for
existing sync and writeback troubleshooting paths.

Signed-off-by: Alex Markuze <amarkuze@redhat.com>
---
 fs/ceph/caps.c       | 10 +++++
 fs/ceph/inode.c      |  1 +
 fs/ceph/mds_client.c | 97 ++++++++++++++++++++++++++++++++++++++++++--
 fs/ceph/super.h      |  6 +++
 4 files changed, 110 insertions(+), 4 deletions(-)

diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
index cb9e78b713d9..4b37d9ffdf7f 100644
--- a/fs/ceph/caps.c
+++ b/fs/ceph/caps.c
@@ -1648,6 +1648,7 @@ static void __ceph_flush_snaps(struct ceph_inode_info *ci,
 
 		spin_lock(&mdsc->cap_dirty_lock);
 		capsnap->cap_flush.tid = ++mdsc->last_cap_flush_tid;
+		capsnap->cap_flush.ci = ci;
 		list_add_tail(&capsnap->cap_flush.g_list,
 			      &mdsc->cap_flush_list);
 		if (oldest_flush_tid == 0)
@@ -1846,6 +1847,7 @@ struct ceph_cap_flush *ceph_alloc_cap_flush(void)
 		return NULL;
 
 	cf->is_capsnap = false;
+	cf->ci = NULL;
 	return cf;
 }
 
@@ -1931,6 +1933,7 @@ static u64 __mark_caps_flushing(struct inode *inode,
 	doutc(cl, "%p %llx.%llx now !dirty\n", inode, ceph_vinop(inode));
 
 	swap(cf, ci->i_prealloc_cap_flush);
+	cf->ci = ci;
 	cf->caps = flushing;
 	cf->wake = wake;
 
@@ -3826,6 +3829,13 @@ static void handle_cap_flush_ack(struct inode *inode, u64 flush_tid,
 	bool wake_ci = false;
 	bool wake_mdsc = false;
 
+	/*
+	 * Flush tids are monotonically increasing and acks arrive in
+	 * order under i_ceph_lock, so this is always the latest tid.
+	 * Diagnostic readers use READ_ONCE() without holding the lock.
+	 */
+	WRITE_ONCE(ci->i_last_cap_flush_ack, flush_tid);
+
 	list_for_each_entry_safe(cf, tmp_cf, &ci->i_cap_flush_list, i_list) {
 		/* Is this the one that was flushed? */
 		if (cf->tid == flush_tid)
diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
index f75d66760d54..de465c7e96e8 100644
--- a/fs/ceph/inode.c
+++ b/fs/ceph/inode.c
@@ -670,6 +670,7 @@ struct inode *ceph_alloc_inode(struct super_block *sb)
 	INIT_LIST_HEAD(&ci->i_cap_snaps);
 	ci->i_head_snapc = NULL;
 	ci->i_snap_caps = 0;
+	ci->i_last_cap_flush_ack = 0;
 
 	ci->i_last_rd = ci->i_last_wr = jiffies - 3600 * HZ;
 	for (i = 0; i < CEPH_FILE_MODE_BITS; i++)
diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index b62abae72c4c..d83003acfb06 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -27,6 +27,8 @@
 #include <trace/events/ceph.h>
 
 #define RECONNECT_MAX_SIZE (INT_MAX - PAGE_SIZE)
+#define CEPH_CAP_FLUSH_WAIT_TIMEOUT_SEC 60
+#define CEPH_CAP_FLUSH_MAX_DUMP_COUNT 5
 
 /*
  * A cluster of MDS (metadata server) daemons is responsible for
@@ -2286,19 +2288,106 @@ static int check_caps_flush(struct ceph_mds_client *mdsc,
 }
 
 /*
- * flush all dirty inode data to disk.
+ * Dump pending cap flushes for diagnostic purposes.
  *
- * returns true if we've flushed through want_flush_tid
+ * cf->ci is safe to dereference here: cap_flush entries hold a
+ * reference on the inode (via the cap), and entries are removed from
+ * cap_flush_list under cap_dirty_lock before the cap (and thus the
+ * inode reference) is released.  Holding cap_dirty_lock therefore
+ * guarantees the inode remains valid for the lifetime of the scan.
+ */
+struct flush_dump_entry {
+	u64 ino;
+	u64 snap;
+	int caps;
+	u64 tid;
+	u64 last_ack;
+	bool wake;
+	bool is_capsnap;
+	bool ci_null;
+};
+
+static void dump_cap_flushes(struct ceph_mds_client *mdsc, u64 want_tid)
+{
+	struct ceph_client *cl = mdsc->fsc->client;
+	struct flush_dump_entry entries[CEPH_CAP_FLUSH_MAX_DUMP_COUNT];
+	struct ceph_cap_flush *cf;
+	int n = 0, remaining = 0;
+
+	spin_lock(&mdsc->cap_dirty_lock);
+	list_for_each_entry(cf, &mdsc->cap_flush_list, g_list) {
+		if (cf->tid > want_tid)
+			break;
+		if (n < CEPH_CAP_FLUSH_MAX_DUMP_COUNT) {
+			struct flush_dump_entry *e = &entries[n++];
+
+			e->ci_null = WARN_ON_ONCE(!cf->ci);
+			if (!e->ci_null) {
+				e->ino = ceph_ino(&cf->ci->netfs.inode);
+				e->snap = ceph_snap(&cf->ci->netfs.inode);
+				e->last_ack = READ_ONCE(cf->ci->i_last_cap_flush_ack);
+			}
+			e->caps = cf->caps;
+			e->tid = cf->tid;
+			e->wake = cf->wake;
+			e->is_capsnap = cf->is_capsnap;
+		} else {
+			remaining++;
+		}
+	}
+	spin_unlock(&mdsc->cap_dirty_lock);
+
+	pr_info_client(cl, "still waiting for cap flushes through %llu:\n",
+		       want_tid);
+	for (int i = 0; i < n; i++) {
+		struct flush_dump_entry *e = &entries[i];
+
+		if (e->ci_null)
+			pr_info_client(cl,
+				       "  (null ci) %s tid=%llu wake=%d%s\n",
+				       ceph_cap_string(e->caps), e->tid,
+				       e->wake,
+				       e->is_capsnap ? " is_capsnap" : "");
+		else
+			pr_info_client(cl,
+				       "  %llx.%llx %s tid=%llu last_ack=%llu wake=%d%s\n",
+				       e->ino, e->snap,
+				       ceph_cap_string(e->caps), e->tid,
+				       e->last_ack, e->wake,
+				       e->is_capsnap ? " is_capsnap" : "");
+	}
+	if (remaining)
+		pr_info_client(cl, "  ... and %d more pending flushes\n",
+			       remaining);
+}
+
+/*
+ * Wait for all cap flushes through @want_flush_tid to complete.
+ * Periodically dumps pending cap flush state for diagnostics.
  */
 static void wait_caps_flush(struct ceph_mds_client *mdsc,
 			    u64 want_flush_tid)
 {
 	struct ceph_client *cl = mdsc->fsc->client;
+	int i = 0;
+	long ret;
 
 	doutc(cl, "want %llu\n", want_flush_tid);
 
-	wait_event(mdsc->cap_flushing_wq,
-		   check_caps_flush(mdsc, want_flush_tid));
+	do {
+		/* 60 * HZ fits in a long on all supported architectures. */
+		ret = wait_event_timeout(mdsc->cap_flushing_wq,
+			   check_caps_flush(mdsc, want_flush_tid),
+			   CEPH_CAP_FLUSH_WAIT_TIMEOUT_SEC * HZ);
+		if (ret == 0) {
+			if (i < CEPH_CAP_FLUSH_MAX_DUMP_COUNT)
+				dump_cap_flushes(mdsc, want_flush_tid);
+			else if (i == CEPH_CAP_FLUSH_MAX_DUMP_COUNT)
+				pr_info_client(cl,
+					       "still waiting for cap flushes; suppressing further dumps\n");
+			i++;
+		}
+	} while (ret == 0);
 
 	doutc(cl, "ok, flushed thru %llu\n", want_flush_tid);
 }
diff --git a/fs/ceph/super.h b/fs/ceph/super.h
index 30911ccf961e..9aca42c89ea0 100644
--- a/fs/ceph/super.h
+++ b/fs/ceph/super.h
@@ -238,6 +238,7 @@ struct ceph_cap_flush {
 	bool is_capsnap; /* true means capsnap */
 	struct list_head g_list; // global
 	struct list_head i_list; // per inode
+	struct ceph_inode_info *ci;
 };
 
 /*
@@ -443,6 +444,11 @@ struct ceph_inode_info {
 	struct ceph_snap_context *i_head_snapc;  /* set if wr_buffer_head > 0 or
 						    dirty|flushing caps */
 	unsigned i_snap_caps;           /* cap bits for snapped files */
+	/*
+	 * Written under i_ceph_lock, read via READ_ONCE()
+	 * from diagnostic paths.
+	 */
+	u64 i_last_cap_flush_ack;
 
 	unsigned long i_last_rd;
 	unsigned long i_last_wr;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [EXTERNAL] [PATCH v3 04/11] ceph: add diagnostic timeout loop to wait_caps_flush()
  2026-04-29 12:51 ` [PATCH v3 04/11] ceph: add diagnostic timeout loop to wait_caps_flush() Alex Markuze
@ 2026-04-29 21:41   ` Viacheslav Dubeyko
  0 siblings, 0 replies; 17+ messages in thread
From: Viacheslav Dubeyko @ 2026-04-29 21:41 UTC (permalink / raw)
  To: Alex Markuze, ceph-devel; +Cc: linux-kernel, idryomov

On Wed, 2026-04-29 at 12:51 +0000, Alex Markuze wrote:
> Convert wait_caps_flush() from a silent indefinite wait into a diagnostic
> wait loop that periodically dumps pending cap flush state.
> 
> The underlying wait semantics remain intact: callers still wait until the
> requested cap flushes complete. The difference is that long stalls now
> produce actionable diagnostics instead of looking like a silent hang.
> 
> CEPH_CAP_FLUSH_MAX_DUMP_COUNT bounds the diagnostics in two ways:
> it limits the number of entries emitted per diagnostic dump, and it
> limits the number of timed diagnostic dumps before the wait continues
> silently. When more entries exist than the per-dump limit, a truncation
> count is reported. When the dump iteration limit is reached, a final
> suppression message is emitted so the transition to silence is explicit.
> 
> The diagnostic dump collects flush entry data under cap_dirty_lock into
> a bounded on-stack array, then prints after releasing the lock.  This
> avoids holding the spinlock across printk calls.
> 
> A null cf->ci on the global flush list indicates a bug since all
> cap_flush entries are initialized with a valid ci before being added.
> Signal this with WARN_ON_ONCE while still printing enough context for
> debugging.
> 
> READ_ONCE is used for the i_last_cap_flush_ack field, which is read
> outside the inode lock domain. Flush tids are monotonically increasing
> and acks are processed in order under i_ceph_lock, so the latest ack
> tid is always the most recently written value.
> 
> Add a ci pointer to struct ceph_cap_flush so that the diagnostic
> dump can identify which inode each pending flush belongs to.  The
> new i_last_cap_flush_ack field tracks the latest acknowledged flush
> tid per inode for diagnostic correlation.
> 
> This improves reset-drain observability and is also useful for
> existing sync and writeback troubleshooting paths.
> 
> Signed-off-by: Alex Markuze <amarkuze@redhat.com>
> ---
>  fs/ceph/caps.c       | 10 +++++
>  fs/ceph/inode.c      |  1 +
>  fs/ceph/mds_client.c | 97 ++++++++++++++++++++++++++++++++++++++++++--
>  fs/ceph/super.h      |  6 +++
>  4 files changed, 110 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
> index cb9e78b713d9..4b37d9ffdf7f 100644
> --- a/fs/ceph/caps.c
> +++ b/fs/ceph/caps.c
> @@ -1648,6 +1648,7 @@ static void __ceph_flush_snaps(struct ceph_inode_info *ci,
>  
>  		spin_lock(&mdsc->cap_dirty_lock);
>  		capsnap->cap_flush.tid = ++mdsc->last_cap_flush_tid;
> +		capsnap->cap_flush.ci = ci;
>  		list_add_tail(&capsnap->cap_flush.g_list,
>  			      &mdsc->cap_flush_list);
>  		if (oldest_flush_tid == 0)
> @@ -1846,6 +1847,7 @@ struct ceph_cap_flush *ceph_alloc_cap_flush(void)
>  		return NULL;
>  
>  	cf->is_capsnap = false;
> +	cf->ci = NULL;
>  	return cf;
>  }
>  
> @@ -1931,6 +1933,7 @@ static u64 __mark_caps_flushing(struct inode *inode,
>  	doutc(cl, "%p %llx.%llx now !dirty\n", inode, ceph_vinop(inode));
>  
>  	swap(cf, ci->i_prealloc_cap_flush);
> +	cf->ci = ci;
>  	cf->caps = flushing;
>  	cf->wake = wake;
>  
> @@ -3826,6 +3829,13 @@ static void handle_cap_flush_ack(struct inode *inode, u64 flush_tid,
>  	bool wake_ci = false;
>  	bool wake_mdsc = false;
>  
> +	/*
> +	 * Flush tids are monotonically increasing and acks arrive in
> +	 * order under i_ceph_lock, so this is always the latest tid.
> +	 * Diagnostic readers use READ_ONCE() without holding the lock.
> +	 */
> +	WRITE_ONCE(ci->i_last_cap_flush_ack, flush_tid);
> +
>  	list_for_each_entry_safe(cf, tmp_cf, &ci->i_cap_flush_list, i_list) {
>  		/* Is this the one that was flushed? */
>  		if (cf->tid == flush_tid)
> diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
> index f75d66760d54..de465c7e96e8 100644
> --- a/fs/ceph/inode.c
> +++ b/fs/ceph/inode.c
> @@ -670,6 +670,7 @@ struct inode *ceph_alloc_inode(struct super_block *sb)
>  	INIT_LIST_HEAD(&ci->i_cap_snaps);
>  	ci->i_head_snapc = NULL;
>  	ci->i_snap_caps = 0;
> +	ci->i_last_cap_flush_ack = 0;
>  
>  	ci->i_last_rd = ci->i_last_wr = jiffies - 3600 * HZ;
>  	for (i = 0; i < CEPH_FILE_MODE_BITS; i++)
> diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
> index b62abae72c4c..d83003acfb06 100644
> --- a/fs/ceph/mds_client.c
> +++ b/fs/ceph/mds_client.c
> @@ -27,6 +27,8 @@
>  #include <trace/events/ceph.h>
>  
>  #define RECONNECT_MAX_SIZE (INT_MAX - PAGE_SIZE)
> +#define CEPH_CAP_FLUSH_WAIT_TIMEOUT_SEC 60
> +#define CEPH_CAP_FLUSH_MAX_DUMP_COUNT 5

I see the point to collect all timeout declarations in one place. What about to
place this declarations here [1]?

The CEPH_CAP_FLUSH_MAX_DUMP_COUNT controls two unrelated limits: the number of
entries printed per dump, and the number of timed diagnostic dumps before
suppression. I think we need to have two separate constants with distinct names.

>  
>  /*
>   * A cluster of MDS (metadata server) daemons is responsible for
> @@ -2286,19 +2288,106 @@ static int check_caps_flush(struct ceph_mds_client *mdsc,
>  }
>  
>  /*
> - * flush all dirty inode data to disk.
> + * Dump pending cap flushes for diagnostic purposes.
>   *
> - * returns true if we've flushed through want_flush_tid
> + * cf->ci is safe to dereference here: cap_flush entries hold a
> + * reference on the inode (via the cap), and entries are removed from
> + * cap_flush_list under cap_dirty_lock before the cap (and thus the
> + * inode reference) is released.  Holding cap_dirty_lock therefore
> + * guarantees the inode remains valid for the lifetime of the scan.
> + */

This commenting looks very strange and confusing. I think we need to have
comment for the function and comment for the structure. Especially, it's really
important to have comment for all fields of the structure.

> +struct flush_dump_entry {
> +	u64 ino;
> +	u64 snap;
> +	int caps;
> +	u64 tid;
> +	u64 last_ack;
> +	bool wake;
> +	bool is_capsnap;
> +	bool ci_null;
> +};
> +
> +static void dump_cap_flushes(struct ceph_mds_client *mdsc, u64 want_tid)
> +{
> +	struct ceph_client *cl = mdsc->fsc->client;
> +	struct flush_dump_entry entries[CEPH_CAP_FLUSH_MAX_DUMP_COUNT];
> +	struct ceph_cap_flush *cf;
> +	int n = 0, remaining = 0;
> +
> +	spin_lock(&mdsc->cap_dirty_lock);
> +	list_for_each_entry(cf, &mdsc->cap_flush_list, g_list) {
> +		if (cf->tid > want_tid)
> +			break;
> +		if (n < CEPH_CAP_FLUSH_MAX_DUMP_COUNT) {
> +			struct flush_dump_entry *e = &entries[n++];
> +
> +			e->ci_null = WARN_ON_ONCE(!cf->ci);
> +			if (!e->ci_null) {
> +				e->ino = ceph_ino(&cf->ci->netfs.inode);
> +				e->snap = ceph_snap(&cf->ci->netfs.inode);
> +				e->last_ack = READ_ONCE(cf->ci->i_last_cap_flush_ack);
> +			}
> +			e->caps = cf->caps;
> +			e->tid = cf->tid;
> +			e->wake = cf->wake;
> +			e->is_capsnap = cf->is_capsnap;
> +		} else {
> +			remaining++;
> +		}
> +	}
> +	spin_unlock(&mdsc->cap_dirty_lock);
> +
> +	pr_info_client(cl, "still waiting for cap flushes through %llu:\n",
> +		       want_tid);
> +	for (int i = 0; i < n; i++) {
> +		struct flush_dump_entry *e = &entries[i];
> +
> +		if (e->ci_null)
> +			pr_info_client(cl,
> +				       "  (null ci) %s tid=%llu wake=%d%s\n",
> +				       ceph_cap_string(e->caps), e->tid,
> +				       e->wake,
> +				       e->is_capsnap ? " is_capsnap" : "");
> +		else
> +			pr_info_client(cl,
> +				       "  %llx.%llx %s tid=%llu last_ack=%llu wake=%d%s\n",
> +				       e->ino, e->snap,
> +				       ceph_cap_string(e->caps), e->tid,
> +				       e->last_ack, e->wake,
> +				       e->is_capsnap ? " is_capsnap" : "");
> +	}
> +	if (remaining)
> +		pr_info_client(cl, "  ... and %d more pending flushes\n",
> +			       remaining);
> +}
> +
> +/*
> + * Wait for all cap flushes through @want_flush_tid to complete.
> + * Periodically dumps pending cap flush state for diagnostics.
>   */
>  static void wait_caps_flush(struct ceph_mds_client *mdsc,
>  			    u64 want_flush_tid)
>  {
>  	struct ceph_client *cl = mdsc->fsc->client;
> +	int i = 0;

Is integer type could be enough here? Could we overflow the i value?

Thanks,
Slava.

> +	long ret;
>  
>  	doutc(cl, "want %llu\n", want_flush_tid);
>  
> -	wait_event(mdsc->cap_flushing_wq,
> -		   check_caps_flush(mdsc, want_flush_tid));
> +	do {
> +		/* 60 * HZ fits in a long on all supported architectures. */
> +		ret = wait_event_timeout(mdsc->cap_flushing_wq,
> +			   check_caps_flush(mdsc, want_flush_tid),
> +			   CEPH_CAP_FLUSH_WAIT_TIMEOUT_SEC * HZ);
> +		if (ret == 0) {
> +			if (i < CEPH_CAP_FLUSH_MAX_DUMP_COUNT)
> +				dump_cap_flushes(mdsc, want_flush_tid);
> +			else if (i == CEPH_CAP_FLUSH_MAX_DUMP_COUNT)
> +				pr_info_client(cl,
> +					       "still waiting for cap flushes; suppressing further dumps\n");
> +			i++;
> +		}
> +	} while (ret == 0);
>  
>  	doutc(cl, "ok, flushed thru %llu\n", want_flush_tid);
>  }
> diff --git a/fs/ceph/super.h b/fs/ceph/super.h
> index 30911ccf961e..9aca42c89ea0 100644
> --- a/fs/ceph/super.h
> +++ b/fs/ceph/super.h
> @@ -238,6 +238,7 @@ struct ceph_cap_flush {
>  	bool is_capsnap; /* true means capsnap */
>  	struct list_head g_list; // global
>  	struct list_head i_list; // per inode
> +	struct ceph_inode_info *ci;
>  };
>  
>  /*
> @@ -443,6 +444,11 @@ struct ceph_inode_info {
>  	struct ceph_snap_context *i_head_snapc;  /* set if wr_buffer_head > 0 or
>  						    dirty|flushing caps */
>  	unsigned i_snap_caps;           /* cap bits for snapped files */
> +	/*
> +	 * Written under i_ceph_lock, read via READ_ONCE()
> +	 * from diagnostic paths.
> +	 */
> +	u64 i_last_cap_flush_ack;
>  
>  	unsigned long i_last_rd;
>  	unsigned long i_last_wr;

[1]
https://elixir.bootlin.com/linux/v7.0.1/source/include/linux/ceph/libceph.h#L72


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v3 05/11] ceph: add client reset state machine and session teardown
  2026-04-29 12:51 [PATCH v3 00/11] ceph: manual client session reset Alex Markuze
                   ` (3 preceding siblings ...)
  2026-04-29 12:51 ` [PATCH v3 04/11] ceph: add diagnostic timeout loop to wait_caps_flush() Alex Markuze
@ 2026-04-29 12:52 ` Alex Markuze
  2026-04-29 22:29   ` [EXTERNAL] " Viacheslav Dubeyko
  2026-04-29 12:52 ` [PATCH v3 06/11] ceph: add manual reset debugfs control and tracepoints Alex Markuze
                   ` (5 subsequent siblings)
  10 siblings, 1 reply; 17+ messages in thread
From: Alex Markuze @ 2026-04-29 12:52 UTC (permalink / raw)
  To: ceph-devel; +Cc: linux-kernel, idryomov, vdubeyko, Alex Markuze

Add the client-side reset state machine, request gating, and manual
session teardown implementation.

Manual reset is an operator-triggered escape hatch for client/MDS
stalemates in which caps, locks, or unsafe metadata state stop making
forward progress.  The reset blocks new metadata work, attempts a
bounded best-effort drain of dirty client state while sessions are
still alive, and finally asks the MDS to close sessions before tearing
local session state down directly.

The reset state machine tracks four phases: IDLE -> QUIESCING ->
DRAINING -> TEARDOWN -> IDLE.  QUIESCING is set synchronously by
schedule_reset() before the workqueue item is dispatched, so that new
metadata requests and file-lock acquisitions are gated immediately --
even before the work function begins running.  All non-IDLE phases
block callers on blocked_wq, preventing races with session teardown.

The drain phase flushes mdlog state, dirty caps, and pending cap
releases for a bounded interval.  State that still cannot make progress
within that interval is discarded during teardown, which is the point
of the reset: break the stalemate and allow fresh sessions to rebuild
clean state.

The session teardown follows the established check_new_map()
forced-close pattern: unregister sessions under mdsc->mutex, then clean
up caps and requests under s->s_mutex.  Reconnect is not attempted
because the MDS only accepts reconnects during its own RECONNECT phase
after restart, not from an active client.

Blocked callers are released when reset completes and observe the final
result via -EIO (reset failed) or 0 (success).  Internal work-function
errors such as -ENOMEM are not propagated to unrelated callers like
open() or flock(); the detailed error remains in debugfs and
tracepoints.

The work function checks st->shutdown before each phase transition
(DRAINING, TEARDOWN) so that a concurrent ceph_mdsc_destroy() is not
overwritten.  If destroy already took ownership, the work function
releases session references and returns without touching the state.

The timeout calculation for blocked-request waiters uses max_t() to
prevent jiffies underflow when the deadline has already passed.

The close-grace sleep before teardown is a best-effort nudge to let
queued REQUEST_CLOSE messages egress; it is not a correctness
requirement since the MDS still has session_autoclose as a fallback.

The destroy path marks reset as failed and wakes blocked waiters before
cancel_work_sync() so unmount does not stall.

Signed-off-by: Alex Markuze <amarkuze@redhat.com>
---
 fs/ceph/locks.c      |  16 ++
 fs/ceph/mds_client.c | 455 +++++++++++++++++++++++++++++++++++++++++++
 fs/ceph/mds_client.h |  42 ++++
 3 files changed, 513 insertions(+)

diff --git a/fs/ceph/locks.c b/fs/ceph/locks.c
index c4ff2266bb94..677221bd64e0 100644
--- a/fs/ceph/locks.c
+++ b/fs/ceph/locks.c
@@ -249,6 +249,7 @@ int ceph_lock(struct file *file, int cmd, struct file_lock *fl)
 {
 	struct inode *inode = file_inode(file);
 	struct ceph_inode_info *ci = ceph_inode(inode);
+	struct ceph_mds_client *mdsc = ceph_sb_to_mdsc(inode->i_sb);
 	struct ceph_client *cl = ceph_inode_to_client(inode);
 	int err = 0;
 	u16 op = CEPH_MDS_OP_SETFILELOCK;
@@ -275,6 +276,13 @@ int ceph_lock(struct file *file, int cmd, struct file_lock *fl)
 		return -EIO;
 	}
 
+	/* Wait for reset to complete before acquiring new locks */
+	if (op == CEPH_MDS_OP_SETFILELOCK && !lock_is_unlock(fl)) {
+		err = ceph_mdsc_wait_for_reset(mdsc);
+		if (err)
+			return err;
+	}
+
 	if (lock_is_read(fl))
 		lock_cmd = CEPH_LOCK_SHARED;
 	else if (lock_is_write(fl))
@@ -311,6 +319,7 @@ int ceph_flock(struct file *file, int cmd, struct file_lock *fl)
 {
 	struct inode *inode = file_inode(file);
 	struct ceph_inode_info *ci = ceph_inode(inode);
+	struct ceph_mds_client *mdsc = ceph_sb_to_mdsc(inode->i_sb);
 	struct ceph_client *cl = ceph_inode_to_client(inode);
 	int err = 0;
 	u8 wait = 0;
@@ -330,6 +339,13 @@ int ceph_flock(struct file *file, int cmd, struct file_lock *fl)
 		return -EIO;
 	}
 
+	/* Wait for reset to complete before acquiring new locks */
+	if (!lock_is_unlock(fl)) {
+		err = ceph_mdsc_wait_for_reset(mdsc);
+		if (err)
+			return err;
+	}
+
 	if (IS_SETLKW(cmd))
 		wait = 1;
 
diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index d83003acfb06..777af51ec8d8 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -6,6 +6,7 @@
 #include <linux/slab.h>
 #include <linux/gfp.h>
 #include <linux/sched.h>
+#include <linux/delay.h>
 #include <linux/debugfs.h>
 #include <linux/seq_file.h>
 #include <linux/ratelimit.h>
@@ -67,6 +68,7 @@ static void __wake_requests(struct ceph_mds_client *mdsc,
 			    struct list_head *head);
 static void ceph_cap_release_work(struct work_struct *work);
 static void ceph_cap_reclaim_work(struct work_struct *work);
+static void ceph_mdsc_reset_workfn(struct work_struct *work);
 
 static const struct ceph_connection_operations mds_con_ops;
 
@@ -3797,6 +3799,22 @@ int ceph_mdsc_submit_request(struct ceph_mds_client *mdsc, struct inode *dir,
 	struct ceph_client *cl = mdsc->fsc->client;
 	int err = 0;
 
+	/*
+	 * If a reset is in progress, wait for it to complete.
+	 *
+	 * This is best-effort: a request can pass this check just
+	 * before the phase leaves IDLE and proceed concurrently with
+	 * reset.  That is acceptable because (a) such requests will
+	 * either complete normally or fail and be retried by the
+	 * caller, and (b) adding lock serialization here would
+	 * penalize every request for a rare manual operation.
+	 */
+	err = ceph_mdsc_wait_for_reset(mdsc);
+	if (err) {
+		doutc(cl, "wait_for_reset failed: %d\n", err);
+		return err;
+	}
+
 	/* take CAP_PIN refs for r_inode, r_parent, r_old_dentry */
 	if (req->r_inode)
 		ceph_get_cap_refs(ceph_inode(req->r_inode), CEPH_CAP_PIN);
@@ -5203,6 +5221,421 @@ static int send_mds_reconnect(struct ceph_mds_client *mdsc,
 	return err;
 }
 
+const char *ceph_reset_phase_name(enum ceph_client_reset_phase phase)
+{
+	switch (phase) {
+	case CEPH_CLIENT_RESET_IDLE:	  return "idle";
+	case CEPH_CLIENT_RESET_QUIESCING: return "quiescing";
+	case CEPH_CLIENT_RESET_DRAINING:  return "draining";
+	case CEPH_CLIENT_RESET_TEARDOWN:  return "teardown";
+	default:			  return "unknown";
+	}
+}
+
+/**
+ * ceph_mdsc_wait_for_reset - wait for an active reset to complete
+ * @mdsc: MDS client
+ *
+ * Returns 0 if reset completed successfully or no reset was active.
+ * Returns -EIO if reset completed with an error.
+ * Returns -ETIMEDOUT if we timed out waiting.
+ * Returns -ERESTARTSYS if interrupted by signal.
+ *
+ * Internal work-function errors (e.g. -ENOMEM) are not propagated
+ * to callers; they are mapped to -EIO.  The detailed error is
+ * available via debugfs status and tracepoints.
+ */
+int ceph_mdsc_wait_for_reset(struct ceph_mds_client *mdsc)
+{
+	struct ceph_client_reset_state *st = &mdsc->reset_state;
+	struct ceph_client *cl = mdsc->fsc->client;
+	unsigned long deadline = jiffies + CEPH_CLIENT_RESET_WAIT_TIMEOUT_SEC * HZ;
+	int blocked_count;
+	long remaining;
+	long wait_ret;
+	int ret;
+
+	if (READ_ONCE(st->phase) == CEPH_CLIENT_RESET_IDLE)
+		return 0;
+
+	blocked_count = atomic_inc_return(&st->blocked_requests);
+	doutc(cl, "request blocked during reset, %d total blocked\n",
+	      blocked_count);
+
+retry:
+	remaining = max_t(long, deadline - jiffies, 1);
+	wait_ret = wait_event_interruptible_timeout(st->blocked_wq,
+						    READ_ONCE(st->phase) ==
+						     CEPH_CLIENT_RESET_IDLE,
+						    remaining);
+
+	if (wait_ret == 0) {
+		atomic_dec(&st->blocked_requests);
+		pr_warn_client(cl, "timed out waiting for reset to complete\n");
+		return -ETIMEDOUT;
+	}
+	if (wait_ret < 0) {
+		atomic_dec(&st->blocked_requests);
+		return (int)wait_ret;  /* -ERESTARTSYS */
+	}
+
+	/*
+	 * Verify phase is still IDLE under the lock.  If another reset
+	 * was scheduled between the wake-up and this check, loop back
+	 * and wait for it to finish rather than returning a stale result.
+	 */
+	spin_lock(&st->lock);
+	if (st->phase != CEPH_CLIENT_RESET_IDLE) {
+		spin_unlock(&st->lock);
+		if (time_before(jiffies, deadline))
+			goto retry;
+		atomic_dec(&st->blocked_requests);
+		return -ETIMEDOUT;
+	}
+	ret = st->last_errno;
+	spin_unlock(&st->lock);
+
+	atomic_dec(&st->blocked_requests);
+	return ret ? -EIO : 0;
+}
+
+static void ceph_mdsc_reset_complete(struct ceph_mds_client *mdsc, int ret)
+{
+	struct ceph_client_reset_state *st = &mdsc->reset_state;
+
+	spin_lock(&st->lock);
+	/*
+	 * If destroy already marked us as shut down, it owns the
+	 * final bookkeeping and waiter wakeup.  Just bail so we
+	 * don't overwrite its state.
+	 */
+	if (st->shutdown) {
+		spin_unlock(&st->lock);
+		return;
+	}
+	st->last_finish = jiffies;
+	st->last_errno = ret;
+	st->phase = CEPH_CLIENT_RESET_IDLE;
+	if (ret)
+		st->failure_count++;
+	else
+		st->success_count++;
+	spin_unlock(&st->lock);
+
+	/* Wake up all requests that were blocked waiting for reset */
+	wake_up_all(&st->blocked_wq);
+}
+
+static void ceph_mdsc_reset_workfn(struct work_struct *work)
+{
+	struct ceph_mds_client *mdsc =
+		container_of(work, struct ceph_mds_client, reset_work);
+	struct ceph_client_reset_state *st = &mdsc->reset_state;
+	struct ceph_client *cl = mdsc->fsc->client;
+	struct ceph_mds_session **sessions = NULL;
+	char reason[CEPH_CLIENT_RESET_REASON_LEN];
+	int max_sessions, i, n = 0, torn_down = 0;
+	int ret = 0;
+
+	spin_lock(&st->lock);
+	strscpy(reason, st->last_reason, sizeof(reason));
+	spin_unlock(&st->lock);
+
+	mutex_lock(&mdsc->mutex);
+	max_sessions = mdsc->max_sessions;
+	if (max_sessions <= 0) {
+		mutex_unlock(&mdsc->mutex);
+		goto out_complete;
+	}
+
+	sessions = kcalloc(max_sessions, sizeof(*sessions), GFP_KERNEL);
+	if (!sessions) {
+		mutex_unlock(&mdsc->mutex);
+		ret = -ENOMEM;
+		pr_err_client(cl,
+			      "manual session reset failed to allocate session array\n");
+		ceph_mdsc_reset_complete(mdsc, ret);
+		return;
+	}
+
+	for (i = 0; i < max_sessions; i++) {
+		struct ceph_mds_session *session = mdsc->sessions[i];
+
+		if (!session)
+			continue;
+
+		/*
+		 * Read session state without s_mutex to avoid nesting
+		 * mdsc->mutex -> s_mutex, which would invert the
+		 * s_mutex -> mdsc->mutex order used by
+		 * cleanup_session_requests().  s_state is an int
+		 * so loads are atomic; the teardown loop below
+		 * handles races with concurrent state transitions.
+		 */
+		switch (READ_ONCE(session->s_state)) {
+		case CEPH_MDS_SESSION_OPEN:
+		case CEPH_MDS_SESSION_HUNG:
+		case CEPH_MDS_SESSION_OPENING:
+		case CEPH_MDS_SESSION_RESTARTING:
+		case CEPH_MDS_SESSION_RECONNECTING:
+		case CEPH_MDS_SESSION_CLOSING:
+			sessions[n++] = ceph_get_mds_session(session);
+			break;
+		default:
+			pr_info_client(cl,
+				       "mds%d in state %s, skipping reset\n",
+				       session->s_mds,
+				       ceph_session_state_name(session->s_state));
+			break;
+		}
+	}
+	mutex_unlock(&mdsc->mutex);
+
+	pr_info_client(cl,
+		       "manual session reset executing (sessions=%d, reason=\"%s\")\n",
+		       n, reason);
+
+	if (n == 0) {
+		kfree(sessions);
+		goto out_complete;
+	}
+
+	spin_lock(&st->lock);
+	if (st->shutdown) {
+		spin_unlock(&st->lock);
+		goto out_sessions;
+	}
+	st->phase = CEPH_CLIENT_RESET_DRAINING;
+	spin_unlock(&st->lock);
+
+	/*
+	 * Best-effort drain: flush dirty state while sessions are still
+	 * alive.  New requests are blocked while phase != IDLE.
+	 * The sessions are functional, so non-stuck state drains normally.
+	 * Stuck state (the cause of the stalemate the operator is trying
+	 * to break) will not drain -- that is expected, and we proceed to
+	 * forced teardown after the timeout.
+	 *
+	 * Three things are kicked off:
+	 *  1. MDS journal -- send_flush_mdlog asks each MDS to journal
+	 *     pending unsafe operations (creates, renames, setattrs).
+	 *     This is best-effort: we do not wait for individual unsafe
+	 *     requests to reach safe status.  Non-stuck ops typically
+	 *     complete within the bounded wait window below; stuck ops
+	 *     will not, and are force-dropped during teardown.
+	 *  2. Dirty caps -- ceph_flush_dirty_caps triggers cap flush on
+	 *     all sessions.  Non-stuck caps flush in milliseconds.
+	 *  3. Cap releases -- push pending cap release messages.
+	 *
+	 * The cap-flush wait below provides the bounded drain window
+	 * during which all three categories can make progress.
+	 */
+	for (i = 0; i < n; i++)
+		send_flush_mdlog(sessions[i]);
+
+	ceph_flush_dirty_caps(mdsc);
+	ceph_flush_cap_releases(mdsc);
+
+	spin_lock(&mdsc->cap_dirty_lock);
+	if (!list_empty(&mdsc->cap_flush_list)) {
+		struct ceph_cap_flush *cf =
+			list_last_entry(&mdsc->cap_flush_list,
+					struct ceph_cap_flush, g_list);
+		u64 want_flush = mdsc->last_cap_flush_tid;
+		long drain_ret;
+
+		/*
+		 * Setting wake on the last entry is sufficient: flush
+		 * entries complete in order, so when this entry finishes
+		 * all earlier ones are already done.
+		 */
+		cf->wake = true;
+		spin_unlock(&mdsc->cap_dirty_lock);
+		pr_info_client(cl,
+			       "draining (want_flush=%llu, %d sessions)\n",
+			       want_flush, n);
+		drain_ret = wait_event_timeout(mdsc->cap_flushing_wq,
+					       check_caps_flush(mdsc,
+								want_flush),
+					       CEPH_CLIENT_RESET_DRAIN_SEC * HZ);
+		if (drain_ret == 0) {
+			pr_info_client(cl,
+				       "drain timed out, proceeding with forced teardown\n");
+			spin_lock(&st->lock);
+			st->drain_timed_out = true;
+			spin_unlock(&st->lock);
+		} else {
+			pr_info_client(cl, "drain completed successfully\n");
+			spin_lock(&st->lock);
+			st->drain_timed_out = false;
+			spin_unlock(&st->lock);
+		}
+	} else {
+		spin_unlock(&mdsc->cap_dirty_lock);
+		spin_lock(&st->lock);
+		st->drain_timed_out = false;
+		spin_unlock(&st->lock);
+	}
+
+	spin_lock(&st->lock);
+	if (st->shutdown) {
+		spin_unlock(&st->lock);
+		goto out_sessions;
+	}
+	st->phase = CEPH_CLIENT_RESET_TEARDOWN;
+	spin_unlock(&st->lock);
+
+	/*
+	 * Ask each MDS to close the session before we tear it down
+	 * locally.  Without this the MDS sees only a connection drop and
+	 * waits for the client to reconnect (up to session_autoclose
+	 * seconds) before evicting the session and releasing locks.
+	 *
+	 * Reuse the normal close machinery so the session state/sequence
+	 * snapshot is serialized under s_mutex and a racing s_seq bump
+	 * retransmits REQUEST_CLOSE while the session remains CLOSING.
+	 * We send all close requests first, then yield briefly to let the
+	 * network stack transmit them before __unregister_session()
+	 * closes the connections.
+	 */
+	for (i = 0; i < n; i++) {
+		int err;
+
+		mutex_lock(&sessions[i]->s_mutex);
+		err = __close_session(mdsc, sessions[i]);
+		mutex_unlock(&sessions[i]->s_mutex);
+		if (err < 0)
+			pr_warn_client(cl,
+				       "mds%d failed to queue close request before reset: %d\n",
+				       sessions[i]->s_mds, err);
+	}
+	/*
+	 * Best-effort grace period: yield briefly so the network stack
+	 * can transmit the queued REQUEST_CLOSE messages before we tear
+	 * down connections.  Not a correctness requirement -- the MDS
+	 * will still evict via session_autoclose if it never receives
+	 * the close request.
+	 */
+	if (n > 0)
+		msleep(CEPH_CLIENT_RESET_CLOSE_GRACE_MS);
+
+	/*
+	 * Tear down each session: close the connection, remove all
+	 * caps, clean up requests, then kick pending requests so they
+	 * re-open a fresh session on the next attempt.
+	 *
+	 * This is modeled on the check_new_map() forced-close path
+	 * for stopped MDS ranks - a proven pattern for hard session
+	 * teardown.  We do NOT attempt send_mds_reconnect() because
+	 * the MDS only accepts reconnects during its own RECONNECT
+	 * phase (after MDS restart), not from an active client.
+	 *
+	 * Any state that did not drain (caps that didn't flush, unsafe
+	 * requests that the MDS didn't journal) is force-dropped here.
+	 * This is intentional: that state is stuck and is the reason
+	 * the operator triggered the reset.
+	 */
+	for (i = 0; i < n; i++) {
+		int mds = sessions[i]->s_mds;
+
+		pr_info_client(cl, "mds%d resetting session\n", mds);
+
+		mutex_lock(&mdsc->mutex);
+		if (mds >= mdsc->max_sessions ||
+		    mdsc->sessions[mds] != sessions[i]) {
+			pr_info_client(cl,
+				       "mds%d session already torn down, skipping\n",
+				       mds);
+			mutex_unlock(&mdsc->mutex);
+			ceph_put_mds_session(sessions[i]);
+			continue;
+		}
+		sessions[i]->s_state = CEPH_MDS_SESSION_CLOSED;
+		__unregister_session(mdsc, sessions[i]);
+		__wake_requests(mdsc, &sessions[i]->s_waiting);
+		mutex_unlock(&mdsc->mutex);
+
+		mutex_lock(&sessions[i]->s_mutex);
+		cleanup_session_requests(mdsc, sessions[i]);
+		remove_session_caps(sessions[i]);
+		mutex_unlock(&sessions[i]->s_mutex);
+
+		wake_up_all(&mdsc->session_close_wq);
+
+		ceph_put_mds_session(sessions[i]);
+
+		mutex_lock(&mdsc->mutex);
+		kick_requests(mdsc, mds);
+		mutex_unlock(&mdsc->mutex);
+
+		torn_down++;
+		pr_info_client(cl, "mds%d session reset complete\n", mds);
+	}
+
+	kfree(sessions);
+
+	spin_lock(&st->lock);
+	st->sessions_reset = torn_down;
+	spin_unlock(&st->lock);
+
+out_complete:
+	ceph_mdsc_reset_complete(mdsc, ret);
+	return;
+
+out_sessions:
+	for (i = 0; i < n; i++)
+		ceph_put_mds_session(sessions[i]);
+	kfree(sessions);
+}
+
+int ceph_mdsc_schedule_reset(struct ceph_mds_client *mdsc,
+			     const char *reason)
+{
+	struct ceph_client_reset_state *st = &mdsc->reset_state;
+	struct ceph_fs_client *fsc = mdsc->fsc;
+	const char *msg = (reason && reason[0]) ? reason : "manual";
+	int mount_state;
+
+	mount_state = READ_ONCE(fsc->mount_state);
+	if (mount_state != CEPH_MOUNT_MOUNTED) {
+		pr_warn_client(fsc->client,
+			       "reset rejected: mount_state=%d (not mounted)\n",
+			       mount_state);
+		return -EINVAL;
+	}
+
+	spin_lock(&st->lock);
+	if (st->phase != CEPH_CLIENT_RESET_IDLE) {
+		spin_unlock(&st->lock);
+		return -EBUSY;
+	}
+
+	st->phase = CEPH_CLIENT_RESET_QUIESCING;
+	st->last_start = jiffies;
+	st->last_errno = 0;
+	st->drain_timed_out = false;
+	st->sessions_reset = 0;
+	st->trigger_count++;
+	strscpy(st->last_reason, msg, sizeof(st->last_reason));
+	spin_unlock(&st->lock);
+
+	if (WARN_ON_ONCE(!queue_work(system_unbound_wq, &mdsc->reset_work))) {
+		spin_lock(&st->lock);
+		st->phase = CEPH_CLIENT_RESET_IDLE;
+		st->last_errno = -EALREADY;
+		st->last_finish = jiffies;
+		st->failure_count++;
+		spin_unlock(&st->lock);
+		wake_up_all(&st->blocked_wq);
+		return -EALREADY;
+	}
+
+	pr_info_client(mdsc->fsc->client,
+		       "manual session reset scheduled (reason=\"%s\")\n",
+		       msg);
+	return 0;
+}
+
 
 /*
  * compare old and new mdsmaps, kicking requests
@@ -5742,6 +6175,11 @@ int ceph_mdsc_init(struct ceph_fs_client *fsc)
 	INIT_LIST_HEAD(&mdsc->dentry_leases);
 	INIT_LIST_HEAD(&mdsc->dentry_dir_leases);
 
+	spin_lock_init(&mdsc->reset_state.lock);
+	init_waitqueue_head(&mdsc->reset_state.blocked_wq);
+	atomic_set(&mdsc->reset_state.blocked_requests, 0);
+	INIT_WORK(&mdsc->reset_work, ceph_mdsc_reset_workfn);
+
 	ceph_caps_init(mdsc);
 	ceph_adjust_caps_max_min(mdsc, fsc->mount_options);
 
@@ -6267,6 +6705,23 @@ void ceph_mdsc_destroy(struct ceph_fs_client *fsc)
 	/* flush out any connection work with references to us */
 	ceph_msgr_flush();
 
+	/*
+	 * Mark reset as failed and wake any blocked waiters before
+	 * cancelling, so unmount doesn't stall on blocked_wq timeout
+	 * if cancel_work_sync() prevents the work from running.
+	 */
+	spin_lock(&mdsc->reset_state.lock);
+	mdsc->reset_state.shutdown = true;
+	if (mdsc->reset_state.phase != CEPH_CLIENT_RESET_IDLE) {
+		mdsc->reset_state.phase = CEPH_CLIENT_RESET_IDLE;
+		mdsc->reset_state.last_errno = -ESHUTDOWN;
+		mdsc->reset_state.last_finish = jiffies;
+		mdsc->reset_state.failure_count++;
+	}
+	spin_unlock(&mdsc->reset_state.lock);
+	wake_up_all(&mdsc->reset_state.blocked_wq);
+
+	cancel_work_sync(&mdsc->reset_work);
 	ceph_mdsc_stop(mdsc);
 
 	ceph_metric_destroy(&mdsc->metric);
diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h
index e91a199d56fd..afc08b0abbd5 100644
--- a/fs/ceph/mds_client.h
+++ b/fs/ceph/mds_client.h
@@ -74,6 +74,42 @@ struct ceph_fs_client;
 struct ceph_cap;
 
 #define MDS_AUTH_UID_ANY -1
+#define CEPH_CLIENT_RESET_REASON_LEN	64
+#define CEPH_CLIENT_RESET_DRAIN_SEC	5
+#define CEPH_CLIENT_RESET_CLOSE_GRACE_MS 100
+#define CEPH_CLIENT_RESET_WAIT_TIMEOUT_SEC 120
+
+enum ceph_client_reset_phase {
+	CEPH_CLIENT_RESET_IDLE = 0,
+	/*
+	 * QUIESCING is set synchronously by schedule_reset() before the
+	 * workqueue item is dispatched.  It gates new requests (any
+	 * phase != IDLE blocks callers) during the window between
+	 * scheduling and the work function's transition to DRAINING.
+	 */
+	CEPH_CLIENT_RESET_QUIESCING,
+	CEPH_CLIENT_RESET_DRAINING,
+	CEPH_CLIENT_RESET_TEARDOWN,
+};
+
+struct ceph_client_reset_state {
+	spinlock_t lock;
+	u64 trigger_count;
+	u64 success_count;
+	u64 failure_count;
+	unsigned long last_start;
+	unsigned long last_finish;
+	int last_errno;
+	enum ceph_client_reset_phase phase;
+	bool drain_timed_out;
+	bool shutdown;
+	int sessions_reset;
+	char last_reason[CEPH_CLIENT_RESET_REASON_LEN];
+
+	/* Request blocking during reset */
+	wait_queue_head_t blocked_wq;
+	atomic_t blocked_requests;
+};
 
 struct ceph_mds_cap_match {
 	s64 uid;  /* default to MDS_AUTH_UID_ANY */
@@ -536,6 +572,8 @@ struct ceph_mds_client {
 	struct list_head  dentry_dir_leases; /* lru list */
 
 	struct ceph_client_metric metric;
+	struct work_struct	reset_work;
+	struct ceph_client_reset_state reset_state;
 
 	spinlock_t		snapid_map_lock;
 	struct rb_root		snapid_map_tree;
@@ -559,10 +597,14 @@ extern struct ceph_mds_session *
 __ceph_lookup_mds_session(struct ceph_mds_client *, int mds);
 
 extern const char *ceph_session_state_name(int s);
+extern const char *ceph_reset_phase_name(enum ceph_client_reset_phase phase);
 
 extern struct ceph_mds_session *
 ceph_get_mds_session(struct ceph_mds_session *s);
 extern void ceph_put_mds_session(struct ceph_mds_session *s);
+int ceph_mdsc_schedule_reset(struct ceph_mds_client *mdsc,
+			     const char *reason);
+int ceph_mdsc_wait_for_reset(struct ceph_mds_client *mdsc);
 
 extern int ceph_mdsc_init(struct ceph_fs_client *fsc);
 extern void ceph_mdsc_close_sessions(struct ceph_mds_client *mdsc);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [EXTERNAL] [PATCH v3 05/11] ceph: add client reset state machine and session teardown
  2026-04-29 12:52 ` [PATCH v3 05/11] ceph: add client reset state machine and session teardown Alex Markuze
@ 2026-04-29 22:29   ` Viacheslav Dubeyko
  0 siblings, 0 replies; 17+ messages in thread
From: Viacheslav Dubeyko @ 2026-04-29 22:29 UTC (permalink / raw)
  To: Alex Markuze, ceph-devel; +Cc: linux-kernel, idryomov

On Wed, 2026-04-29 at 12:52 +0000, Alex Markuze wrote:
> Add the client-side reset state machine, request gating, and manual
> session teardown implementation.
> 
> Manual reset is an operator-triggered escape hatch for client/MDS
> stalemates in which caps, locks, or unsafe metadata state stop making
> forward progress.  The reset blocks new metadata work, attempts a
> bounded best-effort drain of dirty client state while sessions are
> still alive, and finally asks the MDS to close sessions before tearing
> local session state down directly.
> 
> The reset state machine tracks four phases: IDLE -> QUIESCING ->
> DRAINING -> TEARDOWN -> IDLE.  QUIESCING is set synchronously by
> schedule_reset() before the workqueue item is dispatched, so that new
> metadata requests and file-lock acquisitions are gated immediately --
> even before the work function begins running.  All non-IDLE phases
> block callers on blocked_wq, preventing races with session teardown.
> 
> The drain phase flushes mdlog state, dirty caps, and pending cap
> releases for a bounded interval.  State that still cannot make progress
> within that interval is discarded during teardown, which is the point
> of the reset: break the stalemate and allow fresh sessions to rebuild
> clean state.
> 
> The session teardown follows the established check_new_map()
> forced-close pattern: unregister sessions under mdsc->mutex, then clean
> up caps and requests under s->s_mutex.  Reconnect is not attempted
> because the MDS only accepts reconnects during its own RECONNECT phase
> after restart, not from an active client.
> 
> Blocked callers are released when reset completes and observe the final
> result via -EIO (reset failed) or 0 (success).  Internal work-function
> errors such as -ENOMEM are not propagated to unrelated callers like
> open() or flock(); the detailed error remains in debugfs and
> tracepoints.
> 
> The work function checks st->shutdown before each phase transition
> (DRAINING, TEARDOWN) so that a concurrent ceph_mdsc_destroy() is not
> overwritten.  If destroy already took ownership, the work function
> releases session references and returns without touching the state.
> 
> The timeout calculation for blocked-request waiters uses max_t() to
> prevent jiffies underflow when the deadline has already passed.
> 
> The close-grace sleep before teardown is a best-effort nudge to let
> queued REQUEST_CLOSE messages egress; it is not a correctness
> requirement since the MDS still has session_autoclose as a fallback.
> 
> The destroy path marks reset as failed and wakes blocked waiters before
> cancel_work_sync() so unmount does not stall.
> 
> Signed-off-by: Alex Markuze <amarkuze@redhat.com>
> ---
>  fs/ceph/locks.c      |  16 ++
>  fs/ceph/mds_client.c | 455 +++++++++++++++++++++++++++++++++++++++++++
>  fs/ceph/mds_client.h |  42 ++++
>  3 files changed, 513 insertions(+)
> 
> diff --git a/fs/ceph/locks.c b/fs/ceph/locks.c
> index c4ff2266bb94..677221bd64e0 100644
> --- a/fs/ceph/locks.c
> +++ b/fs/ceph/locks.c
> @@ -249,6 +249,7 @@ int ceph_lock(struct file *file, int cmd, struct file_lock *fl)
>  {
>  	struct inode *inode = file_inode(file);
>  	struct ceph_inode_info *ci = ceph_inode(inode);
> +	struct ceph_mds_client *mdsc = ceph_sb_to_mdsc(inode->i_sb);
>  	struct ceph_client *cl = ceph_inode_to_client(inode);
>  	int err = 0;
>  	u16 op = CEPH_MDS_OP_SETFILELOCK;
> @@ -275,6 +276,13 @@ int ceph_lock(struct file *file, int cmd, struct file_lock *fl)
>  		return -EIO;
>  	}
>  
> +	/* Wait for reset to complete before acquiring new locks */
> +	if (op == CEPH_MDS_OP_SETFILELOCK && !lock_is_unlock(fl)) {
> +		err = ceph_mdsc_wait_for_reset(mdsc);
> +		if (err)
> +			return err;
> +	}
> +
>  	if (lock_is_read(fl))
>  		lock_cmd = CEPH_LOCK_SHARED;
>  	else if (lock_is_write(fl))
> @@ -311,6 +319,7 @@ int ceph_flock(struct file *file, int cmd, struct file_lock *fl)
>  {
>  	struct inode *inode = file_inode(file);
>  	struct ceph_inode_info *ci = ceph_inode(inode);
> +	struct ceph_mds_client *mdsc = ceph_sb_to_mdsc(inode->i_sb);
>  	struct ceph_client *cl = ceph_inode_to_client(inode);
>  	int err = 0;
>  	u8 wait = 0;
> @@ -330,6 +339,13 @@ int ceph_flock(struct file *file, int cmd, struct file_lock *fl)
>  		return -EIO;
>  	}
>  
> +	/* Wait for reset to complete before acquiring new locks */
> +	if (!lock_is_unlock(fl)) {
> +		err = ceph_mdsc_wait_for_reset(mdsc);
> +		if (err)
> +			return err;
> +	}
> +
>  	if (IS_SETLKW(cmd))
>  		wait = 1;
>  
> diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
> index d83003acfb06..777af51ec8d8 100644
> --- a/fs/ceph/mds_client.c
> +++ b/fs/ceph/mds_client.c
> @@ -6,6 +6,7 @@
>  #include <linux/slab.h>
>  #include <linux/gfp.h>
>  #include <linux/sched.h>
> +#include <linux/delay.h>
>  #include <linux/debugfs.h>
>  #include <linux/seq_file.h>
>  #include <linux/ratelimit.h>
> @@ -67,6 +68,7 @@ static void __wake_requests(struct ceph_mds_client *mdsc,
>  			    struct list_head *head);
>  static void ceph_cap_release_work(struct work_struct *work);
>  static void ceph_cap_reclaim_work(struct work_struct *work);
> +static void ceph_mdsc_reset_workfn(struct work_struct *work);
>  
>  static const struct ceph_connection_operations mds_con_ops;
>  
> @@ -3797,6 +3799,22 @@ int ceph_mdsc_submit_request(struct ceph_mds_client *mdsc, struct inode *dir,
>  	struct ceph_client *cl = mdsc->fsc->client;
>  	int err = 0;
>  
> +	/*
> +	 * If a reset is in progress, wait for it to complete.
> +	 *
> +	 * This is best-effort: a request can pass this check just
> +	 * before the phase leaves IDLE and proceed concurrently with
> +	 * reset.  That is acceptable because (a) such requests will
> +	 * either complete normally or fail and be retried by the
> +	 * caller, and (b) adding lock serialization here would
> +	 * penalize every request for a rare manual operation.
> +	 */
> +	err = ceph_mdsc_wait_for_reset(mdsc);
> +	if (err) {
> +		doutc(cl, "wait_for_reset failed: %d\n", err);
> +		return err;
> +	}
> +
>  	/* take CAP_PIN refs for r_inode, r_parent, r_old_dentry */
>  	if (req->r_inode)
>  		ceph_get_cap_refs(ceph_inode(req->r_inode), CEPH_CAP_PIN);
> @@ -5203,6 +5221,421 @@ static int send_mds_reconnect(struct ceph_mds_client *mdsc,
>  	return err;
>  }
>  
> +const char *ceph_reset_phase_name(enum ceph_client_reset_phase phase)
> +{
> +	switch (phase) {
> +	case CEPH_CLIENT_RESET_IDLE:	  return "idle";
> +	case CEPH_CLIENT_RESET_QUIESCING: return "quiescing";
> +	case CEPH_CLIENT_RESET_DRAINING:  return "draining";
> +	case CEPH_CLIENT_RESET_TEARDOWN:  return "teardown";
> +	default:			  return "unknown";
> +	}
> +}
> +
> +/**
> + * ceph_mdsc_wait_for_reset - wait for an active reset to complete
> + * @mdsc: MDS client
> + *
> + * Returns 0 if reset completed successfully or no reset was active.
> + * Returns -EIO if reset completed with an error.
> + * Returns -ETIMEDOUT if we timed out waiting.
> + * Returns -ERESTARTSYS if interrupted by signal.
> + *
> + * Internal work-function errors (e.g. -ENOMEM) are not propagated
> + * to callers; they are mapped to -EIO.  The detailed error is
> + * available via debugfs status and tracepoints.
> + */
> +int ceph_mdsc_wait_for_reset(struct ceph_mds_client *mdsc)
> +{
> +	struct ceph_client_reset_state *st = &mdsc->reset_state;
> +	struct ceph_client *cl = mdsc->fsc->client;
> +	unsigned long deadline = jiffies + CEPH_CLIENT_RESET_WAIT_TIMEOUT_SEC * HZ;
> +	int blocked_count;
> +	long remaining;
> +	long wait_ret;
> +	int ret;
> +
> +	if (READ_ONCE(st->phase) == CEPH_CLIENT_RESET_IDLE)
> +		return 0;
> +
> +	blocked_count = atomic_inc_return(&st->blocked_requests);
> +	doutc(cl, "request blocked during reset, %d total blocked\n",
> +	      blocked_count);
> +
> +retry:
> +	remaining = max_t(long, deadline - jiffies, 1);
> +	wait_ret = wait_event_interruptible_timeout(st->blocked_wq,
> +						    READ_ONCE(st->phase) ==
> +						     CEPH_CLIENT_RESET_IDLE,

Maybe, static inline function for this check?

> +						    remaining);
> +
> +	if (wait_ret == 0) {
> +		atomic_dec(&st->blocked_requests);
> +		pr_warn_client(cl, "timed out waiting for reset to complete\n");
> +		return -ETIMEDOUT;
> +	}
> +	if (wait_ret < 0) {
> +		atomic_dec(&st->blocked_requests);
> +		return (int)wait_ret;  /* -ERESTARTSYS */
> +	}
> +
> +	/*
> +	 * Verify phase is still IDLE under the lock.  If another reset
> +	 * was scheduled between the wake-up and this check, loop back
> +	 * and wait for it to finish rather than returning a stale result.
> +	 */
> +	spin_lock(&st->lock);
> +	if (st->phase != CEPH_CLIENT_RESET_IDLE) {
> +		spin_unlock(&st->lock);
> +		if (time_before(jiffies, deadline))
> +			goto retry;
> +		atomic_dec(&st->blocked_requests);
> +		return -ETIMEDOUT;
> +	}
> +	ret = st->last_errno;
> +	spin_unlock(&st->lock);
> +
> +	atomic_dec(&st->blocked_requests);
> +	return ret ? -EIO : 0;

The ceph_mdsc_wait_for_reset() maps all non-zero last_errno to -EIO. Any
internal failure gets silently mapped to -EIO. Callers seeing -EIO from open()
or flock() won't be able to distinguish "reset failed" from "session lost".

> +}
> +
> +static void ceph_mdsc_reset_complete(struct ceph_mds_client *mdsc, int ret)
> +{
> +	struct ceph_client_reset_state *st = &mdsc->reset_state;
> +
> +	spin_lock(&st->lock);
> +	/*
> +	 * If destroy already marked us as shut down, it owns the
> +	 * final bookkeeping and waiter wakeup.  Just bail so we
> +	 * don't overwrite its state.
> +	 */
> +	if (st->shutdown) {
> +		spin_unlock(&st->lock);
> +		return;
> +	}
> +	st->last_finish = jiffies;
> +	st->last_errno = ret;
> +	st->phase = CEPH_CLIENT_RESET_IDLE;
> +	if (ret)
> +		st->failure_count++;
> +	else
> +		st->success_count++;
> +	spin_unlock(&st->lock);
> +
> +	/* Wake up all requests that were blocked waiting for reset */
> +	wake_up_all(&st->blocked_wq);
> +}
> +
> +static void ceph_mdsc_reset_workfn(struct work_struct *work)
> +{
> +	struct ceph_mds_client *mdsc =
> +		container_of(work, struct ceph_mds_client, reset_work);
> +	struct ceph_client_reset_state *st = &mdsc->reset_state;
> +	struct ceph_client *cl = mdsc->fsc->client;
> +	struct ceph_mds_session **sessions = NULL;
> +	char reason[CEPH_CLIENT_RESET_REASON_LEN];
> +	int max_sessions, i, n = 0, torn_down = 0;
> +	int ret = 0;
> +
> +	spin_lock(&st->lock);
> +	strscpy(reason, st->last_reason, sizeof(reason));
> +	spin_unlock(&st->lock);
> +
> +	mutex_lock(&mdsc->mutex);
> +	max_sessions = mdsc->max_sessions;
> +	if (max_sessions <= 0) {
> +		mutex_unlock(&mdsc->mutex);
> +		goto out_complete;
> +	}
> +
> +	sessions = kcalloc(max_sessions, sizeof(*sessions), GFP_KERNEL);
> +	if (!sessions) {
> +		mutex_unlock(&mdsc->mutex);
> +		ret = -ENOMEM;
> +		pr_err_client(cl,
> +			      "manual session reset failed to allocate session array\n");
> +		ceph_mdsc_reset_complete(mdsc, ret);
> +		return;
> +	}
> +
> +	for (i = 0; i < max_sessions; i++) {
> +		struct ceph_mds_session *session = mdsc->sessions[i];
> +
> +		if (!session)
> +			continue;
> +
> +		/*
> +		 * Read session state without s_mutex to avoid nesting
> +		 * mdsc->mutex -> s_mutex, which would invert the
> +		 * s_mutex -> mdsc->mutex order used by
> +		 * cleanup_session_requests().  s_state is an int
> +		 * so loads are atomic; the teardown loop below
> +		 * handles races with concurrent state transitions.
> +		 */
> +		switch (READ_ONCE(session->s_state)) {
> +		case CEPH_MDS_SESSION_OPEN:
> +		case CEPH_MDS_SESSION_HUNG:
> +		case CEPH_MDS_SESSION_OPENING:
> +		case CEPH_MDS_SESSION_RESTARTING:
> +		case CEPH_MDS_SESSION_RECONNECTING:
> +		case CEPH_MDS_SESSION_CLOSING:
> +			sessions[n++] = ceph_get_mds_session(session);
> +			break;
> +		default:
> +			pr_info_client(cl,
> +				       "mds%d in state %s, skipping reset\n",
> +				       session->s_mds,
> +				       ceph_session_state_name(session->s_state));
> +			break;
> +		}
> +	}
> +	mutex_unlock(&mdsc->mutex);
> +
> +	pr_info_client(cl,
> +		       "manual session reset executing (sessions=%d, reason=\"%s\")\n",
> +		       n, reason);
> +
> +	if (n == 0) {
> +		kfree(sessions);
> +		goto out_complete;
> +	}
> +
> +	spin_lock(&st->lock);
> +	if (st->shutdown) {
> +		spin_unlock(&st->lock);
> +		goto out_sessions;

The out_sessions silently skips ceph_mdsc_reset_complete(). Is it always correct
logic?

> +	}
> +	st->phase = CEPH_CLIENT_RESET_DRAINING;
> +	spin_unlock(&st->lock);
> +
> +	/*
> +	 * Best-effort drain: flush dirty state while sessions are still
> +	 * alive.  New requests are blocked while phase != IDLE.
> +	 * The sessions are functional, so non-stuck state drains normally.
> +	 * Stuck state (the cause of the stalemate the operator is trying
> +	 * to break) will not drain -- that is expected, and we proceed to
> +	 * forced teardown after the timeout.
> +	 *
> +	 * Three things are kicked off:
> +	 *  1. MDS journal -- send_flush_mdlog asks each MDS to journal
> +	 *     pending unsafe operations (creates, renames, setattrs).
> +	 *     This is best-effort: we do not wait for individual unsafe
> +	 *     requests to reach safe status.  Non-stuck ops typically
> +	 *     complete within the bounded wait window below; stuck ops
> +	 *     will not, and are force-dropped during teardown.
> +	 *  2. Dirty caps -- ceph_flush_dirty_caps triggers cap flush on
> +	 *     all sessions.  Non-stuck caps flush in milliseconds.
> +	 *  3. Cap releases -- push pending cap release messages.
> +	 *
> +	 * The cap-flush wait below provides the bounded drain window
> +	 * during which all three categories can make progress.
> +	 */
> +	for (i = 0; i < n; i++)
> +		send_flush_mdlog(sessions[i]);
> +
> +	ceph_flush_dirty_caps(mdsc);
> +	ceph_flush_cap_releases(mdsc);
> +
> +	spin_lock(&mdsc->cap_dirty_lock);
> +	if (!list_empty(&mdsc->cap_flush_list)) {
> +		struct ceph_cap_flush *cf =

Why not declare variable on one line and then assign on another line?

> +			list_last_entry(&mdsc->cap_flush_list,
> +					struct ceph_cap_flush, g_list);
> +		u64 want_flush = mdsc->last_cap_flush_tid;
> +		long drain_ret;
> +
> +		/*
> +		 * Setting wake on the last entry is sufficient: flush
> +		 * entries complete in order, so when this entry finishes
> +		 * all earlier ones are already done.
> +		 */
> +		cf->wake = true;
> +		spin_unlock(&mdsc->cap_dirty_lock);
> +		pr_info_client(cl,
> +			       "draining (want_flush=%llu, %d sessions)\n",
> +			       want_flush, n);
> +		drain_ret = wait_event_timeout(mdsc->cap_flushing_wq,
> +					       check_caps_flush(mdsc,
> +								want_flush),
> +					       CEPH_CLIENT_RESET_DRAIN_SEC * HZ);
> +		if (drain_ret == 0) {
> +			pr_info_client(cl,
> +				       "drain timed out, proceeding with forced teardown\n");
> +			spin_lock(&st->lock);
> +			st->drain_timed_out = true;

Do we really need to use spin_lock() here? Could WRITE_ONCE() be enough for
changing one field?

> +			spin_unlock(&st->lock);
> +		} else {
> +			pr_info_client(cl, "drain completed successfully\n");
> +			spin_lock(&st->lock);
> +			st->drain_timed_out = false;

Ditto.

> +			spin_unlock(&st->lock);
> +		}
> +	} else {
> +		spin_unlock(&mdsc->cap_dirty_lock);
> +		spin_lock(&st->lock);
> +		st->drain_timed_out = false;

Ditto.

> +		spin_unlock(&st->lock);
> +	}
> +
> +	spin_lock(&st->lock);
> +	if (st->shutdown) {
> +		spin_unlock(&st->lock);
> +		goto out_sessions;
> +	}
> +	st->phase = CEPH_CLIENT_RESET_TEARDOWN;
> +	spin_unlock(&st->lock);
> +
> +	/*
> +	 * Ask each MDS to close the session before we tear it down
> +	 * locally.  Without this the MDS sees only a connection drop and
> +	 * waits for the client to reconnect (up to session_autoclose
> +	 * seconds) before evicting the session and releasing locks.
> +	 *
> +	 * Reuse the normal close machinery so the session state/sequence
> +	 * snapshot is serialized under s_mutex and a racing s_seq bump
> +	 * retransmits REQUEST_CLOSE while the session remains CLOSING.
> +	 * We send all close requests first, then yield briefly to let the
> +	 * network stack transmit them before __unregister_session()
> +	 * closes the connections.
> +	 */
> +	for (i = 0; i < n; i++) {
> +		int err;
> +
> +		mutex_lock(&sessions[i]->s_mutex);
> +		err = __close_session(mdsc, sessions[i]);
> +		mutex_unlock(&sessions[i]->s_mutex);
> +		if (err < 0)
> +			pr_warn_client(cl,
> +				       "mds%d failed to queue close request before reset: %d\n",
> +				       sessions[i]->s_mds, err);
> +	}
> +	/*
> +	 * Best-effort grace period: yield briefly so the network stack
> +	 * can transmit the queued REQUEST_CLOSE messages before we tear
> +	 * down connections.  Not a correctness requirement -- the MDS
> +	 * will still evict via session_autoclose if it never receives
> +	 * the close request.
> +	 */
> +	if (n > 0)
> +		msleep(CEPH_CLIENT_RESET_CLOSE_GRACE_MS);

I don't like to use the msleep(). Can we use of waiting some event instead?

> +
> +	/*
> +	 * Tear down each session: close the connection, remove all
> +	 * caps, clean up requests, then kick pending requests so they
> +	 * re-open a fresh session on the next attempt.
> +	 *
> +	 * This is modeled on the check_new_map() forced-close path
> +	 * for stopped MDS ranks - a proven pattern for hard session
> +	 * teardown.  We do NOT attempt send_mds_reconnect() because
> +	 * the MDS only accepts reconnects during its own RECONNECT
> +	 * phase (after MDS restart), not from an active client.
> +	 *
> +	 * Any state that did not drain (caps that didn't flush, unsafe
> +	 * requests that the MDS didn't journal) is force-dropped here.
> +	 * This is intentional: that state is stuck and is the reason
> +	 * the operator triggered the reset.
> +	 */
> +	for (i = 0; i < n; i++) {
> +		int mds = sessions[i]->s_mds;
> +
> +		pr_info_client(cl, "mds%d resetting session\n", mds);
> +
> +		mutex_lock(&mdsc->mutex);
> +		if (mds >= mdsc->max_sessions ||
> +		    mdsc->sessions[mds] != sessions[i]) {
> +			pr_info_client(cl,
> +				       "mds%d session already torn down, skipping\n",
> +				       mds);
> +			mutex_unlock(&mdsc->mutex);
> +			ceph_put_mds_session(sessions[i]);

If I understood correctly, ceph_put_mds_session() could free pointer on
sessions. Could we have use-after-free issue here? Should we do sessions[i] =
NULL here?

> +			continue;
> +		}
> +		sessions[i]->s_state = CEPH_MDS_SESSION_CLOSED;
> +		__unregister_session(mdsc, sessions[i]);
> +		__wake_requests(mdsc, &sessions[i]->s_waiting);
> +		mutex_unlock(&mdsc->mutex);
> +
> +		mutex_lock(&sessions[i]->s_mutex);
> +		cleanup_session_requests(mdsc, sessions[i]);
> +		remove_session_caps(sessions[i]);
> +		mutex_unlock(&sessions[i]->s_mutex);
> +
> +		wake_up_all(&mdsc->session_close_wq);
> +
> +		ceph_put_mds_session(sessions[i]);
> +
> +		mutex_lock(&mdsc->mutex);
> +		kick_requests(mdsc, mds);
> +		mutex_unlock(&mdsc->mutex);
> +
> +		torn_down++;
> +		pr_info_client(cl, "mds%d session reset complete\n", mds);
> +	}
> +
> +	kfree(sessions);
> +
> +	spin_lock(&st->lock);
> +	st->sessions_reset = torn_down;
> +	spin_unlock(&st->lock);
> +
> +out_complete:
> +	ceph_mdsc_reset_complete(mdsc, ret);
> +	return;
> +
> +out_sessions:
> +	for (i = 0; i < n; i++)
> +		ceph_put_mds_session(sessions[i]);
> +	kfree(sessions);
> +}
> +
> +int ceph_mdsc_schedule_reset(struct ceph_mds_client *mdsc,
> +			     const char *reason)
> +{
> +	struct ceph_client_reset_state *st = &mdsc->reset_state;
> +	struct ceph_fs_client *fsc = mdsc->fsc;
> +	const char *msg = (reason && reason[0]) ? reason : "manual";
> +	int mount_state;
> +
> +	mount_state = READ_ONCE(fsc->mount_state);
> +	if (mount_state != CEPH_MOUNT_MOUNTED) {
> +		pr_warn_client(fsc->client,
> +			       "reset rejected: mount_state=%d (not mounted)\n",
> +			       mount_state);
> +		return -EINVAL;
> +	}
> +
> +	spin_lock(&st->lock);
> +	if (st->phase != CEPH_CLIENT_RESET_IDLE) {
> +		spin_unlock(&st->lock);
> +		return -EBUSY;
> +	}
> +
> +	st->phase = CEPH_CLIENT_RESET_QUIESCING;
> +	st->last_start = jiffies;
> +	st->last_errno = 0;
> +	st->drain_timed_out = false;
> +	st->sessions_reset = 0;
> +	st->trigger_count++;
> +	strscpy(st->last_reason, msg, sizeof(st->last_reason));
> +	spin_unlock(&st->lock);
> +
> +	if (WARN_ON_ONCE(!queue_work(system_unbound_wq, &mdsc->reset_work))) {
> +		spin_lock(&st->lock);
> +		st->phase = CEPH_CLIENT_RESET_IDLE;
> +		st->last_errno = -EALREADY;
> +		st->last_finish = jiffies;
> +		st->failure_count++;
> +		spin_unlock(&st->lock);
> +		wake_up_all(&st->blocked_wq);
> +		return -EALREADY;
> +	}
> +
> +	pr_info_client(mdsc->fsc->client,
> +		       "manual session reset scheduled (reason=\"%s\")\n",
> +		       msg);
> +	return 0;
> +}
> +
>  
>  /*
>   * compare old and new mdsmaps, kicking requests
> @@ -5742,6 +6175,11 @@ int ceph_mdsc_init(struct ceph_fs_client *fsc)
>  	INIT_LIST_HEAD(&mdsc->dentry_leases);
>  	INIT_LIST_HEAD(&mdsc->dentry_dir_leases);
>  
> +	spin_lock_init(&mdsc->reset_state.lock);
> +	init_waitqueue_head(&mdsc->reset_state.blocked_wq);
> +	atomic_set(&mdsc->reset_state.blocked_requests, 0);
> +	INIT_WORK(&mdsc->reset_work, ceph_mdsc_reset_workfn);
> +
>  	ceph_caps_init(mdsc);
>  	ceph_adjust_caps_max_min(mdsc, fsc->mount_options);
>  
> @@ -6267,6 +6705,23 @@ void ceph_mdsc_destroy(struct ceph_fs_client *fsc)
>  	/* flush out any connection work with references to us */
>  	ceph_msgr_flush();
>  
> +	/*
> +	 * Mark reset as failed and wake any blocked waiters before
> +	 * cancelling, so unmount doesn't stall on blocked_wq timeout
> +	 * if cancel_work_sync() prevents the work from running.
> +	 */
> +	spin_lock(&mdsc->reset_state.lock);
> +	mdsc->reset_state.shutdown = true;
> +	if (mdsc->reset_state.phase != CEPH_CLIENT_RESET_IDLE) {
> +		mdsc->reset_state.phase = CEPH_CLIENT_RESET_IDLE;
> +		mdsc->reset_state.last_errno = -ESHUTDOWN;
> +		mdsc->reset_state.last_finish = jiffies;
> +		mdsc->reset_state.failure_count++;
> +	}
> +	spin_unlock(&mdsc->reset_state.lock);
> +	wake_up_all(&mdsc->reset_state.blocked_wq);
> +
> +	cancel_work_sync(&mdsc->reset_work);
>  	ceph_mdsc_stop(mdsc);
>  
>  	ceph_metric_destroy(&mdsc->metric);
> diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h
> index e91a199d56fd..afc08b0abbd5 100644
> --- a/fs/ceph/mds_client.h
> +++ b/fs/ceph/mds_client.h
> @@ -74,6 +74,42 @@ struct ceph_fs_client;
>  struct ceph_cap;
>  
>  #define MDS_AUTH_UID_ANY -1
> +#define CEPH_CLIENT_RESET_REASON_LEN	64
> +#define CEPH_CLIENT_RESET_DRAIN_SEC	5

Probably, this value could be short for production. 5 seconds to flush dirty
caps across sessions under any meaningful write load is very tight. The existing
wait_caps_flush() has no timeout at all. Maybe, 30–60 seconds would be more
useful?

> +#define CEPH_CLIENT_RESET_CLOSE_GRACE_MS 100
> +#define CEPH_CLIENT_RESET_WAIT_TIMEOUT_SEC 120

I think we need to collect all timeout declarations in one place.

> +
> +enum ceph_client_reset_phase {
> +	CEPH_CLIENT_RESET_IDLE = 0,
> +	/*
> +	 * QUIESCING is set synchronously by schedule_reset() before the
> +	 * workqueue item is dispatched.  It gates new requests (any
> +	 * phase != IDLE blocks callers) during the window between
> +	 * scheduling and the work function's transition to DRAINING.
> +	 */
> +	CEPH_CLIENT_RESET_QUIESCING,
> +	CEPH_CLIENT_RESET_DRAINING,
> +	CEPH_CLIENT_RESET_TEARDOWN,
> +};
> +
> +struct ceph_client_reset_state {
> +	spinlock_t lock;
> +	u64 trigger_count;
> +	u64 success_count;
> +	u64 failure_count;
> +	unsigned long last_start;
> +	unsigned long last_finish;
> +	int last_errno;
> +	enum ceph_client_reset_phase phase;
> +	bool drain_timed_out;
> +	bool shutdown;
> +	int sessions_reset;
> +	char last_reason[CEPH_CLIENT_RESET_REASON_LEN];
> +
> +	/* Request blocking during reset */
> +	wait_queue_head_t blocked_wq;
> +	atomic_t blocked_requests;
> +};

It's big enough structure and it requires the commenting of all fields.

Thanks,
Slava.

>  
>  struct ceph_mds_cap_match {
>  	s64 uid;  /* default to MDS_AUTH_UID_ANY */
> @@ -536,6 +572,8 @@ struct ceph_mds_client {
>  	struct list_head  dentry_dir_leases; /* lru list */
>  
>  	struct ceph_client_metric metric;
> +	struct work_struct	reset_work;
> +	struct ceph_client_reset_state reset_state;
>  
>  	spinlock_t		snapid_map_lock;
>  	struct rb_root		snapid_map_tree;
> @@ -559,10 +597,14 @@ extern struct ceph_mds_session *
>  __ceph_lookup_mds_session(struct ceph_mds_client *, int mds);
>  
>  extern const char *ceph_session_state_name(int s);
> +extern const char *ceph_reset_phase_name(enum ceph_client_reset_phase phase);
>  
>  extern struct ceph_mds_session *
>  ceph_get_mds_session(struct ceph_mds_session *s);
>  extern void ceph_put_mds_session(struct ceph_mds_session *s);
> +int ceph_mdsc_schedule_reset(struct ceph_mds_client *mdsc,
> +			     const char *reason);
> +int ceph_mdsc_wait_for_reset(struct ceph_mds_client *mdsc);
>  
>  extern int ceph_mdsc_init(struct ceph_fs_client *fsc);
>  extern void ceph_mdsc_close_sessions(struct ceph_mds_client *mdsc);


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v3 06/11] ceph: add manual reset debugfs control and tracepoints
  2026-04-29 12:51 [PATCH v3 00/11] ceph: manual client session reset Alex Markuze
                   ` (4 preceding siblings ...)
  2026-04-29 12:52 ` [PATCH v3 05/11] ceph: add client reset state machine and session teardown Alex Markuze
@ 2026-04-29 12:52 ` Alex Markuze
  2026-04-30 18:38   ` [EXTERNAL] " Viacheslav Dubeyko
  2026-04-29 12:52 ` [PATCH v3 07/11] selftests: ceph: add reset consistency checker Alex Markuze
                   ` (4 subsequent siblings)
  10 siblings, 1 reply; 17+ messages in thread
From: Alex Markuze @ 2026-04-29 12:52 UTC (permalink / raw)
  To: ceph-devel; +Cc: linux-kernel, idryomov, vdubeyko, Alex Markuze

Add the debugfs and trace plumbing used to trigger and observe
manual client reset.

The reset interface exposes a trigger file for operator-initiated
reset and a status file for tracking the most recent run.  The
tracepoints record scheduling, completion, and blocked caller
behavior so reset progress can be diagnosed from the client side.

debugfs layout under /sys/kernel/debug/ceph/<client>/reset/:
  trigger - write to initiate a manual reset
  status  - read to see the most recent reset result

The reset directory is cleaned up via debugfs_remove_recursive()
on the parent, so individual file dentries are not stored.

Tracepoints:
  ceph_client_reset_schedule  - reset queued
  ceph_client_reset_complete  - reset finished (success or failure)
  ceph_client_reset_blocked   - caller blocked waiting for reset
  ceph_client_reset_unblocked - caller unblocked after reset

All tracepoints use a null-safe access for monc.auth->global_id
to guard against early-init or late-teardown edge cases.

Signed-off-by: Alex Markuze <amarkuze@redhat.com>
---
 fs/ceph/debugfs.c           | 102 ++++++++++++++++++++++++++++++++++++
 fs/ceph/mds_client.c        |   8 +++
 fs/ceph/super.h             |   1 +
 include/trace/events/ceph.h |  67 +++++++++++++++++++++++
 4 files changed, 178 insertions(+)

diff --git a/fs/ceph/debugfs.c b/fs/ceph/debugfs.c
index 7dc307790240..beee4cfe8b18 100644
--- a/fs/ceph/debugfs.c
+++ b/fs/ceph/debugfs.c
@@ -9,6 +9,7 @@
 #include <linux/seq_file.h>
 #include <linux/math64.h>
 #include <linux/ktime.h>
+#include <linux/uaccess.h>
 
 #include <linux/ceph/libceph.h>
 #include <linux/ceph/mon_client.h>
@@ -360,16 +361,107 @@ static int status_show(struct seq_file *s, void *p)
 	return 0;
 }
 
+static int reset_status_show(struct seq_file *s, void *p)
+{
+	struct ceph_fs_client *fsc = s->private;
+	struct ceph_mds_client *mdsc = fsc->mdsc;
+	struct ceph_client_reset_state *st;
+	u64 trigger = 0, success = 0, failure = 0;
+	unsigned long last_start = 0, last_finish = 0;
+	int last_errno = 0;
+	enum ceph_client_reset_phase phase = CEPH_CLIENT_RESET_IDLE;
+	bool drain_timed_out = false;
+	int sessions_reset = 0;
+	int blocked_requests = 0;
+	char reason[CEPH_CLIENT_RESET_REASON_LEN];
+
+	if (!mdsc)
+		return 0;
+
+	st = &mdsc->reset_state;
+
+	spin_lock(&st->lock);
+	trigger = st->trigger_count;
+	success = st->success_count;
+	failure = st->failure_count;
+	last_start = st->last_start;
+	last_finish = st->last_finish;
+	last_errno = st->last_errno;
+	phase = st->phase;
+	drain_timed_out = st->drain_timed_out;
+	sessions_reset = st->sessions_reset;
+	strscpy(reason, st->last_reason, sizeof(reason));
+	spin_unlock(&st->lock);
+
+	blocked_requests = atomic_read(&st->blocked_requests);
+
+	seq_printf(s, "phase: %s\n", ceph_reset_phase_name(phase));
+	seq_printf(s, "trigger_count: %llu\n", trigger);
+	seq_printf(s, "success_count: %llu\n", success);
+	seq_printf(s, "failure_count: %llu\n", failure);
+	if (last_start)
+		seq_printf(s, "last_start_ms_ago: %u\n",
+			   jiffies_to_msecs(jiffies - last_start));
+	else
+		seq_puts(s, "last_start_ms_ago: (never)\n");
+	if (last_finish)
+		seq_printf(s, "last_finish_ms_ago: %u\n",
+			   jiffies_to_msecs(jiffies - last_finish));
+	else
+		seq_puts(s, "last_finish_ms_ago: (never)\n");
+	seq_printf(s, "last_errno: %d\n", last_errno);
+	seq_printf(s, "last_reason: %s\n",
+		   reason[0] ? reason : "(none)");
+	seq_printf(s, "drain_timed_out: %s\n",
+		   drain_timed_out ? "yes" : "no");
+	seq_printf(s, "sessions_reset: %d\n", sessions_reset);
+	seq_printf(s, "blocked_requests: %d\n", blocked_requests);
+
+	return 0;
+}
+
+static ssize_t reset_trigger_write(struct file *file, const char __user *buf,
+				   size_t len, loff_t *ppos)
+{
+	struct ceph_fs_client *fsc = file->private_data;
+	struct ceph_mds_client *mdsc = fsc->mdsc;
+	char reason[CEPH_CLIENT_RESET_REASON_LEN];
+	size_t copy;
+	int ret;
+
+	if (!mdsc)
+		return -ENODEV;
+
+	copy = min_t(size_t, len, sizeof(reason) - 1);
+	if (copy && copy_from_user(reason, buf, copy))
+		return -EFAULT;
+	reason[copy] = '\0';
+	strim(reason);
+
+	ret = ceph_mdsc_schedule_reset(mdsc, reason);
+	if (ret)
+		return ret;
+
+	return len;
+}
+
 DEFINE_SHOW_ATTRIBUTE(mdsmap);
 DEFINE_SHOW_ATTRIBUTE(mdsc);
 DEFINE_SHOW_ATTRIBUTE(caps);
 DEFINE_SHOW_ATTRIBUTE(mds_sessions);
 DEFINE_SHOW_ATTRIBUTE(status);
+DEFINE_SHOW_ATTRIBUTE(reset_status);
 DEFINE_SHOW_ATTRIBUTE(metrics_file);
 DEFINE_SHOW_ATTRIBUTE(metrics_latency);
 DEFINE_SHOW_ATTRIBUTE(metrics_size);
 DEFINE_SHOW_ATTRIBUTE(metrics_caps);
 
+static const struct file_operations ceph_reset_trigger_fops = {
+	.owner = THIS_MODULE,
+	.open = simple_open,
+	.write = reset_trigger_write,
+	.llseek = noop_llseek,
+};
 
 /*
  * debugfs
@@ -404,6 +496,7 @@ void ceph_fs_debugfs_cleanup(struct ceph_fs_client *fsc)
 	debugfs_remove(fsc->debugfs_caps);
 	debugfs_remove(fsc->debugfs_status);
 	debugfs_remove(fsc->debugfs_mdsc);
+	debugfs_remove_recursive(fsc->debugfs_reset_dir);
 	debugfs_remove_recursive(fsc->debugfs_metrics_dir);
 	doutc(fsc->client, "done\n");
 }
@@ -451,6 +544,15 @@ void ceph_fs_debugfs_init(struct ceph_fs_client *fsc)
 						fsc,
 						&caps_fops);
 
+	fsc->debugfs_reset_dir = debugfs_create_dir("reset",
+						    fsc->client->debugfs_dir);
+	debugfs_create_file("trigger", 0200,
+			    fsc->debugfs_reset_dir, fsc,
+			    &ceph_reset_trigger_fops);
+	debugfs_create_file("status", 0400,
+			    fsc->debugfs_reset_dir, fsc,
+			    &reset_status_fops);
+
 	fsc->debugfs_status = debugfs_create_file("status",
 						  0400,
 						  fsc->client->debugfs_dir,
diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index 777af51ec8d8..8339c2c72f9a 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -5261,6 +5261,7 @@ int ceph_mdsc_wait_for_reset(struct ceph_mds_client *mdsc)
 	blocked_count = atomic_inc_return(&st->blocked_requests);
 	doutc(cl, "request blocked during reset, %d total blocked\n",
 	      blocked_count);
+	trace_ceph_client_reset_blocked(mdsc, blocked_count);
 
 retry:
 	remaining = max_t(long, deadline - jiffies, 1);
@@ -5272,10 +5273,12 @@ int ceph_mdsc_wait_for_reset(struct ceph_mds_client *mdsc)
 	if (wait_ret == 0) {
 		atomic_dec(&st->blocked_requests);
 		pr_warn_client(cl, "timed out waiting for reset to complete\n");
+		trace_ceph_client_reset_unblocked(mdsc, -ETIMEDOUT);
 		return -ETIMEDOUT;
 	}
 	if (wait_ret < 0) {
 		atomic_dec(&st->blocked_requests);
+		trace_ceph_client_reset_unblocked(mdsc, (int)wait_ret);
 		return (int)wait_ret;  /* -ERESTARTSYS */
 	}
 
@@ -5290,12 +5293,14 @@ int ceph_mdsc_wait_for_reset(struct ceph_mds_client *mdsc)
 		if (time_before(jiffies, deadline))
 			goto retry;
 		atomic_dec(&st->blocked_requests);
+		trace_ceph_client_reset_unblocked(mdsc, -ETIMEDOUT);
 		return -ETIMEDOUT;
 	}
 	ret = st->last_errno;
 	spin_unlock(&st->lock);
 
 	atomic_dec(&st->blocked_requests);
+	trace_ceph_client_reset_unblocked(mdsc, ret);
 	return ret ? -EIO : 0;
 }
 
@@ -5324,6 +5329,8 @@ static void ceph_mdsc_reset_complete(struct ceph_mds_client *mdsc, int ret)
 
 	/* Wake up all requests that were blocked waiting for reset */
 	wake_up_all(&st->blocked_wq);
+
+	trace_ceph_client_reset_complete(mdsc, ret);
 }
 
 static void ceph_mdsc_reset_workfn(struct work_struct *work)
@@ -5633,6 +5640,7 @@ int ceph_mdsc_schedule_reset(struct ceph_mds_client *mdsc,
 	pr_info_client(mdsc->fsc->client,
 		       "manual session reset scheduled (reason=\"%s\")\n",
 		       msg);
+	trace_ceph_client_reset_schedule(mdsc, msg);
 	return 0;
 }
 
diff --git a/fs/ceph/super.h b/fs/ceph/super.h
index 9aca42c89ea0..5bf976b6c4fe 100644
--- a/fs/ceph/super.h
+++ b/fs/ceph/super.h
@@ -179,6 +179,7 @@ struct ceph_fs_client {
 	struct dentry *debugfs_status;
 	struct dentry *debugfs_mds_sessions;
 	struct dentry *debugfs_metrics_dir;
+	struct dentry *debugfs_reset_dir;
 #endif
 
 #ifdef CONFIG_CEPH_FSCACHE
diff --git a/include/trace/events/ceph.h b/include/trace/events/ceph.h
index 08cb0659fbfc..1b990632f62b 100644
--- a/include/trace/events/ceph.h
+++ b/include/trace/events/ceph.h
@@ -226,6 +226,73 @@ TRACE_EVENT(ceph_handle_caps,
 		  __entry->mseq)
 );
 
+/*
+ * Client reset tracepoints - identify the client by its monitor-
+ * assigned global_id so traces remain meaningful when kernel pointer
+ * hashing is enabled.
+ */
+TRACE_EVENT(ceph_client_reset_schedule,
+	TP_PROTO(const struct ceph_mds_client *mdsc, const char *reason),
+	TP_ARGS(mdsc, reason),
+	TP_STRUCT__entry(
+		__field(u64, client_id)
+		__string(reason, reason ? reason : "")
+	),
+	TP_fast_assign(
+		__entry->client_id = mdsc->fsc->client->monc.auth ?
+			mdsc->fsc->client->monc.auth->global_id : 0;
+		__assign_str(reason);
+	),
+	TP_printk("client_id=%llu reason=%s",
+		  __entry->client_id, __get_str(reason))
+);
+
+TRACE_EVENT(ceph_client_reset_complete,
+	TP_PROTO(const struct ceph_mds_client *mdsc, int ret),
+	TP_ARGS(mdsc, ret),
+	TP_STRUCT__entry(
+		__field(u64, client_id)
+		__field(int, ret)
+	),
+	TP_fast_assign(
+		__entry->client_id = mdsc->fsc->client->monc.auth ?
+			mdsc->fsc->client->monc.auth->global_id : 0;
+		__entry->ret = ret;
+	),
+	TP_printk("client_id=%llu ret=%d", __entry->client_id, __entry->ret)
+);
+
+TRACE_EVENT(ceph_client_reset_blocked,
+	TP_PROTO(const struct ceph_mds_client *mdsc, int blocked_count),
+	TP_ARGS(mdsc, blocked_count),
+	TP_STRUCT__entry(
+		__field(u64, client_id)
+		__field(int, blocked_count)
+	),
+	TP_fast_assign(
+		__entry->client_id = mdsc->fsc->client->monc.auth ?
+			mdsc->fsc->client->monc.auth->global_id : 0;
+		__entry->blocked_count = blocked_count;
+	),
+	TP_printk("client_id=%llu blocked_count=%d", __entry->client_id,
+		  __entry->blocked_count)
+);
+
+TRACE_EVENT(ceph_client_reset_unblocked,
+	TP_PROTO(const struct ceph_mds_client *mdsc, int ret),
+	TP_ARGS(mdsc, ret),
+	TP_STRUCT__entry(
+		__field(u64, client_id)
+		__field(int, ret)
+	),
+	TP_fast_assign(
+		__entry->client_id = mdsc->fsc->client->monc.auth ?
+			mdsc->fsc->client->monc.auth->global_id : 0;
+		__entry->ret = ret;
+	),
+	TP_printk("client_id=%llu ret=%d", __entry->client_id, __entry->ret)
+);
+
 #undef EM
 #undef E_
 #endif /* _TRACE_CEPH_H */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [EXTERNAL] [PATCH v3 06/11] ceph: add manual reset debugfs control and tracepoints
  2026-04-29 12:52 ` [PATCH v3 06/11] ceph: add manual reset debugfs control and tracepoints Alex Markuze
@ 2026-04-30 18:38   ` Viacheslav Dubeyko
  0 siblings, 0 replies; 17+ messages in thread
From: Viacheslav Dubeyko @ 2026-04-30 18:38 UTC (permalink / raw)
  To: Alex Markuze, ceph-devel; +Cc: linux-kernel, idryomov

On Wed, 2026-04-29 at 12:52 +0000, Alex Markuze wrote:
> Add the debugfs and trace plumbing used to trigger and observe
> manual client reset.
> 
> The reset interface exposes a trigger file for operator-initiated
> reset and a status file for tracking the most recent run.  The
> tracepoints record scheduling, completion, and blocked caller
> behavior so reset progress can be diagnosed from the client side.
> 
> debugfs layout under /sys/kernel/debug/ceph/<client>/reset/:
>   trigger - write to initiate a manual reset
>   status  - read to see the most recent reset result
> 
> The reset directory is cleaned up via debugfs_remove_recursive()
> on the parent, so individual file dentries are not stored.
> 
> Tracepoints:
>   ceph_client_reset_schedule  - reset queued
>   ceph_client_reset_complete  - reset finished (success or failure)
>   ceph_client_reset_blocked   - caller blocked waiting for reset
>   ceph_client_reset_unblocked - caller unblocked after reset
> 
> All tracepoints use a null-safe access for monc.auth->global_id
> to guard against early-init or late-teardown edge cases.
> 
> Signed-off-by: Alex Markuze <amarkuze@redhat.com>
> ---
>  fs/ceph/debugfs.c           | 102 ++++++++++++++++++++++++++++++++++++
>  fs/ceph/mds_client.c        |   8 +++
>  fs/ceph/super.h             |   1 +
>  include/trace/events/ceph.h |  67 +++++++++++++++++++++++
>  4 files changed, 178 insertions(+)
> 
> diff --git a/fs/ceph/debugfs.c b/fs/ceph/debugfs.c
> index 7dc307790240..beee4cfe8b18 100644
> --- a/fs/ceph/debugfs.c
> +++ b/fs/ceph/debugfs.c
> @@ -9,6 +9,7 @@
>  #include <linux/seq_file.h>
>  #include <linux/math64.h>
>  #include <linux/ktime.h>
> +#include <linux/uaccess.h>
>  
>  #include <linux/ceph/libceph.h>
>  #include <linux/ceph/mon_client.h>
> @@ -360,16 +361,107 @@ static int status_show(struct seq_file *s, void *p)
>  	return 0;
>  }
>  
> +static int reset_status_show(struct seq_file *s, void *p)
> +{
> +	struct ceph_fs_client *fsc = s->private;
> +	struct ceph_mds_client *mdsc = fsc->mdsc;
> +	struct ceph_client_reset_state *st;
> +	u64 trigger = 0, success = 0, failure = 0;
> +	unsigned long last_start = 0, last_finish = 0;
> +	int last_errno = 0;
> +	enum ceph_client_reset_phase phase = CEPH_CLIENT_RESET_IDLE;
> +	bool drain_timed_out = false;
> +	int sessions_reset = 0;
> +	int blocked_requests = 0;
> +	char reason[CEPH_CLIENT_RESET_REASON_LEN];
> +
> +	if (!mdsc)
> +		return 0;
> +
> +	st = &mdsc->reset_state;
> +
> +	spin_lock(&st->lock);
> +	trigger = st->trigger_count;
> +	success = st->success_count;
> +	failure = st->failure_count;
> +	last_start = st->last_start;
> +	last_finish = st->last_finish;
> +	last_errno = st->last_errno;
> +	phase = st->phase;
> +	drain_timed_out = st->drain_timed_out;
> +	sessions_reset = st->sessions_reset;
> +	strscpy(reason, st->last_reason, sizeof(reason));
> +	spin_unlock(&st->lock);
> +
> +	blocked_requests = atomic_read(&st->blocked_requests);
> +
> +	seq_printf(s, "phase: %s\n", ceph_reset_phase_name(phase));
> +	seq_printf(s, "trigger_count: %llu\n", trigger);
> +	seq_printf(s, "success_count: %llu\n", success);
> +	seq_printf(s, "failure_count: %llu\n", failure);
> +	if (last_start)
> +		seq_printf(s, "last_start_ms_ago: %u\n",
> +			   jiffies_to_msecs(jiffies - last_start));
> +	else
> +		seq_puts(s, "last_start_ms_ago: (never)\n");
> +	if (last_finish)
> +		seq_printf(s, "last_finish_ms_ago: %u\n",
> +			   jiffies_to_msecs(jiffies - last_finish));
> +	else
> +		seq_puts(s, "last_finish_ms_ago: (never)\n");
> +	seq_printf(s, "last_errno: %d\n", last_errno);
> +	seq_printf(s, "last_reason: %s\n",
> +		   reason[0] ? reason : "(none)");
> +	seq_printf(s, "drain_timed_out: %s\n",
> +		   drain_timed_out ? "yes" : "no");
> +	seq_printf(s, "sessions_reset: %d\n", sessions_reset);
> +	seq_printf(s, "blocked_requests: %d\n", blocked_requests);
> +
> +	return 0;
> +}
> +
> +static ssize_t reset_trigger_write(struct file *file, const char __user *buf,
> +				   size_t len, loff_t *ppos)
> +{
> +	struct ceph_fs_client *fsc = file->private_data;
> +	struct ceph_mds_client *mdsc = fsc->mdsc;
> +	char reason[CEPH_CLIENT_RESET_REASON_LEN];
> +	size_t copy;
> +	int ret;
> +
> +	if (!mdsc)
> +		return -ENODEV;
> +
> +	copy = min_t(size_t, len, sizeof(reason) - 1);
> +	if (copy && copy_from_user(reason, buf, copy))
> +		return -EFAULT;
> +	reason[copy] = '\0';
> +	strim(reason);
> +
> +	ret = ceph_mdsc_schedule_reset(mdsc, reason);
> +	if (ret)
> +		return ret;
> +
> +	return len;
> +}
> +
>  DEFINE_SHOW_ATTRIBUTE(mdsmap);
>  DEFINE_SHOW_ATTRIBUTE(mdsc);
>  DEFINE_SHOW_ATTRIBUTE(caps);
>  DEFINE_SHOW_ATTRIBUTE(mds_sessions);
>  DEFINE_SHOW_ATTRIBUTE(status);
> +DEFINE_SHOW_ATTRIBUTE(reset_status);
>  DEFINE_SHOW_ATTRIBUTE(metrics_file);
>  DEFINE_SHOW_ATTRIBUTE(metrics_latency);
>  DEFINE_SHOW_ATTRIBUTE(metrics_size);
>  DEFINE_SHOW_ATTRIBUTE(metrics_caps);
>  
> +static const struct file_operations ceph_reset_trigger_fops = {
> +	.owner = THIS_MODULE,
> +	.open = simple_open,
> +	.write = reset_trigger_write,
> +	.llseek = noop_llseek,
> +};
>  
>  /*
>   * debugfs
> @@ -404,6 +496,7 @@ void ceph_fs_debugfs_cleanup(struct ceph_fs_client *fsc)
>  	debugfs_remove(fsc->debugfs_caps);
>  	debugfs_remove(fsc->debugfs_status);
>  	debugfs_remove(fsc->debugfs_mdsc);
> +	debugfs_remove_recursive(fsc->debugfs_reset_dir);

I started to have troubles to apply the patches from the 3rd one. And latest
kernel version contains:

debugfs_remove(fsc->debugfs_subvolume_metrics);

So, patchset needs to be rebased on the latest state of CephFS kernel client.

Thanks,
Slava.

>  	debugfs_remove_recursive(fsc->debugfs_metrics_dir);
>  	doutc(fsc->client, "done\n");
>  }
> @@ -451,6 +544,15 @@ void ceph_fs_debugfs_init(struct ceph_fs_client *fsc)
>  						fsc,
>  						&caps_fops);
>  
> +	fsc->debugfs_reset_dir = debugfs_create_dir("reset",
> +						    fsc->client->debugfs_dir);
> +	debugfs_create_file("trigger", 0200,
> +			    fsc->debugfs_reset_dir, fsc,
> +			    &ceph_reset_trigger_fops);
> +	debugfs_create_file("status", 0400,
> +			    fsc->debugfs_reset_dir, fsc,
> +			    &reset_status_fops);
> +
>  	fsc->debugfs_status = debugfs_create_file("status",
>  						  0400,
>  						  fsc->client->debugfs_dir,
> diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
> index 777af51ec8d8..8339c2c72f9a 100644
> --- a/fs/ceph/mds_client.c
> +++ b/fs/ceph/mds_client.c
> @@ -5261,6 +5261,7 @@ int ceph_mdsc_wait_for_reset(struct ceph_mds_client *mdsc)
>  	blocked_count = atomic_inc_return(&st->blocked_requests);
>  	doutc(cl, "request blocked during reset, %d total blocked\n",
>  	      blocked_count);
> +	trace_ceph_client_reset_blocked(mdsc, blocked_count);
>  
>  retry:
>  	remaining = max_t(long, deadline - jiffies, 1);
> @@ -5272,10 +5273,12 @@ int ceph_mdsc_wait_for_reset(struct ceph_mds_client *mdsc)
>  	if (wait_ret == 0) {
>  		atomic_dec(&st->blocked_requests);
>  		pr_warn_client(cl, "timed out waiting for reset to complete\n");
> +		trace_ceph_client_reset_unblocked(mdsc, -ETIMEDOUT);
>  		return -ETIMEDOUT;
>  	}
>  	if (wait_ret < 0) {
>  		atomic_dec(&st->blocked_requests);
> +		trace_ceph_client_reset_unblocked(mdsc, (int)wait_ret);
>  		return (int)wait_ret;  /* -ERESTARTSYS */
>  	}
>  
> @@ -5290,12 +5293,14 @@ int ceph_mdsc_wait_for_reset(struct ceph_mds_client *mdsc)
>  		if (time_before(jiffies, deadline))
>  			goto retry;
>  		atomic_dec(&st->blocked_requests);
> +		trace_ceph_client_reset_unblocked(mdsc, -ETIMEDOUT);
>  		return -ETIMEDOUT;
>  	}
>  	ret = st->last_errno;
>  	spin_unlock(&st->lock);
>  
>  	atomic_dec(&st->blocked_requests);
> +	trace_ceph_client_reset_unblocked(mdsc, ret);
>  	return ret ? -EIO : 0;
>  }
>  
> @@ -5324,6 +5329,8 @@ static void ceph_mdsc_reset_complete(struct ceph_mds_client *mdsc, int ret)
>  
>  	/* Wake up all requests that were blocked waiting for reset */
>  	wake_up_all(&st->blocked_wq);
> +
> +	trace_ceph_client_reset_complete(mdsc, ret);
>  }
>  
>  static void ceph_mdsc_reset_workfn(struct work_struct *work)
> @@ -5633,6 +5640,7 @@ int ceph_mdsc_schedule_reset(struct ceph_mds_client *mdsc,
>  	pr_info_client(mdsc->fsc->client,
>  		       "manual session reset scheduled (reason=\"%s\")\n",
>  		       msg);
> +	trace_ceph_client_reset_schedule(mdsc, msg);
>  	return 0;
>  }
>  
> diff --git a/fs/ceph/super.h b/fs/ceph/super.h
> index 9aca42c89ea0..5bf976b6c4fe 100644
> --- a/fs/ceph/super.h
> +++ b/fs/ceph/super.h
> @@ -179,6 +179,7 @@ struct ceph_fs_client {
>  	struct dentry *debugfs_status;
>  	struct dentry *debugfs_mds_sessions;
>  	struct dentry *debugfs_metrics_dir;
> +	struct dentry *debugfs_reset_dir;
>  #endif
>  
>  #ifdef CONFIG_CEPH_FSCACHE
> diff --git a/include/trace/events/ceph.h b/include/trace/events/ceph.h
> index 08cb0659fbfc..1b990632f62b 100644
> --- a/include/trace/events/ceph.h
> +++ b/include/trace/events/ceph.h
> @@ -226,6 +226,73 @@ TRACE_EVENT(ceph_handle_caps,
>  		  __entry->mseq)
>  );
>  
> +/*
> + * Client reset tracepoints - identify the client by its monitor-
> + * assigned global_id so traces remain meaningful when kernel pointer
> + * hashing is enabled.
> + */
> +TRACE_EVENT(ceph_client_reset_schedule,
> +	TP_PROTO(const struct ceph_mds_client *mdsc, const char *reason),
> +	TP_ARGS(mdsc, reason),
> +	TP_STRUCT__entry(
> +		__field(u64, client_id)
> +		__string(reason, reason ? reason : "")
> +	),
> +	TP_fast_assign(
> +		__entry->client_id = mdsc->fsc->client->monc.auth ?
> +			mdsc->fsc->client->monc.auth->global_id : 0;
> +		__assign_str(reason);
> +	),
> +	TP_printk("client_id=%llu reason=%s",
> +		  __entry->client_id, __get_str(reason))
> +);
> +
> +TRACE_EVENT(ceph_client_reset_complete,
> +	TP_PROTO(const struct ceph_mds_client *mdsc, int ret),
> +	TP_ARGS(mdsc, ret),
> +	TP_STRUCT__entry(
> +		__field(u64, client_id)
> +		__field(int, ret)
> +	),
> +	TP_fast_assign(
> +		__entry->client_id = mdsc->fsc->client->monc.auth ?
> +			mdsc->fsc->client->monc.auth->global_id : 0;
> +		__entry->ret = ret;
> +	),
> +	TP_printk("client_id=%llu ret=%d", __entry->client_id, __entry->ret)
> +);
> +
> +TRACE_EVENT(ceph_client_reset_blocked,
> +	TP_PROTO(const struct ceph_mds_client *mdsc, int blocked_count),
> +	TP_ARGS(mdsc, blocked_count),
> +	TP_STRUCT__entry(
> +		__field(u64, client_id)
> +		__field(int, blocked_count)
> +	),
> +	TP_fast_assign(
> +		__entry->client_id = mdsc->fsc->client->monc.auth ?
> +			mdsc->fsc->client->monc.auth->global_id : 0;
> +		__entry->blocked_count = blocked_count;
> +	),
> +	TP_printk("client_id=%llu blocked_count=%d", __entry->client_id,
> +		  __entry->blocked_count)
> +);
> +
> +TRACE_EVENT(ceph_client_reset_unblocked,
> +	TP_PROTO(const struct ceph_mds_client *mdsc, int ret),
> +	TP_ARGS(mdsc, ret),
> +	TP_STRUCT__entry(
> +		__field(u64, client_id)
> +		__field(int, ret)
> +	),
> +	TP_fast_assign(
> +		__entry->client_id = mdsc->fsc->client->monc.auth ?
> +			mdsc->fsc->client->monc.auth->global_id : 0;
> +		__entry->ret = ret;
> +	),
> +	TP_printk("client_id=%llu ret=%d", __entry->client_id, __entry->ret)
> +);
> +
>  #undef EM
>  #undef E_
>  #endif /* _TRACE_CEPH_H */


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v3 07/11] selftests: ceph: add reset consistency checker
  2026-04-29 12:51 [PATCH v3 00/11] ceph: manual client session reset Alex Markuze
                   ` (5 preceding siblings ...)
  2026-04-29 12:52 ` [PATCH v3 06/11] ceph: add manual reset debugfs control and tracepoints Alex Markuze
@ 2026-04-29 12:52 ` Alex Markuze
  2026-04-29 12:52 ` [PATCH v3 08/11] selftests: ceph: add reset stress test Alex Markuze
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 17+ messages in thread
From: Alex Markuze @ 2026-04-29 12:52 UTC (permalink / raw)
  To: ceph-devel; +Cc: linux-kernel, idryomov, vdubeyko, Alex Markuze

Add a Python post-run validator for the CephFS client reset stress
test.  The script reads data files written by the stress runner and
checks that every file was either written completely or is missing,
with no partial or corrupted content.

This is a prerequisite for the stress test script which invokes it
after each run.

Signed-off-by: Alex Markuze <amarkuze@redhat.com>
---
 .../filesystems/ceph/validate_consistency.py  | 297 ++++++++++++++++++
 1 file changed, 297 insertions(+)
 create mode 100755 tools/testing/selftests/filesystems/ceph/validate_consistency.py

diff --git a/tools/testing/selftests/filesystems/ceph/validate_consistency.py b/tools/testing/selftests/filesystems/ceph/validate_consistency.py
new file mode 100755
index 000000000000..c230a59bdb3a
--- /dev/null
+++ b/tools/testing/selftests/filesystems/ceph/validate_consistency.py
@@ -0,0 +1,297 @@
+#!/usr/bin/env python3
+# SPDX-License-Identifier: GPL-2.0
+
+import argparse
+import bisect
+import hashlib
+import json
+import os
+from pathlib import Path
+
+
+def sha256_file(path: Path) -> str:
+    digest = hashlib.sha256()
+    with path.open("rb") as handle:
+        while True:
+            chunk = handle.read(1 << 20)
+            if not chunk:
+                break
+            digest.update(chunk)
+    return digest.hexdigest()
+
+
+def parse_io_log(path: Path):
+    records = []
+    if not path.exists():
+        return records
+    with path.open("r", encoding="utf-8") as handle:
+        for line_no, line in enumerate(handle, 1):
+            line = line.strip()
+            if not line:
+                continue
+            parts = line.split(",")
+            if len(parts) != 5:
+                raise ValueError(f"io log line {line_no}: expected 5 columns, got {len(parts)}")
+            ts_ms, seq, logical_id, relpath, digest = parts
+            records.append(
+                {
+                    "ts_ms": int(ts_ms),
+                    "seq": int(seq),
+                    "logical_id": int(logical_id),
+                    "relpath": relpath,
+                    "digest": digest,
+                }
+            )
+    return records
+
+
+def parse_rename_log(path: Path):
+    records = []
+    if not path.exists():
+        return records
+    with path.open("r", encoding="utf-8") as handle:
+        for line_no, line in enumerate(handle, 1):
+            line = line.strip()
+            if not line:
+                continue
+            parts = line.split(",")
+            if len(parts) == 6:
+                ts_ms, seq, logical_id, src_rel, dst_rel, rc = parts
+            elif len(parts) == 7:
+                ts_ms, worker_id, seq, logical_id, src_rel, dst_rel, rc = parts
+                _ = worker_id  # worker id is informational only
+            else:
+                raise ValueError(
+                    f"rename log line {line_no}: expected 6 or 7 columns, got {len(parts)}"
+                )
+            records.append(
+                {
+                    "ts_ms": int(ts_ms),
+                    "seq": int(seq),
+                    "logical_id": int(logical_id),
+                    "src_rel": src_rel,
+                    "dst_rel": dst_rel,
+                    "rc": int(rc),
+                }
+            )
+    return records
+
+
+def parse_reset_log(path: Path):
+    records = []
+    if not path.exists():
+        return records
+    with path.open("r", encoding="utf-8") as handle:
+        for line_no, line in enumerate(handle, 1):
+            line = line.strip()
+            if not line:
+                continue
+            parts = line.split(",")
+            if len(parts) != 4:
+                raise ValueError(f"reset log line {line_no}: expected 4 columns, got {len(parts)}")
+            ts_ms, seq, reason, rc = parts
+            records.append(
+                {
+                    "ts_ms": int(ts_ms),
+                    "seq": int(seq),
+                    "reason": reason,
+                    "rc": int(rc),
+                }
+            )
+    return records
+
+
+def parse_status_file(path: Path):
+    status = {}
+    if not path.exists():
+        return status
+    with path.open("r", encoding="utf-8") as handle:
+        for line in handle:
+            line = line.strip()
+            if not line or ":" not in line:
+                continue
+            key, value = line.split(":", 1)
+            status[key.strip()] = value.strip()
+    return status
+
+
+def to_int(value: str, default: int = 0):
+    try:
+        return int(value)
+    except Exception:
+        return default
+
+
+def validate_namespace(data_dir: Path, file_count: int, issues):
+    actual_locations = {}
+    actual_paths = {}
+    for logical_id in range(file_count):
+        name = f"file_{logical_id:05d}"
+        found = []
+        for subdir in ("A", "B"):
+            candidate = data_dir / subdir / name
+            if candidate.exists():
+                found.append((subdir, candidate))
+        if len(found) != 1:
+            issues.append(
+                f"namespace invariant failed for logical_id={logical_id:05d}: expected exactly one file in A/B, found {len(found)}"
+            )
+            continue
+        actual_locations[logical_id] = found[0][0]
+        actual_paths[logical_id] = found[0][1]
+    return actual_locations, actual_paths
+
+
+def validate_rename_invariant(rename_records, actual_locations, issues):
+    expected_locations = {}
+    for rec in rename_records:
+        if rec["rc"] != 0:
+            continue
+        dst = rec["dst_rel"]
+        if "/" not in dst:
+            continue
+        expected_locations[rec["logical_id"]] = dst.split("/", 1)[0]
+
+    for logical_id, expected in expected_locations.items():
+        actual = actual_locations.get(logical_id)
+        if actual is None:
+            continue
+        if actual != expected:
+            issues.append(
+                f"rename invariant failed for logical_id={logical_id:05d}: expected location={expected}, actual={actual}"
+            )
+
+
+def validate_data_invariant(io_records, actual_paths, issues):
+    expected_hash = {}
+    for rec in io_records:
+        digest = rec["digest"]
+        if not digest:
+            continue
+        expected_hash[rec["logical_id"]] = digest
+
+    for logical_id, digest in expected_hash.items():
+        path = actual_paths.get(logical_id)
+        if path is None:
+            continue
+        actual_digest = sha256_file(path)
+        if digest != actual_digest:
+            issues.append(
+                f"data invariant failed for logical_id={logical_id:05d}: expected digest={digest}, actual digest={actual_digest}"
+            )
+
+
+def validate_reset_and_slo(args, reset_records, io_records, rename_records, status, issues):
+    if not args.expect_reset:
+        return
+
+    successful_reset_times = [rec["ts_ms"] for rec in reset_records if rec["rc"] == 0]
+    if not successful_reset_times:
+        issues.append("expected reset activity but no successful reset trigger was observed")
+
+    phase = status.get("phase")
+    blocked_requests = to_int(status.get("blocked_requests", "0"), default=-1)
+    last_errno = to_int(status.get("last_errno", "0"), default=1)
+    failure_count = to_int(status.get("failure_count", "0"), default=-1)
+
+    if phase is None:
+        issues.append("missing final reset status file or phase field")
+    elif phase.lower() != "idle":
+        issues.append(f"recovery invariant failed: phase={phase}, expected idle")
+
+    if blocked_requests != 0:
+        issues.append(f"recovery invariant failed: blocked_requests={blocked_requests}, expected 0")
+    if last_errno != 0:
+        issues.append(f"recovery invariant failed: last_errno={last_errno}, expected 0")
+    if failure_count > 0:
+        issues.append(
+            f"recovery invariant failed: failure_count={failure_count}, "
+            "one or more resets failed during the run"
+        )
+
+    op_times = [rec["ts_ms"] for rec in io_records]
+    op_times.extend(rec["ts_ms"] for rec in rename_records if rec["rc"] == 0)
+    op_times.sort()
+
+    if successful_reset_times and not op_times:
+        issues.append("recovery SLO failed: no workload completion events were recorded")
+        return
+
+    slo_ms = args.slo_seconds * 1000
+    for ts in successful_reset_times:
+        idx = bisect.bisect_left(op_times, ts)
+        if idx >= len(op_times):
+            issues.append(f"recovery SLO failed: no operation completion observed after reset at ts_ms={ts}")
+            continue
+        delta = op_times[idx] - ts
+        if delta > slo_ms:
+            issues.append(
+                f"recovery SLO failed: first post-reset completion at {delta}ms exceeds threshold {slo_ms}ms (reset ts_ms={ts})"
+            )
+
+
+def main():
+    parser = argparse.ArgumentParser(description="Validate Ceph reset stress artifacts")
+    parser.add_argument("--data-dir", required=True)
+    parser.add_argument("--file-count", required=True, type=int)
+    parser.add_argument("--io-log", required=True)
+    parser.add_argument("--rename-log", required=True)
+    parser.add_argument("--reset-log", required=True)
+    parser.add_argument("--status-final", required=False, default="")
+    parser.add_argument("--slo-seconds", required=False, type=int, default=30)
+    parser.add_argument("--expect-reset", action="store_true")
+    parser.add_argument("--report-json", required=False, default="")
+    args = parser.parse_args()
+
+    data_dir = Path(args.data_dir)
+    io_log = Path(args.io_log)
+    rename_log = Path(args.rename_log)
+    reset_log = Path(args.reset_log)
+    status_final = Path(args.status_final) if args.status_final else Path("__missing_status__")
+
+    issues = []
+
+    if not data_dir.exists():
+        issues.append(f"data directory is missing: {data_dir}")
+
+    try:
+        io_records = parse_io_log(io_log)
+        rename_records = parse_rename_log(rename_log)
+        reset_records = parse_reset_log(reset_log)
+    except Exception as exc:
+        issues.append(f"log parsing failed: {exc}")
+        io_records = []
+        rename_records = []
+        reset_records = []
+
+    status = parse_status_file(status_final)
+
+    actual_locations, actual_paths = validate_namespace(data_dir, args.file_count, issues)
+    validate_rename_invariant(rename_records, actual_locations, issues)
+    validate_data_invariant(io_records, actual_paths, issues)
+    validate_reset_and_slo(args, reset_records, io_records, rename_records, status, issues)
+
+    report = {
+        "file_count": args.file_count,
+        "io_records": len(io_records),
+        "rename_records": len(rename_records),
+        "reset_records": len(reset_records),
+        "expect_reset": args.expect_reset,
+        "issues": issues,
+    }
+
+    if args.report_json:
+        report_path = Path(args.report_json)
+        report_path.write_text(json.dumps(report, indent=2, sort_keys=True), encoding="utf-8")
+
+    if issues:
+        print("FAIL: consistency validation found issues")
+        for issue in issues:
+            print(f"  - {issue}")
+        raise SystemExit(1)
+
+    print("PASS: consistency validation succeeded")
+
+
+if __name__ == "__main__":
+    main()
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v3 08/11] selftests: ceph: add reset stress test
  2026-04-29 12:51 [PATCH v3 00/11] ceph: manual client session reset Alex Markuze
                   ` (6 preceding siblings ...)
  2026-04-29 12:52 ` [PATCH v3 07/11] selftests: ceph: add reset consistency checker Alex Markuze
@ 2026-04-29 12:52 ` Alex Markuze
  2026-04-29 12:52 ` [PATCH v3 09/11] selftests: ceph: add reset corner-case tests Alex Markuze
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 17+ messages in thread
From: Alex Markuze @ 2026-04-29 12:52 UTC (permalink / raw)
  To: ceph-devel; +Cc: linux-kernel, idryomov, vdubeyko, Alex Markuze

Add a single-client stress test for the CephFS manual session reset
feature.  The test runs concurrent I/O workers alongside periodic
reset injection, then validates data integrity via
validate_consistency.py.

Supports four profiles (baseline, moderate, aggressive, soak) with
configurable duration, reset interval, and worker counts.

Signed-off-by: Alex Markuze <amarkuze@redhat.com>
---
 .../filesystems/ceph/reset_stress.sh          | 694 ++++++++++++++++++
 1 file changed, 694 insertions(+)
 create mode 100755 tools/testing/selftests/filesystems/ceph/reset_stress.sh

diff --git a/tools/testing/selftests/filesystems/ceph/reset_stress.sh b/tools/testing/selftests/filesystems/ceph/reset_stress.sh
new file mode 100755
index 000000000000..c503c75a5f7a
--- /dev/null
+++ b/tools/testing/selftests/filesystems/ceph/reset_stress.sh
@@ -0,0 +1,694 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# CephFS reset stress test:
+# - Runs concurrent I/O and rename workloads
+# - Triggers random client resets through debugfs
+# - Validates consistency and recovery behavior
+
+set -euo pipefail
+
+KSFT_SKIP=4
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+
+# kselftest auto-detect: when invoked with no arguments (e.g. by
+# "make run_tests"), find a CephFS mount automatically or skip.
+if [[ $# -eq 0 ]]; then
+	MOUNT_POINT="$(findmnt -t ceph -n -o TARGET 2>/dev/null | head -1)"
+	if [[ -z "$MOUNT_POINT" ]]; then
+		echo "SKIP: No CephFS mount found and --mount-point not specified"
+		exit "$KSFT_SKIP"
+	fi
+	exec "$0" --mount-point "$MOUNT_POINT"
+fi
+
+PROFILE="moderate"
+DURATION_SEC=""
+COOLDOWN_SEC=20
+FILE_COUNT=64
+IO_WORKERS=""
+RENAME_WORKERS=""
+MOUNT_POINT=""
+OUT_DIR=""
+CLIENT_ID=""
+DEBUGFS_ROOT="/sys/kernel/debug/ceph"
+SLO_SECONDS=30
+EXPECT_RESET=1
+DMESG_CMD=""
+SUDO=""
+
+RESET_MIN_SEC=5
+RESET_MAX_SEC=15
+
+RUN_ID="$(date +%Y%m%d-%H%M%S)"
+WORKLOAD_FLAG=""
+RESET_FLAG=""
+DATA_DIR=""
+
+IO_LOG=""
+RENAME_LOG=""
+RESET_LOG=""
+STATUS_LOG=""
+STATUS_BEFORE=""
+STATUS_FINAL=""
+DMESG_LOG=""
+SUMMARY_LOG=""
+REPORT_JSON=""
+
+RESET_PID=0
+STATUS_PID=0
+declare -a IO_WORKER_PIDS=()
+declare -a RENAME_WORKER_PIDS=()
+
+usage()
+{
+	cat <<EOF
+Usage: $0 --mount-point <cephfs_mount> [options]
+
+Required:
+  --mount-point PATH       CephFS mount point to test under
+
+Options:
+  --profile NAME           baseline|moderate|aggressive|soak (default: moderate)
+  --duration-sec N         Override profile runtime in seconds
+  --cooldown-sec N         Workload drain time after injector stop (default: 20)
+  --file-count N           Number of logical files (default: 64)
+  --io-workers N           Number of concurrent I/O workers (profile default)
+  --rename-workers N       Number of concurrent rename workers (profile default)
+  --out-dir PATH           Artifact directory (default: /tmp/ceph_reset_stress_<ts>)
+  --client-id ID           Ceph debugfs client id; auto-detect if one client exists
+  --debugfs-root PATH      Debugfs Ceph root (default: /sys/kernel/debug/ceph)
+  --slo-seconds N          Max allowed post-reset stall window (default: 30)
+  --no-reset               Disable reset injector (baseline mode helper)
+  --help                   Show this message
+
+Examples:
+  $0 --mount-point /mnt/cephfs --profile moderate
+  $0 --mount-point /mnt/cephfs --profile aggressive --duration-sec 300
+  $0 --mount-point /mnt/cephfs --profile baseline --no-reset
+EOF
+}
+
+now_ms()
+{
+	date +%s%3N
+}
+
+set_profile_defaults()
+{
+	case "$PROFILE" in
+	baseline)
+		RESET_MIN_SEC=0
+		RESET_MAX_SEC=0
+		EXPECT_RESET=0
+		: "${DURATION_SEC:=600}"
+		: "${IO_WORKERS:=1}"
+		: "${RENAME_WORKERS:=1}"
+		;;
+	moderate)
+		RESET_MIN_SEC=5
+		RESET_MAX_SEC=15
+		: "${DURATION_SEC:=900}"
+		: "${IO_WORKERS:=2}"
+		: "${RENAME_WORKERS:=1}"
+		;;
+	aggressive)
+		RESET_MIN_SEC=1
+		RESET_MAX_SEC=5
+		: "${DURATION_SEC:=900}"
+		: "${IO_WORKERS:=4}"
+		: "${RENAME_WORKERS:=2}"
+		;;
+	soak)
+		RESET_MIN_SEC=5
+		RESET_MAX_SEC=15
+		: "${DURATION_SEC:=3600}"
+		: "${IO_WORKERS:=2}"
+		: "${RENAME_WORKERS:=1}"
+		;;
+	*)
+		echo "Unknown profile: $PROFILE" >&2
+		exit 2
+		;;
+	esac
+}
+
+log_summary()
+{
+	local msg="$1"
+	printf '[%s] %s\n' "$(date -u +%Y-%m-%dT%H:%M:%SZ)" "$msg" | tee -a "$SUMMARY_LOG"
+}
+
+discover_client_id()
+{
+	local candidates=()
+	local entry
+
+	if [[ -n "$CLIENT_ID" ]]; then
+		if ! $SUDO test -d "$DEBUGFS_ROOT/$CLIENT_ID/reset"; then
+			echo "SKIP: reset debugfs not found for client-id=$CLIENT_ID" >&2
+			exit "$KSFT_SKIP"
+		fi
+		return 0
+	fi
+
+	if ! $SUDO test -d "$DEBUGFS_ROOT"; then
+		echo "SKIP: Debugfs root not found: $DEBUGFS_ROOT" >&2
+		exit "$KSFT_SKIP"
+	fi
+
+	while IFS= read -r entry; do
+		$SUDO test -d "$DEBUGFS_ROOT/$entry/reset" || continue
+		$SUDO test -w "$DEBUGFS_ROOT/$entry/reset/trigger" || continue
+		candidates+=("$entry")
+	done < <($SUDO ls -1 "$DEBUGFS_ROOT" 2>/dev/null || true)
+
+	if [[ ${#candidates[@]} -eq 1 ]]; then
+		CLIENT_ID="${candidates[0]}"
+		return 0
+	fi
+
+	if [[ ${#candidates[@]} -eq 0 ]]; then
+		echo "SKIP: No writable Ceph reset interface found under $DEBUGFS_ROOT" >&2
+		exit "$KSFT_SKIP"
+	fi
+
+	echo "SKIP: Multiple Ceph clients found (${candidates[*]}). Use --client-id." >&2
+	exit "$KSFT_SKIP"
+}
+
+init_dataset()
+{
+	local i
+	mkdir -p "$DATA_DIR/A" "$DATA_DIR/B"
+
+	for ((i = 0; i < FILE_COUNT; i++)); do
+		printf 'seed logical_id=%05d ts_ms=%s\n' "$i" "$(now_ms)" > "$DATA_DIR/A/file_$(printf '%05d' "$i")"
+	done
+}
+
+io_worker()
+{
+	set +e
+	local worker_id="$1"
+	local seq=0
+	local id
+	local relpath
+	local abspath
+	local payload
+	local hash
+	local ts
+
+	while [[ -f "$WORKLOAD_FLAG" ]]; do
+		id="$(printf '%05d' $((RANDOM % FILE_COUNT)))"
+		if [[ -f "$DATA_DIR/A/file_$id" ]]; then
+			relpath="A/file_$id"
+		elif [[ -f "$DATA_DIR/B/file_$id" ]]; then
+			relpath="B/file_$id"
+		else
+			sleep 0.02
+			continue
+		fi
+
+		abspath="$DATA_DIR/$relpath"
+		alt_relpath=""
+		if [[ "$relpath" == A/* ]]; then
+			alt_relpath="B/file_$id"
+		else
+			alt_relpath="A/file_$id"
+		fi
+		alt_abspath="$DATA_DIR/$alt_relpath"
+		payload="worker=${worker_id} io_seq=${seq} id=${id} ts_ms=$(now_ms)"
+		result="$(
+			python3 - "$abspath" "$alt_abspath" "$payload" <<'PY'
+import hashlib
+import os
+import sys
+
+path = sys.argv[1]
+alt_path = sys.argv[2]
+payload = sys.argv[3]
+
+try:
+    fd = os.open(path, os.O_RDWR | os.O_APPEND)
+    actual = path
+except FileNotFoundError:
+    try:
+        fd = os.open(alt_path, os.O_RDWR | os.O_APPEND)
+        actual = alt_path
+    except FileNotFoundError:
+        sys.exit(1)
+
+try:
+    os.write(fd, (payload + "\n").encode())
+    os.fsync(fd)
+    os.lseek(fd, 0, os.SEEK_SET)
+    digest = hashlib.sha256()
+    while True:
+        chunk = os.read(fd, 1 << 20)
+        if not chunk:
+            break
+        digest.update(chunk)
+    print(actual + " " + digest.hexdigest())
+finally:
+    os.close(fd)
+PY
+		)" || {
+			sleep 0.02
+			continue
+		}
+
+		actual_abspath="${result%% *}"
+		hash="${result#* }"
+		if [[ "$actual_abspath" == "$alt_abspath" ]]; then
+			relpath="$alt_relpath"
+		fi
+
+		ts="$(now_ms)"
+		printf '%s,%s,%s,%s,%s\n' "$ts" "$seq" "$id" "$relpath" "$hash" >> "$IO_LOG"
+		seq=$((seq + 1))
+		sleep 0.02
+	done
+}
+
+rename_worker()
+{
+	set +e
+	local worker_id="$1"
+	local seq=0
+	local id
+	local src_rel
+	local dst_rel
+	local rc
+	local ts
+
+	while [[ -f "$WORKLOAD_FLAG" ]]; do
+		id="$(printf '%05d' $((RANDOM % FILE_COUNT)))"
+
+		if [[ -f "$DATA_DIR/A/file_$id" ]]; then
+			src_rel="A/file_$id"
+			dst_rel="B/file_$id"
+		elif [[ -f "$DATA_DIR/B/file_$id" ]]; then
+			src_rel="B/file_$id"
+			dst_rel="A/file_$id"
+		else
+			sleep 0.02
+			continue
+		fi
+
+		ts="$(now_ms)"
+		if mv -T "$DATA_DIR/$src_rel" "$DATA_DIR/$dst_rel" 2>/dev/null; then
+			rc=0
+		else
+			rc=$?
+		fi
+		printf '%s,%s,%s,%s,%s,%s,%s\n' "$ts" "$worker_id" "$seq" "$id" "$src_rel" "$dst_rel" "$rc" >> "$RENAME_LOG"
+		seq=$((seq + 1))
+		sleep 0.02
+	done
+}
+
+random_sleep_seconds()
+{
+	local min_sec="$1"
+	local max_sec="$2"
+	local wait_sec
+	local span
+
+	span=$((max_sec - min_sec + 1))
+	wait_sec=$((min_sec + RANDOM % span))
+	sleep "$wait_sec"
+}
+
+reset_injector()
+{
+	set +e
+	local trigger_path="$1"
+	local seq=0
+	local ts
+	local reason
+	local rc
+
+	while [[ -f "$RESET_FLAG" ]]; do
+		random_sleep_seconds "$RESET_MIN_SEC" "$RESET_MAX_SEC"
+		[[ -f "$RESET_FLAG" ]] || break
+
+		ts="$(now_ms)"
+		reason="stress_${seq}_${ts}"
+		if echo "$reason" | $SUDO tee "$trigger_path" > /dev/null 2>&1; then
+			rc=0
+		else
+			rc=$?
+		fi
+		printf '%s,%s,%s,%s\n' "$ts" "$seq" "$reason" "$rc" >> "$RESET_LOG"
+		seq=$((seq + 1))
+	done
+}
+
+status_sampler()
+{
+	set +e
+	local status_path="$1"
+	local ts
+	local kv_line
+
+	while [[ -f "$WORKLOAD_FLAG" || -f "$RESET_FLAG" ]]; do
+		ts="$(now_ms)"
+		if $SUDO test -r "$status_path"; then
+			kv_line="$($SUDO awk -F': ' 'NF>=2 {gsub(/ /, "", $1); gsub(/ /, "", $2); printf "%s=%s;", $1, $2}' "$status_path")"
+			printf '%s,%s\n' "$ts" "$kv_line" >> "$STATUS_LOG"
+		fi
+		sleep 1
+	done
+}
+
+stop_pid_with_timeout()
+{
+	local pid="$1"
+	local name="$2"
+	local timeout="$3"
+	local waited=0
+
+	if [[ "$pid" -le 0 ]]; then
+		return 0
+	fi
+
+	while kill -0 "$pid" 2>/dev/null; do
+		if (( waited >= timeout )); then
+			log_summary "Timeout waiting for $name (pid=$pid), sending SIGTERM/SIGKILL"
+			kill -TERM "$pid" 2>/dev/null || true
+			sleep 1
+			kill -KILL "$pid" 2>/dev/null || true
+			wait "$pid" 2>/dev/null || true
+			return 1
+		fi
+		sleep 1
+		waited=$((waited + 1))
+	done
+
+	wait "$pid" 2>/dev/null || true
+	return 0
+}
+
+detect_privileges()
+{
+	if [[ -r "$DEBUGFS_ROOT" ]]; then
+		SUDO=""
+	elif sudo -n true 2>/dev/null; then
+		SUDO="sudo"
+	else
+		echo "WARNING: $DEBUGFS_ROOT is not readable and passwordless sudo is not available" >&2
+		echo "WARNING: reset injection, debugfs status checks, and dmesg capture will not work" >&2
+	fi
+
+	if $SUDO dmesg > /dev/null 2>&1; then
+		DMESG_CMD="$SUDO dmesg"
+	else
+		DMESG_CMD=""
+		echo "WARNING: dmesg is not accessible; kernel errors (hung tasks) will not be detected" >&2
+	fi
+}
+
+check_dmesg()
+{
+	local start_epoch="$1"
+
+	if [[ -z "$DMESG_CMD" ]]; then
+		return 0
+	fi
+
+	if ! $DMESG_CMD --since "@$start_epoch" > "$DMESG_LOG" 2>/dev/null; then
+		if ! $DMESG_CMD > "$DMESG_LOG" 2>/dev/null; then
+			log_summary "WARNING: dmesg capture failed unexpectedly"
+			return 0
+		fi
+		log_summary "dmesg --since unsupported; captured full dmesg"
+	fi
+
+	if grep -qi "hung task" "$DMESG_LOG" 2>/dev/null; then
+		log_summary "ERROR: kernel log contains 'hung task' during test window"
+		return 1
+	fi
+
+	return 0
+}
+
+cleanup()
+{
+	rm -f "$WORKLOAD_FLAG" "$RESET_FLAG"
+	local pid
+	for pid in "${IO_WORKER_PIDS[@]}" "${RENAME_WORKER_PIDS[@]}" "$RESET_PID" "$STATUS_PID"; do
+		[[ "$pid" -gt 0 ]] 2>/dev/null && kill "$pid" 2>/dev/null || true
+	done
+	wait 2>/dev/null || true
+}
+
+parse_args()
+{
+	while [[ $# -gt 0 ]]; do
+		case "$1" in
+		--mount-point)
+			MOUNT_POINT="$2"
+			shift 2
+			;;
+		--profile)
+			PROFILE="$2"
+			shift 2
+			;;
+		--duration-sec)
+			DURATION_SEC="$2"
+			shift 2
+			;;
+		--cooldown-sec)
+			COOLDOWN_SEC="$2"
+			shift 2
+			;;
+		--file-count)
+			FILE_COUNT="$2"
+			shift 2
+			;;
+		--io-workers)
+			IO_WORKERS="$2"
+			shift 2
+			;;
+		--rename-workers)
+			RENAME_WORKERS="$2"
+			shift 2
+			;;
+		--out-dir)
+			OUT_DIR="$2"
+			shift 2
+			;;
+		--client-id)
+			CLIENT_ID="$2"
+			shift 2
+			;;
+		--debugfs-root)
+			DEBUGFS_ROOT="$2"
+			shift 2
+			;;
+		--slo-seconds)
+			SLO_SECONDS="$2"
+			shift 2
+			;;
+		--no-reset)
+			EXPECT_RESET=0
+			shift
+			;;
+		--help|-h)
+			usage
+			exit 0
+			;;
+		*)
+			echo "Unknown option: $1" >&2
+			usage
+			exit 2
+			;;
+		esac
+	done
+}
+
+main()
+{
+	local start_epoch
+	local trigger_path=""
+	local status_path=""
+	local final_rc=0
+	local reset_enabled=0
+	local i
+
+	parse_args "$@"
+
+	if [[ -z "$MOUNT_POINT" ]]; then
+		echo "--mount-point is required" >&2
+		usage
+		exit 2
+	fi
+
+	if [[ ! -d "$MOUNT_POINT" ]]; then
+		echo "SKIP: Mount point does not exist: $MOUNT_POINT" >&2
+		exit "$KSFT_SKIP"
+	fi
+
+	if ! touch "$MOUNT_POINT/.ceph_reset_test_probe" 2>/dev/null; then
+		echo "SKIP: Mount point is not writable: $MOUNT_POINT" >&2
+		exit "$KSFT_SKIP"
+	fi
+	rm -f "$MOUNT_POINT/.ceph_reset_test_probe"
+
+	if ! command -v python3 > /dev/null 2>&1; then
+		echo "SKIP: python3 is required but not found in PATH" >&2
+		exit "$KSFT_SKIP"
+	fi
+
+	if ! stat -f -c '%T' "$MOUNT_POINT" 2>/dev/null | grep -qi ceph; then
+		echo "WARNING: $MOUNT_POINT does not appear to be a CephFS mount" >&2
+	fi
+
+	detect_privileges
+
+	set_profile_defaults
+	if [[ "$EXPECT_RESET" -eq 0 ]]; then
+		PROFILE="baseline"
+		RESET_MIN_SEC=0
+		RESET_MAX_SEC=0
+	fi
+
+	if ! [[ "$IO_WORKERS" =~ ^[0-9]+$ && "$RENAME_WORKERS" =~ ^[0-9]+$ ]]; then
+		echo "io-workers and rename-workers must be integers" >&2
+		exit 2
+	fi
+
+	if [[ "$IO_WORKERS" -le 0 || "$RENAME_WORKERS" -le 0 ]]; then
+		echo "io-workers and rename-workers must be > 0" >&2
+		exit 2
+	fi
+
+	if [[ -z "$OUT_DIR" ]]; then
+		OUT_DIR="/tmp/ceph_reset_stress_${RUN_ID}"
+	fi
+	mkdir -p "$OUT_DIR"
+
+	WORKLOAD_FLAG="$OUT_DIR/workload.running"
+	RESET_FLAG="$OUT_DIR/reset.running"
+
+	DATA_DIR="$MOUNT_POINT/ceph_reset_stress_${RUN_ID}"
+	mkdir -p "$DATA_DIR"
+
+	IO_LOG="$OUT_DIR/io.log"
+	RENAME_LOG="$OUT_DIR/rename.log"
+	RESET_LOG="$OUT_DIR/reset.log"
+	STATUS_LOG="$OUT_DIR/status.log"
+	STATUS_BEFORE="$OUT_DIR/reset_status.before"
+	STATUS_FINAL="$OUT_DIR/reset_status.final"
+	DMESG_LOG="$OUT_DIR/dmesg.log"
+	SUMMARY_LOG="$OUT_DIR/summary.log"
+	REPORT_JSON="$OUT_DIR/validator_report.json"
+
+	: > "$IO_LOG"
+	: > "$RENAME_LOG"
+	: > "$RESET_LOG"
+	: > "$STATUS_LOG"
+	: > "$SUMMARY_LOG"
+
+	start_epoch="$(date +%s)"
+
+	log_summary "Starting Ceph reset stress test"
+	log_summary "Profile=$PROFILE duration=${DURATION_SEC}s cooldown=${COOLDOWN_SEC}s file_count=${FILE_COUNT} io_workers=${IO_WORKERS} rename_workers=${RENAME_WORKERS}"
+	[[ -n "$SUDO" ]] && log_summary "Using sudo for privileged operations"
+	[[ -z "$DMESG_CMD" ]] && log_summary "WARNING: dmesg not available; hung task detection disabled"
+	log_summary "Artifacts=$OUT_DIR"
+	log_summary "Data dir=$DATA_DIR"
+
+	init_dataset
+
+	if [[ "$EXPECT_RESET" -eq 1 ]]; then
+		discover_client_id
+		trigger_path="$DEBUGFS_ROOT/$CLIENT_ID/reset/trigger"
+		status_path="$DEBUGFS_ROOT/$CLIENT_ID/reset/status"
+		if ! $SUDO test -w "$trigger_path"; then
+			echo "SKIP: Reset trigger is not writable: $trigger_path" >&2
+			exit "$KSFT_SKIP"
+		fi
+		if ! $SUDO test -r "$status_path"; then
+			echo "SKIP: Reset status is not readable: $status_path" >&2
+			exit "$KSFT_SKIP"
+		fi
+		$SUDO cat "$status_path" > "$STATUS_BEFORE" || true
+		reset_enabled=1
+		log_summary "Using ceph client id: $CLIENT_ID"
+	fi
+
+	trap cleanup EXIT INT TERM
+
+	touch "$WORKLOAD_FLAG"
+	for ((i = 0; i < IO_WORKERS; i++)); do
+		io_worker "$i" &
+		IO_WORKER_PIDS+=("$!")
+	done
+
+	for ((i = 0; i < RENAME_WORKERS; i++)); do
+		rename_worker "$i" &
+		RENAME_WORKER_PIDS+=("$!")
+	done
+
+	if [[ "$reset_enabled" -eq 1 ]]; then
+		touch "$RESET_FLAG"
+		reset_injector "$trigger_path" &
+		RESET_PID=$!
+
+		status_sampler "$status_path" &
+		STATUS_PID=$!
+	fi
+
+	sleep "$DURATION_SEC"
+
+	if [[ "$reset_enabled" -eq 1 ]]; then
+		rm -f "$RESET_FLAG"
+		stop_pid_with_timeout "$RESET_PID" "reset_injector" 20 || final_rc=1
+		log_summary "Injector stopped; entering cooldown=${COOLDOWN_SEC}s"
+	fi
+
+	sleep "$COOLDOWN_SEC"
+
+	rm -f "$WORKLOAD_FLAG"
+	for i in "${!IO_WORKER_PIDS[@]}"; do
+		stop_pid_with_timeout "${IO_WORKER_PIDS[$i]}" "io_worker[$i]" 20 || final_rc=1
+	done
+	for i in "${!RENAME_WORKER_PIDS[@]}"; do
+		stop_pid_with_timeout "${RENAME_WORKER_PIDS[$i]}" "rename_worker[$i]" 20 || final_rc=1
+	done
+
+	if [[ "$reset_enabled" -eq 1 ]]; then
+		stop_pid_with_timeout "$STATUS_PID" "status_sampler" 10 || final_rc=1
+		$SUDO cat "$status_path" > "$STATUS_FINAL" || true
+	fi
+
+	if ! check_dmesg "$start_epoch"; then
+		final_rc=1
+	fi
+
+	if ! python3 "$SCRIPT_DIR/validate_consistency.py" \
+		--data-dir "$DATA_DIR" \
+		--file-count "$FILE_COUNT" \
+		--io-log "$IO_LOG" \
+		--rename-log "$RENAME_LOG" \
+		--reset-log "$RESET_LOG" \
+		--status-final "$STATUS_FINAL" \
+		--slo-seconds "$SLO_SECONDS" \
+		--report-json "$REPORT_JSON" \
+		$( [[ "$reset_enabled" -eq 1 ]] && echo "--expect-reset" ); then
+		final_rc=1
+	fi
+
+	if [[ "$final_rc" -eq 0 ]]; then
+		log_summary "PASS: stress run completed successfully"
+	else
+		log_summary "FAIL: stress run detected one or more failures"
+	fi
+
+	log_summary "Artifacts available in: $OUT_DIR"
+	exit "$final_rc"
+}
+
+main "$@"
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v3 09/11] selftests: ceph: add reset corner-case tests
  2026-04-29 12:51 [PATCH v3 00/11] ceph: manual client session reset Alex Markuze
                   ` (7 preceding siblings ...)
  2026-04-29 12:52 ` [PATCH v3 08/11] selftests: ceph: add reset stress test Alex Markuze
@ 2026-04-29 12:52 ` Alex Markuze
  2026-04-29 12:52 ` [PATCH v3 10/11] selftests: ceph: add validation harness Alex Markuze
  2026-04-29 12:52 ` [PATCH v3 11/11] selftests: ceph: wire up Ceph reset kselftests and documentation Alex Markuze
  10 siblings, 0 replies; 17+ messages in thread
From: Alex Markuze @ 2026-04-29 12:52 UTC (permalink / raw)
  To: ceph-devel; +Cc: linux-kernel, idryomov, vdubeyko, Alex Markuze

Add targeted corner-case tests for the CephFS manual session reset
feature.  Four sequential tests cover:

  [1/4] ebusy_rejection       - second reset rejected while first in-flight
  [2/4] dirty_caps_at_reset   - reset with unflushed dirty caps
  [3/4] flock_after_reset     - stale lock EIO + fresh lock after holder exit
  [4/4] unmount_during_reset  - umount during active reset (destroy-path wakeup)

Requires: mounted CephFS, debugfs access (root), flock(1) utility.

Signed-off-by: Alex Markuze <amarkuze@redhat.com>
---
 .../filesystems/ceph/reset_corner_cases.sh    | 646 ++++++++++++++++++
 1 file changed, 646 insertions(+)
 create mode 100755 tools/testing/selftests/filesystems/ceph/reset_corner_cases.sh

diff --git a/tools/testing/selftests/filesystems/ceph/reset_corner_cases.sh b/tools/testing/selftests/filesystems/ceph/reset_corner_cases.sh
new file mode 100755
index 000000000000..a6dae84a616d
--- /dev/null
+++ b/tools/testing/selftests/filesystems/ceph/reset_corner_cases.sh
@@ -0,0 +1,646 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# CephFS client reset corner case tests.
+# Runs a checklist of targeted tests that exercise specific reset
+# code paths not covered by the stress tests.
+#
+# Requires: mounted CephFS, debugfs access (root), flock(1) utility.
+
+set -uo pipefail
+
+KSFT_SKIP=4
+
+# kselftest auto-detect: when invoked with no arguments (e.g. by
+# "make run_tests"), find a CephFS mount automatically or skip.
+if [[ $# -eq 0 ]]; then
+	MOUNT_POINT="$(findmnt -t ceph -n -o TARGET 2>/dev/null | head -1)"
+	if [[ -z "$MOUNT_POINT" ]]; then
+		echo "SKIP: No CephFS mount found and --mount-point not specified"
+		exit "$KSFT_SKIP"
+	fi
+	exec "$0" --mount-point "$MOUNT_POINT"
+fi
+
+MOUNT_POINT=""
+DEBUGFS_ROOT="/sys/kernel/debug/ceph"
+DEBUGFS_CLIENT=""
+TRIGGER_PATH=""
+STATUS_PATH=""
+TEMP_MNT=""
+
+PASS_COUNT=0
+FAIL_COUNT=0
+SKIP_COUNT=0
+TOTAL=4
+
+log()
+{
+	printf '[%s] %s\n' "$(date -u +%H:%M:%S)" "$1"
+}
+
+result()
+{
+	local num="$1"
+	local name="$2"
+	local status="$3"
+	local detail="${4:-}"
+
+	case "$status" in
+	PASS) PASS_COUNT=$((PASS_COUNT + 1)) ;;
+	FAIL) FAIL_COUNT=$((FAIL_COUNT + 1)) ;;
+	SKIP) SKIP_COUNT=$((SKIP_COUNT + 1)) ;;
+	esac
+
+	if [[ -n "$detail" ]]; then
+		printf '[%d/%d] %-30s %s  (%s)\n' "$num" "$TOTAL" "$name" "$status" "$detail"
+	else
+		printf '[%d/%d] %-30s %s\n' "$num" "$TOTAL" "$name" "$status"
+	fi
+}
+
+read_status_field()
+{
+	local field="$1"
+	awk -F': ' -v key="$field" '$1 == key {print $2}' "$STATUS_PATH" 2>/dev/null
+}
+
+wait_reset_done()
+{
+	local timeout="${1:-30}"
+	local elapsed=0
+
+	while [[ "$(read_status_field "phase")" != "idle" ]]; do
+		sleep 1
+		elapsed=$((elapsed + 1))
+		if [[ "$elapsed" -ge "$timeout" ]]; then
+			return 1
+		fi
+	done
+	return 0
+}
+
+list_reset_clients()
+{
+	local entry
+
+	for entry in "$DEBUGFS_ROOT"/*/; do
+		entry="$(basename "$entry")"
+		[[ -d "$DEBUGFS_ROOT/$entry/reset" ]] || continue
+		[[ -w "$DEBUGFS_ROOT/$entry/reset/trigger" ]] || continue
+		printf '%s\n' "$entry"
+	done
+}
+
+wait_status_nonidle()
+{
+	local status_path="$1"
+	local timeout="${2:-10}"
+	local polls=$((timeout * 10))
+	local phase
+
+	while [[ "$polls" -gt 0 ]]; do
+		phase="$(awk -F': ' '$1 == "phase" {print $2}' "$status_path" 2>/dev/null)"
+		if [[ -n "$phase" && "$phase" != "idle" ]]; then
+			return 0
+		fi
+		sleep 0.1
+		polls=$((polls - 1))
+	done
+
+	return 1
+}
+
+discover_debugfs()
+{
+	local candidates=()
+	local entry
+
+	if [[ -n "$DEBUGFS_CLIENT" ]]; then
+		if [[ ! -d "$DEBUGFS_ROOT/$DEBUGFS_CLIENT/reset" ]]; then
+			echo "SKIP: reset debugfs not found for $DEBUGFS_CLIENT" >&2
+			exit "$KSFT_SKIP"
+		fi
+		return 0
+	fi
+
+	for entry in "$DEBUGFS_ROOT"/*/; do
+		entry="$(basename "$entry")"
+		[[ -d "$DEBUGFS_ROOT/$entry/reset" ]] || continue
+		[[ -w "$DEBUGFS_ROOT/$entry/reset/trigger" ]] || continue
+		candidates+=("$entry")
+	done
+
+	if [[ ${#candidates[@]} -eq 0 ]]; then
+		echo "SKIP: No writable Ceph reset interface found under $DEBUGFS_ROOT" >&2
+		exit "$KSFT_SKIP"
+	fi
+
+	if [[ ${#candidates[@]} -gt 1 ]]; then
+		echo "SKIP: Multiple Ceph clients found: ${candidates[*]}. Use --client-id." >&2
+		exit "$KSFT_SKIP"
+	fi
+
+	DEBUGFS_CLIENT="${candidates[0]}"
+}
+
+# --- Test 1: ebusy_rejection ------------------------------------------------
+#
+# Trigger a reset while another is guaranteed in-flight.  Creates
+# dirty state so the first reset enters DRAINING (which takes
+# measurable time), then polls until phase != idle and issues the
+# second trigger.  The second trigger must fail (the kernel returns
+# -EBUSY), and only one reset must be counted in the accounting.
+
+test_ebusy_rejection()
+{
+	local num=1
+	local name="ebusy_rejection"
+	local testfile="$MOUNT_POINT/.reset_corner_ebusy_$$"
+	local tc_before tc_after sc_before sc_after second_rc phase elapsed
+
+	tc_before="$(read_status_field "trigger_count")"
+	sc_before="$(read_status_field "success_count")"
+
+	# Create dirty state so the first reset enters DRAINING
+	echo "ebusy_dirty_data" > "$testfile"
+	sync "$testfile"
+
+	python3 -c "
+import os, sys
+fd = os.open('$testfile', os.O_WRONLY | os.O_APPEND)
+os.write(fd, b'dirty_for_ebusy_test\n')
+sys.stdout.write('written')
+" 2>/dev/null || {
+		result "$num" "$name" FAIL "dirty write failed"
+		rm -f "$testfile"
+		return
+	}
+
+	# Trigger the first reset -- it will drain dirty state
+	echo "ebusy_first" > "$TRIGGER_PATH" 2>/dev/null || {
+		result "$num" "$name" FAIL "first trigger failed"
+		rm -f "$testfile"
+		return
+	}
+
+	# Poll until phase is non-idle (quiescing or draining)
+	elapsed=0
+	while true; do
+		phase="$(read_status_field "phase")"
+		if [[ "$phase" != "idle" ]]; then
+			break
+		fi
+		sleep 0.1
+		elapsed=$((elapsed + 1))
+		if [[ "$elapsed" -ge 50 ]]; then
+			result "$num" "$name" SKIP \
+				"first reset completed before overlap could be tested"
+			rm -f "$testfile" 2>/dev/null
+			return
+		fi
+	done
+
+	# Issue the second trigger -- should be rejected with EBUSY
+	second_rc=0
+	echo "ebusy_second" > "$TRIGGER_PATH" 2>/dev/null && second_rc=0 || second_rc=$?
+
+	if ! wait_reset_done 30; then
+		result "$num" "$name" FAIL "first reset never completed"
+		rm -f "$testfile"
+		return
+	fi
+
+	tc_after="$(read_status_field "trigger_count")"
+	sc_after="$(read_status_field "success_count")"
+
+	if [[ "$((tc_after - tc_before))" -ne 1 ]]; then
+		result "$num" "$name" FAIL "trigger_count +$((tc_after - tc_before)), expected +1"
+		rm -f "$testfile"
+		return
+	fi
+
+	if [[ "$((sc_after - sc_before))" -ne 1 ]]; then
+		result "$num" "$name" FAIL "success_count +$((sc_after - sc_before)), expected +1"
+		rm -f "$testfile"
+		return
+	fi
+
+	if [[ "$second_rc" -eq 0 ]]; then
+		result "$num" "$name" FAIL "second trigger did not return error"
+		rm -f "$testfile"
+		return
+	fi
+
+	rm -f "$testfile" 2>/dev/null
+	result "$num" "$name" PASS
+}
+
+# --- Test 2: dirty_caps_at_reset --------------------------------------------
+#
+# Write to a file without fsync (dirty caps), trigger reset, then
+# verify the file is not corrupt.  Manual reset drains dirty caps
+# before teardown (best-effort, 5s timeout).  For a non-stuck cap
+# the dirty write should be flushed during drain and persist.
+# If the drain window is too short, only the synced first line
+# persists -- that is acceptable (data loss is documented for
+# unflushed writes).
+
+test_dirty_caps_at_reset()
+{
+	local num=2
+	local name="dirty_caps_at_reset"
+	local testfile="$MOUNT_POINT/.reset_corner_dirty_caps_$$"
+	local content_after line_count sc_before sc_after le
+
+	sc_before="$(read_status_field "success_count")"
+
+	echo "line_1_before_dirty_write" > "$testfile"
+	sync "$testfile"
+
+	python3 -c "
+import os, sys
+fd = os.open('$testfile', os.O_WRONLY | os.O_APPEND)
+os.write(fd, b'line_2_dirty_no_fsync\n')
+# deliberately no fsync -- leave caps dirty
+sys.stdout.write('written')
+" 2>/dev/null || {
+		result "$num" "$name" FAIL "dirty write failed"
+		rm -f "$testfile"
+		return
+	}
+
+	echo "dirty_caps_test" > "$TRIGGER_PATH" 2>/dev/null || {
+		result "$num" "$name" FAIL "reset trigger failed"
+		rm -f "$testfile"
+		return
+	}
+
+	if ! wait_reset_done 30; then
+		result "$num" "$name" FAIL "reset did not complete"
+		rm -f "$testfile"
+		return
+	fi
+
+	sc_after="$(read_status_field "success_count")"
+	if [[ "$sc_after" -le "$sc_before" ]]; then
+		result "$num" "$name" FAIL "success_count did not increment (reset not exercised)"
+		rm -f "$testfile"
+		return
+	fi
+
+	sync "$testfile" 2>/dev/null || true
+	content_after="$(cat "$testfile" 2>/dev/null)" || {
+		result "$num" "$name" FAIL "cannot read file after reset"
+		rm -f "$testfile"
+		return
+	}
+
+	if [[ -z "$content_after" ]]; then
+		result "$num" "$name" FAIL "file is empty after reset"
+		rm -f "$testfile"
+		return
+	fi
+
+	line_count="$(echo "$content_after" | wc -l)"
+	if [[ "$line_count" -lt 1 ]]; then
+		result "$num" "$name" FAIL "file has $line_count lines, expected >= 1"
+		rm -f "$testfile"
+		return
+	fi
+
+	echo "$content_after" | head -1 | grep -q "line_1_before_dirty_write" || {
+		result "$num" "$name" FAIL "first line corrupted"
+		rm -f "$testfile"
+		return
+	}
+
+	le="$(read_status_field "last_errno")"
+	if [[ "$le" != "0" ]]; then
+		result "$num" "$name" FAIL "last_errno=$le, expected 0"
+		rm -f "$testfile"
+		return
+	fi
+
+	rm -f "$testfile"
+	result "$num" "$name" PASS "file intact ($line_count lines)"
+}
+
+# --- Test 3: flock_after_reset ----------------------------------------------
+#
+# Take an exclusive flock, trigger reset, verify stale lock state is
+# marked with CEPH_I_ERROR_FILELOCK (same-client flock attempt returns
+# EIO).  After the original holder exits (releasing the local lock
+# reference and clearing the error flag), a fresh lock can be acquired.
+#
+# The lock holder uses the fd-based flock form with exec, so killing
+# $lock_pid closes the lock fd immediately (no orphaned child with an
+# inherited fd copy that would prevent the VFS flock release).
+
+test_flock_after_reset()
+{
+	local num=3
+	local name="flock_after_reset"
+	local testfile="$MOUNT_POINT/.reset_corner_flock_$$"
+	local lock_pid probe_rc sc_before sc_after
+
+	sc_before="$(read_status_field "success_count")"
+
+	echo "flock_test_content" > "$testfile"
+	sync "$testfile"
+
+	# Hold lock via fd in a subshell; exec ensures killing $lock_pid
+	# closes the lock fd directly (no fork/child fd inheritance).
+	(
+		exec 9<"$testfile"
+		flock --exclusive --nonblock 9 || exit 1
+		exec sleep 300
+	) &
+	lock_pid=$!
+	sleep 0.5
+
+	if ! kill -0 "$lock_pid" 2>/dev/null; then
+		result "$num" "$name" FAIL "flock holder died immediately"
+		rm -f "$testfile"
+		return
+	fi
+
+	echo "flock_after_reset_test" > "$TRIGGER_PATH" 2>/dev/null || {
+		kill "$lock_pid" 2>/dev/null; wait "$lock_pid" 2>/dev/null
+		result "$num" "$name" FAIL "reset trigger failed"
+		rm -f "$testfile"
+		return
+	}
+
+	if ! wait_reset_done 30; then
+		kill "$lock_pid" 2>/dev/null; wait "$lock_pid" 2>/dev/null
+		result "$num" "$name" FAIL "reset did not complete"
+		rm -f "$testfile"
+		return
+	fi
+
+	sc_after="$(read_status_field "success_count")"
+	if [[ "$sc_after" -le "$sc_before" ]]; then
+		kill "$lock_pid" 2>/dev/null; wait "$lock_pid" 2>/dev/null
+		result "$num" "$name" FAIL "success_count did not increment"
+		rm -f "$testfile"
+		return
+	fi
+
+	# After teardown, CEPH_I_ERROR_FILELOCK is set on the inode.
+	# A same-client lock attempt should fail (EIO), NOT succeed.
+	probe_rc=0
+	flock --exclusive --nonblock "$testfile" true 2>/dev/null && probe_rc=0 || probe_rc=$?
+	if [[ "$probe_rc" -eq 0 ]]; then
+		kill "$lock_pid" 2>/dev/null; wait "$lock_pid" 2>/dev/null
+		result "$num" "$name" FAIL \
+			"same-client probe succeeded, expected EIO from stale lock state"
+		rm -f "$testfile"
+		return
+	fi
+
+	# Kill the holder -- the exec'd sleep IS $lock_pid, so killing it
+	# closes fd 9 directly.  VFS flock release fires ceph_fl_release_lock(),
+	# which decrements i_filelock_ref to 0 and clears CEPH_I_ERROR_FILELOCK.
+	kill "$lock_pid" 2>/dev/null
+	wait "$lock_pid" 2>/dev/null
+
+	# After the holder exits, a fresh lock should be acquirable.
+	# The reset teardown sends SESSION_REQUEST_CLOSE so the MDS
+	# releases locks promptly, but retry briefly in case the
+	# message races with the connection close.
+	local attempt
+	probe_rc=1
+	for attempt in 1 2 3 4 5; do
+		probe_rc=0
+		flock --exclusive --nonblock "$testfile" true 2>/dev/null \
+			&& probe_rc=0 || probe_rc=$?
+		[[ "$probe_rc" -eq 0 ]] && break
+		sleep 1
+	done
+	if [[ "$probe_rc" -ne 0 ]]; then
+		result "$num" "$name" FAIL \
+			"cannot acquire fresh lock after holder exit (rc=$probe_rc, ${attempt} attempts)"
+		rm -f "$testfile"
+		return
+	fi
+
+	# Verify file content survived
+	grep -q "flock_test_content" "$testfile" 2>/dev/null || {
+		result "$num" "$name" FAIL "file content corrupted after reset"
+		rm -f "$testfile"
+		return
+	}
+
+	rm -f "$testfile"
+	result "$num" "$name" PASS "stale lock detected, fresh lock acquired after holder exit"
+}
+
+# --- Test 4: unmount_during_reset -------------------------------------------
+#
+# Mount a fresh CephFS, trigger reset, immediately unmount. The
+# ceph_mdsc_destroy() path must wake blocked waiters with -ESHUTDOWN
+# and not hang.
+
+test_unmount_during_reset()
+{
+	local num=4
+	local name="unmount_during_reset"
+	local temp_mnt="/tmp/ceph_corner_mnt_$$"
+	local mount_opts=""
+	local mount_src=""
+	local temp_trigger=""
+	local temp_status=""
+	local temp_client=""
+	local temp_file="$temp_mnt/.reset_corner_umount_$$"
+	local phase=""
+	local trigger_ok=0
+	local attempt
+	local -a new_clients=()
+	declare -A existing_clients=()
+
+	mount_src="$(awk -v mp="$MOUNT_POINT" '$2 == mp && $3 == "ceph" {print $1; exit}' /proc/mounts 2>/dev/null)"
+	mount_opts="$(awk -v mp="$MOUNT_POINT" '$2 == mp && $3 == "ceph" {print $4; exit}' /proc/mounts 2>/dev/null)"
+
+	if [[ -z "$mount_src" ]]; then
+		result "$num" "$name" SKIP "cannot determine mount source from /proc/mounts"
+		return
+	fi
+
+	while IFS= read -r existing; do
+		[[ -n "$existing" ]] || continue
+		existing_clients["$existing"]=1
+	done < <(list_reset_clients)
+
+	mkdir -p "$temp_mnt"
+
+	if ! mount -t ceph "$mount_src" "$temp_mnt" -o "$mount_opts" 2>/dev/null; then
+		result "$num" "$name" SKIP "cannot mount additional CephFS instance"
+		rmdir "$temp_mnt" 2>/dev/null
+		return
+	fi
+
+	ls "$temp_mnt" > /dev/null 2>&1
+	sync
+	sleep 1
+
+	for attempt in $(seq 1 50); do
+		new_clients=()
+		while IFS= read -r entry; do
+			[[ -n "$entry" ]] || continue
+			if [[ -n "${existing_clients[$entry]+x}" ]]; then
+				continue
+			fi
+			new_clients+=("$entry")
+		done < <(list_reset_clients)
+
+		if [[ "${#new_clients[@]}" -eq 1 ]]; then
+			temp_client="${new_clients[0]}"
+			break
+		fi
+
+		if [[ "${#new_clients[@]}" -gt 1 ]]; then
+			break
+		fi
+
+		sleep 0.1
+	done
+
+	if [[ -z "$temp_client" ]]; then
+		umount "$temp_mnt" 2>/dev/null || umount -l "$temp_mnt" 2>/dev/null
+		rmdir "$temp_mnt" 2>/dev/null
+		result "$num" "$name" SKIP "cannot identify debugfs client for temp mount"
+		return
+	fi
+
+	if [[ "${#new_clients[@]}" -gt 1 ]]; then
+		umount "$temp_mnt" 2>/dev/null || umount -l "$temp_mnt" 2>/dev/null
+		rmdir "$temp_mnt" 2>/dev/null
+		result "$num" "$name" SKIP "multiple new debugfs clients appeared"
+		return
+	fi
+
+	temp_trigger="$DEBUGFS_ROOT/$temp_client/reset/trigger"
+	temp_status="$DEBUGFS_ROOT/$temp_client/reset/status"
+
+	echo "umount_dirty_seed" > "$temp_file" 2>/dev/null || {
+		umount "$temp_mnt" 2>/dev/null || umount -l "$temp_mnt" 2>/dev/null
+		rmdir "$temp_mnt" 2>/dev/null
+		result "$num" "$name" FAIL "cannot create dirty state on temp mount"
+		return
+	}
+	sync "$temp_file"
+	python3 -c "
+import os, sys
+fd = os.open('$temp_file', os.O_WRONLY | os.O_APPEND)
+os.write(fd, b'dirty_for_umount_test\\n')
+os.close(fd)
+" 2>/dev/null || {
+		umount "$temp_mnt" 2>/dev/null || umount -l "$temp_mnt" 2>/dev/null
+		rmdir "$temp_mnt" 2>/dev/null
+		result "$num" "$name" FAIL "cannot dirty temp mount for reset overlap"
+		return
+	}
+
+	echo "unmount_test" > "$temp_trigger" 2>/dev/null && trigger_ok=1 || trigger_ok=0
+	if [[ "$trigger_ok" -ne 1 ]]; then
+		umount "$temp_mnt" 2>/dev/null || umount -l "$temp_mnt" 2>/dev/null
+		rmdir "$temp_mnt" 2>/dev/null
+		result "$num" "$name" FAIL "cannot trigger reset on temp mount"
+		return
+	fi
+
+	if ! wait_status_nonidle "$temp_status" 10; then
+		phase="$(awk -F': ' '$1 == "phase" {print $2}' "$temp_status" 2>/dev/null)"
+		umount "$temp_mnt" 2>/dev/null || umount -l "$temp_mnt" 2>/dev/null
+		rmdir "$temp_mnt" 2>/dev/null
+		result "$num" "$name" FAIL \
+			"reset never became active before umount (phase=${phase:-unknown})"
+		return
+	fi
+
+	local umount_ok=0
+	timeout 30 umount "$temp_mnt" 2>/dev/null && umount_ok=1
+
+	if [[ "$umount_ok" -ne 1 ]]; then
+		umount -l "$temp_mnt" 2>/dev/null || true
+		rmdir "$temp_mnt" 2>/dev/null
+		result "$num" "$name" FAIL "umount hung for >30s"
+		return
+	fi
+
+	rmdir "$temp_mnt" 2>/dev/null
+
+	ls "$MOUNT_POINT" > /dev/null 2>&1 || {
+		result "$num" "$name" FAIL "original mount unhealthy after test"
+		return
+	}
+
+	result "$num" "$name" PASS
+}
+
+# --- Main --------------------------------------------------------------------
+
+usage()
+{
+	cat <<EOF
+Usage: $0 --mount-point <path> [--client-id <id>] [--debugfs-root <path>]
+
+Runs targeted corner-case tests for the CephFS client reset feature.
+Requires root (debugfs access) and a mounted CephFS filesystem.
+
+Options:
+  --mount-point PATH     CephFS mount point (required)
+  --client-id ID         Ceph debugfs client id (auto-detect if one client)
+  --debugfs-root PATH    Debugfs ceph root (default: /sys/kernel/debug/ceph)
+  --help                 Show this message
+EOF
+}
+
+main()
+{
+	while [[ $# -gt 0 ]]; do
+		case "$1" in
+		--mount-point)   MOUNT_POINT="$2"; shift 2 ;;
+		--client-id)     DEBUGFS_CLIENT="$2"; shift 2 ;;
+		--debugfs-root)  DEBUGFS_ROOT="$2"; shift 2 ;;
+		--help|-h)       usage; exit 0 ;;
+		*)               echo "Unknown option: $1" >&2; usage; exit 2 ;;
+		esac
+	done
+
+	if [[ -z "$MOUNT_POINT" ]]; then
+		echo "--mount-point is required" >&2
+		usage
+		exit 2
+	fi
+
+	if [[ ! -d "$MOUNT_POINT" ]]; then
+		echo "SKIP: Mount point does not exist: $MOUNT_POINT" >&2
+		exit "$KSFT_SKIP"
+	fi
+
+	discover_debugfs
+	TRIGGER_PATH="$DEBUGFS_ROOT/$DEBUGFS_CLIENT/reset/trigger"
+	STATUS_PATH="$DEBUGFS_ROOT/$DEBUGFS_CLIENT/reset/status"
+
+	log "CephFS client reset corner case tests"
+	log "Mount: $MOUNT_POINT"
+	log "Client: $DEBUGFS_CLIENT"
+	echo ""
+
+	test_ebusy_rejection
+	test_dirty_caps_at_reset
+	test_flock_after_reset
+	test_unmount_during_reset
+
+	echo ""
+	echo "Results: $PASS_COUNT passed, $FAIL_COUNT failed, $SKIP_COUNT skipped (of $TOTAL)"
+
+	if [[ "$FAIL_COUNT" -gt 0 ]]; then
+		exit 1
+	fi
+	exit 0
+}
+
+main "$@"
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v3 10/11] selftests: ceph: add validation harness
  2026-04-29 12:51 [PATCH v3 00/11] ceph: manual client session reset Alex Markuze
                   ` (8 preceding siblings ...)
  2026-04-29 12:52 ` [PATCH v3 09/11] selftests: ceph: add reset corner-case tests Alex Markuze
@ 2026-04-29 12:52 ` Alex Markuze
  2026-04-29 12:52 ` [PATCH v3 11/11] selftests: ceph: wire up Ceph reset kselftests and documentation Alex Markuze
  10 siblings, 0 replies; 17+ messages in thread
From: Alex Markuze @ 2026-04-29 12:52 UTC (permalink / raw)
  To: ceph-devel; +Cc: linux-kernel, idryomov, vdubeyko, Alex Markuze

Add a one-shot validation wrapper that orchestrates the full reset
test suite with per-stage watchdog timeouts and a final status check.

The harness runs five stages: baseline (no resets), corner cases,
moderate stress, aggressive stress, and a post-run status validation.
Each stage runs with an independent timeout so a hang in one stage
does not block the entire run.

Signed-off-by: Alex Markuze <amarkuze@redhat.com>
---
 .../filesystems/ceph/run_validation.sh        | 350 ++++++++++++++++++
 1 file changed, 350 insertions(+)
 create mode 100755 tools/testing/selftests/filesystems/ceph/run_validation.sh

diff --git a/tools/testing/selftests/filesystems/ceph/run_validation.sh b/tools/testing/selftests/filesystems/ceph/run_validation.sh
new file mode 100755
index 000000000000..5d521e4f9e9b
--- /dev/null
+++ b/tools/testing/selftests/filesystems/ceph/run_validation.sh
@@ -0,0 +1,350 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# CephFS client reset - single-command validation.
+# Runs all test stages in sequence with per-stage timeouts.
+# If any stage hangs (filesystem stuck, process blocked), the
+# timeout kills it and reports failure.
+#
+# Usage:
+#   sudo ./run_validation.sh --mount-point /mnt/mycephfs
+#
+# Expected output on success:
+#
+#   === CephFS Client Reset Validation ===
+#   [stage 1/5] baseline         PASS  (60s, no resets)
+#   [stage 2/5] corner_cases     PASS  (4/4 passed)
+#   [stage 3/5] moderate         PASS  (120s, resets every 5-15s)
+#   [stage 4/5] aggressive       PASS  (120s, resets every 1-5s)
+#   [stage 5/5] status_check     PASS  (phase=idle, last_errno=0)
+#
+#   RESULT: 5/5 stages passed
+#   Artifacts: /tmp/ceph_reset_validation_<timestamp>
+
+set -uo pipefail
+
+KSFT_SKIP=4
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+
+# kselftest auto-detect: when invoked with no arguments (e.g. by
+# "make run_tests"), find a CephFS mount automatically or skip.
+if [[ $# -eq 0 ]]; then
+	MOUNT_POINT="$(findmnt -t ceph -n -o TARGET 2>/dev/null | head -1)"
+	if [[ -z "$MOUNT_POINT" ]]; then
+		echo "SKIP: No CephFS mount found and --mount-point not specified"
+		exit "$KSFT_SKIP"
+	fi
+	exec "$0" --mount-point "$MOUNT_POINT"
+fi
+
+MOUNT_POINT=""
+CLIENT_ID=""
+declare -a CLIENT_ARGS=()
+declare -a DEBUGFS_ARGS=()
+RUN_ID="$(date +%Y%m%d-%H%M%S)"
+OUT_DIR="/tmp/ceph_reset_validation_${RUN_ID}"
+DEBUGFS_ROOT="/sys/kernel/debug/ceph"
+
+# Timeout margins: stage runtime + cooldown + validation + safety buffer
+STAGE1_TIMEOUT=120    # 60s run + 20s cooldown + 40s buffer
+STAGE2_TIMEOUT=300    # 4 corner cases, 30s each worst case + buffer
+STAGE3_TIMEOUT=240    # 120s run + 20s cooldown + 100s buffer
+STAGE4_TIMEOUT=240    # 120s run + 20s cooldown + 100s buffer
+STAGE5_TIMEOUT=10     # just reading debugfs
+
+PASS=0
+FAIL=0
+TOTAL=5
+
+usage()
+{
+	cat <<EOF
+Usage: $0 --mount-point <cephfs_mount> [options]
+
+Required:
+  --mount-point PATH    CephFS mount point
+
+Options:
+  --out-dir PATH        Artifact directory (default: /tmp/ceph_reset_validation_<ts>)
+  --client-id ID        Ceph debugfs client id (optional)
+  --debugfs-root PATH   Debugfs Ceph root (default: /sys/kernel/debug/ceph)
+  --help                Show this message
+EOF
+}
+
+stage_result()
+{
+	local num="$1"
+	local name="$2"
+	local status="$3"
+	local detail="$4"
+
+	if [[ "$status" == "PASS" ]]; then
+		PASS=$((PASS + 1))
+	else
+		FAIL=$((FAIL + 1))
+	fi
+	printf '[stage %d/%d] %-16s %s  (%s)\n' "$num" "$TOTAL" "$name" "$status" "$detail"
+}
+
+# Run a command with a timeout. Returns 0 on success, 1 on failure/timeout.
+# Sets RUN_TIMED_OUT=1 if killed by timeout.
+#
+# The stage command runs in its own session/process group (via setsid).
+# On timeout the entire process group is killed, not just the top-level
+# script PID.  This is required because stage scripts (reset_stress.sh,
+# reset_corner_cases.sh) spawn child processes - I/O workers, rename
+# workers, reset injectors, samplers - that would otherwise survive the
+# timeout and bleed into later stages, invalidating results.
+RUN_TIMED_OUT=0
+
+run_with_timeout()
+{
+	local timeout_sec="$1"
+	local logfile="$2"
+	shift 2
+
+	RUN_TIMED_OUT=0
+
+	# Start the stage in its own session via setsid so all descendant
+	# processes share a process group that we can kill atomically.
+	# In a non-interactive script, background children are not process
+	# group leaders, so setsid(1) calls setsid(2) directly (no extra
+	# fork) and the PID we capture IS the group leader.
+	setsid "$@" > "$logfile" 2>&1 &
+	local pid=$!
+
+	# Watchdog: on timeout, kill the entire process group
+	(
+		sleep "$timeout_sec"
+		if kill -0 "$pid" 2>/dev/null; then
+			echo "TIMEOUT: stage exceeded ${timeout_sec}s, killing process group $pid" >> "$logfile"
+			kill -TERM -- -"$pid" 2>/dev/null
+			sleep 2
+			kill -KILL -- -"$pid" 2>/dev/null
+		fi
+	) &
+	local watchdog_pid=$!
+
+	# Wait for the stage command
+	wait "$pid" 2>/dev/null
+	local rc=$?
+
+	# Kill the watchdog if it's still running
+	kill "$watchdog_pid" 2>/dev/null
+	wait "$watchdog_pid" 2>/dev/null
+
+	# Check if it was killed by timeout
+	if grep -q "^TIMEOUT:" "$logfile" 2>/dev/null; then
+		RUN_TIMED_OUT=1
+		return 1
+	fi
+
+	return "$rc"
+}
+
+find_status_path()
+{
+	local entry
+
+	if [[ -n "$CLIENT_ID" ]]; then
+		if [[ -r "$DEBUGFS_ROOT/$CLIENT_ID/reset/status" ]]; then
+			echo "$DEBUGFS_ROOT/$CLIENT_ID/reset/status"
+			return 0
+		fi
+		return 1
+	fi
+
+	for entry in "$DEBUGFS_ROOT"/*/; do
+		if [[ -r "${entry}reset/status" ]]; then
+			echo "${entry}reset/status"
+			return 0
+		fi
+	done
+	return 1
+}
+
+read_status_field()
+{
+	local status_path="$1"
+	local field="$2"
+	awk -F': ' -v key="$field" '$1 == key {print $2}' "$status_path" 2>/dev/null
+}
+
+# --- Parse arguments -------------------------------------------------------
+
+while [[ $# -gt 0 ]]; do
+	case "$1" in
+	--mount-point)  MOUNT_POINT="$2"; shift 2 ;;
+	--out-dir)      OUT_DIR="$2"; shift 2 ;;
+	--client-id)    CLIENT_ID="$2"; shift 2 ;;
+	--debugfs-root) DEBUGFS_ROOT="$2"; shift 2 ;;
+	--help|-h)      usage; exit 0 ;;
+	*)              echo "Unknown option: $1" >&2; usage; exit 2 ;;
+	esac
+done
+
+if [[ -z "$MOUNT_POINT" ]]; then
+	echo "SKIP: --mount-point is required" >&2
+	usage
+	exit "$KSFT_SKIP"
+fi
+
+if [[ ! -d "$MOUNT_POINT" ]]; then
+	echo "SKIP: Mount point does not exist: $MOUNT_POINT" >&2
+	exit "$KSFT_SKIP"
+fi
+
+# Auto-detect client id when not specified, so all stages (including
+# stage 5 status check) use the same client consistently.
+if [[ -z "$CLIENT_ID" ]]; then
+	candidates=()
+	for entry in "$DEBUGFS_ROOT"/*/; do
+		name="$(basename "$entry")"
+		if [[ -r "${entry}reset/status" ]]; then
+			candidates+=("$name")
+		fi
+	done
+	if [[ ${#candidates[@]} -eq 1 ]]; then
+		CLIENT_ID="${candidates[0]}"
+	elif [[ ${#candidates[@]} -gt 1 ]]; then
+		echo "SKIP: Multiple Ceph clients found (${candidates[*]}). Use --client-id." >&2
+		exit "$KSFT_SKIP"
+	fi
+fi
+
+if [[ -n "$CLIENT_ID" ]]; then
+	CLIENT_ARGS=(--client-id "$CLIENT_ID")
+fi
+DEBUGFS_ARGS=(--debugfs-root "$DEBUGFS_ROOT")
+
+# Quick sanity: can we write to the mount?
+if ! touch "$MOUNT_POINT/.validation_probe_$$" 2>/dev/null; then
+	echo "SKIP: Mount point is not writable: $MOUNT_POINT" >&2
+	exit "$KSFT_SKIP"
+fi
+rm -f "$MOUNT_POINT/.validation_probe_$$"
+
+mkdir -p "$OUT_DIR"
+
+echo ""
+echo "=== CephFS Client Reset Validation ==="
+echo ""
+
+# --- Stage 1: Baseline (no resets) -----------------------------------------
+
+stage1_out="$OUT_DIR/stage1_baseline"
+if run_with_timeout "$STAGE1_TIMEOUT" "$stage1_out.log" \
+	"$SCRIPT_DIR/reset_stress.sh" \
+	--mount-point "$MOUNT_POINT" \
+	--profile baseline \
+	--no-reset \
+	--duration-sec 60 \
+	"${CLIENT_ARGS[@]}" \
+	"${DEBUGFS_ARGS[@]}" \
+	--out-dir "$stage1_out"; then
+	stage_result 1 "baseline" "PASS" "60s, no resets"
+elif [[ "$RUN_TIMED_OUT" -eq 1 ]]; then
+	stage_result 1 "baseline" "FAIL" "HUNG: killed after ${STAGE1_TIMEOUT}s"
+else
+	stage_result 1 "baseline" "FAIL" "see $stage1_out.log"
+fi
+
+# --- Stage 2: Corner cases -------------------------------------------------
+
+stage2_out="$OUT_DIR/stage2_corner_cases"
+mkdir -p "$stage2_out"
+if run_with_timeout "$STAGE2_TIMEOUT" "$stage2_out/output.log" \
+	"$SCRIPT_DIR/reset_corner_cases.sh" \
+	"${CLIENT_ARGS[@]}" \
+	"${DEBUGFS_ARGS[@]}" \
+	--mount-point "$MOUNT_POINT"; then
+	pass_line=$(grep -Eo '[0-9]+ passed, [0-9]+ failed, [0-9]+ skipped' "$stage2_out/output.log" | tail -1)
+	stage_result 2 "corner_cases" "PASS" "${pass_line:-all tests passed}"
+elif [[ "$RUN_TIMED_OUT" -eq 1 ]]; then
+	stage_result 2 "corner_cases" "FAIL" "HUNG: killed after ${STAGE2_TIMEOUT}s"
+else
+	fail_line=$(grep -c 'FAIL' "$stage2_out/output.log" 2>/dev/null || echo "?")
+	stage_result 2 "corner_cases" "FAIL" "${fail_line} failures, see $stage2_out/output.log"
+fi
+
+# --- Stage 3: Moderate resets -----------------------------------------------
+
+stage3_out="$OUT_DIR/stage3_moderate"
+if run_with_timeout "$STAGE3_TIMEOUT" "$stage3_out.log" \
+	"$SCRIPT_DIR/reset_stress.sh" \
+	--mount-point "$MOUNT_POINT" \
+	--profile moderate \
+	--duration-sec 120 \
+	"${CLIENT_ARGS[@]}" \
+	"${DEBUGFS_ARGS[@]}" \
+	--out-dir "$stage3_out"; then
+	stage_result 3 "moderate" "PASS" "120s, resets every 5-15s"
+elif [[ "$RUN_TIMED_OUT" -eq 1 ]]; then
+	stage_result 3 "moderate" "FAIL" "HUNG: killed after ${STAGE3_TIMEOUT}s"
+else
+	stage_result 3 "moderate" "FAIL" "see $stage3_out.log"
+fi
+
+# --- Stage 4: Aggressive resets ---------------------------------------------
+
+stage4_out="$OUT_DIR/stage4_aggressive"
+if run_with_timeout "$STAGE4_TIMEOUT" "$stage4_out.log" \
+	"$SCRIPT_DIR/reset_stress.sh" \
+	--mount-point "$MOUNT_POINT" \
+	--profile aggressive \
+	--duration-sec 120 \
+	"${CLIENT_ARGS[@]}" \
+	"${DEBUGFS_ARGS[@]}" \
+	--out-dir "$stage4_out"; then
+	stage_result 4 "aggressive" "PASS" "120s, resets every 1-5s"
+elif [[ "$RUN_TIMED_OUT" -eq 1 ]]; then
+	stage_result 4 "aggressive" "FAIL" "HUNG: killed after ${STAGE4_TIMEOUT}s"
+else
+	stage_result 4 "aggressive" "FAIL" "see $stage4_out.log"
+fi
+
+# --- Stage 5: Post-run status check ----------------------------------------
+
+status_path=""
+if status_path=$(find_status_path); then
+	phase=$(read_status_field "$status_path" "phase")
+	last_errno=$(read_status_field "$status_path" "last_errno")
+	failure_count=$(read_status_field "$status_path" "failure_count")
+	drain_timed_out=$(read_status_field "$status_path" "drain_timed_out")
+	sessions_reset=$(read_status_field "$status_path" "sessions_reset")
+	blocked=$(read_status_field "$status_path" "blocked_requests")
+
+	# Save full status
+	cat "$status_path" > "$OUT_DIR/final_status.txt" 2>/dev/null
+
+	errors=""
+	[[ "$phase" != "idle" ]] && errors="${errors}phase=$phase "
+	[[ "$last_errno" != "0" ]] && errors="${errors}last_errno=$last_errno "
+	[[ "$failure_count" != "0" && -n "$failure_count" ]] && errors="${errors}failure_count=$failure_count "
+	[[ "$blocked" != "0" ]] && errors="${errors}blocked_requests=$blocked "
+
+	if [[ -z "$errors" ]]; then
+		detail="phase=$phase, last_errno=$last_errno, failure_count=${failure_count:-0}"
+		[[ "$drain_timed_out" == "yes" ]] && detail="$detail, drain_timed_out=yes"
+		[[ -n "$sessions_reset" ]] && detail="$detail, sessions_reset=$sessions_reset"
+		stage_result 5 "status_check" "PASS" "$detail"
+	else
+		stage_result 5 "status_check" "FAIL" "$errors"
+	fi
+else
+	stage_result 5 "status_check" "FAIL" "cannot read reset/status"
+fi
+
+# --- Summary ----------------------------------------------------------------
+
+echo ""
+if [[ "$FAIL" -eq 0 ]]; then
+	echo "RESULT: $PASS/$TOTAL stages passed"
+else
+	echo "RESULT: $PASS/$TOTAL stages passed, $FAIL FAILED"
+fi
+echo "Artifacts: $OUT_DIR"
+echo ""
+
+exit "$FAIL"
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v3 11/11] selftests: ceph: wire up Ceph reset kselftests and documentation
  2026-04-29 12:51 [PATCH v3 00/11] ceph: manual client session reset Alex Markuze
                   ` (9 preceding siblings ...)
  2026-04-29 12:52 ` [PATCH v3 10/11] selftests: ceph: add validation harness Alex Markuze
@ 2026-04-29 12:52 ` Alex Markuze
  10 siblings, 0 replies; 17+ messages in thread
From: Alex Markuze @ 2026-04-29 12:52 UTC (permalink / raw)
  To: ceph-devel; +Cc: linux-kernel, idryomov, vdubeyko, Alex Markuze

Wire the CephFS reset test suite into the kselftest build:

  - Add filesystems/ceph to the top-level selftests Makefile.
  - Add the per-suite Makefile with run_validation.sh as TEST_PROGS.
  - Add the settings file (kselftest timeout).
  - Add the MAINTAINERS entry for the test directory.
  - Add README with prerequisites, usage, and troubleshooting.

Signed-off-by: Alex Markuze <amarkuze@redhat.com>
---
 MAINTAINERS                                   |  1 +
 tools/testing/selftests/Makefile              |  1 +
 .../selftests/filesystems/ceph/Makefile       |  7 ++
 .../testing/selftests/filesystems/ceph/README | 84 +++++++++++++++++++
 .../selftests/filesystems/ceph/settings       |  1 +
 5 files changed, 94 insertions(+)
 create mode 100644 tools/testing/selftests/filesystems/ceph/Makefile
 create mode 100644 tools/testing/selftests/filesystems/ceph/README
 create mode 100644 tools/testing/selftests/filesystems/ceph/settings

diff --git a/MAINTAINERS b/MAINTAINERS
index d1cc0e12fe1f..87c36a26c1f2 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -5917,6 +5917,7 @@ B:	https://tracker.ceph.com/
 T:	git https://github.com/ceph/ceph-client.git
 F:	Documentation/filesystems/ceph.rst
 F:	fs/ceph/
+F:	tools/testing/selftests/filesystems/ceph/
 
 CERTIFICATE HANDLING
 M:	David Howells <dhowells@redhat.com>
diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index 450f13ba4cca..81c01a7062e0 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -32,6 +32,7 @@ TARGETS += exec
 TARGETS += fchmodat2
 TARGETS += filesystems
 TARGETS += filesystems/binderfs
+TARGETS += filesystems/ceph
 TARGETS += filesystems/epoll
 TARGETS += filesystems/fat
 TARGETS += filesystems/overlayfs
diff --git a/tools/testing/selftests/filesystems/ceph/Makefile b/tools/testing/selftests/filesystems/ceph/Makefile
new file mode 100644
index 000000000000..4ad3e8d40d90
--- /dev/null
+++ b/tools/testing/selftests/filesystems/ceph/Makefile
@@ -0,0 +1,7 @@
+# SPDX-License-Identifier: GPL-2.0
+
+TEST_PROGS := run_validation.sh
+TEST_FILES := reset_stress.sh reset_corner_cases.sh \
+	      validate_consistency.py README settings
+
+include ../../lib.mk
diff --git a/tools/testing/selftests/filesystems/ceph/README b/tools/testing/selftests/filesystems/ceph/README
new file mode 100644
index 000000000000..eb0092b38f80
--- /dev/null
+++ b/tools/testing/selftests/filesystems/ceph/README
@@ -0,0 +1,84 @@
+# CephFS Client Reset Test Suite
+
+Test suite for the CephFS kernel client manual session reset feature.
+This trimmed set contains the single-client stress test, the targeted
+corner-case test, and the one-shot validation harness used during
+feature bring-up.
+
+## Prerequisites
+
+- Linux kernel with the CephFS client reset feature (this branch)
+- A running Ceph cluster with at least one MDS
+- Root access (debugfs requires it)
+- Python 3 (for validators)
+- flock utility (for lock tests, usually in util-linux)
+
+## Test inventory
+
+| Test | Script(s) | What it covers |
+|------|-----------|----------------|
+| Single-client stress | `reset_stress.sh` | I/O + resets + data integrity on one mount |
+| Corner cases | `reset_corner_cases.sh` | EBUSY, dirty caps, flock reclaim, unmount-during-reset |
+| Validation harness | `run_validation.sh` | baseline + corner cases + moderate/aggressive stress + final status check |
+
+## Quick start
+
+Stress run:
+
+    sudo ./reset_stress.sh --mount-point /mnt/cephfs --profile moderate
+
+Corner cases:
+
+    sudo ./reset_corner_cases.sh --mount-point /mnt/cephfs
+
+End-to-end validation:
+
+    sudo ./run_validation.sh --mount-point /mnt/cephfs
+
+## Stress profiles
+
+    baseline   - no resets, 1 IO + 1 rename, 600s
+    moderate   - reset every 5-15s, 2 IO + 1 rename, 900s
+    aggressive - reset every 1-5s, 4 IO + 2 rename, 900s
+    soak       - reset every 5-15s, 2 IO + 1 rename, 3600s
+
+## Key options (all scripts)
+
+    --mount-point PATH   CephFS mount point (required)
+    --client-id ID       Debugfs client id (auto-detected if one)
+
+reset_stress.sh additionally accepts:
+
+    --profile NAME       baseline|moderate|aggressive|soak
+    --duration-sec N     Override profile runtime
+    --no-reset           Disable reset injection
+    --out-dir PATH       Artifact directory
+
+## Corner case tests
+
+    [1/4] ebusy_rejection       Second reset rejected while first in-flight
+    [2/4] dirty_caps_at_reset   Reset with unflushed dirty caps
+    [3/4] flock_after_reset     Stale lock EIO + fresh lock after holder exit
+    [4/4] unmount_during_reset  umount during active reset (destroy-path wakeup)
+
+Test 4 requires creating a second CephFS mount instance and SKIPs if
+the host cannot do so.  See `--help` output for details.
+
+## Troubleshooting
+
+**No writable Ceph reset interface found:**
+Kernel lacks the reset feature, debugfs not mounted, or not root.
+Check: `ls /sys/kernel/debug/ceph/*/reset/`
+
+**Multiple Ceph clients found:**
+Use `--client-id` to select one.
+List: `ls /sys/kernel/debug/ceph/`
+
+## Files
+
+| File | Role |
+|------|------|
+| `reset_stress.sh` | Single-client stress test runner |
+| `validate_consistency.py` | Single-client post-run validator |
+| `reset_corner_cases.sh` | Corner case harness (4 sequential tests) |
+| `run_validation.sh` | One-shot validation harness |
diff --git a/tools/testing/selftests/filesystems/ceph/settings b/tools/testing/selftests/filesystems/ceph/settings
new file mode 100644
index 000000000000..79b65bdf05db
--- /dev/null
+++ b/tools/testing/selftests/filesystems/ceph/settings
@@ -0,0 +1 @@
+timeout=1200
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2026-04-30 18:38 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-29 12:51 [PATCH v3 00/11] ceph: manual client session reset Alex Markuze
2026-04-29 12:51 ` [PATCH v3 01/11] ceph: convert inode flags to named bit positions and atomic bitops Alex Markuze
2026-04-29 19:31   ` [EXTERNAL] " Viacheslav Dubeyko
2026-04-29 12:51 ` [PATCH v3 02/11] ceph: use proper endian conversion for flock_len in reconnect Alex Markuze
2026-04-29 12:51 ` [PATCH v3 03/11] ceph: harden send_mds_reconnect and handle active-MDS peer reset Alex Markuze
2026-04-29 21:22   ` [EXTERNAL] " Viacheslav Dubeyko
2026-04-29 12:51 ` [PATCH v3 04/11] ceph: add diagnostic timeout loop to wait_caps_flush() Alex Markuze
2026-04-29 21:41   ` [EXTERNAL] " Viacheslav Dubeyko
2026-04-29 12:52 ` [PATCH v3 05/11] ceph: add client reset state machine and session teardown Alex Markuze
2026-04-29 22:29   ` [EXTERNAL] " Viacheslav Dubeyko
2026-04-29 12:52 ` [PATCH v3 06/11] ceph: add manual reset debugfs control and tracepoints Alex Markuze
2026-04-30 18:38   ` [EXTERNAL] " Viacheslav Dubeyko
2026-04-29 12:52 ` [PATCH v3 07/11] selftests: ceph: add reset consistency checker Alex Markuze
2026-04-29 12:52 ` [PATCH v3 08/11] selftests: ceph: add reset stress test Alex Markuze
2026-04-29 12:52 ` [PATCH v3 09/11] selftests: ceph: add reset corner-case tests Alex Markuze
2026-04-29 12:52 ` [PATCH v3 10/11] selftests: ceph: add validation harness Alex Markuze
2026-04-29 12:52 ` [PATCH v3 11/11] selftests: ceph: wire up Ceph reset kselftests and documentation Alex Markuze

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox