[RFC PATCH v1 0/4] ceph: manual client session reset via debugfs

public inbox for linux-fsdevel@vger.kernel.org
 help / color / mirror / Atom feed

* [RFC PATCH v1 0/4] ceph: manual client session reset via debugfs
@ 2026-02-25 12:59 Alex Markuze
  2026-02-25 12:59 ` [RFC PATCH v1 1/4] ceph: convert inode flags to named bit positions Alex Markuze
                   ` (4 more replies)
  0 siblings, 5 replies; 10+ messages in thread
From: Alex Markuze @ 2026-02-25 12:59 UTC (permalink / raw)
  To: ceph-devel; +Cc: idryomov, linux-fsdevel, amarkuze, vdubeyko

In production CephFS deployments we regularly encounter situations
where MDS sessions become stuck or hung, requiring a full unmount/
remount cycle to recover.  This is disruptive for workloads that
cannot tolerate the downtime or the loss of cached state that comes
with unmounting.

This series adds a mechanism to manually trigger MDS session
reconnection from the client side without unmounting, exposed via
debugfs:

  echo "reason text" > /sys/kernel/debug/ceph/<fsid>/reset/trigger
  cat /sys/kernel/debug/ceph/<fsid>/reset/status

The reset lifecycle:
  1. Operator writes to the trigger file
  2. A work item collects all active sessions and initiates
     reconnection on each (via send_mds_reconnect with from_reset=true)
  3. New metadata requests and lock acquisitions are blocked until
     the reset completes (120s timeout)
  4. Session completions are tracked via a per-session generation
     counter to handle stale completions from timed-out prior resets
  5. Lock reclamation is always attempted during reset-initiated
     reconnects, regardless of prior CEPH_I_ERROR_FILELOCK state

Patch breakdown:
  1. Convert CEPH_I_* flags to named bit positions so test_bit/
     set_bit/clear_bit can be used in reconnect paths
  2. Make wait_caps_flush() bounded with periodic diagnostic dumps
     to aid debugging hung flush scenarios (independent improvement
     that surfaced during development)
  3. The core reset implementation: debugfs interface, reset work
     function, request/lock blocking, tracepoints, and session
     reconnect completion tracking
  4. Rework mds_peer_reset() to properly handle all session states
     and integrate with the reset completion tracking

Open questions / areas for review:
  - Is debugfs the right interface, or should this be a mount option
    / sysfs attribute / netlink command?
  - The request gating in ceph_mdsc_submit_request() is best-effort
    (no lock serialization) to avoid penalizing the normal path.
    Is this acceptable?
  - Should the 60s reconnect timeout and 120s blocked-request timeout
    be configurable (e.g. via mount options)?
  - mds_peer_reset() rework (patch 4) is substantial -- would
    reviewers prefer it split further?


Alex Markuze (4):
  ceph: convert inode flags to named bit positions
  ceph: add bounded timeout and diagnostics to wait_caps_flush()
  ceph: implement manual client session reset via debugfs
  ceph: rework mds_peer_reset() for robust session recovery

 fs/ceph/caps.c              |   7 +
 fs/ceph/debugfs.c           | 171 ++++++++++-
 fs/ceph/locks.c             |  24 +-
 fs/ceph/mds_client.c        | 577 ++++++++++++++++++++++++++++++++++--
 fs/ceph/mds_client.h        |  33 +++
 fs/ceph/super.h             |  60 ++--
 include/trace/events/ceph.h |  60 ++++
 7 files changed, 879 insertions(+), 53 deletions(-)

-- 
2.34.1


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [RFC PATCH v1 1/4] ceph: convert inode flags to named bit positions
  2026-02-25 12:59 [RFC PATCH v1 0/4] ceph: manual client session reset via debugfs Alex Markuze
@ 2026-02-25 12:59 ` Alex Markuze
  2026-02-25 20:56   ` [EXTERNAL] " Viacheslav Dubeyko
  2026-02-25 12:59 ` [RFC PATCH v1 2/4] ceph: add bounded timeout and diagnostics to wait_caps_flush() Alex Markuze
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 10+ messages in thread
From: Alex Markuze @ 2026-02-25 12:59 UTC (permalink / raw)
  To: ceph-devel; +Cc: idryomov, linux-fsdevel, amarkuze, vdubeyko

Define all CEPH_I_* flags as named bit positions with derived
bitmask values, making them usable with test_bit/set_bit/clear_bit.
Previously only CEPH_I_ODIRECT_BIT and CEPH_ASYNC_CREATE_BIT had
named bit positions; the rest were bare bitmask constants.

Convert CEPH_I_ERROR_WRITE and CEPH_I_ERROR_FILELOCK usage sites
to use atomic bit operations (test_bit, set_bit, clear_bit) via
the new _BIT constants.

This is preparation for the client reset feature which needs
test_bit() on CEPH_I_ERROR_FILELOCK_BIT in reconnect paths.

Signed-off-by: Alex Markuze <amarkuze@redhat.com>
---
 fs/ceph/locks.c      |  8 +++----
 fs/ceph/mds_client.c |  3 ++-
 fs/ceph/super.h      | 54 +++++++++++++++++++++++++++-----------------
 3 files changed, 38 insertions(+), 27 deletions(-)

diff --git a/fs/ceph/locks.c b/fs/ceph/locks.c
index dd764f9c64b9..2f21574dfb99 100644
--- a/fs/ceph/locks.c
+++ b/fs/ceph/locks.c
@@ -58,7 +58,7 @@ static void ceph_fl_release_lock(struct file_lock *fl)
 	if (atomic_dec_and_test(&ci->i_filelock_ref)) {
 		/* clear error when all locks are released */
 		spin_lock(&ci->i_ceph_lock);
-		ci->i_ceph_flags &= ~CEPH_I_ERROR_FILELOCK;
+		clear_bit(CEPH_I_ERROR_FILELOCK_BIT, &ci->i_ceph_flags);
 		spin_unlock(&ci->i_ceph_lock);
 	}
 	fl->fl_u.ceph.inode = NULL;
@@ -272,9 +272,8 @@ int ceph_lock(struct file *file, int cmd, struct file_lock *fl)
 		wait = 1;
 
 	spin_lock(&ci->i_ceph_lock);
-	if (ci->i_ceph_flags & CEPH_I_ERROR_FILELOCK) {
+	if (test_bit(CEPH_I_ERROR_FILELOCK_BIT, &ci->i_ceph_flags))
 		err = -EIO;
-	}
 	spin_unlock(&ci->i_ceph_lock);
 	if (err < 0) {
 		if (op == CEPH_MDS_OP_SETFILELOCK && lock_is_unlock(fl))
@@ -332,9 +331,8 @@ int ceph_flock(struct file *file, int cmd, struct file_lock *fl)
 	doutc(cl, "fl_file: %p\n", fl->c.flc_file);
 
 	spin_lock(&ci->i_ceph_lock);
-	if (ci->i_ceph_flags & CEPH_I_ERROR_FILELOCK) {
+	if (test_bit(CEPH_I_ERROR_FILELOCK_BIT, &ci->i_ceph_flags))
 		err = -EIO;
-	}
 	spin_unlock(&ci->i_ceph_lock);
 	if (err < 0) {
 		if (lock_is_unlock(fl))
diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index 23b6d00643c9..28bb27b09b40 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -3610,7 +3610,8 @@ static void __do_request(struct ceph_mds_client *mdsc,
 
 		spin_lock(&ci->i_ceph_lock);
 		cap = ci->i_auth_cap;
-		if (ci->i_ceph_flags & CEPH_I_ASYNC_CREATE && mds != cap->mds) {
+		if (test_bit(CEPH_ASYNC_CREATE_BIT, &ci->i_ceph_flags) &&
+		    mds != cap->mds) {
 			doutc(cl, "session changed for auth cap %d -> %d\n",
 			      cap->session->s_mds, session->s_mds);
 
diff --git a/fs/ceph/super.h b/fs/ceph/super.h
index 29a980e22dc2..69a71848240f 100644
--- a/fs/ceph/super.h
+++ b/fs/ceph/super.h
@@ -655,23 +655,35 @@ static inline struct inode *ceph_find_inode(struct super_block *sb,
 /*
  * Ceph inode.
  */
-#define CEPH_I_DIR_ORDERED	(1 << 0)  /* dentries in dir are ordered */
-#define CEPH_I_FLUSH		(1 << 2)  /* do not delay flush of dirty metadata */
-#define CEPH_I_POOL_PERM	(1 << 3)  /* pool rd/wr bits are valid */
-#define CEPH_I_POOL_RD		(1 << 4)  /* can read from pool */
-#define CEPH_I_POOL_WR		(1 << 5)  /* can write to pool */
-#define CEPH_I_SEC_INITED	(1 << 6)  /* security initialized */
-#define CEPH_I_KICK_FLUSH	(1 << 7)  /* kick flushing caps */
-#define CEPH_I_FLUSH_SNAPS	(1 << 8)  /* need flush snapss */
-#define CEPH_I_ERROR_WRITE	(1 << 9) /* have seen write errors */
-#define CEPH_I_ERROR_FILELOCK	(1 << 10) /* have seen file lock errors */
-#define CEPH_I_ODIRECT_BIT	(11) /* inode in direct I/O mode */
-#define CEPH_I_ODIRECT		(1 << CEPH_I_ODIRECT_BIT)
-#define CEPH_ASYNC_CREATE_BIT	(12)	  /* async create in flight for this */
-#define CEPH_I_ASYNC_CREATE	(1 << CEPH_ASYNC_CREATE_BIT)
-#define CEPH_I_SHUTDOWN		(1 << 13) /* inode is no longer usable */
-#define CEPH_I_ASYNC_CHECK_CAPS	(1 << 14) /* check caps immediately after async
-					     creating finishes */
+#define CEPH_I_DIR_ORDERED_BIT		(0)  /* dentries in dir are ordered */
+#define CEPH_I_FLUSH_BIT		(2)  /* do not delay flush of dirty metadata */
+#define CEPH_I_POOL_PERM_BIT		(3)  /* pool rd/wr bits are valid */
+#define CEPH_I_POOL_RD_BIT		(4)  /* can read from pool */
+#define CEPH_I_POOL_WR_BIT		(5)  /* can write to pool */
+#define CEPH_I_SEC_INITED_BIT		(6)  /* security initialized */
+#define CEPH_I_KICK_FLUSH_BIT		(7)  /* kick flushing caps */
+#define CEPH_I_FLUSH_SNAPS_BIT		(8)  /* need flush snapss */
+#define CEPH_I_ERROR_WRITE_BIT		(9)  /* have seen write errors */
+#define CEPH_I_ERROR_FILELOCK_BIT	(10) /* have seen file lock errors */
+#define CEPH_I_ODIRECT_BIT		(11) /* inode in direct I/O mode */
+#define CEPH_ASYNC_CREATE_BIT		(12) /* async create in flight for this */
+#define CEPH_I_SHUTDOWN_BIT		(13) /* inode is no longer usable */
+#define CEPH_I_ASYNC_CHECK_CAPS_BIT	(14) /* check caps after async creating finishes */
+
+#define CEPH_I_DIR_ORDERED		(1 << CEPH_I_DIR_ORDERED_BIT)
+#define CEPH_I_FLUSH			(1 << CEPH_I_FLUSH_BIT)
+#define CEPH_I_POOL_PERM		(1 << CEPH_I_POOL_PERM_BIT)
+#define CEPH_I_POOL_RD			(1 << CEPH_I_POOL_RD_BIT)
+#define CEPH_I_POOL_WR			(1 << CEPH_I_POOL_WR_BIT)
+#define CEPH_I_SEC_INITED		(1 << CEPH_I_SEC_INITED_BIT)
+#define CEPH_I_KICK_FLUSH		(1 << CEPH_I_KICK_FLUSH_BIT)
+#define CEPH_I_FLUSH_SNAPS		(1 << CEPH_I_FLUSH_SNAPS_BIT)
+#define CEPH_I_ERROR_WRITE		(1 << CEPH_I_ERROR_WRITE_BIT)
+#define CEPH_I_ERROR_FILELOCK		(1 << CEPH_I_ERROR_FILELOCK_BIT)
+#define CEPH_I_ODIRECT			(1 << CEPH_I_ODIRECT_BIT)
+#define CEPH_I_ASYNC_CREATE		(1 << CEPH_ASYNC_CREATE_BIT)
+#define CEPH_I_SHUTDOWN			(1 << CEPH_I_SHUTDOWN_BIT)
+#define CEPH_I_ASYNC_CHECK_CAPS		(1 << CEPH_I_ASYNC_CHECK_CAPS_BIT)
 
 /*
  * Masks of ceph inode work.
@@ -691,18 +703,18 @@ static inline struct inode *ceph_find_inode(struct super_block *sb,
  */
 static inline void ceph_set_error_write(struct ceph_inode_info *ci)
 {
-	if (!(READ_ONCE(ci->i_ceph_flags) & CEPH_I_ERROR_WRITE)) {
+	if (!test_bit(CEPH_I_ERROR_WRITE_BIT, &ci->i_ceph_flags)) {
 		spin_lock(&ci->i_ceph_lock);
-		ci->i_ceph_flags |= CEPH_I_ERROR_WRITE;
+		set_bit(CEPH_I_ERROR_WRITE_BIT, &ci->i_ceph_flags);
 		spin_unlock(&ci->i_ceph_lock);
 	}
 }
 
 static inline void ceph_clear_error_write(struct ceph_inode_info *ci)
 {
-	if (READ_ONCE(ci->i_ceph_flags) & CEPH_I_ERROR_WRITE) {
+	if (test_bit(CEPH_I_ERROR_WRITE_BIT, &ci->i_ceph_flags)) {
 		spin_lock(&ci->i_ceph_lock);
-		ci->i_ceph_flags &= ~CEPH_I_ERROR_WRITE;
+		clear_bit(CEPH_I_ERROR_WRITE_BIT, &ci->i_ceph_flags);
 		spin_unlock(&ci->i_ceph_lock);
 	}
 }
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [RFC PATCH v1 2/4] ceph: add bounded timeout and diagnostics to wait_caps_flush()
  2026-02-25 12:59 [RFC PATCH v1 0/4] ceph: manual client session reset via debugfs Alex Markuze
  2026-02-25 12:59 ` [RFC PATCH v1 1/4] ceph: convert inode flags to named bit positions Alex Markuze
@ 2026-02-25 12:59 ` Alex Markuze
  2026-02-25 12:59 ` [RFC PATCH v1 3/4] ceph: implement manual client session reset via debugfs Alex Markuze
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 10+ messages in thread
From: Alex Markuze @ 2026-02-25 12:59 UTC (permalink / raw)
  To: ceph-devel; +Cc: idryomov, linux-fsdevel, amarkuze, vdubeyko

Convert wait_caps_flush() from an unbounded wait_event() to a
bounded wait_event_timeout() with a 60-second period. If the
flush hasn't completed after each timeout, dump the pending
cap_flush list (up to 5 times) to aid debugging hung flush
scenarios.

Add a ci back-pointer to struct ceph_cap_flush so the diagnostic
dump can identify which inodes have outstanding flushes. Add
i_last_cap_flush_ack to ceph_inode_info for tracking the latest
acknowledged flush tid per inode.

Signed-off-by: Alex Markuze <amarkuze@redhat.com>
---
 fs/ceph/caps.c       |  7 +++++++
 fs/ceph/mds_client.c | 41 +++++++++++++++++++++++++++++++++++++++--
 fs/ceph/super.h      |  2 ++
 3 files changed, 48 insertions(+), 2 deletions(-)

diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
index d51454e995a8..be030fb8e864 100644
--- a/fs/ceph/caps.c
+++ b/fs/ceph/caps.c
@@ -1648,6 +1648,7 @@ static void __ceph_flush_snaps(struct ceph_inode_info *ci,
 
 		spin_lock(&mdsc->cap_dirty_lock);
 		capsnap->cap_flush.tid = ++mdsc->last_cap_flush_tid;
+		capsnap->cap_flush.ci = ci;
 		list_add_tail(&capsnap->cap_flush.g_list,
 			      &mdsc->cap_flush_list);
 		if (oldest_flush_tid == 0)
@@ -1846,6 +1847,9 @@ struct ceph_cap_flush *ceph_alloc_cap_flush(void)
 		return NULL;
 
 	cf->is_capsnap = false;
+	cf->ci = NULL;
+	cf->tid = 0;
+	cf->wake = false;
 	return cf;
 }
 
@@ -1931,6 +1935,7 @@ static u64 __mark_caps_flushing(struct inode *inode,
 	doutc(cl, "%p %llx.%llx now !dirty\n", inode, ceph_vinop(inode));
 
 	swap(cf, ci->i_prealloc_cap_flush);
+	cf->ci = ci;
 	cf->caps = flushing;
 	cf->wake = wake;
 
@@ -3826,6 +3831,8 @@ static void handle_cap_flush_ack(struct inode *inode, u64 flush_tid,
 	bool wake_ci = false;
 	bool wake_mdsc = false;
 
+	ci->i_last_cap_flush_ack = flush_tid;
+
 	list_for_each_entry_safe(cf, tmp_cf, &ci->i_cap_flush_list, i_list) {
 		/* Is this the one that was flushed? */
 		if (cf->tid == flush_tid)
diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index 28bb27b09b40..e27f2f148dea 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -27,6 +27,8 @@
 #include <trace/events/ceph.h>
 
 #define RECONNECT_MAX_SIZE (INT_MAX - PAGE_SIZE)
+#define CEPH_CAP_FLUSH_WAIT_TIMEOUT_SEC 60
+#define CEPH_CAP_FLUSH_MAX_DUMP_COUNT 5
 
 /*
  * A cluster of MDS (metadata server) daemons is responsible for
@@ -2285,6 +2287,34 @@ static int check_caps_flush(struct ceph_mds_client *mdsc,
 	return ret;
 }
 
+static void dump_cap_flushes(struct ceph_mds_client *mdsc, u64 want_tid)
+{
+	struct ceph_client *cl = mdsc->fsc->client;
+	struct ceph_cap_flush *cf;
+
+	pr_info_client(cl, "still waiting for cap flushes through %llu:\n",
+		       want_tid);
+	spin_lock(&mdsc->cap_dirty_lock);
+	list_for_each_entry(cf, &mdsc->cap_flush_list, g_list) {
+		if (cf->tid > want_tid)
+			break;
+		if (!cf->ci) {
+			pr_info_client(cl,
+				       "(null ci) %s tid=%llu wake=%d%s\n",
+				       ceph_cap_string(cf->caps), cf->tid,
+				       cf->wake,
+				       cf->is_capsnap ? " is_capsnap" : "");
+			continue;
+		}
+		pr_info_client(cl, "%llx:%llx %s %llu %llu %d%s\n",
+			       ceph_vinop(&cf->ci->netfs.inode),
+			       ceph_cap_string(cf->caps), cf->tid,
+			       cf->ci->i_last_cap_flush_ack, cf->wake,
+			       cf->is_capsnap ? " is_capsnap" : "");
+	}
+	spin_unlock(&mdsc->cap_dirty_lock);
+}
+
 /*
  * flush all dirty inode data to disk.
  *
@@ -2294,11 +2324,18 @@ static void wait_caps_flush(struct ceph_mds_client *mdsc,
 			    u64 want_flush_tid)
 {
 	struct ceph_client *cl = mdsc->fsc->client;
+	int i = 0;
+	long ret;
 
 	doutc(cl, "want %llu\n", want_flush_tid);
 
-	wait_event(mdsc->cap_flushing_wq,
-		   check_caps_flush(mdsc, want_flush_tid));
+	do {
+		ret = wait_event_timeout(mdsc->cap_flushing_wq,
+			   check_caps_flush(mdsc, want_flush_tid),
+			   CEPH_CAP_FLUSH_WAIT_TIMEOUT_SEC * HZ);
+		if (ret == 0 && i++ < CEPH_CAP_FLUSH_MAX_DUMP_COUNT)
+			dump_cap_flushes(mdsc, want_flush_tid);
+	} while (ret == 0);
 
 	doutc(cl, "ok, flushed thru %llu\n", want_flush_tid);
 }
diff --git a/fs/ceph/super.h b/fs/ceph/super.h
index 69a71848240f..9e80c816aa7a 100644
--- a/fs/ceph/super.h
+++ b/fs/ceph/super.h
@@ -238,6 +238,7 @@ struct ceph_cap_flush {
 	bool is_capsnap; /* true means capsnap */
 	struct list_head g_list; // global
 	struct list_head i_list; // per inode
+	struct ceph_inode_info *ci;
 };
 
 /*
@@ -443,6 +444,7 @@ struct ceph_inode_info {
 	struct ceph_snap_context *i_head_snapc;  /* set if wr_buffer_head > 0 or
 						    dirty|flushing caps */
 	unsigned i_snap_caps;           /* cap bits for snapped files */
+	u64 i_last_cap_flush_ack;		/* latest cap flush_ack tid for this inode */
 
 	unsigned long i_last_rd;
 	unsigned long i_last_wr;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [RFC PATCH v1 3/4] ceph: implement manual client session reset via debugfs
  2026-02-25 12:59 [RFC PATCH v1 0/4] ceph: manual client session reset via debugfs Alex Markuze
  2026-02-25 12:59 ` [RFC PATCH v1 1/4] ceph: convert inode flags to named bit positions Alex Markuze
  2026-02-25 12:59 ` [RFC PATCH v1 2/4] ceph: add bounded timeout and diagnostics to wait_caps_flush() Alex Markuze
@ 2026-02-25 12:59 ` Alex Markuze
  2026-02-25 12:59 ` [RFC PATCH v1 4/4] ceph: rework mds_peer_reset() for robust session recovery Alex Markuze
  2026-02-25 20:51 ` [EXTERNAL] [RFC PATCH v1 0/4] ceph: manual client session reset via debugfs Viacheslav Dubeyko
  4 siblings, 0 replies; 10+ messages in thread
From: Alex Markuze @ 2026-02-25 12:59 UTC (permalink / raw)
  To: ceph-devel; +Cc: idryomov, linux-fsdevel, amarkuze, vdubeyko

Add a mechanism for operators to trigger MDS session reconnection
without unmounting the filesystem. This is useful for recovering
from stuck or hung MDS sessions in production environments.

The reset lifecycle:
  1. Operator writes to /sys/kernel/debug/ceph/<fsid>/reset/trigger
  2. A work item collects all active sessions and calls
     send_mds_reconnect() on each with from_reset=true
  3. New metadata requests and lock acquisitions are blocked
     (with a 120s timeout) until the reset completes
  4. Session reconnect completions are tracked via a per-session
     generation counter (s_reset_gen) to avoid counting stale
     completions from timed-out prior resets
  5. The reset work waits up to 60s for all sessions to reconnect

Debugfs interface under reset/:
  - trigger (write-only): initiate a reset with optional reason
  - status (read-only): show counters and current state
  - inject_error (write-only): fault injection for testing

Lock handling during reset-initiated reconnects always attempts
optimistic lock reclamation (flock_len=1) regardless of the
CEPH_I_ERROR_FILELOCK state, since the reset is meant to recover
from exactly those error conditions.

Tracepoints are added for the reset lifecycle:
ceph_client_reset_schedule, ceph_client_reset_complete,
ceph_client_reset_blocked, ceph_client_reset_unblocked.

Note: Reset tracepoints added to include/trace/events/ceph.h
(merged with existing upstream tracepoint infrastructure) instead
of the separate ceph_client.h from the original patch.

Signed-off-by: Alex Markuze <amarkuze@redhat.com>
---
 fs/ceph/debugfs.c           | 171 +++++++++++++-
 fs/ceph/locks.c             |  16 ++
 fs/ceph/mds_client.c        | 444 +++++++++++++++++++++++++++++++++++-
 fs/ceph/mds_client.h        |  33 +++
 fs/ceph/super.h             |   4 +
 include/trace/events/ceph.h |  60 +++++
 6 files changed, 707 insertions(+), 21 deletions(-)

diff --git a/fs/ceph/debugfs.c b/fs/ceph/debugfs.c
index f3fe786b4143..b6d957a90043 100644
--- a/fs/ceph/debugfs.c
+++ b/fs/ceph/debugfs.c
@@ -9,6 +9,8 @@
 #include <linux/seq_file.h>
 #include <linux/math64.h>
 #include <linux/ktime.h>
+#include <linux/uaccess.h>
+#include <linux/kstrtox.h>
 
 #include <linux/ceph/libceph.h>
 #include <linux/ceph/mon_client.h>
@@ -22,6 +24,10 @@
 #include "mds_client.h"
 #include "metric.h"
 
+#define CEPH_DEBUGFS_MODE_READONLY 0400
+#define CEPH_DEBUGFS_MODE_WRITEONLY 0200
+#define CEPH_DEBUGFS_MODE_READWRITE 0600
+
 static int mdsmap_show(struct seq_file *s, void *p)
 {
 	int i;
@@ -360,16 +366,140 @@ static int status_show(struct seq_file *s, void *p)
 	return 0;
 }
 
+static int reset_status_show(struct seq_file *s, void *p)
+{
+	struct ceph_fs_client *fsc = s->private;
+	struct ceph_mds_client *mdsc = fsc->mdsc;
+	struct ceph_client_reset_state *st;
+	u64 trigger = 0, success = 0, failure = 0;
+	unsigned long last_start = 0, last_finish = 0;
+	int last_errno = 0;
+	bool in_progress = false;
+	bool inject_error = false;
+	int pending_reconnects = 0;
+	int blocked_requests = 0;
+	char reason[CEPH_CLIENT_RESET_REASON_LEN];
+
+	if (!mdsc)
+		return 0;
+
+	st = &mdsc->reset_state;
+
+	spin_lock(&st->lock);
+	trigger = st->trigger_count;
+	success = st->success_count;
+	failure = st->failure_count;
+	last_start = st->last_start;
+	last_finish = st->last_finish;
+	last_errno = st->last_errno;
+	in_progress = st->in_progress;
+	inject_error = st->inject_error;
+	strscpy(reason, st->last_reason, sizeof(reason));
+	spin_unlock(&st->lock);
+
+	pending_reconnects = atomic_read(&st->pending_reconnects);
+	blocked_requests = atomic_read(&st->blocked_requests);
+
+	seq_printf(s, "in_progress: %s\n", in_progress ? "yes" : "no");
+	seq_printf(s, "trigger_count: %llu\n", trigger);
+	seq_printf(s, "success_count: %llu\n", success);
+	seq_printf(s, "failure_count: %llu\n", failure);
+	if (last_start)
+		seq_printf(s, "last_start_ms_ago: %u\n",
+			   jiffies_to_msecs(jiffies - last_start));
+	else
+		seq_puts(s, "last_start_ms_ago: (never)\n");
+	if (last_finish)
+		seq_printf(s, "last_finish_ms_ago: %u\n",
+			   jiffies_to_msecs(jiffies - last_finish));
+	else
+		seq_puts(s, "last_finish_ms_ago: (never)\n");
+	seq_printf(s, "last_errno: %d\n", last_errno);
+	seq_printf(s, "last_reason: %s\n",
+		   reason[0] ? reason : "(none)");
+	seq_printf(s, "inject_error_pending: %s\n",
+		   inject_error ? "yes" : "no");
+	seq_printf(s, "pending_reconnects: %d\n", pending_reconnects);
+	seq_printf(s, "blocked_requests: %d\n", blocked_requests);
+
+	return 0;
+}
+
+static ssize_t reset_trigger_write(struct file *file, const char __user *buf,
+				   size_t len, loff_t *ppos)
+{
+	struct ceph_fs_client *fsc = file->private_data;
+	struct ceph_mds_client *mdsc = fsc->mdsc;
+	char reason[CEPH_CLIENT_RESET_REASON_LEN];
+	size_t copy;
+	int ret;
+
+	if (!mdsc)
+		return -ENODEV;
+
+	copy = min_t(size_t, len, sizeof(reason) - 1);
+	if (copy && copy_from_user(reason, buf, copy))
+		return -EFAULT;
+	reason[copy] = '\0';
+	strim(reason);
+
+	ret = ceph_mdsc_schedule_reset(mdsc, reason);
+	if (ret)
+		return ret;
+
+	return len;
+}
+
+static ssize_t reset_inject_error_write(struct file *file,
+					const char __user *buf,
+					size_t len, loff_t *ppos)
+{
+	struct ceph_fs_client *fsc = file->private_data;
+	struct ceph_mds_client *mdsc = fsc->mdsc;
+	struct ceph_client_reset_state *st;
+	bool enable;
+	int ret;
+
+	if (!mdsc)
+		return -ENODEV;
+
+	ret = kstrtobool_from_user(buf, len, &enable);
+	if (ret)
+		return ret;
+
+	st = &mdsc->reset_state;
+	spin_lock(&st->lock);
+	st->inject_error = enable;
+	spin_unlock(&st->lock);
+
+	return len;
+}
+
 DEFINE_SHOW_ATTRIBUTE(mdsmap);
 DEFINE_SHOW_ATTRIBUTE(mdsc);
 DEFINE_SHOW_ATTRIBUTE(caps);
 DEFINE_SHOW_ATTRIBUTE(mds_sessions);
 DEFINE_SHOW_ATTRIBUTE(status);
+DEFINE_SHOW_ATTRIBUTE(reset_status);
 DEFINE_SHOW_ATTRIBUTE(metrics_file);
 DEFINE_SHOW_ATTRIBUTE(metrics_latency);
 DEFINE_SHOW_ATTRIBUTE(metrics_size);
 DEFINE_SHOW_ATTRIBUTE(metrics_caps);
 
+static const struct file_operations ceph_reset_trigger_fops = {
+	.owner = THIS_MODULE,
+	.open = simple_open,
+	.write = reset_trigger_write,
+	.llseek = noop_llseek,
+};
+
+static const struct file_operations ceph_reset_inject_error_fops = {
+	.owner = THIS_MODULE,
+	.open = simple_open,
+	.write = reset_inject_error_write,
+	.llseek = noop_llseek,
+};
+
 
 /*
  * debugfs
@@ -404,6 +534,10 @@ void ceph_fs_debugfs_cleanup(struct ceph_fs_client *fsc)
 	debugfs_remove(fsc->debugfs_caps);
 	debugfs_remove(fsc->debugfs_status);
 	debugfs_remove(fsc->debugfs_mdsc);
+	debugfs_remove(fsc->debugfs_reset_status);
+	debugfs_remove(fsc->debugfs_reset_force);
+	debugfs_remove(fsc->debugfs_reset_inject);
+	debugfs_remove_recursive(fsc->debugfs_reset_dir);
 	debugfs_remove_recursive(fsc->debugfs_metrics_dir);
 	doutc(fsc->client, "done\n");
 }
@@ -415,7 +549,7 @@ void ceph_fs_debugfs_init(struct ceph_fs_client *fsc)
 	doutc(fsc->client, "begin\n");
 	fsc->debugfs_congestion_kb =
 		debugfs_create_file("writeback_congestion_kb",
-				    0600,
+				    CEPH_DEBUGFS_MODE_READWRITE,
 				    fsc->client->debugfs_dir,
 				    fsc,
 				    &congestion_kb_fops);
@@ -428,31 +562,48 @@ void ceph_fs_debugfs_init(struct ceph_fs_client *fsc)
 				       name);
 
 	fsc->debugfs_mdsmap = debugfs_create_file("mdsmap",
-					0400,
+					CEPH_DEBUGFS_MODE_READONLY,
 					fsc->client->debugfs_dir,
 					fsc,
 					&mdsmap_fops);
 
 	fsc->debugfs_mds_sessions = debugfs_create_file("mds_sessions",
-					0400,
+					CEPH_DEBUGFS_MODE_READONLY,
 					fsc->client->debugfs_dir,
 					fsc,
 					&mds_sessions_fops);
 
 	fsc->debugfs_mdsc = debugfs_create_file("mdsc",
-						0400,
+						CEPH_DEBUGFS_MODE_READONLY,
 						fsc->client->debugfs_dir,
 						fsc,
 						&mdsc_fops);
 
 	fsc->debugfs_caps = debugfs_create_file("caps",
-						0400,
+						CEPH_DEBUGFS_MODE_READONLY,
 						fsc->client->debugfs_dir,
 						fsc,
 						&caps_fops);
 
+	fsc->debugfs_reset_dir = debugfs_create_dir("reset",
+						    fsc->client->debugfs_dir);
+	if (fsc->debugfs_reset_dir) {
+		fsc->debugfs_reset_force =
+			debugfs_create_file("trigger", CEPH_DEBUGFS_MODE_WRITEONLY,
+					    fsc->debugfs_reset_dir, fsc,
+					    &ceph_reset_trigger_fops);
+		fsc->debugfs_reset_status =
+			debugfs_create_file("status", CEPH_DEBUGFS_MODE_READONLY,
+					    fsc->debugfs_reset_dir, fsc,
+					    &reset_status_fops);
+		fsc->debugfs_reset_inject =
+			debugfs_create_file("inject_error", CEPH_DEBUGFS_MODE_WRITEONLY,
+					    fsc->debugfs_reset_dir, fsc,
+					    &ceph_reset_inject_error_fops);
+	}
+
 	fsc->debugfs_status = debugfs_create_file("status",
-						  0400,
+						  CEPH_DEBUGFS_MODE_READONLY,
 						  fsc->client->debugfs_dir,
 						  fsc,
 						  &status_fops);
@@ -460,13 +611,13 @@ void ceph_fs_debugfs_init(struct ceph_fs_client *fsc)
 	fsc->debugfs_metrics_dir = debugfs_create_dir("metrics",
 						      fsc->client->debugfs_dir);
 
-	debugfs_create_file("file", 0400, fsc->debugfs_metrics_dir, fsc,
+	debugfs_create_file("file", CEPH_DEBUGFS_MODE_READONLY, fsc->debugfs_metrics_dir, fsc,
 			    &metrics_file_fops);
-	debugfs_create_file("latency", 0400, fsc->debugfs_metrics_dir, fsc,
+	debugfs_create_file("latency", CEPH_DEBUGFS_MODE_READONLY, fsc->debugfs_metrics_dir, fsc,
 			    &metrics_latency_fops);
-	debugfs_create_file("size", 0400, fsc->debugfs_metrics_dir, fsc,
+	debugfs_create_file("size", CEPH_DEBUGFS_MODE_READONLY, fsc->debugfs_metrics_dir, fsc,
 			    &metrics_size_fops);
-	debugfs_create_file("caps", 0400, fsc->debugfs_metrics_dir, fsc,
+	debugfs_create_file("caps", CEPH_DEBUGFS_MODE_READONLY, fsc->debugfs_metrics_dir, fsc,
 			    &metrics_caps_fops);
 	doutc(fsc->client, "done\n");
 }
diff --git a/fs/ceph/locks.c b/fs/ceph/locks.c
index 2f21574dfb99..079947a5e42a 100644
--- a/fs/ceph/locks.c
+++ b/fs/ceph/locks.c
@@ -251,6 +251,7 @@ int ceph_lock(struct file *file, int cmd, struct file_lock *fl)
 {
 	struct inode *inode = file_inode(file);
 	struct ceph_inode_info *ci = ceph_inode(inode);
+	struct ceph_mds_client *mdsc = ceph_sb_to_mdsc(inode->i_sb);
 	struct ceph_client *cl = ceph_inode_to_client(inode);
 	int err = 0;
 	u16 op = CEPH_MDS_OP_SETFILELOCK;
@@ -281,6 +282,13 @@ int ceph_lock(struct file *file, int cmd, struct file_lock *fl)
 		return err;
 	}
 
+	/* Wait for reset to complete before acquiring new locks */
+	if (op == CEPH_MDS_OP_SETFILELOCK && !lock_is_unlock(fl)) {
+		err = ceph_mdsc_wait_for_reset(mdsc);
+		if (err)
+			return err;
+	}
+
 	if (lock_is_read(fl))
 		lock_cmd = CEPH_LOCK_SHARED;
 	else if (lock_is_write(fl))
@@ -317,6 +325,7 @@ int ceph_flock(struct file *file, int cmd, struct file_lock *fl)
 {
 	struct inode *inode = file_inode(file);
 	struct ceph_inode_info *ci = ceph_inode(inode);
+	struct ceph_mds_client *mdsc = ceph_sb_to_mdsc(inode->i_sb);
 	struct ceph_client *cl = ceph_inode_to_client(inode);
 	int err = 0;
 	u8 wait = 0;
@@ -340,6 +349,13 @@ int ceph_flock(struct file *file, int cmd, struct file_lock *fl)
 		return err;
 	}
 
+	/* Wait for reset to complete before acquiring new locks */
+	if (!lock_is_unlock(fl)) {
+		err = ceph_mdsc_wait_for_reset(mdsc);
+		if (err)
+			return err;
+	}
+
 	if (IS_SETLKW(cmd))
 		wait = 1;
 
diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index e27f2f148dea..d350bf502a15 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -12,6 +12,7 @@
 #include <linux/bits.h>
 #include <linux/ktime.h>
 #include <linux/bitmap.h>
+#include <linux/workqueue.h>
 #include <linux/mnt_idmapping.h>
 
 #include "super.h"
@@ -61,12 +62,16 @@ struct ceph_reconnect_state {
 	struct ceph_pagelist *pagelist;
 	unsigned msg_version;
 	bool allow_multi;
+	bool from_reset;
 };
 
 static void __wake_requests(struct ceph_mds_client *mdsc,
 			    struct list_head *head);
 static void ceph_cap_release_work(struct work_struct *work);
 static void ceph_cap_reclaim_work(struct work_struct *work);
+static void ceph_mdsc_reset_workfn(struct work_struct *work);
+static void ceph_mdsc_reconnect_session_done(struct ceph_mds_client *mdsc,
+					     struct ceph_mds_session *session);
 
 static const struct ceph_connection_operations mds_con_ops;
 
@@ -3742,6 +3747,22 @@ int ceph_mdsc_submit_request(struct ceph_mds_client *mdsc, struct inode *dir,
 	struct ceph_client *cl = mdsc->fsc->client;
 	int err = 0;
 
+	/*
+	 * If a reset is in progress, wait for it to complete.
+	 *
+	 * This is best-effort: a request can pass this check just
+	 * before in_progress is set and proceed concurrently with
+	 * reset.  That is acceptable because (a) such requests will
+	 * either complete normally or fail and be retried by the
+	 * caller, and (b) adding lock serialization here would
+	 * penalize every request for a rare manual operation.
+	 */
+	err = ceph_mdsc_wait_for_reset(mdsc);
+	if (err) {
+		doutc(cl, "wait_for_reset failed: %d\n", err);
+		return err;
+	}
+
 	/* take CAP_PIN refs for r_inode, r_parent, r_old_dentry */
 	if (req->r_inode)
 		ceph_get_cap_refs(ceph_inode(req->r_inode), CEPH_CAP_PIN);
@@ -4374,9 +4395,11 @@ static void handle_session(struct ceph_mds_session *session,
 
 	switch (op) {
 	case CEPH_SESSION_OPEN:
-		if (session->s_state == CEPH_MDS_SESSION_RECONNECTING)
+		if (session->s_state == CEPH_MDS_SESSION_RECONNECTING) {
 			pr_info_client(cl, "mds%d reconnect success\n",
 				       session->s_mds);
+			ceph_mdsc_reconnect_session_done(mdsc, session);
+		}
 
 		session->s_features = features;
 		if (session->s_state == CEPH_MDS_SESSION_OPEN) {
@@ -4409,9 +4432,11 @@ static void handle_session(struct ceph_mds_session *session,
 		break;
 
 	case CEPH_SESSION_CLOSE:
-		if (session->s_state == CEPH_MDS_SESSION_RECONNECTING)
+		if (session->s_state == CEPH_MDS_SESSION_RECONNECTING) {
 			pr_info_client(cl, "mds%d reconnect denied\n",
 				       session->s_mds);
+			ceph_mdsc_reconnect_session_done(mdsc, session);
+		}
 		session->s_state = CEPH_MDS_SESSION_CLOSED;
 		cleanup_session_requests(mdsc, session);
 		remove_session_caps(session);
@@ -4450,7 +4475,13 @@ static void handle_session(struct ceph_mds_session *session,
 		break;
 
 	case CEPH_SESSION_REJECT:
-		WARN_ON(session->s_state != CEPH_MDS_SESSION_OPENING);
+		WARN_ON(session->s_state != CEPH_MDS_SESSION_OPENING &&
+			session->s_state != CEPH_MDS_SESSION_RECONNECTING);
+		if (session->s_state == CEPH_MDS_SESSION_RECONNECTING) {
+			pr_info_client(cl, "mds%d reconnect rejected\n",
+				       session->s_mds);
+			ceph_mdsc_reconnect_session_done(mdsc, session);
+		}
 		pr_info_client(cl, "mds%d rejected session\n",
 			       session->s_mds);
 		session->s_state = CEPH_MDS_SESSION_REJECTED;
@@ -4712,6 +4743,17 @@ static int reconnect_caps_cb(struct inode *inode, int mds, void *arg)
 	cap->mseq = 0;       /* and migrate_seq */
 	cap->cap_gen = atomic_read(&cap->session->s_cap_gen);
 
+	/*
+	 * Don't set CEPH_I_ERROR_FILELOCK here - instead, we attempt to
+	 * reclaim locks by sending them in the reconnect message.
+	 * The MDS will do best-effort reclaim. If locks can't be reclaimed
+	 * (e.g., another client grabbed a conflicting lock), future lock
+	 * operations will fail and the error flag will be set then.
+	 *
+	 * Note: i_filelock_ref > 0 means we have locks to reclaim.
+	 * The lock info will be encoded below if flock_len is non-zero.
+	 */
+
 	/* These are lost when the session goes away */
 	if (S_ISDIR(inode->i_mode)) {
 		if (cap->issued & CEPH_CAP_DIR_CREATE) {
@@ -4727,8 +4769,22 @@ static int reconnect_caps_cb(struct inode *inode, int mds, void *arg)
 		rec.v2.issued = cpu_to_le32(cap->issued);
 		rec.v2.snaprealm = cpu_to_le64(ci->i_snap_realm->ino);
 		rec.v2.pathbase = cpu_to_le64(path_info.vino.ino);
-		rec.v2.flock_len = (__force __le32)
-			((ci->i_ceph_flags & CEPH_I_ERROR_FILELOCK) ? 0 : 1);
+		if (recon_state->from_reset) {
+			/*
+			 * Reset-initiated reconnect: always try to reclaim
+			 * locks. The MDS does best-effort; failures are
+			 * handled on future lock operations.
+			 */
+			rec.v2.flock_len = cpu_to_le32(1);
+		} else {
+			/*
+			 * Normal reconnect: skip lock reclaim if locks
+			 * were already known to be in error state.
+			 */
+			rec.v2.flock_len = cpu_to_le32(
+				test_bit(CEPH_I_ERROR_FILELOCK_BIT,
+					 &ci->i_ceph_flags) ? 0 : 1);
+		}
 	} else {
 		struct timespec64 ts;
 
@@ -4925,8 +4981,9 @@ static int encode_snap_realms(struct ceph_mds_client *mdsc,
  *
  * This is a relatively heavyweight operation, but it's rare.
  */
-static void send_mds_reconnect(struct ceph_mds_client *mdsc,
-			       struct ceph_mds_session *session)
+static int send_mds_reconnect(struct ceph_mds_client *mdsc,
+			      struct ceph_mds_session *session,
+			      bool from_reset)
 {
 	struct ceph_client *cl = mdsc->fsc->client;
 	struct ceph_msg *reply;
@@ -4934,6 +4991,7 @@ static void send_mds_reconnect(struct ceph_mds_client *mdsc,
 	int err = -ENOMEM;
 	struct ceph_reconnect_state recon_state = {
 		.session = session,
+		.from_reset = from_reset,
 	};
 	LIST_HEAD(dispose);
 
@@ -4950,6 +5008,30 @@ static void send_mds_reconnect(struct ceph_mds_client *mdsc,
 	xa_destroy(&session->s_delegated_inos);
 
 	mutex_lock(&session->s_mutex);
+	if (session->s_state == CEPH_MDS_SESSION_CLOSED ||
+	    session->s_state == CEPH_MDS_SESSION_REJECTED) {
+		pr_info_client(cl, "mds%d skipping reconnect, session %s\n",
+			       mds,
+			       ceph_session_state_name(session->s_state));
+		mutex_unlock(&session->s_mutex);
+		ceph_msg_put(reply);
+		err = -ESTALE;
+		goto fail_nomsg;
+	}
+
+	mutex_lock(&mdsc->mutex);
+	if (mds >= mdsc->max_sessions || mdsc->sessions[mds] != session) {
+		mutex_unlock(&mdsc->mutex);
+		pr_info_client(cl,
+			       "mds%d skipping reconnect, session unregistered\n",
+			       mds);
+		mutex_unlock(&session->s_mutex);
+		ceph_msg_put(reply);
+		err = -ENOENT;
+		goto fail_nomsg;
+	}
+	mutex_unlock(&mdsc->mutex);
+
 	session->s_state = CEPH_MDS_SESSION_RECONNECTING;
 	session->s_seq = 0;
 
@@ -5079,7 +5161,7 @@ static void send_mds_reconnect(struct ceph_mds_client *mdsc,
 
 	up_read(&mdsc->snap_rwsem);
 	ceph_pagelist_release(recon_state.pagelist);
-	return;
+	return 0;
 
 fail:
 	ceph_msg_put(reply);
@@ -5090,7 +5172,302 @@ static void send_mds_reconnect(struct ceph_mds_client *mdsc,
 fail_nopagelist:
 	pr_err_client(cl, "error %d preparing reconnect for mds%d\n",
 		      err, mds);
-	return;
+	return err;
+}
+
+/*
+ * Called when a session completes reconnection (success or failure).
+ * Only counts toward reset completion if this session's reconnect was
+ * initiated by the currently active reset generation.  Late completions
+ * from a prior (timed-out) reset are silently ignored.
+ *
+ * Caller must hold session->s_mutex to serialize s_reset_gen updates.
+ */
+static void ceph_mdsc_reconnect_session_done(struct ceph_mds_client *mdsc,
+					     struct ceph_mds_session *session)
+{
+	struct ceph_client_reset_state *st = &mdsc->reset_state;
+	u64 active_gen;
+
+	lockdep_assert_held(&session->s_mutex);
+
+	if (!session->s_reset_gen)
+		return;
+
+	spin_lock(&st->lock);
+	active_gen = st->active_reset_gen;
+	spin_unlock(&st->lock);
+
+	if (session->s_reset_gen != active_gen) {
+		session->s_reset_gen = 0;
+		return;
+	}
+
+	session->s_reset_gen = 0;
+
+	if (atomic_dec_and_test(&st->pending_reconnects))
+		complete(&st->reconnect_done);
+}
+
+/*
+ * Wait for a reset to complete if one is in progress.
+ * Returns 0 if reset completed successfully or wasn't in progress.
+ * Returns -ETIMEDOUT if we timed out waiting.
+ * Returns -ERESTARTSYS if interrupted by signal.
+ */
+int ceph_mdsc_wait_for_reset(struct ceph_mds_client *mdsc)
+{
+	struct ceph_client_reset_state *st = &mdsc->reset_state;
+	struct ceph_client *cl = mdsc->fsc->client;
+	long timeout = CEPH_CLIENT_RESET_WAIT_TIMEOUT_SEC * HZ;
+	int blocked_count;
+	int ret;
+
+	if (!READ_ONCE(st->in_progress))
+		return 0;
+
+	blocked_count = atomic_inc_return(&st->blocked_requests);
+	doutc(cl, "request blocked during reset, %d total blocked\n",
+	      blocked_count);
+	trace_ceph_client_reset_blocked(mdsc, blocked_count);
+
+	ret = wait_event_interruptible_timeout(st->blocked_wq,
+					       !READ_ONCE(st->in_progress),
+					       timeout);
+
+	atomic_dec(&st->blocked_requests);
+
+	if (ret == 0) {
+		pr_warn_client(cl, "timed out waiting for reset to complete\n");
+		trace_ceph_client_reset_unblocked(mdsc, -ETIMEDOUT);
+		return -ETIMEDOUT;
+	}
+	if (ret < 0) {
+		trace_ceph_client_reset_unblocked(mdsc, ret);
+		return ret;  /* -ERESTARTSYS */
+	}
+
+	/* Reset completed, check if it was successful */
+	spin_lock(&st->lock);
+	ret = st->last_errno;
+	spin_unlock(&st->lock);
+
+	trace_ceph_client_reset_unblocked(mdsc, ret);
+	return ret;
+}
+
+static void ceph_mdsc_reset_complete(struct ceph_mds_client *mdsc, int ret)
+{
+	struct ceph_client_reset_state *st = &mdsc->reset_state;
+
+	spin_lock(&st->lock);
+	st->last_finish = jiffies;
+	st->last_errno = ret;
+	st->in_progress = false;
+	if (ret)
+		st->failure_count++;
+	else
+		st->success_count++;
+	spin_unlock(&st->lock);
+
+	/* Wake up all requests that were blocked waiting for reset */
+	wake_up_all(&st->blocked_wq);
+
+	trace_ceph_client_reset_complete(mdsc, ret);
+}
+
+static void ceph_mdsc_reset_workfn(struct work_struct *work)
+{
+	struct ceph_mds_client *mdsc =
+		container_of(work, struct ceph_mds_client, reset_work);
+	struct ceph_client_reset_state *st = &mdsc->reset_state;
+	struct ceph_client *cl = mdsc->fsc->client;
+	struct ceph_mds_session **sessions = NULL;
+	char reason[CEPH_CLIENT_RESET_REASON_LEN];
+	int max_sessions, i, n = 0;
+	bool inject_error;
+	u64 reset_gen;
+	int ret = 0;
+
+	spin_lock(&st->lock);
+	strscpy(reason, st->last_reason, sizeof(reason));
+	inject_error = st->inject_error;
+	if (inject_error)
+		st->inject_error = false;
+	reset_gen = st->active_reset_gen;
+	spin_unlock(&st->lock);
+
+	if (inject_error) {
+		ret = -EIO;
+		goto out_complete;
+	}
+
+	mutex_lock(&mdsc->mutex);
+	max_sessions = mdsc->max_sessions;
+	if (max_sessions <= 0) {
+		mutex_unlock(&mdsc->mutex);
+		goto out_complete;
+	}
+
+	sessions = kcalloc(max_sessions, sizeof(*sessions), GFP_NOFS);
+	if (!sessions) {
+		mutex_unlock(&mdsc->mutex);
+		ret = -ENOMEM;
+		pr_err_client(cl,
+			      "manual session reset failed to allocate session array\n");
+		ceph_mdsc_reset_complete(mdsc, ret);
+		return;
+	}
+
+	for (i = 0; i < max_sessions; i++) {
+		struct ceph_mds_session *session = mdsc->sessions[i];
+
+		if (!session)
+			continue;
+
+		/*
+		 * Read session state without s_mutex to avoid nesting
+		 * mdsc->mutex -> s_mutex, which would invert the
+		 * s_mutex -> mdsc->mutex order used by
+		 * cleanup_session_requests().  s_state is an int
+		 * so loads are atomic; send_mds_reconnect() will
+		 * reject sessions that transitioned to CLOSED or
+		 * REJECTED under s_mutex before proceeding.
+		 */
+		switch (READ_ONCE(session->s_state)) {
+		case CEPH_MDS_SESSION_OPEN:
+		case CEPH_MDS_SESSION_HUNG:
+		case CEPH_MDS_SESSION_OPENING:
+		case CEPH_MDS_SESSION_CLOSING:
+			sessions[n++] = ceph_get_mds_session(session);
+			break;
+		case CEPH_MDS_SESSION_RECONNECTING:
+			pr_info_client(cl,
+				       "mds%d already reconnecting, skipping\n",
+				       session->s_mds);
+			break;
+		default:
+			pr_info_client(cl,
+				       "mds%d in state %s, skipping reconnect\n",
+				       session->s_mds,
+				       ceph_session_state_name(session->s_state));
+			break;
+		}
+	}
+	mutex_unlock(&mdsc->mutex);
+
+	pr_info_client(cl,
+		       "manual session reset executing (sessions=%d, reason=\"%s\")\n",
+		       n, reason);
+
+	if (n == 0) {
+		kfree(sessions);
+		goto out_complete;
+	}
+
+	/* Initialize completion tracking */
+	reinit_completion(&st->reconnect_done);
+	atomic_set(&st->pending_reconnects, n);
+
+	for (i = 0; i < n; i++) {
+		int err;
+
+		if (!sessions[i]) {
+			if (atomic_dec_and_test(&st->pending_reconnects))
+				complete(&st->reconnect_done);
+			continue;
+		}
+
+		mutex_lock(&sessions[i]->s_mutex);
+		sessions[i]->s_reset_gen = reset_gen;
+		mutex_unlock(&sessions[i]->s_mutex);
+
+		err = send_mds_reconnect(mdsc, sessions[i], true);
+		if (err) {
+			bool skipped = (err == -ESTALE || err == -ENOENT);
+
+			mutex_lock(&sessions[i]->s_mutex);
+			sessions[i]->s_reset_gen = 0;
+			mutex_unlock(&sessions[i]->s_mutex);
+
+			if (skipped) {
+				pr_info_client(cl,
+					       "mds%d reconnect skipped during reset: %d\n",
+					       sessions[i]->s_mds, err);
+			} else {
+				pr_err_client(cl,
+					      "mds%d reconnect failed: %d\n",
+					      sessions[i]->s_mds, err);
+				if (!ret)
+					ret = err;
+			}
+			if (atomic_dec_and_test(&st->pending_reconnects))
+				complete(&st->reconnect_done);
+		}
+		ceph_put_mds_session(sessions[i]);
+	}
+
+	kfree(sessions);
+
+	/* Wait for all sessions to complete reconnection */
+	if (!wait_for_completion_timeout(&st->reconnect_done,
+					 CEPH_CLIENT_RESET_TIMEOUT_SEC * HZ)) {
+		pr_warn_client(cl,
+			       "reset timed out waiting for %d sessions\n",
+			       atomic_read(&st->pending_reconnects));
+		if (!ret)
+			ret = -ETIMEDOUT;
+	}
+
+out_complete:
+	ceph_mdsc_reset_complete(mdsc, ret);
+}
+
+int ceph_mdsc_schedule_reset(struct ceph_mds_client *mdsc,
+			     const char *reason)
+{
+	struct ceph_client_reset_state *st = &mdsc->reset_state;
+	struct ceph_fs_client *fsc = mdsc->fsc;
+	const char *msg = (reason && reason[0]) ? reason : "manual";
+	int mount_state;
+
+	mount_state = READ_ONCE(fsc->mount_state);
+	if (mount_state != CEPH_MOUNT_MOUNTED) {
+		pr_warn_client(fsc->client,
+			       "reset rejected: mount_state=%d (not mounted)\n",
+			       mount_state);
+		return -EINVAL;
+	}
+
+	spin_lock(&st->lock);
+	if (st->in_progress) {
+		spin_unlock(&st->lock);
+		return -EBUSY;
+	}
+
+	st->in_progress = true;
+	st->active_reset_gen++;
+	st->last_start = jiffies;
+	st->last_errno = 0;
+	st->trigger_count++;
+	strscpy(st->last_reason, msg, sizeof(st->last_reason));
+	spin_unlock(&st->lock);
+
+	if (!queue_work(system_unbound_wq, &mdsc->reset_work)) {
+		spin_lock(&st->lock);
+		st->in_progress = false;
+		st->last_errno = -EALREADY;
+		st->failure_count++;
+		spin_unlock(&st->lock);
+		wake_up_all(&st->blocked_wq);
+		return -EALREADY;
+	}
+
+	pr_info_client(mdsc->fsc->client,
+		       "manual session reset scheduled (reason=\"%s\")\n",
+		       msg);
+	trace_ceph_client_reset_schedule(mdsc, msg);
+	return 0;
 }
 
 
@@ -5171,9 +5548,15 @@ static void check_new_map(struct ceph_mds_client *mdsc,
 		 */
 		if (s->s_state == CEPH_MDS_SESSION_RESTARTING &&
 		    newstate >= CEPH_MDS_STATE_RECONNECT) {
+			int rc;
+
 			mutex_unlock(&mdsc->mutex);
 			clear_bit(i, targets);
-			send_mds_reconnect(mdsc, s);
+			rc = send_mds_reconnect(mdsc, s, false);
+			if (rc)
+				pr_err_client(cl,
+					      "mds%d reconnect failed: %d\n",
+					      i, rc);
 			mutex_lock(&mdsc->mutex);
 		}
 
@@ -5237,7 +5620,11 @@ static void check_new_map(struct ceph_mds_client *mdsc,
 		}
 		doutc(cl, "send reconnect to export target mds.%d\n", i);
 		mutex_unlock(&mdsc->mutex);
-		send_mds_reconnect(mdsc, s);
+		err = send_mds_reconnect(mdsc, s, false);
+		if (err)
+			pr_err_client(cl,
+				      "mds%d export target reconnect failed: %d\n",
+				      i, err);
 		ceph_put_mds_session(s);
 		mutex_lock(&mdsc->mutex);
 	}
@@ -5622,6 +6009,23 @@ int ceph_mdsc_init(struct ceph_fs_client *fsc)
 	INIT_LIST_HEAD(&mdsc->dentry_leases);
 	INIT_LIST_HEAD(&mdsc->dentry_dir_leases);
 
+	spin_lock_init(&mdsc->reset_state.lock);
+	mdsc->reset_state.trigger_count = 0;
+	mdsc->reset_state.success_count = 0;
+	mdsc->reset_state.failure_count = 0;
+	mdsc->reset_state.last_start = 0;
+	mdsc->reset_state.last_finish = 0;
+	mdsc->reset_state.last_errno = 0;
+	mdsc->reset_state.in_progress = false;
+	mdsc->reset_state.inject_error = false;
+	mdsc->reset_state.active_reset_gen = 0;
+	mdsc->reset_state.last_reason[0] = '\0';
+	atomic_set(&mdsc->reset_state.pending_reconnects, 0);
+	init_completion(&mdsc->reset_state.reconnect_done);
+	init_waitqueue_head(&mdsc->reset_state.blocked_wq);
+	atomic_set(&mdsc->reset_state.blocked_requests, 0);
+	INIT_WORK(&mdsc->reset_work, ceph_mdsc_reset_workfn);
+
 	ceph_caps_init(mdsc);
 	ceph_adjust_caps_max_min(mdsc, fsc->mount_options);
 
@@ -6147,6 +6551,22 @@ void ceph_mdsc_destroy(struct ceph_fs_client *fsc)
 	/* flush out any connection work with references to us */
 	ceph_msgr_flush();
 
+	/*
+	 * Mark reset as failed and wake any blocked waiters before
+	 * cancelling, so unmount doesn't stall on blocked_wq timeout
+	 * if cancel_work_sync() prevents the work from running.
+	 */
+	spin_lock(&mdsc->reset_state.lock);
+	if (mdsc->reset_state.in_progress) {
+		mdsc->reset_state.in_progress = false;
+		mdsc->reset_state.last_errno = -ESHUTDOWN;
+		mdsc->reset_state.failure_count++;
+	}
+	spin_unlock(&mdsc->reset_state.lock);
+	wake_up_all(&mdsc->reset_state.blocked_wq);
+
+	cancel_work_sync(&mdsc->reset_work);
+
 	ceph_mdsc_stop(mdsc);
 
 	ceph_metric_destroy(&mdsc->metric);
@@ -6334,6 +6754,8 @@ static void mds_dispatch(struct ceph_connection *con, struct ceph_msg *msg)
 
 	mutex_lock(&mdsc->mutex);
 	if (__verify_registered_session(mdsc, s) < 0) {
+		pr_info_client(cl, "dropping tid %llu from unregistered session %d\n",
+			       msg->hdr.tid, s->s_mds);
 		mutex_unlock(&mdsc->mutex);
 		goto out;
 	}
diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h
index 0428a5eaf28c..1e9b45fe29ee 100644
--- a/fs/ceph/mds_client.h
+++ b/fs/ceph/mds_client.h
@@ -75,6 +75,32 @@ struct ceph_cap;
 
 #define MDS_AUTH_UID_ANY -1
 
+#define CEPH_CLIENT_RESET_REASON_LEN	64
+#define CEPH_CLIENT_RESET_TIMEOUT_SEC	60
+#define CEPH_CLIENT_RESET_WAIT_TIMEOUT_SEC 120
+
+struct ceph_client_reset_state {
+	spinlock_t lock;
+	u64 trigger_count;
+	u64 success_count;
+	u64 failure_count;
+	unsigned long last_start;
+	unsigned long last_finish;
+	int last_errno;
+	bool in_progress;
+	bool inject_error;
+	char last_reason[CEPH_CLIENT_RESET_REASON_LEN];
+
+	/* Completion tracking for session reconnects */
+	u64 active_reset_gen;
+	atomic_t pending_reconnects;
+	struct completion reconnect_done;
+
+	/* Request blocking during reset */
+	wait_queue_head_t blocked_wq;
+	atomic_t blocked_requests;
+};
+
 struct ceph_mds_cap_match {
 	s64 uid;  /* default to MDS_AUTH_UID_ANY */
 	u32 num_gids;
@@ -251,6 +277,8 @@ struct ceph_mds_session {
 	struct list_head  s_waiting;  /* waiting requests */
 	struct list_head  s_unsafe;   /* unsafe requests */
 	struct xarray	  s_delegated_inos;
+
+	u64		  s_reset_gen; /* generation of reset that initiated reconnect */
 };
 
 /*
@@ -536,6 +564,8 @@ struct ceph_mds_client {
 	struct list_head  dentry_dir_leases; /* lru list */
 
 	struct ceph_client_metric metric;
+	struct work_struct	reset_work;
+	struct ceph_client_reset_state reset_state;
 
 	spinlock_t		snapid_map_lock;
 	struct rb_root		snapid_map_tree;
@@ -563,6 +593,9 @@ extern const char *ceph_session_state_name(int s);
 extern struct ceph_mds_session *
 ceph_get_mds_session(struct ceph_mds_session *s);
 extern void ceph_put_mds_session(struct ceph_mds_session *s);
+int ceph_mdsc_schedule_reset(struct ceph_mds_client *mdsc,
+			     const char *reason);
+int ceph_mdsc_wait_for_reset(struct ceph_mds_client *mdsc);
 
 extern int ceph_mdsc_init(struct ceph_fs_client *fsc);
 extern void ceph_mdsc_close_sessions(struct ceph_mds_client *mdsc);
diff --git a/fs/ceph/super.h b/fs/ceph/super.h
index 9e80c816aa7a..f1b3502b32e5 100644
--- a/fs/ceph/super.h
+++ b/fs/ceph/super.h
@@ -179,6 +179,10 @@ struct ceph_fs_client {
 	struct dentry *debugfs_status;
 	struct dentry *debugfs_mds_sessions;
 	struct dentry *debugfs_metrics_dir;
+	struct dentry *debugfs_reset_dir;
+	struct dentry *debugfs_reset_force;
+	struct dentry *debugfs_reset_status;
+	struct dentry *debugfs_reset_inject;
 #endif
 
 #ifdef CONFIG_CEPH_FSCACHE
diff --git a/include/trace/events/ceph.h b/include/trace/events/ceph.h
index 08cb0659fbfc..22b755f4c7c7 100644
--- a/include/trace/events/ceph.h
+++ b/include/trace/events/ceph.h
@@ -226,6 +226,66 @@ TRACE_EVENT(ceph_handle_caps,
 		  __entry->mseq)
 );
 
+/*
+ * Client reset tracepoints
+ */
+TRACE_EVENT(ceph_client_reset_schedule,
+	TP_PROTO(const struct ceph_mds_client *mdsc, const char *reason),
+	TP_ARGS(mdsc, reason),
+	TP_STRUCT__entry(
+		__field(const void *, mdsc)
+		__string(reason, reason ? reason : "")
+	),
+	TP_fast_assign(
+		__entry->mdsc = mdsc;
+		__assign_str(reason);
+	),
+	TP_printk("mdsc=%p reason=%s", __entry->mdsc, __get_str(reason))
+);
+
+TRACE_EVENT(ceph_client_reset_complete,
+	TP_PROTO(const struct ceph_mds_client *mdsc, int ret),
+	TP_ARGS(mdsc, ret),
+	TP_STRUCT__entry(
+		__field(const void *, mdsc)
+		__field(int, ret)
+	),
+	TP_fast_assign(
+		__entry->mdsc = mdsc;
+		__entry->ret = ret;
+	),
+	TP_printk("mdsc=%p ret=%d", __entry->mdsc, __entry->ret)
+);
+
+TRACE_EVENT(ceph_client_reset_blocked,
+	TP_PROTO(const struct ceph_mds_client *mdsc, int blocked_count),
+	TP_ARGS(mdsc, blocked_count),
+	TP_STRUCT__entry(
+		__field(const void *, mdsc)
+		__field(int, blocked_count)
+	),
+	TP_fast_assign(
+		__entry->mdsc = mdsc;
+		__entry->blocked_count = blocked_count;
+	),
+	TP_printk("mdsc=%p blocked_count=%d", __entry->mdsc,
+		  __entry->blocked_count)
+);
+
+TRACE_EVENT(ceph_client_reset_unblocked,
+	TP_PROTO(const struct ceph_mds_client *mdsc, int ret),
+	TP_ARGS(mdsc, ret),
+	TP_STRUCT__entry(
+		__field(const void *, mdsc)
+		__field(int, ret)
+	),
+	TP_fast_assign(
+		__entry->mdsc = mdsc;
+		__entry->ret = ret;
+	),
+	TP_printk("mdsc=%p ret=%d", __entry->mdsc, __entry->ret)
+);
+
 #undef EM
 #undef E_
 #endif /* _TRACE_CEPH_H */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [RFC PATCH v1 4/4] ceph: rework mds_peer_reset() for robust session recovery
  2026-02-25 12:59 [RFC PATCH v1 0/4] ceph: manual client session reset via debugfs Alex Markuze
                   ` (2 preceding siblings ...)
  2026-02-25 12:59 ` [RFC PATCH v1 3/4] ceph: implement manual client session reset via debugfs Alex Markuze
@ 2026-02-25 12:59 ` Alex Markuze
  2026-02-25 20:51 ` [EXTERNAL] [RFC PATCH v1 0/4] ceph: manual client session reset via debugfs Viacheslav Dubeyko
  4 siblings, 0 replies; 10+ messages in thread
From: Alex Markuze @ 2026-02-25 12:59 UTC (permalink / raw)
  To: ceph-devel; +Cc: idryomov, linux-fsdevel, amarkuze, vdubeyko

Rewrite mds_peer_reset() to properly handle all session states when
the MDS resets our connection. Previously it unconditionally called
send_mds_reconnect() for any state >= RECONNECT, which could leave
sessions in inconsistent states.

The new implementation:
  - Returns early for FENCE_IO or MDS not yet in reconnect state
  - For MDS in RECONNECT state, directly reconnects (no locks)
  - For active sessions (OPEN/OPENING/CLOSING), performs full
    cleanup: unregister session, wake waiting requests, clean up
    caps, and kick pending requests
  - For RECONNECTING sessions, restarts the reconnect
  - Snapshots session state under s_mutex, then re-acquires locks
    in the correct order (s_mutex -> mdsc->mutex) matching the
    convention used by check_new_map() and cleanup_session_requests()

Signed-off-by: Alex Markuze <amarkuze@redhat.com>
---
 fs/ceph/mds_client.c | 89 ++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 86 insertions(+), 3 deletions(-)

diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index d350bf502a15..d9baa0073bbd 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -6737,12 +6737,95 @@ static void mds_peer_reset(struct ceph_connection *con)
 {
 	struct ceph_mds_session *s = con->private;
 	struct ceph_mds_client *mdsc = s->s_mdsc;
+	int session_state;
 
 	pr_warn_client(mdsc->fsc->client, "mds%d closed our session\n",
 		       s->s_mds);
-	if (READ_ONCE(mdsc->fsc->mount_state) != CEPH_MOUNT_FENCE_IO &&
-	    ceph_mdsmap_get_state(mdsc->mdsmap, s->s_mds) >= CEPH_MDS_STATE_RECONNECT)
-		send_mds_reconnect(mdsc, s);
+
+	if (READ_ONCE(mdsc->fsc->mount_state) == CEPH_MOUNT_FENCE_IO ||
+	    ceph_mdsmap_get_state(mdsc->mdsmap, s->s_mds) < CEPH_MDS_STATE_RECONNECT)
+		return;
+
+	if (ceph_mdsmap_get_state(mdsc->mdsmap, s->s_mds) == CEPH_MDS_STATE_RECONNECT) {
+		int rc = send_mds_reconnect(mdsc, s, false);
+		if (rc)
+			pr_err_client(mdsc->fsc->client,
+				      "mds%d reconnect failed: %d\n",
+				      s->s_mds, rc);
+		return;
+	}
+
+	/*
+	 * Snapshot session state under s->s_mutex, then release before
+	 * re-acquiring in the correct order: s->s_mutex -> mdsc->mutex
+	 * (matching check_new_map() and cleanup_session_requests()).
+	 */
+	mutex_lock(&s->s_mutex);
+	session_state = s->s_state;
+	mutex_unlock(&s->s_mutex);
+
+	switch (session_state) {
+	case CEPH_MDS_SESSION_RECONNECTING: {
+		int rc;
+
+		pr_info_client(mdsc->fsc->client,
+			       "mds%d reset during reconnect, restarting\n",
+			       s->s_mds);
+		rc = send_mds_reconnect(mdsc, s, false);
+		if (rc) {
+			pr_err_client(mdsc->fsc->client,
+				      "mds%d reconnect restart failed: %d\n",
+				      s->s_mds, rc);
+			mutex_lock(&s->s_mutex);
+			ceph_mdsc_reconnect_session_done(mdsc, s);
+			mutex_unlock(&s->s_mutex);
+		}
+		return;
+	}
+	case CEPH_MDS_SESSION_CLOSING:
+	case CEPH_MDS_SESSION_OPEN:
+	case CEPH_MDS_SESSION_HUNG:
+	case CEPH_MDS_SESSION_OPENING:
+		mutex_lock(&s->s_mutex);
+		mutex_lock(&mdsc->mutex);
+		if (s->s_state != session_state) {
+			pr_info_client(mdsc->fsc->client,
+				       "mds%d state changed to %s during peer reset\n",
+				       s->s_mds,
+				       ceph_session_state_name(s->s_state));
+			mutex_unlock(&mdsc->mutex);
+			mutex_unlock(&s->s_mutex);
+			return;
+		}
+
+		ceph_get_mds_session(s);
+		__unregister_session(mdsc, s);
+
+		s->s_state = CEPH_MDS_SESSION_CLOSED;
+		__wake_requests(mdsc, &s->s_waiting);
+		mutex_unlock(&mdsc->mutex);
+		mutex_unlock(&s->s_mutex);
+
+		mutex_lock(&s->s_mutex);
+		cleanup_session_requests(mdsc, s);
+		remove_session_caps(s);
+		mutex_unlock(&s->s_mutex);
+
+		wake_up_all(&mdsc->session_close_wq);
+
+		mutex_lock(&mdsc->mutex);
+		kick_requests(mdsc, s->s_mds);
+		mutex_unlock(&mdsc->mutex);
+
+		ceph_put_mds_session(s);
+		break;
+	default:
+		pr_warn_client(mdsc->fsc->client,
+			       "mds%d peer reset in unexpected state %s\n",
+			       s->s_mds,
+			       ceph_session_state_name(session_state));
+		break;
+	}
 }
 
 static void mds_dispatch(struct ceph_connection *con, struct ceph_msg *msg)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [EXTERNAL] [RFC PATCH v1 0/4] ceph: manual client session reset via debugfs
  2026-02-25 12:59 [RFC PATCH v1 0/4] ceph: manual client session reset via debugfs Alex Markuze
                   ` (3 preceding siblings ...)
  2026-02-25 12:59 ` [RFC PATCH v1 4/4] ceph: rework mds_peer_reset() for robust session recovery Alex Markuze
@ 2026-02-25 20:51 ` Viacheslav Dubeyko
  2026-02-26 14:54   ` Alex Markuze
  4 siblings, 1 reply; 10+ messages in thread
From: Viacheslav Dubeyko @ 2026-02-25 20:51 UTC (permalink / raw)
  To: Alex Markuze, ceph-devel; +Cc: idryomov, linux-fsdevel

On Wed, 2026-02-25 at 12:59 +0000, Alex Markuze wrote:
> In production CephFS deployments we regularly encounter situations
> where MDS sessions become stuck or hung, requiring a full unmount/
> remount cycle to recover.  This is disruptive for workloads that
> cannot tolerate the downtime or the loss of cached state that comes
> with unmounting.
> 
> This series adds a mechanism to manually trigger MDS session
> reconnection from the client side without unmounting, exposed via
> debugfs:
> 
>   echo "reason text" > /sys/kernel/debug/ceph/<fsid>/reset/trigger

What do you mean by "reason text"? Usually, it should be some command. Do you
mean that any string will trigger the reset? Is it good approach? What if a user
sends something by mistake or any malicious garbage? Potentially, it could be
security breach.

>   cat /sys/kernel/debug/ceph/<fsid>/reset/status

This "reset/status" sounds slightly ambiguous. Is it status of reset process? If
we didn't request the reset, then what this status means?

> 
> The reset lifecycle:
>   1. Operator writes to the trigger file
>   2. A work item collects all active sessions and initiates
>      reconnection on each (via send_mds_reconnect with from_reset=true)
>   3. New metadata requests and lock acquisitions are blocked until
>      the reset completes (120s timeout)
>   4. Session completions are tracked via a per-session generation
>      counter to handle stale completions from timed-out prior resets
>   5. Lock reclamation is always attempted during reset-initiated
>      reconnects, regardless of prior CEPH_I_ERROR_FILELOCK state
> 
> Patch breakdown:
>   1. Convert CEPH_I_* flags to named bit positions so test_bit/
>      set_bit/clear_bit can be used in reconnect paths
>   2. Make wait_caps_flush() bounded with periodic diagnostic dumps
>      to aid debugging hung flush scenarios (independent improvement
>      that surfaced during development)
>   3. The core reset implementation: debugfs interface, reset work
>      function, request/lock blocking, tracepoints, and session
>      reconnect completion tracking
>   4. Rework mds_peer_reset() to properly handle all session states
>      and integrate with the reset completion tracking
> 
> Open questions / areas for review:
>   - Is debugfs the right interface, or should this be a mount option
>     / sysfs attribute / netlink command?

Recently, I had short discussion with Greg Kroah-Hartman. He was completely
against of using sysfs for transferring commands from user-space to kernel-
space. So, we could have troubles of using sysfs for this. And
/sys/kernel/debug/ sounds like completely wrong place. Because, this is not
debug feature or option. You are suggesting this as a regular feature for
production.

Maybe, we can consider eBPF here?

>   - The request gating in ceph_mdsc_submit_request() is best-effort
>     (no lock serialization) to avoid penalizing the normal path.
>     Is this acceptable?

So, we have "MDS sessions become stuck or hung" and we are trying to submit
another request. Is it sane enough? We already have a dead session. :)

>   - Should the 60s reconnect timeout and 120s blocked-request timeout
>     be configurable (e.g. via mount options)?

I think we already have some timeout option in the mount options. Are you
suggesting yet another one?

>   - mds_peer_reset() rework (patch 4) is substantial -- would
>     reviewers prefer it split further?

Do you mean patch 3 or 4? Because, patch 3 is huge in size.

Thanks,
Slava.

> 
> 
> Alex Markuze (4):
>   ceph: convert inode flags to named bit positions
>   ceph: add bounded timeout and diagnostics to wait_caps_flush()
>   ceph: implement manual client session reset via debugfs
>   ceph: rework mds_peer_reset() for robust session recovery
> 
>  fs/ceph/caps.c              |   7 +
>  fs/ceph/debugfs.c           | 171 ++++++++++-
>  fs/ceph/locks.c             |  24 +-
>  fs/ceph/mds_client.c        | 577 ++++++++++++++++++++++++++++++++++--
>  fs/ceph/mds_client.h        |  33 +++
>  fs/ceph/super.h             |  60 ++--
>  include/trace/events/ceph.h |  60 ++++
>  7 files changed, 879 insertions(+), 53 deletions(-)


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [EXTERNAL] [RFC PATCH v1 1/4] ceph: convert inode flags to named bit positions
  2026-02-25 12:59 ` [RFC PATCH v1 1/4] ceph: convert inode flags to named bit positions Alex Markuze
@ 2026-02-25 20:56   ` Viacheslav Dubeyko
  2026-02-26 13:56     ` Alex Markuze
  0 siblings, 1 reply; 10+ messages in thread
From: Viacheslav Dubeyko @ 2026-02-25 20:56 UTC (permalink / raw)
  To: Alex Markuze, ceph-devel; +Cc: idryomov, linux-fsdevel

On Wed, 2026-02-25 at 12:59 +0000, Alex Markuze wrote:
> Define all CEPH_I_* flags as named bit positions with derived
> bitmask values, making them usable with test_bit/set_bit/clear_bit.
> Previously only CEPH_I_ODIRECT_BIT and CEPH_ASYNC_CREATE_BIT had
> named bit positions; the rest were bare bitmask constants.
> 

As I remember, I've reworked all constants for having name for every bit.
Probably, this patch has ignored and it has never been sent to upstream. And I
converted pretty everything for using test_bit/set_bit/clear_bit. :) You did
this work again. :)

Thanks,
Slava.

> Convert CEPH_I_ERROR_WRITE and CEPH_I_ERROR_FILELOCK usage sites
> to use atomic bit operations (test_bit, set_bit, clear_bit) via
> the new _BIT constants.
> 
> This is preparation for the client reset feature which needs
> test_bit() on CEPH_I_ERROR_FILELOCK_BIT in reconnect paths.
> 
> Signed-off-by: Alex Markuze <amarkuze@redhat.com>
> ---
>  fs/ceph/locks.c      |  8 +++----
>  fs/ceph/mds_client.c |  3 ++-
>  fs/ceph/super.h      | 54 +++++++++++++++++++++++++++-----------------
>  3 files changed, 38 insertions(+), 27 deletions(-)
> 
> diff --git a/fs/ceph/locks.c b/fs/ceph/locks.c
> index dd764f9c64b9..2f21574dfb99 100644
> --- a/fs/ceph/locks.c
> +++ b/fs/ceph/locks.c
> @@ -58,7 +58,7 @@ static void ceph_fl_release_lock(struct file_lock *fl)
>  	if (atomic_dec_and_test(&ci->i_filelock_ref)) {
>  		/* clear error when all locks are released */
>  		spin_lock(&ci->i_ceph_lock);
> -		ci->i_ceph_flags &= ~CEPH_I_ERROR_FILELOCK;
> +		clear_bit(CEPH_I_ERROR_FILELOCK_BIT, &ci->i_ceph_flags);
>  		spin_unlock(&ci->i_ceph_lock);
>  	}
>  	fl->fl_u.ceph.inode = NULL;
> @@ -272,9 +272,8 @@ int ceph_lock(struct file *file, int cmd, struct file_lock *fl)
>  		wait = 1;
>  
>  	spin_lock(&ci->i_ceph_lock);
> -	if (ci->i_ceph_flags & CEPH_I_ERROR_FILELOCK) {
> +	if (test_bit(CEPH_I_ERROR_FILELOCK_BIT, &ci->i_ceph_flags))
>  		err = -EIO;
> -	}
>  	spin_unlock(&ci->i_ceph_lock);
>  	if (err < 0) {
>  		if (op == CEPH_MDS_OP_SETFILELOCK && lock_is_unlock(fl))
> @@ -332,9 +331,8 @@ int ceph_flock(struct file *file, int cmd, struct file_lock *fl)
>  	doutc(cl, "fl_file: %p\n", fl->c.flc_file);
>  
>  	spin_lock(&ci->i_ceph_lock);
> -	if (ci->i_ceph_flags & CEPH_I_ERROR_FILELOCK) {
> +	if (test_bit(CEPH_I_ERROR_FILELOCK_BIT, &ci->i_ceph_flags))
>  		err = -EIO;
> -	}
>  	spin_unlock(&ci->i_ceph_lock);
>  	if (err < 0) {
>  		if (lock_is_unlock(fl))
> diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
> index 23b6d00643c9..28bb27b09b40 100644
> --- a/fs/ceph/mds_client.c
> +++ b/fs/ceph/mds_client.c
> @@ -3610,7 +3610,8 @@ static void __do_request(struct ceph_mds_client *mdsc,
>  
>  		spin_lock(&ci->i_ceph_lock);
>  		cap = ci->i_auth_cap;
> -		if (ci->i_ceph_flags & CEPH_I_ASYNC_CREATE && mds != cap->mds) {
> +		if (test_bit(CEPH_ASYNC_CREATE_BIT, &ci->i_ceph_flags) &&
> +		    mds != cap->mds) {
>  			doutc(cl, "session changed for auth cap %d -> %d\n",
>  			      cap->session->s_mds, session->s_mds);
>  
> diff --git a/fs/ceph/super.h b/fs/ceph/super.h
> index 29a980e22dc2..69a71848240f 100644
> --- a/fs/ceph/super.h
> +++ b/fs/ceph/super.h
> @@ -655,23 +655,35 @@ static inline struct inode *ceph_find_inode(struct super_block *sb,
>  /*
>   * Ceph inode.
>   */
> -#define CEPH_I_DIR_ORDERED	(1 << 0)  /* dentries in dir are ordered */
> -#define CEPH_I_FLUSH		(1 << 2)  /* do not delay flush of dirty metadata */
> -#define CEPH_I_POOL_PERM	(1 << 3)  /* pool rd/wr bits are valid */
> -#define CEPH_I_POOL_RD		(1 << 4)  /* can read from pool */
> -#define CEPH_I_POOL_WR		(1 << 5)  /* can write to pool */
> -#define CEPH_I_SEC_INITED	(1 << 6)  /* security initialized */
> -#define CEPH_I_KICK_FLUSH	(1 << 7)  /* kick flushing caps */
> -#define CEPH_I_FLUSH_SNAPS	(1 << 8)  /* need flush snapss */
> -#define CEPH_I_ERROR_WRITE	(1 << 9) /* have seen write errors */
> -#define CEPH_I_ERROR_FILELOCK	(1 << 10) /* have seen file lock errors */
> -#define CEPH_I_ODIRECT_BIT	(11) /* inode in direct I/O mode */
> -#define CEPH_I_ODIRECT		(1 << CEPH_I_ODIRECT_BIT)
> -#define CEPH_ASYNC_CREATE_BIT	(12)	  /* async create in flight for this */
> -#define CEPH_I_ASYNC_CREATE	(1 << CEPH_ASYNC_CREATE_BIT)
> -#define CEPH_I_SHUTDOWN		(1 << 13) /* inode is no longer usable */
> -#define CEPH_I_ASYNC_CHECK_CAPS	(1 << 14) /* check caps immediately after async
> -					     creating finishes */
> +#define CEPH_I_DIR_ORDERED_BIT		(0)  /* dentries in dir are ordered */
> +#define CEPH_I_FLUSH_BIT		(2)  /* do not delay flush of dirty metadata */
> +#define CEPH_I_POOL_PERM_BIT		(3)  /* pool rd/wr bits are valid */
> +#define CEPH_I_POOL_RD_BIT		(4)  /* can read from pool */
> +#define CEPH_I_POOL_WR_BIT		(5)  /* can write to pool */
> +#define CEPH_I_SEC_INITED_BIT		(6)  /* security initialized */
> +#define CEPH_I_KICK_FLUSH_BIT		(7)  /* kick flushing caps */
> +#define CEPH_I_FLUSH_SNAPS_BIT		(8)  /* need flush snapss */
> +#define CEPH_I_ERROR_WRITE_BIT		(9)  /* have seen write errors */
> +#define CEPH_I_ERROR_FILELOCK_BIT	(10) /* have seen file lock errors */
> +#define CEPH_I_ODIRECT_BIT		(11) /* inode in direct I/O mode */
> +#define CEPH_ASYNC_CREATE_BIT		(12) /* async create in flight for this */
> +#define CEPH_I_SHUTDOWN_BIT		(13) /* inode is no longer usable */
> +#define CEPH_I_ASYNC_CHECK_CAPS_BIT	(14) /* check caps after async creating finishes */
> +
> +#define CEPH_I_DIR_ORDERED		(1 << CEPH_I_DIR_ORDERED_BIT)
> +#define CEPH_I_FLUSH			(1 << CEPH_I_FLUSH_BIT)
> +#define CEPH_I_POOL_PERM		(1 << CEPH_I_POOL_PERM_BIT)
> +#define CEPH_I_POOL_RD			(1 << CEPH_I_POOL_RD_BIT)
> +#define CEPH_I_POOL_WR			(1 << CEPH_I_POOL_WR_BIT)
> +#define CEPH_I_SEC_INITED		(1 << CEPH_I_SEC_INITED_BIT)
> +#define CEPH_I_KICK_FLUSH		(1 << CEPH_I_KICK_FLUSH_BIT)
> +#define CEPH_I_FLUSH_SNAPS		(1 << CEPH_I_FLUSH_SNAPS_BIT)
> +#define CEPH_I_ERROR_WRITE		(1 << CEPH_I_ERROR_WRITE_BIT)
> +#define CEPH_I_ERROR_FILELOCK		(1 << CEPH_I_ERROR_FILELOCK_BIT)
> +#define CEPH_I_ODIRECT			(1 << CEPH_I_ODIRECT_BIT)
> +#define CEPH_I_ASYNC_CREATE		(1 << CEPH_ASYNC_CREATE_BIT)
> +#define CEPH_I_SHUTDOWN			(1 << CEPH_I_SHUTDOWN_BIT)
> +#define CEPH_I_ASYNC_CHECK_CAPS		(1 << CEPH_I_ASYNC_CHECK_CAPS_BIT)
>  
>  /*
>   * Masks of ceph inode work.
> @@ -691,18 +703,18 @@ static inline struct inode *ceph_find_inode(struct super_block *sb,
>   */
>  static inline void ceph_set_error_write(struct ceph_inode_info *ci)
>  {
> -	if (!(READ_ONCE(ci->i_ceph_flags) & CEPH_I_ERROR_WRITE)) {
> +	if (!test_bit(CEPH_I_ERROR_WRITE_BIT, &ci->i_ceph_flags)) {
>  		spin_lock(&ci->i_ceph_lock);
> -		ci->i_ceph_flags |= CEPH_I_ERROR_WRITE;
> +		set_bit(CEPH_I_ERROR_WRITE_BIT, &ci->i_ceph_flags);
>  		spin_unlock(&ci->i_ceph_lock);
>  	}
>  }
>  
>  static inline void ceph_clear_error_write(struct ceph_inode_info *ci)
>  {
> -	if (READ_ONCE(ci->i_ceph_flags) & CEPH_I_ERROR_WRITE) {
> +	if (test_bit(CEPH_I_ERROR_WRITE_BIT, &ci->i_ceph_flags)) {
>  		spin_lock(&ci->i_ceph_lock);
> -		ci->i_ceph_flags &= ~CEPH_I_ERROR_WRITE;
> +		clear_bit(CEPH_I_ERROR_WRITE_BIT, &ci->i_ceph_flags);
>  		spin_unlock(&ci->i_ceph_lock);
>  	}
>  }


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [EXTERNAL] [RFC PATCH v1 1/4] ceph: convert inode flags to named bit positions
  2026-02-25 20:56   ` [EXTERNAL] " Viacheslav Dubeyko
@ 2026-02-26 13:56     ` Alex Markuze
  0 siblings, 0 replies; 10+ messages in thread
From: Alex Markuze @ 2026-02-26 13:56 UTC (permalink / raw)
  To: Viacheslav Dubeyko; +Cc: ceph-devel, idryomov, linux-fsdevel

I think I remember reviewing it, now that you mention it.
Let's discuss offline, I'll set up a 1x1.

On Wed, Feb 25, 2026 at 10:56 PM Viacheslav Dubeyko <vdubeyko@redhat.com> wrote:
>
> On Wed, 2026-02-25 at 12:59 +0000, Alex Markuze wrote:
> > Define all CEPH_I_* flags as named bit positions with derived
> > bitmask values, making them usable with test_bit/set_bit/clear_bit.
> > Previously only CEPH_I_ODIRECT_BIT and CEPH_ASYNC_CREATE_BIT had
> > named bit positions; the rest were bare bitmask constants.
> >
>
> As I remember, I've reworked all constants for having name for every bit.
> Probably, this patch has ignored and it has never been sent to upstream. And I
> converted pretty everything for using test_bit/set_bit/clear_bit. :) You did
> this work again. :)
>
> Thanks,
> Slava.
>
> > Convert CEPH_I_ERROR_WRITE and CEPH_I_ERROR_FILELOCK usage sites
> > to use atomic bit operations (test_bit, set_bit, clear_bit) via
> > the new _BIT constants.
> >
> > This is preparation for the client reset feature which needs
> > test_bit() on CEPH_I_ERROR_FILELOCK_BIT in reconnect paths.
> >
> > Signed-off-by: Alex Markuze <amarkuze@redhat.com>
> > ---
> >  fs/ceph/locks.c      |  8 +++----
> >  fs/ceph/mds_client.c |  3 ++-
> >  fs/ceph/super.h      | 54 +++++++++++++++++++++++++++-----------------
> >  3 files changed, 38 insertions(+), 27 deletions(-)
> >
> > diff --git a/fs/ceph/locks.c b/fs/ceph/locks.c
> > index dd764f9c64b9..2f21574dfb99 100644
> > --- a/fs/ceph/locks.c
> > +++ b/fs/ceph/locks.c
> > @@ -58,7 +58,7 @@ static void ceph_fl_release_lock(struct file_lock *fl)
> >       if (atomic_dec_and_test(&ci->i_filelock_ref)) {
> >               /* clear error when all locks are released */
> >               spin_lock(&ci->i_ceph_lock);
> > -             ci->i_ceph_flags &= ~CEPH_I_ERROR_FILELOCK;
> > +             clear_bit(CEPH_I_ERROR_FILELOCK_BIT, &ci->i_ceph_flags);
> >               spin_unlock(&ci->i_ceph_lock);
> >       }
> >       fl->fl_u.ceph.inode = NULL;
> > @@ -272,9 +272,8 @@ int ceph_lock(struct file *file, int cmd, struct file_lock *fl)
> >               wait = 1;
> >
> >       spin_lock(&ci->i_ceph_lock);
> > -     if (ci->i_ceph_flags & CEPH_I_ERROR_FILELOCK) {
> > +     if (test_bit(CEPH_I_ERROR_FILELOCK_BIT, &ci->i_ceph_flags))
> >               err = -EIO;
> > -     }
> >       spin_unlock(&ci->i_ceph_lock);
> >       if (err < 0) {
> >               if (op == CEPH_MDS_OP_SETFILELOCK && lock_is_unlock(fl))
> > @@ -332,9 +331,8 @@ int ceph_flock(struct file *file, int cmd, struct file_lock *fl)
> >       doutc(cl, "fl_file: %p\n", fl->c.flc_file);
> >
> >       spin_lock(&ci->i_ceph_lock);
> > -     if (ci->i_ceph_flags & CEPH_I_ERROR_FILELOCK) {
> > +     if (test_bit(CEPH_I_ERROR_FILELOCK_BIT, &ci->i_ceph_flags))
> >               err = -EIO;
> > -     }
> >       spin_unlock(&ci->i_ceph_lock);
> >       if (err < 0) {
> >               if (lock_is_unlock(fl))
> > diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
> > index 23b6d00643c9..28bb27b09b40 100644
> > --- a/fs/ceph/mds_client.c
> > +++ b/fs/ceph/mds_client.c
> > @@ -3610,7 +3610,8 @@ static void __do_request(struct ceph_mds_client *mdsc,
> >
> >               spin_lock(&ci->i_ceph_lock);
> >               cap = ci->i_auth_cap;
> > -             if (ci->i_ceph_flags & CEPH_I_ASYNC_CREATE && mds != cap->mds) {
> > +             if (test_bit(CEPH_ASYNC_CREATE_BIT, &ci->i_ceph_flags) &&
> > +                 mds != cap->mds) {
> >                       doutc(cl, "session changed for auth cap %d -> %d\n",
> >                             cap->session->s_mds, session->s_mds);
> >
> > diff --git a/fs/ceph/super.h b/fs/ceph/super.h
> > index 29a980e22dc2..69a71848240f 100644
> > --- a/fs/ceph/super.h
> > +++ b/fs/ceph/super.h
> > @@ -655,23 +655,35 @@ static inline struct inode *ceph_find_inode(struct super_block *sb,
> >  /*
> >   * Ceph inode.
> >   */
> > -#define CEPH_I_DIR_ORDERED   (1 << 0)  /* dentries in dir are ordered */
> > -#define CEPH_I_FLUSH         (1 << 2)  /* do not delay flush of dirty metadata */
> > -#define CEPH_I_POOL_PERM     (1 << 3)  /* pool rd/wr bits are valid */
> > -#define CEPH_I_POOL_RD               (1 << 4)  /* can read from pool */
> > -#define CEPH_I_POOL_WR               (1 << 5)  /* can write to pool */
> > -#define CEPH_I_SEC_INITED    (1 << 6)  /* security initialized */
> > -#define CEPH_I_KICK_FLUSH    (1 << 7)  /* kick flushing caps */
> > -#define CEPH_I_FLUSH_SNAPS   (1 << 8)  /* need flush snapss */
> > -#define CEPH_I_ERROR_WRITE   (1 << 9) /* have seen write errors */
> > -#define CEPH_I_ERROR_FILELOCK        (1 << 10) /* have seen file lock errors */
> > -#define CEPH_I_ODIRECT_BIT   (11) /* inode in direct I/O mode */
> > -#define CEPH_I_ODIRECT               (1 << CEPH_I_ODIRECT_BIT)
> > -#define CEPH_ASYNC_CREATE_BIT        (12)      /* async create in flight for this */
> > -#define CEPH_I_ASYNC_CREATE  (1 << CEPH_ASYNC_CREATE_BIT)
> > -#define CEPH_I_SHUTDOWN              (1 << 13) /* inode is no longer usable */
> > -#define CEPH_I_ASYNC_CHECK_CAPS      (1 << 14) /* check caps immediately after async
> > -                                          creating finishes */
> > +#define CEPH_I_DIR_ORDERED_BIT               (0)  /* dentries in dir are ordered */
> > +#define CEPH_I_FLUSH_BIT             (2)  /* do not delay flush of dirty metadata */
> > +#define CEPH_I_POOL_PERM_BIT         (3)  /* pool rd/wr bits are valid */
> > +#define CEPH_I_POOL_RD_BIT           (4)  /* can read from pool */
> > +#define CEPH_I_POOL_WR_BIT           (5)  /* can write to pool */
> > +#define CEPH_I_SEC_INITED_BIT                (6)  /* security initialized */
> > +#define CEPH_I_KICK_FLUSH_BIT                (7)  /* kick flushing caps */
> > +#define CEPH_I_FLUSH_SNAPS_BIT               (8)  /* need flush snapss */
> > +#define CEPH_I_ERROR_WRITE_BIT               (9)  /* have seen write errors */
> > +#define CEPH_I_ERROR_FILELOCK_BIT    (10) /* have seen file lock errors */
> > +#define CEPH_I_ODIRECT_BIT           (11) /* inode in direct I/O mode */
> > +#define CEPH_ASYNC_CREATE_BIT                (12) /* async create in flight for this */
> > +#define CEPH_I_SHUTDOWN_BIT          (13) /* inode is no longer usable */
> > +#define CEPH_I_ASYNC_CHECK_CAPS_BIT  (14) /* check caps after async creating finishes */
> > +
> > +#define CEPH_I_DIR_ORDERED           (1 << CEPH_I_DIR_ORDERED_BIT)
> > +#define CEPH_I_FLUSH                 (1 << CEPH_I_FLUSH_BIT)
> > +#define CEPH_I_POOL_PERM             (1 << CEPH_I_POOL_PERM_BIT)
> > +#define CEPH_I_POOL_RD                       (1 << CEPH_I_POOL_RD_BIT)
> > +#define CEPH_I_POOL_WR                       (1 << CEPH_I_POOL_WR_BIT)
> > +#define CEPH_I_SEC_INITED            (1 << CEPH_I_SEC_INITED_BIT)
> > +#define CEPH_I_KICK_FLUSH            (1 << CEPH_I_KICK_FLUSH_BIT)
> > +#define CEPH_I_FLUSH_SNAPS           (1 << CEPH_I_FLUSH_SNAPS_BIT)
> > +#define CEPH_I_ERROR_WRITE           (1 << CEPH_I_ERROR_WRITE_BIT)
> > +#define CEPH_I_ERROR_FILELOCK                (1 << CEPH_I_ERROR_FILELOCK_BIT)
> > +#define CEPH_I_ODIRECT                       (1 << CEPH_I_ODIRECT_BIT)
> > +#define CEPH_I_ASYNC_CREATE          (1 << CEPH_ASYNC_CREATE_BIT)
> > +#define CEPH_I_SHUTDOWN                      (1 << CEPH_I_SHUTDOWN_BIT)
> > +#define CEPH_I_ASYNC_CHECK_CAPS              (1 << CEPH_I_ASYNC_CHECK_CAPS_BIT)
> >
> >  /*
> >   * Masks of ceph inode work.
> > @@ -691,18 +703,18 @@ static inline struct inode *ceph_find_inode(struct super_block *sb,
> >   */
> >  static inline void ceph_set_error_write(struct ceph_inode_info *ci)
> >  {
> > -     if (!(READ_ONCE(ci->i_ceph_flags) & CEPH_I_ERROR_WRITE)) {
> > +     if (!test_bit(CEPH_I_ERROR_WRITE_BIT, &ci->i_ceph_flags)) {
> >               spin_lock(&ci->i_ceph_lock);
> > -             ci->i_ceph_flags |= CEPH_I_ERROR_WRITE;
> > +             set_bit(CEPH_I_ERROR_WRITE_BIT, &ci->i_ceph_flags);
> >               spin_unlock(&ci->i_ceph_lock);
> >       }
> >  }
> >
> >  static inline void ceph_clear_error_write(struct ceph_inode_info *ci)
> >  {
> > -     if (READ_ONCE(ci->i_ceph_flags) & CEPH_I_ERROR_WRITE) {
> > +     if (test_bit(CEPH_I_ERROR_WRITE_BIT, &ci->i_ceph_flags)) {
> >               spin_lock(&ci->i_ceph_lock);
> > -             ci->i_ceph_flags &= ~CEPH_I_ERROR_WRITE;
> > +             clear_bit(CEPH_I_ERROR_WRITE_BIT, &ci->i_ceph_flags);
> >               spin_unlock(&ci->i_ceph_lock);
> >       }
> >  }
>


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [EXTERNAL] [RFC PATCH v1 0/4] ceph: manual client session reset via debugfs
  2026-02-25 20:51 ` [EXTERNAL] [RFC PATCH v1 0/4] ceph: manual client session reset via debugfs Viacheslav Dubeyko
@ 2026-02-26 14:54   ` Alex Markuze
  2026-03-03 22:37     ` Viacheslav Dubeyko
  0 siblings, 1 reply; 10+ messages in thread
From: Alex Markuze @ 2026-02-26 14:54 UTC (permalink / raw)
  To: Viacheslav Dubeyko; +Cc: ceph-devel, idryomov, linux-fsdevel

> What do you mean by "reason text"? Usually, it should be some command. Do you
> mean that any string will trigger the reset? Is it good approach? What if a user
> sends something by mistake or any malicious garbage? Potentially, it could be
> security breach.

The string is an optional free-form operator note that gets logged and
recorded in the status output for post-mortem diagnostics (e.g.
"RHBZ#12345" or "mds0 stuck after failover"). Any write to the
trigger file initiates the reset -- the content is not interpreted as
a command.

The input length is capped at `CEPH_CLIENT_RESET_REASON_LEN - 1`
(63 bytes) via `min_t()` before `copy_from_user()`, then
null-terminated and `strim()`'d. Anything longer is silently
truncated -- no overflow is possible.

Regarding security: the trigger file is created with mode 0200
(write-only, owner-only), and debugfs itself is typically mounted
root-only. So the write is restricted to a privileged operator, same
as any other debugfs knob. That said, I agree the interface question
is open -- see below.

> This "reset/status" sounds slightly ambiguous. Is it status of reset process? If
> we didn't request the reset, then what this status means?

It is the status of the reset subsystem. When no reset has been
requested, it shows the idle state:

```
in_progress: no
trigger_count: 0
success_count: 0
failure_count: 0
last_start_ms_ago: (never)
last_finish_ms_ago: (never)
last_errno: 0
last_reason: (none)
inject_error_pending: no
pending_reconnects: 0
blocked_requests: 0
```

Happy to rename it to something less ambiguous, maybe
`reset/reset_status` or just moving the counters into the existing
top-level `status` file. Open to suggestions.

> Recently, I had short discussion with Greg Kroah-Hartman. He was completely
> against of using sysfs for transferring commands from user-space to kernel-
> space. So, we could have troubles of using sysfs for this. And
> /sys/kernel/debug/ sounds like completely wrong place. Because, this is not
> debug feature or option. You are suggesting this as a regular feature for
> production.
>
> Maybe, we can consider eBPF here?

Yeah, debugfs is not the right place for a production feature.

The debugfs interface is there for now as a development/testing
trigger while the core reset mechanism is reviewed. The longer-term
goal is automated recovery: the client would detect stuck/hung
sessions and trigger the reset internally without operator
intervention. Once that's in place, the manual trigger becomes a
fallback rather than the primary interface.

For production manual triggering (before automated recovery is
ready), we can consider other approaches, e.g.,  ioctl, netlink, or
extending the existing `recover_session=clean` path. That path
already handles automatic recovery from blocklisting via
`maybe_recover_session()` / `ceph_force_reconnect()`; the manual
reset solves a related but different problem (sessions stuck without
blocklisting), so the machinery could potentially be unified.

> So, we have "MDS sessions become stuck or hung" and we are trying to submit
> another request. Is it sane enough? We already have a dead session. :)

The gating does not submit requests to the dead session, it does
the opposite. When a reset is in progress,
`ceph_mdsc_wait_for_reset()` blocks new incoming requests (and new
lock acquisitions) until the reset completes and sessions are
re-established. Only then do the blocked requests proceed through
the normal `__do_request()` path, which will find the freshly
reconnected sessions.

The "best-effort" note refers to the fact that we don't take a lock
on the hot path, a request could slip past the check in the narrow
window before `in_progress` is set. That's acceptable because such
a request would either complete normally on the still-alive session
or fail and be retried by the caller after the reset.

> I think we already have some timeout option in the mount options. Are you
> suggesting yet another one?

Fair point. The existing `mount_timeout` controls how long mount and
reconnect operations wait. The two timeouts here are:

- 60s for the reset work to wait for session reconnects to complete
- 120s for blocked requests to wait for the reset to finish

I think these can stay as compile-time constants
(`CEPH_CLIENT_RESET_TIMEOUT_SEC` / `CEPH_CLIENT_RESET_WAIT_TIMEOUT_SEC`)
rather than becoming mount options. They're operational bounds for a
rare manual operation, not something an admin would typically need to
tune. If reviewers feel otherwise I can tie them to `mount_timeout`.

> Do you mean patch 3 or 4? Because, patch 3 is huge in size.

You're right, patch 3 is the large one (~700 lines). I can split it
further if that helps review. A natural split would be:

- 3a: `ceph_client_reset_state` struct, init/destroy, and
  `ceph_mdsc_wait_for_reset()` blocking infrastructure
- 3b: `ceph_mdsc_reset_workfn()`, `send_mds_reconnect()` changes,
  and session completion tracking
- 3c: debugfs interface (trigger, status, inject_error)
- 3d: tracepoints

Let me know if that granularity makes sense or if you'd prefer a
different split.

Thanks for the review, Slava.

On Wed, Feb 25, 2026 at 10:51 PM Viacheslav Dubeyko <vdubeyko@redhat.com> wrote:
>
> On Wed, 2026-02-25 at 12:59 +0000, Alex Markuze wrote:
> > In production CephFS deployments we regularly encounter situations
> > where MDS sessions become stuck or hung, requiring a full unmount/
> > remount cycle to recover.  This is disruptive for workloads that
> > cannot tolerate the downtime or the loss of cached state that comes
> > with unmounting.
> >
> > This series adds a mechanism to manually trigger MDS session
> > reconnection from the client side without unmounting, exposed via
> > debugfs:
> >
> >   echo "reason text" > /sys/kernel/debug/ceph/<fsid>/reset/trigger
>
> What do you mean by "reason text"? Usually, it should be some command. Do you
> mean that any string will trigger the reset? Is it good approach? What if a user
> sends something by mistake or any malicious garbage? Potentially, it could be
> security breach.
>
> >   cat /sys/kernel/debug/ceph/<fsid>/reset/status
>
> This "reset/status" sounds slightly ambiguous. Is it status of reset process? If
> we didn't request the reset, then what this status means?
>
> >
> > The reset lifecycle:
> >   1. Operator writes to the trigger file
> >   2. A work item collects all active sessions and initiates
> >      reconnection on each (via send_mds_reconnect with from_reset=true)
> >   3. New metadata requests and lock acquisitions are blocked until
> >      the reset completes (120s timeout)
> >   4. Session completions are tracked via a per-session generation
> >      counter to handle stale completions from timed-out prior resets
> >   5. Lock reclamation is always attempted during reset-initiated
> >      reconnects, regardless of prior CEPH_I_ERROR_FILELOCK state
> >
> > Patch breakdown:
> >   1. Convert CEPH_I_* flags to named bit positions so test_bit/
> >      set_bit/clear_bit can be used in reconnect paths
> >   2. Make wait_caps_flush() bounded with periodic diagnostic dumps
> >      to aid debugging hung flush scenarios (independent improvement
> >      that surfaced during development)
> >   3. The core reset implementation: debugfs interface, reset work
> >      function, request/lock blocking, tracepoints, and session
> >      reconnect completion tracking
> >   4. Rework mds_peer_reset() to properly handle all session states
> >      and integrate with the reset completion tracking
> >
> > Open questions / areas for review:
> >   - Is debugfs the right interface, or should this be a mount option
> >     / sysfs attribute / netlink command?
>
> Recently, I had short discussion with Greg Kroah-Hartman. He was completely
> against of using sysfs for transferring commands from user-space to kernel-
> space. So, we could have troubles of using sysfs for this. And
> /sys/kernel/debug/ sounds like completely wrong place. Because, this is not
> debug feature or option. You are suggesting this as a regular feature for
> production.
>
> Maybe, we can consider eBPF here?
>
> >   - The request gating in ceph_mdsc_submit_request() is best-effort
> >     (no lock serialization) to avoid penalizing the normal path.
> >     Is this acceptable?
>
> So, we have "MDS sessions become stuck or hung" and we are trying to submit
> another request. Is it sane enough? We already have a dead session. :)
>
> >   - Should the 60s reconnect timeout and 120s blocked-request timeout
> >     be configurable (e.g. via mount options)?
>
> I think we already have some timeout option in the mount options. Are you
> suggesting yet another one?
>
> >   - mds_peer_reset() rework (patch 4) is substantial -- would
> >     reviewers prefer it split further?
>
> Do you mean patch 3 or 4? Because, patch 3 is huge in size.
>
> Thanks,
> Slava.
>
> >
> >
> > Alex Markuze (4):
> >   ceph: convert inode flags to named bit positions
> >   ceph: add bounded timeout and diagnostics to wait_caps_flush()
> >   ceph: implement manual client session reset via debugfs
> >   ceph: rework mds_peer_reset() for robust session recovery
> >
> >  fs/ceph/caps.c              |   7 +
> >  fs/ceph/debugfs.c           | 171 ++++++++++-
> >  fs/ceph/locks.c             |  24 +-
> >  fs/ceph/mds_client.c        | 577 ++++++++++++++++++++++++++++++++++--
> >  fs/ceph/mds_client.h        |  33 +++
> >  fs/ceph/super.h             |  60 ++--
> >  include/trace/events/ceph.h |  60 ++++
> >  7 files changed, 879 insertions(+), 53 deletions(-)
>
>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [EXTERNAL] [RFC PATCH v1 0/4] ceph: manual client session reset via debugfs
  2026-02-26 14:54   ` Alex Markuze
@ 2026-03-03 22:37     ` Viacheslav Dubeyko
  0 siblings, 0 replies; 10+ messages in thread
From: Viacheslav Dubeyko @ 2026-03-03 22:37 UTC (permalink / raw)
  To: Alex Markuze; +Cc: ceph-devel, idryomov, linux-fsdevel

On Thu, 2026-02-26 at 16:54 +0200, Alex Markuze wrote:
> > What do you mean by "reason text"? Usually, it should be some command. Do you
> > mean that any string will trigger the reset? Is it good approach? What if a user
> > sends something by mistake or any malicious garbage? Potentially, it could be
> > security breach.
> 
> The string is an optional free-form operator note that gets logged and
> recorded in the status output for post-mortem diagnostics (e.g.
> "RHBZ#12345" or "mds0 stuck after failover"). Any write to the
> trigger file initiates the reset -- the content is not interpreted as
> a command.

First of all, people are lazy and nobody like to compose any string. So, you
will have completely useless strings at the end of the day ("1234", "dgdgdffgd",
or something like this). This string could be really confusing point because
everybody will expect to have some clear command that triggers the reset. I
expect that end-users will be confused by likewise "free-form operator".

> 
> The input length is capped at `CEPH_CLIENT_RESET_REASON_LEN - 1`
> (63 bytes) via `min_t()` before `copy_from_user()`, then
> null-terminated and `strim()`'d. Anything longer is silently
> truncated -- no overflow is possible.
> 
> Regarding security: the trigger file is created with mode 0200
> (write-only, owner-only), and debugfs itself is typically mounted
> root-only. So the write is restricted to a privileged operator, same
> as any other debugfs knob. That said, I agree the interface question
> is open -- see below.
> 
> > This "reset/status" sounds slightly ambiguous. Is it status of reset process? If
> > we didn't request the reset, then what this status means?
> 
> It is the status of the reset subsystem. When no reset has been
> requested, it shows the idle state:
> 
> ```
> in_progress: no
> trigger_count: 0
> success_count: 0
> failure_count: 0
> last_start_ms_ago: (never)
> last_finish_ms_ago: (never)
> last_errno: 0
> last_reason: (none)
> inject_error_pending: no
> pending_reconnects: 0
> blocked_requests: 0
> ```
> 
> Happy to rename it to something less ambiguous, maybe
> `reset/reset_status` or just moving the counters into the existing
> top-level `status` file. Open to suggestions.

Frankly speaking, I expected to see something like:

/sys/fs/ceph/<info>

And if we are talking about MDS sessions, then, maybe, we need to have MDS node
and some status there:

/sys/fs/ceph/<fsid>/mds/status

> 
> > Recently, I had short discussion with Greg Kroah-Hartman. He was completely
> > against of using sysfs for transferring commands from user-space to kernel-
> > space. So, we could have troubles of using sysfs for this. And
> > /sys/kernel/debug/ sounds like completely wrong place. Because, this is not
> > debug feature or option. You are suggesting this as a regular feature for
> > production.
> > 
> > Maybe, we can consider eBPF here?
> 
> Yeah, debugfs is not the right place for a production feature.
> 
> The debugfs interface is there for now as a development/testing
> trigger while the core reset mechanism is reviewed. The longer-term
> goal is automated recovery: the client would detect stuck/hung
> sessions and trigger the reset internally without operator
> intervention. Once that's in place, the manual trigger becomes a
> fallback rather than the primary interface.
> 
> For production manual triggering (before automated recovery is
> ready), we can consider other approaches, e.g.,  ioctl, netlink, or
> extending the existing `recover_session=clean` path. That path
> already handles automatic recovery from blocklisting via
> `maybe_recover_session()` / `ceph_force_reconnect()`; the manual
> reset solves a related but different problem (sessions stuck without
> blocklisting), so the machinery could potentially be unified.

The ioctl sounds like more reasonable way. However, this /sys/../<trigger_reset>
sounds like an easy way to initiates the reset from command line.

> 
> > So, we have "MDS sessions become stuck or hung" and we are trying to submit
> > another request. Is it sane enough? We already have a dead session. :)
> 
> The gating does not submit requests to the dead session, it does
> the opposite. When a reset is in progress,
> `ceph_mdsc_wait_for_reset()` blocks new incoming requests (and new
> lock acquisitions) until the reset completes and sessions are
> re-established. Only then do the blocked requests proceed through
> the normal `__do_request()` path, which will find the freshly
> reconnected sessions.
> 
> The "best-effort" note refers to the fact that we don't take a lock
> on the hot path, a request could slip past the check in the narrow
> window before `in_progress` is set. That's acceptable because such
> a request would either complete normally on the still-alive session
> or fail and be retried by the caller after the reset.
> 
> > I think we already have some timeout option in the mount options. Are you
> > suggesting yet another one?
> 
> Fair point. The existing `mount_timeout` controls how long mount and
> reconnect operations wait. The two timeouts here are:
> 
> - 60s for the reset work to wait for session reconnects to complete
> - 120s for blocked requests to wait for the reset to finish
> 
> I think these can stay as compile-time constants
> (`CEPH_CLIENT_RESET_TIMEOUT_SEC` / `CEPH_CLIENT_RESET_WAIT_TIMEOUT_SEC`)
> rather than becoming mount options. They're operational bounds for a
> rare manual operation, not something an admin would typically need to
> tune. If reviewers feel otherwise I can tie them to `mount_timeout`.

Potentially, we could have both options available.

> 
> > Do you mean patch 3 or 4? Because, patch 3 is huge in size.
> 
> You're right, patch 3 is the large one (~700 lines). I can split it
> further if that helps review. A natural split would be:
> 
> - 3a: `ceph_client_reset_state` struct, init/destroy, and
>   `ceph_mdsc_wait_for_reset()` blocking infrastructure
> - 3b: `ceph_mdsc_reset_workfn()`, `send_mds_reconnect()` changes,
>   and session completion tracking
> - 3c: debugfs interface (trigger, status, inject_error)
> - 3d: tracepoints
> 
> Let me know if that granularity makes sense or if you'd prefer a
> different split.

I think it makes sense.

Thanks,
Slava.


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2026-03-03 22:37 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-25 12:59 [RFC PATCH v1 0/4] ceph: manual client session reset via debugfs Alex Markuze
2026-02-25 12:59 ` [RFC PATCH v1 1/4] ceph: convert inode flags to named bit positions Alex Markuze
2026-02-25 20:56   ` [EXTERNAL] " Viacheslav Dubeyko
2026-02-26 13:56     ` Alex Markuze
2026-02-25 12:59 ` [RFC PATCH v1 2/4] ceph: add bounded timeout and diagnostics to wait_caps_flush() Alex Markuze
2026-02-25 12:59 ` [RFC PATCH v1 3/4] ceph: implement manual client session reset via debugfs Alex Markuze
2026-02-25 12:59 ` [RFC PATCH v1 4/4] ceph: rework mds_peer_reset() for robust session recovery Alex Markuze
2026-02-25 20:51 ` [EXTERNAL] [RFC PATCH v1 0/4] ceph: manual client session reset via debugfs Viacheslav Dubeyko
2026-02-26 14:54   ` Alex Markuze
2026-03-03 22:37     ` Viacheslav Dubeyko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox