* [PATCH v4 01/11] ceph: convert inode flags to named bit positions and atomic bitops
2026-05-07 12:27 [PATCH v4 00/11] ceph: manual client session reset Alex Markuze
@ 2026-05-07 12:27 ` Alex Markuze
2026-05-07 18:35 ` Viacheslav Dubeyko
2026-05-07 12:27 ` [PATCH v4 02/11] ceph: use proper endian conversion for flock_len in reconnect Alex Markuze
` (10 subsequent siblings)
11 siblings, 1 reply; 24+ messages in thread
From: Alex Markuze @ 2026-05-07 12:27 UTC (permalink / raw)
To: ceph-devel; +Cc: linux-kernel, idryomov, vdubeyko, Alex Markuze
Define named bit-position constants for all CEPH_I_* inode flags and
derive the bitmask values from them. This gives every flag a named
_BIT constant usable with the test_bit/set_bit/clear_bit family.
The intentionally unused bit position 1 is documented inline.
Convert all flag modifications to use atomic bitops (set_bit,
clear_bit, test_and_clear_bit). The previous code mixed lockless
atomic ops on some flags (ERROR_WRITE, ODIRECT) with non-atomic
read-modify-write (|= / &= ~) on other flags sharing the same
unsigned long. A concurrent non-atomic RMW can clobber an
adjacent lockless atomic update -- for example, a lockless
clear_bit(ERROR_WRITE) could be silently resurrected by a
concurrent ci->i_ceph_flags |= CEPH_I_FLUSH under the spinlock.
Using atomic bitops for all modifications eliminates this class
of race entirely.
Flags whose only users are now the _BIT form (ERROR_WRITE,
ASYNC_CHECK_CAPS) have their old mask defines removed to document
that callers must use the _BIT constant with the set_bit/test_bit
family. ERROR_FILELOCK and SHUTDOWN retain their mask defines
because they are still used via bitmask tests in lockless readers
(ceph_inode_is_shutdown, reconnect_caps_cb).
The direct assignment in ceph_finish_async_create() is converted
from i_ceph_flags = CEPH_I_ASYNC_CREATE to set_bit(). This
inode is I_NEW at this point -- still invisible to other threads
and guaranteed to have zero flags from alloc_inode -- so either
form is safe, but set_bit() keeps the conversion uniform.
Co-developed-by: Viacheslav Dubeyko <vdubeyko@redhat.com>
Signed-off-by: Viacheslav Dubeyko <vdubeyko@redhat.com>
Signed-off-by: Alex Markuze <amarkuze@redhat.com>
---
fs/ceph/addr.c | 20 +++++++-------
fs/ceph/caps.c | 24 ++++++++---------
fs/ceph/file.c | 13 ++++-----
fs/ceph/inode.c | 4 +--
fs/ceph/locks.c | 22 ++++-----------
fs/ceph/mds_client.c | 3 ++-
fs/ceph/mds_client.h | 2 +-
fs/ceph/snap.c | 2 +-
fs/ceph/super.h | 64 +++++++++++++++++++++++---------------------
fs/ceph/xattr.c | 2 +-
10 files changed, 74 insertions(+), 82 deletions(-)
diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index 94ffa127b1d3..1859a0c92d66 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -2563,7 +2563,8 @@ int ceph_pool_perm_check(struct inode *inode, int need)
struct ceph_inode_info *ci = ceph_inode(inode);
struct ceph_string *pool_ns;
s64 pool;
- int ret, flags;
+ int ret;
+ unsigned long flags;
/* Only need to do this for regular files */
if (!S_ISREG(inode->i_mode))
@@ -2605,20 +2606,19 @@ int ceph_pool_perm_check(struct inode *inode, int need)
if (ret < 0)
return ret;
- flags = CEPH_I_POOL_PERM;
- if (ret & POOL_READ)
- flags |= CEPH_I_POOL_RD;
- if (ret & POOL_WRITE)
- flags |= CEPH_I_POOL_WR;
-
spin_lock(&ci->i_ceph_lock);
if (pool == ci->i_layout.pool_id &&
pool_ns == rcu_dereference_raw(ci->i_layout.pool_ns)) {
- ci->i_ceph_flags |= flags;
- } else {
+ set_bit(CEPH_I_POOL_PERM_BIT, &ci->i_ceph_flags);
+ if (ret & POOL_READ)
+ set_bit(CEPH_I_POOL_RD_BIT, &ci->i_ceph_flags);
+ if (ret & POOL_WRITE)
+ set_bit(CEPH_I_POOL_WR_BIT, &ci->i_ceph_flags);
+ } else {
pool = ci->i_layout.pool_id;
- flags = ci->i_ceph_flags;
}
+ /* Re-read flags under the lock so check: sees the updated bits. */
+ flags = ci->i_ceph_flags;
spin_unlock(&ci->i_ceph_lock);
goto check;
}
diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
index d51454e995a8..cb9e78b713d9 100644
--- a/fs/ceph/caps.c
+++ b/fs/ceph/caps.c
@@ -549,7 +549,7 @@ static void __cap_delay_requeue_front(struct ceph_mds_client *mdsc,
doutc(mdsc->fsc->client, "%p %llx.%llx\n", inode, ceph_vinop(inode));
spin_lock(&mdsc->cap_delay_lock);
- ci->i_ceph_flags |= CEPH_I_FLUSH;
+ set_bit(CEPH_I_FLUSH_BIT, &ci->i_ceph_flags);
if (!list_empty(&ci->i_cap_delay_list))
list_del_init(&ci->i_cap_delay_list);
list_add(&ci->i_cap_delay_list, &mdsc->cap_delay_list);
@@ -1409,7 +1409,7 @@ static void __prep_cap(struct cap_msg_args *arg, struct ceph_cap *cap,
ceph_cap_string(revoking));
BUG_ON((retain & CEPH_CAP_PIN) == 0);
- ci->i_ceph_flags &= ~CEPH_I_FLUSH;
+ clear_bit(CEPH_I_FLUSH_BIT, &ci->i_ceph_flags);
cap->issued &= retain; /* drop bits we don't want */
/*
@@ -1666,7 +1666,7 @@ static void __ceph_flush_snaps(struct ceph_inode_info *ci,
last_tid = capsnap->cap_flush.tid;
}
- ci->i_ceph_flags &= ~CEPH_I_FLUSH_SNAPS;
+ clear_bit(CEPH_I_FLUSH_SNAPS_BIT, &ci->i_ceph_flags);
while (first_tid <= last_tid) {
struct ceph_cap *cap = ci->i_auth_cap;
@@ -2026,7 +2026,7 @@ void ceph_check_caps(struct ceph_inode_info *ci, int flags)
spin_lock(&ci->i_ceph_lock);
if (ci->i_ceph_flags & CEPH_I_ASYNC_CREATE) {
- ci->i_ceph_flags |= CEPH_I_ASYNC_CHECK_CAPS;
+ set_bit(CEPH_I_ASYNC_CHECK_CAPS_BIT, &ci->i_ceph_flags);
/* Don't send messages until we get async create reply */
spin_unlock(&ci->i_ceph_lock);
@@ -2577,7 +2577,7 @@ static void __kick_flushing_caps(struct ceph_mds_client *mdsc,
if (ci->i_ceph_flags & CEPH_I_ASYNC_CREATE)
return;
- ci->i_ceph_flags &= ~CEPH_I_KICK_FLUSH;
+ clear_bit(CEPH_I_KICK_FLUSH_BIT, &ci->i_ceph_flags);
list_for_each_entry_reverse(cf, &ci->i_cap_flush_list, i_list) {
if (cf->is_capsnap) {
@@ -2686,7 +2686,7 @@ void ceph_early_kick_flushing_caps(struct ceph_mds_client *mdsc,
__kick_flushing_caps(mdsc, session, ci,
oldest_flush_tid);
} else {
- ci->i_ceph_flags |= CEPH_I_KICK_FLUSH;
+ set_bit(CEPH_I_KICK_FLUSH_BIT, &ci->i_ceph_flags);
}
spin_unlock(&ci->i_ceph_lock);
@@ -2829,7 +2829,7 @@ static int try_get_cap_refs(struct inode *inode, int need, int want,
spin_lock(&ci->i_ceph_lock);
if ((flags & CHECK_FILELOCK) &&
- (ci->i_ceph_flags & CEPH_I_ERROR_FILELOCK)) {
+ test_bit(CEPH_I_ERROR_FILELOCK_BIT, &ci->i_ceph_flags)) {
doutc(cl, "%p %llx.%llx error filelock\n", inode,
ceph_vinop(inode));
ret = -EIO;
@@ -3207,7 +3207,7 @@ static int ceph_try_drop_cap_snap(struct ceph_inode_info *ci,
BUG_ON(capsnap->cap_flush.tid > 0);
ceph_put_snap_context(capsnap->context);
if (!list_is_last(&capsnap->ci_item, &ci->i_cap_snaps))
- ci->i_ceph_flags |= CEPH_I_FLUSH_SNAPS;
+ set_bit(CEPH_I_FLUSH_SNAPS_BIT, &ci->i_ceph_flags);
list_del(&capsnap->ci_item);
ceph_put_cap_snap(capsnap);
@@ -3396,7 +3396,7 @@ void ceph_put_wrbuffer_cap_refs(struct ceph_inode_info *ci, int nr,
if (ceph_try_drop_cap_snap(ci, capsnap)) {
put++;
} else {
- ci->i_ceph_flags |= CEPH_I_FLUSH_SNAPS;
+ set_bit(CEPH_I_FLUSH_SNAPS_BIT, &ci->i_ceph_flags);
flush_snaps = true;
}
}
@@ -3648,7 +3648,7 @@ static void handle_cap_grant(struct inode *inode,
if (ci->i_layout.pool_id != old_pool ||
extra_info->pool_ns != old_ns)
- ci->i_ceph_flags &= ~CEPH_I_POOL_PERM;
+ clear_bit(CEPH_I_POOL_PERM_BIT, &ci->i_ceph_flags);
extra_info->pool_ns = old_ns;
@@ -4815,7 +4815,7 @@ int ceph_drop_caps_for_unlink(struct inode *inode)
doutc(mdsc->fsc->client, "%p %llx.%llx\n", inode,
ceph_vinop(inode));
spin_lock(&mdsc->cap_delay_lock);
- ci->i_ceph_flags |= CEPH_I_FLUSH;
+ set_bit(CEPH_I_FLUSH_BIT, &ci->i_ceph_flags);
if (!list_empty(&ci->i_cap_delay_list))
list_del_init(&ci->i_cap_delay_list);
list_add_tail(&ci->i_cap_delay_list,
@@ -5080,7 +5080,7 @@ int ceph_purge_inode_cap(struct inode *inode, struct ceph_cap *cap, bool *invali
if (atomic_read(&ci->i_filelock_ref) > 0) {
/* make further file lock syscall return -EIO */
- ci->i_ceph_flags |= CEPH_I_ERROR_FILELOCK;
+ set_bit(CEPH_I_ERROR_FILELOCK_BIT, &ci->i_ceph_flags);
pr_warn_ratelimited_client(cl,
" dropping file locks for %p %llx.%llx\n",
inode, ceph_vinop(inode));
diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index d54d71669176..7ca9f60fb0e5 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -598,12 +598,12 @@ static void wake_async_create_waiters(struct inode *inode,
spin_lock(&ci->i_ceph_lock);
if (ci->i_ceph_flags & CEPH_I_ASYNC_CREATE) {
- clear_and_wake_up_bit(CEPH_ASYNC_CREATE_BIT, &ci->i_ceph_flags);
+ /* Serialized by i_ceph_lock; the two ops touch different bits. */
+ clear_and_wake_up_bit(CEPH_I_ASYNC_CREATE_BIT, &ci->i_ceph_flags);
- if (ci->i_ceph_flags & CEPH_I_ASYNC_CHECK_CAPS) {
- ci->i_ceph_flags &= ~CEPH_I_ASYNC_CHECK_CAPS;
+ if (test_and_clear_bit(CEPH_I_ASYNC_CHECK_CAPS_BIT,
+ &ci->i_ceph_flags))
check_cap = true;
- }
}
ceph_kick_flushing_inode_caps(session, ci);
spin_unlock(&ci->i_ceph_lock);
@@ -766,7 +766,8 @@ static int ceph_finish_async_create(struct inode *dir, struct inode *inode,
* that point and don't worry about setting
* CEPH_I_ASYNC_CREATE.
*/
- ceph_inode(inode)->i_ceph_flags = CEPH_I_ASYNC_CREATE;
+ set_bit(CEPH_I_ASYNC_CREATE_BIT,
+ &ceph_inode(inode)->i_ceph_flags);
unlock_new_inode(inode);
}
if (d_in_lookup(dentry) || d_really_is_negative(dentry)) {
@@ -2482,7 +2483,7 @@ static ssize_t ceph_write_iter(struct kiocb *iocb, struct iov_iter *from)
if ((got & (CEPH_CAP_FILE_BUFFER|CEPH_CAP_FILE_LAZYIO)) == 0 ||
(iocb->ki_flags & IOCB_DIRECT) || (fi->flags & CEPH_F_SYNC) ||
- (ci->i_ceph_flags & CEPH_I_ERROR_WRITE)) {
+ test_bit(CEPH_I_ERROR_WRITE_BIT, &ci->i_ceph_flags)) {
struct ceph_snap_context *snapc;
struct iov_iter data;
diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
index 22c7da1ea61c..4871d7ab2730 100644
--- a/fs/ceph/inode.c
+++ b/fs/ceph/inode.c
@@ -1180,7 +1180,7 @@ int ceph_fill_inode(struct inode *inode, struct page *locked_page,
rcu_assign_pointer(ci->i_layout.pool_ns, pool_ns);
if (ci->i_layout.pool_id != old_pool || pool_ns != old_ns)
- ci->i_ceph_flags &= ~CEPH_I_POOL_PERM;
+ clear_bit(CEPH_I_POOL_PERM_BIT, &ci->i_ceph_flags);
pool_ns = old_ns;
@@ -3240,7 +3240,7 @@ void ceph_inode_shutdown(struct inode *inode)
bool invalidate = false;
spin_lock(&ci->i_ceph_lock);
- ci->i_ceph_flags |= CEPH_I_SHUTDOWN;
+ set_bit(CEPH_I_SHUTDOWN_BIT, &ci->i_ceph_flags);
p = rb_first(&ci->i_caps);
while (p) {
struct ceph_cap *cap = rb_entry(p, struct ceph_cap, ci_node);
diff --git a/fs/ceph/locks.c b/fs/ceph/locks.c
index dd764f9c64b9..c4ff2266bb94 100644
--- a/fs/ceph/locks.c
+++ b/fs/ceph/locks.c
@@ -57,9 +57,7 @@ static void ceph_fl_release_lock(struct file_lock *fl)
ci = ceph_inode(inode);
if (atomic_dec_and_test(&ci->i_filelock_ref)) {
/* clear error when all locks are released */
- spin_lock(&ci->i_ceph_lock);
- ci->i_ceph_flags &= ~CEPH_I_ERROR_FILELOCK;
- spin_unlock(&ci->i_ceph_lock);
+ clear_bit(CEPH_I_ERROR_FILELOCK_BIT, &ci->i_ceph_flags);
}
fl->fl_u.ceph.inode = NULL;
iput(inode);
@@ -271,15 +269,10 @@ int ceph_lock(struct file *file, int cmd, struct file_lock *fl)
else if (IS_SETLKW(cmd))
wait = 1;
- spin_lock(&ci->i_ceph_lock);
- if (ci->i_ceph_flags & CEPH_I_ERROR_FILELOCK) {
- err = -EIO;
- }
- spin_unlock(&ci->i_ceph_lock);
- if (err < 0) {
+ if (test_bit(CEPH_I_ERROR_FILELOCK_BIT, &ci->i_ceph_flags)) {
if (op == CEPH_MDS_OP_SETFILELOCK && lock_is_unlock(fl))
posix_lock_file(file, fl, NULL);
- return err;
+ return -EIO;
}
if (lock_is_read(fl))
@@ -331,15 +324,10 @@ int ceph_flock(struct file *file, int cmd, struct file_lock *fl)
doutc(cl, "fl_file: %p\n", fl->c.flc_file);
- spin_lock(&ci->i_ceph_lock);
- if (ci->i_ceph_flags & CEPH_I_ERROR_FILELOCK) {
- err = -EIO;
- }
- spin_unlock(&ci->i_ceph_lock);
- if (err < 0) {
+ if (test_bit(CEPH_I_ERROR_FILELOCK_BIT, &ci->i_ceph_flags)) {
if (lock_is_unlock(fl))
locks_lock_file_wait(file, fl);
- return err;
+ return -EIO;
}
if (IS_SETLKW(cmd))
diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index ed17e0023705..53f1012a9e7d 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -3657,7 +3657,8 @@ static void __do_request(struct ceph_mds_client *mdsc,
spin_lock(&ci->i_ceph_lock);
cap = ci->i_auth_cap;
- if (ci->i_ceph_flags & CEPH_I_ASYNC_CREATE && mds != cap->mds) {
+ if (test_bit(CEPH_I_ASYNC_CREATE_BIT, &ci->i_ceph_flags) &&
+ mds != cap->mds) {
doutc(cl, "session changed for auth cap %d -> %d\n",
cap->session->s_mds, session->s_mds);
diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h
index 4e6c87f8414c..d873e784b025 100644
--- a/fs/ceph/mds_client.h
+++ b/fs/ceph/mds_client.h
@@ -670,7 +670,7 @@ static inline int ceph_wait_on_async_create(struct inode *inode)
{
struct ceph_inode_info *ci = ceph_inode(inode);
- return wait_on_bit(&ci->i_ceph_flags, CEPH_ASYNC_CREATE_BIT,
+ return wait_on_bit(&ci->i_ceph_flags, CEPH_I_ASYNC_CREATE_BIT,
TASK_KILLABLE);
}
diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c
index 52b4c2684f92..9b79a5eaca93 100644
--- a/fs/ceph/snap.c
+++ b/fs/ceph/snap.c
@@ -700,7 +700,7 @@ int __ceph_finish_cap_snap(struct ceph_inode_info *ci,
return 0;
}
- ci->i_ceph_flags |= CEPH_I_FLUSH_SNAPS;
+ set_bit(CEPH_I_FLUSH_SNAPS_BIT, &ci->i_ceph_flags);
doutc(cl, "%p %llx.%llx cap_snap %p snapc %p %llu %s s=%llu\n",
inode, ceph_vinop(inode), capsnap, capsnap->context,
capsnap->context->seq, ceph_cap_string(capsnap->dirty),
diff --git a/fs/ceph/super.h b/fs/ceph/super.h
index afc89ce91804..cb45a59dbb19 100644
--- a/fs/ceph/super.h
+++ b/fs/ceph/super.h
@@ -665,23 +665,34 @@ static inline struct inode *ceph_find_inode(struct super_block *sb,
/*
* Ceph inode.
*/
-#define CEPH_I_DIR_ORDERED (1 << 0) /* dentries in dir are ordered */
-#define CEPH_I_FLUSH (1 << 2) /* do not delay flush of dirty metadata */
-#define CEPH_I_POOL_PERM (1 << 3) /* pool rd/wr bits are valid */
-#define CEPH_I_POOL_RD (1 << 4) /* can read from pool */
-#define CEPH_I_POOL_WR (1 << 5) /* can write to pool */
-#define CEPH_I_SEC_INITED (1 << 6) /* security initialized */
-#define CEPH_I_KICK_FLUSH (1 << 7) /* kick flushing caps */
-#define CEPH_I_FLUSH_SNAPS (1 << 8) /* need flush snapss */
-#define CEPH_I_ERROR_WRITE (1 << 9) /* have seen write errors */
-#define CEPH_I_ERROR_FILELOCK (1 << 10) /* have seen file lock errors */
-#define CEPH_I_ODIRECT_BIT (11) /* inode in direct I/O mode */
-#define CEPH_I_ODIRECT (1 << CEPH_I_ODIRECT_BIT)
-#define CEPH_ASYNC_CREATE_BIT (12) /* async create in flight for this */
-#define CEPH_I_ASYNC_CREATE (1 << CEPH_ASYNC_CREATE_BIT)
-#define CEPH_I_SHUTDOWN (1 << 13) /* inode is no longer usable */
-#define CEPH_I_ASYNC_CHECK_CAPS (1 << 14) /* check caps immediately after async
- creating finishes */
+#define CEPH_I_DIR_ORDERED_BIT (0) /* dentries in dir are ordered */
+ /* bit 1 historically unused */
+#define CEPH_I_FLUSH_BIT (2) /* do not delay flush of dirty metadata */
+#define CEPH_I_POOL_PERM_BIT (3) /* pool rd/wr bits are valid */
+#define CEPH_I_POOL_RD_BIT (4) /* can read from pool */
+#define CEPH_I_POOL_WR_BIT (5) /* can write to pool */
+#define CEPH_I_SEC_INITED_BIT (6) /* security initialized */
+#define CEPH_I_KICK_FLUSH_BIT (7) /* kick flushing caps */
+#define CEPH_I_FLUSH_SNAPS_BIT (8) /* need flush snaps */
+#define CEPH_I_ERROR_WRITE_BIT (9) /* have seen write errors */
+#define CEPH_I_ERROR_FILELOCK_BIT (10) /* have seen file lock errors */
+#define CEPH_I_ODIRECT_BIT (11) /* inode in direct I/O mode */
+#define CEPH_I_ASYNC_CREATE_BIT (12) /* async create in flight for this */
+#define CEPH_I_SHUTDOWN_BIT (13) /* inode is no longer usable */
+#define CEPH_I_ASYNC_CHECK_CAPS_BIT (14) /* check caps after async creating finishes */
+
+#define CEPH_I_DIR_ORDERED (1 << CEPH_I_DIR_ORDERED_BIT)
+#define CEPH_I_FLUSH (1 << CEPH_I_FLUSH_BIT)
+#define CEPH_I_POOL_PERM (1 << CEPH_I_POOL_PERM_BIT)
+#define CEPH_I_POOL_RD (1 << CEPH_I_POOL_RD_BIT)
+#define CEPH_I_POOL_WR (1 << CEPH_I_POOL_WR_BIT)
+#define CEPH_I_SEC_INITED (1 << CEPH_I_SEC_INITED_BIT)
+#define CEPH_I_KICK_FLUSH (1 << CEPH_I_KICK_FLUSH_BIT)
+#define CEPH_I_FLUSH_SNAPS (1 << CEPH_I_FLUSH_SNAPS_BIT)
+#define CEPH_I_ERROR_FILELOCK (1 << CEPH_I_ERROR_FILELOCK_BIT)
+#define CEPH_I_ODIRECT (1 << CEPH_I_ODIRECT_BIT)
+#define CEPH_I_ASYNC_CREATE (1 << CEPH_I_ASYNC_CREATE_BIT)
+#define CEPH_I_SHUTDOWN (1 << CEPH_I_SHUTDOWN_BIT)
/*
* Masks of ceph inode work.
@@ -694,27 +705,18 @@ static inline struct inode *ceph_find_inode(struct super_block *sb,
/*
* We set the ERROR_WRITE bit when we start seeing write errors on an inode
- * and then clear it when they start succeeding. Note that we do a lockless
- * check first, and only take the lock if it looks like it needs to be changed.
- * The write submission code just takes this as a hint, so we're not too
- * worried if a few slip through in either direction.
+ * and then clear it when they start succeeding. The write submission code
+ * just takes this as a hint, so we're not too worried if a few slip through
+ * in either direction.
*/
static inline void ceph_set_error_write(struct ceph_inode_info *ci)
{
- if (!(READ_ONCE(ci->i_ceph_flags) & CEPH_I_ERROR_WRITE)) {
- spin_lock(&ci->i_ceph_lock);
- ci->i_ceph_flags |= CEPH_I_ERROR_WRITE;
- spin_unlock(&ci->i_ceph_lock);
- }
+ set_bit(CEPH_I_ERROR_WRITE_BIT, &ci->i_ceph_flags);
}
static inline void ceph_clear_error_write(struct ceph_inode_info *ci)
{
- if (READ_ONCE(ci->i_ceph_flags) & CEPH_I_ERROR_WRITE) {
- spin_lock(&ci->i_ceph_lock);
- ci->i_ceph_flags &= ~CEPH_I_ERROR_WRITE;
- spin_unlock(&ci->i_ceph_lock);
- }
+ clear_bit(CEPH_I_ERROR_WRITE_BIT, &ci->i_ceph_flags);
}
static inline void __ceph_dir_set_complete(struct ceph_inode_info *ci,
diff --git a/fs/ceph/xattr.c b/fs/ceph/xattr.c
index e773be07f767..860fc8e1867d 100644
--- a/fs/ceph/xattr.c
+++ b/fs/ceph/xattr.c
@@ -1054,7 +1054,7 @@ ssize_t __ceph_getxattr(struct inode *inode, const char *name, void *value,
if (current->journal_info &&
!strncmp(name, XATTR_SECURITY_PREFIX, XATTR_SECURITY_PREFIX_LEN) &&
security_ismaclabel(name + XATTR_SECURITY_PREFIX_LEN))
- ci->i_ceph_flags |= CEPH_I_SEC_INITED;
+ set_bit(CEPH_I_SEC_INITED_BIT, &ci->i_ceph_flags);
out:
spin_unlock(&ci->i_ceph_lock);
return err;
--
2.34.1
^ permalink raw reply related [flat|nested] 24+ messages in thread* Re: [PATCH v4 01/11] ceph: convert inode flags to named bit positions and atomic bitops
2026-05-07 12:27 ` [PATCH v4 01/11] ceph: convert inode flags to named bit positions and atomic bitops Alex Markuze
@ 2026-05-07 18:35 ` Viacheslav Dubeyko
0 siblings, 0 replies; 24+ messages in thread
From: Viacheslav Dubeyko @ 2026-05-07 18:35 UTC (permalink / raw)
To: Alex Markuze, ceph-devel@vger.kernel.org
Cc: idryomov@gmail.com, linux-kernel@vger.kernel.org
On Thu, 2026-05-07 at 12:27 +0000, Alex Markuze wrote:
> Define named bit-position constants for all CEPH_I_* inode flags and
> derive the bitmask values from them. This gives every flag a named
> _BIT constant usable with the test_bit/set_bit/clear_bit family.
> The intentionally unused bit position 1 is documented inline.
>
> Convert all flag modifications to use atomic bitops (set_bit,
> clear_bit, test_and_clear_bit). The previous code mixed lockless
> atomic ops on some flags (ERROR_WRITE, ODIRECT) with non-atomic
> read-modify-write (|= / &= ~) on other flags sharing the same
> unsigned long. A concurrent non-atomic RMW can clobber an
> adjacent lockless atomic update -- for example, a lockless
> clear_bit(ERROR_WRITE) could be silently resurrected by a
> concurrent ci->i_ceph_flags |= CEPH_I_FLUSH under the spinlock.
> Using atomic bitops for all modifications eliminates this class
> of race entirely.
>
> Flags whose only users are now the _BIT form (ERROR_WRITE,
> ASYNC_CHECK_CAPS) have their old mask defines removed to document
> that callers must use the _BIT constant with the set_bit/test_bit
> family. ERROR_FILELOCK and SHUTDOWN retain their mask defines
> because they are still used via bitmask tests in lockless readers
> (ceph_inode_is_shutdown, reconnect_caps_cb).
>
> The direct assignment in ceph_finish_async_create() is converted
> from i_ceph_flags = CEPH_I_ASYNC_CREATE to set_bit(). This
> inode is I_NEW at this point -- still invisible to other threads
> and guaranteed to have zero flags from alloc_inode -- so either
> form is safe, but set_bit() keeps the conversion uniform.
>
> Co-developed-by: Viacheslav Dubeyko <vdubeyko@redhat.com>
> Signed-off-by: Viacheslav Dubeyko <vdubeyko@redhat.com>
> Signed-off-by: Alex Markuze <amarkuze@redhat.com>
> ---
> fs/ceph/addr.c | 20 +++++++-------
> fs/ceph/caps.c | 24 ++++++++---------
> fs/ceph/file.c | 13 ++++-----
> fs/ceph/inode.c | 4 +--
> fs/ceph/locks.c | 22 ++++-----------
> fs/ceph/mds_client.c | 3 ++-
> fs/ceph/mds_client.h | 2 +-
> fs/ceph/snap.c | 2 +-
> fs/ceph/super.h | 64 +++++++++++++++++++++++---------------------
> fs/ceph/xattr.c | 2 +-
> 10 files changed, 74 insertions(+), 82 deletions(-)
>
> diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
> index 94ffa127b1d3..1859a0c92d66 100644
> --- a/fs/ceph/addr.c
> +++ b/fs/ceph/addr.c
> @@ -2563,7 +2563,8 @@ int ceph_pool_perm_check(struct inode *inode, int need)
> struct ceph_inode_info *ci = ceph_inode(inode);
> struct ceph_string *pool_ns;
> s64 pool;
> - int ret, flags;
> + int ret;
> + unsigned long flags;
>
> /* Only need to do this for regular files */
> if (!S_ISREG(inode->i_mode))
> @@ -2605,20 +2606,19 @@ int ceph_pool_perm_check(struct inode *inode, int need)
> if (ret < 0)
> return ret;
>
> - flags = CEPH_I_POOL_PERM;
> - if (ret & POOL_READ)
> - flags |= CEPH_I_POOL_RD;
> - if (ret & POOL_WRITE)
> - flags |= CEPH_I_POOL_WR;
> -
> spin_lock(&ci->i_ceph_lock);
> if (pool == ci->i_layout.pool_id &&
> pool_ns == rcu_dereference_raw(ci->i_layout.pool_ns)) {
> - ci->i_ceph_flags |= flags;
> - } else {
> + set_bit(CEPH_I_POOL_PERM_BIT, &ci->i_ceph_flags);
> + if (ret & POOL_READ)
> + set_bit(CEPH_I_POOL_RD_BIT, &ci->i_ceph_flags);
> + if (ret & POOL_WRITE)
> + set_bit(CEPH_I_POOL_WR_BIT, &ci->i_ceph_flags);
> + } else {
> pool = ci->i_layout.pool_id;
> - flags = ci->i_ceph_flags;
> }
> + /* Re-read flags under the lock so check: sees the updated bits. */
> + flags = ci->i_ceph_flags;
> spin_unlock(&ci->i_ceph_lock);
> goto check;
> }
> diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
> index d51454e995a8..cb9e78b713d9 100644
> --- a/fs/ceph/caps.c
> +++ b/fs/ceph/caps.c
> @@ -549,7 +549,7 @@ static void __cap_delay_requeue_front(struct ceph_mds_client *mdsc,
>
> doutc(mdsc->fsc->client, "%p %llx.%llx\n", inode, ceph_vinop(inode));
> spin_lock(&mdsc->cap_delay_lock);
> - ci->i_ceph_flags |= CEPH_I_FLUSH;
> + set_bit(CEPH_I_FLUSH_BIT, &ci->i_ceph_flags);
> if (!list_empty(&ci->i_cap_delay_list))
> list_del_init(&ci->i_cap_delay_list);
> list_add(&ci->i_cap_delay_list, &mdsc->cap_delay_list);
> @@ -1409,7 +1409,7 @@ static void __prep_cap(struct cap_msg_args *arg, struct ceph_cap *cap,
> ceph_cap_string(revoking));
> BUG_ON((retain & CEPH_CAP_PIN) == 0);
>
> - ci->i_ceph_flags &= ~CEPH_I_FLUSH;
> + clear_bit(CEPH_I_FLUSH_BIT, &ci->i_ceph_flags);
>
> cap->issued &= retain; /* drop bits we don't want */
> /*
> @@ -1666,7 +1666,7 @@ static void __ceph_flush_snaps(struct ceph_inode_info *ci,
> last_tid = capsnap->cap_flush.tid;
> }
>
> - ci->i_ceph_flags &= ~CEPH_I_FLUSH_SNAPS;
> + clear_bit(CEPH_I_FLUSH_SNAPS_BIT, &ci->i_ceph_flags);
>
> while (first_tid <= last_tid) {
> struct ceph_cap *cap = ci->i_auth_cap;
> @@ -2026,7 +2026,7 @@ void ceph_check_caps(struct ceph_inode_info *ci, int flags)
>
> spin_lock(&ci->i_ceph_lock);
> if (ci->i_ceph_flags & CEPH_I_ASYNC_CREATE) {
> - ci->i_ceph_flags |= CEPH_I_ASYNC_CHECK_CAPS;
> + set_bit(CEPH_I_ASYNC_CHECK_CAPS_BIT, &ci->i_ceph_flags);
>
> /* Don't send messages until we get async create reply */
> spin_unlock(&ci->i_ceph_lock);
> @@ -2577,7 +2577,7 @@ static void __kick_flushing_caps(struct ceph_mds_client *mdsc,
> if (ci->i_ceph_flags & CEPH_I_ASYNC_CREATE)
> return;
>
> - ci->i_ceph_flags &= ~CEPH_I_KICK_FLUSH;
> + clear_bit(CEPH_I_KICK_FLUSH_BIT, &ci->i_ceph_flags);
>
> list_for_each_entry_reverse(cf, &ci->i_cap_flush_list, i_list) {
> if (cf->is_capsnap) {
> @@ -2686,7 +2686,7 @@ void ceph_early_kick_flushing_caps(struct ceph_mds_client *mdsc,
> __kick_flushing_caps(mdsc, session, ci,
> oldest_flush_tid);
> } else {
> - ci->i_ceph_flags |= CEPH_I_KICK_FLUSH;
> + set_bit(CEPH_I_KICK_FLUSH_BIT, &ci->i_ceph_flags);
> }
>
> spin_unlock(&ci->i_ceph_lock);
> @@ -2829,7 +2829,7 @@ static int try_get_cap_refs(struct inode *inode, int need, int want,
> spin_lock(&ci->i_ceph_lock);
>
> if ((flags & CHECK_FILELOCK) &&
> - (ci->i_ceph_flags & CEPH_I_ERROR_FILELOCK)) {
> + test_bit(CEPH_I_ERROR_FILELOCK_BIT, &ci->i_ceph_flags)) {
> doutc(cl, "%p %llx.%llx error filelock\n", inode,
> ceph_vinop(inode));
> ret = -EIO;
> @@ -3207,7 +3207,7 @@ static int ceph_try_drop_cap_snap(struct ceph_inode_info *ci,
> BUG_ON(capsnap->cap_flush.tid > 0);
> ceph_put_snap_context(capsnap->context);
> if (!list_is_last(&capsnap->ci_item, &ci->i_cap_snaps))
> - ci->i_ceph_flags |= CEPH_I_FLUSH_SNAPS;
> + set_bit(CEPH_I_FLUSH_SNAPS_BIT, &ci->i_ceph_flags);
>
> list_del(&capsnap->ci_item);
> ceph_put_cap_snap(capsnap);
> @@ -3396,7 +3396,7 @@ void ceph_put_wrbuffer_cap_refs(struct ceph_inode_info *ci, int nr,
> if (ceph_try_drop_cap_snap(ci, capsnap)) {
> put++;
> } else {
> - ci->i_ceph_flags |= CEPH_I_FLUSH_SNAPS;
> + set_bit(CEPH_I_FLUSH_SNAPS_BIT, &ci->i_ceph_flags);
> flush_snaps = true;
> }
> }
> @@ -3648,7 +3648,7 @@ static void handle_cap_grant(struct inode *inode,
>
> if (ci->i_layout.pool_id != old_pool ||
> extra_info->pool_ns != old_ns)
> - ci->i_ceph_flags &= ~CEPH_I_POOL_PERM;
> + clear_bit(CEPH_I_POOL_PERM_BIT, &ci->i_ceph_flags);
>
> extra_info->pool_ns = old_ns;
>
> @@ -4815,7 +4815,7 @@ int ceph_drop_caps_for_unlink(struct inode *inode)
> doutc(mdsc->fsc->client, "%p %llx.%llx\n", inode,
> ceph_vinop(inode));
> spin_lock(&mdsc->cap_delay_lock);
> - ci->i_ceph_flags |= CEPH_I_FLUSH;
> + set_bit(CEPH_I_FLUSH_BIT, &ci->i_ceph_flags);
> if (!list_empty(&ci->i_cap_delay_list))
> list_del_init(&ci->i_cap_delay_list);
> list_add_tail(&ci->i_cap_delay_list,
> @@ -5080,7 +5080,7 @@ int ceph_purge_inode_cap(struct inode *inode, struct ceph_cap *cap, bool *invali
>
> if (atomic_read(&ci->i_filelock_ref) > 0) {
> /* make further file lock syscall return -EIO */
> - ci->i_ceph_flags |= CEPH_I_ERROR_FILELOCK;
> + set_bit(CEPH_I_ERROR_FILELOCK_BIT, &ci->i_ceph_flags);
> pr_warn_ratelimited_client(cl,
> " dropping file locks for %p %llx.%llx\n",
> inode, ceph_vinop(inode));
> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> index d54d71669176..7ca9f60fb0e5 100644
> --- a/fs/ceph/file.c
> +++ b/fs/ceph/file.c
> @@ -598,12 +598,12 @@ static void wake_async_create_waiters(struct inode *inode,
>
> spin_lock(&ci->i_ceph_lock);
> if (ci->i_ceph_flags & CEPH_I_ASYNC_CREATE) {
> - clear_and_wake_up_bit(CEPH_ASYNC_CREATE_BIT, &ci->i_ceph_flags);
> + /* Serialized by i_ceph_lock; the two ops touch different bits. */
> + clear_and_wake_up_bit(CEPH_I_ASYNC_CREATE_BIT, &ci->i_ceph_flags);
>
> - if (ci->i_ceph_flags & CEPH_I_ASYNC_CHECK_CAPS) {
> - ci->i_ceph_flags &= ~CEPH_I_ASYNC_CHECK_CAPS;
> + if (test_and_clear_bit(CEPH_I_ASYNC_CHECK_CAPS_BIT,
> + &ci->i_ceph_flags))
> check_cap = true;
> - }
> }
> ceph_kick_flushing_inode_caps(session, ci);
> spin_unlock(&ci->i_ceph_lock);
> @@ -766,7 +766,8 @@ static int ceph_finish_async_create(struct inode *dir, struct inode *inode,
> * that point and don't worry about setting
> * CEPH_I_ASYNC_CREATE.
> */
> - ceph_inode(inode)->i_ceph_flags = CEPH_I_ASYNC_CREATE;
> + set_bit(CEPH_I_ASYNC_CREATE_BIT,
> + &ceph_inode(inode)->i_ceph_flags);
> unlock_new_inode(inode);
> }
> if (d_in_lookup(dentry) || d_really_is_negative(dentry)) {
> @@ -2482,7 +2483,7 @@ static ssize_t ceph_write_iter(struct kiocb *iocb, struct iov_iter *from)
>
> if ((got & (CEPH_CAP_FILE_BUFFER|CEPH_CAP_FILE_LAZYIO)) == 0 ||
> (iocb->ki_flags & IOCB_DIRECT) || (fi->flags & CEPH_F_SYNC) ||
> - (ci->i_ceph_flags & CEPH_I_ERROR_WRITE)) {
> + test_bit(CEPH_I_ERROR_WRITE_BIT, &ci->i_ceph_flags)) {
> struct ceph_snap_context *snapc;
> struct iov_iter data;
>
> diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
> index 22c7da1ea61c..4871d7ab2730 100644
> --- a/fs/ceph/inode.c
> +++ b/fs/ceph/inode.c
> @@ -1180,7 +1180,7 @@ int ceph_fill_inode(struct inode *inode, struct page *locked_page,
> rcu_assign_pointer(ci->i_layout.pool_ns, pool_ns);
>
> if (ci->i_layout.pool_id != old_pool || pool_ns != old_ns)
> - ci->i_ceph_flags &= ~CEPH_I_POOL_PERM;
> + clear_bit(CEPH_I_POOL_PERM_BIT, &ci->i_ceph_flags);
>
> pool_ns = old_ns;
>
> @@ -3240,7 +3240,7 @@ void ceph_inode_shutdown(struct inode *inode)
> bool invalidate = false;
>
> spin_lock(&ci->i_ceph_lock);
> - ci->i_ceph_flags |= CEPH_I_SHUTDOWN;
> + set_bit(CEPH_I_SHUTDOWN_BIT, &ci->i_ceph_flags);
> p = rb_first(&ci->i_caps);
> while (p) {
> struct ceph_cap *cap = rb_entry(p, struct ceph_cap, ci_node);
> diff --git a/fs/ceph/locks.c b/fs/ceph/locks.c
> index dd764f9c64b9..c4ff2266bb94 100644
> --- a/fs/ceph/locks.c
> +++ b/fs/ceph/locks.c
> @@ -57,9 +57,7 @@ static void ceph_fl_release_lock(struct file_lock *fl)
> ci = ceph_inode(inode);
> if (atomic_dec_and_test(&ci->i_filelock_ref)) {
> /* clear error when all locks are released */
> - spin_lock(&ci->i_ceph_lock);
> - ci->i_ceph_flags &= ~CEPH_I_ERROR_FILELOCK;
> - spin_unlock(&ci->i_ceph_lock);
> + clear_bit(CEPH_I_ERROR_FILELOCK_BIT, &ci->i_ceph_flags);
> }
> fl->fl_u.ceph.inode = NULL;
> iput(inode);
> @@ -271,15 +269,10 @@ int ceph_lock(struct file *file, int cmd, struct file_lock *fl)
> else if (IS_SETLKW(cmd))
> wait = 1;
>
> - spin_lock(&ci->i_ceph_lock);
> - if (ci->i_ceph_flags & CEPH_I_ERROR_FILELOCK) {
> - err = -EIO;
> - }
> - spin_unlock(&ci->i_ceph_lock);
> - if (err < 0) {
> + if (test_bit(CEPH_I_ERROR_FILELOCK_BIT, &ci->i_ceph_flags)) {
> if (op == CEPH_MDS_OP_SETFILELOCK && lock_is_unlock(fl))
> posix_lock_file(file, fl, NULL);
> - return err;
> + return -EIO;
> }
>
> if (lock_is_read(fl))
> @@ -331,15 +324,10 @@ int ceph_flock(struct file *file, int cmd, struct file_lock *fl)
>
> doutc(cl, "fl_file: %p\n", fl->c.flc_file);
>
> - spin_lock(&ci->i_ceph_lock);
> - if (ci->i_ceph_flags & CEPH_I_ERROR_FILELOCK) {
> - err = -EIO;
> - }
> - spin_unlock(&ci->i_ceph_lock);
> - if (err < 0) {
> + if (test_bit(CEPH_I_ERROR_FILELOCK_BIT, &ci->i_ceph_flags)) {
> if (lock_is_unlock(fl))
> locks_lock_file_wait(file, fl);
> - return err;
> + return -EIO;
> }
>
> if (IS_SETLKW(cmd))
> diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
> index ed17e0023705..53f1012a9e7d 100644
> --- a/fs/ceph/mds_client.c
> +++ b/fs/ceph/mds_client.c
> @@ -3657,7 +3657,8 @@ static void __do_request(struct ceph_mds_client *mdsc,
>
> spin_lock(&ci->i_ceph_lock);
> cap = ci->i_auth_cap;
> - if (ci->i_ceph_flags & CEPH_I_ASYNC_CREATE && mds != cap->mds) {
> + if (test_bit(CEPH_I_ASYNC_CREATE_BIT, &ci->i_ceph_flags) &&
> + mds != cap->mds) {
> doutc(cl, "session changed for auth cap %d -> %d\n",
> cap->session->s_mds, session->s_mds);
>
> diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h
> index 4e6c87f8414c..d873e784b025 100644
> --- a/fs/ceph/mds_client.h
> +++ b/fs/ceph/mds_client.h
> @@ -670,7 +670,7 @@ static inline int ceph_wait_on_async_create(struct inode *inode)
> {
> struct ceph_inode_info *ci = ceph_inode(inode);
>
> - return wait_on_bit(&ci->i_ceph_flags, CEPH_ASYNC_CREATE_BIT,
> + return wait_on_bit(&ci->i_ceph_flags, CEPH_I_ASYNC_CREATE_BIT,
> TASK_KILLABLE);
> }
>
> diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c
> index 52b4c2684f92..9b79a5eaca93 100644
> --- a/fs/ceph/snap.c
> +++ b/fs/ceph/snap.c
> @@ -700,7 +700,7 @@ int __ceph_finish_cap_snap(struct ceph_inode_info *ci,
> return 0;
> }
>
> - ci->i_ceph_flags |= CEPH_I_FLUSH_SNAPS;
> + set_bit(CEPH_I_FLUSH_SNAPS_BIT, &ci->i_ceph_flags);
> doutc(cl, "%p %llx.%llx cap_snap %p snapc %p %llu %s s=%llu\n",
> inode, ceph_vinop(inode), capsnap, capsnap->context,
> capsnap->context->seq, ceph_cap_string(capsnap->dirty),
> diff --git a/fs/ceph/super.h b/fs/ceph/super.h
> index afc89ce91804..cb45a59dbb19 100644
> --- a/fs/ceph/super.h
> +++ b/fs/ceph/super.h
> @@ -665,23 +665,34 @@ static inline struct inode *ceph_find_inode(struct super_block *sb,
> /*
> * Ceph inode.
> */
> -#define CEPH_I_DIR_ORDERED (1 << 0) /* dentries in dir are ordered */
> -#define CEPH_I_FLUSH (1 << 2) /* do not delay flush of dirty metadata */
> -#define CEPH_I_POOL_PERM (1 << 3) /* pool rd/wr bits are valid */
> -#define CEPH_I_POOL_RD (1 << 4) /* can read from pool */
> -#define CEPH_I_POOL_WR (1 << 5) /* can write to pool */
> -#define CEPH_I_SEC_INITED (1 << 6) /* security initialized */
> -#define CEPH_I_KICK_FLUSH (1 << 7) /* kick flushing caps */
> -#define CEPH_I_FLUSH_SNAPS (1 << 8) /* need flush snapss */
> -#define CEPH_I_ERROR_WRITE (1 << 9) /* have seen write errors */
> -#define CEPH_I_ERROR_FILELOCK (1 << 10) /* have seen file lock errors */
> -#define CEPH_I_ODIRECT_BIT (11) /* inode in direct I/O mode */
> -#define CEPH_I_ODIRECT (1 << CEPH_I_ODIRECT_BIT)
> -#define CEPH_ASYNC_CREATE_BIT (12) /* async create in flight for this */
> -#define CEPH_I_ASYNC_CREATE (1 << CEPH_ASYNC_CREATE_BIT)
> -#define CEPH_I_SHUTDOWN (1 << 13) /* inode is no longer usable */
> -#define CEPH_I_ASYNC_CHECK_CAPS (1 << 14) /* check caps immediately after async
> - creating finishes */
> +#define CEPH_I_DIR_ORDERED_BIT (0) /* dentries in dir are ordered */
> + /* bit 1 historically unused */
> +#define CEPH_I_FLUSH_BIT (2) /* do not delay flush of dirty metadata */
> +#define CEPH_I_POOL_PERM_BIT (3) /* pool rd/wr bits are valid */
> +#define CEPH_I_POOL_RD_BIT (4) /* can read from pool */
> +#define CEPH_I_POOL_WR_BIT (5) /* can write to pool */
> +#define CEPH_I_SEC_INITED_BIT (6) /* security initialized */
> +#define CEPH_I_KICK_FLUSH_BIT (7) /* kick flushing caps */
> +#define CEPH_I_FLUSH_SNAPS_BIT (8) /* need flush snaps */
> +#define CEPH_I_ERROR_WRITE_BIT (9) /* have seen write errors */
> +#define CEPH_I_ERROR_FILELOCK_BIT (10) /* have seen file lock errors */
> +#define CEPH_I_ODIRECT_BIT (11) /* inode in direct I/O mode */
> +#define CEPH_I_ASYNC_CREATE_BIT (12) /* async create in flight for this */
> +#define CEPH_I_SHUTDOWN_BIT (13) /* inode is no longer usable */
> +#define CEPH_I_ASYNC_CHECK_CAPS_BIT (14) /* check caps after async creating finishes */
> +
> +#define CEPH_I_DIR_ORDERED (1 << CEPH_I_DIR_ORDERED_BIT)
> +#define CEPH_I_FLUSH (1 << CEPH_I_FLUSH_BIT)
> +#define CEPH_I_POOL_PERM (1 << CEPH_I_POOL_PERM_BIT)
> +#define CEPH_I_POOL_RD (1 << CEPH_I_POOL_RD_BIT)
> +#define CEPH_I_POOL_WR (1 << CEPH_I_POOL_WR_BIT)
> +#define CEPH_I_SEC_INITED (1 << CEPH_I_SEC_INITED_BIT)
> +#define CEPH_I_KICK_FLUSH (1 << CEPH_I_KICK_FLUSH_BIT)
> +#define CEPH_I_FLUSH_SNAPS (1 << CEPH_I_FLUSH_SNAPS_BIT)
> +#define CEPH_I_ERROR_FILELOCK (1 << CEPH_I_ERROR_FILELOCK_BIT)
> +#define CEPH_I_ODIRECT (1 << CEPH_I_ODIRECT_BIT)
> +#define CEPH_I_ASYNC_CREATE (1 << CEPH_I_ASYNC_CREATE_BIT)
> +#define CEPH_I_SHUTDOWN (1 << CEPH_I_SHUTDOWN_BIT)
>
> /*
> * Masks of ceph inode work.
> @@ -694,27 +705,18 @@ static inline struct inode *ceph_find_inode(struct super_block *sb,
>
> /*
> * We set the ERROR_WRITE bit when we start seeing write errors on an inode
> - * and then clear it when they start succeeding. Note that we do a lockless
> - * check first, and only take the lock if it looks like it needs to be changed.
> - * The write submission code just takes this as a hint, so we're not too
> - * worried if a few slip through in either direction.
> + * and then clear it when they start succeeding. The write submission code
> + * just takes this as a hint, so we're not too worried if a few slip through
> + * in either direction.
> */
> static inline void ceph_set_error_write(struct ceph_inode_info *ci)
> {
> - if (!(READ_ONCE(ci->i_ceph_flags) & CEPH_I_ERROR_WRITE)) {
> - spin_lock(&ci->i_ceph_lock);
> - ci->i_ceph_flags |= CEPH_I_ERROR_WRITE;
> - spin_unlock(&ci->i_ceph_lock);
> - }
> + set_bit(CEPH_I_ERROR_WRITE_BIT, &ci->i_ceph_flags);
> }
>
> static inline void ceph_clear_error_write(struct ceph_inode_info *ci)
> {
> - if (READ_ONCE(ci->i_ceph_flags) & CEPH_I_ERROR_WRITE) {
> - spin_lock(&ci->i_ceph_lock);
> - ci->i_ceph_flags &= ~CEPH_I_ERROR_WRITE;
> - spin_unlock(&ci->i_ceph_lock);
> - }
> + clear_bit(CEPH_I_ERROR_WRITE_BIT, &ci->i_ceph_flags);
> }
>
> static inline void __ceph_dir_set_complete(struct ceph_inode_info *ci,
> diff --git a/fs/ceph/xattr.c b/fs/ceph/xattr.c
> index e773be07f767..860fc8e1867d 100644
> --- a/fs/ceph/xattr.c
> +++ b/fs/ceph/xattr.c
> @@ -1054,7 +1054,7 @@ ssize_t __ceph_getxattr(struct inode *inode, const char *name, void *value,
> if (current->journal_info &&
> !strncmp(name, XATTR_SECURITY_PREFIX, XATTR_SECURITY_PREFIX_LEN) &&
> security_ismaclabel(name + XATTR_SECURITY_PREFIX_LEN))
> - ci->i_ceph_flags |= CEPH_I_SEC_INITED;
> + set_bit(CEPH_I_SEC_INITED_BIT, &ci->i_ceph_flags);
> out:
> spin_unlock(&ci->i_ceph_lock);
> return err;
Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Thanks,
Slava.
^ permalink raw reply [flat|nested] 24+ messages in thread
* [PATCH v4 02/11] ceph: use proper endian conversion for flock_len in reconnect
2026-05-07 12:27 [PATCH v4 00/11] ceph: manual client session reset Alex Markuze
2026-05-07 12:27 ` [PATCH v4 01/11] ceph: convert inode flags to named bit positions and atomic bitops Alex Markuze
@ 2026-05-07 12:27 ` Alex Markuze
2026-05-07 12:27 ` [PATCH v4 03/11] ceph: harden send_mds_reconnect and handle active-MDS peer reset Alex Markuze
` (9 subsequent siblings)
11 siblings, 0 replies; 24+ messages in thread
From: Alex Markuze @ 2026-05-07 12:27 UTC (permalink / raw)
To: ceph-devel
Cc: linux-kernel, idryomov, vdubeyko, Alex Markuze,
Viacheslav Dubeyko
Replace the __force __le32 cast with cpu_to_le32() for the flock_len field
in reconnect_caps_cb(). The old code used a type-system bypass to silence
sparse; the new form uses the proper endian conversion macro.
Also switch from a raw bitmask test against i_ceph_flags to test_bit() on
the named CEPH_I_ERROR_FILELOCK_BIT, which is the correct accessor for the
unsigned long flags field after the bit-position conversion.
Remove the now-unused CEPH_I_ERROR_FILELOCK mask define since all callers
use the _BIT form with test_bit/set_bit/clear_bit.
Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Signed-off-by: Alex Markuze <amarkuze@redhat.com>
---
fs/ceph/mds_client.c | 5 +++--
fs/ceph/super.h | 1 -
2 files changed, 3 insertions(+), 3 deletions(-)
diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index 53f1012a9e7d..d9543399b129 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -4747,8 +4747,9 @@ static int reconnect_caps_cb(struct inode *inode, int mds, void *arg)
rec.v2.issued = cpu_to_le32(cap->issued);
rec.v2.snaprealm = cpu_to_le64(ci->i_snap_realm->ino);
rec.v2.pathbase = cpu_to_le64(path_info.vino.ino);
- rec.v2.flock_len = (__force __le32)
- ((ci->i_ceph_flags & CEPH_I_ERROR_FILELOCK) ? 0 : 1);
+ rec.v2.flock_len = cpu_to_le32(
+ test_bit(CEPH_I_ERROR_FILELOCK_BIT,
+ &ci->i_ceph_flags) ? 0 : 1);
} else {
struct timespec64 ts;
diff --git a/fs/ceph/super.h b/fs/ceph/super.h
index cb45a59dbb19..8afc6f3a10da 100644
--- a/fs/ceph/super.h
+++ b/fs/ceph/super.h
@@ -689,7 +689,6 @@ static inline struct inode *ceph_find_inode(struct super_block *sb,
#define CEPH_I_SEC_INITED (1 << CEPH_I_SEC_INITED_BIT)
#define CEPH_I_KICK_FLUSH (1 << CEPH_I_KICK_FLUSH_BIT)
#define CEPH_I_FLUSH_SNAPS (1 << CEPH_I_FLUSH_SNAPS_BIT)
-#define CEPH_I_ERROR_FILELOCK (1 << CEPH_I_ERROR_FILELOCK_BIT)
#define CEPH_I_ODIRECT (1 << CEPH_I_ODIRECT_BIT)
#define CEPH_I_ASYNC_CREATE (1 << CEPH_I_ASYNC_CREATE_BIT)
#define CEPH_I_SHUTDOWN (1 << CEPH_I_SHUTDOWN_BIT)
--
2.34.1
^ permalink raw reply related [flat|nested] 24+ messages in thread* [PATCH v4 03/11] ceph: harden send_mds_reconnect and handle active-MDS peer reset
2026-05-07 12:27 [PATCH v4 00/11] ceph: manual client session reset Alex Markuze
2026-05-07 12:27 ` [PATCH v4 01/11] ceph: convert inode flags to named bit positions and atomic bitops Alex Markuze
2026-05-07 12:27 ` [PATCH v4 02/11] ceph: use proper endian conversion for flock_len in reconnect Alex Markuze
@ 2026-05-07 12:27 ` Alex Markuze
2026-05-07 18:43 ` [EXTERNAL] " Viacheslav Dubeyko
2026-05-07 12:27 ` [PATCH v4 04/11] ceph: add diagnostic timeout loop to wait_caps_flush() Alex Markuze
` (8 subsequent siblings)
11 siblings, 1 reply; 24+ messages in thread
From: Alex Markuze @ 2026-05-07 12:27 UTC (permalink / raw)
To: ceph-devel; +Cc: linux-kernel, idryomov, vdubeyko, Alex Markuze
Change send_mds_reconnect() to return an error code so callers can detect
and report reconnect failures instead of silently ignoring them. Add early
bailout checks for sessions that are already closed, rejected, or
unregistered, which avoids sending reconnect messages for sessions that
can no longer be recovered.
The early -ESTALE and -ENOENT bailouts use a separate fail_return label
that skips the pr_err_client diagnostic, since these codes indicate
expected concurrent-teardown races rather than genuine reconnect build
failures.
Move the "reconnect start" log after the early-bailout checks so it
only appears for sessions that actually proceed with reconnect.
Save the prior session state before transitioning to RECONNECTING,
and restore it in the failure path. Without this, a transient
build or encoding failure (-ENOMEM, -ENOSPC) strands the session
in RECONNECTING indefinitely because check_new_map() only retries
sessions in RESTARTING state.
Rewrite mds_peer_reset() to handle the case where the MDS is past its
RECONNECT phase (i.e. active). An active MDS rejects CLIENT_RECONNECT
messages because it only accepts them during its own RECONNECT window
after restart. Previously, the client would send a doomed reconnect
that the MDS would reject or ignore. Now, the client tears the session
down locally and lets new requests re-open a fresh session, which is
the correct recovery for this scenario. The RECONNECTING state is
handled on the same teardown path, since the MDS will reject reconnect
attempts from an active client regardless of the session's local state.
Add explicit cases for CLOSED and REJECTED session states in
mds_peer_reset() since these are terminal states where a connection
drop is expected behavior.
The session teardown path in mds_peer_reset() follows the established
drop-and-reacquire locking pattern from check_new_map(): take
mdsc->mutex for session unregistration, release it, then take s->s_mutex
separately for cleanup. This avoids introducing a new simultaneous lock
nesting pattern.
Log reconnect failures from check_new_map() and mds_peer_reset() at
pr_warn level rather than pr_err, since return codes like -ESTALE
(closed/rejected session) and -ENOENT (unregistered session) are
expected during concurrent teardown. Log dropped messages for
unregistered sessions via doutc() (dynamic debug) rather than
pr_info, as post-reset message arrival is routine and does not
warrant unconditional logging.
Signed-off-by: Alex Markuze <amarkuze@redhat.com>
---
fs/ceph/mds_client.c | 178 +++++++++++++++++++++++++++++++++++++++----
1 file changed, 163 insertions(+), 15 deletions(-)
diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index d9543399b129..249419c17d3c 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -4470,9 +4470,14 @@ static void handle_session(struct ceph_mds_session *session,
break;
case CEPH_SESSION_REJECT:
- WARN_ON(session->s_state != CEPH_MDS_SESSION_OPENING);
- pr_info_client(cl, "mds%d rejected session\n",
- session->s_mds);
+ WARN_ON(session->s_state != CEPH_MDS_SESSION_OPENING &&
+ session->s_state != CEPH_MDS_SESSION_RECONNECTING);
+ if (session->s_state == CEPH_MDS_SESSION_RECONNECTING)
+ pr_info_client(cl, "mds%d reconnect rejected\n",
+ session->s_mds);
+ else
+ pr_info_client(cl, "mds%d rejected session\n",
+ session->s_mds);
session->s_state = CEPH_MDS_SESSION_REJECTED;
cleanup_session_requests(mdsc, session);
remove_session_caps(session);
@@ -4732,6 +4737,14 @@ static int reconnect_caps_cb(struct inode *inode, int mds, void *arg)
cap->mseq = 0; /* and migrate_seq */
cap->cap_gen = atomic_read(&cap->session->s_cap_gen);
+ /*
+ * Note: CEPH_I_ERROR_FILELOCK is not set during reconnect.
+ * Instead, locks are submitted for best-effort MDS reclaim
+ * via the flock_len field below. If reclaim fails (e.g.,
+ * another client grabbed a conflicting lock), future lock
+ * operations will fail and set the error flag at that point.
+ */
+
/* These are lost when the session goes away */
if (S_ISDIR(inode->i_mode)) {
if (cap->issued & CEPH_CAP_DIR_CREATE) {
@@ -4946,20 +4959,19 @@ static int encode_snap_realms(struct ceph_mds_client *mdsc,
*
* This is a relatively heavyweight operation, but it's rare.
*/
-static void send_mds_reconnect(struct ceph_mds_client *mdsc,
- struct ceph_mds_session *session)
+static int send_mds_reconnect(struct ceph_mds_client *mdsc,
+ struct ceph_mds_session *session)
{
struct ceph_client *cl = mdsc->fsc->client;
struct ceph_msg *reply;
int mds = session->s_mds;
int err = -ENOMEM;
+ int old_state;
struct ceph_reconnect_state recon_state = {
.session = session,
};
LIST_HEAD(dispose);
- pr_info_client(cl, "mds%d reconnect start\n", mds);
-
recon_state.pagelist = ceph_pagelist_alloc(GFP_NOFS);
if (!recon_state.pagelist)
goto fail_nopagelist;
@@ -4968,9 +4980,37 @@ static void send_mds_reconnect(struct ceph_mds_client *mdsc,
if (!reply)
goto fail_nomsg;
+ mutex_lock(&session->s_mutex);
+
+ /* Serialized by s_mutex against concurrent ceph_get_deleg_ino(). */
xa_destroy(&session->s_delegated_inos);
+ if (session->s_state == CEPH_MDS_SESSION_CLOSED ||
+ session->s_state == CEPH_MDS_SESSION_REJECTED) {
+ pr_info_client(cl, "mds%d skipping reconnect, session %s\n",
+ mds,
+ ceph_session_state_name(session->s_state));
+ mutex_unlock(&session->s_mutex);
+ ceph_msg_put(reply);
+ err = -ESTALE;
+ goto fail_return;
+ }
- mutex_lock(&session->s_mutex);
+ /* s_mutex -> mdsc->mutex matches cleanup_session_requests() order. */
+ mutex_lock(&mdsc->mutex);
+ if (mds >= mdsc->max_sessions || mdsc->sessions[mds] != session) {
+ mutex_unlock(&mdsc->mutex);
+ pr_info_client(cl,
+ "mds%d skipping reconnect, session unregistered\n",
+ mds);
+ mutex_unlock(&session->s_mutex);
+ ceph_msg_put(reply);
+ err = -ENOENT;
+ goto fail_return;
+ }
+ mutex_unlock(&mdsc->mutex);
+
+ pr_info_client(cl, "mds%d reconnect start\n", mds);
+ old_state = session->s_state;
session->s_state = CEPH_MDS_SESSION_RECONNECTING;
session->s_seq = 0;
@@ -5100,7 +5140,7 @@ static void send_mds_reconnect(struct ceph_mds_client *mdsc,
up_read(&mdsc->snap_rwsem);
ceph_pagelist_release(recon_state.pagelist);
- return;
+ return 0;
fail_clear_cap_reconnect:
spin_lock(&session->s_cap_lock);
@@ -5109,13 +5149,29 @@ static void send_mds_reconnect(struct ceph_mds_client *mdsc,
fail:
ceph_msg_put(reply);
up_read(&mdsc->snap_rwsem);
+ /*
+ * Restore prior session state so map-driven reconnect logic
+ * (check_new_map) can retry. Without this, a transient build
+ * failure strands the session in RECONNECTING indefinitely.
+ */
+ session->s_state = old_state;
mutex_unlock(&session->s_mutex);
fail_nomsg:
ceph_pagelist_release(recon_state.pagelist);
fail_nopagelist:
pr_err_client(cl, "error %d preparing reconnect for mds%d\n",
err, mds);
- return;
+ return err;
+
+fail_return:
+ /*
+ * Early-exit path for expected concurrent-teardown races
+ * (-ESTALE for closed/rejected sessions, -ENOENT for
+ * unregistered sessions). Skip the pr_err_client diagnostic
+ * since these are not genuine reconnect build failures.
+ */
+ ceph_pagelist_release(recon_state.pagelist);
+ return err;
}
@@ -5196,9 +5252,15 @@ static void check_new_map(struct ceph_mds_client *mdsc,
*/
if (s->s_state == CEPH_MDS_SESSION_RESTARTING &&
newstate >= CEPH_MDS_STATE_RECONNECT) {
+ int rc;
+
mutex_unlock(&mdsc->mutex);
clear_bit(i, targets);
- send_mds_reconnect(mdsc, s);
+ rc = send_mds_reconnect(mdsc, s);
+ if (rc)
+ pr_warn_client(cl,
+ "mds%d reconnect failed: %d\n",
+ i, rc);
mutex_lock(&mdsc->mutex);
}
@@ -5262,7 +5324,11 @@ static void check_new_map(struct ceph_mds_client *mdsc,
}
doutc(cl, "send reconnect to export target mds.%d\n", i);
mutex_unlock(&mdsc->mutex);
- send_mds_reconnect(mdsc, s);
+ err = send_mds_reconnect(mdsc, s);
+ if (err)
+ pr_warn_client(cl,
+ "mds%d export target reconnect failed: %d\n",
+ i, err);
ceph_put_mds_session(s);
mutex_lock(&mdsc->mutex);
}
@@ -6350,12 +6416,92 @@ static void mds_peer_reset(struct ceph_connection *con)
{
struct ceph_mds_session *s = con->private;
struct ceph_mds_client *mdsc = s->s_mdsc;
+ int session_state;
pr_warn_client(mdsc->fsc->client, "mds%d closed our session\n",
s->s_mds);
- if (READ_ONCE(mdsc->fsc->mount_state) != CEPH_MOUNT_FENCE_IO &&
- ceph_mdsmap_get_state(mdsc->mdsmap, s->s_mds) >= CEPH_MDS_STATE_RECONNECT)
- send_mds_reconnect(mdsc, s);
+
+ if (READ_ONCE(mdsc->fsc->mount_state) == CEPH_MOUNT_FENCE_IO ||
+ ceph_mdsmap_get_state(mdsc->mdsmap, s->s_mds) < CEPH_MDS_STATE_RECONNECT)
+ return;
+
+ /*
+ * Only reconnect if MDS is in its RECONNECT phase. An MDS past
+ * RECONNECT (REJOIN, CLIENTREPLAY, ACTIVE) will reject reconnect
+ * attempts, so those states fall through to session teardown below.
+ */
+ if (ceph_mdsmap_get_state(mdsc->mdsmap, s->s_mds) == CEPH_MDS_STATE_RECONNECT) {
+ int rc = send_mds_reconnect(mdsc, s);
+
+ if (rc)
+ pr_warn_client(mdsc->fsc->client,
+ "mds%d reconnect failed: %d\n",
+ s->s_mds, rc);
+ return;
+ }
+
+ /*
+ * MDS is active (past RECONNECT). It will not accept a
+ * CLIENT_RECONNECT from us, so tear the session down locally
+ * and let new requests re-open a fresh session.
+ *
+ * Snapshot session state with READ_ONCE, then revalidate under
+ * mdsc->mutex before acting. The subsequent mdsc->mutex
+ * section rechecks s_state to catch concurrent transitions, so
+ * the lockless snapshot here is safe. s->s_mutex is taken
+ * separately for cleanup after unregistration, which avoids
+ * introducing a new s->s_mutex + mdsc->mutex nesting.
+ */
+ session_state = READ_ONCE(s->s_state);
+
+ switch (session_state) {
+ case CEPH_MDS_SESSION_RESTARTING:
+ case CEPH_MDS_SESSION_RECONNECTING:
+ case CEPH_MDS_SESSION_CLOSING:
+ case CEPH_MDS_SESSION_OPEN:
+ case CEPH_MDS_SESSION_HUNG:
+ case CEPH_MDS_SESSION_OPENING:
+ mutex_lock(&mdsc->mutex);
+ if (s->s_mds >= mdsc->max_sessions ||
+ mdsc->sessions[s->s_mds] != s ||
+ s->s_state != session_state) {
+ pr_info_client(mdsc->fsc->client,
+ "mds%d state changed to %s during peer reset\n",
+ s->s_mds,
+ ceph_session_state_name(s->s_state));
+ mutex_unlock(&mdsc->mutex);
+ return;
+ }
+
+ ceph_get_mds_session(s);
+ s->s_state = CEPH_MDS_SESSION_CLOSED;
+ __unregister_session(mdsc, s);
+ __wake_requests(mdsc, &s->s_waiting);
+ mutex_unlock(&mdsc->mutex);
+
+ mutex_lock(&s->s_mutex);
+ cleanup_session_requests(mdsc, s);
+ remove_session_caps(s);
+ mutex_unlock(&s->s_mutex);
+
+ wake_up_all(&mdsc->session_close_wq);
+
+ mutex_lock(&mdsc->mutex);
+ kick_requests(mdsc, s->s_mds);
+ mutex_unlock(&mdsc->mutex);
+
+ ceph_put_mds_session(s);
+ break;
+ case CEPH_MDS_SESSION_CLOSED:
+ case CEPH_MDS_SESSION_REJECTED:
+ break;
+ default:
+ pr_warn_client(mdsc->fsc->client,
+ "mds%d peer reset in unexpected state %s\n",
+ s->s_mds,
+ ceph_session_state_name(session_state));
+ break;
+ }
}
static void mds_dispatch(struct ceph_connection *con, struct ceph_msg *msg)
@@ -6367,6 +6513,8 @@ static void mds_dispatch(struct ceph_connection *con, struct ceph_msg *msg)
mutex_lock(&mdsc->mutex);
if (__verify_registered_session(mdsc, s) < 0) {
+ doutc(cl, "dropping tid %llu from unregistered session %d\n",
+ le64_to_cpu(msg->hdr.tid), s->s_mds);
mutex_unlock(&mdsc->mutex);
goto out;
}
--
2.34.1
^ permalink raw reply related [flat|nested] 24+ messages in thread* Re: [EXTERNAL] [PATCH v4 03/11] ceph: harden send_mds_reconnect and handle active-MDS peer reset
2026-05-07 12:27 ` [PATCH v4 03/11] ceph: harden send_mds_reconnect and handle active-MDS peer reset Alex Markuze
@ 2026-05-07 18:43 ` Viacheslav Dubeyko
0 siblings, 0 replies; 24+ messages in thread
From: Viacheslav Dubeyko @ 2026-05-07 18:43 UTC (permalink / raw)
To: Alex Markuze, ceph-devel; +Cc: linux-kernel, idryomov
On Thu, 2026-05-07 at 12:27 +0000, Alex Markuze wrote:
> Change send_mds_reconnect() to return an error code so callers can detect
> and report reconnect failures instead of silently ignoring them. Add early
> bailout checks for sessions that are already closed, rejected, or
> unregistered, which avoids sending reconnect messages for sessions that
> can no longer be recovered.
>
> The early -ESTALE and -ENOENT bailouts use a separate fail_return label
> that skips the pr_err_client diagnostic, since these codes indicate
> expected concurrent-teardown races rather than genuine reconnect build
> failures.
>
> Move the "reconnect start" log after the early-bailout checks so it
> only appears for sessions that actually proceed with reconnect.
>
> Save the prior session state before transitioning to RECONNECTING,
> and restore it in the failure path. Without this, a transient
> build or encoding failure (-ENOMEM, -ENOSPC) strands the session
> in RECONNECTING indefinitely because check_new_map() only retries
> sessions in RESTARTING state.
>
> Rewrite mds_peer_reset() to handle the case where the MDS is past its
> RECONNECT phase (i.e. active). An active MDS rejects CLIENT_RECONNECT
> messages because it only accepts them during its own RECONNECT window
> after restart. Previously, the client would send a doomed reconnect
> that the MDS would reject or ignore. Now, the client tears the session
> down locally and lets new requests re-open a fresh session, which is
> the correct recovery for this scenario. The RECONNECTING state is
> handled on the same teardown path, since the MDS will reject reconnect
> attempts from an active client regardless of the session's local state.
>
> Add explicit cases for CLOSED and REJECTED session states in
> mds_peer_reset() since these are terminal states where a connection
> drop is expected behavior.
>
> The session teardown path in mds_peer_reset() follows the established
> drop-and-reacquire locking pattern from check_new_map(): take
> mdsc->mutex for session unregistration, release it, then take s->s_mutex
> separately for cleanup. This avoids introducing a new simultaneous lock
> nesting pattern.
>
> Log reconnect failures from check_new_map() and mds_peer_reset() at
> pr_warn level rather than pr_err, since return codes like -ESTALE
> (closed/rejected session) and -ENOENT (unregistered session) are
> expected during concurrent teardown. Log dropped messages for
> unregistered sessions via doutc() (dynamic debug) rather than
> pr_info, as post-reset message arrival is routine and does not
> warrant unconditional logging.
>
> Signed-off-by: Alex Markuze <amarkuze@redhat.com>
> ---
> fs/ceph/mds_client.c | 178 +++++++++++++++++++++++++++++++++++++++----
> 1 file changed, 163 insertions(+), 15 deletions(-)
>
> diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
> index d9543399b129..249419c17d3c 100644
> --- a/fs/ceph/mds_client.c
> +++ b/fs/ceph/mds_client.c
> @@ -4470,9 +4470,14 @@ static void handle_session(struct ceph_mds_session *session,
> break;
>
> case CEPH_SESSION_REJECT:
> - WARN_ON(session->s_state != CEPH_MDS_SESSION_OPENING);
> - pr_info_client(cl, "mds%d rejected session\n",
> - session->s_mds);
> + WARN_ON(session->s_state != CEPH_MDS_SESSION_OPENING &&
> + session->s_state != CEPH_MDS_SESSION_RECONNECTING);
> + if (session->s_state == CEPH_MDS_SESSION_RECONNECTING)
> + pr_info_client(cl, "mds%d reconnect rejected\n",
> + session->s_mds);
> + else
> + pr_info_client(cl, "mds%d rejected session\n",
> + session->s_mds);
> session->s_state = CEPH_MDS_SESSION_REJECTED;
> cleanup_session_requests(mdsc, session);
> remove_session_caps(session);
> @@ -4732,6 +4737,14 @@ static int reconnect_caps_cb(struct inode *inode, int mds, void *arg)
> cap->mseq = 0; /* and migrate_seq */
> cap->cap_gen = atomic_read(&cap->session->s_cap_gen);
>
> + /*
> + * Note: CEPH_I_ERROR_FILELOCK is not set during reconnect.
> + * Instead, locks are submitted for best-effort MDS reclaim
> + * via the flock_len field below. If reclaim fails (e.g.,
> + * another client grabbed a conflicting lock), future lock
> + * operations will fail and set the error flag at that point.
> + */
> +
> /* These are lost when the session goes away */
> if (S_ISDIR(inode->i_mode)) {
> if (cap->issued & CEPH_CAP_DIR_CREATE) {
> @@ -4946,20 +4959,19 @@ static int encode_snap_realms(struct ceph_mds_client *mdsc,
> *
> * This is a relatively heavyweight operation, but it's rare.
> */
> -static void send_mds_reconnect(struct ceph_mds_client *mdsc,
> - struct ceph_mds_session *session)
> +static int send_mds_reconnect(struct ceph_mds_client *mdsc,
> + struct ceph_mds_session *session)
> {
> struct ceph_client *cl = mdsc->fsc->client;
> struct ceph_msg *reply;
> int mds = session->s_mds;
> int err = -ENOMEM;
> + int old_state;
> struct ceph_reconnect_state recon_state = {
> .session = session,
> };
> LIST_HEAD(dispose);
>
> - pr_info_client(cl, "mds%d reconnect start\n", mds);
> -
> recon_state.pagelist = ceph_pagelist_alloc(GFP_NOFS);
> if (!recon_state.pagelist)
> goto fail_nopagelist;
> @@ -4968,9 +4980,37 @@ static void send_mds_reconnect(struct ceph_mds_client *mdsc,
> if (!reply)
> goto fail_nomsg;
>
> + mutex_lock(&session->s_mutex);
> +
> + /* Serialized by s_mutex against concurrent ceph_get_deleg_ino(). */
> xa_destroy(&session->s_delegated_inos);
> + if (session->s_state == CEPH_MDS_SESSION_CLOSED ||
> + session->s_state == CEPH_MDS_SESSION_REJECTED) {
> + pr_info_client(cl, "mds%d skipping reconnect, session %s\n",
> + mds,
> + ceph_session_state_name(session->s_state));
> + mutex_unlock(&session->s_mutex);
> + ceph_msg_put(reply);
> + err = -ESTALE;
> + goto fail_return;
> + }
>
> - mutex_lock(&session->s_mutex);
> + /* s_mutex -> mdsc->mutex matches cleanup_session_requests() order. */
> + mutex_lock(&mdsc->mutex);
> + if (mds >= mdsc->max_sessions || mdsc->sessions[mds] != session) {
> + mutex_unlock(&mdsc->mutex);
> + pr_info_client(cl,
> + "mds%d skipping reconnect, session unregistered\n",
> + mds);
> + mutex_unlock(&session->s_mutex);
> + ceph_msg_put(reply);
> + err = -ENOENT;
> + goto fail_return;
> + }
> + mutex_unlock(&mdsc->mutex);
> +
> + pr_info_client(cl, "mds%d reconnect start\n", mds);
> + old_state = session->s_state;
> session->s_state = CEPH_MDS_SESSION_RECONNECTING;
> session->s_seq = 0;
>
> @@ -5100,7 +5140,7 @@ static void send_mds_reconnect(struct ceph_mds_client *mdsc,
>
> up_read(&mdsc->snap_rwsem);
> ceph_pagelist_release(recon_state.pagelist);
> - return;
> + return 0;
>
> fail_clear_cap_reconnect:
> spin_lock(&session->s_cap_lock);
> @@ -5109,13 +5149,29 @@ static void send_mds_reconnect(struct ceph_mds_client *mdsc,
> fail:
> ceph_msg_put(reply);
> up_read(&mdsc->snap_rwsem);
> + /*
> + * Restore prior session state so map-driven reconnect logic
> + * (check_new_map) can retry. Without this, a transient build
> + * failure strands the session in RECONNECTING indefinitely.
> + */
> + session->s_state = old_state;
> mutex_unlock(&session->s_mutex);
> fail_nomsg:
> ceph_pagelist_release(recon_state.pagelist);
> fail_nopagelist:
> pr_err_client(cl, "error %d preparing reconnect for mds%d\n",
> err, mds);
> - return;
> + return err;
> +
> +fail_return:
> + /*
> + * Early-exit path for expected concurrent-teardown races
> + * (-ESTALE for closed/rejected sessions, -ENOENT for
> + * unregistered sessions). Skip the pr_err_client diagnostic
> + * since these are not genuine reconnect build failures.
> + */
> + ceph_pagelist_release(recon_state.pagelist);
> + return err;
> }
>
>
> @@ -5196,9 +5252,15 @@ static void check_new_map(struct ceph_mds_client *mdsc,
> */
> if (s->s_state == CEPH_MDS_SESSION_RESTARTING &&
> newstate >= CEPH_MDS_STATE_RECONNECT) {
> + int rc;
> +
> mutex_unlock(&mdsc->mutex);
> clear_bit(i, targets);
> - send_mds_reconnect(mdsc, s);
> + rc = send_mds_reconnect(mdsc, s);
> + if (rc)
> + pr_warn_client(cl,
> + "mds%d reconnect failed: %d\n",
> + i, rc);
> mutex_lock(&mdsc->mutex);
> }
>
> @@ -5262,7 +5324,11 @@ static void check_new_map(struct ceph_mds_client *mdsc,
> }
> doutc(cl, "send reconnect to export target mds.%d\n", i);
> mutex_unlock(&mdsc->mutex);
> - send_mds_reconnect(mdsc, s);
> + err = send_mds_reconnect(mdsc, s);
> + if (err)
> + pr_warn_client(cl,
> + "mds%d export target reconnect failed: %d\n",
> + i, err);
> ceph_put_mds_session(s);
> mutex_lock(&mdsc->mutex);
> }
> @@ -6350,12 +6416,92 @@ static void mds_peer_reset(struct ceph_connection *con)
> {
> struct ceph_mds_session *s = con->private;
> struct ceph_mds_client *mdsc = s->s_mdsc;
> + int session_state;
>
> pr_warn_client(mdsc->fsc->client, "mds%d closed our session\n",
> s->s_mds);
> - if (READ_ONCE(mdsc->fsc->mount_state) != CEPH_MOUNT_FENCE_IO &&
> - ceph_mdsmap_get_state(mdsc->mdsmap, s->s_mds) >= CEPH_MDS_STATE_RECONNECT)
> - send_mds_reconnect(mdsc, s);
> +
> + if (READ_ONCE(mdsc->fsc->mount_state) == CEPH_MOUNT_FENCE_IO ||
> + ceph_mdsmap_get_state(mdsc->mdsmap, s->s_mds) < CEPH_MDS_STATE_RECONNECT)
> + return;
> +
> + /*
> + * Only reconnect if MDS is in its RECONNECT phase. An MDS past
> + * RECONNECT (REJOIN, CLIENTREPLAY, ACTIVE) will reject reconnect
> + * attempts, so those states fall through to session teardown below.
> + */
> + if (ceph_mdsmap_get_state(mdsc->mdsmap, s->s_mds) == CEPH_MDS_STATE_RECONNECT) {
> + int rc = send_mds_reconnect(mdsc, s);
> +
> + if (rc)
> + pr_warn_client(mdsc->fsc->client,
> + "mds%d reconnect failed: %d\n",
> + s->s_mds, rc);
> + return;
> + }
> +
> + /*
> + * MDS is active (past RECONNECT). It will not accept a
> + * CLIENT_RECONNECT from us, so tear the session down locally
> + * and let new requests re-open a fresh session.
> + *
> + * Snapshot session state with READ_ONCE, then revalidate under
> + * mdsc->mutex before acting. The subsequent mdsc->mutex
> + * section rechecks s_state to catch concurrent transitions, so
> + * the lockless snapshot here is safe. s->s_mutex is taken
> + * separately for cleanup after unregistration, which avoids
> + * introducing a new s->s_mutex + mdsc->mutex nesting.
> + */
> + session_state = READ_ONCE(s->s_state);
> +
> + switch (session_state) {
> + case CEPH_MDS_SESSION_RESTARTING:
> + case CEPH_MDS_SESSION_RECONNECTING:
> + case CEPH_MDS_SESSION_CLOSING:
> + case CEPH_MDS_SESSION_OPEN:
> + case CEPH_MDS_SESSION_HUNG:
> + case CEPH_MDS_SESSION_OPENING:
> + mutex_lock(&mdsc->mutex);
> + if (s->s_mds >= mdsc->max_sessions ||
> + mdsc->sessions[s->s_mds] != s ||
> + s->s_state != session_state) {
> + pr_info_client(mdsc->fsc->client,
> + "mds%d state changed to %s during peer reset\n",
> + s->s_mds,
> + ceph_session_state_name(s->s_state));
> + mutex_unlock(&mdsc->mutex);
> + return;
> + }
> +
> + ceph_get_mds_session(s);
> + s->s_state = CEPH_MDS_SESSION_CLOSED;
> + __unregister_session(mdsc, s);
> + __wake_requests(mdsc, &s->s_waiting);
> + mutex_unlock(&mdsc->mutex);
> +
> + mutex_lock(&s->s_mutex);
> + cleanup_session_requests(mdsc, s);
> + remove_session_caps(s);
> + mutex_unlock(&s->s_mutex);
> +
> + wake_up_all(&mdsc->session_close_wq);
> +
> + mutex_lock(&mdsc->mutex);
> + kick_requests(mdsc, s->s_mds);
> + mutex_unlock(&mdsc->mutex);
> +
> + ceph_put_mds_session(s);
> + break;
> + case CEPH_MDS_SESSION_CLOSED:
> + case CEPH_MDS_SESSION_REJECTED:
> + break;
> + default:
> + pr_warn_client(mdsc->fsc->client,
> + "mds%d peer reset in unexpected state %s\n",
> + s->s_mds,
> + ceph_session_state_name(session_state));
> + break;
> + }
> }
>
> static void mds_dispatch(struct ceph_connection *con, struct ceph_msg *msg)
> @@ -6367,6 +6513,8 @@ static void mds_dispatch(struct ceph_connection *con, struct ceph_msg *msg)
>
> mutex_lock(&mdsc->mutex);
> if (__verify_registered_session(mdsc, s) < 0) {
> + doutc(cl, "dropping tid %llu from unregistered session %d\n",
> + le64_to_cpu(msg->hdr.tid), s->s_mds);
> mutex_unlock(&mdsc->mutex);
> goto out;
> }
Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Thanks,
Slava.
^ permalink raw reply [flat|nested] 24+ messages in thread
* [PATCH v4 04/11] ceph: add diagnostic timeout loop to wait_caps_flush()
2026-05-07 12:27 [PATCH v4 00/11] ceph: manual client session reset Alex Markuze
` (2 preceding siblings ...)
2026-05-07 12:27 ` [PATCH v4 03/11] ceph: harden send_mds_reconnect and handle active-MDS peer reset Alex Markuze
@ 2026-05-07 12:27 ` Alex Markuze
2026-05-07 19:01 ` [EXTERNAL] " Viacheslav Dubeyko
2026-05-07 12:27 ` [PATCH v4 05/11] ceph: add client reset state machine and session teardown Alex Markuze
` (7 subsequent siblings)
11 siblings, 1 reply; 24+ messages in thread
From: Alex Markuze @ 2026-05-07 12:27 UTC (permalink / raw)
To: ceph-devel; +Cc: linux-kernel, idryomov, vdubeyko, Alex Markuze
Convert wait_caps_flush() from a silent indefinite wait into a diagnostic
wait loop that periodically dumps pending cap flush state.
The underlying wait semantics remain intact: callers still wait until the
requested cap flushes complete. The difference is that long stalls now
produce actionable diagnostics instead of looking like a silent hang.
CEPH_CAP_FLUSH_MAX_DUMP_ENTRIES limits the number of entries
emitted per diagnostic dump, and CEPH_CAP_FLUSH_MAX_DUMP_ITERS
limits the number of timed diagnostic dumps before the wait
continues silently. When more entries exist than the per-dump
limit, a truncation count is reported. When the dump iteration
limit is reached, a final suppression message is emitted so the
transition to silence is explicit.
The diagnostic dump collects flush entry data under cap_dirty_lock into
a bounded on-stack array, then prints after releasing the lock. This
avoids holding the spinlock across printk calls.
A null cf->ci on the global flush list indicates a bug since all
cap_flush entries are initialized with a valid ci before being added.
Signal this with WARN_ON_ONCE while still printing enough context for
debugging.
READ_ONCE is used for the i_last_cap_flush_ack field, which is read
outside the inode lock domain. Flush tids are monotonically increasing
and acks are processed in order under i_ceph_lock, so the latest ack
tid is always the most recently written value.
Add a ci pointer to struct ceph_cap_flush so that the diagnostic
dump can identify which inode each pending flush belongs to. The
new i_last_cap_flush_ack field tracks the latest acknowledged flush
tid per inode for diagnostic correlation.
This improves reset-drain observability and is also useful for
existing sync and writeback troubleshooting paths.
Signed-off-by: Alex Markuze <amarkuze@redhat.com>
---
fs/ceph/caps.c | 10 +++++
fs/ceph/inode.c | 1 +
fs/ceph/mds_client.c | 100 +++++++++++++++++++++++++++++++++++++++++--
fs/ceph/mds_client.h | 3 ++
fs/ceph/super.h | 6 +++
5 files changed, 116 insertions(+), 4 deletions(-)
diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
index cb9e78b713d9..4b37d9ffdf7f 100644
--- a/fs/ceph/caps.c
+++ b/fs/ceph/caps.c
@@ -1648,6 +1648,7 @@ static void __ceph_flush_snaps(struct ceph_inode_info *ci,
spin_lock(&mdsc->cap_dirty_lock);
capsnap->cap_flush.tid = ++mdsc->last_cap_flush_tid;
+ capsnap->cap_flush.ci = ci;
list_add_tail(&capsnap->cap_flush.g_list,
&mdsc->cap_flush_list);
if (oldest_flush_tid == 0)
@@ -1846,6 +1847,7 @@ struct ceph_cap_flush *ceph_alloc_cap_flush(void)
return NULL;
cf->is_capsnap = false;
+ cf->ci = NULL;
return cf;
}
@@ -1931,6 +1933,7 @@ static u64 __mark_caps_flushing(struct inode *inode,
doutc(cl, "%p %llx.%llx now !dirty\n", inode, ceph_vinop(inode));
swap(cf, ci->i_prealloc_cap_flush);
+ cf->ci = ci;
cf->caps = flushing;
cf->wake = wake;
@@ -3826,6 +3829,13 @@ static void handle_cap_flush_ack(struct inode *inode, u64 flush_tid,
bool wake_ci = false;
bool wake_mdsc = false;
+ /*
+ * Flush tids are monotonically increasing and acks arrive in
+ * order under i_ceph_lock, so this is always the latest tid.
+ * Diagnostic readers use READ_ONCE() without holding the lock.
+ */
+ WRITE_ONCE(ci->i_last_cap_flush_ack, flush_tid);
+
list_for_each_entry_safe(cf, tmp_cf, &ci->i_cap_flush_list, i_list) {
/* Is this the one that was flushed? */
if (cf->tid == flush_tid)
diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
index 4871d7ab2730..61d7c0b8161f 100644
--- a/fs/ceph/inode.c
+++ b/fs/ceph/inode.c
@@ -671,6 +671,7 @@ struct inode *ceph_alloc_inode(struct super_block *sb)
INIT_LIST_HEAD(&ci->i_cap_snaps);
ci->i_head_snapc = NULL;
ci->i_snap_caps = 0;
+ ci->i_last_cap_flush_ack = 0;
ci->i_last_rd = ci->i_last_wr = jiffies - 3600 * HZ;
for (i = 0; i < CEPH_FILE_MODE_BITS; i++)
diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index 249419c17d3c..6ab5031e697a 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -2330,19 +2330,111 @@ static int check_caps_flush(struct ceph_mds_client *mdsc,
}
/*
- * flush all dirty inode data to disk.
+ * Snapshot of a single cap_flush entry for diagnostic dump.
+ * Collected under cap_dirty_lock, printed after releasing it.
+ */
+struct flush_dump_entry {
+ u64 ino; /* inode number */
+ u64 snap; /* snap id */
+ int caps; /* dirty cap bits */
+ u64 tid; /* flush transaction id */
+ u64 last_ack; /* most recent ack tid for this inode */
+ bool wake; /* whether completion was requested */
+ bool is_capsnap; /* true if this is a cap snap flush */
+ bool ci_null; /* true if cf->ci was unexpectedly NULL */
+};
+
+/*
+ * Dump pending cap flushes for diagnostic purposes.
*
- * returns true if we've flushed through want_flush_tid
+ * cf->ci is safe to dereference here: cap_flush entries hold a
+ * reference on the inode (via the cap), and entries are removed from
+ * cap_flush_list under cap_dirty_lock before the cap (and thus the
+ * inode reference) is released. Holding cap_dirty_lock therefore
+ * guarantees the inode remains valid for the lifetime of the scan.
+ */
+
+static void dump_cap_flushes(struct ceph_mds_client *mdsc, u64 want_tid)
+{
+ struct ceph_client *cl = mdsc->fsc->client;
+ struct flush_dump_entry entries[CEPH_CAP_FLUSH_MAX_DUMP_ENTRIES];
+ struct ceph_cap_flush *cf;
+ int n = 0, remaining = 0;
+
+ spin_lock(&mdsc->cap_dirty_lock);
+ list_for_each_entry(cf, &mdsc->cap_flush_list, g_list) {
+ if (cf->tid > want_tid)
+ break;
+ if (n < CEPH_CAP_FLUSH_MAX_DUMP_ENTRIES) {
+ struct flush_dump_entry *e = &entries[n++];
+
+ e->ci_null = WARN_ON_ONCE(!cf->ci);
+ if (!e->ci_null) {
+ e->ino = ceph_ino(&cf->ci->netfs.inode);
+ e->snap = ceph_snap(&cf->ci->netfs.inode);
+ e->last_ack = READ_ONCE(cf->ci->i_last_cap_flush_ack);
+ }
+ e->caps = cf->caps;
+ e->tid = cf->tid;
+ e->wake = cf->wake;
+ e->is_capsnap = cf->is_capsnap;
+ } else {
+ remaining++;
+ }
+ }
+ spin_unlock(&mdsc->cap_dirty_lock);
+
+ pr_info_client(cl, "still waiting for cap flushes through %llu:\n",
+ want_tid);
+ for (int i = 0; i < n; i++) {
+ struct flush_dump_entry *e = &entries[i];
+
+ if (e->ci_null)
+ pr_info_client(cl,
+ " (null ci) %s tid=%llu wake=%d%s\n",
+ ceph_cap_string(e->caps), e->tid,
+ e->wake,
+ e->is_capsnap ? " is_capsnap" : "");
+ else
+ pr_info_client(cl,
+ " %llx.%llx %s tid=%llu last_ack=%llu wake=%d%s\n",
+ e->ino, e->snap,
+ ceph_cap_string(e->caps), e->tid,
+ e->last_ack, e->wake,
+ e->is_capsnap ? " is_capsnap" : "");
+ }
+ if (remaining)
+ pr_info_client(cl, " ... and %d more pending flushes\n",
+ remaining);
+}
+
+/*
+ * Wait for all cap flushes through @want_flush_tid to complete.
+ * Periodically dumps pending cap flush state for diagnostics.
*/
static void wait_caps_flush(struct ceph_mds_client *mdsc,
u64 want_flush_tid)
{
struct ceph_client *cl = mdsc->fsc->client;
+ int i = 0;
+ long ret;
doutc(cl, "want %llu\n", want_flush_tid);
- wait_event(mdsc->cap_flushing_wq,
- check_caps_flush(mdsc, want_flush_tid));
+ do {
+ /* 60 * HZ fits in a long on all supported architectures. */
+ ret = wait_event_timeout(mdsc->cap_flushing_wq,
+ check_caps_flush(mdsc, want_flush_tid),
+ CEPH_CAP_FLUSH_WAIT_TIMEOUT_SEC * HZ);
+ if (ret == 0) {
+ if (i < CEPH_CAP_FLUSH_MAX_DUMP_ITERS)
+ dump_cap_flushes(mdsc, want_flush_tid);
+ else if (i == CEPH_CAP_FLUSH_MAX_DUMP_ITERS)
+ pr_info_client(cl,
+ "still waiting for cap flushes; suppressing further dumps\n");
+ i++;
+ }
+ } while (ret == 0);
doutc(cl, "ok, flushed thru %llu\n", want_flush_tid);
}
diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h
index d873e784b025..8208fdf02efe 100644
--- a/fs/ceph/mds_client.h
+++ b/fs/ceph/mds_client.h
@@ -77,6 +77,9 @@ struct ceph_fs_client;
struct ceph_cap;
#define MDS_AUTH_UID_ANY -1
+#define CEPH_CAP_FLUSH_WAIT_TIMEOUT_SEC 60
+#define CEPH_CAP_FLUSH_MAX_DUMP_ENTRIES 5
+#define CEPH_CAP_FLUSH_MAX_DUMP_ITERS 5
struct ceph_mds_cap_match {
s64 uid; /* default to MDS_AUTH_UID_ANY */
diff --git a/fs/ceph/super.h b/fs/ceph/super.h
index 8afc6f3a10da..a4993644d543 100644
--- a/fs/ceph/super.h
+++ b/fs/ceph/super.h
@@ -239,6 +239,7 @@ struct ceph_cap_flush {
bool is_capsnap; /* true means capsnap */
struct list_head g_list; // global
struct list_head i_list; // per inode
+ struct ceph_inode_info *ci;
};
/*
@@ -453,6 +454,11 @@ struct ceph_inode_info {
struct ceph_snap_context *i_head_snapc; /* set if wr_buffer_head > 0 or
dirty|flushing caps */
unsigned i_snap_caps; /* cap bits for snapped files */
+ /*
+ * Written under i_ceph_lock, read via READ_ONCE()
+ * from diagnostic paths.
+ */
+ u64 i_last_cap_flush_ack;
unsigned long i_last_rd;
unsigned long i_last_wr;
--
2.34.1
^ permalink raw reply related [flat|nested] 24+ messages in thread* Re: [EXTERNAL] [PATCH v4 04/11] ceph: add diagnostic timeout loop to wait_caps_flush()
2026-05-07 12:27 ` [PATCH v4 04/11] ceph: add diagnostic timeout loop to wait_caps_flush() Alex Markuze
@ 2026-05-07 19:01 ` Viacheslav Dubeyko
0 siblings, 0 replies; 24+ messages in thread
From: Viacheslav Dubeyko @ 2026-05-07 19:01 UTC (permalink / raw)
To: Alex Markuze, ceph-devel; +Cc: linux-kernel, idryomov
On Thu, 2026-05-07 at 12:27 +0000, Alex Markuze wrote:
> Convert wait_caps_flush() from a silent indefinite wait into a diagnostic
> wait loop that periodically dumps pending cap flush state.
>
> The underlying wait semantics remain intact: callers still wait until the
> requested cap flushes complete. The difference is that long stalls now
> produce actionable diagnostics instead of looking like a silent hang.
>
> CEPH_CAP_FLUSH_MAX_DUMP_ENTRIES limits the number of entries
> emitted per diagnostic dump, and CEPH_CAP_FLUSH_MAX_DUMP_ITERS
> limits the number of timed diagnostic dumps before the wait
> continues silently. When more entries exist than the per-dump
> limit, a truncation count is reported. When the dump iteration
> limit is reached, a final suppression message is emitted so the
> transition to silence is explicit.
>
> The diagnostic dump collects flush entry data under cap_dirty_lock into
> a bounded on-stack array, then prints after releasing the lock. This
> avoids holding the spinlock across printk calls.
>
> A null cf->ci on the global flush list indicates a bug since all
> cap_flush entries are initialized with a valid ci before being added.
> Signal this with WARN_ON_ONCE while still printing enough context for
> debugging.
>
> READ_ONCE is used for the i_last_cap_flush_ack field, which is read
> outside the inode lock domain. Flush tids are monotonically increasing
> and acks are processed in order under i_ceph_lock, so the latest ack
> tid is always the most recently written value.
>
> Add a ci pointer to struct ceph_cap_flush so that the diagnostic
> dump can identify which inode each pending flush belongs to. The
> new i_last_cap_flush_ack field tracks the latest acknowledged flush
> tid per inode for diagnostic correlation.
>
> This improves reset-drain observability and is also useful for
> existing sync and writeback troubleshooting paths.
>
> Signed-off-by: Alex Markuze <amarkuze@redhat.com>
> ---
> fs/ceph/caps.c | 10 +++++
> fs/ceph/inode.c | 1 +
> fs/ceph/mds_client.c | 100 +++++++++++++++++++++++++++++++++++++++++--
> fs/ceph/mds_client.h | 3 ++
> fs/ceph/super.h | 6 +++
> 5 files changed, 116 insertions(+), 4 deletions(-)
>
> diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
> index cb9e78b713d9..4b37d9ffdf7f 100644
> --- a/fs/ceph/caps.c
> +++ b/fs/ceph/caps.c
> @@ -1648,6 +1648,7 @@ static void __ceph_flush_snaps(struct ceph_inode_info *ci,
>
> spin_lock(&mdsc->cap_dirty_lock);
> capsnap->cap_flush.tid = ++mdsc->last_cap_flush_tid;
> + capsnap->cap_flush.ci = ci;
> list_add_tail(&capsnap->cap_flush.g_list,
> &mdsc->cap_flush_list);
> if (oldest_flush_tid == 0)
> @@ -1846,6 +1847,7 @@ struct ceph_cap_flush *ceph_alloc_cap_flush(void)
> return NULL;
>
> cf->is_capsnap = false;
> + cf->ci = NULL;
> return cf;
> }
>
> @@ -1931,6 +1933,7 @@ static u64 __mark_caps_flushing(struct inode *inode,
> doutc(cl, "%p %llx.%llx now !dirty\n", inode, ceph_vinop(inode));
>
> swap(cf, ci->i_prealloc_cap_flush);
> + cf->ci = ci;
> cf->caps = flushing;
> cf->wake = wake;
>
> @@ -3826,6 +3829,13 @@ static void handle_cap_flush_ack(struct inode *inode, u64 flush_tid,
> bool wake_ci = false;
> bool wake_mdsc = false;
>
> + /*
> + * Flush tids are monotonically increasing and acks arrive in
> + * order under i_ceph_lock, so this is always the latest tid.
> + * Diagnostic readers use READ_ONCE() without holding the lock.
> + */
> + WRITE_ONCE(ci->i_last_cap_flush_ack, flush_tid);
> +
> list_for_each_entry_safe(cf, tmp_cf, &ci->i_cap_flush_list, i_list) {
> /* Is this the one that was flushed? */
> if (cf->tid == flush_tid)
> diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
> index 4871d7ab2730..61d7c0b8161f 100644
> --- a/fs/ceph/inode.c
> +++ b/fs/ceph/inode.c
> @@ -671,6 +671,7 @@ struct inode *ceph_alloc_inode(struct super_block *sb)
> INIT_LIST_HEAD(&ci->i_cap_snaps);
> ci->i_head_snapc = NULL;
> ci->i_snap_caps = 0;
> + ci->i_last_cap_flush_ack = 0;
>
> ci->i_last_rd = ci->i_last_wr = jiffies - 3600 * HZ;
> for (i = 0; i < CEPH_FILE_MODE_BITS; i++)
> diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
> index 249419c17d3c..6ab5031e697a 100644
> --- a/fs/ceph/mds_client.c
> +++ b/fs/ceph/mds_client.c
> @@ -2330,19 +2330,111 @@ static int check_caps_flush(struct ceph_mds_client *mdsc,
> }
>
> /*
> - * flush all dirty inode data to disk.
> + * Snapshot of a single cap_flush entry for diagnostic dump.
> + * Collected under cap_dirty_lock, printed after releasing it.
> + */
> +struct flush_dump_entry {
> + u64 ino; /* inode number */
> + u64 snap; /* snap id */
> + int caps; /* dirty cap bits */
> + u64 tid; /* flush transaction id */
> + u64 last_ack; /* most recent ack tid for this inode */
> + bool wake; /* whether completion was requested */
> + bool is_capsnap; /* true if this is a cap snap flush */
> + bool ci_null; /* true if cf->ci was unexpectedly NULL */
> +};
> +
> +/*
> + * Dump pending cap flushes for diagnostic purposes.
> *
> - * returns true if we've flushed through want_flush_tid
> + * cf->ci is safe to dereference here: cap_flush entries hold a
> + * reference on the inode (via the cap), and entries are removed from
> + * cap_flush_list under cap_dirty_lock before the cap (and thus the
> + * inode reference) is released. Holding cap_dirty_lock therefore
> + * guarantees the inode remains valid for the lifetime of the scan.
> + */
> +
> +static void dump_cap_flushes(struct ceph_mds_client *mdsc, u64 want_tid)
> +{
> + struct ceph_client *cl = mdsc->fsc->client;
> + struct flush_dump_entry entries[CEPH_CAP_FLUSH_MAX_DUMP_ENTRIES];
> + struct ceph_cap_flush *cf;
> + int n = 0, remaining = 0;
> +
> + spin_lock(&mdsc->cap_dirty_lock);
> + list_for_each_entry(cf, &mdsc->cap_flush_list, g_list) {
> + if (cf->tid > want_tid)
> + break;
> + if (n < CEPH_CAP_FLUSH_MAX_DUMP_ENTRIES) {
> + struct flush_dump_entry *e = &entries[n++];
> +
> + e->ci_null = WARN_ON_ONCE(!cf->ci);
> + if (!e->ci_null) {
> + e->ino = ceph_ino(&cf->ci->netfs.inode);
> + e->snap = ceph_snap(&cf->ci->netfs.inode);
> + e->last_ack = READ_ONCE(cf->ci->i_last_cap_flush_ack);
> + }
> + e->caps = cf->caps;
> + e->tid = cf->tid;
> + e->wake = cf->wake;
> + e->is_capsnap = cf->is_capsnap;
> + } else {
> + remaining++;
> + }
We don't need brackets here. :)
> + }
> + spin_unlock(&mdsc->cap_dirty_lock);
> +
> + pr_info_client(cl, "still waiting for cap flushes through %llu:\n",
> + want_tid);
> + for (int i = 0; i < n; i++) {
I am slightly worried that we could have warnings because of this declaration.
> + struct flush_dump_entry *e = &entries[i];
> +
> + if (e->ci_null)
> + pr_info_client(cl,
> + " (null ci) %s tid=%llu wake=%d%s\n",
> + ceph_cap_string(e->caps), e->tid,
> + e->wake,
> + e->is_capsnap ? " is_capsnap" : "");
> + else
> + pr_info_client(cl,
> + " %llx.%llx %s tid=%llu last_ack=%llu wake=%d%s\n",
> + e->ino, e->snap,
> + ceph_cap_string(e->caps), e->tid,
> + e->last_ack, e->wake,
> + e->is_capsnap ? " is_capsnap" : "");
> + }
> + if (remaining)
> + pr_info_client(cl, " ... and %d more pending flushes\n",
> + remaining);
> +}
> +
> +/*
> + * Wait for all cap flushes through @want_flush_tid to complete.
> + * Periodically dumps pending cap flush state for diagnostics.
> */
> static void wait_caps_flush(struct ceph_mds_client *mdsc,
> u64 want_flush_tid)
> {
> struct ceph_client *cl = mdsc->fsc->client;
> + int i = 0;
> + long ret;
>
> doutc(cl, "want %llu\n", want_flush_tid);
>
> - wait_event(mdsc->cap_flushing_wq,
> - check_caps_flush(mdsc, want_flush_tid));
> + do {
> + /* 60 * HZ fits in a long on all supported architectures. */
> + ret = wait_event_timeout(mdsc->cap_flushing_wq,
> + check_caps_flush(mdsc, want_flush_tid),
> + CEPH_CAP_FLUSH_WAIT_TIMEOUT_SEC * HZ);
> + if (ret == 0) {
> + if (i < CEPH_CAP_FLUSH_MAX_DUMP_ITERS)
> + dump_cap_flushes(mdsc, want_flush_tid);
> + else if (i == CEPH_CAP_FLUSH_MAX_DUMP_ITERS)
> + pr_info_client(cl,
> + "still waiting for cap flushes; suppressing further dumps\n");
> + i++;
> + }
> + } while (ret == 0);
>
> doutc(cl, "ok, flushed thru %llu\n", want_flush_tid);
> }
> diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h
> index d873e784b025..8208fdf02efe 100644
> --- a/fs/ceph/mds_client.h
> +++ b/fs/ceph/mds_client.h
> @@ -77,6 +77,9 @@ struct ceph_fs_client;
> struct ceph_cap;
>
> #define MDS_AUTH_UID_ANY -1
> +#define CEPH_CAP_FLUSH_WAIT_TIMEOUT_SEC 60
> +#define CEPH_CAP_FLUSH_MAX_DUMP_ENTRIES 5
> +#define CEPH_CAP_FLUSH_MAX_DUMP_ITERS 5
>
> struct ceph_mds_cap_match {
> s64 uid; /* default to MDS_AUTH_UID_ANY */
> diff --git a/fs/ceph/super.h b/fs/ceph/super.h
> index 8afc6f3a10da..a4993644d543 100644
> --- a/fs/ceph/super.h
> +++ b/fs/ceph/super.h
> @@ -239,6 +239,7 @@ struct ceph_cap_flush {
> bool is_capsnap; /* true means capsnap */
> struct list_head g_list; // global
> struct list_head i_list; // per inode
> + struct ceph_inode_info *ci;
> };
>
> /*
> @@ -453,6 +454,11 @@ struct ceph_inode_info {
> struct ceph_snap_context *i_head_snapc; /* set if wr_buffer_head > 0 or
> dirty|flushing caps */
> unsigned i_snap_caps; /* cap bits for snapped files */
> + /*
> + * Written under i_ceph_lock, read via READ_ONCE()
> + * from diagnostic paths.
> + */
> + u64 i_last_cap_flush_ack;
>
> unsigned long i_last_rd;
> unsigned long i_last_wr;
I have two minor remarks. The rest looks good.
Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Thanks,
Slava.
^ permalink raw reply [flat|nested] 24+ messages in thread
* [PATCH v4 05/11] ceph: add client reset state machine and session teardown
2026-05-07 12:27 [PATCH v4 00/11] ceph: manual client session reset Alex Markuze
` (3 preceding siblings ...)
2026-05-07 12:27 ` [PATCH v4 04/11] ceph: add diagnostic timeout loop to wait_caps_flush() Alex Markuze
@ 2026-05-07 12:27 ` Alex Markuze
2026-05-07 19:17 ` [EXTERNAL] " Viacheslav Dubeyko
2026-05-07 12:27 ` [PATCH v4 06/11] ceph: add manual reset debugfs control and tracepoints Alex Markuze
` (6 subsequent siblings)
11 siblings, 1 reply; 24+ messages in thread
From: Alex Markuze @ 2026-05-07 12:27 UTC (permalink / raw)
To: ceph-devel; +Cc: linux-kernel, idryomov, vdubeyko, Alex Markuze
Add the client-side reset state machine, request gating, and manual
session teardown implementation.
Manual reset is an operator-triggered escape hatch for client/MDS
stalemates in which caps, locks, or unsafe metadata state stop making
forward progress. The reset blocks new metadata work, attempts a
bounded best-effort drain of dirty client state while sessions are
still alive, and finally asks the MDS to close sessions before tearing
local session state down directly.
The reset state machine tracks four phases: IDLE -> QUIESCING ->
DRAINING -> TEARDOWN -> IDLE. QUIESCING is set synchronously by
schedule_reset() before the workqueue item is dispatched, so that new
metadata requests and file-lock acquisitions are gated immediately --
even before the work function begins running. All non-IDLE phases
block callers on blocked_wq, preventing races with session teardown.
The drain phase flushes mdlog state, dirty caps, and pending cap
releases for a bounded interval. State that still cannot make progress
within that interval is discarded during teardown, which is the point
of the reset: break the stalemate and allow fresh sessions to rebuild
clean state.
The session teardown follows the established check_new_map()
forced-close pattern: unregister sessions under mdsc->mutex, then clean
up caps and requests under s->s_mutex. Reconnect is not attempted
because the MDS only accepts reconnects during its own RECONNECT phase
after restart, not from an active client.
Blocked callers are released when reset completes and observe the final
result via -EAGAIN (reset failed) or 0 (success). Internal work-function
errors such as -ENOMEM are not propagated to unrelated callers like
open() or flock(); the detailed error remains in debugfs and
tracepoints.
The work function checks st->shutdown before each phase transition
(DRAINING, TEARDOWN) so that a concurrent ceph_mdsc_destroy() is not
overwritten. If destroy already took ownership, the work function
releases session references and returns without touching the state.
The timeout calculation for blocked-request waiters uses max_t() to
prevent jiffies underflow when the deadline has already passed.
The close-grace sleep before teardown is a best-effort nudge to let
queued REQUEST_CLOSE messages egress; it is not a correctness
requirement since the MDS still has session_autoclose as a fallback.
The destroy path marks reset as failed and wakes blocked waiters before
cancel_work_sync() so unmount does not stall.
Signed-off-by: Alex Markuze <amarkuze@redhat.com>
---
fs/ceph/locks.c | 16 ++
fs/ceph/mds_client.c | 508 +++++++++++++++++++++++++++++++++++++++++++
fs/ceph/mds_client.h | 46 ++++
3 files changed, 570 insertions(+)
diff --git a/fs/ceph/locks.c b/fs/ceph/locks.c
index c4ff2266bb94..677221bd64e0 100644
--- a/fs/ceph/locks.c
+++ b/fs/ceph/locks.c
@@ -249,6 +249,7 @@ int ceph_lock(struct file *file, int cmd, struct file_lock *fl)
{
struct inode *inode = file_inode(file);
struct ceph_inode_info *ci = ceph_inode(inode);
+ struct ceph_mds_client *mdsc = ceph_sb_to_mdsc(inode->i_sb);
struct ceph_client *cl = ceph_inode_to_client(inode);
int err = 0;
u16 op = CEPH_MDS_OP_SETFILELOCK;
@@ -275,6 +276,13 @@ int ceph_lock(struct file *file, int cmd, struct file_lock *fl)
return -EIO;
}
+ /* Wait for reset to complete before acquiring new locks */
+ if (op == CEPH_MDS_OP_SETFILELOCK && !lock_is_unlock(fl)) {
+ err = ceph_mdsc_wait_for_reset(mdsc);
+ if (err)
+ return err;
+ }
+
if (lock_is_read(fl))
lock_cmd = CEPH_LOCK_SHARED;
else if (lock_is_write(fl))
@@ -311,6 +319,7 @@ int ceph_flock(struct file *file, int cmd, struct file_lock *fl)
{
struct inode *inode = file_inode(file);
struct ceph_inode_info *ci = ceph_inode(inode);
+ struct ceph_mds_client *mdsc = ceph_sb_to_mdsc(inode->i_sb);
struct ceph_client *cl = ceph_inode_to_client(inode);
int err = 0;
u8 wait = 0;
@@ -330,6 +339,13 @@ int ceph_flock(struct file *file, int cmd, struct file_lock *fl)
return -EIO;
}
+ /* Wait for reset to complete before acquiring new locks */
+ if (!lock_is_unlock(fl)) {
+ err = ceph_mdsc_wait_for_reset(mdsc);
+ if (err)
+ return err;
+ }
+
if (IS_SETLKW(cmd))
wait = 1;
diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index 6ab5031e697a..ce773b1095da 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -6,6 +6,7 @@
#include <linux/slab.h>
#include <linux/gfp.h>
#include <linux/sched.h>
+#include <linux/delay.h>
#include <linux/debugfs.h>
#include <linux/seq_file.h>
#include <linux/ratelimit.h>
@@ -65,6 +66,7 @@ static void __wake_requests(struct ceph_mds_client *mdsc,
struct list_head *head);
static void ceph_cap_release_work(struct work_struct *work);
static void ceph_cap_reclaim_work(struct work_struct *work);
+static void ceph_mdsc_reset_workfn(struct work_struct *work);
static const struct ceph_connection_operations mds_con_ops;
@@ -3844,6 +3846,22 @@ int ceph_mdsc_submit_request(struct ceph_mds_client *mdsc, struct inode *dir,
struct ceph_client *cl = mdsc->fsc->client;
int err = 0;
+ /*
+ * If a reset is in progress, wait for it to complete.
+ *
+ * This is best-effort: a request can pass this check just
+ * before the phase leaves IDLE and proceed concurrently with
+ * reset. That is acceptable because (a) such requests will
+ * either complete normally or fail and be retried by the
+ * caller, and (b) adding lock serialization here would
+ * penalize every request for a rare manual operation.
+ */
+ err = ceph_mdsc_wait_for_reset(mdsc);
+ if (err) {
+ doutc(cl, "wait_for_reset failed: %d\n", err);
+ return err;
+ }
+
/* take CAP_PIN refs for r_inode, r_parent, r_old_dentry */
if (req->r_inode)
ceph_get_cap_refs(ceph_inode(req->r_inode), CEPH_CAP_PIN);
@@ -5266,6 +5284,474 @@ static int send_mds_reconnect(struct ceph_mds_client *mdsc,
return err;
}
+const char *ceph_reset_phase_name(enum ceph_client_reset_phase phase)
+{
+ switch (phase) {
+ case CEPH_CLIENT_RESET_IDLE: return "idle";
+ case CEPH_CLIENT_RESET_QUIESCING: return "quiescing";
+ case CEPH_CLIENT_RESET_DRAINING: return "draining";
+ case CEPH_CLIENT_RESET_TEARDOWN: return "teardown";
+ default: return "unknown";
+ }
+}
+
+/**
+ * ceph_mdsc_wait_for_reset - wait for an active reset to complete
+ * @mdsc: MDS client
+ *
+ * Returns 0 if reset completed successfully or no reset was active.
+ * Returns -EAGAIN if reset completed with an error, signalling the
+ * caller to retry. The internal error (e.g. -ENOMEM) is not propagated
+ * because callers like open() or flock() have no way to act on
+ * work-function internals. The detailed error is available via debugfs
+ * reset/status and tracepoints.
+ * Returns -ETIMEDOUT if we timed out waiting.
+ * Returns -ERESTARTSYS if interrupted by signal.
+ */
+int ceph_mdsc_wait_for_reset(struct ceph_mds_client *mdsc)
+{
+ struct ceph_client_reset_state *st = &mdsc->reset_state;
+ struct ceph_client *cl = mdsc->fsc->client;
+ unsigned long deadline = jiffies + CEPH_CLIENT_RESET_WAIT_TIMEOUT_SEC * HZ;
+ int blocked_count;
+ long remaining;
+ long wait_ret;
+ int ret;
+
+ if (ceph_reset_is_idle(st))
+ return 0;
+
+ blocked_count = atomic_inc_return(&st->blocked_requests);
+ doutc(cl, "request blocked during reset, %d total blocked\n",
+ blocked_count);
+
+retry:
+ remaining = max_t(long, deadline - jiffies, 1);
+ wait_ret = wait_event_interruptible_timeout(st->blocked_wq,
+ ceph_reset_is_idle(st),
+ remaining);
+
+ if (wait_ret == 0) {
+ atomic_dec(&st->blocked_requests);
+ pr_warn_client(cl, "timed out waiting for reset to complete\n");
+ return -ETIMEDOUT;
+ }
+ if (wait_ret < 0) {
+ atomic_dec(&st->blocked_requests);
+ return (int)wait_ret; /* -ERESTARTSYS */
+ }
+
+ /*
+ * Verify phase is still IDLE under the lock. If another reset
+ * was scheduled between the wake-up and this check, loop back
+ * and wait for it to finish rather than returning a stale result.
+ */
+ spin_lock(&st->lock);
+ if (st->phase != CEPH_CLIENT_RESET_IDLE) {
+ spin_unlock(&st->lock);
+ if (time_before(jiffies, deadline))
+ goto retry;
+ atomic_dec(&st->blocked_requests);
+ return -ETIMEDOUT;
+ }
+ ret = st->last_errno;
+ spin_unlock(&st->lock);
+
+ atomic_dec(&st->blocked_requests);
+ return ret ? -EAGAIN : 0;
+}
+
+static void ceph_mdsc_reset_complete(struct ceph_mds_client *mdsc, int ret)
+{
+ struct ceph_client_reset_state *st = &mdsc->reset_state;
+
+ spin_lock(&st->lock);
+ /*
+ * If destroy already marked us as shut down, it owns the
+ * final bookkeeping and waiter wakeup. Just bail so we
+ * don't overwrite its state.
+ */
+ if (st->shutdown) {
+ spin_unlock(&st->lock);
+ return;
+ }
+ st->last_finish = jiffies;
+ st->last_errno = ret;
+ st->phase = CEPH_CLIENT_RESET_IDLE;
+ if (ret)
+ st->failure_count++;
+ else
+ st->success_count++;
+ spin_unlock(&st->lock);
+
+ /* Wake up all requests that were blocked waiting for reset */
+ wake_up_all(&st->blocked_wq);
+
+}
+
+static void ceph_mdsc_reset_workfn(struct work_struct *work)
+{
+ struct ceph_mds_client *mdsc =
+ container_of(work, struct ceph_mds_client, reset_work);
+ struct ceph_client_reset_state *st = &mdsc->reset_state;
+ struct ceph_client *cl = mdsc->fsc->client;
+ struct ceph_mds_session **sessions = NULL;
+ char reason[CEPH_CLIENT_RESET_REASON_LEN];
+ unsigned long drain_deadline;
+ int max_sessions, i, n = 0, torn_down = 0;
+ int ret = 0;
+
+ spin_lock(&st->lock);
+ strscpy(reason, st->last_reason, sizeof(reason));
+ spin_unlock(&st->lock);
+
+ mutex_lock(&mdsc->mutex);
+ max_sessions = mdsc->max_sessions;
+ if (max_sessions <= 0) {
+ mutex_unlock(&mdsc->mutex);
+ goto out_complete;
+ }
+
+ sessions = kcalloc(max_sessions, sizeof(*sessions), GFP_KERNEL);
+ if (!sessions) {
+ mutex_unlock(&mdsc->mutex);
+ ret = -ENOMEM;
+ pr_err_client(cl,
+ "manual session reset failed to allocate session array\n");
+ ceph_mdsc_reset_complete(mdsc, ret);
+ return;
+ }
+
+ for (i = 0; i < max_sessions; i++) {
+ struct ceph_mds_session *session = mdsc->sessions[i];
+
+ if (!session)
+ continue;
+
+ /*
+ * Read session state without s_mutex to avoid nesting
+ * mdsc->mutex -> s_mutex, which would invert the
+ * s_mutex -> mdsc->mutex order used by
+ * cleanup_session_requests(). s_state is an int
+ * so loads are atomic; the teardown loop below
+ * handles races with concurrent state transitions.
+ */
+ switch (READ_ONCE(session->s_state)) {
+ case CEPH_MDS_SESSION_OPEN:
+ case CEPH_MDS_SESSION_HUNG:
+ case CEPH_MDS_SESSION_OPENING:
+ case CEPH_MDS_SESSION_RESTARTING:
+ case CEPH_MDS_SESSION_RECONNECTING:
+ case CEPH_MDS_SESSION_CLOSING:
+ sessions[n++] = ceph_get_mds_session(session);
+ break;
+ default:
+ pr_info_client(cl,
+ "mds%d in state %s, skipping reset\n",
+ session->s_mds,
+ ceph_session_state_name(session->s_state));
+ break;
+ }
+ }
+ mutex_unlock(&mdsc->mutex);
+
+ pr_info_client(cl,
+ "manual session reset executing (sessions=%d, reason=\"%s\")\n",
+ n, reason);
+
+ if (n == 0) {
+ kfree(sessions);
+ goto out_complete;
+ }
+
+ spin_lock(&st->lock);
+ if (st->shutdown) {
+ spin_unlock(&st->lock);
+ goto out_sessions;
+ }
+ st->phase = CEPH_CLIENT_RESET_DRAINING;
+ spin_unlock(&st->lock);
+
+ /*
+ * Best-effort drain: flush dirty state while sessions are still
+ * alive. New requests are blocked while phase != IDLE.
+ * The sessions are functional, so non-stuck state drains normally.
+ * Stuck state (the cause of the stalemate the operator is trying
+ * to break) will not drain -- that is expected, and we proceed to
+ * forced teardown after the timeout.
+ *
+ * Four things are drained:
+ * 1. MDS journal -- send_flush_mdlog asks each MDS to journal
+ * pending unsafe operations (creates, renames, setattrs).
+ * 2. Unsafe requests -- bounded wait for each unsafe write
+ * request to reach safe status via r_safe_completion.
+ * 3. Dirty caps -- ceph_flush_dirty_caps triggers cap flush on
+ * all sessions. Non-stuck caps flush in milliseconds.
+ * 4. Cap releases -- push pending cap release messages.
+ *
+ * The unsafe-request wait and cap-flush wait below provide
+ * the bounded drain window during which all categories can
+ * make progress.
+ */
+ for (i = 0; i < n; i++)
+ send_flush_mdlog(sessions[i]);
+
+ /*
+ * Both drain legs (unsafe requests and cap flushes) share a
+ * single deadline so the total drain time is bounded at
+ * CEPH_CLIENT_RESET_DRAIN_SEC.
+ */
+ drain_deadline = jiffies + CEPH_CLIENT_RESET_DRAIN_SEC * HZ;
+
+ /*
+ * Wait for unsafe write requests (creates, renames, setattrs)
+ * to reach safe status. Uses the same pattern as
+ * flush_mdlog_and_wait_mdsc_unsafe_requests() but bounded by
+ * the shared drain deadline. Requests that do not complete within
+ * the window are force-dropped during teardown.
+ */
+ {
+ struct ceph_mds_request *req;
+ struct rb_node *rn;
+ u64 last_tid;
+
+ mutex_lock(&mdsc->mutex);
+ last_tid = mdsc->last_tid;
+ mutex_unlock(&mdsc->mutex);
+
+ mutex_lock(&mdsc->mutex);
+ rn = rb_first(&mdsc->request_tree);
+ while (rn) {
+ req = rb_entry(rn, struct ceph_mds_request, r_node);
+ if (req->r_tid > last_tid)
+ break;
+ if (req->r_op == CEPH_MDS_OP_SETFILELOCK ||
+ !(req->r_op & CEPH_MDS_OP_WRITE)) {
+ rn = rb_next(rn);
+ continue;
+ }
+ ceph_mdsc_get_request(req);
+ mutex_unlock(&mdsc->mutex);
+
+ wait_for_completion_timeout(&req->r_safe_completion,
+ max_t(long, drain_deadline - jiffies, 1));
+
+ mutex_lock(&mdsc->mutex);
+ ceph_mdsc_put_request(req);
+ if (time_after(jiffies, drain_deadline))
+ break;
+ rn = rb_first(&mdsc->request_tree);
+ }
+ mutex_unlock(&mdsc->mutex);
+
+ if (time_after_eq(jiffies, drain_deadline))
+ WRITE_ONCE(st->drain_timed_out, true);
+ }
+
+ ceph_flush_dirty_caps(mdsc);
+ ceph_flush_cap_releases(mdsc);
+
+ spin_lock(&mdsc->cap_dirty_lock);
+ if (!list_empty(&mdsc->cap_flush_list)) {
+ struct ceph_cap_flush *cf =
+ list_last_entry(&mdsc->cap_flush_list,
+ struct ceph_cap_flush, g_list);
+ u64 want_flush = mdsc->last_cap_flush_tid;
+ long drain_ret;
+
+ /*
+ * Setting wake on the last entry is sufficient: flush
+ * entries complete in order, so when this entry finishes
+ * all earlier ones are already done.
+ */
+ cf->wake = true;
+ spin_unlock(&mdsc->cap_dirty_lock);
+ pr_info_client(cl,
+ "draining (want_flush=%llu, %d sessions)\n",
+ want_flush, n);
+ drain_ret = wait_event_timeout(mdsc->cap_flushing_wq,
+ check_caps_flush(mdsc,
+ want_flush),
+ max_t(long,
+ drain_deadline - jiffies,
+ 1));
+ if (drain_ret == 0) {
+ pr_info_client(cl,
+ "drain timed out, proceeding with forced teardown\n");
+ WRITE_ONCE(st->drain_timed_out, true);
+ } else {
+ pr_info_client(cl, "drain completed successfully\n");
+ }
+ } else {
+ spin_unlock(&mdsc->cap_dirty_lock);
+ }
+
+ spin_lock(&st->lock);
+ if (st->shutdown) {
+ spin_unlock(&st->lock);
+ goto out_sessions;
+ }
+ st->phase = CEPH_CLIENT_RESET_TEARDOWN;
+ spin_unlock(&st->lock);
+
+ /*
+ * Ask each MDS to close the session before we tear it down
+ * locally. Without this the MDS sees only a connection drop and
+ * waits for the client to reconnect (up to session_autoclose
+ * seconds) before evicting the session and releasing locks.
+ *
+ * Reuse the normal close machinery so the session state/sequence
+ * snapshot is serialized under s_mutex and a racing s_seq bump
+ * retransmits REQUEST_CLOSE while the session remains CLOSING.
+ * We send all close requests first, then yield briefly to let the
+ * network stack transmit them before __unregister_session()
+ * closes the connections.
+ */
+ for (i = 0; i < n; i++) {
+ int err;
+
+ mutex_lock(&sessions[i]->s_mutex);
+ err = __close_session(mdsc, sessions[i]);
+ mutex_unlock(&sessions[i]->s_mutex);
+ if (err < 0)
+ pr_warn_client(cl,
+ "mds%d failed to queue close request before reset: %d\n",
+ sessions[i]->s_mds, err);
+ }
+ /*
+ * Best-effort grace period: yield briefly so the network stack
+ * can transmit the queued REQUEST_CLOSE messages before we tear
+ * down connections. Not a correctness requirement -- the MDS
+ * will still evict via session_autoclose if it never receives
+ * the close request.
+ *
+ * Event-based waiting is not viable here: there is no completion
+ * event for "message left the NIC," and waiting for the MDS
+ * SESSION_CLOSE response would re-create the stalemate that the
+ * reset is meant to break.
+ */
+ if (n > 0)
+ msleep(CEPH_CLIENT_RESET_CLOSE_GRACE_MS);
+
+ /*
+ * Tear down each session: close the connection, remove all
+ * caps, clean up requests, then kick pending requests so they
+ * re-open a fresh session on the next attempt.
+ *
+ * This is modeled on the check_new_map() forced-close path
+ * for stopped MDS ranks - a proven pattern for hard session
+ * teardown. We do NOT attempt send_mds_reconnect() because
+ * the MDS only accepts reconnects during its own RECONNECT
+ * phase (after MDS restart), not from an active client.
+ *
+ * Any state that did not drain (caps that didn't flush, unsafe
+ * requests that the MDS didn't journal) is force-dropped here.
+ * This is intentional: that state is stuck and is the reason
+ * the operator triggered the reset.
+ */
+ for (i = 0; i < n; i++) {
+ int mds = sessions[i]->s_mds;
+
+ pr_info_client(cl, "mds%d resetting session\n", mds);
+
+ mutex_lock(&mdsc->mutex);
+ if (mds >= mdsc->max_sessions ||
+ mdsc->sessions[mds] != sessions[i]) {
+ pr_info_client(cl,
+ "mds%d session already torn down, skipping\n",
+ mds);
+ mutex_unlock(&mdsc->mutex);
+ ceph_put_mds_session(sessions[i]);
+ sessions[i] = NULL;
+ continue;
+ }
+ sessions[i]->s_state = CEPH_MDS_SESSION_CLOSED;
+ __unregister_session(mdsc, sessions[i]);
+ __wake_requests(mdsc, &sessions[i]->s_waiting);
+ mutex_unlock(&mdsc->mutex);
+
+ mutex_lock(&sessions[i]->s_mutex);
+ cleanup_session_requests(mdsc, sessions[i]);
+ remove_session_caps(sessions[i]);
+ mutex_unlock(&sessions[i]->s_mutex);
+
+ wake_up_all(&mdsc->session_close_wq);
+
+ ceph_put_mds_session(sessions[i]);
+
+ mutex_lock(&mdsc->mutex);
+ kick_requests(mdsc, mds);
+ mutex_unlock(&mdsc->mutex);
+
+ torn_down++;
+ pr_info_client(cl, "mds%d session reset complete\n", mds);
+ }
+
+ kfree(sessions);
+
+ spin_lock(&st->lock);
+ st->sessions_reset = torn_down;
+ spin_unlock(&st->lock);
+
+out_complete:
+ ceph_mdsc_reset_complete(mdsc, ret);
+ return;
+
+out_sessions:
+ /* shutdown == true: ceph_mdsc_destroy() owns the final transition. */
+ for (i = 0; i < n; i++)
+ ceph_put_mds_session(sessions[i]);
+ kfree(sessions);
+}
+
+int ceph_mdsc_schedule_reset(struct ceph_mds_client *mdsc,
+ const char *reason)
+{
+ struct ceph_client_reset_state *st = &mdsc->reset_state;
+ struct ceph_fs_client *fsc = mdsc->fsc;
+ const char *msg = (reason && reason[0]) ? reason : "manual";
+ int mount_state;
+
+ mount_state = READ_ONCE(fsc->mount_state);
+ if (mount_state != CEPH_MOUNT_MOUNTED) {
+ pr_warn_client(fsc->client,
+ "reset rejected: mount_state=%d (not mounted)\n",
+ mount_state);
+ return -EINVAL;
+ }
+
+ spin_lock(&st->lock);
+ if (st->phase != CEPH_CLIENT_RESET_IDLE) {
+ spin_unlock(&st->lock);
+ return -EBUSY;
+ }
+
+ st->phase = CEPH_CLIENT_RESET_QUIESCING;
+ st->last_start = jiffies;
+ st->last_errno = 0;
+ st->drain_timed_out = false;
+ st->sessions_reset = 0;
+ st->trigger_count++;
+ strscpy(st->last_reason, msg, sizeof(st->last_reason));
+ spin_unlock(&st->lock);
+
+ if (WARN_ON_ONCE(!queue_work(system_unbound_wq, &mdsc->reset_work))) {
+ spin_lock(&st->lock);
+ st->phase = CEPH_CLIENT_RESET_IDLE;
+ st->last_errno = -EALREADY;
+ st->last_finish = jiffies;
+ st->failure_count++;
+ spin_unlock(&st->lock);
+ wake_up_all(&st->blocked_wq);
+ return -EALREADY;
+ }
+
+ pr_info_client(mdsc->fsc->client,
+ "manual session reset scheduled (reason=\"%s\")\n",
+ msg);
+ return 0;
+}
+
/*
* compare old and new mdsmaps, kicking requests
@@ -5811,6 +6297,11 @@ int ceph_mdsc_init(struct ceph_fs_client *fsc)
INIT_LIST_HEAD(&mdsc->dentry_leases);
INIT_LIST_HEAD(&mdsc->dentry_dir_leases);
+ spin_lock_init(&mdsc->reset_state.lock);
+ init_waitqueue_head(&mdsc->reset_state.blocked_wq);
+ atomic_set(&mdsc->reset_state.blocked_requests, 0);
+ INIT_WORK(&mdsc->reset_work, ceph_mdsc_reset_workfn);
+
ceph_caps_init(mdsc);
ceph_adjust_caps_max_min(mdsc, fsc->mount_options);
@@ -6336,6 +6827,23 @@ void ceph_mdsc_destroy(struct ceph_fs_client *fsc)
/* flush out any connection work with references to us */
ceph_msgr_flush();
+ /*
+ * Mark reset as failed and wake any blocked waiters before
+ * cancelling, so unmount doesn't stall on blocked_wq timeout
+ * if cancel_work_sync() prevents the work from running.
+ */
+ spin_lock(&mdsc->reset_state.lock);
+ mdsc->reset_state.shutdown = true;
+ if (mdsc->reset_state.phase != CEPH_CLIENT_RESET_IDLE) {
+ mdsc->reset_state.phase = CEPH_CLIENT_RESET_IDLE;
+ mdsc->reset_state.last_errno = -ESHUTDOWN;
+ mdsc->reset_state.last_finish = jiffies;
+ mdsc->reset_state.failure_count++;
+ }
+ spin_unlock(&mdsc->reset_state.lock);
+ wake_up_all(&mdsc->reset_state.blocked_wq);
+
+ cancel_work_sync(&mdsc->reset_work);
ceph_mdsc_stop(mdsc);
ceph_metric_destroy(&mdsc->metric);
diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h
index 8208fdf02efe..b1a0621cd37e 100644
--- a/fs/ceph/mds_client.h
+++ b/fs/ceph/mds_client.h
@@ -80,7 +80,47 @@ struct ceph_cap;
#define CEPH_CAP_FLUSH_WAIT_TIMEOUT_SEC 60
#define CEPH_CAP_FLUSH_MAX_DUMP_ENTRIES 5
#define CEPH_CAP_FLUSH_MAX_DUMP_ITERS 5
+#define CEPH_CLIENT_RESET_REASON_LEN 64
+#define CEPH_CLIENT_RESET_DRAIN_SEC 30
+#define CEPH_CLIENT_RESET_CLOSE_GRACE_MS 100
+#define CEPH_CLIENT_RESET_WAIT_TIMEOUT_SEC 120
+enum ceph_client_reset_phase {
+ CEPH_CLIENT_RESET_IDLE = 0,
+ /*
+ * QUIESCING is set synchronously by schedule_reset() before the
+ * workqueue item is dispatched. It gates new requests (any
+ * phase != IDLE blocks callers) during the window between
+ * scheduling and the work function's transition to DRAINING.
+ */
+ CEPH_CLIENT_RESET_QUIESCING,
+ CEPH_CLIENT_RESET_DRAINING,
+ CEPH_CLIENT_RESET_TEARDOWN,
+};
+
+struct ceph_client_reset_state {
+ spinlock_t lock; /* protects all fields below */
+ u64 trigger_count; /* number of resets triggered */
+ u64 success_count; /* number of successful resets */
+ u64 failure_count; /* number of failed resets */
+ unsigned long last_start; /* jiffies when last reset started */
+ unsigned long last_finish; /* jiffies when last reset finished */
+ int last_errno; /* result of most recent reset */
+ enum ceph_client_reset_phase phase; /* current reset phase */
+ bool drain_timed_out; /* drain exceeded timeout */
+ bool shutdown; /* destroy in progress */
+ int sessions_reset; /* sessions torn down in last reset */
+ char last_reason[CEPH_CLIENT_RESET_REASON_LEN]; /* operator-supplied reason */
+
+ /* Request blocking during reset */
+ wait_queue_head_t blocked_wq; /* waitqueue for blocked callers */
+ atomic_t blocked_requests; /* count of blocked callers */
+};
+
+static inline bool ceph_reset_is_idle(struct ceph_client_reset_state *st)
+{
+ return READ_ONCE(st->phase) == CEPH_CLIENT_RESET_IDLE;
+}
struct ceph_mds_cap_match {
s64 uid; /* default to MDS_AUTH_UID_ANY */
u32 num_gids;
@@ -543,6 +583,8 @@ struct ceph_mds_client {
struct list_head dentry_dir_leases; /* lru list */
struct ceph_client_metric metric;
+ struct work_struct reset_work;
+ struct ceph_client_reset_state reset_state;
struct ceph_subvolume_metrics_tracker subvol_metrics;
/* Subvolume metrics send tracking */
@@ -574,10 +616,14 @@ extern struct ceph_mds_session *
__ceph_lookup_mds_session(struct ceph_mds_client *, int mds);
extern const char *ceph_session_state_name(int s);
+extern const char *ceph_reset_phase_name(enum ceph_client_reset_phase phase);
extern struct ceph_mds_session *
ceph_get_mds_session(struct ceph_mds_session *s);
extern void ceph_put_mds_session(struct ceph_mds_session *s);
+int ceph_mdsc_schedule_reset(struct ceph_mds_client *mdsc,
+ const char *reason);
+int ceph_mdsc_wait_for_reset(struct ceph_mds_client *mdsc);
extern int ceph_mdsc_init(struct ceph_fs_client *fsc);
extern void ceph_mdsc_close_sessions(struct ceph_mds_client *mdsc);
--
2.34.1
^ permalink raw reply related [flat|nested] 24+ messages in thread* Re: [EXTERNAL] [PATCH v4 05/11] ceph: add client reset state machine and session teardown
2026-05-07 12:27 ` [PATCH v4 05/11] ceph: add client reset state machine and session teardown Alex Markuze
@ 2026-05-07 19:17 ` Viacheslav Dubeyko
0 siblings, 0 replies; 24+ messages in thread
From: Viacheslav Dubeyko @ 2026-05-07 19:17 UTC (permalink / raw)
To: Alex Markuze, ceph-devel; +Cc: linux-kernel, idryomov
On Thu, 2026-05-07 at 12:27 +0000, Alex Markuze wrote:
> Add the client-side reset state machine, request gating, and manual
> session teardown implementation.
>
> Manual reset is an operator-triggered escape hatch for client/MDS
> stalemates in which caps, locks, or unsafe metadata state stop making
> forward progress. The reset blocks new metadata work, attempts a
> bounded best-effort drain of dirty client state while sessions are
> still alive, and finally asks the MDS to close sessions before tearing
> local session state down directly.
>
> The reset state machine tracks four phases: IDLE -> QUIESCING ->
> DRAINING -> TEARDOWN -> IDLE. QUIESCING is set synchronously by
> schedule_reset() before the workqueue item is dispatched, so that new
> metadata requests and file-lock acquisitions are gated immediately --
> even before the work function begins running. All non-IDLE phases
> block callers on blocked_wq, preventing races with session teardown.
>
> The drain phase flushes mdlog state, dirty caps, and pending cap
> releases for a bounded interval. State that still cannot make progress
> within that interval is discarded during teardown, which is the point
> of the reset: break the stalemate and allow fresh sessions to rebuild
> clean state.
>
> The session teardown follows the established check_new_map()
> forced-close pattern: unregister sessions under mdsc->mutex, then clean
> up caps and requests under s->s_mutex. Reconnect is not attempted
> because the MDS only accepts reconnects during its own RECONNECT phase
> after restart, not from an active client.
>
> Blocked callers are released when reset completes and observe the final
> result via -EAGAIN (reset failed) or 0 (success). Internal work-function
> errors such as -ENOMEM are not propagated to unrelated callers like
> open() or flock(); the detailed error remains in debugfs and
> tracepoints.
>
> The work function checks st->shutdown before each phase transition
> (DRAINING, TEARDOWN) so that a concurrent ceph_mdsc_destroy() is not
> overwritten. If destroy already took ownership, the work function
> releases session references and returns without touching the state.
>
> The timeout calculation for blocked-request waiters uses max_t() to
> prevent jiffies underflow when the deadline has already passed.
>
> The close-grace sleep before teardown is a best-effort nudge to let
> queued REQUEST_CLOSE messages egress; it is not a correctness
> requirement since the MDS still has session_autoclose as a fallback.
>
> The destroy path marks reset as failed and wakes blocked waiters before
> cancel_work_sync() so unmount does not stall.
>
> Signed-off-by: Alex Markuze <amarkuze@redhat.com>
> ---
> fs/ceph/locks.c | 16 ++
> fs/ceph/mds_client.c | 508 +++++++++++++++++++++++++++++++++++++++++++
> fs/ceph/mds_client.h | 46 ++++
> 3 files changed, 570 insertions(+)
>
> diff --git a/fs/ceph/locks.c b/fs/ceph/locks.c
> index c4ff2266bb94..677221bd64e0 100644
> --- a/fs/ceph/locks.c
> +++ b/fs/ceph/locks.c
> @@ -249,6 +249,7 @@ int ceph_lock(struct file *file, int cmd, struct file_lock *fl)
> {
> struct inode *inode = file_inode(file);
> struct ceph_inode_info *ci = ceph_inode(inode);
> + struct ceph_mds_client *mdsc = ceph_sb_to_mdsc(inode->i_sb);
> struct ceph_client *cl = ceph_inode_to_client(inode);
> int err = 0;
> u16 op = CEPH_MDS_OP_SETFILELOCK;
> @@ -275,6 +276,13 @@ int ceph_lock(struct file *file, int cmd, struct file_lock *fl)
> return -EIO;
> }
>
> + /* Wait for reset to complete before acquiring new locks */
> + if (op == CEPH_MDS_OP_SETFILELOCK && !lock_is_unlock(fl)) {
> + err = ceph_mdsc_wait_for_reset(mdsc);
> + if (err)
> + return err;
> + }
> +
> if (lock_is_read(fl))
> lock_cmd = CEPH_LOCK_SHARED;
> else if (lock_is_write(fl))
> @@ -311,6 +319,7 @@ int ceph_flock(struct file *file, int cmd, struct file_lock *fl)
> {
> struct inode *inode = file_inode(file);
> struct ceph_inode_info *ci = ceph_inode(inode);
> + struct ceph_mds_client *mdsc = ceph_sb_to_mdsc(inode->i_sb);
> struct ceph_client *cl = ceph_inode_to_client(inode);
> int err = 0;
> u8 wait = 0;
> @@ -330,6 +339,13 @@ int ceph_flock(struct file *file, int cmd, struct file_lock *fl)
> return -EIO;
> }
>
> + /* Wait for reset to complete before acquiring new locks */
> + if (!lock_is_unlock(fl)) {
> + err = ceph_mdsc_wait_for_reset(mdsc);
> + if (err)
> + return err;
> + }
> +
> if (IS_SETLKW(cmd))
> wait = 1;
>
> diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
> index 6ab5031e697a..ce773b1095da 100644
> --- a/fs/ceph/mds_client.c
> +++ b/fs/ceph/mds_client.c
> @@ -6,6 +6,7 @@
> #include <linux/slab.h>
> #include <linux/gfp.h>
> #include <linux/sched.h>
> +#include <linux/delay.h>
> #include <linux/debugfs.h>
> #include <linux/seq_file.h>
> #include <linux/ratelimit.h>
> @@ -65,6 +66,7 @@ static void __wake_requests(struct ceph_mds_client *mdsc,
> struct list_head *head);
> static void ceph_cap_release_work(struct work_struct *work);
> static void ceph_cap_reclaim_work(struct work_struct *work);
> +static void ceph_mdsc_reset_workfn(struct work_struct *work);
>
> static const struct ceph_connection_operations mds_con_ops;
>
> @@ -3844,6 +3846,22 @@ int ceph_mdsc_submit_request(struct ceph_mds_client *mdsc, struct inode *dir,
> struct ceph_client *cl = mdsc->fsc->client;
> int err = 0;
>
> + /*
> + * If a reset is in progress, wait for it to complete.
> + *
> + * This is best-effort: a request can pass this check just
> + * before the phase leaves IDLE and proceed concurrently with
> + * reset. That is acceptable because (a) such requests will
> + * either complete normally or fail and be retried by the
> + * caller, and (b) adding lock serialization here would
> + * penalize every request for a rare manual operation.
> + */
> + err = ceph_mdsc_wait_for_reset(mdsc);
> + if (err) {
> + doutc(cl, "wait_for_reset failed: %d\n", err);
> + return err;
> + }
> +
> /* take CAP_PIN refs for r_inode, r_parent, r_old_dentry */
> if (req->r_inode)
> ceph_get_cap_refs(ceph_inode(req->r_inode), CEPH_CAP_PIN);
> @@ -5266,6 +5284,474 @@ static int send_mds_reconnect(struct ceph_mds_client *mdsc,
> return err;
> }
>
> +const char *ceph_reset_phase_name(enum ceph_client_reset_phase phase)
> +{
> + switch (phase) {
> + case CEPH_CLIENT_RESET_IDLE: return "idle";
> + case CEPH_CLIENT_RESET_QUIESCING: return "quiescing";
> + case CEPH_CLIENT_RESET_DRAINING: return "draining";
> + case CEPH_CLIENT_RESET_TEARDOWN: return "teardown";
> + default: return "unknown";
> + }
> +}
> +
> +/**
> + * ceph_mdsc_wait_for_reset - wait for an active reset to complete
> + * @mdsc: MDS client
> + *
> + * Returns 0 if reset completed successfully or no reset was active.
> + * Returns -EAGAIN if reset completed with an error, signalling the
> + * caller to retry. The internal error (e.g. -ENOMEM) is not propagated
> + * because callers like open() or flock() have no way to act on
> + * work-function internals. The detailed error is available via debugfs
> + * reset/status and tracepoints.
> + * Returns -ETIMEDOUT if we timed out waiting.
> + * Returns -ERESTARTSYS if interrupted by signal.
> + */
> +int ceph_mdsc_wait_for_reset(struct ceph_mds_client *mdsc)
> +{
> + struct ceph_client_reset_state *st = &mdsc->reset_state;
> + struct ceph_client *cl = mdsc->fsc->client;
> + unsigned long deadline = jiffies + CEPH_CLIENT_RESET_WAIT_TIMEOUT_SEC * HZ;
> + int blocked_count;
> + long remaining;
> + long wait_ret;
> + int ret;
> +
> + if (ceph_reset_is_idle(st))
> + return 0;
> +
> + blocked_count = atomic_inc_return(&st->blocked_requests);
> + doutc(cl, "request blocked during reset, %d total blocked\n",
> + blocked_count);
> +
> +retry:
> + remaining = max_t(long, deadline - jiffies, 1);
> + wait_ret = wait_event_interruptible_timeout(st->blocked_wq,
> + ceph_reset_is_idle(st),
> + remaining);
> +
> + if (wait_ret == 0) {
> + atomic_dec(&st->blocked_requests);
> + pr_warn_client(cl, "timed out waiting for reset to complete\n");
> + return -ETIMEDOUT;
> + }
> + if (wait_ret < 0) {
> + atomic_dec(&st->blocked_requests);
> + return (int)wait_ret; /* -ERESTARTSYS */
> + }
> +
> + /*
> + * Verify phase is still IDLE under the lock. If another reset
> + * was scheduled between the wake-up and this check, loop back
> + * and wait for it to finish rather than returning a stale result.
> + */
> + spin_lock(&st->lock);
> + if (st->phase != CEPH_CLIENT_RESET_IDLE) {
> + spin_unlock(&st->lock);
> + if (time_before(jiffies, deadline))
> + goto retry;
> + atomic_dec(&st->blocked_requests);
> + return -ETIMEDOUT;
> + }
> + ret = st->last_errno;
> + spin_unlock(&st->lock);
> +
> + atomic_dec(&st->blocked_requests);
> + return ret ? -EAGAIN : 0;
> +}
> +
> +static void ceph_mdsc_reset_complete(struct ceph_mds_client *mdsc, int ret)
> +{
> + struct ceph_client_reset_state *st = &mdsc->reset_state;
> +
> + spin_lock(&st->lock);
> + /*
> + * If destroy already marked us as shut down, it owns the
> + * final bookkeeping and waiter wakeup. Just bail so we
> + * don't overwrite its state.
> + */
> + if (st->shutdown) {
> + spin_unlock(&st->lock);
> + return;
> + }
> + st->last_finish = jiffies;
> + st->last_errno = ret;
> + st->phase = CEPH_CLIENT_RESET_IDLE;
> + if (ret)
> + st->failure_count++;
> + else
> + st->success_count++;
> + spin_unlock(&st->lock);
> +
> + /* Wake up all requests that were blocked waiting for reset */
> + wake_up_all(&st->blocked_wq);
> +
> +}
> +
> +static void ceph_mdsc_reset_workfn(struct work_struct *work)
> +{
> + struct ceph_mds_client *mdsc =
> + container_of(work, struct ceph_mds_client, reset_work);
> + struct ceph_client_reset_state *st = &mdsc->reset_state;
> + struct ceph_client *cl = mdsc->fsc->client;
> + struct ceph_mds_session **sessions = NULL;
> + char reason[CEPH_CLIENT_RESET_REASON_LEN];
> + unsigned long drain_deadline;
> + int max_sessions, i, n = 0, torn_down = 0;
> + int ret = 0;
> +
> + spin_lock(&st->lock);
> + strscpy(reason, st->last_reason, sizeof(reason));
> + spin_unlock(&st->lock);
> +
> + mutex_lock(&mdsc->mutex);
> + max_sessions = mdsc->max_sessions;
> + if (max_sessions <= 0) {
> + mutex_unlock(&mdsc->mutex);
> + goto out_complete;
> + }
> +
> + sessions = kcalloc(max_sessions, sizeof(*sessions), GFP_KERNEL);
> + if (!sessions) {
> + mutex_unlock(&mdsc->mutex);
> + ret = -ENOMEM;
> + pr_err_client(cl,
> + "manual session reset failed to allocate session array\n");
> + ceph_mdsc_reset_complete(mdsc, ret);
> + return;
> + }
> +
> + for (i = 0; i < max_sessions; i++) {
> + struct ceph_mds_session *session = mdsc->sessions[i];
> +
> + if (!session)
> + continue;
> +
> + /*
> + * Read session state without s_mutex to avoid nesting
> + * mdsc->mutex -> s_mutex, which would invert the
> + * s_mutex -> mdsc->mutex order used by
> + * cleanup_session_requests(). s_state is an int
> + * so loads are atomic; the teardown loop below
> + * handles races with concurrent state transitions.
> + */
> + switch (READ_ONCE(session->s_state)) {
> + case CEPH_MDS_SESSION_OPEN:
> + case CEPH_MDS_SESSION_HUNG:
> + case CEPH_MDS_SESSION_OPENING:
> + case CEPH_MDS_SESSION_RESTARTING:
> + case CEPH_MDS_SESSION_RECONNECTING:
> + case CEPH_MDS_SESSION_CLOSING:
> + sessions[n++] = ceph_get_mds_session(session);
> + break;
> + default:
> + pr_info_client(cl,
> + "mds%d in state %s, skipping reset\n",
> + session->s_mds,
> + ceph_session_state_name(session->s_state));
> + break;
> + }
> + }
> + mutex_unlock(&mdsc->mutex);
> +
> + pr_info_client(cl,
> + "manual session reset executing (sessions=%d, reason=\"%s\")\n",
> + n, reason);
> +
> + if (n == 0) {
> + kfree(sessions);
> + goto out_complete;
> + }
> +
> + spin_lock(&st->lock);
> + if (st->shutdown) {
> + spin_unlock(&st->lock);
> + goto out_sessions;
> + }
> + st->phase = CEPH_CLIENT_RESET_DRAINING;
> + spin_unlock(&st->lock);
> +
> + /*
> + * Best-effort drain: flush dirty state while sessions are still
> + * alive. New requests are blocked while phase != IDLE.
> + * The sessions are functional, so non-stuck state drains normally.
> + * Stuck state (the cause of the stalemate the operator is trying
> + * to break) will not drain -- that is expected, and we proceed to
> + * forced teardown after the timeout.
> + *
> + * Four things are drained:
> + * 1. MDS journal -- send_flush_mdlog asks each MDS to journal
> + * pending unsafe operations (creates, renames, setattrs).
> + * 2. Unsafe requests -- bounded wait for each unsafe write
> + * request to reach safe status via r_safe_completion.
> + * 3. Dirty caps -- ceph_flush_dirty_caps triggers cap flush on
> + * all sessions. Non-stuck caps flush in milliseconds.
> + * 4. Cap releases -- push pending cap release messages.
> + *
> + * The unsafe-request wait and cap-flush wait below provide
> + * the bounded drain window during which all categories can
> + * make progress.
> + */
> + for (i = 0; i < n; i++)
> + send_flush_mdlog(sessions[i]);
> +
> + /*
> + * Both drain legs (unsafe requests and cap flushes) share a
> + * single deadline so the total drain time is bounded at
> + * CEPH_CLIENT_RESET_DRAIN_SEC.
> + */
> + drain_deadline = jiffies + CEPH_CLIENT_RESET_DRAIN_SEC * HZ;
> +
> + /*
> + * Wait for unsafe write requests (creates, renames, setattrs)
> + * to reach safe status. Uses the same pattern as
> + * flush_mdlog_and_wait_mdsc_unsafe_requests() but bounded by
> + * the shared drain deadline. Requests that do not complete within
> + * the window are force-dropped during teardown.
> + */
> + {
> + struct ceph_mds_request *req;
> + struct rb_node *rn;
> + u64 last_tid;
> +
> + mutex_lock(&mdsc->mutex);
> + last_tid = mdsc->last_tid;
> + mutex_unlock(&mdsc->mutex);
> +
> + mutex_lock(&mdsc->mutex);
> + rn = rb_first(&mdsc->request_tree);
> + while (rn) {
> + req = rb_entry(rn, struct ceph_mds_request, r_node);
> + if (req->r_tid > last_tid)
> + break;
> + if (req->r_op == CEPH_MDS_OP_SETFILELOCK ||
> + !(req->r_op & CEPH_MDS_OP_WRITE)) {
> + rn = rb_next(rn);
> + continue;
> + }
> + ceph_mdsc_get_request(req);
> + mutex_unlock(&mdsc->mutex);
> +
> + wait_for_completion_timeout(&req->r_safe_completion,
> + max_t(long, drain_deadline - jiffies, 1));
> +
> + mutex_lock(&mdsc->mutex);
> + ceph_mdsc_put_request(req);
> + if (time_after(jiffies, drain_deadline))
> + break;
> + rn = rb_first(&mdsc->request_tree);
> + }
> + mutex_unlock(&mdsc->mutex);
> +
> + if (time_after_eq(jiffies, drain_deadline))
> + WRITE_ONCE(st->drain_timed_out, true);
> + }
> +
> + ceph_flush_dirty_caps(mdsc);
> + ceph_flush_cap_releases(mdsc);
> +
> + spin_lock(&mdsc->cap_dirty_lock);
> + if (!list_empty(&mdsc->cap_flush_list)) {
> + struct ceph_cap_flush *cf =
> + list_last_entry(&mdsc->cap_flush_list,
> + struct ceph_cap_flush, g_list);
> + u64 want_flush = mdsc->last_cap_flush_tid;
> + long drain_ret;
> +
> + /*
> + * Setting wake on the last entry is sufficient: flush
> + * entries complete in order, so when this entry finishes
> + * all earlier ones are already done.
> + */
> + cf->wake = true;
> + spin_unlock(&mdsc->cap_dirty_lock);
> + pr_info_client(cl,
> + "draining (want_flush=%llu, %d sessions)\n",
> + want_flush, n);
> + drain_ret = wait_event_timeout(mdsc->cap_flushing_wq,
> + check_caps_flush(mdsc,
> + want_flush),
> + max_t(long,
> + drain_deadline - jiffies,
> + 1));
> + if (drain_ret == 0) {
> + pr_info_client(cl,
> + "drain timed out, proceeding with forced teardown\n");
> + WRITE_ONCE(st->drain_timed_out, true);
> + } else {
> + pr_info_client(cl, "drain completed successfully\n");
> + }
> + } else {
> + spin_unlock(&mdsc->cap_dirty_lock);
> + }
> +
> + spin_lock(&st->lock);
> + if (st->shutdown) {
> + spin_unlock(&st->lock);
> + goto out_sessions;
> + }
> + st->phase = CEPH_CLIENT_RESET_TEARDOWN;
> + spin_unlock(&st->lock);
> +
> + /*
> + * Ask each MDS to close the session before we tear it down
> + * locally. Without this the MDS sees only a connection drop and
> + * waits for the client to reconnect (up to session_autoclose
> + * seconds) before evicting the session and releasing locks.
> + *
> + * Reuse the normal close machinery so the session state/sequence
> + * snapshot is serialized under s_mutex and a racing s_seq bump
> + * retransmits REQUEST_CLOSE while the session remains CLOSING.
> + * We send all close requests first, then yield briefly to let the
> + * network stack transmit them before __unregister_session()
> + * closes the connections.
> + */
> + for (i = 0; i < n; i++) {
> + int err;
> +
> + mutex_lock(&sessions[i]->s_mutex);
> + err = __close_session(mdsc, sessions[i]);
> + mutex_unlock(&sessions[i]->s_mutex);
> + if (err < 0)
> + pr_warn_client(cl,
> + "mds%d failed to queue close request before reset: %d\n",
> + sessions[i]->s_mds, err);
> + }
> + /*
> + * Best-effort grace period: yield briefly so the network stack
> + * can transmit the queued REQUEST_CLOSE messages before we tear
> + * down connections. Not a correctness requirement -- the MDS
> + * will still evict via session_autoclose if it never receives
> + * the close request.
> + *
> + * Event-based waiting is not viable here: there is no completion
> + * event for "message left the NIC," and waiting for the MDS
> + * SESSION_CLOSE response would re-create the stalemate that the
> + * reset is meant to break.
> + */
> + if (n > 0)
> + msleep(CEPH_CLIENT_RESET_CLOSE_GRACE_MS);
> +
> + /*
> + * Tear down each session: close the connection, remove all
> + * caps, clean up requests, then kick pending requests so they
> + * re-open a fresh session on the next attempt.
> + *
> + * This is modeled on the check_new_map() forced-close path
> + * for stopped MDS ranks - a proven pattern for hard session
> + * teardown. We do NOT attempt send_mds_reconnect() because
> + * the MDS only accepts reconnects during its own RECONNECT
> + * phase (after MDS restart), not from an active client.
> + *
> + * Any state that did not drain (caps that didn't flush, unsafe
> + * requests that the MDS didn't journal) is force-dropped here.
> + * This is intentional: that state is stuck and is the reason
> + * the operator triggered the reset.
> + */
> + for (i = 0; i < n; i++) {
> + int mds = sessions[i]->s_mds;
> +
> + pr_info_client(cl, "mds%d resetting session\n", mds);
> +
> + mutex_lock(&mdsc->mutex);
> + if (mds >= mdsc->max_sessions ||
> + mdsc->sessions[mds] != sessions[i]) {
> + pr_info_client(cl,
> + "mds%d session already torn down, skipping\n",
> + mds);
> + mutex_unlock(&mdsc->mutex);
> + ceph_put_mds_session(sessions[i]);
> + sessions[i] = NULL;
> + continue;
> + }
> + sessions[i]->s_state = CEPH_MDS_SESSION_CLOSED;
> + __unregister_session(mdsc, sessions[i]);
> + __wake_requests(mdsc, &sessions[i]->s_waiting);
> + mutex_unlock(&mdsc->mutex);
> +
> + mutex_lock(&sessions[i]->s_mutex);
> + cleanup_session_requests(mdsc, sessions[i]);
> + remove_session_caps(sessions[i]);
> + mutex_unlock(&sessions[i]->s_mutex);
> +
> + wake_up_all(&mdsc->session_close_wq);
> +
> + ceph_put_mds_session(sessions[i]);
> +
> + mutex_lock(&mdsc->mutex);
> + kick_requests(mdsc, mds);
> + mutex_unlock(&mdsc->mutex);
> +
> + torn_down++;
> + pr_info_client(cl, "mds%d session reset complete\n", mds);
> + }
> +
> + kfree(sessions);
> +
> + spin_lock(&st->lock);
> + st->sessions_reset = torn_down;
> + spin_unlock(&st->lock);
> +
> +out_complete:
> + ceph_mdsc_reset_complete(mdsc, ret);
> + return;
> +
> +out_sessions:
> + /* shutdown == true: ceph_mdsc_destroy() owns the final transition. */
> + for (i = 0; i < n; i++)
> + ceph_put_mds_session(sessions[i]);
> + kfree(sessions);
> +}
This function contains several code blocks that can be made as static inline
functions. It could make the ceph_mdsc_reset_workfn() shorter and easy to
understand.
> +
> +int ceph_mdsc_schedule_reset(struct ceph_mds_client *mdsc,
> + const char *reason)
> +{
> + struct ceph_client_reset_state *st = &mdsc->reset_state;
> + struct ceph_fs_client *fsc = mdsc->fsc;
> + const char *msg = (reason && reason[0]) ? reason : "manual";
> + int mount_state;
> +
> + mount_state = READ_ONCE(fsc->mount_state);
> + if (mount_state != CEPH_MOUNT_MOUNTED) {
> + pr_warn_client(fsc->client,
> + "reset rejected: mount_state=%d (not mounted)\n",
> + mount_state);
> + return -EINVAL;
> + }
> +
> + spin_lock(&st->lock);
> + if (st->phase != CEPH_CLIENT_RESET_IDLE) {
> + spin_unlock(&st->lock);
> + return -EBUSY;
> + }
> +
> + st->phase = CEPH_CLIENT_RESET_QUIESCING;
> + st->last_start = jiffies;
> + st->last_errno = 0;
> + st->drain_timed_out = false;
> + st->sessions_reset = 0;
> + st->trigger_count++;
> + strscpy(st->last_reason, msg, sizeof(st->last_reason));
> + spin_unlock(&st->lock);
> +
> + if (WARN_ON_ONCE(!queue_work(system_unbound_wq, &mdsc->reset_work))) {
> + spin_lock(&st->lock);
> + st->phase = CEPH_CLIENT_RESET_IDLE;
> + st->last_errno = -EALREADY;
> + st->last_finish = jiffies;
> + st->failure_count++;
> + spin_unlock(&st->lock);
> + wake_up_all(&st->blocked_wq);
> + return -EALREADY;
> + }
> +
> + pr_info_client(mdsc->fsc->client,
> + "manual session reset scheduled (reason=\"%s\")\n",
> + msg);
> + return 0;
> +}
> +
>
> /*
> * compare old and new mdsmaps, kicking requests
> @@ -5811,6 +6297,11 @@ int ceph_mdsc_init(struct ceph_fs_client *fsc)
> INIT_LIST_HEAD(&mdsc->dentry_leases);
> INIT_LIST_HEAD(&mdsc->dentry_dir_leases);
>
> + spin_lock_init(&mdsc->reset_state.lock);
> + init_waitqueue_head(&mdsc->reset_state.blocked_wq);
> + atomic_set(&mdsc->reset_state.blocked_requests, 0);
> + INIT_WORK(&mdsc->reset_work, ceph_mdsc_reset_workfn);
> +
> ceph_caps_init(mdsc);
> ceph_adjust_caps_max_min(mdsc, fsc->mount_options);
>
> @@ -6336,6 +6827,23 @@ void ceph_mdsc_destroy(struct ceph_fs_client *fsc)
> /* flush out any connection work with references to us */
> ceph_msgr_flush();
>
> + /*
> + * Mark reset as failed and wake any blocked waiters before
> + * cancelling, so unmount doesn't stall on blocked_wq timeout
> + * if cancel_work_sync() prevents the work from running.
> + */
> + spin_lock(&mdsc->reset_state.lock);
> + mdsc->reset_state.shutdown = true;
> + if (mdsc->reset_state.phase != CEPH_CLIENT_RESET_IDLE) {
> + mdsc->reset_state.phase = CEPH_CLIENT_RESET_IDLE;
> + mdsc->reset_state.last_errno = -ESHUTDOWN;
> + mdsc->reset_state.last_finish = jiffies;
> + mdsc->reset_state.failure_count++;
> + }
> + spin_unlock(&mdsc->reset_state.lock);
> + wake_up_all(&mdsc->reset_state.blocked_wq);
> +
> + cancel_work_sync(&mdsc->reset_work);
> ceph_mdsc_stop(mdsc);
>
> ceph_metric_destroy(&mdsc->metric);
> diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h
> index 8208fdf02efe..b1a0621cd37e 100644
> --- a/fs/ceph/mds_client.h
> +++ b/fs/ceph/mds_client.h
> @@ -80,7 +80,47 @@ struct ceph_cap;
> #define CEPH_CAP_FLUSH_WAIT_TIMEOUT_SEC 60
> #define CEPH_CAP_FLUSH_MAX_DUMP_ENTRIES 5
> #define CEPH_CAP_FLUSH_MAX_DUMP_ITERS 5
> +#define CEPH_CLIENT_RESET_REASON_LEN 64
> +#define CEPH_CLIENT_RESET_DRAIN_SEC 30
> +#define CEPH_CLIENT_RESET_CLOSE_GRACE_MS 100
> +#define CEPH_CLIENT_RESET_WAIT_TIMEOUT_SEC 120
>
> +enum ceph_client_reset_phase {
> + CEPH_CLIENT_RESET_IDLE = 0,
> + /*
> + * QUIESCING is set synchronously by schedule_reset() before the
> + * workqueue item is dispatched. It gates new requests (any
> + * phase != IDLE blocks callers) during the window between
> + * scheduling and the work function's transition to DRAINING.
> + */
> + CEPH_CLIENT_RESET_QUIESCING,
> + CEPH_CLIENT_RESET_DRAINING,
> + CEPH_CLIENT_RESET_TEARDOWN,
> +};
> +
> +struct ceph_client_reset_state {
> + spinlock_t lock; /* protects all fields below */
> + u64 trigger_count; /* number of resets triggered */
> + u64 success_count; /* number of successful resets */
> + u64 failure_count; /* number of failed resets */
> + unsigned long last_start; /* jiffies when last reset started */
> + unsigned long last_finish; /* jiffies when last reset finished */
> + int last_errno; /* result of most recent reset */
> + enum ceph_client_reset_phase phase; /* current reset phase */
> + bool drain_timed_out; /* drain exceeded timeout */
> + bool shutdown; /* destroy in progress */
> + int sessions_reset; /* sessions torn down in last reset */
> + char last_reason[CEPH_CLIENT_RESET_REASON_LEN]; /* operator-supplied reason */
> +
> + /* Request blocking during reset */
> + wait_queue_head_t blocked_wq; /* waitqueue for blocked callers */
> + atomic_t blocked_requests; /* count of blocked callers */
> +};
> +
> +static inline bool ceph_reset_is_idle(struct ceph_client_reset_state *st)
> +{
> + return READ_ONCE(st->phase) == CEPH_CLIENT_RESET_IDLE;
> +}
> struct ceph_mds_cap_match {
> s64 uid; /* default to MDS_AUTH_UID_ANY */
> u32 num_gids;
> @@ -543,6 +583,8 @@ struct ceph_mds_client {
> struct list_head dentry_dir_leases; /* lru list */
>
> struct ceph_client_metric metric;
> + struct work_struct reset_work;
> + struct ceph_client_reset_state reset_state;
> struct ceph_subvolume_metrics_tracker subvol_metrics;
>
> /* Subvolume metrics send tracking */
> @@ -574,10 +616,14 @@ extern struct ceph_mds_session *
> __ceph_lookup_mds_session(struct ceph_mds_client *, int mds);
>
> extern const char *ceph_session_state_name(int s);
> +extern const char *ceph_reset_phase_name(enum ceph_client_reset_phase phase);
>
> extern struct ceph_mds_session *
> ceph_get_mds_session(struct ceph_mds_session *s);
> extern void ceph_put_mds_session(struct ceph_mds_session *s);
> +int ceph_mdsc_schedule_reset(struct ceph_mds_client *mdsc,
> + const char *reason);
> +int ceph_mdsc_wait_for_reset(struct ceph_mds_client *mdsc);
>
> extern int ceph_mdsc_init(struct ceph_fs_client *fsc);
> extern void ceph_mdsc_close_sessions(struct ceph_mds_client *mdsc);
Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Thanks,
Slava.
^ permalink raw reply [flat|nested] 24+ messages in thread
* [PATCH v4 06/11] ceph: add manual reset debugfs control and tracepoints
2026-05-07 12:27 [PATCH v4 00/11] ceph: manual client session reset Alex Markuze
` (4 preceding siblings ...)
2026-05-07 12:27 ` [PATCH v4 05/11] ceph: add client reset state machine and session teardown Alex Markuze
@ 2026-05-07 12:27 ` Alex Markuze
2026-05-07 19:22 ` [EXTERNAL] " Viacheslav Dubeyko
2026-05-07 12:27 ` [PATCH v4 07/11] selftests: ceph: add reset consistency checker Alex Markuze
` (5 subsequent siblings)
11 siblings, 1 reply; 24+ messages in thread
From: Alex Markuze @ 2026-05-07 12:27 UTC (permalink / raw)
To: ceph-devel; +Cc: linux-kernel, idryomov, vdubeyko, Alex Markuze
Add the debugfs and trace plumbing used to trigger and observe
manual client reset.
The reset interface exposes a trigger file for operator-initiated
reset and a status file for tracking the most recent run. The
tracepoints record scheduling, completion, and blocked caller
behavior so reset progress can be diagnosed from the client side.
debugfs layout under /sys/kernel/debug/ceph/<client>/reset/:
trigger - write to initiate a manual reset
status - read to see the most recent reset result
The reset directory is cleaned up via debugfs_remove_recursive()
on the parent, so individual file dentries are not stored.
Tracepoints:
ceph_client_reset_schedule - reset queued
ceph_client_reset_complete - reset finished (success or failure)
ceph_client_reset_blocked - caller blocked waiting for reset
ceph_client_reset_unblocked - caller unblocked after reset
All tracepoints use a null-safe access for monc.auth->global_id
to guard against early-init or late-teardown edge cases.
Signed-off-by: Alex Markuze <amarkuze@redhat.com>
---
fs/ceph/debugfs.c | 103 ++++++++++++++++++++++++++++++++++++
fs/ceph/mds_client.c | 7 +++
fs/ceph/super.h | 1 +
include/trace/events/ceph.h | 67 +++++++++++++++++++++++
4 files changed, 178 insertions(+)
diff --git a/fs/ceph/debugfs.c b/fs/ceph/debugfs.c
index e2463f93cf6b..18eb5da03411 100644
--- a/fs/ceph/debugfs.c
+++ b/fs/ceph/debugfs.c
@@ -9,6 +9,7 @@
#include <linux/seq_file.h>
#include <linux/math64.h>
#include <linux/ktime.h>
+#include <linux/uaccess.h>
#include <linux/atomic.h>
#include <linux/ceph/libceph.h>
@@ -392,6 +393,90 @@ static int status_show(struct seq_file *s, void *p)
return 0;
}
+static int reset_status_show(struct seq_file *s, void *p)
+{
+ struct ceph_fs_client *fsc = s->private;
+ struct ceph_mds_client *mdsc = fsc->mdsc;
+ struct ceph_client_reset_state *st;
+ u64 trigger = 0, success = 0, failure = 0;
+ unsigned long last_start = 0, last_finish = 0;
+ int last_errno = 0;
+ enum ceph_client_reset_phase phase = CEPH_CLIENT_RESET_IDLE;
+ bool drain_timed_out = false;
+ int sessions_reset = 0;
+ int blocked_requests = 0;
+ char reason[CEPH_CLIENT_RESET_REASON_LEN];
+
+ if (!mdsc)
+ return 0;
+
+ st = &mdsc->reset_state;
+
+ spin_lock(&st->lock);
+ trigger = st->trigger_count;
+ success = st->success_count;
+ failure = st->failure_count;
+ last_start = st->last_start;
+ last_finish = st->last_finish;
+ last_errno = st->last_errno;
+ phase = st->phase;
+ drain_timed_out = st->drain_timed_out;
+ sessions_reset = st->sessions_reset;
+ strscpy(reason, st->last_reason, sizeof(reason));
+ spin_unlock(&st->lock);
+
+ blocked_requests = atomic_read(&st->blocked_requests);
+
+ seq_printf(s, "phase: %s\n", ceph_reset_phase_name(phase));
+ seq_printf(s, "trigger_count: %llu\n", trigger);
+ seq_printf(s, "success_count: %llu\n", success);
+ seq_printf(s, "failure_count: %llu\n", failure);
+ if (last_start)
+ seq_printf(s, "last_start_ms_ago: %u\n",
+ jiffies_to_msecs(jiffies - last_start));
+ else
+ seq_puts(s, "last_start_ms_ago: (never)\n");
+ if (last_finish)
+ seq_printf(s, "last_finish_ms_ago: %u\n",
+ jiffies_to_msecs(jiffies - last_finish));
+ else
+ seq_puts(s, "last_finish_ms_ago: (never)\n");
+ seq_printf(s, "last_errno: %d\n", last_errno);
+ seq_printf(s, "last_reason: %s\n",
+ reason[0] ? reason : "(none)");
+ seq_printf(s, "drain_timed_out: %s\n",
+ drain_timed_out ? "yes" : "no");
+ seq_printf(s, "sessions_reset: %d\n", sessions_reset);
+ seq_printf(s, "blocked_requests: %d\n", blocked_requests);
+
+ return 0;
+}
+
+static ssize_t reset_trigger_write(struct file *file, const char __user *buf,
+ size_t len, loff_t *ppos)
+{
+ struct ceph_fs_client *fsc = file->private_data;
+ struct ceph_mds_client *mdsc = fsc->mdsc;
+ char reason[CEPH_CLIENT_RESET_REASON_LEN];
+ size_t copy;
+ int ret;
+
+ if (!mdsc)
+ return -ENODEV;
+
+ copy = min_t(size_t, len, sizeof(reason) - 1);
+ if (copy && copy_from_user(reason, buf, copy))
+ return -EFAULT;
+ reason[copy] = '\0';
+ strim(reason);
+
+ ret = ceph_mdsc_schedule_reset(mdsc, reason);
+ if (ret)
+ return ret;
+
+ return len;
+}
+
static int subvolume_metrics_show(struct seq_file *s, void *p)
{
struct ceph_fs_client *fsc = s->private;
@@ -450,6 +535,7 @@ DEFINE_SHOW_ATTRIBUTE(mdsc);
DEFINE_SHOW_ATTRIBUTE(caps);
DEFINE_SHOW_ATTRIBUTE(mds_sessions);
DEFINE_SHOW_ATTRIBUTE(status);
+DEFINE_SHOW_ATTRIBUTE(reset_status);
DEFINE_SHOW_ATTRIBUTE(metrics_file);
DEFINE_SHOW_ATTRIBUTE(metrics_latency);
DEFINE_SHOW_ATTRIBUTE(metrics_size);
@@ -521,6 +607,13 @@ static int metric_features_show(struct seq_file *s, void *p)
DEFINE_SHOW_ATTRIBUTE(metric_features);
+static const struct file_operations ceph_reset_trigger_fops = {
+ .owner = THIS_MODULE,
+ .open = simple_open,
+ .write = reset_trigger_write,
+ .llseek = noop_llseek,
+};
+
/*
* debugfs
*/
@@ -554,6 +647,7 @@ void ceph_fs_debugfs_cleanup(struct ceph_fs_client *fsc)
debugfs_remove(fsc->debugfs_caps);
debugfs_remove(fsc->debugfs_status);
debugfs_remove(fsc->debugfs_mdsc);
+ debugfs_remove_recursive(fsc->debugfs_reset_dir);
debugfs_remove(fsc->debugfs_subvolume_metrics);
debugfs_remove_recursive(fsc->debugfs_metrics_dir);
doutc(fsc->client, "done\n");
@@ -602,6 +696,15 @@ void ceph_fs_debugfs_init(struct ceph_fs_client *fsc)
fsc,
&caps_fops);
+ fsc->debugfs_reset_dir = debugfs_create_dir("reset",
+ fsc->client->debugfs_dir);
+ debugfs_create_file("trigger", 0200,
+ fsc->debugfs_reset_dir, fsc,
+ &ceph_reset_trigger_fops);
+ debugfs_create_file("status", 0400,
+ fsc->debugfs_reset_dir, fsc,
+ &reset_status_fops);
+
fsc->debugfs_status = debugfs_create_file("status",
0400,
fsc->client->debugfs_dir,
diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index ce773b1095da..b16638ebff7f 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -5324,6 +5324,7 @@ int ceph_mdsc_wait_for_reset(struct ceph_mds_client *mdsc)
blocked_count = atomic_inc_return(&st->blocked_requests);
doutc(cl, "request blocked during reset, %d total blocked\n",
blocked_count);
+ trace_ceph_client_reset_blocked(mdsc, blocked_count);
retry:
remaining = max_t(long, deadline - jiffies, 1);
@@ -5334,10 +5335,12 @@ int ceph_mdsc_wait_for_reset(struct ceph_mds_client *mdsc)
if (wait_ret == 0) {
atomic_dec(&st->blocked_requests);
pr_warn_client(cl, "timed out waiting for reset to complete\n");
+ trace_ceph_client_reset_unblocked(mdsc, -ETIMEDOUT);
return -ETIMEDOUT;
}
if (wait_ret < 0) {
atomic_dec(&st->blocked_requests);
+ trace_ceph_client_reset_unblocked(mdsc, (int)wait_ret);
return (int)wait_ret; /* -ERESTARTSYS */
}
@@ -5352,12 +5355,14 @@ int ceph_mdsc_wait_for_reset(struct ceph_mds_client *mdsc)
if (time_before(jiffies, deadline))
goto retry;
atomic_dec(&st->blocked_requests);
+ trace_ceph_client_reset_unblocked(mdsc, -ETIMEDOUT);
return -ETIMEDOUT;
}
ret = st->last_errno;
spin_unlock(&st->lock);
atomic_dec(&st->blocked_requests);
+ trace_ceph_client_reset_unblocked(mdsc, ret);
return ret ? -EAGAIN : 0;
}
@@ -5387,6 +5392,7 @@ static void ceph_mdsc_reset_complete(struct ceph_mds_client *mdsc, int ret)
/* Wake up all requests that were blocked waiting for reset */
wake_up_all(&st->blocked_wq);
+ trace_ceph_client_reset_complete(mdsc, ret);
}
static void ceph_mdsc_reset_workfn(struct work_struct *work)
@@ -5749,6 +5755,7 @@ int ceph_mdsc_schedule_reset(struct ceph_mds_client *mdsc,
pr_info_client(mdsc->fsc->client,
"manual session reset scheduled (reason=\"%s\")\n",
msg);
+ trace_ceph_client_reset_schedule(mdsc, msg);
return 0;
}
diff --git a/fs/ceph/super.h b/fs/ceph/super.h
index a4993644d543..1d6aab060780 100644
--- a/fs/ceph/super.h
+++ b/fs/ceph/super.h
@@ -179,6 +179,7 @@ struct ceph_fs_client {
struct dentry *debugfs_status;
struct dentry *debugfs_mds_sessions;
struct dentry *debugfs_metrics_dir;
+ struct dentry *debugfs_reset_dir;
struct dentry *debugfs_subvolume_metrics;
#endif
diff --git a/include/trace/events/ceph.h b/include/trace/events/ceph.h
index 08cb0659fbfc..1b990632f62b 100644
--- a/include/trace/events/ceph.h
+++ b/include/trace/events/ceph.h
@@ -226,6 +226,73 @@ TRACE_EVENT(ceph_handle_caps,
__entry->mseq)
);
+/*
+ * Client reset tracepoints - identify the client by its monitor-
+ * assigned global_id so traces remain meaningful when kernel pointer
+ * hashing is enabled.
+ */
+TRACE_EVENT(ceph_client_reset_schedule,
+ TP_PROTO(const struct ceph_mds_client *mdsc, const char *reason),
+ TP_ARGS(mdsc, reason),
+ TP_STRUCT__entry(
+ __field(u64, client_id)
+ __string(reason, reason ? reason : "")
+ ),
+ TP_fast_assign(
+ __entry->client_id = mdsc->fsc->client->monc.auth ?
+ mdsc->fsc->client->monc.auth->global_id : 0;
+ __assign_str(reason);
+ ),
+ TP_printk("client_id=%llu reason=%s",
+ __entry->client_id, __get_str(reason))
+);
+
+TRACE_EVENT(ceph_client_reset_complete,
+ TP_PROTO(const struct ceph_mds_client *mdsc, int ret),
+ TP_ARGS(mdsc, ret),
+ TP_STRUCT__entry(
+ __field(u64, client_id)
+ __field(int, ret)
+ ),
+ TP_fast_assign(
+ __entry->client_id = mdsc->fsc->client->monc.auth ?
+ mdsc->fsc->client->monc.auth->global_id : 0;
+ __entry->ret = ret;
+ ),
+ TP_printk("client_id=%llu ret=%d", __entry->client_id, __entry->ret)
+);
+
+TRACE_EVENT(ceph_client_reset_blocked,
+ TP_PROTO(const struct ceph_mds_client *mdsc, int blocked_count),
+ TP_ARGS(mdsc, blocked_count),
+ TP_STRUCT__entry(
+ __field(u64, client_id)
+ __field(int, blocked_count)
+ ),
+ TP_fast_assign(
+ __entry->client_id = mdsc->fsc->client->monc.auth ?
+ mdsc->fsc->client->monc.auth->global_id : 0;
+ __entry->blocked_count = blocked_count;
+ ),
+ TP_printk("client_id=%llu blocked_count=%d", __entry->client_id,
+ __entry->blocked_count)
+);
+
+TRACE_EVENT(ceph_client_reset_unblocked,
+ TP_PROTO(const struct ceph_mds_client *mdsc, int ret),
+ TP_ARGS(mdsc, ret),
+ TP_STRUCT__entry(
+ __field(u64, client_id)
+ __field(int, ret)
+ ),
+ TP_fast_assign(
+ __entry->client_id = mdsc->fsc->client->monc.auth ?
+ mdsc->fsc->client->monc.auth->global_id : 0;
+ __entry->ret = ret;
+ ),
+ TP_printk("client_id=%llu ret=%d", __entry->client_id, __entry->ret)
+);
+
#undef EM
#undef E_
#endif /* _TRACE_CEPH_H */
--
2.34.1
^ permalink raw reply related [flat|nested] 24+ messages in thread* Re: [EXTERNAL] [PATCH v4 06/11] ceph: add manual reset debugfs control and tracepoints
2026-05-07 12:27 ` [PATCH v4 06/11] ceph: add manual reset debugfs control and tracepoints Alex Markuze
@ 2026-05-07 19:22 ` Viacheslav Dubeyko
0 siblings, 0 replies; 24+ messages in thread
From: Viacheslav Dubeyko @ 2026-05-07 19:22 UTC (permalink / raw)
To: Alex Markuze, ceph-devel; +Cc: linux-kernel, idryomov
On Thu, 2026-05-07 at 12:27 +0000, Alex Markuze wrote:
> Add the debugfs and trace plumbing used to trigger and observe
> manual client reset.
>
> The reset interface exposes a trigger file for operator-initiated
> reset and a status file for tracking the most recent run. The
> tracepoints record scheduling, completion, and blocked caller
> behavior so reset progress can be diagnosed from the client side.
>
> debugfs layout under /sys/kernel/debug/ceph/<client>/reset/:
> trigger - write to initiate a manual reset
> status - read to see the most recent reset result
>
> The reset directory is cleaned up via debugfs_remove_recursive()
> on the parent, so individual file dentries are not stored.
>
> Tracepoints:
> ceph_client_reset_schedule - reset queued
> ceph_client_reset_complete - reset finished (success or failure)
> ceph_client_reset_blocked - caller blocked waiting for reset
> ceph_client_reset_unblocked - caller unblocked after reset
>
> All tracepoints use a null-safe access for monc.auth->global_id
> to guard against early-init or late-teardown edge cases.
>
> Signed-off-by: Alex Markuze <amarkuze@redhat.com>
> ---
> fs/ceph/debugfs.c | 103 ++++++++++++++++++++++++++++++++++++
> fs/ceph/mds_client.c | 7 +++
> fs/ceph/super.h | 1 +
> include/trace/events/ceph.h | 67 +++++++++++++++++++++++
> 4 files changed, 178 insertions(+)
>
> diff --git a/fs/ceph/debugfs.c b/fs/ceph/debugfs.c
> index e2463f93cf6b..18eb5da03411 100644
> --- a/fs/ceph/debugfs.c
> +++ b/fs/ceph/debugfs.c
> @@ -9,6 +9,7 @@
> #include <linux/seq_file.h>
> #include <linux/math64.h>
> #include <linux/ktime.h>
> +#include <linux/uaccess.h>
> #include <linux/atomic.h>
>
> #include <linux/ceph/libceph.h>
> @@ -392,6 +393,90 @@ static int status_show(struct seq_file *s, void *p)
> return 0;
> }
>
> +static int reset_status_show(struct seq_file *s, void *p)
> +{
> + struct ceph_fs_client *fsc = s->private;
> + struct ceph_mds_client *mdsc = fsc->mdsc;
> + struct ceph_client_reset_state *st;
> + u64 trigger = 0, success = 0, failure = 0;
> + unsigned long last_start = 0, last_finish = 0;
> + int last_errno = 0;
> + enum ceph_client_reset_phase phase = CEPH_CLIENT_RESET_IDLE;
> + bool drain_timed_out = false;
> + int sessions_reset = 0;
> + int blocked_requests = 0;
> + char reason[CEPH_CLIENT_RESET_REASON_LEN];
> +
> + if (!mdsc)
> + return 0;
> +
> + st = &mdsc->reset_state;
> +
> + spin_lock(&st->lock);
> + trigger = st->trigger_count;
> + success = st->success_count;
> + failure = st->failure_count;
> + last_start = st->last_start;
> + last_finish = st->last_finish;
> + last_errno = st->last_errno;
> + phase = st->phase;
> + drain_timed_out = st->drain_timed_out;
> + sessions_reset = st->sessions_reset;
> + strscpy(reason, st->last_reason, sizeof(reason));
> + spin_unlock(&st->lock);
> +
> + blocked_requests = atomic_read(&st->blocked_requests);
> +
> + seq_printf(s, "phase: %s\n", ceph_reset_phase_name(phase));
> + seq_printf(s, "trigger_count: %llu\n", trigger);
> + seq_printf(s, "success_count: %llu\n", success);
> + seq_printf(s, "failure_count: %llu\n", failure);
> + if (last_start)
> + seq_printf(s, "last_start_ms_ago: %u\n",
> + jiffies_to_msecs(jiffies - last_start));
> + else
> + seq_puts(s, "last_start_ms_ago: (never)\n");
> + if (last_finish)
> + seq_printf(s, "last_finish_ms_ago: %u\n",
> + jiffies_to_msecs(jiffies - last_finish));
> + else
> + seq_puts(s, "last_finish_ms_ago: (never)\n");
> + seq_printf(s, "last_errno: %d\n", last_errno);
> + seq_printf(s, "last_reason: %s\n",
> + reason[0] ? reason : "(none)");
> + seq_printf(s, "drain_timed_out: %s\n",
> + drain_timed_out ? "yes" : "no");
> + seq_printf(s, "sessions_reset: %d\n", sessions_reset);
> + seq_printf(s, "blocked_requests: %d\n", blocked_requests);
> +
> + return 0;
> +}
> +
> +static ssize_t reset_trigger_write(struct file *file, const char __user *buf,
> + size_t len, loff_t *ppos)
> +{
> + struct ceph_fs_client *fsc = file->private_data;
> + struct ceph_mds_client *mdsc = fsc->mdsc;
> + char reason[CEPH_CLIENT_RESET_REASON_LEN];
> + size_t copy;
> + int ret;
> +
> + if (!mdsc)
> + return -ENODEV;
> +
> + copy = min_t(size_t, len, sizeof(reason) - 1);
> + if (copy && copy_from_user(reason, buf, copy))
> + return -EFAULT;
> + reason[copy] = '\0';
> + strim(reason);
> +
> + ret = ceph_mdsc_schedule_reset(mdsc, reason);
> + if (ret)
> + return ret;
> +
> + return len;
> +}
> +
> static int subvolume_metrics_show(struct seq_file *s, void *p)
> {
> struct ceph_fs_client *fsc = s->private;
> @@ -450,6 +535,7 @@ DEFINE_SHOW_ATTRIBUTE(mdsc);
> DEFINE_SHOW_ATTRIBUTE(caps);
> DEFINE_SHOW_ATTRIBUTE(mds_sessions);
> DEFINE_SHOW_ATTRIBUTE(status);
> +DEFINE_SHOW_ATTRIBUTE(reset_status);
> DEFINE_SHOW_ATTRIBUTE(metrics_file);
> DEFINE_SHOW_ATTRIBUTE(metrics_latency);
> DEFINE_SHOW_ATTRIBUTE(metrics_size);
> @@ -521,6 +607,13 @@ static int metric_features_show(struct seq_file *s, void *p)
>
> DEFINE_SHOW_ATTRIBUTE(metric_features);
>
> +static const struct file_operations ceph_reset_trigger_fops = {
> + .owner = THIS_MODULE,
> + .open = simple_open,
> + .write = reset_trigger_write,
> + .llseek = noop_llseek,
> +};
> +
> /*
> * debugfs
> */
> @@ -554,6 +647,7 @@ void ceph_fs_debugfs_cleanup(struct ceph_fs_client *fsc)
> debugfs_remove(fsc->debugfs_caps);
> debugfs_remove(fsc->debugfs_status);
> debugfs_remove(fsc->debugfs_mdsc);
> + debugfs_remove_recursive(fsc->debugfs_reset_dir);
> debugfs_remove(fsc->debugfs_subvolume_metrics);
> debugfs_remove_recursive(fsc->debugfs_metrics_dir);
> doutc(fsc->client, "done\n");
> @@ -602,6 +696,15 @@ void ceph_fs_debugfs_init(struct ceph_fs_client *fsc)
> fsc,
> &caps_fops);
>
> + fsc->debugfs_reset_dir = debugfs_create_dir("reset",
> + fsc->client->debugfs_dir);
> + debugfs_create_file("trigger", 0200,
> + fsc->debugfs_reset_dir, fsc,
> + &ceph_reset_trigger_fops);
> + debugfs_create_file("status", 0400,
> + fsc->debugfs_reset_dir, fsc,
> + &reset_status_fops);
> +
> fsc->debugfs_status = debugfs_create_file("status",
> 0400,
> fsc->client->debugfs_dir,
> diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
> index ce773b1095da..b16638ebff7f 100644
> --- a/fs/ceph/mds_client.c
> +++ b/fs/ceph/mds_client.c
> @@ -5324,6 +5324,7 @@ int ceph_mdsc_wait_for_reset(struct ceph_mds_client *mdsc)
> blocked_count = atomic_inc_return(&st->blocked_requests);
> doutc(cl, "request blocked during reset, %d total blocked\n",
> blocked_count);
> + trace_ceph_client_reset_blocked(mdsc, blocked_count);
>
> retry:
> remaining = max_t(long, deadline - jiffies, 1);
> @@ -5334,10 +5335,12 @@ int ceph_mdsc_wait_for_reset(struct ceph_mds_client *mdsc)
> if (wait_ret == 0) {
> atomic_dec(&st->blocked_requests);
> pr_warn_client(cl, "timed out waiting for reset to complete\n");
> + trace_ceph_client_reset_unblocked(mdsc, -ETIMEDOUT);
> return -ETIMEDOUT;
> }
> if (wait_ret < 0) {
> atomic_dec(&st->blocked_requests);
> + trace_ceph_client_reset_unblocked(mdsc, (int)wait_ret);
> return (int)wait_ret; /* -ERESTARTSYS */
> }
>
> @@ -5352,12 +5355,14 @@ int ceph_mdsc_wait_for_reset(struct ceph_mds_client *mdsc)
> if (time_before(jiffies, deadline))
> goto retry;
> atomic_dec(&st->blocked_requests);
> + trace_ceph_client_reset_unblocked(mdsc, -ETIMEDOUT);
> return -ETIMEDOUT;
> }
> ret = st->last_errno;
> spin_unlock(&st->lock);
>
> atomic_dec(&st->blocked_requests);
> + trace_ceph_client_reset_unblocked(mdsc, ret);
> return ret ? -EAGAIN : 0;
> }
>
> @@ -5387,6 +5392,7 @@ static void ceph_mdsc_reset_complete(struct ceph_mds_client *mdsc, int ret)
> /* Wake up all requests that were blocked waiting for reset */
> wake_up_all(&st->blocked_wq);
>
> + trace_ceph_client_reset_complete(mdsc, ret);
> }
>
> static void ceph_mdsc_reset_workfn(struct work_struct *work)
> @@ -5749,6 +5755,7 @@ int ceph_mdsc_schedule_reset(struct ceph_mds_client *mdsc,
> pr_info_client(mdsc->fsc->client,
> "manual session reset scheduled (reason=\"%s\")\n",
> msg);
> + trace_ceph_client_reset_schedule(mdsc, msg);
> return 0;
> }
>
> diff --git a/fs/ceph/super.h b/fs/ceph/super.h
> index a4993644d543..1d6aab060780 100644
> --- a/fs/ceph/super.h
> +++ b/fs/ceph/super.h
> @@ -179,6 +179,7 @@ struct ceph_fs_client {
> struct dentry *debugfs_status;
> struct dentry *debugfs_mds_sessions;
> struct dentry *debugfs_metrics_dir;
> + struct dentry *debugfs_reset_dir;
> struct dentry *debugfs_subvolume_metrics;
> #endif
>
> diff --git a/include/trace/events/ceph.h b/include/trace/events/ceph.h
> index 08cb0659fbfc..1b990632f62b 100644
> --- a/include/trace/events/ceph.h
> +++ b/include/trace/events/ceph.h
> @@ -226,6 +226,73 @@ TRACE_EVENT(ceph_handle_caps,
> __entry->mseq)
> );
>
> +/*
> + * Client reset tracepoints - identify the client by its monitor-
> + * assigned global_id so traces remain meaningful when kernel pointer
> + * hashing is enabled.
> + */
> +TRACE_EVENT(ceph_client_reset_schedule,
> + TP_PROTO(const struct ceph_mds_client *mdsc, const char *reason),
> + TP_ARGS(mdsc, reason),
> + TP_STRUCT__entry(
> + __field(u64, client_id)
> + __string(reason, reason ? reason : "")
> + ),
> + TP_fast_assign(
> + __entry->client_id = mdsc->fsc->client->monc.auth ?
> + mdsc->fsc->client->monc.auth->global_id : 0;
> + __assign_str(reason);
> + ),
> + TP_printk("client_id=%llu reason=%s",
> + __entry->client_id, __get_str(reason))
> +);
> +
> +TRACE_EVENT(ceph_client_reset_complete,
> + TP_PROTO(const struct ceph_mds_client *mdsc, int ret),
> + TP_ARGS(mdsc, ret),
> + TP_STRUCT__entry(
> + __field(u64, client_id)
> + __field(int, ret)
> + ),
> + TP_fast_assign(
> + __entry->client_id = mdsc->fsc->client->monc.auth ?
> + mdsc->fsc->client->monc.auth->global_id : 0;
> + __entry->ret = ret;
> + ),
> + TP_printk("client_id=%llu ret=%d", __entry->client_id, __entry->ret)
> +);
> +
> +TRACE_EVENT(ceph_client_reset_blocked,
> + TP_PROTO(const struct ceph_mds_client *mdsc, int blocked_count),
> + TP_ARGS(mdsc, blocked_count),
> + TP_STRUCT__entry(
> + __field(u64, client_id)
> + __field(int, blocked_count)
> + ),
> + TP_fast_assign(
> + __entry->client_id = mdsc->fsc->client->monc.auth ?
> + mdsc->fsc->client->monc.auth->global_id : 0;
> + __entry->blocked_count = blocked_count;
> + ),
> + TP_printk("client_id=%llu blocked_count=%d", __entry->client_id,
> + __entry->blocked_count)
> +);
> +
> +TRACE_EVENT(ceph_client_reset_unblocked,
> + TP_PROTO(const struct ceph_mds_client *mdsc, int ret),
> + TP_ARGS(mdsc, ret),
> + TP_STRUCT__entry(
> + __field(u64, client_id)
> + __field(int, ret)
> + ),
> + TP_fast_assign(
> + __entry->client_id = mdsc->fsc->client->monc.auth ?
> + mdsc->fsc->client->monc.auth->global_id : 0;
> + __entry->ret = ret;
> + ),
> + TP_printk("client_id=%llu ret=%d", __entry->client_id, __entry->ret)
> +);
> +
> #undef EM
> #undef E_
> #endif /* _TRACE_CEPH_H */
Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Thanks,
Slava.
^ permalink raw reply [flat|nested] 24+ messages in thread
* [PATCH v4 07/11] selftests: ceph: add reset consistency checker
2026-05-07 12:27 [PATCH v4 00/11] ceph: manual client session reset Alex Markuze
` (5 preceding siblings ...)
2026-05-07 12:27 ` [PATCH v4 06/11] ceph: add manual reset debugfs control and tracepoints Alex Markuze
@ 2026-05-07 12:27 ` Alex Markuze
2026-05-07 19:24 ` [EXTERNAL] " Viacheslav Dubeyko
2026-05-07 12:27 ` [PATCH v4 08/11] selftests: ceph: add reset stress test Alex Markuze
` (4 subsequent siblings)
11 siblings, 1 reply; 24+ messages in thread
From: Alex Markuze @ 2026-05-07 12:27 UTC (permalink / raw)
To: ceph-devel; +Cc: linux-kernel, idryomov, vdubeyko, Alex Markuze
Add a Python post-run validator for the CephFS client reset stress
test. The script reads data files written by the stress runner and
checks that every file was either written completely or is missing,
with no partial or corrupted content.
This is a prerequisite for the stress test script which invokes it
after each run.
Signed-off-by: Alex Markuze <amarkuze@redhat.com>
---
.../filesystems/ceph/validate_consistency.py | 297 ++++++++++++++++++
1 file changed, 297 insertions(+)
create mode 100755 tools/testing/selftests/filesystems/ceph/validate_consistency.py
diff --git a/tools/testing/selftests/filesystems/ceph/validate_consistency.py b/tools/testing/selftests/filesystems/ceph/validate_consistency.py
new file mode 100755
index 000000000000..c230a59bdb3a
--- /dev/null
+++ b/tools/testing/selftests/filesystems/ceph/validate_consistency.py
@@ -0,0 +1,297 @@
+#!/usr/bin/env python3
+# SPDX-License-Identifier: GPL-2.0
+
+import argparse
+import bisect
+import hashlib
+import json
+import os
+from pathlib import Path
+
+
+def sha256_file(path: Path) -> str:
+ digest = hashlib.sha256()
+ with path.open("rb") as handle:
+ while True:
+ chunk = handle.read(1 << 20)
+ if not chunk:
+ break
+ digest.update(chunk)
+ return digest.hexdigest()
+
+
+def parse_io_log(path: Path):
+ records = []
+ if not path.exists():
+ return records
+ with path.open("r", encoding="utf-8") as handle:
+ for line_no, line in enumerate(handle, 1):
+ line = line.strip()
+ if not line:
+ continue
+ parts = line.split(",")
+ if len(parts) != 5:
+ raise ValueError(f"io log line {line_no}: expected 5 columns, got {len(parts)}")
+ ts_ms, seq, logical_id, relpath, digest = parts
+ records.append(
+ {
+ "ts_ms": int(ts_ms),
+ "seq": int(seq),
+ "logical_id": int(logical_id),
+ "relpath": relpath,
+ "digest": digest,
+ }
+ )
+ return records
+
+
+def parse_rename_log(path: Path):
+ records = []
+ if not path.exists():
+ return records
+ with path.open("r", encoding="utf-8") as handle:
+ for line_no, line in enumerate(handle, 1):
+ line = line.strip()
+ if not line:
+ continue
+ parts = line.split(",")
+ if len(parts) == 6:
+ ts_ms, seq, logical_id, src_rel, dst_rel, rc = parts
+ elif len(parts) == 7:
+ ts_ms, worker_id, seq, logical_id, src_rel, dst_rel, rc = parts
+ _ = worker_id # worker id is informational only
+ else:
+ raise ValueError(
+ f"rename log line {line_no}: expected 6 or 7 columns, got {len(parts)}"
+ )
+ records.append(
+ {
+ "ts_ms": int(ts_ms),
+ "seq": int(seq),
+ "logical_id": int(logical_id),
+ "src_rel": src_rel,
+ "dst_rel": dst_rel,
+ "rc": int(rc),
+ }
+ )
+ return records
+
+
+def parse_reset_log(path: Path):
+ records = []
+ if not path.exists():
+ return records
+ with path.open("r", encoding="utf-8") as handle:
+ for line_no, line in enumerate(handle, 1):
+ line = line.strip()
+ if not line:
+ continue
+ parts = line.split(",")
+ if len(parts) != 4:
+ raise ValueError(f"reset log line {line_no}: expected 4 columns, got {len(parts)}")
+ ts_ms, seq, reason, rc = parts
+ records.append(
+ {
+ "ts_ms": int(ts_ms),
+ "seq": int(seq),
+ "reason": reason,
+ "rc": int(rc),
+ }
+ )
+ return records
+
+
+def parse_status_file(path: Path):
+ status = {}
+ if not path.exists():
+ return status
+ with path.open("r", encoding="utf-8") as handle:
+ for line in handle:
+ line = line.strip()
+ if not line or ":" not in line:
+ continue
+ key, value = line.split(":", 1)
+ status[key.strip()] = value.strip()
+ return status
+
+
+def to_int(value: str, default: int = 0):
+ try:
+ return int(value)
+ except Exception:
+ return default
+
+
+def validate_namespace(data_dir: Path, file_count: int, issues):
+ actual_locations = {}
+ actual_paths = {}
+ for logical_id in range(file_count):
+ name = f"file_{logical_id:05d}"
+ found = []
+ for subdir in ("A", "B"):
+ candidate = data_dir / subdir / name
+ if candidate.exists():
+ found.append((subdir, candidate))
+ if len(found) != 1:
+ issues.append(
+ f"namespace invariant failed for logical_id={logical_id:05d}: expected exactly one file in A/B, found {len(found)}"
+ )
+ continue
+ actual_locations[logical_id] = found[0][0]
+ actual_paths[logical_id] = found[0][1]
+ return actual_locations, actual_paths
+
+
+def validate_rename_invariant(rename_records, actual_locations, issues):
+ expected_locations = {}
+ for rec in rename_records:
+ if rec["rc"] != 0:
+ continue
+ dst = rec["dst_rel"]
+ if "/" not in dst:
+ continue
+ expected_locations[rec["logical_id"]] = dst.split("/", 1)[0]
+
+ for logical_id, expected in expected_locations.items():
+ actual = actual_locations.get(logical_id)
+ if actual is None:
+ continue
+ if actual != expected:
+ issues.append(
+ f"rename invariant failed for logical_id={logical_id:05d}: expected location={expected}, actual={actual}"
+ )
+
+
+def validate_data_invariant(io_records, actual_paths, issues):
+ expected_hash = {}
+ for rec in io_records:
+ digest = rec["digest"]
+ if not digest:
+ continue
+ expected_hash[rec["logical_id"]] = digest
+
+ for logical_id, digest in expected_hash.items():
+ path = actual_paths.get(logical_id)
+ if path is None:
+ continue
+ actual_digest = sha256_file(path)
+ if digest != actual_digest:
+ issues.append(
+ f"data invariant failed for logical_id={logical_id:05d}: expected digest={digest}, actual digest={actual_digest}"
+ )
+
+
+def validate_reset_and_slo(args, reset_records, io_records, rename_records, status, issues):
+ if not args.expect_reset:
+ return
+
+ successful_reset_times = [rec["ts_ms"] for rec in reset_records if rec["rc"] == 0]
+ if not successful_reset_times:
+ issues.append("expected reset activity but no successful reset trigger was observed")
+
+ phase = status.get("phase")
+ blocked_requests = to_int(status.get("blocked_requests", "0"), default=-1)
+ last_errno = to_int(status.get("last_errno", "0"), default=1)
+ failure_count = to_int(status.get("failure_count", "0"), default=-1)
+
+ if phase is None:
+ issues.append("missing final reset status file or phase field")
+ elif phase.lower() != "idle":
+ issues.append(f"recovery invariant failed: phase={phase}, expected idle")
+
+ if blocked_requests != 0:
+ issues.append(f"recovery invariant failed: blocked_requests={blocked_requests}, expected 0")
+ if last_errno != 0:
+ issues.append(f"recovery invariant failed: last_errno={last_errno}, expected 0")
+ if failure_count > 0:
+ issues.append(
+ f"recovery invariant failed: failure_count={failure_count}, "
+ "one or more resets failed during the run"
+ )
+
+ op_times = [rec["ts_ms"] for rec in io_records]
+ op_times.extend(rec["ts_ms"] for rec in rename_records if rec["rc"] == 0)
+ op_times.sort()
+
+ if successful_reset_times and not op_times:
+ issues.append("recovery SLO failed: no workload completion events were recorded")
+ return
+
+ slo_ms = args.slo_seconds * 1000
+ for ts in successful_reset_times:
+ idx = bisect.bisect_left(op_times, ts)
+ if idx >= len(op_times):
+ issues.append(f"recovery SLO failed: no operation completion observed after reset at ts_ms={ts}")
+ continue
+ delta = op_times[idx] - ts
+ if delta > slo_ms:
+ issues.append(
+ f"recovery SLO failed: first post-reset completion at {delta}ms exceeds threshold {slo_ms}ms (reset ts_ms={ts})"
+ )
+
+
+def main():
+ parser = argparse.ArgumentParser(description="Validate Ceph reset stress artifacts")
+ parser.add_argument("--data-dir", required=True)
+ parser.add_argument("--file-count", required=True, type=int)
+ parser.add_argument("--io-log", required=True)
+ parser.add_argument("--rename-log", required=True)
+ parser.add_argument("--reset-log", required=True)
+ parser.add_argument("--status-final", required=False, default="")
+ parser.add_argument("--slo-seconds", required=False, type=int, default=30)
+ parser.add_argument("--expect-reset", action="store_true")
+ parser.add_argument("--report-json", required=False, default="")
+ args = parser.parse_args()
+
+ data_dir = Path(args.data_dir)
+ io_log = Path(args.io_log)
+ rename_log = Path(args.rename_log)
+ reset_log = Path(args.reset_log)
+ status_final = Path(args.status_final) if args.status_final else Path("__missing_status__")
+
+ issues = []
+
+ if not data_dir.exists():
+ issues.append(f"data directory is missing: {data_dir}")
+
+ try:
+ io_records = parse_io_log(io_log)
+ rename_records = parse_rename_log(rename_log)
+ reset_records = parse_reset_log(reset_log)
+ except Exception as exc:
+ issues.append(f"log parsing failed: {exc}")
+ io_records = []
+ rename_records = []
+ reset_records = []
+
+ status = parse_status_file(status_final)
+
+ actual_locations, actual_paths = validate_namespace(data_dir, args.file_count, issues)
+ validate_rename_invariant(rename_records, actual_locations, issues)
+ validate_data_invariant(io_records, actual_paths, issues)
+ validate_reset_and_slo(args, reset_records, io_records, rename_records, status, issues)
+
+ report = {
+ "file_count": args.file_count,
+ "io_records": len(io_records),
+ "rename_records": len(rename_records),
+ "reset_records": len(reset_records),
+ "expect_reset": args.expect_reset,
+ "issues": issues,
+ }
+
+ if args.report_json:
+ report_path = Path(args.report_json)
+ report_path.write_text(json.dumps(report, indent=2, sort_keys=True), encoding="utf-8")
+
+ if issues:
+ print("FAIL: consistency validation found issues")
+ for issue in issues:
+ print(f" - {issue}")
+ raise SystemExit(1)
+
+ print("PASS: consistency validation succeeded")
+
+
+if __name__ == "__main__":
+ main()
--
2.34.1
^ permalink raw reply related [flat|nested] 24+ messages in thread* Re: [EXTERNAL] [PATCH v4 07/11] selftests: ceph: add reset consistency checker
2026-05-07 12:27 ` [PATCH v4 07/11] selftests: ceph: add reset consistency checker Alex Markuze
@ 2026-05-07 19:24 ` Viacheslav Dubeyko
0 siblings, 0 replies; 24+ messages in thread
From: Viacheslav Dubeyko @ 2026-05-07 19:24 UTC (permalink / raw)
To: Alex Markuze, ceph-devel; +Cc: linux-kernel, idryomov
On Thu, 2026-05-07 at 12:27 +0000, Alex Markuze wrote:
> Add a Python post-run validator for the CephFS client reset stress
> test. The script reads data files written by the stress runner and
> checks that every file was either written completely or is missing,
> with no partial or corrupted content.
>
> This is a prerequisite for the stress test script which invokes it
> after each run.
>
> Signed-off-by: Alex Markuze <amarkuze@redhat.com>
> ---
> .../filesystems/ceph/validate_consistency.py | 297 ++++++++++++++++++
> 1 file changed, 297 insertions(+)
> create mode 100755 tools/testing/selftests/filesystems/ceph/validate_consistency.py
>
> diff --git a/tools/testing/selftests/filesystems/ceph/validate_consistency.py b/tools/testing/selftests/filesystems/ceph/validate_consistency.py
> new file mode 100755
> index 000000000000..c230a59bdb3a
> --- /dev/null
> +++ b/tools/testing/selftests/filesystems/ceph/validate_consistency.py
> @@ -0,0 +1,297 @@
> +#!/usr/bin/env python3
> +# SPDX-License-Identifier: GPL-2.0
> +
> +import argparse
> +import bisect
> +import hashlib
> +import json
> +import os
> +from pathlib import Path
> +
> +
> +def sha256_file(path: Path) -> str:
> + digest = hashlib.sha256()
> + with path.open("rb") as handle:
> + while True:
> + chunk = handle.read(1 << 20)
> + if not chunk:
> + break
> + digest.update(chunk)
> + return digest.hexdigest()
> +
> +
> +def parse_io_log(path: Path):
> + records = []
> + if not path.exists():
> + return records
> + with path.open("r", encoding="utf-8") as handle:
> + for line_no, line in enumerate(handle, 1):
> + line = line.strip()
> + if not line:
> + continue
> + parts = line.split(",")
> + if len(parts) != 5:
> + raise ValueError(f"io log line {line_no}: expected 5 columns, got {len(parts)}")
> + ts_ms, seq, logical_id, relpath, digest = parts
> + records.append(
> + {
> + "ts_ms": int(ts_ms),
> + "seq": int(seq),
> + "logical_id": int(logical_id),
> + "relpath": relpath,
> + "digest": digest,
> + }
> + )
> + return records
> +
> +
> +def parse_rename_log(path: Path):
> + records = []
> + if not path.exists():
> + return records
> + with path.open("r", encoding="utf-8") as handle:
> + for line_no, line in enumerate(handle, 1):
> + line = line.strip()
> + if not line:
> + continue
> + parts = line.split(",")
> + if len(parts) == 6:
> + ts_ms, seq, logical_id, src_rel, dst_rel, rc = parts
> + elif len(parts) == 7:
> + ts_ms, worker_id, seq, logical_id, src_rel, dst_rel, rc = parts
> + _ = worker_id # worker id is informational only
> + else:
> + raise ValueError(
> + f"rename log line {line_no}: expected 6 or 7 columns, got {len(parts)}"
> + )
> + records.append(
> + {
> + "ts_ms": int(ts_ms),
> + "seq": int(seq),
> + "logical_id": int(logical_id),
> + "src_rel": src_rel,
> + "dst_rel": dst_rel,
> + "rc": int(rc),
> + }
> + )
> + return records
> +
> +
> +def parse_reset_log(path: Path):
> + records = []
> + if not path.exists():
> + return records
> + with path.open("r", encoding="utf-8") as handle:
> + for line_no, line in enumerate(handle, 1):
> + line = line.strip()
> + if not line:
> + continue
> + parts = line.split(",")
> + if len(parts) != 4:
> + raise ValueError(f"reset log line {line_no}: expected 4 columns, got {len(parts)}")
> + ts_ms, seq, reason, rc = parts
> + records.append(
> + {
> + "ts_ms": int(ts_ms),
> + "seq": int(seq),
> + "reason": reason,
> + "rc": int(rc),
> + }
> + )
> + return records
> +
> +
> +def parse_status_file(path: Path):
> + status = {}
> + if not path.exists():
> + return status
> + with path.open("r", encoding="utf-8") as handle:
> + for line in handle:
> + line = line.strip()
> + if not line or ":" not in line:
> + continue
> + key, value = line.split(":", 1)
> + status[key.strip()] = value.strip()
> + return status
> +
> +
> +def to_int(value: str, default: int = 0):
> + try:
> + return int(value)
> + except Exception:
> + return default
> +
> +
> +def validate_namespace(data_dir: Path, file_count: int, issues):
> + actual_locations = {}
> + actual_paths = {}
> + for logical_id in range(file_count):
> + name = f"file_{logical_id:05d}"
> + found = []
> + for subdir in ("A", "B"):
> + candidate = data_dir / subdir / name
> + if candidate.exists():
> + found.append((subdir, candidate))
> + if len(found) != 1:
> + issues.append(
> + f"namespace invariant failed for logical_id={logical_id:05d}: expected exactly one file in A/B, found {len(found)}"
> + )
> + continue
> + actual_locations[logical_id] = found[0][0]
> + actual_paths[logical_id] = found[0][1]
> + return actual_locations, actual_paths
> +
> +
> +def validate_rename_invariant(rename_records, actual_locations, issues):
> + expected_locations = {}
> + for rec in rename_records:
> + if rec["rc"] != 0:
> + continue
> + dst = rec["dst_rel"]
> + if "/" not in dst:
> + continue
> + expected_locations[rec["logical_id"]] = dst.split("/", 1)[0]
> +
> + for logical_id, expected in expected_locations.items():
> + actual = actual_locations.get(logical_id)
> + if actual is None:
> + continue
> + if actual != expected:
> + issues.append(
> + f"rename invariant failed for logical_id={logical_id:05d}: expected location={expected}, actual={actual}"
> + )
> +
> +
> +def validate_data_invariant(io_records, actual_paths, issues):
> + expected_hash = {}
> + for rec in io_records:
> + digest = rec["digest"]
> + if not digest:
> + continue
> + expected_hash[rec["logical_id"]] = digest
> +
> + for logical_id, digest in expected_hash.items():
> + path = actual_paths.get(logical_id)
> + if path is None:
> + continue
> + actual_digest = sha256_file(path)
> + if digest != actual_digest:
> + issues.append(
> + f"data invariant failed for logical_id={logical_id:05d}: expected digest={digest}, actual digest={actual_digest}"
> + )
> +
> +
> +def validate_reset_and_slo(args, reset_records, io_records, rename_records, status, issues):
> + if not args.expect_reset:
> + return
> +
> + successful_reset_times = [rec["ts_ms"] for rec in reset_records if rec["rc"] == 0]
> + if not successful_reset_times:
> + issues.append("expected reset activity but no successful reset trigger was observed")
> +
> + phase = status.get("phase")
> + blocked_requests = to_int(status.get("blocked_requests", "0"), default=-1)
> + last_errno = to_int(status.get("last_errno", "0"), default=1)
> + failure_count = to_int(status.get("failure_count", "0"), default=-1)
> +
> + if phase is None:
> + issues.append("missing final reset status file or phase field")
> + elif phase.lower() != "idle":
> + issues.append(f"recovery invariant failed: phase={phase}, expected idle")
> +
> + if blocked_requests != 0:
> + issues.append(f"recovery invariant failed: blocked_requests={blocked_requests}, expected 0")
> + if last_errno != 0:
> + issues.append(f"recovery invariant failed: last_errno={last_errno}, expected 0")
> + if failure_count > 0:
> + issues.append(
> + f"recovery invariant failed: failure_count={failure_count}, "
> + "one or more resets failed during the run"
> + )
> +
> + op_times = [rec["ts_ms"] for rec in io_records]
> + op_times.extend(rec["ts_ms"] for rec in rename_records if rec["rc"] == 0)
> + op_times.sort()
> +
> + if successful_reset_times and not op_times:
> + issues.append("recovery SLO failed: no workload completion events were recorded")
> + return
> +
> + slo_ms = args.slo_seconds * 1000
> + for ts in successful_reset_times:
> + idx = bisect.bisect_left(op_times, ts)
> + if idx >= len(op_times):
> + issues.append(f"recovery SLO failed: no operation completion observed after reset at ts_ms={ts}")
> + continue
> + delta = op_times[idx] - ts
> + if delta > slo_ms:
> + issues.append(
> + f"recovery SLO failed: first post-reset completion at {delta}ms exceeds threshold {slo_ms}ms (reset ts_ms={ts})"
> + )
> +
> +
> +def main():
> + parser = argparse.ArgumentParser(description="Validate Ceph reset stress artifacts")
> + parser.add_argument("--data-dir", required=True)
> + parser.add_argument("--file-count", required=True, type=int)
> + parser.add_argument("--io-log", required=True)
> + parser.add_argument("--rename-log", required=True)
> + parser.add_argument("--reset-log", required=True)
> + parser.add_argument("--status-final", required=False, default="")
> + parser.add_argument("--slo-seconds", required=False, type=int, default=30)
> + parser.add_argument("--expect-reset", action="store_true")
> + parser.add_argument("--report-json", required=False, default="")
> + args = parser.parse_args()
> +
> + data_dir = Path(args.data_dir)
> + io_log = Path(args.io_log)
> + rename_log = Path(args.rename_log)
> + reset_log = Path(args.reset_log)
> + status_final = Path(args.status_final) if args.status_final else Path("__missing_status__")
> +
> + issues = []
> +
> + if not data_dir.exists():
> + issues.append(f"data directory is missing: {data_dir}")
> +
> + try:
> + io_records = parse_io_log(io_log)
> + rename_records = parse_rename_log(rename_log)
> + reset_records = parse_reset_log(reset_log)
> + except Exception as exc:
> + issues.append(f"log parsing failed: {exc}")
> + io_records = []
> + rename_records = []
> + reset_records = []
> +
> + status = parse_status_file(status_final)
> +
> + actual_locations, actual_paths = validate_namespace(data_dir, args.file_count, issues)
> + validate_rename_invariant(rename_records, actual_locations, issues)
> + validate_data_invariant(io_records, actual_paths, issues)
> + validate_reset_and_slo(args, reset_records, io_records, rename_records, status, issues)
> +
> + report = {
> + "file_count": args.file_count,
> + "io_records": len(io_records),
> + "rename_records": len(rename_records),
> + "reset_records": len(reset_records),
> + "expect_reset": args.expect_reset,
> + "issues": issues,
> + }
> +
> + if args.report_json:
> + report_path = Path(args.report_json)
> + report_path.write_text(json.dumps(report, indent=2, sort_keys=True), encoding="utf-8")
> +
> + if issues:
> + print("FAIL: consistency validation found issues")
> + for issue in issues:
> + print(f" - {issue}")
> + raise SystemExit(1)
> +
> + print("PASS: consistency validation succeeded")
> +
> +
> +if __name__ == "__main__":
> + main()
Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Thanks,
Slava.
^ permalink raw reply [flat|nested] 24+ messages in thread
* [PATCH v4 08/11] selftests: ceph: add reset stress test
2026-05-07 12:27 [PATCH v4 00/11] ceph: manual client session reset Alex Markuze
` (6 preceding siblings ...)
2026-05-07 12:27 ` [PATCH v4 07/11] selftests: ceph: add reset consistency checker Alex Markuze
@ 2026-05-07 12:27 ` Alex Markuze
2026-05-07 19:29 ` [EXTERNAL] " Viacheslav Dubeyko
2026-05-07 12:27 ` [PATCH v4 09/11] selftests: ceph: add reset corner-case tests Alex Markuze
` (3 subsequent siblings)
11 siblings, 1 reply; 24+ messages in thread
From: Alex Markuze @ 2026-05-07 12:27 UTC (permalink / raw)
To: ceph-devel; +Cc: linux-kernel, idryomov, vdubeyko, Alex Markuze
Add a single-client stress test for the CephFS manual session reset
feature. The test runs concurrent I/O workers alongside periodic
reset injection, then validates data integrity via
validate_consistency.py.
Supports four profiles (baseline, moderate, aggressive, soak) with
configurable duration, reset interval, and worker counts.
Signed-off-by: Alex Markuze <amarkuze@redhat.com>
---
.../filesystems/ceph/reset_stress.sh | 694 ++++++++++++++++++
1 file changed, 694 insertions(+)
create mode 100755 tools/testing/selftests/filesystems/ceph/reset_stress.sh
diff --git a/tools/testing/selftests/filesystems/ceph/reset_stress.sh b/tools/testing/selftests/filesystems/ceph/reset_stress.sh
new file mode 100755
index 000000000000..c503c75a5f7a
--- /dev/null
+++ b/tools/testing/selftests/filesystems/ceph/reset_stress.sh
@@ -0,0 +1,694 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# CephFS reset stress test:
+# - Runs concurrent I/O and rename workloads
+# - Triggers random client resets through debugfs
+# - Validates consistency and recovery behavior
+
+set -euo pipefail
+
+KSFT_SKIP=4
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+
+# kselftest auto-detect: when invoked with no arguments (e.g. by
+# "make run_tests"), find a CephFS mount automatically or skip.
+if [[ $# -eq 0 ]]; then
+ MOUNT_POINT="$(findmnt -t ceph -n -o TARGET 2>/dev/null | head -1)"
+ if [[ -z "$MOUNT_POINT" ]]; then
+ echo "SKIP: No CephFS mount found and --mount-point not specified"
+ exit "$KSFT_SKIP"
+ fi
+ exec "$0" --mount-point "$MOUNT_POINT"
+fi
+
+PROFILE="moderate"
+DURATION_SEC=""
+COOLDOWN_SEC=20
+FILE_COUNT=64
+IO_WORKERS=""
+RENAME_WORKERS=""
+MOUNT_POINT=""
+OUT_DIR=""
+CLIENT_ID=""
+DEBUGFS_ROOT="/sys/kernel/debug/ceph"
+SLO_SECONDS=30
+EXPECT_RESET=1
+DMESG_CMD=""
+SUDO=""
+
+RESET_MIN_SEC=5
+RESET_MAX_SEC=15
+
+RUN_ID="$(date +%Y%m%d-%H%M%S)"
+WORKLOAD_FLAG=""
+RESET_FLAG=""
+DATA_DIR=""
+
+IO_LOG=""
+RENAME_LOG=""
+RESET_LOG=""
+STATUS_LOG=""
+STATUS_BEFORE=""
+STATUS_FINAL=""
+DMESG_LOG=""
+SUMMARY_LOG=""
+REPORT_JSON=""
+
+RESET_PID=0
+STATUS_PID=0
+declare -a IO_WORKER_PIDS=()
+declare -a RENAME_WORKER_PIDS=()
+
+usage()
+{
+ cat <<EOF
+Usage: $0 --mount-point <cephfs_mount> [options]
+
+Required:
+ --mount-point PATH CephFS mount point to test under
+
+Options:
+ --profile NAME baseline|moderate|aggressive|soak (default: moderate)
+ --duration-sec N Override profile runtime in seconds
+ --cooldown-sec N Workload drain time after injector stop (default: 20)
+ --file-count N Number of logical files (default: 64)
+ --io-workers N Number of concurrent I/O workers (profile default)
+ --rename-workers N Number of concurrent rename workers (profile default)
+ --out-dir PATH Artifact directory (default: /tmp/ceph_reset_stress_<ts>)
+ --client-id ID Ceph debugfs client id; auto-detect if one client exists
+ --debugfs-root PATH Debugfs Ceph root (default: /sys/kernel/debug/ceph)
+ --slo-seconds N Max allowed post-reset stall window (default: 30)
+ --no-reset Disable reset injector (baseline mode helper)
+ --help Show this message
+
+Examples:
+ $0 --mount-point /mnt/cephfs --profile moderate
+ $0 --mount-point /mnt/cephfs --profile aggressive --duration-sec 300
+ $0 --mount-point /mnt/cephfs --profile baseline --no-reset
+EOF
+}
+
+now_ms()
+{
+ date +%s%3N
+}
+
+set_profile_defaults()
+{
+ case "$PROFILE" in
+ baseline)
+ RESET_MIN_SEC=0
+ RESET_MAX_SEC=0
+ EXPECT_RESET=0
+ : "${DURATION_SEC:=600}"
+ : "${IO_WORKERS:=1}"
+ : "${RENAME_WORKERS:=1}"
+ ;;
+ moderate)
+ RESET_MIN_SEC=5
+ RESET_MAX_SEC=15
+ : "${DURATION_SEC:=900}"
+ : "${IO_WORKERS:=2}"
+ : "${RENAME_WORKERS:=1}"
+ ;;
+ aggressive)
+ RESET_MIN_SEC=1
+ RESET_MAX_SEC=5
+ : "${DURATION_SEC:=900}"
+ : "${IO_WORKERS:=4}"
+ : "${RENAME_WORKERS:=2}"
+ ;;
+ soak)
+ RESET_MIN_SEC=5
+ RESET_MAX_SEC=15
+ : "${DURATION_SEC:=3600}"
+ : "${IO_WORKERS:=2}"
+ : "${RENAME_WORKERS:=1}"
+ ;;
+ *)
+ echo "Unknown profile: $PROFILE" >&2
+ exit 2
+ ;;
+ esac
+}
+
+log_summary()
+{
+ local msg="$1"
+ printf '[%s] %s\n' "$(date -u +%Y-%m-%dT%H:%M:%SZ)" "$msg" | tee -a "$SUMMARY_LOG"
+}
+
+discover_client_id()
+{
+ local candidates=()
+ local entry
+
+ if [[ -n "$CLIENT_ID" ]]; then
+ if ! $SUDO test -d "$DEBUGFS_ROOT/$CLIENT_ID/reset"; then
+ echo "SKIP: reset debugfs not found for client-id=$CLIENT_ID" >&2
+ exit "$KSFT_SKIP"
+ fi
+ return 0
+ fi
+
+ if ! $SUDO test -d "$DEBUGFS_ROOT"; then
+ echo "SKIP: Debugfs root not found: $DEBUGFS_ROOT" >&2
+ exit "$KSFT_SKIP"
+ fi
+
+ while IFS= read -r entry; do
+ $SUDO test -d "$DEBUGFS_ROOT/$entry/reset" || continue
+ $SUDO test -w "$DEBUGFS_ROOT/$entry/reset/trigger" || continue
+ candidates+=("$entry")
+ done < <($SUDO ls -1 "$DEBUGFS_ROOT" 2>/dev/null || true)
+
+ if [[ ${#candidates[@]} -eq 1 ]]; then
+ CLIENT_ID="${candidates[0]}"
+ return 0
+ fi
+
+ if [[ ${#candidates[@]} -eq 0 ]]; then
+ echo "SKIP: No writable Ceph reset interface found under $DEBUGFS_ROOT" >&2
+ exit "$KSFT_SKIP"
+ fi
+
+ echo "SKIP: Multiple Ceph clients found (${candidates[*]}). Use --client-id." >&2
+ exit "$KSFT_SKIP"
+}
+
+init_dataset()
+{
+ local i
+ mkdir -p "$DATA_DIR/A" "$DATA_DIR/B"
+
+ for ((i = 0; i < FILE_COUNT; i++)); do
+ printf 'seed logical_id=%05d ts_ms=%s\n' "$i" "$(now_ms)" > "$DATA_DIR/A/file_$(printf '%05d' "$i")"
+ done
+}
+
+io_worker()
+{
+ set +e
+ local worker_id="$1"
+ local seq=0
+ local id
+ local relpath
+ local abspath
+ local payload
+ local hash
+ local ts
+
+ while [[ -f "$WORKLOAD_FLAG" ]]; do
+ id="$(printf '%05d' $((RANDOM % FILE_COUNT)))"
+ if [[ -f "$DATA_DIR/A/file_$id" ]]; then
+ relpath="A/file_$id"
+ elif [[ -f "$DATA_DIR/B/file_$id" ]]; then
+ relpath="B/file_$id"
+ else
+ sleep 0.02
+ continue
+ fi
+
+ abspath="$DATA_DIR/$relpath"
+ alt_relpath=""
+ if [[ "$relpath" == A/* ]]; then
+ alt_relpath="B/file_$id"
+ else
+ alt_relpath="A/file_$id"
+ fi
+ alt_abspath="$DATA_DIR/$alt_relpath"
+ payload="worker=${worker_id} io_seq=${seq} id=${id} ts_ms=$(now_ms)"
+ result="$(
+ python3 - "$abspath" "$alt_abspath" "$payload" <<'PY'
+import hashlib
+import os
+import sys
+
+path = sys.argv[1]
+alt_path = sys.argv[2]
+payload = sys.argv[3]
+
+try:
+ fd = os.open(path, os.O_RDWR | os.O_APPEND)
+ actual = path
+except FileNotFoundError:
+ try:
+ fd = os.open(alt_path, os.O_RDWR | os.O_APPEND)
+ actual = alt_path
+ except FileNotFoundError:
+ sys.exit(1)
+
+try:
+ os.write(fd, (payload + "\n").encode())
+ os.fsync(fd)
+ os.lseek(fd, 0, os.SEEK_SET)
+ digest = hashlib.sha256()
+ while True:
+ chunk = os.read(fd, 1 << 20)
+ if not chunk:
+ break
+ digest.update(chunk)
+ print(actual + " " + digest.hexdigest())
+finally:
+ os.close(fd)
+PY
+ )" || {
+ sleep 0.02
+ continue
+ }
+
+ actual_abspath="${result%% *}"
+ hash="${result#* }"
+ if [[ "$actual_abspath" == "$alt_abspath" ]]; then
+ relpath="$alt_relpath"
+ fi
+
+ ts="$(now_ms)"
+ printf '%s,%s,%s,%s,%s\n' "$ts" "$seq" "$id" "$relpath" "$hash" >> "$IO_LOG"
+ seq=$((seq + 1))
+ sleep 0.02
+ done
+}
+
+rename_worker()
+{
+ set +e
+ local worker_id="$1"
+ local seq=0
+ local id
+ local src_rel
+ local dst_rel
+ local rc
+ local ts
+
+ while [[ -f "$WORKLOAD_FLAG" ]]; do
+ id="$(printf '%05d' $((RANDOM % FILE_COUNT)))"
+
+ if [[ -f "$DATA_DIR/A/file_$id" ]]; then
+ src_rel="A/file_$id"
+ dst_rel="B/file_$id"
+ elif [[ -f "$DATA_DIR/B/file_$id" ]]; then
+ src_rel="B/file_$id"
+ dst_rel="A/file_$id"
+ else
+ sleep 0.02
+ continue
+ fi
+
+ ts="$(now_ms)"
+ if mv -T "$DATA_DIR/$src_rel" "$DATA_DIR/$dst_rel" 2>/dev/null; then
+ rc=0
+ else
+ rc=$?
+ fi
+ printf '%s,%s,%s,%s,%s,%s,%s\n' "$ts" "$worker_id" "$seq" "$id" "$src_rel" "$dst_rel" "$rc" >> "$RENAME_LOG"
+ seq=$((seq + 1))
+ sleep 0.02
+ done
+}
+
+random_sleep_seconds()
+{
+ local min_sec="$1"
+ local max_sec="$2"
+ local wait_sec
+ local span
+
+ span=$((max_sec - min_sec + 1))
+ wait_sec=$((min_sec + RANDOM % span))
+ sleep "$wait_sec"
+}
+
+reset_injector()
+{
+ set +e
+ local trigger_path="$1"
+ local seq=0
+ local ts
+ local reason
+ local rc
+
+ while [[ -f "$RESET_FLAG" ]]; do
+ random_sleep_seconds "$RESET_MIN_SEC" "$RESET_MAX_SEC"
+ [[ -f "$RESET_FLAG" ]] || break
+
+ ts="$(now_ms)"
+ reason="stress_${seq}_${ts}"
+ if echo "$reason" | $SUDO tee "$trigger_path" > /dev/null 2>&1; then
+ rc=0
+ else
+ rc=$?
+ fi
+ printf '%s,%s,%s,%s\n' "$ts" "$seq" "$reason" "$rc" >> "$RESET_LOG"
+ seq=$((seq + 1))
+ done
+}
+
+status_sampler()
+{
+ set +e
+ local status_path="$1"
+ local ts
+ local kv_line
+
+ while [[ -f "$WORKLOAD_FLAG" || -f "$RESET_FLAG" ]]; do
+ ts="$(now_ms)"
+ if $SUDO test -r "$status_path"; then
+ kv_line="$($SUDO awk -F': ' 'NF>=2 {gsub(/ /, "", $1); gsub(/ /, "", $2); printf "%s=%s;", $1, $2}' "$status_path")"
+ printf '%s,%s\n' "$ts" "$kv_line" >> "$STATUS_LOG"
+ fi
+ sleep 1
+ done
+}
+
+stop_pid_with_timeout()
+{
+ local pid="$1"
+ local name="$2"
+ local timeout="$3"
+ local waited=0
+
+ if [[ "$pid" -le 0 ]]; then
+ return 0
+ fi
+
+ while kill -0 "$pid" 2>/dev/null; do
+ if (( waited >= timeout )); then
+ log_summary "Timeout waiting for $name (pid=$pid), sending SIGTERM/SIGKILL"
+ kill -TERM "$pid" 2>/dev/null || true
+ sleep 1
+ kill -KILL "$pid" 2>/dev/null || true
+ wait "$pid" 2>/dev/null || true
+ return 1
+ fi
+ sleep 1
+ waited=$((waited + 1))
+ done
+
+ wait "$pid" 2>/dev/null || true
+ return 0
+}
+
+detect_privileges()
+{
+ if [[ -r "$DEBUGFS_ROOT" ]]; then
+ SUDO=""
+ elif sudo -n true 2>/dev/null; then
+ SUDO="sudo"
+ else
+ echo "WARNING: $DEBUGFS_ROOT is not readable and passwordless sudo is not available" >&2
+ echo "WARNING: reset injection, debugfs status checks, and dmesg capture will not work" >&2
+ fi
+
+ if $SUDO dmesg > /dev/null 2>&1; then
+ DMESG_CMD="$SUDO dmesg"
+ else
+ DMESG_CMD=""
+ echo "WARNING: dmesg is not accessible; kernel errors (hung tasks) will not be detected" >&2
+ fi
+}
+
+check_dmesg()
+{
+ local start_epoch="$1"
+
+ if [[ -z "$DMESG_CMD" ]]; then
+ return 0
+ fi
+
+ if ! $DMESG_CMD --since "@$start_epoch" > "$DMESG_LOG" 2>/dev/null; then
+ if ! $DMESG_CMD > "$DMESG_LOG" 2>/dev/null; then
+ log_summary "WARNING: dmesg capture failed unexpectedly"
+ return 0
+ fi
+ log_summary "dmesg --since unsupported; captured full dmesg"
+ fi
+
+ if grep -qi "hung task" "$DMESG_LOG" 2>/dev/null; then
+ log_summary "ERROR: kernel log contains 'hung task' during test window"
+ return 1
+ fi
+
+ return 0
+}
+
+cleanup()
+{
+ rm -f "$WORKLOAD_FLAG" "$RESET_FLAG"
+ local pid
+ for pid in "${IO_WORKER_PIDS[@]}" "${RENAME_WORKER_PIDS[@]}" "$RESET_PID" "$STATUS_PID"; do
+ [[ "$pid" -gt 0 ]] 2>/dev/null && kill "$pid" 2>/dev/null || true
+ done
+ wait 2>/dev/null || true
+}
+
+parse_args()
+{
+ while [[ $# -gt 0 ]]; do
+ case "$1" in
+ --mount-point)
+ MOUNT_POINT="$2"
+ shift 2
+ ;;
+ --profile)
+ PROFILE="$2"
+ shift 2
+ ;;
+ --duration-sec)
+ DURATION_SEC="$2"
+ shift 2
+ ;;
+ --cooldown-sec)
+ COOLDOWN_SEC="$2"
+ shift 2
+ ;;
+ --file-count)
+ FILE_COUNT="$2"
+ shift 2
+ ;;
+ --io-workers)
+ IO_WORKERS="$2"
+ shift 2
+ ;;
+ --rename-workers)
+ RENAME_WORKERS="$2"
+ shift 2
+ ;;
+ --out-dir)
+ OUT_DIR="$2"
+ shift 2
+ ;;
+ --client-id)
+ CLIENT_ID="$2"
+ shift 2
+ ;;
+ --debugfs-root)
+ DEBUGFS_ROOT="$2"
+ shift 2
+ ;;
+ --slo-seconds)
+ SLO_SECONDS="$2"
+ shift 2
+ ;;
+ --no-reset)
+ EXPECT_RESET=0
+ shift
+ ;;
+ --help|-h)
+ usage
+ exit 0
+ ;;
+ *)
+ echo "Unknown option: $1" >&2
+ usage
+ exit 2
+ ;;
+ esac
+ done
+}
+
+main()
+{
+ local start_epoch
+ local trigger_path=""
+ local status_path=""
+ local final_rc=0
+ local reset_enabled=0
+ local i
+
+ parse_args "$@"
+
+ if [[ -z "$MOUNT_POINT" ]]; then
+ echo "--mount-point is required" >&2
+ usage
+ exit 2
+ fi
+
+ if [[ ! -d "$MOUNT_POINT" ]]; then
+ echo "SKIP: Mount point does not exist: $MOUNT_POINT" >&2
+ exit "$KSFT_SKIP"
+ fi
+
+ if ! touch "$MOUNT_POINT/.ceph_reset_test_probe" 2>/dev/null; then
+ echo "SKIP: Mount point is not writable: $MOUNT_POINT" >&2
+ exit "$KSFT_SKIP"
+ fi
+ rm -f "$MOUNT_POINT/.ceph_reset_test_probe"
+
+ if ! command -v python3 > /dev/null 2>&1; then
+ echo "SKIP: python3 is required but not found in PATH" >&2
+ exit "$KSFT_SKIP"
+ fi
+
+ if ! stat -f -c '%T' "$MOUNT_POINT" 2>/dev/null | grep -qi ceph; then
+ echo "WARNING: $MOUNT_POINT does not appear to be a CephFS mount" >&2
+ fi
+
+ detect_privileges
+
+ set_profile_defaults
+ if [[ "$EXPECT_RESET" -eq 0 ]]; then
+ PROFILE="baseline"
+ RESET_MIN_SEC=0
+ RESET_MAX_SEC=0
+ fi
+
+ if ! [[ "$IO_WORKERS" =~ ^[0-9]+$ && "$RENAME_WORKERS" =~ ^[0-9]+$ ]]; then
+ echo "io-workers and rename-workers must be integers" >&2
+ exit 2
+ fi
+
+ if [[ "$IO_WORKERS" -le 0 || "$RENAME_WORKERS" -le 0 ]]; then
+ echo "io-workers and rename-workers must be > 0" >&2
+ exit 2
+ fi
+
+ if [[ -z "$OUT_DIR" ]]; then
+ OUT_DIR="/tmp/ceph_reset_stress_${RUN_ID}"
+ fi
+ mkdir -p "$OUT_DIR"
+
+ WORKLOAD_FLAG="$OUT_DIR/workload.running"
+ RESET_FLAG="$OUT_DIR/reset.running"
+
+ DATA_DIR="$MOUNT_POINT/ceph_reset_stress_${RUN_ID}"
+ mkdir -p "$DATA_DIR"
+
+ IO_LOG="$OUT_DIR/io.log"
+ RENAME_LOG="$OUT_DIR/rename.log"
+ RESET_LOG="$OUT_DIR/reset.log"
+ STATUS_LOG="$OUT_DIR/status.log"
+ STATUS_BEFORE="$OUT_DIR/reset_status.before"
+ STATUS_FINAL="$OUT_DIR/reset_status.final"
+ DMESG_LOG="$OUT_DIR/dmesg.log"
+ SUMMARY_LOG="$OUT_DIR/summary.log"
+ REPORT_JSON="$OUT_DIR/validator_report.json"
+
+ : > "$IO_LOG"
+ : > "$RENAME_LOG"
+ : > "$RESET_LOG"
+ : > "$STATUS_LOG"
+ : > "$SUMMARY_LOG"
+
+ start_epoch="$(date +%s)"
+
+ log_summary "Starting Ceph reset stress test"
+ log_summary "Profile=$PROFILE duration=${DURATION_SEC}s cooldown=${COOLDOWN_SEC}s file_count=${FILE_COUNT} io_workers=${IO_WORKERS} rename_workers=${RENAME_WORKERS}"
+ [[ -n "$SUDO" ]] && log_summary "Using sudo for privileged operations"
+ [[ -z "$DMESG_CMD" ]] && log_summary "WARNING: dmesg not available; hung task detection disabled"
+ log_summary "Artifacts=$OUT_DIR"
+ log_summary "Data dir=$DATA_DIR"
+
+ init_dataset
+
+ if [[ "$EXPECT_RESET" -eq 1 ]]; then
+ discover_client_id
+ trigger_path="$DEBUGFS_ROOT/$CLIENT_ID/reset/trigger"
+ status_path="$DEBUGFS_ROOT/$CLIENT_ID/reset/status"
+ if ! $SUDO test -w "$trigger_path"; then
+ echo "SKIP: Reset trigger is not writable: $trigger_path" >&2
+ exit "$KSFT_SKIP"
+ fi
+ if ! $SUDO test -r "$status_path"; then
+ echo "SKIP: Reset status is not readable: $status_path" >&2
+ exit "$KSFT_SKIP"
+ fi
+ $SUDO cat "$status_path" > "$STATUS_BEFORE" || true
+ reset_enabled=1
+ log_summary "Using ceph client id: $CLIENT_ID"
+ fi
+
+ trap cleanup EXIT INT TERM
+
+ touch "$WORKLOAD_FLAG"
+ for ((i = 0; i < IO_WORKERS; i++)); do
+ io_worker "$i" &
+ IO_WORKER_PIDS+=("$!")
+ done
+
+ for ((i = 0; i < RENAME_WORKERS; i++)); do
+ rename_worker "$i" &
+ RENAME_WORKER_PIDS+=("$!")
+ done
+
+ if [[ "$reset_enabled" -eq 1 ]]; then
+ touch "$RESET_FLAG"
+ reset_injector "$trigger_path" &
+ RESET_PID=$!
+
+ status_sampler "$status_path" &
+ STATUS_PID=$!
+ fi
+
+ sleep "$DURATION_SEC"
+
+ if [[ "$reset_enabled" -eq 1 ]]; then
+ rm -f "$RESET_FLAG"
+ stop_pid_with_timeout "$RESET_PID" "reset_injector" 20 || final_rc=1
+ log_summary "Injector stopped; entering cooldown=${COOLDOWN_SEC}s"
+ fi
+
+ sleep "$COOLDOWN_SEC"
+
+ rm -f "$WORKLOAD_FLAG"
+ for i in "${!IO_WORKER_PIDS[@]}"; do
+ stop_pid_with_timeout "${IO_WORKER_PIDS[$i]}" "io_worker[$i]" 20 || final_rc=1
+ done
+ for i in "${!RENAME_WORKER_PIDS[@]}"; do
+ stop_pid_with_timeout "${RENAME_WORKER_PIDS[$i]}" "rename_worker[$i]" 20 || final_rc=1
+ done
+
+ if [[ "$reset_enabled" -eq 1 ]]; then
+ stop_pid_with_timeout "$STATUS_PID" "status_sampler" 10 || final_rc=1
+ $SUDO cat "$status_path" > "$STATUS_FINAL" || true
+ fi
+
+ if ! check_dmesg "$start_epoch"; then
+ final_rc=1
+ fi
+
+ if ! python3 "$SCRIPT_DIR/validate_consistency.py" \
+ --data-dir "$DATA_DIR" \
+ --file-count "$FILE_COUNT" \
+ --io-log "$IO_LOG" \
+ --rename-log "$RENAME_LOG" \
+ --reset-log "$RESET_LOG" \
+ --status-final "$STATUS_FINAL" \
+ --slo-seconds "$SLO_SECONDS" \
+ --report-json "$REPORT_JSON" \
+ $( [[ "$reset_enabled" -eq 1 ]] && echo "--expect-reset" ); then
+ final_rc=1
+ fi
+
+ if [[ "$final_rc" -eq 0 ]]; then
+ log_summary "PASS: stress run completed successfully"
+ else
+ log_summary "FAIL: stress run detected one or more failures"
+ fi
+
+ log_summary "Artifacts available in: $OUT_DIR"
+ exit "$final_rc"
+}
+
+main "$@"
--
2.34.1
^ permalink raw reply related [flat|nested] 24+ messages in thread* Re: [EXTERNAL] [PATCH v4 08/11] selftests: ceph: add reset stress test
2026-05-07 12:27 ` [PATCH v4 08/11] selftests: ceph: add reset stress test Alex Markuze
@ 2026-05-07 19:29 ` Viacheslav Dubeyko
0 siblings, 0 replies; 24+ messages in thread
From: Viacheslav Dubeyko @ 2026-05-07 19:29 UTC (permalink / raw)
To: Alex Markuze, ceph-devel; +Cc: linux-kernel, idryomov
On Thu, 2026-05-07 at 12:27 +0000, Alex Markuze wrote:
> Add a single-client stress test for the CephFS manual session reset
> feature. The test runs concurrent I/O workers alongside periodic
> reset injection, then validates data integrity via
> validate_consistency.py.
>
> Supports four profiles (baseline, moderate, aggressive, soak) with
> configurable duration, reset interval, and worker counts.
>
> Signed-off-by: Alex Markuze <amarkuze@redhat.com>
> ---
> .../filesystems/ceph/reset_stress.sh | 694 ++++++++++++++++++
> 1 file changed, 694 insertions(+)
> create mode 100755 tools/testing/selftests/filesystems/ceph/reset_stress.sh
>
> diff --git a/tools/testing/selftests/filesystems/ceph/reset_stress.sh b/tools/testing/selftests/filesystems/ceph/reset_stress.sh
> new file mode 100755
> index 000000000000..c503c75a5f7a
> --- /dev/null
> +++ b/tools/testing/selftests/filesystems/ceph/reset_stress.sh
> @@ -0,0 +1,694 @@
> +#!/bin/bash
> +# SPDX-License-Identifier: GPL-2.0
> +#
> +# CephFS reset stress test:
> +# - Runs concurrent I/O and rename workloads
> +# - Triggers random client resets through debugfs
> +# - Validates consistency and recovery behavior
> +
> +set -euo pipefail
> +
> +KSFT_SKIP=4
> +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
> +
> +# kselftest auto-detect: when invoked with no arguments (e.g. by
> +# "make run_tests"), find a CephFS mount automatically or skip.
> +if [[ $# -eq 0 ]]; then
> + MOUNT_POINT="$(findmnt -t ceph -n -o TARGET 2>/dev/null | head -1)"
> + if [[ -z "$MOUNT_POINT" ]]; then
> + echo "SKIP: No CephFS mount found and --mount-point not specified"
> + exit "$KSFT_SKIP"
> + fi
> + exec "$0" --mount-point "$MOUNT_POINT"
> +fi
> +
> +PROFILE="moderate"
> +DURATION_SEC=""
> +COOLDOWN_SEC=20
> +FILE_COUNT=64
> +IO_WORKERS=""
> +RENAME_WORKERS=""
> +MOUNT_POINT=""
> +OUT_DIR=""
> +CLIENT_ID=""
> +DEBUGFS_ROOT="/sys/kernel/debug/ceph"
> +SLO_SECONDS=30
> +EXPECT_RESET=1
> +DMESG_CMD=""
> +SUDO=""
> +
> +RESET_MIN_SEC=5
> +RESET_MAX_SEC=15
> +
> +RUN_ID="$(date +%Y%m%d-%H%M%S)"
> +WORKLOAD_FLAG=""
> +RESET_FLAG=""
> +DATA_DIR=""
> +
> +IO_LOG=""
> +RENAME_LOG=""
> +RESET_LOG=""
> +STATUS_LOG=""
> +STATUS_BEFORE=""
> +STATUS_FINAL=""
> +DMESG_LOG=""
> +SUMMARY_LOG=""
> +REPORT_JSON=""
> +
> +RESET_PID=0
> +STATUS_PID=0
> +declare -a IO_WORKER_PIDS=()
> +declare -a RENAME_WORKER_PIDS=()
> +
> +usage()
> +{
> + cat <<EOF
> +Usage: $0 --mount-point <cephfs_mount> [options]
> +
> +Required:
> + --mount-point PATH CephFS mount point to test under
> +
> +Options:
> + --profile NAME baseline|moderate|aggressive|soak (default: moderate)
> + --duration-sec N Override profile runtime in seconds
> + --cooldown-sec N Workload drain time after injector stop (default: 20)
> + --file-count N Number of logical files (default: 64)
> + --io-workers N Number of concurrent I/O workers (profile default)
> + --rename-workers N Number of concurrent rename workers (profile default)
> + --out-dir PATH Artifact directory (default: /tmp/ceph_reset_stress_<ts>)
> + --client-id ID Ceph debugfs client id; auto-detect if one client exists
> + --debugfs-root PATH Debugfs Ceph root (default: /sys/kernel/debug/ceph)
> + --slo-seconds N Max allowed post-reset stall window (default: 30)
> + --no-reset Disable reset injector (baseline mode helper)
> + --help Show this message
> +
> +Examples:
> + $0 --mount-point /mnt/cephfs --profile moderate
> + $0 --mount-point /mnt/cephfs --profile aggressive --duration-sec 300
> + $0 --mount-point /mnt/cephfs --profile baseline --no-reset
> +EOF
> +}
> +
> +now_ms()
> +{
> + date +%s%3N
> +}
> +
> +set_profile_defaults()
> +{
> + case "$PROFILE" in
> + baseline)
> + RESET_MIN_SEC=0
> + RESET_MAX_SEC=0
> + EXPECT_RESET=0
> + : "${DURATION_SEC:=600}"
> + : "${IO_WORKERS:=1}"
> + : "${RENAME_WORKERS:=1}"
> + ;;
> + moderate)
> + RESET_MIN_SEC=5
> + RESET_MAX_SEC=15
> + : "${DURATION_SEC:=900}"
> + : "${IO_WORKERS:=2}"
> + : "${RENAME_WORKERS:=1}"
> + ;;
> + aggressive)
> + RESET_MIN_SEC=1
> + RESET_MAX_SEC=5
> + : "${DURATION_SEC:=900}"
> + : "${IO_WORKERS:=4}"
> + : "${RENAME_WORKERS:=2}"
> + ;;
> + soak)
> + RESET_MIN_SEC=5
> + RESET_MAX_SEC=15
> + : "${DURATION_SEC:=3600}"
> + : "${IO_WORKERS:=2}"
> + : "${RENAME_WORKERS:=1}"
> + ;;
> + *)
> + echo "Unknown profile: $PROFILE" >&2
> + exit 2
> + ;;
> + esac
> +}
> +
> +log_summary()
> +{
> + local msg="$1"
> + printf '[%s] %s\n' "$(date -u +%Y-%m-%dT%H:%M:%SZ)" "$msg" | tee -a "$SUMMARY_LOG"
> +}
> +
> +discover_client_id()
> +{
> + local candidates=()
> + local entry
> +
> + if [[ -n "$CLIENT_ID" ]]; then
> + if ! $SUDO test -d "$DEBUGFS_ROOT/$CLIENT_ID/reset"; then
> + echo "SKIP: reset debugfs not found for client-id=$CLIENT_ID" >&2
> + exit "$KSFT_SKIP"
> + fi
> + return 0
> + fi
> +
> + if ! $SUDO test -d "$DEBUGFS_ROOT"; then
> + echo "SKIP: Debugfs root not found: $DEBUGFS_ROOT" >&2
> + exit "$KSFT_SKIP"
> + fi
> +
> + while IFS= read -r entry; do
> + $SUDO test -d "$DEBUGFS_ROOT/$entry/reset" || continue
> + $SUDO test -w "$DEBUGFS_ROOT/$entry/reset/trigger" || continue
> + candidates+=("$entry")
> + done < <($SUDO ls -1 "$DEBUGFS_ROOT" 2>/dev/null || true)
> +
> + if [[ ${#candidates[@]} -eq 1 ]]; then
> + CLIENT_ID="${candidates[0]}"
> + return 0
> + fi
> +
> + if [[ ${#candidates[@]} -eq 0 ]]; then
> + echo "SKIP: No writable Ceph reset interface found under $DEBUGFS_ROOT" >&2
> + exit "$KSFT_SKIP"
> + fi
> +
> + echo "SKIP: Multiple Ceph clients found (${candidates[*]}). Use --client-id." >&2
> + exit "$KSFT_SKIP"
> +}
> +
> +init_dataset()
> +{
> + local i
> + mkdir -p "$DATA_DIR/A" "$DATA_DIR/B"
> +
> + for ((i = 0; i < FILE_COUNT; i++)); do
> + printf 'seed logical_id=%05d ts_ms=%s\n' "$i" "$(now_ms)" > "$DATA_DIR/A/file_$(printf '%05d' "$i")"
> + done
> +}
> +
> +io_worker()
> +{
> + set +e
> + local worker_id="$1"
> + local seq=0
> + local id
> + local relpath
> + local abspath
> + local payload
> + local hash
> + local ts
> +
> + while [[ -f "$WORKLOAD_FLAG" ]]; do
> + id="$(printf '%05d' $((RANDOM % FILE_COUNT)))"
> + if [[ -f "$DATA_DIR/A/file_$id" ]]; then
> + relpath="A/file_$id"
> + elif [[ -f "$DATA_DIR/B/file_$id" ]]; then
> + relpath="B/file_$id"
> + else
> + sleep 0.02
> + continue
> + fi
> +
> + abspath="$DATA_DIR/$relpath"
> + alt_relpath=""
> + if [[ "$relpath" == A/* ]]; then
> + alt_relpath="B/file_$id"
> + else
> + alt_relpath="A/file_$id"
> + fi
> + alt_abspath="$DATA_DIR/$alt_relpath"
> + payload="worker=${worker_id} io_seq=${seq} id=${id} ts_ms=$(now_ms)"
> + result="$(
> + python3 - "$abspath" "$alt_abspath" "$payload" <<'PY'
> +import hashlib
> +import os
> +import sys
> +
> +path = sys.argv[1]
> +alt_path = sys.argv[2]
> +payload = sys.argv[3]
> +
> +try:
> + fd = os.open(path, os.O_RDWR | os.O_APPEND)
> + actual = path
> +except FileNotFoundError:
> + try:
> + fd = os.open(alt_path, os.O_RDWR | os.O_APPEND)
> + actual = alt_path
> + except FileNotFoundError:
> + sys.exit(1)
> +
> +try:
> + os.write(fd, (payload + "\n").encode())
> + os.fsync(fd)
> + os.lseek(fd, 0, os.SEEK_SET)
> + digest = hashlib.sha256()
> + while True:
> + chunk = os.read(fd, 1 << 20)
> + if not chunk:
> + break
> + digest.update(chunk)
> + print(actual + " " + digest.hexdigest())
> +finally:
> + os.close(fd)
> +PY
> + )" || {
> + sleep 0.02
> + continue
> + }
> +
> + actual_abspath="${result%% *}"
> + hash="${result#* }"
> + if [[ "$actual_abspath" == "$alt_abspath" ]]; then
> + relpath="$alt_relpath"
> + fi
> +
> + ts="$(now_ms)"
> + printf '%s,%s,%s,%s,%s\n' "$ts" "$seq" "$id" "$relpath" "$hash" >> "$IO_LOG"
> + seq=$((seq + 1))
> + sleep 0.02
> + done
> +}
> +
> +rename_worker()
> +{
> + set +e
> + local worker_id="$1"
> + local seq=0
> + local id
> + local src_rel
> + local dst_rel
> + local rc
> + local ts
> +
> + while [[ -f "$WORKLOAD_FLAG" ]]; do
> + id="$(printf '%05d' $((RANDOM % FILE_COUNT)))"
> +
> + if [[ -f "$DATA_DIR/A/file_$id" ]]; then
> + src_rel="A/file_$id"
> + dst_rel="B/file_$id"
> + elif [[ -f "$DATA_DIR/B/file_$id" ]]; then
> + src_rel="B/file_$id"
> + dst_rel="A/file_$id"
> + else
> + sleep 0.02
> + continue
> + fi
> +
> + ts="$(now_ms)"
> + if mv -T "$DATA_DIR/$src_rel" "$DATA_DIR/$dst_rel" 2>/dev/null; then
> + rc=0
> + else
> + rc=$?
> + fi
> + printf '%s,%s,%s,%s,%s,%s,%s\n' "$ts" "$worker_id" "$seq" "$id" "$src_rel" "$dst_rel" "$rc" >> "$RENAME_LOG"
> + seq=$((seq + 1))
> + sleep 0.02
> + done
> +}
> +
> +random_sleep_seconds()
> +{
> + local min_sec="$1"
> + local max_sec="$2"
> + local wait_sec
> + local span
> +
> + span=$((max_sec - min_sec + 1))
> + wait_sec=$((min_sec + RANDOM % span))
> + sleep "$wait_sec"
> +}
> +
> +reset_injector()
> +{
> + set +e
> + local trigger_path="$1"
> + local seq=0
> + local ts
> + local reason
> + local rc
> +
> + while [[ -f "$RESET_FLAG" ]]; do
> + random_sleep_seconds "$RESET_MIN_SEC" "$RESET_MAX_SEC"
> + [[ -f "$RESET_FLAG" ]] || break
> +
> + ts="$(now_ms)"
> + reason="stress_${seq}_${ts}"
> + if echo "$reason" | $SUDO tee "$trigger_path" > /dev/null 2>&1; then
> + rc=0
> + else
> + rc=$?
> + fi
> + printf '%s,%s,%s,%s\n' "$ts" "$seq" "$reason" "$rc" >> "$RESET_LOG"
> + seq=$((seq + 1))
> + done
> +}
> +
> +status_sampler()
> +{
> + set +e
> + local status_path="$1"
> + local ts
> + local kv_line
> +
> + while [[ -f "$WORKLOAD_FLAG" || -f "$RESET_FLAG" ]]; do
> + ts="$(now_ms)"
> + if $SUDO test -r "$status_path"; then
> + kv_line="$($SUDO awk -F': ' 'NF>=2 {gsub(/ /, "", $1); gsub(/ /, "", $2); printf "%s=%s;", $1, $2}' "$status_path")"
> + printf '%s,%s\n' "$ts" "$kv_line" >> "$STATUS_LOG"
> + fi
> + sleep 1
> + done
> +}
> +
> +stop_pid_with_timeout()
> +{
> + local pid="$1"
> + local name="$2"
> + local timeout="$3"
> + local waited=0
> +
> + if [[ "$pid" -le 0 ]]; then
> + return 0
> + fi
> +
> + while kill -0 "$pid" 2>/dev/null; do
> + if (( waited >= timeout )); then
> + log_summary "Timeout waiting for $name (pid=$pid), sending SIGTERM/SIGKILL"
> + kill -TERM "$pid" 2>/dev/null || true
> + sleep 1
> + kill -KILL "$pid" 2>/dev/null || true
> + wait "$pid" 2>/dev/null || true
> + return 1
> + fi
> + sleep 1
> + waited=$((waited + 1))
> + done
> +
> + wait "$pid" 2>/dev/null || true
> + return 0
> +}
> +
> +detect_privileges()
> +{
> + if [[ -r "$DEBUGFS_ROOT" ]]; then
> + SUDO=""
> + elif sudo -n true 2>/dev/null; then
> + SUDO="sudo"
> + else
> + echo "WARNING: $DEBUGFS_ROOT is not readable and passwordless sudo is not available" >&2
> + echo "WARNING: reset injection, debugfs status checks, and dmesg capture will not work" >&2
> + fi
> +
> + if $SUDO dmesg > /dev/null 2>&1; then
> + DMESG_CMD="$SUDO dmesg"
> + else
> + DMESG_CMD=""
> + echo "WARNING: dmesg is not accessible; kernel errors (hung tasks) will not be detected" >&2
> + fi
> +}
> +
> +check_dmesg()
> +{
> + local start_epoch="$1"
> +
> + if [[ -z "$DMESG_CMD" ]]; then
> + return 0
> + fi
> +
> + if ! $DMESG_CMD --since "@$start_epoch" > "$DMESG_LOG" 2>/dev/null; then
> + if ! $DMESG_CMD > "$DMESG_LOG" 2>/dev/null; then
> + log_summary "WARNING: dmesg capture failed unexpectedly"
> + return 0
> + fi
> + log_summary "dmesg --since unsupported; captured full dmesg"
> + fi
> +
> + if grep -qi "hung task" "$DMESG_LOG" 2>/dev/null; then
> + log_summary "ERROR: kernel log contains 'hung task' during test window"
> + return 1
> + fi
> +
> + return 0
> +}
> +
> +cleanup()
> +{
> + rm -f "$WORKLOAD_FLAG" "$RESET_FLAG"
> + local pid
> + for pid in "${IO_WORKER_PIDS[@]}" "${RENAME_WORKER_PIDS[@]}" "$RESET_PID" "$STATUS_PID"; do
> + [[ "$pid" -gt 0 ]] 2>/dev/null && kill "$pid" 2>/dev/null || true
> + done
> + wait 2>/dev/null || true
> +}
> +
> +parse_args()
> +{
> + while [[ $# -gt 0 ]]; do
> + case "$1" in
> + --mount-point)
> + MOUNT_POINT="$2"
> + shift 2
> + ;;
> + --profile)
> + PROFILE="$2"
> + shift 2
> + ;;
> + --duration-sec)
> + DURATION_SEC="$2"
> + shift 2
> + ;;
> + --cooldown-sec)
> + COOLDOWN_SEC="$2"
> + shift 2
> + ;;
> + --file-count)
> + FILE_COUNT="$2"
> + shift 2
> + ;;
> + --io-workers)
> + IO_WORKERS="$2"
> + shift 2
> + ;;
> + --rename-workers)
> + RENAME_WORKERS="$2"
> + shift 2
> + ;;
> + --out-dir)
> + OUT_DIR="$2"
> + shift 2
> + ;;
> + --client-id)
> + CLIENT_ID="$2"
> + shift 2
> + ;;
> + --debugfs-root)
> + DEBUGFS_ROOT="$2"
> + shift 2
> + ;;
> + --slo-seconds)
> + SLO_SECONDS="$2"
> + shift 2
> + ;;
> + --no-reset)
> + EXPECT_RESET=0
> + shift
> + ;;
> + --help|-h)
> + usage
> + exit 0
> + ;;
> + *)
> + echo "Unknown option: $1" >&2
> + usage
> + exit 2
> + ;;
> + esac
> + done
> +}
> +
> +main()
> +{
> + local start_epoch
> + local trigger_path=""
> + local status_path=""
> + local final_rc=0
> + local reset_enabled=0
> + local i
> +
> + parse_args "$@"
> +
> + if [[ -z "$MOUNT_POINT" ]]; then
> + echo "--mount-point is required" >&2
> + usage
> + exit 2
> + fi
> +
> + if [[ ! -d "$MOUNT_POINT" ]]; then
> + echo "SKIP: Mount point does not exist: $MOUNT_POINT" >&2
> + exit "$KSFT_SKIP"
> + fi
> +
> + if ! touch "$MOUNT_POINT/.ceph_reset_test_probe" 2>/dev/null; then
> + echo "SKIP: Mount point is not writable: $MOUNT_POINT" >&2
> + exit "$KSFT_SKIP"
> + fi
> + rm -f "$MOUNT_POINT/.ceph_reset_test_probe"
> +
> + if ! command -v python3 > /dev/null 2>&1; then
> + echo "SKIP: python3 is required but not found in PATH" >&2
> + exit "$KSFT_SKIP"
> + fi
> +
> + if ! stat -f -c '%T' "$MOUNT_POINT" 2>/dev/null | grep -qi ceph; then
> + echo "WARNING: $MOUNT_POINT does not appear to be a CephFS mount" >&2
> + fi
> +
> + detect_privileges
> +
> + set_profile_defaults
> + if [[ "$EXPECT_RESET" -eq 0 ]]; then
> + PROFILE="baseline"
> + RESET_MIN_SEC=0
> + RESET_MAX_SEC=0
> + fi
> +
> + if ! [[ "$IO_WORKERS" =~ ^[0-9]+$ && "$RENAME_WORKERS" =~ ^[0-9]+$ ]]; then
> + echo "io-workers and rename-workers must be integers" >&2
> + exit 2
> + fi
> +
> + if [[ "$IO_WORKERS" -le 0 || "$RENAME_WORKERS" -le 0 ]]; then
> + echo "io-workers and rename-workers must be > 0" >&2
> + exit 2
> + fi
> +
> + if [[ -z "$OUT_DIR" ]]; then
> + OUT_DIR="/tmp/ceph_reset_stress_${RUN_ID}"
> + fi
> + mkdir -p "$OUT_DIR"
> +
> + WORKLOAD_FLAG="$OUT_DIR/workload.running"
> + RESET_FLAG="$OUT_DIR/reset.running"
> +
> + DATA_DIR="$MOUNT_POINT/ceph_reset_stress_${RUN_ID}"
> + mkdir -p "$DATA_DIR"
> +
> + IO_LOG="$OUT_DIR/io.log"
> + RENAME_LOG="$OUT_DIR/rename.log"
> + RESET_LOG="$OUT_DIR/reset.log"
> + STATUS_LOG="$OUT_DIR/status.log"
> + STATUS_BEFORE="$OUT_DIR/reset_status.before"
> + STATUS_FINAL="$OUT_DIR/reset_status.final"
> + DMESG_LOG="$OUT_DIR/dmesg.log"
> + SUMMARY_LOG="$OUT_DIR/summary.log"
> + REPORT_JSON="$OUT_DIR/validator_report.json"
> +
> + : > "$IO_LOG"
> + : > "$RENAME_LOG"
> + : > "$RESET_LOG"
> + : > "$STATUS_LOG"
> + : > "$SUMMARY_LOG"
> +
> + start_epoch="$(date +%s)"
> +
> + log_summary "Starting Ceph reset stress test"
> + log_summary "Profile=$PROFILE duration=${DURATION_SEC}s cooldown=${COOLDOWN_SEC}s file_count=${FILE_COUNT} io_workers=${IO_WORKERS} rename_workers=${RENAME_WORKERS}"
> + [[ -n "$SUDO" ]] && log_summary "Using sudo for privileged operations"
> + [[ -z "$DMESG_CMD" ]] && log_summary "WARNING: dmesg not available; hung task detection disabled"
> + log_summary "Artifacts=$OUT_DIR"
> + log_summary "Data dir=$DATA_DIR"
> +
> + init_dataset
> +
> + if [[ "$EXPECT_RESET" -eq 1 ]]; then
> + discover_client_id
> + trigger_path="$DEBUGFS_ROOT/$CLIENT_ID/reset/trigger"
> + status_path="$DEBUGFS_ROOT/$CLIENT_ID/reset/status"
> + if ! $SUDO test -w "$trigger_path"; then
> + echo "SKIP: Reset trigger is not writable: $trigger_path" >&2
> + exit "$KSFT_SKIP"
> + fi
> + if ! $SUDO test -r "$status_path"; then
> + echo "SKIP: Reset status is not readable: $status_path" >&2
> + exit "$KSFT_SKIP"
> + fi
> + $SUDO cat "$status_path" > "$STATUS_BEFORE" || true
> + reset_enabled=1
> + log_summary "Using ceph client id: $CLIENT_ID"
> + fi
> +
> + trap cleanup EXIT INT TERM
> +
> + touch "$WORKLOAD_FLAG"
> + for ((i = 0; i < IO_WORKERS; i++)); do
> + io_worker "$i" &
> + IO_WORKER_PIDS+=("$!")
> + done
> +
> + for ((i = 0; i < RENAME_WORKERS; i++)); do
> + rename_worker "$i" &
> + RENAME_WORKER_PIDS+=("$!")
> + done
> +
> + if [[ "$reset_enabled" -eq 1 ]]; then
> + touch "$RESET_FLAG"
> + reset_injector "$trigger_path" &
> + RESET_PID=$!
> +
> + status_sampler "$status_path" &
> + STATUS_PID=$!
> + fi
> +
> + sleep "$DURATION_SEC"
> +
> + if [[ "$reset_enabled" -eq 1 ]]; then
> + rm -f "$RESET_FLAG"
> + stop_pid_with_timeout "$RESET_PID" "reset_injector" 20 || final_rc=1
> + log_summary "Injector stopped; entering cooldown=${COOLDOWN_SEC}s"
> + fi
> +
> + sleep "$COOLDOWN_SEC"
> +
> + rm -f "$WORKLOAD_FLAG"
> + for i in "${!IO_WORKER_PIDS[@]}"; do
> + stop_pid_with_timeout "${IO_WORKER_PIDS[$i]}" "io_worker[$i]" 20 || final_rc=1
> + done
> + for i in "${!RENAME_WORKER_PIDS[@]}"; do
> + stop_pid_with_timeout "${RENAME_WORKER_PIDS[$i]}" "rename_worker[$i]" 20 || final_rc=1
> + done
> +
> + if [[ "$reset_enabled" -eq 1 ]]; then
> + stop_pid_with_timeout "$STATUS_PID" "status_sampler" 10 || final_rc=1
> + $SUDO cat "$status_path" > "$STATUS_FINAL" || true
> + fi
> +
> + if ! check_dmesg "$start_epoch"; then
> + final_rc=1
> + fi
> +
> + if ! python3 "$SCRIPT_DIR/validate_consistency.py" \
> + --data-dir "$DATA_DIR" \
> + --file-count "$FILE_COUNT" \
> + --io-log "$IO_LOG" \
> + --rename-log "$RENAME_LOG" \
> + --reset-log "$RESET_LOG" \
> + --status-final "$STATUS_FINAL" \
> + --slo-seconds "$SLO_SECONDS" \
> + --report-json "$REPORT_JSON" \
> + $( [[ "$reset_enabled" -eq 1 ]] && echo "--expect-reset" ); then
> + final_rc=1
> + fi
> +
> + if [[ "$final_rc" -eq 0 ]]; then
> + log_summary "PASS: stress run completed successfully"
> + else
> + log_summary "FAIL: stress run detected one or more failures"
> + fi
> +
> + log_summary "Artifacts available in: $OUT_DIR"
> + exit "$final_rc"
> +}
> +
> +main "$@"
Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Thanks,
Slava.
^ permalink raw reply [flat|nested] 24+ messages in thread
* [PATCH v4 09/11] selftests: ceph: add reset corner-case tests
2026-05-07 12:27 [PATCH v4 00/11] ceph: manual client session reset Alex Markuze
` (7 preceding siblings ...)
2026-05-07 12:27 ` [PATCH v4 08/11] selftests: ceph: add reset stress test Alex Markuze
@ 2026-05-07 12:27 ` Alex Markuze
2026-05-07 19:31 ` [EXTERNAL] " Viacheslav Dubeyko
2026-05-07 12:27 ` [PATCH v4 10/11] selftests: ceph: add validation harness Alex Markuze
` (2 subsequent siblings)
11 siblings, 1 reply; 24+ messages in thread
From: Alex Markuze @ 2026-05-07 12:27 UTC (permalink / raw)
To: ceph-devel; +Cc: linux-kernel, idryomov, vdubeyko, Alex Markuze
Add targeted corner-case tests for the CephFS manual session reset
feature. Four sequential tests cover:
[1/4] ebusy_rejection - second reset rejected while first in-flight
[2/4] dirty_caps_at_reset - reset with unflushed dirty caps
[3/4] flock_after_reset - stale lock EIO + fresh lock after holder exit
[4/4] unmount_during_reset - umount during active reset (ESHUTDOWN path)
Requires: mounted CephFS, debugfs access (root), flock(1) utility.
Signed-off-by: Alex Markuze <amarkuze@redhat.com>
---
.../filesystems/ceph/reset_corner_cases.sh | 646 ++++++++++++++++++
1 file changed, 646 insertions(+)
create mode 100755 tools/testing/selftests/filesystems/ceph/reset_corner_cases.sh
diff --git a/tools/testing/selftests/filesystems/ceph/reset_corner_cases.sh b/tools/testing/selftests/filesystems/ceph/reset_corner_cases.sh
new file mode 100755
index 000000000000..a6dae84a616d
--- /dev/null
+++ b/tools/testing/selftests/filesystems/ceph/reset_corner_cases.sh
@@ -0,0 +1,646 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# CephFS client reset corner case tests.
+# Runs a checklist of targeted tests that exercise specific reset
+# code paths not covered by the stress tests.
+#
+# Requires: mounted CephFS, debugfs access (root), flock(1) utility.
+
+set -uo pipefail
+
+KSFT_SKIP=4
+
+# kselftest auto-detect: when invoked with no arguments (e.g. by
+# "make run_tests"), find a CephFS mount automatically or skip.
+if [[ $# -eq 0 ]]; then
+ MOUNT_POINT="$(findmnt -t ceph -n -o TARGET 2>/dev/null | head -1)"
+ if [[ -z "$MOUNT_POINT" ]]; then
+ echo "SKIP: No CephFS mount found and --mount-point not specified"
+ exit "$KSFT_SKIP"
+ fi
+ exec "$0" --mount-point "$MOUNT_POINT"
+fi
+
+MOUNT_POINT=""
+DEBUGFS_ROOT="/sys/kernel/debug/ceph"
+DEBUGFS_CLIENT=""
+TRIGGER_PATH=""
+STATUS_PATH=""
+TEMP_MNT=""
+
+PASS_COUNT=0
+FAIL_COUNT=0
+SKIP_COUNT=0
+TOTAL=4
+
+log()
+{
+ printf '[%s] %s\n' "$(date -u +%H:%M:%S)" "$1"
+}
+
+result()
+{
+ local num="$1"
+ local name="$2"
+ local status="$3"
+ local detail="${4:-}"
+
+ case "$status" in
+ PASS) PASS_COUNT=$((PASS_COUNT + 1)) ;;
+ FAIL) FAIL_COUNT=$((FAIL_COUNT + 1)) ;;
+ SKIP) SKIP_COUNT=$((SKIP_COUNT + 1)) ;;
+ esac
+
+ if [[ -n "$detail" ]]; then
+ printf '[%d/%d] %-30s %s (%s)\n' "$num" "$TOTAL" "$name" "$status" "$detail"
+ else
+ printf '[%d/%d] %-30s %s\n' "$num" "$TOTAL" "$name" "$status"
+ fi
+}
+
+read_status_field()
+{
+ local field="$1"
+ awk -F': ' -v key="$field" '$1 == key {print $2}' "$STATUS_PATH" 2>/dev/null
+}
+
+wait_reset_done()
+{
+ local timeout="${1:-30}"
+ local elapsed=0
+
+ while [[ "$(read_status_field "phase")" != "idle" ]]; do
+ sleep 1
+ elapsed=$((elapsed + 1))
+ if [[ "$elapsed" -ge "$timeout" ]]; then
+ return 1
+ fi
+ done
+ return 0
+}
+
+list_reset_clients()
+{
+ local entry
+
+ for entry in "$DEBUGFS_ROOT"/*/; do
+ entry="$(basename "$entry")"
+ [[ -d "$DEBUGFS_ROOT/$entry/reset" ]] || continue
+ [[ -w "$DEBUGFS_ROOT/$entry/reset/trigger" ]] || continue
+ printf '%s\n' "$entry"
+ done
+}
+
+wait_status_nonidle()
+{
+ local status_path="$1"
+ local timeout="${2:-10}"
+ local polls=$((timeout * 10))
+ local phase
+
+ while [[ "$polls" -gt 0 ]]; do
+ phase="$(awk -F': ' '$1 == "phase" {print $2}' "$status_path" 2>/dev/null)"
+ if [[ -n "$phase" && "$phase" != "idle" ]]; then
+ return 0
+ fi
+ sleep 0.1
+ polls=$((polls - 1))
+ done
+
+ return 1
+}
+
+discover_debugfs()
+{
+ local candidates=()
+ local entry
+
+ if [[ -n "$DEBUGFS_CLIENT" ]]; then
+ if [[ ! -d "$DEBUGFS_ROOT/$DEBUGFS_CLIENT/reset" ]]; then
+ echo "SKIP: reset debugfs not found for $DEBUGFS_CLIENT" >&2
+ exit "$KSFT_SKIP"
+ fi
+ return 0
+ fi
+
+ for entry in "$DEBUGFS_ROOT"/*/; do
+ entry="$(basename "$entry")"
+ [[ -d "$DEBUGFS_ROOT/$entry/reset" ]] || continue
+ [[ -w "$DEBUGFS_ROOT/$entry/reset/trigger" ]] || continue
+ candidates+=("$entry")
+ done
+
+ if [[ ${#candidates[@]} -eq 0 ]]; then
+ echo "SKIP: No writable Ceph reset interface found under $DEBUGFS_ROOT" >&2
+ exit "$KSFT_SKIP"
+ fi
+
+ if [[ ${#candidates[@]} -gt 1 ]]; then
+ echo "SKIP: Multiple Ceph clients found: ${candidates[*]}. Use --client-id." >&2
+ exit "$KSFT_SKIP"
+ fi
+
+ DEBUGFS_CLIENT="${candidates[0]}"
+}
+
+# --- Test 1: ebusy_rejection ------------------------------------------------
+#
+# Trigger a reset while another is guaranteed in-flight. Creates
+# dirty state so the first reset enters DRAINING (which takes
+# measurable time), then polls until phase != idle and issues the
+# second trigger. The second trigger must fail (the kernel returns
+# -EBUSY), and only one reset must be counted in the accounting.
+
+test_ebusy_rejection()
+{
+ local num=1
+ local name="ebusy_rejection"
+ local testfile="$MOUNT_POINT/.reset_corner_ebusy_$$"
+ local tc_before tc_after sc_before sc_after second_rc phase elapsed
+
+ tc_before="$(read_status_field "trigger_count")"
+ sc_before="$(read_status_field "success_count")"
+
+ # Create dirty state so the first reset enters DRAINING
+ echo "ebusy_dirty_data" > "$testfile"
+ sync "$testfile"
+
+ python3 -c "
+import os, sys
+fd = os.open('$testfile', os.O_WRONLY | os.O_APPEND)
+os.write(fd, b'dirty_for_ebusy_test\n')
+sys.stdout.write('written')
+" 2>/dev/null || {
+ result "$num" "$name" FAIL "dirty write failed"
+ rm -f "$testfile"
+ return
+ }
+
+ # Trigger the first reset -- it will drain dirty state
+ echo "ebusy_first" > "$TRIGGER_PATH" 2>/dev/null || {
+ result "$num" "$name" FAIL "first trigger failed"
+ rm -f "$testfile"
+ return
+ }
+
+ # Poll until phase is non-idle (quiescing or draining)
+ elapsed=0
+ while true; do
+ phase="$(read_status_field "phase")"
+ if [[ "$phase" != "idle" ]]; then
+ break
+ fi
+ sleep 0.1
+ elapsed=$((elapsed + 1))
+ if [[ "$elapsed" -ge 50 ]]; then
+ result "$num" "$name" SKIP \
+ "first reset completed before overlap could be tested"
+ rm -f "$testfile" 2>/dev/null
+ return
+ fi
+ done
+
+ # Issue the second trigger -- should be rejected with EBUSY
+ second_rc=0
+ echo "ebusy_second" > "$TRIGGER_PATH" 2>/dev/null && second_rc=0 || second_rc=$?
+
+ if ! wait_reset_done 30; then
+ result "$num" "$name" FAIL "first reset never completed"
+ rm -f "$testfile"
+ return
+ fi
+
+ tc_after="$(read_status_field "trigger_count")"
+ sc_after="$(read_status_field "success_count")"
+
+ if [[ "$((tc_after - tc_before))" -ne 1 ]]; then
+ result "$num" "$name" FAIL "trigger_count +$((tc_after - tc_before)), expected +1"
+ rm -f "$testfile"
+ return
+ fi
+
+ if [[ "$((sc_after - sc_before))" -ne 1 ]]; then
+ result "$num" "$name" FAIL "success_count +$((sc_after - sc_before)), expected +1"
+ rm -f "$testfile"
+ return
+ fi
+
+ if [[ "$second_rc" -eq 0 ]]; then
+ result "$num" "$name" FAIL "second trigger did not return error"
+ rm -f "$testfile"
+ return
+ fi
+
+ rm -f "$testfile" 2>/dev/null
+ result "$num" "$name" PASS
+}
+
+# --- Test 2: dirty_caps_at_reset --------------------------------------------
+#
+# Write to a file without fsync (dirty caps), trigger reset, then
+# verify the file is not corrupt. Manual reset drains dirty caps
+# before teardown (best-effort, 5s timeout). For a non-stuck cap
+# the dirty write should be flushed during drain and persist.
+# If the drain window is too short, only the synced first line
+# persists -- that is acceptable (data loss is documented for
+# unflushed writes).
+
+test_dirty_caps_at_reset()
+{
+ local num=2
+ local name="dirty_caps_at_reset"
+ local testfile="$MOUNT_POINT/.reset_corner_dirty_caps_$$"
+ local content_after line_count sc_before sc_after le
+
+ sc_before="$(read_status_field "success_count")"
+
+ echo "line_1_before_dirty_write" > "$testfile"
+ sync "$testfile"
+
+ python3 -c "
+import os, sys
+fd = os.open('$testfile', os.O_WRONLY | os.O_APPEND)
+os.write(fd, b'line_2_dirty_no_fsync\n')
+# deliberately no fsync -- leave caps dirty
+sys.stdout.write('written')
+" 2>/dev/null || {
+ result "$num" "$name" FAIL "dirty write failed"
+ rm -f "$testfile"
+ return
+ }
+
+ echo "dirty_caps_test" > "$TRIGGER_PATH" 2>/dev/null || {
+ result "$num" "$name" FAIL "reset trigger failed"
+ rm -f "$testfile"
+ return
+ }
+
+ if ! wait_reset_done 30; then
+ result "$num" "$name" FAIL "reset did not complete"
+ rm -f "$testfile"
+ return
+ fi
+
+ sc_after="$(read_status_field "success_count")"
+ if [[ "$sc_after" -le "$sc_before" ]]; then
+ result "$num" "$name" FAIL "success_count did not increment (reset not exercised)"
+ rm -f "$testfile"
+ return
+ fi
+
+ sync "$testfile" 2>/dev/null || true
+ content_after="$(cat "$testfile" 2>/dev/null)" || {
+ result "$num" "$name" FAIL "cannot read file after reset"
+ rm -f "$testfile"
+ return
+ }
+
+ if [[ -z "$content_after" ]]; then
+ result "$num" "$name" FAIL "file is empty after reset"
+ rm -f "$testfile"
+ return
+ fi
+
+ line_count="$(echo "$content_after" | wc -l)"
+ if [[ "$line_count" -lt 1 ]]; then
+ result "$num" "$name" FAIL "file has $line_count lines, expected >= 1"
+ rm -f "$testfile"
+ return
+ fi
+
+ echo "$content_after" | head -1 | grep -q "line_1_before_dirty_write" || {
+ result "$num" "$name" FAIL "first line corrupted"
+ rm -f "$testfile"
+ return
+ }
+
+ le="$(read_status_field "last_errno")"
+ if [[ "$le" != "0" ]]; then
+ result "$num" "$name" FAIL "last_errno=$le, expected 0"
+ rm -f "$testfile"
+ return
+ fi
+
+ rm -f "$testfile"
+ result "$num" "$name" PASS "file intact ($line_count lines)"
+}
+
+# --- Test 3: flock_after_reset ----------------------------------------------
+#
+# Take an exclusive flock, trigger reset, verify stale lock state is
+# marked with CEPH_I_ERROR_FILELOCK (same-client flock attempt returns
+# EIO). After the original holder exits (releasing the local lock
+# reference and clearing the error flag), a fresh lock can be acquired.
+#
+# The lock holder uses the fd-based flock form with exec, so killing
+# $lock_pid closes the lock fd immediately (no orphaned child with an
+# inherited fd copy that would prevent the VFS flock release).
+
+test_flock_after_reset()
+{
+ local num=3
+ local name="flock_after_reset"
+ local testfile="$MOUNT_POINT/.reset_corner_flock_$$"
+ local lock_pid probe_rc sc_before sc_after
+
+ sc_before="$(read_status_field "success_count")"
+
+ echo "flock_test_content" > "$testfile"
+ sync "$testfile"
+
+ # Hold lock via fd in a subshell; exec ensures killing $lock_pid
+ # closes the lock fd directly (no fork/child fd inheritance).
+ (
+ exec 9<"$testfile"
+ flock --exclusive --nonblock 9 || exit 1
+ exec sleep 300
+ ) &
+ lock_pid=$!
+ sleep 0.5
+
+ if ! kill -0 "$lock_pid" 2>/dev/null; then
+ result "$num" "$name" FAIL "flock holder died immediately"
+ rm -f "$testfile"
+ return
+ fi
+
+ echo "flock_after_reset_test" > "$TRIGGER_PATH" 2>/dev/null || {
+ kill "$lock_pid" 2>/dev/null; wait "$lock_pid" 2>/dev/null
+ result "$num" "$name" FAIL "reset trigger failed"
+ rm -f "$testfile"
+ return
+ }
+
+ if ! wait_reset_done 30; then
+ kill "$lock_pid" 2>/dev/null; wait "$lock_pid" 2>/dev/null
+ result "$num" "$name" FAIL "reset did not complete"
+ rm -f "$testfile"
+ return
+ fi
+
+ sc_after="$(read_status_field "success_count")"
+ if [[ "$sc_after" -le "$sc_before" ]]; then
+ kill "$lock_pid" 2>/dev/null; wait "$lock_pid" 2>/dev/null
+ result "$num" "$name" FAIL "success_count did not increment"
+ rm -f "$testfile"
+ return
+ fi
+
+ # After teardown, CEPH_I_ERROR_FILELOCK is set on the inode.
+ # A same-client lock attempt should fail (EIO), NOT succeed.
+ probe_rc=0
+ flock --exclusive --nonblock "$testfile" true 2>/dev/null && probe_rc=0 || probe_rc=$?
+ if [[ "$probe_rc" -eq 0 ]]; then
+ kill "$lock_pid" 2>/dev/null; wait "$lock_pid" 2>/dev/null
+ result "$num" "$name" FAIL \
+ "same-client probe succeeded, expected EIO from stale lock state"
+ rm -f "$testfile"
+ return
+ fi
+
+ # Kill the holder -- the exec'd sleep IS $lock_pid, so killing it
+ # closes fd 9 directly. VFS flock release fires ceph_fl_release_lock(),
+ # which decrements i_filelock_ref to 0 and clears CEPH_I_ERROR_FILELOCK.
+ kill "$lock_pid" 2>/dev/null
+ wait "$lock_pid" 2>/dev/null
+
+ # After the holder exits, a fresh lock should be acquirable.
+ # The reset teardown sends SESSION_REQUEST_CLOSE so the MDS
+ # releases locks promptly, but retry briefly in case the
+ # message races with the connection close.
+ local attempt
+ probe_rc=1
+ for attempt in 1 2 3 4 5; do
+ probe_rc=0
+ flock --exclusive --nonblock "$testfile" true 2>/dev/null \
+ && probe_rc=0 || probe_rc=$?
+ [[ "$probe_rc" -eq 0 ]] && break
+ sleep 1
+ done
+ if [[ "$probe_rc" -ne 0 ]]; then
+ result "$num" "$name" FAIL \
+ "cannot acquire fresh lock after holder exit (rc=$probe_rc, ${attempt} attempts)"
+ rm -f "$testfile"
+ return
+ fi
+
+ # Verify file content survived
+ grep -q "flock_test_content" "$testfile" 2>/dev/null || {
+ result "$num" "$name" FAIL "file content corrupted after reset"
+ rm -f "$testfile"
+ return
+ }
+
+ rm -f "$testfile"
+ result "$num" "$name" PASS "stale lock detected, fresh lock acquired after holder exit"
+}
+
+# --- Test 4: unmount_during_reset -------------------------------------------
+#
+# Mount a fresh CephFS, trigger reset, immediately unmount. The
+# ceph_mdsc_destroy() path must wake blocked waiters with -ESHUTDOWN
+# and not hang.
+
+test_unmount_during_reset()
+{
+ local num=4
+ local name="unmount_during_reset"
+ local temp_mnt="/tmp/ceph_corner_mnt_$$"
+ local mount_opts=""
+ local mount_src=""
+ local temp_trigger=""
+ local temp_status=""
+ local temp_client=""
+ local temp_file="$temp_mnt/.reset_corner_umount_$$"
+ local phase=""
+ local trigger_ok=0
+ local attempt
+ local -a new_clients=()
+ declare -A existing_clients=()
+
+ mount_src="$(awk -v mp="$MOUNT_POINT" '$2 == mp && $3 == "ceph" {print $1; exit}' /proc/mounts 2>/dev/null)"
+ mount_opts="$(awk -v mp="$MOUNT_POINT" '$2 == mp && $3 == "ceph" {print $4; exit}' /proc/mounts 2>/dev/null)"
+
+ if [[ -z "$mount_src" ]]; then
+ result "$num" "$name" SKIP "cannot determine mount source from /proc/mounts"
+ return
+ fi
+
+ while IFS= read -r existing; do
+ [[ -n "$existing" ]] || continue
+ existing_clients["$existing"]=1
+ done < <(list_reset_clients)
+
+ mkdir -p "$temp_mnt"
+
+ if ! mount -t ceph "$mount_src" "$temp_mnt" -o "$mount_opts" 2>/dev/null; then
+ result "$num" "$name" SKIP "cannot mount additional CephFS instance"
+ rmdir "$temp_mnt" 2>/dev/null
+ return
+ fi
+
+ ls "$temp_mnt" > /dev/null 2>&1
+ sync
+ sleep 1
+
+ for attempt in $(seq 1 50); do
+ new_clients=()
+ while IFS= read -r entry; do
+ [[ -n "$entry" ]] || continue
+ if [[ -n "${existing_clients[$entry]+x}" ]]; then
+ continue
+ fi
+ new_clients+=("$entry")
+ done < <(list_reset_clients)
+
+ if [[ "${#new_clients[@]}" -eq 1 ]]; then
+ temp_client="${new_clients[0]}"
+ break
+ fi
+
+ if [[ "${#new_clients[@]}" -gt 1 ]]; then
+ break
+ fi
+
+ sleep 0.1
+ done
+
+ if [[ -z "$temp_client" ]]; then
+ umount "$temp_mnt" 2>/dev/null || umount -l "$temp_mnt" 2>/dev/null
+ rmdir "$temp_mnt" 2>/dev/null
+ result "$num" "$name" SKIP "cannot identify debugfs client for temp mount"
+ return
+ fi
+
+ if [[ "${#new_clients[@]}" -gt 1 ]]; then
+ umount "$temp_mnt" 2>/dev/null || umount -l "$temp_mnt" 2>/dev/null
+ rmdir "$temp_mnt" 2>/dev/null
+ result "$num" "$name" SKIP "multiple new debugfs clients appeared"
+ return
+ fi
+
+ temp_trigger="$DEBUGFS_ROOT/$temp_client/reset/trigger"
+ temp_status="$DEBUGFS_ROOT/$temp_client/reset/status"
+
+ echo "umount_dirty_seed" > "$temp_file" 2>/dev/null || {
+ umount "$temp_mnt" 2>/dev/null || umount -l "$temp_mnt" 2>/dev/null
+ rmdir "$temp_mnt" 2>/dev/null
+ result "$num" "$name" FAIL "cannot create dirty state on temp mount"
+ return
+ }
+ sync "$temp_file"
+ python3 -c "
+import os, sys
+fd = os.open('$temp_file', os.O_WRONLY | os.O_APPEND)
+os.write(fd, b'dirty_for_umount_test\\n')
+os.close(fd)
+" 2>/dev/null || {
+ umount "$temp_mnt" 2>/dev/null || umount -l "$temp_mnt" 2>/dev/null
+ rmdir "$temp_mnt" 2>/dev/null
+ result "$num" "$name" FAIL "cannot dirty temp mount for reset overlap"
+ return
+ }
+
+ echo "unmount_test" > "$temp_trigger" 2>/dev/null && trigger_ok=1 || trigger_ok=0
+ if [[ "$trigger_ok" -ne 1 ]]; then
+ umount "$temp_mnt" 2>/dev/null || umount -l "$temp_mnt" 2>/dev/null
+ rmdir "$temp_mnt" 2>/dev/null
+ result "$num" "$name" FAIL "cannot trigger reset on temp mount"
+ return
+ fi
+
+ if ! wait_status_nonidle "$temp_status" 10; then
+ phase="$(awk -F': ' '$1 == "phase" {print $2}' "$temp_status" 2>/dev/null)"
+ umount "$temp_mnt" 2>/dev/null || umount -l "$temp_mnt" 2>/dev/null
+ rmdir "$temp_mnt" 2>/dev/null
+ result "$num" "$name" FAIL \
+ "reset never became active before umount (phase=${phase:-unknown})"
+ return
+ fi
+
+ local umount_ok=0
+ timeout 30 umount "$temp_mnt" 2>/dev/null && umount_ok=1
+
+ if [[ "$umount_ok" -ne 1 ]]; then
+ umount -l "$temp_mnt" 2>/dev/null || true
+ rmdir "$temp_mnt" 2>/dev/null
+ result "$num" "$name" FAIL "umount hung for >30s"
+ return
+ fi
+
+ rmdir "$temp_mnt" 2>/dev/null
+
+ ls "$MOUNT_POINT" > /dev/null 2>&1 || {
+ result "$num" "$name" FAIL "original mount unhealthy after test"
+ return
+ }
+
+ result "$num" "$name" PASS
+}
+
+# --- Main --------------------------------------------------------------------
+
+usage()
+{
+ cat <<EOF
+Usage: $0 --mount-point <path> [--client-id <id>] [--debugfs-root <path>]
+
+Runs targeted corner-case tests for the CephFS client reset feature.
+Requires root (debugfs access) and a mounted CephFS filesystem.
+
+Options:
+ --mount-point PATH CephFS mount point (required)
+ --client-id ID Ceph debugfs client id (auto-detect if one client)
+ --debugfs-root PATH Debugfs ceph root (default: /sys/kernel/debug/ceph)
+ --help Show this message
+EOF
+}
+
+main()
+{
+ while [[ $# -gt 0 ]]; do
+ case "$1" in
+ --mount-point) MOUNT_POINT="$2"; shift 2 ;;
+ --client-id) DEBUGFS_CLIENT="$2"; shift 2 ;;
+ --debugfs-root) DEBUGFS_ROOT="$2"; shift 2 ;;
+ --help|-h) usage; exit 0 ;;
+ *) echo "Unknown option: $1" >&2; usage; exit 2 ;;
+ esac
+ done
+
+ if [[ -z "$MOUNT_POINT" ]]; then
+ echo "--mount-point is required" >&2
+ usage
+ exit 2
+ fi
+
+ if [[ ! -d "$MOUNT_POINT" ]]; then
+ echo "SKIP: Mount point does not exist: $MOUNT_POINT" >&2
+ exit "$KSFT_SKIP"
+ fi
+
+ discover_debugfs
+ TRIGGER_PATH="$DEBUGFS_ROOT/$DEBUGFS_CLIENT/reset/trigger"
+ STATUS_PATH="$DEBUGFS_ROOT/$DEBUGFS_CLIENT/reset/status"
+
+ log "CephFS client reset corner case tests"
+ log "Mount: $MOUNT_POINT"
+ log "Client: $DEBUGFS_CLIENT"
+ echo ""
+
+ test_ebusy_rejection
+ test_dirty_caps_at_reset
+ test_flock_after_reset
+ test_unmount_during_reset
+
+ echo ""
+ echo "Results: $PASS_COUNT passed, $FAIL_COUNT failed, $SKIP_COUNT skipped (of $TOTAL)"
+
+ if [[ "$FAIL_COUNT" -gt 0 ]]; then
+ exit 1
+ fi
+ exit 0
+}
+
+main "$@"
--
2.34.1
^ permalink raw reply related [flat|nested] 24+ messages in thread* Re: [EXTERNAL] [PATCH v4 09/11] selftests: ceph: add reset corner-case tests
2026-05-07 12:27 ` [PATCH v4 09/11] selftests: ceph: add reset corner-case tests Alex Markuze
@ 2026-05-07 19:31 ` Viacheslav Dubeyko
0 siblings, 0 replies; 24+ messages in thread
From: Viacheslav Dubeyko @ 2026-05-07 19:31 UTC (permalink / raw)
To: Alex Markuze, ceph-devel; +Cc: linux-kernel, idryomov
On Thu, 2026-05-07 at 12:27 +0000, Alex Markuze wrote:
> Add targeted corner-case tests for the CephFS manual session reset
> feature. Four sequential tests cover:
>
> [1/4] ebusy_rejection - second reset rejected while first in-flight
> [2/4] dirty_caps_at_reset - reset with unflushed dirty caps
> [3/4] flock_after_reset - stale lock EIO + fresh lock after holder exit
> [4/4] unmount_during_reset - umount during active reset (ESHUTDOWN path)
>
> Requires: mounted CephFS, debugfs access (root), flock(1) utility.
>
> Signed-off-by: Alex Markuze <amarkuze@redhat.com>
> ---
> .../filesystems/ceph/reset_corner_cases.sh | 646 ++++++++++++++++++
> 1 file changed, 646 insertions(+)
> create mode 100755 tools/testing/selftests/filesystems/ceph/reset_corner_cases.sh
>
> diff --git a/tools/testing/selftests/filesystems/ceph/reset_corner_cases.sh b/tools/testing/selftests/filesystems/ceph/reset_corner_cases.sh
> new file mode 100755
> index 000000000000..a6dae84a616d
> --- /dev/null
> +++ b/tools/testing/selftests/filesystems/ceph/reset_corner_cases.sh
> @@ -0,0 +1,646 @@
> +#!/bin/bash
> +# SPDX-License-Identifier: GPL-2.0
> +#
> +# CephFS client reset corner case tests.
> +# Runs a checklist of targeted tests that exercise specific reset
> +# code paths not covered by the stress tests.
> +#
> +# Requires: mounted CephFS, debugfs access (root), flock(1) utility.
> +
> +set -uo pipefail
> +
> +KSFT_SKIP=4
> +
> +# kselftest auto-detect: when invoked with no arguments (e.g. by
> +# "make run_tests"), find a CephFS mount automatically or skip.
> +if [[ $# -eq 0 ]]; then
> + MOUNT_POINT="$(findmnt -t ceph -n -o TARGET 2>/dev/null | head -1)"
> + if [[ -z "$MOUNT_POINT" ]]; then
> + echo "SKIP: No CephFS mount found and --mount-point not specified"
> + exit "$KSFT_SKIP"
> + fi
> + exec "$0" --mount-point "$MOUNT_POINT"
> +fi
> +
> +MOUNT_POINT=""
> +DEBUGFS_ROOT="/sys/kernel/debug/ceph"
> +DEBUGFS_CLIENT=""
> +TRIGGER_PATH=""
> +STATUS_PATH=""
> +TEMP_MNT=""
> +
> +PASS_COUNT=0
> +FAIL_COUNT=0
> +SKIP_COUNT=0
> +TOTAL=4
> +
> +log()
> +{
> + printf '[%s] %s\n' "$(date -u +%H:%M:%S)" "$1"
> +}
> +
> +result()
> +{
> + local num="$1"
> + local name="$2"
> + local status="$3"
> + local detail="${4:-}"
> +
> + case "$status" in
> + PASS) PASS_COUNT=$((PASS_COUNT + 1)) ;;
> + FAIL) FAIL_COUNT=$((FAIL_COUNT + 1)) ;;
> + SKIP) SKIP_COUNT=$((SKIP_COUNT + 1)) ;;
> + esac
> +
> + if [[ -n "$detail" ]]; then
> + printf '[%d/%d] %-30s %s (%s)\n' "$num" "$TOTAL" "$name" "$status" "$detail"
> + else
> + printf '[%d/%d] %-30s %s\n' "$num" "$TOTAL" "$name" "$status"
> + fi
> +}
> +
> +read_status_field()
> +{
> + local field="$1"
> + awk -F': ' -v key="$field" '$1 == key {print $2}' "$STATUS_PATH" 2>/dev/null
> +}
> +
> +wait_reset_done()
> +{
> + local timeout="${1:-30}"
> + local elapsed=0
> +
> + while [[ "$(read_status_field "phase")" != "idle" ]]; do
> + sleep 1
> + elapsed=$((elapsed + 1))
> + if [[ "$elapsed" -ge "$timeout" ]]; then
> + return 1
> + fi
> + done
> + return 0
> +}
> +
> +list_reset_clients()
> +{
> + local entry
> +
> + for entry in "$DEBUGFS_ROOT"/*/; do
> + entry="$(basename "$entry")"
> + [[ -d "$DEBUGFS_ROOT/$entry/reset" ]] || continue
> + [[ -w "$DEBUGFS_ROOT/$entry/reset/trigger" ]] || continue
> + printf '%s\n' "$entry"
> + done
> +}
> +
> +wait_status_nonidle()
> +{
> + local status_path="$1"
> + local timeout="${2:-10}"
> + local polls=$((timeout * 10))
> + local phase
> +
> + while [[ "$polls" -gt 0 ]]; do
> + phase="$(awk -F': ' '$1 == "phase" {print $2}' "$status_path" 2>/dev/null)"
> + if [[ -n "$phase" && "$phase" != "idle" ]]; then
> + return 0
> + fi
> + sleep 0.1
> + polls=$((polls - 1))
> + done
> +
> + return 1
> +}
> +
> +discover_debugfs()
> +{
> + local candidates=()
> + local entry
> +
> + if [[ -n "$DEBUGFS_CLIENT" ]]; then
> + if [[ ! -d "$DEBUGFS_ROOT/$DEBUGFS_CLIENT/reset" ]]; then
> + echo "SKIP: reset debugfs not found for $DEBUGFS_CLIENT" >&2
> + exit "$KSFT_SKIP"
> + fi
> + return 0
> + fi
> +
> + for entry in "$DEBUGFS_ROOT"/*/; do
> + entry="$(basename "$entry")"
> + [[ -d "$DEBUGFS_ROOT/$entry/reset" ]] || continue
> + [[ -w "$DEBUGFS_ROOT/$entry/reset/trigger" ]] || continue
> + candidates+=("$entry")
> + done
> +
> + if [[ ${#candidates[@]} -eq 0 ]]; then
> + echo "SKIP: No writable Ceph reset interface found under $DEBUGFS_ROOT" >&2
> + exit "$KSFT_SKIP"
> + fi
> +
> + if [[ ${#candidates[@]} -gt 1 ]]; then
> + echo "SKIP: Multiple Ceph clients found: ${candidates[*]}. Use --client-id." >&2
> + exit "$KSFT_SKIP"
> + fi
> +
> + DEBUGFS_CLIENT="${candidates[0]}"
> +}
> +
> +# --- Test 1: ebusy_rejection ------------------------------------------------
> +#
> +# Trigger a reset while another is guaranteed in-flight. Creates
> +# dirty state so the first reset enters DRAINING (which takes
> +# measurable time), then polls until phase != idle and issues the
> +# second trigger. The second trigger must fail (the kernel returns
> +# -EBUSY), and only one reset must be counted in the accounting.
> +
> +test_ebusy_rejection()
> +{
> + local num=1
> + local name="ebusy_rejection"
> + local testfile="$MOUNT_POINT/.reset_corner_ebusy_$$"
> + local tc_before tc_after sc_before sc_after second_rc phase elapsed
> +
> + tc_before="$(read_status_field "trigger_count")"
> + sc_before="$(read_status_field "success_count")"
> +
> + # Create dirty state so the first reset enters DRAINING
> + echo "ebusy_dirty_data" > "$testfile"
> + sync "$testfile"
> +
> + python3 -c "
> +import os, sys
> +fd = os.open('$testfile', os.O_WRONLY | os.O_APPEND)
> +os.write(fd, b'dirty_for_ebusy_test\n')
> +sys.stdout.write('written')
> +" 2>/dev/null || {
> + result "$num" "$name" FAIL "dirty write failed"
> + rm -f "$testfile"
> + return
> + }
> +
> + # Trigger the first reset -- it will drain dirty state
> + echo "ebusy_first" > "$TRIGGER_PATH" 2>/dev/null || {
> + result "$num" "$name" FAIL "first trigger failed"
> + rm -f "$testfile"
> + return
> + }
> +
> + # Poll until phase is non-idle (quiescing or draining)
> + elapsed=0
> + while true; do
> + phase="$(read_status_field "phase")"
> + if [[ "$phase" != "idle" ]]; then
> + break
> + fi
> + sleep 0.1
> + elapsed=$((elapsed + 1))
> + if [[ "$elapsed" -ge 50 ]]; then
> + result "$num" "$name" SKIP \
> + "first reset completed before overlap could be tested"
> + rm -f "$testfile" 2>/dev/null
> + return
> + fi
> + done
> +
> + # Issue the second trigger -- should be rejected with EBUSY
> + second_rc=0
> + echo "ebusy_second" > "$TRIGGER_PATH" 2>/dev/null && second_rc=0 || second_rc=$?
> +
> + if ! wait_reset_done 30; then
> + result "$num" "$name" FAIL "first reset never completed"
> + rm -f "$testfile"
> + return
> + fi
> +
> + tc_after="$(read_status_field "trigger_count")"
> + sc_after="$(read_status_field "success_count")"
> +
> + if [[ "$((tc_after - tc_before))" -ne 1 ]]; then
> + result "$num" "$name" FAIL "trigger_count +$((tc_after - tc_before)), expected +1"
> + rm -f "$testfile"
> + return
> + fi
> +
> + if [[ "$((sc_after - sc_before))" -ne 1 ]]; then
> + result "$num" "$name" FAIL "success_count +$((sc_after - sc_before)), expected +1"
> + rm -f "$testfile"
> + return
> + fi
> +
> + if [[ "$second_rc" -eq 0 ]]; then
> + result "$num" "$name" FAIL "second trigger did not return error"
> + rm -f "$testfile"
> + return
> + fi
> +
> + rm -f "$testfile" 2>/dev/null
> + result "$num" "$name" PASS
> +}
> +
> +# --- Test 2: dirty_caps_at_reset --------------------------------------------
> +#
> +# Write to a file without fsync (dirty caps), trigger reset, then
> +# verify the file is not corrupt. Manual reset drains dirty caps
> +# before teardown (best-effort, 5s timeout). For a non-stuck cap
> +# the dirty write should be flushed during drain and persist.
> +# If the drain window is too short, only the synced first line
> +# persists -- that is acceptable (data loss is documented for
> +# unflushed writes).
> +
> +test_dirty_caps_at_reset()
> +{
> + local num=2
> + local name="dirty_caps_at_reset"
> + local testfile="$MOUNT_POINT/.reset_corner_dirty_caps_$$"
> + local content_after line_count sc_before sc_after le
> +
> + sc_before="$(read_status_field "success_count")"
> +
> + echo "line_1_before_dirty_write" > "$testfile"
> + sync "$testfile"
> +
> + python3 -c "
> +import os, sys
> +fd = os.open('$testfile', os.O_WRONLY | os.O_APPEND)
> +os.write(fd, b'line_2_dirty_no_fsync\n')
> +# deliberately no fsync -- leave caps dirty
> +sys.stdout.write('written')
> +" 2>/dev/null || {
> + result "$num" "$name" FAIL "dirty write failed"
> + rm -f "$testfile"
> + return
> + }
> +
> + echo "dirty_caps_test" > "$TRIGGER_PATH" 2>/dev/null || {
> + result "$num" "$name" FAIL "reset trigger failed"
> + rm -f "$testfile"
> + return
> + }
> +
> + if ! wait_reset_done 30; then
> + result "$num" "$name" FAIL "reset did not complete"
> + rm -f "$testfile"
> + return
> + fi
> +
> + sc_after="$(read_status_field "success_count")"
> + if [[ "$sc_after" -le "$sc_before" ]]; then
> + result "$num" "$name" FAIL "success_count did not increment (reset not exercised)"
> + rm -f "$testfile"
> + return
> + fi
> +
> + sync "$testfile" 2>/dev/null || true
> + content_after="$(cat "$testfile" 2>/dev/null)" || {
> + result "$num" "$name" FAIL "cannot read file after reset"
> + rm -f "$testfile"
> + return
> + }
> +
> + if [[ -z "$content_after" ]]; then
> + result "$num" "$name" FAIL "file is empty after reset"
> + rm -f "$testfile"
> + return
> + fi
> +
> + line_count="$(echo "$content_after" | wc -l)"
> + if [[ "$line_count" -lt 1 ]]; then
> + result "$num" "$name" FAIL "file has $line_count lines, expected >= 1"
> + rm -f "$testfile"
> + return
> + fi
> +
> + echo "$content_after" | head -1 | grep -q "line_1_before_dirty_write" || {
> + result "$num" "$name" FAIL "first line corrupted"
> + rm -f "$testfile"
> + return
> + }
> +
> + le="$(read_status_field "last_errno")"
> + if [[ "$le" != "0" ]]; then
> + result "$num" "$name" FAIL "last_errno=$le, expected 0"
> + rm -f "$testfile"
> + return
> + fi
> +
> + rm -f "$testfile"
> + result "$num" "$name" PASS "file intact ($line_count lines)"
> +}
> +
> +# --- Test 3: flock_after_reset ----------------------------------------------
> +#
> +# Take an exclusive flock, trigger reset, verify stale lock state is
> +# marked with CEPH_I_ERROR_FILELOCK (same-client flock attempt returns
> +# EIO). After the original holder exits (releasing the local lock
> +# reference and clearing the error flag), a fresh lock can be acquired.
> +#
> +# The lock holder uses the fd-based flock form with exec, so killing
> +# $lock_pid closes the lock fd immediately (no orphaned child with an
> +# inherited fd copy that would prevent the VFS flock release).
> +
> +test_flock_after_reset()
> +{
> + local num=3
> + local name="flock_after_reset"
> + local testfile="$MOUNT_POINT/.reset_corner_flock_$$"
> + local lock_pid probe_rc sc_before sc_after
> +
> + sc_before="$(read_status_field "success_count")"
> +
> + echo "flock_test_content" > "$testfile"
> + sync "$testfile"
> +
> + # Hold lock via fd in a subshell; exec ensures killing $lock_pid
> + # closes the lock fd directly (no fork/child fd inheritance).
> + (
> + exec 9<"$testfile"
> + flock --exclusive --nonblock 9 || exit 1
> + exec sleep 300
> + ) &
> + lock_pid=$!
> + sleep 0.5
> +
> + if ! kill -0 "$lock_pid" 2>/dev/null; then
> + result "$num" "$name" FAIL "flock holder died immediately"
> + rm -f "$testfile"
> + return
> + fi
> +
> + echo "flock_after_reset_test" > "$TRIGGER_PATH" 2>/dev/null || {
> + kill "$lock_pid" 2>/dev/null; wait "$lock_pid" 2>/dev/null
> + result "$num" "$name" FAIL "reset trigger failed"
> + rm -f "$testfile"
> + return
> + }
> +
> + if ! wait_reset_done 30; then
> + kill "$lock_pid" 2>/dev/null; wait "$lock_pid" 2>/dev/null
> + result "$num" "$name" FAIL "reset did not complete"
> + rm -f "$testfile"
> + return
> + fi
> +
> + sc_after="$(read_status_field "success_count")"
> + if [[ "$sc_after" -le "$sc_before" ]]; then
> + kill "$lock_pid" 2>/dev/null; wait "$lock_pid" 2>/dev/null
> + result "$num" "$name" FAIL "success_count did not increment"
> + rm -f "$testfile"
> + return
> + fi
> +
> + # After teardown, CEPH_I_ERROR_FILELOCK is set on the inode.
> + # A same-client lock attempt should fail (EIO), NOT succeed.
> + probe_rc=0
> + flock --exclusive --nonblock "$testfile" true 2>/dev/null && probe_rc=0 || probe_rc=$?
> + if [[ "$probe_rc" -eq 0 ]]; then
> + kill "$lock_pid" 2>/dev/null; wait "$lock_pid" 2>/dev/null
> + result "$num" "$name" FAIL \
> + "same-client probe succeeded, expected EIO from stale lock state"
> + rm -f "$testfile"
> + return
> + fi
> +
> + # Kill the holder -- the exec'd sleep IS $lock_pid, so killing it
> + # closes fd 9 directly. VFS flock release fires ceph_fl_release_lock(),
> + # which decrements i_filelock_ref to 0 and clears CEPH_I_ERROR_FILELOCK.
> + kill "$lock_pid" 2>/dev/null
> + wait "$lock_pid" 2>/dev/null
> +
> + # After the holder exits, a fresh lock should be acquirable.
> + # The reset teardown sends SESSION_REQUEST_CLOSE so the MDS
> + # releases locks promptly, but retry briefly in case the
> + # message races with the connection close.
> + local attempt
> + probe_rc=1
> + for attempt in 1 2 3 4 5; do
> + probe_rc=0
> + flock --exclusive --nonblock "$testfile" true 2>/dev/null \
> + && probe_rc=0 || probe_rc=$?
> + [[ "$probe_rc" -eq 0 ]] && break
> + sleep 1
> + done
> + if [[ "$probe_rc" -ne 0 ]]; then
> + result "$num" "$name" FAIL \
> + "cannot acquire fresh lock after holder exit (rc=$probe_rc, ${attempt} attempts)"
> + rm -f "$testfile"
> + return
> + fi
> +
> + # Verify file content survived
> + grep -q "flock_test_content" "$testfile" 2>/dev/null || {
> + result "$num" "$name" FAIL "file content corrupted after reset"
> + rm -f "$testfile"
> + return
> + }
> +
> + rm -f "$testfile"
> + result "$num" "$name" PASS "stale lock detected, fresh lock acquired after holder exit"
> +}
> +
> +# --- Test 4: unmount_during_reset -------------------------------------------
> +#
> +# Mount a fresh CephFS, trigger reset, immediately unmount. The
> +# ceph_mdsc_destroy() path must wake blocked waiters with -ESHUTDOWN
> +# and not hang.
> +
> +test_unmount_during_reset()
> +{
> + local num=4
> + local name="unmount_during_reset"
> + local temp_mnt="/tmp/ceph_corner_mnt_$$"
> + local mount_opts=""
> + local mount_src=""
> + local temp_trigger=""
> + local temp_status=""
> + local temp_client=""
> + local temp_file="$temp_mnt/.reset_corner_umount_$$"
> + local phase=""
> + local trigger_ok=0
> + local attempt
> + local -a new_clients=()
> + declare -A existing_clients=()
> +
> + mount_src="$(awk -v mp="$MOUNT_POINT" '$2 == mp && $3 == "ceph" {print $1; exit}' /proc/mounts 2>/dev/null)"
> + mount_opts="$(awk -v mp="$MOUNT_POINT" '$2 == mp && $3 == "ceph" {print $4; exit}' /proc/mounts 2>/dev/null)"
> +
> + if [[ -z "$mount_src" ]]; then
> + result "$num" "$name" SKIP "cannot determine mount source from /proc/mounts"
> + return
> + fi
> +
> + while IFS= read -r existing; do
> + [[ -n "$existing" ]] || continue
> + existing_clients["$existing"]=1
> + done < <(list_reset_clients)
> +
> + mkdir -p "$temp_mnt"
> +
> + if ! mount -t ceph "$mount_src" "$temp_mnt" -o "$mount_opts" 2>/dev/null; then
> + result "$num" "$name" SKIP "cannot mount additional CephFS instance"
> + rmdir "$temp_mnt" 2>/dev/null
> + return
> + fi
> +
> + ls "$temp_mnt" > /dev/null 2>&1
> + sync
> + sleep 1
> +
> + for attempt in $(seq 1 50); do
> + new_clients=()
> + while IFS= read -r entry; do
> + [[ -n "$entry" ]] || continue
> + if [[ -n "${existing_clients[$entry]+x}" ]]; then
> + continue
> + fi
> + new_clients+=("$entry")
> + done < <(list_reset_clients)
> +
> + if [[ "${#new_clients[@]}" -eq 1 ]]; then
> + temp_client="${new_clients[0]}"
> + break
> + fi
> +
> + if [[ "${#new_clients[@]}" -gt 1 ]]; then
> + break
> + fi
> +
> + sleep 0.1
> + done
> +
> + if [[ -z "$temp_client" ]]; then
> + umount "$temp_mnt" 2>/dev/null || umount -l "$temp_mnt" 2>/dev/null
> + rmdir "$temp_mnt" 2>/dev/null
> + result "$num" "$name" SKIP "cannot identify debugfs client for temp mount"
> + return
> + fi
> +
> + if [[ "${#new_clients[@]}" -gt 1 ]]; then
> + umount "$temp_mnt" 2>/dev/null || umount -l "$temp_mnt" 2>/dev/null
> + rmdir "$temp_mnt" 2>/dev/null
> + result "$num" "$name" SKIP "multiple new debugfs clients appeared"
> + return
> + fi
> +
> + temp_trigger="$DEBUGFS_ROOT/$temp_client/reset/trigger"
> + temp_status="$DEBUGFS_ROOT/$temp_client/reset/status"
> +
> + echo "umount_dirty_seed" > "$temp_file" 2>/dev/null || {
> + umount "$temp_mnt" 2>/dev/null || umount -l "$temp_mnt" 2>/dev/null
> + rmdir "$temp_mnt" 2>/dev/null
> + result "$num" "$name" FAIL "cannot create dirty state on temp mount"
> + return
> + }
> + sync "$temp_file"
> + python3 -c "
> +import os, sys
> +fd = os.open('$temp_file', os.O_WRONLY | os.O_APPEND)
> +os.write(fd, b'dirty_for_umount_test\\n')
> +os.close(fd)
> +" 2>/dev/null || {
> + umount "$temp_mnt" 2>/dev/null || umount -l "$temp_mnt" 2>/dev/null
> + rmdir "$temp_mnt" 2>/dev/null
> + result "$num" "$name" FAIL "cannot dirty temp mount for reset overlap"
> + return
> + }
> +
> + echo "unmount_test" > "$temp_trigger" 2>/dev/null && trigger_ok=1 || trigger_ok=0
> + if [[ "$trigger_ok" -ne 1 ]]; then
> + umount "$temp_mnt" 2>/dev/null || umount -l "$temp_mnt" 2>/dev/null
> + rmdir "$temp_mnt" 2>/dev/null
> + result "$num" "$name" FAIL "cannot trigger reset on temp mount"
> + return
> + fi
> +
> + if ! wait_status_nonidle "$temp_status" 10; then
> + phase="$(awk -F': ' '$1 == "phase" {print $2}' "$temp_status" 2>/dev/null)"
> + umount "$temp_mnt" 2>/dev/null || umount -l "$temp_mnt" 2>/dev/null
> + rmdir "$temp_mnt" 2>/dev/null
> + result "$num" "$name" FAIL \
> + "reset never became active before umount (phase=${phase:-unknown})"
> + return
> + fi
> +
> + local umount_ok=0
> + timeout 30 umount "$temp_mnt" 2>/dev/null && umount_ok=1
> +
> + if [[ "$umount_ok" -ne 1 ]]; then
> + umount -l "$temp_mnt" 2>/dev/null || true
> + rmdir "$temp_mnt" 2>/dev/null
> + result "$num" "$name" FAIL "umount hung for >30s"
> + return
> + fi
> +
> + rmdir "$temp_mnt" 2>/dev/null
> +
> + ls "$MOUNT_POINT" > /dev/null 2>&1 || {
> + result "$num" "$name" FAIL "original mount unhealthy after test"
> + return
> + }
> +
> + result "$num" "$name" PASS
> +}
> +
> +# --- Main --------------------------------------------------------------------
> +
> +usage()
> +{
> + cat <<EOF
> +Usage: $0 --mount-point <path> [--client-id <id>] [--debugfs-root <path>]
> +
> +Runs targeted corner-case tests for the CephFS client reset feature.
> +Requires root (debugfs access) and a mounted CephFS filesystem.
> +
> +Options:
> + --mount-point PATH CephFS mount point (required)
> + --client-id ID Ceph debugfs client id (auto-detect if one client)
> + --debugfs-root PATH Debugfs ceph root (default: /sys/kernel/debug/ceph)
> + --help Show this message
> +EOF
> +}
> +
> +main()
> +{
> + while [[ $# -gt 0 ]]; do
> + case "$1" in
> + --mount-point) MOUNT_POINT="$2"; shift 2 ;;
> + --client-id) DEBUGFS_CLIENT="$2"; shift 2 ;;
> + --debugfs-root) DEBUGFS_ROOT="$2"; shift 2 ;;
> + --help|-h) usage; exit 0 ;;
> + *) echo "Unknown option: $1" >&2; usage; exit 2 ;;
> + esac
> + done
> +
> + if [[ -z "$MOUNT_POINT" ]]; then
> + echo "--mount-point is required" >&2
> + usage
> + exit 2
> + fi
> +
> + if [[ ! -d "$MOUNT_POINT" ]]; then
> + echo "SKIP: Mount point does not exist: $MOUNT_POINT" >&2
> + exit "$KSFT_SKIP"
> + fi
> +
> + discover_debugfs
> + TRIGGER_PATH="$DEBUGFS_ROOT/$DEBUGFS_CLIENT/reset/trigger"
> + STATUS_PATH="$DEBUGFS_ROOT/$DEBUGFS_CLIENT/reset/status"
> +
> + log "CephFS client reset corner case tests"
> + log "Mount: $MOUNT_POINT"
> + log "Client: $DEBUGFS_CLIENT"
> + echo ""
> +
> + test_ebusy_rejection
> + test_dirty_caps_at_reset
> + test_flock_after_reset
> + test_unmount_during_reset
> +
> + echo ""
> + echo "Results: $PASS_COUNT passed, $FAIL_COUNT failed, $SKIP_COUNT skipped (of $TOTAL)"
> +
> + if [[ "$FAIL_COUNT" -gt 0 ]]; then
> + exit 1
> + fi
> + exit 0
> +}
> +
> +main "$@"
Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Thanks,
Slava.
^ permalink raw reply [flat|nested] 24+ messages in thread
* [PATCH v4 10/11] selftests: ceph: add validation harness
2026-05-07 12:27 [PATCH v4 00/11] ceph: manual client session reset Alex Markuze
` (8 preceding siblings ...)
2026-05-07 12:27 ` [PATCH v4 09/11] selftests: ceph: add reset corner-case tests Alex Markuze
@ 2026-05-07 12:27 ` Alex Markuze
2026-05-07 19:33 ` [EXTERNAL] " Viacheslav Dubeyko
2026-05-07 12:27 ` [PATCH v4 11/11] selftests: ceph: wire up Ceph reset kselftests and documentation Alex Markuze
2026-05-07 18:28 ` [EXTERNAL] [PATCH v4 00/11] ceph: manual client session reset Viacheslav Dubeyko
11 siblings, 1 reply; 24+ messages in thread
From: Alex Markuze @ 2026-05-07 12:27 UTC (permalink / raw)
To: ceph-devel; +Cc: linux-kernel, idryomov, vdubeyko, Alex Markuze
Add a one-shot validation wrapper that orchestrates the full reset
test suite with per-stage watchdog timeouts and a final status check.
The harness runs five stages: baseline (no resets), corner cases,
moderate stress, aggressive stress, and a post-run status validation.
Each stage runs with an independent timeout so a hang in one stage
does not block the entire run.
Signed-off-by: Alex Markuze <amarkuze@redhat.com>
---
.../filesystems/ceph/run_validation.sh | 350 ++++++++++++++++++
1 file changed, 350 insertions(+)
create mode 100755 tools/testing/selftests/filesystems/ceph/run_validation.sh
diff --git a/tools/testing/selftests/filesystems/ceph/run_validation.sh b/tools/testing/selftests/filesystems/ceph/run_validation.sh
new file mode 100755
index 000000000000..5d521e4f9e9b
--- /dev/null
+++ b/tools/testing/selftests/filesystems/ceph/run_validation.sh
@@ -0,0 +1,350 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# CephFS client reset - single-command validation.
+# Runs all test stages in sequence with per-stage timeouts.
+# If any stage hangs (filesystem stuck, process blocked), the
+# timeout kills it and reports failure.
+#
+# Usage:
+# sudo ./run_validation.sh --mount-point /mnt/mycephfs
+#
+# Expected output on success:
+#
+# === CephFS Client Reset Validation ===
+# [stage 1/5] baseline PASS (60s, no resets)
+# [stage 2/5] corner_cases PASS (4/4 passed)
+# [stage 3/5] moderate PASS (120s, resets every 5-15s)
+# [stage 4/5] aggressive PASS (120s, resets every 1-5s)
+# [stage 5/5] status_check PASS (phase=idle, last_errno=0)
+#
+# RESULT: 5/5 stages passed
+# Artifacts: /tmp/ceph_reset_validation_<timestamp>
+
+set -uo pipefail
+
+KSFT_SKIP=4
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+
+# kselftest auto-detect: when invoked with no arguments (e.g. by
+# "make run_tests"), find a CephFS mount automatically or skip.
+if [[ $# -eq 0 ]]; then
+ MOUNT_POINT="$(findmnt -t ceph -n -o TARGET 2>/dev/null | head -1)"
+ if [[ -z "$MOUNT_POINT" ]]; then
+ echo "SKIP: No CephFS mount found and --mount-point not specified"
+ exit "$KSFT_SKIP"
+ fi
+ exec "$0" --mount-point "$MOUNT_POINT"
+fi
+
+MOUNT_POINT=""
+CLIENT_ID=""
+declare -a CLIENT_ARGS=()
+declare -a DEBUGFS_ARGS=()
+RUN_ID="$(date +%Y%m%d-%H%M%S)"
+OUT_DIR="/tmp/ceph_reset_validation_${RUN_ID}"
+DEBUGFS_ROOT="/sys/kernel/debug/ceph"
+
+# Timeout margins: stage runtime + cooldown + validation + safety buffer
+STAGE1_TIMEOUT=120 # 60s run + 20s cooldown + 40s buffer
+STAGE2_TIMEOUT=300 # 4 corner cases, 30s each worst case + buffer
+STAGE3_TIMEOUT=240 # 120s run + 20s cooldown + 100s buffer
+STAGE4_TIMEOUT=240 # 120s run + 20s cooldown + 100s buffer
+STAGE5_TIMEOUT=10 # just reading debugfs
+
+PASS=0
+FAIL=0
+TOTAL=5
+
+usage()
+{
+ cat <<EOF
+Usage: $0 --mount-point <cephfs_mount> [options]
+
+Required:
+ --mount-point PATH CephFS mount point
+
+Options:
+ --out-dir PATH Artifact directory (default: /tmp/ceph_reset_validation_<ts>)
+ --client-id ID Ceph debugfs client id (optional)
+ --debugfs-root PATH Debugfs Ceph root (default: /sys/kernel/debug/ceph)
+ --help Show this message
+EOF
+}
+
+stage_result()
+{
+ local num="$1"
+ local name="$2"
+ local status="$3"
+ local detail="$4"
+
+ if [[ "$status" == "PASS" ]]; then
+ PASS=$((PASS + 1))
+ else
+ FAIL=$((FAIL + 1))
+ fi
+ printf '[stage %d/%d] %-16s %s (%s)\n' "$num" "$TOTAL" "$name" "$status" "$detail"
+}
+
+# Run a command with a timeout. Returns 0 on success, 1 on failure/timeout.
+# Sets RUN_TIMED_OUT=1 if killed by timeout.
+#
+# The stage command runs in its own session/process group (via setsid).
+# On timeout the entire process group is killed, not just the top-level
+# script PID. This is required because stage scripts (reset_stress.sh,
+# reset_corner_cases.sh) spawn child processes - I/O workers, rename
+# workers, reset injectors, samplers - that would otherwise survive the
+# timeout and bleed into later stages, invalidating results.
+RUN_TIMED_OUT=0
+
+run_with_timeout()
+{
+ local timeout_sec="$1"
+ local logfile="$2"
+ shift 2
+
+ RUN_TIMED_OUT=0
+
+ # Start the stage in its own session via setsid so all descendant
+ # processes share a process group that we can kill atomically.
+ # In a non-interactive script, background children are not process
+ # group leaders, so setsid(1) calls setsid(2) directly (no extra
+ # fork) and the PID we capture IS the group leader.
+ setsid "$@" > "$logfile" 2>&1 &
+ local pid=$!
+
+ # Watchdog: on timeout, kill the entire process group
+ (
+ sleep "$timeout_sec"
+ if kill -0 "$pid" 2>/dev/null; then
+ echo "TIMEOUT: stage exceeded ${timeout_sec}s, killing process group $pid" >> "$logfile"
+ kill -TERM -- -"$pid" 2>/dev/null
+ sleep 2
+ kill -KILL -- -"$pid" 2>/dev/null
+ fi
+ ) &
+ local watchdog_pid=$!
+
+ # Wait for the stage command
+ wait "$pid" 2>/dev/null
+ local rc=$?
+
+ # Kill the watchdog if it's still running
+ kill "$watchdog_pid" 2>/dev/null
+ wait "$watchdog_pid" 2>/dev/null
+
+ # Check if it was killed by timeout
+ if grep -q "^TIMEOUT:" "$logfile" 2>/dev/null; then
+ RUN_TIMED_OUT=1
+ return 1
+ fi
+
+ return "$rc"
+}
+
+find_status_path()
+{
+ local entry
+
+ if [[ -n "$CLIENT_ID" ]]; then
+ if [[ -r "$DEBUGFS_ROOT/$CLIENT_ID/reset/status" ]]; then
+ echo "$DEBUGFS_ROOT/$CLIENT_ID/reset/status"
+ return 0
+ fi
+ return 1
+ fi
+
+ for entry in "$DEBUGFS_ROOT"/*/; do
+ if [[ -r "${entry}reset/status" ]]; then
+ echo "${entry}reset/status"
+ return 0
+ fi
+ done
+ return 1
+}
+
+read_status_field()
+{
+ local status_path="$1"
+ local field="$2"
+ awk -F': ' -v key="$field" '$1 == key {print $2}' "$status_path" 2>/dev/null
+}
+
+# --- Parse arguments -------------------------------------------------------
+
+while [[ $# -gt 0 ]]; do
+ case "$1" in
+ --mount-point) MOUNT_POINT="$2"; shift 2 ;;
+ --out-dir) OUT_DIR="$2"; shift 2 ;;
+ --client-id) CLIENT_ID="$2"; shift 2 ;;
+ --debugfs-root) DEBUGFS_ROOT="$2"; shift 2 ;;
+ --help|-h) usage; exit 0 ;;
+ *) echo "Unknown option: $1" >&2; usage; exit 2 ;;
+ esac
+done
+
+if [[ -z "$MOUNT_POINT" ]]; then
+ echo "SKIP: --mount-point is required" >&2
+ usage
+ exit "$KSFT_SKIP"
+fi
+
+if [[ ! -d "$MOUNT_POINT" ]]; then
+ echo "SKIP: Mount point does not exist: $MOUNT_POINT" >&2
+ exit "$KSFT_SKIP"
+fi
+
+# Auto-detect client id when not specified, so all stages (including
+# stage 5 status check) use the same client consistently.
+if [[ -z "$CLIENT_ID" ]]; then
+ candidates=()
+ for entry in "$DEBUGFS_ROOT"/*/; do
+ name="$(basename "$entry")"
+ if [[ -r "${entry}reset/status" ]]; then
+ candidates+=("$name")
+ fi
+ done
+ if [[ ${#candidates[@]} -eq 1 ]]; then
+ CLIENT_ID="${candidates[0]}"
+ elif [[ ${#candidates[@]} -gt 1 ]]; then
+ echo "SKIP: Multiple Ceph clients found (${candidates[*]}). Use --client-id." >&2
+ exit "$KSFT_SKIP"
+ fi
+fi
+
+if [[ -n "$CLIENT_ID" ]]; then
+ CLIENT_ARGS=(--client-id "$CLIENT_ID")
+fi
+DEBUGFS_ARGS=(--debugfs-root "$DEBUGFS_ROOT")
+
+# Quick sanity: can we write to the mount?
+if ! touch "$MOUNT_POINT/.validation_probe_$$" 2>/dev/null; then
+ echo "SKIP: Mount point is not writable: $MOUNT_POINT" >&2
+ exit "$KSFT_SKIP"
+fi
+rm -f "$MOUNT_POINT/.validation_probe_$$"
+
+mkdir -p "$OUT_DIR"
+
+echo ""
+echo "=== CephFS Client Reset Validation ==="
+echo ""
+
+# --- Stage 1: Baseline (no resets) -----------------------------------------
+
+stage1_out="$OUT_DIR/stage1_baseline"
+if run_with_timeout "$STAGE1_TIMEOUT" "$stage1_out.log" \
+ "$SCRIPT_DIR/reset_stress.sh" \
+ --mount-point "$MOUNT_POINT" \
+ --profile baseline \
+ --no-reset \
+ --duration-sec 60 \
+ "${CLIENT_ARGS[@]}" \
+ "${DEBUGFS_ARGS[@]}" \
+ --out-dir "$stage1_out"; then
+ stage_result 1 "baseline" "PASS" "60s, no resets"
+elif [[ "$RUN_TIMED_OUT" -eq 1 ]]; then
+ stage_result 1 "baseline" "FAIL" "HUNG: killed after ${STAGE1_TIMEOUT}s"
+else
+ stage_result 1 "baseline" "FAIL" "see $stage1_out.log"
+fi
+
+# --- Stage 2: Corner cases -------------------------------------------------
+
+stage2_out="$OUT_DIR/stage2_corner_cases"
+mkdir -p "$stage2_out"
+if run_with_timeout "$STAGE2_TIMEOUT" "$stage2_out/output.log" \
+ "$SCRIPT_DIR/reset_corner_cases.sh" \
+ "${CLIENT_ARGS[@]}" \
+ "${DEBUGFS_ARGS[@]}" \
+ --mount-point "$MOUNT_POINT"; then
+ pass_line=$(grep -Eo '[0-9]+ passed, [0-9]+ failed, [0-9]+ skipped' "$stage2_out/output.log" | tail -1)
+ stage_result 2 "corner_cases" "PASS" "${pass_line:-all tests passed}"
+elif [[ "$RUN_TIMED_OUT" -eq 1 ]]; then
+ stage_result 2 "corner_cases" "FAIL" "HUNG: killed after ${STAGE2_TIMEOUT}s"
+else
+ fail_line=$(grep -c 'FAIL' "$stage2_out/output.log" 2>/dev/null || echo "?")
+ stage_result 2 "corner_cases" "FAIL" "${fail_line} failures, see $stage2_out/output.log"
+fi
+
+# --- Stage 3: Moderate resets -----------------------------------------------
+
+stage3_out="$OUT_DIR/stage3_moderate"
+if run_with_timeout "$STAGE3_TIMEOUT" "$stage3_out.log" \
+ "$SCRIPT_DIR/reset_stress.sh" \
+ --mount-point "$MOUNT_POINT" \
+ --profile moderate \
+ --duration-sec 120 \
+ "${CLIENT_ARGS[@]}" \
+ "${DEBUGFS_ARGS[@]}" \
+ --out-dir "$stage3_out"; then
+ stage_result 3 "moderate" "PASS" "120s, resets every 5-15s"
+elif [[ "$RUN_TIMED_OUT" -eq 1 ]]; then
+ stage_result 3 "moderate" "FAIL" "HUNG: killed after ${STAGE3_TIMEOUT}s"
+else
+ stage_result 3 "moderate" "FAIL" "see $stage3_out.log"
+fi
+
+# --- Stage 4: Aggressive resets ---------------------------------------------
+
+stage4_out="$OUT_DIR/stage4_aggressive"
+if run_with_timeout "$STAGE4_TIMEOUT" "$stage4_out.log" \
+ "$SCRIPT_DIR/reset_stress.sh" \
+ --mount-point "$MOUNT_POINT" \
+ --profile aggressive \
+ --duration-sec 120 \
+ "${CLIENT_ARGS[@]}" \
+ "${DEBUGFS_ARGS[@]}" \
+ --out-dir "$stage4_out"; then
+ stage_result 4 "aggressive" "PASS" "120s, resets every 1-5s"
+elif [[ "$RUN_TIMED_OUT" -eq 1 ]]; then
+ stage_result 4 "aggressive" "FAIL" "HUNG: killed after ${STAGE4_TIMEOUT}s"
+else
+ stage_result 4 "aggressive" "FAIL" "see $stage4_out.log"
+fi
+
+# --- Stage 5: Post-run status check ----------------------------------------
+
+status_path=""
+if status_path=$(find_status_path); then
+ phase=$(read_status_field "$status_path" "phase")
+ last_errno=$(read_status_field "$status_path" "last_errno")
+ failure_count=$(read_status_field "$status_path" "failure_count")
+ drain_timed_out=$(read_status_field "$status_path" "drain_timed_out")
+ sessions_reset=$(read_status_field "$status_path" "sessions_reset")
+ blocked=$(read_status_field "$status_path" "blocked_requests")
+
+ # Save full status
+ cat "$status_path" > "$OUT_DIR/final_status.txt" 2>/dev/null
+
+ errors=""
+ [[ "$phase" != "idle" ]] && errors="${errors}phase=$phase "
+ [[ "$last_errno" != "0" ]] && errors="${errors}last_errno=$last_errno "
+ [[ "$failure_count" != "0" && -n "$failure_count" ]] && errors="${errors}failure_count=$failure_count "
+ [[ "$blocked" != "0" ]] && errors="${errors}blocked_requests=$blocked "
+
+ if [[ -z "$errors" ]]; then
+ detail="phase=$phase, last_errno=$last_errno, failure_count=${failure_count:-0}"
+ [[ "$drain_timed_out" == "yes" ]] && detail="$detail, drain_timed_out=yes"
+ [[ -n "$sessions_reset" ]] && detail="$detail, sessions_reset=$sessions_reset"
+ stage_result 5 "status_check" "PASS" "$detail"
+ else
+ stage_result 5 "status_check" "FAIL" "$errors"
+ fi
+else
+ stage_result 5 "status_check" "FAIL" "cannot read reset/status"
+fi
+
+# --- Summary ----------------------------------------------------------------
+
+echo ""
+if [[ "$FAIL" -eq 0 ]]; then
+ echo "RESULT: $PASS/$TOTAL stages passed"
+else
+ echo "RESULT: $PASS/$TOTAL stages passed, $FAIL FAILED"
+fi
+echo "Artifacts: $OUT_DIR"
+echo ""
+
+exit "$FAIL"
--
2.34.1
^ permalink raw reply related [flat|nested] 24+ messages in thread* Re: [EXTERNAL] [PATCH v4 10/11] selftests: ceph: add validation harness
2026-05-07 12:27 ` [PATCH v4 10/11] selftests: ceph: add validation harness Alex Markuze
@ 2026-05-07 19:33 ` Viacheslav Dubeyko
0 siblings, 0 replies; 24+ messages in thread
From: Viacheslav Dubeyko @ 2026-05-07 19:33 UTC (permalink / raw)
To: Alex Markuze, ceph-devel; +Cc: linux-kernel, idryomov
On Thu, 2026-05-07 at 12:27 +0000, Alex Markuze wrote:
> Add a one-shot validation wrapper that orchestrates the full reset
> test suite with per-stage watchdog timeouts and a final status check.
>
> The harness runs five stages: baseline (no resets), corner cases,
> moderate stress, aggressive stress, and a post-run status validation.
> Each stage runs with an independent timeout so a hang in one stage
> does not block the entire run.
>
> Signed-off-by: Alex Markuze <amarkuze@redhat.com>
> ---
> .../filesystems/ceph/run_validation.sh | 350 ++++++++++++++++++
> 1 file changed, 350 insertions(+)
> create mode 100755 tools/testing/selftests/filesystems/ceph/run_validation.sh
>
> diff --git a/tools/testing/selftests/filesystems/ceph/run_validation.sh b/tools/testing/selftests/filesystems/ceph/run_validation.sh
> new file mode 100755
> index 000000000000..5d521e4f9e9b
> --- /dev/null
> +++ b/tools/testing/selftests/filesystems/ceph/run_validation.sh
> @@ -0,0 +1,350 @@
> +#!/bin/bash
> +# SPDX-License-Identifier: GPL-2.0
> +#
> +# CephFS client reset - single-command validation.
> +# Runs all test stages in sequence with per-stage timeouts.
> +# If any stage hangs (filesystem stuck, process blocked), the
> +# timeout kills it and reports failure.
> +#
> +# Usage:
> +# sudo ./run_validation.sh --mount-point /mnt/mycephfs
> +#
> +# Expected output on success:
> +#
> +# === CephFS Client Reset Validation ===
> +# [stage 1/5] baseline PASS (60s, no resets)
> +# [stage 2/5] corner_cases PASS (4/4 passed)
> +# [stage 3/5] moderate PASS (120s, resets every 5-15s)
> +# [stage 4/5] aggressive PASS (120s, resets every 1-5s)
> +# [stage 5/5] status_check PASS (phase=idle, last_errno=0)
> +#
> +# RESULT: 5/5 stages passed
> +# Artifacts: /tmp/ceph_reset_validation_<timestamp>
> +
> +set -uo pipefail
> +
> +KSFT_SKIP=4
> +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
> +
> +# kselftest auto-detect: when invoked with no arguments (e.g. by
> +# "make run_tests"), find a CephFS mount automatically or skip.
> +if [[ $# -eq 0 ]]; then
> + MOUNT_POINT="$(findmnt -t ceph -n -o TARGET 2>/dev/null | head -1)"
> + if [[ -z "$MOUNT_POINT" ]]; then
> + echo "SKIP: No CephFS mount found and --mount-point not specified"
> + exit "$KSFT_SKIP"
> + fi
> + exec "$0" --mount-point "$MOUNT_POINT"
> +fi
> +
> +MOUNT_POINT=""
> +CLIENT_ID=""
> +declare -a CLIENT_ARGS=()
> +declare -a DEBUGFS_ARGS=()
> +RUN_ID="$(date +%Y%m%d-%H%M%S)"
> +OUT_DIR="/tmp/ceph_reset_validation_${RUN_ID}"
> +DEBUGFS_ROOT="/sys/kernel/debug/ceph"
> +
> +# Timeout margins: stage runtime + cooldown + validation + safety buffer
> +STAGE1_TIMEOUT=120 # 60s run + 20s cooldown + 40s buffer
> +STAGE2_TIMEOUT=300 # 4 corner cases, 30s each worst case + buffer
> +STAGE3_TIMEOUT=240 # 120s run + 20s cooldown + 100s buffer
> +STAGE4_TIMEOUT=240 # 120s run + 20s cooldown + 100s buffer
> +STAGE5_TIMEOUT=10 # just reading debugfs
> +
> +PASS=0
> +FAIL=0
> +TOTAL=5
> +
> +usage()
> +{
> + cat <<EOF
> +Usage: $0 --mount-point <cephfs_mount> [options]
> +
> +Required:
> + --mount-point PATH CephFS mount point
> +
> +Options:
> + --out-dir PATH Artifact directory (default: /tmp/ceph_reset_validation_<ts>)
> + --client-id ID Ceph debugfs client id (optional)
> + --debugfs-root PATH Debugfs Ceph root (default: /sys/kernel/debug/ceph)
> + --help Show this message
> +EOF
> +}
> +
> +stage_result()
> +{
> + local num="$1"
> + local name="$2"
> + local status="$3"
> + local detail="$4"
> +
> + if [[ "$status" == "PASS" ]]; then
> + PASS=$((PASS + 1))
> + else
> + FAIL=$((FAIL + 1))
> + fi
> + printf '[stage %d/%d] %-16s %s (%s)\n' "$num" "$TOTAL" "$name" "$status" "$detail"
> +}
> +
> +# Run a command with a timeout. Returns 0 on success, 1 on failure/timeout.
> +# Sets RUN_TIMED_OUT=1 if killed by timeout.
> +#
> +# The stage command runs in its own session/process group (via setsid).
> +# On timeout the entire process group is killed, not just the top-level
> +# script PID. This is required because stage scripts (reset_stress.sh,
> +# reset_corner_cases.sh) spawn child processes - I/O workers, rename
> +# workers, reset injectors, samplers - that would otherwise survive the
> +# timeout and bleed into later stages, invalidating results.
> +RUN_TIMED_OUT=0
> +
> +run_with_timeout()
> +{
> + local timeout_sec="$1"
> + local logfile="$2"
> + shift 2
> +
> + RUN_TIMED_OUT=0
> +
> + # Start the stage in its own session via setsid so all descendant
> + # processes share a process group that we can kill atomically.
> + # In a non-interactive script, background children are not process
> + # group leaders, so setsid(1) calls setsid(2) directly (no extra
> + # fork) and the PID we capture IS the group leader.
> + setsid "$@" > "$logfile" 2>&1 &
> + local pid=$!
> +
> + # Watchdog: on timeout, kill the entire process group
> + (
> + sleep "$timeout_sec"
> + if kill -0 "$pid" 2>/dev/null; then
> + echo "TIMEOUT: stage exceeded ${timeout_sec}s, killing process group $pid" >> "$logfile"
> + kill -TERM -- -"$pid" 2>/dev/null
> + sleep 2
> + kill -KILL -- -"$pid" 2>/dev/null
> + fi
> + ) &
> + local watchdog_pid=$!
> +
> + # Wait for the stage command
> + wait "$pid" 2>/dev/null
> + local rc=$?
> +
> + # Kill the watchdog if it's still running
> + kill "$watchdog_pid" 2>/dev/null
> + wait "$watchdog_pid" 2>/dev/null
> +
> + # Check if it was killed by timeout
> + if grep -q "^TIMEOUT:" "$logfile" 2>/dev/null; then
> + RUN_TIMED_OUT=1
> + return 1
> + fi
> +
> + return "$rc"
> +}
> +
> +find_status_path()
> +{
> + local entry
> +
> + if [[ -n "$CLIENT_ID" ]]; then
> + if [[ -r "$DEBUGFS_ROOT/$CLIENT_ID/reset/status" ]]; then
> + echo "$DEBUGFS_ROOT/$CLIENT_ID/reset/status"
> + return 0
> + fi
> + return 1
> + fi
> +
> + for entry in "$DEBUGFS_ROOT"/*/; do
> + if [[ -r "${entry}reset/status" ]]; then
> + echo "${entry}reset/status"
> + return 0
> + fi
> + done
> + return 1
> +}
> +
> +read_status_field()
> +{
> + local status_path="$1"
> + local field="$2"
> + awk -F': ' -v key="$field" '$1 == key {print $2}' "$status_path" 2>/dev/null
> +}
> +
> +# --- Parse arguments -------------------------------------------------------
> +
> +while [[ $# -gt 0 ]]; do
> + case "$1" in
> + --mount-point) MOUNT_POINT="$2"; shift 2 ;;
> + --out-dir) OUT_DIR="$2"; shift 2 ;;
> + --client-id) CLIENT_ID="$2"; shift 2 ;;
> + --debugfs-root) DEBUGFS_ROOT="$2"; shift 2 ;;
> + --help|-h) usage; exit 0 ;;
> + *) echo "Unknown option: $1" >&2; usage; exit 2 ;;
> + esac
> +done
> +
> +if [[ -z "$MOUNT_POINT" ]]; then
> + echo "SKIP: --mount-point is required" >&2
> + usage
> + exit "$KSFT_SKIP"
> +fi
> +
> +if [[ ! -d "$MOUNT_POINT" ]]; then
> + echo "SKIP: Mount point does not exist: $MOUNT_POINT" >&2
> + exit "$KSFT_SKIP"
> +fi
> +
> +# Auto-detect client id when not specified, so all stages (including
> +# stage 5 status check) use the same client consistently.
> +if [[ -z "$CLIENT_ID" ]]; then
> + candidates=()
> + for entry in "$DEBUGFS_ROOT"/*/; do
> + name="$(basename "$entry")"
> + if [[ -r "${entry}reset/status" ]]; then
> + candidates+=("$name")
> + fi
> + done
> + if [[ ${#candidates[@]} -eq 1 ]]; then
> + CLIENT_ID="${candidates[0]}"
> + elif [[ ${#candidates[@]} -gt 1 ]]; then
> + echo "SKIP: Multiple Ceph clients found (${candidates[*]}). Use --client-id." >&2
> + exit "$KSFT_SKIP"
> + fi
> +fi
> +
> +if [[ -n "$CLIENT_ID" ]]; then
> + CLIENT_ARGS=(--client-id "$CLIENT_ID")
> +fi
> +DEBUGFS_ARGS=(--debugfs-root "$DEBUGFS_ROOT")
> +
> +# Quick sanity: can we write to the mount?
> +if ! touch "$MOUNT_POINT/.validation_probe_$$" 2>/dev/null; then
> + echo "SKIP: Mount point is not writable: $MOUNT_POINT" >&2
> + exit "$KSFT_SKIP"
> +fi
> +rm -f "$MOUNT_POINT/.validation_probe_$$"
> +
> +mkdir -p "$OUT_DIR"
> +
> +echo ""
> +echo "=== CephFS Client Reset Validation ==="
> +echo ""
> +
> +# --- Stage 1: Baseline (no resets) -----------------------------------------
> +
> +stage1_out="$OUT_DIR/stage1_baseline"
> +if run_with_timeout "$STAGE1_TIMEOUT" "$stage1_out.log" \
> + "$SCRIPT_DIR/reset_stress.sh" \
> + --mount-point "$MOUNT_POINT" \
> + --profile baseline \
> + --no-reset \
> + --duration-sec 60 \
> + "${CLIENT_ARGS[@]}" \
> + "${DEBUGFS_ARGS[@]}" \
> + --out-dir "$stage1_out"; then
> + stage_result 1 "baseline" "PASS" "60s, no resets"
> +elif [[ "$RUN_TIMED_OUT" -eq 1 ]]; then
> + stage_result 1 "baseline" "FAIL" "HUNG: killed after ${STAGE1_TIMEOUT}s"
> +else
> + stage_result 1 "baseline" "FAIL" "see $stage1_out.log"
> +fi
> +
> +# --- Stage 2: Corner cases -------------------------------------------------
> +
> +stage2_out="$OUT_DIR/stage2_corner_cases"
> +mkdir -p "$stage2_out"
> +if run_with_timeout "$STAGE2_TIMEOUT" "$stage2_out/output.log" \
> + "$SCRIPT_DIR/reset_corner_cases.sh" \
> + "${CLIENT_ARGS[@]}" \
> + "${DEBUGFS_ARGS[@]}" \
> + --mount-point "$MOUNT_POINT"; then
> + pass_line=$(grep -Eo '[0-9]+ passed, [0-9]+ failed, [0-9]+ skipped' "$stage2_out/output.log" | tail -1)
> + stage_result 2 "corner_cases" "PASS" "${pass_line:-all tests passed}"
> +elif [[ "$RUN_TIMED_OUT" -eq 1 ]]; then
> + stage_result 2 "corner_cases" "FAIL" "HUNG: killed after ${STAGE2_TIMEOUT}s"
> +else
> + fail_line=$(grep -c 'FAIL' "$stage2_out/output.log" 2>/dev/null || echo "?")
> + stage_result 2 "corner_cases" "FAIL" "${fail_line} failures, see $stage2_out/output.log"
> +fi
> +
> +# --- Stage 3: Moderate resets -----------------------------------------------
> +
> +stage3_out="$OUT_DIR/stage3_moderate"
> +if run_with_timeout "$STAGE3_TIMEOUT" "$stage3_out.log" \
> + "$SCRIPT_DIR/reset_stress.sh" \
> + --mount-point "$MOUNT_POINT" \
> + --profile moderate \
> + --duration-sec 120 \
> + "${CLIENT_ARGS[@]}" \
> + "${DEBUGFS_ARGS[@]}" \
> + --out-dir "$stage3_out"; then
> + stage_result 3 "moderate" "PASS" "120s, resets every 5-15s"
> +elif [[ "$RUN_TIMED_OUT" -eq 1 ]]; then
> + stage_result 3 "moderate" "FAIL" "HUNG: killed after ${STAGE3_TIMEOUT}s"
> +else
> + stage_result 3 "moderate" "FAIL" "see $stage3_out.log"
> +fi
> +
> +# --- Stage 4: Aggressive resets ---------------------------------------------
> +
> +stage4_out="$OUT_DIR/stage4_aggressive"
> +if run_with_timeout "$STAGE4_TIMEOUT" "$stage4_out.log" \
> + "$SCRIPT_DIR/reset_stress.sh" \
> + --mount-point "$MOUNT_POINT" \
> + --profile aggressive \
> + --duration-sec 120 \
> + "${CLIENT_ARGS[@]}" \
> + "${DEBUGFS_ARGS[@]}" \
> + --out-dir "$stage4_out"; then
> + stage_result 4 "aggressive" "PASS" "120s, resets every 1-5s"
> +elif [[ "$RUN_TIMED_OUT" -eq 1 ]]; then
> + stage_result 4 "aggressive" "FAIL" "HUNG: killed after ${STAGE4_TIMEOUT}s"
> +else
> + stage_result 4 "aggressive" "FAIL" "see $stage4_out.log"
> +fi
> +
> +# --- Stage 5: Post-run status check ----------------------------------------
> +
> +status_path=""
> +if status_path=$(find_status_path); then
> + phase=$(read_status_field "$status_path" "phase")
> + last_errno=$(read_status_field "$status_path" "last_errno")
> + failure_count=$(read_status_field "$status_path" "failure_count")
> + drain_timed_out=$(read_status_field "$status_path" "drain_timed_out")
> + sessions_reset=$(read_status_field "$status_path" "sessions_reset")
> + blocked=$(read_status_field "$status_path" "blocked_requests")
> +
> + # Save full status
> + cat "$status_path" > "$OUT_DIR/final_status.txt" 2>/dev/null
> +
> + errors=""
> + [[ "$phase" != "idle" ]] && errors="${errors}phase=$phase "
> + [[ "$last_errno" != "0" ]] && errors="${errors}last_errno=$last_errno "
> + [[ "$failure_count" != "0" && -n "$failure_count" ]] && errors="${errors}failure_count=$failure_count "
> + [[ "$blocked" != "0" ]] && errors="${errors}blocked_requests=$blocked "
> +
> + if [[ -z "$errors" ]]; then
> + detail="phase=$phase, last_errno=$last_errno, failure_count=${failure_count:-0}"
> + [[ "$drain_timed_out" == "yes" ]] && detail="$detail, drain_timed_out=yes"
> + [[ -n "$sessions_reset" ]] && detail="$detail, sessions_reset=$sessions_reset"
> + stage_result 5 "status_check" "PASS" "$detail"
> + else
> + stage_result 5 "status_check" "FAIL" "$errors"
> + fi
> +else
> + stage_result 5 "status_check" "FAIL" "cannot read reset/status"
> +fi
> +
> +# --- Summary ----------------------------------------------------------------
> +
> +echo ""
> +if [[ "$FAIL" -eq 0 ]]; then
> + echo "RESULT: $PASS/$TOTAL stages passed"
> +else
> + echo "RESULT: $PASS/$TOTAL stages passed, $FAIL FAILED"
> +fi
> +echo "Artifacts: $OUT_DIR"
> +echo ""
> +
> +exit "$FAIL"
Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Thanks,
Slava.
^ permalink raw reply [flat|nested] 24+ messages in thread
* [PATCH v4 11/11] selftests: ceph: wire up Ceph reset kselftests and documentation
2026-05-07 12:27 [PATCH v4 00/11] ceph: manual client session reset Alex Markuze
` (9 preceding siblings ...)
2026-05-07 12:27 ` [PATCH v4 10/11] selftests: ceph: add validation harness Alex Markuze
@ 2026-05-07 12:27 ` Alex Markuze
2026-05-07 19:38 ` [EXTERNAL] " Viacheslav Dubeyko
2026-05-07 18:28 ` [EXTERNAL] [PATCH v4 00/11] ceph: manual client session reset Viacheslav Dubeyko
11 siblings, 1 reply; 24+ messages in thread
From: Alex Markuze @ 2026-05-07 12:27 UTC (permalink / raw)
To: ceph-devel; +Cc: linux-kernel, idryomov, vdubeyko, Alex Markuze
Wire the CephFS reset test suite into the kselftest build:
- Add filesystems/ceph to the top-level selftests Makefile.
- Add the per-suite Makefile with run_validation.sh as TEST_PROGS.
- Add the settings file (kselftest timeout).
- Add the MAINTAINERS entry for the test directory.
- Add README with prerequisites, usage, and troubleshooting.
Signed-off-by: Alex Markuze <amarkuze@redhat.com>
---
MAINTAINERS | 1 +
fs/ceph/mds_client.c | 3 +-
fs/ceph/mds_client.h | 1 +
tools/testing/selftests/Makefile | 1 +
.../selftests/filesystems/ceph/Makefile | 7 ++
.../testing/selftests/filesystems/ceph/README | 84 +++++++++++++++++++
.../selftests/filesystems/ceph/settings | 1 +
7 files changed, 97 insertions(+), 1 deletion(-)
create mode 100644 tools/testing/selftests/filesystems/ceph/Makefile
create mode 100644 tools/testing/selftests/filesystems/ceph/README
create mode 100644 tools/testing/selftests/filesystems/ceph/settings
diff --git a/MAINTAINERS b/MAINTAINERS
index 2fb1c75afd16..bf6d973ac3fb 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -5905,6 +5905,7 @@ B: https://tracker.ceph.com/
T: git https://github.com/ceph/ceph-client.git
F: Documentation/filesystems/ceph.rst
F: fs/ceph/
+F: tools/testing/selftests/filesystems/ceph/
CERTIFICATE HANDLING
M: David Howells <dhowells@redhat.com>
diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index b16638ebff7f..3b6560da8c4e 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -2359,6 +2359,7 @@ struct flush_dump_entry {
static void dump_cap_flushes(struct ceph_mds_client *mdsc, u64 want_tid)
{
struct ceph_client *cl = mdsc->fsc->client;
+ int i;
struct flush_dump_entry entries[CEPH_CAP_FLUSH_MAX_DUMP_ENTRIES];
struct ceph_cap_flush *cf;
int n = 0, remaining = 0;
@@ -2388,7 +2389,7 @@ static void dump_cap_flushes(struct ceph_mds_client *mdsc, u64 want_tid)
pr_info_client(cl, "still waiting for cap flushes through %llu:\n",
want_tid);
- for (int i = 0; i < n; i++) {
+ for (i = 0; i < n; i++) {
struct flush_dump_entry *e = &entries[i];
if (e->ci_null)
diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h
index b1a0621cd37e..731d6ad04956 100644
--- a/fs/ceph/mds_client.h
+++ b/fs/ceph/mds_client.h
@@ -121,6 +121,7 @@ static inline bool ceph_reset_is_idle(struct ceph_client_reset_state *st)
{
return READ_ONCE(st->phase) == CEPH_CLIENT_RESET_IDLE;
}
+
struct ceph_mds_cap_match {
s64 uid; /* default to MDS_AUTH_UID_ANY */
u32 num_gids;
diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index 6e59b8f63e41..ab254ae793a9 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -32,6 +32,7 @@ TARGETS += exec
TARGETS += fchmodat2
TARGETS += filesystems
TARGETS += filesystems/binderfs
+TARGETS += filesystems/ceph
TARGETS += filesystems/epoll
TARGETS += filesystems/fat
TARGETS += filesystems/overlayfs
diff --git a/tools/testing/selftests/filesystems/ceph/Makefile b/tools/testing/selftests/filesystems/ceph/Makefile
new file mode 100644
index 000000000000..4ad3e8d40d90
--- /dev/null
+++ b/tools/testing/selftests/filesystems/ceph/Makefile
@@ -0,0 +1,7 @@
+# SPDX-License-Identifier: GPL-2.0
+
+TEST_PROGS := run_validation.sh
+TEST_FILES := reset_stress.sh reset_corner_cases.sh \
+ validate_consistency.py README settings
+
+include ../../lib.mk
diff --git a/tools/testing/selftests/filesystems/ceph/README b/tools/testing/selftests/filesystems/ceph/README
new file mode 100644
index 000000000000..eb0092b38f80
--- /dev/null
+++ b/tools/testing/selftests/filesystems/ceph/README
@@ -0,0 +1,84 @@
+# CephFS Client Reset Test Suite
+
+Test suite for the CephFS kernel client manual session reset feature.
+This trimmed set contains the single-client stress test, the targeted
+corner-case test, and the one-shot validation harness used during
+feature bring-up.
+
+## Prerequisites
+
+- Linux kernel with the CephFS client reset feature (this branch)
+- A running Ceph cluster with at least one MDS
+- Root access (debugfs requires it)
+- Python 3 (for validators)
+- flock utility (for lock tests, usually in util-linux)
+
+## Test inventory
+
+| Test | Script(s) | What it covers |
+|------|-----------|----------------|
+| Single-client stress | `reset_stress.sh` | I/O + resets + data integrity on one mount |
+| Corner cases | `reset_corner_cases.sh` | EBUSY, dirty caps, flock reclaim, unmount-during-reset |
+| Validation harness | `run_validation.sh` | baseline + corner cases + moderate/aggressive stress + final status check |
+
+## Quick start
+
+Stress run:
+
+ sudo ./reset_stress.sh --mount-point /mnt/cephfs --profile moderate
+
+Corner cases:
+
+ sudo ./reset_corner_cases.sh --mount-point /mnt/cephfs
+
+End-to-end validation:
+
+ sudo ./run_validation.sh --mount-point /mnt/cephfs
+
+## Stress profiles
+
+ baseline - no resets, 1 IO + 1 rename, 600s
+ moderate - reset every 5-15s, 2 IO + 1 rename, 900s
+ aggressive - reset every 1-5s, 4 IO + 2 rename, 900s
+ soak - reset every 5-15s, 2 IO + 1 rename, 3600s
+
+## Key options (all scripts)
+
+ --mount-point PATH CephFS mount point (required)
+ --client-id ID Debugfs client id (auto-detected if one)
+
+reset_stress.sh additionally accepts:
+
+ --profile NAME baseline|moderate|aggressive|soak
+ --duration-sec N Override profile runtime
+ --no-reset Disable reset injection
+ --out-dir PATH Artifact directory
+
+## Corner case tests
+
+ [1/4] ebusy_rejection Second reset rejected while first in-flight
+ [2/4] dirty_caps_at_reset Reset with unflushed dirty caps
+ [3/4] flock_after_reset Stale lock EIO + fresh lock after holder exit
+ [4/4] unmount_during_reset umount during active reset (destroy-path wakeup)
+
+Test 4 requires creating a second CephFS mount instance and SKIPs if
+the host cannot do so. See `--help` output for details.
+
+## Troubleshooting
+
+**No writable Ceph reset interface found:**
+Kernel lacks the reset feature, debugfs not mounted, or not root.
+Check: `ls /sys/kernel/debug/ceph/*/reset/`
+
+**Multiple Ceph clients found:**
+Use `--client-id` to select one.
+List: `ls /sys/kernel/debug/ceph/`
+
+## Files
+
+| File | Role |
+|------|------|
+| `reset_stress.sh` | Single-client stress test runner |
+| `validate_consistency.py` | Single-client post-run validator |
+| `reset_corner_cases.sh` | Corner case harness (4 sequential tests) |
+| `run_validation.sh` | One-shot validation harness |
diff --git a/tools/testing/selftests/filesystems/ceph/settings b/tools/testing/selftests/filesystems/ceph/settings
new file mode 100644
index 000000000000..79b65bdf05db
--- /dev/null
+++ b/tools/testing/selftests/filesystems/ceph/settings
@@ -0,0 +1 @@
+timeout=1200
--
2.34.1
^ permalink raw reply related [flat|nested] 24+ messages in thread* Re: [EXTERNAL] [PATCH v4 11/11] selftests: ceph: wire up Ceph reset kselftests and documentation
2026-05-07 12:27 ` [PATCH v4 11/11] selftests: ceph: wire up Ceph reset kselftests and documentation Alex Markuze
@ 2026-05-07 19:38 ` Viacheslav Dubeyko
0 siblings, 0 replies; 24+ messages in thread
From: Viacheslav Dubeyko @ 2026-05-07 19:38 UTC (permalink / raw)
To: Alex Markuze, ceph-devel; +Cc: linux-kernel, idryomov
On Thu, 2026-05-07 at 12:27 +0000, Alex Markuze wrote:
> Wire the CephFS reset test suite into the kselftest build:
>
> - Add filesystems/ceph to the top-level selftests Makefile.
> - Add the per-suite Makefile with run_validation.sh as TEST_PROGS.
> - Add the settings file (kselftest timeout).
> - Add the MAINTAINERS entry for the test directory.
> - Add README with prerequisites, usage, and troubleshooting.
>
> Signed-off-by: Alex Markuze <amarkuze@redhat.com>
> ---
> MAINTAINERS | 1 +
> fs/ceph/mds_client.c | 3 +-
> fs/ceph/mds_client.h | 1 +
> tools/testing/selftests/Makefile | 1 +
> .../selftests/filesystems/ceph/Makefile | 7 ++
> .../testing/selftests/filesystems/ceph/README | 84 +++++++++++++++++++
> .../selftests/filesystems/ceph/settings | 1 +
> 7 files changed, 97 insertions(+), 1 deletion(-)
> create mode 100644 tools/testing/selftests/filesystems/ceph/Makefile
> create mode 100644 tools/testing/selftests/filesystems/ceph/README
> create mode 100644 tools/testing/selftests/filesystems/ceph/settings
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 2fb1c75afd16..bf6d973ac3fb 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -5905,6 +5905,7 @@ B: https://urldefense.proofpoint.com/v2/url?u=https-3A__tracker.ceph.com_&d=DwIDAg&c=BSDicqBQBDjDI9RkVyTcHQ&r=q5bIm4AXMzc8NJu1_RGmnQ2fMWKq4Y4RAkElvUgSs00&m=cRBv-UtfmhjOUGCI7KVnj0A4bykzGklT8Ys3gWPOvguS8dyEz9b-bde2xdqTEBZi&s=wjFho6C-M1GiaFvcMzhan3RRBQEA-dyFGR2tqxUoJy0&e=
> T: git https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_ceph_ceph-2Dclient.git&d=DwIDAg&c=BSDicqBQBDjDI9RkVyTcHQ&r=q5bIm4AXMzc8NJu1_RGmnQ2fMWKq4Y4RAkElvUgSs00&m=cRBv-UtfmhjOUGCI7KVnj0A4bykzGklT8Ys3gWPOvguS8dyEz9b-bde2xdqTEBZi&s=77jK11z0NzRQJxZXDRsicogOgD_d9u7-XA5-bhb99XA&e=
> F: Documentation/filesystems/ceph.rst
> F: fs/ceph/
> +F: tools/testing/selftests/filesystems/ceph/
>
> CERTIFICATE HANDLING
> M: David Howells <dhowells@redhat.com>
> diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
> index b16638ebff7f..3b6560da8c4e 100644
> --- a/fs/ceph/mds_client.c
> +++ b/fs/ceph/mds_client.c
> @@ -2359,6 +2359,7 @@ struct flush_dump_entry {
> static void dump_cap_flushes(struct ceph_mds_client *mdsc, u64 want_tid)
> {
> struct ceph_client *cl = mdsc->fsc->client;
> + int i;
> struct flush_dump_entry entries[CEPH_CAP_FLUSH_MAX_DUMP_ENTRIES];
> struct ceph_cap_flush *cf;
> int n = 0, remaining = 0;
> @@ -2388,7 +2389,7 @@ static void dump_cap_flushes(struct ceph_mds_client *mdsc, u64 want_tid)
>
> pr_info_client(cl, "still waiting for cap flushes through %llu:\n",
> want_tid);
> - for (int i = 0; i < n; i++) {
> + for (i = 0; i < n; i++) {
> struct flush_dump_entry *e = &entries[i];
>
> if (e->ci_null)
> diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h
> index b1a0621cd37e..731d6ad04956 100644
> --- a/fs/ceph/mds_client.h
> +++ b/fs/ceph/mds_client.h
> @@ -121,6 +121,7 @@ static inline bool ceph_reset_is_idle(struct ceph_client_reset_state *st)
> {
> return READ_ONCE(st->phase) == CEPH_CLIENT_RESET_IDLE;
> }
> +
> struct ceph_mds_cap_match {
> s64 uid; /* default to MDS_AUTH_UID_ANY */
> u32 num_gids;
> diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
> index 6e59b8f63e41..ab254ae793a9 100644
> --- a/tools/testing/selftests/Makefile
> +++ b/tools/testing/selftests/Makefile
> @@ -32,6 +32,7 @@ TARGETS += exec
> TARGETS += fchmodat2
> TARGETS += filesystems
> TARGETS += filesystems/binderfs
> +TARGETS += filesystems/ceph
> TARGETS += filesystems/epoll
> TARGETS += filesystems/fat
> TARGETS += filesystems/overlayfs
> diff --git a/tools/testing/selftests/filesystems/ceph/Makefile b/tools/testing/selftests/filesystems/ceph/Makefile
> new file mode 100644
> index 000000000000..4ad3e8d40d90
> --- /dev/null
> +++ b/tools/testing/selftests/filesystems/ceph/Makefile
> @@ -0,0 +1,7 @@
> +# SPDX-License-Identifier: GPL-2.0
> +
> +TEST_PROGS := run_validation.sh
> +TEST_FILES := reset_stress.sh reset_corner_cases.sh \
> + validate_consistency.py README settings
> +
> +include ../../lib.mk
> diff --git a/tools/testing/selftests/filesystems/ceph/README b/tools/testing/selftests/filesystems/ceph/README
> new file mode 100644
> index 000000000000..eb0092b38f80
> --- /dev/null
> +++ b/tools/testing/selftests/filesystems/ceph/README
> @@ -0,0 +1,84 @@
> +# CephFS Client Reset Test Suite
> +
> +Test suite for the CephFS kernel client manual session reset feature.
> +This trimmed set contains the single-client stress test, the targeted
> +corner-case test, and the one-shot validation harness used during
> +feature bring-up.
> +
> +## Prerequisites
> +
> +- Linux kernel with the CephFS client reset feature (this branch)
> +- A running Ceph cluster with at least one MDS
> +- Root access (debugfs requires it)
> +- Python 3 (for validators)
> +- flock utility (for lock tests, usually in util-linux)
> +
> +## Test inventory
> +
> +| Test | Script(s) | What it covers |
> +|------|-----------|----------------|
> +| Single-client stress | `reset_stress.sh` | I/O + resets + data integrity on one mount |
> +| Corner cases | `reset_corner_cases.sh` | EBUSY, dirty caps, flock reclaim, unmount-during-reset |
> +| Validation harness | `run_validation.sh` | baseline + corner cases + moderate/aggressive stress + final status check |
> +
> +## Quick start
> +
> +Stress run:
> +
> + sudo ./reset_stress.sh --mount-point /mnt/cephfs --profile moderate
> +
> +Corner cases:
> +
> + sudo ./reset_corner_cases.sh --mount-point /mnt/cephfs
> +
> +End-to-end validation:
> +
> + sudo ./run_validation.sh --mount-point /mnt/cephfs
> +
> +## Stress profiles
> +
> + baseline - no resets, 1 IO + 1 rename, 600s
> + moderate - reset every 5-15s, 2 IO + 1 rename, 900s
> + aggressive - reset every 1-5s, 4 IO + 2 rename, 900s
> + soak - reset every 5-15s, 2 IO + 1 rename, 3600s
> +
> +## Key options (all scripts)
> +
> + --mount-point PATH CephFS mount point (required)
> + --client-id ID Debugfs client id (auto-detected if one)
> +
> +reset_stress.sh additionally accepts:
> +
> + --profile NAME baseline|moderate|aggressive|soak
> + --duration-sec N Override profile runtime
> + --no-reset Disable reset injection
> + --out-dir PATH Artifact directory
> +
> +## Corner case tests
> +
> + [1/4] ebusy_rejection Second reset rejected while first in-flight
> + [2/4] dirty_caps_at_reset Reset with unflushed dirty caps
> + [3/4] flock_after_reset Stale lock EIO + fresh lock after holder exit
> + [4/4] unmount_during_reset umount during active reset (destroy-path wakeup)
> +
> +Test 4 requires creating a second CephFS mount instance and SKIPs if
> +the host cannot do so. See `--help` output for details.
> +
> +## Troubleshooting
> +
> +**No writable Ceph reset interface found:**
> +Kernel lacks the reset feature, debugfs not mounted, or not root.
> +Check: `ls /sys/kernel/debug/ceph/*/reset/`
> +
> +**Multiple Ceph clients found:**
> +Use `--client-id` to select one.
> +List: `ls /sys/kernel/debug/ceph/`
> +
> +## Files
> +
> +| File | Role |
> +|------|------|
> +| `reset_stress.sh` | Single-client stress test runner |
> +| `validate_consistency.py` | Single-client post-run validator |
> +| `reset_corner_cases.sh` | Corner case harness (4 sequential tests) |
> +| `run_validation.sh` | One-shot validation harness |
> diff --git a/tools/testing/selftests/filesystems/ceph/settings b/tools/testing/selftests/filesystems/ceph/settings
> new file mode 100644
> index 000000000000..79b65bdf05db
> --- /dev/null
> +++ b/tools/testing/selftests/filesystems/ceph/settings
> @@ -0,0 +1 @@
> +timeout=1200
Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Thanks,
Slava.
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [EXTERNAL] [PATCH v4 00/11] ceph: manual client session reset
2026-05-07 12:27 [PATCH v4 00/11] ceph: manual client session reset Alex Markuze
` (10 preceding siblings ...)
2026-05-07 12:27 ` [PATCH v4 11/11] selftests: ceph: wire up Ceph reset kselftests and documentation Alex Markuze
@ 2026-05-07 18:28 ` Viacheslav Dubeyko
2026-05-08 17:49 ` Viacheslav Dubeyko
11 siblings, 1 reply; 24+ messages in thread
From: Viacheslav Dubeyko @ 2026-05-07 18:28 UTC (permalink / raw)
To: Alex Markuze, ceph-devel; +Cc: linux-kernel, idryomov
On Thu, 2026-05-07 at 12:27 +0000, Alex Markuze wrote:
> This series adds operator-initiated manual client session reset for
> CephFS, providing a controlled escape hatch for client/MDS stalemates
> in which caps, locks, or unsafe metadata state stop making forward
> progress.
>
> Motivation
>
> When a CephFS client enters a stalemate with the MDS -- stuck cap
> flushes, hung file locks, or unsafe requests that cannot be journaled --
> the only current recovery options are client eviction from the MDS side
> or a full client node restart. Both are disruptive and can cascade to
> other workloads on the same node.
>
> Manual reset gives the operator a targeted tool: block new metadata
> work, attempt a bounded best-effort drain of dirty client state while
> sessions are still alive, then tear sessions down and let new requests
> re-open fresh sessions. State that cannot drain (the stuck state
> causing the stalemate) is force-dropped -- that is the point of the
> reset.
>
> Design
>
> The reset is triggered via debugfs:
>
> echo "reason" > /sys/kernel/debug/ceph/<client>/reset/trigger
> cat /sys/kernel/debug/ceph/<client>/reset/status
>
> The state machine tracks four phases:
>
> IDLE -> QUIESCING -> DRAINING -> TEARDOWN -> IDLE
>
> QUIESCING is set synchronously by schedule_reset() before the workqueue
> item is dispatched. This provides immediate request gating from the
> caller's context -- new metadata requests and file-lock acquisitions
> block the moment the operator triggers the reset, with no race window
> between scheduling and the work function starting. All non-IDLE phases
> block callers on blocked_wq; the hot path adds only a single READ_ONCE
> per request.
>
> The drain phase uses a single shared deadline (bounded at 30 seconds)
> across all drain legs. It first waits for unsafe write requests
> (creates, renames, setattrs) to reach safe status, then flushes dirty
> caps and pushes pending cap releases, using whatever time remains
> within the shared deadline. Non-stuck state drains in milliseconds;
> stuck state times out and is force-dropped during teardown. The
> drain_timed_out flag is monotonic: once set by any drain leg, it stays
> true for the lifetime of the reset.
>
> The session teardown follows the established check_new_map()
> forced-close pattern: unregister sessions under mdsc->mutex, then
> clean up caps and requests under s->s_mutex. Reconnect is not
> attempted because the MDS only accepts CLIENT_RECONNECT during its
> own RECONNECT phase after restart, not from an active client. A
> SESSION_REQUEST_CLOSE is sent to each MDS before local teardown so
> the MDS can release server-side state promptly rather than waiting
> for session_autoclose timeout.
>
> Blocked callers are released when reset completes and observe the
> final result via -EAGAIN (reset failed, retry later) or 0 (success).
> Internal work-function errors such as -ENOMEM are not propagated to
> unrelated callers like open() or flock(); the detailed error remains
> in debugfs and tracepoints.
>
> The work function checks st->shutdown before each phase transition
> (DRAINING, TEARDOWN) so that a concurrent ceph_mdsc_destroy() is not
> overwritten. If destroy already took ownership, the work function
> releases session references and returns without touching the state.
>
> The destroy path marks reset as failed and wakes blocked waiters
> before cancel_work_sync() so unmount does not stall.
>
> Patch breakdown
>
> Prep / cleanup:
>
> 1. Convert all CEPH_I_* inode flags to named bit-position constants
> and switch all flag modifications to atomic bitops (set_bit,
> clear_bit, test_and_clear_bit). The previous code mixed lockless
> atomics with non-atomic read-modify-write on the same unsigned
> long, which is a correctness hazard. Flag reads under i_ceph_lock
> that only test lock-serialised flags retain bitmask tests.
>
> 2. Fix a __force endian cast in reconnect_caps_cb() to use the
> proper cpu_to_le32() macro and the new test_bit() accessor.
>
> Hardening / diagnostics:
>
> 3. Harden send_mds_reconnect() with error return, early bailout for
> closed/rejected/unregistered sessions, state restoration on
> transient failure. Rewrite mds_peer_reset() to handle active-MDS
> (past RECONNECT phase) by tearing the session down locally.
>
> 4. Convert wait_caps_flush() to a diagnostic timeout loop that
> periodically dumps pending flush state, improving observability
> for reset-drain stalls and existing sync/writeback hangs.
>
> Core feature:
>
> 5. Add the reset state machine, request gating, session teardown
> work function, scheduling, and destroy-path coordination.
>
> 6. Add the debugfs trigger/status interface and four tracepoints
> (schedule, complete, blocked, unblocked).
>
> Testing:
>
> 7-11. kselftest-integrated shell tests split into five patches:
> data integrity checker (7), stress test with concurrent I/O and
> random-interval reset injection (8), targeted corner cases --
> overlapping resets, dirty data across reset, stale locks, unmount
> during reset (9), five-stage validation wrapper with per-stage
> timeouts (10), and kselftest Makefile/MAINTAINERS wiring (11).
> All 5 validation stages pass on a real CephFS cluster.
>
> Changes since v3
>
> - Rebased onto testing (7.1-rc1 + ceph fixes).
> - Dropped v3 patch 7 ("add trace points to the MDS client") --
> already upstream as d927a595ab2f.
> - Patch 1: fixed flags type from int to unsigned long in
> ceph_pool_perm_check() (Slava). Added commit message paragraph
> documenting the set_bit() conversion in ceph_finish_async_create().
> - Patch 3: moved xa_destroy() under s_mutex with comment explaining
> serialization against ceph_get_deleg_ino() (Slava). Added lock
> ordering comment at mdsc->mutex acquisition. Added comment
> explaining why mds_peer_reset() narrows the RECONNECT state check
> from >= to ==.
> - Patch 4: split CEPH_CAP_FLUSH_MAX_DUMP_COUNT into separate
> CEPH_CAP_FLUSH_MAX_DUMP_ENTRIES (array bound) and
> CEPH_CAP_FLUSH_MAX_DUMP_ITERS (iteration limit) (Slava). Moved
> all flush timeout defines to mds_client.h alongside reset defines
> (Slava). Split comment block into per-field struct documentation
> and separate function safety comment for dump_cap_flushes() (Slava).
> Fixed for-loop variable declaration to match fs/ceph/ convention.
> Fixed commit message to reference the correct macro names and to
> stay within 72-column body width.
> - Patch 5: added bounded wait for unsafe write requests during the
> drain phase, using a shared deadline across all drain legs so the
> total drain time stays within CEPH_CLIENT_RESET_DRAIN_SEC. Made
> drain_timed_out monotonic (once set, stays true for the reset).
> Replaced spin_lock/spin_unlock around drain_timed_out writes with
> WRITE_ONCE() (Slava). Added ceph_reset_is_idle() inline helper
> (Slava). Added per-field comments to struct ceph_client_reset_state
> (Slava). Changed -EIO return to -EAGAIN for reset-failure
> signalling to callers (Slava). Increased CEPH_CLIENT_RESET_DRAIN_SEC
> from 5s to 30s (Slava). Added sessions[i] = NULL after
> ceph_put_mds_session() in teardown skip path (Slava). Added comment
> at out_sessions label explaining destroy ownership. Expanded
> msleep() comment explaining why event-based waiting is not viable.
> - Patch 6: tracepoint placement fixed to fire before -EAGAIN return.
> - Patch 11: added MAINTAINERS F: entry for the test directory and
> the filesystems/ceph line in the top-level selftests Makefile.
>
> Alex Markuze (11):
> ceph: convert inode flags to named bit positions and atomic bitops
> ceph: use proper endian conversion for flock_len in reconnect
> ceph: harden send_mds_reconnect and handle active-MDS peer reset
> ceph: add diagnostic timeout loop to wait_caps_flush()
> ceph: add client reset state machine and session teardown
> ceph: add manual reset debugfs control and tracepoints
> selftests: ceph: add reset consistency checker
> selftests: ceph: add reset stress test
> selftests: ceph: add reset corner-case tests
> selftests: ceph: add validation harness
> selftests: ceph: wire up Ceph reset kselftests and documentation
>
> MAINTAINERS | 1 +
> fs/ceph/addr.c | 20 +-
> fs/ceph/caps.c | 34 +-
> fs/ceph/debugfs.c | 103 +++
> fs/ceph/file.c | 13 +-
> fs/ceph/inode.c | 5 +-
> fs/ceph/locks.c | 38 +-
> fs/ceph/mds_client.c | 800 +++++++++++++++++-
> fs/ceph/mds_client.h | 52 +-
> fs/ceph/snap.c | 2 +-
> fs/ceph/super.h | 70 +-
> fs/ceph/xattr.c | 2 +-
> include/trace/events/ceph.h | 67 ++
> tools/testing/selftests/Makefile | 1 +
> .../selftests/filesystems/ceph/Makefile | 7 +
> .../testing/selftests/filesystems/ceph/README | 84 ++
> .../filesystems/ceph/reset_corner_cases.sh | 646 ++++++++++++++
> .../filesystems/ceph/reset_stress.sh | 694 +++++++++++++++
> .../filesystems/ceph/run_validation.sh | 350 ++++++++
> .../selftests/filesystems/ceph/settings | 1 +
> .../filesystems/ceph/validate_consistency.py | 297 +++++++
> 21 files changed, 3185 insertions(+), 102 deletions(-)
> create mode 100644 tools/testing/selftests/filesystems/ceph/Makefile
> create mode 100644 tools/testing/selftests/filesystems/ceph/README
> create mode 100755 tools/testing/selftests/filesystems/ceph/reset_corner_cases.sh
> create mode 100755 tools/testing/selftests/filesystems/ceph/reset_stress.sh
> create mode 100755 tools/testing/selftests/filesystems/ceph/run_validation.sh
> create mode 100644 tools/testing/selftests/filesystems/ceph/settings
> create mode 100755 tools/testing/selftests/filesystems/ceph/validate_consistency.py
I was able to apply the patchset on the v.7.1-rc2 successfully. Let me run
xfstests for the patchset. I'll be back with results ASAP.
Thanks,
Slava.
^ permalink raw reply [flat|nested] 24+ messages in thread* Re: [EXTERNAL] [PATCH v4 00/11] ceph: manual client session reset
2026-05-07 18:28 ` [EXTERNAL] [PATCH v4 00/11] ceph: manual client session reset Viacheslav Dubeyko
@ 2026-05-08 17:49 ` Viacheslav Dubeyko
0 siblings, 0 replies; 24+ messages in thread
From: Viacheslav Dubeyko @ 2026-05-08 17:49 UTC (permalink / raw)
To: Alex Markuze, ceph-devel; +Cc: linux-kernel, idryomov
On Thu, 2026-05-07 at 11:28 -0700, Viacheslav Dubeyko wrote:
> On Thu, 2026-05-07 at 12:27 +0000, Alex Markuze wrote:
> > This series adds operator-initiated manual client session reset for
> > CephFS, providing a controlled escape hatch for client/MDS stalemates
> > in which caps, locks, or unsafe metadata state stop making forward
> > progress.
> >
> > Motivation
> >
> > When a CephFS client enters a stalemate with the MDS -- stuck cap
> > flushes, hung file locks, or unsafe requests that cannot be journaled --
> > the only current recovery options are client eviction from the MDS side
> > or a full client node restart. Both are disruptive and can cascade to
> > other workloads on the same node.
> >
> > Manual reset gives the operator a targeted tool: block new metadata
> > work, attempt a bounded best-effort drain of dirty client state while
> > sessions are still alive, then tear sessions down and let new requests
> > re-open fresh sessions. State that cannot drain (the stuck state
> > causing the stalemate) is force-dropped -- that is the point of the
> > reset.
> >
> > Design
> >
> > The reset is triggered via debugfs:
> >
> > echo "reason" > /sys/kernel/debug/ceph/<client>/reset/trigger
> > cat /sys/kernel/debug/ceph/<client>/reset/status
> >
> > The state machine tracks four phases:
> >
> > IDLE -> QUIESCING -> DRAINING -> TEARDOWN -> IDLE
> >
> > QUIESCING is set synchronously by schedule_reset() before the workqueue
> > item is dispatched. This provides immediate request gating from the
> > caller's context -- new metadata requests and file-lock acquisitions
> > block the moment the operator triggers the reset, with no race window
> > between scheduling and the work function starting. All non-IDLE phases
> > block callers on blocked_wq; the hot path adds only a single READ_ONCE
> > per request.
> >
> > The drain phase uses a single shared deadline (bounded at 30 seconds)
> > across all drain legs. It first waits for unsafe write requests
> > (creates, renames, setattrs) to reach safe status, then flushes dirty
> > caps and pushes pending cap releases, using whatever time remains
> > within the shared deadline. Non-stuck state drains in milliseconds;
> > stuck state times out and is force-dropped during teardown. The
> > drain_timed_out flag is monotonic: once set by any drain leg, it stays
> > true for the lifetime of the reset.
> >
> > The session teardown follows the established check_new_map()
> > forced-close pattern: unregister sessions under mdsc->mutex, then
> > clean up caps and requests under s->s_mutex. Reconnect is not
> > attempted because the MDS only accepts CLIENT_RECONNECT during its
> > own RECONNECT phase after restart, not from an active client. A
> > SESSION_REQUEST_CLOSE is sent to each MDS before local teardown so
> > the MDS can release server-side state promptly rather than waiting
> > for session_autoclose timeout.
> >
> > Blocked callers are released when reset completes and observe the
> > final result via -EAGAIN (reset failed, retry later) or 0 (success).
> > Internal work-function errors such as -ENOMEM are not propagated to
> > unrelated callers like open() or flock(); the detailed error remains
> > in debugfs and tracepoints.
> >
> > The work function checks st->shutdown before each phase transition
> > (DRAINING, TEARDOWN) so that a concurrent ceph_mdsc_destroy() is not
> > overwritten. If destroy already took ownership, the work function
> > releases session references and returns without touching the state.
> >
> > The destroy path marks reset as failed and wakes blocked waiters
> > before cancel_work_sync() so unmount does not stall.
> >
> > Patch breakdown
> >
> > Prep / cleanup:
> >
> > 1. Convert all CEPH_I_* inode flags to named bit-position constants
> > and switch all flag modifications to atomic bitops (set_bit,
> > clear_bit, test_and_clear_bit). The previous code mixed lockless
> > atomics with non-atomic read-modify-write on the same unsigned
> > long, which is a correctness hazard. Flag reads under i_ceph_lock
> > that only test lock-serialised flags retain bitmask tests.
> >
> > 2. Fix a __force endian cast in reconnect_caps_cb() to use the
> > proper cpu_to_le32() macro and the new test_bit() accessor.
> >
> > Hardening / diagnostics:
> >
> > 3. Harden send_mds_reconnect() with error return, early bailout for
> > closed/rejected/unregistered sessions, state restoration on
> > transient failure. Rewrite mds_peer_reset() to handle active-MDS
> > (past RECONNECT phase) by tearing the session down locally.
> >
> > 4. Convert wait_caps_flush() to a diagnostic timeout loop that
> > periodically dumps pending flush state, improving observability
> > for reset-drain stalls and existing sync/writeback hangs.
> >
> > Core feature:
> >
> > 5. Add the reset state machine, request gating, session teardown
> > work function, scheduling, and destroy-path coordination.
> >
> > 6. Add the debugfs trigger/status interface and four tracepoints
> > (schedule, complete, blocked, unblocked).
> >
> > Testing:
> >
> > 7-11. kselftest-integrated shell tests split into five patches:
> > data integrity checker (7), stress test with concurrent I/O and
> > random-interval reset injection (8), targeted corner cases --
> > overlapping resets, dirty data across reset, stale locks, unmount
> > during reset (9), five-stage validation wrapper with per-stage
> > timeouts (10), and kselftest Makefile/MAINTAINERS wiring (11).
> > All 5 validation stages pass on a real CephFS cluster.
> >
> > Changes since v3
> >
> > - Rebased onto testing (7.1-rc1 + ceph fixes).
> > - Dropped v3 patch 7 ("add trace points to the MDS client") --
> > already upstream as d927a595ab2f.
> > - Patch 1: fixed flags type from int to unsigned long in
> > ceph_pool_perm_check() (Slava). Added commit message paragraph
> > documenting the set_bit() conversion in ceph_finish_async_create().
> > - Patch 3: moved xa_destroy() under s_mutex with comment explaining
> > serialization against ceph_get_deleg_ino() (Slava). Added lock
> > ordering comment at mdsc->mutex acquisition. Added comment
> > explaining why mds_peer_reset() narrows the RECONNECT state check
> > from >= to ==.
> > - Patch 4: split CEPH_CAP_FLUSH_MAX_DUMP_COUNT into separate
> > CEPH_CAP_FLUSH_MAX_DUMP_ENTRIES (array bound) and
> > CEPH_CAP_FLUSH_MAX_DUMP_ITERS (iteration limit) (Slava). Moved
> > all flush timeout defines to mds_client.h alongside reset defines
> > (Slava). Split comment block into per-field struct documentation
> > and separate function safety comment for dump_cap_flushes() (Slava).
> > Fixed for-loop variable declaration to match fs/ceph/ convention.
> > Fixed commit message to reference the correct macro names and to
> > stay within 72-column body width.
> > - Patch 5: added bounded wait for unsafe write requests during the
> > drain phase, using a shared deadline across all drain legs so the
> > total drain time stays within CEPH_CLIENT_RESET_DRAIN_SEC. Made
> > drain_timed_out monotonic (once set, stays true for the reset).
> > Replaced spin_lock/spin_unlock around drain_timed_out writes with
> > WRITE_ONCE() (Slava). Added ceph_reset_is_idle() inline helper
> > (Slava). Added per-field comments to struct ceph_client_reset_state
> > (Slava). Changed -EIO return to -EAGAIN for reset-failure
> > signalling to callers (Slava). Increased CEPH_CLIENT_RESET_DRAIN_SEC
> > from 5s to 30s (Slava). Added sessions[i] = NULL after
> > ceph_put_mds_session() in teardown skip path (Slava). Added comment
> > at out_sessions label explaining destroy ownership. Expanded
> > msleep() comment explaining why event-based waiting is not viable.
> > - Patch 6: tracepoint placement fixed to fire before -EAGAIN return.
> > - Patch 11: added MAINTAINERS F: entry for the test directory and
> > the filesystems/ceph line in the top-level selftests Makefile.
> >
> > Alex Markuze (11):
> > ceph: convert inode flags to named bit positions and atomic bitops
> > ceph: use proper endian conversion for flock_len in reconnect
> > ceph: harden send_mds_reconnect and handle active-MDS peer reset
> > ceph: add diagnostic timeout loop to wait_caps_flush()
> > ceph: add client reset state machine and session teardown
> > ceph: add manual reset debugfs control and tracepoints
> > selftests: ceph: add reset consistency checker
> > selftests: ceph: add reset stress test
> > selftests: ceph: add reset corner-case tests
> > selftests: ceph: add validation harness
> > selftests: ceph: wire up Ceph reset kselftests and documentation
> >
> > MAINTAINERS | 1 +
> > fs/ceph/addr.c | 20 +-
> > fs/ceph/caps.c | 34 +-
> > fs/ceph/debugfs.c | 103 +++
> > fs/ceph/file.c | 13 +-
> > fs/ceph/inode.c | 5 +-
> > fs/ceph/locks.c | 38 +-
> > fs/ceph/mds_client.c | 800 +++++++++++++++++-
> > fs/ceph/mds_client.h | 52 +-
> > fs/ceph/snap.c | 2 +-
> > fs/ceph/super.h | 70 +-
> > fs/ceph/xattr.c | 2 +-
> > include/trace/events/ceph.h | 67 ++
> > tools/testing/selftests/Makefile | 1 +
> > .../selftests/filesystems/ceph/Makefile | 7 +
> > .../testing/selftests/filesystems/ceph/README | 84 ++
> > .../filesystems/ceph/reset_corner_cases.sh | 646 ++++++++++++++
> > .../filesystems/ceph/reset_stress.sh | 694 +++++++++++++++
> > .../filesystems/ceph/run_validation.sh | 350 ++++++++
> > .../selftests/filesystems/ceph/settings | 1 +
> > .../filesystems/ceph/validate_consistency.py | 297 +++++++
> > 21 files changed, 3185 insertions(+), 102 deletions(-)
> > create mode 100644 tools/testing/selftests/filesystems/ceph/Makefile
> > create mode 100644 tools/testing/selftests/filesystems/ceph/README
> > create mode 100755 tools/testing/selftests/filesystems/ceph/reset_corner_cases.sh
> > create mode 100755 tools/testing/selftests/filesystems/ceph/reset_stress.sh
> > create mode 100755 tools/testing/selftests/filesystems/ceph/run_validation.sh
> > create mode 100644 tools/testing/selftests/filesystems/ceph/settings
> > create mode 100755 tools/testing/selftests/filesystems/ceph/validate_consistency.py
>
> I was able to apply the patchset on the v.7.1-rc2 successfully. Let me run
> xfstests for the patchset. I'll be back with results ASAP.
>
>
The xfestests run was successful. I don't see any critical issues with the
patchset.
Tested-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Thanks,
Slava.
^ permalink raw reply [flat|nested] 24+ messages in thread