From: Alex Markuze <amarkuze@redhat.com>
To: ceph-devel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org, idryomov@gmail.com,
vdubeyko@redhat.com, Alex Markuze <amarkuze@redhat.com>
Subject: [PATCH v4 00/11] ceph: manual client session reset
Date: Thu, 7 May 2026 12:27:26 +0000 [thread overview]
Message-ID: <20260507122737.2804094-1-amarkuze@redhat.com> (raw)
This series adds operator-initiated manual client session reset for
CephFS, providing a controlled escape hatch for client/MDS stalemates
in which caps, locks, or unsafe metadata state stop making forward
progress.
Motivation
When a CephFS client enters a stalemate with the MDS -- stuck cap
flushes, hung file locks, or unsafe requests that cannot be journaled --
the only current recovery options are client eviction from the MDS side
or a full client node restart. Both are disruptive and can cascade to
other workloads on the same node.
Manual reset gives the operator a targeted tool: block new metadata
work, attempt a bounded best-effort drain of dirty client state while
sessions are still alive, then tear sessions down and let new requests
re-open fresh sessions. State that cannot drain (the stuck state
causing the stalemate) is force-dropped -- that is the point of the
reset.
Design
The reset is triggered via debugfs:
echo "reason" > /sys/kernel/debug/ceph/<client>/reset/trigger
cat /sys/kernel/debug/ceph/<client>/reset/status
The state machine tracks four phases:
IDLE -> QUIESCING -> DRAINING -> TEARDOWN -> IDLE
QUIESCING is set synchronously by schedule_reset() before the workqueue
item is dispatched. This provides immediate request gating from the
caller's context -- new metadata requests and file-lock acquisitions
block the moment the operator triggers the reset, with no race window
between scheduling and the work function starting. All non-IDLE phases
block callers on blocked_wq; the hot path adds only a single READ_ONCE
per request.
The drain phase uses a single shared deadline (bounded at 30 seconds)
across all drain legs. It first waits for unsafe write requests
(creates, renames, setattrs) to reach safe status, then flushes dirty
caps and pushes pending cap releases, using whatever time remains
within the shared deadline. Non-stuck state drains in milliseconds;
stuck state times out and is force-dropped during teardown. The
drain_timed_out flag is monotonic: once set by any drain leg, it stays
true for the lifetime of the reset.
The session teardown follows the established check_new_map()
forced-close pattern: unregister sessions under mdsc->mutex, then
clean up caps and requests under s->s_mutex. Reconnect is not
attempted because the MDS only accepts CLIENT_RECONNECT during its
own RECONNECT phase after restart, not from an active client. A
SESSION_REQUEST_CLOSE is sent to each MDS before local teardown so
the MDS can release server-side state promptly rather than waiting
for session_autoclose timeout.
Blocked callers are released when reset completes and observe the
final result via -EAGAIN (reset failed, retry later) or 0 (success).
Internal work-function errors such as -ENOMEM are not propagated to
unrelated callers like open() or flock(); the detailed error remains
in debugfs and tracepoints.
The work function checks st->shutdown before each phase transition
(DRAINING, TEARDOWN) so that a concurrent ceph_mdsc_destroy() is not
overwritten. If destroy already took ownership, the work function
releases session references and returns without touching the state.
The destroy path marks reset as failed and wakes blocked waiters
before cancel_work_sync() so unmount does not stall.
Patch breakdown
Prep / cleanup:
1. Convert all CEPH_I_* inode flags to named bit-position constants
and switch all flag modifications to atomic bitops (set_bit,
clear_bit, test_and_clear_bit). The previous code mixed lockless
atomics with non-atomic read-modify-write on the same unsigned
long, which is a correctness hazard. Flag reads under i_ceph_lock
that only test lock-serialised flags retain bitmask tests.
2. Fix a __force endian cast in reconnect_caps_cb() to use the
proper cpu_to_le32() macro and the new test_bit() accessor.
Hardening / diagnostics:
3. Harden send_mds_reconnect() with error return, early bailout for
closed/rejected/unregistered sessions, state restoration on
transient failure. Rewrite mds_peer_reset() to handle active-MDS
(past RECONNECT phase) by tearing the session down locally.
4. Convert wait_caps_flush() to a diagnostic timeout loop that
periodically dumps pending flush state, improving observability
for reset-drain stalls and existing sync/writeback hangs.
Core feature:
5. Add the reset state machine, request gating, session teardown
work function, scheduling, and destroy-path coordination.
6. Add the debugfs trigger/status interface and four tracepoints
(schedule, complete, blocked, unblocked).
Testing:
7-11. kselftest-integrated shell tests split into five patches:
data integrity checker (7), stress test with concurrent I/O and
random-interval reset injection (8), targeted corner cases --
overlapping resets, dirty data across reset, stale locks, unmount
during reset (9), five-stage validation wrapper with per-stage
timeouts (10), and kselftest Makefile/MAINTAINERS wiring (11).
All 5 validation stages pass on a real CephFS cluster.
Changes since v3
- Rebased onto testing (7.1-rc1 + ceph fixes).
- Dropped v3 patch 7 ("add trace points to the MDS client") --
already upstream as d927a595ab2f.
- Patch 1: fixed flags type from int to unsigned long in
ceph_pool_perm_check() (Slava). Added commit message paragraph
documenting the set_bit() conversion in ceph_finish_async_create().
- Patch 3: moved xa_destroy() under s_mutex with comment explaining
serialization against ceph_get_deleg_ino() (Slava). Added lock
ordering comment at mdsc->mutex acquisition. Added comment
explaining why mds_peer_reset() narrows the RECONNECT state check
from >= to ==.
- Patch 4: split CEPH_CAP_FLUSH_MAX_DUMP_COUNT into separate
CEPH_CAP_FLUSH_MAX_DUMP_ENTRIES (array bound) and
CEPH_CAP_FLUSH_MAX_DUMP_ITERS (iteration limit) (Slava). Moved
all flush timeout defines to mds_client.h alongside reset defines
(Slava). Split comment block into per-field struct documentation
and separate function safety comment for dump_cap_flushes() (Slava).
Fixed for-loop variable declaration to match fs/ceph/ convention.
Fixed commit message to reference the correct macro names and to
stay within 72-column body width.
- Patch 5: added bounded wait for unsafe write requests during the
drain phase, using a shared deadline across all drain legs so the
total drain time stays within CEPH_CLIENT_RESET_DRAIN_SEC. Made
drain_timed_out monotonic (once set, stays true for the reset).
Replaced spin_lock/spin_unlock around drain_timed_out writes with
WRITE_ONCE() (Slava). Added ceph_reset_is_idle() inline helper
(Slava). Added per-field comments to struct ceph_client_reset_state
(Slava). Changed -EIO return to -EAGAIN for reset-failure
signalling to callers (Slava). Increased CEPH_CLIENT_RESET_DRAIN_SEC
from 5s to 30s (Slava). Added sessions[i] = NULL after
ceph_put_mds_session() in teardown skip path (Slava). Added comment
at out_sessions label explaining destroy ownership. Expanded
msleep() comment explaining why event-based waiting is not viable.
- Patch 6: tracepoint placement fixed to fire before -EAGAIN return.
- Patch 11: added MAINTAINERS F: entry for the test directory and
the filesystems/ceph line in the top-level selftests Makefile.
Alex Markuze (11):
ceph: convert inode flags to named bit positions and atomic bitops
ceph: use proper endian conversion for flock_len in reconnect
ceph: harden send_mds_reconnect and handle active-MDS peer reset
ceph: add diagnostic timeout loop to wait_caps_flush()
ceph: add client reset state machine and session teardown
ceph: add manual reset debugfs control and tracepoints
selftests: ceph: add reset consistency checker
selftests: ceph: add reset stress test
selftests: ceph: add reset corner-case tests
selftests: ceph: add validation harness
selftests: ceph: wire up Ceph reset kselftests and documentation
MAINTAINERS | 1 +
fs/ceph/addr.c | 20 +-
fs/ceph/caps.c | 34 +-
fs/ceph/debugfs.c | 103 +++
fs/ceph/file.c | 13 +-
fs/ceph/inode.c | 5 +-
fs/ceph/locks.c | 38 +-
fs/ceph/mds_client.c | 800 +++++++++++++++++-
fs/ceph/mds_client.h | 52 +-
fs/ceph/snap.c | 2 +-
fs/ceph/super.h | 70 +-
fs/ceph/xattr.c | 2 +-
include/trace/events/ceph.h | 67 ++
tools/testing/selftests/Makefile | 1 +
.../selftests/filesystems/ceph/Makefile | 7 +
.../testing/selftests/filesystems/ceph/README | 84 ++
.../filesystems/ceph/reset_corner_cases.sh | 646 ++++++++++++++
.../filesystems/ceph/reset_stress.sh | 694 +++++++++++++++
.../filesystems/ceph/run_validation.sh | 350 ++++++++
.../selftests/filesystems/ceph/settings | 1 +
.../filesystems/ceph/validate_consistency.py | 297 +++++++
21 files changed, 3185 insertions(+), 102 deletions(-)
create mode 100644 tools/testing/selftests/filesystems/ceph/Makefile
create mode 100644 tools/testing/selftests/filesystems/ceph/README
create mode 100755 tools/testing/selftests/filesystems/ceph/reset_corner_cases.sh
create mode 100755 tools/testing/selftests/filesystems/ceph/reset_stress.sh
create mode 100755 tools/testing/selftests/filesystems/ceph/run_validation.sh
create mode 100644 tools/testing/selftests/filesystems/ceph/settings
create mode 100755 tools/testing/selftests/filesystems/ceph/validate_consistency.py
--
2.34.1
next reply other threads:[~2026-05-07 12:27 UTC|newest]
Thread overview: 24+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-07 12:27 Alex Markuze [this message]
2026-05-07 12:27 ` [PATCH v4 01/11] ceph: convert inode flags to named bit positions and atomic bitops Alex Markuze
2026-05-07 18:35 ` Viacheslav Dubeyko
2026-05-07 12:27 ` [PATCH v4 02/11] ceph: use proper endian conversion for flock_len in reconnect Alex Markuze
2026-05-07 12:27 ` [PATCH v4 03/11] ceph: harden send_mds_reconnect and handle active-MDS peer reset Alex Markuze
2026-05-07 18:43 ` [EXTERNAL] " Viacheslav Dubeyko
2026-05-07 12:27 ` [PATCH v4 04/11] ceph: add diagnostic timeout loop to wait_caps_flush() Alex Markuze
2026-05-07 19:01 ` [EXTERNAL] " Viacheslav Dubeyko
2026-05-07 12:27 ` [PATCH v4 05/11] ceph: add client reset state machine and session teardown Alex Markuze
2026-05-07 19:17 ` [EXTERNAL] " Viacheslav Dubeyko
2026-05-07 12:27 ` [PATCH v4 06/11] ceph: add manual reset debugfs control and tracepoints Alex Markuze
2026-05-07 19:22 ` [EXTERNAL] " Viacheslav Dubeyko
2026-05-07 12:27 ` [PATCH v4 07/11] selftests: ceph: add reset consistency checker Alex Markuze
2026-05-07 19:24 ` [EXTERNAL] " Viacheslav Dubeyko
2026-05-07 12:27 ` [PATCH v4 08/11] selftests: ceph: add reset stress test Alex Markuze
2026-05-07 19:29 ` [EXTERNAL] " Viacheslav Dubeyko
2026-05-07 12:27 ` [PATCH v4 09/11] selftests: ceph: add reset corner-case tests Alex Markuze
2026-05-07 19:31 ` [EXTERNAL] " Viacheslav Dubeyko
2026-05-07 12:27 ` [PATCH v4 10/11] selftests: ceph: add validation harness Alex Markuze
2026-05-07 19:33 ` [EXTERNAL] " Viacheslav Dubeyko
2026-05-07 12:27 ` [PATCH v4 11/11] selftests: ceph: wire up Ceph reset kselftests and documentation Alex Markuze
2026-05-07 19:38 ` [EXTERNAL] " Viacheslav Dubeyko
2026-05-07 18:28 ` [EXTERNAL] [PATCH v4 00/11] ceph: manual client session reset Viacheslav Dubeyko
2026-05-08 17:49 ` Viacheslav Dubeyko
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260507122737.2804094-1-amarkuze@redhat.com \
--to=amarkuze@redhat.com \
--cc=ceph-devel@vger.kernel.org \
--cc=idryomov@gmail.com \
--cc=linux-kernel@vger.kernel.org \
--cc=vdubeyko@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox