public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 00/11] ceph: manual client session reset
@ 2026-04-29 12:51 Alex Markuze
  2026-04-29 12:51 ` [PATCH v3 01/11] ceph: convert inode flags to named bit positions and atomic bitops Alex Markuze
                   ` (10 more replies)
  0 siblings, 11 replies; 17+ messages in thread
From: Alex Markuze @ 2026-04-29 12:51 UTC (permalink / raw)
  To: ceph-devel; +Cc: linux-kernel, idryomov, vdubeyko, Alex Markuze

This series adds operator-initiated manual client session reset for
CephFS, providing a controlled escape hatch for client/MDS stalemates
in which caps, locks, or unsafe metadata state stop making forward
progress.

Motivation

When a CephFS client enters a stalemate with the MDS -- stuck cap
flushes, hung file locks, or unsafe requests that cannot be journaled --
the only current recovery options are client eviction from the MDS side
or a full client node restart.  Both are disruptive and can cascade to
other workloads on the same node.

Manual reset gives the operator a targeted tool: block new metadata
work, attempt a bounded best-effort drain of dirty client state while
sessions are still alive, then tear sessions down and let new requests
re-open fresh sessions.  State that cannot drain (the stuck state
causing the stalemate) is force-dropped -- that is the point of the
reset.

Design

The reset is triggered via debugfs:

    echo "reason" > /sys/kernel/debug/ceph/<client>/reset/trigger
    cat /sys/kernel/debug/ceph/<client>/reset/status

The state machine tracks four phases:

    IDLE -> QUIESCING -> DRAINING -> TEARDOWN -> IDLE

QUIESCING is set synchronously by schedule_reset() before the workqueue
item is dispatched.  This provides immediate request gating from the
caller's context -- new metadata requests and file-lock acquisitions
block the moment the operator triggers the reset, with no race window
between scheduling and the work function starting.  All non-IDLE phases
block callers on blocked_wq; the hot path adds only a single READ_ONCE
per request.

The drain phase (bounded at 5 seconds) flushes the MDS journal, dirty
caps, and pending cap releases.  Non-stuck state drains in milliseconds;
stuck state times out and is force-dropped during teardown.

The session teardown follows the established check_new_map() forced-close
pattern: unregister sessions under mdsc->mutex, then clean up caps and
requests under s->s_mutex.  Reconnect is not attempted because the MDS
only accepts CLIENT_RECONNECT during its own RECONNECT phase after
restart, not from an active client.  A SESSION_REQUEST_CLOSE is sent to
each MDS before local teardown so the MDS can release server-side state
promptly rather than waiting for session_autoclose timeout.

Blocked callers are released when reset completes and observe the final
result.  The wait path verifies the phase is still IDLE under the lock
after wakeup, looping back if a new reset was scheduled in the interim.

Patch breakdown

Prep / cleanup:

 1. Convert all CEPH_I_* inode flags to named bit-position constants
    and switch all flag modifications to atomic bitops (set_bit,
    clear_bit, test_and_clear_bit).  The previous code mixed lockless
    atomics with non-atomic read-modify-write on the same unsigned long,
    which is a correctness hazard.  Flag reads under i_ceph_lock that
    only test lock-serialised flags retain bitmask tests.

 2. Fix a __force endian cast in reconnect_caps_cb() to use the proper
    cpu_to_le32() macro and the new test_bit() accessor.

Hardening / diagnostics:

 3. Harden send_mds_reconnect() with error return, early bailout for
    closed/rejected/unregistered sessions, state restoration on
    transient failure.  Rewrite mds_peer_reset() to handle active-MDS
    (past RECONNECT phase) by tearing the session down locally.

 4. Convert wait_caps_flush() to a diagnostic timeout loop that
    periodically dumps pending flush state, improving observability
    for reset-drain stalls and existing sync/writeback hangs.

Core feature:

 5. Add the reset state machine, request gating, session teardown
    work function, scheduling, and destroy-path coordination.

 6. Add the debugfs trigger/status interface and four tracepoints
    (schedule, complete, blocked, unblocked).

Testing:

 7-11. kselftest-integrated shell tests split into five patches:
    data integrity checker (7), stress test with concurrent I/O and
    random-interval reset injection (8), targeted corner cases --
    overlapping resets, dirty data across reset, stale locks, unmount
    during reset (9), five-stage validation wrapper with per-stage
    timeouts (10), and kselftest Makefile/MAINTAINERS wiring (11).
    All 5 validation stages pass on a real CephFS cluster.

Changes since v2

 - Patch 1: restored CEPH_I_SHUTDOWN mask define for
   ceph_inode_is_shutdown().  Kept CEPH_I_ERROR_FILELOCK alive for
   bisectability (patch 2 removes it).  Unconditional set_bit() is
   kept for hint flags like ERROR_WRITE -- the extra test-before-set
   branch is redundant on a plain atomic bitop.  Added comments
   clarifying the under-lock flag re-read in ceph_pool_perm_check()
   and the bit-operation sequence in wake_async_create_waiters().
 - Patch 2: unchanged from v2.
 - Patch 3: moved the "reconnect start" log after the early-bailout
   checks.  Added explicit CLOSED/REJECTED cases in the
   mds_peer_reset() switch.
 - Patch 4: reworked dump_cap_flushes() to collect data under
   cap_dirty_lock into an on-stack array and print after releasing
   the lock.  Added truncation count and a final suppression message.
   Changed null cf->ci to WARN_ON_ONCE.  Widened snap field to u64
   to match ceph_snap() return type.  Documented cf->ci
   inode-lifetime guarantee and flush-tid monotonic ordering (acks
   are processed in order under i_ceph_lock, so the latest-ack store
   is correct without max()).  The timeout expression
   CEPH_CAP_FLUSH_WAIT_TIMEOUT_SEC * HZ is kept because 60 * HZ is
   bounded and no conversion helper is needed here.
 - Patch 5: ceph_mdsc_wait_for_reset() maps all internal errors to
   -EIO so unrelated callers (open, flock) do not see raw
   work-function failures like -ENOMEM.  Added max_t() guard on
   deadline-jiffies underflow.  Added st->shutdown checks before
   DRAINING and TEARDOWN phase transitions so a concurrent
   ceph_mdsc_destroy() is not overwritten by the work function.
   Documented the close-grace sleep as a best-effort nudge.
 - Patch 6: removed unused debugfs_reset_trigger and
   debugfs_reset_status struct fields.  Added null-safe access for
   monc.auth->global_id in all four reset tracepoints.
 - Old patch 7: split into five smaller patches (7-11) ordered by
   file dependency.

Changes since v1

 - Patch 1 now converts ALL flag modifications to atomic bitops,
   eliminating the mixed atomic/non-atomic RMW hazard on the shared
   i_ceph_flags field.  Removed mask defines for flags that are only
   accessed via the _BIT form.  Fixed Co-authored-by -> Co-developed-by.
 - Rewrote mds_peer_reset() to handle active-MDS state correctly
   instead of sending a doomed reconnect (patch 3).
 - Added state restoration in send_mds_reconnect() failure path to
   prevent sessions stranding in RECONNECTING (patch 3).
 - Added early bailout checks in send_mds_reconnect() for
   closed/rejected/unregistered sessions (patch 3).
 - Added diagnostic timeout loop to wait_caps_flush() (patch 4).
 - Patch 5 commit message now documents the QUIESCING phase purpose
   (synchronous request gating before async work dispatch).
 - Fixed TOCTOU in ceph_mdsc_wait_for_reset(): the phase is now
   re-verified under the lock after wakeup, with a deadline-based
   retry loop if a new reset was scheduled in the interim (patch 5).
 - Added tracepoints for reset lifecycle events (patch 6).
 - Added selftest validation harness with kselftest integration
   (patch 7).
 - Rebased onto v7.0.

Alex Markuze (11):
  ceph: convert inode flags to named bit positions and atomic bitops
  ceph: use proper endian conversion for flock_len in reconnect
  ceph: harden send_mds_reconnect and handle active-MDS peer reset
  ceph: add diagnostic timeout loop to wait_caps_flush()
  ceph: add client reset state machine and session teardown
  ceph: add manual reset debugfs control and tracepoints
  selftests: ceph: add reset consistency checker
  selftests: ceph: add reset stress test
  selftests: ceph: add reset corner-case tests
  selftests: ceph: add validation harness
  selftests: ceph: wire up Ceph reset kselftests and documentation

 MAINTAINERS                                   |   1 +
 fs/ceph/addr.c                                |  17 +-
 fs/ceph/caps.c                                |  34 +-
 fs/ceph/debugfs.c                             | 102 +++
 fs/ceph/file.c                                |  13 +-
 fs/ceph/inode.c                               |   5 +-
 fs/ceph/locks.c                               |  38 +-
 fs/ceph/mds_client.c                          | 735 +++++++++++++++++-
 fs/ceph/mds_client.h                          |  44 +-
 fs/ceph/snap.c                                |   2 +-
 fs/ceph/super.h                               |  70 +-
 fs/ceph/xattr.c                               |   2 +-
 include/trace/events/ceph.h                   |  67 ++
 tools/testing/selftests/Makefile              |   1 +
 .../selftests/filesystems/ceph/Makefile       |   7 +
 .../testing/selftests/filesystems/ceph/README |  84 ++
 .../filesystems/ceph/reset_corner_cases.sh    | 646 +++++++++++++++
 .../filesystems/ceph/reset_stress.sh          | 694 +++++++++++++++++
 .../filesystems/ceph/run_validation.sh        | 350 +++++++++
 .../selftests/filesystems/ceph/settings       |   1 +
 .../filesystems/ceph/validate_consistency.py  | 297 +++++++
 21 files changed, 3110 insertions(+), 100 deletions(-)
 create mode 100644 tools/testing/selftests/filesystems/ceph/Makefile
 create mode 100644 tools/testing/selftests/filesystems/ceph/README
 create mode 100755 tools/testing/selftests/filesystems/ceph/reset_corner_cases.sh
 create mode 100755 tools/testing/selftests/filesystems/ceph/reset_stress.sh
 create mode 100755 tools/testing/selftests/filesystems/ceph/run_validation.sh
 create mode 100644 tools/testing/selftests/filesystems/ceph/settings
 create mode 100755 tools/testing/selftests/filesystems/ceph/validate_consistency.py

-- 
2.34.1


^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2026-04-30 18:38 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-29 12:51 [PATCH v3 00/11] ceph: manual client session reset Alex Markuze
2026-04-29 12:51 ` [PATCH v3 01/11] ceph: convert inode flags to named bit positions and atomic bitops Alex Markuze
2026-04-29 19:31   ` [EXTERNAL] " Viacheslav Dubeyko
2026-04-29 12:51 ` [PATCH v3 02/11] ceph: use proper endian conversion for flock_len in reconnect Alex Markuze
2026-04-29 12:51 ` [PATCH v3 03/11] ceph: harden send_mds_reconnect and handle active-MDS peer reset Alex Markuze
2026-04-29 21:22   ` [EXTERNAL] " Viacheslav Dubeyko
2026-04-29 12:51 ` [PATCH v3 04/11] ceph: add diagnostic timeout loop to wait_caps_flush() Alex Markuze
2026-04-29 21:41   ` [EXTERNAL] " Viacheslav Dubeyko
2026-04-29 12:52 ` [PATCH v3 05/11] ceph: add client reset state machine and session teardown Alex Markuze
2026-04-29 22:29   ` [EXTERNAL] " Viacheslav Dubeyko
2026-04-29 12:52 ` [PATCH v3 06/11] ceph: add manual reset debugfs control and tracepoints Alex Markuze
2026-04-30 18:38   ` [EXTERNAL] " Viacheslav Dubeyko
2026-04-29 12:52 ` [PATCH v3 07/11] selftests: ceph: add reset consistency checker Alex Markuze
2026-04-29 12:52 ` [PATCH v3 08/11] selftests: ceph: add reset stress test Alex Markuze
2026-04-29 12:52 ` [PATCH v3 09/11] selftests: ceph: add reset corner-case tests Alex Markuze
2026-04-29 12:52 ` [PATCH v3 10/11] selftests: ceph: add validation harness Alex Markuze
2026-04-29 12:52 ` [PATCH v3 11/11] selftests: ceph: wire up Ceph reset kselftests and documentation Alex Markuze

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox