The Linux Kernel Mailing List
 help / color / mirror / Atom feed
From: Viacheslav Dubeyko <vdubeyko@redhat.com>
To: Alex Markuze <amarkuze@redhat.com>, ceph-devel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org, idryomov@gmail.com
Subject: Re: [EXTERNAL] [PATCH v4 00/11] ceph: manual client session reset
Date: Thu, 07 May 2026 11:28:20 -0700	[thread overview]
Message-ID: <3bbabebd992c4efef81e4653ee2e04f4644ee57a.camel@redhat.com> (raw)
In-Reply-To: <20260507122737.2804094-1-amarkuze@redhat.com>

On Thu, 2026-05-07 at 12:27 +0000, Alex Markuze wrote:
> This series adds operator-initiated manual client session reset for
> CephFS, providing a controlled escape hatch for client/MDS stalemates
> in which caps, locks, or unsafe metadata state stop making forward
> progress.
> 
> Motivation
> 
> When a CephFS client enters a stalemate with the MDS -- stuck cap
> flushes, hung file locks, or unsafe requests that cannot be journaled --
> the only current recovery options are client eviction from the MDS side
> or a full client node restart.  Both are disruptive and can cascade to
> other workloads on the same node.
> 
> Manual reset gives the operator a targeted tool: block new metadata
> work, attempt a bounded best-effort drain of dirty client state while
> sessions are still alive, then tear sessions down and let new requests
> re-open fresh sessions.  State that cannot drain (the stuck state
> causing the stalemate) is force-dropped -- that is the point of the
> reset.
> 
> Design
> 
> The reset is triggered via debugfs:
> 
>     echo "reason" > /sys/kernel/debug/ceph/<client>/reset/trigger
>     cat /sys/kernel/debug/ceph/<client>/reset/status
> 
> The state machine tracks four phases:
> 
>     IDLE -> QUIESCING -> DRAINING -> TEARDOWN -> IDLE
> 
> QUIESCING is set synchronously by schedule_reset() before the workqueue
> item is dispatched.  This provides immediate request gating from the
> caller's context -- new metadata requests and file-lock acquisitions
> block the moment the operator triggers the reset, with no race window
> between scheduling and the work function starting.  All non-IDLE phases
> block callers on blocked_wq; the hot path adds only a single READ_ONCE
> per request.
> 
> The drain phase uses a single shared deadline (bounded at 30 seconds)
> across all drain legs.  It first waits for unsafe write requests
> (creates, renames, setattrs) to reach safe status, then flushes dirty
> caps and pushes pending cap releases, using whatever time remains
> within the shared deadline.  Non-stuck state drains in milliseconds;
> stuck state times out and is force-dropped during teardown.  The
> drain_timed_out flag is monotonic: once set by any drain leg, it stays
> true for the lifetime of the reset.
> 
> The session teardown follows the established check_new_map()
> forced-close pattern: unregister sessions under mdsc->mutex, then
> clean up caps and requests under s->s_mutex.  Reconnect is not
> attempted because the MDS only accepts CLIENT_RECONNECT during its
> own RECONNECT phase after restart, not from an active client.  A
> SESSION_REQUEST_CLOSE is sent to each MDS before local teardown so
> the MDS can release server-side state promptly rather than waiting
> for session_autoclose timeout.
> 
> Blocked callers are released when reset completes and observe the
> final result via -EAGAIN (reset failed, retry later) or 0 (success).
> Internal work-function errors such as -ENOMEM are not propagated to
> unrelated callers like open() or flock(); the detailed error remains
> in debugfs and tracepoints.
> 
> The work function checks st->shutdown before each phase transition
> (DRAINING, TEARDOWN) so that a concurrent ceph_mdsc_destroy() is not
> overwritten.  If destroy already took ownership, the work function
> releases session references and returns without touching the state.
> 
> The destroy path marks reset as failed and wakes blocked waiters
> before cancel_work_sync() so unmount does not stall.
> 
> Patch breakdown
> 
> Prep / cleanup:
> 
>  1. Convert all CEPH_I_* inode flags to named bit-position constants
>     and switch all flag modifications to atomic bitops (set_bit,
>     clear_bit, test_and_clear_bit).  The previous code mixed lockless
>     atomics with non-atomic read-modify-write on the same unsigned
>     long, which is a correctness hazard.  Flag reads under i_ceph_lock
>     that only test lock-serialised flags retain bitmask tests.
> 
>  2. Fix a __force endian cast in reconnect_caps_cb() to use the
>     proper cpu_to_le32() macro and the new test_bit() accessor.
> 
> Hardening / diagnostics:
> 
>  3. Harden send_mds_reconnect() with error return, early bailout for
>     closed/rejected/unregistered sessions, state restoration on
>     transient failure.  Rewrite mds_peer_reset() to handle active-MDS
>     (past RECONNECT phase) by tearing the session down locally.
> 
>  4. Convert wait_caps_flush() to a diagnostic timeout loop that
>     periodically dumps pending flush state, improving observability
>     for reset-drain stalls and existing sync/writeback hangs.
> 
> Core feature:
> 
>  5. Add the reset state machine, request gating, session teardown
>     work function, scheduling, and destroy-path coordination.
> 
>  6. Add the debugfs trigger/status interface and four tracepoints
>     (schedule, complete, blocked, unblocked).
> 
> Testing:
> 
>  7-11. kselftest-integrated shell tests split into five patches:
>     data integrity checker (7), stress test with concurrent I/O and
>     random-interval reset injection (8), targeted corner cases --
>     overlapping resets, dirty data across reset, stale locks, unmount
>     during reset (9), five-stage validation wrapper with per-stage
>     timeouts (10), and kselftest Makefile/MAINTAINERS wiring (11).
>     All 5 validation stages pass on a real CephFS cluster.
> 
> Changes since v3
> 
>  - Rebased onto testing (7.1-rc1 + ceph fixes).
>  - Dropped v3 patch 7 ("add trace points to the MDS client") --
>    already upstream as d927a595ab2f.
>  - Patch 1: fixed flags type from int to unsigned long in
>    ceph_pool_perm_check() (Slava).  Added commit message paragraph
>    documenting the set_bit() conversion in ceph_finish_async_create().
>  - Patch 3: moved xa_destroy() under s_mutex with comment explaining
>    serialization against ceph_get_deleg_ino() (Slava).  Added lock
>    ordering comment at mdsc->mutex acquisition.  Added comment
>    explaining why mds_peer_reset() narrows the RECONNECT state check
>    from >= to ==.
>  - Patch 4: split CEPH_CAP_FLUSH_MAX_DUMP_COUNT into separate
>    CEPH_CAP_FLUSH_MAX_DUMP_ENTRIES (array bound) and
>    CEPH_CAP_FLUSH_MAX_DUMP_ITERS (iteration limit) (Slava).  Moved
>    all flush timeout defines to mds_client.h alongside reset defines
>    (Slava).  Split comment block into per-field struct documentation
>    and separate function safety comment for dump_cap_flushes() (Slava).
>    Fixed for-loop variable declaration to match fs/ceph/ convention.
>    Fixed commit message to reference the correct macro names and to
>    stay within 72-column body width.
>  - Patch 5: added bounded wait for unsafe write requests during the
>    drain phase, using a shared deadline across all drain legs so the
>    total drain time stays within CEPH_CLIENT_RESET_DRAIN_SEC.  Made
>    drain_timed_out monotonic (once set, stays true for the reset).
>    Replaced spin_lock/spin_unlock around drain_timed_out writes with
>    WRITE_ONCE() (Slava).  Added ceph_reset_is_idle() inline helper
>    (Slava).  Added per-field comments to struct ceph_client_reset_state
>    (Slava).  Changed -EIO return to -EAGAIN for reset-failure
>    signalling to callers (Slava).  Increased CEPH_CLIENT_RESET_DRAIN_SEC
>    from 5s to 30s (Slava).  Added sessions[i] = NULL after
>    ceph_put_mds_session() in teardown skip path (Slava).  Added comment
>    at out_sessions label explaining destroy ownership.  Expanded
>    msleep() comment explaining why event-based waiting is not viable.
>  - Patch 6: tracepoint placement fixed to fire before -EAGAIN return.
>  - Patch 11: added MAINTAINERS F: entry for the test directory and
>    the filesystems/ceph line in the top-level selftests Makefile.
> 
> Alex Markuze (11):
>   ceph: convert inode flags to named bit positions and atomic bitops
>   ceph: use proper endian conversion for flock_len in reconnect
>   ceph: harden send_mds_reconnect and handle active-MDS peer reset
>   ceph: add diagnostic timeout loop to wait_caps_flush()
>   ceph: add client reset state machine and session teardown
>   ceph: add manual reset debugfs control and tracepoints
>   selftests: ceph: add reset consistency checker
>   selftests: ceph: add reset stress test
>   selftests: ceph: add reset corner-case tests
>   selftests: ceph: add validation harness
>   selftests: ceph: wire up Ceph reset kselftests and documentation
> 
>  MAINTAINERS                                   |   1 +
>  fs/ceph/addr.c                                |  20 +-
>  fs/ceph/caps.c                                |  34 +-
>  fs/ceph/debugfs.c                             | 103 +++
>  fs/ceph/file.c                                |  13 +-
>  fs/ceph/inode.c                               |   5 +-
>  fs/ceph/locks.c                               |  38 +-
>  fs/ceph/mds_client.c                          | 800 +++++++++++++++++-
>  fs/ceph/mds_client.h                          |  52 +-
>  fs/ceph/snap.c                                |   2 +-
>  fs/ceph/super.h                               |  70 +-
>  fs/ceph/xattr.c                               |   2 +-
>  include/trace/events/ceph.h                   |  67 ++
>  tools/testing/selftests/Makefile              |   1 +
>  .../selftests/filesystems/ceph/Makefile       |   7 +
>  .../testing/selftests/filesystems/ceph/README |  84 ++
>  .../filesystems/ceph/reset_corner_cases.sh    | 646 ++++++++++++++
>  .../filesystems/ceph/reset_stress.sh          | 694 +++++++++++++++
>  .../filesystems/ceph/run_validation.sh        | 350 ++++++++
>  .../selftests/filesystems/ceph/settings       |   1 +
>  .../filesystems/ceph/validate_consistency.py  | 297 +++++++
>  21 files changed, 3185 insertions(+), 102 deletions(-)
>  create mode 100644 tools/testing/selftests/filesystems/ceph/Makefile
>  create mode 100644 tools/testing/selftests/filesystems/ceph/README
>  create mode 100755 tools/testing/selftests/filesystems/ceph/reset_corner_cases.sh
>  create mode 100755 tools/testing/selftests/filesystems/ceph/reset_stress.sh
>  create mode 100755 tools/testing/selftests/filesystems/ceph/run_validation.sh
>  create mode 100644 tools/testing/selftests/filesystems/ceph/settings
>  create mode 100755 tools/testing/selftests/filesystems/ceph/validate_consistency.py

I was able to apply the patchset on the v.7.1-rc2 successfully. Let me run
xfstests for the patchset. I'll be back with results ASAP.

Thanks,
Slava.


  parent reply	other threads:[~2026-05-07 18:28 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-07 12:27 [PATCH v4 00/11] ceph: manual client session reset Alex Markuze
2026-05-07 12:27 ` [PATCH v4 01/11] ceph: convert inode flags to named bit positions and atomic bitops Alex Markuze
2026-05-07 18:35   ` Viacheslav Dubeyko
2026-05-07 12:27 ` [PATCH v4 02/11] ceph: use proper endian conversion for flock_len in reconnect Alex Markuze
2026-05-07 12:27 ` [PATCH v4 03/11] ceph: harden send_mds_reconnect and handle active-MDS peer reset Alex Markuze
2026-05-07 18:43   ` [EXTERNAL] " Viacheslav Dubeyko
2026-05-07 12:27 ` [PATCH v4 04/11] ceph: add diagnostic timeout loop to wait_caps_flush() Alex Markuze
2026-05-07 19:01   ` [EXTERNAL] " Viacheslav Dubeyko
2026-05-07 12:27 ` [PATCH v4 05/11] ceph: add client reset state machine and session teardown Alex Markuze
2026-05-07 19:17   ` [EXTERNAL] " Viacheslav Dubeyko
2026-05-07 12:27 ` [PATCH v4 06/11] ceph: add manual reset debugfs control and tracepoints Alex Markuze
2026-05-07 19:22   ` [EXTERNAL] " Viacheslav Dubeyko
2026-05-07 12:27 ` [PATCH v4 07/11] selftests: ceph: add reset consistency checker Alex Markuze
2026-05-07 19:24   ` [EXTERNAL] " Viacheslav Dubeyko
2026-05-07 12:27 ` [PATCH v4 08/11] selftests: ceph: add reset stress test Alex Markuze
2026-05-07 19:29   ` [EXTERNAL] " Viacheslav Dubeyko
2026-05-07 12:27 ` [PATCH v4 09/11] selftests: ceph: add reset corner-case tests Alex Markuze
2026-05-07 19:31   ` [EXTERNAL] " Viacheslav Dubeyko
2026-05-07 12:27 ` [PATCH v4 10/11] selftests: ceph: add validation harness Alex Markuze
2026-05-07 19:33   ` [EXTERNAL] " Viacheslav Dubeyko
2026-05-07 12:27 ` [PATCH v4 11/11] selftests: ceph: wire up Ceph reset kselftests and documentation Alex Markuze
2026-05-07 19:38   ` [EXTERNAL] " Viacheslav Dubeyko
2026-05-07 18:28 ` Viacheslav Dubeyko [this message]
2026-05-08 17:49   ` [EXTERNAL] [PATCH v4 00/11] ceph: manual client session reset Viacheslav Dubeyko

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=3bbabebd992c4efef81e4653ee2e04f4644ee57a.camel@redhat.com \
    --to=vdubeyko@redhat.com \
    --cc=amarkuze@redhat.com \
    --cc=ceph-devel@vger.kernel.org \
    --cc=idryomov@gmail.com \
    --cc=linux-kernel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox