From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id AF9F5413220 for ; Fri, 8 May 2026 17:49:53 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.129.124 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778262596; cv=none; b=nO1ad9kD+L0d8jwUS68jpACDLW2HDx1ArkVNKWyPRy1IZisubXe5FlYyeKeIV3kES7H30uHkXIsBp4fO/iI81LYuGETKZK1+3Y7TzSlw9AyD5rm+6nrtN4NkbjRjNLmMrXtkdscEDmtOMmxWh84MJiviZH1/zYVJcaWIAPZjirk= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778262596; c=relaxed/simple; bh=u4ME0m1100EkvCeExQwt6TOT2R065IW1fxbghSF1LhE=; h=Message-ID:Subject:From:To:Cc:Date:In-Reply-To:References: Content-Type:MIME-Version; b=rHU9zqLooA7JeKcnwnPNSHSrXjqgIZaOGVVtdfLF1HqHnZSNDn8XHKsgESUlIu8CPJWMhHXZULcRfUlGwRypZbJv/wz3wHQx/q+E5NBBQniPuNWTZmxfqkI6jncpDgc1I4LLG6d+an0//rElYP3j+a0TQXNZXoV3+MzExl4N4UQ= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=YRHMSyUn; dkim=pass (2048-bit key) header.d=redhat.com header.i=@redhat.com header.b=VGSUdn3e; arc=none smtp.client-ip=170.10.129.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="YRHMSyUn"; dkim=pass (2048-bit key) header.d=redhat.com header.i=@redhat.com header.b="VGSUdn3e" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1778262592; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=Grp5uB9s1kMDRMEQDniy91FFGKNxYoejXWAs2fHzLIU=; b=YRHMSyUnY79hu3CoW70U03oj0t0sTSDaT2n5Fgtfdc8tq8TaX3sZcEqfEQAFSnxGuPouWe 7GZQLmtyd7GozGtI+pYwVWYcCp5TRmyKZ87dPuvFaT5+EEsUsrGMYPon+NrGt73N6rLWsj Fe/g5AiaUo7uBtdbmTt6a0r0rE8oFZc= Received: from mail-yw1-f199.google.com (mail-yw1-f199.google.com [209.85.128.199]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-221-TzIJv6FePuu0UT3giCm-IQ-1; Fri, 08 May 2026 13:49:51 -0400 X-MC-Unique: TzIJv6FePuu0UT3giCm-IQ-1 X-Mimecast-MFC-AGG-ID: TzIJv6FePuu0UT3giCm-IQ_1778262590 Received: by mail-yw1-f199.google.com with SMTP id 00721157ae682-7bf1433b750so42060157b3.1 for ; Fri, 08 May 2026 10:49:51 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=google; t=1778262590; x=1778867390; darn=vger.kernel.org; h=mime-version:user-agent:content-transfer-encoding:references :in-reply-to:date:cc:to:from:subject:message-id:from:to:cc:subject :date:message-id:reply-to; bh=Grp5uB9s1kMDRMEQDniy91FFGKNxYoejXWAs2fHzLIU=; b=VGSUdn3e3NEQD2b3MokGN7KGgoAkg9ToTcGfNPxvL2/ARpRIg+YGDPWDwU0O3t92qV fjJKd8hTmISQGpMteV+dhb9EA+NV3WH+hzSJmvaQsdRFRz2MTVxKxWDdkwzEo2BICIGQ LatNwNp+SlGncf4VaNRO8h5YdmmN6qfaANuBf5A0xOaXMfjZ9KkDB9E7s8TdQdMMATBu 3Ko3sicZqfjJ52TRlLHHCzYHJETq25F2BXAYdbqlov8g6RE3A9J9qda929dpX5gb8U3p 3LgIWu/LJI3CLYyEnhrCb6oTyTHdIy0gNi6Mbn99zBKaajZSd0gV8MMYUDcAyeoXeSHl iv/w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1778262590; x=1778867390; h=mime-version:user-agent:content-transfer-encoding:references :in-reply-to:date:cc:to:from:subject:message-id:x-gm-gg :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=Grp5uB9s1kMDRMEQDniy91FFGKNxYoejXWAs2fHzLIU=; b=e6W/6Nb+brNn4W9yAWL5tlQEQ1IImDf3qkziz8qmLOFJ/yoVu7b7ojspSwHFqPk7vx Zc3Hnqz5NkhO2yns77CYaQMdZtDbA5ipVeIoIVeUI+N/CWXKrZYPPOwOA8mdDtV0Y+xw o59p6fE5tS6wjFgYuBWWzlnXbuV3Zxnd4IaQAsbzyM/fFa9I7/aO2dmtuedBBBDqV+1t W1P/EnP89scxx528QJO9ofEqgg/b0vdGfylKN/0//22OLKz6w7ARNsZlnEinrdrS84OX X6bKGQbBUoM4GD3Y2yZW4SjvheB/EN7sH3y9GeSSwTbDF06vRTrHWA4UTiqr9psoTMYH pn9Q== X-Gm-Message-State: AOJu0YxudZ8O0ymzduceuN9Jy1HydZW994ParPcaOInxI5KQSqcvKAOS DOGvak753+/cMlbQ3D3azAEZQfC2RPKiabq8hs2Zd5m7mgivYZLGjcXlZGKvQR2XZMMMiiEHEQZ /ar4r35dCOgeQ7i43H3wucBMySgdFOa8MP98OWliOCv+4NHhVkN4cKKBasQ/KlYzY1g== X-Gm-Gg: Acq92OGshrCLyccvjFCLGoyd8aB6MIrMtzhWN5iPKe/Qyk4KJZwjEEVVhL36j+UPE9S E7XGZ/PXsjddtxpwgv12b3/UUEB3086oG/CW3AyKx8/fr5kCTC7S/QzyJesh5KxoXPvhC+veOJE /SPcztHDK8ohvrSsg2w7umjSRlT4VZHF5FoB5zXJfOqiQ9zyU9IJpZds0M7xMOD5T37/yKrFG6u DuZBSrFjBox4nhRtOnogp/QFaEfycLh84lCwkffvyGpKSOlRP6Lq7WZ+5/C+o11WYqsSBRu4pD+ z0PclqKf2rA07ZbbqvLTeeAANWkRqo7A/Z1Nx0cQEuK39gFpjlYwKPmX09veuxn7Iseusd5Nki4 25CUb3sv7UO5trSV37rCxun1riUuZ+okhJcaaQCNx7X45l5ix6Onz X-Received: by 2002:a05:690c:113:b0:7bf:ff7:ea72 with SMTP id 00721157ae682-7bf0ff7ec09mr74229657b3.34.1778262590375; Fri, 08 May 2026 10:49:50 -0700 (PDT) X-Received: by 2002:a05:690c:113:b0:7bf:ff7:ea72 with SMTP id 00721157ae682-7bf0ff7ec09mr74229147b3.34.1778262589663; Fri, 08 May 2026 10:49:49 -0700 (PDT) Received: from li-4c4c4544-0032-4210-804c-c3c04f423534.ibm.com ([2600:1700:6476:1430::29]) by smtp.gmail.com with ESMTPSA id 00721157ae682-7bd66838851sm109240167b3.23.2026.05.08.10.49.48 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 08 May 2026 10:49:49 -0700 (PDT) Message-ID: Subject: Re: [EXTERNAL] [PATCH v4 00/11] ceph: manual client session reset From: Viacheslav Dubeyko To: Alex Markuze , ceph-devel@vger.kernel.org Cc: linux-kernel@vger.kernel.org, idryomov@gmail.com Date: Fri, 08 May 2026 10:49:48 -0700 In-Reply-To: <3bbabebd992c4efef81e4653ee2e04f4644ee57a.camel@redhat.com> References: <20260507122737.2804094-1-amarkuze@redhat.com> <3bbabebd992c4efef81e4653ee2e04f4644ee57a.camel@redhat.com> Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable User-Agent: Evolution 3.60.0 (3.60.0-1.fc44app2) Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 On Thu, 2026-05-07 at 11:28 -0700, Viacheslav Dubeyko wrote: > On Thu, 2026-05-07 at 12:27 +0000, Alex Markuze wrote: > > This series adds operator-initiated manual client session reset for > > CephFS, providing a controlled escape hatch for client/MDS stalemates > > in which caps, locks, or unsafe metadata state stop making forward > > progress. > >=20 > > Motivation > >=20 > > When a CephFS client enters a stalemate with the MDS -- stuck cap > > flushes, hung file locks, or unsafe requests that cannot be journaled -= - > > the only current recovery options are client eviction from the MDS side > > or a full client node restart. Both are disruptive and can cascade to > > other workloads on the same node. > >=20 > > Manual reset gives the operator a targeted tool: block new metadata > > work, attempt a bounded best-effort drain of dirty client state while > > sessions are still alive, then tear sessions down and let new requests > > re-open fresh sessions. State that cannot drain (the stuck state > > causing the stalemate) is force-dropped -- that is the point of the > > reset. > >=20 > > Design > >=20 > > The reset is triggered via debugfs: > >=20 > > echo "reason" > /sys/kernel/debug/ceph//reset/trigger > > cat /sys/kernel/debug/ceph//reset/status > >=20 > > The state machine tracks four phases: > >=20 > > IDLE -> QUIESCING -> DRAINING -> TEARDOWN -> IDLE > >=20 > > QUIESCING is set synchronously by schedule_reset() before the workqueue > > item is dispatched. This provides immediate request gating from the > > caller's context -- new metadata requests and file-lock acquisitions > > block the moment the operator triggers the reset, with no race window > > between scheduling and the work function starting. All non-IDLE phases > > block callers on blocked_wq; the hot path adds only a single READ_ONCE > > per request. > >=20 > > The drain phase uses a single shared deadline (bounded at 30 seconds) > > across all drain legs. It first waits for unsafe write requests > > (creates, renames, setattrs) to reach safe status, then flushes dirty > > caps and pushes pending cap releases, using whatever time remains > > within the shared deadline. Non-stuck state drains in milliseconds; > > stuck state times out and is force-dropped during teardown. The > > drain_timed_out flag is monotonic: once set by any drain leg, it stays > > true for the lifetime of the reset. > >=20 > > The session teardown follows the established check_new_map() > > forced-close pattern: unregister sessions under mdsc->mutex, then > > clean up caps and requests under s->s_mutex. Reconnect is not > > attempted because the MDS only accepts CLIENT_RECONNECT during its > > own RECONNECT phase after restart, not from an active client. A > > SESSION_REQUEST_CLOSE is sent to each MDS before local teardown so > > the MDS can release server-side state promptly rather than waiting > > for session_autoclose timeout. > >=20 > > Blocked callers are released when reset completes and observe the > > final result via -EAGAIN (reset failed, retry later) or 0 (success). > > Internal work-function errors such as -ENOMEM are not propagated to > > unrelated callers like open() or flock(); the detailed error remains > > in debugfs and tracepoints. > >=20 > > The work function checks st->shutdown before each phase transition > > (DRAINING, TEARDOWN) so that a concurrent ceph_mdsc_destroy() is not > > overwritten. If destroy already took ownership, the work function > > releases session references and returns without touching the state. > >=20 > > The destroy path marks reset as failed and wakes blocked waiters > > before cancel_work_sync() so unmount does not stall. > >=20 > > Patch breakdown > >=20 > > Prep / cleanup: > >=20 > > 1. Convert all CEPH_I_* inode flags to named bit-position constants > > and switch all flag modifications to atomic bitops (set_bit, > > clear_bit, test_and_clear_bit). The previous code mixed lockless > > atomics with non-atomic read-modify-write on the same unsigned > > long, which is a correctness hazard. Flag reads under i_ceph_lock > > that only test lock-serialised flags retain bitmask tests. > >=20 > > 2. Fix a __force endian cast in reconnect_caps_cb() to use the > > proper cpu_to_le32() macro and the new test_bit() accessor. > >=20 > > Hardening / diagnostics: > >=20 > > 3. Harden send_mds_reconnect() with error return, early bailout for > > closed/rejected/unregistered sessions, state restoration on > > transient failure. Rewrite mds_peer_reset() to handle active-MDS > > (past RECONNECT phase) by tearing the session down locally. > >=20 > > 4. Convert wait_caps_flush() to a diagnostic timeout loop that > > periodically dumps pending flush state, improving observability > > for reset-drain stalls and existing sync/writeback hangs. > >=20 > > Core feature: > >=20 > > 5. Add the reset state machine, request gating, session teardown > > work function, scheduling, and destroy-path coordination. > >=20 > > 6. Add the debugfs trigger/status interface and four tracepoints > > (schedule, complete, blocked, unblocked). > >=20 > > Testing: > >=20 > > 7-11. kselftest-integrated shell tests split into five patches: > > data integrity checker (7), stress test with concurrent I/O and > > random-interval reset injection (8), targeted corner cases -- > > overlapping resets, dirty data across reset, stale locks, unmount > > during reset (9), five-stage validation wrapper with per-stage > > timeouts (10), and kselftest Makefile/MAINTAINERS wiring (11). > > All 5 validation stages pass on a real CephFS cluster. > >=20 > > Changes since v3 > >=20 > > - Rebased onto testing (7.1-rc1 + ceph fixes). > > - Dropped v3 patch 7 ("add trace points to the MDS client") -- > > already upstream as d927a595ab2f. > > - Patch 1: fixed flags type from int to unsigned long in > > ceph_pool_perm_check() (Slava). Added commit message paragraph > > documenting the set_bit() conversion in ceph_finish_async_create(). > > - Patch 3: moved xa_destroy() under s_mutex with comment explaining > > serialization against ceph_get_deleg_ino() (Slava). Added lock > > ordering comment at mdsc->mutex acquisition. Added comment > > explaining why mds_peer_reset() narrows the RECONNECT state check > > from >=3D to =3D=3D. > > - Patch 4: split CEPH_CAP_FLUSH_MAX_DUMP_COUNT into separate > > CEPH_CAP_FLUSH_MAX_DUMP_ENTRIES (array bound) and > > CEPH_CAP_FLUSH_MAX_DUMP_ITERS (iteration limit) (Slava). Moved > > all flush timeout defines to mds_client.h alongside reset defines > > (Slava). Split comment block into per-field struct documentation > > and separate function safety comment for dump_cap_flushes() (Slava). > > Fixed for-loop variable declaration to match fs/ceph/ convention. > > Fixed commit message to reference the correct macro names and to > > stay within 72-column body width. > > - Patch 5: added bounded wait for unsafe write requests during the > > drain phase, using a shared deadline across all drain legs so the > > total drain time stays within CEPH_CLIENT_RESET_DRAIN_SEC. Made > > drain_timed_out monotonic (once set, stays true for the reset). > > Replaced spin_lock/spin_unlock around drain_timed_out writes with > > WRITE_ONCE() (Slava). Added ceph_reset_is_idle() inline helper > > (Slava). Added per-field comments to struct ceph_client_reset_state > > (Slava). Changed -EIO return to -EAGAIN for reset-failure > > signalling to callers (Slava). Increased CEPH_CLIENT_RESET_DRAIN_SE= C > > from 5s to 30s (Slava). Added sessions[i] =3D NULL after > > ceph_put_mds_session() in teardown skip path (Slava). Added comment > > at out_sessions label explaining destroy ownership. Expanded > > msleep() comment explaining why event-based waiting is not viable. > > - Patch 6: tracepoint placement fixed to fire before -EAGAIN return. > > - Patch 11: added MAINTAINERS F: entry for the test directory and > > the filesystems/ceph line in the top-level selftests Makefile. > >=20 > > Alex Markuze (11): > > ceph: convert inode flags to named bit positions and atomic bitops > > ceph: use proper endian conversion for flock_len in reconnect > > ceph: harden send_mds_reconnect and handle active-MDS peer reset > > ceph: add diagnostic timeout loop to wait_caps_flush() > > ceph: add client reset state machine and session teardown > > ceph: add manual reset debugfs control and tracepoints > > selftests: ceph: add reset consistency checker > > selftests: ceph: add reset stress test > > selftests: ceph: add reset corner-case tests > > selftests: ceph: add validation harness > > selftests: ceph: wire up Ceph reset kselftests and documentation > >=20 > > MAINTAINERS | 1 + > > fs/ceph/addr.c | 20 +- > > fs/ceph/caps.c | 34 +- > > fs/ceph/debugfs.c | 103 +++ > > fs/ceph/file.c | 13 +- > > fs/ceph/inode.c | 5 +- > > fs/ceph/locks.c | 38 +- > > fs/ceph/mds_client.c | 800 +++++++++++++++++- > > fs/ceph/mds_client.h | 52 +- > > fs/ceph/snap.c | 2 +- > > fs/ceph/super.h | 70 +- > > fs/ceph/xattr.c | 2 +- > > include/trace/events/ceph.h | 67 ++ > > tools/testing/selftests/Makefile | 1 + > > .../selftests/filesystems/ceph/Makefile | 7 + > > .../testing/selftests/filesystems/ceph/README | 84 ++ > > .../filesystems/ceph/reset_corner_cases.sh | 646 ++++++++++++++ > > .../filesystems/ceph/reset_stress.sh | 694 +++++++++++++++ > > .../filesystems/ceph/run_validation.sh | 350 ++++++++ > > .../selftests/filesystems/ceph/settings | 1 + > > .../filesystems/ceph/validate_consistency.py | 297 +++++++ > > 21 files changed, 3185 insertions(+), 102 deletions(-) > > create mode 100644 tools/testing/selftests/filesystems/ceph/Makefile > > create mode 100644 tools/testing/selftests/filesystems/ceph/README > > create mode 100755 tools/testing/selftests/filesystems/ceph/reset_corn= er_cases.sh > > create mode 100755 tools/testing/selftests/filesystems/ceph/reset_stre= ss.sh > > create mode 100755 tools/testing/selftests/filesystems/ceph/run_valida= tion.sh > > create mode 100644 tools/testing/selftests/filesystems/ceph/settings > > create mode 100755 tools/testing/selftests/filesystems/ceph/validate_c= onsistency.py >=20 > I was able to apply the patchset on the v.7.1-rc2 successfully. Let me ru= n > xfstests for the patchset. I'll be back with results ASAP. >=20 >=20 The xfestests run was successful. I don't see any critical issues with the patchset. Tested-by: Viacheslav Dubeyko Thanks, Slava.