public inbox for stable@vger.kernel.org
 help / color / mirror / Atom feed
From: Sasha Levin <sashal@kernel.org>
To: patches@lists.linux.dev, stable@vger.kernel.org
Cc: Alexander Aring <aahringo@redhat.com>,
	David Teigland <teigland@redhat.com>,
	Sasha Levin <sashal@kernel.org>,
	gfs2@lists.linux.dev
Subject: [PATCH AUTOSEL 6.19-6.18] dlm: fix recovery pending middle conversion
Date: Tue, 10 Feb 2026 18:30:48 -0500	[thread overview]
Message-ID: <20260210233123.2905307-3-sashal@kernel.org> (raw)
In-Reply-To: <20260210233123.2905307-1-sashal@kernel.org>

From: Alexander Aring <aahringo@redhat.com>

[ Upstream commit 1416bd508c78bdfdb9ae0b4511369e5581f348ea ]

During a workload involving conversions between lock modes PR and CW,
lock recovery can create a "conversion deadlock" state between locks
that have been recovered.  When this occurs, kernel warning messages
are logged, e.g.

  "dlm: WARN: pending deadlock 1e node 0 2 1bf21"

  "dlm: receive_rcom_lock_args 2e middle convert gr 3 rq 2 remote 2 1e"

After this occurs, the deadlocked conversions both appear on the convert
queue of the resource being locked, and the conversion requests do not
complete.

Outside of recovery, conversions that would produce a deadlock are
resolved immediately, and return -EDEADLK.  The locks are not placed
on the convert queue in the deadlocked state.

To fix this problem, an lkb under conversion between PR/CW is rebuilt
during recovery on a new master's granted queue, with the currently
granted mode, rather than being rebuilt on the new master's convert
queue, with the currently granted mode and the newly requested mode.
The in-progress convert is then resent to the new master after
recovery, so the conversion deadlock will be processed outside of
the recovery context and handled as described above.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---

LLM Generated explanations, may be completely bogus:

Now I have a complete picture. Let me provide my detailed analysis.

---

## Detailed Analysis

### 1. COMMIT MESSAGE ANALYSIS

The commit subject clearly says "fix" and the message describes a
concrete, reproducible bug:

- **Symptom**: During a workload involving conversions between lock
  modes PR (Protected Read) and CW (Concurrent Write), lock recovery
  creates a "conversion deadlock" state between recovered locks.
- **Visible effects**: Kernel warning messages (`"dlm: WARN: pending
  deadlock..."`) and permanently stuck lock conversions on the convert
  queue that never complete.
- **This affects clustered filesystem users** (GFS2, OCFS2) running on
  systems where DLM node recovery occurs during PR/CW conversion
  workloads.

The authors are **Alexander Aring** (Red Hat, DLM subsystem developer)
and **David Teigland** (the DLM maintainer who signs off all DLM
patches). This is authoritative.

### 2. CODE CHANGE ANALYSIS

The change is in `recover_convert_waiter()` in `fs/dlm/lock.c`. Here's
what changed:

**Before (buggy code introduced by f74dacb4):**

```5017:5035:fs/dlm/lock.c
        if (middle_conversion(lkb)) {
                log_rinfo(ls, "%s %x middle convert in progress",
__func__,
                         lkb->lkb_id);
                /* We sent this lock to the new master. The new master
will
   - tell us when it's granted.  We no longer need a reply, so
   - use a fake reply to put the lkb into the right state.
                 */
                hold_lkb(lkb);
                memset(ms_local, 0, sizeof(struct dlm_message));
                ms_local->m_type = cpu_to_le32(DLM_MSG_CONVERT_REPLY);
                ms_local->m_result =
cpu_to_le32(to_dlm_errno(-EINPROGRESS));
                ms_local->m_header.h_nodeid =
cpu_to_le32(lkb->lkb_nodeid);
                _receive_convert_reply(lkb, ms_local, true);
                unhold_lkb(lkb);
        } else if (lkb->lkb_rqmode >= lkb->lkb_grmode) {
                set_bit(DLM_IFL_RESEND_BIT, &lkb->lkb_iflags);
        }
```

**After (the fix):**

```c
        if (middle_conversion(lkb) || lkb->lkb_rqmode >=
lkb->lkb_grmode)
                set_bit(DLM_IFL_RESEND_BIT, &lkb->lkb_iflags);
```

### 3. BUG MECHANISM

The recovery sequence in `fs/dlm/recoverd.c` runs these steps in order:

1. **Line 218**: `dlm_recover_waiters_pre()` — handles lkbs waiting for
   replies from failed nodes
2. **Line 247**: `dlm_recover_locks()` — sends locks to new masters (via
   rcom)
3. **Line 270**: `dlm_recover_rsbs()` — calls `recover_conversion()` on
   flagged resources
4. **Line 320**: `dlm_recover_waiters_post()` — resends operations
   marked with RESEND bit

**The bug in the old code**: When `recover_convert_waiter` faked an
`-EINPROGRESS` reply for a middle conversion:
- `_receive_convert_reply()` → `__receive_convert_reply()` would handle
  the `-EINPROGRESS` case by calling `del_lkb(r, lkb)` + `add_lkb(r,
  lkb, DLM_LKSTS_CONVERT)`, moving the lkb to the **local convert
  queue**
- The lkb was also removed from the waiters list via
  `remove_from_waiters_ms()`
- When the lock was subsequently sent to the **new master** via rcom
  (step 2), it was sent as a converting lock
- `receive_rcom_lock_args()` on the new master placed it on the convert
  queue and set `RSB_RECOVER_CONVERT`
- If **two locks** had middle conversions (e.g., A: PR→CW, B: CW→PR),
  both ended up on the convert queue of the new master resource in a
  **deadlocked state**
- `recover_conversion()` couldn't resolve this; the normal `-EDEADLK`
  detection doesn't run during recovery
- Result: permanent deadlock, kernel warnings, stuck I/O

**How the fix resolves it**: By setting `DLM_IFL_RESEND_BIT` instead of
faking `-EINPROGRESS`:
- The lkb stays on the waiters list with its original granted status
  unchanged
- When sent to the new master via rcom, it's sent as a **granted lock**
  (not converting), placed on the **granted queue**
- No conversion deadlock can form during recovery
- After recovery, `dlm_recover_waiters_post()` finds the lkb (via
  `find_resend_waiter`), resets its state, and calls `_convert_lock()`
  to resend the conversion to the new master through **normal channels**
- If the conversion would deadlock, the normal code returns `-EDEADLK`
  immediately, preventing the stuck state

### 4. CLASSIFICATION

This is a **clear bug fix** — it fixes a conversion deadlock in DLM lock
recovery. There are no new features, APIs, or behavioral changes. The
fix actually **removes** complex code and replaces it with the simpler,
correct approach.

### 5. SCOPE AND RISK ASSESSMENT

- **Lines changed**: Net removal of ~17 lines; replaces ~20 lines with 2
  lines
- **Files changed**: 1 (`fs/dlm/lock.c`)
- **Function modified**: Only `recover_convert_waiter()`
- **Risk**: **Very low** — the fix simplifies the code and uses the same
  `DLM_IFL_RESEND_BIT` mechanism that is already used for normal up-
  conversions (REQUEST, LOOKUP, and non-middle CONVERT cases). The
  RESEND path via `dlm_recover_waiters_post()` → `_convert_lock()` is
  well-tested existing code.
- **Subsystem**: DLM (fs/dlm/) — critical for clustered filesystems

### 6. USER IMPACT

- **Who is affected**: Users of DLM-based clustered filesystems (GFS2,
  OCFS2) in production clusters
- **Severity**: **High** — when the bug triggers, lock conversions
  permanently hang, which would block I/O on clustered filesystems
- **Trigger**: PR↔CW conversions during node recovery (when a cluster
  node fails). This is a realistic scenario in production clusters.

### 7. DEPENDENCY CHECK

- **Requires f74dacb4** ("dlm: fix recovery of middle conversions"):
  This commit introduced the code being modified (the fake -EINPROGRESS
  approach). f74dacb4 entered mainline in v6.13-rc1 and is present in
  v6.13+ stable trees.
- **Does NOT require a8abcff1** ("dlm: move to rinfo for all middle
  conversion cases"): a8abcff1 only changed `log_limit` → `log_rinfo` in
  `receive_rcom_lock_args` and `recover_conversion`, not in
  `recover_convert_waiter`. The code being removed already used
  `log_rinfo` since f74dacb4.
- The patch should apply cleanly on any tree containing f74dacb4.

### 8. STABILITY INDICATORS

- Authored and signed-off by the DLM maintainers (Aring + Teigland)
- The fix simplifies code (reduction in complexity = lower regression
  risk)
- Uses existing well-tested RESEND mechanism
- The commit message thoroughly explains the problem and solution

### Summary

This is a clear-cut bug fix for a conversion deadlock in DLM lock
recovery that causes permanent hangs and kernel warnings. The fix is
small, surgical (net code removal), contained to a single function in a
single file, and uses existing well-tested mechanisms. It affects
production clustered filesystem users. The prerequisite commit
(f74dacb4) is in v6.13+ stable trees, so this fix is needed in those
same trees. The risk is very low — the fix actually simplifies the
recovery path.

**YES**

 fs/dlm/lock.c | 19 +------------------
 1 file changed, 1 insertion(+), 18 deletions(-)

diff --git a/fs/dlm/lock.c b/fs/dlm/lock.c
index be938fdf17d96..c01a291db401b 100644
--- a/fs/dlm/lock.c
+++ b/fs/dlm/lock.c
@@ -5014,25 +5014,8 @@ void dlm_receive_buffer(const union dlm_packet *p, int nodeid)
 static void recover_convert_waiter(struct dlm_ls *ls, struct dlm_lkb *lkb,
 				   struct dlm_message *ms_local)
 {
-	if (middle_conversion(lkb)) {
-		log_rinfo(ls, "%s %x middle convert in progress", __func__,
-			 lkb->lkb_id);
-
-		/* We sent this lock to the new master. The new master will
-		 * tell us when it's granted.  We no longer need a reply, so
-		 * use a fake reply to put the lkb into the right state.
-		 */
-		hold_lkb(lkb);
-		memset(ms_local, 0, sizeof(struct dlm_message));
-		ms_local->m_type = cpu_to_le32(DLM_MSG_CONVERT_REPLY);
-		ms_local->m_result = cpu_to_le32(to_dlm_errno(-EINPROGRESS));
-		ms_local->m_header.h_nodeid = cpu_to_le32(lkb->lkb_nodeid);
-		_receive_convert_reply(lkb, ms_local, true);
-		unhold_lkb(lkb);
-
-	} else if (lkb->lkb_rqmode >= lkb->lkb_grmode) {
+	if (middle_conversion(lkb) || lkb->lkb_rqmode >= lkb->lkb_grmode)
 		set_bit(DLM_IFL_RESEND_BIT, &lkb->lkb_iflags);
-	}
 
 	/* lkb->lkb_rqmode < lkb->lkb_grmode shouldn't happen since down
 	   conversions are async; there's no reply from the remote master */
-- 
2.51.0


  parent reply	other threads:[~2026-02-10 23:31 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-10 23:30 [PATCH AUTOSEL 6.19-6.12] i3c: mipi-i3c-hci: Reset RING_OPERATION1 fields during init Sasha Levin
2026-02-10 23:30 ` [PATCH AUTOSEL 6.19-5.15] gfs2: fiemap page fault fix Sasha Levin
2026-02-10 23:30 ` Sasha Levin [this message]
2026-02-10 23:30 ` [PATCH AUTOSEL 6.19-6.6] smb: client: prevent races in ->query_interfaces() Sasha Levin
2026-02-10 23:30 ` [PATCH AUTOSEL 6.19-6.12] i3c: mipi-i3c-hci: Ensure proper bus clean-up Sasha Levin
2026-02-11  7:56   ` Adrian Hunter
2026-02-10 23:30 ` [PATCH AUTOSEL 6.19-5.10] audit: add fchmodat2() to change attributes class Sasha Levin
2026-02-10 23:30 ` [PATCH AUTOSEL 6.19-6.12] btrfs: fallback to buffered IO if the data profile has duplication Sasha Levin
2026-02-10 23:30 ` [PATCH AUTOSEL 6.19] btrfs: don't BUG() on unexpected delayed ref type in run_one_delayed_ref() Sasha Levin
2026-02-10 23:30 ` [PATCH AUTOSEL 6.19-6.12] i3c: mipi-i3c-hci-pci: Add System Suspend support Sasha Levin
2026-02-11  7:57   ` Adrian Hunter
2026-02-10 23:30 ` [PATCH AUTOSEL 6.19-6.18] hfsplus: fix volume corruption issue for generic/480 Sasha Levin
2026-02-10 23:30 ` [PATCH AUTOSEL 6.19-6.18] kselftest/kublk: include message in _Static_assert for C11 compatibility Sasha Levin
2026-02-10 23:30 ` [PATCH AUTOSEL 6.19-6.12] dlm: validate length in dlm_search_rsb_tree Sasha Levin
2026-02-10 23:30 ` [PATCH AUTOSEL 6.19-6.18] i3c: mipi-i3c-hci: Stop reading Extended Capabilities if capability ID is 0 Sasha Levin
2026-02-10 23:30 ` [PATCH AUTOSEL 6.19-6.1] fs/buffer: add alert in try_to_free_buffers() for folios without buffers Sasha Levin
2026-02-10 23:31 ` [PATCH AUTOSEL 6.19-5.15] i3c: master: svc: Initialize 'dev' to NULL in svc_i3c_master_ibi_isr() Sasha Levin
2026-02-10 23:31 ` [PATCH AUTOSEL 6.19-6.12] statmount: permission check should return EPERM Sasha Levin
2026-02-10 23:31 ` [PATCH AUTOSEL 6.19-5.10] audit: add missing syscalls to read class Sasha Levin
2026-02-10 23:31 ` [PATCH AUTOSEL 6.19-5.10] hfsplus: pretend special inodes as regular files Sasha Levin
2026-02-10 23:31 ` [PATCH AUTOSEL 6.19-5.10] hfsplus: fix volume corruption issue for generic/498 Sasha Levin
2026-02-10 23:31 ` [PATCH AUTOSEL 6.19-6.18] netfs: when subreq is marked for retry, do not check if it faced an error Sasha Levin
2026-02-10 23:31 ` [PATCH AUTOSEL 6.19] hfs: Replace BUG_ON with error handling for CNID count checks Sasha Levin
2026-02-10 23:31 ` [PATCH AUTOSEL 6.19-6.1] smb: client: add proper locking around ses->iface_last_update Sasha Levin
2026-02-10 23:31 ` [PATCH AUTOSEL 6.19-6.6] btrfs: handle user interrupt properly in btrfs_trim_fs() Sasha Levin
2026-02-10 23:31 ` [PATCH AUTOSEL 6.19-5.10] minix: Add required sanity checking to minix_check_superblock() Sasha Levin
2026-02-11  7:56 ` [PATCH AUTOSEL 6.19-6.12] i3c: mipi-i3c-hci: Reset RING_OPERATION1 fields during init Adrian Hunter

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260210233123.2905307-3-sashal@kernel.org \
    --to=sashal@kernel.org \
    --cc=aahringo@redhat.com \
    --cc=gfs2@lists.linux.dev \
    --cc=patches@lists.linux.dev \
    --cc=stable@vger.kernel.org \
    --cc=teigland@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox