[PATCH AUTOSEL 6.17-6.16] dlm: move to rinfo for all middle conversion cases

patches.lists.linux.dev archive mirror
 help / color / mirror / Atom feed

From: Sasha Levin <sashal@kernel.org>
To: patches@lists.linux.dev, stable@vger.kernel.org
Cc: Alexander Aring <aahringo@redhat.com>,
	David Teigland <teigland@redhat.com>,
	Sasha Levin <sashal@kernel.org>,
	ccaulfie@redhat.com, cluster-devel@redhat.com
Subject: [PATCH AUTOSEL 6.17-6.16] dlm: move to rinfo for all middle conversion cases
Date: Thu,  2 Oct 2025 11:30:02 -0400	[thread overview]
Message-ID: <20251002153025.2209281-15-sashal@kernel.org> (raw)
In-Reply-To: <20251002153025.2209281-1-sashal@kernel.org>

From: Alexander Aring <aahringo@redhat.com>

[ Upstream commit a8abcff174f7f9ce4587c6451b1a2450d01f52c9 ]

Since commit f74dacb4c8116 ("dlm: fix recovery of middle conversions")
we introduced additional debugging information if we hit the middle
conversion by using log_limit(). The DLM log_limit() functionality
requires a DLM debug option being enabled. As this case is so rarely and
excempt any potential introduced new issue with recovery we switching it
to log_rinfo() ad this is ratelimited under normal DLM loglevel.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---

LLM Generated explanations, may be completely bogus:

## Backport Analysis: dlm: move to rinfo for all middle conversion cases

**Backport Status: YES**

---

## Comprehensive Analysis

### Change Summary
This commit changes logging calls from `log_limit()` to `log_rinfo()` in
two locations within the DLM (Distributed Lock Manager) subsystem:
1. **fs/dlm/lock.c:5579** - in `receive_rcom_lock_args()` function
2. **fs/dlm/recover.c:845** - in `recover_conversion()` function

Both changes affect logging for "middle conversion" cases - rare but
critical scenarios where locks convert between PR (Protected Read) and
CW (Concurrent Write) modes during cluster recovery.

### Code Changes Analysis

**Logging Infrastructure Differences** (from
fs/dlm/dlm_internal.h:65-87):
- **log_limit()**: Only logs when `dlm_config.ci_log_debug` is enabled,
  at DEBUG level with rate limiting. Requires explicit debug mode.
- **log_rinfo()**: Logs at INFO or DEBUG level depending on
  configuration (`ci_log_info` OR `ci_log_debug`). Visible under normal
  DLM loglevel.

**Specific Change 1 - fs/dlm/lock.c:5579**:
```c
// In receive_rcom_lock_args() - when receiving lock recovery
information
if (rl->rl_status == DLM_LKSTS_CONVERT && middle_conversion(lkb)) {
- log_limit(ls, "%s %x middle convert gr %d rq %d remote %d %x", ...)
+   log_rinfo(ls, "%s %x middle convert gr %d rq %d remote %d %x", ...)
    rsb_set_flag(r, RSB_RECOVER_CONVERT);
}
```

**Specific Change 2 - fs/dlm/recover.c:845**:
```c
// In recover_conversion() - when detecting incompatible lock modes
during recovery
if (((lkb->lkb_grmode == DLM_LOCK_PR) && (other_grmode == DLM_LOCK_CW))
||
    ((lkb->lkb_grmode == DLM_LOCK_CW) && (other_grmode == DLM_LOCK_PR)))
{
- log_limit(ls, "%s %x gr %d rq %d, remote %d %x, other_lkid %u, other
  gr %d, set gr=NL", ...)
+   log_rinfo(ls, "%s %x gr %d rq %d, remote %d %x, other_lkid %u, other
gr %d, set gr=NL", ...)
    lkb->lkb_grmode = DLM_LOCK_NL;
}
```

### Critical Context from Referenced Commit f74dacb4c8116

The commit message references f74dacb4c8116 ("dlm: fix recovery of
middle conversions", Nov 2024), which fixed a **long-standing critical
bug** in DLM recovery:

**The Bug**: Recovery couldn't reliably rebuild lock state for
conversions between PR/CW modes. The code would set invalid modes
(DLM_LOCK_IV), causing unpredictable errors.

**Why It Went Unnoticed**:
- Applications rarely convert between PR/CW
- Recovery rarely occurs during these conversions
- Even when the bug occurred, callers might not notice depending on
  subsequent operations
- A gfs2 core dump finally revealed this broken code

**The Fix**: Properly detect and correct incompatible lock modes during
recovery by temporarily setting grmode to NL, allowing the conversion to
complete after recovery.

**Logging Inconsistency**: The original bug fix added logging in three
places for middle conversions:
- `recover_convert_waiter()`: Used `log_rinfo()` ✓
- `receive_rcom_lock_args()`: Used `log_limit()` ✗
- `recover_conversion()`: Used `log_limit()` ✗

The current commit makes all three consistent by using `log_rinfo()`.

### Why This Change Matters

1. **Production Visibility**: Middle conversion recovery is rare but
   critical. The original bug existed for years undetected. Having this
   logging visible in production (without debug mode) helps catch any
   remaining issues or new regressions.

2. **Consistency**: All three middle conversion logging points should
   use the same logging level for coherent debugging.

3. **Preventative Monitoring**: The commit message says "excempt any
   potential introduced new issue with recovery" - this appears to mean
   they want to *except* (catch) any potential new issues. Making these
   logs visible helps detect problems early.

4. **Cluster Filesystem Impact**: DLM is used by GFS2 and other cluster
   filesystems. Recovery bugs can cause data corruption or service
   outages in production clusters.

### Risk Assessment

**Risk Level**: **VERY LOW**

**Potential Issues**:
- Slightly increased log verbosity in rare recovery scenarios
- Both logging paths are rate-limited, preventing log spam
- Only affects recovery code paths that are infrequently exercised

**Regression Probability**: **Near Zero**
- No functional code changes
- Only affects logging output
- Both `log_limit()` and `log_rinfo()` are rate-limited
- Change is identical to existing logging pattern in same subsystem

**Testing Consideration**:
The affected code paths execute during:
- Cluster node failures during lock conversions
- Lock recovery after master node changes
- Middle conversion scenarios (PR↔CW)

These are difficult to reproduce in testing but critical in production.

### Stable Tree Considerations

**Arguments FOR Backporting**:
1. ✅ The referenced bug fix f74dacb4c8116 is present in this stable tree
   (confirmed by git log)
2. ✅ Extremely low risk - only logging changes, no functional
   modifications
3. ✅ Small, self-contained change (2 lines)
4. ✅ Improves observability for a critical, hard-to-debug subsystem
5. ✅ Provides consistency in logging for recovery scenarios
6. ✅ Could help catch issues in production environments where debug mode
   isn't enabled
7. ✅ Cluster filesystems (GFS2) users would benefit from better recovery
   debugging
8. ✅ No architectural changes or new features

**Arguments AGAINST Backporting**:
1. ❌ Not fixing a bug - it's an observability improvement
2. ❌ No explicit Cc: stable tag from maintainers
3. ❌ Changes observable behavior (more verbose logging under normal
   loglevel)
4. ❌ Stable trees typically prioritize bug fixes over improvements

### Relationship to Stable Tree Rules

From stable tree documentation, acceptable commits should:
- ✅ "Fix a real bug" - While not fixing a bug directly, this improves
  detection of bugs in the recovery logic
- ✅ "Be small and obviously correct" - 2-line change, straightforward
- ✅ "Include no significant risk of regression" - Only logging changes
- ⚠️  "Not add a new feature" - Borderline: makes existing logging more
  accessible, doesn't add functionality

### Historical Precedent

The DLM subsystem has a history of recovery-related issues that went
undetected for years. The original bug fix commit message states: "This
bug has gone unnoticed for so long..." This justifies enhanced logging
for rare recovery scenarios.

### Conclusion

**Recommendation: YES - Backport to stable trees**

**Rationale**:
1. **Minimal Risk**: This is purely a logging level change with no
   functional modifications. The rate-limiting prevents any log spam
   concerns.

2. **Valuable for Production**: The original bug (f74dacb4c8116)
   demonstrates that middle conversion recovery bugs can exist
   undetected for years. Making this logging visible in production
   environments helps catch issues early.

3. **Complements Existing Backport**: If the bug fix f74dacb4c8116 is
   backported to stable (it's present in this tree), this logging
   improvement provides essential observability for that fix.

4. **Consistency**: Makes all middle conversion logging uniform, which
   aids in debugging across different recovery code paths.

5. **Cluster Filesystem Stability**: DLM is critical infrastructure for
   cluster filesystems like GFS2. Better observability in production
   helps maintain cluster stability and aids in root cause analysis
   during incidents.

**Conditional Note**: If stable tree maintainers have strict policies
against non-bugfix commits, they may choose to reject this. However,
given:
- The extreme low risk
- The relationship to a critical bug fix
- The production value for cluster filesystem users
- The small size and obvious correctness

This commit meets the spirit of stable tree backporting rules even if
it's technically an "improvement" rather than a "fix."

 fs/dlm/lock.c    | 2 +-
 fs/dlm/recover.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/dlm/lock.c b/fs/dlm/lock.c
index 6dd3a524cd352..be938fdf17d96 100644
--- a/fs/dlm/lock.c
+++ b/fs/dlm/lock.c
@@ -5576,7 +5576,7 @@ static int receive_rcom_lock_args(struct dlm_ls *ls, struct dlm_lkb *lkb,

 	if (rl->rl_status == DLM_LKSTS_CONVERT && middle_conversion(lkb)) {
 		/* We may need to adjust grmode depending on other granted locks. */
-		log_limit(ls, "%s %x middle convert gr %d rq %d remote %d %x",
+		log_rinfo(ls, "%s %x middle convert gr %d rq %d remote %d %x",
 			  __func__, lkb->lkb_id, lkb->lkb_grmode,
 			  lkb->lkb_rqmode, lkb->lkb_nodeid, lkb->lkb_remid);
 		rsb_set_flag(r, RSB_RECOVER_CONVERT);
diff --git a/fs/dlm/recover.c b/fs/dlm/recover.c
index be4240f09abd4..3ac020fb8139e 100644
--- a/fs/dlm/recover.c
+++ b/fs/dlm/recover.c
@@ -842,7 +842,7 @@ static void recover_conversion(struct dlm_rsb *r)
 		 */
 		if (((lkb->lkb_grmode == DLM_LOCK_PR) && (other_grmode == DLM_LOCK_CW)) ||
 		    ((lkb->lkb_grmode == DLM_LOCK_CW) && (other_grmode == DLM_LOCK_PR))) {
-			log_limit(ls, "%s %x gr %d rq %d, remote %d %x, other_lkid %u, other gr %d, set gr=NL",
+			log_rinfo(ls, "%s %x gr %d rq %d, remote %d %x, other_lkid %u, other gr %d, set gr=NL",
 				  __func__, lkb->lkb_id, lkb->lkb_grmode,
 				  lkb->lkb_rqmode, lkb->lkb_nodeid,
 				  lkb->lkb_remid, other_lkid, other_grmode);
-- 
2.51.0

next prev parent reply	other threads:[~2025-10-02 15:30 UTC|newest]

Thread overview: 36+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-10-02 15:29 [PATCH AUTOSEL 6.17-5.4] hfs: fix KMSAN uninit-value issue in hfs_find_set_zero_bits() Sasha Levin
2025-10-02 15:29 ` [PATCH AUTOSEL 6.17-6.12] arm64: sysreg: Correct sign definitions for EIESB and DoubleLock Sasha Levin
2025-10-02 15:29 ` [PATCH AUTOSEL 6.17-5.4] hfs: clear offset and space out of valid records in b-tree node Sasha Levin
2025-10-02 15:29 ` [PATCH AUTOSEL 6.17-5.4] hfsplus: return EIO when type of hidden directory mismatch in hfsplus_fill_super() Sasha Levin
2025-10-02 15:29 ` [PATCH AUTOSEL 6.17-6.1] powerpc/32: Remove PAGE_KERNEL_TEXT to fix startup failure Sasha Levin
2025-10-02 15:29 ` [PATCH AUTOSEL 6.17-5.4] m68k: bitops: Fix find_*_bit() signatures Sasha Levin
2025-10-02 15:29 ` [PATCH AUTOSEL 6.17] smb: client: make use of ib_wc_status_msg() and skip IB_WC_WR_FLUSH_ERR logging Sasha Levin
2025-10-02 15:29 ` [PATCH AUTOSEL 6.17-6.16] arm64: realm: ioremap: Allow mapping memory as encrypted Sasha Levin
2025-10-02 16:43   ` Suzuki K Poulose
2025-10-21 15:38     ` Sasha Levin
2025-10-02 15:29 ` [PATCH AUTOSEL 6.17-6.12] gfs2: Fix unlikely race in gdlm_put_lock Sasha Levin
2025-10-02 15:29 ` [PATCH AUTOSEL 6.17-6.1] smb: server: let smb_direct_flush_send_list() invalidate a remote key first Sasha Levin
2025-10-02 15:29 ` [PATCH AUTOSEL 6.17-5.15] nios2: ensure that memblock.current_limit is set when setting pfn limits Sasha Levin
2025-10-02 15:29 ` [PATCH AUTOSEL 6.17-6.12] s390/mm: Use __GFP_ACCOUNT for user page table allocations Sasha Levin
2025-10-02 15:30 ` [PATCH AUTOSEL 6.17-6.16] riscv: mm: Return intended SATP mode for noXlvl options Sasha Levin
2025-10-02 15:30 ` [PATCH AUTOSEL 6.17-6.16] gfs2: Fix LM_FLAG_TRY* logic in add_to_queue Sasha Levin
2025-10-02 15:30 ` Sasha Levin [this message]
2025-10-02 15:30 ` [PATCH AUTOSEL 6.17-5.4] hfsplus: fix KMSAN uninit-value issue in hfsplus_delete_cat() Sasha Levin
2025-10-02 15:30 ` [PATCH AUTOSEL 6.17-5.4] exec: Fix incorrect type for ret Sasha Levin
2025-10-02 15:30 ` [PATCH AUTOSEL 6.17-5.4] hfsplus: fix KMSAN uninit-value issue in __hfsplus_ext_cache_extent() Sasha Levin
2025-10-02 15:30 ` [PATCH AUTOSEL 6.17-6.1] lkdtm: fortify: Fix potential NULL dereference on kmalloc failure Sasha Levin
2025-10-02 15:30 ` [PATCH AUTOSEL 6.17-6.16] riscv: mm: Use mmu-type from FDT to limit SATP mode Sasha Levin
2025-10-02 15:30 ` [PATCH AUTOSEL 6.17-6.6] Unbreak 'make tools/*' for user-space targets Sasha Levin
2025-10-02 15:30 ` [PATCH AUTOSEL 6.17-5.4] hfs: make proper initalization of struct hfs_find_data Sasha Levin
2025-10-02 15:30 ` [PATCH AUTOSEL 6.17-5.4] hfsplus: fix slab-out-of-bounds read in hfsplus_strcasecmp() Sasha Levin
2025-10-02 15:30 ` [PATCH AUTOSEL 6.17-6.16] riscv: cpufeature: add validation for zfa, zfh and zfhmin Sasha Levin
2025-10-02 15:30 ` [PATCH AUTOSEL 6.17-6.12] PCI: Test for bit underflow in pcie_set_readrq() Sasha Levin
2025-10-02 15:30 ` [PATCH AUTOSEL 6.17-6.16] s390/pkey: Forward keygenflags to ep11_unwrapkey Sasha Levin
2025-10-02 15:30 ` [PATCH AUTOSEL 6.17-6.6] drivers/perf: hisi: Relax the event ID check in the framework Sasha Levin
2025-10-02 15:30 ` [PATCH AUTOSEL 6.17-5.4] hfs: validate record offset in hfsplus_bmap_alloc Sasha Levin
2025-10-02 15:30 ` [PATCH AUTOSEL 6.17] smb: client: limit the range of info->receive_credit_target Sasha Levin
2025-10-02 15:30 ` [PATCH AUTOSEL 6.17-5.4] dlm: check for defined force value in dlm_lockspace_release Sasha Levin
2025-10-02 15:30 ` [PATCH AUTOSEL 6.17-6.12] binfmt_elf: preserve original ELF e_flags for core dumps Sasha Levin
2025-10-02 15:58   ` Kees Cook
2025-10-02 15:30 ` [PATCH AUTOSEL 6.17-6.16] arm64: errata: Apply workarounds for Neoverse-V3AE Sasha Levin
2025-10-02 15:30 ` [PATCH AUTOSEL 6.17-6.16] smb: client: queue post_recv_credits_work also if the peer raises the credit target Sasha Levin

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:6dd3a524cd35 dfblob:be938fdf17d9 dfblob:be4240f09abd
dfblob:3ac020fb8139 )
 OR (
bs:"[PATCH AUTOSEL 6.17-6.16] dlm: move to rinfo for all middle conversion cases" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20251002153025.2209281-15-sashal@kernel.org \
    --to=sashal@kernel.org \
    --cc=aahringo@redhat.com \
    --cc=ccaulfie@redhat.com \
    --cc=cluster-devel@redhat.com \
    --cc=patches@lists.linux.dev \
    --cc=stable@vger.kernel.org \
    --cc=teigland@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).