[PATCH AUTOSEL 7.0-5.10] scsi: scsi_dh_alua: Increase default ALUA timeout to maximum spec value

Linux SCSI subsystem development
 help / color / mirror / Atom feed

* [PATCH AUTOSEL 7.0-5.10] scsi: scsi_dh_alua: Increase default ALUA timeout to maximum spec value
       [not found] <20260511221931.2370053-1-sashal@kernel.org>
@ 2026-05-11 22:19 ` Sasha Levin
  2026-05-11 22:19 ` [PATCH AUTOSEL 7.0-5.15] scsi: smartpqi: Silence a recursive lock warning Sasha Levin
  1 sibling, 0 replies; 2+ messages in thread
From: Sasha Levin @ 2026-05-11 22:19 UTC (permalink / raw)
  To: patches, stable
  Cc: Brian Bunker, Krishna Kant, Riya Savla, Hannes Reinecke,
	Martin K. Petersen, Sasha Levin, jejb, linux-scsi, linux-kernel

From: Brian Bunker <brian@purestorage.com>

[ Upstream commit 68c3a65a5a8e85643745fdde02cb63904e165620 ]

The ALUA handler maps a 0 value (no implicit transition timeout provided
by the target) to the ALUA_FAILOVER_TIMEOUT constant, currently 60
seconds. This means the kernel already does not accept an infinite
transition time.

However, 60 seconds is insufficient for some arrays that may take longer
to complete ALUA transitions. Since the highest value allowed by the
SCSI specification for the implicit transition timeout is a single byte
(255 seconds), change the default to 255. This way, when a target does
not provide an explicit transition timeout, we default to the maximum
value the spec allows rather than an arbitrary 60 second limit.

Co-developed-by: Krishna Kant <krishna.kant@purestorage.com>
Signed-off-by: Krishna Kant <krishna.kant@purestorage.com>
Co-developed-by: Riya Savla <rsavla@purestorage.com>
Signed-off-by: Riya Savla <rsavla@purestorage.com>
Signed-off-by: Brian Bunker <brian@purestorage.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://patch.msgid.link/20260416165512.26497-2-brian@purestorage.com
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---

LLM Generated explanations, may be completely bogus:

## Analysis Walkthrough

Phase 1 Record: Subsystem is `scsi: scsi_dh_alua`; action is “increase”;
intent is to raise the default ALUA implicit transition timeout from 60s
to 255s. Tags verified from commit
`68c3a65a5a8e85643745fdde02cb63904e165620`: co-developed/SOB by Krishna
Kant and Riya Savla, SOB by Brian Bunker, `Reviewed-by: Hannes
Reinecke`, `Link:` to the v4 posting, SOB by Martin K. Petersen. No
`Fixes:`, `Reported-by:`, `Tested-by:`, or `Cc: stable`. Body describes
a real behavior problem: targets that omit an explicit ALUA transition
timeout get capped at 60s, which is too short for some arrays.

Phase 2 Record: One file changed,
`drivers/scsi/device_handler/scsi_dh_alua.c`, 1 insertion/1 deletion. No
function body is modified; only `ALUA_FAILOVER_TIMEOUT` changes. The
macro is used by `submit_rtpg()`, `submit_stpg()`, `alua_tur()`, and
`alua_rtpg()` for command and transition expiry timing. Before: missing
target timeout defaults to 60s. After: defaults to 255s. Bug category is
logic/correctness for storage failover timing. Fix quality is very small
and obvious; main regression risk is slower failure detection for arrays
that omit timeout and remain stuck.

Phase 3 Record: `git blame` shows the 60s default came from
`3588c5a21aef8c` (`[SCSI] scsi_dh_alua: implement 'implied transition
timeout'`), first contained in `v3.6`. That original commit added the
implicit transition timeout machinery and made 60s the finite fallback.
Recent local history shows ALUA-related fixes but no prerequisite for
this one. Author Brian Bunker previously authored ALUA transition-state
fix `6056a92ceb2a7`, so this is from a contributor with direct ALUA
history. No standalone dependency was found.

Phase 4 Record: `b4 dig -c 68c3a65a5a8e8` found the v4 lore submission
at
`https://patch.msgid.link/20260416165512.26497-2-brian@purestorage.com`.
`b4 dig -a` found v3 and v4; v4 is the applied revision. `b4 dig -w`
shows Brian Bunker, `linux-scsi`, Hannes Reinecke, Krishna Kant, and
Riya Savla were included. The v4 thread has Hannes’s `Reviewed-by` and
Martin Petersen’s “Applied to 7.1/scsi-staging”. Earlier v2 discussion
verified Hannes objected to tying ALUA transition timeout to device
command timeout, and the patch evolved into the simpler 255s default. I
found no stable-list discussion.

Phase 5 Record: Modified function list is empty, but impacted code paths
are the ALUA RTPG/STPG/TUR and transition expiry paths. Call tracing
verified `alua_rtpg_work()` calls `alua_tur()` and `alua_rtpg()`,
`alua_activate()` queues RTPG from dm-multipath activation,
`alua_check_sense()` is invoked from SCSI error handling, and
`alua_prep_fn()` is called from SCSI request setup. This is reachable
from SCSI disk/device-handler attach, error handling, and dm-multipath
path activation. Similar pattern search found the same 60s fallback in
active stable tags.

Phase 6 Record: The buggy 60s default exists in `v4.14`, `v4.19`,
`v5.10`, `v5.15`, `v6.1`, `v6.6`, `v6.12`, `v6.16`, `v6.17`, and `v7.0`
tags in this repo. The exact macro line is present, so backport
difficulty should be clean or trivial for those trees. `b4 am` also
reported the v4 patch “applies clean to current tree.” No alternate
stable fix was found.

Phase 7 Record: Subsystem is SCSI device handler / ALUA multipath
storage. Criticality is IMPORTANT: it affects systems using ALUA-capable
SCSI storage, especially enterprise multipath arrays. MAINTAINERS
verifies SCSI is maintained by James Bottomley and Martin Petersen, and
the patch was committed by Martin Petersen.

Phase 8 Record: Affected users are config/hardware-specific: ALUA SCSI
disk users, commonly multipath enterprise storage. Trigger is an ALUA
transition where the target omits an explicit transition timeout and
takes more than 60s. Failure mode is premature transition expiry,
leading `alua_rtpg()` to mark the port group standby and return I/O/path
failure. Severity is HIGH for affected systems because it can break
failover or storage availability. Benefit is high for affected storage
users; risk is low because this is a one-line bounded timeout increase
and 255s matches the implementation’s `unsigned char`/`buff[5]` timeout
representation.

Phase 9 Record: Evidence for backporting: real storage failover
correctness issue, long-lived bug since v3.6, affects many stable trees,
one-line bounded fix, reviewed by Hannes Reinecke, committed by SCSI
maintainer Martin Petersen, no new API or feature. Evidence against: no
formal `Reported-by` or `Tested-by`; behavior may wait longer before
declaring a nonresponsive target failed. Stable checklist: obviously
correct yes; real bug yes; important issue yes for storage
availability/path failure; small and contained yes; no new APIs yes;
applies to stable trees yes/trivial. Exception category: none, this is
not a device ID/quirk/build/doc fix.

## Verification

- Phase 1: `git show 68c3a65a5a8e8` verified commit message, tags,
  author, committer, and one-line diff.
- Phase 2: `git show` and source read verified only
  `ALUA_FAILOVER_TIMEOUT` changes from `60` to `255`.
- Phase 3: `git blame` verified the 60s default came from
  `3588c5a21aef8c`; `git describe --contains` verified `v3.6` ancestry.
- Phase 4: `b4 dig`, `b4 dig -a`, `b4 dig -w`, and saved mboxes verified
  v3/v4 review history, Hannes review, and Martin’s apply note. Direct
  WebFetch of lore was blocked/timed out; b4 succeeded.
- Phase 5: `git grep` and file reads verified ALUA call paths through
  SCSI request setup, SCSI error handling, and dm-multipath activation.
- Phase 6: `git grep` against stable tags verified the 60s default
  exists across listed stable releases.
- Phase 7: `MAINTAINERS` search verified SCSI maintainer/list context.
- Phase 8: Source inspection verified the failure path: timeout expiry
  in `alua_rtpg()` changes transitioning state handling to standby/I/O
  error.
- Unverified: I did not independently fetch the SCSI SPC text; the “255
  maximum spec value” claim is supported by the reviewed commit text and
  by the kernel implementation storing the timeout as a single byte.

This should be backported: it fixes a real ALUA multipath storage
availability problem with a tiny, bounded, maintainer-reviewed change
and minimal regression risk.

**YES**

 drivers/scsi/device_handler/scsi_dh_alua.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/scsi/device_handler/scsi_dh_alua.c b/drivers/scsi/device_handler/scsi_dh_alua.c
index efb08b9b145a1..80ab0ff921d43 100644
--- a/drivers/scsi/device_handler/scsi_dh_alua.c
+++ b/drivers/scsi/device_handler/scsi_dh_alua.c
@@ -37,7 +37,7 @@
 #define TPGS_MODE_EXPLICIT		0x2

 #define ALUA_RTPG_SIZE			128
-#define ALUA_FAILOVER_TIMEOUT		60
+#define ALUA_FAILOVER_TIMEOUT		255	/* max 255 (8-bit value) */
 #define ALUA_FAILOVER_RETRIES		5
 #define ALUA_RTPG_DELAY_MSECS		5
 #define ALUA_RTPG_RETRY_DELAY		2
-- 
2.53.0

^ permalink raw reply related	[flat|nested] 2+ messages in thread

* [PATCH AUTOSEL 7.0-5.15] scsi: smartpqi: Silence a recursive lock warning
       [not found] <20260511221931.2370053-1-sashal@kernel.org>
  2026-05-11 22:19 ` [PATCH AUTOSEL 7.0-5.10] scsi: scsi_dh_alua: Increase default ALUA timeout to maximum spec value Sasha Levin
@ 2026-05-11 22:19 ` Sasha Levin
  1 sibling, 0 replies; 2+ messages in thread
From: Sasha Levin @ 2026-05-11 22:19 UTC (permalink / raw)
  To: patches, stable
  Cc: Tomas Henzl, Don Brace, Martin K. Petersen, Sasha Levin, jejb,
	storagedev, linux-scsi, linux-kernel

From: Tomas Henzl <thenzl@redhat.com>

[ Upstream commit d65efdf467ff935e35dfe6aa9a7ab93f17ac07ee ]

On systems with multiple controllers debug kernel shows

  WARNING: possible recursive locking detected

during shutdown.

Each controller does have its own ctrl_info (and mutex) and that isn't
correctly recognized by debug kernel.  Suppress the warning by releasing
the mutex at the end of pqi_shutdown().

Signed-off-by: Tomas Henzl <thenzl@redhat.com>
Acked-by: Don Brace <don.brace@microchip.com>
Link: https://patch.msgid.link/20260414124118.23661-1-thenzl@redhat.com
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---

LLM Generated explanations, may be completely bogus:

## Decision Summary
Backport this to stable trees that contain the mutex-based
`pqi_ctrl_block_device_reset()` shutdown path, verified in `v5.15`,
`v6.1`, `v6.6`, `v6.12`, `v6.16`, and `v6.19`. Do not apply it as-is to
`v5.10`, where that helper only sets `block_device_reset = true` and no
`pqi_ctrl_unblock_device_reset()` helper exists.

This is a one-line fix for an imbalanced `lun_reset_mutex` lock in
`pqi_shutdown()`. The direct user-visible symptom is a lockdep “possible
recursive locking detected” warning during shutdown/reboot on systems
with multiple smartpqi controllers. The impact is mostly debug-kernel
noise rather than data loss or a production crash, but the patch
corrects real locking hygiene, is tiny, subsystem-reviewed, and has very
low risk.

## Phase Walkthrough
Phase 1:
- Record 1.1: subsystem `scsi: smartpqi`; action `Silence`; intent is to
  stop a recursive lock warning during shutdown.
- Record 1.2: tags present: `Signed-off-by: Tomas Henzl`, `Acked-by: Don
  Brace`, `Link:
  https://patch.msgid.link/20260414124118.23661-1-thenzl@redhat.com`,
  `Signed-off-by: Martin K. Petersen`. No `Fixes:`, no `Reported-by:`,
  no `Cc: stable`.
- Record 1.3: message describes debug kernels warning on multi-
  controller systems because distinct per-controller mutexes are not
  recognized as distinct after shutdown leaves one held.
- Record 1.4: hidden bug fix: yes. It is described as silencing a
  warning, but the code adds a missing unlock for a mutex acquired
  earlier in the same function.

Phase 2:
- Record 2.1: one file, `drivers/scsi/smartpqi/smartpqi_init.c`, one
  insertion in `pqi_shutdown()`. Single-file surgical fix.
- Record 2.2: before, `pqi_shutdown()` locked
  `ctrl_info->lun_reset_mutex` via `pqi_ctrl_block_device_reset()` and
  returned after `pqi_reset()` without unlocking. After, it unlocks via
  `pqi_ctrl_unblock_device_reset()`.
- Record 2.3: bug category is synchronization/lock balancing. The
  changed helper is verified as
  `mutex_unlock(&ctrl_info->lun_reset_mutex)`.
- Record 2.4: fix quality is high: one existing helper call, no new API,
  no refactor. Main risk is allowing a reset waiter to proceed late in
  shutdown; Tomas explicitly discussed this risk on-list and said he
  checked it.

Phase 3:
- Record 3.1: blame shows the shutdown call to
  `pqi_ctrl_block_device_reset()` is old, but `9fa8202336096` changed
  the helper to a mutex-based block/unblock model. That is the relevant
  introduction point for the missing unlock.
- Record 3.2: no `Fixes:` tag, so no tagged introducing commit to
  follow.
- Record 3.3: recent file history shows normal smartpqi churn, including
  fixes and device-ID updates; no prerequisite for this one-line helper
  call was identified for v5.15+ style code.
- Record 3.4: Tomas Henzl has SCSI commits in history but no recent
  smartpqi commits found; Don Brace is listed as smartpqi maintainer and
  acked the patch.
- Record 3.5: dependency is the existing
  `pqi_ctrl_unblock_device_reset()` helper. It exists in v5.15+ verified
  tags, not in v5.10.

Phase 4:
- Record 4.1: candidate commit hash was not available locally, so `b4
  dig -c` could not be used for this candidate. `b4 mbox` and `b4 am`
  using the Link fetched the original thread.
- Record 4.2: original recipients were `linux-scsi` and Don Brace; Don
  Brace acked it; Martin Petersen applied it.
- Record 4.3: external thread and an earlier related LKML post show a
  real lockdep splat with call trace through `__do_sys_reboot ->
  device_shutdown -> pci_device_shutdown -> pqi_shutdown`.
- Record 4.4: no newer v2/v3 was reported by `b4 mbox -c`; thread had
  six messages. A separate 2025 lockdep-key proposal for the same
  warning was found, but it is not present in this tree.
- Record 4.5: web search found no relevant stable-list discussion.

Phase 5:
- Record 5.1: modified function: `pqi_shutdown()`.
- Record 5.2: caller is PCI driver `.shutdown = pqi_shutdown`; this is
  reached from PCI/device shutdown during reboot/poweroff paths.
- Record 5.3: relevant callees are `pqi_wait_until_ofa_finished()`,
  `pqi_scsi_block_requests()`, `pqi_ctrl_block_device_reset()`,
  `pqi_ctrl_block_requests()`, `pqi_ctrl_wait_until_quiesced()`,
  `pqi_flush_cache()`, `pqi_crash_if_pending_command()`, `pqi_reset()`,
  and now `pqi_ctrl_unblock_device_reset()`.
- Record 5.4: verified external call trace reaches `pqi_shutdown()` from
  reboot. Trigger requires multiple smartpqi controllers and a
  debug/lockdep kernel.
- Record 5.5: similar lock/unlock pairing exists in OFA and
  suspend/resume paths; shutdown was the unmatched case.

Phase 6:
- Record 6.1: verified `v5.15`, `v6.1`, `v6.6`, `v6.12`, `v6.16`, and
  `v6.19` have mutex-based block/unblock helpers and shutdown lacks the
  final unblock. Verified `v5.10` does not have the mutex helper.
- Record 6.2: expected backport difficulty is clean or trivial for
  v5.15+ style trees because the exact helper and shutdown context
  exist. v5.10 is not applicable as-is.
- Record 6.3: no related fix already present in the checked local tree;
  `lun_reset_key` proposal is absent.

Phase 7:
- Record 7.1: subsystem is SCSI storage driver, `smartpqi`; criticality
  is driver-specific but storage-related.
- Record 7.2: subsystem is active; recent history shows ongoing fixes,
  device IDs, and driver updates.

Phase 8:
- Record 8.1: affected users are systems with Microchip/Microsemi
  SmartPQI controllers, especially multiple controllers with
  debug/lockdep kernels.
- Record 8.2: trigger is shutdown/reboot. The verified external trace
  shows reboot path; unprivileged triggerability was not verified.
- Record 8.3: failure mode is lockdep warning/lock imbalance, severity
  medium-low in production terms but valid for debug-kernel correctness.
- Record 8.4: benefit is moderate for affected systems and CI/debug
  kernels; risk is very low because this is one line using an existing
  helper after a matching lock.

Phase 9:
- Record 9.1: evidence for backporting: real lock imbalance,
  reproducible lockdep warning, one-line fix, maintainer ack, existing
  helper, verified affected stable baselines v5.15+. Evidence against:
  symptom is mainly debug warning, not crash/data corruption; v5.10 not
  applicable as-is.
- Record 9.2: stable rules: obviously correct yes; fixes a real bug yes;
  important issue borderline but acceptable due lockdep warning and tiny
  risk; small/contained yes; no new features/APIs yes; applies to v5.15+
  style trees likely clean/trivial.
- Record 9.3: no exception category applies.
- Record 9.4: risk-benefit favors backporting for applicable stable
  trees.

## Verification
- Phase 1: Parsed supplied commit message and `b4 am` output; confirmed
  tags and absence of `Fixes:`/stable/Reported-by.
- Phase 2: Read `smartpqi_init.c`; confirmed
  `pqi_ctrl_block_device_reset()` is `mutex_lock()` and
  `pqi_ctrl_unblock_device_reset()` is `mutex_unlock()`.
- Phase 3: Used `git blame`, `git show 0530736e40a069`, and `git show
  9fa8202336096d`; confirmed helper semantics changed to mutex model in
  the shutdown/suspend update.
- Phase 4: `WebFetch` to `patch.msgid.link` was blocked by Anubis; `b4
  mbox`/`b4 am` fetched the lore thread successfully. `b4 am` reported
  the patch applies cleanly to current tree.
- Phase 4: Read lore mirror; confirmed Bart’s “patch looks fine”
  comment, Tomas’s risk discussion, Don Brace’s ack, and Martin
  Petersen’s apply notice.
- Phase 5: Used exact searches and file reads to trace `.shutdown =
  pqi_shutdown`, SCSI reset handlers, and related lock users.
- Phase 6: Used version tags to verify affected code in `v5.15+` and
  non-applicability to `v5.10`.
- Phase 7: Checked `MAINTAINERS`; confirmed Don Brace maintains
  smartpqi.
- Phase 8: External LKML mirror provided the concrete lockdep call trace
  and trigger conditions.
- Unverified: I did not build-test the patch and did not verify stable
  branch-specific conflicts beyond version-tag code presence.

**YES**

 drivers/scsi/smartpqi/smartpqi_init.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/scsi/smartpqi/smartpqi_init.c b/drivers/scsi/smartpqi/smartpqi_init.c
index b4ed991976d06..2026ac645d6ab 100644
--- a/drivers/scsi/smartpqi/smartpqi_init.c
+++ b/drivers/scsi/smartpqi/smartpqi_init.c
@@ -9427,6 +9427,7 @@ static void pqi_shutdown(struct pci_dev *pci_dev)
 
 	pqi_crash_if_pending_command(ctrl_info);
 	pqi_reset(ctrl_info);
+	pqi_ctrl_unblock_device_reset(ctrl_info);
 }
 
 static void pqi_process_lockup_action_param(void)
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2026-05-11 22:20 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20260511221931.2370053-1-sashal@kernel.org>
2026-05-11 22:19 ` [PATCH AUTOSEL 7.0-5.10] scsi: scsi_dh_alua: Increase default ALUA timeout to maximum spec value Sasha Levin
2026-05-11 22:19 ` [PATCH AUTOSEL 7.0-5.15] scsi: smartpqi: Silence a recursive lock warning Sasha Levin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox