public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] scsi: smartpqi: fix PCIe hot reset recovery
@ 2026-05-06 14:01 Mateusz Nowicki
  2026-05-06 14:01 ` [PATCH 1/2] scsi: smartpqi: add pci_error_handlers for bus " Mateusz Nowicki
                   ` (2 more replies)
  0 siblings, 3 replies; 5+ messages in thread
From: Mateusz Nowicki @ 2026-05-06 14:01 UTC (permalink / raw)
  To: don.brace
  Cc: martin.petersen, James.Bottomley, storagedev, linux-scsi,
	linux-kernel

A PCIe bus reset (e.g. "echo 1 > /sys/bus/pci/devices/<bdf>/reset") on a
controller without FLR support leaves the HPE SR932i-p Gen10+ unusable
until reboot: smartpqi registers no pci_error_handlers, so the driver
is not notified, firmware reverts to SIS mode, and all queue mappings
are dropped while the driver still drives PQI.

Patch 1 adds .reset_prepare / .reset_done reusing
pqi_ofa_ctrl_quiesce() / _unquiesce() / pqi_ctrl_init_resume().

Patch 2 raises SIS_CTRL_READY_RESUME_TIMEOUT_SECS from 90s to 180s,
matching the cold-boot path; without this patch 1 fails at the SIS
ready check because firmware boot after reset takes ~125s on the
SR932i-p Gen10+.

Tested on HPE SR932i-p Gen10+ against Linus' master at 74fe02ce122a.

Note: the From: header is my Posteo address because my employer's SMTP
is unavailable for external mailing lists.  The Signed-off-by carries
the Microchip attribution.

Mateusz Nowicki (2):
  scsi: smartpqi: add pci_error_handlers for bus reset recovery
  scsi: smartpqi: increase SIS ctrl ready resume timeout to 180s

 drivers/scsi/smartpqi/smartpqi_init.c | 47 +++++++++++++++++++++++++++
 drivers/scsi/smartpqi/smartpqi_sis.c  |  2 +-
 2 files changed, 48 insertions(+), 1 deletion(-)

--
2.43.0


^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH 1/2] scsi: smartpqi: add pci_error_handlers for bus reset recovery
  2026-05-06 14:01 [PATCH 0/2] scsi: smartpqi: fix PCIe hot reset recovery Mateusz Nowicki
@ 2026-05-06 14:01 ` Mateusz Nowicki
  2026-05-06 14:01 ` [PATCH 2/2] scsi: smartpqi: increase SIS ctrl ready resume timeout to 180s Mateusz Nowicki
  2026-05-06 22:21 ` [PATCH 0/2] scsi: smartpqi: fix PCIe hot reset recovery Laurence Oberman
  2 siblings, 0 replies; 5+ messages in thread
From: Mateusz Nowicki @ 2026-05-06 14:01 UTC (permalink / raw)
  To: don.brace
  Cc: martin.petersen, James.Bottomley, storagedev, linux-scsi,
	linux-kernel, Mateusz Nowicki

The smartpqi driver does not register pci_error_handlers.  When the PCI
subsystem performs a bus reset (e.g. "echo 1 > /sys/bus/pci/devices/
<bdf>/reset") on a controller without FLR support, the driver is not
notified.  Firmware reverts to SIS mode and drops admin and operational
queue mappings while the driver still believes PQI is active; SCSI I/O
hangs until reboot.

Add .reset_prepare and .reset_done callbacks reusing the existing
SIS -> PQI recovery helpers.

  reset_prepare:
    - pqi_wait_until_ofa_finished()
    - pqi_ofa_ctrl_quiesce()
    - clear controller_online and pqi_mode_enabled

  reset_done:
    - ssleep(PQI_POST_RESET_DELAY_SECS)
    - pqi_ofa_ctrl_unquiesce()
    - pqi_ctrl_init_resume() to drive SIS -> PQI, recreate queues,
      re-enable events and rescan
    - pqi_take_ctrl_offline() on failure

No new helpers or exports.  Tested on HPE SR932i-p Gen10+.

Signed-off-by: Mateusz Nowicki <mateusz.nowicki@microchip.com>
---
 drivers/scsi/smartpqi/smartpqi_init.c | 47 +++++++++++++++++++++++++++
 1 file changed, 47 insertions(+)

diff --git a/drivers/scsi/smartpqi/smartpqi_init.c b/drivers/scsi/smartpqi/smartpqi_init.c
index 2026ac645d6a..c4003d3cda7e 100644
--- a/drivers/scsi/smartpqi/smartpqi_init.c
+++ b/drivers/scsi/smartpqi/smartpqi_init.c
@@ -10677,12 +10677,59 @@ static const struct pci_device_id pqi_pci_id_table[] = {
 
 MODULE_DEVICE_TABLE(pci, pqi_pci_id_table);
 
+static void pqi_reset_prepare(struct pci_dev *pci_dev)
+{
+	struct pqi_ctrl_info *ctrl_info = pci_get_drvdata(pci_dev);
+
+	if (!ctrl_info)
+		return;
+
+	dev_info(&pci_dev->dev, "PCI reset prepare\n");
+
+	pqi_wait_until_ofa_finished(ctrl_info);
+
+	pqi_ofa_ctrl_quiesce(ctrl_info);
+
+	ctrl_info->controller_online = false;
+	ctrl_info->pqi_mode_enabled = false;
+}
+
+static void pqi_reset_done(struct pci_dev *pci_dev)
+{
+	int rc;
+	struct pqi_ctrl_info *ctrl_info = pci_get_drvdata(pci_dev);
+
+	if (!ctrl_info)
+		return;
+
+	dev_info(&pci_dev->dev, "PCI reset done - reinitializing\n");
+
+	ssleep(PQI_POST_RESET_DELAY_SECS);
+
+	pqi_ofa_ctrl_unquiesce(ctrl_info);
+
+	rc = pqi_ctrl_init_resume(ctrl_info);
+	if (rc) {
+		dev_err(&pci_dev->dev, "reset recovery failed: %d\n", rc);
+		pqi_take_ctrl_offline(ctrl_info, PQI_FIRMWARE_KERNEL_NOT_UP);
+		return;
+	}
+
+	dev_info(&pci_dev->dev, "reset recovery complete\n");
+}
+
+static const struct pci_error_handlers pqi_pci_error_handlers = {
+	.reset_prepare	= pqi_reset_prepare,
+	.reset_done	= pqi_reset_done,
+};
+
 static struct pci_driver pqi_pci_driver = {
 	.name = DRIVER_NAME_SHORT,
 	.id_table = pqi_pci_id_table,
 	.probe = pqi_pci_probe,
 	.remove = pqi_pci_remove,
 	.shutdown = pqi_shutdown,
+	.err_handler = &pqi_pci_error_handlers,
 #if defined(CONFIG_PM)
 	.driver = {
 		.pm = &pqi_pm_ops
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH 2/2] scsi: smartpqi: increase SIS ctrl ready resume timeout to 180s
  2026-05-06 14:01 [PATCH 0/2] scsi: smartpqi: fix PCIe hot reset recovery Mateusz Nowicki
  2026-05-06 14:01 ` [PATCH 1/2] scsi: smartpqi: add pci_error_handlers for bus " Mateusz Nowicki
@ 2026-05-06 14:01 ` Mateusz Nowicki
  2026-05-06 22:21 ` [PATCH 0/2] scsi: smartpqi: fix PCIe hot reset recovery Laurence Oberman
  2 siblings, 0 replies; 5+ messages in thread
From: Mateusz Nowicki @ 2026-05-06 14:01 UTC (permalink / raw)
  To: don.brace
  Cc: martin.petersen, James.Bottomley, storagedev, linux-scsi,
	linux-kernel, Mateusz Nowicki

After a PCIe hot reset, firmware boot can exceed the 90 second timeout
in sis_wait_for_ctrl_ready_resume().  On HPE SR932i-p Gen10+ boot takes
~125s, causing pqi_ctrl_init_resume() to fail with -ETIMEDOUT:

    smartpqi 0000:84:00.0: PCI reset prepare
    smartpqi 0000:84:00.0: PCI reset done - reinitializing
    smartpqi 0000:84:00.0: controller not ready after 90 seconds
    smartpqi 0000:84:00.0: reset recovery failed: -110

Match SIS_CTRL_READY_TIMEOUT_SECS (180s) used on the cold-boot path.

Signed-off-by: Mateusz Nowicki <mateusz.nowicki@microchip.com>
---
 drivers/scsi/smartpqi/smartpqi_sis.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/scsi/smartpqi/smartpqi_sis.c b/drivers/scsi/smartpqi/smartpqi_sis.c
index ae5a264d062d..df06302cec38 100644
--- a/drivers/scsi/smartpqi/smartpqi_sis.c
+++ b/drivers/scsi/smartpqi/smartpqi_sis.c
@@ -58,7 +58,7 @@
 #define SIS_CTRL_KERNEL_UP			0x80
 #define SIS_CTRL_KERNEL_PANIC			0x100
 #define SIS_CTRL_READY_TIMEOUT_SECS		180
-#define SIS_CTRL_READY_RESUME_TIMEOUT_SECS	90
+#define SIS_CTRL_READY_RESUME_TIMEOUT_SECS	180
 #define SIS_CTRL_READY_POLL_INTERVAL_MSECS	10
 
 enum sis_fw_triage_status {
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH 0/2] scsi: smartpqi: fix PCIe hot reset recovery
  2026-05-06 14:01 [PATCH 0/2] scsi: smartpqi: fix PCIe hot reset recovery Mateusz Nowicki
  2026-05-06 14:01 ` [PATCH 1/2] scsi: smartpqi: add pci_error_handlers for bus " Mateusz Nowicki
  2026-05-06 14:01 ` [PATCH 2/2] scsi: smartpqi: increase SIS ctrl ready resume timeout to 180s Mateusz Nowicki
@ 2026-05-06 22:21 ` Laurence Oberman
  2026-05-07  1:45   ` Laurence Oberman
  2 siblings, 1 reply; 5+ messages in thread
From: Laurence Oberman @ 2026-05-06 22:21 UTC (permalink / raw)
  To: Mateusz Nowicki, don.brace
  Cc: martin.petersen, James.Bottomley, storagedev, linux-scsi,
	linux-kernel

On Wed, 2026-05-06 at 14:01 +0000, Mateusz Nowicki wrote:
> A PCIe bus reset (e.g. "echo 1 > /sys/bus/pci/devices/<bdf>/reset")
> on a
> controller without FLR support leaves the HPE SR932i-p Gen10+
> unusable
> until reboot: smartpqi registers no pci_error_handlers, so the driver
> is not notified, firmware reverts to SIS mode, and all queue mappings
> are dropped while the driver still drives PQI.
> 
> Patch 1 adds .reset_prepare / .reset_done reusing
> pqi_ofa_ctrl_quiesce() / _unquiesce() / pqi_ctrl_init_resume().
> 
> Patch 2 raises SIS_CTRL_READY_RESUME_TIMEOUT_SECS from 90s to 180s,
> matching the cold-boot path; without this patch 1 fails at the SIS
> ready check because firmware boot after reset takes ~125s on the
> SR932i-p Gen10+.
> 
> Tested on HPE SR932i-p Gen10+ against Linus' master at 74fe02ce122a.
> 
> Note: the From: header is my Posteo address because my employer's
> SMTP
> is unavailable for external mailing lists.  The Signed-off-by carries
> the Microchip attribution.
> 
> Mateusz Nowicki (2):
>   scsi: smartpqi: add pci_error_handlers for bus reset recovery
>   scsi: smartpqi: increase SIS ctrl ready resume timeout to 180s
> 
>  drivers/scsi/smartpqi/smartpqi_init.c | 47
> +++++++++++++++++++++++++++
>  drivers/scsi/smartpqi/smartpqi_sis.c  |  2 +-
>  2 files changed, 48 insertions(+), 1 deletion(-)
> 
> --
> 2.43.0
> 
> 
> 
Hello

I did reproduce this so I am testing the patches as well.
They look correct to me, I will reply again after testing with a
review.

Thanks
Laurence


[2513778.140012] smartpqi 0000:64:00.0: no heartbeat detected - last
heartbeat count: 4207808511
[2513778.140031] smartpqi 0000:64:00.0: controller offline: reason code
0x4 (no controller heartbeat detected)
[2513778.141346] sd 1:0:0:0: [sda] tag#549 FAILED Result:
hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK cmd_age=18s
[2513778.141355] sd 1:0:0:0: [sda] tag#550 FAILED Result: 

"xfs_buf_ioend_handle_error+0xd5/0x3f0 [xfs]" at daddr 0x9f78 len 8
error 5
[2513778.141526] XFS (dm-0): log I/O error -5



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH 0/2] scsi: smartpqi: fix PCIe hot reset recovery
  2026-05-06 22:21 ` [PATCH 0/2] scsi: smartpqi: fix PCIe hot reset recovery Laurence Oberman
@ 2026-05-07  1:45   ` Laurence Oberman
  0 siblings, 0 replies; 5+ messages in thread
From: Laurence Oberman @ 2026-05-07  1:45 UTC (permalink / raw)
  To: Mateusz Nowicki, don.brace
  Cc: martin.petersen, James.Bottomley, storagedev, linux-scsi,
	linux-kernel

On Wed, 2026-05-06 at 18:21 -0400, Laurence Oberman wrote:
> On Wed, 2026-05-06 at 14:01 +0000, Mateusz Nowicki wrote:
> > A PCIe bus reset (e.g. "echo 1 > /sys/bus/pci/devices/<bdf>/reset")
> > on a
> > controller without FLR support leaves the HPE SR932i-p Gen10+
> > unusable
> > until reboot: smartpqi registers no pci_error_handlers, so the
> > driver
> > is not notified, firmware reverts to SIS mode, and all queue
> > mappings
> > are dropped while the driver still drives PQI.
> > 
> > Patch 1 adds .reset_prepare / .reset_done reusing
> > pqi_ofa_ctrl_quiesce() / _unquiesce() / pqi_ctrl_init_resume().
> > 
> > Patch 2 raises SIS_CTRL_READY_RESUME_TIMEOUT_SECS from 90s to 180s,
> > matching the cold-boot path; without this patch 1 fails at the SIS
> > ready check because firmware boot after reset takes ~125s on the
> > SR932i-p Gen10+.
> > 
> > Tested on HPE SR932i-p Gen10+ against Linus' master at
> > 74fe02ce122a.
> > 
> > Note: the From: header is my Posteo address because my employer's
> > SMTP
> > is unavailable for external mailing lists.  The Signed-off-by
> > carries
> > the Microchip attribution.
> > 
> > Mateusz Nowicki (2):
> >   scsi: smartpqi: add pci_error_handlers for bus reset recovery
> >   scsi: smartpqi: increase SIS ctrl ready resume timeout to 180s
> > 
> >  drivers/scsi/smartpqi/smartpqi_init.c | 47
> > +++++++++++++++++++++++++++
> >  drivers/scsi/smartpqi/smartpqi_sis.c  |  2 +-
> >  2 files changed, 48 insertions(+), 1 deletion(-)
> > 
> > --
> > 2.43.0
> > 
> > 
> > 
> Hello
> 
> I did reproduce this so I am testing the patches as well.
> They look correct to me, I will reply again after testing with a
> review.
> 
> Thanks
> Laurence
> 
> 
> [2513778.140012] smartpqi 0000:64:00.0: no heartbeat detected - last
> heartbeat count: 4207808511
> [2513778.140031] smartpqi 0000:64:00.0: controller offline: reason
> code
> 0x4 (no controller heartbeat detected)
> [2513778.141346] sd 1:0:0:0: [sda] tag#549 FAILED Result:
> hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK cmd_age=18s
> [2513778.141355] sd 1:0:0:0: [sda] tag#550 FAILED Result: 
> 
> "xfs_buf_ioend_handle_error+0xd5/0x3f0 [xfs]" at daddr 0x9f78 len 8
> error 5
> [2513778.141526] XFS (dm-0): log I/O error -5
> 

Hello 

For the series:

I tested the patches and it recovers with them applied.
The patches look good.

Tested-by: Laurence Oberman <loberman@redhat.com>
Reviewed-by: Laurence Oberman <loberman@redhat.com>


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2026-05-07  1:45 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-06 14:01 [PATCH 0/2] scsi: smartpqi: fix PCIe hot reset recovery Mateusz Nowicki
2026-05-06 14:01 ` [PATCH 1/2] scsi: smartpqi: add pci_error_handlers for bus " Mateusz Nowicki
2026-05-06 14:01 ` [PATCH 2/2] scsi: smartpqi: increase SIS ctrl ready resume timeout to 180s Mateusz Nowicki
2026-05-06 22:21 ` [PATCH 0/2] scsi: smartpqi: fix PCIe hot reset recovery Laurence Oberman
2026-05-07  1:45   ` Laurence Oberman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox