* [PATCH v3 0/1] nvme-pci: recover from NVM subsystem reset
@ 2024-06-04 9:10 Nilay Shroff
2024-06-04 9:10 ` [PATCH v3 1/1] nvme-pci : Fix EEH failure on ppc after " Nilay Shroff
0 siblings, 1 reply; 11+ messages in thread
From: Nilay Shroff @ 2024-06-04 9:10 UTC (permalink / raw)
To: kbusch; +Cc: linux-nvme, hch, sagi, gjoyce, axboe, Nilay Shroff
Hi Keith,
My previous attempt to get attention for this patch didn't garner enough
eyeballs and so I thought to rewrite the text and tried including some
more background on this. For those interested, I have also copied below
the link to the previous email where we had some discussions about this
patch.
The NVM subsystem reset command might be needed for activating nvme
controller firmware image after the image is committed to a slot or
in some cases to recover from the controller fatal error. The NVM
subsystem reset when executed, may cause the loss of communication
with NVMe controller. And the only way to re-establish communication
with NVMe adapter is to either re-enumerate the pci bus or hotplug
NVMe disk or reboot the OS. Fortunately, the PPC architecture supports
extended PCI capability which could help recover the loss of PCI adapter
communication. The EEH (Enhanced Error Handling) hardware features on
PPC machine allow PCI bus errors to be cleared and a PCI card to be
"rebooted", without actually having to reboot the OS or re-enumerating
PCI bus or hotplugging NVMe disk.
In the current implementation, when user executes NVM subsystem reset
command, kernel programs the nvme subsystem register (NSSR) and then
initiates the nvme reset work. The nvme reset work first shuts down the
controller and that requires access to PCIe config space. As programming
to NSSR typically causes the loss of communication with NVMe controller,
the nvme reset work which is immediately followed after that would fail
to read/write to PCIe config space and that causes the nvme driver to
believe that controller is dead and so driver cleanup all resources
associated with that NVMe controller and marks the controller dead.
So the PCI error recovery (EEH on PPC) doesn't get chance to try recover
device from the adapter communication lost.
This patch helps to detect the case if the communication with the NVMe
adapter is lost and the PCI error recovery has been initiated by the
platform then allow error recovery to forward progress and thus contain
the nvme reset work (which has been initiated post NVM subsystem reset)
from marking the controller dead. If in case pci error recovery is unable
to recover the device then it sets the pci channel state to
"permanent failure" and help removes the device.
I have tested the following cases with this patch applied,
1. NVM subsystem reset while no IO is running
2. NVM subsystem reset while IO is ongoing
3. Inject PCI error while reset work is scheduled and no IO is running
4. Inject PCI error while reset work is scheduled and IO is ongoing
For all above cases (1-4), verified that pci error recovery could
successfully recover the nvme disk.
5. NVM subsystem reset and then immediately hot remove the NVMe disk:
In this case though pci error recovery is initiated it couldn't forward
progress (as disk is hot removed) and so controller is deleted and it's
all associated resources are freed.
6. NVM subsystem reset and PCI error recovery is unable to recover the
device:
In this case controller is deleted and it's all associated resources
are freed.
7. NVM subsystem reset on a platform which doesn't support PCI error
recovery:
In this case nvme reset work frees resources associated with the
controller and mark it dead.
Changelog:
==========
Changes from v2:
- Formatting cleanup
- Updated commit changelog to better describe the issue
- Added the cover later to add more details about nvme
subsystem reset and error recovery(EEH)
Changes from v1:
- Allow a controller to move from CONNECTING state to
RESETTING state (Keith)
- Fix race condition between reset work and pci error handler
code which may contain reset work and pci recovery from
forward progress (Keith)
Link: https://lore.kernel.org/all/20240209050342.406184-1-nilay@linux.ibm.com/
Nilay Shroff (1):
nvme-pci : Fix EEH failure on ppc after subsystem reset
drivers/nvme/host/core.c | 1 +
drivers/nvme/host/pci.c | 20 +++++++++++++++++---
2 files changed, 18 insertions(+), 3 deletions(-)
--
2.45.1
^ permalink raw reply [flat|nested] 11+ messages in thread
* [PATCH v3 1/1] nvme-pci : Fix EEH failure on ppc after subsystem reset
2024-06-04 9:10 [PATCH v3 0/1] nvme-pci: recover from NVM subsystem reset Nilay Shroff
@ 2024-06-04 9:10 ` Nilay Shroff
2024-06-10 12:32 ` Maurizio Lombardi
` (2 more replies)
0 siblings, 3 replies; 11+ messages in thread
From: Nilay Shroff @ 2024-06-04 9:10 UTC (permalink / raw)
To: kbusch; +Cc: linux-nvme, hch, sagi, gjoyce, axboe, Nilay Shroff
The NVMe subsystem reset command when executed may cause the loss of
the NVMe adapter communication with kernel. And the only way today
to recover the adapter is to either re-enumerate the pci bus or
hotplug NVMe disk or reboot OS.
The PPC architecture supports mechanism called EEH (enhanced error
handling) which allows pci bus errors to be cleared and a pci card to
be rebooted, without having to physically hotplug NVMe disk or reboot
the OS.
In the current implementation when user executes the nvme subsystem
reset command and if kernel loses the communication with NVMe adapter
then subsequent read/write to the PCIe config space of the device
would fail. Failing to read/write to PCI config space makes NVMe
driver assume the permanent loss of communication with the device and
so driver marks the NVMe controller dead and frees all resources
associate to that controller. As the NVMe controller goes dead, the
EEH recovery can't succeed.
This patch helps fix this issue so that after user executes subsystem
reset command if the communication with the NVMe adapter is lost and
EEH recovery is initiated then we allow the EEH recovery to forward
progress and gives the EEH thread a fair chance to recover the
adapter. If in case, the EEH thread couldn't recover the adapter
communication then it sets the pci channel state of the erring
adapter to "permanent failure" and removes the device.
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
---
drivers/nvme/host/core.c | 1 +
drivers/nvme/host/pci.c | 21 ++++++++++++++++++---
2 files changed, 19 insertions(+), 3 deletions(-)
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index f5d150c62955..afb8419566a9 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -562,6 +562,7 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
switch (old_state) {
case NVME_CTRL_NEW:
case NVME_CTRL_LIVE:
+ case NVME_CTRL_CONNECTING:
changed = true;
fallthrough;
default:
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 102a9fb0c65f..f1bb8df20701 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2789,6 +2789,17 @@ static void nvme_reset_work(struct work_struct *work)
out_unlock:
mutex_unlock(&dev->shutdown_lock);
out:
+ /*
+ * If PCI recovery is ongoing then let it finish first
+ */
+ if (pci_channel_offline(to_pci_dev(dev->dev))) {
+ if (nvme_ctrl_state(&dev->ctrl) == NVME_CTRL_RESETTING ||
+ nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_RESETTING)) {
+ dev_warn(dev->ctrl.device,
+ "Let pci error recovery finish!\n");
+ return;
+ }
+ }
/*
* Set state to deleting now to avoid blocking nvme_wait_reset(), which
* may be holding this pci_dev's device lock.
@@ -3308,10 +3319,14 @@ static pci_ers_result_t nvme_error_detected(struct pci_dev *pdev,
case pci_channel_io_frozen:
dev_warn(dev->ctrl.device,
"frozen state error detected, reset controller\n");
- if (!nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_RESETTING)) {
- nvme_dev_disable(dev, true);
- return PCI_ERS_RESULT_DISCONNECT;
+ if (nvme_ctrl_state(&dev->ctrl) != NVME_CTRL_RESETTING) {
+ if (!nvme_change_ctrl_state(&dev->ctrl,
+ NVME_CTRL_RESETTING)) {
+ nvme_dev_disable(dev, true);
+ return PCI_ERS_RESULT_DISCONNECT;
+ }
}
+ flush_work(&dev->ctrl.reset_work);
nvme_dev_disable(dev, false);
return PCI_ERS_RESULT_NEED_RESET;
case pci_channel_io_perm_failure:
--
2.45.1
^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [PATCH v3 1/1] nvme-pci : Fix EEH failure on ppc after subsystem reset
2024-06-04 9:10 ` [PATCH v3 1/1] nvme-pci : Fix EEH failure on ppc after " Nilay Shroff
@ 2024-06-10 12:32 ` Maurizio Lombardi
2024-06-12 11:07 ` Nilay Shroff
2024-06-14 9:51 ` Hannes Reinecke
2024-06-21 16:37 ` Keith Busch
2 siblings, 1 reply; 11+ messages in thread
From: Maurizio Lombardi @ 2024-06-10 12:32 UTC (permalink / raw)
To: Nilay Shroff; +Cc: kbusch, linux-nvme, hch, sagi, gjoyce, axboe
út 4. 6. 2024 v 11:16 odesílatel Nilay Shroff <nilay@linux.ibm.com> napsal:
>
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index f5d150c62955..afb8419566a9 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -562,6 +562,7 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
> switch (old_state) {
> case NVME_CTRL_NEW:
> case NVME_CTRL_LIVE:
> + case NVME_CTRL_CONNECTING:
> changed = true;
> fallthrough;
> default:
Side note:
I believe this could fix a race condition in the nvme-tcp driver.
Some time ago we received a kernel dump where the controller was
flagged as NVME_CTRL_LIVE but the sockets were marked as TCP_CLOSED,
the kernel was hanging into io_work.
It might be possible that the socket's nvme_tcp_error_recovery
callback was executed while the ctrl was still marked as
NVME_CTRL_CONNECTING, nvme_change_ctrl_state() doesn't allow a
reconnecting controller to switch to the resetting state and therefore
nvme_tcp_error_recovery() simply returns without triggering an error
recovery.
nvme_tcp_setup_ctrl() then moves the ctrl to the LIVE state, but with
closed tcp sockets.
I guess that this change would fix the possible race condition, even
if I am not completely sure what could happen in case of a race
between nvme_tcp_setup_ctrl() and the error recovery code.
Maurizio
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH v3 1/1] nvme-pci : Fix EEH failure on ppc after subsystem reset
2024-06-10 12:32 ` Maurizio Lombardi
@ 2024-06-12 11:07 ` Nilay Shroff
2024-06-12 13:10 ` Maurizio Lombardi
0 siblings, 1 reply; 11+ messages in thread
From: Nilay Shroff @ 2024-06-12 11:07 UTC (permalink / raw)
To: Maurizio Lombardi; +Cc: kbusch, linux-nvme, hch, sagi, gjoyce, axboe
On 6/10/24 18:02, Maurizio Lombardi wrote:
> út 4. 6. 2024 v 11:16 odesílatel Nilay Shroff <nilay@linux.ibm.com> napsal:
>>
>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>> index f5d150c62955..afb8419566a9 100644
>> --- a/drivers/nvme/host/core.c
>> +++ b/drivers/nvme/host/core.c
>> @@ -562,6 +562,7 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
>> switch (old_state) {
>> case NVME_CTRL_NEW:
>> case NVME_CTRL_LIVE:
>> + case NVME_CTRL_CONNECTING:
>> changed = true;
>> fallthrough;
>> default:
>
> Side note:
>
> I believe this could fix a race condition in the nvme-tcp driver.
> Some time ago we received a kernel dump where the controller was
> flagged as NVME_CTRL_LIVE but the sockets were marked as TCP_CLOSED,
> the kernel was hanging into io_work.
> It might be possible that the socket's nvme_tcp_error_recovery
> callback was executed while the ctrl was still marked as
> NVME_CTRL_CONNECTING, nvme_change_ctrl_state() doesn't allow a
> reconnecting controller to switch to the resetting state and therefore
> nvme_tcp_error_recovery() simply returns without triggering an error
> recovery.
>
> nvme_tcp_setup_ctrl() then moves the ctrl to the LIVE state, but with
> closed tcp sockets.
>
> I guess that this change would fix the possible race condition, even
> if I am not completely sure what could happen in case of a race
> between nvme_tcp_setup_ctrl() and the error recovery code.
>
I think for your case you shall consider using nvme_wait_reset() before
queuing error recovery work. For instance, it could be done something
like below assuming that all callers of nvme_tcp_error_recovery() are
allowed to sleep (not fully tested) :
static void nvme_tcp_error_recovery(struct nvme_ctrl *ctrl)
{
if (!nvme_wait_reset(ctrl))
return;
dev_warn(ctrl->device, "starting error recovery\n");
queue_work(nvme_reset_wq, &to_tcp_ctrl(ctrl)->err_work);
}
The nvme_wait_reset() shall wait until the nvme_tcp_setup_ctrl() finish
its work and then the tcp error recovery work would be initiated.
Thanks,
--Nilay
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH v3 1/1] nvme-pci : Fix EEH failure on ppc after subsystem reset
2024-06-12 11:07 ` Nilay Shroff
@ 2024-06-12 13:10 ` Maurizio Lombardi
2024-06-12 17:07 ` Nilay Shroff
0 siblings, 1 reply; 11+ messages in thread
From: Maurizio Lombardi @ 2024-06-12 13:10 UTC (permalink / raw)
To: Nilay Shroff; +Cc: kbusch, linux-nvme, hch, sagi, gjoyce, axboe
st 12. 6. 2024 v 13:07 odesílatel Nilay Shroff <nilay@linux.ibm.com> napsal:
> I think for your case you shall consider using nvme_wait_reset() before
> queuing error recovery work. For instance, it could be done something
> like below assuming that all callers of nvme_tcp_error_recovery() are
> allowed to sleep (not fully tested) :
>
> static void nvme_tcp_error_recovery(struct nvme_ctrl *ctrl)
> {
> if (!nvme_wait_reset(ctrl))
> return;
Nice idea but in this case I don't think it can work
because socket's callbacks run in interrupt context.
Maurizio
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH v3 1/1] nvme-pci : Fix EEH failure on ppc after subsystem reset
2024-06-12 13:10 ` Maurizio Lombardi
@ 2024-06-12 17:07 ` Nilay Shroff
2024-06-13 7:02 ` Maurizio Lombardi
0 siblings, 1 reply; 11+ messages in thread
From: Nilay Shroff @ 2024-06-12 17:07 UTC (permalink / raw)
To: Maurizio Lombardi; +Cc: kbusch, linux-nvme, hch, sagi, gjoyce, axboe
On 6/12/24 18:40, Maurizio Lombardi wrote:
> st 12. 6. 2024 v 13:07 odesílatel Nilay Shroff <nilay@linux.ibm.com> napsal:
>> I think for your case you shall consider using nvme_wait_reset() before
>> queuing error recovery work. For instance, it could be done something
>> like below assuming that all callers of nvme_tcp_error_recovery() are
>> allowed to sleep (not fully tested) :
>>
>> static void nvme_tcp_error_recovery(struct nvme_ctrl *ctrl)
>> {
>> if (!nvme_wait_reset(ctrl))
>> return;
>
> Nice idea but in this case I don't think it can work
> because socket's callbacks run in interrupt context.
>
Ok so in that case how about calling nvme_wait_reset()
from nvme_tcp_error_recovery_work()?
Thanks,
--Nilay
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH v3 1/1] nvme-pci : Fix EEH failure on ppc after subsystem reset
2024-06-12 17:07 ` Nilay Shroff
@ 2024-06-13 7:02 ` Maurizio Lombardi
0 siblings, 0 replies; 11+ messages in thread
From: Maurizio Lombardi @ 2024-06-13 7:02 UTC (permalink / raw)
To: Nilay Shroff; +Cc: kbusch, linux-nvme, hch, sagi, gjoyce, axboe
st 12. 6. 2024 v 19:08 odesílatel Nilay Shroff <nilay@linux.ibm.com> napsal:
>
> Ok so in that case how about calling nvme_wait_reset()
> from nvme_tcp_error_recovery_work()?
That would work, but if your patches get merged it won't be necessary
because it will be become possible to switch the ctrl state directly
from "connecting" to "resetting"
Maurizio
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH v3 1/1] nvme-pci : Fix EEH failure on ppc after subsystem reset
2024-06-04 9:10 ` [PATCH v3 1/1] nvme-pci : Fix EEH failure on ppc after " Nilay Shroff
2024-06-10 12:32 ` Maurizio Lombardi
@ 2024-06-14 9:51 ` Hannes Reinecke
2024-06-21 16:37 ` Keith Busch
2 siblings, 0 replies; 11+ messages in thread
From: Hannes Reinecke @ 2024-06-14 9:51 UTC (permalink / raw)
To: Nilay Shroff, kbusch; +Cc: linux-nvme, hch, sagi, gjoyce, axboe
On 6/4/24 11:10, Nilay Shroff wrote:
> The NVMe subsystem reset command when executed may cause the loss of
> the NVMe adapter communication with kernel. And the only way today
> to recover the adapter is to either re-enumerate the pci bus or
> hotplug NVMe disk or reboot OS.
>
> The PPC architecture supports mechanism called EEH (enhanced error
> handling) which allows pci bus errors to be cleared and a pci card to
> be rebooted, without having to physically hotplug NVMe disk or reboot
> the OS.
>
> In the current implementation when user executes the nvme subsystem
> reset command and if kernel loses the communication with NVMe adapter
> then subsequent read/write to the PCIe config space of the device
> would fail. Failing to read/write to PCI config space makes NVMe
> driver assume the permanent loss of communication with the device and
> so driver marks the NVMe controller dead and frees all resources
> associate to that controller. As the NVMe controller goes dead, the
> EEH recovery can't succeed.
>
> This patch helps fix this issue so that after user executes subsystem
> reset command if the communication with the NVMe adapter is lost and
> EEH recovery is initiated then we allow the EEH recovery to forward
> progress and gives the EEH thread a fair chance to recover the
> adapter. If in case, the EEH thread couldn't recover the adapter
> communication then it sets the pci channel state of the erring
> adapter to "permanent failure" and removes the device.
>
> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
> ---
> drivers/nvme/host/core.c | 1 +
> drivers/nvme/host/pci.c | 21 ++++++++++++++++++---
> 2 files changed, 19 insertions(+), 3 deletions(-)
>
Looks odd, but I cannot see a better way out here.
Reviewed-by: Hannes Reinecke <hare@suse.de>
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Ivo Totev, Andrew McDonald,
Werner Knoblich
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH v3 1/1] nvme-pci : Fix EEH failure on ppc after subsystem reset
2024-06-04 9:10 ` [PATCH v3 1/1] nvme-pci : Fix EEH failure on ppc after " Nilay Shroff
2024-06-10 12:32 ` Maurizio Lombardi
2024-06-14 9:51 ` Hannes Reinecke
@ 2024-06-21 16:37 ` Keith Busch
2024-06-22 15:07 ` Nilay Shroff
2 siblings, 1 reply; 11+ messages in thread
From: Keith Busch @ 2024-06-21 16:37 UTC (permalink / raw)
To: Nilay Shroff; +Cc: linux-nvme, hch, sagi, gjoyce, axboe
On Tue, Jun 04, 2024 at 02:40:04PM +0530, Nilay Shroff wrote:
> The NVMe subsystem reset command when executed may cause the loss of
> the NVMe adapter communication with kernel. And the only way today
> to recover the adapter is to either re-enumerate the pci bus or
> hotplug NVMe disk or reboot OS.
>
> The PPC architecture supports mechanism called EEH (enhanced error
> handling) which allows pci bus errors to be cleared and a pci card to
> be rebooted, without having to physically hotplug NVMe disk or reboot
> the OS.
>
> In the current implementation when user executes the nvme subsystem
> reset command and if kernel loses the communication with NVMe adapter
> then subsequent read/write to the PCIe config space of the device
> would fail. Failing to read/write to PCI config space makes NVMe
> driver assume the permanent loss of communication with the device and
> so driver marks the NVMe controller dead and frees all resources
> associate to that controller. As the NVMe controller goes dead, the
> EEH recovery can't succeed.
>
> This patch helps fix this issue so that after user executes subsystem
> reset command if the communication with the NVMe adapter is lost and
> EEH recovery is initiated then we allow the EEH recovery to forward
> progress and gives the EEH thread a fair chance to recover the
> adapter. If in case, the EEH thread couldn't recover the adapter
> communication then it sets the pci channel state of the erring
> adapter to "permanent failure" and removes the device.
I think the driver is trying to do too much here by trying to handle the
subsystem reset inline with the reset request. It would surely fail, but
that was the idea because we had been expecting pciehp to re-enumerate.
But there are other possibilities, like your EEH, or others like DPC and
AER could do different handling instead of a bus rescan. So, perhaps the
problem is just the subsystem reset handling. Maybe just don't proceed
with the reset handling, and it'll be fine?
---
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 68b400f9c42d5..97ed33d9046d4 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -646,9 +646,12 @@ static inline void nvme_should_fail(struct request *req) {}
bool nvme_wait_reset(struct nvme_ctrl *ctrl);
int nvme_try_sched_reset(struct nvme_ctrl *ctrl);
+bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
+ enum nvme_ctrl_state new_state);
static inline int nvme_reset_subsystem(struct nvme_ctrl *ctrl)
{
+ u32 val;
int ret;
if (!ctrl->subsystem)
@@ -657,10 +660,10 @@ static inline int nvme_reset_subsystem(struct nvme_ctrl *ctrl)
return -EBUSY;
ret = ctrl->ops->reg_write32(ctrl, NVME_REG_NSSR, 0x4E564D65);
- if (ret)
- return ret;
-
- return nvme_try_sched_reset(ctrl);
+ nvme_change_ctrl_state(ctrl, NVME_CTRL_LIVE);
+ if (!ret)
+ ctrl->ops->reg_read32(ctrl, NVME_REG_CSTS, &val);
+ return ret;
}
/*
@@ -786,8 +789,6 @@ blk_status_t nvme_host_path_error(struct request *req);
bool nvme_cancel_request(struct request *req, void *data);
void nvme_cancel_tagset(struct nvme_ctrl *ctrl);
void nvme_cancel_admin_tagset(struct nvme_ctrl *ctrl);
-bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
- enum nvme_ctrl_state new_state);
int nvme_disable_ctrl(struct nvme_ctrl *ctrl, bool shutdown);
int nvme_enable_ctrl(struct nvme_ctrl *ctrl);
int nvme_init_ctrl(struct nvme_ctrl *ctrl, struct device *dev,
--
^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [PATCH v3 1/1] nvme-pci : Fix EEH failure on ppc after subsystem reset
2024-06-21 16:37 ` Keith Busch
@ 2024-06-22 15:07 ` Nilay Shroff
2024-06-24 16:07 ` Keith Busch
0 siblings, 1 reply; 11+ messages in thread
From: Nilay Shroff @ 2024-06-22 15:07 UTC (permalink / raw)
To: Keith Busch; +Cc: linux-nvme, hch, sagi, gjoyce, axboe
On 6/21/24 22:07, Keith Busch wrote:
> On Tue, Jun 04, 2024 at 02:40:04PM +0530, Nilay Shroff wrote:
>> The NVMe subsystem reset command when executed may cause the loss of
>> the NVMe adapter communication with kernel. And the only way today
>> to recover the adapter is to either re-enumerate the pci bus or
>> hotplug NVMe disk or reboot OS.
>>
>> The PPC architecture supports mechanism called EEH (enhanced error
>> handling) which allows pci bus errors to be cleared and a pci card to
>> be rebooted, without having to physically hotplug NVMe disk or reboot
>> the OS.
>>
>> In the current implementation when user executes the nvme subsystem
>> reset command and if kernel loses the communication with NVMe adapter
>> then subsequent read/write to the PCIe config space of the device
>> would fail. Failing to read/write to PCI config space makes NVMe
>> driver assume the permanent loss of communication with the device and
>> so driver marks the NVMe controller dead and frees all resources
>> associate to that controller. As the NVMe controller goes dead, the
>> EEH recovery can't succeed.
>>
>> This patch helps fix this issue so that after user executes subsystem
>> reset command if the communication with the NVMe adapter is lost and
>> EEH recovery is initiated then we allow the EEH recovery to forward
>> progress and gives the EEH thread a fair chance to recover the
>> adapter. If in case, the EEH thread couldn't recover the adapter
>> communication then it sets the pci channel state of the erring
>> adapter to "permanent failure" and removes the device.
>
> I think the driver is trying to do too much here by trying to handle the
> subsystem reset inline with the reset request. It would surely fail, but
> that was the idea because we had been expecting pciehp to re-enumerate.
> But there are other possibilities, like your EEH, or others like DPC and
> AER could do different handling instead of a bus rescan. So, perhaps the
> problem is just the subsystem reset handling. Maybe just don't proceed
> with the reset handling, and it'll be fine?
>
> ---
> diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
> index 68b400f9c42d5..97ed33d9046d4 100644
> --- a/drivers/nvme/host/nvme.h
> +++ b/drivers/nvme/host/nvme.h
> @@ -646,9 +646,12 @@ static inline void nvme_should_fail(struct request *req) {}
>
> bool nvme_wait_reset(struct nvme_ctrl *ctrl);
> int nvme_try_sched_reset(struct nvme_ctrl *ctrl);
> +bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
> + enum nvme_ctrl_state new_state);
>
> static inline int nvme_reset_subsystem(struct nvme_ctrl *ctrl)
> {
> + u32 val;
> int ret;
>
> if (!ctrl->subsystem)
> @@ -657,10 +660,10 @@ static inline int nvme_reset_subsystem(struct nvme_ctrl *ctrl)
> return -EBUSY;
>
> ret = ctrl->ops->reg_write32(ctrl, NVME_REG_NSSR, 0x4E564D65);
> - if (ret)
> - return ret;
> -
> - return nvme_try_sched_reset(ctrl);
> + nvme_change_ctrl_state(ctrl, NVME_CTRL_LIVE);
> + if (!ret)
> + ctrl->ops->reg_read32(ctrl, NVME_REG_CSTS, &val);
> + return ret;
> }
>
> /*
> @@ -786,8 +789,6 @@ blk_status_t nvme_host_path_error(struct request *req);
> bool nvme_cancel_request(struct request *req, void *data);
> void nvme_cancel_tagset(struct nvme_ctrl *ctrl);
> void nvme_cancel_admin_tagset(struct nvme_ctrl *ctrl);
> -bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
> - enum nvme_ctrl_state new_state);
> int nvme_disable_ctrl(struct nvme_ctrl *ctrl, bool shutdown);
> int nvme_enable_ctrl(struct nvme_ctrl *ctrl);
> int nvme_init_ctrl(struct nvme_ctrl *ctrl, struct device *dev,
> --
>
This is a nice idea! These changes look good. I have tested it on powerpc with
EEH and I observed that post nvme subsystem-reset, EEH is able to recover the disk.
I have also tested it on a platform which *doesn't* support EEH or pci error recovery
and on this platform I observed that nvme disk falls through the dead state.
So I think you may submit a formal patch with this change.
Thanks,
--Nilay
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH v3 1/1] nvme-pci : Fix EEH failure on ppc after subsystem reset
2024-06-22 15:07 ` Nilay Shroff
@ 2024-06-24 16:07 ` Keith Busch
0 siblings, 0 replies; 11+ messages in thread
From: Keith Busch @ 2024-06-24 16:07 UTC (permalink / raw)
To: Nilay Shroff; +Cc: linux-nvme, hch, sagi, gjoyce, axboe
On Sat, Jun 22, 2024 at 08:37:02PM +0530, Nilay Shroff wrote:
> On 6/21/24 22:07, Keith Busch wrote:
> > static inline int nvme_reset_subsystem(struct nvme_ctrl *ctrl)
> > {
> > + u32 val;
> > int ret;
> >
> > if (!ctrl->subsystem)
> > @@ -657,10 +660,10 @@ static inline int nvme_reset_subsystem(struct nvme_ctrl *ctrl)
> > return -EBUSY;
> >
> > ret = ctrl->ops->reg_write32(ctrl, NVME_REG_NSSR, 0x4E564D65);
> > - if (ret)
> > - return ret;
> > -
> > - return nvme_try_sched_reset(ctrl);
> > + nvme_change_ctrl_state(ctrl, NVME_CTRL_LIVE);
> > + if (!ret)
> > + ctrl->ops->reg_read32(ctrl, NVME_REG_CSTS, &val);
> > + return ret;
> This is a nice idea! These changes look good. I have tested it on powerpc with
> EEH and I observed that post nvme subsystem-reset, EEH is able to recover the disk.
> I have also tested it on a platform which *doesn't* support EEH or pci error recovery
> and on this platform I observed that nvme disk falls through the dead state.
>
> So I think you may submit a formal patch with this change.
Just a little concerned about the reg_read32 at the end there. A hot
plug event is potentially expected outcome from the reg write, and that
may unmap the pci bar before read.
And come to think of it, a hot plug could occur before the reg_write32,
too, for a reason unrelated to the requested subsys-reset operation...
Anyway, I think this needs a driver specific op to handle it safely.
I'll send a patch.
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2024-06-24 16:07 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-06-04 9:10 [PATCH v3 0/1] nvme-pci: recover from NVM subsystem reset Nilay Shroff
2024-06-04 9:10 ` [PATCH v3 1/1] nvme-pci : Fix EEH failure on ppc after " Nilay Shroff
2024-06-10 12:32 ` Maurizio Lombardi
2024-06-12 11:07 ` Nilay Shroff
2024-06-12 13:10 ` Maurizio Lombardi
2024-06-12 17:07 ` Nilay Shroff
2024-06-13 7:02 ` Maurizio Lombardi
2024-06-14 9:51 ` Hannes Reinecke
2024-06-21 16:37 ` Keith Busch
2024-06-22 15:07 ` Nilay Shroff
2024-06-24 16:07 ` Keith Busch
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).