Re: [PATCH RESEND] nvme-pci: Fix EEH failure on ppc after subsystem reset

linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Greg Joyce <gjoyce@linux.ibm.com>
To: Nilay Shroff <nilay@linux.ibm.com>,
	kbusch@kernel.org, axboe@fb.com, hch@lst.de, sagi@grimberg.me,
	hare@suse.de, dwagner@suse.de,
	Wendy Xiong <wenxiong@linux.ibm.com>
Cc: linux-nvme@lists.infradead.org, linux-block@vger.kernel.org
Subject: Re: [PATCH RESEND] nvme-pci: Fix EEH failure on ppc after subsystem reset
Date: Thu, 22 Feb 2024 15:00:08 -0600	[thread overview]
Message-ID: <14c739bedccbf5a4cf21cf3fec724bc17fa265b6.camel@linux.ibm.com> (raw)
In-Reply-To: <20240209050342.406184-1-nilay@linux.ibm.com>

On Fri, 2024-02-09 at 10:32 +0530, Nilay Shroff wrote:
> If the nvme subsyetm reset causes the loss of communication to the
> nvme
> adapter then EEH could potnetially recover the adapter. The detection
> of
> comminication loss to the adapter only happens when the nvme driver
> attempts to read an MMIO register.
> 
> The nvme subsystem reset command writes 0x4E564D65 to NSSR register
> and
> schedule adapter reset.In the case nvme subsystem reset caused the
> loss
> of communication to the nvme adapter then either IO timeout event or
> adapter reset handler could detect it. If IO timeout even could
> detect
> loss of communication then EEH handler is able to recover the
> communication to the adapter. This change was implemented in
> 651438bb0af5
> (nvme-pci: Fix EEH failure on ppc). However if the adapter
> communication
> loss is detected in nvme reset work handler then EEH is unable to
> successfully finish the adapter recovery.
> 
> This patch ensures that,
> - nvme driver reset handler would observer pci channel was offline
> after
>   a failed MMIO read and avoids marking the controller state to DEAD
> and
>   thus gives a fair chance to EEH handler to recover the nvme
> adapter.
> 
> - if nvme controller is already in RESETTNG state and pci channel
> frozen
>   error is detected then  nvme driver pci-error-handler code sends
> the
>   correct error code (PCI_ERS_RESULT_NEED_RESET) back to the EEH
> handler
>   so that EEH handler could proceed with the pci slot reset.
> 
> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>

Note that while in this case the issue was discovered via the Power EEH
handler, this is not a Power specific issue. The problem and fix is
applicable to all architectures.
 

> 
> [  131.415601] EEH: Recovering PHB#40-PE#10000
> [  131.415619] EEH: PE location: N/A, PHB location: N/A
> [  131.415623] EEH: Frozen PHB#40-PE#10000 detected
> [  131.415627] EEH: Call Trace:
> [  131.415629] EEH: [c000000000051078]
> __eeh_send_failure_event+0x7c/0x15c
> [  131.415782] EEH: [c000000000049bdc]
> eeh_dev_check_failure.part.0+0x27c/0x6b0
> [  131.415789] EEH: [c000000000cb665c] nvme_pci_reg_read32+0x78/0x9c
> [  131.415802] EEH: [c000000000ca07f8] nvme_wait_ready+0xa8/0x18c
> [  131.415814] EEH: [c000000000cb7070] nvme_dev_disable+0x368/0x40c
> [  131.415823] EEH: [c000000000cb9970] nvme_reset_work+0x198/0x348
> [  131.415830] EEH: [c00000000017b76c] process_one_work+0x1f0/0x4f4
> [  131.415841] EEH: [c00000000017be2c] worker_thread+0x3bc/0x590
> [  131.415846] EEH: [c00000000018a46c] kthread+0x138/0x140
> [  131.415854] EEH: [c00000000000dd58] start_kernel_thread+0x14/0x18
> [  131.415864] EEH: This PCI device has failed 1 times in the last
> hour and will be permanently disabled after 5 failures.
> [  131.415874] EEH: Notify device drivers to shutdown
> [  131.415882] EEH: Beginning: 'error_detected(IO frozen)'
> [  131.415888] PCI 0040:01:00.0#10000: EEH: Invoking nvme-
> >error_detected(IO frozen)
> [  131.415891] nvme nvme1: frozen state error detected, reset
> controller
> [  131.515358] nvme 0040:01:00.0: enabling device (0000 -> 0002)
> [  131.515778] nvme nvme1: Disabling device after reset failure: -19
> [  131.555336] PCI 0040:01:00.0#10000: EEH: nvme driver reports:
> 'disconnect'
> [  131.555343] EEH: Finished:'error_detected(IO frozen)' with
> aggregate recovery state:'disconnect'
> [  131.555371] EEH: Unable to recover from failure from PHB#40-
> PE#10000.
> [  131.555371] Please try reseating or replacing it
> [  131.556296] EEH: of node=0040:01:00.0
> [  131.556351] EEH: PCI device/vendor: 00251e0f
> [  131.556421] EEH: PCI cmd/status register: 00100142
> [  131.556428] EEH: PCI-E capabilities and status follow:
> [  131.556678] EEH: PCI-E 00: 0002b010 10008fe3 00002910 00436044
> [  131.556859] EEH: PCI-E 10: 10440000 00000000 00000000 00000000
> [  131.556869] EEH: PCI-E 20: 00000000
> [  131.556875] EEH: PCI-E AER capability register set follows:
> [  131.557115] EEH: PCI-E AER 00: 14820001 00000000 00400000 00462030
> [  131.557294] EEH: PCI-E AER 10: 00000000 0000e000 000002a0 00000000
> [  131.557469] EEH: PCI-E AER 20: 00000000 00000000 00000000 00000000
> [  131.557523] EEH: PCI-E AER 30: 00000000 00000000
> [  131.558807] EEH: Beginning: 'error_detected(permanent failure)'
> [  131.558815] PCI 0040:01:00.0#10000: EEH: Invoking nvme-
> >error_detected(permanent failure)
> [  131.558818] nvme nvme1: failure state error detected, request
> disconnect
> [  131.558839] PCI 0040:01:00.0#10000: EEH: nvme driver reports:
> 'disconnect'
> ---
>  drivers/nvme/host/pci.c | 16 +++++++++++++---
>  1 file changed, 13 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index c1d6357ec98a..a6ba46e727ba 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -2776,6 +2776,14 @@ static void nvme_reset_work(struct work_struct
> *work)
>   out_unlock:
>  	mutex_unlock(&dev->shutdown_lock);
>   out:
> +	/*
> +	 * If PCI recovery is ongoing then let it finish first
> +	 */
> +	if (pci_channel_offline(to_pci_dev(dev->dev))) {
> +		dev_warn(dev->ctrl.device, "PCI recovery is ongoing so
> let it finish\n");
> +		return;
> +	}
> +
>  	/*
>  	 * Set state to deleting now to avoid blocking
> nvme_wait_reset(), which
>  	 * may be holding this pci_dev's device lock.
> @@ -3295,9 +3303,11 @@ static pci_ers_result_t
> nvme_error_detected(struct pci_dev *pdev,
>  	case pci_channel_io_frozen:
>  		dev_warn(dev->ctrl.device,
>  			"frozen state error detected, reset
> controller\n");
> -		if (!nvme_change_ctrl_state(&dev->ctrl,
> NVME_CTRL_RESETTING)) {
> -			nvme_dev_disable(dev, true);
> -			return PCI_ERS_RESULT_DISCONNECT;
> +		if (nvme_ctrl_state(&dev->ctrl) != NVME_CTRL_RESETTING)
> {
> +			if (!nvme_change_ctrl_state(&dev->ctrl,
> NVME_CTRL_RESETTING)) {
> +				nvme_dev_disable(dev, true);
> +				return PCI_ERS_RESULT_DISCONNECT;
> +			}
>  		}
>  		nvme_dev_disable(dev, false);
>  		return PCI_ERS_RESULT_NEED_RESET;

next prev parent reply	other threads:[~2024-02-22 21:02 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-02-09  5:02 [PATCH RESEND] nvme-pci: Fix EEH failure on ppc after subsystem reset Nilay Shroff
2024-02-22 21:00 ` Greg Joyce [this message]
     [not found] ` <2c76725c-7bb6-4827-b45a-dbe1acbefba7@imap.linux.ibm.com>
2024-02-27 18:14   ` Nilay Shroff
2024-02-27 18:29 ` Keith Busch
2024-02-28 11:19   ` Nilay Shroff
2024-02-29 12:27     ` Nilay Shroff
2024-03-06 11:20   ` Nilay Shroff
2024-03-06 15:19     ` Keith Busch
2024-03-08 15:41 ` Keith Busch
     [not found]   ` <039541c8-2e13-442e-bd5b-90a799a9851a@linux.ibm.com>
     [not found]     ` <ZeyD6xh0LGZyRBfO@kbusch-mbp>
2024-03-09 19:05       ` Nilay Shroff
2024-03-11  4:41         ` Keith Busch
2024-03-11 12:58           ` Nilay Shroff
2024-03-12 14:30             ` Keith Busch
2024-03-13 11:59               ` Nilay Shroff
2024-03-22  5:02                 ` Nilay Shroff

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=14c739bedccbf5a4cf21cf3fec724bc17fa265b6.camel@linux.ibm.com \
    --to=gjoyce@linux.ibm.com \
    --cc=axboe@fb.com \
    --cc=dwagner@suse.de \
    --cc=hare@suse.de \
    --cc=hch@lst.de \
    --cc=kbusch@kernel.org \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-nvme@lists.infradead.org \
    --cc=nilay@linux.ibm.com \
    --cc=sagi@grimberg.me \
    --cc=wenxiong@linux.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).