From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 5CED27CF05 for ; Thu, 29 Feb 2024 12:28:20 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=148.163.156.1 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1709209703; cv=none; b=nIGdBHSRnNuk2rtquW4gyZ5VVFrzH905YR1GFo4PsrrlJxhS62Md1kLvyA5E04HCws3T0ANN7GzRvxOFVRkx3Qde2lMqZ0260QCrXgkHzmP18Db+nxbvPNf9Ojc+g7XG8KvdxYMGKw611DGoFQwVwH9drfVlESLSpraXTCq0Ulc= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1709209703; c=relaxed/simple; bh=nDXxWTGefPDdWRyLxw2dWB5iMDiL+3kPzS/uaeEWIFg=; h=Message-ID:Date:Subject:From:To:Cc:References:In-Reply-To: Content-Type:MIME-Version; b=Ncm8z2PhKnEWFEDa3y002EBGpe4+GeR8OC+x2I9KjYDunKk/FIRlfxdCQiQHC1toLRDD1Oynv6Vie2U+m05QHEYVnodWecN3Z+L9jjIzLmwQuQHXApUiXvK4oAWPeerQByPLRtO/amSq/85UzakoMNzy5lyxzS714aw/vN5b6yg= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.ibm.com; spf=pass smtp.mailfrom=linux.ibm.com; dkim=pass (2048-bit key) header.d=ibm.com header.i=@ibm.com header.b=aDWRKj/Y; arc=none smtp.client-ip=148.163.156.1 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.ibm.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.ibm.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=ibm.com header.i=@ibm.com header.b="aDWRKj/Y" Received: from pps.filterd (m0353726.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.17.1.19/8.17.1.19) with ESMTP id 41TCRF9F000975; Thu, 29 Feb 2024 12:28:05 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=message-id : date : subject : from : to : cc : references : in-reply-to : content-type : content-transfer-encoding : mime-version; s=pp1; bh=rpwE6tprDgf1a9XkPC8twTrnewmme0GULdhPyewoo/g=; b=aDWRKj/Y1ewoB1ox2ZchbOzGEyR/fs/IS8mDvbStrElYQc2s+1Vuxj4hoz0jTZ0/feZ2 pOjIoIZun3gYmy9Phch7jmrWx5NE2aAiOAVqfxHs1vA6OLrsJPDWb+aozIapuP7NCfc1 e3BmkOW18oNNRJ0Mzv9DxVBqEQTH/Md91XXQ6mYQXUqmKXaQCyOH/e8P61TGIR4l9Vs8 Mfk1iMm0DihFO0mhqH52DVULpotkrcvCZrvd8hFqctZGUTUzVuPHFXhaX1SAH4lpFQ5F K0cbufhsIObNs4Tw/irtOFDhFImxLmocauefF87r+tWJe4xShMFHKbWryk5/wliHdBld 9Q== Received: from ppma13.dal12v.mail.ibm.com (dd.9e.1632.ip4.static.sl-reverse.com [50.22.158.221]) by mx0a-001b2d01.pphosted.com (PPS) with ESMTPS id 3wjsxmr0um-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Thu, 29 Feb 2024 12:28:04 +0000 Received: from pps.filterd (ppma13.dal12v.mail.ibm.com [127.0.0.1]) by ppma13.dal12v.mail.ibm.com (8.17.1.19/8.17.1.19) with ESMTP id 41TCQp9t023777; Thu, 29 Feb 2024 12:28:04 GMT Received: from smtprelay04.dal12v.mail.ibm.com ([172.16.1.6]) by ppma13.dal12v.mail.ibm.com (PPS) with ESMTPS id 3wfw0kmwpa-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Thu, 29 Feb 2024 12:28:04 +0000 Received: from smtpav05.dal12v.mail.ibm.com (smtpav05.dal12v.mail.ibm.com [10.241.53.104]) by smtprelay04.dal12v.mail.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 41TCS1sg9437742 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Thu, 29 Feb 2024 12:28:03 GMT Received: from smtpav05.dal12v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 39EAF58073; Thu, 29 Feb 2024 12:28:01 +0000 (GMT) Received: from smtpav05.dal12v.mail.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 8132F5806C; Thu, 29 Feb 2024 12:27:58 +0000 (GMT) Received: from [9.109.198.202] (unknown [9.109.198.202]) by smtpav05.dal12v.mail.ibm.com (Postfix) with ESMTP; Thu, 29 Feb 2024 12:27:58 +0000 (GMT) Message-ID: <2fc14fe8-7d39-42b4-b963-415714ece38c@linux.ibm.com> Date: Thu, 29 Feb 2024 17:57:56 +0530 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH RESEND] nvme-pci: Fix EEH failure on ppc after subsystem reset From: Nilay Shroff To: Keith Busch Cc: axboe@fb.com, hch@lst.de, sagi@grimberg.me, linux-nvme@lists.infradead.org, linux-block@vger.kernel.org, gjoyce@linux.ibm.com, Srimannarayana Murthy Maram References: <20240209050342.406184-1-nilay@linux.ibm.com> <07b92c25-d0dd-455c-8fb9-a4f2709677ba@linux.ibm.com> Content-Language: en-US In-Reply-To: <07b92c25-d0dd-455c-8fb9-a4f2709677ba@linux.ibm.com> Content-Type: text/plain; charset=UTF-8 X-TM-AS-GCONF: 00 X-Proofpoint-ORIG-GUID: yS_zZrsRLcRUKPXCHUHIlsD4puBHKBLS X-Proofpoint-GUID: yS_zZrsRLcRUKPXCHUHIlsD4puBHKBLS Content-Transfer-Encoding: 7bit X-Proofpoint-UnRewURL: 0 URL was un-rewritten Precedence: bulk X-Mailing-List: linux-block@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.272,Aquarius:18.0.1011,Hydra:6.0.619,FMLib:17.11.176.26 definitions=2024-02-29_02,2024-02-29_01,2023-05-22_02 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 malwarescore=0 phishscore=0 adultscore=0 bulkscore=0 mlxscore=0 spamscore=0 priorityscore=1501 clxscore=1015 impostorscore=0 suspectscore=0 lowpriorityscore=0 mlxlogscore=999 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2311290000 definitions=main-2402290095 Hi Keith, On 2/28/24 16:49, Nilay Shroff wrote: > > > On 2/27/24 23:59, Keith Busch wrote: >> On Fri, Feb 09, 2024 at 10:32:16AM +0530, Nilay Shroff wrote: >>> If the nvme subsyetm reset causes the loss of communication to the nvme >>> adapter then EEH could potnetially recover the adapter. The detection of >>> comminication loss to the adapter only happens when the nvme driver >>> attempts to read an MMIO register. >>> >>> The nvme subsystem reset command writes 0x4E564D65 to NSSR register and >>> schedule adapter reset.In the case nvme subsystem reset caused the loss >>> of communication to the nvme adapter then either IO timeout event or >>> adapter reset handler could detect it. If IO timeout even could detect >>> loss of communication then EEH handler is able to recover the >>> communication to the adapter. This change was implemented in 651438bb0af5 >>> (nvme-pci: Fix EEH failure on ppc). However if the adapter communication >>> loss is detected in nvme reset work handler then EEH is unable to >>> successfully finish the adapter recovery. >>> >>> This patch ensures that, >>> - nvme driver reset handler would observer pci channel was offline after >>> a failed MMIO read and avoids marking the controller state to DEAD and >>> thus gives a fair chance to EEH handler to recover the nvme adapter. >>> >>> - if nvme controller is already in RESETTNG state and pci channel frozen >>> error is detected then nvme driver pci-error-handler code sends the >>> correct error code (PCI_ERS_RESULT_NEED_RESET) back to the EEH handler >>> so that EEH handler could proceed with the pci slot reset. >> >> A subsystem reset takes the link down. I'm pretty sure the proper way to >> recover from it requires pcie hotplug support. >> > Yes you're correct. We require pcie hotplugging to recover. However powerpc EEH > handler could able to recover the pcie adapter without physically removing and > re-inserting the adapter or in another words, it could reset adapter without > hotplug activity. In fact, powerpc EEH could isolate pcie slot and resets it > (i.e. resetting the PCI device holding the PCI #RST line high for two seconds), > followed by setting up the device config space (the base address registers > (BAR's), latency timer, cache line size, interrupt line, and so on). > > You may find more information about EEH recovery here: > https://www.kernel.org/doc/Documentation/powerpc/eeh-pci-error-recovery.txt > > Typically when pcie error is detected and the EEH is able to recover the device, > the EEH handler code goes through below sequence (assuming driver is EEH aware): > > eeh_handle_normal_event() > eeh_set_channel_state()-> set state to pci_channel_io_frozen > eeh_report_error() > nvme_error_detected() -> channel state "pci_channel_io_frozen"; returns PCI_ERS_RESULT_NEED_RESET > eeh_slot_reset() -> recovery successful > nvme_slot_reset() -> returns PCI_ERS_RESULT_RECOVERED > eeh_set_channel_state()-> set state to pci_channel_io_normal > nvme_error_resume() > > In case pcie erorr is detected and the EEH is unable to recover the device, > the EEH handler code goes through the below sequence: > > eeh_handle_normal_event() > eeh_set_channel_state()-> set state to pci_channel_io_frozen > eeh_report_error() > nvme_error_detected() -> channel state pci_channel_io_frozen; returns PCI_ERS_RESULT_NEED_RESET > eeh_slot_reset() -> recovery failed > eeh_report_failure() > nvme_error_detected()-> channel state pci_channel_io_perm_failure; returns PCI_ERS_RESULT_DISCONNECT > eeh_set_channel_state()-> set state to pci_channel_io_perm_failure > nvme_remove() > > > If we execute the command "nvme subsystem-reset ..." and adapter communication is > lost then in the current code (under nvme_reset_work()) we simply disable the device > and mark the controller DEAD. However we may have a chance to recover the controller > if driver is EEH aware and EEH recovery is underway. We already handle one such case > in nvme_timeout(). So this patch ensures that if we fall through nvme_reset_work() > post subsystem-reset and the EEH recovery is in progress then we give a chance to the > EEH mechanism to recover the adapter. If in case the EEH recovery is unsuccessful then > we'd anyway fall through code path I mentioned above where we invoke nvme_remove() at > the end and delete the erring controller. > BTW, the similar issue was earlier fixed in 651438bb0af5(nvme-pci: Fix EEH failure on ppc). And that fix was needed due to the controller health check polling was removed in b2a0eb1a0ac72869(nvme-pci: Remove watchdog timer). In fact, today we may be able to recover the NVMe adapter if subsystem-reset or any other PCI error occurs and at the same time some I/O request is in flight. The recovery is possible due to the in flight I/O request would eventually timeout and the nvme_timeout() has special code (added in 651438bb0af5) that gives EEH a chance to recover the adapter. However later, in 1e866afd4bcdd(nvme: ensure subsystem reset is single threaded), the nvme subsystem reset code was reworked so now when user executes command subsystem-reset, kernel first writes 0x4E564D65 to nvme NSSR register and then schedules the adapter reset. It's quite possible that when subsytem-reset is executed there were no I/O in flight and hence we may never hit the nvme_timeout(). Later when the adapter reset code (under nvme_reset_work()) start execution, it accesses MMIO registers. Hence, IMO, potentially nvme_reset_work() would also need similar changes as implemented under nvme_timeout() so that EEH recovery could be possible. > With the proposed patch, we find that EEH recovery is successful post subsystem-reset. > Please find below the relevant output: > # lspci > 0524:28:00.0 Non-Volatile memory controller: KIOXIA Corporation NVMe SSD Controller CM7 2.5" (rev 01) > > # nvme list-subsys > nvme-subsys0 - NQN=nqn.2019-10.com.kioxia:KCM7DRUG1T92:7DQ0A01206N3 > hostnqn=nqn.2014-08.org.nvmexpress:uuid:41528538-e8ad-4eaf-84a7-9c552917d988 > iopolicy=numa > \ > +- nvme0 pcie 0524:28:00.0 live > > # nvme subsystem-reset /dev/nvme0 > > # nvme list-subsys > nvme-subsys0 - NQN=nqn.2019-10.com.kioxia:KCM7DRUG1T92:7DQ0A01206N3 > hostnqn=nqn.2014-08.org.nvmexpress:uuid:41528538-e8ad-4eaf-84a7-9c552917d988 > iopolicy=numa > \ > +- nvme0 pcie 0524:28:00.0 resetting > > [10556.034082] EEH: Recovering PHB#524-PE#280000 > [10556.034108] EEH: PE location: N/A, PHB location: N/A > [10556.034112] EEH: Frozen PHB#524-PE#280000 detected > [10556.034115] EEH: Call Trace: > [10556.034117] EEH: [c000000000051068] __eeh_send_failure_event+0x7c/0x15c > [10556.034304] EEH: [c000000000049bcc] eeh_dev_check_failure.part.0+0x27c/0x6b0 > [10556.034310] EEH: [c008000004753d3c] nvme_pci_reg_read32+0x80/0xac [nvme] > [10556.034319] EEH: [c0080000045f365c] nvme_wait_ready+0xa4/0x18c [nvme_core] > [10556.034333] EEH: [c008000004754750] nvme_dev_disable+0x370/0x41c [nvme] > [10556.034338] EEH: [c008000004757184] nvme_reset_work+0x1f4/0x3cc [nvme] > [10556.034344] EEH: [c00000000017bb8c] process_one_work+0x1f0/0x4f4 > [10556.034350] EEH: [c00000000017c24c] worker_thread+0x3bc/0x590 > [10556.034355] EEH: [c00000000018a87c] kthread+0x138/0x140 > [10556.034358] EEH: [c00000000000dd58] start_kernel_thread+0x14/0x18 > [10556.034363] EEH: This PCI device has failed 1 times in the last hour and will be permanently disabled after 5 failures. > [10556.034368] EEH: Notify device drivers to shutdown > [10556.034371] EEH: Beginning: 'error_detected(IO frozen)' > [10556.034376] PCI 0524:28:00.0#280000: EEH: Invoking nvme->error_detected(IO frozen) > [10556.034379] nvme nvme0: frozen state error detected, reset controller > [10556.102654] nvme 0524:28:00.0: enabling device (0000 -> 0002) > [10556.103171] nvme nvme0: PCI recovery is ongoing so let it finish > [10556.142532] PCI 0524:28:00.0#280000: EEH: nvme driver reports: 'need reset' > [10556.142535] EEH: Finished:'error_detected(IO frozen)' with aggregate recovery state:'need reset' > [...] > [...] > [10556.148172] EEH: Reset without hotplug activity > [10558.298672] EEH: Beginning: 'slot_reset' > [10558.298692] PCI 0524:28:00.0#280000: EEH: Invoking nvme->slot_reset() > [10558.298696] nvme nvme0: restart after slot reset > [10558.301925] PCI 0524:28:00.0#280000: EEH: nvme driver reports: 'recovered' > [10558.301928] EEH: Finished:'slot_reset' with aggregate recovery state:'recovered' > [10558.301939] EEH: Notify device driver to resume > [10558.301944] EEH: Beginning: 'resume' > [10558.301947] PCI 0524:28:00.0#280000: EEH: Invoking nvme->resume() > [10558.331051] nvme nvme0: Shutdown timeout set to 10 seconds > [10558.356679] nvme nvme0: 16/0/0 default/read/poll queues > [10558.357026] PCI 0524:28:00.0#280000: EEH: nvme driver reports: 'none' > [10558.357028] EEH: Finished:'resume' > [10558.357035] EEH: Recovery successful. > > # nvme list-subsys > nvme-subsys0 - NQN=nqn.2019-10.com.kioxia:KCM7DRUG1T92:7DQ0A01206N3 > hostnqn=nqn.2014-08.org.nvmexpress:uuid:41528538-e8ad-4eaf-84a7-9c552917d988 > iopolicy=numa > \ > +- nvme0 pcie 0524:28:00.0 live > > Thanks, --Nilay