From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from lists.ozlabs.org (lists.ozlabs.org [112.213.38.117]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 3A829FEFB48 for ; Fri, 27 Feb 2026 12:28:27 +0000 (UTC) Received: from boromir.ozlabs.org (localhost [127.0.0.1]) by lists.ozlabs.org (Postfix) with ESMTP id 4fMnfx4kwRz30Sv; Fri, 27 Feb 2026 23:28:25 +1100 (AEDT) Authentication-Results: lists.ozlabs.org; arc=none smtp.remote-ip=115.124.30.99 ARC-Seal: i=1; a=rsa-sha256; d=lists.ozlabs.org; s=201707; t=1772195305; cv=none; b=S6FD9jlM7RicIicIzYITJCc7LAEzn7j9nAiGmoD0iu2MCxabXa0VOXQ1v3INNNy3WOOIKBFmkz1ynOS/YWIgAJbJn0Dk3C1ZkaB4DvdZrJwVFOT3+ibpZTKNWwk68D/Rkm32eReIoLBr4KfFmW6W5xj2H5IV+05GDpewmzeqhc1BtR9iWU4SfU0ry1qfVQZooKzB7CZD0Nojk8zC5jOmOD5G7aHuFrjHMCPeewVsIeDWG0s0J1WKzzMBiB/rJVKEaEgDpyyvyLrI3YU8+JMLR+iBxj4I70Xd7ciTQQ+ve6jFyqycFemdghDn1m8i9z+AR5wbiRbpHuExWlxcWaVqow== ARC-Message-Signature: i=1; a=rsa-sha256; d=lists.ozlabs.org; s=201707; t=1772195305; c=relaxed/relaxed; bh=YAnptZhgz1649pWMwYw9NqZhdnSslf/yjACo95roL/I=; h=Message-ID:Date:MIME-Version:Subject:To:Cc:References:From: In-Reply-To:Content-Type; b=IV2QNoXNwbYG3JASKCc3fQotAhqHliJxa9v4SYigF9hMc4fdl4I5bgce4EJIsYEbX4Cp1hsTsslufbRKb2VDKSPYXcfheJdIDeoj9Fhc805tIevGVu57Uo4xgLBMTjclcZ+8zVUESSPPNIQQ3bZbLtx5yx9DCpQIteEHXs/v0QDkLg/k0mb4Aeezfoh0eFo+rOYVPLXT1NZH2cNi2q98TwkxuWT4GHIVZf9ypC88U8OGNZzK7QdqhdGLVXl0qbU/yaeNZWQknHpw6gWycQqVWHH1nAMoR7HX4PUCL0lWzAW0zlWOw0Gs6bs5zReJmmlsuRxfy6lREq2XhUMUYNE/Hw== ARC-Authentication-Results: i=1; lists.ozlabs.org; dmarc=pass (p=none dis=none) header.from=linux.alibaba.com; dkim=pass (1024-bit key; unprotected) header.d=linux.alibaba.com header.i=@linux.alibaba.com header.a=rsa-sha256 header.s=default header.b=x+W4cYDW; dkim-atps=neutral; spf=pass (client-ip=115.124.30.99; helo=out30-99.freemail.mail.aliyun.com; envelope-from=xueshuai@linux.alibaba.com; receiver=lists.ozlabs.org) smtp.mailfrom=linux.alibaba.com Authentication-Results: lists.ozlabs.org; dmarc=pass (p=none dis=none) header.from=linux.alibaba.com Authentication-Results: lists.ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=linux.alibaba.com header.i=@linux.alibaba.com header.a=rsa-sha256 header.s=default header.b=x+W4cYDW; dkim-atps=neutral Authentication-Results: lists.ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=linux.alibaba.com (client-ip=115.124.30.99; helo=out30-99.freemail.mail.aliyun.com; envelope-from=xueshuai@linux.alibaba.com; receiver=lists.ozlabs.org) Received: from out30-99.freemail.mail.aliyun.com (out30-99.freemail.mail.aliyun.com [115.124.30.99]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange x25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by lists.ozlabs.org (Postfix) with ESMTPS id 4fMnft5Sm6z30N8 for ; Fri, 27 Feb 2026 23:28:21 +1100 (AEDT) DKIM-Signature:v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.alibaba.com; s=default; t=1772195293; h=Message-ID:Date:MIME-Version:Subject:To:From:Content-Type; bh=YAnptZhgz1649pWMwYw9NqZhdnSslf/yjACo95roL/I=; b=x+W4cYDWOzbQZ+6UpmqRzIU8PULTOcBB1HHbP2m8NryA7G5YqnwqhU1lO0zZhwq0i10EXRCu1i4aphZvTqtKb+mD5VqNYibmUqgvnYLHzQdbiTFYeuSfkXcii/sXa8tfUZxYdB0/NncXBWRcyUkuTQpxLVzSLH7y24UbBPyGaZU= Received: from 30.246.163.43(mailfrom:xueshuai@linux.alibaba.com fp:SMTPD_---0WzuNC9i_1772195290 cluster:ay36) by smtp.aliyun-inc.com; Fri, 27 Feb 2026 20:28:11 +0800 Message-ID: Date: Fri, 27 Feb 2026 20:28:17 +0800 X-Mailing-List: linuxppc-dev@lists.ozlabs.org List-Id: List-Help: List-Owner: List-Post: List-Archive: , List-Subscribe: , , List-Unsubscribe: Precedence: list MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v7 2/5] PCI/DPC: Run recovery on device that detected the error To: Lukas Wunner Cc: linux-pci@vger.kernel.org, linux-kernel@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, bhelgaas@google.com, kbusch@kernel.org, sathyanarayanan.kuppuswamy@linux.intel.com, mahesh@linux.ibm.com, oohall@gmail.com, Jonathan.Cameron@huawei.com, terry.bowman@amd.com, tianruidong@linux.alibaba.com References: <20260124074557.73961-1-xueshuai@linux.alibaba.com> <20260124074557.73961-3-xueshuai@linux.alibaba.com> <924dce22-171e-4508-907c-74f57f1bdea8@linux.alibaba.com> <234dcf9e-05ff-485c-a330-019a4fbb5f3b@linux.alibaba.com> From: Shuai Xue In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit On 2/27/26 6:47 PM, Lukas Wunner wrote: > On Fri, Feb 27, 2026 at 04:28:59PM +0800, Shuai Xue wrote: >> On 2/7/26 3:48 PM, Shuai Xue wrote: >>> Regarding pci_restore_state() in slot_reset(): I see now that it does >>> call pci_aer_clear_status(dev) (at line 1844 in pci.c), which will >>> clear the AER Status registers. So if we walk the hierarchy after >>> the slot_reset callbacks, the error bits accumulated during DPC will >>> already be cleared. >>> >>> To avoid losing those errors, I think the walk should happen after >>> dpc_reset_link() succeeds but *before* pcie_do_recovery() invokes the >>> slot_reset callbacks. That way, we can capture the AER Status bits >>> before pci_restore_state() clears them. >>> >>> Does that sound like the right approach, or would you prefer a >>> different placement? > > The problem is that if the hierarchy that was reset is deeper than > one level, you first need to call pci_restore_state() on all the > PCIe Upstream and Downstream Ports that were reset before you can > access the Endpoints at the bottom of the hierarchy. > > E.g. if DPC occurs at a Root Port with multiple nested PCIe switches > below, the Endpoints at the "leafs" of that tree are only accessible > once Config Space has been restored at all the PCIe switches > in-between the Endpoints and the DPC-capable Root Port. > > Hence your proposal unfortunately won't work. > > I think the solution is to move pci_aer_clear_status() out of > pci_restore_state() into the callers that actually need it. > But that requires going through every single caller. > I've begun doing that last week and am about 60% done. > > Once pci_restore_state() no longer clears the error bits, we can > report and clear them after the "report_slot_reset" stage (which > is where drivers call pci_restore_state()). > > I've also changed my mind and I think reporting and clearing > the error bits *could* happen in pcie_do_recovery() even if it > were used for EEH and s390 because those platforms may plug in > AER-capable devices as well and so we do need to clear the bits > regardless of the error recovery mechanism used. > > Let me get back to you once I've gone through all the callers of > pci_restore_state(). Please be patient. > Sure, glad to hear you have been working on that. Thanks. SHuai