From mboxrd@z Thu Jan 1 00:00:00 1970 From: okaya@codeaurora.org (Sinan Kaya) Date: Thu, 16 Nov 2017 15:52:47 -0500 Subject: [PATCH v2 4/4] PCI/AER: Dont do recovery when DPC is enabled In-Reply-To: <20171116201745.GB7266@bhelgaas-glaptop.roam.corp.google.com> References: <1510721808-27164-1-git-send-email-poza@codeaurora.org> <1510721808-27164-5-git-send-email-poza@codeaurora.org> <20171115211447.GA7266@bhelgaas-glaptop.roam.corp.google.com> <16e82fb4-dda8-3790-e831-675f9a6fd388@codeaurora.org> <20171116201745.GB7266@bhelgaas-glaptop.roam.corp.google.com> Message-ID: <16b9ac67-042d-1010-0c8c-205b2fa40a7f@codeaurora.org> To: linux-arm-kernel@lists.infradead.org List-Id: linux-arm-kernel.lists.infradead.org >> >> Whether the AER driver reads ~0 or not really depends on timing. The >> link may come up from the DPC driver by the time AER driver reaches >> here as an example. >> >> Bad things do happen. We have seen this with e1000e driver. > > I don't doubt that bad things happen. I'm just trying to understand > exactly *what* bad things happen and how, so we can fix them cleanly. > This was random crashes in the e1000e drivers accompanied with stack traces coming from WARN and msi allocation routines. > I don't know exactly what you mean by "DPC stops the drivers > immediately". Since the DPC hardware disables the Link, I *think* > you probably mean that driver accesses to the device start failing > (whether the driver notices this is a whole different question). Sorry for not being clear. I was talking about the pci_stop_and_remove_bus_device() call in DPC's interrupt_event_handler() function as the "stop command". > > When the DPC hardware disables the Link, it causes a hot reset for > downstream components. The DPC interrupt_event_handler() doesn't do > much except remove the device (which detaches the driver) and clear > the DPC Trigger Status bit (which allows hardware to try to retrain > the Link). > That's true. > So the "stop" and "recover" commands you mention must be related to > AER. I was talking about pci_stop_and_remove_bus_device() vs. error_detected() as "stop" and "recover" > I guess these would be some of the driver callbacks > (error_detected(), mmio_enabled(), slot_reset(), reset_prepare(), > reset_done(), resume())? > > In any case, I agree that it probably doesn't make sense to call any > of these callbacks if the DPC driver has already detached the driver > and re-attached it. The device state is gone because of the hot reset > and the driver state is gone because of the detach/re-attach. > > However, I'm not so sure about the period *before* the DPC driver > detaches the driver. The description of error_detected() says it > cannot assume the device is accessible, so I think there might be an > argument that AER *should* call this for DPC events so the driver has > a chance to clean up before being unceremoniously detached. Makes sense. Let us work on this. > > I suspect this all probably requires tighter integration between DPC > and AER, and I'm totally fine with that. I think the current > separation as separate "drivers" is pretty artificial anyway. Got it. We will try to plumb DPC error handling into AER driver's error handling mechanism. Another question: What do you think about the rescan following link up? The only entity that does rescan today is hotplug after DPC recovery. There could be a platform with DPC support but no hotplug support. How should we handle it? We can call rescan from DPC all the time. -- Sinan Kaya Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc. Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.