From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932934AbbI3SGY (ORCPT ); Wed, 30 Sep 2015 14:06:24 -0400 Received: from 8bytes.org ([81.169.241.247]:42887 "EHLO theia.8bytes.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932190AbbI3SGX (ORCPT ); Wed, 30 Sep 2015 14:06:23 -0400 Date: Wed, 30 Sep 2015 20:06:21 +0200 From: Joerg Roedel To: Jiang Liu Cc: Borislav Petkov , Daniel Vetter , Thomas Gleixner , Bjorn Helgaas , Alex Deucher , Alex Deucher , Christian =?iso-8859-1?Q?K=F6nig?= , Maling list - DRI developers , lkml Subject: Re: WARNING: CPU: 4 PID: 863 at include/drm/drm_crtc.h:1577 drm_helper_choose_encoder_dpms+0x88/0x90() - evildoer found and neutralized Message-ID: <20150930180621.GV3036@8bytes.org> References: <20150923085950.GA3440@pd.tnic> <20150923144450.GD3383@phenom.ffwll.local> <20150923160621.GA3446@pd.tnic> <20150923161839.GB3446@pd.tnic> <20150926164651.GA3640@pd.tnic> <560A50DC.1040505@linux.intel.com> <20150929105138.GA12037@nazgul.tnic> <560B9323.6000309@linux.intel.com> <20150930124432.GS3036@8bytes.org> <560C153C.10600@linux.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <560C153C.10600@linux.intel.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Oct 01, 2015 at 01:00:44AM +0800, Jiang Liu wrote: > Thanks Joerg, that makes sense. If some driver tries to binding to the > IOMMU device, it will trigger the scenario as you described. For > example, Xen backend driver will try to probe all PCI devices > if enabled. I will do more investigation tomorrow. Not only that, the probe code looks like this in __pci_device_probe: error = -ENODEV; id = pci_match_device(drv, pci_dev); if (id) error = pci_call_probe(drv, pci_dev, id); if (error >= 0) error = 0; The pci_match_device() function will always return NULL for the iommu pci_dev, because no driver matches the ids of it. So the function returns -ENODEV, which will be handled in the caller (pci_device_probe): error = pcibios_alloc_irq(pci_dev); if (error < 0) return error; pci_dev_get(pci_dev); error = __pci_device_probe(drv, pci_dev); if (error) { pcibios_free_irq(pci_dev); pci_dev_put(pci_dev); } For the IOMMU pci_dev a pcibios-irq will be allocated (if there is one, like on Boris' system) and because __pci_device_probe returns -ENODEV it will be freed again with pcibios_free_irq(). The pcibios_free_irq() function will set dev->irq = 0, which overwrites the value that pci_enable_msi() wrote there. So later in suspend/resume code the msi-handling part tries to fetch the irq-descriptor for the wrong irq (which is NULL) and causes the crash. The issue got introduced because with your changes pci_enable_msi() is only allowed after a pci-device was successfully probed by the driver. But this assumption is not true, as the AMD IOMMU driver does not register as a pci-driver. Registering a pci-driver would actually be harmful, because a device can be forcibly unbound from its driver, which would be pretty bad for an IOMMU in the running system. So the right fix is to allow pci_enable_msi() for pci-devices not registered against a driver. The fix I sent Boris has issues (I think it leaks pcibios irqs when MSI is in use), but was thinking about fixing it in pci_device_probe by not allocating a pcibios-irq when MSI is already active. What do you think? Regards, Joerg