From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:42218) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Y0XUA-0006ul-55 for qemu-devel@nongnu.org; Mon, 15 Dec 2014 10:23:07 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1Y0XU1-00070z-2X for qemu-devel@nongnu.org; Mon, 15 Dec 2014 10:22:58 -0500 Received: from mx-v6.kamp.de ([2a02:248:0:51::16]:48897 helo=mx01.kamp.de) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Y0XU0-00070Q-Ms for qemu-devel@nongnu.org; Mon, 15 Dec 2014 10:22:48 -0500 Message-ID: <548EFCC4.8040104@kamp.de> Date: Mon, 15 Dec 2014 16:22:44 +0100 From: Peter Lieven MIME-Version: 1.0 References: <548B605B.6030002@kamp.de> <1418422905.1095.189.camel@bling.home> <548CA4FE.4090004@kamp.de> <1418656297.1095.252.camel@bling.home> In-Reply-To: <1418656297.1095.252.camel@bling.home> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit Subject: Re: [Qemu-devel] vfio-pci issues with multiple devices on the same root port List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Alex Williamson Cc: "qemu-devel@nongnu.org" On 15.12.2014 16:11, Alex Williamson wrote: > On Sat, 2014-12-13 at 21:43 +0100, Peter Lieven wrote: >> Am 12.12.2014 um 23:21 schrieb Alex Williamson: >>> On Fri, 2014-12-12 at 22:38 +0100, Peter Lieven wrote: >>>> Hi, >>>> >>>> we have a Cisco UCS infrastructure where we have fnic Fibre-Channel Adapters that we expose to guests. The UCS >>>> infrastruture allows to create virtual HBAs that can be exposed to a host so its possible to have quite a lot of them. >>>> >>>> We ran into a strange issue when we started having more than one vServer with a FibreChannel Adapter passed >>>> thru with vfio-pci. >>>> >>>> When a hypervisor shuts down it the kernel sees the following error: >>>> >>>> pcieport 0000:00:07.0: AER: Uncorrected (Non-Fatal) error received: id=0038 >>>> pcieport 0000:00:07.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0038(Receiver ID) >>>> pcieport 0000:00:07.0: device [8086:340e] error status/mask=00200000/00100000 >>>> pcieport 0000:00:07.0: [21] Unknown Error Bit (First) >>>> pcieport 0000:00:07.0: broadcast error_detected message >>>> pcieport 0000:00:07.0: AER: Device recovery failed >>>> >>>> Bit 21 seems to be ACS Violation. And 0000:00:07.0 is the PCIE Root Port on that System. >>>> >>>> This wouldn't be a big problem, altough I would like to find out what the ACS Violation causes. >>>> >>>> The real problem is that all other vfio-pci cards on that root port get notified of this error and the connected vServers are suspended >>>> with RUN_STATE_INTERNAL_ERROR. >>>> >>>> Any ideas to work around this other than hacking qemu to not register an error handler or modifying vfio_err_notifier_handler >>>> to not suspend the vServer? >>> You could set bit 21 in the AER uncorrected error mask register to avoid >>> the root port signaling the error. Is bit 21 already clear in the >>> severity register to make this non-fatal? >>> >>>> Is it correct that all children of a root port are notified? Should qemu distinguish between fatal and non-fatal errors when >>>> suspending a vServer? >>> Yes, each child is notified. QEMU only gets an eventfd signal, which is >>> supposed to occur only for fatal errors. I don't quite understand why >>> this apparently non-fatal error is getting through. The kernel-side >>> VFIO code is where filtering of fatal vs non-fatal should occur. >> Had a look at vfio-pci.c from master. I can't see where there is a filtering of fatal vs. non-fatal > I'm under the impression that fatal vs non-fatal would be determined > somewhere in the PCI layers and the driver would only be notified for > uncorrected/fatal. Are we missing that filtering? Thanks, As far as I am understand vfio_pci_aer_err_detected in drivers/vfio/pci/vfio_pci.c is called to recover potential recoverable errors and the driver decides if the error was recoverable by the return code. Peter