From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:33544)
	by lists.gnu.org with esmtp (Exim 4.71) (envelope-from <pl@kamp.de>)
	id 1XzXvG-0005o5-F6
	for qemu-devel@nongnu.org; Fri, 12 Dec 2014 16:38:59 -0500
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <pl@kamp.de>) id 1XzXv7-0003dr-DQ
	for qemu-devel@nongnu.org; Fri, 12 Dec 2014 16:38:50 -0500
Received: from mx-v6.kamp.de ([2a02:248:0:51::16]:45351 helo=mx01.kamp.de)
	by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from <pl@kamp.de>)
	id 1XzXv7-0003dY-2G
	for qemu-devel@nongnu.org; Fri, 12 Dec 2014 16:38:41 -0500
Message-ID: <548B605B.6030002@kamp.de>
Date: Fri, 12 Dec 2014 22:38:35 +0100
From: Peter Lieven <pl@kamp.de>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
Subject: [Qemu-devel] vfio-pci issues with multiple devices on the same root
	port
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: "qemu-devel@nongnu.org" <qemu-devel@nongnu.org>, alex.williamson@redhat.com

Hi,

we have a Cisco UCS infrastructure where we have fnic Fibre-Channel Adapters that we expose to guests. The UCS
infrastruture allows to create virtual HBAs that can be exposed to a host so its possible to have quite a lot of them.

We ran into a strange issue when we started having more than one vServer with a FibreChannel Adapter passed
thru with vfio-pci.

When a hypervisor shuts down it the kernel sees the following error:

 pcieport 0000:00:07.0: AER: Uncorrected (Non-Fatal) error received: id=0038
 pcieport 0000:00:07.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0038(Receiver ID)
 pcieport 0000:00:07.0:   device [8086:340e] error status/mask=00200000/00100000
 pcieport 0000:00:07.0:    [21] Unknown Error Bit (First)
 pcieport 0000:00:07.0: broadcast error_detected message
 pcieport 0000:00:07.0: AER: Device recovery failed

Bit 21 seems to be ACS Violation. And 0000:00:07.0 is the PCIE Root Port on that System.

This wouldn't be a big problem, altough I would like to find out what the ACS Violation causes.

The real problem is that all other vfio-pci cards on that root port get notified of this error and the connected vServers are suspended
with RUN_STATE_INTERNAL_ERROR.

Any ideas to work around this other than hacking qemu to not register an error handler or modifying vfio_err_notifier_handler
to not suspend the vServer?

Is it correct that all children of a root port are notified? Should qemu distinguish between fatal and non-fatal errors when
suspending a vServer?

Thanks,
Peter