From: Alex Williamson <alex.williamson@redhat.com>
To: Peter Lieven <pl@kamp.de>
Cc: "qemu-devel@nongnu.org" <qemu-devel@nongnu.org>
Subject: Re: [Qemu-devel] vfio-pci issues with multiple devices on the same root port
Date: Fri, 12 Dec 2014 15:21:45 -0700 [thread overview]
Message-ID: <1418422905.1095.189.camel@bling.home> (raw)
In-Reply-To: <548B605B.6030002@kamp.de>
On Fri, 2014-12-12 at 22:38 +0100, Peter Lieven wrote:
> Hi,
>
> we have a Cisco UCS infrastructure where we have fnic Fibre-Channel Adapters that we expose to guests. The UCS
> infrastruture allows to create virtual HBAs that can be exposed to a host so its possible to have quite a lot of them.
>
> We ran into a strange issue when we started having more than one vServer with a FibreChannel Adapter passed
> thru with vfio-pci.
>
> When a hypervisor shuts down it the kernel sees the following error:
>
> pcieport 0000:00:07.0: AER: Uncorrected (Non-Fatal) error received: id=0038
> pcieport 0000:00:07.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0038(Receiver ID)
> pcieport 0000:00:07.0: device [8086:340e] error status/mask=00200000/00100000
> pcieport 0000:00:07.0: [21] Unknown Error Bit (First)
> pcieport 0000:00:07.0: broadcast error_detected message
> pcieport 0000:00:07.0: AER: Device recovery failed
>
> Bit 21 seems to be ACS Violation. And 0000:00:07.0 is the PCIE Root Port on that System.
>
> This wouldn't be a big problem, altough I would like to find out what the ACS Violation causes.
>
> The real problem is that all other vfio-pci cards on that root port get notified of this error and the connected vServers are suspended
> with RUN_STATE_INTERNAL_ERROR.
>
> Any ideas to work around this other than hacking qemu to not register an error handler or modifying vfio_err_notifier_handler
> to not suspend the vServer?
You could set bit 21 in the AER uncorrected error mask register to avoid
the root port signaling the error. Is bit 21 already clear in the
severity register to make this non-fatal?
> Is it correct that all children of a root port are notified? Should qemu distinguish between fatal and non-fatal errors when
> suspending a vServer?
Yes, each child is notified. QEMU only gets an eventfd signal, which is
supposed to occur only for fatal errors. I don't quite understand why
this apparently non-fatal error is getting through. The kernel-side
VFIO code is where filtering of fatal vs non-fatal should occur.
Thanks,
Alex
next prev parent reply other threads:[~2014-12-12 22:22 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-12-12 21:38 [Qemu-devel] vfio-pci issues with multiple devices on the same root port Peter Lieven
2014-12-12 22:21 ` Alex Williamson [this message]
2014-12-13 20:36 ` Peter Lieven
2014-12-15 15:08 ` Alex Williamson
2014-12-13 20:43 ` Peter Lieven
2014-12-15 15:11 ` Alex Williamson
2014-12-15 15:22 ` Peter Lieven
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1418422905.1095.189.camel@bling.home \
--to=alex.williamson@redhat.com \
--cc=pl@kamp.de \
--cc=qemu-devel@nongnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).