From: Alex Williamson <alex.williamson@redhat.com>
To: Peter Lieven <pl@kamp.de>
Cc: "qemu-devel@nongnu.org" <qemu-devel@nongnu.org>
Subject: Re: [Qemu-devel] vfio-pci issues with multiple devices on the same root port
Date: Mon, 15 Dec 2014 08:11:37 -0700 [thread overview]
Message-ID: <1418656297.1095.252.camel@bling.home> (raw)
In-Reply-To: <548CA4FE.4090004@kamp.de>
On Sat, 2014-12-13 at 21:43 +0100, Peter Lieven wrote:
> Am 12.12.2014 um 23:21 schrieb Alex Williamson:
> > On Fri, 2014-12-12 at 22:38 +0100, Peter Lieven wrote:
> >> Hi,
> >>
> >> we have a Cisco UCS infrastructure where we have fnic Fibre-Channel Adapters that we expose to guests. The UCS
> >> infrastruture allows to create virtual HBAs that can be exposed to a host so its possible to have quite a lot of them.
> >>
> >> We ran into a strange issue when we started having more than one vServer with a FibreChannel Adapter passed
> >> thru with vfio-pci.
> >>
> >> When a hypervisor shuts down it the kernel sees the following error:
> >>
> >> pcieport 0000:00:07.0: AER: Uncorrected (Non-Fatal) error received: id=0038
> >> pcieport 0000:00:07.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0038(Receiver ID)
> >> pcieport 0000:00:07.0: device [8086:340e] error status/mask=00200000/00100000
> >> pcieport 0000:00:07.0: [21] Unknown Error Bit (First)
> >> pcieport 0000:00:07.0: broadcast error_detected message
> >> pcieport 0000:00:07.0: AER: Device recovery failed
> >>
> >> Bit 21 seems to be ACS Violation. And 0000:00:07.0 is the PCIE Root Port on that System.
> >>
> >> This wouldn't be a big problem, altough I would like to find out what the ACS Violation causes.
> >>
> >> The real problem is that all other vfio-pci cards on that root port get notified of this error and the connected vServers are suspended
> >> with RUN_STATE_INTERNAL_ERROR.
> >>
> >> Any ideas to work around this other than hacking qemu to not register an error handler or modifying vfio_err_notifier_handler
> >> to not suspend the vServer?
> > You could set bit 21 in the AER uncorrected error mask register to avoid
> > the root port signaling the error. Is bit 21 already clear in the
> > severity register to make this non-fatal?
> >
> >> Is it correct that all children of a root port are notified? Should qemu distinguish between fatal and non-fatal errors when
> >> suspending a vServer?
> > Yes, each child is notified. QEMU only gets an eventfd signal, which is
> > supposed to occur only for fatal errors. I don't quite understand why
> > this apparently non-fatal error is getting through. The kernel-side
> > VFIO code is where filtering of fatal vs non-fatal should occur.
>
> Had a look at vfio-pci.c from master. I can't see where there is a filtering of fatal vs. non-fatal
I'm under the impression that fatal vs non-fatal would be determined
somewhere in the PCI layers and the driver would only be notified for
uncorrected/fatal. Are we missing that filtering? Thanks,
Alex
> static pci_ers_result_t vfio_pci_aer_err_detected(struct pci_dev *pdev,
> pci_channel_state_t state)
> {
> struct vfio_pci_device *vdev;
> struct vfio_device *device;
>
> device = vfio_device_get_from_dev(&pdev->dev);
> if (device == NULL)
> return PCI_ERS_RESULT_DISCONNECT;
>
> vdev = vfio_device_data(device);
> if (vdev == NULL) {
> vfio_device_put(device);
> return PCI_ERS_RESULT_DISCONNECT;
> }
>
> mutex_lock(&vdev->igate);
>
> if (vdev->err_trigger)
> eventfd_signal(vdev->err_trigger, 1);
>
> mutex_unlock(&vdev->igate);
>
> vfio_device_put(device);
>
> return PCI_ERS_RESULT_CAN_RECOVER;
> }
>
> static struct pci_error_handlers vfio_err_handlers = {
> .error_detected = vfio_pci_aer_err_detected,
> };
>
> static struct pci_driver vfio_pci_driver = {
> .name = "vfio-pci",
> .id_table = NULL, /* only dynamic ids */
> .probe = vfio_pci_probe,
> .remove = vfio_pci_remove,
> .err_handler = &vfio_err_handlers,
> };
>
> Peter
>
>
next prev parent reply other threads:[~2014-12-15 15:12 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-12-12 21:38 [Qemu-devel] vfio-pci issues with multiple devices on the same root port Peter Lieven
2014-12-12 22:21 ` Alex Williamson
2014-12-13 20:36 ` Peter Lieven
2014-12-15 15:08 ` Alex Williamson
2014-12-13 20:43 ` Peter Lieven
2014-12-15 15:11 ` Alex Williamson [this message]
2014-12-15 15:22 ` Peter Lieven
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1418656297.1095.252.camel@bling.home \
--to=alex.williamson@redhat.com \
--cc=pl@kamp.de \
--cc=qemu-devel@nongnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).