From: Alex Williamson <alex.williamson@redhat.com>
To: Matthew Ruffell <matthew.ruffell@canonical.com>
Cc: linux-pci@vger.kernel.org, lkml <linux-kernel@vger.kernel.org>,
kvm@vger.kernel.org,
nathan.langford@xcelesunifiedtechnologies.com
Subject: Re: [PROBLEM] Frequently get "irq 31: nobody cared" when passing through 2x GPUs that share same pci switch via vfio
Date: Wed, 15 Sep 2021 10:32:35 -0600 [thread overview]
Message-ID: <20210915103235.097202d2.alex.williamson@redhat.com> (raw)
In-Reply-To: <9e8d0e9e-1d94-35e8-be1f-cf66916c24b2@canonical.com>
On Wed, 15 Sep 2021 16:44:38 +1200
Matthew Ruffell <matthew.ruffell@canonical.com> wrote:
> On 15/09/21 4:43 am, Alex Williamson wrote:
> >
> > FWIW, I have access to a system with an NVIDIA K1 and M60, both use
> > this same switch on-card and I've not experienced any issues assigning
> > all the GPUs to a single VM. Topo:
> >
> > +-[0000:40]-+-02.0-[42-47]----00.0-[43-47]--+-08.0-[44]----00.0
> > | +-09.0-[45]----00.0
> > | +-10.0-[46]----00.0
> > | \-11.0-[47]----00.0
> > \-[0000:00]-+-03.0-[04-07]----00.0-[05-07]--+-08.0-[06]----00.0
> > \-10.0-[07]----00.0
I've actually found that the above configuration, assigning all 6 GPUs
to a VM reproduces this pretty readily by simply rebooting the VM. In
my case, I don't have the panic-on-warn/oops that must be set on your
kernel, so the result is far more benign, the IRQ gets masked until
it's re-registered.
The fact that my upstream ports are using MSI seems irrelevant.
Adding debugging to the vfio-pci interrupt handler, it's correctly
deferring the interrupt as the GPU device is not identifying itself as
the source of the interrupt via the status register. In fact, setting
the disable INTx bit in the GPU command register while the interrupt
storm occurs does not stop the interrupts.
The interrupt storm does seem to be related to the bus resets, but I
can't figure out yet how multiple devices per switch factors into the
issue. Serializing all bus resets via a mutex doesn't seem to change
the behavior.
I'm still investigating, but if anyone knows how to get access to the
Broadcom datasheet or errata for this switch, please let me know.
Thanks,
Alex
next prev parent reply other threads:[~2021-09-15 16:32 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <d4084296-9d36-64ec-8a79-77d82ac6d31c@canonical.com>
2021-09-14 16:43 ` [PROBLEM] Frequently get "irq 31: nobody cared" when passing through 2x GPUs that share same pci switch via vfio Alex Williamson
2021-09-15 4:44 ` Matthew Ruffell
2021-09-15 16:32 ` Alex Williamson [this message]
2021-09-16 5:13 ` Matthew Ruffell
2021-10-05 5:02 ` Matthew Ruffell
2021-10-05 23:13 ` Alex Williamson
2021-10-12 4:58 ` Matthew Ruffell
2021-10-12 20:05 ` Alex Williamson
2021-10-12 22:35 ` Matthew Ruffell
2021-11-01 4:35 ` Matthew Ruffell
2021-11-04 22:05 ` Alex Williamson
2021-11-24 5:52 ` Matthew Ruffell
2021-11-29 17:56 ` Alex Williamson
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20210915103235.097202d2.alex.williamson@redhat.com \
--to=alex.williamson@redhat.com \
--cc=kvm@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-pci@vger.kernel.org \
--cc=matthew.ruffell@canonical.com \
--cc=nathan.langford@xcelesunifiedtechnologies.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox