From: Alex Williamson <alex.williamson@redhat.com>
To: Heyi Guo <guoheyi@huawei.com>
Cc: "qemu-devel@nongnu.org" <qemu-devel@nongnu.org>,
Peter Maydell <peter.maydell@linaro.org>,
"Wanghaibin (D)" <wanghaibin.wang@huawei.com>
Subject: Re: [Qemu-devel] Questions about VFIO enabling MSIX vector
Date: Thu, 17 Jan 2019 08:21:02 -0700 [thread overview]
Message-ID: <20190117082102.236fd268@x1.home> (raw)
In-Reply-To: <a317f715-e146-d63e-5da8-0f6beb85d93a@huawei.com>
On Thu, 17 Jan 2019 20:55:05 +0800
Heyi Guo <guoheyi@huawei.com> wrote:
> On 2019/1/16 0:18, Alex Williamson wrote:
> > On Tue, 15 Jan 2019 11:21:51 +0800
> > Heyi Guo <guoheyi@huawei.com> wrote:
> >
> >> Hi Alex,
> >>
> >> Really appreciate your comments. I have some more questions below.
> >>
> >>
> >> On 2019/1/15 0:07, Alex Williamson wrote:
> >>> On Sat, 12 Jan 2019 10:30:40 +0800
> >>> Heyi Guo <guoheyi@huawei.com> wrote:
> >>>
> >>>> Hi folks,
> >>>>
> >>>> I have some questions about vfio_msix_vector_do_use() in
> >>>> hw/vfio/pci.c, could you help to explain?
> >>>>
> >>>> We can see that when guest tries to enable one specific MSIX vector
> >>>> by unmasking MSIX Vector Control, the access will be trapped and then
> >>>> into function vfio_msix_vector_do_use(). And we may go to the below
> >>>> branch in line 525:
> >>>>
> >>>>
> >>>> 520 /*
> >>>> 521 * We don't want to have the host allocate all possible MSI vectors
> >>>> 522 * for a device if they're not in use, so we shutdown and incrementally
> >>>> 523 * increase them as needed.
> >>>> 524 */
> >>>> 525 if (vdev->nr_vectors < nr + 1) {
> >>>> 526 vfio_disable_irqindex(&vdev->vbasedev, VFIO_PCI_MSIX_IRQ_INDEX);
> >>>> 527 vdev->nr_vectors = nr + 1;
> >>>> 528 ret = vfio_enable_vectors(vdev, true);
> >>>> 529 if (ret) {
> >>>> 530 error_report("vfio: failed to enable vectors, %d", ret);
> >>>> 531 }
> >>>>
> >>>> Here all MSIX vectors will be disabled first and then enabled, with
> >>>> one more MSIX. The comment is there but I still don't quite
> >>>> understand. It makes sense for not allocating all possible MSI
> >>>> vectors, but why shall we shutdown the whole MSI when being requested
> >>>> to enable one specific vector? Can't we just enable the user
> >>>> specified vector indexed by "nr"?
> >>> What internal kernel API would vfio-pci make use of to achieve that?
> >>> We can't know the intentions of the guest and we have a limited set of
> >>> tools to manage the MSI-X state in the host kernel. It's generally the
> >>> case that MSI-X vectors are only manipulated during driver setup and
> >>> shutdown, so while the current behavior isn't necessarily efficient, it
> >>> works within the constraints and generally doesn't have a runtime impact
> >>> on performance. We also recognize that this behavior may change in the
> >>> future depending on more dynamic internal host kernel interfaces, thus
> >>> we export the VFIO_IRQ_INFO_NORESIZE flag to indicate to userspace
> >>> whether this procedure is required.
> >> I tried to enable the specified MSIX vector only, and finally
> >> understood the original QEMU code after always getting EINVAL from
> >> ioctl->VFIO_DEVICE_SET_IRQS. Then I tried to allocate all MSIX
> >> vectors in vfio_msix_enable(), not only MSIX vector 0. The change
> >> works, but it definitely doesn't follow the comment in the code "We
> >> don't want to have the host allocate all possible MSI vectors for a
> >> device if they're not in use". May I ask what the side effect is if
> >> we really allocate all possible vectors at the beginning?
> > Host IRQ vector exhaustion. Very few devices seem to make use of the
> > their full compliment of available vectors and some devices support a
> > very, very large number of vectors. The current approach only consumes
> > the host resources that the guest is actually making use of.
> >
> >>>> What's more, on ARM64 systems with GIC ITS, the kernel will issue
> >>>> an ITS discard command when disabling a MSI vector, which will drop
> >>>> currently pending MSI interrupt. If device driver in guest system
> >>>> enables some MSIs first and interrupts may come at any time, and
> >>>> then it tries to enable other MSIs, is it possible for the above
> >>>> code to cause interrupts missing?
> >>> Interrupt reconfiguration is generally during driver setup or
> >>> teardown when the device is considered idle or lost interrupts are
> >>> not a concern. If you have a device which manipulates interrupt
> >>> configuration runtime, you can expect that lost interrupts won't be
> >>> the only problem,
> >> On physical machine, if the driver enables vector 0 first, and then
> >> vector 1 later, will the vector 0 interrupt be lost for some
> >> possibility? In my opinion the manipulation of vector 1 should not
> >> affect the interrupt of vector 0, isn't it? But if this is possible
> >> in virtual world, what do we consider it as? A bug to be fixed in
> >> future, or a known limitation of virtualization that won't be fixed
> >> and all driver developers should pay attention to it?
> > Are you experiencing an actual problem with lost interrupts or is this
> > a theoretical question? There is a difference in programming of MSI-X
> > for an assigned device because we cannot know the intentions of the
> > guest with respect to how many vectors they intend to use. We can only
> > know as individual vectors are unmasked that a vector is now used.
> > Allocating the maximum compliment of interrupts supported on the device
> > is wasteful of host resources. Given this, combined with the fact that
> > drivers generally configure vectors only at device setup time, the
> > existing code generally works well. Ideally drivers should not be
> > required to modify their behavior to support device assignment, but
> > certain actions, like high frequency masking/unmasking of interrupts
> > through the vector table impose unavoidable virtualization overhead
> > and should be avoided. I believe older Linux guests (RHEL5 time frame)
> > had such behavior.
>
> Yes we are experiencing occasional interrupts missing for Huawei
> Hi1822 NIC VF used on Guest. Once I suspected the workflow in qemu
> could cause this issue, but after more investigation these two days,
> we found it should only be the trigger of the issue. The root cause
> may be on kernel or ARM GICR in the chip, for kernel always maps vPE
> to the first CPU (normally CPU 0), and then moves the map to the
> scheduled CPU. But there is a time window between these two actions,
> and interrupts may get lost when triggered in this short time window.
> However we have not come to a final conclusion yet.
>
> >
> > It's possible that current IRQ domains alleviate some of the concern
> > for host IRQ exhaustion and that vfio-pci could be made to allocate
> > the maximum supported vectors per device, removing the use of the
> > NORESIZE flag for the index. Userspace would need to continue to
> > support both modes of operation and the kernel would need to
> > continue to be compatible with that. Thanks,
>
> How about we set a threshold (maybe 16, 32) that only allocating
> vectors of number = min (maximum supported vectors, threshold), and
> go thru the original work flow if the specified vector is larger than
> the initially allocated?
I would not support such a solution, it's not only still wasteful of
interrupt vectors but it introduces an inflection point in the behavior
which means that "high-end" configurations exercise code paths that
"low-end" configurations do not. Besides, it seems you haven't yet
even proven that this is an issue. If we wanted to remove the NORESIZE
flag for this vector index, I think the host would need to allocate all
vectors supported by the MSI-X capability in advance, which means we
need to verify that vector exhaustion is no longer a concern on the
host. Thanks,
Alex
next prev parent reply other threads:[~2019-01-17 15:21 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-01-12 2:30 [Qemu-devel] Questions about VFIO enabling MSIX vector Heyi Guo
2019-01-14 11:26 ` Auger Eric
2019-01-14 16:07 ` Alex Williamson
2019-01-15 3:21 ` Heyi Guo
2019-01-15 16:18 ` Alex Williamson
2019-01-17 12:55 ` Heyi Guo
2019-01-17 15:21 ` Alex Williamson [this message]
2019-01-18 0:26 ` Heyi Guo
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20190117082102.236fd268@x1.home \
--to=alex.williamson@redhat.com \
--cc=guoheyi@huawei.com \
--cc=peter.maydell@linaro.org \
--cc=qemu-devel@nongnu.org \
--cc=wanghaibin.wang@huawei.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).