Re: [RFC PATCH 3/5] vfio/quirks: Automatic ioeventfd enabling for NVIDIA BAR0 quirks

kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Auger Eric <eric.auger@redhat.com>
To: Alex Williamson <alex.williamson@redhat.com>
Cc: qemu-devel@nongnu.org, kvm@vger.kernel.org
Subject: Re: [RFC PATCH 3/5] vfio/quirks: Automatic ioeventfd enabling for NVIDIA BAR0 quirks
Date: Thu, 8 Feb 2018 21:52:31 +0100	[thread overview]
Message-ID: <0dd10589-73b1-0ec8-e13e-b6d8c39bc2d2@redhat.com> (raw)
In-Reply-To: <20180208112436.35ac74a1@w520.home>

Hi Alex,

On 08/02/18 19:24, Alex Williamson wrote:
> On Thu, 8 Feb 2018 12:10:02 +0100
> Auger Eric <eric.auger@redhat.com> wrote:
> 
>> Hi Alex,
>>
>> On 07/02/18 01:26, Alex Williamson wrote:
>>> Record data writes that come through the NVIDIA BAR0 quirk, if we get
>>> enough in a row that we're only passing through, automatically enable
>>> an ioeventfd for it.  The primary target for this is the MSI-ACK
>>> that NVIDIA uses to allow the MSI interrupt to re-trigger, which is a
>>> 4-byte write, data value 0x0 to offset 0x704 into the quirk, 0x88704
>>> into BAR0 MMIO space.  For an interrupt latency sensitive micro-
>>> benchmark, this takes us from 83% of performance versus disabling the
>>> quirk entirely (which GeForce cannot do), to to almost 90%.
>>>
>>> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
>>> ---
>>>  hw/vfio/pci-quirks.c |   89 +++++++++++++++++++++++++++++++++++++++++++++++++-
>>>  hw/vfio/pci.h        |    2 +
>>>  2 files changed, 89 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/hw/vfio/pci-quirks.c b/hw/vfio/pci-quirks.c
>>> index e4cf4ea2dd9c..e739efe601b1 100644lg  
>>
>>> --- a/hw/vfio/pci-quirks.c
>>> +++ b/hw/vfio/pci-quirks.c
>>> @@ -203,6 +203,7 @@ typedef struct VFIOConfigMirrorQuirk {
>>>      uint32_t offset;
>>>      uint8_t bar;
>>>      MemoryRegion *mem;
>>> +    uint8_t data[];  
>> Do you foresee other usages of data besides the LastDataSet?
> 
> Sure, LastDataSet is just a very crude tracking mechanism, I could
> imagine some kind of device specific LRU array.  Any quirk wanting to
> add additional analytics on top of this generic quirk could make use of
> it.  I didn't want to pollute the generic quirk with LastDataSet since
> there are other users, but I also didn't want to complicate the shared
> free'ing path we have by simply adding a pointer.  Thus embedding quirk
> specific data is what I went with here.

OK
> 
>>>  } VFIOConfigMirrorQuirk;
>>>  
>>>  static uint64_t vfio_generic_quirk_mirror_read(void *opaque,
>>> @@ -297,6 +298,50 @@ static void vfio_ioeventfd_exit(VFIOIOEventFD *ioeventfd)
>>>      g_free(ioeventfd);
>>>  }
>>>    
>> add a comment? user handler in case kvm ioeventfd setup failed?
> 
> As you correctly updated later, we're only setting up the kvm-user
> ioeventfd handler in this patch.  This is the handler when that
> succeeds.  Once we add vfio ioeventfd support, this handler will
> hopefully never get used, but we still need to support kernels that
> won't have vfio ioeventfd support and this alone does still provide a
> performance improvement.
> 
>>> +static void vfio_ioeventfd_handler(void *opaque)
>>> +{
>>> +    VFIOIOEventFD *ioeventfd = opaque;
>>> +
>>> +    if (event_notifier_test_and_clear(&ioeventfd->e)) {
>>> +        vfio_region_write(ioeventfd->region, ioeventfd->region_addr,
>>> +                          ioeventfd->data, ioeventfd->size);
>>> +    }
>>> +}
>>> +
>>> +static VFIOIOEventFD *vfio_ioeventfd_init(VFIOPCIDevice *vdev,
>>> +                                          MemoryRegion *mr, hwaddr addr,
>>> +                                          unsigned size, uint64_t data,
>>> +                                          VFIORegion *region,
>>> +                                          hwaddr region_addr)
>>> +{
>>> +    VFIOIOEventFD *ioeventfd = g_malloc0(sizeof(*ioeventfd));
>>> +
>>> +    if (event_notifier_init(&ioeventfd->e, 0)) {
>>> +        g_free(ioeventfd);
>>> +        return NULL;
>>> +    }
>>> +
>>> +    ioeventfd->mr = mr;
>>> +    ioeventfd->addr = addr;
>>> +    ioeventfd->size = size;
>>> +    ioeventfd->match_data = true;
>>> +    ioeventfd->data = data;
>>> +    ioeventfd->region = region;
>>> +    ioeventfd->region_addr = region_addr;  
>> I found difficult to follow the different addr semantic.
>> I understand region_add is the offset % bar and addr is the offset %
>> mirror region. Maybe more explicit names would help (region = bar_region
>> and region_addr = bar_offset)
> 
> I was trying to avoid PCI specific fields for VFIOIOEventFD, but
> basically we're dealing with two different sets of base and offset.
> The first is the MemoryRegion of the quirk and the offset relative to
> that MemoryRegion.  Those are used by memory_region_add_eventfd() for
> establishing the kvm ioeventfd.  The second is the region and
> region_addr, which identify the vfio region and offset relative to the
> region so that we can actually handle it with vfio_region_write().
> 
> I agree though that these are confusing and maybe comments are a
> solution as I can't think of better names:
> 
>     /*
>      * MemoryRegion and relative offset, plus additional ioeventfd setup
>      * parameters for configuring and later tearing down KVM ioeventfd.
>      */
>     ioeventfd->mr = mr;
>     ioeventfd->addr = addr;
>     ioeventfd->size = size;
>     ioeventfd->match_data = true;
>     ioeventfd->data = data;
>     /*
>      * VFIORegion and relative offset for implementing the userspace
>      * handler.  data & size fields shared for both uses.
>      */
>     ioeventfd->region = region;
>     ioeventfd->region_addr = region_addr;
> 
> 
>>> +
>>> +    qemu_set_fd_handler(event_notifier_get_fd(&ioeventfd->e),
>>> +                        vfio_ioeventfd_handler, NULL, ioeventfd);
>>> +    memory_region_add_eventfd(ioeventfd->mr, ioeventfd->addr,
>>> +                              ioeventfd->size, ioeventfd->match_data,
>>> +                              ioeventfd->data, &ioeventfd->e);
>>> +
>>> +    info_report("Enabled automatic ioeventfd acceleration for %s region %d, "
>>> +                "offset 0x%"HWADDR_PRIx", data 0x%"PRIx64", size %u",
>>> +                vdev->vbasedev.name, region->nr, region_addr, data, size);
>>> +
>>> +    return ioeventfd;
>>> +}
>>> +
>>>  static void vfio_vga_probe_ati_3c3_quirk(VFIOPCIDevice *vdev)
>>>  {
>>>      VFIOQuirk *quirk;
>>> @@ -732,6 +777,13 @@ static void vfio_probe_nvidia_bar5_quirk(VFIOPCIDevice *vdev, int nr)
>>>      trace_vfio_quirk_nvidia_bar5_probe(vdev->vbasedev.name);
>>>  }
>>>  
>>> +typedef struct LastDataSet {
>>> +    hwaddr addr;
>>> +    uint64_t data;
>>> +    unsigned size;
>>> +    int count;
>>> +} LastDataSet;
>>> +
>>>  /*
>>>   * Finally, BAR0 itself.  We want to redirect any accesses to either
>>>   * 0x1800 or 0x88000 through the PCI config space access functions.
>>> @@ -742,6 +794,7 @@ static void vfio_nvidia_quirk_mirror_write(void *opaque, hwaddr addr,
>>>      VFIOConfigMirrorQuirk *mirror = opaque;
>>>      VFIOPCIDevice *vdev = mirror->vdev;
>>>      PCIDevice *pdev = &vdev->pdev;
>>> +    LastDataSet *last = (LastDataSet *)&mirror->data;
>>>  
>>>      vfio_generic_quirk_mirror_write(opaque, addr, data, size);
>>>  
>>> @@ -756,6 +809,38 @@ static void vfio_nvidia_quirk_mirror_write(void *opaque, hwaddr addr,
>>>                            addr + mirror->offset, data, size);
>>>          trace_vfio_quirk_nvidia_bar0_msi_ack(vdev->vbasedev.name);
>>>      }
>>> +
>>> +    /*
>>> +     * Automatically add an ioeventfd to handle any repeated write with the
>>> +     * same data and size above the standard PCI config space header.  This is
>>> +     * primarily expected to accelerate the MSI-ACK behavior, such as noted
>>> +     * above.  Current hardware/drivers should trigger an ioeventfd at config
>>> +     * offset 0x704 (region offset 0x88704), with data 0x0, size 4.
>>> +     */
>>> +    if (addr > PCI_STD_HEADER_SIZEOF) {
>>> +        if (addr != last->addr || data != last->data || size != last->size) {
>>> +            last->addr = addr;
>>> +            last->data = data;
>>> +            last->size = size;
>>> +            last->count = 1;
>>> +        } else if (++last->count > 10) {  
>> So here is the naive question about the "10" choice and also the fact
>> this needs to be consecutive accesses. I guess you observed this works
>> but at first sight this is not obvious to me.
> 
> Initially when I started working on this I used the previously known
> MSI-ACK behavior, which was a write to the MSI capability ID register,
> but I quickly figured out it didn't do anything because my current guest
> setup uses the 0x704 offset noted in the comment above.  So I continued
> developing with 0x704 hard coded.  But I have no idea if everything
> uses 0x704 now or some drivers might still use the capability ID, or
> what NVIDIA might switch to next.  So I wanted to make this dynamic and
> I figured that shouldn't anything that gets repeated calls through here
> with the same data that we're not modifying on the way to hardware
> benefit from this acceleration.  Then I started thinking about what
> frequency of write should trigger this switch-over, but the time
> component of that is hard to deal with, so I found that I could just
> drop it.  The MSI-ACK behavior means that we see a pretty continuous
> stream of writes to the same offset, same data, so that triggers right
> away.  Interestingly, after the MSI-ACK triggers the ioeventfd, we even
> add another for a relatively low frequency write.  I wouldn't have
> bothered with this one otherwise, but I don't see the harm in leaving
> it since we're setup for a datamatch we'll only trigger the ioeventfd
> if it remains the same.  So basically to answer your question, it's 10
> because it seemed like a reasonable watermark and it works.  If we had
> to, we could use some sort of LRU array, but this is simple.

OK
> 
>> Does anyone check potential overlaps between ioeventfd's [addr, addr +
>> size -1]?
> 
> Do we need to?  KVM will only trigger the ioeventfd if we match the
> address, data, and size, so AIUI we could have multiple ioeventfds
> covering the same area with different data or size and there's nothing
> wrong with that.  I wondered whether we need some sort of pruning
> algorithm, but once we've enabled kvm-vfio acceleration, QEMU no longer
> has any idea which are being used.  Maybe we could arbitrarily decide
> some fixed number limit for a device, but currently I think the limit
> is based on the user's open file limit.  Thanks,

OK. I was just thinking about a device potentially writing the same area
with different width/offset patterns. But as you said the only drawback
is to register several KVM ioeventfds and their fellow VFIO kernel
listeners for the same area.

Thanks

Eric
> 
> Alex
>

next prev parent reply	other threads:[~2018-02-08 20:52 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-02-07  0:26 [RFC PATCH 0/5] vfio: ioeventfd support Alex Williamson
2018-02-07  0:26 ` [RFC PATCH 1/5] vfio/quirks: Add common quirk alloc helper Alex Williamson
2018-02-08 11:10   ` Auger Eric
2018-02-08 18:28     ` Alex Williamson
2018-02-07  0:26 ` [RFC PATCH 2/5] vfio/quirks: Add generic support for ioveventfds Alex Williamson
2018-02-08 11:11   ` Auger Eric
2018-02-08 18:33     ` Alex Williamson
2018-02-08 20:37       ` [Qemu-devel] " Auger Eric
2018-02-07  0:26 ` [RFC PATCH 3/5] vfio/quirks: Automatic ioeventfd enabling for NVIDIA BAR0 quirks Alex Williamson
2018-02-08 11:10   ` Auger Eric
2018-02-08 11:33     ` [Qemu-devel] " Auger Eric
2018-02-08 18:24     ` Alex Williamson
2018-02-08 20:52       ` Auger Eric [this message]
2018-02-07  0:26 ` [RFC PATCH 4/5] vfio: Update linux header Alex Williamson
2018-02-07  0:26 ` [RFC PATCH 5/5] vfio/quirks: Enable ioeventfd quirks to be handled by vfio directly Alex Williamson
2018-02-08 11:42   ` Auger Eric
2018-02-08 18:41     ` Alex Williamson
2018-02-09  7:11   ` [Qemu-devel] " Peter Xu
2018-02-09 22:09     ` Alex Williamson
2018-02-11  2:38       ` Peter Xu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=0dd10589-73b1-0ec8-e13e-b6d8c39bc2d2@redhat.com \
    --to=eric.auger@redhat.com \
    --cc=alex.williamson@redhat.com \
    --cc=kvm@vger.kernel.org \
    --cc=qemu-devel@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).