All of lore.kernel.org
 help / color / mirror / Atom feed
From: Avi Kivity <avi@redhat.com>
To: "Yang, Sheng" <sheng.yang@intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>,
	"Michael S. Tsirkin" <mst@redhat.com>,
	"kvm@vger.kernel.org" <kvm@vger.kernel.org>
Subject: Re: Mask bit support's API
Date: Thu, 02 Dec 2010 15:09:43 +0200	[thread overview]
Message-ID: <4CF79A97.30407@redhat.com> (raw)
In-Reply-To: <201012011036.08926.sheng.yang@intel.com>

On 12/01/2010 04:36 AM, Yang, Sheng wrote:
> On Tuesday 30 November 2010 22:15:29 Avi Kivity wrote:
> >  On 11/26/2010 04:35 AM, Yang, Sheng wrote:
> >  >  >   >   Shouldn't kvm also service reads from the pending bitmask?
> >  >  >
> >  >  >   Of course KVM should service reading from pending bitmask. For
> >  >  >   assigned device, it's kernel who would set the pending bit; but I am
> >  >  >   not sure for virtio. This interface is GET_ENTRY, so reading is fine
> >  >  >   with it.
> >
> >  The kernel should manage it in the same way.  Virtio raises irq (via
> >  KVM_IRQ_LINE or vhost-net's irqfd), kernel sets pending bit.
> >
> >  Note we need to be able to read and write the pending bitmask for live
> >  migration.
>
> Then seems we still need to an writing interface for it. And I think we can work
> on it later since it got no user now.

I'm not sure.  Suppose a guest starts using pending bits.  Does it mean 
it will not work with kernel msix acceleration?

If that is the case, then we need pending bit support now.  I don't want 
to knowingly merge something that doesn't conform to the spec, forcing 
users to choose whether they want to enable kernel msix acceleration or not.



> >
> >  Because we have an interface where you get an exit if (addr % 4)<  3 and
> >  don't get an exit if (addr % 4) == 3.  There is a gpa range which is
> >  partially maintained by the kernel and partially in userspace.  It's a
> >  confusing interface.  Things like 64-bit reads or writes need to be
> >  broken up and serviced in two different places.
> >
> >  We already need to support this (for unaligned writes which hit two
> >  regions), but let's at least make a contiguous region behave sanely.
>
> Oh, I didn't mean to handle this kind of unaligned writing in userspace. They're
> illegal according to the PCI spec(otherwise the result is undefined according to
> the spec). I would cover all contiguous writing(32-bit and 64-bit) in the next
> version, and discard all illegal writing. And 64-bit accessing would be broken up
> in qemu as you said, as they do currently.
>
> In fact I think we can handle all data for 64-bit to qemu, because it should still
> sync the mask bit with kernel, which make the maskbit writing in userspace useless
> and can be ignored.

What about reads?  Split those as well?

> >  The reason is to keep a sane interface.  Like we emulate instructions
> >  and msrs in the kernel and don't do half a job.  I don't think there's a
> >  real need to accelerate the first three words of an msi-x entry.
>
> Here is the case we've observed with Xen. It can only be reproduced by large scale
> testing. When the interrupt intensity is very high, even new kernels would try to
> make it lower, then it would access mask bit very frequently. And in the kernel,
> msi_set_mask_bit() is like this:
>
> static void msi_set_mask_bit(struct irq_data *data, u32 flag)
> {
>          struct msi_desc *desc = irq_data_get_msi(data);
>
>          if (desc->msi_attrib.is_msix) {
>                  msix_mask_irq(desc, flag);
>                  readl(desc->mask_base);         /* Flush write to device */
>          } else {
>                  unsigned offset = data->irq - desc->dev->irq;
>                  msi_mask_irq(desc, 1<<  offset, flag<<  offset);
>          }
> }
>
> That flush by readl() would complied with each MSI-X mask writing. That is the only
> place we want to accelerate, otherwise it would still fall back to qemu each time
> when guest want to mask the MSI-X entry.

So it seems we do want to accelerate reads to the entire entry.  This 
seems to give more weight to making the kernel own the entire entry.

> >
> >  >  And BTW, we can take routing table as a kind of *cache*, if the content
> >  >  is in the cache, then we can fetch it from the cache, otherwise we need
> >  >  to go back to fetch it from memory(userspace).
> >
> >  If it's guaranteed by the spec that addr/data pairs are always
> >  interpreted in the same way, sure.  But there no reason to do it,
> >  really, it isn't a fast path.
>
> I am not quite understand the first sentence... But it's a fast path, in the case I
> showed above.

Which case?  the readl() doesn't need access to the routing table, just 
the entry.

Oh, I think there is a terminology problem, I was talking about kvm's 
irq routing table, you were talking about the msix entries.

I think treating it as a cache causes more problems, since there are now 
two paths for reads (in cache and not in cache) and more things for 
writes to manage.

Here's my proposed API:

KVM_DEFINE_MSIX_TABLE(table_id, nr_entries, msix_base_gpa, 
pending_bitmap_base_gpa)

  - called when the guest enables msix

KVM_REMOVE_MSIX_TABLE(table_id)

   - called when the guest disables msix

KVM_SET_MSIX_ENTRY(table_id, entry_id, contents)

   - called when the guest enables msix (to initialize it), or after 
live migration

KVM_GET_MSIX_ENTRY(table_id, entry_id, contents)

   - called when the guest updates an msix table (triggered by 
KVM_EXIT_MSIX_ENTRY_UPDATED)

KVM_RUN -> KVM_EXIT_MSIX_ENTRY_UPDATED

   - returned from KVM_RUN after the guest touches the first three words 
of an msix entry

KVM_SET_MSIX_ENTRY_IRQ(table_id, entry_id, msix_irq_desc)

msix_irq_desc could describe an assigned device irq, or gsi that is 
mapped to an msix interrupt in the kvm routing table.

Michael?  I think that should work for virtio and vfio assigned 
devices?  Not sure about pending bits.

-- 
error compiling committee.c: too many arguments to function


  reply	other threads:[~2010-12-02 13:09 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-11-23  6:09 Mask bit support's API Yang, Sheng
2010-11-23  6:17 ` Avi Kivity
2010-11-23  6:35   ` Yang, Sheng
2010-11-23  7:54     ` Avi Kivity
2010-11-23  8:30       ` Yang, Sheng
2010-11-23 12:47         ` Avi Kivity
2010-11-23 12:56           ` Michael S. Tsirkin
2010-11-23 13:57           ` Yang, Sheng
2010-11-23 14:06             ` Avi Kivity
2010-11-23 15:11               ` Michael S. Tsirkin
2010-11-23 15:24                 ` Gleb Natapov
2010-11-23 16:10                   ` Michael S. Tsirkin
2010-11-24  1:59               ` Yang, Sheng
2010-11-26  2:35                 ` Yang, Sheng
2010-11-30 14:15                   ` Avi Kivity
2010-12-01  2:36                     ` Yang, Sheng
2010-12-02 13:09                       ` Avi Kivity [this message]
2010-12-02 13:47                         ` Michael S. Tsirkin
2010-12-02 13:56                           ` Avi Kivity
2010-12-02 14:26                             ` Michael S. Tsirkin
2010-12-02 14:54                               ` Sheng Yang
2010-12-02 16:55                                 ` Michael S. Tsirkin
2010-12-03  3:03                                   ` Yang, Sheng
2010-11-23 12:04 ` Michael S. Tsirkin
2010-11-23 14:02   ` Yang, Sheng

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4CF79A97.30407@redhat.com \
    --to=avi@redhat.com \
    --cc=kvm@vger.kernel.org \
    --cc=mst@redhat.com \
    --cc=mtosatti@redhat.com \
    --cc=sheng.yang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.