From: "Michael S. Tsirkin" <mst@redhat.com>
To: Jan Kiszka <jan.kiszka@web.de>
Cc: Avi Kivity <avi@redhat.com>,
Marcelo Tosatti <mtosatti@redhat.com>, kvm <kvm@vger.kernel.org>,
Alex Williamson <alex.williamson@redhat.com>,
Jan Kiszka <jan.kiszka@siemens.com>
Subject: Re: [PATCH 3/3] KVM: Allow host IRQ sharing for passed-through PCI 2.3 devices
Date: Mon, 1 Nov 2010 19:06:43 +0200 [thread overview]
Message-ID: <20101101170643.GA1220@redhat.com> (raw)
In-Reply-To: <4CCEEB1C.3060704@web.de>
On Mon, Nov 01, 2010 at 05:30:20PM +0100, Jan Kiszka wrote:
> Am 01.11.2010 16:52, Michael S. Tsirkin wrote:
> > On Mon, Nov 01, 2010 at 04:41:08PM +0100, Jan Kiszka wrote:
> >> Am 01.11.2010 16:24, Michael S. Tsirkin wrote:
> >>> On Mon, Nov 01, 2010 at 03:08:37PM +0100, Jan Kiszka wrote:
> >>>> From: Jan Kiszka <jan.kiszka@siemens.com>
> >>>>
> >>>> PCI 2.3 allows to generically disable IRQ sources at device level. This
> >>>> enables us to share IRQs of such devices between on the host side when
> >>>> passing them to a guest.
> >>>>
> >>>> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
> >>>> ---
> >>>> include/linux/kvm_host.h | 1 +
> >>>> virt/kvm/assigned-dev.c | 153 +++++++++++++++++++++++++++++++++++++++++----
> >>>> 2 files changed, 140 insertions(+), 14 deletions(-)
> >>>>
> >>>> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> >>>> index df5497f..fcdc849 100644
> >>>> --- a/include/linux/kvm_host.h
> >>>> +++ b/include/linux/kvm_host.h
> >>>> @@ -473,6 +473,7 @@ struct kvm_assigned_dev_kernel {
> >>>> unsigned int entries_nr;
> >>>> int host_irq;
> >>>> bool host_irq_disabled;
> >>>> + bool pci_2_3;
> >>>> struct msix_entry *host_msix_entries;
> >>>> int guest_irq;
> >>>> struct kvm_guest_msix_entry *guest_msix_entries;
> >>>> diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c
> >>>> index d3ddfea..411643c 100644
> >>>> --- a/virt/kvm/assigned-dev.c
> >>>> +++ b/virt/kvm/assigned-dev.c
> >>>> @@ -55,10 +55,96 @@ static int find_index_from_host_irq(struct kvm_assigned_dev_kernel
> >>>> return index;
> >>>> }
> >>>>
> >>>> +/*
> >>>> + * Verify that the device supports Interrupt Disable bit in command register,
> >>>> + * per PCI 2.3, by flipping this bit and reading it back: this bit was readonly
> >>>> + * in PCI 2.2.
> >>>> + */
> >>>> +static bool pci_2_3_supported(struct pci_dev *pdev)
> >>>> +{
> >>>> + u16 orig, new;
> >>>> + bool supported = false;
> >>>> +
> >>>> + pci_block_user_cfg_access(pdev);
> >>>> + pci_read_config_word(pdev, PCI_COMMAND, &orig);
> >>>> + pci_write_config_word(pdev, PCI_COMMAND,
> >>>> + orig ^ PCI_COMMAND_INTX_DISABLE);
> >>>> + pci_read_config_word(pdev, PCI_COMMAND, &new);
> >>>> +
> >>>> + /*
> >>>> + * There's no way to protect against
> >>>> + * hardware bugs or detect them reliably, but as long as we know
> >>>> + * what the value should be, let's go ahead and check it.
> >>>> + */
> >>>> + if ((new ^ orig) & ~PCI_COMMAND_INTX_DISABLE) {
> >>>> + dev_err(&pdev->dev, "Command changed from 0x%x to 0x%x: "
> >>>> + "driver or HW bug?\n", orig, new);
> >>>> + goto out;
> >>>> + }
> >>>> + if (!((new ^ orig) & PCI_COMMAND_INTX_DISABLE)) {
> >>>> + dev_warn(&pdev->dev, "Device does not support "
> >>>> + "disabling interrupts: unable to bind.\n");
> >>>> + goto out;
> >>>> + }
> >>>> + supported = true;
> >>>> +
> >>>> + /* Now restore the original value. */
> >>>> + pci_write_config_word(pdev, PCI_COMMAND, orig);
> >>>> +
> >>>> +out:
> >>>> + pci_unblock_user_cfg_access(pdev);
> >>>> + return supported;
> >>>> +}
> >>>> +
> >>>> +static void
> >>>> +pci_2_3_mask_irq(struct pci_dev *dev, int mask, unsigned int *irq_status)
> >>>> +{
> >>>> + u32 cmd_status_dword;
> >>>> + u16 origcmd, newcmd;
> >>>> +
> >>>> + /*
> >>>> + * We do a single dword read to retrieve both command and status.
> >>>> + * Document assumptions that make this possible.
> >>>> + */
> >>>> + BUILD_BUG_ON(PCI_COMMAND % 4);
> >>>> + BUILD_BUG_ON(PCI_COMMAND + 2 != PCI_STATUS);
> >>>> +
> >>>> + pci_block_user_cfg_access(dev);
> >>>> +
> >>>> + /*
> >>>> + * Read both command and status registers in a single 32-bit operation.
> >>>> + * Note: we could cache the value for command and move the status read
> >>>> + * out of the lock if there was a way to get notified of user changes
> >>>> + * to command register through sysfs. Should be good for shared irqs.
> >>>> + */
> >>>> + pci_read_config_dword(dev, PCI_COMMAND, &cmd_status_dword);
> >>>> + origcmd = cmd_status_dword;
> >>>> +
> >>>> + if (irq_status) {
> >>>> + /*
> >>>> + * Check interrupt status register to see whether our device triggered
> >>>> + * the interrupt.
> >>>> + */
> >>>> + *irq_status = (cmd_status_dword >> 16) & PCI_STATUS_INTERRUPT;
> >>>> + if (*irq_status == 0)
> >>>> + goto done;
> >>>> + }
> >>>> +
> >>>> + newcmd = origcmd & ~PCI_COMMAND_INTX_DISABLE;
> >>>> + if (mask)
> >>>> + newcmd |= PCI_COMMAND_INTX_DISABLE;
> >>>> + if (newcmd != origcmd)
> >>>> + pci_write_config_word(dev, PCI_COMMAND, newcmd);
> >>>> +
> >>>> +done:
> >>>> + pci_unblock_user_cfg_access(dev);
> >>>> +}
> >>>> +
> >>>
> >>> Let's return irq_status instead of returning through a pointer?
> >>> Will save a branch and generally make the code a bit cleaner.
> >>
> >> I'm open for a better API suggestion,
> >
> > Maybe use separate functions for this.
> > pci_2_3_mask_irq
> > pci_2_3_unmask_irq
> > pci_2_3_disable_irq
> >
> > Common code can go into subfunctions.
> >
> >> but the current one goes like
> >> this: if irq_status is non-null, only mask the IRQ if the status bit
> >> indicates an interrupt. But we also have a user that wants to mask
> >> unconditionally.
> >
> > Why do you ever want to do that?
>
> During device shutdown (was disable_irq in the unshared case so far).
> The alternative would be to reset the device first, clearing any
> potentially pending events. If we can reorder the reset, that is likely
> better.
Not sure I understand completely, but sounds good.
> >
> >>>
> >>>> static irqreturn_t kvm_assigned_dev_intr(int irq, void *dev_id)
> >>>> {
> >>>> struct kvm_assigned_dev_kernel *assigned_dev =
> >>>> (struct kvm_assigned_dev_kernel *) dev_id;
> >>>> + int ret = IRQ_HANDLED;
> >>>> unsigned long flags;
> >>>>
> >>>> spin_lock_irqsave(&assigned_dev->assigned_dev_lock, flags);
> >>>> @@ -83,19 +169,34 @@ static irqreturn_t kvm_assigned_dev_intr(int irq, void *dev_id)
> >>>> guest_entries[i].vector, 1);
> >>>> }
> >>>> } else {
> >>>> - kvm_set_irq(assigned_dev->kvm, assigned_dev->irq_source_id,
> >>>> - assigned_dev->guest_irq, 1);
> >>>> -
> >>>> if (assigned_dev->irq_requested_type &
> >>>> KVM_DEV_IRQ_GUEST_INTX) {
> >>>> - disable_irq_nosync(irq);
> >>>> + if (assigned_dev->pci_2_3) {
> >>>> + unsigned int irq_status;
> >>>> +
> >>>> + if (assigned_dev->host_irq_disabled) {
> >>>> + ret = IRQ_NONE;
> >>>> + goto out;
> >>>> + }
> >>>> +
> >>>> + pci_2_3_mask_irq(assigned_dev->dev, 1,
> >>>> + &irq_status);
> >>>> + if (irq_status == 0) {
> >>>> + ret = IRQ_NONE;
> >>>> + goto out;
> >>>> + }
> >>>
> >>>
> >>> This will be cleaner if pci_2_3_mask_irq returns irqreturn_t.
> >>>
> >>>> + } else
> >>>> + disable_irq_nosync(irq);
> >>>> assigned_dev->host_irq_disabled = true;
> >>>> }
> >>>> +
> >>>> + kvm_set_irq(assigned_dev->kvm, assigned_dev->irq_source_id,
> >>>> + assigned_dev->guest_irq, 1);
> >>>> }
> >>>>
> >>>> out:
> >>>> spin_unlock_irqrestore(&assigned_dev->assigned_dev_lock, flags);
> >>>> - return IRQ_HANDLED;
> >>>> + return ret;
> >>>> }
> >>>>
> >>>> /* Ack the irq line for an assigned device */
> >>>> @@ -117,7 +218,10 @@ static void kvm_assigned_dev_ack_irq(struct kvm_irq_ack_notifier *kian)
> >>>> */
> >>>> spin_lock_irqsave(&dev->assigned_dev_lock, flags);
> >>>> if (dev->host_irq_disabled) {
> >>>> - enable_irq(dev->host_irq);
> >>>> + if (dev->pci_2_3)
> >>>> + pci_2_3_mask_irq(dev->dev, 0, NULL);
> >>>> + else
> >>>> + enable_irq(dev->host_irq);
> >>>> dev->host_irq_disabled = false;
> >>>
> >>> So what happens here is that if interrupt is still pending
> >>> we will set level to 0, then get another interrupt from device
> >>> which will set it back to 1. An obvious optimization is avoid
> >>> all this, check pending bit and just keep level at 1.
> >>
> >> Isn't this an unrelated optimization, independent of this patch? But
> >> I'll think about it. What pending bit are you referring to?
> >
> > The one in PCI_STATUS.
>
> Ah, OK.
>
> >
> >>>
> >>>> }
> >>>> spin_unlock_irqrestore(&dev->assigned_dev_lock, flags);
> >>>> @@ -166,7 +270,11 @@ static void deassign_host_irq(struct kvm *kvm,
> >>>> pci_disable_msix(assigned_dev->dev);
> >>>> } else {
> >>>> /* Deal with MSI and INTx */
> >>>> - disable_irq(assigned_dev->host_irq);
> >>>> + if (assigned_dev->pci_2_3) {
> >>>> + pci_2_3_mask_irq(assigned_dev->dev, 1, NULL);
> >>>> + synchronize_irq(assigned_dev->host_irq);
> >>>> + } else
> >>>> + disable_irq(assigned_dev->host_irq);
> >>>>
> >>>> free_irq(assigned_dev->host_irq, (void *)assigned_dev);
> >>>>
> >>>> @@ -214,6 +322,13 @@ static void kvm_free_assigned_device(struct kvm *kvm,
> >>>>
> >>>> pci_reset_function(assigned_dev->dev);
> >>>>
> >>>> + /*
> >>>> + * Unmask the IRQ at PCI level once the reset is done - the next user
> >>>> + * may not expect the IRQ being masked.
> >>>> + */
> >>>> + if (assigned_dev->pci_2_3)
> >>>> + pci_2_3_mask_irq(assigned_dev->dev, 0, NULL);
> >>>> +
> >>>> pci_release_regions(assigned_dev->dev);
> >>>> pci_disable_device(assigned_dev->dev);
> >>>> pci_dev_put(assigned_dev->dev);
> >>>> @@ -239,15 +354,26 @@ void kvm_free_all_assigned_devices(struct kvm *kvm)
> >>>> static int assigned_device_enable_host_intx(struct kvm *kvm,
> >>>> struct kvm_assigned_dev_kernel *dev)
> >>>> {
> >>>> + unsigned long flags = 0;
> >>>> +
> >>>> dev->host_irq = dev->dev->irq;
> >>>> - /* Even though this is PCI, we don't want to use shared
> >>>> - * interrupts. Sharing host devices with guest-assigned devices
> >>>> - * on the same interrupt line is not a happy situation: there
> >>>> - * are going to be long delays in accepting, acking, etc.
> >>>> +
> >>>> + /*
> >>>> + * We can only share the IRQ line with other host devices if we are
> >>>> + * able to disable the IRQ source at device-level - independently of
> >>>> + * the guest driver. Otherwise host devices may suffer from unbounded
> >>>> + * IRQ latencies when the guest keeps the line asserted.
> >>>> */
> >>>> + dev->pci_2_3 = pci_2_3_supported(dev->dev);
> >>>> + if (dev->pci_2_3)
> >>>> + flags = IRQF_SHARED;
> >>>> +
> >>>> if (request_irq(dev->host_irq, kvm_assigned_dev_intr,
> >>>> - 0, "kvm_assigned_intx_device", (void *)dev))
> >>>> + flags, "kvm_assigned_intx_device", (void *)dev))
> >>>> return -EIO;
> >>>> +
> >>>> + if (dev->pci_2_3)
> >>>> + pci_2_3_mask_irq(dev->dev, 0, NULL);
> >>>> return 0;
> >>>> }
> >>>>
> >>>
> >>> Let's reverse the logic and try non-shared first, 2.3 is that fails?
> >>> This way we are backwards compatible ...
> >>
> >> Compatible with what?
> >
> > With the status quo :)
>
> There is no incompatibility except for a potential slow-down of a path
> that was often usable (exclusive legacy interrupts belong to a very rare
> species).
That's already an issue if it's real. I think there might also be an
issue if guest accesses command/status itself.
> >
> >> I thought about trying non-shared IRQs first, but that would break host
> >> devices arriving later - including other VMs that want to pass a device
> >> sitting on the same IRQ line. It's better (from management POV) to have
> >> IRQF_SHARED available.
> >
> > OTOH non-shared might be faster as we don't need to do
> > config writes/reads on data path ... Add a knob for management
> > to control this?
>
> Depends on how fast config writes actually are. I know they were slow
> (up to awfully slow if bios was involved) on old hardware, but does this
> still apply? I mean, a config knob would involve the whole stack, so it
> should be worth that effort.
>
> Jan
>
You tell me :)
prev parent reply other threads:[~2010-11-01 17:06 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-11-01 14:08 [PATCH 0/3] KVM: Improve IRQ assignment for device passthrough Jan Kiszka
2010-11-01 14:08 ` [PATCH 1/3] KVM: Fold assigned interrupt work into IRQ handler Jan Kiszka
2010-11-01 16:34 ` Jan Kiszka
2010-11-01 14:08 ` [PATCH 2/3] KVM: Clear assigned guest IRQ on release Jan Kiszka
2010-11-01 14:08 ` [PATCH 3/3] KVM: Allow host IRQ sharing for passed-through PCI 2.3 devices Jan Kiszka
2010-11-01 15:24 ` Michael S. Tsirkin
2010-11-01 15:41 ` Jan Kiszka
2010-11-01 15:52 ` Michael S. Tsirkin
2010-11-01 16:30 ` Jan Kiszka
2010-11-01 17:06 ` Michael S. Tsirkin [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20101101170643.GA1220@redhat.com \
--to=mst@redhat.com \
--cc=alex.williamson@redhat.com \
--cc=avi@redhat.com \
--cc=jan.kiszka@siemens.com \
--cc=jan.kiszka@web.de \
--cc=kvm@vger.kernel.org \
--cc=mtosatti@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).