* [PATCH 0/3] KVM: Improve IRQ assignment for device passthrough
@ 2010-11-01 14:08 Jan Kiszka
2010-11-01 14:08 ` [PATCH 1/3] KVM: Fold assigned interrupt work into IRQ handler Jan Kiszka
` (2 more replies)
0 siblings, 3 replies; 10+ messages in thread
From: Jan Kiszka @ 2010-11-01 14:08 UTC (permalink / raw)
To: Avi Kivity, Marcelo Tosatti; +Cc: kvm, Alex Williamson, Michael S. Tsirkin
Three patches to improve "classic" device assigment /wrt IRQs. Highlight
is the last one that resolves the host IRQ sharing issue for all PCI 2.3
devices. Quite essential when passing non-MSI-ready devices like many
USB host controllers.
Jan Kiszka (3):
KVM: Fold assigned interrupt work into IRQ handler
KVM: Clear assigned guest IRQ on release
KVM: Allow host IRQ sharing for passed-through PCI 2.3 devices
include/linux/kvm_host.h | 2 +-
virt/kvm/assigned-dev.c | 215 ++++++++++++++++++++++++++++++++++------------
2 files changed, 160 insertions(+), 57 deletions(-)
^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH 1/3] KVM: Fold assigned interrupt work into IRQ handler
2010-11-01 14:08 [PATCH 0/3] KVM: Improve IRQ assignment for device passthrough Jan Kiszka
@ 2010-11-01 14:08 ` Jan Kiszka
2010-11-01 16:34 ` Jan Kiszka
2010-11-01 14:08 ` [PATCH 2/3] KVM: Clear assigned guest IRQ on release Jan Kiszka
2010-11-01 14:08 ` [PATCH 3/3] KVM: Allow host IRQ sharing for passed-through PCI 2.3 devices Jan Kiszka
2 siblings, 1 reply; 10+ messages in thread
From: Jan Kiszka @ 2010-11-01 14:08 UTC (permalink / raw)
To: Avi Kivity, Marcelo Tosatti
Cc: kvm, Alex Williamson, Michael S. Tsirkin, Jan Kiszka
From: Jan Kiszka <jan.kiszka@siemens.com>
The complete work handler runs with assigned_dev_lock acquired and
interrupts disabled, so there is nothing to gain pushing this work out
of the actually IRQ handler. Fold them together.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
---
include/linux/kvm_host.h | 1 -
virt/kvm/assigned-dev.c | 71 +++++++++++++++-------------------------------
2 files changed, 23 insertions(+), 49 deletions(-)
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 462b982..df5497f 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -465,7 +465,6 @@ struct kvm_guest_msix_entry {
struct kvm_assigned_dev_kernel {
struct kvm_irq_ack_notifier ack_notifier;
- struct work_struct interrupt_work;
struct list_head list;
int assigned_dev_id;
int host_segnr;
diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c
index 7c98928..5c1b56a 100644
--- a/virt/kvm/assigned-dev.c
+++ b/virt/kvm/assigned-dev.c
@@ -55,18 +55,24 @@ static int find_index_from_host_irq(struct kvm_assigned_dev_kernel
return index;
}
-static void kvm_assigned_dev_interrupt_work_handler(struct work_struct *work)
+static irqreturn_t kvm_assigned_dev_intr(int irq, void *dev_id)
{
- struct kvm_assigned_dev_kernel *assigned_dev;
- int i;
-
- assigned_dev = container_of(work, struct kvm_assigned_dev_kernel,
- interrupt_work);
+ struct kvm_assigned_dev_kernel *assigned_dev =
+ (struct kvm_assigned_dev_kernel *) dev_id;
+ unsigned long flags;
- spin_lock_irq(&assigned_dev->assigned_dev_lock);
+ spin_lock_irqsave(&assigned_dev->assigned_dev_lock, flags);
if (assigned_dev->irq_requested_type & KVM_DEV_IRQ_HOST_MSIX) {
struct kvm_guest_msix_entry *guest_entries =
assigned_dev->guest_msix_entries;
+ int index = find_index_from_host_irq(assigned_dev, irq);
+ int i;
+
+ if (index < 0)
+ goto out;
+
+ guest_entries[index].flags |= KVM_ASSIGNED_MSIX_PENDING;
+
for (i = 0; i < assigned_dev->entries_nr; i++) {
if (!(guest_entries[i].flags &
KVM_ASSIGNED_MSIX_PENDING))
@@ -76,33 +82,15 @@ static void kvm_assigned_dev_interrupt_work_handler(struct work_struct *work)
assigned_dev->irq_source_id,
guest_entries[i].vector, 1);
}
- } else
+ } else {
kvm_set_irq(assigned_dev->kvm, assigned_dev->irq_source_id,
assigned_dev->guest_irq, 1);
- spin_unlock_irq(&assigned_dev->assigned_dev_lock);
-}
-
-static irqreturn_t kvm_assigned_dev_intr(int irq, void *dev_id)
-{
- unsigned long flags;
- struct kvm_assigned_dev_kernel *assigned_dev =
- (struct kvm_assigned_dev_kernel *) dev_id;
-
- spin_lock_irqsave(&assigned_dev->assigned_dev_lock, flags);
- if (assigned_dev->irq_requested_type & KVM_DEV_IRQ_HOST_MSIX) {
- int index = find_index_from_host_irq(assigned_dev, irq);
- if (index < 0)
- goto out;
- assigned_dev->guest_msix_entries[index].flags |=
- KVM_ASSIGNED_MSIX_PENDING;
- }
-
- schedule_work(&assigned_dev->interrupt_work);
-
- if (assigned_dev->irq_requested_type & KVM_DEV_IRQ_GUEST_INTX) {
- disable_irq_nosync(irq);
- assigned_dev->host_irq_disabled = true;
+ if (assigned_dev->irq_requested_type &
+ KVM_DEV_IRQ_GUEST_INTX) {
+ disable_irq_nosync(irq);
+ assigned_dev->host_irq_disabled = true;
+ }
}
out:
@@ -147,33 +135,23 @@ static void deassign_guest_irq(struct kvm *kvm,
assigned_dev->irq_requested_type &= ~(KVM_DEV_IRQ_GUEST_MASK);
}
-/* The function implicit hold kvm->lock mutex due to cancel_work_sync() */
static void deassign_host_irq(struct kvm *kvm,
struct kvm_assigned_dev_kernel *assigned_dev)
{
/*
- * In kvm_free_device_irq, cancel_work_sync return true if:
- * 1. work is scheduled, and then cancelled.
- * 2. work callback is executed.
- *
- * The first one ensured that the irq is disabled and no more events
- * would happen. But for the second one, the irq may be enabled (e.g.
- * for MSI). So we disable irq here to prevent further events.
+ * We disable irq here to prevent further events.
*
* Notice this maybe result in nested disable if the interrupt type is
* INTx, but it's OK for we are going to free it.
*
* If this function is a part of VM destroy, please ensure that till
* now, the kvm state is still legal for probably we also have to wait
- * interrupt_work done.
+ * on a currently running IRQ handler.
*/
if (assigned_dev->irq_requested_type & KVM_DEV_IRQ_HOST_MSIX) {
int i;
for (i = 0; i < assigned_dev->entries_nr; i++)
- disable_irq_nosync(assigned_dev->
- host_msix_entries[i].vector);
-
- cancel_work_sync(&assigned_dev->interrupt_work);
+ disable_irq(assigned_dev->host_msix_entries[i].vector);
for (i = 0; i < assigned_dev->entries_nr; i++)
free_irq(assigned_dev->host_msix_entries[i].vector,
@@ -185,8 +163,7 @@ static void deassign_host_irq(struct kvm *kvm,
pci_disable_msix(assigned_dev->dev);
} else {
/* Deal with MSI and INTx */
- disable_irq_nosync(assigned_dev->host_irq);
- cancel_work_sync(&assigned_dev->interrupt_work);
+ disable_irq(assigned_dev->host_irq);
free_irq(assigned_dev->host_irq, (void *)assigned_dev);
@@ -558,8 +535,6 @@ static int kvm_vm_ioctl_assign_device(struct kvm *kvm,
match->irq_source_id = -1;
match->kvm = kvm;
match->ack_notifier.irq_acked = kvm_assigned_dev_ack_irq;
- INIT_WORK(&match->interrupt_work,
- kvm_assigned_dev_interrupt_work_handler);
list_add(&match->list, &kvm->arch.assigned_dev_head);
--
1.7.1
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH 2/3] KVM: Clear assigned guest IRQ on release
2010-11-01 14:08 [PATCH 0/3] KVM: Improve IRQ assignment for device passthrough Jan Kiszka
2010-11-01 14:08 ` [PATCH 1/3] KVM: Fold assigned interrupt work into IRQ handler Jan Kiszka
@ 2010-11-01 14:08 ` Jan Kiszka
2010-11-01 14:08 ` [PATCH 3/3] KVM: Allow host IRQ sharing for passed-through PCI 2.3 devices Jan Kiszka
2 siblings, 0 replies; 10+ messages in thread
From: Jan Kiszka @ 2010-11-01 14:08 UTC (permalink / raw)
To: Avi Kivity, Marcelo Tosatti
Cc: kvm, Alex Williamson, Michael S. Tsirkin, Jan Kiszka
From: Jan Kiszka <jan.kiszka@siemens.com>
When we deassign a guest IRQ, clear the potentially asserted guest line.
There might be no chance for the guest to do this, specifically if we
switch from INTx to MSI mode.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
---
virt/kvm/assigned-dev.c | 3 +++
1 files changed, 3 insertions(+), 0 deletions(-)
diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c
index 5c1b56a..d3ddfea 100644
--- a/virt/kvm/assigned-dev.c
+++ b/virt/kvm/assigned-dev.c
@@ -129,6 +129,9 @@ static void deassign_guest_irq(struct kvm *kvm,
kvm_unregister_irq_ack_notifier(kvm, &assigned_dev->ack_notifier);
assigned_dev->ack_notifier.gsi = -1;
+ kvm_set_irq(assigned_dev->kvm, assigned_dev->irq_source_id,
+ assigned_dev->guest_irq, 0);
+
if (assigned_dev->irq_source_id != -1)
kvm_free_irq_source_id(kvm, assigned_dev->irq_source_id);
assigned_dev->irq_source_id = -1;
--
1.7.1
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH 3/3] KVM: Allow host IRQ sharing for passed-through PCI 2.3 devices
2010-11-01 14:08 [PATCH 0/3] KVM: Improve IRQ assignment for device passthrough Jan Kiszka
2010-11-01 14:08 ` [PATCH 1/3] KVM: Fold assigned interrupt work into IRQ handler Jan Kiszka
2010-11-01 14:08 ` [PATCH 2/3] KVM: Clear assigned guest IRQ on release Jan Kiszka
@ 2010-11-01 14:08 ` Jan Kiszka
2010-11-01 15:24 ` Michael S. Tsirkin
2 siblings, 1 reply; 10+ messages in thread
From: Jan Kiszka @ 2010-11-01 14:08 UTC (permalink / raw)
To: Avi Kivity, Marcelo Tosatti
Cc: kvm, Alex Williamson, Michael S. Tsirkin, Jan Kiszka
From: Jan Kiszka <jan.kiszka@siemens.com>
PCI 2.3 allows to generically disable IRQ sources at device level. This
enables us to share IRQs of such devices between on the host side when
passing them to a guest.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
---
include/linux/kvm_host.h | 1 +
virt/kvm/assigned-dev.c | 153 +++++++++++++++++++++++++++++++++++++++++----
2 files changed, 140 insertions(+), 14 deletions(-)
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index df5497f..fcdc849 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -473,6 +473,7 @@ struct kvm_assigned_dev_kernel {
unsigned int entries_nr;
int host_irq;
bool host_irq_disabled;
+ bool pci_2_3;
struct msix_entry *host_msix_entries;
int guest_irq;
struct kvm_guest_msix_entry *guest_msix_entries;
diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c
index d3ddfea..411643c 100644
--- a/virt/kvm/assigned-dev.c
+++ b/virt/kvm/assigned-dev.c
@@ -55,10 +55,96 @@ static int find_index_from_host_irq(struct kvm_assigned_dev_kernel
return index;
}
+/*
+ * Verify that the device supports Interrupt Disable bit in command register,
+ * per PCI 2.3, by flipping this bit and reading it back: this bit was readonly
+ * in PCI 2.2.
+ */
+static bool pci_2_3_supported(struct pci_dev *pdev)
+{
+ u16 orig, new;
+ bool supported = false;
+
+ pci_block_user_cfg_access(pdev);
+ pci_read_config_word(pdev, PCI_COMMAND, &orig);
+ pci_write_config_word(pdev, PCI_COMMAND,
+ orig ^ PCI_COMMAND_INTX_DISABLE);
+ pci_read_config_word(pdev, PCI_COMMAND, &new);
+
+ /*
+ * There's no way to protect against
+ * hardware bugs or detect them reliably, but as long as we know
+ * what the value should be, let's go ahead and check it.
+ */
+ if ((new ^ orig) & ~PCI_COMMAND_INTX_DISABLE) {
+ dev_err(&pdev->dev, "Command changed from 0x%x to 0x%x: "
+ "driver or HW bug?\n", orig, new);
+ goto out;
+ }
+ if (!((new ^ orig) & PCI_COMMAND_INTX_DISABLE)) {
+ dev_warn(&pdev->dev, "Device does not support "
+ "disabling interrupts: unable to bind.\n");
+ goto out;
+ }
+ supported = true;
+
+ /* Now restore the original value. */
+ pci_write_config_word(pdev, PCI_COMMAND, orig);
+
+out:
+ pci_unblock_user_cfg_access(pdev);
+ return supported;
+}
+
+static void
+pci_2_3_mask_irq(struct pci_dev *dev, int mask, unsigned int *irq_status)
+{
+ u32 cmd_status_dword;
+ u16 origcmd, newcmd;
+
+ /*
+ * We do a single dword read to retrieve both command and status.
+ * Document assumptions that make this possible.
+ */
+ BUILD_BUG_ON(PCI_COMMAND % 4);
+ BUILD_BUG_ON(PCI_COMMAND + 2 != PCI_STATUS);
+
+ pci_block_user_cfg_access(dev);
+
+ /*
+ * Read both command and status registers in a single 32-bit operation.
+ * Note: we could cache the value for command and move the status read
+ * out of the lock if there was a way to get notified of user changes
+ * to command register through sysfs. Should be good for shared irqs.
+ */
+ pci_read_config_dword(dev, PCI_COMMAND, &cmd_status_dword);
+ origcmd = cmd_status_dword;
+
+ if (irq_status) {
+ /*
+ * Check interrupt status register to see whether our device triggered
+ * the interrupt.
+ */
+ *irq_status = (cmd_status_dword >> 16) & PCI_STATUS_INTERRUPT;
+ if (*irq_status == 0)
+ goto done;
+ }
+
+ newcmd = origcmd & ~PCI_COMMAND_INTX_DISABLE;
+ if (mask)
+ newcmd |= PCI_COMMAND_INTX_DISABLE;
+ if (newcmd != origcmd)
+ pci_write_config_word(dev, PCI_COMMAND, newcmd);
+
+done:
+ pci_unblock_user_cfg_access(dev);
+}
+
static irqreturn_t kvm_assigned_dev_intr(int irq, void *dev_id)
{
struct kvm_assigned_dev_kernel *assigned_dev =
(struct kvm_assigned_dev_kernel *) dev_id;
+ int ret = IRQ_HANDLED;
unsigned long flags;
spin_lock_irqsave(&assigned_dev->assigned_dev_lock, flags);
@@ -83,19 +169,34 @@ static irqreturn_t kvm_assigned_dev_intr(int irq, void *dev_id)
guest_entries[i].vector, 1);
}
} else {
- kvm_set_irq(assigned_dev->kvm, assigned_dev->irq_source_id,
- assigned_dev->guest_irq, 1);
-
if (assigned_dev->irq_requested_type &
KVM_DEV_IRQ_GUEST_INTX) {
- disable_irq_nosync(irq);
+ if (assigned_dev->pci_2_3) {
+ unsigned int irq_status;
+
+ if (assigned_dev->host_irq_disabled) {
+ ret = IRQ_NONE;
+ goto out;
+ }
+
+ pci_2_3_mask_irq(assigned_dev->dev, 1,
+ &irq_status);
+ if (irq_status == 0) {
+ ret = IRQ_NONE;
+ goto out;
+ }
+ } else
+ disable_irq_nosync(irq);
assigned_dev->host_irq_disabled = true;
}
+
+ kvm_set_irq(assigned_dev->kvm, assigned_dev->irq_source_id,
+ assigned_dev->guest_irq, 1);
}
out:
spin_unlock_irqrestore(&assigned_dev->assigned_dev_lock, flags);
- return IRQ_HANDLED;
+ return ret;
}
/* Ack the irq line for an assigned device */
@@ -117,7 +218,10 @@ static void kvm_assigned_dev_ack_irq(struct kvm_irq_ack_notifier *kian)
*/
spin_lock_irqsave(&dev->assigned_dev_lock, flags);
if (dev->host_irq_disabled) {
- enable_irq(dev->host_irq);
+ if (dev->pci_2_3)
+ pci_2_3_mask_irq(dev->dev, 0, NULL);
+ else
+ enable_irq(dev->host_irq);
dev->host_irq_disabled = false;
}
spin_unlock_irqrestore(&dev->assigned_dev_lock, flags);
@@ -166,7 +270,11 @@ static void deassign_host_irq(struct kvm *kvm,
pci_disable_msix(assigned_dev->dev);
} else {
/* Deal with MSI and INTx */
- disable_irq(assigned_dev->host_irq);
+ if (assigned_dev->pci_2_3) {
+ pci_2_3_mask_irq(assigned_dev->dev, 1, NULL);
+ synchronize_irq(assigned_dev->host_irq);
+ } else
+ disable_irq(assigned_dev->host_irq);
free_irq(assigned_dev->host_irq, (void *)assigned_dev);
@@ -214,6 +322,13 @@ static void kvm_free_assigned_device(struct kvm *kvm,
pci_reset_function(assigned_dev->dev);
+ /*
+ * Unmask the IRQ at PCI level once the reset is done - the next user
+ * may not expect the IRQ being masked.
+ */
+ if (assigned_dev->pci_2_3)
+ pci_2_3_mask_irq(assigned_dev->dev, 0, NULL);
+
pci_release_regions(assigned_dev->dev);
pci_disable_device(assigned_dev->dev);
pci_dev_put(assigned_dev->dev);
@@ -239,15 +354,26 @@ void kvm_free_all_assigned_devices(struct kvm *kvm)
static int assigned_device_enable_host_intx(struct kvm *kvm,
struct kvm_assigned_dev_kernel *dev)
{
+ unsigned long flags = 0;
+
dev->host_irq = dev->dev->irq;
- /* Even though this is PCI, we don't want to use shared
- * interrupts. Sharing host devices with guest-assigned devices
- * on the same interrupt line is not a happy situation: there
- * are going to be long delays in accepting, acking, etc.
+
+ /*
+ * We can only share the IRQ line with other host devices if we are
+ * able to disable the IRQ source at device-level - independently of
+ * the guest driver. Otherwise host devices may suffer from unbounded
+ * IRQ latencies when the guest keeps the line asserted.
*/
+ dev->pci_2_3 = pci_2_3_supported(dev->dev);
+ if (dev->pci_2_3)
+ flags = IRQF_SHARED;
+
if (request_irq(dev->host_irq, kvm_assigned_dev_intr,
- 0, "kvm_assigned_intx_device", (void *)dev))
+ flags, "kvm_assigned_intx_device", (void *)dev))
return -EIO;
+
+ if (dev->pci_2_3)
+ pci_2_3_mask_irq(dev->dev, 0, NULL);
return 0;
}
@@ -324,7 +450,6 @@ static int assigned_device_enable_guest_msi(struct kvm *kvm,
{
dev->guest_irq = irq->guest_irq;
dev->ack_notifier.gsi = -1;
- dev->host_irq_disabled = false;
return 0;
}
#endif
@@ -336,7 +461,6 @@ static int assigned_device_enable_guest_msix(struct kvm *kvm,
{
dev->guest_irq = irq->guest_irq;
dev->ack_notifier.gsi = -1;
- dev->host_irq_disabled = false;
return 0;
}
#endif
@@ -367,6 +491,7 @@ static int assign_host_irq(struct kvm *kvm,
default:
r = -EINVAL;
}
+ dev->host_irq_disabled = false;
if (!r)
dev->irq_requested_type |= host_irq_type;
--
1.7.1
^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH 3/3] KVM: Allow host IRQ sharing for passed-through PCI 2.3 devices
2010-11-01 14:08 ` [PATCH 3/3] KVM: Allow host IRQ sharing for passed-through PCI 2.3 devices Jan Kiszka
@ 2010-11-01 15:24 ` Michael S. Tsirkin
2010-11-01 15:41 ` Jan Kiszka
0 siblings, 1 reply; 10+ messages in thread
From: Michael S. Tsirkin @ 2010-11-01 15:24 UTC (permalink / raw)
To: Jan Kiszka; +Cc: Avi Kivity, Marcelo Tosatti, kvm, Alex Williamson, Jan Kiszka
On Mon, Nov 01, 2010 at 03:08:37PM +0100, Jan Kiszka wrote:
> From: Jan Kiszka <jan.kiszka@siemens.com>
>
> PCI 2.3 allows to generically disable IRQ sources at device level. This
> enables us to share IRQs of such devices between on the host side when
> passing them to a guest.
>
> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
> ---
> include/linux/kvm_host.h | 1 +
> virt/kvm/assigned-dev.c | 153 +++++++++++++++++++++++++++++++++++++++++----
> 2 files changed, 140 insertions(+), 14 deletions(-)
>
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index df5497f..fcdc849 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -473,6 +473,7 @@ struct kvm_assigned_dev_kernel {
> unsigned int entries_nr;
> int host_irq;
> bool host_irq_disabled;
> + bool pci_2_3;
> struct msix_entry *host_msix_entries;
> int guest_irq;
> struct kvm_guest_msix_entry *guest_msix_entries;
> diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c
> index d3ddfea..411643c 100644
> --- a/virt/kvm/assigned-dev.c
> +++ b/virt/kvm/assigned-dev.c
> @@ -55,10 +55,96 @@ static int find_index_from_host_irq(struct kvm_assigned_dev_kernel
> return index;
> }
>
> +/*
> + * Verify that the device supports Interrupt Disable bit in command register,
> + * per PCI 2.3, by flipping this bit and reading it back: this bit was readonly
> + * in PCI 2.2.
> + */
> +static bool pci_2_3_supported(struct pci_dev *pdev)
> +{
> + u16 orig, new;
> + bool supported = false;
> +
> + pci_block_user_cfg_access(pdev);
> + pci_read_config_word(pdev, PCI_COMMAND, &orig);
> + pci_write_config_word(pdev, PCI_COMMAND,
> + orig ^ PCI_COMMAND_INTX_DISABLE);
> + pci_read_config_word(pdev, PCI_COMMAND, &new);
> +
> + /*
> + * There's no way to protect against
> + * hardware bugs or detect them reliably, but as long as we know
> + * what the value should be, let's go ahead and check it.
> + */
> + if ((new ^ orig) & ~PCI_COMMAND_INTX_DISABLE) {
> + dev_err(&pdev->dev, "Command changed from 0x%x to 0x%x: "
> + "driver or HW bug?\n", orig, new);
> + goto out;
> + }
> + if (!((new ^ orig) & PCI_COMMAND_INTX_DISABLE)) {
> + dev_warn(&pdev->dev, "Device does not support "
> + "disabling interrupts: unable to bind.\n");
> + goto out;
> + }
> + supported = true;
> +
> + /* Now restore the original value. */
> + pci_write_config_word(pdev, PCI_COMMAND, orig);
> +
> +out:
> + pci_unblock_user_cfg_access(pdev);
> + return supported;
> +}
> +
> +static void
> +pci_2_3_mask_irq(struct pci_dev *dev, int mask, unsigned int *irq_status)
> +{
> + u32 cmd_status_dword;
> + u16 origcmd, newcmd;
> +
> + /*
> + * We do a single dword read to retrieve both command and status.
> + * Document assumptions that make this possible.
> + */
> + BUILD_BUG_ON(PCI_COMMAND % 4);
> + BUILD_BUG_ON(PCI_COMMAND + 2 != PCI_STATUS);
> +
> + pci_block_user_cfg_access(dev);
> +
> + /*
> + * Read both command and status registers in a single 32-bit operation.
> + * Note: we could cache the value for command and move the status read
> + * out of the lock if there was a way to get notified of user changes
> + * to command register through sysfs. Should be good for shared irqs.
> + */
> + pci_read_config_dword(dev, PCI_COMMAND, &cmd_status_dword);
> + origcmd = cmd_status_dword;
> +
> + if (irq_status) {
> + /*
> + * Check interrupt status register to see whether our device triggered
> + * the interrupt.
> + */
> + *irq_status = (cmd_status_dword >> 16) & PCI_STATUS_INTERRUPT;
> + if (*irq_status == 0)
> + goto done;
> + }
> +
> + newcmd = origcmd & ~PCI_COMMAND_INTX_DISABLE;
> + if (mask)
> + newcmd |= PCI_COMMAND_INTX_DISABLE;
> + if (newcmd != origcmd)
> + pci_write_config_word(dev, PCI_COMMAND, newcmd);
> +
> +done:
> + pci_unblock_user_cfg_access(dev);
> +}
> +
Let's return irq_status instead of returning through a pointer?
Will save a branch and generally make the code a bit cleaner.
> static irqreturn_t kvm_assigned_dev_intr(int irq, void *dev_id)
> {
> struct kvm_assigned_dev_kernel *assigned_dev =
> (struct kvm_assigned_dev_kernel *) dev_id;
> + int ret = IRQ_HANDLED;
> unsigned long flags;
>
> spin_lock_irqsave(&assigned_dev->assigned_dev_lock, flags);
> @@ -83,19 +169,34 @@ static irqreturn_t kvm_assigned_dev_intr(int irq, void *dev_id)
> guest_entries[i].vector, 1);
> }
> } else {
> - kvm_set_irq(assigned_dev->kvm, assigned_dev->irq_source_id,
> - assigned_dev->guest_irq, 1);
> -
> if (assigned_dev->irq_requested_type &
> KVM_DEV_IRQ_GUEST_INTX) {
> - disable_irq_nosync(irq);
> + if (assigned_dev->pci_2_3) {
> + unsigned int irq_status;
> +
> + if (assigned_dev->host_irq_disabled) {
> + ret = IRQ_NONE;
> + goto out;
> + }
> +
> + pci_2_3_mask_irq(assigned_dev->dev, 1,
> + &irq_status);
> + if (irq_status == 0) {
> + ret = IRQ_NONE;
> + goto out;
> + }
This will be cleaner if pci_2_3_mask_irq returns irqreturn_t.
> + } else
> + disable_irq_nosync(irq);
> assigned_dev->host_irq_disabled = true;
> }
> +
> + kvm_set_irq(assigned_dev->kvm, assigned_dev->irq_source_id,
> + assigned_dev->guest_irq, 1);
> }
>
> out:
> spin_unlock_irqrestore(&assigned_dev->assigned_dev_lock, flags);
> - return IRQ_HANDLED;
> + return ret;
> }
>
> /* Ack the irq line for an assigned device */
> @@ -117,7 +218,10 @@ static void kvm_assigned_dev_ack_irq(struct kvm_irq_ack_notifier *kian)
> */
> spin_lock_irqsave(&dev->assigned_dev_lock, flags);
> if (dev->host_irq_disabled) {
> - enable_irq(dev->host_irq);
> + if (dev->pci_2_3)
> + pci_2_3_mask_irq(dev->dev, 0, NULL);
> + else
> + enable_irq(dev->host_irq);
> dev->host_irq_disabled = false;
So what happens here is that if interrupt is still pending
we will set level to 0, then get another interrupt from device
which will set it back to 1. An obvious optimization is avoid
all this, check pending bit and just keep level at 1.
> }
> spin_unlock_irqrestore(&dev->assigned_dev_lock, flags);
> @@ -166,7 +270,11 @@ static void deassign_host_irq(struct kvm *kvm,
> pci_disable_msix(assigned_dev->dev);
> } else {
> /* Deal with MSI and INTx */
> - disable_irq(assigned_dev->host_irq);
> + if (assigned_dev->pci_2_3) {
> + pci_2_3_mask_irq(assigned_dev->dev, 1, NULL);
> + synchronize_irq(assigned_dev->host_irq);
> + } else
> + disable_irq(assigned_dev->host_irq);
>
> free_irq(assigned_dev->host_irq, (void *)assigned_dev);
>
> @@ -214,6 +322,13 @@ static void kvm_free_assigned_device(struct kvm *kvm,
>
> pci_reset_function(assigned_dev->dev);
>
> + /*
> + * Unmask the IRQ at PCI level once the reset is done - the next user
> + * may not expect the IRQ being masked.
> + */
> + if (assigned_dev->pci_2_3)
> + pci_2_3_mask_irq(assigned_dev->dev, 0, NULL);
> +
> pci_release_regions(assigned_dev->dev);
> pci_disable_device(assigned_dev->dev);
> pci_dev_put(assigned_dev->dev);
> @@ -239,15 +354,26 @@ void kvm_free_all_assigned_devices(struct kvm *kvm)
> static int assigned_device_enable_host_intx(struct kvm *kvm,
> struct kvm_assigned_dev_kernel *dev)
> {
> + unsigned long flags = 0;
> +
> dev->host_irq = dev->dev->irq;
> - /* Even though this is PCI, we don't want to use shared
> - * interrupts. Sharing host devices with guest-assigned devices
> - * on the same interrupt line is not a happy situation: there
> - * are going to be long delays in accepting, acking, etc.
> +
> + /*
> + * We can only share the IRQ line with other host devices if we are
> + * able to disable the IRQ source at device-level - independently of
> + * the guest driver. Otherwise host devices may suffer from unbounded
> + * IRQ latencies when the guest keeps the line asserted.
> */
> + dev->pci_2_3 = pci_2_3_supported(dev->dev);
> + if (dev->pci_2_3)
> + flags = IRQF_SHARED;
> +
> if (request_irq(dev->host_irq, kvm_assigned_dev_intr,
> - 0, "kvm_assigned_intx_device", (void *)dev))
> + flags, "kvm_assigned_intx_device", (void *)dev))
> return -EIO;
> +
> + if (dev->pci_2_3)
> + pci_2_3_mask_irq(dev->dev, 0, NULL);
> return 0;
> }
>
Let's reverse the logic and try non-shared first, 2.3 is that fails?
This way we are backwards compatible ...
> @@ -324,7 +450,6 @@ static int assigned_device_enable_guest_msi(struct kvm *kvm,
> {
> dev->guest_irq = irq->guest_irq;
> dev->ack_notifier.gsi = -1;
> - dev->host_irq_disabled = false;
> return 0;
> }
> #endif
> @@ -336,7 +461,6 @@ static int assigned_device_enable_guest_msix(struct kvm *kvm,
> {
> dev->guest_irq = irq->guest_irq;
> dev->ack_notifier.gsi = -1;
> - dev->host_irq_disabled = false;
> return 0;
> }
> #endif
> @@ -367,6 +491,7 @@ static int assign_host_irq(struct kvm *kvm,
> default:
> r = -EINVAL;
> }
> + dev->host_irq_disabled = false;
>
> if (!r)
> dev->irq_requested_type |= host_irq_type;
> --
> 1.7.1
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 3/3] KVM: Allow host IRQ sharing for passed-through PCI 2.3 devices
2010-11-01 15:24 ` Michael S. Tsirkin
@ 2010-11-01 15:41 ` Jan Kiszka
2010-11-01 15:52 ` Michael S. Tsirkin
0 siblings, 1 reply; 10+ messages in thread
From: Jan Kiszka @ 2010-11-01 15:41 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Avi Kivity, Marcelo Tosatti, kvm, Alex Williamson, Jan Kiszka
[-- Attachment #1: Type: text/plain, Size: 10270 bytes --]
Am 01.11.2010 16:24, Michael S. Tsirkin wrote:
> On Mon, Nov 01, 2010 at 03:08:37PM +0100, Jan Kiszka wrote:
>> From: Jan Kiszka <jan.kiszka@siemens.com>
>>
>> PCI 2.3 allows to generically disable IRQ sources at device level. This
>> enables us to share IRQs of such devices between on the host side when
>> passing them to a guest.
>>
>> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
>> ---
>> include/linux/kvm_host.h | 1 +
>> virt/kvm/assigned-dev.c | 153 +++++++++++++++++++++++++++++++++++++++++----
>> 2 files changed, 140 insertions(+), 14 deletions(-)
>>
>> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
>> index df5497f..fcdc849 100644
>> --- a/include/linux/kvm_host.h
>> +++ b/include/linux/kvm_host.h
>> @@ -473,6 +473,7 @@ struct kvm_assigned_dev_kernel {
>> unsigned int entries_nr;
>> int host_irq;
>> bool host_irq_disabled;
>> + bool pci_2_3;
>> struct msix_entry *host_msix_entries;
>> int guest_irq;
>> struct kvm_guest_msix_entry *guest_msix_entries;
>> diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c
>> index d3ddfea..411643c 100644
>> --- a/virt/kvm/assigned-dev.c
>> +++ b/virt/kvm/assigned-dev.c
>> @@ -55,10 +55,96 @@ static int find_index_from_host_irq(struct kvm_assigned_dev_kernel
>> return index;
>> }
>>
>> +/*
>> + * Verify that the device supports Interrupt Disable bit in command register,
>> + * per PCI 2.3, by flipping this bit and reading it back: this bit was readonly
>> + * in PCI 2.2.
>> + */
>> +static bool pci_2_3_supported(struct pci_dev *pdev)
>> +{
>> + u16 orig, new;
>> + bool supported = false;
>> +
>> + pci_block_user_cfg_access(pdev);
>> + pci_read_config_word(pdev, PCI_COMMAND, &orig);
>> + pci_write_config_word(pdev, PCI_COMMAND,
>> + orig ^ PCI_COMMAND_INTX_DISABLE);
>> + pci_read_config_word(pdev, PCI_COMMAND, &new);
>> +
>> + /*
>> + * There's no way to protect against
>> + * hardware bugs or detect them reliably, but as long as we know
>> + * what the value should be, let's go ahead and check it.
>> + */
>> + if ((new ^ orig) & ~PCI_COMMAND_INTX_DISABLE) {
>> + dev_err(&pdev->dev, "Command changed from 0x%x to 0x%x: "
>> + "driver or HW bug?\n", orig, new);
>> + goto out;
>> + }
>> + if (!((new ^ orig) & PCI_COMMAND_INTX_DISABLE)) {
>> + dev_warn(&pdev->dev, "Device does not support "
>> + "disabling interrupts: unable to bind.\n");
>> + goto out;
>> + }
>> + supported = true;
>> +
>> + /* Now restore the original value. */
>> + pci_write_config_word(pdev, PCI_COMMAND, orig);
>> +
>> +out:
>> + pci_unblock_user_cfg_access(pdev);
>> + return supported;
>> +}
>> +
>> +static void
>> +pci_2_3_mask_irq(struct pci_dev *dev, int mask, unsigned int *irq_status)
>> +{
>> + u32 cmd_status_dword;
>> + u16 origcmd, newcmd;
>> +
>> + /*
>> + * We do a single dword read to retrieve both command and status.
>> + * Document assumptions that make this possible.
>> + */
>> + BUILD_BUG_ON(PCI_COMMAND % 4);
>> + BUILD_BUG_ON(PCI_COMMAND + 2 != PCI_STATUS);
>> +
>> + pci_block_user_cfg_access(dev);
>> +
>> + /*
>> + * Read both command and status registers in a single 32-bit operation.
>> + * Note: we could cache the value for command and move the status read
>> + * out of the lock if there was a way to get notified of user changes
>> + * to command register through sysfs. Should be good for shared irqs.
>> + */
>> + pci_read_config_dword(dev, PCI_COMMAND, &cmd_status_dword);
>> + origcmd = cmd_status_dword;
>> +
>> + if (irq_status) {
>> + /*
>> + * Check interrupt status register to see whether our device triggered
>> + * the interrupt.
>> + */
>> + *irq_status = (cmd_status_dword >> 16) & PCI_STATUS_INTERRUPT;
>> + if (*irq_status == 0)
>> + goto done;
>> + }
>> +
>> + newcmd = origcmd & ~PCI_COMMAND_INTX_DISABLE;
>> + if (mask)
>> + newcmd |= PCI_COMMAND_INTX_DISABLE;
>> + if (newcmd != origcmd)
>> + pci_write_config_word(dev, PCI_COMMAND, newcmd);
>> +
>> +done:
>> + pci_unblock_user_cfg_access(dev);
>> +}
>> +
>
> Let's return irq_status instead of returning through a pointer?
> Will save a branch and generally make the code a bit cleaner.
I'm open for a better API suggestion, but the current one goes like
this: if irq_status is non-null, only mask the IRQ if the status bit
indicates an interrupt. But we also have a user that wants to mask
unconditionally.
>
>> static irqreturn_t kvm_assigned_dev_intr(int irq, void *dev_id)
>> {
>> struct kvm_assigned_dev_kernel *assigned_dev =
>> (struct kvm_assigned_dev_kernel *) dev_id;
>> + int ret = IRQ_HANDLED;
>> unsigned long flags;
>>
>> spin_lock_irqsave(&assigned_dev->assigned_dev_lock, flags);
>> @@ -83,19 +169,34 @@ static irqreturn_t kvm_assigned_dev_intr(int irq, void *dev_id)
>> guest_entries[i].vector, 1);
>> }
>> } else {
>> - kvm_set_irq(assigned_dev->kvm, assigned_dev->irq_source_id,
>> - assigned_dev->guest_irq, 1);
>> -
>> if (assigned_dev->irq_requested_type &
>> KVM_DEV_IRQ_GUEST_INTX) {
>> - disable_irq_nosync(irq);
>> + if (assigned_dev->pci_2_3) {
>> + unsigned int irq_status;
>> +
>> + if (assigned_dev->host_irq_disabled) {
>> + ret = IRQ_NONE;
>> + goto out;
>> + }
>> +
>> + pci_2_3_mask_irq(assigned_dev->dev, 1,
>> + &irq_status);
>> + if (irq_status == 0) {
>> + ret = IRQ_NONE;
>> + goto out;
>> + }
>
>
> This will be cleaner if pci_2_3_mask_irq returns irqreturn_t.
>
>> + } else
>> + disable_irq_nosync(irq);
>> assigned_dev->host_irq_disabled = true;
>> }
>> +
>> + kvm_set_irq(assigned_dev->kvm, assigned_dev->irq_source_id,
>> + assigned_dev->guest_irq, 1);
>> }
>>
>> out:
>> spin_unlock_irqrestore(&assigned_dev->assigned_dev_lock, flags);
>> - return IRQ_HANDLED;
>> + return ret;
>> }
>>
>> /* Ack the irq line for an assigned device */
>> @@ -117,7 +218,10 @@ static void kvm_assigned_dev_ack_irq(struct kvm_irq_ack_notifier *kian)
>> */
>> spin_lock_irqsave(&dev->assigned_dev_lock, flags);
>> if (dev->host_irq_disabled) {
>> - enable_irq(dev->host_irq);
>> + if (dev->pci_2_3)
>> + pci_2_3_mask_irq(dev->dev, 0, NULL);
>> + else
>> + enable_irq(dev->host_irq);
>> dev->host_irq_disabled = false;
>
> So what happens here is that if interrupt is still pending
> we will set level to 0, then get another interrupt from device
> which will set it back to 1. An obvious optimization is avoid
> all this, check pending bit and just keep level at 1.
Isn't this an unrelated optimization, independent of this patch? But
I'll think about it. What pending bit are you referring to?
>
>> }
>> spin_unlock_irqrestore(&dev->assigned_dev_lock, flags);
>> @@ -166,7 +270,11 @@ static void deassign_host_irq(struct kvm *kvm,
>> pci_disable_msix(assigned_dev->dev);
>> } else {
>> /* Deal with MSI and INTx */
>> - disable_irq(assigned_dev->host_irq);
>> + if (assigned_dev->pci_2_3) {
>> + pci_2_3_mask_irq(assigned_dev->dev, 1, NULL);
>> + synchronize_irq(assigned_dev->host_irq);
>> + } else
>> + disable_irq(assigned_dev->host_irq);
>>
>> free_irq(assigned_dev->host_irq, (void *)assigned_dev);
>>
>> @@ -214,6 +322,13 @@ static void kvm_free_assigned_device(struct kvm *kvm,
>>
>> pci_reset_function(assigned_dev->dev);
>>
>> + /*
>> + * Unmask the IRQ at PCI level once the reset is done - the next user
>> + * may not expect the IRQ being masked.
>> + */
>> + if (assigned_dev->pci_2_3)
>> + pci_2_3_mask_irq(assigned_dev->dev, 0, NULL);
>> +
>> pci_release_regions(assigned_dev->dev);
>> pci_disable_device(assigned_dev->dev);
>> pci_dev_put(assigned_dev->dev);
>> @@ -239,15 +354,26 @@ void kvm_free_all_assigned_devices(struct kvm *kvm)
>> static int assigned_device_enable_host_intx(struct kvm *kvm,
>> struct kvm_assigned_dev_kernel *dev)
>> {
>> + unsigned long flags = 0;
>> +
>> dev->host_irq = dev->dev->irq;
>> - /* Even though this is PCI, we don't want to use shared
>> - * interrupts. Sharing host devices with guest-assigned devices
>> - * on the same interrupt line is not a happy situation: there
>> - * are going to be long delays in accepting, acking, etc.
>> +
>> + /*
>> + * We can only share the IRQ line with other host devices if we are
>> + * able to disable the IRQ source at device-level - independently of
>> + * the guest driver. Otherwise host devices may suffer from unbounded
>> + * IRQ latencies when the guest keeps the line asserted.
>> */
>> + dev->pci_2_3 = pci_2_3_supported(dev->dev);
>> + if (dev->pci_2_3)
>> + flags = IRQF_SHARED;
>> +
>> if (request_irq(dev->host_irq, kvm_assigned_dev_intr,
>> - 0, "kvm_assigned_intx_device", (void *)dev))
>> + flags, "kvm_assigned_intx_device", (void *)dev))
>> return -EIO;
>> +
>> + if (dev->pci_2_3)
>> + pci_2_3_mask_irq(dev->dev, 0, NULL);
>> return 0;
>> }
>>
>
> Let's reverse the logic and try non-shared first, 2.3 is that fails?
> This way we are backwards compatible ...
Compatible with what?
I thought about trying non-shared IRQs first, but that would break host
devices arriving later - including other VMs that want to pass a device
sitting on the same IRQ line. It's better (from management POV) to have
IRQF_SHARED available.
>
>> @@ -324,7 +450,6 @@ static int assigned_device_enable_guest_msi(struct kvm *kvm,
>> {
>> dev->guest_irq = irq->guest_irq;
>> dev->ack_notifier.gsi = -1;
>> - dev->host_irq_disabled = false;
>> return 0;
>> }
>> #endif
>> @@ -336,7 +461,6 @@ static int assigned_device_enable_guest_msix(struct kvm *kvm,
>> {
>> dev->guest_irq = irq->guest_irq;
>> dev->ack_notifier.gsi = -1;
>> - dev->host_irq_disabled = false;
>> return 0;
>> }
>> #endif
>> @@ -367,6 +491,7 @@ static int assign_host_irq(struct kvm *kvm,
>> default:
>> r = -EINVAL;
>> }
>> + dev->host_irq_disabled = false;
>>
>> if (!r)
>> dev->irq_requested_type |= host_irq_type;
>> --
>> 1.7.1
Thanks for the comment,
Jan
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 259 bytes --]
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 3/3] KVM: Allow host IRQ sharing for passed-through PCI 2.3 devices
2010-11-01 15:41 ` Jan Kiszka
@ 2010-11-01 15:52 ` Michael S. Tsirkin
2010-11-01 16:30 ` Jan Kiszka
0 siblings, 1 reply; 10+ messages in thread
From: Michael S. Tsirkin @ 2010-11-01 15:52 UTC (permalink / raw)
To: Jan Kiszka; +Cc: Avi Kivity, Marcelo Tosatti, kvm, Alex Williamson, Jan Kiszka
On Mon, Nov 01, 2010 at 04:41:08PM +0100, Jan Kiszka wrote:
> Am 01.11.2010 16:24, Michael S. Tsirkin wrote:
> > On Mon, Nov 01, 2010 at 03:08:37PM +0100, Jan Kiszka wrote:
> >> From: Jan Kiszka <jan.kiszka@siemens.com>
> >>
> >> PCI 2.3 allows to generically disable IRQ sources at device level. This
> >> enables us to share IRQs of such devices between on the host side when
> >> passing them to a guest.
> >>
> >> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
> >> ---
> >> include/linux/kvm_host.h | 1 +
> >> virt/kvm/assigned-dev.c | 153 +++++++++++++++++++++++++++++++++++++++++----
> >> 2 files changed, 140 insertions(+), 14 deletions(-)
> >>
> >> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> >> index df5497f..fcdc849 100644
> >> --- a/include/linux/kvm_host.h
> >> +++ b/include/linux/kvm_host.h
> >> @@ -473,6 +473,7 @@ struct kvm_assigned_dev_kernel {
> >> unsigned int entries_nr;
> >> int host_irq;
> >> bool host_irq_disabled;
> >> + bool pci_2_3;
> >> struct msix_entry *host_msix_entries;
> >> int guest_irq;
> >> struct kvm_guest_msix_entry *guest_msix_entries;
> >> diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c
> >> index d3ddfea..411643c 100644
> >> --- a/virt/kvm/assigned-dev.c
> >> +++ b/virt/kvm/assigned-dev.c
> >> @@ -55,10 +55,96 @@ static int find_index_from_host_irq(struct kvm_assigned_dev_kernel
> >> return index;
> >> }
> >>
> >> +/*
> >> + * Verify that the device supports Interrupt Disable bit in command register,
> >> + * per PCI 2.3, by flipping this bit and reading it back: this bit was readonly
> >> + * in PCI 2.2.
> >> + */
> >> +static bool pci_2_3_supported(struct pci_dev *pdev)
> >> +{
> >> + u16 orig, new;
> >> + bool supported = false;
> >> +
> >> + pci_block_user_cfg_access(pdev);
> >> + pci_read_config_word(pdev, PCI_COMMAND, &orig);
> >> + pci_write_config_word(pdev, PCI_COMMAND,
> >> + orig ^ PCI_COMMAND_INTX_DISABLE);
> >> + pci_read_config_word(pdev, PCI_COMMAND, &new);
> >> +
> >> + /*
> >> + * There's no way to protect against
> >> + * hardware bugs or detect them reliably, but as long as we know
> >> + * what the value should be, let's go ahead and check it.
> >> + */
> >> + if ((new ^ orig) & ~PCI_COMMAND_INTX_DISABLE) {
> >> + dev_err(&pdev->dev, "Command changed from 0x%x to 0x%x: "
> >> + "driver or HW bug?\n", orig, new);
> >> + goto out;
> >> + }
> >> + if (!((new ^ orig) & PCI_COMMAND_INTX_DISABLE)) {
> >> + dev_warn(&pdev->dev, "Device does not support "
> >> + "disabling interrupts: unable to bind.\n");
> >> + goto out;
> >> + }
> >> + supported = true;
> >> +
> >> + /* Now restore the original value. */
> >> + pci_write_config_word(pdev, PCI_COMMAND, orig);
> >> +
> >> +out:
> >> + pci_unblock_user_cfg_access(pdev);
> >> + return supported;
> >> +}
> >> +
> >> +static void
> >> +pci_2_3_mask_irq(struct pci_dev *dev, int mask, unsigned int *irq_status)
> >> +{
> >> + u32 cmd_status_dword;
> >> + u16 origcmd, newcmd;
> >> +
> >> + /*
> >> + * We do a single dword read to retrieve both command and status.
> >> + * Document assumptions that make this possible.
> >> + */
> >> + BUILD_BUG_ON(PCI_COMMAND % 4);
> >> + BUILD_BUG_ON(PCI_COMMAND + 2 != PCI_STATUS);
> >> +
> >> + pci_block_user_cfg_access(dev);
> >> +
> >> + /*
> >> + * Read both command and status registers in a single 32-bit operation.
> >> + * Note: we could cache the value for command and move the status read
> >> + * out of the lock if there was a way to get notified of user changes
> >> + * to command register through sysfs. Should be good for shared irqs.
> >> + */
> >> + pci_read_config_dword(dev, PCI_COMMAND, &cmd_status_dword);
> >> + origcmd = cmd_status_dword;
> >> +
> >> + if (irq_status) {
> >> + /*
> >> + * Check interrupt status register to see whether our device triggered
> >> + * the interrupt.
> >> + */
> >> + *irq_status = (cmd_status_dword >> 16) & PCI_STATUS_INTERRUPT;
> >> + if (*irq_status == 0)
> >> + goto done;
> >> + }
> >> +
> >> + newcmd = origcmd & ~PCI_COMMAND_INTX_DISABLE;
> >> + if (mask)
> >> + newcmd |= PCI_COMMAND_INTX_DISABLE;
> >> + if (newcmd != origcmd)
> >> + pci_write_config_word(dev, PCI_COMMAND, newcmd);
> >> +
> >> +done:
> >> + pci_unblock_user_cfg_access(dev);
> >> +}
> >> +
> >
> > Let's return irq_status instead of returning through a pointer?
> > Will save a branch and generally make the code a bit cleaner.
>
> I'm open for a better API suggestion,
Maybe use separate functions for this.
pci_2_3_mask_irq
pci_2_3_unmask_irq
pci_2_3_disable_irq
Common code can go into subfunctions.
> but the current one goes like
> this: if irq_status is non-null, only mask the IRQ if the status bit
> indicates an interrupt. But we also have a user that wants to mask
> unconditionally.
Why do you ever want to do that?
> >
> >> static irqreturn_t kvm_assigned_dev_intr(int irq, void *dev_id)
> >> {
> >> struct kvm_assigned_dev_kernel *assigned_dev =
> >> (struct kvm_assigned_dev_kernel *) dev_id;
> >> + int ret = IRQ_HANDLED;
> >> unsigned long flags;
> >>
> >> spin_lock_irqsave(&assigned_dev->assigned_dev_lock, flags);
> >> @@ -83,19 +169,34 @@ static irqreturn_t kvm_assigned_dev_intr(int irq, void *dev_id)
> >> guest_entries[i].vector, 1);
> >> }
> >> } else {
> >> - kvm_set_irq(assigned_dev->kvm, assigned_dev->irq_source_id,
> >> - assigned_dev->guest_irq, 1);
> >> -
> >> if (assigned_dev->irq_requested_type &
> >> KVM_DEV_IRQ_GUEST_INTX) {
> >> - disable_irq_nosync(irq);
> >> + if (assigned_dev->pci_2_3) {
> >> + unsigned int irq_status;
> >> +
> >> + if (assigned_dev->host_irq_disabled) {
> >> + ret = IRQ_NONE;
> >> + goto out;
> >> + }
> >> +
> >> + pci_2_3_mask_irq(assigned_dev->dev, 1,
> >> + &irq_status);
> >> + if (irq_status == 0) {
> >> + ret = IRQ_NONE;
> >> + goto out;
> >> + }
> >
> >
> > This will be cleaner if pci_2_3_mask_irq returns irqreturn_t.
> >
> >> + } else
> >> + disable_irq_nosync(irq);
> >> assigned_dev->host_irq_disabled = true;
> >> }
> >> +
> >> + kvm_set_irq(assigned_dev->kvm, assigned_dev->irq_source_id,
> >> + assigned_dev->guest_irq, 1);
> >> }
> >>
> >> out:
> >> spin_unlock_irqrestore(&assigned_dev->assigned_dev_lock, flags);
> >> - return IRQ_HANDLED;
> >> + return ret;
> >> }
> >>
> >> /* Ack the irq line for an assigned device */
> >> @@ -117,7 +218,10 @@ static void kvm_assigned_dev_ack_irq(struct kvm_irq_ack_notifier *kian)
> >> */
> >> spin_lock_irqsave(&dev->assigned_dev_lock, flags);
> >> if (dev->host_irq_disabled) {
> >> - enable_irq(dev->host_irq);
> >> + if (dev->pci_2_3)
> >> + pci_2_3_mask_irq(dev->dev, 0, NULL);
> >> + else
> >> + enable_irq(dev->host_irq);
> >> dev->host_irq_disabled = false;
> >
> > So what happens here is that if interrupt is still pending
> > we will set level to 0, then get another interrupt from device
> > which will set it back to 1. An obvious optimization is avoid
> > all this, check pending bit and just keep level at 1.
>
> Isn't this an unrelated optimization, independent of this patch? But
> I'll think about it. What pending bit are you referring to?
The one in PCI_STATUS.
> >
> >> }
> >> spin_unlock_irqrestore(&dev->assigned_dev_lock, flags);
> >> @@ -166,7 +270,11 @@ static void deassign_host_irq(struct kvm *kvm,
> >> pci_disable_msix(assigned_dev->dev);
> >> } else {
> >> /* Deal with MSI and INTx */
> >> - disable_irq(assigned_dev->host_irq);
> >> + if (assigned_dev->pci_2_3) {
> >> + pci_2_3_mask_irq(assigned_dev->dev, 1, NULL);
> >> + synchronize_irq(assigned_dev->host_irq);
> >> + } else
> >> + disable_irq(assigned_dev->host_irq);
> >>
> >> free_irq(assigned_dev->host_irq, (void *)assigned_dev);
> >>
> >> @@ -214,6 +322,13 @@ static void kvm_free_assigned_device(struct kvm *kvm,
> >>
> >> pci_reset_function(assigned_dev->dev);
> >>
> >> + /*
> >> + * Unmask the IRQ at PCI level once the reset is done - the next user
> >> + * may not expect the IRQ being masked.
> >> + */
> >> + if (assigned_dev->pci_2_3)
> >> + pci_2_3_mask_irq(assigned_dev->dev, 0, NULL);
> >> +
> >> pci_release_regions(assigned_dev->dev);
> >> pci_disable_device(assigned_dev->dev);
> >> pci_dev_put(assigned_dev->dev);
> >> @@ -239,15 +354,26 @@ void kvm_free_all_assigned_devices(struct kvm *kvm)
> >> static int assigned_device_enable_host_intx(struct kvm *kvm,
> >> struct kvm_assigned_dev_kernel *dev)
> >> {
> >> + unsigned long flags = 0;
> >> +
> >> dev->host_irq = dev->dev->irq;
> >> - /* Even though this is PCI, we don't want to use shared
> >> - * interrupts. Sharing host devices with guest-assigned devices
> >> - * on the same interrupt line is not a happy situation: there
> >> - * are going to be long delays in accepting, acking, etc.
> >> +
> >> + /*
> >> + * We can only share the IRQ line with other host devices if we are
> >> + * able to disable the IRQ source at device-level - independently of
> >> + * the guest driver. Otherwise host devices may suffer from unbounded
> >> + * IRQ latencies when the guest keeps the line asserted.
> >> */
> >> + dev->pci_2_3 = pci_2_3_supported(dev->dev);
> >> + if (dev->pci_2_3)
> >> + flags = IRQF_SHARED;
> >> +
> >> if (request_irq(dev->host_irq, kvm_assigned_dev_intr,
> >> - 0, "kvm_assigned_intx_device", (void *)dev))
> >> + flags, "kvm_assigned_intx_device", (void *)dev))
> >> return -EIO;
> >> +
> >> + if (dev->pci_2_3)
> >> + pci_2_3_mask_irq(dev->dev, 0, NULL);
> >> return 0;
> >> }
> >>
> >
> > Let's reverse the logic and try non-shared first, 2.3 is that fails?
> > This way we are backwards compatible ...
>
> Compatible with what?
With the status quo :)
> I thought about trying non-shared IRQs first, but that would break host
> devices arriving later - including other VMs that want to pass a device
> sitting on the same IRQ line. It's better (from management POV) to have
> IRQF_SHARED available.
OTOH non-shared might be faster as we don't need to do
config writes/reads on data path ... Add a knob for management
to control this?
> >
> >> @@ -324,7 +450,6 @@ static int assigned_device_enable_guest_msi(struct kvm *kvm,
> >> {
> >> dev->guest_irq = irq->guest_irq;
> >> dev->ack_notifier.gsi = -1;
> >> - dev->host_irq_disabled = false;
> >> return 0;
> >> }
> >> #endif
> >> @@ -336,7 +461,6 @@ static int assigned_device_enable_guest_msix(struct kvm *kvm,
> >> {
> >> dev->guest_irq = irq->guest_irq;
> >> dev->ack_notifier.gsi = -1;
> >> - dev->host_irq_disabled = false;
> >> return 0;
> >> }
> >> #endif
> >> @@ -367,6 +491,7 @@ static int assign_host_irq(struct kvm *kvm,
> >> default:
> >> r = -EINVAL;
> >> }
> >> + dev->host_irq_disabled = false;
> >>
> >> if (!r)
> >> dev->irq_requested_type |= host_irq_type;
> >> --
> >> 1.7.1
>
> Thanks for the comment,
> Jan
>
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 3/3] KVM: Allow host IRQ sharing for passed-through PCI 2.3 devices
2010-11-01 15:52 ` Michael S. Tsirkin
@ 2010-11-01 16:30 ` Jan Kiszka
2010-11-01 17:06 ` Michael S. Tsirkin
0 siblings, 1 reply; 10+ messages in thread
From: Jan Kiszka @ 2010-11-01 16:30 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Avi Kivity, Marcelo Tosatti, kvm, Alex Williamson, Jan Kiszka
[-- Attachment #1: Type: text/plain, Size: 11245 bytes --]
Am 01.11.2010 16:52, Michael S. Tsirkin wrote:
> On Mon, Nov 01, 2010 at 04:41:08PM +0100, Jan Kiszka wrote:
>> Am 01.11.2010 16:24, Michael S. Tsirkin wrote:
>>> On Mon, Nov 01, 2010 at 03:08:37PM +0100, Jan Kiszka wrote:
>>>> From: Jan Kiszka <jan.kiszka@siemens.com>
>>>>
>>>> PCI 2.3 allows to generically disable IRQ sources at device level. This
>>>> enables us to share IRQs of such devices between on the host side when
>>>> passing them to a guest.
>>>>
>>>> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
>>>> ---
>>>> include/linux/kvm_host.h | 1 +
>>>> virt/kvm/assigned-dev.c | 153 +++++++++++++++++++++++++++++++++++++++++----
>>>> 2 files changed, 140 insertions(+), 14 deletions(-)
>>>>
>>>> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
>>>> index df5497f..fcdc849 100644
>>>> --- a/include/linux/kvm_host.h
>>>> +++ b/include/linux/kvm_host.h
>>>> @@ -473,6 +473,7 @@ struct kvm_assigned_dev_kernel {
>>>> unsigned int entries_nr;
>>>> int host_irq;
>>>> bool host_irq_disabled;
>>>> + bool pci_2_3;
>>>> struct msix_entry *host_msix_entries;
>>>> int guest_irq;
>>>> struct kvm_guest_msix_entry *guest_msix_entries;
>>>> diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c
>>>> index d3ddfea..411643c 100644
>>>> --- a/virt/kvm/assigned-dev.c
>>>> +++ b/virt/kvm/assigned-dev.c
>>>> @@ -55,10 +55,96 @@ static int find_index_from_host_irq(struct kvm_assigned_dev_kernel
>>>> return index;
>>>> }
>>>>
>>>> +/*
>>>> + * Verify that the device supports Interrupt Disable bit in command register,
>>>> + * per PCI 2.3, by flipping this bit and reading it back: this bit was readonly
>>>> + * in PCI 2.2.
>>>> + */
>>>> +static bool pci_2_3_supported(struct pci_dev *pdev)
>>>> +{
>>>> + u16 orig, new;
>>>> + bool supported = false;
>>>> +
>>>> + pci_block_user_cfg_access(pdev);
>>>> + pci_read_config_word(pdev, PCI_COMMAND, &orig);
>>>> + pci_write_config_word(pdev, PCI_COMMAND,
>>>> + orig ^ PCI_COMMAND_INTX_DISABLE);
>>>> + pci_read_config_word(pdev, PCI_COMMAND, &new);
>>>> +
>>>> + /*
>>>> + * There's no way to protect against
>>>> + * hardware bugs or detect them reliably, but as long as we know
>>>> + * what the value should be, let's go ahead and check it.
>>>> + */
>>>> + if ((new ^ orig) & ~PCI_COMMAND_INTX_DISABLE) {
>>>> + dev_err(&pdev->dev, "Command changed from 0x%x to 0x%x: "
>>>> + "driver or HW bug?\n", orig, new);
>>>> + goto out;
>>>> + }
>>>> + if (!((new ^ orig) & PCI_COMMAND_INTX_DISABLE)) {
>>>> + dev_warn(&pdev->dev, "Device does not support "
>>>> + "disabling interrupts: unable to bind.\n");
>>>> + goto out;
>>>> + }
>>>> + supported = true;
>>>> +
>>>> + /* Now restore the original value. */
>>>> + pci_write_config_word(pdev, PCI_COMMAND, orig);
>>>> +
>>>> +out:
>>>> + pci_unblock_user_cfg_access(pdev);
>>>> + return supported;
>>>> +}
>>>> +
>>>> +static void
>>>> +pci_2_3_mask_irq(struct pci_dev *dev, int mask, unsigned int *irq_status)
>>>> +{
>>>> + u32 cmd_status_dword;
>>>> + u16 origcmd, newcmd;
>>>> +
>>>> + /*
>>>> + * We do a single dword read to retrieve both command and status.
>>>> + * Document assumptions that make this possible.
>>>> + */
>>>> + BUILD_BUG_ON(PCI_COMMAND % 4);
>>>> + BUILD_BUG_ON(PCI_COMMAND + 2 != PCI_STATUS);
>>>> +
>>>> + pci_block_user_cfg_access(dev);
>>>> +
>>>> + /*
>>>> + * Read both command and status registers in a single 32-bit operation.
>>>> + * Note: we could cache the value for command and move the status read
>>>> + * out of the lock if there was a way to get notified of user changes
>>>> + * to command register through sysfs. Should be good for shared irqs.
>>>> + */
>>>> + pci_read_config_dword(dev, PCI_COMMAND, &cmd_status_dword);
>>>> + origcmd = cmd_status_dword;
>>>> +
>>>> + if (irq_status) {
>>>> + /*
>>>> + * Check interrupt status register to see whether our device triggered
>>>> + * the interrupt.
>>>> + */
>>>> + *irq_status = (cmd_status_dword >> 16) & PCI_STATUS_INTERRUPT;
>>>> + if (*irq_status == 0)
>>>> + goto done;
>>>> + }
>>>> +
>>>> + newcmd = origcmd & ~PCI_COMMAND_INTX_DISABLE;
>>>> + if (mask)
>>>> + newcmd |= PCI_COMMAND_INTX_DISABLE;
>>>> + if (newcmd != origcmd)
>>>> + pci_write_config_word(dev, PCI_COMMAND, newcmd);
>>>> +
>>>> +done:
>>>> + pci_unblock_user_cfg_access(dev);
>>>> +}
>>>> +
>>>
>>> Let's return irq_status instead of returning through a pointer?
>>> Will save a branch and generally make the code a bit cleaner.
>>
>> I'm open for a better API suggestion,
>
> Maybe use separate functions for this.
> pci_2_3_mask_irq
> pci_2_3_unmask_irq
> pci_2_3_disable_irq
>
> Common code can go into subfunctions.
>
>> but the current one goes like
>> this: if irq_status is non-null, only mask the IRQ if the status bit
>> indicates an interrupt. But we also have a user that wants to mask
>> unconditionally.
>
> Why do you ever want to do that?
During device shutdown (was disable_irq in the unshared case so far).
The alternative would be to reset the device first, clearing any
potentially pending events. If we can reorder the reset, that is likely
better.
>
>>>
>>>> static irqreturn_t kvm_assigned_dev_intr(int irq, void *dev_id)
>>>> {
>>>> struct kvm_assigned_dev_kernel *assigned_dev =
>>>> (struct kvm_assigned_dev_kernel *) dev_id;
>>>> + int ret = IRQ_HANDLED;
>>>> unsigned long flags;
>>>>
>>>> spin_lock_irqsave(&assigned_dev->assigned_dev_lock, flags);
>>>> @@ -83,19 +169,34 @@ static irqreturn_t kvm_assigned_dev_intr(int irq, void *dev_id)
>>>> guest_entries[i].vector, 1);
>>>> }
>>>> } else {
>>>> - kvm_set_irq(assigned_dev->kvm, assigned_dev->irq_source_id,
>>>> - assigned_dev->guest_irq, 1);
>>>> -
>>>> if (assigned_dev->irq_requested_type &
>>>> KVM_DEV_IRQ_GUEST_INTX) {
>>>> - disable_irq_nosync(irq);
>>>> + if (assigned_dev->pci_2_3) {
>>>> + unsigned int irq_status;
>>>> +
>>>> + if (assigned_dev->host_irq_disabled) {
>>>> + ret = IRQ_NONE;
>>>> + goto out;
>>>> + }
>>>> +
>>>> + pci_2_3_mask_irq(assigned_dev->dev, 1,
>>>> + &irq_status);
>>>> + if (irq_status == 0) {
>>>> + ret = IRQ_NONE;
>>>> + goto out;
>>>> + }
>>>
>>>
>>> This will be cleaner if pci_2_3_mask_irq returns irqreturn_t.
>>>
>>>> + } else
>>>> + disable_irq_nosync(irq);
>>>> assigned_dev->host_irq_disabled = true;
>>>> }
>>>> +
>>>> + kvm_set_irq(assigned_dev->kvm, assigned_dev->irq_source_id,
>>>> + assigned_dev->guest_irq, 1);
>>>> }
>>>>
>>>> out:
>>>> spin_unlock_irqrestore(&assigned_dev->assigned_dev_lock, flags);
>>>> - return IRQ_HANDLED;
>>>> + return ret;
>>>> }
>>>>
>>>> /* Ack the irq line for an assigned device */
>>>> @@ -117,7 +218,10 @@ static void kvm_assigned_dev_ack_irq(struct kvm_irq_ack_notifier *kian)
>>>> */
>>>> spin_lock_irqsave(&dev->assigned_dev_lock, flags);
>>>> if (dev->host_irq_disabled) {
>>>> - enable_irq(dev->host_irq);
>>>> + if (dev->pci_2_3)
>>>> + pci_2_3_mask_irq(dev->dev, 0, NULL);
>>>> + else
>>>> + enable_irq(dev->host_irq);
>>>> dev->host_irq_disabled = false;
>>>
>>> So what happens here is that if interrupt is still pending
>>> we will set level to 0, then get another interrupt from device
>>> which will set it back to 1. An obvious optimization is avoid
>>> all this, check pending bit and just keep level at 1.
>>
>> Isn't this an unrelated optimization, independent of this patch? But
>> I'll think about it. What pending bit are you referring to?
>
> The one in PCI_STATUS.
Ah, OK.
>
>>>
>>>> }
>>>> spin_unlock_irqrestore(&dev->assigned_dev_lock, flags);
>>>> @@ -166,7 +270,11 @@ static void deassign_host_irq(struct kvm *kvm,
>>>> pci_disable_msix(assigned_dev->dev);
>>>> } else {
>>>> /* Deal with MSI and INTx */
>>>> - disable_irq(assigned_dev->host_irq);
>>>> + if (assigned_dev->pci_2_3) {
>>>> + pci_2_3_mask_irq(assigned_dev->dev, 1, NULL);
>>>> + synchronize_irq(assigned_dev->host_irq);
>>>> + } else
>>>> + disable_irq(assigned_dev->host_irq);
>>>>
>>>> free_irq(assigned_dev->host_irq, (void *)assigned_dev);
>>>>
>>>> @@ -214,6 +322,13 @@ static void kvm_free_assigned_device(struct kvm *kvm,
>>>>
>>>> pci_reset_function(assigned_dev->dev);
>>>>
>>>> + /*
>>>> + * Unmask the IRQ at PCI level once the reset is done - the next user
>>>> + * may not expect the IRQ being masked.
>>>> + */
>>>> + if (assigned_dev->pci_2_3)
>>>> + pci_2_3_mask_irq(assigned_dev->dev, 0, NULL);
>>>> +
>>>> pci_release_regions(assigned_dev->dev);
>>>> pci_disable_device(assigned_dev->dev);
>>>> pci_dev_put(assigned_dev->dev);
>>>> @@ -239,15 +354,26 @@ void kvm_free_all_assigned_devices(struct kvm *kvm)
>>>> static int assigned_device_enable_host_intx(struct kvm *kvm,
>>>> struct kvm_assigned_dev_kernel *dev)
>>>> {
>>>> + unsigned long flags = 0;
>>>> +
>>>> dev->host_irq = dev->dev->irq;
>>>> - /* Even though this is PCI, we don't want to use shared
>>>> - * interrupts. Sharing host devices with guest-assigned devices
>>>> - * on the same interrupt line is not a happy situation: there
>>>> - * are going to be long delays in accepting, acking, etc.
>>>> +
>>>> + /*
>>>> + * We can only share the IRQ line with other host devices if we are
>>>> + * able to disable the IRQ source at device-level - independently of
>>>> + * the guest driver. Otherwise host devices may suffer from unbounded
>>>> + * IRQ latencies when the guest keeps the line asserted.
>>>> */
>>>> + dev->pci_2_3 = pci_2_3_supported(dev->dev);
>>>> + if (dev->pci_2_3)
>>>> + flags = IRQF_SHARED;
>>>> +
>>>> if (request_irq(dev->host_irq, kvm_assigned_dev_intr,
>>>> - 0, "kvm_assigned_intx_device", (void *)dev))
>>>> + flags, "kvm_assigned_intx_device", (void *)dev))
>>>> return -EIO;
>>>> +
>>>> + if (dev->pci_2_3)
>>>> + pci_2_3_mask_irq(dev->dev, 0, NULL);
>>>> return 0;
>>>> }
>>>>
>>>
>>> Let's reverse the logic and try non-shared first, 2.3 is that fails?
>>> This way we are backwards compatible ...
>>
>> Compatible with what?
>
> With the status quo :)
There is no incompatibility except for a potential slow-down of a path
that was often usable (exclusive legacy interrupts belong to a very rare
species).
>
>> I thought about trying non-shared IRQs first, but that would break host
>> devices arriving later - including other VMs that want to pass a device
>> sitting on the same IRQ line. It's better (from management POV) to have
>> IRQF_SHARED available.
>
> OTOH non-shared might be faster as we don't need to do
> config writes/reads on data path ... Add a knob for management
> to control this?
Depends on how fast config writes actually are. I know they were slow
(up to awfully slow if bios was involved) on old hardware, but does this
still apply? I mean, a config knob would involve the whole stack, so it
should be worth that effort.
Jan
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 259 bytes --]
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 1/3] KVM: Fold assigned interrupt work into IRQ handler
2010-11-01 14:08 ` [PATCH 1/3] KVM: Fold assigned interrupt work into IRQ handler Jan Kiszka
@ 2010-11-01 16:34 ` Jan Kiszka
0 siblings, 0 replies; 10+ messages in thread
From: Jan Kiszka @ 2010-11-01 16:34 UTC (permalink / raw)
To: Avi Kivity, Marcelo Tosatti; +Cc: kvm, Alex Williamson, Michael S. Tsirkin
[-- Attachment #1: Type: text/plain, Size: 6308 bytes --]
Am 01.11.2010 15:08, Jan Kiszka wrote:
> From: Jan Kiszka <jan.kiszka@siemens.com>
>
> The complete work handler runs with assigned_dev_lock acquired and
> interrupts disabled, so there is nothing to gain pushing this work out
> of the actually IRQ handler. Fold them together.
Err, forget it. kvm_set_irq pulls in the famous pic lock, and that one
is not prepared to be called from IRQ context (lockdep just complained).
I will try to clean this up via a threaded IRQ handler, maybe using the
slow-path only for INTx-type IRQs and pushing MSIs directly from the
hard handler.
Jan
>
> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
> ---
> include/linux/kvm_host.h | 1 -
> virt/kvm/assigned-dev.c | 71 +++++++++++++++-------------------------------
> 2 files changed, 23 insertions(+), 49 deletions(-)
>
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 462b982..df5497f 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -465,7 +465,6 @@ struct kvm_guest_msix_entry {
>
> struct kvm_assigned_dev_kernel {
> struct kvm_irq_ack_notifier ack_notifier;
> - struct work_struct interrupt_work;
> struct list_head list;
> int assigned_dev_id;
> int host_segnr;
> diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c
> index 7c98928..5c1b56a 100644
> --- a/virt/kvm/assigned-dev.c
> +++ b/virt/kvm/assigned-dev.c
> @@ -55,18 +55,24 @@ static int find_index_from_host_irq(struct kvm_assigned_dev_kernel
> return index;
> }
>
> -static void kvm_assigned_dev_interrupt_work_handler(struct work_struct *work)
> +static irqreturn_t kvm_assigned_dev_intr(int irq, void *dev_id)
> {
> - struct kvm_assigned_dev_kernel *assigned_dev;
> - int i;
> -
> - assigned_dev = container_of(work, struct kvm_assigned_dev_kernel,
> - interrupt_work);
> + struct kvm_assigned_dev_kernel *assigned_dev =
> + (struct kvm_assigned_dev_kernel *) dev_id;
> + unsigned long flags;
>
> - spin_lock_irq(&assigned_dev->assigned_dev_lock);
> + spin_lock_irqsave(&assigned_dev->assigned_dev_lock, flags);
> if (assigned_dev->irq_requested_type & KVM_DEV_IRQ_HOST_MSIX) {
> struct kvm_guest_msix_entry *guest_entries =
> assigned_dev->guest_msix_entries;
> + int index = find_index_from_host_irq(assigned_dev, irq);
> + int i;
> +
> + if (index < 0)
> + goto out;
> +
> + guest_entries[index].flags |= KVM_ASSIGNED_MSIX_PENDING;
> +
> for (i = 0; i < assigned_dev->entries_nr; i++) {
> if (!(guest_entries[i].flags &
> KVM_ASSIGNED_MSIX_PENDING))
> @@ -76,33 +82,15 @@ static void kvm_assigned_dev_interrupt_work_handler(struct work_struct *work)
> assigned_dev->irq_source_id,
> guest_entries[i].vector, 1);
> }
> - } else
> + } else {
> kvm_set_irq(assigned_dev->kvm, assigned_dev->irq_source_id,
> assigned_dev->guest_irq, 1);
>
> - spin_unlock_irq(&assigned_dev->assigned_dev_lock);
> -}
> -
> -static irqreturn_t kvm_assigned_dev_intr(int irq, void *dev_id)
> -{
> - unsigned long flags;
> - struct kvm_assigned_dev_kernel *assigned_dev =
> - (struct kvm_assigned_dev_kernel *) dev_id;
> -
> - spin_lock_irqsave(&assigned_dev->assigned_dev_lock, flags);
> - if (assigned_dev->irq_requested_type & KVM_DEV_IRQ_HOST_MSIX) {
> - int index = find_index_from_host_irq(assigned_dev, irq);
> - if (index < 0)
> - goto out;
> - assigned_dev->guest_msix_entries[index].flags |=
> - KVM_ASSIGNED_MSIX_PENDING;
> - }
> -
> - schedule_work(&assigned_dev->interrupt_work);
> -
> - if (assigned_dev->irq_requested_type & KVM_DEV_IRQ_GUEST_INTX) {
> - disable_irq_nosync(irq);
> - assigned_dev->host_irq_disabled = true;
> + if (assigned_dev->irq_requested_type &
> + KVM_DEV_IRQ_GUEST_INTX) {
> + disable_irq_nosync(irq);
> + assigned_dev->host_irq_disabled = true;
> + }
> }
>
> out:
> @@ -147,33 +135,23 @@ static void deassign_guest_irq(struct kvm *kvm,
> assigned_dev->irq_requested_type &= ~(KVM_DEV_IRQ_GUEST_MASK);
> }
>
> -/* The function implicit hold kvm->lock mutex due to cancel_work_sync() */
> static void deassign_host_irq(struct kvm *kvm,
> struct kvm_assigned_dev_kernel *assigned_dev)
> {
> /*
> - * In kvm_free_device_irq, cancel_work_sync return true if:
> - * 1. work is scheduled, and then cancelled.
> - * 2. work callback is executed.
> - *
> - * The first one ensured that the irq is disabled and no more events
> - * would happen. But for the second one, the irq may be enabled (e.g.
> - * for MSI). So we disable irq here to prevent further events.
> + * We disable irq here to prevent further events.
> *
> * Notice this maybe result in nested disable if the interrupt type is
> * INTx, but it's OK for we are going to free it.
> *
> * If this function is a part of VM destroy, please ensure that till
> * now, the kvm state is still legal for probably we also have to wait
> - * interrupt_work done.
> + * on a currently running IRQ handler.
> */
> if (assigned_dev->irq_requested_type & KVM_DEV_IRQ_HOST_MSIX) {
> int i;
> for (i = 0; i < assigned_dev->entries_nr; i++)
> - disable_irq_nosync(assigned_dev->
> - host_msix_entries[i].vector);
> -
> - cancel_work_sync(&assigned_dev->interrupt_work);
> + disable_irq(assigned_dev->host_msix_entries[i].vector);
>
> for (i = 0; i < assigned_dev->entries_nr; i++)
> free_irq(assigned_dev->host_msix_entries[i].vector,
> @@ -185,8 +163,7 @@ static void deassign_host_irq(struct kvm *kvm,
> pci_disable_msix(assigned_dev->dev);
> } else {
> /* Deal with MSI and INTx */
> - disable_irq_nosync(assigned_dev->host_irq);
> - cancel_work_sync(&assigned_dev->interrupt_work);
> + disable_irq(assigned_dev->host_irq);
>
> free_irq(assigned_dev->host_irq, (void *)assigned_dev);
>
> @@ -558,8 +535,6 @@ static int kvm_vm_ioctl_assign_device(struct kvm *kvm,
> match->irq_source_id = -1;
> match->kvm = kvm;
> match->ack_notifier.irq_acked = kvm_assigned_dev_ack_irq;
> - INIT_WORK(&match->interrupt_work,
> - kvm_assigned_dev_interrupt_work_handler);
>
> list_add(&match->list, &kvm->arch.assigned_dev_head);
>
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 259 bytes --]
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 3/3] KVM: Allow host IRQ sharing for passed-through PCI 2.3 devices
2010-11-01 16:30 ` Jan Kiszka
@ 2010-11-01 17:06 ` Michael S. Tsirkin
0 siblings, 0 replies; 10+ messages in thread
From: Michael S. Tsirkin @ 2010-11-01 17:06 UTC (permalink / raw)
To: Jan Kiszka; +Cc: Avi Kivity, Marcelo Tosatti, kvm, Alex Williamson, Jan Kiszka
On Mon, Nov 01, 2010 at 05:30:20PM +0100, Jan Kiszka wrote:
> Am 01.11.2010 16:52, Michael S. Tsirkin wrote:
> > On Mon, Nov 01, 2010 at 04:41:08PM +0100, Jan Kiszka wrote:
> >> Am 01.11.2010 16:24, Michael S. Tsirkin wrote:
> >>> On Mon, Nov 01, 2010 at 03:08:37PM +0100, Jan Kiszka wrote:
> >>>> From: Jan Kiszka <jan.kiszka@siemens.com>
> >>>>
> >>>> PCI 2.3 allows to generically disable IRQ sources at device level. This
> >>>> enables us to share IRQs of such devices between on the host side when
> >>>> passing them to a guest.
> >>>>
> >>>> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
> >>>> ---
> >>>> include/linux/kvm_host.h | 1 +
> >>>> virt/kvm/assigned-dev.c | 153 +++++++++++++++++++++++++++++++++++++++++----
> >>>> 2 files changed, 140 insertions(+), 14 deletions(-)
> >>>>
> >>>> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> >>>> index df5497f..fcdc849 100644
> >>>> --- a/include/linux/kvm_host.h
> >>>> +++ b/include/linux/kvm_host.h
> >>>> @@ -473,6 +473,7 @@ struct kvm_assigned_dev_kernel {
> >>>> unsigned int entries_nr;
> >>>> int host_irq;
> >>>> bool host_irq_disabled;
> >>>> + bool pci_2_3;
> >>>> struct msix_entry *host_msix_entries;
> >>>> int guest_irq;
> >>>> struct kvm_guest_msix_entry *guest_msix_entries;
> >>>> diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c
> >>>> index d3ddfea..411643c 100644
> >>>> --- a/virt/kvm/assigned-dev.c
> >>>> +++ b/virt/kvm/assigned-dev.c
> >>>> @@ -55,10 +55,96 @@ static int find_index_from_host_irq(struct kvm_assigned_dev_kernel
> >>>> return index;
> >>>> }
> >>>>
> >>>> +/*
> >>>> + * Verify that the device supports Interrupt Disable bit in command register,
> >>>> + * per PCI 2.3, by flipping this bit and reading it back: this bit was readonly
> >>>> + * in PCI 2.2.
> >>>> + */
> >>>> +static bool pci_2_3_supported(struct pci_dev *pdev)
> >>>> +{
> >>>> + u16 orig, new;
> >>>> + bool supported = false;
> >>>> +
> >>>> + pci_block_user_cfg_access(pdev);
> >>>> + pci_read_config_word(pdev, PCI_COMMAND, &orig);
> >>>> + pci_write_config_word(pdev, PCI_COMMAND,
> >>>> + orig ^ PCI_COMMAND_INTX_DISABLE);
> >>>> + pci_read_config_word(pdev, PCI_COMMAND, &new);
> >>>> +
> >>>> + /*
> >>>> + * There's no way to protect against
> >>>> + * hardware bugs or detect them reliably, but as long as we know
> >>>> + * what the value should be, let's go ahead and check it.
> >>>> + */
> >>>> + if ((new ^ orig) & ~PCI_COMMAND_INTX_DISABLE) {
> >>>> + dev_err(&pdev->dev, "Command changed from 0x%x to 0x%x: "
> >>>> + "driver or HW bug?\n", orig, new);
> >>>> + goto out;
> >>>> + }
> >>>> + if (!((new ^ orig) & PCI_COMMAND_INTX_DISABLE)) {
> >>>> + dev_warn(&pdev->dev, "Device does not support "
> >>>> + "disabling interrupts: unable to bind.\n");
> >>>> + goto out;
> >>>> + }
> >>>> + supported = true;
> >>>> +
> >>>> + /* Now restore the original value. */
> >>>> + pci_write_config_word(pdev, PCI_COMMAND, orig);
> >>>> +
> >>>> +out:
> >>>> + pci_unblock_user_cfg_access(pdev);
> >>>> + return supported;
> >>>> +}
> >>>> +
> >>>> +static void
> >>>> +pci_2_3_mask_irq(struct pci_dev *dev, int mask, unsigned int *irq_status)
> >>>> +{
> >>>> + u32 cmd_status_dword;
> >>>> + u16 origcmd, newcmd;
> >>>> +
> >>>> + /*
> >>>> + * We do a single dword read to retrieve both command and status.
> >>>> + * Document assumptions that make this possible.
> >>>> + */
> >>>> + BUILD_BUG_ON(PCI_COMMAND % 4);
> >>>> + BUILD_BUG_ON(PCI_COMMAND + 2 != PCI_STATUS);
> >>>> +
> >>>> + pci_block_user_cfg_access(dev);
> >>>> +
> >>>> + /*
> >>>> + * Read both command and status registers in a single 32-bit operation.
> >>>> + * Note: we could cache the value for command and move the status read
> >>>> + * out of the lock if there was a way to get notified of user changes
> >>>> + * to command register through sysfs. Should be good for shared irqs.
> >>>> + */
> >>>> + pci_read_config_dword(dev, PCI_COMMAND, &cmd_status_dword);
> >>>> + origcmd = cmd_status_dword;
> >>>> +
> >>>> + if (irq_status) {
> >>>> + /*
> >>>> + * Check interrupt status register to see whether our device triggered
> >>>> + * the interrupt.
> >>>> + */
> >>>> + *irq_status = (cmd_status_dword >> 16) & PCI_STATUS_INTERRUPT;
> >>>> + if (*irq_status == 0)
> >>>> + goto done;
> >>>> + }
> >>>> +
> >>>> + newcmd = origcmd & ~PCI_COMMAND_INTX_DISABLE;
> >>>> + if (mask)
> >>>> + newcmd |= PCI_COMMAND_INTX_DISABLE;
> >>>> + if (newcmd != origcmd)
> >>>> + pci_write_config_word(dev, PCI_COMMAND, newcmd);
> >>>> +
> >>>> +done:
> >>>> + pci_unblock_user_cfg_access(dev);
> >>>> +}
> >>>> +
> >>>
> >>> Let's return irq_status instead of returning through a pointer?
> >>> Will save a branch and generally make the code a bit cleaner.
> >>
> >> I'm open for a better API suggestion,
> >
> > Maybe use separate functions for this.
> > pci_2_3_mask_irq
> > pci_2_3_unmask_irq
> > pci_2_3_disable_irq
> >
> > Common code can go into subfunctions.
> >
> >> but the current one goes like
> >> this: if irq_status is non-null, only mask the IRQ if the status bit
> >> indicates an interrupt. But we also have a user that wants to mask
> >> unconditionally.
> >
> > Why do you ever want to do that?
>
> During device shutdown (was disable_irq in the unshared case so far).
> The alternative would be to reset the device first, clearing any
> potentially pending events. If we can reorder the reset, that is likely
> better.
Not sure I understand completely, but sounds good.
> >
> >>>
> >>>> static irqreturn_t kvm_assigned_dev_intr(int irq, void *dev_id)
> >>>> {
> >>>> struct kvm_assigned_dev_kernel *assigned_dev =
> >>>> (struct kvm_assigned_dev_kernel *) dev_id;
> >>>> + int ret = IRQ_HANDLED;
> >>>> unsigned long flags;
> >>>>
> >>>> spin_lock_irqsave(&assigned_dev->assigned_dev_lock, flags);
> >>>> @@ -83,19 +169,34 @@ static irqreturn_t kvm_assigned_dev_intr(int irq, void *dev_id)
> >>>> guest_entries[i].vector, 1);
> >>>> }
> >>>> } else {
> >>>> - kvm_set_irq(assigned_dev->kvm, assigned_dev->irq_source_id,
> >>>> - assigned_dev->guest_irq, 1);
> >>>> -
> >>>> if (assigned_dev->irq_requested_type &
> >>>> KVM_DEV_IRQ_GUEST_INTX) {
> >>>> - disable_irq_nosync(irq);
> >>>> + if (assigned_dev->pci_2_3) {
> >>>> + unsigned int irq_status;
> >>>> +
> >>>> + if (assigned_dev->host_irq_disabled) {
> >>>> + ret = IRQ_NONE;
> >>>> + goto out;
> >>>> + }
> >>>> +
> >>>> + pci_2_3_mask_irq(assigned_dev->dev, 1,
> >>>> + &irq_status);
> >>>> + if (irq_status == 0) {
> >>>> + ret = IRQ_NONE;
> >>>> + goto out;
> >>>> + }
> >>>
> >>>
> >>> This will be cleaner if pci_2_3_mask_irq returns irqreturn_t.
> >>>
> >>>> + } else
> >>>> + disable_irq_nosync(irq);
> >>>> assigned_dev->host_irq_disabled = true;
> >>>> }
> >>>> +
> >>>> + kvm_set_irq(assigned_dev->kvm, assigned_dev->irq_source_id,
> >>>> + assigned_dev->guest_irq, 1);
> >>>> }
> >>>>
> >>>> out:
> >>>> spin_unlock_irqrestore(&assigned_dev->assigned_dev_lock, flags);
> >>>> - return IRQ_HANDLED;
> >>>> + return ret;
> >>>> }
> >>>>
> >>>> /* Ack the irq line for an assigned device */
> >>>> @@ -117,7 +218,10 @@ static void kvm_assigned_dev_ack_irq(struct kvm_irq_ack_notifier *kian)
> >>>> */
> >>>> spin_lock_irqsave(&dev->assigned_dev_lock, flags);
> >>>> if (dev->host_irq_disabled) {
> >>>> - enable_irq(dev->host_irq);
> >>>> + if (dev->pci_2_3)
> >>>> + pci_2_3_mask_irq(dev->dev, 0, NULL);
> >>>> + else
> >>>> + enable_irq(dev->host_irq);
> >>>> dev->host_irq_disabled = false;
> >>>
> >>> So what happens here is that if interrupt is still pending
> >>> we will set level to 0, then get another interrupt from device
> >>> which will set it back to 1. An obvious optimization is avoid
> >>> all this, check pending bit and just keep level at 1.
> >>
> >> Isn't this an unrelated optimization, independent of this patch? But
> >> I'll think about it. What pending bit are you referring to?
> >
> > The one in PCI_STATUS.
>
> Ah, OK.
>
> >
> >>>
> >>>> }
> >>>> spin_unlock_irqrestore(&dev->assigned_dev_lock, flags);
> >>>> @@ -166,7 +270,11 @@ static void deassign_host_irq(struct kvm *kvm,
> >>>> pci_disable_msix(assigned_dev->dev);
> >>>> } else {
> >>>> /* Deal with MSI and INTx */
> >>>> - disable_irq(assigned_dev->host_irq);
> >>>> + if (assigned_dev->pci_2_3) {
> >>>> + pci_2_3_mask_irq(assigned_dev->dev, 1, NULL);
> >>>> + synchronize_irq(assigned_dev->host_irq);
> >>>> + } else
> >>>> + disable_irq(assigned_dev->host_irq);
> >>>>
> >>>> free_irq(assigned_dev->host_irq, (void *)assigned_dev);
> >>>>
> >>>> @@ -214,6 +322,13 @@ static void kvm_free_assigned_device(struct kvm *kvm,
> >>>>
> >>>> pci_reset_function(assigned_dev->dev);
> >>>>
> >>>> + /*
> >>>> + * Unmask the IRQ at PCI level once the reset is done - the next user
> >>>> + * may not expect the IRQ being masked.
> >>>> + */
> >>>> + if (assigned_dev->pci_2_3)
> >>>> + pci_2_3_mask_irq(assigned_dev->dev, 0, NULL);
> >>>> +
> >>>> pci_release_regions(assigned_dev->dev);
> >>>> pci_disable_device(assigned_dev->dev);
> >>>> pci_dev_put(assigned_dev->dev);
> >>>> @@ -239,15 +354,26 @@ void kvm_free_all_assigned_devices(struct kvm *kvm)
> >>>> static int assigned_device_enable_host_intx(struct kvm *kvm,
> >>>> struct kvm_assigned_dev_kernel *dev)
> >>>> {
> >>>> + unsigned long flags = 0;
> >>>> +
> >>>> dev->host_irq = dev->dev->irq;
> >>>> - /* Even though this is PCI, we don't want to use shared
> >>>> - * interrupts. Sharing host devices with guest-assigned devices
> >>>> - * on the same interrupt line is not a happy situation: there
> >>>> - * are going to be long delays in accepting, acking, etc.
> >>>> +
> >>>> + /*
> >>>> + * We can only share the IRQ line with other host devices if we are
> >>>> + * able to disable the IRQ source at device-level - independently of
> >>>> + * the guest driver. Otherwise host devices may suffer from unbounded
> >>>> + * IRQ latencies when the guest keeps the line asserted.
> >>>> */
> >>>> + dev->pci_2_3 = pci_2_3_supported(dev->dev);
> >>>> + if (dev->pci_2_3)
> >>>> + flags = IRQF_SHARED;
> >>>> +
> >>>> if (request_irq(dev->host_irq, kvm_assigned_dev_intr,
> >>>> - 0, "kvm_assigned_intx_device", (void *)dev))
> >>>> + flags, "kvm_assigned_intx_device", (void *)dev))
> >>>> return -EIO;
> >>>> +
> >>>> + if (dev->pci_2_3)
> >>>> + pci_2_3_mask_irq(dev->dev, 0, NULL);
> >>>> return 0;
> >>>> }
> >>>>
> >>>
> >>> Let's reverse the logic and try non-shared first, 2.3 is that fails?
> >>> This way we are backwards compatible ...
> >>
> >> Compatible with what?
> >
> > With the status quo :)
>
> There is no incompatibility except for a potential slow-down of a path
> that was often usable (exclusive legacy interrupts belong to a very rare
> species).
That's already an issue if it's real. I think there might also be an
issue if guest accesses command/status itself.
> >
> >> I thought about trying non-shared IRQs first, but that would break host
> >> devices arriving later - including other VMs that want to pass a device
> >> sitting on the same IRQ line. It's better (from management POV) to have
> >> IRQF_SHARED available.
> >
> > OTOH non-shared might be faster as we don't need to do
> > config writes/reads on data path ... Add a knob for management
> > to control this?
>
> Depends on how fast config writes actually are. I know they were slow
> (up to awfully slow if bios was involved) on old hardware, but does this
> still apply? I mean, a config knob would involve the whole stack, so it
> should be worth that effort.
>
> Jan
>
You tell me :)
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2010-11-01 17:06 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-11-01 14:08 [PATCH 0/3] KVM: Improve IRQ assignment for device passthrough Jan Kiszka
2010-11-01 14:08 ` [PATCH 1/3] KVM: Fold assigned interrupt work into IRQ handler Jan Kiszka
2010-11-01 16:34 ` Jan Kiszka
2010-11-01 14:08 ` [PATCH 2/3] KVM: Clear assigned guest IRQ on release Jan Kiszka
2010-11-01 14:08 ` [PATCH 3/3] KVM: Allow host IRQ sharing for passed-through PCI 2.3 devices Jan Kiszka
2010-11-01 15:24 ` Michael S. Tsirkin
2010-11-01 15:41 ` Jan Kiszka
2010-11-01 15:52 ` Michael S. Tsirkin
2010-11-01 16:30 ` Jan Kiszka
2010-11-01 17:06 ` Michael S. Tsirkin
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).