[PATCH 0/7 v4] MSI-X mask support for assigned device

public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/7 v4] MSI-X mask support for assigned device
@ 2010-11-11  7:46 Sheng Yang
  2010-11-11  7:46 ` [PATCH 1/7] PCI: MSI: Move MSI-X entry definition to pci_regs.h Sheng Yang
                   ` (6 more replies)
  0 siblings, 7 replies; 22+ messages in thread
From: Sheng Yang @ 2010-11-11  7:46 UTC (permalink / raw)
  To: Avi Kivity, Marcelo Tosatti, Michael S. Tsirkin; +Cc: kvm, Sheng Yang

Change from v3:
1. Re-design the userspace API.
2. Add big-endian support for msix_mmio_read/write()(untested!)

Change from v2:
1. Move all mask handling to kernel, and userspace has to access it using API.
2. Discard userspace mask bit operation API.

Sheng Yang (7):
  PCI: MSI: Move MSI-X entry definition to pci_regs.h
  PCI: Add mask bit definition for MSI-X table
  KVM: Move struct kvm_io_device to kvm_host.h
  KVM: Add kvm_get_irq_routing_entry() func
  KVM: assigned dev: Clean up assigned_device's flag
  KVM: assigned dev: MSI-X mask support
  KVM: assigned dev: Big endian support for MSI-X MMIO

 arch/x86/kvm/x86.c       |    1 +
 drivers/pci/msi.c        |    5 +-
 drivers/pci/msi.h        |    6 -
 include/linux/kvm.h      |   32 +++++
 include/linux/kvm_host.h |   31 +++++
 include/linux/pci_regs.h |    8 +
 virt/kvm/assigned-dev.c  |  324 +++++++++++++++++++++++++++++++++++++++++++++-
 virt/kvm/iodev.h         |   25 +----
 virt/kvm/irq_comm.c      |   20 +++
 9 files changed, 416 insertions(+), 36 deletions(-)


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 1/7] PCI: MSI: Move MSI-X entry definition to pci_regs.h
  2010-11-11  7:46 [PATCH 0/7 v4] MSI-X mask support for assigned device Sheng Yang
@ 2010-11-11  7:46 ` Sheng Yang
  2010-11-11  7:46 ` [PATCH 2/7] PCI: Add mask bit definition for MSI-X table Sheng Yang
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 22+ messages in thread
From: Sheng Yang @ 2010-11-11  7:46 UTC (permalink / raw)
  To: Avi Kivity, Marcelo Tosatti, Michael S. Tsirkin
  Cc: kvm, Sheng Yang, Jesse Barnes, linux-pci

Then it can be used by others.

Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Reviewed-by: Matthew Wilcox <willy@linux.intel.com>
Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
Cc: linux-pci@vger.kernel.org
Signed-off-by: Sheng Yang <sheng@linux.intel.com>
---
 drivers/pci/msi.h        |    6 ------
 include/linux/pci_regs.h |    7 +++++++
 2 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/drivers/pci/msi.h b/drivers/pci/msi.h
index de27c1c..28a3c52 100644
--- a/drivers/pci/msi.h
+++ b/drivers/pci/msi.h
@@ -6,12 +6,6 @@
 #ifndef MSI_H
 #define MSI_H
 
-#define PCI_MSIX_ENTRY_SIZE		16
-#define  PCI_MSIX_ENTRY_LOWER_ADDR	0
-#define  PCI_MSIX_ENTRY_UPPER_ADDR	4
-#define  PCI_MSIX_ENTRY_DATA		8
-#define  PCI_MSIX_ENTRY_VECTOR_CTRL	12
-
 #define msi_control_reg(base)		(base + PCI_MSI_FLAGS)
 #define msi_lower_address_reg(base)	(base + PCI_MSI_ADDRESS_LO)
 #define msi_upper_address_reg(base)	(base + PCI_MSI_ADDRESS_HI)
diff --git a/include/linux/pci_regs.h b/include/linux/pci_regs.h
index 455b9cc..acfc224 100644
--- a/include/linux/pci_regs.h
+++ b/include/linux/pci_regs.h
@@ -307,6 +307,13 @@
 #define  PCI_MSIX_FLAGS_MASKALL	(1 << 14)
 #define PCI_MSIX_FLAGS_BIRMASK	(7 << 0)
 
+/* MSI-X entry's format */
+#define PCI_MSIX_ENTRY_SIZE		16
+#define  PCI_MSIX_ENTRY_LOWER_ADDR	0
+#define  PCI_MSIX_ENTRY_UPPER_ADDR	4
+#define  PCI_MSIX_ENTRY_DATA		8
+#define  PCI_MSIX_ENTRY_VECTOR_CTRL	12
+
 /* CompactPCI Hotswap Register */
 
 #define PCI_CHSWP_CSR		2	/* Control and Status Register */
-- 
1.7.0.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 2/7] PCI: Add mask bit definition for MSI-X table
  2010-11-11  7:46 [PATCH 0/7 v4] MSI-X mask support for assigned device Sheng Yang
  2010-11-11  7:46 ` [PATCH 1/7] PCI: MSI: Move MSI-X entry definition to pci_regs.h Sheng Yang
@ 2010-11-11  7:46 ` Sheng Yang
  2010-11-11 17:29   ` Jesse Barnes
  2010-11-11  7:46 ` [PATCH 3/7] KVM: Move struct kvm_io_device to kvm_host.h Sheng Yang
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 22+ messages in thread
From: Sheng Yang @ 2010-11-11  7:46 UTC (permalink / raw)
  To: Avi Kivity, Marcelo Tosatti, Michael S. Tsirkin
  Cc: kvm, Sheng Yang, Matthew Wilcox, Jesse Barnes, linux-pci

Then we can use it instead of magic number 1.

Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
Cc: linux-pci@vger.kernel.org
Signed-off-by: Sheng Yang <sheng@linux.intel.com>
---
 drivers/pci/msi.c        |    5 +++--
 include/linux/pci_regs.h |    1 +
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
index 69b7be3..095634e 100644
--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -158,8 +158,9 @@ static u32 __msix_mask_irq(struct msi_desc *desc, u32 flag)
 	u32 mask_bits = desc->masked;
 	unsigned offset = desc->msi_attrib.entry_nr * PCI_MSIX_ENTRY_SIZE +
 						PCI_MSIX_ENTRY_VECTOR_CTRL;
-	mask_bits &= ~1;
-	mask_bits |= flag;
+	mask_bits &= ~PCI_MSIX_ENTRY_CTRL_MASKBIT;
+	if (flag)
+		mask_bits |= PCI_MSIX_ENTRY_CTRL_MASKBIT;
 	writel(mask_bits, desc->mask_base + offset);
 
 	return mask_bits;
diff --git a/include/linux/pci_regs.h b/include/linux/pci_regs.h
index acfc224..ff51632 100644
--- a/include/linux/pci_regs.h
+++ b/include/linux/pci_regs.h
@@ -313,6 +313,7 @@
 #define  PCI_MSIX_ENTRY_UPPER_ADDR	4
 #define  PCI_MSIX_ENTRY_DATA		8
 #define  PCI_MSIX_ENTRY_VECTOR_CTRL	12
+#define   PCI_MSIX_ENTRY_CTRL_MASKBIT	1
 
 /* CompactPCI Hotswap Register */
 
-- 
1.7.0.1

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 3/7] KVM: Move struct kvm_io_device to kvm_host.h
  2010-11-11  7:46 [PATCH 0/7 v4] MSI-X mask support for assigned device Sheng Yang
  2010-11-11  7:46 ` [PATCH 1/7] PCI: MSI: Move MSI-X entry definition to pci_regs.h Sheng Yang
  2010-11-11  7:46 ` [PATCH 2/7] PCI: Add mask bit definition for MSI-X table Sheng Yang
@ 2010-11-11  7:46 ` Sheng Yang
  2010-11-11  7:46 ` [PATCH 4/7] KVM: Add kvm_get_irq_routing_entry() func Sheng Yang
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 22+ messages in thread
From: Sheng Yang @ 2010-11-11  7:46 UTC (permalink / raw)
  To: Avi Kivity, Marcelo Tosatti, Michael S. Tsirkin; +Cc: kvm, Sheng Yang

Then it can be used in struct kvm_assigned_dev_kernel later.

Signed-off-by: Sheng Yang <sheng@linux.intel.com>
---
 include/linux/kvm_host.h |   23 +++++++++++++++++++++++
 virt/kvm/iodev.h         |   25 +------------------------
 2 files changed, 24 insertions(+), 24 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 0b89d00..7f9e4b7 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -74,6 +74,29 @@ int kvm_io_bus_register_dev(struct kvm *kvm, enum kvm_bus bus_idx,
 int kvm_io_bus_unregister_dev(struct kvm *kvm, enum kvm_bus bus_idx,
 			      struct kvm_io_device *dev);
 
+struct kvm_io_device;
+
+/**
+ * kvm_io_device_ops are called under kvm slots_lock.
+ * read and write handlers return 0 if the transaction has been handled,
+ * or non-zero to have it passed to the next device.
+ **/
+struct kvm_io_device_ops {
+	int (*read)(struct kvm_io_device *this,
+		    gpa_t addr,
+		    int len,
+		    void *val);
+	int (*write)(struct kvm_io_device *this,
+		     gpa_t addr,
+		     int len,
+		     const void *val);
+	void (*destructor)(struct kvm_io_device *this);
+};
+
+struct kvm_io_device {
+	const struct kvm_io_device_ops *ops;
+};
+
 struct kvm_vcpu {
 	struct kvm *kvm;
 #ifdef CONFIG_PREEMPT_NOTIFIERS
diff --git a/virt/kvm/iodev.h b/virt/kvm/iodev.h
index 12fd3ca..d1f5651 100644
--- a/virt/kvm/iodev.h
+++ b/virt/kvm/iodev.h
@@ -17,32 +17,9 @@
 #define __KVM_IODEV_H__
 
 #include <linux/kvm_types.h>
+#include <linux/kvm_host.h>
 #include <asm/errno.h>
 
-struct kvm_io_device;
-
-/**
- * kvm_io_device_ops are called under kvm slots_lock.
- * read and write handlers return 0 if the transaction has been handled,
- * or non-zero to have it passed to the next device.
- **/
-struct kvm_io_device_ops {
-	int (*read)(struct kvm_io_device *this,
-		    gpa_t addr,
-		    int len,
-		    void *val);
-	int (*write)(struct kvm_io_device *this,
-		     gpa_t addr,
-		     int len,
-		     const void *val);
-	void (*destructor)(struct kvm_io_device *this);
-};
-
-
-struct kvm_io_device {
-	const struct kvm_io_device_ops *ops;
-};
-
 static inline void kvm_iodevice_init(struct kvm_io_device *dev,
 				     const struct kvm_io_device_ops *ops)
 {
-- 
1.7.0.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 4/7] KVM: Add kvm_get_irq_routing_entry() func
  2010-11-11  7:46 [PATCH 0/7 v4] MSI-X mask support for assigned device Sheng Yang
                   ` (2 preceding siblings ...)
  2010-11-11  7:46 ` [PATCH 3/7] KVM: Move struct kvm_io_device to kvm_host.h Sheng Yang
@ 2010-11-11  7:46 ` Sheng Yang
  2010-11-11  7:46 ` [PATCH 5/7] KVM: assigned dev: Clean up assigned_device's flag Sheng Yang
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 22+ messages in thread
From: Sheng Yang @ 2010-11-11  7:46 UTC (permalink / raw)
  To: Avi Kivity, Marcelo Tosatti, Michael S. Tsirkin; +Cc: kvm, Sheng Yang

We need to query the entry later.

Signed-off-by: Sheng Yang <sheng@linux.intel.com>
---
 include/linux/kvm_host.h |    2 ++
 virt/kvm/irq_comm.c      |   20 ++++++++++++++++++++
 2 files changed, 22 insertions(+), 0 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 7f9e4b7..e2ecbac 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -614,6 +614,8 @@ int kvm_set_irq_routing(struct kvm *kvm,
 			const struct kvm_irq_routing_entry *entries,
 			unsigned nr,
 			unsigned flags);
+int kvm_get_irq_routing_entry(struct kvm *kvm, int gsi,
+		struct kvm_kernel_irq_routing_entry *entry);
 void kvm_free_irq_routing(struct kvm *kvm);
 
 #else
diff --git a/virt/kvm/irq_comm.c b/virt/kvm/irq_comm.c
index 8edca91..ae1dc7c 100644
--- a/virt/kvm/irq_comm.c
+++ b/virt/kvm/irq_comm.c
@@ -421,6 +421,26 @@ out:
 	return r;
 }
 
+int kvm_get_irq_routing_entry(struct kvm *kvm, int gsi,
+		struct kvm_kernel_irq_routing_entry *entry)
+{
+	int count = 0;
+	struct kvm_kernel_irq_routing_entry *ei = NULL;
+	struct kvm_irq_routing_table *irq_rt;
+	struct hlist_node *n;
+
+	rcu_read_lock();
+	irq_rt = rcu_dereference(kvm->irq_routing);
+	if (gsi < irq_rt->nr_rt_entries)
+		hlist_for_each_entry(ei, n, &irq_rt->map[gsi], link)
+			count++;
+	if (count == 1)
+		*entry = *ei;
+	rcu_read_unlock();
+
+	return (count != 1);
+}
+
 #define IOAPIC_ROUTING_ENTRY(irq) \
 	{ .gsi = irq, .type = KVM_IRQ_ROUTING_IRQCHIP,	\
 	  .u.irqchip.irqchip = KVM_IRQCHIP_IOAPIC, .u.irqchip.pin = (irq) }
-- 
1.7.0.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 5/7] KVM: assigned dev: Clean up assigned_device's flag
  2010-11-11  7:46 [PATCH 0/7 v4] MSI-X mask support for assigned device Sheng Yang
                   ` (3 preceding siblings ...)
  2010-11-11  7:46 ` [PATCH 4/7] KVM: Add kvm_get_irq_routing_entry() func Sheng Yang
@ 2010-11-11  7:46 ` Sheng Yang
  2010-11-11  7:46 ` [PATCH 6/7] KVM: assigned dev: MSI-X mask support Sheng Yang
  2010-11-11  7:47 ` [PATCH 7/7] KVM: assigned dev: Big endian support for MSI-X MMIO Sheng Yang
  6 siblings, 0 replies; 22+ messages in thread
From: Sheng Yang @ 2010-11-11  7:46 UTC (permalink / raw)
  To: Avi Kivity, Marcelo Tosatti, Michael S. Tsirkin; +Cc: kvm, Sheng Yang

Reuse KVM_DEV_ASSIGN_ENABLE_IOMMU for an in-kernel struct didn't make much
sense.

Signed-off-by: Sheng Yang <sheng@linux.intel.com>
---
 include/linux/kvm_host.h |    1 +
 virt/kvm/assigned-dev.c  |    7 ++++---
 2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index e2ecbac..4b31539 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -444,6 +444,7 @@ struct kvm_guest_msix_entry {
 	u16 flags;
 };
 
+#define KVM_ASSIGNED_ENABLED_IOMMU		(1 << 0)
 struct kvm_assigned_dev_kernel {
 	struct kvm_irq_ack_notifier ack_notifier;
 	struct work_struct interrupt_work;
diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c
index 7c98928..5c6b96d 100644
--- a/virt/kvm/assigned-dev.c
+++ b/virt/kvm/assigned-dev.c
@@ -552,7 +552,8 @@ static int kvm_vm_ioctl_assign_device(struct kvm *kvm,
 	match->host_segnr = assigned_dev->segnr;
 	match->host_busnr = assigned_dev->busnr;
 	match->host_devfn = assigned_dev->devfn;
-	match->flags = assigned_dev->flags;
+	if (assigned_dev->flags & KVM_DEV_ASSIGN_ENABLE_IOMMU)
+		match->flags |= KVM_ASSIGNED_ENABLED_IOMMU;
 	match->dev = dev;
 	spin_lock_init(&match->assigned_dev_lock);
 	match->irq_source_id = -1;
@@ -563,7 +564,7 @@ static int kvm_vm_ioctl_assign_device(struct kvm *kvm,
 
 	list_add(&match->list, &kvm->arch.assigned_dev_head);
 
-	if (assigned_dev->flags & KVM_DEV_ASSIGN_ENABLE_IOMMU) {
+	if (assigned_dev->flags & KVM_ASSIGNED_ENABLED_IOMMU) {
 		if (!kvm->arch.iommu_domain) {
 			r = kvm_iommu_map_guest(kvm);
 			if (r)
@@ -609,7 +610,7 @@ static int kvm_vm_ioctl_deassign_device(struct kvm *kvm,
 		goto out;
 	}
 
-	if (match->flags & KVM_DEV_ASSIGN_ENABLE_IOMMU)
+	if (match->flags & KVM_ASSIGNED_ENABLED_IOMMU)
 		kvm_deassign_device(kvm, match);
 
 	kvm_free_assigned_device(kvm, match);
-- 
1.7.0.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 6/7] KVM: assigned dev: MSI-X mask support
  2010-11-11  7:46 [PATCH 0/7 v4] MSI-X mask support for assigned device Sheng Yang
                   ` (4 preceding siblings ...)
  2010-11-11  7:46 ` [PATCH 5/7] KVM: assigned dev: Clean up assigned_device's flag Sheng Yang
@ 2010-11-11  7:46 ` Sheng Yang
  2010-11-12  9:53   ` Michael S. Tsirkin
  2010-11-11  7:47 ` [PATCH 7/7] KVM: assigned dev: Big endian support for MSI-X MMIO Sheng Yang
  6 siblings, 1 reply; 22+ messages in thread
From: Sheng Yang @ 2010-11-11  7:46 UTC (permalink / raw)
  To: Avi Kivity, Marcelo Tosatti, Michael S. Tsirkin; +Cc: kvm, Sheng Yang

This patch enable per-vector mask for assigned devices using MSI-X.

This patch provided two new APIs: one is for guest to specific device's MSI-X
table address in MMIO, the other is for userspace to get information about mask
bit.

All the mask bit operation are kept in kernel, in order to accelerate.
Userspace shouldn't access the device MMIO directly for the information,
instead it should uses provided API to do so.

Signed-off-by: Sheng Yang <sheng@linux.intel.com>
---
 arch/x86/kvm/x86.c       |    1 +
 include/linux/kvm.h      |   32 +++++
 include/linux/kvm_host.h |    5 +
 virt/kvm/assigned-dev.c  |  316 +++++++++++++++++++++++++++++++++++++++++++++-
 4 files changed, 353 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index f3f86b2..847f1e1 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1926,6 +1926,7 @@ int kvm_dev_ioctl_check_extension(long ext)
 	case KVM_CAP_DEBUGREGS:
 	case KVM_CAP_X86_ROBUST_SINGLESTEP:
 	case KVM_CAP_XSAVE:
+	case KVM_CAP_MSIX_MASK:
 		r = 1;
 		break;
 	case KVM_CAP_COALESCED_MMIO:
diff --git a/include/linux/kvm.h b/include/linux/kvm.h
index 919ae53..32cd244 100644
--- a/include/linux/kvm.h
+++ b/include/linux/kvm.h
@@ -540,6 +540,9 @@ struct kvm_ppc_pvinfo {
 #endif
 #define KVM_CAP_PPC_GET_PVINFO 57
 #define KVM_CAP_PPC_IRQ_LEVEL 58
+#ifdef __KVM_HAVE_MSIX
+#define KVM_CAP_MSIX_MASK 59
+#endif
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
@@ -671,6 +674,9 @@ struct kvm_clock_data {
 #define KVM_XEN_HVM_CONFIG        _IOW(KVMIO,  0x7a, struct kvm_xen_hvm_config)
 #define KVM_SET_CLOCK             _IOW(KVMIO,  0x7b, struct kvm_clock_data)
 #define KVM_GET_CLOCK             _IOR(KVMIO,  0x7c, struct kvm_clock_data)
+/* Available with KVM_CAP_MSIX_MASK */
+#define KVM_GET_MSIX_ENTRY        _IOWR(KVMIO,  0x7d, struct kvm_msix_entry)
+#define KVM_UPDATE_MSIX_MMIO      _IOW(KVMIO,  0x7e, struct kvm_msix_mmio)
 /* Available with KVM_CAP_PIT_STATE2 */
 #define KVM_GET_PIT2              _IOR(KVMIO,  0x9f, struct kvm_pit_state2)
 #define KVM_SET_PIT2              _IOW(KVMIO,  0xa0, struct kvm_pit_state2)
@@ -794,4 +800,30 @@ struct kvm_assigned_msix_entry {
 	__u16 padding[3];
 };
 
+#define KVM_MSIX_TYPE_ASSIGNED_DEV	1
+
+#define KVM_MSIX_FLAG_MASKBIT		(1 << 0)
+#define KVM_MSIX_FLAG_QUERY_MASKBIT	(1 << 0)
+
+struct kvm_msix_entry {
+	__u32 id;
+	__u32 type;
+	__u32 entry; /* The index of entry in the MSI-X table */
+	__u32 flags;
+	__u32 query_flags;
+	__u32 reserved[5];
+};
+
+#define KVM_MSIX_MMIO_FLAG_REGISTER	(1 << 0)
+#define KVM_MSIX_MMIO_FLAG_UNREGISTER	(1 << 1)
+
+struct kvm_msix_mmio {
+	__u32 id;
+	__u32 type;
+	__u64 base_addr;
+	__u32 max_entries_nr;
+	__u32 flags;
+	__u32 reserved[6];
+};
+
 #endif /* __LINUX_KVM_H */
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 4b31539..9d49074 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -445,6 +445,7 @@ struct kvm_guest_msix_entry {
 };
 
 #define KVM_ASSIGNED_ENABLED_IOMMU		(1 << 0)
+#define KVM_ASSIGNED_ENABLED_MSIX_MMIO		(1 << 1)
 struct kvm_assigned_dev_kernel {
 	struct kvm_irq_ack_notifier ack_notifier;
 	struct work_struct interrupt_work;
@@ -465,6 +466,10 @@ struct kvm_assigned_dev_kernel {
 	struct pci_dev *dev;
 	struct kvm *kvm;
 	spinlock_t assigned_dev_lock;
+	DECLARE_BITMAP(msix_mask_bitmap, KVM_MAX_MSIX_PER_DEV);
+	gpa_t msix_mmio_base;
+	struct kvm_io_device msix_mmio_dev;
+	int msix_max_entries_nr;
 };
 
 struct kvm_irq_mask_notifier {
diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c
index 5c6b96d..3010d7d 100644
--- a/virt/kvm/assigned-dev.c
+++ b/virt/kvm/assigned-dev.c
@@ -226,12 +226,27 @@ static void kvm_free_assigned_irq(struct kvm *kvm,
 	kvm_deassign_irq(kvm, assigned_dev, assigned_dev->irq_requested_type);
 }
 
+static void unregister_msix_mmio(struct kvm *kvm,
+				 struct kvm_assigned_dev_kernel *adev)
+{
+	if (adev->flags & KVM_ASSIGNED_ENABLED_MSIX_MMIO) {
+		mutex_lock(&kvm->slots_lock);
+		kvm_io_bus_unregister_dev(kvm, KVM_MMIO_BUS,
+				&adev->msix_mmio_dev);
+		mutex_unlock(&kvm->slots_lock);
+		adev->flags &= ~KVM_ASSIGNED_ENABLED_MSIX_MMIO;
+	}
+}
+
 static void kvm_free_assigned_device(struct kvm *kvm,
 				     struct kvm_assigned_dev_kernel
 				     *assigned_dev)
 {
 	kvm_free_assigned_irq(kvm, assigned_dev);
 
+#ifdef __KVM_HAVE_MSIX
+	unregister_msix_mmio(kvm, assigned_dev);
+#endif
 	pci_reset_function(assigned_dev->dev);
 
 	pci_release_regions(assigned_dev->dev);
@@ -504,7 +519,7 @@ out:
 static int kvm_vm_ioctl_assign_device(struct kvm *kvm,
 				      struct kvm_assigned_pci_dev *assigned_dev)
 {
-	int r = 0, idx;
+	int r = 0, idx, i;
 	struct kvm_assigned_dev_kernel *match;
 	struct pci_dev *dev;
 
@@ -564,6 +579,10 @@ static int kvm_vm_ioctl_assign_device(struct kvm *kvm,
 
 	list_add(&match->list, &kvm->arch.assigned_dev_head);
 
+	/* The state after reset of MSI-X table is all masked */
+	for (i = 0; i < KVM_MAX_MSIX_PER_DEV; i++)
+		set_bit(i, match->msix_mask_bitmap);
+
 	if (assigned_dev->flags & KVM_ASSIGNED_ENABLED_IOMMU) {
 		if (!kvm->arch.iommu_domain) {
 			r = kvm_iommu_map_guest(kvm);
@@ -667,6 +686,43 @@ msix_nr_out:
 	return r;
 }
 
+static void update_msix_mask(struct kvm_assigned_dev_kernel *adev,
+			     int idx, bool new_mask_flag)
+{
+	int irq;
+	bool old_mask_flag, need_flush = false;
+
+	spin_lock_irq(&adev->assigned_dev_lock);
+
+	if (!adev->dev->msix_enabled ||
+	    !(adev->irq_requested_type & KVM_DEV_IRQ_HOST_MSIX))
+		goto out;
+
+	old_mask_flag = test_bit(adev->guest_msix_entries[idx].entry,
+			adev->msix_mask_bitmap);
+	if (old_mask_flag == new_mask_flag)
+		goto out;
+
+	irq = adev->host_msix_entries[idx].vector;
+	BUG_ON(irq == 0);
+
+	if (new_mask_flag) {
+		set_bit(adev->guest_msix_entries[idx].entry,
+				adev->msix_mask_bitmap);
+		disable_irq_nosync(irq);
+		need_flush = true;
+	} else {
+		clear_bit(adev->guest_msix_entries[idx].entry,
+				adev->msix_mask_bitmap);
+		enable_irq(irq);
+	}
+out:
+	spin_unlock_irq(&adev->assigned_dev_lock);
+
+	if (need_flush)
+		flush_work(&adev->interrupt_work);
+}
+
 static int kvm_vm_ioctl_set_msix_entry(struct kvm *kvm,
 				       struct kvm_assigned_msix_entry *entry)
 {
@@ -701,6 +757,233 @@ msix_entry_out:
 
 	return r;
 }
+
+static int kvm_vm_ioctl_get_msix_entry(struct kvm *kvm,
+				       struct kvm_msix_entry *entry)
+{
+	int r = 0;
+	struct kvm_assigned_dev_kernel *adev;
+
+	if (entry->type != KVM_MSIX_TYPE_ASSIGNED_DEV)
+		return -EINVAL;
+
+	if (!entry->query_flags)
+		return -EINVAL;
+
+	mutex_lock(&kvm->lock);
+
+	adev = kvm_find_assigned_dev(&kvm->arch.assigned_dev_head,
+				      entry->id);
+
+	if (!adev) {
+		r = -EINVAL;
+		goto out;
+	}
+
+	if (entry->entry >= adev->msix_max_entries_nr) {
+		r = -ENOSPC;
+		goto out;
+	}
+
+	if (entry->query_flags & KVM_MSIX_FLAG_QUERY_MASKBIT) {
+		if (test_bit(entry->entry, adev->msix_mask_bitmap))
+			entry->flags |= KVM_MSIX_FLAG_MASKBIT;
+		else
+			entry->flags &= ~KVM_MSIX_FLAG_MASKBIT;
+	}
+
+out:
+	mutex_unlock(&kvm->lock);
+
+	return r;
+}
+
+static bool msix_mmio_in_range(struct kvm_assigned_dev_kernel *adev,
+			      gpa_t addr, int len)
+{
+	gpa_t start, end;
+
+	BUG_ON(adev->msix_mmio_base == 0);
+	start = adev->msix_mmio_base;
+	end = adev->msix_mmio_base + PCI_MSIX_ENTRY_SIZE *
+		adev->msix_max_entries_nr;
+	if (addr >= start && addr + len <= end)
+		return true;
+
+	return false;
+}
+
+static int msix_get_enabled_idx(struct kvm_assigned_dev_kernel *adev,
+			      gpa_t addr, int len)
+{
+	int i, index = (addr - adev->msix_mmio_base) / PCI_MSIX_ENTRY_SIZE;
+
+	for (i = 0; i < adev->entries_nr; i++)
+		if (adev->guest_msix_entries[i].entry == index)
+			return i;
+
+	return -EINVAL;
+}
+
+static int msix_mmio_read(struct kvm_io_device *this, gpa_t addr, int len,
+			  void *val)
+{
+	struct kvm_assigned_dev_kernel *adev =
+			container_of(this, struct kvm_assigned_dev_kernel,
+				     msix_mmio_dev);
+	int idx, r = 0;
+	u32 entry[4];
+	struct kvm_kernel_irq_routing_entry e;
+
+	mutex_lock(&adev->kvm->lock);
+	if (!msix_mmio_in_range(adev, addr, len)) {
+		r = -EOPNOTSUPP;
+		goto out;
+	}
+	if ((addr & 0x3) || len != 4)
+		goto out;
+
+	idx = msix_get_enabled_idx(adev, addr, len);
+	if (idx < 0) {
+		idx = (addr - adev->msix_mmio_base) / PCI_MSIX_ENTRY_SIZE;
+		if ((addr % PCI_MSIX_ENTRY_SIZE) ==
+				PCI_MSIX_ENTRY_VECTOR_CTRL)
+			*(unsigned long *)val =
+				test_bit(idx, adev->msix_mask_bitmap) ?
+				PCI_MSIX_ENTRY_CTRL_MASKBIT : 0;
+		else
+			r = -EOPNOTSUPP;
+		goto out;
+	}
+
+	r = kvm_get_irq_routing_entry(adev->kvm,
+			adev->guest_msix_entries[idx].vector, &e);
+	if (r || e.type != KVM_IRQ_ROUTING_MSI) {
+		r = -EOPNOTSUPP;
+		goto out;
+	}
+	entry[0] = e.msi.address_lo;
+	entry[1] = e.msi.address_hi;
+	entry[2] = e.msi.data;
+	entry[3] = test_bit(adev->guest_msix_entries[idx].entry,
+			adev->msix_mask_bitmap);
+	memcpy(val, &entry[addr % PCI_MSIX_ENTRY_SIZE / sizeof *entry], len);
+
+out:
+	mutex_unlock(&adev->kvm->lock);
+	return r;
+}
+
+static int msix_mmio_write(struct kvm_io_device *this, gpa_t addr, int len,
+			   const void *val)
+{
+	struct kvm_assigned_dev_kernel *adev =
+			container_of(this, struct kvm_assigned_dev_kernel,
+				     msix_mmio_dev);
+	int idx, r = 0;
+	unsigned long new_val = *(unsigned long *)val;
+
+	mutex_lock(&adev->kvm->lock);
+	if (!msix_mmio_in_range(adev, addr, len)) {
+		r = -EOPNOTSUPP;
+		goto out;
+	}
+	if ((addr & 0x3) || len != 4)
+		goto out;
+
+	idx = msix_get_enabled_idx(adev, addr, len);
+	if (idx < 0) {
+		idx = (addr - adev->msix_mmio_base) / PCI_MSIX_ENTRY_SIZE;
+		if (((addr % PCI_MSIX_ENTRY_SIZE) ==
+				PCI_MSIX_ENTRY_VECTOR_CTRL)) {
+			if (new_val & ~PCI_MSIX_ENTRY_CTRL_MASKBIT)
+				goto out;
+			if (new_val & PCI_MSIX_ENTRY_CTRL_MASKBIT)
+				set_bit(idx, adev->msix_mask_bitmap);
+			else
+				clear_bit(idx, adev->msix_mask_bitmap);
+			/* It's possible that we need re-enable MSI-X, so go
+			 * back to userspace */
+		}
+		/* Userspace would handle other MMIO writing */
+		r = -EOPNOTSUPP;
+		goto out;
+	}
+	if (addr % PCI_MSIX_ENTRY_SIZE != PCI_MSIX_ENTRY_VECTOR_CTRL) {
+		r = -EOPNOTSUPP;
+		goto out;
+	}
+	if (new_val & ~PCI_MSIX_ENTRY_CTRL_MASKBIT)
+		goto out;
+	update_msix_mask(adev, idx, !!(new_val & PCI_MSIX_ENTRY_CTRL_MASKBIT));
+out:
+	mutex_unlock(&adev->kvm->lock);
+
+	return r;
+}
+
+static const struct kvm_io_device_ops msix_mmio_ops = {
+	.read     = msix_mmio_read,
+	.write    = msix_mmio_write,
+};
+
+static int kvm_vm_ioctl_update_msix_mmio(struct kvm *kvm,
+				struct kvm_msix_mmio *msix_mmio)
+{
+	int r = 0;
+	struct kvm_assigned_dev_kernel *adev;
+
+	if (msix_mmio->type != KVM_MSIX_TYPE_ASSIGNED_DEV)
+		return -EINVAL;
+
+	if (!msix_mmio->flags)
+		return -EINVAL;
+
+	mutex_lock(&kvm->lock);
+	adev = kvm_find_assigned_dev(&kvm->arch.assigned_dev_head,
+				      msix_mmio->id);
+	if (!adev) {
+		r = -EINVAL;
+		goto out;
+	}
+	if (msix_mmio->base_addr == 0) {
+		r = -EINVAL;
+		goto out;
+	}
+	if (msix_mmio->max_entries_nr == 0 ||
+			msix_mmio->max_entries_nr > KVM_MAX_MSIX_PER_DEV) {
+		r = -EINVAL;
+		goto out;
+	}
+
+	if ((msix_mmio->flags & KVM_MSIX_MMIO_FLAG_REGISTER) &&
+			(msix_mmio->flags & KVM_MSIX_MMIO_FLAG_UNREGISTER)) {
+		r = -EINVAL;
+		goto out;
+	}
+
+	if (msix_mmio->flags & KVM_MSIX_MMIO_FLAG_REGISTER) {
+		if (!(adev->flags & KVM_ASSIGNED_ENABLED_MSIX_MMIO)) {
+			mutex_lock(&kvm->slots_lock);
+			kvm_iodevice_init(&adev->msix_mmio_dev,
+					&msix_mmio_ops);
+			r = kvm_io_bus_register_dev(kvm, KVM_MMIO_BUS,
+					&adev->msix_mmio_dev);
+			if (!r)
+				adev->flags |= KVM_ASSIGNED_ENABLED_MSIX_MMIO;
+			mutex_unlock(&kvm->slots_lock);
+		}
+		if (!r) {
+			adev->msix_mmio_base = msix_mmio->base_addr;
+			adev->msix_max_entries_nr = msix_mmio->max_entries_nr;
+		}
+	} else if (msix_mmio->flags & KVM_MSIX_MMIO_FLAG_UNREGISTER)
+		unregister_msix_mmio(kvm, adev);
+out:
+	mutex_unlock(&kvm->lock);
+
+	return r;
+}
 #endif
 
 long kvm_vm_ioctl_assigned_device(struct kvm *kvm, unsigned ioctl,
@@ -813,6 +1096,37 @@ long kvm_vm_ioctl_assigned_device(struct kvm *kvm, unsigned ioctl,
 			goto out;
 		break;
 	}
+	case KVM_GET_MSIX_ENTRY: {
+		struct kvm_msix_entry entry;
+		r = -EFAULT;
+		if (copy_from_user(&entry, argp, sizeof entry))
+			goto out;
+		r = kvm_vm_ioctl_get_msix_entry(kvm, &entry);
+		if (r)
+			goto out;
+		r = -EFAULT;
+		if (copy_to_user(argp, &entry, sizeof entry))
+			goto out;
+		r = 0;
+		break;
+	}
+	case KVM_UPDATE_MSIX_MMIO: {
+		struct kvm_msix_mmio msix_mmio;
+
+		r = -EFAULT;
+		if (copy_from_user(&msix_mmio, argp, sizeof(msix_mmio)))
+			goto out;
+
+		r = -EINVAL;
+		if (find_first_bit((unsigned long *)msix_mmio.reserved,
+		    sizeof(msix_mmio.reserved)) < sizeof(msix_mmio.reserved))
+			goto out;
+
+		r = kvm_vm_ioctl_update_msix_mmio(kvm, &msix_mmio);
+		if (r)
+			goto out;
+		break;
+	}
 #endif
 	}
 out:
-- 
1.7.0.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 7/7] KVM: assigned dev: Big endian support for MSI-X MMIO
  2010-11-11  7:46 [PATCH 0/7 v4] MSI-X mask support for assigned device Sheng Yang
                   ` (5 preceding siblings ...)
  2010-11-11  7:46 ` [PATCH 6/7] KVM: assigned dev: MSI-X mask support Sheng Yang
@ 2010-11-11  7:47 ` Sheng Yang
  2010-11-11 16:12   ` Michael S. Tsirkin
  6 siblings, 1 reply; 22+ messages in thread
From: Sheng Yang @ 2010-11-11  7:47 UTC (permalink / raw)
  To: Avi Kivity, Marcelo Tosatti, Michael S. Tsirkin; +Cc: kvm, Sheng Yang

Add marco for big-endian machine.(Untested!)

Signed-off-by: Sheng Yang <sheng@linux.intel.com>
---
 virt/kvm/assigned-dev.c |    7 ++++---
 1 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c
index 3010d7d..15b5f74 100644
--- a/virt/kvm/assigned-dev.c
+++ b/virt/kvm/assigned-dev.c
@@ -848,9 +848,9 @@ static int msix_mmio_read(struct kvm_io_device *this, gpa_t addr, int len,
 		idx = (addr - adev->msix_mmio_base) / PCI_MSIX_ENTRY_SIZE;
 		if ((addr % PCI_MSIX_ENTRY_SIZE) ==
 				PCI_MSIX_ENTRY_VECTOR_CTRL)
-			*(unsigned long *)val =
+			*(unsigned long *)val = le32_to_cpu(
 				test_bit(idx, adev->msix_mask_bitmap) ?
-				PCI_MSIX_ENTRY_CTRL_MASKBIT : 0;
+				PCI_MSIX_ENTRY_CTRL_MASKBIT : 0);
 		else
 			r = -EOPNOTSUPP;
 		goto out;
@@ -869,6 +869,7 @@ static int msix_mmio_read(struct kvm_io_device *this, gpa_t addr, int len,
 			adev->msix_mask_bitmap);
 	memcpy(val, &entry[addr % PCI_MSIX_ENTRY_SIZE / sizeof *entry], len);
 
+	*(unsigned long *)val = le32_to_cpu(*(unsigned long *)val);
 out:
 	mutex_unlock(&adev->kvm->lock);
 	return r;
@@ -881,7 +882,7 @@ static int msix_mmio_write(struct kvm_io_device *this, gpa_t addr, int len,
 			container_of(this, struct kvm_assigned_dev_kernel,
 				     msix_mmio_dev);
 	int idx, r = 0;
-	unsigned long new_val = *(unsigned long *)val;
+	unsigned long new_val = cpu_to_le32(*(unsigned long *)val);
 
 	mutex_lock(&adev->kvm->lock);
 	if (!msix_mmio_in_range(adev, addr, len)) {
-- 
1.7.0.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 7/7] KVM: assigned dev: Big endian support for MSI-X MMIO
  2010-11-11  7:47 ` [PATCH 7/7] KVM: assigned dev: Big endian support for MSI-X MMIO Sheng Yang
@ 2010-11-11 16:12   ` Michael S. Tsirkin
  2010-11-12  2:47     ` Sheng Yang
  0 siblings, 1 reply; 22+ messages in thread
From: Michael S. Tsirkin @ 2010-11-11 16:12 UTC (permalink / raw)
  To: Sheng Yang; +Cc: Avi Kivity, Marcelo Tosatti, kvm

On Thu, Nov 11, 2010 at 03:47:00PM +0800, Sheng Yang wrote:
> Add marco for big-endian machine.(Untested!)
> 
> Signed-off-by: Sheng Yang <sheng@linux.intel.com>

I presume this is tested at the same level as the previous patch?
So you want to fold this into the previous patch.

Also, please build with sparse (C=1) to find some endian-ness issues.

> ---
>  virt/kvm/assigned-dev.c |    7 ++++---
>  1 files changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c
> index 3010d7d..15b5f74 100644
> --- a/virt/kvm/assigned-dev.c
> +++ b/virt/kvm/assigned-dev.c
> @@ -848,9 +848,9 @@ static int msix_mmio_read(struct kvm_io_device *this, gpa_t addr, int len,
>  		idx = (addr - adev->msix_mmio_base) / PCI_MSIX_ENTRY_SIZE;
>  		if ((addr % PCI_MSIX_ENTRY_SIZE) ==
>  				PCI_MSIX_ENTRY_VECTOR_CTRL)
> -			*(unsigned long *)val =
> +			*(unsigned long *)val = le32_to_cpu(

This should be cpu_to_le32. And val cast to __le32.

>  				test_bit(idx, adev->msix_mask_bitmap) ?
> -				PCI_MSIX_ENTRY_CTRL_MASKBIT : 0;
> +				PCI_MSIX_ENTRY_CTRL_MASKBIT : 0);
>  		else
>  			r = -EOPNOTSUPP;
>  		goto out;

In this function,
entry must be __le32 too, and fix up endian-ness where you fill it in.

> @@ -869,6 +869,7 @@ static int msix_mmio_read(struct kvm_io_device *this, gpa_t addr, int len,
>  			adev->msix_mask_bitmap);
>  	memcpy(val, &entry[addr % PCI_MSIX_ENTRY_SIZE / sizeof *entry], len);
>  
> +	*(unsigned long *)val = le32_to_cpu(*(unsigned long *)val);
>  out:
>  	mutex_unlock(&adev->kvm->lock);
>  	return r;
> @@ -881,7 +882,7 @@ static int msix_mmio_write(struct kvm_io_device *this, gpa_t addr, int len,
>  			container_of(this, struct kvm_assigned_dev_kernel,
>  				     msix_mmio_dev);
>  	int idx, r = 0;
> -	unsigned long new_val = *(unsigned long *)val;
> +	unsigned long new_val = cpu_to_le32(*(unsigned long *)val);

__le32

>  
>  	mutex_lock(&adev->kvm->lock);
>  	if (!msix_mmio_in_range(adev, addr, len)) {
> -- 
> 1.7.0.1

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/7] PCI: Add mask bit definition for MSI-X table
  2010-11-11  7:46 ` [PATCH 2/7] PCI: Add mask bit definition for MSI-X table Sheng Yang
@ 2010-11-11 17:29   ` Jesse Barnes
  2010-11-15  8:02     ` Sheng Yang
  0 siblings, 1 reply; 22+ messages in thread
From: Jesse Barnes @ 2010-11-11 17:29 UTC (permalink / raw)
  To: Sheng Yang
  Cc: Avi Kivity, Marcelo Tosatti, Michael S. Tsirkin, kvm,
	Matthew Wilcox, linux-pci

On Thu, 11 Nov 2010 15:46:55 +0800
Sheng Yang <sheng@linux.intel.com> wrote:

> Then we can use it instead of magic number 1.
> 
> Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
> Cc: Matthew Wilcox <willy@linux.intel.com>
> Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
> Cc: linux-pci@vger.kernel.org
> Signed-off-by: Sheng Yang <sheng@linux.intel.com>
> ---
>  drivers/pci/msi.c        |    5 +++--
>  include/linux/pci_regs.h |    1 +
>  2 files changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
> index 69b7be3..095634e 100644
> --- a/drivers/pci/msi.c
> +++ b/drivers/pci/msi.c
> @@ -158,8 +158,9 @@ static u32 __msix_mask_irq(struct msi_desc *desc, u32 flag)
>  	u32 mask_bits = desc->masked;
>  	unsigned offset = desc->msi_attrib.entry_nr * PCI_MSIX_ENTRY_SIZE +
>  						PCI_MSIX_ENTRY_VECTOR_CTRL;
> -	mask_bits &= ~1;
> -	mask_bits |= flag;
> +	mask_bits &= ~PCI_MSIX_ENTRY_CTRL_MASKBIT;
> +	if (flag)
> +		mask_bits |= PCI_MSIX_ENTRY_CTRL_MASKBIT;
>  	writel(mask_bits, desc->mask_base + offset);
>  
>  	return mask_bits;
> diff --git a/include/linux/pci_regs.h b/include/linux/pci_regs.h
> index acfc224..ff51632 100644
> --- a/include/linux/pci_regs.h
> +++ b/include/linux/pci_regs.h
> @@ -313,6 +313,7 @@
>  #define  PCI_MSIX_ENTRY_UPPER_ADDR	4
>  #define  PCI_MSIX_ENTRY_DATA		8
>  #define  PCI_MSIX_ENTRY_VECTOR_CTRL	12
> +#define   PCI_MSIX_ENTRY_CTRL_MASKBIT	1
>  
>  /* CompactPCI Hotswap Register */
>  

Applied 1/7 and 2/7 to my linux-next tree, thanks.

If it's easier to push them both through the kvm tree let me know; you
can just add my acked-by in that case.

-- 
Jesse Barnes, Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 7/7] KVM: assigned dev: Big endian support for MSI-X MMIO
  2010-11-11 16:12   ` Michael S. Tsirkin
@ 2010-11-12  2:47     ` Sheng Yang
  0 siblings, 0 replies; 22+ messages in thread
From: Sheng Yang @ 2010-11-12  2:47 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Avi Kivity, Marcelo Tosatti, kvm

On Friday 12 November 2010 00:12:21 Michael S. Tsirkin wrote:
> On Thu, Nov 11, 2010 at 03:47:00PM +0800, Sheng Yang wrote:
> > Add marco for big-endian machine.(Untested!)
> > 
> > Signed-off-by: Sheng Yang <sheng@linux.intel.com>
> 
> I presume this is tested at the same level as the previous patch?
> So you want to fold this into the previous patch.

Of course tested on Intel's platform, but it didn't make sense for big-endian 
patch...

I'd like to make it separate still, because I can't test it properly.
> 
> Also, please build with sparse (C=1) to find some endian-ness issues.

Sure. (Have it run with the patches, nothing new found...)

> 
> > ---
> > 
> >  virt/kvm/assigned-dev.c |    7 ++++---
> >  1 files changed, 4 insertions(+), 3 deletions(-)
> > 
> > diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c
> > index 3010d7d..15b5f74 100644
> > --- a/virt/kvm/assigned-dev.c
> > +++ b/virt/kvm/assigned-dev.c
> > @@ -848,9 +848,9 @@ static int msix_mmio_read(struct kvm_io_device *this,
> > gpa_t addr, int len,
> > 
> >  		idx = (addr - adev->msix_mmio_base) / PCI_MSIX_ENTRY_SIZE;
> >  		if ((addr % PCI_MSIX_ENTRY_SIZE) ==
> >  		
> >  				PCI_MSIX_ENTRY_VECTOR_CTRL)
> > 
> > -			*(unsigned long *)val =
> > +			*(unsigned long *)val = le32_to_cpu(
> 
> This should be cpu_to_le32. And val cast to __le32.

I think we can just cast val to __le32 here? The result should be big-endian.

No, no, I suppose all parameter of msix_mmio_read/write should be little-endian 
because it's being written into PCI MSI-X table?
> 
> >  				test_bit(idx, adev->msix_mask_bitmap) ?
> > 
> > -				PCI_MSIX_ENTRY_CTRL_MASKBIT : 0;
> > +				PCI_MSIX_ENTRY_CTRL_MASKBIT : 0);
> > 
> >  		else
> >  		
> >  			r = -EOPNOTSUPP;
> >  		
> >  		goto out;
> 
> In this function,
> entry must be __le32 too, and fix up endian-ness where you fill it in.
> 
> > @@ -869,6 +869,7 @@ static int msix_mmio_read(struct kvm_io_device *this,
> > gpa_t addr, int len,
> > 
> >  			adev->msix_mask_bitmap);
> >  	
> >  	memcpy(val, &entry[addr % PCI_MSIX_ENTRY_SIZE / sizeof *entry], len);
> > 
> > +	*(unsigned long *)val = le32_to_cpu(*(unsigned long *)val);
> > 
> >  out:
> >  	mutex_unlock(&adev->kvm->lock);
> >  	return r;
> > 
> > @@ -881,7 +882,7 @@ static int msix_mmio_write(struct kvm_io_device
> > *this, gpa_t addr, int len,
> > 
> >  			container_of(this, struct kvm_assigned_dev_kernel,
> >  			
> >  				     msix_mmio_dev);
> >  	
> >  	int idx, r = 0;
> > 
> > -	unsigned long new_val = *(unsigned long *)val;
> > +	unsigned long new_val = cpu_to_le32(*(unsigned long *)val);
> 
> __le32

Should it be 

unsigned long new_val = le32_to_cpu(*(unsigned long *)val);

*val should be little endian I guess?

Or Michael, could you provide a patch for this? I really don't think I can 
guarantee correctness of this patch...

--
regards
Yang, Sheng

> 
> >  	mutex_lock(&adev->kvm->lock);
> >  	if (!msix_mmio_in_range(adev, addr, len)) {

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 6/7] KVM: assigned dev: MSI-X mask support
  2010-11-11  7:46 ` [PATCH 6/7] KVM: assigned dev: MSI-X mask support Sheng Yang
@ 2010-11-12  9:53   ` Michael S. Tsirkin
  2010-11-12 10:13     ` Sheng Yang
  0 siblings, 1 reply; 22+ messages in thread
From: Michael S. Tsirkin @ 2010-11-12  9:53 UTC (permalink / raw)
  To: Sheng Yang; +Cc: Avi Kivity, Marcelo Tosatti, kvm

On Thu, Nov 11, 2010 at 03:46:59PM +0800, Sheng Yang wrote:
> This patch enable per-vector mask for assigned devices using MSI-X.
> 
> This patch provided two new APIs: one is for guest to specific device's MSI-X
> table address in MMIO, the other is for userspace to get information about mask
> bit.
> 
> All the mask bit operation are kept in kernel, in order to accelerate.
> Userspace shouldn't access the device MMIO directly for the information,
> instead it should uses provided API to do so.
> 
> Signed-off-by: Sheng Yang <sheng@linux.intel.com>
> ---
>  arch/x86/kvm/x86.c       |    1 +
>  include/linux/kvm.h      |   32 +++++
>  include/linux/kvm_host.h |    5 +
>  virt/kvm/assigned-dev.c  |  316 +++++++++++++++++++++++++++++++++++++++++++++-
>  4 files changed, 353 insertions(+), 1 deletions(-)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index f3f86b2..847f1e1 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -1926,6 +1926,7 @@ int kvm_dev_ioctl_check_extension(long ext)
>  	case KVM_CAP_DEBUGREGS:
>  	case KVM_CAP_X86_ROBUST_SINGLESTEP:
>  	case KVM_CAP_XSAVE:
> +	case KVM_CAP_MSIX_MASK:
>  		r = 1;
>  		break;
>  	case KVM_CAP_COALESCED_MMIO:
> diff --git a/include/linux/kvm.h b/include/linux/kvm.h
> index 919ae53..32cd244 100644
> --- a/include/linux/kvm.h
> +++ b/include/linux/kvm.h
> @@ -540,6 +540,9 @@ struct kvm_ppc_pvinfo {
>  #endif
>  #define KVM_CAP_PPC_GET_PVINFO 57
>  #define KVM_CAP_PPC_IRQ_LEVEL 58
> +#ifdef __KVM_HAVE_MSIX
> +#define KVM_CAP_MSIX_MASK 59
> +#endif
>  
>  #ifdef KVM_CAP_IRQ_ROUTING
>  
> @@ -671,6 +674,9 @@ struct kvm_clock_data {
>  #define KVM_XEN_HVM_CONFIG        _IOW(KVMIO,  0x7a, struct kvm_xen_hvm_config)
>  #define KVM_SET_CLOCK             _IOW(KVMIO,  0x7b, struct kvm_clock_data)
>  #define KVM_GET_CLOCK             _IOR(KVMIO,  0x7c, struct kvm_clock_data)
> +/* Available with KVM_CAP_MSIX_MASK */
> +#define KVM_GET_MSIX_ENTRY        _IOWR(KVMIO,  0x7d, struct kvm_msix_entry)
> +#define KVM_UPDATE_MSIX_MMIO      _IOW(KVMIO,  0x7e, struct kvm_msix_mmio)
>  /* Available with KVM_CAP_PIT_STATE2 */
>  #define KVM_GET_PIT2              _IOR(KVMIO,  0x9f, struct kvm_pit_state2)
>  #define KVM_SET_PIT2              _IOW(KVMIO,  0xa0, struct kvm_pit_state2)
> @@ -794,4 +800,30 @@ struct kvm_assigned_msix_entry {
>  	__u16 padding[3];
>  };
>  
> +#define KVM_MSIX_TYPE_ASSIGNED_DEV	1
> +
> +#define KVM_MSIX_FLAG_MASKBIT		(1 << 0)
> +#define KVM_MSIX_FLAG_QUERY_MASKBIT	(1 << 0)
> +
> +struct kvm_msix_entry {
> +	__u32 id;
> +	__u32 type;
> +	__u32 entry; /* The index of entry in the MSI-X table */
> +	__u32 flags;
> +	__u32 query_flags;
> +	__u32 reserved[5];
> +};
> +
> +#define KVM_MSIX_MMIO_FLAG_REGISTER	(1 << 0)
> +#define KVM_MSIX_MMIO_FLAG_UNREGISTER	(1 << 1)
> +
> +struct kvm_msix_mmio {
> +	__u32 id;
> +	__u32 type;
> +	__u64 base_addr;
> +	__u32 max_entries_nr;
> +	__u32 flags;
> +	__u32 reserved[6];
> +};
> +
>  #endif /* __LINUX_KVM_H */
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 4b31539..9d49074 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -445,6 +445,7 @@ struct kvm_guest_msix_entry {
>  };
>  
>  #define KVM_ASSIGNED_ENABLED_IOMMU		(1 << 0)
> +#define KVM_ASSIGNED_ENABLED_MSIX_MMIO		(1 << 1)
>  struct kvm_assigned_dev_kernel {
>  	struct kvm_irq_ack_notifier ack_notifier;
>  	struct work_struct interrupt_work;
> @@ -465,6 +466,10 @@ struct kvm_assigned_dev_kernel {
>  	struct pci_dev *dev;
>  	struct kvm *kvm;
>  	spinlock_t assigned_dev_lock;
> +	DECLARE_BITMAP(msix_mask_bitmap, KVM_MAX_MSIX_PER_DEV);
> +	gpa_t msix_mmio_base;
> +	struct kvm_io_device msix_mmio_dev;
> +	int msix_max_entries_nr;
>  };
>  
>  struct kvm_irq_mask_notifier {
> diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c
> index 5c6b96d..3010d7d 100644
> --- a/virt/kvm/assigned-dev.c
> +++ b/virt/kvm/assigned-dev.c
> @@ -226,12 +226,27 @@ static void kvm_free_assigned_irq(struct kvm *kvm,
>  	kvm_deassign_irq(kvm, assigned_dev, assigned_dev->irq_requested_type);
>  }
>  
> +static void unregister_msix_mmio(struct kvm *kvm,
> +				 struct kvm_assigned_dev_kernel *adev)
> +{
> +	if (adev->flags & KVM_ASSIGNED_ENABLED_MSIX_MMIO) {
> +		mutex_lock(&kvm->slots_lock);
> +		kvm_io_bus_unregister_dev(kvm, KVM_MMIO_BUS,
> +				&adev->msix_mmio_dev);
> +		mutex_unlock(&kvm->slots_lock);
> +		adev->flags &= ~KVM_ASSIGNED_ENABLED_MSIX_MMIO;
> +	}
> +}
> +
>  static void kvm_free_assigned_device(struct kvm *kvm,
>  				     struct kvm_assigned_dev_kernel
>  				     *assigned_dev)
>  {
>  	kvm_free_assigned_irq(kvm, assigned_dev);
>  
> +#ifdef __KVM_HAVE_MSIX
> +	unregister_msix_mmio(kvm, assigned_dev);
> +#endif
>  	pci_reset_function(assigned_dev->dev);
>  
>  	pci_release_regions(assigned_dev->dev);
> @@ -504,7 +519,7 @@ out:
>  static int kvm_vm_ioctl_assign_device(struct kvm *kvm,
>  				      struct kvm_assigned_pci_dev *assigned_dev)
>  {
> -	int r = 0, idx;
> +	int r = 0, idx, i;
>  	struct kvm_assigned_dev_kernel *match;
>  	struct pci_dev *dev;
>  
> @@ -564,6 +579,10 @@ static int kvm_vm_ioctl_assign_device(struct kvm *kvm,
>  
>  	list_add(&match->list, &kvm->arch.assigned_dev_head);
>  
> +	/* The state after reset of MSI-X table is all masked */
> +	for (i = 0; i < KVM_MAX_MSIX_PER_DEV; i++)
> +		set_bit(i, match->msix_mask_bitmap);
> +
>  	if (assigned_dev->flags & KVM_ASSIGNED_ENABLED_IOMMU) {
>  		if (!kvm->arch.iommu_domain) {
>  			r = kvm_iommu_map_guest(kvm);
> @@ -667,6 +686,43 @@ msix_nr_out:
>  	return r;
>  }
>  
> +static void update_msix_mask(struct kvm_assigned_dev_kernel *adev,
> +			     int idx, bool new_mask_flag)
> +{
> +	int irq;
> +	bool old_mask_flag, need_flush = false;
> +
> +	spin_lock_irq(&adev->assigned_dev_lock);
> +
> +	if (!adev->dev->msix_enabled ||
> +	    !(adev->irq_requested_type & KVM_DEV_IRQ_HOST_MSIX))
> +		goto out;
> +
> +	old_mask_flag = test_bit(adev->guest_msix_entries[idx].entry,
> +			adev->msix_mask_bitmap);
> +	if (old_mask_flag == new_mask_flag)
> +		goto out;
> +
> +	irq = adev->host_msix_entries[idx].vector;
> +	BUG_ON(irq == 0);
> +
> +	if (new_mask_flag) {
> +		set_bit(adev->guest_msix_entries[idx].entry,
> +				adev->msix_mask_bitmap);
> +		disable_irq_nosync(irq);
> +		need_flush = true;
> +	} else {
> +		clear_bit(adev->guest_msix_entries[idx].entry,
> +				adev->msix_mask_bitmap);
> +		enable_irq(irq);
> +	}
> +out:
> +	spin_unlock_irq(&adev->assigned_dev_lock);
> +
> +	if (need_flush)
> +		flush_work(&adev->interrupt_work);
> +}
> +
>  static int kvm_vm_ioctl_set_msix_entry(struct kvm *kvm,
>  				       struct kvm_assigned_msix_entry *entry)
>  {
> @@ -701,6 +757,233 @@ msix_entry_out:
>  
>  	return r;
>  }
> +
> +static int kvm_vm_ioctl_get_msix_entry(struct kvm *kvm,
> +				       struct kvm_msix_entry *entry)
> +{
> +	int r = 0;
> +	struct kvm_assigned_dev_kernel *adev;
> +
> +	if (entry->type != KVM_MSIX_TYPE_ASSIGNED_DEV)
> +		return -EINVAL;
> +
> +	if (!entry->query_flags)
> +		return -EINVAL;
> +
> +	mutex_lock(&kvm->lock);
> +
> +	adev = kvm_find_assigned_dev(&kvm->arch.assigned_dev_head,
> +				      entry->id);
> +
> +	if (!adev) {
> +		r = -EINVAL;
> +		goto out;
> +	}
> +
> +	if (entry->entry >= adev->msix_max_entries_nr) {
> +		r = -ENOSPC;
> +		goto out;
> +	}
> +
> +	if (entry->query_flags & KVM_MSIX_FLAG_QUERY_MASKBIT) {
> +		if (test_bit(entry->entry, adev->msix_mask_bitmap))
> +			entry->flags |= KVM_MSIX_FLAG_MASKBIT;
> +		else
> +			entry->flags &= ~KVM_MSIX_FLAG_MASKBIT;
> +	}
> +
> +out:
> +	mutex_unlock(&kvm->lock);
> +
> +	return r;
> +}
> +
> +static bool msix_mmio_in_range(struct kvm_assigned_dev_kernel *adev,
> +			      gpa_t addr, int len)
> +{
> +	gpa_t start, end;
> +
> +	BUG_ON(adev->msix_mmio_base == 0);

I thought we wanted to use a separate flag for this?

> +	start = adev->msix_mmio_base;
> +	end = adev->msix_mmio_base + PCI_MSIX_ENTRY_SIZE *
> +		adev->msix_max_entries_nr;
> +	if (addr >= start && addr + len <= end)
> +		return true;
> +
> +	return false;
> +}
> +
> +static int msix_get_enabled_idx(struct kvm_assigned_dev_kernel *adev,
> +			      gpa_t addr, int len)
> +{
> +	int i, index = (addr - adev->msix_mmio_base) / PCI_MSIX_ENTRY_SIZE;
> +
> +	for (i = 0; i < adev->entries_nr; i++)
> +		if (adev->guest_msix_entries[i].entry == index)
> +			return i;
> +
> +	return -EINVAL;
> +}
> +

Hmm, we still have a linear scan on each write.  Is the point to detect
which entries need to be handled in kernel? Maybe another bitmap for
this?  Or just handle whole entry write in kernel as well, then we won't
need this at all.

> +static int msix_mmio_read(struct kvm_io_device *this, gpa_t addr, int len,
> +			  void *val)
> +{
> +	struct kvm_assigned_dev_kernel *adev =
> +			container_of(this, struct kvm_assigned_dev_kernel,
> +				     msix_mmio_dev);
> +	int idx, r = 0;
> +	u32 entry[4];
> +	struct kvm_kernel_irq_routing_entry e;
> +
> +	mutex_lock(&adev->kvm->lock);
> +	if (!msix_mmio_in_range(adev, addr, len)) {
> +		r = -EOPNOTSUPP;
> +		goto out;
> +	}
> +	if ((addr & 0x3) || len != 4)
> +		goto out;
> +
> +	idx = msix_get_enabled_idx(adev, addr, len);
> +	if (idx < 0) {
> +		idx = (addr - adev->msix_mmio_base) / PCI_MSIX_ENTRY_SIZE;
> +		if ((addr % PCI_MSIX_ENTRY_SIZE) ==
> +				PCI_MSIX_ENTRY_VECTOR_CTRL)
> +			*(unsigned long *)val =
> +				test_bit(idx, adev->msix_mask_bitmap) ?
> +				PCI_MSIX_ENTRY_CTRL_MASKBIT : 0;
> +		else
> +			r = -EOPNOTSUPP;
> +		goto out;
> +	}
> +
> +	r = kvm_get_irq_routing_entry(adev->kvm,
> +			adev->guest_msix_entries[idx].vector, &e);
> +	if (r || e.type != KVM_IRQ_ROUTING_MSI) {
> +		r = -EOPNOTSUPP;
> +		goto out;
> +	}
> +	entry[0] = e.msi.address_lo;
> +	entry[1] = e.msi.address_hi;
> +	entry[2] = e.msi.data;
> +	entry[3] = test_bit(adev->guest_msix_entries[idx].entry,
> +			adev->msix_mask_bitmap);
> +	memcpy(val, &entry[addr % PCI_MSIX_ENTRY_SIZE / sizeof *entry], len);

It seems weird to pass writes to userspace but do reads
in kernel.
Either we should support entry read and write or neither.

> +
> +out:
> +	mutex_unlock(&adev->kvm->lock);
> +	return r;
> +}
> +
> +static int msix_mmio_write(struct kvm_io_device *this, gpa_t addr, int len,
> +			   const void *val)
> +{

> +	struct kvm_assigned_dev_kernel *adev =
> +			container_of(this, struct kvm_assigned_dev_kernel,
> +				     msix_mmio_dev);
> +	int idx, r = 0;
> +	unsigned long new_val = *(unsigned long *)val;

Move this to after you did length and alignment checks,
it might not be safe here.

> +
> +	mutex_lock(&adev->kvm->lock);
> +	if (!msix_mmio_in_range(adev, addr, len)) {
> +		r = -EOPNOTSUPP;
> +		goto out;
> +	}
> +	if ((addr & 0x3) || len != 4)

The lenght could be 8:

For all accesses to MSI-X Table and MSI-X PBA fields, software must use aligned full
DWORD or aligned full QWORD transactions; otherwise, the result is undefined.

and

Software is permitted to fill in MSI-X Table entry DWORD fields individually with
DWORD writes, or software in certain cases is permitted to fill in appropriate pairs of
DWORDs with a single QWORD write. Specifically, software is always permitted to fill in
the Message Address and Message Upper Address fields with a single QWORD write. If a
given entry is currently masked (via its Mask bit or the Function Mask bit), software is
permitted to fill in the Message Data and Vector Control fields with a single QWORD write,
taking advantage of the fact the Message Data field is guaranteed to become visible to
hardware no later than the Vector Control field. However, if software wishes to mask a
currently unmasked entry (without setting the Function Mask bit), software must set the
entry’s Mask bit using a DWORD write to the Vector Control field, since performing a
QWORD write to the Message Data and Vector Control fields might result in the Message
Data field being modified before the Mask bit in the Vector Control field becomes set.

> +		goto out;


Why don't we pass to userspace on error?
Will make it easy to debug ...

> +
> +	idx = msix_get_enabled_idx(adev, addr, len);
> +	if (idx < 0) {
> +		idx = (addr - adev->msix_mmio_base) / PCI_MSIX_ENTRY_SIZE;
> +		if (((addr % PCI_MSIX_ENTRY_SIZE) ==
> +				PCI_MSIX_ENTRY_VECTOR_CTRL)) {
> +			if (new_val & ~PCI_MSIX_ENTRY_CTRL_MASKBIT)
> +				goto out;
> +			if (new_val & PCI_MSIX_ENTRY_CTRL_MASKBIT)
> +				set_bit(idx, adev->msix_mask_bitmap);
> +			else
> +				clear_bit(idx, adev->msix_mask_bitmap);
> +			/* It's possible that we need re-enable MSI-X, so go
> +			 * back to userspace */
> +		}
> +		/* Userspace would handle other MMIO writing */
> +		r = -EOPNOTSUPP;
> +		goto out;
> +	}
> +	if (addr % PCI_MSIX_ENTRY_SIZE != PCI_MSIX_ENTRY_VECTOR_CTRL) {
> +		r = -EOPNOTSUPP;
> +		goto out;
> +	}
> +	if (new_val & ~PCI_MSIX_ENTRY_CTRL_MASKBIT)
> +		goto out;

So if the write touches any other bit, we ignore mask bit write
completely? Don't even pass to userspace? why?

> +	update_msix_mask(adev, idx, !!(new_val & PCI_MSIX_ENTRY_CTRL_MASKBIT));
> +out:
> +	mutex_unlock(&adev->kvm->lock);
> +
> +	return r;
> +}
> +
> +static const struct kvm_io_device_ops msix_mmio_ops = {
> +	.read     = msix_mmio_read,
> +	.write    = msix_mmio_write,
> +};
> +
> +static int kvm_vm_ioctl_update_msix_mmio(struct kvm *kvm,
> +				struct kvm_msix_mmio *msix_mmio)
> +{
> +	int r = 0;
> +	struct kvm_assigned_dev_kernel *adev;
> +
> +	if (msix_mmio->type != KVM_MSIX_TYPE_ASSIGNED_DEV)
> +		return -EINVAL;
> +
> +	if (!msix_mmio->flags)
> +		return -EINVAL;
> +
> +	mutex_lock(&kvm->lock);
> +	adev = kvm_find_assigned_dev(&kvm->arch.assigned_dev_head,
> +				      msix_mmio->id);
> +	if (!adev) {
> +		r = -EINVAL;
> +		goto out;
> +	}
> +	if (msix_mmio->base_addr == 0) {
> +		r = -EINVAL;
> +		goto out;
> +	}
> +	if (msix_mmio->max_entries_nr == 0 ||
> +			msix_mmio->max_entries_nr > KVM_MAX_MSIX_PER_DEV) {
> +		r = -EINVAL;
> +		goto out;
> +	}
> +
> +	if ((msix_mmio->flags & KVM_MSIX_MMIO_FLAG_REGISTER) &&
> +			(msix_mmio->flags & KVM_MSIX_MMIO_FLAG_UNREGISTER)) {
> +		r = -EINVAL;
> +		goto out;
> +	}
> +
> +	if (msix_mmio->flags & KVM_MSIX_MMIO_FLAG_REGISTER) {
> +		if (!(adev->flags & KVM_ASSIGNED_ENABLED_MSIX_MMIO)) {
> +			mutex_lock(&kvm->slots_lock);
> +			kvm_iodevice_init(&adev->msix_mmio_dev,
> +					&msix_mmio_ops);
> +			r = kvm_io_bus_register_dev(kvm, KVM_MMIO_BUS,
> +					&adev->msix_mmio_dev);
> +			if (!r)
> +				adev->flags |= KVM_ASSIGNED_ENABLED_MSIX_MMIO;
> +			mutex_unlock(&kvm->slots_lock);
> +		}
> +		if (!r) {
> +			adev->msix_mmio_base = msix_mmio->base_addr;
> +			adev->msix_max_entries_nr = msix_mmio->max_entries_nr;
> +		}
> +	} else if (msix_mmio->flags & KVM_MSIX_MMIO_FLAG_UNREGISTER)
> +		unregister_msix_mmio(kvm, adev);
> +out:
> +	mutex_unlock(&kvm->lock);
> +
> +	return r;
> +}
>  #endif
>  
>  long kvm_vm_ioctl_assigned_device(struct kvm *kvm, unsigned ioctl,
> @@ -813,6 +1096,37 @@ long kvm_vm_ioctl_assigned_device(struct kvm *kvm, unsigned ioctl,
>  			goto out;
>  		break;
>  	}
> +	case KVM_GET_MSIX_ENTRY: {
> +		struct kvm_msix_entry entry;
> +		r = -EFAULT;
> +		if (copy_from_user(&entry, argp, sizeof entry))
> +			goto out;
> +		r = kvm_vm_ioctl_get_msix_entry(kvm, &entry);
> +		if (r)
> +			goto out;
> +		r = -EFAULT;
> +		if (copy_to_user(argp, &entry, sizeof entry))
> +			goto out;
> +		r = 0;
> +		break;
> +	}
> +	case KVM_UPDATE_MSIX_MMIO: {
> +		struct kvm_msix_mmio msix_mmio;
> +
> +		r = -EFAULT;
> +		if (copy_from_user(&msix_mmio, argp, sizeof(msix_mmio)))
> +			goto out;
> +
> +		r = -EINVAL;
> +		if (find_first_bit((unsigned long *)msix_mmio.reserved,
> +		    sizeof(msix_mmio.reserved)) < sizeof(msix_mmio.reserved))
> +			goto out;
> +
> +		r = kvm_vm_ioctl_update_msix_mmio(kvm, &msix_mmio);
> +		if (r)
> +			goto out;
> +		break;
> +	}
>  #endif
>  	}
>  out:
> -- 
> 1.7.0.1

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 6/7] KVM: assigned dev: MSI-X mask support
  2010-11-12  9:53   ` Michael S. Tsirkin
@ 2010-11-12 10:13     ` Sheng Yang
  2010-11-12 10:47       ` Michael S. Tsirkin
  0 siblings, 1 reply; 22+ messages in thread
From: Sheng Yang @ 2010-11-12 10:13 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Avi Kivity, Marcelo Tosatti, kvm

On Friday 12 November 2010 17:53:13 Michael S. Tsirkin wrote:
> On Thu, Nov 11, 2010 at 03:46:59PM +0800, Sheng Yang wrote:
> > This patch enable per-vector mask for assigned devices using MSI-X.
> > 
> > This patch provided two new APIs: one is for guest to specific device's
> > MSI-X table address in MMIO, the other is for userspace to get
> > information about mask bit.
> > 
> > All the mask bit operation are kept in kernel, in order to accelerate.
> > Userspace shouldn't access the device MMIO directly for the information,
> > instead it should uses provided API to do so.
> > 
> > Signed-off-by: Sheng Yang <sheng@linux.intel.com>
> > ---
> > 
> >  arch/x86/kvm/x86.c       |    1 +
> >  include/linux/kvm.h      |   32 +++++
> >  include/linux/kvm_host.h |    5 +
> >  virt/kvm/assigned-dev.c  |  316
> >  +++++++++++++++++++++++++++++++++++++++++++++- 4 files changed, 
353
> >  insertions(+), 1 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index f3f86b2..847f1e1 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -1926,6 +1926,7 @@ int kvm_dev_ioctl_check_extension(long ext)
> > 
> >  	case KVM_CAP_DEBUGREGS:
> >  	case KVM_CAP_X86_ROBUST_SINGLESTEP:
> > 
> >  	case KVM_CAP_XSAVE:
> > +	case KVM_CAP_MSIX_MASK:
> >  		r = 1;
> >  		break;
> >  	
> >  	case KVM_CAP_COALESCED_MMIO:
> > diff --git a/include/linux/kvm.h b/include/linux/kvm.h
> > index 919ae53..32cd244 100644
> > --- a/include/linux/kvm.h
> > +++ b/include/linux/kvm.h
> > @@ -540,6 +540,9 @@ struct kvm_ppc_pvinfo {
> > 
> >  #endif
> >  #define KVM_CAP_PPC_GET_PVINFO 57
> >  #define KVM_CAP_PPC_IRQ_LEVEL 58
> > 
> > +#ifdef __KVM_HAVE_MSIX
> > +#define KVM_CAP_MSIX_MASK 59
> > +#endif
> > 
> >  #ifdef KVM_CAP_IRQ_ROUTING
> > 
> > @@ -671,6 +674,9 @@ struct kvm_clock_data {
> > 
> >  #define KVM_XEN_HVM_CONFIG        _IOW(KVMIO,  0x7a, struct
> >  kvm_xen_hvm_config) #define KVM_SET_CLOCK             _IOW(KVMIO, 
> >  0x7b, struct kvm_clock_data) #define KVM_GET_CLOCK            
> >  _IOR(KVMIO,  0x7c, struct kvm_clock_data)
> > 
> > +/* Available with KVM_CAP_MSIX_MASK */
> > +#define KVM_GET_MSIX_ENTRY        _IOWR(KVMIO,  0x7d, struct
> > kvm_msix_entry) +#define KVM_UPDATE_MSIX_MMIO      _IOW(KVMIO,  0x7e,
> > struct kvm_msix_mmio)
> > 
> >  /* Available with KVM_CAP_PIT_STATE2 */
> >  #define KVM_GET_PIT2              _IOR(KVMIO,  0x9f, struct
> >  kvm_pit_state2) #define KVM_SET_PIT2              _IOW(KVMIO,  0xa0,
> >  struct kvm_pit_state2)
> > 
> > @@ -794,4 +800,30 @@ struct kvm_assigned_msix_entry {
> > 
> >  	__u16 padding[3];
> >  
> >  };
> > 
> > +#define KVM_MSIX_TYPE_ASSIGNED_DEV	1
> > +
> > +#define KVM_MSIX_FLAG_MASKBIT		(1 << 0)
> > +#define KVM_MSIX_FLAG_QUERY_MASKBIT	(1 << 0)
> > +
> > +struct kvm_msix_entry {
> > +	__u32 id;
> > +	__u32 type;
> > +	__u32 entry; /* The index of entry in the MSI-X table */
> > +	__u32 flags;
> > +	__u32 query_flags;
> > +	__u32 reserved[5];
> > +};
> > +
> > +#define KVM_MSIX_MMIO_FLAG_REGISTER	(1 << 0)
> > +#define KVM_MSIX_MMIO_FLAG_UNREGISTER	(1 << 1)
> > +
> > +struct kvm_msix_mmio {
> > +	__u32 id;
> > +	__u32 type;
> > +	__u64 base_addr;
> > +	__u32 max_entries_nr;
> > +	__u32 flags;
> > +	__u32 reserved[6];
> > +};
> > +
> > 
> >  #endif /* __LINUX_KVM_H */
> > 
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 4b31539..9d49074 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -445,6 +445,7 @@ struct kvm_guest_msix_entry {
> > 
> >  };
> >  
> >  #define KVM_ASSIGNED_ENABLED_IOMMU		(1 << 0)
> > 
> > +#define KVM_ASSIGNED_ENABLED_MSIX_MMIO		(1 << 1)
> > 
> >  struct kvm_assigned_dev_kernel {
> >  
> >  	struct kvm_irq_ack_notifier ack_notifier;
> >  	struct work_struct interrupt_work;
> > 
> > @@ -465,6 +466,10 @@ struct kvm_assigned_dev_kernel {
> > 
> >  	struct pci_dev *dev;
> >  	struct kvm *kvm;
> >  	spinlock_t assigned_dev_lock;
> > 
> > +	DECLARE_BITMAP(msix_mask_bitmap, KVM_MAX_MSIX_PER_DEV);
> > +	gpa_t msix_mmio_base;
> > +	struct kvm_io_device msix_mmio_dev;
> > +	int msix_max_entries_nr;
> > 
> >  };
> >  
> >  struct kvm_irq_mask_notifier {
> > 
> > diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c
> > index 5c6b96d..3010d7d 100644
> > --- a/virt/kvm/assigned-dev.c
> > +++ b/virt/kvm/assigned-dev.c
> > @@ -226,12 +226,27 @@ static void kvm_free_assigned_irq(struct kvm *kvm,
> > 
> >  	kvm_deassign_irq(kvm, assigned_dev, assigned_dev->irq_requested_type);
> >  
> >  }
> > 
> > +static void unregister_msix_mmio(struct kvm *kvm,
> > +				 struct kvm_assigned_dev_kernel *adev)
> > +{
> > +	if (adev->flags & KVM_ASSIGNED_ENABLED_MSIX_MMIO) {
> > +		mutex_lock(&kvm->slots_lock);
> > +		kvm_io_bus_unregister_dev(kvm, KVM_MMIO_BUS,
> > +				&adev->msix_mmio_dev);
> > +		mutex_unlock(&kvm->slots_lock);
> > +		adev->flags &= ~KVM_ASSIGNED_ENABLED_MSIX_MMIO;
> > +	}
> > +}
> > +
> > 
> >  static void kvm_free_assigned_device(struct kvm *kvm,
> >  
> >  				     struct kvm_assigned_dev_kernel
> >  				     *assigned_dev)
> >  
> >  {
> >  
> >  	kvm_free_assigned_irq(kvm, assigned_dev);
> > 
> > +#ifdef __KVM_HAVE_MSIX
> > +	unregister_msix_mmio(kvm, assigned_dev);
> > +#endif
> > 
> >  	pci_reset_function(assigned_dev->dev);
> >  	
> >  	pci_release_regions(assigned_dev->dev);
> > 
> > @@ -504,7 +519,7 @@ out:
> >  static int kvm_vm_ioctl_assign_device(struct kvm *kvm,
> >  
> >  				      struct kvm_assigned_pci_dev *assigned_dev)
> >  
> >  {
> > 
> > -	int r = 0, idx;
> > +	int r = 0, idx, i;
> > 
> >  	struct kvm_assigned_dev_kernel *match;
> >  	struct pci_dev *dev;
> > 
> > @@ -564,6 +579,10 @@ static int kvm_vm_ioctl_assign_device(struct kvm
> > *kvm,
> > 
> >  	list_add(&match->list, &kvm->arch.assigned_dev_head);
> > 
> > +	/* The state after reset of MSI-X table is all masked */
> > +	for (i = 0; i < KVM_MAX_MSIX_PER_DEV; i++)
> > +		set_bit(i, match->msix_mask_bitmap);
> > +
> > 
> >  	if (assigned_dev->flags & KVM_ASSIGNED_ENABLED_IOMMU) {
> >  	
> >  		if (!kvm->arch.iommu_domain) {
> >  		
> >  			r = kvm_iommu_map_guest(kvm);
> > 
> > @@ -667,6 +686,43 @@ msix_nr_out:
> >  	return r;
> >  
> >  }
> > 
> > +static void update_msix_mask(struct kvm_assigned_dev_kernel *adev,
> > +			     int idx, bool new_mask_flag)
> > +{
> > +	int irq;
> > +	bool old_mask_flag, need_flush = false;
> > +
> > +	spin_lock_irq(&adev->assigned_dev_lock);
> > +
> > +	if (!adev->dev->msix_enabled ||
> > +	    !(adev->irq_requested_type & KVM_DEV_IRQ_HOST_MSIX))
> > +		goto out;
> > +
> > +	old_mask_flag = test_bit(adev->guest_msix_entries[idx].entry,
> > +			adev->msix_mask_bitmap);
> > +	if (old_mask_flag == new_mask_flag)
> > +		goto out;
> > +
> > +	irq = adev->host_msix_entries[idx].vector;
> > +	BUG_ON(irq == 0);
> > +
> > +	if (new_mask_flag) {
> > +		set_bit(adev->guest_msix_entries[idx].entry,
> > +				adev->msix_mask_bitmap);
> > +		disable_irq_nosync(irq);
> > +		need_flush = true;
> > +	} else {
> > +		clear_bit(adev->guest_msix_entries[idx].entry,
> > +				adev->msix_mask_bitmap);
> > +		enable_irq(irq);
> > +	}
> > +out:
> > +	spin_unlock_irq(&adev->assigned_dev_lock);
> > +
> > +	if (need_flush)
> > +		flush_work(&adev->interrupt_work);
> > +}
> > +
> > 
> >  static int kvm_vm_ioctl_set_msix_entry(struct kvm *kvm,
> >  
> >  				       struct kvm_assigned_msix_entry *entry)
> >  
> >  {
> > 
> > @@ -701,6 +757,233 @@ msix_entry_out:
> >  	return r;
> >  
> >  }
> > 
> > +
> > +static int kvm_vm_ioctl_get_msix_entry(struct kvm *kvm,
> > +				       struct kvm_msix_entry *entry)
> > +{
> > +	int r = 0;
> > +	struct kvm_assigned_dev_kernel *adev;
> > +
> > +	if (entry->type != KVM_MSIX_TYPE_ASSIGNED_DEV)
> > +		return -EINVAL;
> > +
> > +	if (!entry->query_flags)
> > +		return -EINVAL;
> > +
> > +	mutex_lock(&kvm->lock);
> > +
> > +	adev = kvm_find_assigned_dev(&kvm->arch.assigned_dev_head,
> > +				      entry->id);
> > +
> > +	if (!adev) {
> > +		r = -EINVAL;
> > +		goto out;
> > +	}
> > +
> > +	if (entry->entry >= adev->msix_max_entries_nr) {
> > +		r = -ENOSPC;
> > +		goto out;
> > +	}
> > +
> > +	if (entry->query_flags & KVM_MSIX_FLAG_QUERY_MASKBIT) {
> > +		if (test_bit(entry->entry, adev->msix_mask_bitmap))
> > +			entry->flags |= KVM_MSIX_FLAG_MASKBIT;
> > +		else
> > +			entry->flags &= ~KVM_MSIX_FLAG_MASKBIT;
> > +	}
> > +
> > +out:
> > +	mutex_unlock(&kvm->lock);
> > +
> > +	return r;
> > +}
> > +
> > +static bool msix_mmio_in_range(struct kvm_assigned_dev_kernel *adev,
> > +			      gpa_t addr, int len)
> > +{
> > +	gpa_t start, end;
> > +
> > +	BUG_ON(adev->msix_mmio_base == 0);
> 
> I thought we wanted to use a separate flag for this?

O, flag added, but forgot to update this one.
> 
> > +	start = adev->msix_mmio_base;
> > +	end = adev->msix_mmio_base + PCI_MSIX_ENTRY_SIZE *
> > +		adev->msix_max_entries_nr;
> > +	if (addr >= start && addr + len <= end)
> > +		return true;
> > +
> > +	return false;
> > +}
> > +
> > +static int msix_get_enabled_idx(struct kvm_assigned_dev_kernel *adev,
> > +			      gpa_t addr, int len)
> > +{
> > +	int i, index = (addr - adev->msix_mmio_base) / PCI_MSIX_ENTRY_SIZE;
> > +
> > +	for (i = 0; i < adev->entries_nr; i++)
> > +		if (adev->guest_msix_entries[i].entry == index)
> > +			return i;
> > +
> > +	return -EINVAL;
> > +}
> > +
> 
> Hmm, we still have a linear scan on each write.  Is the point to detect
> which entries need to be handled in kernel? Maybe another bitmap for
> this?  Or just handle whole entry write in kernel as well, then we won't
> need this at all.

Still dont' like handling the whole entry writing. In fact here we have two 
questions:
1. If the entry already enabled
2. If it is, then what's it sequence number.

Bitmap can solve question 1, but we still need O(n) scan for the second. So I 
think leave it like this should be fine.
> 
> > +static int msix_mmio_read(struct kvm_io_device *this, gpa_t addr, int
> > len, +			  void *val)
> > +{
> > +	struct kvm_assigned_dev_kernel *adev =
> > +			container_of(this, struct kvm_assigned_dev_kernel,
> > +				     msix_mmio_dev);
> > +	int idx, r = 0;
> > +	u32 entry[4];
> > +	struct kvm_kernel_irq_routing_entry e;
> > +
> > +	mutex_lock(&adev->kvm->lock);
> > +	if (!msix_mmio_in_range(adev, addr, len)) {
> > +		r = -EOPNOTSUPP;
> > +		goto out;
> > +	}
> > +	if ((addr & 0x3) || len != 4)
> > +		goto out;
> > +
> > +	idx = msix_get_enabled_idx(adev, addr, len);
> > +	if (idx < 0) {
> > +		idx = (addr - adev->msix_mmio_base) / PCI_MSIX_ENTRY_SIZE;
> > +		if ((addr % PCI_MSIX_ENTRY_SIZE) ==
> > +				PCI_MSIX_ENTRY_VECTOR_CTRL)
> > +			*(unsigned long *)val =
> > +				test_bit(idx, adev->msix_mask_bitmap) ?
> > +				PCI_MSIX_ENTRY_CTRL_MASKBIT : 0;
> > +		else
> > +			r = -EOPNOTSUPP;
> > +		goto out;
> > +	}
> > +
> > +	r = kvm_get_irq_routing_entry(adev->kvm,
> > +			adev->guest_msix_entries[idx].vector, &e);
> > +	if (r || e.type != KVM_IRQ_ROUTING_MSI) {
> > +		r = -EOPNOTSUPP;
> > +		goto out;
> > +	}
> > +	entry[0] = e.msi.address_lo;
> > +	entry[1] = e.msi.address_hi;
> > +	entry[2] = e.msi.data;
> > +	entry[3] = test_bit(adev->guest_msix_entries[idx].entry,
> > +			adev->msix_mask_bitmap);
> > +	memcpy(val, &entry[addr % PCI_MSIX_ENTRY_SIZE / sizeof *entry], len);
> 
> It seems weird to pass writes to userspace but do reads
> in kernel.
> Either we should support entry read and write or neither.

Entry read is for speed up, because kernel would write to the mask bit then try to 
read from it for flushing. Don't think reading matter much here, as long as 
userspace still own routing table.
 
> > +
> > +out:
> > +	mutex_unlock(&adev->kvm->lock);
> > +	return r;
> > +}
> > +
> > +static int msix_mmio_write(struct kvm_io_device *this, gpa_t addr, int
> > len, +			   const void *val)
> > +{
> > 
> > +	struct kvm_assigned_dev_kernel *adev =
> > +			container_of(this, struct kvm_assigned_dev_kernel,
> > +				     msix_mmio_dev);
> > +	int idx, r = 0;
> > +	unsigned long new_val = *(unsigned long *)val;
> 
> Move this to after you did length and alignment checks,
> it might not be safe here.
> 
> > +
> > +	mutex_lock(&adev->kvm->lock);
> > +	if (!msix_mmio_in_range(adev, addr, len)) {
> > +		r = -EOPNOTSUPP;
> > +		goto out;
> > +	}
> > +	if ((addr & 0x3) || len != 4)
> 
> The lenght could be 8:
> 
> For all accesses to MSI-X Table and MSI-X PBA fields, software must use
> aligned full DWORD or aligned full QWORD transactions; otherwise, the
> result is undefined.
> 
> and
> 
> Software is permitted to fill in MSI-X Table entry DWORD fields
> individually with DWORD writes, or software in certain cases is permitted
> to fill in appropriate pairs of DWORDs with a single QWORD write.
> Specifically, software is always permitted to fill in the Message Address
> and Message Upper Address fields with a single QWORD write. If a given
> entry is currently masked (via its Mask bit or the Function Mask bit),
> software is permitted to fill in the Message Data and Vector Control
> fields with a single QWORD write, taking advantage of the fact the Message
> Data field is guaranteed to become visible to hardware no later than the
> Vector Control field. However, if software wishes to mask a currently
> unmasked entry (without setting the Function Mask bit), software must set
> the entry’s Mask bit using a DWORD write to the Vector Control field,
> since performing a QWORD write to the Message Data and Vector Control
> fields might result in the Message Data field being modified before the
> Mask bit in the Vector Control field becomes set.
 
Haven't seen any device use QWORD accessing. Also QEmu can't handle QWORD MMIO as 
well. So I guess we can leave it later.

> > +		goto out;
> 
> Why don't we pass to userspace on error?
> Will make it easy to debug ...

OK, at least we can know if guest want to access it using 8 bits.
> 
> > +
> > +	idx = msix_get_enabled_idx(adev, addr, len);
> > +	if (idx < 0) {
> > +		idx = (addr - adev->msix_mmio_base) / PCI_MSIX_ENTRY_SIZE;
> > +		if (((addr % PCI_MSIX_ENTRY_SIZE) ==
> > +				PCI_MSIX_ENTRY_VECTOR_CTRL)) {
> > +			if (new_val & ~PCI_MSIX_ENTRY_CTRL_MASKBIT)
> > +				goto out;
> > +			if (new_val & PCI_MSIX_ENTRY_CTRL_MASKBIT)
> > +				set_bit(idx, adev->msix_mask_bitmap);
> > +			else
> > +				clear_bit(idx, adev->msix_mask_bitmap);
> > +			/* It's possible that we need re-enable MSI-X, so go
> > +			 * back to userspace */
> > +		}
> > +		/* Userspace would handle other MMIO writing */
> > +		r = -EOPNOTSUPP;
> > +		goto out;
> > +	}
> > +	if (addr % PCI_MSIX_ENTRY_SIZE != PCI_MSIX_ENTRY_VECTOR_CTRL) {
> > +		r = -EOPNOTSUPP;
> > +		goto out;
> > +	}
> > +	if (new_val & ~PCI_MSIX_ENTRY_CTRL_MASKBIT)
> > +		goto out;
> 
> So if the write touches any other bit, we ignore mask bit write
> completely? Don't even pass to userspace? why?

Because other bits are reserved. The writing behavior is illegal. Spec said result 
is undefined.

--
regards
Yang, Sheng

> 
> > +	update_msix_mask(adev, idx, !!(new_val & PCI_MSIX_ENTRY_CTRL_MASKBIT));
> > +out:
> > +	mutex_unlock(&adev->kvm->lock);
> > +
> > +	return r;
> > +}
> > +
> > +static const struct kvm_io_device_ops msix_mmio_ops = {
> > +	.read     = msix_mmio_read,
> > +	.write    = msix_mmio_write,
> > +};
> > +
> > +static int kvm_vm_ioctl_update_msix_mmio(struct kvm *kvm,
> > +				struct kvm_msix_mmio *msix_mmio)
> > +{
> > +	int r = 0;
> > +	struct kvm_assigned_dev_kernel *adev;
> > +
> > +	if (msix_mmio->type != KVM_MSIX_TYPE_ASSIGNED_DEV)
> > +		return -EINVAL;
> > +
> > +	if (!msix_mmio->flags)
> > +		return -EINVAL;
> > +
> > +	mutex_lock(&kvm->lock);
> > +	adev = kvm_find_assigned_dev(&kvm->arch.assigned_dev_head,
> > +				      msix_mmio->id);
> > +	if (!adev) {
> > +		r = -EINVAL;
> > +		goto out;
> > +	}
> > +	if (msix_mmio->base_addr == 0) {
> > +		r = -EINVAL;
> > +		goto out;
> > +	}
> > +	if (msix_mmio->max_entries_nr == 0 ||
> > +			msix_mmio->max_entries_nr > KVM_MAX_MSIX_PER_DEV) {
> > +		r = -EINVAL;
> > +		goto out;
> > +	}
> > +
> > +	if ((msix_mmio->flags & KVM_MSIX_MMIO_FLAG_REGISTER) &&
> > +			(msix_mmio->flags & KVM_MSIX_MMIO_FLAG_UNREGISTER)) {
> > +		r = -EINVAL;
> > +		goto out;
> > +	}
> > +
> > +	if (msix_mmio->flags & KVM_MSIX_MMIO_FLAG_REGISTER) {
> > +		if (!(adev->flags & KVM_ASSIGNED_ENABLED_MSIX_MMIO)) {
> > +			mutex_lock(&kvm->slots_lock);
> > +			kvm_iodevice_init(&adev->msix_mmio_dev,
> > +					&msix_mmio_ops);
> > +			r = kvm_io_bus_register_dev(kvm, KVM_MMIO_BUS,
> > +					&adev->msix_mmio_dev);
> > +			if (!r)
> > +				adev->flags |= KVM_ASSIGNED_ENABLED_MSIX_MMIO;
> > +			mutex_unlock(&kvm->slots_lock);
> > +		}
> > +		if (!r) {
> > +			adev->msix_mmio_base = msix_mmio->base_addr;
> > +			adev->msix_max_entries_nr = msix_mmio->max_entries_nr;
> > +		}
> > +	} else if (msix_mmio->flags & KVM_MSIX_MMIO_FLAG_UNREGISTER)
> > +		unregister_msix_mmio(kvm, adev);
> > +out:
> > +	mutex_unlock(&kvm->lock);
> > +
> > +	return r;
> > +}
> > 
> >  #endif
> >  
> >  long kvm_vm_ioctl_assigned_device(struct kvm *kvm, unsigned ioctl,
> > 
> > @@ -813,6 +1096,37 @@ long kvm_vm_ioctl_assigned_device(struct kvm *kvm,
> > unsigned ioctl,
> > 
> >  			goto out;
> >  		
> >  		break;
> >  	
> >  	}
> > 
> > +	case KVM_GET_MSIX_ENTRY: {
> > +		struct kvm_msix_entry entry;
> > +		r = -EFAULT;
> > +		if (copy_from_user(&entry, argp, sizeof entry))
> > +			goto out;
> > +		r = kvm_vm_ioctl_get_msix_entry(kvm, &entry);
> > +		if (r)
> > +			goto out;
> > +		r = -EFAULT;
> > +		if (copy_to_user(argp, &entry, sizeof entry))
> > +			goto out;
> > +		r = 0;
> > +		break;
> > +	}
> > +	case KVM_UPDATE_MSIX_MMIO: {
> > +		struct kvm_msix_mmio msix_mmio;
> > +
> > +		r = -EFAULT;
> > +		if (copy_from_user(&msix_mmio, argp, sizeof(msix_mmio)))
> > +			goto out;
> > +
> > +		r = -EINVAL;
> > +		if (find_first_bit((unsigned long *)msix_mmio.reserved,
> > +		    sizeof(msix_mmio.reserved)) < sizeof(msix_mmio.reserved))
> > +			goto out;
> > +
> > +		r = kvm_vm_ioctl_update_msix_mmio(kvm, &msix_mmio);
> > +		if (r)
> > +			goto out;
> > +		break;
> > +	}
> > 
> >  #endif
> >  
> >  	}
> >  
> >  out:

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 6/7] KVM: assigned dev: MSI-X mask support
  2010-11-12 10:13     ` Sheng Yang
@ 2010-11-12 10:47       ` Michael S. Tsirkin
  2010-11-12 10:54         ` Sheng Yang
  0 siblings, 1 reply; 22+ messages in thread
From: Michael S. Tsirkin @ 2010-11-12 10:47 UTC (permalink / raw)
  To: Sheng Yang; +Cc: Avi Kivity, Marcelo Tosatti, kvm

On Fri, Nov 12, 2010 at 06:13:48PM +0800, Sheng Yang wrote:
> On Friday 12 November 2010 17:53:13 Michael S. Tsirkin wrote:
> > On Thu, Nov 11, 2010 at 03:46:59PM +0800, Sheng Yang wrote:
> > > This patch enable per-vector mask for assigned devices using MSI-X.
> > > 
> > > This patch provided two new APIs: one is for guest to specific device's
> > > MSI-X table address in MMIO, the other is for userspace to get
> > > information about mask bit.
> > > 
> > > All the mask bit operation are kept in kernel, in order to accelerate.
> > > Userspace shouldn't access the device MMIO directly for the information,
> > > instead it should uses provided API to do so.
> > > 
> > > Signed-off-by: Sheng Yang <sheng@linux.intel.com>
> > > ---
> > > 
> > >  arch/x86/kvm/x86.c       |    1 +
> > >  include/linux/kvm.h      |   32 +++++
> > >  include/linux/kvm_host.h |    5 +
> > >  virt/kvm/assigned-dev.c  |  316
> > >  +++++++++++++++++++++++++++++++++++++++++++++- 4 files changed, 
> 353
> > >  insertions(+), 1 deletions(-)
> > > 
> > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > index f3f86b2..847f1e1 100644
> > > --- a/arch/x86/kvm/x86.c
> > > +++ b/arch/x86/kvm/x86.c
> > > @@ -1926,6 +1926,7 @@ int kvm_dev_ioctl_check_extension(long ext)
> > > 
> > >  	case KVM_CAP_DEBUGREGS:
> > >  	case KVM_CAP_X86_ROBUST_SINGLESTEP:
> > > 
> > >  	case KVM_CAP_XSAVE:
> > > +	case KVM_CAP_MSIX_MASK:
> > >  		r = 1;
> > >  		break;
> > >  	
> > >  	case KVM_CAP_COALESCED_MMIO:
> > > diff --git a/include/linux/kvm.h b/include/linux/kvm.h
> > > index 919ae53..32cd244 100644
> > > --- a/include/linux/kvm.h
> > > +++ b/include/linux/kvm.h
> > > @@ -540,6 +540,9 @@ struct kvm_ppc_pvinfo {
> > > 
> > >  #endif
> > >  #define KVM_CAP_PPC_GET_PVINFO 57
> > >  #define KVM_CAP_PPC_IRQ_LEVEL 58
> > > 
> > > +#ifdef __KVM_HAVE_MSIX
> > > +#define KVM_CAP_MSIX_MASK 59
> > > +#endif
> > > 
> > >  #ifdef KVM_CAP_IRQ_ROUTING
> > > 
> > > @@ -671,6 +674,9 @@ struct kvm_clock_data {
> > > 
> > >  #define KVM_XEN_HVM_CONFIG        _IOW(KVMIO,  0x7a, struct
> > >  kvm_xen_hvm_config) #define KVM_SET_CLOCK             _IOW(KVMIO, 
> > >  0x7b, struct kvm_clock_data) #define KVM_GET_CLOCK            
> > >  _IOR(KVMIO,  0x7c, struct kvm_clock_data)
> > > 
> > > +/* Available with KVM_CAP_MSIX_MASK */
> > > +#define KVM_GET_MSIX_ENTRY        _IOWR(KVMIO,  0x7d, struct
> > > kvm_msix_entry) +#define KVM_UPDATE_MSIX_MMIO      _IOW(KVMIO,  0x7e,
> > > struct kvm_msix_mmio)
> > > 
> > >  /* Available with KVM_CAP_PIT_STATE2 */
> > >  #define KVM_GET_PIT2              _IOR(KVMIO,  0x9f, struct
> > >  kvm_pit_state2) #define KVM_SET_PIT2              _IOW(KVMIO,  0xa0,
> > >  struct kvm_pit_state2)
> > > 
> > > @@ -794,4 +800,30 @@ struct kvm_assigned_msix_entry {
> > > 
> > >  	__u16 padding[3];
> > >  
> > >  };
> > > 
> > > +#define KVM_MSIX_TYPE_ASSIGNED_DEV	1
> > > +
> > > +#define KVM_MSIX_FLAG_MASKBIT		(1 << 0)
> > > +#define KVM_MSIX_FLAG_QUERY_MASKBIT	(1 << 0)
> > > +
> > > +struct kvm_msix_entry {
> > > +	__u32 id;
> > > +	__u32 type;
> > > +	__u32 entry; /* The index of entry in the MSI-X table */
> > > +	__u32 flags;
> > > +	__u32 query_flags;
> > > +	__u32 reserved[5];
> > > +};
> > > +
> > > +#define KVM_MSIX_MMIO_FLAG_REGISTER	(1 << 0)
> > > +#define KVM_MSIX_MMIO_FLAG_UNREGISTER	(1 << 1)
> > > +
> > > +struct kvm_msix_mmio {
> > > +	__u32 id;
> > > +	__u32 type;
> > > +	__u64 base_addr;
> > > +	__u32 max_entries_nr;
> > > +	__u32 flags;
> > > +	__u32 reserved[6];
> > > +};
> > > +
> > > 
> > >  #endif /* __LINUX_KVM_H */
> > > 
> > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > > index 4b31539..9d49074 100644
> > > --- a/include/linux/kvm_host.h
> > > +++ b/include/linux/kvm_host.h
> > > @@ -445,6 +445,7 @@ struct kvm_guest_msix_entry {
> > > 
> > >  };
> > >  
> > >  #define KVM_ASSIGNED_ENABLED_IOMMU		(1 << 0)
> > > 
> > > +#define KVM_ASSIGNED_ENABLED_MSIX_MMIO		(1 << 1)
> > > 
> > >  struct kvm_assigned_dev_kernel {
> > >  
> > >  	struct kvm_irq_ack_notifier ack_notifier;
> > >  	struct work_struct interrupt_work;
> > > 
> > > @@ -465,6 +466,10 @@ struct kvm_assigned_dev_kernel {
> > > 
> > >  	struct pci_dev *dev;
> > >  	struct kvm *kvm;
> > >  	spinlock_t assigned_dev_lock;
> > > 
> > > +	DECLARE_BITMAP(msix_mask_bitmap, KVM_MAX_MSIX_PER_DEV);
> > > +	gpa_t msix_mmio_base;
> > > +	struct kvm_io_device msix_mmio_dev;
> > > +	int msix_max_entries_nr;
> > > 
> > >  };
> > >  
> > >  struct kvm_irq_mask_notifier {
> > > 
> > > diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c
> > > index 5c6b96d..3010d7d 100644
> > > --- a/virt/kvm/assigned-dev.c
> > > +++ b/virt/kvm/assigned-dev.c
> > > @@ -226,12 +226,27 @@ static void kvm_free_assigned_irq(struct kvm *kvm,
> > > 
> > >  	kvm_deassign_irq(kvm, assigned_dev, assigned_dev->irq_requested_type);
> > >  
> > >  }
> > > 
> > > +static void unregister_msix_mmio(struct kvm *kvm,
> > > +				 struct kvm_assigned_dev_kernel *adev)
> > > +{
> > > +	if (adev->flags & KVM_ASSIGNED_ENABLED_MSIX_MMIO) {
> > > +		mutex_lock(&kvm->slots_lock);
> > > +		kvm_io_bus_unregister_dev(kvm, KVM_MMIO_BUS,
> > > +				&adev->msix_mmio_dev);
> > > +		mutex_unlock(&kvm->slots_lock);
> > > +		adev->flags &= ~KVM_ASSIGNED_ENABLED_MSIX_MMIO;
> > > +	}
> > > +}
> > > +
> > > 
> > >  static void kvm_free_assigned_device(struct kvm *kvm,
> > >  
> > >  				     struct kvm_assigned_dev_kernel
> > >  				     *assigned_dev)
> > >  
> > >  {
> > >  
> > >  	kvm_free_assigned_irq(kvm, assigned_dev);
> > > 
> > > +#ifdef __KVM_HAVE_MSIX
> > > +	unregister_msix_mmio(kvm, assigned_dev);
> > > +#endif
> > > 
> > >  	pci_reset_function(assigned_dev->dev);
> > >  	
> > >  	pci_release_regions(assigned_dev->dev);
> > > 
> > > @@ -504,7 +519,7 @@ out:
> > >  static int kvm_vm_ioctl_assign_device(struct kvm *kvm,
> > >  
> > >  				      struct kvm_assigned_pci_dev *assigned_dev)
> > >  
> > >  {
> > > 
> > > -	int r = 0, idx;
> > > +	int r = 0, idx, i;
> > > 
> > >  	struct kvm_assigned_dev_kernel *match;
> > >  	struct pci_dev *dev;
> > > 
> > > @@ -564,6 +579,10 @@ static int kvm_vm_ioctl_assign_device(struct kvm
> > > *kvm,
> > > 
> > >  	list_add(&match->list, &kvm->arch.assigned_dev_head);
> > > 
> > > +	/* The state after reset of MSI-X table is all masked */
> > > +	for (i = 0; i < KVM_MAX_MSIX_PER_DEV; i++)
> > > +		set_bit(i, match->msix_mask_bitmap);
> > > +
> > > 
> > >  	if (assigned_dev->flags & KVM_ASSIGNED_ENABLED_IOMMU) {
> > >  	
> > >  		if (!kvm->arch.iommu_domain) {
> > >  		
> > >  			r = kvm_iommu_map_guest(kvm);
> > > 
> > > @@ -667,6 +686,43 @@ msix_nr_out:
> > >  	return r;
> > >  
> > >  }
> > > 
> > > +static void update_msix_mask(struct kvm_assigned_dev_kernel *adev,
> > > +			     int idx, bool new_mask_flag)
> > > +{
> > > +	int irq;
> > > +	bool old_mask_flag, need_flush = false;
> > > +
> > > +	spin_lock_irq(&adev->assigned_dev_lock);
> > > +
> > > +	if (!adev->dev->msix_enabled ||
> > > +	    !(adev->irq_requested_type & KVM_DEV_IRQ_HOST_MSIX))
> > > +		goto out;
> > > +
> > > +	old_mask_flag = test_bit(adev->guest_msix_entries[idx].entry,
> > > +			adev->msix_mask_bitmap);
> > > +	if (old_mask_flag == new_mask_flag)
> > > +		goto out;
> > > +
> > > +	irq = adev->host_msix_entries[idx].vector;
> > > +	BUG_ON(irq == 0);
> > > +
> > > +	if (new_mask_flag) {
> > > +		set_bit(adev->guest_msix_entries[idx].entry,
> > > +				adev->msix_mask_bitmap);
> > > +		disable_irq_nosync(irq);
> > > +		need_flush = true;
> > > +	} else {
> > > +		clear_bit(adev->guest_msix_entries[idx].entry,
> > > +				adev->msix_mask_bitmap);
> > > +		enable_irq(irq);
> > > +	}
> > > +out:
> > > +	spin_unlock_irq(&adev->assigned_dev_lock);
> > > +
> > > +	if (need_flush)
> > > +		flush_work(&adev->interrupt_work);
> > > +}
> > > +
> > > 
> > >  static int kvm_vm_ioctl_set_msix_entry(struct kvm *kvm,
> > >  
> > >  				       struct kvm_assigned_msix_entry *entry)
> > >  
> > >  {
> > > 
> > > @@ -701,6 +757,233 @@ msix_entry_out:
> > >  	return r;
> > >  
> > >  }
> > > 
> > > +
> > > +static int kvm_vm_ioctl_get_msix_entry(struct kvm *kvm,
> > > +				       struct kvm_msix_entry *entry)
> > > +{
> > > +	int r = 0;
> > > +	struct kvm_assigned_dev_kernel *adev;
> > > +
> > > +	if (entry->type != KVM_MSIX_TYPE_ASSIGNED_DEV)
> > > +		return -EINVAL;
> > > +
> > > +	if (!entry->query_flags)
> > > +		return -EINVAL;
> > > +
> > > +	mutex_lock(&kvm->lock);
> > > +
> > > +	adev = kvm_find_assigned_dev(&kvm->arch.assigned_dev_head,
> > > +				      entry->id);
> > > +
> > > +	if (!adev) {
> > > +		r = -EINVAL;
> > > +		goto out;
> > > +	}
> > > +
> > > +	if (entry->entry >= adev->msix_max_entries_nr) {
> > > +		r = -ENOSPC;
> > > +		goto out;
> > > +	}
> > > +
> > > +	if (entry->query_flags & KVM_MSIX_FLAG_QUERY_MASKBIT) {
> > > +		if (test_bit(entry->entry, adev->msix_mask_bitmap))
> > > +			entry->flags |= KVM_MSIX_FLAG_MASKBIT;
> > > +		else
> > > +			entry->flags &= ~KVM_MSIX_FLAG_MASKBIT;
> > > +	}
> > > +
> > > +out:
> > > +	mutex_unlock(&kvm->lock);
> > > +
> > > +	return r;
> > > +}
> > > +
> > > +static bool msix_mmio_in_range(struct kvm_assigned_dev_kernel *adev,
> > > +			      gpa_t addr, int len)
> > > +{
> > > +	gpa_t start, end;
> > > +
> > > +	BUG_ON(adev->msix_mmio_base == 0);
> > 
> > I thought we wanted to use a separate flag for this?
> 
> O, flag added, but forgot to update this one.
> > 
> > > +	start = adev->msix_mmio_base;
> > > +	end = adev->msix_mmio_base + PCI_MSIX_ENTRY_SIZE *
> > > +		adev->msix_max_entries_nr;
> > > +	if (addr >= start && addr + len <= end)
> > > +		return true;
> > > +
> > > +	return false;
> > > +}
> > > +
> > > +static int msix_get_enabled_idx(struct kvm_assigned_dev_kernel *adev,
> > > +			      gpa_t addr, int len)
> > > +{
> > > +	int i, index = (addr - adev->msix_mmio_base) / PCI_MSIX_ENTRY_SIZE;
> > > +
> > > +	for (i = 0; i < adev->entries_nr; i++)
> > > +		if (adev->guest_msix_entries[i].entry == index)
> > > +			return i;
> > > +
> > > +	return -EINVAL;
> > > +}
> > > +
> > 
> > Hmm, we still have a linear scan on each write.  Is the point to detect
> > which entries need to be handled in kernel? Maybe another bitmap for
> > this?  Or just handle whole entry write in kernel as well, then we won't
> > need this at all.
> 
> Still dont' like handling the whole entry writing. In fact here we have two 
> questions:
> 1. If the entry already enabled
> 2. If it is, then what's it sequence number.
> 
> Bitmap can solve question 1, but we still need O(n) scan for the second. So I 
> think leave it like this should be fine.

The spec quoted above implies we must handle all entry writes:
a single write can update mask and data.

> > > +static int msix_mmio_read(struct kvm_io_device *this, gpa_t addr, int
> > > len, +			  void *val)
> > > +{
> > > +	struct kvm_assigned_dev_kernel *adev =
> > > +			container_of(this, struct kvm_assigned_dev_kernel,
> > > +				     msix_mmio_dev);
> > > +	int idx, r = 0;
> > > +	u32 entry[4];
> > > +	struct kvm_kernel_irq_routing_entry e;
> > > +
> > > +	mutex_lock(&adev->kvm->lock);
> > > +	if (!msix_mmio_in_range(adev, addr, len)) {
> > > +		r = -EOPNOTSUPP;
> > > +		goto out;
> > > +	}
> > > +	if ((addr & 0x3) || len != 4)
> > > +		goto out;
> > > +
> > > +	idx = msix_get_enabled_idx(adev, addr, len);
> > > +	if (idx < 0) {
> > > +		idx = (addr - adev->msix_mmio_base) / PCI_MSIX_ENTRY_SIZE;
> > > +		if ((addr % PCI_MSIX_ENTRY_SIZE) ==
> > > +				PCI_MSIX_ENTRY_VECTOR_CTRL)
> > > +			*(unsigned long *)val =
> > > +				test_bit(idx, adev->msix_mask_bitmap) ?
> > > +				PCI_MSIX_ENTRY_CTRL_MASKBIT : 0;
> > > +		else
> > > +			r = -EOPNOTSUPP;
> > > +		goto out;
> > > +	}
> > > +
> > > +	r = kvm_get_irq_routing_entry(adev->kvm,
> > > +			adev->guest_msix_entries[idx].vector, &e);
> > > +	if (r || e.type != KVM_IRQ_ROUTING_MSI) {
> > > +		r = -EOPNOTSUPP;
> > > +		goto out;
> > > +	}
> > > +	entry[0] = e.msi.address_lo;
> > > +	entry[1] = e.msi.address_hi;
> > > +	entry[2] = e.msi.data;
> > > +	entry[3] = test_bit(adev->guest_msix_entries[idx].entry,
> > > +			adev->msix_mask_bitmap);
> > > +	memcpy(val, &entry[addr % PCI_MSIX_ENTRY_SIZE / sizeof *entry], len);
> > 
> > It seems weird to pass writes to userspace but do reads
> > in kernel.
> > Either we should support entry read and write or neither.
> 
> Entry read is for speed up, because kernel would write to the mask bit then try to 
> read from it for flushing. Don't think reading matter much here, as long as 
> userspace still own routing table.
>  
> > > +
> > > +out:
> > > +	mutex_unlock(&adev->kvm->lock);
> > > +	return r;
> > > +}
> > > +
> > > +static int msix_mmio_write(struct kvm_io_device *this, gpa_t addr, int
> > > len, +			   const void *val)
> > > +{
> > > 
> > > +	struct kvm_assigned_dev_kernel *adev =
> > > +			container_of(this, struct kvm_assigned_dev_kernel,
> > > +				     msix_mmio_dev);
> > > +	int idx, r = 0;
> > > +	unsigned long new_val = *(unsigned long *)val;
> > 
> > Move this to after you did length and alignment checks,
> > it might not be safe here.
> > 
> > > +
> > > +	mutex_lock(&adev->kvm->lock);
> > > +	if (!msix_mmio_in_range(adev, addr, len)) {
> > > +		r = -EOPNOTSUPP;
> > > +		goto out;
> > > +	}
> > > +	if ((addr & 0x3) || len != 4)
> > 
> > The lenght could be 8:
> > 
> > For all accesses to MSI-X Table and MSI-X PBA fields, software must use
> > aligned full DWORD or aligned full QWORD transactions; otherwise, the
> > result is undefined.
> > 
> > and
> > 
> > Software is permitted to fill in MSI-X Table entry DWORD fields
> > individually with DWORD writes, or software in certain cases is permitted
> > to fill in appropriate pairs of DWORDs with a single QWORD write.
> > Specifically, software is always permitted to fill in the Message Address
> > and Message Upper Address fields with a single QWORD write. If a given
> > entry is currently masked (via its Mask bit or the Function Mask bit),
> > software is permitted to fill in the Message Data and Vector Control
> > fields with a single QWORD write, taking advantage of the fact the Message
> > Data field is guaranteed to become visible to hardware no later than the
> > Vector Control field. However, if software wishes to mask a currently
> > unmasked entry (without setting the Function Mask bit), software must set
> > the entry’s Mask bit using a DWORD write to the Vector Control field,
> > since performing a QWORD write to the Message Data and Vector Control
> > fields might result in the Message Data field being modified before the
> > Mask bit in the Vector Control field becomes set.
>  
> Haven't seen any device use QWORD accessing. Also QEmu can't handle QWORD MMIO as 
> well. So I guess we can leave it later.
> 
> > > +		goto out;
> > 
> > Why don't we pass to userspace on error?
> > Will make it easy to debug ...
> 
> OK, at least we can know if guest want to access it using 8 bits.
> > 
> > > +
> > > +	idx = msix_get_enabled_idx(adev, addr, len);
> > > +	if (idx < 0) {
> > > +		idx = (addr - adev->msix_mmio_base) / PCI_MSIX_ENTRY_SIZE;
> > > +		if (((addr % PCI_MSIX_ENTRY_SIZE) ==
> > > +				PCI_MSIX_ENTRY_VECTOR_CTRL)) {
> > > +			if (new_val & ~PCI_MSIX_ENTRY_CTRL_MASKBIT)
> > > +				goto out;
> > > +			if (new_val & PCI_MSIX_ENTRY_CTRL_MASKBIT)
> > > +				set_bit(idx, adev->msix_mask_bitmap);
> > > +			else
> > > +				clear_bit(idx, adev->msix_mask_bitmap);
> > > +			/* It's possible that we need re-enable MSI-X, so go
> > > +			 * back to userspace */
> > > +		}
> > > +		/* Userspace would handle other MMIO writing */
> > > +		r = -EOPNOTSUPP;
> > > +		goto out;
> > > +	}
> > > +	if (addr % PCI_MSIX_ENTRY_SIZE != PCI_MSIX_ENTRY_VECTOR_CTRL) {
> > > +		r = -EOPNOTSUPP;
> > > +		goto out;
> > > +	}
> > > +	if (new_val & ~PCI_MSIX_ENTRY_CTRL_MASKBIT)
> > > +		goto out;
> > 
> > So if the write touches any other bit, we ignore mask bit write
> > completely? Don't even pass to userspace? why?
> 
> Because other bits are reserved. The writing behavior is illegal. Spec said result 
> is undefined.
> 
> --
> regards
> Yang, Sheng
> 
> > 
> > > +	update_msix_mask(adev, idx, !!(new_val & PCI_MSIX_ENTRY_CTRL_MASKBIT));
> > > +out:
> > > +	mutex_unlock(&adev->kvm->lock);
> > > +
> > > +	return r;
> > > +}
> > > +
> > > +static const struct kvm_io_device_ops msix_mmio_ops = {
> > > +	.read     = msix_mmio_read,
> > > +	.write    = msix_mmio_write,
> > > +};
> > > +
> > > +static int kvm_vm_ioctl_update_msix_mmio(struct kvm *kvm,
> > > +				struct kvm_msix_mmio *msix_mmio)
> > > +{
> > > +	int r = 0;
> > > +	struct kvm_assigned_dev_kernel *adev;
> > > +
> > > +	if (msix_mmio->type != KVM_MSIX_TYPE_ASSIGNED_DEV)
> > > +		return -EINVAL;
> > > +
> > > +	if (!msix_mmio->flags)
> > > +		return -EINVAL;
> > > +
> > > +	mutex_lock(&kvm->lock);
> > > +	adev = kvm_find_assigned_dev(&kvm->arch.assigned_dev_head,
> > > +				      msix_mmio->id);
> > > +	if (!adev) {
> > > +		r = -EINVAL;
> > > +		goto out;
> > > +	}
> > > +	if (msix_mmio->base_addr == 0) {
> > > +		r = -EINVAL;
> > > +		goto out;
> > > +	}
> > > +	if (msix_mmio->max_entries_nr == 0 ||
> > > +			msix_mmio->max_entries_nr > KVM_MAX_MSIX_PER_DEV) {
> > > +		r = -EINVAL;
> > > +		goto out;
> > > +	}
> > > +
> > > +	if ((msix_mmio->flags & KVM_MSIX_MMIO_FLAG_REGISTER) &&
> > > +			(msix_mmio->flags & KVM_MSIX_MMIO_FLAG_UNREGISTER)) {
> > > +		r = -EINVAL;
> > > +		goto out;
> > > +	}
> > > +
> > > +	if (msix_mmio->flags & KVM_MSIX_MMIO_FLAG_REGISTER) {
> > > +		if (!(adev->flags & KVM_ASSIGNED_ENABLED_MSIX_MMIO)) {
> > > +			mutex_lock(&kvm->slots_lock);
> > > +			kvm_iodevice_init(&adev->msix_mmio_dev,
> > > +					&msix_mmio_ops);
> > > +			r = kvm_io_bus_register_dev(kvm, KVM_MMIO_BUS,
> > > +					&adev->msix_mmio_dev);
> > > +			if (!r)
> > > +				adev->flags |= KVM_ASSIGNED_ENABLED_MSIX_MMIO;
> > > +			mutex_unlock(&kvm->slots_lock);
> > > +		}
> > > +		if (!r) {
> > > +			adev->msix_mmio_base = msix_mmio->base_addr;
> > > +			adev->msix_max_entries_nr = msix_mmio->max_entries_nr;
> > > +		}
> > > +	} else if (msix_mmio->flags & KVM_MSIX_MMIO_FLAG_UNREGISTER)
> > > +		unregister_msix_mmio(kvm, adev);
> > > +out:
> > > +	mutex_unlock(&kvm->lock);
> > > +
> > > +	return r;
> > > +}
> > > 
> > >  #endif
> > >  
> > >  long kvm_vm_ioctl_assigned_device(struct kvm *kvm, unsigned ioctl,
> > > 
> > > @@ -813,6 +1096,37 @@ long kvm_vm_ioctl_assigned_device(struct kvm *kvm,
> > > unsigned ioctl,
> > > 
> > >  			goto out;
> > >  		
> > >  		break;
> > >  	
> > >  	}
> > > 
> > > +	case KVM_GET_MSIX_ENTRY: {
> > > +		struct kvm_msix_entry entry;
> > > +		r = -EFAULT;
> > > +		if (copy_from_user(&entry, argp, sizeof entry))
> > > +			goto out;
> > > +		r = kvm_vm_ioctl_get_msix_entry(kvm, &entry);
> > > +		if (r)
> > > +			goto out;
> > > +		r = -EFAULT;
> > > +		if (copy_to_user(argp, &entry, sizeof entry))
> > > +			goto out;
> > > +		r = 0;
> > > +		break;
> > > +	}
> > > +	case KVM_UPDATE_MSIX_MMIO: {
> > > +		struct kvm_msix_mmio msix_mmio;
> > > +
> > > +		r = -EFAULT;
> > > +		if (copy_from_user(&msix_mmio, argp, sizeof(msix_mmio)))
> > > +			goto out;
> > > +
> > > +		r = -EINVAL;
> > > +		if (find_first_bit((unsigned long *)msix_mmio.reserved,
> > > +		    sizeof(msix_mmio.reserved)) < sizeof(msix_mmio.reserved))
> > > +			goto out;
> > > +
> > > +		r = kvm_vm_ioctl_update_msix_mmio(kvm, &msix_mmio);
> > > +		if (r)
> > > +			goto out;
> > > +		break;
> > > +	}
> > > 
> > >  #endif
> > >  
> > >  	}
> > >  
> > >  out:

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 6/7] KVM: assigned dev: MSI-X mask support
  2010-11-12 10:47       ` Michael S. Tsirkin
@ 2010-11-12 10:54         ` Sheng Yang
  2010-11-12 11:25           ` Michael S. Tsirkin
  0 siblings, 1 reply; 22+ messages in thread
From: Sheng Yang @ 2010-11-12 10:54 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Avi Kivity, Marcelo Tosatti, kvm

On Friday 12 November 2010 18:47:29 Michael S. Tsirkin wrote:
> On Fri, Nov 12, 2010 at 06:13:48PM +0800, Sheng Yang wrote:
> > On Friday 12 November 2010 17:53:13 Michael S. Tsirkin wrote:
> > > On Thu, Nov 11, 2010 at 03:46:59PM +0800, Sheng Yang wrote:
> > > > This patch enable per-vector mask for assigned devices using MSI-X.
> > > > 
> > > > This patch provided two new APIs: one is for guest to specific
> > > > device's MSI-X table address in MMIO, the other is for userspace to
> > > > get information about mask bit.
> > > > 
> > > > All the mask bit operation are kept in kernel, in order to
> > > > accelerate. Userspace shouldn't access the device MMIO directly for
> > > > the information, instead it should uses provided API to do so.
> > > > 
> > > > Signed-off-by: Sheng Yang <sheng@linux.intel.com>
> > > > ---
> > > > 
> > > >  arch/x86/kvm/x86.c       |    1 +
> > > >  include/linux/kvm.h      |   32 +++++
> > > >  include/linux/kvm_host.h |    5 +
> > > >  virt/kvm/assigned-dev.c  |  316
> > > >  +++++++++++++++++++++++++++++++++++++++++++++- 4 files 
changed,
> > 
> > 353
> > 
> > > >  insertions(+), 1 deletions(-)
> > > > 
> > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > > index f3f86b2..847f1e1 100644
> > > > --- a/arch/x86/kvm/x86.c
> > > > +++ b/arch/x86/kvm/x86.c
> > > > @@ -1926,6 +1926,7 @@ int kvm_dev_ioctl_check_extension(long ext)
> > > > 
> > > >  	case KVM_CAP_DEBUGREGS:
> > > >  	case KVM_CAP_X86_ROBUST_SINGLESTEP:
> > > > 
> > > >  	case KVM_CAP_XSAVE:
> > > > +	case KVM_CAP_MSIX_MASK:
> > > >  		r = 1;
> > > >  		break;
> > > >  	
> > > >  	case KVM_CAP_COALESCED_MMIO:
> > > > diff --git a/include/linux/kvm.h b/include/linux/kvm.h
> > > > index 919ae53..32cd244 100644
> > > > --- a/include/linux/kvm.h
> > > > +++ b/include/linux/kvm.h
> > > > @@ -540,6 +540,9 @@ struct kvm_ppc_pvinfo {
> > > > 
> > > >  #endif
> > > >  #define KVM_CAP_PPC_GET_PVINFO 57
> > > >  #define KVM_CAP_PPC_IRQ_LEVEL 58
> > > > 
> > > > +#ifdef __KVM_HAVE_MSIX
> > > > +#define KVM_CAP_MSIX_MASK 59
> > > > +#endif
> > > > 
> > > >  #ifdef KVM_CAP_IRQ_ROUTING
> > > > 
> > > > @@ -671,6 +674,9 @@ struct kvm_clock_data {
> > > > 
> > > >  #define KVM_XEN_HVM_CONFIG        _IOW(KVMIO,  0x7a, struct
> > > >  kvm_xen_hvm_config) #define KVM_SET_CLOCK             _IOW(KVMIO,
> > > >  0x7b, struct kvm_clock_data) #define KVM_GET_CLOCK
> > > >  _IOR(KVMIO,  0x7c, struct kvm_clock_data)
> > > > 
> > > > +/* Available with KVM_CAP_MSIX_MASK */
> > > > +#define KVM_GET_MSIX_ENTRY        _IOWR(KVMIO,  0x7d, struct
> > > > kvm_msix_entry) +#define KVM_UPDATE_MSIX_MMIO      _IOW(KVMIO,  0x7e,
> > > > struct kvm_msix_mmio)
> > > > 
> > > >  /* Available with KVM_CAP_PIT_STATE2 */
> > > >  #define KVM_GET_PIT2              _IOR(KVMIO,  0x9f, struct
> > > >  kvm_pit_state2) #define KVM_SET_PIT2              _IOW(KVMIO,  0xa0,
> > > >  struct kvm_pit_state2)
> > > > 
> > > > @@ -794,4 +800,30 @@ struct kvm_assigned_msix_entry {
> > > > 
> > > >  	__u16 padding[3];
> > > >  
> > > >  };
> > > > 
> > > > +#define KVM_MSIX_TYPE_ASSIGNED_DEV	1
> > > > +
> > > > +#define KVM_MSIX_FLAG_MASKBIT		(1 << 0)
> > > > +#define KVM_MSIX_FLAG_QUERY_MASKBIT	(1 << 0)
> > > > +
> > > > +struct kvm_msix_entry {
> > > > +	__u32 id;
> > > > +	__u32 type;
> > > > +	__u32 entry; /* The index of entry in the MSI-X table */
> > > > +	__u32 flags;
> > > > +	__u32 query_flags;
> > > > +	__u32 reserved[5];
> > > > +};
> > > > +
> > > > +#define KVM_MSIX_MMIO_FLAG_REGISTER	(1 << 0)
> > > > +#define KVM_MSIX_MMIO_FLAG_UNREGISTER	(1 << 1)
> > > > +
> > > > +struct kvm_msix_mmio {
> > > > +	__u32 id;
> > > > +	__u32 type;
> > > > +	__u64 base_addr;
> > > > +	__u32 max_entries_nr;
> > > > +	__u32 flags;
> > > > +	__u32 reserved[6];
> > > > +};
> > > > +
> > > > 
> > > >  #endif /* __LINUX_KVM_H */
> > > > 
> > > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > > > index 4b31539..9d49074 100644
> > > > --- a/include/linux/kvm_host.h
> > > > +++ b/include/linux/kvm_host.h
> > > > @@ -445,6 +445,7 @@ struct kvm_guest_msix_entry {
> > > > 
> > > >  };
> > > >  
> > > >  #define KVM_ASSIGNED_ENABLED_IOMMU		(1 << 0)
> > > > 
> > > > +#define KVM_ASSIGNED_ENABLED_MSIX_MMIO		(1 << 1)
> > > > 
> > > >  struct kvm_assigned_dev_kernel {
> > > >  
> > > >  	struct kvm_irq_ack_notifier ack_notifier;
> > > >  	struct work_struct interrupt_work;
> > > > 
> > > > @@ -465,6 +466,10 @@ struct kvm_assigned_dev_kernel {
> > > > 
> > > >  	struct pci_dev *dev;
> > > >  	struct kvm *kvm;
> > > >  	spinlock_t assigned_dev_lock;
> > > > 
> > > > +	DECLARE_BITMAP(msix_mask_bitmap, KVM_MAX_MSIX_PER_DEV);
> > > > +	gpa_t msix_mmio_base;
> > > > +	struct kvm_io_device msix_mmio_dev;
> > > > +	int msix_max_entries_nr;
> > > > 
> > > >  };
> > > >  
> > > >  struct kvm_irq_mask_notifier {
> > > > 
> > > > diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c
> > > > index 5c6b96d..3010d7d 100644
> > > > --- a/virt/kvm/assigned-dev.c
> > > > +++ b/virt/kvm/assigned-dev.c
> > > > @@ -226,12 +226,27 @@ static void kvm_free_assigned_irq(struct kvm
> > > > *kvm,
> > > > 
> > > >  	kvm_deassign_irq(kvm, assigned_dev,
> > > >  	assigned_dev->irq_requested_type);
> > > >  
> > > >  }
> > > > 
> > > > +static void unregister_msix_mmio(struct kvm *kvm,
> > > > +				 struct kvm_assigned_dev_kernel *adev)
> > > > +{
> > > > +	if (adev->flags & KVM_ASSIGNED_ENABLED_MSIX_MMIO) {
> > > > +		mutex_lock(&kvm->slots_lock);
> > > > +		kvm_io_bus_unregister_dev(kvm, KVM_MMIO_BUS,
> > > > +				&adev->msix_mmio_dev);
> > > > +		mutex_unlock(&kvm->slots_lock);
> > > > +		adev->flags &= ~KVM_ASSIGNED_ENABLED_MSIX_MMIO;
> > > > +	}
> > > > +}
> > > > +
> > > > 
> > > >  static void kvm_free_assigned_device(struct kvm *kvm,
> > > >  
> > > >  				     struct kvm_assigned_dev_kernel
> > > >  				     *assigned_dev)
> > > >  
> > > >  {
> > > >  
> > > >  	kvm_free_assigned_irq(kvm, assigned_dev);
> > > > 
> > > > +#ifdef __KVM_HAVE_MSIX
> > > > +	unregister_msix_mmio(kvm, assigned_dev);
> > > > +#endif
> > > > 
> > > >  	pci_reset_function(assigned_dev->dev);
> > > >  	
> > > >  	pci_release_regions(assigned_dev->dev);
> > > > 
> > > > @@ -504,7 +519,7 @@ out:
> > > >  static int kvm_vm_ioctl_assign_device(struct kvm *kvm,
> > > >  
> > > >  				      struct kvm_assigned_pci_dev *assigned_dev)
> > > >  
> > > >  {
> > > > 
> > > > -	int r = 0, idx;
> > > > +	int r = 0, idx, i;
> > > > 
> > > >  	struct kvm_assigned_dev_kernel *match;
> > > >  	struct pci_dev *dev;
> > > > 
> > > > @@ -564,6 +579,10 @@ static int kvm_vm_ioctl_assign_device(struct kvm
> > > > *kvm,
> > > > 
> > > >  	list_add(&match->list, &kvm->arch.assigned_dev_head);
> > > > 
> > > > +	/* The state after reset of MSI-X table is all masked */
> > > > +	for (i = 0; i < KVM_MAX_MSIX_PER_DEV; i++)
> > > > +		set_bit(i, match->msix_mask_bitmap);
> > > > +
> > > > 
> > > >  	if (assigned_dev->flags & KVM_ASSIGNED_ENABLED_IOMMU) {
> > > >  	
> > > >  		if (!kvm->arch.iommu_domain) {
> > > >  		
> > > >  			r = kvm_iommu_map_guest(kvm);
> > > > 
> > > > @@ -667,6 +686,43 @@ msix_nr_out:
> > > >  	return r;
> > > >  
> > > >  }
> > > > 
> > > > +static void update_msix_mask(struct kvm_assigned_dev_kernel *adev,
> > > > +			     int idx, bool new_mask_flag)
> > > > +{
> > > > +	int irq;
> > > > +	bool old_mask_flag, need_flush = false;
> > > > +
> > > > +	spin_lock_irq(&adev->assigned_dev_lock);
> > > > +
> > > > +	if (!adev->dev->msix_enabled ||
> > > > +	    !(adev->irq_requested_type & KVM_DEV_IRQ_HOST_MSIX))
> > > > +		goto out;
> > > > +
> > > > +	old_mask_flag = test_bit(adev->guest_msix_entries[idx].entry,
> > > > +			adev->msix_mask_bitmap);
> > > > +	if (old_mask_flag == new_mask_flag)
> > > > +		goto out;
> > > > +
> > > > +	irq = adev->host_msix_entries[idx].vector;
> > > > +	BUG_ON(irq == 0);
> > > > +
> > > > +	if (new_mask_flag) {
> > > > +		set_bit(adev->guest_msix_entries[idx].entry,
> > > > +				adev->msix_mask_bitmap);
> > > > +		disable_irq_nosync(irq);
> > > > +		need_flush = true;
> > > > +	} else {
> > > > +		clear_bit(adev->guest_msix_entries[idx].entry,
> > > > +				adev->msix_mask_bitmap);
> > > > +		enable_irq(irq);
> > > > +	}
> > > > +out:
> > > > +	spin_unlock_irq(&adev->assigned_dev_lock);
> > > > +
> > > > +	if (need_flush)
> > > > +		flush_work(&adev->interrupt_work);
> > > > +}
> > > > +
> > > > 
> > > >  static int kvm_vm_ioctl_set_msix_entry(struct kvm *kvm,
> > > >  
> > > >  				       struct kvm_assigned_msix_entry *entry)
> > > >  
> > > >  {
> > > > 
> > > > @@ -701,6 +757,233 @@ msix_entry_out:
> > > >  	return r;
> > > >  
> > > >  }
> > > > 
> > > > +
> > > > +static int kvm_vm_ioctl_get_msix_entry(struct kvm *kvm,
> > > > +				       struct kvm_msix_entry *entry)
> > > > +{
> > > > +	int r = 0;
> > > > +	struct kvm_assigned_dev_kernel *adev;
> > > > +
> > > > +	if (entry->type != KVM_MSIX_TYPE_ASSIGNED_DEV)
> > > > +		return -EINVAL;
> > > > +
> > > > +	if (!entry->query_flags)
> > > > +		return -EINVAL;
> > > > +
> > > > +	mutex_lock(&kvm->lock);
> > > > +
> > > > +	adev = kvm_find_assigned_dev(&kvm->arch.assigned_dev_head,
> > > > +				      entry->id);
> > > > +
> > > > +	if (!adev) {
> > > > +		r = -EINVAL;
> > > > +		goto out;
> > > > +	}
> > > > +
> > > > +	if (entry->entry >= adev->msix_max_entries_nr) {
> > > > +		r = -ENOSPC;
> > > > +		goto out;
> > > > +	}
> > > > +
> > > > +	if (entry->query_flags & KVM_MSIX_FLAG_QUERY_MASKBIT) {
> > > > +		if (test_bit(entry->entry, adev->msix_mask_bitmap))
> > > > +			entry->flags |= KVM_MSIX_FLAG_MASKBIT;
> > > > +		else
> > > > +			entry->flags &= ~KVM_MSIX_FLAG_MASKBIT;
> > > > +	}
> > > > +
> > > > +out:
> > > > +	mutex_unlock(&kvm->lock);
> > > > +
> > > > +	return r;
> > > > +}
> > > > +
> > > > +static bool msix_mmio_in_range(struct kvm_assigned_dev_kernel *adev,
> > > > +			      gpa_t addr, int len)
> > > > +{
> > > > +	gpa_t start, end;
> > > > +
> > > > +	BUG_ON(adev->msix_mmio_base == 0);
> > > 
> > > I thought we wanted to use a separate flag for this?
> > 
> > O, flag added, but forgot to update this one.
> > 
> > > > +	start = adev->msix_mmio_base;
> > > > +	end = adev->msix_mmio_base + PCI_MSIX_ENTRY_SIZE *
> > > > +		adev->msix_max_entries_nr;
> > > > +	if (addr >= start && addr + len <= end)
> > > > +		return true;
> > > > +
> > > > +	return false;
> > > > +}
> > > > +
> > > > +static int msix_get_enabled_idx(struct kvm_assigned_dev_kernel
> > > > *adev, +			      gpa_t addr, int len)
> > > > +{
> > > > +	int i, index = (addr - adev->msix_mmio_base) / PCI_MSIX_ENTRY_SIZE;
> > > > +
> > > > +	for (i = 0; i < adev->entries_nr; i++)
> > > > +		if (adev->guest_msix_entries[i].entry == index)
> > > > +			return i;
> > > > +
> > > > +	return -EINVAL;
> > > > +}
> > > > +
> > > 
> > > Hmm, we still have a linear scan on each write.  Is the point to detect
> > > which entries need to be handled in kernel? Maybe another bitmap for
> > > this?  Or just handle whole entry write in kernel as well, then we
> > > won't need this at all.
> > 
> > Still dont' like handling the whole entry writing. In fact here we have
> > two questions:
> > 1. If the entry already enabled
> > 2. If it is, then what's it sequence number.
> > 
> > Bitmap can solve question 1, but we still need O(n) scan for the second.
> > So I think leave it like this should be fine.
> 
> The spec quoted above implies we must handle all entry writes:
> a single write can update mask and data.

Anyway it's the work to be done by QEmu, if we want to cover the situation that 
signle write can update mask and data - that's surely QWORD, which hasn't been 
supported yet. 

We can back to them if there is someone really did it in that way. But for all 
hypervisors using QEmu, I think we haven't seen such kind of behavior yet. So we 
can leave it later.

--
regards
Yang, Sheng

> 
> > > > +static int msix_mmio_read(struct kvm_io_device *this, gpa_t addr,
> > > > int len, +			  void *val)
> > > > +{
> > > > +	struct kvm_assigned_dev_kernel *adev =
> > > > +			container_of(this, struct kvm_assigned_dev_kernel,
> > > > +				     msix_mmio_dev);
> > > > +	int idx, r = 0;
> > > > +	u32 entry[4];
> > > > +	struct kvm_kernel_irq_routing_entry e;
> > > > +
> > > > +	mutex_lock(&adev->kvm->lock);
> > > > +	if (!msix_mmio_in_range(adev, addr, len)) {
> > > > +		r = -EOPNOTSUPP;
> > > > +		goto out;
> > > > +	}
> > > > +	if ((addr & 0x3) || len != 4)
> > > > +		goto out;
> > > > +
> > > > +	idx = msix_get_enabled_idx(adev, addr, len);
> > > > +	if (idx < 0) {
> > > > +		idx = (addr - adev->msix_mmio_base) / PCI_MSIX_ENTRY_SIZE;
> > > > +		if ((addr % PCI_MSIX_ENTRY_SIZE) ==
> > > > +				PCI_MSIX_ENTRY_VECTOR_CTRL)
> > > > +			*(unsigned long *)val =
> > > > +				test_bit(idx, adev->msix_mask_bitmap) ?
> > > > +				PCI_MSIX_ENTRY_CTRL_MASKBIT : 0;
> > > > +		else
> > > > +			r = -EOPNOTSUPP;
> > > > +		goto out;
> > > > +	}
> > > > +
> > > > +	r = kvm_get_irq_routing_entry(adev->kvm,
> > > > +			adev->guest_msix_entries[idx].vector, &e);
> > > > +	if (r || e.type != KVM_IRQ_ROUTING_MSI) {
> > > > +		r = -EOPNOTSUPP;
> > > > +		goto out;
> > > > +	}
> > > > +	entry[0] = e.msi.address_lo;
> > > > +	entry[1] = e.msi.address_hi;
> > > > +	entry[2] = e.msi.data;
> > > > +	entry[3] = test_bit(adev->guest_msix_entries[idx].entry,
> > > > +			adev->msix_mask_bitmap);
> > > > +	memcpy(val, &entry[addr % PCI_MSIX_ENTRY_SIZE / sizeof *entry],
> > > > len);
> > > 
> > > It seems weird to pass writes to userspace but do reads
> > > in kernel.
> > > Either we should support entry read and write or neither.
> > 
> > Entry read is for speed up, because kernel would write to the mask bit
> > then try to read from it for flushing. Don't think reading matter much
> > here, as long as userspace still own routing table.
> > 
> > > > +
> > > > +out:
> > > > +	mutex_unlock(&adev->kvm->lock);
> > > > +	return r;
> > > > +}
> > > > +
> > > > +static int msix_mmio_write(struct kvm_io_device *this, gpa_t addr,
> > > > int len, +			   const void *val)
> > > > +{
> > > > 
> > > > +	struct kvm_assigned_dev_kernel *adev =
> > > > +			container_of(this, struct kvm_assigned_dev_kernel,
> > > > +				     msix_mmio_dev);
> > > > +	int idx, r = 0;
> > > > +	unsigned long new_val = *(unsigned long *)val;
> > > 
> > > Move this to after you did length and alignment checks,
> > > it might not be safe here.
> > > 
> > > > +
> > > > +	mutex_lock(&adev->kvm->lock);
> > > > +	if (!msix_mmio_in_range(adev, addr, len)) {
> > > > +		r = -EOPNOTSUPP;
> > > > +		goto out;
> > > > +	}
> > > > +	if ((addr & 0x3) || len != 4)
> > > 
> > > The lenght could be 8:
> > > 
> > > For all accesses to MSI-X Table and MSI-X PBA fields, software must use
> > > aligned full DWORD or aligned full QWORD transactions; otherwise, the
> > > result is undefined.
> > > 
> > > and
> > > 
> > > Software is permitted to fill in MSI-X Table entry DWORD fields
> > > individually with DWORD writes, or software in certain cases is
> > > permitted to fill in appropriate pairs of DWORDs with a single QWORD
> > > write. Specifically, software is always permitted to fill in the
> > > Message Address and Message Upper Address fields with a single QWORD
> > > write. If a given entry is currently masked (via its Mask bit or the
> > > Function Mask bit), software is permitted to fill in the Message Data
> > > and Vector Control fields with a single QWORD write, taking advantage
> > > of the fact the Message Data field is guaranteed to become visible to
> > > hardware no later than the Vector Control field. However, if software
> > > wishes to mask a currently unmasked entry (without setting the
> > > Function Mask bit), software must set the entry’s Mask bit using a
> > > DWORD write to the Vector Control field, since performing a QWORD
> > > write to the Message Data and Vector Control fields might result in
> > > the Message Data field being modified before the Mask bit in the
> > > Vector Control field becomes set.
> > 
> > Haven't seen any device use QWORD accessing. Also QEmu can't handle QWORD
> > MMIO as well. So I guess we can leave it later.
> > 
> > > > +		goto out;
> > > 
> > > Why don't we pass to userspace on error?
> > > Will make it easy to debug ...
> > 
> > OK, at least we can know if guest want to access it using 8 bits.
> > 
> > > > +
> > > > +	idx = msix_get_enabled_idx(adev, addr, len);
> > > > +	if (idx < 0) {
> > > > +		idx = (addr - adev->msix_mmio_base) / PCI_MSIX_ENTRY_SIZE;
> > > > +		if (((addr % PCI_MSIX_ENTRY_SIZE) ==
> > > > +				PCI_MSIX_ENTRY_VECTOR_CTRL)) {
> > > > +			if (new_val & ~PCI_MSIX_ENTRY_CTRL_MASKBIT)
> > > > +				goto out;
> > > > +			if (new_val & PCI_MSIX_ENTRY_CTRL_MASKBIT)
> > > > +				set_bit(idx, adev->msix_mask_bitmap);
> > > > +			else
> > > > +				clear_bit(idx, adev->msix_mask_bitmap);
> > > > +			/* It's possible that we need re-enable MSI-X, so go
> > > > +			 * back to userspace */
> > > > +		}
> > > > +		/* Userspace would handle other MMIO writing */
> > > > +		r = -EOPNOTSUPP;
> > > > +		goto out;
> > > > +	}
> > > > +	if (addr % PCI_MSIX_ENTRY_SIZE != PCI_MSIX_ENTRY_VECTOR_CTRL) {
> > > > +		r = -EOPNOTSUPP;
> > > > +		goto out;
> > > > +	}
> > > > +	if (new_val & ~PCI_MSIX_ENTRY_CTRL_MASKBIT)
> > > > +		goto out;
> > > 
> > > So if the write touches any other bit, we ignore mask bit write
> > > completely? Don't even pass to userspace? why?
> > 
> > Because other bits are reserved. The writing behavior is illegal. Spec
> > said result is undefined.
> > 
> > --
> > regards
> > Yang, Sheng
> > 
> > > > +	update_msix_mask(adev, idx, !!(new_val &
> > > > PCI_MSIX_ENTRY_CTRL_MASKBIT)); +out:
> > > > +	mutex_unlock(&adev->kvm->lock);
> > > > +
> > > > +	return r;
> > > > +}
> > > > +
> > > > +static const struct kvm_io_device_ops msix_mmio_ops = {
> > > > +	.read     = msix_mmio_read,
> > > > +	.write    = msix_mmio_write,
> > > > +};
> > > > +
> > > > +static int kvm_vm_ioctl_update_msix_mmio(struct kvm *kvm,
> > > > +				struct kvm_msix_mmio *msix_mmio)
> > > > +{
> > > > +	int r = 0;
> > > > +	struct kvm_assigned_dev_kernel *adev;
> > > > +
> > > > +	if (msix_mmio->type != KVM_MSIX_TYPE_ASSIGNED_DEV)
> > > > +		return -EINVAL;
> > > > +
> > > > +	if (!msix_mmio->flags)
> > > > +		return -EINVAL;
> > > > +
> > > > +	mutex_lock(&kvm->lock);
> > > > +	adev = kvm_find_assigned_dev(&kvm->arch.assigned_dev_head,
> > > > +				      msix_mmio->id);
> > > > +	if (!adev) {
> > > > +		r = -EINVAL;
> > > > +		goto out;
> > > > +	}
> > > > +	if (msix_mmio->base_addr == 0) {
> > > > +		r = -EINVAL;
> > > > +		goto out;
> > > > +	}
> > > > +	if (msix_mmio->max_entries_nr == 0 ||
> > > > +			msix_mmio->max_entries_nr > KVM_MAX_MSIX_PER_DEV) {
> > > > +		r = -EINVAL;
> > > > +		goto out;
> > > > +	}
> > > > +
> > > > +	if ((msix_mmio->flags & KVM_MSIX_MMIO_FLAG_REGISTER) &&
> > > > +			(msix_mmio->flags & KVM_MSIX_MMIO_FLAG_UNREGISTER)) {
> > > > +		r = -EINVAL;
> > > > +		goto out;
> > > > +	}
> > > > +
> > > > +	if (msix_mmio->flags & KVM_MSIX_MMIO_FLAG_REGISTER) {
> > > > +		if (!(adev->flags & KVM_ASSIGNED_ENABLED_MSIX_MMIO)) {
> > > > +			mutex_lock(&kvm->slots_lock);
> > > > +			kvm_iodevice_init(&adev->msix_mmio_dev,
> > > > +					&msix_mmio_ops);
> > > > +			r = kvm_io_bus_register_dev(kvm, KVM_MMIO_BUS,
> > > > +					&adev->msix_mmio_dev);
> > > > +			if (!r)
> > > > +				adev->flags |= KVM_ASSIGNED_ENABLED_MSIX_MMIO;
> > > > +			mutex_unlock(&kvm->slots_lock);
> > > > +		}
> > > > +		if (!r) {
> > > > +			adev->msix_mmio_base = msix_mmio->base_addr;
> > > > +			adev->msix_max_entries_nr = msix_mmio->max_entries_nr;
> > > > +		}
> > > > +	} else if (msix_mmio->flags & KVM_MSIX_MMIO_FLAG_UNREGISTER)
> > > > +		unregister_msix_mmio(kvm, adev);
> > > > +out:
> > > > +	mutex_unlock(&kvm->lock);
> > > > +
> > > > +	return r;
> > > > +}
> > > > 
> > > >  #endif
> > > >  
> > > >  long kvm_vm_ioctl_assigned_device(struct kvm *kvm, unsigned ioctl,
> > > > 
> > > > @@ -813,6 +1096,37 @@ long kvm_vm_ioctl_assigned_device(struct kvm
> > > > *kvm, unsigned ioctl,
> > > > 
> > > >  			goto out;
> > > >  		
> > > >  		break;
> > > >  	
> > > >  	}
> > > > 
> > > > +	case KVM_GET_MSIX_ENTRY: {
> > > > +		struct kvm_msix_entry entry;
> > > > +		r = -EFAULT;
> > > > +		if (copy_from_user(&entry, argp, sizeof entry))
> > > > +			goto out;
> > > > +		r = kvm_vm_ioctl_get_msix_entry(kvm, &entry);
> > > > +		if (r)
> > > > +			goto out;
> > > > +		r = -EFAULT;
> > > > +		if (copy_to_user(argp, &entry, sizeof entry))
> > > > +			goto out;
> > > > +		r = 0;
> > > > +		break;
> > > > +	}
> > > > +	case KVM_UPDATE_MSIX_MMIO: {
> > > > +		struct kvm_msix_mmio msix_mmio;
> > > > +
> > > > +		r = -EFAULT;
> > > > +		if (copy_from_user(&msix_mmio, argp, sizeof(msix_mmio)))
> > > > +			goto out;
> > > > +
> > > > +		r = -EINVAL;
> > > > +		if (find_first_bit((unsigned long *)msix_mmio.reserved,
> > > > +		    sizeof(msix_mmio.reserved)) < sizeof(msix_mmio.reserved))
> > > > +			goto out;
> > > > +
> > > > +		r = kvm_vm_ioctl_update_msix_mmio(kvm, &msix_mmio);
> > > > +		if (r)
> > > > +			goto out;
> > > > +		break;
> > > > +	}
> > > > 
> > > >  #endif
> > > >  
> > > >  	}
> > > >  
> > > >  out:

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 6/7] KVM: assigned dev: MSI-X mask support
  2010-11-12 10:54         ` Sheng Yang
@ 2010-11-12 11:25           ` Michael S. Tsirkin
  2010-11-15  7:37             ` Sheng Yang
  0 siblings, 1 reply; 22+ messages in thread
From: Michael S. Tsirkin @ 2010-11-12 11:25 UTC (permalink / raw)
  To: Sheng Yang; +Cc: Avi Kivity, Marcelo Tosatti, kvm

On Fri, Nov 12, 2010 at 06:54:01PM +0800, Sheng Yang wrote:
> On Friday 12 November 2010 18:47:29 Michael S. Tsirkin wrote:
> > On Fri, Nov 12, 2010 at 06:13:48PM +0800, Sheng Yang wrote:
> > > On Friday 12 November 2010 17:53:13 Michael S. Tsirkin wrote:
> > > > On Thu, Nov 11, 2010 at 03:46:59PM +0800, Sheng Yang wrote:
> > > > > This patch enable per-vector mask for assigned devices using MSI-X.
> > > > > 
> > > > > This patch provided two new APIs: one is for guest to specific
> > > > > device's MSI-X table address in MMIO, the other is for userspace to
> > > > > get information about mask bit.
> > > > > 
> > > > > All the mask bit operation are kept in kernel, in order to
> > > > > accelerate. Userspace shouldn't access the device MMIO directly for
> > > > > the information, instead it should uses provided API to do so.
> > > > > 
> > > > > Signed-off-by: Sheng Yang <sheng@linux.intel.com>
> > > > > ---
> > > > > 
> > > > >  arch/x86/kvm/x86.c       |    1 +
> > > > >  include/linux/kvm.h      |   32 +++++
> > > > >  include/linux/kvm_host.h |    5 +
> > > > >  virt/kvm/assigned-dev.c  |  316
> > > > >  +++++++++++++++++++++++++++++++++++++++++++++- 4 files 
> changed,
> > > 
> > > 353
> > > 
> > > > >  insertions(+), 1 deletions(-)
> > > > > 
> > > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > > > index f3f86b2..847f1e1 100644
> > > > > --- a/arch/x86/kvm/x86.c
> > > > > +++ b/arch/x86/kvm/x86.c
> > > > > @@ -1926,6 +1926,7 @@ int kvm_dev_ioctl_check_extension(long ext)
> > > > > 
> > > > >  	case KVM_CAP_DEBUGREGS:
> > > > >  	case KVM_CAP_X86_ROBUST_SINGLESTEP:
> > > > > 
> > > > >  	case KVM_CAP_XSAVE:
> > > > > +	case KVM_CAP_MSIX_MASK:
> > > > >  		r = 1;
> > > > >  		break;
> > > > >  	
> > > > >  	case KVM_CAP_COALESCED_MMIO:
> > > > > diff --git a/include/linux/kvm.h b/include/linux/kvm.h
> > > > > index 919ae53..32cd244 100644
> > > > > --- a/include/linux/kvm.h
> > > > > +++ b/include/linux/kvm.h
> > > > > @@ -540,6 +540,9 @@ struct kvm_ppc_pvinfo {
> > > > > 
> > > > >  #endif
> > > > >  #define KVM_CAP_PPC_GET_PVINFO 57
> > > > >  #define KVM_CAP_PPC_IRQ_LEVEL 58
> > > > > 
> > > > > +#ifdef __KVM_HAVE_MSIX
> > > > > +#define KVM_CAP_MSIX_MASK 59
> > > > > +#endif
> > > > > 
> > > > >  #ifdef KVM_CAP_IRQ_ROUTING
> > > > > 
> > > > > @@ -671,6 +674,9 @@ struct kvm_clock_data {
> > > > > 
> > > > >  #define KVM_XEN_HVM_CONFIG        _IOW(KVMIO,  0x7a, struct
> > > > >  kvm_xen_hvm_config) #define KVM_SET_CLOCK             _IOW(KVMIO,
> > > > >  0x7b, struct kvm_clock_data) #define KVM_GET_CLOCK
> > > > >  _IOR(KVMIO,  0x7c, struct kvm_clock_data)
> > > > > 
> > > > > +/* Available with KVM_CAP_MSIX_MASK */
> > > > > +#define KVM_GET_MSIX_ENTRY        _IOWR(KVMIO,  0x7d, struct
> > > > > kvm_msix_entry) +#define KVM_UPDATE_MSIX_MMIO      _IOW(KVMIO,  0x7e,
> > > > > struct kvm_msix_mmio)
> > > > > 
> > > > >  /* Available with KVM_CAP_PIT_STATE2 */
> > > > >  #define KVM_GET_PIT2              _IOR(KVMIO,  0x9f, struct
> > > > >  kvm_pit_state2) #define KVM_SET_PIT2              _IOW(KVMIO,  0xa0,
> > > > >  struct kvm_pit_state2)
> > > > > 
> > > > > @@ -794,4 +800,30 @@ struct kvm_assigned_msix_entry {
> > > > > 
> > > > >  	__u16 padding[3];
> > > > >  
> > > > >  };
> > > > > 
> > > > > +#define KVM_MSIX_TYPE_ASSIGNED_DEV	1
> > > > > +
> > > > > +#define KVM_MSIX_FLAG_MASKBIT		(1 << 0)
> > > > > +#define KVM_MSIX_FLAG_QUERY_MASKBIT	(1 << 0)
> > > > > +
> > > > > +struct kvm_msix_entry {
> > > > > +	__u32 id;
> > > > > +	__u32 type;
> > > > > +	__u32 entry; /* The index of entry in the MSI-X table */
> > > > > +	__u32 flags;
> > > > > +	__u32 query_flags;
> > > > > +	__u32 reserved[5];
> > > > > +};
> > > > > +
> > > > > +#define KVM_MSIX_MMIO_FLAG_REGISTER	(1 << 0)
> > > > > +#define KVM_MSIX_MMIO_FLAG_UNREGISTER	(1 << 1)
> > > > > +
> > > > > +struct kvm_msix_mmio {
> > > > > +	__u32 id;
> > > > > +	__u32 type;
> > > > > +	__u64 base_addr;
> > > > > +	__u32 max_entries_nr;
> > > > > +	__u32 flags;
> > > > > +	__u32 reserved[6];
> > > > > +};
> > > > > +
> > > > > 
> > > > >  #endif /* __LINUX_KVM_H */
> > > > > 
> > > > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > > > > index 4b31539..9d49074 100644
> > > > > --- a/include/linux/kvm_host.h
> > > > > +++ b/include/linux/kvm_host.h
> > > > > @@ -445,6 +445,7 @@ struct kvm_guest_msix_entry {
> > > > > 
> > > > >  };
> > > > >  
> > > > >  #define KVM_ASSIGNED_ENABLED_IOMMU		(1 << 0)
> > > > > 
> > > > > +#define KVM_ASSIGNED_ENABLED_MSIX_MMIO		(1 << 1)
> > > > > 
> > > > >  struct kvm_assigned_dev_kernel {
> > > > >  
> > > > >  	struct kvm_irq_ack_notifier ack_notifier;
> > > > >  	struct work_struct interrupt_work;
> > > > > 
> > > > > @@ -465,6 +466,10 @@ struct kvm_assigned_dev_kernel {
> > > > > 
> > > > >  	struct pci_dev *dev;
> > > > >  	struct kvm *kvm;
> > > > >  	spinlock_t assigned_dev_lock;
> > > > > 
> > > > > +	DECLARE_BITMAP(msix_mask_bitmap, KVM_MAX_MSIX_PER_DEV);
> > > > > +	gpa_t msix_mmio_base;
> > > > > +	struct kvm_io_device msix_mmio_dev;
> > > > > +	int msix_max_entries_nr;
> > > > > 
> > > > >  };
> > > > >  
> > > > >  struct kvm_irq_mask_notifier {
> > > > > 
> > > > > diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c
> > > > > index 5c6b96d..3010d7d 100644
> > > > > --- a/virt/kvm/assigned-dev.c
> > > > > +++ b/virt/kvm/assigned-dev.c
> > > > > @@ -226,12 +226,27 @@ static void kvm_free_assigned_irq(struct kvm
> > > > > *kvm,
> > > > > 
> > > > >  	kvm_deassign_irq(kvm, assigned_dev,
> > > > >  	assigned_dev->irq_requested_type);
> > > > >  
> > > > >  }
> > > > > 
> > > > > +static void unregister_msix_mmio(struct kvm *kvm,
> > > > > +				 struct kvm_assigned_dev_kernel *adev)
> > > > > +{
> > > > > +	if (adev->flags & KVM_ASSIGNED_ENABLED_MSIX_MMIO) {
> > > > > +		mutex_lock(&kvm->slots_lock);
> > > > > +		kvm_io_bus_unregister_dev(kvm, KVM_MMIO_BUS,
> > > > > +				&adev->msix_mmio_dev);
> > > > > +		mutex_unlock(&kvm->slots_lock);
> > > > > +		adev->flags &= ~KVM_ASSIGNED_ENABLED_MSIX_MMIO;
> > > > > +	}
> > > > > +}
> > > > > +
> > > > > 
> > > > >  static void kvm_free_assigned_device(struct kvm *kvm,
> > > > >  
> > > > >  				     struct kvm_assigned_dev_kernel
> > > > >  				     *assigned_dev)
> > > > >  
> > > > >  {
> > > > >  
> > > > >  	kvm_free_assigned_irq(kvm, assigned_dev);
> > > > > 
> > > > > +#ifdef __KVM_HAVE_MSIX
> > > > > +	unregister_msix_mmio(kvm, assigned_dev);
> > > > > +#endif
> > > > > 
> > > > >  	pci_reset_function(assigned_dev->dev);
> > > > >  	
> > > > >  	pci_release_regions(assigned_dev->dev);
> > > > > 
> > > > > @@ -504,7 +519,7 @@ out:
> > > > >  static int kvm_vm_ioctl_assign_device(struct kvm *kvm,
> > > > >  
> > > > >  				      struct kvm_assigned_pci_dev *assigned_dev)
> > > > >  
> > > > >  {
> > > > > 
> > > > > -	int r = 0, idx;
> > > > > +	int r = 0, idx, i;
> > > > > 
> > > > >  	struct kvm_assigned_dev_kernel *match;
> > > > >  	struct pci_dev *dev;
> > > > > 
> > > > > @@ -564,6 +579,10 @@ static int kvm_vm_ioctl_assign_device(struct kvm
> > > > > *kvm,
> > > > > 
> > > > >  	list_add(&match->list, &kvm->arch.assigned_dev_head);
> > > > > 
> > > > > +	/* The state after reset of MSI-X table is all masked */
> > > > > +	for (i = 0; i < KVM_MAX_MSIX_PER_DEV; i++)
> > > > > +		set_bit(i, match->msix_mask_bitmap);
> > > > > +
> > > > > 
> > > > >  	if (assigned_dev->flags & KVM_ASSIGNED_ENABLED_IOMMU) {
> > > > >  	
> > > > >  		if (!kvm->arch.iommu_domain) {
> > > > >  		
> > > > >  			r = kvm_iommu_map_guest(kvm);
> > > > > 
> > > > > @@ -667,6 +686,43 @@ msix_nr_out:
> > > > >  	return r;
> > > > >  
> > > > >  }
> > > > > 
> > > > > +static void update_msix_mask(struct kvm_assigned_dev_kernel *adev,
> > > > > +			     int idx, bool new_mask_flag)
> > > > > +{
> > > > > +	int irq;
> > > > > +	bool old_mask_flag, need_flush = false;
> > > > > +
> > > > > +	spin_lock_irq(&adev->assigned_dev_lock);
> > > > > +
> > > > > +	if (!adev->dev->msix_enabled ||
> > > > > +	    !(adev->irq_requested_type & KVM_DEV_IRQ_HOST_MSIX))
> > > > > +		goto out;
> > > > > +
> > > > > +	old_mask_flag = test_bit(adev->guest_msix_entries[idx].entry,
> > > > > +			adev->msix_mask_bitmap);
> > > > > +	if (old_mask_flag == new_mask_flag)
> > > > > +		goto out;
> > > > > +
> > > > > +	irq = adev->host_msix_entries[idx].vector;
> > > > > +	BUG_ON(irq == 0);
> > > > > +
> > > > > +	if (new_mask_flag) {
> > > > > +		set_bit(adev->guest_msix_entries[idx].entry,
> > > > > +				adev->msix_mask_bitmap);
> > > > > +		disable_irq_nosync(irq);
> > > > > +		need_flush = true;
> > > > > +	} else {
> > > > > +		clear_bit(adev->guest_msix_entries[idx].entry,
> > > > > +				adev->msix_mask_bitmap);
> > > > > +		enable_irq(irq);
> > > > > +	}
> > > > > +out:
> > > > > +	spin_unlock_irq(&adev->assigned_dev_lock);
> > > > > +
> > > > > +	if (need_flush)
> > > > > +		flush_work(&adev->interrupt_work);
> > > > > +}
> > > > > +
> > > > > 
> > > > >  static int kvm_vm_ioctl_set_msix_entry(struct kvm *kvm,
> > > > >  
> > > > >  				       struct kvm_assigned_msix_entry *entry)
> > > > >  
> > > > >  {
> > > > > 
> > > > > @@ -701,6 +757,233 @@ msix_entry_out:
> > > > >  	return r;
> > > > >  
> > > > >  }
> > > > > 
> > > > > +
> > > > > +static int kvm_vm_ioctl_get_msix_entry(struct kvm *kvm,
> > > > > +				       struct kvm_msix_entry *entry)
> > > > > +{
> > > > > +	int r = 0;
> > > > > +	struct kvm_assigned_dev_kernel *adev;
> > > > > +
> > > > > +	if (entry->type != KVM_MSIX_TYPE_ASSIGNED_DEV)
> > > > > +		return -EINVAL;
> > > > > +
> > > > > +	if (!entry->query_flags)
> > > > > +		return -EINVAL;
> > > > > +
> > > > > +	mutex_lock(&kvm->lock);
> > > > > +
> > > > > +	adev = kvm_find_assigned_dev(&kvm->arch.assigned_dev_head,
> > > > > +				      entry->id);
> > > > > +
> > > > > +	if (!adev) {
> > > > > +		r = -EINVAL;
> > > > > +		goto out;
> > > > > +	}
> > > > > +
> > > > > +	if (entry->entry >= adev->msix_max_entries_nr) {
> > > > > +		r = -ENOSPC;
> > > > > +		goto out;
> > > > > +	}
> > > > > +
> > > > > +	if (entry->query_flags & KVM_MSIX_FLAG_QUERY_MASKBIT) {
> > > > > +		if (test_bit(entry->entry, adev->msix_mask_bitmap))
> > > > > +			entry->flags |= KVM_MSIX_FLAG_MASKBIT;
> > > > > +		else
> > > > > +			entry->flags &= ~KVM_MSIX_FLAG_MASKBIT;
> > > > > +	}
> > > > > +
> > > > > +out:
> > > > > +	mutex_unlock(&kvm->lock);
> > > > > +
> > > > > +	return r;
> > > > > +}
> > > > > +
> > > > > +static bool msix_mmio_in_range(struct kvm_assigned_dev_kernel *adev,
> > > > > +			      gpa_t addr, int len)
> > > > > +{
> > > > > +	gpa_t start, end;
> > > > > +
> > > > > +	BUG_ON(adev->msix_mmio_base == 0);
> > > > 
> > > > I thought we wanted to use a separate flag for this?
> > > 
> > > O, flag added, but forgot to update this one.
> > > 
> > > > > +	start = adev->msix_mmio_base;
> > > > > +	end = adev->msix_mmio_base + PCI_MSIX_ENTRY_SIZE *
> > > > > +		adev->msix_max_entries_nr;
> > > > > +	if (addr >= start && addr + len <= end)
> > > > > +		return true;
> > > > > +
> > > > > +	return false;
> > > > > +}
> > > > > +
> > > > > +static int msix_get_enabled_idx(struct kvm_assigned_dev_kernel
> > > > > *adev, +			      gpa_t addr, int len)
> > > > > +{
> > > > > +	int i, index = (addr - adev->msix_mmio_base) / PCI_MSIX_ENTRY_SIZE;
> > > > > +
> > > > > +	for (i = 0; i < adev->entries_nr; i++)
> > > > > +		if (adev->guest_msix_entries[i].entry == index)
> > > > > +			return i;
> > > > > +
> > > > > +	return -EINVAL;
> > > > > +}
> > > > > +
> > > > 
> > > > Hmm, we still have a linear scan on each write.  Is the point to detect
> > > > which entries need to be handled in kernel? Maybe another bitmap for
> > > > this?  Or just handle whole entry write in kernel as well, then we
> > > > won't need this at all.
> > > 
> > > Still dont' like handling the whole entry writing. In fact here we have
> > > two questions:
> > > 1. If the entry already enabled
> > > 2. If it is, then what's it sequence number.
> > > 
> > > Bitmap can solve question 1, but we still need O(n) scan for the second.
> > > So I think leave it like this should be fine.
> > 
> > The spec quoted above implies we must handle all entry writes:
> > a single write can update mask and data.
> 
> Anyway it's the work to be done by QEmu, if we want to cover the situation that 
> signle write can update mask and data - that's surely QWORD, which hasn't been 
> supported yet. 

Must pass the transactions to qemu then :)

> We can back to them if there is someone really did it in that way. But for all 
> hypervisors using QEmu, I think we haven't seen such kind of behavior yet.

I would rather stick to the spec than go figure out what do BSD/Sun/Mac do,
or will do.

> So we 
> can leave it later.
> 
> --
> regards
> Yang, Sheng
> 
> > 
> > > > > +static int msix_mmio_read(struct kvm_io_device *this, gpa_t addr,
> > > > > int len, +			  void *val)
> > > > > +{
> > > > > +	struct kvm_assigned_dev_kernel *adev =
> > > > > +			container_of(this, struct kvm_assigned_dev_kernel,
> > > > > +				     msix_mmio_dev);
> > > > > +	int idx, r = 0;
> > > > > +	u32 entry[4];
> > > > > +	struct kvm_kernel_irq_routing_entry e;
> > > > > +
> > > > > +	mutex_lock(&adev->kvm->lock);
> > > > > +	if (!msix_mmio_in_range(adev, addr, len)) {
> > > > > +		r = -EOPNOTSUPP;
> > > > > +		goto out;
> > > > > +	}
> > > > > +	if ((addr & 0x3) || len != 4)
> > > > > +		goto out;
> > > > > +
> > > > > +	idx = msix_get_enabled_idx(adev, addr, len);
> > > > > +	if (idx < 0) {
> > > > > +		idx = (addr - adev->msix_mmio_base) / PCI_MSIX_ENTRY_SIZE;
> > > > > +		if ((addr % PCI_MSIX_ENTRY_SIZE) ==
> > > > > +				PCI_MSIX_ENTRY_VECTOR_CTRL)
> > > > > +			*(unsigned long *)val =
> > > > > +				test_bit(idx, adev->msix_mask_bitmap) ?
> > > > > +				PCI_MSIX_ENTRY_CTRL_MASKBIT : 0;
> > > > > +		else
> > > > > +			r = -EOPNOTSUPP;
> > > > > +		goto out;
> > > > > +	}
> > > > > +
> > > > > +	r = kvm_get_irq_routing_entry(adev->kvm,
> > > > > +			adev->guest_msix_entries[idx].vector, &e);
> > > > > +	if (r || e.type != KVM_IRQ_ROUTING_MSI) {
> > > > > +		r = -EOPNOTSUPP;
> > > > > +		goto out;
> > > > > +	}
> > > > > +	entry[0] = e.msi.address_lo;
> > > > > +	entry[1] = e.msi.address_hi;
> > > > > +	entry[2] = e.msi.data;
> > > > > +	entry[3] = test_bit(adev->guest_msix_entries[idx].entry,
> > > > > +			adev->msix_mask_bitmap);
> > > > > +	memcpy(val, &entry[addr % PCI_MSIX_ENTRY_SIZE / sizeof *entry],
> > > > > len);
> > > > 
> > > > It seems weird to pass writes to userspace but do reads
> > > > in kernel.
> > > > Either we should support entry read and write or neither.
> > > 
> > > Entry read is for speed up, because kernel would write to the mask bit
> > > then try to read from it for flushing.

What I see linux doing is reading the 1st entry, always.
So this means we must handle all table reads in kernel,
we can't pass any to userspace. Anyway, only new kernels do this,
presumably they touch msix table rarely so it should not matter
for performance?

> Don't think reading matter much
> > > here, as long as userspace still own routing table.

Why insist on using the routing table then?  Since we handle all of the
entry for reads, let's do it for writes too?
The only justification for the split was that supposedly registers were
always programmed separately. But now we see that spoec does not mandate
this.

> > > > > +
> > > > > +out:
> > > > > +	mutex_unlock(&adev->kvm->lock);
> > > > > +	return r;
> > > > > +}
> > > > > +
> > > > > +static int msix_mmio_write(struct kvm_io_device *this, gpa_t addr,
> > > > > int len, +			   const void *val)
> > > > > +{
> > > > > 
> > > > > +	struct kvm_assigned_dev_kernel *adev =
> > > > > +			container_of(this, struct kvm_assigned_dev_kernel,
> > > > > +				     msix_mmio_dev);
> > > > > +	int idx, r = 0;
> > > > > +	unsigned long new_val = *(unsigned long *)val;
> > > > 
> > > > Move this to after you did length and alignment checks,
> > > > it might not be safe here.
> > > > 
> > > > > +
> > > > > +	mutex_lock(&adev->kvm->lock);
> > > > > +	if (!msix_mmio_in_range(adev, addr, len)) {
> > > > > +		r = -EOPNOTSUPP;
> > > > > +		goto out;
> > > > > +	}
> > > > > +	if ((addr & 0x3) || len != 4)
> > > > 
> > > > The lenght could be 8:
> > > > 
> > > > For all accesses to MSI-X Table and MSI-X PBA fields, software must use
> > > > aligned full DWORD or aligned full QWORD transactions; otherwise, the
> > > > result is undefined.
> > > > 
> > > > and
> > > > 
> > > > Software is permitted to fill in MSI-X Table entry DWORD fields
> > > > individually with DWORD writes, or software in certain cases is
> > > > permitted to fill in appropriate pairs of DWORDs with a single QWORD
> > > > write. Specifically, software is always permitted to fill in the
> > > > Message Address and Message Upper Address fields with a single QWORD
> > > > write. If a given entry is currently masked (via its Mask bit or the
> > > > Function Mask bit), software is permitted to fill in the Message Data
> > > > and Vector Control fields with a single QWORD write, taking advantage
> > > > of the fact the Message Data field is guaranteed to become visible to
> > > > hardware no later than the Vector Control field. However, if software
> > > > wishes to mask a currently unmasked entry (without setting the
> > > > Function Mask bit), software must set the entry’s Mask bit using a
> > > > DWORD write to the Vector Control field, since performing a QWORD
> > > > write to the Message Data and Vector Control fields might result in
> > > > the Message Data field being modified before the Mask bit in the
> > > > Vector Control field becomes set.
> > > 
> > > Haven't seen any device use QWORD accessing. Also QEmu can't handle QWORD
> > > MMIO as well.

That's a userspace bug.  Not sure whether this justifies copyng the bug
in kernel.

> > >  So I guess we can leave it later.

Maybe what we can do is handle the mask in kernel +
pass to userspace to handle the address/vector update?

-- 
MST

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 6/7] KVM: assigned dev: MSI-X mask support
  2010-11-12 11:25           ` Michael S. Tsirkin
@ 2010-11-15  7:37             ` Sheng Yang
  2010-11-15  7:42               ` Michael S. Tsirkin
  0 siblings, 1 reply; 22+ messages in thread
From: Sheng Yang @ 2010-11-15  7:37 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Avi Kivity, Marcelo Tosatti, kvm

On Friday 12 November 2010 19:25:16 Michael S. Tsirkin wrote:
> On Fri, Nov 12, 2010 at 06:54:01PM +0800, Sheng Yang wrote:
> > On Friday 12 November 2010 18:47:29 Michael S. Tsirkin wrote:
> > > On Fri, Nov 12, 2010 at 06:13:48PM +0800, Sheng Yang wrote:
> > > > On Friday 12 November 2010 17:53:13 Michael S. Tsirkin wrote:
> > > > > On Thu, Nov 11, 2010 at 03:46:59PM +0800, Sheng Yang wrote:
> > > > > > This patch enable per-vector mask for assigned devices using
> > > > > > MSI-X.
> > > > > > 
> > > > > > This patch provided two new APIs: one is for guest to specific
> > > > > > device's MSI-X table address in MMIO, the other is for userspace
> > > > > > to get information about mask bit.
> > > > > > 
> > > > > > All the mask bit operation are kept in kernel, in order to
> > > > > > accelerate. Userspace shouldn't access the device MMIO directly
> > > > > > for the information, instead it should uses provided API to do
> > > > > > so.
> > > > > > 
> > > > > > Signed-off-by: Sheng Yang <sheng@linux.intel.com>
> > > > > > ---
> > > > > > 
> > > > > >  arch/x86/kvm/x86.c       |    1 +
> > > > > >  include/linux/kvm.h      |   32 +++++
> > > > > >  include/linux/kvm_host.h |    5 +
> > > > > >  virt/kvm/assigned-dev.c  |  316
> > > > > >  +++++++++++++++++++++++++++++++++++++++++++++- 4 files
> > 
> > changed,
> > 
> > > > 353
> > > > 
> > > > > >  insertions(+), 1 deletions(-)
> > > > > > 
> > > > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > > > > index f3f86b2..847f1e1 100644
> > > > > > --- a/arch/x86/kvm/x86.c
> > > > > > +++ b/arch/x86/kvm/x86.c
> > > > > > @@ -1926,6 +1926,7 @@ int kvm_dev_ioctl_check_extension(long ext)
> > > > > > 
> > > > > >  	case KVM_CAP_DEBUGREGS:
> > > > > >  	case KVM_CAP_X86_ROBUST_SINGLESTEP:
> > > > > > 
> > > > > >  	case KVM_CAP_XSAVE:
> > > > > > +	case KVM_CAP_MSIX_MASK:
> > > > > >  		r = 1;
> > > > > >  		break;
> > > > > >  	
> > > > > >  	case KVM_CAP_COALESCED_MMIO:
> > > > > > diff --git a/include/linux/kvm.h b/include/linux/kvm.h
> > > > > > index 919ae53..32cd244 100644
> > > > > > --- a/include/linux/kvm.h
> > > > > > +++ b/include/linux/kvm.h
> > > > > > @@ -540,6 +540,9 @@ struct kvm_ppc_pvinfo {
> > > > > > 
> > > > > >  #endif
> > > > > >  #define KVM_CAP_PPC_GET_PVINFO 57
> > > > > >  #define KVM_CAP_PPC_IRQ_LEVEL 58
> > > > > > 
> > > > > > +#ifdef __KVM_HAVE_MSIX
> > > > > > +#define KVM_CAP_MSIX_MASK 59
> > > > > > +#endif
> > > > > > 
> > > > > >  #ifdef KVM_CAP_IRQ_ROUTING
> > > > > > 
> > > > > > @@ -671,6 +674,9 @@ struct kvm_clock_data {
> > > > > > 
> > > > > >  #define KVM_XEN_HVM_CONFIG        _IOW(KVMIO,  0x7a, struct
> > > > > >  kvm_xen_hvm_config) #define KVM_SET_CLOCK            
> > > > > >  _IOW(KVMIO, 0x7b, struct kvm_clock_data) #define KVM_GET_CLOCK
> > > > > >  _IOR(KVMIO,  0x7c, struct kvm_clock_data)
> > > > > > 
> > > > > > +/* Available with KVM_CAP_MSIX_MASK */
> > > > > > +#define KVM_GET_MSIX_ENTRY        _IOWR(KVMIO,  0x7d, struct
> > > > > > kvm_msix_entry) +#define KVM_UPDATE_MSIX_MMIO      _IOW(KVMIO, 
> > > > > > 0x7e, struct kvm_msix_mmio)
> > > > > > 
> > > > > >  /* Available with KVM_CAP_PIT_STATE2 */
> > > > > >  #define KVM_GET_PIT2              _IOR(KVMIO,  0x9f, struct
> > > > > >  kvm_pit_state2) #define KVM_SET_PIT2              _IOW(KVMIO, 
> > > > > >  0xa0, struct kvm_pit_state2)
> > > > > > 
> > > > > > @@ -794,4 +800,30 @@ struct kvm_assigned_msix_entry {
> > > > > > 
> > > > > >  	__u16 padding[3];
> > > > > >  
> > > > > >  };
> > > > > > 
> > > > > > +#define KVM_MSIX_TYPE_ASSIGNED_DEV	1
> > > > > > +
> > > > > > +#define KVM_MSIX_FLAG_MASKBIT		(1 << 0)
> > > > > > +#define KVM_MSIX_FLAG_QUERY_MASKBIT	(1 << 0)
> > > > > > +
> > > > > > +struct kvm_msix_entry {
> > > > > > +	__u32 id;
> > > > > > +	__u32 type;
> > > > > > +	__u32 entry; /* The index of entry in the MSI-X table */
> > > > > > +	__u32 flags;
> > > > > > +	__u32 query_flags;
> > > > > > +	__u32 reserved[5];
> > > > > > +};
> > > > > > +
> > > > > > +#define KVM_MSIX_MMIO_FLAG_REGISTER	(1 << 0)
> > > > > > +#define KVM_MSIX_MMIO_FLAG_UNREGISTER	(1 << 1)
> > > > > > +
> > > > > > +struct kvm_msix_mmio {
> > > > > > +	__u32 id;
> > > > > > +	__u32 type;
> > > > > > +	__u64 base_addr;
> > > > > > +	__u32 max_entries_nr;
> > > > > > +	__u32 flags;
> > > > > > +	__u32 reserved[6];
> > > > > > +};
> > > > > > +
> > > > > > 
> > > > > >  #endif /* __LINUX_KVM_H */
> > > > > > 
> > > > > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > > > > > index 4b31539..9d49074 100644
> > > > > > --- a/include/linux/kvm_host.h
> > > > > > +++ b/include/linux/kvm_host.h
> > > > > > @@ -445,6 +445,7 @@ struct kvm_guest_msix_entry {
> > > > > > 
> > > > > >  };
> > > > > >  
> > > > > >  #define KVM_ASSIGNED_ENABLED_IOMMU		(1 << 0)
> > > > > > 
> > > > > > +#define KVM_ASSIGNED_ENABLED_MSIX_MMIO		(1 << 1)
> > > > > > 
> > > > > >  struct kvm_assigned_dev_kernel {
> > > > > >  
> > > > > >  	struct kvm_irq_ack_notifier ack_notifier;
> > > > > >  	struct work_struct interrupt_work;
> > > > > > 
> > > > > > @@ -465,6 +466,10 @@ struct kvm_assigned_dev_kernel {
> > > > > > 
> > > > > >  	struct pci_dev *dev;
> > > > > >  	struct kvm *kvm;
> > > > > >  	spinlock_t assigned_dev_lock;
> > > > > > 
> > > > > > +	DECLARE_BITMAP(msix_mask_bitmap, KVM_MAX_MSIX_PER_DEV);
> > > > > > +	gpa_t msix_mmio_base;
> > > > > > +	struct kvm_io_device msix_mmio_dev;
> > > > > > +	int msix_max_entries_nr;
> > > > > > 
> > > > > >  };
> > > > > >  
> > > > > >  struct kvm_irq_mask_notifier {
> > > > > > 
> > > > > > diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c
> > > > > > index 5c6b96d..3010d7d 100644
> > > > > > --- a/virt/kvm/assigned-dev.c
> > > > > > +++ b/virt/kvm/assigned-dev.c
> > > > > > @@ -226,12 +226,27 @@ static void kvm_free_assigned_irq(struct
> > > > > > kvm *kvm,
> > > > > > 
> > > > > >  	kvm_deassign_irq(kvm, assigned_dev,
> > > > > >  	assigned_dev->irq_requested_type);
> > > > > >  
> > > > > >  }
> > > > > > 
> > > > > > +static void unregister_msix_mmio(struct kvm *kvm,
> > > > > > +				 struct kvm_assigned_dev_kernel *adev)
> > > > > > +{
> > > > > > +	if (adev->flags & KVM_ASSIGNED_ENABLED_MSIX_MMIO) {
> > > > > > +		mutex_lock(&kvm->slots_lock);
> > > > > > +		kvm_io_bus_unregister_dev(kvm, KVM_MMIO_BUS,
> > > > > > +				&adev->msix_mmio_dev);
> > > > > > +		mutex_unlock(&kvm->slots_lock);
> > > > > > +		adev->flags &= ~KVM_ASSIGNED_ENABLED_MSIX_MMIO;
> > > > > > +	}
> > > > > > +}
> > > > > > +
> > > > > > 
> > > > > >  static void kvm_free_assigned_device(struct kvm *kvm,
> > > > > >  
> > > > > >  				     struct kvm_assigned_dev_kernel
> > > > > >  				     *assigned_dev)
> > > > > >  
> > > > > >  {
> > > > > >  
> > > > > >  	kvm_free_assigned_irq(kvm, assigned_dev);
> > > > > > 
> > > > > > +#ifdef __KVM_HAVE_MSIX
> > > > > > +	unregister_msix_mmio(kvm, assigned_dev);
> > > > > > +#endif
> > > > > > 
> > > > > >  	pci_reset_function(assigned_dev->dev);
> > > > > >  	
> > > > > >  	pci_release_regions(assigned_dev->dev);
> > > > > > 
> > > > > > @@ -504,7 +519,7 @@ out:
> > > > > >  static int kvm_vm_ioctl_assign_device(struct kvm *kvm,
> > > > > >  
> > > > > >  				      struct kvm_assigned_pci_dev *assigned_dev)
> > > > > >  
> > > > > >  {
> > > > > > 
> > > > > > -	int r = 0, idx;
> > > > > > +	int r = 0, idx, i;
> > > > > > 
> > > > > >  	struct kvm_assigned_dev_kernel *match;
> > > > > >  	struct pci_dev *dev;
> > > > > > 
> > > > > > @@ -564,6 +579,10 @@ static int kvm_vm_ioctl_assign_device(struct
> > > > > > kvm *kvm,
> > > > > > 
> > > > > >  	list_add(&match->list, &kvm->arch.assigned_dev_head);
> > > > > > 
> > > > > > +	/* The state after reset of MSI-X table is all masked */
> > > > > > +	for (i = 0; i < KVM_MAX_MSIX_PER_DEV; i++)
> > > > > > +		set_bit(i, match->msix_mask_bitmap);
> > > > > > +
> > > > > > 
> > > > > >  	if (assigned_dev->flags & KVM_ASSIGNED_ENABLED_IOMMU) {
> > > > > >  	
> > > > > >  		if (!kvm->arch.iommu_domain) {
> > > > > >  		
> > > > > >  			r = kvm_iommu_map_guest(kvm);
> > > > > > 
> > > > > > @@ -667,6 +686,43 @@ msix_nr_out:
> > > > > >  	return r;
> > > > > >  
> > > > > >  }
> > > > > > 
> > > > > > +static void update_msix_mask(struct kvm_assigned_dev_kernel
> > > > > > *adev, +			     int idx, bool new_mask_flag)
> > > > > > +{
> > > > > > +	int irq;
> > > > > > +	bool old_mask_flag, need_flush = false;
> > > > > > +
> > > > > > +	spin_lock_irq(&adev->assigned_dev_lock);
> > > > > > +
> > > > > > +	if (!adev->dev->msix_enabled ||
> > > > > > +	    !(adev->irq_requested_type & KVM_DEV_IRQ_HOST_MSIX))
> > > > > > +		goto out;
> > > > > > +
> > > > > > +	old_mask_flag = test_bit(adev->guest_msix_entries[idx].entry,
> > > > > > +			adev->msix_mask_bitmap);
> > > > > > +	if (old_mask_flag == new_mask_flag)
> > > > > > +		goto out;
> > > > > > +
> > > > > > +	irq = adev->host_msix_entries[idx].vector;
> > > > > > +	BUG_ON(irq == 0);
> > > > > > +
> > > > > > +	if (new_mask_flag) {
> > > > > > +		set_bit(adev->guest_msix_entries[idx].entry,
> > > > > > +				adev->msix_mask_bitmap);
> > > > > > +		disable_irq_nosync(irq);
> > > > > > +		need_flush = true;
> > > > > > +	} else {
> > > > > > +		clear_bit(adev->guest_msix_entries[idx].entry,
> > > > > > +				adev->msix_mask_bitmap);
> > > > > > +		enable_irq(irq);
> > > > > > +	}
> > > > > > +out:
> > > > > > +	spin_unlock_irq(&adev->assigned_dev_lock);
> > > > > > +
> > > > > > +	if (need_flush)
> > > > > > +		flush_work(&adev->interrupt_work);
> > > > > > +}
> > > > > > +
> > > > > > 
> > > > > >  static int kvm_vm_ioctl_set_msix_entry(struct kvm *kvm,
> > > > > >  
> > > > > >  				       struct kvm_assigned_msix_entry *entry)
> > > > > >  
> > > > > >  {
> > > > > > 
> > > > > > @@ -701,6 +757,233 @@ msix_entry_out:
> > > > > >  	return r;
> > > > > >  
> > > > > >  }
> > > > > > 
> > > > > > +
> > > > > > +static int kvm_vm_ioctl_get_msix_entry(struct kvm *kvm,
> > > > > > +				       struct kvm_msix_entry *entry)
> > > > > > +{
> > > > > > +	int r = 0;
> > > > > > +	struct kvm_assigned_dev_kernel *adev;
> > > > > > +
> > > > > > +	if (entry->type != KVM_MSIX_TYPE_ASSIGNED_DEV)
> > > > > > +		return -EINVAL;
> > > > > > +
> > > > > > +	if (!entry->query_flags)
> > > > > > +		return -EINVAL;
> > > > > > +
> > > > > > +	mutex_lock(&kvm->lock);
> > > > > > +
> > > > > > +	adev = kvm_find_assigned_dev(&kvm->arch.assigned_dev_head,
> > > > > > +				      entry->id);
> > > > > > +
> > > > > > +	if (!adev) {
> > > > > > +		r = -EINVAL;
> > > > > > +		goto out;
> > > > > > +	}
> > > > > > +
> > > > > > +	if (entry->entry >= adev->msix_max_entries_nr) {
> > > > > > +		r = -ENOSPC;
> > > > > > +		goto out;
> > > > > > +	}
> > > > > > +
> > > > > > +	if (entry->query_flags & KVM_MSIX_FLAG_QUERY_MASKBIT) {
> > > > > > +		if (test_bit(entry->entry, adev->msix_mask_bitmap))
> > > > > > +			entry->flags |= KVM_MSIX_FLAG_MASKBIT;
> > > > > > +		else
> > > > > > +			entry->flags &= ~KVM_MSIX_FLAG_MASKBIT;
> > > > > > +	}
> > > > > > +
> > > > > > +out:
> > > > > > +	mutex_unlock(&kvm->lock);
> > > > > > +
> > > > > > +	return r;
> > > > > > +}
> > > > > > +
> > > > > > +static bool msix_mmio_in_range(struct kvm_assigned_dev_kernel
> > > > > > *adev, +			      gpa_t addr, int len)
> > > > > > +{
> > > > > > +	gpa_t start, end;
> > > > > > +
> > > > > > +	BUG_ON(adev->msix_mmio_base == 0);
> > > > > 
> > > > > I thought we wanted to use a separate flag for this?
> > > > 
> > > > O, flag added, but forgot to update this one.
> > > > 
> > > > > > +	start = adev->msix_mmio_base;
> > > > > > +	end = adev->msix_mmio_base + PCI_MSIX_ENTRY_SIZE *
> > > > > > +		adev->msix_max_entries_nr;
> > > > > > +	if (addr >= start && addr + len <= end)
> > > > > > +		return true;
> > > > > > +
> > > > > > +	return false;
> > > > > > +}
> > > > > > +
> > > > > > +static int msix_get_enabled_idx(struct kvm_assigned_dev_kernel
> > > > > > *adev, +			      gpa_t addr, int len)
> > > > > > +{
> > > > > > +	int i, index = (addr - adev->msix_mmio_base) /
> > > > > > PCI_MSIX_ENTRY_SIZE; +
> > > > > > +	for (i = 0; i < adev->entries_nr; i++)
> > > > > > +		if (adev->guest_msix_entries[i].entry == index)
> > > > > > +			return i;
> > > > > > +
> > > > > > +	return -EINVAL;
> > > > > > +}
> > > > > > +
> > > > > 
> > > > > Hmm, we still have a linear scan on each write.  Is the point to
> > > > > detect which entries need to be handled in kernel? Maybe another
> > > > > bitmap for this?  Or just handle whole entry write in kernel as
> > > > > well, then we won't need this at all.
> > > > 
> > > > Still dont' like handling the whole entry writing. In fact here we
> > > > have two questions:
> > > > 1. If the entry already enabled
> > > > 2. If it is, then what's it sequence number.
> > > > 
> > > > Bitmap can solve question 1, but we still need O(n) scan for the
> > > > second. So I think leave it like this should be fine.
> > > 
> > > The spec quoted above implies we must handle all entry writes:
> > > a single write can update mask and data.
> > 
> > Anyway it's the work to be done by QEmu, if we want to cover the
> > situation that signle write can update mask and data - that's surely
> > QWORD, which hasn't been supported yet.
> 
> Must pass the transactions to qemu then :)
> 
> > We can back to them if there is someone really did it in that way. But
> > for all hypervisors using QEmu, I think we haven't seen such kind of
> > behavior yet.
> 
> I would rather stick to the spec than go figure out what do BSD/Sun/Mac do,
> or will do.

Sure, but no hurry for that. It doesn't similar to the API case, so we can achieve 
it incrementally.
> 
> > So we
> > can leave it later.
> > 
> > --
> > regards
> > Yang, Sheng
> > 
> > > > > > +static int msix_mmio_read(struct kvm_io_device *this, gpa_t
> > > > > > addr, int len, +			  void *val)
> > > > > > +{
> > > > > > +	struct kvm_assigned_dev_kernel *adev =
> > > > > > +			container_of(this, struct kvm_assigned_dev_kernel,
> > > > > > +				     msix_mmio_dev);
> > > > > > +	int idx, r = 0;
> > > > > > +	u32 entry[4];
> > > > > > +	struct kvm_kernel_irq_routing_entry e;
> > > > > > +
> > > > > > +	mutex_lock(&adev->kvm->lock);
> > > > > > +	if (!msix_mmio_in_range(adev, addr, len)) {
> > > > > > +		r = -EOPNOTSUPP;
> > > > > > +		goto out;
> > > > > > +	}
> > > > > > +	if ((addr & 0x3) || len != 4)
> > > > > > +		goto out;
> > > > > > +
> > > > > > +	idx = msix_get_enabled_idx(adev, addr, len);
> > > > > > +	if (idx < 0) {
> > > > > > +		idx = (addr - adev->msix_mmio_base) / PCI_MSIX_ENTRY_SIZE;
> > > > > > +		if ((addr % PCI_MSIX_ENTRY_SIZE) ==
> > > > > > +				PCI_MSIX_ENTRY_VECTOR_CTRL)
> > > > > > +			*(unsigned long *)val =
> > > > > > +				test_bit(idx, adev->msix_mask_bitmap) ?
> > > > > > +				PCI_MSIX_ENTRY_CTRL_MASKBIT : 0;
> > > > > > +		else
> > > > > > +			r = -EOPNOTSUPP;
> > > > > > +		goto out;
> > > > > > +	}
> > > > > > +
> > > > > > +	r = kvm_get_irq_routing_entry(adev->kvm,
> > > > > > +			adev->guest_msix_entries[idx].vector, &e);
> > > > > > +	if (r || e.type != KVM_IRQ_ROUTING_MSI) {
> > > > > > +		r = -EOPNOTSUPP;
> > > > > > +		goto out;
> > > > > > +	}
> > > > > > +	entry[0] = e.msi.address_lo;
> > > > > > +	entry[1] = e.msi.address_hi;
> > > > > > +	entry[2] = e.msi.data;
> > > > > > +	entry[3] = test_bit(adev->guest_msix_entries[idx].entry,
> > > > > > +			adev->msix_mask_bitmap);
> > > > > > +	memcpy(val, &entry[addr % PCI_MSIX_ENTRY_SIZE / sizeof *entry],
> > > > > > len);
> > > > > 
> > > > > It seems weird to pass writes to userspace but do reads
> > > > > in kernel.
> > > > > Either we should support entry read and write or neither.
> > > > 
> > > > Entry read is for speed up, because kernel would write to the mask
> > > > bit then try to read from it for flushing.
> 
> What I see linux doing is reading the 1st entry, always.
> So this means we must handle all table reads in kernel,
> we can't pass any to userspace. Anyway, only new kernels do this,
> presumably they touch msix table rarely so it should not matter
> for performance?

This change is still due to the performance issue we identified. The case is when 
many SRIOV VF running on one machine with new kernel, the interrupt would also be 
very frequently then host may decide to mask someone temporarily. This behavior 
can also be very frequently. 
> 
> > Don't think reading matter much
> > 
> > > > here, as long as userspace still own routing table.
> 
> Why insist on using the routing table then?  Since we handle all of the
> entry for reads, let's do it for writes too?
> The only justification for the split was that supposedly registers were
> always programmed separately. But now we see that spoec does not mandate
> this.

My point is, I don't really care if we are using routing table. It's no more than 
provide some data/address pair in this case. But I do want to push the patch. The 
patch here is only need data/address pair, and now routing table provide it. If 
you want to change this, you can modify it after later. But it got nothing to do 
with the current patch's logic. You can simply modify it as well, as you needed in 
the future.
> 
> > > > > > +
> > > > > > +out:
> > > > > > +	mutex_unlock(&adev->kvm->lock);
> > > > > > +	return r;
> > > > > > +}
> > > > > > +
> > > > > > +static int msix_mmio_write(struct kvm_io_device *this, gpa_t
> > > > > > addr, int len, +			   const void *val)
> > > > > > +{
> > > > > > 
> > > > > > +	struct kvm_assigned_dev_kernel *adev =
> > > > > > +			container_of(this, struct kvm_assigned_dev_kernel,
> > > > > > +				     msix_mmio_dev);
> > > > > > +	int idx, r = 0;
> > > > > > +	unsigned long new_val = *(unsigned long *)val;
> > > > > 
> > > > > Move this to after you did length and alignment checks,
> > > > > it might not be safe here.
> > > > > 
> > > > > > +
> > > > > > +	mutex_lock(&adev->kvm->lock);
> > > > > > +	if (!msix_mmio_in_range(adev, addr, len)) {
> > > > > > +		r = -EOPNOTSUPP;
> > > > > > +		goto out;
> > > > > > +	}
> > > > > > +	if ((addr & 0x3) || len != 4)
> > > > > 
> > > > > The lenght could be 8:
> > > > > 
> > > > > For all accesses to MSI-X Table and MSI-X PBA fields, software must
> > > > > use aligned full DWORD or aligned full QWORD transactions;
> > > > > otherwise, the result is undefined.
> > > > > 
> > > > > and
> > > > > 
> > > > > Software is permitted to fill in MSI-X Table entry DWORD fields
> > > > > individually with DWORD writes, or software in certain cases is
> > > > > permitted to fill in appropriate pairs of DWORDs with a single
> > > > > QWORD write. Specifically, software is always permitted to fill in
> > > > > the Message Address and Message Upper Address fields with a single
> > > > > QWORD write. If a given entry is currently masked (via its Mask
> > > > > bit or the Function Mask bit), software is permitted to fill in
> > > > > the Message Data and Vector Control fields with a single QWORD
> > > > > write, taking advantage of the fact the Message Data field is
> > > > > guaranteed to become visible to hardware no later than the Vector
> > > > > Control field. However, if software wishes to mask a currently
> > > > > unmasked entry (without setting the Function Mask bit), software
> > > > > must set the entry’s Mask bit using a DWORD write to the Vector
> > > > > Control field, since performing a QWORD write to the Message Data
> > > > > and Vector Control fields might result in the Message Data field
> > > > > being modified before the Mask bit in the Vector Control field
> > > > > becomes set.
> > > > 
> > > > Haven't seen any device use QWORD accessing. Also QEmu can't handle
> > > > QWORD MMIO as well.
> 
> That's a userspace bug.  Not sure whether this justifies copyng the bug
> in kernel.

We just haven't implemented the full spec, but we can still cover all known 
situation now. So we can work on that later, incrementally.
> 
> > > >  So I guess we can leave it later.
> 
> Maybe what we can do is handle the mask in kernel +
> pass to userspace to handle the address/vector update?

Maybe, and we can do that later. 

It's somehow like the customer's demand keeping changing all the time, but we have 
to do something first and somether later, and cut some release in between. It would 
be incremental, not completely 0 or completely 1. I'd like to have some code base 
checked in first.

--
regards
Yang, Sheng

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 6/7] KVM: assigned dev: MSI-X mask support
  2010-11-15  7:37             ` Sheng Yang
@ 2010-11-15  7:42               ` Michael S. Tsirkin
  2010-11-15  7:48                 ` Sheng Yang
  0 siblings, 1 reply; 22+ messages in thread
From: Michael S. Tsirkin @ 2010-11-15  7:42 UTC (permalink / raw)
  To: Sheng Yang; +Cc: Avi Kivity, Marcelo Tosatti, kvm

On Mon, Nov 15, 2010 at 03:37:21PM +0800, Sheng Yang wrote:
> > > We can back to them if there is someone really did it in that way. But
> > > for all hypervisors using QEmu, I think we haven't seen such kind of
> > > behavior yet.
> > 
> > I would rather stick to the spec than go figure out what do BSD/Sun/Mac do,
> > or will do.
> 
> Sure, but no hurry for that. It doesn't similar to the API case, so we can achieve 
> it incrementally.

Isn't the proposed way to solve this to move vector address/data
handling into kernel too? If yes it does affect the API.

-- 
MST

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 6/7] KVM: assigned dev: MSI-X mask support
  2010-11-15  7:42               ` Michael S. Tsirkin
@ 2010-11-15  7:48                 ` Sheng Yang
  2010-11-15  8:03                   ` Michael S. Tsirkin
  0 siblings, 1 reply; 22+ messages in thread
From: Sheng Yang @ 2010-11-15  7:48 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Avi Kivity, Marcelo Tosatti, kvm

On Monday 15 November 2010 15:42:50 Michael S. Tsirkin wrote:
> On Mon, Nov 15, 2010 at 03:37:21PM +0800, Sheng Yang wrote:
> > > > We can back to them if there is someone really did it in that way.
> > > > But for all hypervisors using QEmu, I think we haven't seen such
> > > > kind of behavior yet.
> > > 
> > > I would rather stick to the spec than go figure out what do BSD/Sun/Mac
> > > do, or will do.
> > 
> > Sure, but no hurry for that. It doesn't similar to the API case, so we
> > can achieve it incrementally.
> 
> Isn't the proposed way to solve this to move vector address/data
> handling into kernel too? If yes it does affect the API.

It didn't afffect the API used by this patch. So the code can still be modified 
after later.

--
regards
Yang, Sheng

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/7] PCI: Add mask bit definition for MSI-X table
  2010-11-11 17:29   ` Jesse Barnes
@ 2010-11-15  8:02     ` Sheng Yang
  0 siblings, 0 replies; 22+ messages in thread
From: Sheng Yang @ 2010-11-15  8:02 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Jesse Barnes, Marcelo Tosatti, Michael S. Tsirkin, kvm,
	Matthew Wilcox, linux-pci

On Friday 12 November 2010 01:29:29 Jesse Barnes wrote:
> On Thu, 11 Nov 2010 15:46:55 +0800
> 
> Sheng Yang <sheng@linux.intel.com> wrote:
> > Then we can use it instead of magic number 1.
> > 
> > Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
> > Cc: Matthew Wilcox <willy@linux.intel.com>
> > Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
> > Cc: linux-pci@vger.kernel.org
> > Signed-off-by: Sheng Yang <sheng@linux.intel.com>
> > ---
> > 
> >  drivers/pci/msi.c        |    5 +++--
> >  include/linux/pci_regs.h |    1 +
> >  2 files changed, 4 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
> > index 69b7be3..095634e 100644
> > --- a/drivers/pci/msi.c
> > +++ b/drivers/pci/msi.c
> > @@ -158,8 +158,9 @@ static u32 __msix_mask_irq(struct msi_desc *desc, u32
> > flag)
> > 
> >  	u32 mask_bits = desc->masked;
> >  	unsigned offset = desc->msi_attrib.entry_nr * PCI_MSIX_ENTRY_SIZE +
> >  	
> >  						PCI_MSIX_ENTRY_VECTOR_CTRL;
> > 
> > -	mask_bits &= ~1;
> > -	mask_bits |= flag;
> > +	mask_bits &= ~PCI_MSIX_ENTRY_CTRL_MASKBIT;
> > +	if (flag)
> > +		mask_bits |= PCI_MSIX_ENTRY_CTRL_MASKBIT;
> > 
> >  	writel(mask_bits, desc->mask_base + offset);
> >  	
> >  	return mask_bits;
> > 
> > diff --git a/include/linux/pci_regs.h b/include/linux/pci_regs.h
> > index acfc224..ff51632 100644
> > --- a/include/linux/pci_regs.h
> > +++ b/include/linux/pci_regs.h
> > @@ -313,6 +313,7 @@
> > 
> >  #define  PCI_MSIX_ENTRY_UPPER_ADDR	4
> >  #define  PCI_MSIX_ENTRY_DATA		8
> >  #define  PCI_MSIX_ENTRY_VECTOR_CTRL	12
> > 
> > +#define   PCI_MSIX_ENTRY_CTRL_MASKBIT	1
> > 
> >  /* CompactPCI Hotswap Register */
> 
> Applied 1/7 and 2/7 to my linux-next tree, thanks.
> 
> If it's easier to push them both through the kvm tree let me know; you
> can just add my acked-by in that case.

Thanks Jesse!

Avi, which way do you prefer?

--
regards
Yang, Sheng

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 6/7] KVM: assigned dev: MSI-X mask support
  2010-11-15  7:48                 ` Sheng Yang
@ 2010-11-15  8:03                   ` Michael S. Tsirkin
  2010-11-15  8:22                     ` Sheng Yang
  0 siblings, 1 reply; 22+ messages in thread
From: Michael S. Tsirkin @ 2010-11-15  8:03 UTC (permalink / raw)
  To: Sheng Yang; +Cc: Avi Kivity, Marcelo Tosatti, kvm

On Mon, Nov 15, 2010 at 03:48:46PM +0800, Sheng Yang wrote:
> On Monday 15 November 2010 15:42:50 Michael S. Tsirkin wrote:
> > On Mon, Nov 15, 2010 at 03:37:21PM +0800, Sheng Yang wrote:
> > > > > We can back to them if there is someone really did it in that way.
> > > > > But for all hypervisors using QEmu, I think we haven't seen such
> > > > > kind of behavior yet.
> > > > 
> > > > I would rather stick to the spec than go figure out what do BSD/Sun/Mac
> > > > do, or will do.
> > > 
> > > Sure, but no hurry for that. It doesn't similar to the API case, so we
> > > can achieve it incrementally.
> > 
> > Isn't the proposed way to solve this to move vector address/data
> > handling into kernel too? If yes it does affect the API.
> 
> It didn't afffect the API used by this patch. So the code can still be modified 
> after later.

Then won't we have to support two APIs, forever?

> --
> regards
> Yang, Sheng

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 6/7] KVM: assigned dev: MSI-X mask support
  2010-11-15  8:03                   ` Michael S. Tsirkin
@ 2010-11-15  8:22                     ` Sheng Yang
  0 siblings, 0 replies; 22+ messages in thread
From: Sheng Yang @ 2010-11-15  8:22 UTC (permalink / raw)
  To: Michael S. Tsirkin, Avi Kivity; +Cc: Marcelo Tosatti, kvm

On Monday 15 November 2010 16:03:53 Michael S. Tsirkin wrote:
> On Mon, Nov 15, 2010 at 03:48:46PM +0800, Sheng Yang wrote:
> > On Monday 15 November 2010 15:42:50 Michael S. Tsirkin wrote:
> > > On Mon, Nov 15, 2010 at 03:37:21PM +0800, Sheng Yang wrote:
> > > > > > We can back to them if there is someone really did it in that
> > > > > > way. But for all hypervisors using QEmu, I think we haven't seen
> > > > > > such kind of behavior yet.
> > > > > 
> > > > > I would rather stick to the spec than go figure out what do
> > > > > BSD/Sun/Mac do, or will do.
> > > > 
> > > > Sure, but no hurry for that. It doesn't similar to the API case, so
> > > > we can achieve it incrementally.
> > > 
> > > Isn't the proposed way to solve this to move vector address/data
> > > handling into kernel too? If yes it does affect the API.
> > 
> > It didn't afffect the API used by this patch. So the code can still be
> > modified after later.
> 
> Then won't we have to support two APIs, forever?

In fact for this patch, the logic is pretty straightforward. I don't think this 
patch would trouble us if we really have to support two APIs(userspace and in-
kernel routing table) in the end. Just check msix_mmio_write(), you would find it 
just cost few lines(maybe just one line "r = -EOPNOTSUPP") to get the modification 
handled in userspace. All other logic mostly remained the same as in the this 
patch.

IMO Mask bit support and irq routing are two separate things. It make no sense to 
block mask bit support patch due to the one idea of possible irq routing change in 
the future.

--
regards
Yang, Sheng

> 
> > --
> > regards
> > Yang, Sheng

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2010-11-15  8:21 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-11-11  7:46 [PATCH 0/7 v4] MSI-X mask support for assigned device Sheng Yang
2010-11-11  7:46 ` [PATCH 1/7] PCI: MSI: Move MSI-X entry definition to pci_regs.h Sheng Yang
2010-11-11  7:46 ` [PATCH 2/7] PCI: Add mask bit definition for MSI-X table Sheng Yang
2010-11-11 17:29   ` Jesse Barnes
2010-11-15  8:02     ` Sheng Yang
2010-11-11  7:46 ` [PATCH 3/7] KVM: Move struct kvm_io_device to kvm_host.h Sheng Yang
2010-11-11  7:46 ` [PATCH 4/7] KVM: Add kvm_get_irq_routing_entry() func Sheng Yang
2010-11-11  7:46 ` [PATCH 5/7] KVM: assigned dev: Clean up assigned_device's flag Sheng Yang
2010-11-11  7:46 ` [PATCH 6/7] KVM: assigned dev: MSI-X mask support Sheng Yang
2010-11-12  9:53   ` Michael S. Tsirkin
2010-11-12 10:13     ` Sheng Yang
2010-11-12 10:47       ` Michael S. Tsirkin
2010-11-12 10:54         ` Sheng Yang
2010-11-12 11:25           ` Michael S. Tsirkin
2010-11-15  7:37             ` Sheng Yang
2010-11-15  7:42               ` Michael S. Tsirkin
2010-11-15  7:48                 ` Sheng Yang
2010-11-15  8:03                   ` Michael S. Tsirkin
2010-11-15  8:22                     ` Sheng Yang
2010-11-11  7:47 ` [PATCH 7/7] KVM: assigned dev: Big endian support for MSI-X MMIO Sheng Yang
2010-11-11 16:12   ` Michael S. Tsirkin
2010-11-12  2:47     ` Sheng Yang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox