[PATCH 0/5 v3] MSI-X mask supporting for assigned device(kernel)

kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/5 v3] MSI-X mask supporting for assigned device(kernel)
@ 2010-11-04  6:15 Sheng Yang
  2010-11-04  6:15 ` [PATCH 1/5] PCI: MSI: Move MSI-X entry definition to pci_regs.h Sheng Yang
                   ` (5 more replies)
  0 siblings, 6 replies; 29+ messages in thread
From: Sheng Yang @ 2010-11-04  6:15 UTC (permalink / raw)
  To: Avi Kivity, Marcelo Tosatti; +Cc: Michael S. Tsirkin, kvm, Sheng Yang

Here is the latest series of MSI-X mask supporting patches.

The bigest change from last version is, in order to reduce the complexity, I
moved all mask bit operation to kernel, including disabled entries. This
addressed two concerns:
1. KVM and QEmu each own a part of mask bit operation.
2. QEmu need accessing the real hardware to get the mask bit information.

So now QEmu have to use kernel API to get mask bit information. Though it
would be slower than direct accessing the real hardware's MMIO, the direct
accessing is unacceptable beacuse in fact the host OS own the device. The host
OS can access the device without notifying the guest(and don't need to do so).
Userspace shouldn't penetrate the host OS layer to directly operate the real
hardware, otherwise would cause guest confusion.

Also I discard the userspace mask operation as well, after showed the
performance number.

This version also removed the capability enabling mechanism. Because we want
to use the struct kvm_assigned_msix_entry with new IOCTL, so there is no
compatible issue.

Please review. And I would run more test with current patch. So far so good.

--
regards
Yang, Sheng

Sheng Yang (5):
  PCI: MSI: Move MSI-X entry definition to pci_regs.h
  PCI: Add mask bit definition for MSI-X table
  KVM: Move struct kvm_io_device to kvm_host.h
  KVM: Add kvm_get_irq_routing_entry() func
  KVM: assigned dev: MSI-X mask support

 Documentation/kvm/api.txt |   46 +++++++
 arch/x86/kvm/x86.c        |    1 +
 drivers/pci/msi.c         |    4 +-
 drivers/pci/msi.h         |    6 -
 include/linux/kvm.h       |   21 +++-
 include/linux/kvm_host.h  |   28 +++++
 include/linux/pci_regs.h  |    8 ++
 virt/kvm/assigned-dev.c   |  285 ++++++++++++++++++++++++++++++++++++++++++++-
 virt/kvm/iodev.h          |   25 +----
 virt/kvm/irq_comm.c       |   20 +++
 10 files changed, 410 insertions(+), 34 deletions(-)

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 1/5] PCI: MSI: Move MSI-X entry definition to pci_regs.h
  2010-11-04  6:15 [PATCH 0/5 v3] MSI-X mask supporting for assigned device(kernel) Sheng Yang
@ 2010-11-04  6:15 ` Sheng Yang
  2010-11-05  0:43   ` Hidetoshi Seto
  2010-11-04  6:15 ` [PATCH 2/5] PCI: Add mask bit definition for MSI-X table Sheng Yang
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 29+ messages in thread
From: Sheng Yang @ 2010-11-04  6:15 UTC (permalink / raw)
  To: Avi Kivity, Marcelo Tosatti
  Cc: Michael S. Tsirkin, kvm, Sheng Yang, Jesse Barnes, linux-pci

Then it can be used by others.

Reviewed-by: Matthew Wilcox <willy@linux.intel.com>
Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
Cc: linux-pci@vger.kernel.org
Signed-off-by: Sheng Yang <sheng@linux.intel.com>
---
 drivers/pci/msi.h        |    6 ------
 include/linux/pci_regs.h |    7 +++++++
 2 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/drivers/pci/msi.h b/drivers/pci/msi.h
index de27c1c..28a3c52 100644
--- a/drivers/pci/msi.h
+++ b/drivers/pci/msi.h
@@ -6,12 +6,6 @@
 #ifndef MSI_H
 #define MSI_H
 
-#define PCI_MSIX_ENTRY_SIZE		16
-#define  PCI_MSIX_ENTRY_LOWER_ADDR	0
-#define  PCI_MSIX_ENTRY_UPPER_ADDR	4
-#define  PCI_MSIX_ENTRY_DATA		8
-#define  PCI_MSIX_ENTRY_VECTOR_CTRL	12
-
 #define msi_control_reg(base)		(base + PCI_MSI_FLAGS)
 #define msi_lower_address_reg(base)	(base + PCI_MSI_ADDRESS_LO)
 #define msi_upper_address_reg(base)	(base + PCI_MSI_ADDRESS_HI)
diff --git a/include/linux/pci_regs.h b/include/linux/pci_regs.h
index 455b9cc..acfc224 100644
--- a/include/linux/pci_regs.h
+++ b/include/linux/pci_regs.h
@@ -307,6 +307,13 @@
 #define  PCI_MSIX_FLAGS_MASKALL	(1 << 14)
 #define PCI_MSIX_FLAGS_BIRMASK	(7 << 0)
 
+/* MSI-X entry's format */
+#define PCI_MSIX_ENTRY_SIZE		16
+#define  PCI_MSIX_ENTRY_LOWER_ADDR	0
+#define  PCI_MSIX_ENTRY_UPPER_ADDR	4
+#define  PCI_MSIX_ENTRY_DATA		8
+#define  PCI_MSIX_ENTRY_VECTOR_CTRL	12
+
 /* CompactPCI Hotswap Register */
 
 #define PCI_CHSWP_CSR		2	/* Control and Status Register */
-- 
1.7.0.1

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 2/5] PCI: Add mask bit definition for MSI-X table
  2010-11-04  6:15 [PATCH 0/5 v3] MSI-X mask supporting for assigned device(kernel) Sheng Yang
  2010-11-04  6:15 ` [PATCH 1/5] PCI: MSI: Move MSI-X entry definition to pci_regs.h Sheng Yang
@ 2010-11-04  6:15 ` Sheng Yang
  2010-11-04  9:50   ` Michael S. Tsirkin
  2010-11-05  0:48   ` Hidetoshi Seto
  2010-11-04  6:15 ` [PATCH 3/5] KVM: Move struct kvm_io_device to kvm_host.h Sheng Yang
                   ` (3 subsequent siblings)
  5 siblings, 2 replies; 29+ messages in thread
From: Sheng Yang @ 2010-11-04  6:15 UTC (permalink / raw)
  To: Avi Kivity, Marcelo Tosatti
  Cc: Michael S. Tsirkin, kvm, Sheng Yang, Matthew Wilcox, Jesse Barnes,
	linux-pci

Then we can use it instead of magic number 1.

Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
Cc: linux-pci@vger.kernel.org
Signed-off-by: Sheng Yang <sheng@linux.intel.com>
---
 drivers/pci/msi.c        |    4 ++--
 include/linux/pci_regs.h |    1 +
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
index 69b7be3..673e7dc 100644
--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -158,7 +158,7 @@ static u32 __msix_mask_irq(struct msi_desc *desc, u32 flag)
 	u32 mask_bits = desc->masked;
 	unsigned offset = desc->msi_attrib.entry_nr * PCI_MSIX_ENTRY_SIZE +
 						PCI_MSIX_ENTRY_VECTOR_CTRL;
-	mask_bits &= ~1;
+	mask_bits &= ~PCI_MSIX_ENTRY_CTRL_MASKBIT;
 	mask_bits |= flag;
 	writel(mask_bits, desc->mask_base + offset);
 
@@ -185,7 +185,7 @@ static void msi_set_mask_bit(unsigned irq, u32 flag)
 
 void mask_msi_irq(unsigned int irq)
 {
-	msi_set_mask_bit(irq, 1);
+	msi_set_mask_bit(irq, PCI_MSIX_ENTRY_CTRL_MASKBIT);
 }
 
 void unmask_msi_irq(unsigned int irq)
diff --git a/include/linux/pci_regs.h b/include/linux/pci_regs.h
index acfc224..ff51632 100644
--- a/include/linux/pci_regs.h
+++ b/include/linux/pci_regs.h
@@ -313,6 +313,7 @@
 #define  PCI_MSIX_ENTRY_UPPER_ADDR	4
 #define  PCI_MSIX_ENTRY_DATA		8
 #define  PCI_MSIX_ENTRY_VECTOR_CTRL	12
+#define   PCI_MSIX_ENTRY_CTRL_MASKBIT	1
 
 /* CompactPCI Hotswap Register */
 
-- 
1.7.0.1

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 3/5] KVM: Move struct kvm_io_device to kvm_host.h
  2010-11-04  6:15 [PATCH 0/5 v3] MSI-X mask supporting for assigned device(kernel) Sheng Yang
  2010-11-04  6:15 ` [PATCH 1/5] PCI: MSI: Move MSI-X entry definition to pci_regs.h Sheng Yang
  2010-11-04  6:15 ` [PATCH 2/5] PCI: Add mask bit definition for MSI-X table Sheng Yang
@ 2010-11-04  6:15 ` Sheng Yang
  2010-11-04  6:15 ` [PATCH 4/5] KVM: Add kvm_get_irq_routing_entry() func Sheng Yang
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 29+ messages in thread
From: Sheng Yang @ 2010-11-04  6:15 UTC (permalink / raw)
  To: Avi Kivity, Marcelo Tosatti; +Cc: Michael S. Tsirkin, kvm, Sheng Yang

Then it can be used in struct kvm_assigned_dev_kernel later.

Signed-off-by: Sheng Yang <sheng@linux.intel.com>
---
 include/linux/kvm_host.h |   23 +++++++++++++++++++++++
 virt/kvm/iodev.h         |   25 +------------------------
 2 files changed, 24 insertions(+), 24 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 0b89d00..7f9e4b7 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -74,6 +74,29 @@ int kvm_io_bus_register_dev(struct kvm *kvm, enum kvm_bus bus_idx,
 int kvm_io_bus_unregister_dev(struct kvm *kvm, enum kvm_bus bus_idx,
 			      struct kvm_io_device *dev);
 
+struct kvm_io_device;
+
+/**
+ * kvm_io_device_ops are called under kvm slots_lock.
+ * read and write handlers return 0 if the transaction has been handled,
+ * or non-zero to have it passed to the next device.
+ **/
+struct kvm_io_device_ops {
+	int (*read)(struct kvm_io_device *this,
+		    gpa_t addr,
+		    int len,
+		    void *val);
+	int (*write)(struct kvm_io_device *this,
+		     gpa_t addr,
+		     int len,
+		     const void *val);
+	void (*destructor)(struct kvm_io_device *this);
+};
+
+struct kvm_io_device {
+	const struct kvm_io_device_ops *ops;
+};
+
 struct kvm_vcpu {
 	struct kvm *kvm;
 #ifdef CONFIG_PREEMPT_NOTIFIERS
diff --git a/virt/kvm/iodev.h b/virt/kvm/iodev.h
index 12fd3ca..d1f5651 100644
--- a/virt/kvm/iodev.h
+++ b/virt/kvm/iodev.h
@@ -17,32 +17,9 @@
 #define __KVM_IODEV_H__
 
 #include <linux/kvm_types.h>
+#include <linux/kvm_host.h>
 #include <asm/errno.h>
 
-struct kvm_io_device;
-
-/**
- * kvm_io_device_ops are called under kvm slots_lock.
- * read and write handlers return 0 if the transaction has been handled,
- * or non-zero to have it passed to the next device.
- **/
-struct kvm_io_device_ops {
-	int (*read)(struct kvm_io_device *this,
-		    gpa_t addr,
-		    int len,
-		    void *val);
-	int (*write)(struct kvm_io_device *this,
-		     gpa_t addr,
-		     int len,
-		     const void *val);
-	void (*destructor)(struct kvm_io_device *this);
-};
-
-
-struct kvm_io_device {
-	const struct kvm_io_device_ops *ops;
-};
-
 static inline void kvm_iodevice_init(struct kvm_io_device *dev,
 				     const struct kvm_io_device_ops *ops)
 {
-- 
1.7.0.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 4/5] KVM: Add kvm_get_irq_routing_entry() func
  2010-11-04  6:15 [PATCH 0/5 v3] MSI-X mask supporting for assigned device(kernel) Sheng Yang
                   ` (2 preceding siblings ...)
  2010-11-04  6:15 ` [PATCH 3/5] KVM: Move struct kvm_io_device to kvm_host.h Sheng Yang
@ 2010-11-04  6:15 ` Sheng Yang
  2010-11-04  6:15 ` [PATCH 5/5] KVM: assigned dev: MSI-X mask support Sheng Yang
  2010-11-06  0:27 ` [PATCH 0/5 v3] MSI-X mask supporting for assigned device(kernel) Marcelo Tosatti
  5 siblings, 0 replies; 29+ messages in thread
From: Sheng Yang @ 2010-11-04  6:15 UTC (permalink / raw)
  To: Avi Kivity, Marcelo Tosatti; +Cc: Michael S. Tsirkin, kvm, Sheng Yang

We need to query the entry later.

Signed-off-by: Sheng Yang <sheng@linux.intel.com>
---
 include/linux/kvm_host.h |    2 ++
 virt/kvm/irq_comm.c      |   20 ++++++++++++++++++++
 2 files changed, 22 insertions(+), 0 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 7f9e4b7..e2ecbac 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -614,6 +614,8 @@ int kvm_set_irq_routing(struct kvm *kvm,
 			const struct kvm_irq_routing_entry *entries,
 			unsigned nr,
 			unsigned flags);
+int kvm_get_irq_routing_entry(struct kvm *kvm, int gsi,
+		struct kvm_kernel_irq_routing_entry *entry);
 void kvm_free_irq_routing(struct kvm *kvm);
 
 #else
diff --git a/virt/kvm/irq_comm.c b/virt/kvm/irq_comm.c
index 8edca91..ae1dc7c 100644
--- a/virt/kvm/irq_comm.c
+++ b/virt/kvm/irq_comm.c
@@ -421,6 +421,26 @@ out:
 	return r;
 }
 
+int kvm_get_irq_routing_entry(struct kvm *kvm, int gsi,
+		struct kvm_kernel_irq_routing_entry *entry)
+{
+	int count = 0;
+	struct kvm_kernel_irq_routing_entry *ei = NULL;
+	struct kvm_irq_routing_table *irq_rt;
+	struct hlist_node *n;
+
+	rcu_read_lock();
+	irq_rt = rcu_dereference(kvm->irq_routing);
+	if (gsi < irq_rt->nr_rt_entries)
+		hlist_for_each_entry(ei, n, &irq_rt->map[gsi], link)
+			count++;
+	if (count == 1)
+		*entry = *ei;
+	rcu_read_unlock();
+
+	return (count != 1);
+}
+
 #define IOAPIC_ROUTING_ENTRY(irq) \
 	{ .gsi = irq, .type = KVM_IRQ_ROUTING_IRQCHIP,	\
 	  .u.irqchip.irqchip = KVM_IRQCHIP_IOAPIC, .u.irqchip.pin = (irq) }
-- 
1.7.0.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 5/5] KVM: assigned dev: MSI-X mask support
  2010-11-04  6:15 [PATCH 0/5 v3] MSI-X mask supporting for assigned device(kernel) Sheng Yang
                   ` (3 preceding siblings ...)
  2010-11-04  6:15 ` [PATCH 4/5] KVM: Add kvm_get_irq_routing_entry() func Sheng Yang
@ 2010-11-04  6:15 ` Sheng Yang
  2010-11-04 10:43   ` Michael S. Tsirkin
  2010-11-06  0:27 ` [PATCH 0/5 v3] MSI-X mask supporting for assigned device(kernel) Marcelo Tosatti
  5 siblings, 1 reply; 29+ messages in thread
From: Sheng Yang @ 2010-11-04  6:15 UTC (permalink / raw)
  To: Avi Kivity, Marcelo Tosatti; +Cc: Michael S. Tsirkin, kvm, Sheng Yang

This patch enable per-vector mask for assigned devices using MSI-X.

This patch provided two new APIs: one is for guest to specific device's MSI-X
table address in MMIO, the other is for userspace to get information about mask
bit.

All the mask bit operation are kept in kernel, in order to accelerate.
Userspace shouldn't access the device MMIO directly for the information,
instead it should uses provided API to do so.

Signed-off-by: Sheng Yang <sheng@linux.intel.com>
---
 Documentation/kvm/api.txt |   46 +++++++
 arch/x86/kvm/x86.c        |    1 +
 include/linux/kvm.h       |   21 +++-
 include/linux/kvm_host.h  |    3 +
 virt/kvm/assigned-dev.c   |  285 ++++++++++++++++++++++++++++++++++++++++++++-
 5 files changed, 354 insertions(+), 2 deletions(-)

diff --git a/Documentation/kvm/api.txt b/Documentation/kvm/api.txt
index b336266..69b993c 100644
--- a/Documentation/kvm/api.txt
+++ b/Documentation/kvm/api.txt
@@ -1085,6 +1085,52 @@ of 4 instructions that make up a hypercall.
 If any additional field gets added to this structure later on, a bit for that
 additional piece of information will be set in the flags bitmap.
 
+4.47 KVM_ASSIGN_REG_MSIX_MMIO
+
+Capability: KVM_CAP_DEVICE_MSIX_MASK
+Architectures: x86
+Type: vm ioctl
+Parameters: struct kvm_assigned_msix_mmio (in)
+Returns: 0 on success, !0 on error
+
+struct kvm_assigned_msix_mmio {
+	/* Assigned device's ID */
+	__u32 assigned_dev_id;
+	/* Must be 0 */
+	__u32 flags;
+	/* MSI-X table MMIO address */
+	__u64 base_addr;
+	/* Must be 0, reserved for future use */
+	__u64 reserved;
+};
+
+This ioctl would enable in-kernel MSI-X emulation, which would handle MSI-X
+mask bit in the kernel.
+
+4.48 KVM_ASSIGN_GET_MSIX_ENTRY
+
+Capability: KVM_CAP_DEVICE_MSIX_MASK
+Architectures: x86
+Type: vm ioctl
+Parameters: struct kvm_assigned_msix_entry (in and out)
+Returns: 0 on success, !0 on error
+
+struct kvm_assigned_msix_entry {
+	/* Assigned device's ID */
+	__u32 assigned_dev_id;
+	/* Ignored */
+	__u32 gsi;
+	/* The index of entry in the MSI-X table */
+	__u16 entry;
+	/* Querying flags and returning status */
+	__u16 flags;
+	/* Must be 0 */
+	__u16 padding[2];
+};
+
+This ioctl would allow userspace to get the status of one specific MSI-X
+entry. Currently we support mask bit status querying.
+
 5. The kvm_run structure
 
 Application code obtains a pointer to the kvm_run structure by
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index f3f86b2..8fd5121 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1926,6 +1926,7 @@ int kvm_dev_ioctl_check_extension(long ext)
 	case KVM_CAP_DEBUGREGS:
 	case KVM_CAP_X86_ROBUST_SINGLESTEP:
 	case KVM_CAP_XSAVE:
+	case KVM_CAP_DEVICE_MSIX_MASK:
 		r = 1;
 		break;
 	case KVM_CAP_COALESCED_MMIO:
diff --git a/include/linux/kvm.h b/include/linux/kvm.h
index 919ae53..eafafb1 100644
--- a/include/linux/kvm.h
+++ b/include/linux/kvm.h
@@ -540,6 +540,9 @@ struct kvm_ppc_pvinfo {
 #endif
 #define KVM_CAP_PPC_GET_PVINFO 57
 #define KVM_CAP_PPC_IRQ_LEVEL 58
+#ifdef __KVM_HAVE_MSIX
+#define KVM_CAP_DEVICE_MSIX_MASK 59
+#endif
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
@@ -671,6 +674,10 @@ struct kvm_clock_data {
 #define KVM_XEN_HVM_CONFIG        _IOW(KVMIO,  0x7a, struct kvm_xen_hvm_config)
 #define KVM_SET_CLOCK             _IOW(KVMIO,  0x7b, struct kvm_clock_data)
 #define KVM_GET_CLOCK             _IOR(KVMIO,  0x7c, struct kvm_clock_data)
+#define KVM_ASSIGN_GET_MSIX_ENTRY _IOWR(KVMIO,  0x7d, \
+					struct kvm_assigned_msix_entry)
+#define KVM_ASSIGN_REG_MSIX_MMIO  _IOW(KVMIO,  0x7e, \
+					struct kvm_assigned_msix_mmio)
 /* Available with KVM_CAP_PIT_STATE2 */
 #define KVM_GET_PIT2              _IOR(KVMIO,  0x9f, struct kvm_pit_state2)
 #define KVM_SET_PIT2              _IOW(KVMIO,  0xa0, struct kvm_pit_state2)
@@ -787,11 +794,23 @@ struct kvm_assigned_msix_nr {
 };
 
 #define KVM_MAX_MSIX_PER_DEV		256
+
+#define KVM_MSIX_FLAG_MASK		(1 << 0)
+#define KVM_MSIX_FLAG_QUERY_MASK	(1 << 15)
+
 struct kvm_assigned_msix_entry {
 	__u32 assigned_dev_id;
 	__u32 gsi;
 	__u16 entry; /* The index of entry in the MSI-X table */
-	__u16 padding[3];
+	__u16 flags;
+	__u16 padding[2];
+};
+
+struct kvm_assigned_msix_mmio {
+	__u32 assigned_dev_id;
+	__u32 flags;
+	__u64 base_addr;
+	__u64 reserved;
 };
 
 #endif /* __LINUX_KVM_H */
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index e2ecbac..5ac4db9 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -464,6 +464,9 @@ struct kvm_assigned_dev_kernel {
 	struct pci_dev *dev;
 	struct kvm *kvm;
 	spinlock_t assigned_dev_lock;
+	DECLARE_BITMAP(msix_mask_bitmap, KVM_MAX_MSIX_PER_DEV);
+	gpa_t msix_mmio_base;
+	struct kvm_io_device msix_mmio_dev;
 };
 
 struct kvm_irq_mask_notifier {
diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c
index 7c98928..9398c27 100644
--- a/virt/kvm/assigned-dev.c
+++ b/virt/kvm/assigned-dev.c
@@ -232,6 +232,14 @@ static void kvm_free_assigned_device(struct kvm *kvm,
 {
 	kvm_free_assigned_irq(kvm, assigned_dev);
 
+#ifdef __KVM_HAVE_MSIX
+	if (assigned_dev->msix_mmio_base) {
+		mutex_lock(&kvm->slots_lock);
+		kvm_io_bus_unregister_dev(kvm, KVM_MMIO_BUS,
+				&assigned_dev->msix_mmio_dev);
+		mutex_unlock(&kvm->slots_lock);
+	}
+#endif
 	pci_reset_function(assigned_dev->dev);
 
 	pci_release_regions(assigned_dev->dev);
@@ -504,7 +512,7 @@ out:
 static int kvm_vm_ioctl_assign_device(struct kvm *kvm,
 				      struct kvm_assigned_pci_dev *assigned_dev)
 {
-	int r = 0, idx;
+	int r = 0, idx, i;
 	struct kvm_assigned_dev_kernel *match;
 	struct pci_dev *dev;
 
@@ -563,6 +571,10 @@ static int kvm_vm_ioctl_assign_device(struct kvm *kvm,
 
 	list_add(&match->list, &kvm->arch.assigned_dev_head);
 
+	/* The state after reset of MSI-X table is all masked */
+	for (i = 0; i < KVM_MAX_MSIX_PER_DEV; i++)
+		set_bit(i, match->msix_mask_bitmap);
+
 	if (assigned_dev->flags & KVM_DEV_ASSIGN_ENABLE_IOMMU) {
 		if (!kvm->arch.iommu_domain) {
 			r = kvm_iommu_map_guest(kvm);
@@ -666,6 +678,43 @@ msix_nr_out:
 	return r;
 }
 
+static void update_msix_mask(struct kvm_assigned_dev_kernel *adev,
+			     int idx, bool new_mask_flag)
+{
+	int irq;
+	bool old_mask_flag, need_flush = false;
+
+	spin_lock_irq(&adev->assigned_dev_lock);
+
+	if (!adev->dev->msix_enabled ||
+	    !(adev->irq_requested_type & KVM_DEV_IRQ_HOST_MSIX))
+		goto out;
+
+	old_mask_flag = test_bit(adev->guest_msix_entries[idx].entry,
+			adev->msix_mask_bitmap);
+	if (old_mask_flag == new_mask_flag)
+		goto out;
+
+	irq = adev->host_msix_entries[idx].vector;
+	BUG_ON(irq == 0);
+
+	if (new_mask_flag) {
+		set_bit(adev->guest_msix_entries[idx].entry,
+				adev->msix_mask_bitmap);
+		disable_irq_nosync(irq);
+		need_flush = true;
+	} else {
+		clear_bit(adev->guest_msix_entries[idx].entry,
+				adev->msix_mask_bitmap);
+		enable_irq(irq);
+	}
+out:
+	spin_unlock_irq(&adev->assigned_dev_lock);
+
+	if (need_flush)
+		flush_work(&adev->interrupt_work);
+}
+
 static int kvm_vm_ioctl_set_msix_entry(struct kvm *kvm,
 				       struct kvm_assigned_msix_entry *entry)
 {
@@ -700,6 +749,210 @@ msix_entry_out:
 
 	return r;
 }
+
+static int kvm_vm_ioctl_get_msix_entry(struct kvm *kvm,
+				       struct kvm_assigned_msix_entry *entry)
+{
+	int r = 0;
+	struct kvm_assigned_dev_kernel *adev;
+
+	mutex_lock(&kvm->lock);
+
+	adev = kvm_find_assigned_dev(&kvm->arch.assigned_dev_head,
+				      entry->assigned_dev_id);
+
+	if (!adev) {
+		r = -EINVAL;
+		goto out;
+	}
+
+	if (entry->entry >= KVM_MAX_MSIX_PER_DEV) {
+		r = -ENOSPC;
+		goto out;
+	}
+
+	if (entry->flags & KVM_MSIX_FLAG_QUERY_MASK) {
+		if (test_bit(entry->entry, adev->msix_mask_bitmap))
+			entry->flags |= KVM_MSIX_FLAG_MASK;
+		else
+			entry->flags &= ~KVM_MSIX_FLAG_MASK;
+	}
+
+out:
+	mutex_unlock(&kvm->lock);
+
+	return r;
+}
+
+static bool msix_mmio_in_range(struct kvm_assigned_dev_kernel *adev,
+			      gpa_t addr, int len)
+{
+	gpa_t start, end;
+
+	BUG_ON(adev->msix_mmio_base == 0);
+	start = adev->msix_mmio_base;
+	end = adev->msix_mmio_base + PCI_MSIX_ENTRY_SIZE * KVM_MAX_MSIX_PER_DEV;
+	if (addr >= start && addr + len <= end)
+		return true;
+
+	return false;
+}
+
+static int msix_get_enabled_idx(struct kvm_assigned_dev_kernel *adev,
+			      gpa_t addr, int len)
+{
+	int i;
+	gpa_t start, end;
+
+	for (i = 0; i < adev->entries_nr; i++) {
+		start = adev->msix_mmio_base +
+			adev->guest_msix_entries[i].entry * PCI_MSIX_ENTRY_SIZE;
+		end = start + PCI_MSIX_ENTRY_SIZE;
+		if (addr >= start && addr + len <= end) {
+			return i;
+		}
+	}
+
+	return -EINVAL;
+}
+
+static int msix_mmio_read(struct kvm_io_device *this, gpa_t addr, int len,
+			  void *val)
+{
+	struct kvm_assigned_dev_kernel *adev =
+			container_of(this, struct kvm_assigned_dev_kernel,
+				     msix_mmio_dev);
+	int idx, r = 0;
+	u32 entry[4];
+	struct kvm_kernel_irq_routing_entry e;
+
+	mutex_lock(&adev->kvm->lock);
+	if (!msix_mmio_in_range(adev, addr, len)) {
+		r = -EOPNOTSUPP;
+		goto out;
+	}
+	if ((addr & 0x3) || len != 4)
+		goto out;
+
+	idx = msix_get_enabled_idx(adev, addr, len);
+	if (idx < 0) {
+		idx = (addr - adev->msix_mmio_base) / PCI_MSIX_ENTRY_SIZE;
+		if ((addr % PCI_MSIX_ENTRY_SIZE) ==
+				PCI_MSIX_ENTRY_VECTOR_CTRL)
+			*(unsigned long *)val =
+				test_bit(idx, adev->msix_mask_bitmap) ?
+				PCI_MSIX_ENTRY_CTRL_MASKBIT : 0;
+		else
+			r = -EOPNOTSUPP;
+		goto out;
+	}
+
+	r = kvm_get_irq_routing_entry(adev->kvm,
+			adev->guest_msix_entries[idx].vector, &e);
+	if (r || e.type != KVM_IRQ_ROUTING_MSI) {
+		printk(KERN_WARNING "KVM: Wrong MSI-X routing entry! "
+			"idx %d, addr 0x%llx, len %d\n", idx, addr, len);
+		r = -EOPNOTSUPP;
+		goto out;
+	}
+	entry[0] = e.msi.address_lo;
+	entry[1] = e.msi.address_hi;
+	entry[2] = e.msi.data;
+	entry[3] = test_bit(adev->guest_msix_entries[idx].entry,
+			adev->msix_mask_bitmap);
+	memcpy(val, &entry[addr % PCI_MSIX_ENTRY_SIZE / 4], len);
+
+out:
+	mutex_unlock(&adev->kvm->lock);
+	return r;
+}
+
+static int msix_mmio_write(struct kvm_io_device *this, gpa_t addr, int len,
+			   const void *val)
+{
+	struct kvm_assigned_dev_kernel *adev =
+			container_of(this, struct kvm_assigned_dev_kernel,
+				     msix_mmio_dev);
+	int idx, r = 0;
+	unsigned long new_val = *(unsigned long *)val;
+
+	mutex_lock(&adev->kvm->lock);
+	if (!msix_mmio_in_range(adev, addr, len)) {
+		r = -EOPNOTSUPP;
+		goto out;
+	}
+	if ((addr & 0x3) || len != 4)
+		goto out;
+
+	idx = msix_get_enabled_idx(adev, addr, len);
+	if (idx < 0) {
+		idx = (addr - adev->msix_mmio_base) / PCI_MSIX_ENTRY_SIZE;
+		if (((addr % PCI_MSIX_ENTRY_SIZE) ==
+				PCI_MSIX_ENTRY_VECTOR_CTRL)) {
+			if (new_val & ~PCI_MSIX_ENTRY_CTRL_MASKBIT)
+				goto out;
+			if (new_val & PCI_MSIX_ENTRY_CTRL_MASKBIT)
+				set_bit(idx, adev->msix_mask_bitmap);
+			else
+				clear_bit(idx, adev->msix_mask_bitmap);
+		} else
+			/* Userspace would handle other MMIO writing */
+			r = -EOPNOTSUPP;
+		goto out;
+	}
+	if (addr % PCI_MSIX_ENTRY_SIZE != PCI_MSIX_ENTRY_VECTOR_CTRL) {
+		r = -EOPNOTSUPP;
+		goto out;
+	}
+	if (new_val & ~PCI_MSIX_ENTRY_CTRL_MASKBIT)
+		goto out;
+	update_msix_mask(adev, idx, !!(new_val & PCI_MSIX_ENTRY_CTRL_MASKBIT));
+out:
+	mutex_unlock(&adev->kvm->lock);
+
+	return r;
+}
+
+static const struct kvm_io_device_ops msix_mmio_ops = {
+	.read     = msix_mmio_read,
+	.write    = msix_mmio_write,
+};
+
+static int kvm_vm_ioctl_register_msix_mmio(struct kvm *kvm,
+				struct kvm_assigned_msix_mmio *msix_mmio)
+{
+	int r = 0;
+	struct kvm_assigned_dev_kernel *adev;
+
+	mutex_lock(&kvm->lock);
+	adev = kvm_find_assigned_dev(&kvm->arch.assigned_dev_head,
+				      msix_mmio->assigned_dev_id);
+	if (!adev) {
+		r = -EINVAL;
+		goto out;
+	}
+	if (msix_mmio->base_addr == 0) {
+		r = -EINVAL;
+		goto out;
+	}
+
+	mutex_lock(&kvm->slots_lock);
+	if (adev->msix_mmio_base == 0) {
+		kvm_iodevice_init(&adev->msix_mmio_dev, &msix_mmio_ops);
+		r = kvm_io_bus_register_dev(kvm, KVM_MMIO_BUS,
+				&adev->msix_mmio_dev);
+		if (r)
+			goto out2;
+	}
+
+	adev->msix_mmio_base = msix_mmio->base_addr;
+out2:
+	mutex_unlock(&kvm->slots_lock);
+out:
+	mutex_unlock(&kvm->lock);
+
+	return r;
+}
 #endif
 
 long kvm_vm_ioctl_assigned_device(struct kvm *kvm, unsigned ioctl,
@@ -812,6 +1065,36 @@ long kvm_vm_ioctl_assigned_device(struct kvm *kvm, unsigned ioctl,
 			goto out;
 		break;
 	}
+	case KVM_ASSIGN_GET_MSIX_ENTRY: {
+		struct kvm_assigned_msix_entry entry;
+		r = -EFAULT;
+		if (copy_from_user(&entry, argp, sizeof entry))
+			goto out;
+		r = kvm_vm_ioctl_get_msix_entry(kvm, &entry);
+		if (r)
+			goto out;
+		r = -EFAULT;
+		if (copy_to_user(argp, &entry, sizeof entry))
+			goto out;
+		r = 0;
+		break;
+	}
+	case KVM_ASSIGN_REG_MSIX_MMIO: {
+		struct kvm_assigned_msix_mmio msix_mmio;
+
+		r = -EFAULT;
+		if (copy_from_user(&msix_mmio, argp, sizeof(msix_mmio)))
+			goto out;
+
+		r = -EINVAL;
+		if (msix_mmio.flags != 0 || msix_mmio.reserved != 0)
+			goto out;
+
+		r = kvm_vm_ioctl_register_msix_mmio(kvm, &msix_mmio);
+		if (r)
+			goto out;
+		break;
+	}
 #endif
 	}
 out:
-- 
1.7.0.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/5] PCI: Add mask bit definition for MSI-X table
  2010-11-04  6:15 ` [PATCH 2/5] PCI: Add mask bit definition for MSI-X table Sheng Yang
@ 2010-11-04  9:50   ` Michael S. Tsirkin
  2010-11-05  2:48     ` Sheng Yang
  2010-11-05  0:48   ` Hidetoshi Seto
  1 sibling, 1 reply; 29+ messages in thread
From: Michael S. Tsirkin @ 2010-11-04  9:50 UTC (permalink / raw)
  To: Sheng Yang
  Cc: Avi Kivity, Marcelo Tosatti, kvm, Matthew Wilcox, Jesse Barnes,
	linux-pci

On Thu, Nov 04, 2010 at 02:15:18PM +0800, Sheng Yang wrote:
> Then we can use it instead of magic number 1.
> 
> Cc: Matthew Wilcox <willy@linux.intel.com>
> Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
> Cc: linux-pci@vger.kernel.org
> Signed-off-by: Sheng Yang <sheng@linux.intel.com>

Looks good to me

> ---
>  drivers/pci/msi.c        |    4 ++--
>  include/linux/pci_regs.h |    1 +
>  2 files changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
> index 69b7be3..673e7dc 100644
> --- a/drivers/pci/msi.c
> +++ b/drivers/pci/msi.c
> @@ -158,7 +158,7 @@ static u32 __msix_mask_irq(struct msi_desc *desc, u32 flag)
>  	u32 mask_bits = desc->masked;
>  	unsigned offset = desc->msi_attrib.entry_nr * PCI_MSIX_ENTRY_SIZE +
>  						PCI_MSIX_ENTRY_VECTOR_CTRL;
> -	mask_bits &= ~1;
> +	mask_bits &= ~PCI_MSIX_ENTRY_CTRL_MASKBIT;
>  	mask_bits |= flag;
>  	writel(mask_bits, desc->mask_base + offset);
>  
> @@ -185,7 +185,7 @@ static void msi_set_mask_bit(unsigned irq, u32 flag)
>  
>  void mask_msi_irq(unsigned int irq)
>  {
> -	msi_set_mask_bit(irq, 1);
> +	msi_set_mask_bit(irq, PCI_MSIX_ENTRY_CTRL_MASKBIT);
>  }
>  
>  void unmask_msi_irq(unsigned int irq)
> diff --git a/include/linux/pci_regs.h b/include/linux/pci_regs.h
> index acfc224..ff51632 100644
> --- a/include/linux/pci_regs.h
> +++ b/include/linux/pci_regs.h
> @@ -313,6 +313,7 @@
>  #define  PCI_MSIX_ENTRY_UPPER_ADDR	4
>  #define  PCI_MSIX_ENTRY_DATA		8
>  #define  PCI_MSIX_ENTRY_VECTOR_CTRL	12
> +#define   PCI_MSIX_ENTRY_CTRL_MASKBIT	1

minor alignment issue

>  /* CompactPCI Hotswap Register */
>  
> -- 
> 1.7.0.1

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 5/5] KVM: assigned dev: MSI-X mask support
  2010-11-04  6:15 ` [PATCH 5/5] KVM: assigned dev: MSI-X mask support Sheng Yang
@ 2010-11-04 10:43   ` Michael S. Tsirkin
  2010-11-05  2:44     ` Sheng Yang
  0 siblings, 1 reply; 29+ messages in thread
From: Michael S. Tsirkin @ 2010-11-04 10:43 UTC (permalink / raw)
  To: Sheng Yang; +Cc: Avi Kivity, Marcelo Tosatti, kvm

On Thu, Nov 04, 2010 at 02:15:21PM +0800, Sheng Yang wrote:
> This patch enable per-vector mask for assigned devices using MSI-X.
> 
> This patch provided two new APIs: one is for guest to specific device's MSI-X
> table address in MMIO, the other is for userspace to get information about mask
> bit.
> 
> All the mask bit operation are kept in kernel, in order to accelerate.
> Userspace shouldn't access the device MMIO directly for the information,
> instead it should uses provided API to do so.
> 
> Signed-off-by: Sheng Yang <sheng@linux.intel.com>
> ---
>  Documentation/kvm/api.txt |   46 +++++++
>  arch/x86/kvm/x86.c        |    1 +
>  include/linux/kvm.h       |   21 +++-
>  include/linux/kvm_host.h  |    3 +
>  virt/kvm/assigned-dev.c   |  285 ++++++++++++++++++++++++++++++++++++++++++++-
>  5 files changed, 354 insertions(+), 2 deletions(-)
> 
> diff --git a/Documentation/kvm/api.txt b/Documentation/kvm/api.txt
> index b336266..69b993c 100644
> --- a/Documentation/kvm/api.txt
> +++ b/Documentation/kvm/api.txt
> @@ -1085,6 +1085,52 @@ of 4 instructions that make up a hypercall.
>  If any additional field gets added to this structure later on, a bit for that
>  additional piece of information will be set in the flags bitmap.
>  
> +4.47 KVM_ASSIGN_REG_MSIX_MMIO
> +
> +Capability: KVM_CAP_DEVICE_MSIX_MASK
> +Architectures: x86
> +Type: vm ioctl
> +Parameters: struct kvm_assigned_msix_mmio (in)
> +Returns: 0 on success, !0 on error
> +
> +struct kvm_assigned_msix_mmio {
> +	/* Assigned device's ID */
> +	__u32 assigned_dev_id;
> +	/* Must be 0 */
> +	__u32 flags;
> +	/* MSI-X table MMIO address */
> +	__u64 base_addr;
> +	/* Must be 0, reserved for future use */
> +	__u64 reserved;

We also want length I think.

> +};
> +
> +This ioctl would enable in-kernel MSI-X emulation, which would handle MSI-X
> +mask bit in the kernel.

What happens on repeated calls when it's already enabled?
How does one disable after it's enabled?

> +

This is very specialized for assigned devices.  I would like an
interface not tied to assigned devices explicitly
(even if the implementation at the moment only works
for assigned devices, I can patch that in later). E.g. vhost-net would
benefit from such an extension too.  Maybe tie this to a special
GSI and this GSI can be an assigned device or not?


> +4.48 KVM_ASSIGN_GET_MSIX_ENTRY
> +
> +Capability: KVM_CAP_DEVICE_MSIX_MASK
> +Architectures: x86
> +Type: vm ioctl
> +Parameters: struct kvm_assigned_msix_entry (in and out)
> +Returns: 0 on success, !0 on error
> +
> +struct kvm_assigned_msix_entry {
> +	/* Assigned device's ID */
> +	__u32 assigned_dev_id;
> +	/* Ignored */
> +	__u32 gsi;
> +	/* The index of entry in the MSI-X table */
> +	__u16 entry;
> +	/* Querying flags and returning status */
> +	__u16 flags;
> +	/* Must be 0 */
> +	__u16 padding[2];
> +};
> +
> +This ioctl would allow userspace to get the status of one specific MSI-X
> +entry. Currently we support mask bit status querying.
> +
>  5. The kvm_run structure
>  
>  Application code obtains a pointer to the kvm_run structure by
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index f3f86b2..8fd5121 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -1926,6 +1926,7 @@ int kvm_dev_ioctl_check_extension(long ext)
>  	case KVM_CAP_DEBUGREGS:
>  	case KVM_CAP_X86_ROBUST_SINGLESTEP:
>  	case KVM_CAP_XSAVE:
> +	case KVM_CAP_DEVICE_MSIX_MASK:
>  		r = 1;
>  		break;
>  	case KVM_CAP_COALESCED_MMIO:
> diff --git a/include/linux/kvm.h b/include/linux/kvm.h
> index 919ae53..eafafb1 100644
> --- a/include/linux/kvm.h
> +++ b/include/linux/kvm.h
> @@ -540,6 +540,9 @@ struct kvm_ppc_pvinfo {
>  #endif
>  #define KVM_CAP_PPC_GET_PVINFO 57
>  #define KVM_CAP_PPC_IRQ_LEVEL 58
> +#ifdef __KVM_HAVE_MSIX
> +#define KVM_CAP_DEVICE_MSIX_MASK 59
> +#endif
>  

We seem to have these HAVE macros all over.
Avi, what's the idea? Let's drop them?

>  #ifdef KVM_CAP_IRQ_ROUTING
>  
> @@ -671,6 +674,10 @@ struct kvm_clock_data {
>  #define KVM_XEN_HVM_CONFIG        _IOW(KVMIO,  0x7a, struct kvm_xen_hvm_config)
>  #define KVM_SET_CLOCK             _IOW(KVMIO,  0x7b, struct kvm_clock_data)
>  #define KVM_GET_CLOCK             _IOR(KVMIO,  0x7c, struct kvm_clock_data)
> +#define KVM_ASSIGN_GET_MSIX_ENTRY _IOWR(KVMIO,  0x7d, \
> +					struct kvm_assigned_msix_entry)
> +#define KVM_ASSIGN_REG_MSIX_MMIO  _IOW(KVMIO,  0x7e, \
> +					struct kvm_assigned_msix_mmio)
>  /* Available with KVM_CAP_PIT_STATE2 */
>  #define KVM_GET_PIT2              _IOR(KVMIO,  0x9f, struct kvm_pit_state2)
>  #define KVM_SET_PIT2              _IOW(KVMIO,  0xa0, struct kvm_pit_state2)
> @@ -787,11 +794,23 @@ struct kvm_assigned_msix_nr {
>  };
>  
>  #define KVM_MAX_MSIX_PER_DEV		256
> +
> +#define KVM_MSIX_FLAG_MASK		(1 << 0)
> +#define KVM_MSIX_FLAG_QUERY_MASK	(1 << 15)
> +
>  struct kvm_assigned_msix_entry {
>  	__u32 assigned_dev_id;
>  	__u32 gsi;
>  	__u16 entry; /* The index of entry in the MSI-X table */
> -	__u16 padding[3];
> +	__u16 flags;
> +	__u16 padding[2];
> +};
> +
> +struct kvm_assigned_msix_mmio {
> +	__u32 assigned_dev_id;
> +	__u32 flags;
> +	__u64 base_addr;
> +	__u64 reserved;
>  };
>  
>  #endif /* __LINUX_KVM_H */
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index e2ecbac..5ac4db9 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -464,6 +464,9 @@ struct kvm_assigned_dev_kernel {
>  	struct pci_dev *dev;
>  	struct kvm *kvm;
>  	spinlock_t assigned_dev_lock;
> +	DECLARE_BITMAP(msix_mask_bitmap, KVM_MAX_MSIX_PER_DEV);
> +	gpa_t msix_mmio_base;
> +	struct kvm_io_device msix_mmio_dev;
>  };
>  
>  struct kvm_irq_mask_notifier {
> diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c
> index 7c98928..9398c27 100644
> --- a/virt/kvm/assigned-dev.c
> +++ b/virt/kvm/assigned-dev.c
> @@ -232,6 +232,14 @@ static void kvm_free_assigned_device(struct kvm *kvm,
>  {
>  	kvm_free_assigned_irq(kvm, assigned_dev);
>  
> +#ifdef __KVM_HAVE_MSIX
> +	if (assigned_dev->msix_mmio_base) {
> +		mutex_lock(&kvm->slots_lock);
> +		kvm_io_bus_unregister_dev(kvm, KVM_MMIO_BUS,
> +				&assigned_dev->msix_mmio_dev);
> +		mutex_unlock(&kvm->slots_lock);
> +	}
> +#endif
>  	pci_reset_function(assigned_dev->dev);
>  
>  	pci_release_regions(assigned_dev->dev);
> @@ -504,7 +512,7 @@ out:
>  static int kvm_vm_ioctl_assign_device(struct kvm *kvm,
>  				      struct kvm_assigned_pci_dev *assigned_dev)
>  {
> -	int r = 0, idx;
> +	int r = 0, idx, i;
>  	struct kvm_assigned_dev_kernel *match;
>  	struct pci_dev *dev;
>  
> @@ -563,6 +571,10 @@ static int kvm_vm_ioctl_assign_device(struct kvm *kvm,
>  
>  	list_add(&match->list, &kvm->arch.assigned_dev_head);
>  
> +	/* The state after reset of MSI-X table is all masked */
> +	for (i = 0; i < KVM_MAX_MSIX_PER_DEV; i++)
> +		set_bit(i, match->msix_mask_bitmap);
> +
>  	if (assigned_dev->flags & KVM_DEV_ASSIGN_ENABLE_IOMMU) {
>  		if (!kvm->arch.iommu_domain) {
>  			r = kvm_iommu_map_guest(kvm);
> @@ -666,6 +678,43 @@ msix_nr_out:
>  	return r;
>  }
>  
> +static void update_msix_mask(struct kvm_assigned_dev_kernel *adev,
> +			     int idx, bool new_mask_flag)
> +{
> +	int irq;
> +	bool old_mask_flag, need_flush = false;
> +
> +	spin_lock_irq(&adev->assigned_dev_lock);
> +
> +	if (!adev->dev->msix_enabled ||
> +	    !(adev->irq_requested_type & KVM_DEV_IRQ_HOST_MSIX))
> +		goto out;
> +
> +	old_mask_flag = test_bit(adev->guest_msix_entries[idx].entry,
> +			adev->msix_mask_bitmap);
> +	if (old_mask_flag == new_mask_flag)
> +		goto out;
> +
> +	irq = adev->host_msix_entries[idx].vector;
> +	BUG_ON(irq == 0);
> +
> +	if (new_mask_flag) {
> +		set_bit(adev->guest_msix_entries[idx].entry,
> +				adev->msix_mask_bitmap);
> +		disable_irq_nosync(irq);
> +		need_flush = true;
> +	} else {
> +		clear_bit(adev->guest_msix_entries[idx].entry,
> +				adev->msix_mask_bitmap);
> +		enable_irq(irq);
> +	}
> +out:
> +	spin_unlock_irq(&adev->assigned_dev_lock);
> +
> +	if (need_flush)
> +		flush_work(&adev->interrupt_work);
> +}
> +
>  static int kvm_vm_ioctl_set_msix_entry(struct kvm *kvm,
>  				       struct kvm_assigned_msix_entry *entry)
>  {
> @@ -700,6 +749,210 @@ msix_entry_out:
>  
>  	return r;
>  }
> +
> +static int kvm_vm_ioctl_get_msix_entry(struct kvm *kvm,
> +				       struct kvm_assigned_msix_entry *entry)
> +{
> +	int r = 0;
> +	struct kvm_assigned_dev_kernel *adev;
> +
> +	mutex_lock(&kvm->lock);
> +
> +	adev = kvm_find_assigned_dev(&kvm->arch.assigned_dev_head,
> +				      entry->assigned_dev_id);
> +
> +	if (!adev) {
> +		r = -EINVAL;
> +		goto out;
> +	}
> +
> +	if (entry->entry >= KVM_MAX_MSIX_PER_DEV) {
> +		r = -ENOSPC;
> +		goto out;
> +	}
> +
> +	if (entry->flags & KVM_MSIX_FLAG_QUERY_MASK) {
> +		if (test_bit(entry->entry, adev->msix_mask_bitmap))
> +			entry->flags |= KVM_MSIX_FLAG_MASK;
> +		else
> +			entry->flags &= ~KVM_MSIX_FLAG_MASK;
> +	}
> +
> +out:
> +	mutex_unlock(&kvm->lock);
> +
> +	return r;
> +}
> +
> +static bool msix_mmio_in_range(struct kvm_assigned_dev_kernel *adev,
> +			      gpa_t addr, int len)
> +{
> +	gpa_t start, end;
> +
> +	BUG_ON(adev->msix_mmio_base == 0);

Below I see
	adev->msix_mmio_base = msix_mmio->base_addr;
which comes from userspace. BUG_ON is an unfriendly way to
tell user about a bug in qemu.

Anyway, I don't think we should special-case 0 gpa.
It's up to the user where to base the table.
Use a separate flag to signal that table was initialized.

> +	start = adev->msix_mmio_base;
> +	end = adev->msix_mmio_base + PCI_MSIX_ENTRY_SIZE * KVM_MAX_MSIX_PER_DEV;

This seems wrong. What if the device has a smaller msix table?
Can't we simply pass in base and length?

> +	if (addr >= start && addr + len <= end)

interesting wrap-around cases come to mind. Likely not a real
problem but maybe better use addr + len - 1 all over
to make it robust.

> +		return true;
> +
> +	return false;
> +}
> +
> +static int msix_get_enabled_idx(struct kvm_assigned_dev_kernel *adev,
> +			      gpa_t addr, int len)
> +{
> +	int i;
> +	gpa_t start, end;
> +
> +	for (i = 0; i < adev->entries_nr; i++) {
> +		start = adev->msix_mmio_base +
> +			adev->guest_msix_entries[i].entry * PCI_MSIX_ENTRY_SIZE;
> +		end = start + PCI_MSIX_ENTRY_SIZE;
> +		if (addr >= start && addr + len <= end) {
> +			return i;
> +		}
> +	}

That's a scary thing to do on each access.
Can we build an interface that does not force us
to scan the whole list each time?

> +
> +	return -EINVAL;
> +}
> +
> +static int msix_mmio_read(struct kvm_io_device *this, gpa_t addr, int len,
> +			  void *val)
> +{
> +	struct kvm_assigned_dev_kernel *adev =
> +			container_of(this, struct kvm_assigned_dev_kernel,
> +				     msix_mmio_dev);
> +	int idx, r = 0;
> +	u32 entry[4];
> +	struct kvm_kernel_irq_routing_entry e;
> +
> +	mutex_lock(&adev->kvm->lock);
> +	if (!msix_mmio_in_range(adev, addr, len)) {
> +		r = -EOPNOTSUPP;
> +		goto out;
> +	}
> +	if ((addr & 0x3) || len != 4)
> +		goto out;
> +
> +	idx = msix_get_enabled_idx(adev, addr, len);
> +	if (idx < 0) {
> +		idx = (addr - adev->msix_mmio_base) / PCI_MSIX_ENTRY_SIZE;
> +		if ((addr % PCI_MSIX_ENTRY_SIZE) ==
> +				PCI_MSIX_ENTRY_VECTOR_CTRL)
> +			*(unsigned long *)val =
> +				test_bit(idx, adev->msix_mask_bitmap) ?
> +				PCI_MSIX_ENTRY_CTRL_MASKBIT : 0;

I suspect this is wrong for big endian platforms: PCI returns
all values in little endian format.

> +		else
> +			r = -EOPNOTSUPP;
> +		goto out;
> +	}
> +
> +	r = kvm_get_irq_routing_entry(adev->kvm,
> +			adev->guest_msix_entries[idx].vector, &e);
> +	if (r || e.type != KVM_IRQ_ROUTING_MSI) {
> +		printk(KERN_WARNING "KVM: Wrong MSI-X routing entry! "
> +			"idx %d, addr 0x%llx, len %d\n", idx, addr, len);

Can a malicious userspace trigger this intentionally?

> +		r = -EOPNOTSUPP;
> +		goto out;
> +	}
> +	entry[0] = e.msi.address_lo;
> +	entry[1] = e.msi.address_hi;
> +	entry[2] = e.msi.data;
> +	entry[3] = test_bit(adev->guest_msix_entries[idx].entry,
> +			adev->msix_mask_bitmap);

I suspect this is wrong for big endian platforms: PCI returns
all values in little endian format.

> +	memcpy(val, &entry[addr % PCI_MSIX_ENTRY_SIZE / 4

4 is sizeof *entry?

], len);
> +
> +out:
> +	mutex_unlock(&adev->kvm->lock);
> +	return r;
> +}
> +
> +static int msix_mmio_write(struct kvm_io_device *this, gpa_t addr, int len,
> +			   const void *val)
> +{
> +	struct kvm_assigned_dev_kernel *adev =
> +			container_of(this, struct kvm_assigned_dev_kernel,
> +				     msix_mmio_dev);
> +	int idx, r = 0;
> +	unsigned long new_val = *(unsigned long *)val;

I suspect this is wrong for big endian platforms: PCI returns
all values in little endian format.

> +
> +	mutex_lock(&adev->kvm->lock);
> +	if (!msix_mmio_in_range(adev, addr, len)) {
> +		r = -EOPNOTSUPP;
> +		goto out;
> +	}
> +	if ((addr & 0x3) || len != 4)
> +		goto out;
> +
> +	idx = msix_get_enabled_idx(adev, addr, len);
> +	if (idx < 0) {
> +		idx = (addr - adev->msix_mmio_base) / PCI_MSIX_ENTRY_SIZE;
> +		if (((addr % PCI_MSIX_ENTRY_SIZE) ==
> +				PCI_MSIX_ENTRY_VECTOR_CTRL)) {
> +			if (new_val & ~PCI_MSIX_ENTRY_CTRL_MASKBIT)
> +				goto out;
> +			if (new_val & PCI_MSIX_ENTRY_CTRL_MASKBIT)
> +				set_bit(idx, adev->msix_mask_bitmap);
> +			else
> +				clear_bit(idx, adev->msix_mask_bitmap);
> +		} else
> +			/* Userspace would handle other MMIO writing */
> +			r = -EOPNOTSUPP;
> +		goto out;
> +	}
> +	if (addr % PCI_MSIX_ENTRY_SIZE != PCI_MSIX_ENTRY_VECTOR_CTRL) {
> +		r = -EOPNOTSUPP;
> +		goto out;
> +	}
> +	if (new_val & ~PCI_MSIX_ENTRY_CTRL_MASKBIT)
> +		goto out;
> +	update_msix_mask(adev, idx, !!(new_val & PCI_MSIX_ENTRY_CTRL_MASKBIT));
> +out:
> +	mutex_unlock(&adev->kvm->lock);
> +
> +	return r;
> +}
> +
> +static const struct kvm_io_device_ops msix_mmio_ops = {
> +	.read     = msix_mmio_read,
> +	.write    = msix_mmio_write,
> +};
> +
> +static int kvm_vm_ioctl_register_msix_mmio(struct kvm *kvm,
> +				struct kvm_assigned_msix_mmio *msix_mmio)
> +{
> +	int r = 0;
> +	struct kvm_assigned_dev_kernel *adev;
> +
> +	mutex_lock(&kvm->lock);
> +	adev = kvm_find_assigned_dev(&kvm->arch.assigned_dev_head,
> +				      msix_mmio->assigned_dev_id);
> +	if (!adev) {
> +		r = -EINVAL;
> +		goto out;
> +	}
> +	if (msix_mmio->base_addr == 0) {
> +		r = -EINVAL;
> +		goto out;
> +	}
> +
> +	mutex_lock(&kvm->slots_lock);
> +	if (adev->msix_mmio_base == 0) {

What does this do? Handles case we are already registered?
Also ! adev->msix_mmio_base


> +		kvm_iodevice_init(&adev->msix_mmio_dev, &msix_mmio_ops);
> +		r = kvm_io_bus_register_dev(kvm, KVM_MMIO_BUS,
> +				&adev->msix_mmio_dev);
> +		if (r)
> +			goto out2;
> +	}
> +
> +	adev->msix_mmio_base = msix_mmio->base_addr;
> +out2:
> +	mutex_unlock(&kvm->slots_lock);
> +out:
> +	mutex_unlock(&kvm->lock);
> +
> +	return r;
> +}
>  #endif
>  
>  long kvm_vm_ioctl_assigned_device(struct kvm *kvm, unsigned ioctl,
> @@ -812,6 +1065,36 @@ long kvm_vm_ioctl_assigned_device(struct kvm *kvm, unsigned ioctl,
>  			goto out;
>  		break;
>  	}
> +	case KVM_ASSIGN_GET_MSIX_ENTRY: {
> +		struct kvm_assigned_msix_entry entry;
> +		r = -EFAULT;
> +		if (copy_from_user(&entry, argp, sizeof entry))
> +			goto out;
> +		r = kvm_vm_ioctl_get_msix_entry(kvm, &entry);
> +		if (r)
> +			goto out;
> +		r = -EFAULT;
> +		if (copy_to_user(argp, &entry, sizeof entry))
> +			goto out;
> +		r = 0;
> +		break;
> +	}
> +	case KVM_ASSIGN_REG_MSIX_MMIO: {
> +		struct kvm_assigned_msix_mmio msix_mmio;
> +
> +		r = -EFAULT;
> +		if (copy_from_user(&msix_mmio, argp, sizeof(msix_mmio)))
> +			goto out;
> +
> +		r = -EINVAL;
> +		if (msix_mmio.flags != 0 || msix_mmio.reserved != 0)
> +			goto out;
> +
> +		r = kvm_vm_ioctl_register_msix_mmio(kvm, &msix_mmio);
> +		if (r)
> +			goto out;
> +		break;
> +	}
>  #endif
>  	}
>  out:
> -- 
> 1.7.0.1

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/5] PCI: MSI: Move MSI-X entry definition to pci_regs.h
  2010-11-04  6:15 ` [PATCH 1/5] PCI: MSI: Move MSI-X entry definition to pci_regs.h Sheng Yang
@ 2010-11-05  0:43   ` Hidetoshi Seto
  0 siblings, 0 replies; 29+ messages in thread
From: Hidetoshi Seto @ 2010-11-05  0:43 UTC (permalink / raw)
  To: Sheng Yang
  Cc: Avi Kivity, Marcelo Tosatti, Michael S. Tsirkin, kvm,
	Jesse Barnes, linux-pci

Yeah, I think there are many virtualization stuff awaiting
this change in these days.

Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>

(2010/11/04 15:15), Sheng Yang wrote:
> Then it can be used by others.
> 
> Reviewed-by: Matthew Wilcox <willy@linux.intel.com>
> Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
> Cc: linux-pci@vger.kernel.org
> Signed-off-by: Sheng Yang <sheng@linux.intel.com>
> ---
>  drivers/pci/msi.h        |    6 ------
>  include/linux/pci_regs.h |    7 +++++++
>  2 files changed, 7 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/pci/msi.h b/drivers/pci/msi.h
> index de27c1c..28a3c52 100644
> --- a/drivers/pci/msi.h
> +++ b/drivers/pci/msi.h
> @@ -6,12 +6,6 @@
>  #ifndef MSI_H
>  #define MSI_H
>  
> -#define PCI_MSIX_ENTRY_SIZE		16
> -#define  PCI_MSIX_ENTRY_LOWER_ADDR	0
> -#define  PCI_MSIX_ENTRY_UPPER_ADDR	4
> -#define  PCI_MSIX_ENTRY_DATA		8
> -#define  PCI_MSIX_ENTRY_VECTOR_CTRL	12
> -
>  #define msi_control_reg(base)		(base + PCI_MSI_FLAGS)
>  #define msi_lower_address_reg(base)	(base + PCI_MSI_ADDRESS_LO)
>  #define msi_upper_address_reg(base)	(base + PCI_MSI_ADDRESS_HI)
> diff --git a/include/linux/pci_regs.h b/include/linux/pci_regs.h
> index 455b9cc..acfc224 100644
> --- a/include/linux/pci_regs.h
> +++ b/include/linux/pci_regs.h
> @@ -307,6 +307,13 @@
>  #define  PCI_MSIX_FLAGS_MASKALL	(1 << 14)
>  #define PCI_MSIX_FLAGS_BIRMASK	(7 << 0)
>  
> +/* MSI-X entry's format */
> +#define PCI_MSIX_ENTRY_SIZE		16
> +#define  PCI_MSIX_ENTRY_LOWER_ADDR	0
> +#define  PCI_MSIX_ENTRY_UPPER_ADDR	4
> +#define  PCI_MSIX_ENTRY_DATA		8
> +#define  PCI_MSIX_ENTRY_VECTOR_CTRL	12
> +
>  /* CompactPCI Hotswap Register */
>  
>  #define PCI_CHSWP_CSR		2	/* Control and Status Register */


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/5] PCI: Add mask bit definition for MSI-X table
  2010-11-04  6:15 ` [PATCH 2/5] PCI: Add mask bit definition for MSI-X table Sheng Yang
  2010-11-04  9:50   ` Michael S. Tsirkin
@ 2010-11-05  0:48   ` Hidetoshi Seto
  2010-11-05  2:49     ` Sheng Yang
  1 sibling, 1 reply; 29+ messages in thread
From: Hidetoshi Seto @ 2010-11-05  0:48 UTC (permalink / raw)
  To: Sheng Yang
  Cc: Avi Kivity, Marcelo Tosatti, Michael S. Tsirkin, kvm,
	Matthew Wilcox, Jesse Barnes, linux-pci

(2010/11/04 15:15), Sheng Yang wrote:
> Then we can use it instead of magic number 1.
> 
> Cc: Matthew Wilcox <willy@linux.intel.com>
> Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
> Cc: linux-pci@vger.kernel.org
> Signed-off-by: Sheng Yang <sheng@linux.intel.com>
> ---
>  drivers/pci/msi.c        |    4 ++--
>  include/linux/pci_regs.h |    1 +
>  2 files changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
> index 69b7be3..673e7dc 100644
> --- a/drivers/pci/msi.c
> +++ b/drivers/pci/msi.c
> @@ -158,7 +158,7 @@ static u32 __msix_mask_irq(struct msi_desc *desc, u32 flag)
>  	u32 mask_bits = desc->masked;
>  	unsigned offset = desc->msi_attrib.entry_nr * PCI_MSIX_ENTRY_SIZE +
>  						PCI_MSIX_ENTRY_VECTOR_CTRL;
> -	mask_bits &= ~1;
> +	mask_bits &= ~PCI_MSIX_ENTRY_CTRL_MASKBIT;
>  	mask_bits |= flag;
>  	writel(mask_bits, desc->mask_base + offset);
>  
> @@ -185,7 +185,7 @@ static void msi_set_mask_bit(unsigned irq, u32 flag)
>  
>  void mask_msi_irq(unsigned int irq)
>  {
> -	msi_set_mask_bit(irq, 1);
> +	msi_set_mask_bit(irq, PCI_MSIX_ENTRY_CTRL_MASKBIT);
>  }
>  
>  void unmask_msi_irq(unsigned int irq)

This hunk is not good, because the function msi_set_mask_bit() is
not only for MSI-X but also for MSI, so the number 0/1 here works
as like as false/true:

 static void msi_set_mask_bit(struct irq_data *data, u32 flag)
 {
 	struct msi_desc *desc = irq_data_get_msi(data);

 	if (desc->msi_attrib.is_msix) {
 		msix_mask_irq(desc, flag);
 		readl(desc->mask_base);         /* Flush write to device */
 	} else {
 		unsigned offset = data->irq - desc->dev->irq;
 		msi_mask_irq(desc, 1 << offset, flag << offset);
 	}
 }

> diff --git a/include/linux/pci_regs.h b/include/linux/pci_regs.h
> index acfc224..ff51632 100644
> --- a/include/linux/pci_regs.h
> +++ b/include/linux/pci_regs.h
> @@ -313,6 +313,7 @@
>  #define  PCI_MSIX_ENTRY_UPPER_ADDR	4
>  #define  PCI_MSIX_ENTRY_DATA		8
>  #define  PCI_MSIX_ENTRY_VECTOR_CTRL	12
> +#define   PCI_MSIX_ENTRY_CTRL_MASKBIT	1
>  
>  /* CompactPCI Hotswap Register */
>  

So I recommend you to have a patch with the above hunk for header
plus a change like:

 -	mask_bits &= ~1;
 - 	mask_bits |= flag;
 +	mask_bits &= ~PCI_MSIX_ENTRY_CTRL_MASKBIT;
 + 	mask_bits |= PCI_MSIX_ENTRY_CTRL_MASKBIT & !!flag;

Anyway thank you very much for doing this work.

Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>

Thanks,
H.Seto


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 5/5] KVM: assigned dev: MSI-X mask support
  2010-11-04 10:43   ` Michael S. Tsirkin
@ 2010-11-05  2:44     ` Sheng Yang
  2010-11-05  4:20       ` Sheng Yang
  2010-11-05  8:43       ` Michael S. Tsirkin
  0 siblings, 2 replies; 29+ messages in thread
From: Sheng Yang @ 2010-11-05  2:44 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Avi Kivity, Marcelo Tosatti, kvm

On Thursday 04 November 2010 18:43:22 Michael S. Tsirkin wrote:
> On Thu, Nov 04, 2010 at 02:15:21PM +0800, Sheng Yang wrote:
> > This patch enable per-vector mask for assigned devices using MSI-X.
> > 
> > This patch provided two new APIs: one is for guest to specific device's
> > MSI-X table address in MMIO, the other is for userspace to get
> > information about mask bit.
> > 
> > All the mask bit operation are kept in kernel, in order to accelerate.
> > Userspace shouldn't access the device MMIO directly for the information,
> > instead it should uses provided API to do so.
> > 
> > Signed-off-by: Sheng Yang <sheng@linux.intel.com>
> > ---
> > 
> >  Documentation/kvm/api.txt |   46 +++++++
> >  arch/x86/kvm/x86.c        |    1 +
> >  include/linux/kvm.h       |   21 +++-
> >  include/linux/kvm_host.h  |    3 +
> >  virt/kvm/assigned-dev.c   |  285
> >  ++++++++++++++++++++++++++++++++++++++++++++- 5 files changed, 
354
> >  insertions(+), 2 deletions(-)
> > 
> > diff --git a/Documentation/kvm/api.txt b/Documentation/kvm/api.txt
> > index b336266..69b993c 100644
> > --- a/Documentation/kvm/api.txt
> > +++ b/Documentation/kvm/api.txt
> > @@ -1085,6 +1085,52 @@ of 4 instructions that make up a hypercall.
> > 
> >  If any additional field gets added to this structure later on, a bit for
> >  that additional piece of information will be set in the flags bitmap.
> > 
> > +4.47 KVM_ASSIGN_REG_MSIX_MMIO
> > +
> > +Capability: KVM_CAP_DEVICE_MSIX_MASK
> > +Architectures: x86
> > +Type: vm ioctl
> > +Parameters: struct kvm_assigned_msix_mmio (in)
> > +Returns: 0 on success, !0 on error
> > +
> > +struct kvm_assigned_msix_mmio {
> > +	/* Assigned device's ID */
> > +	__u32 assigned_dev_id;
> > +	/* Must be 0 */
> > +	__u32 flags;
> > +	/* MSI-X table MMIO address */
> > +	__u64 base_addr;
> > +	/* Must be 0, reserved for future use */
> > +	__u64 reserved;
> 
> We also want length I think.

OK, everybody ask for it...
> 
> > +};
> > +
> > +This ioctl would enable in-kernel MSI-X emulation, which would handle
> > MSI-X +mask bit in the kernel.
> 
> What happens on repeated calls when it's already enabled?
> How does one disable after it's enabled?

Suppose this should only be called once. But again would update the MMIO base. It 
would be disabled along with device deassignment. We enable it explicitly because 
not all devices have MSI-X capability.

> 
> > +
> 
> This is very specialized for assigned devices.  I would like an
> interface not tied to assigned devices explicitly
> (even if the implementation at the moment only works
> for assigned devices, I can patch that in later). E.g. vhost-net would
> benefit from such an extension too.  Maybe tie this to a special
> GSI and this GSI can be an assigned device or not?

You're talking about KVM_ASSIGN_GET_MSIX_ENTRY? We can't tie it to GSI, because we 
should also cover entries without GSI. The entry number should be fine for all 
kinds of devices using MSI-X.

I am not sure about what PV one would looks like. I use kvm_assigned_msix_entry 
just because it can be easily reused.
> 
> > +4.48 KVM_ASSIGN_GET_MSIX_ENTRY
> > +
> > +Capability: KVM_CAP_DEVICE_MSIX_MASK
> > +Architectures: x86
> > +Type: vm ioctl
> > +Parameters: struct kvm_assigned_msix_entry (in and out)
> > +Returns: 0 on success, !0 on error
> > +
> > +struct kvm_assigned_msix_entry {
> > +	/* Assigned device's ID */
> > +	__u32 assigned_dev_id;
> > +	/* Ignored */
> > +	__u32 gsi;
> > +	/* The index of entry in the MSI-X table */
> > +	__u16 entry;
> > +	/* Querying flags and returning status */
> > +	__u16 flags;
> > +	/* Must be 0 */
> > +	__u16 padding[2];
> > +};
> > +
> > +This ioctl would allow userspace to get the status of one specific MSI-X
> > +entry. Currently we support mask bit status querying.
> > +
> > 
> >  5. The kvm_run structure
> >  
> >  Application code obtains a pointer to the kvm_run structure by
> > 
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index f3f86b2..8fd5121 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -1926,6 +1926,7 @@ int kvm_dev_ioctl_check_extension(long ext)
> > 
> >  	case KVM_CAP_DEBUGREGS:
> >  	case KVM_CAP_X86_ROBUST_SINGLESTEP:
> > 
> >  	case KVM_CAP_XSAVE:
> > +	case KVM_CAP_DEVICE_MSIX_MASK:
> >  		r = 1;
> >  		break;
> >  	
> >  	case KVM_CAP_COALESCED_MMIO:
> > diff --git a/include/linux/kvm.h b/include/linux/kvm.h
> > index 919ae53..eafafb1 100644
> > --- a/include/linux/kvm.h
> > +++ b/include/linux/kvm.h
> > @@ -540,6 +540,9 @@ struct kvm_ppc_pvinfo {
> > 
> >  #endif
> >  #define KVM_CAP_PPC_GET_PVINFO 57
> >  #define KVM_CAP_PPC_IRQ_LEVEL 58
> > 
> > +#ifdef __KVM_HAVE_MSIX
> > +#define KVM_CAP_DEVICE_MSIX_MASK 59
> > +#endif
> 
> We seem to have these HAVE macros all over.
> Avi, what's the idea? Let's drop them?

Um? This kind of marcos are used here to avoid vendor specific string in the header 
file. 
 
> >  #ifdef KVM_CAP_IRQ_ROUTING
> > 
> > @@ -671,6 +674,10 @@ struct kvm_clock_data {
> > 
> >  #define KVM_XEN_HVM_CONFIG        _IOW(KVMIO,  0x7a, struct
> >  kvm_xen_hvm_config) #define KVM_SET_CLOCK             _IOW(KVMIO, 
> >  0x7b, struct kvm_clock_data) #define KVM_GET_CLOCK            
> >  _IOR(KVMIO,  0x7c, struct kvm_clock_data)
> > 
> > +#define KVM_ASSIGN_GET_MSIX_ENTRY _IOWR(KVMIO,  0x7d, \
> > +					struct kvm_assigned_msix_entry)
> > +#define KVM_ASSIGN_REG_MSIX_MMIO  _IOW(KVMIO,  0x7e, \
> > +					struct kvm_assigned_msix_mmio)
> > 
> >  /* Available with KVM_CAP_PIT_STATE2 */
> >  #define KVM_GET_PIT2              _IOR(KVMIO,  0x9f, struct
> >  kvm_pit_state2) #define KVM_SET_PIT2              _IOW(KVMIO,  0xa0,
> >  struct kvm_pit_state2)
> > 
> > @@ -787,11 +794,23 @@ struct kvm_assigned_msix_nr {
> > 
> >  };
> >  
> >  #define KVM_MAX_MSIX_PER_DEV		256
> > 
> > +
> > +#define KVM_MSIX_FLAG_MASK		(1 << 0)
> > +#define KVM_MSIX_FLAG_QUERY_MASK	(1 << 15)
> > +
> > 
> >  struct kvm_assigned_msix_entry {
> >  
> >  	__u32 assigned_dev_id;
> >  	__u32 gsi;
> >  	__u16 entry; /* The index of entry in the MSI-X table */
> > 
> > -	__u16 padding[3];
> > +	__u16 flags;
> > +	__u16 padding[2];
> > +};
> > +
> > +struct kvm_assigned_msix_mmio {
> > +	__u32 assigned_dev_id;
> > +	__u32 flags;
> > +	__u64 base_addr;
> > +	__u64 reserved;
> > 
> >  };
> >  
> >  #endif /* __LINUX_KVM_H */
> > 
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index e2ecbac..5ac4db9 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -464,6 +464,9 @@ struct kvm_assigned_dev_kernel {
> > 
> >  	struct pci_dev *dev;
> >  	struct kvm *kvm;
> >  	spinlock_t assigned_dev_lock;
> > 
> > +	DECLARE_BITMAP(msix_mask_bitmap, KVM_MAX_MSIX_PER_DEV);
> > +	gpa_t msix_mmio_base;
> > +	struct kvm_io_device msix_mmio_dev;
> > 
> >  };
> >  
> >  struct kvm_irq_mask_notifier {
> > 
> > diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c
> > index 7c98928..9398c27 100644
> > --- a/virt/kvm/assigned-dev.c
> > +++ b/virt/kvm/assigned-dev.c
> > @@ -232,6 +232,14 @@ static void kvm_free_assigned_device(struct kvm
> > *kvm,
> > 
> >  {
> >  
> >  	kvm_free_assigned_irq(kvm, assigned_dev);
> > 
> > +#ifdef __KVM_HAVE_MSIX
> > +	if (assigned_dev->msix_mmio_base) {
> > +		mutex_lock(&kvm->slots_lock);
> > +		kvm_io_bus_unregister_dev(kvm, KVM_MMIO_BUS,
> > +				&assigned_dev->msix_mmio_dev);
> > +		mutex_unlock(&kvm->slots_lock);
> > +	}
> > +#endif
> > 
> >  	pci_reset_function(assigned_dev->dev);
> >  	
> >  	pci_release_regions(assigned_dev->dev);
> > 
> > @@ -504,7 +512,7 @@ out:
> >  static int kvm_vm_ioctl_assign_device(struct kvm *kvm,
> >  
> >  				      struct kvm_assigned_pci_dev *assigned_dev)
> >  
> >  {
> > 
> > -	int r = 0, idx;
> > +	int r = 0, idx, i;
> > 
> >  	struct kvm_assigned_dev_kernel *match;
> >  	struct pci_dev *dev;
> > 
> > @@ -563,6 +571,10 @@ static int kvm_vm_ioctl_assign_device(struct kvm
> > *kvm,
> > 
> >  	list_add(&match->list, &kvm->arch.assigned_dev_head);
> > 
> > +	/* The state after reset of MSI-X table is all masked */
> > +	for (i = 0; i < KVM_MAX_MSIX_PER_DEV; i++)
> > +		set_bit(i, match->msix_mask_bitmap);
> > +
> > 
> >  	if (assigned_dev->flags & KVM_DEV_ASSIGN_ENABLE_IOMMU) {
> >  	
> >  		if (!kvm->arch.iommu_domain) {
> >  		
> >  			r = kvm_iommu_map_guest(kvm);
> > 
> > @@ -666,6 +678,43 @@ msix_nr_out:
> >  	return r;
> >  
> >  }
> > 
> > +static void update_msix_mask(struct kvm_assigned_dev_kernel *adev,
> > +			     int idx, bool new_mask_flag)
> > +{
> > +	int irq;
> > +	bool old_mask_flag, need_flush = false;
> > +
> > +	spin_lock_irq(&adev->assigned_dev_lock);
> > +
> > +	if (!adev->dev->msix_enabled ||
> > +	    !(adev->irq_requested_type & KVM_DEV_IRQ_HOST_MSIX))
> > +		goto out;
> > +
> > +	old_mask_flag = test_bit(adev->guest_msix_entries[idx].entry,
> > +			adev->msix_mask_bitmap);
> > +	if (old_mask_flag == new_mask_flag)
> > +		goto out;
> > +
> > +	irq = adev->host_msix_entries[idx].vector;
> > +	BUG_ON(irq == 0);
> > +
> > +	if (new_mask_flag) {
> > +		set_bit(adev->guest_msix_entries[idx].entry,
> > +				adev->msix_mask_bitmap);
> > +		disable_irq_nosync(irq);
> > +		need_flush = true;
> > +	} else {
> > +		clear_bit(adev->guest_msix_entries[idx].entry,
> > +				adev->msix_mask_bitmap);
> > +		enable_irq(irq);
> > +	}
> > +out:
> > +	spin_unlock_irq(&adev->assigned_dev_lock);
> > +
> > +	if (need_flush)
> > +		flush_work(&adev->interrupt_work);
> > +}
> > +
> > 
> >  static int kvm_vm_ioctl_set_msix_entry(struct kvm *kvm,
> >  
> >  				       struct kvm_assigned_msix_entry *entry)
> >  
> >  {
> > 
> > @@ -700,6 +749,210 @@ msix_entry_out:
> >  	return r;
> >  
> >  }
> > 
> > +
> > +static int kvm_vm_ioctl_get_msix_entry(struct kvm *kvm,
> > +				       struct kvm_assigned_msix_entry *entry)
> > +{
> > +	int r = 0;
> > +	struct kvm_assigned_dev_kernel *adev;
> > +
> > +	mutex_lock(&kvm->lock);
> > +
> > +	adev = kvm_find_assigned_dev(&kvm->arch.assigned_dev_head,
> > +				      entry->assigned_dev_id);
> > +
> > +	if (!adev) {
> > +		r = -EINVAL;
> > +		goto out;
> > +	}
> > +
> > +	if (entry->entry >= KVM_MAX_MSIX_PER_DEV) {
> > +		r = -ENOSPC;
> > +		goto out;
> > +	}
> > +
> > +	if (entry->flags & KVM_MSIX_FLAG_QUERY_MASK) {
> > +		if (test_bit(entry->entry, adev->msix_mask_bitmap))
> > +			entry->flags |= KVM_MSIX_FLAG_MASK;
> > +		else
> > +			entry->flags &= ~KVM_MSIX_FLAG_MASK;
> > +	}
> > +
> > +out:
> > +	mutex_unlock(&kvm->lock);
> > +
> > +	return r;
> > +}
> > +
> > +static bool msix_mmio_in_range(struct kvm_assigned_dev_kernel *adev,
> > +			      gpa_t addr, int len)
> > +{
> > +	gpa_t start, end;
> > +
> > +	BUG_ON(adev->msix_mmio_base == 0);
> 
> Below I see
> 	adev->msix_mmio_base = msix_mmio->base_addr;
> which comes from userspace. BUG_ON is an unfriendly way to
> tell user about a bug in qemu.
>
> Anyway, I don't think we should special-case 0 gpa.
> It's up to the user where to base the table.
> Use a separate flag to signal that table was initialized.

Base_addr can't be 0 in any case. Only the unimplemented BAR are hardwired to 
zero. Also if we get here with base_addr = 0, it is surely a bug.
> 
> > +	start = adev->msix_mmio_base;
> > +	end = adev->msix_mmio_base + PCI_MSIX_ENTRY_SIZE *
> > KVM_MAX_MSIX_PER_DEV;
> 
> This seems wrong. What if the device has a smaller msix table?
> Can't we simply pass in base and length?

OK, OK, you would have it...
 
> > +	if (addr >= start && addr + len <= end)
> 
> interesting wrap-around cases come to mind. Likely not a real
> problem but maybe better use addr + len - 1 all over
> to make it robust.

Quite good point. :)

> 
> > +		return true;
> > +
> > +	return false;
> > +}
> > +
> > +static int msix_get_enabled_idx(struct kvm_assigned_dev_kernel *adev,
> > +			      gpa_t addr, int len)
> > +{
> > +	int i;
> > +	gpa_t start, end;
> > +
> > +	for (i = 0; i < adev->entries_nr; i++) {
> > +		start = adev->msix_mmio_base +
> > +			adev->guest_msix_entries[i].entry * PCI_MSIX_ENTRY_SIZE;
> > +		end = start + PCI_MSIX_ENTRY_SIZE;
> > +		if (addr >= start && addr + len <= end) {
> > +			return i;
> > +		}
> > +	}
> 
> That's a scary thing to do on each access.
> Can we build an interface that does not force us
> to scan the whole list each time?

We can reduced the calculation. But seems it can't be O(1) without introducing 
more variables to help.
> 
> > +
> > +	return -EINVAL;
> > +}
> > +
> > +static int msix_mmio_read(struct kvm_io_device *this, gpa_t addr, int
> > len, +			  void *val)
> > +{
> > +	struct kvm_assigned_dev_kernel *adev =
> > +			container_of(this, struct kvm_assigned_dev_kernel,
> > +				     msix_mmio_dev);
> > +	int idx, r = 0;
> > +	u32 entry[4];
> > +	struct kvm_kernel_irq_routing_entry e;
> > +
> > +	mutex_lock(&adev->kvm->lock);
> > +	if (!msix_mmio_in_range(adev, addr, len)) {
> > +		r = -EOPNOTSUPP;
> > +		goto out;
> > +	}
> > +	if ((addr & 0x3) || len != 4)
> > +		goto out;
> > +
> > +	idx = msix_get_enabled_idx(adev, addr, len);
> > +	if (idx < 0) {
> > +		idx = (addr - adev->msix_mmio_base) / PCI_MSIX_ENTRY_SIZE;
> > +		if ((addr % PCI_MSIX_ENTRY_SIZE) ==
> > +				PCI_MSIX_ENTRY_VECTOR_CTRL)
> > +			*(unsigned long *)val =
> > +				test_bit(idx, adev->msix_mask_bitmap) ?
> > +				PCI_MSIX_ENTRY_CTRL_MASKBIT : 0;
> 
> I suspect this is wrong for big endian platforms: PCI returns
> all values in little endian format.

Let big endian platform take care of it later... There is only x86 has 
__KVM_HAVE_MSIX now.
> 
> > +		else
> > +			r = -EOPNOTSUPP;
> > +		goto out;
> > +	}
> > +
> > +	r = kvm_get_irq_routing_entry(adev->kvm,
> > +			adev->guest_msix_entries[idx].vector, &e);
> > +	if (r || e.type != KVM_IRQ_ROUTING_MSI) {
> > +		printk(KERN_WARNING "KVM: Wrong MSI-X routing entry! "
> > +			"idx %d, addr 0x%llx, len %d\n", idx, addr, len);
> 
> Can a malicious userspace trigger this intentionally?

No. 
> 
> > +		r = -EOPNOTSUPP;
> > +		goto out;
> > +	}
> > +	entry[0] = e.msi.address_lo;
> > +	entry[1] = e.msi.address_hi;
> > +	entry[2] = e.msi.data;
> > +	entry[3] = test_bit(adev->guest_msix_entries[idx].entry,
> > +			adev->msix_mask_bitmap);
> 
> I suspect this is wrong for big endian platforms: PCI returns
> all values in little endian format.
> 
> > +	memcpy(val, &entry[addr % PCI_MSIX_ENTRY_SIZE / 4
> 
> 4 is sizeof *entry?

4 elements. In fact "PCI_MSIX_ENTRY_SIZE / sizeof u32".
> 
> ], len);
> 
> > +
> > +out:
> > +	mutex_unlock(&adev->kvm->lock);
> > +	return r;
> > +}
> > +
> > +static int msix_mmio_write(struct kvm_io_device *this, gpa_t addr, int
> > len, +			   const void *val)
> > +{
> > +	struct kvm_assigned_dev_kernel *adev =
> > +			container_of(this, struct kvm_assigned_dev_kernel,
> > +				     msix_mmio_dev);
> > +	int idx, r = 0;
> > +	unsigned long new_val = *(unsigned long *)val;
> 
> I suspect this is wrong for big endian platforms: PCI returns
> all values in little endian format.
> 
> > +
> > +	mutex_lock(&adev->kvm->lock);
> > +	if (!msix_mmio_in_range(adev, addr, len)) {
> > +		r = -EOPNOTSUPP;
> > +		goto out;
> > +	}
> > +	if ((addr & 0x3) || len != 4)
> > +		goto out;
> > +
> > +	idx = msix_get_enabled_idx(adev, addr, len);
> > +	if (idx < 0) {
> > +		idx = (addr - adev->msix_mmio_base) / PCI_MSIX_ENTRY_SIZE;
> > +		if (((addr % PCI_MSIX_ENTRY_SIZE) ==
> > +				PCI_MSIX_ENTRY_VECTOR_CTRL)) {
> > +			if (new_val & ~PCI_MSIX_ENTRY_CTRL_MASKBIT)
> > +				goto out;
> > +			if (new_val & PCI_MSIX_ENTRY_CTRL_MASKBIT)
> > +				set_bit(idx, adev->msix_mask_bitmap);
> > +			else
> > +				clear_bit(idx, adev->msix_mask_bitmap);
> > +		} else
> > +			/* Userspace would handle other MMIO writing */
> > +			r = -EOPNOTSUPP;
> > +		goto out;
> > +	}
> > +	if (addr % PCI_MSIX_ENTRY_SIZE != PCI_MSIX_ENTRY_VECTOR_CTRL) {
> > +		r = -EOPNOTSUPP;
> > +		goto out;
> > +	}
> > +	if (new_val & ~PCI_MSIX_ENTRY_CTRL_MASKBIT)
> > +		goto out;
> > +	update_msix_mask(adev, idx, !!(new_val & PCI_MSIX_ENTRY_CTRL_MASKBIT));
> > +out:
> > +	mutex_unlock(&adev->kvm->lock);
> > +
> > +	return r;
> > +}
> > +
> > +static const struct kvm_io_device_ops msix_mmio_ops = {
> > +	.read     = msix_mmio_read,
> > +	.write    = msix_mmio_write,
> > +};
> > +
> > +static int kvm_vm_ioctl_register_msix_mmio(struct kvm *kvm,
> > +				struct kvm_assigned_msix_mmio *msix_mmio)
> > +{
> > +	int r = 0;
> > +	struct kvm_assigned_dev_kernel *adev;
> > +
> > +	mutex_lock(&kvm->lock);
> > +	adev = kvm_find_assigned_dev(&kvm->arch.assigned_dev_head,
> > +				      msix_mmio->assigned_dev_id);
> > +	if (!adev) {
> > +		r = -EINVAL;
> > +		goto out;
> > +	}
> > +	if (msix_mmio->base_addr == 0) {
> > +		r = -EINVAL;
> > +		goto out;
> > +	}
> > +
> > +	mutex_lock(&kvm->slots_lock);
> > +	if (adev->msix_mmio_base == 0) {
> 
> What does this do? Handles case we are already registered?
> Also ! adev->msix_mmio_base

We don't need to register again. Just update the address.

Well, msix_mmio_base is an address. I think ==0 is more properly.

--
regards
Yang, Sheng

> 
> > +		kvm_iodevice_init(&adev->msix_mmio_dev, &msix_mmio_ops);
> > +		r = kvm_io_bus_register_dev(kvm, KVM_MMIO_BUS,
> > +				&adev->msix_mmio_dev);
> > +		if (r)
> > +			goto out2;
> > +	}
> > +
> > +	adev->msix_mmio_base = msix_mmio->base_addr;
> > +out2:
> > +	mutex_unlock(&kvm->slots_lock);
> > +out:
> > +	mutex_unlock(&kvm->lock);
> > +
> > +	return r;
> > +}
> > 
> >  #endif
> >  
> >  long kvm_vm_ioctl_assigned_device(struct kvm *kvm, unsigned ioctl,
> > 
> > @@ -812,6 +1065,36 @@ long kvm_vm_ioctl_assigned_device(struct kvm *kvm,
> > unsigned ioctl,
> > 
> >  			goto out;
> >  		
> >  		break;
> >  	
> >  	}
> > 
> > +	case KVM_ASSIGN_GET_MSIX_ENTRY: {
> > +		struct kvm_assigned_msix_entry entry;
> > +		r = -EFAULT;
> > +		if (copy_from_user(&entry, argp, sizeof entry))
> > +			goto out;
> > +		r = kvm_vm_ioctl_get_msix_entry(kvm, &entry);
> > +		if (r)
> > +			goto out;
> > +		r = -EFAULT;
> > +		if (copy_to_user(argp, &entry, sizeof entry))
> > +			goto out;
> > +		r = 0;
> > +		break;
> > +	}
> > +	case KVM_ASSIGN_REG_MSIX_MMIO: {
> > +		struct kvm_assigned_msix_mmio msix_mmio;
> > +
> > +		r = -EFAULT;
> > +		if (copy_from_user(&msix_mmio, argp, sizeof(msix_mmio)))
> > +			goto out;
> > +
> > +		r = -EINVAL;
> > +		if (msix_mmio.flags != 0 || msix_mmio.reserved != 0)
> > +			goto out;
> > +
> > +		r = kvm_vm_ioctl_register_msix_mmio(kvm, &msix_mmio);
> > +		if (r)
> > +			goto out;
> > +		break;
> > +	}
> > 
> >  #endif
> >  
> >  	}
> >  
> >  out:

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/5] PCI: Add mask bit definition for MSI-X table
  2010-11-04  9:50   ` Michael S. Tsirkin
@ 2010-11-05  2:48     ` Sheng Yang
  0 siblings, 0 replies; 29+ messages in thread
From: Sheng Yang @ 2010-11-05  2:48 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Avi Kivity, Marcelo Tosatti, kvm, Matthew Wilcox, Jesse Barnes,
	linux-pci

On Thursday 04 November 2010 17:50:29 Michael S. Tsirkin wrote:
> On Thu, Nov 04, 2010 at 02:15:18PM +0800, Sheng Yang wrote:
> > Then we can use it instead of magic number 1.
> > 
> > Cc: Matthew Wilcox <willy@linux.intel.com>
> > Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
> > Cc: linux-pci@vger.kernel.org
> > Signed-off-by: Sheng Yang <sheng@linux.intel.com>
> 
> Looks good to me
> 
> > ---
> > 
> >  drivers/pci/msi.c        |    4 ++--
> >  include/linux/pci_regs.h |    1 +
> >  2 files changed, 3 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
> > index 69b7be3..673e7dc 100644
> > --- a/drivers/pci/msi.c
> > +++ b/drivers/pci/msi.c
> > @@ -158,7 +158,7 @@ static u32 __msix_mask_irq(struct msi_desc *desc, u32
> > flag)
> > 
> >  	u32 mask_bits = desc->masked;
> >  	unsigned offset = desc->msi_attrib.entry_nr * PCI_MSIX_ENTRY_SIZE +
> >  	
> >  						PCI_MSIX_ENTRY_VECTOR_CTRL;
> > 
> > -	mask_bits &= ~1;
> > +	mask_bits &= ~PCI_MSIX_ENTRY_CTRL_MASKBIT;
> > 
> >  	mask_bits |= flag;
> >  	writel(mask_bits, desc->mask_base + offset);
> > 
> > @@ -185,7 +185,7 @@ static void msi_set_mask_bit(unsigned irq, u32 flag)
> > 
> >  void mask_msi_irq(unsigned int irq)
> >  {
> > 
> > -	msi_set_mask_bit(irq, 1);
> > +	msi_set_mask_bit(irq, PCI_MSIX_ENTRY_CTRL_MASKBIT);
> > 
> >  }
> >  
> >  void unmask_msi_irq(unsigned int irq)
> > 
> > diff --git a/include/linux/pci_regs.h b/include/linux/pci_regs.h
> > index acfc224..ff51632 100644
> > --- a/include/linux/pci_regs.h
> > +++ b/include/linux/pci_regs.h
> > @@ -313,6 +313,7 @@
> > 
> >  #define  PCI_MSIX_ENTRY_UPPER_ADDR	4
> >  #define  PCI_MSIX_ENTRY_DATA		8
> >  #define  PCI_MSIX_ENTRY_VECTOR_CTRL	12
> > 
> > +#define   PCI_MSIX_ENTRY_CTRL_MASKBIT	1
> 
> minor alignment issue

The two spaces after #define? It's used to tell apart the definition(which is a bit) 
from others(which are fields)...

--
regards
Yang, Sheng

> 
> >  /* CompactPCI Hotswap Register */

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/5] PCI: Add mask bit definition for MSI-X table
  2010-11-05  0:48   ` Hidetoshi Seto
@ 2010-11-05  2:49     ` Sheng Yang
  0 siblings, 0 replies; 29+ messages in thread
From: Sheng Yang @ 2010-11-05  2:49 UTC (permalink / raw)
  To: Hidetoshi Seto
  Cc: Avi Kivity, Marcelo Tosatti, Michael S. Tsirkin, kvm,
	Matthew Wilcox, Jesse Barnes, linux-pci

On Friday 05 November 2010 08:48:43 Hidetoshi Seto wrote:
> (2010/11/04 15:15), Sheng Yang wrote:
> > Then we can use it instead of magic number 1.
> > 
> > Cc: Matthew Wilcox <willy@linux.intel.com>
> > Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
> > Cc: linux-pci@vger.kernel.org
> > Signed-off-by: Sheng Yang <sheng@linux.intel.com>
> > ---
> > 
> >  drivers/pci/msi.c        |    4 ++--
> >  include/linux/pci_regs.h |    1 +
> >  2 files changed, 3 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
> > index 69b7be3..673e7dc 100644
> > --- a/drivers/pci/msi.c
> > +++ b/drivers/pci/msi.c
> > @@ -158,7 +158,7 @@ static u32 __msix_mask_irq(struct msi_desc *desc, u32
> > flag)
> > 
> >  	u32 mask_bits = desc->masked;
> >  	unsigned offset = desc->msi_attrib.entry_nr * PCI_MSIX_ENTRY_SIZE +
> >  	
> >  						PCI_MSIX_ENTRY_VECTOR_CTRL;
> > 
> > -	mask_bits &= ~1;
> > +	mask_bits &= ~PCI_MSIX_ENTRY_CTRL_MASKBIT;
> > 
> >  	mask_bits |= flag;
> >  	writel(mask_bits, desc->mask_base + offset);
> > 
> > @@ -185,7 +185,7 @@ static void msi_set_mask_bit(unsigned irq, u32 flag)
> > 
> >  void mask_msi_irq(unsigned int irq)
> >  {
> > 
> > -	msi_set_mask_bit(irq, 1);
> > +	msi_set_mask_bit(irq, PCI_MSIX_ENTRY_CTRL_MASKBIT);
> > 
> >  }
> >  
> >  void unmask_msi_irq(unsigned int irq)
> 
> This hunk is not good, because the function msi_set_mask_bit() is
> not only for MSI-X but also for MSI, so the number 0/1 here works
> as like as false/true:
> 
>  static void msi_set_mask_bit(struct irq_data *data, u32 flag)
>  {
>  	struct msi_desc *desc = irq_data_get_msi(data);
> 
>  	if (desc->msi_attrib.is_msix) {
>  		msix_mask_irq(desc, flag);
>  		readl(desc->mask_base);         /* Flush write to device */
>  	} else {
>  		unsigned offset = data->irq - desc->dev->irq;
>  		msi_mask_irq(desc, 1 << offset, flag << offset);
>  	}
>  }
> 
> > diff --git a/include/linux/pci_regs.h b/include/linux/pci_regs.h
> > index acfc224..ff51632 100644
> > --- a/include/linux/pci_regs.h
> > +++ b/include/linux/pci_regs.h
> > @@ -313,6 +313,7 @@
> > 
> >  #define  PCI_MSIX_ENTRY_UPPER_ADDR	4
> >  #define  PCI_MSIX_ENTRY_DATA		8
> >  #define  PCI_MSIX_ENTRY_VECTOR_CTRL	12
> > 
> > +#define   PCI_MSIX_ENTRY_CTRL_MASKBIT	1
> > 
> >  /* CompactPCI Hotswap Register */
> 
> So I recommend you to have a patch with the above hunk for header
> plus a change like:
> 
>  -	mask_bits &= ~1;
>  - 	mask_bits |= flag;
>  +	mask_bits &= ~PCI_MSIX_ENTRY_CTRL_MASKBIT;
>  + 	mask_bits |= PCI_MSIX_ENTRY_CTRL_MASKBIT & !!flag;
> 
> Anyway thank you very much for doing this work.

Good suggestion. Would update it.
> 
> Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>

--
regards
Yang, Sheng

> 
> Thanks,
> H.Seto

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 5/5] KVM: assigned dev: MSI-X mask support
  2010-11-05  2:44     ` Sheng Yang
@ 2010-11-05  4:20       ` Sheng Yang
  2010-11-05  7:16         ` Sheng Yang
  2010-11-05  8:43       ` Michael S. Tsirkin
  1 sibling, 1 reply; 29+ messages in thread
From: Sheng Yang @ 2010-11-05  4:20 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Avi Kivity, Marcelo Tosatti, kvm

On Friday 05 November 2010 10:44:19 Sheng Yang wrote:
> On Thursday 04 November 2010 18:43:22 Michael S. Tsirkin wrote:
> > On Thu, Nov 04, 2010 at 02:15:21PM +0800, Sheng Yang wrote:
> > > This patch enable per-vector mask for assigned devices using MSI-X.
> > > 
> > > This patch provided two new APIs: one is for guest to specific device's
> > > MSI-X table address in MMIO, the other is for userspace to get
> > > information about mask bit.
> > > 
> > > All the mask bit operation are kept in kernel, in order to accelerate.
> > > Userspace shouldn't access the device MMIO directly for the
> > > information, instead it should uses provided API to do so.
> > > 
> > > Signed-off-by: Sheng Yang <sheng@linux.intel.com>
> > > ---
> > > 
> > >  Documentation/kvm/api.txt |   46 +++++++
> > >  arch/x86/kvm/x86.c        |    1 +
> > >  include/linux/kvm.h       |   21 +++-
> > >  include/linux/kvm_host.h  |    3 +
> > >  virt/kvm/assigned-dev.c   |  285
> > >  ++++++++++++++++++++++++++++++++++++++++++++- 5 files changed,
> 
> 354
> 
> > >  insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/Documentation/kvm/api.txt b/Documentation/kvm/api.txt
> > > index b336266..69b993c 100644
> > > --- a/Documentation/kvm/api.txt
> > > +++ b/Documentation/kvm/api.txt
> > > @@ -1085,6 +1085,52 @@ of 4 instructions that make up a hypercall.
> > > 
> > >  If any additional field gets added to this structure later on, a bit
> > >  for that additional piece of information will be set in the flags
> > >  bitmap.
> > > 
> > > +4.47 KVM_ASSIGN_REG_MSIX_MMIO
> > > +
> > > +Capability: KVM_CAP_DEVICE_MSIX_MASK
> > > +Architectures: x86
> > > +Type: vm ioctl
> > > +Parameters: struct kvm_assigned_msix_mmio (in)
> > > +Returns: 0 on success, !0 on error
> > > +
> > > +struct kvm_assigned_msix_mmio {
> > > +	/* Assigned device's ID */
> > > +	__u32 assigned_dev_id;
> > > +	/* Must be 0 */
> > > +	__u32 flags;
> > > +	/* MSI-X table MMIO address */
> > > +	__u64 base_addr;
> > > +	/* Must be 0, reserved for future use */
> > > +	__u64 reserved;
> > 
> > We also want length I think.
> 
> OK, everybody ask for it...
> 
> > > +};
> > > +
> > > +This ioctl would enable in-kernel MSI-X emulation, which would handle
> > > MSI-X +mask bit in the kernel.
> > 
> > What happens on repeated calls when it's already enabled?
> > How does one disable after it's enabled?
> 
> Suppose this should only be called once. But again would update the MMIO
> base. It would be disabled along with device deassignment. We enable it
> explicitly because not all devices have MSI-X capability.
> 
> > > +
> > 
> > This is very specialized for assigned devices.  I would like an
> > interface not tied to assigned devices explicitly
> > (even if the implementation at the moment only works
> > for assigned devices, I can patch that in later). E.g. vhost-net would
> > benefit from such an extension too.  Maybe tie this to a special
> > GSI and this GSI can be an assigned device or not?
> 
> You're talking about KVM_ASSIGN_GET_MSIX_ENTRY? We can't tie it to GSI,
> because we should also cover entries without GSI. The entry number should
> be fine for all kinds of devices using MSI-X.
> 
> I am not sure about what PV one would looks like. I use
> kvm_assigned_msix_entry just because it can be easily reused.
> 
> > > +4.48 KVM_ASSIGN_GET_MSIX_ENTRY
> > > +
> > > +Capability: KVM_CAP_DEVICE_MSIX_MASK
> > > +Architectures: x86
> > > +Type: vm ioctl
> > > +Parameters: struct kvm_assigned_msix_entry (in and out)
> > > +Returns: 0 on success, !0 on error
> > > +
> > > +struct kvm_assigned_msix_entry {
> > > +	/* Assigned device's ID */
> > > +	__u32 assigned_dev_id;
> > > +	/* Ignored */
> > > +	__u32 gsi;
> > > +	/* The index of entry in the MSI-X table */
> > > +	__u16 entry;
> > > +	/* Querying flags and returning status */
> > > +	__u16 flags;
> > > +	/* Must be 0 */
> > > +	__u16 padding[2];
> > > +};
> > > +
> > > +This ioctl would allow userspace to get the status of one specific
> > > MSI-X +entry. Currently we support mask bit status querying.
> > > +
> > > 
> > >  5. The kvm_run structure
> > >  
> > >  Application code obtains a pointer to the kvm_run structure by
> > > 
> > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > index f3f86b2..8fd5121 100644
> > > --- a/arch/x86/kvm/x86.c
> > > +++ b/arch/x86/kvm/x86.c
> > > @@ -1926,6 +1926,7 @@ int kvm_dev_ioctl_check_extension(long ext)
> > > 
> > >  	case KVM_CAP_DEBUGREGS:
> > >  	case KVM_CAP_X86_ROBUST_SINGLESTEP:
> > > 
> > >  	case KVM_CAP_XSAVE:
> > > +	case KVM_CAP_DEVICE_MSIX_MASK:
> > >  		r = 1;
> > >  		break;
> > >  	
> > >  	case KVM_CAP_COALESCED_MMIO:
> > > diff --git a/include/linux/kvm.h b/include/linux/kvm.h
> > > index 919ae53..eafafb1 100644
> > > --- a/include/linux/kvm.h
> > > +++ b/include/linux/kvm.h
> > > @@ -540,6 +540,9 @@ struct kvm_ppc_pvinfo {
> > > 
> > >  #endif
> > >  #define KVM_CAP_PPC_GET_PVINFO 57
> > >  #define KVM_CAP_PPC_IRQ_LEVEL 58
> > > 
> > > +#ifdef __KVM_HAVE_MSIX
> > > +#define KVM_CAP_DEVICE_MSIX_MASK 59
> > > +#endif
> > 
> > We seem to have these HAVE macros all over.
> > Avi, what's the idea? Let's drop them?
> 
> Um? This kind of marcos are used here to avoid vendor specific string in
> the header file.
> 
> > >  #ifdef KVM_CAP_IRQ_ROUTING
> > > 
> > > @@ -671,6 +674,10 @@ struct kvm_clock_data {
> > > 
> > >  #define KVM_XEN_HVM_CONFIG        _IOW(KVMIO,  0x7a, struct
> > >  kvm_xen_hvm_config) #define KVM_SET_CLOCK             _IOW(KVMIO,
> > >  0x7b, struct kvm_clock_data) #define KVM_GET_CLOCK
> > >  _IOR(KVMIO,  0x7c, struct kvm_clock_data)
> > > 
> > > +#define KVM_ASSIGN_GET_MSIX_ENTRY _IOWR(KVMIO,  0x7d, \
> > > +					struct kvm_assigned_msix_entry)
> > > +#define KVM_ASSIGN_REG_MSIX_MMIO  _IOW(KVMIO,  0x7e, \
> > > +					struct kvm_assigned_msix_mmio)
> > > 
> > >  /* Available with KVM_CAP_PIT_STATE2 */
> > >  #define KVM_GET_PIT2              _IOR(KVMIO,  0x9f, struct
> > >  kvm_pit_state2) #define KVM_SET_PIT2              _IOW(KVMIO,  0xa0,
> > >  struct kvm_pit_state2)
> > > 
> > > @@ -787,11 +794,23 @@ struct kvm_assigned_msix_nr {
> > > 
> > >  };
> > >  
> > >  #define KVM_MAX_MSIX_PER_DEV		256
> > > 
> > > +
> > > +#define KVM_MSIX_FLAG_MASK		(1 << 0)
> > > +#define KVM_MSIX_FLAG_QUERY_MASK	(1 << 15)
> > > +
> > > 
> > >  struct kvm_assigned_msix_entry {
> > >  
> > >  	__u32 assigned_dev_id;
> > >  	__u32 gsi;
> > >  	__u16 entry; /* The index of entry in the MSI-X table */
> > > 
> > > -	__u16 padding[3];
> > > +	__u16 flags;
> > > +	__u16 padding[2];
> > > +};
> > > +
> > > +struct kvm_assigned_msix_mmio {
> > > +	__u32 assigned_dev_id;
> > > +	__u32 flags;
> > > +	__u64 base_addr;
> > > +	__u64 reserved;
> > > 
> > >  };
> > >  
> > >  #endif /* __LINUX_KVM_H */
> > > 
> > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > > index e2ecbac..5ac4db9 100644
> > > --- a/include/linux/kvm_host.h
> > > +++ b/include/linux/kvm_host.h
> > > @@ -464,6 +464,9 @@ struct kvm_assigned_dev_kernel {
> > > 
> > >  	struct pci_dev *dev;
> > >  	struct kvm *kvm;
> > >  	spinlock_t assigned_dev_lock;
> > > 
> > > +	DECLARE_BITMAP(msix_mask_bitmap, KVM_MAX_MSIX_PER_DEV);
> > > +	gpa_t msix_mmio_base;
> > > +	struct kvm_io_device msix_mmio_dev;
> > > 
> > >  };
> > >  
> > >  struct kvm_irq_mask_notifier {
> > > 
> > > diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c
> > > index 7c98928..9398c27 100644
> > > --- a/virt/kvm/assigned-dev.c
> > > +++ b/virt/kvm/assigned-dev.c
> > > @@ -232,6 +232,14 @@ static void kvm_free_assigned_device(struct kvm
> > > *kvm,
> > > 
> > >  {
> > >  
> > >  	kvm_free_assigned_irq(kvm, assigned_dev);
> > > 
> > > +#ifdef __KVM_HAVE_MSIX
> > > +	if (assigned_dev->msix_mmio_base) {
> > > +		mutex_lock(&kvm->slots_lock);
> > > +		kvm_io_bus_unregister_dev(kvm, KVM_MMIO_BUS,
> > > +				&assigned_dev->msix_mmio_dev);
> > > +		mutex_unlock(&kvm->slots_lock);
> > > +	}
> > > +#endif
> > > 
> > >  	pci_reset_function(assigned_dev->dev);
> > >  	
> > >  	pci_release_regions(assigned_dev->dev);
> > > 
> > > @@ -504,7 +512,7 @@ out:
> > >  static int kvm_vm_ioctl_assign_device(struct kvm *kvm,
> > >  
> > >  				      struct kvm_assigned_pci_dev *assigned_dev)
> > >  
> > >  {
> > > 
> > > -	int r = 0, idx;
> > > +	int r = 0, idx, i;
> > > 
> > >  	struct kvm_assigned_dev_kernel *match;
> > >  	struct pci_dev *dev;
> > > 
> > > @@ -563,6 +571,10 @@ static int kvm_vm_ioctl_assign_device(struct kvm
> > > *kvm,
> > > 
> > >  	list_add(&match->list, &kvm->arch.assigned_dev_head);
> > > 
> > > +	/* The state after reset of MSI-X table is all masked */
> > > +	for (i = 0; i < KVM_MAX_MSIX_PER_DEV; i++)
> > > +		set_bit(i, match->msix_mask_bitmap);
> > > +
> > > 
> > >  	if (assigned_dev->flags & KVM_DEV_ASSIGN_ENABLE_IOMMU) {
> > >  	
> > >  		if (!kvm->arch.iommu_domain) {
> > >  		
> > >  			r = kvm_iommu_map_guest(kvm);
> > > 
> > > @@ -666,6 +678,43 @@ msix_nr_out:
> > >  	return r;
> > >  
> > >  }
> > > 
> > > +static void update_msix_mask(struct kvm_assigned_dev_kernel *adev,
> > > +			     int idx, bool new_mask_flag)
> > > +{
> > > +	int irq;
> > > +	bool old_mask_flag, need_flush = false;
> > > +
> > > +	spin_lock_irq(&adev->assigned_dev_lock);
> > > +
> > > +	if (!adev->dev->msix_enabled ||
> > > +	    !(adev->irq_requested_type & KVM_DEV_IRQ_HOST_MSIX))
> > > +		goto out;
> > > +
> > > +	old_mask_flag = test_bit(adev->guest_msix_entries[idx].entry,
> > > +			adev->msix_mask_bitmap);
> > > +	if (old_mask_flag == new_mask_flag)
> > > +		goto out;
> > > +
> > > +	irq = adev->host_msix_entries[idx].vector;
> > > +	BUG_ON(irq == 0);
> > > +
> > > +	if (new_mask_flag) {
> > > +		set_bit(adev->guest_msix_entries[idx].entry,
> > > +				adev->msix_mask_bitmap);
> > > +		disable_irq_nosync(irq);
> > > +		need_flush = true;
> > > +	} else {
> > > +		clear_bit(adev->guest_msix_entries[idx].entry,
> > > +				adev->msix_mask_bitmap);
> > > +		enable_irq(irq);
> > > +	}
> > > +out:
> > > +	spin_unlock_irq(&adev->assigned_dev_lock);
> > > +
> > > +	if (need_flush)
> > > +		flush_work(&adev->interrupt_work);
> > > +}
> > > +
> > > 
> > >  static int kvm_vm_ioctl_set_msix_entry(struct kvm *kvm,
> > >  
> > >  				       struct kvm_assigned_msix_entry *entry)
> > >  
> > >  {
> > > 
> > > @@ -700,6 +749,210 @@ msix_entry_out:
> > >  	return r;
> > >  
> > >  }
> > > 
> > > +
> > > +static int kvm_vm_ioctl_get_msix_entry(struct kvm *kvm,
> > > +				       struct kvm_assigned_msix_entry *entry)
> > > +{
> > > +	int r = 0;
> > > +	struct kvm_assigned_dev_kernel *adev;
> > > +
> > > +	mutex_lock(&kvm->lock);
> > > +
> > > +	adev = kvm_find_assigned_dev(&kvm->arch.assigned_dev_head,
> > > +				      entry->assigned_dev_id);
> > > +
> > > +	if (!adev) {
> > > +		r = -EINVAL;
> > > +		goto out;
> > > +	}
> > > +
> > > +	if (entry->entry >= KVM_MAX_MSIX_PER_DEV) {
> > > +		r = -ENOSPC;
> > > +		goto out;
> > > +	}
> > > +
> > > +	if (entry->flags & KVM_MSIX_FLAG_QUERY_MASK) {
> > > +		if (test_bit(entry->entry, adev->msix_mask_bitmap))
> > > +			entry->flags |= KVM_MSIX_FLAG_MASK;
> > > +		else
> > > +			entry->flags &= ~KVM_MSIX_FLAG_MASK;
> > > +	}
> > > +
> > > +out:
> > > +	mutex_unlock(&kvm->lock);
> > > +
> > > +	return r;
> > > +}
> > > +
> > > +static bool msix_mmio_in_range(struct kvm_assigned_dev_kernel *adev,
> > > +			      gpa_t addr, int len)
> > > +{
> > > +	gpa_t start, end;
> > > +
> > > +	BUG_ON(adev->msix_mmio_base == 0);
> > 
> > Below I see
> > 
> > 	adev->msix_mmio_base = msix_mmio->base_addr;
> > 
> > which comes from userspace. BUG_ON is an unfriendly way to
> > tell user about a bug in qemu.
> > 
> > Anyway, I don't think we should special-case 0 gpa.
> > It's up to the user where to base the table.
> > Use a separate flag to signal that table was initialized.
> 
> Base_addr can't be 0 in any case. Only the unimplemented BAR are hardwired
> to zero. Also if we get here with base_addr = 0, it is surely a bug.
> 
> > > +	start = adev->msix_mmio_base;
> > > +	end = adev->msix_mmio_base + PCI_MSIX_ENTRY_SIZE *
> > > KVM_MAX_MSIX_PER_DEV;
> > 
> > This seems wrong. What if the device has a smaller msix table?
> > Can't we simply pass in base and length?
> 
> OK, OK, you would have it...
> 
> > > +	if (addr >= start && addr + len <= end)
> > 
> > interesting wrap-around cases come to mind. Likely not a real
> > problem but maybe better use addr + len - 1 all over
> > to make it robust.
> 
> Quite good point. :)
> 
> > > +		return true;
> > > +
> > > +	return false;
> > > +}
> > > +
> > > +static int msix_get_enabled_idx(struct kvm_assigned_dev_kernel *adev,
> > > +			      gpa_t addr, int len)
> > > +{
> > > +	int i;
> > > +	gpa_t start, end;
> > > +
> > > +	for (i = 0; i < adev->entries_nr; i++) {
> > > +		start = adev->msix_mmio_base +
> > > +			adev->guest_msix_entries[i].entry * PCI_MSIX_ENTRY_SIZE;
> > > +		end = start + PCI_MSIX_ENTRY_SIZE;
> > > +		if (addr >= start && addr + len <= end) {
> > > +			return i;
> > > +		}
> > > +	}
> > 
> > That's a scary thing to do on each access.
> > Can we build an interface that does not force us
> > to scan the whole list each time?
> 
> We can reduced the calculation. But seems it can't be O(1) without
> introducing more variables to help.
> 
> > > +
> > > +	return -EINVAL;
> > > +}
> > > +
> > > +static int msix_mmio_read(struct kvm_io_device *this, gpa_t addr, int
> > > len, +			  void *val)
> > > +{
> > > +	struct kvm_assigned_dev_kernel *adev =
> > > +			container_of(this, struct kvm_assigned_dev_kernel,
> > > +				     msix_mmio_dev);
> > > +	int idx, r = 0;
> > > +	u32 entry[4];
> > > +	struct kvm_kernel_irq_routing_entry e;
> > > +
> > > +	mutex_lock(&adev->kvm->lock);
> > > +	if (!msix_mmio_in_range(adev, addr, len)) {
> > > +		r = -EOPNOTSUPP;
> > > +		goto out;
> > > +	}
> > > +	if ((addr & 0x3) || len != 4)
> > > +		goto out;
> > > +
> > > +	idx = msix_get_enabled_idx(adev, addr, len);
> > > +	if (idx < 0) {
> > > +		idx = (addr - adev->msix_mmio_base) / PCI_MSIX_ENTRY_SIZE;
> > > +		if ((addr % PCI_MSIX_ENTRY_SIZE) ==
> > > +				PCI_MSIX_ENTRY_VECTOR_CTRL)
> > > +			*(unsigned long *)val =
> > > +				test_bit(idx, adev->msix_mask_bitmap) ?
> > > +				PCI_MSIX_ENTRY_CTRL_MASKBIT : 0;
> > 
> > I suspect this is wrong for big endian platforms: PCI returns
> > all values in little endian format.
> 
> Let big endian platform take care of it later... There is only x86 has
> __KVM_HAVE_MSIX now.
> 
> > > +		else
> > > +			r = -EOPNOTSUPP;
> > > +		goto out;
> > > +	}
> > > +
> > > +	r = kvm_get_irq_routing_entry(adev->kvm,
> > > +			adev->guest_msix_entries[idx].vector, &e);
> > > +	if (r || e.type != KVM_IRQ_ROUTING_MSI) {
> > > +		printk(KERN_WARNING "KVM: Wrong MSI-X routing entry! "
> > > +			"idx %d, addr 0x%llx, len %d\n", idx, addr, len);
> > 
> > Can a malicious userspace trigger this intentionally?
> 
> No.
> 
> > > +		r = -EOPNOTSUPP;
> > > +		goto out;
> > > +	}
> > > +	entry[0] = e.msi.address_lo;
> > > +	entry[1] = e.msi.address_hi;
> > > +	entry[2] = e.msi.data;
> > > +	entry[3] = test_bit(adev->guest_msix_entries[idx].entry,
> > > +			adev->msix_mask_bitmap);
> > 
> > I suspect this is wrong for big endian platforms: PCI returns
> > all values in little endian format.
> > 
> > > +	memcpy(val, &entry[addr % PCI_MSIX_ENTRY_SIZE / 4
> > 
> > 4 is sizeof *entry?
> 
> 4 elements. In fact "PCI_MSIX_ENTRY_SIZE / sizeof u32".

Well, sizeof *entry is correct in sematic. Would update it.

--
regards
Yang, Sheng

> 
> > ], len);
> > 
> > > +
> > > +out:
> > > +	mutex_unlock(&adev->kvm->lock);
> > > +	return r;
> > > +}
> > > +
> > > +static int msix_mmio_write(struct kvm_io_device *this, gpa_t addr, int
> > > len, +			   const void *val)
> > > +{
> > > +	struct kvm_assigned_dev_kernel *adev =
> > > +			container_of(this, struct kvm_assigned_dev_kernel,
> > > +				     msix_mmio_dev);
> > > +	int idx, r = 0;
> > > +	unsigned long new_val = *(unsigned long *)val;
> > 
> > I suspect this is wrong for big endian platforms: PCI returns
> > all values in little endian format.
> > 
> > > +
> > > +	mutex_lock(&adev->kvm->lock);
> > > +	if (!msix_mmio_in_range(adev, addr, len)) {
> > > +		r = -EOPNOTSUPP;
> > > +		goto out;
> > > +	}
> > > +	if ((addr & 0x3) || len != 4)
> > > +		goto out;
> > > +
> > > +	idx = msix_get_enabled_idx(adev, addr, len);
> > > +	if (idx < 0) {
> > > +		idx = (addr - adev->msix_mmio_base) / PCI_MSIX_ENTRY_SIZE;
> > > +		if (((addr % PCI_MSIX_ENTRY_SIZE) ==
> > > +				PCI_MSIX_ENTRY_VECTOR_CTRL)) {
> > > +			if (new_val & ~PCI_MSIX_ENTRY_CTRL_MASKBIT)
> > > +				goto out;
> > > +			if (new_val & PCI_MSIX_ENTRY_CTRL_MASKBIT)
> > > +				set_bit(idx, adev->msix_mask_bitmap);
> > > +			else
> > > +				clear_bit(idx, adev->msix_mask_bitmap);
> > > +		} else
> > > +			/* Userspace would handle other MMIO writing */
> > > +			r = -EOPNOTSUPP;
> > > +		goto out;
> > > +	}
> > > +	if (addr % PCI_MSIX_ENTRY_SIZE != PCI_MSIX_ENTRY_VECTOR_CTRL) {
> > > +		r = -EOPNOTSUPP;
> > > +		goto out;
> > > +	}
> > > +	if (new_val & ~PCI_MSIX_ENTRY_CTRL_MASKBIT)
> > > +		goto out;
> > > +	update_msix_mask(adev, idx, !!(new_val &
> > > PCI_MSIX_ENTRY_CTRL_MASKBIT)); +out:
> > > +	mutex_unlock(&adev->kvm->lock);
> > > +
> > > +	return r;
> > > +}
> > > +
> > > +static const struct kvm_io_device_ops msix_mmio_ops = {
> > > +	.read     = msix_mmio_read,
> > > +	.write    = msix_mmio_write,
> > > +};
> > > +
> > > +static int kvm_vm_ioctl_register_msix_mmio(struct kvm *kvm,
> > > +				struct kvm_assigned_msix_mmio *msix_mmio)
> > > +{
> > > +	int r = 0;
> > > +	struct kvm_assigned_dev_kernel *adev;
> > > +
> > > +	mutex_lock(&kvm->lock);
> > > +	adev = kvm_find_assigned_dev(&kvm->arch.assigned_dev_head,
> > > +				      msix_mmio->assigned_dev_id);
> > > +	if (!adev) {
> > > +		r = -EINVAL;
> > > +		goto out;
> > > +	}
> > > +	if (msix_mmio->base_addr == 0) {
> > > +		r = -EINVAL;
> > > +		goto out;
> > > +	}
> > > +
> > > +	mutex_lock(&kvm->slots_lock);
> > > +	if (adev->msix_mmio_base == 0) {
> > 
> > What does this do? Handles case we are already registered?
> > Also ! adev->msix_mmio_base
> 
> We don't need to register again. Just update the address.
> 
> Well, msix_mmio_base is an address. I think ==0 is more properly.
> 
> --
> regards
> Yang, Sheng
> 
> > > +		kvm_iodevice_init(&adev->msix_mmio_dev, &msix_mmio_ops);
> > > +		r = kvm_io_bus_register_dev(kvm, KVM_MMIO_BUS,
> > > +				&adev->msix_mmio_dev);
> > > +		if (r)
> > > +			goto out2;
> > > +	}
> > > +
> > > +	adev->msix_mmio_base = msix_mmio->base_addr;
> > > +out2:
> > > +	mutex_unlock(&kvm->slots_lock);
> > > +out:
> > > +	mutex_unlock(&kvm->lock);
> > > +
> > > +	return r;
> > > +}
> > > 
> > >  #endif
> > >  
> > >  long kvm_vm_ioctl_assigned_device(struct kvm *kvm, unsigned ioctl,
> > > 
> > > @@ -812,6 +1065,36 @@ long kvm_vm_ioctl_assigned_device(struct kvm
> > > *kvm, unsigned ioctl,
> > > 
> > >  			goto out;
> > >  		
> > >  		break;
> > >  	
> > >  	}
> > > 
> > > +	case KVM_ASSIGN_GET_MSIX_ENTRY: {
> > > +		struct kvm_assigned_msix_entry entry;
> > > +		r = -EFAULT;
> > > +		if (copy_from_user(&entry, argp, sizeof entry))
> > > +			goto out;
> > > +		r = kvm_vm_ioctl_get_msix_entry(kvm, &entry);
> > > +		if (r)
> > > +			goto out;
> > > +		r = -EFAULT;
> > > +		if (copy_to_user(argp, &entry, sizeof entry))
> > > +			goto out;
> > > +		r = 0;
> > > +		break;
> > > +	}
> > > +	case KVM_ASSIGN_REG_MSIX_MMIO: {
> > > +		struct kvm_assigned_msix_mmio msix_mmio;
> > > +
> > > +		r = -EFAULT;
> > > +		if (copy_from_user(&msix_mmio, argp, sizeof(msix_mmio)))
> > > +			goto out;
> > > +
> > > +		r = -EINVAL;
> > > +		if (msix_mmio.flags != 0 || msix_mmio.reserved != 0)
> > > +			goto out;
> > > +
> > > +		r = kvm_vm_ioctl_register_msix_mmio(kvm, &msix_mmio);
> > > +		if (r)
> > > +			goto out;
> > > +		break;
> > > +	}
> > > 
> > >  #endif
> > >  
> > >  	}
> > >  
> > >  out:

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 5/5] KVM: assigned dev: MSI-X mask support
  2010-11-05  4:20       ` Sheng Yang
@ 2010-11-05  7:16         ` Sheng Yang
  2010-11-05  8:51           ` Michael S. Tsirkin
  0 siblings, 1 reply; 29+ messages in thread
From: Sheng Yang @ 2010-11-05  7:16 UTC (permalink / raw)
  To: Avi Kivity, Marcelo Tosatti, Michael S. Tsirkin; +Cc: kvm, Sheng Yang

This patch enable per-vector mask for assigned devices using MSI-X.

This patch provided two new APIs: one is for guest to specific device's MSI-X
table address in MMIO, the other is for userspace to get information about mask
bit.

All the mask bit operation are kept in kernel, in order to accelerate.
Userspace shouldn't access the device MMIO directly for the information,
instead it should uses provided API to do so.

Signed-off-by: Sheng Yang <sheng@linux.intel.com>
---
 Documentation/kvm/api.txt |   48 ++++++++
 arch/x86/kvm/x86.c        |    1 +
 include/linux/kvm.h       |   22 ++++-
 include/linux/kvm_host.h  |    4 +
 virt/kvm/assigned-dev.c   |  286 ++++++++++++++++++++++++++++++++++++++++++++-
 5 files changed, 359 insertions(+), 2 deletions(-)

diff --git a/Documentation/kvm/api.txt b/Documentation/kvm/api.txt
index b336266..76f800b 100644
--- a/Documentation/kvm/api.txt
+++ b/Documentation/kvm/api.txt
@@ -1085,6 +1085,54 @@ of 4 instructions that make up a hypercall.
 If any additional field gets added to this structure later on, a bit for that
 additional piece of information will be set in the flags bitmap.
 
+4.47 KVM_ASSIGN_REG_MSIX_MMIO
+
+Capability: KVM_CAP_DEVICE_MSIX_MASK
+Architectures: x86
+Type: vm ioctl
+Parameters: struct kvm_assigned_msix_mmio (in)
+Returns: 0 on success, !0 on error
+
+struct kvm_assigned_msix_mmio {
+	/* Assigned device's ID */
+	__u32 assigned_dev_id;
+	/* Must be 0 */
+	__u32 flags;
+	/* MSI-X table MMIO address */
+	__u64 base_addr;
+	/* Maximum entries contained in the table, <= KVM_MAX_MSIX_PER_DEV */
+	__u32 max_entries_nr;
+	/* Must be 0, reserved for future use */
+	__u32 reserved;
+};
+
+This ioctl would enable in-kernel MSI-X emulation, which would handle MSI-X
+mask bit in the kernel.
+
+4.48 KVM_ASSIGN_GET_MSIX_ENTRY
+
+Capability: KVM_CAP_DEVICE_MSIX_MASK
+Architectures: x86
+Type: vm ioctl
+Parameters: struct kvm_assigned_msix_entry (in and out)
+Returns: 0 on success, !0 on error
+
+struct kvm_assigned_msix_entry {
+	/* Assigned device's ID */
+	__u32 assigned_dev_id;
+	/* Ignored */
+	__u32 gsi;
+	/* The index of entry in the MSI-X table */
+	__u16 entry;
+	/* Querying flags and returning status */
+	__u16 flags;
+	/* Must be 0 */
+	__u16 padding[2];
+};
+
+This ioctl would allow userspace to get the status of one specific MSI-X
+entry. Currently we support mask bit status querying.
+
 5. The kvm_run structure
 
 Application code obtains a pointer to the kvm_run structure by
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index f3f86b2..8fd5121 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1926,6 +1926,7 @@ int kvm_dev_ioctl_check_extension(long ext)
 	case KVM_CAP_DEBUGREGS:
 	case KVM_CAP_X86_ROBUST_SINGLESTEP:
 	case KVM_CAP_XSAVE:
+	case KVM_CAP_DEVICE_MSIX_MASK:
 		r = 1;
 		break;
 	case KVM_CAP_COALESCED_MMIO:
diff --git a/include/linux/kvm.h b/include/linux/kvm.h
index 919ae53..bfe5707 100644
--- a/include/linux/kvm.h
+++ b/include/linux/kvm.h
@@ -540,6 +540,9 @@ struct kvm_ppc_pvinfo {
 #endif
 #define KVM_CAP_PPC_GET_PVINFO 57
 #define KVM_CAP_PPC_IRQ_LEVEL 58
+#ifdef __KVM_HAVE_MSIX
+#define KVM_CAP_DEVICE_MSIX_MASK 59
+#endif
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
@@ -671,6 +674,10 @@ struct kvm_clock_data {
 #define KVM_XEN_HVM_CONFIG        _IOW(KVMIO,  0x7a, struct kvm_xen_hvm_config)
 #define KVM_SET_CLOCK             _IOW(KVMIO,  0x7b, struct kvm_clock_data)
 #define KVM_GET_CLOCK             _IOR(KVMIO,  0x7c, struct kvm_clock_data)
+#define KVM_ASSIGN_GET_MSIX_ENTRY _IOWR(KVMIO,  0x7d, \
+					struct kvm_assigned_msix_entry)
+#define KVM_ASSIGN_REG_MSIX_MMIO  _IOW(KVMIO,  0x7e, \
+					struct kvm_assigned_msix_mmio)
 /* Available with KVM_CAP_PIT_STATE2 */
 #define KVM_GET_PIT2              _IOR(KVMIO,  0x9f, struct kvm_pit_state2)
 #define KVM_SET_PIT2              _IOW(KVMIO,  0xa0, struct kvm_pit_state2)
@@ -787,11 +794,24 @@ struct kvm_assigned_msix_nr {
 };
 
 #define KVM_MAX_MSIX_PER_DEV		256
+
+#define KVM_MSIX_FLAG_MASK		(1 << 0)
+#define KVM_MSIX_FLAG_QUERY_MASK	(1 << 15)
+
 struct kvm_assigned_msix_entry {
 	__u32 assigned_dev_id;
 	__u32 gsi;
 	__u16 entry; /* The index of entry in the MSI-X table */
-	__u16 padding[3];
+	__u16 flags;
+	__u16 padding[2];
+};
+
+struct kvm_assigned_msix_mmio {
+	__u32 assigned_dev_id;
+	__u32 flags;
+	__u64 base_addr;
+	__u32 max_entries_nr;
+	__u32 reserved;
 };
 
 #endif /* __LINUX_KVM_H */
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index e2ecbac..f58aaca 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -464,6 +464,10 @@ struct kvm_assigned_dev_kernel {
 	struct pci_dev *dev;
 	struct kvm *kvm;
 	spinlock_t assigned_dev_lock;
+	DECLARE_BITMAP(msix_mask_bitmap, KVM_MAX_MSIX_PER_DEV);
+	gpa_t msix_mmio_base;
+	struct kvm_io_device msix_mmio_dev;
+	int msix_max_entries_nr;
 };
 
 struct kvm_irq_mask_notifier {
diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c
index 7c98928..e3b5530 100644
--- a/virt/kvm/assigned-dev.c
+++ b/virt/kvm/assigned-dev.c
@@ -232,6 +232,14 @@ static void kvm_free_assigned_device(struct kvm *kvm,
 {
 	kvm_free_assigned_irq(kvm, assigned_dev);
 
+#ifdef __KVM_HAVE_MSIX
+	if (assigned_dev->msix_mmio_base) {
+		mutex_lock(&kvm->slots_lock);
+		kvm_io_bus_unregister_dev(kvm, KVM_MMIO_BUS,
+				&assigned_dev->msix_mmio_dev);
+		mutex_unlock(&kvm->slots_lock);
+	}
+#endif
 	pci_reset_function(assigned_dev->dev);
 
 	pci_release_regions(assigned_dev->dev);
@@ -504,7 +512,7 @@ out:
 static int kvm_vm_ioctl_assign_device(struct kvm *kvm,
 				      struct kvm_assigned_pci_dev *assigned_dev)
 {
-	int r = 0, idx;
+	int r = 0, idx, i;
 	struct kvm_assigned_dev_kernel *match;
 	struct pci_dev *dev;
 
@@ -563,6 +571,10 @@ static int kvm_vm_ioctl_assign_device(struct kvm *kvm,
 
 	list_add(&match->list, &kvm->arch.assigned_dev_head);
 
+	/* The state after reset of MSI-X table is all masked */
+	for (i = 0; i < KVM_MAX_MSIX_PER_DEV; i++)
+		set_bit(i, match->msix_mask_bitmap);
+
 	if (assigned_dev->flags & KVM_DEV_ASSIGN_ENABLE_IOMMU) {
 		if (!kvm->arch.iommu_domain) {
 			r = kvm_iommu_map_guest(kvm);
@@ -666,6 +678,43 @@ msix_nr_out:
 	return r;
 }
 
+static void update_msix_mask(struct kvm_assigned_dev_kernel *adev,
+			     int idx, bool new_mask_flag)
+{
+	int irq;
+	bool old_mask_flag, need_flush = false;
+
+	spin_lock_irq(&adev->assigned_dev_lock);
+
+	if (!adev->dev->msix_enabled ||
+	    !(adev->irq_requested_type & KVM_DEV_IRQ_HOST_MSIX))
+		goto out;
+
+	old_mask_flag = test_bit(adev->guest_msix_entries[idx].entry,
+			adev->msix_mask_bitmap);
+	if (old_mask_flag == new_mask_flag)
+		goto out;
+
+	irq = adev->host_msix_entries[idx].vector;
+	BUG_ON(irq == 0);
+
+	if (new_mask_flag) {
+		set_bit(adev->guest_msix_entries[idx].entry,
+				adev->msix_mask_bitmap);
+		disable_irq_nosync(irq);
+		need_flush = true;
+	} else {
+		clear_bit(adev->guest_msix_entries[idx].entry,
+				adev->msix_mask_bitmap);
+		enable_irq(irq);
+	}
+out:
+	spin_unlock_irq(&adev->assigned_dev_lock);
+
+	if (need_flush)
+		flush_work(&adev->interrupt_work);
+}
+
 static int kvm_vm_ioctl_set_msix_entry(struct kvm *kvm,
 				       struct kvm_assigned_msix_entry *entry)
 {
@@ -700,6 +749,211 @@ msix_entry_out:
 
 	return r;
 }
+
+static int kvm_vm_ioctl_get_msix_entry(struct kvm *kvm,
+				       struct kvm_assigned_msix_entry *entry)
+{
+	int r = 0;
+	struct kvm_assigned_dev_kernel *adev;
+
+	mutex_lock(&kvm->lock);
+
+	adev = kvm_find_assigned_dev(&kvm->arch.assigned_dev_head,
+				      entry->assigned_dev_id);
+
+	if (!adev) {
+		r = -EINVAL;
+		goto out;
+	}
+
+	if (entry->entry >= adev->msix_max_entries_nr) {
+		r = -ENOSPC;
+		goto out;
+	}
+
+	if (entry->flags & KVM_MSIX_FLAG_QUERY_MASK) {
+		if (test_bit(entry->entry, adev->msix_mask_bitmap))
+			entry->flags |= KVM_MSIX_FLAG_MASK;
+		else
+			entry->flags &= ~KVM_MSIX_FLAG_MASK;
+	}
+
+out:
+	mutex_unlock(&kvm->lock);
+
+	return r;
+}
+
+static bool msix_mmio_in_range(struct kvm_assigned_dev_kernel *adev,
+			      gpa_t addr, int len)
+{
+	gpa_t start, end;
+
+	BUG_ON(adev->msix_mmio_base == 0);
+	start = adev->msix_mmio_base;
+	end = adev->msix_mmio_base + PCI_MSIX_ENTRY_SIZE *
+		adev->msix_max_entries_nr;
+	if (addr >= start && addr + len <= end)
+		return true;
+
+	return false;
+}
+
+static int msix_get_enabled_idx(struct kvm_assigned_dev_kernel *adev,
+			      gpa_t addr, int len)
+{
+	int i, index = (addr - adev->msix_mmio_base) / PCI_MSIX_ENTRY_SIZE;
+
+	for (i = 0; i < adev->entries_nr; i++)
+		if (adev->guest_msix_entries[i].entry == index)
+			return i;
+
+	return -EINVAL;
+}
+
+static int msix_mmio_read(struct kvm_io_device *this, gpa_t addr, int len,
+			  void *val)
+{
+	struct kvm_assigned_dev_kernel *adev =
+			container_of(this, struct kvm_assigned_dev_kernel,
+				     msix_mmio_dev);
+	int idx, r = 0;
+	u32 entry[4];
+	struct kvm_kernel_irq_routing_entry e;
+
+	mutex_lock(&adev->kvm->lock);
+	if (!msix_mmio_in_range(adev, addr, len)) {
+		r = -EOPNOTSUPP;
+		goto out;
+	}
+	if ((addr & 0x3) || len != 4)
+		goto out;
+
+	idx = msix_get_enabled_idx(adev, addr, len);
+	if (idx < 0) {
+		idx = (addr - adev->msix_mmio_base) / PCI_MSIX_ENTRY_SIZE;
+		if ((addr % PCI_MSIX_ENTRY_SIZE) ==
+				PCI_MSIX_ENTRY_VECTOR_CTRL)
+			*(unsigned long *)val =
+				test_bit(idx, adev->msix_mask_bitmap) ?
+				PCI_MSIX_ENTRY_CTRL_MASKBIT : 0;
+		else
+			r = -EOPNOTSUPP;
+		goto out;
+	}
+
+	r = kvm_get_irq_routing_entry(adev->kvm,
+			adev->guest_msix_entries[idx].vector, &e);
+	if (r || e.type != KVM_IRQ_ROUTING_MSI) {
+		printk(KERN_WARNING "KVM: Wrong MSI-X routing entry! "
+			"idx %d, addr 0x%llx, len %d\n", idx, addr, len);
+		r = -EOPNOTSUPP;
+		goto out;
+	}
+	entry[0] = e.msi.address_lo;
+	entry[1] = e.msi.address_hi;
+	entry[2] = e.msi.data;
+	entry[3] = test_bit(adev->guest_msix_entries[idx].entry,
+			adev->msix_mask_bitmap);
+	memcpy(val, &entry[addr % PCI_MSIX_ENTRY_SIZE / sizeof *entry], len);
+
+out:
+	mutex_unlock(&adev->kvm->lock);
+	return r;
+}
+
+static int msix_mmio_write(struct kvm_io_device *this, gpa_t addr, int len,
+			   const void *val)
+{
+	struct kvm_assigned_dev_kernel *adev =
+			container_of(this, struct kvm_assigned_dev_kernel,
+				     msix_mmio_dev);
+	int idx, r = 0;
+	unsigned long new_val = *(unsigned long *)val;
+
+	mutex_lock(&adev->kvm->lock);
+	if (!msix_mmio_in_range(adev, addr, len)) {
+		r = -EOPNOTSUPP;
+		goto out;
+	}
+	if ((addr & 0x3) || len != 4)
+		goto out;
+
+	idx = msix_get_enabled_idx(adev, addr, len);
+	if (idx < 0) {
+		idx = (addr - adev->msix_mmio_base) / PCI_MSIX_ENTRY_SIZE;
+		if (((addr % PCI_MSIX_ENTRY_SIZE) ==
+				PCI_MSIX_ENTRY_VECTOR_CTRL)) {
+			if (new_val & ~PCI_MSIX_ENTRY_CTRL_MASKBIT)
+				goto out;
+			if (new_val & PCI_MSIX_ENTRY_CTRL_MASKBIT)
+				set_bit(idx, adev->msix_mask_bitmap);
+			else
+				clear_bit(idx, adev->msix_mask_bitmap);
+		} else
+			/* Userspace would handle other MMIO writing */
+			r = -EOPNOTSUPP;
+		goto out;
+	}
+	if (addr % PCI_MSIX_ENTRY_SIZE != PCI_MSIX_ENTRY_VECTOR_CTRL) {
+		r = -EOPNOTSUPP;
+		goto out;
+	}
+	if (new_val & ~PCI_MSIX_ENTRY_CTRL_MASKBIT)
+		goto out;
+	update_msix_mask(adev, idx, !!(new_val & PCI_MSIX_ENTRY_CTRL_MASKBIT));
+out:
+	mutex_unlock(&adev->kvm->lock);
+
+	return r;
+}
+
+static const struct kvm_io_device_ops msix_mmio_ops = {
+	.read     = msix_mmio_read,
+	.write    = msix_mmio_write,
+};
+
+static int kvm_vm_ioctl_register_msix_mmio(struct kvm *kvm,
+				struct kvm_assigned_msix_mmio *msix_mmio)
+{
+	int r = 0;
+	struct kvm_assigned_dev_kernel *adev;
+
+	mutex_lock(&kvm->lock);
+	adev = kvm_find_assigned_dev(&kvm->arch.assigned_dev_head,
+				      msix_mmio->assigned_dev_id);
+	if (!adev) {
+		r = -EINVAL;
+		goto out;
+	}
+	if (msix_mmio->base_addr == 0) {
+		r = -EINVAL;
+		goto out;
+	}
+	if (msix_mmio->max_entries_nr == 0 ||
+			msix_mmio->max_entries_nr > KVM_MAX_MSIX_PER_DEV) {
+		r = -EINVAL;
+		goto out;
+	}
+
+	mutex_lock(&kvm->slots_lock);
+	if (adev->msix_mmio_base == 0) {
+		kvm_iodevice_init(&adev->msix_mmio_dev, &msix_mmio_ops);
+		r = kvm_io_bus_register_dev(kvm, KVM_MMIO_BUS,
+				&adev->msix_mmio_dev);
+		if (r)
+			goto out2;
+	}
+
+	adev->msix_mmio_base = msix_mmio->base_addr;
+	adev->msix_max_entries_nr = msix_mmio->max_entries_nr;
+out2:
+	mutex_unlock(&kvm->slots_lock);
+out:
+	mutex_unlock(&kvm->lock);
+
+	return r;
+}
 #endif
 
 long kvm_vm_ioctl_assigned_device(struct kvm *kvm, unsigned ioctl,
@@ -812,6 +1066,36 @@ long kvm_vm_ioctl_assigned_device(struct kvm *kvm, unsigned ioctl,
 			goto out;
 		break;
 	}
+	case KVM_ASSIGN_GET_MSIX_ENTRY: {
+		struct kvm_assigned_msix_entry entry;
+		r = -EFAULT;
+		if (copy_from_user(&entry, argp, sizeof entry))
+			goto out;
+		r = kvm_vm_ioctl_get_msix_entry(kvm, &entry);
+		if (r)
+			goto out;
+		r = -EFAULT;
+		if (copy_to_user(argp, &entry, sizeof entry))
+			goto out;
+		r = 0;
+		break;
+	}
+	case KVM_ASSIGN_REG_MSIX_MMIO: {
+		struct kvm_assigned_msix_mmio msix_mmio;
+
+		r = -EFAULT;
+		if (copy_from_user(&msix_mmio, argp, sizeof(msix_mmio)))
+			goto out;
+
+		r = -EINVAL;
+		if (msix_mmio.flags != 0 || msix_mmio.reserved != 0)
+			goto out;
+
+		r = kvm_vm_ioctl_register_msix_mmio(kvm, &msix_mmio);
+		if (r)
+			goto out;
+		break;
+	}
 #endif
 	}
 out:
-- 
1.7.0.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH 5/5] KVM: assigned dev: MSI-X mask support
  2010-11-05  2:44     ` Sheng Yang
  2010-11-05  4:20       ` Sheng Yang
@ 2010-11-05  8:43       ` Michael S. Tsirkin
  2010-11-05 10:53         ` Sheng Yang
  1 sibling, 1 reply; 29+ messages in thread
From: Michael S. Tsirkin @ 2010-11-05  8:43 UTC (permalink / raw)
  To: Sheng Yang; +Cc: Avi Kivity, Marcelo Tosatti, kvm

On Fri, Nov 05, 2010 at 10:44:19AM +0800, Sheng Yang wrote:
> > > +};
> > > +
> > > +This ioctl would enable in-kernel MSI-X emulation, which would handle
> > > MSI-X +mask bit in the kernel.
> > 
> > What happens on repeated calls when it's already enabled?
> > How does one disable after it's enabled?
> 
> Suppose this should only be called once. But again would update the MMIO base.


So maybe add this all to documentation.

> It 
> would be disabled along with device deassignment.

So what are you going to do when guest enables and later disables MSIX?
Disable assignment completely?

> We enable it explicitly because 
> not all devices have MSI-X capability.



> > 
> > > +
> > 
> > This is very specialized for assigned devices.  I would like an
> > interface not tied to assigned devices explicitly
> > (even if the implementation at the moment only works
> > for assigned devices, I can patch that in later). E.g. vhost-net would
> > benefit from such an extension too.  Maybe tie this to a special
> > GSI and this GSI can be an assigned device or not?
> 
> You're talking about KVM_ASSIGN_GET_MSIX_ENTRY?

Yes but also, and more importantly, about KVM_ASSIGN_REG_MSIX_MMIO
and kvm_assigned_msix_mmio really.

> We can't tie it to GSI, because we 
> should also cover entries without GSI. The entry number should be fine for all 
> kinds of devices using MSI-X.

I don't completely understand: entries without GSI
never need to inject an interrupt. Why do we care
about them? Let's pass such accesses to userspace.

In any case, I think MSIX GSIs are inexpensive (we never search them
all), so we could create GSIs for all entries if we wanted to.

> I am not sure about what PV one would looks like.
> I use kvm_assigned_msix_entry 
> just because it can be easily reused.

Not only PV, emulated devices too. OK let's try to figure out:

What happens with userspace is, we inject interrupts either through
ioctls or eventfds. Avoiding an exit to userspace on MSIX
access is beneficial in exactly the same way.

We could have an ioctl to pass in the table addresses, and mapping
table to relevant GSIs. For example, add kvm_msix_mmio with table
start/end, also specify which GSIs are covered. Maybe ask that
userspace allocate all GSIs for a table consequitively?
Or add kvm_irq_routing_msix which is same as msi but
also has the mmio addresses? Bonus points for avoiding the need
to scan all table on each write. For example, don't scan at all,
instead just set a bit in KVM, and set irq entry on the next
interrupt only: we have KVM_MAX_MSIX_PER_DEV as 256 so maybe a small 32 byte bitmap
would be enough for this, avoding a linear scan.

When we pass in kvm_msix_mmio, kvm would start registering masked state.




> > 
> > > +4.48 KVM_ASSIGN_GET_MSIX_ENTRY
> > > +
> > > +Capability: KVM_CAP_DEVICE_MSIX_MASK
> > > +Architectures: x86
> > > +Type: vm ioctl
> > > +Parameters: struct kvm_assigned_msix_entry (in and out)
> > > +Returns: 0 on success, !0 on error
> > > +
> > > +struct kvm_assigned_msix_entry {
> > > +	/* Assigned device's ID */
> > > +	__u32 assigned_dev_id;
> > > +	/* Ignored */
> > > +	__u32 gsi;
> > > +	/* The index of entry in the MSI-X table */
> > > +	__u16 entry;
> > > +	/* Querying flags and returning status */
> > > +	__u16 flags;
> > > +	/* Must be 0 */
> > > +	__u16 padding[2];
> > > +};
> > > +
> > > +This ioctl would allow userspace to get the status of one specific MSI-X
> > > +entry. Currently we support mask bit status querying.
> > > +
> > > 
> > >  5. The kvm_run structure
> > >  
> > >  Application code obtains a pointer to the kvm_run structure by
> > > 
> > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > index f3f86b2..8fd5121 100644
> > > --- a/arch/x86/kvm/x86.c
> > > +++ b/arch/x86/kvm/x86.c
> > > @@ -1926,6 +1926,7 @@ int kvm_dev_ioctl_check_extension(long ext)
> > > 
> > >  	case KVM_CAP_DEBUGREGS:
> > >  	case KVM_CAP_X86_ROBUST_SINGLESTEP:
> > > 
> > >  	case KVM_CAP_XSAVE:
> > > +	case KVM_CAP_DEVICE_MSIX_MASK:
> > >  		r = 1;
> > >  		break;
> > >  	
> > >  	case KVM_CAP_COALESCED_MMIO:
> > > diff --git a/include/linux/kvm.h b/include/linux/kvm.h
> > > index 919ae53..eafafb1 100644
> > > --- a/include/linux/kvm.h
> > > +++ b/include/linux/kvm.h
> > > @@ -540,6 +540,9 @@ struct kvm_ppc_pvinfo {
> > > 
> > >  #endif
> > >  #define KVM_CAP_PPC_GET_PVINFO 57
> > >  #define KVM_CAP_PPC_IRQ_LEVEL 58
> > > 
> > > +#ifdef __KVM_HAVE_MSIX
> > > +#define KVM_CAP_DEVICE_MSIX_MASK 59
> > > +#endif
> > 
> > We seem to have these HAVE macros all over.
> > Avi, what's the idea? Let's drop them?
> 
> Um? This kind of marcos are used here to avoid vendor specific string in the header 
> file. 

Am I the only one that dislikes the ifdefs we have all over this code then?
Rest of linux tries to keep ifdefs local by stubbing functionality out.

> > > +	BUG_ON(adev->msix_mmio_base == 0);
> > 
> > Below I see
> > 	adev->msix_mmio_base = msix_mmio->base_addr;
> > which comes from userspace. BUG_ON is an unfriendly way to
> > tell user about a bug in qemu.
> >
> > Anyway, I don't think we should special-case 0 gpa.
> > It's up to the user where to base the table.
> > Use a separate flag to signal that table was initialized.
> 
> Base_addr can't be 0 in any case. Only the unimplemented BAR are hardwired to 
> zero.

No, the PCI spec does not say so. Check it out: the unimplemented BAR
is hardwired to zero but implemented can be set to 0 as well.

> Also if we get here with base_addr = 0, it is surely a bug.
> > > +
> > > +	return -EINVAL;
> > > +}
> > > +
> > > +static int msix_mmio_read(struct kvm_io_device *this, gpa_t addr, int
> > > len, +			  void *val)
> > > +{
> > > +	struct kvm_assigned_dev_kernel *adev =
> > > +			container_of(this, struct kvm_assigned_dev_kernel,
> > > +				     msix_mmio_dev);
> > > +	int idx, r = 0;
> > > +	u32 entry[4];
> > > +	struct kvm_kernel_irq_routing_entry e;
> > > +
> > > +	mutex_lock(&adev->kvm->lock);
> > > +	if (!msix_mmio_in_range(adev, addr, len)) {
> > > +		r = -EOPNOTSUPP;
> > > +		goto out;
> > > +	}
> > > +	if ((addr & 0x3) || len != 4)
> > > +		goto out;
> > > +
> > > +	idx = msix_get_enabled_idx(adev, addr, len);
> > > +	if (idx < 0) {
> > > +		idx = (addr - adev->msix_mmio_base) / PCI_MSIX_ENTRY_SIZE;
> > > +		if ((addr % PCI_MSIX_ENTRY_SIZE) ==
> > > +				PCI_MSIX_ENTRY_VECTOR_CTRL)
> > > +			*(unsigned long *)val =
> > > +				test_bit(idx, adev->msix_mask_bitmap) ?
> > > +				PCI_MSIX_ENTRY_CTRL_MASKBIT : 0;
> > 
> > I suspect this is wrong for big endian platforms: PCI returns
> > all values in little endian format.
> 
> Let big endian platform take care of it later... There is only x86 has 
> __KVM_HAVE_MSIX now.

Well it's easier to make the code correct when you write it than later.

> > 
> > > +		else
> > > +			r = -EOPNOTSUPP;
> > > +		goto out;
> > > +	}
> > > +
> > > +	r = kvm_get_irq_routing_entry(adev->kvm,
> > > +			adev->guest_msix_entries[idx].vector, &e);
> > > +	if (r || e.type != KVM_IRQ_ROUTING_MSI) {
> > > +		printk(KERN_WARNING "KVM: Wrong MSI-X routing entry! "
> > > +			"idx %d, addr 0x%llx, len %d\n", idx, addr, len);
> > 
> > Can a malicious userspace trigger this intentionally?
> 
> No. 

Here's a scenario: bind an MSIX routing entry to device, then later
change its type to regular interrupt. Fills the log with warnings,
doesn't it?


> > 
> > > +		r = -EOPNOTSUPP;
> > > +		goto out;
> > > +	}
> > > +	entry[0] = e.msi.address_lo;
> > > +	entry[1] = e.msi.address_hi;
> > > +	entry[2] = e.msi.data;
> > > +	entry[3] = test_bit(adev->guest_msix_entries[idx].entry,
> > > +			adev->msix_mask_bitmap);
> > 
> > I suspect this is wrong for big endian platforms: PCI returns
> > all values in little endian format.
> > 
> > > +	memcpy(val, &entry[addr % PCI_MSIX_ENTRY_SIZE / 4
> > 
> > 4 is sizeof *entry?
> 
> 4 elements.
> In fact "PCI_MSIX_ENTRY_SIZE / sizeof u32".

Are you sure?  This does not make sense to me.  address in bytes, we
take offset from start of entry in bytes, to figure which dword we need
we divide by size of dword. No?

> > 
> > ], len);
> > 
> > > +
> > > +out:
> > > +	mutex_unlock(&adev->kvm->lock);
> > > +	return r;
> > > +}
> > > +
> > > +static int msix_mmio_write(struct kvm_io_device *this, gpa_t addr, int
> > > len, +			   const void *val)
> > > +{
> > > +	struct kvm_assigned_dev_kernel *adev =
> > > +			container_of(this, struct kvm_assigned_dev_kernel,
> > > +				     msix_mmio_dev);
> > > +	int idx, r = 0;
> > > +	unsigned long new_val = *(unsigned long *)val;
> > 
> > I suspect this is wrong for big endian platforms: PCI returns
> > all values in little endian format.
> > 
> > > +
> > > +	mutex_lock(&adev->kvm->lock);
> > > +	if (!msix_mmio_in_range(adev, addr, len)) {
> > > +		r = -EOPNOTSUPP;
> > > +		goto out;
> > > +	}
> > > +	if ((addr & 0x3) || len != 4)
> > > +		goto out;
> > > +
> > > +	idx = msix_get_enabled_idx(adev, addr, len);
> > > +	if (idx < 0) {
> > > +		idx = (addr - adev->msix_mmio_base) / PCI_MSIX_ENTRY_SIZE;
> > > +		if (((addr % PCI_MSIX_ENTRY_SIZE) ==
> > > +				PCI_MSIX_ENTRY_VECTOR_CTRL)) {
> > > +			if (new_val & ~PCI_MSIX_ENTRY_CTRL_MASKBIT)
> > > +				goto out;
> > > +			if (new_val & PCI_MSIX_ENTRY_CTRL_MASKBIT)
> > > +				set_bit(idx, adev->msix_mask_bitmap);
> > > +			else
> > > +				clear_bit(idx, adev->msix_mask_bitmap);
> > > +		} else
> > > +			/* Userspace would handle other MMIO writing */
> > > +			r = -EOPNOTSUPP;
> > > +		goto out;
> > > +	}
> > > +	if (addr % PCI_MSIX_ENTRY_SIZE != PCI_MSIX_ENTRY_VECTOR_CTRL) {
> > > +		r = -EOPNOTSUPP;
> > > +		goto out;
> > > +	}
> > > +	if (new_val & ~PCI_MSIX_ENTRY_CTRL_MASKBIT)
> > > +		goto out;
> > > +	update_msix_mask(adev, idx, !!(new_val & PCI_MSIX_ENTRY_CTRL_MASKBIT));
> > > +out:
> > > +	mutex_unlock(&adev->kvm->lock);
> > > +
> > > +	return r;
> > > +}
> > > +
> > > +static const struct kvm_io_device_ops msix_mmio_ops = {
> > > +	.read     = msix_mmio_read,
> > > +	.write    = msix_mmio_write,
> > > +};
> > > +
> > > +static int kvm_vm_ioctl_register_msix_mmio(struct kvm *kvm,
> > > +				struct kvm_assigned_msix_mmio *msix_mmio)
> > > +{
> > > +	int r = 0;
> > > +	struct kvm_assigned_dev_kernel *adev;
> > > +
> > > +	mutex_lock(&kvm->lock);
> > > +	adev = kvm_find_assigned_dev(&kvm->arch.assigned_dev_head,
> > > +				      msix_mmio->assigned_dev_id);
> > > +	if (!adev) {
> > > +		r = -EINVAL;
> > > +		goto out;
> > > +	}
> > > +	if (msix_mmio->base_addr == 0) {
> > > +		r = -EINVAL;
> > > +		goto out;
> > > +	}
> > > +
> > > +	mutex_lock(&kvm->slots_lock);
> > > +	if (adev->msix_mmio_base == 0) {
> > 
> > What does this do? Handles case we are already registered?
> > Also ! adev->msix_mmio_base
> 
> We don't need to register again. Just update the address.
> 
> Well, msix_mmio_base is an address. I think ==0 is more properly.

Yes, there's a test above that validates for 0 and returns EINVAL.
But using a separate flag that userspace can not control
to mean 'already registered' will make the code clearer.


> 
> --
> regards
> Yang, Sheng
> 
> > 
> > > +		kvm_iodevice_init(&adev->msix_mmio_dev, &msix_mmio_ops);
> > > +		r = kvm_io_bus_register_dev(kvm, KVM_MMIO_BUS,
> > > +				&adev->msix_mmio_dev);
> > > +		if (r)
> > > +			goto out2;
> > > +	}
> > > +
> > > +	adev->msix_mmio_base = msix_mmio->base_addr;
> > > +out2:
> > > +	mutex_unlock(&kvm->slots_lock);
> > > +out:
> > > +	mutex_unlock(&kvm->lock);
> > > +
> > > +	return r;
> > > +}
> > > 
> > >  #endif
> > >  
> > >  long kvm_vm_ioctl_assigned_device(struct kvm *kvm, unsigned ioctl,
> > > 
> > > @@ -812,6 +1065,36 @@ long kvm_vm_ioctl_assigned_device(struct kvm *kvm,
> > > unsigned ioctl,
> > > 
> > >  			goto out;
> > >  		
> > >  		break;
> > >  	
> > >  	}
> > > 
> > > +	case KVM_ASSIGN_GET_MSIX_ENTRY: {
> > > +		struct kvm_assigned_msix_entry entry;
> > > +		r = -EFAULT;
> > > +		if (copy_from_user(&entry, argp, sizeof entry))
> > > +			goto out;
> > > +		r = kvm_vm_ioctl_get_msix_entry(kvm, &entry);
> > > +		if (r)
> > > +			goto out;
> > > +		r = -EFAULT;
> > > +		if (copy_to_user(argp, &entry, sizeof entry))
> > > +			goto out;
> > > +		r = 0;
> > > +		break;
> > > +	}
> > > +	case KVM_ASSIGN_REG_MSIX_MMIO: {
> > > +		struct kvm_assigned_msix_mmio msix_mmio;
> > > +
> > > +		r = -EFAULT;
> > > +		if (copy_from_user(&msix_mmio, argp, sizeof(msix_mmio)))
> > > +			goto out;
> > > +
> > > +		r = -EINVAL;
> > > +		if (msix_mmio.flags != 0 || msix_mmio.reserved != 0)
> > > +			goto out;
> > > +
> > > +		r = kvm_vm_ioctl_register_msix_mmio(kvm, &msix_mmio);
> > > +		if (r)
> > > +			goto out;
> > > +		break;
> > > +	}
> > > 
> > >  #endif
> > >  
> > >  	}
> > >  
> > >  out:

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 5/5] KVM: assigned dev: MSI-X mask support
  2010-11-05  7:16         ` Sheng Yang
@ 2010-11-05  8:51           ` Michael S. Tsirkin
  2010-11-05 10:54             ` Sheng Yang
  0 siblings, 1 reply; 29+ messages in thread
From: Michael S. Tsirkin @ 2010-11-05  8:51 UTC (permalink / raw)
  To: Sheng Yang; +Cc: Avi Kivity, Marcelo Tosatti, kvm

On Fri, Nov 05, 2010 at 03:16:23PM +0800, Sheng Yang wrote:
> This patch enable per-vector mask for assigned devices using MSI-X.
> 
> This patch provided two new APIs: one is for guest to specific device's MSI-X
> table address in MMIO, the other is for userspace to get information about mask
> bit.
> 
> All the mask bit operation are kept in kernel, in order to accelerate.
> Userspace shouldn't access the device MMIO directly for the information,
> instead it should uses provided API to do so.
> 
> Signed-off-by: Sheng Yang <sheng@linux.intel.com>

Is this v2? Or repost?  If v2 you would want to supply a changelog.

-- 
MST

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 5/5] KVM: assigned dev: MSI-X mask support
  2010-11-05  8:43       ` Michael S. Tsirkin
@ 2010-11-05 10:53         ` Sheng Yang
  2010-11-05 12:01           ` Sheng Yang
                             ` (2 more replies)
  0 siblings, 3 replies; 29+ messages in thread
From: Sheng Yang @ 2010-11-05 10:53 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Avi Kivity, Marcelo Tosatti, kvm

On Friday 05 November 2010 16:43:33 Michael S. Tsirkin wrote:
> On Fri, Nov 05, 2010 at 10:44:19AM +0800, Sheng Yang wrote:
> > > > +};
> > > > +
> > > > +This ioctl would enable in-kernel MSI-X emulation, which would
> > > > handle MSI-X +mask bit in the kernel.
> > > 
> > > What happens on repeated calls when it's already enabled?
> > > How does one disable after it's enabled?
> > 
> > Suppose this should only be called once. But again would update the MMIO
> > base.
> 
> So maybe add this all to documentation.
> 
> > It
> > would be disabled along with device deassignment.
> 
> So what are you going to do when guest enables and later disables MSIX?
> Disable assignment completely?

This device goes with PCI resources allocation, rather than IRQ assignment.
> 
> > We enable it explicitly because
> > not all devices have MSI-X capability.
> > 
> > > > +
> > > 
> > > This is very specialized for assigned devices.  I would like an
> > > interface not tied to assigned devices explicitly
> > > (even if the implementation at the moment only works
> > > for assigned devices, I can patch that in later). E.g. vhost-net would
> > > benefit from such an extension too.  Maybe tie this to a special
> > > GSI and this GSI can be an assigned device or not?
> > 
> > You're talking about KVM_ASSIGN_GET_MSIX_ENTRY?
> 
> Yes but also, and more importantly, about KVM_ASSIGN_REG_MSIX_MMIO
> and kvm_assigned_msix_mmio really.
> 
> > We can't tie it to GSI, because we
> > should also cover entries without GSI. The entry number should be fine
> > for all kinds of devices using MSI-X.
> 
> I don't completely understand: entries without GSI
> never need to inject an interrupt. Why do we care
> about them? Let's pass such accesses to userspace.
>
> In any case, I think MSIX GSIs are inexpensive (we never search them
> all), so we could create GSIs for all entries if we wanted to.

All mask bit accessing controlled by kernel, in order to keep consistent.
> 
> > I am not sure about what PV one would looks like.
> > I use kvm_assigned_msix_entry
> > just because it can be easily reused.
> 
> Not only PV, emulated devices too. OK let's try to figure out:
> 
> What happens with userspace is, we inject interrupts either through
> ioctls or eventfds. Avoiding an exit to userspace on MSIX
> access is beneficial in exactly the same way.
> 
> We could have an ioctl to pass in the table addresses, and mapping
> table to relevant GSIs. For example, add kvm_msix_mmio with table
> start/end, also specify which GSIs are covered.
> Maybe ask that userspace allocate all GSIs for a table consequitively?
> Or add kvm_irq_routing_msix which is same as msi but
> also has the mmio addresses? Bonus points for avoiding the need
> to scan all table on each write. For example, don't scan at all,
> instead just set a bit in KVM, and set irq entry on the next
> interrupt only: we have KVM_MAX_MSIX_PER_DEV as 256 so maybe a small 32
> byte bitmap would be enough for this, avoding a linear scan.
> 
> When we pass in kvm_msix_mmio, kvm would start registering masked state.
> 
> > > > +4.48 KVM_ASSIGN_GET_MSIX_ENTRY
> > > > +
> > > > +Capability: KVM_CAP_DEVICE_MSIX_MASK
> > > > +Architectures: x86
> > > > +Type: vm ioctl
> > > > +Parameters: struct kvm_assigned_msix_entry (in and out)
> > > > +Returns: 0 on success, !0 on error
> > > > +
> > > > +struct kvm_assigned_msix_entry {
> > > > +	/* Assigned device's ID */
> > > > +	__u32 assigned_dev_id;
> > > > +	/* Ignored */
> > > > +	__u32 gsi;
> > > > +	/* The index of entry in the MSI-X table */
> > > > +	__u16 entry;
> > > > +	/* Querying flags and returning status */
> > > > +	__u16 flags;
> > > > +	/* Must be 0 */
> > > > +	__u16 padding[2];
> > > > +};
> > > > +
> > > > +This ioctl would allow userspace to get the status of one specific
> > > > MSI-X +entry. Currently we support mask bit status querying.
> > > > +
> > > > 
> > > >  5. The kvm_run structure
> > > >  
> > > >  Application code obtains a pointer to the kvm_run structure by
> > > > 
> > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > > index f3f86b2..8fd5121 100644
> > > > --- a/arch/x86/kvm/x86.c
> > > > +++ b/arch/x86/kvm/x86.c
> > > > @@ -1926,6 +1926,7 @@ int kvm_dev_ioctl_check_extension(long ext)
> > > > 
> > > >  	case KVM_CAP_DEBUGREGS:
> > > >  	case KVM_CAP_X86_ROBUST_SINGLESTEP:
> > > > 
> > > >  	case KVM_CAP_XSAVE:
> > > > +	case KVM_CAP_DEVICE_MSIX_MASK:
> > > >  		r = 1;
> > > >  		break;
> > > >  	
> > > >  	case KVM_CAP_COALESCED_MMIO:
> > > > diff --git a/include/linux/kvm.h b/include/linux/kvm.h
> > > > index 919ae53..eafafb1 100644
> > > > --- a/include/linux/kvm.h
> > > > +++ b/include/linux/kvm.h
> > > > @@ -540,6 +540,9 @@ struct kvm_ppc_pvinfo {
> > > > 
> > > >  #endif
> > > >  #define KVM_CAP_PPC_GET_PVINFO 57
> > > >  #define KVM_CAP_PPC_IRQ_LEVEL 58
> > > > 
> > > > +#ifdef __KVM_HAVE_MSIX
> > > > +#define KVM_CAP_DEVICE_MSIX_MASK 59
> > > > +#endif
> > > 
> > > We seem to have these HAVE macros all over.
> > > Avi, what's the idea? Let's drop them?
> > 
> > Um? This kind of marcos are used here to avoid vendor specific string in
> > the header file.
> 
> Am I the only one that dislikes the ifdefs we have all over this code then?
> Rest of linux tries to keep ifdefs local by stubbing functionality out.
> 
> > > > +	BUG_ON(adev->msix_mmio_base == 0);
> > > 
> > > Below I see
> > > 
> > > 	adev->msix_mmio_base = msix_mmio->base_addr;
> > > 
> > > which comes from userspace. BUG_ON is an unfriendly way to
> > > tell user about a bug in qemu.
> > > 
> > > Anyway, I don't think we should special-case 0 gpa.
> > > It's up to the user where to base the table.
> > > Use a separate flag to signal that table was initialized.
> > 
> > Base_addr can't be 0 in any case. Only the unimplemented BAR are
> > hardwired to zero.
> 
> No, the PCI spec does not say so. Check it out: the unimplemented BAR
> is hardwired to zero but implemented can be set to 0 as well.

OK, OK, OK, I would add one more flag.
>
> > Also if we get here with base_addr = 0, it is surely a bug.
> > 
> > > > +
> > > > +	return -EINVAL;
> > > > +}
> > > > +
> > > > +static int msix_mmio_read(struct kvm_io_device *this, gpa_t addr,
> > > > int len, +			  void *val)
> > > > +{
> > > > +	struct kvm_assigned_dev_kernel *adev =
> > > > +			container_of(this, struct kvm_assigned_dev_kernel,
> > > > +				     msix_mmio_dev);
> > > > +	int idx, r = 0;
> > > > +	u32 entry[4];
> > > > +	struct kvm_kernel_irq_routing_entry e;
> > > > +
> > > > +	mutex_lock(&adev->kvm->lock);
> > > > +	if (!msix_mmio_in_range(adev, addr, len)) {
> > > > +		r = -EOPNOTSUPP;
> > > > +		goto out;
> > > > +	}
> > > > +	if ((addr & 0x3) || len != 4)
> > > > +		goto out;
> > > > +
> > > > +	idx = msix_get_enabled_idx(adev, addr, len);
> > > > +	if (idx < 0) {
> > > > +		idx = (addr - adev->msix_mmio_base) / PCI_MSIX_ENTRY_SIZE;
> > > > +		if ((addr % PCI_MSIX_ENTRY_SIZE) ==
> > > > +				PCI_MSIX_ENTRY_VECTOR_CTRL)
> > > > +			*(unsigned long *)val =
> > > > +				test_bit(idx, adev->msix_mask_bitmap) ?
> > > > +				PCI_MSIX_ENTRY_CTRL_MASKBIT : 0;
> > > 
> > > I suspect this is wrong for big endian platforms: PCI returns
> > > all values in little endian format.
> > 
> > Let big endian platform take care of it later... There is only x86 has
> > __KVM_HAVE_MSIX now.
> 
> Well it's easier to make the code correct when you write it than later.

Go back to me if someone want to implement MSI device assignment on big-endian 
machine.

> > > > +		else
> > > > +			r = -EOPNOTSUPP;
> > > > +		goto out;
> > > > +	}
> > > > +
> > > > +	r = kvm_get_irq_routing_entry(adev->kvm,
> > > > +			adev->guest_msix_entries[idx].vector, &e);
> > > > +	if (r || e.type != KVM_IRQ_ROUTING_MSI) {
> > > > +		printk(KERN_WARNING "KVM: Wrong MSI-X routing entry! "
> > > > +			"idx %d, addr 0x%llx, len %d\n", idx, addr, len);
> > > 
> > > Can a malicious userspace trigger this intentionally?
> > 
> > No.
> 
> Here's a scenario: bind an MSIX routing entry to device, then later
> change its type to regular interrupt. Fills the log with warnings,
> doesn't it?

Does QEmu suppose to discard the old routing?
 
> > > > +		r = -EOPNOTSUPP;
> > > > +		goto out;
> > > > +	}
> > > > +	entry[0] = e.msi.address_lo;
> > > > +	entry[1] = e.msi.address_hi;
> > > > +	entry[2] = e.msi.data;
> > > > +	entry[3] = test_bit(adev->guest_msix_entries[idx].entry,
> > > > +			adev->msix_mask_bitmap);
> > > 
> > > I suspect this is wrong for big endian platforms: PCI returns
> > > all values in little endian format.
> > > 
> > > > +	memcpy(val, &entry[addr % PCI_MSIX_ENTRY_SIZE / 4
> > > 
> > > 4 is sizeof *entry?
> > 
> > 4 elements.
> > In fact "PCI_MSIX_ENTRY_SIZE / sizeof u32".
> 
> Are you sure?  This does not make sense to me.  address in bytes, we
> take offset from start of entry in bytes, to figure which dword we need
> we divide by size of dword. No?

I've corrected this one.

--
regards
Yang, Sheng

> 
> > > ], len);
> > > 
> > > > +
> > > > +out:
> > > > +	mutex_unlock(&adev->kvm->lock);
> > > > +	return r;
> > > > +}
> > > > +
> > > > +static int msix_mmio_write(struct kvm_io_device *this, gpa_t addr,
> > > > int len, +			   const void *val)
> > > > +{
> > > > +	struct kvm_assigned_dev_kernel *adev =
> > > > +			container_of(this, struct kvm_assigned_dev_kernel,
> > > > +				     msix_mmio_dev);
> > > > +	int idx, r = 0;
> > > > +	unsigned long new_val = *(unsigned long *)val;
> > > 
> > > I suspect this is wrong for big endian platforms: PCI returns
> > > all values in little endian format.
> > > 
> > > > +
> > > > +	mutex_lock(&adev->kvm->lock);
> > > > +	if (!msix_mmio_in_range(adev, addr, len)) {
> > > > +		r = -EOPNOTSUPP;
> > > > +		goto out;
> > > > +	}
> > > > +	if ((addr & 0x3) || len != 4)
> > > > +		goto out;
> > > > +
> > > > +	idx = msix_get_enabled_idx(adev, addr, len);
> > > > +	if (idx < 0) {
> > > > +		idx = (addr - adev->msix_mmio_base) / PCI_MSIX_ENTRY_SIZE;
> > > > +		if (((addr % PCI_MSIX_ENTRY_SIZE) ==
> > > > +				PCI_MSIX_ENTRY_VECTOR_CTRL)) {
> > > > +			if (new_val & ~PCI_MSIX_ENTRY_CTRL_MASKBIT)
> > > > +				goto out;
> > > > +			if (new_val & PCI_MSIX_ENTRY_CTRL_MASKBIT)
> > > > +				set_bit(idx, adev->msix_mask_bitmap);
> > > > +			else
> > > > +				clear_bit(idx, adev->msix_mask_bitmap);
> > > > +		} else
> > > > +			/* Userspace would handle other MMIO writing */
> > > > +			r = -EOPNOTSUPP;
> > > > +		goto out;
> > > > +	}
> > > > +	if (addr % PCI_MSIX_ENTRY_SIZE != PCI_MSIX_ENTRY_VECTOR_CTRL) {
> > > > +		r = -EOPNOTSUPP;
> > > > +		goto out;
> > > > +	}
> > > > +	if (new_val & ~PCI_MSIX_ENTRY_CTRL_MASKBIT)
> > > > +		goto out;
> > > > +	update_msix_mask(adev, idx, !!(new_val &
> > > > PCI_MSIX_ENTRY_CTRL_MASKBIT)); +out:
> > > > +	mutex_unlock(&adev->kvm->lock);
> > > > +
> > > > +	return r;
> > > > +}
> > > > +
> > > > +static const struct kvm_io_device_ops msix_mmio_ops = {
> > > > +	.read     = msix_mmio_read,
> > > > +	.write    = msix_mmio_write,
> > > > +};
> > > > +
> > > > +static int kvm_vm_ioctl_register_msix_mmio(struct kvm *kvm,
> > > > +				struct kvm_assigned_msix_mmio *msix_mmio)
> > > > +{
> > > > +	int r = 0;
> > > > +	struct kvm_assigned_dev_kernel *adev;
> > > > +
> > > > +	mutex_lock(&kvm->lock);
> > > > +	adev = kvm_find_assigned_dev(&kvm->arch.assigned_dev_head,
> > > > +				      msix_mmio->assigned_dev_id);
> > > > +	if (!adev) {
> > > > +		r = -EINVAL;
> > > > +		goto out;
> > > > +	}
> > > > +	if (msix_mmio->base_addr == 0) {
> > > > +		r = -EINVAL;
> > > > +		goto out;
> > > > +	}
> > > > +
> > > > +	mutex_lock(&kvm->slots_lock);
> > > > +	if (adev->msix_mmio_base == 0) {
> > > 
> > > What does this do? Handles case we are already registered?
> > > Also ! adev->msix_mmio_base
> > 
> > We don't need to register again. Just update the address.
> > 
> > Well, msix_mmio_base is an address. I think ==0 is more properly.
> 
> Yes, there's a test above that validates for 0 and returns EINVAL.
> But using a separate flag that userspace can not control
> to mean 'already registered' will make the code clearer.
> 
> > --
> > regards
> > Yang, Sheng
> > 
> > > > +		kvm_iodevice_init(&adev->msix_mmio_dev, &msix_mmio_ops);
> > > > +		r = kvm_io_bus_register_dev(kvm, KVM_MMIO_BUS,
> > > > +				&adev->msix_mmio_dev);
> > > > +		if (r)
> > > > +			goto out2;
> > > > +	}
> > > > +
> > > > +	adev->msix_mmio_base = msix_mmio->base_addr;
> > > > +out2:
> > > > +	mutex_unlock(&kvm->slots_lock);
> > > > +out:
> > > > +	mutex_unlock(&kvm->lock);
> > > > +
> > > > +	return r;
> > > > +}
> > > > 
> > > >  #endif
> > > >  
> > > >  long kvm_vm_ioctl_assigned_device(struct kvm *kvm, unsigned ioctl,
> > > > 
> > > > @@ -812,6 +1065,36 @@ long kvm_vm_ioctl_assigned_device(struct kvm
> > > > *kvm, unsigned ioctl,
> > > > 
> > > >  			goto out;
> > > >  		
> > > >  		break;
> > > >  	
> > > >  	}
> > > > 
> > > > +	case KVM_ASSIGN_GET_MSIX_ENTRY: {
> > > > +		struct kvm_assigned_msix_entry entry;
> > > > +		r = -EFAULT;
> > > > +		if (copy_from_user(&entry, argp, sizeof entry))
> > > > +			goto out;
> > > > +		r = kvm_vm_ioctl_get_msix_entry(kvm, &entry);
> > > > +		if (r)
> > > > +			goto out;
> > > > +		r = -EFAULT;
> > > > +		if (copy_to_user(argp, &entry, sizeof entry))
> > > > +			goto out;
> > > > +		r = 0;
> > > > +		break;
> > > > +	}
> > > > +	case KVM_ASSIGN_REG_MSIX_MMIO: {
> > > > +		struct kvm_assigned_msix_mmio msix_mmio;
> > > > +
> > > > +		r = -EFAULT;
> > > > +		if (copy_from_user(&msix_mmio, argp, sizeof(msix_mmio)))
> > > > +			goto out;
> > > > +
> > > > +		r = -EINVAL;
> > > > +		if (msix_mmio.flags != 0 || msix_mmio.reserved != 0)
> > > > +			goto out;
> > > > +
> > > > +		r = kvm_vm_ioctl_register_msix_mmio(kvm, &msix_mmio);
> > > > +		if (r)
> > > > +			goto out;
> > > > +		break;
> > > > +	}
> > > > 
> > > >  #endif
> > > >  
> > > >  	}
> > > >  
> > > >  out:

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 5/5] KVM: assigned dev: MSI-X mask support
  2010-11-05  8:51           ` Michael S. Tsirkin
@ 2010-11-05 10:54             ` Sheng Yang
  0 siblings, 0 replies; 29+ messages in thread
From: Sheng Yang @ 2010-11-05 10:54 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Avi Kivity, Marcelo Tosatti, kvm

On Friday 05 November 2010 16:51:33 Michael S. Tsirkin wrote:
> On Fri, Nov 05, 2010 at 03:16:23PM +0800, Sheng Yang wrote:
> > This patch enable per-vector mask for assigned devices using MSI-X.
> > 
> > This patch provided two new APIs: one is for guest to specific device's
> > MSI-X table address in MMIO, the other is for userspace to get
> > information about mask bit.
> > 
> > All the mask bit operation are kept in kernel, in order to accelerate.
> > Userspace shouldn't access the device MMIO directly for the information,
> > instead it should uses provided API to do so.
> > 
> > Signed-off-by: Sheng Yang <sheng@linux.intel.com>
> 
> Is this v2? Or repost?  If v2 you would want to supply a changelog.

Just an short update after your comments. I supposed the "in-reply-to" work?

--
regards
Yang, Sheng

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 5/5] KVM: assigned dev: MSI-X mask support
  2010-11-05 10:53         ` Sheng Yang
@ 2010-11-05 12:01           ` Sheng Yang
  2010-11-05 13:29             ` Michael S. Tsirkin
  2010-11-05 13:27           ` Michael S. Tsirkin
  2010-11-06  0:24           ` Marcelo Tosatti
  2 siblings, 1 reply; 29+ messages in thread
From: Sheng Yang @ 2010-11-05 12:01 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Avi Kivity, Marcelo Tosatti, kvm

On Friday 05 November 2010 18:53:50 Sheng Yang wrote:
> On Friday 05 November 2010 16:43:33 Michael S. Tsirkin wrote:
> > On Fri, Nov 05, 2010 at 10:44:19AM +0800, Sheng Yang wrote:
> > > > > +};
> > > > > +
> > > > > +This ioctl would enable in-kernel MSI-X emulation, which would
> > > > > handle MSI-X +mask bit in the kernel.
> > > > 
> > > > What happens on repeated calls when it's already enabled?
> > > > How does one disable after it's enabled?
> > > 
> > > Suppose this should only be called once. But again would update the
> > > MMIO base.
> > 
> > So maybe add this all to documentation.
> > 
> > > It
> > > would be disabled along with device deassignment.
> > 
> > So what are you going to do when guest enables and later disables MSIX?
> > Disable assignment completely?
> 
> This device goes with PCI resources allocation, rather than IRQ assignment.
> 
> > > We enable it explicitly because
> > > not all devices have MSI-X capability.
> > > 
> > > > > +
> > > > 
> > > > This is very specialized for assigned devices.  I would like an
> > > > interface not tied to assigned devices explicitly
> > > > (even if the implementation at the moment only works
> > > > for assigned devices, I can patch that in later). E.g. vhost-net
> > > > would benefit from such an extension too.  Maybe tie this to a
> > > > special GSI and this GSI can be an assigned device or not?
> > > 
> > > You're talking about KVM_ASSIGN_GET_MSIX_ENTRY?
> > 
> > Yes but also, and more importantly, about KVM_ASSIGN_REG_MSIX_MMIO
> > and kvm_assigned_msix_mmio really.
> > 
> > > We can't tie it to GSI, because we
> > > should also cover entries without GSI. The entry number should be fine
> > > for all kinds of devices using MSI-X.
> > 
> > I don't completely understand: entries without GSI
> > never need to inject an interrupt. Why do we care
> > about them? Let's pass such accesses to userspace.
> > 
> > In any case, I think MSIX GSIs are inexpensive (we never search them
> > all), so we could create GSIs for all entries if we wanted to.
> 
> All mask bit accessing controlled by kernel, in order to keep consistent.
> 
> > > I am not sure about what PV one would looks like.
> > > I use kvm_assigned_msix_entry
> > > just because it can be easily reused.
> > 
> > Not only PV, emulated devices too. OK let's try to figure out:
> > 
> > What happens with userspace is, we inject interrupts either through
> > ioctls or eventfds. Avoiding an exit to userspace on MSIX
> > access is beneficial in exactly the same way.
> > 
> > We could have an ioctl to pass in the table addresses, and mapping
> > table to relevant GSIs. For example, add kvm_msix_mmio with table
> > start/end, also specify which GSIs are covered.
> > Maybe ask that userspace allocate all GSIs for a table consequitively?
> > Or add kvm_irq_routing_msix which is same as msi but
> > also has the mmio addresses? Bonus points for avoiding the need
> > to scan all table on each write. For example, don't scan at all,
> > instead just set a bit in KVM, and set irq entry on the next
> > interrupt only: we have KVM_MAX_MSIX_PER_DEV as 256 so maybe a small 32
> > byte bitmap would be enough for this, avoding a linear scan.
> > 
> > When we pass in kvm_msix_mmio, kvm would start registering masked state.
> > 
> > > > > +4.48 KVM_ASSIGN_GET_MSIX_ENTRY
> > > > > +
> > > > > +Capability: KVM_CAP_DEVICE_MSIX_MASK
> > > > > +Architectures: x86
> > > > > +Type: vm ioctl
> > > > > +Parameters: struct kvm_assigned_msix_entry (in and out)
> > > > > +Returns: 0 on success, !0 on error
> > > > > +
> > > > > +struct kvm_assigned_msix_entry {
> > > > > +	/* Assigned device's ID */
> > > > > +	__u32 assigned_dev_id;
> > > > > +	/* Ignored */
> > > > > +	__u32 gsi;
> > > > > +	/* The index of entry in the MSI-X table */
> > > > > +	__u16 entry;
> > > > > +	/* Querying flags and returning status */
> > > > > +	__u16 flags;
> > > > > +	/* Must be 0 */
> > > > > +	__u16 padding[2];
> > > > > +};
> > > > > +
> > > > > +This ioctl would allow userspace to get the status of one specific
> > > > > MSI-X +entry. Currently we support mask bit status querying.
> > > > > +
> > > > > 
> > > > >  5. The kvm_run structure
> > > > >  
> > > > >  Application code obtains a pointer to the kvm_run structure by
> > > > > 
> > > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > > > index f3f86b2..8fd5121 100644
> > > > > --- a/arch/x86/kvm/x86.c
> > > > > +++ b/arch/x86/kvm/x86.c
> > > > > @@ -1926,6 +1926,7 @@ int kvm_dev_ioctl_check_extension(long ext)
> > > > > 
> > > > >  	case KVM_CAP_DEBUGREGS:
> > > > >  	case KVM_CAP_X86_ROBUST_SINGLESTEP:
> > > > > 
> > > > >  	case KVM_CAP_XSAVE:
> > > > > +	case KVM_CAP_DEVICE_MSIX_MASK:
> > > > >  		r = 1;
> > > > >  		break;
> > > > >  	
> > > > >  	case KVM_CAP_COALESCED_MMIO:
> > > > > diff --git a/include/linux/kvm.h b/include/linux/kvm.h
> > > > > index 919ae53..eafafb1 100644
> > > > > --- a/include/linux/kvm.h
> > > > > +++ b/include/linux/kvm.h
> > > > > @@ -540,6 +540,9 @@ struct kvm_ppc_pvinfo {
> > > > > 
> > > > >  #endif
> > > > >  #define KVM_CAP_PPC_GET_PVINFO 57
> > > > >  #define KVM_CAP_PPC_IRQ_LEVEL 58
> > > > > 
> > > > > +#ifdef __KVM_HAVE_MSIX
> > > > > +#define KVM_CAP_DEVICE_MSIX_MASK 59
> > > > > +#endif
> > > > 
> > > > We seem to have these HAVE macros all over.
> > > > Avi, what's the idea? Let's drop them?
> > > 
> > > Um? This kind of marcos are used here to avoid vendor specific string
> > > in the header file.
> > 
> > Am I the only one that dislikes the ifdefs we have all over this code
> > then? Rest of linux tries to keep ifdefs local by stubbing functionality
> > out.
> > 
> > > > > +	BUG_ON(adev->msix_mmio_base == 0);
> > > > 
> > > > Below I see
> > > > 
> > > > 	adev->msix_mmio_base = msix_mmio->base_addr;
> > > > 
> > > > which comes from userspace. BUG_ON is an unfriendly way to
> > > > tell user about a bug in qemu.
> > > > 
> > > > Anyway, I don't think we should special-case 0 gpa.
> > > > It's up to the user where to base the table.
> > > > Use a separate flag to signal that table was initialized.
> > > 
> > > Base_addr can't be 0 in any case. Only the unimplemented BAR are
> > > hardwired to zero.
> > 
> > No, the PCI spec does not say so. Check it out: the unimplemented BAR
> > is hardwired to zero but implemented can be set to 0 as well.
> 
> OK, OK, OK, I would add one more flag.
> 
> > > Also if we get here with base_addr = 0, it is surely a bug.
> > > 
> > > > > +
> > > > > +	return -EINVAL;
> > > > > +}
> > > > > +
> > > > > +static int msix_mmio_read(struct kvm_io_device *this, gpa_t addr,
> > > > > int len, +			  void *val)
> > > > > +{
> > > > > +	struct kvm_assigned_dev_kernel *adev =
> > > > > +			container_of(this, struct kvm_assigned_dev_kernel,
> > > > > +				     msix_mmio_dev);
> > > > > +	int idx, r = 0;
> > > > > +	u32 entry[4];
> > > > > +	struct kvm_kernel_irq_routing_entry e;
> > > > > +
> > > > > +	mutex_lock(&adev->kvm->lock);
> > > > > +	if (!msix_mmio_in_range(adev, addr, len)) {
> > > > > +		r = -EOPNOTSUPP;
> > > > > +		goto out;
> > > > > +	}
> > > > > +	if ((addr & 0x3) || len != 4)
> > > > > +		goto out;
> > > > > +
> > > > > +	idx = msix_get_enabled_idx(adev, addr, len);
> > > > > +	if (idx < 0) {
> > > > > +		idx = (addr - adev->msix_mmio_base) / PCI_MSIX_ENTRY_SIZE;
> > > > > +		if ((addr % PCI_MSIX_ENTRY_SIZE) ==
> > > > > +				PCI_MSIX_ENTRY_VECTOR_CTRL)
> > > > > +			*(unsigned long *)val =
> > > > > +				test_bit(idx, adev->msix_mask_bitmap) ?
> > > > > +				PCI_MSIX_ENTRY_CTRL_MASKBIT : 0;
> > > > 
> > > > I suspect this is wrong for big endian platforms: PCI returns
> > > > all values in little endian format.
> > > 
> > > Let big endian platform take care of it later... There is only x86 has
> > > __KVM_HAVE_MSIX now.
> > 
> > Well it's easier to make the code correct when you write it than later.
> 
> Go back to me if someone want to implement MSI device assignment on
> big-endian machine.

Sorry, just realized it's very likely that I don't have an big endian machine to 
test it even at that time...

I think it's really should be done by someone would use and test it.

--
regards
Yang, Sheng

> 
> > > > > +		else
> > > > > +			r = -EOPNOTSUPP;
> > > > > +		goto out;
> > > > > +	}
> > > > > +
> > > > > +	r = kvm_get_irq_routing_entry(adev->kvm,
> > > > > +			adev->guest_msix_entries[idx].vector, &e);
> > > > > +	if (r || e.type != KVM_IRQ_ROUTING_MSI) {
> > > > > +		printk(KERN_WARNING "KVM: Wrong MSI-X routing entry! "
> > > > > +			"idx %d, addr 0x%llx, len %d\n", idx, addr, len);
> > > > 
> > > > Can a malicious userspace trigger this intentionally?
> > > 
> > > No.
> > 
> > Here's a scenario: bind an MSIX routing entry to device, then later
> > change its type to regular interrupt. Fills the log with warnings,
> > doesn't it?
> 
> Does QEmu suppose to discard the old routing?
> 
> > > > > +		r = -EOPNOTSUPP;
> > > > > +		goto out;
> > > > > +	}
> > > > > +	entry[0] = e.msi.address_lo;
> > > > > +	entry[1] = e.msi.address_hi;
> > > > > +	entry[2] = e.msi.data;
> > > > > +	entry[3] = test_bit(adev->guest_msix_entries[idx].entry,
> > > > > +			adev->msix_mask_bitmap);
> > > > 
> > > > I suspect this is wrong for big endian platforms: PCI returns
> > > > all values in little endian format.
> > > > 
> > > > > +	memcpy(val, &entry[addr % PCI_MSIX_ENTRY_SIZE / 4
> > > > 
> > > > 4 is sizeof *entry?
> > > 
> > > 4 elements.
> > > In fact "PCI_MSIX_ENTRY_SIZE / sizeof u32".
> > 
> > Are you sure?  This does not make sense to me.  address in bytes, we
> > take offset from start of entry in bytes, to figure which dword we need
> > we divide by size of dword. No?
> 
> I've corrected this one.
> 
> --
> regards
> Yang, Sheng
> 
> > > > ], len);
> > > > 
> > > > > +
> > > > > +out:
> > > > > +	mutex_unlock(&adev->kvm->lock);
> > > > > +	return r;
> > > > > +}
> > > > > +
> > > > > +static int msix_mmio_write(struct kvm_io_device *this, gpa_t addr,
> > > > > int len, +			   const void *val)
> > > > > +{
> > > > > +	struct kvm_assigned_dev_kernel *adev =
> > > > > +			container_of(this, struct kvm_assigned_dev_kernel,
> > > > > +				     msix_mmio_dev);
> > > > > +	int idx, r = 0;
> > > > > +	unsigned long new_val = *(unsigned long *)val;
> > > > 
> > > > I suspect this is wrong for big endian platforms: PCI returns
> > > > all values in little endian format.
> > > > 
> > > > > +
> > > > > +	mutex_lock(&adev->kvm->lock);
> > > > > +	if (!msix_mmio_in_range(adev, addr, len)) {
> > > > > +		r = -EOPNOTSUPP;
> > > > > +		goto out;
> > > > > +	}
> > > > > +	if ((addr & 0x3) || len != 4)
> > > > > +		goto out;
> > > > > +
> > > > > +	idx = msix_get_enabled_idx(adev, addr, len);
> > > > > +	if (idx < 0) {
> > > > > +		idx = (addr - adev->msix_mmio_base) / PCI_MSIX_ENTRY_SIZE;
> > > > > +		if (((addr % PCI_MSIX_ENTRY_SIZE) ==
> > > > > +				PCI_MSIX_ENTRY_VECTOR_CTRL)) {
> > > > > +			if (new_val & ~PCI_MSIX_ENTRY_CTRL_MASKBIT)
> > > > > +				goto out;
> > > > > +			if (new_val & PCI_MSIX_ENTRY_CTRL_MASKBIT)
> > > > > +				set_bit(idx, adev->msix_mask_bitmap);
> > > > > +			else
> > > > > +				clear_bit(idx, adev->msix_mask_bitmap);
> > > > > +		} else
> > > > > +			/* Userspace would handle other MMIO writing */
> > > > > +			r = -EOPNOTSUPP;
> > > > > +		goto out;
> > > > > +	}
> > > > > +	if (addr % PCI_MSIX_ENTRY_SIZE != PCI_MSIX_ENTRY_VECTOR_CTRL) {
> > > > > +		r = -EOPNOTSUPP;
> > > > > +		goto out;
> > > > > +	}
> > > > > +	if (new_val & ~PCI_MSIX_ENTRY_CTRL_MASKBIT)
> > > > > +		goto out;
> > > > > +	update_msix_mask(adev, idx, !!(new_val &
> > > > > PCI_MSIX_ENTRY_CTRL_MASKBIT)); +out:
> > > > > +	mutex_unlock(&adev->kvm->lock);
> > > > > +
> > > > > +	return r;
> > > > > +}
> > > > > +
> > > > > +static const struct kvm_io_device_ops msix_mmio_ops = {
> > > > > +	.read     = msix_mmio_read,
> > > > > +	.write    = msix_mmio_write,
> > > > > +};
> > > > > +
> > > > > +static int kvm_vm_ioctl_register_msix_mmio(struct kvm *kvm,
> > > > > +				struct kvm_assigned_msix_mmio *msix_mmio)
> > > > > +{
> > > > > +	int r = 0;
> > > > > +	struct kvm_assigned_dev_kernel *adev;
> > > > > +
> > > > > +	mutex_lock(&kvm->lock);
> > > > > +	adev = kvm_find_assigned_dev(&kvm->arch.assigned_dev_head,
> > > > > +				      msix_mmio->assigned_dev_id);
> > > > > +	if (!adev) {
> > > > > +		r = -EINVAL;
> > > > > +		goto out;
> > > > > +	}
> > > > > +	if (msix_mmio->base_addr == 0) {
> > > > > +		r = -EINVAL;
> > > > > +		goto out;
> > > > > +	}
> > > > > +
> > > > > +	mutex_lock(&kvm->slots_lock);
> > > > > +	if (adev->msix_mmio_base == 0) {
> > > > 
> > > > What does this do? Handles case we are already registered?
> > > > Also ! adev->msix_mmio_base
> > > 
> > > We don't need to register again. Just update the address.
> > > 
> > > Well, msix_mmio_base is an address. I think ==0 is more properly.
> > 
> > Yes, there's a test above that validates for 0 and returns EINVAL.
> > But using a separate flag that userspace can not control
> > to mean 'already registered' will make the code clearer.
> > 
> > > --
> > > regards
> > > Yang, Sheng
> > > 
> > > > > +		kvm_iodevice_init(&adev->msix_mmio_dev, &msix_mmio_ops);
> > > > > +		r = kvm_io_bus_register_dev(kvm, KVM_MMIO_BUS,
> > > > > +				&adev->msix_mmio_dev);
> > > > > +		if (r)
> > > > > +			goto out2;
> > > > > +	}
> > > > > +
> > > > > +	adev->msix_mmio_base = msix_mmio->base_addr;
> > > > > +out2:
> > > > > +	mutex_unlock(&kvm->slots_lock);
> > > > > +out:
> > > > > +	mutex_unlock(&kvm->lock);
> > > > > +
> > > > > +	return r;
> > > > > +}
> > > > > 
> > > > >  #endif
> > > > >  
> > > > >  long kvm_vm_ioctl_assigned_device(struct kvm *kvm, unsigned ioctl,
> > > > > 
> > > > > @@ -812,6 +1065,36 @@ long kvm_vm_ioctl_assigned_device(struct kvm
> > > > > *kvm, unsigned ioctl,
> > > > > 
> > > > >  			goto out;
> > > > >  		
> > > > >  		break;
> > > > >  	
> > > > >  	}
> > > > > 
> > > > > +	case KVM_ASSIGN_GET_MSIX_ENTRY: {
> > > > > +		struct kvm_assigned_msix_entry entry;
> > > > > +		r = -EFAULT;
> > > > > +		if (copy_from_user(&entry, argp, sizeof entry))
> > > > > +			goto out;
> > > > > +		r = kvm_vm_ioctl_get_msix_entry(kvm, &entry);
> > > > > +		if (r)
> > > > > +			goto out;
> > > > > +		r = -EFAULT;
> > > > > +		if (copy_to_user(argp, &entry, sizeof entry))
> > > > > +			goto out;
> > > > > +		r = 0;
> > > > > +		break;
> > > > > +	}
> > > > > +	case KVM_ASSIGN_REG_MSIX_MMIO: {
> > > > > +		struct kvm_assigned_msix_mmio msix_mmio;
> > > > > +
> > > > > +		r = -EFAULT;
> > > > > +		if (copy_from_user(&msix_mmio, argp, sizeof(msix_mmio)))
> > > > > +			goto out;
> > > > > +
> > > > > +		r = -EINVAL;
> > > > > +		if (msix_mmio.flags != 0 || msix_mmio.reserved != 0)
> > > > > +			goto out;
> > > > > +
> > > > > +		r = kvm_vm_ioctl_register_msix_mmio(kvm, &msix_mmio);
> > > > > +		if (r)
> > > > > +			goto out;
> > > > > +		break;
> > > > > +	}
> > > > > 
> > > > >  #endif
> > > > >  
> > > > >  	}
> > > > >  
> > > > >  out:

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 5/5] KVM: assigned dev: MSI-X mask support
  2010-11-05 10:53         ` Sheng Yang
  2010-11-05 12:01           ` Sheng Yang
@ 2010-11-05 13:27           ` Michael S. Tsirkin
  2010-11-08  3:18             ` Sheng Yang
  2010-11-06  0:24           ` Marcelo Tosatti
  2 siblings, 1 reply; 29+ messages in thread
From: Michael S. Tsirkin @ 2010-11-05 13:27 UTC (permalink / raw)
  To: Sheng Yang; +Cc: Avi Kivity, Marcelo Tosatti, kvm

On Fri, Nov 05, 2010 at 06:53:50PM +0800, Sheng Yang wrote:
> On Friday 05 November 2010 16:43:33 Michael S. Tsirkin wrote:
> > On Fri, Nov 05, 2010 at 10:44:19AM +0800, Sheng Yang wrote:
> > > > > +};
> > > > > +
> > > > > +This ioctl would enable in-kernel MSI-X emulation, which would
> > > > > handle MSI-X +mask bit in the kernel.
> > > > 
> > > > What happens on repeated calls when it's already enabled?
> > > > How does one disable after it's enabled?
> > > 
> > > Suppose this should only be called once. But again would update the MMIO
> > > base.
> > 
> > So maybe add this all to documentation.
> > 
> > > It
> > > would be disabled along with device deassignment.
> > 
> > So what are you going to do when guest enables and later disables MSIX?
> > Disable assignment completely?
> 
> This device goes with PCI resources allocation, rather than IRQ assignment.

I see. I guess we can live with it, but it seems tied to a specific
mode of use. Would be better to support enabling together with msix too, this
requires an ability to disable.

> > 
> > > We enable it explicitly because
> > > not all devices have MSI-X capability.
> > > 
> > > > > +
> > > > 
> > > > This is very specialized for assigned devices.  I would like an
> > > > interface not tied to assigned devices explicitly
> > > > (even if the implementation at the moment only works
> > > > for assigned devices, I can patch that in later). E.g. vhost-net would
> > > > benefit from such an extension too.  Maybe tie this to a special
> > > > GSI and this GSI can be an assigned device or not?
> > > 
> > > You're talking about KVM_ASSIGN_GET_MSIX_ENTRY?
> > 
> > Yes but also, and more importantly, about KVM_ASSIGN_REG_MSIX_MMIO
> > and kvm_assigned_msix_mmio really.
> > 
> > > We can't tie it to GSI, because we
> > > should also cover entries without GSI. The entry number should be fine
> > > for all kinds of devices using MSI-X.
> > 
> > I don't completely understand: entries without GSI
> > never need to inject an interrupt. Why do we care
> > about them? Let's pass such accesses to userspace.
> >
> > In any case, I think MSIX GSIs are inexpensive (we never search them
> > all), so we could create GSIs for all entries if we wanted to.
> 
> All mask bit accessing controlled by kernel, in order to keep consistent.


Well I think passing whatever we can't handle to userspace makes
complete sense. We do this for address/data writes after all.
If we wanted to do it *all* in kernel we would handle
address/data there too. I think Avi suggested this at some point.

The main point is msix mmio BAR handling is completely unrelated
to the assigned device. Emulated devices have an msix BAR as well.
So tying it to an assigned devices is suboptimal, we'll need to
grow yet another interface for these.

> > 
> > > I am not sure about what PV one would looks like.
> > > I use kvm_assigned_msix_entry
> > > just because it can be easily reused.
> > 
> > Not only PV, emulated devices too. OK let's try to figure out:
> > 
> > What happens with userspace is, we inject interrupts either through
> > ioctls or eventfds. Avoiding an exit to userspace on MSIX
> > access is beneficial in exactly the same way.
> > 
> > We could have an ioctl to pass in the table addresses, and mapping
> > table to relevant GSIs. For example, add kvm_msix_mmio with table
> > start/end, also specify which GSIs are covered.
> > Maybe ask that userspace allocate all GSIs for a table consequitively?
> > Or add kvm_irq_routing_msix which is same as msi but
> > also has the mmio addresses? Bonus points for avoiding the need
> > to scan all table on each write. For example, don't scan at all,
> > instead just set a bit in KVM, and set irq entry on the next
> > interrupt only: we have KVM_MAX_MSIX_PER_DEV as 256 so maybe a small 32
> > byte bitmap would be enough for this, avoding a linear scan.
> > 
> > When we pass in kvm_msix_mmio, kvm would start registering masked state.
> > 
> > > > > +4.48 KVM_ASSIGN_GET_MSIX_ENTRY
> > > > > +
> > > > > +Capability: KVM_CAP_DEVICE_MSIX_MASK
> > > > > +Architectures: x86
> > > > > +Type: vm ioctl
> > > > > +Parameters: struct kvm_assigned_msix_entry (in and out)
> > > > > +Returns: 0 on success, !0 on error
> > > > > +
> > > > > +struct kvm_assigned_msix_entry {
> > > > > +	/* Assigned device's ID */
> > > > > +	__u32 assigned_dev_id;
> > > > > +	/* Ignored */
> > > > > +	__u32 gsi;
> > > > > +	/* The index of entry in the MSI-X table */
> > > > > +	__u16 entry;
> > > > > +	/* Querying flags and returning status */
> > > > > +	__u16 flags;
> > > > > +	/* Must be 0 */
> > > > > +	__u16 padding[2];
> > > > > +};
> > > > > +
> > > > > +This ioctl would allow userspace to get the status of one specific
> > > > > MSI-X +entry. Currently we support mask bit status querying.
> > > > > +
> > > > > 
> > > > >  5. The kvm_run structure
> > > > >  
> > > > >  Application code obtains a pointer to the kvm_run structure by
> > > > > 
> > > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > > > index f3f86b2..8fd5121 100644
> > > > > --- a/arch/x86/kvm/x86.c
> > > > > +++ b/arch/x86/kvm/x86.c
> > > > > @@ -1926,6 +1926,7 @@ int kvm_dev_ioctl_check_extension(long ext)
> > > > > 
> > > > >  	case KVM_CAP_DEBUGREGS:
> > > > >  	case KVM_CAP_X86_ROBUST_SINGLESTEP:
> > > > > 
> > > > >  	case KVM_CAP_XSAVE:
> > > > > +	case KVM_CAP_DEVICE_MSIX_MASK:
> > > > >  		r = 1;
> > > > >  		break;
> > > > >  	
> > > > >  	case KVM_CAP_COALESCED_MMIO:
> > > > > diff --git a/include/linux/kvm.h b/include/linux/kvm.h
> > > > > index 919ae53..eafafb1 100644
> > > > > --- a/include/linux/kvm.h
> > > > > +++ b/include/linux/kvm.h
> > > > > @@ -540,6 +540,9 @@ struct kvm_ppc_pvinfo {
> > > > > 
> > > > >  #endif
> > > > >  #define KVM_CAP_PPC_GET_PVINFO 57
> > > > >  #define KVM_CAP_PPC_IRQ_LEVEL 58
> > > > > 
> > > > > +#ifdef __KVM_HAVE_MSIX
> > > > > +#define KVM_CAP_DEVICE_MSIX_MASK 59
> > > > > +#endif
> > > > 
> > > > We seem to have these HAVE macros all over.
> > > > Avi, what's the idea? Let's drop them?
> > > 
> > > Um? This kind of marcos are used here to avoid vendor specific string in
> > > the header file.
> > 
> > Am I the only one that dislikes the ifdefs we have all over this code then?
> > Rest of linux tries to keep ifdefs local by stubbing functionality out.
> > 
> > > > > +	BUG_ON(adev->msix_mmio_base == 0);
> > > > 
> > > > Below I see
> > > > 
> > > > 	adev->msix_mmio_base = msix_mmio->base_addr;
> > > > 
> > > > which comes from userspace. BUG_ON is an unfriendly way to
> > > > tell user about a bug in qemu.
> > > > 
> > > > Anyway, I don't think we should special-case 0 gpa.
> > > > It's up to the user where to base the table.
> > > > Use a separate flag to signal that table was initialized.
> > > 
> > > Base_addr can't be 0 in any case. Only the unimplemented BAR are
> > > hardwired to zero.
> > 
> > No, the PCI spec does not say so. Check it out: the unimplemented BAR
> > is hardwired to zero but implemented can be set to 0 as well.
> 
> OK, OK, OK, I would add one more flag.
> >
> > > Also if we get here with base_addr = 0, it is surely a bug.
> > > 
> > > > > +
> > > > > +	return -EINVAL;
> > > > > +}
> > > > > +
> > > > > +static int msix_mmio_read(struct kvm_io_device *this, gpa_t addr,
> > > > > int len, +			  void *val)
> > > > > +{
> > > > > +	struct kvm_assigned_dev_kernel *adev =
> > > > > +			container_of(this, struct kvm_assigned_dev_kernel,
> > > > > +				     msix_mmio_dev);
> > > > > +	int idx, r = 0;
> > > > > +	u32 entry[4];
> > > > > +	struct kvm_kernel_irq_routing_entry e;
> > > > > +
> > > > > +	mutex_lock(&adev->kvm->lock);
> > > > > +	if (!msix_mmio_in_range(adev, addr, len)) {
> > > > > +		r = -EOPNOTSUPP;
> > > > > +		goto out;
> > > > > +	}
> > > > > +	if ((addr & 0x3) || len != 4)
> > > > > +		goto out;
> > > > > +
> > > > > +	idx = msix_get_enabled_idx(adev, addr, len);
> > > > > +	if (idx < 0) {
> > > > > +		idx = (addr - adev->msix_mmio_base) / PCI_MSIX_ENTRY_SIZE;
> > > > > +		if ((addr % PCI_MSIX_ENTRY_SIZE) ==
> > > > > +				PCI_MSIX_ENTRY_VECTOR_CTRL)
> > > > > +			*(unsigned long *)val =
> > > > > +				test_bit(idx, adev->msix_mask_bitmap) ?
> > > > > +				PCI_MSIX_ENTRY_CTRL_MASKBIT : 0;
> > > > 
> > > > I suspect this is wrong for big endian platforms: PCI returns
> > > > all values in little endian format.
> > > 
> > > Let big endian platform take care of it later... There is only x86 has
> > > __KVM_HAVE_MSIX now.
> > 
> > Well it's easier to make the code correct when you write it than later.
> 
> Go back to me if someone want to implement MSI device assignment on big-endian 
> machine.

I think it's more elegant to make the code correct the first time around.
What are we arguing about? A couple of cpu_to_le macros?

> > > > > +		else
> > > > > +			r = -EOPNOTSUPP;
> > > > > +		goto out;
> > > > > +	}
> > > > > +
> > > > > +	r = kvm_get_irq_routing_entry(adev->kvm,
> > > > > +			adev->guest_msix_entries[idx].vector, &e);
> > > > > +	if (r || e.type != KVM_IRQ_ROUTING_MSI) {
> > > > > +		printk(KERN_WARNING "KVM: Wrong MSI-X routing entry! "
> > > > > +			"idx %d, addr 0x%llx, len %d\n", idx, addr, len);
> > > > 
> > > > Can a malicious userspace trigger this intentionally?
> > > 
> > > No.
> > 
> > Here's a scenario: bind an MSIX routing entry to device, then later
> > change its type to regular interrupt. Fills the log with warnings,
> > doesn't it?
> 
> Does QEmu suppose to discard the old routing?

It does not matter. We can't trust userspace to do the right thing.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 5/5] KVM: assigned dev: MSI-X mask support
  2010-11-05 12:01           ` Sheng Yang
@ 2010-11-05 13:29             ` Michael S. Tsirkin
  0 siblings, 0 replies; 29+ messages in thread
From: Michael S. Tsirkin @ 2010-11-05 13:29 UTC (permalink / raw)
  To: Sheng Yang; +Cc: Avi Kivity, Marcelo Tosatti, kvm

On Fri, Nov 05, 2010 at 08:01:53PM +0800, Sheng Yang wrote:
> > Go back to me if someone want to implement MSI device assignment on
> > big-endian machine.
> 
> Sorry, just realized it's very likely that I don't have an big endian machine to 
> test it even at that time...
> 
> I think it's really should be done by someone would use and test it.

I'm not asking you do to that. Just tag endian-ness explicitly to
make it easier for whoever will do it.

-- 
MST

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 5/5] KVM: assigned dev: MSI-X mask support
  2010-11-05 10:53         ` Sheng Yang
  2010-11-05 12:01           ` Sheng Yang
  2010-11-05 13:27           ` Michael S. Tsirkin
@ 2010-11-06  0:24           ` Marcelo Tosatti
  2010-11-08  5:41             ` Sheng Yang
  2 siblings, 1 reply; 29+ messages in thread
From: Marcelo Tosatti @ 2010-11-06  0:24 UTC (permalink / raw)
  To: Sheng Yang; +Cc: Michael S. Tsirkin, Avi Kivity, kvm

On Fri, Nov 05, 2010 at 06:53:50PM +0800, Sheng Yang wrote:
> On Friday 05 November 2010 16:43:33 Michael S. Tsirkin wrote:
> > On Fri, Nov 05, 2010 at 10:44:19AM +0800, Sheng Yang wrote:
> > > > > +};
> > > > > +
> > > > > +This ioctl would enable in-kernel MSI-X emulation, which would
> > > > > handle MSI-X +mask bit in the kernel.
> > > > 
> > > > What happens on repeated calls when it's already enabled?
> > > > How does one disable after it's enabled?
> > > 
> > > Suppose this should only be called once. But again would update the MMIO
> > > base.
> > 
> > So maybe add this all to documentation.
> > 
> > > It
> > > would be disabled along with device deassignment.
> > 
> > So what are you going to do when guest enables and later disables MSIX?
> > Disable assignment completely?
> 
> This device goes with PCI resources allocation, rather than IRQ assignment.

What about hot plug, for example? I can't see how this can function
without unregistering regions.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/5 v3] MSI-X mask supporting for assigned device(kernel)
  2010-11-04  6:15 [PATCH 0/5 v3] MSI-X mask supporting for assigned device(kernel) Sheng Yang
                   ` (4 preceding siblings ...)
  2010-11-04  6:15 ` [PATCH 5/5] KVM: assigned dev: MSI-X mask support Sheng Yang
@ 2010-11-06  0:27 ` Marcelo Tosatti
  2010-11-08  0:51   ` Sheng Yang
  5 siblings, 1 reply; 29+ messages in thread
From: Marcelo Tosatti @ 2010-11-06  0:27 UTC (permalink / raw)
  To: Sheng Yang; +Cc: Avi Kivity, Michael S. Tsirkin, kvm

On Thu, Nov 04, 2010 at 02:15:16PM +0800, Sheng Yang wrote:
> Here is the latest series of MSI-X mask supporting patches.
> 
> The bigest change from last version is, in order to reduce the complexity, I
> moved all mask bit operation to kernel, including disabled entries. This
> addressed two concerns:
> 1. KVM and QEmu each own a part of mask bit operation.
> 2. QEmu need accessing the real hardware to get the mask bit information.
> 
> So now QEmu have to use kernel API to get mask bit information. Though it
> would be slower than direct accessing the real hardware's MMIO, the direct
> accessing is unacceptable beacuse in fact the host OS own the device. The host
> OS can access the device without notifying the guest(and don't need to do so).
> Userspace shouldn't penetrate the host OS layer to directly operate the real
> hardware, otherwise would cause guest confusion.
> 
> Also I discard the userspace mask operation as well, after showed the
> performance number.
> 
> This version also removed the capability enabling mechanism. Because we want
> to use the struct kvm_assigned_msix_entry with new IOCTL, so there is no
> compatible issue.
> 
> Please review. And I would run more test with current patch. So far so good.

It would be good to know where the performance issue is with the entire
implementation of mask bit in QEMU.

http://www.mail-archive.com/kvm@vger.kernel.org/msg42652.html.

All you mentioned in the past was "there was high CPU utilization when
running in QEMU" and decided to start implementing in kernel. But AFAICS
you did not really look into where the problem was...


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/5 v3] MSI-X mask supporting for assigned device(kernel)
  2010-11-06  0:27 ` [PATCH 0/5 v3] MSI-X mask supporting for assigned device(kernel) Marcelo Tosatti
@ 2010-11-08  0:51   ` Sheng Yang
  0 siblings, 0 replies; 29+ messages in thread
From: Sheng Yang @ 2010-11-08  0:51 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Avi Kivity, Michael S. Tsirkin, kvm

On Saturday 06 November 2010 08:27:15 Marcelo Tosatti wrote:
> On Thu, Nov 04, 2010 at 02:15:16PM +0800, Sheng Yang wrote:
> > Here is the latest series of MSI-X mask supporting patches.
> > 
> > The bigest change from last version is, in order to reduce the
> > complexity, I moved all mask bit operation to kernel, including disabled
> > entries. This addressed two concerns:
> > 1. KVM and QEmu each own a part of mask bit operation.
> > 2. QEmu need accessing the real hardware to get the mask bit information.
> > 
> > So now QEmu have to use kernel API to get mask bit information. Though it
> > would be slower than direct accessing the real hardware's MMIO, the
> > direct accessing is unacceptable beacuse in fact the host OS own the
> > device. The host OS can access the device without notifying the
> > guest(and don't need to do so). Userspace shouldn't penetrate the host
> > OS layer to directly operate the real hardware, otherwise would cause
> > guest confusion.
> > 
> > Also I discard the userspace mask operation as well, after showed the
> > performance number.
> > 
> > This version also removed the capability enabling mechanism. Because we
> > want to use the struct kvm_assigned_msix_entry with new IOCTL, so there
> > is no compatible issue.
> > 
> > Please review. And I would run more test with current patch. So far so
> > good.
> 
> It would be good to know where the performance issue is with the entire
> implementation of mask bit in QEMU.
> 
> http://www.mail-archive.com/kvm@vger.kernel.org/msg42652.html.
> 
> All you mentioned in the past was "there was high CPU utilization when
> running in QEMU" and decided to start implementing in kernel. But AFAICS
> you did not really look into where the problem was...

We've analysed the same issue in Xen, and believed it was caused by exiting to the 
QEmu everytime when guest want to access the mask bit(and some specific kernel want 
to do this desperately). So we provided the patches to Xen, and it worked very 
well. 

The cost of exiting to the userspace in KVM is much smaller than Xen side, but it 
still much slower than in-kernel. I've shown the performance data of kernel and 
userspace here:

http://ns.spinics.net/lists/kvm/msg43712.html

We can easily get 20% performance gain using kernel.

--
regards
Yang, Sheng

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 5/5] KVM: assigned dev: MSI-X mask support
  2010-11-05 13:27           ` Michael S. Tsirkin
@ 2010-11-08  3:18             ` Sheng Yang
  2010-11-08  6:42               ` Michael S. Tsirkin
  0 siblings, 1 reply; 29+ messages in thread
From: Sheng Yang @ 2010-11-08  3:18 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Avi Kivity, Marcelo Tosatti, kvm

On Friday 05 November 2010 21:27:42 Michael S. Tsirkin wrote:
> On Fri, Nov 05, 2010 at 06:53:50PM +0800, Sheng Yang wrote:
> > On Friday 05 November 2010 16:43:33 Michael S. Tsirkin wrote:
> > > On Fri, Nov 05, 2010 at 10:44:19AM +0800, Sheng Yang wrote:
> > > > > > +};
> > > > > > +
> > > > > > +This ioctl would enable in-kernel MSI-X emulation, which would
> > > > > > handle MSI-X +mask bit in the kernel.
> > > > > 
> > > > > What happens on repeated calls when it's already enabled?
> > > > > How does one disable after it's enabled?
> > > > 
> > > > Suppose this should only be called once. But again would update the
> > > > MMIO base.
> > > 
> > > So maybe add this all to documentation.
> > > 
> > > > It
> > > > would be disabled along with device deassignment.
> > > 
> > > So what are you going to do when guest enables and later disables MSIX?
> > > Disable assignment completely?
> > 
> > This device goes with PCI resources allocation, rather than IRQ
> > assignment.
> 
> I see. I guess we can live with it, but it seems tied to a specific
> mode of use. Would be better to support enabling together with msix too,
> this requires an ability to disable.
> 
> > > > We enable it explicitly because
> > > > not all devices have MSI-X capability.
> > > > 
> > > > > > +
> > > > > 
> > > > > This is very specialized for assigned devices.  I would like an
> > > > > interface not tied to assigned devices explicitly
> > > > > (even if the implementation at the moment only works
> > > > > for assigned devices, I can patch that in later). E.g. vhost-net
> > > > > would benefit from such an extension too.  Maybe tie this to a
> > > > > special GSI and this GSI can be an assigned device or not?
> > > > 
> > > > You're talking about KVM_ASSIGN_GET_MSIX_ENTRY?
> > > 
> > > Yes but also, and more importantly, about KVM_ASSIGN_REG_MSIX_MMIO
> > > and kvm_assigned_msix_mmio really.
> > > 
> > > > We can't tie it to GSI, because we
> > > > should also cover entries without GSI. The entry number should be
> > > > fine for all kinds of devices using MSI-X.
> > > 
> > > I don't completely understand: entries without GSI
> > > never need to inject an interrupt. Why do we care
> > > about them? Let's pass such accesses to userspace.
> > > 
> > > In any case, I think MSIX GSIs are inexpensive (we never search them
> > > all), so we could create GSIs for all entries if we wanted to.
> > 
> > All mask bit accessing controlled by kernel, in order to keep consistent.
> 
> Well I think passing whatever we can't handle to userspace makes
> complete sense. We do this for address/data writes after all.
> If we wanted to do it *all* in kernel we would handle
> address/data there too. I think Avi suggested this at some point.

I think Avi's point was we handle mask bit half in the kernel and half in the 
userspace, which makes API complex. I agree with this so I moved all mask bit 
handling to kernel.

Data/address is another issue, due to QEmu own the routing table. We can change it 
in the future, but this is not related to this patch.
> 
> The main point is msix mmio BAR handling is completely unrelated
> to the assigned device. Emulated devices have an msix BAR as well.
> So tying it to an assigned devices is suboptimal, we'll need to
> grow yet another interface for these.

If you only means the name, we can change that. But I think all MSI-X entries 
would have entry number, regardless of if it got GSI number. Keep entry number as 
parameter makes more sense to me.

And still, the patch's only current user is assigned device. I am glad if it can 
cover the other future case along with it, but that's not the primary target of 
this patch.
> 
> > > > I am not sure about what PV one would looks like.
> > > > I use kvm_assigned_msix_entry
> > > > just because it can be easily reused.
> > > 
> > > Not only PV, emulated devices too. OK let's try to figure out:
> > > 
> > > What happens with userspace is, we inject interrupts either through
> > > ioctls or eventfds. Avoiding an exit to userspace on MSIX
> > > access is beneficial in exactly the same way.
> > > 
> > > We could have an ioctl to pass in the table addresses, and mapping
> > > table to relevant GSIs. For example, add kvm_msix_mmio with table
> > > start/end, also specify which GSIs are covered.
> > > Maybe ask that userspace allocate all GSIs for a table consequitively?
> > > Or add kvm_irq_routing_msix which is same as msi but
> > > also has the mmio addresses? Bonus points for avoiding the need
> > > to scan all table on each write. For example, don't scan at all,
> > > instead just set a bit in KVM, and set irq entry on the next
> > > interrupt only: we have KVM_MAX_MSIX_PER_DEV as 256 so maybe a small 32
> > > byte bitmap would be enough for this, avoding a linear scan.
> > > 
> > > When we pass in kvm_msix_mmio, kvm would start registering masked
> > > state.
> > > 
> > > > > > +4.48 KVM_ASSIGN_GET_MSIX_ENTRY
> > > > > > +
> > > > > > +Capability: KVM_CAP_DEVICE_MSIX_MASK
> > > > > > +Architectures: x86
> > > > > > +Type: vm ioctl
> > > > > > +Parameters: struct kvm_assigned_msix_entry (in and out)
> > > > > > +Returns: 0 on success, !0 on error
> > > > > > +
> > > > > > +struct kvm_assigned_msix_entry {
> > > > > > +	/* Assigned device's ID */
> > > > > > +	__u32 assigned_dev_id;
> > > > > > +	/* Ignored */
> > > > > > +	__u32 gsi;
> > > > > > +	/* The index of entry in the MSI-X table */
> > > > > > +	__u16 entry;
> > > > > > +	/* Querying flags and returning status */
> > > > > > +	__u16 flags;
> > > > > > +	/* Must be 0 */
> > > > > > +	__u16 padding[2];
> > > > > > +};
> > > > > > +
> > > > > > +This ioctl would allow userspace to get the status of one
> > > > > > specific MSI-X +entry. Currently we support mask bit status
> > > > > > querying. +
> > > > > > 
> > > > > >  5. The kvm_run structure
> > > > > >  
> > > > > >  Application code obtains a pointer to the kvm_run structure by
> > > > > > 
> > > > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > > > > index f3f86b2..8fd5121 100644
> > > > > > --- a/arch/x86/kvm/x86.c
> > > > > > +++ b/arch/x86/kvm/x86.c
> > > > > > @@ -1926,6 +1926,7 @@ int kvm_dev_ioctl_check_extension(long ext)
> > > > > > 
> > > > > >  	case KVM_CAP_DEBUGREGS:
> > > > > >  	case KVM_CAP_X86_ROBUST_SINGLESTEP:
> > > > > > 
> > > > > >  	case KVM_CAP_XSAVE:
> > > > > > +	case KVM_CAP_DEVICE_MSIX_MASK:
> > > > > >  		r = 1;
> > > > > >  		break;
> > > > > >  	
> > > > > >  	case KVM_CAP_COALESCED_MMIO:
> > > > > > diff --git a/include/linux/kvm.h b/include/linux/kvm.h
> > > > > > index 919ae53..eafafb1 100644
> > > > > > --- a/include/linux/kvm.h
> > > > > > +++ b/include/linux/kvm.h
> > > > > > @@ -540,6 +540,9 @@ struct kvm_ppc_pvinfo {
> > > > > > 
> > > > > >  #endif
> > > > > >  #define KVM_CAP_PPC_GET_PVINFO 57
> > > > > >  #define KVM_CAP_PPC_IRQ_LEVEL 58
> > > > > > 
> > > > > > +#ifdef __KVM_HAVE_MSIX
> > > > > > +#define KVM_CAP_DEVICE_MSIX_MASK 59
> > > > > > +#endif
> > > > > 
> > > > > We seem to have these HAVE macros all over.
> > > > > Avi, what's the idea? Let's drop them?
> > > > 
> > > > Um? This kind of marcos are used here to avoid vendor specific string
> > > > in the header file.
> > > 
> > > Am I the only one that dislikes the ifdefs we have all over this code
> > > then? Rest of linux tries to keep ifdefs local by stubbing
> > > functionality out.
> > > 
> > > > > > +	BUG_ON(adev->msix_mmio_base == 0);
> > > > > 
> > > > > Below I see
> > > > > 
> > > > > 	adev->msix_mmio_base = msix_mmio->base_addr;
> > > > > 
> > > > > which comes from userspace. BUG_ON is an unfriendly way to
> > > > > tell user about a bug in qemu.
> > > > > 
> > > > > Anyway, I don't think we should special-case 0 gpa.
> > > > > It's up to the user where to base the table.
> > > > > Use a separate flag to signal that table was initialized.
> > > > 
> > > > Base_addr can't be 0 in any case. Only the unimplemented BAR are
> > > > hardwired to zero.
> > > 
> > > No, the PCI spec does not say so. Check it out: the unimplemented BAR
> > > is hardwired to zero but implemented can be set to 0 as well.
> > 
> > OK, OK, OK, I would add one more flag.
> > 
> > > > Also if we get here with base_addr = 0, it is surely a bug.
> > > > 
> > > > > > +
> > > > > > +	return -EINVAL;
> > > > > > +}
> > > > > > +
> > > > > > +static int msix_mmio_read(struct kvm_io_device *this, gpa_t
> > > > > > addr, int len, +			  void *val)
> > > > > > +{
> > > > > > +	struct kvm_assigned_dev_kernel *adev =
> > > > > > +			container_of(this, struct kvm_assigned_dev_kernel,
> > > > > > +				     msix_mmio_dev);
> > > > > > +	int idx, r = 0;
> > > > > > +	u32 entry[4];
> > > > > > +	struct kvm_kernel_irq_routing_entry e;
> > > > > > +
> > > > > > +	mutex_lock(&adev->kvm->lock);
> > > > > > +	if (!msix_mmio_in_range(adev, addr, len)) {
> > > > > > +		r = -EOPNOTSUPP;
> > > > > > +		goto out;
> > > > > > +	}
> > > > > > +	if ((addr & 0x3) || len != 4)
> > > > > > +		goto out;
> > > > > > +
> > > > > > +	idx = msix_get_enabled_idx(adev, addr, len);
> > > > > > +	if (idx < 0) {
> > > > > > +		idx = (addr - adev->msix_mmio_base) / PCI_MSIX_ENTRY_SIZE;
> > > > > > +		if ((addr % PCI_MSIX_ENTRY_SIZE) ==
> > > > > > +				PCI_MSIX_ENTRY_VECTOR_CTRL)
> > > > > > +			*(unsigned long *)val =
> > > > > > +				test_bit(idx, adev->msix_mask_bitmap) ?
> > > > > > +				PCI_MSIX_ENTRY_CTRL_MASKBIT : 0;
> > > > > 
> > > > > I suspect this is wrong for big endian platforms: PCI returns
> > > > > all values in little endian format.
> > > > 
> > > > Let big endian platform take care of it later... There is only x86
> > > > has __KVM_HAVE_MSIX now.
> > > 
> > > Well it's easier to make the code correct when you write it than later.
> > 
> > Go back to me if someone want to implement MSI device assignment on
> > big-endian machine.
> 
> I think it's more elegant to make the code correct the first time around.
> What are we arguing about? A couple of cpu_to_le macros?

Sorry, I am really not familiar with big endian machine and related macros. Also 
lacks of way to test it. Would give it a try.
> 
> > > > > > +		else
> > > > > > +			r = -EOPNOTSUPP;
> > > > > > +		goto out;
> > > > > > +	}
> > > > > > +
> > > > > > +	r = kvm_get_irq_routing_entry(adev->kvm,
> > > > > > +			adev->guest_msix_entries[idx].vector, &e);
> > > > > > +	if (r || e.type != KVM_IRQ_ROUTING_MSI) {
> > > > > > +		printk(KERN_WARNING "KVM: Wrong MSI-X routing entry! "
> > > > > > +			"idx %d, addr 0x%llx, len %d\n", idx, addr, len);
> > > > > 
> > > > > Can a malicious userspace trigger this intentionally?
> > > > 
> > > > No.
> > > 
> > > Here's a scenario: bind an MSIX routing entry to device, then later
> > > change its type to regular interrupt. Fills the log with warnings,
> > > doesn't it?
> > 
> > Does QEmu suppose to discard the old routing?
> 
> It does not matter. We can't trust userspace to do the right thing.

OK, let's eliminate this.

--
regards
Yang, Sheng

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 5/5] KVM: assigned dev: MSI-X mask support
  2010-11-06  0:24           ` Marcelo Tosatti
@ 2010-11-08  5:41             ` Sheng Yang
  2010-11-09 14:08               ` Marcelo Tosatti
  0 siblings, 1 reply; 29+ messages in thread
From: Sheng Yang @ 2010-11-08  5:41 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Michael S. Tsirkin, Avi Kivity, kvm

On Saturday 06 November 2010 08:24:51 Marcelo Tosatti wrote:
> On Fri, Nov 05, 2010 at 06:53:50PM +0800, Sheng Yang wrote:
> > On Friday 05 November 2010 16:43:33 Michael S. Tsirkin wrote:
> > > On Fri, Nov 05, 2010 at 10:44:19AM +0800, Sheng Yang wrote:
> > > > > > +};
> > > > > > +
> > > > > > +This ioctl would enable in-kernel MSI-X emulation, which would
> > > > > > handle MSI-X +mask bit in the kernel.
> > > > > 
> > > > > What happens on repeated calls when it's already enabled?
> > > > > How does one disable after it's enabled?
> > > > 
> > > > Suppose this should only be called once. But again would update the
> > > > MMIO base.
> > > 
> > > So maybe add this all to documentation.
> > > 
> > > > It
> > > > would be disabled along with device deassignment.
> > > 
> > > So what are you going to do when guest enables and later disables MSIX?
> > > Disable assignment completely?
> > 
> > This device goes with PCI resources allocation, rather than IRQ
> > assignment.
> 
> What about hot plug, for example? I can't see how this can function
> without unregistering regions.

The register time is when PCI resources allocation, and the unregister time is 
when device was deassigned.

I suppose hot plug path would also notify kernel to deassign the device?

--
regards
Yang, Sheng

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 5/5] KVM: assigned dev: MSI-X mask support
  2010-11-08  3:18             ` Sheng Yang
@ 2010-11-08  6:42               ` Michael S. Tsirkin
  0 siblings, 0 replies; 29+ messages in thread
From: Michael S. Tsirkin @ 2010-11-08  6:42 UTC (permalink / raw)
  To: Sheng Yang; +Cc: Avi Kivity, Marcelo Tosatti, kvm

On Mon, Nov 08, 2010 at 11:18:34AM +0800, Sheng Yang wrote:
> On Friday 05 November 2010 21:27:42 Michael S. Tsirkin wrote:
> > On Fri, Nov 05, 2010 at 06:53:50PM +0800, Sheng Yang wrote:
> > > On Friday 05 November 2010 16:43:33 Michael S. Tsirkin wrote:
> > > > On Fri, Nov 05, 2010 at 10:44:19AM +0800, Sheng Yang wrote:
> > > > > > > +};
> > > > > > > +
> > > > > > > +This ioctl would enable in-kernel MSI-X emulation, which would
> > > > > > > handle MSI-X +mask bit in the kernel.
> > > > > > 
> > > > > > What happens on repeated calls when it's already enabled?
> > > > > > How does one disable after it's enabled?
> > > > > 
> > > > > Suppose this should only be called once. But again would update the
> > > > > MMIO base.
> > > > 
> > > > So maybe add this all to documentation.
> > > > 
> > > > > It
> > > > > would be disabled along with device deassignment.
> > > > 
> > > > So what are you going to do when guest enables and later disables MSIX?
> > > > Disable assignment completely?
> > > 
> > > This device goes with PCI resources allocation, rather than IRQ
> > > assignment.
> > 
> > I see. I guess we can live with it, but it seems tied to a specific
> > mode of use. Would be better to support enabling together with msix too,
> > this requires an ability to disable.
> > 
> > > > > We enable it explicitly because
> > > > > not all devices have MSI-X capability.
> > > > > 
> > > > > > > +
> > > > > > 
> > > > > > This is very specialized for assigned devices.  I would like an
> > > > > > interface not tied to assigned devices explicitly
> > > > > > (even if the implementation at the moment only works
> > > > > > for assigned devices, I can patch that in later). E.g. vhost-net
> > > > > > would benefit from such an extension too.  Maybe tie this to a
> > > > > > special GSI and this GSI can be an assigned device or not?
> > > > > 
> > > > > You're talking about KVM_ASSIGN_GET_MSIX_ENTRY?
> > > > 
> > > > Yes but also, and more importantly, about KVM_ASSIGN_REG_MSIX_MMIO
> > > > and kvm_assigned_msix_mmio really.
> > > > 
> > > > > We can't tie it to GSI, because we
> > > > > should also cover entries without GSI. The entry number should be
> > > > > fine for all kinds of devices using MSI-X.
> > > > 
> > > > I don't completely understand: entries without GSI
> > > > never need to inject an interrupt. Why do we care
> > > > about them? Let's pass such accesses to userspace.
> > > > 
> > > > In any case, I think MSIX GSIs are inexpensive (we never search them
> > > > all), so we could create GSIs for all entries if we wanted to.
> > > 
> > > All mask bit accessing controlled by kernel, in order to keep consistent.
> > 
> > Well I think passing whatever we can't handle to userspace makes
> > complete sense. We do this for address/data writes after all.
> > If we wanted to do it *all* in kernel we would handle
> > address/data there too. I think Avi suggested this at some point.
> 
> I think Avi's point was we handle mask bit half in the kernel and half in the 
> userspace, which makes API complex. I agree with this so I moved all mask bit 
> handling to kernel.
> 
> Data/address is another issue, due to QEmu own the routing table. We can change it 
> in the future, but this is not related to this patch.
> > 
> > The main point is msix mmio BAR handling is completely unrelated
> > to the assigned device. Emulated devices have an msix BAR as well.
> > So tying it to an assigned devices is suboptimal, we'll need to
> > grow yet another interface for these.
> 
> If you only means the name, we can change that.

I mean it shouldn't use the assigned device id. Use some other index:
I suggested adding a new GSI type for this, but can be something else
if you prefer.

> But I think all MSI-X entries 
> would have entry number, regardless of if it got GSI number. Keep entry number as 
> parameter makes more sense to me.
> 
> And still, the patch's only current user is assigned device. I am glad if it can 
> cover the other future case along with it, but that's not the primary target of 
> this patch.

It's a problem I think. Stop thinking about the patch for a moment
and think about kvm as a whole. We know we will want such an interface for
emulated and PV devices. What then? Add another interface and a bunch of
code to support it?  Point being, since userspace interfaces need to be
supported forever, they should be generic if possible.


> > 
> > > > > I am not sure about what PV one would looks like.
> > > > > I use kvm_assigned_msix_entry
> > > > > just because it can be easily reused.
> > > > 
> > > > Not only PV, emulated devices too. OK let's try to figure out:
> > > > 
> > > > What happens with userspace is, we inject interrupts either through
> > > > ioctls or eventfds. Avoiding an exit to userspace on MSIX
> > > > access is beneficial in exactly the same way.
> > > > 
> > > > We could have an ioctl to pass in the table addresses, and mapping
> > > > table to relevant GSIs. For example, add kvm_msix_mmio with table
> > > > start/end, also specify which GSIs are covered.
> > > > Maybe ask that userspace allocate all GSIs for a table consequitively?
> > > > Or add kvm_irq_routing_msix which is same as msi but
> > > > also has the mmio addresses? Bonus points for avoiding the need
> > > > to scan all table on each write. For example, don't scan at all,
> > > > instead just set a bit in KVM, and set irq entry on the next
> > > > interrupt only: we have KVM_MAX_MSIX_PER_DEV as 256 so maybe a small 32
> > > > byte bitmap would be enough for this, avoding a linear scan.
> > > > 
> > > > When we pass in kvm_msix_mmio, kvm would start registering masked
> > > > state.
> > > > 
> > > > > > > +4.48 KVM_ASSIGN_GET_MSIX_ENTRY
> > > > > > > +
> > > > > > > +Capability: KVM_CAP_DEVICE_MSIX_MASK
> > > > > > > +Architectures: x86
> > > > > > > +Type: vm ioctl
> > > > > > > +Parameters: struct kvm_assigned_msix_entry (in and out)
> > > > > > > +Returns: 0 on success, !0 on error
> > > > > > > +
> > > > > > > +struct kvm_assigned_msix_entry {
> > > > > > > +	/* Assigned device's ID */
> > > > > > > +	__u32 assigned_dev_id;
> > > > > > > +	/* Ignored */
> > > > > > > +	__u32 gsi;
> > > > > > > +	/* The index of entry in the MSI-X table */
> > > > > > > +	__u16 entry;
> > > > > > > +	/* Querying flags and returning status */
> > > > > > > +	__u16 flags;
> > > > > > > +	/* Must be 0 */
> > > > > > > +	__u16 padding[2];
> > > > > > > +};
> > > > > > > +
> > > > > > > +This ioctl would allow userspace to get the status of one
> > > > > > > specific MSI-X +entry. Currently we support mask bit status
> > > > > > > querying. +
> > > > > > > 
> > > > > > >  5. The kvm_run structure
> > > > > > >  
> > > > > > >  Application code obtains a pointer to the kvm_run structure by
> > > > > > > 
> > > > > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > > > > > index f3f86b2..8fd5121 100644
> > > > > > > --- a/arch/x86/kvm/x86.c
> > > > > > > +++ b/arch/x86/kvm/x86.c
> > > > > > > @@ -1926,6 +1926,7 @@ int kvm_dev_ioctl_check_extension(long ext)
> > > > > > > 
> > > > > > >  	case KVM_CAP_DEBUGREGS:
> > > > > > >  	case KVM_CAP_X86_ROBUST_SINGLESTEP:
> > > > > > > 
> > > > > > >  	case KVM_CAP_XSAVE:
> > > > > > > +	case KVM_CAP_DEVICE_MSIX_MASK:
> > > > > > >  		r = 1;
> > > > > > >  		break;
> > > > > > >  	
> > > > > > >  	case KVM_CAP_COALESCED_MMIO:
> > > > > > > diff --git a/include/linux/kvm.h b/include/linux/kvm.h
> > > > > > > index 919ae53..eafafb1 100644
> > > > > > > --- a/include/linux/kvm.h
> > > > > > > +++ b/include/linux/kvm.h
> > > > > > > @@ -540,6 +540,9 @@ struct kvm_ppc_pvinfo {
> > > > > > > 
> > > > > > >  #endif
> > > > > > >  #define KVM_CAP_PPC_GET_PVINFO 57
> > > > > > >  #define KVM_CAP_PPC_IRQ_LEVEL 58
> > > > > > > 
> > > > > > > +#ifdef __KVM_HAVE_MSIX
> > > > > > > +#define KVM_CAP_DEVICE_MSIX_MASK 59
> > > > > > > +#endif
> > > > > > 
> > > > > > We seem to have these HAVE macros all over.
> > > > > > Avi, what's the idea? Let's drop them?
> > > > > 
> > > > > Um? This kind of marcos are used here to avoid vendor specific string
> > > > > in the header file.
> > > > 
> > > > Am I the only one that dislikes the ifdefs we have all over this code
> > > > then? Rest of linux tries to keep ifdefs local by stubbing
> > > > functionality out.
> > > > 
> > > > > > > +	BUG_ON(adev->msix_mmio_base == 0);
> > > > > > 
> > > > > > Below I see
> > > > > > 
> > > > > > 	adev->msix_mmio_base = msix_mmio->base_addr;
> > > > > > 
> > > > > > which comes from userspace. BUG_ON is an unfriendly way to
> > > > > > tell user about a bug in qemu.
> > > > > > 
> > > > > > Anyway, I don't think we should special-case 0 gpa.
> > > > > > It's up to the user where to base the table.
> > > > > > Use a separate flag to signal that table was initialized.
> > > > > 
> > > > > Base_addr can't be 0 in any case. Only the unimplemented BAR are
> > > > > hardwired to zero.
> > > > 
> > > > No, the PCI spec does not say so. Check it out: the unimplemented BAR
> > > > is hardwired to zero but implemented can be set to 0 as well.
> > > 
> > > OK, OK, OK, I would add one more flag.
> > > 
> > > > > Also if we get here with base_addr = 0, it is surely a bug.
> > > > > 
> > > > > > > +
> > > > > > > +	return -EINVAL;
> > > > > > > +}
> > > > > > > +
> > > > > > > +static int msix_mmio_read(struct kvm_io_device *this, gpa_t
> > > > > > > addr, int len, +			  void *val)
> > > > > > > +{
> > > > > > > +	struct kvm_assigned_dev_kernel *adev =
> > > > > > > +			container_of(this, struct kvm_assigned_dev_kernel,
> > > > > > > +				     msix_mmio_dev);
> > > > > > > +	int idx, r = 0;
> > > > > > > +	u32 entry[4];
> > > > > > > +	struct kvm_kernel_irq_routing_entry e;
> > > > > > > +
> > > > > > > +	mutex_lock(&adev->kvm->lock);
> > > > > > > +	if (!msix_mmio_in_range(adev, addr, len)) {
> > > > > > > +		r = -EOPNOTSUPP;
> > > > > > > +		goto out;
> > > > > > > +	}
> > > > > > > +	if ((addr & 0x3) || len != 4)
> > > > > > > +		goto out;
> > > > > > > +
> > > > > > > +	idx = msix_get_enabled_idx(adev, addr, len);
> > > > > > > +	if (idx < 0) {
> > > > > > > +		idx = (addr - adev->msix_mmio_base) / PCI_MSIX_ENTRY_SIZE;
> > > > > > > +		if ((addr % PCI_MSIX_ENTRY_SIZE) ==
> > > > > > > +				PCI_MSIX_ENTRY_VECTOR_CTRL)
> > > > > > > +			*(unsigned long *)val =
> > > > > > > +				test_bit(idx, adev->msix_mask_bitmap) ?
> > > > > > > +				PCI_MSIX_ENTRY_CTRL_MASKBIT : 0;
> > > > > > 
> > > > > > I suspect this is wrong for big endian platforms: PCI returns
> > > > > > all values in little endian format.
> > > > > 
> > > > > Let big endian platform take care of it later... There is only x86
> > > > > has __KVM_HAVE_MSIX now.
> > > > 
> > > > Well it's easier to make the code correct when you write it than later.
> > > 
> > > Go back to me if someone want to implement MSI device assignment on
> > > big-endian machine.
> > 
> > I think it's more elegant to make the code correct the first time around.
> > What are we arguing about? A couple of cpu_to_le macros?
> 
> Sorry, I am really not familiar with big endian machine and related macros. Also 
> lacks of way to test it. Would give it a try.

Actually, there is a way to do some limited check on a nx86 pc:
just tag the entry appropriately:

__le32 entry[4];

(Hmm, isn't there some macro instead of 4?)

And then make sparse not complain about code (make C=1).

> > 
> > > > > > > +		else
> > > > > > > +			r = -EOPNOTSUPP;
> > > > > > > +		goto out;
> > > > > > > +	}
> > > > > > > +
> > > > > > > +	r = kvm_get_irq_routing_entry(adev->kvm,
> > > > > > > +			adev->guest_msix_entries[idx].vector, &e);
> > > > > > > +	if (r || e.type != KVM_IRQ_ROUTING_MSI) {
> > > > > > > +		printk(KERN_WARNING "KVM: Wrong MSI-X routing entry! "
> > > > > > > +			"idx %d, addr 0x%llx, len %d\n", idx, addr, len);
> > > > > > 
> > > > > > Can a malicious userspace trigger this intentionally?
> > > > > 
> > > > > No.
> > > > 
> > > > Here's a scenario: bind an MSIX routing entry to device, then later
> > > > change its type to regular interrupt. Fills the log with warnings,
> > > > doesn't it?
> > > 
> > > Does QEmu suppose to discard the old routing?
> > 
> > It does not matter. We can't trust userspace to do the right thing.
> 
> OK, let's eliminate this.


Yea. debug or something. Also,
since it's a read, we can pass it up to userspace
which will maybe make it easier to debug.

> --
> regards
> Yang, Sheng

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 5/5] KVM: assigned dev: MSI-X mask support
  2010-11-08  5:41             ` Sheng Yang
@ 2010-11-09 14:08               ` Marcelo Tosatti
  0 siblings, 0 replies; 29+ messages in thread
From: Marcelo Tosatti @ 2010-11-09 14:08 UTC (permalink / raw)
  To: Sheng Yang; +Cc: Michael S. Tsirkin, Avi Kivity, kvm

On Mon, Nov 08, 2010 at 01:41:40PM +0800, Sheng Yang wrote:
> On Saturday 06 November 2010 08:24:51 Marcelo Tosatti wrote:
> > On Fri, Nov 05, 2010 at 06:53:50PM +0800, Sheng Yang wrote:
> > > On Friday 05 November 2010 16:43:33 Michael S. Tsirkin wrote:
> > > > On Fri, Nov 05, 2010 at 10:44:19AM +0800, Sheng Yang wrote:
> > > > > > > +};
> > > > > > > +
> > > > > > > +This ioctl would enable in-kernel MSI-X emulation, which would
> > > > > > > handle MSI-X +mask bit in the kernel.
> > > > > > 
> > > > > > What happens on repeated calls when it's already enabled?
> > > > > > How does one disable after it's enabled?
> > > > > 
> > > > > Suppose this should only be called once. But again would update the
> > > > > MMIO base.
> > > > 
> > > > So maybe add this all to documentation.
> > > > 
> > > > > It
> > > > > would be disabled along with device deassignment.
> > > > 
> > > > So what are you going to do when guest enables and later disables MSIX?
> > > > Disable assignment completely?
> > > 
> > > This device goes with PCI resources allocation, rather than IRQ
> > > assignment.
> > 
> > What about hot plug, for example? I can't see how this can function
> > without unregistering regions.
> 
> The register time is when PCI resources allocation, and the unregister time is 
> when device was deassigned.
> 
> I suppose hot plug path would also notify kernel to deassign the device?

Right.


^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2010-11-09 14:44 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-11-04  6:15 [PATCH 0/5 v3] MSI-X mask supporting for assigned device(kernel) Sheng Yang
2010-11-04  6:15 ` [PATCH 1/5] PCI: MSI: Move MSI-X entry definition to pci_regs.h Sheng Yang
2010-11-05  0:43   ` Hidetoshi Seto
2010-11-04  6:15 ` [PATCH 2/5] PCI: Add mask bit definition for MSI-X table Sheng Yang
2010-11-04  9:50   ` Michael S. Tsirkin
2010-11-05  2:48     ` Sheng Yang
2010-11-05  0:48   ` Hidetoshi Seto
2010-11-05  2:49     ` Sheng Yang
2010-11-04  6:15 ` [PATCH 3/5] KVM: Move struct kvm_io_device to kvm_host.h Sheng Yang
2010-11-04  6:15 ` [PATCH 4/5] KVM: Add kvm_get_irq_routing_entry() func Sheng Yang
2010-11-04  6:15 ` [PATCH 5/5] KVM: assigned dev: MSI-X mask support Sheng Yang
2010-11-04 10:43   ` Michael S. Tsirkin
2010-11-05  2:44     ` Sheng Yang
2010-11-05  4:20       ` Sheng Yang
2010-11-05  7:16         ` Sheng Yang
2010-11-05  8:51           ` Michael S. Tsirkin
2010-11-05 10:54             ` Sheng Yang
2010-11-05  8:43       ` Michael S. Tsirkin
2010-11-05 10:53         ` Sheng Yang
2010-11-05 12:01           ` Sheng Yang
2010-11-05 13:29             ` Michael S. Tsirkin
2010-11-05 13:27           ` Michael S. Tsirkin
2010-11-08  3:18             ` Sheng Yang
2010-11-08  6:42               ` Michael S. Tsirkin
2010-11-06  0:24           ` Marcelo Tosatti
2010-11-08  5:41             ` Sheng Yang
2010-11-09 14:08               ` Marcelo Tosatti
2010-11-06  0:27 ` [PATCH 0/5 v3] MSI-X mask supporting for assigned device(kernel) Marcelo Tosatti
2010-11-08  0:51   ` Sheng Yang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).