[Qemu-devel] [RFC][patch 0/6] pci pass-through support for qemu/KVM on s390

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [Qemu-devel] [RFC][patch 0/6] pci pass-through support for qemu/KVM on s390
@ 2014-09-04 10:52 frank.blaschka
  2014-09-04 10:52 ` [Qemu-devel] [RFC][patch 1/6] s390: cio: chsc function to register GIB frank.blaschka
                   ` (7 more replies)
  0 siblings, 8 replies; 21+ messages in thread
From: frank.blaschka @ 2014-09-04 10:52 UTC (permalink / raw)
  To: qemu-devel, linux-s390, kvm; +Cc: aik, pbonzini, agraf

This set of patches implements pci pass-through support for qemu/KVM on s390.
PCI support on s390 is very different from other platforms.
Major differences are:

1) all PCI operations are driven by special s390 instructions
2) all s390 PCI instructions are privileged
3) PCI config and memory spaces can not be mmap'ed
4) no classic interrupts (INTX, MSI). The pci hw understands the concept
   of requesting MSIX irqs but irqs are delivered as s390 adapter irqs.
5) For DMA access there is always an IOMMU required. s390 pci implementation
   does not support a complete memory to iommu mapping, dma mappings are
   created on request.
6) The OS does not get any informations about the physical layout
   of the PCI bus.
7) To take advantage of system z specific virtualization features
   we need to access the SIE control block residing in the kernel KVM
8) To enable system z specific virtualization features we have to manipulate
   the zpci device in kernel.

For this reasons I decided to implement a kernel based approach similar
to x86 device assignment. There is a new qemu device (s390-pci) representing a
pass through device on the host. Here is a sample qemu device configuration:

-device s390-pci,host=0000:00:00.0

The device executes the KVM_ASSIGN_PCI_DEVICE ioctl to create a proxy instance
in the kernel KVM and connect this instance to the host pci device.

kernel patches apply to linux-kvm

s390: cio: chsc function to register GIB
s390: pci: export pci functions for pass-through usage
KVM: s390: Add GISA support
KVM: s390: Add PCI pass-through support

qemu patches apply to qemu-master

s390: Add PCI bus support
s390: Add PCI pass-through device support

Feedback and discussion is highly welcome ...
Thx!

Frank

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Qemu-devel] [RFC][patch 1/6] s390: cio: chsc function to register GIB
  2014-09-04 10:52 [Qemu-devel] [RFC][patch 0/6] pci pass-through support for qemu/KVM on s390 frank.blaschka
@ 2014-09-04 10:52 ` frank.blaschka
  2014-09-04 10:52 ` [Qemu-devel] [RFC][patch 2/6] s390: pci: export pci functions for pass-through usage frank.blaschka
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 21+ messages in thread
From: frank.blaschka @ 2014-09-04 10:52 UTC (permalink / raw)
  To: qemu-devel, linux-s390, kvm; +Cc: aik, pbonzini, agraf

[-- Attachment #1: 002-s390_chsc_gib-3.16.patch --]
[-- Type: text/plain, Size: 1895 bytes --]

From: Frank Blaschka <frank.blaschka@de.ibm.com>

This patch provides a new chsc function to register/unregister
a GIB (Guest Information Block).

Signed-off-by: Frank Blaschka <frank.blaschka@de.ibm.com>
---
 arch/s390/include/asm/cio.h |    1 
 drivers/s390/cio/chsc.c     |   50 ++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 51 insertions(+)

--- a/arch/s390/include/asm/cio.h
+++ b/arch/s390/include/asm/cio.h
@@ -311,5 +311,6 @@ extern int cio_get_iplinfo(struct cio_ip
 /* Function from drivers/s390/cio/chsc.c */
 int chsc_sstpc(void *page, unsigned int op, u16 ctrl);
 int chsc_sstpi(void *page, void *result, size_t size);
+int chsc_sgib(u32 gibo);
 
 #endif
--- a/drivers/s390/cio/chsc.c
+++ b/drivers/s390/cio/chsc.c
@@ -1188,6 +1188,56 @@ out:
 EXPORT_SYMBOL_GPL(chsc_siosl);
 
 /**
+ * chsc_sgib() - register guest information block
+ * @gibo: guest information block
+ *
+ * gibo must be allocated in low memory
+ *
+ * Returns 0 on success.
+ */
+int chsc_sgib(u32 gibo)
+{
+	struct {
+		struct chsc_header request;
+		u16 operation_code;
+		u16 : 16;
+		u32 : 4;
+		u32 fmt : 4;
+		u32 : 24;
+		u32 : 32;
+		u32 : 32;
+		u32 gibo;
+		u64 : 64;
+		u32 : 16;
+		u32 aix : 8;
+		u32 : 8;
+		u32 reserved[1007];
+		struct chsc_header response;
+	} __packed *scssc;
+	unsigned long flags;
+	int rc;
+
+	spin_lock_irqsave(&chsc_page_lock, flags);
+	memset(chsc_page, 0, PAGE_SIZE);
+	scssc = chsc_page;
+
+	scssc->request.length = 0x0fe0;
+	scssc->request.code = 0x0021;
+	scssc->operation_code = 1;
+	scssc->gibo = gibo;
+
+	rc = chsc(scssc);
+	if (rc)
+		rc = -EIO;
+	else
+		rc = chsc_error_from_response(scssc->response.code);
+
+	spin_unlock_irqrestore(&chsc_page_lock, flags);
+	return rc;
+}
+EXPORT_SYMBOL_GPL(chsc_sgib);
+
+/**
  * chsc_scm_info() - store SCM information (SSI)
  * @scm_area: request and response block for SSI
  * @token: continuation token

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Qemu-devel] [RFC][patch 2/6] s390: pci: export pci functions for pass-through usage
  2014-09-04 10:52 [Qemu-devel] [RFC][patch 0/6] pci pass-through support for qemu/KVM on s390 frank.blaschka
  2014-09-04 10:52 ` [Qemu-devel] [RFC][patch 1/6] s390: cio: chsc function to register GIB frank.blaschka
@ 2014-09-04 10:52 ` frank.blaschka
  2014-09-04 10:52 ` [Qemu-devel] [RFC][patch 3/6] KVM: s390: Add GISA support frank.blaschka
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 21+ messages in thread
From: frank.blaschka @ 2014-09-04 10:52 UTC (permalink / raw)
  To: qemu-devel, linux-s390, kvm; +Cc: aik, pbonzini, agraf

[-- Attachment #1: 003-s390_pci_ppt-3.16.patch --]
[-- Type: text/plain, Size: 10408 bytes --]

From: Frank Blaschka <frank.blaschka@de.ibm.com>

This patch exports a couple of zPCI functions. The new pci
pass-through driver for KVM will use this functions to enable the
device with virtualization information and update the device dma
translation table on the host. We add a new interface to purge
the translation table of a device. Also we moved some zPCI functions
to the pci_insn header file.

Signed-off-by: Frank Blaschka <frank.blaschka@de.ibm.com>
---
 arch/s390/include/asm/pci.h      |    6 ++
 arch/s390/include/asm/pci_clp.h  |    3 -
 arch/s390/include/asm/pci_insn.h |   92 ++++++++++++++++++++++++++++++++++++
 arch/s390/pci/pci_clp.c          |    4 +
 arch/s390/pci/pci_dma.c          |   24 ++++++++-
 arch/s390/pci/pci_insn.c         |   97 ---------------------------------------
 6 files changed, 126 insertions(+), 100 deletions(-)

--- a/arch/s390/include/asm/pci.h
+++ b/arch/s390/include/asm/pci.h
@@ -140,6 +140,7 @@ int zpci_register_ioat(struct zpci_dev *
 int zpci_unregister_ioat(struct zpci_dev *, u8);
 
 /* CLP */
+u8 clp_instr(void *data);
 int clp_scan_pci_devices(void);
 int clp_rescan_pci_devices(void);
 int clp_rescan_pci_devices_simple(void);
@@ -177,6 +178,11 @@ struct zpci_dev *get_zdev_by_fid(u32);
 /* DMA */
 int zpci_dma_init(void);
 void zpci_dma_exit(void);
+int dma_update_trans(struct zpci_dev *zdev, unsigned long pa,
+		     dma_addr_t dma_addr, size_t size, int flags);
+void dma_update_cpu_trans(struct zpci_dev *zdev, void *page_addr,
+			  dma_addr_t dma_addr, int flags);
+void dma_purge_rto_entries(struct zpci_dev *zdev);
 
 /* FMB */
 int zpci_fmb_enable_device(struct zpci_dev *);
--- a/arch/s390/include/asm/pci_clp.h
+++ b/arch/s390/include/asm/pci_clp.h
@@ -148,7 +148,8 @@ struct clp_req_set_pci {
 	u16 reserved2;
 	u8 oc;				/* operation controls */
 	u8 ndas;			/* number of dma spaces */
-	u64 reserved3;
+	u32 reserved3;
+	u32 gd;				/* GISA Designation */
 } __packed;
 
 /* Set PCI function response */
--- a/arch/s390/include/asm/pci_insn.h
+++ b/arch/s390/include/asm/pci_insn.h
@@ -1,6 +1,8 @@
 #ifndef _ASM_S390_PCI_INSN_H
 #define _ASM_S390_PCI_INSN_H
 
+#include <asm/processor.h>
+
 /* Load/Store status codes */
 #define ZPCI_PCI_ST_FUNC_NOT_ENABLED		4
 #define ZPCI_PCI_ST_FUNC_IN_ERR			8
@@ -83,4 +85,94 @@ int zpci_store(u64 data, u64 req, u64 of
 int zpci_store_block(const u64 *data, u64 req, u64 offset);
 void zpci_set_irq_ctrl(u16 ctl, char *unused, u8 isc);
 
+static inline u8 __mpcifc(u64 req, struct zpci_fib *fib, u8 *status)
+{
+	u8 cc;
+
+	asm volatile (
+		"	.insn	rxy,0xe300000000d0,%[req],%[fib]\n"
+		"	ipm	%[cc]\n"
+		"	srl	%[cc],28\n"
+		: [cc] "=d" (cc), [req] "+d" (req), [fib] "+Q" (*fib)
+		: : "cc");
+	*status = req >> 24 & 0xff;
+	return cc;
+}
+
+static inline u8 __rpcit(u64 fn, u64 addr, u64 range, u8 *status)
+{
+	register u64 __addr asm("2") = addr;
+	register u64 __range asm("3") = range;
+	u8 cc;
+
+	asm volatile (
+		"	.insn	rre,0xb9d30000,%[fn],%[addr]\n"
+		"	ipm	%[cc]\n"
+		"	srl	%[cc],28\n"
+		: [cc] "=d" (cc), [fn] "+d" (fn)
+		: [addr] "d" (__addr), "d" (__range)
+		: "cc");
+	*status = fn >> 24 & 0xff;
+	return cc;
+}
+
+static inline int __pcilg(u64 *data, u64 req, u64 offset, u8 *status)
+{
+	register u64 __req asm("2") = req;
+	register u64 __offset asm("3") = offset;
+	int cc = -ENXIO;
+	u64 __data;
+
+	asm volatile (
+		"	.insn	rre,0xb9d20000,%[data],%[req]\n"
+		"0:	ipm	%[cc]\n"
+		"	srl	%[cc],28\n"
+		"1:\n"
+		EX_TABLE(0b, 1b)
+		: [cc] "+d" (cc), [data] "=d" (__data), [req] "+d" (__req)
+		:  "d" (__offset)
+		: "cc");
+	*status = __req >> 24 & 0xff;
+	if (!cc)
+		*data = __data;
+
+	return cc;
+}
+
+static inline int __pcistg(u64 data, u64 req, u64 offset, u8 *status)
+{
+	register u64 __req asm("2") = req;
+	register u64 __offset asm("3") = offset;
+	int cc = -ENXIO;
+
+	asm volatile (
+		"	.insn	rre,0xb9d00000,%[data],%[req]\n"
+		"0:	ipm	%[cc]\n"
+		"	srl	%[cc],28\n"
+		"1:\n"
+		EX_TABLE(0b, 1b)
+		: [cc] "+d" (cc), [req] "+d" (__req)
+		: "d" (__offset), [data] "d" (data)
+		: "cc");
+	*status = __req >> 24 & 0xff;
+	return cc;
+}
+
+static inline int __pcistb(const u64 *data, u64 req, u64 offset, u8 *status)
+{
+	int cc = -ENXIO;
+
+	asm volatile (
+		"	.insn	rsy,0xeb00000000d0,%[req],%[offset],%[data]\n"
+		"0:	ipm	%[cc]\n"
+		"	srl	%[cc],28\n"
+		"1:\n"
+		EX_TABLE(0b, 1b)
+		: [cc] "+d" (cc), [req] "+d" (req)
+		: [offset] "d" (offset), [data] "Q" (*data)
+		: "cc");
+	*status = req >> 24 & 0xff;
+	return cc;
+}
+
 #endif
--- a/arch/s390/pci/pci_clp.c
+++ b/arch/s390/pci/pci_clp.c
@@ -30,7 +30,7 @@ static inline void zpci_err_clp(unsigned
  * Call Logical Processor
  * Retry logic is handled by the caller.
  */
-static inline u8 clp_instr(void *data)
+u8 clp_instr(void *data)
 {
 	struct { u8 _[CLP_BLK_SIZE]; } *req = data;
 	u64 ignored;
@@ -45,6 +45,7 @@ static inline u8 clp_instr(void *data)
 		: "cc");
 	return cc;
 }
+EXPORT_SYMBOL_GPL(clp_instr);
 
 static void *clp_alloc_block(gfp_t gfp_mask)
 {
@@ -263,6 +264,7 @@ int clp_disable_fh(struct zpci_dev *zdev
 	zpci_dbg(3, "dis fid:%x, fh:%x, rc:%d\n", zdev->fid, zdev->fh, rc);
 	return rc;
 }
+EXPORT_SYMBOL_GPL(clp_disable_fh);
 
 static int clp_list_pci(struct clp_req_rsp_list_pci *rrb,
 			void (*cb)(struct clp_fh_list_entry *entry))
--- a/arch/s390/pci/pci_dma.c
+++ b/arch/s390/pci/pci_dma.c
@@ -114,7 +114,7 @@ static unsigned long *dma_walk_cpu_trans
 	return &pto[px];
 }
 
-static void dma_update_cpu_trans(struct zpci_dev *zdev, void *page_addr,
+void dma_update_cpu_trans(struct zpci_dev *zdev, void *page_addr,
 				 dma_addr_t dma_addr, int flags)
 {
 	unsigned long *entry;
@@ -138,8 +138,9 @@ static void dma_update_cpu_trans(struct
 	else
 		entry_clr_protected(entry);
 }
+EXPORT_SYMBOL_GPL(dma_update_cpu_trans);
 
-static int dma_update_trans(struct zpci_dev *zdev, unsigned long pa,
+int dma_update_trans(struct zpci_dev *zdev, unsigned long pa,
 			    dma_addr_t dma_addr, size_t size, int flags)
 {
 	unsigned int nr_pages = PAGE_ALIGN(size) >> PAGE_SHIFT;
@@ -180,6 +181,7 @@ no_refresh:
 	spin_unlock_irqrestore(&zdev->dma_table_lock, irq_flags);
 	return rc;
 }
+EXPORT_SYMBOL_GPL(dma_update_trans);
 
 static void dma_free_seg_table(unsigned long entry)
 {
@@ -457,6 +459,7 @@ out_reg:
 out_clean:
 	return rc;
 }
+EXPORT_SYMBOL_GPL(zpci_dma_init_device);
 
 void zpci_dma_exit_device(struct zpci_dev *zdev)
 {
@@ -466,6 +469,7 @@ void zpci_dma_exit_device(struct zpci_de
 	zdev->iommu_bitmap = NULL;
 	zdev->next_bit = 0;
 }
+EXPORT_SYMBOL_GPL(zpci_dma_exit_device);
 
 static int __init dma_alloc_cpu_table_caches(void)
 {
@@ -518,6 +522,22 @@ struct dma_map_ops s390_dma_ops = {
 };
 EXPORT_SYMBOL_GPL(s390_dma_ops);
 
+void dma_purge_rto_entries(struct zpci_dev *zdev)
+{
+	unsigned long *table;
+	int rtx;
+
+	if (!zdev || !zdev->dma_table)
+		return;
+	table = zdev->dma_table;
+	for (rtx = 0; rtx < ZPCI_TABLE_ENTRIES; rtx++)
+		if (reg_entry_isvalid(table[rtx])) {
+			dma_free_seg_table(table[rtx]);
+			invalidate_table_entry(&table[rtx]);
+		}
+}
+EXPORT_SYMBOL_GPL(dma_purge_rto_entries);
+
 static int __init s390_iommu_setup(char *str)
 {
 	if (!strncmp(str, "strict", 6))
--- a/arch/s390/pci/pci_insn.c
+++ b/arch/s390/pci/pci_insn.c
@@ -8,25 +8,9 @@
 #include <linux/errno.h>
 #include <linux/delay.h>
 #include <asm/pci_insn.h>
-#include <asm/processor.h>
 
 #define ZPCI_INSN_BUSY_DELAY	1	/* 1 microsecond */
 
-/* Modify PCI Function Controls */
-static inline u8 __mpcifc(u64 req, struct zpci_fib *fib, u8 *status)
-{
-	u8 cc;
-
-	asm volatile (
-		"	.insn	rxy,0xe300000000d0,%[req],%[fib]\n"
-		"	ipm	%[cc]\n"
-		"	srl	%[cc],28\n"
-		: [cc] "=d" (cc), [req] "+d" (req), [fib] "+Q" (*fib)
-		: : "cc");
-	*status = req >> 24 & 0xff;
-	return cc;
-}
-
 int zpci_mod_fc(u64 req, struct zpci_fib *fib)
 {
 	u8 cc, status;
@@ -43,24 +27,6 @@ int zpci_mod_fc(u64 req, struct zpci_fib
 	return (cc) ? -EIO : 0;
 }
 
-/* Refresh PCI Translations */
-static inline u8 __rpcit(u64 fn, u64 addr, u64 range, u8 *status)
-{
-	register u64 __addr asm("2") = addr;
-	register u64 __range asm("3") = range;
-	u8 cc;
-
-	asm volatile (
-		"	.insn	rre,0xb9d30000,%[fn],%[addr]\n"
-		"	ipm	%[cc]\n"
-		"	srl	%[cc],28\n"
-		: [cc] "=d" (cc), [fn] "+d" (fn)
-		: [addr] "d" (__addr), "d" (__range)
-		: "cc");
-	*status = fn >> 24 & 0xff;
-	return cc;
-}
-
 int zpci_refresh_trans(u64 fn, u64 addr, u64 range)
 {
 	u8 cc, status;
@@ -84,30 +50,7 @@ void zpci_set_irq_ctrl(u16 ctl, char *un
 		"	.insn	rsy,0xeb00000000d1,%[ctl],%[isc],%[u]\n"
 		: : [ctl] "d" (ctl), [isc] "d" (isc << 27), [u] "Q" (*unused));
 }
-
-/* PCI Load */
-static inline int __pcilg(u64 *data, u64 req, u64 offset, u8 *status)
-{
-	register u64 __req asm("2") = req;
-	register u64 __offset asm("3") = offset;
-	int cc = -ENXIO;
-	u64 __data;
-
-	asm volatile (
-		"	.insn	rre,0xb9d20000,%[data],%[req]\n"
-		"0:	ipm	%[cc]\n"
-		"	srl	%[cc],28\n"
-		"1:\n"
-		EX_TABLE(0b, 1b)
-		: [cc] "+d" (cc), [data] "=d" (__data), [req] "+d" (__req)
-		:  "d" (__offset)
-		: "cc");
-	*status = __req >> 24 & 0xff;
-	if (!cc)
-		*data = __data;
-
-	return cc;
-}
+EXPORT_SYMBOL_GPL(zpci_set_irq_ctrl);
 
 int zpci_load(u64 *data, u64 req, u64 offset)
 {
@@ -127,26 +70,6 @@ int zpci_load(u64 *data, u64 req, u64 of
 }
 EXPORT_SYMBOL_GPL(zpci_load);
 
-/* PCI Store */
-static inline int __pcistg(u64 data, u64 req, u64 offset, u8 *status)
-{
-	register u64 __req asm("2") = req;
-	register u64 __offset asm("3") = offset;
-	int cc = -ENXIO;
-
-	asm volatile (
-		"	.insn	rre,0xb9d00000,%[data],%[req]\n"
-		"0:	ipm	%[cc]\n"
-		"	srl	%[cc],28\n"
-		"1:\n"
-		EX_TABLE(0b, 1b)
-		: [cc] "+d" (cc), [req] "+d" (__req)
-		: "d" (__offset), [data] "d" (data)
-		: "cc");
-	*status = __req >> 24 & 0xff;
-	return cc;
-}
-
 int zpci_store(u64 data, u64 req, u64 offset)
 {
 	u8 status;
@@ -165,24 +88,6 @@ int zpci_store(u64 data, u64 req, u64 of
 }
 EXPORT_SYMBOL_GPL(zpci_store);
 
-/* PCI Store Block */
-static inline int __pcistb(const u64 *data, u64 req, u64 offset, u8 *status)
-{
-	int cc = -ENXIO;
-
-	asm volatile (
-		"	.insn	rsy,0xeb00000000d0,%[req],%[offset],%[data]\n"
-		"0:	ipm	%[cc]\n"
-		"	srl	%[cc],28\n"
-		"1:\n"
-		EX_TABLE(0b, 1b)
-		: [cc] "+d" (cc), [req] "+d" (req)
-		: [offset] "d" (offset), [data] "Q" (*data)
-		: "cc");
-	*status = req >> 24 & 0xff;
-	return cc;
-}
-
 int zpci_store_block(const u64 *data, u64 req, u64 offset)
 {
 	u8 status;

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Qemu-devel] [RFC][patch 3/6] KVM: s390: Add GISA support
  2014-09-04 10:52 [Qemu-devel] [RFC][patch 0/6] pci pass-through support for qemu/KVM on s390 frank.blaschka
  2014-09-04 10:52 ` [Qemu-devel] [RFC][patch 1/6] s390: cio: chsc function to register GIB frank.blaschka
  2014-09-04 10:52 ` [Qemu-devel] [RFC][patch 2/6] s390: pci: export pci functions for pass-through usage frank.blaschka
@ 2014-09-04 10:52 ` frank.blaschka
  2014-09-04 14:19   ` Heiko Carstens
  2014-09-05  8:29   ` Alexander Graf
  2014-09-04 10:52 ` [Qemu-devel] [RFC][patch 4/6] KVM: s390: Add PCI pass-through support frank.blaschka
                   ` (4 subsequent siblings)
  7 siblings, 2 replies; 21+ messages in thread
From: frank.blaschka @ 2014-09-04 10:52 UTC (permalink / raw)
  To: qemu-devel, linux-s390, kvm; +Cc: aik, pbonzini, agraf

[-- Attachment #1: 004-s390_kvm_gisa-3.16.patch --]
[-- Type: text/plain, Size: 10151 bytes --]

From: Frank Blaschka <frank.blaschka@de.ibm.com>

This patch adds GISA (Guest Interrupt State Area) support
to s390 kvm. GISA can be used for exitless interrupts. The
patch provides a set of functions for GISA related operations
like accessing GISA fields or registering ISCs for alert.
Exploiters of GISA will follow with additional patches.

Signed-off-by: Frank Blaschka <frank.blaschka@de.ibm.com>
---
 arch/s390/include/asm/kvm_host.h |   72 ++++++++++++++++
 arch/s390/kvm/kvm-s390.c         |  167 +++++++++++++++++++++++++++++++++++++++
 arch/s390/kvm/kvm-s390.h         |   28 ++++++
 3 files changed, 265 insertions(+), 2 deletions(-)

--- a/arch/s390/include/asm/kvm_host.h
+++ b/arch/s390/include/asm/kvm_host.h
@@ -129,11 +129,12 @@ struct kvm_s390_sie_block {
 	__u8	reserved60;		/* 0x0060 */
 	__u8	ecb;			/* 0x0061 */
 	__u8    ecb2;                   /* 0x0062 */
-	__u8    reserved63[1];          /* 0x0063 */
+	__u8    ecb3;			/* 0x0063 */
 	__u32	scaol;			/* 0x0064 */
 	__u8	reserved68[4];		/* 0x0068 */
 	__u32	todpr;			/* 0x006c */
-	__u8	reserved70[32];		/* 0x0070 */
+	__u32   gd;                     /* 0x0070 */
+	__u8    reserved74[28];         /* 0x0074 */
 	psw_t	gpsw;			/* 0x0090 */
 	__u64	gg14;			/* 0x00a0 */
 	__u64	gg15;			/* 0x00a8 */
@@ -300,6 +301,70 @@ struct kvm_s390_interrupt_info {
 #define ACTION_STORE_ON_STOP		(1<<0)
 #define ACTION_STOP_ON_STOP		(1<<1)
 
+#define KVM_S390_GISA_FORMAT_0	0
+#define KVM_S390_GISA_FORMAT_1	1
+
+struct kvm_s390_gisa_f0 {
+	u32 next_alert;
+	u8 ipm;
+	u16 rsv0:14;
+	u16 g:1;
+	u16 c:1;
+	u8 iam;
+	u32 rsv1;
+	u32 count;
+} __packed;
+
+struct kvm_s390_gisa_f1 {
+	u32 next_alert;
+	u8 ipm;
+	u8 simm;
+	u8 nimm;
+	u8 iam;
+	u64 aisma;
+	u32 rsv0:6;
+	u32 g:1;
+	u32 c:1;
+	u32 rsv1:24;
+	u64 rsv2;
+	u32 count;
+} __packed;
+
+union kvm_s390_gisa {
+	struct kvm_s390_gisa_f0 f0;
+	struct kvm_s390_gisa_f1 f1;
+};
+
+struct kvm_s390_gait {
+	u32 gd;
+	u16      : 5;
+	u16 gisc : 3;
+	u16 rpu  : 8;
+	u16        : 10;
+	u16 gaisbo :  6;
+	u64 gaisba;
+} __packed;
+
+struct kvm_s390_aifte {
+	u64 faisba;
+	u64 gaita;
+	u16 simm : 8;
+	u16      : 5;
+	u16 afi  : 3;
+	u16 reserved1;
+	u16 reserved2;
+	u16 faal;
+} __packed;
+
+struct kvm_s390_gib {
+	u32 alo;
+	u32 reserved1;
+	u32      : 5;
+	u32 nisc : 3;
+	u32      : 24;
+	u8 reserverd2[20];
+} __packed;
+
 struct kvm_s390_local_interrupt {
 	spinlock_t lock;
 	struct list_head list;
@@ -420,6 +485,9 @@ struct kvm_arch{
 	struct s390_io_adapter *adapters[MAX_S390_IO_ADAPTERS];
 	wait_queue_head_t ipte_wq;
 	spinlock_t start_stop_lock;
+	union kvm_s390_gisa *gisa;
+	unsigned long iam;
+	atomic_t in_sie;
 };
 
 #define KVM_HVA_ERR_BAD		(-1UL)
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -404,6 +404,16 @@ long kvm_arch_vm_ioctl(struct file *filp
 	return r;
 }
 
+static u8 kvm_s390_gisa_get_alert_mask(struct kvm *kvm)
+{
+	return (u8)ACCESS_ONCE(kvm->arch.iam);
+}
+
+static void kvm_s390_gisa_set_alert_mask(struct kvm *kvm, u8 iam)
+{
+	xchg(&kvm->arch.iam, iam);
+}
+
 int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
 {
 	int rc;
@@ -461,6 +471,14 @@ int kvm_arch_init_vm(struct kvm *kvm, un
 	kvm->arch.css_support = 0;
 	kvm->arch.use_irqchip = 0;
 
+	kvm->arch.gisa = (union kvm_s390_gisa *)get_zeroed_page(
+			GFP_KERNEL | GFP_DMA);
+	if (!kvm->arch.gisa)
+		goto out_nogmap;
+	kvm_s390_gisa_set_next_alert(kvm, (u32)(unsigned long)kvm->arch.gisa);
+	kvm_s390_gisa_set_alert_mask(kvm, 0);
+	atomic_set(&kvm->arch.in_sie, 0);
+
 	spin_lock_init(&kvm->arch.start_stop_lock);
 
 	return 0;
@@ -520,6 +538,7 @@ void kvm_arch_sync_events(struct kvm *kv
 
 void kvm_arch_destroy_vm(struct kvm *kvm)
 {
+	free_page((unsigned long)kvm->arch.gisa);
 	kvm_free_vcpus(kvm);
 	free_page((unsigned long)(kvm->arch.sca));
 	debug_unregister(kvm->arch.dbf);
@@ -656,6 +675,19 @@ int kvm_arch_vcpu_setup(struct kvm_vcpu
 	return rc;
 }
 
+u32 kvm_s390_gisa_get_fmt(void)
+{
+	if (test_facility(70) || test_facility(72))
+		return KVM_S390_GISA_FORMAT_1;
+	else
+		return KVM_S390_GISA_FORMAT_0;
+}
+
+static u32 kvm_s390_build_gd(struct kvm *kvm)
+{
+	return (u32)(unsigned long)kvm->arch.gisa | kvm_s390_gisa_get_fmt();
+}
+
 struct kvm_vcpu *kvm_arch_vcpu_create(struct kvm *kvm,
 				      unsigned int id)
 {
@@ -699,6 +731,7 @@ struct kvm_vcpu *kvm_arch_vcpu_create(st
 	vcpu->arch.local_int.float_int = &kvm->arch.float_int;
 	vcpu->arch.local_int.wq = &vcpu->wq;
 	vcpu->arch.local_int.cpuflags = &vcpu->arch.sie_block->cpuflags;
+	vcpu->arch.sie_block->gd = kvm_s390_build_gd(kvm);
 
 	rc = kvm_vcpu_init(vcpu, kvm, id);
 	if (rc)
@@ -749,6 +782,132 @@ void exit_sie_sync(struct kvm_vcpu *vcpu
 	exit_sie(vcpu);
 }
 
+void kvm_s390_gisa_register_alert(struct kvm *kvm, u32 gisc)
+{
+	int bito = BITS_PER_BYTE * 7 + gisc;
+
+	set_bit(bito ^ (BITS_PER_LONG - 1), &kvm->arch.iam);
+}
+
+void kvm_s390_gisa_unregister_alert(struct kvm *kvm, u32 gisc)
+{
+	int bito = BITS_PER_BYTE * 7 + gisc;
+
+	clear_bit(bito ^ (BITS_PER_LONG - 1), &kvm->arch.iam);
+}
+
+u32 __kvm_s390_gisa_get_next_alert(union kvm_s390_gisa *gisa)
+{
+	return ACCESS_ONCE(gisa->f0.next_alert);
+}
+
+u32 kvm_s390_gisa_get_next_alert(struct kvm *kvm)
+{
+	return __kvm_s390_gisa_get_next_alert(
+		(union kvm_s390_gisa *)kvm->arch.gisa);
+}
+
+void __kvm_s390_gisa_set_next_alert(union kvm_s390_gisa *gisa, u32 val)
+{
+	xchg(&gisa->f0.next_alert, val);
+}
+
+void kvm_s390_gisa_set_next_alert(struct kvm *kvm, u32 val)
+{
+	__kvm_s390_gisa_set_next_alert(kvm->arch.gisa, val);
+}
+
+u8 kvm_s390_gisa_get_iam(struct kvm *kvm)
+{
+	return ACCESS_ONCE(kvm->arch.gisa->f0.iam);
+}
+
+void kvm_s390_gisa_set_iam(struct kvm *kvm, u8 iam)
+{
+	xchg(&kvm->arch.gisa->f0.iam, iam);
+}
+
+int kvm_s390_gisa_test_iam_gisc(struct kvm *kvm, u32 gisc)
+{
+	int bito = BITS_PER_BYTE * 7 + gisc;
+	unsigned long *addr = (unsigned long *)kvm->arch.gisa;
+
+	return test_bit(bito ^ (BITS_PER_LONG - 1), addr);
+}
+
+u8 kvm_s390_gisa_get_ipm(struct kvm *kvm)
+{
+	return ACCESS_ONCE(kvm->arch.gisa->f0.ipm);
+}
+
+void kvm_s390_gisa_set_ipm(struct kvm *kvm, u8 ipm)
+{
+	xchg(&kvm->arch.gisa->f0.ipm, ipm);
+}
+
+int kvm_s390_gisa_test_ipm_gisc(struct kvm *kvm, u32 gisc)
+{
+	int bito = BITS_PER_BYTE * 4 + gisc;
+	unsigned long *addr = (unsigned long *)kvm->arch.gisa;
+
+	return test_bit(bito ^ (BITS_PER_LONG - 1), addr);
+}
+
+void kvm_s390_gisa_set_ipm_gisc(struct kvm *kvm, u32 gisc)
+{
+	int bito = gisc + 32;
+	unsigned long *addr = (unsigned long *)kvm->arch.gisa;
+
+	set_bit(bito ^ (BITS_PER_LONG - 1), addr);
+}
+
+u32 kvm_s390_gisa_get_g(struct kvm *kvm)
+{
+	u32 fmt, bito;
+	unsigned long *addr;
+
+	fmt = kvm_s390_gisa_get_fmt();
+	if (fmt == KVM_S390_GISA_FORMAT_0) {
+		addr = (unsigned long *)kvm->arch.gisa;
+		bito = BITS_PER_BYTE * 6 + 6;
+	} else {
+		addr = (unsigned long *)((u8 *)kvm->arch.gisa + 16);
+		bito = 6;
+	}
+
+	return test_bit(bito ^ (BITS_PER_LONG - 1), addr);
+}
+
+u32 kvm_s390_gisa_get_c(struct kvm *kvm)
+{
+	u32 fmt, bito;
+	unsigned long *addr;
+
+	fmt = kvm_s390_gisa_get_fmt();
+	if (fmt == KVM_S390_GISA_FORMAT_0) {
+		addr = (unsigned long *)kvm->arch.gisa;
+		bito = BITS_PER_BYTE * 6 + 7;
+	} else {
+		addr = (unsigned long *)((u8 *)kvm->arch.gisa + 16);
+		bito = 7;
+	}
+
+	return test_bit(bito ^ (BITS_PER_LONG - 1), addr);
+}
+
+u32 kvm_s390_gisa_get_count(struct kvm *kvm)
+{
+	u32 fmt, cnt;
+
+	fmt = kvm_s390_gisa_get_fmt();
+	if (fmt == KVM_S390_GISA_FORMAT_0)
+		cnt = ACCESS_ONCE(kvm->arch.gisa->f0.count);
+	else
+		cnt = ACCESS_ONCE(kvm->arch.gisa->f1.count);
+
+	return cnt;
+}
+
 static void kvm_gmap_notifier(struct gmap *gmap, unsigned long address)
 {
 	int i;
@@ -1284,8 +1443,16 @@ static int __vcpu_run(struct kvm_vcpu *v
 		preempt_disable();
 		kvm_guest_enter();
 		preempt_enable();
+
+		atomic_inc(&vcpu->kvm->arch.in_sie);
+		kvm_s390_gisa_set_iam(vcpu->kvm, 0);
+
 		exit_reason = sie64a(vcpu->arch.sie_block,
 				     vcpu->run->s.regs.gprs);
+		if (atomic_dec_and_test(&vcpu->kvm->arch.in_sie))
+			kvm_s390_gisa_set_iam(vcpu->kvm,
+				kvm_s390_gisa_get_alert_mask(vcpu->kvm));
+
 		kvm_guest_exit();
 		vcpu->srcu_idx = srcu_read_lock(&vcpu->kvm->srcu);
 
--- a/arch/s390/kvm/kvm-s390.h
+++ b/arch/s390/kvm/kvm-s390.h
@@ -122,6 +122,17 @@ static inline u64 kvm_s390_get_base_disp
 	return (base2 ? vcpu->run->s.regs.gprs[base2] : 0) + disp2;
 }
 
+static inline u64 kvm_s390_get_base_disp_rxy(struct kvm_vcpu *vcpu)
+{
+	u32 x2 = (vcpu->arch.sie_block->ipa & 0x000f);
+	u32 base2 = vcpu->arch.sie_block->ipb >> 28;
+	u32 disp2 = ((vcpu->arch.sie_block->ipb & 0x0fff0000) >> 16) +
+		((vcpu->arch.sie_block->ipb & 0xff00) << 4);
+
+	return (base2 ? vcpu->run->s.regs.gprs[base2] : 0) +
+		(x2 ? vcpu->run->s.regs.gprs[x2] : 0) + (u64)disp2;
+}
+
 /* Set the condition code in the guest program status word */
 static inline void kvm_s390_set_psw_cc(struct kvm_vcpu *vcpu, unsigned long cc)
 {
@@ -180,6 +191,23 @@ void exit_sie(struct kvm_vcpu *vcpu);
 void exit_sie_sync(struct kvm_vcpu *vcpu);
 int kvm_s390_vcpu_setup_cmma(struct kvm_vcpu *vcpu);
 void kvm_s390_vcpu_unsetup_cmma(struct kvm_vcpu *vcpu);
+u32 kvm_s390_gisa_get_fmt(void);
+void kvm_s390_gisa_register_alert(struct kvm *kvm, u32 gisc);
+void kvm_s390_gisa_unregister_alert(struct kvm *kvm, u32 gisc);
+u32 __kvm_s390_gisa_get_next_alert(union kvm_s390_gisa *gisa);
+u32 kvm_s390_gisa_get_next_alert(struct kvm *kvm);
+void __kvm_s390_gisa_set_next_alert(union kvm_s390_gisa *gisa, u32 val);
+void kvm_s390_gisa_set_next_alert(struct kvm *kvm, u32 val);
+u8 kvm_s390_gisa_get_iam(struct kvm *kvm);
+void kvm_s390_gisa_set_iam(struct kvm *kvm, u8 value);
+int kvm_s390_gisa_test_iam_gisc(struct kvm *kvm, u32 gisc);
+u8 kvm_s390_gisa_get_ipm(struct kvm *kvm);
+void kvm_s390_gisa_set_ipm(struct kvm *kvm, u8 value);
+int kvm_s390_gisa_test_ipm_gisc(struct kvm *kvm, u32 gisc);
+void kvm_s390_gisa_set_ipm_gisc(struct kvm *kvm, u32 gisc);
+u32 kvm_s390_gisa_get_g(struct kvm *kvm);
+u32 kvm_s390_gisa_get_c(struct kvm *kvm);
+u32 kvm_s390_gisa_get_count(struct kvm *kvm);
 /* is cmma enabled */
 bool kvm_s390_cmma_enabled(struct kvm *kvm);
 int test_vfacility(unsigned long nr);

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Qemu-devel] [RFC][patch 4/6] KVM: s390: Add PCI pass-through support
  2014-09-04 10:52 [Qemu-devel] [RFC][patch 0/6] pci pass-through support for qemu/KVM on s390 frank.blaschka
                   ` (2 preceding siblings ...)
  2014-09-04 10:52 ` [Qemu-devel] [RFC][patch 3/6] KVM: s390: Add GISA support frank.blaschka
@ 2014-09-04 10:52 ` frank.blaschka
  2014-09-05  8:37   ` Alexander Graf
  2014-09-04 10:52 ` [Qemu-devel] [RFC][patch 5/6] s390: Add PCI bus support frank.blaschka
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 21+ messages in thread
From: frank.blaschka @ 2014-09-04 10:52 UTC (permalink / raw)
  To: qemu-devel, linux-s390, kvm; +Cc: aik, pbonzini, agraf

[-- Attachment #1: 005-s390_kvm_ppt-3.16.patch --]
[-- Type: text/plain, Size: 59003 bytes --]

From: Frank Blaschka <frank.blaschka@de.ibm.com>

This patch implemets PCI pass-through kernel support for s390.
Design approach is very similar to the x86 device assignment.
User space executes the KVM_ASSIGN_PCI_DEVICE ioctl to create
a proxy instance in the kernel KVM and connect this instance to the
host pci device. s390 pci instructions are intercepted in kernel and
operations are passed directly to the assigned pci device.
To take advantage of all system z specific virtualization features
we need to access the SIE control block residing in KVM. Also we have to
enable z pci devices with special configuration information coming
form the SIE block as well.

Signed-off-by: Frank Blaschka <frank.blaschka@de.ibm.com>
---
 arch/s390/include/asm/kvm_host.h |    1 
 arch/s390/kvm/Makefile           |    2 
 arch/s390/kvm/intercept.c        |    1 
 arch/s390/kvm/kvm-s390.c         |   33 
 arch/s390/kvm/kvm-s390.h         |   17 
 arch/s390/kvm/pci.c              | 2130 +++++++++++++++++++++++++++++++++++++++
 arch/s390/kvm/priv.c             |   21 
 7 files changed, 2202 insertions(+), 3 deletions(-)

--- a/arch/s390/include/asm/kvm_host.h
+++ b/arch/s390/include/asm/kvm_host.h
@@ -488,6 +488,7 @@ struct kvm_arch{
 	union kvm_s390_gisa *gisa;
 	unsigned long iam;
 	atomic_t in_sie;
+	struct list_head ppt_dev_list;
 };
 
 #define KVM_HVA_ERR_BAD		(-1UL)
--- a/arch/s390/kvm/Makefile
+++ b/arch/s390/kvm/Makefile
@@ -12,6 +12,6 @@ common-objs = $(KVM)/kvm_main.o $(KVM)/e
 ccflags-y := -Ivirt/kvm -Iarch/s390/kvm
 
 kvm-objs := $(common-objs) kvm-s390.o intercept.o interrupt.o priv.o sigp.o
-kvm-objs += diag.o gaccess.o guestdbg.o
+kvm-objs += diag.o gaccess.o guestdbg.o pci.o
 
 obj-$(CONFIG_KVM) += kvm.o
--- a/arch/s390/kvm/intercept.c
+++ b/arch/s390/kvm/intercept.c
@@ -34,6 +34,7 @@ static const intercept_handler_t instruc
 	[0xb6] = kvm_s390_handle_stctl,
 	[0xb7] = kvm_s390_handle_lctl,
 	[0xb9] = kvm_s390_handle_b9,
+	[0xe3] = kvm_s390_handle_e3,
 	[0xe5] = kvm_s390_handle_e5,
 	[0xeb] = kvm_s390_handle_eb,
 };
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -397,6 +397,24 @@ long kvm_arch_vm_ioctl(struct file *filp
 		r = kvm_s390_vm_has_attr(kvm, &attr);
 		break;
 	}
+	case KVM_ASSIGN_PCI_DEVICE: {
+		struct kvm_assigned_pci_dev assigned_dev;
+
+		r = -EFAULT;
+		if (copy_from_user(&assigned_dev, argp, sizeof(assigned_dev)))
+			break;
+		r = kvm_s390_ioctrl_assign_pci(kvm, &assigned_dev);
+		break;
+	}
+	case KVM_DEASSIGN_PCI_DEVICE: {
+		struct kvm_assigned_pci_dev assigned_dev;
+
+		r = -EFAULT;
+		if (copy_from_user(&assigned_dev, argp, sizeof(assigned_dev)))
+			break;
+		r = kvm_s390_ioctrl_deassign_pci(kvm, &assigned_dev);
+		break;
+	}
 	default:
 		r = -ENOTTY;
 	}
@@ -478,6 +496,7 @@ int kvm_arch_init_vm(struct kvm *kvm, un
 	kvm_s390_gisa_set_next_alert(kvm, (u32)(unsigned long)kvm->arch.gisa);
 	kvm_s390_gisa_set_alert_mask(kvm, 0);
 	atomic_set(&kvm->arch.in_sie, 0);
+	INIT_LIST_HEAD(&kvm->arch.ppt_dev_list);
 
 	spin_lock_init(&kvm->arch.start_stop_lock);
 
@@ -538,6 +557,7 @@ void kvm_arch_sync_events(struct kvm *kv
 
 void kvm_arch_destroy_vm(struct kvm *kvm)
 {
+	s390_pci_cleanup(kvm);
 	free_page((unsigned long)kvm->arch.gisa);
 	kvm_free_vcpus(kvm);
 	free_page((unsigned long)(kvm->arch.sca));
@@ -656,7 +676,10 @@ int kvm_arch_vcpu_setup(struct kvm_vcpu
 		vcpu->arch.sie_block->ecb |= 0x10;
 
 	vcpu->arch.sie_block->ecb2  = 8;
-	vcpu->arch.sie_block->eca   = 0xD1002000U;
+	vcpu->arch.sie_block->eca   = 0xD1202000U;
+	vcpu->arch.sie_block->ecb2 |= 0x02;
+	vcpu->arch.sie_block->ecb3 = 0x20;
+
 	if (sclp_has_siif())
 		vcpu->arch.sie_block->eca |= 1;
 	vcpu->arch.sie_block->fac   = (int) (long) vfacilities;
@@ -1920,6 +1943,12 @@ static int __init kvm_s390_init(void)
 	if (ret)
 		return ret;
 
+	ret = s390_pci_init();
+	if (ret) {
+		kvm_exit();
+		return ret;
+	}
+
 	/*
 	 * guests can ask for up to 255+1 double words, we need a full page
 	 * to hold the maximum amount of facilities. On the other hand, we
@@ -1932,7 +1961,7 @@ static int __init kvm_s390_init(void)
 	}
 	memcpy(vfacilities, S390_lowcore.stfle_fac_list, 16);
 	vfacilities[0] &= 0xff82fff3f4fc2000UL;
-	vfacilities[1] &= 0x005c000000000000UL;
+	vfacilities[1] &= 0x07dc000000000000UL;
 	return 0;
 }
 
--- a/arch/s390/kvm/kvm-s390.h
+++ b/arch/s390/kvm/kvm-s390.h
@@ -167,6 +167,7 @@ int kvm_s390_mask_adapter(struct kvm *kv
 /* implemented in priv.c */
 int is_valid_psw(psw_t *psw);
 int kvm_s390_handle_b2(struct kvm_vcpu *vcpu);
+int kvm_s390_handle_e3(struct kvm_vcpu *vcpu);
 int kvm_s390_handle_e5(struct kvm_vcpu *vcpu);
 int kvm_s390_handle_01(struct kvm_vcpu *vcpu);
 int kvm_s390_handle_b9(struct kvm_vcpu *vcpu);
@@ -267,4 +268,20 @@ void kvm_s390_clear_bp_data(struct kvm_v
 void kvm_s390_prepare_debug_exit(struct kvm_vcpu *vcpu);
 void kvm_s390_handle_per_event(struct kvm_vcpu *vcpu);
 
+/* implemented in pci.c */
+int handle_clp(struct kvm_vcpu *vcpu);
+int handle_rpcit(struct kvm_vcpu *vcpu);
+int handle_sic(struct kvm_vcpu *vcpu);
+int handle_pcistb(struct kvm_vcpu *vcpu);
+int handle_mpcifc(struct kvm_vcpu *vcpu);
+int handle_pcistg(struct kvm_vcpu *vcpu);
+int handle_pcilg(struct kvm_vcpu *vcpu);
+int handle_stpcifc(struct kvm_vcpu *vcpu);
+int kvm_s390_ioctrl_assign_pci(struct kvm *kvm,
+	struct kvm_assigned_pci_dev *assigned_dev);
+int kvm_s390_ioctrl_deassign_pci(struct kvm *kvm,
+	struct kvm_assigned_pci_dev *assigned_dev);
+void s390_pci_cleanup(struct kvm *kvm);
+int s390_pci_init(void);
+void s390_pci_exit(void);
 #endif
--- /dev/null
+++ b/arch/s390/kvm/pci.c
@@ -0,0 +1,2130 @@
+/*
+ * handling pci related instructions
+ *
+ * Copyright IBM Corp. 2014
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License (version 2 only)
+ * as published by the Free Software Foundation.
+ *
+ *    Author(s): Frank Blaschka <frank.blaschka@de.ibm.com>
+ *               Hong Bo Li <lihbbj@cn.ibm.com>
+ *               Yi Min Zhao <zyimin@cn.ibm.com>
+ */
+
+#define KMSG_COMPONENT "kvmpci"
+#define pr_fmt(fmt) KMSG_COMPONENT ": " fmt
+
+#include <linux/kvm.h>
+#include <linux/gfp.h>
+#include <linux/errno.h>
+#include <linux/compat.h>
+#include <linux/pci.h>
+#include <linux/delay.h>
+#include <linux/interrupt.h>
+#include <linux/mmu_context.h>
+#include <linux/delay.h>
+#include <linux/seq_file.h>
+#include <linux/debugfs.h>
+#include <asm/asm-offsets.h>
+#include <asm/current.h>
+#include <asm/debug.h>
+#include <asm/ebcdic.h>
+#include <asm/sysinfo.h>
+#include <asm/pgtable.h>
+#include <asm/io.h>
+#include <asm/ptrace.h>
+#include <asm/compat.h>
+#include <asm/facility.h>
+#include <asm/cio.h>
+#include <asm/clp.h>
+#include <asm/pci_clp.h>
+#include <asm/pci_dma.h>
+#include <asm/pci_insn.h>
+#include <asm/isc.h>
+#include <asm/airq.h>
+#include <asm/cio.h>
+#include "gaccess.h"
+#include "kvm-s390.h"
+#include "trace.h"
+
+#define USER_LSPCI
+
+#define FH_ENABLED (1UL << 31)
+#define FH_VIRT 0x00ff0000
+#define PCIPT_ISC 5
+#define IO_INT_WORD_AI 0x80000000
+#define DBF_NAME_LEN 20
+#define ASSIGN_FLAG_HOSTIRQ 0x1
+
+#define PPT_AIRQ_HOST_ERROR   0x2
+#define PPT_AIRQ_HOST_FORWARD 0x1
+
+#define PPT_TRACE_NORMAL 2
+#define PPT_TRACE_DEBUG 3
+
+#define PPT_MESSAGE(level, text...) \
+	debug_sprintf_event(dmsgf, level, text)
+#define PPT_DEVICE_MESSAGE(card, level, text...) \
+	debug_sprintf_event(card->debug, level, text)
+
+static const unsigned long be_to_le = BITS_PER_LONG - 1;
+
+enum ppt_vm_stats {
+	PPT_VM_STAT_ALERT_IRQ,
+	PPT_VM_STAT_ALERT_H,
+	PPT_VM_STAT_GISA,
+	PPT_VM_STATS,
+};
+
+static char *ppt_vm_stats_names[PPT_VM_STATS] = {
+	"alert irq",
+	"alert irq H",
+	"gisa irq",
+};
+
+struct ppt_vm_entry {
+	struct list_head entry;
+	atomic_t refcnt;
+	struct kvm *kvm;
+	struct work_struct irq_work;
+	unsigned int stat_items[PPT_VM_STATS];
+};
+
+struct ppt_dbf_entry {
+	char dbf_name[DBF_NAME_LEN];
+	debug_info_t *dbf_info;
+	struct list_head dbf_list;
+};
+
+enum ppt_dev_stats {
+	PPT_DEV_STAT_HOST_IRQ_INJECT,
+	PPT_DEV_STAT_HOST_IRQ_GISA,
+	PPT_DEV_STAT_PCISTG,
+	PPT_DEV_STAT_PCISTB,
+	PPT_DEV_STAT_PCILG,
+	PPT_DEV_STAT_MPCIFC,
+	PPT_DEV_STAT_RPCIT,
+	PPT_DEV_STATS,
+};
+
+static char *ppt_dev_stats_names[PPT_DEV_STATS] = {
+	"host irqs inject",
+	"host irqs gisa",
+	"pcistg",
+	"pcistb",
+	"pcilg",
+	"mpcifc",
+	"rpcit",
+};
+
+struct ppt_dev {
+	struct list_head entry;
+	int enabled;
+	int configured;
+	atomic_t refcnt;
+	struct kvm *kvm;
+	struct ppt_vm_entry *ppt_vm;
+	struct zpci_dev *zdev;
+	struct pci_dev *pdev;
+	u32 dev_id;
+	int irq_on;
+	u32 hostirq;
+	struct msix_entry *entries;
+	unsigned int faisb;
+	unsigned long *aibv;
+	u32 aibvo;
+	unsigned long *aisb;
+	u32 aisbo;
+	u8 sum;
+	u16 noi;
+	u64 g_iota;
+	u64 g_fmba;
+	struct dentry *debugfs_stats;
+	unsigned int stat_items[PPT_DEV_STATS];
+	struct zpci_fmb *fmb;
+	debug_info_t *debug;
+};
+
+static void ppt_irq_worker(struct work_struct *);
+static void ppt_dereg_irqs(struct ppt_dev *, u8 *, u8 *);
+static void ppt_alert_irq_handler(struct airq_struct *);
+static u8 ppt_refresh_trans(u64 fn, u64 addr, u64 range, u8 *status);
+
+static struct airq_struct ppt_airq = {
+	.handler = ppt_alert_irq_handler,
+	.isc = PCIPT_ISC,
+};
+
+static struct kvm_s390_gait *gait;
+static struct airq_iv *faisb_iv;
+static struct kvm_s390_gib *gib;
+static debug_info_t *dmsgf;
+static struct dentry *ppt_stats_debugfs_root;
+
+static LIST_HEAD(ppt_vm_list);
+static DEFINE_SPINLOCK(ppt_vm_list_lock);
+static LIST_HEAD(ppt_dbf_list);
+static DEFINE_MUTEX(ppt_dbf_list_mutex);
+
+static int ppt_dev_stats_show(struct seq_file *sf, void *v)
+{
+	struct ppt_dev *ppt_dev = sf->private;
+	int i = 0;
+
+	if (!ppt_dev)
+		return 0;
+
+	seq_printf(sf, "PPT Device ID : 0x%x\n", ppt_dev->zdev->fid);
+	seq_puts(sf, "PPT Device Statistics Information:\n");
+
+	for (i = 0; i < PPT_DEV_STATS; ++i) {
+		seq_printf(sf, "%24s\t : %d\n",
+			ppt_dev_stats_names[i], ppt_dev->stat_items[i]);
+	}
+
+	seq_puts(sf, "\nPPT VM Statistics Information:\n");
+	for (i = 0; i < PPT_VM_STATS; ++i) {
+		seq_printf(sf, "%24s\t : %d\n", ppt_vm_stats_names[i],
+			i == PPT_VM_STAT_GISA ?
+			kvm_s390_gisa_get_count(ppt_dev->ppt_vm->kvm) :
+			ppt_dev->ppt_vm->stat_items[i]);
+	}
+
+	return 0;
+}
+
+static int ppt_dev_stats_seq_open(struct inode *inode, struct file *filp)
+{
+	return single_open(filp, ppt_dev_stats_show,
+		file_inode(filp)->i_private);
+}
+
+static const struct file_operations ppt_debugfs_stats_fops = {
+	.open = ppt_dev_stats_seq_open,
+	.read = seq_read,
+	.llseek = seq_lseek,
+	.release = single_release,
+};
+
+static void ppt_dev_debugfs_stats_init(struct ppt_dev *ppt_dev)
+{
+	char file_name[20];
+
+	if (!ppt_dev)
+		return;
+
+	sprintf(file_name, "ppt_dev_%x", ppt_dev->zdev->fid);
+	ppt_dev->debugfs_stats = debugfs_create_file(file_name,
+		S_IFREG | S_IRUGO,
+		ppt_stats_debugfs_root,
+		ppt_dev,
+		&ppt_debugfs_stats_fops);
+	memset(ppt_dev->stat_items, 0, sizeof(unsigned int) * PPT_DEV_STATS);
+
+	if (IS_ERR(ppt_dev->debugfs_stats))
+		ppt_dev->debugfs_stats = NULL;
+}
+
+static debug_info_t *ppt_get_dbf_entry(char *name)
+{
+	struct ppt_dbf_entry *entry;
+	debug_info_t *rc = NULL;
+
+	mutex_lock(&ppt_dbf_list_mutex);
+	list_for_each_entry(entry, &ppt_dbf_list, dbf_list) {
+		if (strcmp(entry->dbf_name, name) == 0) {
+			rc = entry->dbf_info;
+			break;
+		}
+	}
+	mutex_unlock(&ppt_dbf_list_mutex);
+	return rc;
+}
+
+static int ppt_add_dbf_entry(struct ppt_dev *card, char *name)
+{
+	struct ppt_dbf_entry *new_entry;
+
+	card->debug = debug_register(name, 8, 1, 128);
+	if (!card->debug) {
+		PPT_MESSAGE(PPT_TRACE_NORMAL,
+			"Cannot register ppt device debug");
+		goto err;
+	}
+	if (debug_register_view(card->debug, &debug_sprintf_view))
+		goto err_dbg;
+	debug_set_level(card->debug, PPT_TRACE_NORMAL);
+	new_entry = kzalloc(sizeof(struct ppt_dbf_entry), GFP_KERNEL);
+	if (!new_entry)
+		goto err_dbg;
+	strncpy(new_entry->dbf_name, name, DBF_NAME_LEN);
+	new_entry->dbf_info = card->debug;
+	mutex_lock(&ppt_dbf_list_mutex);
+	list_add(&new_entry->dbf_list, &ppt_dbf_list);
+	mutex_unlock(&ppt_dbf_list_mutex);
+
+	return 0;
+
+err_dbg:
+	debug_unregister(card->debug);
+err:
+	return -ENOMEM;
+}
+
+static void ppt_clear_dbf_list(void)
+{
+	struct ppt_dbf_entry *entry, *tmp;
+
+	mutex_lock(&ppt_dbf_list_mutex);
+	list_for_each_entry_safe(entry, tmp, &ppt_dbf_list, dbf_list) {
+		list_del(&entry->dbf_list);
+		debug_unregister(entry->dbf_info);
+		kfree(entry);
+	}
+	mutex_unlock(&ppt_dbf_list_mutex);
+}
+
+static void ppt_unregister_dbf_views(void)
+{
+	debug_unregister(dmsgf);
+}
+
+static int ppt_register_dbf_views(void)
+{
+	int rc;
+
+	dmsgf = debug_register("ppt_msg", 8, 1, 128);
+
+	if (!dmsgf)
+		return -ENOMEM;
+
+	rc = debug_register_view(dmsgf, &debug_sprintf_view);
+	if (rc) {
+		debug_unregister(dmsgf);
+		return rc;
+	}
+
+	debug_set_level(dmsgf, PPT_TRACE_NORMAL);
+	return 0;
+}
+
+static struct ppt_vm_entry *ppt_register_vm(struct kvm *kvm)
+{
+	unsigned long flags;
+	struct ppt_vm_entry *tmp, *tmp2, *match = NULL;
+
+	tmp2 = kzalloc(sizeof(struct ppt_vm_entry), GFP_KERNEL);
+	if (!tmp2)
+		return ERR_PTR(-ENOMEM);
+
+	atomic_set(&tmp2->refcnt, 0);
+
+	spin_lock_irqsave(&ppt_vm_list_lock, flags);
+	list_for_each_entry(tmp, &ppt_vm_list, entry) {
+		if (tmp->kvm == kvm) {
+			match = tmp;
+			break;
+		}
+	}
+
+	if (match) {
+		kfree(tmp2);
+	} else {
+		match = tmp2;
+		match->kvm = kvm;
+		kvm_s390_gisa_register_alert(kvm, PCI_ISC);
+		INIT_WORK(&match->irq_work, ppt_irq_worker);
+		list_add(&match->entry, &ppt_vm_list);
+		PPT_MESSAGE(PPT_TRACE_NORMAL,
+			"register kvm 0x%lx\n", (unsigned long)kvm);
+	}
+
+	atomic_inc(&match->refcnt);
+	spin_unlock_irqrestore(&ppt_vm_list_lock, flags);
+	return match;
+}
+
+static int ppt_unregister_vm(struct ppt_vm_entry *ppt_vm)
+{
+	unsigned long flags;
+	struct ppt_vm_entry *tmp, *match = NULL;
+	int rc = 0;
+
+	spin_lock_irqsave(&ppt_vm_list_lock, flags);
+	list_for_each_entry(tmp, &ppt_vm_list, entry) {
+		if (tmp == ppt_vm) {
+			match = tmp;
+			break;
+		}
+	}
+
+	if (match) {
+		if (atomic_dec_and_test(&match->refcnt)) {
+			PPT_MESSAGE(PPT_TRACE_NORMAL,
+				"unregister kvm 0x%lx\n",
+				(unsigned long)match->kvm);
+			kvm_s390_gisa_unregister_alert(match->kvm, PCI_ISC);
+			list_del(&match->entry);
+			kfree(match);
+		}
+	} else {
+		rc = -ENODEV;
+	}
+
+	spin_unlock_irqrestore(&ppt_vm_list_lock, flags);
+	return rc;
+}
+
+static int ppt_dma_update_trans(struct zpci_dev *zdev, unsigned long pa,
+				dma_addr_t dma_addr, size_t size, int flags,
+				u8 *cc, u8 *status)
+{
+	unsigned int nr_pages = PAGE_ALIGN(size) >> PAGE_SHIFT;
+	u8 *page_addr = (u8 *) (pa & PAGE_MASK);
+	dma_addr_t start_dma_addr = dma_addr;
+	unsigned long irq_flags;
+	int i, rc = 0;
+
+	if (!nr_pages)
+		return -EINVAL;
+
+	spin_lock_irqsave(&zdev->dma_table_lock, irq_flags);
+	if (!zdev->dma_table) {
+		dev_err(&zdev->pdev->dev, "Missing DMA table\n");
+		goto no_refresh;
+	}
+
+	for (i = 0; i < nr_pages; i++) {
+		dma_update_cpu_trans(zdev, page_addr, dma_addr, flags);
+		page_addr += PAGE_SIZE;
+		dma_addr += PAGE_SIZE;
+	}
+
+	/*
+	 * rpcit is not required to establish new translations when previously
+	 * invalid translation-table entries are validated, however it is
+	 * required when altering previously valid entries.
+	 */
+	if (!zdev->tlb_refresh &&
+	    ((flags & ZPCI_PTE_VALID_MASK) == ZPCI_PTE_VALID))
+		/*
+		 * TODO: also need to check that the old entry is indeed INVALID
+		 * and not only for one page but for the whole range...
+		 * -> now we WARN_ON in that case but with lazy unmap that
+		 * needs to be redone!
+		 */
+		goto no_refresh;
+
+	*cc = ppt_refresh_trans((u64) zdev->fh << 32, start_dma_addr,
+				nr_pages * PAGE_SIZE, status);
+	rc = (*cc) ? -EIO : 0;
+no_refresh:
+	spin_unlock_irqrestore(&zdev->dma_table_lock, irq_flags);
+	return rc;
+}
+
+static int ppt_update_trans_entry(struct ppt_dev *ppt_dev, u64 dma_addr,
+				  struct page *page, int flags, u8 *cc,
+				  u8 *status)
+{
+	int rc;
+	u64 paddr = page_to_phys(page);
+
+	rc = ppt_dma_update_trans(ppt_dev->zdev, paddr, dma_addr, PAGE_SIZE,
+				  flags, cc, status);
+	if (flags & ZPCI_PTE_INVALID)
+		put_page(page);
+	else
+		get_page(page);
+
+	if (rc)
+		PPT_DEVICE_MESSAGE(ppt_dev, PPT_TRACE_NORMAL,
+			"dma_up rc %d paddr 0x%llx addr 0x%llx flags 0x%x\n",
+			rc, paddr, dma_addr, flags);
+
+	return rc;
+}
+
+static struct ppt_dev *ppt_alloc_dev(void)
+{
+	struct ppt_dev *dev;
+
+	dev = kzalloc(sizeof(*dev), GFP_KERNEL);
+	if (!dev)
+		return ERR_PTR(-ENOMEM);
+
+	atomic_set(&dev->refcnt, 1);
+	dev->enabled = 0;
+	dev->configured = 1;
+	return dev;
+}
+
+static void ppt_put_dev(struct ppt_dev *ppt_dev)
+{
+	int rc;
+	u8 cc, status;
+
+	WARN_ON(atomic_read(&ppt_dev->refcnt) <= 0);
+	if (atomic_dec_and_test(&ppt_dev->refcnt)) {
+		if (ppt_dev->irq_on)
+			ppt_dereg_irqs(ppt_dev, &cc, &status);
+
+		if (ppt_dev->fmb)
+			put_page(virt_to_page(ppt_dev->fmb));
+
+		if (ppt_dev->enabled) {
+			ppt_dev->enabled = 0;
+			pci_release_regions(ppt_dev->pdev);
+			pci_disable_device(ppt_dev->pdev);
+			/* disable/enable zpci layer so all dma translations
+			 * are cleared in hw and host table
+			 */
+			rc = zpci_disable_device(ppt_dev->zdev);
+			if (rc)
+				PPT_DEVICE_MESSAGE(ppt_dev, PPT_TRACE_NORMAL,
+					"disable device failed rc %d\n", rc);
+
+		}
+
+		rc = ppt_unregister_vm(ppt_dev->ppt_vm);
+		if (rc)
+			PPT_DEVICE_MESSAGE(ppt_dev, PPT_TRACE_NORMAL,
+				"unregister vm failed rc %d\n", rc);
+
+		rc = zpci_enable_device(ppt_dev->zdev);
+		if (rc)
+			PPT_DEVICE_MESSAGE(ppt_dev, PPT_TRACE_NORMAL,
+				"enable device failed rc %d\n", rc);
+
+		pci_dev_put(ppt_dev->pdev);
+
+		debugfs_remove(ppt_dev->debugfs_stats);
+
+		PPT_DEVICE_MESSAGE(ppt_dev,
+			PPT_TRACE_NORMAL, "free dev\n");
+		PPT_DEVICE_MESSAGE(ppt_dev, PPT_TRACE_NORMAL,
+			"free fh 0x%x\n", ppt_dev->zdev->fh);
+		kfree(ppt_dev);
+	}
+}
+
+static struct ppt_dev *ppt_get_by_devid(struct kvm *kvm, u32 dev_id)
+{
+	struct ppt_dev *tmp, *dev = NULL;
+
+	mutex_lock(&kvm->lock);
+	list_for_each_entry(tmp, &kvm->arch.ppt_dev_list, entry) {
+		if (tmp->dev_id == dev_id) {
+			dev = tmp;
+			WARN_ON(atomic_read(&dev->refcnt) <= 0);
+			atomic_inc(&dev->refcnt);
+			break;
+		}
+	}
+	mutex_unlock(&kvm->lock);
+	return dev;
+}
+
+static struct ppt_dev *ppt_get_by_fh(struct kvm *kvm, u32 fh)
+{
+	struct ppt_dev *tmp, *dev = NULL;
+
+	mutex_lock(&kvm->lock);
+	list_for_each_entry(tmp, &kvm->arch.ppt_dev_list, entry) {
+		if ((tmp->zdev->fh & ~FH_ENABLED) == (fh & ~FH_ENABLED)) {
+			dev = tmp;
+			WARN_ON(atomic_read(&dev->refcnt) <= 0);
+			atomic_inc(&dev->refcnt);
+			break;
+		}
+	}
+	mutex_unlock(&kvm->lock);
+	return dev;
+}
+
+static int ppt_clp_list_pci(struct kvm_vcpu *vcpu,
+			    struct clp_req_rsp_list_pci *rrb, u8 *cc)
+{
+	struct ppt_dev *ppt_dev;
+	struct clp_req_rsp_list_pci *tmprrb;
+	int initial_l2, g_l2, tmp_en, g_en, i, rc;
+
+	tmprrb = (struct clp_req_rsp_list_pci *)get_zeroed_page(GFP_KERNEL);
+	if (!tmprrb)
+		return -ENOMEM;
+	initial_l2 = rrb->response.hdr.len;
+	if ((initial_l2 - LIST_PCI_HDR_LEN) % sizeof(struct clp_fh_list_entry)
+		!= 0)
+		return -EINVAL;
+
+	memcpy(tmprrb, rrb, sizeof(struct clp_req_rsp_list_pci));
+	g_l2 = g_en = rc = 0;
+	do {
+		*cc = clp_instr(tmprrb);
+		if (*cc) {
+			rc = -EIO;
+			break;
+		}
+		if (tmprrb->response.hdr.rsp != CLP_RC_OK) {
+			rc = -EIO;
+			break;
+		}
+
+		tmp_en = (tmprrb->response.hdr.len - LIST_PCI_HDR_LEN) /
+			tmprrb->response.entry_size;
+		for (i = 0; i < tmp_en; i++) {
+			ppt_dev = ppt_get_by_fh(vcpu->kvm,
+				tmprrb->response.fh_list[i].fh);
+			if (ppt_dev) {
+				memcpy(&(rrb->response.fh_list[g_en]),
+				       &(tmprrb->response.fh_list[i]),
+				       tmprrb->response.entry_size);
+				g_en++;
+				ppt_put_dev(ppt_dev);
+			}
+		}
+		g_l2 = LIST_PCI_HDR_LEN + g_en * tmprrb->response.entry_size;
+		if (tmprrb->response.resume_token == 0)
+			break;
+		tmprrb->request.resume_token = tmprrb->response.resume_token;
+		tmprrb->response.hdr.len = LIST_PCI_HDR_LEN +
+			(initial_l2 - g_l2);
+	} while (g_l2 < initial_l2);
+
+	memcpy(&rrb->response, &tmprrb->response, LIST_PCI_HDR_LEN);
+	if (!rc)
+		rrb->response.hdr.len = g_l2;
+	free_page((unsigned long)tmprrb);
+	return rc;
+}
+
+static u8 ppt_clp_instr(struct clp_req_rsp_set_pci *rrb)
+{
+	u8 cc;
+	int l2, retries;
+
+	retries = 100;
+	l2 = rrb->response.hdr.len;
+	do {
+		rrb->response.hdr.len = l2;
+		cc = clp_instr(rrb);
+		if (rrb->response.hdr.rsp == CLP_RC_SETPCIFN_BUSY) {
+			retries--;
+			if (retries < 0)
+				break;
+			msleep(20);
+		}
+	} while (rrb->response.hdr.rsp == CLP_RC_SETPCIFN_BUSY);
+
+	return cc;
+}
+
+static int ppt_clp_set_pci(struct kvm_vcpu *vcpu,
+			   struct clp_req_rsp_set_pci *rrb, u8 *cc)
+{
+	struct ppt_dev *ppt_dev;
+	int rc = 0;
+
+	if ((rrb->request.fh & FH_VIRT) == FH_VIRT)
+		return -EOPNOTSUPP;
+
+	ppt_dev = ppt_get_by_fh(vcpu->kvm, rrb->request.fh);
+	if (!ppt_dev) {
+		*cc = 3;
+		rrb->response.hdr.rsp = CLP_RC_SETPCIFN_FH;
+		return -ENODEV;
+	}
+
+	if (rrb->request.oc == CLP_SET_ENABLE_PCI_FN) {
+		rrb->request.gd = vcpu->arch.sie_block->gd;
+		*cc = ppt_clp_instr(rrb);
+		if (!(*cc) && rrb->response.hdr.rsp == CLP_RC_OK) {
+			/* Success -> store handle in zdev */
+			PPT_DEVICE_MESSAGE(ppt_dev, PPT_TRACE_NORMAL,
+				"enable ppt\n");
+			ppt_dev->zdev->fh = rrb->response.fh;
+		} else {
+			PPT_DEVICE_MESSAGE(ppt_dev, PPT_TRACE_NORMAL,
+				"enable ppt failed cc %d resp 0x%x\n",
+				*cc, rrb->response.hdr.rsp);
+			rc = -EIO;
+			goto out;
+		}
+
+		rc = zpci_dma_init_device(ppt_dev->zdev);
+		if (rc) {
+			PPT_DEVICE_MESSAGE(ppt_dev, PPT_TRACE_NORMAL,
+				"ppt dma init failed rc 0x%x\n", rc);
+			clp_disable_fh(ppt_dev->zdev);
+			goto out;
+		}
+
+		rc = pci_enable_device(ppt_dev->pdev);
+		if (rc) {
+			PPT_DEVICE_MESSAGE(ppt_dev, PPT_TRACE_NORMAL,
+				"pci_enable_device failed rc 0x%x\n", rc);
+			zpci_disable_device(ppt_dev->zdev);
+			goto out;
+		}
+
+		rc = pci_request_regions(ppt_dev->pdev,
+			"s390_assigned_device");
+		if (rc) {
+			PPT_DEVICE_MESSAGE(ppt_dev, PPT_TRACE_NORMAL,
+				"pci_request_regions failed rc 0x%x\n", rc);
+			zpci_disable_device(ppt_dev->zdev);
+			pci_disable_device(ppt_dev->pdev);
+			goto out;
+		}
+		ppt_dev->zdev->state = ZPCI_FN_STATE_ONLINE;
+		ppt_dev->enabled = 1;
+		PPT_DEVICE_MESSAGE(ppt_dev, PPT_TRACE_NORMAL, "enabled\n");
+	} else {
+		pci_release_regions(ppt_dev->pdev);
+		pci_disable_device(ppt_dev->pdev);
+		zpci_dma_exit_device(ppt_dev->zdev);
+		*cc = ppt_clp_instr(rrb);
+		if (!(*cc) && rrb->response.hdr.rsp == CLP_RC_OK) {
+			PPT_DEVICE_MESSAGE(ppt_dev, PPT_TRACE_NORMAL,
+				"disable ppt\n");
+			ppt_dev->zdev->fh = rrb->response.fh;
+		} else {
+			PPT_DEVICE_MESSAGE(ppt_dev, PPT_TRACE_NORMAL,
+				"disable ppt failed cc %d resp %x\n",
+				*cc, rrb->response.hdr.rsp);
+			rc = -EIO;
+			goto out;
+		}
+		ppt_dev->enabled = 0;
+		PPT_DEVICE_MESSAGE(ppt_dev,
+			PPT_TRACE_NORMAL, "disabled\n");
+	}
+out:
+	ppt_put_dev(ppt_dev);
+	return rc;
+}
+
+int handle_clp(struct kvm_vcpu *vcpu)
+{
+	struct clp_req_hdr *reqh;
+	struct clp_rsp_hdr *resh;
+	char *buffer;
+	__u32 req_len, res_len;
+	int cmd, res_code, rc;
+	u8 cc;
+	struct ppt_dev *ppt_dev;
+	int r2 = (vcpu->arch.sie_block->ipb & 0x000f0000) >> 16;
+	int handle_user = 0;
+
+	cmd = rc = cc = res_code = 0;
+
+	buffer = (char *)get_zeroed_page(GFP_KERNEL);
+	if (!buffer) {
+		rc = -ENOMEM;
+		goto out;
+	}
+	rc = read_guest(vcpu, vcpu->run->s.regs.gprs[r2], buffer, sizeof(reqh));
+	if (rc)
+		goto out;
+	reqh = (struct clp_req_hdr *)buffer;
+	req_len = reqh->len;
+
+	rc = read_guest(vcpu, vcpu->run->s.regs.gprs[r2], buffer,
+			req_len + sizeof(resh));
+	if (rc)
+		goto out;
+	resh = (struct clp_rsp_hdr *)(buffer + req_len);
+	res_len = resh->len;
+
+	rc = read_guest(vcpu, vcpu->run->s.regs.gprs[r2], buffer,
+			req_len + res_len);
+	if (rc)
+		goto out;
+
+	cmd = reqh->cmd;
+	switch (cmd) {
+	case CLP_LIST_PCI: {
+		struct clp_req_rsp_list_pci *rrlistpci =
+			(struct clp_req_rsp_list_pci *)buffer;
+#ifdef USER_LSPCI
+		handle_user = 1;
+		goto out_u;
+#endif
+		rc = ppt_clp_list_pci(vcpu, rrlistpci, &cc);
+		res_code = rrlistpci->response.hdr.rsp;
+		break;
+	}
+	case CLP_SET_PCI_FN: {
+		struct clp_req_rsp_set_pci *rrsetpci =
+			(struct clp_req_rsp_set_pci *)buffer;
+		rc = ppt_clp_set_pci(vcpu, rrsetpci, &cc);
+		if (rc == -EOPNOTSUPP)
+			goto out_u;
+		res_code = rrsetpci->response.hdr.rsp;
+#ifdef USER_LSPCI
+		handle_user = 1;
+#endif
+		break;
+	}
+	case CLP_QUERY_PCI_FN: {
+		struct clp_req_rsp_query_pci *rrqpci =
+			(struct clp_req_rsp_query_pci *)buffer;
+
+		if ((rrqpci->request.fh & FH_VIRT) == FH_VIRT)
+			return -EOPNOTSUPP;
+
+		ppt_dev = ppt_get_by_fh(vcpu->kvm, rrqpci->request.fh);
+		if (!ppt_dev) {
+			cc = 3;
+			res_code = rrqpci->response.hdr.rsp =
+				CLP_RC_SETPCIFN_FH;
+			break;
+		}
+
+		cc = clp_instr(rrqpci);
+		res_code = rrqpci->response.hdr.rsp;
+		ppt_put_dev(ppt_dev);
+		break;
+	}
+	case CLP_QUERY_PCI_FNGRP: {
+		struct clp_req_rsp_query_pci_grp *rrqgrp =
+			(struct clp_req_rsp_query_pci_grp *)buffer;
+
+		cc = clp_instr(buffer);
+		/* always turn on tlb refresh so the rpcit intercept can
+		   keep track dma memory */
+		rrqgrp->response.refresh = 1;
+		res_code = rrqgrp->response.hdr.rsp;
+		break;
+	}
+	default:
+		PPT_MESSAGE(PPT_TRACE_NORMAL,
+			"invalid clp command 0x%x\n", reqh->cmd);
+		rc = -EINVAL;
+	}
+
+	if (rc || cc == 3 || res_code != CLP_RC_OK)
+		PPT_MESSAGE(PPT_TRACE_NORMAL,
+			"clp failed cmd %d rc %d cc %d resp 0x%x\n",
+			cmd, rc, cc, res_code);
+
+	rc = write_guest(vcpu, vcpu->run->s.regs.gprs[r2], buffer,
+		req_len + res_len);
+out:
+	if (rc) {
+		PPT_MESSAGE(PPT_TRACE_NORMAL,
+			"handle clp failed cmd %d rc %d\n", cmd, rc);
+		cc = 3;
+		if (rc != -EOPNOTSUPP)
+			rc = 0;
+	}
+	kvm_s390_set_psw_cc(vcpu, cc);
+out_u:
+	free_page((unsigned long)buffer);
+	return handle_user ? -EOPNOTSUPP : rc;
+}
+
+static u64 ppt_guest_io_table_walk(u64 guest_iota, u64 guest_dma_address,
+				   struct kvm_vcpu *vcpu)
+{
+	u64 sto_a, pto_a, px_a;
+	u64 sto, pto, pte;
+	u32 rtx, sx, px;
+	int rc;
+
+	rtx = calc_rtx(guest_dma_address);
+	sx = calc_sx(guest_dma_address);
+	px = calc_px(guest_dma_address);
+
+	sto_a = guest_iota + rtx * sizeof(u64);
+	rc = read_guest(vcpu, sto_a, &sto, sizeof(u64));
+	if (rc)
+		return rc;
+
+	sto = (u64)get_rt_sto(sto);
+	if (!sto)
+		return -EINVAL;
+
+	pto_a = sto + sx * sizeof(u64);
+	rc = read_guest(vcpu, pto_a, &pto, sizeof(u64));
+	if (rc)
+		return rc;
+
+	pto = (u64)get_st_pto(pto);
+	if (!pto)
+		return -EINVAL;
+
+	px_a = pto + px * sizeof(u64);
+	rc = read_guest(vcpu, px_a, &pte, sizeof(u64));
+	if (rc)
+		return rc;
+
+	return pte;
+}
+
+static void ppt_set_status(struct kvm_vcpu *vcpu, int rx, u8 status)
+{
+	if (vcpu) {
+		vcpu->run->s.regs.gprs[rx] &= ~(0xFF << 24);
+		vcpu->run->s.regs.gprs[rx] |= ((unsigned long)status << 24);
+	}
+}
+
+static u8 ppt_mod_fc(u64 req, struct zpci_fib *fib, u8 *status)
+{
+	u8 cc;
+
+	do {
+		cc = __mpcifc(req, fib, status);
+		if (cc == 2)
+			msleep(20);
+	} while (cc == 2);
+
+	if (cc)
+		PPT_MESSAGE(PPT_TRACE_NORMAL,
+			"%s: error cc: %d  status: %d\n",
+			__func__, cc, *status);
+
+	return cc;
+}
+
+static u8 ppt_refresh_trans(u64 fn, u64 addr, u64 range, u8 *status)
+{
+	u8 cc;
+
+	do {
+		cc = __rpcit(fn, addr, range, status);
+		if (cc == 2)
+			udelay(1);
+	} while (cc == 2);
+
+	if (cc)
+		PPT_MESSAGE(PPT_TRACE_NORMAL,
+		"%s: error cc: %d  status: %d  dma_addr: %Lx  size: %Lx\n",
+			__func__, cc, *status, addr, range);
+	return cc;
+}
+
+static u8 ppt_load(u64 *data, u64 req, u64 offset, u8 *status)
+{
+	int cc;
+
+	do {
+		cc = (u8)__pcilg(data, req, offset, status);
+		if (cc == 2)
+			udelay(1);
+	} while (cc == 2);
+
+	return cc;
+}
+
+static u8 ppt_store(u64 data, u64 req, u64 offset, u8 *status)
+{
+	u8 cc;
+
+	do {
+		cc = (u8)__pcistg(data, req, offset, status);
+		if (cc == 2)
+			udelay(1);
+	} while (cc == 2);
+
+	return cc;
+}
+
+static u8 ppt_store_block(const u64 *data, u64 req, u64 offset, u8 *status)
+{
+	u8 cc;
+
+	do {
+		cc = (u8)__pcistb(data, req, offset, status);
+		if (cc == 2)
+			udelay(1);
+	} while (cc == 2);
+
+	return cc;
+}
+
+int handle_rpcit(struct kvm_vcpu *vcpu)
+{
+	u8 cc = ZPCI_PCI_LS_OK, status = 0;
+	int rc = 0;
+	int r1 = (vcpu->arch.sie_block->ipb & 0x00f00000) >> 20;
+	int r2 = (vcpu->arch.sie_block->ipb & 0x000f0000) >> 16;
+	u64 pte;
+	u64 uaddr;
+	int flags;
+	u32 fh = vcpu->run->s.regs.gprs[r1] >> 32;
+	struct page **page_list;
+	int i;
+	u64 dma_addr;
+	u32 nr_pages;
+	u32 nr_upages;
+	struct ppt_dev *ppt_dev;
+
+	if ((fh & FH_VIRT) == FH_VIRT)
+		return -EOPNOTSUPP;
+
+	ppt_dev = ppt_get_by_fh(vcpu->kvm, fh);
+	if (!ppt_dev) {
+		cc = ZPCI_PCI_LS_INVAL_HANDLE;
+		status = 0;
+		rc = -ENODEV;
+		goto out_nodev;
+	}
+
+	pte = ppt_guest_io_table_walk(ppt_dev->g_iota,
+		vcpu->run->s.regs.gprs[r2], vcpu);
+	if (IS_ERR_VALUE(pte)) {
+		PPT_DEVICE_MESSAGE(ppt_dev, PPT_TRACE_NORMAL,
+			"rpcit bad pte\n");
+		rc = pte;
+		cc = 1;
+		status = 16;
+		goto out;
+	}
+
+	uaddr = gmap_translate((pte & ZPCI_PTE_ADDR_MASK), vcpu->arch.gmap);
+	if (uaddr < 0) {
+		PPT_DEVICE_MESSAGE(ppt_dev, PPT_TRACE_NORMAL,
+			"rpcit bad mapping\n");
+		rc = uaddr;
+		cc = 1;
+		status = 16;
+		goto out;
+	}
+
+	page_list = (struct page **) __get_free_page(GFP_KERNEL);
+	if (!page_list) {
+		rc = -ENOMEM;
+		cc = 1;
+		status = 16;
+		goto out;
+	}
+	flags = pte & ZPCI_PTE_FLAG_MASK;
+	nr_pages = vcpu->run->s.regs.gprs[r2 + 1] / PAGE_SIZE;
+
+	nr_upages = get_user_pages_fast(uaddr, nr_pages, 1, page_list);
+
+	if (!nr_upages) {
+		PPT_DEVICE_MESSAGE(ppt_dev,
+			PPT_TRACE_NORMAL, "rpcit no user pages\n");
+		rc = -ENOMEM;
+		cc = 1;
+		status = 16;
+		goto no_pages;
+	}
+
+	dma_addr = vcpu->run->s.regs.gprs[r2];
+	for (i = 0; i < nr_upages; i++) {
+		rc = ppt_update_trans_entry(ppt_dev, dma_addr, page_list[i],
+			flags, &cc, &status);
+		if (rc) {
+			dma_purge_rto_entries(ppt_dev->zdev);
+			break;
+		}
+		dma_addr += PAGE_SIZE;
+	}
+
+	for (i = 0; i < nr_upages; i++)
+		put_page(page_list[i]);
+
+	ppt_dev->stat_items[PPT_DEV_STAT_RPCIT]++;
+no_pages:
+	free_page((unsigned long)page_list);
+out:
+	ppt_put_dev(ppt_dev);
+out_nodev:
+	kvm_s390_set_psw_cc(vcpu, (unsigned long)cc);
+	ppt_set_status(vcpu, r1, status);
+	if (rc)
+		PPT_DEVICE_MESSAGE(ppt_dev, PPT_TRACE_NORMAL,
+			"rpcit failed rc %d\n", rc);
+
+	return 0;
+}
+
+int handle_sic(struct kvm_vcpu *vcpu)
+{
+	int r1 = (vcpu->arch.sie_block->ipa & 0x00f0) >> 4;
+	int r3 = vcpu->arch.sie_block->ipa & 0x000f;
+
+	/* since we have ecb bit 18 we should only get
+	intercept for operation 2 so log this */
+	PPT_MESSAGE(PPT_TRACE_NORMAL, "Warnig: sic r1 0x%llx r3 0x%llx\n",
+		vcpu->run->s.regs.gprs[r1], vcpu->run->s.regs.gprs[r3]);
+	return 0;
+}
+
+int handle_pcistb(struct kvm_vcpu *vcpu)
+{
+	u8 cc, status;
+	int r1 = (vcpu->arch.sie_block->ipa & 0x00f0) >> 4;
+	int r3 = vcpu->arch.sie_block->ipa & 0x000f;
+	u64 gaddr = kvm_s390_get_base_disp_rsy(vcpu);
+	u32 fh = vcpu->run->s.regs.gprs[r1] >> 32;
+	struct ppt_dev *ppt_dev;
+	int rc = 0;
+	u8 len = vcpu->run->s.regs.gprs[r1] & 0xff;
+	char *buffer;
+
+	if ((fh & FH_VIRT) == FH_VIRT)
+		return -EOPNOTSUPP;
+
+	ppt_dev = ppt_get_by_fh(vcpu->kvm, fh);
+	if (!ppt_dev) {
+		cc = ZPCI_PCI_LS_INVAL_HANDLE;
+		status = 0;
+		goto out_nodev;
+	}
+
+	buffer = (char *)get_zeroed_page(GFP_KERNEL);
+	if (!buffer) {
+		PPT_DEVICE_MESSAGE(ppt_dev, PPT_TRACE_NORMAL,
+			"pcistb get page failed\n");
+		cc = ZPCI_PCI_LS_ERR;
+		status = 40;
+		goto out_nomem;
+	}
+
+	rc = read_guest(vcpu, gaddr, buffer, len);
+	if (rc) {
+		PPT_DEVICE_MESSAGE(ppt_dev, PPT_TRACE_NORMAL,
+			"pcistb read guest failed rc %d\n", rc);
+		cc = ZPCI_PCI_LS_ERR;
+		status = 40;
+		goto out;
+	}
+
+	cc = ppt_store_block((const u64 *)buffer,
+		vcpu->run->s.regs.gprs[r1],
+		vcpu->run->s.regs.gprs[r3],
+		&status);
+	ppt_dev->stat_items[PPT_DEV_STAT_PCISTB]++;
+	if (cc)
+		PPT_DEVICE_MESSAGE(ppt_dev, PPT_TRACE_NORMAL,
+			"pcistb offset 0x%llx gaddr 0x%llx len %d cc %d\n",
+			vcpu->run->s.regs.gprs[r3], gaddr, len, cc);
+out:
+	free_page((unsigned long)buffer);
+out_nomem:
+	ppt_put_dev(ppt_dev);
+out_nodev:
+	kvm_s390_set_psw_cc(vcpu, (unsigned long)cc);
+	ppt_set_status(vcpu, r1, status);
+	return 0;
+}
+
+static void ppt_irq_worker(struct work_struct *work)
+{
+	int rc;
+	struct kvm_s390_interrupt s390int;
+	struct ppt_vm_entry *ppt_vm = container_of(work, struct ppt_vm_entry,
+						   irq_work);
+
+	u32 io_int_word = (PCI_ISC << 27) | IO_INT_WORD_AI;
+
+	s390int.type = KVM_S390_INT_IO(1, 0, 0, 0);
+	s390int.parm = 0;
+	s390int.parm64 = io_int_word;
+
+	rc = kvm_s390_inject_vm(ppt_vm->kvm, &s390int);
+	if (rc)
+		PPT_MESSAGE(PPT_TRACE_NORMAL, "inject vm failed\n");
+}
+
+static irqreturn_t ppt_handle_host_irq(int irq, void *ptr)
+{
+	struct ppt_dev *ppt_dev = (struct ppt_dev *)ptr;
+	int summary_set, i;
+
+	for (i = 0; i < ppt_dev->noi; ++i) {
+		if (ppt_dev->entries[i].vector == irq) {
+			set_bit(ppt_dev->entries[i].entry ^ be_to_le,
+				ppt_dev->aibv);
+			break;
+		}
+	}
+
+	summary_set = test_and_set_bit(ppt_dev->aisbo ^ be_to_le,
+		ppt_dev->aisb);
+
+	if (!summary_set) {
+		if (kvm_s390_gisa_test_iam_gisc(ppt_dev->kvm, PCI_ISC)) {
+			schedule_work(&ppt_dev->ppt_vm->irq_work);
+			ppt_dev->stat_items[PPT_DEV_STAT_HOST_IRQ_INJECT]++;
+		} else {
+			kvm_s390_gisa_set_ipm_gisc(ppt_dev->kvm, PCI_ISC);
+			ppt_dev->stat_items[PPT_DEV_STAT_HOST_IRQ_GISA]++;
+		}
+	}
+	return IRQ_HANDLED;
+}
+
+static unsigned long *ppt_getandmap_gaddr(u64 gaddr, struct kvm_vcpu *vcpu)
+{
+	unsigned long uaddr = gmap_translate(gaddr, vcpu->arch.gmap);
+	u32 offset = gaddr % PAGE_SIZE;
+	int nr_upages;
+	struct page *page;
+	unsigned long *kpp;
+
+	if (uaddr < 0) {
+		PPT_MESSAGE(PPT_TRACE_NORMAL, "bad gaddr\n");
+		return ERR_PTR(uaddr);
+	}
+
+	nr_upages = get_user_pages(current, current->mm,
+		uaddr, 1, 1, 0, &page, NULL);
+	if (!nr_upages) {
+		PPT_MESSAGE(PPT_TRACE_NORMAL, "gaddr no user pages\n");
+		return ERR_PTR(-ENOMEM);
+	}
+
+	kpp = page_address(page);
+
+	kpp = (unsigned long *)((char *)kpp + offset);
+	return kpp;
+}
+
+static u32 ppt_gib_get_alo(void)
+{
+	return ACCESS_ONCE(gib->alo);
+}
+
+static void ppt_gib_set_alo(u32 alo)
+{
+	xchg(&gib->alo, alo);
+}
+
+static int ppt_reg_irqs_host(struct ppt_dev *ppt_dev, u8 *cc, u8 *status)
+{
+	int i, n, err;
+
+	PPT_DEVICE_MESSAGE(ppt_dev, PPT_TRACE_NORMAL, "use host irqs\n");
+	ppt_dev->entries = kcalloc(ppt_dev->noi, sizeof(struct msix_entry),
+				   GFP_KERNEL);
+	if (!ppt_dev->entries) {
+		*cc = ZPCI_PCI_LS_ERR;
+		*status = 16;
+		return -ENOMEM;
+	}
+
+	for (i = 0; i < ppt_dev->noi; ++i)
+		ppt_dev->entries[i].entry = i;
+
+	err = pci_enable_msix(ppt_dev->pdev, ppt_dev->entries, ppt_dev->noi);
+	if (err) {
+		PPT_DEVICE_MESSAGE(ppt_dev, PPT_TRACE_NORMAL,
+			"mpcifc pci_enable_msix %d\n", err);
+		*cc = ZPCI_PCI_LS_ERR;
+		*status = 16;
+		kfree(ppt_dev->entries);
+		return err;
+	}
+
+	for (i = 0; i < ppt_dev->noi; ++i) {
+		PPT_DEVICE_MESSAGE(ppt_dev, PPT_TRACE_NORMAL,
+			"mpcifc request irq entry %d vector %d\n",
+			ppt_dev->entries[i].entry,
+			ppt_dev->entries[i].vector);
+		err = request_irq(ppt_dev->entries[i].vector,
+			ppt_handle_host_irq, 0, "pci proxy", ppt_dev);
+		if (err) {
+			PPT_DEVICE_MESSAGE(ppt_dev, PPT_TRACE_NORMAL,
+				"mpcifc request_irq %d\n", err);
+			for (n = 0; n < i; ++n)
+				free_irq(ppt_dev->entries[i].vector, ppt_dev);
+
+			pci_disable_msix(ppt_dev->pdev);
+			kfree(ppt_dev->entries);
+			*cc = ZPCI_PCI_LS_ERR;
+			*status = 16;
+			return err;
+		}
+	}
+
+	*cc = ZPCI_PCI_LS_OK;
+	*status = 0;
+	ppt_dev->irq_on = 1;
+	return 0;
+}
+
+static int ppt_reg_irqs(struct ppt_dev *ppt_dev, struct zpci_fib *fib,
+			struct kvm_vcpu *vcpu, u8 *cc, u8 *status)
+{
+	int rc = 0;
+	u64 req;
+
+	ppt_dev->noi = fib->noi;
+	ppt_dev->sum = fib->sum;
+	ppt_dev->aibv = ppt_getandmap_gaddr(fib->aibv, vcpu);
+	if (IS_ERR(ppt_dev->aibv)) {
+		*cc = ZPCI_PCI_LS_ERR;
+		*status = 16;
+		return PTR_ERR(ppt_dev->aibv);
+	}
+	ppt_dev->aibvo = fib->aibvo;
+	ppt_dev->aisb = ppt_getandmap_gaddr(fib->aisb, vcpu);
+	if (IS_ERR(ppt_dev->aisb)) {
+		*cc = ZPCI_PCI_LS_ERR;
+		*status = 16;
+		put_page(virt_to_page(ppt_dev->aibv));
+		return PTR_ERR(ppt_dev->aisb);
+	}
+	ppt_dev->aisbo = fib->aisbo;
+
+	PPT_DEVICE_MESSAGE(ppt_dev, PPT_TRACE_NORMAL,
+		"mpcifc reg int noi %d aibv %p aibvo 0x%x aisb %p aisbo 0x%x\n",
+		ppt_dev->noi, ppt_dev->aibv, ppt_dev->aibvo, ppt_dev->aisb,
+		ppt_dev->aisbo);
+
+	ppt_dev->faisb = airq_iv_alloc_bit(faisb_iv);
+	if (ppt_dev->faisb == -1UL) {
+		*cc = ZPCI_PCI_LS_ERR;
+		*status = 16;
+		put_page(virt_to_page(ppt_dev->aibv));
+		put_page(virt_to_page(ppt_dev->aisb));
+		return -EINVAL;
+	}
+
+	PPT_DEVICE_MESSAGE(ppt_dev, PPT_TRACE_NORMAL,
+		"faisb nr %d\n", ppt_dev->faisb);
+
+	if (ppt_dev->hostirq)
+		rc = ppt_reg_irqs_host(ppt_dev, cc, status);
+	else {
+		PPT_DEVICE_MESSAGE(ppt_dev, PPT_TRACE_NORMAL,
+			"use aen irqs\n");
+		gait[ppt_dev->faisb].gd = vcpu->arch.sie_block->gd;
+		gait[ppt_dev->faisb].gisc = PCI_ISC;
+		gait[ppt_dev->faisb].gaisbo = ppt_dev->aisbo;
+		gait[ppt_dev->faisb].gaisba = (u64)ppt_dev->aisb;
+
+		PPT_DEVICE_MESSAGE(ppt_dev, PPT_TRACE_NORMAL,
+			"setup gait gd 0x%x gait[%d] 0x%lx\n",
+			gait[ppt_dev->faisb].gd, ppt_dev->faisb,
+			(unsigned long)&gait[ppt_dev->faisb]);
+		PPT_DEVICE_MESSAGE(ppt_dev, PPT_TRACE_NORMAL,
+			"gaisbo 0x%x gaisba 0x%llx\n",
+			gait[ppt_dev->faisb].gaisbo,
+			gait[ppt_dev->faisb].gaisba);
+
+		fib->aibv = (u64)ppt_dev->aibv;
+		fib->aisb = (unsigned long)faisb_iv->vector +
+				(ppt_dev->faisb/64)*8;
+		fib->aisbo = ppt_dev->faisb & 63;
+		fib->gd = vcpu->arch.sie_block->gd;
+		fib->isc = PCIPT_ISC;
+		req = ZPCI_CREATE_REQ(ppt_dev->zdev->fh, 0,
+			ZPCI_MOD_FC_REG_INT);
+		*cc = ppt_mod_fc(req, fib, status);
+		if (!*cc)
+			ppt_dev->irq_on = 1;
+		else
+			rc = -EIO;
+	}
+
+	if (rc) {
+		airq_iv_free_bit(faisb_iv, ppt_dev->faisb);
+		put_page(virt_to_page(ppt_dev->aibv));
+		put_page(virt_to_page(ppt_dev->aisb));
+	}
+
+	return rc;
+}
+
+static void ppt_dereg_irqs(struct ppt_dev *ppt_dev, u8 *cc, u8 *status)
+{
+	int i;
+	u64 req;
+	struct zpci_fib *fib;
+
+	PPT_DEVICE_MESSAGE(ppt_dev, PPT_TRACE_NORMAL, "dereg_irqs\n");
+	if (ppt_dev->hostirq) {
+		for (i = 0; i < ppt_dev->noi; ++i)
+			free_irq(ppt_dev->entries[i].vector, ppt_dev);
+
+		pci_disable_msix(ppt_dev->pdev);
+		kfree(ppt_dev->entries);
+		PPT_DEVICE_MESSAGE(ppt_dev, PPT_TRACE_NORMAL,
+			"host_irqs_inject: %d host_irqs_gisa: %d\n",
+			ppt_dev->stat_items[PPT_DEV_STAT_HOST_IRQ_INJECT],
+			ppt_dev->stat_items[PPT_DEV_STAT_HOST_IRQ_GISA]);
+		*cc = ZPCI_PCI_LS_OK;
+	} else {
+		fib = (void *) get_zeroed_page(GFP_KERNEL);
+		req = ZPCI_CREATE_REQ(ppt_dev->zdev->fh, 0,
+			ZPCI_MOD_FC_DEREG_INT);
+		*cc = ppt_mod_fc(req, fib, status);
+		if (*cc)
+			PPT_DEVICE_MESSAGE(ppt_dev, PPT_TRACE_NORMAL,
+				"dereg irq failed cc %d\n", *cc);
+		free_page((unsigned long)fib);
+		memset(&gait[ppt_dev->faisb], 0, sizeof(struct kvm_s390_gait));
+	}
+
+	PPT_DEVICE_MESSAGE(ppt_dev, PPT_TRACE_NORMAL,
+		"faisb 0x%lx aisb 0x%lx aibv 0x%lx\n",
+		*faisb_iv->vector, *ppt_dev->aisb, *ppt_dev->aibv);
+
+	PPT_DEVICE_MESSAGE(ppt_dev, PPT_TRACE_NORMAL,
+		"gib 0x%x gisa ipm 0x%x iam 0x%x G %d C %d\n",
+		ppt_gib_get_alo(),
+		kvm_s390_gisa_get_ipm(ppt_dev->kvm),
+		kvm_s390_gisa_get_iam(ppt_dev->kvm),
+		kvm_s390_gisa_get_g(ppt_dev->kvm),
+		kvm_s390_gisa_get_c(ppt_dev->kvm));
+
+	airq_iv_free_bit(faisb_iv, ppt_dev->faisb);
+	put_page(virt_to_page(ppt_dev->aibv));
+	put_page(virt_to_page(ppt_dev->aisb));
+	ppt_dev->irq_on = 0;
+}
+
+static int ppt_fmb_enable_device(struct ppt_dev *ppt_dev, struct zpci_fib *fib,
+				 struct kvm_vcpu *vcpu, u8 *cc, u8 *status)
+{
+	u64 req;
+	unsigned long *haddr;
+
+	haddr = ppt_getandmap_gaddr(fib->fmb_addr, vcpu);
+	if (IS_ERR(haddr)) {
+		PPT_DEVICE_MESSAGE(ppt_dev, PPT_TRACE_NORMAL,
+			"get and map guest addr failed\n");
+		*cc = ZPCI_PCI_LS_ERR;
+		*status = 16;
+		return PTR_ERR(haddr);
+	}
+
+	fib->fmb_addr = (u64)haddr;
+	req = ZPCI_CREATE_REQ(ppt_dev->zdev->fh, 0, ZPCI_MOD_FC_SET_MEASURE);
+	*cc = ppt_mod_fc(req, fib, status);
+	if (*cc) {
+		put_page(virt_to_page(haddr));
+		return -EIO;
+	}
+
+	if (ppt_dev->fmb) {
+		PPT_DEVICE_MESSAGE(ppt_dev, PPT_TRACE_NORMAL,
+			"release old fmb: 0x%p\n", ppt_dev->fmb);
+		put_page(virt_to_page(ppt_dev->fmb));
+		ppt_dev->fmb = NULL;
+	}
+	ppt_dev->fmb = (struct zpci_fmb *)haddr;
+
+	return 0;
+}
+
+static int ppt_fmb_disable_device(struct ppt_dev *ppt_dev, struct zpci_fib *fib,
+				  u8 *cc, u8 *status)
+{
+	u64 req;
+
+	if (!ppt_dev->fmb || fib->fmb_addr)
+		return -EINVAL;
+
+	req = ZPCI_CREATE_REQ(ppt_dev->zdev->fh, 0, ZPCI_MOD_FC_SET_MEASURE);
+	*cc = ppt_mod_fc(req, fib, status);
+	if (*cc)
+		return -EIO;
+	put_page(virt_to_page(ppt_dev->fmb));
+	ppt_dev->fmb = NULL;
+	return 0;
+}
+
+static int ppt_set_measure(struct ppt_dev *ppt_dev, struct zpci_fib *fib,
+			   struct kvm_vcpu *vcpu, u8 *cc, u8 *status)
+{
+	int rc = 0;
+
+	if (fib->fmb_addr == 0)
+		rc = ppt_fmb_disable_device(ppt_dev, fib, cc, status);
+	else
+		rc = ppt_fmb_enable_device(ppt_dev, fib, vcpu, cc, status);
+	return rc;
+}
+
+static int ppt_rereg_ioat(struct ppt_dev *ppt_dev, struct zpci_fib *fib,
+			  u8 *cc, u8 *status)
+{
+	u64 req = ZPCI_CREATE_REQ(ppt_dev->zdev->fh, 0, ZPCI_MOD_FC_REREG_IOAT);
+	struct zpci_fib *nfib;
+
+	if (!fib) {
+		*cc = ZPCI_PCI_LS_ERR;
+		*status = 16;
+		return -EINVAL;
+	}
+
+	nfib = (void *) get_zeroed_page(GFP_KERNEL);
+	if (!nfib) {
+		*cc = ZPCI_PCI_LS_ERR;
+		*status = 16;
+		return -ENOMEM;
+	}
+
+	nfib->pal = fib->pal;
+	nfib->iota = (u64)ppt_dev->zdev->dma_table;
+
+	*cc = ppt_mod_fc(req, nfib, status);
+	if (!*cc) {
+		dma_purge_rto_entries(ppt_dev->zdev);
+		ppt_dev->g_iota = fib->iota & ~ZPCI_IOTA_RTTO_FLAG;
+	}
+
+	free_page((unsigned long) nfib);
+	return (*cc) ? -EIO : 0;
+}
+
+static int ppt_reset_error(struct ppt_dev *ppt_dev, struct zpci_fib *fib,
+			   u8 *cc, u8 *status)
+{
+	u64 req  = ZPCI_CREATE_REQ(ppt_dev->zdev->fh,
+		0, ZPCI_MOD_FC_RESET_ERROR);
+
+	if (!fib) {
+		*cc = 1;
+		*status = 16;
+		return -EINVAL;
+	}
+
+	*cc = ppt_mod_fc(req, fib, status);
+	return (*cc) ? -EIO : 0;
+}
+
+static int ppt_reset_block(struct ppt_dev *ppt_dev, struct zpci_fib *fib,
+			   u8 *cc, u8 *status)
+{
+	u64 req  = ZPCI_CREATE_REQ(ppt_dev->zdev->fh,
+		0, ZPCI_MOD_FC_RESET_BLOCK);
+
+	if (!fib) {
+		*cc = 1;
+		*status = 16;
+		return -EINVAL;
+	}
+
+	*cc = ppt_mod_fc(req, fib, status);
+	return (*cc) ? -EIO : 0;
+}
+
+int handle_mpcifc(struct kvm_vcpu *vcpu)
+{
+	u8 cc = ZPCI_PCI_LS_OK, status;
+	int rc = 0;
+	int r1 = (vcpu->arch.sie_block->ipa & 0x00f0) >> 4;
+	u64 fiba = kvm_s390_get_base_disp_rxy(vcpu);
+	struct zpci_fib *fib;
+	u8 oc = vcpu->run->s.regs.gprs[r1] & 0xff;
+	u32 fh = vcpu->run->s.regs.gprs[r1] >> 32;
+	struct ppt_dev *ppt_dev;
+
+	if ((fh & FH_VIRT) == FH_VIRT)
+		return -EOPNOTSUPP;
+
+	ppt_dev = ppt_get_by_fh(vcpu->kvm, fh);
+	if (!ppt_dev) {
+		cc = ZPCI_PCI_LS_INVAL_HANDLE;
+		status = 0;
+		rc = -ENODEV;
+		goto out_nodev;
+	}
+
+	fib = (void *)get_zeroed_page(GFP_KERNEL);
+	if (!fib) {
+		PPT_MESSAGE(PPT_TRACE_NORMAL,
+			"mpcifc: cannot get memory for fib\n");
+		cc = ZPCI_PCI_LS_ERR;
+		status = 16;
+		rc = -ENOMEM;
+		goto out_nomem;
+	}
+
+	rc = read_guest(vcpu, fiba, fib, sizeof(struct zpci_fib));
+	if (rc) {
+		PPT_MESSAGE(PPT_TRACE_NORMAL,
+			"mpcifc: read guest failed. fiba: 0x%llx\n", fiba);
+		cc = ZPCI_PCI_LS_ERR;
+		status = 16;
+		goto out_rguest;
+	}
+
+	switch (oc) {
+	case ZPCI_MOD_FC_REG_INT:
+		rc = ppt_reg_irqs(ppt_dev, fib, vcpu, &cc, &status);
+		break;
+	case ZPCI_MOD_FC_DEREG_INT:
+		ppt_dereg_irqs(ppt_dev, &cc, &status);
+		if (cc) {
+			PPT_DEVICE_MESSAGE(ppt_dev, PPT_TRACE_NORMAL,
+				"mpcifc: dereg interrupt cc=%d\n", cc);
+			rc = -EIO;
+		}
+		break;
+	case ZPCI_MOD_FC_REG_IOAT:
+		if (((fib->iota >> 2) & ZPCI_IOTA_IOPTO) != ZPCI_IOTA_RTTO) {
+			rc = -EINVAL;
+			cc = ZPCI_PCI_LS_ERR;
+			status = 28;
+			PPT_DEVICE_MESSAGE(ppt_dev, PPT_TRACE_NORMAL,
+				"mpcifc: register ioat memory format error.\n");
+			break;
+		}
+		ppt_dev->g_iota = fib->iota & ~ZPCI_IOTA_RTTO_FLAG;
+		PPT_DEVICE_MESSAGE(ppt_dev, PPT_TRACE_NORMAL,
+			"mpcifc set iota 0x%llx\n", ppt_dev->g_iota);
+		cc = ZPCI_PCI_LS_OK;
+		break;
+	case ZPCI_MOD_FC_DEREG_IOAT:
+		zpci_stop_device(ppt_dev->zdev);
+		ppt_dev->g_iota = 0;
+		break;
+	case ZPCI_MOD_FC_REREG_IOAT:
+		if (((fib->iota >> 2) & ZPCI_IOTA_IOPTO) != ZPCI_IOTA_RTTO) {
+			rc = -EINVAL;
+			cc = ZPCI_PCI_LS_ERR;
+			status = 28;
+			PPT_DEVICE_MESSAGE(ppt_dev, PPT_TRACE_NORMAL,
+				"mpcifc: rereg ioat memory format error.\n");
+			break;
+		}
+		rc = ppt_rereg_ioat(ppt_dev, fib, &cc, &status);
+		break;
+	case ZPCI_MOD_FC_RESET_ERROR:
+		rc = ppt_reset_error(ppt_dev, fib, &cc, &status);
+		break;
+	case ZPCI_MOD_FC_RESET_BLOCK:
+		rc = ppt_reset_block(ppt_dev, fib, &cc, &status);
+		break;
+	case ZPCI_MOD_FC_SET_MEASURE:
+		rc = ppt_set_measure(ppt_dev, fib, vcpu, &cc, &status);
+		break;
+	default:
+		PPT_DEVICE_MESSAGE(ppt_dev, PPT_TRACE_NORMAL,
+			"invalid mpcifc oc 0x%x\n", oc);
+		rc = -EINVAL;
+		cc = ZPCI_PCI_LS_ERR;
+		status = 16;
+	}
+	ppt_dev->stat_items[PPT_DEV_STAT_MPCIFC]++;
+out_rguest:
+	free_page((unsigned long)fib);
+out_nomem:
+	ppt_put_dev(ppt_dev);
+out_nodev:
+	kvm_s390_set_psw_cc(vcpu, (unsigned long)cc);
+	ppt_set_status(vcpu, r1, status);
+	PPT_DEVICE_MESSAGE(ppt_dev, PPT_TRACE_NORMAL,
+		"mpcifc oc %d rc %d cc %d status %d\n",
+		oc, rc, cc, status);
+	return 0;
+}
+
+int handle_pcistg(struct kvm_vcpu *vcpu)
+{
+	u8 cc, status;
+	int r1 = (vcpu->arch.sie_block->ipb & 0x00f00000) >> 20;
+	int r2 = (vcpu->arch.sie_block->ipb & 0x000f0000) >> 16;
+	u32 fh = vcpu->run->s.regs.gprs[r2] >> 32;
+	struct ppt_dev *ppt_dev;
+
+	if ((fh & FH_VIRT) == FH_VIRT)
+		return -EOPNOTSUPP;
+
+	ppt_dev = ppt_get_by_fh(vcpu->kvm, fh);
+	if (!ppt_dev) {
+		cc = ZPCI_PCI_LS_INVAL_HANDLE;
+		status = 0;
+		goto out;
+	}
+
+	cc = ppt_store(vcpu->run->s.regs.gprs[r1],
+			   vcpu->run->s.regs.gprs[r2],
+			   vcpu->run->s.regs.gprs[r2 + 1],
+			   &status);
+	ppt_dev->stat_items[PPT_DEV_STAT_PCISTG]++;
+
+	if (cc)
+		PPT_DEVICE_MESSAGE(ppt_dev, PPT_TRACE_NORMAL,
+			"pcistg req: 0x%llx off: 0x%llx data: 0x%llx cc: %d\n",
+			vcpu->run->s.regs.gprs[r2],
+			vcpu->run->s.regs.gprs[r2 + 1],
+			vcpu->run->s.regs.gprs[r1], cc);
+
+	ppt_put_dev(ppt_dev);
+out:
+	kvm_s390_set_psw_cc(vcpu, (unsigned long)cc);
+	ppt_set_status(vcpu, r2, status);
+	return 0;
+}
+
+int handle_pcilg(struct kvm_vcpu *vcpu)
+{
+	u8 cc, status;
+	int r1 = (vcpu->arch.sie_block->ipb & 0x00f00000) >> 20;
+	int r2 = (vcpu->arch.sie_block->ipb & 0x000f0000) >> 16;
+	u32 fh = vcpu->run->s.regs.gprs[r2] >> 32;
+	struct ppt_dev *ppt_dev;
+
+	if ((fh & FH_VIRT) == FH_VIRT)
+		return -EOPNOTSUPP;
+
+	ppt_dev = ppt_get_by_fh(vcpu->kvm, fh);
+	if (!ppt_dev) {
+		cc = ZPCI_PCI_LS_INVAL_HANDLE;
+		status = 0;
+		goto out;
+	}
+
+	cc = ppt_load(&vcpu->run->s.regs.gprs[r1],
+			  vcpu->run->s.regs.gprs[r2],
+			  vcpu->run->s.regs.gprs[r2 + 1],
+			  &status);
+	ppt_dev->stat_items[PPT_DEV_STAT_PCILG]++;
+
+	if (cc)
+		PPT_DEVICE_MESSAGE(ppt_dev, PPT_TRACE_NORMAL,
+			"pcilg req: 0x%llx off: 0x%llx data: 0x%llx cc: %d\n",
+			vcpu->run->s.regs.gprs[r2],
+			vcpu->run->s.regs.gprs[r2 + 1],
+			vcpu->run->s.regs.gprs[r1], cc);
+
+	ppt_put_dev(ppt_dev);
+out:
+	kvm_s390_set_psw_cc(vcpu, (unsigned long)cc);
+	ppt_set_status(vcpu, r2, status);
+	return 0;
+}
+
+int handle_stpcifc(struct kvm_vcpu *vcpu)
+{
+	PPT_MESSAGE(PPT_TRACE_NORMAL, "stpcifc\n");
+	kvm_s390_set_psw_cc(vcpu, 1);
+	return 0;
+}
+
+static void ppt_process_gisa(struct kvm_s390_gisa_f1 *gisa)
+{
+	struct ppt_vm_entry *tmp;
+
+	spin_lock(&ppt_vm_list_lock);
+
+	list_for_each_entry(tmp, &ppt_vm_list, entry) {
+		if (&tmp->kvm->arch.gisa->f1 == gisa) {
+			schedule_work(&tmp->irq_work);
+			tmp->stat_items[PPT_VM_STAT_ALERT_IRQ]++;
+			break;
+		}
+	}
+
+	spin_unlock(&ppt_vm_list_lock);
+}
+
+static void ppt_process_faisb(void)
+{
+	struct ppt_dev *ppt_dev, *tmp_ppt_dev;
+	struct ppt_vm_entry *tmp;
+	int summary_set;
+	unsigned long si;
+
+	for (si = 0;;) {
+		si = airq_iv_scan(faisb_iv, si, airq_iv_end(faisb_iv));
+		if (si == -1UL)
+			break;
+
+		ppt_dev = NULL;
+
+		spin_lock(&ppt_vm_list_lock);
+
+		list_for_each_entry(tmp, &ppt_vm_list, entry) {
+			list_for_each_entry(tmp_ppt_dev,
+				&tmp->kvm->arch.ppt_dev_list, entry) {
+				if (tmp_ppt_dev->faisb == si) {
+					ppt_dev = tmp_ppt_dev;
+					break;
+				}
+			}
+			if (ppt_dev) {
+				tmp->stat_items[PPT_VM_STAT_ALERT_IRQ]++;
+				tmp->stat_items[PPT_VM_STAT_ALERT_H]++;
+				break;
+			}
+		}
+
+		if (ppt_dev) {
+			summary_set = test_and_set_bit(
+				ppt_dev->aisbo ^ be_to_le, ppt_dev->aisb);
+
+			if (!summary_set) {
+				if (kvm_s390_gisa_test_iam_gisc(ppt_dev->kvm,
+					PCI_ISC)) {
+					schedule_work(
+						&ppt_dev->ppt_vm->irq_work);
+				} else {
+					kvm_s390_gisa_set_ipm_gisc(
+						ppt_dev->kvm, PCI_ISC);
+				}
+			}
+
+		}
+		spin_unlock(&ppt_vm_list_lock);
+	}
+}
+
+static void ppt_walk_gib(void)
+{
+	struct kvm_s390_gisa_f1 *gisa, *next_gisa, *prev_gisa;
+
+	gisa = (struct kvm_s390_gisa_f1 *)(unsigned long)ppt_gib_get_alo();
+	prev_gisa = NULL;
+
+	while (gisa) {
+		next_gisa = (struct kvm_s390_gisa_f1 *)
+			    (unsigned long)__kvm_s390_gisa_get_next_alert(
+			    (union kvm_s390_gisa *)gisa);
+		while (next_gisa) {
+			prev_gisa = gisa;
+			gisa = (struct kvm_s390_gisa_f1 *)(unsigned long)
+				__kvm_s390_gisa_get_next_alert(
+				(union kvm_s390_gisa *)prev_gisa);
+			next_gisa = (struct kvm_s390_gisa_f1 *)(unsigned long)
+				__kvm_s390_gisa_get_next_alert(
+				(union kvm_s390_gisa *)gisa);
+		}
+
+		ppt_process_gisa(gisa);
+		__kvm_s390_gisa_set_next_alert((union kvm_s390_gisa *)gisa,
+					       (u32)(unsigned long)gisa);
+
+		if (prev_gisa)
+			__kvm_s390_gisa_set_next_alert(
+				(union kvm_s390_gisa *)prev_gisa, 0);
+		else {
+			ppt_gib_set_alo(0);
+			break;
+		}
+	}
+}
+
+static void ppt_alert_irq_handler(struct airq_struct *airq)
+{
+	if (S390_lowcore.subchannel_nr & PPT_AIRQ_HOST_ERROR) {
+		PPT_MESSAGE(PPT_TRACE_NORMAL,
+			"error irq id 0x%x nr 0x%x parm 0x%x word 0x%x\n",
+			S390_lowcore.subchannel_id, S390_lowcore.subchannel_nr,
+			S390_lowcore.io_int_parm, S390_lowcore.io_int_word);
+		PPT_MESSAGE(PPT_TRACE_NORMAL,
+			"gib alo 0x%x faisb 0x%lx\n",
+			ppt_gib_get_alo(),
+			*faisb_iv->vector);
+		/* we do not turn on irq again so the forwarding is dead
+		 * from now on*/
+		return;
+	}
+
+	if (S390_lowcore.subchannel_nr & PPT_AIRQ_HOST_FORWARD) {
+		ppt_process_faisb();
+		zpci_set_irq_ctrl(1, NULL, PCIPT_ISC);
+		return;
+	}
+
+	/* handle the alert list */
+	ppt_walk_gib();
+	zpci_set_irq_ctrl(1, NULL, PCIPT_ISC);
+	ppt_walk_gib();
+}
+
+static struct ppt_dev *kvm_find_assigned_ppt(struct list_head *head,
+					     int assigned_dev_id)
+{
+	struct ppt_dev *tmp;
+
+	list_for_each_entry(tmp, head, entry) {
+		if (tmp->dev_id == assigned_dev_id)
+			return tmp;
+	}
+
+	return NULL;
+}
+
+static const char *const ppt_driver_whitelist[] = { "pci-stub" };
+
+static bool ppt_whitelisted_driver(struct device_driver *drv)
+{
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(ppt_driver_whitelist); i++) {
+		if (!strcmp(drv->name, ppt_driver_whitelist[i]))
+			return true;
+	}
+
+	return false;
+}
+
+int kvm_s390_ioctrl_assign_pci(struct kvm *kvm,
+			       struct kvm_assigned_pci_dev *assigned_dev)
+{
+	struct ppt_dev *ppt_dev = NULL;
+	struct pci_dev *dev;
+	struct ppt_vm_entry *tmp;
+	struct device_driver *drv;
+	int rc = 0;
+	char ppt_dev_name[DBF_NAME_LEN];
+	unsigned long flags;
+
+	if (kvm_s390_gisa_get_fmt() != KVM_S390_GISA_FORMAT_1) {
+		PPT_MESSAGE(PPT_TRACE_NORMAL, "GISA format not support\n");
+		rc = -EINVAL;
+		goto out;
+	}
+
+	spin_lock_irqsave(&ppt_vm_list_lock, flags);
+	list_for_each_entry(tmp, &ppt_vm_list, entry) {
+		ppt_dev = kvm_find_assigned_ppt(&tmp->kvm->arch.ppt_dev_list,
+			assigned_dev->assigned_dev_id);
+		if (ppt_dev)
+			break;
+	}
+	spin_unlock_irqrestore(&ppt_vm_list_lock, flags);
+	if (ppt_dev) {
+		rc = -EEXIST;
+		goto out;
+	}
+
+	dev = pci_get_domain_bus_and_slot(assigned_dev->segnr,
+		assigned_dev->busnr, assigned_dev->devfn);
+	if (!dev) {
+		PPT_MESSAGE(PPT_TRACE_NORMAL, "assign no pci dev\n");
+		rc = -EINVAL;
+		goto out;
+	}
+
+	if (dev->hdr_type != PCI_HEADER_TYPE_NORMAL) {
+		PPT_MESSAGE(PPT_TRACE_NORMAL, "device type not support\n");
+		rc = -EINVAL;
+		goto out;
+	}
+
+	drv = ACCESS_ONCE(dev->dev.driver);
+	if (drv && !ppt_whitelisted_driver(drv)) {
+		PPT_MESSAGE(PPT_TRACE_NORMAL,
+			"device has been bound to a driver not allowed\n");
+		rc = -EBUSY;
+		goto out;
+	}
+
+	ppt_dev = ppt_alloc_dev();
+	if (IS_ERR(ppt_dev)) {
+		PPT_MESSAGE(PPT_TRACE_NORMAL, "could not allocate memory\n");
+		rc = PTR_ERR(ppt_dev);
+		goto out_put;
+	}
+	ppt_dev->zdev = (struct zpci_dev *)dev->sysdata;
+	ppt_dev->pdev = dev;
+	ppt_dev->kvm = kvm;
+	ppt_dev->dev_id = assigned_dev->assigned_dev_id;
+
+	sprintf(ppt_dev_name, "ppt_dev_%x", ppt_dev->zdev->fid);
+	ppt_dev->debug = ppt_get_dbf_entry(ppt_dev_name);
+	if (!ppt_dev->debug) {
+		rc = ppt_add_dbf_entry(ppt_dev, ppt_dev_name);
+		if (rc) {
+			PPT_MESSAGE(PPT_TRACE_NORMAL, "add dbf failed\n");
+			goto out_free;
+		}
+	}
+
+	ppt_dev_debugfs_stats_init(ppt_dev);
+
+	PPT_DEVICE_MESSAGE(ppt_dev, PPT_TRACE_NORMAL, "assigned\n");
+
+	rc = zpci_disable_device(ppt_dev->zdev);
+	if (rc) {
+		PPT_MESSAGE(PPT_TRACE_NORMAL, "could not disable device\n");
+		goto out_free;
+	}
+
+	ppt_dev->ppt_vm = ppt_register_vm(ppt_dev->kvm);
+	if (IS_ERR(ppt_dev->ppt_vm)) {
+		PPT_MESSAGE(PPT_TRACE_NORMAL, "register vm failed\n");
+		rc = PTR_ERR(ppt_dev->ppt_vm);
+		goto out_free;
+	}
+
+	if (assigned_dev->flags & ASSIGN_FLAG_HOSTIRQ) {
+		PPT_DEVICE_MESSAGE(ppt_dev, PPT_TRACE_NORMAL,
+			"config hostirq\n");
+		ppt_dev->hostirq = 1;
+	}
+
+	mutex_lock(&kvm->lock);
+	list_add_tail(&ppt_dev->entry, &kvm->arch.ppt_dev_list);
+	mutex_unlock(&kvm->lock);
+
+	PPT_DEVICE_MESSAGE(ppt_dev, PPT_TRACE_NORMAL,
+		"assign fh 0x%x\n", ppt_dev->zdev->fh);
+	return rc;
+out_free:
+	kfree(ppt_dev);
+out_put:
+	pci_dev_put(dev);
+out:
+	return rc;
+}
+
+int kvm_s390_ioctrl_deassign_pci(struct kvm *kvm,
+				 struct kvm_assigned_pci_dev *assigned_dev)
+{
+	int rc = 0;
+	struct ppt_dev *ppt_dev = ppt_get_by_devid(kvm,
+		assigned_dev->assigned_dev_id);
+
+	if (ppt_dev) {
+		PPT_DEVICE_MESSAGE(ppt_dev, PPT_TRACE_NORMAL,
+			"deassign fh 0x%x\n", ppt_dev->zdev->fh);
+		mutex_lock(&kvm->lock);
+		list_del(&ppt_dev->entry);
+		mutex_unlock(&kvm->lock);
+		/* for the ppt_get_by_devid */
+		ppt_put_dev(ppt_dev);
+		/* to free the device */
+		ppt_put_dev(ppt_dev);
+	} else {
+		PPT_MESSAGE(PPT_TRACE_NORMAL,
+			"deassign no dev with id 0x%x\n",
+			assigned_dev->assigned_dev_id);
+		rc = -ENODEV;
+	}
+	return rc;
+}
+
+void s390_pci_cleanup(struct kvm *kvm)
+{
+	struct list_head *ptr, *ptr2;
+	struct ppt_dev *ppt_dev;
+
+	list_for_each_safe(ptr, ptr2, &kvm->arch.ppt_dev_list) {
+		ppt_dev = list_entry(ptr, struct ppt_dev, entry);
+		list_del(&ppt_dev->entry);
+		ppt_put_dev(ppt_dev);
+	}
+}
+
+int s390_pci_init(void)
+{
+	int rc = 0;
+	struct kvm_s390_aifte *aifte;
+
+	if (!test_facility(2) || !test_facility(69)
+	    || !test_facility(71) || !test_facility(72))
+		return 0;
+
+	rc = ppt_register_dbf_views();
+	if (rc) {
+		pr_err("failed to register dbf views rc %d\n", rc);
+		goto out;
+	}
+
+	rc = register_adapter_interrupt(&ppt_airq);
+	if (rc) {
+		pr_err("failed to register airq isc %d rc %d\n", PCIPT_ISC, rc);
+		goto out_airq;
+	}
+
+	*ppt_airq.lsi_ptr = 1;
+
+	faisb_iv = airq_iv_create(CONFIG_PCI_NR_FUNCTIONS, AIRQ_IV_ALLOC);
+	if (!faisb_iv) {
+		rc = -ENOMEM;
+		goto out_iv;
+	}
+
+	gait = (struct kvm_s390_gait *)get_zeroed_page(
+		GFP_KERNEL | GFP_DMA);
+
+	aifte = (struct kvm_s390_aifte *)get_zeroed_page(
+		GFP_KERNEL | GFP_DMA);
+
+	gib = (struct kvm_s390_gib *)get_zeroed_page(
+		GFP_KERNEL | GFP_DMA);
+
+	if (!gait || !aifte || !gib) {
+		rc = -ENOMEM;
+		goto out_nomem;
+	}
+
+	aifte->faisba = (unsigned long)faisb_iv->vector;
+	aifte->gaita = (u64)gait;
+	aifte->afi = PCIPT_ISC;
+	aifte->faal = CONFIG_PCI_NR_FUNCTIONS;
+
+	rc = chsc_sgib((u32)(unsigned long)gib);
+	if (rc) {
+		pr_err("set gib failed rc %d\n", rc);
+		goto out_gib;
+	}
+
+	zpci_set_irq_ctrl(2, (char *)aifte, PCIPT_ISC);
+
+	zpci_set_irq_ctrl(1, NULL, PCIPT_ISC);
+	free_page((unsigned long)aifte);
+
+	PPT_MESSAGE(PPT_TRACE_NORMAL,
+		"faisba 0x%lx gait 0x%lx gib 0x%lx isc: %d\n",
+		(unsigned long)faisb_iv->vector, (unsigned long)gait,
+		(unsigned long)gib, PCIPT_ISC);
+
+	ppt_stats_debugfs_root = debugfs_create_dir("ppt", NULL);
+	if (IS_ERR(ppt_stats_debugfs_root))
+		ppt_stats_debugfs_root = NULL;
+
+	return 0;
+out_gib:
+out_nomem:
+	airq_iv_release(faisb_iv);
+	free_page((unsigned long)gait);
+	free_page((unsigned long)aifte);
+	free_page((unsigned long)gib);
+out_iv:
+	unregister_adapter_interrupt(&ppt_airq);
+out_airq:
+	ppt_unregister_dbf_views();
+out:
+	return rc;
+}
+
+void s390_pci_exit(void)
+{
+	int rc;
+	struct kvm_s390_aifte *aifte;
+
+	if (!test_facility(2) || !test_facility(69)
+	    || !test_facility(71) || !test_facility(72))
+		return;
+
+	if (faisb_iv)
+		airq_iv_release(faisb_iv);
+
+	unregister_adapter_interrupt(&ppt_airq);
+
+	aifte = (struct kvm_s390_aifte *)get_zeroed_page(
+		GFP_KERNEL | GFP_DMA);
+	if (!aifte)
+		pr_err("failed to get page for aifte\n");
+	else
+		zpci_set_irq_ctrl(2, (char *)aifte, PCIPT_ISC);
+
+	rc = chsc_sgib(0);
+	if (rc)
+		pr_err("reset gib failed rc %d\n", rc);
+
+	free_page((unsigned long)aifte);
+	free_page((unsigned long)gait);
+	free_page((unsigned long)gib);
+
+	ppt_unregister_dbf_views();
+	ppt_clear_dbf_list();
+
+	debugfs_remove(ppt_stats_debugfs_root);
+}
--- a/arch/s390/kvm/priv.c
+++ b/arch/s390/kvm/priv.c
@@ -753,6 +753,10 @@ static const intercept_handler_t b9_hand
 	[0x8f] = handle_ipte_interlock,
 	[0xab] = handle_essa,
 	[0xaf] = handle_pfmf,
+	[0xa0] = handle_clp,
+	[0xd0] = handle_pcistg,
+	[0xd2] = handle_pcilg,
+	[0xd3] = handle_rpcit,
 };
 
 int kvm_s390_handle_b9(struct kvm_vcpu *vcpu)
@@ -912,9 +916,26 @@ static int handle_stctg(struct kvm_vcpu
 	return 0;
 }
 
+static const intercept_handler_t e3_handlers[256] = {
+	[0xd0] = handle_mpcifc,
+	[0xd4] = handle_stpcifc,
+};
+
+int kvm_s390_handle_e3(struct kvm_vcpu *vcpu)
+{
+	intercept_handler_t handler;
+
+	handler = e3_handlers[vcpu->arch.sie_block->ipb & 0xff];
+	if (handler)
+		return handler(vcpu);
+	return -EOPNOTSUPP;
+}
+
 static const intercept_handler_t eb_handlers[256] = {
 	[0x2f] = handle_lctlg,
 	[0x25] = handle_stctg,
+	[0xd0] = handle_pcistb,
+	[0xd1] = handle_sic,
 };
 
 int kvm_s390_handle_eb(struct kvm_vcpu *vcpu)

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Qemu-devel] [RFC][patch 5/6] s390: Add PCI bus support
  2014-09-04 10:52 [Qemu-devel] [RFC][patch 0/6] pci pass-through support for qemu/KVM on s390 frank.blaschka
                   ` (3 preceding siblings ...)
  2014-09-04 10:52 ` [Qemu-devel] [RFC][patch 4/6] KVM: s390: Add PCI pass-through support frank.blaschka
@ 2014-09-04 10:52 ` frank.blaschka
  2014-09-04 10:52 ` [Qemu-devel] [RFC][patch 6/6] s390: Add PCI pass-through device support frank.blaschka
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 21+ messages in thread
From: frank.blaschka @ 2014-09-04 10:52 UTC (permalink / raw)
  To: qemu-devel, linux-s390, kvm; +Cc: aik, pbonzini, agraf

[-- Attachment #1: 101-qemu_bus.patch --]
[-- Type: text/plain, Size: 33123 bytes --]

From: Frank Blaschka <frank.blaschka@de.ibm.com>

This patch implements a pci bus for s390x together with some infrastructure
to generate and handle hotplug events. It also provides device 
configuration/unconfiguration via sclp instruction interception.
    
Signed-off-by: Frank Blaschka <frank.blaschka@de.ibm.com>
---
 default-configs/s390x-softmmu.mak |    1 
 hw/s390x/Makefile.objs            |    1 
 hw/s390x/css.c                    |    5 
 hw/s390x/css.h                    |    1 
 hw/s390x/s390-pci-bus.c           |  287 ++++++++++++++++++++++++++++++++++++++
 hw/s390x/s390-pci-bus.h           |  139 ++++++++++++++++++
 hw/s390x/s390-virtio-ccw.c        |    2 
 hw/s390x/sclp.c                   |   10 +
 include/hw/s390x/sclp.h           |    8 +
 target-s390x/Makefile.objs        |    2 
 target-s390x/ioinst.c             |   52 ++++++
 target-s390x/ioinst.h             |    1 
 target-s390x/kvm.c                |    5 
 target-s390x/pci_ic.c             |  230 ++++++++++++++++++++++++++++++
 target-s390x/pci_ic.h             |  214 ++++++++++++++++++++++++++++
 15 files changed, 956 insertions(+), 2 deletions(-)

--- a/default-configs/s390x-softmmu.mak
+++ b/default-configs/s390x-softmmu.mak
@@ -1,4 +1,5 @@
 CONFIG_VIRTIO=y
+CONFIG_PCI=y
 CONFIG_SCLPCONSOLE=y
 CONFIG_S390_FLIC=y
 CONFIG_S390_FLIC_KVM=$(CONFIG_KVM)
--- a/hw/s390x/Makefile.objs
+++ b/hw/s390x/Makefile.objs
@@ -8,3 +8,4 @@ obj-y += ipl.o
 obj-y += css.o
 obj-y += s390-virtio-ccw.o
 obj-y += virtio-ccw.o
+obj-$(CONFIG_KVM) += s390-pci-bus.o
--- a/hw/s390x/css.c
+++ b/hw/s390x/css.c
@@ -1281,6 +1281,11 @@ void css_generate_chp_crws(uint8_t cssid
     /* TODO */
 }
 
+void css_generate_css_crws(uint8_t cssid)
+{
+    css_queue_crw(CRW_RSC_CSS, 0, 0, 0);
+}
+
 int css_enable_mcsse(void)
 {
     trace_css_enable_facility("mcsse");
--- a/hw/s390x/css.h
+++ b/hw/s390x/css.h
@@ -99,6 +99,7 @@ void css_queue_crw(uint8_t rsc, uint8_t
 void css_generate_sch_crws(uint8_t cssid, uint8_t ssid, uint16_t schid,
                            int hotplugged, int add);
 void css_generate_chp_crws(uint8_t cssid, uint8_t chpid);
+void css_generate_css_crws(uint8_t cssid);
 void css_adapter_interrupt(uint8_t isc);
 
 #define CSS_IO_ADAPTER_VIRTIO 1
--- /dev/null
+++ b/hw/s390x/s390-pci-bus.c
@@ -0,0 +1,287 @@
+/*
+ * s390 PCI BUS
+ *
+ * Copyright 2014 IBM Corp.
+ * Author(s): Frank Blaschka <frank.blaschka@de.ibm.com>
+ *            Hong Bo Li <lihbbj@cn.ibm.com>
+ *            Yi Min Zhao <zyimin@cn.ibm.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or (at
+ * your option) any later version. See the COPYING file in the top-level
+ * directory.
+ */
+
+#include <hw/pci/pci.h>
+#include <hw/s390x/css.h>
+#include <hw/s390x/sclp.h>
+#include "qemu/error-report.h"
+#include "s390-pci-bus.h"
+
+/* #define DEBUG_S390PCI_BUS */
+#ifdef DEBUG_S390PCI_BUS
+#define DPRINTF(fmt, ...) \
+    do { fprintf(stderr, "S390pci-bus: " fmt, ## __VA_ARGS__); } while (0)
+#else
+#define DPRINTF(fmt, ...) \
+    do { } while (0)
+#endif
+
+static QTAILQ_HEAD(, SeiContainer) pending_sei =
+    QTAILQ_HEAD_INITIALIZER(pending_sei);
+static QTAILQ_HEAD(, S390PCIBusDevice) device_list =
+    QTAILQ_HEAD_INITIALIZER(device_list);
+
+int chsc_sei_nt2_get_event(void *res)
+{
+    ChscSeiNt2Res *nt2_res = (ChscSeiNt2Res *)res;
+    PciCcdfAvail *accdf;
+    PciCcdfErr *eccdf;
+    int rc = 1;
+    SeiContainer *sei_cont;
+
+    sei_cont = QTAILQ_FIRST(&pending_sei);
+    if (sei_cont) {
+        QTAILQ_REMOVE(&pending_sei, sei_cont, link);
+        nt2_res->nt = 2;
+        nt2_res->cc = sei_cont->cc;
+        switch (sei_cont->cc) {
+        case 1: /* error event */
+            eccdf = (PciCcdfErr *)nt2_res->ccdf;
+            eccdf->fid = cpu_to_be32(sei_cont->fid);
+            eccdf->fh = cpu_to_be32(sei_cont->fh);
+            break;
+        case 2: /* availability event */
+            accdf = (PciCcdfAvail *)nt2_res->ccdf;
+            accdf->fid = cpu_to_be32(sei_cont->fid);
+            accdf->fh = cpu_to_be32(sei_cont->fh);
+            accdf->pec = cpu_to_be16(sei_cont->pec);
+            break;
+        default:
+            abort();
+        }
+        g_free(sei_cont);
+        rc = 0;
+    }
+
+    return rc;
+}
+
+int chsc_sei_nt2_have_event(void)
+{
+    return !QTAILQ_EMPTY(&pending_sei);
+}
+
+static S390PCIBusDevice *s390_pci_find_dev_by_fid(uint32_t fid)
+{
+    S390PCIBusDevice *pbdev;
+
+    QTAILQ_FOREACH(pbdev, &device_list, next) {
+        if (pbdev->fid == fid) {
+            return pbdev;
+        }
+    }
+    return NULL;
+}
+
+void s390_pci_sclp_configure(int configure, SCCB *sccb)
+{
+    PciCfgSccb *psccb = (PciCfgSccb *)sccb;
+    S390PCIBusDevice *pbdev = s390_pci_find_dev_by_fid(be32_to_cpu(psccb->aid));
+    uint16_t rc;
+
+    if (pbdev) {
+        if ((configure == 1 && pbdev->configured == true) ||
+            (configure == 0 && pbdev->configured == false)) {
+            rc = SCLP_RC_NO_ACTION_REQUIRED;
+        } else {
+            pbdev->configured = !pbdev->configured;
+            rc = SCLP_RC_NORMAL_COMPLETION;
+        }
+    } else {
+        DPRINTF("sclp config %d no dev found\n", configure);
+        rc = SCLP_RC_ADAPTER_ID_NOT_RECOGNIZED;
+    }
+
+    psccb->header.response_code = cpu_to_be16(rc);
+    return;
+}
+
+static uint32_t s390_pci_get_pfid(PCIDevice *pdev)
+{
+    return PCI_SLOT(pdev->devfn);
+}
+
+static uint32_t s390_pci_get_pfh(PCIDevice *pdev)
+{
+    return PCI_SLOT(pdev->devfn) | FH_VIRT;
+}
+
+S390PCIBusDevice *s390_pci_find_dev_by_idx(uint32_t idx)
+{
+    S390PCIBusDevice *dev;
+    int i = 0;
+
+    QTAILQ_FOREACH(dev, &device_list, next) {
+        if (i == idx) {
+            return dev;
+        }
+        i++;
+    }
+    return NULL;
+}
+
+S390PCIBusDevice *s390_pci_find_dev_by_fh(uint32_t fh)
+{
+    S390PCIBusDevice *pbdev;
+
+    QTAILQ_FOREACH(pbdev, &device_list, next) {
+        if (pbdev->fh == fh) {
+            return pbdev;
+        }
+    }
+    return NULL;
+}
+
+static S390PCIBusDevice *s390_pci_find_dev_by_pdev(PCIDevice *pdev)
+{
+    S390PCIBusDevice *pbdev;
+
+    QTAILQ_FOREACH(pbdev, &device_list, next) {
+        if (pbdev->pdev == pdev) {
+            return pbdev;
+        }
+    }
+    return NULL;
+}
+
+static void s390_pci_generate_plug_event(uint16_t pec, uint32_t fh,
+                                         uint32_t fid)
+{
+    SeiContainer *sei_cont = g_malloc0(sizeof(SeiContainer));
+
+    sei_cont->fh = fh;
+    sei_cont->fid = fid;
+    sei_cont->cc = 2;
+    sei_cont->pec = pec;
+
+    QTAILQ_INSERT_TAIL(&pending_sei, sei_cont, link);
+    css_generate_css_crws(0);
+}
+
+static void s390_pci_set_irq(void *opaque, int irq, int level)
+{
+    /* nothing to do */
+}
+
+static int s390_pci_map_irq(PCIDevice *pci_dev, int irq_num)
+{
+    /* nothing to do */
+    return 0;
+}
+
+void s390_pci_bus_init(void)
+{
+    DeviceState *dev;
+
+    dev = qdev_create(NULL, TYPE_S390_PCI_HOST_BRIDGE);
+    qdev_init_nofail(dev);
+}
+
+static int s390_pcihost_init(SysBusDevice *dev)
+{
+    PCIBus *b;
+    BusState *bus;
+    PCIHostState *phb = PCI_HOST_BRIDGE(dev);
+
+    DPRINTF("host_init\n");
+
+    b = pci_register_bus(DEVICE(dev), NULL,
+                         s390_pci_set_irq, s390_pci_map_irq, NULL,
+                         get_system_memory(), get_system_io(), 0, 64,
+                         TYPE_PCI_BUS);
+
+    bus = BUS(b);
+    qbus_set_hotplug_handler(bus, DEVICE(dev), NULL);
+    phb->bus = b;
+
+    return 0;
+}
+
+static void s390_pcihost_hot_plug(HotplugHandler *hotplug_dev,
+                                  DeviceState *dev, Error **errp)
+{
+    PCIDevice *pci_dev = PCI_DEVICE(dev);
+    S390PCIBusDevice *pbdev;
+
+    pbdev = g_malloc0(sizeof(*pbdev));
+
+    pbdev->fid = s390_pci_get_pfid(pci_dev);
+    pbdev->pdev = pci_dev;
+    pbdev->configured = true;
+
+    pbdev->fh = s390_pci_get_pfh(pci_dev);
+    pbdev->is_virt = 1;
+
+    QTAILQ_INSERT_TAIL(&device_list, pbdev, next);
+    if (dev->hotplugged) {
+        s390_pci_generate_plug_event(HP_EVENT_RESERVED_TO_STANDBY,
+                                     pbdev->fh, pbdev->fid);
+        s390_pci_generate_plug_event(HP_EVENT_TO_CONFIGURED,
+                                     pbdev->fh, pbdev->fid);
+    }
+    return;
+}
+
+static void s390_pcihost_hot_unplug(HotplugHandler *hotplug_dev,
+                                    DeviceState *dev, Error **errp)
+{
+    PCIDevice *pci_dev = PCI_DEVICE(dev);
+    S390PCIBusDevice *pbdev;
+
+    pbdev = s390_pci_find_dev_by_pdev(pci_dev);
+    if (!pbdev) {
+        DPRINTF("Error, can't find hot-unplug device in list\n");
+        return;
+    }
+
+    if (pbdev->configured) {
+        pbdev->configured = false;
+        s390_pci_generate_plug_event(HP_EVENT_CONFIGURED_TO_STBRES,
+                                     pbdev->fh, pbdev->fid);
+    }
+
+    QTAILQ_REMOVE(&device_list, pbdev, next);
+    s390_pci_generate_plug_event(HP_EVENT_STANDBY_TO_RESERVED, 0, 0);
+    object_unparent(OBJECT(pci_dev));
+    g_free(pbdev);
+}
+
+static void s390_pcihost_class_init(ObjectClass *klass, void *data)
+{
+    SysBusDeviceClass *k = SYS_BUS_DEVICE_CLASS(klass);
+    DeviceClass *dc = DEVICE_CLASS(klass);
+    HotplugHandlerClass *hc = HOTPLUG_HANDLER_CLASS(klass);
+
+    dc->cannot_instantiate_with_device_add_yet = true;
+    k->init = s390_pcihost_init;
+    hc->plug = s390_pcihost_hot_plug;
+    hc->unplug = s390_pcihost_hot_unplug;
+}
+
+static const TypeInfo s390_pcihost_info = {
+    .name          = TYPE_S390_PCI_HOST_BRIDGE,
+    .parent        = TYPE_PCI_HOST_BRIDGE,
+    .instance_size = sizeof(S390pciState),
+    .class_init    = s390_pcihost_class_init,
+    .interfaces = (InterfaceInfo[]) {
+        { TYPE_HOTPLUG_HANDLER },
+        { }
+    }
+};
+
+static void s390_pci_register_types(void)
+{
+    type_register_static(&s390_pcihost_info);
+}
+
+type_init(s390_pci_register_types)
--- /dev/null
+++ b/hw/s390x/s390-pci-bus.h
@@ -0,0 +1,139 @@
+/*
+ * s390 PCI BUS definitions
+ *
+ * Copyright 2014 IBM Corp.
+ * Author(s): Frank Blaschka <frank.blaschka@de.ibm.com>
+ *            Hong Bo Li <lihbbj@cn.ibm.com>
+ *            Yi Min Zhao <zyimin@cn.ibm.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or (at
+ * your option) any later version. See the COPYING file in the top-level
+ * directory.
+ */
+
+#ifndef HW_S390_PCI_BUS_H
+#define HW_S390_PCI_BUS_H
+
+#include <hw/pci/pci_host.h>
+#include "hw/s390x/sclp.h"
+
+#define TYPE_S390_PCI_HOST_BRIDGE "s390-pcihost"
+#define FH_VIRT 0x00ff0000
+#define ENABLE_BIT_OFFSET 31
+
+#define S390_PCI_HOST_BRIDGE(obj) \
+    OBJECT_CHECK(s390pciState, (obj), TYPE_S390_PCI_HOST_BRIDGE)
+
+#define HP_EVENT_TO_CONFIGURED        0x0301
+#define HP_EVENT_RESERVED_TO_STANDBY  0x0302
+#define HP_EVENT_CONFIGURED_TO_STBRES 0x0304
+#define HP_EVENT_STANDBY_TO_RESERVED  0x0308
+
+typedef struct SeiContainer {
+    QTAILQ_ENTRY(SeiContainer) link;
+    uint32_t fid;
+    uint32_t fh;
+    uint8_t cc;
+    uint16_t pec;
+} SeiContainer;
+
+typedef struct PciCcdfErr {
+    uint32_t reserved1;
+    uint32_t fh;
+    uint32_t fid;
+    uint32_t reserved2;
+    uint64_t faddr;
+    uint32_t reserved3;
+    uint16_t reserved4;
+    uint16_t pec;
+} QEMU_PACKED PciCcdfErr;
+
+typedef struct PciCcdfAvail {
+    uint32_t reserved1;
+    uint32_t fh;
+    uint32_t fid;
+    uint32_t reserved2;
+    uint32_t reserved3;
+    uint32_t reserved4;
+    uint32_t reserved5;
+    uint16_t reserved6;
+    uint16_t pec;
+} QEMU_PACKED PciCcdfAvail;
+
+typedef struct ChscSeiNt2Res {
+    uint16_t length;
+    uint16_t code;
+    uint16_t reserved1;
+    uint8_t reserved2;
+    uint8_t nt;
+    uint8_t flags;
+    uint8_t reserved3;
+    uint8_t reserved4;
+    uint8_t cc;
+    uint32_t reserved5[13];
+    uint8_t ccdf[4016];
+} QEMU_PACKED ChscSeiNt2Res;
+
+typedef struct PciCfgSccb {
+        SCCBHeader header;
+        uint8_t atype;
+        uint8_t reserved1;
+        uint16_t reserved2;
+        uint32_t aid;
+} QEMU_PACKED PciCfgSccb;
+
+typedef struct S390pciState {
+    PCIHostState parent_obj;
+    MemoryRegion reg_mem;
+} S390pciState;
+
+typedef struct S390PCIBusDevice {
+    PCIDevice *pdev;
+    bool is_virt;
+    bool configured;
+    uint32_t fh;
+    uint32_t fid;
+    uint64_t g_iota;
+    QTAILQ_ENTRY(S390PCIBusDevice) next;
+} S390PCIBusDevice;
+
+#ifdef CONFIG_KVM
+int chsc_sei_nt2_get_event(void *res);
+int chsc_sei_nt2_have_event(void);
+void s390_pci_sclp_configure(int configure, SCCB *sccb);
+S390PCIBusDevice *s390_pci_find_dev_by_idx(uint32_t idx);
+S390PCIBusDevice *s390_pci_find_dev_by_fh(uint32_t fh);
+void s390_pci_bus_init(void);
+#else
+static inline int chsc_sei_nt2_get_event(void *res)
+{
+    return 1;
+}
+
+static inline int chsc_sei_nt2_have_event(void)
+{
+    return 0;
+}
+
+static inline void s390_pci_sclp_configure(int configure, SCCB *sccb)
+{
+    return;
+}
+
+static inline S390PCIBusDevice *s390_pci_find_dev_by_idx(uint32_t idx)
+{
+    return NULL;
+}
+
+static inline S390PCIBusDevice *s390_pci_find_dev_by_fh(uint32_t fh)
+{
+    return NULL;
+}
+
+static inline void s390_pci_bus_init(void)
+{
+    return;
+}
+#endif
+
+#endif
--- a/hw/s390x/s390-virtio-ccw.c
+++ b/hw/s390x/s390-virtio-ccw.c
@@ -17,6 +17,7 @@
 #include "ioinst.h"
 #include "css.h"
 #include "virtio-ccw.h"
+#include "s390-pci-bus.h"
 #include "qemu/config-file.h"
 
 #define TYPE_S390_CCW_MACHINE               "s390-ccw-machine"
@@ -126,6 +127,7 @@ static void ccw_init(MachineState *machi
     s390_init_ipl_dev(machine->kernel_filename, machine->kernel_cmdline,
                       machine->initrd_filename, "s390-ccw.img");
     s390_flic_init();
+    s390_pci_bus_init();
 
     /* register hypercalls */
     virtio_ccw_register_hcalls();
--- a/hw/s390x/sclp.c
+++ b/hw/s390x/sclp.c
@@ -20,6 +20,7 @@
 #include "qemu/config-file.h"
 #include "hw/s390x/sclp.h"
 #include "hw/s390x/event-facility.h"
+#include "hw/s390x/s390-pci-bus.h"
 
 static inline SCLPEventFacility *get_event_facility(void)
 {
@@ -62,7 +63,8 @@ static void read_SCP_info(SCCB *sccb)
         read_info->entries[i].type = 0;
     }
 
-    read_info->facilities = cpu_to_be64(SCLP_HAS_CPU_INFO);
+    read_info->facilities = cpu_to_be64(SCLP_HAS_CPU_INFO |
+                                        SCLP_HAS_PCI_RECONFIG);
 
     /*
      * The storage increment size is a multiple of 1M and is a power of 2.
@@ -350,6 +352,12 @@ static void sclp_execute(SCCB *sccb, uin
     case SCLP_UNASSIGN_STORAGE:
         unassign_storage(sccb);
         break;
+    case SCLP_CMDW_CONFIGURE_PCI:
+        s390_pci_sclp_configure(1, sccb);
+        break;
+    case SCLP_CMDW_DECONFIGURE_PCI:
+        s390_pci_sclp_configure(0, sccb);
+        break;
     default:
         efc->command_handler(ef, sccb, code);
         break;
--- a/include/hw/s390x/sclp.h
+++ b/include/hw/s390x/sclp.h
@@ -45,14 +45,22 @@
 #define SCLP_CMDW_CONFIGURE_CPU                 0x00110001
 #define SCLP_CMDW_DECONFIGURE_CPU               0x00100001
 
+/* SCLP PCI codes */
+#define SCLP_HAS_PCI_RECONFIG                   0x0000000040000000ULL
+#define SCLP_CMDW_CONFIGURE_PCI                 0x001a0001
+#define SCLP_CMDW_DECONFIGURE_PCI               0x001b0001
+#define SCLP_RECONFIG_PCI_ATPYE                 2
+
 /* SCLP response codes */
 #define SCLP_RC_NORMAL_READ_COMPLETION          0x0010
 #define SCLP_RC_NORMAL_COMPLETION               0x0020
 #define SCLP_RC_SCCB_BOUNDARY_VIOLATION         0x0100
+#define SCLP_RC_NO_ACTION_REQUIRED              0x0120
 #define SCLP_RC_INVALID_SCLP_COMMAND            0x01f0
 #define SCLP_RC_CONTAINED_EQUIPMENT_CHECK       0x0340
 #define SCLP_RC_INSUFFICIENT_SCCB_LENGTH        0x0300
 #define SCLP_RC_STANDBY_READ_COMPLETION         0x0410
+#define SCLP_RC_ADAPTER_ID_NOT_RECOGNIZED       0x09f0
 #define SCLP_RC_INVALID_FUNCTION                0x40f0
 #define SCLP_RC_NO_EVENT_BUFFERS_STORED         0x60f0
 #define SCLP_RC_INVALID_SELECTION_MASK          0x70f0
--- a/target-s390x/Makefile.objs
+++ b/target-s390x/Makefile.objs
@@ -2,4 +2,4 @@ obj-y += translate.o helper.o cpu.o inte
 obj-y += int_helper.o fpu_helper.o cc_helper.o mem_helper.o misc_helper.o
 obj-y += gdbstub.o
 obj-$(CONFIG_SOFTMMU) += ioinst.o arch_dump.o
-obj-$(CONFIG_KVM) += kvm.o
+obj-$(CONFIG_KVM) += kvm.o pci_ic.o
--- a/target-s390x/ioinst.c
+++ b/target-s390x/ioinst.c
@@ -14,6 +14,7 @@
 #include "cpu.h"
 #include "ioinst.h"
 #include "trace.h"
+#include "hw/s390x/s390-pci-bus.h"
 
 int ioinst_disassemble_sch_ident(uint32_t value, int *m, int *cssid, int *ssid,
                                  int *schid)
@@ -398,6 +399,7 @@ typedef struct ChscResp {
 #define CHSC_SCPD 0x0002
 #define CHSC_SCSC 0x0010
 #define CHSC_SDA  0x0031
+#define CHSC_SEI  0x000e
 
 #define CHSC_SCPD_0_M 0x20000000
 #define CHSC_SCPD_0_C 0x10000000
@@ -566,6 +568,53 @@ out:
     res->param = 0;
 }
 
+static int chsc_sei_nt0_get_event(void *res)
+{
+    /* no events yet */
+    return 1;
+}
+
+static int chsc_sei_nt0_have_event(void)
+{
+    /* no events yet */
+    return 0;
+}
+
+#define CHSC_SEI_NT0    (1ULL << 63)
+#define CHSC_SEI_NT2    (1ULL << 61)
+static void ioinst_handle_chsc_sei(ChscReq *req, ChscResp *res)
+{
+    uint64_t selection_mask = be64_to_cpu(*(uint64_t *)&req->param1);
+    uint8_t *res_flags = (uint8_t *)res->data;
+    int have_event = 0;
+    int have_more = 0;
+
+    /* regarding architecture nt0 can not be masked */
+    have_event = !chsc_sei_nt0_get_event(res);
+    have_more = chsc_sei_nt0_have_event();
+
+    if (selection_mask & CHSC_SEI_NT2) {
+        if (!have_event) {
+            have_event = !chsc_sei_nt2_get_event(res);
+        }
+
+        if (!have_more) {
+            have_more = chsc_sei_nt2_have_event();
+        }
+    }
+
+    if (have_event) {
+        res->code = cpu_to_be16(0x0001);
+        if (have_more) {
+            (*res_flags) |= 0x80;
+        } else {
+            (*res_flags) &= ~0x80;
+        }
+    } else {
+        res->code = cpu_to_be16(0x0004);
+    }
+}
+
 static void ioinst_handle_chsc_unimplemented(ChscResp *res)
 {
     res->len = cpu_to_be16(CHSC_MIN_RESP_LEN);
@@ -617,6 +666,9 @@ void ioinst_handle_chsc(S390CPU *cpu, ui
     case CHSC_SDA:
         ioinst_handle_chsc_sda(req, res);
         break;
+    case CHSC_SEI:
+        ioinst_handle_chsc_sei(req, res);
+        break;
     default:
         ioinst_handle_chsc_unimplemented(res);
         break;
--- a/target-s390x/ioinst.h
+++ b/target-s390x/ioinst.h
@@ -194,6 +194,7 @@ typedef struct CRW {
 
 #define CRW_RSC_SUBCH 0x3
 #define CRW_RSC_CHP   0x4
+#define CRW_RSC_CSS   0xb
 
 /* I/O interruption code */
 typedef struct IOIntCode {
--- a/target-s390x/kvm.c
+++ b/target-s390x/kvm.c
@@ -40,6 +40,7 @@
 #include "exec/gdbstub.h"
 #include "trace.h"
 #include "qapi-event.h"
+#include "pci_ic.h"
 
 /* #define DEBUG_KVM */
 
@@ -78,6 +79,7 @@
 #define PRIV_EB_SQBS                    0x8a
 
 #define PRIV_B9_EQBS                    0x9c
+#define PRIV_B9_CLP                     0xa0
 
 #define DIAG_IPL                        0x308
 #define DIAG_KVM_HYPERCALL              0x500
@@ -813,6 +815,9 @@ static int handle_b9(S390CPU *cpu, struc
     int r = 0;
 
     switch (ipa1) {
+    case PRIV_B9_CLP:
+        r = kvm_clp_service_call(cpu, run);
+        break;
     case PRIV_B9_EQBS:
         /* just inject exception */
         r = -1;
--- /dev/null
+++ b/target-s390x/pci_ic.c
@@ -0,0 +1,230 @@
+/*
+ * s390 PCI intercepts
+ *
+ * Copyright 2014 IBM Corp.
+ * Author(s): Frank Blaschka <frank.blaschka@de.ibm.com>
+ *            Hong Bo Li <lihbbj@cn.ibm.com>
+ *            Yi Min Zhao <zyimin@cn.ibm.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or (at
+ * your option) any later version. See the COPYING file in the top-level
+ * directory.
+ */
+
+#include <sys/types.h>
+#include <sys/ioctl.h>
+#include <sys/mman.h>
+
+#include <linux/kvm.h>
+#include <asm/ptrace.h>
+#include <hw/pci/pci.h>
+#include <hw/pci/pci_host.h>
+#include <net/net.h>
+
+#include "qemu-common.h"
+#include "qemu/timer.h"
+#include "migration/qemu-file.h"
+#include "sysemu/sysemu.h"
+#include "sysemu/kvm.h"
+#include "cpu.h"
+#include "sysemu/device_tree.h"
+#include "monitor/monitor.h"
+#include "pci_ic.h"
+
+#include "hw/hw.h"
+#include "hw/pci/pci.h"
+#include "hw/pci/pci_bridge.h"
+#include "hw/pci/pci_bus.h"
+#include "hw/pci/pci_host.h"
+#include "hw/s390x/s390-pci-bus.h"
+#include "exec/exec-all.h"
+
+/* #define DEBUG_S390PCI_IC */
+#ifdef DEBUG_S390PCI_IC
+#define DPRINTF(fmt, ...) \
+    do { fprintf(stderr, "s390pci_ic: " fmt, ## __VA_ARGS__); } while (0)
+#else
+#define DPRINTF(fmt, ...) \
+    do { } while (0)
+#endif
+
+static uint64_t resume_token;
+
+static int list_pci(ClpReqRspListPci *rrb, uint8_t *cc)
+{
+    S390PCIBusDevice *pbdev;
+    uint32_t res_code, initial_l2, g_l2, finish;
+    int rc, idx;
+
+    rc = 0;
+    if (be16_to_cpu(rrb->request.hdr.len) != 32) {
+        res_code = CLP_RC_LEN;
+        rc = -EINVAL;
+        goto out;
+    }
+
+    if ((be32_to_cpu(rrb->request.fmt) & CLP_MASK_FMT) != 0) {
+        res_code = CLP_RC_FMT;
+        rc = -EINVAL;
+        goto out;
+    }
+
+    if ((be32_to_cpu(rrb->request.fmt) & ~CLP_MASK_FMT) != 0 ||
+        rrb->request.reserved1 != 0 ||
+        rrb->request.reserved2 != 0) {
+        res_code = CLP_RC_RESNOT0;
+        rc = -EINVAL;
+        goto out;
+    }
+
+    if (be64_to_cpu(rrb->request.resume_token) == 0) {
+        resume_token = 0;
+    } else if (be64_to_cpu(rrb->request.resume_token) != resume_token) {
+        res_code = CLP_RC_LISTPCI_BADRT;
+        rc = -EINVAL;
+        goto out;
+    }
+
+    if (be16_to_cpu(rrb->response.hdr.len) < 48) {
+        res_code = CLP_RC_8K;
+        rc = -EINVAL;
+        goto out;
+    }
+
+    initial_l2 = be16_to_cpu(rrb->response.hdr.len);
+    if ((initial_l2 - LIST_PCI_HDR_LEN) % sizeof(ClpFhListEntry)
+        != 0) {
+        rc = -EINVAL;
+        goto out;
+    }
+
+    rrb->response.fmt = 0;
+    rrb->response.reserved1 = rrb->response.reserved2 = 0;
+    rrb->response.mdd = cpu_to_be32(FH_VIRT);
+    rrb->response.max_fn = cpu_to_be16(PCI_MAX_FUNCTIONS);
+    rrb->response.entry_size = sizeof(ClpFhListEntry);
+    finish = 0;
+    idx = resume_token;
+    g_l2 = LIST_PCI_HDR_LEN;
+    do {
+        pbdev = s390_pci_find_dev_by_idx(idx);
+        if (!pbdev) {
+            finish = 1;
+            break;
+        }
+        rrb->response.fh_list[idx - resume_token].device_id =
+            pci_get_word(pbdev->pdev->config + PCI_DEVICE_ID);
+        rrb->response.fh_list[idx - resume_token].vendor_id =
+            pci_get_word(pbdev->pdev->config + PCI_VENDOR_ID);
+        rrb->response.fh_list[idx - resume_token].config =
+            cpu_to_be32(0x80000000);
+        rrb->response.fh_list[idx - resume_token].fid = cpu_to_be32(pbdev->fid);
+        rrb->response.fh_list[idx - resume_token].fh = cpu_to_be32(pbdev->fh);
+
+        g_l2 += sizeof(ClpFhListEntry);
+        DPRINTF("g_l2 %d vendor id 0x%x device id 0x%x fid 0x%x fh 0x%x\n",
+            g_l2,
+            rrb->response.fh_list[idx - resume_token].vendor_id,
+            rrb->response.fh_list[idx - resume_token].device_id,
+            rrb->response.fh_list[idx - resume_token].fid,
+            rrb->response.fh_list[idx - resume_token].fh);
+        idx++;
+    } while (g_l2 < initial_l2);
+
+    if (finish == 1) {
+        resume_token = 0;
+    } else {
+        resume_token = idx;
+    }
+    rrb->response.resume_token = cpu_to_be64(resume_token);
+    rrb->response.hdr.len = cpu_to_be16(g_l2);
+    rrb->response.hdr.rsp = cpu_to_be16(CLP_RC_OK);
+out:
+    if (rc) {
+        DPRINTF("list pci failed rc 0x%x\n", rc);
+        rrb->response.hdr.rsp = cpu_to_be16(res_code);
+        *cc = 3;
+    }
+    return rc;
+}
+
+int kvm_clp_service_call(S390CPU *cpu, struct kvm_run *run)
+{
+    ClpReqHdr *reqh;
+    ClpRspHdr *resh;
+    S390PCIBusDevice *pbdev;
+    uint32_t req_len;
+    uint32_t res_len;
+    uint8_t *buffer;
+    uint8_t cc;
+    CPUS390XState *env = &cpu->env;
+    uint8_t r2 = (run->s390_sieic.ipb & 0x000f0000) >> 16;
+    int rc = 0;
+
+    buffer = g_malloc0(4096 * 2);
+    cpu_synchronize_state(CPU(cpu));
+
+    cpu_physical_memory_rw(env->regs[r2], buffer, sizeof(reqh), 0);
+    reqh = (ClpReqHdr *)buffer;
+    req_len = be16_to_cpu(reqh->len);
+
+    cpu_physical_memory_rw(env->regs[r2], buffer, req_len + sizeof(resh), 0);
+    resh = (ClpRspHdr *)(buffer + req_len);
+    res_len = be16_to_cpu(resh->len);
+
+    cpu_physical_memory_rw(env->regs[r2], buffer, req_len + res_len, 0);
+
+    switch (reqh->cmd) {
+    case CLP_LIST_PCI: {
+        ClpReqRspListPci *rrb = (ClpReqRspListPci *)buffer;
+        rc = list_pci(rrb, &cc);
+        break;
+    }
+    case CLP_SET_PCI_FN: {
+        ClpReqSetPci *reqsetpci = (ClpReqSetPci *)reqh;
+        ClpRspSetPci *ressetpci = (ClpRspSetPci *)resh;
+
+        pbdev = s390_pci_find_dev_by_fh(be32_to_cpu(reqsetpci->fh));
+        if (!pbdev) {
+                ressetpci->hdr.rsp = cpu_to_be16(CLP_RC_SETPCIFN_FH);
+                goto out;
+        }
+
+        switch (reqsetpci->oc) {
+        case CLP_SET_ENABLE_PCI_FN:
+            if (pbdev->is_virt) {
+                pbdev->fh = pbdev->fh | 1 << ENABLE_BIT_OFFSET;
+                ressetpci->fh = cpu_to_be32(pbdev->fh);
+                ressetpci->hdr.rsp = cpu_to_be16(CLP_RC_OK);
+            } else {
+                pbdev->fh = be32_to_cpu(ressetpci->fh);
+            }
+            break;
+        case CLP_SET_DISABLE_PCI_FN:
+            if (pbdev->is_virt) {
+                pbdev->fh = pbdev->fh & ~(1 << ENABLE_BIT_OFFSET);
+                ressetpci->fh = cpu_to_be32(pbdev->fh);
+                ressetpci->hdr.rsp = cpu_to_be16(CLP_RC_OK);
+            } else {
+                pbdev->fh = be32_to_cpu(ressetpci->fh);
+            }
+            break;
+        default:
+            DPRINTF("unknown set pci command\n");
+            ressetpci->hdr.rsp = cpu_to_be16(CLP_RC_SETPCIFN_FHOP);
+            break;
+        }
+        break;
+    }
+    default:
+        DPRINTF("unknown clp command\n");
+        resh->rsp = cpu_to_be16(CLP_RC_CMD);
+        break;
+    }
+
+out:
+    cpu_physical_memory_rw(env->regs[r2], buffer, req_len + res_len, 1);
+    g_free(buffer);
+    setcc(cpu, 0);
+    return rc;
+}
--- /dev/null
+++ b/target-s390x/pci_ic.h
@@ -0,0 +1,214 @@
+/*
+ * s390 PCI intercept definitions
+ *
+ * Copyright 2014 IBM Corp.
+ * Author(s): Frank Blaschka <frank.blaschka@de.ibm.com>
+ *            Hong Bo Li <lihbbj@cn.ibm.com>
+ *            Yi Min Zhao <zyimin@cn.ibm.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or (at
+ * your option) any later version. See the COPYING file in the top-level
+ * directory.
+ */
+
+#ifndef PCI_IC_S390X_H
+#define PCI_IC_S390X_H
+
+/* CLP common request & response block size */
+#define CLP_BLK_SIZE 4096
+#define PCI_BAR_COUNT 6
+#define PCI_MAX_FUNCTIONS 4096
+
+typedef struct ClpReqHdr {
+    __uint16_t len;
+    __uint16_t cmd;
+} QEMU_PACKED ClpReqHdr;
+
+typedef struct ClpRspHdr {
+    __uint16_t len;
+    __uint16_t rsp;
+} QEMU_PACKED ClpRspHdr;
+
+/* CLP Response Codes */
+#define CLP_RC_OK         0x0010  /* Command request successfully */
+#define CLP_RC_CMD        0x0020  /* Command code not recognized */
+#define CLP_RC_PERM       0x0030  /* Command not authorized */
+#define CLP_RC_FMT        0x0040  /* Invalid command request format */
+#define CLP_RC_LEN        0x0050  /* Invalid command request length */
+#define CLP_RC_8K         0x0060  /* Command requires 8K LPCB */
+#define CLP_RC_RESNOT0    0x0070  /* Reserved field not zero */
+#define CLP_RC_NODATA     0x0080  /* No data available */
+#define CLP_RC_FC_UNKNOWN 0x0100  /* Function code not recognized */
+
+/*
+ * Call Logical Processor - Command Codes
+ */
+#define CLP_LIST_PCI            0x0002
+#define CLP_QUERY_PCI_FN        0x0003
+#define CLP_QUERY_PCI_FNGRP     0x0004
+#define CLP_SET_PCI_FN          0x0005
+
+/* PCI function handle list entry */
+typedef struct ClpFhListEntry {
+    __uint16_t device_id;
+    __uint16_t vendor_id;
+#define CLP_FHLIST_MASK_CONFIG 0x80000000
+    __uint32_t config;
+    __uint32_t fid;
+    __uint32_t fh;
+} QEMU_PACKED ClpFhListEntry;
+
+#define CLP_RC_SETPCIFN_FH      0x0101 /* Invalid PCI fn handle */
+#define CLP_RC_SETPCIFN_FHOP    0x0102 /* Fn handle not valid for op */
+#define CLP_RC_SETPCIFN_DMAAS   0x0103 /* Invalid DMA addr space */
+#define CLP_RC_SETPCIFN_RES     0x0104 /* Insufficient resources */
+#define CLP_RC_SETPCIFN_ALRDY   0x0105 /* Fn already in requested state */
+#define CLP_RC_SETPCIFN_ERR     0x0106 /* Fn in permanent error state */
+#define CLP_RC_SETPCIFN_RECPND  0x0107 /* Error recovery pending */
+#define CLP_RC_SETPCIFN_BUSY    0x0108 /* Fn busy */
+#define CLP_RC_LISTPCI_BADRT    0x010a /* Resume token not recognized */
+#define CLP_RC_QUERYPCIFG_PFGID 0x010b /* Unrecognized PFGID */
+
+/* request or response block header length */
+#define LIST_PCI_HDR_LEN 32
+
+/* Number of function handles fitting in response block */
+#define CLP_FH_LIST_NR_ENTRIES \
+    ((CLP_BLK_SIZE - 2 * LIST_PCI_HDR_LEN) \
+        / sizeof(ClpFhListEntry))
+
+#define CLP_SET_ENABLE_PCI_FN  0 /* Yes, 0 enables it */
+#define CLP_SET_DISABLE_PCI_FN 1 /* Yes, 1 disables it */
+
+#define CLP_UTIL_STR_LEN 64
+
+#define CLP_MASK_FMT 0xf0000000
+
+/* List PCI functions request */
+typedef struct ClpReqListPci {
+    ClpReqHdr hdr;
+    __uint32_t fmt;
+    __uint64_t reserved1;
+    __uint64_t resume_token;
+    __uint64_t reserved2;
+} QEMU_PACKED ClpReqListPci;
+
+/* List PCI functions response */
+typedef struct ClpRspListPci {
+    ClpRspHdr hdr;
+    __uint32_t fmt;
+    __uint64_t reserved1;
+    __uint64_t resume_token;
+    __uint32_t mdd;
+    __uint16_t max_fn;
+    __uint8_t reserved2;
+    __uint8_t entry_size;
+    ClpFhListEntry fh_list[CLP_FH_LIST_NR_ENTRIES];
+} QEMU_PACKED ClpRspListPci;
+
+/* Query PCI function request */
+typedef struct ClpReqQueryPci {
+    ClpReqHdr hdr;
+    __uint32_t fmt;
+    __uint64_t reserved1;
+    __uint32_t fh; /* function handle */
+    __uint32_t reserved2;
+    __uint64_t reserved3;
+} QEMU_PACKED ClpReqQueryPci;
+
+/* Query PCI function response */
+typedef struct ClpRspQueryPci {
+    ClpRspHdr hdr;
+    __uint32_t fmt;
+    __uint64_t reserved1;
+    __uint16_t vfn; /* virtual fn number */
+#define CLP_RSP_QPCI_MASK_UTIL  0x100
+#define CLP_RSP_QPCI_MASK_PFGID 0xff
+    __uint16_t ug;
+    __uint32_t fid; /* pci function id */
+    __uint8_t bar_size[PCI_BAR_COUNT];
+    __uint16_t pchid;
+    __uint32_t bar[PCI_BAR_COUNT];
+    __uint64_t reserved2;
+    __uint64_t sdma; /* start dma as */
+    __uint64_t edma; /* end dma as */
+    __uint64_t reserved3[6];
+    __uint8_t util_str[CLP_UTIL_STR_LEN]; /* utility string */
+} QEMU_PACKED ClpRspQueryPci;
+
+/* Query PCI function group request */
+typedef struct ClpReqQueryPciGrp {
+    ClpReqHdr hdr;
+    __uint32_t fmt;
+    __uint64_t reserved1;
+#define CLP_REQ_QPCIG_MASK_PFGID 0xff
+    __uint32_t g;
+    __uint32_t reserved2;
+    __uint64_t reserved3;
+} QEMU_PACKED ClpReqQueryPciGrp;
+
+/* Query PCI function group response */
+typedef struct ClpRspQueryPciGrp {
+    ClpRspHdr hdr;
+    __uint32_t fmt;
+    __uint64_t reserved1;
+#define CLP_RSP_QPCIG_MASK_NOI 0xfff
+    __uint16_t i;
+    __uint8_t version;
+#define CLP_RSP_QPCIG_MASK_FRAME   0x2
+#define CLP_RSP_QPCIG_MASK_REFRESH 0x1
+    __uint8_t fr;
+    __uint16_t reserved2;
+    __uint16_t mui;
+    __uint64_t reserved3;
+    __uint64_t dasm; /* dma address space mask */
+    __uint64_t msia; /* MSI address */
+    __uint64_t reserved4;
+    __uint64_t reserved5;
+} QEMU_PACKED ClpRspQueryPciGrp;
+
+/* Set PCI function request */
+typedef struct ClpReqSetPci {
+    ClpReqHdr hdr;
+    __uint32_t fmt;
+    __uint64_t reserved1;
+    __uint32_t fh; /* function handle */
+    __uint16_t reserved2;
+    __uint8_t oc; /* operation controls */
+    __uint8_t ndas; /* number of dma spaces */
+    __uint64_t reserved3;
+} QEMU_PACKED ClpReqSetPci;
+
+/* Set PCI function response */
+typedef struct ClpRspSetPci {
+    ClpRspHdr hdr;
+    __uint32_t fmt;
+    __uint64_t reserved1;
+    __uint32_t fh; /* function handle */
+    __uint32_t reserved3;
+    __uint64_t reserved4;
+} QEMU_PACKED ClpRspSetPci;
+
+typedef struct ClpReqRspListPci {
+    ClpReqListPci request;
+    ClpRspListPci response;
+} QEMU_PACKED ClpReqRspListPci;
+
+typedef struct ClpReqRspSetPci {
+    ClpReqSetPci request;
+    ClpRspSetPci response;
+} QEMU_PACKED ClpReqRspSetPci;
+
+typedef struct ClpReqRspQueryPci {
+    ClpReqQueryPci request;
+    ClpRspQueryPci response;
+} QEMU_PACKED ClpReqRspQueryPci;
+
+typedef struct ClpReqRspQueryPciGrp {
+    ClpReqQueryPciGrp request;
+    ClpRspQueryPciGrp response;
+} QEMU_PACKED ClpReqRspQueryPciGrp;
+
+int kvm_clp_service_call(S390CPU *cpu, struct kvm_run *run);
+
+#endif

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Qemu-devel] [RFC][patch 6/6] s390: Add PCI pass-through device support
  2014-09-04 10:52 [Qemu-devel] [RFC][patch 0/6] pci pass-through support for qemu/KVM on s390 frank.blaschka
                   ` (4 preceding siblings ...)
  2014-09-04 10:52 ` [Qemu-devel] [RFC][patch 5/6] s390: Add PCI bus support frank.blaschka
@ 2014-09-04 10:52 ` frank.blaschka
  2014-09-04 13:16 ` [Qemu-devel] [RFC][patch 0/6] pci pass-through support for qemu/KVM on s390 Alex Williamson
  2014-09-05  8:21 ` Alexander Graf
  7 siblings, 0 replies; 21+ messages in thread
From: frank.blaschka @ 2014-09-04 10:52 UTC (permalink / raw)
  To: qemu-devel, linux-s390, kvm; +Cc: aik, pbonzini, agraf

[-- Attachment #1: 102-qemu_s390pci.patch --]
[-- Type: text/plain, Size: 11395 bytes --]

From: Frank Blaschka <frank.blaschka@de.ibm.com>

This patch adds a new device class handling s390 pci pass-through device
assignment. The approach is very similar to the x86 device assignment.
The device executes the KVM_ASSIGN_PCI_DEVICE ioctl to create a proxy instance
in the kernel KVM and connect this instance to the host pci device.

Signed-off-by: Frank Blaschka <frank.blaschka@de.ibm.com>
---
 hw/s390x/Makefile.objs  |    2 
 hw/s390x/s390-pci-bus.c |   14 +-
 hw/s390x/s390_pci.c     |  321 ++++++++++++++++++++++++++++++++++++++++++++++++
 hw/s390x/s390_pci.h     |   31 ++++
 4 files changed, 365 insertions(+), 3 deletions(-)

--- a/hw/s390x/Makefile.objs
+++ b/hw/s390x/Makefile.objs
@@ -8,4 +8,4 @@ obj-y += ipl.o
 obj-y += css.o
 obj-y += s390-virtio-ccw.o
 obj-y += virtio-ccw.o
-obj-$(CONFIG_KVM) += s390-pci-bus.o
+obj-$(CONFIG_KVM) += s390-pci-bus.o s390_pci.o
--- a/hw/s390x/s390-pci-bus.c
+++ b/hw/s390x/s390-pci-bus.c
@@ -16,6 +16,7 @@
 #include <hw/s390x/sclp.h>
 #include "qemu/error-report.h"
 #include "s390-pci-bus.h"
+#include "s390_pci.h"
 
 /* #define DEBUG_S390PCI_BUS */
 #ifdef DEBUG_S390PCI_BUS
@@ -219,8 +220,17 @@ static void s390_pcihost_hot_plug(Hotplu
     pbdev->pdev = pci_dev;
     pbdev->configured = true;
 
-    pbdev->fh = s390_pci_get_pfh(pci_dev);
-    pbdev->is_virt = 1;
+    if (!strcmp(pci_dev->name, "s390-pci")) {
+        S390PCIDevice *sdev = DO_UPCAST(S390PCIDevice, pdev, pci_dev);
+        pbdev->fh = s390_pci_get_fh(sdev->host);
+        if (!pbdev->fh) {
+            g_free(pbdev);
+            return;
+        }
+    } else {
+        pbdev->fh = s390_pci_get_pfh(pci_dev);
+        pbdev->is_virt = 1;
+    }
 
     QTAILQ_INSERT_TAIL(&device_list, pbdev, next);
     if (dev->hotplugged) {
--- /dev/null
+++ b/hw/s390x/s390_pci.c
@@ -0,0 +1,321 @@
+/*
+ * s390 PCI pass-through device assignment
+ *
+ * Copyright 2014 IBM Corp.
+ * Author(s): Frank Blaschka <frank.blaschka@de.ibm.com>
+ *            Hong Bo Li <lihbbj@cn.ibm.com>
+ *            Yi Min Zhao <zyimin@cn.ibm.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or (at
+ * your option) any later version. See the COPYING file in the top-level
+ * directory.
+ */
+
+#include <hw/pci/pci.h>
+#include <hw/pci/pci_host.h>
+#include <hw/pci/pci_bus.h>
+#include <net/net.h>
+#include <hw/s390x/css.h>
+#include <hw/s390x/sclp.h>
+#include "exec/exec-all.h"
+#include "sysemu/sysemu.h"
+#include "exec/address-spaces.h"
+#include "qemu/error-report.h"
+#include "qapi/qmp/qerror.h"
+
+#include "s390_pci.h"
+#include "s390-pci-bus.h"
+
+/* #define DEBUG_S390PCI */
+#ifdef DEBUG_S390PCI
+#define DPRINTF(fmt, ...) \
+    do { fprintf(stderr, "s390pci: " fmt, ## __VA_ARGS__); } while (0)
+#else
+#define DPRINTF(fmt, ...) \
+    do { } while (0)
+#endif
+
+#define ASSIGN_FLAG_HOSTIRQ 0x1
+
+uint32_t s390_pci_get_fh(PCIHostDeviceAddress host)
+{
+    char fh_path[128];
+    struct stat st;
+    FILE *fd;
+    uint32_t fh;
+
+    snprintf(fh_path, sizeof(fh_path),
+        "/sys/bus/pci/devices/%04x:%02x:%02x.%x/function_handle",
+        host.domain, host.bus, host.slot, host.function);
+
+    if (stat(fh_path, &st)) {
+        error_report("get function handle faild: no host device specified");
+        return -1;
+    }
+
+    fd = fopen(fh_path, "r");
+    if (fd == NULL) {
+        error_report("%s: %s: %m", __func__, fh_path);
+        return 0;
+    }
+    if (fscanf(fd, "%x", &fh) != 1) {
+        fclose(fd);
+        return 0;
+    }
+    fclose(fd);
+    return fh;
+}
+
+uint32_t s390_pci_get_fid(PCIHostDeviceAddress host)
+{
+    char fid_path[128];
+    struct stat st;
+    FILE *fd;
+    uint32_t fid;
+
+    snprintf(fid_path, sizeof(fid_path),
+        "/sys/bus/pci/devices/%04x:%02x:%02x.%x/function_id",
+        host.domain, host.bus, host.slot, host.function);
+
+    if (stat(fid_path, &st)) {
+        error_report("get function id faild: no host device specified");
+        return -1;
+    }
+
+    fd = fopen(fid_path, "r");
+    if (fd == NULL) {
+        error_report("%s: %s: %m", __func__, fid_path);
+        return -1;
+    }
+    if (fscanf(fd, "%x", &fid) != 1) {
+        fclose(fd);
+        return -1;
+    }
+    fclose(fd);
+    return fid;
+}
+
+static int get_real_id(const char *devpath, const char *idname, uint16_t *val)
+{
+    FILE *f;
+    char name[128];
+    long id;
+
+    snprintf(name, sizeof(name), "%s%s", devpath, idname);
+    f = fopen(name, "r");
+    if (f == NULL) {
+        error_report("%s: %s: %m", __func__, name);
+        return -1;
+    }
+    if (fscanf(f, "%li\n", &id) == 1) {
+        *val = id;
+    } else {
+        fclose(f);
+        return -1;
+    }
+    fclose(f);
+
+    return 0;
+}
+
+static int get_real_vendor_id(const char *devpath, uint16_t *val)
+{
+    return get_real_id(devpath, "vendor", val);
+}
+
+static int get_real_device_id(const char *devpath, uint16_t *val)
+{
+    return get_real_id(devpath, "device", val);
+}
+
+static void assign_failed_examine(S390PCIDevice *dev)
+{
+    char name[PATH_MAX], dir[PATH_MAX], driver[PATH_MAX] = {}, *ns;
+    uint16_t vendor_id, device_id;
+    int rc;
+
+    snprintf(dir, sizeof(dir), "/sys/bus/pci/devices/%04x:%02x:%02x.%01x/",
+            dev->host.domain, dev->host.bus, dev->host.slot,
+            dev->host.function);
+
+    snprintf(name, sizeof(name), "%sdriver", dir);
+
+    rc = readlink(name, driver, sizeof(driver));
+    if ((rc <= 0) || rc >= sizeof(driver)) {
+        goto fail;
+    }
+
+    driver[rc] = 0;
+    ns = strrchr(driver, '/');
+    if (!ns) {
+        goto fail;
+    }
+
+    ns++;
+
+    if (get_real_vendor_id(dir, &vendor_id) ||
+        get_real_device_id(dir, &device_id)) {
+        goto fail;
+    }
+
+    error_printf("*** The driver '%s' is occupying your device "
+        "%04x:%02x:%02x.%x.\n"
+        "***\n"
+        "*** You can try the following commands to free it:\n"
+        "***\n"
+        "*** $ echo \"%04x %04x\" > /sys/bus/pci/drivers/pci-stub/new_id\n"
+        "*** $ echo \"%04x:%02x:%02x.%x\" > /sys/bus/pci/drivers/%s/unbind\n"
+        "*** $ echo \"%04x:%02x:%02x.%x\" > /sys/bus/pci/drivers/"
+        "pci-stub/bind\n"
+        "*** $ echo \"%04x %04x\" > /sys/bus/pci/drivers/pci-stub/remove_id\n"
+        "***\n",
+        ns, dev->host.domain, dev->host.bus, dev->host.slot,
+        dev->host.function, vendor_id, device_id,
+        dev->host.domain, dev->host.bus, dev->host.slot, dev->host.function,
+        ns, dev->host.domain, dev->host.bus, dev->host.slot,
+        dev->host.function, vendor_id, device_id);
+
+    return;
+
+fail:
+    error_report("Couldn't find out why.");
+}
+
+static int s390_initfn(PCIDevice *pdev)
+{
+    char dir[128], name[128];
+    struct stat st;
+    int cfd;
+    int rc;
+    S390PCIDevice *vdev = DO_UPCAST(S390PCIDevice, pdev, pdev);
+    struct kvm_assigned_pci_dev dev_data = {
+        .segnr = vdev->host.domain,
+        .busnr = vdev->host.bus,
+        .devfn = PCI_DEVFN(vdev->host.slot, vdev->host.function),
+        .flags = 0,
+    };
+
+    if (!kvm_enabled()) {
+        error_report("s390pci-assign: error: requires KVM support");
+        return -1;
+    }
+
+    snprintf(dir, sizeof(dir),
+        "/sys/bus/pci/devices/%04x:%02x:%02x.%x/",
+        vdev->host.domain, vdev->host.bus, vdev->host.slot,
+        vdev->host.function);
+
+    if (stat(dir, &st)) {
+        error_report("s390pci-assign: error: no host device specified");
+        return -1;
+    }
+
+    snprintf(name, sizeof(name), "%sconfig", dir);
+
+    cfd = open(name, O_RDWR);
+    if (cfd == -1) {
+        error_report("%s: %s: %m", __func__, name);
+        return -1;
+    }
+
+    do {
+        rc = read(cfd, pdev->config, pci_config_size(pdev));
+        if (rc >= 0) {
+            break;
+        }
+    } while (errno == EINTR || errno == EAGAIN);
+
+    if (rc < 0) {
+        error_report("%s: read failed, errno = %d", __func__, errno);
+    }
+    close(cfd);
+
+    dev_data.assigned_dev_id =
+        (vdev->host.domain << 16) | (vdev->host.bus << 8) | dev_data.devfn;
+
+    DPRINTF("%04x:%02x:%02x.%x fid 0x%x fh 0x%x dev_id 0x%x\n",
+        vdev->host.domain, vdev->host.bus, vdev->host.slot,
+        vdev->host.function, s390_pci_get_fid(vdev->host),
+        s390_pci_get_fh(vdev->host), dev_data.assigned_dev_id);
+
+    if (vdev->hostirq) {
+        dev_data.flags |= ASSIGN_FLAG_HOSTIRQ;
+    }
+
+    rc = kvm_vm_ioctl(kvm_state, KVM_ASSIGN_PCI_DEVICE, &dev_data);
+    if (rc) {
+        error_report("Failed to assign device \"0x%x\" : %s",
+                     dev_data.assigned_dev_id, strerror(-rc));
+        switch (rc) {
+        case -EBUSY:
+            assign_failed_examine(vdev);
+            break;
+        default:
+            break;
+        }
+        return rc;
+    }
+
+    vdev->dev_id = dev_data.assigned_dev_id;
+    return rc;
+}
+
+static void s390_exitfn(PCIDevice *pdev)
+{
+    int rc;
+
+    S390PCIDevice *vdev = DO_UPCAST(S390PCIDevice, pdev, pdev);
+    struct kvm_assigned_pci_dev dev_data = {
+        .assigned_dev_id = vdev->dev_id,
+    };
+
+    DPRINTF("%s(%04x:%02x:%02x.%x)\n", __func__, vdev->host.domain,
+            vdev->host.bus, vdev->host.slot, vdev->host.function);
+
+    rc = kvm_vm_ioctl(kvm_state, KVM_DEASSIGN_PCI_DEVICE, &dev_data);
+    assert(rc == 0);
+}
+
+static void s390_pci_reset(DeviceState *dev)
+{
+    return;
+}
+
+static Property s390_pci_dev_properties[] = {
+    DEFINE_PROP_PCI_HOST_DEVADDR("host", S390PCIDevice, host),
+    DEFINE_PROP_UINT32("hostirq", S390PCIDevice, hostirq, 0),
+    DEFINE_PROP_END_OF_LIST(),
+};
+
+static const VMStateDescription s390_pci_vmstate = {
+    .name = "s390-pci",
+    .unmigratable = 1,
+};
+
+static void s390_pci_dev_class_init(ObjectClass *klass, void *data)
+{
+    DeviceClass *dc = DEVICE_CLASS(klass);
+    PCIDeviceClass *pdc = PCI_DEVICE_CLASS(klass);
+
+    dc->reset = s390_pci_reset;
+    dc->props = s390_pci_dev_properties;
+    dc->vmsd = &s390_pci_vmstate;
+    dc->desc = "s390-based PCI device assignment";
+    pdc->init = s390_initfn;
+    pdc->exit = s390_exitfn;
+    pdc->is_express = 1;
+}
+
+static const TypeInfo s390_pci_dev_info = {
+    .name = "s390-pci",
+    .parent = TYPE_PCI_DEVICE,
+    .instance_size = sizeof(S390PCIDevice),
+    .class_init = s390_pci_dev_class_init,
+};
+
+static void register_s390_pci_dev_type(void)
+{
+    type_register_static(&s390_pci_dev_info);
+}
+
+type_init(register_s390_pci_dev_type)
--- /dev/null
+++ b/hw/s390x/s390_pci.h
@@ -0,0 +1,31 @@
+/*
+ * s390 PCI pass-through device assignment definitions
+ *
+ * Copyright 2014 IBM Corp.
+ * Author(s): Frank Blaschka <frank.blaschka@de.ibm.com>
+ *            Hong Bo Li <lihbbj@cn.ibm.com>
+ *            Yi Min Zhao <zyimin@cn.ibm.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or (at
+ * your option) any later version. See the COPYING file in the top-level
+ * directory.
+ */
+
+#ifndef HW_S390_PCI_H
+#define HW_S390_PCI_H
+
+#include <hw/pci/pci.h>
+
+typedef struct S390PCIDevice {
+    PCIDevice pdev;
+    PCIHostDeviceAddress host;
+    QLIST_ENTRY(s390PCIDevice) next;
+    uint32_t dev_id;
+    uint32_t fid;
+    uint32_t hostirq;
+} S390PCIDevice;
+
+uint32_t s390_pci_get_fh(PCIHostDeviceAddress host);
+uint32_t s390_pci_get_fid(PCIHostDeviceAddress host);
+
+#endif

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [RFC][patch 0/6] pci pass-through support for qemu/KVM on s390
  2014-09-04 10:52 [Qemu-devel] [RFC][patch 0/6] pci pass-through support for qemu/KVM on s390 frank.blaschka
                   ` (5 preceding siblings ...)
  2014-09-04 10:52 ` [Qemu-devel] [RFC][patch 6/6] s390: Add PCI pass-through device support frank.blaschka
@ 2014-09-04 13:16 ` Alex Williamson
  2014-09-05  7:46   ` Frank Blaschka
  2014-09-05  8:21 ` Alexander Graf
  7 siblings, 1 reply; 21+ messages in thread
From: Alex Williamson @ 2014-09-04 13:16 UTC (permalink / raw)
  To: frank.blaschka; +Cc: linux-s390, kvm, aik, agraf, qemu-devel, pbonzini

On Thu, 2014-09-04 at 12:52 +0200, frank.blaschka@de.ibm.com wrote:
> This set of patches implements pci pass-through support for qemu/KVM on s390.
> PCI support on s390 is very different from other platforms.
> Major differences are:
> 
> 1) all PCI operations are driven by special s390 instructions

Generating config cycles is always arch specific.

> 2) all s390 PCI instructions are privileged

While the operations to generate config cycles on x86 are not
privileged, they must be arbitrated between accesses, so in a sense
they're privileged.

> 3) PCI config and memory spaces can not be mmap'ed

VFIO has mapping flags that allow any region to specify mmap support.

> 4) no classic interrupts (INTX, MSI). The pci hw understands the concept
>    of requesting MSIX irqs but irqs are delivered as s390 adapter irqs.

VFIO delivers interrupts as eventfds regardless of the underlying
platform mechanism.

> 5) For DMA access there is always an IOMMU required.

x86 requires the same.

>  s390 pci implementation
>    does not support a complete memory to iommu mapping, dma mappings are
>    created on request.

Sounds like POWER.

> 6) The OS does not get any informations about the physical layout
>    of the PCI bus.

If that means that every device is isolated (seems unlikely for
multifunction devices) then that makes IOMMU group support really easy.

> 7) To take advantage of system z specific virtualization features
>    we need to access the SIE control block residing in the kernel KVM

The KVM-VFIO device allows interaction between VFIO devices and KVM.

> 8) To enable system z specific virtualization features we have to manipulate
>    the zpci device in kernel.

VFIO supports different device backends, currently pci_dev and working
towards platform devices.  zpci might just be an extension to standard
pci.

> For this reasons I decided to implement a kernel based approach similar
> to x86 device assignment. There is a new qemu device (s390-pci) representing a
> pass through device on the host. Here is a sample qemu device configuration:
> 
> -device s390-pci,host=0000:00:00.0
> 
> The device executes the KVM_ASSIGN_PCI_DEVICE ioctl to create a proxy instance
> in the kernel KVM and connect this instance to the host pci device.
> 
> kernel patches apply to linux-kvm
> 
> s390: cio: chsc function to register GIB
> s390: pci: export pci functions for pass-through usage
> KVM: s390: Add GISA support
> KVM: s390: Add PCI pass-through support
> 
> qemu patches apply to qemu-master
> 
> s390: Add PCI bus support
> s390: Add PCI pass-through device support
> 
> Feedback and discussion is highly welcome ...

KVM-based device assignment needs to go away.  It's a horrible model for
devices, it offers very little protection to the kernel, assumes every
device is fully isolated and visible to the IOMMU, relies on smattering
of sysfs files to operate, etc.  x86, POWER, and ARM are all moving to
VFIO-based device assignment.  Why is s390 special enough to repeat all
the mistakes that x86 did?  Thanks,

Alex

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [RFC][patch 3/6] KVM: s390: Add GISA support
  2014-09-04 10:52 ` [Qemu-devel] [RFC][patch 3/6] KVM: s390: Add GISA support frank.blaschka
@ 2014-09-04 14:19   ` Heiko Carstens
  2014-09-05  8:29   ` Alexander Graf
  1 sibling, 0 replies; 21+ messages in thread
From: Heiko Carstens @ 2014-09-04 14:19 UTC (permalink / raw)
  To: frank.blaschka; +Cc: linux-s390, kvm, aik, agraf, qemu-devel, pbonzini

On Thu, Sep 04, 2014 at 12:52:26PM +0200, frank.blaschka@de.ibm.com wrote:
> +void kvm_s390_gisa_register_alert(struct kvm *kvm, u32 gisc)
> +{
> +	int bito = BITS_PER_BYTE * 7 + gisc;
> +
> +	set_bit(bito ^ (BITS_PER_LONG - 1), &kvm->arch.iam);
> +}

Just a very minor nit: you could also use set_bit_inv() & friends.

> +static inline u64 kvm_s390_get_base_disp_rxy(struct kvm_vcpu *vcpu)
> +{
> +	u32 x2 = (vcpu->arch.sie_block->ipa & 0x000f);
> +	u32 base2 = vcpu->arch.sie_block->ipb >> 28;
> +	u32 disp2 = ((vcpu->arch.sie_block->ipb & 0x0fff0000) >> 16) +
> +		((vcpu->arch.sie_block->ipb & 0xff00) << 4);
> +
> +	return (base2 ? vcpu->run->s.regs.gprs[base2] : 0) +
> +		(x2 ? vcpu->run->s.regs.gprs[x2] : 0) + (u64)disp2;
> +}

Not very readable ;) However.. for the RXY instruction format the 20 bit
displacement is usually signed and not unsigned like your code seems to
treat it.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [RFC][patch 0/6] pci pass-through support for qemu/KVM on s390
  2014-09-04 13:16 ` [Qemu-devel] [RFC][patch 0/6] pci pass-through support for qemu/KVM on s390 Alex Williamson
@ 2014-09-05  7:46   ` Frank Blaschka
  2014-09-05  8:35     ` Alexander Graf
  0 siblings, 1 reply; 21+ messages in thread
From: Frank Blaschka @ 2014-09-05  7:46 UTC (permalink / raw)
  To: Alex Williamson
  Cc: linux-s390, frank.blaschka, kvm, aik, qemu-devel, agraf, pbonzini

On Thu, Sep 04, 2014 at 07:16:24AM -0600, Alex Williamson wrote:
> On Thu, 2014-09-04 at 12:52 +0200, frank.blaschka@de.ibm.com wrote:
> > This set of patches implements pci pass-through support for qemu/KVM on s390.
> > PCI support on s390 is very different from other platforms.
> > Major differences are:
> > 
> > 1) all PCI operations are driven by special s390 instructions
> 
> Generating config cycles is always arch specific.
> 
> > 2) all s390 PCI instructions are privileged
> 
> While the operations to generate config cycles on x86 are not
> privileged, they must be arbitrated between accesses, so in a sense
> they're privileged.
> 
> > 3) PCI config and memory spaces can not be mmap'ed
> 
> VFIO has mapping flags that allow any region to specify mmap support.
>

Hi Alex,

thx for your reply.

Let me elaborate a little bit ore on 1 - 3. Config and memory space can not
be accessed via memory operations. You have to use special s390 instructions.
This instructions can not be executed in user space. So there is no other
way than executing this instructions in kernel. Yes vfio does support a
slow path via ioctrl we could use, but this seems suboptimal from performance
point of view.
 
> > 4) no classic interrupts (INTX, MSI). The pci hw understands the concept
> >    of requesting MSIX irqs but irqs are delivered as s390 adapter irqs.
> 
> VFIO delivers interrupts as eventfds regardless of the underlying
> platform mechanism.
> 

yes that's right, but then we have to do platform specific stuff to present
the irq to the guest. I do not say this is impossible but we have add s390
specific code to vfio. 

> > 5) For DMA access there is always an IOMMU required.
> 
> x86 requires the same.
> 
> >  s390 pci implementation
> >    does not support a complete memory to iommu mapping, dma mappings are
> >    created on request.
> 
> Sounds like POWER.

Don't know the details from power, maybe it is similar but not the same.
We might be able to extend vfio to have a new interface allowing
us to do DMA mappings on request.

> 
> > 6) The OS does not get any informations about the physical layout
> >    of the PCI bus.
> 
> If that means that every device is isolated (seems unlikely for
> multifunction devices) then that makes IOMMU group support really easy.
>

OK
 
> > 7) To take advantage of system z specific virtualization features
> >    we need to access the SIE control block residing in the kernel KVM
> 
> The KVM-VFIO device allows interaction between VFIO devices and KVM.
> 
> > 8) To enable system z specific virtualization features we have to manipulate
> >    the zpci device in kernel.
> 
> VFIO supports different device backends, currently pci_dev and working
> towards platform devices.  zpci might just be an extension to standard
> pci.
> 

7 - 8 At least this is not as straightforward as the pure kernel approach, but
I have to dig into that in more detail if we could only agree on a vfio solution.

> > For this reasons I decided to implement a kernel based approach similar
> > to x86 device assignment. There is a new qemu device (s390-pci) representing a
> > pass through device on the host. Here is a sample qemu device configuration:
> > 
> > -device s390-pci,host=0000:00:00.0
> > 
> > The device executes the KVM_ASSIGN_PCI_DEVICE ioctl to create a proxy instance
> > in the kernel KVM and connect this instance to the host pci device.
> > 
> > kernel patches apply to linux-kvm
> > 
> > s390: cio: chsc function to register GIB
> > s390: pci: export pci functions for pass-through usage
> > KVM: s390: Add GISA support
> > KVM: s390: Add PCI pass-through support
> > 
> > qemu patches apply to qemu-master
> > 
> > s390: Add PCI bus support
> > s390: Add PCI pass-through device support
> > 
> > Feedback and discussion is highly welcome ...
> 
> KVM-based device assignment needs to go away.  It's a horrible model for
> devices, it offers very little protection to the kernel, assumes every
> device is fully isolated and visible to the IOMMU, relies on smattering
> of sysfs files to operate, etc.  x86, POWER, and ARM are all moving to
> VFIO-based device assignment.  Why is s390 special enough to repeat all
> the mistakes that x86 did?  Thanks,
> 

Is this your personal opinion or was this a strategic decision of the
QEMU/KVM community? Can anybody give us direction about this?

Actually I can understand your point. In the last weeks I did some development
and testing regarding the use of vfio too. But the in kernel solutions seems to
offer the best performance and most straighforward implementation for our
platform.

Greetings,

Frank

> Alex
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [RFC][patch 0/6] pci pass-through support for qemu/KVM on s390
  2014-09-04 10:52 [Qemu-devel] [RFC][patch 0/6] pci pass-through support for qemu/KVM on s390 frank.blaschka
                   ` (6 preceding siblings ...)
  2014-09-04 13:16 ` [Qemu-devel] [RFC][patch 0/6] pci pass-through support for qemu/KVM on s390 Alex Williamson
@ 2014-09-05  8:21 ` Alexander Graf
  2014-09-05 11:39   ` Frank Blaschka
  7 siblings, 1 reply; 21+ messages in thread
From: Alexander Graf @ 2014-09-05  8:21 UTC (permalink / raw)
  To: frank.blaschka, qemu-devel, linux-s390, kvm
  Cc: aik, pbonzini, Alex Williamson



On 04.09.14 12:52, frank.blaschka@de.ibm.com wrote:
> This set of patches implements pci pass-through support for qemu/KVM on s390.
> PCI support on s390 is very different from other platforms.
> Major differences are:
> 
> 1) all PCI operations are driven by special s390 instructions
> 2) all s390 PCI instructions are privileged
> 3) PCI config and memory spaces can not be mmap'ed

That's ok, vfio abstracts config space anyway.

> 4) no classic interrupts (INTX, MSI). The pci hw understands the concept
>    of requesting MSIX irqs but irqs are delivered as s390 adapter irqs.

This is in line with other implementations. Interrupts go from

  device -> PHB -> PIC -> CPU

(some times you can have another converter device in between)

In your case, the PHB converts INTX and MSI interrupts to Adapter
interrupts to go to the floating interrupt controller. Same thing as
everyone else really.

> 5) For DMA access there is always an IOMMU required. s390 pci implementation
>    does not support a complete memory to iommu mapping, dma mappings are
>    created on request.

Sounds great :). So I suppose we should implement a guest facing IOMMU?

> 6) The OS does not get any informations about the physical layout
>    of the PCI bus.

So how does it know whether different devices are behind the same IOMMU
context? Or can we assume that every device has its own context?

> 7) To take advantage of system z specific virtualization features
>    we need to access the SIE control block residing in the kernel KVM

Pleas elaborate.

> 8) To enable system z specific virtualization features we have to manipulate
>    the zpci device in kernel.

Why?

> 
> For this reasons I decided to implement a kernel based approach similar
> to x86 device assignment. There is a new qemu device (s390-pci) representing a

I fail to see the rationale and I definitely don't want to see anything
even remotely similar to the legacy x86 device assignment on s390 ;).

Can't we just enhance VFIO?

Also, I think we'll get the cleanest model if we start off with an
implementation that allows us to add emulated PCI devices to an s390x
machine and only then follow on with physical ones.


Alex

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [RFC][patch 3/6] KVM: s390: Add GISA support
  2014-09-04 10:52 ` [Qemu-devel] [RFC][patch 3/6] KVM: s390: Add GISA support frank.blaschka
  2014-09-04 14:19   ` Heiko Carstens
@ 2014-09-05  8:29   ` Alexander Graf
  2014-09-05 10:52     ` Frank Blaschka
  1 sibling, 1 reply; 21+ messages in thread
From: Alexander Graf @ 2014-09-05  8:29 UTC (permalink / raw)
  To: frank.blaschka, qemu-devel, linux-s390, kvm; +Cc: aik, pbonzini

On 04.09.14 12:52, frank.blaschka@de.ibm.com wrote:
> From: Frank Blaschka <frank.blaschka@de.ibm.com>
> 
> This patch adds GISA (Guest Interrupt State Area) support
> to s390 kvm. GISA can be used for exitless interrupts. The
> patch provides a set of functions for GISA related operations
> like accessing GISA fields or registering ISCs for alert.
> Exploiters of GISA will follow with additional patches.
> 
> Signed-off-by: Frank Blaschka <frank.blaschka@de.ibm.com>

That's a nice feature. However, please make sure that you maintain the
abstraction levels.

What should happen is that you request an irqfd from FLIC. Then you
associate that irqfd with the PCI device.

Thanks to that association, both parties can now talk to each other and
negotiate their GISA number space and make sure things are connected.

However, it should always be possible to do things without this direct
IRQ injection.

So you should be able to receive an irqfd event when an IRQ happened, so
that VFIO user space applications can also handle interrupts for example.

And the same applies for interrupt injection. We also need to be able to
inject an adapter interrupt from QEMU for emulated devices ;).

Alex

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [RFC][patch 0/6] pci pass-through support for qemu/KVM on s390
  2014-09-05  7:46   ` Frank Blaschka
@ 2014-09-05  8:35     ` Alexander Graf
  2014-09-05 11:55       ` Frank Blaschka
  0 siblings, 1 reply; 21+ messages in thread
From: Alexander Graf @ 2014-09-05  8:35 UTC (permalink / raw)
  To: Frank Blaschka, Alex Williamson
  Cc: linux-s390, frank.blaschka, kvm, aik, qemu-devel, pbonzini



On 05.09.14 09:46, Frank Blaschka wrote:
> On Thu, Sep 04, 2014 at 07:16:24AM -0600, Alex Williamson wrote:
>> On Thu, 2014-09-04 at 12:52 +0200, frank.blaschka@de.ibm.com wrote:
>>> This set of patches implements pci pass-through support for qemu/KVM on s390.
>>> PCI support on s390 is very different from other platforms.
>>> Major differences are:
>>>
>>> 1) all PCI operations are driven by special s390 instructions
>>
>> Generating config cycles is always arch specific.
>>
>>> 2) all s390 PCI instructions are privileged
>>
>> While the operations to generate config cycles on x86 are not
>> privileged, they must be arbitrated between accesses, so in a sense
>> they're privileged.
>>
>>> 3) PCI config and memory spaces can not be mmap'ed
>>
>> VFIO has mapping flags that allow any region to specify mmap support.
>>
> 
> Hi Alex,
> 
> thx for your reply.
> 
> Let me elaborate a little bit ore on 1 - 3. Config and memory space can not
> be accessed via memory operations. You have to use special s390 instructions.
> This instructions can not be executed in user space. So there is no other
> way than executing this instructions in kernel. Yes vfio does support a
> slow path via ioctrl we could use, but this seems suboptimal from performance
> point of view.

Ah, I missed the "memory spaces" part ;). I agree that it's "suboptimal"
to call into the kernel for every PCI access, but I still think that
VFIO provides the correct abstraction layer for us to use. If nothing
else, it would at least give us identical configuration to x86 and nice
debugability en par with the other platforms.

>  
>>> 4) no classic interrupts (INTX, MSI). The pci hw understands the concept
>>>    of requesting MSIX irqs but irqs are delivered as s390 adapter irqs.
>>
>> VFIO delivers interrupts as eventfds regardless of the underlying
>> platform mechanism.
>>
> 
> yes that's right, but then we have to do platform specific stuff to present
> the irq to the guest. I do not say this is impossible but we have add s390
> specific code to vfio. 

Not at all - interrupt delivery is completely transparent to VFIO.

> 
>>> 5) For DMA access there is always an IOMMU required.
>>
>> x86 requires the same.
>>
>>>  s390 pci implementation
>>>    does not support a complete memory to iommu mapping, dma mappings are
>>>    created on request.
>>
>> Sounds like POWER.
> 
> Don't know the details from power, maybe it is similar but not the same.
> We might be able to extend vfio to have a new interface allowing
> us to do DMA mappings on request.

We already have that.

> 
>>
>>> 6) The OS does not get any informations about the physical layout
>>>    of the PCI bus.
>>
>> If that means that every device is isolated (seems unlikely for
>> multifunction devices) then that makes IOMMU group support really easy.
>>
> 
> OK
>  
>>> 7) To take advantage of system z specific virtualization features
>>>    we need to access the SIE control block residing in the kernel KVM
>>
>> The KVM-VFIO device allows interaction between VFIO devices and KVM.
>>
>>> 8) To enable system z specific virtualization features we have to manipulate
>>>    the zpci device in kernel.
>>
>> VFIO supports different device backends, currently pci_dev and working
>> towards platform devices.  zpci might just be an extension to standard
>> pci.
>>
> 
> 7 - 8 At least this is not as straightforward as the pure kernel approach, but
> I have to dig into that in more detail if we could only agree on a vfio solution.

Please do so, yes :).

> 
>>> For this reasons I decided to implement a kernel based approach similar
>>> to x86 device assignment. There is a new qemu device (s390-pci) representing a
>>> pass through device on the host. Here is a sample qemu device configuration:
>>>
>>> -device s390-pci,host=0000:00:00.0
>>>
>>> The device executes the KVM_ASSIGN_PCI_DEVICE ioctl to create a proxy instance
>>> in the kernel KVM and connect this instance to the host pci device.
>>>
>>> kernel patches apply to linux-kvm
>>>
>>> s390: cio: chsc function to register GIB
>>> s390: pci: export pci functions for pass-through usage
>>> KVM: s390: Add GISA support
>>> KVM: s390: Add PCI pass-through support
>>>
>>> qemu patches apply to qemu-master
>>>
>>> s390: Add PCI bus support
>>> s390: Add PCI pass-through device support
>>>
>>> Feedback and discussion is highly welcome ...
>>
>> KVM-based device assignment needs to go away.  It's a horrible model for
>> devices, it offers very little protection to the kernel, assumes every
>> device is fully isolated and visible to the IOMMU, relies on smattering
>> of sysfs files to operate, etc.  x86, POWER, and ARM are all moving to
>> VFIO-based device assignment.  Why is s390 special enough to repeat all
>> the mistakes that x86 did?  Thanks,
>>
> 
> Is this your personal opinion or was this a strategic decision of the
> QEMU/KVM community? Can anybody give us direction about this?
> 
> Actually I can understand your point. In the last weeks I did some development
> and testing regarding the use of vfio too. But the in kernel solutions seems to
> offer the best performance and most straighforward implementation for our
> platform.

I don't see why there should be any difference in performance between
the two approaches if done right. However, we'd get a lot of benefits.
Most notably the fact that s390 is not different from everyone else.

I think you'll see that it's pretty straight forward to do things VFIO
style once you get the hang of it :).


Alex

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [RFC][patch 4/6] KVM: s390: Add PCI pass-through support
  2014-09-04 10:52 ` [Qemu-devel] [RFC][patch 4/6] KVM: s390: Add PCI pass-through support frank.blaschka
@ 2014-09-05  8:37   ` Alexander Graf
  0 siblings, 0 replies; 21+ messages in thread
From: Alexander Graf @ 2014-09-05  8:37 UTC (permalink / raw)
  To: frank.blaschka, qemu-devel, linux-s390, kvm; +Cc: aik, pbonzini



On 04.09.14 12:52, frank.blaschka@de.ibm.com wrote:
> From: Frank Blaschka <frank.blaschka@de.ibm.com>
> 
> This patch implemets PCI pass-through kernel support for s390.
> Design approach is very similar to the x86 device assignment.
> User space executes the KVM_ASSIGN_PCI_DEVICE ioctl to create
> a proxy instance in the kernel KVM and connect this instance to the
> host pci device. s390 pci instructions are intercepted in kernel and
> operations are passed directly to the assigned pci device.
> To take advantage of all system z specific virtualization features
> we need to access the SIE control block residing in KVM. Also we have to
> enable z pci devices with special configuration information coming
> form the SIE block as well.
> 
> Signed-off-by: Frank Blaschka <frank.blaschka@de.ibm.com>
> ---
>  arch/s390/include/asm/kvm_host.h |    1 
>  arch/s390/kvm/Makefile           |    2 
>  arch/s390/kvm/intercept.c        |    1 
>  arch/s390/kvm/kvm-s390.c         |   33 
>  arch/s390/kvm/kvm-s390.h         |   17 
>  arch/s390/kvm/pci.c              | 2130 +++++++++++++++++++++++++++++++++++++++
>  arch/s390/kvm/priv.c             |   21 
>  7 files changed, 2202 insertions(+), 3 deletions(-)


I would love to review this patch, but in its current form it's
impossible to do. I can't possibly keep > 2000 lines of code in my head.


Alex

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [RFC][patch 3/6] KVM: s390: Add GISA support
  2014-09-05  8:29   ` Alexander Graf
@ 2014-09-05 10:52     ` Frank Blaschka
  0 siblings, 0 replies; 21+ messages in thread
From: Frank Blaschka @ 2014-09-05 10:52 UTC (permalink / raw)
  To: Alexander Graf; +Cc: linux-s390, frank.blaschka, kvm, aik, qemu-devel, pbonzini

On Fri, Sep 05, 2014 at 10:29:26AM +0200, Alexander Graf wrote:
> 
> 
> On 04.09.14 12:52, frank.blaschka@de.ibm.com wrote:
> > From: Frank Blaschka <frank.blaschka@de.ibm.com>
> > 
> > This patch adds GISA (Guest Interrupt State Area) support
> > to s390 kvm. GISA can be used for exitless interrupts. The
> > patch provides a set of functions for GISA related operations
> > like accessing GISA fields or registering ISCs for alert.
> > Exploiters of GISA will follow with additional patches.
> > 
> > Signed-off-by: Frank Blaschka <frank.blaschka@de.ibm.com>
> 
> That's a nice feature. However, please make sure that you maintain the
> abstraction levels.
> 
> What should happen is that you request an irqfd from FLIC. Then you
> associate that irqfd with the PCI device.
> 
> Thanks to that association, both parties can now talk to each other and
> negotiate their GISA number space and make sure things are connected.
> 
> However, it should always be possible to do things without this direct
> IRQ injection.
> 
> So you should be able to receive an irqfd event when an IRQ happened, so
> that VFIO user space applications can also handle interrupts for example.
> 
> And the same applies for interrupt injection. We also need to be able to
> inject an adapter interrupt from QEMU for emulated devices ;).
>

OK, assuming we are doing the vfio solution expoiting GISA would be a
second step. Will take your feedback into account. THX!
> 
> Alex
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [RFC][patch 0/6] pci pass-through support for qemu/KVM on s390
  2014-09-05  8:21 ` Alexander Graf
@ 2014-09-05 11:39   ` Frank Blaschka
  2014-09-05 23:19     ` Alexander Graf
  0 siblings, 1 reply; 21+ messages in thread
From: Frank Blaschka @ 2014-09-05 11:39 UTC (permalink / raw)
  To: Alexander Graf
  Cc: linux-s390, frank.blaschka, kvm, aik, qemu-devel, Alex Williamson,
	pbonzini

On Fri, Sep 05, 2014 at 10:21:27AM +0200, Alexander Graf wrote:
> 
> 
> On 04.09.14 12:52, frank.blaschka@de.ibm.com wrote:
> > This set of patches implements pci pass-through support for qemu/KVM on s390.
> > PCI support on s390 is very different from other platforms.
> > Major differences are:
> > 
> > 1) all PCI operations are driven by special s390 instructions
> > 2) all s390 PCI instructions are privileged
> > 3) PCI config and memory spaces can not be mmap'ed
> 
> That's ok, vfio abstracts config space anyway.
> 
> > 4) no classic interrupts (INTX, MSI). The pci hw understands the concept
> >    of requesting MSIX irqs but irqs are delivered as s390 adapter irqs.
> 
> This is in line with other implementations. Interrupts go from
> 
>   device -> PHB -> PIC -> CPU
> 
> (some times you can have another converter device in between)
> 
> In your case, the PHB converts INTX and MSI interrupts to Adapter
> interrupts to go to the floating interrupt controller. Same thing as
> everyone else really.
> 

Yes, I think this can be done, but we need s390 specific changes in vfio.

> > 5) For DMA access there is always an IOMMU required. s390 pci implementation
> >    does not support a complete memory to iommu mapping, dma mappings are
> >    created on request.
> 
> Sounds great :). So I suppose we should implement a guest facing IOMMU?
> 
> > 6) The OS does not get any informations about the physical layout
> >    of the PCI bus.
> 
> So how does it know whether different devices are behind the same IOMMU
> context? Or can we assume that every device has its own context?

Actually yes

> 
> > 7) To take advantage of system z specific virtualization features
> >    we need to access the SIE control block residing in the kernel KVM
> 
> Pleas elaborate.
> 
> > 8) To enable system z specific virtualization features we have to manipulate
> >    the zpci device in kernel.
> 
> Why?
>

We have following s390 specific virtualization features:

1) interpretive execution of pci load/store instruction. If we use this function
   pci access does not get intercepted (no SIE exit) but is handled via microcode.
   To enable this we have to disable zpci device and enable it again with information
   from the SIE control block. Further in qemu problem is: vfio traps access to
   MSIX table so we have to find another way programming msix if we do not get
   intercepts for memory space access.

2) Adapter event forwarding (with alerting). This is a mechanism the adpater event (irq)
   is directly forwarded to the guest. To set this up we also need to manipulate
   the zpci device (in kernel) with information form the SIE block. Exploiting
   GISA is only one part of this mechanism.

Both might be possible with some more or less nice looking vfio extensions. As I said
before we have to dig more into. Also this can be further optimazation steps later
if we have a running vfio implementation on the platform. 
 
> > 
> > For this reasons I decided to implement a kernel based approach similar
> > to x86 device assignment. There is a new qemu device (s390-pci) representing a
> 
> I fail to see the rationale and I definitely don't want to see anything
> even remotely similar to the legacy x86 device assignment on s390 ;).
> 
> Can't we just enhance VFIO?
> 

Probably yes, but we need some vfio changes (kernel and qemu)

> Also, I think we'll get the cleanest model if we start off with an
> implementation that allows us to add emulated PCI devices to an s390x
> machine and only then follow on with physical ones.
> 

I can already do this. With some more s390 intercepts a device can be detected and
guest is able to access config/memory space. Unfortunately s390 platform does not
support I/O bars so non of the emulated devices will work on the platform ...

> 
> Alex
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [RFC][patch 0/6] pci pass-through support for qemu/KVM on s390
  2014-09-05  8:35     ` Alexander Graf
@ 2014-09-05 11:55       ` Frank Blaschka
  2014-09-05 23:03         ` Alexander Graf
  0 siblings, 1 reply; 21+ messages in thread
From: Frank Blaschka @ 2014-09-05 11:55 UTC (permalink / raw)
  To: Alexander Graf
  Cc: linux-s390, frank.blaschka, kvm, aik, qemu-devel, Alex Williamson,
	pbonzini

On Fri, Sep 05, 2014 at 10:35:59AM +0200, Alexander Graf wrote:
> 
> 
> On 05.09.14 09:46, Frank Blaschka wrote:
> > On Thu, Sep 04, 2014 at 07:16:24AM -0600, Alex Williamson wrote:
> >> On Thu, 2014-09-04 at 12:52 +0200, frank.blaschka@de.ibm.com wrote:
> >>> This set of patches implements pci pass-through support for qemu/KVM on s390.
> >>> PCI support on s390 is very different from other platforms.
> >>> Major differences are:
> >>>
> >>> 1) all PCI operations are driven by special s390 instructions
> >>
> >> Generating config cycles is always arch specific.
> >>
> >>> 2) all s390 PCI instructions are privileged
> >>
> >> While the operations to generate config cycles on x86 are not
> >> privileged, they must be arbitrated between accesses, so in a sense
> >> they're privileged.
> >>
> >>> 3) PCI config and memory spaces can not be mmap'ed
> >>
> >> VFIO has mapping flags that allow any region to specify mmap support.
> >>
> > 
> > Hi Alex,
> > 
> > thx for your reply.
> > 
> > Let me elaborate a little bit ore on 1 - 3. Config and memory space can not
> > be accessed via memory operations. You have to use special s390 instructions.
> > This instructions can not be executed in user space. So there is no other
> > way than executing this instructions in kernel. Yes vfio does support a
> > slow path via ioctrl we could use, but this seems suboptimal from performance
> > point of view.
> 
> Ah, I missed the "memory spaces" part ;). I agree that it's "suboptimal"
> to call into the kernel for every PCI access, but I still think that
> VFIO provides the correct abstraction layer for us to use. If nothing
> else, it would at least give us identical configuration to x86 and nice
> debugability en par with the other platforms.
> 
> >  
> >>> 4) no classic interrupts (INTX, MSI). The pci hw understands the concept
> >>>    of requesting MSIX irqs but irqs are delivered as s390 adapter irqs.
> >>
> >> VFIO delivers interrupts as eventfds regardless of the underlying
> >> platform mechanism.
> >>
> > 
> > yes that's right, but then we have to do platform specific stuff to present
> > the irq to the guest. I do not say this is impossible but we have add s390
> > specific code to vfio. 
> 
> Not at all - interrupt delivery is completely transparent to VFIO.
>

interrupt yes, but MSIX no
 
> > 
> >>> 5) For DMA access there is always an IOMMU required.
> >>
> >> x86 requires the same.
> >>
> >>>  s390 pci implementation
> >>>    does not support a complete memory to iommu mapping, dma mappings are
> >>>    created on request.
> >>
> >> Sounds like POWER.
> > 
> > Don't know the details from power, maybe it is similar but not the same.
> > We might be able to extend vfio to have a new interface allowing
> > us to do DMA mappings on request.
> 
> We already have that.
>

Great, can you give me some pointers how to use? Thx!
 
> > 
> >>
> >>> 6) The OS does not get any informations about the physical layout
> >>>    of the PCI bus.
> >>
> >> If that means that every device is isolated (seems unlikely for
> >> multifunction devices) then that makes IOMMU group support really easy.
> >>
> > 
> > OK
> >  
> >>> 7) To take advantage of system z specific virtualization features
> >>>    we need to access the SIE control block residing in the kernel KVM
> >>
> >> The KVM-VFIO device allows interaction between VFIO devices and KVM.
> >>
> >>> 8) To enable system z specific virtualization features we have to manipulate
> >>>    the zpci device in kernel.
> >>
> >> VFIO supports different device backends, currently pci_dev and working
> >> towards platform devices.  zpci might just be an extension to standard
> >> pci.
> >>
> > 
> > 7 - 8 At least this is not as straightforward as the pure kernel approach, but
> > I have to dig into that in more detail if we could only agree on a vfio solution.
> 
> Please do so, yes :).
> 
> > 
> >>> For this reasons I decided to implement a kernel based approach similar
> >>> to x86 device assignment. There is a new qemu device (s390-pci) representing a
> >>> pass through device on the host. Here is a sample qemu device configuration:
> >>>
> >>> -device s390-pci,host=0000:00:00.0
> >>>
> >>> The device executes the KVM_ASSIGN_PCI_DEVICE ioctl to create a proxy instance
> >>> in the kernel KVM and connect this instance to the host pci device.
> >>>
> >>> kernel patches apply to linux-kvm
> >>>
> >>> s390: cio: chsc function to register GIB
> >>> s390: pci: export pci functions for pass-through usage
> >>> KVM: s390: Add GISA support
> >>> KVM: s390: Add PCI pass-through support
> >>>
> >>> qemu patches apply to qemu-master
> >>>
> >>> s390: Add PCI bus support
> >>> s390: Add PCI pass-through device support
> >>>
> >>> Feedback and discussion is highly welcome ...
> >>
> >> KVM-based device assignment needs to go away.  It's a horrible model for
> >> devices, it offers very little protection to the kernel, assumes every
> >> device is fully isolated and visible to the IOMMU, relies on smattering
> >> of sysfs files to operate, etc.  x86, POWER, and ARM are all moving to
> >> VFIO-based device assignment.  Why is s390 special enough to repeat all
> >> the mistakes that x86 did?  Thanks,
> >>
> > 
> > Is this your personal opinion or was this a strategic decision of the
> > QEMU/KVM community? Can anybody give us direction about this?
> > 
> > Actually I can understand your point. In the last weeks I did some development
> > and testing regarding the use of vfio too. But the in kernel solutions seems to
> > offer the best performance and most straighforward implementation for our
> > platform.
> 
> I don't see why there should be any difference in performance between
> the two approaches if done right. However, we'd get a lot of benefits.
> Most notably the fact that s390 is not different from everyone else.
> 
> I think you'll see that it's pretty straight forward to do things VFIO
> style once you get the hang of it :).
>

Yes, I have seen this already. Will post my vfio work sometime next week.
It is not complete yet but will give you an idea what changes we need.

Hope to get feedback from Alex and you again ...

Have a nice weekend

Frank
 
> 
> Alex
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [RFC][patch 0/6] pci pass-through support for qemu/KVM on s390
  2014-09-05 11:55       ` Frank Blaschka
@ 2014-09-05 23:03         ` Alexander Graf
  0 siblings, 0 replies; 21+ messages in thread
From: Alexander Graf @ 2014-09-05 23:03 UTC (permalink / raw)
  To: Frank Blaschka
  Cc: linux-s390, frank.blaschka, kvm, aik, qemu-devel, Alex Williamson,
	pbonzini



On 05.09.14 13:55, Frank Blaschka wrote:
> On Fri, Sep 05, 2014 at 10:35:59AM +0200, Alexander Graf wrote:
>>
>>
>> On 05.09.14 09:46, Frank Blaschka wrote:
>>> On Thu, Sep 04, 2014 at 07:16:24AM -0600, Alex Williamson wrote:
>>>> On Thu, 2014-09-04 at 12:52 +0200, frank.blaschka@de.ibm.com wrote:
>>>>> This set of patches implements pci pass-through support for qemu/KVM on s390.
>>>>> PCI support on s390 is very different from other platforms.
>>>>> Major differences are:
>>>>>
>>>>> 1) all PCI operations are driven by special s390 instructions
>>>>
>>>> Generating config cycles is always arch specific.
>>>>
>>>>> 2) all s390 PCI instructions are privileged
>>>>
>>>> While the operations to generate config cycles on x86 are not
>>>> privileged, they must be arbitrated between accesses, so in a sense
>>>> they're privileged.
>>>>
>>>>> 3) PCI config and memory spaces can not be mmap'ed
>>>>
>>>> VFIO has mapping flags that allow any region to specify mmap support.
>>>>
>>>
>>> Hi Alex,
>>>
>>> thx for your reply.
>>>
>>> Let me elaborate a little bit ore on 1 - 3. Config and memory space can not
>>> be accessed via memory operations. You have to use special s390 instructions.
>>> This instructions can not be executed in user space. So there is no other
>>> way than executing this instructions in kernel. Yes vfio does support a
>>> slow path via ioctrl we could use, but this seems suboptimal from performance
>>> point of view.
>>
>> Ah, I missed the "memory spaces" part ;). I agree that it's "suboptimal"
>> to call into the kernel for every PCI access, but I still think that
>> VFIO provides the correct abstraction layer for us to use. If nothing
>> else, it would at least give us identical configuration to x86 and nice
>> debugability en par with the other platforms.
>>
>>>  
>>>>> 4) no classic interrupts (INTX, MSI). The pci hw understands the concept
>>>>>    of requesting MSIX irqs but irqs are delivered as s390 adapter irqs.
>>>>
>>>> VFIO delivers interrupts as eventfds regardless of the underlying
>>>> platform mechanism.
>>>>
>>>
>>> yes that's right, but then we have to do platform specific stuff to present
>>> the irq to the guest. I do not say this is impossible but we have add s390
>>> specific code to vfio. 
>>
>> Not at all - interrupt delivery is completely transparent to VFIO.
>>
> 
> interrupt yes, but MSIX no
>  
>>>
>>>>> 5) For DMA access there is always an IOMMU required.
>>>>
>>>> x86 requires the same.
>>>>
>>>>>  s390 pci implementation
>>>>>    does not support a complete memory to iommu mapping, dma mappings are
>>>>>    created on request.
>>>>
>>>> Sounds like POWER.
>>>
>>> Don't know the details from power, maybe it is similar but not the same.
>>> We might be able to extend vfio to have a new interface allowing
>>> us to do DMA mappings on request.
>>
>> We already have that.
>>
> 
> Great, can you give me some pointers how to use? Thx!

Sure! :)

So on POWER (sPAPR) you get a list of page entries that describe the
device -> ram mapping. Every time you want to modify any of these
entries, you need to invoke a hypercall (H_PUT_TCE).

So every time the guest wants to runtime add a DMA window, we trap into
put_tce_emu() in hw/ppc/spapr_iommu.c. Here we call
memory_region_notify_iommu().

This call goes either to an emulated IOMMU context for emulated devices
or to the special VFIO IOMMU context for VFIO devices.

In the VFIO case, we end up in vfio_iommu_map_notify() at hw/misc/vfio.c
which calls ioctl(VFIO_IOMMU_MAP_DMA) at the end of the day. The
in-kernel implementation of the host IOMMU provider uses this map to
create the virtual DMA window map.

Basically, VFIO *only* supports "DMA mappings on request" as you call
them. Prepopulated DMA windows are just a coincidence that may or may
not happen.

I hope that makes it slightly more clear what the path looks like :). If
you have more questions on this, don't hesitate to ask.


Alex

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [RFC][patch 0/6] pci pass-through support for qemu/KVM on s390
  2014-09-05 11:39   ` Frank Blaschka
@ 2014-09-05 23:19     ` Alexander Graf
  2014-09-08  9:20       ` Paolo Bonzini
  0 siblings, 1 reply; 21+ messages in thread
From: Alexander Graf @ 2014-09-05 23:19 UTC (permalink / raw)
  To: Frank Blaschka
  Cc: linux-s390, frank.blaschka, kvm, aik, qemu-devel, Alex Williamson,
	pbonzini



On 05.09.14 13:39, Frank Blaschka wrote:
> On Fri, Sep 05, 2014 at 10:21:27AM +0200, Alexander Graf wrote:
>>
>>
>> On 04.09.14 12:52, frank.blaschka@de.ibm.com wrote:
>>> This set of patches implements pci pass-through support for qemu/KVM on s390.
>>> PCI support on s390 is very different from other platforms.
>>> Major differences are:
>>>
>>> 1) all PCI operations are driven by special s390 instructions
>>> 2) all s390 PCI instructions are privileged
>>> 3) PCI config and memory spaces can not be mmap'ed
>>
>> That's ok, vfio abstracts config space anyway.
>>
>>> 4) no classic interrupts (INTX, MSI). The pci hw understands the concept
>>>    of requesting MSIX irqs but irqs are delivered as s390 adapter irqs.
>>
>> This is in line with other implementations. Interrupts go from
>>
>>   device -> PHB -> PIC -> CPU
>>
>> (some times you can have another converter device in between)
>>
>> In your case, the PHB converts INTX and MSI interrupts to Adapter
>> interrupts to go to the floating interrupt controller. Same thing as
>> everyone else really.
>>
> 
> Yes, I think this can be done, but we need s390 specific changes in vfio.
> 
>>> 5) For DMA access there is always an IOMMU required. s390 pci implementation
>>>    does not support a complete memory to iommu mapping, dma mappings are
>>>    created on request.
>>
>> Sounds great :). So I suppose we should implement a guest facing IOMMU?
>>
>>> 6) The OS does not get any informations about the physical layout
>>>    of the PCI bus.
>>
>> So how does it know whether different devices are behind the same IOMMU
>> context? Or can we assume that every device has its own context?
> 
> Actually yes

That greatly simplifies things. Awesome :).

> 
>>
>>> 7) To take advantage of system z specific virtualization features
>>>    we need to access the SIE control block residing in the kernel KVM
>>
>> Pleas elaborate.
>>
>>> 8) To enable system z specific virtualization features we have to manipulate
>>>    the zpci device in kernel.
>>
>> Why?
>>
> 
> We have following s390 specific virtualization features:
> 
> 1) interpretive execution of pci load/store instruction. If we use this function
>    pci access does not get intercepted (no SIE exit) but is handled via microcode.
>    To enable this we have to disable zpci device and enable it again with information
>    from the SIE control block.

Hrm. So how about you create a special vm ioctl for KVM that allows you
to attach a VFIO device fd into the KVM VM context? Then the default
would stay "accessible by mmap traps", but we could accelerate it with KVM.

>    Further in qemu problem is: vfio traps access to
>    MSIX table so we have to find another way programming msix if we do not get
>    intercepts for memory space access.

We trap access to the MSIX table because it's a shared resource. If it's
not shared for you, there's no need to trap it.

> 2) Adapter event forwarding (with alerting). This is a mechanism the adpater event (irq)
>    is directly forwarded to the guest. To set this up we also need to manipulate
>    the zpci device (in kernel) with information form the SIE block. Exploiting
>    GISA is only one part of this mechanism.

How does this work when the VM is not running (because it's idle)?

Either way, we have a very similar thing on x86. It's called "posted
interrupts" there. I'm not sure everything's in place for VFIO and
posted interrupts to work properly, but whatever we do it sounds like
the interfaces and configuration flow should be identical.

> Both might be possible with some more or less nice looking vfio extensions. As I said
> before we have to dig more into. Also this can be further optimazation steps later
> if we have a running vfio implementation on the platform. 

Yup :). That's the nice part about it.

>  
>>>
>>> For this reasons I decided to implement a kernel based approach similar
>>> to x86 device assignment. There is a new qemu device (s390-pci) representing a
>>
>> I fail to see the rationale and I definitely don't want to see anything
>> even remotely similar to the legacy x86 device assignment on s390 ;).
>>
>> Can't we just enhance VFIO?
>>
> 
> Probably yes, but we need some vfio changes (kernel and qemu)

We need changes either way ;). So let's better do the right ones.

> 
>> Also, I think we'll get the cleanest model if we start off with an
>> implementation that allows us to add emulated PCI devices to an s390x
>> machine and only then follow on with physical ones.
>>
> 
> I can already do this. With some more s390 intercepts a device can be detected and
> guest is able to access config/memory space. Unfortunately s390 platform does not
> support I/O bars so non of the emulated devices will work on the platform ...

Oh? How about "nec-usb-xhci" or "intel-hda"?


Alex

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [RFC][patch 0/6] pci pass-through support for qemu/KVM on s390
  2014-09-05 23:19     ` Alexander Graf
@ 2014-09-08  9:20       ` Paolo Bonzini
  2014-09-08 14:19         ` Alex Williamson
  0 siblings, 1 reply; 21+ messages in thread
From: Paolo Bonzini @ 2014-09-08  9:20 UTC (permalink / raw)
  To: Alexander Graf, Frank Blaschka
  Cc: linux-s390, frank.blaschka, kvm, aik, qemu-devel, Alex Williamson

Il 06/09/2014 01:19, Alexander Graf ha scritto:
>> > 1) interpretive execution of pci load/store instruction. If we use this function
>> >    pci access does not get intercepted (no SIE exit) but is handled via microcode.
>> >    To enable this we have to disable zpci device and enable it again with information
>> >    from the SIE control block.
> Hrm. So how about you create a special vm ioctl for KVM that allows you
> to attach a VFIO device fd into the KVM VM context? Then the default
> would stay "accessible by mmap traps", but we could accelerate it with KVM.

There is already KVM_DEV_VFIO_GROUP_ADD and KVM_DEV_VFIO_GROUP_DEL.

Right now, they result in a call to kvm_arch_register_noncoherent_dma or
kvm_arch_unregister_noncoherent_dma, but you can add more hooks.

Paolo

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [RFC][patch 0/6] pci pass-through support for qemu/KVM on s390
  2014-09-08  9:20       ` Paolo Bonzini
@ 2014-09-08 14:19         ` Alex Williamson
  0 siblings, 0 replies; 21+ messages in thread
From: Alex Williamson @ 2014-09-08 14:19 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: linux-s390, frank.blaschka, kvm, aik, Frank Blaschka,
	Alexander Graf, qemu-devel

On Mon, 2014-09-08 at 11:20 +0200, Paolo Bonzini wrote:
> Il 06/09/2014 01:19, Alexander Graf ha scritto:
> >> > 1) interpretive execution of pci load/store instruction. If we use this function
> >> >    pci access does not get intercepted (no SIE exit) but is handled via microcode.
> >> >    To enable this we have to disable zpci device and enable it again with information
> >> >    from the SIE control block.
> > Hrm. So how about you create a special vm ioctl for KVM that allows you
> > to attach a VFIO device fd into the KVM VM context? Then the default
> > would stay "accessible by mmap traps", but we could accelerate it with KVM.
> 
> There is already KVM_DEV_VFIO_GROUP_ADD and KVM_DEV_VFIO_GROUP_DEL.
> 
> Right now, they result in a call to kvm_arch_register_noncoherent_dma or
> kvm_arch_unregister_noncoherent_dma, but you can add more hooks.

Eric Auger is also working on a patch series to do IRQ forward control
on ARM via the kvm-vfio pseudo device, extending the interface to
register VFIO device fds.  Sounds like that may be a good path to follow
here too.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2014-09-08 14:19 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-09-04 10:52 [Qemu-devel] [RFC][patch 0/6] pci pass-through support for qemu/KVM on s390 frank.blaschka
2014-09-04 10:52 ` [Qemu-devel] [RFC][patch 1/6] s390: cio: chsc function to register GIB frank.blaschka
2014-09-04 10:52 ` [Qemu-devel] [RFC][patch 2/6] s390: pci: export pci functions for pass-through usage frank.blaschka
2014-09-04 10:52 ` [Qemu-devel] [RFC][patch 3/6] KVM: s390: Add GISA support frank.blaschka
2014-09-04 14:19   ` Heiko Carstens
2014-09-05  8:29   ` Alexander Graf
2014-09-05 10:52     ` Frank Blaschka
2014-09-04 10:52 ` [Qemu-devel] [RFC][patch 4/6] KVM: s390: Add PCI pass-through support frank.blaschka
2014-09-05  8:37   ` Alexander Graf
2014-09-04 10:52 ` [Qemu-devel] [RFC][patch 5/6] s390: Add PCI bus support frank.blaschka
2014-09-04 10:52 ` [Qemu-devel] [RFC][patch 6/6] s390: Add PCI pass-through device support frank.blaschka
2014-09-04 13:16 ` [Qemu-devel] [RFC][patch 0/6] pci pass-through support for qemu/KVM on s390 Alex Williamson
2014-09-05  7:46   ` Frank Blaschka
2014-09-05  8:35     ` Alexander Graf
2014-09-05 11:55       ` Frank Blaschka
2014-09-05 23:03         ` Alexander Graf
2014-09-05  8:21 ` Alexander Graf
2014-09-05 11:39   ` Frank Blaschka
2014-09-05 23:19     ` Alexander Graf
2014-09-08  9:20       ` Paolo Bonzini
2014-09-08 14:19         ` Alex Williamson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).