* [PATCH 1/2] powerpc/PCI: Use PCI_UNKNOWN for unknown power state
From: Bjorn Helgaas @ 2013-05-20 23:19 UTC (permalink / raw)
To: Benjamin Herrenschmidt, Paul Mackerras, David S. Miller
Cc: sparclinux, Rafael J. Wysocki, linuxppc-dev, linux-pci
Previously we initialized dev->current_state to 4 (PCI_D3cold), but I think
we wanted PCI_UNKNOWN (5) here based on the comment and the fact that the
generic version of this code, pci_setup_device(), uses PCI_UNKNOWN.
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
---
arch/powerpc/kernel/pci_of_scan.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/powerpc/kernel/pci_of_scan.c b/arch/powerpc/kernel/pci_of_scan.c
index 2a67e9b..d2d407d 100644
--- a/arch/powerpc/kernel/pci_of_scan.c
+++ b/arch/powerpc/kernel/pci_of_scan.c
@@ -165,7 +165,7 @@ struct pci_dev *of_create_pci_dev(struct device_node *node,
pr_debug(" class: 0x%x\n", dev->class);
pr_debug(" revision: 0x%x\n", dev->revision);
- dev->current_state = 4; /* unknown power state */
+ dev->current_state = PCI_UNKNOWN; /* unknown power state */
dev->error_state = pci_channel_io_normal;
dev->dma_mask = 0xffffffff;
^ permalink raw reply related
* [PATCH 2/2] sparc/PCI: Use PCI_UNKNOWN for unknown power state
From: Bjorn Helgaas @ 2013-05-20 23:19 UTC (permalink / raw)
To: Benjamin Herrenschmidt, Paul Mackerras, David S. Miller
Cc: sparclinux, Rafael J. Wysocki, linuxppc-dev, linux-pci
In-Reply-To: <20130520231909.32416.81752.stgit@bhelgaas-glaptop>
Previously we initialized dev->current_state to 4 (PCI_D3cold), but I think
we wanted PCI_UNKNOWN (5) here based on the comment and the fact that the
generic version of this code, pci_setup_device(), uses PCI_UNKNOWN.
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
---
arch/sparc/kernel/pci.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/sparc/kernel/pci.c b/arch/sparc/kernel/pci.c
index baf4366..972892a 100644
--- a/arch/sparc/kernel/pci.c
+++ b/arch/sparc/kernel/pci.c
@@ -327,7 +327,7 @@ static struct pci_dev *of_create_pci_dev(struct pci_pbm_info *pbm,
if ((dev->class >> 8) == PCI_CLASS_STORAGE_IDE)
pci_set_master(dev);
- dev->current_state = 4; /* unknown power state */
+ dev->current_state = PCI_UNKNOWN; /* unknown power state */
dev->error_state = pci_channel_io_normal;
dev->dma_mask = 0xffffffff;
^ permalink raw reply related
* Re: [PATCH 1/2] powerpc/PCI: Use PCI_UNKNOWN for unknown power state
From: Bjorn Helgaas @ 2013-05-20 23:26 UTC (permalink / raw)
To: Benjamin Herrenschmidt, Paul Mackerras, David S. Miller
Cc: sparclinux, Rafael J. Wysocki, linuxppc-dev,
linux-pci@vger.kernel.org
In-Reply-To: <20130520231909.32416.81752.stgit@bhelgaas-glaptop>
On Mon, May 20, 2013 at 5:19 PM, Bjorn Helgaas <bhelgaas@google.com> wrote:
> Previously we initialized dev->current_state to 4 (PCI_D3cold), but I think
> we wanted PCI_UNKNOWN (5) here based on the comment and the fact that the
> generic version of this code, pci_setup_device(), uses PCI_UNKNOWN.
>
> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
> ---
> arch/powerpc/kernel/pci_of_scan.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/arch/powerpc/kernel/pci_of_scan.c b/arch/powerpc/kernel/pci_of_scan.c
> index 2a67e9b..d2d407d 100644
> --- a/arch/powerpc/kernel/pci_of_scan.c
> +++ b/arch/powerpc/kernel/pci_of_scan.c
> @@ -165,7 +165,7 @@ struct pci_dev *of_create_pci_dev(struct device_node *node,
> pr_debug(" class: 0x%x\n", dev->class);
> pr_debug(" revision: 0x%x\n", dev->revision);
>
> - dev->current_state = 4; /* unknown power state */
> + dev->current_state = PCI_UNKNOWN; /* unknown power state */
If you're interested in the archaeology of this, I think it was first
introduced by 4267292b0 in 2005 for powerpc and subsequently copied
into sparc by a2fb23af1 in 2007.
Prior to the addition of PCI_UNKNOWN, pci_setup_device() set
current_state to 4, so I think 4267292b0 probably just copied that
code before 3fe9d19f9 added PCI_UNKNOWN and changed the initialization
in pci_setup_device().
> dev->error_state = pci_channel_io_normal;
> dev->dma_mask = 0xffffffff;
>
>
^ permalink raw reply
* Re: [PATCH v2 2/4] powerpc/cputable: advertise DSCR support on P7/P7+
From: Michael Neuling @ 2013-05-20 23:41 UTC (permalink / raw)
To: will_schmidt
Cc: Nishanth Aravamudan, Steve Munroe, Peter Bergner, Ryan Arnold,
linuxppc-dev, Michael R Meissner
In-Reply-To: <1369062258.18289.3.camel@brimstone>
Will Schmidt <will_schmidt@vnet.ibm.com> wrote:
> On Fri, 2013-05-03 at 17:48 -0700, Nishanth Aravamudan wrote:
> > Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
> >
> > diff --git a/arch/powerpc/kernel/cputable.c b/arch/powerpc/kernel/cputable.c
> > index ae9f433..a792157 100644
> > --- a/arch/powerpc/kernel/cputable.c
> > +++ b/arch/powerpc/kernel/cputable.c
> > @@ -98,6 +98,7 @@ extern void __restore_cpu_e6500(void);
> > PPC_FEATURE_SMT | PPC_FEATURE_ICACHE_SNOOP | \
> > PPC_FEATURE_TRUE_LE | \
> > PPC_FEATURE_PSERIES_PERFMON_COMPAT)
> > +#define COMMON_USER2_POWER7 (PPC_FEATURE2_DSCR)
> > #define COMMON_USER_POWER8 (COMMON_USER_PPC64 | PPC_FEATURE_ARCH_2_06 |\
> > PPC_FEATURE_SMT | PPC_FEATURE_ICACHE_SNOOP | \
> > PPC_FEATURE_TRUE_LE | \
> > @@ -428,6 +429,7 @@ static struct cpu_spec __initdata cpu_specs[] = {
> > .cpu_name = "POWER7 (architected)",
> > .cpu_features = CPU_FTRS_POWER7,
> > .cpu_user_features = COMMON_USER_POWER7,
> > + .cpu_user_features2 = COMMON_USER2_POWER7,
> > .mmu_features = MMU_FTRS_POWER7,
> > .icache_bsize = 128,
> > .dcache_bsize = 128,
> > @@ -458,6 +460,7 @@ static struct cpu_spec __initdata cpu_specs[] = {
> > .cpu_name = "POWER7 (raw)",
> > .cpu_features = CPU_FTRS_POWER7,
> > .cpu_user_features = COMMON_USER_POWER7,
> > + .cpu_user_features2 = COMMON_USER2_POWER7,
> > .mmu_features = MMU_FTRS_POWER7,
> > .icache_bsize = 128,
> > .dcache_bsize = 128,
> > @@ -475,6 +478,7 @@ static struct cpu_spec __initdata cpu_specs[] = {
> > .cpu_name = "POWER7+ (raw)",
> > .cpu_features = CPU_FTRS_POWER7,
> > .cpu_user_features = COMMON_USER_POWER7,
> > + .cpu_user_features = COMMON_USER2_POWER7,
>
> ^ Oops here, I think. Please consider applying this on top.
> Untested, but seems obvious.
>
> Fix a typo in setting COMMON_USER2_POWER7 bits to .cpu_user_features2
> cpu specs table.
Nice catch. Thanks
Acked-by: Michael Neuling <mikey@neuling.org>
>
> Signed-off-by: Will Schmidt <will_schmidt@vnet.ibm.com>
>
>
> diff --git a/arch/powerpc/kernel/cputable.c
> b/arch/powerpc/kernel/cputable.c
> index c60bbec..51eecb5 100644
> --- a/arch/powerpc/kernel/cputable.c
> +++ b/arch/powerpc/kernel/cputable.c
> @@ -482,7 +482,7 @@ static struct cpu_spec __initdata cpu_specs[] = {
> .cpu_name = "POWER7+ (raw)",
> .cpu_features = CPU_FTRS_POWER7,
> .cpu_user_features = COMMON_USER_POWER7,
> - .cpu_user_features = COMMON_USER2_POWER7,
> + .cpu_user_features2 = COMMON_USER2_POWER7,
> .mmu_features = MMU_FTRS_POWER7,
> .icache_bsize = 128,
> .dcache_bsize = 128,
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> > .mmu_features = MMU_FTRS_POWER7,
> > .icache_bsize = 128,
> > .dcache_bsize = 128,
> >
> > _______________________________________________
> > Linuxppc-dev mailing list
> > Linuxppc-dev@lists.ozlabs.org
> > https://lists.ozlabs.org/listinfo/linuxppc-dev
> >
>
>
^ permalink raw reply
* Re: [PATCH 1/2] powerpc/PCI: Use PCI_UNKNOWN for unknown power state
From: Benjamin Herrenschmidt @ 2013-05-20 23:42 UTC (permalink / raw)
To: Bjorn Helgaas
Cc: linux-pci, Rafael J. Wysocki, Paul Mackerras, sparclinux,
linuxppc-dev, David S. Miller
In-Reply-To: <20130520231909.32416.81752.stgit@bhelgaas-glaptop>
On Mon, 2013-05-20 at 17:19 -0600, Bjorn Helgaas wrote:
> Previously we initialized dev->current_state to 4 (PCI_D3cold), but I think
> we wanted PCI_UNKNOWN (5) here based on the comment and the fact that the
> generic version of this code, pci_setup_device(), uses PCI_UNKNOWN.
>
> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
> ---
Ack. Do you send to Linus or should I ?
Thanks.
Cheers,
Ben.
> arch/powerpc/kernel/pci_of_scan.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/arch/powerpc/kernel/pci_of_scan.c b/arch/powerpc/kernel/pci_of_scan.c
> index 2a67e9b..d2d407d 100644
> --- a/arch/powerpc/kernel/pci_of_scan.c
> +++ b/arch/powerpc/kernel/pci_of_scan.c
> @@ -165,7 +165,7 @@ struct pci_dev *of_create_pci_dev(struct device_node *node,
> pr_debug(" class: 0x%x\n", dev->class);
> pr_debug(" revision: 0x%x\n", dev->revision);
>
> - dev->current_state = 4; /* unknown power state */
> + dev->current_state = PCI_UNKNOWN; /* unknown power state */
> dev->error_state = pci_channel_io_normal;
> dev->dma_mask = 0xffffffff;
>
^ permalink raw reply
* Re: [PATCH 1/2] powerpc/PCI: Use PCI_UNKNOWN for unknown power state
From: Bjorn Helgaas @ 2013-05-20 23:45 UTC (permalink / raw)
To: Benjamin Herrenschmidt
Cc: linux-pci@vger.kernel.org, Rafael J. Wysocki, Paul Mackerras,
sparclinux, linuxppc-dev, David S. Miller
In-Reply-To: <1369093321.6387.24.camel@pasglop>
On Mon, May 20, 2013 at 5:42 PM, Benjamin Herrenschmidt
<benh@kernel.crashing.org> wrote:
> On Mon, 2013-05-20 at 17:19 -0600, Bjorn Helgaas wrote:
>> Previously we initialized dev->current_state to 4 (PCI_D3cold), but I think
>> we wanted PCI_UNKNOWN (5) here based on the comment and the fact that the
>> generic version of this code, pci_setup_device(), uses PCI_UNKNOWN.
>>
>> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
>> ---
>
> Ack. Do you send to Linus or should I ?
I'll add your ack and take care of it.
>> arch/powerpc/kernel/pci_of_scan.c | 2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/arch/powerpc/kernel/pci_of_scan.c b/arch/powerpc/kernel/pci_of_scan.c
>> index 2a67e9b..d2d407d 100644
>> --- a/arch/powerpc/kernel/pci_of_scan.c
>> +++ b/arch/powerpc/kernel/pci_of_scan.c
>> @@ -165,7 +165,7 @@ struct pci_dev *of_create_pci_dev(struct device_node *node,
>> pr_debug(" class: 0x%x\n", dev->class);
>> pr_debug(" revision: 0x%x\n", dev->revision);
>>
>> - dev->current_state = 4; /* unknown power state */
>> + dev->current_state = PCI_UNKNOWN; /* unknown power state */
>> dev->error_state = pci_channel_io_normal;
>> dev->dma_mask = 0xffffffff;
>>
>
>
^ permalink raw reply
* [PATCH v2] powerpc/tm: Abort transactions on emulation and alignment faults
From: Michael Neuling @ 2013-05-21 1:48 UTC (permalink / raw)
To: benh; +Cc: Linux PPC dev, matt
In-Reply-To: <23571.1369048074@ale.ozlabs.ibm.com>
If we are emulating an instruction inside an active user transaction that
touches memory, the kernel can't emulate it as it operates in transactional
suspend context. We need to abort these transactions and send them back to
userspace for the hardware to rollback.
We can service these if the user transaction is in suspend mode, since the
kernel will operate in the same suspend context.
This adds a check to all alignment faults and to specific instruction
emulations (only string instructions for now). If the user process is in an
active (non-suspended) transaction, we abort the transaction go back to
userspace allowing the HW to roll back the transaction and tell the user of the
failure. This also adds new tm abort cause codes to report the reason of the
persistent error to the user.
Crappy test case here http://neuling.org/devel/junkcode/aligntm.c
Signed-off-by: Michael Neuling <mikey@neuling.org>
Cc: <stable@vger.kernel.org> # v3.9 only
---
v2:
Split persistent bit of cause into separate bit
Mark alignment and emulation as persistent.
syscall cause it not used currently.
diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
index a613651..20be907 100644
--- a/arch/powerpc/include/asm/reg.h
+++ b/arch/powerpc/include/asm/reg.h
@@ -115,12 +115,15 @@
convention, bit0 is copied to TEXASR[56] (IBM bit 7) which is set if
the failure is persistent.
*/
+#define TM_CAUSE_PERSISTENT 0x01
#define TM_CAUSE_RESCHED 0xfe
#define TM_CAUSE_TLBI 0xfc
#define TM_CAUSE_FAC_UNAV 0xfa
-#define TM_CAUSE_SYSCALL 0xf9 /* Persistent */
+#define TM_CAUSE_SYSCALL 0xf8
#define TM_CAUSE_MISC 0xf6
#define TM_CAUSE_SIGNAL 0xf4
+#define TM_CAUSE_ALIGNMENT 0xf2
+#define TM_CAUSE_EMULATE 0xf0
#if defined(CONFIG_PPC_BOOK3S_64)
#define MSR_64BIT MSR_SF
diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
index a7a648f..f18c79c 100644
--- a/arch/powerpc/kernel/traps.c
+++ b/arch/powerpc/kernel/traps.c
@@ -53,6 +53,7 @@
#ifdef CONFIG_PPC64
#include <asm/firmware.h>
#include <asm/processor.h>
+#include <asm/tm.h>
#endif
#include <asm/kexec.h>
#include <asm/ppc-opcode.h>
@@ -932,6 +933,28 @@ static int emulate_isel(struct pt_regs *regs, u32 instword)
return 0;
}
+#ifdef CONFIG_PPC_TRANSACTIONAL_MEM
+static inline bool tm_abort_check(struct pt_regs *regs, int cause)
+{
+ /* If we're emulating a load/store in an active transaction, we cannot
+ * emulate it as the kernel operates in transaction suspended context.
+ * We need to abort the transaction. This creates a persistent TM
+ * abort so tell the user what caused it with a new code.
+ */
+ if (MSR_TM_TRANSACTIONAL(regs->msr)) {
+ tm_enable();
+ tm_abort(cause);
+ return true;
+ }
+ return false;
+}
+#else
+static inline bool tm_abort_check(struct pt_regs *regs, int reason)
+{
+ return false;
+}
+#endif
+
static int emulate_instruction(struct pt_regs *regs)
{
u32 instword;
@@ -971,6 +994,9 @@ static int emulate_instruction(struct pt_regs *regs)
/* Emulate load/store string insn. */
if ((instword & PPC_INST_STRING_GEN_MASK) == PPC_INST_STRING) {
+ if (tm_abort_check(regs,
+ TM_CAUSE_EMULATE | TM_CAUSE_PERSISTENT))
+ return -EINVAL;
PPC_WARN_EMULATED(string, regs);
return emulate_string_inst(regs, instword);
}
@@ -1148,6 +1174,9 @@ void alignment_exception(struct pt_regs *regs)
if (!arch_irq_disabled_regs(regs))
local_irq_enable();
+ if (tm_abort_check(regs, TM_CAUSE_ALIGNMENT | TM_CAUSE_PERSISTENT))
+ goto bail;
+
/* we don't implement logging of alignment exceptions */
if (!(current->thread.align_ctl & PR_UNALIGN_SIGBUS))
fixed = fix_alignment(regs);
^ permalink raw reply related
* Re: [PATCH 2/2] sparc/PCI: Use PCI_UNKNOWN for unknown power state
From: David Miller @ 2013-05-21 1:51 UTC (permalink / raw)
To: bhelgaas; +Cc: rjw, paulus, linux-pci, sparclinux, linuxppc-dev
In-Reply-To: <20130520231916.32416.7318.stgit@bhelgaas-glaptop>
From: Bjorn Helgaas <bhelgaas@google.com>
Date: Mon, 20 May 2013 17:19:16 -0600
> Previously we initialized dev->current_state to 4 (PCI_D3cold), but I think
> we wanted PCI_UNKNOWN (5) here based on the comment and the fact that the
> generic version of this code, pci_setup_device(), uses PCI_UNKNOWN.
>
> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: David S. Miller <davem@davemloft.net>
^ permalink raw reply
* RE: [PATCH 2/2 V8] powerpc/85xx: Add machine check handler to fix PCIe erratum on mpc85xx
From: Jia Hongtao-B38951 @ 2013-05-21 2:07 UTC (permalink / raw)
To: Benjamin Herrenschmidt, galak@kernel.crashing.org
Cc: Wood Scott-B07421, linuxppc-dev@lists.ozlabs.org, Li Yang-R58472,
Jia Hongtao-B38951
In-Reply-To: <412C8208B4A0464FA894C5F0C278CD5D01C5BCAA@039-SN1MPN1-002.039d.mgd.msft.net>
Hi Ben and Kumar,
I'm really appreciate if you could help me to review this patches
for these patches were pending nearly a month.
Thanks.
-Hongtao
> -----Original Message-----
> From: Jia Hongtao-B38951
> Sent: Monday, May 13, 2013 2:20 PM
> To: 'Benjamin Herrenschmidt'
> Cc: linuxppc-dev@lists.ozlabs.org; galak@kernel.crashing.org; Wood Scott-
> B07421; Li Yang-R58472; Jia Hongtao-B38951
> Subject: RE: [PATCH 2/2 V8] powerpc/85xx: Add machine check handler to
> fix PCIe erratum on mpc85xx
>=20
> Hi Ben,
>=20
> These four patches have been reviewed for a long time
> and look good to Scott Wood.
> It seems Kumar have no enough time for further review.
> Could you please help me to review them?
>=20
> http://patchwork.ozlabs.org/patch/233211/
> http://patchwork.ozlabs.org/patch/235276/
> http://patchwork.ozlabs.org/patch/240238/
> http://patchwork.ozlabs.org/patch/240239/
>=20
> Thanks.
> -Hongtao
>=20
> > -----Original Message-----
> > From: Jia Hongtao-B38951
> > Sent: Friday, May 10, 2013 12:01 PM
> > To: galak@kernel.crashing.org
> > Cc: linuxppc-dev@lists.ozlabs.org; Li Yang-R58472; Jia Hongtao-B38951
> > Subject: RE: [PATCH 2/2 V8] powerpc/85xx: Add machine check handler to
> > fix PCIe erratum on mpc85xx
> >
> > > > -----Original Message-----
> > > > From: Wood Scott-B07421
> > > > Sent: Friday, May 03, 2013 1:04 AM
> > > > To: Jia Hongtao-B38951
> > > > Cc: linuxppc-dev@lists.ozlabs.org; galak@kernel.crashing.org; Wood
> > > > Scott- B07421; segher@kernel.crashing.org; Li Yang-R58472; Jia
> > > > Hongtao-B38951
> > > > Subject: Re: [PATCH 2/2 V8] powerpc/85xx: Add machine check handler
> > > > to fix PCIe erratum on mpc85xx
> > > >
> > > > On 04/28/2013 12:20:08 AM, Jia Hongtao wrote:
> > > > > A PCIe erratum of mpc85xx may causes a core hang when a link of
> > > > > PCIe goes down. when the link goes down, Non-posted transactions
> > > > > issued via the ATMU requiring completion result in an instruction
> > stall.
> > > > > At the same time a machine-check exception is generated to the
> > > > > core to allow further processing by the handler. We implements
> the
> > > > > handler which skips the instruction caused the stall.
> > > > >
> > > > > This patch depends on patch:
> > > > > powerpc/85xx: Add platform_device declaration to fsl_pci.h
> > > > >
> > > > > Signed-off-by: Zhao Chenhui <b35336@freescale.com>
> > > > > Signed-off-by: Li Yang <leoli@freescale.com>
> > > > > Signed-off-by: Liu Shuo <soniccat.liu@gmail.com>
> > > > > Signed-off-by: Jia Hongtao <hongtao.jia@freescale.com>
> > > > > ---
> > > > > V8:
> > > > > * Add A variant load instruction emulation.
> > > >
> > > > ACK
> > > >
> > > > -Scott
> > >
> > > Thanks for the review.
> > >
> > > Hi Kumar,
> > >
> > > Could you please review these MSI and PCI hang errata patches?
> > > http://patchwork.ozlabs.org/patch/233211/
> > > http://patchwork.ozlabs.org/patch/235276/
> > > http://patchwork.ozlabs.org/patch/240238/
> > > http://patchwork.ozlabs.org/patch/240239/ (This patch)
> > >
> > > Thanks.
> > > -Hongtao
> >
> > Hi Kumar,
> >
> > I'm really appreciated if you have time to review these patches?
> >
> > Thanks.
> > -Hongtao
^ permalink raw reply
* [PATCH 0/4 v2] KVM: PPC: IOMMU in-kernel handling
From: Alexey Kardashevskiy @ 2013-05-21 3:06 UTC (permalink / raw)
To: linuxppc-dev
Cc: kvm, Alexey Kardashevskiy, Alexander Graf, kvm-ppc, linux-kernel,
Paul Mackerras, David Gibson
This accelerates IOMMU operations in real and virtual
mode in the host kernel for the KVM guest.
The first patch with multitce support is useful for emulated devices as is.
The other patches are designed for VFIO although this series
does not contain any VFIO related code as the connection between
VFIO and the new handlers is to be made in QEMU
via ioctl to the KVM fd.
The series was made and tested against v3.10-rc1.
Alexey Kardashevskiy (4):
KVM: PPC: Add support for multiple-TCE hcalls
powerpc: Prepare to support kernel handling of IOMMU map/unmap
KVM: PPC: Add support for IOMMU in-kernel handling
KVM: PPC: Add hugepage support for IOMMU in-kernel handling
Documentation/virtual/kvm/api.txt | 42 +++
arch/powerpc/include/asm/kvm_host.h | 7 +
arch/powerpc/include/asm/kvm_ppc.h | 40 ++-
arch/powerpc/include/asm/pgtable-ppc64.h | 4 +
arch/powerpc/include/uapi/asm/kvm.h | 7 +
arch/powerpc/kvm/book3s_64_vio.c | 398 ++++++++++++++++++++++++-
arch/powerpc/kvm/book3s_64_vio_hv.c | 471 ++++++++++++++++++++++++++++--
arch/powerpc/kvm/book3s_hv.c | 39 +++
arch/powerpc/kvm/book3s_hv_rmhandlers.S | 6 +
arch/powerpc/kvm/book3s_pr_papr.c | 37 ++-
arch/powerpc/kvm/powerpc.c | 15 +
arch/powerpc/mm/init_64.c | 77 ++++-
include/uapi/linux/kvm.h | 5 +
13 files changed, 1120 insertions(+), 28 deletions(-)
--
1.7.10.4
^ permalink raw reply
* [PATCH 1/4] KVM: PPC: Add support for multiple-TCE hcalls
From: Alexey Kardashevskiy @ 2013-05-21 3:06 UTC (permalink / raw)
To: linuxppc-dev
Cc: kvm, Alexey Kardashevskiy, Alexander Graf, kvm-ppc, linux-kernel,
Paul Mackerras, David Gibson
In-Reply-To: <1369105607-20957-1-git-send-email-aik@ozlabs.ru>
This adds real mode handlers for the H_PUT_TCE_INDIRECT and
H_STUFF_TCE hypercalls for QEMU emulated devices such as virtio
devices or emulated PCI. These calls allow adding multiple entries
(up to 512) into the TCE table in one call which saves time on
transition to/from real mode.
This adds a tce_tmp cache to kvm_vcpu_arch to save valid TCEs
(copied from user and verified) before writing the whole list into
the TCE table. This cache will be utilized more in the upcoming
VFIO/IOMMU support to continue TCE list processing in the virtual
mode in the case if the real mode handler failed for some reason.
This adds a guest physical to host real address converter
and calls the existing H_PUT_TCE handler. The converting function
is going to be fully utilized by upcoming VFIO supporting patches.
This also implements the KVM_CAP_PPC_MULTITCE capability,
so in order to support the functionality of this patch, QEMU
needs to query for this capability and set the "hcall-multi-tce"
hypertas property only if the capability is present, otherwise
there will be serious performance degradation.
Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Paul Mackerras <paulus@samba.org>
---
Changelog:
* added kvm_vcpu_arch::tce_tmp
* removed cleanup if put_indirect failed, instead we do not even start
writing to TCE table if we cannot get TCEs from the user and they are
invalid
* kvmppc_emulated_h_put_tce is split to kvmppc_emulated_put_tce
and kvmppc_emulated_validate_tce (for the previous item)
* fixed bug with failthrough for H_IPI
* removed all get_user() from real mode handlers
* kvmppc_lookup_pte() added (instead of making lookup_linux_pte public)
---
Documentation/virtual/kvm/api.txt | 14 ++
arch/powerpc/include/asm/kvm_host.h | 2 +
arch/powerpc/include/asm/kvm_ppc.h | 16 +-
arch/powerpc/kvm/book3s_64_vio.c | 118 ++++++++++++++
arch/powerpc/kvm/book3s_64_vio_hv.c | 266 +++++++++++++++++++++++++++----
arch/powerpc/kvm/book3s_hv.c | 39 +++++
arch/powerpc/kvm/book3s_hv_rmhandlers.S | 6 +
arch/powerpc/kvm/book3s_pr_papr.c | 37 ++++-
arch/powerpc/kvm/powerpc.c | 3 +
include/uapi/linux/kvm.h | 1 +
10 files changed, 470 insertions(+), 32 deletions(-)
diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
index 5f91eda..3c7c7ea 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -2780,3 +2780,17 @@ Parameters: args[0] is the XICS device fd
args[1] is the XICS CPU number (server ID) for this vcpu
This capability connects the vcpu to an in-kernel XICS device.
+
+6.8 KVM_CAP_PPC_MULTITCE
+
+Architectures: ppc
+Parameters: none
+Returns: 0 on success; -1 on error
+
+This capability enables the guest to put/remove multiple TCE entries
+per hypercall which significanly accelerates DMA operations for PPC KVM
+guests.
+
+When this capability is enabled, H_PUT_TCE_INDIRECT and H_STUFF_TCE are
+expected to occur rather than H_PUT_TCE which supports only one TCE entry
+per call.
diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
index af326cd..85d8f26 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -609,6 +609,8 @@ struct kvm_vcpu_arch {
spinlock_t tbacct_lock;
u64 busy_stolen;
u64 busy_preempt;
+
+ unsigned long *tce_tmp; /* TCE cache for TCE_PUT_INDIRECT hall */
#endif
};
diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
index a5287fe..e852921b 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -133,8 +133,20 @@ extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
struct kvm_create_spapr_tce *args);
-extern long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
- unsigned long ioba, unsigned long tce);
+extern struct kvmppc_spapr_tce_table *kvmppc_find_tce_table(
+ struct kvm_vcpu *vcpu, unsigned long liobn);
+extern long kvmppc_emulated_validate_tce(unsigned long tce);
+extern void kvmppc_emulated_put_tce(struct kvmppc_spapr_tce_table *tt,
+ unsigned long ioba, unsigned long tce);
+extern long kvmppc_virtmode_h_put_tce(struct kvm_vcpu *vcpu,
+ unsigned long liobn, unsigned long ioba,
+ unsigned long tce);
+extern long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu,
+ unsigned long liobn, unsigned long ioba,
+ unsigned long tce_list, unsigned long npages);
+extern long kvmppc_virtmode_h_stuff_tce(struct kvm_vcpu *vcpu,
+ unsigned long liobn, unsigned long ioba,
+ unsigned long tce_value, unsigned long npages);
extern long kvm_vm_ioctl_allocate_rma(struct kvm *kvm,
struct kvm_allocate_rma *rma);
extern struct kvmppc_linear_info *kvm_alloc_rma(void);
diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index b2d3f3b..06b7b20 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -14,6 +14,7 @@
*
* Copyright 2010 Paul Mackerras, IBM Corp. <paulus@au1.ibm.com>
* Copyright 2011 David Gibson, IBM Corporation <dwg@au1.ibm.com>
+ * Copyright 2013 Alexey Kardashevskiy, IBM Corporation <aik@au1.ibm.com>
*/
#include <linux/types.h>
@@ -36,8 +37,11 @@
#include <asm/ppc-opcode.h>
#include <asm/kvm_host.h>
#include <asm/udbg.h>
+#include <asm/iommu.h>
+#include <asm/tce.h>
#define TCES_PER_PAGE (PAGE_SIZE / sizeof(u64))
+#define ERROR_ADDR ((void *)~(unsigned long)0x0)
static long kvmppc_stt_npages(unsigned long window_size)
{
@@ -148,3 +152,117 @@ fail:
}
return ret;
}
+
+/* Converts guest physical address into host virtual */
+static void __user *kvmppc_virtmode_gpa_to_hva(struct kvm_vcpu *vcpu,
+ unsigned long gpa)
+{
+ unsigned long hva, gfn = gpa >> PAGE_SHIFT;
+ struct kvm_memory_slot *memslot;
+
+ memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn);
+ if (!memslot)
+ return ERROR_ADDR;
+
+ hva = __gfn_to_hva_memslot(memslot, gfn) + (gpa & ~PAGE_MASK);
+ return (void *) hva;
+}
+
+long kvmppc_virtmode_h_put_tce(struct kvm_vcpu *vcpu,
+ unsigned long liobn, unsigned long ioba,
+ unsigned long tce)
+{
+ long ret;
+ struct kvmppc_spapr_tce_table *tt;
+
+ tt = kvmppc_find_tce_table(vcpu, liobn);
+ /* Didn't find the liobn, put it to userspace */
+ if (!tt)
+ return H_TOO_HARD;
+
+ /* Emulated IO */
+ if (ioba >= tt->window_size)
+ return H_PARAMETER;
+
+ ret = kvmppc_emulated_validate_tce(tce);
+ if (ret)
+ return ret;
+
+ kvmppc_emulated_put_tce(tt, ioba, tce);
+
+ return H_SUCCESS;
+}
+
+long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu,
+ unsigned long liobn, unsigned long ioba,
+ unsigned long tce_list, unsigned long npages)
+{
+ struct kvmppc_spapr_tce_table *tt;
+ long i, ret;
+ unsigned long __user *tces;
+
+ tt = kvmppc_find_tce_table(vcpu, liobn);
+ /* Didn't find the liobn, put it to userspace */
+ if (!tt)
+ return H_TOO_HARD;
+
+ /*
+ * The spec says that the maximum size of the list is 512 TCEs so
+ * so the whole table addressed resides in 4K page
+ */
+ if (npages > 512)
+ return H_PARAMETER;
+
+ if (tce_list & ~IOMMU_PAGE_MASK)
+ return H_PARAMETER;
+
+ tces = kvmppc_virtmode_gpa_to_hva(vcpu, tce_list);
+ if (tces == ERROR_ADDR)
+ return H_TOO_HARD;
+
+ /* Emulated IO */
+ if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
+ return H_PARAMETER;
+
+ for (i = 0; i < npages; ++i) {
+ if (get_user(vcpu->arch.tce_tmp[i], tces + i))
+ return H_PARAMETER;
+
+ ret = kvmppc_emulated_validate_tce(vcpu->arch.tce_tmp[i]);
+ if (ret)
+ return ret;
+ }
+
+ for (i = 0; i < npages; ++i)
+ kvmppc_emulated_put_tce(tt,
+ ioba + (i << IOMMU_PAGE_SHIFT),
+ vcpu->arch.tce_tmp[i]);
+
+ return H_SUCCESS;
+}
+
+long kvmppc_virtmode_h_stuff_tce(struct kvm_vcpu *vcpu,
+ unsigned long liobn, unsigned long ioba,
+ unsigned long tce_value, unsigned long npages)
+{
+ struct kvmppc_spapr_tce_table *tt;
+ long i, ret;
+
+ tt = kvmppc_find_tce_table(vcpu, liobn);
+ /* Didn't find the liobn, put it to userspace */
+ if (!tt)
+ return H_TOO_HARD;
+
+ /* Emulated IO */
+ if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
+ return H_PARAMETER;
+
+ ret = kvmppc_emulated_validate_tce(tce_value);
+ if (ret || (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ)))
+ return H_PARAMETER;
+
+ for (i = 0; i < npages; ++i, ioba += IOMMU_PAGE_SIZE)
+ kvmppc_emulated_put_tce(tt, ioba, tce_value);
+
+ return H_SUCCESS;
+}
diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
index 30c2f3b..c68d538 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -14,6 +14,7 @@
*
* Copyright 2010 Paul Mackerras, IBM Corp. <paulus@au1.ibm.com>
* Copyright 2011 David Gibson, IBM Corporation <dwg@au1.ibm.com>
+ * Copyright 2013 Alexey Kardashevskiy, IBM Corporation <aik@au1.ibm.com>
*/
#include <linux/types.h>
@@ -35,42 +36,249 @@
#include <asm/ppc-opcode.h>
#include <asm/kvm_host.h>
#include <asm/udbg.h>
+#include <asm/iommu.h>
+#include <asm/tce.h>
#define TCES_PER_PAGE (PAGE_SIZE / sizeof(u64))
+#define ERROR_ADDR (~(unsigned long)0x0)
-/* WARNING: This will be called in real-mode on HV KVM and virtual
- * mode on PR KVM
+/* Finds a TCE table descriptor by LIOBN */
+struct kvmppc_spapr_tce_table *kvmppc_find_tce_table(struct kvm_vcpu *vcpu,
+ unsigned long liobn)
+{
+ struct kvmppc_spapr_tce_table *tt;
+
+ list_for_each_entry(tt, &vcpu->kvm->arch.spapr_tce_tables, list) {
+ if (tt->liobn == liobn)
+ return tt;
+ }
+
+ return NULL;
+}
+EXPORT_SYMBOL_GPL(kvmppc_find_tce_table);
+
+/*
+ * Validate TCE address.
+ * At the moment only flags are validated
+ * as other check will significantly slow down
+ * or can make it even impossible to handle TCE requests
+ * in real mode.
+ */
+long kvmppc_emulated_validate_tce(unsigned long tce)
+{
+ if (tce & ~(IOMMU_PAGE_MASK | TCE_PCI_WRITE | TCE_PCI_READ))
+ return H_PARAMETER;
+
+ return H_SUCCESS;
+}
+EXPORT_SYMBOL_GPL(kvmppc_emulated_validate_tce);
+
+/*
+ * kvmppc_emulated_put_tce() handles TCE requests for devices emulated
+ * by QEMU. It puts guest TCE values into the table and expects
+ * the QEMU to convert them later in the QEMU device implementation.
+ * Wiorks in both real and virtual modes.
+ * It cannot fail so kvmppc_emulated_validate_tce must be called before it.
*/
+void kvmppc_emulated_put_tce(struct kvmppc_spapr_tce_table *tt,
+ unsigned long ioba, unsigned long tce)
+{
+ unsigned long idx = ioba >> SPAPR_TCE_SHIFT;
+ struct page *page;
+ u64 *tbl;
+
+ /*
+ * Note on the use of page_address() in real mode,
+ *
+ * It is safe to use page_address() in real mode on ppc64 because
+ * page_address() is always defined as lowmem_page_address()
+ * which returns __va(PFN_PHYS(page_to_pfn(page))) which is arithmetial
+ * operation and does not access page struct.
+ *
+ * Theoretically page_address() could be defined different
+ * but either WANT_PAGE_VIRTUAL or HASHED_PAGE_VIRTUAL
+ * should be enabled.
+ * WANT_PAGE_VIRTUAL is never enabled on ppc32/ppc64,
+ * HASHED_PAGE_VIRTUAL could be enabled for ppc32 only and only
+ * if CONFIG_HIGHMEM is defined. As CONFIG_SPARSEMEM_VMEMMAP
+ * is not expected to be enabled on ppc32, page_address()
+ * is safe for ppc32 as well.
+ */
+#if defined(HASHED_PAGE_VIRTUAL) || defined(WANT_PAGE_VIRTUAL)
+#error TODO: fix to avoid page_address() here
+#endif
+ page = tt->pages[idx / TCES_PER_PAGE];
+ tbl = (u64 *)page_address(page);
+
+ /* udbg_printf("tce @ %p\n", &tbl[idx % TCES_PER_PAGE]); */
+ tbl[idx % TCES_PER_PAGE] = tce;
+}
+EXPORT_SYMBOL_GPL(kvmppc_emulated_put_tce);
+
+#ifdef CONFIG_KVM_BOOK3S_64_HV
+
+static pte_t kvmppc_lookup_pte(pgd_t *pgdir, unsigned long hva, bool writing,
+ unsigned long *pte_sizep)
+{
+ pte_t *ptep;
+ unsigned int shift = 0;
+ pte_t pte, tmp, ret;
+
+ ptep = find_linux_pte_or_hugepte(pgdir, hva, &shift);
+ if (!ptep)
+ return __pte(0);
+ if (shift)
+ *pte_sizep = 1ul << shift;
+ else
+ *pte_sizep = PAGE_SIZE;
+
+ if (!pte_present(*ptep))
+ return __pte(0);
+
+ /* wait until _PAGE_BUSY is clear then set it atomically */
+ __asm__ __volatile__ (
+ "1: ldarx %0,0,%3\n"
+ " andi. %1,%0,%4\n"
+ " bne- 1b\n"
+ " ori %1,%0,%4\n"
+ " stdcx. %1,0,%3\n"
+ " bne- 1b"
+ : "=&r" (pte), "=&r" (tmp), "=m" (*ptep)
+ : "r" (ptep), "i" (_PAGE_BUSY)
+ : "cc");
+
+ ret = pte;
+
+ return ret;
+}
+
+/*
+ * Converts guest physical address into host physical address.
+ * Also returns pte and page size if the page is present in page table.
+ */
+static unsigned long kvmppc_realmode_gpa_to_hpa(struct kvm_vcpu *vcpu,
+ unsigned long gpa)
+{
+ struct kvm_memory_slot *memslot;
+ pte_t pte;
+ unsigned long hva, hpa, pg_size = 0, offset;
+ unsigned long gfn = gpa >> PAGE_SHIFT;
+ bool writing = gpa & TCE_PCI_WRITE;
+
+ /* Find a KVM memslot */
+ memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn);
+ if (!memslot)
+ return ERROR_ADDR;
+
+ /* Convert guest physical address to host virtual */
+ hva = __gfn_to_hva_memslot(memslot, gfn);
+
+ /* Find a PTE and determine the size */
+ pte = kvmppc_lookup_pte(vcpu->arch.pgdir, hva,
+ writing, &pg_size);
+ if (!pte)
+ return ERROR_ADDR;
+
+ /* Calculate host phys address keeping flags and offset in the page */
+ offset = gpa & (pg_size - 1);
+
+ /* pte_pfn(pte) should return an address aligned to pg_size */
+ hpa = (pte_pfn(pte) << PAGE_SHIFT) + offset;
+
+ return hpa;
+}
+
long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
unsigned long ioba, unsigned long tce)
{
- struct kvm *kvm = vcpu->kvm;
- struct kvmppc_spapr_tce_table *stt;
-
- /* udbg_printf("H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n", */
- /* liobn, ioba, tce); */
-
- list_for_each_entry(stt, &kvm->arch.spapr_tce_tables, list) {
- if (stt->liobn == liobn) {
- unsigned long idx = ioba >> SPAPR_TCE_SHIFT;
- struct page *page;
- u64 *tbl;
-
- /* udbg_printf("H_PUT_TCE: liobn 0x%lx => stt=%p window_size=0x%x\n", */
- /* liobn, stt, stt->window_size); */
- if (ioba >= stt->window_size)
- return H_PARAMETER;
-
- page = stt->pages[idx / TCES_PER_PAGE];
- tbl = (u64 *)page_address(page);
-
- /* FIXME: Need to validate the TCE itself */
- /* udbg_printf("tce @ %p\n", &tbl[idx % TCES_PER_PAGE]); */
- tbl[idx % TCES_PER_PAGE] = tce;
- return H_SUCCESS;
- }
+ long ret;
+ struct kvmppc_spapr_tce_table *tt;
+
+ tt = kvmppc_find_tce_table(vcpu, liobn);
+ /* Didn't find the liobn, put it to virtual space */
+ if (!tt)
+ return H_TOO_HARD;
+
+ /* Emulated IO */
+ if (ioba >= tt->window_size)
+ return H_PARAMETER;
+
+ ret = kvmppc_emulated_validate_tce(tce);
+ if (ret)
+ return ret;
+
+ kvmppc_emulated_put_tce(tt, ioba, tce);
+
+ return H_SUCCESS;
+}
+
+long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
+ unsigned long liobn, unsigned long ioba,
+ unsigned long tce_list, unsigned long npages)
+{
+ struct kvmppc_spapr_tce_table *tt;
+ long i, ret;
+ unsigned long *tces;
+
+ tt = kvmppc_find_tce_table(vcpu, liobn);
+ /* Didn't find the liobn, put it to virtual space */
+ if (!tt)
+ return H_TOO_HARD;
+
+ /*
+ * The spec says that the maximum size of the list is 512 TCEs so
+ * so the whole table addressed resides in 4K page
+ */
+ if (npages > 512)
+ return H_PARAMETER;
+
+ if (tce_list & ~IOMMU_PAGE_MASK)
+ return H_PARAMETER;
+
+ tces = (unsigned long *) kvmppc_realmode_gpa_to_hpa(vcpu, tce_list);
+ if ((unsigned long)tces == ERROR_ADDR)
+ return H_TOO_HARD;
+
+ /* Emulated IO */
+ if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
+ return H_PARAMETER;
+
+ for (i = 0; i < npages; ++i) {
+ ret = kvmppc_emulated_validate_tce(tces[i]);
+ if (ret)
+ return ret;
}
- /* Didn't find the liobn, punt it to userspace */
- return H_TOO_HARD;
+ for (i = 0; i < npages; ++i)
+ kvmppc_emulated_put_tce(tt, ioba + (i << IOMMU_PAGE_SHIFT),
+ tces[i]);
+
+ return H_SUCCESS;
+}
+
+long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
+ unsigned long liobn, unsigned long ioba,
+ unsigned long tce_value, unsigned long npages)
+{
+ struct kvmppc_spapr_tce_table *tt;
+ long i, ret;
+
+ tt = kvmppc_find_tce_table(vcpu, liobn);
+ /* Didn't find the liobn, put it to virtual space */
+ if (!tt)
+ return H_TOO_HARD;
+
+ /* Emulated IO */
+ if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
+ return H_PARAMETER;
+
+ ret = kvmppc_emulated_validate_tce(tce_value);
+ if (ret || (tce_value & (TCE_PCI_WRITE | TCE_PCI_READ)))
+ return H_PARAMETER;
+
+ for (i = 0; i < npages; ++i, ioba += IOMMU_PAGE_SIZE)
+ kvmppc_emulated_put_tce(tt, ioba, tce_value);
+
+ return H_SUCCESS;
}
+#endif /* CONFIG_KVM_BOOK3S_64_HV */
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 9de24f8..9d70fd8 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -566,6 +566,30 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu)
ret = kvmppc_xics_hcall(vcpu, req);
break;
} /* fallthrough */
+ return RESUME_HOST;
+ case H_PUT_TCE:
+ ret = kvmppc_virtmode_h_put_tce(vcpu, kvmppc_get_gpr(vcpu, 4),
+ kvmppc_get_gpr(vcpu, 5),
+ kvmppc_get_gpr(vcpu, 6));
+ if (ret == H_TOO_HARD)
+ return RESUME_HOST;
+ break;
+ case H_PUT_TCE_INDIRECT:
+ ret = kvmppc_virtmode_h_put_tce_indirect(vcpu, kvmppc_get_gpr(vcpu, 4),
+ kvmppc_get_gpr(vcpu, 5),
+ kvmppc_get_gpr(vcpu, 6),
+ kvmppc_get_gpr(vcpu, 7));
+ if (ret == H_TOO_HARD)
+ return RESUME_HOST;
+ break;
+ case H_STUFF_TCE:
+ ret = kvmppc_virtmode_h_stuff_tce(vcpu, kvmppc_get_gpr(vcpu, 4),
+ kvmppc_get_gpr(vcpu, 5),
+ kvmppc_get_gpr(vcpu, 6),
+ kvmppc_get_gpr(vcpu, 7));
+ if (ret == H_TOO_HARD)
+ return RESUME_HOST;
+ break;
default:
return RESUME_HOST;
}
@@ -956,6 +980,20 @@ struct kvm_vcpu *kvmppc_core_vcpu_create(struct kvm *kvm, unsigned int id)
vcpu->arch.cpu_type = KVM_CPU_3S_64;
kvmppc_sanity_check(vcpu);
+ /*
+ * As we want to minimize the chance of having H_PUT_TCE_INDIRECT
+ * half executed, we first read TCEs from the user, check them and
+ * return error if something went wrong and only then put TCEs into
+ * the TCE table.
+ *
+ * tce_tmp is a cache for TCEs to avoid stack allocation or
+ * kmalloc as the whole TCE list can take up to 512 items 8 bytes
+ * each (4096 bytes).
+ */
+ vcpu->arch.tce_tmp = kmalloc(4096, GFP_KERNEL);
+ if (!vcpu->arch.tce_tmp)
+ goto free_vcpu;
+
return vcpu;
free_vcpu:
@@ -978,6 +1016,7 @@ void kvmppc_core_vcpu_free(struct kvm_vcpu *vcpu)
unpin_vpa(vcpu->kvm, &vcpu->arch.slb_shadow);
unpin_vpa(vcpu->kvm, &vcpu->arch.vpa);
spin_unlock(&vcpu->arch.vpa_update_lock);
+ kfree(vcpu->arch.tce_tmp);
kvm_vcpu_uninit(vcpu);
kmem_cache_free(kvm_vcpu_cache, vcpu);
}
diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
index b02f91e..d35554e 100644
--- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
+++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
@@ -1490,6 +1490,12 @@ hcall_real_table:
.long 0 /* 0x11c */
.long 0 /* 0x120 */
.long .kvmppc_h_bulk_remove - hcall_real_table
+ .long 0 /* 0x128 */
+ .long 0 /* 0x12c */
+ .long 0 /* 0x130 */
+ .long 0 /* 0x134 */
+ .long .kvmppc_h_stuff_tce - hcall_real_table
+ .long .kvmppc_h_put_tce_indirect - hcall_real_table
hcall_real_table_end:
ignore_hdec:
diff --git a/arch/powerpc/kvm/book3s_pr_papr.c b/arch/powerpc/kvm/book3s_pr_papr.c
index b24309c..3750e3c 100644
--- a/arch/powerpc/kvm/book3s_pr_papr.c
+++ b/arch/powerpc/kvm/book3s_pr_papr.c
@@ -220,7 +220,38 @@ static int kvmppc_h_pr_put_tce(struct kvm_vcpu *vcpu)
unsigned long tce = kvmppc_get_gpr(vcpu, 6);
long rc;
- rc = kvmppc_h_put_tce(vcpu, liobn, ioba, tce);
+ rc = kvmppc_virtmode_h_put_tce(vcpu, liobn, ioba, tce);
+ if (rc == H_TOO_HARD)
+ return EMULATE_FAIL;
+ kvmppc_set_gpr(vcpu, 3, rc);
+ return EMULATE_DONE;
+}
+
+static int kvmppc_h_pr_put_tce_indirect(struct kvm_vcpu *vcpu)
+{
+ unsigned long liobn = kvmppc_get_gpr(vcpu, 4);
+ unsigned long ioba = kvmppc_get_gpr(vcpu, 5);
+ unsigned long tce = kvmppc_get_gpr(vcpu, 6);
+ unsigned long npages = kvmppc_get_gpr(vcpu, 7);
+ long rc;
+
+ rc = kvmppc_virtmode_h_put_tce_indirect(vcpu, liobn, ioba,
+ tce, npages);
+ if (rc == H_TOO_HARD)
+ return EMULATE_FAIL;
+ kvmppc_set_gpr(vcpu, 3, rc);
+ return EMULATE_DONE;
+}
+
+static int kvmppc_h_pr_stuff_tce(struct kvm_vcpu *vcpu)
+{
+ unsigned long liobn = kvmppc_get_gpr(vcpu, 4);
+ unsigned long ioba = kvmppc_get_gpr(vcpu, 5);
+ unsigned long tce_value = kvmppc_get_gpr(vcpu, 6);
+ unsigned long npages = kvmppc_get_gpr(vcpu, 7);
+ long rc;
+
+ rc = kvmppc_virtmode_h_stuff_tce(vcpu, liobn, ioba, tce_value, npages);
if (rc == H_TOO_HARD)
return EMULATE_FAIL;
kvmppc_set_gpr(vcpu, 3, rc);
@@ -247,6 +278,10 @@ int kvmppc_h_pr(struct kvm_vcpu *vcpu, unsigned long cmd)
return kvmppc_h_pr_bulk_remove(vcpu);
case H_PUT_TCE:
return kvmppc_h_pr_put_tce(vcpu);
+ case H_PUT_TCE_INDIRECT:
+ return kvmppc_h_pr_put_tce_indirect(vcpu);
+ case H_STUFF_TCE:
+ return kvmppc_h_pr_stuff_tce(vcpu);
case H_CEDE:
vcpu->arch.shared->msr |= MSR_EE;
kvm_vcpu_block(vcpu);
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 6316ee3..8465c2a 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -395,6 +395,9 @@ int kvm_dev_ioctl_check_extension(long ext)
r = 1;
break;
#endif
+ case KVM_CAP_SPAPR_MULTITCE:
+ r = 1;
+ break;
default:
r = 0;
break;
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index a5c86fc..5a2afda 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -666,6 +666,7 @@ struct kvm_ppc_smmu_info {
#define KVM_CAP_IRQ_MPIC 90
#define KVM_CAP_PPC_RTAS 91
#define KVM_CAP_IRQ_XICS 92
+#define KVM_CAP_SPAPR_MULTITCE (0x110000 + 89)
#ifdef KVM_CAP_IRQ_ROUTING
--
1.7.10.4
^ permalink raw reply related
* [PATCH 2/4] powerpc: Prepare to support kernel handling of IOMMU map/unmap
From: Alexey Kardashevskiy @ 2013-05-21 3:06 UTC (permalink / raw)
To: linuxppc-dev
Cc: kvm, Alexey Kardashevskiy, Alexander Graf, kvm-ppc, linux-kernel,
Paul Mackerras, David Gibson
In-Reply-To: <1369105607-20957-1-git-send-email-aik@ozlabs.ru>
The current VFIO-on-POWER implementation supports only user mode
driven mapping, i.e. QEMU is sending requests to map/unmap pages.
However this approach is really slow, so we want to move that to KVM.
Since H_PUT_TCE can be extremely performance sensitive (especially with
network adapters where each packet needs to be mapped/unmapped) we chose
to implement that as a "fast" hypercall directly in "real
mode" (processor still in the guest context but MMU off).
To be able to do that, we need to provide some facilities to
access the struct page count within that real mode environment as things
like the sparsemem vmemmap mappings aren't accessible.
This adds an API to increment/decrement page counter as
get_user_pages API used for user mode mapping does not work
in the real mode.
CONFIG_SPARSEMEM_VMEMMAP and CONFIG_FLATMEM are supported.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: Paul Mackerras <paulus@samba.org>
Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
---
Changes:
2013-05-20:
* PageTail() is replaced by PageCompound() in order to have the same checks
for whether the page is huge in realmode_get_page() and realmode_put_page()
---
arch/powerpc/include/asm/pgtable-ppc64.h | 4 ++
arch/powerpc/mm/init_64.c | 77 +++++++++++++++++++++++++++++-
2 files changed, 80 insertions(+), 1 deletion(-)
diff --git a/arch/powerpc/include/asm/pgtable-ppc64.h b/arch/powerpc/include/asm/pgtable-ppc64.h
index e3d55f6f..7b46e5f 100644
--- a/arch/powerpc/include/asm/pgtable-ppc64.h
+++ b/arch/powerpc/include/asm/pgtable-ppc64.h
@@ -376,6 +376,10 @@ static inline pte_t *find_linux_pte_or_hugepte(pgd_t *pgdir, unsigned long ea,
}
#endif /* !CONFIG_HUGETLB_PAGE */
+struct page *realmode_pfn_to_page(unsigned long pfn);
+int realmode_get_page(struct page *page);
+int realmode_put_page(struct page *page);
+
#endif /* __ASSEMBLY__ */
#endif /* _ASM_POWERPC_PGTABLE_PPC64_H_ */
diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
index c2787bf..ba6cf9b 100644
--- a/arch/powerpc/mm/init_64.c
+++ b/arch/powerpc/mm/init_64.c
@@ -296,5 +296,80 @@ void vmemmap_free(unsigned long start, unsigned long end)
{
}
-#endif /* CONFIG_SPARSEMEM_VMEMMAP */
+/*
+ * We do not have access to the sparsemem vmemmap, so we fallback to
+ * walking the list of sparsemem blocks which we already maintain for
+ * the sake of crashdump. In the long run, we might want to maintain
+ * a tree if performance of that linear walk becomes a problem.
+ *
+ * Any of realmode_XXXX functions can fail due to:
+ * 1) As real sparsemem blocks do not lay in RAM continously (they
+ * are in virtual address space which is not available in the real mode),
+ * the requested page struct can be split between blocks so get_page/put_page
+ * may fail.
+ * 2) When huge pages are used, the get_page/put_page API will fail
+ * in real mode as the linked addresses in the page struct are virtual
+ * too.
+ * When 1) or 2) takes place, the API returns an error code to cause
+ * an exit to kernel virtual mode where the operation will be completed.
+ */
+struct page *realmode_pfn_to_page(unsigned long pfn)
+{
+ struct vmemmap_backing *vmem_back;
+ struct page *page;
+ unsigned long page_size = 1 << mmu_psize_defs[mmu_vmemmap_psize].shift;
+ unsigned long pg_va = (unsigned long) pfn_to_page(pfn);
+
+ for (vmem_back = vmemmap_list; vmem_back; vmem_back = vmem_back->list) {
+ if (pg_va < vmem_back->virt_addr)
+ continue;
+ /* Check that page struct is not split between real pages */
+ if ((pg_va + sizeof(struct page)) >
+ (vmem_back->virt_addr + page_size))
+ return NULL;
+
+ page = (struct page *) (vmem_back->phys + pg_va -
+ vmem_back->virt_addr);
+ return page;
+ }
+
+ return NULL;
+}
+EXPORT_SYMBOL_GPL(realmode_pfn_to_page);
+
+#elif defined(CONFIG_FLATMEM)
+
+struct page *realmode_pfn_to_page(unsigned long pfn)
+{
+ struct page *page = pfn_to_page(pfn);
+ return page;
+}
+EXPORT_SYMBOL_GPL(realmode_pfn_to_page);
+
+#endif /* CONFIG_SPARSEMEM_VMEMMAP/CONFIG_FLATMEM */
+
+#if defined(CONFIG_SPARSEMEM_VMEMMAP) || defined(CONFIG_FLATMEM)
+int realmode_get_page(struct page *page)
+{
+ if (PageCompound(page))
+ return -EAGAIN;
+
+ get_page(page);
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(realmode_get_page);
+
+int realmode_put_page(struct page *page)
+{
+ if (PageCompound(page))
+ return -EAGAIN;
+
+ if (!atomic_add_unless(&page->_count, -1, 1))
+ return -EAGAIN;
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(realmode_put_page);
+#endif
--
1.7.10.4
^ permalink raw reply related
* [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling
From: Alexey Kardashevskiy @ 2013-05-21 3:06 UTC (permalink / raw)
To: linuxppc-dev
Cc: kvm, Alexey Kardashevskiy, Alexander Graf, kvm-ppc, linux-kernel,
Paul Mackerras, David Gibson
In-Reply-To: <1369105607-20957-1-git-send-email-aik@ozlabs.ru>
This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
and H_STUFF_TCE requests without passing them to QEMU, which should
save time on switching to QEMU and back.
Both real and virtual modes are supported - whenever the kernel
fails to handle TCE request, it passes it to the virtual mode.
If it the virtual mode handlers fail, then the request is passed
to the user mode, for example, to QEMU.
This adds a new KVM_CAP_SPAPR_TCE_IOMMU ioctl to asssociate
a virtual PCI bus ID (LIOBN) with an IOMMU group, which enables
in-kernel handling of IOMMU map/unmap.
Tests show that this patch increases transmission speed from 220MB/s
to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Paul Mackerras <paulus@samba.org>
---
Changes:
2013-05-20:
* removed get_user() from real mode handlers
* kvm_vcpu_arch::tce_tmp usage extended. Now real mode handler puts there
translated TCEs, tries realmode_get_page() on those and if it fails, it
passes control over the virtual mode handler which tries to finish
the request handling
* kvmppc_lookup_pte() now does realmode_get_page() protected by BUSY bit
on a page
* The only reason to pass the request to user mode now is when the user mode
did not register TCE table in the kernel, in all other cases the virtual mode
handler is expected to do the job
---
Documentation/virtual/kvm/api.txt | 28 +++++
arch/powerpc/include/asm/kvm_host.h | 3 +
arch/powerpc/include/asm/kvm_ppc.h | 2 +
arch/powerpc/include/uapi/asm/kvm.h | 7 ++
arch/powerpc/kvm/book3s_64_vio.c | 198 ++++++++++++++++++++++++++++++++++-
arch/powerpc/kvm/book3s_64_vio_hv.c | 193 +++++++++++++++++++++++++++++++++-
arch/powerpc/kvm/powerpc.c | 12 +++
include/uapi/linux/kvm.h | 4 +
8 files changed, 441 insertions(+), 6 deletions(-)
diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
index 3c7c7ea..3c8e9fe 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -2362,6 +2362,34 @@ calls by the guest for that service will be passed to userspace to be
handled.
+4.79 KVM_CREATE_SPAPR_TCE_IOMMU
+
+Capability: KVM_CAP_SPAPR_TCE_IOMMU
+Architectures: powerpc
+Type: vm ioctl
+Parameters: struct kvm_create_spapr_tce_iommu (in)
+Returns: 0 on success, -1 on error
+
+This creates a link between IOMMU group and a hardware TCE (translation
+control entry) table. This link lets the host kernel know what IOMMU
+group (i.e. TCE table) to use for the LIOBN number passed with
+H_PUT_TCE, H_PUT_TCE_INDIRECT, H_STUFF_TCE hypercalls.
+
+/* for KVM_CAP_SPAPR_TCE_IOMMU */
+struct kvm_create_spapr_tce_iommu {
+ __u64 liobn;
+ __u32 iommu_id;
+ __u32 flags;
+};
+
+No flag is supported at the moment.
+
+When the guest issues TCE call on a liobn for which a TCE table has been
+registered, the kernel will handle it in real mode, updating the hardware
+TCE table. TCE table calls for other liobns will cause a vm exit and must
+be handled by userspace.
+
+
5. The kvm_run structure
------------------------
diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
index 85d8f26..ac0e2fe 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -180,6 +180,7 @@ struct kvmppc_spapr_tce_table {
struct kvm *kvm;
u64 liobn;
u32 window_size;
+ struct iommu_group *grp; /* used for IOMMU groups */
struct page *pages[0];
};
@@ -611,6 +612,8 @@ struct kvm_vcpu_arch {
u64 busy_preempt;
unsigned long *tce_tmp; /* TCE cache for TCE_PUT_INDIRECT hall */
+ unsigned long tce_tmp_num; /* Number of handled TCEs in the cache */
+ unsigned long tce_reason; /* The reason of switching to the virtmode */
#endif
};
diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
index e852921b..934e01d 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -133,6 +133,8 @@ extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
struct kvm_create_spapr_tce *args);
+extern long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm,
+ struct kvm_create_spapr_tce_iommu *args);
extern struct kvmppc_spapr_tce_table *kvmppc_find_tce_table(
struct kvm_vcpu *vcpu, unsigned long liobn);
extern long kvmppc_emulated_validate_tce(unsigned long tce);
diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h
index 0fb1a6e..cf82af4 100644
--- a/arch/powerpc/include/uapi/asm/kvm.h
+++ b/arch/powerpc/include/uapi/asm/kvm.h
@@ -319,6 +319,13 @@ struct kvm_create_spapr_tce {
__u32 window_size;
};
+/* for KVM_CAP_SPAPR_TCE_IOMMU */
+struct kvm_create_spapr_tce_iommu {
+ __u64 liobn;
+ __u32 iommu_id;
+ __u32 flags;
+};
+
/* for KVM_ALLOCATE_RMA */
struct kvm_allocate_rma {
__u64 rma_size;
diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index 06b7b20..ffb4698 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -27,6 +27,8 @@
#include <linux/hugetlb.h>
#include <linux/list.h>
#include <linux/anon_inodes.h>
+#include <linux/pci.h>
+#include <linux/iommu.h>
#include <asm/tlbflush.h>
#include <asm/kvm_ppc.h>
@@ -56,8 +58,13 @@ static void release_spapr_tce_table(struct kvmppc_spapr_tce_table *stt)
mutex_lock(&kvm->lock);
list_del(&stt->list);
- for (i = 0; i < kvmppc_stt_npages(stt->window_size); i++)
- __free_page(stt->pages[i]);
+#ifdef CONFIG_IOMMU_API
+ if (stt->grp) {
+ iommu_group_put(stt->grp);
+ } else
+#endif
+ for (i = 0; i < kvmppc_stt_npages(stt->window_size); i++)
+ __free_page(stt->pages[i]);
kfree(stt);
mutex_unlock(&kvm->lock);
@@ -153,6 +160,62 @@ fail:
return ret;
}
+#ifdef CONFIG_IOMMU_API
+static const struct file_operations kvm_spapr_tce_iommu_fops = {
+ .release = kvm_spapr_tce_release,
+};
+
+long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm,
+ struct kvm_create_spapr_tce_iommu *args)
+{
+ struct kvmppc_spapr_tce_table *tt = NULL;
+ struct iommu_group *grp;
+ struct iommu_table *tbl;
+
+ /* Find an IOMMU table for the given ID */
+ grp = iommu_group_get_by_id(args->iommu_id);
+ if (!grp)
+ return -ENXIO;
+
+ tbl = iommu_group_get_iommudata(grp);
+ if (!tbl)
+ return -ENXIO;
+
+ /* Check this LIOBN hasn't been previously allocated */
+ list_for_each_entry(tt, &kvm->arch.spapr_tce_tables, list) {
+ if (tt->liobn == args->liobn)
+ return -EBUSY;
+ }
+
+ tt = kzalloc(sizeof(*tt), GFP_KERNEL);
+ if (!tt)
+ return -ENOMEM;
+
+ tt->liobn = args->liobn;
+ tt->kvm = kvm;
+ tt->grp = grp;
+
+ kvm_get_kvm(kvm);
+
+ mutex_lock(&kvm->lock);
+ list_add(&tt->list, &kvm->arch.spapr_tce_tables);
+
+ mutex_unlock(&kvm->lock);
+
+ pr_debug("LIOBN=%llX hooked to IOMMU %d, flags=%u\n",
+ args->liobn, args->iommu_id, args->flags);
+
+ return anon_inode_getfd("kvm-spapr-tce-iommu",
+ &kvm_spapr_tce_iommu_fops, tt, O_RDWR);
+}
+#else
+long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm,
+ struct kvm_create_spapr_tce_iommu *args)
+{
+ return -ENOSYS;
+}
+#endif /* CONFIG_IOMMU_API */
+
/* Converts guest physical address into host virtual */
static void __user *kvmppc_virtmode_gpa_to_hva(struct kvm_vcpu *vcpu,
unsigned long gpa)
@@ -180,6 +243,46 @@ long kvmppc_virtmode_h_put_tce(struct kvm_vcpu *vcpu,
if (!tt)
return H_TOO_HARD;
+#ifdef CONFIG_IOMMU_API
+ if (tt->grp) {
+ unsigned long entry = ioba >> IOMMU_PAGE_SHIFT;
+ struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp);
+
+ /* Return error if the group is being destroyed */
+ if (!tbl)
+ return H_RESCINDED;
+
+ if (vcpu->arch.tce_reason == H_HARDWARE) {
+ iommu_clear_tces_and_put_pages(tbl, entry, 1);
+ return H_HARDWARE;
+
+ } else if (!(tce & (TCE_PCI_READ | TCE_PCI_WRITE))) {
+ if (iommu_tce_clear_param_check(tbl, ioba, 0, 1))
+ return H_PARAMETER;
+
+ ret = iommu_clear_tces_and_put_pages(tbl, entry, 1);
+ } else {
+ void *hva;
+
+ if (iommu_tce_put_param_check(tbl, ioba, tce))
+ return H_PARAMETER;
+
+ hva = kvmppc_virtmode_gpa_to_hva(vcpu, tce);
+ if (hva == ERROR_ADDR)
+ return H_HARDWARE;
+
+ ret = iommu_put_tce_user_mode(tbl,
+ ioba >> IOMMU_PAGE_SHIFT,
+ (unsigned long) hva);
+ }
+ iommu_flush_tce(tbl);
+
+ if (ret)
+ return H_HARDWARE;
+
+ return H_SUCCESS;
+ }
+#endif
/* Emulated IO */
if (ioba >= tt->window_size)
return H_PARAMETER;
@@ -220,6 +323,70 @@ long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu,
if (tces == ERROR_ADDR)
return H_TOO_HARD;
+#ifdef CONFIG_IOMMU_API
+ if (tt->grp) {
+ struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp);
+
+ /* Return error if the group is being destroyed */
+ if (!tbl)
+ return H_RESCINDED;
+
+ /* Something bad happened, do cleanup and exit */
+ if (vcpu->arch.tce_reason == H_HARDWARE) {
+ i = vcpu->arch.tce_tmp_num;
+ goto fail_clear_tce;
+ } else if (vcpu->arch.tce_reason != H_TOO_HARD) {
+ /*
+ * We get here only in PR KVM mode, otherwise
+ * the real mode handler would have checked TCEs
+ * already and failed on guest TCE translation.
+ */
+ for (i = 0; i < npages; ++i) {
+ if (get_user(vcpu->arch.tce_tmp[i], tces + i))
+ return H_HARDWARE;
+
+ if (iommu_tce_put_param_check(tbl, ioba +
+ (i << IOMMU_PAGE_SHIFT),
+ vcpu->arch.tce_tmp[i]))
+ return H_PARAMETER;
+ }
+ } /* else: The real mode handler checked TCEs already */
+
+ /* Translate TCEs */
+ for (i = vcpu->arch.tce_tmp_num; i < npages; ++i) {
+ void *hva = kvmppc_virtmode_gpa_to_hva(vcpu,
+ vcpu->arch.tce_tmp[i]);
+
+ if (hva == ERROR_ADDR)
+ goto fail_clear_tce;
+
+ vcpu->arch.tce_tmp[i] = (unsigned long) hva;
+ }
+
+ /* Do get_page and put TCEs for all pages */
+ for (i = 0; i < npages; ++i) {
+ if (iommu_put_tce_user_mode(tbl,
+ (ioba >> IOMMU_PAGE_SHIFT) + i,
+ vcpu->arch.tce_tmp[i])) {
+ i = npages;
+ goto fail_clear_tce;
+ }
+ }
+
+ iommu_flush_tce(tbl);
+
+ return H_SUCCESS;
+
+fail_clear_tce:
+ /* Cannot complete the translation, clean up and exit */
+ iommu_clear_tces_and_put_pages(tbl,
+ ioba >> IOMMU_PAGE_SHIFT, i);
+
+ iommu_flush_tce(tbl);
+
+ return H_HARDWARE;
+ }
+#endif
/* Emulated IO */
if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
return H_PARAMETER;
@@ -253,6 +420,33 @@ long kvmppc_virtmode_h_stuff_tce(struct kvm_vcpu *vcpu,
if (!tt)
return H_TOO_HARD;
+#ifdef CONFIG_IOMMU_API
+ if (tt->grp) {
+ struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp);
+ unsigned long tmp, entry = ioba >> IOMMU_PAGE_SHIFT;
+
+ vcpu->arch.tce_tmp_num = 0;
+
+ /* Return error if the group is being destroyed */
+ if (!tbl)
+ return H_RESCINDED;
+
+ /* PR KVM? */
+ if (!vcpu->arch.tce_tmp_num &&
+ (vcpu->arch.tce_reason != H_TOO_HARD) &&
+ iommu_tce_clear_param_check(tbl, ioba,
+ tce_value, npages))
+ return H_PARAMETER;
+
+ /* Do actual cleanup */
+ tmp = vcpu->arch.tce_tmp_num;
+ if (iommu_clear_tces_and_put_pages(tbl, entry + tmp,
+ npages - tmp))
+ return H_PARAMETER;
+
+ return H_SUCCESS;
+ }
+#endif
/* Emulated IO */
if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
return H_PARAMETER;
diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
index c68d538..dc4ae32 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -26,6 +26,7 @@
#include <linux/slab.h>
#include <linux/hugetlb.h>
#include <linux/list.h>
+#include <linux/iommu.h>
#include <asm/tlbflush.h>
#include <asm/kvm_ppc.h>
@@ -118,7 +119,7 @@ EXPORT_SYMBOL_GPL(kvmppc_emulated_put_tce);
#ifdef CONFIG_KVM_BOOK3S_64_HV
static pte_t kvmppc_lookup_pte(pgd_t *pgdir, unsigned long hva, bool writing,
- unsigned long *pte_sizep)
+ unsigned long *pte_sizep, bool do_get_page)
{
pte_t *ptep;
unsigned int shift = 0;
@@ -135,6 +136,14 @@ static pte_t kvmppc_lookup_pte(pgd_t *pgdir, unsigned long hva, bool writing,
if (!pte_present(*ptep))
return __pte(0);
+ /*
+ * Put huge pages handling to the virtual mode.
+ * The only exception is for TCE list pages which we
+ * do need to call get_page() for.
+ */
+ if ((*pte_sizep > PAGE_SIZE) && do_get_page)
+ return __pte(0);
+
/* wait until _PAGE_BUSY is clear then set it atomically */
__asm__ __volatile__ (
"1: ldarx %0,0,%3\n"
@@ -148,6 +157,18 @@ static pte_t kvmppc_lookup_pte(pgd_t *pgdir, unsigned long hva, bool writing,
: "cc");
ret = pte;
+ if (do_get_page && pte_present(pte) && (!writing || pte_write(pte))) {
+ struct page *pg = NULL;
+ pg = realmode_pfn_to_page(pte_pfn(pte));
+ if (realmode_get_page(pg)) {
+ ret = __pte(0);
+ } else {
+ pte = pte_mkyoung(pte);
+ if (writing)
+ pte = pte_mkdirty(pte);
+ }
+ }
+ *ptep = pte; /* clears _PAGE_BUSY */
return ret;
}
@@ -157,7 +178,7 @@ static pte_t kvmppc_lookup_pte(pgd_t *pgdir, unsigned long hva, bool writing,
* Also returns pte and page size if the page is present in page table.
*/
static unsigned long kvmppc_realmode_gpa_to_hpa(struct kvm_vcpu *vcpu,
- unsigned long gpa)
+ unsigned long gpa, bool do_get_page)
{
struct kvm_memory_slot *memslot;
pte_t pte;
@@ -175,7 +196,7 @@ static unsigned long kvmppc_realmode_gpa_to_hpa(struct kvm_vcpu *vcpu,
/* Find a PTE and determine the size */
pte = kvmppc_lookup_pte(vcpu->arch.pgdir, hva,
- writing, &pg_size);
+ writing, &pg_size, do_get_page);
if (!pte)
return ERROR_ADDR;
@@ -188,6 +209,52 @@ static unsigned long kvmppc_realmode_gpa_to_hpa(struct kvm_vcpu *vcpu,
return hpa;
}
+#ifdef CONFIG_IOMMU_API
+static long kvmppc_clear_tce_real_mode(struct kvm_vcpu *vcpu,
+ struct iommu_table *tbl, unsigned long ioba,
+ unsigned long tce_value, unsigned long npages)
+{
+ long ret = 0, i;
+ unsigned long entry = ioba >> IOMMU_PAGE_SHIFT;
+
+ if (iommu_tce_clear_param_check(tbl, ioba, tce_value, npages))
+ return H_PARAMETER;
+
+ for (i = 0; i < npages; ++i) {
+ struct page *page;
+ unsigned long oldtce;
+
+ oldtce = iommu_clear_tce(tbl, entry + i);
+ if (!oldtce)
+ continue;
+
+ page = realmode_pfn_to_page(oldtce >> PAGE_SHIFT);
+ if (!page) {
+ ret = H_TOO_HARD;
+ break;
+ }
+
+ if (oldtce & TCE_PCI_WRITE)
+ SetPageDirty(page);
+
+ if (realmode_put_page(page)) {
+ ret = H_TOO_HARD;
+ break;
+ }
+ }
+
+ if (ret == H_TOO_HARD) {
+ vcpu->arch.tce_tmp_num = i;
+ vcpu->arch.tce_reason = H_TOO_HARD;
+ }
+ /* if (ret < 0)
+ pr_err("iommu_tce: %s failed ioba=%lx, tce_value=%lx ret=%d\n",
+ __func__, ioba, tce_value, ret); */
+
+ return ret;
+}
+#endif /* CONFIG_IOMMU_API */
+
long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
unsigned long ioba, unsigned long tce)
{
@@ -199,6 +266,52 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
if (!tt)
return H_TOO_HARD;
+#ifdef CONFIG_IOMMU_API
+ if (tt->grp) {
+ struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp);
+
+ /* Return error if the group is being destroyed */
+ if (!tbl)
+ return H_RESCINDED;
+
+ vcpu->arch.tce_reason = 0;
+
+ if (tce & (TCE_PCI_READ | TCE_PCI_WRITE)) {
+ unsigned long hpa, hva;
+
+ if (iommu_tce_put_param_check(tbl, ioba, tce))
+ return H_PARAMETER;
+
+ hpa = kvmppc_realmode_gpa_to_hpa(vcpu, tce, true);
+ if (hpa == ERROR_ADDR) {
+ vcpu->arch.tce_reason = H_TOO_HARD;
+ return H_TOO_HARD;
+ }
+
+ hva = (unsigned long) __va(hpa);
+ ret = iommu_tce_build(tbl,
+ ioba >> IOMMU_PAGE_SHIFT,
+ hva, iommu_tce_direction(hva));
+ if (unlikely(ret)) {
+ struct page *pg = realmode_pfn_to_page(hpa);
+ BUG_ON(!pg);
+ if (realmode_put_page(pg)) {
+ vcpu->arch.tce_reason = H_HARDWARE;
+ return H_TOO_HARD;
+ }
+ return H_HARDWARE;
+ }
+ } else {
+ ret = kvmppc_clear_tce_real_mode(vcpu, tbl, ioba, 0, 1);
+ if (ret)
+ return ret;
+ }
+
+ iommu_flush_tce(tbl);
+
+ return H_SUCCESS;
+ }
+#endif
/* Emulated IO */
if (ioba >= tt->window_size)
return H_PARAMETER;
@@ -235,10 +348,62 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
if (tce_list & ~IOMMU_PAGE_MASK)
return H_PARAMETER;
- tces = (unsigned long *) kvmppc_realmode_gpa_to_hpa(vcpu, tce_list);
+ vcpu->arch.tce_tmp_num = 0;
+ vcpu->arch.tce_reason = 0;
+
+ tces = (unsigned long *) kvmppc_realmode_gpa_to_hpa(vcpu,
+ tce_list, false);
if ((unsigned long)tces == ERROR_ADDR)
return H_TOO_HARD;
+#ifdef CONFIG_IOMMU_API
+ if (tt->grp) {
+ struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp);
+
+ /* Return error if the group is being destroyed */
+ if (!tbl)
+ return H_RESCINDED;
+
+ /* Check all TCEs */
+ for (i = 0; i < npages; ++i) {
+ if (iommu_tce_put_param_check(tbl, ioba +
+ (i << IOMMU_PAGE_SHIFT), tces[i]))
+ return H_PARAMETER;
+ vcpu->arch.tce_tmp[i] = tces[i];
+ }
+
+ /* Translate TCEs and go get_page */
+ for (i = 0; i < npages; ++i) {
+ unsigned long hpa = kvmppc_realmode_gpa_to_hpa(vcpu,
+ vcpu->arch.tce_tmp[i], true);
+ if (hpa == ERROR_ADDR) {
+ vcpu->arch.tce_tmp_num = i;
+ vcpu->arch.tce_reason = H_TOO_HARD;
+ return H_TOO_HARD;
+ }
+ vcpu->arch.tce_tmp[i] = hpa;
+ }
+
+ /* Put TCEs to the table */
+ for (i = 0; i < npages; ++i) {
+ unsigned long hva = (unsigned long)
+ __va(vcpu->arch.tce_tmp[i]);
+
+ ret = iommu_tce_build(tbl,
+ (ioba >> IOMMU_PAGE_SHIFT) + i,
+ hva, iommu_tce_direction(hva));
+ if (ret) {
+ /* All wrong, go virtmode and do cleanup */
+ vcpu->arch.tce_reason = H_HARDWARE;
+ return H_TOO_HARD;
+ }
+ }
+
+ iommu_flush_tce(tbl);
+
+ return H_SUCCESS;
+ }
+#endif
/* Emulated IO */
if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
return H_PARAMETER;
@@ -268,6 +433,26 @@ long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
if (!tt)
return H_TOO_HARD;
+#ifdef CONFIG_IOMMU_API
+ if (tt->grp) {
+ struct iommu_table *tbl = iommu_group_get_iommudata(tt->grp);
+
+ /* Return error if the group is being destroyed */
+ if (!tbl)
+ return H_RESCINDED;
+
+ vcpu->arch.tce_reason = 0;
+
+ ret = kvmppc_clear_tce_real_mode(vcpu, tbl, ioba,
+ tce_value, npages);
+ if (ret)
+ return ret;
+
+ iommu_flush_tce(tbl);
+
+ return H_SUCCESS;
+ }
+#endif
/* Emulated IO */
if ((ioba + (npages << IOMMU_PAGE_SHIFT)) > tt->window_size)
return H_PARAMETER;
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 8465c2a..da6bf61 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -396,6 +396,7 @@ int kvm_dev_ioctl_check_extension(long ext)
break;
#endif
case KVM_CAP_SPAPR_MULTITCE:
+ case KVM_CAP_SPAPR_TCE_IOMMU:
r = 1;
break;
default:
@@ -1025,6 +1026,17 @@ long kvm_arch_vm_ioctl(struct file *filp,
r = kvm_vm_ioctl_create_spapr_tce(kvm, &create_tce);
goto out;
}
+ case KVM_CREATE_SPAPR_TCE_IOMMU: {
+ struct kvm_create_spapr_tce_iommu create_tce_iommu;
+ struct kvm *kvm = filp->private_data;
+
+ r = -EFAULT;
+ if (copy_from_user(&create_tce_iommu, argp,
+ sizeof(create_tce_iommu)))
+ goto out;
+ r = kvm_vm_ioctl_create_spapr_tce_iommu(kvm, &create_tce_iommu);
+ goto out;
+ }
#endif /* CONFIG_PPC_BOOK3S_64 */
#ifdef CONFIG_KVM_BOOK3S_64_HV
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 5a2afda..450c82a 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -667,6 +667,7 @@ struct kvm_ppc_smmu_info {
#define KVM_CAP_PPC_RTAS 91
#define KVM_CAP_IRQ_XICS 92
#define KVM_CAP_SPAPR_MULTITCE (0x110000 + 89)
+#define KVM_CAP_SPAPR_TCE_IOMMU (0x110000 + 90)
#ifdef KVM_CAP_IRQ_ROUTING
@@ -939,6 +940,9 @@ struct kvm_s390_ucas_mapping {
#define KVM_GET_DEVICE_ATTR _IOW(KVMIO, 0xe2, struct kvm_device_attr)
#define KVM_HAS_DEVICE_ATTR _IOW(KVMIO, 0xe3, struct kvm_device_attr)
+/* ioctl for SPAPR TCE IOMMU */
+#define KVM_CREATE_SPAPR_TCE_IOMMU _IOW(KVMIO, 0xe4, struct kvm_create_spapr_tce_iommu)
+
/*
* ioctls for vcpu fds
*/
--
1.7.10.4
^ permalink raw reply related
* [PATCH 4/4] KVM: PPC: Add hugepage support for IOMMU in-kernel handling
From: Alexey Kardashevskiy @ 2013-05-21 3:06 UTC (permalink / raw)
To: linuxppc-dev
Cc: kvm, Alexey Kardashevskiy, Alexander Graf, kvm-ppc, linux-kernel,
Paul Mackerras, David Gibson
In-Reply-To: <1369105607-20957-1-git-send-email-aik@ozlabs.ru>
This adds special support for huge pages (16MB). The reference
counting cannot be easily done for such pages in real mode (when
MMU is off) so we added a list of huge pages. It is populated in
virtual mode and get_page is called just once per a huge page.
Real mode handlers check if the requested page is huge and in the list,
then no reference counting is done, otherwise an exit to virtual mode
happens. The list is released at KVM exit. At the moment the fastest
card available for tests uses up to 9 huge pages so walking through this
list is not very expensive. However this can change and we may want
to optimize this.
Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Paul Mackerras <paulus@samba.org>
---
Changes:
* the real mode handler now searches for a huge page by gpa (used to be pte)
* the virtual mode handler prints warning if it is called twice for the same
huge page as the real mode handler is expected to fail just once - when a huge
page is not in the list yet.
* the huge page is refcounted twice - when added to the hugepage list and
when used in the virtual mode hcall handler (can be optimized but it will
make the patch less nice).
---
arch/powerpc/include/asm/kvm_host.h | 2 +
arch/powerpc/include/asm/kvm_ppc.h | 22 +++++++++
arch/powerpc/kvm/book3s_64_vio.c | 88 +++++++++++++++++++++++++++++++++--
arch/powerpc/kvm/book3s_64_vio_hv.c | 40 ++++++++++++++--
4 files changed, 146 insertions(+), 6 deletions(-)
diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
index ac0e2fe..4fc0865 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -181,6 +181,8 @@ struct kvmppc_spapr_tce_table {
u64 liobn;
u32 window_size;
struct iommu_group *grp; /* used for IOMMU groups */
+ struct list_head hugepages; /* used for IOMMU groups */
+ spinlock_t hugepages_lock; /* used for IOMMU groups */
struct page *pages[0];
};
diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
index 934e01d..9054df0 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -149,6 +149,28 @@ extern long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu,
extern long kvmppc_virtmode_h_stuff_tce(struct kvm_vcpu *vcpu,
unsigned long liobn, unsigned long ioba,
unsigned long tce_value, unsigned long npages);
+
+/*
+ * The KVM guest can be backed with 16MB pages (qemu switch
+ * -mem-path /var/lib/hugetlbfs/global/pagesize-16MB/).
+ * In this case, we cannot do page counting from the real mode
+ * as the compound pages are used - they are linked in a list
+ * with pointers as virtual addresses which are inaccessible
+ * in real mode.
+ *
+ * The code below keeps a 16MB pages list and uses page struct
+ * in real mode if it is already locked in RAM and inserted into
+ * the list or switches to the virtual mode where it can be
+ * handled in a usual manner.
+ */
+struct kvmppc_iommu_hugepage {
+ struct list_head list;
+ pte_t pte; /* Huge page PTE */
+ unsigned long gpa; /* Guest physical address */
+ struct page *page; /* page struct of the very first subpage */
+ unsigned long size; /* Huge page size (always 16MB at the moment) */
+};
+
extern long kvm_vm_ioctl_allocate_rma(struct kvm *kvm,
struct kvm_allocate_rma *rma);
extern struct kvmppc_linear_info *kvm_alloc_rma(void);
diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index ffb4698..c34d63a 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -45,6 +45,71 @@
#define TCES_PER_PAGE (PAGE_SIZE / sizeof(u64))
#define ERROR_ADDR ((void *)~(unsigned long)0x0)
+#ifdef CONFIG_IOMMU_API
+/* Adds a new huge page descriptor to the list */
+static long kvmppc_iommu_hugepage_try_add(
+ struct kvmppc_spapr_tce_table *tt,
+ pte_t pte, unsigned long hva, unsigned long gpa,
+ unsigned long pg_size)
+{
+ long ret = 0;
+ struct kvmppc_iommu_hugepage *hp;
+ struct page *p;
+
+ spin_lock(&tt->hugepages_lock);
+ list_for_each_entry(hp, &tt->hugepages, list) {
+ if (hp->pte == pte)
+ goto unlock_exit;
+ }
+
+ hva = hva & ~(pg_size - 1);
+ ret = get_user_pages_fast(hva, 1, true/*write*/, &p);
+ if ((ret != 1) || !p) {
+ ret = -EFAULT;
+ goto unlock_exit;
+ }
+ ret = 0;
+
+ hp = kzalloc(sizeof(*hp), GFP_KERNEL);
+ if (!hp) {
+ ret = -ENOMEM;
+ goto unlock_exit;
+ }
+
+ hp->page = p;
+ hp->pte = pte;
+ hp->gpa = gpa & ~(pg_size - 1);
+ hp->size = pg_size;
+
+ list_add(&hp->list, &tt->hugepages);
+
+unlock_exit:
+ spin_unlock(&tt->hugepages_lock);
+
+ return ret;
+}
+
+static void kvmppc_iommu_hugepages_init(struct kvmppc_spapr_tce_table *tt)
+{
+ INIT_LIST_HEAD(&tt->hugepages);
+ spin_lock_init(&tt->hugepages_lock);
+}
+
+static void kvmppc_iommu_hugepages_cleanup(struct kvmppc_spapr_tce_table *tt)
+{
+ struct kvmppc_iommu_hugepage *hp, *tmp;
+
+ spin_lock(&tt->hugepages_lock);
+ list_for_each_entry_safe(hp, tmp, &tt->hugepages, list) {
+ list_del(&hp->list);
+ put_page(hp->page); /* one for iommu_put_tce_user_mode */
+ put_page(hp->page); /* one for kvmppc_iommu_hugepage_try_add */
+ kfree(hp);
+ }
+ spin_unlock(&tt->hugepages_lock);
+}
+#endif /* CONFIG_IOMMU_API */
+
static long kvmppc_stt_npages(unsigned long window_size)
{
return ALIGN((window_size >> SPAPR_TCE_SHIFT)
@@ -61,6 +126,7 @@ static void release_spapr_tce_table(struct kvmppc_spapr_tce_table *stt)
#ifdef CONFIG_IOMMU_API
if (stt->grp) {
iommu_group_put(stt->grp);
+ kvmppc_iommu_hugepages_cleanup(stt);
} else
#endif
for (i = 0; i < kvmppc_stt_npages(stt->window_size); i++)
@@ -198,6 +264,7 @@ long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm,
kvm_get_kvm(kvm);
mutex_lock(&kvm->lock);
+ kvmppc_iommu_hugepages_init(tt);
list_add(&tt->list, &kvm->arch.spapr_tce_tables);
mutex_unlock(&kvm->lock);
@@ -218,16 +285,31 @@ long kvm_vm_ioctl_create_spapr_tce_iommu(struct kvm *kvm,
/* Converts guest physical address into host virtual */
static void __user *kvmppc_virtmode_gpa_to_hva(struct kvm_vcpu *vcpu,
+ struct kvmppc_spapr_tce_table *tt,
unsigned long gpa)
{
unsigned long hva, gfn = gpa >> PAGE_SHIFT;
struct kvm_memory_slot *memslot;
+ pte_t *ptep;
+ unsigned int shift = 0;
memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn);
if (!memslot)
return ERROR_ADDR;
hva = __gfn_to_hva_memslot(memslot, gfn) + (gpa & ~PAGE_MASK);
+
+ ptep = find_linux_pte_or_hugepte(vcpu->arch.pgdir, hva, &shift);
+ WARN_ON(!ptep);
+ if (!ptep)
+ return ERROR_ADDR;
+
+ if (tt && (shift > PAGE_SHIFT)) {
+ if (kvmppc_iommu_hugepage_try_add(tt, *ptep,
+ hva, gpa, 1 << shift))
+ return ERROR_ADDR;
+ }
+
return (void *) hva;
}
@@ -267,7 +349,7 @@ long kvmppc_virtmode_h_put_tce(struct kvm_vcpu *vcpu,
if (iommu_tce_put_param_check(tbl, ioba, tce))
return H_PARAMETER;
- hva = kvmppc_virtmode_gpa_to_hva(vcpu, tce);
+ hva = kvmppc_virtmode_gpa_to_hva(vcpu, tt, tce);
if (hva == ERROR_ADDR)
return H_HARDWARE;
@@ -319,7 +401,7 @@ long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu,
if (tce_list & ~IOMMU_PAGE_MASK)
return H_PARAMETER;
- tces = kvmppc_virtmode_gpa_to_hva(vcpu, tce_list);
+ tces = kvmppc_virtmode_gpa_to_hva(vcpu, NULL, tce_list);
if (tces == ERROR_ADDR)
return H_TOO_HARD;
@@ -354,7 +436,7 @@ long kvmppc_virtmode_h_put_tce_indirect(struct kvm_vcpu *vcpu,
/* Translate TCEs */
for (i = vcpu->arch.tce_tmp_num; i < npages; ++i) {
- void *hva = kvmppc_virtmode_gpa_to_hva(vcpu,
+ void *hva = kvmppc_virtmode_gpa_to_hva(vcpu, tt,
vcpu->arch.tce_tmp[i]);
if (hva == ERROR_ADDR)
diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
index dc4ae32..6245365 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -178,6 +178,7 @@ static pte_t kvmppc_lookup_pte(pgd_t *pgdir, unsigned long hva, bool writing,
* Also returns pte and page size if the page is present in page table.
*/
static unsigned long kvmppc_realmode_gpa_to_hpa(struct kvm_vcpu *vcpu,
+ struct kvmppc_spapr_tce_table *tt,
unsigned long gpa, bool do_get_page)
{
struct kvm_memory_slot *memslot;
@@ -185,7 +186,31 @@ static unsigned long kvmppc_realmode_gpa_to_hpa(struct kvm_vcpu *vcpu,
unsigned long hva, hpa, pg_size = 0, offset;
unsigned long gfn = gpa >> PAGE_SHIFT;
bool writing = gpa & TCE_PCI_WRITE;
+ struct kvmppc_iommu_hugepage *hp;
+ /*
+ * Try to find an already used hugepage.
+ * If it is not there, the kvmppc_lookup_pte() will return zero
+ * as it won't do get_page() on a huge page in real mode
+ * and therefore the request will be passed to the virtual mode.
+ */
+ if (tt) {
+ spin_lock(&tt->hugepages_lock);
+ list_for_each_entry(hp, &tt->hugepages, list) {
+ if ((gpa < hp->gpa) || (gpa >= hp->gpa + hp->size))
+ continue;
+
+ /* Calculate host phys address keeping flags and offset in the page */
+ offset = gpa & (hp->size - 1);
+
+ /* pte_pfn(pte) should return an address aligned to pg_size */
+ hpa = (pte_pfn(hp->pte) << PAGE_SHIFT) + offset;
+ spin_unlock(&tt->hugepages_lock);
+
+ return hpa;
+ }
+ spin_unlock(&tt->hugepages_lock);
+ }
/* Find a KVM memslot */
memslot = search_memslots(kvm_memslots(vcpu->kvm), gfn);
if (!memslot)
@@ -237,6 +262,10 @@ static long kvmppc_clear_tce_real_mode(struct kvm_vcpu *vcpu,
if (oldtce & TCE_PCI_WRITE)
SetPageDirty(page);
+ /* Do not put a huge page and continue without error */
+ if (PageCompound(page))
+ continue;
+
if (realmode_put_page(page)) {
ret = H_TOO_HARD;
break;
@@ -282,7 +311,7 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
if (iommu_tce_put_param_check(tbl, ioba, tce))
return H_PARAMETER;
- hpa = kvmppc_realmode_gpa_to_hpa(vcpu, tce, true);
+ hpa = kvmppc_realmode_gpa_to_hpa(vcpu, tt, tce, true);
if (hpa == ERROR_ADDR) {
vcpu->arch.tce_reason = H_TOO_HARD;
return H_TOO_HARD;
@@ -295,6 +324,11 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
if (unlikely(ret)) {
struct page *pg = realmode_pfn_to_page(hpa);
BUG_ON(!pg);
+
+ /* Do not put a huge page and return an error */
+ if (!PageCompound(pg))
+ return H_HARDWARE;
+
if (realmode_put_page(pg)) {
vcpu->arch.tce_reason = H_HARDWARE;
return H_TOO_HARD;
@@ -351,7 +385,7 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
vcpu->arch.tce_tmp_num = 0;
vcpu->arch.tce_reason = 0;
- tces = (unsigned long *) kvmppc_realmode_gpa_to_hpa(vcpu,
+ tces = (unsigned long *) kvmppc_realmode_gpa_to_hpa(vcpu, NULL,
tce_list, false);
if ((unsigned long)tces == ERROR_ADDR)
return H_TOO_HARD;
@@ -374,7 +408,7 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
/* Translate TCEs and go get_page */
for (i = 0; i < npages; ++i) {
- unsigned long hpa = kvmppc_realmode_gpa_to_hpa(vcpu,
+ unsigned long hpa = kvmppc_realmode_gpa_to_hpa(vcpu, tt,
vcpu->arch.tce_tmp[i], true);
if (hpa == ERROR_ADDR) {
vcpu->arch.tce_tmp_num = i;
--
1.7.10.4
^ permalink raw reply related
* [PATCH] powerpc: Fix TLB cleanup at boot on POWER8
From: Benjamin Herrenschmidt @ 2013-05-21 3:23 UTC (permalink / raw)
To: linuxppc-dev@lists.ozlabs.org
The TLB has 512 congruence classes (2048 entries 4 way set associative)
while P7 had 128
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
---
arch/powerpc/kernel/cpu_setup_power.S | 8 ++++++--
1 file changed, 6 insertions(+), 2 deletions(-)
diff --git a/arch/powerpc/kernel/cpu_setup_power.S b/arch/powerpc/kernel/cpu_setup_power.S
index a283b64..18b5b9c 100644
--- a/arch/powerpc/kernel/cpu_setup_power.S
+++ b/arch/powerpc/kernel/cpu_setup_power.S
@@ -135,8 +135,12 @@ __init_HFSCR:
blr
__init_TLB:
- /* Clear the TLB */
- li r6,128
+ /*
+ * Clear the TLB using the "IS 3" form of tlbiel instruction
+ * (invalidate by congruence class). P7 has 128 CCs, P8 has 512
+ * so we just always do 512
+ */
+ li r6,512
mtctr r6
li r7,0xc00 /* IS field = 0b11 */
ptesync
^ permalink raw reply related
* [PATCH] powerpc/pci: Fix bogus message at boot about empty memory resources
From: Benjamin Herrenschmidt @ 2013-05-21 3:24 UTC (permalink / raw)
To: linuxppc-dev@lists.ozlabs.org
The message is only meant to be displayed if resource 0 is empty,
but was displayed if any is.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
---
arch/powerpc/kernel/pci-common.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)
diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c
index 6053f03..e9acf50 100644
--- a/arch/powerpc/kernel/pci-common.c
+++ b/arch/powerpc/kernel/pci-common.c
@@ -1520,9 +1520,10 @@ static void pcibios_setup_phb_resources(struct pci_controller *hose,
for (i = 0; i < 3; ++i) {
res = &hose->mem_resources[i];
if (!res->flags) {
- printk(KERN_ERR "PCI: Memory resource 0 not set for "
- "host bridge %s (domain %d)\n",
- hose->dn->full_name, hose->global_number);
+ if (i == 0)
+ printk(KERN_ERR "PCI: Memory resource 0 not set for "
+ "host bridge %s (domain %d)\n",
+ hose->dn->full_name, hose->global_number);
continue;
}
offset = hose->mem_offset[i];
^ permalink raw reply related
* [PATCH] powerpc/powernv: Fix condition for when to invalidate the TCE cache
From: Benjamin Herrenschmidt @ 2013-05-21 3:25 UTC (permalink / raw)
To: linuxppc-dev
We use two flags, one to indicate an invalidation is needed after
creating a new entry and one to indicate an invalidation is needed
after removing an entry. However we were testing the wrong flag
in the remove case.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
---
arch/powerpc/platforms/powernv/pci.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
index 163bd74..098d357 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@@ -367,7 +367,7 @@ static void pnv_tce_free(struct iommu_table *tbl, long index, long npages)
while (npages--)
*(tcep++) = 0;
- if (tbl->it_type & TCE_PCI_SWINV_CREATE)
+ if (tbl->it_type & TCE_PCI_SWINV_FREE)
pnv_pci_ioda_tce_invalidate(tbl, tces, tcep - 1);
}
^ permalink raw reply related
* [PATCH 0/5 v2] VFIO PPC64: add VFIO support on POWERPC64
From: Alexey Kardashevskiy @ 2013-05-21 3:33 UTC (permalink / raw)
To: linuxppc-dev
Cc: kvm, Alexey Kardashevskiy, linux-kernel, Alex Williamson,
Paul Mackerras, David Gibson
The series adds support for VFIO on POWERPC in user space (such as QEMU).
The in-kernel real mode IOMMU support is added by another series posted
separately.
As the first and main aim of this series is the POWERNV platform support,
the "Enable on POWERNV platform" patch goes first and introduces an API
to be used by the VFIO IOMMU driver. The "Enable on pSeries platform" patch
simply registers PHBs in the IOMMU subsystem and expects the API to be present,
it enables VFIO support in fully emulated QEMU guests.
The main change is that this series was changed and tested against v3.10-rc1.
It also contains some bugfixes which are mentioned (if any) in the patch messages.
Alexey Kardashevskiy (3):
powerpc/vfio: Enable on POWERNV platform
powerpc/vfio: Implement IOMMU driver for VFIO
powerpc/vfio: Enable on pSeries platform
Documentation/vfio.txt | 63 +++++
arch/powerpc/include/asm/iommu.h | 26 ++
arch/powerpc/kernel/iommu.c | 323 +++++++++++++++++++++++
arch/powerpc/platforms/powernv/pci-ioda.c | 1 +
arch/powerpc/platforms/powernv/pci-p5ioc2.c | 5 +-
arch/powerpc/platforms/powernv/pci.c | 2 +
arch/powerpc/platforms/pseries/iommu.c | 4 +
drivers/iommu/Kconfig | 8 +
drivers/vfio/Kconfig | 6 +
drivers/vfio/Makefile | 1 +
drivers/vfio/vfio.c | 1 +
drivers/vfio/vfio_iommu_spapr_tce.c | 377 +++++++++++++++++++++++++++
include/uapi/linux/vfio.h | 34 +++
13 files changed, 850 insertions(+), 1 deletion(-)
create mode 100644 drivers/vfio/vfio_iommu_spapr_tce.c
--
1.7.10.4
^ permalink raw reply
* [PATCH 1/3] powerpc/vfio: Enable on POWERNV platform
From: Alexey Kardashevskiy @ 2013-05-21 3:33 UTC (permalink / raw)
To: linuxppc-dev
Cc: kvm, Alexey Kardashevskiy, linux-kernel, Alex Williamson,
Paul Mackerras, David Gibson
In-Reply-To: <1369107191-28547-1-git-send-email-aik@ozlabs.ru>
This initializes IOMMU groups based on the IOMMU configuration
discovered during the PCI scan on POWERNV (POWER non virtualized)
platform. The IOMMU groups are to be used later by the VFIO driver,
which is used for PCI pass through.
It also implements an API for mapping/unmapping pages for
guest PCI drivers and providing DMA window properties.
This API is going to be used later by QEMU-VFIO to handle
h_put_tce hypercalls from the KVM guest.
The iommu_put_tce_user_mode() does only a single page mapping
as an API for adding many mappings at once is going to be
added later.
Although this driver has been tested only on the POWERNV
platform, it should work on any platform which supports
TCE tables. As h_put_tce hypercall is received by the host
kernel and processed by the QEMU (what involves calling
the host kernel again), performance is not the best -
circa 220MB/s on 10Gb ethernet network.
To enable VFIO on POWER, enable SPAPR_TCE_IOMMU config
option and configure VFIO as required.
Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Paul Mackerras <paulus@samba.org>
---
Changes:
* the PCI devices are added to groups from subsys_initcall_sync (used to
be done via bus_notifier callback registered for PCI bus)
* fixed parameters checking (the very last page in the address space was
not handled correctly)
---
arch/powerpc/include/asm/iommu.h | 26 +++
arch/powerpc/kernel/iommu.c | 323 +++++++++++++++++++++++++++
arch/powerpc/platforms/powernv/pci-ioda.c | 1 +
arch/powerpc/platforms/powernv/pci-p5ioc2.c | 5 +-
arch/powerpc/platforms/powernv/pci.c | 2 +
drivers/iommu/Kconfig | 8 +
6 files changed, 364 insertions(+), 1 deletion(-)
diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index cbfe678..98d1422 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -76,6 +76,9 @@ struct iommu_table {
struct iommu_pool large_pool;
struct iommu_pool pools[IOMMU_NR_POOLS];
unsigned long *it_map; /* A simple allocation bitmap for now */
+#ifdef CONFIG_IOMMU_API
+ struct iommu_group *it_group;
+#endif
};
struct scatterlist;
@@ -98,6 +101,8 @@ extern void iommu_free_table(struct iommu_table *tbl, const char *node_name);
*/
extern struct iommu_table *iommu_init_table(struct iommu_table * tbl,
int nid);
+extern void iommu_register_group(struct iommu_table *tbl,
+ int pci_domain_number, unsigned long pe_num);
extern int iommu_map_sg(struct device *dev, struct iommu_table *tbl,
struct scatterlist *sglist, int nelems,
@@ -147,5 +152,26 @@ static inline void iommu_restore(void)
}
#endif
+/* The API to support IOMMU operations for VFIO */
+extern int iommu_tce_clear_param_check(struct iommu_table *tbl,
+ unsigned long ioba, unsigned long tce_value,
+ unsigned long npages);
+extern int iommu_tce_put_param_check(struct iommu_table *tbl,
+ unsigned long ioba, unsigned long tce);
+extern int iommu_tce_build(struct iommu_table *tbl, unsigned long entry,
+ unsigned long hwaddr, enum dma_data_direction direction);
+extern unsigned long iommu_clear_tce(struct iommu_table *tbl,
+ unsigned long entry);
+extern int iommu_clear_tces_and_put_pages(struct iommu_table *tbl,
+ unsigned long entry, unsigned long pages);
+extern int iommu_put_tce_user_mode(struct iommu_table *tbl,
+ unsigned long entry, unsigned long tce);
+
+extern void iommu_flush_tce(struct iommu_table *tbl);
+extern int iommu_take_ownership(struct iommu_table *tbl);
+extern void iommu_release_ownership(struct iommu_table *tbl);
+
+extern enum dma_data_direction iommu_tce_direction(unsigned long tce);
+
#endif /* __KERNEL__ */
#endif /* _ASM_IOMMU_H */
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index c0d0dbd..b20ff17 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -36,6 +36,8 @@
#include <linux/hash.h>
#include <linux/fault-inject.h>
#include <linux/pci.h>
+#include <linux/iommu.h>
+#include <linux/sched.h>
#include <asm/io.h>
#include <asm/prom.h>
#include <asm/iommu.h>
@@ -44,6 +46,7 @@
#include <asm/kdump.h>
#include <asm/fadump.h>
#include <asm/vio.h>
+#include <asm/tce.h>
#define DBG(...)
@@ -724,6 +727,13 @@ void iommu_free_table(struct iommu_table *tbl, const char *node_name)
if (tbl->it_offset == 0)
clear_bit(0, tbl->it_map);
+#ifdef CONFIG_IOMMU_API
+ if (tbl->it_group) {
+ iommu_group_put(tbl->it_group);
+ BUG_ON(tbl->it_group);
+ }
+#endif
+
/* verify that table contains no entries */
if (!bitmap_empty(tbl->it_map, tbl->it_size))
pr_warn("%s: Unexpected TCEs for %s\n", __func__, node_name);
@@ -860,3 +870,316 @@ void iommu_free_coherent(struct iommu_table *tbl, size_t size,
free_pages((unsigned long)vaddr, get_order(size));
}
}
+
+#ifdef CONFIG_IOMMU_API
+/*
+ * SPAPR TCE API
+ */
+static void group_release(void *iommu_data)
+{
+ struct iommu_table *tbl = iommu_data;
+ tbl->it_group = NULL;
+}
+
+void iommu_register_group(struct iommu_table *tbl,
+ int pci_domain_number, unsigned long pe_num)
+{
+ struct iommu_group *grp;
+ char *name;
+
+ grp = iommu_group_alloc();
+ if (IS_ERR(grp)) {
+ pr_warn("powerpc iommu api: cannot create new group, err=%ld\n",
+ PTR_ERR(grp));
+ return;
+ }
+ tbl->it_group = grp;
+ iommu_group_set_iommudata(grp, tbl, group_release);
+ name = kasprintf(GFP_KERNEL, "domain%d-pe%lx",
+ pci_domain_number, pe_num);
+ if (!name)
+ return;
+ iommu_group_set_name(grp, name);
+ kfree(name);
+}
+
+enum dma_data_direction iommu_tce_direction(unsigned long tce)
+{
+ if ((tce & TCE_PCI_READ) && (tce & TCE_PCI_WRITE))
+ return DMA_BIDIRECTIONAL;
+ else if (tce & TCE_PCI_READ)
+ return DMA_TO_DEVICE;
+ else if (tce & TCE_PCI_WRITE)
+ return DMA_FROM_DEVICE;
+ else
+ return DMA_NONE;
+}
+EXPORT_SYMBOL_GPL(iommu_tce_direction);
+
+void iommu_flush_tce(struct iommu_table *tbl)
+{
+ /* Flush/invalidate TLB caches if necessary */
+ if (ppc_md.tce_flush)
+ ppc_md.tce_flush(tbl);
+
+ /* Make sure updates are seen by hardware */
+ mb();
+}
+EXPORT_SYMBOL_GPL(iommu_flush_tce);
+
+int iommu_tce_clear_param_check(struct iommu_table *tbl,
+ unsigned long ioba, unsigned long tce_value,
+ unsigned long npages)
+{
+ /* ppc_md.tce_free() does not support any value but 0 */
+ if (tce_value)
+ return -EINVAL;
+
+ if (ioba & ~IOMMU_PAGE_MASK)
+ return -EINVAL;
+
+ ioba >>= IOMMU_PAGE_SHIFT;
+ if (ioba < tbl->it_offset)
+ return -EINVAL;
+
+ if ((ioba + npages) > (tbl->it_offset + tbl->it_size))
+ return -EINVAL;
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(iommu_tce_clear_param_check);
+
+int iommu_tce_put_param_check(struct iommu_table *tbl,
+ unsigned long ioba, unsigned long tce)
+{
+ if (!(tce & (TCE_PCI_WRITE | TCE_PCI_READ)))
+ return -EINVAL;
+
+ if (tce & ~(IOMMU_PAGE_MASK | TCE_PCI_WRITE | TCE_PCI_READ))
+ return -EINVAL;
+
+ if (ioba & ~IOMMU_PAGE_MASK)
+ return -EINVAL;
+
+ ioba >>= IOMMU_PAGE_SHIFT;
+ if (ioba < tbl->it_offset)
+ return -EINVAL;
+
+ if ((ioba + 1) > (tbl->it_offset + tbl->it_size))
+ return -EINVAL;
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(iommu_tce_put_param_check);
+
+unsigned long iommu_clear_tce(struct iommu_table *tbl, unsigned long entry)
+{
+ unsigned long oldtce;
+ struct iommu_pool *pool = get_pool(tbl, entry);
+
+ spin_lock(&(pool->lock));
+
+ oldtce = ppc_md.tce_get(tbl, entry);
+ if (oldtce & (TCE_PCI_WRITE | TCE_PCI_READ))
+ ppc_md.tce_free(tbl, entry, 1);
+ else
+ oldtce = 0;
+
+ spin_unlock(&(pool->lock));
+
+ return oldtce;
+}
+EXPORT_SYMBOL_GPL(iommu_clear_tce);
+
+int iommu_clear_tces_and_put_pages(struct iommu_table *tbl,
+ unsigned long entry, unsigned long pages)
+{
+ unsigned long oldtce;
+ struct page *page;
+
+ for ( ; pages; --pages, ++entry) {
+ oldtce = iommu_clear_tce(tbl, entry);
+ if (!oldtce)
+ continue;
+
+ page = pfn_to_page(oldtce >> PAGE_SHIFT);
+ WARN_ON(!page);
+ if (page) {
+ if (oldtce & TCE_PCI_WRITE)
+ SetPageDirty(page);
+ put_page(page);
+ }
+ }
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(iommu_clear_tces_and_put_pages);
+
+/*
+ * hwaddr is a kernel virtual address here (0xc... bazillion),
+ * tce_build converts it to a physical address.
+ */
+int iommu_tce_build(struct iommu_table *tbl, unsigned long entry,
+ unsigned long hwaddr, enum dma_data_direction direction)
+{
+ int ret = -EBUSY;
+ unsigned long oldtce;
+ struct iommu_pool *pool = get_pool(tbl, entry);
+
+ spin_lock(&(pool->lock));
+
+ oldtce = ppc_md.tce_get(tbl, entry);
+ /* Add new entry if it is not busy */
+ if (!(oldtce & (TCE_PCI_WRITE | TCE_PCI_READ)))
+ ret = ppc_md.tce_build(tbl, entry, 1, hwaddr, direction, NULL);
+
+ spin_unlock(&(pool->lock));
+
+ /* if (unlikely(ret))
+ pr_err("iommu_tce: %s failed on hwaddr=%lx ioba=%lx kva=%lx ret=%d\n",
+ __func__, hwaddr, entry << IOMMU_PAGE_SHIFT,
+ hwaddr, ret); */
+
+ return ret;
+}
+EXPORT_SYMBOL_GPL(iommu_tce_build);
+
+int iommu_put_tce_user_mode(struct iommu_table *tbl, unsigned long entry,
+ unsigned long tce)
+{
+ int ret;
+ struct page *page = NULL;
+ unsigned long hwaddr, offset = tce & IOMMU_PAGE_MASK & ~PAGE_MASK;
+ enum dma_data_direction direction = iommu_tce_direction(tce);
+
+ ret = get_user_pages_fast(tce & PAGE_MASK, 1,
+ direction != DMA_TO_DEVICE, &page);
+ if (unlikely(ret != 1)) {
+ /* pr_err("iommu_tce: get_user_pages_fast failed tce=%lx ioba=%lx ret=%d\n",
+ tce, entry << IOMMU_PAGE_SHIFT, ret); */
+ return -EFAULT;
+ }
+ hwaddr = (unsigned long) page_address(page) + offset;
+
+ ret = iommu_tce_build(tbl, entry, hwaddr, direction);
+ if (ret)
+ put_page(page);
+
+ if (ret < 0)
+ pr_err("iommu_tce: %s failed ioba=%lx, tce=%lx, ret=%d\n",
+ __func__, entry << IOMMU_PAGE_SHIFT, tce, ret);
+
+ return ret;
+}
+EXPORT_SYMBOL_GPL(iommu_put_tce_user_mode);
+
+int iommu_take_ownership(struct iommu_table *tbl)
+{
+ unsigned long sz = (tbl->it_size + 7) >> 3;
+
+ if (tbl->it_offset == 0)
+ clear_bit(0, tbl->it_map);
+
+ if (!bitmap_empty(tbl->it_map, tbl->it_size)) {
+ pr_err("iommu_tce: it_map is not empty");
+ return -EBUSY;
+ }
+
+ memset(tbl->it_map, 0xff, sz);
+ iommu_clear_tces_and_put_pages(tbl, tbl->it_offset, tbl->it_size);
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(iommu_take_ownership);
+
+void iommu_release_ownership(struct iommu_table *tbl)
+{
+ unsigned long sz = (tbl->it_size + 7) >> 3;
+
+ iommu_clear_tces_and_put_pages(tbl, tbl->it_offset, tbl->it_size);
+ memset(tbl->it_map, 0, sz);
+
+ /* Restore bit#0 set by iommu_init_table() */
+ if (tbl->it_offset == 0)
+ set_bit(0, tbl->it_map);
+}
+EXPORT_SYMBOL_GPL(iommu_release_ownership);
+
+static int iommu_add_device(struct device *dev)
+{
+ struct iommu_table *tbl;
+ int ret = 0;
+
+ if (WARN_ON(dev->iommu_group)) {
+ pr_warn("iommu_tce: device %s is already in iommu group %d, skipping\n",
+ dev_name(dev),
+ iommu_group_id(dev->iommu_group));
+ return -EBUSY;
+ }
+
+ tbl = get_iommu_table_base(dev);
+ if (!tbl || !tbl->it_group) {
+ pr_debug("iommu_tce: skipping device %s with no tbl\n",
+ dev_name(dev));
+ return 0;
+ }
+
+ pr_debug("iommu_tce: adding %s to iommu group %d\n",
+ dev_name(dev), iommu_group_id(tbl->it_group));
+
+ ret = iommu_group_add_device(tbl->it_group, dev);
+ if (ret < 0)
+ pr_err("iommu_tce: %s has not been added, ret=%d\n",
+ dev_name(dev), ret);
+
+ return ret;
+}
+
+static void iommu_del_device(struct device *dev)
+{
+ iommu_group_remove_device(dev);
+}
+
+static int iommu_bus_notifier(struct notifier_block *nb,
+ unsigned long action, void *data)
+{
+ struct device *dev = data;
+
+ switch (action) {
+ case BUS_NOTIFY_ADD_DEVICE:
+ return iommu_add_device(dev);
+ case BUS_NOTIFY_DEL_DEVICE:
+ iommu_del_device(dev);
+ return 0;
+ default:
+ return 0;
+ }
+}
+
+static struct notifier_block tce_iommu_bus_nb = {
+ .notifier_call = iommu_bus_notifier,
+};
+
+static int __init tce_iommu_init(void)
+{
+ struct pci_dev *pdev = NULL;
+
+ BUILD_BUG_ON(PAGE_SIZE < IOMMU_PAGE_SIZE);
+
+ for_each_pci_dev(pdev)
+ iommu_add_device(&pdev->dev);
+
+ bus_register_notifier(&pci_bus_type, &tce_iommu_bus_nb);
+ return 0;
+}
+
+subsys_initcall_sync(tce_iommu_init);
+
+#else
+
+void iommu_register_group(struct iommu_table *tbl,
+ int pci_domain_number, unsigned long pe_num)
+{
+}
+
+#endif /* CONFIG_IOMMU_API */
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 1da578b..cae2555 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -605,6 +605,7 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
TCE_PCI_SWINV_PAIR;
}
iommu_init_table(tbl, phb->hose->node);
+ iommu_register_group(tbl, pci_domain_nr(pe->pbus), pe->pe_number);
return;
fail:
diff --git a/arch/powerpc/platforms/powernv/pci-p5ioc2.c b/arch/powerpc/platforms/powernv/pci-p5ioc2.c
index 92b37a0..5d378f2 100644
--- a/arch/powerpc/platforms/powernv/pci-p5ioc2.c
+++ b/arch/powerpc/platforms/powernv/pci-p5ioc2.c
@@ -86,8 +86,11 @@ static void pnv_pci_init_p5ioc2_msis(struct pnv_phb *phb) { }
static void pnv_pci_p5ioc2_dma_dev_setup(struct pnv_phb *phb,
struct pci_dev *pdev)
{
- if (phb->p5ioc2.iommu_table.it_map == NULL)
+ if (phb->p5ioc2.iommu_table.it_map == NULL) {
iommu_init_table(&phb->p5ioc2.iommu_table, phb->hose->node);
+ iommu_register_group(&phb->p5ioc2.iommu_table,
+ pci_domain_nr(phb->hose->bus), phb->opal_id);
+ }
set_iommu_table_base(&pdev->dev, &phb->p5ioc2.iommu_table);
}
diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
index 55dfca844..bd69285 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@@ -20,6 +20,7 @@
#include <linux/irq.h>
#include <linux/io.h>
#include <linux/msi.h>
+#include <linux/iommu.h>
#include <asm/sections.h>
#include <asm/io.h>
@@ -408,6 +409,7 @@ static struct iommu_table *pnv_pci_setup_bml_iommu(struct pci_controller *hose)
pnv_pci_setup_iommu_table(tbl, __va(be64_to_cpup(basep)),
be32_to_cpup(sizep), 0);
iommu_init_table(tbl, hose->node);
+ iommu_register_group(tbl, pci_domain_nr(hose->bus), 0);
/* Deal with SW invalidated TCEs when needed (BML way) */
swinvp = of_get_property(hose->dn, "linux,tce-sw-invalidate-info",
diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
index c332fb9..3f3abde 100644
--- a/drivers/iommu/Kconfig
+++ b/drivers/iommu/Kconfig
@@ -261,4 +261,12 @@ config SHMOBILE_IOMMU_L1SIZE
default 256 if SHMOBILE_IOMMU_ADDRSIZE_64MB
default 128 if SHMOBILE_IOMMU_ADDRSIZE_32MB
+config SPAPR_TCE_IOMMU
+ bool "sPAPR TCE IOMMU Support"
+ depends on PPC_POWERNV
+ select IOMMU_API
+ help
+ Enables bits of IOMMU API required by VFIO. The iommu_ops
+ is not implemented as it is not necessary for VFIO.
+
endif # IOMMU_SUPPORT
--
1.7.10.4
^ permalink raw reply related
* [PATCH 2/3] powerpc/vfio: Implement IOMMU driver for VFIO
From: Alexey Kardashevskiy @ 2013-05-21 3:33 UTC (permalink / raw)
To: linuxppc-dev
Cc: kvm, Alexey Kardashevskiy, linux-kernel, Alex Williamson,
Paul Mackerras, David Gibson
In-Reply-To: <1369107191-28547-1-git-send-email-aik@ozlabs.ru>
VFIO implements platform independent stuff such as
a PCI driver, BAR access (via read/write on a file descriptor
or direct mapping when possible) and IRQ signaling.
The platform dependent part includes IOMMU initialization
and handling. This implements an IOMMU driver for VFIO
which does mapping/unmapping pages for the guest IO and
provides information about DMA window (required by a POWER
guest).
Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Paul Mackerras <paulus@samba.org>
---
Documentation/vfio.txt | 63 ++++++
drivers/vfio/Kconfig | 6 +
drivers/vfio/Makefile | 1 +
drivers/vfio/vfio.c | 1 +
drivers/vfio/vfio_iommu_spapr_tce.c | 377 +++++++++++++++++++++++++++++++++++
include/uapi/linux/vfio.h | 34 ++++
6 files changed, 482 insertions(+)
create mode 100644 drivers/vfio/vfio_iommu_spapr_tce.c
diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt
index 8eda363..c55533c 100644
--- a/Documentation/vfio.txt
+++ b/Documentation/vfio.txt
@@ -283,6 +283,69 @@ a direct pass through for VFIO_DEVICE_* ioctls. The read/write/mmap
interfaces implement the device region access defined by the device's
own VFIO_DEVICE_GET_REGION_INFO ioctl.
+
+PPC64 sPAPR implementation note
+-------------------------------------------------------------------------------
+
+This implementation has some specifics:
+
+1) Only one IOMMU group per container is supported as an IOMMU group
+represents the minimal entity which isolation can be guaranteed for and
+groups are allocated statically, one per a Partitionable Endpoint (PE)
+(PE is often a PCI domain but not always).
+
+2) The hardware supports so called DMA windows - the PCI address range
+within which DMA transfer is allowed, any attempt to access address space
+out of the window leads to the whole PE isolation.
+
+3) PPC64 guests are paravirtualized but not fully emulated. There is an API
+to map/unmap pages for DMA, and it normally maps 1..32 pages per call and
+currently there is no way to reduce the number of calls. In order to make things
+faster, the map/unmap handling has been implemented in real mode which provides
+an excellent performance which has limitations such as inability to do
+locked pages accounting in real time.
+
+So 3 additional ioctls have been added:
+
+ VFIO_IOMMU_SPAPR_TCE_GET_INFO - returns the size and the start
+ of the DMA window on the PCI bus.
+
+ VFIO_IOMMU_ENABLE - enables the container. The locked pages accounting
+ is done at this point. This lets user first to know what
+ the DMA window is and adjust rlimit before doing any real job.
+
+ VFIO_IOMMU_DISABLE - disables the container.
+
+
+The code flow from the example above should be slightly changed:
+
+ .....
+ /* Add the group to the container */
+ ioctl(group, VFIO_GROUP_SET_CONTAINER, &container);
+
+ /* Enable the IOMMU model we want */
+ ioctl(container, VFIO_SET_IOMMU, VFIO_SPAPR_TCE_IOMMU)
+
+ /* Get addition sPAPR IOMMU info */
+ vfio_iommu_spapr_tce_info spapr_iommu_info;
+ ioctl(container, VFIO_IOMMU_SPAPR_TCE_GET_INFO, &spapr_iommu_info);
+
+ if (ioctl(container, VFIO_IOMMU_ENABLE))
+ /* Cannot enable container, may be low rlimit */
+
+ /* Allocate some space and setup a DMA mapping */
+ dma_map.vaddr = mmap(0, 1024 * 1024, PROT_READ | PROT_WRITE,
+ MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
+
+ dma_map.size = 1024 * 1024;
+ dma_map.iova = 0; /* 1MB starting at 0x0 from device view */
+ dma_map.flags = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE;
+
+ /* Check here is .iova/.size are within DMA window from spapr_iommu_info */
+
+ ioctl(container, VFIO_IOMMU_MAP_DMA, &dma_map);
+ .....
+
-------------------------------------------------------------------------------
[1] VFIO was originally an acronym for "Virtual Function I/O" in its
diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
index 7cd5dec..b464687 100644
--- a/drivers/vfio/Kconfig
+++ b/drivers/vfio/Kconfig
@@ -3,10 +3,16 @@ config VFIO_IOMMU_TYPE1
depends on VFIO
default n
+config VFIO_IOMMU_SPAPR_TCE
+ tristate
+ depends on VFIO && SPAPR_TCE_IOMMU
+ default n
+
menuconfig VFIO
tristate "VFIO Non-Privileged userspace driver framework"
depends on IOMMU_API
select VFIO_IOMMU_TYPE1 if X86
+ select VFIO_IOMMU_SPAPR_TCE if PPC_POWERNV
help
VFIO provides a framework for secure userspace device drivers.
See Documentation/vfio.txt for more details.
diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile
index 2398d4a..72bfabc 100644
--- a/drivers/vfio/Makefile
+++ b/drivers/vfio/Makefile
@@ -1,3 +1,4 @@
obj-$(CONFIG_VFIO) += vfio.o
obj-$(CONFIG_VFIO_IOMMU_TYPE1) += vfio_iommu_type1.o
+obj-$(CONFIG_VFIO_IOMMU_SPAPR_TCE) += vfio_iommu_spapr_tce.o
obj-$(CONFIG_VFIO_PCI) += pci/
diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index 295c48f..a819604 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -1428,6 +1428,7 @@ static int __init vfio_init(void)
* drivers.
*/
request_module_nowait("vfio_iommu_type1");
+ request_module_nowait("vfio_iommu_spapr_tce");
return 0;
diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
new file mode 100644
index 0000000..bdae7a0
--- /dev/null
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -0,0 +1,377 @@
+/*
+ * VFIO: IOMMU DMA mapping support for TCE on POWER
+ *
+ * Copyright (C) 2013 IBM Corp. All rights reserved.
+ * Author: Alexey Kardashevskiy <aik@ozlabs.ru>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * Derived from original vfio_iommu_type1.c:
+ * Copyright (C) 2012 Red Hat, Inc. All rights reserved.
+ * Author: Alex Williamson <alex.williamson@redhat.com>
+ */
+
+#include <linux/module.h>
+#include <linux/pci.h>
+#include <linux/slab.h>
+#include <linux/uaccess.h>
+#include <linux/err.h>
+#include <linux/vfio.h>
+#include <asm/iommu.h>
+#include <asm/tce.h>
+
+#define DRIVER_VERSION "0.1"
+#define DRIVER_AUTHOR "aik@ozlabs.ru"
+#define DRIVER_DESC "VFIO IOMMU SPAPR TCE"
+
+static void tce_iommu_detach_group(void *iommu_data,
+ struct iommu_group *iommu_group);
+
+/*
+ * VFIO IOMMU fd for SPAPR_TCE IOMMU implementation
+ *
+ * This code handles mapping and unmapping of user data buffers
+ * into DMA'ble space using the IOMMU
+ */
+
+/*
+ * The container descriptor supports only a single group per container.
+ * Required by the API as the container is not supplied with the IOMMU group
+ * at the moment of initialization.
+ */
+struct tce_container {
+ struct mutex lock;
+ struct iommu_table *tbl;
+ bool enabled;
+};
+
+static int tce_iommu_enable(struct tce_container *container)
+{
+ int ret = 0;
+ unsigned long locked, lock_limit, npages;
+ struct iommu_table *tbl = container->tbl;
+
+ if (!container->tbl)
+ return -ENXIO;
+
+ if (!current->mm)
+ return -ESRCH; /* process exited */
+
+ if (container->enabled)
+ return -EBUSY;
+
+ /*
+ * When userspace pages are mapped into the IOMMU, they are effectively
+ * locked memory, so, theoretically, we need to update the accounting
+ * of locked pages on each map and unmap. For powerpc, the map unmap
+ * paths can be very hot, though, and the accounting would kill
+ * performance, especially since it would be difficult to impossible
+ * to handle the accounting in real mode only.
+ *
+ * To address that, rather than precisely accounting every page, we
+ * instead account for a worst case on locked memory when the iommu is
+ * enabled and disabled. The worst case upper bound on locked memory
+ * is the size of the whole iommu window, which is usually relatively
+ * small (compared to total memory sizes) on POWER hardware.
+ *
+ * Also we don't have a nice way to fail on H_PUT_TCE due to ulimits,
+ * that would effectively kill the guest at random points, much better
+ * enforcing the limit based on the max that the guest can map.
+ */
+ down_write(¤t->mm->mmap_sem);
+ npages = (tbl->it_size << IOMMU_PAGE_SHIFT) >> PAGE_SHIFT;
+ locked = current->mm->locked_vm + npages;
+ lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
+ if (locked > lock_limit && !capable(CAP_IPC_LOCK)) {
+ pr_warn("RLIMIT_MEMLOCK (%ld) exceeded\n",
+ rlimit(RLIMIT_MEMLOCK));
+ ret = -ENOMEM;
+ } else {
+
+ current->mm->locked_vm += npages;
+ container->enabled = true;
+ }
+ up_write(¤t->mm->mmap_sem);
+
+ return ret;
+}
+
+static void tce_iommu_disable(struct tce_container *container)
+{
+ if (!container->enabled)
+ return;
+
+ container->enabled = false;
+
+ if (!container->tbl || !current->mm)
+ return;
+
+ down_write(¤t->mm->mmap_sem);
+ current->mm->locked_vm -= (container->tbl->it_size <<
+ IOMMU_PAGE_SHIFT) >> PAGE_SHIFT;
+ up_write(¤t->mm->mmap_sem);
+}
+
+static void *tce_iommu_open(unsigned long arg)
+{
+ struct tce_container *container;
+
+ if (arg != VFIO_SPAPR_TCE_IOMMU) {
+ pr_err("tce_vfio: Wrong IOMMU type\n");
+ return ERR_PTR(-EINVAL);
+ }
+
+ container = kzalloc(sizeof(*container), GFP_KERNEL);
+ if (!container)
+ return ERR_PTR(-ENOMEM);
+
+ mutex_init(&container->lock);
+
+ return container;
+}
+
+static void tce_iommu_release(void *iommu_data)
+{
+ struct tce_container *container = iommu_data;
+
+ WARN_ON(container->tbl && !container->tbl->it_group);
+ tce_iommu_disable(container);
+
+ if (container->tbl && container->tbl->it_group)
+ tce_iommu_detach_group(iommu_data, container->tbl->it_group);
+
+ mutex_destroy(&container->lock);
+
+ kfree(container);
+}
+
+static long tce_iommu_ioctl(void *iommu_data,
+ unsigned int cmd, unsigned long arg)
+{
+ struct tce_container *container = iommu_data;
+ unsigned long minsz;
+ long ret;
+
+ switch (cmd) {
+ case VFIO_CHECK_EXTENSION:
+ return (arg == VFIO_SPAPR_TCE_IOMMU) ? 1 : 0;
+
+ case VFIO_IOMMU_SPAPR_TCE_GET_INFO: {
+ struct vfio_iommu_spapr_tce_info info;
+ struct iommu_table *tbl = container->tbl;
+
+ if (WARN_ON(!tbl))
+ return -ENXIO;
+
+ minsz = offsetofend(struct vfio_iommu_spapr_tce_info,
+ dma32_window_size);
+
+ if (copy_from_user(&info, (void __user *)arg, minsz))
+ return -EFAULT;
+
+ if (info.argsz < minsz)
+ return -EINVAL;
+
+ info.dma32_window_start = tbl->it_offset << IOMMU_PAGE_SHIFT;
+ info.dma32_window_size = tbl->it_size << IOMMU_PAGE_SHIFT;
+ info.flags = 0;
+
+ if (copy_to_user((void __user *)arg, &info, minsz))
+ return -EFAULT;
+
+ return 0;
+ }
+ case VFIO_IOMMU_MAP_DMA: {
+ struct vfio_iommu_type1_dma_map param;
+ struct iommu_table *tbl = container->tbl;
+ unsigned long tce, i;
+
+ if (!tbl)
+ return -ENXIO;
+
+ BUG_ON(!tbl->it_group);
+
+ minsz = offsetofend(struct vfio_iommu_type1_dma_map, size);
+
+ if (copy_from_user(¶m, (void __user *)arg, minsz))
+ return -EFAULT;
+
+ if (param.argsz < minsz)
+ return -EINVAL;
+
+ if (param.flags & ~(VFIO_DMA_MAP_FLAG_READ |
+ VFIO_DMA_MAP_FLAG_WRITE))
+ return -EINVAL;
+
+ if ((param.size & ~IOMMU_PAGE_MASK) ||
+ (param.vaddr & ~IOMMU_PAGE_MASK))
+ return -EINVAL;
+
+ /* iova is checked by the IOMMU API */
+ tce = param.vaddr;
+ if (param.flags & VFIO_DMA_MAP_FLAG_READ)
+ tce |= TCE_PCI_READ;
+ if (param.flags & VFIO_DMA_MAP_FLAG_WRITE)
+ tce |= TCE_PCI_WRITE;
+
+ ret = iommu_tce_put_param_check(tbl, param.iova, tce);
+ if (ret)
+ return ret;
+
+ for (i = 0; i < (param.size >> IOMMU_PAGE_SHIFT); ++i) {
+ ret = iommu_put_tce_user_mode(tbl,
+ (param.iova >> IOMMU_PAGE_SHIFT) + i,
+ tce);
+ if (ret)
+ break;
+ tce += IOMMU_PAGE_SIZE;
+ }
+ if (ret)
+ iommu_clear_tces_and_put_pages(tbl,
+ param.iova >> IOMMU_PAGE_SHIFT, i);
+
+ iommu_flush_tce(tbl);
+
+ return ret;
+ }
+ case VFIO_IOMMU_UNMAP_DMA: {
+ struct vfio_iommu_type1_dma_unmap param;
+ struct iommu_table *tbl = container->tbl;
+
+ if (WARN_ON(!tbl))
+ return -ENXIO;
+
+ minsz = offsetofend(struct vfio_iommu_type1_dma_unmap,
+ size);
+
+ if (copy_from_user(¶m, (void __user *)arg, minsz))
+ return -EFAULT;
+
+ if (param.argsz < minsz)
+ return -EINVAL;
+
+ /* No flag is supported now */
+ if (param.flags)
+ return -EINVAL;
+
+ if (param.size & ~IOMMU_PAGE_MASK)
+ return -EINVAL;
+
+ ret = iommu_tce_clear_param_check(tbl, param.iova, 0,
+ param.size >> IOMMU_PAGE_SHIFT);
+ if (ret)
+ return ret;
+
+ ret = iommu_clear_tces_and_put_pages(tbl,
+ param.iova >> IOMMU_PAGE_SHIFT,
+ param.size >> IOMMU_PAGE_SHIFT);
+ iommu_flush_tce(tbl);
+
+ return ret;
+ }
+ case VFIO_IOMMU_ENABLE:
+ mutex_lock(&container->lock);
+ ret = tce_iommu_enable(container);
+ mutex_unlock(&container->lock);
+ return ret;
+
+
+ case VFIO_IOMMU_DISABLE:
+ mutex_lock(&container->lock);
+ tce_iommu_disable(container);
+ mutex_unlock(&container->lock);
+ return 0;
+ }
+
+ return -ENOTTY;
+}
+
+static int tce_iommu_attach_group(void *iommu_data,
+ struct iommu_group *iommu_group)
+{
+ int ret;
+ struct tce_container *container = iommu_data;
+ struct iommu_table *tbl = iommu_group_get_iommudata(iommu_group);
+
+ BUG_ON(!tbl);
+ mutex_lock(&container->lock);
+
+ /* pr_debug("tce_vfio: Attaching group #%u to iommu %p\n",
+ iommu_group_id(iommu_group), iommu_group); */
+ if (container->tbl) {
+ pr_warn("tce_vfio: Only one group per IOMMU container is allowed, existing id=%d, attaching id=%d\n",
+ iommu_group_id(container->tbl->it_group),
+ iommu_group_id(iommu_group));
+ ret = -EBUSY;
+ } else if (container->enabled) {
+ pr_err("tce_vfio: attaching group #%u to enabled container\n",
+ iommu_group_id(iommu_group));
+ ret = -EBUSY;
+ } else {
+ ret = iommu_take_ownership(tbl);
+ if (!ret)
+ container->tbl = tbl;
+ }
+
+ mutex_unlock(&container->lock);
+
+ return ret;
+}
+
+static void tce_iommu_detach_group(void *iommu_data,
+ struct iommu_group *iommu_group)
+{
+ struct tce_container *container = iommu_data;
+ struct iommu_table *tbl = iommu_group_get_iommudata(iommu_group);
+
+ BUG_ON(!tbl);
+ mutex_lock(&container->lock);
+ if (tbl != container->tbl) {
+ pr_warn("tce_vfio: detaching group #%u, expected group is #%u\n",
+ iommu_group_id(iommu_group),
+ iommu_group_id(tbl->it_group));
+ } else {
+ if (container->enabled) {
+ pr_warn("tce_vfio: detaching group #%u from enabled container, forcing disable\n",
+ iommu_group_id(tbl->it_group));
+ tce_iommu_disable(container);
+ }
+
+ /* pr_debug("tce_vfio: detaching group #%u from iommu %p\n",
+ iommu_group_id(iommu_group), iommu_group); */
+ container->tbl = NULL;
+ iommu_release_ownership(tbl);
+ }
+ mutex_unlock(&container->lock);
+}
+
+const struct vfio_iommu_driver_ops tce_iommu_driver_ops = {
+ .name = "iommu-vfio-powerpc",
+ .owner = THIS_MODULE,
+ .open = tce_iommu_open,
+ .release = tce_iommu_release,
+ .ioctl = tce_iommu_ioctl,
+ .attach_group = tce_iommu_attach_group,
+ .detach_group = tce_iommu_detach_group,
+};
+
+static int __init tce_iommu_init(void)
+{
+ return vfio_register_iommu_driver(&tce_iommu_driver_ops);
+}
+
+static void __exit tce_iommu_cleanup(void)
+{
+ vfio_unregister_iommu_driver(&tce_iommu_driver_ops);
+}
+
+module_init(tce_iommu_init);
+module_exit(tce_iommu_cleanup);
+
+MODULE_VERSION(DRIVER_VERSION);
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR(DRIVER_AUTHOR);
+MODULE_DESCRIPTION(DRIVER_DESC);
+
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 284ff24..87ee4f4 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -22,6 +22,7 @@
/* Extensions */
#define VFIO_TYPE1_IOMMU 1
+#define VFIO_SPAPR_TCE_IOMMU 2
/*
* The IOCTL interface is designed for extensibility by embedding the
@@ -375,4 +376,37 @@ struct vfio_iommu_type1_dma_unmap {
#define VFIO_IOMMU_UNMAP_DMA _IO(VFIO_TYPE, VFIO_BASE + 14)
+/*
+ * IOCTLs to enable/disable IOMMU container usage.
+ * No parameters are supported.
+ */
+#define VFIO_IOMMU_ENABLE _IO(VFIO_TYPE, VFIO_BASE + 15)
+#define VFIO_IOMMU_DISABLE _IO(VFIO_TYPE, VFIO_BASE + 16)
+
+/* -------- Additional API for SPAPR TCE (Server POWERPC) IOMMU -------- */
+
+/*
+ * The SPAPR TCE info struct provides the information about the PCI bus
+ * address ranges available for DMA, these values are programmed into
+ * the hardware so the guest has to know that information.
+ *
+ * The DMA 32 bit window start is an absolute PCI bus address.
+ * The IOVA address passed via map/unmap ioctls are absolute PCI bus
+ * addresses too so the window works as a filter rather than an offset
+ * for IOVA addresses.
+ *
+ * A flag will need to be added if other page sizes are supported,
+ * so as defined here, it is always 4k.
+ */
+struct vfio_iommu_spapr_tce_info {
+ __u32 argsz;
+ __u32 flags; /* reserved for future use */
+ __u32 dma32_window_start; /* 32 bit window start (bytes) */
+ __u32 dma32_window_size; /* 32 bit window size (bytes) */
+};
+
+#define VFIO_IOMMU_SPAPR_TCE_GET_INFO _IO(VFIO_TYPE, VFIO_BASE + 12)
+
+/* ***************************************************************** */
+
#endif /* _UAPIVFIO_H */
--
1.7.10.4
^ permalink raw reply related
* [PATCH 3/3] powerpc/vfio: Enable on pSeries platform
From: Alexey Kardashevskiy @ 2013-05-21 3:33 UTC (permalink / raw)
To: linuxppc-dev
Cc: kvm, Alexey Kardashevskiy, linux-kernel, Alex Williamson,
Paul Mackerras, David Gibson
In-Reply-To: <1369107191-28547-1-git-send-email-aik@ozlabs.ru>
The enables VFIO on the pSeries platform, enabling user space
programs to access PCI devices directly.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
---
arch/powerpc/platforms/pseries/iommu.c | 4 ++++
drivers/iommu/Kconfig | 2 +-
drivers/vfio/Kconfig | 2 +-
3 files changed, 6 insertions(+), 2 deletions(-)
diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
index 86ae364..23fc1dc 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -614,6 +614,7 @@ static void pci_dma_bus_setup_pSeries(struct pci_bus *bus)
iommu_table_setparms(pci->phb, dn, tbl);
pci->iommu_table = iommu_init_table(tbl, pci->phb->node);
+ iommu_register_group(tbl, pci_domain_nr(bus), 0);
/* Divide the rest (1.75GB) among the children */
pci->phb->dma_window_size = 0x80000000ul;
@@ -658,6 +659,7 @@ static void pci_dma_bus_setup_pSeriesLP(struct pci_bus *bus)
ppci->phb->node);
iommu_table_setparms_lpar(ppci->phb, pdn, tbl, dma_window);
ppci->iommu_table = iommu_init_table(tbl, ppci->phb->node);
+ iommu_register_group(tbl, pci_domain_nr(bus), 0);
pr_debug(" created table: %p\n", ppci->iommu_table);
}
}
@@ -684,6 +686,7 @@ static void pci_dma_dev_setup_pSeries(struct pci_dev *dev)
phb->node);
iommu_table_setparms(phb, dn, tbl);
PCI_DN(dn)->iommu_table = iommu_init_table(tbl, phb->node);
+ iommu_register_group(tbl, pci_domain_nr(phb->bus), 0);
set_iommu_table_base(&dev->dev, PCI_DN(dn)->iommu_table);
return;
}
@@ -1184,6 +1187,7 @@ static void pci_dma_dev_setup_pSeriesLP(struct pci_dev *dev)
pci->phb->node);
iommu_table_setparms_lpar(pci->phb, pdn, tbl, dma_window);
pci->iommu_table = iommu_init_table(tbl, pci->phb->node);
+ iommu_register_group(tbl, pci_domain_nr(pci->phb->bus), 0);
pr_debug(" created table: %p\n", pci->iommu_table);
} else {
pr_debug(" found DMA window, table: %p\n", pci->iommu_table);
diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
index 3f3abde..01730b2 100644
--- a/drivers/iommu/Kconfig
+++ b/drivers/iommu/Kconfig
@@ -263,7 +263,7 @@ config SHMOBILE_IOMMU_L1SIZE
config SPAPR_TCE_IOMMU
bool "sPAPR TCE IOMMU Support"
- depends on PPC_POWERNV
+ depends on PPC_POWERNV || PPC_PSERIES
select IOMMU_API
help
Enables bits of IOMMU API required by VFIO. The iommu_ops
diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
index b464687..26b3d9d 100644
--- a/drivers/vfio/Kconfig
+++ b/drivers/vfio/Kconfig
@@ -12,7 +12,7 @@ menuconfig VFIO
tristate "VFIO Non-Privileged userspace driver framework"
depends on IOMMU_API
select VFIO_IOMMU_TYPE1 if X86
- select VFIO_IOMMU_SPAPR_TCE if PPC_POWERNV
+ select VFIO_IOMMU_SPAPR_TCE if (PPC_POWERNV || PPC_PSERIES)
help
VFIO provides a framework for secure userspace device drivers.
See Documentation/vfio.txt for more details.
--
1.7.10.4
^ permalink raw reply related
* Re: [PATCH 0/5 v2] VFIO PPC64: add VFIO support on POWERPC64
From: Alexey Kardashevskiy @ 2013-05-21 3:34 UTC (permalink / raw)
To: Alexey Kardashevskiy
Cc: kvm, linux-kernel, Alex Williamson, Paul Mackerras, linuxppc-dev,
David Gibson
In-Reply-To: <1369107191-28547-1-git-send-email-aik@ozlabs.ru>
Oops, wrong subject (cut-n-paste) :)
There are 3 patches, not 5.
On 05/21/2013 01:33 PM, Alexey Kardashevskiy wrote:
> The series adds support for VFIO on POWERPC in user space (such as QEMU).
> The in-kernel real mode IOMMU support is added by another series posted
> separately.
>
> As the first and main aim of this series is the POWERNV platform support,
> the "Enable on POWERNV platform" patch goes first and introduces an API
> to be used by the VFIO IOMMU driver. The "Enable on pSeries platform" patch
> simply registers PHBs in the IOMMU subsystem and expects the API to be present,
> it enables VFIO support in fully emulated QEMU guests.
>
> The main change is that this series was changed and tested against v3.10-rc1.
> It also contains some bugfixes which are mentioned (if any) in the patch messages.
>
> Alexey Kardashevskiy (3):
> powerpc/vfio: Enable on POWERNV platform
> powerpc/vfio: Implement IOMMU driver for VFIO
> powerpc/vfio: Enable on pSeries platform
--
Alexey
^ permalink raw reply
* [PATCH] powerpc/powernv: Add an option to wipe PCI bridges on shutdown
From: Benjamin Herrenschmidt @ 2013-05-21 4:58 UTC (permalink / raw)
To: linuxppc-dev
Older kernels such as 3.6 used by Fedora 18 have a problem with
the way more recent kernels configure the PCI bridge windows due
to my crappy old resource allocation code (we now use the generic
code which is way better).
In order to be able to safely kexec into those earlier kernels
(which we may need to do in some circumstances to launch distro
installers), we need to cleanup the bridges to avoid tripping
that bug.
This adds an option to do that, which is expected to be enabled
on kernels used as kexec-based bootloaders.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
---
arch/powerpc/platforms/powernv/Kconfig | 5 ++++
arch/powerpc/platforms/powernv/pci-ioda.c | 42 +++++++++++++++++++++++++++++
2 files changed, 47 insertions(+)
diff --git a/arch/powerpc/platforms/powernv/Kconfig b/arch/powerpc/platforms/powernv/Kconfig
index d3e840d..91bec0e 100644
--- a/arch/powerpc/platforms/powernv/Kconfig
+++ b/arch/powerpc/platforms/powernv/Kconfig
@@ -19,3 +19,8 @@ config PPC_POWERNV_RTAS
default y
select PPC_ICS_RTAS
select PPC_RTAS
+
+config PPC_POWERNV_PCI_CLEANUP
+ depends on KEXEC && PCI && PPC_POWERNV
+ bool "Wipe PCI bridges on shutdown for safer kexec to earlier kernels"
+ default y
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 3937aaa..194e921 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1050,6 +1050,48 @@ static u32 pnv_ioda_bdfn_to_pe(struct pnv_phb *phb, struct pci_bus *bus,
static void pnv_pci_ioda_shutdown(struct pnv_phb *phb)
{
+#ifdef CONFIG_PPC_POWERNV_PCI_CLEANUP
+ struct pci_dev *pdev = NULL;
+
+ for_each_pci_dev(pdev) {
+ u16 cmd;
+
+ /*
+ * We clear the base stuff, the main thing is bridge
+ * windows which is what hurts 3.6 old IODA PCI allocation
+ * code.
+ */
+ pci_read_config_word(pdev, PCI_COMMAND, &cmd);
+ cmd &= ~(PCI_COMMAND_MASTER | PCI_COMMAND_MEMORY |
+ PCI_COMMAND_IO);
+ pci_write_config_word(pdev, PCI_COMMAND, cmd);
+ if (pdev->hdr_type == PCI_HEADER_TYPE_BRIDGE) {
+ pr_info("Cleaning up bridge %s\n", pci_name(pdev));
+ pci_write_config_word(pdev, PCI_IO_BASE_UPPER16, 0xffff);
+ pci_write_config_byte(pdev, PCI_IO_BASE, 0xf0);
+ pci_write_config_word(pdev, PCI_IO_LIMIT_UPPER16, 0x0);
+ pci_write_config_byte(pdev, PCI_IO_LIMIT, 0x0);
+
+ pci_write_config_word(pdev, PCI_MEMORY_BASE, 0xfff0);
+ pci_write_config_word(pdev, PCI_MEMORY_LIMIT, 0);
+
+ pci_write_config_dword(pdev, PCI_PREF_BASE_UPPER32, 0xffffffff);
+ pci_write_config_word(pdev, PCI_PREF_MEMORY_BASE, 0xfff0);
+ pci_write_config_dword(pdev, PCI_PREF_LIMIT_UPPER32, 0);
+ pci_write_config_word(pdev, PCI_PREF_MEMORY_LIMIT, 0);
+
+ /* Re-enable bridge windows (now that they are closed
+ * this is safe) as some versions of Linux will fail to
+ * do it
+ */
+ pci_read_config_word(pdev, PCI_COMMAND, &cmd);
+ cmd |= PCI_COMMAND_MASTER | PCI_COMMAND_MEMORY |
+ PCI_COMMAND_IO;
+ pci_write_config_word(pdev, PCI_COMMAND, cmd);
+ }
+ }
+#endif /* CONFIG_PPC_POWERNV_PCI_CLEANUP */
+
opal_pci_reset(phb->opal_id, OPAL_PCI_IODA_TABLE_RESET,
OPAL_ASSERT_RESET);
}
^ permalink raw reply related
* Re: [PATCH 3/3] perf, x86, lbr: Demand proper privileges for PERF_SAMPLE_BRANCH_KERNEL
From: Michael Neuling @ 2013-05-21 5:41 UTC (permalink / raw)
To: Peter Zijlstra
Cc: ak@linux.intel.com, LKML, anshuman Stephane Eranian,
Linux PPC dev, Ingo Molnar
In-Reply-To: <20130517111232.GE5162@dyad.programming.kicks-ass.net>
Peter Zijlstra <peterz@infradead.org> wrote:
> On Thu, May 16, 2013 at 05:36:11PM +0200, Stephane Eranian wrote:
> > On Thu, May 16, 2013 at 1:16 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> > > On Thu, May 16, 2013 at 08:15:17PM +1000, Michael Neuling wrote:
> > >> Peter,
> > >>
> > >> BTW PowerPC also has the ability to filter on conditional branches. Any
> > >> chance we could add something like the follow to perf also?
> > >>
> > >
> > > I don't see an immediate problem with that except that we on x86 need to
> > > implement that in the software filter. Stephane do you see any
> > > fundamental issue with that?
> > >
> > On X86, the LBR cannot filter on conditional in HW. Thus as Peter said, it would
> > have to be done in SW. I did not add that because I think those branches are
> > not necessarily useful for tools.
>
> Wouldn't it be mostly conditional branches that are the primary control flow
> and can get predicted wrong? I mean, I'm sure someone will miss-predict an
> unconditional branch but its not like we care about people with such
> afflictions do we?
>
> Anyway, since PPC people thought it worth baking into hardware, presumably they
> have a compelling use case. Mikey could you see if you can retrieve that from
> someone in the know? It might be interesting.
>
> Also, it looks like its trivial to add to x86, you seem to have already done
> all the hard work by having X86_BR_JCC.
>
> The only missing piece would be:
Peter,
Can we add your signed-off-by on this?
We are cleaning up our series for conditional branches and would like to
add this as part of the post.
Mikey
>
> --- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
> +++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
> @@ -337,6 +337,10 @@ static int intel_pmu_setup_sw_lbr_filter
>
> if (br_type & PERF_SAMPLE_BRANCH_IND_CALL)
> mask |= X86_BR_IND_CALL;
> +
> + if (br_type & PERF_SAMPLE_BRANCH_CONDITIONAL)
> + mask |= X86_BR_JCC;
> +
> /*
> * stash actual user request into reg, it may
> * be used by fixup code for some CPU
>
^ permalink raw reply
* [PATCH] PowerPC: kernel: need return the related error code when failure occurs.
From: Chen Gang @ 2013-05-21 5:48 UTC (permalink / raw)
To: Arnd Bergmann, Benjamin Herrenschmidt, paulus@samba.org,
zhangyanfei, Jiri Kosina, Michael Ellerman
Cc: linuxppc-dev@lists.ozlabs.org, linux-kernel@vger.kernel.org
When error occurs, need return the related error code to let upper
caller know about it.
ppc_md.nvram_size() can return the error code (e.g. core99_nvram_size()
in 'arch/powerpc/platforms/powermac/nvram.c').
And when '*ppos >= size', need return -ESPIPE (Illegal seek)
The original related patch: "f9ce299 [PATCH] powerpc: fix large nvram
access"
Signed-off-by: Chen Gang <gang.chen@asianux.com>
---
arch/powerpc/kernel/nvram_64.c | 9 +++++++--
1 files changed, 7 insertions(+), 2 deletions(-)
diff --git a/arch/powerpc/kernel/nvram_64.c b/arch/powerpc/kernel/nvram_64.c
index 48fbc2b..db2a636 100644
--- a/arch/powerpc/kernel/nvram_64.c
+++ b/arch/powerpc/kernel/nvram_64.c
@@ -88,10 +88,15 @@ static ssize_t dev_nvram_read(struct file *file, char __user *buf,
if (!ppc_md.nvram_size)
goto out;
- ret = 0;
size = ppc_md.nvram_size();
- if (*ppos >= size || size < 0)
+ if (size < 0) {
+ ret = size;
goto out;
+ }
+ if (*ppos >= size) {
+ ret = -ESPIPE;
+ goto out;
+ }
count = min_t(size_t, count, size - *ppos);
count = min(count, PAGE_SIZE);
--
1.7.7.6
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox