* Re: [PATCH 3/3] KVM: PPC: Implement H_CEDE hcall for book3s_hv in real-mode code
From: Paul Mackerras @ 2011-08-03 3:31 UTC (permalink / raw)
To: Alexander Graf; +Cc: linuxppc-dev, kvm-ppc
In-Reply-To: <4E380DEC.8030803@suse.de>
On Tue, Aug 02, 2011 at 04:47:08PM +0200, Alexander Graf wrote:
> > int kvm_vcpu_ioctl_interrupt(struct kvm_vcpu *vcpu, struct kvm_interrupt *irq)
> > {
> >- if (irq->irq == KVM_INTERRUPT_UNSET)
> >+ if (irq->irq == KVM_INTERRUPT_UNSET) {
> > kvmppc_core_dequeue_external(vcpu, irq);
> >- else
> >- kvmppc_core_queue_external(vcpu, irq);
> >+ return 0;
> >+ }
>
> Not sure I understand this part. Mind to explain?
It's a micro-optimization - we don't really need to wake up or
interrupt the vcpu thread when we're clearing the interrupt.
Unless of course I'm missing something... :)
>
> Alex
>
> >+
> >+ kvmppc_core_queue_external(vcpu, irq);
> >
> >- if (waitqueue_active(&vcpu->wq)) {
> >- wake_up_interruptible(&vcpu->wq);
> >+ if (waitqueue_active(vcpu->arch.wqp)) {
> >+ wake_up_interruptible(vcpu->arch.wqp);
> > vcpu->stat.halt_wakeup++;
> > } else if (vcpu->cpu != -1) {
> > smp_send_reschedule(vcpu->cpu);
Paul.
^ permalink raw reply
* Re: kvm PCI assignment & VFIO ramblings
From: David Gibson @ 2011-08-03 2:04 UTC (permalink / raw)
To: Alex Williamson
Cc: chrisw, Alexey Kardashevskiy, kvm, Paul Mackerras,
linux-pci@vger.kernel.org, qemu-devel, aafabbri, iommu,
Anthony Liguori, linuxppc-dev, benve
In-Reply-To: <1312310121.2653.470.camel@bling.home>
On Tue, Aug 02, 2011 at 12:35:19PM -0600, Alex Williamson wrote:
> On Tue, 2011-08-02 at 12:14 -0600, Alex Williamson wrote:
> > On Tue, 2011-08-02 at 18:28 +1000, David Gibson wrote:
> > > On Sat, Jul 30, 2011 at 12:20:08PM -0600, Alex Williamson wrote:
> > > > On Sat, 2011-07-30 at 09:58 +1000, Benjamin Herrenschmidt wrote:
> > > [snip]
> > > > On x86, the USB controllers don't typically live behind a PCIe-to-PCI
> > > > bridge, so don't suffer the source identifier problem, but they do often
> > > > share an interrupt. But even then, we can count on most modern devices
> > > > supporting PCI2.3, and thus the DisINTx feature, which allows us to
> > > > share interrupts. In any case, yes, it's more rare but we need to know
> > > > how to handle devices behind PCI bridges. However I disagree that we
> > > > need to assign all the devices behind such a bridge to the guest.
> > > > There's a difference between removing the device from the host and
> > > > exposing the device to the guest.
> > >
> > > I think you're arguing only over details of what words to use for
> > > what, rather than anything of substance here. The point is that an
> > > entire partitionable group must be assigned to "host" (in which case
> > > kernel drivers may bind to it) or to a particular guest partition (or
> > > at least to a single UID on the host). Which of the assigned devices
> > > the partition actually uses is another matter of course, as is at
> > > exactly which level they become "de-exposed" if you don't want to use
> > > all of then.
> >
> > Well first we need to define what a partitionable group is, whether it's
> > based on hardware requirements or user policy. And while I agree that
> > we need unique ownership of a partition, I disagree that qemu is
> > necessarily the owner of the entire partition vs individual devices.
>
> Sorry, I didn't intend to have such circular logic. "... I disagree
> that qemu is necessarily the owner of the entire partition vs granted
> access to devices within the partition". Thanks,
I still don't understand the distinction you're making. We're saying
the group is "owned" by a given user or guest in the sense that no-one
else may use anything in the group (including host drivers). At that
point none, some or all of the devices in the group may actually be
used by the guest.
You seem to be making a distinction between "owned by" and "assigned
to" and "used by" and I really don't see what it is.
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
^ permalink raw reply
* Re: kvm PCI assignment & VFIO ramblings
From: Alex Williamson @ 2011-08-03 1:02 UTC (permalink / raw)
To: Konrad Rzeszutek Wilk
Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, David Gibson,
Avi Kivity, Anthony Liguori, linux-pci@vger.kernel.org,
linuxppc-dev
In-Reply-To: <20110802212949.GB18496@dumpdata.com>
On Tue, 2011-08-02 at 17:29 -0400, Konrad Rzeszutek Wilk wrote:
> On Tue, Aug 02, 2011 at 09:34:58AM -0600, Alex Williamson wrote:
> > On Tue, 2011-08-02 at 22:58 +1000, Benjamin Herrenschmidt wrote:
> > >
> > > Don't worry, it took me a while to get my head around the HW :-) SR-IOV
> > > VFs will generally not have limitations like that no, but on the other
> > > hand, they -will- still require 1 VF = 1 group, ie, you won't be able to
> > > take a bunch of VFs and put them in the same 'domain'.
> > >
> > > I think the main deal is that VFIO/qemu sees "domains" as "guests" and
> > > tries to put all devices for a given guest into a "domain".
> >
> > Actually, that's only a recent optimization, before that each device got
> > it's own iommu domain. It's actually completely configurable on the
> > qemu command line which devices get their own iommu and which share.
> > The default optimizes the number of domains (one) and thus the number of
> > mapping callbacks since we pin the entire guest.
> >
> > > On POWER, we have a different view of things were domains/groups are
> > > defined to be the smallest granularity we can (down to a single VF) and
> > > we give several groups to a guest (ie we avoid sharing the iommu in most
> > > cases)
> > >
> > > This is driven by the HW design but that design is itself driven by the
> > > idea that the domains/group are also error isolation groups and we don't
> > > want to take all of the IOs of a guest down if one adapter in that guest
> > > is having an error.
> > >
> > > The x86 domains are conceptually different as they are about sharing the
> > > iommu page tables with the clear long term intent of then sharing those
> > > page tables with the guest CPU own. We aren't going in that direction
> > > (at this point at least) on POWER..
> >
> > Yes and no. The x86 domains are pretty flexible and used a few
> > different ways. On the host we do dynamic DMA with a domain per device,
> > mapping only the inflight DMA ranges. In order to achieve the
> > transparent device assignment model, we have to flip that around and map
> > the entire guest. As noted, we can continue to use separate domains for
> > this, but since each maps the entire guest, it doesn't add a lot of
> > value and uses more resources and requires more mapping callbacks (and
> > x86 doesn't have the best error containment anyway). If we had a well
> > supported IOMMU model that we could adapt for pvDMA, then it would make
> > sense to keep each device in it's own domain again. Thanks,
>
> Could you have an PV IOMMU (in the guest) that would set up those
> maps?
Yep, definitely. That's effectively what power wants to do. We could
do it on x86, but as others have noted, the map/unmap interface isn't
tuned to do this at that granularity and our target guest OS audience is
effectively reduced to Linux. Thanks,
Alex
^ permalink raw reply
* Re: kvm PCI assignment & VFIO ramblings
From: Konrad Rzeszutek Wilk @ 2011-08-02 21:29 UTC (permalink / raw)
To: Alex Williamson
Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, David Gibson,
Avi Kivity, Anthony Liguori, linux-pci@vger.kernel.org,
linuxppc-dev
In-Reply-To: <1312299299.2653.429.camel@bling.home>
On Tue, Aug 02, 2011 at 09:34:58AM -0600, Alex Williamson wrote:
> On Tue, 2011-08-02 at 22:58 +1000, Benjamin Herrenschmidt wrote:
> >
> > Don't worry, it took me a while to get my head around the HW :-) SR-IOV
> > VFs will generally not have limitations like that no, but on the other
> > hand, they -will- still require 1 VF = 1 group, ie, you won't be able to
> > take a bunch of VFs and put them in the same 'domain'.
> >
> > I think the main deal is that VFIO/qemu sees "domains" as "guests" and
> > tries to put all devices for a given guest into a "domain".
>
> Actually, that's only a recent optimization, before that each device got
> it's own iommu domain. It's actually completely configurable on the
> qemu command line which devices get their own iommu and which share.
> The default optimizes the number of domains (one) and thus the number of
> mapping callbacks since we pin the entire guest.
>
> > On POWER, we have a different view of things were domains/groups are
> > defined to be the smallest granularity we can (down to a single VF) and
> > we give several groups to a guest (ie we avoid sharing the iommu in most
> > cases)
> >
> > This is driven by the HW design but that design is itself driven by the
> > idea that the domains/group are also error isolation groups and we don't
> > want to take all of the IOs of a guest down if one adapter in that guest
> > is having an error.
> >
> > The x86 domains are conceptually different as they are about sharing the
> > iommu page tables with the clear long term intent of then sharing those
> > page tables with the guest CPU own. We aren't going in that direction
> > (at this point at least) on POWER..
>
> Yes and no. The x86 domains are pretty flexible and used a few
> different ways. On the host we do dynamic DMA with a domain per device,
> mapping only the inflight DMA ranges. In order to achieve the
> transparent device assignment model, we have to flip that around and map
> the entire guest. As noted, we can continue to use separate domains for
> this, but since each maps the entire guest, it doesn't add a lot of
> value and uses more resources and requires more mapping callbacks (and
> x86 doesn't have the best error containment anyway). If we had a well
> supported IOMMU model that we could adapt for pvDMA, then it would make
> sense to keep each device in it's own domain again. Thanks,
Could you have an PV IOMMU (in the guest) that would set up those
maps?
^ permalink raw reply
* RE: [Cbe-oss-dev] [PATCH 09/15] ps3: Limit theb number of regions per storage device
From: dm @ 2011-08-02 19:55 UTC (permalink / raw)
To: 'Duait'
Cc: 'Geoff Levand', cbe-oss-dev, 'Hector Martin',
linuxppc-dev
Sent from my HTC Touch HD
-----Original Message-----
From: linuxppc-dev-bounces+dm=3Dmdtech.ru@lists.ozlabs.org =
<linuxppc-dev-bounces+dm=3Dmdtech.ru@lists.ozlabs.org>
Sent: 01 =D0=B0=D0=B2=D0=B3=D1=83=D1=81=D1=82=D0=B0 2011 23:58
To: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Geoff Levand <geoff@infradead.org>; cbe-oss-dev@lists.ozlabs.org =
<cbe-oss-dev@lists.ozlabs.org>; Hector Martin <hector@marcansoft.com>; =
linuxppc-dev@lists.ozlabs.org <linuxppc-dev@lists.ozlabs.org>
Subject: Re: [Cbe-oss-dev] [PATCH 09/15] ps3: Limit the number of =
regions per storage device
On Mon, Aug 1, 2011 at 10:30 PM, Geert Uytterhoeven
<geert@linux-m68k.org> wrote:
> On Mon, Aug 1, 2011 at 22:03, Andre Heider <a.heider@gmail.com> wrote:
>> There can be only 8 regions, add a sanity check
>
> Why can there be only 8 regions?
I believe lv1 limits it to 8? I might be mistaken here, it mostly is a
check for the patches after this one
_______________________________________________
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-d
^ permalink raw reply
* [PATCH] powerpc/64e: External Proxy interrupt support
From: Scott Wood @ 2011-08-02 19:44 UTC (permalink / raw)
To: benh; +Cc: linuxppc-dev
Adds support for External Proxy (a.k.a. CoreInt) interrupts on 64-bit
kernels. External Proxy combines interrupt delivery and
acknowledgement, so simply returning from the interrupt without EOI
or other action will not result in the interrupt being reasserted.
When an external interrupt is deferred in this manner (whether
external proxy is used or not), we set a flag in the PACA. When we
re-enable interrupts, either explicitly or as part of an exception
return, we check the flag and branch to the interrupt exception
vector as if hardware had delivered the interrupt.
Another approach I considered was to use doorbells to replay the
interrupt. There are some problems with this:
- The timing of the actual delivery of the doorbell is undefined.
This means we can't be sure in an architected way that the
doorbell will happen before interrupts are again soft-disabled, at
which point (barring interrupt-controller specific actions such as
raising CTPR) we could take a higher priority interrupt and
overwrite the saved EPR.
- Doorbells have a lower priority than true external interrupt. This
means you could have a lower priority interrupt appear to preempt
a higher prio interrupt, once the higher priority interrupt
enables EE and the doorbell comes in.
Signed-off-by: Scott Wood <scottwood@freescale.com>
---
arch/powerpc/include/asm/irq.h | 2 +
arch/powerpc/include/asm/paca.h | 4 ++
arch/powerpc/kernel/asm-offsets.c | 3 +
arch/powerpc/kernel/entry_64.S | 4 ++
arch/powerpc/kernel/exceptions-64e.S | 94 +++++++++++++++++++++++++++-----
arch/powerpc/kernel/irq.c | 11 ++++
arch/powerpc/platforms/85xx/p5020_ds.c | 5 --
7 files changed, 103 insertions(+), 20 deletions(-)
diff --git a/arch/powerpc/include/asm/irq.h b/arch/powerpc/include/asm/irq.h
index c57a28e..c0a45e7 100644
--- a/arch/powerpc/include/asm/irq.h
+++ b/arch/powerpc/include/asm/irq.h
@@ -332,5 +332,7 @@ extern void do_IRQ(struct pt_regs *regs);
int irq_choose_cpu(const struct cpumask *mask);
+void deliver_pending_irq(void);
+
#endif /* _ASM_IRQ_H */
#endif /* __KERNEL__ */
diff --git a/arch/powerpc/include/asm/paca.h b/arch/powerpc/include/asm/paca.h
index c1f65f5..e5af3e3 100644
--- a/arch/powerpc/include/asm/paca.h
+++ b/arch/powerpc/include/asm/paca.h
@@ -134,6 +134,10 @@ struct paca_struct {
u8 hard_enabled; /* set if irqs are enabled in MSR */
u8 io_sync; /* writel() needs spin_unlock sync */
u8 irq_work_pending; /* IRQ_WORK interrupt while soft-disable */
+#ifdef CONFIG_PPC_BOOK3E
+ /* an irq is pending while soft-disabled */
+ u8 irq_pending;
+#endif
/* Stuff for accurate time accounting */
u64 user_time; /* accumulated usermode TB ticks */
diff --git a/arch/powerpc/kernel/asm-offsets.c b/arch/powerpc/kernel/asm-offsets.c
index c98144f..5082ee7 100644
--- a/arch/powerpc/kernel/asm-offsets.c
+++ b/arch/powerpc/kernel/asm-offsets.c
@@ -206,6 +206,9 @@ int main(void)
DEFINE(SVCPU_SLB, offsetof(struct kvmppc_book3s_shadow_vcpu, slb));
DEFINE(SVCPU_SLB_MAX, offsetof(struct kvmppc_book3s_shadow_vcpu, slb_max));
#endif
+#ifdef CONFIG_PPC_BOOK3E
+ DEFINE(PACA_IRQ_PENDING, offsetof(struct paca_struct, irq_pending));
+#endif
#endif /* CONFIG_PPC64 */
/* RTAS */
diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S
index d834425..c1d8eea 100644
--- a/arch/powerpc/kernel/entry_64.S
+++ b/arch/powerpc/kernel/entry_64.S
@@ -596,6 +596,9 @@ _GLOBAL(ret_from_except_lite)
restore:
BEGIN_FW_FTR_SECTION
ld r5,SOFTE(r1)
+#ifdef CONFIG_PPC_BOOK3E
+ lbz r6,PACA_IRQ_PENDING(r13)
+#endif
FW_FTR_SECTION_ELSE
b .Liseries_check_pending_irqs
ALT_FW_FTR_SECTION_END_IFCLR(FW_FEATURE_ISERIES)
@@ -608,6 +611,7 @@ ALT_FW_FTR_SECTION_END_IFCLR(FW_FEATURE_ISERIES)
stb r4,PACAHARDIRQEN(r13)
#ifdef CONFIG_PPC_BOOK3E
+ /* consumes r3-r6 */
b .exception_return_book3e
#else
ld r4,_CTR(r1)
diff --git a/arch/powerpc/kernel/exceptions-64e.S b/arch/powerpc/kernel/exceptions-64e.S
index 429983c..9886be9 100644
--- a/arch/powerpc/kernel/exceptions-64e.S
+++ b/arch/powerpc/kernel/exceptions-64e.S
@@ -2,6 +2,7 @@
* Boot code and exception vectors for Book3E processors
*
* Copyright (C) 2007 Ben. Herrenschmidt (benh@kernel.crashing.org), IBM Corp.
+ * Copyright 2011 Freescale Semiconductor, Inc.
*
* This program is free software; you can redistribute it and/or
* modify it under the terms of the GNU General Public License
@@ -125,6 +126,10 @@
cmpwi cr0,r11,0; /* yes -> go out of line */ \
beq masked_doorbell_book3e
+#define PROLOG_ADDITION_EXTIRQ_GEN \
+ lbz r11,PACASOFTIRQEN(r13); /* are irqs soft-disabled ? */ \
+ cmpwi cr0,r11,0; /* yes -> go out of line */ \
+ beq masked_extirq_book3e
/* Core exception code for all exceptions except TLB misses.
* XXX: Needs to make SPRN_SPRG_GEN depend on exception type
@@ -325,7 +330,13 @@ interrupt_end_book3e:
b storage_fault_common
/* External Input Interrupt */
- MASKABLE_EXCEPTION(0x500, external_input, .do_IRQ, ACK_NONE)
+ START_EXCEPTION(external_input)
+ NORMAL_EXCEPTION_PROLOG(0x500, PROLOG_ADDITION_EXTIRQ)
+ EXCEPTION_COMMON(0x500, PACA_EXGEN, INTS_DISABLE_ALL)
+ CHECK_NAPPING()
+ addi r3,r1,STACK_FRAME_OVERHEAD
+ bl .do_IRQ
+ b .ret_from_except_lite
/* Alignment */
START_EXCEPTION(alignment);
@@ -557,6 +568,12 @@ kernel_dbg_exc:
* An interrupt came in while soft-disabled; clear EE in SRR1,
* clear paca->hard_enabled and return.
*/
+masked_extirq_book3e:
+ mtcr r10
+ li r10,1
+ stb r10,PACA_IRQ_PENDING(r13)
+ b masked_interrupt_book3e_common
+
masked_doorbell_book3e:
mtcr r10
/* Resend the doorbell to fire again when ints enabled */
@@ -618,20 +635,8 @@ alignment_more:
bl .alignment_exception
b .ret_from_except
-/*
- * We branch here from entry_64.S for the last stage of the exception
- * return code path. MSR:EE is expected to be off at that point
- */
-_GLOBAL(exception_return_book3e)
- b 1f
-
-/* This is the return from load_up_fpu fast path which could do with
- * less GPR restores in fact, but for now we have a single return path
- */
- .globl fast_exception_return
-fast_exception_return:
- wrteei 0
-1: mr r0,r13
+.macro exception_restore
+ mr r0,r13
ld r10,_MSR(r1)
REST_4GPRS(2, r1)
andi. r6,r10,MSR_PR
@@ -667,8 +672,67 @@ fast_exception_return:
ld r10,PACA_EXGEN+EX_R10(r13)
ld r11,PACA_EXGEN+EX_R11(r13)
mfspr r13,SPRN_SPRG_GEN_SCRATCH
+.endm
+
+/*
+ * We branch here from entry_64.S for the last stage of the exception
+ * return code path. MSR:EE is expected to be off at that point
+ * r3 = MSR for return context
+ * r4 = hard irq-enable status for return context
+ * r5 = soft irq-enable status for return context
+ * r6 = irq pending flag
+ */
+_GLOBAL(exception_return_book3e)
+ cmpwi r6,0
+ beq common_exception_return
+
+/*
+ * There's an interrupt pending. If we're returning to a context that
+ * is soft-irq-enabled, we need to deliver the interrupt now.
+ *
+ * We should never get here with soft IRQs enabled but hard IRQs disabled,
+ * but just to be sure, check that too.
+ */
+ cmpwi r5,0
+ beq common_exception_return
+ cmpwi r4,0
+ beq common_exception_return
+
+ lis r5,(MSR_CE | MSR_ME | MSR_DE)@h
+ li r4,0
+ ori r5,r5,(MSR_CE | MSR_ME | MSR_DE)@l
+ stb r4,PACA_IRQ_PENDING(r13)
+ and r5,r5,r3
+ oris r5,r5,MSR_CM@h
+ mtmsr r5
+
+ exception_restore
+ b exc_external_input_book3e
+
+/* This is the return from load_up_fpu fast path which could do with
+ * less GPR restores in fact, but for now we have a single return path
+ */
+ .globl fast_exception_return
+fast_exception_return:
+ wrteei 0
+common_exception_return:
+ exception_restore
rfi
+/* Called from arch_local_irq_restore() prior to hard-enabling interrupts */
+_GLOBAL(deliver_pending_irq)
+ mflr r3
+ mfmsr r4
+ lis r5,(MSR_CM | MSR_CE | MSR_ME | MSR_DE)@h
+ ori r5,r5,(MSR_CM | MSR_CE | MSR_ME | MSR_DE)@l
+ and r5,r5,r4
+ ori r4,r4,MSR_EE
+
+ mtspr SPRN_SRR0,r3
+ mtspr SPRN_SRR1,r4
+ mtmsr r5
+ b exc_external_input_book3e
+
/*
* Trampolines used when spotting a bad kernel stack pointer in
* the exception entry code.
diff --git a/arch/powerpc/kernel/irq.c b/arch/powerpc/kernel/irq.c
index d281fb6..44a23d0 100644
--- a/arch/powerpc/kernel/irq.c
+++ b/arch/powerpc/kernel/irq.c
@@ -184,6 +184,17 @@ notrace void arch_local_irq_restore(unsigned long en)
lv1_get_version_info(&tmp);
}
+#ifdef CONFIG_PPC_BOOK3E
+ /*
+ * If there's a pending IRQ, deliver it now. Interrupts
+ * will be hard-enabled on return.
+ */
+ if (get_paca()->irq_pending) {
+ get_paca()->irq_pending = 0;
+ deliver_pending_irq();
+ }
+#endif
+
__hard_irq_enable();
}
EXPORT_SYMBOL(arch_local_irq_restore);
diff --git a/arch/powerpc/platforms/85xx/p5020_ds.c b/arch/powerpc/platforms/85xx/p5020_ds.c
index e8cba50..87e7d29 100644
--- a/arch/powerpc/platforms/85xx/p5020_ds.c
+++ b/arch/powerpc/platforms/85xx/p5020_ds.c
@@ -76,12 +76,7 @@ define_machine(p5020_ds) {
#ifdef CONFIG_PCI
.pcibios_fixup_bus = fsl_pcibios_fixup_bus,
#endif
-/* coreint doesn't play nice with lazy EE, use legacy mpic for now */
-#ifdef CONFIG_PPC64
- .get_irq = mpic_get_irq,
-#else
.get_irq = mpic_get_coreint_irq,
-#endif
.restart = fsl_rstcr_restart,
.calibrate_decr = generic_calibrate_decr,
.progress = udbg_progress,
--
1.7.4.1
^ permalink raw reply related
* Re: kvm PCI assignment & VFIO ramblings
From: Alex Williamson @ 2011-08-02 18:35 UTC (permalink / raw)
To: David Gibson
Cc: aafabbri, Alexey Kardashevskiy, kvm, Paul Mackerras, qemu-devel,
chrisw, iommu, Anthony Liguori, linux-pci@vger.kernel.org,
linuxppc-dev, benve
In-Reply-To: <1312308847.2653.467.camel@bling.home>
On Tue, 2011-08-02 at 12:14 -0600, Alex Williamson wrote:
> On Tue, 2011-08-02 at 18:28 +1000, David Gibson wrote:
> > On Sat, Jul 30, 2011 at 12:20:08PM -0600, Alex Williamson wrote:
> > > On Sat, 2011-07-30 at 09:58 +1000, Benjamin Herrenschmidt wrote:
> > [snip]
> > > On x86, the USB controllers don't typically live behind a PCIe-to-PCI
> > > bridge, so don't suffer the source identifier problem, but they do often
> > > share an interrupt. But even then, we can count on most modern devices
> > > supporting PCI2.3, and thus the DisINTx feature, which allows us to
> > > share interrupts. In any case, yes, it's more rare but we need to know
> > > how to handle devices behind PCI bridges. However I disagree that we
> > > need to assign all the devices behind such a bridge to the guest.
> > > There's a difference between removing the device from the host and
> > > exposing the device to the guest.
> >
> > I think you're arguing only over details of what words to use for
> > what, rather than anything of substance here. The point is that an
> > entire partitionable group must be assigned to "host" (in which case
> > kernel drivers may bind to it) or to a particular guest partition (or
> > at least to a single UID on the host). Which of the assigned devices
> > the partition actually uses is another matter of course, as is at
> > exactly which level they become "de-exposed" if you don't want to use
> > all of then.
>
> Well first we need to define what a partitionable group is, whether it's
> based on hardware requirements or user policy. And while I agree that
> we need unique ownership of a partition, I disagree that qemu is
> necessarily the owner of the entire partition vs individual devices.
Sorry, I didn't intend to have such circular logic. "... I disagree
that qemu is necessarily the owner of the entire partition vs granted
access to devices within the partition". Thanks,
Alex
^ permalink raw reply
* Re: kvm PCI assignment & VFIO ramblings
From: Alex Williamson @ 2011-08-02 18:14 UTC (permalink / raw)
To: David Gibson
Cc: aafabbri, Alexey Kardashevskiy, kvm, Paul Mackerras, qemu-devel,
chrisw, iommu, Anthony Liguori, linux-pci@vger.kernel.org,
linuxppc-dev, benve
In-Reply-To: <20110802082848.GD29719@yookeroo.fritz.box>
On Tue, 2011-08-02 at 18:28 +1000, David Gibson wrote:
> On Sat, Jul 30, 2011 at 12:20:08PM -0600, Alex Williamson wrote:
> > On Sat, 2011-07-30 at 09:58 +1000, Benjamin Herrenschmidt wrote:
> [snip]
> > On x86, the USB controllers don't typically live behind a PCIe-to-PCI
> > bridge, so don't suffer the source identifier problem, but they do often
> > share an interrupt. But even then, we can count on most modern devices
> > supporting PCI2.3, and thus the DisINTx feature, which allows us to
> > share interrupts. In any case, yes, it's more rare but we need to know
> > how to handle devices behind PCI bridges. However I disagree that we
> > need to assign all the devices behind such a bridge to the guest.
> > There's a difference between removing the device from the host and
> > exposing the device to the guest.
>
> I think you're arguing only over details of what words to use for
> what, rather than anything of substance here. The point is that an
> entire partitionable group must be assigned to "host" (in which case
> kernel drivers may bind to it) or to a particular guest partition (or
> at least to a single UID on the host). Which of the assigned devices
> the partition actually uses is another matter of course, as is at
> exactly which level they become "de-exposed" if you don't want to use
> all of then.
Well first we need to define what a partitionable group is, whether it's
based on hardware requirements or user policy. And while I agree that
we need unique ownership of a partition, I disagree that qemu is
necessarily the owner of the entire partition vs individual devices.
But feel free to dismiss it as unsubstantial.
> [snip]
> > > Maybe something like /sys/devgroups ? This probably warrants involving
> > > more kernel people into the discussion.
> >
> > I don't yet buy into passing groups to qemu since I don't buy into the
> > idea of always exposing all of those devices to qemu. Would it be
> > sufficient to expose iommu nodes in sysfs that link to the devices
> > behind them and describe properties and capabilities of the iommu
> > itself? More on this at the end.
>
> Again, I don't think you're making a distinction of any substance.
> Ben is saying the group as a whole must be set to allow partition
> access, whether or not you call that "assigning". There's no reason
> that passing a sysfs descriptor to qemu couldn't be the qemu
> developer's quick-and-dirty method of putting the devices in, while
> also allowing full assignment of the devices within the groups by
> libvirt.
Well, there is a reason for not passing a sysfs descriptor to qemu if
qemu isn't the one defining the policy about how the members of that
group are exposed. I tend to envision a userspace entity defining
policy and granting devices to qemu. Do we really want separate
developer vs production interfaces?
> [snip]
> > > Now some of this can be fixed with tweaks, and we've started doing it
> > > (we have a working pass-through using VFIO, forgot to mention that, it's
> > > just that we don't like what we had to do to get there).
> >
> > This is a result of wanting to support *unmodified* x86 guests. We
> > don't have the luxury of having a predefined pvDMA spec that all x86
> > OSes adhere to. The 32bit problem is unfortunate, but the priority use
> > case for assigning devices to guests is high performance I/O, which
> > usually entails modern, 64bit hardware. I'd like to see us get to the
> > point of having emulated IOMMU hardware on x86, which could then be
> > backed by VFIO, but for now guest pinning is the most practical and
> > useful.
>
> No-one's suggesting that this isn't a valid mode of operation. It's
> just that right now conditionally disabling it for us is fairly ugly
> because of the way the qemu code is structured.
It really shouldn't be any more than skipping the
cpu_register_phys_memory_client() and calling the map/unmap routines
elsewhere.
> [snip]
> > > - I don't like too much the fact that VFIO provides yet another
> > > different API to do what we already have at least 2 kernel APIs for, ie,
> > > BAR mapping and config space access. At least it should be better at
> > > using the backend infrastructure of the 2 others (sysfs & procfs). I
> > > understand it wants to filter in some case (config space) and -maybe-
> > > yet another API is the right way to go but allow me to have my doubts.
> >
> > The use of PCI sysfs is actually one of my complaints about current
> > device assignment. To do assignment with an unprivileged guest we need
> > to open the PCI sysfs config file for it, then change ownership on a
> > handful of other PCI sysfs files, then there's this other pci-stub thing
> > to maintain ownership, but the kvm ioctls don't actually require it and
> > can grab onto any free device... We are duplicating some of that in
> > VFIO, but we also put the ownership of the device behind a single device
> > file. We do have the uiommu problem that we can't give an unprivileged
> > user ownership of that, but your usage model may actually make that
> > easier. More below...
>
> Hrm. I was assuming that a sysfs groups interface would provide a
> single place to set the ownership of the whole group. Whether that's
> a echoing a uid to a magic file or doing or chown on the directory or
> whatever is a matter of details.
Except one of those details is whether we manage the group in sysfs or
just expose enough information in sysfs for another userspace entity to
manage the devices. Where do we manage enforcement of hardware policy
vs userspace policy?
> [snip]
> > I spent a lot of time looking for an architecture neutral solution here,
> > but I don't think it exists. Please prove me wrong. The problem is
> > that we have to disable INTx on an assigned device after it fires (VFIO
> > does this automatically). If we don't do this, a non-responsive or
> > malicious guest could sit on the interrupt, causing it to fire
> > repeatedly as a DoS on the host. The only indication that we can rely
> > on to re-enable INTx is when the guest CPU writes an EOI to the APIC.
> > We can't just wait for device accesses because a) the device CSRs are
> > (hopefully) direct mapped and we'd have to slow map them or attempt to
> > do some kind of dirty logging to detect when they're accesses b) what
> > constitutes an interrupt service is device specific.
> >
> > That means we need to figure out how PCI interrupt 'A' (or B...)
> > translates to a GSI (Global System Interrupt - ACPI definition, but
> > hopefully a generic concept). That GSI identifies a pin on an IOAPIC,
> > which will also see the APIC EOI. And just to spice things up, the
> > guest can change the PCI to GSI mappings via ACPI. I think the set of
> > callbacks I've added are generic (maybe I left ioapic in the name), but
> > yes they do need to be implemented for other architectures. Patches
> > appreciated from those with knowledge of the systems and/or access to
> > device specs. This is the only reason that I make QEMU VFIO only build
> > for x86.
>
> There will certainly need to be some arch hooks here, but it can be
> made less intrusively x86 specific without too much difficulty.
> e.g. Create an EOF notifier chain in qemu - the master PICs (APIC for
> x86, XICS for pSeries) for all vfio capable machines need to kick it,
> and vfio subscribes.
Am I the only one that sees ioapic_add/remove_gsi_eoi_notifier() in the
qemu/vfio patch series? Shoot me for using ioapic in the name, but it's
exactly what you ask for. It just needs to be made a common service and
implemented for power.
> [snip]
> > Rather than your "groups" idea, I've been mulling over whether we can
> > just expose the dependencies, configuration, and capabilities in sysfs
> > and build qemu commandlines to describe it. For instance, if we simply
> > start with creating iommu nodes in sysfs, we could create links under
> > each iommu directory to the devices behind them. Some kind of
> > capability file could define properties like whether it's page table
> > based or fixed iova window or the granularity of mapping the devices
> > behind it. Once we have that, we could probably make uiommu attach to
> > each of those nodes.
>
> Well, that would address our chief concern that inherently tying the
> lifetime of a domain to an fd is problematic. In fact, I don't really
> see how this differs from the groups proposal except in the details of
> how you inform qemu of the group^H^H^H^H^Hiommu domain.
One implies group policy, configuration and management in sysfs, the
other exposes the hardware dependencies in sysfs and leaves the rest for
someone else (libvirt). Thanks,
Alex
^ permalink raw reply
* Re: kvm PCI assignment & VFIO ramblings
From: Alex Williamson @ 2011-08-02 15:34 UTC (permalink / raw)
To: Benjamin Herrenschmidt
Cc: Alexey Kardashevskiy, kvm, Paul Mackerras,
linux-pci@vger.kernel.org, David Gibson, Avi Kivity,
Anthony Liguori, linuxppc-dev
In-Reply-To: <1312289929.8793.890.camel@pasglop>
On Tue, 2011-08-02 at 22:58 +1000, Benjamin Herrenschmidt wrote:
>
> Don't worry, it took me a while to get my head around the HW :-) SR-IOV
> VFs will generally not have limitations like that no, but on the other
> hand, they -will- still require 1 VF = 1 group, ie, you won't be able to
> take a bunch of VFs and put them in the same 'domain'.
>
> I think the main deal is that VFIO/qemu sees "domains" as "guests" and
> tries to put all devices for a given guest into a "domain".
Actually, that's only a recent optimization, before that each device got
it's own iommu domain. It's actually completely configurable on the
qemu command line which devices get their own iommu and which share.
The default optimizes the number of domains (one) and thus the number of
mapping callbacks since we pin the entire guest.
> On POWER, we have a different view of things were domains/groups are
> defined to be the smallest granularity we can (down to a single VF) and
> we give several groups to a guest (ie we avoid sharing the iommu in most
> cases)
>
> This is driven by the HW design but that design is itself driven by the
> idea that the domains/group are also error isolation groups and we don't
> want to take all of the IOs of a guest down if one adapter in that guest
> is having an error.
>
> The x86 domains are conceptually different as they are about sharing the
> iommu page tables with the clear long term intent of then sharing those
> page tables with the guest CPU own. We aren't going in that direction
> (at this point at least) on POWER..
Yes and no. The x86 domains are pretty flexible and used a few
different ways. On the host we do dynamic DMA with a domain per device,
mapping only the inflight DMA ranges. In order to achieve the
transparent device assignment model, we have to flip that around and map
the entire guest. As noted, we can continue to use separate domains for
this, but since each maps the entire guest, it doesn't add a lot of
value and uses more resources and requires more mapping callbacks (and
x86 doesn't have the best error containment anyway). If we had a well
supported IOMMU model that we could adapt for pvDMA, then it would make
sense to keep each device in it's own domain again. Thanks,
Alex
^ permalink raw reply
* Re: [PATCH 3/3] KVM: PPC: Implement H_CEDE hcall for book3s_hv in real-mode code
From: Alexander Graf @ 2011-08-02 14:47 UTC (permalink / raw)
To: Paul Mackerras; +Cc: linuxppc-dev, kvm-ppc
In-Reply-To: <20110723074246.GC17927@bloggs.ozlabs.ibm.com>
On 07/23/2011 09:42 AM, Paul Mackerras wrote:
> With a KVM guest operating in SMT4 mode (i.e. 4 hardware threads per
> core), whenever a CPU goes idle, we have to pull all the other
> hardware threads in the core out of the guest, because the H_CEDE
> hcall is handled in the kernel. This is inefficient.
>
> This adds code to book3s_hv_rmhandlers.S to handle the H_CEDE hcall
> in real mode. When a guest vcpu does an H_CEDE hcall, we now only
> exit to the kernel if all the other vcpus in the same core are also
> idle. Otherwise we mark this vcpu as napping, save state that could
> be lost in nap mode (mainly GPRs and FPRs), and execute the nap
> instruction. When the thread wakes up, because of a decrementer or
> external interrupt, we come back in at kvm_start_guest (from the
> system reset interrupt vector), find the `napping' flag set in the
> paca, and go to the resume path.
>
> This has some other ramifications. First, when starting a core, we
> now start all the threads, both those that are immediately runnable and
> those that are idle. This is so that we don't have to pull all the
> threads out of the guest when an idle thread gets a decrementer interrupt
> and wants to start running. In fact the idle threads will all start
> with the H_CEDE hcall returning; being idle they will just do another
> H_CEDE immediately and go to nap mode.
>
> This required some changes to kvmppc_run_core() and kvmppc_run_vcpu().
> These functions have been restructured to make them simpler and clearer.
> We introduce a level of indirection in the wait queue that gets woken
> when external and decrementer interrupts get generated for a vcpu, so
> that we can have the 4 vcpus in a vcore using the same wait queue.
> We need this because the 4 vcpus are being handled by one thread.
>
> Secondly, when we need to exit from the guest to the kernel, we now
> have to generate an IPI for any napping threads, because an HDEC
> interrupt doesn't wake up a napping thread.
>
> Thirdly, we now need to be able to handle virtual external interrupts
> and decrementer interrupts becoming pending while a thread is napping,
> and deliver those interrupts to the guest when the thread wakes.
> This is done in kvmppc_cede_reentry, just before fast_guest_return.
>
> Finally, since we are not using the generic kvm_vcpu_block for book3s_hv,
> and hence not calling kvm_arch_vcpu_runnable, we can remove the #ifdef
> from kvm_arch_vcpu_runnable.
>
> Signed-off-by: Paul Mackerras<paulus@samba.org>
> ---
> arch/powerpc/include/asm/kvm_book3s_asm.h | 1 +
> arch/powerpc/include/asm/kvm_host.h | 19 ++-
> arch/powerpc/kernel/asm-offsets.c | 6 +
> arch/powerpc/kvm/book3s_hv.c | 335 ++++++++++++++++-------------
> arch/powerpc/kvm/book3s_hv_rmhandlers.S | 297 ++++++++++++++++++++++---
> arch/powerpc/kvm/powerpc.c | 21 +-
> 6 files changed, 483 insertions(+), 196 deletions(-)
>
>
[...]
> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
> index a107c9b..cd0e3e5 100644
> --- a/arch/powerpc/kvm/powerpc.c
> +++ b/arch/powerpc/kvm/powerpc.c
> @@ -39,12 +39,8 @@
>
> int kvm_arch_vcpu_runnable(struct kvm_vcpu *v)
> {
> -#ifndef CONFIG_KVM_BOOK3S_64_HV
> return !(v->arch.shared->msr& MSR_WE) ||
> !!(v->arch.pending_exceptions);
> -#else
> - return !(v->arch.ceded) || !!(v->arch.pending_exceptions);
> -#endif
> }
>
> int kvmppc_kvm_pv(struct kvm_vcpu *vcpu)
> @@ -258,6 +254,7 @@ struct kvm_vcpu *kvm_arch_vcpu_create(struct kvm *kvm, unsigned int id)
> {
> struct kvm_vcpu *vcpu;
> vcpu = kvmppc_core_vcpu_create(kvm, id);
> + vcpu->arch.wqp =&vcpu->wq;
> if (!IS_ERR(vcpu))
> kvmppc_create_vcpu_debugfs(vcpu, id);
> return vcpu;
> @@ -289,8 +286,8 @@ static void kvmppc_decrementer_func(unsigned long data)
>
> kvmppc_core_queue_dec(vcpu);
>
> - if (waitqueue_active(&vcpu->wq)) {
> - wake_up_interruptible(&vcpu->wq);
> + if (waitqueue_active(vcpu->arch.wqp)) {
> + wake_up_interruptible(vcpu->arch.wqp);
> vcpu->stat.halt_wakeup++;
> }
> }
> @@ -543,13 +540,15 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *run)
>
> int kvm_vcpu_ioctl_interrupt(struct kvm_vcpu *vcpu, struct kvm_interrupt *irq)
> {
> - if (irq->irq == KVM_INTERRUPT_UNSET)
> + if (irq->irq == KVM_INTERRUPT_UNSET) {
> kvmppc_core_dequeue_external(vcpu, irq);
> - else
> - kvmppc_core_queue_external(vcpu, irq);
> + return 0;
> + }
Not sure I understand this part. Mind to explain?
Alex
> +
> + kvmppc_core_queue_external(vcpu, irq);
>
> - if (waitqueue_active(&vcpu->wq)) {
> - wake_up_interruptible(&vcpu->wq);
> + if (waitqueue_active(vcpu->arch.wqp)) {
> + wake_up_interruptible(vcpu->arch.wqp);
> vcpu->stat.halt_wakeup++;
> } else if (vcpu->cpu != -1) {
> smp_send_reschedule(vcpu->cpu);
^ permalink raw reply
* Re: kvm PCI assignment & VFIO ramblings
From: Alex Williamson @ 2011-08-02 14:39 UTC (permalink / raw)
To: Benjamin Herrenschmidt
Cc: Alexey Kardashevskiy, kvm, Paul Mackerras,
linux-pci@vger.kernel.org, David Gibson, Avi Kivity,
Anthony Liguori, linuxppc-dev
In-Reply-To: <1312248479.8793.827.camel@pasglop>
On Tue, 2011-08-02 at 11:27 +1000, Benjamin Herrenschmidt wrote:
> It's a shared address space. With a basic configuration on p7ioc for
> example we have MMIO going from 3G to 4G (PCI side addresses). BARs
> contain the normal PCI address there. But that 1G is divided in 128
> segments of equal size which can separately be assigned to PE#'s.
>
> So BARs are allocated by firmware or the kernel PCI code so that devices
> in different PEs don't share segments.
>
> Of course there's always the risk that a device can be hacked via a
> sideband access to BARs to move out of it's allocated segment. That
> means that the guest owning that device won't be able to access it
> anymore and can potentially disturb a guest or host owning whatever is
> in that other segment.
Wait, what? I thought the MMIO segments were specifically so that if
the device BARs moved out of the segment the guest only hurts itself and
not the new segments overlapped.
> The only way to enforce isolation here is to ensure that PE# are
> entirely behind P2P bridges, since those would then ensure that even if
> you put crap into your BARs you won't be able to walk over a neighbour.
Ok, so the MMIO segments are really just a configuration nuance of the
platform and being behind a P2P bridge is what allows you to hand off
BARs to a guest (which needs to know the bridge window to do anything
useful with them). Is that right?
> I believe pHyp enforces that, for example, if you have a slot, all
> devices & functions behind that slot pertain to the same PE# under pHyp.
>
> That means you cannot put individual functions of a device into
> different PE# with pHyp.
>
> We plan to be a bit less restrictive here for KVM, assuming that if you
> use a device that allows such a back-channel to the BARs, then it's your
> problem to not trust such a device for virtualization. And most of the
> time, you -will- have a P2P to protect you anyways.
>
> The problem doesn't exist (or is assumed as non-existing) for SR-IOV
> since in that case, the VFs are meant to be virtualized, so pHyp assumes
> there is no such back-channel and it can trust them to be in different
> PE#.
But you still need the P2P bridge to protect MMIO segments? Or do
SR-IOV BARs need to be virtualized? I'm having trouble with the mental
model of how you can do both. Thanks,
Alex
^ permalink raw reply
* Re: kvm PCI assignment & VFIO ramblings
From: Avi Kivity @ 2011-08-02 13:39 UTC (permalink / raw)
To: Benjamin Herrenschmidt
Cc: Alexey Kardashevskiy, kvm, Paul Mackerras,
linux-pci@vger.kernel.org, David Gibson, Alex Williamson,
Anthony Liguori, linuxppc-dev
In-Reply-To: <1312289929.8793.890.camel@pasglop>
On 08/02/2011 03:58 PM, Benjamin Herrenschmidt wrote:
> > >
> > > What you mean 2-level is two passes through two trees (ie 6 or 8 levels
> > > right ?).
> >
> > (16 or 25)
>
> 25 levels ? You mean 25 loads to get to a translation ? And you get any
> kind of performance out of that ? :-)
>
Aggressive partial translation caching. Even then, performance does
suffer on memory intensive workloads. The fix was transparent
hugepages; that makes the page table walks much faster since they're
fully cached, the partial translation caches become more effective, and
the tlb itself becomes more effective. On some workloads, THP on both
guest and host was faster than no-THP on bare metal.
> > >
> > > Not sure what you mean... the guest calls h-calls for every iommu page
> > > mapping/unmapping, yes. So the performance of these is critical. So yes,
> > > we'll eventually do it in kernel. We just haven't yet.
> >
> > I see. x86 traditionally doesn't do it for every request. We had some
> > proposals to do a pviommu that does map every request, but none reached
> > maturity.
>
> It's quite performance critical, you don't want to go anywhere near a
> full exit. On POWER we plan to handle that in "real mode" (ie MMU off)
> straight off the interrupt handlers, with the CPU still basically
> operating in guest context with HV permission. That is basically do the
> permission check, translation and whack the HW iommu immediately. If for
> some reason one step fails (!present PTE or something like that), we'd
> then fallback to an exit to Linux to handle it in a more "common"
> environment where we can handle page faults etc...
I guess we can hack some kind of private interface, though I'd hoped to
avoid it (and so far we succeeded - we can even get vfio to inject
interrupts into kvm from the kernel without either knowing anything
about the other).
> > > > Does the BAR value contain the segment base address? Or is that added
> > > > later?
> > >
> > > It's a shared address space. With a basic configuration on p7ioc for
> > > example we have MMIO going from 3G to 4G (PCI side addresses). BARs
> > > contain the normal PCI address there. But that 1G is divided in 128
> > > segments of equal size which can separately be assigned to PE#'s.
> > >
> > > So BARs are allocated by firmware or the kernel PCI code so that devices
> > > in different PEs don't share segments.
> >
> > Okay, and config space virtualization ensures that the guest can't remap?
>
> Well, so it depends :-)
>
> With KVM we currently use whatever config space virtualization you do
> and so we somewhat rely on this but it's not very fool proof.
>
> I believe pHyp doesn't even bother filtering config space. As I said in
> another note, you can't trust adapters anyway. Plenty of them (video
> cards come to mind) have ways to get to their own config space via MMIO
> registers for example.
Yes, we've seen that.
> So what pHyp does is that it always create PE's (aka groups) that are
> below a bridge. With PCIe, everything mostly is below a bridge so that's
> easy, but that does mean that you always have all functions of a device
> in the same PE (and thus in the same partition). SR-IOV is an exception
> to this rule since in that case the HW is designed to be trusted.
>
> That way, being behind a bridge, the bridge windows are going to define
> what can be forwarded to the device, and thus the system is immune to
> the guest putting crap into the BARs. It can't be remapped to overlap a
> neighbouring device.
>
> Note that the bridge itself isn't visible to the guest, so yes, config
> space is -somewhat- virtualized, typically pHyp make every pass-through
> PE look like a separate PCI host bridge with the devices below it.
I think I see, yes.
--
error compiling committee.c: too many arguments to function
^ permalink raw reply
* Re: kvm PCI assignment & VFIO ramblings
From: Benjamin Herrenschmidt @ 2011-08-02 12:58 UTC (permalink / raw)
To: Avi Kivity
Cc: Alexey Kardashevskiy, kvm, Paul Mackerras,
linux-pci@vger.kernel.org, David Gibson, Alex Williamson,
Anthony Liguori, linuxppc-dev
In-Reply-To: <4E37BF62.2060809@redhat.com>
On Tue, 2011-08-02 at 12:12 +0300, Avi Kivity wrote:
> On 08/02/2011 04:27 AM, Benjamin Herrenschmidt wrote:
> > >
> > > I have a feeling you'll be getting the same capabilities sooner or
> > > later, or you won't be able to make use of S/R IOV VFs.
> >
> > I'm not sure why you mean. We can do SR/IOV just fine (well, with some
> > limitations due to constraints with how our MMIO segmenting works and
> > indeed some of those are being lifted in our future chipsets but
> > overall, it works).
>
> Don't those limitations include "all VFs must be assigned to the same
> guest"?
No, not at all. We put them in different PE# and because the HW is
SR-IOV we know we can trust it to the extent that it won't have nasty
hidden side effects between them. We have 64-bit windows for MMIO that
are also segmented and that we can "resize" to map over the VF BAR
region, the limitations are more about the allowed sizes, number of
segments supported etc... for these things which can cause us to play
interesting games with the system page size setting to find a good
match.
> PCI on x86 has function granularity, SRIOV reduces this to VF
> granularity, but I thought power has partition or group granularity
> which is much coarser?
The granularity of a "Group" really depends on what the HW is like. On
pure PCIe SR-IOV we can go down to function granularity.
In fact I currently go down to function granularity on anything pure
PCIe as well, though as I explained earlier, that's a bit chancy since
some adapters -will- allow to create side effects such as side band
access to config space.
pHyp doesn't allow that granularity as far as I can tell, one slot is
always fully assigned to a PE.
However, we might have resource constraints as in reaching max number of
segments or iommu regions that may force us to group a bit more coarsly
under some circumstances.
The main point is that the grouping is pre-existing, so an API designed
around the idea of: 1- create domain, 2- add random devices to it, 3-
use it, won't work for us very well :-)
Since the grouping implies the sharing of iommu's, from a VFIO point of
view is really matches well with the idea of having the domains
pre-existing.
That's why I think a good fit is to have a static representation of the
grouping, with tools allowing to create/manipulate the groups (or
domains) for archs that allow this sort of manipulations, separately
from qemu/libvirt, avoiding those "on the fly" groups whose lifetime is
tied to an instance of a file descriptor.
> > In -theory-, one could do the grouping dynamically with some kind of API
> > for us as well. However the constraints are such that it's not
> > practical. Filtering on RID is based on number of bits to match in the
> > bus number and whether to match the dev and fn. So it's not arbitrary
> > (but works fine for SR-IOV).
> >
> > The MMIO segmentation is a bit special too. There is a single MMIO
> > region in 32-bit space (size is configurable but that's not very
> > practical so for now we stick it to 1G) which is evenly divided into N
> > segments (where N is the number of PE# supported by the host bridge,
> > typically 128 with the current bridges).
> >
> > Each segment goes through a remapping table to select the actual PE# (so
> > large BARs use consecutive segments mapped to the same PE#).
> >
> > For SR-IOV we plan to not use the M32 region. We also have 64-bit MMIO
> > regions which act as some kind of "accordions", they are evenly divided
> > into segments in different PE# and there's several of them which we can
> > "move around" and typically use to map VF BARs.
>
> So, SRIOV VFs *don't* have the group limitation? Sorry, I'm deluged by
> technical details with no ppc background to put them to, I can't say I'm
> making any sense of this.
:-)
Don't worry, it took me a while to get my head around the HW :-) SR-IOV
VFs will generally not have limitations like that no, but on the other
hand, they -will- still require 1 VF = 1 group, ie, you won't be able to
take a bunch of VFs and put them in the same 'domain'.
I think the main deal is that VFIO/qemu sees "domains" as "guests" and
tries to put all devices for a given guest into a "domain".
On POWER, we have a different view of things were domains/groups are
defined to be the smallest granularity we can (down to a single VF) and
we give several groups to a guest (ie we avoid sharing the iommu in most
cases)
This is driven by the HW design but that design is itself driven by the
idea that the domains/group are also error isolation groups and we don't
want to take all of the IOs of a guest down if one adapter in that guest
is having an error.
The x86 domains are conceptually different as they are about sharing the
iommu page tables with the clear long term intent of then sharing those
page tables with the guest CPU own. We aren't going in that direction
(at this point at least) on POWER..
> > > > VFIO here is basically designed for one and only one thing: expose the
> > > > entire guest physical address space to the device more/less 1:1.
> > >
> > > A single level iommu cannot be exposed to guests. Well, it can be
> > > exposed as an iommu that does not provide per-device mapping.
> >
> > Well, x86 ones can't maybe but on POWER we can and must thanks to our
> > essentially paravirt model :-) Even if it' wasn't and we used trapping
> > of accesses to the table, it would work because in practice, even with
> > filtering, what we end up having is a per-device (or rather per-PE#
> > table).
> >
> > > A two level iommu can be emulated and exposed to the guest. See
> > > http://support.amd.com/us/Processor_TechDocs/48882.pdf for an example.
> >
> > What you mean 2-level is two passes through two trees (ie 6 or 8 levels
> > right ?).
>
> (16 or 25)
25 levels ? You mean 25 loads to get to a translation ? And you get any
kind of performance out of that ? :-)
> > We don't have that and probably never will. But again, because
> > we have a paravirt interface to the iommu, it's less of an issue.
>
> Well, then, I guess we need an additional interface to expose that to
> the guest.
>
> > > > This means:
> > > >
> > > > - It only works with iommu's that provide complete DMA address spaces
> > > > to devices. Won't work with a single 'segmented' address space like we
> > > > have on POWER.
> > > >
> > > > - It requires the guest to be pinned. Pass-through -> no more swap
> > >
> > > Newer iommus (and devices, unfortunately) (will) support I/O page faults
> > > and then the requirement can be removed.
> >
> > No. -Some- newer devices will. Out of these, a bunch will have so many
> > bugs in it it's not usable. Some never will. It's a mess really and I
> > wouldn't design my stuff based on those premises just yet. Making it
> > possible to support it for sure, having it in mind, but not making it
> > the fundation on which the whole API is designed.
>
> The API is not designed around pinning. It's a side effect of how the
> IOMMU works. If your IOMMU only maps pages which are under active DMA,
> then it would only pin those pages.
>
> But I see what you mean, the API is designed around up-front
> specification of all guest memory.
Right :-)
> > > > - It doesn't work for POWER server anyways because of our need to
> > > > provide a paravirt iommu interface to the guest since that's how pHyp
> > > > works today and how existing OSes expect to operate.
> > >
> > > Then you need to provide that same interface, and implement it using the
> > > real iommu.
> >
> > Yes. Working on it. It's not very practical due to how VFIO interacts in
> > terms of APIs but solvable. Eventually, we'll make the iommu Hcalls
> > almost entirely real-mode for performance reasons.
>
> The original kvm device assignment code was (and is) part of kvm
> itself. We're trying to move to vfio to allow sharing with non-kvm
> users, but it does reduce flexibility. We can have an internal vfio-kvm
> interface to update mappings in real time.
>
> > > > - Performance sucks of course, the vfio map ioctl wasn't mean for that
> > > > and has quite a bit of overhead. However we'll want to do the paravirt
> > > > call directly in the kernel eventually ...
> > >
> > > Does the guest iomap each request? Why?
> >
> > Not sure what you mean... the guest calls h-calls for every iommu page
> > mapping/unmapping, yes. So the performance of these is critical. So yes,
> > we'll eventually do it in kernel. We just haven't yet.
>
> I see. x86 traditionally doesn't do it for every request. We had some
> proposals to do a pviommu that does map every request, but none reached
> maturity.
It's quite performance critical, you don't want to go anywhere near a
full exit. On POWER we plan to handle that in "real mode" (ie MMU off)
straight off the interrupt handlers, with the CPU still basically
operating in guest context with HV permission. That is basically do the
permission check, translation and whack the HW iommu immediately. If for
some reason one step fails (!present PTE or something like that), we'd
then fallback to an exit to Linux to handle it in a more "common"
environment where we can handle page faults etc...
> > > So, you have interrupt redirection? That is, MSI-x table values encode
> > > the vcpu, not pcpu?
> >
> > Not exactly. The MSI-X address is a real PCI address to an MSI port and
> > the value is a real interrupt number in the PIC.
> >
> > However, the MSI port filters by RID (using the same matching as PE#) to
> > ensure that only allowed devices can write to it, and the PIC has a
> > matching PE# information to ensure that only allowed devices can trigger
> > the interrupt.
> >
> > As for the guest knowing what values to put in there (what port address
> > and interrupt source numbers to use), this is part of the paravirt APIs.
> >
> > So the paravirt APIs handles the configuration and the HW ensures that
> > the guest cannot do anything else than what it's allowed to.
>
> Okay, this is something that x86 doesn't have. Strange that it can
> filter DMA at a fine granularity but not MSI, which is practically the
> same thing.
I wouldn't be surprised if it's actually a quite different path in HW.
There's some magic decoding based on top bits usually that decides it's
an MSI and it goes completely elsewhere from there in the bridge.
> > > Does the BAR value contain the segment base address? Or is that added
> > > later?
> >
> > It's a shared address space. With a basic configuration on p7ioc for
> > example we have MMIO going from 3G to 4G (PCI side addresses). BARs
> > contain the normal PCI address there. But that 1G is divided in 128
> > segments of equal size which can separately be assigned to PE#'s.
> >
> > So BARs are allocated by firmware or the kernel PCI code so that devices
> > in different PEs don't share segments.
>
> Okay, and config space virtualization ensures that the guest can't remap?
Well, so it depends :-)
With KVM we currently use whatever config space virtualization you do
and so we somewhat rely on this but it's not very fool proof.
I believe pHyp doesn't even bother filtering config space. As I said in
another note, you can't trust adapters anyway. Plenty of them (video
cards come to mind) have ways to get to their own config space via MMIO
registers for example.
So what pHyp does is that it always create PE's (aka groups) that are
below a bridge. With PCIe, everything mostly is below a bridge so that's
easy, but that does mean that you always have all functions of a device
in the same PE (and thus in the same partition). SR-IOV is an exception
to this rule since in that case the HW is designed to be trusted.
That way, being behind a bridge, the bridge windows are going to define
what can be forwarded to the device, and thus the system is immune to
the guest putting crap into the BARs. It can't be remapped to overlap a
neighbouring device.
Note that the bridge itself isn't visible to the guest, so yes, config
space is -somewhat- virtualized, typically pHyp make every pass-through
PE look like a separate PCI host bridge with the devices below it.
Cheers,
Ben.
^ permalink raw reply
* Re: [PATCH 1/3] KVM: PPC: Assemble book3s{, _hv}_rmhandlers.S separately
From: Alexander Graf @ 2011-08-02 12:20 UTC (permalink / raw)
To: Paul Mackerras; +Cc: linuxppc-dev, kvm-ppc
In-Reply-To: <20110723074111.GA17927@bloggs.ozlabs.ibm.com>
On 07/23/2011 09:41 AM, Paul Mackerras wrote:
> This makes arch/powerpc/kvm/book3s_rmhandlers.S and
> arch/powerpc/kvm/book3s_hv_rmhandlers.S be assembled as
> separate compilation units rather than having them #included in
> arch/powerpc/kernel/exceptions-64s.S. We no longer have any
> conditional branches between the exception prologs in
> exceptions-64s.S and the KVM handlers, so there is no need to
> keep their contents close together in the vmlinux image.
>
> In their current location, they are using up part of the limited
> space between the first-level interrupt handlers and the firmware
> NMI data area at offset 0x7000, and with some kernel configurations
> this area will overflow (e.g. allyesconfig), leading to an
> "attempt to .org backwards" error when compiling exceptions-64s.S.
>
> Moving them out requires that we add some #includes that the
> book3s_{,hv_}rmhandlers.S code was previously getting implicitly
> via exceptions-64s.S.
So what if your kernel binary is bigger than the 24 bits we can jump and
the KVM code happens to be at the end? Or do we just not care here?
Alex
^ permalink raw reply
* Re: GPIO IRQ on P1022
From: Felix Radensky @ 2011-08-02 10:47 UTC (permalink / raw)
To: Tabi Timur-B04825; +Cc: linuxppc-dev@ozlabs.org
In-Reply-To: <4E355FB7.3030904@freescale.com>
Hi,
On 07/31/2011 04:59 PM, Tabi Timur-B04825 wrote:
> Felix Radensky wrote:
>> What happens when I load my driver is single execution of interrupt
>> handler
>> followed by system freeze. Even if I call disable_irq() in interrupt
>> handler the
>> system still freezes.
> I don't know anything about the GPIO layer, but I think you're going to
> need to debug this a little more. Where exactly is the freeze? Are you
> sure the interrupt handler is being called only once? Perhaps you're not
> clearing the interrupt status and your handler is being called repeatedly?
>
I'm trying to debug this problem, without much luck so far.
I've enabled hard and soft lock-up detection and various
locking debugging options in kernel configuration but nothing
shows up. I've also tried KGDB over serial line, and was able
to hit the breakpoint in mpc8xxx_gpio code, but was unable
to step through the code, gdb just freezes.
I'm getting the following after system locks up:
nfs: server 10.0.0.10 not responding, still trying
NETDEV WATCHDOG: eth0 (fsl-gianfar): transmit queue 0 timed out
------------[ cut here ]------------
WARNING: at net/sched/sch_generic.c:255
Modules linked in: gsat hal
NIP: c01a5e9c LR: c01a5e9c CTR: c01576bc
REGS: dfff1e90 TRAP: 0700 Not tainted (3.0.0)
MSR: 00029000 <EE,ME,CE> CR: 24044082 XER: 20000000
TASK = c02df3b8[0] 'swapper' THREAD: c02ee000
GPR00: c01a5e9c dfff1f40 c02df3b8 00000046 00003d04 ffffffff c01546e4
00003d04
GPR08: c02e0000 c02ed710 00003d04 00000002 84044042 7dcac9f0 00000000
00000000
GPR16: 1ff8c184 1ffa95e0 c0284070 c02f9dc0 c02fabec c02fa9ec c02fa7ec
c02fa5ec
GPR24: 00200200 dfff1f68 c02e0000 dfff0000 c034c138 c02e0000 df011000
00000000
NIP [c01a5e9c] dev_watchdog+0x298/0x2a8
LR [c01a5e9c] dev_watchdog+0x298/0x2a8
Call Trace:
[dfff1f40] [c01a5e9c] dev_watchdog+0x298/0x2a8 (unreliable)
[dfff1f60] [c00434e8] run_timer_softirq+0x1a4/0x1e0
[dfff1fb0] [c003cf20] __do_softirq+0x9c/0x114
[dfff1ff0] [c000c8a0] call_do_softirq+0x14/0x24
[c02efe90] [c00049ec] do_softirq+0x74/0x80
[c02efeb0] [c003d174] irq_exit+0x98/0x9c
[c02efec0] [c0009944] timer_interrupt+0xb4/0x118
[c02efed0] [c000e018] ret_from_except+0x0/0x18
--- Exception: 901 at cpu_idle+0x98/0xd8
LR = cpu_idle+0x98/0xd8
[c02eff90] [c0008238] cpu_idle+0x50/0xd8 (unreliable)
[c02effb0] [c00022e4] rest_init+0x64/0x78
[c02effc0] [c02bb86c] start_kernel+0x244/0x2c0
[c02efff0] [c00003a0] skpinv+0x2b8/0x2f4
Instruction dump:
38000001 7c0903a6 4bfffe40 7fc3f378 4bfe7045 7fe6fb78 7c651b78 3c60c02a
7fc4f378 38639ac4 4cc63182 4be921dd <0fe00000> 38000001 981c0001 4bffff9c
---[ end trace 5d45e0fe33774f9c ]---
What else can be done to find the problem. ?
Thanks a lot in advance.
Felix.
^ permalink raw reply
* Re: kvm PCI assignment & VFIO ramblings
From: David Gibson @ 2011-08-02 8:28 UTC (permalink / raw)
To: Alex Williamson
Cc: aafabbri, Alexey Kardashevskiy, kvm, Paul Mackerras, qemu-devel,
chrisw, iommu, Anthony Liguori, linux-pci@vger.kernel.org,
linuxppc-dev, benve
In-Reply-To: <1312050011.2265.185.camel@x201.home>
On Sat, Jul 30, 2011 at 12:20:08PM -0600, Alex Williamson wrote:
> On Sat, 2011-07-30 at 09:58 +1000, Benjamin Herrenschmidt wrote:
[snip]
> On x86, the USB controllers don't typically live behind a PCIe-to-PCI
> bridge, so don't suffer the source identifier problem, but they do often
> share an interrupt. But even then, we can count on most modern devices
> supporting PCI2.3, and thus the DisINTx feature, which allows us to
> share interrupts. In any case, yes, it's more rare but we need to know
> how to handle devices behind PCI bridges. However I disagree that we
> need to assign all the devices behind such a bridge to the guest.
> There's a difference between removing the device from the host and
> exposing the device to the guest.
I think you're arguing only over details of what words to use for
what, rather than anything of substance here. The point is that an
entire partitionable group must be assigned to "host" (in which case
kernel drivers may bind to it) or to a particular guest partition (or
at least to a single UID on the host). Which of the assigned devices
the partition actually uses is another matter of course, as is at
exactly which level they become "de-exposed" if you don't want to use
all of then.
[snip]
> > Maybe something like /sys/devgroups ? This probably warrants involving
> > more kernel people into the discussion.
>
> I don't yet buy into passing groups to qemu since I don't buy into the
> idea of always exposing all of those devices to qemu. Would it be
> sufficient to expose iommu nodes in sysfs that link to the devices
> behind them and describe properties and capabilities of the iommu
> itself? More on this at the end.
Again, I don't think you're making a distinction of any substance.
Ben is saying the group as a whole must be set to allow partition
access, whether or not you call that "assigning". There's no reason
that passing a sysfs descriptor to qemu couldn't be the qemu
developer's quick-and-dirty method of putting the devices in, while
also allowing full assignment of the devices within the groups by
libvirt.
[snip]
> > Now some of this can be fixed with tweaks, and we've started doing it
> > (we have a working pass-through using VFIO, forgot to mention that, it's
> > just that we don't like what we had to do to get there).
>
> This is a result of wanting to support *unmodified* x86 guests. We
> don't have the luxury of having a predefined pvDMA spec that all x86
> OSes adhere to. The 32bit problem is unfortunate, but the priority use
> case for assigning devices to guests is high performance I/O, which
> usually entails modern, 64bit hardware. I'd like to see us get to the
> point of having emulated IOMMU hardware on x86, which could then be
> backed by VFIO, but for now guest pinning is the most practical and
> useful.
No-one's suggesting that this isn't a valid mode of operation. It's
just that right now conditionally disabling it for us is fairly ugly
because of the way the qemu code is structured.
[snip]
> > The above means we need arch specific APIs. So arch specific vfio
> > ioctl's, either that or kvm ones going to vfio or something ... the
> > current structure of vfio/kvm interaction doesn't make it easy.
>
> FYI, we also have large page support for x86 VT-d, but it seems to only
> be opportunistic right now. I'll try to come back to the rest of this
> below.
Incidentally there seems to be a hugepage leak bug in the current
kernel code (which I haven't had a chance to track down yet). Our
qemu code currently has bugs (working on it..) which means it has
unbalanced maps and unmaps of the pages. But when qemu quits they
should all be released but somehow they're not.
[snip]
> > - I don't like too much the fact that VFIO provides yet another
> > different API to do what we already have at least 2 kernel APIs for, ie,
> > BAR mapping and config space access. At least it should be better at
> > using the backend infrastructure of the 2 others (sysfs & procfs). I
> > understand it wants to filter in some case (config space) and -maybe-
> > yet another API is the right way to go but allow me to have my doubts.
>
> The use of PCI sysfs is actually one of my complaints about current
> device assignment. To do assignment with an unprivileged guest we need
> to open the PCI sysfs config file for it, then change ownership on a
> handful of other PCI sysfs files, then there's this other pci-stub thing
> to maintain ownership, but the kvm ioctls don't actually require it and
> can grab onto any free device... We are duplicating some of that in
> VFIO, but we also put the ownership of the device behind a single device
> file. We do have the uiommu problem that we can't give an unprivileged
> user ownership of that, but your usage model may actually make that
> easier. More below...
Hrm. I was assuming that a sysfs groups interface would provide a
single place to set the ownership of the whole group. Whether that's
a echoing a uid to a magic file or doing or chown on the directory or
whatever is a matter of details.
[snip]
> I spent a lot of time looking for an architecture neutral solution here,
> but I don't think it exists. Please prove me wrong. The problem is
> that we have to disable INTx on an assigned device after it fires (VFIO
> does this automatically). If we don't do this, a non-responsive or
> malicious guest could sit on the interrupt, causing it to fire
> repeatedly as a DoS on the host. The only indication that we can rely
> on to re-enable INTx is when the guest CPU writes an EOI to the APIC.
> We can't just wait for device accesses because a) the device CSRs are
> (hopefully) direct mapped and we'd have to slow map them or attempt to
> do some kind of dirty logging to detect when they're accesses b) what
> constitutes an interrupt service is device specific.
>
> That means we need to figure out how PCI interrupt 'A' (or B...)
> translates to a GSI (Global System Interrupt - ACPI definition, but
> hopefully a generic concept). That GSI identifies a pin on an IOAPIC,
> which will also see the APIC EOI. And just to spice things up, the
> guest can change the PCI to GSI mappings via ACPI. I think the set of
> callbacks I've added are generic (maybe I left ioapic in the name), but
> yes they do need to be implemented for other architectures. Patches
> appreciated from those with knowledge of the systems and/or access to
> device specs. This is the only reason that I make QEMU VFIO only build
> for x86.
There will certainly need to be some arch hooks here, but it can be
made less intrusively x86 specific without too much difficulty.
e.g. Create an EOF notifier chain in qemu - the master PICs (APIC for
x86, XICS for pSeries) for all vfio capable machines need to kick it,
and vfio subscribes.
[snip]
> Rather than your "groups" idea, I've been mulling over whether we can
> just expose the dependencies, configuration, and capabilities in sysfs
> and build qemu commandlines to describe it. For instance, if we simply
> start with creating iommu nodes in sysfs, we could create links under
> each iommu directory to the devices behind them. Some kind of
> capability file could define properties like whether it's page table
> based or fixed iova window or the granularity of mapping the devices
> behind it. Once we have that, we could probably make uiommu attach to
> each of those nodes.
Well, that would address our chief concern that inherently tying the
lifetime of a domain to an fd is problematic. In fact, I don't really
see how this differs from the groups proposal except in the details of
how you inform qemu of the group^H^H^H^H^Hiommu domain.
[snip]
> Today we do DMA mapping via the VFIO device because the capabilities of
> the IOMMU domains change depending on which devices are connected (for
> VT-d, the least common denominator of the IOMMUs in play). Forcing the
> DMA mappings through VFIO naturally forces the call order. If we moved
> to something like above, we could switch the DMA mapping to the uiommu
> device, since the IOMMU would have fixed capabilities.
Ah, that's why you have the map and unmap on the vfio fd,
necessitating the ugly "pick the first vfio fd from the list" thing.
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
^ permalink raw reply
* Re: [PATCHv4 01/11] atomic: add *_dec_not_zero
From: Ralf Baechle @ 2011-08-02 9:58 UTC (permalink / raw)
To: Sven Eckelmann
Cc: linux-m32r-ja, linux-mips, linux-ia64, linux-doc, H. Peter Anvin,
Heiko Carstens, Randy Dunlap, Paul Mackerras, Helge Deller,
sparclinux, linux-arch, linux-s390, Russell King,
user-mode-linux-devel, Richard Weinberger, Hirokazu Takata, x86,
James E.J. Bottomley, Ingo Molnar, Matt Turner, Fenghua Yu,
Arnd Bergmann, Jeff Dike, Chris Metcalf, linux-m32r,
Ivan Kokshaysky, Thomas Gleixner, linux-arm-kernel,
Richard Henderson, Tony Luck, linux-parisc, linux-kernel,
Kyle McMartin, linux-alpha, Martin Schwidefsky, linux390,
linuxppc-dev, David S. Miller
In-Reply-To: <1311760070-21532-1-git-send-email-sven@narfation.org>
On Wed, Jul 27, 2011 at 11:47:40AM +0200, Sven Eckelmann wrote:
For MIPS:
Acked-by: Ralf Baechle <ralf@linux-mips.org>
Ralf
^ permalink raw reply
* Re: kvm PCI assignment & VFIO ramblings
From: Avi Kivity @ 2011-08-02 9:12 UTC (permalink / raw)
To: Benjamin Herrenschmidt
Cc: Alexey Kardashevskiy, kvm, Paul Mackerras,
linux-pci@vger.kernel.org, David Gibson, Alex Williamson,
Anthony Liguori, linuxppc-dev
In-Reply-To: <1312248479.8793.827.camel@pasglop>
On 08/02/2011 04:27 AM, Benjamin Herrenschmidt wrote:
> >
> > I have a feeling you'll be getting the same capabilities sooner or
> > later, or you won't be able to make use of S/R IOV VFs.
>
> I'm not sure why you mean. We can do SR/IOV just fine (well, with some
> limitations due to constraints with how our MMIO segmenting works and
> indeed some of those are being lifted in our future chipsets but
> overall, it works).
Don't those limitations include "all VFs must be assigned to the same
guest"?
PCI on x86 has function granularity, SRIOV reduces this to VF
granularity, but I thought power has partition or group granularity
which is much coarser?
> In -theory-, one could do the grouping dynamically with some kind of API
> for us as well. However the constraints are such that it's not
> practical. Filtering on RID is based on number of bits to match in the
> bus number and whether to match the dev and fn. So it's not arbitrary
> (but works fine for SR-IOV).
>
> The MMIO segmentation is a bit special too. There is a single MMIO
> region in 32-bit space (size is configurable but that's not very
> practical so for now we stick it to 1G) which is evenly divided into N
> segments (where N is the number of PE# supported by the host bridge,
> typically 128 with the current bridges).
>
> Each segment goes through a remapping table to select the actual PE# (so
> large BARs use consecutive segments mapped to the same PE#).
>
> For SR-IOV we plan to not use the M32 region. We also have 64-bit MMIO
> regions which act as some kind of "accordions", they are evenly divided
> into segments in different PE# and there's several of them which we can
> "move around" and typically use to map VF BARs.
So, SRIOV VFs *don't* have the group limitation? Sorry, I'm deluged by
technical details with no ppc background to put them to, I can't say I'm
making any sense of this.
> > >
> > > VFIO here is basically designed for one and only one thing: expose the
> > > entire guest physical address space to the device more/less 1:1.
> >
> > A single level iommu cannot be exposed to guests. Well, it can be
> > exposed as an iommu that does not provide per-device mapping.
>
> Well, x86 ones can't maybe but on POWER we can and must thanks to our
> essentially paravirt model :-) Even if it' wasn't and we used trapping
> of accesses to the table, it would work because in practice, even with
> filtering, what we end up having is a per-device (or rather per-PE#
> table).
>
> > A two level iommu can be emulated and exposed to the guest. See
> > http://support.amd.com/us/Processor_TechDocs/48882.pdf for an example.
>
> What you mean 2-level is two passes through two trees (ie 6 or 8 levels
> right ?).
(16 or 25)
> We don't have that and probably never will. But again, because
> we have a paravirt interface to the iommu, it's less of an issue.
Well, then, I guess we need an additional interface to expose that to
the guest.
> > > This means:
> > >
> > > - It only works with iommu's that provide complete DMA address spaces
> > > to devices. Won't work with a single 'segmented' address space like we
> > > have on POWER.
> > >
> > > - It requires the guest to be pinned. Pass-through -> no more swap
> >
> > Newer iommus (and devices, unfortunately) (will) support I/O page faults
> > and then the requirement can be removed.
>
> No. -Some- newer devices will. Out of these, a bunch will have so many
> bugs in it it's not usable. Some never will. It's a mess really and I
> wouldn't design my stuff based on those premises just yet. Making it
> possible to support it for sure, having it in mind, but not making it
> the fundation on which the whole API is designed.
The API is not designed around pinning. It's a side effect of how the
IOMMU works. If your IOMMU only maps pages which are under active DMA,
then it would only pin those pages.
But I see what you mean, the API is designed around up-front
specification of all guest memory.
> > > - It doesn't work for POWER server anyways because of our need to
> > > provide a paravirt iommu interface to the guest since that's how pHyp
> > > works today and how existing OSes expect to operate.
> >
> > Then you need to provide that same interface, and implement it using the
> > real iommu.
>
> Yes. Working on it. It's not very practical due to how VFIO interacts in
> terms of APIs but solvable. Eventually, we'll make the iommu Hcalls
> almost entirely real-mode for performance reasons.
The original kvm device assignment code was (and is) part of kvm
itself. We're trying to move to vfio to allow sharing with non-kvm
users, but it does reduce flexibility. We can have an internal vfio-kvm
interface to update mappings in real time.
> > > - Performance sucks of course, the vfio map ioctl wasn't mean for that
> > > and has quite a bit of overhead. However we'll want to do the paravirt
> > > call directly in the kernel eventually ...
> >
> > Does the guest iomap each request? Why?
>
> Not sure what you mean... the guest calls h-calls for every iommu page
> mapping/unmapping, yes. So the performance of these is critical. So yes,
> we'll eventually do it in kernel. We just haven't yet.
I see. x86 traditionally doesn't do it for every request. We had some
proposals to do a pviommu that does map every request, but none reached
maturity.
> >
> > So, you have interrupt redirection? That is, MSI-x table values encode
> > the vcpu, not pcpu?
>
> Not exactly. The MSI-X address is a real PCI address to an MSI port and
> the value is a real interrupt number in the PIC.
>
> However, the MSI port filters by RID (using the same matching as PE#) to
> ensure that only allowed devices can write to it, and the PIC has a
> matching PE# information to ensure that only allowed devices can trigger
> the interrupt.
>
> As for the guest knowing what values to put in there (what port address
> and interrupt source numbers to use), this is part of the paravirt APIs.
>
> So the paravirt APIs handles the configuration and the HW ensures that
> the guest cannot do anything else than what it's allowed to.
Okay, this is something that x86 doesn't have. Strange that it can
filter DMA at a fine granularity but not MSI, which is practically the
same thing.
> >
> > Does the BAR value contain the segment base address? Or is that added
> > later?
>
> It's a shared address space. With a basic configuration on p7ioc for
> example we have MMIO going from 3G to 4G (PCI side addresses). BARs
> contain the normal PCI address there. But that 1G is divided in 128
> segments of equal size which can separately be assigned to PE#'s.
>
> So BARs are allocated by firmware or the kernel PCI code so that devices
> in different PEs don't share segments.
Okay, and config space virtualization ensures that the guest can't remap?
--
error compiling committee.c: too many arguments to function
^ permalink raw reply
* Re: Fwd: MPC7410 Linux Kernel
From: Vineeth @ 2011-08-02 9:03 UTC (permalink / raw)
To: tiejun.chen; +Cc: linuxppc-dev
In-Reply-To: <4E37AB04.9070409@windriver.com>
[-- Attachment #1: Type: text/plain, Size: 1936 bytes --]
Thanks for the reply.
We were referring kuroboxHG.dts which uses Sandpoint architecture; which is
almost same as ours.
1. one doubt in kuroboxHG is the ranges property in SOC node says EUMB is at
0xFC00_0000; and as per the datasheet of mpc107, the open pic address will
be EUMB_BASE + 0x40000; but in kurobox its given as 0x80040000;
2. We know that our UART is mapped at address 0xDB00_0100; which is
connected in a PCI-LOCAL bridge whose base is at 0xDB00_0000
How can i represent these things in dts ? Can the RANGES property of PCI
node can mention this ?
On Tue, Aug 2, 2011 at 1:15 PM, tiejun.chen <tiejun.chen@windriver.com>wrote:
> Vineeth wrote:
> > Hi,
> >
> > We are trying to port linux 2.6.38 on MPC7410 based board (This is a
> > preparatory design by our customer)
> >
> > System architecture is as follows,
> >
> > MPC7410 <=> MPC107 <=> PCI_to_LOCAL(plx9052) <=> UART
>
> MPCXXX should be compatible with TSIXXX. So you can refer to mpc7448_hpc2.
>
> >
> > Previously we were using ppc architecture and we had some issues with
> > page_init() functions; which may be because of our configuration.As we
> didnt
> > get much support on ppc architecture we moved to powerpc.
> >
> > Now we moved to powerpc architecture. We have some doubts on writing the
> dts
> > file. Please find the dts file attached.
> >
> > when we checked the legacy_serial.c file, we found that
> > legacy_serial_parents not expecting a pci-local or a pci bridge as
> parent.
> > is our understanding correct ? should we introduce a new pci parent in
> that
> > structure ?
>
> So you can understand this after refer to the file,
> arch/powerpc/boot/dts/mpc7448hpc2.dts.
>
> Tiejun
>
> >
> > We are confused about writing the ranges property of PCI node.we were
> > referring booting_without_of doc but didnt get much info. Is there any
> file
> > which gives better idea about the ranges property ?
> >
> > Thanks
> > Vineeth
>
>
[-- Attachment #2: Type: text/html, Size: 2551 bytes --]
^ permalink raw reply
* Re: kvm PCI assignment & VFIO ramblings
From: Avi Kivity @ 2011-08-02 8:32 UTC (permalink / raw)
To: Alex Williamson
Cc: Alexey Kardashevskiy, kvm, Paul Mackerras, David Gibson,
Anthony Liguori, linux-pci@vger.kernel.org, linuxppc-dev
In-Reply-To: <1312230476.2653.395.camel@bling.home>
On 08/01/2011 11:27 PM, Alex Williamson wrote:
> On Sun, 2011-07-31 at 17:09 +0300, Avi Kivity wrote:
> > On 07/30/2011 02:58 AM, Benjamin Herrenschmidt wrote:
> > > Due to our paravirt nature, we don't need to masquerade the MSI-X table
> > > for example. At all. If the guest configures crap into it, too bad, it
> > > can only shoot itself in the foot since the host bridge enforce
> > > validation anyways as I explained earlier. Because it's all paravirt, we
> > > don't need to "translate" the interrupt vectors& addresses, the guest
> > > will call hyercalls to configure things anyways.
> >
> > So, you have interrupt redirection? That is, MSI-x table values encode
> > the vcpu, not pcpu?
> >
> > Alex, with interrupt redirection, we can skip this as well? Perhaps
> > only if the guest enables interrupt redirection?
>
> It's not clear to me how we could skip it. With VT-d, we'd have to
> implement an emulated interrupt remapper and hope that the guest picks
> unused indexes in the host interrupt remapping table before it could do
> anything useful with direct access to the MSI-X table.
Yeah. We need the interrupt remapping hardware to indirect based on the
source of the message, not just the address and data.
> Maybe AMD IOMMU
> makes this easier? Thanks,
>
No idea.
--
error compiling committee.c: too many arguments to function
^ permalink raw reply
* Re: Fwd: MPC7410 Linux Kernel
From: tiejun.chen @ 2011-08-02 7:45 UTC (permalink / raw)
To: Vineeth; +Cc: linuxppc-dev
In-Reply-To: <CAFbQSaC6BkZj+a4=-9Q2Cyj7w=4y5QYM22kaK1O=Uz5KW9iiuQ@mail.gmail.com>
Vineeth wrote:
> Hi,
>
> We are trying to port linux 2.6.38 on MPC7410 based board (This is a
> preparatory design by our customer)
>
> System architecture is as follows,
>
> MPC7410 <=> MPC107 <=> PCI_to_LOCAL(plx9052) <=> UART
MPCXXX should be compatible with TSIXXX. So you can refer to mpc7448_hpc2.
>
> Previously we were using ppc architecture and we had some issues with
> page_init() functions; which may be because of our configuration.As we didnt
> get much support on ppc architecture we moved to powerpc.
>
> Now we moved to powerpc architecture. We have some doubts on writing the dts
> file. Please find the dts file attached.
>
> when we checked the legacy_serial.c file, we found that
> legacy_serial_parents not expecting a pci-local or a pci bridge as parent.
> is our understanding correct ? should we introduce a new pci parent in that
> structure ?
So you can understand this after refer to the file,
arch/powerpc/boot/dts/mpc7448hpc2.dts.
Tiejun
>
> We are confused about writing the ranges property of PCI node.we were
> referring booting_without_of doc but didnt get much info. Is there any file
> which gives better idea about the ranges property ?
>
> Thanks
> Vineeth
^ permalink raw reply
* Fwd: MPC7410 Linux Kernel
From: Vineeth @ 2011-08-02 6:49 UTC (permalink / raw)
To: linuxppc-dev
In-Reply-To: <CAFbQSaAhXUrb1peJ2E3Kkp1gd16JSn9OjneO826q_aMz+yzAqg@mail.gmail.com>
[-- Attachment #1.1: Type: text/plain, Size: 964 bytes --]
Hi,
We are trying to port linux 2.6.38 on MPC7410 based board (This is a
preparatory design by our customer)
System architecture is as follows,
MPC7410 <=> MPC107 <=> PCI_to_LOCAL(plx9052) <=> UART
Previously we were using ppc architecture and we had some issues with
page_init() functions; which may be because of our configuration.As we didnt
get much support on ppc architecture we moved to powerpc.
Now we moved to powerpc architecture. We have some doubts on writing the dts
file. Please find the dts file attached.
when we checked the legacy_serial.c file, we found that
legacy_serial_parents not expecting a pci-local or a pci bridge as parent.
is our understanding correct ? should we introduce a new pci parent in that
structure ?
We are confused about writing the ranges property of PCI node.we were
referring booting_without_of doc but didnt get much info. Is there any file
which gives better idea about the ranges property ?
Thanks
Vineeth
[-- Attachment #1.2: Type: text/html, Size: 3089 bytes --]
[-- Attachment #2: obc7410.dts --]
[-- Type: application/octet-stream, Size: 2999 bytes --]
/*
* Device Tree Souce for OBC7410-CoreEl
*
* Choose CONFIG_MPC7410 to build a kernel for OBC7410, or use
* the default configuration mpc7410_defconfig.
*
* Based on kuroboxHG.dts
*
* 2011 (c) Vineeth/Sumesh
* Copyright 2011 CoreEl technologies
*
* This file is licensed under
* the terms of the GNU General Public License version 2. This program
* is licensed "as is" without any warranty of any kind, whether express
* or implied.
*/
/dts-v1/;
/ {
model = "obc7410";
compatible = "mpc7410";
#address-cells = <1>;
#size-cells = <1>;
aliases {
serial0 = &serial0;
pci0 = &pci0;
};
cpus {
#address-cells = <1>;
#size-cells = <0>;
PowerPC,603e { /* Really 7410 */
device_type = "cpu";
reg = <0x0>;
clock-frequency = <266000000>; /* Fixed by bootloader */
timebase-frequency = <32522240>; /* Fixed by bootloader */
bus-frequency = <0>; /* Fixed by bootloader */
/* Following required by dtc but not used */
i-cache-size = <0x4000>;
d-cache-size = <0x4000>;
};
};
memory {
device_type = "memory";
reg = <0x0 0x4000000>;
};
soc10x { /* AFAICT i got it from KUROBOXHG :) */
#address-cells = <1>;
#size-cells = <1>;
device_type = "soc";
compatible = "mpc10x";
store-gathering = <0>; /* 0 == off, !0 == on */
reg = <0x80000000 0x100000>;
ranges = <0x80000000 0x80000000 0x70000000 /* pci mem space */
0xfc000000 0xfc000000 0x100000 /* EUMB */
0xfe000000 0xfe000000 0xc00000 /* pci i/o space */
0xfec00000 0xfec00000 0x300000 /* pci cfg regs */
0xfef00000 0xfef00000 0x100000>; /* pci iack */
mpic: interrupt-controller@80040000 {
#interrupt-cells = <2>;
#address-cells = <0>;
device_type = "open-pic";
compatible = "chrp,open-pic";
interrupt-controller;
reg = <0x80040000 0x40000>;
};
pci0: pci@fec00000 {
#address-cells = <3>;
#size-cells = <2>;
#interrupt-cells = <1>;
device_type = "pci";
compatible = "mpc10x-pci";
reg = <0xfec00000 0x400000>;
ranges = <0x1000000 0x0 0x0 0xfe000000 0x0 0xc00000
0x2000000 0x0 0x80000000 0x80000000 0x0 0x70000000>;
bus-range = <0 255>;
clock-frequency = <133333333>;
interrupt-parent = <&mpic>;
interrupt-map-mask = <0xf800 0x0 0x0 0x7>;
pci2local: pciLocal@0xdb000000 {
#address-cells = <2>;
#size-cells = <2>;
#interrupt-cells = <1>;
device-type = "pci-local";
compatible = "simple-bus";
reg = <0xdb000100 0x1000>;
serial0: serial@0x100 {
cell-index = <1>;
device_type = "serial";
compatible = "ns16550";
reg = <0x100 0x1000>;
clock-frequency = <130041000>;
current-speed = <57600>;
interrupts = <10 0>;
interrupt-parent = <&mpic>;
};
};
};
};
};
^ permalink raw reply
* Re: [PATCH v11 08/10] USB ppc4xx: Add Synopsys DWC OTG PCD interrupt function
From: Pratyush Anand @ 2011-08-02 4:46 UTC (permalink / raw)
To: tmarri; +Cc: Mark Miesfeld, greg, linux-usb, linuxppc-dev, Fushen Chen
In-Reply-To: <CABw=1MfB3_-LKhzmrZ5DHnQ_cX5s7c=0gLXPc7n+nAbo_vnd0g@mail.gmail.com>
I can see v13 as the last modifications at following link.
http://patchwork.ozlabs.org/patch/89560/
Is there any work after it?
Regards
Pratyush
On Tue, Aug 2, 2011 at 9:55 AM, Pratyush Anand <pratyush.linux@gmail.com> wrote:
> Is somebody working on these patches?
> If not, then is it possible to share last modification, so that I can
> start work from there.
> If this v11 was the last modification, and noone is working further,
> then just confirm it. I will start working from here.
>
> Regards
> Pratyush
>
^ permalink raw reply
* Re: [PATCH v11 08/10] USB ppc4xx: Add Synopsys DWC OTG PCD interrupt function
From: Pratyush Anand @ 2011-08-02 4:25 UTC (permalink / raw)
To: tmarri; +Cc: Mark Miesfeld, greg, linux-usb, linuxppc-dev, Fushen Chen
In-Reply-To: <1301684732-17603-1-git-send-email-tmarri@apm.com>
Is somebody working on these patches?
If not, then is it possible to share last modification, so that I can
start work from there.
If this v11 was the last modification, and noone is working further,
then just confirm it. I will start working from here.
Regards
Pratyush
^ permalink raw reply
* Re: [PATCH] drivers/misc: introduce Freescale Data Collection Manager driver
From: Mark Brown @ 2011-08-02 2:26 UTC (permalink / raw)
To: Tabi Timur-B04825
Cc: linuxppc-dev@ozlabs.org, linux-kernel@vger.kernel.org,
arnd@arndb.de
In-Reply-To: <4E375998.1020901@freescale.com>
On Tue, Aug 02, 2011 at 01:57:45AM +0000, Tabi Timur-B04825 wrote:
> Mark Brown wrote:
> > I'd expect that things like the _lowest, _highest and _average
> > attributes which a number of drivers have are what you're looking for.
> Yes, but then all I'm doing is presenting numbers that don't change to an
> interface, simply on the basis that the numbers represent sensor values.
> If I'm running a sensor application, I'm doing it to get real-time
> monitoring of the sensors in my system. The DCM on our boards is not
> capable of real-time results. So you're not actually "monitoring" the
> hardware. The data from the DCM is available only *after* you stop
> running the background process.
Right, that seems to fit reasonably well with things like averages and
extremes.
> > At the very least it seems obvious how you might extend the interface if
> > some features you need are missing. The subsystem has fairly extensive
> > documentation in Documentation/hwmon.
> I just don't see how it fits. Yes, I could do it, but then I'd end up
> with something that doesn't make any sense. I would have to use a custom
> interface to start monitoring and then another interface to stop it. Then
The most obvious thing seems to be to use the existing _reset_history
stuff to trigger a restart. If the hardware is so incapable that it
can't cope with reads while active and needs to be reset to even pause
that's seems pretty rubbish, I'd have expected you could at least pause
measurement momentarily to do a read.
Perhaps integration as a PMU may make more sense? The general point
here is that this doesn't sound like it's doing something so odd that it
shouldn't even be trying to work within any sort of framework.
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox