LinuxPPC-Dev Archive on lore.kernel.org

LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed

* Re: kvm PCI assignment & VFIO ramblings
From: David Gibson @ 2011-08-30  1:29 UTC (permalink / raw)
  To: Aaron Fabbri
  Cc: Alexey Kardashevskiy, kvm@vger.kernel.org, Paul Mackerras,
	linux-pci@vger.kernel.org, Alexander Graf, qemu-devel,
	Chris Wright, iommu, Avi Kivity, Anthony Liguori, Roedel, Joerg,
	linuxppc-dev, benve@cisco.com
In-Reply-To: <CA7D4D51.FD84%aafabbri@cisco.com>

eOn Fri, Aug 26, 2011 at 01:17:05PM -0700, Aaron Fabbri wrote:
[snip]
> Yes.  In essence, I'd rather not have to run any other admin processes.
> Doing things programmatically, on the fly, from each process, is the
> cleanest model right now.

The "persistent group" model doesn't necessarily prevent that.
There's no reason your program can't use the administrative interface
as well as the "use" interface, and I don't see that making the admin
interface separate and persistent makes this any harder.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

^ permalink raw reply

* Re: linux-next: build failure in Linus' tree
From: James Bottomley @ 2011-08-29 22:50 UTC (permalink / raw)
  To: Stephen Rothwell
  Cc: J. Bruce Fields, linux-parisc, NeilBrown, Linus, linux-kernel,
	linux-next, linuxppc-dev
In-Reply-To: <20110830083218.1819a5d73c3a33e5053e8312@canb.auug.org.au>

On Tue, 2011-08-30 at 08:32 +1000, Stephen Rothwell wrote:
> Hi Linus,
> 
> On Mon, 29 Aug 2011 10:44:51 +1000 Stephen Rothwell <sfr@canb.auug.org.au> wrote:
> >
> > After merging the fixes tree, today's linux-next build (powerpc
> > ppc64_defconfig) failed like this:
> > 
> > arch/powerpc/kernel/built-in.o: In function `.sys_call_table':
> > (.text+0xbd00): undefined reference to `.sys_nfsservctl'
> > arch/powerpc/kernel/built-in.o: In function `.sys_call_table':
> > (.text+0xbd08): undefined reference to `.compat_sys_nfsservctl'
> > 
> > Caused by commit f5b940997397 ("All Arch: remove linkage for
> > sys_nfsservctl system call") which also missed parisc.
> > 
> > I will apply this patch for today:
> 
> Will you please appply this?  (repeated for ease of inclusion)
> 
> From: Stephen Rothwell <sfr@canb.auug.org.au>
> Date: Mon, 29 Aug 2011 10:38:57 +1000
> Subject: [PATCH] remove remaining references to nfsservctl
> 
> These were missed in commit f5b940997397 "All Arch: remove linkage
> for sys_nfsservctl system call" due to them having no sys_ prefix
> (presumably).
> 
> Cc: NeilBrown <neilb@suse.de>
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: linux-parisc@vger.kernel.org
> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

Thanks for finding this ... definitely acked by me if necessary.

James

^ permalink raw reply

* Re: linux-next: build failure in Linus' tree
From: Stephen Rothwell @ 2011-08-29 22:32 UTC (permalink / raw)
  To: Linus
  Cc: J. Bruce Fields, linux-parisc, NeilBrown, linux-kernel,
	linux-next, linuxppc-dev
In-Reply-To: <20110829104451.1c777e24ff72823d1e399f12@canb.auug.org.au>

Hi Linus,

On Mon, 29 Aug 2011 10:44:51 +1000 Stephen Rothwell <sfr@canb.auug.org.au> wrote:
>
> After merging the fixes tree, today's linux-next build (powerpc
> ppc64_defconfig) failed like this:
> 
> arch/powerpc/kernel/built-in.o: In function `.sys_call_table':
> (.text+0xbd00): undefined reference to `.sys_nfsservctl'
> arch/powerpc/kernel/built-in.o: In function `.sys_call_table':
> (.text+0xbd08): undefined reference to `.compat_sys_nfsservctl'
> 
> Caused by commit f5b940997397 ("All Arch: remove linkage for
> sys_nfsservctl system call") which also missed parisc.
> 
> I will apply this patch for today:

Will you please appply this?  (repeated for ease of inclusion)

From: Stephen Rothwell <sfr@canb.auug.org.au>
Date: Mon, 29 Aug 2011 10:38:57 +1000
Subject: [PATCH] remove remaining references to nfsservctl

These were missed in commit f5b940997397 "All Arch: remove linkage
for sys_nfsservctl system call" due to them having no sys_ prefix
(presumably).

Cc: NeilBrown <neilb@suse.de>
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-parisc@vger.kernel.org
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
---
 arch/parisc/kernel/syscall_table.S |    2 +-
 arch/powerpc/include/asm/systbl.h  |    2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/parisc/kernel/syscall_table.S b/arch/parisc/kernel/syscall_table.S
index e66366f..3735abd 100644
--- a/arch/parisc/kernel/syscall_table.S
+++ b/arch/parisc/kernel/syscall_table.S
@@ -259,7 +259,7 @@
 	ENTRY_SAME(ni_syscall)		/* query_module */
 	ENTRY_SAME(poll)
 	/* structs contain pointers and an in_addr... */
-	ENTRY_COMP(nfsservctl)
+	ENTRY_SAME(ni_syscall)		/* was nfsservctl */
 	ENTRY_SAME(setresgid)		/* 170 */
 	ENTRY_SAME(getresgid)
 	ENTRY_SAME(prctl)
diff --git a/arch/powerpc/include/asm/systbl.h b/arch/powerpc/include/asm/systbl.h
index f6736b7..fa0d27a 100644
--- a/arch/powerpc/include/asm/systbl.h
+++ b/arch/powerpc/include/asm/systbl.h
@@ -171,7 +171,7 @@ SYSCALL_SPU(setresuid)
 SYSCALL_SPU(getresuid)
 SYSCALL(ni_syscall)
 SYSCALL_SPU(poll)
-COMPAT_SYS(nfsservctl)
+SYSCALL(ni_syscall)
 SYSCALL_SPU(setresgid)
 SYSCALL_SPU(getresgid)
 COMPAT_SYS_SPU(prctl)
-- 
1.7.5.4

-- 
Cheers,
Stephen Rothwell                    sfr@canb.auug.org.au
http://www.canb.auug.org.au/~sfr/

^ permalink raw reply related

* [PATCH] powerpc/85xx: clean up FPGA device tree nodes for Freecsale QorIQ boards
From: Timur Tabi @ 2011-08-29 19:09 UTC (permalink / raw)
  To: kumar.gala, linuxppc-dev, devicetree-discuss

Standarize and document the FPGA nodes used on Freescale QorIQ reference
boards.  There are three kinds of FPGAs used on the boards: pixis, qixis, and
cpld.  Although their are minor differences among the boards that have one
kind of FPGA, most of the functionality is the same, so it makes sense
to create common compatibility strings.

Signed-off-by: Timur Tabi <timur@freescale.com>
---

Changes for other Freescale boards will be made in future patches.

 .../devicetree/bindings/powerpc/fsl/board.txt      |   30 ++++++++++++--------
 arch/powerpc/boot/dts/p1010rdb.dts                 |   10 ++----
 arch/powerpc/boot/dts/p1020rdb.dts                 |    7 ++++-
 arch/powerpc/boot/dts/p1022ds.dts                  |    2 +-
 arch/powerpc/boot/dts/p2020ds.dts                  |    5 +++
 arch/powerpc/boot/dts/p2020rdb.dts                 |    4 ++
 arch/powerpc/boot/dts/p2040rdb.dts                 |    8 ++++-
 arch/powerpc/boot/dts/p3041ds.dts                  |    4 +-
 arch/powerpc/boot/dts/p4080ds.dts                  |    8 ++++-
 arch/powerpc/boot/dts/p5020ds.dts                  |    4 +-
 10 files changed, 55 insertions(+), 27 deletions(-)

diff --git a/Documentation/devicetree/bindings/powerpc/fsl/board.txt b/Documentation/devicetree/bindings/powerpc/fsl/board.txt
index 39e9415..ba46a7a 100644
--- a/Documentation/devicetree/bindings/powerpc/fsl/board.txt
+++ b/Documentation/devicetree/bindings/powerpc/fsl/board.txt
@@ -1,3 +1,8 @@
+Freescale Reference Board Bindings
+
+This document describes device tree bindings for various devices that
+exist on some Freescale reference boards.
+
 * Board Control and Status (BCSR)
 
 Required properties:
@@ -12,25 +17,26 @@ Example:
 		reg = <f8000000 8000>;
 	};
 
-* Freescale on board FPGA
+* Freescale on-board FPGA
 
 This is the memory-mapped registers for on board FPGA.
 
 Required properities:
-- compatible : should be "fsl,fpga-pixis".
-- reg : should contain the address and the length of the FPPGA register
-  set.
+- compatible: should be a board-specific string followed by a string
+  indicating the type of FPGA.  Example:
+	"fsl,<board>-pixis", "fsl,fpga-pixis"
+- reg: should contain the address and the length of the FPGA register set.
 - interrupt-parent: should specify phandle for the interrupt controller.
-- interrupts : should specify event (wakeup) IRQ.
+- interrupts: should specify event (wakeup) IRQ.
 
-Example (MPC8610HPCD):
+Example (P1022DS):
 
-	board-control@e8000000 {
-		compatible = "fsl,fpga-pixis";
-		reg = <0xe8000000 32>;
-		interrupt-parent = <&mpic>;
-		interrupts = <8 8>;
-	};
+	 board-control@3,0 {
+		 compatible = "fsl,p1022ds-pixis", "fsl,fpga-pixis";
+		 reg = <3 0 0x30>;
+		 interrupt-parent = <&mpic>;
+		 interrupts = <8 8 0 0>;
+	 };
 
 * Freescale BCSR GPIO banks
 
diff --git a/arch/powerpc/boot/dts/p1010rdb.dts b/arch/powerpc/boot/dts/p1010rdb.dts
index 6b33b73..7769e40 100644
--- a/arch/powerpc/boot/dts/p1010rdb.dts
+++ b/arch/powerpc/boot/dts/p1010rdb.dts
@@ -116,13 +116,9 @@
 			};
 		};
 
-		cpld@3,0 {
-			#address-cells = <1>;
-			#size-cells = <1>;
-			compatible = "fsl,p1010rdb-cpld";
-			reg = <0x3 0x0 0x0000020>;
-			bank-width = <1>;
-			device-width = <1>;
+		board-control@3,0 {
+			compatible = "fsl,p1010rdb-cpld", "fsl,fpga-cpld";
+			reg = <0x3 0x0 0x20>;
 		};
 	};
 
diff --git a/arch/powerpc/boot/dts/p1020rdb.dts b/arch/powerpc/boot/dts/p1020rdb.dts
index d6a8ae4..982d3ea 100644
--- a/arch/powerpc/boot/dts/p1020rdb.dts
+++ b/arch/powerpc/boot/dts/p1020rdb.dts
@@ -34,7 +34,8 @@
 		/* NOR, NAND Flashes and Vitesse 5 port L2 switch */
 		ranges = <0x0 0x0 0x0 0xef000000 0x01000000
 			  0x1 0x0 0x0 0xffa00000 0x00040000
-			  0x2 0x0 0x0 0xffb00000 0x00020000>;
+			  0x2 0x0 0x0 0xffb00000 0x00020000
+			  0x3 0x0 0x0 0xffdf0000 0x00008000>;
 
 		nor@0,0 {
 			#address-cells = <1>;
@@ -138,6 +139,10 @@
 			reg = <0x2 0x0 0x20000>;
 		};
 
+		board-control@3,0 {
+			compatible = "fsl,p1020rdb-cpld", "fsl,fpga-cpld";
+			reg = <0x3 0x0 0x20>;
+		};
 	};
 
 	soc@ffe00000 {
diff --git a/arch/powerpc/boot/dts/p1022ds.dts b/arch/powerpc/boot/dts/p1022ds.dts
index 1be9743..97a0b87 100644
--- a/arch/powerpc/boot/dts/p1022ds.dts
+++ b/arch/powerpc/boot/dts/p1022ds.dts
@@ -150,7 +150,7 @@
 		};
 
 		board-control@3,0 {
-			compatible = "fsl,p1022ds-pixis";
+			compatible = "fsl,p1022ds-pixis", "fsl,fpga-pixis";
 			reg = <3 0 0x30>;
 			interrupt-parent = <&mpic>;
 			/*
diff --git a/arch/powerpc/boot/dts/p2020ds.dts b/arch/powerpc/boot/dts/p2020ds.dts
index dae4031..d1e52f3 100644
--- a/arch/powerpc/boot/dts/p2020ds.dts
+++ b/arch/powerpc/boot/dts/p2020ds.dts
@@ -118,6 +118,11 @@
 			};
 		};
 
+		board-control@3,0 {
+			compatible = "fsl,p2020ds-pixis", "fsl,fpga-pixis";
+			reg = <0x3 0x0 0x30>;
+		};
+
 		nand@4,0 {
 			compatible = "fsl,elbc-fcm-nand";
 			reg = <0x4 0x0 0x40000>;
diff --git a/arch/powerpc/boot/dts/p2020rdb.dts b/arch/powerpc/boot/dts/p2020rdb.dts
index 1d7a05f..1bf9b8c 100644
--- a/arch/powerpc/boot/dts/p2020rdb.dts
+++ b/arch/powerpc/boot/dts/p2020rdb.dts
@@ -138,6 +138,10 @@
 			reg = <0x2 0x0 0x20000>;
 		};
 
+		board-control@3,0 {
+			compatible = "fsl,p2020rdb-cpld", "fsl,fpga-cpld";
+			reg = <0x3 0x0 0x20>;
+		};
 	};
 
 	soc@ffe00000 {
diff --git a/arch/powerpc/boot/dts/p2040rdb.dts b/arch/powerpc/boot/dts/p2040rdb.dts
index 7d84e39..1c72d65 100644
--- a/arch/powerpc/boot/dts/p2040rdb.dts
+++ b/arch/powerpc/boot/dts/p2040rdb.dts
@@ -109,7 +109,8 @@
 
 	localbus@ffe124000 {
 		reg = <0xf 0xfe124000 0 0x1000>;
-		ranges = <0 0 0xf 0xe8000000 0x08000000>;
+		ranges = <0 0 0xf 0xe8000000 0x08000000
+			  3 0 0xf 0xffdf0000 0x00008000>;
 
 		flash@0,0 {
 			compatible = "cfi-flash";
@@ -117,6 +118,11 @@
 			bank-width = <2>;
 			device-width = <2>;
 		};
+
+		board-control@3,0 {
+			compatible = "fsl,p2040rdb-cpld", "fsl,fpga-cpld";
+			reg = <3 0 0x20>;
+		};
 	};
 
 	pci0: pcie@ffe200000 {
diff --git a/arch/powerpc/boot/dts/p3041ds.dts b/arch/powerpc/boot/dts/p3041ds.dts
index 69cae67..92937ce 100644
--- a/arch/powerpc/boot/dts/p3041ds.dts
+++ b/arch/powerpc/boot/dts/p3041ds.dts
@@ -147,8 +147,8 @@
 		};
 
 		board-control@3,0 {
-			compatible = "fsl,p3041ds-pixis";
-			reg = <3 0 0x20>;
+			compatible = "fsl,p3041ds-pixis", "fsl,fpga-pixis";
+			reg = <3 0 0x30>;
 		};
 	};
 
diff --git a/arch/powerpc/boot/dts/p4080ds.dts b/arch/powerpc/boot/dts/p4080ds.dts
index eb11098..a26cf15 100644
--- a/arch/powerpc/boot/dts/p4080ds.dts
+++ b/arch/powerpc/boot/dts/p4080ds.dts
@@ -108,7 +108,8 @@
 
 	localbus@ffe124000 {
 		reg = <0xf 0xfe124000 0 0x1000>;
-		ranges = <0 0 0xf 0xe8000000 0x08000000>;
+		ranges = <0 0 0xf 0xe8000000 0x08000000
+			  3 0 0xf 0xffdf0000 0x00008000>;
 
 		flash@0,0 {
 			compatible = "cfi-flash";
@@ -116,6 +117,11 @@
 			bank-width = <2>;
 			device-width = <2>;
 		};
+
+		board-control@3,0 {
+			compatible = "fsl,p4080ds-pixis", "fsl,fpga-pixis";
+			reg = <3 0 0x30>;
+		};
 	};
 
 	pci0: pcie@ffe200000 {
diff --git a/arch/powerpc/boot/dts/p5020ds.dts b/arch/powerpc/boot/dts/p5020ds.dts
index 8366e2f..b959986 100644
--- a/arch/powerpc/boot/dts/p5020ds.dts
+++ b/arch/powerpc/boot/dts/p5020ds.dts
@@ -147,8 +147,8 @@
 		};
 
 		board-control@3,0 {
-			compatible = "fsl,p5020ds-pixis";
-			reg = <3 0 0x20>;
+			compatible = "fsl,p5020ds-pixis", "fsl,fpga-pixis";
+			reg = <3 0 0x30>;
 		};
 	};
 
-- 
1.7.3.4

^ permalink raw reply related

* Re: Please pull 'next' branch of 4xx tree
From: Josh Boyer @ 2011-08-29 13:05 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: linuxppc-dev
In-Reply-To: <CA+5PVA6mHk+vGsY0OibzJs27rpaKjGgaCEJVpJe4fKqrtBEmSA@mail.gmail.com>

On Wed, Aug 10, 2011 at 2:26 PM, Josh Boyer <jwboyer@gmail.com> wrote:
> Hi Ben,
>
> Finally somewhat caught up. =A0Now that -rc1 is out, here are some
> patches for the next merge window.
>
> josh
>
> The following changes since commit 53d1e658df6e26d62500410719aaee2b82067c=
03:
>
> =A0Merge branch 'devicetree/merge' of
> git://git.secretlab.ca/git/linux-2.6 (2011-08-04 06:37:07 -1000)
>
> are available in the git repository at:
>
> =A0ssh://master.kernel.org/pub/scm/linux/kernel/git/jwboyer/powerpc-4xx.g=
it next

Ben, ping?

josh

^ permalink raw reply

* Re: [PATCH 1/2] arch/powerpc/platforms/cell/iommu.c: add missing of_node_put
From: Arnd Bergmann @ 2011-08-29 11:26 UTC (permalink / raw)
  To: Julia Lawall
  Cc: cbe-oss-dev, devicetree-discuss, kernel-janitors, linux-kernel,
	Paul Mackerras, linuxppc-dev
In-Reply-To: <1313943001-12884-1-git-send-email-julia@diku.dk>

On Sunday 21 August 2011, Julia Lawall wrote:
> From: Julia Lawall <julia@diku.dk>
> 
> np is initialized to the result of calling a function that calls
> of_node_get, so of_node_put should be called before the pointer is dropped.
> 
> The semantic match that finds this problem is as follows:
> (http://coccinelle.lip6.fr/)
> 
> // <smpl>
> @@
> expression e,e1,e2;
> @@
> 
> * e = \(of_find_node_by_type\|of_find_node_by_name\)(...)
>   ... when != of_node_put(e)
>       when != true e == NULL
>       when != e2 = e
>   e = e1
> // </smpl>
> 
> Signed-off-by: Julia Lawall <julia@diku.dk>
> 
Acked-by: Arnd Bergmann <arnd@arndb.de>

^ permalink raw reply

* linux-next: build failure in Linus' tree
From: Stephen Rothwell @ 2011-08-29  0:44 UTC (permalink / raw)
  To: Linus
  Cc: J. Bruce Fields, linux-parisc, NeilBrown, linux-kernel,
	linux-next, linuxppc-dev

Hi Linus,

After merging the fixes tree, today's linux-next build (powerpc
ppc64_defconfig) failed like this:

arch/powerpc/kernel/built-in.o: In function `.sys_call_table':
(.text+0xbd00): undefined reference to `.sys_nfsservctl'
arch/powerpc/kernel/built-in.o: In function `.sys_call_table':
(.text+0xbd08): undefined reference to `.compat_sys_nfsservctl'

Caused by commit f5b940997397 ("All Arch: remove linkage for
sys_nfsservctl system call") which also missed parisc.

I will apply this patch for today:

From: Stephen Rothwell <sfr@canb.auug.org.au>
Date: Mon, 29 Aug 2011 10:38:57 +1000
Subject: [PATCH] remove remaining references to nfsservctl

These were missed in commit f5b940997397 "All Arch: remove linkage
for sys_nfsservctl system call" due to them having no sys_ prefix
(presumably).

Cc: NeilBrown <neilb@suse.de>
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-parisc@vger.kernel.org
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
---
 arch/parisc/kernel/syscall_table.S |    2 +-
 arch/powerpc/include/asm/systbl.h  |    2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/parisc/kernel/syscall_table.S b/arch/parisc/kernel/syscall_table.S
index e66366f..3735abd 100644
--- a/arch/parisc/kernel/syscall_table.S
+++ b/arch/parisc/kernel/syscall_table.S
@@ -259,7 +259,7 @@
 	ENTRY_SAME(ni_syscall)		/* query_module */
 	ENTRY_SAME(poll)
 	/* structs contain pointers and an in_addr... */
-	ENTRY_COMP(nfsservctl)
+	ENTRY_SAME(ni_syscall)		/* was nfsservctl */
 	ENTRY_SAME(setresgid)		/* 170 */
 	ENTRY_SAME(getresgid)
 	ENTRY_SAME(prctl)
diff --git a/arch/powerpc/include/asm/systbl.h b/arch/powerpc/include/asm/systbl.h
index f6736b7..fa0d27a 100644
--- a/arch/powerpc/include/asm/systbl.h
+++ b/arch/powerpc/include/asm/systbl.h
@@ -171,7 +171,7 @@ SYSCALL_SPU(setresuid)
 SYSCALL_SPU(getresuid)
 SYSCALL(ni_syscall)
 SYSCALL_SPU(poll)
-COMPAT_SYS(nfsservctl)
+SYSCALL(ni_syscall)
 SYSCALL_SPU(setresgid)
 SYSCALL_SPU(getresgid)
 COMPAT_SYS_SPU(prctl)
-- 
1.7.5.4

-- 
Cheers,
Stephen Rothwell                    sfr@canb.auug.org.au
http://www.canb.auug.org.au/~sfr/

^ permalink raw reply related

* Re: kvm PCI assignment & VFIO ramblings
From: Avi Kivity @ 2011-08-28 14:04 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Alex Williamson, Alexey Kardashevskiy, kvm@vger.kernel.org,
	Paul Mackerras, Roedel, Joerg, qemu-devel, Alexander Graf, chrisw,
	iommu, Anthony Liguori, linux-pci@vger.kernel.org, linuxppc-dev,
	benve@cisco.com
In-Reply-To: <20110828135632.GG8978@8bytes.org>

On 08/28/2011 04:56 PM, Joerg Roedel wrote:
> On Sun, Aug 28, 2011 at 04:14:00PM +0300, Avi Kivity wrote:
> >  On 08/26/2011 12:24 PM, Roedel, Joerg wrote:
>
> >>  The biggest problem with this approach is that it has to happen in the
> >>  context of the given process. Linux can't really modify an mm which
> >>  which belong to another context in a safe way.
> >>
> >
> >  Is use_mm() insufficient?
>
> Yes, it introduces a set of race conditions when a process that already
> has an mm wants to take over another processes mm temporarily (and when
> use_mm is modified to actually provide this functionality). It is only
> save when used from kernel-thread context.
>
> One example:
>
> 	Process A		Process B			Process C
> 	.			.				.
> 	.		<--	takes A->mm			.
> 	.			and assignes as B->mm		.
> 	.			.			-->	Wants to take
> 	.			.				B->mm, but gets
> 								A->mm now

Good catch.

>
> This can't be secured by a lock, because it introduces potential
> A->B<-->B->A lock problem when two processes try to take each others mm.
> It could probably be solved by a task->real_mm pointer, havn't thought
> about this yet...
>

Or a workqueue -  you get a kernel thread context with a bit of boilerplate.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply

* Re: kvm PCI assignment & VFIO ramblings
From: Joerg Roedel @ 2011-08-28 13:56 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Alex Williamson, Alexey Kardashevskiy, kvm@vger.kernel.org,
	Paul Mackerras, Roedel, Joerg, qemu-devel, Alexander Graf, chrisw,
	iommu, Anthony Liguori, linux-pci@vger.kernel.org, linuxppc-dev,
	benve@cisco.com
In-Reply-To: <4E5A3F18.7050903@redhat.com>

On Sun, Aug 28, 2011 at 04:14:00PM +0300, Avi Kivity wrote:
> On 08/26/2011 12:24 PM, Roedel, Joerg wrote:

>> The biggest problem with this approach is that it has to happen in the
>> context of the given process. Linux can't really modify an mm which
>> which belong to another context in a safe way.
>>
>
> Is use_mm() insufficient?

Yes, it introduces a set of race conditions when a process that already
has an mm wants to take over another processes mm temporarily (and when
use_mm is modified to actually provide this functionality). It is only
save when used from kernel-thread context.

One example:

	Process A		Process B			Process C
	.			.				.
	.		<--	takes A->mm			.
	.			and assignes as B->mm		.
	.			.			-->	Wants to take
	.			.				B->mm, but gets
								A->mm now

This can't be secured by a lock, because it introduces potential
A->B<-->B->A lock problem when two processes try to take each others mm.
It could probably be solved by a task->real_mm pointer, havn't thought
about this yet...

	Joerg

^ permalink raw reply

* Re: kvm PCI assignment & VFIO ramblings
From: Avi Kivity @ 2011-08-28 13:14 UTC (permalink / raw)
  To: Roedel, Joerg
  Cc: Alex Williamson, Alexey Kardashevskiy, kvm@vger.kernel.org,
	Paul Mackerras, linux-pci@vger.kernel.org, qemu-devel,
	Alexander Graf, chrisw, iommu, Anthony Liguori, linuxppc-dev,
	benve@cisco.com
In-Reply-To: <20110826092440.GO1923@amd.com>

On 08/26/2011 12:24 PM, Roedel, Joerg wrote:
> >
> >  As I see it there are two options: (a) make subsequent accesses from
> >  userspace or the guest result in either a SIGBUS that userspace must
> >  either deal with or die, or (b) replace the mapping with a dummy RO
> >  mapping containing 0xff, with any trapped writes emulated as nops.
>
> The biggest problem with this approach is that it has to happen in the
> context of the given process. Linux can't really modify an mm which
> which belong to another context in a safe way.
>

Is use_mm() insufficient?

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply

* [PATCH] powerpc/fsl-booke: Handle L1 D-cache parity error correctly on e500mc
From: Kumar Gala @ 2011-08-27 11:18 UTC (permalink / raw)
  To: linuxppc-dev

If the L1 D-Cache is in write shadow mode the HW will auto-recover the
error.  However we might still log the error and cause a machine check
(if L1CSR0[CPE] - Cache error checking enable).  We should only treat
the non-write shadow case as non-recoverable.

Signed-off-by: Kumar Gala <galak@kernel.crashing.org>
---
 arch/powerpc/include/asm/reg_booke.h |    3 +++
 arch/powerpc/kernel/traps.c          |    9 ++++++++-
 2 files changed, 11 insertions(+), 1 deletions(-)

diff --git a/arch/powerpc/include/asm/reg_booke.h b/arch/powerpc/include/asm/reg_booke.h
index 2d8c920..9856452 100644
--- a/arch/powerpc/include/asm/reg_booke.h
+++ b/arch/powerpc/include/asm/reg_booke.h
@@ -551,6 +551,9 @@
 #define L1CSR1_ICFI	0x00000002	/* Instr Cache Flash Invalidate */
 #define L1CSR1_ICE	0x00000001	/* Instr Cache Enable */
 
+/* Bit definitions for L1CSR2. */
+#define L1CSR2_DCWS	0x40000000	/* Data Cache write shadow */
+
 /* Bit definitions for L2CSR0. */
 #define L2CSR0_L2E	0x80000000	/* L2 Cache Enable */
 #define L2CSR0_L2PE	0x40000000	/* L2 Cache Parity/ECC Enable */
diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
index 1a01414..a1a40f9 100644
--- a/arch/powerpc/kernel/traps.c
+++ b/arch/powerpc/kernel/traps.c
@@ -457,7 +457,14 @@ int machine_check_e500mc(struct pt_regs *regs)
 
 	if (reason & MCSR_DCPERR_MC) {
 		printk("Data Cache Parity Error\n");
-		recoverable = 0;
+
+		/*
+		 * In write shadow mode we auto-recover from the error, but it
+		 * may still get logged and cause a machine check.  We should
+		 * only treat the non-write shadow case as non-recoverable.
+		 */
+		if (!(mfspr(SPRN_L1CSR2) & L1CSR2_DCWS))
+			recoverable = 0;
 	}
 
 	if (reason & MCSR_L2MMU_MHIT) {
-- 
1.7.3.4

^ permalink raw reply related

* [PATCH] powerpc/fsl_msi: clean up and document calculation of MSIIR address
From: Timur Tabi @ 2011-08-27  0:28 UTC (permalink / raw)
  To: kumar.gala, linuxppc-dev

Commit 3da34aae (powerpc/fsl: Support unique MSI addresses per PCIe Root
Complex) redefined the meanings of msi->msi_addr_hi and msi->msi_addr_lo to be
an offset rather than an address.  To help clarify the code, we make the
following changes:

1) Get rid of msi_addr_hi, which is always zero anyway.

2) Rename msi_addr_lo to ccsr_msiir_offset, to indicate that it's an offset
   relative to the beginning of CCSR.

3) Calculate 64-bit addresses using actual 64-bit math.

4) Document some of the code and assumptions we make.

Signed-off-by: Timur Tabi <timur@freescale.com>
---
 arch/powerpc/sysdev/fsl_msi.c |   26 ++++++++++++++++++--------
 arch/powerpc/sysdev/fsl_msi.h |    3 +--
 2 files changed, 19 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/sysdev/fsl_msi.c b/arch/powerpc/sysdev/fsl_msi.c
index 419a772..d824230 100644
--- a/arch/powerpc/sysdev/fsl_msi.c
+++ b/arch/powerpc/sysdev/fsl_msi.c
@@ -30,7 +30,7 @@ LIST_HEAD(msi_head);
 
 struct fsl_msi_feature {
 	u32 fsl_pic_ip;
-	u32 msiir_offset;
+	u32 msiir_offset; /* offset of MSIIR, relative to start of MSI regs */
 };
 
 struct fsl_msi_cascade_data {
@@ -120,16 +120,23 @@ static void fsl_teardown_msi_irqs(struct pci_dev *pdev)
 	return;
 }
 
+/*
+ * Initialize the address and data fields of an MSI message object
+ */
 static void fsl_compose_msi_msg(struct pci_dev *pdev, int hwirq,
 				struct msi_msg *msg,
-				struct fsl_msi *fsl_msi_data)
+				struct fsl_msi *msi_data)
 {
-	struct fsl_msi *msi_data = fsl_msi_data;
 	struct pci_controller *hose = pci_bus_to_host(pdev->bus);
-	u64 base = fsl_pci_immrbar_base(hose);
 
-	msg->address_lo = msi_data->msi_addr_lo + lower_32_bits(base);
-	msg->address_hi = msi_data->msi_addr_hi + upper_32_bits(base);
+	/*
+	 * The PCI address of MSIIR is equal to the PCI base address of CCSR
+	 * plus the offset of MSIIR.
+	 */
+	u64 addr = fsl_pci_immrbar_base(hose) + msi_data->ccsr_msiir_offset;
+
+	msg->address_hi = upper_32_bits(addr);
+	msg->address_lo = lower_32_bits(addr);
 
 	msg->data = hwirq;
 
@@ -359,8 +366,11 @@ static int __devinit fsl_of_msi_probe(struct platform_device *dev)
 
 	msi->irqhost->host_data = msi;
 
-	msi->msi_addr_hi = 0x0;
-	msi->msi_addr_lo = features->msiir_offset + (res.start & 0xfffff);
+	/*
+	 * We assume that the 'reg' property of the MSI node contains an
+	 * offset that has five (or fewer) digits, hence the 0xfffff.
+	 */
+	msi->ccsr_msiir_offset = features->msiir_offset + (res.start & 0xfffff);
 
 	rc = fsl_msi_init_allocator(msi);
 	if (rc) {
diff --git a/arch/powerpc/sysdev/fsl_msi.h b/arch/powerpc/sysdev/fsl_msi.h
index 624580c..eb68c42 100644
--- a/arch/powerpc/sysdev/fsl_msi.h
+++ b/arch/powerpc/sysdev/fsl_msi.h
@@ -28,8 +28,7 @@ struct fsl_msi {
 
 	unsigned long cascade_irq;
 
-	u32 msi_addr_lo;
-	u32 msi_addr_hi;
+	u32 ccsr_msiir_offset; /* offset of MSIIR, relative to start of CCSR */
 	void __iomem *msi_regs;
 	u32 feature;
 	int msi_virqs[NR_MSI_REG];
-- 
1.7.3.4

^ permalink raw reply related

* Re: kvm PCI assignment & VFIO ramblings
From: Chris Wright @ 2011-08-26 21:06 UTC (permalink / raw)
  To: Aaron Fabbri
  Cc: Alexey Kardashevskiy, kvm@vger.kernel.org, Paul Mackerras,
	Roedel, Joerg, Alexander Graf, qemu-devel, Chris Wright, iommu,
	Avi Kivity, Anthony Liguori, linux-pci@vger.kernel.org,
	linuxppc-dev, benve@cisco.com
In-Reply-To: <CA7D4D51.FD84%aafabbri@cisco.com>

* Aaron Fabbri (aafabbri@cisco.com) wrote:
> On 8/26/11 12:35 PM, "Chris Wright" <chrisw@sous-sol.org> wrote:
> > * Aaron Fabbri (aafabbri@cisco.com) wrote:
> >> Each process will open vfio devices on the fly, and they need to be able to
> >> share IOMMU resources.
> > 
> > How do you share IOMMU resources w/ multiple processes, are the processes
> > sharing memory?
> 
> Sorry, bad wording.  I share IOMMU domains *within* each process.

Ah, got it.  Thanks.

> E.g. If one process has 3 devices and another has 10, I can get by with two
> iommu domains (and can share buffers among devices within each process).
> 
> If I ever need to share devices across processes, the shared memory case
> might be interesting.
> 
> > 
> >> So I need the ability to dynamically bring up devices and assign them to a
> >> group.  The number of actual devices and how they map to iommu domains is
> >> not known ahead of time.  We have a single piece of silicon that can expose
> >> hundreds of pci devices.
> > 
> > This does not seem fundamentally different from the KVM use case.
> > 
> > We have 2 kinds of groupings.
> > 
> > 1) low-level system or topoolgy grouping
> > 
> >    Some may have multiple devices in a single group
> > 
> >    * the PCIe-PCI bridge example
> >    * the POWER partitionable endpoint
> > 
> >    Many will not
> > 
> >    * singleton group, e.g. typical x86 PCIe function (majority of
> >      assigned devices)
> > 
> >    Not sure it makes sense to have these administratively defined as
> >    opposed to system defined.
> > 
> > 2) logical grouping
> > 
> >    * multiple low-level groups (singleton or otherwise) attached to same
> >      process, allowing things like single set of io page tables where
> >      applicable.
> > 
> >    These are nominally adminstratively defined.  In the KVM case, there
> >    is likely a privileged task (i.e. libvirtd) involved w/ making the
> >    device available to the guest and can do things like group merging.
> >    In your userspace case, perhaps it should be directly exposed.
> 
> Yes.  In essence, I'd rather not have to run any other admin processes.
> Doing things programmatically, on the fly, from each process, is the
> cleanest model right now.

I don't see an issue w/ this.  As long it can not add devices to the
system defined groups, it's not a privileged operation.  So we still
need the iommu domain concept exposed in some form to logically put
groups into a single iommu domain (if desired).  In fact, I believe Alex
covered this in his most recent recap:

  ...The group fd will provide interfaces for enumerating the devices
  in the group, returning a file descriptor for each device in the group
  (the "device fd"), binding groups together, and returning a file
  descriptor for iommu operations (the "iommu fd").

thanks,
-chris

^ permalink raw reply

* [PATCH] powerpc/eeh: fix /proc/ppc64/eeh creation
From: Thadeu Lima de Souza Cascardo @ 2011-08-26 20:36 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: linux-kernel, Paul Mackerras, Thadeu Lima de Souza Cascardo,
	linuxppc-dev, Breno Leitao

Since commit 188917e183cf9ad0374b571006d0fc6d48a7f447, /proc/ppc64 is a
symlink to /proc/powerpc/. That means that creating /proc/ppc64/eeh will
end up with a unaccessible file, that is not listed under /proc/powerpc/
and, then, not listed under /proc/ppc64/.

Creating /proc/powerpc/eeh fixes that problem and maintain the
compatibility intended with the ppc64 symlink.

Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@linux.vnet.ibm.com>
---
 arch/powerpc/platforms/pseries/eeh.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/eeh.c b/arch/powerpc/platforms/pseries/eeh.c
index ada6e07..d42f37d 100644
--- a/arch/powerpc/platforms/pseries/eeh.c
+++ b/arch/powerpc/platforms/pseries/eeh.c
@@ -1338,7 +1338,7 @@ static const struct file_operations proc_eeh_operations = {
 static int __init eeh_init_proc(void)
 {
 	if (machine_is(pseries))
-		proc_create("ppc64/eeh", 0, NULL, &proc_eeh_operations);
+		proc_create("powerpc/eeh", 0, NULL, &proc_eeh_operations);
 	return 0;
 }
 __initcall(eeh_init_proc);
-- 
1.7.4.4

^ permalink raw reply related

* Re: kvm PCI assignment & VFIO ramblings
From: Chris Wright @ 2011-08-26 19:35 UTC (permalink / raw)
  To: Aaron Fabbri
  Cc: Alexey Kardashevskiy, kvm@vger.kernel.org, Paul Mackerras,
	linux-pci@vger.kernel.org, Alexander Graf, qemu-devel, chrisw,
	iommu, Avi Kivity, Anthony Liguori, Roedel, Joerg, linuxppc-dev,
	benve@cisco.com
In-Reply-To: <CA7D2B86.FD79%aafabbri@cisco.com>

* Aaron Fabbri (aafabbri@cisco.com) wrote:
> On 8/26/11 7:07 AM, "Alexander Graf" <agraf@suse.de> wrote:
> > Forget the KVM case for a moment and think of a user space device driver. I as
> > a user am not root. But I as a user when having access to /dev/vfioX want to
> > be able to access the device and manage it - and only it. The admin of that
> > box needs to set it up properly for me to be able to access it.
> > 
> > So having two steps is really the correct way to go:
> > 
> >   * create VFIO group
> >   * use VFIO group
> > 
> > because the two are done by completely different users.
> 
> This is not the case for my userspace drivers using VFIO today.
> 
> Each process will open vfio devices on the fly, and they need to be able to
> share IOMMU resources.

How do you share IOMMU resources w/ multiple processes, are the processes
sharing memory?

> So I need the ability to dynamically bring up devices and assign them to a
> group.  The number of actual devices and how they map to iommu domains is
> not known ahead of time.  We have a single piece of silicon that can expose
> hundreds of pci devices.

This does not seem fundamentally different from the KVM use case.

We have 2 kinds of groupings.

1) low-level system or topoolgy grouping

   Some may have multiple devices in a single group

   * the PCIe-PCI bridge example
   * the POWER partitionable endpoint

   Many will not

   * singleton group, e.g. typical x86 PCIe function (majority of
     assigned devices)

   Not sure it makes sense to have these administratively defined as
   opposed to system defined.

2) logical grouping

   * multiple low-level groups (singleton or otherwise) attached to same
     process, allowing things like single set of io page tables where
     applicable.

   These are nominally adminstratively defined.  In the KVM case, there
   is likely a privileged task (i.e. libvirtd) involved w/ making the
   device available to the guest and can do things like group merging.
   In your userspace case, perhaps it should be directly exposed.

> In my case, the only administrative task would be to give my processes/users
> access to the vfio groups (which are initially singletons), and the
> application actually opens them and needs the ability to merge groups
> together to conserve IOMMU resources (assuming we're not going to expose
> uiommu).

I agree, we definitely need to expose _some_ way to do this.

thanks,
-chris

^ permalink raw reply

* Re: kvm PCI assignment & VFIO ramblings
From: Aaron Fabbri @ 2011-08-26 20:17 UTC (permalink / raw)
  To: Chris Wright
  Cc: Alexey Kardashevskiy, kvm@vger.kernel.org, Paul Mackerras,
	linux-pci@vger.kernel.org, Alexander Graf, qemu-devel, iommu,
	Avi Kivity, Anthony Liguori, Roedel, Joerg, linuxppc-dev,
	benve@cisco.com
In-Reply-To: <20110826193559.GD13060@sequoia.sous-sol.org>




On 8/26/11 12:35 PM, "Chris Wright" <chrisw@sous-sol.org> wrote:

> * Aaron Fabbri (aafabbri@cisco.com) wrote:
>> On 8/26/11 7:07 AM, "Alexander Graf" <agraf@suse.de> wrote:
>>> Forget the KVM case for a moment and think of a user space device driver. I
>>> as
>>> a user am not root. But I as a user when having access to /dev/vfioX want to
>>> be able to access the device and manage it - and only it. The admin of that
>>> box needs to set it up properly for me to be able to access it.
>>> 
>>> So having two steps is really the correct way to go:
>>> 
>>>   * create VFIO group
>>>   * use VFIO group
>>> 
>>> because the two are done by completely different users.
>> 
>> This is not the case for my userspace drivers using VFIO today.
>> 
>> Each process will open vfio devices on the fly, and they need to be able to
>> share IOMMU resources.
> 
> How do you share IOMMU resources w/ multiple processes, are the processes
> sharing memory?

Sorry, bad wording.  I share IOMMU domains *within* each process.

E.g. If one process has 3 devices and another has 10, I can get by with two
iommu domains (and can share buffers among devices within each process).

If I ever need to share devices across processes, the shared memory case
might be interesting.

> 
>> So I need the ability to dynamically bring up devices and assign them to a
>> group.  The number of actual devices and how they map to iommu domains is
>> not known ahead of time.  We have a single piece of silicon that can expose
>> hundreds of pci devices.
> 
> This does not seem fundamentally different from the KVM use case.
> 
> We have 2 kinds of groupings.
> 
> 1) low-level system or topoolgy grouping
> 
>    Some may have multiple devices in a single group
> 
>    * the PCIe-PCI bridge example
>    * the POWER partitionable endpoint
> 
>    Many will not
> 
>    * singleton group, e.g. typical x86 PCIe function (majority of
>      assigned devices)
> 
>    Not sure it makes sense to have these administratively defined as
>    opposed to system defined.
> 
> 2) logical grouping
> 
>    * multiple low-level groups (singleton or otherwise) attached to same
>      process, allowing things like single set of io page tables where
>      applicable.
> 
>    These are nominally adminstratively defined.  In the KVM case, there
>    is likely a privileged task (i.e. libvirtd) involved w/ making the
>    device available to the guest and can do things like group merging.
>    In your userspace case, perhaps it should be directly exposed.

Yes.  In essence, I'd rather not have to run any other admin processes.
Doing things programmatically, on the fly, from each process, is the
cleanest model right now.

> 
>> In my case, the only administrative task would be to give my processes/users
>> access to the vfio groups (which are initially singletons), and the
>> application actually opens them and needs the ability to merge groups
>> together to conserve IOMMU resources (assuming we're not going to expose
>> uiommu).
> 
> I agree, we definitely need to expose _some_ way to do this.
> 
> thanks,
> -chris

^ permalink raw reply

* Re: Kernel boot up
From: Scott Wood @ 2011-08-26 20:08 UTC (permalink / raw)
  To: smitha.vanga; +Cc: linuxppc-dev
In-Reply-To: <07ACDFB8ECA8EF47863A613BC01BBB22035E3D59@HYD-MKD-MBX02.wipro.com>

On 08/26/2011 01:00 AM, smitha.vanga@wipro.com wrote:
>  
> Thanks scott.
> 
> There was an issue with the file system. Now my board is up with the
> linux boot prompt .
> But ping is not working. 

You still haven't set your MAC address.  U-Boot should be fixing this up
in the device tree.

> The local loopback ping works. My phy chip
> BCM5221 is connected on port A

Your device tree describes a connection on port C.  You need to update
the mdio node's reg to point to port A's registers (0x10d00), and
fsl,mdio-pin and fsl,mdc-pin need to be set to the particular pins your
board uses.

-Scott

^ permalink raw reply

* Re: kvm PCI assignment & VFIO ramblings
From: Alex Williamson @ 2011-08-26 18:04 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: chrisw, Alexey Kardashevskiy, kvm@vger.kernel.org, Paul Mackerras,
	Roedel, Joerg, linux-pci@vger.kernel.org, qemu-devel,
	Aaron Fabbri, iommu, Avi Kivity, Anthony Liguori, linuxppc-dev,
	benve@cisco.com
In-Reply-To: <20110825180557.GD8978@8bytes.org>

On Thu, 2011-08-25 at 20:05 +0200, Joerg Roedel wrote:
> On Thu, Aug 25, 2011 at 11:20:30AM -0600, Alex Williamson wrote:
> > On Thu, 2011-08-25 at 12:54 +0200, Roedel, Joerg wrote:
> 
> > > We need to solve this differently. ARM is starting to use the iommu-api
> > > too and this definitly does not work there. One possible solution might
> > > be to make the iommu-ops per-bus.
> > 
> > That sounds good.  Is anyone working on it?  It seems like it doesn't
> > hurt to use this in the interim, we may just be watching the wrong bus
> > and never add any sysfs group info.
> 
> I'll cook something up for RFC over the weekend.
> 
> > > Also the return type should not be long but something that fits into
> > > 32bit on all platforms. Since you use -ENODEV, probably s32 is a good
> > > choice.
> > 
> > The convenience of using seg|bus|dev|fn was too much to resist, too bad
> > it requires a full 32bits.  Maybe I'll change it to:
> >         int iommu_device_group(struct device *dev, unsigned int *group)
> 
> If we really expect segment numbers that need the full 16 bit then this
> would be the way to go. Otherwise I would prefer returning the group-id
> directly and partition the group-id space for the error values (s32 with
> negative numbers being errors).

It's unlikely to have segments using the top bit, but it would be broken
for an iommu driver to define it's group numbers using pci s:b:d.f if we
don't have that bit available.  Ben/David, do PEs have an identifier of
a convenient size?  I'd guess any hardware based identifier is going to
use a full unsigned bit width.  Thanks,

Alex

^ permalink raw reply

* Re: kvm PCI assignment & VFIO ramblings
From: Aaron Fabbri @ 2011-08-26 17:52 UTC (permalink / raw)
  To: Alexander Graf, Roedel, Joerg
  Cc: Alexey Kardashevskiy, kvm@vger.kernel.org, Paul Mackerras,
	linux-pci@vger.kernel.org, qemu-devel, chrisw, iommu, Avi Kivity,
	Anthony Liguori, linuxppc-dev, benve@cisco.com
In-Reply-To: <571DC890-A1A3-4528-92BE-566F033FD4BF@suse.de>

On 8/26/11 7:07 AM, "Alexander Graf" <agraf@suse.de> wrote:

> 
<snip>
> 
> Forget the KVM case for a moment and think of a user space device driver. I as
> a user am not root. But I as a user when having access to /dev/vfioX want to
> be able to access the device and manage it - and only it. The admin of that
> box needs to set it up properly for me to be able to access it.
> 
> So having two steps is really the correct way to go:
> 
>   * create VFIO group
>   * use VFIO group
> 
> because the two are done by completely different users.

This is not the case for my userspace drivers using VFIO today.

Each process will open vfio devices on the fly, and they need to be able to
share IOMMU resources.

So I need the ability to dynamically bring up devices and assign them to a
group.  The number of actual devices and how they map to iommu domains is
not known ahead of time.  We have a single piece of silicon that can expose
hundreds of pci devices.

In my case, the only administrative task would be to give my processes/users
access to the vfio groups (which are initially singletons), and the
application actually opens them and needs the ability to merge groups
together to conserve IOMMU resources (assuming we're not going to expose
uiommu).

-Aaron

^ permalink raw reply

* VFIO v2 design plan
From: Alex Williamson @ 2011-08-26 17:05 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: chrisw, Alexey Kardashevskiy, kvm, Paul Mackerras,
	linux-pci@vger.kernel.org, agraf, qemu-devel, David Gibson,
	aafabbri, iommu, Avi Kivity, Anthony Liguori, Roedel, Joerg,
	linuxppc-dev, benve

I don't think too much has changed since the previous email went out,
but it seems like a good idea to post a summary in case there were
suggestions or objections that I missed.

VFIO v2 will rely on the platform iommu driver reporting grouping
information.  Again, a group is a set of devices for which the iommu
cannot differentiate transactions.  An example would be a set of devices
behind a PCI-to-PCI bridge.  All transactions appear to be from the
bridge itself rather than devices behind the bridge.  Platforms are free
to have whatever constraints they need to for what constitutes a group.

I posted a rough draft of patch to implement that for the base iommu
driver and VT-d, adding an iommu_device_group callback on iommu ops.
The iommu base driver also populates an iommu_group sysfs file for each
device that's part of a group.  Members of the same group return the
same value via either the sysfs or iommu_device_group.  The value
returned is arbitrary, should not be assumed to be persistent across
boots, and is left to the iommu driver to generate.  There are some
implementation details around how to do this without favoring one bus
over another, but the interface should be bus/device type agnostic in
the end.

When the vfio module is loaded, character devices will be created for
each group in /dev/vfio/$GROUP.  Setting file permissions on these files
should be sufficient for providing a user with complete access to the
group.  Opening this device file provides what we'll call the "group
fd".  The group fd is restricted to only work with a single mm context.
Concurrent opens will be denied if the opening process mm does not
match.  The group fd will provide interfaces for enumerating the devices
in the group, returning a file descriptor for each device in the group
(the "device fd"), binding groups together, and returning a file
descriptor for iommu operations (the "iommu fd").

A group is "viable" when all member devices of the group are bound to
the vfio driver.  Until that point, the group fd only allows enumeration
interfaces (ie. listing of group devices).  I'm currently thinking
enumeration will be done by a simple read() on the device file returning
a list of dev_name()s.  Once the group is viable, the user may bind the
group to another group, retrieve the iommu fd, or retrieve device fds.
Internally, each of these operations will result in an iommu domain
being allocated and all of the devices attached to the domain.

The purpose of binding groups is to share the iommu domain.  Groups
making use of incompatible iommu domains will fail to bind.  Groups
making use of different mm's will fail to bind.  The vfio driver may
reject some binding based on domain capabilities, but final veto power
is left to the iommu driver[1].  If a user makes use of a group
independently and later wishes to bind it to another group, all the
device fds and the iommu fd must first be closed.  This prevents using a
stale iommu fd or accessing devices while the iommu is being switched.
Operations on any group fds of a merged group are performed globally on
the group (ie. enumerating the devices lists all devices in the merged
group, retrieving the iommu fd from any group fd results in the same fd,
device fds from any group can be retrieved from any group fd[2]).
Groups can be merged and unmerged dynamically.  Unmerging a group
requires the device fds for the outgoing group are closed.  The iommu fd
will remain persistent for the remaining merged group.

If a device within a group is unbound from the vfio driver while it's in
use (iommu fd refcnt > 0 || device fd recnt > 0), vfio will block the
release and send netlink remove requests for every opened device in the
group (or merged group).  If the device fds are not released and
subsequently the iommu fd released as well, vfio will kill the user
process after some delay.  At some point in the future we may be able to
adapt this to perform a hard removal and revoke all device access
without killing the user.

The iommu fd supports dma mapping and unmapping ioctls as well as some,
yet to be defined and possibly architecture specific, iommu description
interfaces.  At some point we may also make use of read/write/mmap on
the iommu fd as means to setup dma.  

The device fds will largely support the existing vfio interface, with
generalizations to make it non-pci specific.  We'll access mmio/pio/pci
config using segmented offset into the device fd.  Interrupts will use
the existing mechanisms (eventfds/irqfd).  We'll need to add ioctls to
describe the type of device, number, size, and type of each resource and
available interrupts.

We still have outstanding questions with how devices are exposed in
qemu, but I think that's largely a qemu-vfio problem and the vfio kernel
interface described here supports all the interesting ways that devices
can be exposed as individuals or sets.  I'm currently working on code
changes to support the above and will post as I complete useful chunks.
Thanks,

Alex

[1] Implementation note: the current iommu ops makes some of this
awkward.  We'll need to temporarily setup a domain for incoming devices
to validate the capabilities of that domain, then tear it down and try
to attach devices to the existing domain.  In particular I'm thinking of
the cache coherence capability and whether we remap existing dma
mappings to allow this to change or just reject as incompatible (I'm
leaning to the latter).

[2] Implementation note: I think a container object makes sense here
where reads/ioctls are passed from the group to the container, which
performs them across all groups making use of that container (there are
no performance critical paths through the group fd).  This also implies
the enumeration interface should report groups so we can easily see
which groups are merged.  The group fd could simply read as:
        group: 1234
        device: 0000:00:19.0
        group: 5678
        device: 0000:01:00.0
        device: 0000:01:00.1
Some might say this is screaming for xml.  Do we need to go there?  We
could also do this via the netlink interface.  Suggestions welcome.

^ permalink raw reply

* Re: kvm PCI assignment & VFIO ramblings
From: Alexander Graf @ 2011-08-26 15:29 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: chrisw, Alexey Kardashevskiy, kvm@vger.kernel.org, Paul Mackerras,
	Roedel, Joerg, qemu-devel, aafabbri, iommu, Avi Kivity,
	Anthony Liguori, linux-pci@vger.kernel.org, linuxppc-dev,
	benve@cisco.com
In-Reply-To: <20110826152404.GF8978@8bytes.org>


On 26.08.2011, at 10:24, Joerg Roedel wrote:

> On Fri, Aug 26, 2011 at 09:07:35AM -0500, Alexander Graf wrote:
>> On 26.08.2011, at 04:33, Roedel, Joerg wrote:
>>>=20
>>> The reason is that you mean the usability for the programmer and I =
mean
>>> it for the actual user of qemu :)
>>=20
>> No, we mean the actual user of qemu. The reason being that making a
>> device available for any user space application is an administrative
>> task.
>>=20
>> Forget the KVM case for a moment and think of a user space device
>> driver. I as a user am not root. But I as a user when having access =
to
>> /dev/vfioX want to be able to access the device and manage it - and
>> only it. The admin of that box needs to set it up properly for me to
>> be able to access it.
>=20
> Right, and that task is being performed by attaching the device(s) in
> question to the vfio driver. The rights-management happens on the
> /dev/vfio/$group file.

Yup :)

>=20
>> So having two steps is really the correct way to go:
>>=20
>>  * create VFIO group
>>  * use VFIO group
>>=20
>> because the two are done by completely different users. It's similar
>> to how tun/tap works in Linux too. Of course nothing keeps you from
>> also creating a group on the fly, but it shouldn't be the only
>> interface available. The persistent setup is definitely more useful.
>=20
> I see the use-case. But to make it as easy as possible for the =
end-user
> we can do both.
>=20
> So the user of (qemu again) does this:
>=20
> # vfio-ctl attach 00:01.0
> vfio-ctl: attached to group 8
> # vfio-ctl attach 00:02.0
> vfio-ctl: attached to group 16
> $ qemu -device vfio-pci,host=3D00:01.0 -device vfio,host=3D00:01.0 ...
>=20
> which should cover the usecase you prefer. Qemu still creates the
> meta-group that allow the devices to share the same page-table. But =
what
> should also be possible is:
>=20
> # qemu -device vfio-pci,host=3D00:01.0 -device vfio-pci,host=3D00:02.0
>=20
> In that case qemu detects that the devices are not yet bound to vfio =
and
> will do so and also unbinds them afterwards (essentially the developer
> use-case).

I agree. The same it works with tun today. You can either have qemu =
spawn a tun device dynamically or have a preallocated one you use. If =
you run qemu as a user (which I always do), I preallocate a tun device =
and attach qemu to it.

> Your interface which requires pre-binding of devices into one group by
> the administrator only makes sense if you want to force userspace to
> use certain devices (which do not belong to the same hw-group) only
> together. But I don't see a usecase for defining such constraints =
(yet).

Agreed. As long as the kernel backend can always figure out the =
hw-groups, we're good :)


Alex

^ permalink raw reply

* Re: kvm PCI assignment & VFIO ramblings
From: Joerg Roedel @ 2011-08-26 15:24 UTC (permalink / raw)
  To: Alexander Graf
  Cc: chrisw, Alexey Kardashevskiy, kvm@vger.kernel.org, Paul Mackerras,
	Roedel, Joerg, qemu-devel, aafabbri, iommu, Avi Kivity,
	Anthony Liguori, linux-pci@vger.kernel.org, linuxppc-dev,
	benve@cisco.com
In-Reply-To: <571DC890-A1A3-4528-92BE-566F033FD4BF@suse.de>

On Fri, Aug 26, 2011 at 09:07:35AM -0500, Alexander Graf wrote:
> On 26.08.2011, at 04:33, Roedel, Joerg wrote:
> > 
> > The reason is that you mean the usability for the programmer and I mean
> > it for the actual user of qemu :)
> 
> No, we mean the actual user of qemu. The reason being that making a
> device available for any user space application is an administrative
> task.
>
> Forget the KVM case for a moment and think of a user space device
> driver. I as a user am not root. But I as a user when having access to
> /dev/vfioX want to be able to access the device and manage it - and
> only it. The admin of that box needs to set it up properly for me to
> be able to access it.

Right, and that task is being performed by attaching the device(s) in
question to the vfio driver. The rights-management happens on the
/dev/vfio/$group file.

> So having two steps is really the correct way to go:
> 
>   * create VFIO group
>   * use VFIO group
> 
> because the two are done by completely different users. It's similar
> to how tun/tap works in Linux too. Of course nothing keeps you from
> also creating a group on the fly, but it shouldn't be the only
> interface available. The persistent setup is definitely more useful.

I see the use-case. But to make it as easy as possible for the end-user
we can do both.

So the user of (qemu again) does this:

# vfio-ctl attach 00:01.0
vfio-ctl: attached to group 8
# vfio-ctl attach 00:02.0
vfio-ctl: attached to group 16
$ qemu -device vfio-pci,host=00:01.0 -device vfio,host=00:01.0 ...

which should cover the usecase you prefer. Qemu still creates the
meta-group that allow the devices to share the same page-table. But what
should also be possible is:

# qemu -device vfio-pci,host=00:01.0 -device vfio-pci,host=00:02.0

In that case qemu detects that the devices are not yet bound to vfio and
will do so and also unbinds them afterwards (essentially the developer
use-case).

Your interface which requires pre-binding of devices into one group by
the administrator only makes sense if you want to force userspace to
use certain devices (which do not belong to the same hw-group) only
together. But I don't see a usecase for defining such constraints (yet).

	Joerg

^ permalink raw reply

* Re: kvm PCI assignment & VFIO ramblings
From: Alexander Graf @ 2011-08-26 14:07 UTC (permalink / raw)
  To: Roedel, Joerg
  Cc: chrisw, Alexey Kardashevskiy, kvm@vger.kernel.org, Paul Mackerras,
	linux-pci@vger.kernel.org, qemu-devel, aafabbri, iommu,
	Avi Kivity, Anthony Liguori, linuxppc-dev, benve@cisco.com
In-Reply-To: <20110826093356.GP1923@amd.com>


On 26.08.2011, at 04:33, Roedel, Joerg wrote:

> On Fri, Aug 26, 2011 at 12:20:00AM -0400, David Gibson wrote:
>> On Wed, Aug 24, 2011 at 01:03:32PM +0200, Roedel, Joerg wrote:
>>> On Wed, Aug 24, 2011 at 05:33:00AM -0400, David Gibson wrote:
>>>> On Wed, Aug 24, 2011 at 11:14:26AM +0200, Roedel, Joerg wrote:
>>>=20
>>>>> I don't see a reason to make this meta-grouping static. It would =
harm
>>>>> flexibility on x86. I think it makes things easier on power but =
there
>>>>> are options on that platform to get the dynamic solution too.
>>>>=20
>>>> I think several people are misreading what Ben means by "static".  =
I
>>>> would prefer to say 'persistent', in that the meta-groups lifetime =
is
>>>> not tied to an fd, but they can be freely created, altered and =
removed
>>>> during runtime.
>>>=20
>>> Even if it can be altered at runtime, from a usability perspective =
it is
>>> certainly the best to handle these groups directly in qemu. Or are =
there
>>> strong reasons to do it somewhere else?
>>=20
>> Funny, Ben and I think usability demands it be the other way around.
>=20
> The reason is that you mean the usability for the programmer and I =
mean
> it for the actual user of qemu :)

No, we mean the actual user of qemu. The reason being that making a =
device available for any user space application is an administrative =
task.

Forget the KVM case for a moment and think of a user space device =
driver. I as a user am not root. But I as a user when having access to =
/dev/vfioX want to be able to access the device and manage it - and only =
it. The admin of that box needs to set it up properly for me to be able =
to access it.

So having two steps is really the correct way to go:

  * create VFIO group
  * use VFIO group

because the two are done by completely different users. It's similar to =
how tun/tap works in Linux too. Of course nothing keeps you from also =
creating a group on the fly, but it shouldn't be the only interface =
available. The persistent setup is definitely more useful.

>=20
>> If the meta-groups are transient - that is lifetime tied to an fd -
>> then any program that wants to use meta-groups *must* know the
>> interfaces for creating one, whatever they are.
>>=20
>> But if they're persistent, the admin can use other tools to create =
the
>> meta-group then just hand it to a program to use, since the =
interfaces
>> for _using_ a meta-group are identical to those for an atomic group.
>>=20
>> This doesn't preclude a program from being meta-group aware, and
>> creating its own if it wants to, of course.  My guess is that qemu
>> would not want to build its own meta-groups, but libvirt probably
>> would.
>=20
> Doing it in libvirt makes it really hard for a plain user of qemu to
> assign more than one device to a guest. What I want it that a user =
just
> types
>=20
> 	qemu -device vfio,host=3D00:01.0 -device vfio,host=3D00:02.0 ...
>=20
> and it just works. Qemu creates the meta-groups and they are
> automatically destroyed when qemu exits. That the programs are not =
aware
> of meta-groups is not a big problem because all software using vfio
> needs still to be written :)
>=20
> Btw, with this concept the programmer can still decide to not use
> meta-groups and just multiplex the mappings to all open device-fds it
> uses.

What I want to see is:

  # vfio-create 00:01.0
    /dev/vfio0
  # vftio-create -a /dev/vfio0 00:02.0
    /dev/vfio0

  $ qemu -vfio dev=3D/dev/vfio0,id=3Dvfio0 -device vfio,vfio=3Dvfio0.0 =
-device vfio,vfio=3Dvfio0.1


Alex

^ permalink raw reply

* Re: [PATCH 1/2] [hw-breakpoint] Use generic hw-breakpoint interfaces for new PPC ptrace flags
From: K.Prasad @ 2011-08-26  9:35 UTC (permalink / raw)
  To: linuxppc-dev, Thiago Jung Bauermann, Edjunior Barbosa Machado
In-Reply-To: <20110824035939.GB30097@yookeroo.fritz.box>

On Wed, Aug 24, 2011 at 01:59:39PM +1000, David Gibson wrote:
> On Tue, Aug 23, 2011 at 02:55:13PM +0530, K.Prasad wrote:
> > On Tue, Aug 23, 2011 at 03:08:50PM +1000, David Gibson wrote:
> > > On Fri, Aug 19, 2011 at 01:21:36PM +0530, K.Prasad wrote:
> > > > PPC_PTRACE_GETHWDBGINFO, PPC_PTRACE_SETHWDEBUG and PPC_PTRACE_DELHWDEBUG are
> > > > PowerPC specific ptrace flags that use the watchpoint register. While they are
> > > > targeted primarily towards BookE users, user-space applications such as GDB
> > > > have started using them for BookS too.
> > > > 
> > > > This patch enables the use of generic hardware breakpoint interfaces for these
> > > > new flags. The version number of the associated data structures
> > > > "ppc_hw_breakpoint" and "ppc_debug_info" is incremented to denote new semantics.
> > > 
> > > So, the structure itself doesn't seem to have been extended.  I don't
> > > understand what the semantic difference is - your patch comment needs
> > > to explain this clearly.
> > >
> > 
> > We had a request to extend the structure but thought it was dangerous to
> > do so. For instance if the user-space used version1 of the structure,
> > while kernel did a copy_to_user() pertaining to version2, then we'd run
> > into problems. Unfortunately the ptrace flags weren't designed to accept
> > a version number as input from the user through the
> > PPC_PTRACE_GETHWDBGINFO flag (which would have solved this issue).
> 
> I still don't follow you.
> 

Two things here.

One, the change of semantics warranted an increment of the version
number. The new semantics accepts PPC_BREAKPOINT_MODE_RANGE_INCLUSIVE on
BookS, while the old version number did not. I've added a small comment
in the code to this effect.

Two, regarding changes in the "ppc_hw_breakpoint" and "ppc_debug_info"
structures - we would like to add more members to it if we can (GDB has a
pending request to add more members to it). However the problem foreseen
is that there could be a mismatch between the versions of the structure
used by the user vs kernel-space i.e. if a new version of the structure,
known to the kernel, had an extra member while the user-space still had
the old version, then it becomes dangerous because the __copy_to_user
function would overflow the buffer size in user-space.

This could have been avoided if PPC_PTRACE_GETHWDBGINFO was originally
designed to accept a version number (and provide corresponding
"struct ppc_debug_info") rather than send a populated "ppc_debug_info"
structure along with the version number.

> > I'll add a comment w.r.t change in semantics - such as the ability to
> > accept 'range' breakpoints in BookS.
> >  
> > > > Apart from the usual benefits of using generic hw-breakpoint interfaces, these
> > > > changes allow debuggers (such as GDB) to use a common set of ptrace flags for
> > > > their watchpoint needs and allow more precise breakpoint specification (length
> > > > of the variable can be specified).
> > > 
> > > What is the mechanism for implementing the range breakpoint on book3s?
> > > 
> > 
> > The hw-breakpoint interface, accepts length as an argument in BookS (any
> > value <= 8 Bytes) and would filter out extraneous interrupts arising out
> > of accesses outside the range comprising <addr, addr + len> inside
> > hw_breakpoint_handler function.
> > 
> > We put that ability to use here.
> 
> Ah, so in hardware the breakpoints are always 8 bytes long, but you
> filter out false hits on a shorter range?  Of course, the utility of
> range breakpoints is questionable when length <=8, but the start must
> be aligned on an 8-byte boundary.
> 

Yes, we ensure that through 
+	attr.bp_addr = (unsigned long)bp_info->addr & ~HW_BREAKPOINT_ALIGN;

> [snip]
> > > >  	if ((unsigned long)bp_info->addr >= TASK_SIZE)
> > > >  		return -EIO;
> > > >  
> > > > @@ -1398,15 +1400,86 @@ static long ppc_set_hwdebug(struct task_struct *child,
> > > >  		dabr |= DABR_DATA_READ;
> > > >  	if (bp_info->trigger_type & PPC_BREAKPOINT_TRIGGER_WRITE)
> > > >  		dabr |= DABR_DATA_WRITE;
> > > > +#ifdef CONFIG_HAVE_HW_BREAKPOINT
> > > > +	if (bp_info->version == 1)
> > > > +		goto version_one;
> > > 
> > > There are several legitimate uses of goto in the kernel, but this is
> > > definitely not one of them.  You're essentially using it to put the
> > > old and new versions of the same function in one block.  Nasty.
> > > 
> > 
> > Maybe it's the label that's causing bother here. It might look elegant
> > if it was called something like exit_* or error_* :-)
> > 
> > The goto here helps reduce code, is similar to the error exits we use
> > everywhere.
> 
> Rubbish, it is not an exception exit at all, it is two separate code
> paths for the different versions which would be much clearer as two
> different functions.
> 

I've re-written this part of the code to avoid a goto statement.

> > > > +	if (ptrace_get_breakpoints(child) < 0)
> > > > +		return -ESRCH;
> > > >  
> > > > -	child->thread.dabr = dabr;
> > > > +	bp = thread->ptrace_bps[0];
> > > > +	if (!bp_info->addr) {
> > > > +		if (bp) {
> > > > +			unregister_hw_breakpoint(bp);
> > > > +			thread->ptrace_bps[0] = NULL;
> > > > +		}
> > > > +		ptrace_put_breakpoints(child);
> > > > +		return 0;
> > > 
> > > Why are you making setting a 0 watchpoint remove the existing one (I
> > > think that's what this does).  I thought there was an explicit del
> > > breakpoint operation instead.
> > 
> > We had to define the semantics for what writing a 0 to DABR could mean,
> > and I think it is intuitive to consider it as deletion
> > request...couldn't think of a case where DABR with addr=0 and RW=1 would
> > be required.
> 
> When a user space program maps pages at virtual address 0, which it
> can do.
> 

Agreed. I've removed the code under if (!bp_info->addr) branch.

> > > > +	}
> > > > +	/*
> > > > +	 * Check if the request is for 'range' breakpoints. We can
> > > > +	 * support it if range < 8 bytes.
> > > > +	 */
> > > > +	if (bp_info->addr_mode == PPC_BREAKPOINT_MODE_RANGE_INCLUSIVE)
> > > > +		len = bp_info->addr2 - bp_info->addr;
> > > 
> > > So you compute the length here, but I don't see you ever test if it is
> > > < 8 and return an error.
> > > 
> > 
> > The hw-breakpoint interfaces would fail if the length was > 8.
> 
> Ok.
> 
> > > > +	else if (bp_info->addr_mode != PPC_BREAKPOINT_MODE_EXACT) {
> > > > +			ptrace_put_breakpoints(child);
> > > > +			return -EINVAL;
> > > > +		}
> > > > +	if (bp) {
> > > > +		attr = bp->attr;
> > > > +		attr.bp_addr = (unsigned long)bp_info->addr & ~HW_BREAKPOINT_ALIGN;
> > > > +		arch_bp_generic_fields(dabr &
> > > > +					(DABR_DATA_WRITE | DABR_DATA_READ),
> > > > +							&attr.bp_type);
> > > > +		attr.bp_len = len;
> > > > +		ret =  modify_user_hw_breakpoint(bp, &attr);
> > > > +		if (ret) {
> > > > +			ptrace_put_breakpoints(child);
> > > > +			return ret;
> > > > +		}
> > > > +		thread->ptrace_bps[0] = bp;
> > > > +		ptrace_put_breakpoints(child);
> > > > +		thread->dabr = dabr;
> > > > +		return 0;
> > > > +	}
> > > >  
> > > > +	/* Create a new breakpoint request if one doesn't exist already */
> > > > +	hw_breakpoint_init(&attr);
> > > > +	attr.bp_addr = (unsigned long)bp_info->addr & ~HW_BREAKPOINT_ALIGN;
> > > 
> > > You seem to be silently masking the given address, which seems
> > > completely wrong.
> > > 
> > 
> > We have two ways of looking at the input address.
> > a) Assume that the input address is not multiplexed with the read/write
> > bits and return -EINVAL (for not confirming to the 8-byte alignment
> > requirement).
> > b) Consider the input address to be encoded with the read/write
> > watchpoint type request and align the address by default. This is how
> > the code behaves presently for the !CONFIG_HAVE_HW_BREAKPOINT case.
> 
> Hrm, ok, but this needs commenting.
> 

Added a comment to this effect.

I'm pasting the modified patch below. Kindly let me know your comments.

    PPC_PTRACE_GETHWDBGINFO, PPC_PTRACE_SETHWDEBUG and PPC_PTRACE_DELHWDEBUG are
    PowerPC specific ptrace flags that use the watchpoint register. While they are
    targeted primarily towards BookE users, user-space applications such as GDB
    have started using them for BookS too.
    
    This patch enables the use of generic hardware breakpoint interfaces for these
    new flags. The version number of the associated data structures
    "ppc_hw_breakpoint" and "ppc_debug_info" is incremented to denote new semantics.
    
    Apart from the usual benefits of using generic hw-breakpoint interfaces, these
    changes allow debuggers (such as GDB) to use a common set of ptrace flags for
    their watchpoint needs and allow more precise breakpoint specification (length
    of the variable can be specified).
    
    [Edjunior: Identified an issue in the patch with the sanity check for version
    numbers]

diff --git a/Documentation/powerpc/ptrace.txt b/Documentation/powerpc/ptrace.txt
index f4a5499..97301ae 100644
--- a/Documentation/powerpc/ptrace.txt
+++ b/Documentation/powerpc/ptrace.txt
@@ -127,6 +127,22 @@ Some examples of using the structure to:
   p.addr2           = (uint64_t) end_range;
   p.condition_value = 0;
 
+- set a watchpoint in server processors (BookS) using version 2
+
+  p.version         = 2;
+  p.trigger_type    = PPC_BREAKPOINT_TRIGGER_RW;
+  p.addr_mode       = PPC_BREAKPOINT_MODE_RANGE_INCLUSIVE;
+  or
+  p.addr_mode       = PPC_BREAKPOINT_MODE_RANGE_EXACT;
+
+  p.condition_mode  = PPC_BREAKPOINT_CONDITION_NONE;
+  p.addr            = (uint64_t) begin_range;
+  /* For PPC_BREAKPOINT_MODE_RANGE_INCLUSIVE addr2 needs to be specified, where
+   * addr2 - addr <= 8 Bytes.
+   */
+  p.addr2           = (uint64_t) end_range;
+  p.condition_value = 0;
+
 3. PTRACE_DELHWDEBUG
 
 Takes an integer which identifies an existing breakpoint or watchpoint
diff --git a/arch/powerpc/kernel/ptrace.c b/arch/powerpc/kernel/ptrace.c
index 05b7dd2..f9a4548 100644
--- a/arch/powerpc/kernel/ptrace.c
+++ b/arch/powerpc/kernel/ptrace.c
@@ -1339,11 +1339,17 @@ static int set_dac_range(struct task_struct *child,
 static long ppc_set_hwdebug(struct task_struct *child,
 		     struct ppc_hw_breakpoint *bp_info)
 {
+#ifdef CONFIG_HAVE_HW_BREAKPOINT
+	int ret, len = 0;
+	struct thread_struct *thread = &(child->thread);
+	struct perf_event *bp;
+	struct perf_event_attr attr;
+#endif /* CONFIG_HAVE_HW_BREAKPOINT */
 #ifndef CONFIG_PPC_ADV_DEBUG_REGS
 	unsigned long dabr;
 #endif
 
-	if (bp_info->version != 1)
+	if ((bp_info->version != 1) && (bp_info->version != 2))
 		return -ENOTSUPP;
 #ifdef CONFIG_PPC_ADV_DEBUG_REGS
 	/*
@@ -1382,13 +1388,9 @@ static long ppc_set_hwdebug(struct task_struct *child,
 	 */
 	if ((bp_info->trigger_type & PPC_BREAKPOINT_TRIGGER_RW) == 0 ||
 	    (bp_info->trigger_type & ~PPC_BREAKPOINT_TRIGGER_RW) != 0 ||
-	    bp_info->addr_mode != PPC_BREAKPOINT_MODE_EXACT ||
 	    bp_info->condition_mode != PPC_BREAKPOINT_CONDITION_NONE)
 		return -EINVAL;
 
-	if (child->thread.dabr)
-		return -ENOSPC;
-
 	if ((unsigned long)bp_info->addr >= TASK_SIZE)
 		return -EIO;
 
@@ -1399,14 +1401,84 @@ static long ppc_set_hwdebug(struct task_struct *child,
 	if (bp_info->trigger_type & PPC_BREAKPOINT_TRIGGER_WRITE)
 		dabr |= DABR_DATA_WRITE;
 
-	child->thread.dabr = dabr;
+	if (bp_info->version == 1) {
+		if (bp_info->addr_mode != PPC_BREAKPOINT_MODE_EXACT)
+			return -EINVAL;
+		if (child->thread.dabr)
+			return -ENOSPC;
+		child->thread.dabr = dabr;
+		return 1;
+	}
+#ifdef CONFIG_HAVE_HW_BREAKPOINT
+	/*
+	 * We will use version = 2, to denote the use of
+	 * PPC_BREAKPOINT_MODE_RANGE_INCLUSIVE mode of watchpoints.
+	 */
+	if (bp_info->version != 2)
+		return -EINVAL;
+	if (ptrace_get_breakpoints(child) < 0)
+		return -ESRCH;
 
+	bp = thread->ptrace_bps[0];
+	/*
+	 * Check if the request is for 'range' breakpoints. We can
+	 * support it if range < 8 bytes.
+	 */
+	if (bp_info->addr_mode == PPC_BREAKPOINT_MODE_RANGE_INCLUSIVE)
+		len = bp_info->addr2 - bp_info->addr;
+	else if (bp_info->addr_mode != PPC_BREAKPOINT_MODE_EXACT) {
+			ptrace_put_breakpoints(child);
+			return -EINVAL;
+		}
+	if (bp) {
+		attr = bp->attr;
+		/*
+		 * Consider the input address to be encoded with the read/write
+		 * watchpoint type request and align the address by default.
+		 */
+		attr.bp_addr = (unsigned long)bp_info->addr & ~HW_BREAKPOINT_ALIGN;
+		arch_bp_generic_fields(dabr &
+					(DABR_DATA_WRITE | DABR_DATA_READ),
+							&attr.bp_type);
+		attr.bp_len = len;
+		ret =  modify_user_hw_breakpoint(bp, &attr);
+		if (ret) {
+			ptrace_put_breakpoints(child);
+			return ret;
+		}
+		thread->ptrace_bps[0] = bp;
+		ptrace_put_breakpoints(child);
+		thread->dabr = dabr;
+		return 0;
+	}
+
+	/* Create a new breakpoint request if one doesn't exist already */
+	hw_breakpoint_init(&attr);
+	attr.bp_addr = (unsigned long)bp_info->addr & ~HW_BREAKPOINT_ALIGN;
+	attr.bp_len = len;
+	arch_bp_generic_fields(dabr & (DABR_DATA_WRITE | DABR_DATA_READ),
+								&attr.bp_type);
+
+	thread->ptrace_bps[0] = bp = register_user_hw_breakpoint(&attr,
+					       ptrace_triggered, NULL, child);
+	if (IS_ERR(bp)) {
+		thread->ptrace_bps[0] = NULL;
+		ptrace_put_breakpoints(child);
+		return PTR_ERR(bp);
+	}
+
+	ptrace_put_breakpoints(child);
 	return 1;
+#endif /* CONFIG_HAVE_HW_BREAKPOINT */
 #endif /* !CONFIG_PPC_ADV_DEBUG_DVCS */
 }
 
 static long ppc_del_hwdebug(struct task_struct *child, long addr, long data)
 {
+#ifdef CONFIG_HAVE_HW_BREAKPOINT
+	struct thread_struct *thread = &(child->thread);
+	struct perf_event *bp;
+#endif /* CONFIG_HAVE_HW_BREAKPOINT */
 #ifdef CONFIG_PPC_ADV_DEBUG_REGS
 	int rc;
 
@@ -1426,10 +1498,24 @@ static long ppc_del_hwdebug(struct task_struct *child, long addr, long data)
 #else
 	if (data != 1)
 		return -EINVAL;
+
+#ifdef CONFIG_HAVE_HW_BREAKPOINT
+	if (ptrace_get_breakpoints(child) < 0)
+		return -ESRCH;
+
+	bp = thread->ptrace_bps[0];
+	if (bp) {
+		unregister_hw_breakpoint(bp);
+		thread->ptrace_bps[0] = NULL;
+	}
+	ptrace_put_breakpoints(child);
+	return 0;
+#else /* CONFIG_HAVE_HW_BREAKPOINT */
 	if (child->thread.dabr == 0)
 		return -ENOENT;
 
 	child->thread.dabr = 0;
+#endif /* CONFIG_HAVE_HW_BREAKPOINT */
 
 	return 0;
 #endif
@@ -1536,7 +1622,8 @@ long arch_ptrace(struct task_struct *child, long request,
 	case PPC_PTRACE_GETHWDBGINFO: {
 		struct ppc_debug_info dbginfo;
 
-		dbginfo.version = 1;
+		/* We return the highest version number supported */
+		dbginfo.version = 2;
 #ifdef CONFIG_PPC_ADV_DEBUG_REGS
 		dbginfo.num_instruction_bps = CONFIG_PPC_ADV_DEBUG_IACS;
 		dbginfo.num_data_bps = CONFIG_PPC_ADV_DEBUG_DACS;
@@ -1560,7 +1647,7 @@ long arch_ptrace(struct task_struct *child, long request,
 		dbginfo.data_bp_alignment = 4;
 #endif
 		dbginfo.sizeof_condition = 0;
-		dbginfo.features = 0;
+		dbginfo.features = PPC_DEBUG_FEATURE_DATA_BP_RANGE;
 #endif /* CONFIG_PPC_ADV_DEBUG_REGS */
 
 		if (!access_ok(VERIFY_WRITE, datavp,

^ permalink raw reply related

* Re: kvm PCI assignment & VFIO ramblings
From: Roedel, Joerg @ 2011-08-26  9:33 UTC (permalink / raw)
  To: aafabbri, Alexey Kardashevskiy, kvm@vger.kernel.org,
	Paul Mackerras, linux-pci@vger.kernel.org, qemu-devel, chrisw,
	iommu, Avi Kivity, Anthony Liguori, linuxppc-dev, benve@cisco.com
In-Reply-To: <20110826042000.GE2308@yookeroo.fritz.box>

On Fri, Aug 26, 2011 at 12:20:00AM -0400, David Gibson wrote:
> On Wed, Aug 24, 2011 at 01:03:32PM +0200, Roedel, Joerg wrote:
> > On Wed, Aug 24, 2011 at 05:33:00AM -0400, David Gibson wrote:
> > > On Wed, Aug 24, 2011 at 11:14:26AM +0200, Roedel, Joerg wrote:
> > 
> > > > I don't see a reason to make this meta-grouping static. It would harm
> > > > flexibility on x86. I think it makes things easier on power but there
> > > > are options on that platform to get the dynamic solution too.
> > > 
> > > I think several people are misreading what Ben means by "static".  I
> > > would prefer to say 'persistent', in that the meta-groups lifetime is
> > > not tied to an fd, but they can be freely created, altered and removed
> > > during runtime.
> > 
> > Even if it can be altered at runtime, from a usability perspective it is
> > certainly the best to handle these groups directly in qemu. Or are there
> > strong reasons to do it somewhere else?
> 
> Funny, Ben and I think usability demands it be the other way around.

The reason is that you mean the usability for the programmer and I mean
it for the actual user of qemu :)

> If the meta-groups are transient - that is lifetime tied to an fd -
> then any program that wants to use meta-groups *must* know the
> interfaces for creating one, whatever they are.
> 
> But if they're persistent, the admin can use other tools to create the
> meta-group then just hand it to a program to use, since the interfaces
> for _using_ a meta-group are identical to those for an atomic group.
> 
> This doesn't preclude a program from being meta-group aware, and
> creating its own if it wants to, of course.  My guess is that qemu
> would not want to build its own meta-groups, but libvirt probably
> would.

Doing it in libvirt makes it really hard for a plain user of qemu to
assign more than one device to a guest. What I want it that a user just
types

	qemu -device vfio,host=00:01.0 -device vfio,host=00:02.0 ...

and it just works. Qemu creates the meta-groups and they are
automatically destroyed when qemu exits. That the programs are not aware
of meta-groups is not a big problem because all software using vfio
needs still to be written :)

Btw, with this concept the programmer can still decide to not use
meta-groups and just multiplex the mappings to all open device-fds it
uses.

	Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox