LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 1/1] powerpc/perf: Adjust callchain based on DWARF debug info
From: Sukadev Bhattiprolu @ 2014-05-10  2:46 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Michael Ellerman, Anton Blanchard, linux-kernel, Ulrich.Weigand,
	Maynard Johnson, linuxppc-dev

[PATCH 1/1] powerpc/perf: Adjust callchain based on DWARF debug info

When saving the callchain on Power, the kernel conservatively saves excess
entries in the callchain. A few of these entries are needed in some cases
but not others.

Eg: the value in the link register (LR) is needed only when it holds the
return address of a function. At other times it must be ignored.

If the unnecessary entries are not ignored, we end up with duplicate arcs
in the call-graphs.

Use DWARF debug information to ignore the unnecessary entries.

Callgraph before the patch:

    14.67%          2234  sprintft  libc-2.18.so       [.] __random
            |
            --- __random
               |
               |--61.12%-- __random
               |          |
               |          |--97.15%-- rand
               |          |          do_my_sprintf
               |          |          main
               |          |          generic_start_main.isra.0
               |          |          __libc_start_main
               |          |          0x0
               |          |
               |           --2.85%-- do_my_sprintf
               |                     main
               |                     generic_start_main.isra.0
               |                     __libc_start_main
               |                     0x0
               |
                --38.88%-- rand
                          |
                          |--94.01%-- rand
                          |          do_my_sprintf
                          |          main
                          |          generic_start_main.isra.0
                          |          __libc_start_main
                          |          0x0
                          |
                           --5.99%-- do_my_sprintf
                                     main
                                     generic_start_main.isra.0
                                     __libc_start_main
                                     0x0

Callgraph after the patch:

    14.67%          2234  sprintft  libc-2.18.so       [.] __random
            |
            --- __random
               |
               |--95.93%-- rand
               |          do_my_sprintf
               |          main
               |          generic_start_main.isra.0
               |          __libc_start_main
               |          0x0
               |
                --4.07%-- do_my_sprintf
                          main
                          generic_start_main.isra.0
                          __libc_start_main
                          0x0

TODO:	For split-debug info objects like glibc, we can only determine
	the call-frame-address only when both .eh_frame and .debug_info
	sections are available. We should be able to determin the CFA
	even without the .eh_frame section.

Thanks to Ulrich Weigand for help with DWARF debug information.

Fix suggested by Anton Blanchard.

Reported-by: Maynard Johnson <maynard@us.ibm.com>
Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
---
 tools/perf/arch/powerpc/Makefile                |   1 +
 tools/perf/arch/powerpc/util/adjust-callchain.c | 278 ++++++++++++++++++++++++
 tools/perf/config/Makefile                      |   5 +
 tools/perf/util/callchain.h                     |  12 +
 tools/perf/util/machine.c                       |  16 +-
 5 files changed, 310 insertions(+), 2 deletions(-)
 create mode 100644 tools/perf/arch/powerpc/util/adjust-callchain.c

diff --git a/tools/perf/arch/powerpc/Makefile b/tools/perf/arch/powerpc/Makefile
index 744e629..512cc8d 100644
--- a/tools/perf/arch/powerpc/Makefile
+++ b/tools/perf/arch/powerpc/Makefile
@@ -3,3 +3,4 @@ PERF_HAVE_DWARF_REGS := 1
 LIB_OBJS += $(OUTPUT)arch/$(ARCH)/util/dwarf-regs.o
 endif
 LIB_OBJS += $(OUTPUT)arch/$(ARCH)/util/header.o
+LIB_OBJS += $(OUTPUT)arch/$(ARCH)/util/adjust-callchain.o
diff --git a/tools/perf/arch/powerpc/util/adjust-callchain.c b/tools/perf/arch/powerpc/util/adjust-callchain.c
new file mode 100644
index 0000000..31b1f95
--- /dev/null
+++ b/tools/perf/arch/powerpc/util/adjust-callchain.c
@@ -0,0 +1,278 @@
+/*
+ * Use DWARF Debug information to skip unnecessary callchain entries.
+ *
+ * Copyright (C) 2014 Sukadev Bhattiprolu, IBM Corporation.
+ * Copyright (C) 2014 Ulrich Weigand, IBM Corporation.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ */
+#include <inttypes.h>
+#include <dwarf.h>
+#include <elfutils/libdwfl.h>
+
+#include "util/thread.h"
+#include "util/callchain.h"
+
+/*
+ * When saving the callchain on Power, the kernel conservatively saves
+ * excess entries in the callchain. A few of these entries are needed
+ * in some cases but not others. If the unnecessary entries are not
+ * ignored, we end up with duplicate arcs in the call-graphs. Use
+ * DWARF debug information to skip over any unnecessary callchain
+ * entries.
+ * 
+ * See function header for arch_adjust_callchain() below for more details.
+ *
+ * The libdwfl code in this file is based on code from elfutils
+ * (libdwfl/argp-std.c, libdwfl/tests/addrcfi.c, etc).
+ */
+static char *debuginfo_path;
+
+static const Dwfl_Callbacks offline_callbacks = {
+	.debuginfo_path = &debuginfo_path,
+	.find_debuginfo = dwfl_standard_find_debuginfo,
+	.section_address = dwfl_offline_section_address,
+};
+
+
+/*
+ * Use the DWARF expression for the Call-frame-address and determine
+ * if return address is in LR and if a new frame was allocated.
+ */
+static int check_return_reg(int ra_regno, Dwarf_Frame *frame)
+{
+	Dwarf_Op ops_mem[2];
+	Dwarf_Op dummy;
+	Dwarf_Op *ops = &dummy;
+  	size_t nops;
+	int result;
+
+	result = dwarf_frame_register(frame, ra_regno, ops_mem, &ops, &nops);
+	if (result < 0) {
+		pr_debug("dwarf_frame_register() %s\n", dwarf_errmsg(-1));
+		return -1;
+	}
+
+	/* 
+	 * Check if return address is on the stack.
+	 */
+	if (nops != 0 || ops != NULL)
+		return 0;
+
+	/*
+	 * Return address is in LR. Check if a frame was allocated
+	 * but not-yet used.
+	 */
+        result = dwarf_frame_cfa(frame, &ops, &nops);
+        if (result < 0) {
+                pr_debug("dwarf_frame_cfa() returns %d, %s\n", result,
+					dwarf_errmsg(-1));
+                return -1;
+        }
+
+	/*
+	 * If call frame address is in r1, no new frame was allocated.
+	 */
+        if (nops == 1 && ops[0].atom == DW_OP_bregx && ops[0].number == 1 &&
+					ops[0].number2 == 0)
+                return 1;      
+
+	/*
+	 * A new frame was allocated but has not yet been used.
+	 */
+        return 2;
+}
+
+/*
+ * Get the DWARF frame from the .eh_frame section.
+ */
+static Dwarf_Frame *get_eh_frame(Dwfl_Module *mod, Dwarf_Addr pc) 
+{
+        int             result;
+        Dwarf_Addr      bias;
+        Dwarf_CFI       *cfi;
+        Dwarf_Frame     *frame;
+
+        cfi = dwfl_module_eh_cfi(mod, &bias);   // CHECK
+        if (!cfi) {
+                pr_debug("%s(): no CFI - %s\n", __func__, dwfl_errmsg(-1));
+                return NULL;
+        }
+
+        result = dwarf_cfi_addrframe(cfi, pc, &frame);
+        if (result) {
+                pr_debug("%s(): %s\n", __func__, dwfl_errmsg(-1));
+                return NULL;
+        }
+
+        return frame;
+}
+
+/*
+ * Get the DWARF frame from the .debug_frame section.
+ */
+static Dwarf_Frame *get_dwarf_frame(Dwfl_Module *mod, Dwarf_Addr pc)
+{
+        Dwarf_CFI       *cfi;
+        Dwarf_Addr      bias;
+        Dwarf_Frame     *frame;
+        int             result;
+
+        cfi = dwfl_module_dwarf_cfi(mod, &bias);
+        if (!cfi) {
+                pr_debug("%s(): no CFI - %s\n", __func__, dwfl_errmsg(-1));
+                return NULL;
+        }
+
+        result = dwarf_cfi_addrframe(cfi, pc, &frame);
+        if (result) {
+                pr_debug("%s(): %s\n", __func__, dwfl_errmsg(-1));
+                return NULL;
+        }
+
+        return frame;
+}
+
+/*
+ * Return:
+ * 	0 if return address for the program counter @pc is on stack
+ * 	1 if return address is in LR and no new stack frame was allocated
+ * 	2 if return address is in LR and a new frame was allocated (but not
+ * 	  yet used)
+ * 	-1 in case of errors
+ */
+static int check_return_addr(const char *exec_file, Dwarf_Addr pc)
+{
+	Dwfl		*dwfl;
+	Dwfl_Module	*mod;
+	Dwarf_Frame	*frame;
+	int		ra_regno;
+	Dwarf_Addr 	start = pc;
+	Dwarf_Addr 	end = pc;
+	bool 		signalp;
+
+	pr_debug("Testing: %s, @0x%llx\n", exec_file, (unsigned long long)pc);
+
+	dwfl = dwfl_begin(&offline_callbacks);
+	if (!dwfl) {
+		pr_debug("dwfl_begin() failed: %s\n", dwarf_errmsg(-1));
+		return -1;
+	}
+
+	if (dwfl_report_offline(dwfl, "",  exec_file, -1) == NULL) {
+		pr_debug("dwfl_report_offline() failed %s\n", dwarf_errmsg(-1));
+		return -1;
+	}
+
+	mod = dwfl_addrmodule(dwfl, pc);
+	if (!mod) {
+		pr_debug("dwfl_addrmodule() failed, %s\n", dwarf_errmsg(-1));
+		return -1;
+	}
+
+	/*
+	 * To work with split debug info files (eg: glibc), check both
+	 * .eh_frame and .debug_frame sections of the ELF header.
+	 */
+	frame = get_eh_frame(mod, pc);
+	if (!frame) {
+		frame = get_dwarf_frame(mod, pc);
+		if (!frame)
+			return -1;
+	}
+
+	ra_regno = dwarf_frame_info (frame, &start, &end, &signalp);
+	if (ra_regno < 0) {
+		pr_debug("Return address register unavailable: %s\n",
+				dwarf_errmsg(-1));
+		return -1;
+	}
+
+	return check_return_reg(ra_regno, frame);
+}
+
+/*
+ * The callchain saved by the kernel always includes the link register (LR).
+ * 	
+ * 	0:	PERF_CONTEXT_USER
+ * 	1:	Program counter (Next instruction pointer)
+ * 	2:	LR value
+ * 	3:	Caller's caller
+ * 	4:	...
+ *
+ * The value in LR is only needed when it holds a return address. If the
+ * return address is on the stack, we should ignore the LR value.
+ *
+ * Further, when the return address is in the LR, if a new frame was just
+ * allocated but the LR was not saved into it, then the LR contains the
+ * caller, slot 4: contains the caller's caller and the contents of slot 3:
+ * (chain->ips[3]) is undefined and must be ignored.
+ *
+ * Use DWARF debug information to determine if any entries need to be skipped.
+ *
+ * Return:
+ * 	index:	of callchain entry that needs to be ignored (if any)
+ * 	-1 	if no entry needs to be ignored or in case of errors
+ *
+ * TODO:
+ * 	Rather than returning an index into the callchain and have the
+ * 	caller skip that entry, we could modify the callchain in-place
+ * 	by putting a PERF_CONTEXT_IGNORE marker in the affected entry.
+ *
+ * 	But @chain points to read-only mmap, so the caller needs to
+ * 	duplicate the callchain to modify in-place - something like:
+ *
+ * 		new_callchain = arch_duplicate_callchain()
+ * 		arch_adjust_callchain(new_callchain)
+ * 		arch_free_callchain(new_callchain)
+ *
+ * 	Since we only expect to adjust <= 1 entry for now, just return
+ * 	the index.
+ */
+int arch_adjust_callchain(struct machine *machine, struct thread *thread,
+				struct ip_callchain *chain)
+{
+	struct addr_location al;
+	struct dso *dso = NULL;
+	int rc;
+	u64 ip;
+	u64 skip_slot = -1;
+	
+	if (chain->nr < 3)
+		return skip_slot;
+
+	ip = chain->ips[2];
+
+	thread__find_addr_location(thread, machine, PERF_RECORD_MISC_USER,
+			MAP__FUNCTION, ip, &al);
+
+	if (al.map)
+		dso = al.map->dso;
+
+	if (!dso) {
+		pr_debug("%" PRIx64 " dso is NULL\n", ip);
+		return skip_slot;
+	}
+
+	rc = check_return_addr(dso->long_name, ip);
+
+	pr_debug("DSO %s, nr %" PRIx64 ", ip 0x%" PRIx64 "rc %d\n",
+				dso->long_name, chain->nr, ip, rc);
+
+	if (rc == 0) {
+		/* 
+		 * Return address on stack. Ignore LR value in callchain
+		 */
+		skip_slot = 2;
+	} else if (rc == 2) {
+		/* 
+		 * New frame allocated but return address still in LR.
+		 * Ignore the caller's caller entry in callchain.
+		 */
+		skip_slot = 3;
+	}
+	return skip_slot;
+}
diff --git a/tools/perf/config/Makefile b/tools/perf/config/Makefile
index 5a3c452..7e93877 100644
--- a/tools/perf/config/Makefile
+++ b/tools/perf/config/Makefile
@@ -29,11 +29,16 @@ ifeq ($(ARCH),x86)
   endif
   NO_PERF_REGS := 0
 endif
+
 ifeq ($(ARCH),arm)
   NO_PERF_REGS := 0
   LIBUNWIND_LIBS = -lunwind -lunwind-arm
 endif
 
+ifeq ($(ARCH),powerpc)
+  CFLAGS += -DHAVE_ADJUST_CALLCHAIN
+endif
+
 ifeq ($(LIBUNWIND_LIBS),)
   NO_LIBUNWIND := 1
 else
diff --git a/tools/perf/util/callchain.h b/tools/perf/util/callchain.h
index 8ad97e9..81ecb90 100644
--- a/tools/perf/util/callchain.h
+++ b/tools/perf/util/callchain.h
@@ -157,4 +157,16 @@ int sample__resolve_callchain(struct perf_sample *sample, struct symbol **parent
 int hist_entry__append_callchain(struct hist_entry *he, struct perf_sample *sample);
 
 extern const char record_callchain_help[];
+
+#ifdef HAVE_ADJUST_CALLCHAIN
+extern int arch_adjust_callchain(struct machine *machine,
+			struct thread *thread, struct ip_callchain *chain);
+#else
+static inline int arch_adjust_callchain(struct machine *machine,
+			struct thread *thread, struct ip_callchain *chain)
+{
+	return -1;
+}
+#endif
+
 #endif	/* __PERF_CALLCHAIN_H */
diff --git a/tools/perf/util/machine.c b/tools/perf/util/machine.c
index a53cd0b..dce3bf0 100644
--- a/tools/perf/util/machine.c
+++ b/tools/perf/util/machine.c
@@ -1271,6 +1271,7 @@ static int machine__resolve_callchain_sample(struct machine *machine,
 	int chain_nr = min(max_stack, (int)chain->nr);
 	int i;
 	int err;
+	int skip_slot;
 
 	callchain_cursor_reset(&callchain_cursor);
 
@@ -1279,14 +1280,25 @@ static int machine__resolve_callchain_sample(struct machine *machine,
 		return 0;
 	}
 
+	/*
+	 * Based on DWARF debug information, some architectures skip
+	 * some of the callchain entries saved by the kernel.
+	 */
+	skip_slot = arch_adjust_callchain(machine, thread, chain);
+
 	for (i = 0; i < chain_nr; i++) {
 		u64 ip;
 		struct addr_location al;
 
-		if (callchain_param.order == ORDER_CALLEE)
+		if (callchain_param.order == ORDER_CALLEE) {
+			if (i == skip_slot)
+				continue;
 			ip = chain->ips[i];
-		else
+		} else {
+			if ((int)(chain->nr - i - 1) == skip_slot)
+				continue;
 			ip = chain->ips[chain->nr - i - 1];
+		}
 
 		if (ip >= PERF_CONTEXT_MAX) {
 			switch (ip) {
-- 
1.8.4.2

^ permalink raw reply related

* [PATCH] powerpc: Fix "attempt to move .org backwards" error (again)
From: Guenter Roeck @ 2014-05-10  0:07 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Paul Mackerras, linux-kernel, Guenter Roeck

Commit 4e243b7 (powerpc: Fix "attempt to move .org backwards" error) fixes the
allyesconfig build by moving machine_check_common to a different location.
While this fixes most of the errors, both allmodconfig and allyesconfig still
fail as follows.

arch/powerpc/kernel/exceptions-64s.S:1315: Error: attempt to move .org backwards

Fix by moving machine_check_common after the offending address.

Signed-off-by: Guenter Roeck <linux@roeck-us.net>
---
This fixes the build error, but unfortunately I don't have a system to test
the resulting image.

 arch/powerpc/kernel/exceptions-64s.S | 49 ++++++++++++++++++------------------
 1 file changed, 24 insertions(+), 25 deletions(-)

diff --git a/arch/powerpc/kernel/exceptions-64s.S b/arch/powerpc/kernel/exceptions-64s.S
index 3afd391..25398be 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -1138,31 +1138,6 @@ unrecov_user_slb:
 
 #endif /* __DISABLED__ */
 
-
-	/*
-	 * Machine check is different because we use a different
-	 * save area: PACA_EXMC instead of PACA_EXGEN.
-	 */
-	.align	7
-	.globl machine_check_common
-machine_check_common:
-
-	mfspr	r10,SPRN_DAR
-	std	r10,PACA_EXGEN+EX_DAR(r13)
-	mfspr	r10,SPRN_DSISR
-	stw	r10,PACA_EXGEN+EX_DSISR(r13)
-	EXCEPTION_PROLOG_COMMON(0x200, PACA_EXMC)
-	FINISH_NAP
-	DISABLE_INTS
-	ld	r3,PACA_EXGEN+EX_DAR(r13)
-	lwz	r4,PACA_EXGEN+EX_DSISR(r13)
-	std	r3,_DAR(r1)
-	std	r4,_DSISR(r1)
-	bl	.save_nvgprs
-	addi	r3,r1,STACK_FRAME_OVERHEAD
-	bl	.machine_check_exception
-	b	.ret_from_except
-
 	.align	7
 	.globl alignment_common
 alignment_common:
@@ -1328,6 +1303,30 @@ fwnmi_data_area:
 initial_stab:
 	.space	4096
 
+	/*
+	 * Machine check is different because we use a different
+	 * save area: PACA_EXMC instead of PACA_EXGEN.
+	 */
+	.align	7
+	.globl machine_check_common
+machine_check_common:
+
+	mfspr	r10,SPRN_DAR
+	std	r10,PACA_EXGEN+EX_DAR(r13)
+	mfspr	r10,SPRN_DSISR
+	stw	r10,PACA_EXGEN+EX_DSISR(r13)
+	EXCEPTION_PROLOG_COMMON(0x200, PACA_EXMC)
+	FINISH_NAP
+	DISABLE_INTS
+	ld	r3,PACA_EXGEN+EX_DAR(r13)
+	lwz	r4,PACA_EXGEN+EX_DSISR(r13)
+	std	r3,_DAR(r1)
+	std	r4,_DSISR(r1)
+	bl	.save_nvgprs
+	addi	r3,r1,STACK_FRAME_OVERHEAD
+	bl	.machine_check_exception
+	b	.ret_from_except
+
 #ifdef CONFIG_PPC_POWERNV
 _GLOBAL(opal_mc_secondary_handler)
 	HMT_MEDIUM_PPR_DISCARD
-- 
1.9.1

^ permalink raw reply related

* Re: [PATCH] powerpc: irq work racing with timer interrupt can result in timer interrupt hang
From: Gabriel Paubert @ 2014-05-09 21:50 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: linuxppc-dev, paulus, Anton Blanchard
In-Reply-To: <20140509134113.GP8754@linux.vnet.ibm.com>

On Fri, May 09, 2014 at 06:41:13AM -0700, Paul E. McKenney wrote:
> On Fri, May 09, 2014 at 05:47:12PM +1000, Anton Blanchard wrote:
> > I am seeing an issue where a CPU running perf eventually hangs.
> > Traces show timer interrupts happening every 4 seconds even
> > when a userspace task is running on the CPU.
> 
> Is this by chance every 4.2 seconds?  The reason I ask is that
> Paul Clarke and I are seeing an interrupt every 4.2 seconds when
> he runs NO_HZ_FULL, and are trying to get rid of it.  ;-)

Hmmm, it's close to 2^32 nanoseconds, isnt't it suspiscious?

	Gabriel

^ permalink raw reply

* Re: [PATCH] powerpc: irq work racing with timer interrupt can result in timer interrupt hang
From: Paul E. McKenney @ 2014-05-09 22:08 UTC (permalink / raw)
  To: Gabriel Paubert; +Cc: linuxppc-dev, paulus, Anton Blanchard
In-Reply-To: <20140509215005.GA28239@visitor2.iram.es>

On Fri, May 09, 2014 at 11:50:05PM +0200, Gabriel Paubert wrote:
> On Fri, May 09, 2014 at 06:41:13AM -0700, Paul E. McKenney wrote:
> > On Fri, May 09, 2014 at 05:47:12PM +1000, Anton Blanchard wrote:
> > > I am seeing an issue where a CPU running perf eventually hangs.
> > > Traces show timer interrupts happening every 4 seconds even
> > > when a userspace task is running on the CPU.
> > 
> > Is this by chance every 4.2 seconds?  The reason I ask is that
> > Paul Clarke and I are seeing an interrupt every 4.2 seconds when
> > he runs NO_HZ_FULL, and are trying to get rid of it.  ;-)
> 
> Hmmm, it's close to 2^32 nanoseconds, isnt't it suspiscious?

Now that you mention it...  ;-)

So you are telling me that we are not succeeding in completely turning
off the decrementer interrupt?

							Thanx, Paul

^ permalink raw reply

* linux-next: add scottwood/linux.git
From: Scott Wood @ 2014-05-09 21:15 UTC (permalink / raw)
  To: Stephen Rothwell; +Cc: linux-next, linuxppc-dev
In-Reply-To: <1395709794.12479.411.camel@snotra.buserror.net>

On Mon, 2014-03-24 at 20:09 -0500, Scott Wood wrote:
> On Mon, 2014-03-24 at 10:33 +1100, Benjamin Herrenschmidt wrote:
> > On Mon, 2014-03-24 at 10:16 +1100, Benjamin Herrenschmidt wrote:
> > > On Wed, 2014-03-19 at 23:25 -0500, Scott Wood wrote:
> > > > The following changes since commit c7e64b9ce04aa2e3fad7396d92b5cb92056d16ac:
> > > > 
> > > >   powerpc/powernv Platform dump interface (2014-03-07 16:19:10 +1100)
> > > > 
> > > > are available in the git repository at:
> > > > 
> > > >   git://git.kernel.org/pub/scm/linux/kernel/git/scottwood/linux.git next
> > > > 
> > > > for you to fetch changes up to 48b16180d0d91324e5d2423c6d53d97bbe3dcc14:
> > > > 
> > > >   fsl/pci: The new pci suspend/resume implementation (2014-03-19 22:37:44 -0500)
> > > 
> > > Stephen just informed me that your tree wasn't in -next ... Kumar's
> > > still is.
> > > 
> > > Can you guys fix that up ? I somewhat rely on the FSL stuff to simmer
> > > in -next on its own.
> 
> Stephen, what's the process for adding a tree?

ping

-Scott


> 
> I suppose we should update MAINTAINERS while we're at it.
> 
> > Oh and where is my little summary to put in the merge commit ?
> > 
> > I made one up for this time around.
> 
> Oops, forgot again.  Now I've added something to the script I use to
> generate pull requests, to give me a reminder.
> 
> -Scott
> 

^ permalink raw reply

* Re: [v6,3/5] powerpc/book3e: support kgdb for kernel space
From: Scott Wood @ 2014-05-09 19:36 UTC (permalink / raw)
  To: Tiejun Chen; +Cc: linuxppc-dev, linux-kernel
In-Reply-To: <1382520685-11609-4-git-send-email-tiejun.chen@windriver.com>

On Wed, Oct 23, 2013 at 05:31:23PM +0800, Tiejun Chen wrote:
> Currently we need to skip this for supporting KGDB.
> 
> Signed-off-by: Tiejun Chen <tiejun.chen@windriver.com>
> 
> ---
> arch/powerpc/kernel/exceptions-64e.S |    4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/kernel/exceptions-64e.S b/arch/powerpc/kernel/exceptions-64e.S
> index a55cf62..0b750c6 100644
> --- a/arch/powerpc/kernel/exceptions-64e.S
> +++ b/arch/powerpc/kernel/exceptions-64e.S
> @@ -597,11 +597,13 @@ kernel_dbg_exc:
>  	rfdi
>  
>  	/* Normal debug exception */
> +1:	andi.	r14,r11,MSR_PR;		/* check for userspace again */
> +#ifndef CONFIG_KGDB
>  	/* XXX We only handle coming from userspace for now since we can't
>  	 *     quite save properly an interrupted kernel state yet
>  	 */
> -1:	andi.	r14,r11,MSR_PR;		/* check for userspace again */
>  	beq	kernel_dbg_exc;		/* if from kernel mode */
> +#endif

Now that we have support for properly saving state on special level
exceptions, that should be used here.  With the above patch, what happens
if e.g. a debug exception fires during a TLB miss, and the kgdb handler
takes its own TLB miss accessing the serial port?

-Scott

^ permalink raw reply

* Re: [PATCHv2] powerpc/85xx: Add OCA4080 board support
From: Scott Wood @ 2014-05-09 19:24 UTC (permalink / raw)
  To: Martijn de Gouw; +Cc: stef.van.os, Martijn de Gouw, linuxppc-dev
In-Reply-To: <1397584306-8706-1-git-send-email-martijn.de.gouw@prodrive-technologies.com>

On Tue, Apr 15, 2014 at 07:51:46PM +0200, Martijn de Gouw wrote:
> diff --git a/arch/powerpc/platforms/85xx/corenet_generic.c b/arch/powerpc/platforms/85xx/corenet_generic.c
> index fbd871e..f3685047 100644
> --- a/arch/powerpc/platforms/85xx/corenet_generic.c
> +++ b/arch/powerpc/platforms/85xx/corenet_generic.c
> @@ -55,8 +55,6 @@ void __init corenet_gen_setup_arch(void)
>  	mpc85xx_smp_init();
>  
>  	swiotlb_detect_4g();
> -
> -	pr_info("%s board from Freescale Semiconductor\n", ppc_md.name);

Valentin's patch kept this line but removed "from Freescale
Semiconductor"; I'll leave it like that when applying.

-Scott

^ permalink raw reply

* Re: [PATCH 1/1] booke/watchdog: refine and clean up the codes
From: Guenter Roeck @ 2014-05-09 17:44 UTC (permalink / raw)
  To: Yuantian.Tang; +Cc: scottwood, wim, linuxppc-dev, linux-watchdog
In-Reply-To: <1399514666-2572-1-git-send-email-Yuantian.Tang@freescale.com>

On Thu, May 08, 2014 at 10:04:26AM +0800, Yuantian.Tang@freescale.com wrote:
> From: Tang Yuantian <yuantian.tang@freescale.com>
> 
> Basically, this patch does the following:
> 1. Move the codes of parsing boot parameters from setup-common.c
>    to driver. In this way, code reader can know directly that
>    there are boot parameters that can change the timeout.
> 2. Make boot parameter 'booke_wdt_period' effective.
>    currently, when driver is loaded, default timeout is always
>    being used in stead of booke_wdt_period.
> 3. Wrap up the watchdog timeout in device struct and clean up
>    unnecessary codes.
> 
> Signed-off-by: Tang Yuantian <yuantian.tang@freescale.com>
> Acked-by: Scott Wood <scottwood@freescale.com>

Reviewed-by: Guenter Roeck <linux@roeck-us.net>

^ permalink raw reply

* Re: [PATCH v2 1/2] powerpc/pm: add api to get suspend state which is STANDBY or MEM
From: Scott Wood @ 2014-05-09 17:09 UTC (permalink / raw)
  To: Li Yang
  Cc: Zhao Chenhui, linux-pm@vger.kernel.org, Rafael J. Wysocki,
	Dongsheng Wang, 正雄 金, linuxppc-dev
In-Reply-To: <CADRPPNRukfr92m4hSTar_N4iNYJ8fRPMnyOokzc9tCEDLg_BVw@mail.gmail.com>

On Fri, 2014-05-09 at 17:33 +0800, Li Yang wrote:
> On Wed, Apr 30, 2014 at 6:47 AM, Scott Wood <scottwood@freescale.com> wrote:
> > On Mon, 2014-04-28 at 13:53 +0800, Leo Li wrote:
> >> On Sat, Apr 26, 2014 at 5:45 AM, Scott Wood <scottwood@freescale.com> wrote:
> >> > On Thu, 2014-04-24 at 14:11 +0800, Dongsheng Wang wrote:
> >> >> From: Wang Dongsheng <dongsheng.wang@freescale.com>
> >> >>
> >> >> Add set_pm_suspend_state & pm_suspend_state functions to set/get
> >> >> suspend state. When system going to sleep or deep sleep, devices
> >> >> can get the system suspend state(STANDBY/MEM) through pm_suspend_state
> >> >> function and to handle different situations.
> >> >>
> >> >> Signed-off-by: Wang Dongsheng <dongsheng.wang@freescale.com>
> >> >> ---
> >> >> *v2*
> >> >> Move pm api from fsl platform to powerpc general framework.
> >> >
> >> > What is powerpc-specific about this?
> >>
> >> Generally I agree with you.  But I had the discussion about this topic
> >> a while ago with the PM maintainer.  He suggestion to go with the
> >> platform way.
> >>
> >> https://lkml.org/lkml/2013/8/16/505
> >
> > If what he meant was whether you could do what this patch does, then you
> > can answer him with, "No, because it got nacked as not being platform or
> > arch specific."  Oh, and you're still using .valid as the hook to set
> > the platform state, which is awful -- I think .begin is what you want to
> > use.
> 
> I'm not saying the current patch is good for upstream.  Actually I did
> say that the patch need to be updated for upstream purpose. 

I don't follow -- this thread is an upstream submission.

> > Now, a more legitimate objection to putting it in generic code might be
> > that "standby" and "mem" are loosely defined and the knowledge of how a
> > driver should react to each is platform specific -- but your patch
> > doesn't address that.  You still have the driver itself interpret what
> > "standby" and "mem" mean.
> >
> 
> Yup, we will address it in next batch.

Thanks.

-Scott

^ permalink raw reply

* Re: [PATCH] powerpc: irq work racing with timer interrupt can result in timer interrupt hang
From: Paul E. McKenney @ 2014-05-09 13:41 UTC (permalink / raw)
  To: Anton Blanchard; +Cc: paulus, linuxppc-dev
In-Reply-To: <20140509174712.55fe72d0@kryten>

On Fri, May 09, 2014 at 05:47:12PM +1000, Anton Blanchard wrote:
> I am seeing an issue where a CPU running perf eventually hangs.
> Traces show timer interrupts happening every 4 seconds even
> when a userspace task is running on the CPU.

Is this by chance every 4.2 seconds?  The reason I ask is that
Paul Clarke and I are seeing an interrupt every 4.2 seconds when
he runs NO_HZ_FULL, and are trying to get rid of it.  ;-)

						Thanx, Paul

>                                              /proc/timer_list
> also shows pending hrtimers have not run in over an hour,
> including the scheduler.
> 
> Looking closer, decrementers_next_tb is getting set to
> 0xffffffffffffffff, and at that point we will never take
> a timer interrupt again.
> 
> In __timer_interrupt() we set decrementers_next_tb to
> 0xffffffffffffffff and rely on ->event_handler to update it:
> 
>         *next_tb = ~(u64)0;
>         if (evt->event_handler)
>                 evt->event_handler(evt);
> 
> In this case ->event_handler is hrtimer_interrupt. This will eventually
> call back through the clockevents code with the next event to be
> programmed:
> 
> static int decrementer_set_next_event(unsigned long evt,
>                                       struct clock_event_device *dev)
> {
>         /* Don't adjust the decrementer if some irq work is pending */
>         if (test_irq_work_pending())
>                 return 0;
>         __get_cpu_var(decrementers_next_tb) = get_tb_or_rtc() + evt;
> 
> If irq work came in between these two points, we will return
> before updating decrementers_next_tb and we never process a timer
> interrupt again.
> 
> This looks to have been introduced by 0215f7d8c53f (powerpc: Fix races
> with irq_work). Fix it by removing the early exit and relying on
> code later on in the function to force an early decrementer:
> 
>        /* We may have raced with new irq work */
>        if (test_irq_work_pending())
>                set_dec(1);
> 
> Signed-off-by: Anton Blanchard <anton@samba.org>
> Cc: stable@vger.kernel.org # 3.14+
> ---
> 
> diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
> index 122a580..4f0b676 100644
> --- a/arch/powerpc/kernel/time.c
> +++ b/arch/powerpc/kernel/time.c
> @@ -813,9 +888,6 @@ static void __init clocksource_init(void)
>  static int decrementer_set_next_event(unsigned long evt,
>  				      struct clock_event_device *dev)
>  {
> -	/* Don't adjust the decrementer if some irq work is pending */
> -	if (test_irq_work_pending())
> -		return 0;
>  	__get_cpu_var(decrementers_next_tb) = get_tb_or_rtc() + evt;
>  	set_dec(evt);
> 
> 

^ permalink raw reply

* Re: [PATCH] powerpc: irq work racing with timer interrupt can result in timer interrupt hang
From: Preeti U Murthy @ 2014-05-09  9:52 UTC (permalink / raw)
  To: Anton Blanchard, benh; +Cc: paulmck, paulus, linuxppc-dev
In-Reply-To: <20140509174712.55fe72d0@kryten>

Hi Anton,

On 05/09/2014 01:17 PM, Anton Blanchard wrote:
> I am seeing an issue where a CPU running perf eventually hangs.
> Traces show timer interrupts happening every 4 seconds even
> when a userspace task is running on the CPU. /proc/timer_list
> also shows pending hrtimers have not run in over an hour,
> including the scheduler.
> 
> Looking closer, decrementers_next_tb is getting set to
> 0xffffffffffffffff, and at that point we will never take
> a timer interrupt again.
> 
> In __timer_interrupt() we set decrementers_next_tb to
> 0xffffffffffffffff and rely on ->event_handler to update it:
> 
>         *next_tb = ~(u64)0;
>         if (evt->event_handler)
>                 evt->event_handler(evt);
> 
> In this case ->event_handler is hrtimer_interrupt. This will eventually
> call back through the clockevents code with the next event to be
> programmed:
> 
> static int decrementer_set_next_event(unsigned long evt,
>                                       struct clock_event_device *dev)
> {
>         /* Don't adjust the decrementer if some irq work is pending */
>         if (test_irq_work_pending())
>                 return 0;
>         __get_cpu_var(decrementers_next_tb) = get_tb_or_rtc() + evt;
> 
> If irq work came in between these two points, we will return
> before updating decrementers_next_tb and we never process a timer
> interrupt again.
> 
> This looks to have been introduced by 0215f7d8c53f (powerpc: Fix races
> with irq_work). Fix it by removing the early exit and relying on
> code later on in the function to force an early decrementer:
> 
>        /* We may have raced with new irq work */
>        if (test_irq_work_pending())
>                set_dec(1);
> 

There is another scenario we are missing. Its not necessary that on a
timer interrupt the event handler will call back through the
set_next_event().
If there are no pending timers then the event handler will not bother
programming the tick device and simply return.IOW, set_next_event() will
not be called. In that case we will miss taking care of pending irq work
altogether.

__timer_interrupt() -> event_handler -> next_time = KTIME_MAX ->
__timer_interrupt().

In __timer_interrupt() we do not check for pending irq anywhere after
the call to the event handler and we hence miss servicing irqs in the
above scenario.

How about you also move the check:
 if (test_irq_pending())
   set_dec(1)

in __timer_interrupt() outside the _else_ loop? This will ensure that no
matter what, before exiting timer interrupt handler we check for pending
irq work.

Regards
Preeti U Murthy

> Signed-off-by: Anton Blanchard <anton@samba.org>
> Cc: stable@vger.kernel.org # 3.14+
> ---
> 
> diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
> index 122a580..4f0b676 100644
> --- a/arch/powerpc/kernel/time.c
> +++ b/arch/powerpc/kernel/time.c
> @@ -813,9 +888,6 @@ static void __init clocksource_init(void)
>  static int decrementer_set_next_event(unsigned long evt,
>  				      struct clock_event_device *dev)
>  {
> -	/* Don't adjust the decrementer if some irq work is pending */
> -	if (test_irq_work_pending())
> -		return 0;
>  	__get_cpu_var(decrementers_next_tb) = get_tb_or_rtc() + evt;
>  	set_dec(evt);

How about if you move the test_irq_work_pending
Why do we have test_irq_work_pending() later in the function
decrementer_set_next_event()?
>  
> _______________________________________________
> Linuxppc-dev mailing list
> Linuxppc-dev@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-dev
> 

^ permalink raw reply

* Re: [PATCH v2 1/2] powerpc/pm: add api to get suspend state which is STANDBY or MEM
From: Li Yang @ 2014-05-09  9:33 UTC (permalink / raw)
  To: Scott Wood
  Cc: Zhao Chenhui, linux-pm@vger.kernel.org, Rafael J. Wysocki,
	Dongsheng Wang, 正雄 金, linuxppc-dev
In-Reply-To: <1398811632.24575.98.camel@snotra.buserror.net>

On Wed, Apr 30, 2014 at 6:47 AM, Scott Wood <scottwood@freescale.com> wrote:
> On Mon, 2014-04-28 at 13:53 +0800, Leo Li wrote:
>> On Sat, Apr 26, 2014 at 5:45 AM, Scott Wood <scottwood@freescale.com> wrote:
>> > On Thu, 2014-04-24 at 14:11 +0800, Dongsheng Wang wrote:
>> >> From: Wang Dongsheng <dongsheng.wang@freescale.com>
>> >>
>> >> Add set_pm_suspend_state & pm_suspend_state functions to set/get
>> >> suspend state. When system going to sleep or deep sleep, devices
>> >> can get the system suspend state(STANDBY/MEM) through pm_suspend_state
>> >> function and to handle different situations.
>> >>
>> >> Signed-off-by: Wang Dongsheng <dongsheng.wang@freescale.com>
>> >> ---
>> >> *v2*
>> >> Move pm api from fsl platform to powerpc general framework.
>> >
>> > What is powerpc-specific about this?
>>
>> Generally I agree with you.  But I had the discussion about this topic
>> a while ago with the PM maintainer.  He suggestion to go with the
>> platform way.
>>
>> https://lkml.org/lkml/2013/8/16/505
>
> If what he meant was whether you could do what this patch does, then you
> can answer him with, "No, because it got nacked as not being platform or
> arch specific."  Oh, and you're still using .valid as the hook to set
> the platform state, which is awful -- I think .begin is what you want to
> use.

I'm not saying the current patch is good for upstream.  Actually I did
say that the patch need to be updated for upstream purpose.  I only
meant that we discussed about having the mem/standby passed by generic
kernel/power interface as you suggested internally and got an negative
feedback.

>
> If we did it in powerpc code, then what would we do on ARM?  Copy the
> code?  No.

If you are saying that this shouldn't be done in arch/powerpc  Yes.
We have determined to use drivers/platform folder for the re-used code
with ARM.  Platform power management code will be moved there.

>
> Now, a more legitimate objection to putting it in generic code might be
> that "standby" and "mem" are loosely defined and the knowledge of how a
> driver should react to each is platform specific -- but your patch
> doesn't address that.  You still have the driver itself interpret what
> "standby" and "mem" mean.
>

Yup, we will address it in next batch.

- Leo

^ permalink raw reply

* RE: powerpc/mpc85xx: Add BSC9132 QDS Support
From: Harninder Rai @ 2014-05-09  8:55 UTC (permalink / raw)
  To: Scott Wood
  Cc: linuxppc-dev@lists.ozlabs.org, prabhakar@freescale.com,
	Ruchika Gupta
In-Reply-To: <20140503003113.GB20757@home.buserror.net>



> -----Original Message-----
> From: Wood Scott-B07421
> Sent: Saturday, May 03, 2014 6:01 AM
> To: Rai Harninder-B01044
> Cc: linuxppc-dev@lists.ozlabs.org; Gupta Ruchika-R66431
> Subject: Re: powerpc/mpc85xx: Add BSC9132 QDS Support
>=20
> On Tue, Mar 18, 2014 at 01:05:02PM +0530, harninder rai wrote:
> > +&ifc {
> > +	#address-cells =3D <2>;
> > +	#size-cells =3D <1>;
> > +	compatible =3D "fsl,ifc", "simple-bus";
> > +	/* FIXME: Test whether interrupts are split */
> > +	interrupts =3D <16 2 0 0 20 2 0 0>;
> > +};
>=20
> Have you done this test yet?
Checked with Prabhakar and he says that on 9132, the IFC interrupts are spl=
it
B4/T4 (and variants), C29x etc onwards are when the interrupts got merged i=
nto single interrupt
>=20
> -Scott

^ permalink raw reply

* Re: [PATCH 1/1] booke/watchdog: refine and clean up the codes
From: Leo Li @ 2014-05-09  8:31 UTC (permalink / raw)
  To: Yuantian.Tang; +Cc: Scott Wood, wim, linuxppc-dev, linux-watchdog
In-Reply-To: <1399514666-2572-1-git-send-email-Yuantian.Tang@freescale.com>

On Thu, May 8, 2014 at 10:04 AM,  <Yuantian.Tang@freescale.com> wrote:
> From: Tang Yuantian <yuantian.tang@freescale.com>
>
> Basically, this patch does the following:
> 1. Move the codes of parsing boot parameters from setup-common.c
>    to driver. In this way, code reader can know directly that
>    there are boot parameters that can change the timeout.
> 2. Make boot parameter 'booke_wdt_period' effective.
>    currently, when driver is loaded, default timeout is always
>    being used in stead of booke_wdt_period.
> 3. Wrap up the watchdog timeout in device struct and clean up
>    unnecessary codes.
>
> Signed-off-by: Tang Yuantian <yuantian.tang@freescale.com>
> Acked-by: Scott Wood <scottwood@freescale.com>

Reviewed-by: Li Yang <leoli@freescale.com>

^ permalink raw reply

* Re: [RFT PATCH -next ] [BUGFIX] kprobes: Fix "Failed to find blacklist" error on ia64 and ppc64
From: Masami Hiramatsu @ 2014-05-09  8:06 UTC (permalink / raw)
  To: ananth
  Cc: Jeremy Fitzhardinge, linux-ia64, sparse,
	Linux Kernel Mailing List, Paul Mackerras, H. Peter Anvin,
	Thomas Gleixner, linux-tip-commits, anil.s.keshavamurthy,
	Ingo Molnar, Fenghua Yu, Arnd Bergmann, Rusty Russell,
	Chris Wright, yrl.pp-manager.tt, akataria, Tony Luck, Kevin Hao,
	Linus Torvalds, rdunlap, Tony Luck, dl9pf, Andrew Morton,
	linuxppc-dev, David S. Miller
In-Reply-To: <20140508061658.GA2384@in.ibm.com>

(2014/05/08 15:16), Ananth N Mavinakayanahalli wrote:
> On Thu, May 08, 2014 at 02:40:00PM +0900, Masami Hiramatsu wrote:
>> (2014/05/08 13:47), Ananth N Mavinakayanahalli wrote:
>>> On Wed, May 07, 2014 at 08:55:51PM +0900, Masami Hiramatsu wrote:
>>>
>>> ...
>>>
>>>> +#if defined(CONFIG_PPC64) && (!defined(_CALL_ELF) || _CALL_ELF == 1)
>>>> +/*
>>>> + * On PPC64 ABIv1 the function pointer actually points to the
>>>> + * function's descriptor. The first entry in the descriptor is the
>>>> + * address of the function text.
>>>> + */
>>>> +#define constant_function_entry(fn)	(((func_descr_t *)(fn))->entry)
>>>> +#else
>>>> +#define constant_function_entry(fn)	((unsigned long)(fn))
>>>> +#endif
>>>> +
>>>>  #endif /* __ASSEMBLY__ */
>>>
>>> Hi Masami,
>>>
>>> You could just use ppc_function_entry() instead.
>>
>> No, I think ppc_function_entry() has two problems (on the latest -next kernel)
>>
>> At first, that is an inlined functions which is not applied in build time.
>> Since the NOKPROBE_SYMBOL() is used outside of any functions as like as
>> EXPORT_SYMBOL(), we can only use preprocessed macros.
>> Next, on PPC64 ABI*v2*, ppc_function_entry() returns local function entry,
>> which seems global function entry + 2 insns. I'm not sure about implementation
>> of the kallsyms on PPC64 ABIv2, but I guess we need global function entry
>> for kallsyms.
> 
> ABIv2 does away with function descriptors and Anton fixed up that
> routine to handle the change (the +2 is an artefact of that).

Hmm, do you mean that the address +2 is the actual entry point?
I'd like to know which address is same as the address shown in /proc/kallsyms.

>> BTW, could you test this patch on the latest -next tree on PPC64 if possible?
> 
> I'll test it, but it may take a bit.

Thanks for your help!

> 
> Ananth
> 
> 


-- 
Masami HIRAMATSU
Software Platform Research Dept. Linux Technology Research Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: masami.hiramatsu.pt@hitachi.com

^ permalink raw reply

* RE: powerpc/mpc85xx: Add BSC9132 QDS Support
From: Harninder Rai @ 2014-05-09  7:44 UTC (permalink / raw)
  To: Scott Wood; +Cc: linuxppc-dev@lists.ozlabs.org, Ruchika Gupta
In-Reply-To: <20140503002309.GA20757@home.buserror.net>

> > +	};
> > +
> > +	nand@1,0 {
> > +		#address-cells =3D <1>;
> > +		#size-cells =3D <1>;
> > +		compatible =3D "fsl,ifc-nand";
> > +		reg =3D <0x1 0x0 0x4000>;
> > +
> > +		partition@0 {
> > +			/* This location must not be altered  */
> > +			/* 3MB for u-boot Bootloader Image */
> > +			reg =3D <0x0 0x00300000>;
> > +			label =3D "NAND U-Boot Image";
> > +			read-only;
> > +		};
> > +
> > +		partition@300000 {
> > +			/* 1MB for DTB Image */
> > +			reg =3D <0x00300000 0x00100000>;
> > +			label =3D "NAND DTB Image";
> > +		};
> > +
> > +		partition@400000 {
> > +			/* 8MB for Linux Kernel Image */
> > +			reg =3D <0x00400000 0x00800000>;
> > +			label =3D "NAND Linux Kernel Image";
> > +		};
> > +
> > +		partition@c00000 {
> > +			/* Rest space for Root file System Image */
> > +			reg =3D <0x00c00000 0x07400000>;
> > +			label =3D "NAND RFS Image";
> > +		};
> > +	};
> > +};
>=20
> Please keep partition definitions out of the dts file, as has been recent=
ly
> requested of other boards.  You can use U-Boot to create the partition no=
des
> based on the mtdparts variable, or you can use the Linux mtdparts command=
 line
> option.
Ok. Will remove these in V2 of patch
>=20
> -Scott

^ permalink raw reply

* Re: [PATCH RFC v2 00/10] EEH Support for VFIO PCI devices on PowerKVM guest
From: Gavin Shan @ 2014-05-09  7:54 UTC (permalink / raw)
  To: Gavin Shan; +Cc: aik, agraf, kvm-ppc, alex.williamson, qiudayu, linuxppc-dev
In-Reply-To: <1399621782-23281-1-git-send-email-gwshan@linux.vnet.ibm.com>

On Fri, May 09, 2014 at 05:49:32PM +1000, Gavin Shan wrote:

Sorry for having missed cc'ing Alex Graf. Amending it.

>The series of patches intends to support EEH for PCI devices, which are
>passed through to PowerKVM based guest via VFIO. The implementation is
>straightforward based on the issues or problems we have to resolve to
>support EEH for PowerKVM based guest.
>
>- Emulation for EEH RTAS requests. All EEH RTAS requests goes to QEMU firstly.
>  If QEMU can't handle it, the request will be sent to host via newly introduced
>  VFIO container IOCTL command (VFIO_EEH_INFO) and gets handled in host kernel.
>
>- The error injection infrastructure need support request from the userland
>  utility "errinjct" and PowerKVM based guest. The userland utility "errinjct"
>  works on pSeries platform well with dedicated syscall, which helps invoking
>  RTAS service to fulfil error injection in kernel. From the perspective, it's
>  reasonable to extend the syscall to support PowerNV platform so that OPAL call
>  can be invoked in host kernel for injecting errors. The data transported
>  between userland and kerenl is still following "struct rtas_args" for both
>  cases of PowerNV (OPAL) and pSeries (RTAS).
>
>The series of patches requires corresponding firmware changes from Mike Qiu to
>support error injection and QEMU changes to support EEH for guest. QEMU patchset
>will be sent separately.
>
>Change log
>==========
>v1 -> v2:
>	* EEH RTAS requests are routed to QEMU, and then possiblly to host kerenl.
>	  The mechanism KVM in-kernel handling is dropped.
>	* Error injection is reimplemented based syscall, instead of KVM in-kerenl
>	  handling. The logic for error injection token management is moved to
>	  QEMU. The error injection request is routed to QEMU and then possiblly
>	  to host kernel.
>
>Testing on P7
>=============
>
>- Emulex adapter
>
>Testing on P8
>=============
>
>- Need more testing after design is finalized.
>
>-----
>
>Gavin Shan (10):
>  drivers/vfio: Introduce CONFIG_VFIO_EEH
>  powerpc/eeh: Info to trace passed devices
>  powerpc/eeh: Search EEH device by guest address
>  powerpc/eeh: Search EEH PE by guest address
>  drivers/vfio: New IOCTL command VFIO_EEH_INFO
>  powerpc/eeh: Avoid event on passed PE
>  powerpc/powernv: Sync OPAL header file with firmware
>  powerpc: Extend syscall ppc_rtas()
>  powerpc/powernv: Implement ppc_call_opal()
>  powerpc/powernv: Error injection infrastructure
>
>arch/powerpc/include/asm/eeh.h                 |  52 +++++++++++++
>arch/powerpc/include/asm/opal.h                |  74 +++++++++++++++++-
>arch/powerpc/include/asm/rtas.h                |  10 ++-
>arch/powerpc/include/asm/syscalls.h            |   2 +-
>arch/powerpc/include/asm/systbl.h              |   2 +-
>arch/powerpc/include/uapi/asm/unistd.h         |   2 +-
>arch/powerpc/kernel/eeh.c                      |   8 ++
>arch/powerpc/kernel/eeh_pe.c                   |  80 +++++++++++++++++++
>arch/powerpc/kernel/rtas.c                     |  57 +++-----------
>arch/powerpc/kernel/syscalls.c                 |  50 ++++++++++++
>arch/powerpc/platforms/powernv/Makefile        |   3 +-
>arch/powerpc/platforms/powernv/eeh-ioda.c      |   3 +-
>arch/powerpc/platforms/powernv/eeh-vfio.c      | 584 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>arch/powerpc/platforms/powernv/errinject.c     | 222 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>arch/powerpc/platforms/powernv/opal-wrappers.S |   1 +
>arch/powerpc/platforms/powernv/opal.c          |  93 ++++++++++++++++++++++
>drivers/vfio/Kconfig                           |   6 ++
>drivers/vfio/vfio_iommu_spapr_tce.c            |  12 +++
>include/uapi/linux/vfio.h                      |  61 +++++++++++++++
>kernel/sys_ni.c                                |   2 +-
>20 files changed, 1271 insertions(+), 53 deletions(-)
>create mode 100644 arch/powerpc/platforms/powernv/eeh-vfio.c
>create mode 100644 arch/powerpc/platforms/powernv/errinject.c
>
>Thanks,
>Gavin

^ permalink raw reply

* [PATCH 09/10] powerpc/powernv: Implement ppc_call_opal()
From: Gavin Shan @ 2014-05-09  7:49 UTC (permalink / raw)
  To: linuxppc-dev, kvm-ppc; +Cc: aik, alex.williamson, qiudayu, Gavin Shan
In-Reply-To: <1399621782-23281-1-git-send-email-gwshan@linux.vnet.ibm.com>

If we're running PowerNV platform, ppc_firmware() will be directed
to ppc_call_opal() where we can call to OPAL API accordingly. In
ppc_call_opal(), the input argument are parsed out and call to
appropriate OPAL API to handle that. Each request passed to the
function is identified with token. As we get to the function either
from host owned application (e.g. errinjct) or VM, we always have
the first parameter (so-called "virtual") to differentiate the
cases.

The patch implements above logic and OPAL call handler dynamica
registeration mechanism so that the handlers could be distributed.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/opal.h       |  3 +-
 arch/powerpc/platforms/powernv/opal.c | 90 ++++++++++++++++++++++++++++++++++-
 2 files changed, 90 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h
index ca55d9c..7c4ffd0 100644
--- a/arch/powerpc/include/asm/opal.h
+++ b/arch/powerpc/include/asm/opal.h
@@ -997,7 +997,8 @@ extern void opal_lpc_init(void);
 struct opal_sg_list *opal_vmalloc_to_sg_list(void *vmalloc_addr,
 					     unsigned long vmalloc_size);
 void opal_free_sg_list(struct opal_sg_list *sg);
-
+int opal_call_handler_register(bool virt, int token,
+			       int (*fn)(struct rtas_args *));
 #endif /* __ASSEMBLY__ */
 
 #endif /* __OPAL_H */
diff --git a/arch/powerpc/platforms/powernv/opal.c b/arch/powerpc/platforms/powernv/opal.c
index ad33c2b..c84823c 100644
--- a/arch/powerpc/platforms/powernv/opal.c
+++ b/arch/powerpc/platforms/powernv/opal.c
@@ -38,6 +38,13 @@ struct opal {
 	u64 size;
 } opal;
 
+struct opal_call_handler {
+	bool virt;
+	int token;
+	int (*fn)(struct rtas_args *args);
+	struct list_head list;
+};
+
 struct mcheck_recoverable_range {
 	u64 start_addr;
 	u64 end_addr;
@@ -47,6 +54,10 @@ struct mcheck_recoverable_range {
 static struct mcheck_recoverable_range *mc_recoverable_range;
 static int mc_recoverable_range_len;
 
+/* OPAL call handler */
+static LIST_HEAD(opal_call_handler_list);
+static DEFINE_SPINLOCK(opal_call_lock);
+
 struct device_node *opal_node;
 static DEFINE_SPINLOCK(opal_write_lock);
 extern u64 opal_mc_secondary_handler[];
@@ -703,8 +714,83 @@ void opal_free_sg_list(struct opal_sg_list *sg)
 	}
 }
 
-/* Extend it later */
-int ppc_call_opal(struct rtas_args *args)
+int opal_call_handler_register(bool virt, int token,
+			       int (*fn)(struct rtas_args *))
 {
+	struct opal_call_handler *h, *handler;
+
+	if (!token || !fn) {
+		pr_warn("%s: Invalid parameters\n",
+			__func__);
+		return -EINVAL;
+	}
+
+	handler = kzalloc(sizeof(*handler), GFP_KERNEL);
+	if (!handler) {
+		pr_warn("%s: Out of memory\n",
+			__func__);
+		return -ENOMEM;
+	}
+	handler->token = token;
+	handler->virt = virt;
+	handler->fn = fn;
+	INIT_LIST_HEAD(&handler->list);
+
+	spin_lock(&opal_call_lock);
+	list_for_each_entry(h, &opal_call_handler_list, list) {
+		if (h->token == token &&
+		    h->virt  == virt) {
+			spin_unlock(&opal_call_lock);
+			pr_warn("%s: Handler existing (%s, %x)\n",
+				__func__, virt ? "T" : "F", token);
+			kfree(handler);
+			return -EEXIST;
+		}
+	}
+
+	list_add_tail(&handler->list, &opal_call_handler_list);
+	spin_unlock(&opal_call_lock);
+
 	return 0;
 }
+
+/*
+ * It's usually invoked from syscall ppc_firmware() by host
+ * owned application or VM. The information carried in the
+ * input arguments is different. So we always have the first
+ * argument to differentiate it.
+ *
+ * Also, we have to extend 32-bits address to 64-bits. So
+ * for each address sensitive field, it will require 8
+ * bytes.
+ */
+int ppc_call_opal(struct rtas_args *args)
+{
+	bool virt, found;
+	int token;
+	struct opal_call_handler *h;
+
+	/* We should have "virt" at least */
+	if (args->nargs < 1)
+		return -EINVAL;
+	virt = !!args->args[0];
+	token = args->token;
+
+	/* Do we have handler ? */
+	found = false;
+	spin_lock(&opal_call_lock);
+	list_for_each_entry(h, &opal_call_handler_list, list) {
+		if (h->token == token &&
+		    h->virt == virt) {
+			found = true;
+			break;
+		}
+	}
+	spin_unlock(&opal_call_lock);
+
+	/* Call to handler */
+	if (!found)
+		return -ERANGE;
+
+	return h->fn(args);
+}
-- 
1.8.3.2

^ permalink raw reply related

* [PATCH 10/10] powerpc/powernv: Error injection infrastructure
From: Gavin Shan @ 2014-05-09  7:49 UTC (permalink / raw)
  To: linuxppc-dev, kvm-ppc; +Cc: aik, alex.williamson, qiudayu, Gavin Shan
In-Reply-To: <1399621782-23281-1-git-send-email-gwshan@linux.vnet.ibm.com>

The patch intends to implemdent the error injection infrastructure
for PowerNV platform. The predetermined handlers will be called
according to the type of injected error (e.g. OpalErrinjctTypeIoaBusError).
For now, we just support PCI error injection. We need support
injecting other types of errors in future.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/opal.h            |   6 +
 arch/powerpc/platforms/powernv/Makefile    |   2 +-
 arch/powerpc/platforms/powernv/errinject.c | 224 +++++++++++++++++++++++++++++
 3 files changed, 231 insertions(+), 1 deletion(-)
 create mode 100644 arch/powerpc/platforms/powernv/errinject.c

diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h
index 7c4ffd0..7bf86ba 100644
--- a/arch/powerpc/include/asm/opal.h
+++ b/arch/powerpc/include/asm/opal.h
@@ -794,6 +794,12 @@ typedef struct oppanel_line {
 	uint64_t 	line_len;
 } oppanel_line_t;
 
+enum OpalCallToken{
+	OPAL_CALL_TOKEN_MIN = 0,
+	OPAL_CALL_TOKEN_ERRINJCT,
+	OPAL_CALL_TOKEN_MAX
+};
+
 /* /sys/firmware/opal */
 extern struct kobject *opal_kobj;
 
diff --git a/arch/powerpc/platforms/powernv/Makefile b/arch/powerpc/platforms/powernv/Makefile
index 2b15a03..5ae8257 100644
--- a/arch/powerpc/platforms/powernv/Makefile
+++ b/arch/powerpc/platforms/powernv/Makefile
@@ -1,7 +1,7 @@
 obj-y			+= setup.o opal-takeover.o opal-wrappers.o opal.o opal-async.o
 obj-y			+= opal-rtc.o opal-nvram.o opal-lpc.o opal-flash.o
 obj-y			+= rng.o opal-elog.o opal-dump.o opal-sysparam.o opal-sensor.o
-obj-y			+= opal-msglog.o
+obj-y			+= opal-msglog.o errinject.o
 
 obj-$(CONFIG_SMP)	+= smp.o
 obj-$(CONFIG_PCI)	+= pci.o pci-p5ioc2.o pci-ioda.o
diff --git a/arch/powerpc/platforms/powernv/errinject.c b/arch/powerpc/platforms/powernv/errinject.c
new file mode 100644
index 0000000..aa892d4
--- /dev/null
+++ b/arch/powerpc/platforms/powernv/errinject.c
@@ -0,0 +1,224 @@
+/*
+ * The file intends to support error injection requests from host OS
+ * owned utility (e.g. errinjct) or VM. We need parse the information
+ * passed from user space and call to appropriate OPAL API accordingly.
+ *
+ * Copyright Benjamin Herrenschmidt & Gavin Shan, IBM Corporation 2014.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+#include <linux/io.h>
+#include <linux/irq.h>
+#include <linux/kernel.h>
+#include <linux/msi.h>
+#include <linux/module.h>
+#include <linux/pci.h>
+
+#include <asm/eeh.h>
+#include <asm/eeh_event.h>
+#include <asm/io.h>
+#include <asm/iommu.h>
+#include <asm/msi_bitmap.h>
+#include <asm/opal.h>
+#include <asm/pci-bridge.h>
+#include <asm/ppc-pci.h>
+#include <asm/rtas.h>
+#include <asm/tce.h>
+#include <asm/uaccess.h>
+
+#include "powernv.h"
+#include "pci.h"
+
+static int powernv_errinjct_ioa(struct rtas_args *args)
+{
+	return -ENXIO;
+}
+
+static int powernv_errinjct_ioa64(struct rtas_args *args)
+{
+	return -ENXIO;
+}
+
+#ifdef CONFIG_VFIO_EEH
+static int powernv_errinjct_ioa_virt(struct rtas_args *args)
+{
+	uint32_t addr, mask, cfg_addr;
+	uint32_t buid_hi, buid_lo, op;
+	uint64_t buf_addr = ((uint64_t)(args->args[3])) << 32 |
+			    args->args[4];
+	void __user *buf = (void __user *)buf_addr;
+	struct eeh_vfio_pci_addr vfio_addr;
+	struct pnv_phb *phb;
+	struct eeh_pe *pe;
+	struct OpalErrinjct ej;
+
+	/* Extract parameters */
+	if (get_user(addr, (uint32_t __user *)buf) ||
+	    get_user(mask, (uint32_t __user *)(buf + 4)) ||
+	    get_user(cfg_addr, (uint32_t __user *)(buf + 8)) ||
+	    get_user(buid_hi, (uint32_t __user *)(buf + 12)) ||
+	    get_user(buid_lo, (uint32_t __user *)(buf + 16)) ||
+	    get_user(op, (uint32_t __user *)(buf + 20)))
+		return -EFAULT;
+
+	/* Check opcode */
+	if (op < OpalEjtIoaLoadMemAddr ||
+	    op > OpalEjtIoaDmaWriteMemTarget)
+		return -EINVAL;
+
+	/* Find PE */
+	vfio_addr.buid = ((((uint64_t)buid_hi) << 32) | buid_lo);
+	vfio_addr.pe_addr = cfg_addr;
+	pe = eeh_vfio_pe_get(&vfio_addr);
+	if (!pe)
+		return -ENODEV;
+	phb = pe->phb->private_data;
+
+	/* OPAL call */
+	ej.type = OpalErrinjctTypeIoaBusError;
+	ej.ioa.addr = addr;
+	ej.ioa.mask = mask;
+	ej.ioa.phb_id = phb->opal_id;
+	ej.ioa.pe = pe->addr;
+	ej.ioa.function = op;
+	if (opal_err_injct(&ej) != OPAL_SUCCESS)
+		return -EIO;
+
+	return 0;
+}
+
+static int powernv_errinjct_ioa64_virt(struct rtas_args *args)
+{
+	uint32_t addr_hi, addr_lo, mask_hi, mask_lo;
+	uint32_t cfg_addr, buid_hi, buid_lo, op;
+	uint64_t buf_addr = ((uint64_t)(args->args[3])) << 32 |
+			    args->args[4];
+	void __user *buf = (void __user *)buf_addr;
+	struct eeh_vfio_pci_addr vfio_addr;
+	struct pnv_phb *phb;
+	struct eeh_pe *pe;
+	struct OpalErrinjct ej;
+
+	/* Extract parameters */
+	if (get_user(addr_hi, (uint32_t __user *)buf) ||
+	    get_user(addr_lo, (uint32_t __user *)(buf + 4)) ||
+	    get_user(mask_hi, (uint32_t __user *)(buf + 8)) ||
+	    get_user(mask_lo, (uint32_t __user *)(buf + 12)) ||
+	    get_user(cfg_addr, (uint32_t __user *)(buf + 16)) ||
+	    get_user(buid_hi, (uint32_t __user *)(buf + 20)) ||
+	    get_user(buid_lo, (uint32_t __user *)(buf + 24)) ||
+	    get_user(op, (uint32_t __user *)(buf + 28)))
+		return -EFAULT;
+
+	/* Check opcode */
+	if (op < OpalEjtIoaLoadMemAddr ||
+	    op > OpalEjtIoaDmaWriteMemTarget)
+		return -EINVAL;
+
+	/* Find PE */
+	vfio_addr.buid = ((((uint64_t)buid_hi) << 32) | buid_lo);
+	vfio_addr.pe_addr = (cfg_addr >> 8) & 0xffff;
+	pe = eeh_vfio_pe_get(&vfio_addr);
+	if (!pe)
+		return -ENODEV;
+	phb = pe->phb->private_data;
+
+	/* OPAL call */
+	ej.type = OpalErrinjctTypeIoaBusError64;
+	ej.ioa.addr = (((uint64_t)addr_hi) << 32) | addr_lo;
+	ej.ioa.mask = (((uint64_t)mask_hi) << 32) | mask_lo;
+	ej.ioa.phb_id = phb->opal_id;
+	ej.ioa.pe = pe->addr;
+	ej.ioa.function = op;
+	if (opal_err_injct(&ej) != OPAL_SUCCESS)
+		return -EIO;
+
+	return 0;
+}
+#endif /* CONFIG_VFIO_EEH */
+
+struct errinjct_handler {
+	bool virt;
+	int token;
+	int (*fn)(struct rtas_args *arg);
+};
+
+static struct errinjct_handler handlers[] = {
+#ifdef CONFIG_EEH
+	{ false,
+	  OpalErrinjctTypeIoaBusError,
+	  powernv_errinjct_ioa
+	},
+	{ false,
+	  OpalErrinjctTypeIoaBusError64,
+          powernv_errinjct_ioa64
+	},
+#endif
+#ifdef CONFIG_VFIO_EEH
+	{ true,
+	  OpalErrinjctTypeIoaBusError,
+	  powernv_errinjct_ioa_virt
+	},
+	{ true,
+	  OpalErrinjctTypeIoaBusError64,
+	  powernv_errinjct_ioa64_virt
+	},
+#endif
+};
+
+static int powernv_errinjct(struct rtas_args *args)
+{
+	struct errinjct_handler *h;
+	int token, ej_token, i;
+	bool virt;
+
+	/* Sanity check */
+	if (args->nargs != 5 || args->nret != 1)
+		return -EINVAL;
+
+	token = args->token;
+	virt = !!args->args[0];
+	if (!virt || token != OPAL_CALL_TOKEN_ERRINJCT)
+		return -EINVAL;
+
+	/* Call into specific handler */
+	ej_token = args->args[1];
+	for (i = 0; i < ARRAY_SIZE(handlers); i++) {
+		h = &handlers[i];
+		if (h->virt == virt &&
+		    h->token == ej_token &&
+		    h->fn)
+			return h->fn(args);
+	}
+
+	return -ENXIO;
+}
+
+static int __init powernv_errinjct_init(void)
+{
+	int ret;
+
+	ret = opal_call_handler_register(false, OPAL_CALL_TOKEN_ERRINJCT,
+					 powernv_errinjct);
+	if (ret) {
+		pr_warn("%s: Cannot register errinjct handler\n",
+			__func__);
+		return ret;
+	}
+
+	ret = opal_call_handler_register(true, OPAL_CALL_TOKEN_ERRINJCT,
+					 powernv_errinjct);
+	if (ret) {
+		pr_warn("%s: Cannot register errinjct virtual handler\n",
+			__func__);
+		return ret;
+	}
+
+	return 0;
+}
+
+module_init(powernv_errinjct_init);
-- 
1.8.3.2

^ permalink raw reply related

* [PATCH 07/10] powerpc/powernv: Sync OPAL header file with firmware
From: Gavin Shan @ 2014-05-09  7:49 UTC (permalink / raw)
  To: linuxppc-dev, kvm-ppc; +Cc: aik, alex.williamson, qiudayu, Gavin Shan
In-Reply-To: <1399621782-23281-1-git-send-email-gwshan@linux.vnet.ibm.com>

The patch synchronizes OPAL header file with firmware so that the
host kernel can make OPAL call to do error injection.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/opal.h                | 65 ++++++++++++++++++++++++++
 arch/powerpc/platforms/powernv/opal-wrappers.S |  1 +
 2 files changed, 66 insertions(+)

diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h
index 66ad7a7..ca55d9c 100644
--- a/arch/powerpc/include/asm/opal.h
+++ b/arch/powerpc/include/asm/opal.h
@@ -175,6 +175,7 @@ extern int opal_enter_rtas(struct rtas_args *args,
 #define OPAL_SET_PARAM				90
 #define OPAL_DUMP_RESEND			91
 #define OPAL_DUMP_INFO2				94
+#define OPAL_ERR_INJECT				96
 
 #ifndef __ASSEMBLY__
 
@@ -219,6 +220,69 @@ enum OpalPciErrorSeverity {
 	OPAL_EEH_SEV_INF	= 5
 };
 
+enum OpalErrinjctType {
+	OpalErrinjctTypeFirst			= 0,
+	OpalErrinjctTypeFatal			= 1,
+	OpalErrinjctTypeRecoverRandomEvent	= 2,
+	OpalErrinjctTypeRecoverSpecialEvent	= 3,
+	OpalErrinjctTypeCorruptedPage		= 4,
+	OpalErrinjctTypeCorruptedSlb		= 5,
+	OpalErrinjctTypeTranslatorFailure	= 6,
+	OpalErrinjctTypeIoaBusError		= 7,
+	OpalErrinjctTypeIoaBusError64		= 8,
+	OpalErrinjctTypePlatformSpecific	= 9,
+	OpalErrinjctTypeDcacheStart		= 10,
+	OpalErrinjctTypeDcacheEnd		= 11,
+	OpalErrinjctTypeIcacheStart		= 12,
+	OpalErrinjctTypeIcacheEnd		= 13,
+	OpalErrinjctTypeTlbStart		= 14,
+	OpalErrinjctTypeTlbEnd			= 15,
+	OpalErrinjctTypeUpstreamIoError		= 16,
+	OpalErrinjctTypeLast			= 17,
+
+	/* IoaBusError & IoaBusError64 */
+	OpalEjtIoaLoadMemAddr			= 0,
+	OpalEjtIoaLoadMemData			= 1,
+	OpalEjtIoaLoadIoAddr			= 2,
+	OpalEjtIoaLoadIoData			= 3,
+	OpalEjtIoaLoadConfigAddr		= 4,
+	OpalEjtIoaLoadConfigData		= 5,
+	OpalEjtIoaStoreMemAddr			= 6,
+	OpalEjtIoaStoreMemData			= 7,
+	OpalEjtIoaStoreIoAddr			= 8,
+	OpalEjtIoaStoreIoData			= 9,
+	OpalEjtIoaStoreConfigAddr		= 10,
+	OpalEjtIoaStoreConfigData		= 11,
+	OpalEjtIoaDmaReadMemAddr		= 12,
+	OpalEjtIoaDmaReadMemData		= 13,
+	OpalEjtIoaDmaReadMemMaster		= 14,
+	OpalEjtIoaDmaReadMemTarget		= 15,
+	OpalEjtIoaDmaWriteMemAddr		= 16,
+	OpalEjtIoaDmaWriteMemData		= 17,
+	OpalEjtIoaDmaWriteMemMaster		= 18,
+	OpalEjtIoaDmaWriteMemTarget		= 19,
+};
+
+struct OpalErrinjct {
+	int32_t type;
+	union {
+		struct {
+			uint32_t addr;
+			uint32_t mask;
+			uint64_t phb_id;
+			uint32_t pe;
+			uint32_t function;
+		}ioa;
+		struct {
+			uint64_t addr;
+			uint64_t mask;
+			uint64_t phb_id;
+			uint32_t pe;
+			uint32_t function;
+		}ioa64;
+	};
+};
+
 enum OpalShpcAction {
 	OPAL_SHPC_GET_LINK_STATE = 0,
 	OPAL_SHPC_GET_SLOT_STATE = 1
@@ -839,6 +903,7 @@ int64_t opal_pci_get_phb_diag_data(uint64_t phb_id, void *diag_buffer,
 				   uint64_t diag_buffer_len);
 int64_t opal_pci_get_phb_diag_data2(uint64_t phb_id, void *diag_buffer,
 				    uint64_t diag_buffer_len);
+int64_t opal_err_injct(void *data);
 int64_t opal_pci_fence_phb(uint64_t phb_id);
 int64_t opal_pci_reinit(uint64_t phb_id, uint64_t reinit_scope, uint64_t data);
 int64_t opal_pci_mask_pe_error(uint64_t phb_id, uint16_t pe_number, uint8_t error_type, uint8_t mask_action);
diff --git a/arch/powerpc/platforms/powernv/opal-wrappers.S b/arch/powerpc/platforms/powernv/opal-wrappers.S
index f531ffe..46265de 100644
--- a/arch/powerpc/platforms/powernv/opal-wrappers.S
+++ b/arch/powerpc/platforms/powernv/opal-wrappers.S
@@ -119,6 +119,7 @@ OPAL_CALL(opal_pci_next_error,			OPAL_PCI_NEXT_ERROR);
 OPAL_CALL(opal_pci_poll,			OPAL_PCI_POLL);
 OPAL_CALL(opal_pci_msi_eoi,			OPAL_PCI_MSI_EOI);
 OPAL_CALL(opal_pci_get_phb_diag_data2,		OPAL_PCI_GET_PHB_DIAG_DATA2);
+OPAL_CALL(opal_err_injct,			OPAL_ERR_INJECT);
 OPAL_CALL(opal_xscom_read,			OPAL_XSCOM_READ);
 OPAL_CALL(opal_xscom_write,			OPAL_XSCOM_WRITE);
 OPAL_CALL(opal_lpc_read,			OPAL_LPC_READ);
-- 
1.8.3.2

^ permalink raw reply related

* [PATCH 05/10] drivers/vfio: New IOCTL command VFIO_EEH_INFO
From: Gavin Shan @ 2014-05-09  7:49 UTC (permalink / raw)
  To: linuxppc-dev, kvm-ppc; +Cc: aik, alex.williamson, qiudayu, Gavin Shan
In-Reply-To: <1399621782-23281-1-git-send-email-gwshan@linux.vnet.ibm.com>

The patch adds new IOCTL command VFIO_EEH_INFO to VFIO container
to support EEH functionality for PCI devices, which have been
passed from host to guest via VFIO.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
---
 arch/powerpc/platforms/powernv/Makefile   |   1 +
 arch/powerpc/platforms/powernv/eeh-vfio.c | 584 ++++++++++++++++++++++++++++++
 drivers/vfio/vfio_iommu_spapr_tce.c       |  12 +
 include/uapi/linux/vfio.h                 |  61 ++++
 4 files changed, 658 insertions(+)
 create mode 100644 arch/powerpc/platforms/powernv/eeh-vfio.c

diff --git a/arch/powerpc/platforms/powernv/Makefile b/arch/powerpc/platforms/powernv/Makefile
index 63cebb9..2b15a03 100644
--- a/arch/powerpc/platforms/powernv/Makefile
+++ b/arch/powerpc/platforms/powernv/Makefile
@@ -6,5 +6,6 @@ obj-y			+= opal-msglog.o
 obj-$(CONFIG_SMP)	+= smp.o
 obj-$(CONFIG_PCI)	+= pci.o pci-p5ioc2.o pci-ioda.o
 obj-$(CONFIG_EEH)	+= eeh-ioda.o eeh-powernv.o
+obj-$(CONFIG_VFIO_EEH)	+= eeh-vfio.o
 obj-$(CONFIG_PPC_SCOM)	+= opal-xscom.o
 obj-$(CONFIG_MEMORY_FAILURE)	+= opal-memory-errors.o
diff --git a/arch/powerpc/platforms/powernv/eeh-vfio.c b/arch/powerpc/platforms/powernv/eeh-vfio.c
new file mode 100644
index 0000000..5766715
--- /dev/null
+++ b/arch/powerpc/platforms/powernv/eeh-vfio.c
@@ -0,0 +1,584 @@
+/*
+  * The file intends to support EEH funtionality for those PCI devices,
+  * which have been passed through from host to guest via VFIO. So this
+  * file is naturally part of VFIO implementation on PowerNV platform.
+  *
+  * Copyright Benjamin Herrenschmidt & Gavin Shan, IBM Corporation 2014.
+  *
+  * This program is free software; you can redistribute it and/or modify
+  * it under the terms of the GNU General Public License as published by
+  * the Free Software Foundation; either version 2 of the License, or
+  * (at your option) any later version.
+  */
+
+#include <linux/init.h>
+#include <linux/io.h>
+#include <linux/irq.h>
+#include <linux/kernel.h>
+#include <linux/kvm_host.h>
+#include <linux/msi.h>
+#include <linux/pci.h>
+#include <linux/string.h>
+#include <linux/vfio.h>
+
+#include <asm/eeh.h>
+#include <asm/eeh_event.h>
+#include <asm/io.h>
+#include <asm/iommu.h>
+#include <asm/opal.h>
+#include <asm/msi_bitmap.h>
+#include <asm/pci-bridge.h>
+#include <asm/ppc-pci.h>
+#include <asm/tce.h>
+#include <asm/uaccess.h>
+
+#include "powernv.h"
+#include "pci.h"
+
+static int powernv_eeh_vfio_map(struct vfio_eeh_info *info)
+{
+	struct pci_bus *bus, *pe_bus;
+	struct pci_dev *pdev;
+	struct eeh_dev *edev;
+	struct eeh_pe *pe;
+	int domain, bus_no, devfn;
+
+	/* Host address */
+	domain = info->map.domain;
+	bus_no = (info->map.bdn >> 8) & 0xff;
+	devfn = info->map.bdn & 0xff;
+
+	/* Find PCI bus */
+	bus = pci_find_bus(domain, bus_no);
+	if (!bus) {
+		pr_warn("%s: PCI bus %04x:%02x not found\n",
+			__func__, domain, bus_no);
+		return -ENODEV;
+	}
+
+	/* Find PCI device */
+	pdev = pci_get_slot(bus, devfn);
+	if (!pdev) {
+		pr_warn("%s: PCI device %04x:%02x:%02x.%01x not found\n",
+			__func__, domain, bus_no,
+			PCI_SLOT(devfn), PCI_FUNC(devfn));
+		return -ENODEV;
+	}
+
+	/* No EEH device - almost impossible */
+	edev = pci_dev_to_eeh_dev(pdev);
+	if (unlikely(!edev)) {
+		pci_dev_put(pdev);
+		pr_warn("%s: No EEH dev for PCI device %s\n",
+			__func__, pci_name(pdev));
+		return -ENODEV;
+	}
+
+	/* Doesn't support PE migration between different PHBs */
+	pe = edev->pe;
+	if (!eeh_pe_passed(pe)) {
+		pe_bus = eeh_pe_bus_get(pe);
+		BUG_ON(!pe_bus);
+
+		/* PE# has format 00BBSS00 */
+		pe->gaddr.buid	  = info->map.gbuid;
+		pe->gaddr.pe_addr = pe_bus->number << 16;
+		eeh_pe_set_passed(pe, true);
+	} else if (pe->gaddr.buid != info->map.gbuid) {
+		pci_dev_put(pdev);
+		pr_warn("%s: Mismatched PHB BUID (0x%llx, 0x%llx)\n",
+			__func__, pe->gaddr.buid, info->map.gbuid);
+		return -EINVAL;
+	}
+
+	edev->gaddr.buid = info->map.gbuid;
+	edev->gaddr.bdn  = info->map.gbdn;
+	eeh_dev_set_passed(edev, true);
+
+	pr_debug("EEH: Host PCI dev %s to %llx-%02x:%02x.%01x\n",
+		 pci_name(pdev), info->map.gbuid,
+		 (info->map.gbdn >> 8) & 0xFF,
+		 PCI_SLOT(info->map.gbdn & 0xFF),
+		 PCI_FUNC(info->map.gbdn & 0xFF));
+
+	pci_dev_put(pdev);
+	return 0;
+}
+
+static int powernv_eeh_vfio_unmap(struct vfio_eeh_info *info)
+{
+	struct eeh_vfio_pci_addr addr;
+	struct pci_dev *pdev;
+	struct eeh_dev *edev, *tmp;
+	struct eeh_pe *pe;
+	bool passed;
+
+	/* Get EEH device */
+	addr.buid = info->unmap.buid;
+	addr.bdn  = info->unmap.bdn;
+	edev = eeh_vfio_dev_get(&addr);
+	if (!edev) {
+		pr_warn("%s: Cannot locate %llx:%02x:%02x.%01x\n",
+			__func__, info->unmap.buid,
+			(info->unmap.bdn >> 8) & 0xFF,
+			PCI_SLOT(info->unmap.bdn & 0xFF),
+			PCI_FUNC(info->unmap.bdn & 0xFF));
+		return -ENODEV;
+	}
+
+	/* Return EEH device */
+	memset(&edev->gaddr, 0, sizeof(edev->gaddr));
+	eeh_dev_set_passed(edev, false);
+	pdev = eeh_dev_to_pci_dev(edev);
+	pr_debug("EEH: Host PCI dev %s returned\n",
+		 pdev ? pci_name(pdev) : "NULL");
+
+	/* Return PE if no EEH device is owned by guest */
+	pe = edev->pe;
+	passed = false;
+	eeh_pe_for_each_dev(pe, edev, tmp) {
+		pdev = eeh_dev_to_pci_dev(edev);
+		if (pdev && pdev->subordinate)
+			continue;
+
+		if (eeh_dev_passed(edev)) {
+			passed = true;
+			break;
+		}
+	}
+
+	if (!passed) {
+		memset(&pe->gaddr, 0, sizeof(pe->gaddr));
+		eeh_pe_set_passed(pe, false);
+		pr_debug("EEH: PHB#%x-PE#%x returned to host\n",
+			 pe->phb->global_number, pe->addr);
+	}
+
+	return 0;
+}
+
+static int powernv_eeh_vfio_set_option(struct vfio_eeh_info *info)
+{
+	struct pnv_phb *phb;
+	struct eeh_dev *edev;
+	struct eeh_pe *pe;
+	struct eeh_vfio_pci_addr addr;
+	int opcode = info->option.option;
+	int ret = 0;
+
+	/* Check opcode */
+	if (opcode < EEH_OPT_DISABLE || opcode > EEH_OPT_THAW_DMA) {
+		pr_warn("%s: opcode %d out of range (%d, %d)\n",
+			__func__, opcode, EEH_OPT_DISABLE, EEH_OPT_THAW_DMA);
+		ret = 3;
+		goto out;
+	}
+
+	/* Option "enable" uses PCI config address */
+	if (opcode == EEH_OPT_ENABLE) {
+		addr.buid = info->option.buid;
+		addr.bdn  = (info->option.pe_addr >> 8) & 0xFFFF;
+		edev = eeh_vfio_dev_get(&addr);
+		if (!edev) {
+			pr_warn("%s: Cannot locate %llx:%02x:%02x.%01x\n",
+				__func__, addr.buid,
+				(addr.bdn >> 8) & 0xFF,
+				PCI_SLOT(addr.bdn & 0xFF),
+				PCI_FUNC(addr.bdn & 0xFF));
+			ret = 7;
+			goto out;
+		}
+		phb = edev->phb->private_data;
+	} else {
+		addr.buid    = info->option.buid;
+		addr.pe_addr = info->option.pe_addr;
+		pe = eeh_vfio_pe_get(&addr);
+		if (!pe) {
+			pr_warn("%s: Cannot find PE %llx:%x\n",
+				__func__, addr.buid, addr.pe_addr);
+			ret = 7;
+			goto out;
+		}
+		phb = pe->phb->private_data;
+	}
+
+	/* Insure that the EEH stuff has been initialized */
+	if (!(phb->flags & PNV_PHB_FLAG_EEH)) {
+		pr_warn("%s: EEH disabled on PHB#%d\n",
+			__func__, phb->hose->global_number);
+		ret = 7;
+		goto out;
+	}
+
+	/*
+	 * The EEH functionality has been enabled on all PEs
+	 * by default. So just return success. The same situation
+	 * would be applied while we disable EEH functionality.
+	 * However, the guest isn't expected to disable that
+	 * at all.
+	 */
+	if (opcode == EEH_OPT_DISABLE ||
+	    opcode == EEH_OPT_ENABLE) {
+		ret = 0;
+		goto out;
+	}
+
+	/*
+	 * Call into the IODA dependent backend in order
+	 * to enable DMA or MMIO for the indicated PE.
+	 */
+	if (phb->eeh_ops && phb->eeh_ops->set_option) {
+		if (phb->eeh_ops->set_option(pe, opcode)) {
+			pr_warn("%s: Failure from backend\n", __func__);
+			ret = 1;
+		}
+	} else {
+		pr_warn("%s: Unsupported request\n", __func__);
+		ret = 7;
+	}
+
+out:
+	return ret;
+}
+
+static int powernv_eeh_vfio_get_addr(struct vfio_eeh_info *info)
+{
+	struct pnv_phb *phb;
+	struct eeh_dev *edev;
+	struct eeh_vfio_pci_addr addr;
+	int opcode = info->addr.option;
+	int ret = 0;
+
+	/* Check opcode */
+	if (opcode != 0 && opcode != 1) {
+		pr_warn("%s: opcode %d out of range (0, 1)\n",
+			__func__, opcode);
+		ret = 3;
+		goto out;
+	}
+
+	/* Find EEH device */
+	addr.buid = info->addr.buid;
+	addr.bdn  = (info->addr.bdn >> 8 ) & 0xFFFF;
+	edev = eeh_vfio_dev_get(&addr);
+	if (!edev) {
+		pr_warn("%s: Cannot locate %llx:%02x:%02x.%01x\n",
+			__func__, addr.buid,
+			(addr.bdn >> 8) & 0xFF,
+			PCI_SLOT(addr.bdn & 0xFF),
+			PCI_FUNC(addr.bdn & 0xFF));
+		ret = 7;
+		goto out;
+	}
+	phb = edev->phb->private_data;
+
+	/* EEH enabled ? */
+	if (!(phb->flags & PNV_PHB_FLAG_EEH)) {
+		pr_warn("%s: EEH disabled on PHB#%d\n",
+			__func__, phb->hose->global_number);
+		ret = 3;
+		goto out;
+	}
+
+	/* EEH device passed ? */
+	if (!eeh_dev_passed(edev)) {
+		pr_warn("%s: EEH dev %llx:%02x:%02x.%01x owned by host\n",
+			__func__, addr.buid,
+			(addr.bdn >> 8) & 0xFF,
+			PCI_SLOT(addr.bdn & 0xFF),
+			PCI_FUNC(addr.bdn & 0xFF));
+		ret = 3;
+		goto out;
+	}
+
+	/*
+	 * Fill result according to opcode. We don't differentiate
+	 * PCI bus and device sensitive PE here.
+	 */
+	if (opcode == 0)
+		info->addr.ret = edev->pe->gaddr.pe_addr;
+	else
+		info->addr.ret = 1;
+out:
+	return ret;
+}
+
+static int powernv_eeh_vfio_get_state(struct vfio_eeh_info *info)
+{
+	struct pnv_phb *phb;
+	struct eeh_pe *pe;
+	struct eeh_vfio_pci_addr addr;
+	int result, ret = 0;
+
+	/* Locate the PE */
+	addr.buid    = info->state.buid;
+	addr.pe_addr = info->state.pe_addr;
+	pe = eeh_vfio_pe_get(&addr);
+	if (!pe) {
+		pr_warn("%s: Cannot locate %llx:%x\n",
+			__func__, addr.buid, addr.pe_addr);
+		ret = 3;
+		goto out;
+	}
+	phb = pe->phb->private_data;
+
+	/* EEH enabled ? */
+	if (!(phb->flags & PNV_PHB_FLAG_EEH)) {
+		pr_warn("%s: EEH disabled on PHB#%d\n",
+			__func__, phb->hose->global_number);
+		ret = 3;
+		goto out;
+	}
+
+	/* Call to the IOC dependent function */
+	if (phb->eeh_ops && phb->eeh_ops->get_state) {
+		result = phb->eeh_ops->get_state(pe);
+
+		if (!(result & EEH_STATE_RESET_ACTIVE) &&
+		     (result & EEH_STATE_DMA_ENABLED) &&
+		     (result & EEH_STATE_MMIO_ENABLED))
+			info->state.state = 0;
+		else if (result & EEH_STATE_RESET_ACTIVE)
+			info->state.state = 1;
+		else if (!(result & EEH_STATE_RESET_ACTIVE) &&
+			 !(result & EEH_STATE_DMA_ENABLED) &&
+			 !(result & EEH_STATE_MMIO_ENABLED))
+			info->state.state = 2;
+		else if (!(result & EEH_STATE_RESET_ACTIVE) &&
+			 (result & EEH_STATE_DMA_ENABLED) &&
+			 !(result & EEH_STATE_MMIO_ENABLED))
+			info->state.state = 4;
+		else
+			info->state.state = 5;
+
+		ret = 0;
+	} else {
+		pr_warn("%s: Unsupported request\n", __func__);
+		ret = 3;
+	}
+
+out:
+	return ret;
+}
+
+static int powernv_eeh_vfio_pe_reset(struct vfio_eeh_info *info)
+{
+	struct pnv_phb *phb;
+	struct eeh_pe *pe;
+	struct eeh_vfio_pci_addr addr;
+	int opcode = info->reset.option;
+	int ret = 0;
+
+	/* Check opcode */
+	if (opcode != EEH_RESET_DEACTIVATE &&
+	    opcode != EEH_RESET_HOT &&
+	    opcode != EEH_RESET_FUNDAMENTAL) {
+		pr_warn("%s: Unsupported opcode %d\n", __func__, opcode);
+		ret = 3;
+		goto out;
+	}
+
+	/* Locate the PE */
+	addr.buid    = info->reset.buid;
+	addr.pe_addr = info->reset.pe_addr;
+	pe = eeh_vfio_pe_get(&addr);
+	if (!pe) {
+		pr_warn("%s: Cannot locate %llx:%x\n",
+			__func__, addr.buid, addr.pe_addr);
+		ret = 3;
+		goto out;
+	}
+	phb = pe->phb->private_data;
+
+	/* EEH enabled ? */
+	if (!(phb->flags & PNV_PHB_FLAG_EEH)) {
+		pr_warn("%s: EEH disabled on PHB#%d\n",
+			__func__, phb->hose->global_number);
+		ret = 3;
+		goto out;
+	}
+
+	/* Call into the IODA dependent backend to do the reset */
+	if (!phb->eeh_ops ||
+	    !phb->eeh_ops->set_option ||
+	    !phb->eeh_ops->reset) {
+		pr_warn("%s: Unsupported request\n", __func__);
+		ret = 7;
+	} else {
+		/*
+		 * The frozen PE might be caused by the mechanism called
+		 * PAPR error injection, which is supposed to be one-shot
+		 * without "sticky" bit as being stated by the spec. But
+		 * the reality isn't that, at least on P7IOC. So we have
+		 * to clear that to avoid recrusive error, which fails the
+		 * recovery eventually.
+		 */
+		if (opcode == EEH_RESET_DEACTIVATE)
+			opal_pci_reset(phb->opal_id,
+				       OPAL_PHB_ERROR,
+				       OPAL_ASSERT_RESET);
+
+		if (phb->eeh_ops->reset(pe, opcode)) {
+			pr_warn("%s: Failure from backend\n", __func__);
+			ret = 1;
+			goto out;
+		}
+
+		/*
+		 * The PE is still in frozen state and we need clear that.
+		 * It's good to clear frozen state after deassert to avoid
+		 * messy IO access during reset, which might cause recrusive
+		 * frozen PE.
+		 */
+		if (opcode == EEH_RESET_DEACTIVATE) {
+			phb->eeh_ops->set_option(pe, EEH_OPT_THAW_MMIO);
+			phb->eeh_ops->set_option(pe, EEH_OPT_THAW_DMA);
+		}
+	}
+
+out:
+	return ret;
+}
+
+static int powernv_eeh_vfio_pe_config(struct vfio_eeh_info *info)
+{
+	struct pnv_phb *phb;
+	struct eeh_pe *pe;
+	struct eeh_vfio_pci_addr addr;
+	int ret = 0;
+
+	/* Locate the PE */
+	addr.buid    = info->config.buid;
+	addr.pe_addr = info->config.pe_addr;
+	pe = eeh_vfio_pe_get(&addr);
+	if (!pe) {
+		pr_warn("%s: Cannot locate %llx:%x\n",
+			__func__, addr.buid, addr.pe_addr);
+		ret = 3;
+		goto out;
+	}
+	phb = pe->phb->private_data;
+
+	/* EEH enabled ? */
+	if (!(phb->flags & PNV_PHB_FLAG_EEH)) {
+		pr_warn("%s: EEH disabled on PHB#%d\n",
+			__func__, phb->hose->global_number);
+		ret = 3;
+		goto out;
+        }
+
+	/*
+	 * The access to PCI config space on VFIO device has some
+	 * limitations. Part of PCI config space, including BAR
+	 * registers are not readable and writable. So the guest
+	 * should have stale values for those registers and we have
+	 * to restore them in host side.
+	 */
+	eeh_pe_restore_bars(pe);
+out:
+	return ret;
+}
+
+void eeh_vfio_release(struct iommu_table *tbl)
+{
+	struct pnv_ioda_pe *pnv_pe = container_of(tbl, struct pnv_ioda_pe,
+						  tce32_table);
+	struct pnv_phb *phb = pnv_pe->phb;
+	struct eeh_pe *phb_pe, *pe;
+	struct eeh_dev dev, *edev, *tmp;
+
+	/* Find PHB PE */
+	phb_pe = eeh_phb_pe_get(phb->hose);
+	if (unlikely(!phb_pe)) {
+		pr_warn("%s: Cannot find PHB#%d PE\n",
+			__func__, phb->hose->global_number);
+		return;
+	}
+
+	/* Find PE */
+	memset(&dev, 0, sizeof(struct eeh_dev));
+	dev.phb = phb->hose;
+	dev.pe_config_addr = pnv_pe->pe_number;
+	pe = eeh_pe_get(&dev);
+	if (unlikely(!pe)) {
+		pr_warn("%s: Cannot find PE instance for PHB#%d-PE#%d\n",
+			__func__, phb->hose->global_number,
+			pnv_pe->pe_number);
+		return;
+	}
+
+	/* Release it to host */
+	if (!eeh_pe_passed(pe))
+		return;
+
+	eeh_pe_for_each_dev(pe, edev, tmp) {
+		if (!eeh_dev_passed(edev))
+			continue;
+
+		memset(&edev->gaddr, 0, sizeof(edev->gaddr));
+		eeh_dev_set_passed(edev, false);
+	}
+
+	memset(&pe->gaddr, 0, sizeof(pe->gaddr));
+	eeh_pe_set_passed(pe, false);
+}
+EXPORT_SYMBOL(eeh_vfio_release);
+
+int eeh_vfio_ioctl(unsigned long arg)
+{
+	struct vfio_eeh_info info;
+	int ret = -EINVAL;
+
+	/* Copy over user argument */
+	if (copy_from_user(&info, (void __user *)arg, sizeof(info))) {
+		pr_warn("%s: Cannot copy user argument 0x%lx\n",
+			__func__, arg);
+		return -EFAULT;
+	}
+
+	/* Sanity check */
+	if (info.argsz != sizeof(info)) {
+		pr_warn("%s: Invalid argument size (%d, %ld)\n",
+			__func__, info.argsz, sizeof(info));
+		return -EINVAL;
+	}
+
+	/* Route according to operation */
+	switch (info.op) {
+	case vfio_eeh_ops_map:
+		ret = powernv_eeh_vfio_map(&info);
+		break;
+	case vfio_eeh_ops_unmap:
+		ret = powernv_eeh_vfio_unmap(&info);
+		break;
+	case vfio_eeh_ops_set_option:
+		ret = powernv_eeh_vfio_set_option(&info);
+		break;
+	case vfio_eeh_ops_get_addr:
+		ret = powernv_eeh_vfio_get_addr(&info);
+		break;
+	case vfio_eeh_ops_get_state:
+		ret = powernv_eeh_vfio_get_state(&info);
+		break;
+	case vfio_eeh_ops_pe_reset:
+		ret = powernv_eeh_vfio_pe_reset(&info);
+		break;
+	case vfio_eeh_ops_pe_config:
+		ret = powernv_eeh_vfio_pe_config(&info);
+		break;
+	default:
+		pr_info("%s: Cannot handle op#%d (%d, %d)\n",
+			__func__, info.op, vfio_eeh_ops_min,
+			vfio_eeh_ops_max);
+	}
+
+	/* Copy data back */
+	if (!ret && copy_to_user((void __user *)arg, &info, sizeof(info))) {
+		pr_warn("%s: Cannot copy to user 0x%lx\n",
+			__func__, arg);
+		return -EFAULT;
+	}
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(eeh_vfio_ioctl);
diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index a84788b..c45dece 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -26,6 +26,11 @@
 #define DRIVER_AUTHOR   "aik@ozlabs.ru"
 #define DRIVER_DESC     "VFIO IOMMU SPAPR TCE"
 
+#ifdef CONFIG_VFIO_EEH
+extern void eeh_vfio_release(struct iommu_table *tbl);
+extern int eeh_vfio_ioctl(unsigned long arg);
+#endif
+
 static void tce_iommu_detach_group(void *iommu_data,
 		struct iommu_group *iommu_group);
 
@@ -283,6 +288,10 @@ static long tce_iommu_ioctl(void *iommu_data,
 		tce_iommu_disable(container);
 		mutex_unlock(&container->lock);
 		return 0;
+#ifdef CONFIG_VFIO_EEH
+	case VFIO_EEH_INFO:
+		return eeh_vfio_ioctl(arg);
+#endif
 	}
 
 	return -ENOTTY;
@@ -342,6 +351,9 @@ static void tce_iommu_detach_group(void *iommu_data,
 		/* pr_debug("tce_vfio: detaching group #%u from iommu %p\n",
 				iommu_group_id(iommu_group), iommu_group); */
 		container->tbl = NULL;
+#ifdef CONFIG_VFIO_EEH
+		eeh_vfio_release(tbl);
+#endif
 		iommu_release_ownership(tbl);
 	}
 	mutex_unlock(&container->lock);
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index cb9023d..4e1c7f9 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -455,6 +455,67 @@ struct vfio_iommu_spapr_tce_info {
 
 #define VFIO_IOMMU_SPAPR_TCE_GET_INFO	_IO(VFIO_TYPE, VFIO_BASE + 12)
 
+/*
+ * The VFIO EEH info struct provides way to support EEH functionality
+ * for PCI device that is passed from host to guest via VFIO.
+ */
+enum {
+	vfio_eeh_ops_min	= 0,
+	vfio_eeh_ops_map	= 0,
+	vfio_eeh_ops_unmap	= 1,
+	vfio_eeh_ops_set_option	= 2,
+	vfio_eeh_ops_get_addr	= 3,
+	vfio_eeh_ops_get_state	= 4,
+	vfio_eeh_ops_pe_reset	= 5,
+	vfio_eeh_ops_pe_config	= 6,
+	vfio_eeh_ops_max	= 6
+};
+
+struct vfio_eeh_info {
+	__u32 argsz;
+	__u32 op;
+
+	union {
+		struct vfio_eeh_map {
+			__u32 domain;
+			__u16 bdn;
+			__u64 gbuid;
+			__u16 gbdn;
+		} map;
+		struct vfio_eeh_unmap {
+			__u64 buid;
+			__u16 bdn;
+		} unmap;
+		struct vfio_eeh_set_option {
+			__u64 buid;
+			__u32 pe_addr;
+			__u32 option;
+		} option;
+		struct vfio_eeh_pe_addr {
+			__u64 buid;
+			__u32 bdn;
+			__u32 option;
+			__u32 ret;
+		} addr;
+		struct vfio_eeh_state {
+			__u64 buid;
+			__u32 pe_addr;
+			__u32 state;
+                } state;
+		struct vfio_eeh_reset {
+			__u64 buid;
+			__u32 pe_addr;
+			__u32 option;
+		} reset;
+		struct vfio_eeh_config {
+			__u64 buid;
+			__u32 pe_addr;
+		} config;
+	};
+};
+
+#define VFIO_EEH_INFO	_IO(VFIO_TYPE, VFIO_BASE + 17)
+
 /* ***************************************************************** */
 
 #endif /* _UAPIVFIO_H */
-- 
1.8.3.2

^ permalink raw reply related

* [PATCH 04/10] powerpc/eeh: Search EEH PE by guest address
From: Gavin Shan @ 2014-05-09  7:49 UTC (permalink / raw)
  To: linuxppc-dev, kvm-ppc; +Cc: aik, alex.williamson, qiudayu, Gavin Shan
In-Reply-To: <1399621782-23281-1-git-send-email-gwshan@linux.vnet.ibm.com>

The patch introduces function eeh_vfio_pe_get() to search the EEH
PE according to its guest address, which is made up of PHB ID and
PE configuration address. The function will be useful in backends
for EEH RTAS emulation.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/eeh.h |  1 +
 arch/powerpc/kernel/eeh_pe.c   | 38 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 39 insertions(+)

diff --git a/arch/powerpc/include/asm/eeh.h b/arch/powerpc/include/asm/eeh.h
index 8ffaf39..750e028 100644
--- a/arch/powerpc/include/asm/eeh.h
+++ b/arch/powerpc/include/asm/eeh.h
@@ -384,6 +384,7 @@ static inline void eeh_remove_device(struct pci_dev *dev) { }
 
 #ifdef CONFIG_VFIO_EEH
 struct eeh_dev *eeh_vfio_dev_get(struct eeh_vfio_pci_addr *addr);
+struct eeh_pe *eeh_vfio_pe_get(struct eeh_vfio_pci_addr *addr);
 #endif /* CONFIG_VFIO_EEH */
 
 #ifdef CONFIG_PPC64
diff --git a/arch/powerpc/kernel/eeh_pe.c b/arch/powerpc/kernel/eeh_pe.c
index d09f055..8dc58ac 100644
--- a/arch/powerpc/kernel/eeh_pe.c
+++ b/arch/powerpc/kernel/eeh_pe.c
@@ -288,6 +288,44 @@ struct eeh_dev *eeh_vfio_dev_get(struct eeh_vfio_pci_addr *addr)
 
 	return NULL;
 }
+
+static void *__eeh_vfio_pe_get(void *data, void *flag)
+{
+	struct eeh_pe *pe = (struct eeh_pe *)data;
+	struct eeh_vfio_pci_addr *addr = (struct eeh_vfio_pci_addr *)flag;
+
+	if (!eeh_pe_passed(pe))
+		return NULL;
+
+	/* Comparing the address */
+	if (addr->buid    == pe->gaddr.buid &&
+	    addr->pe_addr == pe->gaddr.pe_addr)
+		return pe;
+
+	return NULL;
+}
+
+/**
+ * eeh_vfio_pe_get - Search EEH PE based on guest's address
+ * @addr: EEH PE guest address
+ *
+ * Search the EEH PE according to the guest address, which
+ * is made up of VM indicator, PHB BUID, and PE configuration
+ * address.
+ */
+struct eeh_pe *eeh_vfio_pe_get(struct eeh_vfio_pci_addr *addr)
+{
+	struct eeh_pe *root;
+	struct eeh_pe *pe;
+
+	list_for_each_entry(root, &eeh_phb_pe, child) {
+		pe = eeh_pe_traverse(root, __eeh_vfio_pe_get, addr);
+		if (pe)
+			return pe;
+	}
+
+	return NULL;
+}
 #endif /* CONFIG_VFIO_EEH */
 
 /**
-- 
1.8.3.2

^ permalink raw reply related

* [PATCH 08/10] powerpc: Extend syscall ppc_rtas()
From: Gavin Shan @ 2014-05-09  7:49 UTC (permalink / raw)
  To: linuxppc-dev, kvm-ppc; +Cc: aik, alex.williamson, qiudayu, Gavin Shan
In-Reply-To: <1399621782-23281-1-git-send-email-gwshan@linux.vnet.ibm.com>

Originally, syscall ppc_rtas() can be used to invoke RTAS call from
user space. Utility "errinjct" is using it to inject various errors
to the system for testing purpose. The patch intends to extend the
syscall to support both pSeries and PowerNV platform. With that,
RTAS and OPAL call can be invoked from user space. In turn, utility
"errinjct" can be supported on pSeries and PowerNV platform at same
time.

The original syscall handler ppc_rtas() is renamed to ppc_firmware(),
which calls ppc_call_rtas() or ppc_call_opal() depending on the
running platform. The data transported between userland and kerenl is
by "struct rtas_args". It's platform specific on how to use the data.

Signed-off-by: Mike Qiu <qiudayu@linux.vnet.ibm.com>
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/rtas.h        | 10 +++++-
 arch/powerpc/include/asm/syscalls.h    |  2 +-
 arch/powerpc/include/asm/systbl.h      |  2 +-
 arch/powerpc/include/uapi/asm/unistd.h |  2 +-
 arch/powerpc/kernel/rtas.c             | 57 +++++++---------------------------
 arch/powerpc/kernel/syscalls.c         | 50 +++++++++++++++++++++++++++++
 arch/powerpc/platforms/powernv/opal.c  |  7 +++++
 kernel/sys_ni.c                        |  2 +-
 8 files changed, 82 insertions(+), 50 deletions(-)

diff --git a/arch/powerpc/include/asm/rtas.h b/arch/powerpc/include/asm/rtas.h
index b390f55..3428524 100644
--- a/arch/powerpc/include/asm/rtas.h
+++ b/arch/powerpc/include/asm/rtas.h
@@ -20,7 +20,7 @@
 #define RTAS_UNKNOWN_SERVICE (-1)
 #define RTAS_INSTANTIATE_MAX (1ULL<<30) /* Don't instantiate rtas at/above this value */
 
-/* Buffer size for ppc_rtas system call. */
+/* Buffer size for ppc_firmware system call. */
 #define RTAS_RMOBUF_MAX (64 * 1024)
 
 /* RTAS return status codes */
@@ -427,9 +427,17 @@ static inline int page_is_rtas_user_buf(unsigned long pfn)
 /* Not the best place to put pSeries_coalesce_init, will be fixed when we
  * move some of the rtas suspend-me stuff to pseries */
 extern void pSeries_coalesce_init(void);
+extern int ppc_call_rtas(struct rtas_args *args);
 #else
 static inline int page_is_rtas_user_buf(unsigned long pfn) { return 0;}
 static inline void pSeries_coalesce_init(void) { }
+static inline int ppc_call_rtas(struct rtas_args *args) { return -ENXIO; }
+#endif
+
+#ifdef CONFIG_PPC_POWERNV
+extern int ppc_call_opal(struct rtas_args *args);
+#else
+static inline int ppc_call_opal(struct rtas_arts *args) { return -ENXIO; }
 #endif
 
 extern int call_rtas(const char *, int, int, unsigned long *, ...);
diff --git a/arch/powerpc/include/asm/syscalls.h b/arch/powerpc/include/asm/syscalls.h
index 23be8f1..3383e50 100644
--- a/arch/powerpc/include/asm/syscalls.h
+++ b/arch/powerpc/include/asm/syscalls.h
@@ -15,7 +15,7 @@ asmlinkage unsigned long sys_mmap2(unsigned long addr, size_t len,
 		unsigned long prot, unsigned long flags,
 		unsigned long fd, unsigned long pgoff);
 asmlinkage long ppc64_personality(unsigned long personality);
-asmlinkage int ppc_rtas(struct rtas_args __user *uargs);
+asmlinkage int ppc_firmware(struct rtas_args __user *uargs);
 
 #endif /* __KERNEL__ */
 #endif /* __ASM_POWERPC_SYSCALLS_H */
diff --git a/arch/powerpc/include/asm/systbl.h b/arch/powerpc/include/asm/systbl.h
index 3ddf702..00f8bb2 100644
--- a/arch/powerpc/include/asm/systbl.h
+++ b/arch/powerpc/include/asm/systbl.h
@@ -259,7 +259,7 @@ COMPAT_SYS_SPU(utimes)
 COMPAT_SYS_SPU(statfs64)
 COMPAT_SYS_SPU(fstatfs64)
 SYSX(sys_ni_syscall, ppc_fadvise64_64, ppc_fadvise64_64)
-PPC_SYS_SPU(rtas)
+PPC_SYS_SPU(firmware)
 OLDSYS(debug_setcontext)
 SYSCALL(ni_syscall)
 COMPAT_SYS(migrate_pages)
diff --git a/arch/powerpc/include/uapi/asm/unistd.h b/arch/powerpc/include/uapi/asm/unistd.h
index 881bf2e..3aee765 100644
--- a/arch/powerpc/include/uapi/asm/unistd.h
+++ b/arch/powerpc/include/uapi/asm/unistd.h
@@ -273,7 +273,7 @@
 #ifndef __powerpc64__
 #define __NR_fadvise64_64	254
 #endif
-#define __NR_rtas		255
+#define __NR_firmware		255
 #define __NR_sys_debug_setcontext 256
 /* Number 257 is reserved for vserver */
 #define __NR_migrate_pages	258
diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c
index 8cd5ed0..5d829a72 100644
--- a/arch/powerpc/kernel/rtas.c
+++ b/arch/powerpc/kernel/rtas.c
@@ -1017,59 +1017,32 @@ struct pseries_errorlog *get_pseries_errorlog(struct rtas_error_log *log,
 }
 
 /* We assume to be passed big endian arguments */
-asmlinkage int ppc_rtas(struct rtas_args __user *uargs)
+int ppc_call_rtas(struct rtas_args *args)
 {
-	struct rtas_args args;
 	unsigned long flags;
 	char *buff_copy, *errbuf = NULL;
-	int nargs, nret, token;
 	int rc;
 
-	if (!capable(CAP_SYS_ADMIN))
-		return -EPERM;
-
-	if (copy_from_user(&args, uargs, 3 * sizeof(u32)) != 0)
-		return -EFAULT;
-
-	nargs = be32_to_cpu(args.nargs);
-	nret  = be32_to_cpu(args.nret);
-	token = be32_to_cpu(args.token);
-
-	if (nargs > ARRAY_SIZE(args.args)
-	    || nret > ARRAY_SIZE(args.args)
-	    || nargs + nret > ARRAY_SIZE(args.args))
-		return -EINVAL;
-
-	/* Copy in args. */
-	if (copy_from_user(args.args, uargs->args,
-			   nargs * sizeof(rtas_arg_t)) != 0)
-		return -EFAULT;
-
-	if (token == RTAS_UNKNOWN_SERVICE)
-		return -EINVAL;
-
-	args.rets = &args.args[nargs];
-	memset(args.rets, 0, nret * sizeof(rtas_arg_t));
-
 	/* Need to handle ibm,suspend_me call specially */
-	if (token == ibm_suspend_me_token) {
-		rc = rtas_ibm_suspend_me(&args);
+	if (args->token == ibm_suspend_me_token) {
+		rc = rtas_ibm_suspend_me(args);
 		if (rc)
 			return rc;
-		goto copy_return;
+		goto out;
 	}
 
 	buff_copy = get_errorlog_buffer();
 
 	flags = lock_rtas();
-
-	rtas.args = args;
+	rtas.args = *args;
 	enter_rtas(__pa(&rtas.args));
-	args = rtas.args;
+	*args = rtas.args;
 
-	/* A -1 return code indicates that the last command couldn't
-	   be completed due to a hardware error. */
-	if (be32_to_cpu(args.rets[0]) == -1)
+	/*
+	 * A -1 return code indicates that the last command couldn't
+	 * be completed due to a hardware error.
+	 */
+	if (be32_to_cpu(args->rets[0]) == -1)
 		errbuf = __fetch_rtas_last_error(buff_copy);
 
 	unlock_rtas(flags);
@@ -1080,13 +1053,7 @@ asmlinkage int ppc_rtas(struct rtas_args __user *uargs)
 		kfree(buff_copy);
 	}
 
- copy_return:
-	/* Copy out args. */
-	if (copy_to_user(uargs->args + nargs,
-			 args.args + nargs,
-			 nret * sizeof(rtas_arg_t)) != 0)
-		return -EFAULT;
-
+out:
 	return 0;
 }
 
diff --git a/arch/powerpc/kernel/syscalls.c b/arch/powerpc/kernel/syscalls.c
index cd9be9a..bcb7483 100644
--- a/arch/powerpc/kernel/syscalls.c
+++ b/arch/powerpc/kernel/syscalls.c
@@ -40,6 +40,56 @@
 #include <asm/syscalls.h>
 #include <asm/time.h>
 #include <asm/unistd.h>
+#include <asm/machdep.h>
+#include <asm/rtas.h>
+
+asmlinkage int ppc_firmware(struct rtas_args __user *uargs)
+{
+	int rc;
+	int nargs, nret, token;
+	struct rtas_args args;
+
+	/* Copy over common header */
+	if (copy_from_user(&args, uargs, 3 * sizeof(u32)))
+		return -EFAULT;
+	nargs = be32_to_cpu(args.nargs);
+	nret  = be32_to_cpu(args.nret);
+	token = be32_to_cpu(args.token);
+
+	/* Parameter overflow ? */
+	if (nargs > ARRAY_SIZE(args.args)
+	    || nret > ARRAY_SIZE(args.args)
+	    || nargs + nret > ARRAY_SIZE(args.args))
+                return -EINVAL;
+
+	/* Copy over all arguments */
+        if (copy_from_user(args.args, uargs->args,
+			   nargs * sizeof(rtas_arg_t)))
+		return -EFAULT;
+
+	/* Invalid token ? */
+	if (token == RTAS_UNKNOWN_SERVICE)
+		return -EINVAL;
+
+	/* Clean out return values */
+        args.rets = &args.args[nargs];
+        memset(args.rets, 0, nret * sizeof(rtas_arg_t));
+
+	/* Route to correct platform */
+	if (machine_is(pseries))
+		rc = ppc_call_rtas(&args);
+	else if (machine_is(powernv))
+		rc = ppc_call_opal(&args);
+	else
+		return -ENXIO;
+
+	/* Copy result to user space */
+	if (copy_to_user(uargs->args + nargs, args.args + nargs,
+                         nret * sizeof(rtas_arg_t)))
+		return -EFAULT;
+
+	return rc;
+}
 
 static inline unsigned long do_mmap2(unsigned long addr, size_t len,
 			unsigned long prot, unsigned long flags,
diff --git a/arch/powerpc/platforms/powernv/opal.c b/arch/powerpc/platforms/powernv/opal.c
index 360ad80c..ad33c2b 100644
--- a/arch/powerpc/platforms/powernv/opal.c
+++ b/arch/powerpc/platforms/powernv/opal.c
@@ -25,6 +25,7 @@
 #include <asm/opal.h>
 #include <asm/firmware.h>
 #include <asm/mce.h>
+#include <asm/rtas.h>
 
 #include "powernv.h"
 
@@ -701,3 +702,9 @@ void opal_free_sg_list(struct opal_sg_list *sg)
 			sg = NULL;
 	}
 }
+
+/* Extend it later */
+int ppc_call_opal(struct rtas_args *args)
+{
+	return 0;
+}
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index bc8d1b7..2c5b3fa 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -159,7 +159,7 @@ cond_syscall(sys_pciconfig_read);
 cond_syscall(sys_pciconfig_write);
 cond_syscall(sys_pciconfig_iobase);
 cond_syscall(compat_sys_s390_ipc);
-cond_syscall(ppc_rtas);
+cond_syscall(ppc_firmware);
 cond_syscall(sys_spu_run);
 cond_syscall(sys_spu_create);
 cond_syscall(sys_subpage_prot);
-- 
1.8.3.2

^ permalink raw reply related

* [PATCH 03/10] powerpc/eeh: Search EEH device by guest address
From: Gavin Shan @ 2014-05-09  7:49 UTC (permalink / raw)
  To: linuxppc-dev, kvm-ppc; +Cc: aik, alex.williamson, qiudayu, Gavin Shan
In-Reply-To: <1399621782-23281-1-git-send-email-gwshan@linux.vnet.ibm.com>

The patch introduces function eeh_vfio_dev_get() to search the EEH
device according to its guest address, which is made up of PHB BUID,
bus, slot and function number. The function is useful in the backends
for EEH RTAS emulation.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/eeh.h |  5 +++++
 arch/powerpc/kernel/eeh_pe.c   | 42 ++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 47 insertions(+)

diff --git a/arch/powerpc/include/asm/eeh.h b/arch/powerpc/include/asm/eeh.h
index 3268692..8ffaf39 100644
--- a/arch/powerpc/include/asm/eeh.h
+++ b/arch/powerpc/include/asm/eeh.h
@@ -381,6 +381,11 @@ static inline void eeh_remove_device(struct pci_dev *dev) { }
 #define EEH_IO_ERROR_VALUE(size) (-1UL)
 #endif /* CONFIG_EEH */
 
+
+#ifdef CONFIG_VFIO_EEH
+struct eeh_dev *eeh_vfio_dev_get(struct eeh_vfio_pci_addr *addr);
+#endif /* CONFIG_VFIO_EEH */
+
 #ifdef CONFIG_PPC64
 /*
  * MMIO read/write operations with EEH support.
diff --git a/arch/powerpc/kernel/eeh_pe.c b/arch/powerpc/kernel/eeh_pe.c
index fbd01eb..d09f055 100644
--- a/arch/powerpc/kernel/eeh_pe.c
+++ b/arch/powerpc/kernel/eeh_pe.c
@@ -248,6 +248,48 @@ struct eeh_pe *eeh_pe_get(struct eeh_dev *edev)
 	return pe;
 }
 
+#ifdef CONFIG_VFIO_EEH
+static void *__eeh_vfio_dev_get(void *data, void *flag)
+{
+	struct eeh_pe *pe = (struct eeh_pe *)data;
+	struct eeh_vfio_pci_addr *addr = (struct eeh_vfio_pci_addr *)flag;
+	struct eeh_dev *edev, *tmp;
+
+	eeh_pe_for_each_dev(pe, edev, tmp) {
+		if (!eeh_dev_passed(edev))
+			continue;
+
+		/* Comparing the address in the guest */
+		if (addr->buid == edev->gaddr.buid &&
+		    addr->bdn  == edev->gaddr.bdn)
+			return edev;
+	}
+
+	return NULL;
+}
+
+/**
+ * eeh_vfio_dev_get - Search EEH device based on guest's address
+ * @addr: EEH device guest address
+ *
+ * Search the EEH device according to its guest's address, which
+ * is made up of PHB BUID, and PCI config address.
+ */
+struct eeh_dev *eeh_vfio_dev_get(struct eeh_vfio_pci_addr *addr)
+{
+	struct eeh_pe *root;
+	struct eeh_dev *edev;
+
+	list_for_each_entry(root, &eeh_phb_pe, child) {
+		edev = eeh_pe_traverse(root, __eeh_vfio_dev_get, addr);
+		if (edev)
+			return edev;
+	}
+
+	return NULL;
+}
+#endif /* CONFIG_VFIO_EEH */
+
 /**
  * eeh_pe_get_parent - Retrieve the parent PE
  * @edev: EEH device
-- 
1.8.3.2

^ permalink raw reply related

* [PATCH 01/10] drivers/vfio: Introduce CONFIG_VFIO_EEH
From: Gavin Shan @ 2014-05-09  7:49 UTC (permalink / raw)
  To: linuxppc-dev, kvm-ppc; +Cc: aik, alex.williamson, qiudayu, Gavin Shan
In-Reply-To: <1399621782-23281-1-git-send-email-gwshan@linux.vnet.ibm.com>

The patch introduces CONFIG_VFIO_EEH for more IOCTL commands on
tce_iommu_driver_ops to support EEH funtionality for PCI devices
that are passed through from host to guest.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
---
 drivers/vfio/Kconfig | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
index af7b204..4f3293b 100644
--- a/drivers/vfio/Kconfig
+++ b/drivers/vfio/Kconfig
@@ -8,11 +8,17 @@ config VFIO_IOMMU_SPAPR_TCE
 	depends on VFIO && SPAPR_TCE_IOMMU
 	default n
 
+config VFIO_EEH
+	tristate
+	depends on EEH && VFIO_IOMMU_SPAPR_TCE
+	default n
+
 menuconfig VFIO
 	tristate "VFIO Non-Privileged userspace driver framework"
 	depends on IOMMU_API
 	select VFIO_IOMMU_TYPE1 if X86
 	select VFIO_IOMMU_SPAPR_TCE if (PPC_POWERNV || PPC_PSERIES)
+	select VFIO_EEH if PPC_POWERNV
 	select ANON_INODES
 	help
 	  VFIO provides a framework for secure userspace device drivers.
-- 
1.8.3.2

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox