LinuxPPC-Dev Archive on lore.kernel.org

LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed

* Re: [PATCH 1/3] powerpc/io: Add __raw_writeq_be() __raw_rm_writeq_be()
From: Samuel Mendoza-Jonas @ 2018-05-18  6:39 UTC (permalink / raw)
  To: Michael Ellerman, linuxppc-dev; +Cc: alistair, paulus
In-Reply-To: <20180514125033.12000-1-mpe@ellerman.id.au>

On Mon, 2018-05-14 at 22:50 +1000, Michael Ellerman wrote:
> Add byte-swapping versions of __raw_writeq() and __raw_rm_writeq().
> 
> This allows us to avoid sparse warnings caused by passing __be64 to
> __raw_writeq(), which takes unsigned long:
> 
>   arch/powerpc/platforms/powernv/pci-ioda.c:1981:38:
>   warning: incorrect type in argument 1 (different base types)
>       expected unsigned long [unsigned] v
>       got restricted __be64 [usertype] <noident>
> 
> It's also generally preferable to use a byte-swapping accessor rather
> than doing it by hand in the code, which is more bug prone.
> 
> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>

For this and the following patches:

Reviewed-by: Samuel Mendoza-Jonas <sam@mendozajonas.com>

> ---
>  arch/powerpc/include/asm/io.h | 10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/arch/powerpc/include/asm/io.h b/arch/powerpc/include/asm/io.h
> index af074923d598..e0331e754568 100644
> --- a/arch/powerpc/include/asm/io.h
> +++ b/arch/powerpc/include/asm/io.h
> @@ -367,6 +367,11 @@ static inline void __raw_writeq(unsigned long v, volatile void __iomem *addr)
>  	*(volatile unsigned long __force *)PCI_FIX_ADDR(addr) = v;
>  }
>  
> +static inline void __raw_writeq_be(unsigned long v, volatile void __iomem *addr)
> +{
> +	__raw_writeq((__force unsigned long)cpu_to_be64(v), addr);
> +}
> +
>  /*
>   * Real mode versions of the above. Those instructions are only supposed
>   * to be used in hypervisor real mode as per the architecture spec.
> @@ -395,6 +400,11 @@ static inline void __raw_rm_writeq(u64 val, volatile void __iomem *paddr)
>  		: : "r" (val), "r" (paddr) : "memory");
>  }
>  
> +static inline void __raw_rm_writeq_be(u64 val, volatile void __iomem *paddr)
> +{
> +	__raw_rm_writeq((__force u64)cpu_to_be64(val), paddr);
> +}
> +
>  static inline u8 __raw_rm_readb(volatile void __iomem *paddr)
>  {
>  	u8 ret;

^ permalink raw reply

* [PATCH] cxl: Disable prefault_mode in Radix mode
From: Vaibhav Jain @ 2018-05-18  6:48 UTC (permalink / raw)
  To: linuxppc-dev, Frederic Barrat
  Cc: Vaibhav Jain, Andrew Donnellan, Christophe Lombard,
	Philippe Bergheaud, Alastair D'Silva, Vaibhav Jain, stable

From: Vaibhav Jain <vaibhav@linux.ibm.com>

On Power-8 the AFU attr prefault_mode tried to improve storage fault
performance by prefaulting process segments. However Power-9 radix
mode doesn't have Storage-Segments and prefaulting Pages is too fine
grained.

So this patch updates prefault_mode_store() to not allow any other
value apart from CXL_PREFAULT_NONE when radix mode is enabled.

Cc: <stable@vger.kernel.org>
Fixes: f24be42aab37 ("cxl: Add psl9 specific code")
Signed-off-by: Vaibhav Jain <vaibhav@linux.ibm.com>
---
 Documentation/ABI/testing/sysfs-class-cxl |  4 +++-
 drivers/misc/cxl/sysfs.c                  | 16 ++++++++++++----
 2 files changed, 15 insertions(+), 5 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-class-cxl b/Documentation/ABI/testing/sysfs-class-cxl
index 640f65e79ef1..267920a1874b 100644
--- a/Documentation/ABI/testing/sysfs-class-cxl
+++ b/Documentation/ABI/testing/sysfs-class-cxl
@@ -69,7 +69,9 @@ Date:           September 2014
 Contact:        linuxppc-dev@lists.ozlabs.org
 Description:    read/write
                 Set the mode for prefaulting in segments into the segment table
-                when performing the START_WORK ioctl. Possible values:
+                when performing the START_WORK ioctl. Only applicable when
+                running under hashed page table mmu.
+                Possible values:
                         none: No prefaulting (default)
                         work_element_descriptor: Treat the work element
                                  descriptor as an effective address and
diff --git a/drivers/misc/cxl/sysfs.c b/drivers/misc/cxl/sysfs.c
index 4b5a4c5d3c01..629e2e156412 100644
--- a/drivers/misc/cxl/sysfs.c
+++ b/drivers/misc/cxl/sysfs.c
@@ -353,12 +353,20 @@ static ssize_t prefault_mode_store(struct device *device,
 	struct cxl_afu *afu = to_cxl_afu(device);
 	enum prefault_modes mode = -1;
 
-	if (!strncmp(buf, "work_element_descriptor", 23))
-		mode = CXL_PREFAULT_WED;
-	if (!strncmp(buf, "all", 3))
-		mode = CXL_PREFAULT_ALL;
 	if (!strncmp(buf, "none", 4))
 		mode = CXL_PREFAULT_NONE;
+	else {
+		if (!radix_enabled()) {
+
+			/* only allowed when not in radix mode */
+			if (!strncmp(buf, "work_element_descriptor", 23))
+				mode = CXL_PREFAULT_WED;
+			if (!strncmp(buf, "all", 3))
+				mode = CXL_PREFAULT_ALL;
+		} else {
+			dev_err(device, "Cannot prefault with radix enabled\n");
+		}
+	}
 
 	if (mode == -1)
 		return -EINVAL;
-- 
2.17.0

^ permalink raw reply related

* [PATCH] powerpc/perf: Remove sched_task function defined for thread-imc
From: Anju T Sudhakar @ 2018-05-18  7:35 UTC (permalink / raw)
  To: mpe; +Cc: maddy, linuxppc-dev, anju

Call trace observed while running perf-fuzzer:

[  329.228068] CPU: 43 PID: 9088 Comm: perf_fuzzer Not tainted 4.13.0-32-generic #35~lp1746225
[  329.228070] task: c000003f776ac900 task.stack: c000003f77728000
[  329.228071] NIP: c000000000299b70 LR: c0000000002a4534 CTR: c00000000029bb80
[  329.228073] REGS: c000003f7772b760 TRAP: 0700   Not tainted  (4.13.0-32-generic)
[  329.228073] MSR: 900000000282b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>
[  329.228079]   CR: 24008822  XER: 00000000
[  329.228080] CFAR: c000000000299a70 SOFTE: 0
GPR00: c0000000002a4534 c000003f7772b9e0 c000000001606200 c000003fef858908
GPR04: c000003f776ac900 0000000000000001 ffffffffffffffff 0000003fee730000
GPR08: 0000000000000000 0000000000000000 c0000000011220d8 0000000000000002
GPR12: c00000000029bb80 c000000007a3d900 0000000000000000 0000000000000000
GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
GPR20: 0000000000000000 0000000000000000 c000003f776ad090 c000000000c71354
GPR24: c000003fef716780 0000003fee730000 c000003fe69d4200 c000003f776ad330
GPR28: c0000000011220d8 0000000000000001 c0000000014c6108 c000003fef858900
[  329.228098] NIP [c000000000299b70] perf_pmu_sched_task+0x170/0x180
[  329.228100] LR [c0000000002a4534] __perf_event_task_sched_in+0xc4/0x230
[  329.228101] Call Trace:
[  329.228102] [c000003f7772b9e0] [c0000000002a0678] perf_iterate_sb+0x158/0x2a0 (unreliable)
[  329.228105] [c000003f7772ba30] [c0000000002a4534] __perf_event_task_sched_in+0xc4/0x230
[  329.228107] [c000003f7772bab0] [c0000000001396dc] finish_task_switch+0x21c/0x310
[  329.228109] [c000003f7772bb60] [c000000000c71354] __schedule+0x304/0xb80
[  329.228111] [c000003f7772bc40] [c000000000c71c10] schedule+0x40/0xc0
[  329.228113] [c000003f7772bc60] [c0000000001033f4] do_wait+0x254/0x2e0
[  329.228115] [c000003f7772bcd0] [c000000000104ac0] kernel_wait4+0xa0/0x1a0
[  329.228117] [c000003f7772bd70] [c000000000104c24] SyS_wait4+0x64/0xc0
[  329.228121] [c000003f7772be30] [c00000000000b184] system_call+0x58/0x6c
[  329.228121] Instruction dump:
[  329.228123] 3beafea0 7faa4800 409eff18 e8010060 eb610028 ebc10040 7c0803a6 38210050
[  329.228127] eb81ffe0 eba1ffe8 ebe1fff8 4e800020 <0fe00000> 4bffffbc 60000000 60420000
[  329.228131] ---[ end trace 8c46856d314c1811 ]---
[  375.755943] hrtimer: interrupt took 31601 ns


The context switch call-backs for thread-imc are defined in sched_task function.
So when thread-imc events are grouped with software pmu events,
perf_pmu_sched_task hits the WARN_ON_ONCE condition, since software PMUs are
assumed not to have a sched_task defined. 
 
Patch to move the thread_imc enable/disable opal call back from sched_task to
event_[add/del] function

Signed-off-by: Anju T Sudhakar <anju@linux.vnet.ibm.com>
---
 arch/powerpc/perf/imc-pmu.c | 108 +++++++++++++++++++++-----------------------
 1 file changed, 51 insertions(+), 57 deletions(-)

diff --git a/arch/powerpc/perf/imc-pmu.c b/arch/powerpc/perf/imc-pmu.c
index d7532e7..71d9ba7 100644
--- a/arch/powerpc/perf/imc-pmu.c
+++ b/arch/powerpc/perf/imc-pmu.c
@@ -866,59 +866,6 @@ static int thread_imc_cpu_init(void)
 			  ppc_thread_imc_cpu_offline);
 }
 
-void thread_imc_pmu_sched_task(struct perf_event_context *ctx,
-				      bool sched_in)
-{
-	int core_id;
-	struct imc_pmu_ref *ref;
-
-	if (!is_core_imc_mem_inited(smp_processor_id()))
-		return;
-
-	core_id = smp_processor_id() / threads_per_core;
-	/*
-	 * imc pmus are enabled only when it is used.
-	 * See if this is triggered for the first time.
-	 * If yes, take the mutex lock and enable the counters.
-	 * If not, just increment the count in ref count struct.
-	 */
-	ref = &core_imc_refc[core_id];
-	if (!ref)
-		return;
-
-	if (sched_in) {
-		mutex_lock(&ref->lock);
-		if (ref->refc == 0) {
-			if (opal_imc_counters_start(OPAL_IMC_COUNTERS_CORE,
-			     get_hard_smp_processor_id(smp_processor_id()))) {
-				mutex_unlock(&ref->lock);
-				pr_err("thread-imc: Unable to start the counter\
-							for core %d\n", core_id);
-				return;
-			}
-		}
-		++ref->refc;
-		mutex_unlock(&ref->lock);
-	} else {
-		mutex_lock(&ref->lock);
-		ref->refc--;
-		if (ref->refc == 0) {
-			if (opal_imc_counters_stop(OPAL_IMC_COUNTERS_CORE,
-			    get_hard_smp_processor_id(smp_processor_id()))) {
-				mutex_unlock(&ref->lock);
-				pr_err("thread-imc: Unable to stop the counters\
-							for core %d\n", core_id);
-				return;
-			}
-		} else if (ref->refc < 0) {
-			ref->refc = 0;
-		}
-		mutex_unlock(&ref->lock);
-	}
-
-	return;
-}
-
 static int thread_imc_event_init(struct perf_event *event)
 {
 	u32 config = event->attr.config;
@@ -1045,22 +992,70 @@ static int imc_event_add(struct perf_event *event, int flags)
 
 static int thread_imc_event_add(struct perf_event *event, int flags)
 {
+	int core_id;
+	struct imc_pmu_ref *ref;
+
 	if (flags & PERF_EF_START)
 		imc_event_start(event, flags);
 
-	/* Enable the sched_task to start the engine */
-	perf_sched_cb_inc(event->ctx->pmu);
+	if (!is_core_imc_mem_inited(smp_processor_id()))
+		return -EINVAL;
+
+	core_id = smp_processor_id() / threads_per_core;
+	/*
+	 * imc pmus are enabled only when it is used.
+	 * See if this is triggered for the first time.
+	 * If yes, take the mutex lock and enable the counters.
+	 * If not, just increment the count in ref count struct.
+	 */
+	ref = &core_imc_refc[core_id];
+	if (!ref)
+		return -EINVAL;
+
+	mutex_lock(&ref->lock);
+	if (ref->refc == 0) {
+		if (opal_imc_counters_start(OPAL_IMC_COUNTERS_CORE,
+		    get_hard_smp_processor_id(smp_processor_id()))) {
+			mutex_unlock(&ref->lock);
+			pr_err("thread-imc: Unable to start the counter\
+				for core %d\n", core_id);
+			return -EINVAL;
+		}
+	}
+	++ref->refc;
+	mutex_unlock(&ref->lock);
 	return 0;
 }
 
 static void thread_imc_event_del(struct perf_event *event, int flags)
 {
+
+	int core_id;
+	struct imc_pmu_ref *ref;
+
 	/*
 	 * Take a snapshot and calculate the delta and update
 	 * the event counter values.
 	 */
 	imc_event_update(event);
-	perf_sched_cb_dec(event->ctx->pmu);
+
+	core_id = smp_processor_id() / threads_per_core;
+	ref = &core_imc_refc[core_id];
+
+	mutex_lock(&ref->lock);
+	ref->refc--;
+	if (ref->refc == 0) {
+		if (opal_imc_counters_stop(OPAL_IMC_COUNTERS_CORE,
+		    get_hard_smp_processor_id(smp_processor_id()))) {
+			mutex_unlock(&ref->lock);
+			pr_err("thread-imc: Unable to stop the counters\
+				for core %d\n", core_id);
+			return;
+		}
+	} else if (ref->refc < 0) {
+		ref->refc = 0;
+	}
+	mutex_unlock(&ref->lock);
 }
 
 /* update_pmu_ops : Populate the appropriate operations for "pmu" */
@@ -1086,7 +1081,6 @@ static int update_pmu_ops(struct imc_pmu *pmu)
 		break;
 	case IMC_DOMAIN_THREAD:
 		pmu->pmu.event_init = thread_imc_event_init;
-		pmu->pmu.sched_task = thread_imc_pmu_sched_task;
 		pmu->pmu.add = thread_imc_event_add;
 		pmu->pmu.del = thread_imc_event_del;
 		pmu->pmu.start_txn = thread_imc_pmu_start_txn;
-- 
2.7.4

^ permalink raw reply related

* [PATCH 4.9 27/33] futex: Remove duplicated code and fix undefined behaviour
From: Greg Kroah-Hartman @ 2018-05-18  8:16 UTC (permalink / raw)
  To: linux-kernel
  Cc: Greg Kroah-Hartman, stable, Jiri Slaby, Thomas Gleixner,
	Russell King, Darren Hart (VMware), linux-mips, Rich Felker,
	linux-ia64, linux-sh, peterz, Benjamin Herrenschmidt,
	Max Filippov, Paul Mackerras, sparclinux, Jonas Bonn, linux-s390,
	linux-arch, Yoshinori Sato, linux-hexagon, Helge Deller,
	James E.J. Bottomley, Catalin Marinas, Matt Turner,
	linux-snps-arc, Fenghua Yu, Arnd Bergmann, linux-xtensa,
	Stefan Kristiansson, openrisc, Ivan Kokshaysky, Stafford Horne,
	linux-arm-kernel, Richard Henderson, Chris Zankel, Michal Simek,
	Tony Luck, linux-parisc, Vineet Gupta, Ralf Baechle, Richard Kuo,
	linux-alpha, Martin Schwidefsky, linuxppc-dev, David S. Miller,
	Ben Hutchings
In-Reply-To: <20180518081535.096308218@linuxfoundation.org>

4.9-stable review patch.  If anyone has any objections, please let me know.

------------------

From: Jiri Slaby <jslaby@suse.cz>

commit 30d6e0a4190d37740e9447e4e4815f06992dd8c3 upstream.

There is code duplicated over all architecture's headers for
futex_atomic_op_inuser. Namely op decoding, access_ok check for uaddr,
and comparison of the result.

Remove this duplication and leave up to the arches only the needed
assembly which is now in arch_futex_atomic_op_inuser.

This effectively distributes the Will Deacon's arm64 fix for undefined
behaviour reported by UBSAN to all architectures. The fix was done in
commit 5f16a046f8e1 (arm64: futex: Fix undefined behaviour with
FUTEX_OP_OPARG_SHIFT usage). Look there for an example dump.

And as suggested by Thomas, check for negative oparg too, because it was
also reported to cause undefined behaviour report.

Note that s390 removed access_ok check in d12a29703 ("s390/uaccess:
remove pointless access_ok() checks") as access_ok there returns true.
We introduce it back to the helper for the sake of simplicity (it gets
optimized away anyway).

Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Russell King <rmk+kernel@armlinux.org.uk>
Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com> [s390]
Acked-by: Chris Metcalf <cmetcalf@mellanox.com> [for tile]
Reviewed-by: Darren Hart (VMware) <dvhart@infradead.org>
Reviewed-by: Will Deacon <will.deacon@arm.com> [core/arm64]
Cc: linux-mips@linux-mips.org
Cc: Rich Felker <dalias@libc.org>
Cc: linux-ia64@vger.kernel.org
Cc: linux-sh@vger.kernel.org
Cc: peterz@infradead.org
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: sparclinux@vger.kernel.org
Cc: Jonas Bonn <jonas@southpole.se>
Cc: linux-s390@vger.kernel.org
Cc: linux-arch@vger.kernel.org
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: linux-hexagon@vger.kernel.org
Cc: Helge Deller <deller@gmx.de>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: linux-snps-arc@lists.infradead.org
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: linux-xtensa@linux-xtensa.org
Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
Cc: openrisc@lists.librecores.org
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Stafford Horne <shorne@gmail.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: Richard Henderson <rth@twiddle.net>
Cc: Chris Zankel <chris@zankel.net>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Tony Luck <tony.luck@intel.com>
Cc: linux-parisc@vger.kernel.org
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Richard Kuo <rkuo@codeaurora.org>
Cc: linux-alpha@vger.kernel.org
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: linuxppc-dev@lists.ozlabs.org
Cc: "David S. Miller" <davem@davemloft.net>
Link: http://lkml.kernel.org/r/20170824073105.3901-1-jslaby@suse.cz
Cc: Ben Hutchings <ben.hutchings@codethink.co.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

---
 arch/alpha/include/asm/futex.h      |   26 +++---------------
 arch/arc/include/asm/futex.h        |   40 +++-------------------------
 arch/arm/include/asm/futex.h        |   26 ++----------------
 arch/arm64/include/asm/futex.h      |   27 ++-----------------
 arch/frv/include/asm/futex.h        |    3 +-
 arch/frv/kernel/futex.c             |   27 ++-----------------
 arch/hexagon/include/asm/futex.h    |   38 ++-------------------------
 arch/ia64/include/asm/futex.h       |   25 ++----------------
 arch/microblaze/include/asm/futex.h |   38 ++-------------------------
 arch/mips/include/asm/futex.h       |   25 ++----------------
 arch/parisc/include/asm/futex.h     |   26 ++----------------
 arch/powerpc/include/asm/futex.h    |   26 +++---------------
 arch/s390/include/asm/futex.h       |   23 +++-------------
 arch/sh/include/asm/futex.h         |   26 ++----------------
 arch/sparc/include/asm/futex_64.h   |   26 +++---------------
 arch/tile/include/asm/futex.h       |   40 +++-------------------------
 arch/x86/include/asm/futex.h        |   40 +++-------------------------
 arch/xtensa/include/asm/futex.h     |   27 +++----------------
 include/asm-generic/futex.h         |   50 ++++++------------------------------
 kernel/futex.c                      |   39 ++++++++++++++++++++++++++++
 20 files changed, 126 insertions(+), 472 deletions(-)

--- a/arch/alpha/include/asm/futex.h
+++ b/arch/alpha/include/asm/futex.h
@@ -29,18 +29,10 @@
 	:	"r" (uaddr), "r"(oparg)				\
 	:	"memory")
 
-static inline int futex_atomic_op_inuser (int encoded_op, u32 __user *uaddr)
+static inline int arch_futex_atomic_op_inuser(int op, int oparg, int *oval,
+		u32 __user *uaddr)
 {
-	int op = (encoded_op >> 28) & 7;
-	int cmp = (encoded_op >> 24) & 15;
-	int oparg = (encoded_op << 8) >> 20;
-	int cmparg = (encoded_op << 20) >> 20;
 	int oldval = 0, ret;
-	if (encoded_op & (FUTEX_OP_OPARG_SHIFT << 28))
-		oparg = 1 << oparg;
-
-	if (!access_ok(VERIFY_WRITE, uaddr, sizeof(u32)))
-		return -EFAULT;
 
 	pagefault_disable();
 
@@ -66,17 +58,9 @@ static inline int futex_atomic_op_inuser
 
 	pagefault_enable();
 
-	if (!ret) {
-		switch (cmp) {
-		case FUTEX_OP_CMP_EQ: ret = (oldval == cmparg); break;
-		case FUTEX_OP_CMP_NE: ret = (oldval != cmparg); break;
-		case FUTEX_OP_CMP_LT: ret = (oldval < cmparg); break;
-		case FUTEX_OP_CMP_GE: ret = (oldval >= cmparg); break;
-		case FUTEX_OP_CMP_LE: ret = (oldval <= cmparg); break;
-		case FUTEX_OP_CMP_GT: ret = (oldval > cmparg); break;
-		default: ret = -ENOSYS;
-		}
-	}
+	if (!ret)
+		*oval = oldval;
+
 	return ret;
 }
 
--- a/arch/arc/include/asm/futex.h
+++ b/arch/arc/include/asm/futex.h
@@ -73,20 +73,11 @@
 
 #endif
 
-static inline int futex_atomic_op_inuser(int encoded_op, u32 __user *uaddr)
+static inline int arch_futex_atomic_op_inuser(int op, int oparg, int *oval,
+		u32 __user *uaddr)
 {
-	int op = (encoded_op >> 28) & 7;
-	int cmp = (encoded_op >> 24) & 15;
-	int oparg = (encoded_op << 8) >> 20;
-	int cmparg = (encoded_op << 20) >> 20;
 	int oldval = 0, ret;
 
-	if (encoded_op & (FUTEX_OP_OPARG_SHIFT << 28))
-		oparg = 1 << oparg;
-
-	if (!access_ok(VERIFY_WRITE, uaddr, sizeof(int)))
-		return -EFAULT;
-
 #ifndef CONFIG_ARC_HAS_LLSC
 	preempt_disable();	/* to guarantee atomic r-m-w of futex op */
 #endif
@@ -118,30 +109,9 @@ static inline int futex_atomic_op_inuser
 	preempt_enable();
 #endif
 
-	if (!ret) {
-		switch (cmp) {
-		case FUTEX_OP_CMP_EQ:
-			ret = (oldval == cmparg);
-			break;
-		case FUTEX_OP_CMP_NE:
-			ret = (oldval != cmparg);
-			break;
-		case FUTEX_OP_CMP_LT:
-			ret = (oldval < cmparg);
-			break;
-		case FUTEX_OP_CMP_GE:
-			ret = (oldval >= cmparg);
-			break;
-		case FUTEX_OP_CMP_LE:
-			ret = (oldval <= cmparg);
-			break;
-		case FUTEX_OP_CMP_GT:
-			ret = (oldval > cmparg);
-			break;
-		default:
-			ret = -ENOSYS;
-		}
-	}
+	if (!ret)
+		*oval = oldval;
+
 	return ret;
 }
 
--- a/arch/arm/include/asm/futex.h
+++ b/arch/arm/include/asm/futex.h
@@ -128,20 +128,10 @@ futex_atomic_cmpxchg_inatomic(u32 *uval,
 #endif /* !SMP */
 
 static inline int
-futex_atomic_op_inuser (int encoded_op, u32 __user *uaddr)
+arch_futex_atomic_op_inuser(int op, int oparg, int *oval, u32 __user *uaddr)
 {
-	int op = (encoded_op >> 28) & 7;
-	int cmp = (encoded_op >> 24) & 15;
-	int oparg = (encoded_op << 8) >> 20;
-	int cmparg = (encoded_op << 20) >> 20;
 	int oldval = 0, ret, tmp;
 
-	if (encoded_op & (FUTEX_OP_OPARG_SHIFT << 28))
-		oparg = 1 << oparg;
-
-	if (!access_ok(VERIFY_WRITE, uaddr, sizeof(u32)))
-		return -EFAULT;
-
 #ifndef CONFIG_SMP
 	preempt_disable();
 #endif
@@ -172,17 +162,9 @@ futex_atomic_op_inuser (int encoded_op,
 	preempt_enable();
 #endif
 
-	if (!ret) {
-		switch (cmp) {
-		case FUTEX_OP_CMP_EQ: ret = (oldval == cmparg); break;
-		case FUTEX_OP_CMP_NE: ret = (oldval != cmparg); break;
-		case FUTEX_OP_CMP_LT: ret = (oldval < cmparg); break;
-		case FUTEX_OP_CMP_GE: ret = (oldval >= cmparg); break;
-		case FUTEX_OP_CMP_LE: ret = (oldval <= cmparg); break;
-		case FUTEX_OP_CMP_GT: ret = (oldval > cmparg); break;
-		default: ret = -ENOSYS;
-		}
-	}
+	if (!ret)
+		*oval = oldval;
+
 	return ret;
 }
 
--- a/arch/arm64/include/asm/futex.h
+++ b/arch/arm64/include/asm/futex.h
@@ -51,20 +51,9 @@
 	: "memory")
 
 static inline int
-futex_atomic_op_inuser(unsigned int encoded_op, u32 __user *_uaddr)
+arch_futex_atomic_op_inuser(int op, int oparg, int *oval, u32 __user *uaddr)
 {
-	int op = (encoded_op >> 28) & 7;
-	int cmp = (encoded_op >> 24) & 15;
-	int oparg = (int)(encoded_op << 8) >> 20;
-	int cmparg = (int)(encoded_op << 20) >> 20;
 	int oldval = 0, ret, tmp;
-	u32 __user *uaddr = __uaccess_mask_ptr(_uaddr);
-
-	if (encoded_op & (FUTEX_OP_OPARG_SHIFT << 28))
-		oparg = 1U << (oparg & 0x1f);
-
-	if (!access_ok(VERIFY_WRITE, uaddr, sizeof(u32)))
-		return -EFAULT;
 
 	pagefault_disable();
 
@@ -95,17 +84,9 @@ futex_atomic_op_inuser(unsigned int enco
 
 	pagefault_enable();
 
-	if (!ret) {
-		switch (cmp) {
-		case FUTEX_OP_CMP_EQ: ret = (oldval == cmparg); break;
-		case FUTEX_OP_CMP_NE: ret = (oldval != cmparg); break;
-		case FUTEX_OP_CMP_LT: ret = (oldval < cmparg); break;
-		case FUTEX_OP_CMP_GE: ret = (oldval >= cmparg); break;
-		case FUTEX_OP_CMP_LE: ret = (oldval <= cmparg); break;
-		case FUTEX_OP_CMP_GT: ret = (oldval > cmparg); break;
-		default: ret = -ENOSYS;
-		}
-	}
+	if (!ret)
+		*oval = oldval;
+
 	return ret;
 }
 
--- a/arch/frv/include/asm/futex.h
+++ b/arch/frv/include/asm/futex.h
@@ -7,7 +7,8 @@
 #include <asm/errno.h>
 #include <asm/uaccess.h>
 
-extern int futex_atomic_op_inuser(int encoded_op, u32 __user *uaddr);
+extern int arch_futex_atomic_op_inuser(int op, int oparg, int *oval,
+		u32 __user *uaddr);
 
 static inline int
 futex_atomic_cmpxchg_inatomic(u32 *uval, u32 __user *uaddr,
--- a/arch/frv/kernel/futex.c
+++ b/arch/frv/kernel/futex.c
@@ -186,20 +186,10 @@ static inline int atomic_futex_op_xchg_x
 /*
  * do the futex operations
  */
-int futex_atomic_op_inuser(int encoded_op, u32 __user *uaddr)
+int arch_futex_atomic_op_inuser(int op, int oparg, int *oval, u32 __user *uaddr)
 {
-	int op = (encoded_op >> 28) & 7;
-	int cmp = (encoded_op >> 24) & 15;
-	int oparg = (encoded_op << 8) >> 20;
-	int cmparg = (encoded_op << 20) >> 20;
 	int oldval = 0, ret;
 
-	if (encoded_op & (FUTEX_OP_OPARG_SHIFT << 28))
-		oparg = 1 << oparg;
-
-	if (!access_ok(VERIFY_WRITE, uaddr, sizeof(u32)))
-		return -EFAULT;
-
 	pagefault_disable();
 
 	switch (op) {
@@ -225,18 +215,9 @@ int futex_atomic_op_inuser(int encoded_o
 
 	pagefault_enable();
 
-	if (!ret) {
-		switch (cmp) {
-		case FUTEX_OP_CMP_EQ: ret = (oldval == cmparg); break;
-		case FUTEX_OP_CMP_NE: ret = (oldval != cmparg); break;
-		case FUTEX_OP_CMP_LT: ret = (oldval < cmparg); break;
-		case FUTEX_OP_CMP_GE: ret = (oldval >= cmparg); break;
-		case FUTEX_OP_CMP_LE: ret = (oldval <= cmparg); break;
-		case FUTEX_OP_CMP_GT: ret = (oldval > cmparg); break;
-		default: ret = -ENOSYS; break;
-		}
-	}
+	if (!ret)
+		*oval = oldval;
 
 	return ret;
 
-} /* end futex_atomic_op_inuser() */
+} /* end arch_futex_atomic_op_inuser() */
--- a/arch/hexagon/include/asm/futex.h
+++ b/arch/hexagon/include/asm/futex.h
@@ -31,18 +31,9 @@
 
 
 static inline int
-futex_atomic_op_inuser(int encoded_op, int __user *uaddr)
+arch_futex_atomic_op_inuser(int op, int oparg, int *oval, u32 __user *uaddr)
 {
-	int op = (encoded_op >> 28) & 7;
-	int cmp = (encoded_op >> 24) & 15;
-	int oparg = (encoded_op << 8) >> 20;
-	int cmparg = (encoded_op << 20) >> 20;
 	int oldval = 0, ret;
-	if (encoded_op & (FUTEX_OP_OPARG_SHIFT << 28))
-		oparg = 1 << oparg;
-
-	if (!access_ok(VERIFY_WRITE, uaddr, sizeof(int)))
-		return -EFAULT;
 
 	pagefault_disable();
 
@@ -72,30 +63,9 @@ futex_atomic_op_inuser(int encoded_op, i
 
 	pagefault_enable();
 
-	if (!ret) {
-		switch (cmp) {
-		case FUTEX_OP_CMP_EQ:
-			ret = (oldval == cmparg);
-			break;
-		case FUTEX_OP_CMP_NE:
-			ret = (oldval != cmparg);
-			break;
-		case FUTEX_OP_CMP_LT:
-			ret = (oldval < cmparg);
-			break;
-		case FUTEX_OP_CMP_GE:
-			ret = (oldval >= cmparg);
-			break;
-		case FUTEX_OP_CMP_LE:
-			ret = (oldval <= cmparg);
-			break;
-		case FUTEX_OP_CMP_GT:
-			ret = (oldval > cmparg);
-			break;
-		default:
-			ret = -ENOSYS;
-		}
-	}
+	if (!ret)
+		*oval = oldval;
+
 	return ret;
 }
 
--- a/arch/ia64/include/asm/futex.h
+++ b/arch/ia64/include/asm/futex.h
@@ -45,18 +45,9 @@ do {									\
 } while (0)
 
 static inline int
-futex_atomic_op_inuser (int encoded_op, u32 __user *uaddr)
+arch_futex_atomic_op_inuser(int op, int oparg, int *oval, u32 __user *uaddr)
 {
-	int op = (encoded_op >> 28) & 7;
-	int cmp = (encoded_op >> 24) & 15;
-	int oparg = (encoded_op << 8) >> 20;
-	int cmparg = (encoded_op << 20) >> 20;
 	int oldval = 0, ret;
-	if (encoded_op & (FUTEX_OP_OPARG_SHIFT << 28))
-		oparg = 1 << oparg;
-
-	if (! access_ok (VERIFY_WRITE, uaddr, sizeof(u32)))
-		return -EFAULT;
 
 	pagefault_disable();
 
@@ -84,17 +75,9 @@ futex_atomic_op_inuser (int encoded_op,
 
 	pagefault_enable();
 
-	if (!ret) {
-		switch (cmp) {
-		case FUTEX_OP_CMP_EQ: ret = (oldval == cmparg); break;
-		case FUTEX_OP_CMP_NE: ret = (oldval != cmparg); break;
-		case FUTEX_OP_CMP_LT: ret = (oldval < cmparg); break;
-		case FUTEX_OP_CMP_GE: ret = (oldval >= cmparg); break;
-		case FUTEX_OP_CMP_LE: ret = (oldval <= cmparg); break;
-		case FUTEX_OP_CMP_GT: ret = (oldval > cmparg); break;
-		default: ret = -ENOSYS;
-		}
-	}
+	if (!ret)
+		*oval = oldval;
+
 	return ret;
 }
 
--- a/arch/microblaze/include/asm/futex.h
+++ b/arch/microblaze/include/asm/futex.h
@@ -29,18 +29,9 @@
 })
 
 static inline int
-futex_atomic_op_inuser(int encoded_op, u32 __user *uaddr)
+arch_futex_atomic_op_inuser(int op, int oparg, int *oval, u32 __user *uaddr)
 {
-	int op = (encoded_op >> 28) & 7;
-	int cmp = (encoded_op >> 24) & 15;
-	int oparg = (encoded_op << 8) >> 20;
-	int cmparg = (encoded_op << 20) >> 20;
 	int oldval = 0, ret;
-	if (encoded_op & (FUTEX_OP_OPARG_SHIFT << 28))
-		oparg = 1 << oparg;
-
-	if (!access_ok(VERIFY_WRITE, uaddr, sizeof(u32)))
-		return -EFAULT;
 
 	pagefault_disable();
 
@@ -66,30 +57,9 @@ futex_atomic_op_inuser(int encoded_op, u
 
 	pagefault_enable();
 
-	if (!ret) {
-		switch (cmp) {
-		case FUTEX_OP_CMP_EQ:
-			ret = (oldval == cmparg);
-			break;
-		case FUTEX_OP_CMP_NE:
-			ret = (oldval != cmparg);
-			break;
-		case FUTEX_OP_CMP_LT:
-			ret = (oldval < cmparg);
-			break;
-		case FUTEX_OP_CMP_GE:
-			ret = (oldval >= cmparg);
-			break;
-		case FUTEX_OP_CMP_LE:
-			ret = (oldval <= cmparg);
-			break;
-		case FUTEX_OP_CMP_GT:
-			ret = (oldval > cmparg);
-			break;
-		default:
-			ret = -ENOSYS;
-		}
-	}
+	if (!ret)
+		*oval = oldval;
+
 	return ret;
 }
 
--- a/arch/mips/include/asm/futex.h
+++ b/arch/mips/include/asm/futex.h
@@ -83,18 +83,9 @@
 }
 
 static inline int
-futex_atomic_op_inuser(int encoded_op, u32 __user *uaddr)
+arch_futex_atomic_op_inuser(int op, int oparg, int *oval, u32 __user *uaddr)
 {
-	int op = (encoded_op >> 28) & 7;
-	int cmp = (encoded_op >> 24) & 15;
-	int oparg = (encoded_op << 8) >> 20;
-	int cmparg = (encoded_op << 20) >> 20;
 	int oldval = 0, ret;
-	if (encoded_op & (FUTEX_OP_OPARG_SHIFT << 28))
-		oparg = 1 << oparg;
-
-	if (! access_ok (VERIFY_WRITE, uaddr, sizeof(u32)))
-		return -EFAULT;
 
 	pagefault_disable();
 
@@ -125,17 +116,9 @@ futex_atomic_op_inuser(int encoded_op, u
 
 	pagefault_enable();
 
-	if (!ret) {
-		switch (cmp) {
-		case FUTEX_OP_CMP_EQ: ret = (oldval == cmparg); break;
-		case FUTEX_OP_CMP_NE: ret = (oldval != cmparg); break;
-		case FUTEX_OP_CMP_LT: ret = (oldval < cmparg); break;
-		case FUTEX_OP_CMP_GE: ret = (oldval >= cmparg); break;
-		case FUTEX_OP_CMP_LE: ret = (oldval <= cmparg); break;
-		case FUTEX_OP_CMP_GT: ret = (oldval > cmparg); break;
-		default: ret = -ENOSYS;
-		}
-	}
+	if (!ret)
+		*oval = oldval;
+
 	return ret;
 }
 
--- a/arch/parisc/include/asm/futex.h
+++ b/arch/parisc/include/asm/futex.h
@@ -32,22 +32,12 @@ _futex_spin_unlock_irqrestore(u32 __user
 }
 
 static inline int
-futex_atomic_op_inuser (int encoded_op, u32 __user *uaddr)
+arch_futex_atomic_op_inuser(int op, int oparg, int *oval, u32 __user *uaddr)
 {
 	unsigned long int flags;
-	int op = (encoded_op >> 28) & 7;
-	int cmp = (encoded_op >> 24) & 15;
-	int oparg = (encoded_op << 8) >> 20;
-	int cmparg = (encoded_op << 20) >> 20;
 	int oldval, ret;
 	u32 tmp;
 
-	if (encoded_op & (FUTEX_OP_OPARG_SHIFT << 28))
-		oparg = 1 << oparg;
-
-	if (!access_ok(VERIFY_WRITE, uaddr, sizeof(*uaddr)))
-		return -EFAULT;
-
 	_futex_spin_lock_irqsave(uaddr, &flags);
 	pagefault_disable();
 
@@ -85,17 +75,9 @@ out_pagefault_enable:
 	pagefault_enable();
 	_futex_spin_unlock_irqrestore(uaddr, &flags);
 
-	if (ret == 0) {
-		switch (cmp) {
-		case FUTEX_OP_CMP_EQ: ret = (oldval == cmparg); break;
-		case FUTEX_OP_CMP_NE: ret = (oldval != cmparg); break;
-		case FUTEX_OP_CMP_LT: ret = (oldval < cmparg); break;
-		case FUTEX_OP_CMP_GE: ret = (oldval >= cmparg); break;
-		case FUTEX_OP_CMP_LE: ret = (oldval <= cmparg); break;
-		case FUTEX_OP_CMP_GT: ret = (oldval > cmparg); break;
-		default: ret = -ENOSYS;
-		}
-	}
+	if (!ret)
+		*oval = oldval;
+
 	return ret;
 }
 
--- a/arch/powerpc/include/asm/futex.h
+++ b/arch/powerpc/include/asm/futex.h
@@ -31,18 +31,10 @@
 	: "b" (uaddr), "i" (-EFAULT), "r" (oparg) \
 	: "cr0", "memory")
 
-static inline int futex_atomic_op_inuser (int encoded_op, u32 __user *uaddr)
+static inline int arch_futex_atomic_op_inuser(int op, int oparg, int *oval,
+		u32 __user *uaddr)
 {
-	int op = (encoded_op >> 28) & 7;
-	int cmp = (encoded_op >> 24) & 15;
-	int oparg = (encoded_op << 8) >> 20;
-	int cmparg = (encoded_op << 20) >> 20;
 	int oldval = 0, ret;
-	if (encoded_op & (FUTEX_OP_OPARG_SHIFT << 28))
-		oparg = 1 << oparg;
-
-	if (! access_ok (VERIFY_WRITE, uaddr, sizeof(u32)))
-		return -EFAULT;
 
 	pagefault_disable();
 
@@ -68,17 +60,9 @@ static inline int futex_atomic_op_inuser
 
 	pagefault_enable();
 
-	if (!ret) {
-		switch (cmp) {
-		case FUTEX_OP_CMP_EQ: ret = (oldval == cmparg); break;
-		case FUTEX_OP_CMP_NE: ret = (oldval != cmparg); break;
-		case FUTEX_OP_CMP_LT: ret = (oldval < cmparg); break;
-		case FUTEX_OP_CMP_GE: ret = (oldval >= cmparg); break;
-		case FUTEX_OP_CMP_LE: ret = (oldval <= cmparg); break;
-		case FUTEX_OP_CMP_GT: ret = (oldval > cmparg); break;
-		default: ret = -ENOSYS;
-		}
-	}
+	if (!ret)
+		*oval = oldval;
+
 	return ret;
 }
 
--- a/arch/s390/include/asm/futex.h
+++ b/arch/s390/include/asm/futex.h
@@ -21,17 +21,12 @@
 		: "0" (-EFAULT), "d" (oparg), "a" (uaddr),		\
 		  "m" (*uaddr) : "cc");
 
-static inline int futex_atomic_op_inuser(int encoded_op, u32 __user *uaddr)
+static inline int arch_futex_atomic_op_inuser(int op, int oparg, int *oval,
+		u32 __user *uaddr)
 {
-	int op = (encoded_op >> 28) & 7;
-	int cmp = (encoded_op >> 24) & 15;
-	int oparg = (encoded_op << 8) >> 20;
-	int cmparg = (encoded_op << 20) >> 20;
 	int oldval = 0, newval, ret;
 
 	load_kernel_asce();
-	if (encoded_op & (FUTEX_OP_OPARG_SHIFT << 28))
-		oparg = 1 << oparg;
 
 	pagefault_disable();
 	switch (op) {
@@ -60,17 +55,9 @@ static inline int futex_atomic_op_inuser
 	}
 	pagefault_enable();
 
-	if (!ret) {
-		switch (cmp) {
-		case FUTEX_OP_CMP_EQ: ret = (oldval == cmparg); break;
-		case FUTEX_OP_CMP_NE: ret = (oldval != cmparg); break;
-		case FUTEX_OP_CMP_LT: ret = (oldval < cmparg); break;
-		case FUTEX_OP_CMP_GE: ret = (oldval >= cmparg); break;
-		case FUTEX_OP_CMP_LE: ret = (oldval <= cmparg); break;
-		case FUTEX_OP_CMP_GT: ret = (oldval > cmparg); break;
-		default: ret = -ENOSYS;
-		}
-	}
+	if (!ret)
+		*oval = oldval;
+
 	return ret;
 }
 
--- a/arch/sh/include/asm/futex.h
+++ b/arch/sh/include/asm/futex.h
@@ -27,21 +27,12 @@ futex_atomic_cmpxchg_inatomic(u32 *uval,
 	return atomic_futex_op_cmpxchg_inatomic(uval, uaddr, oldval, newval);
 }
 
-static inline int futex_atomic_op_inuser(int encoded_op, u32 __user *uaddr)
+static inline int arch_futex_atomic_op_inuser(int op, u32 oparg, int *oval,
+		u32 __user *uaddr)
 {
-	int op = (encoded_op >> 28) & 7;
-	int cmp = (encoded_op >> 24) & 15;
-	u32 oparg = (encoded_op << 8) >> 20;
-	u32 cmparg = (encoded_op << 20) >> 20;
 	u32 oldval, newval, prev;
 	int ret;
 
-	if (encoded_op & (FUTEX_OP_OPARG_SHIFT << 28))
-		oparg = 1 << oparg;
-
-	if (!access_ok(VERIFY_WRITE, uaddr, sizeof(u32)))
-		return -EFAULT;
-
 	pagefault_disable();
 
 	do {
@@ -80,17 +71,8 @@ static inline int futex_atomic_op_inuser
 
 	pagefault_enable();
 
-	if (!ret) {
-		switch (cmp) {
-		case FUTEX_OP_CMP_EQ: ret = (oldval == cmparg); break;
-		case FUTEX_OP_CMP_NE: ret = (oldval != cmparg); break;
-		case FUTEX_OP_CMP_LT: ret = ((int)oldval < (int)cmparg); break;
-		case FUTEX_OP_CMP_GE: ret = ((int)oldval >= (int)cmparg); break;
-		case FUTEX_OP_CMP_LE: ret = ((int)oldval <= (int)cmparg); break;
-		case FUTEX_OP_CMP_GT: ret = ((int)oldval > (int)cmparg); break;
-		default: ret = -ENOSYS;
-		}
-	}
+	if (!ret)
+		*oval = oldval;
 
 	return ret;
 }
--- a/arch/sparc/include/asm/futex_64.h
+++ b/arch/sparc/include/asm/futex_64.h
@@ -29,22 +29,14 @@
 	: "r" (uaddr), "r" (oparg), "i" (-EFAULT)	\
 	: "memory")
 
-static inline int futex_atomic_op_inuser(int encoded_op, u32 __user *uaddr)
+static inline int arch_futex_atomic_op_inuser(int op, int oparg, int *oval,
+		u32 __user *uaddr)
 {
-	int op = (encoded_op >> 28) & 7;
-	int cmp = (encoded_op >> 24) & 15;
-	int oparg = (encoded_op << 8) >> 20;
-	int cmparg = (encoded_op << 20) >> 20;
 	int oldval = 0, ret, tem;
 
-	if (unlikely(!access_ok(VERIFY_WRITE, uaddr, sizeof(u32))))
-		return -EFAULT;
 	if (unlikely((((unsigned long) uaddr) & 0x3UL)))
 		return -EINVAL;
 
-	if (encoded_op & (FUTEX_OP_OPARG_SHIFT << 28))
-		oparg = 1 << oparg;
-
 	pagefault_disable();
 
 	switch (op) {
@@ -69,17 +61,9 @@ static inline int futex_atomic_op_inuser
 
 	pagefault_enable();
 
-	if (!ret) {
-		switch (cmp) {
-		case FUTEX_OP_CMP_EQ: ret = (oldval == cmparg); break;
-		case FUTEX_OP_CMP_NE: ret = (oldval != cmparg); break;
-		case FUTEX_OP_CMP_LT: ret = (oldval < cmparg); break;
-		case FUTEX_OP_CMP_GE: ret = (oldval >= cmparg); break;
-		case FUTEX_OP_CMP_LE: ret = (oldval <= cmparg); break;
-		case FUTEX_OP_CMP_GT: ret = (oldval > cmparg); break;
-		default: ret = -ENOSYS;
-		}
-	}
+	if (!ret)
+		*oval = oldval;
+
 	return ret;
 }
 
--- a/arch/tile/include/asm/futex.h
+++ b/arch/tile/include/asm/futex.h
@@ -106,12 +106,9 @@
 	lock = __atomic_hashed_lock((int __force *)uaddr)
 #endif
 
-static inline int futex_atomic_op_inuser(int encoded_op, u32 __user *uaddr)
+static inline int arch_futex_atomic_op_inuser(int op, u32 oparg, int *oval,
+		u32 __user *uaddr)
 {
-	int op = (encoded_op >> 28) & 7;
-	int cmp = (encoded_op >> 24) & 15;
-	int oparg = (encoded_op << 8) >> 20;
-	int cmparg = (encoded_op << 20) >> 20;
 	int uninitialized_var(val), ret;
 
 	__futex_prolog();
@@ -119,12 +116,6 @@ static inline int futex_atomic_op_inuser
 	/* The 32-bit futex code makes this assumption, so validate it here. */
 	BUILD_BUG_ON(sizeof(atomic_t) != sizeof(int));
 
-	if (encoded_op & (FUTEX_OP_OPARG_SHIFT << 28))
-		oparg = 1 << oparg;
-
-	if (!access_ok(VERIFY_WRITE, uaddr, sizeof(u32)))
-		return -EFAULT;
-
 	pagefault_disable();
 	switch (op) {
 	case FUTEX_OP_SET:
@@ -148,30 +139,9 @@ static inline int futex_atomic_op_inuser
 	}
 	pagefault_enable();
 
-	if (!ret) {
-		switch (cmp) {
-		case FUTEX_OP_CMP_EQ:
-			ret = (val == cmparg);
-			break;
-		case FUTEX_OP_CMP_NE:
-			ret = (val != cmparg);
-			break;
-		case FUTEX_OP_CMP_LT:
-			ret = (val < cmparg);
-			break;
-		case FUTEX_OP_CMP_GE:
-			ret = (val >= cmparg);
-			break;
-		case FUTEX_OP_CMP_LE:
-			ret = (val <= cmparg);
-			break;
-		case FUTEX_OP_CMP_GT:
-			ret = (val > cmparg);
-			break;
-		default:
-			ret = -ENOSYS;
-		}
-	}
+	if (!ret)
+		*oval = val;
+
 	return ret;
 }
 
--- a/arch/x86/include/asm/futex.h
+++ b/arch/x86/include/asm/futex.h
@@ -41,20 +41,11 @@
 		       "+m" (*uaddr), "=&r" (tem)		\
 		     : "r" (oparg), "i" (-EFAULT), "1" (0))
 
-static inline int futex_atomic_op_inuser(int encoded_op, u32 __user *uaddr)
+static inline int arch_futex_atomic_op_inuser(int op, int oparg, int *oval,
+		u32 __user *uaddr)
 {
-	int op = (encoded_op >> 28) & 7;
-	int cmp = (encoded_op >> 24) & 15;
-	int oparg = (encoded_op << 8) >> 20;
-	int cmparg = (encoded_op << 20) >> 20;
 	int oldval = 0, ret, tem;
 
-	if (encoded_op & (FUTEX_OP_OPARG_SHIFT << 28))
-		oparg = 1 << oparg;
-
-	if (!access_ok(VERIFY_WRITE, uaddr, sizeof(u32)))
-		return -EFAULT;
-
 	pagefault_disable();
 
 	switch (op) {
@@ -80,30 +71,9 @@ static inline int futex_atomic_op_inuser
 
 	pagefault_enable();
 
-	if (!ret) {
-		switch (cmp) {
-		case FUTEX_OP_CMP_EQ:
-			ret = (oldval == cmparg);
-			break;
-		case FUTEX_OP_CMP_NE:
-			ret = (oldval != cmparg);
-			break;
-		case FUTEX_OP_CMP_LT:
-			ret = (oldval < cmparg);
-			break;
-		case FUTEX_OP_CMP_GE:
-			ret = (oldval >= cmparg);
-			break;
-		case FUTEX_OP_CMP_LE:
-			ret = (oldval <= cmparg);
-			break;
-		case FUTEX_OP_CMP_GT:
-			ret = (oldval > cmparg);
-			break;
-		default:
-			ret = -ENOSYS;
-		}
-	}
+	if (!ret)
+		*oval = oldval;
+
 	return ret;
 }
 
--- a/arch/xtensa/include/asm/futex.h
+++ b/arch/xtensa/include/asm/futex.h
@@ -44,18 +44,10 @@
 	: "r" (uaddr), "I" (-EFAULT), "r" (oparg)	\
 	: "memory")
 
-static inline int futex_atomic_op_inuser(int encoded_op, u32 __user *uaddr)
+static inline int arch_futex_atomic_op_inuser(int op, int oparg, int *oval,
+		u32 __user *uaddr)
 {
-	int op = (encoded_op >> 28) & 7;
-	int cmp = (encoded_op >> 24) & 15;
-	int oparg = (encoded_op << 8) >> 20;
-	int cmparg = (encoded_op << 20) >> 20;
 	int oldval = 0, ret;
-	if (encoded_op & (FUTEX_OP_OPARG_SHIFT << 28))
-		oparg = 1 << oparg;
-
-	if (!access_ok(VERIFY_WRITE, uaddr, sizeof(u32)))
-		return -EFAULT;
 
 #if !XCHAL_HAVE_S32C1I
 	return -ENOSYS;
@@ -89,19 +81,10 @@ static inline int futex_atomic_op_inuser
 
 	pagefault_enable();
 
-	if (ret)
-		return ret;
-
-	switch (cmp) {
-	case FUTEX_OP_CMP_EQ: return (oldval == cmparg);
-	case FUTEX_OP_CMP_NE: return (oldval != cmparg);
-	case FUTEX_OP_CMP_LT: return (oldval < cmparg);
-	case FUTEX_OP_CMP_GE: return (oldval >= cmparg);
-	case FUTEX_OP_CMP_LE: return (oldval <= cmparg);
-	case FUTEX_OP_CMP_GT: return (oldval > cmparg);
-	}
+	if (!ret)
+		*oval = oldval;
 
-	return -ENOSYS;
+	return ret;
 }
 
 static inline int
--- a/include/asm-generic/futex.h
+++ b/include/asm-generic/futex.h
@@ -13,7 +13,7 @@
  */
 
 /**
- * futex_atomic_op_inuser() - Atomic arithmetic operation with constant
+ * arch_futex_atomic_op_inuser() - Atomic arithmetic operation with constant
  *			  argument and comparison of the previous
  *			  futex value with another constant.
  *
@@ -25,18 +25,11 @@
  * <0 - On error
  */
 static inline int
-futex_atomic_op_inuser(int encoded_op, u32 __user *uaddr)
+arch_futex_atomic_op_inuser(int op, u32 oparg, int *oval, u32 __user *uaddr)
 {
-	int op = (encoded_op >> 28) & 7;
-	int cmp = (encoded_op >> 24) & 15;
-	int oparg = (encoded_op << 8) >> 20;
-	int cmparg = (encoded_op << 20) >> 20;
 	int oldval, ret;
 	u32 tmp;
 
-	if (encoded_op & (FUTEX_OP_OPARG_SHIFT << 28))
-		oparg = 1 << oparg;
-
 	preempt_disable();
 	pagefault_disable();
 
@@ -74,17 +67,9 @@ out_pagefault_enable:
 	pagefault_enable();
 	preempt_enable();
 
-	if (ret == 0) {
-		switch (cmp) {
-		case FUTEX_OP_CMP_EQ: ret = (oldval == cmparg); break;
-		case FUTEX_OP_CMP_NE: ret = (oldval != cmparg); break;
-		case FUTEX_OP_CMP_LT: ret = (oldval < cmparg); break;
-		case FUTEX_OP_CMP_GE: ret = (oldval >= cmparg); break;
-		case FUTEX_OP_CMP_LE: ret = (oldval <= cmparg); break;
-		case FUTEX_OP_CMP_GT: ret = (oldval > cmparg); break;
-		default: ret = -ENOSYS;
-		}
-	}
+	if (ret == 0)
+		*oval = oldval;
+
 	return ret;
 }
 
@@ -126,18 +111,9 @@ futex_atomic_cmpxchg_inatomic(u32 *uval,
 
 #else
 static inline int
-futex_atomic_op_inuser (int encoded_op, u32 __user *uaddr)
+arch_futex_atomic_op_inuser(int op, u32 oparg, int *oval, u32 __user *uaddr)
 {
-	int op = (encoded_op >> 28) & 7;
-	int cmp = (encoded_op >> 24) & 15;
-	int oparg = (encoded_op << 8) >> 20;
-	int cmparg = (encoded_op << 20) >> 20;
 	int oldval = 0, ret;
-	if (encoded_op & (FUTEX_OP_OPARG_SHIFT << 28))
-		oparg = 1 << oparg;
-
-	if (! access_ok (VERIFY_WRITE, uaddr, sizeof(u32)))
-		return -EFAULT;
 
 	pagefault_disable();
 
@@ -153,17 +129,9 @@ futex_atomic_op_inuser (int encoded_op,
 
 	pagefault_enable();
 
-	if (!ret) {
-		switch (cmp) {
-		case FUTEX_OP_CMP_EQ: ret = (oldval == cmparg); break;
-		case FUTEX_OP_CMP_NE: ret = (oldval != cmparg); break;
-		case FUTEX_OP_CMP_LT: ret = (oldval < cmparg); break;
-		case FUTEX_OP_CMP_GE: ret = (oldval >= cmparg); break;
-		case FUTEX_OP_CMP_LE: ret = (oldval <= cmparg); break;
-		case FUTEX_OP_CMP_GT: ret = (oldval > cmparg); break;
-		default: ret = -ENOSYS;
-		}
-	}
+	if (!ret)
+		*oval = oldval;
+
 	return ret;
 }
 
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -1458,6 +1458,45 @@ out:
 	return ret;
 }
 
+static int futex_atomic_op_inuser(unsigned int encoded_op, u32 __user *uaddr)
+{
+	unsigned int op =	  (encoded_op & 0x70000000) >> 28;
+	unsigned int cmp =	  (encoded_op & 0x0f000000) >> 24;
+	int oparg = sign_extend32((encoded_op & 0x00fff000) >> 12, 12);
+	int cmparg = sign_extend32(encoded_op & 0x00000fff, 12);
+	int oldval, ret;
+
+	if (encoded_op & (FUTEX_OP_OPARG_SHIFT << 28)) {
+		if (oparg < 0 || oparg > 31)
+			return -EINVAL;
+		oparg = 1 << oparg;
+	}
+
+	if (!access_ok(VERIFY_WRITE, uaddr, sizeof(u32)))
+		return -EFAULT;
+
+	ret = arch_futex_atomic_op_inuser(op, oparg, &oldval, uaddr);
+	if (ret)
+		return ret;
+
+	switch (cmp) {
+	case FUTEX_OP_CMP_EQ:
+		return oldval == cmparg;
+	case FUTEX_OP_CMP_NE:
+		return oldval != cmparg;
+	case FUTEX_OP_CMP_LT:
+		return oldval < cmparg;
+	case FUTEX_OP_CMP_GE:
+		return oldval >= cmparg;
+	case FUTEX_OP_CMP_LE:
+		return oldval <= cmparg;
+	case FUTEX_OP_CMP_GT:
+		return oldval > cmparg;
+	default:
+		return -ENOSYS;
+	}
+}
+
 /*
  * Wake up all waiters hashed on the physical page that is mapped
  * to this virtual address:

^ permalink raw reply

* Re: [PATCH 4.9 27/33] futex: Remove duplicated code and fix undefined behaviour
From: Jiri Slaby @ 2018-05-18  8:30 UTC (permalink / raw)
  To: Greg Kroah-Hartman, linux-kernel
  Cc: stable, Thomas Gleixner, Russell King, Darren Hart (VMware),
	linux-mips, Rich Felker, linux-ia64, linux-sh, peterz,
	Benjamin Herrenschmidt, Max Filippov, Paul Mackerras, sparclinux,
	Jonas Bonn, linux-s390, linux-arch, Yoshinori Sato, linux-hexagon,
	Helge Deller, James E.J. Bottomley, Catalin Marinas, Matt Turner,
	linux-snps-arc, Fenghua Yu, Arnd Bergmann, linux-xtensa,
	Stefan Kristiansson, openrisc, Ivan Kokshaysky, Stafford Horne,
	linux-arm-kernel, Richard Henderson, Chris Zankel, Michal Simek,
	Tony Luck, linux-parisc, Vineet Gupta, Ralf Baechle, Richard Kuo,
	linux-alpha, Martin Schwidefsky, linuxppc-dev, David S. Miller,
	Ben Hutchings
In-Reply-To: <20180518081536.166573281@linuxfoundation.org>

On 05/18/2018, 10:16 AM, Greg Kroah-Hartman wrote:
> 4.9-stable review patch.  If anyone has any objections, please let me know.
> 
> ------------------
> 
> From: Jiri Slaby <jslaby@suse.cz>
> 
> commit 30d6e0a4190d37740e9447e4e4815f06992dd8c3 upstream.
...
> --- a/kernel/futex.c
> +++ b/kernel/futex.c
> @@ -1458,6 +1458,45 @@ out:
>  	return ret;
>  }
>  
> +static int futex_atomic_op_inuser(unsigned int encoded_op, u32 __user *uaddr)
> +{
> +	unsigned int op =	  (encoded_op & 0x70000000) >> 28;
> +	unsigned int cmp =	  (encoded_op & 0x0f000000) >> 24;
> +	int oparg = sign_extend32((encoded_op & 0x00fff000) >> 12, 12);
> +	int cmparg = sign_extend32(encoded_op & 0x00000fff, 12);

12 is wrong here – wherever you apply this, you need also a follow-up fix:
commit d70ef22892ed6c066e51e118b225923c9b74af34
Author: Jiri Slaby <jslaby@suse.cz>
Date:   Thu Nov 30 15:35:44 2017 +0100

    futex: futex_wake_op, fix sign_extend32 sign bits

thanks,
-- 
js
suse labs

^ permalink raw reply

* Re: [PATCH 4.9 27/33] futex: Remove duplicated code and fix undefined behaviour
From: Greg Kroah-Hartman @ 2018-05-18  9:01 UTC (permalink / raw)
  To: Jiri Slaby
  Cc: linux-kernel, stable, Thomas Gleixner, Russell King,
	Darren Hart (VMware), linux-mips, Rich Felker, linux-ia64,
	linux-sh, peterz, Benjamin Herrenschmidt, Max Filippov,
	Paul Mackerras, sparclinux, Jonas Bonn, linux-s390, linux-arch,
	Yoshinori Sato, linux-hexagon, Helge Deller, James E.J. Bottomley,
	Catalin Marinas, Matt Turner, linux-snps-arc, Fenghua Yu,
	Arnd Bergmann, linux-xtensa, Stefan Kristiansson, openrisc,
	Ivan Kokshaysky, Stafford Horne, linux-arm-kernel,
	Richard Henderson, Chris Zankel, Michal Simek, Tony Luck,
	linux-parisc, Vineet Gupta, Ralf Baechle, Richard Kuo,
	linux-alpha, Martin Schwidefsky, linuxppc-dev, David S. Miller,
	Ben Hutchings
In-Reply-To: <e8dc5f94-3b52-dcf0-3b5e-b442bde7d803@suse.cz>

On Fri, May 18, 2018 at 10:30:24AM +0200, Jiri Slaby wrote:
> On 05/18/2018, 10:16 AM, Greg Kroah-Hartman wrote:
> > 4.9-stable review patch.  If anyone has any objections, please let me know.
> > 
> > ------------------
> > 
> > From: Jiri Slaby <jslaby@suse.cz>
> > 
> > commit 30d6e0a4190d37740e9447e4e4815f06992dd8c3 upstream.
> ...
> > --- a/kernel/futex.c
> > +++ b/kernel/futex.c
> > @@ -1458,6 +1458,45 @@ out:
> >  	return ret;
> >  }
> >  
> > +static int futex_atomic_op_inuser(unsigned int encoded_op, u32 __user *uaddr)
> > +{
> > +	unsigned int op =	  (encoded_op & 0x70000000) >> 28;
> > +	unsigned int cmp =	  (encoded_op & 0x0f000000) >> 24;
> > +	int oparg = sign_extend32((encoded_op & 0x00fff000) >> 12, 12);
> > +	int cmparg = sign_extend32(encoded_op & 0x00000fff, 12);
> 
> 12 is wrong here – wherever you apply this, you need also a follow-up fix:
> commit d70ef22892ed6c066e51e118b225923c9b74af34
> Author: Jiri Slaby <jslaby@suse.cz>
> Date:   Thu Nov 30 15:35:44 2017 +0100
> 
>     futex: futex_wake_op, fix sign_extend32 sign bits

Thanks for letting me know, I've now queued it up to the needed trees.

greg k-h

^ permalink raw reply

* [PATCH] powerpc: fix spelling mistake: "Discharching" -> "Discharging"
From: Colin King @ 2018-05-18  9:31 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	linuxppc-dev
  Cc: kernel-janitors, linux-kernel

From: Colin Ian King <colin.king@canonical.com>

Trivial fix to spelling mistake in battery_charging array

Signed-off-by: Colin Ian King <colin.king@canonical.com>
---
 arch/powerpc/kernel/rtas-proc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/rtas-proc.c b/arch/powerpc/kernel/rtas-proc.c
index d49063d0baa4..ed0ed0c9b7b3 100644
--- a/arch/powerpc/kernel/rtas-proc.c
+++ b/arch/powerpc/kernel/rtas-proc.c
@@ -504,7 +504,7 @@ static void ppc_rtas_process_sensor(struct seq_file *m,
 		"EPOW power off" };
 	const char * battery_cyclestate[]  = { "None", "In progress", 
 						"Requested" };
-	const char * battery_charging[]    = { "Charging", "Discharching", 
+	const char * battery_charging[]    = { "Charging", "Discharging",
 						"No current flow" };
 	const char * ibm_drconnector[]     = { "Empty", "Present", "Unusable", 
 						"Exchange" };
-- 
2.17.0

^ permalink raw reply related

* [PATCH-RESEND] cxl: Disable prefault_mode in Radix mode
From: Vaibhav Jain @ 2018-05-18  9:42 UTC (permalink / raw)
  To: linuxppc-dev, Frederic Barrat
  Cc: Vaibhav Jain, Andrew Donnellan, Christophe Lombard,
	Philippe Bergheaud, Alastair D'Silva, Vaibhav Jain, stable

From: Vaibhav Jain <vaibhav@linux.ibm.com>

Currently we see a kernel-oops reported on Power-9 while attaching a
context to an AFU, with radix-mode and sysfs attr 'prefault_mode' set
to anything other than 'none'. The backtrace of the oops is of this
form:

Unable to handle kernel paging request for data at address 0x00000080
Faulting instruction address: 0xc00800000bcf3b20
cpu 0x1: Vector: 300 (Data Access) at [c00000037f003800]
    pc: c00800000bcf3b20: cxl_load_segment+0x178/0x290 [cxl]
    lr: c00800000bcf39f0: cxl_load_segment+0x48/0x290 [cxl]
    sp: c00000037f003a80
   msr: 9000000000009033
   dar: 80
 dsisr: 40000000
  current = 0xc00000037f280000
  paca    = 0xc0000003ffffe600   softe: 3        irq_happened: 0x01
    pid   = 3529, comm = afp_no_int
<snip>
[c00000037f003af0] c00800000bcf4424 cxl_prefault+0xfc/0x248 [cxl]
[c00000037f003b50] c00800000bcf8a40 process_element_entry_psl9+0xd8/0x1a0 [cxl]
[c00000037f003b90] c00800000bcf944c cxl_attach_dedicated_process_psl9+0x44/0x130 [cxl]
[c00000037f003bd0] c00800000bcf5448 native_attach_process+0xc0/0x130 [cxl]
[c00000037f003c50] c00800000bcf16cc afu_ioctl+0x3f4/0x5e0 [cxl]
[c00000037f003d00] c00000000039d98c do_vfs_ioctl+0xdc/0x890
[c00000037f003da0] c00000000039e1a8 ksys_ioctl+0x68/0xf0
[c00000037f003df0] c00000000039e270 sys_ioctl+0x40/0xa0
[c00000037f003e30] c00000000000b320 system_call+0x58/0x6c
--- Exception: c01 (System Call) at 0000000010053bb0

The issue is caused as on Power-8 the AFU attr 'prefault_mode' was
used to improve initial storage fault performance by prefaulting
process segments. However on Power-9 with radix mode we don't have
Storage-Segments that we can prefault. Also prefaulting process Pages
will be too costly and fine-grained.

Hence, since the prefaulting mechanism doesn't makes sense of
radix-mode, this patch updates prefault_mode_store() to not allow any
other value apart from CXL_PREFAULT_NONE when radix mode is enabled.

Cc: <stable@vger.kernel.org>
Fixes: f24be42aab37 ("cxl: Add psl9 specific code")
Signed-off-by: Vaibhav Jain <vaibhav@linux.ibm.com>
---
Change-log:

Resend  ->  Updated the commit description to add more info on the
	    issue seen [Andrew]
---
 Documentation/ABI/testing/sysfs-class-cxl |  4 +++-
 drivers/misc/cxl/sysfs.c                  | 16 ++++++++++++----
 2 files changed, 15 insertions(+), 5 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-class-cxl b/Documentation/ABI/testing/sysfs-class-cxl
index 640f65e79ef1..267920a1874b 100644
--- a/Documentation/ABI/testing/sysfs-class-cxl
+++ b/Documentation/ABI/testing/sysfs-class-cxl
@@ -69,7 +69,9 @@ Date:           September 2014
 Contact:        linuxppc-dev@lists.ozlabs.org
 Description:    read/write
                 Set the mode for prefaulting in segments into the segment table
-                when performing the START_WORK ioctl. Possible values:
+                when performing the START_WORK ioctl. Only applicable when
+                running under hashed page table mmu.
+                Possible values:
                         none: No prefaulting (default)
                         work_element_descriptor: Treat the work element
                                  descriptor as an effective address and
diff --git a/drivers/misc/cxl/sysfs.c b/drivers/misc/cxl/sysfs.c
index 4b5a4c5d3c01..629e2e156412 100644
--- a/drivers/misc/cxl/sysfs.c
+++ b/drivers/misc/cxl/sysfs.c
@@ -353,12 +353,20 @@ static ssize_t prefault_mode_store(struct device *device,
 	struct cxl_afu *afu = to_cxl_afu(device);
 	enum prefault_modes mode = -1;
 
-	if (!strncmp(buf, "work_element_descriptor", 23))
-		mode = CXL_PREFAULT_WED;
-	if (!strncmp(buf, "all", 3))
-		mode = CXL_PREFAULT_ALL;
 	if (!strncmp(buf, "none", 4))
 		mode = CXL_PREFAULT_NONE;
+	else {
+		if (!radix_enabled()) {
+
+			/* only allowed when not in radix mode */
+			if (!strncmp(buf, "work_element_descriptor", 23))
+				mode = CXL_PREFAULT_WED;
+			if (!strncmp(buf, "all", 3))
+				mode = CXL_PREFAULT_ALL;
+		} else {
+			dev_err(device, "Cannot prefault with radix enabled\n");
+		}
+	}
 
 	if (mode == -1)
 		return -EINVAL;
-- 
2.17.0

^ permalink raw reply related

* Re: [PATCH v2 5/5] powerpc/lib: inline memcmp() for small constant sizes
From: Christophe Leroy @ 2018-05-18 10:35 UTC (permalink / raw)
  To: Segher Boessenkool
  Cc: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	linuxppc-dev, linux-kernel
In-Reply-To: <20180517135551.GT17342@gate.crashing.org>



On 05/17/2018 03:55 PM, Segher Boessenkool wrote:
> On Thu, May 17, 2018 at 12:49:58PM +0200, Christophe Leroy wrote:
>> In my 8xx configuration, I get 208 calls to memcmp()
>> Within those 208 calls, about half of them have constant sizes,
>> 46 have a size of 8, 17 have a size of 16, only a few have a
>> size over 16. Other fixed sizes are mostly 4, 6 and 10.
>>
>> This patch inlines calls to memcmp() when size
>> is constant and lower than or equal to 16
>>
>> In my 8xx configuration, this reduces the number of calls
>> to memcmp() from 208 to 123
>>
>> The following table shows the number of TB timeticks to perform
>> a constant size memcmp() before and after the patch depending on
>> the size
>>
>> 	Before	After	Improvement
>> 01:	 7577	 5682	25%
>> 02:	41668	 5682	86%
>> 03:	51137	13258	74%
>> 04:	45455	 5682	87%
>> 05:	58713	13258	77%
>> 06:	58712	13258	77%
>> 07:	68183	20834	70%
>> 08:	56819	15153	73%
>> 09:	70077	28411	60%
>> 10:	70077	28411	60%
>> 11:	79546	35986	55%
>> 12:	68182	28411	58%
>> 13:	81440	35986	55%
>> 14:	81440	39774	51%
>> 15:	94697	43562	54%
>> 16:	79546	37881	52%
> 
> Could you show results with a more recent GCC?  What version was this?

It was with the latest GCC version I have available in my environment, 
that is GCC 5.4. Is that too old ?

It seems that version inlines memcmp() when length is 1. All other 
lengths call memcmp()

> 
> What is this really measuring?  I doubt it takes 7577 (or 5682) timebase
> ticks to do a 1-byte memcmp, which is just 3 instructions after all.

Well I looked again in my tests and it seems some results are wrong, can 
remember why, I probably did something wrong when I did the tests.

Anyway, the principle is to call a function tstcmpX() 100000 times from 
a loop, and getting the mftb before and after the loop.
Then we remove from the elapsed time the time spent when calling 
tstcmp0() which is only a blr.
Therefore, we get really the time spent in the comparison only.

Here is the loop:

c06243b0:	7f ac 42 e6 	mftb    r29
c06243b4:	3f 60 00 01 	lis     r27,1
c06243b8:	63 7b 86 a0 	ori     r27,r27,34464
c06243bc:	38 a0 00 02 	li      r5,2
c06243c0:	7f c4 f3 78 	mr      r4,r30
c06243c4:	7f 83 e3 78 	mr      r3,r28
c06243c8:	4b 9e 8c 09 	bl      c000cfd0 <tstcmp2>
c06243cc:	2c 1b 00 01 	cmpwi   r27,1
c06243d0:	3b 7b ff ff 	addi    r27,r27,-1
c06243d4:	40 82 ff e8 	bne     c06243bc <setup_arch+0x294>
c06243d8:	7c ac 42 e6 	mftb    r5
c06243dc:	7c bd 28 50 	subf    r5,r29,r5
c06243e0:	7c bf 28 50 	subf    r5,r31,r5
c06243e4:	38 80 00 02 	li      r4,2
c06243e8:	7f 43 d3 78 	mr      r3,r26
c06243ec:	4b a2 e4 45 	bl      c0052830 <printk>

Before the patch:
c000cfc4 <tstcmp0>:
c000cfc4:	4e 80 00 20 	blr

c000cfc8 <tstcmp1>:
c000cfc8:	38 a0 00 01 	li      r5,1
c000cfcc:	48 00 72 08 	b       c00141d4 <__memcmp>

c000cfd0 <tstcmp2>:
c000cfd0:	38 a0 00 02 	li      r5,2
c000cfd4:	48 00 72 00 	b       c00141d4 <__memcmp>

c000cfd8 <tstcmp3>:
c000cfd8:	38 a0 00 03 	li      r5,3
c000cfdc:	48 00 71 f8 	b       c00141d4 <__memcmp>

After the patch:
c000cfc4 <tstcmp0>:
c000cfc4:	4e 80 00 20 	blr

c000cfd8 <tstcmp1>:
c000cfd8:	88 64 00 00 	lbz     r3,0(r4)
c000cfdc:	89 25 00 00 	lbz     r9,0(r5)
c000cfe0:	7c 69 18 50 	subf    r3,r9,r3
c000cfe4:	4e 80 00 20 	blr

c000cfe8 <tstcmp2>:
c000cfe8:	a0 64 00 00 	lhz     r3,0(r4)
c000cfec:	a1 25 00 00 	lhz     r9,0(r5)
c000cff0:	7c 69 18 50 	subf    r3,r9,r3
c000cff4:	4e 80 00 20 	blr

c000cff8 <tstcmp3>:
c000cff8:	a1 24 00 00 	lhz     r9,0(r4)
c000cffc:	a0 65 00 00 	lhz     r3,0(r5)
c000d000:	7c 63 48 51 	subf.   r3,r3,r9
c000d004:	4c 82 00 20 	bnelr
c000d008:	88 64 00 02 	lbz     r3,2(r4)
c000d00c:	89 25 00 02 	lbz     r9,2(r5)
c000d010:	7c 69 18 50 	subf    r3,r9,r3
c000d014:	4e 80 00 20 	blr

c000d018 <tstcmp4>:
c000d018:	80 64 00 00 	lwz     r3,0(r4)
c000d01c:	81 25 00 00 	lwz     r9,0(r5)
c000d020:	7c 69 18 50 	subf    r3,r9,r3
c000d024:	4e 80 00 20 	blr

c000d028 <tstcmp5>:
c000d028:	81 24 00 00 	lwz     r9,0(r4)
c000d02c:	80 65 00 00 	lwz     r3,0(r5)
c000d030:	7c 63 48 51 	subf.   r3,r3,r9
c000d034:	4c 82 00 20 	bnelr
c000d038:	88 64 00 04 	lbz     r3,4(r4)
c000d03c:	89 25 00 04 	lbz     r9,4(r5)
c000d040:	7c 69 18 50 	subf    r3,r9,r3
c000d044:	4e 80 00 20 	blr

c000d048 <tstcmp6>:
c000d048:	81 24 00 00 	lwz     r9,0(r4)
c000d04c:	80 65 00 00 	lwz     r3,0(r5)
c000d050:	7c 63 48 51 	subf.   r3,r3,r9
c000d054:	4c 82 00 20 	bnelr
c000d058:	a0 64 00 04 	lhz     r3,4(r4)
c000d05c:	a1 25 00 04 	lhz     r9,4(r5)
c000d060:	7c 69 18 50 	subf    r3,r9,r3
c000d064:	4e 80 00 20 	blr

c000d068 <tstcmp7>:
c000d068:	81 24 00 00 	lwz     r9,0(r4)
c000d06c:	80 65 00 00 	lwz     r3,0(r5)
c000d070:	7d 23 48 51 	subf.   r9,r3,r9
c000d074:	40 82 00 20 	bne     c000d094 <tstcmp7+0x2c>
c000d078:	a0 64 00 04 	lhz     r3,4(r4)
c000d07c:	a1 25 00 04 	lhz     r9,4(r5)
c000d080:	7d 29 18 51 	subf.   r9,r9,r3
c000d084:	40 82 00 10 	bne     c000d094 <tstcmp7+0x2c>
c000d088:	88 64 00 06 	lbz     r3,6(r4)
c000d08c:	89 25 00 06 	lbz     r9,6(r5)
c000d090:	7d 29 18 50 	subf    r9,r9,r3
c000d094:	7d 23 4b 78 	mr      r3,r9
c000d098:	4e 80 00 20 	blr

c000d09c <tstcmp8>:
c000d09c:	81 25 00 04 	lwz     r9,4(r5)
c000d0a0:	80 64 00 04 	lwz     r3,4(r4)
c000d0a4:	81 04 00 00 	lwz     r8,0(r4)
c000d0a8:	81 45 00 00 	lwz     r10,0(r5)
c000d0ac:	7c 69 18 10 	subfc   r3,r9,r3
c000d0b0:	7d 2a 41 10 	subfe   r9,r10,r8
c000d0b4:	7d 2a fe 70 	srawi   r10,r9,31
c000d0b8:	7d 48 4b 79 	or.     r8,r10,r9
c000d0bc:	4d a2 00 20 	bclr+   12,eq
c000d0c0:	7d 23 4b 78 	mr      r3,r9
c000d0c4:	4e 80 00 20 	blr

c000d0c8 <tstcmp9>:
c000d0c8:	81 25 00 04 	lwz     r9,4(r5)
c000d0cc:	80 64 00 04 	lwz     r3,4(r4)
c000d0d0:	81 04 00 00 	lwz     r8,0(r4)
c000d0d4:	81 45 00 00 	lwz     r10,0(r5)
c000d0d8:	7c 69 18 10 	subfc   r3,r9,r3
c000d0dc:	7d 2a 41 10 	subfe   r9,r10,r8
c000d0e0:	7d 2a fe 70 	srawi   r10,r9,31
c000d0e4:	7d 48 4b 79 	or.     r8,r10,r9
c000d0e8:	41 82 00 08 	beq     c000d0f0 <tstcmp9+0x28>
c000d0ec:	7d 23 4b 78 	mr      r3,r9
c000d0f0:	2f 83 00 00 	cmpwi   cr7,r3,0
c000d0f4:	4c 9e 00 20 	bnelr   cr7
c000d0f8:	88 64 00 08 	lbz     r3,8(r4)
c000d0fc:	89 25 00 08 	lbz     r9,8(r5)
c000d100:	7c 69 18 50 	subf    r3,r9,r3
c000d104:	4e 80 00 20 	blr

This shows that on PPC32, the 8 bytes comparison is not optimal, I will 
improve it.

We also see in tstcmp7() that GCC is a bit stupid, it should use r3 as 
result of the sub as he does with all previous ones, then do bnelr 
instead of bne+mr+blr

Below are the results of the measurement redone today:

     Before After  Improvment
01  24621   5681  77%
02  24621   5681  77%
03  34091  13257  61%
04  28409   5681  80%
05  41667  13257  68%
06  41668  13257  68%
07  51138  22727  56%
08  39772  15151  62%
09  53031  28409  46%
10  53031  28409  46%
11  62501  35986  42%
12  51137  28410  44%
13  64395  35985  44%
14  68182  39774  42%
15  73865  43560  41%
16  62500  37879  39%

We also see here that 08 is not optimal, it should have given same 
results as 05 and 06. I will keep it as is for PPC64 but will rewrite it 
as two 04 comparisons for PPC32

Christophe



> 
> 
> Segher
> 

^ permalink raw reply

* Re: [PATCH 1/3] powerpc/io: Add __raw_writeq_be() __raw_rm_writeq_be()
From: Michael Ellerman @ 2018-05-18 12:00 UTC (permalink / raw)
  To: Samuel Mendoza-Jonas, linuxppc-dev; +Cc: alistair, paulus
In-Reply-To: <865de9ab557621787a26db40c6bf55e1e54ca30d.camel@mendozajonas.com>

Samuel Mendoza-Jonas <sam@mendozajonas.com> writes:

> On Mon, 2018-05-14 at 22:50 +1000, Michael Ellerman wrote:
>> Add byte-swapping versions of __raw_writeq() and __raw_rm_writeq().
>> 
>> This allows us to avoid sparse warnings caused by passing __be64 to
>> __raw_writeq(), which takes unsigned long:
>> 
>>   arch/powerpc/platforms/powernv/pci-ioda.c:1981:38:
>>   warning: incorrect type in argument 1 (different base types)
>>       expected unsigned long [unsigned] v
>>       got restricted __be64 [usertype] <noident>
>> 
>> It's also generally preferable to use a byte-swapping accessor rather
>> than doing it by hand in the code, which is more bug prone.
>> 
>> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
>
> For this and the following patches:
>
> Reviewed-by: Samuel Mendoza-Jonas <sam@mendozajonas.com>

Thanks.

cheers

^ permalink raw reply

* Re: [PATCH v2 00/10] KVM: PPC: reimplement mmio emulation with analyse_instr()
From: Paul Mackerras @ 2018-05-18 12:30 UTC (permalink / raw)
  To: wei.guo.simon; +Cc: kvm-ppc, kvm, linuxppc-dev
In-Reply-To: <1525674016-6703-1-git-send-email-wei.guo.simon@gmail.com>

On Mon, May 07, 2018 at 02:20:06PM +0800, wei.guo.simon@gmail.com wrote:
> From: Simon Guo <wei.guo.simon@gmail.com>
> 
> We already have analyse_instr() which analyzes instructions for the instruction
> type, size, addtional flags, etc. What kvmppc_emulate_loadstore() did is somehow
> duplicated and it will be good to utilize analyse_instr() to reimplement the
> code. The advantage is that the code logic will be shared and more clean to be 
> maintained.

I have put patches 1-3 into my kvm-ppc-next branch.

Thanks,
Paul.

^ permalink raw reply

* Re: [PATCH 07/14] powerpc: Add support for restartable sequences
From: Michael Ellerman @ 2018-05-18 12:38 UTC (permalink / raw)
  To: Mathieu Desnoyers, Boqun Feng, Will Deacon
  Cc: Peter Zijlstra, Paul E. McKenney, Andy Lutomirski, Dave Watson,
	linux-kernel, linux-api, Paul Turner, Andrew Morton, Russell King,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Andrew Hunter,
	Andi Kleen, Chris Lameter, Ben Maurer, rostedt, Josh Triplett,
	Linus Torvalds, Catalin Marinas, Michael Kerrisk, Joel Fernandes,
	Benjamin Herrenschmidt, Paul Mackerras, linuxppc-dev
In-Reply-To: <277374719.2144.1526570889798.JavaMail.zimbra@efficios.com>

Mathieu Desnoyers <mathieu.desnoyers@efficios.com> writes:
> ----- On May 16, 2018, at 9:19 PM, Boqun Feng boqun.feng@gmail.com wrote:
>> On Wed, May 16, 2018 at 04:13:16PM -0400, Mathieu Desnoyers wrote:
>>> ----- On May 16, 2018, at 12:18 PM, Peter Zijlstra peterz@infradead.org wrote:
>>> > On Mon, Apr 30, 2018 at 06:44:26PM -0400, Mathieu Desnoyers wrote:
>>> >> diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
>>> >> index c32a181a7cbb..ed21a777e8c6 100644
>>> >> --- a/arch/powerpc/Kconfig
>>> >> +++ b/arch/powerpc/Kconfig
>>> >> @@ -223,6 +223,7 @@ config PPC
>>> >>  	select HAVE_SYSCALL_TRACEPOINTS
>>> >>  	select HAVE_VIRT_CPU_ACCOUNTING
>>> >>  	select HAVE_IRQ_TIME_ACCOUNTING
>>> >> +	select HAVE_RSEQ
>>> >>  	select IRQ_DOMAIN
>>> >>  	select IRQ_FORCED_THREADING
>>> >>  	select MODULES_USE_ELF_RELA
>>> >> diff --git a/arch/powerpc/kernel/signal.c b/arch/powerpc/kernel/signal.c
>>> >> index 61db86ecd318..d3bb3aaaf5ac 100644
>>> >> --- a/arch/powerpc/kernel/signal.c
>>> >> +++ b/arch/powerpc/kernel/signal.c
>>> >> @@ -133,6 +133,8 @@ static void do_signal(struct task_struct *tsk)
>>> >>  	/* Re-enable the breakpoints for the signal stack */
>>> >>  	thread_change_pc(tsk, tsk->thread.regs);
>>> >>  
>>> >> +	rseq_signal_deliver(tsk->thread.regs);
>>> >> +
>>> >>  	if (is32) {
>>> >>          	if (ksig.ka.sa.sa_flags & SA_SIGINFO)
>>> >>  			ret = handle_rt_signal32(&ksig, oldset, tsk);
>>> >> @@ -164,6 +166,7 @@ void do_notify_resume(struct pt_regs *regs, unsigned long
>>> >> thread_info_flags)
>>> >>  	if (thread_info_flags & _TIF_NOTIFY_RESUME) {
>>> >>  		clear_thread_flag(TIF_NOTIFY_RESUME);
>>> >>  		tracehook_notify_resume(regs);
>>> >> +		rseq_handle_notify_resume(regs);
>>> >>  	}
>>> >>  
>>> >>  	user_enter();
>>> > 
>>> > Again no rseq_syscall().
>>> 
>>> Same question for PowerPC as for ARM:
>>> 
>>> Considering that rseq_syscall is implemented as follows:
>>> 
>>> +void rseq_syscall(struct pt_regs *regs)
>>> +{
>>> +       unsigned long ip = instruction_pointer(regs);
>>> +       struct task_struct *t = current;
>>> +       struct rseq_cs rseq_cs;
>>> +
>>> +       if (!t->rseq)
>>> +               return;
>>> +       if (!access_ok(VERIFY_READ, t->rseq, sizeof(*t->rseq)) ||
>>> +           rseq_get_rseq_cs(t, &rseq_cs) || in_rseq_cs(ip, &rseq_cs))
>>> +               force_sig(SIGSEGV, t);
>>> +}
>>> 
>>> and that x86 calls it from syscall_return_slowpath() (which AFAIU is
>>> now used in the fast-path since KPTI), I wonder where we should call
>> 
>> So we actually detect this after the syscall takes effect, right? I
>> wonder whether this could be problematic, because "disallowing syscall"
>> in rseq areas may means the syscall won't take effect to some people, I
>> guess?
>> 
>>> this on PowerPC ?  I was under the impression that PowerPC return to
>>> userspace fast-path was not calling C code unless work flags were set,
>>> but I might be wrong.
>>> 
>> 
>> I think you're right. So we have to introduce callsite to rseq_syscall()
>> in syscall path, something like:
>> 
>> diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S
>> index 51695608c68b..a25734a96640 100644
>> --- a/arch/powerpc/kernel/entry_64.S
>> +++ b/arch/powerpc/kernel/entry_64.S
>> @@ -222,6 +222,9 @@ system_call_exit:
>> 	mtmsrd	r11,1
>> #endif /* CONFIG_PPC_BOOK3E */
>> 
>> +	addi    r3,r1,STACK_FRAME_OVERHEAD
>> +	bl	rseq_syscall
>> +
>> 	ld	r9,TI_FLAGS(r12)
>> 	li	r11,-MAX_ERRNO
>> 	andi.
>> 		r0,r9,(_TIF_SYSCALL_DOTRACE|_TIF_SINGLESTEP|_TIF_USER_WORK_MASK|_TIF_PERSYSCALL_MASK)
>> 
>> But I think it's important for us to first decide where (before or after
>> the syscall) we do the detection.
>
> As Peter said, we don't really care whether it's on syscall entry or exit, as
> long as the process gets killed when the erroneous use is detected. I think doing
> it on syscall exit is a bit easier because we can clearly access the userspace
> TLS, which AFAIU may be less straightforward on syscall entry.

Coming in to the thread late, sorry if I'm missing the point.

> We may want to add #ifdef CONFIG_DEBUG_RSEQ / #endif around the code you
> proposed above, so it's only compiled in if CONFIG_DEBUG_RSEQ=y.

That sounds good. A function call is not free even if it returns immediately.

> On the ARM leg of the email thread, Will Deacon suggests to test whether current->rseq
> is non-NULL before calling rseq_syscall(). I wonder if this added check is justified
> as the assembly level, considering that this is just a debugging option. We already do
> that check at the very beginning of rseq_syscall().

I guess it depends if this is one of those "debugging options" that's
going to end up turned on in distro kernels?

I think in that code we'd need to check paca->current->rseq, so that
wouldn't be free either.

cheers

^ permalink raw reply

* Re: [PATCH-RESEND] cxl: Disable prefault_mode in Radix mode
From: Frederic Barrat @ 2018-05-18 12:43 UTC (permalink / raw)
  To: Vaibhav Jain, linuxppc-dev, Frederic Barrat
  Cc: Vaibhav Jain, Andrew Donnellan, Christophe Lombard,
	Philippe Bergheaud, Alastair D'Silva, stable
In-Reply-To: <20180518094223.786-1-vaibhav@linux.vnet.ibm.com>



Le 18/05/2018 à 11:42, Vaibhav Jain a écrit :
> From: Vaibhav Jain <vaibhav@linux.ibm.com>
> 
> Currently we see a kernel-oops reported on Power-9 while attaching a
> context to an AFU, with radix-mode and sysfs attr 'prefault_mode' set
> to anything other than 'none'. The backtrace of the oops is of this
> form:
> 
> Unable to handle kernel paging request for data at address 0x00000080
> Faulting instruction address: 0xc00800000bcf3b20
> cpu 0x1: Vector: 300 (Data Access) at [c00000037f003800]
>      pc: c00800000bcf3b20: cxl_load_segment+0x178/0x290 [cxl]
>      lr: c00800000bcf39f0: cxl_load_segment+0x48/0x290 [cxl]
>      sp: c00000037f003a80
>     msr: 9000000000009033
>     dar: 80
>   dsisr: 40000000
>    current = 0xc00000037f280000
>    paca    = 0xc0000003ffffe600   softe: 3        irq_happened: 0x01
>      pid   = 3529, comm = afp_no_int
> <snip>
> [c00000037f003af0] c00800000bcf4424 cxl_prefault+0xfc/0x248 [cxl]
> [c00000037f003b50] c00800000bcf8a40 process_element_entry_psl9+0xd8/0x1a0 [cxl]
> [c00000037f003b90] c00800000bcf944c cxl_attach_dedicated_process_psl9+0x44/0x130 [cxl]
> [c00000037f003bd0] c00800000bcf5448 native_attach_process+0xc0/0x130 [cxl]
> [c00000037f003c50] c00800000bcf16cc afu_ioctl+0x3f4/0x5e0 [cxl]
> [c00000037f003d00] c00000000039d98c do_vfs_ioctl+0xdc/0x890
> [c00000037f003da0] c00000000039e1a8 ksys_ioctl+0x68/0xf0
> [c00000037f003df0] c00000000039e270 sys_ioctl+0x40/0xa0
> [c00000037f003e30] c00000000000b320 system_call+0x58/0x6c
> --- Exception: c01 (System Call) at 0000000010053bb0
> 
> The issue is caused as on Power-8 the AFU attr 'prefault_mode' was
> used to improve initial storage fault performance by prefaulting
> process segments. However on Power-9 with radix mode we don't have
> Storage-Segments that we can prefault. Also prefaulting process Pages
> will be too costly and fine-grained.
> 
> Hence, since the prefaulting mechanism doesn't makes sense of
> radix-mode, this patch updates prefault_mode_store() to not allow any
> other value apart from CXL_PREFAULT_NONE when radix mode is enabled.
> 
> Cc: <stable@vger.kernel.org>
> Fixes: f24be42aab37 ("cxl: Add psl9 specific code")
> Signed-off-by: Vaibhav Jain <vaibhav@linux.ibm.com>
> ---

Thanks!
Acked-by: Frederic Barrat <fbarrat@linux.vnet.ibm.com>



> Change-log:
> 
> Resend  ->  Updated the commit description to add more info on the
> 	    issue seen [Andrew]
> ---
>   Documentation/ABI/testing/sysfs-class-cxl |  4 +++-
>   drivers/misc/cxl/sysfs.c                  | 16 ++++++++++++----
>   2 files changed, 15 insertions(+), 5 deletions(-)
> 
> diff --git a/Documentation/ABI/testing/sysfs-class-cxl b/Documentation/ABI/testing/sysfs-class-cxl
> index 640f65e79ef1..267920a1874b 100644
> --- a/Documentation/ABI/testing/sysfs-class-cxl
> +++ b/Documentation/ABI/testing/sysfs-class-cxl
> @@ -69,7 +69,9 @@ Date:           September 2014
>   Contact:        linuxppc-dev@lists.ozlabs.org
>   Description:    read/write
>                   Set the mode for prefaulting in segments into the segment table
> -                when performing the START_WORK ioctl. Possible values:
> +                when performing the START_WORK ioctl. Only applicable when
> +                running under hashed page table mmu.
> +                Possible values:
>                           none: No prefaulting (default)
>                           work_element_descriptor: Treat the work element
>                                    descriptor as an effective address and
> diff --git a/drivers/misc/cxl/sysfs.c b/drivers/misc/cxl/sysfs.c
> index 4b5a4c5d3c01..629e2e156412 100644
> --- a/drivers/misc/cxl/sysfs.c
> +++ b/drivers/misc/cxl/sysfs.c
> @@ -353,12 +353,20 @@ static ssize_t prefault_mode_store(struct device *device,
>   	struct cxl_afu *afu = to_cxl_afu(device);
>   	enum prefault_modes mode = -1;
> 
> -	if (!strncmp(buf, "work_element_descriptor", 23))
> -		mode = CXL_PREFAULT_WED;
> -	if (!strncmp(buf, "all", 3))
> -		mode = CXL_PREFAULT_ALL;
>   	if (!strncmp(buf, "none", 4))
>   		mode = CXL_PREFAULT_NONE;
> +	else {
> +		if (!radix_enabled()) {
> +
> +			/* only allowed when not in radix mode */
> +			if (!strncmp(buf, "work_element_descriptor", 23))
> +				mode = CXL_PREFAULT_WED;
> +			if (!strncmp(buf, "all", 3))
> +				mode = CXL_PREFAULT_ALL;
> +		} else {
> +			dev_err(device, "Cannot prefault with radix enabled\n");
> +		}
> +	}
> 
>   	if (mode == -1)
>   		return -EINVAL;
> 

^ permalink raw reply

* [PATCH bpf v2 0/6] bpf: enhancements for multi-function programs
From: Sandipan Das @ 2018-05-18 12:50 UTC (permalink / raw)
  To: ast, daniel; +Cc: netdev, linuxppc-dev, naveen.n.rao, mpe, jakub.kicinski

This patch series introduces the following:

[1] Support for bpf-to-bpf function calls in the powerpc64 JIT compiler.

[2] Provide a way for resolving function calls because of the way JITed
    images are allocated in powerpc64.

[3] Fix to get JITed instruction dumps for multi-function programs from
    the bpf system call.

v2:
 - Incorporate review comments from Jakub

Sandipan Das (6):
  bpf: support 64-bit offsets for bpf function calls
  bpf: powerpc64: add JIT support for multi-function programs
  bpf: get kernel symbol addresses via syscall
  tools: bpf: sync bpf uapi header
  tools: bpftool: resolve calls without using imm field
  bpf: fix JITed dump for multi-function programs via syscall

 arch/powerpc/net/bpf_jit_comp64.c | 79 ++++++++++++++++++++++++++++++++++-----
 include/uapi/linux/bpf.h          |  2 +
 kernel/bpf/syscall.c              | 56 ++++++++++++++++++++++++---
 kernel/bpf/verifier.c             | 22 +++++++----
 tools/bpf/bpftool/prog.c          | 29 ++++++++++++++
 tools/bpf/bpftool/xlated_dumper.c | 10 ++++-
 tools/bpf/bpftool/xlated_dumper.h |  2 +
 tools/include/uapi/linux/bpf.h    |  2 +
 8 files changed, 179 insertions(+), 23 deletions(-)

-- 
2.14.3

^ permalink raw reply

* [PATCH bpf v2 1/6] bpf: support 64-bit offsets for bpf function calls
From: Sandipan Das @ 2018-05-18 12:50 UTC (permalink / raw)
  To: ast, daniel; +Cc: netdev, linuxppc-dev, naveen.n.rao, mpe, jakub.kicinski
In-Reply-To: <20180518125039.6500-1-sandipan@linux.vnet.ibm.com>

The imm field of a bpf instruction is a signed 32-bit integer.
For JIT bpf-to-bpf function calls, it stores the offset of the
start address of the callee's JITed image from __bpf_call_base.

For some architectures, such as powerpc64, this offset may be
as large as 64 bits and cannot be accomodated in the imm field
without truncation.

We resolve this by:

[1] Additionally using the auxillary data of each function to
    keep a list of start addresses of the JITed images for all
    functions determined by the verifier.

[2] Retaining the subprog id inside the off field of the call
    instructions and using it to index into the list mentioned
    above and lookup the callee's address.

To make sure that the existing JIT compilers continue to work
without requiring changes, we keep the imm field as it is.

Signed-off-by: Sandipan Das <sandipan@linux.vnet.ibm.com>
---
 kernel/bpf/verifier.c | 15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index a9e4b1372da6..6c56cce9c4e3 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -5383,11 +5383,24 @@ static int jit_subprogs(struct bpf_verifier_env *env)
 			    insn->src_reg != BPF_PSEUDO_CALL)
 				continue;
 			subprog = insn->off;
-			insn->off = 0;
 			insn->imm = (u64 (*)(u64, u64, u64, u64, u64))
 				func[subprog]->bpf_func -
 				__bpf_call_base;
 		}
+
+		/* we use the aux data to keep a list of the start addresses
+		 * of the JITed images for each function in the program
+		 *
+		 * for some architectures, such as powerpc64, the imm field
+		 * might not be large enough to hold the offset of the start
+		 * address of the callee's JITed image from __bpf_call_base
+		 *
+		 * in such cases, we can lookup the start address of a callee
+		 * by using its subprog id, available from the off field of
+		 * the call instruction, as an index for this list
+		 */
+		func[i]->aux->func = func;
+		func[i]->aux->func_cnt = env->subprog_cnt + 1;
 	}
 	for (i = 0; i < env->subprog_cnt; i++) {
 		old_bpf_func = func[i]->bpf_func;
-- 
2.14.3

^ permalink raw reply related

* [PATCH bpf v2 2/6] bpf: powerpc64: add JIT support for multi-function programs
From: Sandipan Das @ 2018-05-18 12:50 UTC (permalink / raw)
  To: ast, daniel; +Cc: netdev, linuxppc-dev, naveen.n.rao, mpe, jakub.kicinski
In-Reply-To: <20180518125039.6500-1-sandipan@linux.vnet.ibm.com>

This adds support for bpf-to-bpf function calls in the powerpc64
JIT compiler. The JIT compiler converts the bpf call instructions
to native branch instructions. After a round of the usual passes,
the start addresses of the JITed images for the callee functions
are known. Finally, to fixup the branch target addresses, we need
to perform an extra pass.

Because of the address range in which JITed images are allocated
on powerpc64, the offsets of the start addresses of these images
from __bpf_call_base are as large as 64 bits. So, for a function
call, we cannot use the imm field of the instruction to determine
the callee's address. Instead, we use the alternative method of
getting it from the list of function addresses in the auxillary
data of the caller by using the off field as an index.

Signed-off-by: Sandipan Das <sandipan@linux.vnet.ibm.com>
---
 arch/powerpc/net/bpf_jit_comp64.c | 79 ++++++++++++++++++++++++++++++++++-----
 1 file changed, 69 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/net/bpf_jit_comp64.c b/arch/powerpc/net/bpf_jit_comp64.c
index 1bdb1aff0619..25939892d8f7 100644
--- a/arch/powerpc/net/bpf_jit_comp64.c
+++ b/arch/powerpc/net/bpf_jit_comp64.c
@@ -256,7 +256,7 @@ static void bpf_jit_emit_tail_call(u32 *image, struct codegen_context *ctx, u32
 /* Assemble the body code between the prologue & epilogue */
 static int bpf_jit_build_body(struct bpf_prog *fp, u32 *image,
 			      struct codegen_context *ctx,
-			      u32 *addrs)
+			      u32 *addrs, bool extra_pass)
 {
 	const struct bpf_insn *insn = fp->insnsi;
 	int flen = fp->len;
@@ -712,11 +712,23 @@ static int bpf_jit_build_body(struct bpf_prog *fp, u32 *image,
 			break;
 
 		/*
-		 * Call kernel helper
+		 * Call kernel helper or bpf function
 		 */
 		case BPF_JMP | BPF_CALL:
 			ctx->seen |= SEEN_FUNC;
-			func = (u8 *) __bpf_call_base + imm;
+
+			/* bpf function call */
+			if (insn[i].src_reg == BPF_PSEUDO_CALL && extra_pass)
+				if (fp->aux->func && off < fp->aux->func_cnt)
+					/* use the subprog id from the off
+					 * field to lookup the callee address
+					 */
+					func = (u8 *) fp->aux->func[off]->bpf_func;
+				else
+					return -EINVAL;
+			/* kernel helper call */
+			else
+				func = (u8 *) __bpf_call_base + imm;
 
 			bpf_jit_emit_func_call(image, ctx, (u64)func);
 
@@ -864,6 +876,14 @@ static int bpf_jit_build_body(struct bpf_prog *fp, u32 *image,
 	return 0;
 }
 
+struct powerpc64_jit_data {
+	struct bpf_binary_header *header;
+	u32 *addrs;
+	u8 *image;
+	u32 proglen;
+	struct codegen_context ctx;
+};
+
 struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *fp)
 {
 	u32 proglen;
@@ -871,6 +891,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *fp)
 	u8 *image = NULL;
 	u32 *code_base;
 	u32 *addrs;
+	struct powerpc64_jit_data *jit_data;
 	struct codegen_context cgctx;
 	int pass;
 	int flen;
@@ -878,6 +899,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *fp)
 	struct bpf_prog *org_fp = fp;
 	struct bpf_prog *tmp_fp;
 	bool bpf_blinded = false;
+	bool extra_pass = false;
 
 	if (!fp->jit_requested)
 		return org_fp;
@@ -891,7 +913,28 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *fp)
 		fp = tmp_fp;
 	}
 
+	jit_data = fp->aux->jit_data;
+	if (!jit_data) {
+		jit_data = kzalloc(sizeof(*jit_data), GFP_KERNEL);
+		if (!jit_data) {
+			fp = org_fp;
+			goto out;
+		}
+		fp->aux->jit_data = jit_data;
+	}
+
 	flen = fp->len;
+	addrs = jit_data->addrs;
+	if (addrs) {
+		cgctx = jit_data->ctx;
+		image = jit_data->image;
+		bpf_hdr = jit_data->header;
+		proglen = jit_data->proglen;
+		alloclen = proglen + FUNCTION_DESCR_SIZE;
+		extra_pass = true;
+		goto skip_init_ctx;
+	}
+
 	addrs = kzalloc((flen+1) * sizeof(*addrs), GFP_KERNEL);
 	if (addrs == NULL) {
 		fp = org_fp;
@@ -904,10 +947,10 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *fp)
 	cgctx.stack_size = round_up(fp->aux->stack_depth, 16);
 
 	/* Scouting faux-generate pass 0 */
-	if (bpf_jit_build_body(fp, 0, &cgctx, addrs)) {
+	if (bpf_jit_build_body(fp, 0, &cgctx, addrs, false)) {
 		/* We hit something illegal or unsupported. */
 		fp = org_fp;
-		goto out;
+		goto out_addrs;
 	}
 
 	/*
@@ -925,9 +968,10 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *fp)
 			bpf_jit_fill_ill_insns);
 	if (!bpf_hdr) {
 		fp = org_fp;
-		goto out;
+		goto out_addrs;
 	}
 
+skip_init_ctx:
 	code_base = (u32 *)(image + FUNCTION_DESCR_SIZE);
 
 	/* Code generation passes 1-2 */
@@ -935,7 +979,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *fp)
 		/* Now build the prologue, body code & epilogue for real. */
 		cgctx.idx = 0;
 		bpf_jit_build_prologue(code_base, &cgctx);
-		bpf_jit_build_body(fp, code_base, &cgctx, addrs);
+		bpf_jit_build_body(fp, code_base, &cgctx, addrs, extra_pass);
 		bpf_jit_build_epilogue(code_base, &cgctx);
 
 		if (bpf_jit_enable > 1)
@@ -956,15 +1000,30 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *fp)
 	((u64 *)image)[1] = local_paca->kernel_toc;
 #endif
 
+	bpf_flush_icache(bpf_hdr, (u8 *)bpf_hdr + (bpf_hdr->pages * PAGE_SIZE));
+
+	if (!fp->is_func || extra_pass) {
+		bpf_jit_binary_lock_ro(bpf_hdr);
+	} else {
+		jit_data->addrs = addrs;
+		jit_data->ctx = cgctx;
+		jit_data->proglen = proglen;
+		jit_data->image = image;
+		jit_data->header = bpf_hdr;
+	}
+
 	fp->bpf_func = (void *)image;
 	fp->jited = 1;
 	fp->jited_len = alloclen;
 
-	bpf_flush_icache(bpf_hdr, (u8 *)bpf_hdr + (bpf_hdr->pages * PAGE_SIZE));
+	if (!fp->is_func || extra_pass) {
+out_addrs:
+		kfree(addrs);
+		kfree(jit_data);
+		fp->aux->jit_data = NULL;
+	}
 
 out:
-	kfree(addrs);
-
 	if (bpf_blinded)
 		bpf_jit_prog_release_other(fp, fp == org_fp ? tmp_fp : org_fp);
 
-- 
2.14.3

^ permalink raw reply related

* [PATCH bpf v2 3/6] bpf: get kernel symbol addresses via syscall
From: Sandipan Das @ 2018-05-18 12:50 UTC (permalink / raw)
  To: ast, daniel; +Cc: netdev, linuxppc-dev, naveen.n.rao, mpe, jakub.kicinski
In-Reply-To: <20180518125039.6500-1-sandipan@linux.vnet.ibm.com>

This adds new two new fields to struct bpf_prog_info. For
multi-function programs, these fields can be used to pass
a list of kernel symbol addresses for all functions in a
given program and to userspace using the bpf system call
with the BPF_OBJ_GET_INFO_BY_FD command.

When bpf_jit_kallsyms is enabled, we can get the address
of the corresponding kernel symbol for a callee function
and resolve the symbol's name. The address is determined
by adding the value of the call instruction's imm field
to __bpf_call_base. This offset gets assigned to the imm
field by the verifier.

For some architectures, such as powerpc64, the imm field
is not large enough to hold this offset.

We resolve this by:

[1] Assigning the subprog id to the imm field of a call
    instruction in the verifier instead of the offset of
    the callee's symbol's address from __bpf_call_base.

[2] Determining the address of a callee's corresponding
    symbol by using the imm field as an index for the
    list of kernel symbol addresses now available from
    the program info.

Suggested-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Sandipan Das <sandipan@linux.vnet.ibm.com>
---
 include/uapi/linux/bpf.h |  2 ++
 kernel/bpf/syscall.c     | 20 ++++++++++++++++++++
 kernel/bpf/verifier.c    |  7 +------
 3 files changed, 23 insertions(+), 6 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index d94d333a8225..040c9cac7303 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2188,6 +2188,8 @@ struct bpf_prog_info {
 	__u32 xlated_prog_len;
 	__aligned_u64 jited_prog_insns;
 	__aligned_u64 xlated_prog_insns;
+	__aligned_u64 jited_ksyms;
+	__u32 nr_jited_ksyms;
 	__u64 load_time;	/* ns since boottime */
 	__u32 created_by_uid;
 	__u32 nr_map_ids;
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index bfcde949c7f8..54a72fafe57c 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1933,6 +1933,7 @@ static int bpf_prog_get_info_by_fd(struct bpf_prog *prog,
 	if (!capable(CAP_SYS_ADMIN)) {
 		info.jited_prog_len = 0;
 		info.xlated_prog_len = 0;
+		info.nr_jited_ksyms = 0;
 		goto done;
 	}
 
@@ -1981,6 +1982,25 @@ static int bpf_prog_get_info_by_fd(struct bpf_prog *prog,
 		}
 	}
 
+	ulen = info.nr_jited_ksyms;
+	info.nr_jited_ksyms = prog->aux->func_cnt;
+	if (info.nr_jited_ksyms && ulen) {
+		u64 __user *user_jited_ksyms = u64_to_user_ptr(info.jited_ksyms);
+		ulong ksym_addr;
+		u32 i;
+
+		/* copy the address of the kernel symbol corresponding to
+		 * each function
+		 */
+		ulen = min_t(u32, info.nr_jited_ksyms, ulen);
+		for (i = 0; i < ulen; i++) {
+			ksym_addr = (ulong) prog->aux->func[i]->bpf_func;
+			ksym_addr &= PAGE_MASK;
+			if (put_user((u64) ksym_addr, &user_jited_ksyms[i]))
+				return -EFAULT;
+		}
+	}
+
 done:
 	if (copy_to_user(uinfo, &info, info_len) ||
 	    put_user(info_len, &uattr->info.info_len))
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 6c56cce9c4e3..e826c396aba2 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -5426,17 +5426,12 @@ static int jit_subprogs(struct bpf_verifier_env *env)
 	 * later look the same as if they were interpreted only.
 	 */
 	for (i = 0, insn = prog->insnsi; i < prog->len; i++, insn++) {
-		unsigned long addr;
-
 		if (insn->code != (BPF_JMP | BPF_CALL) ||
 		    insn->src_reg != BPF_PSEUDO_CALL)
 			continue;
 		insn->off = env->insn_aux_data[i].call_imm;
 		subprog = find_subprog(env, i + insn->off + 1);
-		addr  = (unsigned long)func[subprog]->bpf_func;
-		addr &= PAGE_MASK;
-		insn->imm = (u64 (*)(u64, u64, u64, u64, u64))
-			    addr - __bpf_call_base;
+		insn->imm = subprog;
 	}
 
 	prog->jited = 1;
-- 
2.14.3

^ permalink raw reply related

* [PATCH bpf v2 4/6] tools: bpf: sync bpf uapi header
From: Sandipan Das @ 2018-05-18 12:50 UTC (permalink / raw)
  To: ast, daniel; +Cc: netdev, linuxppc-dev, naveen.n.rao, mpe, jakub.kicinski
In-Reply-To: <20180518125039.6500-1-sandipan@linux.vnet.ibm.com>

Syncing the bpf.h uapi header with tools so that struct
bpf_prog_info has the two new fields for passing on the
addresses of the kernel symbols corresponding to each
function in a JITed program.

Signed-off-by: Sandipan Das <sandipan@linux.vnet.ibm.com>
---
 tools/include/uapi/linux/bpf.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index d94d333a8225..040c9cac7303 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -2188,6 +2188,8 @@ struct bpf_prog_info {
 	__u32 xlated_prog_len;
 	__aligned_u64 jited_prog_insns;
 	__aligned_u64 xlated_prog_insns;
+	__aligned_u64 jited_ksyms;
+	__u32 nr_jited_ksyms;
 	__u64 load_time;	/* ns since boottime */
 	__u32 created_by_uid;
 	__u32 nr_map_ids;
-- 
2.14.3

^ permalink raw reply related

* [PATCH bpf v2 5/6] tools: bpftool: resolve calls without using imm field
From: Sandipan Das @ 2018-05-18 12:50 UTC (permalink / raw)
  To: ast, daniel; +Cc: netdev, linuxppc-dev, naveen.n.rao, mpe, jakub.kicinski
In-Reply-To: <20180518125039.6500-1-sandipan@linux.vnet.ibm.com>

Currently, we resolve the callee's address for a JITed function
call by using the imm field of the call instruction as an offset
from __bpf_call_base. If bpf_jit_kallsyms is enabled, we further
use this address to get the callee's kernel symbol's name.

For some architectures, such as powerpc64, the imm field is not
large enough to hold this offset. So, instead of assigning this
offset to the imm field, the verifier now assigns the subprog
id. Also, a list of kernel symbol addresses for all the JITed
functions is provided in the program info. We now use the imm
field as an index for this list to lookup a callee's symbol's
address and resolve its name.

Suggested-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Sandipan Das <sandipan@linux.vnet.ibm.com>
---
v2:
 - Order variables from longest to shortest
 - Make sure that ksyms_ptr and ksyms_len are always initialized
 - Simplify code
---
 tools/bpf/bpftool/prog.c          | 29 +++++++++++++++++++++++++++++
 tools/bpf/bpftool/xlated_dumper.c | 10 +++++++++-
 tools/bpf/bpftool/xlated_dumper.h |  2 ++
 3 files changed, 40 insertions(+), 1 deletion(-)

diff --git a/tools/bpf/bpftool/prog.c b/tools/bpf/bpftool/prog.c
index 9bdfdf2d3fbe..e2f8f8f259fc 100644
--- a/tools/bpf/bpftool/prog.c
+++ b/tools/bpf/bpftool/prog.c
@@ -421,19 +421,26 @@ static int do_show(int argc, char **argv)
 static int do_dump(int argc, char **argv)
 {
 	struct bpf_prog_info info = {};
+	unsigned long *addrs = NULL;
 	struct dump_data dd = {};
 	__u32 len = sizeof(info);
 	unsigned int buf_size;
+	unsigned int nr_addrs;
 	char *filepath = NULL;
 	bool opcodes = false;
 	bool visual = false;
 	unsigned char *buf;
 	__u32 *member_len;
 	__u64 *member_ptr;
+	__u32 *ksyms_len;
+	__u64 *ksyms_ptr;
 	ssize_t n;
 	int err;
 	int fd;
 
+	ksyms_len = &info.nr_jited_ksyms;
+	ksyms_ptr = &info.jited_ksyms;
+
 	if (is_prefix(*argv, "jited")) {
 		member_len = &info.jited_prog_len;
 		member_ptr = &info.jited_prog_insns;
@@ -496,10 +503,22 @@ static int do_dump(int argc, char **argv)
 		return -1;
 	}
 
+	nr_addrs = *ksyms_len;
+	if (nr_addrs) {
+		addrs = malloc(nr_addrs * sizeof(__u64));
+		if (!addrs) {
+			p_err("mem alloc failed");
+			close(fd);
+			goto err_free;
+		}
+	}
+
 	memset(&info, 0, sizeof(info));
 
 	*member_ptr = ptr_to_u64(buf);
 	*member_len = buf_size;
+	*ksyms_ptr = ptr_to_u64(addrs);
+	*ksyms_len = nr_addrs;
 
 	err = bpf_obj_get_info_by_fd(fd, &info, &len);
 	close(fd);
@@ -513,6 +532,11 @@ static int do_dump(int argc, char **argv)
 		goto err_free;
 	}
 
+	if (*ksyms_len > nr_addrs) {
+		p_err("too many addresses returned");
+		goto err_free;
+	}
+
 	if ((member_len == &info.jited_prog_len &&
 	     info.jited_prog_insns == 0) ||
 	    (member_len == &info.xlated_prog_len &&
@@ -558,6 +582,9 @@ static int do_dump(int argc, char **argv)
 			dump_xlated_cfg(buf, *member_len);
 	} else {
 		kernel_syms_load(&dd);
+		dd.nr_jited_ksyms = *ksyms_len;
+		dd.jited_ksyms = (__u64 *) *ksyms_ptr;
+
 		if (json_output)
 			dump_xlated_json(&dd, buf, *member_len, opcodes);
 		else
@@ -566,10 +593,12 @@ static int do_dump(int argc, char **argv)
 	}
 
 	free(buf);
+	free(addrs);
 	return 0;
 
 err_free:
 	free(buf);
+	free(addrs);
 	return -1;
 }
 
diff --git a/tools/bpf/bpftool/xlated_dumper.c b/tools/bpf/bpftool/xlated_dumper.c
index 7a3173b76c16..fb065b55db6d 100644
--- a/tools/bpf/bpftool/xlated_dumper.c
+++ b/tools/bpf/bpftool/xlated_dumper.c
@@ -174,7 +174,11 @@ static const char *print_call_pcrel(struct dump_data *dd,
 				    unsigned long address,
 				    const struct bpf_insn *insn)
 {
-	if (sym)
+	if (!dd->nr_jited_ksyms)
+		/* Do not show address for interpreted programs */
+		snprintf(dd->scratch_buff, sizeof(dd->scratch_buff),
+			"%+d", insn->off);
+	else if (sym)
 		snprintf(dd->scratch_buff, sizeof(dd->scratch_buff),
 			 "%+d#%s", insn->off, sym->name);
 	else
@@ -203,6 +207,10 @@ static const char *print_call(void *private_data,
 	unsigned long address = dd->address_call_base + insn->imm;
 	struct kernel_sym *sym;
 
+	if (insn->src_reg == BPF_PSEUDO_CALL &&
+		(__u32) insn->imm < dd->nr_jited_ksyms)
+		address = dd->jited_ksyms[insn->imm];
+
 	sym = kernel_syms_search(dd, address);
 	if (insn->src_reg == BPF_PSEUDO_CALL)
 		return print_call_pcrel(dd, sym, address, insn);
diff --git a/tools/bpf/bpftool/xlated_dumper.h b/tools/bpf/bpftool/xlated_dumper.h
index b34affa7ef2d..eafbb49c8d0b 100644
--- a/tools/bpf/bpftool/xlated_dumper.h
+++ b/tools/bpf/bpftool/xlated_dumper.h
@@ -49,6 +49,8 @@ struct dump_data {
 	unsigned long address_call_base;
 	struct kernel_sym *sym_mapping;
 	__u32 sym_count;
+	__u64 *jited_ksyms;
+	__u32 nr_jited_ksyms;
 	char scratch_buff[SYM_MAX_NAME + 8];
 };
 
-- 
2.14.3

^ permalink raw reply related

* [PATCH bpf v2 6/6] bpf: fix JITed dump for multi-function programs via syscall
From: Sandipan Das @ 2018-05-18 12:50 UTC (permalink / raw)
  To: ast, daniel; +Cc: netdev, linuxppc-dev, naveen.n.rao, mpe, jakub.kicinski
In-Reply-To: <20180518125039.6500-1-sandipan@linux.vnet.ibm.com>

Currently, for multi-function programs, we cannot get the JITed
instructions using the bpf system call's BPF_OBJ_GET_INFO_BY_FD
command. Because of this, userspace tools such as bpftool fail
to identify a multi-function program as being JITed or not.

With the JIT enabled and the test program running, this can be
verified as follows:

  # cat /proc/sys/net/core/bpf_jit_enable
  1

Before applying this patch:

  # bpftool prog list
  1: kprobe  name foo  tag b811aab41a39ad3d  gpl
          loaded_at 2018-05-16T11:43:38+0530  uid 0
          xlated 216B  not jited  memlock 65536B
  ...

  # bpftool prog dump jited id 1
  no instructions returned

After applying this patch:

  # bpftool prog list
  1: kprobe  name foo  tag b811aab41a39ad3d  gpl
          loaded_at 2018-05-16T12:13:01+0530  uid 0
          xlated 216B  jited 308B  memlock 65536B
  ...

  # bpftool prog dump jited id 1
     0:   nop
     4:   nop
     8:   mflr    r0
     c:   std     r0,16(r1)
    10:   stdu    r1,-112(r1)
    14:   std     r31,104(r1)
    18:   addi    r31,r1,48
    1c:   li      r3,10
  ...

Signed-off-by: Sandipan Das <sandipan@linux.vnet.ibm.com>
---
 kernel/bpf/syscall.c | 38 ++++++++++++++++++++++++++++++++------
 1 file changed, 32 insertions(+), 6 deletions(-)

diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 54a72fafe57c..2430d159078c 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1896,7 +1896,7 @@ static int bpf_prog_get_info_by_fd(struct bpf_prog *prog,
 	struct bpf_prog_info info = {};
 	u32 info_len = attr->info.info_len;
 	char __user *uinsns;
-	u32 ulen;
+	u32 ulen, i;
 	int err;
 
 	err = check_uarg_tail_zero(uinfo, sizeof(info), info_len);
@@ -1922,7 +1922,6 @@ static int bpf_prog_get_info_by_fd(struct bpf_prog *prog,
 	ulen = min_t(u32, info.nr_map_ids, ulen);
 	if (ulen) {
 		u32 __user *user_map_ids = u64_to_user_ptr(info.map_ids);
-		u32 i;
 
 		for (i = 0; i < ulen; i++)
 			if (put_user(prog->aux->used_maps[i]->id,
@@ -1970,13 +1969,41 @@ static int bpf_prog_get_info_by_fd(struct bpf_prog *prog,
 	 * for offload.
 	 */
 	ulen = info.jited_prog_len;
-	info.jited_prog_len = prog->jited_len;
+	if (prog->aux->func_cnt) {
+		info.jited_prog_len = 0;
+		for (i = 0; i < prog->aux->func_cnt; i++)
+			info.jited_prog_len += prog->aux->func[i]->jited_len;
+	} else {
+		info.jited_prog_len = prog->jited_len;
+	}
+
 	if (info.jited_prog_len && ulen) {
 		if (bpf_dump_raw_ok()) {
 			uinsns = u64_to_user_ptr(info.jited_prog_insns);
 			ulen = min_t(u32, info.jited_prog_len, ulen);
-			if (copy_to_user(uinsns, prog->bpf_func, ulen))
-				return -EFAULT;
+
+			/* for multi-function programs, copy the JITed
+			 * instructions for all the functions
+			 */
+			if (prog->aux->func_cnt) {
+				u32 len, free;
+				u8 *img;
+
+				free = ulen;
+				for (i = 0; i < prog->aux->func_cnt; i++) {
+					len = prog->aux->func[i]->jited_len;
+					img = (u8 *) prog->aux->func[i]->bpf_func;
+					if (len > free)
+						break;
+					if (copy_to_user(uinsns, img, len))
+						return -EFAULT;
+					uinsns += len;
+					free -= len;
+				}
+			} else {
+				if (copy_to_user(uinsns, prog->bpf_func, ulen))
+					return -EFAULT;
+			}
 		} else {
 			info.jited_prog_insns = 0;
 		}
@@ -1987,7 +2014,6 @@ static int bpf_prog_get_info_by_fd(struct bpf_prog *prog,
 	if (info.nr_jited_ksyms && ulen) {
 		u64 __user *user_jited_ksyms = u64_to_user_ptr(info.jited_ksyms);
 		ulong ksym_addr;
-		u32 i;
 
 		/* copy the address of the kernel symbol corresponding to
 		 * each function
-- 
2.14.3

^ permalink raw reply related

* Re: [PATCH 2/2] powerpc/ptrace: Fix setting 512B aligned breakpoints with PTRACE_SET_DEBUGREG
From: Michael Ellerman @ 2018-05-18 12:56 UTC (permalink / raw)
  To: Michael Neuling
  Cc: linuxppc-dev, Edjunior Barbosa Machado, Pedro Franco de Carvalho,
	Ulrich Weigand, mikey
In-Reply-To: <20180517053715.24011-2-mikey@neuling.org>

Michael Neuling <mikey@neuling.org> writes:
> In this change:
>   e2a800beac powerpc/hw_brk: Fix off by one error when validating DAWR region end
>
> We fixed setting the DAWR end point to its max value via
> PPC_PTRACE_SETHWDEBUG. Unfortunately we broke PTRACE_SET_DEBUGREG when
> setting a 512 byte aligned breakpoint.
>
> PTRACE_SET_DEBUGREG currently sets the length of the breakpoint to
> zero (memset() in hw_breakpoint_init()).  This worked with
> arch_validate_hwbkpt_settings() before the above patch was applied but
> is now broken if the breakpoint is 512byte aligned.
>
> This sets the length of the breakpoint to 8 bytes when using
> PTRACE_SET_DEBUGREG.
>
> Signed-off-by: Michael Neuling <mikey@neuling.org>
> Cc: stable@vger.kernel.org # 3.10+

If this is "fixing" e2a800beac then I think v3.11 is right for the
stable tag?

$ git describe --contains --long e2a800beaca1
v3.11-rc1~94^2~4

cheers

^ permalink raw reply

* [PATCH v2] powerpc/lib: Adjust .balign inside string functions for PPC32
From: Christophe Leroy @ 2018-05-18 13:01 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman
  Cc: linux-kernel, linuxppc-dev

commit 87a156fb18fe1 ("Align hot loops of some string functions")
degraded the performance of string functions by adding useless
nops

A simple benchmark on an 8xx calling 100000x a memchr() that
matches the first byte runs in 41668 TB ticks before this patch
and in 35986 TB ticks after this patch. So this gives an
improvement of approx 10%

Another benchmark doing the same with a memchr() matching the 128th
byte runs in 1011365 TB ticks before this patch and 1005682 TB ticks
after this patch, so regardless on the number of loops, removing
those useless nops improves the test by 5683 TB ticks.

Fixes: 87a156fb18fe1 ("Align hot loops of some string functions")
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
---
 v2: Define IFETCH_ALIGN_SHIFT for PPC32 and use IFETCH_ALIGN_BYTES
    for the alignment

 arch/powerpc/include/asm/cache.h | 3 +++
 arch/powerpc/lib/string.S        | 7 ++++---
 2 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/cache.h b/arch/powerpc/include/asm/cache.h
index c1d257aa4c2d..66298461b640 100644
--- a/arch/powerpc/include/asm/cache.h
+++ b/arch/powerpc/include/asm/cache.h
@@ -9,11 +9,14 @@
 #if defined(CONFIG_PPC_8xx) || defined(CONFIG_403GCX)
 #define L1_CACHE_SHIFT		4
 #define MAX_COPY_PREFETCH	1
+#define IFETCH_ALIGN_SHIFT	2
 #elif defined(CONFIG_PPC_E500MC)
 #define L1_CACHE_SHIFT		6
 #define MAX_COPY_PREFETCH	4
+#define IFETCH_ALIGN_SHIFT	3
 #elif defined(CONFIG_PPC32)
 #define MAX_COPY_PREFETCH	4
+#define IFETCH_ALIGN_SHIFT	3	/* 603 fetches 2 insn at a time */
 #if defined(CONFIG_PPC_47x)
 #define L1_CACHE_SHIFT		7
 #else
diff --git a/arch/powerpc/lib/string.S b/arch/powerpc/lib/string.S
index a787776822d8..0378def28d41 100644
--- a/arch/powerpc/lib/string.S
+++ b/arch/powerpc/lib/string.S
@@ -12,6 +12,7 @@
 #include <asm/errno.h>
 #include <asm/ppc_asm.h>
 #include <asm/export.h>
+#include <asm/cache.h>
 
 	.text
 	
@@ -23,7 +24,7 @@ _GLOBAL(strncpy)
 	mtctr	r5
 	addi	r6,r3,-1
 	addi	r4,r4,-1
-	.balign 16
+	.balign IFETCH_ALIGN_BYTES
 1:	lbzu	r0,1(r4)
 	cmpwi	0,r0,0
 	stbu	r0,1(r6)
@@ -43,7 +44,7 @@ _GLOBAL(strncmp)
 	mtctr	r5
 	addi	r5,r3,-1
 	addi	r4,r4,-1
-	.balign 16
+	.balign IFETCH_ALIGN_BYTES
 1:	lbzu	r3,1(r5)
 	cmpwi	1,r3,0
 	lbzu	r0,1(r4)
@@ -77,7 +78,7 @@ _GLOBAL(memchr)
 	beq-	2f
 	mtctr	r5
 	addi	r3,r3,-1
-	.balign 16
+	.balign IFETCH_ALIGN_BYTES
 1:	lbzu	r0,1(r3)
 	cmpw	0,r0,r4
 	bdnzf	2,1b
-- 
2.13.3

^ permalink raw reply related

* Re: [PATCH 2/2] powerpc: Enable ASYM_SMT on interleaved big-core systems
From: Michael Ellerman @ 2018-05-18 13:05 UTC (permalink / raw)
  To: Gautham R Shenoy, Michael Neuling
  Cc: Gautham R. Shenoy, Benjamin Herrenschmidt,
	Vaidyanathan Srinivasan, Akshay Adiga, Shilpasri G Bhat,
	Balbir Singh, Oliver O'Halloran, Nicholas Piggin,
	linuxppc-dev, linux-kernel
In-Reply-To: <20180516050508.GB14826@in.ibm.com>

Gautham R Shenoy <ego@linux.vnet.ibm.com> writes:

> On Mon, May 14, 2018 at 01:22:07PM +1000, Michael Neuling wrote:
>> On Fri, 2018-05-11 at 16:47 +0530, Gautham R. Shenoy wrote:
>> > From: "Gautham R. Shenoy" <ego@linux.vnet.ibm.com>
>> > 
>> > Each of the SMT4 cores forming a fused-core are more or less
>> > independent units. Thus when multiple tasks are scheduled to run on
>> > the fused core, we get the best performance when the tasks are spread
>> > across the pair of SMT4 cores.
>> > 
>> > Since the threads in the pair of SMT4 cores of an interleaved big-core
>> > are numbered {0,2,4,6} and {1,3,5,7} respectively, enable ASYM_SMT on
>> > such interleaved big-cores that will bias the load-balancing of tasks
>> > on smaller numbered threads, which will automatically result in
>> > spreading the tasks uniformly across the associated pair of SMT4
>> > cores.
>> > 
>> > Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
>> > ---
>> >  arch/powerpc/kernel/smp.c | 2 +-
>> >  1 file changed, 1 insertion(+), 1 deletion(-)
>> > 
>> > diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
>> > index 9ca7148..0153f01 100644
>> > --- a/arch/powerpc/kernel/smp.c
>> > +++ b/arch/powerpc/kernel/smp.c
>> > @@ -1082,7 +1082,7 @@ static int powerpc_smt_flags(void)
>> >  {
>> >  	int flags = SD_SHARE_CPUCAPACITY | SD_SHARE_PKG_RESOURCES;
>> >  
>> > -	if (cpu_has_feature(CPU_FTR_ASYM_SMT)) {
>> > +	if (cpu_has_feature(CPU_FTR_ASYM_SMT) || has_interleaved_big_core) {
>> 
>> Shouldn't we just set CPU_FTR_ASYM_SMT and leave this code
> unchanged?
>
> Yes, that would have the same effect. I refrained from doing that
> since I thought CPU_FTR_ASYM_SMT has the "lower numbered threads
> expedite thread-folding" connotation from the POWER7 generation.

The above code is the only use of the feature, so I don't think we need
to worry about any other connotations.

> If it is ok to overload CPU_FTR_ASYM_SMT, we can do what you suggest
> and have all the changes in setup-common.c

Yeah let's do that.

cheers

^ permalink raw reply

* Re: [PATCH 1/2] powerpc: Detect the presence of big-core with interleaved threads
From: Michael Ellerman @ 2018-05-18 13:14 UTC (permalink / raw)
  To: Gautham R Shenoy, Michael Neuling
  Cc: Gautham R. Shenoy, Benjamin Herrenschmidt,
	Vaidyanathan Srinivasan, Akshay Adiga, Shilpasri G Bhat,
	Balbir Singh, Oliver O'Halloran, Nicholas Piggin,
	linuxppc-dev, linux-kernel
In-Reply-To: <20180516043516.GA14826@in.ibm.com>

Gautham R Shenoy <ego@linux.vnet.ibm.com> writes:
...
>> > @@ -565,7 +615,16 @@ void __init smp_setup_cpu_maps(void)
>> >  	vdso_data->processorCount = num_present_cpus();
>> >  #endif /* CONFIG_PPC64 */
>> >  
>> > -        /* Initialize CPU <=> thread mapping/
>> > +	dn = of_find_node_by_type(NULL, "cpu");
>> > +	if (dn) {
>> > +		if (check_for_interleaved_big_core(dn)) {
>> > +			has_interleaved_big_core = true;
>> > +			pr_info("Detected interleaved big-cores\n");
>> 
>> Is there a runtime way to check this also?  If the dmesg buffer overflows, we
>> lose this.
>
> Where do you suggest we put this ? Should it be a part of
> /proc/cpuinfo ?

Hmm, it'd be nice not to pollute it with more junk.

Can you just look at the pir files in sysfs?

eg. on a normal system:

  # cd /sys/devices/system/cpu
  # grep . cpu[0-7]/pir
  cpu0/pir:20
  cpu1/pir:21
  cpu2/pir:22
  cpu3/pir:23
  cpu4/pir:24
  cpu5/pir:25
  cpu6/pir:26
  cpu7/pir:27


cheers

^ permalink raw reply

* pkeys on POWER: Default AMR, UAMOR values
From: Florian Weimer @ 2018-05-18 13:17 UTC (permalink / raw)
  To: linuxppc-dev, linux-mm, Ram Pai, Dave Hansen, Andy Lutomirski

I'm working on adding POWER pkeys support to glibc.  The coding work is 
done, but I'm faced with some test suite failures.

Unlike the default x86 configuration, on POWER, existing threads have 
full access to newly allocated keys.

Or, more precisely, in this scenario:

* Thread A launches thread B
* Thread B waits
* Thread A allocations a protection key with pkey_alloc
* Thread A applies the key to a page
* Thread A signals thread B
* Thread B starts to run and accesses the page

Then at the end, the access will be granted.

I hope it's not too late to change this to denied access.

Furthermore, I think the UAMOR value is wrong as well because it 
prevents thread B at the end to set the AMR register.  In particular, if 
I do this

* … (as before)
* Thread A signals thread B
* Thread B sets the access rights for the key to PKEY_DISABLE_ACCESS
* Thread B reads the current access rights for the key

then it still gets 0 (all access permitted) because the original UAMOR 
value inherited from thread A prior to the key allocation masks out the 
access right update for the newly allocated key.

Thanks,
Florian

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox