* Re: [PATCH] EDAC, mpc85xx: Prevent building as a module
From: Borislav Petkov @ 2019-05-10 14:13 UTC (permalink / raw)
To: Michael Ellerman
Cc: Johannes Thumshirn, linux-kernel, linuxppc-dev, james.morse,
mchehab, linux-edac
In-Reply-To: <87bm0avb03.fsf@concordia.ellerman.id.au>
On Fri, May 10, 2019 at 08:50:52PM +1000, Michael Ellerman wrote:
> Yeah that looks better to me. I didn't think about the case where EDAC
> core is modular.
>
> Do you want me to send a new patch?
Nah, I'll fix it up.
Thx.
--
Regards/Gruss,
Boris.
Good mailing practices for 400: avoid top-posting and trim the reply.
^ permalink raw reply
* [PATCH] powerpc/imc: Add documentation for IMC and trace-mode
From: Anju T Sudhakar @ 2019-05-10 14:17 UTC (permalink / raw)
To: mpe; +Cc: maddy, linuxppc-dev, anju
Documentation for IMC(In-Memory Collection Counters) infrastructure
and trace-mode of IMC.
Signed-off-by: Anju T Sudhakar <anju@linux.vnet.ibm.com>
---
Documentation/powerpc/imc.txt | 195 ++++++++++++++++++++++++++++++++++
1 file changed, 195 insertions(+)
create mode 100644 Documentation/powerpc/imc.txt
diff --git a/Documentation/powerpc/imc.txt b/Documentation/powerpc/imc.txt
new file mode 100644
index 000000000000..9c32e059f3be
--- /dev/null
+++ b/Documentation/powerpc/imc.txt
@@ -0,0 +1,195 @@
+ ===================================
+ IMC (In-Memory Collection Counters)
+ ===================================
+ Date created: 10 May 2019
+
+Table of Contents:
+------------------
+ - Basic overview
+ - IMC example Usage
+ - IMC Trace Mode
+ - LDBAR Register Layout
+ - TRACE_IMC_SCOM bit representation
+ - Trace IMC example usage
+ - Benefits of using IMC trace-mode
+
+
+Basic overview
+==============
+
+IMC (In-Memory collection counters) is a hardware monitoring facility
+that collects large number of hardware performance events at Nest level
+(these are on-chip but off-core), Core level and Thread level.
+
+The Nest PMU counters are handled by a Nest IMC microcode which runs
+in the OCC (On-Chip Controller) complex. The microcode collects the
+counter data and moves the nest IMC counter data to memory.
+
+The Core and Thread IMC PMU counters are handled in the core. Core
+level PMU counters give us the IMC counters' data per core and thread
+level PMU counters give us the IMC counters' data per CPU thread.
+
+OPAL obtains the IMC PMU and supported events information from the
+IMC Catalog and passes on to the kernel via the device tree. The event's
+information contains :
+ - Event name
+ - Event Offset
+ - Event description
+and, maybe :
+ - Event scale
+ - Event unit
+
+Some PMUs may have a common scale and unit values for all their
+supported events. For those cases, the scale and unit properties for
+those events must be inherited from the PMU.
+
+The event offset in the memory is where the counter data gets
+accumulated.
+
+IMC catalog is available at:
+ https://github.com/open-power/ima-catalog
+
+The kernel discovers the IMC counters information in the device tree
+at the "imc-counters" device node which has a compatible field
+"ibm,opal-in-memory-counters". From the device tree, the kernel parses
+the PMUs and their event's information and register the PMU and it
+attributes in the kernel.
+
+IMC example usage
+=================
+
+# perf list
+
+ [...]
+ nest_mcs01/PM_MCS01_64B_RD_DISP_PORT01/ [Kernel PMU event]
+ nest_mcs01/PM_MCS01_64B_RD_DISP_PORT23/ [Kernel PMU event]
+
+ [...]
+ core_imc/CPM_0THRD_NON_IDLE_PCYC/ [Kernel PMU event]
+ core_imc/CPM_1THRD_NON_IDLE_INST/ [Kernel PMU event]
+
+ [...]
+ thread_imc/CPM_0THRD_NON_IDLE_PCYC/ [Kernel PMU event]
+ thread_imc/CPM_1THRD_NON_IDLE_INST/ [Kernel PMU event]
+
+To see per chip data for nest_mcs0/PM_MCS_DOWN_128B_DATA_XFER_MC0/ :
+ # ./perf stat -e "nest_mcs01/PM_MCS01_64B_WR_DISP_PORT01/" -a --per-socket
+
+To see non-idle instructions for core 0 :
+ # ./perf stat -e "core_imc/CPM_NON_IDLE_INST/" -C 0 -I 1000
+
+To see non-idle instructions for a "make" :
+ # ./perf stat -e "thread_imc/CPM_NON_IDLE_PCYC/" make
+
+
+IMC Trace-mode
+===============
+
+POWER9 support two modes for IMC which are the Accumulation mode and
+Trace mode. In Accumulation mode, event counts are accumulated in system
+Memory. Hypervisor then reads the posted counts periodically or when
+requested. In IMC Trace mode, the 64 bit trace scom value is initialized
+with the event information. The CPMC*SEL and CPMC_LOAD in the trace scom,
+specifies the event to be monitored and the sampling duration. On each
+overflow in the CPMC*SEL, hardware snapshots the program counter along
+with event counts and writes into memory pointed by LDBAR.
+
+LDBAR is a 64 bit special purpose per thread register, it has bits to
+indicate whether hardware is configured for accumulation or trace mode.
+
+* LDBAR Register Layout:
+ 0 : Enable/Disable
+ 1 : 0 -> Accumulation Mode
+ 1 -> Trace Mode
+ 2:3 : Reserved
+ 4-6 : PB scope
+ 7 : Reserved
+ 8:50 : Counter Address
+ 51:63 : Reserved
+
+* TRACE_IMC_SCOM bit representation:
+
+ 0:1 : SAMPSEL
+ 2:33 : CPMC_LOAD
+ 34:40 : CPMC1SEL
+ 41:47 : CPMC2SEL
+ 48:50 : BUFFERSIZE
+ 51:63 : RESERVED
+
+CPMC_LOAD contains the sampling duration. SAMPSEL and CPMC*SEL determines
+the event to count. BUFFRSIZE indicates the memory range. On each overflow,
+hardware snapshots program counter along with event counts and update the
+memory and reloads the CMPC_LOAD value for the next sampling duration.
+IMC hardware does not support exceptions, so it quietly wraps around if
+memory buffer reaches the end.
+
+*Currently the event monitored for trace-mode is fixed as cycle.
+
+Trace IMC example usage
+=======================
+
+# perf list
+
+ [....]
+ trace_imc/trace_cycles/ [Kernel PMU event]
+
+To record an application/process with trace-imc event
+# perf record -e trace_imc/trace_cycles/ yes > /dev/nul
+[ perf record: Woken up 1 times to write data ]
+[ perf record: Captured and wrote 0.012 MB perf.data (21 samples) ]
+
+The perf.data generated, can be read using perf report.
+
+Benefits of using IMC trace-mode
+================================
+
+PMI interrupt handling is avoided, since IMC trace mode snapshots the
+program counter and update to the memory. And this also provide a way for
+the operating system to do instruction sampling in real time without
+PMI(Performance Monitoring Interrupts) processing overhead.
+Example:-
+
+Performance data using 'perf top' with and without trace-imc event:
+
+PMI interrupts count when `perf top` command is executed without trace-imc event.
+
+# cat /proc/interrupts (a snippet from the output)
+9944 1072 804 804 1644 804 1306
+804 804 804 804 804 804 804
+804 804 1961 1602 804 804 1258
+[-----------------------------------------------------------------]
+803 803 803 803 803 803 803
+803 803 803 803 804 804 804
+804 804 804 804 804 804 803
+803 803 803 803 803 1306 803
+803 Performance monitoring interrupts
+
+
+`perf top` with trace-imc (executed right after 'perf top' without trace-imc event):
+
+# perf top -e trace_imc/trace_cycles/
+12.50% [kernel] [k] arch_cpu_idle
+11.81% [kernel] [k] __next_timer_interrupt
+11.22% [kernel] [k] rcu_idle_enter
+10.25% [kernel] [k] find_next_bit
+ 7.91% [kernel] [k] do_idle
+ 7.69% [kernel] [k] rcu_dynticks_eqs_exit
+ 5.20% [kernel] [k] tick_nohz_idle_stop_tick
+ [-----------------------]
+
+# cat /proc/interrupts (a snippet from the output)
+
+9944 1072 804 804 1644 804 1306
+804 804 804 804 804 804 804
+804 804 1961 1602 804 804 1258
+[-----------------------------------------------------------------]
+803 803 803 803 803 803 803
+803 803 803 804 804 804 804
+804 804 804 804 804 804 803
+803 803 803 803 803 1306 803
+803 Performance monitoring interrupts
+
+The PMI interrupts count remains the same.
+
+----------------------------------------------------------
+Author(s) : Anju T Sudhakar <anju@linux.vnet.ibm.com>
--
2.17.2
^ permalink raw reply related
* Re: [PATCH 03/16] lib,treewide: add new match_string() helper/macro
From: andriy.shevchenko @ 2019-05-10 14:34 UTC (permalink / raw)
To: Ardelean, Alexandru
Cc: linux-wireless@vger.kernel.org, linux-fbdev@vger.kernel.org,
kvm@vger.kernel.org, linux-pci@vger.kernel.org,
alsa-devel@alsa-project.org, dri-devel@lists.freedesktop.org,
linux-mm@kvack.org, linux-mtd@lists.infradead.org,
linux-clk@vger.kernel.org, devel@driverdev.osuosl.org,
linux-rockchip@lists.infradead.org, linux-pm@vger.kernel.org,
intel-gfx@lists.freedesktop.org, linux-gpio@vger.kernel.org,
linux-rpi-kernel@lists.infradead.org, linux-tegra@vger.kernel.org,
cgroups@vger.kernel.org, linux-omap@vger.kernel.org,
linux-arm-kernel@lists.infradead.org, linux-ide@vger.kernel.org,
gregkh@linuxfoundation.org, linux-usb@vger.kernel.org,
linux-mmc@vger.kernel.org, linux-kernel@vger.kernel.org,
linux-security-module@vger.kernel.org, netdev@vger.kernel.org,
linux-integrity@vger.kernel.org, linuxppc-dev@lists.ozlabs.org
In-Reply-To: <4df165bc4247e60aa4952fd55cb0c77e60712767.camel@analog.com>
On Fri, May 10, 2019 at 09:15:27AM +0000, Ardelean, Alexandru wrote:
> On Wed, 2019-05-08 at 16:22 +0300, Alexandru Ardelean wrote:
> > On Wed, 2019-05-08 at 15:18 +0200, Greg KH wrote:
> > > On Wed, May 08, 2019 at 04:11:28PM +0300, Andy Shevchenko wrote:
> > > > On Wed, May 08, 2019 at 02:28:29PM +0300, Alexandru Ardelean wrote:
> > > > Can you split include/linux/ change from the rest?
> > >
> > > That would break the build, why do you want it split out? This makes
> > > sense all as a single patch to me.
> > >
> >
> > Not really.
> > It would be just be the new match_string() helper/macro in a new commit.
> > And the conversions of the simple users of match_string() (the ones using
> > ARRAY_SIZE()) in another commit.
> >
>
> I should have asked in my previous reply.
> Leave this as-is or re-formulate in 2 patches ?
Depends on on what you would like to spend your time: collecting Acks for all
pieces in treewide patch or send new API first followed up by per driver /
module update in next cycle.
I also have no strong preference.
And I think it's good to add Heikki Krogerus to Cc list for both patch series,
since he is the author of sysfs variant and may have something to comment on
the rest.
--
With Best Regards,
Andy Shevchenko
^ permalink raw reply
* Re: [PATCH] vsprintf: Do not break early boot with probing addresses
From: Petr Mladek @ 2019-05-10 14:49 UTC (permalink / raw)
To: Linus Torvalds
Cc: linux-arch, Sergey Senozhatsky, Heiko Carstens, linux-s390,
Rasmus Villemoes, linux-kernel, Steven Rostedt, Michal Hocko,
Sergey Senozhatsky, Stephen Rothwell, Andy Shevchenko,
linuxppc-dev, Martin Schwidefsky, Tobin C . Harding
In-Reply-To: <20190510084213.22149-1-pmladek@suse.com>
On Fri 2019-05-10 10:42:13, Petr Mladek wrote:
> The commit 3e5903eb9cff70730 ("vsprintf: Prevent crash when dereferencing
> invalid pointers") broke boot on several architectures. The common
> pattern is that probe_kernel_read() is not working during early
> boot because userspace access framework is not ready.
>
> It is a generic problem. We have to avoid any complex external
> functions in vsprintf() code, especially in the common path.
> They might break printk() easily and are hard to debug.
>
> Replace probe_kernel_read() with some simple checks for obvious
> problems.
JFYI, I have sent a pull request with this patch, see
https://lkml.kernel.org/r/20190510144718.riyy72g4cy5nkggx@pathway.suse.cz
Best Regards,
Petr
^ permalink raw reply
* [PATCH v11 0/7] ptrace: add PTRACE_GET_SYSCALL_INFO request
From: Dmitry V. Levin @ 2019-05-10 15:26 UTC (permalink / raw)
To: Andrew Morton
Cc: Oleg Nesterov, Paul Mackerras, linux-kselftest, Vincent Chen,
Shuah Khan, Helge Deller, Eugene Syromyatnikov, Elvira Khabirova,
James Hogan, James E.J. Bottomley, strace-devel, Kees Cook,
Greentime Hu, linux-kernel, Andy Lutomirski, linux-parisc,
linux-api, linux-mips, Ralf Baechle, Richard Kuo, Paul Burton,
linux-hexagon, linuxppc-dev
[Andrew, could you take this patchset into your tree, please?
Besides the patch for hexagon, all patches in this series have
Acked-by or Reviewed-by tags already.
I have been waiting and pinging the hexagon maintainer since November
without any visible effect. The last Acked-by from the hexagon maintainer
in linux.git was in October and the last Signed-off-by was in July. Since
that time not a single change affecting hexagon was able to attract
attention of the hexagon maintainer, so I don't think it's worth waiting
any longer.]
PTRACE_GET_SYSCALL_INFO is a generic ptrace API that lets ptracer obtain
details of the syscall the tracee is blocked in.
There are two reasons for a special syscall-related ptrace request.
Firstly, with the current ptrace API there are cases when ptracer cannot
retrieve necessary information about syscalls. Some examples include:
* The notorious int-0x80-from-64-bit-task issue. See [1] for details.
In short, if a 64-bit task performs a syscall through int 0x80, its tracer
has no reliable means to find out that the syscall was, in fact,
a compat syscall, and misidentifies it.
* Syscall-enter-stop and syscall-exit-stop look the same for the tracer.
Common practice is to keep track of the sequence of ptrace-stops in order
not to mix the two syscall-stops up. But it is not as simple as it looks;
for example, strace had a (just recently fixed) long-standing bug where
attaching strace to a tracee that is performing the execve system call
led to the tracer identifying the following syscall-exit-stop as
syscall-enter-stop, which messed up all the state tracking.
* Since the introduction of commit 84d77d3f06e7e8dea057d10e8ec77ad71f721be3
("ptrace: Don't allow accessing an undumpable mm"), both PTRACE_PEEKDATA
and process_vm_readv become unavailable when the process dumpable flag
is cleared. On such architectures as ia64 this results in all syscall
arguments being unavailable for the tracer.
Secondly, ptracers also have to support a lot of arch-specific code for
obtaining information about the tracee. For some architectures, this
requires a ptrace(PTRACE_PEEKUSER, ...) invocation for every syscall
argument and return value.
PTRACE_GET_SYSCALL_INFO returns the following structure:
struct ptrace_syscall_info {
__u8 op; /* PTRACE_SYSCALL_INFO_* */
__u32 arch __attribute__((__aligned__(sizeof(__u32))));
__u64 instruction_pointer;
__u64 stack_pointer;
union {
struct {
__u64 nr;
__u64 args[6];
} entry;
struct {
__s64 rval;
__u8 is_error;
} exit;
struct {
__u64 nr;
__u64 args[6];
__u32 ret_data;
} seccomp;
};
};
The structure was chosen according to [2], except for the following
changes:
* seccomp substructure was added as a superset of entry substructure;
* the type of nr field was changed from int to __u64 because syscall
numbers are, as a practical matter, 64 bits;
* stack_pointer field was added along with instruction_pointer field
since it is readily available and can save the tracer from extra
PTRACE_GETREGS/PTRACE_GETREGSET calls;
* arch is always initialized to aid with tracing system calls
* such as execve();
* instruction_pointer and stack_pointer are always initialized
so they could be easily obtained for non-syscall stops;
* a boolean is_error field was added along with rval field, this way
the tracer can more reliably distinguish a return value
from an error value.
strace has been ported to PTRACE_GET_SYSCALL_INFO.
Starting with release 4.26, strace uses PTRACE_GET_SYSCALL_INFO API
as the preferred mechanism of obtaining syscall information.
[1] https://lore.kernel.org/lkml/CA+55aFzcSVmdDj9Lh_gdbz1OzHyEm6ZrGPBDAJnywm2LF_eVyg@mail.gmail.com/
[2] https://lore.kernel.org/lkml/CAObL_7GM0n80N7J_DFw_eQyfLyzq+sf4y2AvsCCV88Tb3AwEHA@mail.gmail.com/
---
Notes:
v11:
* Added more Acked-by.
* Rebased back to mainline as the prerequisite syscall_get_arch patchset
has already been merged via audit tree.
v10:
* Added more Acked-by.
v9:
* Rebased to linux-next again due to syscall_get_arguments() signature change.
v8:
* Moved syscall_get_arch() specific patches to a separate patchset
which is now merged into audit/next tree.
* Rebased to linux-next.
* Moved ptrace_get_syscall_info code under #ifdef CONFIG_HAVE_ARCH_TRACEHOOK,
narrowing down the set of architectures supported by this implementation
back to those 19 that enable CONFIG_HAVE_ARCH_TRACEHOOK because
I failed to get all syscall_get_*(), instruction_pointer(),
and user_stack_pointer() functions implemented on some niche
architectures. This leaves the following architectures out:
alpha, h8300, m68k, microblaze, and unicore32.
v7:
* Rebased to v5.0-rc1.
* 5 arch-specific preparatory patches out of 25 have been merged
into v5.0-rc1 via arch trees.
v6:
* Added syscall_get_arguments and syscall_set_arguments wrappers
to asm-generic/syscall.h, requested by Geert.
* Changed PTRACE_GET_SYSCALL_INFO return code: do not take trailing paddings
into account, use the end of the last field of the structure being written.
* Changed struct ptrace_syscall_info:
* remove .frame_pointer field, is is not needed and not portable;
* make .arch field explicitly aligned, remove no longer needed
padding before .arch field;
* remove trailing pads, they are no longer needed.
v5:
* Merged separate series and patches into the single series.
* Changed PTRACE_EVENTMSG_SYSCALL_{ENTRY,EXIT} values as requested by Oleg.
* Changed struct ptrace_syscall_info: generalized instruction_pointer,
stack_pointer, and frame_pointer fields by moving them from
ptrace_syscall_info.{entry,seccomp} substructures to ptrace_syscall_info
and initializing them for all stops.
* Added PTRACE_SYSCALL_INFO_NONE, set it when not in a syscall stop,
so e.g. "strace -i" could use PTRACE_SYSCALL_INFO_SECCOMP to obtain
instruction_pointer when the tracee is in a signal stop.
* Patched all remaining architectures to provide all necessary
syscall_get_* functions.
* Made available for all architectures: do not conditionalize on
CONFIG_HAVE_ARCH_TRACEHOOK since all syscall_get_* functions
are implemented on all architectures.
* Added a test for PTRACE_GET_SYSCALL_INFO to selftests/ptrace.
v4:
* Revisited PTRACE_EVENT_SECCOMP support:
do not introduce task_struct.ptrace_event, use child->last_siginfo->si_code instead.
* Implemented PTRACE_SYSCALL_INFO_SECCOMP and ptrace_syscall_info.seccomp
support along with PTRACE_SYSCALL_INFO_{ENTRY,EXIT} and
ptrace_syscall_info.{entry,exit}.
v3:
* Changed struct ptrace_syscall_info.
* Added PTRACE_EVENT_SECCOMP support by adding ptrace_event to task_struct.
* Added proper defines for ptrace_syscall_info.op values.
* Renamed PT_SYSCALL_IS_ENTERING and PT_SYSCALL_IS_EXITING to
PTRACE_EVENTMSG_SYSCALL_ENTRY and PTRACE_EVENTMSG_SYSCALL_EXIT
and moved them to uapi.
v2:
* Stopped using task->ptrace.
* Replaced entry_info.is_compat with entry_info.arch, used syscall_get_arch().
* Used addr argument of sys_ptrace to get expected size of the struct;
return full size of the struct.
Dmitry V. Levin (6):
nds32: fix asm/syscall.h # acked
hexagon: define syscall_get_error() and syscall_get_return_value() # waiting for ack since November
mips: define syscall_get_error() # acked
parisc: define syscall_get_error() # acked
powerpc: define syscall_get_error() # acked
selftests/ptrace: add a test case for PTRACE_GET_SYSCALL_INFO # acked
Elvira Khabirova (1):
ptrace: add PTRACE_GET_SYSCALL_INFO request # reviewed
arch/hexagon/include/asm/syscall.h | 14 +
arch/mips/include/asm/syscall.h | 6 +
arch/nds32/include/asm/syscall.h | 27 +-
arch/parisc/include/asm/syscall.h | 7 +
arch/powerpc/include/asm/syscall.h | 10 +
include/linux/tracehook.h | 9 +-
include/uapi/linux/ptrace.h | 35 +++
kernel/ptrace.c | 101 ++++++-
tools/testing/selftests/ptrace/.gitignore | 1 +
tools/testing/selftests/ptrace/Makefile | 2 +-
.../selftests/ptrace/get_syscall_info.c | 271 ++++++++++++++++++
11 files changed, 468 insertions(+), 15 deletions(-)
create mode 100644 tools/testing/selftests/ptrace/get_syscall_info.c
--
ldv
^ permalink raw reply
* [PATCH v11 5/7] powerpc: define syscall_get_error()
From: Dmitry V. Levin @ 2019-05-10 15:28 UTC (permalink / raw)
To: Andrew Morton
Cc: Eugene Syromyatnikov, Oleg Nesterov, Elvira Khabirova,
Paul Mackerras, Andy Lutomirski, linuxppc-dev, linux-kernel
In-Reply-To: <20190510152640.GA28529@altlinux.org>
syscall_get_error() is required to be implemented on this
architecture in addition to already implemented syscall_get_nr(),
syscall_get_arguments(), syscall_get_return_value(), and
syscall_get_arch() functions in order to extend the generic
ptrace API with PTRACE_GET_SYSCALL_INFO request.
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Cc: Elvira Khabirova <lineprinter@altlinux.org>
Cc: Eugene Syromyatnikov <esyr@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Dmitry V. Levin <ldv@altlinux.org>
---
Notes:
v11: added Acked-by from https://lore.kernel.org/lkml/87woj3wwmf.fsf@concordia.ellerman.id.au/
v10: unchanged
v9: unchanged
v8: unchanged
v7: unchanged
v6: unchanged
v5: initial revision
This change has been tested with
tools/testing/selftests/ptrace/get_syscall_info.c and strace,
so it's correct from PTRACE_GET_SYSCALL_INFO point of view.
This cast doubts on commit v4.3-rc1~86^2~81 that changed
syscall_set_return_value() in a way that doesn't quite match
syscall_get_error(), but syscall_set_return_value() is out
of scope of this series, so I'll just let you know my concerns.
See also https://lore.kernel.org/lkml/874lbbt3k6.fsf@concordia.ellerman.id.au/
and https://lore.kernel.org/lkml/87woj3wwmf.fsf@concordia.ellerman.id.au/
for more details on powerpc syscall_set_return_value() confusion.
arch/powerpc/include/asm/syscall.h | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/arch/powerpc/include/asm/syscall.h b/arch/powerpc/include/asm/syscall.h
index a048fed0722f..bd9663137d57 100644
--- a/arch/powerpc/include/asm/syscall.h
+++ b/arch/powerpc/include/asm/syscall.h
@@ -38,6 +38,16 @@ static inline void syscall_rollback(struct task_struct *task,
regs->gpr[3] = regs->orig_gpr3;
}
+static inline long syscall_get_error(struct task_struct *task,
+ struct pt_regs *regs)
+{
+ /*
+ * If the system call failed,
+ * regs->gpr[3] contains a positive ERRORCODE.
+ */
+ return (regs->ccr & 0x10000000UL) ? -regs->gpr[3] : 0;
+}
+
static inline long syscall_get_return_value(struct task_struct *task,
struct pt_regs *regs)
{
--
ldv
^ permalink raw reply related
* Re: [PATCH] vsprintf: Do not break early boot with probing addresses
From: Steven Rostedt @ 2019-05-10 16:24 UTC (permalink / raw)
To: Petr Mladek
Cc: linux-arch, Sergey Senozhatsky, Heiko Carstens, linux-s390,
linuxppc-dev, Rasmus Villemoes, linux-kernel, Michal Hocko,
Sergey Senozhatsky, Stephen Rothwell, Andy Shevchenko,
Linus Torvalds, Martin Schwidefsky, Tobin C . Harding
In-Reply-To: <20190510084213.22149-1-pmladek@suse.com>
On Fri, 10 May 2019 10:42:13 +0200
Petr Mladek <pmladek@suse.com> wrote:
> static const char *check_pointer_msg(const void *ptr)
> {
> - char byte;
> -
> if (!ptr)
> return "(null)";
>
> - if (probe_kernel_address(ptr, byte))
> + if ((unsigned long)ptr < PAGE_SIZE || IS_ERR_VALUE(ptr))
> return "(efault)";
>
< PAGE_SIZE ?
do you mean: < TASK_SIZE ?
-- Steve
^ permalink raw reply
* [PATCH 0/8] Provide vcpu dispatch statistics
From: Naveen N. Rao @ 2019-05-10 16:26 UTC (permalink / raw)
To: Michael Ellerman
Cc: Nathan Lynch, Aneesh Kumar K.V, Mingming Cao, linuxppc-dev
This series adds a new procfs file /proc/powerpc/vcpudispatch_stats for
providing statistics around how the LPAR processors are dispatched by
the POWER Hypervisor, in a shared LPAR environment. Patch 6/8 has more
details on how the statistics are gathered.
An example output:
$ sudo cat /proc/powerpc/vcpudispatch_stats
cpu0 6839 4126 2683 30 0 6821 18 0
cpu1 2515 1274 1229 12 0 2509 6 0
cpu2 2317 1198 1109 10 0 2312 5 0
cpu3 2259 1165 1088 6 0 2256 3 0
cpu4 2205 1143 1056 6 0 2202 3 0
cpu5 2165 1121 1038 6 0 2162 3 0
cpu6 2183 1127 1050 6 0 2180 3 0
cpu7 2193 1133 1052 8 0 2187 6 0
cpu8 2165 1115 1032 18 0 2156 9 0
cpu9 2301 1252 1033 16 0 2293 8 0
cpu10 2197 1138 1041 18 0 2187 10 0
cpu11 2273 1185 1062 26 0 2260 13 0
cpu12 2186 1125 1043 18 0 2177 9 0
cpu13 2161 1115 1030 16 0 2153 8 0
cpu14 2206 1153 1033 20 0 2196 10 0
cpu15 2163 1115 1032 16 0 2155 8 0
In the output above, for vcpu0, there have been 6839 dispatches since
statistics were enabled. 4126 of those dispatches were on the same
physical cpu as the last time. 2683 were on a different core, but within
the same chip, while 30 dispatches were on a different chip compared to
its last dispatch.
Also, out of the total of 6839 dispatches, we see that there have been
6821 dispatches on the vcpu's home node, while 18 dispatches were
outside its home node, on a neighbouring chip.
Changes since RFC:
- Patches 1/8 to 5/8: no changes, except rebase to powerpc/merge
- Patch 6/8: The mutex guarding the vphn hcall has been dropped. It was
only meant to serialize hcalls issued when stats are initially
enabled. However, in reality, the various per-cpu workers will be
scheduled at slightly different times and chances of hcalls for
retrieving the same associativity information at the same time is very
less. Even in that case, there are no other side effects.
- Patch 6/8: The third column for vcpu dispatches on the same core, but
different thread has been dropped and merged with the second column.
- Patch 7/8: new patch to ensure we don't take too much time while
enabling/disabling statistics on large systems with heavy workload.
- Patch 8/8: new patch adding a document describing the fields in the
procfs file.
- Naveen
Naveen N. Rao (8):
powerpc/pseries: Use macros for referring to the DTL enable mask
powerpc/pseries: Do not save the previous DTL mask value
powerpc/pseries: Factor out DTL buffer allocation and registration
routines
powerpc/pseries: Generalize hcall_vphn()
powerpc/pseries: Introduce helpers to gatekeep DTLB usage
powerpc/pseries: Provide vcpu dispatch statistics
powerpc/pseries: Protect against hogging the cpu while setting up the
stats
powerpc/pseries: Add documentation for vcpudispatch_stats
Documentation/powerpc/vcpudispatch_stats.txt | 68 +++
arch/powerpc/include/asm/lppaca.h | 11 +
arch/powerpc/include/asm/plpar_wrappers.h | 4 +
arch/powerpc/include/asm/topology.h | 4 +
arch/powerpc/mm/book3s64/vphn.h | 8 +
arch/powerpc/mm/numa.c | 134 ++++-
arch/powerpc/platforms/pseries/dtl.c | 22 +-
arch/powerpc/platforms/pseries/lpar.c | 550 ++++++++++++++++++-
arch/powerpc/platforms/pseries/setup.c | 34 +-
9 files changed, 760 insertions(+), 75 deletions(-)
create mode 100644 Documentation/powerpc/vcpudispatch_stats.txt
--
2.21.0
^ permalink raw reply
* [PATCH 1/8] powerpc/pseries: Use macros for referring to the DTL enable mask
From: Naveen N. Rao @ 2019-05-10 16:26 UTC (permalink / raw)
To: Michael Ellerman
Cc: Nathan Lynch, Aneesh Kumar K.V, Mingming Cao, linuxppc-dev
In-Reply-To: <cover.1557502887.git.naveen.n.rao@linux.vnet.ibm.com>
Introduce macros to encode the DTL enable mask fields and use those
instead of hardcoding numbers.
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
---
arch/powerpc/include/asm/lppaca.h | 11 +++++++++++
arch/powerpc/platforms/pseries/dtl.c | 8 +-------
arch/powerpc/platforms/pseries/lpar.c | 2 +-
arch/powerpc/platforms/pseries/setup.c | 2 +-
4 files changed, 14 insertions(+), 9 deletions(-)
diff --git a/arch/powerpc/include/asm/lppaca.h b/arch/powerpc/include/asm/lppaca.h
index 7c23ce8a5a4c..2c7e31187726 100644
--- a/arch/powerpc/include/asm/lppaca.h
+++ b/arch/powerpc/include/asm/lppaca.h
@@ -154,6 +154,17 @@ struct dtl_entry {
#define DISPATCH_LOG_BYTES 4096 /* bytes per cpu */
#define N_DISPATCH_LOG (DISPATCH_LOG_BYTES / sizeof(struct dtl_entry))
+/*
+ * Dispatch trace log event enable mask:
+ * 0x1: voluntary virtual processor waits
+ * 0x2: time-slice preempts
+ * 0x4: virtual partition memory page faults
+ */
+#define DTL_LOG_CEDE 0x1
+#define DTL_LOG_PREEMPT 0x2
+#define DTL_LOG_FAULT 0x4
+#define DTL_LOG_ALL (DTL_LOG_CEDE | DTL_LOG_PREEMPT | DTL_LOG_FAULT)
+
extern struct kmem_cache *dtl_cache;
/*
diff --git a/arch/powerpc/platforms/pseries/dtl.c b/arch/powerpc/platforms/pseries/dtl.c
index ef6595153642..051ea2de1e1a 100644
--- a/arch/powerpc/platforms/pseries/dtl.c
+++ b/arch/powerpc/platforms/pseries/dtl.c
@@ -40,13 +40,7 @@ struct dtl {
};
static DEFINE_PER_CPU(struct dtl, cpu_dtl);
-/*
- * Dispatch trace log event mask:
- * 0x7: 0x1: voluntary virtual processor waits
- * 0x2: time-slice preempts
- * 0x4: virtual partition memory page faults
- */
-static u8 dtl_event_mask = 0x7;
+static u8 dtl_event_mask = DTL_LOG_ALL;
/*
diff --git a/arch/powerpc/platforms/pseries/lpar.c b/arch/powerpc/platforms/pseries/lpar.c
index 1034ef1fe2b4..23f2ac6793b7 100644
--- a/arch/powerpc/platforms/pseries/lpar.c
+++ b/arch/powerpc/platforms/pseries/lpar.c
@@ -126,7 +126,7 @@ void vpa_init(int cpu)
pr_err("WARNING: DTL registration of cpu %d (hw %d) "
"failed with %ld\n", smp_processor_id(),
hwcpu, ret);
- lppaca_of(cpu).dtl_enable_mask = 2;
+ lppaca_of(cpu).dtl_enable_mask = DTL_LOG_PREEMPT;
}
}
diff --git a/arch/powerpc/platforms/pseries/setup.c b/arch/powerpc/platforms/pseries/setup.c
index e4f0dfd4ae33..fabaefff8399 100644
--- a/arch/powerpc/platforms/pseries/setup.c
+++ b/arch/powerpc/platforms/pseries/setup.c
@@ -316,7 +316,7 @@ static int alloc_dispatch_logs(void)
pr_err("WARNING: DTL registration of cpu %d (hw %d) failed "
"with %d\n", smp_processor_id(),
hard_smp_processor_id(), ret);
- get_paca()->lppaca_ptr->dtl_enable_mask = 2;
+ get_paca()->lppaca_ptr->dtl_enable_mask = DTL_LOG_PREEMPT;
return 0;
}
--
2.21.0
^ permalink raw reply related
* [PATCH 3/8] powerpc/pseries: Factor out DTL buffer allocation and registration routines
From: Naveen N. Rao @ 2019-05-10 16:26 UTC (permalink / raw)
To: Michael Ellerman
Cc: Nathan Lynch, Aneesh Kumar K.V, Mingming Cao, linuxppc-dev
In-Reply-To: <cover.1557502887.git.naveen.n.rao@linux.vnet.ibm.com>
Introduce new helpers for DTL buffer allocation and registration and
have the existing code use those.
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
---
arch/powerpc/include/asm/plpar_wrappers.h | 2 +
arch/powerpc/platforms/pseries/lpar.c | 66 ++++++++++++++++-------
arch/powerpc/platforms/pseries/setup.c | 34 +-----------
3 files changed, 52 insertions(+), 50 deletions(-)
diff --git a/arch/powerpc/include/asm/plpar_wrappers.h b/arch/powerpc/include/asm/plpar_wrappers.h
index cff5a411e595..d08feb1bc2bd 100644
--- a/arch/powerpc/include/asm/plpar_wrappers.h
+++ b/arch/powerpc/include/asm/plpar_wrappers.h
@@ -88,6 +88,8 @@ static inline long register_dtl(unsigned long cpu, unsigned long vpa)
return vpa_call(H_VPA_REG_DTL, cpu, vpa);
}
+extern void register_dtl_buffer(int cpu);
+extern void alloc_dtl_buffers(void);
extern void vpa_init(int cpu);
static inline long plpar_pte_enter(unsigned long flags,
diff --git a/arch/powerpc/platforms/pseries/lpar.c b/arch/powerpc/platforms/pseries/lpar.c
index 23f2ac6793b7..3375ca8cefb5 100644
--- a/arch/powerpc/platforms/pseries/lpar.c
+++ b/arch/powerpc/platforms/pseries/lpar.c
@@ -65,13 +65,58 @@ EXPORT_SYMBOL(plpar_hcall);
EXPORT_SYMBOL(plpar_hcall9);
EXPORT_SYMBOL(plpar_hcall_norets);
+void alloc_dtl_buffers(void)
+{
+ int cpu;
+ struct paca_struct *pp;
+ struct dtl_entry *dtl;
+
+ for_each_possible_cpu(cpu) {
+ pp = paca_ptrs[cpu];
+ dtl = kmem_cache_alloc(dtl_cache, GFP_KERNEL);
+ if (!dtl) {
+ pr_warn("Failed to allocate dispatch trace log for cpu %d\n",
+ cpu);
+ pr_warn("Stolen time statistics will be unreliable\n");
+ break;
+ }
+
+ pp->dtl_ridx = 0;
+ pp->dispatch_log = dtl;
+ pp->dispatch_log_end = dtl + N_DISPATCH_LOG;
+ pp->dtl_curr = dtl;
+ }
+}
+
+void register_dtl_buffer(int cpu)
+{
+ long ret;
+ struct paca_struct *pp;
+ struct dtl_entry *dtl;
+ int hwcpu = get_hard_smp_processor_id(cpu);
+
+ pp = paca_ptrs[cpu];
+ dtl = pp->dispatch_log;
+ if (dtl) {
+ pp->dtl_ridx = 0;
+ pp->dtl_curr = dtl;
+ lppaca_of(cpu).dtl_idx = 0;
+
+ /* hypervisor reads buffer length from this field */
+ dtl->enqueue_to_dispatch_time = cpu_to_be32(DISPATCH_LOG_BYTES);
+ ret = register_dtl(hwcpu, __pa(dtl));
+ if (ret)
+ pr_err("WARNING: DTL registration of cpu %d (hw %d) "
+ "failed with %ld\n", cpu, hwcpu, ret);
+ lppaca_of(cpu).dtl_enable_mask = DTL_LOG_PREEMPT;
+ }
+}
+
void vpa_init(int cpu)
{
int hwcpu = get_hard_smp_processor_id(cpu);
unsigned long addr;
long ret;
- struct paca_struct *pp;
- struct dtl_entry *dtl;
/*
* The spec says it "may be problematic" if CPU x registers the VPA of
@@ -112,22 +157,7 @@ void vpa_init(int cpu)
/*
* Register dispatch trace log, if one has been allocated.
*/
- pp = paca_ptrs[cpu];
- dtl = pp->dispatch_log;
- if (dtl) {
- pp->dtl_ridx = 0;
- pp->dtl_curr = dtl;
- lppaca_of(cpu).dtl_idx = 0;
-
- /* hypervisor reads buffer length from this field */
- dtl->enqueue_to_dispatch_time = cpu_to_be32(DISPATCH_LOG_BYTES);
- ret = register_dtl(hwcpu, __pa(dtl));
- if (ret)
- pr_err("WARNING: DTL registration of cpu %d (hw %d) "
- "failed with %ld\n", smp_processor_id(),
- hwcpu, ret);
- lppaca_of(cpu).dtl_enable_mask = DTL_LOG_PREEMPT;
- }
+ register_dtl_buffer(cpu);
}
#ifdef CONFIG_PPC_BOOK3S_64
diff --git a/arch/powerpc/platforms/pseries/setup.c b/arch/powerpc/platforms/pseries/setup.c
index fabaefff8399..b6995e5cc5c9 100644
--- a/arch/powerpc/platforms/pseries/setup.c
+++ b/arch/powerpc/platforms/pseries/setup.c
@@ -277,46 +277,16 @@ struct kmem_cache *dtl_cache;
*/
static int alloc_dispatch_logs(void)
{
- int cpu, ret;
- struct paca_struct *pp;
- struct dtl_entry *dtl;
-
if (!firmware_has_feature(FW_FEATURE_SPLPAR))
return 0;
if (!dtl_cache)
return 0;
- for_each_possible_cpu(cpu) {
- pp = paca_ptrs[cpu];
- dtl = kmem_cache_alloc(dtl_cache, GFP_KERNEL);
- if (!dtl) {
- pr_warn("Failed to allocate dispatch trace log for cpu %d\n",
- cpu);
- pr_warn("Stolen time statistics will be unreliable\n");
- break;
- }
-
- pp->dtl_ridx = 0;
- pp->dispatch_log = dtl;
- pp->dispatch_log_end = dtl + N_DISPATCH_LOG;
- pp->dtl_curr = dtl;
- }
+ alloc_dtl_buffers();
/* Register the DTL for the current (boot) cpu */
- dtl = get_paca()->dispatch_log;
- get_paca()->dtl_ridx = 0;
- get_paca()->dtl_curr = dtl;
- get_paca()->lppaca_ptr->dtl_idx = 0;
-
- /* hypervisor reads buffer length from this field */
- dtl->enqueue_to_dispatch_time = cpu_to_be32(DISPATCH_LOG_BYTES);
- ret = register_dtl(hard_smp_processor_id(), __pa(dtl));
- if (ret)
- pr_err("WARNING: DTL registration of cpu %d (hw %d) failed "
- "with %d\n", smp_processor_id(),
- hard_smp_processor_id(), ret);
- get_paca()->lppaca_ptr->dtl_enable_mask = DTL_LOG_PREEMPT;
+ register_dtl_buffer(smp_processor_id());
return 0;
}
--
2.21.0
^ permalink raw reply related
* [PATCH 2/8] powerpc/pseries: Do not save the previous DTL mask value
From: Naveen N. Rao @ 2019-05-10 16:26 UTC (permalink / raw)
To: Michael Ellerman
Cc: Nathan Lynch, Aneesh Kumar K.V, Mingming Cao, linuxppc-dev
In-Reply-To: <cover.1557502887.git.naveen.n.rao@linux.vnet.ibm.com>
When CONFIG_VIRT_CPU_ACCOUNTING_NATIVE is enabled, we always initialize
DTL enable mask to DTL_LOG_PREEMPT (0x2). There are no other places
where the mask is changed. As such, when reading the DTL log buffer
through debugfs, there is no need to save and restore the previous mask
value.
We don't need to save and restore the earlier mask value if
CONFIG_VIRT_CPU_ACCOUNTING_NATIVE is not enabled. So, remove the field
from the structure as well.
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
---
arch/powerpc/platforms/pseries/dtl.c | 4 +---
1 file changed, 1 insertion(+), 3 deletions(-)
diff --git a/arch/powerpc/platforms/pseries/dtl.c b/arch/powerpc/platforms/pseries/dtl.c
index 051ea2de1e1a..fb05804adb2f 100644
--- a/arch/powerpc/platforms/pseries/dtl.c
+++ b/arch/powerpc/platforms/pseries/dtl.c
@@ -55,7 +55,6 @@ struct dtl_ring {
struct dtl_entry *write_ptr;
struct dtl_entry *buf;
struct dtl_entry *buf_end;
- u8 saved_dtl_mask;
};
static DEFINE_PER_CPU(struct dtl_ring, dtl_rings);
@@ -105,7 +104,6 @@ static int dtl_start(struct dtl *dtl)
dtlr->write_ptr = dtl->buf;
/* enable event logging */
- dtlr->saved_dtl_mask = lppaca_of(dtl->cpu).dtl_enable_mask;
lppaca_of(dtl->cpu).dtl_enable_mask |= dtl_event_mask;
dtl_consumer = consume_dtle;
@@ -123,7 +121,7 @@ static void dtl_stop(struct dtl *dtl)
dtlr->buf = NULL;
/* restore dtl_enable_mask */
- lppaca_of(dtl->cpu).dtl_enable_mask = dtlr->saved_dtl_mask;
+ lppaca_of(dtl->cpu).dtl_enable_mask = DTL_LOG_PREEMPT;
if (atomic_dec_and_test(&dtl_count))
dtl_consumer = NULL;
--
2.21.0
^ permalink raw reply related
* [PATCH 4/8] powerpc/pseries: Generalize hcall_vphn()
From: Naveen N. Rao @ 2019-05-10 16:26 UTC (permalink / raw)
To: Michael Ellerman
Cc: Nathan Lynch, Aneesh Kumar K.V, Mingming Cao, linuxppc-dev
In-Reply-To: <cover.1557502887.git.naveen.n.rao@linux.vnet.ibm.com>
H_HOME_NODE_ASSOCIATIVITY hcall can take two different flags and return
different associativity information in each case. Generalize the
existing hcall_vphn() function to take flags as an argument and to
return the result. Update the only existing user to pass the proper
arguments.
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
---
arch/powerpc/mm/book3s64/vphn.h | 8 ++++++++
arch/powerpc/mm/numa.c | 27 +++++++++++++--------------
2 files changed, 21 insertions(+), 14 deletions(-)
diff --git a/arch/powerpc/mm/book3s64/vphn.h b/arch/powerpc/mm/book3s64/vphn.h
index f0b93c2dd578..f7ff1e0c3801 100644
--- a/arch/powerpc/mm/book3s64/vphn.h
+++ b/arch/powerpc/mm/book3s64/vphn.h
@@ -11,6 +11,14 @@
*/
#define VPHN_ASSOC_BUFSIZE (VPHN_REGISTER_COUNT*sizeof(u64)/sizeof(u16) + 1)
+/*
+ * The H_HOME_NODE_ASSOCIATIVITY hcall takes two values for flags:
+ * 1 for retrieving associativity information for a guest cpu
+ * 2 for retrieving associativity information for a host/hypervisor cpu
+ */
+#define VPHN_FLAG_VCPU 1
+#define VPHN_FLAG_PCPU 2
+
extern int vphn_unpack_associativity(const long *packed, __be32 *unpacked);
#endif
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 57e64273cb33..57f006b6214b 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -1087,6 +1087,17 @@ static void reset_topology_timer(void);
static int topology_timer_secs = 1;
static int topology_inited;
+static long hcall_vphn(unsigned long cpu, u64 flags, __be32 *associativity)
+{
+ long rc;
+ long retbuf[PLPAR_HCALL9_BUFSIZE] = {0};
+
+ rc = plpar_hcall9(H_HOME_NODE_ASSOCIATIVITY, retbuf, flags, cpu);
+ vphn_unpack_associativity(retbuf, associativity);
+
+ return rc;
+}
+
/*
* Change polling interval for associativity changes.
*/
@@ -1165,25 +1176,13 @@ static int update_cpu_associativity_changes_mask(void)
* Retrieve the new associativity information for a virtual processor's
* home node.
*/
-static long hcall_vphn(unsigned long cpu, __be32 *associativity)
-{
- long rc;
- long retbuf[PLPAR_HCALL9_BUFSIZE] = {0};
- u64 flags = 1;
- int hwcpu = get_hard_smp_processor_id(cpu);
-
- rc = plpar_hcall9(H_HOME_NODE_ASSOCIATIVITY, retbuf, flags, hwcpu);
- vphn_unpack_associativity(retbuf, associativity);
-
- return rc;
-}
-
static long vphn_get_associativity(unsigned long cpu,
__be32 *associativity)
{
long rc;
- rc = hcall_vphn(cpu, associativity);
+ rc = hcall_vphn(get_hard_smp_processor_id(cpu),
+ VPHN_FLAG_VCPU, associativity);
switch (rc) {
case H_FUNCTION:
--
2.21.0
^ permalink raw reply related
* [PATCH 5/8] powerpc/pseries: Introduce helpers to gatekeep DTLB usage
From: Naveen N. Rao @ 2019-05-10 16:26 UTC (permalink / raw)
To: Michael Ellerman
Cc: Nathan Lynch, Aneesh Kumar K.V, Mingming Cao, linuxppc-dev
In-Reply-To: <cover.1557502887.git.naveen.n.rao@linux.vnet.ibm.com>
Since we would be introducing a new user of the DTL buffer in a
subsequent patch, add helpers to gatekeep use of the DTL buffer. The
current usage of the DTL buffer from debugfs is at a per-cpu level
(corresponding to the cpu debugfs file that is opened). Subsequently, we
will have users enabling/accessing DTLB for all online cpus. These
helpers allow any number of per-cpu users, or a single global user
exclusively.
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
---
arch/powerpc/include/asm/plpar_wrappers.h | 2 ++
arch/powerpc/platforms/pseries/dtl.c | 10 ++++++-
arch/powerpc/platforms/pseries/lpar.c | 36 +++++++++++++++++++++++
3 files changed, 47 insertions(+), 1 deletion(-)
diff --git a/arch/powerpc/include/asm/plpar_wrappers.h b/arch/powerpc/include/asm/plpar_wrappers.h
index d08feb1bc2bd..ab7dd454b6eb 100644
--- a/arch/powerpc/include/asm/plpar_wrappers.h
+++ b/arch/powerpc/include/asm/plpar_wrappers.h
@@ -88,6 +88,8 @@ static inline long register_dtl(unsigned long cpu, unsigned long vpa)
return vpa_call(H_VPA_REG_DTL, cpu, vpa);
}
+extern bool register_dtl_buffer_access(bool global);
+extern void unregister_dtl_buffer_access(bool global);
extern void register_dtl_buffer(int cpu);
extern void alloc_dtl_buffers(void);
extern void vpa_init(int cpu);
diff --git a/arch/powerpc/platforms/pseries/dtl.c b/arch/powerpc/platforms/pseries/dtl.c
index fb05804adb2f..dd28296c9903 100644
--- a/arch/powerpc/platforms/pseries/dtl.c
+++ b/arch/powerpc/platforms/pseries/dtl.c
@@ -193,11 +193,15 @@ static int dtl_enable(struct dtl *dtl)
if (dtl->buf)
return -EBUSY;
+ if (register_dtl_buffer_access(false))
+ return -EBUSY;
+
n_entries = dtl_buf_entries;
buf = kmem_cache_alloc_node(dtl_cache, GFP_KERNEL, cpu_to_node(dtl->cpu));
if (!buf) {
printk(KERN_WARNING "%s: buffer alloc failed for cpu %d\n",
__func__, dtl->cpu);
+ unregister_dtl_buffer_access(false);
return -ENOMEM;
}
@@ -214,8 +218,11 @@ static int dtl_enable(struct dtl *dtl)
}
spin_unlock(&dtl->lock);
- if (rc)
+ if (rc) {
+ unregister_dtl_buffer_access(false);
kmem_cache_free(dtl_cache, buf);
+ }
+
return rc;
}
@@ -227,6 +234,7 @@ static void dtl_disable(struct dtl *dtl)
dtl->buf = NULL;
dtl->buf_entries = 0;
spin_unlock(&dtl->lock);
+ unregister_dtl_buffer_access(false);
}
/* file interface */
diff --git a/arch/powerpc/platforms/pseries/lpar.c b/arch/powerpc/platforms/pseries/lpar.c
index 3375ca8cefb5..6af5a2a11deb 100644
--- a/arch/powerpc/platforms/pseries/lpar.c
+++ b/arch/powerpc/platforms/pseries/lpar.c
@@ -65,6 +65,42 @@ EXPORT_SYMBOL(plpar_hcall);
EXPORT_SYMBOL(plpar_hcall9);
EXPORT_SYMBOL(plpar_hcall_norets);
+static DEFINE_SPINLOCK(dtl_buffer_refctr_lock);
+static unsigned int dtl_buffer_global_refctr, dtl_buffer_percpu_refctr;
+
+bool register_dtl_buffer_access(bool global)
+{
+ int rc = 0;
+
+ spin_lock(&dtl_buffer_refctr_lock);
+
+ if ((global && (dtl_buffer_global_refctr || dtl_buffer_percpu_refctr))
+ || (!global && dtl_buffer_global_refctr)) {
+ rc = -1;
+ } else {
+ if (global)
+ dtl_buffer_global_refctr++;
+ else
+ dtl_buffer_percpu_refctr++;
+ }
+
+ spin_unlock(&dtl_buffer_refctr_lock);
+
+ return rc;
+}
+
+void unregister_dtl_buffer_access(bool global)
+{
+ spin_lock(&dtl_buffer_refctr_lock);
+
+ if (global)
+ dtl_buffer_global_refctr--;
+ else
+ dtl_buffer_percpu_refctr--;
+
+ spin_unlock(&dtl_buffer_refctr_lock);
+}
+
void alloc_dtl_buffers(void)
{
int cpu;
--
2.21.0
^ permalink raw reply related
* [PATCH 6/8] powerpc/pseries: Provide vcpu dispatch statistics
From: Naveen N. Rao @ 2019-05-10 16:26 UTC (permalink / raw)
To: Michael Ellerman
Cc: Nathan Lynch, Aneesh Kumar K.V, Mingming Cao, linuxppc-dev
In-Reply-To: <cover.1557502887.git.naveen.n.rao@linux.vnet.ibm.com>
For Shared Processor LPARs, the POWER Hypervisor maintains a relatively
static mapping of the LPAR processors (vcpus) to physical processor
chips (representing the "home" node) and tries to always dispatch vcpus
on their associated physical processor chip. However, under certain
scenarios, vcpus may be dispatched on a different processor chip (away
from its home node). The actual physical processor number on which a
certain vcpu is dispatched is available to the guest in the
'processor_id' field of each DTL entry.
The guest can discover the home node of each vcpu through the
H_HOME_NODE_ASSOCIATIVITY(flags=1) hcall. The guest can also discover
the associativity of physical processors, as represented in the DTL
entry, through the H_HOME_NODE_ASSOCIATIVITY(flags=2) hcall.
These can then be compared to determine if the vcpu was dispatched on
its home node or not. If the vcpu was not dispatched on the home node,
it is possible to determine if the vcpu was dispatched in a different
chip, socket or drawer.
Introduce a procfs file /proc/powerpc/vcpudispatch_stats that can be
used to obtain these statistics. Writing '1' to this file enables
collecting the statistics, while writing '0' disables the statistics.
The statistics themselves are available by reading the procfs file. By
default, the DTLB log for each vcpu is processed 50 times a second so as
not to miss any entries. This processing frequency can be changed
through /proc/powerpc/vcpudispatch_stats_freq.
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
---
arch/powerpc/include/asm/topology.h | 4 +
arch/powerpc/mm/numa.c | 107 +++++++
arch/powerpc/platforms/pseries/lpar.c | 441 +++++++++++++++++++++++++-
3 files changed, 550 insertions(+), 2 deletions(-)
diff --git a/arch/powerpc/include/asm/topology.h b/arch/powerpc/include/asm/topology.h
index f85e2b01c3df..7c064731a0f2 100644
--- a/arch/powerpc/include/asm/topology.h
+++ b/arch/powerpc/include/asm/topology.h
@@ -93,6 +93,10 @@ extern int prrn_is_enabled(void);
extern int find_and_online_cpu_nid(int cpu);
extern int timed_topology_update(int nsecs);
extern void __init shared_proc_topology_init(void);
+extern int init_cpu_associativity(void);
+extern void destroy_cpu_associativity(void);
+extern int cpu_relative_dispatch_distance(int last_disp_cpu, int cur_disp_cpu);
+extern int cpu_home_node_dispatch_distance(int disp_cpu);
#else
static inline int start_topology_update(void)
{
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 57f006b6214b..c0828d5e12e0 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -1086,6 +1086,16 @@ static int prrn_enabled;
static void reset_topology_timer(void);
static int topology_timer_secs = 1;
static int topology_inited;
+static __be32 *vcpu_associativity, *pcpu_associativity;
+
+/*
+ * This represents the number of cpus in the hypervisor. Since there is no
+ * architected way to discover the number of processors in the host, we
+ * provision for dealing with NR_CPUS. This is currently 2048 by default, and
+ * is sufficient for our purposes. This will need to be tweaked if
+ * CONFIG_NR_CPUS is changed.
+ */
+#define NR_CPUS_H NR_CPUS
static long hcall_vphn(unsigned long cpu, u64 flags, __be32 *associativity)
{
@@ -1098,6 +1108,103 @@ static long hcall_vphn(unsigned long cpu, u64 flags, __be32 *associativity)
return rc;
}
+int init_cpu_associativity(void)
+{
+ vcpu_associativity = kcalloc(num_possible_cpus() / threads_per_core,
+ VPHN_ASSOC_BUFSIZE * sizeof(__be32), GFP_KERNEL);
+ pcpu_associativity = kcalloc(NR_CPUS_H / threads_per_core,
+ VPHN_ASSOC_BUFSIZE * sizeof(__be32), GFP_KERNEL);
+
+ if (!vcpu_associativity || !pcpu_associativity) {
+ pr_err("error allocating memory for associativity information\n");
+ return -ENOMEM;
+ }
+
+ return 0;
+}
+
+void destroy_cpu_associativity(void)
+{
+ kfree(vcpu_associativity);
+ kfree(pcpu_associativity);
+ vcpu_associativity = pcpu_associativity = 0;
+}
+
+static __be32 *__get_cpu_associativity(int cpu, __be32 *cpu_assoc, int flag)
+{
+ __be32 *assoc;
+ int rc = 0;
+
+ assoc = &cpu_assoc[(int)(cpu / threads_per_core) * VPHN_ASSOC_BUFSIZE];
+ if (!assoc[0]) {
+ rc = hcall_vphn(cpu, flag, &assoc[0]);
+ if (rc)
+ return NULL;
+ }
+
+ return assoc;
+}
+
+static __be32 *get_pcpu_associativity(int cpu)
+{
+ return __get_cpu_associativity(cpu, pcpu_associativity, VPHN_FLAG_PCPU);
+}
+
+static __be32 *get_vcpu_associativity(int cpu)
+{
+ return __get_cpu_associativity(cpu, vcpu_associativity, VPHN_FLAG_VCPU);
+}
+
+static int calc_dispatch_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
+{
+ int i, index, dist;
+
+ for (i = 0, dist = 0; i < distance_ref_points_depth; i++) {
+ index = be32_to_cpu(distance_ref_points[i]);
+ if (cpu1_assoc[index] == cpu2_assoc[index])
+ break;
+ dist++;
+ }
+
+ return dist;
+}
+
+int cpu_relative_dispatch_distance(int last_disp_cpu, int cur_disp_cpu)
+{
+ __be32 *last_disp_cpu_assoc, *cur_disp_cpu_assoc;
+
+ if (last_disp_cpu >= NR_CPUS_H || cur_disp_cpu >= NR_CPUS_H)
+ return -EINVAL;
+
+ last_disp_cpu_assoc = get_pcpu_associativity(last_disp_cpu);
+ cur_disp_cpu_assoc = get_pcpu_associativity(cur_disp_cpu);
+
+ if (!last_disp_cpu_assoc || !cur_disp_cpu_assoc)
+ return -EIO;
+
+ return calc_dispatch_distance(last_disp_cpu_assoc, cur_disp_cpu_assoc);
+}
+
+int cpu_home_node_dispatch_distance(int disp_cpu)
+{
+ __be32 *disp_cpu_assoc, *vcpu_assoc;
+ int vcpu_id = smp_processor_id();
+
+ if (disp_cpu >= NR_CPUS_H) {
+ pr_debug_ratelimited("vcpu dispatch cpu %d > %d\n",
+ disp_cpu, NR_CPUS_H);
+ return -EINVAL;
+ }
+
+ disp_cpu_assoc = get_pcpu_associativity(disp_cpu);
+ vcpu_assoc = get_vcpu_associativity(vcpu_id);
+
+ if (!disp_cpu_assoc || !vcpu_assoc)
+ return -EIO;
+
+ return calc_dispatch_distance(disp_cpu_assoc, vcpu_assoc);
+}
+
/*
* Change polling interval for associativity changes.
*/
diff --git a/arch/powerpc/platforms/pseries/lpar.c b/arch/powerpc/platforms/pseries/lpar.c
index 6af5a2a11deb..b808a70cc253 100644
--- a/arch/powerpc/platforms/pseries/lpar.c
+++ b/arch/powerpc/platforms/pseries/lpar.c
@@ -30,6 +30,10 @@
#include <linux/jump_label.h>
#include <linux/delay.h>
#include <linux/stop_machine.h>
+#include <linux/spinlock.h>
+#include <linux/cpuhotplug.h>
+#include <linux/workqueue.h>
+#include <linux/proc_fs.h>
#include <asm/processor.h>
#include <asm/mmu.h>
#include <asm/page.h>
@@ -65,8 +69,42 @@ EXPORT_SYMBOL(plpar_hcall);
EXPORT_SYMBOL(plpar_hcall9);
EXPORT_SYMBOL(plpar_hcall_norets);
+#ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
+static u8 dtl_mask = DTL_LOG_PREEMPT;
+#else
+static u8 dtl_mask;
+#endif
+
+struct dtl_worker {
+ struct delayed_work work;
+ int cpu;
+};
+
+struct vcpu_dispatch_data {
+ int last_disp_cpu;
+
+ int total_disp;
+
+ int same_cpu_disp;
+ int same_chip_disp;
+ int diff_chip_disp;
+ int far_chip_disp;
+
+ int numa_home_disp;
+ int numa_remote_disp;
+ int numa_far_disp;
+};
+
+static DEFINE_PER_CPU(struct vcpu_dispatch_data, vcpu_disp_data);
+static DEFINE_PER_CPU(u64, dtl_entry_ridx);
+static DEFINE_PER_CPU(struct dtl_worker, dtl_workers);
+static enum cpuhp_state dtl_worker_state;
+static DEFINE_MUTEX(dtl_worker_mutex);
+static unsigned int dtl_worker_refctr;
static DEFINE_SPINLOCK(dtl_buffer_refctr_lock);
static unsigned int dtl_buffer_global_refctr, dtl_buffer_percpu_refctr;
+static int vcpudispatch_stats_on __read_mostly;
+static int vcpudispatch_stats_freq = 50;
bool register_dtl_buffer_access(bool global)
{
@@ -101,6 +139,124 @@ void unregister_dtl_buffer_access(bool global)
spin_unlock(&dtl_buffer_refctr_lock);
}
+static void update_vcpu_disp_stat(int disp_cpu)
+{
+ struct vcpu_dispatch_data *disp;
+ int distance;
+
+ disp = this_cpu_ptr(&vcpu_disp_data);
+ if (disp->last_disp_cpu == -1) {
+ disp->last_disp_cpu = disp_cpu;
+ return;
+ }
+
+ disp->total_disp++;
+
+ if (disp->last_disp_cpu == disp_cpu ||
+ (cpu_first_thread_sibling(disp->last_disp_cpu) ==
+ cpu_first_thread_sibling(disp_cpu)))
+ disp->same_cpu_disp++;
+ else {
+ distance = cpu_relative_dispatch_distance(disp->last_disp_cpu,
+ disp_cpu);
+ if (distance < 0)
+ pr_debug_ratelimited("vcpudispatch_stats: cpu %d: error determining associativity\n",
+ smp_processor_id());
+ else {
+ switch (distance) {
+ case 0:
+ disp->same_chip_disp++;
+ break;
+ case 1:
+ disp->diff_chip_disp++;
+ break;
+ case 2:
+ disp->far_chip_disp++;
+ break;
+ default:
+ pr_debug_ratelimited("vcpudispatch_stats: cpu %d (%d -> %d): unexpected relative dispatch distance %d\n",
+ smp_processor_id(),
+ disp->last_disp_cpu,
+ disp_cpu,
+ distance);
+ }
+ }
+ }
+
+ distance = cpu_home_node_dispatch_distance(disp_cpu);
+ if (distance < 0)
+ pr_debug_ratelimited("vcpudispatch_stats: cpu %d: error determining associativity\n",
+ smp_processor_id());
+ else {
+ switch (distance) {
+ case 0:
+ disp->numa_home_disp++;
+ break;
+ case 1:
+ disp->numa_remote_disp++;
+ break;
+ case 2:
+ disp->numa_far_disp++;
+ break;
+ default:
+ pr_debug_ratelimited("vcpudispatch_stats: cpu %d on %d: unexpected numa dispatch distance %d\n",
+ smp_processor_id(),
+ disp_cpu,
+ distance);
+ }
+ }
+
+ disp->last_disp_cpu = disp_cpu;
+}
+
+static void process_dtl_buffer(struct work_struct *work)
+{
+ struct dtl_entry dtle;
+ u64 i = __this_cpu_read(dtl_entry_ridx);
+ struct dtl_entry *dtl = local_paca->dispatch_log + (i % N_DISPATCH_LOG);
+ struct dtl_entry *dtl_end = local_paca->dispatch_log_end;
+ struct lppaca *vpa = local_paca->lppaca_ptr;
+ struct dtl_worker *d = container_of(work, struct dtl_worker, work.work);
+
+ if (!local_paca->dispatch_log)
+ return;
+
+ /* if we have been migrated away, we cancel ourself */
+ if (d->cpu != smp_processor_id()) {
+ pr_debug("vcpudispatch_stats: cpu %d worker migrated -- canceling worker\n",
+ smp_processor_id());
+ return;
+ }
+
+ if (i == be64_to_cpu(vpa->dtl_idx))
+ goto out;
+
+ while (i < be64_to_cpu(vpa->dtl_idx)) {
+ dtle = *dtl;
+ barrier();
+ if (i + N_DISPATCH_LOG < be64_to_cpu(vpa->dtl_idx)) {
+ /* buffer has overflowed */
+ pr_debug_ratelimited("vcpudispatch_stats: cpu %d lost %lld DTL samples\n",
+ d->cpu,
+ be64_to_cpu(vpa->dtl_idx) - N_DISPATCH_LOG - i);
+ i = be64_to_cpu(vpa->dtl_idx) - N_DISPATCH_LOG;
+ dtl = local_paca->dispatch_log + (i % N_DISPATCH_LOG);
+ continue;
+ }
+ update_vcpu_disp_stat(be16_to_cpu(dtle.processor_id));
+ ++i;
+ ++dtl;
+ if (dtl == dtl_end)
+ dtl = local_paca->dispatch_log;
+ }
+
+ __this_cpu_write(dtl_entry_ridx, i);
+
+out:
+ schedule_delayed_work_on(d->cpu, to_delayed_work(work),
+ HZ / vcpudispatch_stats_freq);
+}
+
void alloc_dtl_buffers(void)
{
int cpu;
@@ -109,11 +265,15 @@ void alloc_dtl_buffers(void)
for_each_possible_cpu(cpu) {
pp = paca_ptrs[cpu];
+ if (pp->dispatch_log)
+ continue;
dtl = kmem_cache_alloc(dtl_cache, GFP_KERNEL);
if (!dtl) {
pr_warn("Failed to allocate dispatch trace log for cpu %d\n",
cpu);
+#ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
pr_warn("Stolen time statistics will be unreliable\n");
+#endif
break;
}
@@ -124,6 +284,25 @@ void alloc_dtl_buffers(void)
}
}
+static void free_dtl_buffers(void)
+{
+#ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
+ int cpu;
+ struct paca_struct *pp;
+
+ for_each_possible_cpu(cpu) {
+ pp = paca_ptrs[cpu];
+ if (!pp->dispatch_log)
+ continue;
+ kmem_cache_free(dtl_cache, pp->dispatch_log);
+ pp->dtl_ridx = 0;
+ pp->dispatch_log = 0;
+ pp->dispatch_log_end = 0;
+ pp->dtl_curr = 0;
+ }
+#endif
+}
+
void register_dtl_buffer(int cpu)
{
long ret;
@@ -133,7 +312,7 @@ void register_dtl_buffer(int cpu)
pp = paca_ptrs[cpu];
dtl = pp->dispatch_log;
- if (dtl) {
+ if (dtl && dtl_mask) {
pp->dtl_ridx = 0;
pp->dtl_curr = dtl;
lppaca_of(cpu).dtl_idx = 0;
@@ -144,10 +323,268 @@ void register_dtl_buffer(int cpu)
if (ret)
pr_err("WARNING: DTL registration of cpu %d (hw %d) "
"failed with %ld\n", cpu, hwcpu, ret);
- lppaca_of(cpu).dtl_enable_mask = DTL_LOG_PREEMPT;
+ lppaca_of(cpu).dtl_enable_mask = dtl_mask;
+ }
+}
+
+static int dtl_worker_online(unsigned int cpu)
+{
+ struct dtl_worker *d = &per_cpu(dtl_workers, cpu);
+
+ memset(d, 0, sizeof(*d));
+ INIT_DELAYED_WORK(&d->work, process_dtl_buffer);
+ d->cpu = cpu;
+
+#ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
+ per_cpu(dtl_entry_ridx, cpu) = 0;
+ register_dtl_buffer(cpu);
+#else
+ per_cpu(dtl_entry_ridx, cpu) = be64_to_cpu(lppaca_of(cpu).dtl_idx);
+#endif
+
+ schedule_delayed_work_on(cpu, &d->work, HZ / vcpudispatch_stats_freq);
+ return 0;
+}
+
+static int dtl_worker_offline(unsigned int cpu)
+{
+ struct dtl_worker *d = &per_cpu(dtl_workers, cpu);
+
+ cancel_delayed_work_sync(&d->work);
+
+#ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
+ unregister_dtl(get_hard_smp_processor_id(cpu));
+#endif
+
+ return 0;
+}
+
+static void set_global_dtl_mask(u8 mask)
+{
+ int cpu;
+
+ dtl_mask = mask;
+ for_each_present_cpu(cpu)
+ lppaca_of(cpu).dtl_enable_mask = dtl_mask;
+}
+
+static void reset_global_dtl_mask(void)
+{
+ int cpu;
+
+#ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
+ dtl_mask = DTL_LOG_PREEMPT;
+#else
+ dtl_mask = 0;
+#endif
+ for_each_present_cpu(cpu)
+ lppaca_of(cpu).dtl_enable_mask = dtl_mask;
+}
+
+static int dtl_worker_enable(void)
+{
+ int rc = 0, state;
+
+ mutex_lock(&dtl_worker_mutex);
+
+ if (dtl_worker_refctr) {
+ dtl_worker_refctr++;
+ goto out;
+ }
+
+ if (register_dtl_buffer_access(1)) {
+ rc = -EBUSY;
+ goto out;
+ }
+
+ set_global_dtl_mask(DTL_LOG_ALL);
+
+ /* Setup dtl buffers and register those */
+ alloc_dtl_buffers();
+
+ state = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "powerpc/dtl:online",
+ dtl_worker_online, dtl_worker_offline);
+ if (state < 0) {
+ pr_err("vcpudispatch_stats: unable to setup workqueue for DTL processing\n");
+ free_dtl_buffers();
+ reset_global_dtl_mask();
+ unregister_dtl_buffer_access(1);
+ rc = -EINVAL;
+ goto out;
+ }
+ dtl_worker_state = state;
+ dtl_worker_refctr++;
+
+out:
+ mutex_unlock(&dtl_worker_mutex);
+ return rc;
+}
+
+static void dtl_worker_disable(void)
+{
+ mutex_lock(&dtl_worker_mutex);
+ dtl_worker_refctr--;
+ if (!dtl_worker_refctr) {
+ cpuhp_remove_state(dtl_worker_state);
+ dtl_worker_state = 0;
+ free_dtl_buffers();
+ reset_global_dtl_mask();
+ unregister_dtl_buffer_access(1);
+ }
+ mutex_unlock(&dtl_worker_mutex);
+}
+
+static ssize_t vcpudispatch_stats_write(struct file *file, const char __user *p,
+ size_t count, loff_t *ppos)
+{
+ struct vcpu_dispatch_data *disp;
+ int rc, cmd, cpu;
+ char buf[16];
+
+ if (count > 15)
+ return -EINVAL;
+
+ if (copy_from_user(buf, p, count))
+ return -EFAULT;
+
+ buf[count] = 0;
+ rc = kstrtoint(buf, 0, &cmd);
+ if (rc || cmd < 0 || cmd > 1) {
+ pr_err("vcpudispatch_stats: please use 0 to disable or 1 to enable dispatch statistics\n");
+ return rc ? rc : -EINVAL;
+ }
+
+ if ((cmd == 0 && !vcpudispatch_stats_on) ||
+ (cmd == 1 && vcpudispatch_stats_on))
+ return count;
+
+ if (cmd) {
+ rc = init_cpu_associativity();
+ if (rc)
+ return rc;
+
+ for_each_possible_cpu(cpu) {
+ disp = per_cpu_ptr(&vcpu_disp_data, cpu);
+ memset(disp, 0, sizeof(*disp));
+ disp->last_disp_cpu = -1;
+ }
+
+ rc = dtl_worker_enable();
+ if (rc) {
+ destroy_cpu_associativity();
+ return rc;
+ }
+ } else {
+ dtl_worker_disable();
+ destroy_cpu_associativity();
+ }
+
+ vcpudispatch_stats_on = cmd;
+
+ return count;
+}
+
+static int vcpudispatch_stats_display(struct seq_file *p, void *v)
+{
+ int cpu;
+ struct vcpu_dispatch_data *disp;
+
+ if (!vcpudispatch_stats_on) {
+ seq_puts(p, "off\n");
+ return 0;
+ }
+
+ for_each_online_cpu(cpu) {
+ disp = per_cpu_ptr(&vcpu_disp_data, cpu);
+ seq_printf(p, "cpu%d", cpu);
+ seq_put_decimal_ull(p, " ", disp->total_disp);
+ seq_put_decimal_ull(p, " ", disp->same_cpu_disp);
+ seq_put_decimal_ull(p, " ", disp->same_chip_disp);
+ seq_put_decimal_ull(p, " ", disp->diff_chip_disp);
+ seq_put_decimal_ull(p, " ", disp->far_chip_disp);
+ seq_put_decimal_ull(p, " ", disp->numa_home_disp);
+ seq_put_decimal_ull(p, " ", disp->numa_remote_disp);
+ seq_put_decimal_ull(p, " ", disp->numa_far_disp);
+ seq_puts(p, "\n");
}
+
+ return 0;
}
+static int vcpudispatch_stats_open(struct inode *inode, struct file *file)
+{
+ return single_open(file, vcpudispatch_stats_display, NULL);
+}
+
+static const struct file_operations vcpudispatch_stats_proc_ops = {
+ .open = vcpudispatch_stats_open,
+ .read = seq_read,
+ .write = vcpudispatch_stats_write,
+ .llseek = seq_lseek,
+ .release = single_release,
+};
+
+static ssize_t vcpudispatch_stats_freq_write(struct file *file,
+ const char __user *p, size_t count, loff_t *ppos)
+{
+ int rc, freq;
+ char buf[16];
+
+ if (count > 15)
+ return -EINVAL;
+
+ if (copy_from_user(buf, p, count))
+ return -EFAULT;
+
+ buf[count] = 0;
+ rc = kstrtoint(buf, 0, &freq);
+ if (rc || freq < 1 || freq > HZ) {
+ pr_err("vcpudispatch_stats_freq: please specify a frequency between 1 and %d\n",
+ HZ);
+ return rc ? rc : -EINVAL;
+ }
+
+ vcpudispatch_stats_freq = freq;
+
+ return count;
+}
+
+static int vcpudispatch_stats_freq_display(struct seq_file *p, void *v)
+{
+ seq_printf(p, "%d\n", vcpudispatch_stats_freq);
+ return 0;
+}
+
+static int vcpudispatch_stats_freq_open(struct inode *inode, struct file *file)
+{
+ return single_open(file, vcpudispatch_stats_freq_display, NULL);
+}
+
+static const struct file_operations vcpudispatch_stats_freq_proc_ops = {
+ .open = vcpudispatch_stats_freq_open,
+ .read = seq_read,
+ .write = vcpudispatch_stats_freq_write,
+ .llseek = seq_lseek,
+ .release = single_release,
+};
+
+static int __init vcpudispatch_stats_procfs_init(void)
+{
+ if (!lppaca_shared_proc(get_lppaca()))
+ return 0;
+
+ if (!proc_create("powerpc/vcpudispatch_stats", 0600, NULL,
+ &vcpudispatch_stats_proc_ops))
+ pr_err("vcpudispatch_stats: error creating procfs file\n");
+ else if (!proc_create("powerpc/vcpudispatch_stats_freq", 0600, NULL,
+ &vcpudispatch_stats_freq_proc_ops))
+ pr_err("vcpudispatch_stats_freq: error creating procfs file\n");
+
+ return 0;
+}
+
+machine_device_initcall(pseries, vcpudispatch_stats_procfs_init);
+
void vpa_init(int cpu)
{
int hwcpu = get_hard_smp_processor_id(cpu);
--
2.21.0
^ permalink raw reply related
* [PATCH 7/8] powerpc/pseries: Protect against hogging the cpu while setting up the stats
From: Naveen N. Rao @ 2019-05-10 16:26 UTC (permalink / raw)
To: Michael Ellerman
Cc: Nathan Lynch, Aneesh Kumar K.V, Mingming Cao, linuxppc-dev
In-Reply-To: <cover.1557502887.git.naveen.n.rao@linux.vnet.ibm.com>
When enabling or disabling the vcpu dispatch statistics, we do a lot of
work including allocating/deallocating memory across all possible cpus
for the DTL buffer. In order to guard against hogging the cpu for too
long, track the time we're taking and yield the processor if necessary.
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
---
arch/powerpc/include/asm/plpar_wrappers.h | 2 +-
arch/powerpc/platforms/pseries/lpar.c | 29 ++++++++++++++++-------
arch/powerpc/platforms/pseries/setup.c | 2 +-
3 files changed, 22 insertions(+), 11 deletions(-)
diff --git a/arch/powerpc/include/asm/plpar_wrappers.h b/arch/powerpc/include/asm/plpar_wrappers.h
index ab7dd454b6eb..d01bf0036e4d 100644
--- a/arch/powerpc/include/asm/plpar_wrappers.h
+++ b/arch/powerpc/include/asm/plpar_wrappers.h
@@ -91,7 +91,7 @@ static inline long register_dtl(unsigned long cpu, unsigned long vpa)
extern bool register_dtl_buffer_access(bool global);
extern void unregister_dtl_buffer_access(bool global);
extern void register_dtl_buffer(int cpu);
-extern void alloc_dtl_buffers(void);
+extern void alloc_dtl_buffers(unsigned long *time_limit);
extern void vpa_init(int cpu);
static inline long plpar_pte_enter(unsigned long flags,
diff --git a/arch/powerpc/platforms/pseries/lpar.c b/arch/powerpc/platforms/pseries/lpar.c
index b808a70cc253..8bc1b950cfd0 100644
--- a/arch/powerpc/platforms/pseries/lpar.c
+++ b/arch/powerpc/platforms/pseries/lpar.c
@@ -257,7 +257,7 @@ static void process_dtl_buffer(struct work_struct *work)
HZ / vcpudispatch_stats_freq);
}
-void alloc_dtl_buffers(void)
+void alloc_dtl_buffers(unsigned long *time_limit)
{
int cpu;
struct paca_struct *pp;
@@ -281,10 +281,15 @@ void alloc_dtl_buffers(void)
pp->dispatch_log = dtl;
pp->dispatch_log_end = dtl + N_DISPATCH_LOG;
pp->dtl_curr = dtl;
+
+ if (time_limit && time_after(jiffies, *time_limit)) {
+ cond_resched();
+ *time_limit = jiffies + HZ;
+ }
}
}
-static void free_dtl_buffers(void)
+static void free_dtl_buffers(unsigned long *time_limit)
{
#ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
int cpu;
@@ -299,6 +304,11 @@ static void free_dtl_buffers(void)
pp->dispatch_log = 0;
pp->dispatch_log_end = 0;
pp->dtl_curr = 0;
+
+ if (time_limit && time_after(jiffies, *time_limit)) {
+ cond_resched();
+ *time_limit = jiffies + HZ;
+ }
}
#endif
}
@@ -381,7 +391,7 @@ static void reset_global_dtl_mask(void)
lppaca_of(cpu).dtl_enable_mask = dtl_mask;
}
-static int dtl_worker_enable(void)
+static int dtl_worker_enable(unsigned long *time_limit)
{
int rc = 0, state;
@@ -400,13 +410,13 @@ static int dtl_worker_enable(void)
set_global_dtl_mask(DTL_LOG_ALL);
/* Setup dtl buffers and register those */
- alloc_dtl_buffers();
+ alloc_dtl_buffers(time_limit);
state = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "powerpc/dtl:online",
dtl_worker_online, dtl_worker_offline);
if (state < 0) {
pr_err("vcpudispatch_stats: unable to setup workqueue for DTL processing\n");
- free_dtl_buffers();
+ free_dtl_buffers(time_limit);
reset_global_dtl_mask();
unregister_dtl_buffer_access(1);
rc = -EINVAL;
@@ -420,14 +430,14 @@ static int dtl_worker_enable(void)
return rc;
}
-static void dtl_worker_disable(void)
+static void dtl_worker_disable(unsigned long *time_limit)
{
mutex_lock(&dtl_worker_mutex);
dtl_worker_refctr--;
if (!dtl_worker_refctr) {
cpuhp_remove_state(dtl_worker_state);
dtl_worker_state = 0;
- free_dtl_buffers();
+ free_dtl_buffers(time_limit);
reset_global_dtl_mask();
unregister_dtl_buffer_access(1);
}
@@ -437,6 +447,7 @@ static void dtl_worker_disable(void)
static ssize_t vcpudispatch_stats_write(struct file *file, const char __user *p,
size_t count, loff_t *ppos)
{
+ unsigned long time_limit = jiffies + HZ;
struct vcpu_dispatch_data *disp;
int rc, cmd, cpu;
char buf[16];
@@ -469,13 +480,13 @@ static ssize_t vcpudispatch_stats_write(struct file *file, const char __user *p,
disp->last_disp_cpu = -1;
}
- rc = dtl_worker_enable();
+ rc = dtl_worker_enable(&time_limit);
if (rc) {
destroy_cpu_associativity();
return rc;
}
} else {
- dtl_worker_disable();
+ dtl_worker_disable(&time_limit);
destroy_cpu_associativity();
}
diff --git a/arch/powerpc/platforms/pseries/setup.c b/arch/powerpc/platforms/pseries/setup.c
index b6995e5cc5c9..c98246343752 100644
--- a/arch/powerpc/platforms/pseries/setup.c
+++ b/arch/powerpc/platforms/pseries/setup.c
@@ -283,7 +283,7 @@ static int alloc_dispatch_logs(void)
if (!dtl_cache)
return 0;
- alloc_dtl_buffers();
+ alloc_dtl_buffers(0);
/* Register the DTL for the current (boot) cpu */
register_dtl_buffer(smp_processor_id());
--
2.21.0
^ permalink raw reply related
* [PATCH 8/8] powerpc/pseries: Add documentation for vcpudispatch_stats
From: Naveen N. Rao @ 2019-05-10 16:26 UTC (permalink / raw)
To: Michael Ellerman
Cc: Nathan Lynch, Aneesh Kumar K.V, Mingming Cao, linuxppc-dev
In-Reply-To: <cover.1557502887.git.naveen.n.rao@linux.vnet.ibm.com>
Add a document describing the fields provided by
/proc/powerpc/vcpudispatch_stats.
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
---
Documentation/powerpc/vcpudispatch_stats.txt | 68 ++++++++++++++++++++
1 file changed, 68 insertions(+)
create mode 100644 Documentation/powerpc/vcpudispatch_stats.txt
diff --git a/Documentation/powerpc/vcpudispatch_stats.txt b/Documentation/powerpc/vcpudispatch_stats.txt
new file mode 100644
index 000000000000..e21476bfd78c
--- /dev/null
+++ b/Documentation/powerpc/vcpudispatch_stats.txt
@@ -0,0 +1,68 @@
+VCPU Dispatch Statistics:
+=========================
+
+For Shared Processor LPARs, the POWER Hypervisor maintains a relatively
+static mapping of the LPAR processors (vcpus) to physical processor
+chips (representing the "home" node) and tries to always dispatch vcpus
+on their associated physical processor chip. However, under certain
+scenarios, vcpus may be dispatched on a different processor chip (away
+from its home node).
+
+/proc/powerpc/vcpudispatch_stats can be used to obtain statistics
+related to the vcpu dispatch behavior. Writing '1' to this file enables
+collecting the statistics, while writing '0' disables the statistics.
+By default, the DTLB log for each vcpu is processed 50 times a second so
+as not to miss any entries. This processing frequency can be changed
+through /proc/powerpc/vcpudispatch_stats_freq.
+
+The statistics themselves are available by reading the procfs file
+/proc/powerpc/vcpudispatch_stats. Each line in the output corresponds to
+a vcpu as represented by the first field, followed by 8 numbers.
+
+The first number corresponds to:
+1. total vcpu dispatches since the beginning of statistics collection
+
+The next 4 numbers represent vcpu dispatch dispersions:
+2. number of times this vcpu was dispatched on the same processor as last
+ time
+3. number of times this vcpu was dispatched on a different processor core
+ as last time, but within the same chip
+4. number of times this vcpu was dispatched on a different chip
+5. number of times this vcpu was dispatches on a different socket/drawer
+(next numa boundary)
+
+The final 3 numbers represent statistics in relation to the home node of
+the vcpu:
+6. number of times this vcpu was dispatched in its home node (chip)
+7. number of times this vcpu was dispatched in a different node
+8. number of times this vcpu was dispatched in a node further away (numa
+distance)
+
+An example output:
+ $ sudo cat /proc/powerpc/vcpudispatch_stats
+ cpu0 6839 4126 2683 30 0 6821 18 0
+ cpu1 2515 1274 1229 12 0 2509 6 0
+ cpu2 2317 1198 1109 10 0 2312 5 0
+ cpu3 2259 1165 1088 6 0 2256 3 0
+ cpu4 2205 1143 1056 6 0 2202 3 0
+ cpu5 2165 1121 1038 6 0 2162 3 0
+ cpu6 2183 1127 1050 6 0 2180 3 0
+ cpu7 2193 1133 1052 8 0 2187 6 0
+ cpu8 2165 1115 1032 18 0 2156 9 0
+ cpu9 2301 1252 1033 16 0 2293 8 0
+ cpu10 2197 1138 1041 18 0 2187 10 0
+ cpu11 2273 1185 1062 26 0 2260 13 0
+ cpu12 2186 1125 1043 18 0 2177 9 0
+ cpu13 2161 1115 1030 16 0 2153 8 0
+ cpu14 2206 1153 1033 20 0 2196 10 0
+ cpu15 2163 1115 1032 16 0 2155 8 0
+
+In the output above, for vcpu0, there have been 6839 dispatches since
+statistics were enabled. 4126 of those dispatches were on the same
+physical cpu as the last time. 2683 were on a different core, but within
+the same chip, while 30 dispatches were on a different chip compared to
+its last dispatch.
+
+Also, out of the total of 6839 dispatches, we see that there have been
+6821 dispatches on the vcpu's home node, while 18 dispatches were
+outside its home node, on a neighbouring chip.
--
2.21.0
^ permalink raw reply related
* Re: [PATCH] vsprintf: Do not break early boot with probing addresses
From: Martin Schwidefsky @ 2019-05-10 16:32 UTC (permalink / raw)
To: Steven Rostedt
Cc: Petr Mladek, linux-arch, Sergey Senozhatsky, Heiko Carstens,
linux-s390, linuxppc-dev, Rasmus Villemoes, linux-kernel,
Michal Hocko, Sergey Senozhatsky, Stephen Rothwell,
Andy Shevchenko, Linus Torvalds, Tobin C . Harding
In-Reply-To: <20190510122401.21a598f6@gandalf.local.home>
On Fri, 10 May 2019 12:24:01 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:
> On Fri, 10 May 2019 10:42:13 +0200
> Petr Mladek <pmladek@suse.com> wrote:
>
> > static const char *check_pointer_msg(const void *ptr)
> > {
> > - char byte;
> > -
> > if (!ptr)
> > return "(null)";
> >
> > - if (probe_kernel_address(ptr, byte))
> > + if ((unsigned long)ptr < PAGE_SIZE || IS_ERR_VALUE(ptr))
> > return "(efault)";
> >
>
>
> < PAGE_SIZE ?
>
> do you mean: < TASK_SIZE ?
The check with < TASK_SIZE would break on s390. The 'ptr' is
in the kernel address space, *not* in the user address space.
Remember s390 has two separate address spaces for kernel/user
the check < TASK_SIZE only makes sense with a __user pointer.
--
blue skies,
Martin.
"Reality continues to ruin my life." - Calvin.
^ permalink raw reply
* Re: [PATCH] vsprintf: Do not break early boot with probing addresses
From: Steven Rostedt @ 2019-05-10 16:40 UTC (permalink / raw)
To: Martin Schwidefsky
Cc: Petr Mladek, linux-arch, Sergey Senozhatsky, Heiko Carstens,
linux-s390, linuxppc-dev, Rasmus Villemoes, linux-kernel,
Michal Hocko, Sergey Senozhatsky, Stephen Rothwell,
Andy Shevchenko, Linus Torvalds, Tobin C . Harding
In-Reply-To: <20190510183258.1f6c4153@mschwideX1>
On Fri, 10 May 2019 18:32:58 +0200
Martin Schwidefsky <schwidefsky@de.ibm.com> wrote:
> On Fri, 10 May 2019 12:24:01 -0400
> Steven Rostedt <rostedt@goodmis.org> wrote:
>
> > On Fri, 10 May 2019 10:42:13 +0200
> > Petr Mladek <pmladek@suse.com> wrote:
> >
> > > static const char *check_pointer_msg(const void *ptr)
> > > {
> > > - char byte;
> > > -
> > > if (!ptr)
> > > return "(null)";
> > >
> > > - if (probe_kernel_address(ptr, byte))
> > > + if ((unsigned long)ptr < PAGE_SIZE || IS_ERR_VALUE(ptr))
> > > return "(efault)";
> > >
> >
> >
> > < PAGE_SIZE ?
> >
> > do you mean: < TASK_SIZE ?
>
> The check with < TASK_SIZE would break on s390. The 'ptr' is
> in the kernel address space, *not* in the user address space.
> Remember s390 has two separate address spaces for kernel/user
> the check < TASK_SIZE only makes sense with a __user pointer.
>
So we allow this to read user addresses? Can't that cause a fault?
If the condition is true, we return "(efault)".
-- Steve
^ permalink raw reply
* Re: [PATCH] vsprintf: Do not break early boot with probing addresses
From: Andy Shevchenko @ 2019-05-10 16:41 UTC (permalink / raw)
To: Steven Rostedt
Cc: Petr Mladek, linux-arch, Sergey Senozhatsky, Heiko Carstens,
linux-s390, linuxppc-dev, Rasmus Villemoes, linux-kernel,
Michal Hocko, Sergey Senozhatsky, Stephen Rothwell,
Linus Torvalds, Martin Schwidefsky, Tobin C . Harding
In-Reply-To: <20190510122401.21a598f6@gandalf.local.home>
On Fri, May 10, 2019 at 12:24:01PM -0400, Steven Rostedt wrote:
> On Fri, 10 May 2019 10:42:13 +0200
> Petr Mladek <pmladek@suse.com> wrote:
>
> > static const char *check_pointer_msg(const void *ptr)
> > {
> > - char byte;
> > -
> > if (!ptr)
> > return "(null)";
> >
> > - if (probe_kernel_address(ptr, byte))
> > + if ((unsigned long)ptr < PAGE_SIZE || IS_ERR_VALUE(ptr))
> > return "(efault)";
> >
>
>
> < PAGE_SIZE ?
>
> do you mean: < TASK_SIZE ?
Original code used PAGE_SIZE. If it needs to be changed, that it might be a
separate explanation / patch.
--
With Best Regards,
Andy Shevchenko
^ permalink raw reply
* Re: [PATCH] vsprintf: Do not break early boot with probing addresses
From: Martin Schwidefsky @ 2019-05-10 16:45 UTC (permalink / raw)
To: Steven Rostedt
Cc: Petr Mladek, linux-arch, Sergey Senozhatsky, Heiko Carstens,
linux-s390, linuxppc-dev, Rasmus Villemoes, linux-kernel,
Michal Hocko, Sergey Senozhatsky, Stephen Rothwell,
Andy Shevchenko, Linus Torvalds, Tobin C . Harding
In-Reply-To: <20190510124058.0d44b441@gandalf.local.home>
On Fri, 10 May 2019 12:40:58 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:
> On Fri, 10 May 2019 18:32:58 +0200
> Martin Schwidefsky <schwidefsky@de.ibm.com> wrote:
>
> > On Fri, 10 May 2019 12:24:01 -0400
> > Steven Rostedt <rostedt@goodmis.org> wrote:
> >
> > > On Fri, 10 May 2019 10:42:13 +0200
> > > Petr Mladek <pmladek@suse.com> wrote:
> > >
> > > > static const char *check_pointer_msg(const void *ptr)
> > > > {
> > > > - char byte;
> > > > -
> > > > if (!ptr)
> > > > return "(null)";
> > > >
> > > > - if (probe_kernel_address(ptr, byte))
> > > > + if ((unsigned long)ptr < PAGE_SIZE || IS_ERR_VALUE(ptr))
> > > > return "(efault)";
> > > >
> > >
> > >
> > > < PAGE_SIZE ?
> > >
> > > do you mean: < TASK_SIZE ?
> >
> > The check with < TASK_SIZE would break on s390. The 'ptr' is
> > in the kernel address space, *not* in the user address space.
> > Remember s390 has two separate address spaces for kernel/user
> > the check < TASK_SIZE only makes sense with a __user pointer.
> >
>
> So we allow this to read user addresses? Can't that cause a fault?
>
> If the condition is true, we return "(efault)".
On x86 this would allow a user space access as kernel and user live
in the same address space, on s390 it would not.
h
--
blue skies,
Martin.
"Reality continues to ruin my life." - Calvin.
^ permalink raw reply
* Re: [PATCH] vsprintf: Do not break early boot with probing addresses
From: christophe leroy @ 2019-05-10 17:35 UTC (permalink / raw)
To: Steven Rostedt, Petr Mladek
Cc: linux-arch, linux-s390, Sergey Senozhatsky, Heiko Carstens,
linuxppc-dev, Rasmus Villemoes, linux-kernel, Michal Hocko,
Sergey Senozhatsky, Stephen Rothwell, Andy Shevchenko,
Linus Torvalds, Martin Schwidefsky, Tobin C . Harding
In-Reply-To: <20190510122401.21a598f6@gandalf.local.home>
Le 10/05/2019 à 18:24, Steven Rostedt a écrit :
> On Fri, 10 May 2019 10:42:13 +0200
> Petr Mladek <pmladek@suse.com> wrote:
>
>> static const char *check_pointer_msg(const void *ptr)
>> {
>> - char byte;
>> -
>> if (!ptr)
>> return "(null)";
>>
>> - if (probe_kernel_address(ptr, byte))
>> + if ((unsigned long)ptr < PAGE_SIZE || IS_ERR_VALUE(ptr))
>> return "(efault)";
>>
>
>
> < PAGE_SIZE ?
>
> do you mean: < TASK_SIZE ?
I guess not.
Usually, < PAGE_SIZE means NULL pointer dereference (via the member of a
struct)
Christophe
---
L'absence de virus dans ce courrier électronique a été vérifiée par le logiciel antivirus Avast.
https://www.avast.com/antivirus
^ permalink raw reply
* Re: [PATCH] EDAC, mpc85xx: Prevent building as a module
From: Borislav Petkov @ 2019-05-10 18:25 UTC (permalink / raw)
To: Michael Ellerman
Cc: Johannes Thumshirn, linux-kernel, linuxppc-dev, james.morse,
mchehab, linux-edac
In-Reply-To: <20190510141320.GB29927@zn.tnic>
On Fri, May 10, 2019 at 04:13:20PM +0200, Borislav Petkov wrote:
> On Fri, May 10, 2019 at 08:50:52PM +1000, Michael Ellerman wrote:
> > Yeah that looks better to me. I didn't think about the case where EDAC
> > core is modular.
> >
> > Do you want me to send a new patch?
>
> Nah, I'll fix it up.
I've pushed it here:
https://git.kernel.org/pub/scm/linux/kernel/git/bp/bp.git/commit/?h=edac-fix-for-5.2
in case you wanna throw your build tests on it. My dingy cross-compiler
can't do much really.
Thx.
--
Regards/Gruss,
Boris.
Good mailing practices for 400: avoid top-posting and trim the reply.
^ permalink raw reply
* Re: [PATCH 0/4] Enabling secure boot on PowerNV systems
From: Claudio Carvalho @ 2019-05-10 21:31 UTC (permalink / raw)
To: Matthew Garrett
Cc: linux-efi, Ard Biesheuvel, Nayna Jain, Linux Kernel Mailing List,
Matthew Garret, linuxppc-dev, Peter Jones, Paul Mackerras,
Jeremy Kerr, linux-integrity
In-Reply-To: <CACdnJuvpUKiX5UgSOrzh+B9y68zKm+Bzu1c8KFJHd8diz=sm2Q@mail.gmail.com>
Hi Matthew,
Thanks for the feedback and sorry for the delay in responding.
On 4/10/19 2:36 PM, Matthew Garrett wrote:
> (Cc:ing Peter Jones)
>
> On Tue, Apr 9, 2019 at 3:55 PM Claudio Carvalho <cclaudio@linux.ibm.com> wrote:
>>
>> On 4/5/19 7:19 PM, Matthew Garrett wrote:
>>> Based on our experience doing this in UEFI, that's insufficient - you
>>> want to be able to block individual binaries or leaf certificates
>>> without dropping trust in an intermediate certificate entirely.
>>
>> We agree that a dbx would be useful for blacklisting particular kernels
>> signed with given certificate. However, we have been avoiding doing so for
>> the initial release of secure boot on OpenPOWER. We don't have individual
>> firmware binaries in OpenPOWER. Kernels are currently the only concern for
>> the OS secure boot certificates we're discussing here. Also, we have a very
>> limited keystore space in POWER9.
>>
>> Petitboot doesn't have standardized OS kernel verification at all right
>> now. Having the capability even without dbx seems valuable.
> I don't see the benefit in attempting to maintain compatibility with
> existing tooling unless you're going to be *completely* compatible
> with existing tooling. That means supporting dbx and dbt.
Before addressing that, I'd like to share some of the current OpenPOWER
secure boot design.
Different from UEFI, secure boot in OpenPOWER systems have two distinct
domains. Each one has its own key hierarchy and signing and signature
verification mechanisms.
In the firmware secure boot domain (work already upstream):
- Every image loaded up to skiroot is wrapped in a secure boot container.
Skiroot is a Linux zimage with petitboot (kexec bootloader) embedded in the
initramfs.
- Within the secure boot container, the payload image is protected by a
chain of signatures anchored in the root ECDSA keys, also known as hardware
keys.
- All public keys required to verify the container are stored in the
container itself, but a hash of the trusted public hardware keys is stored
in a protected SEEPROM region outside of the container. Firmware uses it to
check if the container is anchored in the trusted hardware keys. If not,
the container payload is not executed and the boot is aborted.
- The hash of the hardware keys is set by the firmware supplier, for
instance, the platform manufacturer.
In OS secure boot domain (work in progress):
- The skiroot container is verified as part of firmware secure boot.
- Skiroot uses UEFI-like secure variables (PK, KEK and db) to verify OS
kernels. Only X.509 certificates will be supported for these secure variables.
- OS kernels are signed using the Linux kernel sign-file tool, as if they
were kernel modules.
- In the skiroot kernel, if secure boot is enabled, the db certificates
will be loaded into the platform keyring and IMA-appraisal will verify the
kexec image against the platform keyring.
- The PK is set by whoever controls the platform, for instance, the
manufacturer or the end customer.
How about dbx and dbt?
The db keys will be used to verify only OS kernels via kexecs initiated by
petitboot. So we only need the dbx to revoke kernel images, either via
certs or hashes. Currently, the kernel loads certs and hashes from the dbx
to the system blacklist keyring. The revoked certs are checked during pkcs7
signature verification and loading of keys. However, there doesn't appear
to be any verification against blacklisted hashes. Should kernel images be
revoked only by keys and not hashes? We tried to find published revoked
kernel lists but couldn't find any. How is kernel image revocation handled
in practice?
Also, we didn't see the shim or kernel loading anything from dbt.
In general, how do you think the kernel ought to support blacklists?
>
>>>> The API is still a work in progress. We are planning to publish a document
>>>> describing the current API and overall design shortly.
>>> Ok. How are the attributes interpreted by the API?
>>
>> We support a subset of standard EFI variable attributes, and we only use
>> EFI variables that relate to secure boot. Our goal is not to implement
>> UEFI. However, we do seek to be compatible with user space tooling and
>> reuse as much existing infrastructure as possible. We don’t support the
>> following: EFI_VARIABLE_HARDWARE_ERROR_RECORD,
>> EFI_VARIABLE_AUTHENTICATED_WRITE_ACCESS and
>> EFI_VARIABLE_ENHANCED_AUTHENTICATED_ACCESS.
> Ok. I think that's realistically fine.
>
>>>> Perhaps the biggest departure is that the secure variables are stored in
>>>> flash memory that is not lockable. In order to protect the secure
>>>> variables, hashes of the flash regions where they're stored are written to
>>>> TPM NVRAM indices. The TPM NVRAM indices we use are write locked at
>>>> runtime. The sysadmin enqueues update commands in flash. During the next
>>>> boot, the firmware verifies and processes the commands to update the
>>>> certificate store and accompanying integrity hashes in the TPM NVRAM
>>>> indices and write locks them. Before certificates read from flash are
>>>> used, the certificate store is hashed and compared against the hashes
>>>> stored from the TPM. The one exception is the PK. We store it in a TPM
>>>> NVRAM index by itself rather than flash because updates to it must be
>>>> guaranteed to be atomic.
>>> What's the behaviour if multiple updates are enqueued? Does reading
>>> back show a mocked up updated variable or the original state?
>>
>> Our secure variable updates are only applied at boot time. If any one of
>> them fails, they all fail.
> So I do the following:
>
> 1) Boot
> 2) Extend the contents of db
> 3) Extend the contents of db again
> 4) Read back the contents of db through efivarfs
> 5) Reboot
> 6) Read back the contents of db through efivarfs
>
> Is what I see in (4) and (6) the same? Does it contain the values form
> both extensions?
In (2) and (3) the extensions are added to the update queue, which is
processed only in (5) by firmware. So, in (4) you should see the db content
without the extensions.
In (5), firmware (skiboot) will process the update queue. The extensions
will be applied only if *all* of them are valid and pass signature
verification. Only in this case should you be able to see the extensions in
(6). If any of the extensions fail, firmware will discard all of them,
clear the queue, and do the proper logging.
>>> I'm not really clear on the workflow here. Who's the administrator
>>> authority? When would they be updating the second level of keys? If
>>> there's no support for revocation, why would distributions need two
>>> levels of key in the system database rather than just distributing a
>>> single intermediate and signing their actual signing certs with that?
>>
>> In OpenPOWER systems, we enable our customers and business partners to
>> establish and manage the platform key certificate, which is the root of our
>> key hierarchy. From there, through the KEK, they can delegate authority to
>> intermediate level organizations, e.g. distros or IT departments or
>> business operations. Those intermediate level organizations then manage the
>> code signing certificates in the DB. If this answer doesn’t address your
>> question, can you please rephrase?
> Why would the intermediate level organisations not just have entries
> in db?
Because that seems to add more complexity than having three levels (PK, KEK
and db).
Typically, the intermediate level organisations (or KEK) are used to
authorize new additions to db. However, if we also have them in the db, who
would authorize the new additions to db. If that would be the intermediate
level organisation entries now in the db, it seems we would need to
implement a mechanism to determine which entries are for authorizing new
additions and which are for kernel signature verification. If that would be
the PK, we'd be burdening the PK owner to sign every new db addition if the
platform is owned by a company that has intermediate level organizations.
> The main reason we don't do it this way in UEFI is because we
> need to support dbx, and if you're not supporting dbx I'm not sure I
> see the benefit.
I'm not sure I understand your question. We would be using dbx to prevent
kernels from being loaded. How is that related to having three levels in
the key hierarchy (PK, KEK and db)?
Thanks,
Claudio
^ permalink raw reply
* Re: [PATCH kernel 1/2] powerpc/pseries/dma: Allow swiotlb
From: Thiago Jung Bauermann @ 2019-05-10 22:36 UTC (permalink / raw)
To: Alexey Kardashevskiy; +Cc: Alistair Popple, linuxppc-dev, David Gibson
In-Reply-To: <20190507062559.20295-2-aik@ozlabs.ru>
Alexey Kardashevskiy <aik@ozlabs.ru> writes:
> The commit 8617a5c5bc00 ("powerpc/dma: handle iommu bypass in
> dma_iommu_ops") merged direct DMA ops into the IOMMU DMA ops allowing
> SWIOTLB as well but only for mapping; the unmapping and bouncing parts
> were left unmodified.
>
> This adds missing direct unmapping calls to .unmap_page() and .unmap_sg().
>
> This adds missing sync callbacks and directs them to the direct DMA hooks.
>
> Fixes: 8617a5c5bc00 (powerpc/dma: handle iommu bypass in dma_iommu_ops)
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Nice! Thanks for working on this. I have the patch at the end of this
email to get virtio-scsi-pci and virtio-blk-pci working in a secure
guest.
I applied your patch and reverted my patch and unfortunately the guest
hangs right after mounting the disk:
[ 0.185659] virtio-pci 0000:00:04.0: enabling device (0100 -> 0102)
[ 0.187082] virtio-pci 0000:00:04.0: ibm,query-pe-dma-windows(2026) 2000 8000000 20000000 returned 0
[ 0.187497] virtio-pci 0000:00:04.0: ibm,create-pe-dma-window(2027) 2000 8000000 20000000 10 20 returned 0 (liobn = 0x80000001 startin
g addr = 8000000 0)
[ 0.226654] Serial: 8250/16550 driver, 4 ports, IRQ sharing disabled
[ 0.227094] Non-volatile memory driver v1.3
[ 0.228950] brd: module loaded
[ 0.230666] loop: module loaded
[ 0.230773] ipr: IBM Power RAID SCSI Device Driver version: 2.6.4 (March 14, 2017)
[ 0.233323] scsi host0: Virtio SCSI HBA
[ 0.235439] scsi 0:0:0:0: Direct-Access QEMU QEMU HARDDISK 2.5+ PQ: 0 ANSI: 5
[ 0.369009] random: fast init done
[ 0.370819] sd 0:0:0:0: Attached scsi generic sg0 type 0
[ 0.371320] sd 0:0:0:0: Power-on or device reset occurred
<snip>
[ 0.380378] sd 0:0:0:0: [sda] 31457280 512-byte logical blocks: (16.1 GB/15.0 GiB)
[ 0.381102] sd 0:0:0:0: [sda] Write Protect is off
[ 0.381195] sd 0:0:0:0: [sda] Mode Sense: 63 00 00 08
[ 0.382436] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[ 0.383630] sd 0:0:0:0: [sda] Optimal transfer size 0 bytes < PAGE_SIZE (65536 bytes)
[ 0.391562] sda: sda1 sda2
[ 0.398101] sd 0:0:0:0: [sda] Attached SCSI disk
[ 0.398205] md: Waiting for all devices to be available before autodetect
[ 0.398318] md: If you don't use raid, use raid=noautodetect
[ 0.398515] md: Autodetecting RAID arrays.
[ 0.398585] md: autorun ...
[ 0.398631] md: ... autorun DONE.
[ 0.403552] EXT4-fs (sda2): mounted filesystem with ordered data mode. Opts: (null)
[ 0.403700] VFS: Mounted root (ext4 filesystem) readonly on device 8:2.
[ 0.405258] devtmpfs: mounted
[ 0.406427] Freeing unused kernel memory: 4224K
[ 0.406519] This architecture does not have kernel memory protection.
[ 0.406633] Run /sbin/init as init process
Sorry, I don't have any information on where the guest is stuck. I tried
<sysrq>+l, <sysrq>+t and <sysrq>+w but nothing out of the ordinary
showed up. Will try something else later.
--
Thiago Jung Bauermann
IBM Linux Technology Center
From 70d2fba809119ae2d35c9ca4269405bb5c28413a Mon Sep 17 00:00:00 2001
From: Thiago Jung Bauermann <bauerman@linux.ibm.com>
Date: Thu, 24 Jan 2019 22:40:16 -0200
Subject: [PATCH 1/1] powerpc/pseries/iommu: Don't use dma_iommu_ops on secure
guests
Secure guest memory is inacessible to devices so regular DMA isn't
possible.
In that case set devices' dma_map_ops to NULL so that the generic
DMA code path will use SWIOTLB and DMA to bounce buffers.
Signed-off-by: Thiago Jung Bauermann <bauerman@linux.ibm.com>
---
arch/powerpc/platforms/pseries/iommu.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
index 36eb1ddbac69..1636306007eb 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -50,6 +50,7 @@
#include <asm/udbg.h>
#include <asm/mmzone.h>
#include <asm/plpar_wrappers.h>
+#include <asm/svm.h>
#include "pseries.h"
@@ -1335,7 +1336,10 @@ void iommu_init_early_pSeries(void)
of_reconfig_notifier_register(&iommu_reconfig_nb);
register_memory_notifier(&iommu_mem_nb);
- set_pci_dma_ops(&dma_iommu_ops);
+ if (is_secure_guest())
+ set_pci_dma_ops(NULL);
+ else
+ set_pci_dma_ops(&dma_iommu_ops);
}
static int __init disable_multitce(char *str)
^ permalink raw reply related
* Re: [PATCH kernel 2/2] powerpc/pseries/dma: Enable swiotlb
From: Thiago Jung Bauermann @ 2019-05-10 22:41 UTC (permalink / raw)
To: Alexey Kardashevskiy; +Cc: Alistair Popple, linuxppc-dev, David Gibson
In-Reply-To: <20190507062559.20295-3-aik@ozlabs.ru>
Hello Alexey,
Thanks!
I have similar changes in my "Secure Virtual Machine Enablement"
patches, which I am currently preparing for posting again real soon now.
This is the last version:
https://lore.kernel.org/linuxppc-dev/20180824162535.22798-1-bauerman@linux.ibm.com/
Alexey Kardashevskiy <aik@ozlabs.ru> writes:
> So far the pseries platforms has always been using IOMMU making SWIOTLB
> unnecessary. Now we want secure guests which means devices can only
> access certain areas of guest physical memory; we are going to use
> SWIOTLB for this purpose.
>
> This allows SWIOTLB for pseries. By default there is no change in behavior.
>
> This enables SWIOTLB when the "swiotlb" kernel parameter is set to "force".
>
> With the SWIOTLB enabled, the kernel creates a directly mapped DMA window
> (using the usual DDW mechanism) and implements SWIOTLB on top of that.
>
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
> arch/powerpc/platforms/pseries/setup.c | 5 +++++
> arch/powerpc/platforms/pseries/Kconfig | 1 +
> 2 files changed, 6 insertions(+)
>
> diff --git a/arch/powerpc/platforms/pseries/setup.c b/arch/powerpc/platforms/pseries/setup.c
> index e4f0dfd4ae33..30d72b587ac5 100644
> --- a/arch/powerpc/platforms/pseries/setup.c
> +++ b/arch/powerpc/platforms/pseries/setup.c
> @@ -42,6 +42,7 @@
> #include <linux/of.h>
> #include <linux/of_pci.h>
> #include <linux/memblock.h>
> +#include <linux/swiotlb.h>
>
> #include <asm/mmu.h>
> #include <asm/processor.h>
> @@ -71,6 +72,7 @@
> #include <asm/isa-bridge.h>
> #include <asm/security_features.h>
> #include <asm/asm-const.h>
> +#include <asm/swiotlb.h>
>
> #include "pseries.h"
> #include "../../../../drivers/pci/pci.h"
> @@ -797,6 +799,9 @@ static void __init pSeries_setup_arch(void)
> }
>
> ppc_md.pcibios_root_bridge_prepare = pseries_root_bridge_prepare;
> +
> + if (swiotlb_force == SWIOTLB_FORCE)
> + ppc_swiotlb_enable = 1;
> }
Yep! I have this here, enabled when booting as a secure guest:
https://lore.kernel.org/linuxppc-dev/20180824162535.22798-6-bauerman@linux.ibm.com/
And also another patch which makes it so that if booting as a secure
guest it acts as if the swiotlb kernel parameter was set to force:
https://lore.kernel.org/linuxppc-dev/20180824162535.22798-11-bauerman@linux.ibm.com/
> static void pseries_panic(char *str)
> diff --git a/arch/powerpc/platforms/pseries/Kconfig b/arch/powerpc/platforms/pseries/Kconfig
> index 9c6b3d860518..b9e8b608de01 100644
> --- a/arch/powerpc/platforms/pseries/Kconfig
> +++ b/arch/powerpc/platforms/pseries/Kconfig
> @@ -23,6 +23,7 @@ config PPC_PSERIES
> select ARCH_RANDOM
> select PPC_DOORBELL
> select FORCE_SMP
> + select SWIOTLB
> default y
>
> config PPC_SPLPAR
I put this in a PPC_SVM config option:
https://lore.kernel.org/linuxppc-dev/20180824162535.22798-3-bauerman@linux.ibm.com/
--
Thiago Jung Bauermann
IBM Linux Technology Center
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox