LinuxPPC-Dev Archive on lore.kernel.org

LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed

* Re: [PATCH NEXT 1/4] powerpc/pasemi: Add PCI initialisation for Nemo board.
From: Michael Ellerman @ 2018-07-11 14:04 UTC (permalink / raw)
  To: Christian Zigotzky; +Cc: Darren Stevens, linuxppc-dev
In-Reply-To: <196ECBF7-5001-4B0B-9FE3-F079CE0996EE@xenosoft.de>

Christian Zigotzky <chzigotzky@xenosoft.de> writes:

> Hello Michael,
>
> Thanks a lot for your reply. OK, first I would like to add=20
>
> pr_info("NEMO SB600 IOB base %08llx\n",res.start)
>
> to the Nemo patch. Is this line correct now?

Yes I think so.

The type of start is resource_size_t, which can be 32 or 64-bits. But in
this code it should always be 64-bit.

> After that I will try to contact Darren because of your other comments.
>
> If I don=E2=80=99t reach Darren then I will try to fix the issues but I t=
hink
> I need your help for fixing them.

Sure.

cheers

^ permalink raw reply

* Re: [PATCH 1/2] powerpc: Add ppc32_allmodconfig defconfig target
From: Michael Ellerman @ 2018-07-11 13:50 UTC (permalink / raw)
  To: Mathieu Malaterre; +Cc: linuxppc-dev, Nicholas Piggin, Randy Dunlap
In-Reply-To: <CA+7wUszg2EnuExY=Wx0JWebvFmdYwRSN2nvjkvwTtmD1vS4AJQ@mail.gmail.com>

Mathieu Malaterre <malat@debian.org> writes:
> On Tue, Jul 10, 2018 at 3:47 PM Michael Ellerman <mpe@ellerman.id.au> wrote:
>> Mathieu Malaterre <malat@debian.org> writes:
>> > On Mon, Jul 9, 2018 at 4:24 PM Michael Ellerman <mpe@ellerman.id.au> wrote:
>> >> Because the allmodconfig logic just sets every symbol to M or Y, it
>> >> has the effect of always generating a 64-bit config, because
>> >> CONFIG_PPC64 becomes Y.
>> >>
>> >> So to make it easier for folks to test 32-bit code, provide a phony
>> >> defconfig target that generates a 32-bit allmodconfig.
>> >>
>> >> The 32-bit port has several mutually exclusive CPU types, we choose
>> >> the Book3S variants as that's what the help text in Kconfig says is
>> >> most common.
>> >
>> > Ok then.
>>
>> That was just me taking a stab in the dark. You suggested we should
>> mimic the Debian config, what does that use?
>
> Sorry got confused for a minute. This is the correct value (at least
> the one used in Debian).

OK cool.

>> >> diff --git a/arch/powerpc/Makefile b/arch/powerpc/Makefile
>> >> index 2ea575cb3401..2556c2182789 100644
>> >> --- a/arch/powerpc/Makefile
>> >> +++ b/arch/powerpc/Makefile
>> >> @@ -354,6 +354,11 @@ mpc86xx_smp_defconfig:
>> >>         $(call merge_into_defconfig,mpc86xx_basic_defconfig,\
>> >>                 86xx-smp 86xx-hw fsl-emb-nonhw)
>> >>
>> >> +PHONY += ppc32_allmodconfig
>> >> +ppc32_allmodconfig:
>> >> +       $(Q)$(MAKE) KCONFIG_ALLCONFIG=$(srctree)/arch/powerpc/configs/book3s_32.config \
>> >> +               -f $(srctree)/Makefile allmodconfig
>> >> +
>> >
>> > I this a good time to update line 34 at the same time:
>> >
>> > KBUILD_DEFCONFIG := $(shell uname -m)_defconfig
>> >
>> > ?
>>
>> 34 or 36?
>>
>>   ifeq ($(CROSS_COMPILE),)
>>   KBUILD_DEFCONFIG := $(shell uname -m)_defconfig
>>   else
>>   KBUILD_DEFCONFIG := ppc64_defconfig
>>   endif
>>
>> Do you mean the else case?
>
> As far as I know uname -m on powerpc returns 'ppc' so the following
> has always failed from a ppc32be machine:

Oh yep it does.

I've never built a kernel *on* a 32-bit machine ;)

> $ make ARCH=powerpc defconfig
>
> I was simply suggesting to mimic what was done for ppc64:
>
> ifeq ($(CROSS_COMPILE),)
> KBUILD_DEFCONFIG := ppc32_defconfig
> else
> KBUILD_DEFCONFIG := ppc64_defconfig
> endif

That wouldn't work, CROSS_COMPILE isn't a ppc/ppc64 thing.

In fact setting CROSS_COMPILE doesn't actually mean you're cross
compiling, it just means you're using a different toolchain. I do that
all the time, because I want to use a specific version of GCC, not the
distro one.


What about:

diff --git a/arch/powerpc/Makefile b/arch/powerpc/Makefile
index 2ea575c..e70d223 100644
--- a/arch/powerpc/Makefile
+++ b/arch/powerpc/Makefile
@@ -30,10 +30,13 @@ endif
 endif
 endif
 
-ifeq ($(CROSS_COMPILE),)
-KBUILD_DEFCONFIG := $(shell uname -m)_defconfig
+host_arch := $(shell uname -m)
+ifeq ($(host_arch),ppc)
+       KBUILD_DEFCONFIG := ppc32_defconfig
+else ifeq ($(host_arch),ppc64)
+       KBUILD_DEFCONFIG := ppc64_defconfig
 else
-KBUILD_DEFCONFIG := ppc64_defconfig
+       KBUILD_DEFCONFIG := ppc64le_defconfig
 endif
 
 ifeq ($(CONFIG_PPC64),y)


We obviously need a ppc32_defconfig to make that work.

> If I followed the discussion one would want the file `ppc32_defconfig`
> to contains pretty much the same thing as the .config generated from
> `book3s_32.config`, right ?

Can you boot the resulting kernel if you build ppc32_allmodconfig ?

Maybe we should just make that be ppc32_defconfig.

cheers

^ permalink raw reply related

* Re: Boot failures with "mm/sparse: Remove CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER" on powerpc (was Re: mmotm 2018-07-10-16-50 uploaded)
From: Oscar Salvador @ 2018-07-11 13:37 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: akpm, broonie, mhocko, sfr, linux-next, linux-fsdevel, linux-mm,
	linux-kernel, mm-commits, linuxppc-dev, bhe, pasha.tatashin,
	Aneesh Kumar K.V, Anshuman Khandual
In-Reply-To: <87lgai9bt5.fsf@concordia.ellerman.id.au>

On Wed, Jul 11, 2018 at 10:49:58PM +1000, Michael Ellerman wrote:
> akpm@linux-foundation.org writes:
> > The mm-of-the-moment snapshot 2018-07-10-16-50 has been uploaded to
> >
> >    http://www.ozlabs.org/~akpm/mmotm/
> ...
> 
> > * mm-sparse-add-a-static-variable-nr_present_sections.patch
> > * mm-sparsemem-defer-the-ms-section_mem_map-clearing.patch
> > * mm-sparsemem-defer-the-ms-section_mem_map-clearing-fix.patch
> > * mm-sparse-add-a-new-parameter-data_unit_size-for-alloc_usemap_and_memmap.patch
> > * mm-sparse-optimize-memmap-allocation-during-sparse_init.patch
> > * mm-sparse-optimize-memmap-allocation-during-sparse_init-checkpatch-fixes.patch
> 
> > * mm-sparse-remove-config_sparsemem_alloc_mem_map_together.patch
> 
> This seems to be breaking my powerpc pseries qemu boots.
> 
> The boot log with some extra debug shows eg:
> 
>   $ make pseries_le_defconfig

Could you please share the config?
I was not able to find such config in the kernel tree.
-- 
Oscar Salvador
SUSE L3

^ permalink raw reply

* Re: [v2] powerpc: mpc5200: Remove VLA usage
From: Michael Ellerman @ 2018-07-11 13:24 UTC (permalink / raw)
  To: Kees Cook
  Cc: Arnd Bergmann, linux-kernel, Paul Mackerras, Anatolij Gustschin,
	linuxppc-dev
In-Reply-To: <20180702155621.GA16047@beast>

On Mon, 2018-07-02 at 15:56:21 UTC, Kees Cook wrote:
> In the quest to remove all stack VLA usage from the kernel[1], this
> switches to using a stack size large enough for the saved routine and
> adds a sanity check making sure the routine doesn't overflow into the
> 0x600 exception handler.
> 
> [1] https://lkml.kernel.org/r/CA+55aFzCG-zNmZwX4A2FQpadafLfEzK6CC=qPXydAacU1RqZWA@mail.gmail.com
> 
> Signed-off-by: Kees Cook <keescook@chromium.org>
> Reviewed-by: Arnd Bergmann <arnd@arndb.de>

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/741c5640a15a23a2ec3a0846668a82

cheers

^ permalink raw reply

* Re: powerpc: Remove memtrace mmap
From: Michael Ellerman @ 2018-07-11 13:24 UTC (permalink / raw)
  To: Michael Neuling; +Cc: mikey, linuxppc-dev, rashmica.g
In-Reply-To: <20180629001209.18523-1-mikey@neuling.org>

On Fri, 2018-06-29 at 00:12:09 UTC, Michael Neuling wrote:
> debugfs doesn't support mmap, so this code is never used.
> 
> Signed-off-by: Michael Neuling <mikey@neuling.org>

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/7dea6f2f053599d90f7894216db0dd

cheers

^ permalink raw reply

* Re: [v2, 01/10] Revert "cxl: Add kernel API to allow a context to operate with relocate disabled"
From: Michael Ellerman @ 2018-07-11 13:24 UTC (permalink / raw)
  To: Frederic Barrat, alastair, andrew.donnellan, vaibhav, clombard,
	felix, linuxppc-dev
  Cc: huyn
In-Reply-To: <20180628100509.17413-2-fbarrat@linux.ibm.com>

On Thu, 2018-06-28 at 10:05:00 UTC, Frederic Barrat wrote:
> From: Alastair D'Silva <alastair@d-silva.org>
> 
> Remove abandonned capi support for the Mellanox CX4.
> The symbol 'cxl_set_translation_mode' is never called, so
> ctx->real_mode is always false.
> 
> This reverts commit 7a0d85d313c2066712e530e668bc02bb741a685c.
> 
> Signed-off-by: Alastair D'Silva <alastair@d-silva.org>
> Acked-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>

Series applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/c5828150067c47a97f30e690a472e0

cheers

^ permalink raw reply

* Re: [v3,1/2] selftests/powerpc: Fix strncpy usage
From: Michael Ellerman @ 2018-07-11 13:24 UTC (permalink / raw)
  To: Breno Leitao, linuxppc-dev; +Cc: Breno Leitao, Anshuman Khandual
In-Reply-To: <1530019213-2347-1-git-send-email-leitao@debian.org>

On Tue, 2018-06-26 at 13:20:12 UTC, Breno Leitao wrote:
> There is a buffer overflow in dscr_inherit_test.c test. In main(), strncpy()'s
> third argument is the length of the source, not the size of the destination
> buffer, which makes strncpy() behaves like strcpy(), causing a buffer overflow
> if argv[0] is bigger than LEN_MAX (100).
> 
> This patch maps 'prog' to the argv[0] memory region, removing the static
> allocation and the LEN_MAX size restriction.
> 
> CC: Michael Ellerman <mpe@ellerman.id.au>
> CC: Segher Boessenkool <segher@kernel.crashing.org>
> CC: Anshuman Khandual <khandual@linux.vnet.ibm.com>
> Signed-off-by: Breno Leitao <leitao@debian.org>

Series applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/09a61e894ac852fb063ee0b54fc513

cheers

^ permalink raw reply

* Re: powerpc/pci: Remove legacy debug code
From: Michael Ellerman @ 2018-07-11 13:24 UTC (permalink / raw)
  To: Breno Leitao, linuxppc-dev; +Cc: Breno Leitao
In-Reply-To: <1529360102-8206-1-git-send-email-leitao@debian.org>

On Mon, 2018-06-18 at 22:15:02 UTC, Breno Leitao wrote:
> Commit 59f47eff03a0 ("powerpc/pci: Use of_irq_parse_and_map_pci() helper")
> removed the 'oirq' variable, but kept memsetting it when the DEBUG macro is
> defined.
> 
> When setting DEBUG macro for debugging purpose, the kernel fails to build since
> 'oirq' is not defined anymore.
> 
> This patch simply remove the debug block, since it does not seem to sense
> now.
> 
> Fixes: 59f47eff03a08c ("powerpc/pci: Use of_irq_parse_and_map_pci() helper")
> 
> Signed-off-by: Breno Leitao <leitao@debian.org>

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/3bfb450ee7b52d41ce2117738ee9c1

cheers

^ permalink raw reply

* Re: [v2,1/2] powerpc: Document issues with the DAWR on POWER9
From: Michael Ellerman @ 2018-07-11 13:24 UTC (permalink / raw)
  To: Michael Neuling; +Cc: mikey, linuxppc-dev
In-Reply-To: <20180625013456.16157-1-mikey@neuling.org>

On Mon, 2018-06-25 at 01:34:55 UTC, Michael Neuling wrote:
> Signed-off-by: Michael Neuling <mikey@neuling.org>
> Acked-by: Stewart Smith <stewart@linux.ibm.com>

Series applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/c6bef2e9e50ca27987aae90147511b

cheers

^ permalink raw reply

* Re: ocxl: Fix page fault handler in case of fault on dying process
From: Michael Ellerman @ 2018-07-11 13:24 UTC (permalink / raw)
  To: Frederic Barrat, andrew.donnellan, alastair, linuxppc-dev
  Cc: clombard, vaibhav
In-Reply-To: <20180618121436.20479-1-fbarrat@linux.ibm.com>

On Mon, 2018-06-18 at 12:14:36 UTC, Frederic Barrat wrote:
> If a process exits without doing proper cleanup, there's a window
> where an opencapi device can try to access the memory of the dying
> process and may trigger a page fault. That's an expected scenario and
> the ocxl driver holds a reference on the mm_struct of the process
> until the opencapi device is notified of the process exiting.
> However, if mm_users is already at 0, i.e. the address space of the
> process has already been destroyed, the driver shouldn't try resolving
> the page fault, as it will fail, but it can also try accessing already
> freed data.
> 
> It is fixed by only calling the bottom half of the page fault handler
> if mm_users is greater than 0 and get a reference on mm_users instead
> of mm_count. Otherwise, we can safely return a translation fault to
> the device, as its associated memory context is being removed. The
> opencapi device will be properly cleaned up shortly after when closing
> the file descriptors.
> 
> Fixes: 5ef3166e8a32 ("ocxl: Driver code for 'generic' opencapi devices")
> Cc: stable@vger.kernel.org # v4.16+
> Signed-off-by: Frederic Barrat <fbarrat@linux.ibm.com>
> Reviewed-By: Alastair D'Silva <alastair@d-silva.org>
> Acked-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/d497ebf5fb3a026c0817f8c96cde57

cheers

^ permalink raw reply

* Re: powerpc: xmon: use ktime_get_coarse_boottime64
From: Michael Ellerman @ 2018-07-11 13:24 UTC (permalink / raw)
  To: Arnd Bergmann, Benjamin Herrenschmidt, Paul Mackerras
  Cc: Arnd Bergmann, y2038, Vaibhav Jain, Nicholas Piggin, linux-kernel,
	linuxppc-dev
In-Reply-To: <20180618095653.1538471-1-arnd@arndb.de>

On Mon, 2018-06-18 at 09:56:24 UTC, Arnd Bergmann wrote:
> get_monotonic_boottime() is deprecated, and may not be safe to call in
> every context, as it has to read a hardware clocksource.
> 
> This changes xmon to print the time using ktime_get_coarse_boottime64()
> instead, which avoids the old timespec type and the HW access.
> 
> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
> Acked-by: Balbir Singh <bsingharora@gmail.com>

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/f6bd74fa084eb9ad573ffbb236a095

cheers

^ permalink raw reply

* Re: powerpc: wii: Remove outdated comment about memory fixups
From: Michael Ellerman @ 2018-07-11 13:24 UTC (permalink / raw)
  To: Jonathan Neuschäfer, linuxppc-dev
  Cc: Paul Mackerras, Jonathan Neuschäfer, linux-kernel
In-Reply-To: <20180617124909.26828-1-j.neuschaefer@gmx.net>

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 477 bytes --]

On Sun, 2018-06-17 at 12:49:06 UTC, =?utf-8?q?Jonathan_Neusch=C3=A4fer?= wrote:
> The workaround has been removed. What stays is just code to find the
> memory hole so the BATs can be configured properly in the function below.
> 
> Fixes: 57deb8fea01f ("powerpc/wii: Don't rely on the reserved memory hack")
> Signed-off-by: Jonathan Neuschäfer <j.neuschaefer@gmx.net>

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/8272f598523d0975259080d2f9d760

cheers

^ permalink raw reply

* Re: misc: ocxl: Change return type for fault handler
From: Michael Ellerman @ 2018-07-11 13:24 UTC (permalink / raw)
  To: Souptick Joarder, willy, fbarrat, andrew.donnellan, arnd, gregkh
  Cc: brajeswar.linux, linuxppc-dev, linux-kernel, sabyasachi.linux
In-Reply-To: <20180611202904.GA25538@jordon-HP-15-Notebook-PC>

On Mon, 2018-06-11 at 20:29:04 UTC, Souptick Joarder wrote:
> Use new return type vm_fault_t for fault handler. For
> now, this is just documenting that the function returns
> a VM_FAULT value rather than an errno. Once all instances
> are converted, vm_fault_t will become a distinct type.
> 
> Ref-> commit 1c8f422059ae ("mm: change return type to vm_fault_t")
> 
> There is an existing bug when vm_insert_pfn() can return
> ENOMEM which was ignored and VM_FAULT_NOPAGE returned as
> default. The new inline vmf_insert_pfn() has removed
> this inefficiency by returning correct vm_fault_ type.
> 
> Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
> Acked-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
> Acked-by: Frederic Barrat <fbarrat@linux.vnet.ibm.com>

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/a545cf032d11437ed86e62f00d4991

cheers

^ permalink raw reply

* Re: [kernel] powerpc/powernv/ioda2: Reduce upper limit for DMA window size
From: Michael Ellerman @ 2018-07-11 13:24 UTC (permalink / raw)
  To: Alexey Kardashevskiy, linuxppc-dev; +Cc: Alexey Kardashevskiy
In-Reply-To: <20180601080616.29279-1-aik@ozlabs.ru>

On Fri, 2018-06-01 at 08:06:16 UTC, Alexey Kardashevskiy wrote:
> We use PHB in mode1 which uses bit 59 to select a correct DMA window.
> However there is mode2 which uses bits 59:55 and allows up to 32 DMA
> windows per a PE.
> 
> Even though documentation does not clearly specify that, it seems that
> the actual hardware does not support bits 59:55 even in mode1, in other
> words we can create a window as big as 1<<58 but DMA simply won't work.
> 
> This reduces the upper limit from 59 to 55 bits to let the userspace know
> about the hardware limits.
> 
> Fixes: 7aafac11e3 "powerpc/powernv/ioda2: Gracefully fail if too many TCE levels requested"
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/d3d4ffaae439981e1e441ebb125aa3

cheers

^ permalink raw reply

* Re: powerpc/eeh: Avoid misleading message "EEH: no capable adapters found"
From: Michael Ellerman @ 2018-07-11 13:24 UTC (permalink / raw)
  To: Mauro S. M. Rodrigues, linuxppc-dev, ruscur, benh, paulus; +Cc: kernel
In-Reply-To: <1521771052-5973-1-git-send-email-maurosr@linux.vnet.ibm.com>

On Fri, 2018-03-23 at 02:10:52 UTC, "Mauro S. M. Rodrigues" wrote:
> Due to recent refactoring in EEH in:
> commit b9fde58db7e5 ("powerpc/powernv: Rework EEH initialization on
> powernv")
> a misleading message was seen in the kernel message buffer:
> 
> [    0.108431] EEH: PowerNV platform initialized
> [    0.589979] EEH: No capable adapters found
> 
> This happened due to the removal of the initialization delay for powernv
> platform.
> 
> Even though the EEH infrastructure for the devices is eventually
> initialized and still works just fine the eeh device probe step is
> postponed in order to assure the PEs are created. Later
> pnv_eeh_post_init does the probe devices job but at that point the
> message was already shown right after eeh_init flow.
> 
> This patch introduces a new flag EEH_POSTPONED_PROBE to represent that
> temporary state and avoid the message mentioned above and showing the
> follow one instead:
> 
> [    0.107724] EEH: PowerNV platform initialized
> [    4.844825] EEH: PCI Enhanced I/O Error Handling Enabled
> 
> Signed-off-by: Mauro S. M. Rodrigues <maurosr@linux.vnet.ibm.com>
> Acked-by: Russell Currey <ruscur@russell.cc>
> Tested-by:Venkat Rao B <vrbagal1@linux.vnet.ibm.com>

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/ee8c446fed99ffdc29dedf7d2a8854

cheers

^ permalink raw reply

* Re: scripts: Add ppc64le support for checkstack.pl
From: Michael Ellerman @ 2018-07-11 13:24 UTC (permalink / raw)
  To: Breno Leitao, linux-kernel, linuxppc-dev; +Cc: Breno Leitao
In-Reply-To: <1511897374-27408-1-git-send-email-leitao@debian.org>

On Tue, 2017-11-28 at 19:29:34 UTC, Breno Leitao wrote:
> 64-bit ELF v2 ABI specification for POWER describes, on section "General
> Stack Frame Requirements", that the stack should use the following
> instructions when compiled with backchain:
> 
>   mflr r0
>   std  r0, 16(r1)
>   stdu r1, -XX(r1)
> 
> Where XX is the frame size for that function, and this is the value
> checkstack.pl will find the stack size for each function.
> 
> This patch also simplifies the entire Powerpc section, since just two
> type of instructions are used, 'stdu' for 64 bits and 'stwu' for 32 bits
> platform.
> 
> Signed-off-by: Breno Leitao <leitao@debian.org>

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/8449a4cb69ab6fbb873d653a82787a

cheers

^ permalink raw reply

* Re: Boot failures with "mm/sparse: Remove CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER" on powerpc (was Re: mmotm 2018-07-10-16-50 uploaded)
From: Baoquan He @ 2018-07-11 13:12 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: akpm, broonie, mhocko, sfr, linux-next, linux-fsdevel, linux-mm,
	linux-kernel, mm-commits, linuxppc-dev, pasha.tatashin,
	Aneesh Kumar K.V, Anshuman Khandual
In-Reply-To: <87lgai9bt5.fsf@concordia.ellerman.id.au>

Hi Michael,

On 07/11/18 at 10:49pm, Michael Ellerman wrote:
> akpm@linux-foundation.org writes:
> > The mm-of-the-moment snapshot 2018-07-10-16-50 has been uploaded to
> >
> >    http://www.ozlabs.org/~akpm/mmotm/
> ...
> 
> > * mm-sparse-add-a-static-variable-nr_present_sections.patch
> > * mm-sparsemem-defer-the-ms-section_mem_map-clearing.patch
> > * mm-sparsemem-defer-the-ms-section_mem_map-clearing-fix.patch
> > * mm-sparse-add-a-new-parameter-data_unit_size-for-alloc_usemap_and_memmap.patch
> > * mm-sparse-optimize-memmap-allocation-during-sparse_init.patch
> > * mm-sparse-optimize-memmap-allocation-during-sparse_init-checkpatch-fixes.patch
> 
> > * mm-sparse-remove-config_sparsemem_alloc_mem_map_together.patch
> 
> This seems to be breaking my powerpc pseries qemu boots.
> 
> The boot log with some extra debug shows eg:
> 
>   $ make pseries_le_defconfig
>   $ qemu-system-ppc64 -nographic -vga none -M pseries -m 2G -kernel vmlinux 
>   vmemmap_populate f000000000000000..f000000000024000, node 0
>         * f000000000000000..f000000001000000 allocated at c000000076000000
>   hash__vmemmap_create_mapping: start 0xf000000000000000 size 0x1000000 phys 0x76000000
>   hash__vmemmap_create_mapping: failed -1
> 
>   <repeated many times>
> 
> Then there's lots of other warnings about bad page states and eventually
> a NULL deref and we panic().
> 
> 
> The problem seems to be that we're calling down into
> hash__vmemmap_create_mapping() for every call to vmemmap_populate(),
> whereas previously we would only call hash__vmemmap_create_mapping()
> once because our vmemmap_populated() would return true.
> 
> There's actually a comment in sparse_init() that says:
> 
> 	 * powerpc need to call sparse_init_one_section right after each
> 	 * sparse_early_mem_map_alloc, so allocate usemap_map at first.
> 
> So changing that behaviour does seem to be the problem.
> 
> I assume that comment is talking about the fact that we use pfn_valid()
> in vmemmap_populated().
> 
> I'm not clear on how to fix it though.

Have you tried reverting that patch and building kernel to test again?
Does it work?

^ permalink raw reply

* [PATCH] [RESEND] powerpc: xmon: use ktime_get_coarse_boottime64
From: Arnd Bergmann @ 2018-07-11 12:53 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman
  Cc: y2038, Arnd Bergmann, Balbir Singh, Nicholas Piggin, Vaibhav Jain,
	Breno Leitao, Mathieu Malaterre, Yisheng Xie, linuxppc-dev,
	linux-kernel

get_monotonic_boottime() is deprecated, and may not be safe to call in
every context, as it has to read a hardware clocksource.

This changes xmon to print the time using ktime_get_coarse_boottime64()
instead, which avoids the old timespec type and the HW access.

Acked-by: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
---
Originally sent Jun 18, but this hasn't appeared in linux-next yet.

Resending to make sure this is still on the radar. Please apply
to the powerpc git for 4.19
---
 arch/powerpc/xmon/xmon.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/xmon/xmon.c b/arch/powerpc/xmon/xmon.c
index 47166ad2a669..45e3d0ec1246 100644
--- a/arch/powerpc/xmon/xmon.c
+++ b/arch/powerpc/xmon/xmon.c
@@ -918,13 +918,13 @@ static void remove_cpu_bpts(void)
 static void
 show_uptime(void)
 {
-	struct timespec uptime;
+	struct timespec64 uptime;
 
 	if (setjmp(bus_error_jmp) == 0) {
 		catch_memory_errors = 1;
 		sync();
 
-		get_monotonic_boottime(&uptime);
+		ktime_get_coarse_boottime_ts64(&uptime);
 		printf("Uptime: %lu.%.2lu seconds\n", (unsigned long)uptime.tv_sec,
 			((unsigned long)uptime.tv_nsec / (NSEC_PER_SEC/100)));
 
-- 
2.9.0

^ permalink raw reply related

* Boot failures with "mm/sparse: Remove CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER" on powerpc (was Re: mmotm 2018-07-10-16-50 uploaded)
From: Michael Ellerman @ 2018-07-11 12:49 UTC (permalink / raw)
  To: akpm, broonie, mhocko, sfr, linux-next, linux-fsdevel, linux-mm,
	linux-kernel, mm-commits, linuxppc-dev, bhe, pasha.tatashin,
	Aneesh Kumar K.V, Anshuman Khandual
In-Reply-To: <20180710235044.vjlRV%akpm@linux-foundation.org>

akpm@linux-foundation.org writes:
> The mm-of-the-moment snapshot 2018-07-10-16-50 has been uploaded to
>
>    http://www.ozlabs.org/~akpm/mmotm/
...

> * mm-sparse-add-a-static-variable-nr_present_sections.patch
> * mm-sparsemem-defer-the-ms-section_mem_map-clearing.patch
> * mm-sparsemem-defer-the-ms-section_mem_map-clearing-fix.patch
> * mm-sparse-add-a-new-parameter-data_unit_size-for-alloc_usemap_and_memmap.patch
> * mm-sparse-optimize-memmap-allocation-during-sparse_init.patch
> * mm-sparse-optimize-memmap-allocation-during-sparse_init-checkpatch-fixes.patch

> * mm-sparse-remove-config_sparsemem_alloc_mem_map_together.patch

This seems to be breaking my powerpc pseries qemu boots.

The boot log with some extra debug shows eg:

  $ make pseries_le_defconfig
  $ qemu-system-ppc64 -nographic -vga none -M pseries -m 2G -kernel vmlinux 
  ...
  vmemmap_populate f000000000000000..f000000000004000, node 0
        * f000000000000000..f000000001000000 allocated at c00000007e000000
  hash__vmemmap_create_mapping: start 0xf000000000000000 size 0x1000000 phys 0x7e000000
  vmemmap_populate f000000000000000..f000000000008000, node 0
        * f000000000000000..f000000001000000 allocated at c00000007d000000
  hash__vmemmap_create_mapping: start 0xf000000000000000 size 0x1000000 phys 0x7d000000
  vmemmap_populate f000000000000000..f00000000000c000, node 0
        * f000000000000000..f000000001000000 allocated at c00000007c000000
  hash__vmemmap_create_mapping: start 0xf000000000000000 size 0x1000000 phys 0x7c000000
  vmemmap_populate f000000000000000..f000000000010000, node 0
        * f000000000000000..f000000001000000 allocated at c00000007b000000
  hash__vmemmap_create_mapping: start 0xf000000000000000 size 0x1000000 phys 0x7b000000
  vmemmap_populate f000000000000000..f000000000014000, node 0
        * f000000000000000..f000000001000000 allocated at c00000007a000000
  hash__vmemmap_create_mapping: start 0xf000000000000000 size 0x1000000 phys 0x7a000000
  vmemmap_populate f000000000000000..f000000000018000, node 0
        * f000000000000000..f000000001000000 allocated at c000000079000000
  hash__vmemmap_create_mapping: start 0xf000000000000000 size 0x1000000 phys 0x79000000
  vmemmap_populate f000000000000000..f00000000001c000, node 0
        * f000000000000000..f000000001000000 allocated at c000000078000000
  hash__vmemmap_create_mapping: start 0xf000000000000000 size 0x1000000 phys 0x78000000
  vmemmap_populate f000000000000000..f000000000020000, node 0
        * f000000000000000..f000000001000000 allocated at c000000077000000
  hash__vmemmap_create_mapping: start 0xf000000000000000 size 0x1000000 phys 0x77000000
  vmemmap_populate f000000000000000..f000000000024000, node 0
        * f000000000000000..f000000001000000 allocated at c000000076000000
  hash__vmemmap_create_mapping: start 0xf000000000000000 size 0x1000000 phys 0x76000000
  hash__vmemmap_create_mapping: failed -1

  <repeated many times>

Then there's lots of other warnings about bad page states and eventually
a NULL deref and we panic().


The problem seems to be that we're calling down into
hash__vmemmap_create_mapping() for every call to vmemmap_populate(),
whereas previously we would only call hash__vmemmap_create_mapping()
once because our vmemmap_populated() would return true.

There's actually a comment in sparse_init() that says:

	 * powerpc need to call sparse_init_one_section right after each
	 * sparse_early_mem_map_alloc, so allocate usemap_map at first.

So changing that behaviour does seem to be the problem.

I assume that comment is talking about the fact that we use pfn_valid()
in vmemmap_populated().

I'm not clear on how to fix it though.

Any ideas?

cheers

^ permalink raw reply

* [PATCH kernel v6 0/2] KVM: PPC: Check if IOMMU page is contained in the pinned physical page
From: Alexey Kardashevskiy @ 2018-07-11 11:00 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, David Gibson, kvm-ppc, Aneesh Kumar K.V,
	Alex Williamson, Michael Ellerman, Nicholas Piggin,
	Paul Mackerras, Balbir Singh

This is to improve page boundaries checking and should probably
be cc:stable. I came accross this while debugging nvlink2 passthrough
but the lack of checking might be exploited by the existing userspace.

The get_user_pages() comment says it should be "phased out" but the only
alternative seems to be get_user_pages_longterm(), should that be used
instead (this is longterm reference elevation, however it is not DAX,
whatever this implies)? get_user_pages_remote() seems unnecessarily
complicated because of @locked.

Changes:
v6:
* 2/2: read pageshift from pte

v5:
* 2/2: changed compound pages handling

v4:
* 2/2: implemented less strict but still safe max pageshift as David suggested

v3:
* enforced huge pages not to cross preregistered chunk boundaries

v2:
* 2/2: explicitly check for compound pages before calling compound_order()

This is based on sha1
1e4b044 Linus Torvalds "Linux 4.18-rc4".

Please comment. Thanks.

Alexey Kardashevskiy (2):
  vfio/spapr: Use IOMMU pageshift rather than pagesize
  KVM: PPC: Check if IOMMU page is contained in the pinned physical page

 arch/powerpc/include/asm/mmu_context.h |  4 ++--
 arch/powerpc/kvm/book3s_64_vio.c       |  2 +-
 arch/powerpc/kvm/book3s_64_vio_hv.c    |  6 ++++--
 arch/powerpc/mm/mmu_context_iommu.c    | 39 ++++++++++++++++++++++++++++++++--
 drivers/vfio/vfio_iommu_spapr_tce.c    | 10 ++++-----
 5 files changed, 49 insertions(+), 12 deletions(-)

-- 
2.11.0

^ permalink raw reply

* [PATCH kernel v6 2/2] KVM: PPC: Check if IOMMU page is contained in the pinned physical page
From: Alexey Kardashevskiy @ 2018-07-11 11:00 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, David Gibson, kvm-ppc, Aneesh Kumar K.V,
	Alex Williamson, Michael Ellerman, Nicholas Piggin,
	Paul Mackerras, Balbir Singh
In-Reply-To: <20180711110044.15939-1-aik@ozlabs.ru>

A VM which has:
 - a DMA capable device passed through to it (eg. network card);
 - running a malicious kernel that ignores H_PUT_TCE failure;
 - capability of using IOMMU pages bigger that physical pages
can create an IOMMU mapping that exposes (for example) 16MB of
the host physical memory to the device when only 64K was allocated to the VM.

The remaining 16MB - 64K will be some other content of host memory, possibly
including pages of the VM, but also pages of host kernel memory, host
programs or other VMs.

The attacking VM does not control the location of the page it can map,
and is only allowed to map as many pages as it has pages of RAM.

We already have a check in drivers/vfio/vfio_iommu_spapr_tce.c that
an IOMMU page is contained in the physical page so the PCI hardware won't
get access to unassigned host memory; however this check is missing in
the KVM fastpath (H_PUT_TCE accelerated code). We were lucky so far and
did not hit this yet as the very first time when the mapping happens
we do not have tbl::it_userspace allocated yet and fall back to
the userspace which in turn calls VFIO IOMMU driver, this fails and
the guest does not retry,

This stores the smallest preregistered page size in the preregistered
region descriptor and changes the mm_iommu_xxx API to check this against
the IOMMU page size.

This calculates maximum page size as a minimum of the natural region
alignment and compound page size. For the page shift this uses the shift
returned by find_linux_pte() which indicates how the page is mapped to
the current userspace - if the page is huge and this is not a zero, then
it is a leaf pte and the page is mapped within the range.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
Changes:
v6:
* replaced hugetlbfs with pageshift from find_linux_pte()

v5:
* only consider compound pages from hugetlbfs

v4:
* reimplemented max pageshift calculation

v3:
* fixed upper limit for the page size
* added checks that we don't register parts of a huge page

v2:
* explicitely check for compound pages before calling compound_order()

---
The bug is: run QEMU _without_ hugepages (no -mempath) and tell it to
advertise 16MB pages to the guest; a typical pseries guest will use 16MB
for IOMMU pages without checking the mmu pagesize and this will fail
at https://git.qemu.org/?p=qemu.git;a=blob;f=hw/vfio/common.c;h=fb396cf00ac40eb35967a04c9cc798ca896eed57;hb=refs/heads/master#l256

With the change, mapping will fail in KVM and the guest will print:

mlx5_core 0000:00:00.0: ibm,create-pe-dma-window(2027) 0 8000000 20000000 18 1f returned 0 (liobn = 0x80000001 starting addr = 8000000 0)
mlx5_core 0000:00:00.0: created tce table LIOBN 0x80000001 for /pci@800000020000000/ethernet@0
mlx5_core 0000:00:00.0: failed to map direct window for /pci@800000020000000/ethernet@0: -1
---
 arch/powerpc/include/asm/mmu_context.h |  4 ++--
 arch/powerpc/kvm/book3s_64_vio.c       |  2 +-
 arch/powerpc/kvm/book3s_64_vio_hv.c    |  6 ++++--
 arch/powerpc/mm/mmu_context_iommu.c    | 39 ++++++++++++++++++++++++++++++++--
 drivers/vfio/vfio_iommu_spapr_tce.c    |  2 +-
 5 files changed, 45 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h
index 896efa5..79d570c 100644
--- a/arch/powerpc/include/asm/mmu_context.h
+++ b/arch/powerpc/include/asm/mmu_context.h
@@ -35,9 +35,9 @@ extern struct mm_iommu_table_group_mem_t *mm_iommu_lookup_rm(
 extern struct mm_iommu_table_group_mem_t *mm_iommu_find(struct mm_struct *mm,
 		unsigned long ua, unsigned long entries);
 extern long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem,
-		unsigned long ua, unsigned long *hpa);
+		unsigned long ua, unsigned int pageshift, unsigned long *hpa);
 extern long mm_iommu_ua_to_hpa_rm(struct mm_iommu_table_group_mem_t *mem,
-		unsigned long ua, unsigned long *hpa);
+		unsigned long ua, unsigned int pageshift, unsigned long *hpa);
 extern long mm_iommu_mapped_inc(struct mm_iommu_table_group_mem_t *mem);
 extern void mm_iommu_mapped_dec(struct mm_iommu_table_group_mem_t *mem);
 #endif
diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index d066e37..8c456fa 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -449,7 +449,7 @@ long kvmppc_tce_iommu_do_map(struct kvm *kvm, struct iommu_table *tbl,
 		/* This only handles v2 IOMMU type, v1 is handled via ioctl() */
 		return H_TOO_HARD;
 
-	if (WARN_ON_ONCE(mm_iommu_ua_to_hpa(mem, ua, &hpa)))
+	if (WARN_ON_ONCE(mm_iommu_ua_to_hpa(mem, ua, tbl->it_page_shift, &hpa)))
 		return H_HARDWARE;
 
 	if (mm_iommu_mapped_inc(mem))
diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
index 925fc31..5b298f5 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -279,7 +279,8 @@ static long kvmppc_rm_tce_iommu_do_map(struct kvm *kvm, struct iommu_table *tbl,
 	if (!mem)
 		return H_TOO_HARD;
 
-	if (WARN_ON_ONCE_RM(mm_iommu_ua_to_hpa_rm(mem, ua, &hpa)))
+	if (WARN_ON_ONCE_RM(mm_iommu_ua_to_hpa_rm(mem, ua, tbl->it_page_shift,
+			&hpa)))
 		return H_HARDWARE;
 
 	pua = (void *) vmalloc_to_phys(pua);
@@ -469,7 +470,8 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 
 		mem = mm_iommu_lookup_rm(vcpu->kvm->mm, ua, IOMMU_PAGE_SIZE_4K);
 		if (mem)
-			prereg = mm_iommu_ua_to_hpa_rm(mem, ua, &tces) == 0;
+			prereg = mm_iommu_ua_to_hpa_rm(mem, ua,
+					IOMMU_PAGE_SHIFT_4K, &tces) == 0;
 	}
 
 	if (!prereg) {
diff --git a/arch/powerpc/mm/mmu_context_iommu.c b/arch/powerpc/mm/mmu_context_iommu.c
index abb4364..0bb53a7 100644
--- a/arch/powerpc/mm/mmu_context_iommu.c
+++ b/arch/powerpc/mm/mmu_context_iommu.c
@@ -19,6 +19,7 @@
 #include <linux/hugetlb.h>
 #include <linux/swap.h>
 #include <asm/mmu_context.h>
+#include <asm/pte-walk.h>
 
 static DEFINE_MUTEX(mem_list_mutex);
 
@@ -27,6 +28,7 @@ struct mm_iommu_table_group_mem_t {
 	struct rcu_head rcu;
 	unsigned long used;
 	atomic64_t mapped;
+	unsigned int pageshift;
 	u64 ua;			/* userspace address */
 	u64 entries;		/* number of entries in hpas[] */
 	u64 *hpas;		/* vmalloc'ed */
@@ -125,6 +127,8 @@ long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long entries,
 {
 	struct mm_iommu_table_group_mem_t *mem;
 	long i, j, ret = 0, locked_entries = 0;
+	unsigned int pageshift;
+	unsigned long flags;
 	struct page *page = NULL;
 
 	mutex_lock(&mem_list_mutex);
@@ -159,6 +163,12 @@ long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long entries,
 		goto unlock_exit;
 	}
 
+	/*
+	 * For a starting point for a maximum page size calculation
+	 * we use @ua and @entries natural alignment to allow IOMMU pages
+	 * smaller than huge pages but still bigger than PAGE_SIZE.
+	 */
+	mem->pageshift = __ffs(ua | (entries << PAGE_SHIFT));
 	mem->hpas = vzalloc(array_size(entries, sizeof(mem->hpas[0])));
 	if (!mem->hpas) {
 		kfree(mem);
@@ -199,6 +209,25 @@ long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long entries,
 			}
 		}
 populate:
+		pageshift = PAGE_SHIFT;
+		if (PageCompound(page)) {
+			pte_t *pte;
+			struct page *head = compound_head(page);
+			unsigned int compshift = compound_order(head);
+
+			local_irq_save(flags); /* disables as well */
+			pte = find_linux_pte(mm->pgd, ua, NULL, &pageshift);
+			local_irq_restore(flags);
+			if (!pte) {
+				ret = -EFAULT;
+				goto unlock_exit;
+			}
+			/* Double check it is still the same pinned page */
+			if (pte_page(*pte) == head && pageshift == compshift)
+				pageshift = max_t(unsigned int, pageshift,
+						PAGE_SHIFT);
+		}
+		mem->pageshift = min(mem->pageshift, pageshift);
 		mem->hpas[i] = page_to_pfn(page) << PAGE_SHIFT;
 	}
 
@@ -349,7 +378,7 @@ struct mm_iommu_table_group_mem_t *mm_iommu_find(struct mm_struct *mm,
 EXPORT_SYMBOL_GPL(mm_iommu_find);
 
 long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem,
-		unsigned long ua, unsigned long *hpa)
+		unsigned long ua, unsigned int pageshift, unsigned long *hpa)
 {
 	const long entry = (ua - mem->ua) >> PAGE_SHIFT;
 	u64 *va = &mem->hpas[entry];
@@ -357,6 +386,9 @@ long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem,
 	if (entry >= mem->entries)
 		return -EFAULT;
 
+	if (pageshift > mem->pageshift)
+		return -EFAULT;
+
 	*hpa = *va | (ua & ~PAGE_MASK);
 
 	return 0;
@@ -364,7 +396,7 @@ long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem,
 EXPORT_SYMBOL_GPL(mm_iommu_ua_to_hpa);
 
 long mm_iommu_ua_to_hpa_rm(struct mm_iommu_table_group_mem_t *mem,
-		unsigned long ua, unsigned long *hpa)
+		unsigned long ua, unsigned int pageshift, unsigned long *hpa)
 {
 	const long entry = (ua - mem->ua) >> PAGE_SHIFT;
 	void *va = &mem->hpas[entry];
@@ -373,6 +405,9 @@ long mm_iommu_ua_to_hpa_rm(struct mm_iommu_table_group_mem_t *mem,
 	if (entry >= mem->entries)
 		return -EFAULT;
 
+	if (pageshift > mem->pageshift)
+		return -EFAULT;
+
 	pa = (void *) vmalloc_to_phys(va);
 	if (!pa)
 		return -EFAULT;
diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index 2da5f05..7cd63b0 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -467,7 +467,7 @@ static int tce_iommu_prereg_ua_to_hpa(struct tce_container *container,
 	if (!mem)
 		return -EINVAL;
 
-	ret = mm_iommu_ua_to_hpa(mem, tce, phpa);
+	ret = mm_iommu_ua_to_hpa(mem, tce, shift, phpa);
 	if (ret)
 		return -EINVAL;
 
-- 
2.11.0

^ permalink raw reply related

* [PATCH kernel v6 1/2] vfio/spapr: Use IOMMU pageshift rather than pagesize
From: Alexey Kardashevskiy @ 2018-07-11 11:00 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, David Gibson, kvm-ppc, Aneesh Kumar K.V,
	Alex Williamson, Michael Ellerman, Nicholas Piggin,
	Paul Mackerras, Balbir Singh
In-Reply-To: <20180711110044.15939-1-aik@ozlabs.ru>

The size is always equal to 1 page so let's use this. Later on this will
be used for other checks which use page shifts to check the granularity
of access.

This should cause no behavioral change.

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---

As Alex suggested, this should go via the ppc tree which the next patch
is going to (which is ppc-kvm).
---
 drivers/vfio/vfio_iommu_spapr_tce.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index 759a5bd..2da5f05 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -457,13 +457,13 @@ static void tce_iommu_unuse_page(struct tce_container *container,
 }
 
 static int tce_iommu_prereg_ua_to_hpa(struct tce_container *container,
-		unsigned long tce, unsigned long size,
+		unsigned long tce, unsigned long shift,
 		unsigned long *phpa, struct mm_iommu_table_group_mem_t **pmem)
 {
 	long ret = 0;
 	struct mm_iommu_table_group_mem_t *mem;
 
-	mem = mm_iommu_lookup(container->mm, tce, size);
+	mem = mm_iommu_lookup(container->mm, tce, 1ULL << shift);
 	if (!mem)
 		return -EINVAL;
 
@@ -487,7 +487,7 @@ static void tce_iommu_unuse_page_v2(struct tce_container *container,
 	if (!pua)
 		return;
 
-	ret = tce_iommu_prereg_ua_to_hpa(container, *pua, IOMMU_PAGE_SIZE(tbl),
+	ret = tce_iommu_prereg_ua_to_hpa(container, *pua, tbl->it_page_shift,
 			&hpa, &mem);
 	if (ret)
 		pr_debug("%s: tce %lx at #%lx was not cached, ret=%d\n",
@@ -611,7 +611,7 @@ static long tce_iommu_build_v2(struct tce_container *container,
 				entry + i);
 
 		ret = tce_iommu_prereg_ua_to_hpa(container,
-				tce, IOMMU_PAGE_SIZE(tbl), &hpa, &mem);
+				tce, tbl->it_page_shift, &hpa, &mem);
 		if (ret)
 			break;
 
-- 
2.11.0

^ permalink raw reply related

* Re: [RFC PATCH] powerpc/64s: Move ISAv3.0 / POWER9 idle code to powernv C code
From: Gautham R Shenoy @ 2018-07-11 10:24 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: Gautham R Shenoy, linuxppc-dev, Vaidyanathan Srinivasan,
	Akshay Adiga, huntbag
In-Reply-To: <20180711183036.02c006d3@roar.ozlabs.ibm.com>

On Wed, Jul 11, 2018 at 06:30:36PM +1000, Nicholas Piggin wrote:
> On Tue, 10 Jul 2018 16:36:34 +0530
> Gautham R Shenoy <ego@linux.vnet.ibm.com> wrote:
> 
> > Hello Nicholas,
> > 
> > 
> > On Mon, Jul 09, 2018 at 12:24:36AM +1000, Nicholas Piggin wrote:
> > > Reimplement POWER9 idle code in C, in the powernv platform code.
> > > Assembly stubs are used to save and restore the stack frame and
> > > non-volatile GPRs before going to idle, but these are small and
> > > mostly agnostic to microarchitecture implementation details.
> > >  
> > 
> > Thanks for this patch.  It is indeed hard to maneuver through the
> > current assembly code and change things without introducing new bugs.
> 
> Hey thanks for the very good review. I don't mean to disparage the
> existing asm code too much :) We were doing our best there to get
> something working, now I think it's a good time to step back and see
> what we can improve.
> 
> > > POWER7/8 code is not converted (yet), but that's not a moving
> > > target, and it doesn't make you want to claw your eyes out so
> > > much with the POWER9 code untangled from it.
> > > 
> > > The optimisation where EC=ESL=0 idle modes did not have to save
> > > GPRs or mtmsrd L=0 is restored, because it's simple to do.
> > > 
> > > Idle wakeup no longer uses the ->cpu_restore call to reinit SPRs,
> > > but saves and restores them all explicitly.  
> > 
> > Right! The ->cpu_restore call in the current code is rendered
> > ineffective by "restore_additional_sprs" which is called after the
> > cpu_restore. 
> 
> Yes, I would actually like to do an initial incremental patch for this,
> and remove it from P7/8 CPU types as well. I think that's quite easy to
> do, and a useful cleanup to remove it entirely.
> 
> > > diff --git a/arch/powerpc/include/asm/cpuidle.h b/arch/powerpc/include/asm/cpuidle.h
> > > index e210a83eb196..b668f030d531 100644
> > > --- a/arch/powerpc/include/asm/cpuidle.h
> > > +++ b/arch/powerpc/include/asm/cpuidle.h
> > > @@ -28,6 +28,7 @@
> > >   * yet woken from the winkle state.
> > >   */
> > >  #define PNV_CORE_IDLE_LOCK_BIT			0x10000000
> > > +#define NR_PNV_CORE_IDLE_LOCK_BIT		28  
> > 
> > We can define PNV_CORE_IDLE_LOCK_BIT mask based on
> > NR_PNV_CORE_IDLE_LOCK_BIT ?
> 
> Yep good point.
> 
> > > diff --git a/arch/powerpc/kernel/exceptions-64s.S b/arch/powerpc/kernel/exceptions-64s.S
> > > index 76a14702cb9c..36b5f0e18c0c 100644
> > > --- a/arch/powerpc/kernel/exceptions-64s.S
> > > +++ b/arch/powerpc/kernel/exceptions-64s.S
> > > @@ -135,8 +135,14 @@ TRAMP_KVM(PACA_EXNMI, 0x100)
> > > 
> > >  #ifdef CONFIG_PPC_P7_NAP
> > >  EXC_COMMON_BEGIN(system_reset_idle_common)
> > > +BEGIN_FTR_SECTION
> > > +	mfspr	r3,SPRN_SRR1
> > > +	bltlr	cr3	/* no state loss, return to idle caller */  
> > 
> > This is a nice optimization. But perhaps needs a carefully checked.
> > We wakeup at SYSTEM_RESET only when we have executed stop via the
> > mayloss() call which creates an additional stack frame. When we return
> > back to the caller, we will go back with that stack frame. Perhaps we
> > need to jump to a isa3_idle_wake_gpr_noloss ?
> 
> I wasn't 100% sure of all this so yes it's something I do need to
> verify carefully. I did test a few different cases in mambo and stepped
> through things, but could easily have bugs.
> 
> I could have got something wrong, but I think mayloss does not create
> an additional stack frame, so this works okay. It's just using red zone.
> ... I think...
> 
> > > +_GLOBAL(isa3_idle_stop_mayloss)
> > > +	std	r1,PACAR1(r13)
> > > +	mflr	r4
> > > +	mfcr	r5
> > > +	/* use stack red zone rather than a new frame */
> > > +	addi	r6,r1,-INT_FRAME_SIZE
> > > +	SAVE_GPR(2, r6)
> > > +	SAVE_NVGPRS(r6)
> > > +	std	r4,_LINK(r6)
> > > +	std	r5,_CCR(r6)
> > > +	mtspr 	SPRN_PSSCR,r3
> > > +	PPC_STOP  
> > 
> > 
> > Suppose we enter into a isa3_idle_stop_may_loss (ESL=EC=1), but as
> > soon as we execute PPC_STOP, if we get interrupted, we will go back to
> > SYSTEM_RESET vector.
> > 
> > The SRR1[46:47] should be 0x1, due to which we will do a bl there,
> > thereby going back to the caller of isa3_idle_stop_mayloss. Which
> > implies that we need to perform the stack unwinding there.
> 
> Well we haven't modified r1 here, so I thought it should be okay. It
> seems to do the right thing in mambo.

Yeah, my bad. We aren't updating the r1. The code is fine in that
case.

> 
> > > +static inline void atomic_unlock_thread_idle(void)
> > > +{
> > > +	int cpu = raw_smp_processor_id();
> > > +	int first = cpu_first_thread_sibling(cpu);
> > > +	unsigned long *state = &paca_ptrs[first]->idle_state;
> > > +
> > > +	clear_bit_unlock(NR_PNV_CORE_IDLE_LOCK_BIT, state);
> > > +}
> > > +
> > > +static unsigned long power9_idle_stop(unsigned long psscr, bool  mmu_on)  
> > 
> > We are not using the mmu_on anywhere in the code. Is this for
> > restoring the IR|DR bit in the MSR ?
> 
> Hmm, it's supposed to be for KVM secondaries that have to have the MMU
> switched off around idle/wake. I haven't got all the KVM stuff right
> yet though.

Ah, right. But I was under the impression that it was a quirk only on 
POWER8 so that we set the KVM_HWTHREAD_IN_IDLE on the secondaries with
MMU off, so that in the event the primary thread in the core switches
to guest, the secondaries won't execute the guest code.

On P9, the comment in power9_offline_stop says that we don't need to
set the KVM_HWTHREAD_IN_IDLE in the real-mode.

	/*
	 * Tell KVM we're entering idle.
	 * This does not have to be done in real mode because the P9 MMU
	 * is independent per-thread. Some steppings share radix/hash mode
	 * between threads, but in that case KVM has a barrier sync in real
	 * mode before and after switching between radix and hash.
	 */

Am I missing something?

> 
> 
> Thanks,
> Nick
> 
--
Thanks and Regards
gautham.

^ permalink raw reply

* Re: [PATCH v3 1/2] powerpc: Detect the presence of big-cores via "ibm, thread-groups"
From: Gautham R Shenoy @ 2018-07-11  9:54 UTC (permalink / raw)
  To: Murilo Opsfelder Araujo
  Cc: Gautham R. Shenoy, Michael Ellerman, Benjamin Herrenschmidt,
	Michael Neuling, Vaidyanathan Srinivasan, Akshay Adiga,
	Shilpasri G Bhat, Oliver O'Halloran, Nicholas Piggin,
	linuxppc-dev, linux-kernel
In-Reply-To: <20180708160334.GA6947@kermit-br-ibm-com>

Hello Murilo,

On Sun, Jul 08, 2018 at 01:03:34PM -0300, Murilo Opsfelder Araujo wrote:
> On Fri, Jul 06, 2018 at 02:35:48PM +0530, Gautham R. Shenoy wrote:
> > From: "Gautham R. Shenoy" <ego@linux.vnet.ibm.com>
> > 
> > On IBM POWER9, the device tree exposes a property array identifed by
> > "ibm,thread-groups" which will indicate which groups of threads share a
> > particular set of resources.
> > 
> > As of today we only have one form of grouping identifying the group of
> > threads in the core that share the L1 cache, translation cache and
> > instruction data flow.
> > 
> > This patch defines the helper function to parse the contents of
> > "ibm,thread-groups" and a new structure to contain the parsed output.
> > 
> > The patch also creates the sysfs file named "small_core_siblings" that
> > returns the physical ids of the threads in the core that share the L1
> > cache, translation cache and instruction data flow.
> > 
> > Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
> > ---
> >  Documentation/ABI/testing/sysfs-devices-system-cpu |   8 ++
> >  arch/powerpc/include/asm/cputhreads.h              |  22 +++
> >  arch/powerpc/kernel/setup-common.c                 | 154 +++++++++++++++++++++
> >  arch/powerpc/kernel/sysfs.c                        |  35 +++++
> >  4 files changed, 219 insertions(+)
> > 
> > diff --git a/Documentation/ABI/testing/sysfs-devices-system-cpu b/Documentation/ABI/testing/sysfs-devices-system-cpu
> > index 9c5e7732..62f24de 100644
> > --- a/Documentation/ABI/testing/sysfs-devices-system-cpu
> > +++ b/Documentation/ABI/testing/sysfs-devices-system-cpu
> > @@ -487,3 +487,11 @@ Description:	Information about CPU vulnerabilities
> >  		"Not affected"	  CPU is not affected by the vulnerability
> >  		"Vulnerable"	  CPU is affected and no mitigation in effect
> >  		"Mitigation: $M"  CPU is affected and mitigation $M is in effect
> > +
> > +What: 		/sys/devices/system/cpu/cpu[0-9]+/small_core_siblings
> > +Date:		05-Jul-2018
> > +KernelVersion:	v4.18.0
> > +Contact:	Gautham R. Shenoy <ego@linux.vnet.ibm.com>
> > +Description:	List of Physical ids of CPUs which share the the L1 cache,
> > +		translation cache and instruction data-flow with this CPU.
> 
> What about this?
> 
>     Description: List of physical CPU IDs that share a common L1 cache,
>                  translation cache and instruction data flow with this CPU.
> 
> Or perhaps just remove the extra "the".

Oops! Will remove the extra "the". Thanks for spotting it.


> 
> > +Values:		Comma separated list of decimal integers.
> > diff --git a/arch/powerpc/include/asm/cputhreads.h b/arch/powerpc/include/asm/cputhreads.h
> > index d71a909..33226d7 100644
> > --- a/arch/powerpc/include/asm/cputhreads.h

[..snip..]

> > diff --git a/arch/powerpc/kernel/sysfs.c b/arch/powerpc/kernel/sysfs.c
> > index 755dc98..f5717de 100644
> > --- a/arch/powerpc/kernel/sysfs.c
> > +++ b/arch/powerpc/kernel/sysfs.c
> > @@ -18,6 +18,7 @@
> >  #include <asm/smp.h>
> >  #include <asm/pmc.h>
> >  #include <asm/firmware.h>
> > +#include <asm/cputhreads.h>
> >  
> >  #include "cacheinfo.h"
> >  #include "setup.h"
> > @@ -1025,6 +1026,33 @@ static ssize_t show_physical_id(struct device *dev,
> >  }
> >  static DEVICE_ATTR(physical_id, 0444, show_physical_id, NULL);
> >  
> > +static ssize_t show_small_core_siblings(struct device *dev,
> > +					struct device_attribute *attr,
> > +					char *buf)
> > +{
> 
> Interesting enough, checkpatch.pl warned about this function name:
> 
>     WARNING: Consider renaming function(s) 'show_small_core_siblings' to 'small_core_siblings_show'
>     #354: FILE: arch/powerpc/kernel/sysfs.c:1053:
>     +}

Yes I kept it despite the warning, as doing otherwise would introduce
inconsistency with the way the remaining functions are named in the
file.

If Michael Ellerman is ok, I can send a file converting all these
sysfs calls to use DEVICE_ATTR_RO/RW, which will require renaming all
the other functions to XXX_show().


> 
> > +	struct cpu *cpu = container_of(dev, struct cpu, dev);
> > +	struct device_node *dn = of_get_cpu_node(cpu->dev.id, NULL);
> > +	struct thread_groups tg;
> > +	int i, j;
> > +	ssize_t ret = 0;
> > +
> > +	if (parse_thread_groups(dn, &tg))
> > +		return -ENODATA;
> > +
> > +	i = get_cpu_thread_group_start(cpu->dev.id, &tg);
> > +
> > +	if (i == -1)
> > +		return -ENODATA;
> > +
> > +	for (j = 0; j < tg.threads_per_group - 1; j++)
> > +		ret += sprintf(buf + ret, "%d,", tg.thread_list[i + j]);
> > +
> > +	ret += sprintf(buf + ret, "%d\n", tg.thread_list[i + j]);
> > +
> > +	return ret;
> > +}
> > +static DEVICE_ATTR(small_core_siblings, 0444, show_small_core_siblings, NULL);
> > +
> >  static int __init topology_init(void)
> >  {
> >  	int cpu, r;
> > @@ -1048,6 +1076,13 @@ static int __init topology_init(void)
> >  			register_cpu(c, cpu);
> >  
> >  			device_create_file(&c->dev, &dev_attr_physical_id);
> > +
> > +			if (has_big_cores) {
> > +				const struct device_attribute *attr =
> > +				       &dev_attr_small_core_siblings;
> > +
> > +			       device_create_file(&c->dev, attr);
> > +			}
> >  		}
> >  	}
> >  	r = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "powerpc/topology:online",
> > -- 
> > 1.9.4
> > 
> 
> -- 
> Murilo

^ permalink raw reply

* Re: [PATCH kernel] powerpc/ioda/npu2: Call hot reset skiboot hook when disabling NPU
From: Alexey Kardashevskiy @ 2018-07-11  9:45 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Benjamin Herrenschmidt, Russell Currey, Alistair Popple,
	Balbir Singh, Stewart Smith
In-Reply-To: <20180607070607.16037-1-aik@ozlabs.ru>

On Thu,  7 Jun 2018 17:06:07 +1000
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

> This brings NPU2 in a safe mode when it does not throw HMI if GPU
> coherent memory is gone.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>


Anyone, ping?


> ---
> 
> The main aim for this is nvlink2 pass through, helps a lot.
> 
> 
> ---
>  arch/powerpc/platforms/powernv/pci-ioda.c | 11 +++++++++++
>  1 file changed, 11 insertions(+)
> 
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> index 66c2804..29f798c 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -3797,6 +3797,16 @@ static void pnv_pci_release_device(struct pci_dev *pdev)
>  		pnv_ioda_release_pe(pe);
>  }
>  
> +void pnv_npu_disable_device(struct pci_dev *pdev)
> +{
> +	struct eeh_dev *edev = pci_dev_to_eeh_dev(pdev);
> +	struct eeh_pe *eehpe = edev ? edev->pe : NULL;
> +
> +	if (eehpe && eeh_ops && eeh_ops->reset) {
> +		eeh_ops->reset(eehpe, EEH_RESET_HOT);
> +	}
> +}
> +
>  static void pnv_pci_ioda_shutdown(struct pci_controller *hose)
>  {
>  	struct pnv_phb *phb = hose->private_data;
> @@ -3841,6 +3851,7 @@ static const struct pci_controller_ops pnv_npu_ioda_controller_ops = {
>  	.reset_secondary_bus	= pnv_pci_reset_secondary_bus,
>  	.dma_set_mask		= pnv_npu_dma_set_mask,
>  	.shutdown		= pnv_pci_ioda_shutdown,
> +	.disable_device		= pnv_npu_disable_device,
>  };
>  
>  static const struct pci_controller_ops pnv_npu_ocapi_ioda_controller_ops = {
> -- 
> 2.11.0
> 



--
Alexey

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox