LinuxPPC-Dev Archive on lore.kernel.org

LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed

* Re: Boot failures with "mm/sparse: Remove CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER" on powerpc (was Re: mmotm 2018-07-10-16-50 uploaded)
From: Oscar Salvador @ 2018-07-11 13:37 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: akpm, broonie, mhocko, sfr, linux-next, linux-fsdevel, linux-mm,
	linux-kernel, mm-commits, linuxppc-dev, bhe, pasha.tatashin,
	Aneesh Kumar K.V, Anshuman Khandual
In-Reply-To: <87lgai9bt5.fsf@concordia.ellerman.id.au>

On Wed, Jul 11, 2018 at 10:49:58PM +1000, Michael Ellerman wrote:
> akpm@linux-foundation.org writes:
> > The mm-of-the-moment snapshot 2018-07-10-16-50 has been uploaded to
> >
> >    http://www.ozlabs.org/~akpm/mmotm/
> ...
> 
> > * mm-sparse-add-a-static-variable-nr_present_sections.patch
> > * mm-sparsemem-defer-the-ms-section_mem_map-clearing.patch
> > * mm-sparsemem-defer-the-ms-section_mem_map-clearing-fix.patch
> > * mm-sparse-add-a-new-parameter-data_unit_size-for-alloc_usemap_and_memmap.patch
> > * mm-sparse-optimize-memmap-allocation-during-sparse_init.patch
> > * mm-sparse-optimize-memmap-allocation-during-sparse_init-checkpatch-fixes.patch
> 
> > * mm-sparse-remove-config_sparsemem_alloc_mem_map_together.patch
> 
> This seems to be breaking my powerpc pseries qemu boots.
> 
> The boot log with some extra debug shows eg:
> 
>   $ make pseries_le_defconfig

Could you please share the config?
I was not able to find such config in the kernel tree.
-- 
Oscar Salvador
SUSE L3

^ permalink raw reply

* Re: [v2] powerpc: mpc5200: Remove VLA usage
From: Michael Ellerman @ 2018-07-11 13:24 UTC (permalink / raw)
  To: Kees Cook
  Cc: Arnd Bergmann, linux-kernel, Paul Mackerras, Anatolij Gustschin,
	linuxppc-dev
In-Reply-To: <20180702155621.GA16047@beast>

On Mon, 2018-07-02 at 15:56:21 UTC, Kees Cook wrote:
> In the quest to remove all stack VLA usage from the kernel[1], this
> switches to using a stack size large enough for the saved routine and
> adds a sanity check making sure the routine doesn't overflow into the
> 0x600 exception handler.
> 
> [1] https://lkml.kernel.org/r/CA+55aFzCG-zNmZwX4A2FQpadafLfEzK6CC=qPXydAacU1RqZWA@mail.gmail.com
> 
> Signed-off-by: Kees Cook <keescook@chromium.org>
> Reviewed-by: Arnd Bergmann <arnd@arndb.de>

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/741c5640a15a23a2ec3a0846668a82

cheers

^ permalink raw reply

* Re: powerpc: Remove memtrace mmap
From: Michael Ellerman @ 2018-07-11 13:24 UTC (permalink / raw)
  To: Michael Neuling; +Cc: mikey, linuxppc-dev, rashmica.g
In-Reply-To: <20180629001209.18523-1-mikey@neuling.org>

On Fri, 2018-06-29 at 00:12:09 UTC, Michael Neuling wrote:
> debugfs doesn't support mmap, so this code is never used.
> 
> Signed-off-by: Michael Neuling <mikey@neuling.org>

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/7dea6f2f053599d90f7894216db0dd

cheers

^ permalink raw reply

* Re: [v2, 01/10] Revert "cxl: Add kernel API to allow a context to operate with relocate disabled"
From: Michael Ellerman @ 2018-07-11 13:24 UTC (permalink / raw)
  To: Frederic Barrat, alastair, andrew.donnellan, vaibhav, clombard,
	felix, linuxppc-dev
  Cc: huyn
In-Reply-To: <20180628100509.17413-2-fbarrat@linux.ibm.com>

On Thu, 2018-06-28 at 10:05:00 UTC, Frederic Barrat wrote:
> From: Alastair D'Silva <alastair@d-silva.org>
> 
> Remove abandonned capi support for the Mellanox CX4.
> The symbol 'cxl_set_translation_mode' is never called, so
> ctx->real_mode is always false.
> 
> This reverts commit 7a0d85d313c2066712e530e668bc02bb741a685c.
> 
> Signed-off-by: Alastair D'Silva <alastair@d-silva.org>
> Acked-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>

Series applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/c5828150067c47a97f30e690a472e0

cheers

^ permalink raw reply

* Re: [v3,1/2] selftests/powerpc: Fix strncpy usage
From: Michael Ellerman @ 2018-07-11 13:24 UTC (permalink / raw)
  To: Breno Leitao, linuxppc-dev; +Cc: Breno Leitao, Anshuman Khandual
In-Reply-To: <1530019213-2347-1-git-send-email-leitao@debian.org>

On Tue, 2018-06-26 at 13:20:12 UTC, Breno Leitao wrote:
> There is a buffer overflow in dscr_inherit_test.c test. In main(), strncpy()'s
> third argument is the length of the source, not the size of the destination
> buffer, which makes strncpy() behaves like strcpy(), causing a buffer overflow
> if argv[0] is bigger than LEN_MAX (100).
> 
> This patch maps 'prog' to the argv[0] memory region, removing the static
> allocation and the LEN_MAX size restriction.
> 
> CC: Michael Ellerman <mpe@ellerman.id.au>
> CC: Segher Boessenkool <segher@kernel.crashing.org>
> CC: Anshuman Khandual <khandual@linux.vnet.ibm.com>
> Signed-off-by: Breno Leitao <leitao@debian.org>

Series applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/09a61e894ac852fb063ee0b54fc513

cheers

^ permalink raw reply

* Re: powerpc/pci: Remove legacy debug code
From: Michael Ellerman @ 2018-07-11 13:24 UTC (permalink / raw)
  To: Breno Leitao, linuxppc-dev; +Cc: Breno Leitao
In-Reply-To: <1529360102-8206-1-git-send-email-leitao@debian.org>

On Mon, 2018-06-18 at 22:15:02 UTC, Breno Leitao wrote:
> Commit 59f47eff03a0 ("powerpc/pci: Use of_irq_parse_and_map_pci() helper")
> removed the 'oirq' variable, but kept memsetting it when the DEBUG macro is
> defined.
> 
> When setting DEBUG macro for debugging purpose, the kernel fails to build since
> 'oirq' is not defined anymore.
> 
> This patch simply remove the debug block, since it does not seem to sense
> now.
> 
> Fixes: 59f47eff03a08c ("powerpc/pci: Use of_irq_parse_and_map_pci() helper")
> 
> Signed-off-by: Breno Leitao <leitao@debian.org>

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/3bfb450ee7b52d41ce2117738ee9c1

cheers

^ permalink raw reply

* Re: [v2,1/2] powerpc: Document issues with the DAWR on POWER9
From: Michael Ellerman @ 2018-07-11 13:24 UTC (permalink / raw)
  To: Michael Neuling; +Cc: mikey, linuxppc-dev
In-Reply-To: <20180625013456.16157-1-mikey@neuling.org>

On Mon, 2018-06-25 at 01:34:55 UTC, Michael Neuling wrote:
> Signed-off-by: Michael Neuling <mikey@neuling.org>
> Acked-by: Stewart Smith <stewart@linux.ibm.com>

Series applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/c6bef2e9e50ca27987aae90147511b

cheers

^ permalink raw reply

* Re: ocxl: Fix page fault handler in case of fault on dying process
From: Michael Ellerman @ 2018-07-11 13:24 UTC (permalink / raw)
  To: Frederic Barrat, andrew.donnellan, alastair, linuxppc-dev
  Cc: clombard, vaibhav
In-Reply-To: <20180618121436.20479-1-fbarrat@linux.ibm.com>

On Mon, 2018-06-18 at 12:14:36 UTC, Frederic Barrat wrote:
> If a process exits without doing proper cleanup, there's a window
> where an opencapi device can try to access the memory of the dying
> process and may trigger a page fault. That's an expected scenario and
> the ocxl driver holds a reference on the mm_struct of the process
> until the opencapi device is notified of the process exiting.
> However, if mm_users is already at 0, i.e. the address space of the
> process has already been destroyed, the driver shouldn't try resolving
> the page fault, as it will fail, but it can also try accessing already
> freed data.
> 
> It is fixed by only calling the bottom half of the page fault handler
> if mm_users is greater than 0 and get a reference on mm_users instead
> of mm_count. Otherwise, we can safely return a translation fault to
> the device, as its associated memory context is being removed. The
> opencapi device will be properly cleaned up shortly after when closing
> the file descriptors.
> 
> Fixes: 5ef3166e8a32 ("ocxl: Driver code for 'generic' opencapi devices")
> Cc: stable@vger.kernel.org # v4.16+
> Signed-off-by: Frederic Barrat <fbarrat@linux.ibm.com>
> Reviewed-By: Alastair D'Silva <alastair@d-silva.org>
> Acked-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/d497ebf5fb3a026c0817f8c96cde57

cheers

^ permalink raw reply

* Re: powerpc: xmon: use ktime_get_coarse_boottime64
From: Michael Ellerman @ 2018-07-11 13:24 UTC (permalink / raw)
  To: Arnd Bergmann, Benjamin Herrenschmidt, Paul Mackerras
  Cc: Arnd Bergmann, y2038, Vaibhav Jain, Nicholas Piggin, linux-kernel,
	linuxppc-dev
In-Reply-To: <20180618095653.1538471-1-arnd@arndb.de>

On Mon, 2018-06-18 at 09:56:24 UTC, Arnd Bergmann wrote:
> get_monotonic_boottime() is deprecated, and may not be safe to call in
> every context, as it has to read a hardware clocksource.
> 
> This changes xmon to print the time using ktime_get_coarse_boottime64()
> instead, which avoids the old timespec type and the HW access.
> 
> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
> Acked-by: Balbir Singh <bsingharora@gmail.com>

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/f6bd74fa084eb9ad573ffbb236a095

cheers

^ permalink raw reply

* Re: powerpc: wii: Remove outdated comment about memory fixups
From: Michael Ellerman @ 2018-07-11 13:24 UTC (permalink / raw)
  To: Jonathan Neuschäfer, linuxppc-dev
  Cc: Paul Mackerras, Jonathan Neuschäfer, linux-kernel
In-Reply-To: <20180617124909.26828-1-j.neuschaefer@gmx.net>

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 477 bytes --]

On Sun, 2018-06-17 at 12:49:06 UTC, =?utf-8?q?Jonathan_Neusch=C3=A4fer?= wrote:
> The workaround has been removed. What stays is just code to find the
> memory hole so the BATs can be configured properly in the function below.
> 
> Fixes: 57deb8fea01f ("powerpc/wii: Don't rely on the reserved memory hack")
> Signed-off-by: Jonathan Neuschäfer <j.neuschaefer@gmx.net>

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/8272f598523d0975259080d2f9d760

cheers

^ permalink raw reply

* Re: misc: ocxl: Change return type for fault handler
From: Michael Ellerman @ 2018-07-11 13:24 UTC (permalink / raw)
  To: Souptick Joarder, willy, fbarrat, andrew.donnellan, arnd, gregkh
  Cc: brajeswar.linux, linuxppc-dev, linux-kernel, sabyasachi.linux
In-Reply-To: <20180611202904.GA25538@jordon-HP-15-Notebook-PC>

On Mon, 2018-06-11 at 20:29:04 UTC, Souptick Joarder wrote:
> Use new return type vm_fault_t for fault handler. For
> now, this is just documenting that the function returns
> a VM_FAULT value rather than an errno. Once all instances
> are converted, vm_fault_t will become a distinct type.
> 
> Ref-> commit 1c8f422059ae ("mm: change return type to vm_fault_t")
> 
> There is an existing bug when vm_insert_pfn() can return
> ENOMEM which was ignored and VM_FAULT_NOPAGE returned as
> default. The new inline vmf_insert_pfn() has removed
> this inefficiency by returning correct vm_fault_ type.
> 
> Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
> Acked-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
> Acked-by: Frederic Barrat <fbarrat@linux.vnet.ibm.com>

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/a545cf032d11437ed86e62f00d4991

cheers

^ permalink raw reply

* Re: [kernel] powerpc/powernv/ioda2: Reduce upper limit for DMA window size
From: Michael Ellerman @ 2018-07-11 13:24 UTC (permalink / raw)
  To: Alexey Kardashevskiy, linuxppc-dev; +Cc: Alexey Kardashevskiy
In-Reply-To: <20180601080616.29279-1-aik@ozlabs.ru>

On Fri, 2018-06-01 at 08:06:16 UTC, Alexey Kardashevskiy wrote:
> We use PHB in mode1 which uses bit 59 to select a correct DMA window.
> However there is mode2 which uses bits 59:55 and allows up to 32 DMA
> windows per a PE.
> 
> Even though documentation does not clearly specify that, it seems that
> the actual hardware does not support bits 59:55 even in mode1, in other
> words we can create a window as big as 1<<58 but DMA simply won't work.
> 
> This reduces the upper limit from 59 to 55 bits to let the userspace know
> about the hardware limits.
> 
> Fixes: 7aafac11e3 "powerpc/powernv/ioda2: Gracefully fail if too many TCE levels requested"
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/d3d4ffaae439981e1e441ebb125aa3

cheers

^ permalink raw reply

* Re: powerpc/eeh: Avoid misleading message "EEH: no capable adapters found"
From: Michael Ellerman @ 2018-07-11 13:24 UTC (permalink / raw)
  To: Mauro S. M. Rodrigues, linuxppc-dev, ruscur, benh, paulus; +Cc: kernel
In-Reply-To: <1521771052-5973-1-git-send-email-maurosr@linux.vnet.ibm.com>

On Fri, 2018-03-23 at 02:10:52 UTC, "Mauro S. M. Rodrigues" wrote:
> Due to recent refactoring in EEH in:
> commit b9fde58db7e5 ("powerpc/powernv: Rework EEH initialization on
> powernv")
> a misleading message was seen in the kernel message buffer:
> 
> [    0.108431] EEH: PowerNV platform initialized
> [    0.589979] EEH: No capable adapters found
> 
> This happened due to the removal of the initialization delay for powernv
> platform.
> 
> Even though the EEH infrastructure for the devices is eventually
> initialized and still works just fine the eeh device probe step is
> postponed in order to assure the PEs are created. Later
> pnv_eeh_post_init does the probe devices job but at that point the
> message was already shown right after eeh_init flow.
> 
> This patch introduces a new flag EEH_POSTPONED_PROBE to represent that
> temporary state and avoid the message mentioned above and showing the
> follow one instead:
> 
> [    0.107724] EEH: PowerNV platform initialized
> [    4.844825] EEH: PCI Enhanced I/O Error Handling Enabled
> 
> Signed-off-by: Mauro S. M. Rodrigues <maurosr@linux.vnet.ibm.com>
> Acked-by: Russell Currey <ruscur@russell.cc>
> Tested-by:Venkat Rao B <vrbagal1@linux.vnet.ibm.com>

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/ee8c446fed99ffdc29dedf7d2a8854

cheers

^ permalink raw reply

* Re: scripts: Add ppc64le support for checkstack.pl
From: Michael Ellerman @ 2018-07-11 13:24 UTC (permalink / raw)
  To: Breno Leitao, linux-kernel, linuxppc-dev; +Cc: Breno Leitao
In-Reply-To: <1511897374-27408-1-git-send-email-leitao@debian.org>

On Tue, 2017-11-28 at 19:29:34 UTC, Breno Leitao wrote:
> 64-bit ELF v2 ABI specification for POWER describes, on section "General
> Stack Frame Requirements", that the stack should use the following
> instructions when compiled with backchain:
> 
>   mflr r0
>   std  r0, 16(r1)
>   stdu r1, -XX(r1)
> 
> Where XX is the frame size for that function, and this is the value
> checkstack.pl will find the stack size for each function.
> 
> This patch also simplifies the entire Powerpc section, since just two
> type of instructions are used, 'stdu' for 64 bits and 'stwu' for 32 bits
> platform.
> 
> Signed-off-by: Breno Leitao <leitao@debian.org>

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/8449a4cb69ab6fbb873d653a82787a

cheers

^ permalink raw reply

* Re: Boot failures with "mm/sparse: Remove CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER" on powerpc (was Re: mmotm 2018-07-10-16-50 uploaded)
From: Baoquan He @ 2018-07-11 13:12 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: akpm, broonie, mhocko, sfr, linux-next, linux-fsdevel, linux-mm,
	linux-kernel, mm-commits, linuxppc-dev, pasha.tatashin,
	Aneesh Kumar K.V, Anshuman Khandual
In-Reply-To: <87lgai9bt5.fsf@concordia.ellerman.id.au>

Hi Michael,

On 07/11/18 at 10:49pm, Michael Ellerman wrote:
> akpm@linux-foundation.org writes:
> > The mm-of-the-moment snapshot 2018-07-10-16-50 has been uploaded to
> >
> >    http://www.ozlabs.org/~akpm/mmotm/
> ...
> 
> > * mm-sparse-add-a-static-variable-nr_present_sections.patch
> > * mm-sparsemem-defer-the-ms-section_mem_map-clearing.patch
> > * mm-sparsemem-defer-the-ms-section_mem_map-clearing-fix.patch
> > * mm-sparse-add-a-new-parameter-data_unit_size-for-alloc_usemap_and_memmap.patch
> > * mm-sparse-optimize-memmap-allocation-during-sparse_init.patch
> > * mm-sparse-optimize-memmap-allocation-during-sparse_init-checkpatch-fixes.patch
> 
> > * mm-sparse-remove-config_sparsemem_alloc_mem_map_together.patch
> 
> This seems to be breaking my powerpc pseries qemu boots.
> 
> The boot log with some extra debug shows eg:
> 
>   $ make pseries_le_defconfig
>   $ qemu-system-ppc64 -nographic -vga none -M pseries -m 2G -kernel vmlinux 
>   vmemmap_populate f000000000000000..f000000000024000, node 0
>         * f000000000000000..f000000001000000 allocated at c000000076000000
>   hash__vmemmap_create_mapping: start 0xf000000000000000 size 0x1000000 phys 0x76000000
>   hash__vmemmap_create_mapping: failed -1
> 
>   <repeated many times>
> 
> Then there's lots of other warnings about bad page states and eventually
> a NULL deref and we panic().
> 
> 
> The problem seems to be that we're calling down into
> hash__vmemmap_create_mapping() for every call to vmemmap_populate(),
> whereas previously we would only call hash__vmemmap_create_mapping()
> once because our vmemmap_populated() would return true.
> 
> There's actually a comment in sparse_init() that says:
> 
> 	 * powerpc need to call sparse_init_one_section right after each
> 	 * sparse_early_mem_map_alloc, so allocate usemap_map at first.
> 
> So changing that behaviour does seem to be the problem.
> 
> I assume that comment is talking about the fact that we use pfn_valid()
> in vmemmap_populated().
> 
> I'm not clear on how to fix it though.

Have you tried reverting that patch and building kernel to test again?
Does it work?

^ permalink raw reply

* [PATCH] [RESEND] powerpc: xmon: use ktime_get_coarse_boottime64
From: Arnd Bergmann @ 2018-07-11 12:53 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman
  Cc: y2038, Arnd Bergmann, Balbir Singh, Nicholas Piggin, Vaibhav Jain,
	Breno Leitao, Mathieu Malaterre, Yisheng Xie, linuxppc-dev,
	linux-kernel

get_monotonic_boottime() is deprecated, and may not be safe to call in
every context, as it has to read a hardware clocksource.

This changes xmon to print the time using ktime_get_coarse_boottime64()
instead, which avoids the old timespec type and the HW access.

Acked-by: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
---
Originally sent Jun 18, but this hasn't appeared in linux-next yet.

Resending to make sure this is still on the radar. Please apply
to the powerpc git for 4.19
---
 arch/powerpc/xmon/xmon.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/xmon/xmon.c b/arch/powerpc/xmon/xmon.c
index 47166ad2a669..45e3d0ec1246 100644
--- a/arch/powerpc/xmon/xmon.c
+++ b/arch/powerpc/xmon/xmon.c
@@ -918,13 +918,13 @@ static void remove_cpu_bpts(void)
 static void
 show_uptime(void)
 {
-	struct timespec uptime;
+	struct timespec64 uptime;
 
 	if (setjmp(bus_error_jmp) == 0) {
 		catch_memory_errors = 1;
 		sync();
 
-		get_monotonic_boottime(&uptime);
+		ktime_get_coarse_boottime_ts64(&uptime);
 		printf("Uptime: %lu.%.2lu seconds\n", (unsigned long)uptime.tv_sec,
 			((unsigned long)uptime.tv_nsec / (NSEC_PER_SEC/100)));
 
-- 
2.9.0

^ permalink raw reply related

* Boot failures with "mm/sparse: Remove CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER" on powerpc (was Re: mmotm 2018-07-10-16-50 uploaded)
From: Michael Ellerman @ 2018-07-11 12:49 UTC (permalink / raw)
  To: akpm, broonie, mhocko, sfr, linux-next, linux-fsdevel, linux-mm,
	linux-kernel, mm-commits, linuxppc-dev, bhe, pasha.tatashin,
	Aneesh Kumar K.V, Anshuman Khandual
In-Reply-To: <20180710235044.vjlRV%akpm@linux-foundation.org>

akpm@linux-foundation.org writes:
> The mm-of-the-moment snapshot 2018-07-10-16-50 has been uploaded to
>
>    http://www.ozlabs.org/~akpm/mmotm/
...

> * mm-sparse-add-a-static-variable-nr_present_sections.patch
> * mm-sparsemem-defer-the-ms-section_mem_map-clearing.patch
> * mm-sparsemem-defer-the-ms-section_mem_map-clearing-fix.patch
> * mm-sparse-add-a-new-parameter-data_unit_size-for-alloc_usemap_and_memmap.patch
> * mm-sparse-optimize-memmap-allocation-during-sparse_init.patch
> * mm-sparse-optimize-memmap-allocation-during-sparse_init-checkpatch-fixes.patch

> * mm-sparse-remove-config_sparsemem_alloc_mem_map_together.patch

This seems to be breaking my powerpc pseries qemu boots.

The boot log with some extra debug shows eg:

  $ make pseries_le_defconfig
  $ qemu-system-ppc64 -nographic -vga none -M pseries -m 2G -kernel vmlinux 
  ...
  vmemmap_populate f000000000000000..f000000000004000, node 0
        * f000000000000000..f000000001000000 allocated at c00000007e000000
  hash__vmemmap_create_mapping: start 0xf000000000000000 size 0x1000000 phys 0x7e000000
  vmemmap_populate f000000000000000..f000000000008000, node 0
        * f000000000000000..f000000001000000 allocated at c00000007d000000
  hash__vmemmap_create_mapping: start 0xf000000000000000 size 0x1000000 phys 0x7d000000
  vmemmap_populate f000000000000000..f00000000000c000, node 0
        * f000000000000000..f000000001000000 allocated at c00000007c000000
  hash__vmemmap_create_mapping: start 0xf000000000000000 size 0x1000000 phys 0x7c000000
  vmemmap_populate f000000000000000..f000000000010000, node 0
        * f000000000000000..f000000001000000 allocated at c00000007b000000
  hash__vmemmap_create_mapping: start 0xf000000000000000 size 0x1000000 phys 0x7b000000
  vmemmap_populate f000000000000000..f000000000014000, node 0
        * f000000000000000..f000000001000000 allocated at c00000007a000000
  hash__vmemmap_create_mapping: start 0xf000000000000000 size 0x1000000 phys 0x7a000000
  vmemmap_populate f000000000000000..f000000000018000, node 0
        * f000000000000000..f000000001000000 allocated at c000000079000000
  hash__vmemmap_create_mapping: start 0xf000000000000000 size 0x1000000 phys 0x79000000
  vmemmap_populate f000000000000000..f00000000001c000, node 0
        * f000000000000000..f000000001000000 allocated at c000000078000000
  hash__vmemmap_create_mapping: start 0xf000000000000000 size 0x1000000 phys 0x78000000
  vmemmap_populate f000000000000000..f000000000020000, node 0
        * f000000000000000..f000000001000000 allocated at c000000077000000
  hash__vmemmap_create_mapping: start 0xf000000000000000 size 0x1000000 phys 0x77000000
  vmemmap_populate f000000000000000..f000000000024000, node 0
        * f000000000000000..f000000001000000 allocated at c000000076000000
  hash__vmemmap_create_mapping: start 0xf000000000000000 size 0x1000000 phys 0x76000000
  hash__vmemmap_create_mapping: failed -1

  <repeated many times>

Then there's lots of other warnings about bad page states and eventually
a NULL deref and we panic().


The problem seems to be that we're calling down into
hash__vmemmap_create_mapping() for every call to vmemmap_populate(),
whereas previously we would only call hash__vmemmap_create_mapping()
once because our vmemmap_populated() would return true.

There's actually a comment in sparse_init() that says:

	 * powerpc need to call sparse_init_one_section right after each
	 * sparse_early_mem_map_alloc, so allocate usemap_map at first.

So changing that behaviour does seem to be the problem.

I assume that comment is talking about the fact that we use pfn_valid()
in vmemmap_populated().

I'm not clear on how to fix it though.

Any ideas?

cheers

^ permalink raw reply

* [PATCH kernel v6 0/2] KVM: PPC: Check if IOMMU page is contained in the pinned physical page
From: Alexey Kardashevskiy @ 2018-07-11 11:00 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, David Gibson, kvm-ppc, Aneesh Kumar K.V,
	Alex Williamson, Michael Ellerman, Nicholas Piggin,
	Paul Mackerras, Balbir Singh

This is to improve page boundaries checking and should probably
be cc:stable. I came accross this while debugging nvlink2 passthrough
but the lack of checking might be exploited by the existing userspace.

The get_user_pages() comment says it should be "phased out" but the only
alternative seems to be get_user_pages_longterm(), should that be used
instead (this is longterm reference elevation, however it is not DAX,
whatever this implies)? get_user_pages_remote() seems unnecessarily
complicated because of @locked.

Changes:
v6:
* 2/2: read pageshift from pte

v5:
* 2/2: changed compound pages handling

v4:
* 2/2: implemented less strict but still safe max pageshift as David suggested

v3:
* enforced huge pages not to cross preregistered chunk boundaries

v2:
* 2/2: explicitly check for compound pages before calling compound_order()

This is based on sha1
1e4b044 Linus Torvalds "Linux 4.18-rc4".

Please comment. Thanks.

Alexey Kardashevskiy (2):
  vfio/spapr: Use IOMMU pageshift rather than pagesize
  KVM: PPC: Check if IOMMU page is contained in the pinned physical page

 arch/powerpc/include/asm/mmu_context.h |  4 ++--
 arch/powerpc/kvm/book3s_64_vio.c       |  2 +-
 arch/powerpc/kvm/book3s_64_vio_hv.c    |  6 ++++--
 arch/powerpc/mm/mmu_context_iommu.c    | 39 ++++++++++++++++++++++++++++++++--
 drivers/vfio/vfio_iommu_spapr_tce.c    | 10 ++++-----
 5 files changed, 49 insertions(+), 12 deletions(-)

-- 
2.11.0

^ permalink raw reply

* [PATCH kernel v6 2/2] KVM: PPC: Check if IOMMU page is contained in the pinned physical page
From: Alexey Kardashevskiy @ 2018-07-11 11:00 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, David Gibson, kvm-ppc, Aneesh Kumar K.V,
	Alex Williamson, Michael Ellerman, Nicholas Piggin,
	Paul Mackerras, Balbir Singh
In-Reply-To: <20180711110044.15939-1-aik@ozlabs.ru>

A VM which has:
 - a DMA capable device passed through to it (eg. network card);
 - running a malicious kernel that ignores H_PUT_TCE failure;
 - capability of using IOMMU pages bigger that physical pages
can create an IOMMU mapping that exposes (for example) 16MB of
the host physical memory to the device when only 64K was allocated to the VM.

The remaining 16MB - 64K will be some other content of host memory, possibly
including pages of the VM, but also pages of host kernel memory, host
programs or other VMs.

The attacking VM does not control the location of the page it can map,
and is only allowed to map as many pages as it has pages of RAM.

We already have a check in drivers/vfio/vfio_iommu_spapr_tce.c that
an IOMMU page is contained in the physical page so the PCI hardware won't
get access to unassigned host memory; however this check is missing in
the KVM fastpath (H_PUT_TCE accelerated code). We were lucky so far and
did not hit this yet as the very first time when the mapping happens
we do not have tbl::it_userspace allocated yet and fall back to
the userspace which in turn calls VFIO IOMMU driver, this fails and
the guest does not retry,

This stores the smallest preregistered page size in the preregistered
region descriptor and changes the mm_iommu_xxx API to check this against
the IOMMU page size.

This calculates maximum page size as a minimum of the natural region
alignment and compound page size. For the page shift this uses the shift
returned by find_linux_pte() which indicates how the page is mapped to
the current userspace - if the page is huge and this is not a zero, then
it is a leaf pte and the page is mapped within the range.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
Changes:
v6:
* replaced hugetlbfs with pageshift from find_linux_pte()

v5:
* only consider compound pages from hugetlbfs

v4:
* reimplemented max pageshift calculation

v3:
* fixed upper limit for the page size
* added checks that we don't register parts of a huge page

v2:
* explicitely check for compound pages before calling compound_order()

---
The bug is: run QEMU _without_ hugepages (no -mempath) and tell it to
advertise 16MB pages to the guest; a typical pseries guest will use 16MB
for IOMMU pages without checking the mmu pagesize and this will fail
at https://git.qemu.org/?p=qemu.git;a=blob;f=hw/vfio/common.c;h=fb396cf00ac40eb35967a04c9cc798ca896eed57;hb=refs/heads/master#l256

With the change, mapping will fail in KVM and the guest will print:

mlx5_core 0000:00:00.0: ibm,create-pe-dma-window(2027) 0 8000000 20000000 18 1f returned 0 (liobn = 0x80000001 starting addr = 8000000 0)
mlx5_core 0000:00:00.0: created tce table LIOBN 0x80000001 for /pci@800000020000000/ethernet@0
mlx5_core 0000:00:00.0: failed to map direct window for /pci@800000020000000/ethernet@0: -1
---
 arch/powerpc/include/asm/mmu_context.h |  4 ++--
 arch/powerpc/kvm/book3s_64_vio.c       |  2 +-
 arch/powerpc/kvm/book3s_64_vio_hv.c    |  6 ++++--
 arch/powerpc/mm/mmu_context_iommu.c    | 39 ++++++++++++++++++++++++++++++++--
 drivers/vfio/vfio_iommu_spapr_tce.c    |  2 +-
 5 files changed, 45 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h
index 896efa5..79d570c 100644
--- a/arch/powerpc/include/asm/mmu_context.h
+++ b/arch/powerpc/include/asm/mmu_context.h
@@ -35,9 +35,9 @@ extern struct mm_iommu_table_group_mem_t *mm_iommu_lookup_rm(
 extern struct mm_iommu_table_group_mem_t *mm_iommu_find(struct mm_struct *mm,
 		unsigned long ua, unsigned long entries);
 extern long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem,
-		unsigned long ua, unsigned long *hpa);
+		unsigned long ua, unsigned int pageshift, unsigned long *hpa);
 extern long mm_iommu_ua_to_hpa_rm(struct mm_iommu_table_group_mem_t *mem,
-		unsigned long ua, unsigned long *hpa);
+		unsigned long ua, unsigned int pageshift, unsigned long *hpa);
 extern long mm_iommu_mapped_inc(struct mm_iommu_table_group_mem_t *mem);
 extern void mm_iommu_mapped_dec(struct mm_iommu_table_group_mem_t *mem);
 #endif
diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index d066e37..8c456fa 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -449,7 +449,7 @@ long kvmppc_tce_iommu_do_map(struct kvm *kvm, struct iommu_table *tbl,
 		/* This only handles v2 IOMMU type, v1 is handled via ioctl() */
 		return H_TOO_HARD;
 
-	if (WARN_ON_ONCE(mm_iommu_ua_to_hpa(mem, ua, &hpa)))
+	if (WARN_ON_ONCE(mm_iommu_ua_to_hpa(mem, ua, tbl->it_page_shift, &hpa)))
 		return H_HARDWARE;
 
 	if (mm_iommu_mapped_inc(mem))
diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
index 925fc31..5b298f5 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -279,7 +279,8 @@ static long kvmppc_rm_tce_iommu_do_map(struct kvm *kvm, struct iommu_table *tbl,
 	if (!mem)
 		return H_TOO_HARD;
 
-	if (WARN_ON_ONCE_RM(mm_iommu_ua_to_hpa_rm(mem, ua, &hpa)))
+	if (WARN_ON_ONCE_RM(mm_iommu_ua_to_hpa_rm(mem, ua, tbl->it_page_shift,
+			&hpa)))
 		return H_HARDWARE;
 
 	pua = (void *) vmalloc_to_phys(pua);
@@ -469,7 +470,8 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 
 		mem = mm_iommu_lookup_rm(vcpu->kvm->mm, ua, IOMMU_PAGE_SIZE_4K);
 		if (mem)
-			prereg = mm_iommu_ua_to_hpa_rm(mem, ua, &tces) == 0;
+			prereg = mm_iommu_ua_to_hpa_rm(mem, ua,
+					IOMMU_PAGE_SHIFT_4K, &tces) == 0;
 	}
 
 	if (!prereg) {
diff --git a/arch/powerpc/mm/mmu_context_iommu.c b/arch/powerpc/mm/mmu_context_iommu.c
index abb4364..0bb53a7 100644
--- a/arch/powerpc/mm/mmu_context_iommu.c
+++ b/arch/powerpc/mm/mmu_context_iommu.c
@@ -19,6 +19,7 @@
 #include <linux/hugetlb.h>
 #include <linux/swap.h>
 #include <asm/mmu_context.h>
+#include <asm/pte-walk.h>
 
 static DEFINE_MUTEX(mem_list_mutex);
 
@@ -27,6 +28,7 @@ struct mm_iommu_table_group_mem_t {
 	struct rcu_head rcu;
 	unsigned long used;
 	atomic64_t mapped;
+	unsigned int pageshift;
 	u64 ua;			/* userspace address */
 	u64 entries;		/* number of entries in hpas[] */
 	u64 *hpas;		/* vmalloc'ed */
@@ -125,6 +127,8 @@ long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long entries,
 {
 	struct mm_iommu_table_group_mem_t *mem;
 	long i, j, ret = 0, locked_entries = 0;
+	unsigned int pageshift;
+	unsigned long flags;
 	struct page *page = NULL;
 
 	mutex_lock(&mem_list_mutex);
@@ -159,6 +163,12 @@ long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long entries,
 		goto unlock_exit;
 	}
 
+	/*
+	 * For a starting point for a maximum page size calculation
+	 * we use @ua and @entries natural alignment to allow IOMMU pages
+	 * smaller than huge pages but still bigger than PAGE_SIZE.
+	 */
+	mem->pageshift = __ffs(ua | (entries << PAGE_SHIFT));
 	mem->hpas = vzalloc(array_size(entries, sizeof(mem->hpas[0])));
 	if (!mem->hpas) {
 		kfree(mem);
@@ -199,6 +209,25 @@ long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long entries,
 			}
 		}
 populate:
+		pageshift = PAGE_SHIFT;
+		if (PageCompound(page)) {
+			pte_t *pte;
+			struct page *head = compound_head(page);
+			unsigned int compshift = compound_order(head);
+
+			local_irq_save(flags); /* disables as well */
+			pte = find_linux_pte(mm->pgd, ua, NULL, &pageshift);
+			local_irq_restore(flags);
+			if (!pte) {
+				ret = -EFAULT;
+				goto unlock_exit;
+			}
+			/* Double check it is still the same pinned page */
+			if (pte_page(*pte) == head && pageshift == compshift)
+				pageshift = max_t(unsigned int, pageshift,
+						PAGE_SHIFT);
+		}
+		mem->pageshift = min(mem->pageshift, pageshift);
 		mem->hpas[i] = page_to_pfn(page) << PAGE_SHIFT;
 	}
 
@@ -349,7 +378,7 @@ struct mm_iommu_table_group_mem_t *mm_iommu_find(struct mm_struct *mm,
 EXPORT_SYMBOL_GPL(mm_iommu_find);
 
 long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem,
-		unsigned long ua, unsigned long *hpa)
+		unsigned long ua, unsigned int pageshift, unsigned long *hpa)
 {
 	const long entry = (ua - mem->ua) >> PAGE_SHIFT;
 	u64 *va = &mem->hpas[entry];
@@ -357,6 +386,9 @@ long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem,
 	if (entry >= mem->entries)
 		return -EFAULT;
 
+	if (pageshift > mem->pageshift)
+		return -EFAULT;
+
 	*hpa = *va | (ua & ~PAGE_MASK);
 
 	return 0;
@@ -364,7 +396,7 @@ long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem,
 EXPORT_SYMBOL_GPL(mm_iommu_ua_to_hpa);
 
 long mm_iommu_ua_to_hpa_rm(struct mm_iommu_table_group_mem_t *mem,
-		unsigned long ua, unsigned long *hpa)
+		unsigned long ua, unsigned int pageshift, unsigned long *hpa)
 {
 	const long entry = (ua - mem->ua) >> PAGE_SHIFT;
 	void *va = &mem->hpas[entry];
@@ -373,6 +405,9 @@ long mm_iommu_ua_to_hpa_rm(struct mm_iommu_table_group_mem_t *mem,
 	if (entry >= mem->entries)
 		return -EFAULT;
 
+	if (pageshift > mem->pageshift)
+		return -EFAULT;
+
 	pa = (void *) vmalloc_to_phys(va);
 	if (!pa)
 		return -EFAULT;
diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index 2da5f05..7cd63b0 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -467,7 +467,7 @@ static int tce_iommu_prereg_ua_to_hpa(struct tce_container *container,
 	if (!mem)
 		return -EINVAL;
 
-	ret = mm_iommu_ua_to_hpa(mem, tce, phpa);
+	ret = mm_iommu_ua_to_hpa(mem, tce, shift, phpa);
 	if (ret)
 		return -EINVAL;
 
-- 
2.11.0

^ permalink raw reply related

* [PATCH kernel v6 1/2] vfio/spapr: Use IOMMU pageshift rather than pagesize
From: Alexey Kardashevskiy @ 2018-07-11 11:00 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, David Gibson, kvm-ppc, Aneesh Kumar K.V,
	Alex Williamson, Michael Ellerman, Nicholas Piggin,
	Paul Mackerras, Balbir Singh
In-Reply-To: <20180711110044.15939-1-aik@ozlabs.ru>

The size is always equal to 1 page so let's use this. Later on this will
be used for other checks which use page shifts to check the granularity
of access.

This should cause no behavioral change.

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---

As Alex suggested, this should go via the ppc tree which the next patch
is going to (which is ppc-kvm).
---
 drivers/vfio/vfio_iommu_spapr_tce.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index 759a5bd..2da5f05 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -457,13 +457,13 @@ static void tce_iommu_unuse_page(struct tce_container *container,
 }
 
 static int tce_iommu_prereg_ua_to_hpa(struct tce_container *container,
-		unsigned long tce, unsigned long size,
+		unsigned long tce, unsigned long shift,
 		unsigned long *phpa, struct mm_iommu_table_group_mem_t **pmem)
 {
 	long ret = 0;
 	struct mm_iommu_table_group_mem_t *mem;
 
-	mem = mm_iommu_lookup(container->mm, tce, size);
+	mem = mm_iommu_lookup(container->mm, tce, 1ULL << shift);
 	if (!mem)
 		return -EINVAL;
 
@@ -487,7 +487,7 @@ static void tce_iommu_unuse_page_v2(struct tce_container *container,
 	if (!pua)
 		return;
 
-	ret = tce_iommu_prereg_ua_to_hpa(container, *pua, IOMMU_PAGE_SIZE(tbl),
+	ret = tce_iommu_prereg_ua_to_hpa(container, *pua, tbl->it_page_shift,
 			&hpa, &mem);
 	if (ret)
 		pr_debug("%s: tce %lx at #%lx was not cached, ret=%d\n",
@@ -611,7 +611,7 @@ static long tce_iommu_build_v2(struct tce_container *container,
 				entry + i);
 
 		ret = tce_iommu_prereg_ua_to_hpa(container,
-				tce, IOMMU_PAGE_SIZE(tbl), &hpa, &mem);
+				tce, tbl->it_page_shift, &hpa, &mem);
 		if (ret)
 			break;
 
-- 
2.11.0

^ permalink raw reply related

* Re: [RFC PATCH] powerpc/64s: Move ISAv3.0 / POWER9 idle code to powernv C code
From: Gautham R Shenoy @ 2018-07-11 10:24 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: Gautham R Shenoy, linuxppc-dev, Vaidyanathan Srinivasan,
	Akshay Adiga, huntbag
In-Reply-To: <20180711183036.02c006d3@roar.ozlabs.ibm.com>

On Wed, Jul 11, 2018 at 06:30:36PM +1000, Nicholas Piggin wrote:
> On Tue, 10 Jul 2018 16:36:34 +0530
> Gautham R Shenoy <ego@linux.vnet.ibm.com> wrote:
> 
> > Hello Nicholas,
> > 
> > 
> > On Mon, Jul 09, 2018 at 12:24:36AM +1000, Nicholas Piggin wrote:
> > > Reimplement POWER9 idle code in C, in the powernv platform code.
> > > Assembly stubs are used to save and restore the stack frame and
> > > non-volatile GPRs before going to idle, but these are small and
> > > mostly agnostic to microarchitecture implementation details.
> > >  
> > 
> > Thanks for this patch.  It is indeed hard to maneuver through the
> > current assembly code and change things without introducing new bugs.
> 
> Hey thanks for the very good review. I don't mean to disparage the
> existing asm code too much :) We were doing our best there to get
> something working, now I think it's a good time to step back and see
> what we can improve.
> 
> > > POWER7/8 code is not converted (yet), but that's not a moving
> > > target, and it doesn't make you want to claw your eyes out so
> > > much with the POWER9 code untangled from it.
> > > 
> > > The optimisation where EC=ESL=0 idle modes did not have to save
> > > GPRs or mtmsrd L=0 is restored, because it's simple to do.
> > > 
> > > Idle wakeup no longer uses the ->cpu_restore call to reinit SPRs,
> > > but saves and restores them all explicitly.  
> > 
> > Right! The ->cpu_restore call in the current code is rendered
> > ineffective by "restore_additional_sprs" which is called after the
> > cpu_restore. 
> 
> Yes, I would actually like to do an initial incremental patch for this,
> and remove it from P7/8 CPU types as well. I think that's quite easy to
> do, and a useful cleanup to remove it entirely.
> 
> > > diff --git a/arch/powerpc/include/asm/cpuidle.h b/arch/powerpc/include/asm/cpuidle.h
> > > index e210a83eb196..b668f030d531 100644
> > > --- a/arch/powerpc/include/asm/cpuidle.h
> > > +++ b/arch/powerpc/include/asm/cpuidle.h
> > > @@ -28,6 +28,7 @@
> > >   * yet woken from the winkle state.
> > >   */
> > >  #define PNV_CORE_IDLE_LOCK_BIT			0x10000000
> > > +#define NR_PNV_CORE_IDLE_LOCK_BIT		28  
> > 
> > We can define PNV_CORE_IDLE_LOCK_BIT mask based on
> > NR_PNV_CORE_IDLE_LOCK_BIT ?
> 
> Yep good point.
> 
> > > diff --git a/arch/powerpc/kernel/exceptions-64s.S b/arch/powerpc/kernel/exceptions-64s.S
> > > index 76a14702cb9c..36b5f0e18c0c 100644
> > > --- a/arch/powerpc/kernel/exceptions-64s.S
> > > +++ b/arch/powerpc/kernel/exceptions-64s.S
> > > @@ -135,8 +135,14 @@ TRAMP_KVM(PACA_EXNMI, 0x100)
> > > 
> > >  #ifdef CONFIG_PPC_P7_NAP
> > >  EXC_COMMON_BEGIN(system_reset_idle_common)
> > > +BEGIN_FTR_SECTION
> > > +	mfspr	r3,SPRN_SRR1
> > > +	bltlr	cr3	/* no state loss, return to idle caller */  
> > 
> > This is a nice optimization. But perhaps needs a carefully checked.
> > We wakeup at SYSTEM_RESET only when we have executed stop via the
> > mayloss() call which creates an additional stack frame. When we return
> > back to the caller, we will go back with that stack frame. Perhaps we
> > need to jump to a isa3_idle_wake_gpr_noloss ?
> 
> I wasn't 100% sure of all this so yes it's something I do need to
> verify carefully. I did test a few different cases in mambo and stepped
> through things, but could easily have bugs.
> 
> I could have got something wrong, but I think mayloss does not create
> an additional stack frame, so this works okay. It's just using red zone.
> ... I think...
> 
> > > +_GLOBAL(isa3_idle_stop_mayloss)
> > > +	std	r1,PACAR1(r13)
> > > +	mflr	r4
> > > +	mfcr	r5
> > > +	/* use stack red zone rather than a new frame */
> > > +	addi	r6,r1,-INT_FRAME_SIZE
> > > +	SAVE_GPR(2, r6)
> > > +	SAVE_NVGPRS(r6)
> > > +	std	r4,_LINK(r6)
> > > +	std	r5,_CCR(r6)
> > > +	mtspr 	SPRN_PSSCR,r3
> > > +	PPC_STOP  
> > 
> > 
> > Suppose we enter into a isa3_idle_stop_may_loss (ESL=EC=1), but as
> > soon as we execute PPC_STOP, if we get interrupted, we will go back to
> > SYSTEM_RESET vector.
> > 
> > The SRR1[46:47] should be 0x1, due to which we will do a bl there,
> > thereby going back to the caller of isa3_idle_stop_mayloss. Which
> > implies that we need to perform the stack unwinding there.
> 
> Well we haven't modified r1 here, so I thought it should be okay. It
> seems to do the right thing in mambo.

Yeah, my bad. We aren't updating the r1. The code is fine in that
case.

> 
> > > +static inline void atomic_unlock_thread_idle(void)
> > > +{
> > > +	int cpu = raw_smp_processor_id();
> > > +	int first = cpu_first_thread_sibling(cpu);
> > > +	unsigned long *state = &paca_ptrs[first]->idle_state;
> > > +
> > > +	clear_bit_unlock(NR_PNV_CORE_IDLE_LOCK_BIT, state);
> > > +}
> > > +
> > > +static unsigned long power9_idle_stop(unsigned long psscr, bool  mmu_on)  
> > 
> > We are not using the mmu_on anywhere in the code. Is this for
> > restoring the IR|DR bit in the MSR ?
> 
> Hmm, it's supposed to be for KVM secondaries that have to have the MMU
> switched off around idle/wake. I haven't got all the KVM stuff right
> yet though.

Ah, right. But I was under the impression that it was a quirk only on 
POWER8 so that we set the KVM_HWTHREAD_IN_IDLE on the secondaries with
MMU off, so that in the event the primary thread in the core switches
to guest, the secondaries won't execute the guest code.

On P9, the comment in power9_offline_stop says that we don't need to
set the KVM_HWTHREAD_IN_IDLE in the real-mode.

	/*
	 * Tell KVM we're entering idle.
	 * This does not have to be done in real mode because the P9 MMU
	 * is independent per-thread. Some steppings share radix/hash mode
	 * between threads, but in that case KVM has a barrier sync in real
	 * mode before and after switching between radix and hash.
	 */

Am I missing something?

> 
> 
> Thanks,
> Nick
> 
--
Thanks and Regards
gautham.

^ permalink raw reply

* Re: [PATCH v3 1/2] powerpc: Detect the presence of big-cores via "ibm, thread-groups"
From: Gautham R Shenoy @ 2018-07-11  9:54 UTC (permalink / raw)
  To: Murilo Opsfelder Araujo
  Cc: Gautham R. Shenoy, Michael Ellerman, Benjamin Herrenschmidt,
	Michael Neuling, Vaidyanathan Srinivasan, Akshay Adiga,
	Shilpasri G Bhat, Oliver O'Halloran, Nicholas Piggin,
	linuxppc-dev, linux-kernel
In-Reply-To: <20180708160334.GA6947@kermit-br-ibm-com>

Hello Murilo,

On Sun, Jul 08, 2018 at 01:03:34PM -0300, Murilo Opsfelder Araujo wrote:
> On Fri, Jul 06, 2018 at 02:35:48PM +0530, Gautham R. Shenoy wrote:
> > From: "Gautham R. Shenoy" <ego@linux.vnet.ibm.com>
> > 
> > On IBM POWER9, the device tree exposes a property array identifed by
> > "ibm,thread-groups" which will indicate which groups of threads share a
> > particular set of resources.
> > 
> > As of today we only have one form of grouping identifying the group of
> > threads in the core that share the L1 cache, translation cache and
> > instruction data flow.
> > 
> > This patch defines the helper function to parse the contents of
> > "ibm,thread-groups" and a new structure to contain the parsed output.
> > 
> > The patch also creates the sysfs file named "small_core_siblings" that
> > returns the physical ids of the threads in the core that share the L1
> > cache, translation cache and instruction data flow.
> > 
> > Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
> > ---
> >  Documentation/ABI/testing/sysfs-devices-system-cpu |   8 ++
> >  arch/powerpc/include/asm/cputhreads.h              |  22 +++
> >  arch/powerpc/kernel/setup-common.c                 | 154 +++++++++++++++++++++
> >  arch/powerpc/kernel/sysfs.c                        |  35 +++++
> >  4 files changed, 219 insertions(+)
> > 
> > diff --git a/Documentation/ABI/testing/sysfs-devices-system-cpu b/Documentation/ABI/testing/sysfs-devices-system-cpu
> > index 9c5e7732..62f24de 100644
> > --- a/Documentation/ABI/testing/sysfs-devices-system-cpu
> > +++ b/Documentation/ABI/testing/sysfs-devices-system-cpu
> > @@ -487,3 +487,11 @@ Description:	Information about CPU vulnerabilities
> >  		"Not affected"	  CPU is not affected by the vulnerability
> >  		"Vulnerable"	  CPU is affected and no mitigation in effect
> >  		"Mitigation: $M"  CPU is affected and mitigation $M is in effect
> > +
> > +What: 		/sys/devices/system/cpu/cpu[0-9]+/small_core_siblings
> > +Date:		05-Jul-2018
> > +KernelVersion:	v4.18.0
> > +Contact:	Gautham R. Shenoy <ego@linux.vnet.ibm.com>
> > +Description:	List of Physical ids of CPUs which share the the L1 cache,
> > +		translation cache and instruction data-flow with this CPU.
> 
> What about this?
> 
>     Description: List of physical CPU IDs that share a common L1 cache,
>                  translation cache and instruction data flow with this CPU.
> 
> Or perhaps just remove the extra "the".

Oops! Will remove the extra "the". Thanks for spotting it.


> 
> > +Values:		Comma separated list of decimal integers.
> > diff --git a/arch/powerpc/include/asm/cputhreads.h b/arch/powerpc/include/asm/cputhreads.h
> > index d71a909..33226d7 100644
> > --- a/arch/powerpc/include/asm/cputhreads.h

[..snip..]

> > diff --git a/arch/powerpc/kernel/sysfs.c b/arch/powerpc/kernel/sysfs.c
> > index 755dc98..f5717de 100644
> > --- a/arch/powerpc/kernel/sysfs.c
> > +++ b/arch/powerpc/kernel/sysfs.c
> > @@ -18,6 +18,7 @@
> >  #include <asm/smp.h>
> >  #include <asm/pmc.h>
> >  #include <asm/firmware.h>
> > +#include <asm/cputhreads.h>
> >  
> >  #include "cacheinfo.h"
> >  #include "setup.h"
> > @@ -1025,6 +1026,33 @@ static ssize_t show_physical_id(struct device *dev,
> >  }
> >  static DEVICE_ATTR(physical_id, 0444, show_physical_id, NULL);
> >  
> > +static ssize_t show_small_core_siblings(struct device *dev,
> > +					struct device_attribute *attr,
> > +					char *buf)
> > +{
> 
> Interesting enough, checkpatch.pl warned about this function name:
> 
>     WARNING: Consider renaming function(s) 'show_small_core_siblings' to 'small_core_siblings_show'
>     #354: FILE: arch/powerpc/kernel/sysfs.c:1053:
>     +}

Yes I kept it despite the warning, as doing otherwise would introduce
inconsistency with the way the remaining functions are named in the
file.

If Michael Ellerman is ok, I can send a file converting all these
sysfs calls to use DEVICE_ATTR_RO/RW, which will require renaming all
the other functions to XXX_show().


> 
> > +	struct cpu *cpu = container_of(dev, struct cpu, dev);
> > +	struct device_node *dn = of_get_cpu_node(cpu->dev.id, NULL);
> > +	struct thread_groups tg;
> > +	int i, j;
> > +	ssize_t ret = 0;
> > +
> > +	if (parse_thread_groups(dn, &tg))
> > +		return -ENODATA;
> > +
> > +	i = get_cpu_thread_group_start(cpu->dev.id, &tg);
> > +
> > +	if (i == -1)
> > +		return -ENODATA;
> > +
> > +	for (j = 0; j < tg.threads_per_group - 1; j++)
> > +		ret += sprintf(buf + ret, "%d,", tg.thread_list[i + j]);
> > +
> > +	ret += sprintf(buf + ret, "%d\n", tg.thread_list[i + j]);
> > +
> > +	return ret;
> > +}
> > +static DEVICE_ATTR(small_core_siblings, 0444, show_small_core_siblings, NULL);
> > +
> >  static int __init topology_init(void)
> >  {
> >  	int cpu, r;
> > @@ -1048,6 +1076,13 @@ static int __init topology_init(void)
> >  			register_cpu(c, cpu);
> >  
> >  			device_create_file(&c->dev, &dev_attr_physical_id);
> > +
> > +			if (has_big_cores) {
> > +				const struct device_attribute *attr =
> > +				       &dev_attr_small_core_siblings;
> > +
> > +			       device_create_file(&c->dev, attr);
> > +			}
> >  		}
> >  	}
> >  	r = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "powerpc/topology:online",
> > -- 
> > 1.9.4
> > 
> 
> -- 
> Murilo

^ permalink raw reply

* Re: [PATCH kernel] powerpc/ioda/npu2: Call hot reset skiboot hook when disabling NPU
From: Alexey Kardashevskiy @ 2018-07-11  9:45 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Benjamin Herrenschmidt, Russell Currey, Alistair Popple,
	Balbir Singh, Stewart Smith
In-Reply-To: <20180607070607.16037-1-aik@ozlabs.ru>

On Thu,  7 Jun 2018 17:06:07 +1000
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

> This brings NPU2 in a safe mode when it does not throw HMI if GPU
> coherent memory is gone.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>


Anyone, ping?


> ---
> 
> The main aim for this is nvlink2 pass through, helps a lot.
> 
> 
> ---
>  arch/powerpc/platforms/powernv/pci-ioda.c | 11 +++++++++++
>  1 file changed, 11 insertions(+)
> 
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> index 66c2804..29f798c 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -3797,6 +3797,16 @@ static void pnv_pci_release_device(struct pci_dev *pdev)
>  		pnv_ioda_release_pe(pe);
>  }
>  
> +void pnv_npu_disable_device(struct pci_dev *pdev)
> +{
> +	struct eeh_dev *edev = pci_dev_to_eeh_dev(pdev);
> +	struct eeh_pe *eehpe = edev ? edev->pe : NULL;
> +
> +	if (eehpe && eeh_ops && eeh_ops->reset) {
> +		eeh_ops->reset(eehpe, EEH_RESET_HOT);
> +	}
> +}
> +
>  static void pnv_pci_ioda_shutdown(struct pci_controller *hose)
>  {
>  	struct pnv_phb *phb = hose->private_data;
> @@ -3841,6 +3851,7 @@ static const struct pci_controller_ops pnv_npu_ioda_controller_ops = {
>  	.reset_secondary_bus	= pnv_pci_reset_secondary_bus,
>  	.dma_set_mask		= pnv_npu_dma_set_mask,
>  	.shutdown		= pnv_pci_ioda_shutdown,
> +	.disable_device		= pnv_npu_disable_device,
>  };
>  
>  static const struct pci_controller_ops pnv_npu_ocapi_ioda_controller_ops = {
> -- 
> 2.11.0
> 



--
Alexey

^ permalink raw reply

* Re: [RFC PATCH kernel 0/5] powerpc/P9/vfio: Pass through NVIDIA Tesla V100
From: Alexey Kardashevskiy @ 2018-07-11  9:26 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Benjamin Herrenschmidt, linuxppc-dev, David Gibson, kvm-ppc,
	Ram Pai, kvm, Alistair Popple
In-Reply-To: <20180710163715.5c326985@t450s.home>

On Tue, 10 Jul 2018 16:37:15 -0600
Alex Williamson <alex.williamson@redhat.com> wrote:

> On Tue, 10 Jul 2018 14:10:20 +1000
> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> 
> > On Thu, 7 Jun 2018 23:03:23 -0600
> > Alex Williamson <alex.williamson@redhat.com> wrote:
> >   
> > > On Fri, 8 Jun 2018 14:14:23 +1000
> > > Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> > >     
> > > > On 8/6/18 1:44 pm, Alex Williamson wrote:      
> > > > > On Fri, 8 Jun 2018 13:08:54 +1000
> > > > > Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> > > > >         
> > > > >> On 8/6/18 8:15 am, Alex Williamson wrote:        
> > > > >>> On Fri, 08 Jun 2018 07:54:02 +1000
> > > > >>> Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:
> > > > >>>           
> > > > >>>> On Thu, 2018-06-07 at 11:04 -0600, Alex Williamson wrote:          
> > > > >>>>>
> > > > >>>>> Can we back up and discuss whether the IOMMU grouping of NVLink
> > > > >>>>> connected devices makes sense?  AIUI we have a PCI view of these
> > > > >>>>> devices and from that perspective they're isolated.  That's the view of
> > > > >>>>> the device used to generate the grouping.  However, not visible to us,
> > > > >>>>> these devices are interconnected via NVLink.  What isolation properties
> > > > >>>>> does NVLink provide given that its entire purpose for existing seems to
> > > > >>>>> be to provide a high performance link for p2p between devices?            
> > > > >>>>
> > > > >>>> Not entire. On POWER chips, we also have an nvlink between the device
> > > > >>>> and the CPU which is running significantly faster than PCIe.
> > > > >>>>
> > > > >>>> But yes, there are cross-links and those should probably be accounted
> > > > >>>> for in the grouping.          
> > > > >>>
> > > > >>> Then after we fix the grouping, can we just let the host driver manage
> > > > >>> this coherent memory range and expose vGPUs to guests?  The use case of
> > > > >>> assigning all 6 GPUs to one VM seems pretty limited.  (Might need to
> > > > >>> convince NVIDIA to support more than a single vGPU per VM though)          
> > > > >>
> > > > >> These are physical GPUs, not virtual sriov-alike things they are
> > > > >> implementing as well elsewhere.        
> > > > > 
> > > > > vGPUs as implemented on M- and P-series Teslas aren't SR-IOV like
> > > > > either.  That's why we have mdev devices now to implement software
> > > > > defined devices.  I don't have first hand experience with V-series, but
> > > > > I would absolutely expect a PCIe-based Tesla V100 to support vGPU.        
> > > > 
> > > > So assuming V100 can do vGPU, you are suggesting ditching this patchset and
> > > > using mediated vGPUs instead, correct?      
> > > 
> > > If it turns out that our PCIe-only-based IOMMU grouping doesn't
> > > account for lack of isolation on the NVLink side and we correct that,
> > > limiting assignment to sets of 3 interconnected GPUs, is that still a
> > > useful feature?  OTOH, it's entirely an NVIDIA proprietary decision
> > > whether they choose to support vGPU on these GPUs or whether they can
> > > be convinced to support multiple vGPUs per VM.
> > >     
> > > > >> My current understanding is that every P9 chip in that box has some NVLink2
> > > > >> logic on it so each P9 is directly connected to 3 GPUs via PCIe and
> > > > >> 2xNVLink2, and GPUs in that big group are interconnected by NVLink2 links
> > > > >> as well.
> > > > >>
> > > > >> From small bits of information I have it seems that a GPU can perfectly
> > > > >> work alone and if the NVIDIA driver does not see these interconnects
> > > > >> (because we do not pass the rest of the big 3xGPU group to this guest), it
> > > > >> continues with a single GPU. There is an "nvidia-smi -r" big reset hammer
> > > > >> which simply refuses to work until all 3 GPUs are passed so there is some
> > > > >> distinction between passing 1 or 3 GPUs, and I am trying (as we speak) to
> > > > >> get a confirmation from NVIDIA that it is ok to pass just a single GPU.
> > > > >>
> > > > >> So we will either have 6 groups (one per GPU) or 2 groups (one per
> > > > >> interconnected group).        
> > > > > 
> > > > > I'm not gaining much confidence that we can rely on isolation between
> > > > > NVLink connected GPUs, it sounds like you're simply expecting that
> > > > > proprietary code from NVIDIA on a proprietary interconnect from NVIDIA
> > > > > is going to play nice and nobody will figure out how to do bad things
> > > > > because... obfuscation?  Thanks,        
> > > > 
> > > > Well, we already believe that a proprietary firmware of a sriov-capable
> > > > adapter like Mellanox ConnextX is not doing bad things, how is this
> > > > different in principle?      
> > > 
> > > It seems like the scope and hierarchy are different.  Here we're
> > > talking about exposing big discrete devices, which are peers of one
> > > another (and have history of being reverse engineered), to userspace
> > > drivers.  Once handed to userspace, each of those devices needs to be
> > > considered untrusted.  In the case of SR-IOV, we typically have a
> > > trusted host driver for the PF managing untrusted VFs.  We do rely on
> > > some sanity in the hardware/firmware in isolating the VFs from each
> > > other and from the PF, but we also often have source code for Linux
> > > drivers for these devices and sometimes even datasheets.  Here we have
> > > neither of those and perhaps we won't know the extent of the lack of
> > > isolation between these devices until nouveau (best case) or some
> > > exploit (worst case) exposes it.  IOMMU grouping always assumes a lack
> > > of isolation between devices unless the hardware provides some
> > > indication that isolation exists, for example ACS on PCIe.  If NVIDIA
> > > wants to expose isolation on NVLink, perhaps they need to document
> > > enough of it that the host kernel can manipulate and test for isolation,
> > > perhaps even enabling virtualization of the NVLink interconnect
> > > interface such that the host can prevent GPUs from interfering with
> > > each other.  Thanks,    
> > 
> > 
> > So far I got this from NVIDIA:
> > 
> > 1. An NVLink2 state can be controlled via MMIO registers, there is a
> > "NVLINK ISOLATION ON MULTI-TENANT SYSTEMS" spec (my copy is
> > "confidential" though) from NVIDIA with the MMIO addresses to block if
> > we want to disable certain links. In order to NVLink to work it needs to
> > be enabled on both sides so by filtering certains MMIO ranges we can
> > isolate a GPU.  
> 
> Where are these MMIO registers, on the bridge or on the endpoint device?

The endpoint GPU device.

> I'm wondering when you say block MMIO if these are ranges on the device
> that we disallow mmap to and all the overlapping PAGE_SIZE issues that
> come with that or if this should essentially be device specific
> enable_acs and acs_enabled quirks, and maybe also potentially used by
> Logan's disable acs series to allow GPUs to be linked and have grouping
> to match.

An update, I confused P100 and V100, P100 would need filtering but
ours is V100 and it has a couple of registers which we can use to
disable particular links and once disabled, the link cannot be
re-enabled till the next secondary bus reset.


> > 2. We can and should also prohibit the GPU firmware update, this is
> > done via MMIO as well. The protocol is not open but at least register
> > ranges might be in order to filter these accesses, and there is no
> > plan to change this.  
> 
> I assume this MMIO is on the endpoint and has all the PAGE_SIZE joys
> along with it.

Yes, however NVIDIA says there is no performance critical stuff with
this 64K page.

> Also, there are certainly use cases of updating
> firmware for an assigned device, we don't want to impose a policy, but
> we should figure out the right place for that policy to be specified by
> the admin.

May be but NVIDIA is talking about some "out-of-band" command to the GPU
to enable firmware update so firmware update is not really supported.


> > 3. DMA trafic over the NVLink2 link can be of 2 types: UT=1 for
> > PCI-style DMA via our usual TCE tables (one per a NVLink2 link),
> > and UT=0 for direct host memory access. UT stands for "use
> > translation" and this is a part of the NVLink2 protocol. Only UT=1 is
> > possible over the PCIe link.
> > This UT=0 trafic uses host physical addresses returned by a nest MMU (a
> > piece of NVIDIA logic on a POWER9 chip), this takes LPID (guest id),
> > mmu context id (guest userspace mm id), a virtual address and translates
> > to the host physical and that result is used for UT=0 DMA, this is
> > called "ATS" although it is not PCIe ATS afaict.
> > NVIDIA says that the hardware is designed in a way that it can only do
> > DMA UT=0 to addresses which ATS translated to, and there is no way to
> > override this behavior and this is what guarantees the isolation.  
> 
> I'm kinda lost here, maybe we can compare it to PCIe ATS where an
> endpoint requests a translation of an IOVA to physical address, the
> IOMMU returns a lookup based on PCIe requester ID, and there's an
> invalidation protocol to keep things coherent.

Yes there is. The current approach is to have an MMU notifier in
the kernel which tells an NPU (IBM piece of logic between GPU/NVlink2
and NVIDIA nest MMU) to invalidate translations and that in turn pokes
the GPU till that confirms that it invalidated tlbs and there is no
ongoing DMA.

> In the case above, who provides a guest id and mmu context id? 

We (powerpc/powernv platform) configure NPU to bind specific bus:dev:fn to
an LPID (== guest id) and MMU context id comes from the guest. The nest
MMU knows where the partition table and this table contains all the
pointers needs for the translation.


> Additional software
> somewhere?  Is the virtual address an IOVA or a process virtual
> address? 

A guest kernel or a guest userspace virtual address.

> Do we assume some sort of invalidation protocol as well?

I am little confused, is this question about the same invalidation
protocol as above or different?


> > So isolation can be achieved if I do not miss something.
> > 
> > How do we want this to be documented to proceed? I assume if I post
> > patches filtering MMIOs, this won't do it, right? If just 1..3 are
> > documented, will we take this t&c or we need a GPU API spec (which is
> > not going to happen anyway)?  
> 
> "t&c"? I think we need what we're actually interacting with to be well
> documented, but that could be _thorough_ comments in the code, enough
> to understand the theory of operation, as far as I'm concerned.  A pdf
> lost on a corporate webserver isn't necessarily an improvement over
> that, but there needs to be sufficient detail to understand what we're
> touching such that we can maintain, adapt, and improve the code over
> time.  Only item #3 above appears POWER specific, so I'd hope that #1
> is done in the PCI subsystem, #2 might be a QEMU option (maybe kernel
> vfio-pci, but I'm not sure that's necessary), and I don't know where #3
> goes.  Thanks,

Ok, understood. Thanks!


--
Alexey

^ permalink raw reply

* Re: [PATCH 1/2] mm/cma: remove unsupported gfp_mask parameter from cma_alloc()
From: Michal Hocko @ 2018-07-11  8:54 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Marek Szyprowski, Linux Memory Management List, LKML,
	linux-arm-kernel, linuxppc-dev, iommu, Andrew Morton,
	Michal Nazarewicz, Joonsoo Kim, Vlastimil Babka,
	Christoph Hellwig, Russell King, Catalin Marinas, Will Deacon,
	Paul Mackerras, Benjamin Herrenschmidt, Chris Zankel,
	Martin Schwidefsky, Joerg Roedel, Sumit Semwal, Robin Murphy,
	Laura Abbott, linaro-mm-sig
In-Reply-To: <CAAmzW4P1m_T77DfQzDD6ysGaOF46++-0gwRaOajmo6ef=VYp=A@mail.gmail.com>

On Wed 11-07-18 16:35:28, Joonsoo Kim wrote:
> 2018-07-10 18:50 GMT+09:00 Michal Hocko <mhocko@kernel.org>:
> > On Tue 10-07-18 16:19:32, Joonsoo Kim wrote:
> >> Hello, Marek.
> >>
> >> 2018-07-09 21:19 GMT+09:00 Marek Szyprowski <m.szyprowski@samsung.com>:
> >> > cma_alloc() function doesn't really support gfp flags other than
> >> > __GFP_NOWARN, so convert gfp_mask parameter to boolean no_warn parameter.
> >>
> >> Although gfp_mask isn't used in cma_alloc() except no_warn, it can be used
> >> in alloc_contig_range(). For example, if passed gfp mask has no __GFP_FS,
> >> compaction(isolation) would work differently. Do you have considered
> >> such a case?
> >
> > Does any of cma_alloc users actually care about GFP_NO{FS,IO}?
> 
> I don't know. My guess is that cma_alloc() is used for DMA allocation so
> block device would use it, too. If fs/block subsystem initiates the
> request for the device,
> it would be possible that cma_alloc() is called with such a flag.
> Again, I don't know
> much about those subsystem so I would be wrong.

The patch converts existing users and none of them really tries to use
anything other than GFP_KERNEL [|__GFP_NOWARN] so this doesn't seem to
be the case. Should there be a new user requiring more restricted
gfp_mask we should carefuly re-evaluate and think how to support it.

Until then I would simply stick with the proposed approach because my
experience tells me that a wrong gfp mask usage is way too easy so the
simpler the api is the less likely we will see an abuse.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox