LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed
* Re: [PATCH 5/6] mm/mremap: Use pmd/pud_poplulate to update page table entries
From: Aneesh Kumar K.V @ 2021-06-13 11:13 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linuxppc-dev, Nick Piggin, Linux-MM, Kalesh Singh, Joel Fernandes,
	Kirill A . Shutemov, Andrew Morton, Linus Torvalds
In-Reply-To: <YMXi6AZm4fPpPKrH@casper.infradead.org>

On 6/13/21 4:20 PM, Matthew Wilcox wrote:
> On Sun, Jun 13, 2021 at 02:36:13PM +0530, Aneesh Kumar K.V wrote:
>> IIUC the reason why we do have pmd_pgtable() is that pgtable_t type
>> is arch dependent. On some architecture it is pte_t * and on the other
>> struct page *. The reason being highmem and level 4 page table can
>> be located in highmem.
> 
> That is ahistorical.  See 2f569afd9ced9ebec9a6eb3dbf6f83429be0a7b4 --
> we have pgtable_t for the benefit of s390's crazy sub-page page table
> sizes.

That is also true with ppc64. We do sub-page page table size. I was 
trying to explain why it can't be pte_t * everywhere and why we have
it as struct page *.

> 
> Also, please stop numbering page tables upside down.  PTEs are first
> level, not fourth.
> 

POWER ISA do name it the other way. I also see some pages explaining 
levels the other way

https://www.bottomupcs.com/virtual_memory_linux.xhtml

whereas 
https://en.wikipedia.org/wiki/Intel_5-level_paging#/media/File:Page_Tables_(5_levels).svg

I am pretty sure I had commits that explained page table level as I did 
in this thread. I will switch to your suggestion in further discussions. 
May be the best solution is to attribute it with more details like level 
1 (pte_t *)?


-aneesh


^ permalink raw reply

* Re: [PATCH -next 9/9] ASoC: fsl_xcvr: check return value after calling platform_get_resource_byname()
From: Timur Tabi @ 2021-06-13 14:00 UTC (permalink / raw)
  To: Yang Yingliang
  Cc: alsa-devel mailing list, Mark Brown, PowerPC Mailing List, lkml,
	timur
In-Reply-To: <20210611093626.579176-10-yangyingliang@huawei.com>

On Fri, Jun 11, 2021 at 4:32 AM Yang Yingliang <yangyingliang@huawei.com> wrote:

>         rx_res = platform_get_resource_byname(pdev, IORESOURCE_MEM, "rxfifo");
>         tx_res = platform_get_resource_byname(pdev, IORESOURCE_MEM, "txfifo");
> +       if (!rx_res || !tx_res) {
> +               dev_err(dev, "Invalid resource\n");
> +               return -EINVAL;
> +       }

If platform_get_resource_byname() returns an error, it's probably
because the name cannot be found.  So I think this error message is
more accurate:

"could not find rxfifo or txfifo resource"

^ permalink raw reply

* Re: [PATCH 5/6] mm/mremap: Use pmd/pud_poplulate to update page table entries
From: Linus Torvalds @ 2021-06-13 18:53 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Nick Piggin, Linux-MM, Kalesh Singh, Joel Fernandes,
	Kirill A . Shutemov, Andrew Morton, linuxppc-dev
In-Reply-To: <87wnqy9lru.fsf@linux.ibm.com>

On Sun, Jun 13, 2021 at 2:06 AM Aneesh Kumar K.V
<aneesh.kumar@linux.ibm.com> wrote:
>
> IIUC the reason why we do have pmd_pgtable() is that pgtable_t type
> is arch dependent. On some architecture it is pte_t * and on the other
> struct page *. The reason being highmem and level 4 page table can
> be located in highmem.

Honestly, the same confusion is real - in a different way - about
pud_page_vaddr().

I really hate that function.

Just grep for the uses, and the definitions, to see what I mean. It's crazy.

I'm perfectly happy not having a "pud_pagetable()" function, but that
cast on pud_page_vaddr() is indicative of real problems.

One solution might be to just say "pud_page_vaddr()" must return a "pmd_t *".

I think it's what all the users actually want anyway.

              Linus

^ permalink raw reply

* Re: [PATCH v4 2/4] lazy tlb: allow lazy tlb mm refcounting to be configurable
From: Nicholas Piggin @ 2021-06-14  0:45 UTC (permalink / raw)
  To: Andrew Morton, Andy Lutomirski
  Cc: linux-arch, Linus Torvalds, Randy Dunlap, linux-kernel, linux-mm,
	linuxppc-dev
In-Reply-To: <8ac1d420-b861-f586-bacf-8c3949e9b5c4@kernel.org>

Excerpts from Andy Lutomirski's message of June 9, 2021 2:20 am:
> On 6/4/21 6:42 PM, Nicholas Piggin wrote:
>> Add CONFIG_MMU_TLB_REFCOUNT which enables refcounting of the lazy tlb mm
>> when it is context switched. This can be disabled by architectures that
>> don't require this refcounting if they clean up lazy tlb mms when the
>> last refcount is dropped. Currently this is always enabled, which is
>> what existing code does, so the patch is effectively a no-op.
>> 
>> Rename rq->prev_mm to rq->prev_lazy_mm, because that's what it is.
> 
> I am in favor of this approach, but I would be a lot more comfortable
> with the resulting code if task->active_mm were at least better
> documented and possibly even guarded by ifdefs.

active_mm is fairly well documented in Documentation/active_mm.rst IMO.
I don't think anything has changed in 20 years, I don't know what more
is needed, but if you can add to documentation that would be nice. Maybe
moving a bit of that into .c and .h files?

> x86 bare metal currently does not need the core lazy mm refcounting, and
> x86 bare metal *also* does not need ->active_mm.  Under the x86 scheme,
> if lazy mm refcounting were configured out, ->active_mm could become a
> dangling pointer, and this makes me extremely uncomfortable.
> 
> So I tend to think that, depending on config, the core code should
> either keep ->active_mm [1] alive or get rid of it entirely.

I don't actually know what you mean.

core code needs the concept of an "active_mm". This is the mm that your 
kernel threads are using, even in the unmerged CONFIG_LAZY_TLB=n patch,
active_mm still points to init_mm for kernel threads.

We could hide that idea behind an active_mm() function that would always 
return &init_mm if mm==NULL, but you still have the concept of an active
mm and a pointer that callers must not access after free (because some
cases will be CONFIG_LAZY_TLB=y).

> [1] I don't really think it belongs in task_struct at all.  It's not a
> property of the task.  It's the *per-cpu* mm that the core code is
> keeping alive for lazy purposes.  How about consolidating it with the
> copy in rq?

I agree it's conceptually a per-cpu property. I don't know why it was 
done this way, maybe it was just convenient and works well for mm and 
active_mm to be adjacent. Linus might have a better insight.

> I guess the short summary of my opinion is that I like making this
> configurable, but I do not like the state of the code.

I don't think I'd object to moving active_mm to rq and converting all
usages to active_mm() while we're there, it would make things a bit
more configurable. But I don't see it making core code fundamentally
less complex... if you're referring to the x86 mm switching monstrosity,
then that's understandable, but I admit I haven't spent enough time
looking at it to make a useful comment. A patch would be enlightening,
I have the leftover CONFIG_LAZY_TLB=n patch if you were thinking of 
building on that I can send it to you.

Thanks,
Nick

^ permalink raw reply

* Re: [PATCH] powerpc/signal64: Copy siginfo before changing regs->nip
From: Nicholas Piggin @ 2021-06-14  0:47 UTC (permalink / raw)
  To: linuxppc-dev, Michael Ellerman; +Cc: dja, cmr
In-Reply-To: <20210608134605.2783677-1-mpe@ellerman.id.au>

Excerpts from Michael Ellerman's message of June 8, 2021 11:46 pm:
> In commit 96d7a4e06fab ("powerpc/signal64: Rewrite handle_rt_signal64()
> to minimise uaccess switches") the 64-bit signal code was rearranged to
> use user_write_access_begin/end().
> 
> As part of that change the call to copy_siginfo_to_user() was moved
> later in the function, so that it could be done after the
> user_write_access_end().
> 
> In particular it was moved after we modify regs->nip to point to the
> signal trampoline. That means if copy_siginfo_to_user() fails we exit
> handle_rt_signal64() with an error but with regs->nip modified, whereas
> previously we would not modify regs->nip until the copy succeeded.
> 
> Returning an error from signal delivery but with regs->nip updated
> leaves the process in a sort of half-delivered state. We do immediately
> force a SEGV in signal_setup_done(), called from do_signal(), so the
> process should never run in the half-delivered state.
> 
> However that SEGV is not delivered until we've gone around to
> do_notify_resume() again, so it's possible some tracing could observe
> the half-delivered state.
> 
> There are other cases where we fail signal delivery with regs partly
> updated, eg. the write to newsp and SA_SIGINFO, but the latter at least
> is very unlikely to fail as it reads back from the frame we just wrote
> to.
> 
> Looking at other arches they seem to be more careful about leaving regs
> unchanged until the copy operations have succeeded, and in general that
> seems like good hygenie.
> 
> So although the current behaviour is not cleary buggy, it's also not
> clearly correct. So move the call to copy_siginfo_to_user() up prior to
> the modification of regs->nip, which is closer to the old behaviour, and
> easier to reason about.

Good catch, should it still have a Fixes: tag though? Even if it's not
clearly buggy we want it to be patched.

Thanks,
Nick

> 
> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
> ---
>  arch/powerpc/kernel/signal_64.c | 9 ++++-----
>  1 file changed, 4 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/signal_64.c b/arch/powerpc/kernel/signal_64.c
> index dca66481d0c2..f9e1f5428b9e 100644
> --- a/arch/powerpc/kernel/signal_64.c
> +++ b/arch/powerpc/kernel/signal_64.c
> @@ -902,6 +902,10 @@ int handle_rt_signal64(struct ksignal *ksig, sigset_t *set,
>  	unsafe_copy_to_user(&frame->uc.uc_sigmask, set, sizeof(*set), badframe_block);
>  	user_write_access_end();
>  
> +	/* Save the siginfo outside of the unsafe block. */
> +	if (copy_siginfo_to_user(&frame->info, &ksig->info))
> +		goto badframe;
> +
>  	/* Make sure signal handler doesn't get spurious FP exceptions */
>  	tsk->thread.fp_state.fpscr = 0;
>  
> @@ -915,11 +919,6 @@ int handle_rt_signal64(struct ksignal *ksig, sigset_t *set,
>  		regs->nip = (unsigned long) &frame->tramp[0];
>  	}
>  
> -
> -	/* Save the siginfo outside of the unsafe block. */
> -	if (copy_siginfo_to_user(&frame->info, &ksig->info))
> -		goto badframe;
> -
>  	/* Allocate a dummy caller frame for the signal handler. */
>  	newsp = ((unsigned long)frame) - __SIGNAL_FRAMESIZE;
>  	err |= put_user(regs->gpr[1], (unsigned long __user *)newsp);
> -- 
> 2.25.1
> 
> 

^ permalink raw reply

* Re: [PATCH] powerpc/signal64: Copy siginfo before changing regs->nip
From: Nicholas Piggin @ 2021-06-14  1:29 UTC (permalink / raw)
  To: linuxppc-dev, Michael Ellerman; +Cc: dja, cmr
In-Reply-To: <1623631623.jvh0hlk56m.astroid@bobo.none>

Excerpts from Nicholas Piggin's message of June 14, 2021 10:47 am:
> Excerpts from Michael Ellerman's message of June 8, 2021 11:46 pm:
>> In commit 96d7a4e06fab ("powerpc/signal64: Rewrite handle_rt_signal64()
>> to minimise uaccess switches") the 64-bit signal code was rearranged to
>> use user_write_access_begin/end().
>> 
>> As part of that change the call to copy_siginfo_to_user() was moved
>> later in the function, so that it could be done after the
>> user_write_access_end().
>> 
>> In particular it was moved after we modify regs->nip to point to the
>> signal trampoline. That means if copy_siginfo_to_user() fails we exit
>> handle_rt_signal64() with an error but with regs->nip modified, whereas
>> previously we would not modify regs->nip until the copy succeeded.
>> 
>> Returning an error from signal delivery but with regs->nip updated
>> leaves the process in a sort of half-delivered state. We do immediately
>> force a SEGV in signal_setup_done(), called from do_signal(), so the
>> process should never run in the half-delivered state.
>> 
>> However that SEGV is not delivered until we've gone around to
>> do_notify_resume() again, so it's possible some tracing could observe
>> the half-delivered state.
>> 
>> There are other cases where we fail signal delivery with regs partly
>> updated, eg. the write to newsp and SA_SIGINFO, but the latter at least
>> is very unlikely to fail as it reads back from the frame we just wrote
>> to.
>> 
>> Looking at other arches they seem to be more careful about leaving regs
>> unchanged until the copy operations have succeeded, and in general that
>> seems like good hygenie.
>> 
>> So although the current behaviour is not cleary buggy, it's also not
>> clearly correct. So move the call to copy_siginfo_to_user() up prior to
>> the modification of regs->nip, which is closer to the old behaviour, and
>> easier to reason about.
> 
> Good catch, should it still have a Fixes: tag though? Even if it's not
> clearly buggy we want it to be patched.

Also...

>> 
>> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
>> ---
>>  arch/powerpc/kernel/signal_64.c | 9 ++++-----
>>  1 file changed, 4 insertions(+), 5 deletions(-)
>> 
>> diff --git a/arch/powerpc/kernel/signal_64.c b/arch/powerpc/kernel/signal_64.c
>> index dca66481d0c2..f9e1f5428b9e 100644
>> --- a/arch/powerpc/kernel/signal_64.c
>> +++ b/arch/powerpc/kernel/signal_64.c
>> @@ -902,6 +902,10 @@ int handle_rt_signal64(struct ksignal *ksig, sigset_t *set,
>>  	unsafe_copy_to_user(&frame->uc.uc_sigmask, set, sizeof(*set), badframe_block);
>>  	user_write_access_end();
>>  
>> +	/* Save the siginfo outside of the unsafe block. */
>> +	if (copy_siginfo_to_user(&frame->info, &ksig->info))
>> +		goto badframe;
>> +
>>  	/* Make sure signal handler doesn't get spurious FP exceptions */
>>  	tsk->thread.fp_state.fpscr = 0;
>>  
>> @@ -915,11 +919,6 @@ int handle_rt_signal64(struct ksignal *ksig, sigset_t *set,
>>  		regs->nip = (unsigned long) &frame->tramp[0];
>>  	}
>>  
>> -
>> -	/* Save the siginfo outside of the unsafe block. */
>> -	if (copy_siginfo_to_user(&frame->info, &ksig->info))
>> -		goto badframe;
>> -
>>  	/* Allocate a dummy caller frame for the signal handler. */
>>  	newsp = ((unsigned long)frame) - __SIGNAL_FRAMESIZE;
>>  	err |= put_user(regs->gpr[1], (unsigned long __user *)newsp);

Does the same reasoning apply to this one and the ELF V1 function
descriptor thing? It seems like you could move all of that block
up instead. With your other SA_SIGINFO get_user patch, there would
then be no possibility of error after you start modifying regs.

Thanks,
Nick


^ permalink raw reply

* Re: [PATCH] powerpc/signal64: Don't read sigaction arguments back from user memory
From: Nicholas Piggin @ 2021-06-14  1:32 UTC (permalink / raw)
  To: linuxppc-dev, Michael Ellerman; +Cc: cmr
In-Reply-To: <20210610072949.3198522-1-mpe@ellerman.id.au>

Excerpts from Michael Ellerman's message of June 10, 2021 5:29 pm:
> When delivering a signal to a sigaction style handler (SA_SIGINFO), we
> pass pointers to the siginfo and ucontext via r4 and r5.
> 
> Currently we populate the values in those registers by reading the
> pointers out of the sigframe in user memory, even though the values in
> user memory were written by the kernel just prior:
> 
>   unsafe_put_user(&frame->info, &frame->pinfo, badframe_block);
>   unsafe_put_user(&frame->uc, &frame->puc, badframe_block);
>   ...
>   if (ksig->ka.sa.sa_flags & SA_SIGINFO) {
>   	err |= get_user(regs->gpr[4], (unsigned long __user *)&frame->pinfo);
>   	err |= get_user(regs->gpr[5], (unsigned long __user *)&frame->puc);
> 
> ie. we write &frame->info into frame->pinfo, and then read frame->pinfo
> back into r4, and similarly for &frame->uc.
> 
> The code has always been like this, since linux-fullhistory commit
> d4f2d95eca2c ("Forward port of 2.4 ppc64 signal changes.").
> 
> There's no reason for us to read the values back from user memory,
> rather than just setting the value in the gpr[4/5] directly. In fact
> reading the value back from user memory opens up the possibility of
> another user thread changing the values before we read them back.
> Although any process doing that would be racing against the kernel
> delivering the signal, and would risk corrupting the stack, so that
> would be a userspace bug.
> 
> Note that this is 64-bit only code, so there's no subtlety with the size
> of pointers differing between kernel and user. Also the frame variable
> is not modified to point elsewhere during the function.
> 
> In the past reading the values back from user memory was not costly, but
> now that we have KUAP on some CPUs it is, so we'd rather avoid it for
> that reason too.
> 
> So change the code to just set the values directly, using the same
> values we have written to the sigframe previously in the function.
> 
> Note also that this matches what our 32-bit signal code does.
> 
> Using a version of will-it-scale's signal1_threads that sets SA_SIGINFO,
> this results in a ~4% increase in signals per second on a Power9, from
> 229,777 to 239,766.

Good find, nice improvement. Will make it possible to make the error
handling much nicer too I think.

Reviewed-by: Nicholas Piggin <npiggin@gmail.com>

You've moved copy_siginfo_to_user right up to the user access unlock,
could save 2 more KUAP lock/unlocks if we had an unsafe_clear_user. If
we can move the other user access stuff up as well, the stack frame
put_user could use unsafe_put_user as well, saving 1 more. Another few
percent?


> 
> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
> ---
>  arch/powerpc/kernel/signal_64.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/signal_64.c b/arch/powerpc/kernel/signal_64.c
> index dca66481d0c2..f58e7a98d0df 100644
> --- a/arch/powerpc/kernel/signal_64.c
> +++ b/arch/powerpc/kernel/signal_64.c
> @@ -948,8 +948,8 @@ int handle_rt_signal64(struct ksignal *ksig, sigset_t *set,
>  	regs->gpr[3] = ksig->sig;
>  	regs->result = 0;
>  	if (ksig->ka.sa.sa_flags & SA_SIGINFO) {
> -		err |= get_user(regs->gpr[4], (unsigned long __user *)&frame->pinfo);
> -		err |= get_user(regs->gpr[5], (unsigned long __user *)&frame->puc);
> +		regs->gpr[4] = (unsigned long)&frame->info;
> +		regs->gpr[5] = (unsigned long)&frame->uc;
>  		regs->gpr[6] = (unsigned long) frame;
>  	} else {
>  		regs->gpr[4] = (unsigned long)&frame->uc.uc_mcontext;
> -- 
> 2.25.1
> 
> 

^ permalink raw reply

* Re: [PATCH v5 02/17] powerpc/vas: Move VAS API to book3s common platform
From: Nicholas Piggin @ 2021-06-14  2:10 UTC (permalink / raw)
  To: Haren Myneni, herbert, linux-crypto, linuxppc-dev, mpe
In-Reply-To: <ab39540c383d93a0a4dec847fe21586450decf5f.camel@linux.ibm.com>

Excerpts from Haren Myneni's message of June 13, 2021 8:55 pm:
> 
> Using the same /dev/crypto/nx-gzip interface for both powerNV and
> pseries. So this patch creates platforms/book3s/ and moves VAS API
> to that directory. The actual functionality is not changed.
> 
> Signed-off-by: Haren Myneni <haren@linux.ibm.com>

Just a minor nit with the wording of the changelog.

The pseries platform will share vas and nx code and interfaces with the 
powernv platform, so create the arch/powerpc/platforms/book3s/ directory 
and move VAS API code there. Functionality is not changed.

Reviewed-by: Nicholas Piggin <npiggin@gmail.com>

> ---
>  arch/powerpc/platforms/Kconfig                    |  1 +
>  arch/powerpc/platforms/Makefile                   |  1 +
>  arch/powerpc/platforms/book3s/Kconfig             | 15 +++++++++++++++
>  arch/powerpc/platforms/book3s/Makefile            |  2 ++
>  .../platforms/{powernv => book3s}/vas-api.c       |  2 +-
>  arch/powerpc/platforms/powernv/Kconfig            | 14 --------------
>  arch/powerpc/platforms/powernv/Makefile           |  2 +-
>  7 files changed, 21 insertions(+), 16 deletions(-)
>  create mode 100644 arch/powerpc/platforms/book3s/Kconfig
>  create mode 100644 arch/powerpc/platforms/book3s/Makefile
>  rename arch/powerpc/platforms/{powernv => book3s}/vas-api.c (99%)
> 
> diff --git a/arch/powerpc/platforms/Kconfig b/arch/powerpc/platforms/Kconfig
> index 7a5e8f4541e3..594544a65b02 100644
> --- a/arch/powerpc/platforms/Kconfig
> +++ b/arch/powerpc/platforms/Kconfig
> @@ -20,6 +20,7 @@ source "arch/powerpc/platforms/embedded6xx/Kconfig"
>  source "arch/powerpc/platforms/44x/Kconfig"
>  source "arch/powerpc/platforms/40x/Kconfig"
>  source "arch/powerpc/platforms/amigaone/Kconfig"
> +source "arch/powerpc/platforms/book3s/Kconfig"
>  
>  config KVM_GUEST
>  	bool "KVM Guest support"
> diff --git a/arch/powerpc/platforms/Makefile b/arch/powerpc/platforms/Makefile
> index 143d4417f6cc..0e75d7df387b 100644
> --- a/arch/powerpc/platforms/Makefile
> +++ b/arch/powerpc/platforms/Makefile
> @@ -22,3 +22,4 @@ obj-$(CONFIG_PPC_CELL)		+= cell/
>  obj-$(CONFIG_PPC_PS3)		+= ps3/
>  obj-$(CONFIG_EMBEDDED6xx)	+= embedded6xx/
>  obj-$(CONFIG_AMIGAONE)		+= amigaone/
> +obj-$(CONFIG_PPC_BOOK3S)	+= book3s/
> diff --git a/arch/powerpc/platforms/book3s/Kconfig b/arch/powerpc/platforms/book3s/Kconfig
> new file mode 100644
> index 000000000000..34c931592ef0
> --- /dev/null
> +++ b/arch/powerpc/platforms/book3s/Kconfig
> @@ -0,0 +1,15 @@
> +# SPDX-License-Identifier: GPL-2.0
> +config PPC_VAS
> +	bool "IBM Virtual Accelerator Switchboard (VAS)"
> +	depends on (PPC_POWERNV || PPC_PSERIES) && PPC_64K_PAGES
> +	default y
> +	help
> +	  This enables support for IBM Virtual Accelerator Switchboard (VAS).
> +
> +	  VAS devices are found in POWER9-based and later systems, they
> +	  provide access to accelerator coprocessors such as NX-GZIP and
> +	  NX-842. This config allows the kernel to use NX-842 accelerators,
> +	  and user-mode APIs for the NX-GZIP accelerator on POWER9 PowerNV
> +	  and POWER10 PowerVM platforms.
> +
> +	  If unsure, say "N".
> diff --git a/arch/powerpc/platforms/book3s/Makefile b/arch/powerpc/platforms/book3s/Makefile
> new file mode 100644
> index 000000000000..e790f1910f61
> --- /dev/null
> +++ b/arch/powerpc/platforms/book3s/Makefile
> @@ -0,0 +1,2 @@
> +# SPDX-License-Identifier: GPL-2.0-only
> +obj-$(CONFIG_PPC_VAS)	+= vas-api.o
> diff --git a/arch/powerpc/platforms/powernv/vas-api.c b/arch/powerpc/platforms/book3s/vas-api.c
> similarity index 99%
> rename from arch/powerpc/platforms/powernv/vas-api.c
> rename to arch/powerpc/platforms/book3s/vas-api.c
> index 98ed5d8c5441..cfc9d7dd65ab 100644
> --- a/arch/powerpc/platforms/powernv/vas-api.c
> +++ b/arch/powerpc/platforms/book3s/vas-api.c
> @@ -10,9 +10,9 @@
>  #include <linux/fs.h>
>  #include <linux/slab.h>
>  #include <linux/uaccess.h>
> +#include <linux/io.h>
>  #include <asm/vas.h>
>  #include <uapi/asm/vas-api.h>
> -#include "vas.h"
>  
>  /*
>   * The driver creates the device node that can be used as follows:
> diff --git a/arch/powerpc/platforms/powernv/Kconfig b/arch/powerpc/platforms/powernv/Kconfig
> index 619b093a0657..043eefbbdd28 100644
> --- a/arch/powerpc/platforms/powernv/Kconfig
> +++ b/arch/powerpc/platforms/powernv/Kconfig
> @@ -33,20 +33,6 @@ config PPC_MEMTRACE
>  	  Enabling this option allows for runtime allocation of memory (RAM)
>  	  for hardware tracing.
>  
> -config PPC_VAS
> -	bool "IBM Virtual Accelerator Switchboard (VAS)"
> -	depends on PPC_POWERNV && PPC_64K_PAGES
> -	default y
> -	help
> -	  This enables support for IBM Virtual Accelerator Switchboard (VAS).
> -
> -	  VAS allows accelerators in co-processors like NX-GZIP and NX-842
> -	  to be accessible to kernel subsystems and user processes.
> -
> -	  VAS adapters are found in POWER9 based systems.
> -
> -	  If unsure, say N.
> -
>  config SCOM_DEBUGFS
>  	bool "Expose SCOM controllers via debugfs"
>  	depends on DEBUG_FS
> diff --git a/arch/powerpc/platforms/powernv/Makefile b/arch/powerpc/platforms/powernv/Makefile
> index be2546b96816..dc7b37c23b60 100644
> --- a/arch/powerpc/platforms/powernv/Makefile
> +++ b/arch/powerpc/platforms/powernv/Makefile
> @@ -18,7 +18,7 @@ obj-$(CONFIG_MEMORY_FAILURE)	+= opal-memory-errors.o
>  obj-$(CONFIG_OPAL_PRD)	+= opal-prd.o
>  obj-$(CONFIG_PERF_EVENTS) += opal-imc.o
>  obj-$(CONFIG_PPC_MEMTRACE)	+= memtrace.o
> -obj-$(CONFIG_PPC_VAS)	+= vas.o vas-window.o vas-debug.o vas-fault.o vas-api.o
> +obj-$(CONFIG_PPC_VAS)	+= vas.o vas-window.o vas-debug.o vas-fault.o
>  obj-$(CONFIG_OCXL_BASE)	+= ocxl.o
>  obj-$(CONFIG_SCOM_DEBUGFS) += opal-xscom.o
>  obj-$(CONFIG_PPC_SECURE_BOOT) += opal-secvar.o
> -- 
> 2.18.2
> 
> 
> 

^ permalink raw reply

* Re: [PATCH v5 01/17] powerpc/powernv/vas: Release reference to tgid during window close
From: Nicholas Piggin @ 2021-06-14  2:11 UTC (permalink / raw)
  To: Haren Myneni, herbert, linux-crypto, linuxppc-dev, mpe
In-Reply-To: <4c769385e96a9b7022a4fd22938310550ceba5e1.camel@linux.ibm.com>

Excerpts from Haren Myneni's message of June 13, 2021 8:54 pm:
> 
> The kernel handles the NX fault by updating CSB or sending
> signal to process. In multithread applications, children can
> open VAS windows and can exit without closing them. But the
> parent can continue to send NX requests with these windows. To
> prevent pid reuse, reference will be taken on pid and tgid
> when the window is opened and release them during window close.
> 
> The current code is not releasing the tgid reference which can
> cause pid leak and this patch fixes the issue.
> 
> Fixes: db1c08a740635 ("powerpc/vas: Take reference to PID and mm for user space windows")
> Cc: stable@vger.kernel.org # 5.8+
> Signed-off-by: Haren Myneni <haren@linux.ibm.com>
> Reported-by: Nicholas Piggin <npiggin@gmail.com>

Reviewed-by: Nicholas Piggin <npiggin@gmail.com>

Thanks,
Nick

> ---
>  arch/powerpc/platforms/powernv/vas-window.c | 9 +++++----
>  1 file changed, 5 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/powernv/vas-window.c b/arch/powerpc/platforms/powernv/vas-window.c
> index 5f5fe63a3d1c..7ba0840fc3b5 100644
> --- a/arch/powerpc/platforms/powernv/vas-window.c
> +++ b/arch/powerpc/platforms/powernv/vas-window.c
> @@ -1093,9 +1093,9 @@ struct vas_window *vas_tx_win_open(int vasid, enum vas_cop_type cop,
>  		/*
>  		 * Process closes window during exit. In the case of
>  		 * multithread application, the child thread can open
> -		 * window and can exit without closing it. Expects parent
> -		 * thread to use and close the window. So do not need
> -		 * to take pid reference for parent thread.
> +		 * window and can exit without closing it. so takes tgid
> +		 * reference until window closed to make sure tgid is not
> +		 * reused.
>  		 */
>  		txwin->tgid = find_get_pid(task_tgid_vnr(current));
>  		/*
> @@ -1339,8 +1339,9 @@ int vas_win_close(struct vas_window *window)
>  	/* if send window, drop reference to matching receive window */
>  	if (window->tx_win) {
>  		if (window->user_win) {
> -			/* Drop references to pid and mm */
> +			/* Drop references to pid. tgid and mm */
>  			put_pid(window->pid);
> +			put_pid(window->tgid);
>  			if (window->mm) {
>  				mm_context_remove_vas_window(window->mm);
>  				mmdrop(window->mm);
> -- 
> 2.18.2
> 
> 
> 

^ permalink raw reply

* Re: [PATCH v5 04/17] powerpc/vas: Add platform specific user window operations
From: Nicholas Piggin @ 2021-06-14  2:24 UTC (permalink / raw)
  To: Haren Myneni, herbert, linux-crypto, linuxppc-dev, mpe
In-Reply-To: <c64fda6e9b684c175cedb3ec448cce7aaf0f4615.camel@linux.ibm.com>

Excerpts from Haren Myneni's message of June 13, 2021 8:57 pm:
> 
> PowerNV uses registers to open/close VAS windows, and getting the
> paste address. Whereas the hypervisor calls are used on PowerVM.
> 
> This patch adds the platform specific user space window operations
> and register with the common VAS user space interface.
> 
> Signed-off-by: Haren Myneni <haren@linux.ibm.com>
> ---
>  arch/powerpc/include/asm/vas.h              | 14 +++++-
>  arch/powerpc/platforms/book3s/vas-api.c     | 53 +++++++++++++--------
>  arch/powerpc/platforms/powernv/vas-window.c | 45 ++++++++++++++++-
>  3 files changed, 89 insertions(+), 23 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/vas.h b/arch/powerpc/include/asm/vas.h
> index bab7891d43f5..85318d7446c7 100644
> --- a/arch/powerpc/include/asm/vas.h
> +++ b/arch/powerpc/include/asm/vas.h
> @@ -5,6 +5,7 @@
>  
>  #ifndef _ASM_POWERPC_VAS_H
>  #define _ASM_POWERPC_VAS_H
> +#include <uapi/asm/vas-api.h>
>  
>  struct vas_window;
>  
> @@ -48,6 +49,16 @@ enum vas_cop_type {
>  	VAS_COP_TYPE_MAX,
>  };
>  
> +/*
> + * User space window operations used for powernv and powerVM
> + */
> +struct vas_user_win_ops {
> +	struct vas_window * (*open_win)(struct vas_tx_win_open_attr *,
> +				enum vas_cop_type);
> +	u64 (*paste_addr)(struct vas_window *);
> +	int (*close_win)(struct vas_window *);
> +};

This looks better, but rather than pull in uapi and the user API 
structure here, could you just pass in vas_id and flags after the common 
code does the user copy and verifies the version and other details?

I think it's generally good practice to limit the data that the usre
can influence as much as possible. Sorry for not picking up on that
earlier.

If that's changed, then

Reviewed-by: Nicholas Piggin <npiggin@gmail.com>

Thanks,
Nick

> +
>  /*
>   * Receive window attributes specified by the (in-kernel) owner of window.
>   */
> @@ -175,7 +186,8 @@ void vas_unregister_api_powernv(void);
>   * used for others in future.
>   */
>  int vas_register_coproc_api(struct module *mod, enum vas_cop_type cop_type,
> -				const char *name);
> +			    const char *name,
> +			    const struct vas_user_win_ops *vops);
>  void vas_unregister_coproc_api(void);
>  
>  #endif /* __ASM_POWERPC_VAS_H */
> diff --git a/arch/powerpc/platforms/book3s/vas-api.c b/arch/powerpc/platforms/book3s/vas-api.c
> index 72c126d87216..7cfc4b435ae8 100644
> --- a/arch/powerpc/platforms/book3s/vas-api.c
> +++ b/arch/powerpc/platforms/book3s/vas-api.c
> @@ -42,6 +42,7 @@ static struct coproc_dev {
>  	dev_t devt;
>  	struct class *class;
>  	enum vas_cop_type cop_type;
> +	const struct vas_user_win_ops *vops;
>  } coproc_device;
>  
>  struct coproc_instance {
> @@ -72,11 +73,10 @@ static int coproc_open(struct inode *inode, struct file *fp)
>  static int coproc_ioc_tx_win_open(struct file *fp, unsigned long arg)
>  {
>  	void __user *uptr = (void __user *)arg;
> -	struct vas_tx_win_attr txattr = {};
>  	struct vas_tx_win_open_attr uattr;
>  	struct coproc_instance *cp_inst;
>  	struct vas_window *txwin;
> -	int rc, vasid;
> +	int rc;
>  
>  	cp_inst = fp->private_data;
>  
> @@ -93,27 +93,20 @@ static int coproc_ioc_tx_win_open(struct file *fp, unsigned long arg)
>  	}
>  
>  	if (uattr.version != 1) {
> -		pr_err("Invalid version\n");
> +		pr_err("Invalid window open API version\n");
>  		return -EINVAL;
>  	}
>  
> -	vasid = uattr.vas_id;
> -
> -	vas_init_tx_win_attr(&txattr, cp_inst->coproc->cop_type);
> -
> -	txattr.lpid = mfspr(SPRN_LPID);
> -	txattr.pidr = mfspr(SPRN_PID);
> -	txattr.user_win = true;
> -	txattr.rsvd_txbuf_count = false;
> -	txattr.pswid = false;
> -
> -	pr_devel("Pid %d: Opening txwin, PIDR %ld\n", txattr.pidr,
> -				mfspr(SPRN_PID));
> +	if (!cp_inst->coproc->vops && !cp_inst->coproc->vops->open_win) {
> +		pr_err("VAS API is not registered\n");
> +		return -EACCES;
> +	}
>  
> -	txwin = vas_tx_win_open(vasid, cp_inst->coproc->cop_type, &txattr);
> +	txwin = cp_inst->coproc->vops->open_win(&uattr,
> +						cp_inst->coproc->cop_type);
>  	if (IS_ERR(txwin)) {
> -		pr_err("%s() vas_tx_win_open() failed, %ld\n", __func__,
> -					PTR_ERR(txwin));
> +		pr_err("%s() VAS window open failed, %ld\n", __func__,
> +				PTR_ERR(txwin));
>  		return PTR_ERR(txwin);
>  	}
>  
> @@ -125,9 +118,15 @@ static int coproc_ioc_tx_win_open(struct file *fp, unsigned long arg)
>  static int coproc_release(struct inode *inode, struct file *fp)
>  {
>  	struct coproc_instance *cp_inst = fp->private_data;
> +	int rc;
>  
>  	if (cp_inst->txwin) {
> -		vas_win_close(cp_inst->txwin);
> +		if (cp_inst->coproc->vops &&
> +			cp_inst->coproc->vops->close_win) {
> +			rc = cp_inst->coproc->vops->close_win(cp_inst->txwin);
> +			if (rc)
> +				return rc;
> +		}
>  		cp_inst->txwin = NULL;
>  	}
>  
> @@ -168,7 +167,17 @@ static int coproc_mmap(struct file *fp, struct vm_area_struct *vma)
>  		return -EINVAL;
>  	}
>  
> -	vas_win_paste_addr(txwin, &paste_addr, NULL);
> +	if (!cp_inst->coproc->vops && !cp_inst->coproc->vops->paste_addr) {
> +		pr_err("%s(): VAS API is not registered\n", __func__);
> +		return -EACCES;
> +	}
> +
> +	paste_addr = cp_inst->coproc->vops->paste_addr(txwin);
> +	if (!paste_addr) {
> +		pr_err("%s(): Window paste address failed\n", __func__);
> +		return -EINVAL;
> +	}
> +
>  	pfn = paste_addr >> PAGE_SHIFT;
>  
>  	/* flags, page_prot from cxl_mmap(), except we want cachable */
> @@ -208,7 +217,8 @@ static struct file_operations coproc_fops = {
>   * extended to other coprocessor types later.
>   */
>  int vas_register_coproc_api(struct module *mod, enum vas_cop_type cop_type,
> -				const char *name)
> +			    const char *name,
> +			    const struct vas_user_win_ops *vops)
>  {
>  	int rc = -EINVAL;
>  	dev_t devno;
> @@ -230,6 +240,7 @@ int vas_register_coproc_api(struct module *mod, enum vas_cop_type cop_type,
>  	}
>  	coproc_device.class->devnode = coproc_devnode;
>  	coproc_device.cop_type = cop_type;
> +	coproc_device.vops = vops;
>  
>  	coproc_fops.owner = mod;
>  	cdev_init(&coproc_device.cdev, &coproc_fops);
> diff --git a/arch/powerpc/platforms/powernv/vas-window.c b/arch/powerpc/platforms/powernv/vas-window.c
> index 41712b4b268e..5162e95c4090 100644
> --- a/arch/powerpc/platforms/powernv/vas-window.c
> +++ b/arch/powerpc/platforms/powernv/vas-window.c
> @@ -16,6 +16,7 @@
>  #include <linux/mmu_context.h>
>  #include <asm/switch_to.h>
>  #include <asm/ppc-opcode.h>
> +#include <asm/vas.h>
>  #include "vas.h"
>  #include "copy-paste.h"
>  
> @@ -1443,6 +1444,48 @@ struct vas_window *vas_pswid_to_window(struct vas_instance *vinst,
>  	return window;
>  }
>  
> +static struct vas_window *vas_user_win_open(struct vas_tx_win_open_attr *uattr,
> +				enum vas_cop_type cop_type)
> +{
> +	struct vas_tx_win_attr txattr = {};
> +
> +	vas_init_tx_win_attr(&txattr, cop_type);
> +
> +	txattr.lpid = mfspr(SPRN_LPID);
> +	txattr.pidr = mfspr(SPRN_PID);
> +	txattr.user_win = true;
> +	txattr.rsvd_txbuf_count = false;
> +	txattr.pswid = false;
> +
> +	pr_devel("Pid %d: Opening txwin, PIDR %ld\n", txattr.pidr,
> +				mfspr(SPRN_PID));
> +
> +	return vas_tx_win_open(uattr->vas_id, cop_type, &txattr);
> +}
> +
> +static u64 vas_user_win_paste_addr(struct vas_window *win)
> +{
> +	u64 paste_addr;
> +
> +	vas_win_paste_addr(win, &paste_addr, NULL);
> +
> +	return paste_addr;
> +}
> +
> +static int vas_user_win_close(struct vas_window *txwin)
> +{
> +
> +	vas_win_close(txwin);
> +
> +	return 0;
> +}
> +
> +static const struct vas_user_win_ops vops =  {
> +	.open_win	=	vas_user_win_open,
> +	.paste_addr	=	vas_user_win_paste_addr,
> +	.close_win	=	vas_user_win_close,
> +};
> +
>  /*
>   * Supporting only nx-gzip coprocessor type now, but this API code
>   * extended to other coprocessor types later.
> @@ -1451,7 +1494,7 @@ int vas_register_api_powernv(struct module *mod, enum vas_cop_type cop_type,
>  			     const char *name)
>  {
>  
> -	return vas_register_coproc_api(mod, cop_type, name);
> +	return vas_register_coproc_api(mod, cop_type, name, &vops);
>  }
>  EXPORT_SYMBOL_GPL(vas_register_api_powernv);
>  
> -- 
> 2.18.2
> 
> 
> 

^ permalink raw reply

* Re: [PATCH v5 05/17] powerpc/vas: Create take/drop pid and mm reference functions
From: Nicholas Piggin @ 2021-06-14  2:26 UTC (permalink / raw)
  To: Haren Myneni, herbert, linux-crypto, linuxppc-dev, mpe
In-Reply-To: <3d5873e775ae3c3a5dc9a62298a42d3f190f8d21.camel@linux.ibm.com>

Excerpts from Haren Myneni's message of June 13, 2021 8:57 pm:
> 
> Take pid and mm references when each window opens and drops during
> close. This functionality is needed for powerNV and pseries. So
> this patch defines the existing code as functions in common book3s
> platform vas-api.c
> 
> Signed-off-by: Haren Myneni <haren@linux.ibm.com>

Reviewed-by: Nicholas Piggin <npiggin@gmail.com>

Thanks,
Nick

> ---
>  arch/powerpc/include/asm/vas.h              | 40 +++++++++++++++
>  arch/powerpc/platforms/book3s/vas-api.c     | 39 +++++++++++++++
>  arch/powerpc/platforms/powernv/vas-fault.c  | 10 ++--
>  arch/powerpc/platforms/powernv/vas-window.c | 55 ++-------------------
>  arch/powerpc/platforms/powernv/vas.h        |  6 +--
>  5 files changed, 91 insertions(+), 59 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/vas.h b/arch/powerpc/include/asm/vas.h
> index 85318d7446c7..163460cff59b 100644
> --- a/arch/powerpc/include/asm/vas.h
> +++ b/arch/powerpc/include/asm/vas.h
> @@ -5,6 +5,9 @@
>  
>  #ifndef _ASM_POWERPC_VAS_H
>  #define _ASM_POWERPC_VAS_H
> +#include <linux/sched/mm.h>
> +#include <linux/mmu_context.h>
> +#include <asm/icswx.h>
>  #include <uapi/asm/vas-api.h>
>  
>  struct vas_window;
> @@ -49,6 +52,17 @@ enum vas_cop_type {
>  	VAS_COP_TYPE_MAX,
>  };
>  
> +/*
> + * User space VAS windows are opened by tasks and take references
> + * to pid and mm until windows are closed.
> + * Stores pid, mm, and tgid for each window.
> + */
> +struct vas_user_win_ref {
> +	struct pid *pid;	/* PID of owner */
> +	struct pid *tgid;	/* Thread group ID of owner */
> +	struct mm_struct *mm;	/* Linux process mm_struct */
> +};
> +
>  /*
>   * User space window operations used for powernv and powerVM
>   */
> @@ -59,6 +73,31 @@ struct vas_user_win_ops {
>  	int (*close_win)(struct vas_window *);
>  };
>  
> +static inline void put_vas_user_win_ref(struct vas_user_win_ref *ref)
> +{
> +	/* Drop references to pid, tgid, and mm */
> +	put_pid(ref->pid);
> +	put_pid(ref->tgid);
> +	if (ref->mm)
> +		mmdrop(ref->mm);
> +}
> +
> +static inline void vas_user_win_add_mm_context(struct vas_user_win_ref *ref)
> +{
> +	mm_context_add_vas_window(ref->mm);
> +	/*
> +	 * Even a process that has no foreign real address mapping can
> +	 * use an unpaired COPY instruction (to no real effect). Issue
> +	 * CP_ABORT to clear any pending COPY and prevent a covert
> +	 * channel.
> +	 *
> +	 * __switch_to() will issue CP_ABORT on future context switches
> +	 * if process / thread has any open VAS window (Use
> +	 * current->mm->context.vas_windows).
> +	 */
> +	asm volatile(PPC_CP_ABORT);
> +}
> +
>  /*
>   * Receive window attributes specified by the (in-kernel) owner of window.
>   */
> @@ -190,4 +229,5 @@ int vas_register_coproc_api(struct module *mod, enum vas_cop_type cop_type,
>  			    const struct vas_user_win_ops *vops);
>  void vas_unregister_coproc_api(void);
>  
> +int get_vas_user_win_ref(struct vas_user_win_ref *task_ref);
>  #endif /* __ASM_POWERPC_VAS_H */
> diff --git a/arch/powerpc/platforms/book3s/vas-api.c b/arch/powerpc/platforms/book3s/vas-api.c
> index 7cfc4b435ae8..1d7d3273d34b 100644
> --- a/arch/powerpc/platforms/book3s/vas-api.c
> +++ b/arch/powerpc/platforms/book3s/vas-api.c
> @@ -55,6 +55,45 @@ static char *coproc_devnode(struct device *dev, umode_t *mode)
>  	return kasprintf(GFP_KERNEL, "crypto/%s", dev_name(dev));
>  }
>  
> +/*
> + * Take reference to pid and mm
> + */
> +int get_vas_user_win_ref(struct vas_user_win_ref *task_ref)
> +{
> +	/*
> +	 * Window opened by a child thread may not be closed when
> +	 * it exits. So take reference to its pid and release it
> +	 * when the window is free by parent thread.
> +	 * Acquire a reference to the task's pid to make sure
> +	 * pid will not be re-used - needed only for multithread
> +	 * applications.
> +	 */
> +	task_ref->pid = get_task_pid(current, PIDTYPE_PID);
> +	/*
> +	 * Acquire a reference to the task's mm.
> +	 */
> +	task_ref->mm = get_task_mm(current);
> +	if (!task_ref->mm) {
> +		put_pid(task_ref->pid);
> +		pr_err("VAS: pid(%d): mm_struct is not found\n",
> +				current->pid);
> +		return -EPERM;
> +	}
> +
> +	mmgrab(task_ref->mm);
> +	mmput(task_ref->mm);
> +	/*
> +	 * Process closes window during exit. In the case of
> +	 * multithread application, the child thread can open
> +	 * window and can exit without closing it. So takes tgid
> +	 * reference until window closed to make sure tgid is not
> +	 * reused.
> +	 */
> +	task_ref->tgid = find_get_pid(task_tgid_vnr(current));
> +
> +	return 0;
> +}
> +
>  static int coproc_open(struct inode *inode, struct file *fp)
>  {
>  	struct coproc_instance *cp_inst;
> diff --git a/arch/powerpc/platforms/powernv/vas-fault.c b/arch/powerpc/platforms/powernv/vas-fault.c
> index 3d21fce254b7..ac3a71ec3bd5 100644
> --- a/arch/powerpc/platforms/powernv/vas-fault.c
> +++ b/arch/powerpc/platforms/powernv/vas-fault.c
> @@ -73,7 +73,7 @@ static void update_csb(struct vas_window *window,
>  	 * NX user space windows can not be opened for task->mm=NULL
>  	 * and faults will not be generated for kernel requests.
>  	 */
> -	if (WARN_ON_ONCE(!window->mm || !window->user_win))
> +	if (WARN_ON_ONCE(!window->task_ref.mm || !window->user_win))
>  		return;
>  
>  	csb_addr = (void __user *)be64_to_cpu(crb->csb_addr);
> @@ -92,7 +92,7 @@ static void update_csb(struct vas_window *window,
>  	csb.address = crb->stamp.nx.fault_storage_addr;
>  	csb.flags = 0;
>  
> -	pid = window->pid;
> +	pid = window->task_ref.pid;
>  	tsk = get_pid_task(pid, PIDTYPE_PID);
>  	/*
>  	 * Process closes send window after all pending NX requests are
> @@ -111,7 +111,7 @@ static void update_csb(struct vas_window *window,
>  	 * a window and exits without closing it.
>  	 */
>  	if (!tsk) {
> -		pid = window->tgid;
> +		pid = window->task_ref.tgid;
>  		tsk = get_pid_task(pid, PIDTYPE_PID);
>  		/*
>  		 * Parent thread (tgid) will be closing window when it
> @@ -127,7 +127,7 @@ static void update_csb(struct vas_window *window,
>  		return;
>  	}
>  
> -	kthread_use_mm(window->mm);
> +	kthread_use_mm(window->task_ref.mm);
>  	rc = copy_to_user(csb_addr, &csb, sizeof(csb));
>  	/*
>  	 * User space polls on csb.flags (first byte). So add barrier
> @@ -139,7 +139,7 @@ static void update_csb(struct vas_window *window,
>  		smp_mb();
>  		rc = copy_to_user(csb_addr, &csb, sizeof(u8));
>  	}
> -	kthread_unuse_mm(window->mm);
> +	kthread_unuse_mm(window->task_ref.mm);
>  	put_task_struct(tsk);
>  
>  	/* Success */
> diff --git a/arch/powerpc/platforms/powernv/vas-window.c b/arch/powerpc/platforms/powernv/vas-window.c
> index 5162e95c4090..4222c9bdb8fe 100644
> --- a/arch/powerpc/platforms/powernv/vas-window.c
> +++ b/arch/powerpc/platforms/powernv/vas-window.c
> @@ -1065,51 +1065,11 @@ struct vas_window *vas_tx_win_open(int vasid, enum vas_cop_type cop,
>  			rc = -ENODEV;
>  			goto free_window;
>  		}
> -
> -		/*
> -		 * Window opened by a child thread may not be closed when
> -		 * it exits. So take reference to its pid and release it
> -		 * when the window is free by parent thread.
> -		 * Acquire a reference to the task's pid to make sure
> -		 * pid will not be re-used - needed only for multithread
> -		 * applications.
> -		 */
> -		txwin->pid = get_task_pid(current, PIDTYPE_PID);
> -		/*
> -		 * Acquire a reference to the task's mm.
> -		 */
> -		txwin->mm = get_task_mm(current);
> -
> -		if (!txwin->mm) {
> -			put_pid(txwin->pid);
> -			pr_err("VAS: pid(%d): mm_struct is not found\n",
> -					current->pid);
> -			rc = -EPERM;
> +		rc = get_vas_user_win_ref(&txwin->task_ref);
> +		if (rc)
>  			goto free_window;
> -		}
>  
> -		mmgrab(txwin->mm);
> -		mmput(txwin->mm);
> -		mm_context_add_vas_window(txwin->mm);
> -		/*
> -		 * Process closes window during exit. In the case of
> -		 * multithread application, the child thread can open
> -		 * window and can exit without closing it. so takes tgid
> -		 * reference until window closed to make sure tgid is not
> -		 * reused.
> -		 */
> -		txwin->tgid = find_get_pid(task_tgid_vnr(current));
> -		/*
> -		 * Even a process that has no foreign real address mapping can
> -		 * use an unpaired COPY instruction (to no real effect). Issue
> -		 * CP_ABORT to clear any pending COPY and prevent a covert
> -		 * channel.
> -		 *
> -		 * __switch_to() will issue CP_ABORT on future context switches
> -		 * if process / thread has any open VAS window (Use
> -		 * current->mm->context.vas_windows).
> -		 */
> -		asm volatile(PPC_CP_ABORT);
> +		vas_user_win_add_mm_context(&txwin->task_ref);
>  	}
>  
>  	set_vinst_win(vinst, txwin);
> @@ -1340,13 +1300,8 @@ int vas_win_close(struct vas_window *window)
>  	/* if send window, drop reference to matching receive window */
>  	if (window->tx_win) {
>  		if (window->user_win) {
> -			/* Drop references to pid. tgid and mm */
> -			put_pid(window->pid);
> -			put_pid(window->tgid);
> -			if (window->mm) {
> -				mm_context_remove_vas_window(window->mm);
> -				mmdrop(window->mm);
> -			}
> +			put_vas_user_win_ref(&window->task_ref);
> +			mm_context_remove_vas_window(window->task_ref.mm);
>  		}
>  		put_rx_win(window->rxwin);
>  	}
> diff --git a/arch/powerpc/platforms/powernv/vas.h b/arch/powerpc/platforms/powernv/vas.h
> index c7db3190baca..f354dd5c51bd 100644
> --- a/arch/powerpc/platforms/powernv/vas.h
> +++ b/arch/powerpc/platforms/powernv/vas.h
> @@ -357,11 +357,9 @@ struct vas_window {
>  	bool user_win;		/* True if user space window */
>  	void *hvwc_map;		/* HV window context */
>  	void *uwc_map;		/* OS/User window context */
> -	struct pid *pid;	/* Linux process id of owner */
> -	struct pid *tgid;	/* Thread group ID of owner */
> -	struct mm_struct *mm;	/* Linux process mm_struct */
>  	int wcreds_max;		/* Window credits */
>  
> +	struct vas_user_win_ref task_ref;
>  	char *dbgname;
>  	struct dentry *dbgdir;
>  
> @@ -443,7 +441,7 @@ extern void vas_win_paste_addr(struct vas_window *window, u64 *addr,
>  
>  static inline int vas_window_pid(struct vas_window *window)
>  {
> -	return pid_vnr(window->pid);
> +	return pid_vnr(window->task_ref.pid);
>  }
>  
>  static inline void vas_log_write(struct vas_window *win, char *name,
> -- 
> 2.18.2
> 
> 
> 

^ permalink raw reply

* Re: [PATCH v5 07/17] powerpc/vas:  Define and use common vas_window struct
From: Nicholas Piggin @ 2021-06-14  2:28 UTC (permalink / raw)
  To: Haren Myneni, herbert, linux-crypto, linuxppc-dev, mpe
In-Reply-To: <e717c0370a9c8e7853cbe474c51b01b43566699f.camel@linux.ibm.com>

Excerpts from Haren Myneni's message of June 13, 2021 8:58 pm:
> 
> Many elements in vas_struct are used on PowerNV and PowerVM
> platforms. vas_window is used for both TX and RX windows on
> PowerNV and for TX windows on PowerVM. So some elements are
> specific to these platforms.
> 
> So this patch defines common vas_window and platform
> specific window structs (pnv_vas_window on PowerNV). Also adds
> the corresponding changes in PowerNV vas code.
> 
> Signed-off-by: Haren Myneni <haren@linux.ibm.com>

Thanks for making this change, it must have been a lot of churn. Again, 
apologies for not picking up on it earlier. I think the end result is
nicer.

Reviewed-by: Nicholas Piggin <npiggin@gmail.com>

Thanks,
Nick

> ---
>  arch/powerpc/include/asm/vas.h              |  14 +-
>  arch/powerpc/platforms/powernv/vas-debug.c  |  27 ++--
>  arch/powerpc/platforms/powernv/vas-fault.c  |  20 +--
>  arch/powerpc/platforms/powernv/vas-trace.h  |   4 +-
>  arch/powerpc/platforms/powernv/vas-window.c | 161 +++++++++++---------
>  arch/powerpc/platforms/powernv/vas.h        |  44 +++---
>  6 files changed, 144 insertions(+), 126 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/vas.h b/arch/powerpc/include/asm/vas.h
> index c1daab4cc205..1acf2b18c2d1 100644
> --- a/arch/powerpc/include/asm/vas.h
> +++ b/arch/powerpc/include/asm/vas.h
> @@ -10,8 +10,6 @@
>  #include <asm/icswx.h>
>  #include <uapi/asm/vas-api.h>
>  
> -struct vas_window;
> -
>  /*
>   * Min and max FIFO sizes are based on Version 1.05 Section 3.1.4.25
>   * (Local FIFO Size Register) of the VAS workbook.
> @@ -63,6 +61,18 @@ struct vas_user_win_ref {
>  	struct mm_struct *mm;	/* Linux process mm_struct */
>  };
>  
> +/*
> + * Common VAS window struct on PowerNV and PowerVM
> + */
> +struct vas_window {
> +	u32 winid;
> +	u32 wcreds_max;	/* Window credits */
> +	enum vas_cop_type cop;
> +	struct vas_user_win_ref task_ref;
> +	char *dbgname;
> +	struct dentry *dbgdir;
> +};
> +
>  /*
>   * User space window operations used for powernv and powerVM
>   */
> diff --git a/arch/powerpc/platforms/powernv/vas-debug.c b/arch/powerpc/platforms/powernv/vas-debug.c
> index 41fa90d2f4ab..3ce89a4b54be 100644
> --- a/arch/powerpc/platforms/powernv/vas-debug.c
> +++ b/arch/powerpc/platforms/powernv/vas-debug.c
> @@ -9,6 +9,7 @@
>  #include <linux/slab.h>
>  #include <linux/debugfs.h>
>  #include <linux/seq_file.h>
> +#include <asm/vas.h>
>  #include "vas.h"
>  
>  static struct dentry *vas_debugfs;
> @@ -28,7 +29,7 @@ static char *cop_to_str(int cop)
>  
>  static int info_show(struct seq_file *s, void *private)
>  {
> -	struct vas_window *window = s->private;
> +	struct pnv_vas_window *window = s->private;
>  
>  	mutex_lock(&vas_mutex);
>  
> @@ -36,9 +37,9 @@ static int info_show(struct seq_file *s, void *private)
>  	if (!window->hvwc_map)
>  		goto unlock;
>  
> -	seq_printf(s, "Type: %s, %s\n", cop_to_str(window->cop),
> +	seq_printf(s, "Type: %s, %s\n", cop_to_str(window->vas_win.cop),
>  					window->tx_win ? "Send" : "Receive");
> -	seq_printf(s, "Pid : %d\n", vas_window_pid(window));
> +	seq_printf(s, "Pid : %d\n", vas_window_pid(&window->vas_win));
>  
>  unlock:
>  	mutex_unlock(&vas_mutex);
> @@ -47,7 +48,7 @@ static int info_show(struct seq_file *s, void *private)
>  
>  DEFINE_SHOW_ATTRIBUTE(info);
>  
> -static inline void print_reg(struct seq_file *s, struct vas_window *win,
> +static inline void print_reg(struct seq_file *s, struct pnv_vas_window *win,
>  			char *name, u32 reg)
>  {
>  	seq_printf(s, "0x%016llx %s\n", read_hvwc_reg(win, name, reg), name);
> @@ -55,7 +56,7 @@ static inline void print_reg(struct seq_file *s, struct vas_window *win,
>  
>  static int hvwc_show(struct seq_file *s, void *private)
>  {
> -	struct vas_window *window = s->private;
> +	struct pnv_vas_window *window = s->private;
>  
>  	mutex_lock(&vas_mutex);
>  
> @@ -103,8 +104,10 @@ static int hvwc_show(struct seq_file *s, void *private)
>  
>  DEFINE_SHOW_ATTRIBUTE(hvwc);
>  
> -void vas_window_free_dbgdir(struct vas_window *window)
> +void vas_window_free_dbgdir(struct pnv_vas_window *pnv_win)
>  {
> +	struct vas_window *window =  &pnv_win->vas_win;
> +
>  	if (window->dbgdir) {
>  		debugfs_remove_recursive(window->dbgdir);
>  		kfree(window->dbgname);
> @@ -113,21 +116,21 @@ void vas_window_free_dbgdir(struct vas_window *window)
>  	}
>  }
>  
> -void vas_window_init_dbgdir(struct vas_window *window)
> +void vas_window_init_dbgdir(struct pnv_vas_window *window)
>  {
>  	struct dentry *d;
>  
>  	if (!window->vinst->dbgdir)
>  		return;
>  
> -	window->dbgname = kzalloc(16, GFP_KERNEL);
> -	if (!window->dbgname)
> +	window->vas_win.dbgname = kzalloc(16, GFP_KERNEL);
> +	if (!window->vas_win.dbgname)
>  		return;
>  
> -	snprintf(window->dbgname, 16, "w%d", window->winid);
> +	snprintf(window->vas_win.dbgname, 16, "w%d", window->vas_win.winid);
>  
> -	d = debugfs_create_dir(window->dbgname, window->vinst->dbgdir);
> -	window->dbgdir = d;
> +	d = debugfs_create_dir(window->vas_win.dbgname, window->vinst->dbgdir);
> +	window->vas_win.dbgdir = d;
>  
>  	debugfs_create_file("info", 0444, d, window, &info_fops);
>  	debugfs_create_file("hvwc", 0444, d, window, &hvwc_fops);
> diff --git a/arch/powerpc/platforms/powernv/vas-fault.c b/arch/powerpc/platforms/powernv/vas-fault.c
> index 2729ac541fb3..a7aabc18039e 100644
> --- a/arch/powerpc/platforms/powernv/vas-fault.c
> +++ b/arch/powerpc/platforms/powernv/vas-fault.c
> @@ -68,7 +68,7 @@ irqreturn_t vas_fault_thread_fn(int irq, void *data)
>  	struct vas_instance *vinst = data;
>  	struct coprocessor_request_block *crb, *entry;
>  	struct coprocessor_request_block buf;
> -	struct vas_window *window;
> +	struct pnv_vas_window *window;
>  	unsigned long flags;
>  	void *fifo;
>  
> @@ -153,7 +153,7 @@ irqreturn_t vas_fault_thread_fn(int irq, void *data)
>  			 * NX sees faults only with user space windows.
>  			 */
>  			if (window->user_win)
> -				vas_update_csb(crb, &window->task_ref);
> +				vas_update_csb(crb, &window->vas_win.task_ref);
>  			else
>  				WARN_ON_ONCE(!window->user_win);
>  
> @@ -199,6 +199,7 @@ irqreturn_t vas_fault_handler(int irq, void *dev_id)
>  int vas_setup_fault_window(struct vas_instance *vinst)
>  {
>  	struct vas_rx_win_attr attr;
> +	struct vas_window *win;
>  
>  	vinst->fault_fifo_size = VAS_FAULT_WIN_FIFO_SIZE;
>  	vinst->fault_fifo = kzalloc(vinst->fault_fifo_size, GFP_KERNEL);
> @@ -227,18 +228,17 @@ int vas_setup_fault_window(struct vas_instance *vinst)
>  	attr.lnotify_pid = mfspr(SPRN_PID);
>  	attr.lnotify_tid = mfspr(SPRN_PID);
>  
> -	vinst->fault_win = vas_rx_win_open(vinst->vas_id, VAS_COP_TYPE_FAULT,
> -					&attr);
> -
> -	if (IS_ERR(vinst->fault_win)) {
> -		pr_err("VAS: Error %ld opening FaultWin\n",
> -			PTR_ERR(vinst->fault_win));
> +	win = vas_rx_win_open(vinst->vas_id, VAS_COP_TYPE_FAULT, &attr);
> +	if (IS_ERR(win)) {
> +		pr_err("VAS: Error %ld opening FaultWin\n", PTR_ERR(win));
>  		kfree(vinst->fault_fifo);
> -		return PTR_ERR(vinst->fault_win);
> +		return PTR_ERR(win);
>  	}
>  
> +	vinst->fault_win = container_of(win, struct pnv_vas_window, vas_win);
> +
>  	pr_devel("VAS: Created FaultWin %d, LPID/PID/TID [%d/%d/%d]\n",
> -			vinst->fault_win->winid, attr.lnotify_lpid,
> +			vinst->fault_win->vas_win.winid, attr.lnotify_lpid,
>  			attr.lnotify_pid, attr.lnotify_tid);
>  
>  	return 0;
> diff --git a/arch/powerpc/platforms/powernv/vas-trace.h b/arch/powerpc/platforms/powernv/vas-trace.h
> index a449b9f0c12e..ca2e08f2ddc0 100644
> --- a/arch/powerpc/platforms/powernv/vas-trace.h
> +++ b/arch/powerpc/platforms/powernv/vas-trace.h
> @@ -80,7 +80,7 @@ TRACE_EVENT(	vas_tx_win_open,
>  TRACE_EVENT(	vas_paste_crb,
>  
>  		TP_PROTO(struct task_struct *tsk,
> -			struct vas_window *win),
> +			struct pnv_vas_window *win),
>  
>  		TP_ARGS(tsk, win),
>  
> @@ -96,7 +96,7 @@ TRACE_EVENT(	vas_paste_crb,
>  		TP_fast_assign(
>  			__entry->pid = tsk->pid;
>  			__entry->vasid = win->vinst->vas_id;
> -			__entry->winid = win->winid;
> +			__entry->winid = win->vas_win.winid;
>  			__entry->paste_kaddr = (unsigned long)win->paste_kaddr
>  		),
>  
> diff --git a/arch/powerpc/platforms/powernv/vas-window.c b/arch/powerpc/platforms/powernv/vas-window.c
> index 4222c9bdb8fe..95ee13f81bdc 100644
> --- a/arch/powerpc/platforms/powernv/vas-window.c
> +++ b/arch/powerpc/platforms/powernv/vas-window.c
> @@ -27,14 +27,14 @@
>   * Compute the paste address region for the window @window using the
>   * ->paste_base_addr and ->paste_win_id_shift we got from device tree.
>   */
> -void vas_win_paste_addr(struct vas_window *window, u64 *addr, int *len)
> +void vas_win_paste_addr(struct pnv_vas_window *window, u64 *addr, int *len)
>  {
>  	int winid;
>  	u64 base, shift;
>  
>  	base = window->vinst->paste_base_addr;
>  	shift = window->vinst->paste_win_id_shift;
> -	winid = window->winid;
> +	winid = window->vas_win.winid;
>  
>  	*addr  = base + (winid << shift);
>  	if (len)
> @@ -43,23 +43,23 @@ void vas_win_paste_addr(struct vas_window *window, u64 *addr, int *len)
>  	pr_debug("Txwin #%d: Paste addr 0x%llx\n", winid, *addr);
>  }
>  
> -static inline void get_hvwc_mmio_bar(struct vas_window *window,
> +static inline void get_hvwc_mmio_bar(struct pnv_vas_window *window,
>  			u64 *start, int *len)
>  {
>  	u64 pbaddr;
>  
>  	pbaddr = window->vinst->hvwc_bar_start;
> -	*start = pbaddr + window->winid * VAS_HVWC_SIZE;
> +	*start = pbaddr + window->vas_win.winid * VAS_HVWC_SIZE;
>  	*len = VAS_HVWC_SIZE;
>  }
>  
> -static inline void get_uwc_mmio_bar(struct vas_window *window,
> +static inline void get_uwc_mmio_bar(struct pnv_vas_window *window,
>  			u64 *start, int *len)
>  {
>  	u64 pbaddr;
>  
>  	pbaddr = window->vinst->uwc_bar_start;
> -	*start = pbaddr + window->winid * VAS_UWC_SIZE;
> +	*start = pbaddr + window->vas_win.winid * VAS_UWC_SIZE;
>  	*len = VAS_UWC_SIZE;
>  }
>  
> @@ -68,7 +68,7 @@ static inline void get_uwc_mmio_bar(struct vas_window *window,
>   * space. Unlike MMIO regions (map_mmio_region() below), paste region must
>   * be mapped cache-able and is only applicable to send windows.
>   */
> -static void *map_paste_region(struct vas_window *txwin)
> +static void *map_paste_region(struct pnv_vas_window *txwin)
>  {
>  	int len;
>  	void *map;
> @@ -76,7 +76,7 @@ static void *map_paste_region(struct vas_window *txwin)
>  	u64 start;
>  
>  	name = kasprintf(GFP_KERNEL, "window-v%d-w%d", txwin->vinst->vas_id,
> -				txwin->winid);
> +				txwin->vas_win.winid);
>  	if (!name)
>  		goto free_name;
>  
> @@ -133,7 +133,7 @@ static void unmap_region(void *addr, u64 start, int len)
>  /*
>   * Unmap the paste address region for a window.
>   */
> -static void unmap_paste_region(struct vas_window *window)
> +static void unmap_paste_region(struct pnv_vas_window *window)
>  {
>  	int len;
>  	u64 busaddr_start;
> @@ -154,7 +154,7 @@ static void unmap_paste_region(struct vas_window *window)
>   * path, just minimize the time we hold the mutex for now. We can add
>   * a per-instance mutex later if necessary.
>   */
> -static void unmap_winctx_mmio_bars(struct vas_window *window)
> +static void unmap_winctx_mmio_bars(struct pnv_vas_window *window)
>  {
>  	int len;
>  	void *uwc_map;
> @@ -187,7 +187,7 @@ static void unmap_winctx_mmio_bars(struct vas_window *window)
>   * OS/User Window Context (UWC) MMIO Base Address Region for the given window.
>   * Map these bus addresses and save the mapped kernel addresses in @window.
>   */
> -static int map_winctx_mmio_bars(struct vas_window *window)
> +static int map_winctx_mmio_bars(struct pnv_vas_window *window)
>  {
>  	int len;
>  	u64 start;
> @@ -215,7 +215,7 @@ static int map_winctx_mmio_bars(struct vas_window *window)
>   *	 registers are not sequential. And, we can only write to offsets
>   *	 with valid registers.
>   */
> -static void reset_window_regs(struct vas_window *window)
> +static void reset_window_regs(struct pnv_vas_window *window)
>  {
>  	write_hvwc_reg(window, VREG(LPID), 0ULL);
>  	write_hvwc_reg(window, VREG(PID), 0ULL);
> @@ -271,7 +271,7 @@ static void reset_window_regs(struct vas_window *window)
>   * want to add fields to vas_winctx and move the initialization to
>   * init_vas_winctx_regs().
>   */
> -static void init_xlate_regs(struct vas_window *window, bool user_win)
> +static void init_xlate_regs(struct pnv_vas_window *window, bool user_win)
>  {
>  	u64 lpcr, val;
>  
> @@ -336,7 +336,7 @@ static void init_xlate_regs(struct vas_window *window, bool user_win)
>   *
>   * TODO: Reserved (aka dedicated) send buffers are not supported yet.
>   */
> -static void init_rsvd_tx_buf_count(struct vas_window *txwin,
> +static void init_rsvd_tx_buf_count(struct pnv_vas_window *txwin,
>  				struct vas_winctx *winctx)
>  {
>  	write_hvwc_reg(txwin, VREG(TX_RSVD_BUF_COUNT), 0ULL);
> @@ -358,7 +358,7 @@ static void init_rsvd_tx_buf_count(struct vas_window *txwin,
>   *	as a one-time task? That could work for NX but what about other
>   *	receivers?  Let the receivers tell us the rx-fifo buffers for now.
>   */
> -static void init_winctx_regs(struct vas_window *window,
> +static void init_winctx_regs(struct pnv_vas_window *window,
>  			     struct vas_winctx *winctx)
>  {
>  	u64 val;
> @@ -520,10 +520,10 @@ static int vas_assign_window_id(struct ida *ida)
>  	return winid;
>  }
>  
> -static void vas_window_free(struct vas_window *window)
> +static void vas_window_free(struct pnv_vas_window *window)
>  {
> -	int winid = window->winid;
>  	struct vas_instance *vinst = window->vinst;
> +	int winid = window->vas_win.winid;
>  
>  	unmap_winctx_mmio_bars(window);
>  
> @@ -534,10 +534,10 @@ static void vas_window_free(struct vas_window *window)
>  	vas_release_window_id(&vinst->ida, winid);
>  }
>  
> -static struct vas_window *vas_window_alloc(struct vas_instance *vinst)
> +static struct pnv_vas_window *vas_window_alloc(struct vas_instance *vinst)
>  {
>  	int winid;
> -	struct vas_window *window;
> +	struct pnv_vas_window *window;
>  
>  	winid = vas_assign_window_id(&vinst->ida);
>  	if (winid < 0)
> @@ -548,7 +548,7 @@ static struct vas_window *vas_window_alloc(struct vas_instance *vinst)
>  		goto out_free;
>  
>  	window->vinst = vinst;
> -	window->winid = winid;
> +	window->vas_win.winid = winid;
>  
>  	if (map_winctx_mmio_bars(window))
>  		goto out_free;
> @@ -563,7 +563,7 @@ static struct vas_window *vas_window_alloc(struct vas_instance *vinst)
>  	return ERR_PTR(-ENOMEM);
>  }
>  
> -static void put_rx_win(struct vas_window *rxwin)
> +static void put_rx_win(struct pnv_vas_window *rxwin)
>  {
>  	/* Better not be a send window! */
>  	WARN_ON_ONCE(rxwin->tx_win);
> @@ -579,10 +579,11 @@ static void put_rx_win(struct vas_window *rxwin)
>   *
>   * NOTE: We access ->windows[] table and assume that vinst->mutex is held.
>   */
> -static struct vas_window *get_user_rxwin(struct vas_instance *vinst, u32 pswid)
> +static struct pnv_vas_window *get_user_rxwin(struct vas_instance *vinst,
> +					     u32 pswid)
>  {
>  	int vasid, winid;
> -	struct vas_window *rxwin;
> +	struct pnv_vas_window *rxwin;
>  
>  	decode_pswid(pswid, &vasid, &winid);
>  
> @@ -591,7 +592,7 @@ static struct vas_window *get_user_rxwin(struct vas_instance *vinst, u32 pswid)
>  
>  	rxwin = vinst->windows[winid];
>  
> -	if (!rxwin || rxwin->tx_win || rxwin->cop != VAS_COP_TYPE_FTW)
> +	if (!rxwin || rxwin->tx_win || rxwin->vas_win.cop != VAS_COP_TYPE_FTW)
>  		return ERR_PTR(-EINVAL);
>  
>  	return rxwin;
> @@ -603,10 +604,10 @@ static struct vas_window *get_user_rxwin(struct vas_instance *vinst, u32 pswid)
>   *
>   * See also function header of set_vinst_win().
>   */
> -static struct vas_window *get_vinst_rxwin(struct vas_instance *vinst,
> +static struct pnv_vas_window *get_vinst_rxwin(struct vas_instance *vinst,
>  			enum vas_cop_type cop, u32 pswid)
>  {
> -	struct vas_window *rxwin;
> +	struct pnv_vas_window *rxwin;
>  
>  	mutex_lock(&vinst->mutex);
>  
> @@ -639,9 +640,9 @@ static struct vas_window *get_vinst_rxwin(struct vas_instance *vinst,
>   * window, we also save the window in the ->rxwin[] table.
>   */
>  static void set_vinst_win(struct vas_instance *vinst,
> -			struct vas_window *window)
> +			struct pnv_vas_window *window)
>  {
> -	int id = window->winid;
> +	int id = window->vas_win.winid;
>  
>  	mutex_lock(&vinst->mutex);
>  
> @@ -650,8 +651,8 @@ static void set_vinst_win(struct vas_instance *vinst,
>  	 * unless its a user (FTW) window.
>  	 */
>  	if (!window->user_win && !window->tx_win) {
> -		WARN_ON_ONCE(vinst->rxwin[window->cop]);
> -		vinst->rxwin[window->cop] = window;
> +		WARN_ON_ONCE(vinst->rxwin[window->vas_win.cop]);
> +		vinst->rxwin[window->vas_win.cop] = window;
>  	}
>  
>  	WARN_ON_ONCE(vinst->windows[id] != NULL);
> @@ -664,16 +665,16 @@ static void set_vinst_win(struct vas_instance *vinst,
>   * Clear this window from the table(s) of windows for this VAS instance.
>   * See also function header of set_vinst_win().
>   */
> -static void clear_vinst_win(struct vas_window *window)
> +static void clear_vinst_win(struct pnv_vas_window *window)
>  {
> -	int id = window->winid;
> +	int id = window->vas_win.winid;
>  	struct vas_instance *vinst = window->vinst;
>  
>  	mutex_lock(&vinst->mutex);
>  
>  	if (!window->user_win && !window->tx_win) {
> -		WARN_ON_ONCE(!vinst->rxwin[window->cop]);
> -		vinst->rxwin[window->cop] = NULL;
> +		WARN_ON_ONCE(!vinst->rxwin[window->vas_win.cop]);
> +		vinst->rxwin[window->vas_win.cop] = NULL;
>  	}
>  
>  	WARN_ON_ONCE(vinst->windows[id] != window);
> @@ -682,7 +683,7 @@ static void clear_vinst_win(struct vas_window *window)
>  	mutex_unlock(&vinst->mutex);
>  }
>  
> -static void init_winctx_for_rxwin(struct vas_window *rxwin,
> +static void init_winctx_for_rxwin(struct pnv_vas_window *rxwin,
>  			struct vas_rx_win_attr *rxattr,
>  			struct vas_winctx *winctx)
>  {
> @@ -703,7 +704,7 @@ static void init_winctx_for_rxwin(struct vas_window *rxwin,
>  
>  	winctx->rx_fifo = rxattr->rx_fifo;
>  	winctx->rx_fifo_size = rxattr->rx_fifo_size;
> -	winctx->wcreds_max = rxwin->wcreds_max;
> +	winctx->wcreds_max = rxwin->vas_win.wcreds_max;
>  	winctx->pin_win = rxattr->pin_win;
>  
>  	winctx->nx_win = rxattr->nx_win;
> @@ -852,7 +853,7 @@ EXPORT_SYMBOL_GPL(vas_init_rx_win_attr);
>  struct vas_window *vas_rx_win_open(int vasid, enum vas_cop_type cop,
>  			struct vas_rx_win_attr *rxattr)
>  {
> -	struct vas_window *rxwin;
> +	struct pnv_vas_window *rxwin;
>  	struct vas_winctx winctx;
>  	struct vas_instance *vinst;
>  
> @@ -871,21 +872,21 @@ struct vas_window *vas_rx_win_open(int vasid, enum vas_cop_type cop,
>  	rxwin = vas_window_alloc(vinst);
>  	if (IS_ERR(rxwin)) {
>  		pr_devel("Unable to allocate memory for Rx window\n");
> -		return rxwin;
> +		return (struct vas_window *)rxwin;
>  	}
>  
>  	rxwin->tx_win = false;
>  	rxwin->nx_win = rxattr->nx_win;
>  	rxwin->user_win = rxattr->user_win;
> -	rxwin->cop = cop;
> -	rxwin->wcreds_max = rxattr->wcreds_max;
> +	rxwin->vas_win.cop = cop;
> +	rxwin->vas_win.wcreds_max = rxattr->wcreds_max;
>  
>  	init_winctx_for_rxwin(rxwin, rxattr, &winctx);
>  	init_winctx_regs(rxwin, &winctx);
>  
>  	set_vinst_win(vinst, rxwin);
>  
> -	return rxwin;
> +	return &rxwin->vas_win;
>  }
>  EXPORT_SYMBOL_GPL(vas_rx_win_open);
>  
> @@ -906,7 +907,7 @@ void vas_init_tx_win_attr(struct vas_tx_win_attr *txattr, enum vas_cop_type cop)
>  }
>  EXPORT_SYMBOL_GPL(vas_init_tx_win_attr);
>  
> -static void init_winctx_for_txwin(struct vas_window *txwin,
> +static void init_winctx_for_txwin(struct pnv_vas_window *txwin,
>  			struct vas_tx_win_attr *txattr,
>  			struct vas_winctx *winctx)
>  {
> @@ -927,7 +928,7 @@ static void init_winctx_for_txwin(struct vas_window *txwin,
>  	 */
>  	memset(winctx, 0, sizeof(struct vas_winctx));
>  
> -	winctx->wcreds_max = txwin->wcreds_max;
> +	winctx->wcreds_max = txwin->vas_win.wcreds_max;
>  
>  	winctx->user_win = txattr->user_win;
>  	winctx->nx_win = txwin->rxwin->nx_win;
> @@ -947,13 +948,13 @@ static void init_winctx_for_txwin(struct vas_window *txwin,
>  
>  	winctx->lpid = txattr->lpid;
>  	winctx->pidr = txattr->pidr;
> -	winctx->rx_win_id = txwin->rxwin->winid;
> +	winctx->rx_win_id = txwin->rxwin->vas_win.winid;
>  	/*
>  	 * IRQ and fault window setup is successful. Set fault window
>  	 * for the send window so that ready to handle faults.
>  	 */
>  	if (txwin->vinst->virq)
> -		winctx->fault_win_id = txwin->vinst->fault_win->winid;
> +		winctx->fault_win_id = txwin->vinst->fault_win->vas_win.winid;
>  
>  	winctx->dma_type = VAS_DMA_TYPE_INJECT;
>  	winctx->tc_mode = txattr->tc_mode;
> @@ -963,7 +964,8 @@ static void init_winctx_for_txwin(struct vas_window *txwin,
>  		winctx->irq_port = txwin->vinst->irq_port;
>  
>  	winctx->pswid = txattr->pswid ? txattr->pswid :
> -			encode_pswid(txwin->vinst->vas_id, txwin->winid);
> +			encode_pswid(txwin->vinst->vas_id,
> +			txwin->vas_win.winid);
>  }
>  
>  static bool tx_win_args_valid(enum vas_cop_type cop,
> @@ -994,8 +996,8 @@ struct vas_window *vas_tx_win_open(int vasid, enum vas_cop_type cop,
>  			struct vas_tx_win_attr *attr)
>  {
>  	int rc;
> -	struct vas_window *txwin;
> -	struct vas_window *rxwin;
> +	struct pnv_vas_window *txwin;
> +	struct pnv_vas_window *rxwin;
>  	struct vas_winctx winctx;
>  	struct vas_instance *vinst;
>  
> @@ -1021,7 +1023,7 @@ struct vas_window *vas_tx_win_open(int vasid, enum vas_cop_type cop,
>  	rxwin = get_vinst_rxwin(vinst, cop, attr->pswid);
>  	if (IS_ERR(rxwin)) {
>  		pr_devel("No RxWin for vasid %d, cop %d\n", vasid, cop);
> -		return rxwin;
> +		return (struct vas_window *)rxwin;
>  	}
>  
>  	txwin = vas_window_alloc(vinst);
> @@ -1030,12 +1032,12 @@ struct vas_window *vas_tx_win_open(int vasid, enum vas_cop_type cop,
>  		goto put_rxwin;
>  	}
>  
> -	txwin->cop = cop;
> +	txwin->vas_win.cop = cop;
>  	txwin->tx_win = 1;
>  	txwin->rxwin = rxwin;
>  	txwin->nx_win = txwin->rxwin->nx_win;
>  	txwin->user_win = attr->user_win;
> -	txwin->wcreds_max = attr->wcreds_max ?: VAS_WCREDS_DEFAULT;
> +	txwin->vas_win.wcreds_max = attr->wcreds_max ?: VAS_WCREDS_DEFAULT;
>  
>  	init_winctx_for_txwin(txwin, attr, &winctx);
>  
> @@ -1065,16 +1067,16 @@ struct vas_window *vas_tx_win_open(int vasid, enum vas_cop_type cop,
>  			rc = -ENODEV;
>  			goto free_window;
>  		}
> -		rc = get_vas_user_win_ref(&txwin->task_ref);
> +		rc = get_vas_user_win_ref(&txwin->vas_win.task_ref);
>  		if (rc)
>  			goto free_window;
>  
> -		vas_user_win_add_mm_context(&txwin->task_ref);
> +		vas_user_win_add_mm_context(&txwin->vas_win.task_ref);
>  	}
>  
>  	set_vinst_win(vinst, txwin);
>  
> -	return txwin;
> +	return &txwin->vas_win;
>  
>  free_window:
>  	vas_window_free(txwin);
> @@ -1093,12 +1095,14 @@ int vas_copy_crb(void *crb, int offset)
>  EXPORT_SYMBOL_GPL(vas_copy_crb);
>  
>  #define RMA_LSMP_REPORT_ENABLE PPC_BIT(53)
> -int vas_paste_crb(struct vas_window *txwin, int offset, bool re)
> +int vas_paste_crb(struct vas_window *vwin, int offset, bool re)
>  {
> +	struct pnv_vas_window *txwin;
>  	int rc;
>  	void *addr;
>  	uint64_t val;
>  
> +	txwin = container_of(vwin, struct pnv_vas_window, vas_win);
>  	trace_vas_paste_crb(current, txwin);
>  
>  	/*
> @@ -1128,7 +1132,7 @@ int vas_paste_crb(struct vas_window *txwin, int offset, bool re)
>  	else
>  		rc = -EINVAL;
>  
> -	pr_debug("Txwin #%d: Msg count %llu\n", txwin->winid,
> +	pr_debug("Txwin #%d: Msg count %llu\n", txwin->vas_win.winid,
>  			read_hvwc_reg(txwin, VREG(LRFIFO_PUSH)));
>  
>  	return rc;
> @@ -1148,7 +1152,7 @@ EXPORT_SYMBOL_GPL(vas_paste_crb);
>   *	user space. (NX-842 driver waits for CSB and Fast thread-wakeup
>   *	doesn't use credit checking).
>   */
> -static void poll_window_credits(struct vas_window *window)
> +static void poll_window_credits(struct pnv_vas_window *window)
>  {
>  	u64 val;
>  	int creds, mode;
> @@ -1178,7 +1182,7 @@ static void poll_window_credits(struct vas_window *window)
>  	 *       and issue CRB Kill to stop all pending requests. Need only
>  	 *       if there is a bug in NX or fault handling in kernel.
>  	 */
> -	if (creds < window->wcreds_max) {
> +	if (creds < window->vas_win.wcreds_max) {
>  		val = 0;
>  		set_current_state(TASK_UNINTERRUPTIBLE);
>  		schedule_timeout(msecs_to_jiffies(10));
> @@ -1189,7 +1193,8 @@ static void poll_window_credits(struct vas_window *window)
>  		 */
>  		if (!(count % 1000))
>  			pr_warn_ratelimited("VAS: pid %d stuck. Waiting for credits returned for Window(%d). creds %d, Retries %d\n",
> -				vas_window_pid(window), window->winid,
> +				vas_window_pid(&window->vas_win),
> +				window->vas_win.winid,
>  				creds, count);
>  
>  		goto retry;
> @@ -1201,7 +1206,7 @@ static void poll_window_credits(struct vas_window *window)
>   * short time to queue a CRB, so window should not be busy for too long.
>   * Trying 5ms intervals.
>   */
> -static void poll_window_busy_state(struct vas_window *window)
> +static void poll_window_busy_state(struct pnv_vas_window *window)
>  {
>  	int busy;
>  	u64 val;
> @@ -1221,7 +1226,8 @@ static void poll_window_busy_state(struct vas_window *window)
>  		 */
>  		if (!(count % 1000))
>  			pr_warn_ratelimited("VAS: pid %d stuck. Window (ID=%d) is in busy state. Retries %d\n",
> -				vas_window_pid(window), window->winid, count);
> +				vas_window_pid(&window->vas_win),
> +				window->vas_win.winid, count);
>  
>  		goto retry;
>  	}
> @@ -1243,7 +1249,7 @@ static void poll_window_busy_state(struct vas_window *window)
>   *	casting out becomes necessary we should consider offloading the
>   *	job to a worker thread, so the window close can proceed quickly.
>   */
> -static void poll_window_castout(struct vas_window *window)
> +static void poll_window_castout(struct pnv_vas_window *window)
>  {
>  	/* stub for now */
>  }
> @@ -1252,7 +1258,7 @@ static void poll_window_castout(struct vas_window *window)
>   * Unpin and close a window so no new requests are accepted and the
>   * hardware can evict this window from cache if necessary.
>   */
> -static void unpin_close_window(struct vas_window *window)
> +static void unpin_close_window(struct pnv_vas_window *window)
>  {
>  	u64 val;
>  
> @@ -1274,11 +1280,15 @@ static void unpin_close_window(struct vas_window *window)
>   *
>   * Besides the hardware, kernel has some bookkeeping of course.
>   */
> -int vas_win_close(struct vas_window *window)
> +int vas_win_close(struct vas_window *vwin)
>  {
> -	if (!window)
> +	struct pnv_vas_window *window;
> +
> +	if (!vwin)
>  		return 0;
>  
> +	window = container_of(vwin, struct pnv_vas_window, vas_win);
> +
>  	if (!window->tx_win && atomic_read(&window->num_txwins) != 0) {
>  		pr_devel("Attempting to close an active Rx window!\n");
>  		WARN_ON_ONCE(1);
> @@ -1300,8 +1310,8 @@ int vas_win_close(struct vas_window *window)
>  	/* if send window, drop reference to matching receive window */
>  	if (window->tx_win) {
>  		if (window->user_win) {
> -			put_vas_user_win_ref(&window->task_ref);
> -			mm_context_remove_vas_window(window->task_ref.mm);
> +			put_vas_user_win_ref(&vwin->task_ref);
> +			mm_context_remove_vas_window(vwin->task_ref.mm);
>  		}
>  		put_rx_win(window->rxwin);
>  	}
> @@ -1334,7 +1344,7 @@ EXPORT_SYMBOL_GPL(vas_win_close);
>   * - The kernel with return credit on fault window after reading entry
>   *   from fault FIFO.
>   */
> -void vas_return_credit(struct vas_window *window, bool tx)
> +void vas_return_credit(struct pnv_vas_window *window, bool tx)
>  {
>  	uint64_t val;
>  
> @@ -1348,10 +1358,10 @@ void vas_return_credit(struct vas_window *window, bool tx)
>  	}
>  }
>  
> -struct vas_window *vas_pswid_to_window(struct vas_instance *vinst,
> +struct pnv_vas_window *vas_pswid_to_window(struct vas_instance *vinst,
>  		uint32_t pswid)
>  {
> -	struct vas_window *window;
> +	struct pnv_vas_window *window;
>  	int winid;
>  
>  	if (!pswid) {
> @@ -1388,11 +1398,11 @@ struct vas_window *vas_pswid_to_window(struct vas_instance *vinst,
>  	 * by NX).
>  	 */
>  	if (!window->tx_win || !window->user_win || !window->nx_win ||
> -			window->cop == VAS_COP_TYPE_FAULT ||
> -			window->cop == VAS_COP_TYPE_FTW) {
> +			window->vas_win.cop == VAS_COP_TYPE_FAULT ||
> +			window->vas_win.cop == VAS_COP_TYPE_FTW) {
>  		pr_err("PSWID decode: id %d, tx %d, user %d, nx %d, cop %d\n",
>  			winid, window->tx_win, window->user_win,
> -			window->nx_win, window->cop);
> +			window->nx_win, window->vas_win.cop);
>  		WARN_ON(1);
>  	}
>  
> @@ -1418,10 +1428,12 @@ static struct vas_window *vas_user_win_open(struct vas_tx_win_open_attr *uattr,
>  	return vas_tx_win_open(uattr->vas_id, cop_type, &txattr);
>  }
>  
> -static u64 vas_user_win_paste_addr(struct vas_window *win)
> +static u64 vas_user_win_paste_addr(struct vas_window *txwin)
>  {
> +	struct pnv_vas_window *win;
>  	u64 paste_addr;
>  
> +	win = container_of(txwin, struct pnv_vas_window, vas_win);
>  	vas_win_paste_addr(win, &paste_addr, NULL);
>  
>  	return paste_addr;
> @@ -1429,7 +1441,6 @@ static u64 vas_user_win_paste_addr(struct vas_window *win)
>  
>  static int vas_user_win_close(struct vas_window *txwin)
>  {
> -
>  	vas_win_close(txwin);
>  
>  	return 0;
> diff --git a/arch/powerpc/platforms/powernv/vas.h b/arch/powerpc/platforms/powernv/vas.h
> index f354dd5c51bd..e7f54007dbd0 100644
> --- a/arch/powerpc/platforms/powernv/vas.h
> +++ b/arch/powerpc/platforms/powernv/vas.h
> @@ -334,11 +334,11 @@ struct vas_instance {
>  	int fifo_in_progress;	/* To wake up thread or return IRQ_HANDLED */
>  	spinlock_t fault_lock;	/* Protects fifo_in_progress update */
>  	void *fault_fifo;
> -	struct vas_window *fault_win; /* Fault window */
> +	struct pnv_vas_window *fault_win; /* Fault window */
>  
>  	struct mutex mutex;
> -	struct vas_window *rxwin[VAS_COP_TYPE_MAX];
> -	struct vas_window *windows[VAS_WINDOWS_PER_CHIP];
> +	struct pnv_vas_window *rxwin[VAS_COP_TYPE_MAX];
> +	struct pnv_vas_window *windows[VAS_WINDOWS_PER_CHIP];
>  
>  	char *name;
>  	char *dbgname;
> @@ -346,30 +346,24 @@ struct vas_instance {
>  };
>  
>  /*
> - * In-kernel state a VAS window. One per window.
> + * In-kernel state a VAS window on PowerNV. One per window.
>   */
> -struct vas_window {
> +struct pnv_vas_window {
> +	struct vas_window vas_win;
>  	/* Fields common to send and receive windows */
>  	struct vas_instance *vinst;
> -	int winid;
>  	bool tx_win;		/* True if send window */
>  	bool nx_win;		/* True if NX window */
>  	bool user_win;		/* True if user space window */
>  	void *hvwc_map;		/* HV window context */
>  	void *uwc_map;		/* OS/User window context */
> -	int wcreds_max;		/* Window credits */
> -
> -	struct vas_user_win_ref task_ref;
> -	char *dbgname;
> -	struct dentry *dbgdir;
>  
>  	/* Fields applicable only to send windows */
>  	void *paste_kaddr;
>  	char *paste_addr_name;
> -	struct vas_window *rxwin;
> +	struct pnv_vas_window *rxwin;
>  
> -	/* Feilds applicable only to receive windows */
> -	enum vas_cop_type cop;
> +	/* Fields applicable only to receive windows */
>  	atomic_t num_txwins;
>  };
>  
> @@ -428,15 +422,15 @@ extern struct mutex vas_mutex;
>  extern struct vas_instance *find_vas_instance(int vasid);
>  extern void vas_init_dbgdir(void);
>  extern void vas_instance_init_dbgdir(struct vas_instance *vinst);
> -extern void vas_window_init_dbgdir(struct vas_window *win);
> -extern void vas_window_free_dbgdir(struct vas_window *win);
> +extern void vas_window_init_dbgdir(struct pnv_vas_window *win);
> +extern void vas_window_free_dbgdir(struct pnv_vas_window *win);
>  extern int vas_setup_fault_window(struct vas_instance *vinst);
>  extern irqreturn_t vas_fault_thread_fn(int irq, void *data);
>  extern irqreturn_t vas_fault_handler(int irq, void *dev_id);
> -extern void vas_return_credit(struct vas_window *window, bool tx);
> -extern struct vas_window *vas_pswid_to_window(struct vas_instance *vinst,
> +extern void vas_return_credit(struct pnv_vas_window *window, bool tx);
> +extern struct pnv_vas_window *vas_pswid_to_window(struct vas_instance *vinst,
>  						uint32_t pswid);
> -extern void vas_win_paste_addr(struct vas_window *window, u64 *addr,
> +extern void vas_win_paste_addr(struct pnv_vas_window *window, u64 *addr,
>  					int *len);
>  
>  static inline int vas_window_pid(struct vas_window *window)
> @@ -444,16 +438,16 @@ static inline int vas_window_pid(struct vas_window *window)
>  	return pid_vnr(window->task_ref.pid);
>  }
>  
> -static inline void vas_log_write(struct vas_window *win, char *name,
> +static inline void vas_log_write(struct pnv_vas_window *win, char *name,
>  			void *regptr, u64 val)
>  {
>  	if (val)
>  		pr_debug("%swin #%d: %s reg %p, val 0x%016llx\n",
> -				win->tx_win ? "Tx" : "Rx", win->winid, name,
> -				regptr, val);
> +				win->tx_win ? "Tx" : "Rx", win->vas_win.winid,
> +				name, regptr, val);
>  }
>  
> -static inline void write_uwc_reg(struct vas_window *win, char *name,
> +static inline void write_uwc_reg(struct pnv_vas_window *win, char *name,
>  			s32 reg, u64 val)
>  {
>  	void *regptr;
> @@ -464,7 +458,7 @@ static inline void write_uwc_reg(struct vas_window *win, char *name,
>  	out_be64(regptr, val);
>  }
>  
> -static inline void write_hvwc_reg(struct vas_window *win, char *name,
> +static inline void write_hvwc_reg(struct pnv_vas_window *win, char *name,
>  			s32 reg, u64 val)
>  {
>  	void *regptr;
> @@ -475,7 +469,7 @@ static inline void write_hvwc_reg(struct vas_window *win, char *name,
>  	out_be64(regptr, val);
>  }
>  
> -static inline u64 read_hvwc_reg(struct vas_window *win,
> +static inline u64 read_hvwc_reg(struct pnv_vas_window *win,
>  			char *name __maybe_unused, s32 reg)
>  {
>  	return in_be64(win->hvwc_map+reg);
> -- 
> 2.18.2
> 
> 
> 

^ permalink raw reply

* Re: [PATCH v5 08/17] powerpc/pseries/vas: Define VAS/NXGZIP hcalls and structs
From: Nicholas Piggin @ 2021-06-14  2:32 UTC (permalink / raw)
  To: Haren Myneni, herbert, linux-crypto, linuxppc-dev, mpe
In-Reply-To: <2355e34fbffcf0a5e846bd63d9d7bef89100ddfe.camel@linux.ibm.com>

Excerpts from Haren Myneni's message of June 13, 2021 8:59 pm:
> 
> This patch adds hcalls and other definitions. Also define structs
> that are used in VAS implementation on PowerVM.
> 
> Signed-off-by: Haren Myneni <haren@linux.ibm.com>

I haven't got the specs to verify it against, but previous comments on 
naming etc are resolved or withdrawn.

Acked-by: Nicholas Piggin <npiggin@gmail.com>

Thanks,
Nick


> ---
>  arch/powerpc/include/asm/hvcall.h    |   7 ++
>  arch/powerpc/include/asm/vas.h       |  30 +++++++
>  arch/powerpc/platforms/pseries/vas.h | 125 +++++++++++++++++++++++++++
>  3 files changed, 162 insertions(+)
>  create mode 100644 arch/powerpc/platforms/pseries/vas.h
> 
> diff --git a/arch/powerpc/include/asm/hvcall.h b/arch/powerpc/include/asm/hvcall.h
> index e3b29eda8074..7c3418d1b5e9 100644
> --- a/arch/powerpc/include/asm/hvcall.h
> +++ b/arch/powerpc/include/asm/hvcall.h
> @@ -294,6 +294,13 @@
>  #define H_RESIZE_HPT_COMMIT	0x370
>  #define H_REGISTER_PROC_TBL	0x37C
>  #define H_SIGNAL_SYS_RESET	0x380
> +#define H_ALLOCATE_VAS_WINDOW	0x388
> +#define H_MODIFY_VAS_WINDOW	0x38C
> +#define H_DEALLOCATE_VAS_WINDOW	0x390
> +#define H_QUERY_VAS_WINDOW	0x394
> +#define H_QUERY_VAS_CAPABILITIES	0x398
> +#define H_QUERY_NX_CAPABILITIES	0x39C
> +#define H_GET_NX_FAULT		0x3A0
>  #define H_INT_GET_SOURCE_INFO   0x3A8
>  #define H_INT_SET_SOURCE_CONFIG 0x3AC
>  #define H_INT_GET_SOURCE_CONFIG 0x3B0
> diff --git a/arch/powerpc/include/asm/vas.h b/arch/powerpc/include/asm/vas.h
> index 1acf2b18c2d1..eefc758d8cd4 100644
> --- a/arch/powerpc/include/asm/vas.h
> +++ b/arch/powerpc/include/asm/vas.h
> @@ -160,6 +160,7 @@ struct vas_tx_win_attr {
>  	bool rx_win_ord_mode;
>  };
>  
> +#ifdef CONFIG_PPC_POWERNV
>  /*
>   * Helper to map a chip id to VAS id.
>   * For POWER9, this is a 1:1 mapping. In the future this maybe a 1:N
> @@ -225,6 +226,35 @@ int vas_paste_crb(struct vas_window *win, int offset, bool re);
>  int vas_register_api_powernv(struct module *mod, enum vas_cop_type cop_type,
>  			     const char *name);
>  void vas_unregister_api_powernv(void);
> +#endif
> +
> +#ifdef CONFIG_PPC_PSERIES
> +
> +/* VAS Capabilities */
> +#define VAS_GZIP_QOS_FEAT	0x1
> +#define VAS_GZIP_DEF_FEAT	0x2
> +#define VAS_GZIP_QOS_FEAT_BIT	PPC_BIT(VAS_GZIP_QOS_FEAT) /* Bit 1 */
> +#define VAS_GZIP_DEF_FEAT_BIT	PPC_BIT(VAS_GZIP_DEF_FEAT) /* Bit 2 */
> +
> +/* NX Capabilities */
> +#define VAS_NX_GZIP_FEAT	0x1
> +#define VAS_NX_GZIP_FEAT_BIT	PPC_BIT(VAS_NX_GZIP_FEAT) /* Bit 1 */
> +
> +/*
> + * These structs are used to retrieve overall VAS capabilities that
> + * the hypervisor provides.
> + */
> +struct hv_vas_all_caps {
> +	__be64  descriptor;
> +	__be64  feat_type;
> +} __packed __aligned(0x1000);
> +
> +struct vas_all_caps {
> +	u64     descriptor;
> +	u64     feat_type;
> +};
> +
> +#endif
>  
>  /*
>   * Register / unregister coprocessor type to VAS API which will be exported
> diff --git a/arch/powerpc/platforms/pseries/vas.h b/arch/powerpc/platforms/pseries/vas.h
> new file mode 100644
> index 000000000000..3a798102de93
> --- /dev/null
> +++ b/arch/powerpc/platforms/pseries/vas.h
> @@ -0,0 +1,125 @@
> +/* SPDX-License-Identifier: GPL-2.0-or-later */
> +/*
> + * Copyright 2020-21 IBM Corp.
> + */
> +
> +#ifndef _VAS_H
> +#define _VAS_H
> +#include <asm/vas.h>
> +#include <linux/mutex.h>
> +#include <linux/stringify.h>
> +
> +/*
> + * VAS window modify flags
> + */
> +#define VAS_MOD_WIN_CLOSE	PPC_BIT(0)
> +#define VAS_MOD_WIN_JOBS_KILL	PPC_BIT(1)
> +#define VAS_MOD_WIN_DR		PPC_BIT(3)
> +#define VAS_MOD_WIN_PR		PPC_BIT(4)
> +#define VAS_MOD_WIN_SF		PPC_BIT(5)
> +#define VAS_MOD_WIN_TA		PPC_BIT(6)
> +#define VAS_MOD_WIN_FLAGS	(VAS_MOD_WIN_JOBS_KILL | VAS_MOD_WIN_DR | \
> +				VAS_MOD_WIN_PR | VAS_MOD_WIN_SF)
> +
> +#define VAS_WIN_ACTIVE		0x0
> +#define VAS_WIN_CLOSED		0x1
> +#define VAS_WIN_INACTIVE	0x2	/* Inactive due to HW failure */
> +/* Process of being modified, deallocated, or quiesced */
> +#define VAS_WIN_MOD_IN_PROCESS	0x3
> +
> +#define VAS_COPY_PASTE_USER_MODE	0x00000001
> +#define VAS_COP_OP_USER_MODE		0x00000010
> +
> +/*
> + * Co-processor feature - GZIP QoS windows or GZIP default windows
> + */
> +enum vas_cop_feat_type {
> +	VAS_GZIP_QOS_FEAT_TYPE,
> +	VAS_GZIP_DEF_FEAT_TYPE,
> +	VAS_MAX_FEAT_TYPE,
> +};
> +
> +/*
> + * Use to get feature specific capabilities from the
> + * hypervisor.
> + */
> +struct hv_vas_ct_caps {
> +	__be64	descriptor;
> +	u8	win_type;		/* Default or QoS type */
> +	u8	user_mode;
> +	__be16	max_lpar_creds;
> +	__be16	max_win_creds;
> +	union {
> +		__be16	reserved;
> +		__be16	def_lpar_creds; /* Used for default capabilities */
> +	};
> +	__be16	target_lpar_creds;
> +} __packed __aligned(0x1000);
> +
> +/*
> + * Feature specific (QoS or default) capabilities.
> + */
> +struct vas_ct_caps {
> +	u64		descriptor;
> +	u8		win_type;	/* Default or QoS type */
> +	u8		user_mode;	/* User mode copy/paste or COP HCALL */
> +	u16		max_lpar_creds;	/* Max credits available in LPAR */
> +	/* Max credits can be assigned per window */
> +	u16		max_win_creds;
> +	union {
> +		u16	reserved;	/* Used for QoS credit type */
> +		u16	def_lpar_creds; /* Used for default credit type */
> +	};
> +	/* Total LPAR available credits. Can be different from max LPAR */
> +	/* credits due to DLPAR operation */
> +	atomic_t	target_lpar_creds;
> +	atomic_t	used_lpar_creds; /* Used credits so far */
> +	u16		avail_lpar_creds; /* Remaining available credits */
> +};
> +
> +/*
> + * Feature (QoS or Default) specific to store capabilities and
> + * the list of open windows.
> + */
> +struct vas_caps {
> +	struct vas_ct_caps caps;
> +	struct list_head list;	/* List of open windows */
> +};
> +
> +/*
> + * To get window information from the hypervisor.
> + */
> +struct hv_vas_win_lpar {
> +	__be16	version;
> +	u8	win_type;
> +	u8	status;
> +	__be16	credits;	/* No of credits assigned to this window */
> +	__be16	reserved;
> +	__be32	pid;		/* LPAR Process ID */
> +	__be32	tid;		/* LPAR Thread ID */
> +	__be64	win_addr;	/* Paste address */
> +	__be32	interrupt;	/* Interrupt when NX request completes */
> +	__be32	fault;		/* Interrupt when NX sees fault */
> +	/* Associativity Domain Identifiers as returned in */
> +	/* H_HOME_NODE_ASSOCIATIVITY */
> +	__be64	domain[6];
> +	__be64	win_util;	/* Number of bytes processed */
> +} __packed __aligned(0x1000);
> +
> +struct pseries_vas_window {
> +	struct vas_window vas_win;
> +	u64 win_addr;		/* Physical paste address */
> +	u8 win_type;		/* QoS or Default window */
> +	u32 complete_irq;	/* Completion interrupt */
> +	u32 fault_irq;		/* Fault interrupt */
> +	u64 domain[6];		/* Associativity domain Ids */
> +				/* this window is allocated */
> +	u64 util;
> +
> +	/* List of windows opened which is used for LPM */
> +	struct list_head win_list;
> +	u64 flags;
> +	char *name;
> +	int fault_virq;
> +};
> +#endif /* _VAS_H */
> -- 
> 2.18.2
> 
> 
> 

^ permalink raw reply

* Re: [PATCH v5 09/17] powerpc/vas: Define QoS credit flag to allocate window
From: Nicholas Piggin @ 2021-06-14  2:32 UTC (permalink / raw)
  To: Haren Myneni, herbert, linux-crypto, linuxppc-dev, mpe
In-Reply-To: <64c8e95b25f58c5e05c98765dab2bc8eb9b1483d.camel@linux.ibm.com>

Excerpts from Haren Myneni's message of June 13, 2021 9:00 pm:
> 
> PowerVM introduces two different type of credits: Default and Quality
> of service (QoS).
> 
> The total number of default credits available on each LPAR depends
> on CPU resources configured. But these credits can be shared or
> over-committed across LPARs in shared mode which can result in
> paste command failure (RMA_busy). To avoid NX HW contention, the
> hypervisor ntroduces QoS credit type which makes sure guaranteed
> access to NX esources. The system admins can assign QoS credits
> or each LPAR via HMC.
> 
> Default credit type is used to allocate a VAS window by default as
> on PowerVM implementation. But the process can pass
> VAS_TX_WIN_FLAG_QOS_CREDIT flag with VAS_TX_WIN_OPEN ioctl to open
> QoS type window.
> 
> Signed-off-by: Haren Myneni <haren@linux.ibm.com>

Flag name looks good now. Again I don't have the spec, so

Acked-by: Nicholas Piggin <npiggin@gmail.com>

Thanks,
Nick

> ---
>  arch/powerpc/include/uapi/asm/vas-api.h | 6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/include/uapi/asm/vas-api.h b/arch/powerpc/include/uapi/asm/vas-api.h
> index ebd4b2424785..7c81301ecdba 100644
> --- a/arch/powerpc/include/uapi/asm/vas-api.h
> +++ b/arch/powerpc/include/uapi/asm/vas-api.h
> @@ -13,11 +13,15 @@
>  #define VAS_MAGIC	'v'
>  #define VAS_TX_WIN_OPEN	_IOW(VAS_MAGIC, 0x20, struct vas_tx_win_open_attr)
>  
> +/* Flags to VAS TX open window ioctl */
> +/* To allocate a window with QoS credit, otherwise use default credit */
> +#define VAS_TX_WIN_FLAG_QOS_CREDIT	0x0000000000000001
> +
>  struct vas_tx_win_open_attr {
>  	__u32	version;
>  	__s16	vas_id;	/* specific instance of vas or -1 for default */
>  	__u16	reserved1;
> -	__u64	flags;	/* Future use */
> +	__u64	flags;
>  	__u64	reserved2[6];
>  };
>  
> -- 
> 2.18.2
> 
> 
> 

^ permalink raw reply

* Re: [PATCH v5 11/17] powerpc/pseries/vas: Implement getting capabilities from hypervisor
From: Nicholas Piggin @ 2021-06-14  2:35 UTC (permalink / raw)
  To: Haren Myneni, herbert, linux-crypto, linuxppc-dev, mpe
In-Reply-To: <4ca037732838675bd8e12de702687971e0ab2b3d.camel@linux.ibm.com>

Excerpts from Haren Myneni's message of June 13, 2021 9:01 pm:
> 
> The hypervisor provides VAS capabilities for GZIP default and QoS
> features. These capabilities gives information for the specific
> features such as total number of credits available in LPAR,
> maximum credits allowed per window, maximum credits allowed in
> LPAR, whether usermode copy/paste is supported, and etc.
> 
> This patch adds the following:
> - Retrieve all features that are provided by hypervisor using
>   H_QUERY_VAS_CAPABILITIES hcall with 0 as feature type.
> - Retrieve capabilities for the specific feature using the same
>   hcall and the feature type (1 for QoS and 2 for default type).
> 
> Signed-off-by: Haren Myneni <haren@linux.ibm.com>
> ---
>  arch/powerpc/platforms/pseries/vas.c | 124 +++++++++++++++++++++++++++
>  1 file changed, 124 insertions(+)
> 
> diff --git a/arch/powerpc/platforms/pseries/vas.c b/arch/powerpc/platforms/pseries/vas.c
> index fec280979d50..98109a13f1c2 100644
> --- a/arch/powerpc/platforms/pseries/vas.c
> +++ b/arch/powerpc/platforms/pseries/vas.c
> @@ -10,6 +10,7 @@
>  #include <linux/export.h>
>  #include <linux/types.h>
>  #include <linux/delay.h>
> +#include <asm/machdep.h>
>  #include <asm/hvcall.h>
>  #include <asm/plpar_wrappers.h>
>  #include <asm/vas.h>
> @@ -20,6 +21,13 @@
>  /* phyp allows one credit per window right now */
>  #define DEF_WIN_CREDS		1
>  
> +static struct vas_all_caps caps_all;
> +static bool copypaste_feat;
> +
> +static struct vas_caps vascaps[VAS_MAX_FEAT_TYPE];

If these will all be read by later patches I think that's okay, 
however:

> +
> +static DEFINE_MUTEX(vas_pseries_mutex);

This is not used at all. It should be added in the patch that uses it.

Thanks,
Nick

> +
>  static long hcall_return_busy_check(long rc)
>  {
>  	/* Check if we are stalled for some time */
> @@ -179,3 +187,119 @@ int h_query_vas_capabilities(const u64 hcall, u8 query_type, u64 result)
>  		return -EIO;
>  	}
>  }
> +
> +/*
> + * Get the specific capabilities based on the feature type.
> + * Right now supports GZIP default and GZIP QoS capabilities.
> + */
> +static int get_vas_capabilities(u8 feat, enum vas_cop_feat_type type,
> +				struct hv_vas_ct_caps *hv_caps)
> +{
> +	struct vas_ct_caps *caps;
> +	struct vas_caps *vcaps;
> +	int rc = 0;
> +
> +	vcaps = &vascaps[type];
> +	memset(vcaps, 0, sizeof(*vcaps));
> +	INIT_LIST_HEAD(&vcaps->list);
> +
> +	caps = &vcaps->caps;
> +
> +	rc = h_query_vas_capabilities(H_QUERY_VAS_CAPABILITIES, feat,
> +					  (u64)virt_to_phys(hv_caps));
> +	if (rc)
> +		return rc;
> +
> +	caps->user_mode = hv_caps->user_mode;
> +	if (!(caps->user_mode & VAS_COPY_PASTE_USER_MODE)) {
> +		pr_err("User space COPY/PASTE is not supported\n");
> +		return -ENOTSUPP;
> +	}
> +
> +	caps->descriptor = be64_to_cpu(hv_caps->descriptor);
> +	caps->win_type = hv_caps->win_type;
> +	if (caps->win_type >= VAS_MAX_FEAT_TYPE) {
> +		pr_err("Unsupported window type %u\n", caps->win_type);
> +		return -EINVAL;
> +	}
> +	caps->max_lpar_creds = be16_to_cpu(hv_caps->max_lpar_creds);
> +	caps->max_win_creds = be16_to_cpu(hv_caps->max_win_creds);
> +	atomic_set(&caps->target_lpar_creds,
> +		   be16_to_cpu(hv_caps->target_lpar_creds));
> +	if (feat == VAS_GZIP_DEF_FEAT) {
> +		caps->def_lpar_creds = be16_to_cpu(hv_caps->def_lpar_creds);
> +
> +		if (caps->max_win_creds < DEF_WIN_CREDS) {
> +			pr_err("Window creds(%u) > max allowed window creds(%u)\n",
> +			       DEF_WIN_CREDS, caps->max_win_creds);
> +			return -EINVAL;
> +		}
> +	}
> +
> +	copypaste_feat = true;
> +
> +	return 0;
> +}
> +
> +static int __init pseries_vas_init(void)
> +{
> +	struct hv_vas_ct_caps *hv_ct_caps;
> +	struct hv_vas_all_caps *hv_caps;
> +	int rc;
> +
> +	/*
> +	 * Linux supports user space COPY/PASTE only with Radix
> +	 */
> +	if (!radix_enabled()) {
> +		pr_err("API is supported only with radix page tables\n");
> +		return -ENOTSUPP;
> +	}
> +
> +	hv_caps = kmalloc(sizeof(*hv_caps), GFP_KERNEL);
> +	if (!hv_caps)
> +		return -ENOMEM;
> +	/*
> +	 * Get VAS overall capabilities by passing 0 to feature type.
> +	 */
> +	rc = h_query_vas_capabilities(H_QUERY_VAS_CAPABILITIES, 0,
> +					  (u64)virt_to_phys(hv_caps));
> +	if (rc)
> +		goto out;
> +
> +	caps_all.descriptor = be64_to_cpu(hv_caps->descriptor);
> +	caps_all.feat_type = be64_to_cpu(hv_caps->feat_type);
> +
> +	hv_ct_caps = kmalloc(sizeof(*hv_ct_caps), GFP_KERNEL);
> +	if (!hv_ct_caps) {
> +		rc = -ENOMEM;
> +		goto out;
> +	}
> +	/*
> +	 * QOS capabilities available
> +	 */
> +	if (caps_all.feat_type & VAS_GZIP_QOS_FEAT_BIT) {
> +		rc = get_vas_capabilities(VAS_GZIP_QOS_FEAT,
> +					  VAS_GZIP_QOS_FEAT_TYPE, hv_ct_caps);
> +
> +		if (rc)
> +			goto out_ct;
> +	}
> +	/*
> +	 * Default capabilities available
> +	 */
> +	if (caps_all.feat_type & VAS_GZIP_DEF_FEAT_BIT) {
> +		rc = get_vas_capabilities(VAS_GZIP_DEF_FEAT,
> +					  VAS_GZIP_DEF_FEAT_TYPE, hv_ct_caps);
> +		if (rc)
> +			goto out_ct;
> +	}
> +
> +	pr_info("GZIP feature is available\n");
> +
> +out_ct:
> +	kfree(hv_ct_caps);
> +out:
> +	kfree(hv_caps);
> +	return rc;
> +}
> +machine_device_initcall(pseries, pseries_vas_init);
> -- 
> 2.18.2
> 
> 
> 

^ permalink raw reply

* Re: [PATCH v5 12/17] powerpc/pseries/vas: Integrate API with open/close windows
From: Nicholas Piggin @ 2021-06-14  2:55 UTC (permalink / raw)
  To: Haren Myneni, herbert, linux-crypto, linuxppc-dev, mpe
In-Reply-To: <58c2f9debeff2ff6515ea950ebdd6483c147c843.camel@linux.ibm.com>

Excerpts from Haren Myneni's message of June 13, 2021 9:02 pm:
> 
> This patch adds VAS window allocatioa/close with the corresponding
> hcalls. Also changes to integrate with the existing user space VAS
> API and provide register/unregister functions to NX pseries driver.
> 
> The driver register function is used to create the user space
> interface (/dev/crypto/nx-gzip) and unregister to remove this entry.
> 
> The user space process opens this device node and makes an ioctl
> to allocate VAS window. The close interface is used to deallocate
> window.
> 
> Signed-off-by: Haren Myneni <haren@linux.ibm.com>
> ---
>  arch/powerpc/include/asm/vas.h          |   4 +
>  arch/powerpc/platforms/pseries/Makefile |   1 +
>  arch/powerpc/platforms/pseries/vas.c    | 223 ++++++++++++++++++++++++
>  3 files changed, 228 insertions(+)
> 
> diff --git a/arch/powerpc/include/asm/vas.h b/arch/powerpc/include/asm/vas.h
> index eefc758d8cd4..9d5646d721c4 100644
> --- a/arch/powerpc/include/asm/vas.h
> +++ b/arch/powerpc/include/asm/vas.h
> @@ -254,6 +254,10 @@ struct vas_all_caps {
>  	u64     feat_type;
>  };
>  
> +int h_query_vas_capabilities(const u64 hcall, u8 query_type, u64 result);
> +int vas_register_api_pseries(struct module *mod,
> +			     enum vas_cop_type cop_type, const char *name);
> +void vas_unregister_api_pseries(void);
>  #endif
>  
>  /*
> diff --git a/arch/powerpc/platforms/pseries/Makefile b/arch/powerpc/platforms/pseries/Makefile
> index c8a2b0b05ac0..4cda0ef87be0 100644
> --- a/arch/powerpc/platforms/pseries/Makefile
> +++ b/arch/powerpc/platforms/pseries/Makefile
> @@ -30,3 +30,4 @@ obj-$(CONFIG_PPC_SVM)		+= svm.o
>  obj-$(CONFIG_FA_DUMP)		+= rtas-fadump.o
>  
>  obj-$(CONFIG_SUSPEND)		+= suspend.o
> +obj-$(CONFIG_PPC_VAS)		+= vas.o
> diff --git a/arch/powerpc/platforms/pseries/vas.c b/arch/powerpc/platforms/pseries/vas.c
> index 98109a13f1c2..fe375f7a7029 100644
> --- a/arch/powerpc/platforms/pseries/vas.c
> +++ b/arch/powerpc/platforms/pseries/vas.c
> @@ -10,6 +10,7 @@
>  #include <linux/export.h>
>  #include <linux/types.h>
>  #include <linux/delay.h>
> +#include <linux/slab.h>
>  #include <asm/machdep.h>
>  #include <asm/hvcall.h>
>  #include <asm/plpar_wrappers.h>
> @@ -187,6 +188,228 @@ int h_query_vas_capabilities(const u64 hcall, u8 query_type, u64 result)
>  		return -EIO;
>  	}
>  }
> +EXPORT_SYMBOL_GPL(h_query_vas_capabilities);
> +
> +/*
> + * Allocate window and setup IRQ mapping.
> + */
> +static int allocate_setup_window(struct pseries_vas_window *txwin,
> +				 u64 *domain, u8 wintype)
> +{
> +	int rc;
> +
> +	rc = h_allocate_vas_window(txwin, domain, wintype, DEF_WIN_CREDS);
> +	if (rc)
> +		return rc;
> +
> +	txwin->vas_win.wcreds_max = DEF_WIN_CREDS;
> +
> +	return 0;
> +}
> +
> +static struct vas_window *vas_allocate_window(struct vas_tx_win_open_attr *uattr,
> +					      enum vas_cop_type cop_type)
> +{
> +	long domain[PLPAR_HCALL9_BUFSIZE] = {VAS_DEFAULT_DOMAIN_ID};
> +	struct vas_ct_caps *ct_caps;
> +	struct vas_caps *caps;
> +	struct pseries_vas_window *txwin;
> +	int rc;
> +
> +	txwin = kzalloc(sizeof(*txwin), GFP_KERNEL);
> +	if (!txwin)
> +		return ERR_PTR(-ENOMEM);
> +
> +	/*
> +	 * A VAS window can have many credits which means that many
> +	 * requests can be issued simultaneously. But phyp restricts
> +	 * one credit per window.
> +	 * phyp introduces 2 different types of credits:
> +	 * Default credit type (Uses normal priority FIFO):
> +	 *	A limited number of credits are assigned to partitions
> +	 *	based on processor entitlement. But these credits may be
> +	 *	over-committed on a system depends on whether the CPUs
> +	 *	are in shared or dedicated modes - that is, more requests
> +	 *	may be issued across the system than NX can service at
> +	 *	once which can result in paste command failure (RMA_busy).
> +	 *	Then the process has to resend requests or fall-back to
> +	 *	SW compression.
> +	 * Quality of Service (QoS) credit type (Uses high priority FIFO):
> +	 *	To avoid NX HW contention, the system admins can assign
> +	 *	QoS credits for each LPAR so that this partition is
> +	 *	guaranteed access to NX resources. These credits are
> +	 *	assigned to partitions via the HMC.
> +	 *	Refer PAPR for more information.
> +	 *
> +	 * Allocate window with QoS credits if user requested. Otherwise
> +	 * default credits are used.
> +	 */
> +	if (uattr->flags & VAS_TX_WIN_FLAG_QOS_CREDIT)
> +		caps = &vascaps[VAS_GZIP_QOS_FEAT_TYPE];
> +	else
> +		caps = &vascaps[VAS_GZIP_DEF_FEAT_TYPE];
> +
> +	ct_caps = &caps->caps;
> +
> +	if (atomic_inc_return(&ct_caps->used_lpar_creds) >
> +			atomic_read(&ct_caps->target_lpar_creds)) {
> +		pr_err("Credits are not available to allocate window\n");
> +		rc = -EINVAL;
> +		goto out;
> +	}
> +
> +	/*
> +	 * The user space is requesting to allocate a window on a VAS
> +	 * instance (or chip) where the process is executing.
> +	 * On powerVM, domain values are passed to pHyp to select chip /
> +	 * VAS instance. Useful if the process is affinity to NUMA node.
> +	 * pHyp selects VAS instance if VAS_DEFAULT_DOMAIN_ID (-1) is
> +	 * passed for domain values.
> +	 */

s/powerVM/PowerVM
s/pHyp/PowerVM

You could also just call it the hypervisor. KVM may not implement the 
hcalls now, but in future it could.

> +	if (uattr->vas_id == -1) {

Should the above comment fit under here? vas_id == -1 means userspace 
asks for any VAS but preferably a local one?

> +		/*
> +		 * To allocate VAS window, pass same domain values returned
> +		 * from this HCALL.
> +		 */

Then you could merge it with this comment and make it a bit clearer:
the h_allocate_vas_window hcall is defined to take a domain as
specified by h_home_node_associativity, so no conversions or unpacking
needs to be done.

> +		rc = plpar_hcall9(H_HOME_NODE_ASSOCIATIVITY, domain,
> +				  VPHN_FLAG_VCPU, smp_processor_id());
> +		if (rc != H_SUCCESS) {
> +			pr_err("HCALL(%x): failed with ret(%d)\n",
> +			       H_HOME_NODE_ASSOCIATIVITY, rc);
> +			goto out;
> +		}
> +	}
> +
> +	/*
> +	 * Allocate / Deallocate window HCALLs and setup / free IRQs
> +	 * have to be protected with mutex.
> +	 * Open VAS window: Allocate window HCALL and setup IRQ
> +	 * Close VAS window: Deallocate window HCALL and free IRQ
> +	 *	The hypervisor waits until all NX requests are
> +	 *	completed before closing the window. So expects OS
> +	 *	to handle NX faults, means IRQ can be freed only
> +	 *	after the deallocate window HCALL is returned.
> +	 * So once the window is closed with deallocate HCALL before
> +	 * the IRQ is freed, it can be assigned to new allocate
> +	 * HCALL with the same fault IRQ by the hypervisor. It can
> +	 * result in setup IRQ fail for the new window since the
> +	 * same fault IRQ is not freed by the OS.
> +	 */
> +	mutex_lock(&vas_pseries_mutex);

Why? What's the mutex protecting here?

> +	rc = allocate_setup_window(txwin, (u64 *)&domain[0],
> +				   ct_caps->win_type);

If you define the types to be the same, can you avoid this casting?
allocate_setup_window specifically needs an array of 
PLPAR_HCALL9_BUFSIZE longs.

> +	mutex_unlock(&vas_pseries_mutex);
> +	if (rc)
> +		goto out;
> +
> +	/*
> +	 * Modify window and it is ready to use.
> +	 */
> +	rc = h_modify_vas_window(txwin);
> +	if (!rc)
> +		rc = get_vas_user_win_ref(&txwin->vas_win.task_ref);
> +	if (rc)
> +		goto out_free;
> +
> +	vas_user_win_add_mm_context(&txwin->vas_win.task_ref);
> +	txwin->win_type = ct_caps->win_type;
> +	mutex_lock(&vas_pseries_mutex);
> +	list_add(&txwin->win_list, &caps->list);
> +	mutex_unlock(&vas_pseries_mutex);
> +
> +	return &txwin->vas_win;
> +
> +out_free:
> +	h_deallocate_vas_window(txwin->vas_win.winid);

No mutex here in this deallocate hcall.

I suspect you don't actually need the mutex for the hcalls themselves, 
but the list manipulations. I would possibly consider putting 
used_lpar_creds under that same lock rather than making it atomic and
playing lock free games, unless you really need to.

Also... "creds". credentials? credits, right? Don't go through and 
change everything now, but not skimping on naming helps a lot with
reading code that you're not familiar with. All the vas/nx stuff
could probably do with a pass to make the names a bit easier.

(creds isn't so bad, "ct" for "coprocessor type" is pretty obscure 
though).

Thanks,
Nick

> +out:
> +	atomic_dec(&ct_caps->used_lpar_creds);
> +	kfree(txwin);
> +	return ERR_PTR(rc);
> +}
> +
> +static u64 vas_paste_address(struct vas_window *vwin)
> +{
> +	struct pseries_vas_window *win;
> +
> +	win = container_of(vwin, struct pseries_vas_window, vas_win);
> +	return win->win_addr;
> +}
> +
> +static int deallocate_free_window(struct pseries_vas_window *win)
> +{
> +	int rc = 0;
> +
> +	rc = h_deallocate_vas_window(win->vas_win.winid);
> +
> +	return rc;
> +}
> +
> +static int vas_deallocate_window(struct vas_window *vwin)
> +{
> +	struct pseries_vas_window *win;
> +	struct vas_ct_caps *caps;
> +	int rc = 0;
> +
> +	if (!vwin)
> +		return -EINVAL;
> +
> +	win = container_of(vwin, struct pseries_vas_window, vas_win);
> +
> +	/* Should not happen */
> +	if (win->win_type >= VAS_MAX_FEAT_TYPE) {
> +		pr_err("Window (%u): Invalid window type %u\n",
> +				vwin->winid, win->win_type);
> +		return -EINVAL;
> +	}
> +
> +	caps = &vascaps[win->win_type].caps;
> +	mutex_lock(&vas_pseries_mutex);
> +	rc = deallocate_free_window(win);
> +	if (rc) {
> +		mutex_unlock(&vas_pseries_mutex);
> +		return rc;
> +	}
> +
> +	list_del(&win->win_list);
> +	atomic_dec(&caps->used_lpar_creds);
> +	mutex_unlock(&vas_pseries_mutex);
> +
> +	put_vas_user_win_ref(&vwin->task_ref);
> +	mm_context_remove_vas_window(vwin->task_ref.mm);
> +
> +	kfree(win);
> +	return 0;
> +}
> +
> +static const struct vas_user_win_ops vops_pseries = {
> +	.open_win	= vas_allocate_window,	/* Open and configure window */
> +	.paste_addr	= vas_paste_address,	/* To do copy/paste */
> +	.close_win	= vas_deallocate_window, /* Close window */
> +};
> +
> +/*
> + * Supporting only nx-gzip coprocessor type now, but this API code
> + * extended to other coprocessor types later.
> + */
> +int vas_register_api_pseries(struct module *mod, enum vas_cop_type cop_type,
> +			     const char *name)
> +{
> +	int rc;
> +
> +	if (!copypaste_feat)
> +		return -ENOTSUPP;
> +
> +	rc = vas_register_coproc_api(mod, cop_type, name, &vops_pseries);
> +
> +	return rc;
> +}
> +EXPORT_SYMBOL_GPL(vas_register_api_pseries);
> +
> +void vas_unregister_api_pseries(void)
> +{
> +	vas_unregister_coproc_api();
> +}
> +EXPORT_SYMBOL_GPL(vas_unregister_api_pseries);
>  
>  /*
>   * Get the specific capabilities based on the feature type.
> -- 
> 2.18.2
> 
> 
> 

^ permalink raw reply

* Re: [PATCH v5 13/17] powerpc/pseries/vas: Setup IRQ and fault handling
From: Nicholas Piggin @ 2021-06-14  3:07 UTC (permalink / raw)
  To: Haren Myneni, herbert, linux-crypto, linuxppc-dev, mpe
In-Reply-To: <8a9fe47a17d1f89cc43885d2ef8c2f09e4e90d7a.camel@linux.ibm.com>

Excerpts from Haren Myneni's message of June 13, 2021 9:02 pm:
> 
> NX generates an interrupt when sees a fault on the user space
> buffer and the hypervisor forwards that interrupt to OS. Then
> the kernel handles the interrupt by issuing H_GET_NX_FAULT hcall
> to retrieve the fault CRB information.
> 
> This patch also adds changes to setup and free IRQ per each
> window and also handles the fault by updating the CSB.
> 
> Signed-off-by: Haren Myneni <haren@linux.ibm.com>
> ---
>  arch/powerpc/platforms/pseries/vas.c | 108 +++++++++++++++++++++++++++
>  1 file changed, 108 insertions(+)
> 
> diff --git a/arch/powerpc/platforms/pseries/vas.c b/arch/powerpc/platforms/pseries/vas.c
> index fe375f7a7029..55185bdd3776 100644
> --- a/arch/powerpc/platforms/pseries/vas.c
> +++ b/arch/powerpc/platforms/pseries/vas.c
> @@ -11,6 +11,7 @@
>  #include <linux/types.h>
>  #include <linux/delay.h>
>  #include <linux/slab.h>
> +#include <linux/interrupt.h>
>  #include <asm/machdep.h>
>  #include <asm/hvcall.h>
>  #include <asm/plpar_wrappers.h>
> @@ -190,6 +191,58 @@ int h_query_vas_capabilities(const u64 hcall, u8 query_type, u64 result)
>  }
>  EXPORT_SYMBOL_GPL(h_query_vas_capabilities);
>  
> +/*
> + * hcall to get fault CRB from pHyp.
> + */
> +static int h_get_nx_fault(u32 winid, u64 buffer)
> +{
> +	long rc;
> +
> +	rc = plpar_hcall_norets(H_GET_NX_FAULT, winid, buffer);
> +
> +	switch (rc) {
> +	case H_SUCCESS:
> +		return 0;
> +	case H_PARAMETER:
> +		pr_err("HCALL(%x): Invalid window ID %u\n", H_GET_NX_FAULT,
> +		       winid);
> +		return -EINVAL;
> +	case H_PRIVILEGE:
> +		pr_err("HCALL(%x): Window(%u): Invalid fault buffer 0x%llx\n",
> +		       H_GET_NX_FAULT, winid, buffer);
> +		return -EACCES;
> +	default:
> +		pr_err("HCALL(%x): Failed with error %ld for window(%u)\n",
> +		       H_GET_NX_FAULT, rc, winid);
> +		return -EIO;

3 error messages have 3 different formats for window ID.

I agree with Michael you could just have one error message that reports 
the return value. Also "H_GET_NX_FAULT: " would be nicer than "HCALL(380): "

Check how some other hcall failures are reported, "hcall failed: 
H_CALL_NAME" seems to have a few takers.

> +	}
> +}
> +
> +/*
> + * Handle the fault interrupt.
> + * When the fault interrupt is received for each window, query pHyp to get
> + * the fault CRB on the specific fault. Then process the CRB by updating
> + * CSB or send signal if the user space CSB is invalid.
> + * Note: pHyp forwards an interrupt for each fault request. So one fault
> + *	CRB to process for each H_GET_NX_FAULT HCALL.
> + */
> +irqreturn_t pseries_vas_fault_thread_fn(int irq, void *data)
> +{
> +	struct pseries_vas_window *txwin = data;
> +	struct coprocessor_request_block crb;
> +	struct vas_user_win_ref *tsk_ref;
> +	int rc;
> +
> +	rc = h_get_nx_fault(txwin->vas_win.winid, (u64)virt_to_phys(&crb));
> +	if (!rc) {
> +		tsk_ref = &txwin->vas_win.task_ref;
> +		vas_dump_crb(&crb);
> +		vas_update_csb(&crb, tsk_ref);
> +	}
> +
> +	return IRQ_HANDLED;
> +}
> +
>  /*
>   * Allocate window and setup IRQ mapping.
>   */
> @@ -201,10 +254,51 @@ static int allocate_setup_window(struct pseries_vas_window *txwin,
>  	rc = h_allocate_vas_window(txwin, domain, wintype, DEF_WIN_CREDS);
>  	if (rc)
>  		return rc;
> +	/*
> +	 * On powerVM, pHyp setup and forwards the fault interrupt per

           The hypervisor forwards the fault interrupt per-window...

> +	 * window. So the IRQ setup and fault handling will be done for
> +	 * each open window separately.
> +	 */
> +	txwin->fault_virq = irq_create_mapping(NULL, txwin->fault_irq);
> +	if (!txwin->fault_virq) {
> +		pr_err("Failed irq mapping %d\n", txwin->fault_irq);
> +		rc = -EINVAL;
> +		goto out_win;
> +	}
> +
> +	txwin->name = kasprintf(GFP_KERNEL, "vas-win-%d",
> +				txwin->vas_win.winid);
> +	if (!txwin->name) {
> +		rc = -ENOMEM;
> +		goto out_irq;
> +	}
> +
> +	rc = request_threaded_irq(txwin->fault_virq, NULL,
> +				  pseries_vas_fault_thread_fn, IRQF_ONESHOT,
> +				  txwin->name, txwin);
> +	if (rc) {
> +		pr_err("VAS-Window[%d]: Request IRQ(%u) failed with %d\n",
> +		       txwin->vas_win.winid, txwin->fault_virq, rc);
> +		goto out_free;
> +	}
>  
>  	txwin->vas_win.wcreds_max = DEF_WIN_CREDS;
>  
>  	return 0;
> +out_free:
> +	kfree(txwin->name);
> +out_irq:
> +	irq_dispose_mapping(txwin->fault_virq);
> +out_win:
> +	h_deallocate_vas_window(txwin->vas_win.winid);
> +	return rc;
> +}
> +
> +static inline void free_irq_setup(struct pseries_vas_window *txwin)
> +{
> +	free_irq(txwin->fault_virq, txwin);
> +	irq_dispose_mapping(txwin->fault_virq);
> +	kfree(txwin->name);

Nit, but this freeing is in a different order than the error handling in 
the function does. I'd just keep it the same unless there is a reason to 
be different, in which case it could use a comment.

So long as the irq can't somehow fire and try to use txwin->name, you 
might be okay.

Otherwise I _think_ it's okay, but I don't know the irq APIs well.

Thanks,
Nick

>  }
>  
>  static struct vas_window *vas_allocate_window(struct vas_tx_win_open_attr *uattr,
> @@ -320,6 +414,11 @@ static struct vas_window *vas_allocate_window(struct vas_tx_win_open_attr *uattr
>  	return &txwin->vas_win;
>  
>  out_free:
> +	/*
> +	 * Window is not operational. Free IRQ before closing
> +	 * window so that do not have to hold mutex.
> +	 */

Why don't you have to hold the mutex in that case?

> +	free_irq_setup(txwin);
>  	h_deallocate_vas_window(txwin->vas_win.winid);
>  out:
>  	atomic_dec(&ct_caps->used_lpar_creds);
> @@ -339,7 +438,16 @@ static int deallocate_free_window(struct pseries_vas_window *win)
>  {
>  	int rc = 0;
>  
> +	/*
> +	 * Free IRQ after executing H_DEALLOCATE_VAS_WINDOW HCALL
> +	 * to close the window. pHyp waits for all requests including
> +	 * faults are processed before closing the window - Means all
> +	 * credits are returned. In the case of fault request, credit
> +	 * is returned after OS issues H_GET_NX_FAULT HCALL.
> +	 */
>  	rc = h_deallocate_vas_window(win->vas_win.winid);
> +	if (!rc)
> +		free_irq_setup(win);
>  
>  	return rc;
>  }
> -- 
> 2.18.2
> 
> 
> 

^ permalink raw reply

* Re: [PATCH v5 14/17] crypto/nx: Rename nx-842-pseries file name to nx-common-pseries
From: Nicholas Piggin @ 2021-06-14  3:07 UTC (permalink / raw)
  To: Haren Myneni, herbert, linux-crypto, linuxppc-dev, mpe
In-Reply-To: <03731a80b5af5f8cea95579215c6d2241c291b70.camel@linux.ibm.com>

Excerpts from Haren Myneni's message of June 13, 2021 9:03 pm:
> 
> Rename nx-842-pseries.c to nx-common-pseries.c to add code for new
> GZIP compression type. The actual functionality is not changed in
> this patch.
> 
> Signed-off-by: Haren Myneni <haren@linux.ibm.com>
> Acked-by: Herbert Xu <herbert@gondor.apana.org.au>

Acked-by: Nicholas Piggin <npiggin@gmail.com>

> ---
>  drivers/crypto/nx/Makefile                                  | 2 +-
>  drivers/crypto/nx/{nx-842-pseries.c => nx-common-pseries.c} | 0
>  2 files changed, 1 insertion(+), 1 deletion(-)
>  rename drivers/crypto/nx/{nx-842-pseries.c => nx-common-pseries.c} (100%)
> 
> diff --git a/drivers/crypto/nx/Makefile b/drivers/crypto/nx/Makefile
> index bc89a20e5d9d..d00181a26dd6 100644
> --- a/drivers/crypto/nx/Makefile
> +++ b/drivers/crypto/nx/Makefile
> @@ -14,5 +14,5 @@ nx-crypto-objs := nx.o \
>  obj-$(CONFIG_CRYPTO_DEV_NX_COMPRESS_PSERIES) += nx-compress-pseries.o nx-compress.o
>  obj-$(CONFIG_CRYPTO_DEV_NX_COMPRESS_POWERNV) += nx-compress-powernv.o nx-compress.o
>  nx-compress-objs := nx-842.o
> -nx-compress-pseries-objs := nx-842-pseries.o
> +nx-compress-pseries-objs := nx-common-pseries.o
>  nx-compress-powernv-objs := nx-common-powernv.o
> diff --git a/drivers/crypto/nx/nx-842-pseries.c b/drivers/crypto/nx/nx-common-pseries.c
> similarity index 100%
> rename from drivers/crypto/nx/nx-842-pseries.c
> rename to drivers/crypto/nx/nx-common-pseries.c
> -- 
> 2.18.2
> 
> 
> 

^ permalink raw reply

* Re: [PATCH v5 06/17] powerpc/vas: Move update_csb/dump_crb to common book3s platform
From: Nicholas Piggin @ 2021-06-14  3:34 UTC (permalink / raw)
  To: Haren Myneni, herbert, linux-crypto, linuxppc-dev, mpe
In-Reply-To: <acdf224290adc8735144302ac0698801a5e29c33.camel@linux.ibm.com>

Excerpts from Haren Myneni's message of June 13, 2021 8:58 pm:
> 
> If a coprocessor encounters an error translating an address, the
> VAS will cause an interrupt in the host. The kernel processes
> the fault by updating CSB. This functionality is same for both
> powerNV and pseries. So this patch moves these functions to
> common vas-api.c and the actual functionality is not changed.
> 
> Signed-off-by: Haren Myneni <haren@linux.ibm.com>
> Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
> ---
>  arch/powerpc/include/asm/vas.h             |   3 +
>  arch/powerpc/platforms/book3s/vas-api.c    | 147 +++++++++++++++++++
>  arch/powerpc/platforms/powernv/vas-fault.c | 155 ++-------------------
>  3 files changed, 159 insertions(+), 146 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/vas.h b/arch/powerpc/include/asm/vas.h
> index 163460cff59b..c1daab4cc205 100644
> --- a/arch/powerpc/include/asm/vas.h
> +++ b/arch/powerpc/include/asm/vas.h
> @@ -230,4 +230,7 @@ int vas_register_coproc_api(struct module *mod, enum vas_cop_type cop_type,
>  void vas_unregister_coproc_api(void);
>  
>  int get_vas_user_win_ref(struct vas_user_win_ref *task_ref);
> +void vas_update_csb(struct coprocessor_request_block *crb,
> +		    struct vas_user_win_ref *task_ref);
> +void vas_dump_crb(struct coprocessor_request_block *crb);
>  #endif /* __ASM_POWERPC_VAS_H */
> diff --git a/arch/powerpc/platforms/book3s/vas-api.c b/arch/powerpc/platforms/book3s/vas-api.c
> index 1d7d3273d34b..11c76c7141d2 100644
> --- a/arch/powerpc/platforms/book3s/vas-api.c
> +++ b/arch/powerpc/platforms/book3s/vas-api.c
> @@ -10,6 +10,9 @@
>  #include <linux/fs.h>
>  #include <linux/slab.h>
>  #include <linux/uaccess.h>
> +#include <linux/kthread.h>
> +#include <linux/sched/signal.h>
> +#include <linux/mmu_context.h>
>  #include <linux/io.h>
>  #include <asm/vas.h>
>  #include <uapi/asm/vas-api.h>
> @@ -94,6 +97,150 @@ int get_vas_user_win_ref(struct vas_user_win_ref *task_ref)
>  	return 0;
>  }
>  
> +/*
> + * Update the CSB to indicate a translation error.
> + *
> + * User space will be polling on CSB after the request is issued.
> + * If NX can handle the request without any issues, it updates CSB.
> + * Whereas if NX encounters page fault, the kernel will handle the
> + * fault and update CSB with translation error.
> + *
> + * If we are unable to update the CSB means copy_to_user failed due to
> + * invalid csb_addr, send a signal to the process.
> + */
> +void vas_update_csb(struct coprocessor_request_block *crb,
> +		    struct vas_user_win_ref *task_ref)
> +{
> +	struct coprocessor_status_block csb;
> +	struct kernel_siginfo info;
> +	struct task_struct *tsk;
> +	void __user *csb_addr;
> +	struct pid *pid;
> +	int rc;
> +
> +	/*
> +	 * NX user space windows can not be opened for task->mm=NULL
> +	 * and faults will not be generated for kernel requests.
> +	 */
> +	if (WARN_ON_ONCE(!task_ref->mm))
> +		return;
> +
> +	csb_addr = (void __user *)be64_to_cpu(crb->csb_addr);
> +
> +	memset(&csb, 0, sizeof(csb));
> +	csb.cc = CSB_CC_FAULT_ADDRESS;
> +	csb.ce = CSB_CE_TERMINATION;
> +	csb.cs = 0;
> +	csb.count = 0;
> +
> +	/*
> +	 * NX operates and returns in BE format as defined CRB struct.
> +	 * So saves fault_storage_addr in BE as NX pastes in FIFO and
> +	 * expects user space to convert to CPU format.
> +	 */
> +	csb.address = crb->stamp.nx.fault_storage_addr;
> +	csb.flags = 0;
> +
> +	pid = task_ref->pid;
> +	tsk = get_pid_task(pid, PIDTYPE_PID);
> +	/*
> +	 * Process closes send window after all pending NX requests are
> +	 * completed. In multi-thread applications, a child thread can
> +	 * open a window and can exit without closing it. May be some
> +	 * requests are pending or this window can be used by other
> +	 * threads later. We should handle faults if NX encounters
> +	 * pages faults on these requests. Update CSB with translation
> +	 * error and fault address. If csb_addr passed by user space is
> +	 * invalid, send SEGV signal to pid saved in window. If the
> +	 * child thread is not running, send the signal to tgid.
> +	 * Parent thread (tgid) will close this window upon its exit.
> +	 *
> +	 * pid and mm references are taken when window is opened by
> +	 * process (pid). So tgid is used only when child thread opens
> +	 * a window and exits without closing it.
> +	 */
> +	if (!tsk) {
> +		pid = task_ref->tgid;
> +		tsk = get_pid_task(pid, PIDTYPE_PID);
> +		/*
> +		 * Parent thread (tgid) will be closing window when it
> +		 * exits. So should not get here.
> +		 */
> +		if (WARN_ON_ONCE(!tsk))
> +			return;
> +	}
> +
> +	/* Return if the task is exiting. */
> +	if (tsk->flags & PF_EXITING) {
> +		put_task_struct(tsk);
> +		return;
> +	}

Actually in the previous patch that moves the refcounting into its own 
functions, could you make a function for this stuff as well?

> +
> +	kthread_use_mm(task_ref->mm);
> +	rc = copy_to_user(csb_addr, &csb, sizeof(csb));
> +	/*
> +	 * User space polls on csb.flags (first byte). So add barrier
> +	 * then copy first byte with csb flags update.
> +	 */

So this first writes 0 to the user's csb.flags which was previously 0.

> +	if (!rc) {
> +		csb.flags = CSB_V;
> +		/* Make sure update to csb.flags is visible now */
> +		smp_mb();
> +		rc = copy_to_user(csb_addr, &csb, sizeof(u8));

And then it stores CSB_V to flags, and that must be visible before the
other stores to the user's CSB?

So that should be smp_wmb()? And memory barrier comments should not say
it makes a store visible, rather it should explain the two sets of 
memory operations it is enforcing an ordering between. And it should
give a reference to barriers in the other code that relies on this 
ordering. In this case userspace I assume is expected to enforce 
load/load ordering between finding CSB_V in csb.flags, and loading the
rest of the data there.

You could make this change at the start of the series, or if you prefer
to do more testing before relaxing smp_mb to smp_wmb then I don't mind 
if you do it afterwards and change the comment at that time.

> +	}
> +	kthread_unuse_mm(task_ref->mm);
> +	put_task_struct(tsk);
> +
> +	/* Success */
> +	if (!rc)
> +		return;
> +
> +
> +	pr_debug("Invalid CSB address 0x%p signalling pid(%d)\n",
> +			csb_addr, pid_vnr(pid));
> +
> +	clear_siginfo(&info);
> +	info.si_signo = SIGSEGV;
> +	info.si_errno = EFAULT;
> +	info.si_code = SEGV_MAPERR;
> +	info.si_addr = csb_addr;
> +	/*
> +	 * process will be polling on csb.flags after request is sent to
> +	 * NX. So generally CSB update should not fail except when an
> +	 * application passes invalid csb_addr. So an error message will
> +	 * be displayed and leave it to user space whether to ignore or
> +	 * handle this signal.
> +	 */
> +	rcu_read_lock();
> +	rc = kill_pid_info(SIGSEGV, &info, pid);
> +	rcu_read_unlock();
> +
> +	pr_devel("%s(): pid %d kill_proc_info() rc %d\n", __func__,
> +			pid_vnr(pid), rc);

And a separate function for this bit too. And comment them all in high
level terms of the thread/process/file descriptor/interrupt lifetimes,
coprocessor access to memory and raising of interrupts, etc. rather than
talking about vas windows and CSB addresses and things. The problem is
nobody who understands kernel/fork/exit/signal/pid/etc rules will have 
any clue what's going on here reading these comments.

It's not actually quite clear what your task_struct ref is doing because 
you release it before you deliver this signal, so what's it protecting
exactly.

I know it's existing code, but it could go in the previous patch that
already gets halfway there.

Thanks,
Nick

> +}
> +
> +void vas_dump_crb(struct coprocessor_request_block *crb)
> +{
> +	struct data_descriptor_entry *dde;
> +	struct nx_fault_stamp *nx;
> +
> +	dde = &crb->source;
> +	pr_devel("SrcDDE: addr 0x%llx, len %d, count %d, idx %d, flags %d\n",
> +		be64_to_cpu(dde->address), be32_to_cpu(dde->length),
> +		dde->count, dde->index, dde->flags);
> +
> +	dde = &crb->target;
> +	pr_devel("TgtDDE: addr 0x%llx, len %d, count %d, idx %d, flags %d\n",
> +		be64_to_cpu(dde->address), be32_to_cpu(dde->length),
> +		dde->count, dde->index, dde->flags);
> +
> +	nx = &crb->stamp.nx;
> +	pr_devel("NX Stamp: PSWID 0x%x, FSA 0x%llx, flags 0x%x, FS 0x%x\n",
> +		be32_to_cpu(nx->pswid),
> +		be64_to_cpu(crb->stamp.nx.fault_storage_addr),
> +		nx->flags, nx->fault_status);
> +}
> +
>  static int coproc_open(struct inode *inode, struct file *fp)
>  {
>  	struct coproc_instance *cp_inst;
> diff --git a/arch/powerpc/platforms/powernv/vas-fault.c b/arch/powerpc/platforms/powernv/vas-fault.c
> index ac3a71ec3bd5..2729ac541fb3 100644
> --- a/arch/powerpc/platforms/powernv/vas-fault.c
> +++ b/arch/powerpc/platforms/powernv/vas-fault.c
> @@ -26,150 +26,6 @@
>   */
>  #define VAS_FAULT_WIN_FIFO_SIZE	(4 << 20)
>  
> -static void dump_crb(struct coprocessor_request_block *crb)
> -{
> -	struct data_descriptor_entry *dde;
> -	struct nx_fault_stamp *nx;
> -
> -	dde = &crb->source;
> -	pr_devel("SrcDDE: addr 0x%llx, len %d, count %d, idx %d, flags %d\n",
> -		be64_to_cpu(dde->address), be32_to_cpu(dde->length),
> -		dde->count, dde->index, dde->flags);
> -
> -	dde = &crb->target;
> -	pr_devel("TgtDDE: addr 0x%llx, len %d, count %d, idx %d, flags %d\n",
> -		be64_to_cpu(dde->address), be32_to_cpu(dde->length),
> -		dde->count, dde->index, dde->flags);
> -
> -	nx = &crb->stamp.nx;
> -	pr_devel("NX Stamp: PSWID 0x%x, FSA 0x%llx, flags 0x%x, FS 0x%x\n",
> -		be32_to_cpu(nx->pswid),
> -		be64_to_cpu(crb->stamp.nx.fault_storage_addr),
> -		nx->flags, nx->fault_status);
> -}
> -
> -/*
> - * Update the CSB to indicate a translation error.
> - *
> - * User space will be polling on CSB after the request is issued.
> - * If NX can handle the request without any issues, it updates CSB.
> - * Whereas if NX encounters page fault, the kernel will handle the
> - * fault and update CSB with translation error.
> - *
> - * If we are unable to update the CSB means copy_to_user failed due to
> - * invalid csb_addr, send a signal to the process.
> - */
> -static void update_csb(struct vas_window *window,
> -			struct coprocessor_request_block *crb)
> -{
> -	struct coprocessor_status_block csb;
> -	struct kernel_siginfo info;
> -	struct task_struct *tsk;
> -	void __user *csb_addr;
> -	struct pid *pid;
> -	int rc;
> -
> -	/*
> -	 * NX user space windows can not be opened for task->mm=NULL
> -	 * and faults will not be generated for kernel requests.
> -	 */
> -	if (WARN_ON_ONCE(!window->task_ref.mm || !window->user_win))
> -		return;
> -
> -	csb_addr = (void __user *)be64_to_cpu(crb->csb_addr);
> -
> -	memset(&csb, 0, sizeof(csb));
> -	csb.cc = CSB_CC_FAULT_ADDRESS;
> -	csb.ce = CSB_CE_TERMINATION;
> -	csb.cs = 0;
> -	csb.count = 0;
> -
> -	/*
> -	 * NX operates and returns in BE format as defined CRB struct.
> -	 * So saves fault_storage_addr in BE as NX pastes in FIFO and
> -	 * expects user space to convert to CPU format.
> -	 */
> -	csb.address = crb->stamp.nx.fault_storage_addr;
> -	csb.flags = 0;
> -
> -	pid = window->task_ref.pid;
> -	tsk = get_pid_task(pid, PIDTYPE_PID);
> -	/*
> -	 * Process closes send window after all pending NX requests are
> -	 * completed. In multi-thread applications, a child thread can
> -	 * open a window and can exit without closing it. May be some
> -	 * requests are pending or this window can be used by other
> -	 * threads later. We should handle faults if NX encounters
> -	 * pages faults on these requests. Update CSB with translation
> -	 * error and fault address. If csb_addr passed by user space is
> -	 * invalid, send SEGV signal to pid saved in window. If the
> -	 * child thread is not running, send the signal to tgid.
> -	 * Parent thread (tgid) will close this window upon its exit.
> -	 *
> -	 * pid and mm references are taken when window is opened by
> -	 * process (pid). So tgid is used only when child thread opens
> -	 * a window and exits without closing it.
> -	 */
> -	if (!tsk) {
> -		pid = window->task_ref.tgid;
> -		tsk = get_pid_task(pid, PIDTYPE_PID);
> -		/*
> -		 * Parent thread (tgid) will be closing window when it
> -		 * exits. So should not get here.
> -		 */
> -		if (WARN_ON_ONCE(!tsk))
> -			return;
> -	}
> -
> -	/* Return if the task is exiting. */
> -	if (tsk->flags & PF_EXITING) {
> -		put_task_struct(tsk);
> -		return;
> -	}
> -
> -	kthread_use_mm(window->task_ref.mm);
> -	rc = copy_to_user(csb_addr, &csb, sizeof(csb));
> -	/*
> -	 * User space polls on csb.flags (first byte). So add barrier
> -	 * then copy first byte with csb flags update.
> -	 */
> -	if (!rc) {
> -		csb.flags = CSB_V;
> -		/* Make sure update to csb.flags is visible now */
> -		smp_mb();
> -		rc = copy_to_user(csb_addr, &csb, sizeof(u8));
> -	}
> -	kthread_unuse_mm(window->task_ref.mm);
> -	put_task_struct(tsk);
> -
> -	/* Success */
> -	if (!rc)
> -		return;
> -
> -	pr_debug("Invalid CSB address 0x%p signalling pid(%d)\n",
> -			csb_addr, pid_vnr(pid));
> -
> -	clear_siginfo(&info);
> -	info.si_signo = SIGSEGV;
> -	info.si_errno = EFAULT;
> -	info.si_code = SEGV_MAPERR;
> -	info.si_addr = csb_addr;
> -
> -	/*
> -	 * process will be polling on csb.flags after request is sent to
> -	 * NX. So generally CSB update should not fail except when an
> -	 * application passes invalid csb_addr. So an error message will
> -	 * be displayed and leave it to user space whether to ignore or
> -	 * handle this signal.
> -	 */
> -	rcu_read_lock();
> -	rc = kill_pid_info(SIGSEGV, &info, pid);
> -	rcu_read_unlock();
> -
> -	pr_devel("%s(): pid %d kill_proc_info() rc %d\n", __func__,
> -			pid_vnr(pid), rc);
> -}
> -
>  static void dump_fifo(struct vas_instance *vinst, void *entry)
>  {
>  	unsigned long *end = vinst->fault_fifo + vinst->fault_fifo_size;
> @@ -272,7 +128,7 @@ irqreturn_t vas_fault_thread_fn(int irq, void *data)
>  				vinst->vas_id, vinst->fault_fifo, fifo,
>  				vinst->fault_crbs);
>  
> -		dump_crb(crb);
> +		vas_dump_crb(crb);
>  		window = vas_pswid_to_window(vinst,
>  				be32_to_cpu(crb->stamp.nx.pswid));
>  
> @@ -293,7 +149,14 @@ irqreturn_t vas_fault_thread_fn(int irq, void *data)
>  
>  			WARN_ON_ONCE(1);
>  		} else {
> -			update_csb(window, crb);
> +			/*
> +			 * NX sees faults only with user space windows.
> +			 */
> +			if (window->user_win)
> +				vas_update_csb(crb, &window->task_ref);
> +			else
> +				WARN_ON_ONCE(!window->user_win);
> +
>  			/*
>  			 * Return credit for send window after processing
>  			 * fault CRB.
> -- 
> 2.18.2
> 
> 
> 

^ permalink raw reply

* Re: [PATCH v5 15/17] crypto/nx: Register and unregister VAS interface on PowerVM
From: Nicholas Piggin @ 2021-06-14  3:35 UTC (permalink / raw)
  To: Haren Myneni, herbert, linux-crypto, linuxppc-dev, mpe
In-Reply-To: <0fe93c925b9e5517f55121f51074de425feb5236.camel@linux.ibm.com>

Excerpts from Haren Myneni's message of June 13, 2021 9:04 pm:
> 
> The user space uses /dev/crypto/nx-gzip interface to setup VAS
> windows, create paste mapping and close windows. This patch adds
> changes to create/remove this interface with VAS register/unregister
> functions on PowerVM platform.
> 
> Signed-off-by: Haren Myneni <haren@linux.ibm.com>
> Acked-by: Herbert Xu <herbert@gondor.apana.org.au>

Acked-by: Nicholas Piggin <npiggin@gmail.com>

> ---
>  drivers/crypto/nx/Kconfig             | 1 +
>  drivers/crypto/nx/nx-common-pseries.c | 9 +++++++++
>  2 files changed, 10 insertions(+)
> 
> diff --git a/drivers/crypto/nx/Kconfig b/drivers/crypto/nx/Kconfig
> index 23e3d0160e67..2a35e0e785bd 100644
> --- a/drivers/crypto/nx/Kconfig
> +++ b/drivers/crypto/nx/Kconfig
> @@ -29,6 +29,7 @@ if CRYPTO_DEV_NX_COMPRESS
>  config CRYPTO_DEV_NX_COMPRESS_PSERIES
>  	tristate "Compression acceleration support on pSeries platform"
>  	depends on PPC_PSERIES && IBMVIO
> +	depends on PPC_VAS
>  	default y
>  	help
>  	  Support for PowerPC Nest (NX) compression acceleration. This
> diff --git a/drivers/crypto/nx/nx-common-pseries.c b/drivers/crypto/nx/nx-common-pseries.c
> index cc8dd3072b8b..9a40fca8a9e6 100644
> --- a/drivers/crypto/nx/nx-common-pseries.c
> +++ b/drivers/crypto/nx/nx-common-pseries.c
> @@ -9,6 +9,7 @@
>   */
>  
>  #include <asm/vio.h>
> +#include <asm/vas.h>
>  
>  #include "nx-842.h"
>  #include "nx_csbcpb.h" /* struct nx_csbcpb */
> @@ -1101,6 +1102,12 @@ static int __init nx842_pseries_init(void)
>  		return ret;
>  	}
>  
> +	ret = vas_register_api_pseries(THIS_MODULE, VAS_COP_TYPE_GZIP,
> +				       "nx-gzip");
> +
> +	if (ret)
> +		pr_err("NX-GZIP is not supported. Returned=%d\n", ret);
> +
>  	return 0;
>  }
>  
> @@ -1111,6 +1118,8 @@ static void __exit nx842_pseries_exit(void)
>  	struct nx842_devdata *old_devdata;
>  	unsigned long flags;
>  
> +	vas_unregister_api_pseries();
> +
>  	crypto_unregister_alg(&nx842_pseries_alg);
>  
>  	spin_lock_irqsave(&devdata_mutex, flags);
> -- 
> 2.18.2
> 
> 
> 

^ permalink raw reply

* Re: [PATCH v5 16/17] crypto/nx: Get NX capabilities for GZIP coprocessor type
From: Nicholas Piggin @ 2021-06-14  3:39 UTC (permalink / raw)
  To: Haren Myneni, herbert, linux-crypto, linuxppc-dev, mpe
In-Reply-To: <726de270eb20e0898f4391a0b0e7077697db66ab.camel@linux.ibm.com>

Excerpts from Haren Myneni's message of June 13, 2021 9:04 pm:
> 
> The hypervisor provides different capabilities that it supports
> to define the user space NX request. These capabilities are
> recommended minimum compression / decompression lengths and the
> maximum request buffer size in bytes.
> 
> Changes to get NX overall capabilities which points to the
> specific features that the hypervisor supports. Then retrieve
> the capabilities for the specific feature (available only
> for NXGZIP).

So what does this give you which you didn't have before? Should
this go before the previous patch that enables the interface for guests,
or is there some functional-yet-degraded mode that is available without
this patch?

I would suggest even if this is the case to switch ordering of the 
patches so as to reduce the matrix of functionality that userspace sees 
when bisecting. Unless you specifically want this kind of bisectability,
in which case make that explicit in the changelog.

Thanks,
Nick

> 
> Signed-off-by: Haren Myneni <haren@linux.ibm.com>
> Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
> ---
>  drivers/crypto/nx/nx-common-pseries.c | 86 +++++++++++++++++++++++++++
>  1 file changed, 86 insertions(+)
> 
> diff --git a/drivers/crypto/nx/nx-common-pseries.c b/drivers/crypto/nx/nx-common-pseries.c
> index 9a40fca8a9e6..60b5049ec9f7 100644
> --- a/drivers/crypto/nx/nx-common-pseries.c
> +++ b/drivers/crypto/nx/nx-common-pseries.c
> @@ -9,6 +9,7 @@
>   */
>  
>  #include <asm/vio.h>
> +#include <asm/hvcall.h>
>  #include <asm/vas.h>
>  
>  #include "nx-842.h"
> @@ -20,6 +21,29 @@ MODULE_DESCRIPTION("842 H/W Compression driver for IBM Power processors");
>  MODULE_ALIAS_CRYPTO("842");
>  MODULE_ALIAS_CRYPTO("842-nx");
>  
> +/*
> + * Coprocessor type specific capabilities from the hypervisor.
> + */
> +struct hv_nx_ct_caps {
> +	__be64	descriptor;
> +	__be64	req_max_processed_len;	/* Max bytes in one GZIP request */
> +	__be64	min_compress_len;	/* Min compression size in bytes */
> +	__be64	min_decompress_len;	/* Min decompression size in bytes */
> +} __packed __aligned(0x1000);
> +
> +/*
> + * Coprocessor type specific capabilities.
> + */
> +struct nx_ct_caps {
> +	u64	descriptor;
> +	u64	req_max_processed_len;	/* Max bytes in one GZIP request */
> +	u64	min_compress_len;	/* Min compression in bytes */
> +	u64	min_decompress_len;	/* Min decompression in bytes */
> +};
> +
> +static u64 caps_feat;
> +static struct nx_ct_caps nx_ct_caps;
> +
>  static struct nx842_constraints nx842_pseries_constraints = {
>  	.alignment =	DDE_BUFFER_ALIGN,
>  	.multiple =	DDE_BUFFER_LAST_MULT,
> @@ -1066,6 +1090,64 @@ static void nx842_remove(struct vio_dev *viodev)
>  	kfree(old_devdata);
>  }
>  
> +/*
> + * Get NX capabilities from the hypervisor.
> + * Only NXGZIP capabilities are provided by the hypersvisor right
> + * now and these values are available to user space with sysfs.
> + */
> +static void __init nxct_get_capabilities(void)
> +{
> +	struct hv_vas_all_caps *hv_caps;
> +	struct hv_nx_ct_caps *hv_nxc;
> +	int rc;
> +
> +	hv_caps = kmalloc(sizeof(*hv_caps), GFP_KERNEL);
> +	if (!hv_caps)
> +		return;
> +	/*
> +	 * Get NX overall capabilities with feature type=0
> +	 */
> +	rc = h_query_vas_capabilities(H_QUERY_NX_CAPABILITIES, 0,
> +					  (u64)virt_to_phys(hv_caps));
> +	if (rc)
> +		goto out;
> +
> +	caps_feat = be64_to_cpu(hv_caps->feat_type);
> +	/*
> +	 * NX-GZIP feature available
> +	 */
> +	if (caps_feat & VAS_NX_GZIP_FEAT_BIT) {
> +		hv_nxc = kmalloc(sizeof(*hv_nxc), GFP_KERNEL);
> +		if (!hv_nxc)
> +			goto out;
> +		/*
> +		 * Get capabilities for NX-GZIP feature
> +		 */
> +		rc = h_query_vas_capabilities(H_QUERY_NX_CAPABILITIES,
> +						  VAS_NX_GZIP_FEAT,
> +						  (u64)virt_to_phys(hv_nxc));
> +	} else {
> +		pr_err("NX-GZIP feature is not available\n");
> +		rc = -EINVAL;
> +	}
> +
> +	if (!rc) {
> +		nx_ct_caps.descriptor = be64_to_cpu(hv_nxc->descriptor);
> +		nx_ct_caps.req_max_processed_len =
> +				be64_to_cpu(hv_nxc->req_max_processed_len);
> +		nx_ct_caps.min_compress_len =
> +				be64_to_cpu(hv_nxc->min_compress_len);
> +		nx_ct_caps.min_decompress_len =
> +				be64_to_cpu(hv_nxc->min_decompress_len);
> +	} else {
> +		caps_feat = 0;
> +	}
> +
> +	kfree(hv_nxc);
> +out:
> +	kfree(hv_caps);
> +}
> +
>  static const struct vio_device_id nx842_vio_driver_ids[] = {
>  	{"ibm,compression-v1", "ibm,compression"},
>  	{"", ""},
> @@ -1093,6 +1175,10 @@ static int __init nx842_pseries_init(void)
>  		return -ENOMEM;
>  
>  	RCU_INIT_POINTER(devdata, new_devdata);
> +	/*
> +	 * Get NX capabilities from the hypervisor.
> +	 */
> +	nxct_get_capabilities();
>  
>  	ret = vio_register_driver(&nx842_vio_driver);
>  	if (ret) {
> -- 
> 2.18.2
> 
> 
> 

^ permalink raw reply

* Re: [PATCH v5 17/17] crypto/nx: Add sysfs interface to export NX capabilities
From: Nicholas Piggin @ 2021-06-14  3:46 UTC (permalink / raw)
  To: Haren Myneni, herbert, linux-crypto, linuxppc-dev, mpe
In-Reply-To: <8fd529d8612ea47cce69101b62e9498de9324850.camel@linux.ibm.com>

Excerpts from Haren Myneni's message of June 13, 2021 9:05 pm:
> 
> Changes to export the following NXGZIP capabilities through sysfs:
> 
> /sys/devices/vio/ibm,compression-v1/nx_gzip_caps:
> min_compress_len  /*Recommended minimum compress length in bytes*/
> min_decompress_len /*Recommended minimum decompress length in bytes*/
> req_max_processed_len /* Maximum number of bytes processed in one
> 			request */
> 
> NX will return RMA_Reject if the request buffer size is greater
> than req_max_processed_len.

Similar for the previous patch, can userspace sanely use the API without 
these capabilities? If not, reorder so the final enable comes last.

I would put those comments in the code, and make the changelog a little
higher level, including a reference or example of how userspace uses it,
which should come with any new userspace ABI.

"Export NX-GZIP capabilities to usrespace in sysfs /sys/devices/vio/...
directory. These are queried by userspace accelerator libraries to set
minimum length heuristics and maximum limits on request sizes."

Something like that. Otherwise looks okay to me.

Thanks,
Nick

> 
> Signed-off-by: Haren Myneni <haren@linux.ibm.com>
> Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
> ---
>  drivers/crypto/nx/nx-common-pseries.c | 43 +++++++++++++++++++++++++++
>  1 file changed, 43 insertions(+)
> 
> diff --git a/drivers/crypto/nx/nx-common-pseries.c b/drivers/crypto/nx/nx-common-pseries.c
> index 60b5049ec9f7..db28a84a826c 100644
> --- a/drivers/crypto/nx/nx-common-pseries.c
> +++ b/drivers/crypto/nx/nx-common-pseries.c
> @@ -967,6 +967,36 @@ static struct attribute_group nx842_attribute_group = {
>  	.attrs = nx842_sysfs_entries,
>  };
>  
> +#define	nxct_caps_read(_name)						\
> +static ssize_t nxct_##_name##_show(struct device *dev,			\
> +			struct device_attribute *attr, char *buf)	\
> +{									\
> +	return sprintf(buf, "%lld\n", nx_ct_caps._name);		\
> +}
> +
> +#define NXCT_ATTR_RO(_name)						\
> +	nxct_caps_read(_name);						\
> +	static struct device_attribute dev_attr_##_name = __ATTR(_name,	\
> +						0444,			\
> +						nxct_##_name##_show,	\
> +						NULL);
> +
> +NXCT_ATTR_RO(req_max_processed_len);
> +NXCT_ATTR_RO(min_compress_len);
> +NXCT_ATTR_RO(min_decompress_len);
> +
> +static struct attribute *nxct_caps_sysfs_entries[] = {
> +	&dev_attr_req_max_processed_len.attr,
> +	&dev_attr_min_compress_len.attr,
> +	&dev_attr_min_decompress_len.attr,
> +	NULL,
> +};
> +
> +static struct attribute_group nxct_caps_attr_group = {
> +	.name	=	"nx_gzip_caps",
> +	.attrs	=	nxct_caps_sysfs_entries,
> +};
> +
>  static struct nx842_driver nx842_pseries_driver = {
>  	.name =		KBUILD_MODNAME,
>  	.owner =	THIS_MODULE,
> @@ -1056,6 +1086,16 @@ static int nx842_probe(struct vio_dev *viodev,
>  		goto error;
>  	}
>  
> +	if (caps_feat) {
> +		if (sysfs_create_group(&viodev->dev.kobj,
> +					&nxct_caps_attr_group)) {
> +			dev_err(&viodev->dev,
> +				"Could not create sysfs NX capability entries\n");
> +			ret = -1;
> +			goto error;
> +		}
> +	}
> +
>  	return 0;
>  
>  error_unlock:
> @@ -1075,6 +1115,9 @@ static void nx842_remove(struct vio_dev *viodev)
>  	pr_info("Removing IBM Power 842 compression device\n");
>  	sysfs_remove_group(&viodev->dev.kobj, &nx842_attribute_group);
>  
> +	if (caps_feat)
> +		sysfs_remove_group(&viodev->dev.kobj, &nxct_caps_attr_group);
> +
>  	crypto_unregister_alg(&nx842_pseries_alg);
>  
>  	spin_lock_irqsave(&devdata_mutex, flags);
> -- 
> 2.18.2
> 
> 
> 

^ permalink raw reply

* Re: [PATCH v4 2/4] lazy tlb: allow lazy tlb mm refcounting to be configurable
From: Andy Lutomirski @ 2021-06-14  3:52 UTC (permalink / raw)
  To: Nicholas Piggin, Andrew Morton
  Cc: linux-arch, Linus Torvalds, Randy Dunlap, linux-kernel, linux-mm,
	linuxppc-dev
In-Reply-To: <1623629185.fxzl5xdab6.astroid@bobo.none>

On 6/13/21 5:45 PM, Nicholas Piggin wrote:
> Excerpts from Andy Lutomirski's message of June 9, 2021 2:20 am:
>> On 6/4/21 6:42 PM, Nicholas Piggin wrote:
>>> Add CONFIG_MMU_TLB_REFCOUNT which enables refcounting of the lazy tlb mm
>>> when it is context switched. This can be disabled by architectures that
>>> don't require this refcounting if they clean up lazy tlb mms when the
>>> last refcount is dropped. Currently this is always enabled, which is
>>> what existing code does, so the patch is effectively a no-op.
>>>
>>> Rename rq->prev_mm to rq->prev_lazy_mm, because that's what it is.
>>
>> I am in favor of this approach, but I would be a lot more comfortable
>> with the resulting code if task->active_mm were at least better
>> documented and possibly even guarded by ifdefs.
> 
> active_mm is fairly well documented in Documentation/active_mm.rst IMO.
> I don't think anything has changed in 20 years, I don't know what more
> is needed, but if you can add to documentation that would be nice. Maybe
> moving a bit of that into .c and .h files?
> 

Quoting from that file:

  - however, we obviously need to keep track of which address space we
    "stole" for such an anonymous user. For that, we have "tsk->active_mm",
    which shows what the currently active address space is.

This isn't even true right now on x86.  With your patch applied:

 To support all that, the "struct mm_struct" now has two counters: a
 "mm_users" counter that is how many "real address space users" there are,
 and a "mm_count" counter that is the number of "lazy" users (ie anonymous
 users) plus one if there are any real users.

isn't even true any more.


>> x86 bare metal currently does not need the core lazy mm refcounting, and
>> x86 bare metal *also* does not need ->active_mm.  Under the x86 scheme,
>> if lazy mm refcounting were configured out, ->active_mm could become a
>> dangling pointer, and this makes me extremely uncomfortable.
>>
>> So I tend to think that, depending on config, the core code should
>> either keep ->active_mm [1] alive or get rid of it entirely.
> 
> I don't actually know what you mean.
> 
> core code needs the concept of an "active_mm". This is the mm that your 
> kernel threads are using, even in the unmerged CONFIG_LAZY_TLB=n patch,
> active_mm still points to init_mm for kernel threads.

Core code does *not* need this concept.  First, it's wrong on x86 since
at least 4.15.  Any core code that actually assumes that ->active_mm is
"active" for any sensible definition of the word active is wrong.
Fortunately there is no such code.

I looked through all active_mm references in core code.  We have:

kernel/sched/core.c: it's all refcounting, although it's a bit tangled
with membarrier.

kernel/kthread.c: same.  refcounting and membarrier stuff.

kernel/exit.c: exit_mm() a BUG_ON().

kernel/fork.c: initialization code and a warning.

kernel/cpu.c: cpu offline stuff.  wouldn't be needed if active_mm went away.

fs/exec.c: nothing of interest

I didn't go through drivers, but I maintain my point.  active_mm is
there for refcounting.  So please don't just make it even more confusing
-- do your performance improvement, but improve the code at the same
time: get rid of active_mm, at least on architectures that opt out of
the refcounting.



> 
> We could hide that idea behind an active_mm() function that would always 
> return &init_mm if mm==NULL, but you still have the concept of an active
> mm and a pointer that callers must not access after free (because some
> cases will be CONFIG_LAZY_TLB=y).
> 
>> [1] I don't really think it belongs in task_struct at all.  It's not a
>> property of the task.  It's the *per-cpu* mm that the core code is
>> keeping alive for lazy purposes.  How about consolidating it with the
>> copy in rq?
> 
> I agree it's conceptually a per-cpu property. I don't know why it was 
> done this way, maybe it was just convenient and works well for mm and 
> active_mm to be adjacent. Linus might have a better insight.
> 
>> I guess the short summary of my opinion is that I like making this
>> configurable, but I do not like the state of the code.
> 
> I don't think I'd object to moving active_mm to rq and converting all
> usages to active_mm() while we're there, it would make things a bit
> more configurable. But I don't see it making core code fundamentally
> less complex... if you're referring to the x86 mm switching monstrosity,
> then that's understandable, but I admit I haven't spent enough time
> looking at it to make a useful comment. A patch would be enlightening,
> I have the leftover CONFIG_LAZY_TLB=n patch if you were thinking of 
> building on that I can send it to you.
> 
> Thanks,
> Nick
> 


^ permalink raw reply

* Re: [PATCH v4 2/4] lazy tlb: allow lazy tlb mm refcounting to be configurable
From: Nicholas Piggin @ 2021-06-14  4:14 UTC (permalink / raw)
  To: Andrew Morton, Andy Lutomirski
  Cc: linux-arch, Linus Torvalds, Randy Dunlap, linux-kernel, linux-mm,
	linuxppc-dev
In-Reply-To: <02e16a2f-2f58-b4f2-d335-065e007bcea2@kernel.org>

Excerpts from Andy Lutomirski's message of June 14, 2021 1:52 pm:
> On 6/13/21 5:45 PM, Nicholas Piggin wrote:
>> Excerpts from Andy Lutomirski's message of June 9, 2021 2:20 am:
>>> On 6/4/21 6:42 PM, Nicholas Piggin wrote:
>>>> Add CONFIG_MMU_TLB_REFCOUNT which enables refcounting of the lazy tlb mm
>>>> when it is context switched. This can be disabled by architectures that
>>>> don't require this refcounting if they clean up lazy tlb mms when the
>>>> last refcount is dropped. Currently this is always enabled, which is
>>>> what existing code does, so the patch is effectively a no-op.
>>>>
>>>> Rename rq->prev_mm to rq->prev_lazy_mm, because that's what it is.
>>>
>>> I am in favor of this approach, but I would be a lot more comfortable
>>> with the resulting code if task->active_mm were at least better
>>> documented and possibly even guarded by ifdefs.
>> 
>> active_mm is fairly well documented in Documentation/active_mm.rst IMO.
>> I don't think anything has changed in 20 years, I don't know what more
>> is needed, but if you can add to documentation that would be nice. Maybe
>> moving a bit of that into .c and .h files?
>> 
> 
> Quoting from that file:
> 
>   - however, we obviously need to keep track of which address space we
>     "stole" for such an anonymous user. For that, we have "tsk->active_mm",
>     which shows what the currently active address space is.
> 
> This isn't even true right now on x86.

From the perspective of core code, it is. x86 might do something crazy 
with it, but it has to make it appear this way to non-arch code that
uses active_mm.

Is x86's scheme documented?

> With your patch applied:
> 
>  To support all that, the "struct mm_struct" now has two counters: a
>  "mm_users" counter that is how many "real address space users" there are,
>  and a "mm_count" counter that is the number of "lazy" users (ie anonymous
>  users) plus one if there are any real users.
> 
> isn't even true any more.

Well yeah but the active_mm concept hasn't changed. The refcounting 
change is hopefully reasonably documented?

> 
> 
>>> x86 bare metal currently does not need the core lazy mm refcounting, and
>>> x86 bare metal *also* does not need ->active_mm.  Under the x86 scheme,
>>> if lazy mm refcounting were configured out, ->active_mm could become a
>>> dangling pointer, and this makes me extremely uncomfortable.
>>>
>>> So I tend to think that, depending on config, the core code should
>>> either keep ->active_mm [1] alive or get rid of it entirely.
>> 
>> I don't actually know what you mean.
>> 
>> core code needs the concept of an "active_mm". This is the mm that your 
>> kernel threads are using, even in the unmerged CONFIG_LAZY_TLB=n patch,
>> active_mm still points to init_mm for kernel threads.
> 
> Core code does *not* need this concept.  First, it's wrong on x86 since
> at least 4.15.  Any core code that actually assumes that ->active_mm is
> "active" for any sensible definition of the word active is wrong.
> Fortunately there is no such code.
> 
> I looked through all active_mm references in core code.  We have:
> 
> kernel/sched/core.c: it's all refcounting, although it's a bit tangled
> with membarrier.
> 
> kernel/kthread.c: same.  refcounting and membarrier stuff.
> 
> kernel/exit.c: exit_mm() a BUG_ON().
> 
> kernel/fork.c: initialization code and a warning.
> 
> kernel/cpu.c: cpu offline stuff.  wouldn't be needed if active_mm went away.
> 
> fs/exec.c: nothing of interest

I might not have been clear. Core code doesn't need active_mm if 
active_mm somehow goes away. I'm saying active_mm can't go away because
it's needed to support (most) archs that do lazy tlb mm switching.

The part I don't understand is when you say it can just go away. How? 

> I didn't go through drivers, but I maintain my point.  active_mm is
> there for refcounting.  So please don't just make it even more confusing
> -- do your performance improvement, but improve the code at the same
> time: get rid of active_mm, at least on architectures that opt out of
> the refcounting.

powerpc opts out of the refcounting and can not "get rid of active_mm".
Not even in theory.

Thanks,
Nick

^ permalink raw reply

* Re: [PATCH v4 2/4] lazy tlb: allow lazy tlb mm refcounting to be configurable
From: Nicholas Piggin @ 2021-06-14  4:47 UTC (permalink / raw)
  To: Andrew Morton, Andy Lutomirski
  Cc: linux-arch, Linus Torvalds, Randy Dunlap, linux-kernel, linux-mm,
	linuxppc-dev
In-Reply-To: <1623643443.b9twp3txmw.astroid@bobo.none>

Excerpts from Nicholas Piggin's message of June 14, 2021 2:14 pm:
> Excerpts from Andy Lutomirski's message of June 14, 2021 1:52 pm:
>> On 6/13/21 5:45 PM, Nicholas Piggin wrote:
>>> Excerpts from Andy Lutomirski's message of June 9, 2021 2:20 am:
>>>> On 6/4/21 6:42 PM, Nicholas Piggin wrote:
>>>>> Add CONFIG_MMU_TLB_REFCOUNT which enables refcounting of the lazy tlb mm
>>>>> when it is context switched. This can be disabled by architectures that
>>>>> don't require this refcounting if they clean up lazy tlb mms when the
>>>>> last refcount is dropped. Currently this is always enabled, which is
>>>>> what existing code does, so the patch is effectively a no-op.
>>>>>
>>>>> Rename rq->prev_mm to rq->prev_lazy_mm, because that's what it is.
>>>>
>>>> I am in favor of this approach, but I would be a lot more comfortable
>>>> with the resulting code if task->active_mm were at least better
>>>> documented and possibly even guarded by ifdefs.
>>> 
>>> active_mm is fairly well documented in Documentation/active_mm.rst IMO.
>>> I don't think anything has changed in 20 years, I don't know what more
>>> is needed, but if you can add to documentation that would be nice. Maybe
>>> moving a bit of that into .c and .h files?
>>> 
>> 
>> Quoting from that file:
>> 
>>   - however, we obviously need to keep track of which address space we
>>     "stole" for such an anonymous user. For that, we have "tsk->active_mm",
>>     which shows what the currently active address space is.
>> 
>> This isn't even true right now on x86.
> 
> From the perspective of core code, it is. x86 might do something crazy 
> with it, but it has to make it appear this way to non-arch code that
> uses active_mm.
> 
> Is x86's scheme documented?
> 
>> With your patch applied:
>> 
>>  To support all that, the "struct mm_struct" now has two counters: a
>>  "mm_users" counter that is how many "real address space users" there are,
>>  and a "mm_count" counter that is the number of "lazy" users (ie anonymous
>>  users) plus one if there are any real users.
>> 
>> isn't even true any more.
> 
> Well yeah but the active_mm concept hasn't changed. The refcounting 
> change is hopefully reasonably documented?
> 
>> 
>> 
>>>> x86 bare metal currently does not need the core lazy mm refcounting, and
>>>> x86 bare metal *also* does not need ->active_mm.  Under the x86 scheme,
>>>> if lazy mm refcounting were configured out, ->active_mm could become a
>>>> dangling pointer, and this makes me extremely uncomfortable.
>>>>
>>>> So I tend to think that, depending on config, the core code should
>>>> either keep ->active_mm [1] alive or get rid of it entirely.
>>> 
>>> I don't actually know what you mean.
>>> 
>>> core code needs the concept of an "active_mm". This is the mm that your 
>>> kernel threads are using, even in the unmerged CONFIG_LAZY_TLB=n patch,
>>> active_mm still points to init_mm for kernel threads.
>> 
>> Core code does *not* need this concept.  First, it's wrong on x86 since
>> at least 4.15.  Any core code that actually assumes that ->active_mm is
>> "active" for any sensible definition of the word active is wrong.
>> Fortunately there is no such code.
>> 
>> I looked through all active_mm references in core code.  We have:
>> 
>> kernel/sched/core.c: it's all refcounting, although it's a bit tangled
>> with membarrier.
>> 
>> kernel/kthread.c: same.  refcounting and membarrier stuff.
>> 
>> kernel/exit.c: exit_mm() a BUG_ON().
>> 
>> kernel/fork.c: initialization code and a warning.
>> 
>> kernel/cpu.c: cpu offline stuff.  wouldn't be needed if active_mm went away.
>> 
>> fs/exec.c: nothing of interest
> 
> I might not have been clear. Core code doesn't need active_mm if 
> active_mm somehow goes away. I'm saying active_mm can't go away because
> it's needed to support (most) archs that do lazy tlb mm switching.
> 
> The part I don't understand is when you say it can just go away. How? 
> 
>> I didn't go through drivers, but I maintain my point.  active_mm is
>> there for refcounting.  So please don't just make it even more confusing
>> -- do your performance improvement, but improve the code at the same
>> time: get rid of active_mm, at least on architectures that opt out of
>> the refcounting.
> 
> powerpc opts out of the refcounting and can not "get rid of active_mm".
> Not even in theory.

That is to say, it does do a type of reference management that requires 
active_mm so you can argue it has not entirely opted out of refcounting.
But we're not just doing refcounting for the sake of refcounting! That
would make no sense.

active_mm is required because that's the mm that we have switched to 
(from core code's perspective), and it is integral to know when to 
switch to a different mm. See how active_mm is a fundamental concept
in core code? It's part of the contract between core code and the
arch mm context management calls. reference counting follows from there
but it's not the _reason_ for this code.

Pretend the reference problem does not exit (whether by refcounting or 
shootdown or garbage collection or whatever). We still can't remove 
active_mm! We need it to know how to call into arch functions like 
switch_mm.

I don't know if you just forgot that critical requirement in your above 
list, or you actually are entirely using x86's mental model for this 
code which is doing something entirely different that does not need it 
at all. If that is the case I really don't mind some cleanup or wrapper 
functions for x86 do entirely do its own thing, but if that's the case
you can't criticize core code's use of active_mm due to the current
state of x86. It's x86 that needs documentation and cleaning up.

Thanks,
Nick

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox