LinuxPPC-Dev Archive on lore.kernel.org

LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed

* Re: crash after NX error
From: Stewart Smith @ 2019-06-06  2:29 UTC (permalink / raw)
  To: Michael Ellerman, linuxppc-dev, haren
In-Reply-To: <87zhmwmgv7.fsf@concordia.ellerman.id.au>

Michael Ellerman <mpe@ellerman.id.au> writes:
> Stewart Smith <stewart@linux.ibm.com> writes:
>> On my two socket POWER9 system (powernv) with 842 zwap set up, I
>> recently got a crash with the Ubuntu kernel (I haven't tried with
>> upstream, and this is the first time the system has died like this, so
>> I'm not sure how repeatable it is).
>>
>> [    2.891463] zswap: loaded using pool 842-nx/zbud
>> ...
>> [15626.124646] nx_compress_powernv: ERROR: CSB still not valid after 5000000 us, giving up : 00 00 00 00 00000000
>> [16868.932913] Unable to handle kernel paging request for data at address 0x6655f67da816cdb8
>> [16868.933726] Faulting instruction address: 0xc000000000391600
>>
>>
>> cpu 0x68: Vector: 380 (Data Access Out of Range) at [c000001c9d98b9a0]
>>     pc: c000000000391600: kmem_cache_alloc+0x2e0/0x340
>>     lr: c0000000003915ec: kmem_cache_alloc+0x2cc/0x340
>>     sp: c000001c9d98bc20
>>    msr: 900000000280b033
>>    dar: 6655f67da816cdb8
>>   current = 0xc000001ad43cb400
>>   paca    = 0xc00000000fac7800   softe: 0        irq_happened: 0x01
>>     pid   = 8319, comm = make
>> Linux version 4.15.0-50-generic (buildd@bos02-ppc64el-006) (gcc version 7.3.0 (Ubuntu 7.3.0-16ubuntu3)) #54-Ubuntu SMP Mon May 6 18:55:18 UTC 2019 (Ubuntu 4.15.0-50.54-generic 4.15.18)
>>
>> 68:mon> t
>> [c000001c9d98bc20] c0000000003914d4 kmem_cache_alloc+0x1b4/0x340 (unreliable)
>> [c000001c9d98bc80] c0000000003b1e14 __khugepaged_enter+0x54/0x220
>> [c000001c9d98bcc0] c00000000010f0ec copy_process.isra.5.part.6+0xebc/0x1a10
>> [c000001c9d98bda0] c00000000010fe4c _do_fork+0xec/0x510
>> [c000001c9d98be30] c00000000000b584 ppc_clone+0x8/0xc
>> --- Exception: c00 (System Call) at 00007afe9daf87f4
>> SP (7fffca606880) is in userspace
>>
>> So, it looks like there could be a problem in the error path, plausibly
>> fixed by this patch:
>>
>> commit 656ecc16e8fc2ab44b3d70e3fcc197a7020d0ca5
>> Author: Haren Myneni <haren@linux.vnet.ibm.com>
>> Date:   Wed Jun 13 00:32:40 2018 -0700
>>
>>     crypto/nx: Initialize 842 high and normal RxFIFO control registers
>>     
>>     NX increments readOffset by FIFO size in receive FIFO control register
>>     when CRB is read. But the index in RxFIFO has to match with the
>>     corresponding entry in FIFO maintained by VAS in kernel. Otherwise NX
>>     may be processing incorrect CRBs and can cause CRB timeout.
>>     
>>     VAS FIFO offset is 0 when the receive window is opened during
>>     initialization. When the module is reloaded or in kexec boot, readOffset
>>     in FIFO control register may not match with VAS entry. This patch adds
>>     nx_coproc_init OPAL call to reset readOffset and queued entries in FIFO
>>     control register for both high and normal FIFOs.
>>     
>>     Signed-off-by: Haren Myneni <haren@us.ibm.com>
>>     [mpe: Fixup uninitialized variable warning]
>>     Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
>>
>> $ git describe --contains 656ecc16e8fc2ab44b3d70e3fcc197a7020d0ca5
>> v4.19-rc1~24^2~50
>>
>>
>> Which was never backported to any stable release, so probably needs to
>> be for v4.14 through v4.18.
>
> Yeah the P9 NX support went in in:
>   b0d6c9bab5e4 ("crypto/nx: Add P9 NX support for 842 compression engine")
>
> Which was: v4.14-rc1~119^2~21, so first released in v4.14.
>
>
> I'm actually less interested in that and more interested in the
> subsequent crash. The time stamps are miles apart though, did we just
> leave some corrupted memory after the NX failed and then hit it later?
> Or did we not correctly signal to the upper level APIs that the request
> failed.
>
> I think we need to do some testing with errors injected into the
> wait_for_csb() path, to ensure that failures there are not causing
> corrupting in zswap. Haren have you done any testing of error
> injection?

So, things died pretty heavily overnight (requiring e2fsck) with a *lot*
of those wait_for_csb() errors in the log.

It certainly *looks* like there's corruption around, as one of the CI
jobs that failed around that time got "internal compiler error" which is
usually a good sign that things have gone poorly somewhere.

-- 
Stewart Smith
OPAL Architect, IBM.


^ permalink raw reply

* Re: [RFC V2] mm: Generalize notify_page_fault()
From: Anshuman Khandual @ 2019-06-06  2:34 UTC (permalink / raw)
  To: Michael Ellerman, linux-kernel, linux-mm
  Cc: Mark Rutland, Michal Hocko, linux-ia64, linux-sh, Peter Zijlstra,
	Catalin Marinas, Dave Hansen, Heiko Carstens, Paul Mackerras,
	Matthew Wilcox, sparclinux, linux-s390, Yoshinori Sato, x86,
	Russell King, Will Deacon, Ingo Molnar, Fenghua Yu,
	Stephen Rothwell, Andrey Konovalov, Andy Lutomirski,
	Thomas Gleixner, linux-arm-kernel, Tony Luck, Martin Schwidefsky,
	Andrew Morton, linuxppc-dev, David S. Miller
In-Reply-To: <87sgsomg91.fsf@concordia.ellerman.id.au>

On 06/05/2019 04:49 PM, Michael Ellerman wrote:
> Anshuman Khandual <anshuman.khandual@arm.com> writes:
>> Similar notify_page_fault() definitions are being used by architectures
>> duplicating much of the same code. This attempts to unify them into a
>> single implementation, generalize it and then move it to a common place.
>> kprobes_built_in() can detect CONFIG_KPROBES, hence notify_page_fault()
>> need not be wrapped again within CONFIG_KPROBES. Trap number argument can
>> now contain upto an 'unsigned int' accommodating all possible platforms.
> ...
>> diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c
>> index 58f69fa..1bc3b18 100644
>> --- a/arch/arm/mm/fault.c
>> +++ b/arch/arm/mm/fault.c
>> @@ -30,28 +30,6 @@
>>  
>>  #ifdef CONFIG_MMU
>>  
>> -#ifdef CONFIG_KPROBES
>> -static inline int notify_page_fault(struct pt_regs *regs, unsigned int fsr)
>> -{
>> -	int ret = 0;
>> -
>> -	if (!user_mode(regs)) {
>> -		/* kprobe_running() needs smp_processor_id() */
>> -		preempt_disable();
>> -		if (kprobe_running() && kprobe_fault_handler(regs, fsr))
>> -			ret = 1;
>> -		preempt_enable();
>> -	}
>> -
>> -	return ret;
>> -}
>> -#else
> 
> You've changed several of the architectures from something like above,
> where it disables preemption around the call into the below:
> 
>> +int __kprobes notify_page_fault(struct pt_regs *regs, unsigned int trap)
>> +{
>> +	int ret = 0;
>> +
>> +	/*
>> +	 * To be potentially processing a kprobe fault and to be allowed
>> +	 * to call kprobe_running(), we have to be non-preemptible.
>> +	 */
>> +	if (kprobes_built_in() && !preemptible() && !user_mode(regs)) {
>> +		if (kprobe_running() && kprobe_fault_handler(regs, trap))
>> +			ret = 1;
>> +	}
>> +	return ret;
>> +}
> 
> Which skips everything if we're preemptible. Is that an equivalent

Right.

> change? If so can you please explain why in more detail.

It is probably not an equivalent change. The following explanation is extracted from
RFC V1 discussions (https://patchwork.kernel.org/patch/10968273/). Will explain the
rational for this behavior change in the commit message next time around.

----------------------------
a980c0ef9f6d ("x86/kprobes: Refactor kprobes_fault() like kprobe_exceptions_notify()")
b506a9d08bae ("x86: code clarification patch to Kprobes arch code")

In particular the later one (b506a9d08bae). It explains how the invoking context
in itself should be non-preemptible for the kprobes processing context irrespective
of whether kprobe_running() or perhaps smp_processor_id() is safe or not. Hence it
does not make much sense to continue when original invoking context is preemptible.
Instead just bail out earlier. This seems to be making more sense than preempt
disable-enable pair. If there are no concerns about this change from other platforms,
I will change the preemption behavior in proposed generic function next time around.
----------------------------

Do you see any concern changing preempt behavior in the x86 way ?

> 
> Also why not have it return bool?

Just that all architectures (except powerpc) had 'int' as return type. But we can
change that to 'bool'.

^ permalink raw reply

* Re: [RFC V2] mm: Generalize notify_page_fault()
From: Anshuman Khandual @ 2019-06-06  2:40 UTC (permalink / raw)
  To: Matthew Wilcox, Michael Ellerman
  Cc: Mark Rutland, Michal Hocko, linux-ia64, linux-sh, Peter Zijlstra,
	Catalin Marinas, Dave Hansen, Will Deacon, linux-mm,
	Paul Mackerras, sparclinux, linux-s390, Yoshinori Sato, x86,
	Russell King, Ingo Molnar, Fenghua Yu, Stephen Rothwell,
	Andrey Konovalov, Andy Lutomirski, Thomas Gleixner,
	linux-arm-kernel, Tony Luck, Heiko Carstens, linux-kernel,
	Martin Schwidefsky, Andrew Morton, linuxppc-dev, David S. Miller
In-Reply-To: <20190605112328.GB2025@bombadil.infradead.org>



On 06/05/2019 04:53 PM, Matthew Wilcox wrote:
> On Wed, Jun 05, 2019 at 09:19:22PM +1000, Michael Ellerman wrote:
>> Anshuman Khandual <anshuman.khandual@arm.com> writes:
>>> Similar notify_page_fault() definitions are being used by architectures
>>> duplicating much of the same code. This attempts to unify them into a
>>> single implementation, generalize it and then move it to a common place.
>>> kprobes_built_in() can detect CONFIG_KPROBES, hence notify_page_fault()
>>> need not be wrapped again within CONFIG_KPROBES. Trap number argument can
>>> now contain upto an 'unsigned int' accommodating all possible platforms.
>> ...
>>
>> You've changed several of the architectures from something like above,
>> where it disables preemption around the call into the below:
>>
>>
>> Which skips everything if we're preemptible. Is that an equivalent
>> change? If so can you please explain why in more detail.
> 
> See the discussion in v1 of this patch, which you were cc'd on.
> 
> I agree the description here completely fails to mention why the change.
> It should mention commit a980c0ef9f6d8c.

I will update the commit message to include an explanation for this new
preempt behavior in the generic definition.

^ permalink raw reply

* Re: [BISECTED REGRESSION] b43legacy broken on G4 PowerBook
From: Larry Finger @ 2019-06-06  3:06 UTC (permalink / raw)
  To: Aaro Koskinen, Christoph Hellwig, Christian Zigotzky,
	Michael Ellerman
  Cc: linuxppc-dev, linux-wireless, linux-kernel
In-Reply-To: <20190605225059.GA9953@darkstar.musicnaut.iki.fi>

On 6/5/19 5:50 PM, Aaro Koskinen wrote:
> Hi,
> 
> When upgrading from v5.0 -> v5.1 on G4 PowerBook, I noticed WLAN does
> not work anymore:
> 
> [   42.004303] b43legacy-phy0: Loading firmware version 0x127, patch level 14 (2005-04-18 02:36:27)
> [   42.184837] b43legacy-phy0 debug: Chip initialized
> [   42.184873] b43legacy-phy0 ERROR: The machine/kernel does not support the required 30-bit DMA mask
> 
> The same happens with the current mainline.
> 
> Bisected to:
> 
> 	commit 65a21b71f948406201e4f62e41f06513350ca390
> 	Author: Christoph Hellwig <hch@lst.de>
> 	Date:   Wed Feb 13 08:01:26 2019 +0100
> 
> 	    powerpc/dma: remove dma_nommu_dma_supported
> 
> 	    This function is largely identical to the generic version used
> 	    everywhere else.  Replace it with the generic version.
> 
> 	    Signed-off-by: Christoph Hellwig <hch@lst.de>
> 	    Tested-by: Christian Zigotzky <chzigotzky@xenosoft.de>
> 	    Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>

Aaro,

First of all, you have my sympathy for the laborious bisection on a PowerBook 
G4. I have done several myself. Thank you.

I confirm your results.

The ppc code has a maximum DMA size of 31 bits, thus a 32-bit request will fail. 
Why the 30-bit fallback fails in b43legacy fails while it works in b43 is a mystery.

Although dma_nommu_dma_supported() may be "largely identical" to 
dma_direct_supported(), they obviously differ. Routine dma_nommu_dma_supported() 
returns 1 for 32-bit systems, but I do not know what dma_direct_supported() returns.

I am trying to find a patch.

Larry

^ permalink raw reply

* Re: [PATCH kernel v3 0/3] powerpc/ioda2: Yet another attempt to allow DMA masks between 32 and 59
From: Shawn Anastasio @ 2019-06-06  4:11 UTC (permalink / raw)
  To: Alexey Kardashevskiy, linuxppc-dev
  Cc: Sam Bobroff, Alistair Popple, Oliver O'Halloran, David Gibson
In-Reply-To: <20190530070355.121802-1-aik@ozlabs.ru>

On 5/30/19 2:03 AM, Alexey Kardashevskiy wrote:
> This is an attempt to allow DMA masks between 32..59 which are not large
> enough to use either a PHB3 bypass mode or a sketchy bypass. Depending
> on the max order, up to 40 is usually available.
> 
> 
> This is based on v5.2-rc2.
> 
> Please comment. Thanks.

I have tested this patch set with an AMD GPU that's limited to <64bit
DMA (I believe it's 40 or 42 bit). It successfully allows the card to
operate without falling back to 32-bit DMA mode as it does without
the patches.

Relevant kernel log message:
```
[    0.311211] pci 0033:01     : [PE# 00] Enabling 64-bit DMA bypass
```

Tested-by: Shawn Anastasio <shawn@anastas.io>

^ permalink raw reply

* Re: PowerPC arch_ptrace() writes beyond thread_struct/task_struct
From: Christophe Leroy @ 2019-06-06  5:15 UTC (permalink / raw)
  To: Radu Rendec, linuxppc-dev; +Cc: Oleg Nesterov
In-Reply-To: <CAD5jUk-meCjEoqJqwriEZxfXQqWvObZUPq0MP=pSQmAbAbXMkw@mail.gmail.com>



Le 05/06/2019 à 23:45, Radu Rendec a écrit :
> Hi Everyone,
> 
> I'm seeing some weird memory corruption that I have been able to isolate
> to arch_ptrace() [arch/powerpc/kernel/ptrace.c] and PTRACE_POKEUSR. I am
> on PowerPC 32 (MPC8378), kernel 4.9.179.
> 
> It's not very easy for me to test on the latest kernel, but I guess
> little has changed since 4.9 in either the architecture specific ptrace
> code or PowerPC register data structures.
> 
> What happens is that gdb calls ptrace(PTRACE_POKEUSER) with addr=0x158.
> This goes down to arch_ptrace() [arch/powerpc/kernel/ptrace.c], inside
> `case PTRACE_POKEUSR`, on the branch that does this:
> 
>      memcpy(&child->thread.TS_FPR(fpidx), &data,
>              sizeof(long));
> 
> where:
>      index = addr >> 2 = 0x56 = 86
>      fpidx = index - PT_FPR0 = 86 - 48 = 38

In struct thread_fp_state, fpr field is u64, so I guess we should have 
the following on PPC32:

fpidx = (index - PT_FPR0) >> 1;

Christophe

>      &child->thread.TS_FPR(fpidx) = (void *)child + 1296
> 
>      offsetof(struct task_struct, thread) = 960
>      sizeof(struct thread_struct) = 336
>      sizeof(struct task_struct) = 1296
> 
> In other words, the memcpy() call writes just beyond thread_struct
> (which is also beyond task_struct, for that matter).
> 
> This should never get past the bounds checks for `index`, so perhaps
> there is a mismatch between ptrace macros and the actual register data
> structures layout.
> 
> I will continue to investigate, but I'm not familiar with the PowerPC
> registers so it will take a while before I make sense of all the data
> structures and macros. Hopefully this rings a bell to someone who is
> already familiar with those and could figure out quickly what the
> problem is.
> 
> Best regards,
> Radu Rendec
> 

^ permalink raw reply

* Re: [PATCH 12/16] mm: consolidate the get_user_pages* implementations
From: John Hubbard @ 2019-06-06  6:01 UTC (permalink / raw)
  To: Christoph Hellwig, Linus Torvalds, Paul Burton, James Hogan,
	Yoshinori Sato, Rich Felker, David S. Miller
  Cc: linux-sh, Andrey Konovalov, x86, linux-mips, Nicholas Piggin,
	linux-kernel, linux-mm, Khalid Aziz, Paul Mackerras, sparclinux,
	linuxppc-dev
In-Reply-To: <20190601074959.14036-13-hch@lst.de>

On 6/1/19 12:49 AM, Christoph Hellwig wrote:
> Always build mm/gup.c, and move the nommu versions and replace the
> separate stubs for various functions by the default ones, with the _fast
> version always falling back to the slow path because gup_fast_permitted
> always returns false now if HAVE_FAST_GUP is not set, and we use the
> nommu version of __get_user_pages while keeping all the wrappers common.
> 
> This also ensures the new put_user_pages* helpers are available for
> nommu, as those are currently missing, which would create a problem as
> soon as we actually grew users for it.
> 

Hi Christoph,

Thanks for fixing up the nommu case. And the patchset overall is a huge
relief to see, because I'd filed those arches under the "despair" category
for the gup conversions. :)

I started reviewing this one patch, and it's kind of messy figuring out 
if the code motion preserves everything because of
all the consolidation from other places, plus having to move things in
and out of the ifdef blocks.  So I figured I'd check and see if this is
going to make it past RFC status soon, and if it's going before or after
Ira's recent RFC ("RDMA/FS DAX truncate proposal").


thanks,
-- 
John Hubbard
NVIDIA

> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  mm/Kconfig  |   1 +
>  mm/Makefile |   4 +-
>  mm/gup.c    | 476 +++++++++++++++++++++++++++++-----------------------
>  mm/nommu.c  |  88 ----------
>  mm/util.c   |  47 ------
>  5 files changed, 269 insertions(+), 347 deletions(-)
> 
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 98dffb0f2447..5c41409557da 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -133,6 +133,7 @@ config HAVE_MEMBLOCK_PHYS_MAP
>  	bool
>  
>  config HAVE_FAST_GUP
> +	depends on MMU
>  	bool
>  
>  config ARCH_KEEP_MEMBLOCK
> diff --git a/mm/Makefile b/mm/Makefile
> index ac5e5ba78874..dc0746ca1109 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -22,7 +22,7 @@ KCOV_INSTRUMENT_mmzone.o := n
>  KCOV_INSTRUMENT_vmstat.o := n
>  
>  mmu-y			:= nommu.o
> -mmu-$(CONFIG_MMU)	:= gup.o highmem.o memory.o mincore.o \
> +mmu-$(CONFIG_MMU)	:= highmem.o memory.o mincore.o \
>  			   mlock.o mmap.o mmu_gather.o mprotect.o mremap.o \
>  			   msync.o page_vma_mapped.o pagewalk.o \
>  			   pgtable-generic.o rmap.o vmalloc.o
> @@ -39,7 +39,7 @@ obj-y			:= filemap.o mempool.o oom_kill.o fadvise.o \
>  			   mm_init.o mmu_context.o percpu.o slab_common.o \
>  			   compaction.o vmacache.o \
>  			   interval_tree.o list_lru.o workingset.o \
> -			   debug.o $(mmu-y)
> +			   debug.o gup.o $(mmu-y)
>  
>  # Give 'page_alloc' its own module-parameter namespace
>  page-alloc-y := page_alloc.o
> diff --git a/mm/gup.c b/mm/gup.c
> index a24f52292c7f..c8da7764de9c 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -134,6 +134,7 @@ void put_user_pages(struct page **pages, unsigned long npages)
>  }
>  EXPORT_SYMBOL(put_user_pages);
>  
> +#ifdef CONFIG_MMU
>  static struct page *no_page_table(struct vm_area_struct *vma,
>  		unsigned int flags)
>  {
> @@ -1099,86 +1100,6 @@ static __always_inline long __get_user_pages_locked(struct task_struct *tsk,
>  	return pages_done;
>  }
>  
> -/*
> - * We can leverage the VM_FAULT_RETRY functionality in the page fault
> - * paths better by using either get_user_pages_locked() or
> - * get_user_pages_unlocked().
> - *
> - * get_user_pages_locked() is suitable to replace the form:
> - *
> - *      down_read(&mm->mmap_sem);
> - *      do_something()
> - *      get_user_pages(tsk, mm, ..., pages, NULL);
> - *      up_read(&mm->mmap_sem);
> - *
> - *  to:
> - *
> - *      int locked = 1;
> - *      down_read(&mm->mmap_sem);
> - *      do_something()
> - *      get_user_pages_locked(tsk, mm, ..., pages, &locked);
> - *      if (locked)
> - *          up_read(&mm->mmap_sem);
> - */
> -long get_user_pages_locked(unsigned long start, unsigned long nr_pages,
> -			   unsigned int gup_flags, struct page **pages,
> -			   int *locked)
> -{
> -	/*
> -	 * FIXME: Current FOLL_LONGTERM behavior is incompatible with
> -	 * FAULT_FLAG_ALLOW_RETRY because of the FS DAX check requirement on
> -	 * vmas.  As there are no users of this flag in this call we simply
> -	 * disallow this option for now.
> -	 */
> -	if (WARN_ON_ONCE(gup_flags & FOLL_LONGTERM))
> -		return -EINVAL;
> -
> -	return __get_user_pages_locked(current, current->mm, start, nr_pages,
> -				       pages, NULL, locked,
> -				       gup_flags | FOLL_TOUCH);
> -}
> -EXPORT_SYMBOL(get_user_pages_locked);
> -
> -/*
> - * get_user_pages_unlocked() is suitable to replace the form:
> - *
> - *      down_read(&mm->mmap_sem);
> - *      get_user_pages(tsk, mm, ..., pages, NULL);
> - *      up_read(&mm->mmap_sem);
> - *
> - *  with:
> - *
> - *      get_user_pages_unlocked(tsk, mm, ..., pages);
> - *
> - * It is functionally equivalent to get_user_pages_fast so
> - * get_user_pages_fast should be used instead if specific gup_flags
> - * (e.g. FOLL_FORCE) are not required.
> - */
> -long get_user_pages_unlocked(unsigned long start, unsigned long nr_pages,
> -			     struct page **pages, unsigned int gup_flags)
> -{
> -	struct mm_struct *mm = current->mm;
> -	int locked = 1;
> -	long ret;
> -
> -	/*
> -	 * FIXME: Current FOLL_LONGTERM behavior is incompatible with
> -	 * FAULT_FLAG_ALLOW_RETRY because of the FS DAX check requirement on
> -	 * vmas.  As there are no users of this flag in this call we simply
> -	 * disallow this option for now.
> -	 */
> -	if (WARN_ON_ONCE(gup_flags & FOLL_LONGTERM))
> -		return -EINVAL;
> -
> -	down_read(&mm->mmap_sem);
> -	ret = __get_user_pages_locked(current, mm, start, nr_pages, pages, NULL,
> -				      &locked, gup_flags | FOLL_TOUCH);
> -	if (locked)
> -		up_read(&mm->mmap_sem);
> -	return ret;
> -}
> -EXPORT_SYMBOL(get_user_pages_unlocked);
> -
>  /*
>   * get_user_pages_remote() - pin user pages in memory
>   * @tsk:	the task_struct to use for page fault accounting, or
> @@ -1255,6 +1176,199 @@ long get_user_pages_remote(struct task_struct *tsk, struct mm_struct *mm,
>  }
>  EXPORT_SYMBOL(get_user_pages_remote);
>  
> +/**
> + * populate_vma_page_range() -  populate a range of pages in the vma.
> + * @vma:   target vma
> + * @start: start address
> + * @end:   end address
> + * @nonblocking:
> + *
> + * This takes care of mlocking the pages too if VM_LOCKED is set.
> + *
> + * return 0 on success, negative error code on error.
> + *
> + * vma->vm_mm->mmap_sem must be held.
> + *
> + * If @nonblocking is NULL, it may be held for read or write and will
> + * be unperturbed.
> + *
> + * If @nonblocking is non-NULL, it must held for read only and may be
> + * released.  If it's released, *@nonblocking will be set to 0.
> + */
> +long populate_vma_page_range(struct vm_area_struct *vma,
> +		unsigned long start, unsigned long end, int *nonblocking)
> +{
> +	struct mm_struct *mm = vma->vm_mm;
> +	unsigned long nr_pages = (end - start) / PAGE_SIZE;
> +	int gup_flags;
> +
> +	VM_BUG_ON(start & ~PAGE_MASK);
> +	VM_BUG_ON(end   & ~PAGE_MASK);
> +	VM_BUG_ON_VMA(start < vma->vm_start, vma);
> +	VM_BUG_ON_VMA(end   > vma->vm_end, vma);
> +	VM_BUG_ON_MM(!rwsem_is_locked(&mm->mmap_sem), mm);
> +
> +	gup_flags = FOLL_TOUCH | FOLL_POPULATE | FOLL_MLOCK;
> +	if (vma->vm_flags & VM_LOCKONFAULT)
> +		gup_flags &= ~FOLL_POPULATE;
> +	/*
> +	 * We want to touch writable mappings with a write fault in order
> +	 * to break COW, except for shared mappings because these don't COW
> +	 * and we would not want to dirty them for nothing.
> +	 */
> +	if ((vma->vm_flags & (VM_WRITE | VM_SHARED)) == VM_WRITE)
> +		gup_flags |= FOLL_WRITE;
> +
> +	/*
> +	 * We want mlock to succeed for regions that have any permissions
> +	 * other than PROT_NONE.
> +	 */
> +	if (vma->vm_flags & (VM_READ | VM_WRITE | VM_EXEC))
> +		gup_flags |= FOLL_FORCE;
> +
> +	/*
> +	 * We made sure addr is within a VMA, so the following will
> +	 * not result in a stack expansion that recurses back here.
> +	 */
> +	return __get_user_pages(current, mm, start, nr_pages, gup_flags,
> +				NULL, NULL, nonblocking);
> +}
> +
> +/*
> + * __mm_populate - populate and/or mlock pages within a range of address space.
> + *
> + * This is used to implement mlock() and the MAP_POPULATE / MAP_LOCKED mmap
> + * flags. VMAs must be already marked with the desired vm_flags, and
> + * mmap_sem must not be held.
> + */
> +int __mm_populate(unsigned long start, unsigned long len, int ignore_errors)
> +{
> +	struct mm_struct *mm = current->mm;
> +	unsigned long end, nstart, nend;
> +	struct vm_area_struct *vma = NULL;
> +	int locked = 0;
> +	long ret = 0;
> +
> +	end = start + len;
> +
> +	for (nstart = start; nstart < end; nstart = nend) {
> +		/*
> +		 * We want to fault in pages for [nstart; end) address range.
> +		 * Find first corresponding VMA.
> +		 */
> +		if (!locked) {
> +			locked = 1;
> +			down_read(&mm->mmap_sem);
> +			vma = find_vma(mm, nstart);
> +		} else if (nstart >= vma->vm_end)
> +			vma = vma->vm_next;
> +		if (!vma || vma->vm_start >= end)
> +			break;
> +		/*
> +		 * Set [nstart; nend) to intersection of desired address
> +		 * range with the first VMA. Also, skip undesirable VMA types.
> +		 */
> +		nend = min(end, vma->vm_end);
> +		if (vma->vm_flags & (VM_IO | VM_PFNMAP))
> +			continue;
> +		if (nstart < vma->vm_start)
> +			nstart = vma->vm_start;
> +		/*
> +		 * Now fault in a range of pages. populate_vma_page_range()
> +		 * double checks the vma flags, so that it won't mlock pages
> +		 * if the vma was already munlocked.
> +		 */
> +		ret = populate_vma_page_range(vma, nstart, nend, &locked);
> +		if (ret < 0) {
> +			if (ignore_errors) {
> +				ret = 0;
> +				continue;	/* continue at next VMA */
> +			}
> +			break;
> +		}
> +		nend = nstart + ret * PAGE_SIZE;
> +		ret = 0;
> +	}
> +	if (locked)
> +		up_read(&mm->mmap_sem);
> +	return ret;	/* 0 or negative error code */
> +}
> +
> +/**
> + * get_dump_page() - pin user page in memory while writing it to core dump
> + * @addr: user address
> + *
> + * Returns struct page pointer of user page pinned for dump,
> + * to be freed afterwards by put_page().
> + *
> + * Returns NULL on any kind of failure - a hole must then be inserted into
> + * the corefile, to preserve alignment with its headers; and also returns
> + * NULL wherever the ZERO_PAGE, or an anonymous pte_none, has been found -
> + * allowing a hole to be left in the corefile to save diskspace.
> + *
> + * Called without mmap_sem, but after all other threads have been killed.
> + */
> +#ifdef CONFIG_ELF_CORE
> +struct page *get_dump_page(unsigned long addr)
> +{
> +	struct vm_area_struct *vma;
> +	struct page *page;
> +
> +	if (__get_user_pages(current, current->mm, addr, 1,
> +			     FOLL_FORCE | FOLL_DUMP | FOLL_GET, &page, &vma,
> +			     NULL) < 1)
> +		return NULL;
> +	flush_cache_page(vma, addr, page_to_pfn(page));
> +	return page;
> +}
> +#endif /* CONFIG_ELF_CORE */
> +
> +#else /* CONFIG_MMU */
> +static long __get_user_pages_locked(struct task_struct *tsk,
> +		struct mm_struct *mm, unsigned long start,
> +		unsigned long nr_pages, struct page **pages,
> +		struct vm_area_struct **vmas, int *locked,
> +		unsigned int foll_flags)
> +{
> +	struct vm_area_struct *vma;
> +	unsigned long vm_flags;
> +	int i;
> +
> +	/* calculate required read or write permissions.
> +	 * If FOLL_FORCE is set, we only require the "MAY" flags.
> +	 */
> +	vm_flags  = (foll_flags & FOLL_WRITE) ?
> +			(VM_WRITE | VM_MAYWRITE) : (VM_READ | VM_MAYREAD);
> +	vm_flags &= (foll_flags & FOLL_FORCE) ?
> +			(VM_MAYREAD | VM_MAYWRITE) : (VM_READ | VM_WRITE);
> +
> +	for (i = 0; i < nr_pages; i++) {
> +		vma = find_vma(mm, start);
> +		if (!vma)
> +			goto finish_or_fault;
> +
> +		/* protect what we can, including chardevs */
> +		if ((vma->vm_flags & (VM_IO | VM_PFNMAP)) ||
> +		    !(vm_flags & vma->vm_flags))
> +			goto finish_or_fault;
> +
> +		if (pages) {
> +			pages[i] = virt_to_page(start);
> +			if (pages[i])
> +				get_page(pages[i]);
> +		}
> +		if (vmas)
> +			vmas[i] = vma;
> +		start = (start + PAGE_SIZE) & PAGE_MASK;
> +	}
> +
> +	return i;
> +
> +finish_or_fault:
> +	return i ? : -EFAULT;
> +}
> +#endif /* !CONFIG_MMU */
> +
>  #if defined(CONFIG_FS_DAX) || defined (CONFIG_CMA)
>  static bool check_dax_vmas(struct vm_area_struct **vmas, long nr_pages)
>  {
> @@ -1416,7 +1530,7 @@ static long check_and_migrate_cma_pages(struct task_struct *tsk,
>  {
>  	return nr_pages;
>  }
> -#endif
> +#endif /* CONFIG_CMA */
>  
>  /*
>   * __gup_longterm_locked() is a wrapper for __get_user_pages_locked which
> @@ -1502,152 +1616,85 @@ long get_user_pages(unsigned long start, unsigned long nr_pages,
>  }
>  EXPORT_SYMBOL(get_user_pages);
>  
> -/**
> - * populate_vma_page_range() -  populate a range of pages in the vma.
> - * @vma:   target vma
> - * @start: start address
> - * @end:   end address
> - * @nonblocking:
> - *
> - * This takes care of mlocking the pages too if VM_LOCKED is set.
> +/*
> + * We can leverage the VM_FAULT_RETRY functionality in the page fault
> + * paths better by using either get_user_pages_locked() or
> + * get_user_pages_unlocked().
>   *
> - * return 0 on success, negative error code on error.
> + * get_user_pages_locked() is suitable to replace the form:
>   *
> - * vma->vm_mm->mmap_sem must be held.
> + *      down_read(&mm->mmap_sem);
> + *      do_something()
> + *      get_user_pages(tsk, mm, ..., pages, NULL);
> + *      up_read(&mm->mmap_sem);
>   *
> - * If @nonblocking is NULL, it may be held for read or write and will
> - * be unperturbed.
> + *  to:
>   *
> - * If @nonblocking is non-NULL, it must held for read only and may be
> - * released.  If it's released, *@nonblocking will be set to 0.
> + *      int locked = 1;
> + *      down_read(&mm->mmap_sem);
> + *      do_something()
> + *      get_user_pages_locked(tsk, mm, ..., pages, &locked);
> + *      if (locked)
> + *          up_read(&mm->mmap_sem);
>   */
> -long populate_vma_page_range(struct vm_area_struct *vma,
> -		unsigned long start, unsigned long end, int *nonblocking)
> +long get_user_pages_locked(unsigned long start, unsigned long nr_pages,
> +			   unsigned int gup_flags, struct page **pages,
> +			   int *locked)
>  {
> -	struct mm_struct *mm = vma->vm_mm;
> -	unsigned long nr_pages = (end - start) / PAGE_SIZE;
> -	int gup_flags;
> -
> -	VM_BUG_ON(start & ~PAGE_MASK);
> -	VM_BUG_ON(end   & ~PAGE_MASK);
> -	VM_BUG_ON_VMA(start < vma->vm_start, vma);
> -	VM_BUG_ON_VMA(end   > vma->vm_end, vma);
> -	VM_BUG_ON_MM(!rwsem_is_locked(&mm->mmap_sem), mm);
> -
> -	gup_flags = FOLL_TOUCH | FOLL_POPULATE | FOLL_MLOCK;
> -	if (vma->vm_flags & VM_LOCKONFAULT)
> -		gup_flags &= ~FOLL_POPULATE;
> -	/*
> -	 * We want to touch writable mappings with a write fault in order
> -	 * to break COW, except for shared mappings because these don't COW
> -	 * and we would not want to dirty them for nothing.
> -	 */
> -	if ((vma->vm_flags & (VM_WRITE | VM_SHARED)) == VM_WRITE)
> -		gup_flags |= FOLL_WRITE;
> -
>  	/*
> -	 * We want mlock to succeed for regions that have any permissions
> -	 * other than PROT_NONE.
> +	 * FIXME: Current FOLL_LONGTERM behavior is incompatible with
> +	 * FAULT_FLAG_ALLOW_RETRY because of the FS DAX check requirement on
> +	 * vmas.  As there are no users of this flag in this call we simply
> +	 * disallow this option for now.
>  	 */
> -	if (vma->vm_flags & (VM_READ | VM_WRITE | VM_EXEC))
> -		gup_flags |= FOLL_FORCE;
> +	if (WARN_ON_ONCE(gup_flags & FOLL_LONGTERM))
> +		return -EINVAL;
>  
> -	/*
> -	 * We made sure addr is within a VMA, so the following will
> -	 * not result in a stack expansion that recurses back here.
> -	 */
> -	return __get_user_pages(current, mm, start, nr_pages, gup_flags,
> -				NULL, NULL, nonblocking);
> +	return __get_user_pages_locked(current, current->mm, start, nr_pages,
> +				       pages, NULL, locked,
> +				       gup_flags | FOLL_TOUCH);
>  }
> +EXPORT_SYMBOL(get_user_pages_locked);
>  
>  /*
> - * __mm_populate - populate and/or mlock pages within a range of address space.
> + * get_user_pages_unlocked() is suitable to replace the form:
>   *
> - * This is used to implement mlock() and the MAP_POPULATE / MAP_LOCKED mmap
> - * flags. VMAs must be already marked with the desired vm_flags, and
> - * mmap_sem must not be held.
> + *      down_read(&mm->mmap_sem);
> + *      get_user_pages(tsk, mm, ..., pages, NULL);
> + *      up_read(&mm->mmap_sem);
> + *
> + *  with:
> + *
> + *      get_user_pages_unlocked(tsk, mm, ..., pages);
> + *
> + * It is functionally equivalent to get_user_pages_fast so
> + * get_user_pages_fast should be used instead if specific gup_flags
> + * (e.g. FOLL_FORCE) are not required.
>   */
> -int __mm_populate(unsigned long start, unsigned long len, int ignore_errors)
> +long get_user_pages_unlocked(unsigned long start, unsigned long nr_pages,
> +			     struct page **pages, unsigned int gup_flags)
>  {
>  	struct mm_struct *mm = current->mm;
> -	unsigned long end, nstart, nend;
> -	struct vm_area_struct *vma = NULL;
> -	int locked = 0;
> -	long ret = 0;
> +	int locked = 1;
> +	long ret;
>  
> -	end = start + len;
> +	/*
> +	 * FIXME: Current FOLL_LONGTERM behavior is incompatible with
> +	 * FAULT_FLAG_ALLOW_RETRY because of the FS DAX check requirement on
> +	 * vmas.  As there are no users of this flag in this call we simply
> +	 * disallow this option for now.
> +	 */
> +	if (WARN_ON_ONCE(gup_flags & FOLL_LONGTERM))
> +		return -EINVAL;
>  
> -	for (nstart = start; nstart < end; nstart = nend) {
> -		/*
> -		 * We want to fault in pages for [nstart; end) address range.
> -		 * Find first corresponding VMA.
> -		 */
> -		if (!locked) {
> -			locked = 1;
> -			down_read(&mm->mmap_sem);
> -			vma = find_vma(mm, nstart);
> -		} else if (nstart >= vma->vm_end)
> -			vma = vma->vm_next;
> -		if (!vma || vma->vm_start >= end)
> -			break;
> -		/*
> -		 * Set [nstart; nend) to intersection of desired address
> -		 * range with the first VMA. Also, skip undesirable VMA types.
> -		 */
> -		nend = min(end, vma->vm_end);
> -		if (vma->vm_flags & (VM_IO | VM_PFNMAP))
> -			continue;
> -		if (nstart < vma->vm_start)
> -			nstart = vma->vm_start;
> -		/*
> -		 * Now fault in a range of pages. populate_vma_page_range()
> -		 * double checks the vma flags, so that it won't mlock pages
> -		 * if the vma was already munlocked.
> -		 */
> -		ret = populate_vma_page_range(vma, nstart, nend, &locked);
> -		if (ret < 0) {
> -			if (ignore_errors) {
> -				ret = 0;
> -				continue;	/* continue at next VMA */
> -			}
> -			break;
> -		}
> -		nend = nstart + ret * PAGE_SIZE;
> -		ret = 0;
> -	}
> +	down_read(&mm->mmap_sem);
> +	ret = __get_user_pages_locked(current, mm, start, nr_pages, pages, NULL,
> +				      &locked, gup_flags | FOLL_TOUCH);
>  	if (locked)
>  		up_read(&mm->mmap_sem);
> -	return ret;	/* 0 or negative error code */
> -}
> -
> -/**
> - * get_dump_page() - pin user page in memory while writing it to core dump
> - * @addr: user address
> - *
> - * Returns struct page pointer of user page pinned for dump,
> - * to be freed afterwards by put_page().
> - *
> - * Returns NULL on any kind of failure - a hole must then be inserted into
> - * the corefile, to preserve alignment with its headers; and also returns
> - * NULL wherever the ZERO_PAGE, or an anonymous pte_none, has been found -
> - * allowing a hole to be left in the corefile to save diskspace.
> - *
> - * Called without mmap_sem, but after all other threads have been killed.
> - */
> -#ifdef CONFIG_ELF_CORE
> -struct page *get_dump_page(unsigned long addr)
> -{
> -	struct vm_area_struct *vma;
> -	struct page *page;
> -
> -	if (__get_user_pages(current, current->mm, addr, 1,
> -			     FOLL_FORCE | FOLL_DUMP | FOLL_GET, &page, &vma,
> -			     NULL) < 1)
> -		return NULL;
> -	flush_cache_page(vma, addr, page_to_pfn(page));
> -	return page;
> +	return ret;
>  }
> -#endif /* CONFIG_ELF_CORE */
> +EXPORT_SYMBOL(get_user_pages_unlocked);
>  
>  /*
>   * Fast GUP
> @@ -1682,7 +1729,7 @@ struct page *get_dump_page(unsigned long addr)
>   *
>   * This code is based heavily on the PowerPC implementation by Nick Piggin.
>   */
> -#ifdef CONFIG_HAVE_FAST_GUP
> +#if defined(CONFIG_MMU) && defined(CONFIG_HAVE_FAST_GUP)
>  #ifdef CONFIG_GUP_GET_PTE_LOW_HIGH
>  /*
>   * WARNING: only to be used in the get_user_pages_fast() implementation.
> @@ -2159,6 +2206,12 @@ static void gup_pgd_range(unsigned long addr, unsigned long end,
>  			return;
>  	} while (pgdp++, addr = next, addr != end);
>  }
> +#else
> +static inline void gup_pgd_range(unsigned long addr, unsigned long end,
> +		unsigned int flags, struct page **pages, int *nr)
> +{
> +}
> +#endif /* CONFIG_HAVE_FAST_GUP */
>  
>  #ifndef gup_fast_permitted
>  /*
> @@ -2167,7 +2220,7 @@ static void gup_pgd_range(unsigned long addr, unsigned long end,
>   */
>  static bool gup_fast_permitted(unsigned long start, unsigned long end)
>  {
> -	return true;
> +	return IS_ENABLED(CONFIG_HAVE_FAST_GUP) ? true : false;
>  }
>  #endif
>  
> @@ -2176,6 +2229,9 @@ static bool gup_fast_permitted(unsigned long start, unsigned long end)
>   * the regular GUP.
>   * Note a difference with get_user_pages_fast: this always returns the
>   * number of pages pinned, 0 if no pages were pinned.
> + *
> + * If the architecture does not support this function, simply return with no
> + * pages pinned.
>   */
>  int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
>  			  struct page **pages)
> @@ -2213,6 +2269,7 @@ int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
>  
>  	return nr;
>  }
> +EXPORT_SYMBOL_GPL(__get_user_pages_fast);
>  
>  static int __gup_longterm_unlocked(unsigned long start, int nr_pages,
>  				   unsigned int gup_flags, struct page **pages)
> @@ -2295,5 +2352,4 @@ int get_user_pages_fast(unsigned long start, int nr_pages,
>  
>  	return ret;
>  }
> -
> -#endif /* CONFIG_HAVE_GENERIC_GUP */
> +EXPORT_SYMBOL_GPL(get_user_pages_fast);
> diff --git a/mm/nommu.c b/mm/nommu.c
> index d8c02fbe03b5..07165ad2e548 100644
> --- a/mm/nommu.c
> +++ b/mm/nommu.c
> @@ -111,94 +111,6 @@ unsigned int kobjsize(const void *objp)
>  	return PAGE_SIZE << compound_order(page);
>  }
>  
> -static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
> -		      unsigned long start, unsigned long nr_pages,
> -		      unsigned int foll_flags, struct page **pages,
> -		      struct vm_area_struct **vmas, int *nonblocking)
> -{
> -	struct vm_area_struct *vma;
> -	unsigned long vm_flags;
> -	int i;
> -
> -	/* calculate required read or write permissions.
> -	 * If FOLL_FORCE is set, we only require the "MAY" flags.
> -	 */
> -	vm_flags  = (foll_flags & FOLL_WRITE) ?
> -			(VM_WRITE | VM_MAYWRITE) : (VM_READ | VM_MAYREAD);
> -	vm_flags &= (foll_flags & FOLL_FORCE) ?
> -			(VM_MAYREAD | VM_MAYWRITE) : (VM_READ | VM_WRITE);
> -
> -	for (i = 0; i < nr_pages; i++) {
> -		vma = find_vma(mm, start);
> -		if (!vma)
> -			goto finish_or_fault;
> -
> -		/* protect what we can, including chardevs */
> -		if ((vma->vm_flags & (VM_IO | VM_PFNMAP)) ||
> -		    !(vm_flags & vma->vm_flags))
> -			goto finish_or_fault;
> -
> -		if (pages) {
> -			pages[i] = virt_to_page(start);
> -			if (pages[i])
> -				get_page(pages[i]);
> -		}
> -		if (vmas)
> -			vmas[i] = vma;
> -		start = (start + PAGE_SIZE) & PAGE_MASK;
> -	}
> -
> -	return i;
> -
> -finish_or_fault:
> -	return i ? : -EFAULT;
> -}
> -
> -/*
> - * get a list of pages in an address range belonging to the specified process
> - * and indicate the VMA that covers each page
> - * - this is potentially dodgy as we may end incrementing the page count of a
> - *   slab page or a secondary page from a compound page
> - * - don't permit access to VMAs that don't support it, such as I/O mappings
> - */
> -long get_user_pages(unsigned long start, unsigned long nr_pages,
> -		    unsigned int gup_flags, struct page **pages,
> -		    struct vm_area_struct **vmas)
> -{
> -	return __get_user_pages(current, current->mm, start, nr_pages,
> -				gup_flags, pages, vmas, NULL);
> -}
> -EXPORT_SYMBOL(get_user_pages);
> -
> -long get_user_pages_locked(unsigned long start, unsigned long nr_pages,
> -			    unsigned int gup_flags, struct page **pages,
> -			    int *locked)
> -{
> -	return get_user_pages(start, nr_pages, gup_flags, pages, NULL);
> -}
> -EXPORT_SYMBOL(get_user_pages_locked);
> -
> -static long __get_user_pages_unlocked(struct task_struct *tsk,
> -			struct mm_struct *mm, unsigned long start,
> -			unsigned long nr_pages, struct page **pages,
> -			unsigned int gup_flags)
> -{
> -	long ret;
> -	down_read(&mm->mmap_sem);
> -	ret = __get_user_pages(tsk, mm, start, nr_pages, gup_flags, pages,
> -				NULL, NULL);
> -	up_read(&mm->mmap_sem);
> -	return ret;
> -}
> -
> -long get_user_pages_unlocked(unsigned long start, unsigned long nr_pages,
> -			     struct page **pages, unsigned int gup_flags)
> -{
> -	return __get_user_pages_unlocked(current, current->mm, start, nr_pages,
> -					 pages, gup_flags);
> -}
> -EXPORT_SYMBOL(get_user_pages_unlocked);
> -
>  /**
>   * follow_pfn - look up PFN at a user virtual address
>   * @vma: memory mapping
> diff --git a/mm/util.c b/mm/util.c
> index 91682a2090ee..d58f5b0eb669 100644
> --- a/mm/util.c
> +++ b/mm/util.c
> @@ -300,53 +300,6 @@ void arch_pick_mmap_layout(struct mm_struct *mm, struct rlimit *rlim_stack)
>  }
>  #endif
>  
> -/*
> - * Like get_user_pages_fast() except its IRQ-safe in that it won't fall
> - * back to the regular GUP.
> - * Note a difference with get_user_pages_fast: this always returns the
> - * number of pages pinned, 0 if no pages were pinned.
> - * If the architecture does not support this function, simply return with no
> - * pages pinned.
> - */
> -int __weak __get_user_pages_fast(unsigned long start,
> -				 int nr_pages, int write, struct page **pages)
> -{
> -	return 0;
> -}
> -EXPORT_SYMBOL_GPL(__get_user_pages_fast);
> -
> -/**
> - * get_user_pages_fast() - pin user pages in memory
> - * @start:	starting user address
> - * @nr_pages:	number of pages from start to pin
> - * @gup_flags:	flags modifying pin behaviour
> - * @pages:	array that receives pointers to the pages pinned.
> - *		Should be at least nr_pages long.
> - *
> - * get_user_pages_fast provides equivalent functionality to get_user_pages,
> - * operating on current and current->mm, with force=0 and vma=NULL. However
> - * unlike get_user_pages, it must be called without mmap_sem held.
> - *
> - * get_user_pages_fast may take mmap_sem and page table locks, so no
> - * assumptions can be made about lack of locking. get_user_pages_fast is to be
> - * implemented in a way that is advantageous (vs get_user_pages()) when the
> - * user memory area is already faulted in and present in ptes. However if the
> - * pages have to be faulted in, it may turn out to be slightly slower so
> - * callers need to carefully consider what to use. On many architectures,
> - * get_user_pages_fast simply falls back to get_user_pages.
> - *
> - * Return: number of pages pinned. This may be fewer than the number
> - * requested. If nr_pages is 0 or negative, returns 0. If no pages
> - * were pinned, returns -errno.
> - */
> -int __weak get_user_pages_fast(unsigned long start,
> -				int nr_pages, unsigned int gup_flags,
> -				struct page **pages)
> -{
> -	return get_user_pages_unlocked(start, nr_pages, pages, gup_flags);
> -}
> -EXPORT_SYMBOL_GPL(get_user_pages_fast);
> -
>  unsigned long vm_mmap_pgoff(struct file *file, unsigned long addr,
>  	unsigned long len, unsigned long prot,
>  	unsigned long flag, unsigned long pgoff)
> 

^ permalink raw reply

* Re: [PATCH 12/16] mm: consolidate the get_user_pages* implementations
From: Christoph Hellwig @ 2019-06-06  6:20 UTC (permalink / raw)
  To: John Hubbard
  Cc: x86, Rich Felker, Yoshinori Sato, linux-sh, James Hogan,
	linuxppc-dev, Khalid Aziz, Nicholas Piggin, David S. Miller,
	linux-mm, Paul Burton, Paul Mackerras, Andrey Konovalov,
	sparclinux, linux-mips, Linus Torvalds, Christoph Hellwig,
	linux-kernel
In-Reply-To: <b0b73eae-6d79-b621-ce4e-997ccfbf4446@nvidia.com>

On Wed, Jun 05, 2019 at 11:01:17PM -0700, John Hubbard wrote:
> I started reviewing this one patch, and it's kind of messy figuring out 
> if the code motion preserves everything because of
> all the consolidation from other places, plus having to move things in
> and out of the ifdef blocks.  So I figured I'd check and see if this is
> going to make it past RFC status soon, and if it's going before or after
> Ira's recent RFC ("RDMA/FS DAX truncate proposal").

I don't like the huge moves either, but I can't really think of any
better way to do it.  Proposals welcome, though.

^ permalink raw reply

* Re: [BISECTED REGRESSION] b43legacy broken on G4 PowerBook
From: Christoph Hellwig @ 2019-06-06  6:40 UTC (permalink / raw)
  To: Larry Finger
  Cc: Aaro Koskinen, linux-wireless, linux-kernel, Christian Zigotzky,
	linuxppc-dev, Christoph Hellwig
In-Reply-To: <35b3f09b-b371-e2cc-4436-120c67e2f1fb@lwfinger.net>

On Wed, Jun 05, 2019 at 10:06:18PM -0500, Larry Finger wrote:
> First of all, you have my sympathy for the laborious bisection on a 
> PowerBook G4. I have done several myself. Thank you.
>
> I confirm your results.
>
> The ppc code has a maximum DMA size of 31 bits, thus a 32-bit request will 
> fail. Why the 30-bit fallback fails in b43legacy fails while it works in 
> b43 is a mystery.
>
> Although dma_nommu_dma_supported() may be "largely identical" to 
> dma_direct_supported(), they obviously differ. Routine 
> dma_nommu_dma_supported() returns 1 for 32-bit systems, but I do not know 
> what dma_direct_supported() returns.
>
> I am trying to find a patch.

	if (IS_ENABLED(CONFIG_ZONE_DMA))
		min_mask = DMA_BIT_MASK(ARCH_ZONE_DMA_BITS);
	else
		min_mask = DMA_BIT_MASK(32);

	min_mask = min_t(u64, min_mask, (max_pfn - 1) << PAGE_SHIFT);
	return mask >= __phys_to_dma(dev, min_mask);

So the smaller or:

 (1) 32-bit
 (2) ARCH_ZONE_DMA_BITS
 (3) the actual amount of memory in the system

modolo any DMA offsets that come into play.

No offsets should exists on pmac, and ARCH_ZONE_DMA_BITS is 31 on
powerpc.  So unless the system has 1GB or less memory it will probably
return false for b43, because it can't actually guarantee reliable
allocation.  It will work fine on x86 with the smaller ZONE_DMA.

^ permalink raw reply

* Re: [PATCH v12 00/31] Speculative page faults
From: Haiyan Song @ 2019-06-06  6:51 UTC (permalink / raw)
  To: Laurent Dufour
  Cc: jack, sergey.senozhatsky.work, peterz, Will Deacon, mhocko,
	linux-mm, paulus, Punit Agrawal, hpa, Michel Lespinasse,
	Alexei Starovoitov, Andrea Arcangeli, ak, Minchan Kim,
	aneesh.kumar, x86, Matthew Wilcox, Daniel Jordan, Ingo Molnar,
	David Rientjes, paulmck, npiggin, sj38.park, Jerome Glisse, dave,
	kemi.wang, kirill, Thomas Gleixner, zhong jiang, Ganesh Mahendran,
	Yang Shi, Mike Rapoport, linuxppc-dev, linux-kernel,
	Sergey Senozhatsky, vinayak menon, akpm, Tim Chen, haren
In-Reply-To: <20190416134522.17540-1-ldufour@linux.ibm.com>

Hi Laurent,

Regression test for v12 patch serials have been run on Intel 2s skylake platform,
some regressions were found by LKP-tools (linux kernel performance). Only tested the
cases that have been run and found regressions on v11 patch serials.

Get the patch serials from https://github.com/ldu4/linux/tree/spf-v12.
Kernel commit:
  base: a297558ad4479e0c9c5c14f3f69fe43113f72d1c (v5.1-rc4-mmotm-2019-04-09-17-51)
  head: 02c5a1f984a8061d075cfd74986ac8aa01d81064 (spf-v12)

Benchmark: will-it-scale
Download link: https://github.com/antonblanchard/will-it-scale/tree/master
Metrics: will-it-scale.per_thread_ops=threads/nr_cpu
test box: lkp-skl-2sp8(nr_cpu=72,memory=192G)
THP: enable / disable
nr_task: 100%

The following is benchmark results, tested 4 times for every case.

a). Enable THP
                                            base  %stddev   change    head   %stddev
will-it-scale.page_fault3.per_thread_ops    63216  ±3%      -16.9%    52537   ±4%
will-it-scale.page_fault2.per_thread_ops    36862           -9.8%     33256

b). Disable THP
                                            base  %stddev   change    head   %stddev
will-it-scale.page_fault3.per_thread_ops    65111           -18.6%    53023  ±2%
will-it-scale.page_fault2.per_thread_ops    38164           -12.0%    33565

Best regards,
Haiyan Song

On Tue, Apr 16, 2019 at 03:44:51PM +0200, Laurent Dufour wrote:
> This is a port on kernel 5.1 of the work done by Peter Zijlstra to handle
> page fault without holding the mm semaphore [1].
> 
> The idea is to try to handle user space page faults without holding the
> mmap_sem. This should allow better concurrency for massively threaded
> process since the page fault handler will not wait for other threads memory
> layout change to be done, assuming that this change is done in another part
> of the process's memory space. This type of page fault is named speculative
> page fault. If the speculative page fault fails because a concurrency has
> been detected or because underlying PMD or PTE tables are not yet
> allocating, it is failing its processing and a regular page fault is then
> tried.
> 
> The speculative page fault (SPF) has to look for the VMA matching the fault
> address without holding the mmap_sem, this is done by protecting the MM RB
> tree with RCU and by using a reference counter on each VMA. When fetching a
> VMA under the RCU protection, the VMA's reference counter is incremented to
> ensure that the VMA will not freed in our back during the SPF
> processing. Once that processing is done the VMA's reference counter is
> decremented. To ensure that a VMA is still present when walking the RB tree
> locklessly, the VMA's reference counter is incremented when that VMA is
> linked in the RB tree. When the VMA is unlinked from the RB tree, its
> reference counter will be decremented at the end of the RCU grace period,
> ensuring it will be available during this time. This means that the VMA
> freeing could be delayed and could delay the file closing for file
> mapping. Since the SPF handler is not able to manage file mapping, file is
> closed synchronously and not during the RCU cleaning. This is safe since
> the page fault handler is aborting if a file pointer is associated to the
> VMA.
> 
> Using RCU fixes the overhead seen by Haiyan Song using the will-it-scale
> benchmark [2].
> 
> The VMA's attributes checked during the speculative page fault processing
> have to be protected against parallel changes. This is done by using a per
> VMA sequence lock. This sequence lock allows the speculative page fault
> handler to fast check for parallel changes in progress and to abort the
> speculative page fault in that case.
> 
> Once the VMA has been found, the speculative page fault handler would check
> for the VMA's attributes to verify that the page fault has to be handled
> correctly or not. Thus, the VMA is protected through a sequence lock which
> allows fast detection of concurrent VMA changes. If such a change is
> detected, the speculative page fault is aborted and a *classic* page fault
> is tried.  VMA sequence lockings are added when VMA attributes which are
> checked during the page fault are modified.
> 
> When the PTE is fetched, the VMA is checked to see if it has been changed,
> so once the page table is locked, the VMA is valid, so any other changes
> leading to touching this PTE will need to lock the page table, so no
> parallel change is possible at this time.
> 
> The locking of the PTE is done with interrupts disabled, this allows
> checking for the PMD to ensure that there is not an ongoing collapsing
> operation. Since khugepaged is firstly set the PMD to pmd_none and then is
> waiting for the other CPU to have caught the IPI interrupt, if the pmd is
> valid at the time the PTE is locked, we have the guarantee that the
> collapsing operation will have to wait on the PTE lock to move
> forward. This allows the SPF handler to map the PTE safely. If the PMD
> value is different from the one recorded at the beginning of the SPF
> operation, the classic page fault handler will be called to handle the
> operation while holding the mmap_sem. As the PTE lock is done with the
> interrupts disabled, the lock is done using spin_trylock() to avoid dead
> lock when handling a page fault while a TLB invalidate is requested by
> another CPU holding the PTE.
> 
> In pseudo code, this could be seen as:
>     speculative_page_fault()
>     {
> 	    vma = find_vma_rcu()
> 	    check vma sequence count
> 	    check vma's support
> 	    disable interrupt
> 		  check pgd,p4d,...,pte
> 		  save pmd and pte in vmf
> 		  save vma sequence counter in vmf
> 	    enable interrupt
> 	    check vma sequence count
> 	    handle_pte_fault(vma)
> 		    ..
> 		    page = alloc_page()
> 		    pte_map_lock()
> 			    disable interrupt
> 				    abort if sequence counter has changed
> 				    abort if pmd or pte has changed
> 				    pte map and lock
> 			    enable interrupt
> 		    if abort
> 		       free page
> 		       abort
> 		    ...
> 	    put_vma(vma)
>     }
>     
>     arch_fault_handler()
>     {
> 	    if (speculative_page_fault(&vma))
> 	       goto done
>     again:
> 	    lock(mmap_sem)
> 	    vma = find_vma();
> 	    handle_pte_fault(vma);
> 	    if retry
> 	       unlock(mmap_sem)
> 	       goto again;
>     done:
> 	    handle fault error
>     }
> 
> Support for THP is not done because when checking for the PMD, we can be
> confused by an in progress collapsing operation done by khugepaged. The
> issue is that pmd_none() could be true either if the PMD is not already
> populated or if the underlying PTE are in the way to be collapsed. So we
> cannot safely allocate a PMD if pmd_none() is true.
> 
> This series add a new software performance event named 'speculative-faults'
> or 'spf'. It counts the number of successful page fault event handled
> speculatively. When recording 'faults,spf' events, the faults one is
> counting the total number of page fault events while 'spf' is only counting
> the part of the faults processed speculatively.
> 
> There are some trace events introduced by this series. They allow
> identifying why the page faults were not processed speculatively. This
> doesn't take in account the faults generated by a monothreaded process
> which directly processed while holding the mmap_sem. This trace events are
> grouped in a system named 'pagefault', they are:
> 
>  - pagefault:spf_vma_changed : if the VMA has been changed in our back
>  - pagefault:spf_vma_noanon : the vma->anon_vma field was not yet set.
>  - pagefault:spf_vma_notsup : the VMA's type is not supported
>  - pagefault:spf_vma_access : the VMA's access right are not respected
>  - pagefault:spf_pmd_changed : the upper PMD pointer has changed in our
>  back.
> 
> To record all the related events, the easier is to run perf with the
> following arguments :
> $ perf stat -e 'faults,spf,pagefault:*' <command>
> 
> There is also a dedicated vmstat counter showing the number of successful
> page fault handled speculatively. I can be seen this way:
> $ grep speculative_pgfault /proc/vmstat
> 
> It is possible to deactivate the speculative page fault handler by echoing
> 0 in /proc/sys/vm/speculative_page_fault.
> 
> This series builds on top of v5.1-rc4-mmotm-2019-04-09-17-51 and is
> functional on x86, PowerPC. I cross built it on arm64 but I was not able to
> test it.
> 
> This series is also available on github [4].
> 
> ---------------------
> Real Workload results
> 
> Test using a "popular in memory multithreaded database product" on 128cores
> SMT8 Power system are in progress and I will come back with performance
> mesurement as soon as possible. With the previous series we seen up to 30%
> improvements in the number of transaction processed per second, and we hope
> this will be the case with this series too.
> 
> ------------------
> Benchmarks results
> 
> Base kernel is v5.1-rc4-mmotm-2019-04-09-17-51
> SPF is BASE + this series
> 
> Kernbench:
> ----------
> Here are the results on a 48 CPUs X86 system using kernbench on a 5.0
> kernel (kernel is build 5 times):
> 
> Average	Half load -j 24
> 		 Run	(std deviation)
> 		 BASE			SPF
> Elapsed	Time	 56.52   (1.39185)      56.256  (1.15106)       0.47% 
> User	Time	 980.018 (2.94734)      984.958 (1.98518)       -0.50%
> System	Time	 130.744 (1.19148)      133.616 (0.873573)      -2.20%
> Percent	CPU	 1965.6  (49.682)       1988.4  (40.035)        -1.16%
> Context	Switches 29926.6 (272.789)      30472.4 (109.569)       -1.82%
> Sleeps		 124793  (415.87)       125003  (591.008)       -0.17%
> 						
> Average	Optimal	load -j	48
> 		 Run	(std deviation)
> 		 BASE			SPF
> Elapsed	Time	 46.354  (0.917949)     45.968 (1.42786)        0.83% 
> User	Time	 1193.42 (224.96)       1196.78 (223.28)        -0.28%
> System	Time	 143.306 (13.2726)      146.177 (13.2659)       -2.00%
> Percent	CPU	 2668.6  (743.157)      2699.9 (753.767)        -1.17%
> Context	Switches 62268.3 (34097.1)      62721.7 (33999.1)       -0.73%
> Sleeps		 132556  (8222.99)      132607 (8077.6)         -0.04%
> 
> During a run on the SPF, perf events were captured:
>  Performance counter stats for '../kernbench -M':
>        525,873,132      faults
>                242      spf
>                  0      pagefault:spf_vma_changed
>                  0      pagefault:spf_vma_noanon
>                441      pagefault:spf_vma_notsup
>                  0      pagefault:spf_vma_access
>                  0      pagefault:spf_pmd_changed
> 
> Very few speculative page faults were recorded as most of the processes
> involved are monothreaded (sounds that on this architecture some threads
> were created during the kernel build processing).
> 
> Here are the kerbench results on a 1024 CPUs Power8 VM:
> 
> 5.1.0-rc4-mm1+				5.1.0-rc4-mm1-spf-rcu+
> Average Half load -j 512 Run (std deviation):
> Elapsed Time 	 52.52   (0.906697)	52.778  (0.510069)	-0.49%
> User Time 	 3855.43 (76.378)	3890.44 (73.0466)	-0.91%
> System Time 	 1977.24 (182.316)	1974.56 (166.097)	0.14% 
> Percent CPU 	 11111.6 (540.461)	11115.2 (458.907)	-0.03%
> Context Switches 83245.6 (3061.44)	83651.8 (1202.31)	-0.49%
> Sleeps 		 613459  (23091.8)	628378  (27485.2) 	-2.43%
> 
> Average Optimal load -j 1024 Run (std deviation):
> Elapsed Time 	 52.964  (0.572346)	53.132 (0.825694)	-0.32%
> User Time 	 4058.22 (222.034)	4070.2 (201.646) 	-0.30%
> System Time 	 2672.81 (759.207)	2712.13 (797.292)	-1.47%
> Percent CPU 	 12756.7 (1786.35)	12806.5 (1858.89)	-0.39% 
> Context Switches 88818.5 (6772)		87890.6 (5567.72)	1.04% 
> Sleeps 		 618658  (20842.2)	636297 (25044) 		-2.85%
> 
> During a run on the SPF, perf events were captured:
>  Performance counter stats for '../kernbench -M':
>        149 375 832      faults
>                  1      spf
>                  0      pagefault:spf_vma_changed
>                  0      pagefault:spf_vma_noanon
>                561      pagefault:spf_vma_notsup
>                  0      pagefault:spf_vma_access
>                  0      pagefault:spf_pmd_changed
> 
> Most of the processes involved are monothreaded so SPF is not activated but
> there is no impact on the performance.
> 
> Ebizzy:
> -------
> The test is counting the number of records per second it can manage, the
> higher is the best. I run it like this 'ebizzy -mTt <nrcpus>'. To get
> consistent result I repeated the test 100 times and measure the average
> result. The number is the record processes per second, the higher is the best.
> 
>   		BASE		SPF		delta	
> 24 CPUs x86	5492.69		9383.07		70.83%
> 1024 CPUS P8 VM 8476.74		17144.38	102%
> 
> Here are the performance counter read during a run on a 48 CPUs x86 node:
>  Performance counter stats for './ebizzy -mTt 48':
>         11,846,569      faults
>         10,886,706      spf
>            957,702      pagefault:spf_vma_changed
>                  0      pagefault:spf_vma_noanon
>                815      pagefault:spf_vma_notsup
>                  0      pagefault:spf_vma_access
>                  0      pagefault:spf_pmd_changed
> 
> And the ones captured during a run on a 1024 CPUs Power VM:
>  Performance counter stats for './ebizzy -mTt 1024':
>          1 359 789      faults
>          1 284 910      spf
>             72 085      pagefault:spf_vma_changed
>                  0      pagefault:spf_vma_noanon
>              2 669      pagefault:spf_vma_notsup
>                  0      pagefault:spf_vma_access
>                  0      pagefault:spf_pmd_changed
> 		 
> In ebizzy's case most of the page fault were handled in a speculative way,
> leading the ebizzy performance boost.
> 
> ------------------
> Changes since v11 [3]
> - Check vm_ops.fault instead of vm_ops since now all the VMA as a vm_ops.
>  - Abort speculative page fault when doing swap readhead because VMA's
>    boundaries are not protected at this time. Doing this the first swap in
>    is doing a readhead, the next fault should be handled in a speculative
>    way as the page is present in the swap read page.
>  - Handle a race between copy_pte_range() and the wp_page_copy called by
>    the speculative page fault handler.
>  - Ported to Kernel v5.0
>  - Moved VM_FAULT_PTNOTSAME define in mm_types.h
>  - Use RCU to protect the MM RB tree instead of a rwlock.
>  - Add a toggle interface: /proc/sys/vm/speculative_page_fault
> 
> [1] https://lore.kernel.org/linux-mm/20141020215633.717315139@infradead.org/
> [2] https://lore.kernel.org/linux-mm/9FE19350E8A7EE45B64D8D63D368C8966B847F54@SHSMSX101.ccr.corp.intel.com/
> [3] https://lore.kernel.org/linux-mm/1526555193-7242-1-git-send-email-ldufour@linux.vnet.ibm.com/
> [4] https://github.com/ldu4/linux/tree/spf-v12
> 
> Laurent Dufour (25):
>   mm: introduce CONFIG_SPECULATIVE_PAGE_FAULT
>   x86/mm: define ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT
>   powerpc/mm: set ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT
>   mm: introduce pte_spinlock for FAULT_FLAG_SPECULATIVE
>   mm: make pte_unmap_same compatible with SPF
>   mm: introduce INIT_VMA()
>   mm: protect VMA modifications using VMA sequence count
>   mm: protect mremap() against SPF hanlder
>   mm: protect SPF handler against anon_vma changes
>   mm: cache some VMA fields in the vm_fault structure
>   mm/migrate: Pass vm_fault pointer to migrate_misplaced_page()
>   mm: introduce __lru_cache_add_active_or_unevictable
>   mm: introduce __vm_normal_page()
>   mm: introduce __page_add_new_anon_rmap()
>   mm: protect against PTE changes done by dup_mmap()
>   mm: protect the RB tree with a sequence lock
>   mm: introduce vma reference counter
>   mm: Introduce find_vma_rcu()
>   mm: don't do swap readahead during speculative page fault
>   mm: adding speculative page fault failure trace events
>   perf: add a speculative page fault sw event
>   perf tools: add support for the SPF perf event
>   mm: add speculative page fault vmstats
>   powerpc/mm: add speculative page fault
>   mm: Add a speculative page fault switch in sysctl
> 
> Mahendran Ganesh (2):
>   arm64/mm: define ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT
>   arm64/mm: add speculative page fault
> 
> Peter Zijlstra (4):
>   mm: prepare for FAULT_FLAG_SPECULATIVE
>   mm: VMA sequence count
>   mm: provide speculative fault infrastructure
>   x86/mm: add speculative pagefault handling
> 
>  arch/arm64/Kconfig                    |   1 +
>  arch/arm64/mm/fault.c                 |  12 +
>  arch/powerpc/Kconfig                  |   1 +
>  arch/powerpc/mm/fault.c               |  16 +
>  arch/x86/Kconfig                      |   1 +
>  arch/x86/mm/fault.c                   |  14 +
>  fs/exec.c                             |   1 +
>  fs/proc/task_mmu.c                    |   5 +-
>  fs/userfaultfd.c                      |  17 +-
>  include/linux/hugetlb_inline.h        |   2 +-
>  include/linux/migrate.h               |   4 +-
>  include/linux/mm.h                    | 138 +++++-
>  include/linux/mm_types.h              |  16 +-
>  include/linux/pagemap.h               |   4 +-
>  include/linux/rmap.h                  |  12 +-
>  include/linux/swap.h                  |  10 +-
>  include/linux/vm_event_item.h         |   3 +
>  include/trace/events/pagefault.h      |  80 ++++
>  include/uapi/linux/perf_event.h       |   1 +
>  kernel/fork.c                         |  35 +-
>  kernel/sysctl.c                       |   9 +
>  mm/Kconfig                            |  22 +
>  mm/huge_memory.c                      |   6 +-
>  mm/hugetlb.c                          |   2 +
>  mm/init-mm.c                          |   3 +
>  mm/internal.h                         |  45 ++
>  mm/khugepaged.c                       |   5 +
>  mm/madvise.c                          |   6 +-
>  mm/memory.c                           | 631 ++++++++++++++++++++++----
>  mm/mempolicy.c                        |  51 ++-
>  mm/migrate.c                          |   6 +-
>  mm/mlock.c                            |  13 +-
>  mm/mmap.c                             | 249 ++++++++--
>  mm/mprotect.c                         |   4 +-
>  mm/mremap.c                           |  13 +
>  mm/nommu.c                            |   1 +
>  mm/rmap.c                             |   5 +-
>  mm/swap.c                             |   6 +-
>  mm/swap_state.c                       |  10 +-
>  mm/vmstat.c                           |   5 +-
>  tools/include/uapi/linux/perf_event.h |   1 +
>  tools/perf/util/evsel.c               |   1 +
>  tools/perf/util/parse-events.c        |   4 +
>  tools/perf/util/parse-events.l        |   1 +
>  tools/perf/util/python.c              |   1 +
>  45 files changed, 1277 insertions(+), 196 deletions(-)
>  create mode 100644 include/trace/events/pagefault.h
> 
> -- 
> 2.21.0
> 

^ permalink raw reply

* Re: [PATCH kernel v3 0/3] powerpc/ioda2: Yet another attempt to allow DMA masks between 32 and 59
From: Alistair Popple @ 2019-06-06  7:17 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, Sam Bobroff, Oliver O'Halloran,
	Shawn Anastasio, David Gibson
In-Reply-To: <deb34b5f-9472-2156-e58d-8dbcb0a38979@anastas.io>

I have been hitting EEH address errors testing this with some network
cards which map/unmap DMA addresses more frequently. For example:

PHB4 PHB#5 Diag-data (Version: 1)                                                                                                                                      
brdgCtl:    00000002                                                                                                                                                   
RootSts:    00060020 00402000 a0220008 00100107 00000800                                                                                                               
PhbSts:     0000001c00000000 0000001c00000000                                                                                                                          
Lem:        0000000100000080 0000000000000000 0000000000000080                                                                                                         
PhbErr:     0000028000000000 0000020000000000 2148000098000240 a008400000000000                                                                                        
RxeTceErr:  2000000000000000 2000000000000000 c000000000000000 0000000000000000                                                                                        
PblErr:     0000000000020000 0000000000020000 0000000000000000 0000000000000000                                                                                        
RegbErr:    0000004000000000 0000004000000000 61000c4800000000 0000000000000000                                                                                        
PE[000] A/B: 8300b03800000000 8000000000000000                                                                                                                         

Interestingly the PE[000] A/B data is the same across different cards
and drivers.

- Alistair

On Wednesday, 5 June 2019 11:11:06 PM AEST Shawn Anastasio wrote:
> On 5/30/19 2:03 AM, Alexey Kardashevskiy wrote:
> > This is an attempt to allow DMA masks between 32..59 which are not large
> > enough to use either a PHB3 bypass mode or a sketchy bypass. Depending
> > on the max order, up to 40 is usually available.
> > 
> > 
> > This is based on v5.2-rc2.
> > 
> > Please comment. Thanks.
> 
> I have tested this patch set with an AMD GPU that's limited to <64bit
> DMA (I believe it's 40 or 42 bit). It successfully allows the card to
> operate without falling back to 32-bit DMA mode as it does without
> the patches.
> 
> Relevant kernel log message:
> ```
> [    0.311211] pci 0033:01     : [PE# 00] Enabling 64-bit DMA bypass
> ```
> 
> Tested-by: Shawn Anastasio <shawn@anastas.io>



^ permalink raw reply

* [PATCH] Powerpc/Watchpoint: Restore nvgprs while returning from exception
From: Ravi Bangoria @ 2019-06-06  7:29 UTC (permalink / raw)
  To: mpe
  Cc: Ravi Bangoria, mikey, linux-kernel, npiggin, paulus, mahesh,
	linuxppc-dev

Powerpc hw triggers watchpoint before executing the instruction.
To make trigger-after-execute behavior, kernel emulates the
instruction. If the instruction is 'load something into non-
volatile register', exception handler should restore emulated
register state while returning back, otherwise there will be
register state corruption. Ex, Adding a watchpoint on a list
can corrput the list:

  # cat /proc/kallsyms | grep kthread_create_list
  c00000000121c8b8 d kthread_create_list

Add watchpoint on kthread_create_list->next:

  # perf record -e mem:0xc00000000121c8c0

Run some workload such that new kthread gets invoked. Ex, I
just logged out from console:

  list_add corruption. next->prev should be prev (c000000001214e00), \
	but was c00000000121c8b8. (next=c00000000121c8b8).
  WARNING: CPU: 59 PID: 309 at lib/list_debug.c:25 __list_add_valid+0xb4/0xc0
  CPU: 59 PID: 309 Comm: kworker/59:0 Kdump: loaded Not tainted 5.1.0-rc7+ #69
  ...
  NIP __list_add_valid+0xb4/0xc0
  LR __list_add_valid+0xb0/0xc0
  Call Trace:
  __list_add_valid+0xb0/0xc0 (unreliable)
  __kthread_create_on_node+0xe0/0x260
  kthread_create_on_node+0x34/0x50
  create_worker+0xe8/0x260
  worker_thread+0x444/0x560
  kthread+0x160/0x1a0
  ret_from_kernel_thread+0x5c/0x70

Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
---
 arch/powerpc/kernel/exceptions-64s.S | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/exceptions-64s.S b/arch/powerpc/kernel/exceptions-64s.S
index 9481a11..96de0d1 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -1753,7 +1753,7 @@ handle_dabr_fault:
 	ld      r5,_DSISR(r1)
 	addi    r3,r1,STACK_FRAME_OVERHEAD
 	bl      do_break
-12:	b       ret_from_except_lite
+12:	b       ret_from_except
 
 
 #ifdef CONFIG_PPC_BOOK3S_64
-- 
1.8.3.1


^ permalink raw reply related

* Re: [PATCH 12/16] mm: consolidate the get_user_pages* implementations
From: John Hubbard @ 2019-06-06  7:44 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: x86, Rich Felker, Yoshinori Sato, linux-sh, James Hogan,
	linuxppc-dev, Khalid Aziz, Nicholas Piggin, linux-mips, linux-mm,
	Paul Burton, Paul Mackerras, Andrey Konovalov, sparclinux,
	Linus Torvalds, David S. Miller, linux-kernel
In-Reply-To: <20190606062018.GA26745@lst.de>

On 6/5/19 11:20 PM, Christoph Hellwig wrote:
> On Wed, Jun 05, 2019 at 11:01:17PM -0700, John Hubbard wrote:
>> I started reviewing this one patch, and it's kind of messy figuring out
>> if the code motion preserves everything because of
>> all the consolidation from other places, plus having to move things in
>> and out of the ifdef blocks.  So I figured I'd check and see if this is
>> going to make it past RFC status soon, and if it's going before or after
>> Ira's recent RFC ("RDMA/FS DAX truncate proposal").
> 
> I don't like the huge moves either, but I can't really think of any
> better way to do it.  Proposals welcome, though.
> 

One way would be to do it in two patches:

1) Move the code into gup.c, maybe at the bottom. Surround each function
or group of functions by whatever ifdefs they need.

2) Move code out of the bottom of gup.c, into the final location.

...but I'm not certain that will be that much better. In the spirit of
not creating gratuitous work for others, I could try it out and send
out something if it looks like it's noticeably easier to verify/review.

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply

* Re: [PATCH] Powerpc/Watchpoint: Restore nvgprs while returning from exception
From: Naveen N. Rao @ 2019-06-06  8:21 UTC (permalink / raw)
  To: mpe, Ravi Bangoria
  Cc: mikey, mahesh, linux-kernel, npiggin, paulus, linuxppc-dev
In-Reply-To: <20190606072951.32116-1-ravi.bangoria@linux.ibm.com>

Ravi Bangoria wrote:
> Powerpc hw triggers watchpoint before executing the instruction.
> To make trigger-after-execute behavior, kernel emulates the
> instruction. If the instruction is 'load something into non-
> volatile register', exception handler should restore emulated
> register state while returning back, otherwise there will be
> register state corruption. Ex, Adding a watchpoint on a list
> can corrput the list:
> 
>   # cat /proc/kallsyms | grep kthread_create_list
>   c00000000121c8b8 d kthread_create_list
> 
> Add watchpoint on kthread_create_list->next:
> 
>   # perf record -e mem:0xc00000000121c8c0
> 
> Run some workload such that new kthread gets invoked. Ex, I
> just logged out from console:
> 
>   list_add corruption. next->prev should be prev (c000000001214e00), \
> 	but was c00000000121c8b8. (next=c00000000121c8b8).
>   WARNING: CPU: 59 PID: 309 at lib/list_debug.c:25 __list_add_valid+0xb4/0xc0
>   CPU: 59 PID: 309 Comm: kworker/59:0 Kdump: loaded Not tainted 5.1.0-rc7+ #69
>   ...
>   NIP __list_add_valid+0xb4/0xc0
>   LR __list_add_valid+0xb0/0xc0
>   Call Trace:
>   __list_add_valid+0xb0/0xc0 (unreliable)
>   __kthread_create_on_node+0xe0/0x260
>   kthread_create_on_node+0x34/0x50
>   create_worker+0xe8/0x260
>   worker_thread+0x444/0x560
>   kthread+0x160/0x1a0
>   ret_from_kernel_thread+0x5c/0x70
> 
> Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
> ---
>  arch/powerpc/kernel/exceptions-64s.S | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 

Awesome catch - this one has had a glorious run...
Fixes: 5aae8a5370802 ("powerpc, hw_breakpoints: Implement hw_breakpoints for 64-bit server processors")

Reviewed-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>


- Naveen



^ permalink raw reply

* Re: [PATCH] Powerpc/Watchpoint: Restore nvgprs while returning from exception
From: Ravi Bangoria @ 2019-06-06  8:30 UTC (permalink / raw)
  To: mpe
  Cc: Ravi Bangoria, mikey, linux-kernel, npiggin, paulus, mahesh,
	linuxppc-dev
In-Reply-To: <20190606072951.32116-1-ravi.bangoria@linux.ibm.com>



On 6/6/19 12:59 PM, Ravi Bangoria wrote:
> Powerpc hw triggers watchpoint before executing the instruction.
> To make trigger-after-execute behavior, kernel emulates the
> instruction. If the instruction is 'load something into non-
> volatile register', exception handler should restore emulated
> register state while returning back, otherwise there will be
> register state corruption. Ex, Adding a watchpoint on a list
> can corrput the list:
> 
>   # cat /proc/kallsyms | grep kthread_create_list
>   c00000000121c8b8 d kthread_create_list
> 
> Add watchpoint on kthread_create_list->next:

s/kthread_create_list->next/kthread_create_list->prev/


^ permalink raw reply

* Re: [PATCH] ocxl: do not use C++ style comments in uapi header
From: Masahiro Yamada @ 2019-06-06  9:00 UTC (permalink / raw)
  To: linuxppc-dev, Michael Ellerman
  Cc: Andrew Donnellan, Arnd Bergmann, Greg KH, Randy Dunlap,
	Linux Kernel Mailing List, Joe Perches, Frederic Barrat,
	Thomas Gleixner
In-Reply-To: <ab22b27e-dd07-1c83-af60-19403c98c6a2@linux.ibm.com>

Hi Michael,

On Wed, Jun 5, 2019 at 3:18 PM Andrew Donnellan <ajd@linux.ibm.com> wrote:
>
> On 4/6/19 10:12 pm, Masahiro Yamada wrote:
> > On Tue, Jun 4, 2019 at 8:54 PM Frederic Barrat <fbarrat@linux.ibm.com> wrote:
> >>
> >>
> >>
> >> Le 04/06/2019 à 13:16, Masahiro Yamada a écrit :
> >>> Linux kernel tolerates C++ style comments these days. Actually, the
> >>> SPDX License tags for .c files start with //.
> >>>
> >>> On the other hand, uapi headers are written in more strict C, where
> >>> the C++ comment style is forbidden.
> >>>
> >>> Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
> >>> ---
> >>
> >> Thanks!
> >> Acked-by: Frederic Barrat <fbarrat@linux.ibm.com>
> >>
> >
> > Please hold on this patch until
> > we get consensus about the C++ comment style.
> >
> > Discussion just started here:
> > https://lore.kernel.org/patchwork/patch/1083801/
>
> If you choose to proceed with this patch:
>
> Acked-by: Andrew Donnellan <ajd@linux.ibm.com>

After some discussion,
the other one was applied to the media subsystem.

Please pick up this one with Frederic and Andrew's Ack.

Thanks.



-- 
Best Regards
Masahiro Yamada

^ permalink raw reply

* Re: [BISECTED REGRESSION] b43legacy broken on G4 PowerBook
From: Aaro Koskinen @ 2019-06-06  9:31 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: linux-wireless, linux-kernel, Christian Zigotzky, linuxppc-dev,
	Christoph Hellwig, Larry Finger
In-Reply-To: <dfe6451c93574b61d4bdde4a05c5f8ccf86b31a0.camel@kernel.crashing.org>

Hi,

On Thu, Jun 06, 2019 at 10:54:51AM +1000, Benjamin Herrenschmidt wrote:
> On Thu, 2019-06-06 at 01:50 +0300, Aaro Koskinen wrote:
> > Hi,
> > 
> > When upgrading from v5.0 -> v5.1 on G4 PowerBook, I noticed WLAN does
> > not work anymore:
> > 
> > [   42.004303] b43legacy-phy0: Loading firmware version 0x127, patch level 14 (2005-04-18 02:36:27)
> > [   42.184837] b43legacy-phy0 debug: Chip initialized
> > [   42.184873] b43legacy-phy0 ERROR: The machine/kernel does not support the required 30-bit DMA mask
> > 
> > The same happens with the current mainline.
> 
> How much RAM do you have ?

The system has 1129 MB RAM. Booting with mem=1G makes it work.

A.

^ permalink raw reply

* Re: [BISECTED REGRESSION] b43legacy broken on G4 PowerBook
From: Benjamin Herrenschmidt @ 2019-06-06 10:56 UTC (permalink / raw)
  To: Aaro Koskinen
  Cc: linux-wireless, linux-kernel, Christian Zigotzky, linuxppc-dev,
	Christoph Hellwig, Larry Finger
In-Reply-To: <20190606093149.GA11598@darkstar.musicnaut.iki.fi>

On Thu, 2019-06-06 at 12:31 +0300, Aaro Koskinen wrote:
> Hi,
> 
> On Thu, Jun 06, 2019 at 10:54:51AM +1000, Benjamin Herrenschmidt
> wrote:
> > On Thu, 2019-06-06 at 01:50 +0300, Aaro Koskinen wrote:
> > > Hi,
> > > 
> > > When upgrading from v5.0 -> v5.1 on G4 PowerBook, I noticed WLAN
> > > does
> > > not work anymore:
> > > 
> > > [   42.004303] b43legacy-phy0: Loading firmware version 0x127,
> > > patch level 14 (2005-04-18 02:36:27)
> > > [   42.184837] b43legacy-phy0 debug: Chip initialized
> > > [   42.184873] b43legacy-phy0 ERROR: The machine/kernel does not
> > > support the required 30-bit DMA mask
> > > 
> > > The same happens with the current mainline.
> > 
> > How much RAM do you have ?
> 
> The system has 1129 MB RAM. Booting with mem=1G makes it work.

Wow... that's an odd amount. One thing we could possibly do is add code
to limit the amount of RAM when we detect that device....

Cheers,
Ben.



^ permalink raw reply

* Re: [BISECTED REGRESSION] b43legacy broken on G4 PowerBook
From: Benjamin Herrenschmidt @ 2019-06-06 10:57 UTC (permalink / raw)
  To: Aaro Koskinen
  Cc: linux-wireless, linux-kernel, Christian Zigotzky, linuxppc-dev,
	Christoph Hellwig, Larry Finger
In-Reply-To: <d87ac9a7faac0d5522cb496d74afc586410fed9c.camel@kernel.crashing.org>

On Thu, 2019-06-06 at 20:56 +1000, Benjamin Herrenschmidt wrote:
> On Thu, 2019-06-06 at 12:31 +0300, Aaro Koskinen wrote:
> > Hi,
> > 
> > On Thu, Jun 06, 2019 at 10:54:51AM +1000, Benjamin Herrenschmidt
> > wrote:
> > > On Thu, 2019-06-06 at 01:50 +0300, Aaro Koskinen wrote:
> > > > Hi,
> > > > 
> > > > When upgrading from v5.0 -> v5.1 on G4 PowerBook, I noticed WLAN
> > > > does
> > > > not work anymore:
> > > > 
> > > > [   42.004303] b43legacy-phy0: Loading firmware version 0x127,
> > > > patch level 14 (2005-04-18 02:36:27)
> > > > [   42.184837] b43legacy-phy0 debug: Chip initialized
> > > > [   42.184873] b43legacy-phy0 ERROR: The machine/kernel does not
> > > > support the required 30-bit DMA mask
> > > > 
> > > > The same happens with the current mainline.
> > > 
> > > How much RAM do you have ?
> > 
> > The system has 1129 MB RAM. Booting with mem=1G makes it work.
> 
> Wow... that's an odd amount. One thing we could possibly do is add code
> to limit the amount of RAM when we detect that device....

Sent too quickly... I mean that *or* force swiotlb at 30-bits on those systems based
on detecting the presence of that device in the device-tree.

Cheers,
Ben.



^ permalink raw reply

* [PATCH v1 1/5] crypto: talitos - fix ECB and CBC algs ivsize
From: Christophe Leroy @ 2019-06-06 11:31 UTC (permalink / raw)
  To: Herbert Xu, David S. Miller, horia.geanta
  Cc: linuxppc-dev, linux-crypto, linux-kernel
In-Reply-To: <cover.1559819372.git.christophe.leroy@c-s.fr>

commit d84cc9c9524e ("crypto: talitos - fix ECB algs ivsize")
wrongly modified CBC algs ivsize instead of ECB aggs ivsize.

This restore the CBC algs original ivsize of removes ECB's ones.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Fixes: d84cc9c9524e ("crypto: talitos - fix ECB algs ivsize")
---
 drivers/crypto/talitos.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/crypto/talitos.c b/drivers/crypto/talitos.c
index 122ec6c85446..3b3e99f1cddb 100644
--- a/drivers/crypto/talitos.c
+++ b/drivers/crypto/talitos.c
@@ -2753,7 +2753,6 @@ static struct talitos_alg_template driver_algs[] = {
 			.cra_ablkcipher = {
 				.min_keysize = AES_MIN_KEY_SIZE,
 				.max_keysize = AES_MAX_KEY_SIZE,
-				.ivsize = AES_BLOCK_SIZE,
 				.setkey = ablkcipher_aes_setkey,
 			}
 		},
@@ -2770,6 +2769,7 @@ static struct talitos_alg_template driver_algs[] = {
 			.cra_ablkcipher = {
 				.min_keysize = AES_MIN_KEY_SIZE,
 				.max_keysize = AES_MAX_KEY_SIZE,
+				.ivsize = AES_BLOCK_SIZE,
 				.setkey = ablkcipher_aes_setkey,
 			}
 		},
@@ -2805,7 +2805,6 @@ static struct talitos_alg_template driver_algs[] = {
 			.cra_ablkcipher = {
 				.min_keysize = DES_KEY_SIZE,
 				.max_keysize = DES_KEY_SIZE,
-				.ivsize = DES_BLOCK_SIZE,
 				.setkey = ablkcipher_des_setkey,
 			}
 		},
@@ -2822,6 +2821,7 @@ static struct talitos_alg_template driver_algs[] = {
 			.cra_ablkcipher = {
 				.min_keysize = DES_KEY_SIZE,
 				.max_keysize = DES_KEY_SIZE,
+				.ivsize = DES_BLOCK_SIZE,
 				.setkey = ablkcipher_des_setkey,
 			}
 		},
@@ -2839,7 +2839,6 @@ static struct talitos_alg_template driver_algs[] = {
 			.cra_ablkcipher = {
 				.min_keysize = DES3_EDE_KEY_SIZE,
 				.max_keysize = DES3_EDE_KEY_SIZE,
-				.ivsize = DES3_EDE_BLOCK_SIZE,
 				.setkey = ablkcipher_des3_setkey,
 			}
 		},
@@ -2857,6 +2856,7 @@ static struct talitos_alg_template driver_algs[] = {
 			.cra_ablkcipher = {
 				.min_keysize = DES3_EDE_KEY_SIZE,
 				.max_keysize = DES3_EDE_KEY_SIZE,
+				.ivsize = DES3_EDE_BLOCK_SIZE,
 				.setkey = ablkcipher_des3_setkey,
 			}
 		},
-- 
2.13.3


^ permalink raw reply related

* [PATCH v1 0/5] Additional fixes on Talitos driver
From: Christophe Leroy @ 2019-06-06 11:31 UTC (permalink / raw)
  To: Herbert Xu, David S. Miller, horia.geanta
  Cc: linuxppc-dev, linux-crypto, linux-kernel

This series is the last set of fixes for the Talitos driver.

We now get a fully clean boot on both SEC1 (SEC1.2 on mpc885) and
SEC2 (SEC2.2 on mpc8321E) with CONFIG_CRYPTO_MANAGER_EXTRA_TESTS:

[    3.385197] bus: 'platform': really_probe: probing driver talitos with device ff020000.crypto
[    3.450982] random: fast init done
[   12.252548] alg: No test for authenc(hmac(md5),cbc(aes)) (authenc-hmac-md5-cbc-aes-talitos-hsna)
[   12.262226] alg: No test for authenc(hmac(md5),cbc(des3_ede)) (authenc-hmac-md5-cbc-3des-talitos-hsna)
[   43.310737] Bug in SEC1, padding ourself
[   45.603318] random: crng init done
[   54.612333] talitos ff020000.crypto: fsl,sec1.2 algorithms registered in /proc/crypto
[   54.620232] driver: 'talitos': driver_bound: bound to device 'ff020000.crypto'

[    1.193721] bus: 'platform': really_probe: probing driver talitos with device b0030000.crypto
[    1.229197] random: fast init done
[    2.714920] alg: No test for authenc(hmac(sha224),cbc(aes)) (authenc-hmac-sha224-cbc-aes-talitos)
[    2.724312] alg: No test for authenc(hmac(sha224),cbc(aes)) (authenc-hmac-sha224-cbc-aes-talitos-hsna)
[    4.482045] alg: No test for authenc(hmac(md5),cbc(aes)) (authenc-hmac-md5-cbc-aes-talitos)
[    4.490940] alg: No test for authenc(hmac(md5),cbc(aes)) (authenc-hmac-md5-cbc-aes-talitos-hsna)
[    4.500280] alg: No test for authenc(hmac(md5),cbc(des3_ede)) (authenc-hmac-md5-cbc-3des-talitos)
[    4.509727] alg: No test for authenc(hmac(md5),cbc(des3_ede)) (authenc-hmac-md5-cbc-3des-talitos-hsna)
[    6.631781] random: crng init done
[   11.521795] talitos b0030000.crypto: fsl,sec2.2 algorithms registered in /proc/crypto
[   11.529803] driver: 'talitos': driver_bound: bound to device 'b0030000.crypto'

Christophe Leroy (5):
  crypto: talitos - fix ECB and CBC algs ivsize
  crypto: talitos - move struct talitos_edesc into talitos.h
  crypto: talitos - fix hash on SEC1.
  crypto: talitos - eliminate unneeded 'done' functions at build time
  crypto: talitos - drop icv_ool

 drivers/crypto/talitos.c | 104 ++++++++++++++++++++---------------------------
 drivers/crypto/talitos.h |  28 +++++++++++++
 2 files changed, 72 insertions(+), 60 deletions(-)

-- 
2.13.3


^ permalink raw reply

* [PATCH v1 2/5] crypto: talitos - move struct talitos_edesc into talitos.h
From: Christophe Leroy @ 2019-06-06 11:31 UTC (permalink / raw)
  To: Herbert Xu, David S. Miller, horia.geanta
  Cc: linuxppc-dev, linux-crypto, linux-kernel
In-Reply-To: <cover.1559819372.git.christophe.leroy@c-s.fr>

Next patch will require struct talitos_edesc to be defined
earlier in talitos.c

This patch moves it into talitos.h so that it can be used
from any place in talitos.c

Fixes: 37b5e8897eb5 ("crypto: talitos - chain in buffered data for ahash on SEC1")
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
---
 drivers/crypto/talitos.c | 30 ------------------------------
 drivers/crypto/talitos.h | 30 ++++++++++++++++++++++++++++++
 2 files changed, 30 insertions(+), 30 deletions(-)

diff --git a/drivers/crypto/talitos.c b/drivers/crypto/talitos.c
index 3b3e99f1cddb..5b401aec6c84 100644
--- a/drivers/crypto/talitos.c
+++ b/drivers/crypto/talitos.c
@@ -951,36 +951,6 @@ static int aead_des3_setkey(struct crypto_aead *authenc,
 	goto out;
 }
 
-/*
- * talitos_edesc - s/w-extended descriptor
- * @src_nents: number of segments in input scatterlist
- * @dst_nents: number of segments in output scatterlist
- * @icv_ool: whether ICV is out-of-line
- * @iv_dma: dma address of iv for checking continuity and link table
- * @dma_len: length of dma mapped link_tbl space
- * @dma_link_tbl: bus physical address of link_tbl/buf
- * @desc: h/w descriptor
- * @link_tbl: input and output h/w link tables (if {src,dst}_nents > 1) (SEC2)
- * @buf: input and output buffeur (if {src,dst}_nents > 1) (SEC1)
- *
- * if decrypting (with authcheck), or either one of src_nents or dst_nents
- * is greater than 1, an integrity check value is concatenated to the end
- * of link_tbl data
- */
-struct talitos_edesc {
-	int src_nents;
-	int dst_nents;
-	bool icv_ool;
-	dma_addr_t iv_dma;
-	int dma_len;
-	dma_addr_t dma_link_tbl;
-	struct talitos_desc desc;
-	union {
-		struct talitos_ptr link_tbl[0];
-		u8 buf[0];
-	};
-};
-
 static void talitos_sg_unmap(struct device *dev,
 			     struct talitos_edesc *edesc,
 			     struct scatterlist *src,
diff --git a/drivers/crypto/talitos.h b/drivers/crypto/talitos.h
index 32ad4fc679ed..95f78c6d9206 100644
--- a/drivers/crypto/talitos.h
+++ b/drivers/crypto/talitos.h
@@ -42,6 +42,36 @@ struct talitos_desc {
 
 #define TALITOS_DESC_SIZE	(sizeof(struct talitos_desc) - sizeof(__be32))
 
+/*
+ * talitos_edesc - s/w-extended descriptor
+ * @src_nents: number of segments in input scatterlist
+ * @dst_nents: number of segments in output scatterlist
+ * @icv_ool: whether ICV is out-of-line
+ * @iv_dma: dma address of iv for checking continuity and link table
+ * @dma_len: length of dma mapped link_tbl space
+ * @dma_link_tbl: bus physical address of link_tbl/buf
+ * @desc: h/w descriptor
+ * @link_tbl: input and output h/w link tables (if {src,dst}_nents > 1) (SEC2)
+ * @buf: input and output buffeur (if {src,dst}_nents > 1) (SEC1)
+ *
+ * if decrypting (with authcheck), or either one of src_nents or dst_nents
+ * is greater than 1, an integrity check value is concatenated to the end
+ * of link_tbl data
+ */
+struct talitos_edesc {
+	int src_nents;
+	int dst_nents;
+	bool icv_ool;
+	dma_addr_t iv_dma;
+	int dma_len;
+	dma_addr_t dma_link_tbl;
+	struct talitos_desc desc;
+	union {
+		struct talitos_ptr link_tbl[0];
+		u8 buf[0];
+	};
+};
+
 /**
  * talitos_request - descriptor submission request
  * @desc: descriptor pointer (kernel virtual)
-- 
2.13.3


^ permalink raw reply related

* [PATCH v1 4/5] crypto: talitos - eliminate unneeded 'done' functions at build time
From: Christophe Leroy @ 2019-06-06 11:31 UTC (permalink / raw)
  To: Herbert Xu, David S. Miller, horia.geanta
  Cc: linuxppc-dev, linux-crypto, linux-kernel
In-Reply-To: <cover.1559819372.git.christophe.leroy@c-s.fr>

When building for SEC1 only, talitos2_done functions are unneeded
and should go away.

For this, use has_ftr_sec1() which will always return true when only
SEC1 support is being built, allowing GCC to drop TALITOS2 functions.

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
---
 drivers/crypto/talitos.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/crypto/talitos.c b/drivers/crypto/talitos.c
index 4f03baef952b..b2de931de623 100644
--- a/drivers/crypto/talitos.c
+++ b/drivers/crypto/talitos.c
@@ -3401,7 +3401,7 @@ static int talitos_probe(struct platform_device *ofdev)
 	if (err)
 		goto err_out;

-	if (of_device_is_compatible(np, "fsl,sec1.0")) {
+	if (has_ftr_sec1(priv)) {
 		if (priv->num_channels == 1)
 			tasklet_init(&priv->done_task[0], talitos1_done_ch0,
 				     (unsigned long)dev);
-- 
2.13.3

^ permalink raw reply related

* [PATCH v1 3/5] crypto: talitos - fix hash on SEC1.
From: Christophe Leroy @ 2019-06-06 11:31 UTC (permalink / raw)
  To: Herbert Xu, David S. Miller, horia.geanta
  Cc: linuxppc-dev, linux-crypto, linux-kernel
In-Reply-To: <cover.1559819372.git.christophe.leroy@c-s.fr>

On SEC1, hash provides wrong result when performing hashing in several
steps with input data SG list has more than one element. This was
detected with CONFIG_CRYPTO_MANAGER_EXTRA_TESTS:

[   44.185947] alg: hash: md5-talitos test failed (wrong result) on test vector 6, cfg="random: may_sleep use_finup src_divs=[<reimport>25.88%@+8063, <flush>24.19%@+9588, 28.63%@+16333, <reimport>4.60%@+6756, 16.70%@+16281] dst_divs=[71.61%@alignmask+16361, 14.36%@+7756, 14.3%@+"
[   44.325122] alg: hash: sha1-talitos test failed (wrong result) on test vector 3, cfg="random: inplace use_final src_divs=[<flush,nosimd>16.56%@+16378, <reimport>52.0%@+16329, 21.42%@alignmask+16380, 10.2%@alignmask+16380] iv_offset=39"
[   44.493500] alg: hash: sha224-talitos test failed (wrong result) on test vector 4, cfg="random: use_final nosimd src_divs=[<reimport>52.27%@+7401, <reimport>17.34%@+16285, <flush>17.71%@+26, 12.68%@+10644] iv_offset=43"
[   44.673262] alg: hash: sha256-talitos test failed (wrong result) on test vector 4, cfg="random: may_sleep use_finup src_divs=[<reimport>60.6%@+12790, 17.86%@+1329, <reimport>12.64%@alignmask+16300, 8.29%@+15, 0.40%@+13506, <reimport>0.51%@+16322, <reimport>0.24%@+16339] dst_divs"

This is due to two issues:
- We have an overlap between the buffer used for copying the input
data (SEC1 doesn't do scatter/gather) and the chained descriptor.
- Data copy is wrong when the previous hash left less than one
blocksize of data to hash, implying a complement of the previous
block with a few bytes from the new request.

This patch fixes it by:
- Moving the second descriptor after the buffer, as moving the buffer
after the descriptor would make it more complex for other cipher
operations (AEAD, ABLKCIPHER)
- Rebuiding a new data SG list without the bytes taken from the new
request to complete the previous one.

Fixes: 37b5e8897eb5 ("crypto: talitos - chain in buffered data for ahash on SEC1")
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
---
 drivers/crypto/talitos.c | 63 ++++++++++++++++++++++++++++++------------------
 1 file changed, 40 insertions(+), 23 deletions(-)

diff --git a/drivers/crypto/talitos.c b/drivers/crypto/talitos.c
index 5b401aec6c84..4f03baef952b 100644
--- a/drivers/crypto/talitos.c
+++ b/drivers/crypto/talitos.c
@@ -336,15 +336,18 @@ static void flush_channel(struct device *dev, int ch, int error, int reset_ch)
 	tail = priv->chan[ch].tail;
 	while (priv->chan[ch].fifo[tail].desc) {
 		__be32 hdr;
+		struct talitos_edesc *edesc;
 
 		request = &priv->chan[ch].fifo[tail];
+		edesc = container_of(request->desc, struct talitos_edesc, desc);
 
 		/* descriptors with their done bits set don't get the error */
 		rmb();
 		if (!is_sec1)
 			hdr = request->desc->hdr;
 		else if (request->desc->next_desc)
-			hdr = (request->desc + 1)->hdr1;
+			hdr = ((struct talitos_desc *)
+			       (edesc->buf + edesc->dma_len))->hdr1;
 		else
 			hdr = request->desc->hdr1;
 
@@ -476,8 +479,14 @@ static u32 current_desc_hdr(struct device *dev, int ch)
 		}
 	}
 
-	if (priv->chan[ch].fifo[iter].desc->next_desc == cur_desc)
-		return (priv->chan[ch].fifo[iter].desc + 1)->hdr;
+	if (priv->chan[ch].fifo[iter].desc->next_desc == cur_desc) {
+		struct talitos_edesc *edesc;
+
+		edesc = container_of(priv->chan[ch].fifo[iter].desc,
+				     struct talitos_edesc, desc);
+		return ((struct talitos_desc *)
+			(edesc->buf + edesc->dma_len))->hdr;
+	}
 
 	return priv->chan[ch].fifo[iter].desc->hdr;
 }
@@ -1402,15 +1411,11 @@ static struct talitos_edesc *talitos_edesc_alloc(struct device *dev,
 	edesc->dst_nents = dst_nents;
 	edesc->iv_dma = iv_dma;
 	edesc->dma_len = dma_len;
-	if (dma_len) {
-		void *addr = &edesc->link_tbl[0];
-
-		if (is_sec1 && !dst)
-			addr += sizeof(struct talitos_desc);
-		edesc->dma_link_tbl = dma_map_single(dev, addr,
+	if (dma_len)
+		edesc->dma_link_tbl = dma_map_single(dev, &edesc->link_tbl[0],
 						     edesc->dma_len,
 						     DMA_BIDIRECTIONAL);
-	}
+
 	return edesc;
 }
 
@@ -1722,14 +1727,16 @@ static void common_nonsnoop_hash_unmap(struct device *dev,
 	struct talitos_private *priv = dev_get_drvdata(dev);
 	bool is_sec1 = has_ftr_sec1(priv);
 	struct talitos_desc *desc = &edesc->desc;
-	struct talitos_desc *desc2 = desc + 1;
+	struct talitos_desc *desc2 = (struct talitos_desc *)
+				     (edesc->buf + edesc->dma_len);
 
 	unmap_single_talitos_ptr(dev, &edesc->desc.ptr[5], DMA_FROM_DEVICE);
 	if (desc->next_desc &&
 	    desc->ptr[5].ptr != desc2->ptr[5].ptr)
 		unmap_single_talitos_ptr(dev, &desc2->ptr[5], DMA_FROM_DEVICE);
 
-	talitos_sg_unmap(dev, edesc, req_ctx->psrc, NULL, 0, 0);
+	if (req_ctx->psrc)
+		talitos_sg_unmap(dev, edesc, req_ctx->psrc, NULL, 0, 0);
 
 	/* When using hashctx-in, must unmap it. */
 	if (from_talitos_ptr_len(&edesc->desc.ptr[1], is_sec1))
@@ -1796,7 +1803,6 @@ static void talitos_handle_buggy_hash(struct talitos_ctx *ctx,
 
 static int common_nonsnoop_hash(struct talitos_edesc *edesc,
 				struct ahash_request *areq, unsigned int length,
-				unsigned int offset,
 				void (*callback) (struct device *dev,
 						  struct talitos_desc *desc,
 						  void *context, int error))
@@ -1835,9 +1841,7 @@ static int common_nonsnoop_hash(struct talitos_edesc *edesc,
 
 	sg_count = edesc->src_nents ?: 1;
 	if (is_sec1 && sg_count > 1)
-		sg_pcopy_to_buffer(req_ctx->psrc, sg_count,
-				   edesc->buf + sizeof(struct talitos_desc),
-				   length, req_ctx->nbuf);
+		sg_copy_to_buffer(req_ctx->psrc, sg_count, edesc->buf, length);
 	else if (length)
 		sg_count = dma_map_sg(dev, req_ctx->psrc, sg_count,
 				      DMA_TO_DEVICE);
@@ -1850,7 +1854,7 @@ static int common_nonsnoop_hash(struct talitos_edesc *edesc,
 				       DMA_TO_DEVICE);
 	} else {
 		sg_count = talitos_sg_map(dev, req_ctx->psrc, length, edesc,
-					  &desc->ptr[3], sg_count, offset, 0);
+					  &desc->ptr[3], sg_count, 0, 0);
 		if (sg_count > 1)
 			sync_needed = true;
 	}
@@ -1874,7 +1878,8 @@ static int common_nonsnoop_hash(struct talitos_edesc *edesc,
 		talitos_handle_buggy_hash(ctx, edesc, &desc->ptr[3]);
 
 	if (is_sec1 && req_ctx->nbuf && length) {
-		struct talitos_desc *desc2 = desc + 1;
+		struct talitos_desc *desc2 = (struct talitos_desc *)
+					     (edesc->buf + edesc->dma_len);
 		dma_addr_t next_desc;
 
 		memset(desc2, 0, sizeof(*desc2));
@@ -1895,7 +1900,7 @@ static int common_nonsnoop_hash(struct talitos_edesc *edesc,
 						      DMA_TO_DEVICE);
 		copy_talitos_ptr(&desc2->ptr[2], &desc->ptr[2], is_sec1);
 		sg_count = talitos_sg_map(dev, req_ctx->psrc, length, edesc,
-					  &desc2->ptr[3], sg_count, offset, 0);
+					  &desc2->ptr[3], sg_count, 0, 0);
 		if (sg_count > 1)
 			sync_needed = true;
 		copy_talitos_ptr(&desc2->ptr[5], &desc->ptr[5], is_sec1);
@@ -2006,7 +2011,6 @@ static int ahash_process_req(struct ahash_request *areq, unsigned int nbytes)
 	struct device *dev = ctx->dev;
 	struct talitos_private *priv = dev_get_drvdata(dev);
 	bool is_sec1 = has_ftr_sec1(priv);
-	int offset = 0;
 	u8 *ctx_buf = req_ctx->buf[req_ctx->buf_idx];
 
 	if (!req_ctx->last && (nbytes + req_ctx->nbuf <= blocksize)) {
@@ -2046,6 +2050,9 @@ static int ahash_process_req(struct ahash_request *areq, unsigned int nbytes)
 			sg_chain(req_ctx->bufsl, 2, areq->src);
 		req_ctx->psrc = req_ctx->bufsl;
 	} else if (is_sec1 && req_ctx->nbuf && req_ctx->nbuf < blocksize) {
+		int offset;
+		struct scatterlist *sg;
+
 		if (nbytes_to_hash > blocksize)
 			offset = blocksize - req_ctx->nbuf;
 		else
@@ -2058,7 +2065,18 @@ static int ahash_process_req(struct ahash_request *areq, unsigned int nbytes)
 		sg_copy_to_buffer(areq->src, nents,
 				  ctx_buf + req_ctx->nbuf, offset);
 		req_ctx->nbuf += offset;
-		req_ctx->psrc = areq->src;
+		for (sg = areq->src; sg && offset >= sg->length;
+		     offset -= sg->length, sg = sg_next(sg))
+			;
+		if (offset) {
+			sg_init_table(req_ctx->bufsl, 2);
+			sg_set_buf(req_ctx->bufsl, sg_virt(sg) + offset,
+				   sg->length - offset);
+			sg_chain(req_ctx->bufsl, 2, sg_next(sg));
+			req_ctx->psrc = req_ctx->bufsl;
+		} else {
+			req_ctx->psrc = sg;
+		}
 	} else
 		req_ctx->psrc = areq->src;
 
@@ -2098,8 +2116,7 @@ static int ahash_process_req(struct ahash_request *areq, unsigned int nbytes)
 	if (ctx->keylen && (req_ctx->first || req_ctx->last))
 		edesc->desc.hdr |= DESC_HDR_MODE0_MDEU_HMAC;
 
-	return common_nonsnoop_hash(edesc, areq, nbytes_to_hash, offset,
-				    ahash_done);
+	return common_nonsnoop_hash(edesc, areq, nbytes_to_hash, ahash_done);
 }
 
 static int ahash_update(struct ahash_request *areq)
-- 
2.13.3


^ permalink raw reply related

* [PATCH v1 5/5] crypto: talitos - drop icv_ool
From: Christophe Leroy @ 2019-06-06 11:31 UTC (permalink / raw)
  To: Herbert Xu, David S. Miller, horia.geanta
  Cc: linuxppc-dev, linux-crypto, linux-kernel
In-Reply-To: <cover.1559819372.git.christophe.leroy@c-s.fr>

icv_ool is not used anymore, drop it.

Fixes: 9cc87bc3613b ("crypto: talitos - fix AEAD processing")
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
---
 drivers/crypto/talitos.c | 3 ---
 drivers/crypto/talitos.h | 2 --
 2 files changed, 5 deletions(-)

diff --git a/drivers/crypto/talitos.c b/drivers/crypto/talitos.c
index b2de931de623..03b7a5d28fb0 100644
--- a/drivers/crypto/talitos.c
+++ b/drivers/crypto/talitos.c
@@ -1278,9 +1278,6 @@ static int ipsec_esp(struct talitos_edesc *edesc, struct aead_request *areq,
 				 is_ipsec_esp && !encrypt);
 	tbl_off += ret;
 
-	/* ICV data */
-	edesc->icv_ool = !encrypt;
-
 	if (!encrypt && is_ipsec_esp) {
 		struct talitos_ptr *tbl_ptr = &edesc->link_tbl[tbl_off];
 
diff --git a/drivers/crypto/talitos.h b/drivers/crypto/talitos.h
index 95f78c6d9206..1469b956948a 100644
--- a/drivers/crypto/talitos.h
+++ b/drivers/crypto/talitos.h
@@ -46,7 +46,6 @@ struct talitos_desc {
  * talitos_edesc - s/w-extended descriptor
  * @src_nents: number of segments in input scatterlist
  * @dst_nents: number of segments in output scatterlist
- * @icv_ool: whether ICV is out-of-line
  * @iv_dma: dma address of iv for checking continuity and link table
  * @dma_len: length of dma mapped link_tbl space
  * @dma_link_tbl: bus physical address of link_tbl/buf
@@ -61,7 +60,6 @@ struct talitos_desc {
 struct talitos_edesc {
 	int src_nents;
 	int dst_nents;
-	bool icv_ool;
 	dma_addr_t iv_dma;
 	int dma_len;
 	dma_addr_t dma_link_tbl;
-- 
2.13.3


^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox