LinuxPPC-Dev Archive on lore.kernel.org

LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed

* Re: [PATCH v2] powerpc/64s: Fix pte update for kernel memory on radix
From: Naveen N. Rao @ 2021-02-04  6:41 UTC (permalink / raw)
  To: Jordan Niethe; +Cc: linuxppc-dev, npiggin, cmr
In-Reply-To: <20210203235907.961243-1-jniethe5@gmail.com>

Hi Jordan,

On 2021/02/04 10:59AM, Jordan Niethe wrote:
> When adding a pte a ptesync is needed to order the update of the pte
> with subsequent accesses otherwise a spurious fault may be raised.
> 
> radix__set_pte_at() does not do this for performance gains. For
> non-kernel memory this is not an issue as any faults of this kind are
> corrected by the page fault handler.  For kernel memory these faults are
> not handled.  The current solution is that there is a ptesync in
> flush_cache_vmap() which should be called when mapping from the vmalloc
> region.
> 
> However, map_kernel_page() does not call flush_cache_vmap(). This is
> troublesome in particular for code patching with Strict RWX on radix. In
> do_patch_instruction() the page frame that contains the instruction to
> be patched is mapped and then immediately patched. With no ordering or
> synchronization between setting up the pte and writing to the page it is
> possible for faults.
> 
> As the code patching is done using __put_user_asm_goto() the resulting
> fault is obscured - but using a normal store instead it can be seen:
> 
> [  418.498768][  T757] BUG: Unable to handle kernel data access on write at 0xc008000008f24a3c
> [  418.498790][  T757] Faulting instruction address: 0xc00000000008bd74
> [  418.498805][  T757] Oops: Kernel access of bad area, sig: 11 [#1]
> [  418.498828][  T757] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA PowerNV
> [  418.498843][  T757] Modules linked in: nop_module(PO+) [last unloaded: nop_module]
> [  418.498872][  T757] CPU: 4 PID: 757 Comm: sh Tainted: P           O      5.10.0-rc5-01361-ge3c1b78c8440-dirty #43
> [  418.498936][  T757] NIP:  c00000000008bd74 LR: c00000000008bd50 CTR: c000000000025810
> [  418.498979][  T757] REGS: c000000016f634a0 TRAP: 0300   Tainted: P           O       (5.10.0-rc5-01361-ge3c1b78c8440-dirty)
> [  418.499033][  T757] MSR:  9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE>  CR: 44002884  XER: 00000000
> [  418.499084][  T757] CFAR: c00000000007c68c DAR: c008000008f24a3c DSISR: 42000000 IRQMASK: 1
> 
> This results in the kind of issue reported here:
> https://lore.kernel.org/linuxppc-dev/15AC5B0E-A221-4B8C-9039-FA96B8EF7C88@lca.pw/
> 

Thanks for fixing this!

- Naveen


^ permalink raw reply

* Re: [PATCH v4 02/10] powerpc/signal: Add unsafe_copy_{vsx, fpr}_from_user()
From: Michael Ellerman @ 2021-02-04  6:09 UTC (permalink / raw)
  To: Christopher M. Riedl, Gabriel Paubert
  Cc: David Laight, linuxppc-dev@lists.ozlabs.org
In-Reply-To: <C8YHSKQ99VC4.M9Y0WOFVUTBQ@geist>

"Christopher M. Riedl" <cmr@codefail.de> writes:
> On Mon Feb 1, 2021 at 10:54 AM CST, Gabriel Paubert wrote:
>> On Mon, Feb 01, 2021 at 09:55:44AM -0600, Christopher M. Riedl wrote:
>> > On Thu Jan 28, 2021 at 4:38 AM CST, David Laight wrote:
>> > > From: Christopher M. Riedl
>> > > > Sent: 28 January 2021 04:04
>> > > > 
>> > > > Reuse the "safe" implementation from signal.c except for calling
>> > > > unsafe_copy_from_user() to copy into a local buffer.
>> > > > 
>> > > > Signed-off-by: Christopher M. Riedl <cmr@codefail.de>
>> > > > ---
>> > > >  arch/powerpc/kernel/signal.h | 33 +++++++++++++++++++++++++++++++++
>> > > >  1 file changed, 33 insertions(+)
>> > > > 
>> > > > diff --git a/arch/powerpc/kernel/signal.h b/arch/powerpc/kernel/signal.h
>> > > > index 2559a681536e..c18402d625f1 100644
>> > > > --- a/arch/powerpc/kernel/signal.h
>> > > > +++ b/arch/powerpc/kernel/signal.h
>> > > > @@ -53,6 +53,33 @@ unsigned long copy_ckfpr_from_user(struct task_struct *task, void __user *from);
>> > > >  				&buf[i], label);\
>> > > >  } while (0)
>> > > > 
>> > > > +#define unsafe_copy_fpr_from_user(task, from, label)	do {		\
>> > > > +	struct task_struct *__t = task;					\
>> > > > +	u64 __user *__f = (u64 __user *)from;				\
>> > > > +	u64 buf[ELF_NFPREG];						\
>> > >
>> > > How big is that buffer?
>> > > Isn't is likely to be reasonably large compared to a reasonable
>> > > kernel stack frame.
>> > > Especially since this isn't even a leaf function.
>> > >
>> > 
>> > I think Christophe answered this - I don't really have an opinion either
>> > way. What would be a 'reasonable' kernel stack frame for reference?
>>
>> See include/linux/poll.h, where the limit is of the order of 800 bytes
>> and the number of entries in an on stack array is chosen at compile time
>> (different between 32 and 64 bit for example).
>>
>> The values are used in do_sys_poll, which, with almost 1000 bytes of
>> stack footprint, appears close to the top of "make checkstack". In
>> addition do_sys_poll has to call the ->poll function of every file
>> descriptor in its table, so it is not a tail function.
>>
>> This 264 bytes array looks reasonable, but please use 'make checkstack'
>> to verify that the function's total stack usage stays within reason.
>
> Neat, looks like total usage is a bit larger but still reasonable and
> less than half of 800B:
>
> 0xc000000000017e900 __unsafe_restore_sigcontext.constprop.0 [vmlinux]:352

We warn for frames larger than 2KB on 64-bit, see FRAME_WARN in
lib/Kconfig.debug.

So 264 bytes is entirely reasonable IMHO.

cheers


^ permalink raw reply

* Re: [PATCH v2] powerpc/64s: Fix pte update for kernel memory on radix
From: Jordan Niethe @ 2021-02-04  4:45 UTC (permalink / raw)
  To: Nicholas Piggin; +Cc: linuxppc-dev, cmr
In-Reply-To: <1612409379.adcakadx0b.astroid@bobo.none>

On Thu, Feb 4, 2021 at 2:31 PM Nicholas Piggin <npiggin@gmail.com> wrote:
>
> Excerpts from Jordan Niethe's message of February 4, 2021 9:59 am:
> > When adding a pte a ptesync is needed to order the update of the pte
> > with subsequent accesses otherwise a spurious fault may be raised.
> >
> > radix__set_pte_at() does not do this for performance gains. For
> > non-kernel memory this is not an issue as any faults of this kind are
> > corrected by the page fault handler.  For kernel memory these faults are
> > not handled.  The current solution is that there is a ptesync in
> > flush_cache_vmap() which should be called when mapping from the vmalloc
> > region.
> >
> > However, map_kernel_page() does not call flush_cache_vmap(). This is
> > troublesome in particular for code patching with Strict RWX on radix. In
> > do_patch_instruction() the page frame that contains the instruction to
> > be patched is mapped and then immediately patched. With no ordering or
> > synchronization between setting up the pte and writing to the page it is
> > possible for faults.
> >
> > As the code patching is done using __put_user_asm_goto() the resulting
> > fault is obscured - but using a normal store instead it can be seen:
> >
> > [  418.498768][  T757] BUG: Unable to handle kernel data access on write at 0xc008000008f24a3c
> > [  418.498790][  T757] Faulting instruction address: 0xc00000000008bd74
> > [  418.498805][  T757] Oops: Kernel access of bad area, sig: 11 [#1]
> > [  418.498828][  T757] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA PowerNV
> > [  418.498843][  T757] Modules linked in: nop_module(PO+) [last unloaded: nop_module]
> > [  418.498872][  T757] CPU: 4 PID: 757 Comm: sh Tainted: P           O      5.10.0-rc5-01361-ge3c1b78c8440-dirty #43
> > [  418.498936][  T757] NIP:  c00000000008bd74 LR: c00000000008bd50 CTR: c000000000025810
> > [  418.498979][  T757] REGS: c000000016f634a0 TRAP: 0300   Tainted: P           O       (5.10.0-rc5-01361-ge3c1b78c8440-dirty)
> > [  418.499033][  T757] MSR:  9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE>  CR: 44002884  XER: 00000000
> > [  418.499084][  T757] CFAR: c00000000007c68c DAR: c008000008f24a3c DSISR: 42000000 IRQMASK: 1
> >
> > This results in the kind of issue reported here:
> > https://lore.kernel.org/linuxppc-dev/15AC5B0E-A221-4B8C-9039-FA96B8EF7C88@lca.pw/
> >
> > Chris Riedl suggested a reliable way to reproduce the issue:
> > $ mount -t debugfs none /sys/kernel/debug
> > $ (while true; do echo function > /sys/kernel/debug/tracing/current_tracer ; echo nop > /sys/kernel/debug/tracing/current_tracer ; done)&
> >
> > Turning ftrace on and off does a large amount of code patching which in
> > usually less then 5min will crash giving a trace like:
> >
> > [  146.668988][  T809] ftrace-powerpc: (____ptrval____): replaced (4b473b11) != old (60000000)
> > [  146.668995][  T809] ------------[ ftrace bug ]------------
> > [  146.669031][  T809] ftrace failed to modify
> > [  146.669039][  T809] [<c000000000bf8e5c>] napi_busy_loop+0xc/0x390
> > [  146.669045][  T809]  actual:   11:3b:47:4b
> > [  146.669070][  T809] Setting ftrace call site to call ftrace function
> > [  146.669075][  T809] ftrace record flags: 80000001
> > [  146.669081][  T809]  (1)
> > [  146.669081][  T809]  expected tramp: c00000000006c96c
> > [  146.669096][  T809] ------------[ cut here ]------------
> > [  146.669104][  T809] WARNING: CPU: 4 PID: 809 at kernel/trace/ftrace.c:2065 ftrace_bug+0x28c/0x2e8
> > [  146.669109][  T809] Modules linked in: nop_module(PO-) [last unloaded: nop_module]
> > [  146.669130][  T809] CPU: 4 PID: 809 Comm: sh Tainted: P           O      5.10.0-rc5-01360-gf878ccaf250a #1
> > [  146.669136][  T809] NIP:  c00000000024f334 LR: c00000000024f330 CTR: c0000000001a5af0
> > [  146.669142][  T809] REGS: c000000004c8b760 TRAP: 0700   Tainted: P           O       (5.10.0-rc5-01360-gf878ccaf250a)
> > [  146.669147][  T809] MSR:  900000000282b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 28008848  XER: 20040000
> > [  146.669208][  T809] CFAR: c0000000001a9c98 IRQMASK: 0
> > [  146.669208][  T809] GPR00: c00000000024f330 c000000004c8b9f0 c000000002770600 0000000000000022
> > [  146.669208][  T809] GPR04: 00000000ffff7fff c000000004c8b6d0 0000000000000027 c0000007fe9bcdd8
> > [  146.669208][  T809] GPR08: 0000000000000023 ffffffffffffffd8 0000000000000027 c000000002613118
> > [  146.669208][  T809] GPR12: 0000000000008000 c0000007fffdca00 0000000000000000 0000000000000000
> > [  146.669208][  T809] GPR16: 0000000023ec37c5 0000000000000000 0000000000000000 0000000000000008
> > [  146.669208][  T809] GPR20: c000000004c8bc90 c0000000027a2d20 c000000004c8bcd0 c000000002612fe8
> > [  146.669208][  T809] GPR24: 0000000000000038 0000000000000030 0000000000000028 0000000000000020
> > [  146.669208][  T809] GPR28: c000000000ff1b68 c000000000bf8e5c c00000000312f700 c000000000fbb9b0
> > [  146.669384][  T809] NIP [c00000000024f334] ftrace_bug+0x28c/0x2e8
> > [  146.669391][  T809] LR [c00000000024f330] ftrace_bug+0x288/0x2e8
> > [  146.669396][  T809] Call Trace:
> > [  146.669403][  T809] [c000000004c8b9f0] [c00000000024f330] ftrace_bug+0x288/0x2e8 (unreliable)
> > [  146.669418][  T809] [c000000004c8ba80] [c000000000248778] ftrace_modify_all_code+0x168/0x210
> > [  146.669429][  T809] [c000000004c8bab0] [c00000000006c528] arch_ftrace_update_code+0x18/0x30
> > [  146.669440][  T809] [c000000004c8bad0] [c000000000248954] ftrace_run_update_code+0x44/0xc0
> > [  146.669451][  T809] [c000000004c8bb00] [c00000000024dc88] ftrace_startup+0xf8/0x1c0
> > [  146.669461][  T809] [c000000004c8bb40] [c00000000024dd9c] register_ftrace_function+0x4c/0xc0
> > [  146.669472][  T809] [c000000004c8bb70] [c00000000026e750] function_trace_init+0x80/0xb0
> > [  146.669484][  T809] [c000000004c8bba0] [c000000000266b84] tracing_set_tracer+0x2a4/0x4f0
> > [  146.669495][  T809] [c000000004c8bc70] [c000000000266ea4] tracing_set_trace_write+0xd4/0x130
> > [  146.669506][  T809] [c000000004c8bd20] [c000000000422790] vfs_write+0xf0/0x330
> > [  146.669518][  T809] [c000000004c8bd70] [c000000000422bb4] ksys_write+0x84/0x140
> > [  146.669529][  T809] [c000000004c8bdc0] [c00000000003499c] system_call_exception+0x14c/0x230
> > [  146.669540][  T809] [c000000004c8be20] [c00000000000d860] system_call_common+0xf0/0x27c
> > [  146.669549][  T809] Instruction dump:
> > [  146.669558][  T809] 48000014 3c62fe88 38631718 4bf5a941 60000000 7fc3f378 4bff877d 7c641b78
> > [  146.669598][  T809] 3c62fe88 38631730 4bf5a925 60000000 <0fe00000> 38210090 3d22fd90 39000001
> > [  146.669638][  T809] ---[ end trace 5ea7076ea28c0fbd ]---
> >
> > To fix this when updating kernel memory ptes using ptesync.
> >
> > Fixes: 37bc3e5fd764 ("powerpc/lib/code-patching: Use alternate map for patch_instruction()")
> > Fixes: f1cb8f9beba8 ("powerpc/64s/radix: avoid ptesync after set_pte and ptep_set_access_flags")
> > Signed-off-by: Jordan Niethe <jniethe5@gmail.com>
>
> Good catch. I would say it just fixes the latter patch doesn't it? The
> previous one just happens to break because it's using the broken API?
Yes that is true. I will send another revision that removes that from
the 'Fixes' and also add a self test.
>
> Anyhow,
>
> Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
>
> > ---
> > v2: Only ptesync is needed
> > ---
> >  arch/powerpc/include/asm/book3s/64/radix.h | 6 ++++--
> >  arch/powerpc/mm/book3s64/radix_pgtable.c   | 4 ++--
> >  2 files changed, 6 insertions(+), 4 deletions(-)
> >
> > diff --git a/arch/powerpc/include/asm/book3s/64/radix.h b/arch/powerpc/include/asm/book3s/64/radix.h
> > index c7813dc628fc..59cab558e2f0 100644
> > --- a/arch/powerpc/include/asm/book3s/64/radix.h
> > +++ b/arch/powerpc/include/asm/book3s/64/radix.h
> > @@ -222,8 +222,10 @@ static inline void radix__set_pte_at(struct mm_struct *mm, unsigned long addr,
> >        * from ptesync, it should probably go into update_mmu_cache, rather
> >        * than set_pte_at (which is used to set ptes unrelated to faults).
> >        *
> > -      * Spurious faults to vmalloc region are not tolerated, so there is
> > -      * a ptesync in flush_cache_vmap.
> > +      * Spurious faults from the kernel memory are not tolerated, so there
> > +      * is a ptesync in flush_cache_vmap, and __map_kernel_page() follows
> > +      * the pte update sequence from ISA Book III 6.10 Translation Table
> > +      * Update Synchronization Requirements.
> >        */
> >  }
> >
> > diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c b/arch/powerpc/mm/book3s64/radix_pgtable.c
> > index 3adcf730f478..1d5eec847b88 100644
> > --- a/arch/powerpc/mm/book3s64/radix_pgtable.c
> > +++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
> > @@ -108,7 +108,7 @@ static int early_map_kernel_page(unsigned long ea, unsigned long pa,
> >
> >  set_the_pte:
> >       set_pte_at(&init_mm, ea, ptep, pfn_pte(pfn, flags));
> > -     smp_wmb();
> > +     asm volatile("ptesync": : :"memory");
> >       return 0;
> >  }
> >
> > @@ -168,7 +168,7 @@ static int __map_kernel_page(unsigned long ea, unsigned long pa,
> >
> >  set_the_pte:
> >       set_pte_at(&init_mm, ea, ptep, pfn_pte(pfn, flags));
> > -     smp_wmb();
> > +     asm volatile("ptesync": : :"memory");
> >       return 0;
> >  }
> >
> > --
> > 2.25.1
> >
> >

^ permalink raw reply

* Re: [PATCH v2] powerpc/64s: Fix pte update for kernel memory on radix
From: Nicholas Piggin @ 2021-02-04  3:31 UTC (permalink / raw)
  To: Jordan Niethe, linuxppc-dev; +Cc: cmr
In-Reply-To: <20210203235907.961243-1-jniethe5@gmail.com>

Excerpts from Jordan Niethe's message of February 4, 2021 9:59 am:
> When adding a pte a ptesync is needed to order the update of the pte
> with subsequent accesses otherwise a spurious fault may be raised.
> 
> radix__set_pte_at() does not do this for performance gains. For
> non-kernel memory this is not an issue as any faults of this kind are
> corrected by the page fault handler.  For kernel memory these faults are
> not handled.  The current solution is that there is a ptesync in
> flush_cache_vmap() which should be called when mapping from the vmalloc
> region.
> 
> However, map_kernel_page() does not call flush_cache_vmap(). This is
> troublesome in particular for code patching with Strict RWX on radix. In
> do_patch_instruction() the page frame that contains the instruction to
> be patched is mapped and then immediately patched. With no ordering or
> synchronization between setting up the pte and writing to the page it is
> possible for faults.
> 
> As the code patching is done using __put_user_asm_goto() the resulting
> fault is obscured - but using a normal store instead it can be seen:
> 
> [  418.498768][  T757] BUG: Unable to handle kernel data access on write at 0xc008000008f24a3c
> [  418.498790][  T757] Faulting instruction address: 0xc00000000008bd74
> [  418.498805][  T757] Oops: Kernel access of bad area, sig: 11 [#1]
> [  418.498828][  T757] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA PowerNV
> [  418.498843][  T757] Modules linked in: nop_module(PO+) [last unloaded: nop_module]
> [  418.498872][  T757] CPU: 4 PID: 757 Comm: sh Tainted: P           O      5.10.0-rc5-01361-ge3c1b78c8440-dirty #43
> [  418.498936][  T757] NIP:  c00000000008bd74 LR: c00000000008bd50 CTR: c000000000025810
> [  418.498979][  T757] REGS: c000000016f634a0 TRAP: 0300   Tainted: P           O       (5.10.0-rc5-01361-ge3c1b78c8440-dirty)
> [  418.499033][  T757] MSR:  9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE>  CR: 44002884  XER: 00000000
> [  418.499084][  T757] CFAR: c00000000007c68c DAR: c008000008f24a3c DSISR: 42000000 IRQMASK: 1
> 
> This results in the kind of issue reported here:
> https://lore.kernel.org/linuxppc-dev/15AC5B0E-A221-4B8C-9039-FA96B8EF7C88@lca.pw/
> 
> Chris Riedl suggested a reliable way to reproduce the issue:
> $ mount -t debugfs none /sys/kernel/debug
> $ (while true; do echo function > /sys/kernel/debug/tracing/current_tracer ; echo nop > /sys/kernel/debug/tracing/current_tracer ; done)&
> 
> Turning ftrace on and off does a large amount of code patching which in
> usually less then 5min will crash giving a trace like:
> 
> [  146.668988][  T809] ftrace-powerpc: (____ptrval____): replaced (4b473b11) != old (60000000)
> [  146.668995][  T809] ------------[ ftrace bug ]------------
> [  146.669031][  T809] ftrace failed to modify
> [  146.669039][  T809] [<c000000000bf8e5c>] napi_busy_loop+0xc/0x390
> [  146.669045][  T809]  actual:   11:3b:47:4b
> [  146.669070][  T809] Setting ftrace call site to call ftrace function
> [  146.669075][  T809] ftrace record flags: 80000001
> [  146.669081][  T809]  (1)
> [  146.669081][  T809]  expected tramp: c00000000006c96c
> [  146.669096][  T809] ------------[ cut here ]------------
> [  146.669104][  T809] WARNING: CPU: 4 PID: 809 at kernel/trace/ftrace.c:2065 ftrace_bug+0x28c/0x2e8
> [  146.669109][  T809] Modules linked in: nop_module(PO-) [last unloaded: nop_module]
> [  146.669130][  T809] CPU: 4 PID: 809 Comm: sh Tainted: P           O      5.10.0-rc5-01360-gf878ccaf250a #1
> [  146.669136][  T809] NIP:  c00000000024f334 LR: c00000000024f330 CTR: c0000000001a5af0
> [  146.669142][  T809] REGS: c000000004c8b760 TRAP: 0700   Tainted: P           O       (5.10.0-rc5-01360-gf878ccaf250a)
> [  146.669147][  T809] MSR:  900000000282b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 28008848  XER: 20040000
> [  146.669208][  T809] CFAR: c0000000001a9c98 IRQMASK: 0
> [  146.669208][  T809] GPR00: c00000000024f330 c000000004c8b9f0 c000000002770600 0000000000000022
> [  146.669208][  T809] GPR04: 00000000ffff7fff c000000004c8b6d0 0000000000000027 c0000007fe9bcdd8
> [  146.669208][  T809] GPR08: 0000000000000023 ffffffffffffffd8 0000000000000027 c000000002613118
> [  146.669208][  T809] GPR12: 0000000000008000 c0000007fffdca00 0000000000000000 0000000000000000
> [  146.669208][  T809] GPR16: 0000000023ec37c5 0000000000000000 0000000000000000 0000000000000008
> [  146.669208][  T809] GPR20: c000000004c8bc90 c0000000027a2d20 c000000004c8bcd0 c000000002612fe8
> [  146.669208][  T809] GPR24: 0000000000000038 0000000000000030 0000000000000028 0000000000000020
> [  146.669208][  T809] GPR28: c000000000ff1b68 c000000000bf8e5c c00000000312f700 c000000000fbb9b0
> [  146.669384][  T809] NIP [c00000000024f334] ftrace_bug+0x28c/0x2e8
> [  146.669391][  T809] LR [c00000000024f330] ftrace_bug+0x288/0x2e8
> [  146.669396][  T809] Call Trace:
> [  146.669403][  T809] [c000000004c8b9f0] [c00000000024f330] ftrace_bug+0x288/0x2e8 (unreliable)
> [  146.669418][  T809] [c000000004c8ba80] [c000000000248778] ftrace_modify_all_code+0x168/0x210
> [  146.669429][  T809] [c000000004c8bab0] [c00000000006c528] arch_ftrace_update_code+0x18/0x30
> [  146.669440][  T809] [c000000004c8bad0] [c000000000248954] ftrace_run_update_code+0x44/0xc0
> [  146.669451][  T809] [c000000004c8bb00] [c00000000024dc88] ftrace_startup+0xf8/0x1c0
> [  146.669461][  T809] [c000000004c8bb40] [c00000000024dd9c] register_ftrace_function+0x4c/0xc0
> [  146.669472][  T809] [c000000004c8bb70] [c00000000026e750] function_trace_init+0x80/0xb0
> [  146.669484][  T809] [c000000004c8bba0] [c000000000266b84] tracing_set_tracer+0x2a4/0x4f0
> [  146.669495][  T809] [c000000004c8bc70] [c000000000266ea4] tracing_set_trace_write+0xd4/0x130
> [  146.669506][  T809] [c000000004c8bd20] [c000000000422790] vfs_write+0xf0/0x330
> [  146.669518][  T809] [c000000004c8bd70] [c000000000422bb4] ksys_write+0x84/0x140
> [  146.669529][  T809] [c000000004c8bdc0] [c00000000003499c] system_call_exception+0x14c/0x230
> [  146.669540][  T809] [c000000004c8be20] [c00000000000d860] system_call_common+0xf0/0x27c
> [  146.669549][  T809] Instruction dump:
> [  146.669558][  T809] 48000014 3c62fe88 38631718 4bf5a941 60000000 7fc3f378 4bff877d 7c641b78
> [  146.669598][  T809] 3c62fe88 38631730 4bf5a925 60000000 <0fe00000> 38210090 3d22fd90 39000001
> [  146.669638][  T809] ---[ end trace 5ea7076ea28c0fbd ]---
> 
> To fix this when updating kernel memory ptes using ptesync.
> 
> Fixes: 37bc3e5fd764 ("powerpc/lib/code-patching: Use alternate map for patch_instruction()")
> Fixes: f1cb8f9beba8 ("powerpc/64s/radix: avoid ptesync after set_pte and ptep_set_access_flags")
> Signed-off-by: Jordan Niethe <jniethe5@gmail.com>

Good catch. I would say it just fixes the latter patch doesn't it? The 
previous one just happens to break because it's using the broken API?

Anyhow,

Reviewed-by: Nicholas Piggin <npiggin@gmail.com>

> ---
> v2: Only ptesync is needed
> ---
>  arch/powerpc/include/asm/book3s/64/radix.h | 6 ++++--
>  arch/powerpc/mm/book3s64/radix_pgtable.c   | 4 ++--
>  2 files changed, 6 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/book3s/64/radix.h b/arch/powerpc/include/asm/book3s/64/radix.h
> index c7813dc628fc..59cab558e2f0 100644
> --- a/arch/powerpc/include/asm/book3s/64/radix.h
> +++ b/arch/powerpc/include/asm/book3s/64/radix.h
> @@ -222,8 +222,10 @@ static inline void radix__set_pte_at(struct mm_struct *mm, unsigned long addr,
>  	 * from ptesync, it should probably go into update_mmu_cache, rather
>  	 * than set_pte_at (which is used to set ptes unrelated to faults).
>  	 *
> -	 * Spurious faults to vmalloc region are not tolerated, so there is
> -	 * a ptesync in flush_cache_vmap.
> +	 * Spurious faults from the kernel memory are not tolerated, so there
> +	 * is a ptesync in flush_cache_vmap, and __map_kernel_page() follows
> +	 * the pte update sequence from ISA Book III 6.10 Translation Table
> +	 * Update Synchronization Requirements.
>  	 */
>  }
>  
> diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c b/arch/powerpc/mm/book3s64/radix_pgtable.c
> index 3adcf730f478..1d5eec847b88 100644
> --- a/arch/powerpc/mm/book3s64/radix_pgtable.c
> +++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
> @@ -108,7 +108,7 @@ static int early_map_kernel_page(unsigned long ea, unsigned long pa,
>  
>  set_the_pte:
>  	set_pte_at(&init_mm, ea, ptep, pfn_pte(pfn, flags));
> -	smp_wmb();
> +	asm volatile("ptesync": : :"memory");
>  	return 0;
>  }
>  
> @@ -168,7 +168,7 @@ static int __map_kernel_page(unsigned long ea, unsigned long pa,
>  
>  set_the_pte:
>  	set_pte_at(&init_mm, ea, ptep, pfn_pte(pfn, flags));
> -	smp_wmb();
> +	asm volatile("ptesync": : :"memory");
>  	return 0;
>  }
>  
> -- 
> 2.25.1
> 
> 

^ permalink raw reply

* Re: [PATCH v3 28/32] powerpc/64s: interrupt implement exit logic in C
From: Nicholas Piggin @ 2021-02-04  3:27 UTC (permalink / raw)
  To: Christophe Leroy, linuxppc-dev; +Cc: Michal Suchanek
In-Reply-To: <37c2a8e1-2c4b-2e55-6753-0a804ce00cac@csgroup.eu>

Excerpts from Christophe Leroy's message of February 4, 2021 2:25 am:
> 
> 
> Le 25/02/2020 à 18:35, Nicholas Piggin a écrit :
>> Implement the bulk of interrupt return logic in C. The asm return code
>> must handle a few cases: restoring full GPRs, and emulating stack store.
>> 
> 
> 
>> +notrace unsigned long interrupt_exit_kernel_prepare(struct pt_regs *regs, unsigned long msr)
>> +{
>> +	unsigned long *ti_flagsp = &current_thread_info()->flags;
>> +	unsigned long flags;
>> +
>> +	if (IS_ENABLED(CONFIG_PPC_BOOK3S) && unlikely(!(regs->msr & MSR_RI)))
>> +		unrecoverable_exception(regs);
>> +	BUG_ON(regs->msr & MSR_PR);
>> +	BUG_ON(!FULL_REGS(regs));
>> +
>> +	local_irq_save(flags);
>> +
>> +	if (regs->softe == IRQS_ENABLED) {
>> +		/* Returning to a kernel context with local irqs enabled. */
>> +		WARN_ON_ONCE(!(regs->msr & MSR_EE));
>> +again:
>> +		if (IS_ENABLED(CONFIG_PREEMPT)) {
>> +			/* Return to preemptible kernel context */
>> +			if (unlikely(*ti_flagsp & _TIF_NEED_RESCHED)) {
>> +				if (preempt_count() == 0)
>> +					preempt_schedule_irq();
>> +			}
>> +		}
>> +
>> +		trace_hardirqs_on();
>> +		__hard_EE_RI_disable();
>> +		if (unlikely(lazy_irq_pending())) {
>> +			__hard_RI_enable();
>> +			irq_soft_mask_set(IRQS_ALL_DISABLED);
>> +			trace_hardirqs_off();
>> +			local_paca->irq_happened |= PACA_IRQ_HARD_DIS;
>> +			/*
>> +			 * Can't local_irq_enable in case we are in interrupt
>> +			 * context. Must replay directly.
>> +			 */
>> +			replay_soft_interrupts();
>> +			irq_soft_mask_set(flags);
>> +			/* Took an interrupt, may have more exit work to do. */
>> +			goto again;
>> +		}
>> +		local_paca->irq_happened = 0;
>> +		irq_soft_mask_set(IRQS_ENABLED);
>> +	} else {
>> +		/* Returning to a kernel context with local irqs disabled. */
>> +		trace_hardirqs_on();
>> +		__hard_EE_RI_disable();
>> +		if (regs->msr & MSR_EE)
>> +			local_paca->irq_happened &= ~PACA_IRQ_HARD_DIS;
>> +	}
>> +
>> +
>> +#ifdef CONFIG_PPC_TRANSACTIONAL_MEM
>> +	local_paca->tm_scratch = regs->msr;
>> +#endif
>> +
>> +	/*
>> +	 * We don't need to restore AMR on the way back to userspace for KUAP.
>> +	 * The value of AMR only matters while we're in the kernel.
>> +	 */
>> +	kuap_restore_amr(regs);
> 
> Is that correct to restore KUAP state here ? Shouldn't we have it at lower level in assembly ?
> 
> Isn't there a risk that someone manages to call interrupt_exit_kernel_prepare() or the end of it in 
> a way or another, and get the previous KUAP state restored by this way ?

I'm not sure if there much more risk if it's here rather than the 
instruction being in another place in the code.

There's a lot of user access around the kernel too if you want to find a 
gadget to unlock KUAP then I suppose there is a pretty large attack
surface.

> Also, it looks a bit strange to have kuap_save_amr_and_lock() done at lowest level in assembly, and 
> kuap_restore_amr() done in upper level. That looks unbalanced.

I'd like to bring the entry assembly into C.

Thanks,
Nick

^ permalink raw reply

* Re: [PATCH] mm/memory.c: Remove pte_sw_mkyoung()
From: Nicholas Piggin @ 2021-02-04  3:23 UTC (permalink / raw)
  To: Andrew Morton, Christophe Leroy
  Cc: linux-arch, Thomas Bogendoerfer, Jia He, linux-kernel, Bibo Mao,
	linux-mips, linux-mm, linuxppc-dev
In-Reply-To: <20210203164630.ada46d0c84e0e9f0a474b283@linux-foundation.org>

Excerpts from Andrew Morton's message of February 4, 2021 10:46 am:
> On Wed,  3 Feb 2021 10:19:44 +0000 (UTC) Christophe Leroy <christophe.leroy@csgroup.eu> wrote:
> 
>> Commit 83d116c53058 ("mm: fix double page fault on arm64 if PTE_AF
>> is cleared") introduced arch_faults_on_old_pte() helper to identify
>> platforms that don't set page access bit in HW and require a page
>> fault to set it.
>> 
>> Commit 44bf431b47b4 ("mm/memory.c: Add memory read privilege on page
>> fault handling") added pte_sw_mkyoung() which is yet another way to
>> manage platforms that don't set page access bit in HW and require a
>> page fault to set it.
>> 
>> Remove that pte_sw_mkyoung() helper and use the already existing
>> arch_faults_on_old_pte() helper together with pte_mkyoung() instead.
> 
> This conflicts with mm/memory.c changes in linux-next.  In
> do_set_pte().  Please check my efforts:

I wanted to just get rid of it completely --

https://marc.info/?l=linux-mm&m=160860750115163&w=2

Waiting for MIPs to get that patch mentioned merged or nacked but
as yet seems to be no response from maintainers.

https://lore.kernel.org/linux-arch/20201019081257.32127-1-huangpei@loongson.cn/

Thanks,
Nick

> 
> --- a/arch/mips/include/asm/pgtable.h~mm-memoryc-remove-pte_sw_mkyoung
> +++ a/arch/mips/include/asm/pgtable.h
> @@ -406,8 +406,6 @@ static inline pte_t pte_mkyoung(pte_t pt
>  	return pte;
>  }
>  
> -#define pte_sw_mkyoung	pte_mkyoung
> -
>  #ifdef CONFIG_MIPS_HUGE_TLB_SUPPORT
>  static inline int pte_huge(pte_t pte)	{ return pte_val(pte) & _PAGE_HUGE; }
>  
> --- a/include/linux/pgtable.h~mm-memoryc-remove-pte_sw_mkyoung
> +++ a/include/linux/pgtable.h
> @@ -424,22 +424,6 @@ static inline void ptep_set_wrprotect(st
>  }
>  #endif
>  
> -/*
> - * On some architectures hardware does not set page access bit when accessing
> - * memory page, it is responsibilty of software setting this bit. It brings
> - * out extra page fault penalty to track page access bit. For optimization page
> - * access bit can be set during all page fault flow on these arches.
> - * To be differentiate with macro pte_mkyoung, this macro is used on platforms
> - * where software maintains page access bit.
> - */
> -#ifndef pte_sw_mkyoung
> -static inline pte_t pte_sw_mkyoung(pte_t pte)
> -{
> -	return pte;
> -}
> -#define pte_sw_mkyoung	pte_sw_mkyoung
> -#endif
> -
>  #ifndef pte_savedwrite
>  #define pte_savedwrite pte_write
>  #endif
> --- a/mm/memory.c~mm-memoryc-remove-pte_sw_mkyoung
> +++ a/mm/memory.c
> @@ -2902,7 +2902,8 @@ static vm_fault_t wp_page_copy(struct vm
>  		}
>  		flush_cache_page(vma, vmf->address, pte_pfn(vmf->orig_pte));
>  		entry = mk_pte(new_page, vma->vm_page_prot);
> -		entry = pte_sw_mkyoung(entry);
> +		if (arch_faults_on_old_pte())
> +			entry = pte_mkyoung(entry);
>  		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
>  
>  		/*
> @@ -3560,7 +3561,8 @@ static vm_fault_t do_anonymous_page(stru
>  	__SetPageUptodate(page);
>  
>  	entry = mk_pte(page, vma->vm_page_prot);
> -	entry = pte_sw_mkyoung(entry);
> +	if (arch_faults_on_old_pte())
> +		entry = pte_mkyoung(entry);
>  	if (vma->vm_flags & VM_WRITE)
>  		entry = pte_mkwrite(pte_mkdirty(entry));
>  
> @@ -3745,8 +3747,8 @@ void do_set_pte(struct vm_fault *vmf, st
>  
>  	if (prefault && arch_wants_old_prefaulted_pte())
>  		entry = pte_mkold(entry);
> -	else
> -		entry = pte_sw_mkyoung(entry);
> +	else if (arch_faults_on_old_pte())
> +		entry = pte_mkyoung(entry);
>  
>  	if (write)
>  		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> _
> 
> 
> 

^ permalink raw reply

* Re: [PATCH] powerpc/perf: Record counter overflow always if SAMPLE_IP is unset
From: Michael Ellerman @ 2021-02-04  2:55 UTC (permalink / raw)
  To: Athira Rajeev; +Cc: maddy, linuxppc-dev
In-Reply-To: <1612342484-1404-1-git-send-email-atrajeev@linux.vnet.ibm.com>

Athira Rajeev <atrajeev@linux.vnet.ibm.com> writes:
> While sampling for marked events, currently we record the sample only
> if the SIAR valid bit of Sampled Instruction Event Register (SIER) is
> set. SIAR_VALID bit is used for fetching the instruction address from
> Sampled Instruction Address Register(SIAR). But there are some usecases,
> where the user is interested only in the PMU stats at each counter
> overflow and the exact IP of the overflow event is not required.
> Dropping SIAR invalid samples will fail to record some of the counter
> overflows in such cases.
>
> Example of such usecase is dumping the PMU stats (event counts)
> after some regular amount of instructions/events from the userspace
> (ex: via ptrace). Here counter overflow is indicated to userspace via
> signal handler, and captured by monitoring and enabling I/O
> signaling on the event file descriptor. In these cases, we expect to
> get sample/overflow indication after each specified sample_period.
>
> Perf event attribute will not have PERF_SAMPLE_IP set in the
> sample_type if exact IP of the overflow event is not requested. So
> while profiling if SAMPLE_IP is not set, just record the counter overflow
> irrespective of SIAR_VALID check.
>
> Signed-off-by: Athira Rajeev <atrajeev@linux.vnet.ibm.com>
> ---
>  arch/powerpc/perf/core-book3s.c | 10 ++++++++--
>  1 file changed, 8 insertions(+), 2 deletions(-)
>
> diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c
> index 28206b1fe172..bb4828a05e4d 100644
> --- a/arch/powerpc/perf/core-book3s.c
> +++ b/arch/powerpc/perf/core-book3s.c
> @@ -2166,10 +2166,16 @@ static void record_and_restart(struct perf_event *event, unsigned long val,
>  	 * address even when freeze on supervisor state (kernel) is set in
>  	 * MMCR2. Check attr.exclude_kernel and address to drop the sample in
>  	 * these cases.
> +	 *
> +	 * If address is not requested in the sample
> +	 * via PERF_SAMPLE_IP, just record that sample
> +	 * irrespective of SIAR valid check.
>  	 */
> -	if (event->attr.exclude_kernel && record)
> -		if (is_kernel_addr(mfspr(SPRN_SIAR)))
> +	if (event->attr.exclude_kernel && record) {
> +		if (is_kernel_addr(mfspr(SPRN_SIAR)) && (event->attr.sample_type & PERF_SAMPLE_IP))
>  			record = 0;
> +	} else if (!record && !(event->attr.sample_type & PERF_SAMPLE_IP))
> +		record = 1;

This seems wrong, you're assuming that record was set previously to
= siar_valid(), but it may be that record is still 0 from the
initialisation and we weren't going to record.

Don't we need something more like this?

diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c
index 9fd06010e8b6..e4e8a017d339 100644
--- a/arch/powerpc/perf/core-book3s.c
+++ b/arch/powerpc/perf/core-book3s.c
@@ -2136,7 +2136,12 @@ static void record_and_restart(struct perf_event *event, unsigned long val,
 			left += period;
 			if (left <= 0)
 				left = period;
-			record = siar_valid(regs);
+
+			if (event->attr.sample_type & PERF_SAMPLE_IP)
+				record = siar_valid(regs);
+			else
+				record = 1;
+
 			event->hw.last_period = event->hw.sample_period;
 		}
 		if (left < 0x80000000LL)
@@ -2154,9 +2159,10 @@ static void record_and_restart(struct perf_event *event, unsigned long val,
 	 * MMCR2. Check attr.exclude_kernel and address to drop the sample in
 	 * these cases.
 	 */
-	if (event->attr.exclude_kernel && record)
-		if (is_kernel_addr(mfspr(SPRN_SIAR)))
-			record = 0;
+	if (event->attr.exclude_kernel &&
+	    (event->attr.sample_type & PERF_SAMPLE_IP) &&
+	    is_kernel_addr(mfspr(SPRN_SIAR)))
+		record = 0;
 
 	/*
 	 * Finally record data if requested.



cheers

^ permalink raw reply related

* Re: [PATCH v2 1/3] powerpc: sstep: Fix load and update emulation
From: Michael Ellerman @ 2021-02-04  0:53 UTC (permalink / raw)
  To: Sandipan Das, Naveen N. Rao
  Cc: ravi.bangoria, ananth, jniethe5, paulus, linuxppc-dev, dja
In-Reply-To: <8e675de3-51df-0711-c90c-f450a1585f65@linux.ibm.com>

Sandipan Das <sandipan@linux.ibm.com> writes:
> On 03/02/21 3:19 pm, Naveen N. Rao wrote:
>> [...]
>> 
>> Wouldn't it be easier to just do the below at the end? Or, am I missing something?
>> 
>> diff --git a/arch/powerpc/lib/sstep.c b/arch/powerpc/lib/sstep.c
>> index ede093e9623472..a2d726d2a5e9d1 100644
>> --- a/arch/powerpc/lib/sstep.c
>> +++ b/arch/powerpc/lib/sstep.c
>> @@ -2980,6 +2980,10 @@ int analyse_instr(struct instruction_op *op, const struct pt_regs *regs,
>>         }
>>  #endif /* CONFIG_VSX */
>> 
>> +       if (GETTYPE(op->type) == LOAD && (op->type & UPDATE) &&
>> +                       (ra == 0 || ra == rd))
>> +               goto unknown_opcode;
>> +
>>         return 0;
>> 
>>   logical_done:
>> 
>
> This looks good?
>
> diff --git a/arch/powerpc/lib/sstep.c b/arch/powerpc/lib/sstep.c
> index e96cff845ef7..a9c149bfd2f5 100644
> --- a/arch/powerpc/lib/sstep.c
> +++ b/arch/powerpc/lib/sstep.c
> @@ -3017,6 +3017,21 @@ int analyse_instr(struct instruction_op *op, const struct pt_regs *regs,
>  
>         }
>  
> +       if (op->type & UPDATE) {
> +               if (ra == rd && GETTYPE(op->type) == LOAD)
> +                       goto unknown_opcode;
> +               else if (ra == 0)
> +                       switch(GETTYPE(op->type)) {
> +                       case LOAD:
> +                       case STORE:
> +#ifdef CONFIG_PPC_FPU
> +                       case LOAD_FP:
> +                       case STORE_FP:
> +#endif

Why make it conditional?

cheers

^ permalink raw reply

* Re: [PATCH] mm/memory.c: Remove pte_sw_mkyoung()
From: Andrew Morton @ 2021-02-04  0:46 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: linux-arch, Thomas Bogendoerfer, Jia He, linux-kernel, Bibo Mao,
	linux-mips, linux-mm, linuxppc-dev
In-Reply-To: <f302ef92c48d1f08a0459aaee1c568ca11213814.1612345700.git.christophe.leroy@csgroup.eu>

On Wed,  3 Feb 2021 10:19:44 +0000 (UTC) Christophe Leroy <christophe.leroy@csgroup.eu> wrote:

> Commit 83d116c53058 ("mm: fix double page fault on arm64 if PTE_AF
> is cleared") introduced arch_faults_on_old_pte() helper to identify
> platforms that don't set page access bit in HW and require a page
> fault to set it.
> 
> Commit 44bf431b47b4 ("mm/memory.c: Add memory read privilege on page
> fault handling") added pte_sw_mkyoung() which is yet another way to
> manage platforms that don't set page access bit in HW and require a
> page fault to set it.
> 
> Remove that pte_sw_mkyoung() helper and use the already existing
> arch_faults_on_old_pte() helper together with pte_mkyoung() instead.

This conflicts with mm/memory.c changes in linux-next.  In
do_set_pte().  Please check my efforts:

--- a/arch/mips/include/asm/pgtable.h~mm-memoryc-remove-pte_sw_mkyoung
+++ a/arch/mips/include/asm/pgtable.h
@@ -406,8 +406,6 @@ static inline pte_t pte_mkyoung(pte_t pt
 	return pte;
 }
 
-#define pte_sw_mkyoung	pte_mkyoung
-
 #ifdef CONFIG_MIPS_HUGE_TLB_SUPPORT
 static inline int pte_huge(pte_t pte)	{ return pte_val(pte) & _PAGE_HUGE; }
 
--- a/include/linux/pgtable.h~mm-memoryc-remove-pte_sw_mkyoung
+++ a/include/linux/pgtable.h
@@ -424,22 +424,6 @@ static inline void ptep_set_wrprotect(st
 }
 #endif
 
-/*
- * On some architectures hardware does not set page access bit when accessing
- * memory page, it is responsibilty of software setting this bit. It brings
- * out extra page fault penalty to track page access bit. For optimization page
- * access bit can be set during all page fault flow on these arches.
- * To be differentiate with macro pte_mkyoung, this macro is used on platforms
- * where software maintains page access bit.
- */
-#ifndef pte_sw_mkyoung
-static inline pte_t pte_sw_mkyoung(pte_t pte)
-{
-	return pte;
-}
-#define pte_sw_mkyoung	pte_sw_mkyoung
-#endif
-
 #ifndef pte_savedwrite
 #define pte_savedwrite pte_write
 #endif
--- a/mm/memory.c~mm-memoryc-remove-pte_sw_mkyoung
+++ a/mm/memory.c
@@ -2902,7 +2902,8 @@ static vm_fault_t wp_page_copy(struct vm
 		}
 		flush_cache_page(vma, vmf->address, pte_pfn(vmf->orig_pte));
 		entry = mk_pte(new_page, vma->vm_page_prot);
-		entry = pte_sw_mkyoung(entry);
+		if (arch_faults_on_old_pte())
+			entry = pte_mkyoung(entry);
 		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
 
 		/*
@@ -3560,7 +3561,8 @@ static vm_fault_t do_anonymous_page(stru
 	__SetPageUptodate(page);
 
 	entry = mk_pte(page, vma->vm_page_prot);
-	entry = pte_sw_mkyoung(entry);
+	if (arch_faults_on_old_pte())
+		entry = pte_mkyoung(entry);
 	if (vma->vm_flags & VM_WRITE)
 		entry = pte_mkwrite(pte_mkdirty(entry));
 
@@ -3745,8 +3747,8 @@ void do_set_pte(struct vm_fault *vmf, st
 
 	if (prefault && arch_wants_old_prefaulted_pte())
 		entry = pte_mkold(entry);
-	else
-		entry = pte_sw_mkyoung(entry);
+	else if (arch_faults_on_old_pte())
+		entry = pte_mkyoung(entry);
 
 	if (write)
 		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
_


^ permalink raw reply

* [PATCH v2] powerpc/64s: Fix pte update for kernel memory on radix
From: Jordan Niethe @ 2021-02-03 23:59 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Jordan Niethe, npiggin, cmr

When adding a pte a ptesync is needed to order the update of the pte
with subsequent accesses otherwise a spurious fault may be raised.

radix__set_pte_at() does not do this for performance gains. For
non-kernel memory this is not an issue as any faults of this kind are
corrected by the page fault handler.  For kernel memory these faults are
not handled.  The current solution is that there is a ptesync in
flush_cache_vmap() which should be called when mapping from the vmalloc
region.

However, map_kernel_page() does not call flush_cache_vmap(). This is
troublesome in particular for code patching with Strict RWX on radix. In
do_patch_instruction() the page frame that contains the instruction to
be patched is mapped and then immediately patched. With no ordering or
synchronization between setting up the pte and writing to the page it is
possible for faults.

As the code patching is done using __put_user_asm_goto() the resulting
fault is obscured - but using a normal store instead it can be seen:

[  418.498768][  T757] BUG: Unable to handle kernel data access on write at 0xc008000008f24a3c
[  418.498790][  T757] Faulting instruction address: 0xc00000000008bd74
[  418.498805][  T757] Oops: Kernel access of bad area, sig: 11 [#1]
[  418.498828][  T757] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA PowerNV
[  418.498843][  T757] Modules linked in: nop_module(PO+) [last unloaded: nop_module]
[  418.498872][  T757] CPU: 4 PID: 757 Comm: sh Tainted: P           O      5.10.0-rc5-01361-ge3c1b78c8440-dirty #43
[  418.498936][  T757] NIP:  c00000000008bd74 LR: c00000000008bd50 CTR: c000000000025810
[  418.498979][  T757] REGS: c000000016f634a0 TRAP: 0300   Tainted: P           O       (5.10.0-rc5-01361-ge3c1b78c8440-dirty)
[  418.499033][  T757] MSR:  9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE>  CR: 44002884  XER: 00000000
[  418.499084][  T757] CFAR: c00000000007c68c DAR: c008000008f24a3c DSISR: 42000000 IRQMASK: 1

This results in the kind of issue reported here:
https://lore.kernel.org/linuxppc-dev/15AC5B0E-A221-4B8C-9039-FA96B8EF7C88@lca.pw/

Chris Riedl suggested a reliable way to reproduce the issue:
$ mount -t debugfs none /sys/kernel/debug
$ (while true; do echo function > /sys/kernel/debug/tracing/current_tracer ; echo nop > /sys/kernel/debug/tracing/current_tracer ; done)&

Turning ftrace on and off does a large amount of code patching which in
usually less then 5min will crash giving a trace like:

[  146.668988][  T809] ftrace-powerpc: (____ptrval____): replaced (4b473b11) != old (60000000)
[  146.668995][  T809] ------------[ ftrace bug ]------------
[  146.669031][  T809] ftrace failed to modify
[  146.669039][  T809] [<c000000000bf8e5c>] napi_busy_loop+0xc/0x390
[  146.669045][  T809]  actual:   11:3b:47:4b
[  146.669070][  T809] Setting ftrace call site to call ftrace function
[  146.669075][  T809] ftrace record flags: 80000001
[  146.669081][  T809]  (1)
[  146.669081][  T809]  expected tramp: c00000000006c96c
[  146.669096][  T809] ------------[ cut here ]------------
[  146.669104][  T809] WARNING: CPU: 4 PID: 809 at kernel/trace/ftrace.c:2065 ftrace_bug+0x28c/0x2e8
[  146.669109][  T809] Modules linked in: nop_module(PO-) [last unloaded: nop_module]
[  146.669130][  T809] CPU: 4 PID: 809 Comm: sh Tainted: P           O      5.10.0-rc5-01360-gf878ccaf250a #1
[  146.669136][  T809] NIP:  c00000000024f334 LR: c00000000024f330 CTR: c0000000001a5af0
[  146.669142][  T809] REGS: c000000004c8b760 TRAP: 0700   Tainted: P           O       (5.10.0-rc5-01360-gf878ccaf250a)
[  146.669147][  T809] MSR:  900000000282b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 28008848  XER: 20040000
[  146.669208][  T809] CFAR: c0000000001a9c98 IRQMASK: 0
[  146.669208][  T809] GPR00: c00000000024f330 c000000004c8b9f0 c000000002770600 0000000000000022
[  146.669208][  T809] GPR04: 00000000ffff7fff c000000004c8b6d0 0000000000000027 c0000007fe9bcdd8
[  146.669208][  T809] GPR08: 0000000000000023 ffffffffffffffd8 0000000000000027 c000000002613118
[  146.669208][  T809] GPR12: 0000000000008000 c0000007fffdca00 0000000000000000 0000000000000000
[  146.669208][  T809] GPR16: 0000000023ec37c5 0000000000000000 0000000000000000 0000000000000008
[  146.669208][  T809] GPR20: c000000004c8bc90 c0000000027a2d20 c000000004c8bcd0 c000000002612fe8
[  146.669208][  T809] GPR24: 0000000000000038 0000000000000030 0000000000000028 0000000000000020
[  146.669208][  T809] GPR28: c000000000ff1b68 c000000000bf8e5c c00000000312f700 c000000000fbb9b0
[  146.669384][  T809] NIP [c00000000024f334] ftrace_bug+0x28c/0x2e8
[  146.669391][  T809] LR [c00000000024f330] ftrace_bug+0x288/0x2e8
[  146.669396][  T809] Call Trace:
[  146.669403][  T809] [c000000004c8b9f0] [c00000000024f330] ftrace_bug+0x288/0x2e8 (unreliable)
[  146.669418][  T809] [c000000004c8ba80] [c000000000248778] ftrace_modify_all_code+0x168/0x210
[  146.669429][  T809] [c000000004c8bab0] [c00000000006c528] arch_ftrace_update_code+0x18/0x30
[  146.669440][  T809] [c000000004c8bad0] [c000000000248954] ftrace_run_update_code+0x44/0xc0
[  146.669451][  T809] [c000000004c8bb00] [c00000000024dc88] ftrace_startup+0xf8/0x1c0
[  146.669461][  T809] [c000000004c8bb40] [c00000000024dd9c] register_ftrace_function+0x4c/0xc0
[  146.669472][  T809] [c000000004c8bb70] [c00000000026e750] function_trace_init+0x80/0xb0
[  146.669484][  T809] [c000000004c8bba0] [c000000000266b84] tracing_set_tracer+0x2a4/0x4f0
[  146.669495][  T809] [c000000004c8bc70] [c000000000266ea4] tracing_set_trace_write+0xd4/0x130
[  146.669506][  T809] [c000000004c8bd20] [c000000000422790] vfs_write+0xf0/0x330
[  146.669518][  T809] [c000000004c8bd70] [c000000000422bb4] ksys_write+0x84/0x140
[  146.669529][  T809] [c000000004c8bdc0] [c00000000003499c] system_call_exception+0x14c/0x230
[  146.669540][  T809] [c000000004c8be20] [c00000000000d860] system_call_common+0xf0/0x27c
[  146.669549][  T809] Instruction dump:
[  146.669558][  T809] 48000014 3c62fe88 38631718 4bf5a941 60000000 7fc3f378 4bff877d 7c641b78
[  146.669598][  T809] 3c62fe88 38631730 4bf5a925 60000000 <0fe00000> 38210090 3d22fd90 39000001
[  146.669638][  T809] ---[ end trace 5ea7076ea28c0fbd ]---

To fix this when updating kernel memory ptes using ptesync.

Fixes: 37bc3e5fd764 ("powerpc/lib/code-patching: Use alternate map for patch_instruction()")
Fixes: f1cb8f9beba8 ("powerpc/64s/radix: avoid ptesync after set_pte and ptep_set_access_flags")
Signed-off-by: Jordan Niethe <jniethe5@gmail.com>
---
v2: Only ptesync is needed
---
 arch/powerpc/include/asm/book3s/64/radix.h | 6 ++++--
 arch/powerpc/mm/book3s64/radix_pgtable.c   | 4 ++--
 2 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/radix.h b/arch/powerpc/include/asm/book3s/64/radix.h
index c7813dc628fc..59cab558e2f0 100644
--- a/arch/powerpc/include/asm/book3s/64/radix.h
+++ b/arch/powerpc/include/asm/book3s/64/radix.h
@@ -222,8 +222,10 @@ static inline void radix__set_pte_at(struct mm_struct *mm, unsigned long addr,
 	 * from ptesync, it should probably go into update_mmu_cache, rather
 	 * than set_pte_at (which is used to set ptes unrelated to faults).
 	 *
-	 * Spurious faults to vmalloc region are not tolerated, so there is
-	 * a ptesync in flush_cache_vmap.
+	 * Spurious faults from the kernel memory are not tolerated, so there
+	 * is a ptesync in flush_cache_vmap, and __map_kernel_page() follows
+	 * the pte update sequence from ISA Book III 6.10 Translation Table
+	 * Update Synchronization Requirements.
 	 */
 }
 
diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c b/arch/powerpc/mm/book3s64/radix_pgtable.c
index 3adcf730f478..1d5eec847b88 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -108,7 +108,7 @@ static int early_map_kernel_page(unsigned long ea, unsigned long pa,
 
 set_the_pte:
 	set_pte_at(&init_mm, ea, ptep, pfn_pte(pfn, flags));
-	smp_wmb();
+	asm volatile("ptesync": : :"memory");
 	return 0;
 }
 
@@ -168,7 +168,7 @@ static int __map_kernel_page(unsigned long ea, unsigned long pa,
 
 set_the_pte:
 	set_pte_at(&init_mm, ea, ptep, pfn_pte(pfn, flags));
-	smp_wmb();
+	asm volatile("ptesync": : :"memory");
 	return 0;
 }
 
-- 
2.25.1


^ permalink raw reply related

* [PATCH RFC v1 6/6] xen-swiotlb: enable 64-bit xen-swiotlb
From: Dongli Zhang @ 2021-02-03 23:37 UTC (permalink / raw)
  To: dri-devel, intel-gfx, iommu, linux-mips, linux-mmc, linux-pci,
	linuxppc-dev, nouveau, x86, xen-devel
  Cc: ulf.hansson, airlied, joonas.lahtinen, adrian.hunter, paulus, hpa,
	mingo, m.szyprowski, sstabellini, joe.jin, hch, peterz, mingo,
	matthew.auld, bskeggs, thomas.lendacky, konrad.wilk, jani.nikula,
	bp, rodrigo.vivi, bhelgaas, boris.ostrovsky, chris, jgross,
	tsbogend, linux-kernel, tglx, bauerman, daniel, akpm,
	robin.murphy, rppt
In-Reply-To: <20210203233709.19819-1-dongli.zhang@oracle.com>

This patch is to enable the 64-bit xen-swiotlb buffer.

For Xen PVM DMA address, the 64-bit device will be able to allocate from
64-bit swiotlb buffer.

Cc: Joe Jin <joe.jin@oracle.com>
Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
---
 drivers/xen/swiotlb-xen.c | 117 ++++++++++++++++++++++++--------------
 1 file changed, 74 insertions(+), 43 deletions(-)

diff --git a/drivers/xen/swiotlb-xen.c b/drivers/xen/swiotlb-xen.c
index e18cae693cdc..c9ab07809e32 100644
--- a/drivers/xen/swiotlb-xen.c
+++ b/drivers/xen/swiotlb-xen.c
@@ -108,27 +108,36 @@ static int is_xen_swiotlb_buffer(struct device *dev, dma_addr_t dma_addr)
 	unsigned long bfn = XEN_PFN_DOWN(dma_to_phys(dev, dma_addr));
 	unsigned long xen_pfn = bfn_to_local_pfn(bfn);
 	phys_addr_t paddr = (phys_addr_t)xen_pfn << XEN_PAGE_SHIFT;
+	int i;
 
 	/* If the address is outside our domain, it CAN
 	 * have the same virtual address as another address
 	 * in our domain. Therefore _only_ check address within our domain.
 	 */
-	if (pfn_valid(PFN_DOWN(paddr))) {
-		return paddr >= virt_to_phys(xen_io_tlb_start[SWIOTLB_LO]) &&
-		       paddr < virt_to_phys(xen_io_tlb_end[SWIOTLB_LO]);
-	}
+	if (!pfn_valid(PFN_DOWN(paddr)))
+		return 0;
+
+	for (i = 0; i < swiotlb_nr; i++)
+		if (paddr >= virt_to_phys(xen_io_tlb_start[i]) &&
+		    paddr < virt_to_phys(xen_io_tlb_end[i]))
+			return 1;
+
 	return 0;
 }
 
 static int
-xen_swiotlb_fixup(void *buf, size_t size, unsigned long nslabs)
+xen_swiotlb_fixup(void *buf, size_t size, unsigned long nslabs,
+		  enum swiotlb_t type)
 {
 	int i, rc;
 	int dma_bits;
 	dma_addr_t dma_handle;
 	phys_addr_t p = virt_to_phys(buf);
 
-	dma_bits = get_order(IO_TLB_SEGSIZE << IO_TLB_SHIFT) + PAGE_SHIFT;
+	if (type == SWIOTLB_HI)
+		dma_bits = max_dma_bits[SWIOTLB_HI];
+	else
+		dma_bits = get_order(IO_TLB_SEGSIZE << IO_TLB_SHIFT) + PAGE_SHIFT;
 
 	i = 0;
 	do {
@@ -139,7 +148,7 @@ xen_swiotlb_fixup(void *buf, size_t size, unsigned long nslabs)
 				p + (i << IO_TLB_SHIFT),
 				get_order(slabs << IO_TLB_SHIFT),
 				dma_bits, &dma_handle);
-		} while (rc && dma_bits++ < max_dma_bits[SWIOTLB_LO]);
+		} while (rc && dma_bits++ < max_dma_bits[type]);
 		if (rc)
 			return rc;
 
@@ -147,16 +156,17 @@ xen_swiotlb_fixup(void *buf, size_t size, unsigned long nslabs)
 	} while (i < nslabs);
 	return 0;
 }
-static unsigned long xen_set_nslabs(unsigned long nr_tbl)
+
+static unsigned long xen_set_nslabs(unsigned long nr_tbl, enum swiotlb_t type)
 {
 	if (!nr_tbl) {
-		xen_io_tlb_nslabs[SWIOTLB_LO] = (64 * 1024 * 1024 >> IO_TLB_SHIFT);
-		xen_io_tlb_nslabs[SWIOTLB_LO] = ALIGN(xen_io_tlb_nslabs[SWIOTLB_LO],
+		xen_io_tlb_nslabs[type] = (64 * 1024 * 1024 >> IO_TLB_SHIFT);
+		xen_io_tlb_nslabs[type] = ALIGN(xen_io_tlb_nslabs[type],
 						      IO_TLB_SEGSIZE);
 	} else
-		xen_io_tlb_nslabs[SWIOTLB_LO] = nr_tbl;
+		xen_io_tlb_nslabs[type] = nr_tbl;
 
-	return xen_io_tlb_nslabs[SWIOTLB_LO] << IO_TLB_SHIFT;
+	return xen_io_tlb_nslabs[type] << IO_TLB_SHIFT;
 }
 
 enum xen_swiotlb_err {
@@ -180,23 +190,24 @@ static const char *xen_swiotlb_error(enum xen_swiotlb_err err)
 	}
 	return "";
 }
-int __ref xen_swiotlb_init(int verbose, bool early)
+
+static int xen_swiotlb_init_type(int verbose, bool early, enum swiotlb_t type)
 {
 	unsigned long bytes, order;
 	int rc = -ENOMEM;
 	enum xen_swiotlb_err m_ret = XEN_SWIOTLB_UNKNOWN;
 	unsigned int repeat = 3;
 
-	xen_io_tlb_nslabs[SWIOTLB_LO] = swiotlb_nr_tbl(SWIOTLB_LO);
+	xen_io_tlb_nslabs[type] = swiotlb_nr_tbl(type);
 retry:
-	bytes = xen_set_nslabs(xen_io_tlb_nslabs[SWIOTLB_LO]);
-	order = get_order(xen_io_tlb_nslabs[SWIOTLB_LO] << IO_TLB_SHIFT);
+	bytes = xen_set_nslabs(xen_io_tlb_nslabs[type], type);
+	order = get_order(xen_io_tlb_nslabs[type] << IO_TLB_SHIFT);
 
 	/*
 	 * IO TLB memory already allocated. Just use it.
 	 */
-	if (io_tlb_start[SWIOTLB_LO] != 0) {
-		xen_io_tlb_start[SWIOTLB_LO] = phys_to_virt(io_tlb_start[SWIOTLB_LO]);
+	if (io_tlb_start[type] != 0) {
+		xen_io_tlb_start[type] = phys_to_virt(io_tlb_start[type]);
 		goto end;
 	}
 
@@ -204,81 +215,95 @@ int __ref xen_swiotlb_init(int verbose, bool early)
 	 * Get IO TLB memory from any location.
 	 */
 	if (early) {
-		xen_io_tlb_start[SWIOTLB_LO] = memblock_alloc(PAGE_ALIGN(bytes),
+		xen_io_tlb_start[type] = memblock_alloc(PAGE_ALIGN(bytes),
 						  PAGE_SIZE);
-		if (!xen_io_tlb_start[SWIOTLB_LO])
+		if (!xen_io_tlb_start[type])
 			panic("%s: Failed to allocate %lu bytes align=0x%lx\n",
 			      __func__, PAGE_ALIGN(bytes), PAGE_SIZE);
 	} else {
 #define SLABS_PER_PAGE (1 << (PAGE_SHIFT - IO_TLB_SHIFT))
 #define IO_TLB_MIN_SLABS ((1<<20) >> IO_TLB_SHIFT)
 		while ((SLABS_PER_PAGE << order) > IO_TLB_MIN_SLABS) {
-			xen_io_tlb_start[SWIOTLB_LO] = (void *)xen_get_swiotlb_free_pages(order);
-			if (xen_io_tlb_start[SWIOTLB_LO])
+			xen_io_tlb_start[type] = (void *)xen_get_swiotlb_free_pages(order);
+			if (xen_io_tlb_start[type])
 				break;
 			order--;
 		}
 		if (order != get_order(bytes)) {
 			pr_warn("Warning: only able to allocate %ld MB for software IO TLB\n",
 				(PAGE_SIZE << order) >> 20);
-			xen_io_tlb_nslabs[SWIOTLB_LO] = SLABS_PER_PAGE << order;
-			bytes = xen_io_tlb_nslabs[SWIOTLB_LO] << IO_TLB_SHIFT;
+			xen_io_tlb_nslabs[type] = SLABS_PER_PAGE << order;
+			bytes = xen_io_tlb_nslabs[type] << IO_TLB_SHIFT;
 		}
 	}
-	if (!xen_io_tlb_start[SWIOTLB_LO]) {
+	if (!xen_io_tlb_start[type]) {
 		m_ret = XEN_SWIOTLB_ENOMEM;
 		goto error;
 	}
 	/*
 	 * And replace that memory with pages under 4GB.
 	 */
-	rc = xen_swiotlb_fixup(xen_io_tlb_start[SWIOTLB_LO],
+	rc = xen_swiotlb_fixup(xen_io_tlb_start[type],
 			       bytes,
-			       xen_io_tlb_nslabs[SWIOTLB_LO]);
+			       xen_io_tlb_nslabs[type],
+			       type);
 	if (rc) {
 		if (early)
-			memblock_free(__pa(xen_io_tlb_start[SWIOTLB_LO]),
+			memblock_free(__pa(xen_io_tlb_start[type]),
 				      PAGE_ALIGN(bytes));
 		else {
-			free_pages((unsigned long)xen_io_tlb_start[SWIOTLB_LO], order);
-			xen_io_tlb_start[SWIOTLB_LO] = NULL;
+			free_pages((unsigned long)xen_io_tlb_start[type], order);
+			xen_io_tlb_start[type] = NULL;
 		}
 		m_ret = XEN_SWIOTLB_EFIXUP;
 		goto error;
 	}
 	if (early) {
-		if (swiotlb_init_with_tbl(xen_io_tlb_start[SWIOTLB_LO],
-					  xen_io_tlb_nslabs[SWIOTLB_LO],
-					  SWIOTLB_LO, verbose))
+		if (swiotlb_init_with_tbl(xen_io_tlb_start[type],
+					  xen_io_tlb_nslabs[type],
+					  type, verbose))
 			panic("Cannot allocate SWIOTLB buffer");
 		rc = 0;
 	} else
-		rc = swiotlb_late_init_with_tbl(xen_io_tlb_start[SWIOTLB_LO],
-						xen_io_tlb_nslabs[SWIOTLB_LO],
-						SWIOTLB_LO);
+		rc = swiotlb_late_init_with_tbl(xen_io_tlb_start[type],
+						xen_io_tlb_nslabs[type],
+						type);
 
 end:
-	xen_io_tlb_end[SWIOTLB_LO] = xen_io_tlb_start[SWIOTLB_LO] + bytes;
+	xen_io_tlb_end[type] = xen_io_tlb_start[type] + bytes;
 	if (!rc)
-		swiotlb_set_max_segment(PAGE_SIZE, SWIOTLB_LO);
+		swiotlb_set_max_segment(PAGE_SIZE, type);
 
 	return rc;
 error:
 	if (repeat--) {
-		xen_io_tlb_nslabs[SWIOTLB_LO] = max(1024UL, /* Min is 2MB */
-					(xen_io_tlb_nslabs[SWIOTLB_LO] >> 1));
+		xen_io_tlb_nslabs[type] = max(1024UL, /* Min is 2MB */
+					(xen_io_tlb_nslabs[type] >> 1));
 		pr_info("Lowering to %luMB\n",
-			(xen_io_tlb_nslabs[SWIOTLB_LO] << IO_TLB_SHIFT) >> 20);
+			(xen_io_tlb_nslabs[type] << IO_TLB_SHIFT) >> 20);
 		goto retry;
 	}
 	pr_err("%s (rc:%d)\n", xen_swiotlb_error(m_ret), rc);
 	if (early)
 		panic("%s (rc:%d)", xen_swiotlb_error(m_ret), rc);
 	else
-		free_pages((unsigned long)xen_io_tlb_start[SWIOTLB_LO], order);
+		free_pages((unsigned long)xen_io_tlb_start[type], order);
 	return rc;
 }
 
+int __ref xen_swiotlb_init(int verbose, bool early)
+{
+	int i, rc;
+
+	for (i = 0; i < swiotlb_nr; i++) {
+		rc = xen_swiotlb_init_type(verbose, early, i);
+		if (rc)
+			return rc;
+	}
+
+	return 0;
+}
+
 static void *
 xen_swiotlb_alloc_coherent(struct device *hwdev, size_t size,
 			   dma_addr_t *dma_handle, gfp_t flags,
@@ -566,7 +591,13 @@ xen_swiotlb_sync_sg_for_device(struct device *dev, struct scatterlist *sgl,
 static int
 xen_swiotlb_dma_supported(struct device *hwdev, u64 mask)
 {
-	return xen_virt_to_bus(hwdev, xen_io_tlb_end[SWIOTLB_LO] - 1) <= mask;
+	int i;
+
+	for (i = 0; i < swiotlb_nr; i++)
+		if (xen_virt_to_bus(hwdev, xen_io_tlb_end[i] - 1) <= mask)
+			return true;
+
+	return false;
 }
 
 const struct dma_map_ops xen_swiotlb_dma_ops = {
-- 
2.17.1


^ permalink raw reply related

* [PATCH RFC v1 5/6] xen-swiotlb: convert variables to arrays
From: Dongli Zhang @ 2021-02-03 23:37 UTC (permalink / raw)
  To: dri-devel, intel-gfx, iommu, linux-mips, linux-mmc, linux-pci,
	linuxppc-dev, nouveau, x86, xen-devel
  Cc: ulf.hansson, airlied, joonas.lahtinen, adrian.hunter, paulus, hpa,
	mingo, m.szyprowski, sstabellini, joe.jin, hch, peterz, mingo,
	matthew.auld, bskeggs, thomas.lendacky, konrad.wilk, jani.nikula,
	bp, rodrigo.vivi, bhelgaas, boris.ostrovsky, chris, jgross,
	tsbogend, linux-kernel, tglx, bauerman, daniel, akpm,
	robin.murphy, rppt
In-Reply-To: <20210203233709.19819-1-dongli.zhang@oracle.com>

This patch converts several xen-swiotlb related variables to arrays, in
order to maintain stat/status for different swiotlb buffers. Here are
variables involved:

- xen_io_tlb_start and xen_io_tlb_end
- xen_io_tlb_nslabs
- MAX_DMA_BITS

There is no functional change and this is to prepare to enable 64-bit
xen-swiotlb.

Cc: Joe Jin <joe.jin@oracle.com>
Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
---
 drivers/xen/swiotlb-xen.c | 75 +++++++++++++++++++++------------------
 1 file changed, 40 insertions(+), 35 deletions(-)

diff --git a/drivers/xen/swiotlb-xen.c b/drivers/xen/swiotlb-xen.c
index 662638093542..e18cae693cdc 100644
--- a/drivers/xen/swiotlb-xen.c
+++ b/drivers/xen/swiotlb-xen.c
@@ -39,15 +39,17 @@
 #include <asm/xen/page-coherent.h>
 
 #include <trace/events/swiotlb.h>
-#define MAX_DMA_BITS 32
 /*
  * Used to do a quick range check in swiotlb_tbl_unmap_single and
  * swiotlb_tbl_sync_single_*, to see if the memory was in fact allocated by this
  * API.
  */
 
-static char *xen_io_tlb_start, *xen_io_tlb_end;
-static unsigned long xen_io_tlb_nslabs;
+static char *xen_io_tlb_start[SWIOTLB_MAX], *xen_io_tlb_end[SWIOTLB_MAX];
+static unsigned long xen_io_tlb_nslabs[SWIOTLB_MAX];
+
+static int max_dma_bits[] = {32, 64};
+
 /*
  * Quick lookup value of the bus address of the IOTLB.
  */
@@ -112,8 +114,8 @@ static int is_xen_swiotlb_buffer(struct device *dev, dma_addr_t dma_addr)
 	 * in our domain. Therefore _only_ check address within our domain.
 	 */
 	if (pfn_valid(PFN_DOWN(paddr))) {
-		return paddr >= virt_to_phys(xen_io_tlb_start) &&
-		       paddr < virt_to_phys(xen_io_tlb_end);
+		return paddr >= virt_to_phys(xen_io_tlb_start[SWIOTLB_LO]) &&
+		       paddr < virt_to_phys(xen_io_tlb_end[SWIOTLB_LO]);
 	}
 	return 0;
 }
@@ -137,7 +139,7 @@ xen_swiotlb_fixup(void *buf, size_t size, unsigned long nslabs)
 				p + (i << IO_TLB_SHIFT),
 				get_order(slabs << IO_TLB_SHIFT),
 				dma_bits, &dma_handle);
-		} while (rc && dma_bits++ < MAX_DMA_BITS);
+		} while (rc && dma_bits++ < max_dma_bits[SWIOTLB_LO]);
 		if (rc)
 			return rc;
 
@@ -148,12 +150,13 @@ xen_swiotlb_fixup(void *buf, size_t size, unsigned long nslabs)
 static unsigned long xen_set_nslabs(unsigned long nr_tbl)
 {
 	if (!nr_tbl) {
-		xen_io_tlb_nslabs = (64 * 1024 * 1024 >> IO_TLB_SHIFT);
-		xen_io_tlb_nslabs = ALIGN(xen_io_tlb_nslabs, IO_TLB_SEGSIZE);
+		xen_io_tlb_nslabs[SWIOTLB_LO] = (64 * 1024 * 1024 >> IO_TLB_SHIFT);
+		xen_io_tlb_nslabs[SWIOTLB_LO] = ALIGN(xen_io_tlb_nslabs[SWIOTLB_LO],
+						      IO_TLB_SEGSIZE);
 	} else
-		xen_io_tlb_nslabs = nr_tbl;
+		xen_io_tlb_nslabs[SWIOTLB_LO] = nr_tbl;
 
-	return xen_io_tlb_nslabs << IO_TLB_SHIFT;
+	return xen_io_tlb_nslabs[SWIOTLB_LO] << IO_TLB_SHIFT;
 }
 
 enum xen_swiotlb_err {
@@ -184,16 +187,16 @@ int __ref xen_swiotlb_init(int verbose, bool early)
 	enum xen_swiotlb_err m_ret = XEN_SWIOTLB_UNKNOWN;
 	unsigned int repeat = 3;
 
-	xen_io_tlb_nslabs = swiotlb_nr_tbl(SWIOTLB_LO);
+	xen_io_tlb_nslabs[SWIOTLB_LO] = swiotlb_nr_tbl(SWIOTLB_LO);
 retry:
-	bytes = xen_set_nslabs(xen_io_tlb_nslabs);
-	order = get_order(xen_io_tlb_nslabs << IO_TLB_SHIFT);
+	bytes = xen_set_nslabs(xen_io_tlb_nslabs[SWIOTLB_LO]);
+	order = get_order(xen_io_tlb_nslabs[SWIOTLB_LO] << IO_TLB_SHIFT);
 
 	/*
 	 * IO TLB memory already allocated. Just use it.
 	 */
 	if (io_tlb_start[SWIOTLB_LO] != 0) {
-		xen_io_tlb_start = phys_to_virt(io_tlb_start[SWIOTLB_LO]);
+		xen_io_tlb_start[SWIOTLB_LO] = phys_to_virt(io_tlb_start[SWIOTLB_LO]);
 		goto end;
 	}
 
@@ -201,76 +204,78 @@ int __ref xen_swiotlb_init(int verbose, bool early)
 	 * Get IO TLB memory from any location.
 	 */
 	if (early) {
-		xen_io_tlb_start = memblock_alloc(PAGE_ALIGN(bytes),
+		xen_io_tlb_start[SWIOTLB_LO] = memblock_alloc(PAGE_ALIGN(bytes),
 						  PAGE_SIZE);
-		if (!xen_io_tlb_start)
+		if (!xen_io_tlb_start[SWIOTLB_LO])
 			panic("%s: Failed to allocate %lu bytes align=0x%lx\n",
 			      __func__, PAGE_ALIGN(bytes), PAGE_SIZE);
 	} else {
 #define SLABS_PER_PAGE (1 << (PAGE_SHIFT - IO_TLB_SHIFT))
 #define IO_TLB_MIN_SLABS ((1<<20) >> IO_TLB_SHIFT)
 		while ((SLABS_PER_PAGE << order) > IO_TLB_MIN_SLABS) {
-			xen_io_tlb_start = (void *)xen_get_swiotlb_free_pages(order);
-			if (xen_io_tlb_start)
+			xen_io_tlb_start[SWIOTLB_LO] = (void *)xen_get_swiotlb_free_pages(order);
+			if (xen_io_tlb_start[SWIOTLB_LO])
 				break;
 			order--;
 		}
 		if (order != get_order(bytes)) {
 			pr_warn("Warning: only able to allocate %ld MB for software IO TLB\n",
 				(PAGE_SIZE << order) >> 20);
-			xen_io_tlb_nslabs = SLABS_PER_PAGE << order;
-			bytes = xen_io_tlb_nslabs << IO_TLB_SHIFT;
+			xen_io_tlb_nslabs[SWIOTLB_LO] = SLABS_PER_PAGE << order;
+			bytes = xen_io_tlb_nslabs[SWIOTLB_LO] << IO_TLB_SHIFT;
 		}
 	}
-	if (!xen_io_tlb_start) {
+	if (!xen_io_tlb_start[SWIOTLB_LO]) {
 		m_ret = XEN_SWIOTLB_ENOMEM;
 		goto error;
 	}
 	/*
 	 * And replace that memory with pages under 4GB.
 	 */
-	rc = xen_swiotlb_fixup(xen_io_tlb_start,
+	rc = xen_swiotlb_fixup(xen_io_tlb_start[SWIOTLB_LO],
 			       bytes,
-			       xen_io_tlb_nslabs);
+			       xen_io_tlb_nslabs[SWIOTLB_LO]);
 	if (rc) {
 		if (early)
-			memblock_free(__pa(xen_io_tlb_start),
+			memblock_free(__pa(xen_io_tlb_start[SWIOTLB_LO]),
 				      PAGE_ALIGN(bytes));
 		else {
-			free_pages((unsigned long)xen_io_tlb_start, order);
-			xen_io_tlb_start = NULL;
+			free_pages((unsigned long)xen_io_tlb_start[SWIOTLB_LO], order);
+			xen_io_tlb_start[SWIOTLB_LO] = NULL;
 		}
 		m_ret = XEN_SWIOTLB_EFIXUP;
 		goto error;
 	}
 	if (early) {
-		if (swiotlb_init_with_tbl(xen_io_tlb_start, xen_io_tlb_nslabs,
+		if (swiotlb_init_with_tbl(xen_io_tlb_start[SWIOTLB_LO],
+					  xen_io_tlb_nslabs[SWIOTLB_LO],
 					  SWIOTLB_LO, verbose))
 			panic("Cannot allocate SWIOTLB buffer");
 		rc = 0;
 	} else
-		rc = swiotlb_late_init_with_tbl(xen_io_tlb_start,
-						xen_io_tlb_nslabs, SWIOTLB_LO);
+		rc = swiotlb_late_init_with_tbl(xen_io_tlb_start[SWIOTLB_LO],
+						xen_io_tlb_nslabs[SWIOTLB_LO],
+						SWIOTLB_LO);
 
 end:
-	xen_io_tlb_end = xen_io_tlb_start + bytes;
+	xen_io_tlb_end[SWIOTLB_LO] = xen_io_tlb_start[SWIOTLB_LO] + bytes;
 	if (!rc)
 		swiotlb_set_max_segment(PAGE_SIZE, SWIOTLB_LO);
 
 	return rc;
 error:
 	if (repeat--) {
-		xen_io_tlb_nslabs = max(1024UL, /* Min is 2MB */
-					(xen_io_tlb_nslabs >> 1));
+		xen_io_tlb_nslabs[SWIOTLB_LO] = max(1024UL, /* Min is 2MB */
+					(xen_io_tlb_nslabs[SWIOTLB_LO] >> 1));
 		pr_info("Lowering to %luMB\n",
-			(xen_io_tlb_nslabs << IO_TLB_SHIFT) >> 20);
+			(xen_io_tlb_nslabs[SWIOTLB_LO] << IO_TLB_SHIFT) >> 20);
 		goto retry;
 	}
 	pr_err("%s (rc:%d)\n", xen_swiotlb_error(m_ret), rc);
 	if (early)
 		panic("%s (rc:%d)", xen_swiotlb_error(m_ret), rc);
 	else
-		free_pages((unsigned long)xen_io_tlb_start, order);
+		free_pages((unsigned long)xen_io_tlb_start[SWIOTLB_LO], order);
 	return rc;
 }
 
@@ -561,7 +566,7 @@ xen_swiotlb_sync_sg_for_device(struct device *dev, struct scatterlist *sgl,
 static int
 xen_swiotlb_dma_supported(struct device *hwdev, u64 mask)
 {
-	return xen_virt_to_bus(hwdev, xen_io_tlb_end - 1) <= mask;
+	return xen_virt_to_bus(hwdev, xen_io_tlb_end[SWIOTLB_LO] - 1) <= mask;
 }
 
 const struct dma_map_ops xen_swiotlb_dma_ops = {
-- 
2.17.1


^ permalink raw reply related

* [PATCH RFC v1 2/6] swiotlb: convert variables to arrays
From: Dongli Zhang @ 2021-02-03 23:37 UTC (permalink / raw)
  To: dri-devel, intel-gfx, iommu, linux-mips, linux-mmc, linux-pci,
	linuxppc-dev, nouveau, x86, xen-devel
  Cc: ulf.hansson, airlied, joonas.lahtinen, adrian.hunter, paulus, hpa,
	mingo, m.szyprowski, sstabellini, joe.jin, hch, peterz, mingo,
	matthew.auld, bskeggs, thomas.lendacky, konrad.wilk, jani.nikula,
	bp, rodrigo.vivi, bhelgaas, boris.ostrovsky, chris, jgross,
	tsbogend, linux-kernel, tglx, bauerman, daniel, akpm,
	robin.murphy, rppt
In-Reply-To: <20210203233709.19819-1-dongli.zhang@oracle.com>

This patch converts several swiotlb related variables to arrays, in
order to maintain stat/status for different swiotlb buffers. Here are
variables involved:

- io_tlb_start and io_tlb_end
- io_tlb_nslabs and io_tlb_used
- io_tlb_list
- io_tlb_index
- max_segment
- io_tlb_orig_addr
- no_iotlb_memory

There is no functional change and this is to prepare to enable 64-bit
swiotlb.

Cc: Joe Jin <joe.jin@oracle.com>
Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
---
 arch/powerpc/platforms/pseries/svm.c |   6 +-
 drivers/xen/swiotlb-xen.c            |   4 +-
 include/linux/swiotlb.h              |   5 +-
 kernel/dma/swiotlb.c                 | 257 ++++++++++++++-------------
 4 files changed, 140 insertions(+), 132 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/svm.c b/arch/powerpc/platforms/pseries/svm.c
index 7b739cc7a8a9..9f8842d0da1f 100644
--- a/arch/powerpc/platforms/pseries/svm.c
+++ b/arch/powerpc/platforms/pseries/svm.c
@@ -55,9 +55,9 @@ void __init svm_swiotlb_init(void)
 	if (vstart && !swiotlb_init_with_tbl(vstart, io_tlb_nslabs, false))
 		return;
 
-	if (io_tlb_start)
-		memblock_free_early(io_tlb_start,
-				    PAGE_ALIGN(io_tlb_nslabs << IO_TLB_SHIFT));
+	if (io_tlb_start[SWIOTLB_LO])
+		memblock_free_early(io_tlb_start[SWIOTLB_LO],
+				    PAGE_ALIGN(io_tlb_nslabs[SWIOTLB_LO] << IO_TLB_SHIFT));
 	panic("SVM: Cannot allocate SWIOTLB buffer");
 }
 
diff --git a/drivers/xen/swiotlb-xen.c b/drivers/xen/swiotlb-xen.c
index 2b385c1b4a99..3261880ad859 100644
--- a/drivers/xen/swiotlb-xen.c
+++ b/drivers/xen/swiotlb-xen.c
@@ -192,8 +192,8 @@ int __ref xen_swiotlb_init(int verbose, bool early)
 	/*
 	 * IO TLB memory already allocated. Just use it.
 	 */
-	if (io_tlb_start != 0) {
-		xen_io_tlb_start = phys_to_virt(io_tlb_start);
+	if (io_tlb_start[SWIOTLB_LO] != 0) {
+		xen_io_tlb_start = phys_to_virt(io_tlb_start[SWIOTLB_LO]);
 		goto end;
 	}
 
diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index ca125c1b1281..777046cd4d1b 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -76,11 +76,12 @@ dma_addr_t swiotlb_map(struct device *dev, phys_addr_t phys,
 
 #ifdef CONFIG_SWIOTLB
 extern enum swiotlb_force swiotlb_force;
-extern phys_addr_t io_tlb_start, io_tlb_end;
+extern phys_addr_t io_tlb_start[], io_tlb_end[];
 
 static inline bool is_swiotlb_buffer(phys_addr_t paddr)
 {
-	return paddr >= io_tlb_start && paddr < io_tlb_end;
+	return paddr >= io_tlb_start[SWIOTLB_LO] &&
+	       paddr < io_tlb_end[SWIOTLB_LO];
 }
 
 void __init swiotlb_exit(void);
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index 7c42df6e6100..1fbb65daa2dd 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -69,38 +69,38 @@ enum swiotlb_force swiotlb_force;
  * swiotlb_tbl_sync_single_*, to see if the memory was in fact allocated by this
  * API.
  */
-phys_addr_t io_tlb_start, io_tlb_end;
+phys_addr_t io_tlb_start[SWIOTLB_MAX], io_tlb_end[SWIOTLB_MAX];
 
 /*
  * The number of IO TLB blocks (in groups of 64) between io_tlb_start and
  * io_tlb_end.  This is command line adjustable via setup_io_tlb_npages.
  */
-static unsigned long io_tlb_nslabs;
+static unsigned long io_tlb_nslabs[SWIOTLB_MAX];
 
 /*
  * The number of used IO TLB block
  */
-static unsigned long io_tlb_used;
+static unsigned long io_tlb_used[SWIOTLB_MAX];
 
 /*
  * This is a free list describing the number of free entries available from
  * each index
  */
-static unsigned int *io_tlb_list;
-static unsigned int io_tlb_index;
+static unsigned int *io_tlb_list[SWIOTLB_MAX];
+static unsigned int io_tlb_index[SWIOTLB_MAX];
 
 /*
  * Max segment that we can provide which (if pages are contingous) will
  * not be bounced (unless SWIOTLB_FORCE is set).
  */
-static unsigned int max_segment;
+static unsigned int max_segment[SWIOTLB_MAX];
 
 /*
  * We need to save away the original address corresponding to a mapped entry
  * for the sync operations.
  */
 #define INVALID_PHYS_ADDR (~(phys_addr_t)0)
-static phys_addr_t *io_tlb_orig_addr;
+static phys_addr_t *io_tlb_orig_addr[SWIOTLB_MAX];
 
 /*
  * Protect the above data structures in the map and unmap calls
@@ -113,9 +113,9 @@ static int __init
 setup_io_tlb_npages(char *str)
 {
 	if (isdigit(*str)) {
-		io_tlb_nslabs = simple_strtoul(str, &str, 0);
+		io_tlb_nslabs[SWIOTLB_LO] = simple_strtoul(str, &str, 0);
 		/* avoid tail segment of size < IO_TLB_SEGSIZE */
-		io_tlb_nslabs = ALIGN(io_tlb_nslabs, IO_TLB_SEGSIZE);
+		io_tlb_nslabs[SWIOTLB_LO] = ALIGN(io_tlb_nslabs[SWIOTLB_LO], IO_TLB_SEGSIZE);
 	}
 	if (*str == ',')
 		++str;
@@ -123,40 +123,40 @@ setup_io_tlb_npages(char *str)
 		swiotlb_force = SWIOTLB_FORCE;
 	} else if (!strcmp(str, "noforce")) {
 		swiotlb_force = SWIOTLB_NO_FORCE;
-		io_tlb_nslabs = 1;
+		io_tlb_nslabs[SWIOTLB_LO] = 1;
 	}
 
 	return 0;
 }
 early_param("swiotlb", setup_io_tlb_npages);
 
-static bool no_iotlb_memory;
+static bool no_iotlb_memory[SWIOTLB_MAX];
 
 unsigned long swiotlb_nr_tbl(void)
 {
-	return unlikely(no_iotlb_memory) ? 0 : io_tlb_nslabs;
+	return unlikely(no_iotlb_memory[SWIOTLB_LO]) ? 0 : io_tlb_nslabs[SWIOTLB_LO];
 }
 EXPORT_SYMBOL_GPL(swiotlb_nr_tbl);
 
 unsigned int swiotlb_max_segment(void)
 {
-	return unlikely(no_iotlb_memory) ? 0 : max_segment;
+	return unlikely(no_iotlb_memory[SWIOTLB_LO]) ? 0 : max_segment[SWIOTLB_LO];
 }
 EXPORT_SYMBOL_GPL(swiotlb_max_segment);
 
 void swiotlb_set_max_segment(unsigned int val)
 {
 	if (swiotlb_force == SWIOTLB_FORCE)
-		max_segment = 1;
+		max_segment[SWIOTLB_LO] = 1;
 	else
-		max_segment = rounddown(val, PAGE_SIZE);
+		max_segment[SWIOTLB_LO] = rounddown(val, PAGE_SIZE);
 }
 
 unsigned long swiotlb_size_or_default(void)
 {
 	unsigned long size;
 
-	size = io_tlb_nslabs << IO_TLB_SHIFT;
+	size = io_tlb_nslabs[SWIOTLB_LO] << IO_TLB_SHIFT;
 
 	return size ? size : (IO_TLB_DEFAULT_SIZE);
 }
@@ -170,10 +170,10 @@ void __init swiotlb_adjust_size(unsigned long new_size)
 	 * architectures such as those supporting memory encryption to
 	 * adjust/expand SWIOTLB size for their use.
 	 */
-	if (!io_tlb_nslabs) {
+	if (!io_tlb_nslabs[SWIOTLB_LO]) {
 		size = ALIGN(new_size, 1 << IO_TLB_SHIFT);
-		io_tlb_nslabs = size >> IO_TLB_SHIFT;
-		io_tlb_nslabs = ALIGN(io_tlb_nslabs, IO_TLB_SEGSIZE);
+		io_tlb_nslabs[SWIOTLB_LO] = size >> IO_TLB_SHIFT;
+		io_tlb_nslabs[SWIOTLB_LO] = ALIGN(io_tlb_nslabs[SWIOTLB_LO], IO_TLB_SEGSIZE);
 
 		pr_info("SWIOTLB bounce buffer size adjusted to %luMB", size >> 20);
 	}
@@ -181,15 +181,16 @@ void __init swiotlb_adjust_size(unsigned long new_size)
 
 void swiotlb_print_info(void)
 {
-	unsigned long bytes = io_tlb_nslabs << IO_TLB_SHIFT;
+	unsigned long bytes = io_tlb_nslabs[SWIOTLB_LO] << IO_TLB_SHIFT;
 
-	if (no_iotlb_memory) {
+	if (no_iotlb_memory[SWIOTLB_LO]) {
 		pr_warn("No low mem\n");
 		return;
 	}
 
-	pr_info("mapped [mem %pa-%pa] (%luMB)\n", &io_tlb_start, &io_tlb_end,
-	       bytes >> 20);
+	pr_info("mapped [mem %pa-%pa] (%luMB)\n",
+		&io_tlb_start[SWIOTLB_LO], &io_tlb_end[SWIOTLB_LO],
+		bytes >> 20);
 }
 
 /*
@@ -203,11 +204,11 @@ void __init swiotlb_update_mem_attributes(void)
 	void *vaddr;
 	unsigned long bytes;
 
-	if (no_iotlb_memory || late_alloc)
+	if (no_iotlb_memory[SWIOTLB_LO] || late_alloc)
 		return;
 
-	vaddr = phys_to_virt(io_tlb_start);
-	bytes = PAGE_ALIGN(io_tlb_nslabs << IO_TLB_SHIFT);
+	vaddr = phys_to_virt(io_tlb_start[SWIOTLB_LO]);
+	bytes = PAGE_ALIGN(io_tlb_nslabs[SWIOTLB_LO] << IO_TLB_SHIFT);
 	set_memory_decrypted((unsigned long)vaddr, bytes >> PAGE_SHIFT);
 	memset(vaddr, 0, bytes);
 }
@@ -219,38 +220,38 @@ int __init swiotlb_init_with_tbl(char *tlb, unsigned long nslabs, int verbose)
 
 	bytes = nslabs << IO_TLB_SHIFT;
 
-	io_tlb_nslabs = nslabs;
-	io_tlb_start = __pa(tlb);
-	io_tlb_end = io_tlb_start + bytes;
+	io_tlb_nslabs[SWIOTLB_LO] = nslabs;
+	io_tlb_start[SWIOTLB_LO] = __pa(tlb);
+	io_tlb_end[SWIOTLB_LO] = io_tlb_start[SWIOTLB_LO] + bytes;
 
 	/*
 	 * Allocate and initialize the free list array.  This array is used
 	 * to find contiguous free memory regions of size up to IO_TLB_SEGSIZE
 	 * between io_tlb_start and io_tlb_end.
 	 */
-	alloc_size = PAGE_ALIGN(io_tlb_nslabs * sizeof(int));
-	io_tlb_list = memblock_alloc(alloc_size, PAGE_SIZE);
-	if (!io_tlb_list)
+	alloc_size = PAGE_ALIGN(io_tlb_nslabs[SWIOTLB_LO] * sizeof(int));
+	io_tlb_list[SWIOTLB_LO] = memblock_alloc(alloc_size, PAGE_SIZE);
+	if (!io_tlb_list[SWIOTLB_LO])
 		panic("%s: Failed to allocate %zu bytes align=0x%lx\n",
 		      __func__, alloc_size, PAGE_SIZE);
 
-	alloc_size = PAGE_ALIGN(io_tlb_nslabs * sizeof(phys_addr_t));
-	io_tlb_orig_addr = memblock_alloc(alloc_size, PAGE_SIZE);
-	if (!io_tlb_orig_addr)
+	alloc_size = PAGE_ALIGN(io_tlb_nslabs[SWIOTLB_LO] * sizeof(phys_addr_t));
+	io_tlb_orig_addr[SWIOTLB_LO] = memblock_alloc(alloc_size, PAGE_SIZE);
+	if (!io_tlb_orig_addr[SWIOTLB_LO])
 		panic("%s: Failed to allocate %zu bytes align=0x%lx\n",
 		      __func__, alloc_size, PAGE_SIZE);
 
-	for (i = 0; i < io_tlb_nslabs; i++) {
-		io_tlb_list[i] = IO_TLB_SEGSIZE - OFFSET(i, IO_TLB_SEGSIZE);
-		io_tlb_orig_addr[i] = INVALID_PHYS_ADDR;
+	for (i = 0; i < io_tlb_nslabs[SWIOTLB_LO]; i++) {
+		io_tlb_list[SWIOTLB_LO][i] = IO_TLB_SEGSIZE - OFFSET(i, IO_TLB_SEGSIZE);
+		io_tlb_orig_addr[SWIOTLB_LO][i] = INVALID_PHYS_ADDR;
 	}
-	io_tlb_index = 0;
-	no_iotlb_memory = false;
+	io_tlb_index[SWIOTLB_LO] = 0;
+	no_iotlb_memory[SWIOTLB_LO] = false;
 
 	if (verbose)
 		swiotlb_print_info();
 
-	swiotlb_set_max_segment(io_tlb_nslabs << IO_TLB_SHIFT);
+	swiotlb_set_max_segment(io_tlb_nslabs[SWIOTLB_LO] << IO_TLB_SHIFT);
 	return 0;
 }
 
@@ -265,25 +266,25 @@ swiotlb_init(int verbose)
 	unsigned char *vstart;
 	unsigned long bytes;
 
-	if (!io_tlb_nslabs) {
-		io_tlb_nslabs = (default_size >> IO_TLB_SHIFT);
-		io_tlb_nslabs = ALIGN(io_tlb_nslabs, IO_TLB_SEGSIZE);
+	if (!io_tlb_nslabs[SWIOTLB_LO]) {
+		io_tlb_nslabs[SWIOTLB_LO] = (default_size >> IO_TLB_SHIFT);
+		io_tlb_nslabs[SWIOTLB_LO] = ALIGN(io_tlb_nslabs[SWIOTLB_LO], IO_TLB_SEGSIZE);
 	}
 
-	bytes = io_tlb_nslabs << IO_TLB_SHIFT;
+	bytes = io_tlb_nslabs[SWIOTLB_LO] << IO_TLB_SHIFT;
 
 	/* Get IO TLB memory from the low pages */
 	vstart = memblock_alloc_low(PAGE_ALIGN(bytes), PAGE_SIZE);
-	if (vstart && !swiotlb_init_with_tbl(vstart, io_tlb_nslabs, verbose))
+	if (vstart && !swiotlb_init_with_tbl(vstart, io_tlb_nslabs[SWIOTLB_LO], verbose))
 		return;
 
-	if (io_tlb_start) {
-		memblock_free_early(io_tlb_start,
-				    PAGE_ALIGN(io_tlb_nslabs << IO_TLB_SHIFT));
-		io_tlb_start = 0;
+	if (io_tlb_start[SWIOTLB_LO]) {
+		memblock_free_early(io_tlb_start[SWIOTLB_LO],
+				    PAGE_ALIGN(io_tlb_nslabs[SWIOTLB_LO] << IO_TLB_SHIFT));
+		io_tlb_start[SWIOTLB_LO] = 0;
 	}
 	pr_warn("Cannot allocate buffer");
-	no_iotlb_memory = true;
+	no_iotlb_memory[SWIOTLB_LO] = true;
 }
 
 /*
@@ -294,22 +295,22 @@ swiotlb_init(int verbose)
 int
 swiotlb_late_init_with_default_size(size_t default_size)
 {
-	unsigned long bytes, req_nslabs = io_tlb_nslabs;
+	unsigned long bytes, req_nslabs = io_tlb_nslabs[SWIOTLB_LO];
 	unsigned char *vstart = NULL;
 	unsigned int order;
 	int rc = 0;
 
-	if (!io_tlb_nslabs) {
-		io_tlb_nslabs = (default_size >> IO_TLB_SHIFT);
-		io_tlb_nslabs = ALIGN(io_tlb_nslabs, IO_TLB_SEGSIZE);
+	if (!io_tlb_nslabs[SWIOTLB_LO]) {
+		io_tlb_nslabs[SWIOTLB_LO] = (default_size >> IO_TLB_SHIFT);
+		io_tlb_nslabs[SWIOTLB_LO] = ALIGN(io_tlb_nslabs[SWIOTLB_LO], IO_TLB_SEGSIZE);
 	}
 
 	/*
 	 * Get IO TLB memory from the low pages
 	 */
-	order = get_order(io_tlb_nslabs << IO_TLB_SHIFT);
-	io_tlb_nslabs = SLABS_PER_PAGE << order;
-	bytes = io_tlb_nslabs << IO_TLB_SHIFT;
+	order = get_order(io_tlb_nslabs[SWIOTLB_LO] << IO_TLB_SHIFT);
+	io_tlb_nslabs[SWIOTLB_LO] = SLABS_PER_PAGE << order;
+	bytes = io_tlb_nslabs[SWIOTLB_LO] << IO_TLB_SHIFT;
 
 	while ((SLABS_PER_PAGE << order) > IO_TLB_MIN_SLABS) {
 		vstart = (void *)__get_free_pages(GFP_DMA | __GFP_NOWARN,
@@ -320,15 +321,15 @@ swiotlb_late_init_with_default_size(size_t default_size)
 	}
 
 	if (!vstart) {
-		io_tlb_nslabs = req_nslabs;
+		io_tlb_nslabs[SWIOTLB_LO] = req_nslabs;
 		return -ENOMEM;
 	}
 	if (order != get_order(bytes)) {
 		pr_warn("only able to allocate %ld MB\n",
 			(PAGE_SIZE << order) >> 20);
-		io_tlb_nslabs = SLABS_PER_PAGE << order;
+		io_tlb_nslabs[SWIOTLB_LO] = SLABS_PER_PAGE << order;
 	}
-	rc = swiotlb_late_init_with_tbl(vstart, io_tlb_nslabs);
+	rc = swiotlb_late_init_with_tbl(vstart, io_tlb_nslabs[SWIOTLB_LO]);
 	if (rc)
 		free_pages((unsigned long)vstart, order);
 
@@ -337,10 +338,10 @@ swiotlb_late_init_with_default_size(size_t default_size)
 
 static void swiotlb_cleanup(void)
 {
-	io_tlb_end = 0;
-	io_tlb_start = 0;
-	io_tlb_nslabs = 0;
-	max_segment = 0;
+	io_tlb_end[SWIOTLB_LO] = 0;
+	io_tlb_start[SWIOTLB_LO] = 0;
+	io_tlb_nslabs[SWIOTLB_LO] = 0;
+	max_segment[SWIOTLB_LO] = 0;
 }
 
 int
@@ -350,9 +351,9 @@ swiotlb_late_init_with_tbl(char *tlb, unsigned long nslabs)
 
 	bytes = nslabs << IO_TLB_SHIFT;
 
-	io_tlb_nslabs = nslabs;
-	io_tlb_start = virt_to_phys(tlb);
-	io_tlb_end = io_tlb_start + bytes;
+	io_tlb_nslabs[SWIOTLB_LO] = nslabs;
+	io_tlb_start[SWIOTLB_LO] = virt_to_phys(tlb);
+	io_tlb_end[SWIOTLB_LO] = io_tlb_start[SWIOTLB_LO] + bytes;
 
 	set_memory_decrypted((unsigned long)tlb, bytes >> PAGE_SHIFT);
 	memset(tlb, 0, bytes);
@@ -362,37 +363,37 @@ swiotlb_late_init_with_tbl(char *tlb, unsigned long nslabs)
 	 * to find contiguous free memory regions of size up to IO_TLB_SEGSIZE
 	 * between io_tlb_start and io_tlb_end.
 	 */
-	io_tlb_list = (unsigned int *)__get_free_pages(GFP_KERNEL,
-	                              get_order(io_tlb_nslabs * sizeof(int)));
-	if (!io_tlb_list)
+	io_tlb_list[SWIOTLB_LO] = (unsigned int *)__get_free_pages(GFP_KERNEL,
+			get_order(io_tlb_nslabs[SWIOTLB_LO] * sizeof(int)));
+	if (!io_tlb_list[SWIOTLB_LO])
 		goto cleanup3;
 
-	io_tlb_orig_addr = (phys_addr_t *)
+	io_tlb_orig_addr[SWIOTLB_LO] = (phys_addr_t *)
 		__get_free_pages(GFP_KERNEL,
-				 get_order(io_tlb_nslabs *
+				 get_order(io_tlb_nslabs[SWIOTLB_LO] *
 					   sizeof(phys_addr_t)));
-	if (!io_tlb_orig_addr)
+	if (!io_tlb_orig_addr[SWIOTLB_LO])
 		goto cleanup4;
 
-	for (i = 0; i < io_tlb_nslabs; i++) {
-		io_tlb_list[i] = IO_TLB_SEGSIZE - OFFSET(i, IO_TLB_SEGSIZE);
-		io_tlb_orig_addr[i] = INVALID_PHYS_ADDR;
+	for (i = 0; i < io_tlb_nslabs[SWIOTLB_LO]; i++) {
+		io_tlb_list[SWIOTLB_LO][i] = IO_TLB_SEGSIZE - OFFSET(i, IO_TLB_SEGSIZE);
+		io_tlb_orig_addr[SWIOTLB_LO][i] = INVALID_PHYS_ADDR;
 	}
-	io_tlb_index = 0;
-	no_iotlb_memory = false;
+	io_tlb_index[SWIOTLB_LO] = 0;
+	no_iotlb_memory[SWIOTLB_LO] = false;
 
 	swiotlb_print_info();
 
 	late_alloc = 1;
 
-	swiotlb_set_max_segment(io_tlb_nslabs << IO_TLB_SHIFT);
+	swiotlb_set_max_segment(io_tlb_nslabs[SWIOTLB_LO] << IO_TLB_SHIFT);
 
 	return 0;
 
 cleanup4:
-	free_pages((unsigned long)io_tlb_list, get_order(io_tlb_nslabs *
+	free_pages((unsigned long)io_tlb_list[SWIOTLB_LO], get_order(io_tlb_nslabs[SWIOTLB_LO] *
 	                                                 sizeof(int)));
-	io_tlb_list = NULL;
+	io_tlb_list[SWIOTLB_LO] = NULL;
 cleanup3:
 	swiotlb_cleanup();
 	return -ENOMEM;
@@ -400,23 +401,23 @@ swiotlb_late_init_with_tbl(char *tlb, unsigned long nslabs)
 
 void __init swiotlb_exit(void)
 {
-	if (!io_tlb_orig_addr)
+	if (!io_tlb_orig_addr[SWIOTLB_LO])
 		return;
 
 	if (late_alloc) {
-		free_pages((unsigned long)io_tlb_orig_addr,
-			   get_order(io_tlb_nslabs * sizeof(phys_addr_t)));
-		free_pages((unsigned long)io_tlb_list, get_order(io_tlb_nslabs *
-								 sizeof(int)));
-		free_pages((unsigned long)phys_to_virt(io_tlb_start),
-			   get_order(io_tlb_nslabs << IO_TLB_SHIFT));
+		free_pages((unsigned long)io_tlb_orig_addr[SWIOTLB_LO],
+			   get_order(io_tlb_nslabs[SWIOTLB_LO] * sizeof(phys_addr_t)));
+		free_pages((unsigned long)io_tlb_list[SWIOTLB_LO],
+			   get_order(io_tlb_nslabs[SWIOTLB_LO] * sizeof(int)));
+		free_pages((unsigned long)phys_to_virt(io_tlb_start[SWIOTLB_LO]),
+			   get_order(io_tlb_nslabs[SWIOTLB_LO] << IO_TLB_SHIFT));
 	} else {
-		memblock_free_late(__pa(io_tlb_orig_addr),
-				   PAGE_ALIGN(io_tlb_nslabs * sizeof(phys_addr_t)));
-		memblock_free_late(__pa(io_tlb_list),
-				   PAGE_ALIGN(io_tlb_nslabs * sizeof(int)));
-		memblock_free_late(io_tlb_start,
-				   PAGE_ALIGN(io_tlb_nslabs << IO_TLB_SHIFT));
+		memblock_free_late(__pa(io_tlb_orig_addr[SWIOTLB_LO]),
+				   PAGE_ALIGN(io_tlb_nslabs[SWIOTLB_LO] * sizeof(phys_addr_t)));
+		memblock_free_late(__pa(io_tlb_list[SWIOTLB_LO]),
+				   PAGE_ALIGN(io_tlb_nslabs[SWIOTLB_LO] * sizeof(int)));
+		memblock_free_late(io_tlb_start[SWIOTLB_LO],
+				   PAGE_ALIGN(io_tlb_nslabs[SWIOTLB_LO] << IO_TLB_SHIFT));
 	}
 	swiotlb_cleanup();
 }
@@ -465,7 +466,7 @@ phys_addr_t swiotlb_tbl_map_single(struct device *hwdev, phys_addr_t orig_addr,
 		size_t mapping_size, size_t alloc_size,
 		enum dma_data_direction dir, unsigned long attrs)
 {
-	dma_addr_t tbl_dma_addr = phys_to_dma_unencrypted(hwdev, io_tlb_start);
+	dma_addr_t tbl_dma_addr = phys_to_dma_unencrypted(hwdev, io_tlb_start[SWIOTLB_LO]);
 	unsigned long flags;
 	phys_addr_t tlb_addr;
 	unsigned int nslots, stride, index, wrap;
@@ -475,7 +476,7 @@ phys_addr_t swiotlb_tbl_map_single(struct device *hwdev, phys_addr_t orig_addr,
 	unsigned long max_slots;
 	unsigned long tmp_io_tlb_used;
 
-	if (no_iotlb_memory)
+	if (no_iotlb_memory[SWIOTLB_LO])
 		panic("Can not allocate SWIOTLB buffer earlier and can't now provide you with the DMA bounce buffer");
 
 	if (mem_encrypt_active())
@@ -518,11 +519,11 @@ phys_addr_t swiotlb_tbl_map_single(struct device *hwdev, phys_addr_t orig_addr,
 	 */
 	spin_lock_irqsave(&io_tlb_lock, flags);
 
-	if (unlikely(nslots > io_tlb_nslabs - io_tlb_used))
+	if (unlikely(nslots > io_tlb_nslabs[SWIOTLB_LO] - io_tlb_used[SWIOTLB_LO]))
 		goto not_found;
 
-	index = ALIGN(io_tlb_index, stride);
-	if (index >= io_tlb_nslabs)
+	index = ALIGN(io_tlb_index[SWIOTLB_LO], stride);
+	if (index >= io_tlb_nslabs[SWIOTLB_LO])
 		index = 0;
 	wrap = index;
 
@@ -530,7 +531,7 @@ phys_addr_t swiotlb_tbl_map_single(struct device *hwdev, phys_addr_t orig_addr,
 		while (iommu_is_span_boundary(index, nslots, offset_slots,
 					      max_slots)) {
 			index += stride;
-			if (index >= io_tlb_nslabs)
+			if (index >= io_tlb_nslabs[SWIOTLB_LO])
 				index = 0;
 			if (index == wrap)
 				goto not_found;
@@ -541,39 +542,42 @@ phys_addr_t swiotlb_tbl_map_single(struct device *hwdev, phys_addr_t orig_addr,
 		 * contiguous buffers, we allocate the buffers from that slot
 		 * and mark the entries as '0' indicating unavailable.
 		 */
-		if (io_tlb_list[index] >= nslots) {
+		if (io_tlb_list[SWIOTLB_LO][index] >= nslots) {
 			int count = 0;
 
 			for (i = index; i < (int) (index + nslots); i++)
-				io_tlb_list[i] = 0;
-			for (i = index - 1; (OFFSET(i, IO_TLB_SEGSIZE) != IO_TLB_SEGSIZE - 1) && io_tlb_list[i]; i--)
-				io_tlb_list[i] = ++count;
-			tlb_addr = io_tlb_start + (index << IO_TLB_SHIFT);
+				io_tlb_list[SWIOTLB_LO][i] = 0;
+			for (i = index - 1;
+			     (OFFSET(i, IO_TLB_SEGSIZE) != IO_TLB_SEGSIZE - 1) &&
+			     io_tlb_list[SWIOTLB_LO][i];
+			     i--)
+				io_tlb_list[SWIOTLB_LO][i] = ++count;
+			tlb_addr = io_tlb_start[SWIOTLB_LO] + (index << IO_TLB_SHIFT);
 
 			/*
 			 * Update the indices to avoid searching in the next
 			 * round.
 			 */
-			io_tlb_index = ((index + nslots) < io_tlb_nslabs
+			io_tlb_index[SWIOTLB_LO] = ((index + nslots) < io_tlb_nslabs[SWIOTLB_LO]
 					? (index + nslots) : 0);
 
 			goto found;
 		}
 		index += stride;
-		if (index >= io_tlb_nslabs)
+		if (index >= io_tlb_nslabs[SWIOTLB_LO])
 			index = 0;
 	} while (index != wrap);
 
 not_found:
-	tmp_io_tlb_used = io_tlb_used;
+	tmp_io_tlb_used = io_tlb_used[SWIOTLB_LO];
 
 	spin_unlock_irqrestore(&io_tlb_lock, flags);
 	if (!(attrs & DMA_ATTR_NO_WARN) && printk_ratelimit())
 		dev_warn(hwdev, "swiotlb buffer is full (sz: %zd bytes), total %lu (slots), used %lu (slots)\n",
-			 alloc_size, io_tlb_nslabs, tmp_io_tlb_used);
+			 alloc_size, io_tlb_nslabs[SWIOTLB_LO], tmp_io_tlb_used);
 	return (phys_addr_t)DMA_MAPPING_ERROR;
 found:
-	io_tlb_used += nslots;
+	io_tlb_used[SWIOTLB_LO] += nslots;
 	spin_unlock_irqrestore(&io_tlb_lock, flags);
 
 	/*
@@ -582,7 +586,7 @@ phys_addr_t swiotlb_tbl_map_single(struct device *hwdev, phys_addr_t orig_addr,
 	 * needed.
 	 */
 	for (i = 0; i < nslots; i++)
-		io_tlb_orig_addr[index+i] = orig_addr + (i << IO_TLB_SHIFT);
+		io_tlb_orig_addr[SWIOTLB_LO][index+i] = orig_addr + (i << IO_TLB_SHIFT);
 	if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC) &&
 	    (dir == DMA_TO_DEVICE || dir == DMA_BIDIRECTIONAL))
 		swiotlb_bounce(orig_addr, tlb_addr, mapping_size, DMA_TO_DEVICE);
@@ -599,8 +603,8 @@ void swiotlb_tbl_unmap_single(struct device *hwdev, phys_addr_t tlb_addr,
 {
 	unsigned long flags;
 	int i, count, nslots = ALIGN(alloc_size, 1 << IO_TLB_SHIFT) >> IO_TLB_SHIFT;
-	int index = (tlb_addr - io_tlb_start) >> IO_TLB_SHIFT;
-	phys_addr_t orig_addr = io_tlb_orig_addr[index];
+	int index = (tlb_addr - io_tlb_start[SWIOTLB_LO]) >> IO_TLB_SHIFT;
+	phys_addr_t orig_addr = io_tlb_orig_addr[SWIOTLB_LO][index];
 
 	/*
 	 * First, sync the memory before unmapping the entry
@@ -619,23 +623,26 @@ void swiotlb_tbl_unmap_single(struct device *hwdev, phys_addr_t tlb_addr,
 	spin_lock_irqsave(&io_tlb_lock, flags);
 	{
 		count = ((index + nslots) < ALIGN(index + 1, IO_TLB_SEGSIZE) ?
-			 io_tlb_list[index + nslots] : 0);
+			 io_tlb_list[SWIOTLB_LO][index + nslots] : 0);
 		/*
 		 * Step 1: return the slots to the free list, merging the
 		 * slots with superceeding slots
 		 */
 		for (i = index + nslots - 1; i >= index; i--) {
-			io_tlb_list[i] = ++count;
-			io_tlb_orig_addr[i] = INVALID_PHYS_ADDR;
+			io_tlb_list[SWIOTLB_LO][i] = ++count;
+			io_tlb_orig_addr[SWIOTLB_LO][i] = INVALID_PHYS_ADDR;
 		}
 		/*
 		 * Step 2: merge the returned slots with the preceding slots,
 		 * if available (non zero)
 		 */
-		for (i = index - 1; (OFFSET(i, IO_TLB_SEGSIZE) != IO_TLB_SEGSIZE -1) && io_tlb_list[i]; i--)
-			io_tlb_list[i] = ++count;
+		for (i = index - 1;
+		     (OFFSET(i, IO_TLB_SEGSIZE) != IO_TLB_SEGSIZE - 1) &&
+		     io_tlb_list[SWIOTLB_LO][i];
+		     i--)
+			io_tlb_list[SWIOTLB_LO][i] = ++count;
 
-		io_tlb_used -= nslots;
+		io_tlb_used[SWIOTLB_LO] -= nslots;
 	}
 	spin_unlock_irqrestore(&io_tlb_lock, flags);
 }
@@ -644,8 +651,8 @@ void swiotlb_tbl_sync_single(struct device *hwdev, phys_addr_t tlb_addr,
 			     size_t size, enum dma_data_direction dir,
 			     enum dma_sync_target target)
 {
-	int index = (tlb_addr - io_tlb_start) >> IO_TLB_SHIFT;
-	phys_addr_t orig_addr = io_tlb_orig_addr[index];
+	int index = (tlb_addr - io_tlb_start[SWIOTLB_LO]) >> IO_TLB_SHIFT;
+	phys_addr_t orig_addr = io_tlb_orig_addr[SWIOTLB_LO][index];
 
 	if (orig_addr == INVALID_PHYS_ADDR)
 		return;
@@ -716,7 +723,7 @@ bool is_swiotlb_active(void)
 	 * When SWIOTLB is initialized, even if io_tlb_start points to physical
 	 * address zero, io_tlb_end surely doesn't.
 	 */
-	return io_tlb_end != 0;
+	return io_tlb_end[SWIOTLB_LO] != 0;
 }
 
 #ifdef CONFIG_DEBUG_FS
@@ -726,8 +733,8 @@ static int __init swiotlb_create_debugfs(void)
 	struct dentry *root;
 
 	root = debugfs_create_dir("swiotlb", NULL);
-	debugfs_create_ulong("io_tlb_nslabs", 0400, root, &io_tlb_nslabs);
-	debugfs_create_ulong("io_tlb_used", 0400, root, &io_tlb_used);
+	debugfs_create_ulong("io_tlb_nslabs", 0400, root, &io_tlb_nslabs[SWIOTLB_LO]);
+	debugfs_create_ulong("io_tlb_used", 0400, root, &io_tlb_used[SWIOTLB_LO]);
 	return 0;
 }
 
-- 
2.17.1


^ permalink raw reply related

* [PATCH RFC v1 4/6] swiotlb: enable 64-bit swiotlb
From: Dongli Zhang @ 2021-02-03 23:37 UTC (permalink / raw)
  To: dri-devel, intel-gfx, iommu, linux-mips, linux-mmc, linux-pci,
	linuxppc-dev, nouveau, x86, xen-devel
  Cc: ulf.hansson, airlied, joonas.lahtinen, adrian.hunter, paulus, hpa,
	mingo, m.szyprowski, sstabellini, joe.jin, hch, peterz, mingo,
	matthew.auld, bskeggs, thomas.lendacky, konrad.wilk, jani.nikula,
	bp, rodrigo.vivi, bhelgaas, boris.ostrovsky, chris, jgross,
	tsbogend, linux-kernel, tglx, bauerman, daniel, akpm,
	robin.murphy, rppt
In-Reply-To: <20210203233709.19819-1-dongli.zhang@oracle.com>

This patch is to enable the 64-bit swiotlb buffer.

The state of the art swiotlb pre-allocates <=32-bit memory in order to meet
the DMA mask requirement for some 32-bit legacy device. Considering most
devices nowadays support 64-bit DMA and IOMMU is available, the swiotlb is
not used for most of the times, except:

1. The xen PVM domain requires the DMA addresses to both (1) less than the
dma mask, and (2) continuous in machine address. Therefore, the 64-bit
device may still require swiotlb on PVM domain.

2. From source code the AMD SME/SEV will enable SWIOTLB_FORCE. As a result
it is always required to allocate from swiotlb buffer even the device dma
mask is 64-bit.

sme_early_init()
-> if (sev_active())
   swiotlb_force = SWIOTLB_FORCE;

Therefore, this patch introduces the 2nd swiotlb buffer for 64-bit DMA
access. For instance, the swiotlb_tbl_map_single() allocates from the 2nd
64-bit buffer if the device DMA mask is
min_not_zero(*hwdev->dma_mask, hwdev->bus_dma_limit).

The example to configure 64-bit swiotlb is "swiotlb=65536,524288,force"
or "swiotlb=,524288,force", where 524288 is the size of 64-bit buffer.

With the patch, the kernel will be able to allocate >4GB swiotlb buffer.
This patch is only for swiotlb, not including xen-swiotlb.

Cc: Joe Jin <joe.jin@oracle.com>
Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
---
 arch/mips/cavium-octeon/dma-octeon.c         |   3 +-
 arch/powerpc/kernel/dma-swiotlb.c            |   2 +-
 arch/powerpc/platforms/pseries/svm.c         |   2 +-
 arch/x86/kernel/pci-swiotlb.c                |   5 +-
 arch/x86/pci/sta2x11-fixup.c                 |   2 +-
 drivers/gpu/drm/i915/gem/i915_gem_internal.c |   4 +-
 drivers/gpu/drm/i915/i915_scatterlist.h      |   2 +-
 drivers/gpu/drm/nouveau/nouveau_ttm.c        |   2 +-
 drivers/mmc/host/sdhci.c                     |   2 +-
 drivers/pci/xen-pcifront.c                   |   2 +-
 drivers/xen/swiotlb-xen.c                    |   9 +-
 include/linux/swiotlb.h                      |  28 +-
 kernel/dma/swiotlb.c                         | 339 +++++++++++--------
 13 files changed, 238 insertions(+), 164 deletions(-)

diff --git a/arch/mips/cavium-octeon/dma-octeon.c b/arch/mips/cavium-octeon/dma-octeon.c
index df70308db0e6..3480555d908a 100644
--- a/arch/mips/cavium-octeon/dma-octeon.c
+++ b/arch/mips/cavium-octeon/dma-octeon.c
@@ -245,6 +245,7 @@ void __init plat_swiotlb_setup(void)
 		panic("%s: Failed to allocate %zu bytes align=%lx\n",
 		      __func__, swiotlbsize, PAGE_SIZE);
 
-	if (swiotlb_init_with_tbl(octeon_swiotlb, swiotlb_nslabs, 1) == -ENOMEM)
+	if (swiotlb_init_with_tbl(octeon_swiotlb, swiotlb_nslabs,
+				  SWIOTLB_LO, 1) == -ENOMEM)
 		panic("Cannot allocate SWIOTLB buffer");
 }
diff --git a/arch/powerpc/kernel/dma-swiotlb.c b/arch/powerpc/kernel/dma-swiotlb.c
index fc7816126a40..88113318c53f 100644
--- a/arch/powerpc/kernel/dma-swiotlb.c
+++ b/arch/powerpc/kernel/dma-swiotlb.c
@@ -20,7 +20,7 @@ void __init swiotlb_detect_4g(void)
 static int __init check_swiotlb_enabled(void)
 {
 	if (ppc_swiotlb_enable)
-		swiotlb_print_info();
+		swiotlb_print_info(SWIOTLB_LO);
 	else
 		swiotlb_exit();
 
diff --git a/arch/powerpc/platforms/pseries/svm.c b/arch/powerpc/platforms/pseries/svm.c
index 9f8842d0da1f..77910e4ffad8 100644
--- a/arch/powerpc/platforms/pseries/svm.c
+++ b/arch/powerpc/platforms/pseries/svm.c
@@ -52,7 +52,7 @@ void __init svm_swiotlb_init(void)
 	bytes = io_tlb_nslabs << IO_TLB_SHIFT;
 
 	vstart = memblock_alloc(PAGE_ALIGN(bytes), PAGE_SIZE);
-	if (vstart && !swiotlb_init_with_tbl(vstart, io_tlb_nslabs, false))
+	if (vstart && !swiotlb_init_with_tbl(vstart, io_tlb_nslabs, SWIOTLB_LO, false))
 		return;
 
 	if (io_tlb_start[SWIOTLB_LO])
diff --git a/arch/x86/kernel/pci-swiotlb.c b/arch/x86/kernel/pci-swiotlb.c
index c2cfa5e7c152..950624fd95a4 100644
--- a/arch/x86/kernel/pci-swiotlb.c
+++ b/arch/x86/kernel/pci-swiotlb.c
@@ -67,12 +67,15 @@ void __init pci_swiotlb_init(void)
 
 void __init pci_swiotlb_late_init(void)
 {
+	int i;
+
 	/* An IOMMU turned us off. */
 	if (!swiotlb)
 		swiotlb_exit();
 	else {
 		printk(KERN_INFO "PCI-DMA: "
 		       "Using software bounce buffering for IO (SWIOTLB)\n");
-		swiotlb_print_info();
+		for (i = 0; i < swiotlb_nr; i++)
+			swiotlb_print_info(i);
 	}
 }
diff --git a/arch/x86/pci/sta2x11-fixup.c b/arch/x86/pci/sta2x11-fixup.c
index 7d2525691854..c440520b2055 100644
--- a/arch/x86/pci/sta2x11-fixup.c
+++ b/arch/x86/pci/sta2x11-fixup.c
@@ -57,7 +57,7 @@ static void sta2x11_new_instance(struct pci_dev *pdev)
 		int size = STA2X11_SWIOTLB_SIZE;
 		/* First instance: register your own swiotlb area */
 		dev_info(&pdev->dev, "Using SWIOTLB (size %i)\n", size);
-		if (swiotlb_late_init_with_default_size(size))
+		if (swiotlb_late_init_with_default_size(size), SWIOTLB_LO)
 			dev_emerg(&pdev->dev, "init swiotlb failed\n");
 	}
 	list_add(&instance->list, &sta2x11_instance_list);
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_internal.c b/drivers/gpu/drm/i915/gem/i915_gem_internal.c
index ad22f42541bd..947683f2e568 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_internal.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_internal.c
@@ -42,10 +42,10 @@ static int i915_gem_object_get_pages_internal(struct drm_i915_gem_object *obj)
 
 	max_order = MAX_ORDER;
 #ifdef CONFIG_SWIOTLB
-	if (swiotlb_nr_tbl()) {
+	if (swiotlb_nr_tbl(SWIOTLB_LO)) {
 		unsigned int max_segment;
 
-		max_segment = swiotlb_max_segment();
+		max_segment = swiotlb_max_segment(SWIOTLB_LO);
 		if (max_segment) {
 			max_segment = max_t(unsigned int, max_segment,
 					    PAGE_SIZE) >> PAGE_SHIFT;
diff --git a/drivers/gpu/drm/i915/i915_scatterlist.h b/drivers/gpu/drm/i915/i915_scatterlist.h
index 9cb26a224034..c63c7f6941f6 100644
--- a/drivers/gpu/drm/i915/i915_scatterlist.h
+++ b/drivers/gpu/drm/i915/i915_scatterlist.h
@@ -118,7 +118,7 @@ static inline unsigned int i915_sg_page_sizes(struct scatterlist *sg)
 
 static inline unsigned int i915_sg_segment_size(void)
 {
-	unsigned int size = swiotlb_max_segment();
+	unsigned int size = swiotlb_max_segment(SWIOTLB_LO);
 
 	if (size == 0)
 		size = UINT_MAX;
diff --git a/drivers/gpu/drm/nouveau/nouveau_ttm.c b/drivers/gpu/drm/nouveau/nouveau_ttm.c
index a37bc3d7b38b..0919b207ac47 100644
--- a/drivers/gpu/drm/nouveau/nouveau_ttm.c
+++ b/drivers/gpu/drm/nouveau/nouveau_ttm.c
@@ -321,7 +321,7 @@ nouveau_ttm_init(struct nouveau_drm *drm)
 	}
 
 #if IS_ENABLED(CONFIG_SWIOTLB) && IS_ENABLED(CONFIG_X86)
-	need_swiotlb = !!swiotlb_nr_tbl();
+	need_swiotlb = !!swiotlb_nr_tbl(SWIOTLB_LO);
 #endif
 
 	ret = ttm_bo_device_init(&drm->ttm.bdev, &nouveau_bo_driver,
diff --git a/drivers/mmc/host/sdhci.c b/drivers/mmc/host/sdhci.c
index 646823ddd317..1f7fb912d5a9 100644
--- a/drivers/mmc/host/sdhci.c
+++ b/drivers/mmc/host/sdhci.c
@@ -4582,7 +4582,7 @@ int sdhci_setup_host(struct sdhci_host *host)
 		mmc->max_segs = SDHCI_MAX_SEGS;
 	} else if (host->flags & SDHCI_USE_SDMA) {
 		mmc->max_segs = 1;
-		if (swiotlb_max_segment()) {
+		if (swiotlb_max_segment(SWIOTLB_LO)) {
 			unsigned int max_req_size = (1 << IO_TLB_SHIFT) *
 						IO_TLB_SEGSIZE;
 			mmc->max_req_size = min(mmc->max_req_size,
diff --git a/drivers/pci/xen-pcifront.c b/drivers/pci/xen-pcifront.c
index c6fe0cfec0f6..9509ed9b4126 100644
--- a/drivers/pci/xen-pcifront.c
+++ b/drivers/pci/xen-pcifront.c
@@ -693,7 +693,7 @@ static int pcifront_connect_and_init_dma(struct pcifront_device *pdev)
 
 	spin_unlock(&pcifront_dev_lock);
 
-	if (!err && !swiotlb_nr_tbl()) {
+	if (!err && !swiotlb_nr_tbl(SWIOTLB_LO)) {
 		err = pci_xen_swiotlb_init_late();
 		if (err)
 			dev_err(&pdev->xdev->dev, "Could not setup SWIOTLB!\n");
diff --git a/drivers/xen/swiotlb-xen.c b/drivers/xen/swiotlb-xen.c
index 3261880ad859..662638093542 100644
--- a/drivers/xen/swiotlb-xen.c
+++ b/drivers/xen/swiotlb-xen.c
@@ -184,7 +184,7 @@ int __ref xen_swiotlb_init(int verbose, bool early)
 	enum xen_swiotlb_err m_ret = XEN_SWIOTLB_UNKNOWN;
 	unsigned int repeat = 3;
 
-	xen_io_tlb_nslabs = swiotlb_nr_tbl();
+	xen_io_tlb_nslabs = swiotlb_nr_tbl(SWIOTLB_LO);
 retry:
 	bytes = xen_set_nslabs(xen_io_tlb_nslabs);
 	order = get_order(xen_io_tlb_nslabs << IO_TLB_SHIFT);
@@ -245,16 +245,17 @@ int __ref xen_swiotlb_init(int verbose, bool early)
 	}
 	if (early) {
 		if (swiotlb_init_with_tbl(xen_io_tlb_start, xen_io_tlb_nslabs,
-			 verbose))
+					  SWIOTLB_LO, verbose))
 			panic("Cannot allocate SWIOTLB buffer");
 		rc = 0;
 	} else
-		rc = swiotlb_late_init_with_tbl(xen_io_tlb_start, xen_io_tlb_nslabs);
+		rc = swiotlb_late_init_with_tbl(xen_io_tlb_start,
+						xen_io_tlb_nslabs, SWIOTLB_LO);
 
 end:
 	xen_io_tlb_end = xen_io_tlb_start + bytes;
 	if (!rc)
-		swiotlb_set_max_segment(PAGE_SIZE);
+		swiotlb_set_max_segment(PAGE_SIZE, SWIOTLB_LO);
 
 	return rc;
 error:
diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index 3d5980d77810..8ba45fddfb14 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -43,11 +43,14 @@ extern int swiotlb_nr;
 #define IO_TLB_DEFAULT_SIZE (64UL<<20)
 
 extern void swiotlb_init(int verbose);
-int swiotlb_init_with_tbl(char *tlb, unsigned long nslabs, int verbose);
-extern unsigned long swiotlb_nr_tbl(void);
+int swiotlb_init_with_tbl(char *tlb, unsigned long nslabs,
+			  enum swiotlb_t type, int verbose);
+extern unsigned long swiotlb_nr_tbl(enum swiotlb_t type);
 unsigned long swiotlb_size_or_default(void);
-extern int swiotlb_late_init_with_tbl(char *tlb, unsigned long nslabs);
-extern int swiotlb_late_init_with_default_size(size_t default_size);
+extern int swiotlb_late_init_with_tbl(char *tlb, unsigned long nslabs,
+				      enum swiotlb_t type);
+extern int swiotlb_late_init_with_default_size(size_t default_size,
+					       enum swiotlb_t type);
 extern void __init swiotlb_update_mem_attributes(void);
 
 /*
@@ -83,8 +86,13 @@ extern phys_addr_t io_tlb_start[], io_tlb_end[];
 
 static inline bool is_swiotlb_buffer(phys_addr_t paddr)
 {
-	return paddr >= io_tlb_start[SWIOTLB_LO] &&
-	       paddr < io_tlb_end[SWIOTLB_LO];
+	int i;
+
+	for (i = 0; i < swiotlb_nr; i++)
+		if (paddr >= io_tlb_start[i] && paddr < io_tlb_end[i])
+			return true;
+
+	return false;
 }
 
 static inline int swiotlb_get_type(phys_addr_t paddr)
@@ -99,7 +107,7 @@ static inline int swiotlb_get_type(phys_addr_t paddr)
 }
 
 void __init swiotlb_exit(void);
-unsigned int swiotlb_max_segment(void);
+unsigned int swiotlb_max_segment(enum swiotlb_t type);
 size_t swiotlb_max_mapping_size(struct device *dev);
 bool is_swiotlb_active(void);
 void __init swiotlb_adjust_size(unsigned long new_size);
@@ -112,7 +120,7 @@ static inline bool is_swiotlb_buffer(phys_addr_t paddr)
 static inline void swiotlb_exit(void)
 {
 }
-static inline unsigned int swiotlb_max_segment(void)
+static inline unsigned int swiotlb_max_segment(enum swiotlb_t type)
 {
 	return 0;
 }
@@ -131,7 +139,7 @@ static inline void swiotlb_adjust_size(unsigned long new_size)
 }
 #endif /* CONFIG_SWIOTLB */
 
-extern void swiotlb_print_info(void);
-extern void swiotlb_set_max_segment(unsigned int);
+extern void swiotlb_print_info(enum swiotlb_t type);
+extern void swiotlb_set_max_segment(unsigned int val, enum swiotlb_t type);
 
 #endif /* __LINUX_SWIOTLB_H */
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index c91d3d2c3936..cd28db5b016a 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -111,6 +111,11 @@ static int late_alloc;
 
 int swiotlb_nr = 1;
 
+static const char * const swiotlb_name[] = {
+	"Low Mem",
+	"High Mem"
+};
+
 static int __init
 setup_io_tlb_npages(char *str)
 {
@@ -121,11 +126,25 @@ setup_io_tlb_npages(char *str)
 	}
 	if (*str == ',')
 		++str;
+
+	if (isdigit(*str)) {
+		io_tlb_nslabs[SWIOTLB_HI] = simple_strtoul(str, &str, 0);
+		/* avoid tail segment of size < IO_TLB_SEGSIZE */
+		io_tlb_nslabs[SWIOTLB_HI] = ALIGN(io_tlb_nslabs[SWIOTLB_HI], IO_TLB_SEGSIZE);
+
+		swiotlb_nr = 2;
+	}
+
+	if (*str == ',')
+		++str;
+
 	if (!strcmp(str, "force")) {
 		swiotlb_force = SWIOTLB_FORCE;
 	} else if (!strcmp(str, "noforce")) {
 		swiotlb_force = SWIOTLB_NO_FORCE;
 		io_tlb_nslabs[SWIOTLB_LO] = 1;
+		if (swiotlb_nr > 1)
+			io_tlb_nslabs[SWIOTLB_HI] = 1;
 	}
 
 	return 0;
@@ -134,24 +153,24 @@ early_param("swiotlb", setup_io_tlb_npages);
 
 static bool no_iotlb_memory[SWIOTLB_MAX];
 
-unsigned long swiotlb_nr_tbl(void)
+unsigned long swiotlb_nr_tbl(enum swiotlb_t type)
 {
-	return unlikely(no_iotlb_memory[SWIOTLB_LO]) ? 0 : io_tlb_nslabs[SWIOTLB_LO];
+	return unlikely(no_iotlb_memory[type]) ? 0 : io_tlb_nslabs[type];
 }
 EXPORT_SYMBOL_GPL(swiotlb_nr_tbl);
 
-unsigned int swiotlb_max_segment(void)
+unsigned int swiotlb_max_segment(enum swiotlb_t type)
 {
-	return unlikely(no_iotlb_memory[SWIOTLB_LO]) ? 0 : max_segment[SWIOTLB_LO];
+	return unlikely(no_iotlb_memory[type]) ? 0 : max_segment[type];
 }
 EXPORT_SYMBOL_GPL(swiotlb_max_segment);
 
-void swiotlb_set_max_segment(unsigned int val)
+void swiotlb_set_max_segment(unsigned int val, enum swiotlb_t type)
 {
 	if (swiotlb_force == SWIOTLB_FORCE)
-		max_segment[SWIOTLB_LO] = 1;
+		max_segment[type] = 1;
 	else
-		max_segment[SWIOTLB_LO] = rounddown(val, PAGE_SIZE);
+		max_segment[type] = rounddown(val, PAGE_SIZE);
 }
 
 unsigned long swiotlb_size_or_default(void)
@@ -181,18 +200,18 @@ void __init swiotlb_adjust_size(unsigned long new_size)
 	}
 }
 
-void swiotlb_print_info(void)
+void swiotlb_print_info(enum swiotlb_t type)
 {
-	unsigned long bytes = io_tlb_nslabs[SWIOTLB_LO] << IO_TLB_SHIFT;
+	unsigned long bytes = io_tlb_nslabs[type] << IO_TLB_SHIFT;
 
-	if (no_iotlb_memory[SWIOTLB_LO]) {
-		pr_warn("No low mem\n");
+	if (no_iotlb_memory[type]) {
+		pr_warn("No low mem with %s\n", swiotlb_name[type]);
 		return;
 	}
 
-	pr_info("mapped [mem %pa-%pa] (%luMB)\n",
-		&io_tlb_start[SWIOTLB_LO], &io_tlb_end[SWIOTLB_LO],
-		bytes >> 20);
+	pr_info("%s mapped [mem %pa-%pa] (%luMB)\n",
+		swiotlb_name[type],
+		&io_tlb_start[type], &io_tlb_end[type], bytes >> 20);
 }
 
 /*
@@ -205,88 +224,104 @@ void __init swiotlb_update_mem_attributes(void)
 {
 	void *vaddr;
 	unsigned long bytes;
+	int i;
 
-	if (no_iotlb_memory[SWIOTLB_LO] || late_alloc)
-		return;
+	for (i = 0; i < swiotlb_nr; i++) {
+		if (no_iotlb_memory[i] || late_alloc)
+			continue;
 
-	vaddr = phys_to_virt(io_tlb_start[SWIOTLB_LO]);
-	bytes = PAGE_ALIGN(io_tlb_nslabs[SWIOTLB_LO] << IO_TLB_SHIFT);
-	set_memory_decrypted((unsigned long)vaddr, bytes >> PAGE_SHIFT);
-	memset(vaddr, 0, bytes);
+		vaddr = phys_to_virt(io_tlb_start[i]);
+		bytes = PAGE_ALIGN(io_tlb_nslabs[i] << IO_TLB_SHIFT);
+		set_memory_decrypted((unsigned long)vaddr, bytes >> PAGE_SHIFT);
+		memset(vaddr, 0, bytes);
+	}
 }
 
-int __init swiotlb_init_with_tbl(char *tlb, unsigned long nslabs, int verbose)
+int __init swiotlb_init_with_tbl(char *tlb, unsigned long nslabs,
+				 enum swiotlb_t type, int verbose)
 {
 	unsigned long i, bytes;
 	size_t alloc_size;
 
 	bytes = nslabs << IO_TLB_SHIFT;
 
-	io_tlb_nslabs[SWIOTLB_LO] = nslabs;
-	io_tlb_start[SWIOTLB_LO] = __pa(tlb);
-	io_tlb_end[SWIOTLB_LO] = io_tlb_start[SWIOTLB_LO] + bytes;
+	io_tlb_nslabs[type] = nslabs;
+	io_tlb_start[type] = __pa(tlb);
+	io_tlb_end[type] = io_tlb_start[type] + bytes;
 
 	/*
 	 * Allocate and initialize the free list array.  This array is used
 	 * to find contiguous free memory regions of size up to IO_TLB_SEGSIZE
 	 * between io_tlb_start and io_tlb_end.
 	 */
-	alloc_size = PAGE_ALIGN(io_tlb_nslabs[SWIOTLB_LO] * sizeof(int));
-	io_tlb_list[SWIOTLB_LO] = memblock_alloc(alloc_size, PAGE_SIZE);
-	if (!io_tlb_list[SWIOTLB_LO])
+	alloc_size = PAGE_ALIGN(io_tlb_nslabs[type] * sizeof(int));
+	io_tlb_list[type] = memblock_alloc(alloc_size, PAGE_SIZE);
+	if (!io_tlb_list[type])
 		panic("%s: Failed to allocate %zu bytes align=0x%lx\n",
 		      __func__, alloc_size, PAGE_SIZE);
 
-	alloc_size = PAGE_ALIGN(io_tlb_nslabs[SWIOTLB_LO] * sizeof(phys_addr_t));
-	io_tlb_orig_addr[SWIOTLB_LO] = memblock_alloc(alloc_size, PAGE_SIZE);
-	if (!io_tlb_orig_addr[SWIOTLB_LO])
+	alloc_size = PAGE_ALIGN(io_tlb_nslabs[type] * sizeof(phys_addr_t));
+	io_tlb_orig_addr[type] = memblock_alloc(alloc_size, PAGE_SIZE);
+	if (!io_tlb_orig_addr[type])
 		panic("%s: Failed to allocate %zu bytes align=0x%lx\n",
 		      __func__, alloc_size, PAGE_SIZE);
 
-	for (i = 0; i < io_tlb_nslabs[SWIOTLB_LO]; i++) {
-		io_tlb_list[SWIOTLB_LO][i] = IO_TLB_SEGSIZE - OFFSET(i, IO_TLB_SEGSIZE);
-		io_tlb_orig_addr[SWIOTLB_LO][i] = INVALID_PHYS_ADDR;
+	for (i = 0; i < io_tlb_nslabs[type]; i++) {
+		io_tlb_list[type][i] = IO_TLB_SEGSIZE - OFFSET(i, IO_TLB_SEGSIZE);
+		io_tlb_orig_addr[type][i] = INVALID_PHYS_ADDR;
 	}
-	io_tlb_index[SWIOTLB_LO] = 0;
-	no_iotlb_memory[SWIOTLB_LO] = false;
+	io_tlb_index[type] = 0;
+	no_iotlb_memory[type] = false;
 
 	if (verbose)
-		swiotlb_print_info();
+		swiotlb_print_info(type);
 
-	swiotlb_set_max_segment(io_tlb_nslabs[SWIOTLB_LO] << IO_TLB_SHIFT);
+	swiotlb_set_max_segment(io_tlb_nslabs[type] << IO_TLB_SHIFT, type);
 	return 0;
 }
 
-/*
- * Statically reserve bounce buffer space and initialize bounce buffer data
- * structures for the software IO TLB used to implement the DMA API.
- */
-void  __init
-swiotlb_init(int verbose)
+static void __init
+swiotlb_init_type(enum swiotlb_t type, int verbose)
 {
 	size_t default_size = IO_TLB_DEFAULT_SIZE;
 	unsigned char *vstart;
 	unsigned long bytes;
 
-	if (!io_tlb_nslabs[SWIOTLB_LO]) {
-		io_tlb_nslabs[SWIOTLB_LO] = (default_size >> IO_TLB_SHIFT);
-		io_tlb_nslabs[SWIOTLB_LO] = ALIGN(io_tlb_nslabs[SWIOTLB_LO], IO_TLB_SEGSIZE);
+	if (!io_tlb_nslabs[type]) {
+		io_tlb_nslabs[type] = (default_size >> IO_TLB_SHIFT);
+		io_tlb_nslabs[type] = ALIGN(io_tlb_nslabs[type], IO_TLB_SEGSIZE);
 	}
 
-	bytes = io_tlb_nslabs[SWIOTLB_LO] << IO_TLB_SHIFT;
+	bytes = io_tlb_nslabs[type] << IO_TLB_SHIFT;
+
+	if (type == SWIOTLB_LO)
+		vstart = memblock_alloc_low(PAGE_ALIGN(bytes), PAGE_SIZE);
+	else
+		vstart = memblock_alloc(PAGE_ALIGN(bytes), PAGE_SIZE);
 
-	/* Get IO TLB memory from the low pages */
-	vstart = memblock_alloc_low(PAGE_ALIGN(bytes), PAGE_SIZE);
-	if (vstart && !swiotlb_init_with_tbl(vstart, io_tlb_nslabs[SWIOTLB_LO], verbose))
+	if (vstart && !swiotlb_init_with_tbl(vstart, io_tlb_nslabs[type], type, verbose))
 		return;
 
-	if (io_tlb_start[SWIOTLB_LO]) {
-		memblock_free_early(io_tlb_start[SWIOTLB_LO],
-				    PAGE_ALIGN(io_tlb_nslabs[SWIOTLB_LO] << IO_TLB_SHIFT));
-		io_tlb_start[SWIOTLB_LO] = 0;
+	if (io_tlb_start[type]) {
+		memblock_free_early(io_tlb_start[type],
+				    PAGE_ALIGN(io_tlb_nslabs[type] << IO_TLB_SHIFT));
+		io_tlb_start[type] = 0;
 	}
-	pr_warn("Cannot allocate buffer");
-	no_iotlb_memory[SWIOTLB_LO] = true;
+	pr_warn("Cannot allocate buffer %s\n", swiotlb_name[type]);
+	no_iotlb_memory[type] = true;
+}
+
+/*
+ * Statically reserve bounce buffer space and initialize bounce buffer data
+ * structures for the software IO TLB used to implement the DMA API.
+ */
+void  __init
+swiotlb_init(int verbose)
+{
+	int i;
+
+	for (i = 0; i < swiotlb_nr; i++)
+		swiotlb_init_type(i, verbose);
 }
 
 /*
@@ -295,67 +330,68 @@ swiotlb_init(int verbose)
  * This should be just like above, but with some error catching.
  */
 int
-swiotlb_late_init_with_default_size(size_t default_size)
+swiotlb_late_init_with_default_size(size_t default_size, enum swiotlb_t type)
 {
-	unsigned long bytes, req_nslabs = io_tlb_nslabs[SWIOTLB_LO];
+	unsigned long bytes, req_nslabs = io_tlb_nslabs[type];
 	unsigned char *vstart = NULL;
 	unsigned int order;
 	int rc = 0;
+	gfp_t gfp_mask = (type == SWIOTLB_LO) ? GFP_DMA | __GFP_NOWARN :
+						GFP_KERNEL | __GFP_NOWARN;
 
-	if (!io_tlb_nslabs[SWIOTLB_LO]) {
-		io_tlb_nslabs[SWIOTLB_LO] = (default_size >> IO_TLB_SHIFT);
-		io_tlb_nslabs[SWIOTLB_LO] = ALIGN(io_tlb_nslabs[SWIOTLB_LO], IO_TLB_SEGSIZE);
+	if (!io_tlb_nslabs[type]) {
+		io_tlb_nslabs[type] = (default_size >> IO_TLB_SHIFT);
+		io_tlb_nslabs[type] = ALIGN(io_tlb_nslabs[type], IO_TLB_SEGSIZE);
 	}
 
 	/*
 	 * Get IO TLB memory from the low pages
 	 */
-	order = get_order(io_tlb_nslabs[SWIOTLB_LO] << IO_TLB_SHIFT);
-	io_tlb_nslabs[SWIOTLB_LO] = SLABS_PER_PAGE << order;
-	bytes = io_tlb_nslabs[SWIOTLB_LO] << IO_TLB_SHIFT;
+	order = get_order(io_tlb_nslabs[type] << IO_TLB_SHIFT);
+	io_tlb_nslabs[type] = SLABS_PER_PAGE << order;
+	bytes = io_tlb_nslabs[type] << IO_TLB_SHIFT;
 
 	while ((SLABS_PER_PAGE << order) > IO_TLB_MIN_SLABS) {
-		vstart = (void *)__get_free_pages(GFP_DMA | __GFP_NOWARN,
-						  order);
+		vstart = (void *)__get_free_pages(gfp_mask, order);
 		if (vstart)
 			break;
 		order--;
 	}
 
 	if (!vstart) {
-		io_tlb_nslabs[SWIOTLB_LO] = req_nslabs;
+		io_tlb_nslabs[type] = req_nslabs;
 		return -ENOMEM;
 	}
 	if (order != get_order(bytes)) {
 		pr_warn("only able to allocate %ld MB\n",
 			(PAGE_SIZE << order) >> 20);
-		io_tlb_nslabs[SWIOTLB_LO] = SLABS_PER_PAGE << order;
+		io_tlb_nslabs[type] = SLABS_PER_PAGE << order;
 	}
-	rc = swiotlb_late_init_with_tbl(vstart, io_tlb_nslabs[SWIOTLB_LO]);
+	rc = swiotlb_late_init_with_tbl(vstart, io_tlb_nslabs[type], type);
 	if (rc)
 		free_pages((unsigned long)vstart, order);
 
 	return rc;
 }
 
-static void swiotlb_cleanup(void)
+static void swiotlb_cleanup(enum swiotlb_t type)
 {
-	io_tlb_end[SWIOTLB_LO] = 0;
-	io_tlb_start[SWIOTLB_LO] = 0;
-	io_tlb_nslabs[SWIOTLB_LO] = 0;
-	max_segment[SWIOTLB_LO] = 0;
+	io_tlb_end[type] = 0;
+	io_tlb_start[type] = 0;
+	io_tlb_nslabs[type] = 0;
+	max_segment[type] = 0;
 }
 
 int
-swiotlb_late_init_with_tbl(char *tlb, unsigned long nslabs)
+swiotlb_late_init_with_tbl(char *tlb, unsigned long nslabs, enum swiotlb_t type)
 {
 	unsigned long i, bytes;
 
 	bytes = nslabs << IO_TLB_SHIFT;
 
-	io_tlb_nslabs[SWIOTLB_LO] = nslabs;
-	io_tlb_start[SWIOTLB_LO] = virt_to_phys(tlb);
-	io_tlb_end[SWIOTLB_LO] = io_tlb_start[SWIOTLB_LO] + bytes;
+	io_tlb_nslabs[type] = nslabs;
+	io_tlb_start[type] = virt_to_phys(tlb);
+	io_tlb_end[type] = io_tlb_start[type] + bytes;
 
 	set_memory_decrypted((unsigned long)tlb, bytes >> PAGE_SHIFT);
 	memset(tlb, 0, bytes);
@@ -365,63 +401,67 @@ swiotlb_late_init_with_tbl(char *tlb, unsigned long nslabs)
 	 * to find contiguous free memory regions of size up to IO_TLB_SEGSIZE
 	 * between io_tlb_start and io_tlb_end.
 	 */
-	io_tlb_list[SWIOTLB_LO] = (unsigned int *)__get_free_pages(GFP_KERNEL,
-			get_order(io_tlb_nslabs[SWIOTLB_LO] * sizeof(int)));
-	if (!io_tlb_list[SWIOTLB_LO])
+	io_tlb_list[type] = (unsigned int *)__get_free_pages(GFP_KERNEL,
+			get_order(io_tlb_nslabs[type] * sizeof(int)));
+	if (!io_tlb_list[type])
 		goto cleanup3;
 
-	io_tlb_orig_addr[SWIOTLB_LO] = (phys_addr_t *)
+	io_tlb_orig_addr[type] = (phys_addr_t *)
 		__get_free_pages(GFP_KERNEL,
-				 get_order(io_tlb_nslabs[SWIOTLB_LO] *
+				 get_order(io_tlb_nslabs[type] *
 					   sizeof(phys_addr_t)));
-	if (!io_tlb_orig_addr[SWIOTLB_LO])
+	if (!io_tlb_orig_addr[type])
 		goto cleanup4;
 
-	for (i = 0; i < io_tlb_nslabs[SWIOTLB_LO]; i++) {
-		io_tlb_list[SWIOTLB_LO][i] = IO_TLB_SEGSIZE - OFFSET(i, IO_TLB_SEGSIZE);
-		io_tlb_orig_addr[SWIOTLB_LO][i] = INVALID_PHYS_ADDR;
+	for (i = 0; i < io_tlb_nslabs[type]; i++) {
+		io_tlb_list[type][i] = IO_TLB_SEGSIZE - OFFSET(i, IO_TLB_SEGSIZE);
+		io_tlb_orig_addr[type][i] = INVALID_PHYS_ADDR;
 	}
-	io_tlb_index[SWIOTLB_LO] = 0;
-	no_iotlb_memory[SWIOTLB_LO] = false;
+	io_tlb_index[type] = 0;
+	no_iotlb_memory[type] = false;
 
-	swiotlb_print_info();
+	swiotlb_print_info(type);
 
 	late_alloc = 1;
 
-	swiotlb_set_max_segment(io_tlb_nslabs[SWIOTLB_LO] << IO_TLB_SHIFT);
+	swiotlb_set_max_segment(io_tlb_nslabs[type] << IO_TLB_SHIFT, type);
 
 	return 0;
 
 cleanup4:
-	free_pages((unsigned long)io_tlb_list[SWIOTLB_LO], get_order(io_tlb_nslabs[SWIOTLB_LO] *
+	free_pages((unsigned long)io_tlb_list[type], get_order(io_tlb_nslabs[type] *
 	                                                 sizeof(int)));
-	io_tlb_list[SWIOTLB_LO] = NULL;
+	io_tlb_list[type] = NULL;
 cleanup3:
-	swiotlb_cleanup();
+	swiotlb_cleanup(type);
 	return -ENOMEM;
 }
 
 void __init swiotlb_exit(void)
 {
-	if (!io_tlb_orig_addr[SWIOTLB_LO])
-		return;
+	int i;
 
-	if (late_alloc) {
-		free_pages((unsigned long)io_tlb_orig_addr[SWIOTLB_LO],
-			   get_order(io_tlb_nslabs[SWIOTLB_LO] * sizeof(phys_addr_t)));
-		free_pages((unsigned long)io_tlb_list[SWIOTLB_LO],
-			   get_order(io_tlb_nslabs[SWIOTLB_LO] * sizeof(int)));
-		free_pages((unsigned long)phys_to_virt(io_tlb_start[SWIOTLB_LO]),
-			   get_order(io_tlb_nslabs[SWIOTLB_LO] << IO_TLB_SHIFT));
-	} else {
-		memblock_free_late(__pa(io_tlb_orig_addr[SWIOTLB_LO]),
-				   PAGE_ALIGN(io_tlb_nslabs[SWIOTLB_LO] * sizeof(phys_addr_t)));
-		memblock_free_late(__pa(io_tlb_list[SWIOTLB_LO]),
-				   PAGE_ALIGN(io_tlb_nslabs[SWIOTLB_LO] * sizeof(int)));
-		memblock_free_late(io_tlb_start[SWIOTLB_LO],
-				   PAGE_ALIGN(io_tlb_nslabs[SWIOTLB_LO] << IO_TLB_SHIFT));
+	for (i = 0; i < swiotlb_nr; i++) {
+		if (!io_tlb_orig_addr[i])
+			continue;
+
+		if (late_alloc) {
+			free_pages((unsigned long)io_tlb_orig_addr[i],
+				   get_order(io_tlb_nslabs[i] * sizeof(phys_addr_t)));
+			free_pages((unsigned long)io_tlb_list[i],
+				   get_order(io_tlb_nslabs[i] * sizeof(int)));
+			free_pages((unsigned long)phys_to_virt(io_tlb_start[i]),
+				   get_order(io_tlb_nslabs[i] << IO_TLB_SHIFT));
+		} else {
+			memblock_free_late(__pa(io_tlb_orig_addr[i]),
+					   PAGE_ALIGN(io_tlb_nslabs[i] * sizeof(phys_addr_t)));
+			memblock_free_late(__pa(io_tlb_list[i]),
+					   PAGE_ALIGN(io_tlb_nslabs[i] * sizeof(int)));
+			memblock_free_late(io_tlb_start[i],
+					   PAGE_ALIGN(io_tlb_nslabs[i] << IO_TLB_SHIFT));
+		}
+		swiotlb_cleanup(i);
 	}
-	swiotlb_cleanup();
 }
 
 /*
@@ -468,7 +508,7 @@ phys_addr_t swiotlb_tbl_map_single(struct device *hwdev, phys_addr_t orig_addr,
 		size_t mapping_size, size_t alloc_size,
 		enum dma_data_direction dir, unsigned long attrs)
 {
-	dma_addr_t tbl_dma_addr = phys_to_dma_unencrypted(hwdev, io_tlb_start[SWIOTLB_LO]);
+	dma_addr_t tbl_dma_addr;
 	unsigned long flags;
 	phys_addr_t tlb_addr;
 	unsigned int nslots, stride, index, wrap;
@@ -477,8 +517,16 @@ phys_addr_t swiotlb_tbl_map_single(struct device *hwdev, phys_addr_t orig_addr,
 	unsigned long offset_slots;
 	unsigned long max_slots;
 	unsigned long tmp_io_tlb_used;
+	unsigned long dma_mask = min_not_zero(*hwdev->dma_mask,
+					      hwdev->bus_dma_limit);
+	int type;
 
-	if (no_iotlb_memory[SWIOTLB_LO])
+	if (swiotlb_nr > 1 && dma_mask == DMA_BIT_MASK(64))
+		type = SWIOTLB_HI;
+	else
+		type = SWIOTLB_LO;
+
+	if (no_iotlb_memory[type])
 		panic("Can not allocate SWIOTLB buffer earlier and can't now provide you with the DMA bounce buffer");
 
 	if (mem_encrypt_active())
@@ -490,6 +538,8 @@ phys_addr_t swiotlb_tbl_map_single(struct device *hwdev, phys_addr_t orig_addr,
 		return (phys_addr_t)DMA_MAPPING_ERROR;
 	}
 
+	tbl_dma_addr = phys_to_dma_unencrypted(hwdev, io_tlb_start[type]);
+
 	mask = dma_get_seg_boundary(hwdev);
 
 	tbl_dma_addr &= mask;
@@ -521,11 +571,11 @@ phys_addr_t swiotlb_tbl_map_single(struct device *hwdev, phys_addr_t orig_addr,
 	 */
 	spin_lock_irqsave(&io_tlb_lock, flags);
 
-	if (unlikely(nslots > io_tlb_nslabs[SWIOTLB_LO] - io_tlb_used[SWIOTLB_LO]))
+	if (unlikely(nslots > io_tlb_nslabs[type] - io_tlb_used[type]))
 		goto not_found;
 
-	index = ALIGN(io_tlb_index[SWIOTLB_LO], stride);
-	if (index >= io_tlb_nslabs[SWIOTLB_LO])
+	index = ALIGN(io_tlb_index[type], stride);
+	if (index >= io_tlb_nslabs[type])
 		index = 0;
 	wrap = index;
 
@@ -533,7 +583,7 @@ phys_addr_t swiotlb_tbl_map_single(struct device *hwdev, phys_addr_t orig_addr,
 		while (iommu_is_span_boundary(index, nslots, offset_slots,
 					      max_slots)) {
 			index += stride;
-			if (index >= io_tlb_nslabs[SWIOTLB_LO])
+			if (index >= io_tlb_nslabs[type])
 				index = 0;
 			if (index == wrap)
 				goto not_found;
@@ -544,42 +594,42 @@ phys_addr_t swiotlb_tbl_map_single(struct device *hwdev, phys_addr_t orig_addr,
 		 * contiguous buffers, we allocate the buffers from that slot
 		 * and mark the entries as '0' indicating unavailable.
 		 */
-		if (io_tlb_list[SWIOTLB_LO][index] >= nslots) {
+		if (io_tlb_list[type][index] >= nslots) {
 			int count = 0;
 
 			for (i = index; i < (int) (index + nslots); i++)
-				io_tlb_list[SWIOTLB_LO][i] = 0;
+				io_tlb_list[type][i] = 0;
 			for (i = index - 1;
 			     (OFFSET(i, IO_TLB_SEGSIZE) != IO_TLB_SEGSIZE - 1) &&
-			     io_tlb_list[SWIOTLB_LO][i];
+			     io_tlb_list[type][i];
 			     i--)
-				io_tlb_list[SWIOTLB_LO][i] = ++count;
-			tlb_addr = io_tlb_start[SWIOTLB_LO] + (index << IO_TLB_SHIFT);
+				io_tlb_list[type][i] = ++count;
+			tlb_addr = io_tlb_start[type] + (index << IO_TLB_SHIFT);
 
 			/*
 			 * Update the indices to avoid searching in the next
 			 * round.
 			 */
-			io_tlb_index[SWIOTLB_LO] = ((index + nslots) < io_tlb_nslabs[SWIOTLB_LO]
+			io_tlb_index[type] = ((index + nslots) < io_tlb_nslabs[type]
 					? (index + nslots) : 0);
 
 			goto found;
 		}
 		index += stride;
-		if (index >= io_tlb_nslabs[SWIOTLB_LO])
+		if (index >= io_tlb_nslabs[type])
 			index = 0;
 	} while (index != wrap);
 
 not_found:
-	tmp_io_tlb_used = io_tlb_used[SWIOTLB_LO];
+	tmp_io_tlb_used = io_tlb_used[type];
 
 	spin_unlock_irqrestore(&io_tlb_lock, flags);
 	if (!(attrs & DMA_ATTR_NO_WARN) && printk_ratelimit())
-		dev_warn(hwdev, "swiotlb buffer is full (sz: %zd bytes), total %lu (slots), used %lu (slots)\n",
-			 alloc_size, io_tlb_nslabs[SWIOTLB_LO], tmp_io_tlb_used);
+		dev_warn(hwdev, "%s swiotlb buffer is full (sz: %zd bytes), total %lu (slots), used %lu (slots)\n",
+			 swiotlb_name[type], alloc_size, io_tlb_nslabs[type], tmp_io_tlb_used);
 	return (phys_addr_t)DMA_MAPPING_ERROR;
 found:
-	io_tlb_used[SWIOTLB_LO] += nslots;
+	io_tlb_used[type] += nslots;
 	spin_unlock_irqrestore(&io_tlb_lock, flags);
 
 	/*
@@ -588,7 +638,7 @@ phys_addr_t swiotlb_tbl_map_single(struct device *hwdev, phys_addr_t orig_addr,
 	 * needed.
 	 */
 	for (i = 0; i < nslots; i++)
-		io_tlb_orig_addr[SWIOTLB_LO][index+i] = orig_addr + (i << IO_TLB_SHIFT);
+		io_tlb_orig_addr[type][index+i] = orig_addr + (i << IO_TLB_SHIFT);
 	if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC) &&
 	    (dir == DMA_TO_DEVICE || dir == DMA_BIDIRECTIONAL))
 		swiotlb_bounce(orig_addr, tlb_addr, mapping_size, DMA_TO_DEVICE);
@@ -605,8 +655,9 @@ void swiotlb_tbl_unmap_single(struct device *hwdev, phys_addr_t tlb_addr,
 {
 	unsigned long flags;
 	int i, count, nslots = ALIGN(alloc_size, 1 << IO_TLB_SHIFT) >> IO_TLB_SHIFT;
-	int index = (tlb_addr - io_tlb_start[SWIOTLB_LO]) >> IO_TLB_SHIFT;
-	phys_addr_t orig_addr = io_tlb_orig_addr[SWIOTLB_LO][index];
+	int type = swiotlb_get_type(tlb_addr);
+	int index = (tlb_addr - io_tlb_start[type]) >> IO_TLB_SHIFT;
+	phys_addr_t orig_addr = io_tlb_orig_addr[type][index];
 
 	/*
 	 * First, sync the memory before unmapping the entry
@@ -625,14 +676,14 @@ void swiotlb_tbl_unmap_single(struct device *hwdev, phys_addr_t tlb_addr,
 	spin_lock_irqsave(&io_tlb_lock, flags);
 	{
 		count = ((index + nslots) < ALIGN(index + 1, IO_TLB_SEGSIZE) ?
-			 io_tlb_list[SWIOTLB_LO][index + nslots] : 0);
+			 io_tlb_list[type][index + nslots] : 0);
 		/*
 		 * Step 1: return the slots to the free list, merging the
 		 * slots with superceeding slots
 		 */
 		for (i = index + nslots - 1; i >= index; i--) {
-			io_tlb_list[SWIOTLB_LO][i] = ++count;
-			io_tlb_orig_addr[SWIOTLB_LO][i] = INVALID_PHYS_ADDR;
+			io_tlb_list[type][i] = ++count;
+			io_tlb_orig_addr[type][i] = INVALID_PHYS_ADDR;
 		}
 		/*
 		 * Step 2: merge the returned slots with the preceding slots,
@@ -640,11 +691,11 @@ void swiotlb_tbl_unmap_single(struct device *hwdev, phys_addr_t tlb_addr,
 		 */
 		for (i = index - 1;
 		     (OFFSET(i, IO_TLB_SEGSIZE) != IO_TLB_SEGSIZE - 1) &&
-		     io_tlb_list[SWIOTLB_LO][i];
+		     io_tlb_list[type][i];
 		     i--)
-			io_tlb_list[SWIOTLB_LO][i] = ++count;
+			io_tlb_list[type][i] = ++count;
 
-		io_tlb_used[SWIOTLB_LO] -= nslots;
+		io_tlb_used[type] -= nslots;
 	}
 	spin_unlock_irqrestore(&io_tlb_lock, flags);
 }
@@ -653,8 +704,9 @@ void swiotlb_tbl_sync_single(struct device *hwdev, phys_addr_t tlb_addr,
 			     size_t size, enum dma_data_direction dir,
 			     enum dma_sync_target target)
 {
-	int index = (tlb_addr - io_tlb_start[SWIOTLB_LO]) >> IO_TLB_SHIFT;
-	phys_addr_t orig_addr = io_tlb_orig_addr[SWIOTLB_LO][index];
+	int type = swiotlb_get_type(tlb_addr);
+	int index = (tlb_addr - io_tlb_start[type]) >> IO_TLB_SHIFT;
+	phys_addr_t orig_addr = io_tlb_orig_addr[type][index];
 
 	if (orig_addr == INVALID_PHYS_ADDR)
 		return;
@@ -737,6 +789,15 @@ static int __init swiotlb_create_debugfs(void)
 	root = debugfs_create_dir("swiotlb", NULL);
 	debugfs_create_ulong("io_tlb_nslabs", 0400, root, &io_tlb_nslabs[SWIOTLB_LO]);
 	debugfs_create_ulong("io_tlb_used", 0400, root, &io_tlb_used[SWIOTLB_LO]);
+
+	if (swiotlb_nr == 1)
+		return 0;
+
+	debugfs_create_ulong("io_tlb_nslabs-highmem", 0400,
+			     root, &io_tlb_nslabs[SWIOTLB_HI]);
+	debugfs_create_ulong("io_tlb_used-highmem", 0400,
+			     root, &io_tlb_used[SWIOTLB_HI]);
+
 	return 0;
 }
 
-- 
2.17.1


^ permalink raw reply related

* [PATCH RFC v1 1/6] swiotlb: define new enumerated type
From: Dongli Zhang @ 2021-02-03 23:37 UTC (permalink / raw)
  To: dri-devel, intel-gfx, iommu, linux-mips, linux-mmc, linux-pci,
	linuxppc-dev, nouveau, x86, xen-devel
  Cc: ulf.hansson, airlied, joonas.lahtinen, adrian.hunter, paulus, hpa,
	mingo, m.szyprowski, sstabellini, joe.jin, hch, peterz, mingo,
	matthew.auld, bskeggs, thomas.lendacky, konrad.wilk, jani.nikula,
	bp, rodrigo.vivi, bhelgaas, boris.ostrovsky, chris, jgross,
	tsbogend, linux-kernel, tglx, bauerman, daniel, akpm,
	robin.murphy, rppt
In-Reply-To: <20210203233709.19819-1-dongli.zhang@oracle.com>

This is just to define new enumerated type without functional change.

The 'SWIOTLB_LO' is to index legacy 32-bit swiotlb buffer, while the
'SWIOTLB_HI' is to index the 64-bit buffer.

This is to prepare to enable 64-bit swiotlb.

Cc: Joe Jin <joe.jin@oracle.com>
Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
---
 include/linux/swiotlb.h | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index d9c9fc9ca5d2..ca125c1b1281 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -17,6 +17,12 @@ enum swiotlb_force {
 	SWIOTLB_NO_FORCE,	/* swiotlb=noforce */
 };
 
+enum swiotlb_t {
+	SWIOTLB_LO,
+	SWIOTLB_HI,
+	SWIOTLB_MAX,
+};
+
 /*
  * Maximum allowable number of contiguous slabs to map,
  * must be a power of 2.  What is the appropriate value ?
-- 
2.17.1


^ permalink raw reply related

* [PATCH RFC v1 3/6] swiotlb: introduce swiotlb_get_type() to calculate swiotlb buffer type
From: Dongli Zhang @ 2021-02-03 23:37 UTC (permalink / raw)
  To: dri-devel, intel-gfx, iommu, linux-mips, linux-mmc, linux-pci,
	linuxppc-dev, nouveau, x86, xen-devel
  Cc: ulf.hansson, airlied, joonas.lahtinen, adrian.hunter, paulus, hpa,
	mingo, m.szyprowski, sstabellini, joe.jin, hch, peterz, mingo,
	matthew.auld, bskeggs, thomas.lendacky, konrad.wilk, jani.nikula,
	bp, rodrigo.vivi, bhelgaas, boris.ostrovsky, chris, jgross,
	tsbogend, linux-kernel, tglx, bauerman, daniel, akpm,
	robin.murphy, rppt
In-Reply-To: <20210203233709.19819-1-dongli.zhang@oracle.com>

This patch introduces swiotlb_get_type() in order to calculate which
swiotlb buffer the given DMA address is belong to.

This is to prepare to enable 64-bit swiotlb.

Cc: Joe Jin <joe.jin@oracle.com>
Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
---
 include/linux/swiotlb.h | 14 ++++++++++++++
 kernel/dma/swiotlb.c    |  2 ++
 2 files changed, 16 insertions(+)

diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index 777046cd4d1b..3d5980d77810 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -3,6 +3,7 @@
 #define __LINUX_SWIOTLB_H
 
 #include <linux/dma-direction.h>
+#include <linux/errno.h>
 #include <linux/init.h>
 #include <linux/types.h>
 #include <linux/limits.h>
@@ -23,6 +24,8 @@ enum swiotlb_t {
 	SWIOTLB_MAX,
 };
 
+extern int swiotlb_nr;
+
 /*
  * Maximum allowable number of contiguous slabs to map,
  * must be a power of 2.  What is the appropriate value ?
@@ -84,6 +87,17 @@ static inline bool is_swiotlb_buffer(phys_addr_t paddr)
 	       paddr < io_tlb_end[SWIOTLB_LO];
 }
 
+static inline int swiotlb_get_type(phys_addr_t paddr)
+{
+	int i;
+
+	for (i = 0; i < swiotlb_nr; i++)
+		if (paddr >= io_tlb_start[i] && paddr < io_tlb_end[i])
+			return i;
+
+	return -ENOENT;
+}
+
 void __init swiotlb_exit(void);
 unsigned int swiotlb_max_segment(void);
 size_t swiotlb_max_mapping_size(struct device *dev);
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index 1fbb65daa2dd..c91d3d2c3936 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -109,6 +109,8 @@ static DEFINE_SPINLOCK(io_tlb_lock);
 
 static int late_alloc;
 
+int swiotlb_nr = 1;
+
 static int __init
 setup_io_tlb_npages(char *str)
 {
-- 
2.17.1


^ permalink raw reply related

* [PATCH RFC v1 0/6] swiotlb: 64-bit DMA buffer
From: Dongli Zhang @ 2021-02-03 23:37 UTC (permalink / raw)
  To: dri-devel, intel-gfx, iommu, linux-mips, linux-mmc, linux-pci,
	linuxppc-dev, nouveau, x86, xen-devel
  Cc: ulf.hansson, airlied, joonas.lahtinen, adrian.hunter, paulus, hpa,
	mingo, m.szyprowski, sstabellini, joe.jin, hch, peterz, mingo,
	matthew.auld, bskeggs, thomas.lendacky, konrad.wilk, jani.nikula,
	bp, rodrigo.vivi, bhelgaas, boris.ostrovsky, chris, jgross,
	tsbogend, linux-kernel, tglx, bauerman, daniel, akpm,
	robin.murphy, rppt

This RFC is to introduce the 2nd swiotlb buffer for 64-bit DMA access.  The
prototype is based on v5.11-rc6.

The state of the art swiotlb pre-allocates <=32-bit memory in order to meet
the DMA mask requirement for some 32-bit legacy device. Considering most
devices nowadays support 64-bit DMA and IOMMU is available, the swiotlb is
not used for most of times, except:

1. The xen PVM domain requires the DMA addresses to both (1) <= the device
dma mask, and (2) continuous in machine address. Therefore, the 64-bit
device may still require swiotlb on PVM domain.

2. From source code the AMD SME/SEV will enable SWIOTLB_FORCE. As a result
it is always required to allocate from swiotlb buffer even the device dma
mask is 64-bit.

sme_early_init()
-> if (sev_active())
       swiotlb_force = SWIOTLB_FORCE;

Therefore, this RFC introduces the 2nd swiotlb buffer for 64-bit DMA
access.  For instance, the swiotlb_tbl_map_single() allocates from the 2nd
64-bit buffer if the device DMA mask min_not_zero(*hwdev->dma_mask,
hwdev->bus_dma_limit) is 64-bit.  With the RFC, the Xen/AMD will be able to
allocate >4GB swiotlb buffer.

With it being 64-bit, you can (not in this patch set but certainly
possible) allocate this at runtime. Meaning the size could change depending
on the device MMIO buffers, etc.

I have tested the patch set on Xen PVM dom0 boot via QEMU. The dom0 is boot
via:

qemu-system-x86_64 -smp 8 -m 20G -enable-kvm -vnc :9 \
-net nic -net user,hostfwd=tcp::5029-:22 \
-hda disk.img \
-device nvme,drive=nvme0,serial=deudbeaf1,max_ioqpairs=16 \
-drive file=test.qcow2,if=none,id=nvme0 \
-serial stdio

The "swiotlb=65536,1048576,force" is to configure 32-bit swiotlb as 128MB
and 64-bit swiotlb as 2048MB. The swiotlb is enforced.

vm# cat /proc/cmdline 
placeholder root=UUID=4e942d60-c228-4caf-b98e-f41c365d9703 ro text
swiotlb=65536,1048576,force quiet splash

[    5.119877] Booting paravirtualized kernel on Xen
... ...
[    5.190423] software IO TLB: Low Mem mapped [mem 0x0000000234e00000-0x000000023ce00000] (128MB)
[    6.276161] software IO TLB: High Mem mapped [mem 0x0000000166f33000-0x00000001e6f33000] (2048MB)

0x0000000234e00000 is mapped to 0x00000000001c0000 (32-bit machine address)
0x000000023ce00000-1 is mapped to 0x000000000ff3ffff (32-bit machine address)
0x0000000166f33000 is mapped to 0x00000004b7280000 (64-bit machine address)
0x00000001e6f33000-1 is mapped to 0x000000033a07ffff (64-bit machine address)

While running fio for emulated-NVMe, the swiotlb is allocating from 64-bit
io_tlb_used-highmem.

vm# cat /sys/kernel/debug/swiotlb/io_tlb_nslabs
65536
vm# cat /sys/kernel/debug/swiotlb/io_tlb_used
258
vm# cat /sys/kernel/debug/swiotlb/io_tlb_nslabs-highmem
1048576
vm# cat /sys/kernel/debug/swiotlb/io_tlb_used-highmem
58880

I also tested virtio-scsi (with "disable-legacy=on,iommu_platform=true") on
VM with AMD SEV enabled.

qemu-system-x86_64 -enable-kvm -machine q35 -smp 36 -m 20G \
-drive if=pflash,format=raw,unit=0,file=OVMF_CODE.pure-efi.fd,readonly \
-drive if=pflash,format=raw,unit=1,file=OVMF_VARS.fd \
-hda ol7-uefi.qcow2 -serial stdio -vnc :9 \
-net nic -net user,hostfwd=tcp::5029-:22 \
-cpu EPYC -object sev-guest,id=sev0,cbitpos=47,reduced-phys-bits=1 \
-machine memory-encryption=sev0 \
-device virtio-scsi-pci,id=scsi,disable-legacy=on,iommu_platform=true \
-device scsi-hd,drive=disk0 \
-drive file=test.qcow2,if=none,id=disk0,format=qcow2

The "swiotlb=65536,1048576" is to configure 32-bit swiotlb as 128MB and
64-bit swiotlb as 2048MB. We do not need to force swiotlb because AMD SEV
will set SWIOTLB_FORCE.

# cat /proc/cmdline
BOOT_IMAGE=/vmlinuz-5.11.0-rc6swiotlb+ root=/dev/mapper/ol-root ro
crashkernel=auto rd.lvm.lv=ol/root rd.lvm.lv=ol/swap rhgb quiet
LANG=en_US.UTF-8 swiotlb=65536,1048576

[    0.729790] AMD Memory Encryption Features active: SEV
... ...
[    2.113147] software IO TLB: Low Mem mapped [mem 0x0000000073e1e000-0x000000007be1e000] (128MB)
[    2.113151] software IO TLB: High Mem mapped [mem 0x00000004e8400000-0x0000000568400000] (2048MB)

While running fio for virtio-scsi, the swiotlb is allocating from 64-bit
io_tlb_used-highmem.

vm# cat /sys/kernel/debug/swiotlb/io_tlb_nslabs
65536
vm# cat /sys/kernel/debug/swiotlb/io_tlb_used
0
vm# cat /sys/kernel/debug/swiotlb/io_tlb_nslabs-highmem
1048576
vm# cat /sys/kernel/debug/swiotlb/io_tlb_used-highmem
64647

Please let me know if there is any feedback for this idea and RFC.

Dongli Zhang (6):
   swiotlb: define new enumerated type
   swiotlb: convert variables to arrays
   swiotlb: introduce swiotlb_get_type() to calculate swiotlb buffer type
   swiotlb: enable 64-bit swiotlb
   xen-swiotlb: convert variables to arrays
   xen-swiotlb: enable 64-bit xen-swiotlb

 arch/mips/cavium-octeon/dma-octeon.c         |   3 +-
 arch/powerpc/kernel/dma-swiotlb.c            |   2 +-
 arch/powerpc/platforms/pseries/svm.c         |   8 +-
 arch/x86/kernel/pci-swiotlb.c                |   5 +-
 arch/x86/pci/sta2x11-fixup.c                 |   2 +-
 drivers/gpu/drm/i915/gem/i915_gem_internal.c |   4 +-
 drivers/gpu/drm/i915/i915_scatterlist.h      |   2 +-
 drivers/gpu/drm/nouveau/nouveau_ttm.c        |   2 +-
 drivers/mmc/host/sdhci.c                     |   2 +-
 drivers/pci/xen-pcifront.c                   |   2 +-
 drivers/xen/swiotlb-xen.c                    | 123 ++++---
 include/linux/swiotlb.h                      |  49 ++-
 kernel/dma/swiotlb.c                         | 382 +++++++++++++---------
 13 files changed, 363 insertions(+), 223 deletions(-)

Thank you very much!

Dongli Zhang

^ permalink raw reply

* Re: [PATCH] powerpc/pkeys: Remove unused code
From: Ram Pai @ 2021-02-03 21:38 UTC (permalink / raw)
  To: Sandipan Das; +Cc: linuxppc-dev, aneesh.kumar
In-Reply-To: <20210202150050.75335-1-sandipan@linux.ibm.com>

On Tue, Feb 02, 2021 at 08:30:50PM +0530, Sandipan Das wrote:
> This removes arch_supports_pkeys(), arch_usable_pkeys() and
> thread_pkey_regs_*() which are remnants from the following:
> 
> commit 06bb53b33804 ("powerpc: store and restore the pkey state across context switches")
> commit 2cd4bd192ee9 ("powerpc/pkeys: Fix handling of pkey state across fork()")
> commit cf43d3b26452 ("powerpc: Enable pkey subsystem")
> 
> arch_supports_pkeys() and arch_usable_pkeys() were unused
> since their introduction while thread_pkey_regs_*() became
> unused after the introduction of the following:
> 
> commit d5fa30e6993f ("powerpc/book3s64/pkeys: Reset userspace AMR correctly on exec")
> commit 48a8ab4eeb82 ("powerpc/book3s64/pkeys: Don't update SPRN_AMR when in kernel mode")
> 
> Signed-off-by: Sandipan Das <sandipan@linux.ibm.com>
> ---
>  arch/powerpc/include/asm/mmu_context.h | 3 ---
>  arch/powerpc/include/asm/pkeys.h       | 6 ------
>  2 files changed, 9 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h
> index d5821834dba9..652ce85f9410 100644
> --- a/arch/powerpc/include/asm/mmu_context.h
> +++ b/arch/powerpc/include/asm/mmu_context.h
> @@ -282,9 +282,6 @@ static inline bool arch_vma_access_permitted(struct vm_area_struct *vma,
>  }
> 
>  #define pkey_mm_init(mm)
> -#define thread_pkey_regs_save(thread)
> -#define thread_pkey_regs_restore(new_thread, old_thread)
> -#define thread_pkey_regs_init(thread)
>  #define arch_dup_pkeys(oldmm, mm)
> 
>  static inline u64 pte_to_hpte_pkey_bits(u64 pteflags, unsigned long flags)
> diff --git a/arch/powerpc/include/asm/pkeys.h b/arch/powerpc/include/asm/pkeys.h
> index a7951049e129..59a2c7dbc78f 100644
> --- a/arch/powerpc/include/asm/pkeys.h
> +++ b/arch/powerpc/include/asm/pkeys.h
> @@ -169,10 +169,4 @@ static inline bool arch_pkeys_enabled(void)
>  }
> 
>  extern void pkey_mm_init(struct mm_struct *mm);
> -extern bool arch_supports_pkeys(int cap);
> -extern unsigned int arch_usable_pkeys(void);
> -extern void thread_pkey_regs_save(struct thread_struct *thread);
> -extern void thread_pkey_regs_restore(struct thread_struct *new_thread,
> -				     struct thread_struct *old_thread);
> -extern void thread_pkey_regs_init(struct thread_struct *thread);
>  #endif /*_ASM_POWERPC_KEYS_H */

Looking at this code after a long time. Though I do not entirely understand,
the reason to change the logic for save and restore of the
AMR/IAMR register, I do understand that with those changes these three
function-prototypes became unnecessary.

Reviewed-by: Ram Pai <linuxram@us.ibm.com>


RP

^ permalink raw reply

* Re: [PATCH v2 1/3] powerpc: sstep: Fix load and update emulation
From: Segher Boessenkool @ 2021-02-03 21:17 UTC (permalink / raw)
  To: Naveen N. Rao
  Cc: ravi.bangoria, ananth, jniethe5, paulus, Sandipan Das,
	linuxppc-dev, dja
In-Reply-To: <20210203094909.GD210@DESKTOP-TDPLP67.localdomain>

On Wed, Feb 03, 2021 at 03:19:09PM +0530, Naveen N. Rao wrote:
> On 2021/02/03 12:08PM, Sandipan Das wrote:
> > The Power ISA says that the fixed-point load and update
> > instructions must neither use R0 for the base address (RA)
> > nor have the destination (RT) and the base address (RA) as
> > the same register. In these cases, the instruction is
> > invalid.

> > However, the following behaviour is observed using some
> > invalid opcodes where RA = RT.
> > 
> > An userspace program using an invalid instruction word like
> > 0xe9ce0001, i.e. "ldu r14, 0(r14)", runs and exits without
> > getting terminated abruptly. The instruction performs the
> > load operation but does not write the effective address to
> > the base address register. 
> 
> While the processor (p8 in my test) doesn't seem to be throwing an 
> exception, I don't think it is necessarily loading the value. Qemu 
> throws an exception though. It's probably best to term the behavior as 
> being undefined.

Power8 does:

  Load with Update Instructions (RA = 0)
    EA is placed into R0.
  Load with Update Instructions (RA = RT)
    EA is placed into RT. The storage operand addressed by EA is
    accessed, but the data returned by the load is discarded.

Power9 does:

  Load with Update Instructions (RA = 0)
    EA is placed into R0.
  Load with Update Instructions (RA = RT)
    The storage operand addressed by EA is accessed. The displacement
    field is added to the data returned by the load and placed into RT.

Both UMs also say

  Invalid Forms
    In general, the POWER9 core handles invalid forms of instructions in
    the manner that is most convenient for the particular case (within
    the scope of meeting the boundedly-undefined definition described in
    the Power ISA). This document specifies the behavior for these
    cases.  However, it is not recommended that software or other system
    facilities make use of the POWER9 behavior in these cases because
    such behavior might be different in another processor that
    implements the Power ISA.

(or POWER8 instead of POWER9 of course).  Always complaining about most
invalid forms seems wise, certainly if not all recent CPUs behave the
same :-)


Segher

^ permalink raw reply

* [PATCH v5 00/10] Improve signal performance on PPC64 with KUAP
From: Christopher M. Riedl @ 2021-02-03 18:43 UTC (permalink / raw)
  To: linuxppc-dev

As reported by Anton, there is a large penalty to signal handling
performance on radix systems using KUAP. The signal handling code
performs many user access operations, each of which needs to switch the
KUAP permissions bit to open and then close user access. This involves a
costly 'mtspr' operation [0].

There is existing work done on x86 and by Christophe Leroy for PPC32 to
instead open up user access in "blocks" using user_*_access_{begin,end}.
We can do the same in PPC64 to bring performance back up on KUAP-enabled
radix and now also hash MMU systems [1].

Hash MMU KUAP support along with uaccess flush has landed in linuxppc/next
since the last revision. This series also provides a large benefit on hash
with KUAP. However, in the hash implementation of KUAP the user AMR is
always restored during system_call_exception() which cannot be avoided.
Fewer user access switches naturally also result in less uaccess flushing.

The first two patches add some needed 'unsafe' versions of copy-from
functions. While these do not make use of asm-goto they still allow for
avoiding the repeated uaccess switches.

The third patch moves functions called by setup_sigcontext() into a new
prepare_setup_sigcontext() to simplify converting setup_sigcontext()
into an 'unsafe' version which assumes an open uaccess window later.

The fourth and fifths patches clean-up some of the Transactional Memory
ifdef stuff to simplify using uaccess blocks later.

The next two patches rewrite some of the signal64 helper functions to
be 'unsafe'. Finally, the last three patches update the main signal
handling functions to make use of the new 'unsafe' helpers and eliminate
some additional uaccess switching.

I used the will-it-scale signal1 benchmark to measure and compare
performance [2]. The below results are from running a minimal
kernel+initramfs QEMU/KVM guest on a POWER9 Blackbird:

	signal1_threads -t1 -s10

	|                              | hash   | radix  |
	| ---------------------------- | ------ | ------ |
	| linuxppc/next                | 117667 | 135752 |
	| linuxppc/next w/o KUAP+KUEP  | 225273 | 227567 |
	| unsafe-signal64              | 193402 | 230983 |

[0]: https://github.com/linuxppc/issues/issues/277
[1]: https://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=196278
[2]: https://github.com/antonblanchard/will-it-scale/blob/master/tests/signal1.c

v5:	* Use sizeof(buf) in copy_{vsx,fpr}_from_user() (Thanks David Laight)
	* Rebase on latest linuxppc/next

v4:	* Fix issues identified by Christophe Leroy (Thanks for review)
	* Use __get_user() directly to copy the 8B sigset_t

v3:	* Rebase on latest linuxppc/next
	* Reword confusing commit messages
	* Add missing comma in macro in signal.h which broke compiles without
	  CONFIG_ALTIVEC
	* Validate hash KUAP signal performance improvements

v2:	* Rebase on latest linuxppc/next + Christophe Leroy's PPC32
	  signal series
	* Simplify/remove TM ifdefery similar to PPC32 series and clean
	  up the uaccess begin/end calls
	* Isolate non-inline functions so they are not called when
	  uaccess window is open

Christopher M. Riedl (8):
  powerpc/uaccess: Add unsafe_copy_from_user
  powerpc/signal: Add unsafe_copy_{vsx,fpr}_from_user()
  powerpc/signal64: Move non-inline functions out of setup_sigcontext()
  powerpc: Reference param in MSR_TM_ACTIVE() macro
  powerpc/signal64: Remove TM ifdefery in middle of if/else block
  powerpc/signal64: Replace setup_sigcontext() w/
    unsafe_setup_sigcontext()
  powerpc/signal64: Replace restore_sigcontext() w/
    unsafe_restore_sigcontext()
  powerpc/signal64: Use __get_user() to copy sigset_t

Daniel Axtens (2):
  powerpc/signal64: Rewrite handle_rt_signal64() to minimise uaccess
    switches
  powerpc/signal64: Rewrite rt_sigreturn() to minimise uaccess switches

 arch/powerpc/include/asm/reg.h     |   2 +-
 arch/powerpc/include/asm/uaccess.h |   3 +
 arch/powerpc/kernel/signal.h       |  30 ++++
 arch/powerpc/kernel/signal_64.c    | 251 ++++++++++++++++++-----------
 4 files changed, 193 insertions(+), 93 deletions(-)

-- 
2.26.1

^ permalink raw reply

* [PATCH v5 07/10] powerpc/signal64: Replace restore_sigcontext() w/ unsafe_restore_sigcontext()
From: Christopher M. Riedl @ 2021-02-03 18:43 UTC (permalink / raw)
  To: linuxppc-dev
In-Reply-To: <20210203184323.20792-1-cmr@codefail.de>

Previously restore_sigcontext() performed a costly KUAP switch on every
uaccess operation. These repeated uaccess switches cause a significant
drop in signal handling performance.

Rewrite restore_sigcontext() to assume that a userspace read access
window is open. Replace all uaccess functions with their 'unsafe'
versions which avoid the repeated uaccess switches.

Signed-off-by: Christopher M. Riedl <cmr@codefail.de>
---
 arch/powerpc/kernel/signal_64.c | 68 ++++++++++++++++++++-------------
 1 file changed, 41 insertions(+), 27 deletions(-)

diff --git a/arch/powerpc/kernel/signal_64.c b/arch/powerpc/kernel/signal_64.c
index 4248e4489ff1..d668f8af18fe 100644
--- a/arch/powerpc/kernel/signal_64.c
+++ b/arch/powerpc/kernel/signal_64.c
@@ -326,14 +326,14 @@ static long setup_tm_sigcontexts(struct sigcontext __user *sc,
 /*
  * Restore the sigcontext from the signal frame.
  */
-
-static long restore_sigcontext(struct task_struct *tsk, sigset_t *set, int sig,
-			      struct sigcontext __user *sc)
+#define unsafe_restore_sigcontext(tsk, set, sig, sc, e) \
+	unsafe_op_wrap(__unsafe_restore_sigcontext(tsk, set, sig, sc), e)
+static long notrace __unsafe_restore_sigcontext(struct task_struct *tsk, sigset_t *set,
+						int sig, struct sigcontext __user *sc)
 {
 #ifdef CONFIG_ALTIVEC
 	elf_vrreg_t __user *v_regs;
 #endif
-	unsigned long err = 0;
 	unsigned long save_r13 = 0;
 	unsigned long msr;
 	struct pt_regs *regs = tsk->thread.regs;
@@ -348,27 +348,28 @@ static long restore_sigcontext(struct task_struct *tsk, sigset_t *set, int sig,
 		save_r13 = regs->gpr[13];
 
 	/* copy the GPRs */
-	err |= __copy_from_user(regs->gpr, sc->gp_regs, sizeof(regs->gpr));
-	err |= __get_user(regs->nip, &sc->gp_regs[PT_NIP]);
+	unsafe_copy_from_user(regs->gpr, sc->gp_regs, sizeof(regs->gpr),
+			      efault_out);
+	unsafe_get_user(regs->nip, &sc->gp_regs[PT_NIP], efault_out);
 	/* get MSR separately, transfer the LE bit if doing signal return */
-	err |= __get_user(msr, &sc->gp_regs[PT_MSR]);
+	unsafe_get_user(msr, &sc->gp_regs[PT_MSR], efault_out);
 	if (sig)
 		regs->msr = (regs->msr & ~MSR_LE) | (msr & MSR_LE);
-	err |= __get_user(regs->orig_gpr3, &sc->gp_regs[PT_ORIG_R3]);
-	err |= __get_user(regs->ctr, &sc->gp_regs[PT_CTR]);
-	err |= __get_user(regs->link, &sc->gp_regs[PT_LNK]);
-	err |= __get_user(regs->xer, &sc->gp_regs[PT_XER]);
-	err |= __get_user(regs->ccr, &sc->gp_regs[PT_CCR]);
+	unsafe_get_user(regs->orig_gpr3, &sc->gp_regs[PT_ORIG_R3], efault_out);
+	unsafe_get_user(regs->ctr, &sc->gp_regs[PT_CTR], efault_out);
+	unsafe_get_user(regs->link, &sc->gp_regs[PT_LNK], efault_out);
+	unsafe_get_user(regs->xer, &sc->gp_regs[PT_XER], efault_out);
+	unsafe_get_user(regs->ccr, &sc->gp_regs[PT_CCR], efault_out);
 	/* Don't allow userspace to set SOFTE */
 	set_trap_norestart(regs);
-	err |= __get_user(regs->dar, &sc->gp_regs[PT_DAR]);
-	err |= __get_user(regs->dsisr, &sc->gp_regs[PT_DSISR]);
-	err |= __get_user(regs->result, &sc->gp_regs[PT_RESULT]);
+	unsafe_get_user(regs->dar, &sc->gp_regs[PT_DAR], efault_out);
+	unsafe_get_user(regs->dsisr, &sc->gp_regs[PT_DSISR], efault_out);
+	unsafe_get_user(regs->result, &sc->gp_regs[PT_RESULT], efault_out);
 
 	if (!sig)
 		regs->gpr[13] = save_r13;
 	if (set != NULL)
-		err |=  __get_user(set->sig[0], &sc->oldmask);
+		unsafe_get_user(set->sig[0], &sc->oldmask, efault_out);
 
 	/*
 	 * Force reload of FP/VEC.
@@ -378,29 +379,28 @@ static long restore_sigcontext(struct task_struct *tsk, sigset_t *set, int sig,
 	regs->msr &= ~(MSR_FP | MSR_FE0 | MSR_FE1 | MSR_VEC | MSR_VSX);
 
 #ifdef CONFIG_ALTIVEC
-	err |= __get_user(v_regs, &sc->v_regs);
-	if (err)
-		return err;
+	unsafe_get_user(v_regs, &sc->v_regs, efault_out);
 	if (v_regs && !access_ok(v_regs, 34 * sizeof(vector128)))
 		return -EFAULT;
 	/* Copy 33 vec registers (vr0..31 and vscr) from the stack */
 	if (v_regs != NULL && (msr & MSR_VEC) != 0) {
-		err |= __copy_from_user(&tsk->thread.vr_state, v_regs,
-					33 * sizeof(vector128));
+		unsafe_copy_from_user(&tsk->thread.vr_state, v_regs,
+				      33 * sizeof(vector128), efault_out);
 		tsk->thread.used_vr = true;
 	} else if (tsk->thread.used_vr) {
 		memset(&tsk->thread.vr_state, 0, 33 * sizeof(vector128));
 	}
 	/* Always get VRSAVE back */
 	if (v_regs != NULL)
-		err |= __get_user(tsk->thread.vrsave, (u32 __user *)&v_regs[33]);
+		unsafe_get_user(tsk->thread.vrsave, (u32 __user *)&v_regs[33],
+				efault_out);
 	else
 		tsk->thread.vrsave = 0;
 	if (cpu_has_feature(CPU_FTR_ALTIVEC))
 		mtspr(SPRN_VRSAVE, tsk->thread.vrsave);
 #endif /* CONFIG_ALTIVEC */
 	/* restore floating point */
-	err |= copy_fpr_from_user(tsk, &sc->fp_regs);
+	unsafe_copy_fpr_from_user(tsk, &sc->fp_regs, efault_out);
 #ifdef CONFIG_VSX
 	/*
 	 * Get additional VSX data. Update v_regs to point after the
@@ -409,14 +409,17 @@ static long restore_sigcontext(struct task_struct *tsk, sigset_t *set, int sig,
 	 */
 	v_regs += ELF_NVRREG;
 	if ((msr & MSR_VSX) != 0) {
-		err |= copy_vsx_from_user(tsk, v_regs);
+		unsafe_copy_vsx_from_user(tsk, v_regs, efault_out);
 		tsk->thread.used_vsr = true;
 	} else {
 		for (i = 0; i < 32 ; i++)
 			tsk->thread.fp_state.fpr[i][TS_VSRLOWOFFSET] = 0;
 	}
 #endif
-	return err;
+	return 0;
+
+efault_out:
+	return -EFAULT;
 }
 
 #ifdef CONFIG_PPC_TRANSACTIONAL_MEM
@@ -701,8 +704,14 @@ SYSCALL_DEFINE3(swapcontext, struct ucontext __user *, old_ctx,
 	if (__copy_from_user(&set, &new_ctx->uc_sigmask, sizeof(set)))
 		do_exit(SIGSEGV);
 	set_current_blocked(&set);
-	if (restore_sigcontext(current, NULL, 0, &new_ctx->uc_mcontext))
+
+	if (!user_read_access_begin(new_ctx, ctx_size))
+		return -EFAULT;
+	if (__unsafe_restore_sigcontext(current, NULL, 0, &new_ctx->uc_mcontext)) {
+		user_read_access_end();
 		do_exit(SIGSEGV);
+	}
+	user_read_access_end();
 
 	/* This returns like rt_sigreturn */
 	set_thread_flag(TIF_RESTOREALL);
@@ -807,8 +816,13 @@ SYSCALL_DEFINE0(rt_sigreturn)
 		 * causing a TM bad thing.
 		 */
 		current->thread.regs->msr &= ~MSR_TS_MASK;
-		if (restore_sigcontext(current, NULL, 1, &uc->uc_mcontext))
+		if (!user_read_access_begin(uc, sizeof(*uc)))
+			return -EFAULT;
+		if (__unsafe_restore_sigcontext(current, NULL, 1, &uc->uc_mcontext)) {
+			user_read_access_end();
 			goto badframe;
+		}
+		user_read_access_end();
 	}
 
 	if (restore_altstack(&uc->uc_stack))
-- 
2.26.1


^ permalink raw reply related

* [PATCH v5 04/10] powerpc: Reference param in MSR_TM_ACTIVE() macro
From: Christopher M. Riedl @ 2021-02-03 18:43 UTC (permalink / raw)
  To: linuxppc-dev
In-Reply-To: <20210203184323.20792-1-cmr@codefail.de>

Unlike the other MSR_TM_* macros, MSR_TM_ACTIVE does not reference or
use its parameter unless CONFIG_PPC_TRANSACTIONAL_MEM is defined. This
causes an 'unused variable' compile warning unless the variable is also
guarded with CONFIG_PPC_TRANSACTIONAL_MEM.

Reference but do nothing with the argument in the macro to avoid a
potential compile warning.

Signed-off-by: Christopher M. Riedl <cmr@codefail.de>
---
 arch/powerpc/include/asm/reg.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
index e40a921d78f9..c5a3e856191c 100644
--- a/arch/powerpc/include/asm/reg.h
+++ b/arch/powerpc/include/asm/reg.h
@@ -124,7 +124,7 @@
 #ifdef CONFIG_PPC_TRANSACTIONAL_MEM
 #define MSR_TM_ACTIVE(x) (((x) & MSR_TS_MASK) != 0) /* Transaction active? */
 #else
-#define MSR_TM_ACTIVE(x) 0
+#define MSR_TM_ACTIVE(x) ((void)(x), 0)
 #endif
 
 #if defined(CONFIG_PPC_BOOK3S_64)
-- 
2.26.1


^ permalink raw reply related

* [PATCH v5 10/10] powerpc/signal64: Use __get_user() to copy sigset_t
From: Christopher M. Riedl @ 2021-02-03 18:43 UTC (permalink / raw)
  To: linuxppc-dev
In-Reply-To: <20210203184323.20792-1-cmr@codefail.de>

Usually sigset_t is exactly 8B which is a "trivial" size and does not
warrant using __copy_from_user(). Use __get_user() directly in
anticipation of future work to remove the trivial size optimizations
from __copy_from_user(). Calling __get_user() also results in a small
boost to signal handling throughput here.

Signed-off-by: Christopher M. Riedl <cmr@codefail.de>
---
 arch/powerpc/kernel/signal_64.c | 14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kernel/signal_64.c b/arch/powerpc/kernel/signal_64.c
index 817b64e1e409..42fdc4a7ff72 100644
--- a/arch/powerpc/kernel/signal_64.c
+++ b/arch/powerpc/kernel/signal_64.c
@@ -97,6 +97,14 @@ static void prepare_setup_sigcontext(struct task_struct *tsk, int ctx_has_vsx_re
 #endif /* CONFIG_VSX */
 }
 
+static inline int get_user_sigset(sigset_t *dst, const sigset_t *src)
+{
+	if (sizeof(sigset_t) <= 8)
+		return __get_user(dst->sig[0], &src->sig[0]);
+	else
+		return __copy_from_user(dst, src, sizeof(sigset_t));
+}
+
 /*
  * Set up the sigcontext for the signal frame.
  */
@@ -701,8 +709,9 @@ SYSCALL_DEFINE3(swapcontext, struct ucontext __user *, old_ctx,
 	 * We kill the task with a SIGSEGV in this situation.
 	 */
 
-	if (__copy_from_user(&set, &new_ctx->uc_sigmask, sizeof(set)))
+	if (get_user_sigset(&set, &new_ctx->uc_sigmask))
 		do_exit(SIGSEGV);
+
 	set_current_blocked(&set);
 
 	if (!user_read_access_begin(new_ctx, ctx_size))
@@ -740,8 +749,9 @@ SYSCALL_DEFINE0(rt_sigreturn)
 	if (!access_ok(uc, sizeof(*uc)))
 		goto badframe;
 
-	if (__copy_from_user(&set, &uc->uc_sigmask, sizeof(set)))
+	if (get_user_sigset(&set, &uc->uc_sigmask))
 		goto badframe;
+
 	set_current_blocked(&set);
 
 #ifdef CONFIG_PPC_TRANSACTIONAL_MEM
-- 
2.26.1


^ permalink raw reply related

* [PATCH v5 03/10] powerpc/signal64: Move non-inline functions out of setup_sigcontext()
From: Christopher M. Riedl @ 2021-02-03 18:43 UTC (permalink / raw)
  To: linuxppc-dev
In-Reply-To: <20210203184323.20792-1-cmr@codefail.de>

There are non-inline functions which get called in setup_sigcontext() to
save register state to the thread struct. Move these functions into a
separate prepare_setup_sigcontext() function so that
setup_sigcontext() can be refactored later into an "unsafe" version
which assumes an open uaccess window. Non-inline functions should be
avoided when uaccess is open.

The majority of setup_sigcontext() can be refactored to execute in an
"unsafe" context (uaccess window is opened) except for some non-inline
functions. Move these out into a separate prepare_setup_sigcontext()
function which must be called first and before opening up a uaccess
window. A follow-up commit converts setup_sigcontext() to be "unsafe".

Signed-off-by: Christopher M. Riedl <cmr@codefail.de>
---
 arch/powerpc/kernel/signal_64.c | 32 +++++++++++++++++++++-----------
 1 file changed, 21 insertions(+), 11 deletions(-)

diff --git a/arch/powerpc/kernel/signal_64.c b/arch/powerpc/kernel/signal_64.c
index f9e4a1ac440f..b211a8ea4f6e 100644
--- a/arch/powerpc/kernel/signal_64.c
+++ b/arch/powerpc/kernel/signal_64.c
@@ -79,6 +79,24 @@ static elf_vrreg_t __user *sigcontext_vmx_regs(struct sigcontext __user *sc)
 }
 #endif
 
+static void prepare_setup_sigcontext(struct task_struct *tsk, int ctx_has_vsx_region)
+{
+#ifdef CONFIG_ALTIVEC
+	/* save altivec registers */
+	if (tsk->thread.used_vr)
+		flush_altivec_to_thread(tsk);
+	if (cpu_has_feature(CPU_FTR_ALTIVEC))
+		tsk->thread.vrsave = mfspr(SPRN_VRSAVE);
+#endif /* CONFIG_ALTIVEC */
+
+	flush_fp_to_thread(tsk);
+
+#ifdef CONFIG_VSX
+	if (tsk->thread.used_vsr && ctx_has_vsx_region)
+		flush_vsx_to_thread(tsk);
+#endif /* CONFIG_VSX */
+}
+
 /*
  * Set up the sigcontext for the signal frame.
  */
@@ -97,7 +115,6 @@ static long setup_sigcontext(struct sigcontext __user *sc,
 	 */
 #ifdef CONFIG_ALTIVEC
 	elf_vrreg_t __user *v_regs = sigcontext_vmx_regs(sc);
-	unsigned long vrsave;
 #endif
 	struct pt_regs *regs = tsk->thread.regs;
 	unsigned long msr = regs->msr;
@@ -112,7 +129,6 @@ static long setup_sigcontext(struct sigcontext __user *sc,
 
 	/* save altivec registers */
 	if (tsk->thread.used_vr) {
-		flush_altivec_to_thread(tsk);
 		/* Copy 33 vec registers (vr0..31 and vscr) to the stack */
 		err |= __copy_to_user(v_regs, &tsk->thread.vr_state,
 				      33 * sizeof(vector128));
@@ -124,17 +140,10 @@ static long setup_sigcontext(struct sigcontext __user *sc,
 	/* We always copy to/from vrsave, it's 0 if we don't have or don't
 	 * use altivec.
 	 */
-	vrsave = 0;
-	if (cpu_has_feature(CPU_FTR_ALTIVEC)) {
-		vrsave = mfspr(SPRN_VRSAVE);
-		tsk->thread.vrsave = vrsave;
-	}
-
-	err |= __put_user(vrsave, (u32 __user *)&v_regs[33]);
+	err |= __put_user(tsk->thread.vrsave, (u32 __user *)&v_regs[33]);
 #else /* CONFIG_ALTIVEC */
 	err |= __put_user(0, &sc->v_regs);
 #endif /* CONFIG_ALTIVEC */
-	flush_fp_to_thread(tsk);
 	/* copy fpr regs and fpscr */
 	err |= copy_fpr_to_user(&sc->fp_regs, tsk);
 
@@ -150,7 +159,6 @@ static long setup_sigcontext(struct sigcontext __user *sc,
 	 * VMX data.
 	 */
 	if (tsk->thread.used_vsr && ctx_has_vsx_region) {
-		flush_vsx_to_thread(tsk);
 		v_regs += ELF_NVRREG;
 		err |= copy_vsx_to_user(v_regs, tsk);
 		/* set MSR_VSX in the MSR value in the frame to
@@ -655,6 +663,7 @@ SYSCALL_DEFINE3(swapcontext, struct ucontext __user *, old_ctx,
 		ctx_has_vsx_region = 1;
 
 	if (old_ctx != NULL) {
+		prepare_setup_sigcontext(current, ctx_has_vsx_region);
 		if (!access_ok(old_ctx, ctx_size)
 		    || setup_sigcontext(&old_ctx->uc_mcontext, current, 0, NULL, 0,
 					ctx_has_vsx_region)
@@ -842,6 +851,7 @@ int handle_rt_signal64(struct ksignal *ksig, sigset_t *set,
 #endif
 	{
 		err |= __put_user(0, &frame->uc.uc_link);
+		prepare_setup_sigcontext(tsk, 1);
 		err |= setup_sigcontext(&frame->uc.uc_mcontext, tsk, ksig->sig,
 					NULL, (unsigned long)ksig->ka.sa.sa_handler,
 					1);
-- 
2.26.1


^ permalink raw reply related

* [PATCH v5 09/10] powerpc/signal64: Rewrite rt_sigreturn() to minimise uaccess switches
From: Christopher M. Riedl @ 2021-02-03 18:43 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Daniel Axtens
In-Reply-To: <20210203184323.20792-1-cmr@codefail.de>

From: Daniel Axtens <dja@axtens.net>

Add uaccess blocks and use the 'unsafe' versions of functions doing user
access where possible to reduce the number of times uaccess has to be
opened/closed.

Signed-off-by: Daniel Axtens <dja@axtens.net>
Co-developed-by: Christopher M. Riedl <cmr@codefail.de>
Signed-off-by: Christopher M. Riedl <cmr@codefail.de>
---
 arch/powerpc/kernel/signal_64.c | 25 +++++++++++++++----------
 1 file changed, 15 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/kernel/signal_64.c b/arch/powerpc/kernel/signal_64.c
index a471e97589a8..817b64e1e409 100644
--- a/arch/powerpc/kernel/signal_64.c
+++ b/arch/powerpc/kernel/signal_64.c
@@ -782,9 +782,13 @@ SYSCALL_DEFINE0(rt_sigreturn)
 	 * restore_tm_sigcontexts.
 	 */
 	regs->msr &= ~MSR_TS_MASK;
+#endif
 
-	if (__get_user(msr, &uc->uc_mcontext.gp_regs[PT_MSR]))
+	if (!user_read_access_begin(uc, sizeof(*uc)))
 		goto badframe;
+
+#ifdef CONFIG_PPC_TRANSACTIONAL_MEM
+	unsafe_get_user(msr, &uc->uc_mcontext.gp_regs[PT_MSR], badframe_block);
 #endif
 
 	if (MSR_TM_ACTIVE(msr)) {
@@ -794,10 +798,12 @@ SYSCALL_DEFINE0(rt_sigreturn)
 
 		/* Trying to start TM on non TM system */
 		if (!cpu_has_feature(CPU_FTR_TM))
-			goto badframe;
+			goto badframe_block;
+
+		unsafe_get_user(uc_transact, &uc->uc_link, badframe_block);
+
+		user_read_access_end();
 
-		if (__get_user(uc_transact, &uc->uc_link))
-			goto badframe;
 		if (restore_tm_sigcontexts(current, &uc->uc_mcontext,
 					   &uc_transact->uc_mcontext))
 			goto badframe;
@@ -816,12 +822,9 @@ SYSCALL_DEFINE0(rt_sigreturn)
 		 * causing a TM bad thing.
 		 */
 		current->thread.regs->msr &= ~MSR_TS_MASK;
-		if (!user_read_access_begin(uc, sizeof(*uc)))
-			return -EFAULT;
-		if (__unsafe_restore_sigcontext(current, NULL, 1, &uc->uc_mcontext)) {
-			user_read_access_end();
-			goto badframe;
-		}
+		unsafe_restore_sigcontext(current, NULL, 1, &uc->uc_mcontext,
+					  badframe_block);
+
 		user_read_access_end();
 	}
 
@@ -831,6 +834,8 @@ SYSCALL_DEFINE0(rt_sigreturn)
 	set_thread_flag(TIF_RESTOREALL);
 	return 0;
 
+badframe_block:
+	user_read_access_end();
 badframe:
 	signal_fault(current, regs, "rt_sigreturn", uc);
 
-- 
2.26.1


^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox