LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed
* Re: [RFC PATCH v2 2/3] mm: make pXd_addr_end() functions page-table entry aware
From: Alexander Gordeev @ 2020-09-08  7:46 UTC (permalink / raw)
  To: Christophe Leroy, Michael Ellerman
  Cc: Peter Zijlstra, Catalin Marinas, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Christian Borntraeger,
	Richard Weinberger, linux-x86, Russell King, Jason Gunthorpe,
	Ingo Molnar, Andrey Ryabinin, Gerald Schaefer, Jeff Dike,
	Arnd Bergmann, John Hubbard, Heiko Carstens, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	Linus Torvalds, LKML, Andrew Morton, linux-power, Mike Rapoport
In-Reply-To: <31dfb3ed-a0cc-3024-d389-ab9bd19e881f@csgroup.eu>

On Tue, Sep 08, 2020 at 07:14:38AM +0200, Christophe Leroy wrote:
> You forgot arch/powerpc/mm/book3s64/subpage_prot.c it seems.

Yes, and also two more sources :/
	arch/powerpc/mm/kasan/8xx.c
	arch/powerpc/mm/kasan/kasan_init_32.c

But these two are not quite obvious wrt pgd_addr_end() used
while traversing pmds. Could you please clarify a bit?


diff --git a/arch/powerpc/mm/kasan/8xx.c b/arch/powerpc/mm/kasan/8xx.c
index 2784224..89c5053 100644
--- a/arch/powerpc/mm/kasan/8xx.c
+++ b/arch/powerpc/mm/kasan/8xx.c
@@ -15,8 +15,8 @@
 	for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd += 2, block += SZ_8M) {
 		pte_basic_t *new;
 
-		k_next = pgd_addr_end(k_cur, k_end);
-		k_next = pgd_addr_end(k_next, k_end);
+		k_next = pmd_addr_end(k_cur, k_end);
+		k_next = pmd_addr_end(k_next, k_end);
 		if ((void *)pmd_page_vaddr(*pmd) != kasan_early_shadow_pte)
 			continue;
 
diff --git a/arch/powerpc/mm/kasan/kasan_init_32.c b/arch/powerpc/mm/kasan/kasan_init_32.c
index fb29404..3f7d6dc6 100644
--- a/arch/powerpc/mm/kasan/kasan_init_32.c
+++ b/arch/powerpc/mm/kasan/kasan_init_32.c
@@ -38,7 +38,7 @@ int __init kasan_init_shadow_page_tables(unsigned long k_start, unsigned long k_
 	for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd++) {
 		pte_t *new;
 
-		k_next = pgd_addr_end(k_cur, k_end);
+		k_next = pmd_addr_end(k_cur, k_end);
 		if ((void *)pmd_page_vaddr(*pmd) != kasan_early_shadow_pte)
 			continue;
 
@@ -196,7 +196,7 @@ void __init kasan_early_init(void)
 	kasan_populate_pte(kasan_early_shadow_pte, PAGE_KERNEL);
 
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pmd_addr_end(addr, end);
 		pmd_populate_kernel(&init_mm, pmd, kasan_early_shadow_pte);
 	} while (pmd++, addr = next, addr != end);
 

> Christophe

^ permalink raw reply related

* Re: [RFC PATCH 02/12] powerpc: remove arguments from interrupt handler functions
From: Nicholas Piggin @ 2020-09-08  7:48 UTC (permalink / raw)
  To: Christophe Leroy; +Cc: linuxppc-dev
In-Reply-To: <1599478457.27656.1.camel@po17688vm.idsi0.si.c-s.fr>

Excerpts from Christophe Leroy's message of September 7, 2020 9:34 pm:
> On Mon, 2020-09-07 at 11:20 +0200, Christophe Leroy wrote:
>> 
>> Le 05/09/2020 à 19:43, Nicholas Piggin a écrit :
>> > Make interrupt handlers all just take the pt_regs * argument and load
>> > DAR/DSISR etc from that. Make those that return a value return long.
>> 
>> I like this, it will likely simplify a bit the VMAP_STACK mess.
>> 
>> Not sure it is that easy. My board is stuck after the start of init.
>> 
>> 
>> On the 8xx, on Instruction TLB Error exception, we do
>> 
>> 	andis.	r5,r9,DSISR_SRR1_MATCH_32S@h /* Filter relevant SRR1 bits */
>> 
>> On book3s/32, on ISI exception we do:
>> 	andis.	r5,r9,DSISR_SRR1_MATCH_32S@h /* Filter relevant SRR1 bits */
>> 
>> On 40x and bookE, on ISI exception we do:
>> 	li	r5,0			/* Pass zero as arg3 */
>> 
>> 
>> And regs->dsisr will just contain nothing
>> 
>> So it means we should at least write back r5 into regs->dsisr from there 
>> ? The performance impact should be minimal as we already write _DAR so 
>> the cache line should already be in the cache.
>> 
>> A hacky 'stw r5, _DSISR(r1)' in handle_page_fault() does the trick, 
>> allthough we don't want to do it for both ISI and DSI at the end, so 
>> you'll have to do it in every head_xxx.S
> 
> To get you series build and work, I did the following hacks:

Great, thanks for this.

> diff --git a/arch/powerpc/include/asm/interrupt.h
> b/arch/powerpc/include/asm/interrupt.h
> index acfcc7d5779b..c11045d3113a 100644
> --- a/arch/powerpc/include/asm/interrupt.h
> +++ b/arch/powerpc/include/asm/interrupt.h
> @@ -93,7 +93,9 @@ static inline void interrupt_nmi_exit_prepare(struct
> pt_regs *regs, struct inter
>  {
>  	nmi_exit();
>  
> +#ifdef CONFIG_PPC64
>  	this_cpu_set_ftrace_enabled(state->ftrace_enabled);
> +#endif

This seems okay, not a hack.

>  #ifdef CONFIG_PPC_BOOK3S_64
>  	/* Check we didn't change the pending interrupt mask. */
> diff --git a/arch/powerpc/kernel/entry_32.S
> b/arch/powerpc/kernel/entry_32.S
> index f4d0af8e1136..66f7adbe1076 100644
> --- a/arch/powerpc/kernel/entry_32.S
> +++ b/arch/powerpc/kernel/entry_32.S
> @@ -663,6 +663,7 @@ ppc_swapcontext:
>   */
>  	.globl	handle_page_fault
>  handle_page_fault:
> +	stw	r5,_DSISR(r1)
>  	addi	r3,r1,STACK_FRAME_OVERHEAD
>  #ifdef CONFIG_PPC_BOOK3S_32
>  	andis.  r0,r5,DSISR_DABRMATCH@h

Is this what you want to do for 32, or do you want to seperate
ISI and DSI sides?

Thanks,
Nick

^ permalink raw reply

* Re: [RFC PATCH v2 2/3] mm: make pXd_addr_end() functions page-table entry aware
From: Christophe Leroy @ 2020-09-08  8:16 UTC (permalink / raw)
  To: Alexander Gordeev, Michael Ellerman
  Cc: Peter Zijlstra, Catalin Marinas, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Christian Borntraeger,
	Richard Weinberger, linux-x86, Russell King, Jason Gunthorpe,
	Ingo Molnar, Andrey Ryabinin, Gerald Schaefer, Jeff Dike,
	Arnd Bergmann, John Hubbard, Heiko Carstens, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	Linus Torvalds, LKML, Andrew Morton, linux-power, Mike Rapoport
In-Reply-To: <20200908074638.GA19099@oc3871087118.ibm.com>



Le 08/09/2020 à 09:46, Alexander Gordeev a écrit :
> On Tue, Sep 08, 2020 at 07:14:38AM +0200, Christophe Leroy wrote:
>> You forgot arch/powerpc/mm/book3s64/subpage_prot.c it seems.
> 
> Yes, and also two more sources :/
> 	arch/powerpc/mm/kasan/8xx.c
> 	arch/powerpc/mm/kasan/kasan_init_32.c
> 
> But these two are not quite obvious wrt pgd_addr_end() used
> while traversing pmds. Could you please clarify a bit?
> 
> 
> diff --git a/arch/powerpc/mm/kasan/8xx.c b/arch/powerpc/mm/kasan/8xx.c
> index 2784224..89c5053 100644
> --- a/arch/powerpc/mm/kasan/8xx.c
> +++ b/arch/powerpc/mm/kasan/8xx.c
> @@ -15,8 +15,8 @@
>   	for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd += 2, block += SZ_8M) {
>   		pte_basic_t *new;
>   
> -		k_next = pgd_addr_end(k_cur, k_end);
> -		k_next = pgd_addr_end(k_next, k_end);
> +		k_next = pmd_addr_end(k_cur, k_end);
> +		k_next = pmd_addr_end(k_next, k_end);

No, I don't think so.
On powerpc32 we have only two levels, so pgd and pmd are more or less 
the same.
But pmd_addr_end() as defined in include/asm-generic/pgtable-nopmd.h is 
a no-op, so I don't think it will work.

It is likely that this function should iterate on pgd, then you get pmd 
= pmd_offset(pud_offset(p4d_offset(pgd)));

>   		if ((void *)pmd_page_vaddr(*pmd) != kasan_early_shadow_pte)
>   			continue;
>   
> diff --git a/arch/powerpc/mm/kasan/kasan_init_32.c b/arch/powerpc/mm/kasan/kasan_init_32.c
> index fb29404..3f7d6dc6 100644
> --- a/arch/powerpc/mm/kasan/kasan_init_32.c
> +++ b/arch/powerpc/mm/kasan/kasan_init_32.c
> @@ -38,7 +38,7 @@ int __init kasan_init_shadow_page_tables(unsigned long k_start, unsigned long k_
>   	for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd++) {
>   		pte_t *new;
>   
> -		k_next = pgd_addr_end(k_cur, k_end);
> +		k_next = pmd_addr_end(k_cur, k_end);

Same here I get, iterate on pgd then get pmd = 
pmd_offset(pud_offset(p4d_offset(pgd)));

>   		if ((void *)pmd_page_vaddr(*pmd) != kasan_early_shadow_pte)
>   			continue;
>   
> @@ -196,7 +196,7 @@ void __init kasan_early_init(void)
>   	kasan_populate_pte(kasan_early_shadow_pte, PAGE_KERNEL);
>   
>   	do {
> -		next = pgd_addr_end(addr, end);
> +		next = pmd_addr_end(addr, end);
>   		pmd_populate_kernel(&init_mm, pmd, kasan_early_shadow_pte);
>   	} while (pmd++, addr = next, addr != end);
>   
> 

Christophe

^ permalink raw reply

* Re: [RFC PATCH 02/12] powerpc: remove arguments from interrupt handler functions
From: Christophe Leroy @ 2020-09-08  8:29 UTC (permalink / raw)
  To: Nicholas Piggin; +Cc: linuxppc-dev
In-Reply-To: <1599551224.3zoap14y55.astroid@bobo.none>



Le 08/09/2020 à 09:48, Nicholas Piggin a écrit :
> Excerpts from Christophe Leroy's message of September 7, 2020 9:34 pm:
>> On Mon, 2020-09-07 at 11:20 +0200, Christophe Leroy wrote:
>>>
>>> Le 05/09/2020 à 19:43, Nicholas Piggin a écrit :
>>>> Make interrupt handlers all just take the pt_regs * argument and load
>>>> DAR/DSISR etc from that. Make those that return a value return long.
>>>
>>> I like this, it will likely simplify a bit the VMAP_STACK mess.
>>>
>>> Not sure it is that easy. My board is stuck after the start of init.
>>>
>>>
>>> On the 8xx, on Instruction TLB Error exception, we do
>>>
>>> 	andis.	r5,r9,DSISR_SRR1_MATCH_32S@h /* Filter relevant SRR1 bits */
>>>
>>> On book3s/32, on ISI exception we do:
>>> 	andis.	r5,r9,DSISR_SRR1_MATCH_32S@h /* Filter relevant SRR1 bits */
>>>
>>> On 40x and bookE, on ISI exception we do:
>>> 	li	r5,0			/* Pass zero as arg3 */
>>>
>>>
>>> And regs->dsisr will just contain nothing
>>>
>>> So it means we should at least write back r5 into regs->dsisr from there
>>> ? The performance impact should be minimal as we already write _DAR so
>>> the cache line should already be in the cache.
>>>
>>> A hacky 'stw r5, _DSISR(r1)' in handle_page_fault() does the trick,
>>> allthough we don't want to do it for both ISI and DSI at the end, so
>>> you'll have to do it in every head_xxx.S
>>
>> To get you series build and work, I did the following hacks:
> 
> Great, thanks for this.
> 
>> diff --git a/arch/powerpc/include/asm/interrupt.h
>> b/arch/powerpc/include/asm/interrupt.h
>> index acfcc7d5779b..c11045d3113a 100644
>> --- a/arch/powerpc/include/asm/interrupt.h
>> +++ b/arch/powerpc/include/asm/interrupt.h
>> @@ -93,7 +93,9 @@ static inline void interrupt_nmi_exit_prepare(struct
>> pt_regs *regs, struct inter
>>   {
>>   	nmi_exit();
>>   
>> +#ifdef CONFIG_PPC64
>>   	this_cpu_set_ftrace_enabled(state->ftrace_enabled);
>> +#endif
> 
> This seems okay, not a hack.
> 
>>   #ifdef CONFIG_PPC_BOOK3S_64
>>   	/* Check we didn't change the pending interrupt mask. */
>> diff --git a/arch/powerpc/kernel/entry_32.S
>> b/arch/powerpc/kernel/entry_32.S
>> index f4d0af8e1136..66f7adbe1076 100644
>> --- a/arch/powerpc/kernel/entry_32.S
>> +++ b/arch/powerpc/kernel/entry_32.S
>> @@ -663,6 +663,7 @@ ppc_swapcontext:
>>    */
>>   	.globl	handle_page_fault
>>   handle_page_fault:
>> +	stw	r5,_DSISR(r1)
>>   	addi	r3,r1,STACK_FRAME_OVERHEAD
>>   #ifdef CONFIG_PPC_BOOK3S_32
>>   	andis.  r0,r5,DSISR_DABRMATCH@h
> 
> Is this what you want to do for 32, or do you want to seperate
> ISI and DSI sides?
> 

No I think we want to separate ISI and DSI sides.

And I think the specific filtering done in ISI could be done in 
do_page_fault() in C. Ok, it would make a special handling for is_exec 
but there are already several places where the behaviour differs based 
on is_exec.
The only thing we need to keep at ASM level is the DABR stuff for 
calling do_break() in handle_page_fault(), because it is used to decide 
whether full regs are saved or not. But it could be a test done earlier 
in the prolog and the result being kept in one of the CR bits.

That way we can avoid reloading dsisr and dar from the stack, especially 
for VMAP stack as they are saved quite early in the prolog.

Or we can take them out of the thread struct and save them in the stack 
a little latter in the prolog, but then we have to keep the RI bit a bit 
cleared longer.

While we are playing with do_page_fault(), did you have a look at the 
series I sent 
https://patchwork.ozlabs.org/project/linuxppc-dev/patch/7baae4086cbb9ffb08c933b065ff7d29dbc03dd6.1596734104.git.christophe.leroy@csgroup.eu/

Christophe

^ permalink raw reply

* Re: [PATCH v1 1/5] powerpc/mm: sanity_check_fault() should work for all, not only BOOK3S
From: Nicholas Piggin @ 2020-09-08  8:43 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Christophe Leroy, Michael Ellerman,
	Paul Mackerras
  Cc: linuxppc-dev, linux-kernel
In-Reply-To: <7baae4086cbb9ffb08c933b065ff7d29dbc03dd6.1596734104.git.christophe.leroy@csgroup.eu>

Excerpts from Christophe Leroy's message of August 7, 2020 3:15 am:
> The verification and message introduced by commit 374f3f5979f9
> ("powerpc/mm/hash: Handle user access of kernel address gracefully")
> applies to all platforms, it should not be limited to BOOK3S.
> 
> Make the BOOK3S version of sanity_check_fault() the one for all,
> and bail out earlier if not BOOK3S.
> 
> Fixes: 374f3f5979f9 ("powerpc/mm/hash: Handle user access of kernel address gracefully")
> Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
> ---
>  arch/powerpc/mm/fault.c | 8 +++-----
>  1 file changed, 3 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
> index 925a7231abb3..2efa34d7e644 100644
> --- a/arch/powerpc/mm/fault.c
> +++ b/arch/powerpc/mm/fault.c
> @@ -303,7 +303,6 @@ static inline void cmo_account_page_fault(void)
>  static inline void cmo_account_page_fault(void) { }
>  #endif /* CONFIG_PPC_SMLPAR */
>  
> -#ifdef CONFIG_PPC_BOOK3S
>  static void sanity_check_fault(bool is_write, bool is_user,
>  			       unsigned long error_code, unsigned long address)
>  {
> @@ -320,6 +319,9 @@ static void sanity_check_fault(bool is_write, bool is_user,
>  		return;
>  	}
>  
> +	if (!IS_ENABLED(CONFIG_PPC_BOOK3S))
> +		return;

Seems okay. Why is address == -1 special though? I guess it's because 
it may not be an exploit kernel reference but a buggy pointer underflow? 
In that case -1 doesn't seem like it would catch very much. Would it be 
better to test for high bit set for example ((long)address < 0) ?

Anyway for your patch

Reviewed-by: Nicholas Piggin <npiggin@gmail.com>

> +
>  	/*
>  	 * For hash translation mode, we should never get a
>  	 * PROTFAULT. Any update to pte to reduce access will result in us
> @@ -354,10 +356,6 @@ static void sanity_check_fault(bool is_write, bool is_user,
>  
>  	WARN_ON_ONCE(error_code & DSISR_PROTFAULT);
>  }
> -#else
> -static void sanity_check_fault(bool is_write, bool is_user,
> -			       unsigned long error_code, unsigned long address) { }
> -#endif /* CONFIG_PPC_BOOK3S */
>  
>  /*
>   * Define the correct "is_write" bit in error_code based
> -- 
> 2.25.0
> 
> 

^ permalink raw reply

* Re: [PATCH v1 2/5] powerpc/fault: Unnest definition of page_fault_is_write() and page_fault_is_bad()
From: Nicholas Piggin @ 2020-09-08  8:44 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Christophe Leroy, Michael Ellerman,
	Paul Mackerras
  Cc: linuxppc-dev, linux-kernel
In-Reply-To: <4cd127f8988b7b5d3a9b24b67dbad81fef3aee7f.1596734104.git.christophe.leroy@csgroup.eu>

Excerpts from Christophe Leroy's message of August 7, 2020 3:15 am:
> To make it more readable, separate page_fault_is_write() and page_fault_is_bad()
> to avoir several levels of #ifdefs

Reviewed-by: Nicholas Piggin <npiggin@gmail.com>

> Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
> ---
>  arch/powerpc/mm/fault.c | 8 +++++---
>  1 file changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
> index 2efa34d7e644..9ef9ee244f72 100644
> --- a/arch/powerpc/mm/fault.c
> +++ b/arch/powerpc/mm/fault.c
> @@ -363,17 +363,19 @@ static void sanity_check_fault(bool is_write, bool is_user,
>   */
>  #if (defined(CONFIG_4xx) || defined(CONFIG_BOOKE))
>  #define page_fault_is_write(__err)	((__err) & ESR_DST)
> -#define page_fault_is_bad(__err)	(0)
>  #else
>  #define page_fault_is_write(__err)	((__err) & DSISR_ISSTORE)
> -#if defined(CONFIG_PPC_8xx)
> +#endif
> +
> +#if defined(CONFIG_4xx) || defined(CONFIG_BOOKE)
> +#define page_fault_is_bad(__err)	(0)
> +#elif defined(CONFIG_PPC_8xx)
>  #define page_fault_is_bad(__err)	((__err) & DSISR_NOEXEC_OR_G)
>  #elif defined(CONFIG_PPC64)
>  #define page_fault_is_bad(__err)	((__err) & DSISR_BAD_FAULT_64S)
>  #else
>  #define page_fault_is_bad(__err)	((__err) & DSISR_BAD_FAULT_32S)
>  #endif
> -#endif
>  
>  /*
>   * For 600- and 800-family processors, the error_code parameter is DSISR
> -- 
> 2.25.0
> 
> 

^ permalink raw reply

* Re: [PATCH v1 3/5] powerpc/fault: Reorder tests in bad_kernel_fault()
From: Nicholas Piggin @ 2020-09-08  8:46 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Christophe Leroy, Michael Ellerman,
	Paul Mackerras
  Cc: linuxppc-dev, linux-kernel
In-Reply-To: <ef827b73770a7b155079393f8d8430e10a99ec94.1596734104.git.christophe.leroy@csgroup.eu>

Excerpts from Christophe Leroy's message of August 7, 2020 3:15 am:
> Check address earlier to simplify the following test.

Good logic reduction.

Reviewed-by: Nicholas Piggin <npiggin@gmail.com>

> Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
> ---
>  arch/powerpc/mm/fault.c | 10 +++++-----
>  1 file changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
> index 9ef9ee244f72..525e0c2b5406 100644
> --- a/arch/powerpc/mm/fault.c
> +++ b/arch/powerpc/mm/fault.c
> @@ -210,17 +210,17 @@ static bool bad_kernel_fault(struct pt_regs *regs, unsigned long error_code,
>  		return true;
>  	}
>  
> -	if (!is_exec && address < TASK_SIZE && (error_code & DSISR_PROTFAULT) &&
> +	// Kernel fault on kernel address is bad
> +	if (address >= TASK_SIZE)
> +		return true;
> +
> +	if (!is_exec && (error_code & DSISR_PROTFAULT) &&
>  	    !search_exception_tables(regs->nip)) {
>  		pr_crit_ratelimited("Kernel attempted to access user page (%lx) - exploit attempt? (uid: %d)\n",
>  				    address,
>  				    from_kuid(&init_user_ns, current_uid()));
>  	}
>  
> -	// Kernel fault on kernel address is bad
> -	if (address >= TASK_SIZE)
> -		return true;
> -
>  	// Fault on user outside of certain regions (eg. copy_tofrom_user()) is bad
>  	if (!search_exception_tables(regs->nip))
>  		return true;
> -- 
> 2.25.0
> 
> 

^ permalink raw reply

* Re: [RFC PATCH 02/12] powerpc: remove arguments from interrupt handler functions
From: Christophe Leroy @ 2020-09-08  8:50 UTC (permalink / raw)
  To: Nicholas Piggin; +Cc: linuxppc-dev
In-Reply-To: <a4eac9b0-da50-de8a-439e-173da7c20252@csgroup.eu>



Le 08/09/2020 à 10:29, Christophe Leroy a écrit :
> 
> 
> Le 08/09/2020 à 09:48, Nicholas Piggin a écrit :
>> Excerpts from Christophe Leroy's message of September 7, 2020 9:34 pm:
>>> On Mon, 2020-09-07 at 11:20 +0200, Christophe Leroy wrote:
>>>>
>>>> Le 05/09/2020 à 19:43, Nicholas Piggin a écrit :
>>>>> Make interrupt handlers all just take the pt_regs * argument and load
>>>>> DAR/DSISR etc from that. Make those that return a value return long.
>>>>
>>>> I like this, it will likely simplify a bit the VMAP_STACK mess.
>>>>
>>>> Not sure it is that easy. My board is stuck after the start of init.
>>>>
>>>>
>>>> On the 8xx, on Instruction TLB Error exception, we do
>>>>
>>>>     andis.    r5,r9,DSISR_SRR1_MATCH_32S@h /* Filter relevant SRR1 
>>>> bits */
>>>>
>>>> On book3s/32, on ISI exception we do:
>>>>     andis.    r5,r9,DSISR_SRR1_MATCH_32S@h /* Filter relevant SRR1 
>>>> bits */
>>>>
>>>> On 40x and bookE, on ISI exception we do:
>>>>     li    r5,0            /* Pass zero as arg3 */
>>>>
>>>>
>>>> And regs->dsisr will just contain nothing
>>>>
>>>> So it means we should at least write back r5 into regs->dsisr from 
>>>> there
>>>> ? The performance impact should be minimal as we already write _DAR so
>>>> the cache line should already be in the cache.
>>>>
>>>> A hacky 'stw r5, _DSISR(r1)' in handle_page_fault() does the trick,
>>>> allthough we don't want to do it for both ISI and DSI at the end, so
>>>> you'll have to do it in every head_xxx.S
>>>
>>> To get you series build and work, I did the following hacks:
>>
>> Great, thanks for this.
>>
>>> diff --git a/arch/powerpc/include/asm/interrupt.h
>>> b/arch/powerpc/include/asm/interrupt.h
>>> index acfcc7d5779b..c11045d3113a 100644
>>> --- a/arch/powerpc/include/asm/interrupt.h
>>> +++ b/arch/powerpc/include/asm/interrupt.h
>>> @@ -93,7 +93,9 @@ static inline void interrupt_nmi_exit_prepare(struct
>>> pt_regs *regs, struct inter
>>>   {
>>>       nmi_exit();
>>> +#ifdef CONFIG_PPC64
>>>       this_cpu_set_ftrace_enabled(state->ftrace_enabled);
>>> +#endif
>>
>> This seems okay, not a hack.
>>
>>>   #ifdef CONFIG_PPC_BOOK3S_64
>>>       /* Check we didn't change the pending interrupt mask. */
>>> diff --git a/arch/powerpc/kernel/entry_32.S
>>> b/arch/powerpc/kernel/entry_32.S
>>> index f4d0af8e1136..66f7adbe1076 100644
>>> --- a/arch/powerpc/kernel/entry_32.S
>>> +++ b/arch/powerpc/kernel/entry_32.S
>>> @@ -663,6 +663,7 @@ ppc_swapcontext:
>>>    */
>>>       .globl    handle_page_fault
>>>   handle_page_fault:
>>> +    stw    r5,_DSISR(r1)
>>>       addi    r3,r1,STACK_FRAME_OVERHEAD
>>>   #ifdef CONFIG_PPC_BOOK3S_32
>>>       andis.  r0,r5,DSISR_DABRMATCH@h
>>
>> Is this what you want to do for 32, or do you want to seperate
>> ISI and DSI sides?
>>
> 
> No I think we want to separate ISI and DSI sides.
> 
> And I think the specific filtering done in ISI could be done in 
> do_page_fault() in C. Ok, it would make a special handling for is_exec 
> but there are already several places where the behaviour differs based 
> on is_exec.
> The only thing we need to keep at ASM level is the DABR stuff for 
> calling do_break() in handle_page_fault(), because it is used to decide 
> whether full regs are saved or not. But it could be a test done earlier 
> in the prolog and the result being kept in one of the CR bits.

Looking at it once more, I'm wondering whether we really need a full regs.

Before commit 
https://github.com/linuxppc/linux/commit/d300627c6a53693fb01479b59b0cdd293761b1fa#diff-f9658f412252f3bb3093e0a95b37f3ac 
do_break() was called from do_page_fault() without a full regs set.

Christophe

^ permalink raw reply

* Re: [PATCH v1 4/5] powerpc/fault: Avoid heavy search_exception_tables() verification
From: Nicholas Piggin @ 2020-09-08  8:54 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Christophe Leroy, Michael Ellerman,
	Paul Mackerras
  Cc: linuxppc-dev, linux-kernel
In-Reply-To: <b07bac7a882c69deb9e6c8f234a68b3022f29072.1596734105.git.christophe.leroy@csgroup.eu>

Excerpts from Christophe Leroy's message of August 7, 2020 3:15 am:
> search_exception_tables() is an heavy operation, we have to avoid it.
> When KUAP is selected, we'll know the fault has been blocked by KUAP.
> Otherwise, it behaves just as if the address was already in the TLBs
> and no fault was generated.
> 
> Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>

Sorry I missed reviewing this. Yes, we discussed this and decided
that it's not effective I think (and KUAP solves it properly).

Reviewed-by: Nicholas Piggin <npiggin@gmail.com>

> ---
>  arch/powerpc/mm/fault.c | 20 +++++---------------
>  1 file changed, 5 insertions(+), 15 deletions(-)
> 
> diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
> index 525e0c2b5406..edde169ba3a6 100644
> --- a/arch/powerpc/mm/fault.c
> +++ b/arch/powerpc/mm/fault.c
> @@ -214,24 +214,14 @@ static bool bad_kernel_fault(struct pt_regs *regs, unsigned long error_code,
>  	if (address >= TASK_SIZE)
>  		return true;
>  
> -	if (!is_exec && (error_code & DSISR_PROTFAULT) &&
> -	    !search_exception_tables(regs->nip)) {
> +	// Read/write fault blocked by KUAP is bad, it can never succeed.
> +	if (bad_kuap_fault(regs, address, is_write)) {
>  		pr_crit_ratelimited("Kernel attempted to access user page (%lx) - exploit attempt? (uid: %d)\n",
> -				    address,
> -				    from_kuid(&init_user_ns, current_uid()));
> -	}
> -
> -	// Fault on user outside of certain regions (eg. copy_tofrom_user()) is bad
> -	if (!search_exception_tables(regs->nip))
> -		return true;
> -
> -	// Read/write fault in a valid region (the exception table search passed
> -	// above), but blocked by KUAP is bad, it can never succeed.
> -	if (bad_kuap_fault(regs, address, is_write))
> +				    address, from_kuid(&init_user_ns, current_uid()));
>  		return true;
> +	}
>  
> -	// What's left? Kernel fault on user in well defined regions (extable
> -	// matched), and allowed by KUAP in the faulting context.
> +	// What's left? Kernel fault on user and allowed by KUAP in the faulting context.
>  	return false;
>  }
>  
> -- 
> 2.25.0
> 
> 

^ permalink raw reply

* Re: [PATCH v1 1/5] powerpc/mm: sanity_check_fault() should work for all, not only BOOK3S
From: Christophe Leroy @ 2020-09-08  8:56 UTC (permalink / raw)
  To: Nicholas Piggin, Benjamin Herrenschmidt, Michael Ellerman,
	Paul Mackerras
  Cc: linuxppc-dev, linux-kernel
In-Reply-To: <1599554359.m174sr2fhg.astroid@bobo.none>



Le 08/09/2020 à 10:43, Nicholas Piggin a écrit :
> Excerpts from Christophe Leroy's message of August 7, 2020 3:15 am:
>> The verification and message introduced by commit 374f3f5979f9
>> ("powerpc/mm/hash: Handle user access of kernel address gracefully")
>> applies to all platforms, it should not be limited to BOOK3S.
>>
>> Make the BOOK3S version of sanity_check_fault() the one for all,
>> and bail out earlier if not BOOK3S.
>>
>> Fixes: 374f3f5979f9 ("powerpc/mm/hash: Handle user access of kernel address gracefully")
>> Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
>> ---
>>   arch/powerpc/mm/fault.c | 8 +++-----
>>   1 file changed, 3 insertions(+), 5 deletions(-)
>>
>> diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
>> index 925a7231abb3..2efa34d7e644 100644
>> --- a/arch/powerpc/mm/fault.c
>> +++ b/arch/powerpc/mm/fault.c
>> @@ -303,7 +303,6 @@ static inline void cmo_account_page_fault(void)
>>   static inline void cmo_account_page_fault(void) { }
>>   #endif /* CONFIG_PPC_SMLPAR */
>>   
>> -#ifdef CONFIG_PPC_BOOK3S
>>   static void sanity_check_fault(bool is_write, bool is_user,
>>   			       unsigned long error_code, unsigned long address)
>>   {
>> @@ -320,6 +319,9 @@ static void sanity_check_fault(bool is_write, bool is_user,
>>   		return;
>>   	}
>>   
>> +	if (!IS_ENABLED(CONFIG_PPC_BOOK3S))
>> +		return;
> 
> Seems okay. Why is address == -1 special though? I guess it's because
> it may not be an exploit kernel reference but a buggy pointer underflow?
> In that case -1 doesn't seem like it would catch very much. Would it be
> better to test for high bit set for example ((long)address < 0) ?

See 
https://github.com/linuxppc/linux/commit/0f9aee0cb9da7db7d96f63cfa2dc5e4f1bffeb87#diff-f9658f412252f3bb3093e0a95b37f3ac

-1 is what mmap() returns on error, if the app uses that as a pointer 
that's a programming error not an exploit.

Euh .. If you test (long)address < 0, then the entire kernel falls into 
that range as usually it goes from 0xc0000000 to 0xffffffff

But we could skip the top page entirely, anyway it is never mapped.

> 
> Anyway for your patch
> 
> Reviewed-by: Nicholas Piggin <npiggin@gmail.com>

Thanks
Christophe


^ permalink raw reply

* Re: [PATCH v1 5/5] powerpc/fault: Perform exception fixup in do_page_fault()
From: Nicholas Piggin @ 2020-09-08  9:13 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Christophe Leroy, Michael Ellerman,
	Paul Mackerras
  Cc: linuxppc-dev, linux-kernel
In-Reply-To: <5748c8f5cf0a9b3686169e2c7709107e6aaec408.1596734105.git.christophe.leroy@csgroup.eu>

Excerpts from Christophe Leroy's message of August 7, 2020 3:15 am:
> Exception fixup doesn't require the heady full regs saving,
                                      heavy

> do it from do_page_fault() directly.
> 
> For that, split bad_page_fault() in two parts.
> 
> As bad_page_fault() can also be called from other places than
> handle_page_fault(), it will still perform exception fixup and
> fallback on __bad_page_fault().
> 
> handle_page_fault() directly calls __bad_page_fault() as the
> exception fixup will now be done by do_page_fault()

Looks good. We can probably get rid of bad_page_fault completely after 
this too.

Hmm, the alignment exception might(?) hit user copies if the user points 
it to CI memory. Then you could race and the memory gets unmapped. In 
that case the exception table check might be better to be explicit there
with comments.

The first call in do_hash_fault is not required (copy user will never
be in nmi context). The second one and the one in slb_fault could be
made explicit too. Anyway for now this is fine.

Thanks,
Nick

Reviewed-by: Nicholas Piggin <npiggin@gmail.com>


> 
> Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
> ---
>  arch/powerpc/kernel/entry_32.S       |  2 +-
>  arch/powerpc/kernel/exceptions-64e.S |  2 +-
>  arch/powerpc/kernel/exceptions-64s.S |  2 +-
>  arch/powerpc/mm/fault.c              | 33 ++++++++++++++++++++--------
>  4 files changed, 27 insertions(+), 12 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/entry_32.S b/arch/powerpc/kernel/entry_32.S
> index f4d0af8e1136..c198786591f9 100644
> --- a/arch/powerpc/kernel/entry_32.S
> +++ b/arch/powerpc/kernel/entry_32.S
> @@ -678,7 +678,7 @@ handle_page_fault:
>  	mr	r5,r3
>  	addi	r3,r1,STACK_FRAME_OVERHEAD
>  	lwz	r4,_DAR(r1)
> -	bl	bad_page_fault
> +	bl	__bad_page_fault
>  	b	ret_from_except_full
>  
>  #ifdef CONFIG_PPC_BOOK3S_32
> diff --git a/arch/powerpc/kernel/exceptions-64e.S b/arch/powerpc/kernel/exceptions-64e.S
> index d9ed79415100..dd9161ea5da8 100644
> --- a/arch/powerpc/kernel/exceptions-64e.S
> +++ b/arch/powerpc/kernel/exceptions-64e.S
> @@ -1024,7 +1024,7 @@ storage_fault_common:
>  	mr	r5,r3
>  	addi	r3,r1,STACK_FRAME_OVERHEAD
>  	ld	r4,_DAR(r1)
> -	bl	bad_page_fault
> +	bl	__bad_page_fault
>  	b	ret_from_except
>  
>  /*
> diff --git a/arch/powerpc/kernel/exceptions-64s.S b/arch/powerpc/kernel/exceptions-64s.S
> index f7d748b88705..2cb3bcfb896d 100644
> --- a/arch/powerpc/kernel/exceptions-64s.S
> +++ b/arch/powerpc/kernel/exceptions-64s.S
> @@ -3254,7 +3254,7 @@ handle_page_fault:
>  	mr	r5,r3
>  	addi	r3,r1,STACK_FRAME_OVERHEAD
>  	ld	r4,_DAR(r1)
> -	bl	bad_page_fault
> +	bl	__bad_page_fault
>  	b	interrupt_return
>  
>  /* We have a data breakpoint exception - handle it */
> diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
> index edde169ba3a6..bd6e397eb84a 100644
> --- a/arch/powerpc/mm/fault.c
> +++ b/arch/powerpc/mm/fault.c
> @@ -542,10 +542,20 @@ NOKPROBE_SYMBOL(__do_page_fault);
>  int do_page_fault(struct pt_regs *regs, unsigned long address,
>  		  unsigned long error_code)
>  {
> +	const struct exception_table_entry *entry;
>  	enum ctx_state prev_state = exception_enter();
>  	int rc = __do_page_fault(regs, address, error_code);
>  	exception_exit(prev_state);
> -	return rc;
> +	if (likely(!rc))
> +		return 0;
> +
> +	entry = search_exception_tables(regs->nip);
> +	if (unlikely(!entry))
> +		return rc;
> +
> +	instruction_pointer_set(regs, extable_fixup(entry));
> +
> +	return 0;
>  }
>  NOKPROBE_SYMBOL(do_page_fault);
>  
> @@ -554,17 +564,10 @@ NOKPROBE_SYMBOL(do_page_fault);
>   * It is called from the DSI and ISI handlers in head.S and from some
>   * of the procedures in traps.c.
>   */
> -void bad_page_fault(struct pt_regs *regs, unsigned long address, int sig)
> +void __bad_page_fault(struct pt_regs *regs, unsigned long address, int sig)
>  {
> -	const struct exception_table_entry *entry;
>  	int is_write = page_fault_is_write(regs->dsisr);
>  
> -	/* Are we prepared to handle this fault?  */
> -	if ((entry = search_exception_tables(regs->nip)) != NULL) {
> -		regs->nip = extable_fixup(entry);
> -		return;
> -	}
> -
>  	/* kernel has accessed a bad area */
>  
>  	switch (TRAP(regs)) {
> @@ -598,3 +601,15 @@ void bad_page_fault(struct pt_regs *regs, unsigned long address, int sig)
>  
>  	die("Kernel access of bad area", regs, sig);
>  }
> +
> +void bad_page_fault(struct pt_regs *regs, unsigned long address, int sig)
> +{
> +	const struct exception_table_entry *entry;
> +
> +	/* Are we prepared to handle this fault?  */
> +	entry = search_exception_tables(instruction_pointer(regs));
> +	if (entry)
> +		instruction_pointer_set(regs, extable_fixup(entry));
> +	else
> +		__bad_page_fault(regs, address, sig);
> +}
> -- 
> 2.25.0
> 
> 

^ permalink raw reply

* Re: [PATCH kernel] powerpc/dma: Fix dma_map_ops::get_required_mask
From: Cédric Le Goater @ 2020-09-08 11:45 UTC (permalink / raw)
  To: Alexey Kardashevskiy, linuxppc-dev
  Cc: Oliver O'Halloran, Christoph Hellwig
In-Reply-To: <20200908015106.79661-1-aik@ozlabs.ru>

On 9/8/20 3:51 AM, Alexey Kardashevskiy wrote:
> There are 2 problems with it:
> 1. "<" vs expected "<<"
> 2. the shift number is an IOMMU page number mask, not an address mask
> as the IOMMU page shift is missing.
> 
> This did not hit us before f1565c24b596 ("powerpc: use the generic
> dma_ops_bypass mode") because we had there additional code to handle
> bypass mask so this chunk (almost?) never executed. However there
> were reports that aacraid does not work with "iommu=nobypass".
> After f1565c24b596, aacraid (and probably others which call
> dma_get_required_mask() before setting the mask) was unable to
> enable 64bit DMA and fall back to using IOMMU which was known not to work,
> one of the problems is double free of an IOMMU page.
> 
> This fixes DMA for aacraid, both with and without "iommu=nobypass"
> in the kernel command line. Verified with "stress-ng -d 4".
> 
> Fixes: f1565c24b596 ("powerpc: use the generic dma_ops_bypass mode")
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>


The boston system looks solid with this patch.

Tested-by: Cédric Le Goater <clg@kaod.org>

Thanks a lot ! 

C. 


> ---
> 
> The original code came Jun 24 2011:
> 6a5c7be5e484 ("powerpc: Override dma_get_required_mask by platform hook and ops")
> 
> 
> What is dma_get_required_mask() for anyway? What "requires" what here?
> 
> Even though it works for now (due to huge - >4GB - default DMA window),
> I am still not convinced we do not want this chunk here
> (this is what f1565c24b596 removed):
> 
> if (dev_is_pci(dev)) {
>         u64 bypass_mask = dma_direct_get_required_mask(dev);
> 
>         if (dma_iommu_bypass_supported(dev, bypass_mask))
>                 return bypass_mask;
> }
> ---
>  arch/powerpc/kernel/dma-iommu.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/kernel/dma-iommu.c b/arch/powerpc/kernel/dma-iommu.c
> index 569fecd7b5b2..9053fc9d20c7 100644
> --- a/arch/powerpc/kernel/dma-iommu.c
> +++ b/arch/powerpc/kernel/dma-iommu.c
> @@ -120,7 +120,8 @@ u64 dma_iommu_get_required_mask(struct device *dev)
>  	if (!tbl)
>  		return 0;
>  
> -	mask = 1ULL < (fls_long(tbl->it_offset + tbl->it_size) - 1);
> +	mask = 1ULL << (fls_long(tbl->it_offset + tbl->it_size) +
> +			tbl->it_page_shift - 1);
>  	mask += mask - 1;
>  
>  	return mask;
> 


^ permalink raw reply

* Re: [PATCH kernel] powerpc/dma: Fix dma_map_ops::get_required_mask
From: Alexey Kardashevskiy @ 2020-09-08 12:06 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Oliver O'Halloran, linuxppc-dev, Cédric Le Goater
In-Reply-To: <20200908054416.GA13585@lst.de>



On 08/09/2020 15:44, Christoph Hellwig wrote:
> On Tue, Sep 08, 2020 at 11:51:06AM +1000, Alexey Kardashevskiy wrote:
>> What is dma_get_required_mask() for anyway? What "requires" what here?
> 
> Yes, it is a really odd API.  It comes from classic old PCI where
> 64-bit addressing required an additional bus cycle, and various devices
> had different addressing schemes, with the smaller addresses beeing
> more efficient.  So this allows the driver to request the "required"
> addressing mode to address all memory.  "preferred" might be a better
> name as we'll bounce buffer if it isn't met.  I also don't really see
> why a driver would ever want to use it for a modern PCIe device.


a-ha, this makes more sense, thanks. Then I guess we need to revert that 
one bit from yours f1565c24b596, do not we?


-- 
Alexey

^ permalink raw reply

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
From: Christian Borntraeger @ 2020-09-08 12:09 UTC (permalink / raw)
  To: Christophe Leroy, Gerald Schaefer, Jason Gunthorpe, John Hubbard
  Cc: Peter Zijlstra, Catalin Marinas, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Richard Weinberger, linux-x86, Russell King, Ingo Molnar,
	Andrey Ryabinin, Jeff Dike, Arnd Bergmann, Heiko Carstens,
	linux-um, Borislav Petkov, Andy Lutomirski, Thomas Gleixner,
	linux-arm, Linus Torvalds, LKML, Andrew Morton, linux-power,
	Mike Rapoport
In-Reply-To: <82fbe8f9-f199-5fc2-4168-eb43ad0b0346@csgroup.eu>



On 08.09.20 07:06, Christophe Leroy wrote:
> 
> 
> Le 07/09/2020 à 20:00, Gerald Schaefer a écrit :
>> From: Alexander Gordeev <agordeev@linux.ibm.com>
>>
>> Commit 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast
>> code") introduced a subtle but severe bug on s390 with gup_fast, due to
>> dynamic page table folding.
>>
>> The question "What would it require for the generic code to work for s390"
>> has already been discussed here
>> https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1
>> and ended with a promising approach here
>> https://lkml.kernel.org/r/20190419153307.4f2911b5@mschwideX1
>> which in the end unfortunately didn't quite work completely.
>>
>> We tried to mimic static level folding by changing pgd_offset to always
>> calculate top level page table offset, and do nothing in folded pXd_offset.
>> What has been overlooked is that PxD_SIZE/MASK and thus pXd_addr_end do
>> not reflect this dynamic behaviour, and still act like static 5-level
>> page tables.
>>
> 
> [...]
> 
>>
>> Fix this by introducing new pXd_addr_end_folded helpers, which take an
>> additional pXd entry value parameter, that can be used on s390
>> to determine the correct page table level and return corresponding
>> end / boundary. With that, the pointer iteration will always
>> happen in gup_pgd_range for s390. No change for other architectures
>> introduced.
> 
> Not sure pXd_addr_end_folded() is the best understandable name, allthough I don't have any alternative suggestion at the moment.
> Maybe could be something like pXd_addr_end_fixup() as it will disappear in the next patch, or pXd_addr_end_gup() ?
> 
> Also, if it happens to be acceptable to get patch 2 in stable, I think you should switch patch 1 and patch 2 to avoid the step through pXd_addr_end_folded()

given that this fixes a data corruption issue, wouldnt it be the best to go forward
with this patch ASAP and then handle the other patches on top with all the time that
we need?
> 
> 
>>
>> Fixes: 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast code")
>> Cc: <stable@vger.kernel.org> # 5.2+
>> Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
>> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
>> Signed-off-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
>> ---
>>   arch/s390/include/asm/pgtable.h | 42 +++++++++++++++++++++++++++++++++
>>   include/linux/pgtable.h         | 16 +++++++++++++
>>   mm/gup.c                        |  8 +++----
>>   3 files changed, 62 insertions(+), 4 deletions(-)
>>
>> diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
>> index 7eb01a5459cd..027206e4959d 100644
>> --- a/arch/s390/include/asm/pgtable.h
>> +++ b/arch/s390/include/asm/pgtable.h
>> @@ -512,6 +512,48 @@ static inline bool mm_pmd_folded(struct mm_struct *mm)
>>   }
>>   #define mm_pmd_folded(mm) mm_pmd_folded(mm)
>>   +/*
>> + * With dynamic page table levels on s390, the static pXd_addr_end() functions
>> + * will not return corresponding dynamic boundaries. This is no problem as long
>> + * as only pXd pointers are passed down during page table walk, because
>> + * pXd_offset() will simply return the given pointer for folded levels, and the
>> + * pointer iteration over a range simply happens at the correct page table
>> + * level.
>> + * It is however a problem with gup_fast, or other places walking the page
>> + * tables w/o locks using READ_ONCE(), and passing down the pXd values instead
>> + * of pointers. In this case, the pointer given to pXd_offset() is a pointer to
>> + * a stack variable, which cannot be used for pointer iteration at the correct
>> + * level. Instead, the iteration then has to happen by going up to pgd level
>> + * again. To allow this, provide pXd_addr_end_folded() functions with an
>> + * additional pXd value parameter, which can be used on s390 to determine the
>> + * folding level and return the corresponding boundary.
>> + */
>> +static inline unsigned long rste_addr_end_folded(unsigned long rste, unsigned long addr, unsigned long end)
> 
> What does 'rste' stands for ?
> 
> Isn't this line a bit long ?

this is region/segment table entry according to the architecture. 
On our platform we do have the pagetables with a different format that
next levels (segment table -> 1MB granularity, region 3rd table -> 2 GB
granularity, region 2nd table -> 4TB granularity, region 1st table -> 8 PB
granularity. ST,R3,R2,R1 have the same format and are thus often called
crste (combined region and segment table entry).

^ permalink raw reply

* Re: [PATCH kernel] powerpc/dma: Fix dma_map_ops::get_required_mask
From: Christoph Hellwig @ 2020-09-08 12:19 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Oliver O'Halloran, linuxppc-dev, Christoph Hellwig,
	Cédric Le Goater
In-Reply-To: <94353228-2262-cfa1-7177-7eed2288ca63@ozlabs.ru>

On Tue, Sep 08, 2020 at 10:06:56PM +1000, Alexey Kardashevskiy wrote:
> On 08/09/2020 15:44, Christoph Hellwig wrote:
>> On Tue, Sep 08, 2020 at 11:51:06AM +1000, Alexey Kardashevskiy wrote:
>>> What is dma_get_required_mask() for anyway? What "requires" what here?
>>
>> Yes, it is a really odd API.  It comes from classic old PCI where
>> 64-bit addressing required an additional bus cycle, and various devices
>> had different addressing schemes, with the smaller addresses beeing
>> more efficient.  So this allows the driver to request the "required"
>> addressing mode to address all memory.  "preferred" might be a better
>> name as we'll bounce buffer if it isn't met.  I also don't really see
>> why a driver would ever want to use it for a modern PCIe device.
>
>
> a-ha, this makes more sense, thanks. Then I guess we need to revert that 
> one bit from yours f1565c24b596, do not we?

Why?  The was the original intent of the API, but now we also use
internally to check the addressing capabilities.

^ permalink raw reply

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
From: Christophe Leroy @ 2020-09-08 12:40 UTC (permalink / raw)
  To: Christian Borntraeger, Gerald Schaefer, Jason Gunthorpe,
	John Hubbard
  Cc: Peter Zijlstra, Catalin Marinas, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Richard Weinberger, linux-x86, Russell King, Ingo Molnar,
	Andrey Ryabinin, Jeff Dike, Arnd Bergmann, Heiko Carstens,
	linux-um, Borislav Petkov, Andy Lutomirski, Thomas Gleixner,
	linux-arm, Linus Torvalds, LKML, Andrew Morton, linux-power,
	Mike Rapoport
In-Reply-To: <70a3dcb5-5ed1-6efa-6158-d0573d6927da@de.ibm.com>



Le 08/09/2020 à 14:09, Christian Borntraeger a écrit :
> 
> 
> On 08.09.20 07:06, Christophe Leroy wrote:
>>
>>
>> Le 07/09/2020 à 20:00, Gerald Schaefer a écrit :
>>> From: Alexander Gordeev <agordeev@linux.ibm.com>
>>>
>>> Commit 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast
>>> code") introduced a subtle but severe bug on s390 with gup_fast, due to
>>> dynamic page table folding.
>>>
>>> The question "What would it require for the generic code to work for s390"
>>> has already been discussed here
>>> https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1
>>> and ended with a promising approach here
>>> https://lkml.kernel.org/r/20190419153307.4f2911b5@mschwideX1
>>> which in the end unfortunately didn't quite work completely.
>>>
>>> We tried to mimic static level folding by changing pgd_offset to always
>>> calculate top level page table offset, and do nothing in folded pXd_offset.
>>> What has been overlooked is that PxD_SIZE/MASK and thus pXd_addr_end do
>>> not reflect this dynamic behaviour, and still act like static 5-level
>>> page tables.
>>>
>>
>> [...]
>>
>>>
>>> Fix this by introducing new pXd_addr_end_folded helpers, which take an
>>> additional pXd entry value parameter, that can be used on s390
>>> to determine the correct page table level and return corresponding
>>> end / boundary. With that, the pointer iteration will always
>>> happen in gup_pgd_range for s390. No change for other architectures
>>> introduced.
>>
>> Not sure pXd_addr_end_folded() is the best understandable name, allthough I don't have any alternative suggestion at the moment.
>> Maybe could be something like pXd_addr_end_fixup() as it will disappear in the next patch, or pXd_addr_end_gup() ?
>>
>> Also, if it happens to be acceptable to get patch 2 in stable, I think you should switch patch 1 and patch 2 to avoid the step through pXd_addr_end_folded()
> 
> given that this fixes a data corruption issue, wouldnt it be the best to go forward
> with this patch ASAP and then handle the other patches on top with all the time that
> we need?

I have no strong opinion on this, but I feel rather tricky to have to 
change generic part of GUP to use a new fonction then revert that change 
in the following patch, just because you want the first patch in stable 
and not the second one.

Regardless, I was wondering, why do we need a reference to the pXd at 
all when calling pXd_addr_end() ?

Couldn't S390 retrieve the pXd by using the pXd_offset() dance with the 
passed addr ?

Christophe

^ permalink raw reply

* [PATCH] powerpc/64: Make VDSO32 track COMPAT on 64-bit
From: Michael Ellerman @ 2020-09-08 12:58 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: christophe.leroy, msuchanek

When we added the VDSO32 kconfig symbol, which controls building of
the 32-bit VDSO, we made it depend on CPU_BIG_ENDIAN (for 64-bit).

That was because back then COMPAT was always enabled for 64-bit, so
depending on it would have left the 32-bit VDSO always enabled, which
we didn't want.

But since then we have made COMPAT selectable, and off by default for
ppc64le, so VDSO32 should really depend on that.

For most people this makes no difference, none of the defconfigs
change, it's only if someone is building ppc64le with COMPAT=y, they
will now also get VDSO32. If they've enabled COMPAT in order to run
32-bit binaries they presumably also want the 32-bit VDSO.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
---
 arch/powerpc/platforms/Kconfig.cputype | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/platforms/Kconfig.cputype b/arch/powerpc/platforms/Kconfig.cputype
index 87737ec86d39..a80ad0ef436e 100644
--- a/arch/powerpc/platforms/Kconfig.cputype
+++ b/arch/powerpc/platforms/Kconfig.cputype
@@ -490,13 +490,12 @@ endmenu
 
 config VDSO32
 	def_bool y
-	depends on PPC32 || CPU_BIG_ENDIAN
+	depends on PPC32 || COMPAT
 	help
 	  This symbol controls whether we build the 32-bit VDSO. We obviously
 	  want to do that if we're building a 32-bit kernel. If we're building
-	  a 64-bit kernel then we only want a 32-bit VDSO if we're building for
-	  big endian. That is because the only little endian configuration we
-	  support is ppc64le which is 64-bit only.
+	  a 64-bit kernel then we only want a 32-bit VDSO if we're also enabling
+	  COMPAT.
 
 choice
 	prompt "Endianness selection"
-- 
2.25.1


^ permalink raw reply related

* Re: [PATCH] powerpc/64: Make VDSO32 track COMPAT on 64-bit
From: Christophe Leroy @ 2020-09-08 13:11 UTC (permalink / raw)
  To: Michael Ellerman, linuxppc-dev; +Cc: msuchanek
In-Reply-To: <20200908125850.407939-1-mpe@ellerman.id.au>



Le 08/09/2020 à 14:58, Michael Ellerman a écrit :
> When we added the VDSO32 kconfig symbol, which controls building of
> the 32-bit VDSO, we made it depend on CPU_BIG_ENDIAN (for 64-bit).
> 
> That was because back then COMPAT was always enabled for 64-bit, so
> depending on it would have left the 32-bit VDSO always enabled, which
> we didn't want.
> 
> But since then we have made COMPAT selectable, and off by default for
> ppc64le, so VDSO32 should really depend on that.
> 
> For most people this makes no difference, none of the defconfigs
> change, it's only if someone is building ppc64le with COMPAT=y, they
> will now also get VDSO32. If they've enabled COMPAT in order to run
> 32-bit binaries they presumably also want the 32-bit VDSO.
> 
> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>


Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>

Michael, please note that christophe.leroy@c-s.fr is a deprecated 
address that will one day not work anymore. Please use the new one 
whenever possible.

Christophe


> ---
>   arch/powerpc/platforms/Kconfig.cputype | 7 +++----
>   1 file changed, 3 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/Kconfig.cputype b/arch/powerpc/platforms/Kconfig.cputype
> index 87737ec86d39..a80ad0ef436e 100644
> --- a/arch/powerpc/platforms/Kconfig.cputype
> +++ b/arch/powerpc/platforms/Kconfig.cputype
> @@ -490,13 +490,12 @@ endmenu
>   
>   config VDSO32
>   	def_bool y
> -	depends on PPC32 || CPU_BIG_ENDIAN
> +	depends on PPC32 || COMPAT
>   	help
>   	  This symbol controls whether we build the 32-bit VDSO. We obviously
>   	  want to do that if we're building a 32-bit kernel. If we're building
> -	  a 64-bit kernel then we only want a 32-bit VDSO if we're building for
> -	  big endian. That is because the only little endian configuration we
> -	  support is ppc64le which is 64-bit only.
> +	  a 64-bit kernel then we only want a 32-bit VDSO if we're also enabling
> +	  COMPAT.
>   
>   choice
>   	prompt "Endianness selection"
> 

^ permalink raw reply

* Re: [RFC PATCH v2 2/3] mm: make pXd_addr_end() functions page-table entry aware
From: Jason Gunthorpe @ 2020-09-08 13:26 UTC (permalink / raw)
  To: Gerald Schaefer
  Cc: Peter Zijlstra, Dave Hansen, linux-mm, Paul Mackerras,
	linux-sparc, Alexander Gordeev, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Richard Weinberger,
	linux-x86, Russell King, Christian Borntraeger, Ingo Molnar,
	Catalin Marinas, Andrey Ryabinin, Heiko Carstens, Arnd Bergmann,
	John Hubbard, Jeff Dike, linux-um, Borislav Petkov,
	Andy Lutomirski, Thomas Gleixner, linux-arm, linux-power, LKML,
	Andrew Morton, Linus Torvalds, Mike Rapoport
In-Reply-To: <20200907180058.64880-3-gerald.schaefer@linux.ibm.com>

On Mon, Sep 07, 2020 at 08:00:57PM +0200, Gerald Schaefer wrote:
> From: Alexander Gordeev <agordeev@linux.ibm.com>
> 
> Unlike all other page-table abstractions pXd_addr_end() do not take
> into account a particular table entry in which context the functions
> are called. On architectures with dynamic page-tables folding that
> might lead to lack of necessary information that is difficult to
> obtain other than from the table entry itself. That already led to
> a subtle memory corruption issue on s390.
> 
> By letting pXd_addr_end() functions know about the page-table entry
> we allow archs not only make extra checks, but also optimizations.
> 
> As result of this change the pXd_addr_end_folded() functions used
> in gup_fast traversal code become unnecessary and get replaced with
> universal pXd_addr_end() variants.
> 
> The arch-specific updates not only add dereferencing of page-table
> entry pointers, but also small changes to the code flow to make those
> dereferences possible, at least for x86 and powerpc. Also for arm64,
> but in way that should not have any impact.
> 
> So, even though the dereferenced page-table entries are not used on
> archs other than s390, and are optimized out by the compiler, there
> is a small change in kernel size and this is what bloat-o-meter reports:

This looks pretty clean and straightfoward, only
__munlock_pagevec_fill() had any real increased complexity.

Thanks,
Jason

^ permalink raw reply

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
From: Gerald Schaefer @ 2020-09-08 13:38 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Peter Zijlstra, Catalin Marinas, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Alexander Gordeev, Claudio Imbrenda,
	Will Deacon, linux-arch, linux-s390, Vasily Gorbik,
	Christian Borntraeger, Richard Weinberger, linux-x86,
	Russell King, Jason Gunthorpe, Ingo Molnar, Andrey Ryabinin,
	Jeff Dike, Arnd Bergmann, John Hubbard, Heiko Carstens, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	Linus Torvalds, LKML, Andrew Morton, linux-power, Mike Rapoport
In-Reply-To: <96b80926-cf5b-1afa-9b7a-949a2188e61f@csgroup.eu>

On Tue, 8 Sep 2020 14:40:10 +0200
Christophe Leroy <christophe.leroy@csgroup.eu> wrote:

> 
> 
> Le 08/09/2020 à 14:09, Christian Borntraeger a écrit :
> > 
> > 
> > On 08.09.20 07:06, Christophe Leroy wrote:
> >>
> >>
> >> Le 07/09/2020 à 20:00, Gerald Schaefer a écrit :
> >>> From: Alexander Gordeev <agordeev@linux.ibm.com>
> >>>
> >>> Commit 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast
> >>> code") introduced a subtle but severe bug on s390 with gup_fast, due to
> >>> dynamic page table folding.
> >>>
> >>> The question "What would it require for the generic code to work for s390"
> >>> has already been discussed here
> >>> https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1
> >>> and ended with a promising approach here
> >>> https://lkml.kernel.org/r/20190419153307.4f2911b5@mschwideX1
> >>> which in the end unfortunately didn't quite work completely.
> >>>
> >>> We tried to mimic static level folding by changing pgd_offset to always
> >>> calculate top level page table offset, and do nothing in folded pXd_offset.
> >>> What has been overlooked is that PxD_SIZE/MASK and thus pXd_addr_end do
> >>> not reflect this dynamic behaviour, and still act like static 5-level
> >>> page tables.
> >>>
> >>
> >> [...]
> >>
> >>>
> >>> Fix this by introducing new pXd_addr_end_folded helpers, which take an
> >>> additional pXd entry value parameter, that can be used on s390
> >>> to determine the correct page table level and return corresponding
> >>> end / boundary. With that, the pointer iteration will always
> >>> happen in gup_pgd_range for s390. No change for other architectures
> >>> introduced.
> >>
> >> Not sure pXd_addr_end_folded() is the best understandable name, allthough I don't have any alternative suggestion at the moment.
> >> Maybe could be something like pXd_addr_end_fixup() as it will disappear in the next patch, or pXd_addr_end_gup() ?
> >>
> >> Also, if it happens to be acceptable to get patch 2 in stable, I think you should switch patch 1 and patch 2 to avoid the step through pXd_addr_end_folded()
> > 
> > given that this fixes a data corruption issue, wouldnt it be the best to go forward
> > with this patch ASAP and then handle the other patches on top with all the time that
> > we need?
> 
> I have no strong opinion on this, but I feel rather tricky to have to 
> change generic part of GUP to use a new fonction then revert that change 
> in the following patch, just because you want the first patch in stable 
> and not the second one.
> 
> Regardless, I was wondering, why do we need a reference to the pXd at 
> all when calling pXd_addr_end() ?
> 
> Couldn't S390 retrieve the pXd by using the pXd_offset() dance with the 
> passed addr ?

Apart from performance impact when re-doing that what has already been
done by the caller, I think we would also break the READ_ONCE semantics.
After all, the pXd_offset() would also require some pXd pointer input,
which we don't have. So we would need to start over again from mm->pgd.

Also, it seems to be more in line with other primitives that take
a pXd value or pointer.

^ permalink raw reply

* Re: [RFC PATCH v2 2/3] mm: make pXd_addr_end() functions page-table entry aware
From: Alexander Gordeev @ 2020-09-08 14:15 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Peter Zijlstra, Catalin Marinas, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Arnd Bergmann, Christian Borntraeger,
	Richard Weinberger, linux-x86, Russell King, Jason Gunthorpe,
	Ingo Molnar, Andrey Ryabinin, Gerald Schaefer, Jeff Dike,
	Vasily Gorbik, John Hubbard, Heiko Carstens, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	Linus Torvalds, LKML, Andrew Morton, linux-power, Mike Rapoport
In-Reply-To: <5d4f5546-afd0-0b8f-664d-700ae346b9ec@csgroup.eu>

On Tue, Sep 08, 2020 at 10:16:49AM +0200, Christophe Leroy wrote:
> >Yes, and also two more sources :/
> >	arch/powerpc/mm/kasan/8xx.c
> >	arch/powerpc/mm/kasan/kasan_init_32.c
> >
> >But these two are not quite obvious wrt pgd_addr_end() used
> >while traversing pmds. Could you please clarify a bit?
> >
> >
> >diff --git a/arch/powerpc/mm/kasan/8xx.c b/arch/powerpc/mm/kasan/8xx.c
> >index 2784224..89c5053 100644
> >--- a/arch/powerpc/mm/kasan/8xx.c
> >+++ b/arch/powerpc/mm/kasan/8xx.c
> >@@ -15,8 +15,8 @@
> >  	for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd += 2, block += SZ_8M) {
> >  		pte_basic_t *new;
> >-		k_next = pgd_addr_end(k_cur, k_end);
> >-		k_next = pgd_addr_end(k_next, k_end);
> >+		k_next = pmd_addr_end(k_cur, k_end);
> >+		k_next = pmd_addr_end(k_next, k_end);
> 
> No, I don't think so.
> On powerpc32 we have only two levels, so pgd and pmd are more or
> less the same.
> But pmd_addr_end() as defined in include/asm-generic/pgtable-nopmd.h
> is a no-op, so I don't think it will work.
> 
> It is likely that this function should iterate on pgd, then you get
> pmd = pmd_offset(pud_offset(p4d_offset(pgd)));

It looks like the code iterates over single pmd table while using
pgd_addr_end() only to skip all the middle levels and bail out
from the loop.

I would be wary for switching from pmds to pgds, since we are
trying to minimize impact (especially functional) and the
rework does not seem that obvious.

Assuming pmd and pgd are the same would actually such approach
work for now?

diff --git a/arch/powerpc/mm/kasan/8xx.c b/arch/powerpc/mm/kasan/8xx.c
index 2784224..94466cc 100644
--- a/arch/powerpc/mm/kasan/8xx.c
+++ b/arch/powerpc/mm/kasan/8xx.c
@@ -15,8 +15,8 @@
 	for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd += 2, block += SZ_8M) {
 		pte_basic_t *new;
 
-		k_next = pgd_addr_end(k_cur, k_end);
-		k_next = pgd_addr_end(k_next, k_end);
+		k_next = pgd_addr_end(__pgd(pmd_val(*pmd)), k_cur, k_end);
+		k_next = pgd_addr_end(__pgd(pmd_val(*(pmd + 1))), k_next, k_end);
 		if ((void *)pmd_page_vaddr(*pmd) != kasan_early_shadow_pte)
 			continue;
 
diff --git a/arch/powerpc/mm/kasan/kasan_init_32.c b/arch/powerpc/mm/kasan/kasan_init_32.c
index fb29404..c0bcd64 100644
--- a/arch/powerpc/mm/kasan/kasan_init_32.c
+++ b/arch/powerpc/mm/kasan/kasan_init_32.c
@@ -38,7 +38,7 @@ int __init kasan_init_shadow_page_tables(unsigned long k_start, unsigned long k_
 	for (k_cur = k_start; k_cur != k_end; k_cur = k_next, pmd++) {
 		pte_t *new;
 
-		k_next = pgd_addr_end(k_cur, k_end);
+		k_next = pgd_addr_end(__pgd(pmd_val(*pmd)), k_cur, k_end);
 		if ((void *)pmd_page_vaddr(*pmd) != kasan_early_shadow_pte)
 			continue;
 
@@ -196,7 +196,7 @@ void __init kasan_early_init(void)
 	kasan_populate_pte(kasan_early_shadow_pte, PAGE_KERNEL);
 
 	do {
-		next = pgd_addr_end(addr, end);
+		next = pgd_addr_end(__pgd(pmd_val(*pmd)), addr, end);
 		pmd_populate_kernel(&init_mm, pmd, kasan_early_shadow_pte);
 	} while (pmd++, addr = next, addr != end);
 

Alternatively we could pass invalid pgd to keep the code structure
intact, but that of course is less nice.

Thanks!

> Christophe

^ permalink raw reply related

* Re: [RFC PATCH v2 2/3] mm: make pXd_addr_end() functions page-table entry aware
From: Alexander Gordeev @ 2020-09-08 14:25 UTC (permalink / raw)
  To: Christophe Leroy, Michael Ellerman
  Cc: Peter Zijlstra, Catalin Marinas, Dave Hansen, linux-mm,
	Paul Mackerras, linux-sparc, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Christian Borntraeger,
	Richard Weinberger, linux-x86, Russell King, Jason Gunthorpe,
	Ingo Molnar, Andrey Ryabinin, Gerald Schaefer, Jeff Dike,
	Arnd Bergmann, John Hubbard, Heiko Carstens, linux-um,
	Borislav Petkov, Andy Lutomirski, Thomas Gleixner, linux-arm,
	Linus Torvalds, LKML, Andrew Morton, linux-power, Mike Rapoport
In-Reply-To: <31dfb3ed-a0cc-3024-d389-ab9bd19e881f@csgroup.eu>

On Tue, Sep 08, 2020 at 07:14:38AM +0200, Christophe Leroy wrote:
[...]
> You forgot arch/powerpc/mm/book3s64/subpage_prot.c it seems.

If this one would be okay?

diff --git a/arch/powerpc/mm/book3s64/subpage_prot.c b/arch/powerpc/mm/book3s64/subpage_prot.c
index 60c6ea16..3690d22 100644
--- a/arch/powerpc/mm/book3s64/subpage_prot.c
+++ b/arch/powerpc/mm/book3s64/subpage_prot.c
@@ -88,6 +88,7 @@ static void hpte_flush_range(struct mm_struct *mm, unsigned long addr,
 static void subpage_prot_clear(unsigned long addr, unsigned long len)
 {
 	struct mm_struct *mm = current->mm;
+	pmd_t *pmd = pmd_off(mm, addr);
 	struct subpage_prot_table *spt;
 	u32 **spm, *spp;
 	unsigned long i;
@@ -103,8 +104,8 @@ static void subpage_prot_clear(unsigned long addr, unsigned long len)
 	limit = addr + len;
 	if (limit > spt->maxaddr)
 		limit = spt->maxaddr;
-	for (; addr < limit; addr = next) {
-		next = pmd_addr_end(addr, limit);
+	for (; addr < limit; addr = next, pmd++) {
+		next = pmd_addr_end(*pmd, addr, limit);
 		if (addr < 0x100000000UL) {
 			spm = spt->low_prot;
 		} else {
@@ -191,6 +192,7 @@ static void subpage_mark_vma_nohuge(struct mm_struct *mm, unsigned long addr,
 		unsigned long, len, u32 __user *, map)
 {
 	struct mm_struct *mm = current->mm;
+	pmd_t *pmd = pmd_off(mm, addr);
 	struct subpage_prot_table *spt;
 	u32 **spm, *spp;
 	unsigned long i;
@@ -236,8 +238,8 @@ static void subpage_mark_vma_nohuge(struct mm_struct *mm, unsigned long addr,
 	}
 
 	subpage_mark_vma_nohuge(mm, addr, len);
-	for (limit = addr + len; addr < limit; addr = next) {
-		next = pmd_addr_end(addr, limit);
+	for (limit = addr + len; addr < limit; addr = next, pmd++) {
+		next = pmd_addr_end(*pmd, addr, limit);
 		err = -ENOMEM;
 		if (addr < 0x100000000UL) {
 			spm = spt->low_prot;

Thanks!

> Christophe

^ permalink raw reply related

* Re: [RFC PATCH v2 1/3] mm/gup: fix gup_fast with dynamic page table folding
From: Dave Hansen @ 2020-09-08 14:30 UTC (permalink / raw)
  To: Gerald Schaefer, Jason Gunthorpe, John Hubbard
  Cc: Peter Zijlstra, Dave Hansen, linux-mm, Paul Mackerras,
	linux-sparc, Alexander Gordeev, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Richard Weinberger,
	linux-x86, Russell King, Christian Borntraeger, Ingo Molnar,
	Catalin Marinas, Andrey Ryabinin, Heiko Carstens, Arnd Bergmann,
	Jeff Dike, linux-um, Borislav Petkov, Andy Lutomirski,
	Thomas Gleixner, linux-arm, linux-power, LKML, Andrew Morton,
	Linus Torvalds, Mike Rapoport
In-Reply-To: <20200907180058.64880-2-gerald.schaefer@linux.ibm.com>

On 9/7/20 11:00 AM, Gerald Schaefer wrote:
> Commit 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast
> code") introduced a subtle but severe bug on s390 with gup_fast, due to
> dynamic page table folding.

Would it be fair to say that the "fake" page table entries s390
allocates on the stack are what's causing the trouble here?  That might
be a nice thing to open up with here.  "Dynamic page table folding"
really means nothing to me.

> @@ -2521,7 +2521,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
>  	do {
>  		pmd_t pmd = READ_ONCE(*pmdp);
>  
> -		next = pmd_addr_end(addr, end);
> +		next = pmd_addr_end_folded(pmd, addr, end);
>  		if (!pmd_present(pmd))
>  			return 0;

It looks like you fix this up later, but this would be a problem if left
this way.  There's no documentation for whether I use
pmd_addr_end_folded() or pmd_addr_end() when writing a page table walker.


^ permalink raw reply

* Re: [RFC PATCH v2 2/3] mm: make pXd_addr_end() functions page-table entry aware
From: Dave Hansen @ 2020-09-08 14:33 UTC (permalink / raw)
  To: Gerald Schaefer, Jason Gunthorpe, John Hubbard
  Cc: Peter Zijlstra, Dave Hansen, linux-mm, Paul Mackerras,
	linux-sparc, Alexander Gordeev, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Vasily Gorbik, Richard Weinberger,
	linux-x86, Russell King, Christian Borntraeger, Ingo Molnar,
	Catalin Marinas, Andrey Ryabinin, Heiko Carstens, Arnd Bergmann,
	Jeff Dike, linux-um, Borislav Petkov, Andy Lutomirski,
	Thomas Gleixner, linux-arm, linux-power, LKML, Andrew Morton,
	Linus Torvalds, Mike Rapoport
In-Reply-To: <20200907180058.64880-3-gerald.schaefer@linux.ibm.com>

On 9/7/20 11:00 AM, Gerald Schaefer wrote:
> x86:
> add/remove: 0/0 grow/shrink: 2/0 up/down: 10/0 (10)
> Function                                     old     new   delta
> vmemmap_populate                             587     592      +5
> munlock_vma_pages_range                      556     561      +5
> Total: Before=15534694, After=15534704, chg +0.00%
...
>  arch/x86/mm/init_64.c                    | 15 ++++-----
>  arch/x86/mm/kasan_init_64.c              | 16 +++++-----

I didn't do a super thorough review on this, but it generally looks OK
and the benefits of sharing more code between arches certainly outweigh
a few bytes of binary growth.  For the x86 bits at least, feel free to
add my ack.

^ permalink raw reply

* Re: [PATCH v4 00/13] mm/debug_vm_pgtable fixes
From: Gerald Schaefer @ 2020-09-08 15:39 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: linux-s390@vger.kernel.org, Aneesh Kumar K.V, linux-mm,
	Vineet Gupta, akpm, linux-snps-arc@lists.infradead.org,
	linuxppc-dev, linux-riscv, Gerald Schaefer
In-Reply-To: <20200904180115.07ee5f00@thinkpad>

On Fri, 4 Sep 2020 18:01:15 +0200
Gerald Schaefer <gerald.schaefer@linux.ibm.com> wrote:

[...]
> 
> BTW2, a quick test with this change (so far) made the issues on s390
> go away:
> 
> @@ -1069,7 +1074,7 @@ static int __init debug_vm_pgtable(void)
>         spin_unlock(ptl);
> 
>  #ifndef CONFIG_PPC_BOOK3S_64
> -       hugetlb_advanced_tests(mm, vma, ptep, pte_aligned, vaddr, prot);
> +       hugetlb_advanced_tests(mm, vma, (pte_t *) pmdp, pmd_aligned, vaddr, prot);
>  #endif
> 
>         spin_lock(&mm->page_table_lock);
> 
> That would more match the "pte_t pointer" usage for hugetlb code,
> i.e. just cast a pmd_t pointer to it. Also changed to pmd_aligned,
> but I think the root cause is the pte_t pointer.
> 
> Not entirely sure though if that would really be the correct fix.
> I somehow lost whatever little track I had about what these tests
> really want to check, and if that would still be valid with that
> change.

Uh oh, wasn't aware that this (or some predecessor) already went
upstream, and broke our debug kernel today.

I found out now what goes (horribly) wrong on s390, see below for
more details. In short, using hugetlb primitives with ptep pointers
that do _not_ point to a pmd or pud entry will not work on s390.
It also seems to make no sense to verify / test such a thing in general,
as it would also be a severe bug if any kernel code would do that.
After all, with hugepages, there are no pte tables, only pmd etc.
tables.

My change above would fix the issue for s390, but I can still not
completely judge if that would not break other things for your
tests. In general, for normal kernel code, much of what you do would
be very broken, but I guess your tests are doing such "special" things
because they can. E.g. because they operate on some "sandbox" mm
and page tables, and you also do not need properly populated page
tables for some exit / free cleanup, you just throw them away
explicitly with pXd_free at the end. So it might just be "the right
thing" to pass a casted pmd pointer to hugetlb_advanced_tests(),
to simulate and test (proper) usage of the hugetlb primitives.
I also see no other way to make this work for s390, than using a
proper pmd/pud pointer. If not possible, please add us to the
#ifndef.

So, for all those interested, here is what goes wrong on s390.
huge_ptep_get_and_clear() uses the "idte" instruction for the
clearing (and TLB invalidation) part. That instruction expects
a "region or segment table" origin, which is a pmd/pud/p4d/pgd,
but not a pte table. Even worse, when we calculate the table
origin from the given ptep (which *should* not point to a pte),
due to different table sizes for pte / pXd tables, we end up
at some place before the given pte table.

The "idte" instruction also gets the virtual address, and does
corresponding index addition to the given table origin. Depending
on the pmd_index we now end up either within the pte table again,
in which case we see a panic because idte complains about seeing
a pte value. If we are unlucky, then we end up outside the pte
table, and depending on the content of that memory location, idte
might succeed, effectively corrupting that memory.

That explains why we only see the panic sometimes, depending on
random vaddr, other symptoms other times, and probably completely
silent memory corruption for the rest...

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox