LinuxPPC-Dev Archive on lore.kernel.org

LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed

* Re: [PATCH] powerpc64/idle: Fix SP offsets when saving GPRs
From: Nicholas Piggin @ 2021-01-30 13:44 UTC (permalink / raw)
  To: Christopher M. Riedl, linuxppc-dev, Michael Ellerman
In-Reply-To: <87o8h6d5jg.fsf@mpe.ellerman.id.au>

Excerpts from Michael Ellerman's message of January 30, 2021 9:32 pm:
> "Christopher M. Riedl" <cmr@codefail.de> writes:
>> The idle entry/exit code saves/restores GPRs in the stack "red zone"
>> (Protected Zone according to PowerPC64 ELF ABI v2). However, the offset
>> used for the first GPR is incorrect and overwrites the back chain - the
>> Protected Zone actually starts below the current SP. In practice this is
>> probably not an issue, but it's still incorrect so fix it.
> 
> Nice catch.
> 
> Corrupting the back chain means you can't backtrace from there, which
> could be confusing for debugging one day.

Yeah, we seem to have got away without noticing because the CPU will 
wake up and return out of here before it tries to unwind the stack,
but if you tried to walk it by hand if the CPU got stuck in idle or 
something, then we'd get confused.

> It does make me wonder why we don't just create a stack frame and use
> the normal macros? It would use a bit more stack space, but we shouldn't
> be short of stack space when going idle.
> 
> Nick, was there a particular reason for using the red zone?

I don't recall a particular reason, I think a normal stack frame is 
probably a good idea.

Thanks,
Nick

^ permalink raw reply

* Re: [PATCH] powerpc/fault: fix wrong KUAP fault for IO_URING
From: Nicholas Piggin @ 2021-01-30 13:54 UTC (permalink / raw)
  To: Aneesh Kumar K.V, Christophe Leroy, Michael Ellerman, Zorro Lang
  Cc: Jens Axboe, linuxppc-dev
In-Reply-To: <87r1m2d5z4.fsf@mpe.ellerman.id.au>

Excerpts from Michael Ellerman's message of January 30, 2021 9:22 pm:
> Christophe Leroy <christophe.leroy@csgroup.eu> writes:
>> +Aneesh
>>
>> Le 29/01/2021 à 07:52, Zorro Lang a écrit :
> ..
>>> [   96.200296] ------------[ cut here ]------------
>>> [   96.200304] Bug: Read fault blocked by KUAP!
>>> [   96.200309] WARNING: CPU: 3 PID: 1876 at arch/powerpc/mm/fault.c:229 bad_kernel_fault+0x180/0x310
>>
>>> [   96.200734] NIP [c000000000849424] fault_in_pages_readable+0x104/0x350
>>> [   96.200741] LR [c00000000084952c] fault_in_pages_readable+0x20c/0x350
>>> [   96.200747] --- interrupt: 300
>>
>>
>> Problem happens in a section where userspace access is supposed to be granted, so the patch you 
>> proposed is definitely not the right fix.
>>
>> c000000000849408:	2c 01 00 4c 	isync
>> c00000000084940c:	a6 03 3d 7d 	mtspr   29,r9  <== granting userspace access permission
>> c000000000849410:	2c 01 00 4c 	isync
>> c000000000849414:	00 00 36 e9 	ld      r9,0(r22)
>> c000000000849418:	20 00 29 81 	lwz     r9,32(r9)
>> c00000000084941c:	00 02 29 71 	andi.   r9,r9,512
>> c000000000849420:	78 d3 5e 7f 	mr      r30,r26
>> ==> c000000000849424:	00 00 bf 8b 	lbz     r29,0(r31)  <== accessing userspace
>> c000000000849428:	10 00 82 41 	beq     c000000000849438 <fault_in_pages_readable+0x118>
>> c00000000084942c:	2c 01 00 4c 	isync
>> c000000000849430:	a6 03 bd 7e 	mtspr   29,r21  <== clearing userspace access permission
>> c000000000849434:	2c 01 00 4c 	isync
>>
>> My first guess is that the problem is linked to the following function, see the comment
>>
>> /*
>>   * For kernel thread that doesn't have thread.regs return
>>   * default AMR/IAMR values.
>>   */
>> static inline u64 current_thread_amr(void)
>> {
>> 	if (current->thread.regs)
>> 		return current->thread.regs->amr;
>> 	return AMR_KUAP_BLOCKED;
>> }
>>
>> Above function was introduced by commit 48a8ab4eeb82 ("powerpc/book3s64/pkeys: Don't update SPRN_AMR 
>> when in kernel mode")
> 
> Yeah that's a bit of a curly one.
> 
> At some point io_uring did kthread_use_mm(), which is supposed to mean
> the kthread can operate on behalf of the original process that submitted
> the IO.
> 
> But because KUAP is implemented using memory protection keys, it depends
> on the value of the AMR register, which is not part of the mm, it's in
> thread.regs->amr.
> 
> And what's worse by the time we're in kthread_use_mm() we no longer have
> access to the thread.regs->amr of the original process that submitted
> the IO.
> 
> We also can't simply move the AMR into the mm, precisely because it's
> per thread, not per mm.
> 
> So TBH I don't know how we're going to fix this.
> 
> I guess we could return AMR=unblocked for kernel threads, but that's
> arguably a bug because it allows a process to circumvent memory keys by
> asking the kernel to do the access.

We shouldn't need to inherit AMR should we? We only need it to be locked 
for kernel threads until it's explicitly unlocked -- nothing mm specific 
there. I think current_thread_amr could return 0 for kernel threads? Or
I would even avoid using that function for allow_user_access and open
code the kthread case and remove it from current_thread_amr().

Thanks,
Nick

^ permalink raw reply

* Re: [PATCH v2] tpm: ibmvtpm: fix error return code in tpm_ibmvtpm_probe()
From: Jarkko Sakkinen @ 2021-01-30 21:14 UTC (permalink / raw)
  To: Stefan Berger, Stefan Berger
  Cc: Wang Hai, linux-kernel, Hulk Robot, paulus, linux-integrity,
	linuxppc-dev
In-Reply-To: <262c3095-d4c0-f124-c8bc-e99685cbe572@linux.ibm.com>

On Fri, 2021-01-29 at 13:57 -0500, Stefan Berger wrote:
> On 1/29/21 12:35 PM, Jarkko Sakkinen wrote:
> > On Mon, Jan 25, 2021 at 08:47:53PM -0500, Stefan Berger wrote:
> > > From: Stefan Berger <stefanb@linux.ibm.com>
> > > 
> > > Return error code -ETIMEDOUT rather than '0' when waiting for the
> > > rtce_buf to be set has timed out.
> > > 
> > > Fixes: d8d74ea3c002 ("tpm: ibmvtpm: Wait for buffer to be set before proceeding")
> > > Reported-by: Hulk Robot <hulkci@huawei.com>
> > > Signed-off-by: Wang Hai <wanghai38@huawei.com>
> > > Signed-off-by: Stefan Berger <stefanb@linux.ibm.com>
> > > ---
> > Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
> > 
> > Thanks! Should I add
> > 
> > Cc: stable@vger.kernel.org to this?
> 
> Yes, that would be good! Thank you!


OK, it's applied.


>     Stefan

/Jarkko

^ permalink raw reply

* [PATCH 1/2] powerpc/Kconfig: Fix unmet direct dependency on NET
From: Florian Fainelli @ 2021-01-30 22:55 UTC (permalink / raw)
  To: linux-kernel, arndb
  Cc: Christophe Leroy, Florian Fainelli, kbuild-all, kernel test robot,
	Arnd Bergmann, Michal Simek, Paul Mackerras,
	open list:LINUX FOR POWERPC EMBEDDED PPC4XX

The kbuild test robot was able to generate a configuration where
ETHERNET and NETDEVICES was selected by the Akenobo platform but not
NET, which resulted in various build failures and these Kconfig
warnings:

WARNING: unmet direct dependencies detected for NETDEVICES
  Depends on [n]: NET [=n]
  Selected by [y]:
  - AKEBONO [=y] && PPC_47x [=y]

WARNING: unmet direct dependencies detected for ETHERNET
  Depends on [n]: NETDEVICES [=y] && NET [=n]
  Selected by [y]:
  - AKEBONO [=y] && PPC_47x [=y]

Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
---
 arch/powerpc/platforms/44x/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/platforms/44x/Kconfig b/arch/powerpc/platforms/44x/Kconfig
index 78ac6d67a935..68bd647c878f 100644
--- a/arch/powerpc/platforms/44x/Kconfig
+++ b/arch/powerpc/platforms/44x/Kconfig
@@ -206,6 +206,7 @@ config AKEBONO
 	select PPC4xx_HSTA_MSI
 	select I2C
 	select I2C_IBM_IIC
+	select NET
 	select NETDEVICES
 	select ETHERNET
 	select NET_VENDOR_IBM
-- 
2.25.1


^ permalink raw reply related

* Re: [PATCH] powerpc: fix AKEBONO build failures
From: Randy Dunlap @ 2021-01-30 23:21 UTC (permalink / raw)
  To: Michael Ellerman, Yury Norov, linuxppc-dev,
	Linux Kernel Mailing List
  Cc: Paul Mackerras
In-Reply-To: <875z3prcwg.fsf@mpe.ellerman.id.au>

On 1/21/21 5:14 PM, Michael Ellerman wrote:
> Randy Dunlap <rdunlap@infradead.org> writes:
>> On 1/20/21 1:29 PM, Yury Norov wrote:
>>> Hi all,
>>>
>>> I found the power pc build broken on today's
>>> linux-next (647060f3b592).
>>
>> Darn, I was building linux-5.11-rc4.
>>
>> I'll try linux-next after I send this.
>>
>> ---
>> From: Randy Dunlap <rdunlap@infradead.org>
>>
>> Fulfill AKEBONO Kconfig requirements.
>>
>> Fixes these Kconfig warnings (and more) and fixes the subsequent
>> build errors:
>>
>> WARNING: unmet direct dependencies detected for NETDEVICES
>>    Depends on [n]: NET [=n]
>>    Selected by [y]:
>>    - AKEBONO [=y] && PPC_47x [=y]
>>
>> WARNING: unmet direct dependencies detected for MMC_SDHCI
>>    Depends on [n]: MMC [=n] && HAS_DMA [=y]
>>    Selected by [y]:
>>    - AKEBONO [=y] && PPC_47x [=y]
>>
>> Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
>> Cc: Michael Ellerman <mpe@ellerman.id.au>
>> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
>> Cc: Paul Mackerras <paulus@samba.org>
>> Cc: linuxppc-dev@lists.ozlabs.org
>> Cc: Yury Norov <yury.norov@gmail.com>
>> ---
>>   arch/powerpc/platforms/44x/Kconfig |    2 ++
>>   1 file changed, 2 insertions(+)
>>
>> --- lnx-511-rc4.orig/arch/powerpc/platforms/44x/Kconfig
>> +++ lnx-511-rc4/arch/powerpc/platforms/44x/Kconfig
>> @@ -206,6 +206,7 @@ config AKEBONO
>>   	select PPC4xx_HSTA_MSI
>>   	select I2C
>>   	select I2C_IBM_IIC
>> +	select NET
>>   	select NETDEVICES
>>   	select ETHERNET
>>   	select NET_VENDOR_IBM
> 
> I think the problem here is too much use of select, for things that
> should instead be in the defconfig.
> 
> The patch below results in the same result for make
> 44x/akebono_defconfig. Does it fix the original issue?

Hi Michael,
Sorry for the delay.

Changing the akebono_defconfig doesn't cause the missing symbols
to be set -- the defconfig is not being used here.

I guess that if you have users who set CONFIG_AKEBONO and expect
it to build cleanly, you will need something like my patch or the
patch that Florian just posted.

Changing the akebono_defconfig also would not help 'make randconfig'
builds to build cleanly if they had happened to enable AKEBONO.


> We don't need to add ETHERNET or NET_VENDOR_IBM to the defconfig because
> they're both default y.
> 
> cheers


^ permalink raw reply

* [PATCH] cxl: Simplify bool conversion
From: Yang Li @ 2021-01-29  8:25 UTC (permalink / raw)
  To: fbarrat; +Cc: ajd, arnd, gregkh, linux-kernel, Yang Li, linuxppc-dev

Fix the following coccicheck warning:
./drivers/misc/cxl/sysfs.c:181:48-53: WARNING: conversion to bool not
needed here

Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
---
 drivers/misc/cxl/sysfs.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/misc/cxl/sysfs.c b/drivers/misc/cxl/sysfs.c
index d97a243..c173a5e 100644
--- a/drivers/misc/cxl/sysfs.c
+++ b/drivers/misc/cxl/sysfs.c
@@ -178,7 +178,7 @@ static ssize_t perst_reloads_same_image_store(struct device *device,
 	if ((rc != 1) || !(val == 1 || val == 0))
 		return -EINVAL;
 
-	adapter->perst_same_image = (val == 1 ? true : false);
+	adapter->perst_same_image = (val == 1);
 	return count;
 }
 
-- 
1.8.3.1


^ permalink raw reply related

* [RFC 00/20] TLB batching consolidation and enhancements
From: Nadav Amit @ 2021-01-31  0:11 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Andrea Arcangeli, linux-s390, x86, Yu Zhao, Will Deacon,
	Peter Zijlstra, Mel Gorman, Dave Hansen, linux-csky, Nadav Amit,
	Nick Piggin, Andy Lutomirski, Andrew Morton, linuxppc-dev,
	Thomas Gleixner

From: Nadav Amit <namit@vmware.com>

There are currently (at least?) 5 different TLB batching schemes in the
kernel:

1. Using mmu_gather (e.g., zap_page_range()).

2. Using {inc|dec}_tlb_flush_pending() to inform other threads on the
   ongoing deferred TLB flush and flushing the entire range eventually
   (e.g., change_protection_range()).

3. arch_{enter|leave}_lazy_mmu_mode() for sparc and powerpc (and Xen?).

4. Batching per-table flushes (move_ptes()).

5. By setting a flag on that a deferred TLB flush operation takes place,
   flushing when (try_to_unmap_one() on x86).

It seems that (1)-(4) can be consolidated. In addition, it seems that
(5) is racy. It also seems there can be many redundant TLB flushes, and
potentially TLB-shootdown storms, for instance during batched
reclamation (using try_to_unmap_one()) if at the same time mmu_gather
defers TLB flushes.

More aggressive TLB batching may be possible, but this patch-set does
not add such batching. The proposed changes would enable such batching
in a later time.

Admittedly, I do not understand how things are not broken today, which
frightens me to make further batching before getting things in order.
For instance, why is ok for zap_pte_range() to batch dirty-PTE flushes
for each page-table (but not in greater granularity). Can't
ClearPageDirty() be called before the flush, causing writes after
ClearPageDirty() and before the flush to be lost?

This patch-set therefore performs the following changes:

1. Change mprotect, task_mmu and mapping_dirty_helpers to use mmu_gather
   instead of {inc|dec}_tlb_flush_pending().

2. Avoid TLB flushes if PTE permission is not demoted.

3. Cleans up mmu_gather to be less arch-dependant.

4. Uses mm's generations to track in finer granularity, either per-VMA
   or per page-table, whether a pending mmu_gather operation is
   outstanding. This should allow to avoid some TLB flushes when KSM or
   memory reclamation takes place while another operation such as
   munmap() or mprotect() is running.

5. Changes try_to_unmap_one() flushing scheme, as the current seems
   broken to track in a bitmap which CPUs have outstanding TLB flushes
   instead of having a flag.

Further optimizations are possible, such as changing move_ptes() to use
mmu_gather.

The patches were very very lightly tested. I am looking forward for your
feedback regarding the overall approaches, and whether to split them
into multiple patch-sets.

Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: linux-csky@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-s390@vger.kernel.org
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Nick Piggin <npiggin@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Cc: x86@kernel.org
Cc: Yu Zhao <yuzhao@google.com>

Nadav Amit (20):
  mm/tlb: fix fullmm semantics
  mm/mprotect: use mmu_gather
  mm/mprotect: do not flush on permission promotion
  mm/mapping_dirty_helpers: use mmu_gather
  mm/tlb: move BATCHED_UNMAP_TLB_FLUSH to tlb.h
  fs/task_mmu: use mmu_gather interface of clear-soft-dirty
  mm: move x86 tlb_gen to generic code
  mm: store completed TLB generation
  mm: create pte/pmd_tlb_flush_pending()
  mm: add pte_to_page()
  mm/tlb: remove arch-specific tlb_start/end_vma()
  mm/tlb: save the VMA that is flushed during tlb_start_vma()
  mm/tlb: introduce tlb_start_ptes() and tlb_end_ptes()
  mm: move inc/dec_tlb_flush_pending() to mmu_gather.c
  mm: detect deferred TLB flushes in vma granularity
  mm/tlb: per-page table generation tracking
  mm/tlb: updated completed deferred TLB flush conditionally
  mm: make mm_cpumask() volatile
  lib/cpumask: introduce cpumask_atomic_or()
  mm/rmap: avoid potential races

 arch/arm/include/asm/bitops.h         |   4 +-
 arch/arm/include/asm/pgtable.h        |   4 +-
 arch/arm64/include/asm/pgtable.h      |   4 +-
 arch/csky/Kconfig                     |   1 +
 arch/csky/include/asm/tlb.h           |  12 --
 arch/powerpc/Kconfig                  |   1 +
 arch/powerpc/include/asm/tlb.h        |   2 -
 arch/s390/Kconfig                     |   1 +
 arch/s390/include/asm/tlb.h           |   3 -
 arch/sparc/Kconfig                    |   1 +
 arch/sparc/include/asm/pgtable_64.h   |   9 +-
 arch/sparc/include/asm/tlb_64.h       |   2 -
 arch/sparc/mm/init_64.c               |   2 +-
 arch/x86/Kconfig                      |   3 +
 arch/x86/hyperv/mmu.c                 |   2 +-
 arch/x86/include/asm/mmu.h            |  10 -
 arch/x86/include/asm/mmu_context.h    |   1 -
 arch/x86/include/asm/paravirt_types.h |   2 +-
 arch/x86/include/asm/pgtable.h        |  24 +--
 arch/x86/include/asm/tlb.h            |  21 +-
 arch/x86/include/asm/tlbbatch.h       |  15 --
 arch/x86/include/asm/tlbflush.h       |  61 ++++--
 arch/x86/mm/tlb.c                     |  52 +++--
 arch/x86/xen/mmu_pv.c                 |   2 +-
 drivers/firmware/efi/efi.c            |   1 +
 fs/proc/task_mmu.c                    |  29 ++-
 include/asm-generic/bitops/find.h     |   8 +-
 include/asm-generic/tlb.h             | 291 +++++++++++++++++++++-----
 include/linux/bitmap.h                |  21 +-
 include/linux/cpumask.h               |  40 ++--
 include/linux/huge_mm.h               |   3 +-
 include/linux/mm.h                    |  29 ++-
 include/linux/mm_types.h              | 166 ++++++++++-----
 include/linux/mm_types_task.h         |  13 --
 include/linux/pgtable.h               |   2 +-
 include/linux/smp.h                   |   6 +-
 init/Kconfig                          |  21 ++
 kernel/fork.c                         |   2 +
 kernel/smp.c                          |   8 +-
 lib/bitmap.c                          |  33 ++-
 lib/cpumask.c                         |   8 +-
 lib/find_bit.c                        |  10 +-
 mm/huge_memory.c                      |   6 +-
 mm/init-mm.c                          |   1 +
 mm/internal.h                         |  16 --
 mm/ksm.c                              |   2 +-
 mm/madvise.c                          |   6 +-
 mm/mapping_dirty_helpers.c            |  52 +++--
 mm/memory.c                           |   2 +
 mm/mmap.c                             |   1 +
 mm/mmu_gather.c                       |  59 +++++-
 mm/mprotect.c                         |  55 ++---
 mm/mremap.c                           |   2 +-
 mm/pgtable-generic.c                  |   2 +-
 mm/rmap.c                             |  42 ++--
 mm/vmscan.c                           |   1 +
 56 files changed, 803 insertions(+), 374 deletions(-)
 delete mode 100644 arch/x86/include/asm/tlbbatch.h

-- 
2.25.1

^ permalink raw reply

* [RFC 11/20] mm/tlb: remove arch-specific tlb_start/end_vma()
From: Nadav Amit @ 2021-01-31  0:11 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Andrea Arcangeli, linux-s390, x86, Yu Zhao, Peter Zijlstra,
	linuxppc-dev, Dave Hansen, Nick Piggin, Nadav Amit, linux-csky,
	Andy Lutomirski, Andrew Morton, Will Deacon, Thomas Gleixner
In-Reply-To: <20210131001132.3368247-1-namit@vmware.com>

From: Nadav Amit <namit@vmware.com>

Architecture-specific tlb_start_vma() and tlb_end_vma() seem
unnecessary. They are currently used for:

1. Avoid per-VMA TLB flushes. This can be determined by introducing
   a new config option.

2. Avoid saving information on the vma that is being flushed. Saving
   this information, even for architectures that do not need it, is
   cheap and we will need it for per-VMA deferred TLB flushing.

3. Avoid calling flush_cache_range().

Remove the architecture specific tlb_start_vma() and tlb_end_vma() in
the following manner, corresponding to the previous requirements:

1. Introduce a new config option -
   ARCH_WANT_AGGRESSIVE_TLB_FLUSH_BATCHING - to allow architectures to
   define whether they want aggressive TLB flush batching (instead of
   flushing mappings of each VMA separately).

2. Save information on the vma regardless of architecture. Saving this
   information should have negligible overhead, and they will be
   needed for fine granularity TLB flushes.

3. flush_cache_range() is anyhow not defined for the architectures that
   implement tlb_start/end_vma().

No functional change intended.

Signed-off-by: Nadav Amit <namit@vmware.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Nick Piggin <npiggin@gmail.com>
Cc: linux-csky@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-s390@vger.kernel.org
Cc: x86@kernel.org
---
 arch/csky/Kconfig               |  1 +
 arch/csky/include/asm/tlb.h     | 12 ------------
 arch/powerpc/Kconfig            |  1 +
 arch/powerpc/include/asm/tlb.h  |  2 --
 arch/s390/Kconfig               |  1 +
 arch/s390/include/asm/tlb.h     |  3 ---
 arch/sparc/Kconfig              |  1 +
 arch/sparc/include/asm/tlb_64.h |  2 --
 arch/x86/Kconfig                |  1 +
 arch/x86/include/asm/tlb.h      |  3 ---
 include/asm-generic/tlb.h       | 15 +++++----------
 init/Kconfig                    |  8 ++++++++
 12 files changed, 18 insertions(+), 32 deletions(-)

diff --git a/arch/csky/Kconfig b/arch/csky/Kconfig
index 89dd2fcf38fa..924ff5721240 100644
--- a/arch/csky/Kconfig
+++ b/arch/csky/Kconfig
@@ -8,6 +8,7 @@ config CSKY
 	select ARCH_HAS_SYNC_DMA_FOR_DEVICE
 	select ARCH_USE_BUILTIN_BSWAP
 	select ARCH_USE_QUEUED_RWLOCKS if NR_CPUS>2
+	select ARCH_WANT_AGGRESSIVE_TLB_FLUSH_BATCHING
 	select ARCH_WANT_FRAME_POINTERS if !CPU_CK610
 	select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT
 	select COMMON_CLK
diff --git a/arch/csky/include/asm/tlb.h b/arch/csky/include/asm/tlb.h
index fdff9b8d70c8..8130a5f09a6b 100644
--- a/arch/csky/include/asm/tlb.h
+++ b/arch/csky/include/asm/tlb.h
@@ -6,18 +6,6 @@
 
 #include <asm/cacheflush.h>
 
-#define tlb_start_vma(tlb, vma) \
-	do { \
-		if (!(tlb)->fullmm) \
-			flush_cache_range(vma, (vma)->vm_start, (vma)->vm_end); \
-	}  while (0)
-
-#define tlb_end_vma(tlb, vma) \
-	do { \
-		if (!(tlb)->fullmm) \
-			flush_tlb_range(vma, (vma)->vm_start, (vma)->vm_end); \
-	}  while (0)
-
 #define tlb_flush(tlb) flush_tlb_mm((tlb)->mm)
 
 #include <asm-generic/tlb.h>
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 107bb4319e0e..d9761b6f192a 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -151,6 +151,7 @@ config PPC
 	select ARCH_USE_CMPXCHG_LOCKREF		if PPC64
 	select ARCH_USE_QUEUED_RWLOCKS		if PPC_QUEUED_SPINLOCKS
 	select ARCH_USE_QUEUED_SPINLOCKS	if PPC_QUEUED_SPINLOCKS
+	select ARCH_WANT_AGGRESSIVE_TLB_FLUSH_BATCHING
 	select ARCH_WANT_IPC_PARSE_VERSION
 	select ARCH_WANT_IRQS_OFF_ACTIVATE_MM
 	select ARCH_WANT_LD_ORPHAN_WARN
diff --git a/arch/powerpc/include/asm/tlb.h b/arch/powerpc/include/asm/tlb.h
index 160422a439aa..880b7daf904e 100644
--- a/arch/powerpc/include/asm/tlb.h
+++ b/arch/powerpc/include/asm/tlb.h
@@ -19,8 +19,6 @@
 
 #include <linux/pagemap.h>
 
-#define tlb_start_vma(tlb, vma)	do { } while (0)
-#define tlb_end_vma(tlb, vma)	do { } while (0)
 #define __tlb_remove_tlb_entry	__tlb_remove_tlb_entry
 
 #define tlb_flush tlb_flush
diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
index c72874f09741..5b3dc5ca9873 100644
--- a/arch/s390/Kconfig
+++ b/arch/s390/Kconfig
@@ -113,6 +113,7 @@ config S390
 	select ARCH_USE_BUILTIN_BSWAP
 	select ARCH_USE_CMPXCHG_LOCKREF
 	select ARCH_WANTS_DYNAMIC_TASK_STRUCT
+	select ARCH_WANT_AGGRESSIVE_TLB_FLUSH_BATCHING
 	select ARCH_WANT_DEFAULT_BPF_JIT
 	select ARCH_WANT_IPC_PARSE_VERSION
 	select BUILDTIME_TABLE_SORT
diff --git a/arch/s390/include/asm/tlb.h b/arch/s390/include/asm/tlb.h
index 954fa8ca6cbd..03f31d59f97c 100644
--- a/arch/s390/include/asm/tlb.h
+++ b/arch/s390/include/asm/tlb.h
@@ -27,9 +27,6 @@ static inline void tlb_flush(struct mmu_gather *tlb);
 static inline bool __tlb_remove_page_size(struct mmu_gather *tlb,
 					  struct page *page, int page_size);
 
-#define tlb_start_vma(tlb, vma)			do { } while (0)
-#define tlb_end_vma(tlb, vma)			do { } while (0)
-
 #define tlb_flush tlb_flush
 #define pte_free_tlb pte_free_tlb
 #define pmd_free_tlb pmd_free_tlb
diff --git a/arch/sparc/Kconfig b/arch/sparc/Kconfig
index c9c34dc52b7d..fb46e1b6f177 100644
--- a/arch/sparc/Kconfig
+++ b/arch/sparc/Kconfig
@@ -51,6 +51,7 @@ config SPARC
 	select NEED_DMA_MAP_STATE
 	select NEED_SG_DMA_LENGTH
 	select SET_FS
+	select ARCH_WANT_AGGRESSIVE_TLB_FLUSH_BATCHING
 
 config SPARC32
 	def_bool !64BIT
diff --git a/arch/sparc/include/asm/tlb_64.h b/arch/sparc/include/asm/tlb_64.h
index 779a5a0f0608..3037187482db 100644
--- a/arch/sparc/include/asm/tlb_64.h
+++ b/arch/sparc/include/asm/tlb_64.h
@@ -22,8 +22,6 @@ void smp_flush_tlb_mm(struct mm_struct *mm);
 void __flush_tlb_pending(unsigned long, unsigned long, unsigned long *);
 void flush_tlb_pending(void);
 
-#define tlb_start_vma(tlb, vma) do { } while (0)
-#define tlb_end_vma(tlb, vma)	do { } while (0)
 #define tlb_flush(tlb)	flush_tlb_pending()
 
 /*
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 6bd4d626a6b3..d56b0f5cb00c 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -101,6 +101,7 @@ config X86
 	select ARCH_USE_QUEUED_RWLOCKS
 	select ARCH_USE_QUEUED_SPINLOCKS
 	select ARCH_USE_SYM_ANNOTATIONS
+	select ARCH_WANT_AGGRESSIVE_TLB_FLUSH_BATCHING
 	select ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
 	select ARCH_WANT_DEFAULT_BPF_JIT	if X86_64
 	select ARCH_WANTS_DYNAMIC_TASK_STRUCT
diff --git a/arch/x86/include/asm/tlb.h b/arch/x86/include/asm/tlb.h
index 1bfe979bb9bc..580636cdc257 100644
--- a/arch/x86/include/asm/tlb.h
+++ b/arch/x86/include/asm/tlb.h
@@ -2,9 +2,6 @@
 #ifndef _ASM_X86_TLB_H
 #define _ASM_X86_TLB_H
 
-#define tlb_start_vma(tlb, vma) do { } while (0)
-#define tlb_end_vma(tlb, vma) do { } while (0)
-
 #define tlb_flush tlb_flush
 static inline void tlb_flush(struct mmu_gather *tlb);
 
diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index 427bfcc6cdec..b97136b7010b 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -334,8 +334,8 @@ static inline void __tlb_reset_range(struct mmu_gather *tlb)
 
 #ifdef CONFIG_MMU_GATHER_NO_RANGE
 
-#if defined(tlb_flush) || defined(tlb_start_vma) || defined(tlb_end_vma)
-#error MMU_GATHER_NO_RANGE relies on default tlb_flush(), tlb_start_vma() and tlb_end_vma()
+#if defined(tlb_flush)
+#error MMU_GATHER_NO_RANGE relies on default tlb_flush()
 #endif
 
 /*
@@ -362,10 +362,6 @@ static inline void tlb_end_vma(struct mmu_gather *tlb, struct vm_area_struct *vm
 
 #ifndef tlb_flush
 
-#if defined(tlb_start_vma) || defined(tlb_end_vma)
-#error Default tlb_flush() relies on default tlb_start_vma() and tlb_end_vma()
-#endif
-
 /*
  * When an architecture does not provide its own tlb_flush() implementation
  * but does have a reasonably efficient flush_vma_range() implementation
@@ -486,7 +482,6 @@ static inline unsigned long tlb_get_unmap_size(struct mmu_gather *tlb)
  * case where we're doing a full MM flush.  When we're doing a munmap,
  * the vmas are adjusted to only cover the region to be torn down.
  */
-#ifndef tlb_start_vma
 static inline void tlb_start_vma(struct mmu_gather *tlb, struct vm_area_struct *vma)
 {
 	if (tlb->fullmm)
@@ -495,14 +490,15 @@ static inline void tlb_start_vma(struct mmu_gather *tlb, struct vm_area_struct *
 	tlb_update_vma_flags(tlb, vma);
 	flush_cache_range(vma, vma->vm_start, vma->vm_end);
 }
-#endif
 
-#ifndef tlb_end_vma
 static inline void tlb_end_vma(struct mmu_gather *tlb, struct vm_area_struct *vma)
 {
 	if (tlb->fullmm)
 		return;
 
+	if (IS_ENABLED(CONFIG_ARCH_WANT_AGGRESSIVE_TLB_FLUSH_BATCHING))
+		return;
+
 	/*
 	 * Do a TLB flush and reset the range at VMA boundaries; this avoids
 	 * the ranges growing with the unused space between consecutive VMAs,
@@ -511,7 +507,6 @@ static inline void tlb_end_vma(struct mmu_gather *tlb, struct vm_area_struct *vm
 	 */
 	tlb_flush_mmu_tlbonly(tlb);
 }
-#endif
 
 #ifdef CONFIG_ARCH_HAS_TLB_GENERATIONS
 
diff --git a/init/Kconfig b/init/Kconfig
index 3d11a0f7c8cc..14a599a48738 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -849,6 +849,14 @@ config ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
 config ARCH_HAS_TLB_GENERATIONS
 	bool
 
+#
+# For architectures that prefer to batch TLB flushes aggressively, i.e.,
+# not to flush after changing or removing each VMA. The architecture must
+# provide its own tlb_flush() function.
+config ARCH_WANT_AGGRESSIVE_TLB_FLUSH_BATCHING
+	bool
+	depends on !CONFIG_MMU_GATHER_NO_GATHER
+
 config CC_HAS_INT128
 	def_bool !$(cc-option,$(m64-flag) -D__SIZEOF_INT128__=0) && 64BIT
 
-- 
2.25.1


^ permalink raw reply related

* Re: [RFC 00/20] TLB batching consolidation and enhancements
From: Andy Lutomirski @ 2021-01-31  0:39 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Andrea Arcangeli, linux-s390, X86 ML, Yu Zhao, Will Deacon,
	Peter Zijlstra, Mel Gorman, Dave Hansen, LKML, linux-csky,
	Linux-MM, Nadav Amit, Nick Piggin, Andy Lutomirski, Andrew Morton,
	linuxppc-dev, Thomas Gleixner
In-Reply-To: <20210131001132.3368247-1-namit@vmware.com>

On Sat, Jan 30, 2021 at 4:16 PM Nadav Amit <nadav.amit@gmail.com> wrote:
>
> From: Nadav Amit <namit@vmware.com>
>
> There are currently (at least?) 5 different TLB batching schemes in the
> kernel:
>
> 1. Using mmu_gather (e.g., zap_page_range()).
>
> 2. Using {inc|dec}_tlb_flush_pending() to inform other threads on the
>    ongoing deferred TLB flush and flushing the entire range eventually
>    (e.g., change_protection_range()).
>
> 3. arch_{enter|leave}_lazy_mmu_mode() for sparc and powerpc (and Xen?).
>
> 4. Batching per-table flushes (move_ptes()).
>
> 5. By setting a flag on that a deferred TLB flush operation takes place,
>    flushing when (try_to_unmap_one() on x86).

Are you referring to the arch_tlbbatch_add_mm/flush mechanism?

^ permalink raw reply

* Re: [RFC 00/20] TLB batching consolidation and enhancements
From: Nadav Amit @ 2021-01-31  1:08 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andrea Arcangeli, linux-s390, X86 ML, Yu Zhao, Will Deacon,
	Peter Zijlstra, Mel Gorman, Dave Hansen, LKML,
	linux-csky@vger.kernel.org, Linux-MM, Nick Piggin, Andrew Morton,
	linuxppc-dev, Thomas Gleixner
In-Reply-To: <CALCETrUUe_iijSoTYMjibqxUveaYrG3sVTWawbi3HWCjx_ySYg@mail.gmail.com>

> On Jan 30, 2021, at 4:39 PM, Andy Lutomirski <luto@kernel.org> wrote:
> 
> On Sat, Jan 30, 2021 at 4:16 PM Nadav Amit <nadav.amit@gmail.com> wrote:
>> From: Nadav Amit <namit@vmware.com>
>> 
>> There are currently (at least?) 5 different TLB batching schemes in the
>> kernel:
>> 
>> 1. Using mmu_gather (e.g., zap_page_range()).
>> 
>> 2. Using {inc|dec}_tlb_flush_pending() to inform other threads on the
>>   ongoing deferred TLB flush and flushing the entire range eventually
>>   (e.g., change_protection_range()).
>> 
>> 3. arch_{enter|leave}_lazy_mmu_mode() for sparc and powerpc (and Xen?).
>> 
>> 4. Batching per-table flushes (move_ptes()).
>> 
>> 5. By setting a flag on that a deferred TLB flush operation takes place,
>>   flushing when (try_to_unmap_one() on x86).
> 
> Are you referring to the arch_tlbbatch_add_mm/flush mechanism?

Yes.

^ permalink raw reply

* Re: [RFC 00/20] TLB batching consolidation and enhancements
From: Nicholas Piggin @ 2021-01-31  3:30 UTC (permalink / raw)
  To: linux-kernel, linux-mm, Nadav Amit
  Cc: Andrea Arcangeli, linux-s390, x86, Yu Zhao, Will Deacon,
	Peter Zijlstra, Mel Gorman, Dave Hansen, linux-csky, Nadav Amit,
	Andy Lutomirski, Andrew Morton, linuxppc-dev, Thomas Gleixner
In-Reply-To: <20210131001132.3368247-1-namit@vmware.com>

Excerpts from Nadav Amit's message of January 31, 2021 10:11 am:
> From: Nadav Amit <namit@vmware.com>
> 
> There are currently (at least?) 5 different TLB batching schemes in the
> kernel:
> 
> 1. Using mmu_gather (e.g., zap_page_range()).
> 
> 2. Using {inc|dec}_tlb_flush_pending() to inform other threads on the
>    ongoing deferred TLB flush and flushing the entire range eventually
>    (e.g., change_protection_range()).
> 
> 3. arch_{enter|leave}_lazy_mmu_mode() for sparc and powerpc (and Xen?).
> 
> 4. Batching per-table flushes (move_ptes()).
> 
> 5. By setting a flag on that a deferred TLB flush operation takes place,
>    flushing when (try_to_unmap_one() on x86).
> 
> It seems that (1)-(4) can be consolidated. In addition, it seems that
> (5) is racy. It also seems there can be many redundant TLB flushes, and
> potentially TLB-shootdown storms, for instance during batched
> reclamation (using try_to_unmap_one()) if at the same time mmu_gather
> defers TLB flushes.
> 
> More aggressive TLB batching may be possible, but this patch-set does
> not add such batching. The proposed changes would enable such batching
> in a later time.
> 
> Admittedly, I do not understand how things are not broken today, which
> frightens me to make further batching before getting things in order.
> For instance, why is ok for zap_pte_range() to batch dirty-PTE flushes
> for each page-table (but not in greater granularity). Can't
> ClearPageDirty() be called before the flush, causing writes after
> ClearPageDirty() and before the flush to be lost?

Because it's holding the page table lock which stops page_mkclean from 
cleaning the page. Or am I misunderstanding the question?

I'll go through the patches a bit more closely when they all come 
through. Sparc and powerpc of course need the arch lazy mode to get 
per-page/pte information for operations that are not freeing pages, 
which is what mmu gather is designed for.

I wouldn't mind using a similar API so it's less of a black box when 
reading generic code, but it might not quite fit the mmu gather API
exactly (most of these paths don't want a full mmu_gather on stack).

> 
> This patch-set therefore performs the following changes:
> 
> 1. Change mprotect, task_mmu and mapping_dirty_helpers to use mmu_gather
>    instead of {inc|dec}_tlb_flush_pending().
> 
> 2. Avoid TLB flushes if PTE permission is not demoted.
> 
> 3. Cleans up mmu_gather to be less arch-dependant.
> 
> 4. Uses mm's generations to track in finer granularity, either per-VMA
>    or per page-table, whether a pending mmu_gather operation is
>    outstanding. This should allow to avoid some TLB flushes when KSM or
>    memory reclamation takes place while another operation such as
>    munmap() or mprotect() is running.
> 
> 5. Changes try_to_unmap_one() flushing scheme, as the current seems
>    broken to track in a bitmap which CPUs have outstanding TLB flushes
>    instead of having a flag.

Putting fixes first, and cleanups and independent patches (like #2) next
would help with getting stuff merged and backported.

Thanks,
Nick

^ permalink raw reply

* Re: [RFC 00/20] TLB batching consolidation and enhancements
From: Nadav Amit @ 2021-01-31  7:57 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: Andrea Arcangeli, linux-s390, X86 ML, Yu Zhao, Will Deacon,
	Peter Zijlstra, Mel Gorman, Dave Hansen, LKML,
	linux-csky@vger.kernel.org, Linux-MM, Andy Lutomirski,
	Andrew Morton, linuxppc-dev, Thomas Gleixner
In-Reply-To: <1612063149.2awdsvvmhj.astroid@bobo.none>

> On Jan 30, 2021, at 7:30 PM, Nicholas Piggin <npiggin@gmail.com> wrote:
> 
> Excerpts from Nadav Amit's message of January 31, 2021 10:11 am:
>> From: Nadav Amit <namit@vmware.com>
>> 
>> There are currently (at least?) 5 different TLB batching schemes in the
>> kernel:
>> 
>> 1. Using mmu_gather (e.g., zap_page_range()).
>> 
>> 2. Using {inc|dec}_tlb_flush_pending() to inform other threads on the
>>   ongoing deferred TLB flush and flushing the entire range eventually
>>   (e.g., change_protection_range()).
>> 
>> 3. arch_{enter|leave}_lazy_mmu_mode() for sparc and powerpc (and Xen?).
>> 
>> 4. Batching per-table flushes (move_ptes()).
>> 
>> 5. By setting a flag on that a deferred TLB flush operation takes place,
>>   flushing when (try_to_unmap_one() on x86).
>> 
>> It seems that (1)-(4) can be consolidated. In addition, it seems that
>> (5) is racy. It also seems there can be many redundant TLB flushes, and
>> potentially TLB-shootdown storms, for instance during batched
>> reclamation (using try_to_unmap_one()) if at the same time mmu_gather
>> defers TLB flushes.
>> 
>> More aggressive TLB batching may be possible, but this patch-set does
>> not add such batching. The proposed changes would enable such batching
>> in a later time.
>> 
>> Admittedly, I do not understand how things are not broken today, which
>> frightens me to make further batching before getting things in order.
>> For instance, why is ok for zap_pte_range() to batch dirty-PTE flushes
>> for each page-table (but not in greater granularity). Can't
>> ClearPageDirty() be called before the flush, causing writes after
>> ClearPageDirty() and before the flush to be lost?
> 
> Because it's holding the page table lock which stops page_mkclean from 
> cleaning the page. Or am I misunderstanding the question?

Thanks. I understood this part. Looking again at the code, I now understand
my confusion: I forgot that the reverse mapping is removed after the PTE is
zapped.

Makes me wonder whether it is ok to defer the TLB flush to tlb_finish_mmu(),
by performing set_page_dirty() for the batched pages when needed in
tlb_finish_mmu() [ i.e., by marking for each batched page whether
set_page_dirty() should be issued for that page while collecting them ].

> I'll go through the patches a bit more closely when they all come 
> through. Sparc and powerpc of course need the arch lazy mode to get 
> per-page/pte information for operations that are not freeing pages, 
> which is what mmu gather is designed for.

IIUC you mean any PTE change requires a TLB flush. Even setting up a new PTE
where no previous PTE was set, right?

> I wouldn't mind using a similar API so it's less of a black box when 
> reading generic code, but it might not quite fit the mmu gather API
> exactly (most of these paths don't want a full mmu_gather on stack).

I see your point. It may be possible to create two mmu_gather structs: a
small one that only holds the flush information and another that also holds
the pages. 

>> This patch-set therefore performs the following changes:
>> 
>> 1. Change mprotect, task_mmu and mapping_dirty_helpers to use mmu_gather
>>   instead of {inc|dec}_tlb_flush_pending().
>> 
>> 2. Avoid TLB flushes if PTE permission is not demoted.
>> 
>> 3. Cleans up mmu_gather to be less arch-dependant.
>> 
>> 4. Uses mm's generations to track in finer granularity, either per-VMA
>>   or per page-table, whether a pending mmu_gather operation is
>>   outstanding. This should allow to avoid some TLB flushes when KSM or
>>   memory reclamation takes place while another operation such as
>>   munmap() or mprotect() is running.
>> 
>> 5. Changes try_to_unmap_one() flushing scheme, as the current seems
>>   broken to track in a bitmap which CPUs have outstanding TLB flushes
>>   instead of having a flag.
> 
> Putting fixes first, and cleanups and independent patches (like #2) next
> would help with getting stuff merged and backported.

I tried to do it mostly this way. There are some theoretical races which
I did not manage (or try hard enough) to create, so I did not include in
the “fixes” section. I will restructure the patch-set according to the
feedback.

Thanks,
Nadav

^ permalink raw reply

* Re: [RFC 00/20] TLB batching consolidation and enhancements
From: Nadav Amit @ 2021-01-31  8:14 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Andrea Arcangeli, linux-s390, X86 ML, Yu Zhao, Will Deacon,
	Peter Zijlstra, Mel Gorman, Dave Hansen, LKML, Nicholas Piggin,
	Linux-MM, linux-csky@vger.kernel.org, Andy Lutomirski,
	Andrew Morton, linuxppc-dev, Thomas Gleixner
In-Reply-To: <A1589669-34AE-4E15-8358-79BAD7C72520@vmware.com>

> On Jan 30, 2021, at 11:57 PM, Nadav Amit <namit@vmware.com> wrote:
> 
>> On Jan 30, 2021, at 7:30 PM, Nicholas Piggin <npiggin@gmail.com> wrote:
>> 
>> Excerpts from Nadav Amit's message of January 31, 2021 10:11 am:
>>> From: Nadav Amit <namit@vmware.com>
>>> 
>>> There are currently (at least?) 5 different TLB batching schemes in the
>>> kernel:
>>> 
>>> 1. Using mmu_gather (e.g., zap_page_range()).
>>> 
>>> 2. Using {inc|dec}_tlb_flush_pending() to inform other threads on the
>>>  ongoing deferred TLB flush and flushing the entire range eventually
>>>  (e.g., change_protection_range()).
>>> 
>>> 3. arch_{enter|leave}_lazy_mmu_mode() for sparc and powerpc (and Xen?).
>>> 
>>> 4. Batching per-table flushes (move_ptes()).
>>> 
>>> 5. By setting a flag on that a deferred TLB flush operation takes place,
>>>  flushing when (try_to_unmap_one() on x86).
>>> 
>>> It seems that (1)-(4) can be consolidated. In addition, it seems that
>>> (5) is racy. It also seems there can be many redundant TLB flushes, and
>>> potentially TLB-shootdown storms, for instance during batched
>>> reclamation (using try_to_unmap_one()) if at the same time mmu_gather
>>> defers TLB flushes.
>>> 
>>> More aggressive TLB batching may be possible, but this patch-set does
>>> not add such batching. The proposed changes would enable such batching
>>> in a later time.
>>> 
>>> Admittedly, I do not understand how things are not broken today, which
>>> frightens me to make further batching before getting things in order.
>>> For instance, why is ok for zap_pte_range() to batch dirty-PTE flushes
>>> for each page-table (but not in greater granularity). Can't
>>> ClearPageDirty() be called before the flush, causing writes after
>>> ClearPageDirty() and before the flush to be lost?
>> 
>> Because it's holding the page table lock which stops page_mkclean from 
>> cleaning the page. Or am I misunderstanding the question?
> 
> Thanks. I understood this part. Looking again at the code, I now understand
> my confusion: I forgot that the reverse mapping is removed after the PTE is
> zapped.
> 
> Makes me wonder whether it is ok to defer the TLB flush to tlb_finish_mmu(),
> by performing set_page_dirty() for the batched pages when needed in
> tlb_finish_mmu() [ i.e., by marking for each batched page whether
> set_page_dirty() should be issued for that page while collecting them ].

Correcting myself (I hope): no we cannot do so, since the buffers might be
remove from the page at that point.


^ permalink raw reply

* Re: [GIT PULL] Please pull powerpc/linux.git powerpc-5.11-6 tag
From: Linus Torvalds @ 2021-01-31 19:38 UTC (permalink / raw)
  To: Michael Ellerman; +Cc: linuxppc-dev, Linux Kernel Mailing List, Nick Piggin
In-Reply-To: <87lfc9cnuw.fsf@mpe.ellerman.id.au>

On Sun, Jan 31, 2021 at 4:06 AM Michael Ellerman <mpe@ellerman.id.au> wrote:
>
> Please pull another powerpc fix for 5.11:

Manual pr-tracker-bot says thanks,

           Linus

^ permalink raw reply

* Re: [PATCH v7 19/42] powerpc/perf: move perf irq/nmi handling details into traps.c
From: Athira Rajeev @ 2021-01-31 12:30 UTC (permalink / raw)
  To: Nicholas Piggin; +Cc: linuxppc-dev
In-Reply-To: <20210130130852.2952424-20-npiggin@gmail.com>

[-- Attachment #1: Type: text/html, Size: 13382 bytes --]

^ permalink raw reply

* Re: [PATCH] powerpc: fix AKEBONO build failures
From: Michael Ellerman @ 2021-02-01  1:24 UTC (permalink / raw)
  To: Randy Dunlap, Yury Norov, linuxppc-dev, Linux Kernel Mailing List
  Cc: Paul Mackerras
In-Reply-To: <4572579a-7208-628d-cbe2-b70a74a84ae7@infradead.org>

Randy Dunlap <rdunlap@infradead.org> writes:
> On 1/21/21 5:14 PM, Michael Ellerman wrote:
>> Randy Dunlap <rdunlap@infradead.org> writes:
>>> On 1/20/21 1:29 PM, Yury Norov wrote:
>>>> Hi all,
>>>>
>>>> I found the power pc build broken on today's
>>>> linux-next (647060f3b592).
>>>
>>> Darn, I was building linux-5.11-rc4.
>>>
>>> I'll try linux-next after I send this.
>>>
>>> ---
>>> From: Randy Dunlap <rdunlap@infradead.org>
>>>
>>> Fulfill AKEBONO Kconfig requirements.
>>>
>>> Fixes these Kconfig warnings (and more) and fixes the subsequent
>>> build errors:
>>>
>>> WARNING: unmet direct dependencies detected for NETDEVICES
>>>    Depends on [n]: NET [=n]
>>>    Selected by [y]:
>>>    - AKEBONO [=y] && PPC_47x [=y]
>>>
>>> WARNING: unmet direct dependencies detected for MMC_SDHCI
>>>    Depends on [n]: MMC [=n] && HAS_DMA [=y]
>>>    Selected by [y]:
>>>    - AKEBONO [=y] && PPC_47x [=y]
>>>
>>> Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
>>> Cc: Michael Ellerman <mpe@ellerman.id.au>
>>> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
>>> Cc: Paul Mackerras <paulus@samba.org>
>>> Cc: linuxppc-dev@lists.ozlabs.org
>>> Cc: Yury Norov <yury.norov@gmail.com>
>>> ---
>>>   arch/powerpc/platforms/44x/Kconfig |    2 ++
>>>   1 file changed, 2 insertions(+)
>>>
>>> --- lnx-511-rc4.orig/arch/powerpc/platforms/44x/Kconfig
>>> +++ lnx-511-rc4/arch/powerpc/platforms/44x/Kconfig
>>> @@ -206,6 +206,7 @@ config AKEBONO
>>>   	select PPC4xx_HSTA_MSI
>>>   	select I2C
>>>   	select I2C_IBM_IIC
>>> +	select NET
>>>   	select NETDEVICES
>>>   	select ETHERNET
>>>   	select NET_VENDOR_IBM
>> 
>> I think the problem here is too much use of select, for things that
>> should instead be in the defconfig.
>> 
>> The patch below results in the same result for make
>> 44x/akebono_defconfig. Does it fix the original issue?
>
> Hi Michael,
> Sorry for the delay.
>
> Changing the akebono_defconfig doesn't cause the missing symbols
> to be set -- the defconfig is not being used here.

Yep, but that's OK. None of those selected symbols are hard dependencies
of AKEBONO, they're just things you probably want in your kernel to
actually boot on an akebono board.

> I guess that if you have users who set CONFIG_AKEBONO and expect
> it to build cleanly, you will need something like my patch or the
> patch that Florian just posted.

It will build cleanly, it just won't necessarily boot on a real board.
Users who enable AKEBONO manually need to know what they're doing, or
they should just use the defconfig.

> Changing the akebono_defconfig also would not help 'make randconfig'
> builds to build cleanly if they had happened to enable AKEBONO.

Changing the defconfig doesn't help randconfig, but dropping the selects
does.

Anyway I'll send a proper version of my patch, which I'm pretty
confident will fix all the issues.

cheers

^ permalink raw reply

* [PATCH] powerpc/akebono: Fix unmet dependency errors
From: Michael Ellerman @ 2021-02-01  1:25 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: f.fainelli, rdunlap, yury.norov

The AKEBONO config has various selects under it, including some with
user-selectable dependencies, which means those dependencies can be
disabled. This leads to warnings from Kconfig.

This can be seen with eg:

  $ make allnoconfig
  $ ./scripts/config --file build~/.config -k -e CONFIG_44x -k -e CONFIG_PPC_47x -e CONFIG_AKEBONO
  $ make olddefconfig

  WARNING: unmet direct dependencies detected for ATA
    Depends on [n]: HAS_IOMEM [=y] && BLOCK [=n]
    Selected by [y]:
    - AKEBONO [=y] && PPC_47x [=y]

  WARNING: unmet direct dependencies detected for NETDEVICES
    Depends on [n]: NET [=n]
    Selected by [y]:
    - AKEBONO [=y] && PPC_47x [=y]

  WARNING: unmet direct dependencies detected for ETHERNET
    Depends on [n]: NETDEVICES [=y] && NET [=n]
    Selected by [y]:
    - AKEBONO [=y] && PPC_47x [=y]

  WARNING: unmet direct dependencies detected for MMC_SDHCI
    Depends on [n]: MMC [=n] && HAS_DMA [=y]
    Selected by [y]:
    - AKEBONO [=y] && PPC_47x [=y]

  WARNING: unmet direct dependencies detected for MMC_SDHCI_PLTFM
    Depends on [n]: MMC [=n] && MMC_SDHCI [=y]
    Selected by [y]:
    - AKEBONO [=y] && PPC_47x [=y]

The problem is that AKEBONO is using select to enable things that are
not true dependencies, but rather things you probably want enabled in
an AKEBONO kernel. That is what a defconfig is for.

So drop those selects and instead move those symbols into the
defconfig. This fixes all the kconfig warnings, and the result of make
44x/akebono_defconfig is the same before and after the patch.

Reported-by: Yury Norov <yury.norov@gmail.com>
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Reported-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
---
 arch/powerpc/configs/44x/akebono_defconfig | 5 +++++
 arch/powerpc/platforms/44x/Kconfig         | 7 -------
 2 files changed, 5 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/configs/44x/akebono_defconfig b/arch/powerpc/configs/44x/akebono_defconfig
index 3894ba8f8ffc..ae9b7beefdd6 100644
--- a/arch/powerpc/configs/44x/akebono_defconfig
+++ b/arch/powerpc/configs/44x/akebono_defconfig
@@ -21,6 +21,7 @@ CONFIG_IRQ_ALL_CPUS=y
 # CONFIG_COMPACTION is not set
 # CONFIG_SUSPEND is not set
 CONFIG_NET=y
+CONFIG_NETDEVICES=y
 CONFIG_PACKET=y
 CONFIG_UNIX=y
 CONFIG_INET=y
@@ -41,7 +42,9 @@ CONFIG_BLK_DEV_RAM_SIZE=35000
 # CONFIG_SCSI_PROC_FS is not set
 CONFIG_BLK_DEV_SD=y
 # CONFIG_SCSI_LOWLEVEL is not set
+CONFIG_ATA=y
 # CONFIG_SATA_PMP is not set
+CONFIG_SATA_AHCI_PLATFORM=y
 # CONFIG_ATA_SFF is not set
 # CONFIG_NET_VENDOR_3COM is not set
 # CONFIG_NET_VENDOR_ADAPTEC is not set
@@ -98,6 +101,8 @@ CONFIG_USB_OHCI_HCD=y
 # CONFIG_USB_OHCI_HCD_PCI is not set
 CONFIG_USB_STORAGE=y
 CONFIG_MMC=y
+CONFIG_MMC_SDHCI=y
+CONFIG_MMC_SDHCI_PLTFM=y
 CONFIG_RTC_CLASS=y
 CONFIG_RTC_DRV_M41T80=y
 CONFIG_EXT2_FS=y
diff --git a/arch/powerpc/platforms/44x/Kconfig b/arch/powerpc/platforms/44x/Kconfig
index 78ac6d67a935..7d41e9264510 100644
--- a/arch/powerpc/platforms/44x/Kconfig
+++ b/arch/powerpc/platforms/44x/Kconfig
@@ -206,17 +206,10 @@ config AKEBONO
 	select PPC4xx_HSTA_MSI
 	select I2C
 	select I2C_IBM_IIC
-	select NETDEVICES
-	select ETHERNET
-	select NET_VENDOR_IBM
 	select IBM_EMAC_EMAC4 if IBM_EMAC
 	select USB if USB_SUPPORT
 	select USB_OHCI_HCD_PLATFORM if USB_OHCI_HCD
 	select USB_EHCI_HCD_PLATFORM if USB_EHCI_HCD
-	select MMC_SDHCI
-	select MMC_SDHCI_PLTFM
-	select ATA
-	select SATA_AHCI_PLATFORM
 	help
 	  This option enables support for the IBM Akebono (476gtr) evaluation board
 
-- 
2.25.1


^ permalink raw reply related

* Re: [PATCH] cxl: Simplify bool conversion
From: Andrew Donnellan @ 2021-02-01  1:44 UTC (permalink / raw)
  To: Yang Li, fbarrat; +Cc: gregkh, linuxppc-dev, linux-kernel, arnd
In-Reply-To: <1611908705-98507-1-git-send-email-yang.lee@linux.alibaba.com>

On 29/1/21 7:25 pm, Yang Li wrote:
> Fix the following coccicheck warning:
> ./drivers/misc/cxl/sysfs.c:181:48-53: WARNING: conversion to bool not
> needed here
> 
> Reported-by: Abaci Robot <abaci@linux.alibaba.com>
> Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>

Reviewed-by: Andrew Donnellan <ajd@linux.ibm.com>

Thanks!

> ---
>   drivers/misc/cxl/sysfs.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/misc/cxl/sysfs.c b/drivers/misc/cxl/sysfs.c
> index d97a243..c173a5e 100644
> --- a/drivers/misc/cxl/sysfs.c
> +++ b/drivers/misc/cxl/sysfs.c
> @@ -178,7 +178,7 @@ static ssize_t perst_reloads_same_image_store(struct device *device,
>   	if ((rc != 1) || !(val == 1 || val == 0))
>   		return -EINVAL;
>   
> -	adapter->perst_same_image = (val == 1 ? true : false);
> +	adapter->perst_same_image = (val == 1);
>   	return count;
>   }
>   
> 

-- 
Andrew Donnellan              OzLabs, ADL Canberra
ajd@linux.ibm.com             IBM Australia Limited

^ permalink raw reply

* Re: [PATCH] powerpc/akebono: Fix unmet dependency errors
From: Florian Fainelli @ 2021-02-01  2:17 UTC (permalink / raw)
  To: Michael Ellerman, linuxppc-dev; +Cc: rdunlap, yury.norov
In-Reply-To: <20210201012503.940145-1-mpe@ellerman.id.au>



On 1/31/2021 5:25 PM, Michael Ellerman wrote:
> The AKEBONO config has various selects under it, including some with
> user-selectable dependencies, which means those dependencies can be
> disabled. This leads to warnings from Kconfig.
> 
> This can be seen with eg:
> 
>   $ make allnoconfig
>   $ ./scripts/config --file build~/.config -k -e CONFIG_44x -k -e CONFIG_PPC_47x -e CONFIG_AKEBONO
>   $ make olddefconfig
> 
>   WARNING: unmet direct dependencies detected for ATA
>     Depends on [n]: HAS_IOMEM [=y] && BLOCK [=n]
>     Selected by [y]:
>     - AKEBONO [=y] && PPC_47x [=y]
> 
>   WARNING: unmet direct dependencies detected for NETDEVICES
>     Depends on [n]: NET [=n]
>     Selected by [y]:
>     - AKEBONO [=y] && PPC_47x [=y]
> 
>   WARNING: unmet direct dependencies detected for ETHERNET
>     Depends on [n]: NETDEVICES [=y] && NET [=n]
>     Selected by [y]:
>     - AKEBONO [=y] && PPC_47x [=y]
> 
>   WARNING: unmet direct dependencies detected for MMC_SDHCI
>     Depends on [n]: MMC [=n] && HAS_DMA [=y]
>     Selected by [y]:
>     - AKEBONO [=y] && PPC_47x [=y]
> 
>   WARNING: unmet direct dependencies detected for MMC_SDHCI_PLTFM
>     Depends on [n]: MMC [=n] && MMC_SDHCI [=y]
>     Selected by [y]:
>     - AKEBONO [=y] && PPC_47x [=y]
> 
> The problem is that AKEBONO is using select to enable things that are
> not true dependencies, but rather things you probably want enabled in
> an AKEBONO kernel. That is what a defconfig is for.
> 
> So drop those selects and instead move those symbols into the
> defconfig. This fixes all the kconfig warnings, and the result of make
> 44x/akebono_defconfig is the same before and after the patch.
> 
> Reported-by: Yury Norov <yury.norov@gmail.com>
> Reported-by: Randy Dunlap <rdunlap@infradead.org>
> Reported-by: Florian Fainelli <f.fainelli@gmail.com>
> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>

Acked-by: Florian Fainelli <f.fainelli@gmail.com>

Much better than my patch, thanks!
-- 
Florian

^ permalink raw reply

* Re: [PATCH] powerpc/akebono: Fix unmet dependency errors
From: Randy Dunlap @ 2021-02-01  4:11 UTC (permalink / raw)
  To: Michael Ellerman, linuxppc-dev; +Cc: f.fainelli, yury.norov
In-Reply-To: <20210201012503.940145-1-mpe@ellerman.id.au>

On 1/31/21 5:25 PM, Michael Ellerman wrote:
> The AKEBONO config has various selects under it, including some with
> user-selectable dependencies, which means those dependencies can be
> disabled. This leads to warnings from Kconfig.
> 
> 
> The problem is that AKEBONO is using select to enable things that are
> not true dependencies, but rather things you probably want enabled in
> an AKEBONO kernel. That is what a defconfig is for.
> 
> So drop those selects and instead move those symbols into the
> defconfig. This fixes all the kconfig warnings, and the result of make
> 44x/akebono_defconfig is the same before and after the patch.
> 
> Reported-by: Yury Norov <yury.norov@gmail.com>
> Reported-by: Randy Dunlap <rdunlap@infradead.org>
> Reported-by: Florian Fainelli <f.fainelli@gmail.com>
> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
> ---
>  arch/powerpc/configs/44x/akebono_defconfig | 5 +++++
>  arch/powerpc/platforms/44x/Kconfig         | 7 -------
>  2 files changed, 5 insertions(+), 7 deletions(-)

Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Tested-by: Randy Dunlap <rdunlap@infradead.org>

Thanks.

-- 
~Randy

^ permalink raw reply

* [PATCH] powerpc/603: Fix protection of user pages mapped with PROT_NONE
From: Christophe Leroy @ 2021-02-01  6:29 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Christoph Plattner
  Cc: linuxppc-dev, linux-kernel

On book3s/32, page protection is defined by the PP bits in the PTE
which provide the following protection depending on the access
keys defined in the matching segment register:
- PP 00 means RW with key 0 and N/A with key 1.
- PP 01 means RW with key 0 and RO with key 1.
- PP 10 means RW with both key 0 and key 1.
- PP 11 means RO with both key 0 and key 1.

Since the implementation of kernel userspace access protection,
PP bits have been set as follows:
- PP00 for pages without _PAGE_USER
- PP01 for pages with _PAGE_USER and _PAGE_RW
- PP11 for pages with _PAGE_USER and without _PAGE_RW

For kernelspace segments, kernel accesses are performed with key 0
and user accesses are performed with key 1. As PP00 is used for
non _PAGE_USER pages, user can't access kernel pages not flagged
_PAGE_USER while kernel can.

For userspace segments, both kernel and user accesses are performed
with key 0, therefore pages not flagged _PAGE_USER are still
accessible to the user.

This shouldn't be an issue, because userspace is expected to be
accessible to the user. But unlike most other architectures, powerpc
implements PROT_NONE protection by removing _PAGE_USER flag instead of
flagging the page as not valid. This means that pages in userspace
that are not flagged _PAGE_USER shall remain inaccessible.

To get the expected behaviour, just mimic other architectures in the
TLB miss handler by checking _PAGE_USER permission on userspace
accesses as if it was the _PAGE_PRESENT bit.

Note that this problem only is only for 603 cores. The 604+ have
an hash table, and hash_page() function already implement the
verification of _PAGE_USER permission on userspace pages.

Reported-by: Christoph Plattner <christoph.plattner@thalesgroup.com>
Fixes: f342adca3afc ("powerpc/32s: Prepare Kernel Userspace Access Protection")
Cc: stable@vger.kernel.org
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
---
 arch/powerpc/kernel/head_book3s_32.S | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kernel/head_book3s_32.S b/arch/powerpc/kernel/head_book3s_32.S
index 858fbc8b19f3..0004e8a6a58e 100644
--- a/arch/powerpc/kernel/head_book3s_32.S
+++ b/arch/powerpc/kernel/head_book3s_32.S
@@ -453,11 +453,12 @@ InstructionTLBMiss:
 	cmplw	0,r1,r3
 #endif
 	mfspr	r2, SPRN_SDR1
-	li	r1,_PAGE_PRESENT | _PAGE_ACCESSED | _PAGE_EXEC
+	li	r1,_PAGE_PRESENT | _PAGE_ACCESSED | _PAGE_EXEC | _PAGE_USER
 	rlwinm	r2, r2, 28, 0xfffff000
 #ifdef CONFIG_MODULES
 	bgt-	112f
 	lis	r2, (swapper_pg_dir - PAGE_OFFSET)@ha	/* if kernel address, use */
+	li	r1,_PAGE_PRESENT | _PAGE_ACCESSED | _PAGE_EXEC
 	addi	r2, r2, (swapper_pg_dir - PAGE_OFFSET)@l	/* kernel page table */
 #endif
 112:	rlwimi	r2,r3,12,20,29		/* insert top 10 bits of address */
@@ -516,10 +517,11 @@ DataLoadTLBMiss:
 	lis	r1, TASK_SIZE@h		/* check if kernel address */
 	cmplw	0,r1,r3
 	mfspr	r2, SPRN_SDR1
-	li	r1, _PAGE_PRESENT | _PAGE_ACCESSED
+	li	r1, _PAGE_PRESENT | _PAGE_ACCESSED | _PAGE_USER
 	rlwinm	r2, r2, 28, 0xfffff000
 	bgt-	112f
 	lis	r2, (swapper_pg_dir - PAGE_OFFSET)@ha	/* if kernel address, use */
+	li	r1, _PAGE_PRESENT | _PAGE_ACCESSED
 	addi	r2, r2, (swapper_pg_dir - PAGE_OFFSET)@l	/* kernel page table */
 112:	rlwimi	r2,r3,12,20,29		/* insert top 10 bits of address */
 	lwz	r2,0(r2)		/* get pmd entry */
@@ -593,10 +595,11 @@ DataStoreTLBMiss:
 	lis	r1, TASK_SIZE@h		/* check if kernel address */
 	cmplw	0,r1,r3
 	mfspr	r2, SPRN_SDR1
-	li	r1, _PAGE_RW | _PAGE_DIRTY | _PAGE_PRESENT | _PAGE_ACCESSED
+	li	r1, _PAGE_RW | _PAGE_DIRTY | _PAGE_PRESENT | _PAGE_ACCESSED | _PAGE_USER
 	rlwinm	r2, r2, 28, 0xfffff000
 	bgt-	112f
 	lis	r2, (swapper_pg_dir - PAGE_OFFSET)@ha	/* if kernel address, use */
+	li	r1, _PAGE_RW | _PAGE_DIRTY | _PAGE_PRESENT | _PAGE_ACCESSED
 	addi	r2, r2, (swapper_pg_dir - PAGE_OFFSET)@l	/* kernel page table */
 112:	rlwimi	r2,r3,12,20,29		/* insert top 10 bits of address */
 	lwz	r2,0(r2)		/* get pmd entry */
-- 
2.25.0


^ permalink raw reply related

* [PATCH] ASoC: fsl_xcvr: remove unneeded semicolon
From: Yang Li @ 2021-02-01  8:08 UTC (permalink / raw)
  To: timur
  Cc: alsa-devel, linuxppc-dev, Xiubo.Lee, festevam, tiwai, lgirdwood,
	perex, nicoleotsuka, broonie, Yang Li, p.zabel, shengjiu.wang,
	linux-kernel

Eliminate the following coccicheck warning:
./sound/soc/fsl/fsl_xcvr.c:739:2-3: Unneeded semicolon

Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
---
 sound/soc/fsl/fsl_xcvr.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/sound/soc/fsl/fsl_xcvr.c b/sound/soc/fsl/fsl_xcvr.c
index 3d58c88..65b388a 100644
--- a/sound/soc/fsl/fsl_xcvr.c
+++ b/sound/soc/fsl/fsl_xcvr.c
@@ -736,7 +736,7 @@ static int fsl_xcvr_load_firmware(struct fsl_xcvr *xcvr)
 			/* clean current page, including data memory */
 			memset_io(xcvr->ram_addr, 0, size);
 		}
-	};
+	}
 
 err_firmware:
 	release_firmware(fw);
-- 
1.8.3.1


^ permalink raw reply related

* Re: FSL P5040: KVM HV doesn't work with the RC5 of kernel 5.11
From: Christian Zigotzky @ 2021-02-01  8:29 UTC (permalink / raw)
  To: linuxppc-dev, tglx, kvm-ppc@vger.kernel.org
  Cc: Darren Stevens, mad skateman, R.T.Dickinson
In-Reply-To: <b0ba0690-507f-7660-79e2-5268cc6684bf@xenosoft.de>

Hello,

I compiled the RC6 of kernel 5.11 today and KVM HV works again. 
Therefore I don't need the patch below anymore.

Many thanks for solving the issue,
Christian


On 27 January 2021 at 05:07pm, Christian Zigotzky wrote:
> Hello,
>
> I compiled the RC5 of kernel 5.11 today. Unfortunately KVM HV doesn't 
> work anymore on my FSL P5040 board [1]. I tested it with QEMU 5.0.0 
> today [2]. The virtual e5500 QEMU machine works with the "RC4 with KVM 
> HV" and with the "RC5 without KVM HV". The complete system freezes if 
> I use KVM HV with the RC5.
>
> I have bisected and 785025820a6a565185ce9d47fdd8d23dbf91dee8 
> (powerpc/mm/highmem: use __set_pte_at() for kmap_local()) [3] is the 
> first bad commit.
>
> I was able to revert this bad commit and after a new compiling, KVM HV 
> works again.  I created a patch for reverting the commit. [4] Please 
> find attached the kernel config. I use one uImage for the virtual 
> machine and for the P5040 board.
>
> Please check the first bad commit.
>
> Thanks,
> Christian
>
>
> [1] http://wiki.amiga.org/index.php?title=X5000
> [2] qemu-system-ppc64 -M ppce500 -cpu e5500 -enable-kvm -m 1024 
> -kernel uImage-5.11 -drive 
> format=raw,file=MintPPC32-X5000.img,index=0,if=virtio -netdev 
> user,id=mynet0 -device e1000,netdev=mynet0 -append "rw root=/dev/vda" 
> -device virtio-vga -device virtio-mouse-pci -device 
> virtio-keyboard-pci -device pci-ohci,id=newusb -device 
> usb-audio,bus=newusb.0 -smp 4
> [3] 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h=v5.11-rc5&id=785025820a6a565185ce9d47fdd8d23dbf91dee8
> [4]
> diff -rupN a/arch/powerpc/include/asm/highmem.h 
> b/arch/powerpc/include/asm/highmem.h
> --- a/arch/powerpc/include/asm/highmem.h        2021-01-27 
> 16:12:40.382164118 +0100
> +++ b/arch/powerpc/include/asm/highmem.h        2021-01-27 
> 16:10:54.055249957 +0100
> @@ -58,8 +58,6 @@ extern pte_t *pkmap_page_table;
>
>  #define flush_cache_kmaps()    flush_cache_all()
>
> -#define arch_kmap_local_set_pte(mm, vaddr, ptep, ptev) \
> -       __set_pte_at(mm, vaddr, ptep, ptev, 1)
>  #define arch_kmap_local_post_map(vaddr, pteval)        \
>         local_flush_tlb_page(NULL, vaddr)
>  #define arch_kmap_local_post_unmap(vaddr)      \


^ permalink raw reply

* RE: [PATCH] powerpc/603: Fix protection of user pages mapped with PROT_NONE
From: PLATTNER Christoph @ 2021-02-01 10:22 UTC (permalink / raw)
  To: Christophe Leroy, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman
  Cc: KOENIG Werner, HAMETNER Reinhard, linux-kernel@vger.kernel.org,
	REITHER Robert - Contractor, PLATTNER Christoph,
	linuxppc-dev@lists.ozlabs.org
In-Reply-To: <4a0c6e3bb8f0c162457bf54d9bc6fd8d7b55129f.1612160907.git.christophe.leroy@csgroup.eu>

Hello to all, and thank you very much for first and second fast response.

I do not have a long history on PowerPC MMU environment, I hacked into this topic
for about 3 months for analyzing that problem- so, sorry, if I am wrong in some points ...

What I learn so far from this MPC5121e (variant of e300c4 core):
- It uses book3s32 hash-code, but it DOES NOT provide KEY hash method, so always the 
   branch  "if (! Hash) ...." is taken, so, I assume that "key 0" and "key 1" setups are not
   used on this CPU (not supporting MMU_FTR_HPTE_TABLE)
- The PP bits are NOT checked by the CPU in HW, even if set to 00, the CPU does not react.
   As far I have understood, the TLB miss routines are responsible for checking permissions.
   The TLB miss routines check the Linux PTE styled entries and generates the PP bits
   for the TLB entry. The PowerPC PP bits are never check elsewhere on that CPU models ...
- The PTE entries in Linux are fully "void" in sense of this CPU type, as this CPU does not
   read any PTEs from RAM (no HW support in contrast to x86 or ARM or later ppc...).

In summary - as far as I understand it now - we have to handle the PTE bits differently
(Linux style) for PROT_NONE permissions - OR - we have to expand the permission 
checking like my proposed experimental patch. (PROT_NONE is not NUMA related only,
but may not used very often ...).

Another related point:
According e300 RM (manual) the ACCESSED bit in the PTE shall be set on TLB miss, as
it is an indication, that page is used. In 4.4 kernel this write back of the _PAGE_ACCESSED 
bit was performed after successful permission check:

        bne-    DataAddressInvalid      /* return if access not permitted */
        ori     r0,r0,_PAGE_ACCESSED    /* set _PAGE_ACCESSED in pte */
        /*
         * NOTE! We are assuming this is not an SMP system, otherwise
         * we would need to update the pte atomically with lwarx/stwcx.
         */
        stw     r0,0(r2)                /* update PTE (accessed bit) */
        /* Convert linux-style PTE to low word of PPC-style PTE */

Bit is set (ori ...) and written back (stw ...) to Linux PTE. May be, this is not needed, as the
PTE is never seen by the PPC chip. But I do not understand, WHY the PAGE_ACCCESSED 
is used for permission check in the late 5.4 kernel (not used in 4.4 kernel):

	cmplw	0,r1,r3
 	mfspr	r2, SPRN_SDR1
	li	r1, _PAGE_PRESENT | _PAGE_ACCESSED
	rlwinm	r2, r2, 28, 0xfffff000
 	bgt-	112f

What is the reason or relevance for checking this here ?
Was not checked in 4.4, bit or-ed afterwards, as it is accessed now.
Do you know the reason of change on this point ?

Another remark to Core manual relevant for this:
There is the reference manual for e300 core available (e300 RM). It includes
many remarks in range of Memory Management section, that many features
are optional or variable for dedicated implementations. On the other hand, 
the MPC5121e reference manual refers to the e300 core RM, but DOES NOT 
information, which of the optional points are there or nor. According my
analysis, MPC5121e does not include any of the optional features.

Thanks a lot for first reactions
Christoph

-----Original Message-----
From: Christophe Leroy <christophe.leroy@csgroup.eu> 
Sent: Montag, 1. Februar 2021 07:30
To: Benjamin Herrenschmidt <benh@kernel.crashing.org>; Paul Mackerras <paulus@samba.org>; Michael Ellerman <mpe@ellerman.id.au>; PLATTNER Christoph <christoph.plattner@thalesgroup.com>
Cc: linux-kernel@vger.kernel.org; linuxppc-dev@lists.ozlabs.org
Subject: [PATCH] powerpc/603: Fix protection of user pages mapped with PROT_NONE

On book3s/32, page protection is defined by the PP bits in the PTE which provide the following protection depending on the access keys defined in the matching segment register:
- PP 00 means RW with key 0 and N/A with key 1.
- PP 01 means RW with key 0 and RO with key 1.
- PP 10 means RW with both key 0 and key 1.
- PP 11 means RO with both key 0 and key 1.

Since the implementation of kernel userspace access protection, PP bits have been set as follows:
- PP00 for pages without _PAGE_USER
- PP01 for pages with _PAGE_USER and _PAGE_RW
- PP11 for pages with _PAGE_USER and without _PAGE_RW

For kernelspace segments, kernel accesses are performed with key 0 and user accesses are performed with key 1. As PP00 is used for non _PAGE_USER pages, user can't access kernel pages not flagged _PAGE_USER while kernel can.

For userspace segments, both kernel and user accesses are performed with key 0, therefore pages not flagged _PAGE_USER are still accessible to the user.

This shouldn't be an issue, because userspace is expected to be accessible to the user. But unlike most other architectures, powerpc implements PROT_NONE protection by removing _PAGE_USER flag instead of flagging the page as not valid. This means that pages in userspace that are not flagged _PAGE_USER shall remain inaccessible.

To get the expected behaviour, just mimic other architectures in the TLB miss handler by checking _PAGE_USER permission on userspace accesses as if it was the _PAGE_PRESENT bit.

Note that this problem only is only for 603 cores. The 604+ have an hash table, and hash_page() function already implement the verification of _PAGE_USER permission on userspace pages.

Reported-by: Christoph Plattner <christoph.plattner@thalesgroup.com>
Fixes: f342adca3afc ("powerpc/32s: Prepare Kernel Userspace Access Protection")
Cc: stable@vger.kernel.org
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
---
 arch/powerpc/kernel/head_book3s_32.S | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kernel/head_book3s_32.S b/arch/powerpc/kernel/head_book3s_32.S
index 858fbc8b19f3..0004e8a6a58e 100644
--- a/arch/powerpc/kernel/head_book3s_32.S
+++ b/arch/powerpc/kernel/head_book3s_32.S
@@ -453,11 +453,12 @@ InstructionTLBMiss:
 	cmplw	0,r1,r3
 #endif
 	mfspr	r2, SPRN_SDR1
-	li	r1,_PAGE_PRESENT | _PAGE_ACCESSED | _PAGE_EXEC
+	li	r1,_PAGE_PRESENT | _PAGE_ACCESSED | _PAGE_EXEC | _PAGE_USER
 	rlwinm	r2, r2, 28, 0xfffff000
 #ifdef CONFIG_MODULES
 	bgt-	112f
 	lis	r2, (swapper_pg_dir - PAGE_OFFSET)@ha	/* if kernel address, use */
+	li	r1,_PAGE_PRESENT | _PAGE_ACCESSED | _PAGE_EXEC
 	addi	r2, r2, (swapper_pg_dir - PAGE_OFFSET)@l	/* kernel page table */
 #endif
 112:	rlwimi	r2,r3,12,20,29		/* insert top 10 bits of address */
@@ -516,10 +517,11 @@ DataLoadTLBMiss:
 	lis	r1, TASK_SIZE@h		/* check if kernel address */
 	cmplw	0,r1,r3
 	mfspr	r2, SPRN_SDR1
-	li	r1, _PAGE_PRESENT | _PAGE_ACCESSED
+	li	r1, _PAGE_PRESENT | _PAGE_ACCESSED | _PAGE_USER
 	rlwinm	r2, r2, 28, 0xfffff000
 	bgt-	112f
 	lis	r2, (swapper_pg_dir - PAGE_OFFSET)@ha	/* if kernel address, use */
+	li	r1, _PAGE_PRESENT | _PAGE_ACCESSED
 	addi	r2, r2, (swapper_pg_dir - PAGE_OFFSET)@l	/* kernel page table */
 112:	rlwimi	r2,r3,12,20,29		/* insert top 10 bits of address */
 	lwz	r2,0(r2)		/* get pmd entry */
@@ -593,10 +595,11 @@ DataStoreTLBMiss:
 	lis	r1, TASK_SIZE@h		/* check if kernel address */
 	cmplw	0,r1,r3
 	mfspr	r2, SPRN_SDR1
-	li	r1, _PAGE_RW | _PAGE_DIRTY | _PAGE_PRESENT | _PAGE_ACCESSED
+	li	r1, _PAGE_RW | _PAGE_DIRTY | _PAGE_PRESENT | _PAGE_ACCESSED | _PAGE_USER
 	rlwinm	r2, r2, 28, 0xfffff000
 	bgt-	112f
 	lis	r2, (swapper_pg_dir - PAGE_OFFSET)@ha	/* if kernel address, use */
+	li	r1, _PAGE_RW | _PAGE_DIRTY | _PAGE_PRESENT | _PAGE_ACCESSED
 	addi	r2, r2, (swapper_pg_dir - PAGE_OFFSET)@l	/* kernel page table */
 112:	rlwimi	r2,r3,12,20,29		/* insert top 10 bits of address */
 	lwz	r2,0(r2)		/* get pmd entry */
--
2.25.0

^ permalink raw reply related

* Re: [PATCH] powerpc/603: Fix protection of user pages mapped with PROT_NONE
From: Christophe Leroy @ 2021-02-01 11:39 UTC (permalink / raw)
  To: PLATTNER Christoph, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman
  Cc: REITHER Robert - Contractor, HAMETNER Reinhard, KOENIG Werner,
	linuxppc-dev@lists.ozlabs.org, linux-kernel@vger.kernel.org
In-Reply-To: <1b194840-d4e6-4660-94d9-6bac623442cf@THSDC1IRIMBX13P.iris.infra.thales>



Le 01/02/2021 à 11:22, PLATTNER Christoph a écrit :
> Hello to all, and thank you very much for first and second fast response.
> 
> I do not have a long history on PowerPC MMU environment, I hacked into this topic
> for about 3 months for analyzing that problem- so, sorry, if I am wrong in some points ...

Yes you are wrong on some points, sorry, see below.


> 
> What I learn so far from this MPC5121e (variant of e300c4 core):
> - It uses book3s32 hash-code, but it DOES NOT provide KEY hash method, so always the
>     branch  "if (! Hash) ...." is taken, so, I assume that "key 0" and "key 1" setups are not
>     used on this CPU (not supporting MMU_FTR_HPTE_TABLE)

hash method is not used, this is SW TLB loading that is used, but still, all the PP and Ks/Kp keys 
defined in the segment register are used, see e300 core reference manual §6.4.2 Page Memory Protection

> - The PP bits are NOT checked by the CPU in HW, even if set to 00, the CPU does not react.
>     As far I have understood, the TLB miss routines are responsible for checking permissions.
>     The TLB miss routines check the Linux PTE styled entries and generates the PP bits
>     for the TLB entry. The PowerPC PP bits are never check elsewhere on that CPU models ...

PP bits ARE checked hoppefully. If it was not the case, then the TLB miss routines would install a 
TLB on a read, then the user could do a write without any verification being done ?

Refer to e300 Core reference Manual, §6.1.4 Memory Protection Facilities

As I explained in the patch, the problem is not that the HW doesn't check the permission. It is that 
user accessed been done with key 0 as programmed in the segment registers, PP 00 means RW access.

> - The PTE entries in Linux are fully "void" in sense of this CPU type, as this CPU does not
>     read any PTEs from RAM (no HW support in contrast to x86 or ARM or later ppc...).

No, the PTE are read by the TLB miss exception handlers and writen into TLB entries.

> 
> In summary - as far as I understand it now - we have to handle the PTE bits differently
> (Linux style) for PROT_NONE permissions - OR - we have to expand the permission
> checking like my proposed experimental patch. (PROT_NONE is not NUMA related only,
> but may not used very often ...).

Yes, expanding the permission checking is the easiest solution, hence the patch I sent out based on 
your proposal.

> 
> Another related point:
> According e300 RM (manual) the ACCESSED bit in the PTE shall be set on TLB miss, as
> it is an indication, that page is used. In 4.4 kernel this write back of the _PAGE_ACCESSED
> bit was performed after successful permission check:
> 
>          bne-    DataAddressInvalid      /* return if access not permitted */
>          ori     r0,r0,_PAGE_ACCESSED    /* set _PAGE_ACCESSED in pte */
>          /*
>           * NOTE! We are assuming this is not an SMP system, otherwise
>           * we would need to update the pte atomically with lwarx/stwcx.
>           */
>          stw     r0,0(r2)                /* update PTE (accessed bit) */
>          /* Convert linux-style PTE to low word of PPC-style PTE */
> 
> Bit is set (ori ...) and written back (stw ...) to Linux PTE. May be, this is not needed, as the
> PTE is never seen by the PPC chip. But I do not understand, WHY the PAGE_ACCCESSED
> is used for permission check in the late 5.4 kernel (not used in 4.4 kernel):
> 
> 	cmplw	0,r1,r3
>   	mfspr	r2, SPRN_SDR1
> 	li	r1, _PAGE_PRESENT | _PAGE_ACCESSED
> 	rlwinm	r2, r2, 28, 0xfffff000
>   	bgt-	112f
> 
> What is the reason or relevance for checking this here ?
> Was not checked in 4.4, bit or-ed afterwards, as it is accessed now.
> Do you know the reason of change on this point ?

PAGE_ACCESSED is important for memory management, linux kernel need it.

But instead of spending time at every miss to perform a write which will be a no-op in 99% of cases, 
we prefer bailing out to the page_fault logic when the accessed bit is not set. Then the page_fault 
logic will set the bit.
This also allowed to simplify the handling in __set_pte()_at function by avoiding races in the 
update of PTEs.

> 
> Another remark to Core manual relevant for this:
> There is the reference manual for e300 core available (e300 RM). It includes
> many remarks in range of Memory Management section, that many features
> are optional or variable for dedicated implementations. On the other hand,
> the MPC5121e reference manual refers to the e300 core RM, but DOES NOT
> information, which of the optional points are there or nor. According my
> analysis, MPC5121e does not include any of the optional features.
> 

Not sure what you mean. As far as I understand, that chapter tells you that some functionnalities 
are optional for the powerpc architectecture, and provided (or not) by the e300 core. The MPC5121 
supports all the things that are defined by e300 core.


> 
> Thanks a lot for first reactions

You are welcome, don't hesitate if you have additional questions.

Christophe

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox