LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed
* Re: [PATCH] powerpc/64s: accumulate_stolen_time remove irq mask workaround
From: Michael Ellerman @ 2021-06-25  6:21 UTC (permalink / raw)
  To: Nicholas Piggin, linuxppc-dev
In-Reply-To: <20210623022924.704645-1-npiggin@gmail.com>

On Wed, 23 Jun 2021 12:29:24 +1000, Nicholas Piggin wrote:
> The caller has been moved to C after irq soft-mask state has been
> reconciled, and Linux irqs have been marked as disabled, so this
> does not have to play games with irq internals.

Applied to powerpc/next.

[1/1] powerpc/64s: accumulate_stolen_time remove irq mask workaround
      https://git.kernel.org/powerpc/c/0cdff98b395e5ab71b650c3df154217b1348e9b5

cheers

^ permalink raw reply

* Re: [PATCH] powerpc: Make PPC_IRQ_SOFT_MASK_DEBUG depend on PPC64
From: Michael Ellerman @ 2021-06-25  6:21 UTC (permalink / raw)
  To: Nicholas Piggin, linuxppc-dev
In-Reply-To: <20210623032909.826010-1-npiggin@gmail.com>

On Wed, 23 Jun 2021 13:29:09 +1000, Nicholas Piggin wrote:
> 32-bit platforms don't have irq soft masking.

Applied to powerpc/next.

[1/1] powerpc: Make PPC_IRQ_SOFT_MASK_DEBUG depend on PPC64
      https://git.kernel.org/powerpc/c/f5f48e8cb93f4acd77411df0327b61066985bea8

cheers

^ permalink raw reply

* Re: [PATCH] powerpc: offline CPU in stop_this_cpu
From: Michael Ellerman @ 2021-06-25  6:21 UTC (permalink / raw)
  To: Nicholas Piggin, linuxppc-dev
In-Reply-To: <20210623041245.865134-1-npiggin@gmail.com>

On Wed, 23 Jun 2021 14:12:45 +1000, Nicholas Piggin wrote:
> printk_safe_flush_on_panic() has special lock breaking code for the
> case where we panic()ed with the console lock held. It relies on
> panic IPI causing other CPUs to mark themselves offline.
> 
> Do as most other architectures do.
> 
> This effectively reverts commit de6e5d38417e ("powerpc: smp_send_stop do
> not offline stopped CPUs"), unfortunately it may result in some false
> positive warnings, but the alternative is more situations where we can
> crash without getting messages out.

Applied to powerpc/next.

[1/1] powerpc: offline CPU in stop_this_cpu
      https://git.kernel.org/powerpc/c/bab26238bbd44d5a4687c0a64fd2c7f2755ea937

cheers

^ permalink raw reply

* Re: [PATCH] powerpc/pseries: Enable hardlockup watchdog for PowerVM partitions
From: Michael Ellerman @ 2021-06-25  6:21 UTC (permalink / raw)
  To: Nicholas Piggin, linuxppc-dev
In-Reply-To: <20210503111822.758423-1-npiggin@gmail.com>

On Mon, 3 May 2021 21:18:22 +1000, Nicholas Piggin wrote:
> PowerVM will not arbitrarily oversubscribe or stop guests, page out the
> guest kernel text to a NFS volume connected by carrier pigeon to abacus
> based storage, etc., as a KVM host might. So PowerVM guests are not
> likely to be killed by the hard lockup watchdog in normal operation,
> even with shared processor LPARs which still get a minimum allotment of
> CPU time.
> 
> [...]

Applied to powerpc/next.

[1/1] powerpc/pseries: Enable hardlockup watchdog for PowerVM partitions
      https://git.kernel.org/powerpc/c/633c8e9800f3884a26b2af59be8ce27696ad6ebf

cheers

^ permalink raw reply

* Re: [PATCH v2] powerpc/pseries: Enable hardlockup watchdog for PowerVM partitions
From: Michael Ellerman @ 2021-06-25  6:21 UTC (permalink / raw)
  To: Nicholas Piggin, linuxppc-dev
In-Reply-To: <20210623021528.702241-1-npiggin@gmail.com>

On Wed, 23 Jun 2021 12:15:28 +1000, Nicholas Piggin wrote:
> PowerVM will not arbitrarily oversubscribe or stop guests, page out the
> guest kernel text to a NFS volume connected by carrier pigeon to abacus
> based storage, etc., as a KVM host might. So PowerVM guests are not
> likely to be killed by the hard lockup watchdog in normal operation,
> even with shared processor LPARs which still get a minimum allotment of
> CPU time.
> 
> [...]

Applied to powerpc/next.

[1/1] powerpc/pseries: Enable hardlockup watchdog for PowerVM partitions
      https://git.kernel.org/powerpc/c/633c8e9800f3884a26b2af59be8ce27696ad6ebf

cheers

^ permalink raw reply

* Re: [PATCH v4 00/17] powerpc/64: fast interrupt exits
From: Michael Ellerman @ 2021-06-25  6:21 UTC (permalink / raw)
  To: Nicholas Piggin, linuxppc-dev
In-Reply-To: <20210617155116.2167984-1-npiggin@gmail.com>

On Fri, 18 Jun 2021 01:50:59 +1000, Nicholas Piggin wrote:
> This series attempts to improve the speed of interrupts and system calls
> in three major ways.
> 
> Firstly, the SRR/HSRR registers do not need to be reloaded if they were
> clobbered for the duration of the interrupt and the return NIP and MSR
> did not changed. 64e does not implement this part, but it could quite
> easily.
> 
> [...]

Patches 1-4, and 6-17 applied to powerpc/next.

[01/17] powerpc/interrupt: Fix CONFIG ifdef typo
        https://git.kernel.org/powerpc/c/9a3ed7adcabce24a85fbe05f54e762b18756ec22
[02/17] powerpc: remove interrupt exit helpers unused argument
        https://git.kernel.org/powerpc/c/bf9155f1970c4dbf9ec6b87d3688433bd494a4e1
[03/17] powerpc/64s: introduce different functions to return from SRR vs HSRR interrupts
        https://git.kernel.org/powerpc/c/1df7d5e4baeac74d14c1bee18b2dff9302b3efbc
[04/17] powerpc/64s: avoid reloading (H)SRR registers if they are still valid
        https://git.kernel.org/powerpc/c/59dc5bfca0cb6a29db1a50847684eb5c19f8f400
[06/17] powerpc/64: move interrupt return asm to interrupt_64.S
        https://git.kernel.org/powerpc/c/e754f4d13e3919aafa485657599907aa63b9a40c
[07/17] powerpc/64s: system call avoid setting MSR[RI] until we set MSR[EE]
        https://git.kernel.org/powerpc/c/dd152f70bdc1b91445b10c65ac874b90c93fb3b5
[08/17] powerpc/64s: save one more register in the masked interrupt handler
        https://git.kernel.org/powerpc/c/63e40806eea984f770c992120bbfd71b589ea580
[09/17] powerpc/64: allow alternate return locations for soft-masked interrupts
        https://git.kernel.org/powerpc/c/f23699c93becd746295aaa506537882a46a62219
[10/17] powerpc/64: interrupt soft-enable race fix
        https://git.kernel.org/powerpc/c/862fa563524b9f92d7e89fe332732bd3421772db
[11/17] powerpc/64: treat low kernel text as irqs soft-masked
        https://git.kernel.org/powerpc/c/9d1988ca87dd90ecf80a0601c7fd13071fbb1a83
[12/17] powerpc/64: use interrupt restart table to speed up return from interrupt
        https://git.kernel.org/powerpc/c/13799748b957bc5659f97c036224b0f4b42172e2
[13/17] powerpc/interrupt: Rename and lightly change syscall_exit_prepare_main()
        https://git.kernel.org/powerpc/c/f84aa284947f325c5697d35b92abd2047224f24b
[14/17] powerpc/interrupt: Refactor interrupt_exit_user_prepare()
        https://git.kernel.org/powerpc/c/a214ee8802adb864d175ea6ca4176223bcc11d2b
[15/17] powerpc/interrupt: Interchange prep_irq_for_{kernel_enabled/user}_exit()
        https://git.kernel.org/powerpc/c/99f98f849cf13e5fac532979ccdb77dff07665db
[16/17] powerpc/interrupt: Refactor prep_irq_for_{user/kernel_enabled}_exit()
        https://git.kernel.org/powerpc/c/61eece2d1707796fd45225ea3d20e9289251311c
[17/17] powerpc/interrupt: Remove prep_irq_for_user_exit()
        https://git.kernel.org/powerpc/c/ae58b1c645895c28ca155843db6788d57ea99e11

cheers

^ permalink raw reply

* Re: PowerPC guest getting "BUG: scheduling while atomic" on linux-next-20210623 during secondary CPUs bringup
From: Peter Zijlstra @ 2021-06-25  7:28 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: LKML, Valentin Schneider, linux-next, Bharata B Rao, linuxppc-dev,
	Ingo Molnar
In-Reply-To: <20210625054608.fmwt7lxuhp7inkjx@linux.vnet.ibm.com>

On Fri, Jun 25, 2021 at 11:16:08AM +0530, Srikar Dronamraju wrote:
> * Bharata B Rao <bharata@linux.ibm.com> [2021-06-24 21:25:09]:
> 
> > A PowerPC KVM guest gets the following BUG message when booting
> > linux-next-20210623:
> > 
> > smp: Bringing up secondary CPUs ...
> > BUG: scheduling while atomic: swapper/1/0/0x00000000

'funny', your preempt_count is actually too low. The check here is for
preempt_count() == DISABLE_OFFSET (aka. 1 when PREEMPT=y), but you have
0.

> > no locks held by swapper/1/0.
> > Modules linked in:
> > CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.13.0-rc7-next-20210623
> > Call Trace:
> > [c00000000ae5bc20] [c000000000badc64] dump_stack_lvl+0x98/0xe0 (unreliable)
> > [c00000000ae5bc60] [c000000000210200] __schedule_bug+0xb0/0xe0
> > [c00000000ae5bcd0] [c000000001609e28] __schedule+0x1788/0x1c70
> > [c00000000ae5be20] [c00000000160a8cc] schedule_idle+0x3c/0x70
> > [c00000000ae5be50] [c00000000022984c] do_idle+0x2bc/0x420
> > [c00000000ae5bf00] [c000000000229d88] cpu_startup_entry+0x38/0x40
> > [c00000000ae5bf30] [c0000000000666c0] start_secondary+0x290/0x2a0
> > [c00000000ae5bf90] [c00000000000be54] start_secondary_prolog+0x10/0x14
> > 
> > <The above repeats for all the secondary CPUs>
> > 
> > smp: Brought up 2 nodes, 16 CPUs
> > numa: Node 0 CPUs: 0-7
> > numa: Node 1 CPUs: 8-15
> > 
> > This seems to have started from next-20210521 and isn't seen on
> > next-20210511.
> > 
> 
> Bharata,
> 
> I think the regression is due to Commit f1a0a376ca0c ("sched/core:
> Initialize the idle task with preemption disabled")

So that extra preempt_disable() that got removed would've incremented it
to 1 and then things would've been fine.

Except.. Valentin changed things such that preempt_count() should've
been inittialized to 1, instead of 0, but for some raisin that didn't
stick.. what gives.

So we have init_idle(p) -> init_idle_preempt_count(p) ->
task_thread_info(p)->preempt_count = PREEMPT_DISABLED;

But somehow, by the time you're running start_secondary(), that's gotten
to be 0 again. Does DEBUG_PREEMPT give more clues?

^ permalink raw reply

* [PATCH v2 0/5] Remove uses of struct page from x86 and arm64 MMU
From: David Stevens @ 2021-06-25  7:36 UTC (permalink / raw)
  To: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Paul Mackerras,
	Paolo Bonzini, Nick Piggin
  Cc: Wanpeng Li, kvm, dri-devel, linux-mips, Will Deacon, kvmarm,
	Alexandru Elisei, Joerg Roedel, David Stevens, Zhi Wang,
	Suzuki K Poulose, intel-gfx, kvm-ppc, Zhenyu Wang, intel-gvt-dev,
	linux-arm-kernel, Jim Mattson, Sean Christopherson, linux-kernel,
	James Morse, Vitaly Kuznetsov, linuxppc-dev

KVM supports mapping VM_IO and VM_PFNMAP memory into the guest by using
follow_pte in gfn_to_pfn. However, the resolved pfns may not have
assoicated struct pages, so they should not be passed to pfn_to_page.
This series removes such calls from the x86 and arm64 secondary MMU. To
do this, this series introduces gfn_to_pfn_page functions that parallel
the gfn_to_pfn functions. These functions take an extra out parameter
which  contains the page if the hva was resovled by gup. This allows the
caller to call put_page only when necessated by gup.

The gfn_to_pfn functions are depreciated. For now the functions remain
with identical behavior to before, but callers should be migrated to the
new gfn_to_pfn_page functions. I added new functions instead of simply
adding another parameter to the existing functions to make it easier to
track down users of the deprecated functions.

I have migrated the x86 and arm64 secondary MMUs to the new
gfn_to_pfn_page functions.

This addresses CVE-2021-22543 on x86 and arm64.

v1 -> v2:
 - Introduce new gfn_to_pfn_page functions instead of modifying the
   behavior of existing gfn_to_pfn functions, to make the change less
   invasive.
 - Drop changes to mmu_audit.c
 - Include Nicholas Piggin's patch to avoid corrupting refcount in the
   follow_pte case, and use it in depreciated gfn_to_pfn functions.
 - Rebase on kvm/next

David Stevens (4):
  KVM: mmu: introduce new gfn_to_pfn_page functions
  KVM: x86/mmu: use gfn_to_pfn_page
  KVM: arm64/mmu: use gfn_to_pfn_page
  KVM: mmu: remove over-aggressive warnings

Nicholas Piggin (1):
  KVM: do not allow mapping valid but non-refcounted pages

 arch/arm64/kvm/mmu.c            |  26 +++--
 arch/x86/kvm/mmu/mmu.c          |  50 ++++-----
 arch/x86/kvm/mmu/mmu_internal.h |   3 +-
 arch/x86/kvm/mmu/paging_tmpl.h  |  23 +++--
 arch/x86/kvm/mmu/tdp_mmu.c      |   6 +-
 arch/x86/kvm/mmu/tdp_mmu.h      |   4 +-
 arch/x86/kvm/x86.c              |   6 +-
 include/linux/kvm_host.h        |  17 +++
 virt/kvm/kvm_main.c             | 177 +++++++++++++++++++++++++-------
 9 files changed, 222 insertions(+), 90 deletions(-)

-- 
2.32.0.93.g670b81a890-goog


^ permalink raw reply

* [PATCH v2 1/5] KVM: do not allow mapping valid but non-refcounted pages
From: David Stevens @ 2021-06-25  7:36 UTC (permalink / raw)
  To: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Paul Mackerras,
	Paolo Bonzini, Nick Piggin
  Cc: kvm-ppc, intel-gvt-dev, Wanpeng Li, kvm, Will Deacon,
	Suzuki K Poulose, Sean Christopherson, Joerg Roedel, linuxppc-dev,
	dri-devel, linux-kernel, Zhenyu Wang, linux-mips, James Morse,
	linux-arm-kernel, intel-gfx, Vitaly Kuznetsov, Alexandru Elisei,
	kvmarm, Zhi Wang, Jim Mattson
In-Reply-To: <20210625073616.2184426-1-stevensd@google.com>

From: Nicholas Piggin <npiggin@gmail.com>

It's possible to create a region which maps valid but non-refcounted
pages (e.g., tail pages of non-compound higher order allocations). These
host pages can then be returned by gfn_to_page, gfn_to_pfn, etc., family
of APIs, which take a reference to the page, which takes it from 0 to 1.
When the reference is dropped, this will free the page incorrectly.

Fix this by only taking a reference on the page if it was non-zero,
which indicates it is participating in normal refcounting (and can be
released with put_page).

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 virt/kvm/kvm_main.c | 19 +++++++++++++++++--
 1 file changed, 17 insertions(+), 2 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 3dcc2abbfc60..f7445c3bcd90 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2175,6 +2175,13 @@ static bool vma_is_valid(struct vm_area_struct *vma, bool write_fault)
 	return true;
 }
 
+static int kvm_try_get_pfn(kvm_pfn_t pfn)
+{
+	if (kvm_is_reserved_pfn(pfn))
+		return 1;
+	return get_page_unless_zero(pfn_to_page(pfn));
+}
+
 static int hva_to_pfn_remapped(struct vm_area_struct *vma,
 			       unsigned long addr, bool *async,
 			       bool write_fault, bool *writable,
@@ -2224,13 +2231,21 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
 	 * Whoever called remap_pfn_range is also going to call e.g.
 	 * unmap_mapping_range before the underlying pages are freed,
 	 * causing a call to our MMU notifier.
+	 *
+	 * Certain IO or PFNMAP mappings can be backed with valid
+	 * struct pages, but be allocated without refcounting e.g.,
+	 * tail pages of non-compound higher order allocations, which
+	 * would then underflow the refcount when the caller does the
+	 * required put_page. Don't allow those pages here.
 	 */ 
-	kvm_get_pfn(pfn);
+	if (!kvm_try_get_pfn(pfn))
+		r = -EFAULT;
 
 out:
 	pte_unmap_unlock(ptep, ptl);
 	*p_pfn = pfn;
-	return 0;
+
+	return r;
 }
 
 /*
-- 
2.32.0.93.g670b81a890-goog


^ permalink raw reply related

* [PATCH v2 2/5] KVM: mmu: introduce new gfn_to_pfn_page functions
From: David Stevens @ 2021-06-25  7:36 UTC (permalink / raw)
  To: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Paul Mackerras,
	Paolo Bonzini, Nick Piggin
  Cc: Wanpeng Li, kvm, dri-devel, linux-mips, Will Deacon, kvmarm,
	Alexandru Elisei, Joerg Roedel, Zhi Wang, Suzuki K Poulose,
	intel-gfx, kvm-ppc, Zhenyu Wang, intel-gvt-dev, linux-arm-kernel,
	Jim Mattson, Sean Christopherson, linux-kernel, James Morse,
	David Stevens, Vitaly Kuznetsov, linuxppc-dev
In-Reply-To: <20210625073616.2184426-1-stevensd@google.com>

From: David Stevens <stevensd@chromium.org>

Introduce new gfn_to_pfn_page functions that parallel existing
gfn_to_pfn functions. The new functions are identical except they take
an additional out parameter that is used to return the struct page if
the hva was resolved by gup. This allows callers to differentiate the
gup and follow_pte cases, which in turn allows callers to only touch the
page refcount when necessitated by gup.

The old gfn_to_pfn functions are depreciated, and all callers should be
migrated to the new gfn_to_pfn_page functions. In the interim, the
gfn_to_pfn functions are reimplemented as wrappers of the corresponding
gfn_to_pfn_page functions. The wrappers take a reference to the pfn's
page that had previously been taken in hva_to_pfn_remapped.

Signed-off-by: David Stevens <stevensd@chromium.org>
---
 include/linux/kvm_host.h |  17 ++++
 virt/kvm/kvm_main.c      | 186 ++++++++++++++++++++++++++++-----------
 2 files changed, 152 insertions(+), 51 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index ae7735b490b4..2f828edd7278 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -820,6 +820,19 @@ kvm_pfn_t __gfn_to_pfn_memslot(struct kvm_memory_slot *slot, gfn_t gfn,
 			       bool atomic, bool *async, bool write_fault,
 			       bool *writable, hva_t *hva);
 
+kvm_pfn_t gfn_to_pfn_page(struct kvm *kvm, gfn_t gfn, struct page **page);
+kvm_pfn_t gfn_to_pfn_page_prot(struct kvm *kvm, gfn_t gfn,
+			       bool write_fault, bool *writable,
+			       struct page **page);
+kvm_pfn_t gfn_to_pfn_page_memslot(struct kvm_memory_slot *slot,
+				  gfn_t gfn, struct page **page);
+kvm_pfn_t gfn_to_pfn_page_memslot_atomic(struct kvm_memory_slot *slot,
+					 gfn_t gfn, struct page **page);
+kvm_pfn_t __gfn_to_pfn_page_memslot(struct kvm_memory_slot *slot,
+				    gfn_t gfn, bool atomic, bool *async,
+				    bool write_fault, bool *writable,
+				    hva_t *hva, struct page **page);
+
 void kvm_release_pfn_clean(kvm_pfn_t pfn);
 void kvm_release_pfn_dirty(kvm_pfn_t pfn);
 void kvm_set_pfn_dirty(kvm_pfn_t pfn);
@@ -901,6 +914,10 @@ struct kvm_memslots *kvm_vcpu_memslots(struct kvm_vcpu *vcpu);
 struct kvm_memory_slot *kvm_vcpu_gfn_to_memslot(struct kvm_vcpu *vcpu, gfn_t gfn);
 kvm_pfn_t kvm_vcpu_gfn_to_pfn_atomic(struct kvm_vcpu *vcpu, gfn_t gfn);
 kvm_pfn_t kvm_vcpu_gfn_to_pfn(struct kvm_vcpu *vcpu, gfn_t gfn);
+kvm_pfn_t kvm_vcpu_gfn_to_pfn_page_atomic(struct kvm_vcpu *vcpu, gfn_t gfn,
+					  struct page **page);
+kvm_pfn_t kvm_vcpu_gfn_to_pfn_page(struct kvm_vcpu *vcpu, gfn_t gfn,
+				   struct page **page);
 int kvm_vcpu_map(struct kvm_vcpu *vcpu, gpa_t gpa, struct kvm_host_map *map);
 int kvm_map_gfn(struct kvm_vcpu *vcpu, gfn_t gfn, struct kvm_host_map *map,
 		struct gfn_to_pfn_cache *cache, bool atomic);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index f7445c3bcd90..1de8702845ac 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2102,9 +2102,9 @@ static inline int check_user_page_hwpoison(unsigned long addr)
  * only part that runs if we can in atomic context.
  */
 static bool hva_to_pfn_fast(unsigned long addr, bool write_fault,
-			    bool *writable, kvm_pfn_t *pfn)
+			    bool *writable, kvm_pfn_t *pfn,
+			    struct page **page)
 {
-	struct page *page[1];
 
 	/*
 	 * Fast pin a writable pfn only if it is a write fault request
@@ -2115,7 +2115,7 @@ static bool hva_to_pfn_fast(unsigned long addr, bool write_fault,
 		return false;
 
 	if (get_user_page_fast_only(addr, FOLL_WRITE, page)) {
-		*pfn = page_to_pfn(page[0]);
+		*pfn = page_to_pfn(*page);
 
 		if (writable)
 			*writable = true;
@@ -2130,10 +2130,9 @@ static bool hva_to_pfn_fast(unsigned long addr, bool write_fault,
  * 1 indicates success, -errno is returned if error is detected.
  */
 static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault,
-			   bool *writable, kvm_pfn_t *pfn)
+			   bool *writable, kvm_pfn_t *pfn, struct page **page)
 {
 	unsigned int flags = FOLL_HWPOISON;
-	struct page *page;
 	int npages = 0;
 
 	might_sleep();
@@ -2146,7 +2145,7 @@ static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault,
 	if (async)
 		flags |= FOLL_NOWAIT;
 
-	npages = get_user_pages_unlocked(addr, 1, &page, flags);
+	npages = get_user_pages_unlocked(addr, 1, page, flags);
 	if (npages != 1)
 		return npages;
 
@@ -2156,11 +2155,11 @@ static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault,
 
 		if (get_user_page_fast_only(addr, FOLL_WRITE, &wpage)) {
 			*writable = true;
-			put_page(page);
-			page = wpage;
+			put_page(*page);
+			*page = wpage;
 		}
 	}
-	*pfn = page_to_pfn(page);
+	*pfn = page_to_pfn(*page);
 	return npages;
 }
 
@@ -2175,13 +2174,6 @@ static bool vma_is_valid(struct vm_area_struct *vma, bool write_fault)
 	return true;
 }
 
-static int kvm_try_get_pfn(kvm_pfn_t pfn)
-{
-	if (kvm_is_reserved_pfn(pfn))
-		return 1;
-	return get_page_unless_zero(pfn_to_page(pfn));
-}
-
 static int hva_to_pfn_remapped(struct vm_area_struct *vma,
 			       unsigned long addr, bool *async,
 			       bool write_fault, bool *writable,
@@ -2221,26 +2213,6 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
 		*writable = pte_write(*ptep);
 	pfn = pte_pfn(*ptep);
 
-	/*
-	 * Get a reference here because callers of *hva_to_pfn* and
-	 * *gfn_to_pfn* ultimately call kvm_release_pfn_clean on the
-	 * returned pfn.  This is only needed if the VMA has VM_MIXEDMAP
-	 * set, but the kvm_get_pfn/kvm_release_pfn_clean pair will
-	 * simply do nothing for reserved pfns.
-	 *
-	 * Whoever called remap_pfn_range is also going to call e.g.
-	 * unmap_mapping_range before the underlying pages are freed,
-	 * causing a call to our MMU notifier.
-	 *
-	 * Certain IO or PFNMAP mappings can be backed with valid
-	 * struct pages, but be allocated without refcounting e.g.,
-	 * tail pages of non-compound higher order allocations, which
-	 * would then underflow the refcount when the caller does the
-	 * required put_page. Don't allow those pages here.
-	 */ 
-	if (!kvm_try_get_pfn(pfn))
-		r = -EFAULT;
-
 out:
 	pte_unmap_unlock(ptep, ptl);
 	*p_pfn = pfn;
@@ -2262,8 +2234,9 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
  * 2): @write_fault = false && @writable, @writable will tell the caller
  *     whether the mapping is writable.
  */
-static kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool *async,
-			bool write_fault, bool *writable)
+static kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic,
+			    bool *async, bool write_fault, bool *writable,
+			    struct page **page)
 {
 	struct vm_area_struct *vma;
 	kvm_pfn_t pfn = 0;
@@ -2272,13 +2245,14 @@ static kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool *async,
 	/* we can do it either atomically or asynchronously, not both */
 	BUG_ON(atomic && async);
 
-	if (hva_to_pfn_fast(addr, write_fault, writable, &pfn))
+	if (hva_to_pfn_fast(addr, write_fault, writable, &pfn, page))
 		return pfn;
 
 	if (atomic)
 		return KVM_PFN_ERR_FAULT;
 
-	npages = hva_to_pfn_slow(addr, async, write_fault, writable, &pfn);
+	npages = hva_to_pfn_slow(addr, async, write_fault, writable,
+				 &pfn, page);
 	if (npages == 1)
 		return pfn;
 
@@ -2310,12 +2284,14 @@ static kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool *async,
 	return pfn;
 }
 
-kvm_pfn_t __gfn_to_pfn_memslot(struct kvm_memory_slot *slot, gfn_t gfn,
-			       bool atomic, bool *async, bool write_fault,
-			       bool *writable, hva_t *hva)
+kvm_pfn_t __gfn_to_pfn_page_memslot(struct kvm_memory_slot *slot,
+				    gfn_t gfn, bool atomic, bool *async,
+				    bool write_fault, bool *writable,
+				    hva_t *hva, struct page **page)
 {
 	unsigned long addr = __gfn_to_hva_many(slot, gfn, NULL, write_fault);
 
+	*page = NULL;
 	if (hva)
 		*hva = addr;
 
@@ -2338,45 +2314,153 @@ kvm_pfn_t __gfn_to_pfn_memslot(struct kvm_memory_slot *slot, gfn_t gfn,
 	}
 
 	return hva_to_pfn(addr, atomic, async, write_fault,
-			  writable);
+			  writable, page);
+}
+EXPORT_SYMBOL_GPL(__gfn_to_pfn_page_memslot);
+
+kvm_pfn_t gfn_to_pfn_page_prot(struct kvm *kvm, gfn_t gfn, bool write_fault,
+			       bool *writable, struct page **page)
+{
+	return __gfn_to_pfn_page_memslot(gfn_to_memslot(kvm, gfn), gfn, false,
+					 NULL, write_fault, writable, NULL,
+					 page);
+}
+EXPORT_SYMBOL_GPL(gfn_to_pfn_page_prot);
+
+kvm_pfn_t gfn_to_pfn_page_memslot(struct kvm_memory_slot *slot, gfn_t gfn,
+				  struct page **page)
+{
+	return __gfn_to_pfn_page_memslot(slot, gfn, false, NULL, true,
+					 NULL, NULL, page);
+}
+EXPORT_SYMBOL_GPL(gfn_to_pfn_page_memslot);
+
+kvm_pfn_t gfn_to_pfn_page_memslot_atomic(struct kvm_memory_slot *slot,
+					 gfn_t gfn, struct page **page)
+{
+	return __gfn_to_pfn_page_memslot(slot, gfn, true, NULL, true, NULL,
+					 NULL, page);
+}
+EXPORT_SYMBOL_GPL(gfn_to_pfn_page_memslot_atomic);
+
+kvm_pfn_t kvm_vcpu_gfn_to_pfn_page_atomic(struct kvm_vcpu *vcpu, gfn_t gfn,
+					  struct page **page)
+{
+	return gfn_to_pfn_page_memslot_atomic(
+			kvm_vcpu_gfn_to_memslot(vcpu, gfn), gfn, page);
+}
+EXPORT_SYMBOL_GPL(kvm_vcpu_gfn_to_pfn_page_atomic);
+
+kvm_pfn_t gfn_to_pfn_page(struct kvm *kvm, gfn_t gfn, struct page **page)
+{
+	return gfn_to_pfn_page_memslot(gfn_to_memslot(kvm, gfn), gfn, page);
+}
+EXPORT_SYMBOL_GPL(gfn_to_pfn_page);
+
+kvm_pfn_t kvm_vcpu_gfn_to_pfn_page(struct kvm_vcpu *vcpu, gfn_t gfn,
+				   struct page **page)
+{
+	return gfn_to_pfn_page_memslot(kvm_vcpu_gfn_to_memslot(vcpu, gfn),
+				       gfn, page);
+}
+EXPORT_SYMBOL_GPL(kvm_vcpu_gfn_to_pfn_page);
+
+static kvm_pfn_t ensure_pfn_ref(struct page *page, kvm_pfn_t pfn)
+{
+	if (page || is_error_pfn(pfn) || kvm_is_reserved_pfn(pfn))
+		return pfn;
+
+	/*
+	 * Certain IO or PFNMAP mappings can be backed with valid
+	 * struct pages, but be allocated without refcounting e.g.,
+	 * tail pages of non-compound higher order allocations, which
+	 * would then underflow the refcount when the caller does the
+	 * required put_page. Don't allow those pages here.
+	 */
+	if (get_page_unless_zero(pfn_to_page(pfn)))
+		return pfn;
+
+	return KVM_PFN_ERR_FAULT;
+}
+
+kvm_pfn_t __gfn_to_pfn_memslot(struct kvm_memory_slot *slot, gfn_t gfn,
+			       bool atomic, bool *async, bool write_fault,
+			       bool *writable, hva_t *hva)
+{
+	struct page *page;
+	kvm_pfn_t pfn;
+
+	pfn = __gfn_to_pfn_page_memslot(slot, gfn, atomic, async,
+					write_fault, writable, hva, &page);
+
+	return ensure_pfn_ref(page, pfn);
 }
 EXPORT_SYMBOL_GPL(__gfn_to_pfn_memslot);
 
 kvm_pfn_t gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn, bool write_fault,
 		      bool *writable)
 {
-	return __gfn_to_pfn_memslot(gfn_to_memslot(kvm, gfn), gfn, false, NULL,
-				    write_fault, writable, NULL);
+	struct page *page;
+	kvm_pfn_t pfn;
+
+	pfn = gfn_to_pfn_page_prot(kvm, gfn, write_fault, writable, &page);
+
+	return ensure_pfn_ref(page, pfn);
 }
 EXPORT_SYMBOL_GPL(gfn_to_pfn_prot);
 
 kvm_pfn_t gfn_to_pfn_memslot(struct kvm_memory_slot *slot, gfn_t gfn)
 {
-	return __gfn_to_pfn_memslot(slot, gfn, false, NULL, true, NULL, NULL);
+	struct page *page;
+	kvm_pfn_t pfn;
+
+	pfn = gfn_to_pfn_page_memslot(slot, gfn, &page);
+
+	return ensure_pfn_ref(page, pfn);
 }
 EXPORT_SYMBOL_GPL(gfn_to_pfn_memslot);
 
 kvm_pfn_t gfn_to_pfn_memslot_atomic(struct kvm_memory_slot *slot, gfn_t gfn)
 {
-	return __gfn_to_pfn_memslot(slot, gfn, true, NULL, true, NULL, NULL);
+	struct page *page;
+	kvm_pfn_t pfn;
+
+	pfn = gfn_to_pfn_page_memslot_atomic(slot, gfn, &page);
+
+	return ensure_pfn_ref(page, pfn);
 }
 EXPORT_SYMBOL_GPL(gfn_to_pfn_memslot_atomic);
 
 kvm_pfn_t kvm_vcpu_gfn_to_pfn_atomic(struct kvm_vcpu *vcpu, gfn_t gfn)
 {
-	return gfn_to_pfn_memslot_atomic(kvm_vcpu_gfn_to_memslot(vcpu, gfn), gfn);
+	struct page *page;
+	kvm_pfn_t pfn;
+
+	pfn = kvm_vcpu_gfn_to_pfn_page_atomic(vcpu, gfn, &page);
+
+	return ensure_pfn_ref(page, pfn);
 }
 EXPORT_SYMBOL_GPL(kvm_vcpu_gfn_to_pfn_atomic);
 
 kvm_pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn)
 {
-	return gfn_to_pfn_memslot(gfn_to_memslot(kvm, gfn), gfn);
+	struct page *page;
+	kvm_pfn_t pfn;
+
+	pfn = gfn_to_pfn_page(kvm, gfn, &page);
+
+	return ensure_pfn_ref(page, pfn);
 }
 EXPORT_SYMBOL_GPL(gfn_to_pfn);
 
 kvm_pfn_t kvm_vcpu_gfn_to_pfn(struct kvm_vcpu *vcpu, gfn_t gfn)
 {
-	return gfn_to_pfn_memslot(kvm_vcpu_gfn_to_memslot(vcpu, gfn), gfn);
+	struct page *page;
+	kvm_pfn_t pfn;
+
+	pfn = kvm_vcpu_gfn_to_pfn_page(vcpu, gfn, &page);
+
+	return ensure_pfn_ref(page, pfn);
 }
 EXPORT_SYMBOL_GPL(kvm_vcpu_gfn_to_pfn);
 
-- 
2.32.0.93.g670b81a890-goog


^ permalink raw reply related

* [PATCH v2 3/5] KVM: x86/mmu: use gfn_to_pfn_page
From: David Stevens @ 2021-06-25  7:36 UTC (permalink / raw)
  To: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Paul Mackerras,
	Paolo Bonzini, Nick Piggin
  Cc: Wanpeng Li, kvm, dri-devel, linux-mips, Will Deacon, kvmarm,
	Alexandru Elisei, Joerg Roedel, Zhi Wang, Suzuki K Poulose,
	intel-gfx, kvm-ppc, Zhenyu Wang, intel-gvt-dev, linux-arm-kernel,
	Jim Mattson, Sean Christopherson, linux-kernel, James Morse,
	David Stevens, Vitaly Kuznetsov, linuxppc-dev
In-Reply-To: <20210625073616.2184426-1-stevensd@google.com>

From: David Stevens <stevensd@chromium.org>

Covert usages of the deprecated gfn_to_pfn functions to the new
gfn_to_pfn_page functions.

Signed-off-by: David Stevens <stevensd@chromium.org>
---
 arch/x86/kvm/mmu/mmu.c          | 43 ++++++++++++++++++++-------------
 arch/x86/kvm/mmu/mmu_internal.h |  3 ++-
 arch/x86/kvm/mmu/paging_tmpl.h  | 23 +++++++++++-------
 arch/x86/kvm/mmu/tdp_mmu.c      |  6 ++---
 arch/x86/kvm/mmu/tdp_mmu.h      |  4 +--
 arch/x86/kvm/x86.c              |  6 +++--
 6 files changed, 51 insertions(+), 34 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 00732757cc60..dd5cb6e33591 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2702,8 +2702,9 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
 	return ret;
 }
 
-static kvm_pfn_t pte_prefetch_gfn_to_pfn(struct kvm_vcpu *vcpu, gfn_t gfn,
-				     bool no_dirty_log)
+static kvm_pfn_t pte_prefetch_gfn_to_pfn_page(struct kvm_vcpu *vcpu,
+					      gfn_t gfn, bool no_dirty_log,
+					      struct page **page)
 {
 	struct kvm_memory_slot *slot;
 
@@ -2711,7 +2712,7 @@ static kvm_pfn_t pte_prefetch_gfn_to_pfn(struct kvm_vcpu *vcpu, gfn_t gfn,
 	if (!slot)
 		return KVM_PFN_ERR_FAULT;
 
-	return gfn_to_pfn_memslot_atomic(slot, gfn);
+	return gfn_to_pfn_page_memslot_atomic(slot, gfn, page);
 }
 
 static int direct_pte_prefetch_many(struct kvm_vcpu *vcpu,
@@ -2840,7 +2841,8 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm,
 
 int kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, gfn_t gfn,
 			    int max_level, kvm_pfn_t *pfnp,
-			    bool huge_page_disallowed, int *req_level)
+			    struct page *page, bool huge_page_disallowed,
+			    int *req_level)
 {
 	struct kvm_memory_slot *slot;
 	kvm_pfn_t pfn = *pfnp;
@@ -2852,6 +2854,9 @@ int kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, gfn_t gfn,
 	if (unlikely(max_level == PG_LEVEL_4K))
 		return PG_LEVEL_4K;
 
+	if (!page)
+		return PG_LEVEL_4K;
+
 	if (is_error_noslot_pfn(pfn) || kvm_is_reserved_pfn(pfn))
 		return PG_LEVEL_4K;
 
@@ -2906,7 +2911,8 @@ void disallowed_hugepage_adjust(u64 spte, gfn_t gfn, int cur_level,
 }
 
 static int __direct_map(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
-			int map_writable, int max_level, kvm_pfn_t pfn,
+			int map_writable, int max_level,
+			kvm_pfn_t pfn, struct page *page,
 			bool prefault, bool is_tdp)
 {
 	bool nx_huge_page_workaround_enabled = is_nx_huge_page_enabled();
@@ -2919,7 +2925,7 @@ static int __direct_map(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
 	gfn_t gfn = gpa >> PAGE_SHIFT;
 	gfn_t base_gfn = gfn;
 
-	level = kvm_mmu_hugepage_adjust(vcpu, gfn, max_level, &pfn,
+	level = kvm_mmu_hugepage_adjust(vcpu, gfn, max_level, &pfn, page,
 					huge_page_disallowed, &req_level);
 
 	trace_kvm_mmu_spte_requested(gpa, level, pfn);
@@ -3768,8 +3774,9 @@ static bool kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 }
 
 static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
-			 gpa_t cr2_or_gpa, kvm_pfn_t *pfn, hva_t *hva,
-			 bool write, bool *writable)
+			 gpa_t cr2_or_gpa, kvm_pfn_t *pfn,
+			 hva_t *hva, bool write, bool *writable,
+			 struct page **page)
 {
 	struct kvm_memory_slot *slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
 	bool async;
@@ -3790,8 +3797,8 @@ static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
 	}
 
 	async = false;
-	*pfn = __gfn_to_pfn_memslot(slot, gfn, false, &async,
-				    write, writable, hva);
+	*pfn = __gfn_to_pfn_page_memslot(slot, gfn, false, &async,
+					 write, writable, hva, page);
 	if (!async)
 		return false; /* *pfn has correct page already */
 
@@ -3805,8 +3812,8 @@ static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
 			return true;
 	}
 
-	*pfn = __gfn_to_pfn_memslot(slot, gfn, false, NULL,
-				    write, writable, hva);
+	*pfn = __gfn_to_pfn_page_memslot(slot, gfn, false, NULL,
+					 write, writable, hva, page);
 	return false;
 }
 
@@ -3820,6 +3827,7 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
 	gfn_t gfn = gpa >> PAGE_SHIFT;
 	unsigned long mmu_seq;
 	kvm_pfn_t pfn;
+	struct page *page;
 	hva_t hva;
 	int r;
 
@@ -3840,7 +3848,7 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
 	smp_rmb();
 
 	if (try_async_pf(vcpu, prefault, gfn, gpa, &pfn, &hva,
-			 write, &map_writable))
+			 write, &map_writable, &page))
 		return RET_PF_RETRY;
 
 	if (handle_abnormal_pfn(vcpu, is_tdp ? 0 : gpa, gfn, pfn, ACC_ALL, &r))
@@ -3861,17 +3869,18 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
 
 	if (is_tdp_mmu_fault)
 		r = kvm_tdp_mmu_map(vcpu, gpa, error_code, map_writable, max_level,
-				    pfn, prefault);
+				    pfn, page, prefault);
 	else
-		r = __direct_map(vcpu, gpa, error_code, map_writable, max_level, pfn,
-				 prefault, is_tdp);
+		r = __direct_map(vcpu, gpa, error_code, map_writable, max_level,
+				 pfn, page, prefault, is_tdp);
 
 out_unlock:
 	if (is_tdp_mmu_fault)
 		read_unlock(&vcpu->kvm->mmu_lock);
 	else
 		write_unlock(&vcpu->kvm->mmu_lock);
-	kvm_release_pfn_clean(pfn);
+	if (page)
+		put_page(page);
 	return r;
 }
 
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 35567293c1fd..cc02fe22b450 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -159,7 +159,8 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm,
 			      kvm_pfn_t pfn, int max_level);
 int kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, gfn_t gfn,
 			    int max_level, kvm_pfn_t *pfnp,
-			    bool huge_page_disallowed, int *req_level);
+			    struct page *page, bool huge_page_disallowed,
+			    int *req_level);
 void disallowed_hugepage_adjust(u64 spte, gfn_t gfn, int cur_level,
 				kvm_pfn_t *pfnp, int *goal_levelp);
 
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 490a028ddabe..f1ebb1ee7f0f 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -564,6 +564,7 @@ FNAME(prefetch_gpte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 	unsigned pte_access;
 	gfn_t gfn;
 	kvm_pfn_t pfn;
+	struct page *page;
 
 	if (FNAME(prefetch_invalid_gpte)(vcpu, sp, spte, gpte))
 		return false;
@@ -573,8 +574,8 @@ FNAME(prefetch_gpte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 	gfn = gpte_to_gfn(gpte);
 	pte_access = sp->role.access & FNAME(gpte_access)(gpte);
 	FNAME(protect_clean_gpte)(vcpu->arch.mmu, &pte_access, gpte);
-	pfn = pte_prefetch_gfn_to_pfn(vcpu, gfn,
-			no_dirty_log && (pte_access & ACC_WRITE_MASK));
+	pfn = pte_prefetch_gfn_to_pfn_page(vcpu, gfn,
+			no_dirty_log && (pte_access & ACC_WRITE_MASK), &page);
 	if (is_error_pfn(pfn))
 		return false;
 
@@ -585,7 +586,8 @@ FNAME(prefetch_gpte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 	mmu_set_spte(vcpu, spte, pte_access, false, PG_LEVEL_4K, gfn, pfn,
 		     true, true);
 
-	kvm_release_pfn_clean(pfn);
+	if (page)
+		put_page(page);
 	return true;
 }
 
@@ -665,8 +667,8 @@ static void FNAME(pte_prefetch)(struct kvm_vcpu *vcpu, struct guest_walker *gw,
  */
 static int FNAME(fetch)(struct kvm_vcpu *vcpu, gpa_t addr,
 			 struct guest_walker *gw, u32 error_code,
-			 int max_level, kvm_pfn_t pfn, bool map_writable,
-			 bool prefault)
+			 int max_level, kvm_pfn_t pfn, struct page *page,
+			 bool map_writable, bool prefault)
 {
 	bool nx_huge_page_workaround_enabled = is_nx_huge_page_enabled();
 	bool write_fault = error_code & PFERR_WRITE_MASK;
@@ -723,7 +725,8 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, gpa_t addr,
 	}
 
 	level = kvm_mmu_hugepage_adjust(vcpu, gw->gfn, max_level, &pfn,
-					huge_page_disallowed, &req_level);
+					page, huge_page_disallowed,
+					&req_level);
 
 	trace_kvm_mmu_spte_requested(addr, gw->level, pfn);
 
@@ -830,6 +833,7 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gpa_t addr, u32 error_code,
 	struct guest_walker walker;
 	int r;
 	kvm_pfn_t pfn;
+	struct page *page;
 	hva_t hva;
 	unsigned long mmu_seq;
 	bool map_writable, is_self_change_mapping;
@@ -882,7 +886,7 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gpa_t addr, u32 error_code,
 	smp_rmb();
 
 	if (try_async_pf(vcpu, prefault, walker.gfn, addr, &pfn, &hva,
-			 write_fault, &map_writable))
+			 write_fault, &map_writable, &page))
 		return RET_PF_RETRY;
 
 	if (handle_abnormal_pfn(vcpu, addr, walker.gfn, pfn, walker.pte_access, &r))
@@ -916,13 +920,14 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gpa_t addr, u32 error_code,
 	r = make_mmu_pages_available(vcpu);
 	if (r)
 		goto out_unlock;
-	r = FNAME(fetch)(vcpu, addr, &walker, error_code, max_level, pfn,
+	r = FNAME(fetch)(vcpu, addr, &walker, error_code, max_level, pfn, page,
 			 map_writable, prefault);
 	kvm_mmu_audit(vcpu, AUDIT_POST_PAGE_FAULT);
 
 out_unlock:
 	write_unlock(&vcpu->kvm->mmu_lock);
-	kvm_release_pfn_clean(pfn);
+	if (page)
+		put_page(page);
 	return r;
 }
 
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index caac4ddb46df..10572af6fe91 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -966,8 +966,8 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu, int write,
  * page tables and SPTEs to translate the faulting guest physical address.
  */
 int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
-		    int map_writable, int max_level, kvm_pfn_t pfn,
-		    bool prefault)
+		    int map_writable, int max_level,
+		    kvm_pfn_t pfn, struct page *page, bool prefault)
 {
 	bool nx_huge_page_workaround_enabled = is_nx_huge_page_enabled();
 	bool write = error_code & PFERR_WRITE_MASK;
@@ -983,7 +983,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
 	int level;
 	int req_level;
 
-	level = kvm_mmu_hugepage_adjust(vcpu, gfn, max_level, &pfn,
+	level = kvm_mmu_hugepage_adjust(vcpu, gfn, max_level, &pfn, page,
 					huge_page_disallowed, &req_level);
 
 	trace_kvm_mmu_spte_requested(gpa, level, pfn);
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index 1cae4485b3bc..3afaf73adfe7 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -52,8 +52,8 @@ void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm);
 void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm);
 
 int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
-		    int map_writable, int max_level, kvm_pfn_t pfn,
-		    bool prefault);
+		    int map_writable, int max_level,
+		    kvm_pfn_t pfn, struct page *page, bool prefault);
 
 bool kvm_tdp_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range,
 				 bool flush);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 17468d983fbd..1a21b6702de5 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7511,6 +7511,7 @@ static bool reexecute_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 {
 	gpa_t gpa = cr2_or_gpa;
 	kvm_pfn_t pfn;
+	struct page *page;
 
 	if (!(emulation_type & EMULTYPE_ALLOW_RETRY_PF))
 		return false;
@@ -7540,7 +7541,7 @@ static bool reexecute_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 	 * retry instruction -> write #PF -> emulation fail -> retry
 	 * instruction -> ...
 	 */
-	pfn = gfn_to_pfn(vcpu->kvm, gpa_to_gfn(gpa));
+	pfn = gfn_to_pfn_page(vcpu->kvm, gpa_to_gfn(gpa), &page);
 
 	/*
 	 * If the instruction failed on the error pfn, it can not be fixed,
@@ -7549,7 +7550,8 @@ static bool reexecute_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 	if (is_error_noslot_pfn(pfn))
 		return false;
 
-	kvm_release_pfn_clean(pfn);
+	if (page)
+		put_page(page);
 
 	/* The instructions are well-emulated on direct mmu. */
 	if (vcpu->arch.mmu->direct_map) {
-- 
2.32.0.93.g670b81a890-goog


^ permalink raw reply related

* [PATCH v2 4/5] KVM: arm64/mmu: use gfn_to_pfn_page
From: David Stevens @ 2021-06-25  7:36 UTC (permalink / raw)
  To: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Paul Mackerras,
	Paolo Bonzini, Nick Piggin
  Cc: Wanpeng Li, kvm, dri-devel, linux-mips, Will Deacon, kvmarm,
	Alexandru Elisei, Joerg Roedel, Zhi Wang, Suzuki K Poulose,
	intel-gfx, kvm-ppc, Zhenyu Wang, intel-gvt-dev, linux-arm-kernel,
	Jim Mattson, Sean Christopherson, linux-kernel, James Morse,
	David Stevens, Vitaly Kuznetsov, linuxppc-dev
In-Reply-To: <20210625073616.2184426-1-stevensd@google.com>

From: David Stevens <stevensd@chromium.org>

Covert usages of the deprecated gfn_to_pfn functions to the new
gfn_to_pfn_page functions.

Signed-off-by: David Stevens <stevensd@chromium.org>
---
 arch/arm64/kvm/mmu.c | 26 ++++++++++++++++----------
 1 file changed, 16 insertions(+), 10 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index c10207fed2f3..c29da690ed74 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -780,7 +780,7 @@ static bool fault_supports_stage2_huge_mapping(struct kvm_memory_slot *memslot,
 static unsigned long
 transparent_hugepage_adjust(struct kvm_memory_slot *memslot,
 			    unsigned long hva, kvm_pfn_t *pfnp,
-			    phys_addr_t *ipap)
+			    struct page **page, phys_addr_t *ipap)
 {
 	kvm_pfn_t pfn = *pfnp;
 
@@ -789,7 +789,7 @@ transparent_hugepage_adjust(struct kvm_memory_slot *memslot,
 	 * sure that the HVA and IPA are sufficiently aligned and that the
 	 * block map is contained within the memslot.
 	 */
-	if (kvm_is_transparent_hugepage(pfn) &&
+	if (*page && kvm_is_transparent_hugepage(pfn) &&
 	    fault_supports_stage2_huge_mapping(memslot, hva, PMD_SIZE)) {
 		/*
 		 * The address we faulted on is backed by a transparent huge
@@ -810,10 +810,11 @@ transparent_hugepage_adjust(struct kvm_memory_slot *memslot,
 		 * page accordingly.
 		 */
 		*ipap &= PMD_MASK;
-		kvm_release_pfn_clean(pfn);
+		put_page(*page);
 		pfn &= ~(PTRS_PER_PMD - 1);
-		kvm_get_pfn(pfn);
 		*pfnp = pfn;
+		*page = pfn_to_page(pfn);
+		get_page(*page);
 
 		return PMD_SIZE;
 	}
@@ -837,6 +838,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	short vma_shift;
 	gfn_t gfn;
 	kvm_pfn_t pfn;
+	struct page *page;
 	bool logging_active = memslot_is_logging(memslot);
 	unsigned long fault_level = kvm_vcpu_trap_get_fault_level(vcpu);
 	unsigned long vma_pagesize, fault_granule;
@@ -933,8 +935,8 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	 */
 	smp_rmb();
 
-	pfn = __gfn_to_pfn_memslot(memslot, gfn, false, NULL,
-				   write_fault, &writable, NULL);
+	pfn = __gfn_to_pfn_page_memslot(memslot, gfn, false, NULL,
+					write_fault, &writable, NULL, &page);
 	if (pfn == KVM_PFN_ERR_HWPOISON) {
 		kvm_send_hwpoison_signal(hva, vma_shift);
 		return 0;
@@ -967,7 +969,8 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	 */
 	if (vma_pagesize == PAGE_SIZE && !force_pte)
 		vma_pagesize = transparent_hugepage_adjust(memslot, hva,
-							   &pfn, &fault_ipa);
+							   &pfn, &page,
+							   &fault_ipa);
 	if (writable)
 		prot |= KVM_PGTABLE_PROT_W;
 
@@ -999,14 +1002,17 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 
 	/* Mark the page dirty only if the fault is handled successfully */
 	if (writable && !ret) {
-		kvm_set_pfn_dirty(pfn);
+		if (page)
+			kvm_set_pfn_dirty(pfn);
 		mark_page_dirty_in_slot(kvm, memslot, gfn);
 	}
 
 out_unlock:
 	spin_unlock(&kvm->mmu_lock);
-	kvm_set_pfn_accessed(pfn);
-	kvm_release_pfn_clean(pfn);
+	if (page) {
+		kvm_set_pfn_accessed(pfn);
+		put_page(page);
+	}
 	return ret != -EAGAIN ? ret : 0;
 }
 
-- 
2.32.0.93.g670b81a890-goog


^ permalink raw reply related

* [PATCH v2 5/5] KVM: mmu: remove over-aggressive warnings
From: David Stevens @ 2021-06-25  7:36 UTC (permalink / raw)
  To: Marc Zyngier, Huacai Chen, Aleksandar Markovic, Paul Mackerras,
	Paolo Bonzini, Nick Piggin
  Cc: Wanpeng Li, kvm, dri-devel, linux-mips, Will Deacon, kvmarm,
	Alexandru Elisei, Joerg Roedel, Zhi Wang, Suzuki K Poulose,
	intel-gfx, kvm-ppc, Zhenyu Wang, intel-gvt-dev, linux-arm-kernel,
	Jim Mattson, Sean Christopherson, linux-kernel, James Morse,
	David Stevens, Vitaly Kuznetsov, linuxppc-dev
In-Reply-To: <20210625073616.2184426-1-stevensd@google.com>

From: David Stevens <stevensd@chromium.org>

Remove two warnings that require ref counts for pages to be non-zero, as
mapped pfns from follow_pfn may not have an initialized ref count.

Signed-off-by: David Stevens <stevensd@chromium.org>
---
 arch/x86/kvm/mmu/mmu.c | 7 -------
 virt/kvm/kvm_main.c    | 2 +-
 2 files changed, 1 insertion(+), 8 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index dd5cb6e33591..0c47245594c6 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -607,13 +607,6 @@ static int mmu_spte_clear_track_bits(u64 *sptep)
 
 	pfn = spte_to_pfn(old_spte);
 
-	/*
-	 * KVM does not hold the refcount of the page used by
-	 * kvm mmu, before reclaiming the page, we should
-	 * unmap it from mmu first.
-	 */
-	WARN_ON(!kvm_is_reserved_pfn(pfn) && !page_count(pfn_to_page(pfn)));
-
 	if (is_accessed_spte(old_spte))
 		kvm_set_pfn_accessed(pfn);
 
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 1de8702845ac..ce7126bab4b0 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -168,7 +168,7 @@ bool kvm_is_zone_device_pfn(kvm_pfn_t pfn)
 	 * the device has been pinned, e.g. by get_user_pages().  WARN if the
 	 * page_count() is zero to help detect bad usage of this helper.
 	 */
-	if (!pfn_valid(pfn) || WARN_ON_ONCE(!page_count(pfn_to_page(pfn))))
+	if (!pfn_valid(pfn) || !page_count(pfn_to_page(pfn)))
 		return false;
 
 	return is_zone_device_page(pfn_to_page(pfn));
-- 
2.32.0.93.g670b81a890-goog


^ permalink raw reply related

* Re: [PATCH 0/6] KVM: Remove uses of struct page from x86 and arm64 MMU
From: Christian Borntraeger @ 2021-06-25  7:44 UTC (permalink / raw)
  To: Nicholas Piggin, Aleksandar Markovic, Huacai Chen, Marc Zyngier,
	Paul Mackerras, Paolo Bonzini, David Stevens, Zhenyu Wang,
	Zhi Wang
  Cc: Wanpeng Li, kvm, David Stevens, Alexandru Elisei, intel-gfx,
	linuxppc-dev, linux-kernel, dri-devel, kvmarm, Will Deacon,
	Suzuki K Poulose, James Morse, kvm-ppc, Sean Christopherson,
	Vitaly Kuznetsov, linux-mips, intel-gvt-dev, Joerg Roedel,
	linux-arm-kernel, Jim Mattson
In-Reply-To: <1624539354.6zggpdrdbw.astroid@bobo.none>



On 24.06.21 14:57, Nicholas Piggin wrote:
> Excerpts from Paolo Bonzini's message of June 24, 2021 10:41 pm:
>> On 24/06/21 13:42, Nicholas Piggin wrote:
>>> Excerpts from Nicholas Piggin's message of June 24, 2021 8:34 pm:
>>>> Excerpts from David Stevens's message of June 24, 2021 1:57 pm:
>>>>> KVM supports mapping VM_IO and VM_PFNMAP memory into the guest by using
>>>>> follow_pte in gfn_to_pfn. However, the resolved pfns may not have
>>>>> assoicated struct pages, so they should not be passed to pfn_to_page.
>>>>> This series removes such calls from the x86 and arm64 secondary MMU. To
>>>>> do this, this series modifies gfn_to_pfn to return a struct page in
>>>>> addition to a pfn, if the hva was resolved by gup. This allows the
>>>>> caller to call put_page only when necessated by gup.
>>>>>
>>>>> This series provides a helper function that unwraps the new return type
>>>>> of gfn_to_pfn to provide behavior identical to the old behavior. As I
>>>>> have no hardware to test powerpc/mips changes, the function is used
>>>>> there for minimally invasive changes. Additionally, as gfn_to_page and
>>>>> gfn_to_pfn_cache are not integrated with mmu notifier, they cannot be
>>>>> easily changed over to only use pfns.
>>>>>
>>>>> This addresses CVE-2021-22543 on x86 and arm64.
>>>>
>>>> Does this fix the problem? (untested I don't have a POC setup at hand,
>>>> but at least in concept)
>>>
>>> This one actually compiles at least. Unfortunately I don't have much
>>> time in the near future to test, and I only just found out about this
>>> CVE a few hours ago.
>>
>> And it also works (the reproducer gets an infinite stream of userspace
>> exits and especially does not crash).  We can still go for David's
>> solution later since MMU notifiers are able to deal with this pages, but
>> it's a very nice patch for stable kernels.
> 
> Oh nice, thanks for testing. How's this?
> 
> Thanks,
> Nick
> 
> ---
> 
> KVM: Fix page ref underflow for regions with valid but non-refcounted pages
> 
> It's possible to create a region which maps valid but non-refcounted
> pages (e.g., tail pages of non-compound higher order allocations). These
> host pages can then be returned by gfn_to_page, gfn_to_pfn, etc., family
> of APIs, which take a reference to the page, which takes it from 0 to 1.
> When the reference is dropped, this will free the page incorrectly.
> 
> Fix this by only taking a reference on the page if it was non-zero,
> which indicates it is participating in normal refcounting (and can be
> released with put_page).
> 
> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
> ---
>   virt/kvm/kvm_main.c | 19 +++++++++++++++++--
>   1 file changed, 17 insertions(+), 2 deletions(-)
> 
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 6a6bc7af0e28..46fb042837d2 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2055,6 +2055,13 @@ static bool vma_is_valid(struct vm_area_struct *vma, bool write_fault)
>   	return true;
>   }
>   
> +static int kvm_try_get_pfn(kvm_pfn_t pfn)
> +{
> +	if (kvm_is_reserved_pfn(pfn))
> +		return 1;
> +	return get_page_unless_zero(pfn_to_page(pfn));
> +}
> +
>   static int hva_to_pfn_remapped(struct vm_area_struct *vma,
>   			       unsigned long addr, bool *async,
>   			       bool write_fault, bool *writable,
> @@ -2104,13 +2111,21 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
>   	 * Whoever called remap_pfn_range is also going to call e.g.
>   	 * unmap_mapping_range before the underlying pages are freed,
>   	 * causing a call to our MMU notifier.
> +	 *
> +	 * Certain IO or PFNMAP mappings can be backed with valid
> +	 * struct pages, but be allocated without refcounting e.g.,
> +	 * tail pages of non-compound higher order allocations, which
> +	 * would then underflow the refcount when the caller does the
> +	 * required put_page. Don't allow those pages here.
>   	 */
> -	kvm_get_pfn(pfn);
> +	if (!kvm_try_get_pfn(pfn))
> +		r = -EFAULT;
>   

Right. That should also take care of s390 (pin_guest_page in vsie.c
which calls gfn_to_page).
FWIW, the current API is really hard to follow as it does not tell
which functions take a reference and which dont.

Anyway, this patch (with cc stable?)

Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>

>   out:
>   	pte_unmap_unlock(ptep, ptl);
>   	*p_pfn = pfn;
> -	return 0;
> +
> +	return r;
>   }
>   
>   /*
> 

^ permalink raw reply

* Re: [PATCH v2 1/5] KVM: do not allow mapping valid but non-refcounted pages
From: Christian Borntraeger @ 2021-06-25  7:58 UTC (permalink / raw)
  To: David Stevens, Marc Zyngier, Huacai Chen, Aleksandar Markovic,
	Paul Mackerras, Paolo Bonzini, Nick Piggin
  Cc: kvm-ppc, intel-gvt-dev, Wanpeng Li, kvm, Will Deacon,
	Suzuki K Poulose, Sean Christopherson, Joerg Roedel, linuxppc-dev,
	dri-devel, linux-kernel, Zhenyu Wang, linux-mips, James Morse,
	linux-arm-kernel, intel-gfx, Vitaly Kuznetsov, Alexandru Elisei,
	kvmarm, Zhi Wang, Jim Mattson
In-Reply-To: <20210625073616.2184426-2-stevensd@google.com>



On 25.06.21 09:36, David Stevens wrote:
> From: Nicholas Piggin <npiggin@gmail.com>
> 
> It's possible to create a region which maps valid but non-refcounted
> pages (e.g., tail pages of non-compound higher order allocations). These
> host pages can then be returned by gfn_to_page, gfn_to_pfn, etc., family
> of APIs, which take a reference to the page, which takes it from 0 to 1.
> When the reference is dropped, this will free the page incorrectly.
> 
> Fix this by only taking a reference on the page if it was non-zero,
> which indicates it is participating in normal refcounting (and can be
> released with put_page).
> 
> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>

I guess this would be the small fix for stable? Do we want to add that cc?

Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
> ---
>   virt/kvm/kvm_main.c | 19 +++++++++++++++++--
>   1 file changed, 17 insertions(+), 2 deletions(-)
> 
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 3dcc2abbfc60..f7445c3bcd90 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2175,6 +2175,13 @@ static bool vma_is_valid(struct vm_area_struct *vma, bool write_fault)
>   	return true;
>   }
> 
> +static int kvm_try_get_pfn(kvm_pfn_t pfn)
> +{
> +	if (kvm_is_reserved_pfn(pfn))
> +		return 1;
> +	return get_page_unless_zero(pfn_to_page(pfn));
> +}
> +
>   static int hva_to_pfn_remapped(struct vm_area_struct *vma,
>   			       unsigned long addr, bool *async,
>   			       bool write_fault, bool *writable,
> @@ -2224,13 +2231,21 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
>   	 * Whoever called remap_pfn_range is also going to call e.g.
>   	 * unmap_mapping_range before the underlying pages are freed,
>   	 * causing a call to our MMU notifier.
> +	 *
> +	 * Certain IO or PFNMAP mappings can be backed with valid
> +	 * struct pages, but be allocated without refcounting e.g.,
> +	 * tail pages of non-compound higher order allocations, which
> +	 * would then underflow the refcount when the caller does the
> +	 * required put_page. Don't allow those pages here.
>   	 */
> -	kvm_get_pfn(pfn);
> +	if (!kvm_try_get_pfn(pfn))
> +		r = -EFAULT;
> 
>   out:
>   	pte_unmap_unlock(ptep, ptl);
>   	*p_pfn = pfn;
> -	return 0;
> +
> +	return r;
>   }
> 
>   /*
> 

^ permalink raw reply

* Re: [PATCH v2 1/5] KVM: do not allow mapping valid but non-refcounted pages
From: Paolo Bonzini @ 2021-06-25  8:07 UTC (permalink / raw)
  To: Christian Borntraeger, David Stevens, Marc Zyngier, Huacai Chen,
	Aleksandar Markovic, Paul Mackerras, Nick Piggin
  Cc: kvm-ppc, intel-gvt-dev, Wanpeng Li, kvm, Will Deacon,
	Suzuki K Poulose, Sean Christopherson, Joerg Roedel, linuxppc-dev,
	dri-devel, linux-kernel, Zhenyu Wang, linux-mips, James Morse,
	linux-arm-kernel, intel-gfx, Vitaly Kuznetsov, Alexandru Elisei,
	kvmarm, Zhi Wang, Jim Mattson
In-Reply-To: <183b71c1-6bb0-8d05-e2ce-e452253259a8@de.ibm.com>

On 25/06/21 09:58, Christian Borntraeger wrote:
> 
> 
> On 25.06.21 09:36, David Stevens wrote:
>> From: Nicholas Piggin <npiggin@gmail.com>
>>
>> It's possible to create a region which maps valid but non-refcounted
>> pages (e.g., tail pages of non-compound higher order allocations). These
>> host pages can then be returned by gfn_to_page, gfn_to_pfn, etc., family
>> of APIs, which take a reference to the page, which takes it from 0 to 1.
>> When the reference is dropped, this will free the page incorrectly.
>>
>> Fix this by only taking a reference on the page if it was non-zero,
>> which indicates it is participating in normal refcounting (and can be
>> released with put_page).
>>
>> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
> 
> I guess this would be the small fix for stable? Do we want to add that cc?
> 
> Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>

Yes, this one is going to Linus today.  The rest is for 5.15.

Paolo

>> ---
>>   virt/kvm/kvm_main.c | 19 +++++++++++++++++--
>>   1 file changed, 17 insertions(+), 2 deletions(-)
>>
>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>> index 3dcc2abbfc60..f7445c3bcd90 100644
>> --- a/virt/kvm/kvm_main.c
>> +++ b/virt/kvm/kvm_main.c
>> @@ -2175,6 +2175,13 @@ static bool vma_is_valid(struct vm_area_struct 
>> *vma, bool write_fault)
>>       return true;
>>   }
>>
>> +static int kvm_try_get_pfn(kvm_pfn_t pfn)
>> +{
>> +    if (kvm_is_reserved_pfn(pfn))
>> +        return 1;
>> +    return get_page_unless_zero(pfn_to_page(pfn));
>> +}
>> +
>>   static int hva_to_pfn_remapped(struct vm_area_struct *vma,
>>                      unsigned long addr, bool *async,
>>                      bool write_fault, bool *writable,
>> @@ -2224,13 +2231,21 @@ static int hva_to_pfn_remapped(struct 
>> vm_area_struct *vma,
>>        * Whoever called remap_pfn_range is also going to call e.g.
>>        * unmap_mapping_range before the underlying pages are freed,
>>        * causing a call to our MMU notifier.
>> +     *
>> +     * Certain IO or PFNMAP mappings can be backed with valid
>> +     * struct pages, but be allocated without refcounting e.g.,
>> +     * tail pages of non-compound higher order allocations, which
>> +     * would then underflow the refcount when the caller does the
>> +     * required put_page. Don't allow those pages here.
>>        */
>> -    kvm_get_pfn(pfn);
>> +    if (!kvm_try_get_pfn(pfn))
>> +        r = -EFAULT;
>>
>>   out:
>>       pte_unmap_unlock(ptep, ptl);
>>       *p_pfn = pfn;
>> -    return 0;
>> +
>> +    return r;
>>   }
>>
>>   /*
>>
> 


^ permalink raw reply

* Re: PowerPC guest getting "BUG: scheduling while atomic" on linux-next-20210623 during secondary CPUs bringup
From: Bharata B Rao @ 2021-06-25  8:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Srikar Dronamraju, LKML, Valentin Schneider, linux-next,
	linuxppc-dev, Ingo Molnar
In-Reply-To: <YNWFiZii+MINhUC3@hirez.programming.kicks-ass.net>

On Fri, Jun 25, 2021 at 09:28:09AM +0200, Peter Zijlstra wrote:
> On Fri, Jun 25, 2021 at 11:16:08AM +0530, Srikar Dronamraju wrote:
> > * Bharata B Rao <bharata@linux.ibm.com> [2021-06-24 21:25:09]:
> > 
> > > A PowerPC KVM guest gets the following BUG message when booting
> > > linux-next-20210623:
> > > 
> > > smp: Bringing up secondary CPUs ...
> > > BUG: scheduling while atomic: swapper/1/0/0x00000000
> 
> 'funny', your preempt_count is actually too low. The check here is for
> preempt_count() == DISABLE_OFFSET (aka. 1 when PREEMPT=y), but you have
> 0.
> 
> > > no locks held by swapper/1/0.
> > > Modules linked in:
> > > CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.13.0-rc7-next-20210623
> > > Call Trace:
> > > [c00000000ae5bc20] [c000000000badc64] dump_stack_lvl+0x98/0xe0 (unreliable)
> > > [c00000000ae5bc60] [c000000000210200] __schedule_bug+0xb0/0xe0
> > > [c00000000ae5bcd0] [c000000001609e28] __schedule+0x1788/0x1c70
> > > [c00000000ae5be20] [c00000000160a8cc] schedule_idle+0x3c/0x70
> > > [c00000000ae5be50] [c00000000022984c] do_idle+0x2bc/0x420
> > > [c00000000ae5bf00] [c000000000229d88] cpu_startup_entry+0x38/0x40
> > > [c00000000ae5bf30] [c0000000000666c0] start_secondary+0x290/0x2a0
> > > [c00000000ae5bf90] [c00000000000be54] start_secondary_prolog+0x10/0x14
> > > 
> > > <The above repeats for all the secondary CPUs>
> > > 
> > > smp: Brought up 2 nodes, 16 CPUs
> > > numa: Node 0 CPUs: 0-7
> > > numa: Node 1 CPUs: 8-15
> > > 
> > > This seems to have started from next-20210521 and isn't seen on
> > > next-20210511.
> > > 
> > 
> > Bharata,
> > 
> > I think the regression is due to Commit f1a0a376ca0c ("sched/core:
> > Initialize the idle task with preemption disabled")
> 
> So that extra preempt_disable() that got removed would've incremented it
> to 1 and then things would've been fine.
> 
> Except.. Valentin changed things such that preempt_count() should've
> been inittialized to 1, instead of 0, but for some raisin that didn't
> stick.. what gives.
> 
> So we have init_idle(p) -> init_idle_preempt_count(p) ->
> task_thread_info(p)->preempt_count = PREEMPT_DISABLED;
> 
> But somehow, by the time you're running start_secondary(), that's gotten
> to be 0 again. Does DEBUG_PREEMPT give more clues?

PREEMPTION is off here.

Regards,
Bharata.

^ permalink raw reply

* Re: PowerPC guest getting "BUG: scheduling while atomic" on linux-next-20210623 during secondary CPUs bringup
From: Valentin Schneider @ 2021-06-25  9:02 UTC (permalink / raw)
  To: Peter Zijlstra, Srikar Dronamraju
  Cc: Ingo Molnar, linux-next, linuxppc-dev, LKML, Bharata B Rao
In-Reply-To: <YNWFiZii+MINhUC3@hirez.programming.kicks-ass.net>

On 25/06/21 09:28, Peter Zijlstra wrote:
> On Fri, Jun 25, 2021 at 11:16:08AM +0530, Srikar Dronamraju wrote:
>> Bharata,
>>
>> I think the regression is due to Commit f1a0a376ca0c ("sched/core:
>> Initialize the idle task with preemption disabled")
>
> So that extra preempt_disable() that got removed would've incremented it
> to 1 and then things would've been fine.
>
> Except.. Valentin changed things such that preempt_count() should've
> been inittialized to 1, instead of 0, but for some raisin that didn't
> stick.. what gives.
>
> So we have init_idle(p) -> init_idle_preempt_count(p) ->
> task_thread_info(p)->preempt_count = PREEMPT_DISABLED;
>
> But somehow, by the time you're running start_secondary(), that's gotten
> to be 0 again. Does DEBUG_PREEMPT give more clues?

Given the preempt_count isn't reset between hotplugs anymore, you might be
able to find the culprit with a hotplug cycle and ftrace with
trace_prempt_off and trace_preempt_on events (requires PREEMPT_TRACER
IIRC).

It's doable at boot time too, but that will mean sifting through many more
events than you'd like...

^ permalink raw reply

* Re: PowerPC guest getting "BUG: scheduling while atomic" on linux-next-20210623 during secondary CPUs bringup
From: Peter Zijlstra @ 2021-06-25 10:16 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: Srikar Dronamraju, LKML, Valentin Schneider, linux-next,
	linuxppc-dev, Ingo Molnar
In-Reply-To: <YNWZfKK+KBQSUdG5@in.ibm.com>

On Fri, Jun 25, 2021 at 02:23:16PM +0530, Bharata B Rao wrote:
> On Fri, Jun 25, 2021 at 09:28:09AM +0200, Peter Zijlstra wrote:
> > On Fri, Jun 25, 2021 at 11:16:08AM +0530, Srikar Dronamraju wrote:
> > > * Bharata B Rao <bharata@linux.ibm.com> [2021-06-24 21:25:09]:
> > > 
> > > > A PowerPC KVM guest gets the following BUG message when booting
> > > > linux-next-20210623:
> > > > 
> > > > smp: Bringing up secondary CPUs ...
> > > > BUG: scheduling while atomic: swapper/1/0/0x00000000
> > 
> > 'funny', your preempt_count is actually too low. The check here is for
> > preempt_count() == DISABLE_OFFSET (aka. 1 when PREEMPT=y), but you have
> > 0.
> > 
> > > > no locks held by swapper/1/0.
> > > > Modules linked in:
> > > > CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.13.0-rc7-next-20210623
> > > > Call Trace:
> > > > [c00000000ae5bc20] [c000000000badc64] dump_stack_lvl+0x98/0xe0 (unreliable)
> > > > [c00000000ae5bc60] [c000000000210200] __schedule_bug+0xb0/0xe0
> > > > [c00000000ae5bcd0] [c000000001609e28] __schedule+0x1788/0x1c70
> > > > [c00000000ae5be20] [c00000000160a8cc] schedule_idle+0x3c/0x70
> > > > [c00000000ae5be50] [c00000000022984c] do_idle+0x2bc/0x420
> > > > [c00000000ae5bf00] [c000000000229d88] cpu_startup_entry+0x38/0x40
> > > > [c00000000ae5bf30] [c0000000000666c0] start_secondary+0x290/0x2a0
> > > > [c00000000ae5bf90] [c00000000000be54] start_secondary_prolog+0x10/0x14
> > > > 
> > > > <The above repeats for all the secondary CPUs>
> > > > 
> > > > smp: Brought up 2 nodes, 16 CPUs
> > > > numa: Node 0 CPUs: 0-7
> > > > numa: Node 1 CPUs: 8-15
> > > > 
> > > > This seems to have started from next-20210521 and isn't seen on
> > > > next-20210511.
> > > > 
> > > 
> > > Bharata,
> > > 
> > > I think the regression is due to Commit f1a0a376ca0c ("sched/core:
> > > Initialize the idle task with preemption disabled")
> > 
> > So that extra preempt_disable() that got removed would've incremented it
> > to 1 and then things would've been fine.
> > 
> > Except.. Valentin changed things such that preempt_count() should've
> > been inittialized to 1, instead of 0, but for some raisin that didn't
> > stick.. what gives.
> > 
> > So we have init_idle(p) -> init_idle_preempt_count(p) ->
> > task_thread_info(p)->preempt_count = PREEMPT_DISABLED;
> > 
> > But somehow, by the time you're running start_secondary(), that's gotten
> > to be 0 again. Does DEBUG_PREEMPT give more clues?
> 
> PREEMPTION is off here.

You mean: CONFIG_PREEMPTION=n, what about CONFIG_PREEMPT_COUNT?

Because if both are =n, then I don't see how that warning could trigger.
in_atomic_preempt_off() would then result in prempt_count() == 0, and
per the print above, it *is* 0.

^ permalink raw reply

* [powerpc:next] BUILD SUCCESS 0e8554b5d7801b0aebc6c348a0a9f7706aa17b3b
From: kernel test robot @ 2021-06-25 10:19 UTC (permalink / raw)
  To: Michael Ellerman; +Cc: linuxppc-dev

tree/branch: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git next
branch HEAD: 0e8554b5d7801b0aebc6c348a0a9f7706aa17b3b  powerpc/papr_scm: Properly handle UUID types and API

elapsed time: 1121m

configs tested: 167
configs skipped: 3

The following configs have been built successfully.
More configs may be tested in the coming days.

gcc tested configs:
arm                                 defconfig
arm64                            allyesconfig
arm64                               defconfig
arm                              allyesconfig
arm                              allmodconfig
mips                           ip28_defconfig
powerpc                 mpc832x_mds_defconfig
arm                      pxa255-idp_defconfig
sh                              ul2_defconfig
arm                        neponset_defconfig
sh                           se7721_defconfig
arm                             rpc_defconfig
arm                        magician_defconfig
mips                      maltasmvp_defconfig
powerpc                      ppc44x_defconfig
alpha                            alldefconfig
m68k                          amiga_defconfig
powerpc                     ep8248e_defconfig
mips                        omega2p_defconfig
arm                         s5pv210_defconfig
arm                         hackkit_defconfig
csky                             alldefconfig
arc                           tb10x_defconfig
sh                          rsk7201_defconfig
powerpc                 mpc837x_rdb_defconfig
mips                  maltasmvp_eva_defconfig
arm                        spear6xx_defconfig
sh                             espt_defconfig
mips                        qi_lb60_defconfig
h8300                            alldefconfig
powerpc                     akebono_defconfig
ia64                             allmodconfig
xtensa                generic_kc705_defconfig
openrisc                    or1ksim_defconfig
mips                         rt305x_defconfig
sh                           se7206_defconfig
nios2                            alldefconfig
powerpc                 mpc8540_ads_defconfig
mips                malta_qemu_32r6_defconfig
m68k                        m5407c3_defconfig
m68k                            mac_defconfig
xtensa                           alldefconfig
mips                        bcm63xx_defconfig
arc                          axs103_defconfig
sh                        edosk7760_defconfig
powerpc                      katmai_defconfig
powerpc                 mpc834x_mds_defconfig
arc                        nsimosci_defconfig
xtensa                    xip_kc705_defconfig
sh                           se7751_defconfig
mips                       capcella_defconfig
arm                      footbridge_defconfig
powerpc                 mpc8315_rdb_defconfig
powerpc                 xes_mpc85xx_defconfig
sh                        sh7763rdp_defconfig
powerpc                     tqm8555_defconfig
xtensa                  audio_kc705_defconfig
mips                      maltaaprp_defconfig
h8300                            allyesconfig
arm                        mvebu_v5_defconfig
arm                          exynos_defconfig
xtensa                              defconfig
arm                        clps711x_defconfig
sh                           se7343_defconfig
m68k                        m5272c3_defconfig
arm                           sunxi_defconfig
sh                         ap325rxa_defconfig
m68k                       m5249evb_defconfig
arm                  colibri_pxa270_defconfig
sh                           sh2007_defconfig
powerpc               mpc834x_itxgp_defconfig
powerpc                 mpc8313_rdb_defconfig
arm                   milbeaut_m10v_defconfig
arm                           tegra_defconfig
s390                          debug_defconfig
powerpc                       ebony_defconfig
powerpc                       holly_defconfig
x86_64                            allnoconfig
ia64                                defconfig
ia64                             allyesconfig
m68k                             allmodconfig
m68k                                defconfig
m68k                             allyesconfig
nios2                               defconfig
arc                              allyesconfig
nds32                             allnoconfig
nds32                               defconfig
nios2                            allyesconfig
csky                                defconfig
alpha                               defconfig
alpha                            allyesconfig
xtensa                           allyesconfig
arc                                 defconfig
sh                               allmodconfig
parisc                              defconfig
s390                             allyesconfig
s390                             allmodconfig
parisc                           allyesconfig
s390                                defconfig
i386                             allyesconfig
sparc                            allyesconfig
sparc                               defconfig
i386                                defconfig
mips                             allyesconfig
mips                             allmodconfig
powerpc                          allyesconfig
powerpc                          allmodconfig
powerpc                           allnoconfig
x86_64               randconfig-a002-20210625
x86_64               randconfig-a001-20210625
x86_64               randconfig-a005-20210625
x86_64               randconfig-a003-20210625
x86_64               randconfig-a004-20210625
x86_64               randconfig-a006-20210625
i386                 randconfig-a002-20210625
i386                 randconfig-a001-20210625
i386                 randconfig-a003-20210625
i386                 randconfig-a006-20210625
i386                 randconfig-a005-20210625
i386                 randconfig-a004-20210625
i386                 randconfig-a001-20210622
i386                 randconfig-a002-20210622
i386                 randconfig-a003-20210622
i386                 randconfig-a006-20210622
i386                 randconfig-a005-20210622
i386                 randconfig-a004-20210622
x86_64               randconfig-a012-20210622
x86_64               randconfig-a016-20210622
x86_64               randconfig-a015-20210622
x86_64               randconfig-a014-20210622
x86_64               randconfig-a013-20210622
x86_64               randconfig-a011-20210622
i386                 randconfig-a011-20210622
i386                 randconfig-a014-20210622
i386                 randconfig-a013-20210622
i386                 randconfig-a015-20210622
i386                 randconfig-a012-20210622
i386                 randconfig-a016-20210622
i386                 randconfig-a011-20210625
i386                 randconfig-a014-20210625
i386                 randconfig-a013-20210625
i386                 randconfig-a015-20210625
i386                 randconfig-a012-20210625
i386                 randconfig-a016-20210625
riscv                    nommu_k210_defconfig
riscv                            allyesconfig
riscv                    nommu_virt_defconfig
riscv                             allnoconfig
riscv                               defconfig
riscv                          rv32_defconfig
riscv                            allmodconfig
x86_64                    rhel-8.3-kselftests
um                           x86_64_defconfig
um                             i386_defconfig
um                            kunit_defconfig
x86_64                           allyesconfig
x86_64                              defconfig
x86_64                               rhel-8.3
x86_64                      rhel-8.3-kbuiltin
x86_64                                  kexec

clang tested configs:
x86_64               randconfig-b001-20210622
x86_64               randconfig-a002-20210622
x86_64               randconfig-a001-20210622
x86_64               randconfig-a005-20210622
x86_64               randconfig-a003-20210622
x86_64               randconfig-a004-20210622
x86_64               randconfig-a006-20210622

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

^ permalink raw reply

* [powerpc:next-test] BUILD SUCCESS 06664e6c0f810035deb4b1d135d7a8f70795512f
From: kernel test robot @ 2021-06-25 10:19 UTC (permalink / raw)
  To: Michael Ellerman; +Cc: linuxppc-dev

tree/branch: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git next-test
branch HEAD: 06664e6c0f810035deb4b1d135d7a8f70795512f  powerpc: mark local variables around longjmp as volatile

elapsed time: 1120m

configs tested: 162
configs skipped: 3

The following configs have been built successfully.
More configs may be tested in the coming days.

gcc tested configs:
arm                                 defconfig
arm64                            allyesconfig
arm64                               defconfig
arm                              allyesconfig
arm                              allmodconfig
mips                           ip28_defconfig
powerpc                 mpc832x_mds_defconfig
arm                      pxa255-idp_defconfig
sh                              ul2_defconfig
arm                        neponset_defconfig
sh                           se7721_defconfig
alpha                            alldefconfig
m68k                          amiga_defconfig
powerpc                     ep8248e_defconfig
mips                        omega2p_defconfig
arm                         s5pv210_defconfig
arm                         hackkit_defconfig
csky                             alldefconfig
arc                           tb10x_defconfig
sh                          rsk7201_defconfig
powerpc                 mpc837x_rdb_defconfig
mips                  maltasmvp_eva_defconfig
arm                        spear6xx_defconfig
sh                             espt_defconfig
mips                        qi_lb60_defconfig
h8300                            alldefconfig
powerpc                     akebono_defconfig
ia64                             allmodconfig
xtensa                generic_kc705_defconfig
openrisc                    or1ksim_defconfig
mips                         rt305x_defconfig
sh                           se7206_defconfig
nios2                            alldefconfig
powerpc                 mpc8540_ads_defconfig
mips                malta_qemu_32r6_defconfig
m68k                        m5407c3_defconfig
m68k                            mac_defconfig
xtensa                           alldefconfig
mips                        bcm63xx_defconfig
arc                          axs103_defconfig
sh                        edosk7760_defconfig
powerpc                      katmai_defconfig
powerpc                 mpc834x_mds_defconfig
arc                        nsimosci_defconfig
xtensa                    xip_kc705_defconfig
sh                           se7751_defconfig
mips                       capcella_defconfig
arm                      footbridge_defconfig
powerpc                 mpc8315_rdb_defconfig
powerpc                 xes_mpc85xx_defconfig
sh                        sh7763rdp_defconfig
powerpc                     tqm8555_defconfig
xtensa                  audio_kc705_defconfig
mips                      maltaaprp_defconfig
h8300                            allyesconfig
arm                        mvebu_v5_defconfig
arm                          exynos_defconfig
xtensa                              defconfig
mips                    maltaup_xpa_defconfig
arm                       multi_v4t_defconfig
um                                  defconfig
sh                         ap325rxa_defconfig
m68k                       m5249evb_defconfig
arm                  colibri_pxa270_defconfig
sh                           sh2007_defconfig
powerpc               mpc834x_itxgp_defconfig
powerpc                 mpc8313_rdb_defconfig
arm                   milbeaut_m10v_defconfig
arm                           tegra_defconfig
s390                          debug_defconfig
powerpc                       ebony_defconfig
powerpc                       holly_defconfig
x86_64                            allnoconfig
ia64                                defconfig
ia64                             allyesconfig
m68k                             allmodconfig
m68k                                defconfig
m68k                             allyesconfig
nios2                               defconfig
arc                              allyesconfig
nds32                             allnoconfig
nds32                               defconfig
nios2                            allyesconfig
csky                                defconfig
alpha                               defconfig
alpha                            allyesconfig
xtensa                           allyesconfig
arc                                 defconfig
sh                               allmodconfig
parisc                              defconfig
s390                             allyesconfig
s390                             allmodconfig
parisc                           allyesconfig
s390                                defconfig
i386                             allyesconfig
sparc                            allyesconfig
sparc                               defconfig
i386                                defconfig
mips                             allyesconfig
mips                             allmodconfig
powerpc                          allyesconfig
powerpc                          allmodconfig
powerpc                           allnoconfig
x86_64               randconfig-a002-20210625
x86_64               randconfig-a001-20210625
x86_64               randconfig-a005-20210625
x86_64               randconfig-a003-20210625
x86_64               randconfig-a004-20210625
x86_64               randconfig-a006-20210625
i386                 randconfig-a002-20210625
i386                 randconfig-a001-20210625
i386                 randconfig-a003-20210625
i386                 randconfig-a006-20210625
i386                 randconfig-a005-20210625
i386                 randconfig-a004-20210625
i386                 randconfig-a001-20210622
i386                 randconfig-a002-20210622
i386                 randconfig-a003-20210622
i386                 randconfig-a006-20210622
i386                 randconfig-a005-20210622
i386                 randconfig-a004-20210622
x86_64               randconfig-a012-20210622
x86_64               randconfig-a016-20210622
x86_64               randconfig-a015-20210622
x86_64               randconfig-a014-20210622
x86_64               randconfig-a013-20210622
x86_64               randconfig-a011-20210622
i386                 randconfig-a011-20210622
i386                 randconfig-a014-20210622
i386                 randconfig-a013-20210622
i386                 randconfig-a015-20210622
i386                 randconfig-a012-20210622
i386                 randconfig-a016-20210622
i386                 randconfig-a011-20210625
i386                 randconfig-a014-20210625
i386                 randconfig-a013-20210625
i386                 randconfig-a015-20210625
i386                 randconfig-a012-20210625
i386                 randconfig-a016-20210625
riscv                    nommu_k210_defconfig
riscv                            allyesconfig
riscv                    nommu_virt_defconfig
riscv                             allnoconfig
riscv                               defconfig
riscv                          rv32_defconfig
riscv                            allmodconfig
x86_64                    rhel-8.3-kselftests
um                           x86_64_defconfig
um                             i386_defconfig
um                            kunit_defconfig
x86_64                           allyesconfig
x86_64                              defconfig
x86_64                               rhel-8.3
x86_64                      rhel-8.3-kbuiltin
x86_64                                  kexec

clang tested configs:
x86_64               randconfig-b001-20210622
x86_64               randconfig-a002-20210622
x86_64               randconfig-a001-20210622
x86_64               randconfig-a005-20210622
x86_64               randconfig-a003-20210622
x86_64               randconfig-a004-20210622
x86_64               randconfig-a006-20210622

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

^ permalink raw reply

* Re: [PATCH] powerpc: mark local variables around longjmp as volatile
From: Christophe Leroy @ 2021-06-25 10:35 UTC (permalink / raw)
  To: Arnd Bergmann, Michael Ellerman
  Cc: Ravi Bangoria, Arnd Bergmann, Linux Doc Mailing List,
	linux-kernel, Nicholas Piggin, Paul Mackerras, linuxppc-dev
In-Reply-To: <20210429080708.1520360-1-arnd@kernel.org>



Le 29/04/2021 à 10:06, Arnd Bergmann a écrit :
> From: Arnd Bergmann <arnd@arndb.de>
> 
> gcc-11 points out that modifying local variables next to a
> longjmp/setjmp may cause undefined behavior:
> 
> arch/powerpc/kexec/crash.c: In function 'crash_kexec_prepare_cpus.constprop':
> arch/powerpc/kexec/crash.c:108:22: error: variable 'ncpus' might be clobbered by 'longjmp' or 'vfork' [-Werror=clobbere
> d]
> arch/powerpc/kexec/crash.c:109:13: error: variable 'tries' might be clobbered by 'longjmp' or 'vfork' [-Werror=clobbere
> d]
> arch/powerpc/xmon/xmon.c: In function 'xmon_print_symbol':
> arch/powerpc/xmon/xmon.c:3625:21: error: variable 'name' might be clobbered by 'longjmp' or 'vfork' [-Werror=clobbered]
> arch/powerpc/xmon/xmon.c: In function 'stop_spus':
> arch/powerpc/xmon/xmon.c:4057:13: error: variable 'i' might be clobbered by 'longjmp' or 'vfork' [-Werror=clobbered]
> arch/powerpc/xmon/xmon.c: In function 'restart_spus':
> arch/powerpc/xmon/xmon.c:4098:13: error: variable 'i' might be clobbered by 'longjmp' or 'vfork' [-Werror=clobbered]
> arch/powerpc/xmon/xmon.c: In function 'dump_opal_msglog':
> arch/powerpc/xmon/xmon.c:3008:16: error: variable 'pos' might be clobbered by 'longjmp' or 'vfork' [-Werror=clobbered]
> arch/powerpc/xmon/xmon.c: In function 'show_pte':
> arch/powerpc/xmon/xmon.c:3207:29: error: variable 'tsk' might be clobbered by 'longjmp' or 'vfork' [-Werror=clobbered]
> arch/powerpc/xmon/xmon.c: In function 'show_tasks':
> arch/powerpc/xmon/xmon.c:3302:29: error: variable 'tsk' might be clobbered by 'longjmp' or 'vfork' [-Werror=clobbered]
> arch/powerpc/xmon/xmon.c: In function 'xmon_core':
> arch/powerpc/xmon/xmon.c:494:13: error: variable 'cmd' might be clobbered by 'longjmp' or 'vfork' [-Werror=clobbered]
> arch/powerpc/xmon/xmon.c:860:21: error: variable 'bp' might be clobbered by 'longjmp' or 'vfork' [-Werror=clobbered]
> arch/powerpc/xmon/xmon.c:860:21: error: variable 'bp' might be clobbered by 'longjmp' or 'vfork' [-Werror=clobbered]
> arch/powerpc/xmon/xmon.c:492:48: error: argument 'fromipi' might be clobbered by 'longjmp' or 'vfork' [-Werror=clobbered]
> 
> According to the documentation, marking these as 'volatile' is
> sufficient to avoid the problem, and it shuts up the warning.


I think this change deserves some comment in the code, and maybe also an update of 
https://www.kernel.org/doc/html/latest/process/volatile-considered-harmful.html

Otherwise, there's a risk that one day or another, someone removes those 'volatile' markings.

Christophe


> 
> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
> ---
>   arch/powerpc/kexec/crash.c |  4 ++--
>   arch/powerpc/xmon/xmon.c   | 22 +++++++++++-----------
>   2 files changed, 13 insertions(+), 13 deletions(-)
> 
> diff --git a/arch/powerpc/kexec/crash.c b/arch/powerpc/kexec/crash.c
> index 0196d0c211ac..10f997e6bb95 100644
> --- a/arch/powerpc/kexec/crash.c
> +++ b/arch/powerpc/kexec/crash.c
> @@ -105,8 +105,8 @@ void crash_ipi_callback(struct pt_regs *regs)
>   static void crash_kexec_prepare_cpus(int cpu)
>   {
>   	unsigned int msecs;
> -	unsigned int ncpus = num_online_cpus() - 1;/* Excluding the panic cpu */
> -	int tries = 0;
> +	volatile unsigned int ncpus = num_online_cpus() - 1;/* Excluding the panic cpu */
> +	volatile int tries = 0;
>   	int (*old_handler)(struct pt_regs *regs);
>   
>   	printk(KERN_EMERG "Sending IPI to other CPUs\n");
> diff --git a/arch/powerpc/xmon/xmon.c b/arch/powerpc/xmon/xmon.c
> index c8173e92f19d..ce0eacf77645 100644
> --- a/arch/powerpc/xmon/xmon.c
> +++ b/arch/powerpc/xmon/xmon.c
> @@ -489,10 +489,10 @@ static void xmon_touch_watchdogs(void)
>   	touch_nmi_watchdog();
>   }
>   
> -static int xmon_core(struct pt_regs *regs, int fromipi)
> +static int xmon_core(struct pt_regs *regs, volatile int fromipi)
>   {
> -	int cmd = 0;
> -	struct bpt *bp;
> +	volatile int cmd = 0;
> +	struct bpt *volatile bp;
>   	long recurse_jmp[JMP_BUF_LEN];
>   	bool locked_down;
>   	unsigned long offset;
> @@ -857,7 +857,7 @@ static inline void force_enable_xmon(void)
>   static struct bpt *at_breakpoint(unsigned long pc)
>   {
>   	int i;
> -	struct bpt *bp;
> +	struct bpt *volatile bp;
>   
>   	bp = bpts;
>   	for (i = 0; i < NBPTS; ++i, ++bp)
> @@ -3005,7 +3005,7 @@ static void dump_opal_msglog(void)
>   {
>   	unsigned char buf[128];
>   	ssize_t res;
> -	loff_t pos = 0;
> +	volatile loff_t pos = 0;
>   
>   	if (!firmware_has_feature(FW_FEATURE_OPAL)) {
>   		printf("Machine is not running OPAL firmware.\n");
> @@ -3160,7 +3160,7 @@ memzcan(void)
>   		printf("%.8lx\n", a - mskip);
>   }
>   
> -static void show_task(struct task_struct *tsk)
> +static void show_task(struct task_struct *volatile tsk)
>   {
>   	char state;
>   
> @@ -3204,7 +3204,7 @@ static void format_pte(void *ptep, unsigned long pte)
>   static void show_pte(unsigned long addr)
>   {
>   	unsigned long tskv = 0;
> -	struct task_struct *tsk = NULL;
> +	struct task_struct *volatile tsk = NULL;
>   	struct mm_struct *mm;
>   	pgd_t *pgdp;
>   	p4d_t *p4dp;
> @@ -3299,7 +3299,7 @@ static void show_pte(unsigned long addr)
>   static void show_tasks(void)
>   {
>   	unsigned long tskv;
> -	struct task_struct *tsk = NULL;
> +	struct task_struct *volatile tsk = NULL;
>   
>   	printf("     task_struct     ->thread.ksp    ->thread.regs    PID   PPID S  P CMD\n");
>   
> @@ -3622,7 +3622,7 @@ static void xmon_print_symbol(unsigned long address, const char *mid,
>   			      const char *after)
>   {
>   	char *modname;
> -	const char *name = NULL;
> +	const char *volatile name = NULL;
>   	unsigned long offset, size;
>   
>   	printf(REG, address);
> @@ -4054,7 +4054,7 @@ void xmon_register_spus(struct list_head *list)
>   static void stop_spus(void)
>   {
>   	struct spu *spu;
> -	int i;
> +	volatile int i;
>   	u64 tmp;
>   
>   	for (i = 0; i < XMON_NUM_SPUS; i++) {
> @@ -4095,7 +4095,7 @@ static void stop_spus(void)
>   static void restart_spus(void)
>   {
>   	struct spu *spu;
> -	int i;
> +	volatile int i;
>   
>   	for (i = 0; i < XMON_NUM_SPUS; i++) {
>   		if (!spu_info[i].spu)
> 

^ permalink raw reply

* [PATCH 1/2] powerpc/signal: Fix handling of SA_RESTORER sigaction flag
From: Christophe Leroy @ 2021-06-25 10:49 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman
  Cc: linuxppc-dev, linux-kernel

powerpc advertises support of SA_RESTORER sigaction flag.

Make it the truth.

Cc: stable@vger.kernel.org
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
---
 arch/powerpc/kernel/signal_32.c | 8 ++++++--
 arch/powerpc/kernel/signal_64.c | 4 +++-
 2 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kernel/signal_32.c b/arch/powerpc/kernel/signal_32.c
index 0608581967f0..cf3da1386595 100644
--- a/arch/powerpc/kernel/signal_32.c
+++ b/arch/powerpc/kernel/signal_32.c
@@ -769,7 +769,9 @@ int handle_rt_signal32(struct ksignal *ksig, sigset_t *oldset,
 	}
 
 	/* Save user registers on the stack */
-	if (tsk->mm->context.vdso) {
+	if (ksig->ka.sa.sa_flags & SA_RESTORER) {
+		tramp = (unsigned long)ksig->ka.sa.sa_restorer;
+	} else if (tsk->mm->context.vdso) {
 		tramp = VDSO32_SYMBOL(tsk->mm->context.vdso, sigtramp_rt32);
 	} else {
 		tramp = (unsigned long)mctx->mc_pad;
@@ -865,7 +867,9 @@ int handle_signal32(struct ksignal *ksig, sigset_t *oldset,
 	else
 		unsafe_save_user_regs(regs, mctx, tm_mctx, 1, failed);
 
-	if (tsk->mm->context.vdso) {
+	if (ksig->ka.sa.sa_flags & SA_RESTORER) {
+		tramp = (unsigned long)ksig->ka.sa.sa_restorer;
+	} else if (tsk->mm->context.vdso) {
 		tramp = VDSO32_SYMBOL(tsk->mm->context.vdso, sigtramp32);
 	} else {
 		tramp = (unsigned long)mctx->mc_pad;
diff --git a/arch/powerpc/kernel/signal_64.c b/arch/powerpc/kernel/signal_64.c
index 1831bba0582e..fb31a334aca6 100644
--- a/arch/powerpc/kernel/signal_64.c
+++ b/arch/powerpc/kernel/signal_64.c
@@ -910,7 +910,9 @@ int handle_rt_signal64(struct ksignal *ksig, sigset_t *set,
 	tsk->thread.fp_state.fpscr = 0;
 
 	/* Set up to return from userspace. */
-	if (tsk->mm->context.vdso) {
+	if (ksig->ka.sa.sa_flags & SA_RESTORER) {
+		regs_set_return_ip(regs, (unsigned long)ksig->ka.sa.sa_restorer);
+	} else if (tsk->mm->context.vdso) {
 		regs_set_return_ip(regs, VDSO64_SYMBOL(tsk->mm->context.vdso, sigtramp_rt64));
 	} else {
 		err |= setup_trampoline(__NR_rt_sigreturn, &frame->tramp[0]);
-- 
2.25.0


^ permalink raw reply related

* [PATCH RFC 2/2] powerpc/signal: Retire signal trampoline on stack
From: Christophe Leroy @ 2021-06-25 10:49 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman
  Cc: linuxppc-dev, linux-kernel
In-Reply-To: <afe50d1db63a10fde9547ea08fe1fa68b0638aba.1624618157.git.christophe.leroy@csgroup.eu>

The signal trampoline is either:
- As specified by the caller via SA_RESTORER
- In VDSO if VDSO is properly mapped
- Fallback on user stack

However, nowadays user stack is mapped non executable by default
so the fallback will generate an exec fault.

All other architectures having VDSO except x86 and sh don't install
any fallback trampoline on stack.

Simplify the code by doing the same, remove signal trampoline
on stack. If VDSO is not mapped, a NULL like address will be set
and the user app will gently fault.

If a user application explicitly want's to unmap VDSO, it can
still provide an SA_RESTORER.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
---
 arch/powerpc/kernel/signal_32.c  | 26 +++++++++-------------
 arch/powerpc/kernel/signal_64.c  | 38 ++++----------------------------
 arch/powerpc/perf/callchain_32.c |  5 -----
 arch/powerpc/perf/callchain_64.c |  2 --
 4 files changed, 14 insertions(+), 57 deletions(-)

diff --git a/arch/powerpc/kernel/signal_32.c b/arch/powerpc/kernel/signal_32.c
index cf3da1386595..366d07cb42da 100644
--- a/arch/powerpc/kernel/signal_32.c
+++ b/arch/powerpc/kernel/signal_32.c
@@ -769,16 +769,13 @@ int handle_rt_signal32(struct ksignal *ksig, sigset_t *oldset,
 	}
 
 	/* Save user registers on the stack */
-	if (ksig->ka.sa.sa_flags & SA_RESTORER) {
+	if (ksig->ka.sa.sa_flags & SA_RESTORER)
 		tramp = (unsigned long)ksig->ka.sa.sa_restorer;
-	} else if (tsk->mm->context.vdso) {
+	else if (tsk->mm->context.vdso)
 		tramp = VDSO32_SYMBOL(tsk->mm->context.vdso, sigtramp_rt32);
-	} else {
-		tramp = (unsigned long)mctx->mc_pad;
-		unsafe_put_user(PPC_RAW_LI(_R0, __NR_rt_sigreturn), &mctx->mc_pad[0], failed);
-		unsafe_put_user(PPC_RAW_SC(), &mctx->mc_pad[1], failed);
-		asm("dcbst %y0; sync; icbi %y0; sync" :: "Z" (mctx->mc_pad[0]));
-	}
+	else
+		tramp = 0;
+
 	unsafe_put_sigset_t(&frame->uc.uc_sigmask, oldset, failed);
 
 	user_access_end();
@@ -867,16 +864,13 @@ int handle_signal32(struct ksignal *ksig, sigset_t *oldset,
 	else
 		unsafe_save_user_regs(regs, mctx, tm_mctx, 1, failed);
 
-	if (ksig->ka.sa.sa_flags & SA_RESTORER) {
+	if (ksig->ka.sa.sa_flags & SA_RESTORER)
 		tramp = (unsigned long)ksig->ka.sa.sa_restorer;
-	} else if (tsk->mm->context.vdso) {
+	else if (tsk->mm->context.vdso)
 		tramp = VDSO32_SYMBOL(tsk->mm->context.vdso, sigtramp32);
-	} else {
-		tramp = (unsigned long)mctx->mc_pad;
-		unsafe_put_user(PPC_RAW_LI(_R0, __NR_sigreturn), &mctx->mc_pad[0], failed);
-		unsafe_put_user(PPC_RAW_SC(), &mctx->mc_pad[1], failed);
-		asm("dcbst %y0; sync; icbi %y0; sync" :: "Z" (mctx->mc_pad[0]));
-	}
+	else
+		tramp = 0;
+
 	user_access_end();
 
 	regs->link = tramp;
diff --git a/arch/powerpc/kernel/signal_64.c b/arch/powerpc/kernel/signal_64.c
index fb31a334aca6..39522ebd1137 100644
--- a/arch/powerpc/kernel/signal_64.c
+++ b/arch/powerpc/kernel/signal_64.c
@@ -610,32 +610,6 @@ static long restore_tm_sigcontexts(struct task_struct *tsk, struct sigcontext __
 }
 #endif
 
-/*
- * Setup the trampoline code on the stack
- */
-static long setup_trampoline(unsigned int syscall, unsigned int __user *tramp)
-{
-	int i;
-	long err = 0;
-
-	/* Call the handler and pop the dummy stackframe*/
-	err |= __put_user(PPC_RAW_BCTRL(), &tramp[0]);
-	err |= __put_user(PPC_RAW_ADDI(_R1, _R1, __SIGNAL_FRAMESIZE), &tramp[1]);
-
-	err |= __put_user(PPC_RAW_LI(_R0, syscall), &tramp[2]);
-	err |= __put_user(PPC_RAW_SC(), &tramp[3]);
-
-	/* Minimal traceback info */
-	for (i=TRAMP_TRACEBACK; i < TRAMP_SIZE ;i++)
-		err |= __put_user(0, &tramp[i]);
-
-	if (!err)
-		flush_icache_range((unsigned long) &tramp[0],
-			   (unsigned long) &tramp[TRAMP_SIZE]);
-
-	return err;
-}
-
 /*
  * Userspace code may pass a ucontext which doesn't include VSX added
  * at the end.  We need to check for this case.
@@ -910,16 +884,12 @@ int handle_rt_signal64(struct ksignal *ksig, sigset_t *set,
 	tsk->thread.fp_state.fpscr = 0;
 
 	/* Set up to return from userspace. */
-	if (ksig->ka.sa.sa_flags & SA_RESTORER) {
+	if (ksig->ka.sa.sa_flags & SA_RESTORER)
 		regs_set_return_ip(regs, (unsigned long)ksig->ka.sa.sa_restorer);
-	} else if (tsk->mm->context.vdso) {
+	else if (tsk->mm->context.vdso)
 		regs_set_return_ip(regs, VDSO64_SYMBOL(tsk->mm->context.vdso, sigtramp_rt64));
-	} else {
-		err |= setup_trampoline(__NR_rt_sigreturn, &frame->tramp[0]);
-		if (err)
-			goto badframe;
-		regs_set_return_ip(regs, (unsigned long) &frame->tramp[0]);
-	}
+	else
+		regs->nip = 0;
 
 	/* Allocate a dummy caller frame for the signal handler. */
 	newsp = ((unsigned long)frame) - __SIGNAL_FRAMESIZE;
diff --git a/arch/powerpc/perf/callchain_32.c b/arch/powerpc/perf/callchain_32.c
index b83c47b7947f..8ab4663cef50 100644
--- a/arch/powerpc/perf/callchain_32.c
+++ b/arch/powerpc/perf/callchain_32.c
@@ -57,8 +57,6 @@ struct rt_signal_frame_32 {
 
 static int is_sigreturn_32_address(unsigned int nip, unsigned int fp)
 {
-	if (nip == fp + offsetof(struct signal_frame_32, mctx.mc_pad))
-		return 1;
 	if (current->mm->context.vdso &&
 	    nip == VDSO32_SYMBOL(current->mm->context.vdso, sigtramp32))
 		return 1;
@@ -67,9 +65,6 @@ static int is_sigreturn_32_address(unsigned int nip, unsigned int fp)
 
 static int is_rt_sigreturn_32_address(unsigned int nip, unsigned int fp)
 {
-	if (nip == fp + offsetof(struct rt_signal_frame_32,
-				 uc.uc_mcontext.mc_pad))
-		return 1;
 	if (current->mm->context.vdso &&
 	    nip == VDSO32_SYMBOL(current->mm->context.vdso, sigtramp_rt32))
 		return 1;
diff --git a/arch/powerpc/perf/callchain_64.c b/arch/powerpc/perf/callchain_64.c
index 8d0df4226328..8c7078ff67c2 100644
--- a/arch/powerpc/perf/callchain_64.c
+++ b/arch/powerpc/perf/callchain_64.c
@@ -66,8 +66,6 @@ struct signal_frame_64 {
 
 static int is_sigreturn_64_address(unsigned long nip, unsigned long fp)
 {
-	if (nip == fp + offsetof(struct signal_frame_64, tramp))
-		return 1;
 	if (current->mm->context.vdso &&
 	    nip == VDSO64_SYMBOL(current->mm->context.vdso, sigtramp_rt64))
 		return 1;
-- 
2.25.0


^ permalink raw reply related

* [PATCH] powerpc/syscalls: Simplify do_mmap2()
From: Christophe Leroy @ 2021-06-25 10:58 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman
  Cc: linuxppc-dev, linux-kernel

When shift is nul, operations remain valid so no test needed.

And 'ret' is unnecessary.

And use IS_ALIGNED() to check alignment, that's more clear.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
---
 arch/powerpc/kernel/syscalls.c | 15 ++++-----------
 1 file changed, 4 insertions(+), 11 deletions(-)

diff --git a/arch/powerpc/kernel/syscalls.c b/arch/powerpc/kernel/syscalls.c
index bf4ae0f0e36c..825931e400df 100644
--- a/arch/powerpc/kernel/syscalls.c
+++ b/arch/powerpc/kernel/syscalls.c
@@ -41,20 +41,13 @@ static inline long do_mmap2(unsigned long addr, size_t len,
 			unsigned long prot, unsigned long flags,
 			unsigned long fd, unsigned long off, int shift)
 {
-	long ret = -EINVAL;
-
 	if (!arch_validate_prot(prot, addr))
-		goto out;
+		return -EINVAL;
 
-	if (shift) {
-		if (off & ((1 << shift) - 1))
-			goto out;
-		off >>= shift;
-	}
+	if (!IS_ALIGNED(off, 1 << shift))
+		return -EINVAL;
 
-	ret = ksys_mmap_pgoff(addr, len, prot, flags, fd, off);
-out:
-	return ret;
+	return ksys_mmap_pgoff(addr, len, prot, flags, fd, off >> shift);
 }
 
 SYSCALL_DEFINE6(mmap2, unsigned long, addr, size_t, len,
-- 
2.25.0


^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox