* [PATCH 0/4] KVM: PPC: Book3S HV: Bug fixes for 3.13
@ 2013-11-16 6:46 Paul Mackerras
2013-11-16 6:46 ` [PATCH 1/4] KVM: PPC: Book3S HV: Fix physical address calculations Paul Mackerras
` (4 more replies)
0 siblings, 5 replies; 8+ messages in thread
From: Paul Mackerras @ 2013-11-16 6:46 UTC (permalink / raw)
To: Alexander Graf; +Cc: kvm, kvm-ppc
This series fixes some nasty bugs that have recently been found in HV
KVM on PPC. The series is against the 'queue' branch of the KVM tree.
I would like to see all of these go into 3.13.
The first patch fixes a regression that occurred when the transparent
huge page (THP) code for ppc was merged, in 3.11. When we have a
memslot whose userspace address is not 16MB aligned, and we find a
16MB page in it due to THP when mapping a normal page to the guest, we
compute its physical address incorrectly.
The second and third patches fix host hangs that have been observed in
testing with large guests under load. The fourth patch fixes another
locking problem that was found by lockdep when we were looking for the
reason for the hangs.
Paul.
arch/powerpc/kvm/book3s_64_mmu_hv.c | 12 +++++++++---
arch/powerpc/kvm/book3s_hv.c | 24 ++++++++++++++----------
arch/powerpc/kvm/book3s_hv_rm_mmu.c | 4 ++--
arch/powerpc/kvm/book3s_hv_rmhandlers.S | 16 ++++++++++------
4 files changed, 35 insertions(+), 21 deletions(-)
^ permalink raw reply [flat|nested] 8+ messages in thread
* [PATCH 1/4] KVM: PPC: Book3S HV: Fix physical address calculations
2013-11-16 6:46 [PATCH 0/4] KVM: PPC: Book3S HV: Bug fixes for 3.13 Paul Mackerras
@ 2013-11-16 6:46 ` Paul Mackerras
2013-12-06 10:52 ` Alexey Kardashevskiy
2013-11-16 6:46 ` [PATCH 2/4] KVM: PPC: Book3S HV: Refine barriers in guest entry/exit Paul Mackerras
` (3 subsequent siblings)
4 siblings, 1 reply; 8+ messages in thread
From: Paul Mackerras @ 2013-11-16 6:46 UTC (permalink / raw)
To: Alexander Graf; +Cc: kvm, kvm-ppc
This fixes a bug in kvmppc_do_h_enter() where the physical address
for a page can be calculated incorrectly if transparent huge pages
(THP) are active. Until THP came along, it was true that if we
encountered a large (16M) page in kvmppc_do_h_enter(), then the
associated memslot must be 16M aligned for both its guest physical
address and the userspace address, and the physical address
calculations in kvmppc_do_h_enter() assumed that. With THP, that
is no longer true.
In the case where we are using MMU notifiers and the page size that
we get from the Linux page tables is larger than the page being mapped
by the guest, we need to fill in some low-order bits of the physical
address. Without THP, these bits would be the same in the guest
physical address (gpa) and the host virtual address (hva). With THP,
they can be different, and we need to use the bits from hva rather
than gpa.
In the case where we are not using MMU notifiers, the host physical
address we get from the memslot->arch.slot_phys[] array already
includes the low-order bits down to the PAGE_SIZE level, even if
we are using large pages. Thus we can simplify the calculation in
this case to just add in the remaining bits in the case where
PAGE_SIZE is 64k and the guest is mapping a 4k page.
The same bug exists in kvmppc_book3s_hv_page_fault(). The basic fix
is to use psize (the page size from the HPTE) rather than pte_size
(the page size from the Linux PTE) when updating the HPTE low word
in r. That means that pfn needs to be computed to PAGE_SIZE
granularity even if the Linux PTE is a huge page PTE. That can be
arranged simply by doing the page_to_pfn() before setting page to
the head of the compound page. If psize is less than PAGE_SIZE,
then we need to make sure we only update the bits from PAGE_SIZE
upwards, in order not to lose any sub-page offset bits in r.
On the other hand, if psize is greater than PAGE_SIZE, we need to
make sure we don't bring in non-zero low order bits in pfn, hence
we mask (pfn << PAGE_SHIFT) with ~(psize - 1).
Cc: stable@vger.kernel.org # v3.11+
Signed-off-by: Paul Mackerras <paulus@samba.org>
---
arch/powerpc/kvm/book3s_64_mmu_hv.c | 12 +++++++++---
arch/powerpc/kvm/book3s_hv_rm_mmu.c | 4 ++--
2 files changed, 11 insertions(+), 5 deletions(-)
diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c b/arch/powerpc/kvm/book3s_64_mmu_hv.c
index f3ff587..47bbeaf 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
@@ -665,6 +665,7 @@ int kvmppc_book3s_hv_page_fault(struct kvm_run *run, struct kvm_vcpu *vcpu,
return -EFAULT;
} else {
page = pages[0];
+ pfn = page_to_pfn(page);
if (PageHuge(page)) {
page = compound_head(page);
pte_size <<= compound_order(page);
@@ -689,7 +690,6 @@ int kvmppc_book3s_hv_page_fault(struct kvm_run *run, struct kvm_vcpu *vcpu,
}
rcu_read_unlock_sched();
}
- pfn = page_to_pfn(page);
}
ret = -EFAULT;
@@ -707,8 +707,14 @@ int kvmppc_book3s_hv_page_fault(struct kvm_run *run, struct kvm_vcpu *vcpu,
r = (r & ~(HPTE_R_W|HPTE_R_I|HPTE_R_G)) | HPTE_R_M;
}
- /* Set the HPTE to point to pfn */
- r = (r & ~(HPTE_R_PP0 - pte_size)) | (pfn << PAGE_SHIFT);
+ /*
+ * Set the HPTE to point to pfn.
+ * Since the pfn is at PAGE_SIZE granularity, make sure we
+ * don't mask out lower-order bits if psize < PAGE_SIZE.
+ */
+ if (psize < PAGE_SIZE)
+ psize = PAGE_SIZE;
+ r = (r & ~(HPTE_R_PP0 - psize)) | ((pfn << PAGE_SHIFT) & ~(psize - 1));
if (hpte_is_writable(r) && !write_ok)
r = hpte_make_readonly(r);
ret = RESUME_GUEST;
diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
index 9c51544..fddbf98 100644
--- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c
+++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
@@ -225,6 +225,7 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags,
is_io = pa & (HPTE_R_I | HPTE_R_W);
pte_size = PAGE_SIZE << (pa & KVMPPC_PAGE_ORDER_MASK);
pa &= PAGE_MASK;
+ pa |= gpa & ~PAGE_MASK;
} else {
/* Translate to host virtual address */
hva = __gfn_to_hva_memslot(memslot, gfn);
@@ -238,13 +239,12 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags,
ptel = hpte_make_readonly(ptel);
is_io = hpte_cache_bits(pte_val(pte));
pa = pte_pfn(pte) << PAGE_SHIFT;
+ pa |= hva & (pte_size - 1);
}
}
if (pte_size < psize)
return H_PARAMETER;
- if (pa && pte_size > psize)
- pa |= gpa & (pte_size - 1);
ptel &= ~(HPTE_R_PP0 - psize);
ptel |= pa;
--
1.8.4.2
^ permalink raw reply related [flat|nested] 8+ messages in thread
* [PATCH 2/4] KVM: PPC: Book3S HV: Refine barriers in guest entry/exit
2013-11-16 6:46 [PATCH 0/4] KVM: PPC: Book3S HV: Bug fixes for 3.13 Paul Mackerras
2013-11-16 6:46 ` [PATCH 1/4] KVM: PPC: Book3S HV: Fix physical address calculations Paul Mackerras
@ 2013-11-16 6:46 ` Paul Mackerras
2013-11-16 6:46 ` [PATCH 3/4] KVM: PPC: Book3S HV: Make tbacct_lock irq-safe Paul Mackerras
` (2 subsequent siblings)
4 siblings, 0 replies; 8+ messages in thread
From: Paul Mackerras @ 2013-11-16 6:46 UTC (permalink / raw)
To: Alexander Graf; +Cc: kvm, kvm-ppc
Some users have reported instances of the host hanging with secondary
threads of a core waiting for the primary thread to exit the guest,
and the primary thread stuck in nap mode. This prompted a review of
the memory barriers in the guest entry/exit code, and this is the
result. Most of these changes are the suggestions of Dean Burdick
<deanburdick@us.ibm.com>.
The barriers between updating napping_threads and reading the
entry_exit_count on the one hand, and updating entry_exit_count and
reading napping_threads on the other, need to be isync not lwsync,
since we need to ensure that either the napping_threads update or the
entry_exit_count update get seen. It is not sufficient to order the
load vs. lwarx, as lwsync does; we need to order the load vs. the
stwcx., so we need isync.
In addition, we need a full sync before sending IPIs to wake other
threads from nap, to ensure that the write to the entry_exit_count is
visible before the IPI occurs.
Cc: stable@vger.kernel.org # v3.2+
Signed-off-by: Paul Mackerras <paulus@samba.org>
---
arch/powerpc/kvm/book3s_hv_rmhandlers.S | 16 ++++++++++------
1 file changed, 10 insertions(+), 6 deletions(-)
diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
index bc8de75..bde28da 100644
--- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
+++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
@@ -153,7 +153,6 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_206)
13: b machine_check_fwnmi
-
/*
* We come in here when wakened from nap mode on a secondary hw thread.
* Relocation is off and most register values are lost.
@@ -224,6 +223,11 @@ kvm_start_guest:
/* Clear our vcpu pointer so we don't come back in early */
li r0, 0
std r0, HSTATE_KVM_VCPU(r13)
+ /*
+ * Make sure we clear HSTATE_KVM_VCPU(r13) before incrementing
+ * the nap_count, because once the increment to nap_count is
+ * visible we could be given another vcpu.
+ */
lwsync
/* Clear any pending IPI - we're an offline thread */
ld r5, HSTATE_XICS_PHYS(r13)
@@ -241,7 +245,6 @@ kvm_start_guest:
/* increment the nap count and then go to nap mode */
ld r4, HSTATE_KVM_VCORE(r13)
addi r4, r4, VCORE_NAP_COUNT
- lwsync /* make previous updates visible */
51: lwarx r3, 0, r4
addi r3, r3, 1
stwcx. r3, 0, r4
@@ -990,14 +993,13 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_201)
*/
/* Increment the threads-exiting-guest count in the 0xff00
bits of vcore->entry_exit_count */
- lwsync
ld r5,HSTATE_KVM_VCORE(r13)
addi r6,r5,VCORE_ENTRY_EXIT
41: lwarx r3,0,r6
addi r0,r3,0x100
stwcx. r0,0,r6
bne 41b
- lwsync
+ isync /* order stwcx. vs. reading napping_threads */
/*
* At this point we have an interrupt that we have to pass
@@ -1030,6 +1032,8 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_201)
sld r0,r0,r4
andc. r3,r3,r0 /* no sense IPI'ing ourselves */
beq 43f
+ /* Order entry/exit update vs. IPIs */
+ sync
mulli r4,r4,PACA_SIZE /* get paca for thread 0 */
subf r6,r4,r13
42: andi. r0,r3,1
@@ -1638,10 +1642,10 @@ END_FTR_SECTION_IFCLR(CPU_FTR_ARCH_206)
bge kvm_cede_exit
stwcx. r4,0,r6
bne 31b
+ /* order napping_threads update vs testing entry_exit_count */
+ isync
li r0,1
stb r0,HSTATE_NAPPING(r13)
- /* order napping_threads update vs testing entry_exit_count */
- lwsync
mr r4,r3
lwz r7,VCORE_ENTRY_EXIT(r5)
cmpwi r7,0x100
--
1.8.4.2
^ permalink raw reply related [flat|nested] 8+ messages in thread
* [PATCH 3/4] KVM: PPC: Book3S HV: Make tbacct_lock irq-safe
2013-11-16 6:46 [PATCH 0/4] KVM: PPC: Book3S HV: Bug fixes for 3.13 Paul Mackerras
2013-11-16 6:46 ` [PATCH 1/4] KVM: PPC: Book3S HV: Fix physical address calculations Paul Mackerras
2013-11-16 6:46 ` [PATCH 2/4] KVM: PPC: Book3S HV: Refine barriers in guest entry/exit Paul Mackerras
@ 2013-11-16 6:46 ` Paul Mackerras
2013-11-16 6:46 ` [PATCH 4/4] KVM: PPC: Book3S HV: Take SRCU read lock around kvm_read_guest() call Paul Mackerras
2013-11-18 21:42 ` [PATCH 0/4] KVM: PPC: Book3S HV: Bug fixes for 3.13 Alexander Graf
4 siblings, 0 replies; 8+ messages in thread
From: Paul Mackerras @ 2013-11-16 6:46 UTC (permalink / raw)
To: Alexander Graf; +Cc: kvm, kvm-ppc
Lockdep reported that there is a potential for deadlock because
vcpu->arch.tbacct_lock is not irq-safe, and is sometimes taken inside
the rq_lock (run-queue lock) in the scheduler, which is taken within
interrupts. The lockdep splat looks like:
===========================
[ INFO: HARDIRQ-safe -> HARDIRQ-unsafe lock order detected ]
3.12.0-rc5-kvm+ #8 Not tainted
------------------------------------------------------
qemu-system-ppc/4803 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
(&(&vcpu->arch.tbacct_lock)->rlock){+.+...}, at: [<c0000000000947ac>] .kvmppc_core_vcpu_put_hv+0x2c/0xa0
and this task is already holding:
(&rq->lock){-.-.-.}, at: [<c000000000ac16c0>] .__schedule+0x180/0xaa0
which would create a new lock dependency:
(&rq->lock){-.-.-.} -> (&(&vcpu->arch.tbacct_lock)->rlock){+.+...}
but this new dependency connects a HARDIRQ-irq-safe lock:
(&rq->lock){-.-.-.}
... which became HARDIRQ-irq-safe at:
[<c00000000013797c>] .lock_acquire+0xbc/0x190
[<c000000000ac3c74>] ._raw_spin_lock+0x34/0x60
[<c0000000000f8564>] .scheduler_tick+0x54/0x180
[<c0000000000c2610>] .update_process_times+0x70/0xa0
[<c00000000012cdfc>] .tick_periodic+0x3c/0xe0
[<c00000000012cec8>] .tick_handle_periodic+0x28/0xb0
[<c00000000001ef40>] .timer_interrupt+0x120/0x2e0
[<c000000000002868>] decrementer_common+0x168/0x180
[<c0000000001c7ca4>] .get_page_from_freelist+0x924/0xc10
[<c0000000001c8e00>] .__alloc_pages_nodemask+0x200/0xba0
[<c0000000001c9eb8>] .alloc_pages_exact_nid+0x68/0x110
[<c000000000f4c3ec>] .page_cgroup_init+0x1e0/0x270
[<c000000000f24480>] .start_kernel+0x3e0/0x4e4
[<c000000000009d30>] .start_here_common+0x20/0x70
to a HARDIRQ-irq-unsafe lock:
(&(&vcpu->arch.tbacct_lock)->rlock){+.+...}
... which became HARDIRQ-irq-unsafe at:
... [<c00000000013797c>] .lock_acquire+0xbc/0x190
[<c000000000ac3c74>] ._raw_spin_lock+0x34/0x60
[<c0000000000946ac>] .kvmppc_core_vcpu_load_hv+0x2c/0x100
[<c00000000008394c>] .kvmppc_core_vcpu_load+0x2c/0x40
[<c000000000081000>] .kvm_arch_vcpu_load+0x10/0x30
[<c00000000007afd4>] .vcpu_load+0x64/0xd0
[<c00000000007b0f8>] .kvm_vcpu_ioctl+0x68/0x730
[<c00000000025530c>] .do_vfs_ioctl+0x4dc/0x7a0
[<c000000000255694>] .SyS_ioctl+0xc4/0xe0
[<c000000000009ee4>] syscall_exit+0x0/0x98
Some users have reported this deadlock occurring in practice, though
the reports have been primarily on 3.10.x-based kernels.
This fixes the problem by making tbacct_lock be irq-safe.
Signed-off-by: Paul Mackerras <paulus@samba.org>
---
arch/powerpc/kvm/book3s_hv.c | 22 ++++++++++++----------
1 file changed, 12 insertions(+), 10 deletions(-)
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 072287f..31d9cfb 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -131,8 +131,9 @@ static void kvmppc_fast_vcpu_kick_hv(struct kvm_vcpu *vcpu)
static void kvmppc_core_vcpu_load_hv(struct kvm_vcpu *vcpu, int cpu)
{
struct kvmppc_vcore *vc = vcpu->arch.vcore;
+ unsigned long flags;
- spin_lock(&vcpu->arch.tbacct_lock);
+ spin_lock_irqsave(&vcpu->arch.tbacct_lock, flags);
if (vc->runner = vcpu && vc->vcore_state != VCORE_INACTIVE &&
vc->preempt_tb != TB_NIL) {
vc->stolen_tb += mftb() - vc->preempt_tb;
@@ -143,19 +144,20 @@ static void kvmppc_core_vcpu_load_hv(struct kvm_vcpu *vcpu, int cpu)
vcpu->arch.busy_stolen += mftb() - vcpu->arch.busy_preempt;
vcpu->arch.busy_preempt = TB_NIL;
}
- spin_unlock(&vcpu->arch.tbacct_lock);
+ spin_unlock_irqrestore(&vcpu->arch.tbacct_lock, flags);
}
static void kvmppc_core_vcpu_put_hv(struct kvm_vcpu *vcpu)
{
struct kvmppc_vcore *vc = vcpu->arch.vcore;
+ unsigned long flags;
- spin_lock(&vcpu->arch.tbacct_lock);
+ spin_lock_irqsave(&vcpu->arch.tbacct_lock, flags);
if (vc->runner = vcpu && vc->vcore_state != VCORE_INACTIVE)
vc->preempt_tb = mftb();
if (vcpu->arch.state = KVMPPC_VCPU_BUSY_IN_HOST)
vcpu->arch.busy_preempt = mftb();
- spin_unlock(&vcpu->arch.tbacct_lock);
+ spin_unlock_irqrestore(&vcpu->arch.tbacct_lock, flags);
}
static void kvmppc_set_msr_hv(struct kvm_vcpu *vcpu, u64 msr)
@@ -486,11 +488,11 @@ static u64 vcore_stolen_time(struct kvmppc_vcore *vc, u64 now)
*/
if (vc->vcore_state != VCORE_INACTIVE &&
vc->runner->arch.run_task != current) {
- spin_lock(&vc->runner->arch.tbacct_lock);
+ spin_lock_irq(&vc->runner->arch.tbacct_lock);
p = vc->stolen_tb;
if (vc->preempt_tb != TB_NIL)
p += now - vc->preempt_tb;
- spin_unlock(&vc->runner->arch.tbacct_lock);
+ spin_unlock_irq(&vc->runner->arch.tbacct_lock);
} else {
p = vc->stolen_tb;
}
@@ -512,10 +514,10 @@ static void kvmppc_create_dtl_entry(struct kvm_vcpu *vcpu,
core_stolen = vcore_stolen_time(vc, now);
stolen = core_stolen - vcpu->arch.stolen_logged;
vcpu->arch.stolen_logged = core_stolen;
- spin_lock(&vcpu->arch.tbacct_lock);
+ spin_lock_irq(&vcpu->arch.tbacct_lock);
stolen += vcpu->arch.busy_stolen;
vcpu->arch.busy_stolen = 0;
- spin_unlock(&vcpu->arch.tbacct_lock);
+ spin_unlock_irq(&vcpu->arch.tbacct_lock);
if (!dt || !vpa)
return;
memset(dt, 0, sizeof(struct dtl_entry));
@@ -1115,13 +1117,13 @@ static void kvmppc_remove_runnable(struct kvmppc_vcore *vc,
if (vcpu->arch.state != KVMPPC_VCPU_RUNNABLE)
return;
- spin_lock(&vcpu->arch.tbacct_lock);
+ spin_lock_irq(&vcpu->arch.tbacct_lock);
now = mftb();
vcpu->arch.busy_stolen += vcore_stolen_time(vc, now) -
vcpu->arch.stolen_logged;
vcpu->arch.busy_preempt = now;
vcpu->arch.state = KVMPPC_VCPU_BUSY_IN_HOST;
- spin_unlock(&vcpu->arch.tbacct_lock);
+ spin_unlock_irq(&vcpu->arch.tbacct_lock);
--vc->n_runnable;
list_del(&vcpu->arch.run_list);
}
--
1.8.4.2
^ permalink raw reply related [flat|nested] 8+ messages in thread
* [PATCH 4/4] KVM: PPC: Book3S HV: Take SRCU read lock around kvm_read_guest() call
2013-11-16 6:46 [PATCH 0/4] KVM: PPC: Book3S HV: Bug fixes for 3.13 Paul Mackerras
` (2 preceding siblings ...)
2013-11-16 6:46 ` [PATCH 3/4] KVM: PPC: Book3S HV: Make tbacct_lock irq-safe Paul Mackerras
@ 2013-11-16 6:46 ` Paul Mackerras
2013-11-18 21:42 ` [PATCH 0/4] KVM: PPC: Book3S HV: Bug fixes for 3.13 Alexander Graf
4 siblings, 0 replies; 8+ messages in thread
From: Paul Mackerras @ 2013-11-16 6:46 UTC (permalink / raw)
To: Alexander Graf; +Cc: kvm, kvm-ppc
Running a kernel with CONFIG_PROVE_RCU=y yields the following diagnostic:
===============[ INFO: suspicious RCU usage. ]
3.12.0-rc5-kvm+ #9 Not tainted
-------------------------------
/home/paulus/kernel/kvm-test/include/linux/kvm_host.h:473 suspicious rcu_dereference_check() usage!
other info that might help us debug this:
rcu_scheduler_active = 1, debug_locks = 0
1 lock held by qemu-system-ppc/4831:
#0: (&vcpu->mutex){+.+.+.}, at: [<c00000000007af98>] .vcpu_load+0x28/0xd0
stack backtrace:
CPU: 28 PID: 4831 Comm: qemu-system-ppc Not tainted 3.12.0-rc5-kvm+ #9
Call Trace:
[c000000be462b2a0] [c00000000001644c] .show_stack+0x7c/0x1f0 (unreliable)
[c000000be462b370] [c000000000ad57c0] .dump_stack+0x88/0xb4
[c000000be462b3f0] [c0000000001315e8] .lockdep_rcu_suspicious+0x138/0x180
[c000000be462b480] [c00000000007862c] .gfn_to_memslot+0x13c/0x170
[c000000be462b510] [c00000000007d384] .gfn_to_hva_prot+0x24/0x90
[c000000be462b5a0] [c00000000007d420] .kvm_read_guest_page+0x30/0xd0
[c000000be462b630] [c00000000007d528] .kvm_read_guest+0x68/0x110
[c000000be462b6e0] [c000000000084594] .kvmppc_rtas_hcall+0x34/0x180
[c000000be462b7d0] [c000000000097934] .kvmppc_pseries_do_hcall+0x74/0x830
[c000000be462b880] [c0000000000990e8] .kvmppc_vcpu_run_hv+0xff8/0x15a0
[c000000be462b9e0] [c0000000000839cc] .kvmppc_vcpu_run+0x2c/0x40
[c000000be462ba50] [c0000000000810b4] .kvm_arch_vcpu_ioctl_run+0x54/0x1b0
[c000000be462bae0] [c00000000007b508] .kvm_vcpu_ioctl+0x478/0x730
[c000000be462bca0] [c00000000025532c] .do_vfs_ioctl+0x4dc/0x7a0
[c000000be462bd80] [c0000000002556b4] .SyS_ioctl+0xc4/0xe0
[c000000be462be30] [c000000000009ee4] syscall_exit+0x0/0x98
To fix this, we take the SRCU read lock around the kvmppc_rtas_hcall()
call.
Signed-off-by: Paul Mackerras <paulus@samba.org>
---
arch/powerpc/kvm/book3s_hv.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 31d9cfb..b51d5db 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -591,7 +591,9 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu)
if (list_empty(&vcpu->kvm->arch.rtas_tokens))
return RESUME_HOST;
+ idx = srcu_read_lock(&vcpu->kvm->srcu);
rc = kvmppc_rtas_hcall(vcpu);
+ srcu_read_unlock(&vcpu->kvm->srcu, idx);
if (rc = -ENOENT)
return RESUME_HOST;
--
1.8.4.2
^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: [PATCH 0/4] KVM: PPC: Book3S HV: Bug fixes for 3.13
2013-11-16 6:46 [PATCH 0/4] KVM: PPC: Book3S HV: Bug fixes for 3.13 Paul Mackerras
` (3 preceding siblings ...)
2013-11-16 6:46 ` [PATCH 4/4] KVM: PPC: Book3S HV: Take SRCU read lock around kvm_read_guest() call Paul Mackerras
@ 2013-11-18 21:42 ` Alexander Graf
4 siblings, 0 replies; 8+ messages in thread
From: Alexander Graf @ 2013-11-18 21:42 UTC (permalink / raw)
To: Paul Mackerras; +Cc: kvm@vger.kernel.org mailing list, kvm-ppc
On 16.11.2013, at 01:46, Paul Mackerras <paulus@samba.org> wrote:
> This series fixes some nasty bugs that have recently been found in HV
> KVM on PPC. The series is against the 'queue' branch of the KVM tree.
> I would like to see all of these go into 3.13.
>
> The first patch fixes a regression that occurred when the transparent
> huge page (THP) code for ppc was merged, in 3.11. When we have a
> memslot whose userspace address is not 16MB aligned, and we find a
> 16MB page in it due to THP when mapping a normal page to the guest, we
> compute its physical address incorrectly.
>
> The second and third patches fix host hangs that have been observed in
> testing with large guests under load. The fourth patch fixes another
> locking problem that was found by lockdep when we were looking for the
> reason for the hangs.
Thanks, applied all to for-3.13.
Alex
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH 1/4] KVM: PPC: Book3S HV: Fix physical address calculations
2013-11-16 6:46 ` [PATCH 1/4] KVM: PPC: Book3S HV: Fix physical address calculations Paul Mackerras
@ 2013-12-06 10:52 ` Alexey Kardashevskiy
2013-12-10 5:38 ` Paul Mackerras
0 siblings, 1 reply; 8+ messages in thread
From: Alexey Kardashevskiy @ 2013-12-06 10:52 UTC (permalink / raw)
To: Paul Mackerras, Alexander Graf; +Cc: kvm, kvm-ppc
On 11/16/2013 05:46 PM, Paul Mackerras wrote:
> This fixes a bug in kvmppc_do_h_enter() where the physical address
> for a page can be calculated incorrectly if transparent huge pages
> (THP) are active. Until THP came along, it was true that if we
> encountered a large (16M) page in kvmppc_do_h_enter(), then the
> associated memslot must be 16M aligned for both its guest physical
> address and the userspace address, and the physical address
> calculations in kvmppc_do_h_enter() assumed that. With THP, that
> is no longer true.
>
> In the case where we are using MMU notifiers and the page size that
> we get from the Linux page tables is larger than the page being mapped
> by the guest, we need to fill in some low-order bits of the physical
> address. Without THP, these bits would be the same in the guest
> physical address (gpa) and the host virtual address (hva). With THP,
> they can be different, and we need to use the bits from hva rather
> than gpa.
>
> In the case where we are not using MMU notifiers, the host physical
> address we get from the memslot->arch.slot_phys[] array already
> includes the low-order bits down to the PAGE_SIZE level, even if
> we are using large pages. Thus we can simplify the calculation in
> this case to just add in the remaining bits in the case where
> PAGE_SIZE is 64k and the guest is mapping a 4k page.
>
> The same bug exists in kvmppc_book3s_hv_page_fault(). The basic fix
> is to use psize (the page size from the HPTE) rather than pte_size
> (the page size from the Linux PTE) when updating the HPTE low word
> in r. That means that pfn needs to be computed to PAGE_SIZE
> granularity even if the Linux PTE is a huge page PTE. That can be
> arranged simply by doing the page_to_pfn() before setting page to
> the head of the compound page. If psize is less than PAGE_SIZE,
> then we need to make sure we only update the bits from PAGE_SIZE
> upwards, in order not to lose any sub-page offset bits in r.
> On the other hand, if psize is greater than PAGE_SIZE, we need to
> make sure we don't bring in non-zero low order bits in pfn, hence
> we mask (pfn << PAGE_SHIFT) with ~(psize - 1).
This patch breaks VFIO with Intel E1000E on my p5ioc2 machine (vpl2) - the
guest tries allocating MSI ibm,change_msi and pauses for a while (>
10minutes), then continues but the ethernet adapter does not work anyway.
Disabling MSI in QEMU does not help though. I'll debug on Monday but quick
ideas are welcome.
Host kernel config huge pages bits:
# CONFIG_TRANSPARENT_HUGEPAGE is not set
CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE=y
Huge pages are not enabled in QEMU.
> Cc: stable@vger.kernel.org # v3.11+
> Signed-off-by: Paul Mackerras <paulus@samba.org>
> ---
> arch/powerpc/kvm/book3s_64_mmu_hv.c | 12 +++++++++---
> arch/powerpc/kvm/book3s_hv_rm_mmu.c | 4 ++--
> 2 files changed, 11 insertions(+), 5 deletions(-)
>
> diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c b/arch/powerpc/kvm/book3s_64_mmu_hv.c
> index f3ff587..47bbeaf 100644
> --- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
> +++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
> @@ -665,6 +665,7 @@ int kvmppc_book3s_hv_page_fault(struct kvm_run *run, struct kvm_vcpu *vcpu,
> return -EFAULT;
> } else {
> page = pages[0];
> + pfn = page_to_pfn(page);
> if (PageHuge(page)) {
> page = compound_head(page);
> pte_size <<= compound_order(page);
> @@ -689,7 +690,6 @@ int kvmppc_book3s_hv_page_fault(struct kvm_run *run, struct kvm_vcpu *vcpu,
> }
> rcu_read_unlock_sched();
> }
> - pfn = page_to_pfn(page);
> }
>
> ret = -EFAULT;
> @@ -707,8 +707,14 @@ int kvmppc_book3s_hv_page_fault(struct kvm_run *run, struct kvm_vcpu *vcpu,
> r = (r & ~(HPTE_R_W|HPTE_R_I|HPTE_R_G)) | HPTE_R_M;
> }
>
> - /* Set the HPTE to point to pfn */
> - r = (r & ~(HPTE_R_PP0 - pte_size)) | (pfn << PAGE_SHIFT);
> + /*
> + * Set the HPTE to point to pfn.
> + * Since the pfn is at PAGE_SIZE granularity, make sure we
> + * don't mask out lower-order bits if psize < PAGE_SIZE.
> + */
> + if (psize < PAGE_SIZE)
> + psize = PAGE_SIZE;
> + r = (r & ~(HPTE_R_PP0 - psize)) | ((pfn << PAGE_SHIFT) & ~(psize - 1));
> if (hpte_is_writable(r) && !write_ok)
> r = hpte_make_readonly(r);
> ret = RESUME_GUEST;
> diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
> index 9c51544..fddbf98 100644
> --- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c
> +++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
> @@ -225,6 +225,7 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags,
> is_io = pa & (HPTE_R_I | HPTE_R_W);
> pte_size = PAGE_SIZE << (pa & KVMPPC_PAGE_ORDER_MASK);
> pa &= PAGE_MASK;
> + pa |= gpa & ~PAGE_MASK;
> } else {
> /* Translate to host virtual address */
> hva = __gfn_to_hva_memslot(memslot, gfn);
> @@ -238,13 +239,12 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags,
> ptel = hpte_make_readonly(ptel);
> is_io = hpte_cache_bits(pte_val(pte));
> pa = pte_pfn(pte) << PAGE_SHIFT;
> + pa |= hva & (pte_size - 1);
> }
> }
>
> if (pte_size < psize)
> return H_PARAMETER;
> - if (pa && pte_size > psize)
> - pa |= gpa & (pte_size - 1);
>
> ptel &= ~(HPTE_R_PP0 - psize);
> ptel |= pa;
>
--
Alexey
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH 1/4] KVM: PPC: Book3S HV: Fix physical address calculations
2013-12-06 10:52 ` Alexey Kardashevskiy
@ 2013-12-10 5:38 ` Paul Mackerras
0 siblings, 0 replies; 8+ messages in thread
From: Paul Mackerras @ 2013-12-10 5:38 UTC (permalink / raw)
To: Alexey Kardashevskiy; +Cc: Alexander Graf, kvm, kvm-ppc
On Fri, Dec 06, 2013 at 09:52:31PM +1100, Alexey Kardashevskiy wrote:
> This patch breaks VFIO with Intel E1000E on my p5ioc2 machine (vpl2) - the
> guest tries allocating MSI ibm,change_msi and pauses for a while (>
> 10minutes), then continues but the ethernet adapter does not work anyway.
> Disabling MSI in QEMU does not help though. I'll debug on Monday but quick
> ideas are welcome.
Looks like I'm losing some low-order address bits in kvmppc_do_h_enter
when the guest creates a 4k HPTE.
Alex, have you merged this patch in an immutable branch yet? Do you
want an incremental patch on top, or a revised version of this patch?
Regards,
Paul.
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2013-12-10 5:38 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-11-16 6:46 [PATCH 0/4] KVM: PPC: Book3S HV: Bug fixes for 3.13 Paul Mackerras
2013-11-16 6:46 ` [PATCH 1/4] KVM: PPC: Book3S HV: Fix physical address calculations Paul Mackerras
2013-12-06 10:52 ` Alexey Kardashevskiy
2013-12-10 5:38 ` Paul Mackerras
2013-11-16 6:46 ` [PATCH 2/4] KVM: PPC: Book3S HV: Refine barriers in guest entry/exit Paul Mackerras
2013-11-16 6:46 ` [PATCH 3/4] KVM: PPC: Book3S HV: Make tbacct_lock irq-safe Paul Mackerras
2013-11-16 6:46 ` [PATCH 4/4] KVM: PPC: Book3S HV: Take SRCU read lock around kvm_read_guest() call Paul Mackerras
2013-11-18 21:42 ` [PATCH 0/4] KVM: PPC: Book3S HV: Bug fixes for 3.13 Alexander Graf
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).