* [PATCH] KVM: PPC: Book3S HV: Fix handling of interrupted VCPUs
@ 2015-09-18 6:57 ` Paul Mackerras
0 siblings, 0 replies; 10+ messages in thread
From: Paul Mackerras @ 2015-09-18 6:57 UTC (permalink / raw)
To: kvm, kvm-ppc; +Cc: David Gibson, Thomas Huth
This fixes a bug which results in stale vcore pointers being left in
the per-cpu preempted vcore lists when a VM is destroyed. The result
of the stale vcore pointers is usually either a crash or a lockup
inside collect_piggybacks() when another VM is run. A typical
lockup message looks like:
[ 472.161074] NMI watchdog: BUG: soft lockup - CPU#24 stuck for 22s! [qemu-system-ppc:7039]
[ 472.161204] Modules linked in: kvm_hv kvm_pr kvm xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 tun ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw ses enclosure shpchp rtc_opal i2c_opal powernv_rng binfmt_misc dm_service_time scsi_dh_alua radeon i2c_algo_bit drm_kms_helper ttm drm tg3 ptp pps_core cxgb3 ipr i2c_core mdio dm_multipath [last unloaded: kvm_hv]
[ 472.162111] CPU: 24 PID: 7039 Comm: qemu-system-ppc Not tainted 4.2.0-kvm+ #49
[ 472.162187] task: c000001e38512750 ti: c000001e41bfc000 task.ti: c000001e41bfc000
[ 472.162262] NIP: c00000000096b094 LR: c00000000096b08c CTR: c000000000111130
[ 472.162337] REGS: c000001e41bff520 TRAP: 0901 Not tainted (4.2.0-kvm+)
[ 472.162399] MSR: 9000000100009033 <SF,HV,EE,ME,IR,DR,RI,LE> CR: 24848844 XER: 00000000
[ 472.162588] CFAR: c00000000096b0ac SOFTE: 1
GPR00: c000000000111170 c000001e41bff7a0 c00000000127df00 0000000000000001
GPR04: 0000000000000003 0000000000000001 0000000000000000 0000000000874821
GPR08: c000001e41bff8e0 0000000000000001 0000000000000000 d00000000efde740
GPR12: c000000000111130 c00000000fdae400
[ 472.163053] NIP [c00000000096b094] _raw_spin_lock_irqsave+0xa4/0x130
[ 472.163117] LR [c00000000096b08c] _raw_spin_lock_irqsave+0x9c/0x130
[ 472.163179] Call Trace:
[ 472.163206] [c000001e41bff7a0] [c000001e41bff7f0] 0xc000001e41bff7f0 (unreliable)
[ 472.163295] [c000001e41bff7e0] [c000000000111170] __wake_up+0x40/0x90
[ 472.163375] [c000001e41bff830] [d00000000efd6fc0] kvmppc_run_core+0x1240/0x1950 [kvm_hv]
[ 472.163465] [c000001e41bffa30] [d00000000efd8510] kvmppc_vcpu_run_hv+0x5a0/0xd90 [kvm_hv]
[ 472.163559] [c000001e41bffb70] [d00000000e9318a4] kvmppc_vcpu_run+0x44/0x60 [kvm]
[ 472.163653] [c000001e41bffba0] [d00000000e92e674] kvm_arch_vcpu_ioctl_run+0x64/0x170 [kvm]
[ 472.163745] [c000001e41bffbe0] [d00000000e9263a8] kvm_vcpu_ioctl+0x538/0x7b0 [kvm]
[ 472.163834] [c000001e41bffd40] [c0000000002d0f50] do_vfs_ioctl+0x480/0x7c0
[ 472.163910] [c000001e41bffde0] [c0000000002d1364] SyS_ioctl+0xd4/0xf0
[ 472.163986] [c000001e41bffe30] [c000000000009260] system_call+0x38/0xd0
[ 472.164060] Instruction dump:
[ 472.164098] ebc1fff0 ebe1fff8 7c0803a6 4e800020 60000000 60000000 60420000 8bad02e2
[ 472.164224] 7fc3f378 4b6a57c1 60000000 7c210b78 <e92d0000> 89290009 792affe3 40820070
The bug is that kvmppc_run_vcpu does not correctly handle the case
where a vcpu task receives a signal while its guest vcpu is executing
in the guest as a result of being piggy-backed onto the execution of
another vcore. In that case we need to wait for the vcpu to finish
executing inside the guest, and then remove this vcore from the
preempted vcores list. That way, we avoid leaving this vcpu's vcore
on the preempted vcores list when the vcpu gets interrupted.
Fixes: ec2571650826
Reported-by: Thomas Huth <thuth@redhat.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
---
arch/powerpc/kvm/book3s_hv.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 9754e68..2280497 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -2692,9 +2692,13 @@ static int kvmppc_run_vcpu(struct kvm_run *kvm_run, struct kvm_vcpu *vcpu)
while (vcpu->arch.state == KVMPPC_VCPU_RUNNABLE &&
(vc->vcore_state == VCORE_RUNNING ||
- vc->vcore_state == VCORE_EXITING))
+ vc->vcore_state == VCORE_EXITING ||
+ vc->vcore_state == VCORE_PIGGYBACK))
kvmppc_wait_for_exec(vc, vcpu, TASK_UNINTERRUPTIBLE);
+ if (vc->vcore_state == VCORE_PREEMPT && vc->runner == NULL)
+ kvmppc_vcore_end_preempt(vc);
+
if (vcpu->arch.state == KVMPPC_VCPU_RUNNABLE) {
kvmppc_remove_runnable(vc, vcpu);
vcpu->stat.signal_exits++;
--
2.5.1
^ permalink raw reply related [flat|nested] 10+ messages in thread* Re: [PATCH] KVM: PPC: Book3S HV: Fix handling of interrupted VCPUs
2015-09-18 6:57 ` Paul Mackerras
@ 2015-09-18 8:15 ` Thomas Huth
-1 siblings, 0 replies; 10+ messages in thread
From: Thomas Huth @ 2015-09-18 8:15 UTC (permalink / raw)
To: Paul Mackerras, kvm, kvm-ppc; +Cc: David Gibson
On 18/09/15 08:57, Paul Mackerras wrote:
> This fixes a bug which results in stale vcore pointers being left in
> the per-cpu preempted vcore lists when a VM is destroyed. The result
> of the stale vcore pointers is usually either a crash or a lockup
> inside collect_piggybacks() when another VM is run. A typical
> lockup message looks like:
>
> [ 472.161074] NMI watchdog: BUG: soft lockup - CPU#24 stuck for 22s! [qemu-system-ppc:7039]
> [ 472.161204] Modules linked in: kvm_hv kvm_pr kvm xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 tun ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw ses enclosure shpchp rtc_opal i2c_opal powernv_rng binfmt_misc dm_service_time scsi_dh_alua radeon i2c_algo_bit drm_kms_helper ttm drm tg3 ptp pps_core cxgb3 ipr i2c_core mdio dm_multipath [last unloaded: kvm_hv]
> [ 472.162111] CPU: 24 PID: 7039 Comm: qemu-system-ppc Not tainted 4.2.0-kvm+ #49
> [ 472.162187] task: c000001e38512750 ti: c000001e41bfc000 task.ti: c000001e41bfc000
> [ 472.162262] NIP: c00000000096b094 LR: c00000000096b08c CTR: c000000000111130
> [ 472.162337] REGS: c000001e41bff520 TRAP: 0901 Not tainted (4.2.0-kvm+)
> [ 472.162399] MSR: 9000000100009033 <SF,HV,EE,ME,IR,DR,RI,LE> CR: 24848844 XER: 00000000
> [ 472.162588] CFAR: c00000000096b0ac SOFTE: 1
> GPR00: c000000000111170 c000001e41bff7a0 c00000000127df00 0000000000000001
> GPR04: 0000000000000003 0000000000000001 0000000000000000 0000000000874821
> GPR08: c000001e41bff8e0 0000000000000001 0000000000000000 d00000000efde740
> GPR12: c000000000111130 c00000000fdae400
> [ 472.163053] NIP [c00000000096b094] _raw_spin_lock_irqsave+0xa4/0x130
> [ 472.163117] LR [c00000000096b08c] _raw_spin_lock_irqsave+0x9c/0x130
> [ 472.163179] Call Trace:
> [ 472.163206] [c000001e41bff7a0] [c000001e41bff7f0] 0xc000001e41bff7f0 (unreliable)
> [ 472.163295] [c000001e41bff7e0] [c000000000111170] __wake_up+0x40/0x90
> [ 472.163375] [c000001e41bff830] [d00000000efd6fc0] kvmppc_run_core+0x1240/0x1950 [kvm_hv]
> [ 472.163465] [c000001e41bffa30] [d00000000efd8510] kvmppc_vcpu_run_hv+0x5a0/0xd90 [kvm_hv]
> [ 472.163559] [c000001e41bffb70] [d00000000e9318a4] kvmppc_vcpu_run+0x44/0x60 [kvm]
> [ 472.163653] [c000001e41bffba0] [d00000000e92e674] kvm_arch_vcpu_ioctl_run+0x64/0x170 [kvm]
> [ 472.163745] [c000001e41bffbe0] [d00000000e9263a8] kvm_vcpu_ioctl+0x538/0x7b0 [kvm]
> [ 472.163834] [c000001e41bffd40] [c0000000002d0f50] do_vfs_ioctl+0x480/0x7c0
> [ 472.163910] [c000001e41bffde0] [c0000000002d1364] SyS_ioctl+0xd4/0xf0
> [ 472.163986] [c000001e41bffe30] [c000000000009260] system_call+0x38/0xd0
> [ 472.164060] Instruction dump:
> [ 472.164098] ebc1fff0 ebe1fff8 7c0803a6 4e800020 60000000 60000000 60420000 8bad02e2
> [ 472.164224] 7fc3f378 4b6a57c1 60000000 7c210b78 <e92d0000> 89290009 792affe3 40820070
>
> The bug is that kvmppc_run_vcpu does not correctly handle the case
> where a vcpu task receives a signal while its guest vcpu is executing
> in the guest as a result of being piggy-backed onto the execution of
> another vcore. In that case we need to wait for the vcpu to finish
> executing inside the guest, and then remove this vcore from the
> preempted vcores list. That way, we avoid leaving this vcpu's vcore
> on the preempted vcores list when the vcpu gets interrupted.
>
> Fixes: ec2571650826
> Reported-by: Thomas Huth <thuth@redhat.com>
> Signed-off-by: Paul Mackerras <paulus@samba.org>
> ---
> arch/powerpc/kvm/book3s_hv.c | 6 +++++-
> 1 file changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> index 9754e68..2280497 100644
> --- a/arch/powerpc/kvm/book3s_hv.c
> +++ b/arch/powerpc/kvm/book3s_hv.c
> @@ -2692,9 +2692,13 @@ static int kvmppc_run_vcpu(struct kvm_run *kvm_run, struct kvm_vcpu *vcpu)
>
> while (vcpu->arch.state = KVMPPC_VCPU_RUNNABLE &&
> (vc->vcore_state = VCORE_RUNNING ||
> - vc->vcore_state = VCORE_EXITING))
> + vc->vcore_state = VCORE_EXITING ||
> + vc->vcore_state = VCORE_PIGGYBACK))
> kvmppc_wait_for_exec(vc, vcpu, TASK_UNINTERRUPTIBLE);
>
> + if (vc->vcore_state = VCORE_PREEMPT && vc->runner = NULL)
> + kvmppc_vcore_end_preempt(vc);
> +
> if (vcpu->arch.state = KVMPPC_VCPU_RUNNABLE) {
> kvmppc_remove_runnable(vc, vcpu);
> vcpu->stat.signal_exits++;
Great, this fixes the crash for me! Thanks!
Tested-by: Thomas Huth <thuth@redhat.com>
^ permalink raw reply [flat|nested] 10+ messages in thread* Re: [PATCH] KVM: PPC: Book3S HV: Fix handling of interrupted VCPUs
@ 2015-09-18 8:15 ` Thomas Huth
0 siblings, 0 replies; 10+ messages in thread
From: Thomas Huth @ 2015-09-18 8:15 UTC (permalink / raw)
To: Paul Mackerras, kvm, kvm-ppc; +Cc: David Gibson
On 18/09/15 08:57, Paul Mackerras wrote:
> This fixes a bug which results in stale vcore pointers being left in
> the per-cpu preempted vcore lists when a VM is destroyed. The result
> of the stale vcore pointers is usually either a crash or a lockup
> inside collect_piggybacks() when another VM is run. A typical
> lockup message looks like:
>
> [ 472.161074] NMI watchdog: BUG: soft lockup - CPU#24 stuck for 22s! [qemu-system-ppc:7039]
> [ 472.161204] Modules linked in: kvm_hv kvm_pr kvm xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 tun ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw ses enclosure shpchp rtc_opal i2c_opal powernv_rng binfmt_misc dm_service_time scsi_dh_alua radeon i2c_algo_bit drm_kms_helper ttm drm tg3 ptp pps_core cxgb3 ipr i2c_core mdio dm_multipath [last unloaded: kvm_hv]
> [ 472.162111] CPU: 24 PID: 7039 Comm: qemu-system-ppc Not tainted 4.2.0-kvm+ #49
> [ 472.162187] task: c000001e38512750 ti: c000001e41bfc000 task.ti: c000001e41bfc000
> [ 472.162262] NIP: c00000000096b094 LR: c00000000096b08c CTR: c000000000111130
> [ 472.162337] REGS: c000001e41bff520 TRAP: 0901 Not tainted (4.2.0-kvm+)
> [ 472.162399] MSR: 9000000100009033 <SF,HV,EE,ME,IR,DR,RI,LE> CR: 24848844 XER: 00000000
> [ 472.162588] CFAR: c00000000096b0ac SOFTE: 1
> GPR00: c000000000111170 c000001e41bff7a0 c00000000127df00 0000000000000001
> GPR04: 0000000000000003 0000000000000001 0000000000000000 0000000000874821
> GPR08: c000001e41bff8e0 0000000000000001 0000000000000000 d00000000efde740
> GPR12: c000000000111130 c00000000fdae400
> [ 472.163053] NIP [c00000000096b094] _raw_spin_lock_irqsave+0xa4/0x130
> [ 472.163117] LR [c00000000096b08c] _raw_spin_lock_irqsave+0x9c/0x130
> [ 472.163179] Call Trace:
> [ 472.163206] [c000001e41bff7a0] [c000001e41bff7f0] 0xc000001e41bff7f0 (unreliable)
> [ 472.163295] [c000001e41bff7e0] [c000000000111170] __wake_up+0x40/0x90
> [ 472.163375] [c000001e41bff830] [d00000000efd6fc0] kvmppc_run_core+0x1240/0x1950 [kvm_hv]
> [ 472.163465] [c000001e41bffa30] [d00000000efd8510] kvmppc_vcpu_run_hv+0x5a0/0xd90 [kvm_hv]
> [ 472.163559] [c000001e41bffb70] [d00000000e9318a4] kvmppc_vcpu_run+0x44/0x60 [kvm]
> [ 472.163653] [c000001e41bffba0] [d00000000e92e674] kvm_arch_vcpu_ioctl_run+0x64/0x170 [kvm]
> [ 472.163745] [c000001e41bffbe0] [d00000000e9263a8] kvm_vcpu_ioctl+0x538/0x7b0 [kvm]
> [ 472.163834] [c000001e41bffd40] [c0000000002d0f50] do_vfs_ioctl+0x480/0x7c0
> [ 472.163910] [c000001e41bffde0] [c0000000002d1364] SyS_ioctl+0xd4/0xf0
> [ 472.163986] [c000001e41bffe30] [c000000000009260] system_call+0x38/0xd0
> [ 472.164060] Instruction dump:
> [ 472.164098] ebc1fff0 ebe1fff8 7c0803a6 4e800020 60000000 60000000 60420000 8bad02e2
> [ 472.164224] 7fc3f378 4b6a57c1 60000000 7c210b78 <e92d0000> 89290009 792affe3 40820070
>
> The bug is that kvmppc_run_vcpu does not correctly handle the case
> where a vcpu task receives a signal while its guest vcpu is executing
> in the guest as a result of being piggy-backed onto the execution of
> another vcore. In that case we need to wait for the vcpu to finish
> executing inside the guest, and then remove this vcore from the
> preempted vcores list. That way, we avoid leaving this vcpu's vcore
> on the preempted vcores list when the vcpu gets interrupted.
>
> Fixes: ec2571650826
> Reported-by: Thomas Huth <thuth@redhat.com>
> Signed-off-by: Paul Mackerras <paulus@samba.org>
> ---
> arch/powerpc/kvm/book3s_hv.c | 6 +++++-
> 1 file changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> index 9754e68..2280497 100644
> --- a/arch/powerpc/kvm/book3s_hv.c
> +++ b/arch/powerpc/kvm/book3s_hv.c
> @@ -2692,9 +2692,13 @@ static int kvmppc_run_vcpu(struct kvm_run *kvm_run, struct kvm_vcpu *vcpu)
>
> while (vcpu->arch.state == KVMPPC_VCPU_RUNNABLE &&
> (vc->vcore_state == VCORE_RUNNING ||
> - vc->vcore_state == VCORE_EXITING))
> + vc->vcore_state == VCORE_EXITING ||
> + vc->vcore_state == VCORE_PIGGYBACK))
> kvmppc_wait_for_exec(vc, vcpu, TASK_UNINTERRUPTIBLE);
>
> + if (vc->vcore_state == VCORE_PREEMPT && vc->runner == NULL)
> + kvmppc_vcore_end_preempt(vc);
> +
> if (vcpu->arch.state == KVMPPC_VCPU_RUNNABLE) {
> kvmppc_remove_runnable(vc, vcpu);
> vcpu->stat.signal_exits++;
Great, this fixes the crash for me! Thanks!
Tested-by: Thomas Huth <thuth@redhat.com>
^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH] KVM: PPC: Book3S HV: Fix handling of large pages in radix page fault handler
2015-09-18 6:57 ` Paul Mackerras
(?)
(?)
@ 2018-02-23 10:26 ` Paul Mackerras
-1 siblings, 0 replies; 10+ messages in thread
From: Paul Mackerras @ 2018-02-23 10:26 UTC (permalink / raw)
To: kvm-ppc
This fixes several bugs in the radix page fault handler relating to
the way large pages in the memory backing the guest were handled.
First, the check for large pages only checked for explicit huge pages
and missed transparent huge pages. Then the check that the addresses
(host virtual vs. guest physical) had appropriate alignment was
wrong, meaning that the code never put a large page in the partition
scoped radix tree; it was always demoted to a small page.
Fixing this exposed bugs in kvmppc_create_pte(). We were never
invalidating a 2MB PTE, which meant that if a page was initially
faulted in without write permission and the guest then attempted
to store to it, we would never update the PTE to have write permission.
If we find a valid 2MB PTE in the PMD, we need to clear it and
do a TLB invalidation before installing either the new 2MB PTE or
a pointer to a page table page.
This also corrects an assumption that get_user_pages_fast would set
the _PAGE_DIRTY bit if we are writing, which is not true. Instead we
mark the page dirty explicitly with set_page_dirty_lock(). This
also means we don't need the dirty bit set on the host PTE when
providing write access on a read fault.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
---
arch/powerpc/kvm/book3s_64_mmu_radix.c | 70 +++++++++++++++++++++-------------
1 file changed, 44 insertions(+), 26 deletions(-)
diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c b/arch/powerpc/kvm/book3s_64_mmu_radix.c
index 0c85481..5798c2c 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
@@ -195,6 +195,12 @@ static void kvmppc_pte_free(pte_t *ptep)
kmem_cache_free(kvm_pte_cache, ptep);
}
+/* Like pmd_huge() and pmd_large(), but works regardless of config options */
+static inline int pmd_is_leaf(pmd_t pmd)
+{
+ return !!(pmd_val(pmd) & _PAGE_PTE);
+}
+
static int kvmppc_create_pte(struct kvm *kvm, pte_t pte, unsigned long gpa,
unsigned int level, unsigned long mmu_seq)
{
@@ -219,7 +225,7 @@ static int kvmppc_create_pte(struct kvm *kvm, pte_t pte, unsigned long gpa,
else
new_pmd = pmd_alloc_one(kvm->mm, gpa);
- if (level = 0 && !(pmd && pmd_present(*pmd)))
+ if (level = 0 && !(pmd && pmd_present(*pmd) && !pmd_is_leaf(*pmd)))
new_ptep = kvmppc_pte_alloc();
/* Check if we might have been invalidated; let the guest retry if so */
@@ -244,12 +250,31 @@ static int kvmppc_create_pte(struct kvm *kvm, pte_t pte, unsigned long gpa,
new_pmd = NULL;
}
pmd = pmd_offset(pud, gpa);
- if (pmd_large(*pmd)) {
- /* Someone else has instantiated a large page here; retry */
- ret = -EAGAIN;
- goto out_unlock;
- }
- if (level = 1 && !pmd_none(*pmd)) {
+ if (pmd_is_leaf(*pmd)) {
+ unsigned long lgpa = gpa & PMD_MASK;
+
+ /*
+ * If we raced with another CPU which has just put
+ * a 2MB pte in after we saw a pte page, try again.
+ */
+ if (level = 0 && !new_ptep) {
+ ret = -EAGAIN;
+ goto out_unlock;
+ }
+ /* Valid 2MB page here already, remove it */
+ old = kvmppc_radix_update_pte(kvm, pmdp_ptep(pmd),
+ _PAGE_PRESENT, 0, lgpa, PMD_SHIFT);
+ kvmppc_radix_tlbie_page(kvm, lgpa, PMD_SHIFT);
+ if (old & _PAGE_DIRTY) {
+ unsigned long gfn = lgpa >> PAGE_SHIFT;
+ struct kvm_memory_slot *memslot;
+ memslot = gfn_to_memslot(kvm, gfn);
+ if (memslot && memslot->dirty_bitmap)
+ kvmppc_update_dirty_map(memslot,
+ gfn, PMD_SIZE);
+ }
+ pmd_clear(pmd);
+ } else if (level = 1 && !pmd_none(*pmd)) {
/*
* There's a page table page here, but we wanted
* to install a large page. Tell the caller and let
@@ -412,28 +437,24 @@ int kvmppc_book3s_radix_page_fault(struct kvm_run *run, struct kvm_vcpu *vcpu,
} else {
page = pages[0];
pfn = page_to_pfn(page);
- if (PageHuge(page)) {
- page = compound_head(page);
- pte_size <<= compound_order(page);
+ if (PageCompound(page)) {
+ pte_size <<= compound_order(compound_head(page));
/* See if we can insert a 2MB large-page PTE here */
if (pte_size >= PMD_SIZE &&
- (gpa & PMD_MASK & PAGE_MASK) =
- (hva & PMD_MASK & PAGE_MASK)) {
+ (gpa & (PMD_SIZE - PAGE_SIZE)) =
+ (hva & (PMD_SIZE - PAGE_SIZE))) {
level = 1;
pfn &= ~((PMD_SIZE >> PAGE_SHIFT) - 1);
}
}
/* See if we can provide write access */
if (writing) {
- /*
- * We assume gup_fast has set dirty on the host PTE.
- */
pgflags |= _PAGE_WRITE;
} else {
local_irq_save(flags);
ptep = find_current_mm_pte(current->mm->pgd,
hva, NULL, NULL);
- if (ptep && pte_write(*ptep) && pte_dirty(*ptep))
+ if (ptep && pte_write(*ptep))
pgflags |= _PAGE_WRITE;
local_irq_restore(flags);
}
@@ -459,18 +480,15 @@ int kvmppc_book3s_radix_page_fault(struct kvm_run *run, struct kvm_vcpu *vcpu,
pte = pfn_pte(pfn, __pgprot(pgflags));
ret = kvmppc_create_pte(kvm, pte, gpa, level, mmu_seq);
}
- if (ret = 0 || ret = -EAGAIN)
- ret = RESUME_GUEST;
if (page) {
- /*
- * We drop pages[0] here, not page because page might
- * have been set to the head page of a compound, but
- * we have to drop the reference on the correct tail
- * page to match the get inside gup()
- */
- put_page(pages[0]);
+ if (!ret && (pgflags & _PAGE_WRITE))
+ set_page_dirty_lock(page);
+ put_page(page);
}
+
+ if (ret = 0 || ret = -EAGAIN)
+ ret = RESUME_GUEST;
return ret;
}
@@ -644,7 +662,7 @@ void kvmppc_free_radix(struct kvm *kvm)
continue;
pmd = pmd_offset(pud, 0);
for (im = 0; im < PTRS_PER_PMD; ++im, ++pmd) {
- if (pmd_huge(*pmd)) {
+ if (pmd_is_leaf(*pmd)) {
pmd_clear(pmd);
continue;
}
--
2.7.4
^ permalink raw reply related [flat|nested] 10+ messages in thread* Re: [PATCH] KVM: PPC: Book3S HV: Fix handling of large pages in radix page fault handler
2015-09-18 6:57 ` Paul Mackerras
` (2 preceding siblings ...)
(?)
@ 2018-02-26 7:16 ` Aneesh Kumar K.V
-1 siblings, 0 replies; 10+ messages in thread
From: Aneesh Kumar K.V @ 2018-02-26 7:16 UTC (permalink / raw)
To: kvm-ppc
Paul Mackerras <paulus@ozlabs.org> writes:
> This fixes several bugs in the radix page fault handler relating to
> the way large pages in the memory backing the guest were handled.
> First, the check for large pages only checked for explicit huge pages
> and missed transparent huge pages. Then the check that the addresses
> (host virtual vs. guest physical) had appropriate alignment was
> wrong, meaning that the code never put a large page in the partition
> scoped radix tree; it was always demoted to a small page.
>
> Fixing this exposed bugs in kvmppc_create_pte(). We were never
> invalidating a 2MB PTE, which meant that if a page was initially
> faulted in without write permission and the guest then attempted
> to store to it, we would never update the PTE to have write permission.
> If we find a valid 2MB PTE in the PMD, we need to clear it and
> do a TLB invalidation before installing either the new 2MB PTE or
> a pointer to a page table page.
>
> This also corrects an assumption that get_user_pages_fast would set
> the _PAGE_DIRTY bit if we are writing, which is not true. Instead we
> mark the page dirty explicitly with set_page_dirty_lock(). This
> also means we don't need the dirty bit set on the host PTE when
> providing write access on a read fault.
>
> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
> ---
> arch/powerpc/kvm/book3s_64_mmu_radix.c | 70 +++++++++++++++++++++-------------
> 1 file changed, 44 insertions(+), 26 deletions(-)
>
> diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c b/arch/powerpc/kvm/book3s_64_mmu_radix.c
> index 0c85481..5798c2c 100644
> --- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
> +++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
> @@ -195,6 +195,12 @@ static void kvmppc_pte_free(pte_t *ptep)
> kmem_cache_free(kvm_pte_cache, ptep);
> }
>
> +/* Like pmd_huge() and pmd_large(), but works regardless of config options */
> +static inline int pmd_is_leaf(pmd_t pmd)
> +{
> + return !!(pmd_val(pmd) & _PAGE_PTE);
> +}
> +
> static int kvmppc_create_pte(struct kvm *kvm, pte_t pte, unsigned long gpa,
> unsigned int level, unsigned long mmu_seq)
> {
> @@ -219,7 +225,7 @@ static int kvmppc_create_pte(struct kvm *kvm, pte_t pte, unsigned long gpa,
> else
> new_pmd = pmd_alloc_one(kvm->mm, gpa);
>
> - if (level = 0 && !(pmd && pmd_present(*pmd)))
> + if (level = 0 && !(pmd && pmd_present(*pmd) && !pmd_is_leaf(*pmd)))
> new_ptep = kvmppc_pte_alloc();
>
> /* Check if we might have been invalidated; let the guest retry if so */
> @@ -244,12 +250,31 @@ static int kvmppc_create_pte(struct kvm *kvm, pte_t pte, unsigned long gpa,
> new_pmd = NULL;
> }
> pmd = pmd_offset(pud, gpa);
> - if (pmd_large(*pmd)) {
> - /* Someone else has instantiated a large page here; retry */
> - ret = -EAGAIN;
> - goto out_unlock;
> - }
> - if (level = 1 && !pmd_none(*pmd)) {
> + if (pmd_is_leaf(*pmd)) {
> + unsigned long lgpa = gpa & PMD_MASK;
> +
> + /*
> + * If we raced with another CPU which has just put
> + * a 2MB pte in after we saw a pte page, try again.
> + */
> + if (level = 0 && !new_ptep) {
> + ret = -EAGAIN;
> + goto out_unlock;
> + }
> + /* Valid 2MB page here already, remove it */
> + old = kvmppc_radix_update_pte(kvm, pmdp_ptep(pmd),
> + _PAGE_PRESENT, 0, lgpa, PMD_SHIFT);
Can we do s/_PAGE_PRESENT/~0UL/ and that way avoid pmd_clear(pmd) below?
> + kvmppc_radix_tlbie_page(kvm, lgpa, PMD_SHIFT);
> + if (old & _PAGE_DIRTY) {
> + unsigned long gfn = lgpa >> PAGE_SHIFT;
> + struct kvm_memory_slot *memslot;
> + memslot = gfn_to_memslot(kvm, gfn);
> + if (memslot && memslot->dirty_bitmap)
> + kvmppc_update_dirty_map(memslot,
> + gfn, PMD_SIZE);
> + }
> + pmd_clear(pmd);
> + } else if (level = 1 && !pmd_none(*pmd)) {
> /*
> * There's a page table page here, but we wanted
> * to install a large page. Tell the caller and let
> @@ -412,28 +437,24 @@ int kvmppc_book3s_radix_page_fault(struct kvm_run *run, struct kvm_vcpu *vcpu,
> } else {
> page = pages[0];
> pfn = page_to_pfn(page);
> - if (PageHuge(page)) {
> - page = compound_head(page);
> - pte_size <<= compound_order(page);
> + if (PageCompound(page)) {
> + pte_size <<= compound_order(compound_head(page));
> /* See if we can insert a 2MB large-page PTE here */
> if (pte_size >= PMD_SIZE &&
> - (gpa & PMD_MASK & PAGE_MASK) =
> - (hva & PMD_MASK & PAGE_MASK)) {
> + (gpa & (PMD_SIZE - PAGE_SIZE)) =
> + (hva & (PMD_SIZE - PAGE_SIZE))) {
> level = 1;
> pfn &= ~((PMD_SIZE >> PAGE_SHIFT) - 1);
> }
> }
> /* See if we can provide write access */
> if (writing) {
> - /*
> - * We assume gup_fast has set dirty on the host PTE.
> - */
> pgflags |= _PAGE_WRITE;
> } else {
> local_irq_save(flags);
> ptep = find_current_mm_pte(current->mm->pgd,
> hva, NULL, NULL);
> - if (ptep && pte_write(*ptep) && pte_dirty(*ptep))
> + if (ptep && pte_write(*ptep))
> pgflags |= _PAGE_WRITE;
> local_irq_restore(flags);
> }
> @@ -459,18 +480,15 @@ int kvmppc_book3s_radix_page_fault(struct kvm_run *run, struct kvm_vcpu *vcpu,
> pte = pfn_pte(pfn, __pgprot(pgflags));
> ret = kvmppc_create_pte(kvm, pte, gpa, level, mmu_seq);
> }
> - if (ret = 0 || ret = -EAGAIN)
> - ret = RESUME_GUEST;
>
> if (page) {
> - /*
> - * We drop pages[0] here, not page because page might
> - * have been set to the head page of a compound, but
> - * we have to drop the reference on the correct tail
> - * page to match the get inside gup()
> - */
> - put_page(pages[0]);
> + if (!ret && (pgflags & _PAGE_WRITE))
> + set_page_dirty_lock(page);
> + put_page(page);
> }
> +
> + if (ret = 0 || ret = -EAGAIN)
> + ret = RESUME_GUEST;
> return ret;
> }
>
> @@ -644,7 +662,7 @@ void kvmppc_free_radix(struct kvm *kvm)
> continue;
> pmd = pmd_offset(pud, 0);
> for (im = 0; im < PTRS_PER_PMD; ++im, ++pmd) {
> - if (pmd_huge(*pmd)) {
> + if (pmd_is_leaf(*pmd)) {
> pmd_clear(pmd);
> continue;
> }
> --
> 2.7.4
>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 10+ messages in thread* [PATCH] KVM: PPC: Book3S HV: fix handling for interrupted H_ENTER_NESTED
@ 2018-11-09 3:27 ` Michael Roth
0 siblings, 0 replies; 10+ messages in thread
From: Michael Roth @ 2018-11-09 3:27 UTC (permalink / raw)
To: kvm; +Cc: linuxppc-dev, Suraj Jitindar Singh, kvm-ppc, David Gibson
While running a nested guest VCPU on L0 via H_ENTER_NESTED hcall, a
pending signal in the L0 QEMU process can generate the following
sequence:
ret0 = kvmppc_pseries_do_hcall()
ret1 = kvmhv_enter_nested_guest()
ret2 = kvmhv_run_single_vcpu()
if (ret2 = -EINTR)
return H_INTERRUPT
if (ret1 = H_INTERRUPT)
kvmppc_set_gpr(vcpu, 3, 0)
return -EINTR
/* skipped: */
kvmppc_set_gpr(vcpu, 3, ret)
vcpu->arch.hcall_needed = 0
return RESUME_GUEST
which causes an exit to L0 userspace with ret0 = -EINTR.
The intention seems to be to set the hcall return value to 0 (via
VCPU r3) so that L1 will see a successful return from H_ENTER_NESTED
once we resume executing the VCPU. However, because we don't set
vcpu->arch.hcall_needed = 0, we do the following once userspace
resumes execution via kvm_arch_vcpu_ioctl_run():
...
} else if (vcpu->arch.hcall_needed) {
int i
kvmppc_set_gpr(vcpu, 3, run->papr_hcall.ret);
for (i = 0; i < 9; ++i)
kvmppc_set_gpr(vcpu, 4 + i, run->papr_hcall.args[i]);
vcpu->arch.hcall_needed = 0;
since vcpu->arch.hcall_needed = 1 indicates that userspace should
have handled the hcall and stored the return value in
run->papr_hcall.ret. Since that's not the case here, we can get an
unexpected value in VCPU r3, which can result in
kvmhv_p9_guest_entry() reporting an unexpected trap value when it
returns from H_ENTER_NESTED, causing the following register dump to
console via subsequent call to kvmppc_handle_exit_hv() in L1:
[ 350.612854] vcpu 00000000f9564cf8 (0):
[ 350.612915] pc = c00000000013eb98 msr = 8000000000009033 trap = 1
[ 350.613020] r 0 = c0000000004b9044 r16 = 0000000000000000
[ 350.613075] r 1 = c00000007cffba30 r17 = 0000000000000000
[ 350.613120] r 2 = c00000000178c100 r18 = 00007fffc24f3b50
[ 350.613166] r 3 = c00000007ef52480 r19 = 00007fffc24fff58
[ 350.613212] r 4 = 0000000000000000 r20 = 00000a1e96ece9d0
[ 350.613253] r 5 = 70616d00746f6f72 r21 = 00000a1ea117c9b0
[ 350.613295] r 6 = 0000000000000020 r22 = 00000a1ea1184360
[ 350.613338] r 7 = c0000000783be440 r23 = 0000000000000003
[ 350.613380] r 8 = fffffffffffffffc r24 = 00000a1e96e9e124
[ 350.613423] r 9 = c00000007ef52490 r25 = 00000000000007ff
[ 350.613469] r10 = 0000000000000004 r26 = c00000007eb2f7a0
[ 350.613513] r11 = b0616d0009eccdb2 r27 = c00000007cffbb10
[ 350.613556] r12 = c0000000004b9000 r28 = c00000007d83a2c0
[ 350.613597] r13 = c000000001b00000 r29 = c0000000783cdf68
[ 350.613639] r14 = 0000000000000000 r30 = 0000000000000000
[ 350.613681] r15 = 0000000000000000 r31 = c00000007cffbbf0
[ 350.613723] ctr = c0000000004b9000 lr = c0000000004b9044
[ 350.613765] srr0 = 0000772f954dd48c srr1 = 800000000280f033
[ 350.613808] sprg0 = 0000000000000000 sprg1 = c000000001b00000
[ 350.613859] sprg2 = 0000772f9565a280 sprg3 = 0000000000000000
[ 350.613911] cr = 88002848 xer = 0000000020040000 dsisr = 42000000
[ 350.613962] dar = 0000772f95390000
[ 350.614031] fault dar = c000000244b278c0 dsisr = 00000000
[ 350.614073] SLB (0 entries):
[ 350.614157] lpcr = 0040000003d40413 sdr1 = 0000000000000000 last_inst = ffffffff
[ 350.614252] trap=0x1 | pc=0xc00000000013eb98 | msr=0x8000000000009033
followed by L1's QEMU reporting the following before stopping execution
of the nested guest:
KVM: unknown exit, hardware reason 1
NIP c00000000013eb98 LR c0000000004b9044 CTR c0000000004b9000 XER 0000000020040000 CPU#0
MSR 8000000000009033 HID0 0000000000000000 HF 8000000000000000 iidx 3 didx 3
TB 00000000 00000000 DECR 00000000
GPR00 c0000000004b9044 c00000007cffba30 c00000000178c100 c00000007ef52480
GPR04 0000000000000000 70616d00746f6f72 0000000000000020 c0000000783be440
GPR08 fffffffffffffffc c00000007ef52490 0000000000000004 b0616d0009eccdb2
GPR12 c0000000004b9000 c000000001b00000 0000000000000000 0000000000000000
GPR16 0000000000000000 0000000000000000 00007fffc24f3b50 00007fffc24fff58
GPR20 00000a1e96ece9d0 00000a1ea117c9b0 00000a1ea1184360 0000000000000003
GPR24 00000a1e96e9e124 00000000000007ff c00000007eb2f7a0 c00000007cffbb10
GPR28 c00000007d83a2c0 c0000000783cdf68 0000000000000000 c00000007cffbbf0
CR 88002848 [ L L - - E L G L ] RES ffffffffffffffff
SRR0 0000772f954dd48c SRR1 800000000280f033 PVR 00000000004e1202 VRSAVE 0000000000000000
SPRG0 0000000000000000 SPRG1 c000000001b00000 SPRG2 0000772f9565a280 SPRG3 0000000000000000
SPRG4 0000000000000000 SPRG5 0000000000000000 SPRG6 0000000000000000 SPRG7 0000000000000000
HSRR0 0000000000000000 HSRR1 0000000000000000
CFAR 0000000000000000
LPCR 0000000003d40413
PTCR 0000000000000000 DAR 0000772f95390000 DSISR 0000000042000000
Fix this by setting vcpu->arch.hcall_needed = 0 to indicate completion
of H_ENTER_NESTED before we exit to L0 userspace.
Cc: linuxppc-dev@ozlabs.org
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Paul Mackerras <paulus@ozlabs.org>
Cc: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: Michael Roth <mdroth@linux.vnet.ibm.com>
---
arch/powerpc/kvm/book3s_hv.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index d65b961661fb..a56f8413758a 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -983,6 +983,7 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu)
ret = kvmhv_enter_nested_guest(vcpu);
if (ret = H_INTERRUPT) {
kvmppc_set_gpr(vcpu, 3, 0);
+ vcpu->arch.hcall_needed = 0;
return -EINTR;
}
break;
--
2.17.1
^ permalink raw reply related [flat|nested] 10+ messages in thread* [PATCH] KVM: PPC: Book3S HV: fix handling for interrupted H_ENTER_NESTED
@ 2018-11-09 3:27 ` Michael Roth
0 siblings, 0 replies; 10+ messages in thread
From: Michael Roth @ 2018-11-09 3:27 UTC (permalink / raw)
To: kvm; +Cc: linuxppc-dev, Suraj Jitindar Singh, kvm-ppc, David Gibson
While running a nested guest VCPU on L0 via H_ENTER_NESTED hcall, a
pending signal in the L0 QEMU process can generate the following
sequence:
ret0 = kvmppc_pseries_do_hcall()
ret1 = kvmhv_enter_nested_guest()
ret2 = kvmhv_run_single_vcpu()
if (ret2 == -EINTR)
return H_INTERRUPT
if (ret1 == H_INTERRUPT)
kvmppc_set_gpr(vcpu, 3, 0)
return -EINTR
/* skipped: */
kvmppc_set_gpr(vcpu, 3, ret)
vcpu->arch.hcall_needed = 0
return RESUME_GUEST
which causes an exit to L0 userspace with ret0 == -EINTR.
The intention seems to be to set the hcall return value to 0 (via
VCPU r3) so that L1 will see a successful return from H_ENTER_NESTED
once we resume executing the VCPU. However, because we don't set
vcpu->arch.hcall_needed = 0, we do the following once userspace
resumes execution via kvm_arch_vcpu_ioctl_run():
...
} else if (vcpu->arch.hcall_needed) {
int i
kvmppc_set_gpr(vcpu, 3, run->papr_hcall.ret);
for (i = 0; i < 9; ++i)
kvmppc_set_gpr(vcpu, 4 + i, run->papr_hcall.args[i]);
vcpu->arch.hcall_needed = 0;
since vcpu->arch.hcall_needed == 1 indicates that userspace should
have handled the hcall and stored the return value in
run->papr_hcall.ret. Since that's not the case here, we can get an
unexpected value in VCPU r3, which can result in
kvmhv_p9_guest_entry() reporting an unexpected trap value when it
returns from H_ENTER_NESTED, causing the following register dump to
console via subsequent call to kvmppc_handle_exit_hv() in L1:
[ 350.612854] vcpu 00000000f9564cf8 (0):
[ 350.612915] pc = c00000000013eb98 msr = 8000000000009033 trap = 1
[ 350.613020] r 0 = c0000000004b9044 r16 = 0000000000000000
[ 350.613075] r 1 = c00000007cffba30 r17 = 0000000000000000
[ 350.613120] r 2 = c00000000178c100 r18 = 00007fffc24f3b50
[ 350.613166] r 3 = c00000007ef52480 r19 = 00007fffc24fff58
[ 350.613212] r 4 = 0000000000000000 r20 = 00000a1e96ece9d0
[ 350.613253] r 5 = 70616d00746f6f72 r21 = 00000a1ea117c9b0
[ 350.613295] r 6 = 0000000000000020 r22 = 00000a1ea1184360
[ 350.613338] r 7 = c0000000783be440 r23 = 0000000000000003
[ 350.613380] r 8 = fffffffffffffffc r24 = 00000a1e96e9e124
[ 350.613423] r 9 = c00000007ef52490 r25 = 00000000000007ff
[ 350.613469] r10 = 0000000000000004 r26 = c00000007eb2f7a0
[ 350.613513] r11 = b0616d0009eccdb2 r27 = c00000007cffbb10
[ 350.613556] r12 = c0000000004b9000 r28 = c00000007d83a2c0
[ 350.613597] r13 = c000000001b00000 r29 = c0000000783cdf68
[ 350.613639] r14 = 0000000000000000 r30 = 0000000000000000
[ 350.613681] r15 = 0000000000000000 r31 = c00000007cffbbf0
[ 350.613723] ctr = c0000000004b9000 lr = c0000000004b9044
[ 350.613765] srr0 = 0000772f954dd48c srr1 = 800000000280f033
[ 350.613808] sprg0 = 0000000000000000 sprg1 = c000000001b00000
[ 350.613859] sprg2 = 0000772f9565a280 sprg3 = 0000000000000000
[ 350.613911] cr = 88002848 xer = 0000000020040000 dsisr = 42000000
[ 350.613962] dar = 0000772f95390000
[ 350.614031] fault dar = c000000244b278c0 dsisr = 00000000
[ 350.614073] SLB (0 entries):
[ 350.614157] lpcr = 0040000003d40413 sdr1 = 0000000000000000 last_inst = ffffffff
[ 350.614252] trap=0x1 | pc=0xc00000000013eb98 | msr=0x8000000000009033
followed by L1's QEMU reporting the following before stopping execution
of the nested guest:
KVM: unknown exit, hardware reason 1
NIP c00000000013eb98 LR c0000000004b9044 CTR c0000000004b9000 XER 0000000020040000 CPU#0
MSR 8000000000009033 HID0 0000000000000000 HF 8000000000000000 iidx 3 didx 3
TB 00000000 00000000 DECR 00000000
GPR00 c0000000004b9044 c00000007cffba30 c00000000178c100 c00000007ef52480
GPR04 0000000000000000 70616d00746f6f72 0000000000000020 c0000000783be440
GPR08 fffffffffffffffc c00000007ef52490 0000000000000004 b0616d0009eccdb2
GPR12 c0000000004b9000 c000000001b00000 0000000000000000 0000000000000000
GPR16 0000000000000000 0000000000000000 00007fffc24f3b50 00007fffc24fff58
GPR20 00000a1e96ece9d0 00000a1ea117c9b0 00000a1ea1184360 0000000000000003
GPR24 00000a1e96e9e124 00000000000007ff c00000007eb2f7a0 c00000007cffbb10
GPR28 c00000007d83a2c0 c0000000783cdf68 0000000000000000 c00000007cffbbf0
CR 88002848 [ L L - - E L G L ] RES ffffffffffffffff
SRR0 0000772f954dd48c SRR1 800000000280f033 PVR 00000000004e1202 VRSAVE 0000000000000000
SPRG0 0000000000000000 SPRG1 c000000001b00000 SPRG2 0000772f9565a280 SPRG3 0000000000000000
SPRG4 0000000000000000 SPRG5 0000000000000000 SPRG6 0000000000000000 SPRG7 0000000000000000
HSRR0 0000000000000000 HSRR1 0000000000000000
CFAR 0000000000000000
LPCR 0000000003d40413
PTCR 0000000000000000 DAR 0000772f95390000 DSISR 0000000042000000
Fix this by setting vcpu->arch.hcall_needed = 0 to indicate completion
of H_ENTER_NESTED before we exit to L0 userspace.
Cc: linuxppc-dev@ozlabs.org
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Paul Mackerras <paulus@ozlabs.org>
Cc: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: Michael Roth <mdroth@linux.vnet.ibm.com>
---
arch/powerpc/kvm/book3s_hv.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index d65b961661fb..a56f8413758a 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -983,6 +983,7 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu)
ret = kvmhv_enter_nested_guest(vcpu);
if (ret == H_INTERRUPT) {
kvmppc_set_gpr(vcpu, 3, 0);
+ vcpu->arch.hcall_needed = 0;
return -EINTR;
}
break;
--
2.17.1
^ permalink raw reply related [flat|nested] 10+ messages in thread* Re: [PATCH] KVM: PPC: Book3S HV: fix handling for interrupted H_ENTER_NESTED
2018-11-09 3:27 ` Michael Roth
@ 2018-11-14 3:09 ` Suraj Jitindar Singh
-1 siblings, 0 replies; 10+ messages in thread
From: Suraj Jitindar Singh @ 2018-11-14 3:09 UTC (permalink / raw)
To: Michael Roth, kvm; +Cc: linuxppc-dev, kvm-ppc, David Gibson
On Thu, 2018-11-08 at 21:27 -0600, Michael Roth wrote:
> While running a nested guest VCPU on L0 via H_ENTER_NESTED hcall, a
> pending signal in the L0 QEMU process can generate the following
> sequence:
>
> ret0 = kvmppc_pseries_do_hcall()
> ret1 = kvmhv_enter_nested_guest()
> ret2 = kvmhv_run_single_vcpu()
> if (ret2 = -EINTR)
> return H_INTERRUPT
> if (ret1 = H_INTERRUPT)
> kvmppc_set_gpr(vcpu, 3, 0)
> return -EINTR
> /* skipped: */
> kvmppc_set_gpr(vcpu, 3, ret)
> vcpu->arch.hcall_needed = 0
> return RESUME_GUEST
>
> which causes an exit to L0 userspace with ret0 = -EINTR.
>
> The intention seems to be to set the hcall return value to 0 (via
> VCPU r3) so that L1 will see a successful return from H_ENTER_NESTED
> once we resume executing the VCPU. However, because we don't set
> vcpu->arch.hcall_needed = 0, we do the following once userspace
> resumes execution via kvm_arch_vcpu_ioctl_run():
>
> ...
> } else if (vcpu->arch.hcall_needed) {
> int i
>
> kvmppc_set_gpr(vcpu, 3, run->papr_hcall.ret);
> for (i = 0; i < 9; ++i)
> kvmppc_set_gpr(vcpu, 4 + i, run->papr_hcall.args[i]);
> vcpu->arch.hcall_needed = 0;
>
> since vcpu->arch.hcall_needed = 1 indicates that userspace should
> have handled the hcall and stored the return value in
> run->papr_hcall.ret. Since that's not the case here, we can get an
> unexpected value in VCPU r3, which can result in
> kvmhv_p9_guest_entry() reporting an unexpected trap value when it
> returns from H_ENTER_NESTED, causing the following register dump to
> console via subsequent call to kvmppc_handle_exit_hv() in L1:
>
> [ 350.612854] vcpu 00000000f9564cf8 (0):
> [ 350.612915] pc = c00000000013eb98 msr = 8000000000009033 trap
> = 1
> [ 350.613020] r 0 = c0000000004b9044 r16 = 0000000000000000
> [ 350.613075] r 1 = c00000007cffba30 r17 = 0000000000000000
> [ 350.613120] r 2 = c00000000178c100 r18 = 00007fffc24f3b50
> [ 350.613166] r 3 = c00000007ef52480 r19 = 00007fffc24fff58
> [ 350.613212] r 4 = 0000000000000000 r20 = 00000a1e96ece9d0
> [ 350.613253] r 5 = 70616d00746f6f72 r21 = 00000a1ea117c9b0
> [ 350.613295] r 6 = 0000000000000020 r22 = 00000a1ea1184360
> [ 350.613338] r 7 = c0000000783be440 r23 = 0000000000000003
> [ 350.613380] r 8 = fffffffffffffffc r24 = 00000a1e96e9e124
> [ 350.613423] r 9 = c00000007ef52490 r25 = 00000000000007ff
> [ 350.613469] r10 = 0000000000000004 r26 = c00000007eb2f7a0
> [ 350.613513] r11 = b0616d0009eccdb2 r27 = c00000007cffbb10
> [ 350.613556] r12 = c0000000004b9000 r28 = c00000007d83a2c0
> [ 350.613597] r13 = c000000001b00000 r29 = c0000000783cdf68
> [ 350.613639] r14 = 0000000000000000 r30 = 0000000000000000
> [ 350.613681] r15 = 0000000000000000 r31 = c00000007cffbbf0
> [ 350.613723] ctr = c0000000004b9000 lr = c0000000004b9044
> [ 350.613765] srr0 = 0000772f954dd48c srr1 = 800000000280f033
> [ 350.613808] sprg0 = 0000000000000000 sprg1 = c000000001b00000
> [ 350.613859] sprg2 = 0000772f9565a280 sprg3 = 0000000000000000
> [ 350.613911] cr = 88002848 xer = 0000000020040000 dsisr > 42000000
> [ 350.613962] dar = 0000772f95390000
> [ 350.614031] fault dar = c000000244b278c0 dsisr = 00000000
> [ 350.614073] SLB (0 entries):
> [ 350.614157] lpcr = 0040000003d40413 sdr1 = 0000000000000000
> last_inst = ffffffff
> [ 350.614252] trap=0x1 | pc=0xc00000000013eb98 |
> msr=0x8000000000009033
>
> followed by L1's QEMU reporting the following before stopping
> execution
> of the nested guest:
>
> KVM: unknown exit, hardware reason 1
> NIP c00000000013eb98 LR c0000000004b9044 CTR c0000000004b9000 XER
> 0000000020040000 CPU#0
> MSR 8000000000009033 HID0 0000000000000000 HF 8000000000000000
> iidx 3 didx 3
> TB 00000000 00000000 DECR 00000000
> GPR00 c0000000004b9044 c00000007cffba30 c00000000178c100
> c00000007ef52480
> GPR04 0000000000000000 70616d00746f6f72 0000000000000020
> c0000000783be440
> GPR08 fffffffffffffffc c00000007ef52490 0000000000000004
> b0616d0009eccdb2
> GPR12 c0000000004b9000 c000000001b00000 0000000000000000
> 0000000000000000
> GPR16 0000000000000000 0000000000000000 00007fffc24f3b50
> 00007fffc24fff58
> GPR20 00000a1e96ece9d0 00000a1ea117c9b0 00000a1ea1184360
> 0000000000000003
> GPR24 00000a1e96e9e124 00000000000007ff c00000007eb2f7a0
> c00000007cffbb10
> GPR28 c00000007d83a2c0 c0000000783cdf68 0000000000000000
> c00000007cffbbf0
> CR 88002848 [ L L - - E L G L ] RES
> ffffffffffffffff
> SRR0 0000772f954dd48c SRR1 800000000280f033 PVR
> 00000000004e1202 VRSAVE 0000000000000000
> SPRG0 0000000000000000 SPRG1 c000000001b00000 SPRG2
> 0000772f9565a280 SPRG3 0000000000000000
> SPRG4 0000000000000000 SPRG5 0000000000000000 SPRG6
> 0000000000000000 SPRG7 0000000000000000
> HSRR0 0000000000000000 HSRR1 0000000000000000
> CFAR 0000000000000000
> LPCR 0000000003d40413
> PTCR 0000000000000000 DAR 0000772f95390000 DSISR
> 0000000042000000
>
> Fix this by setting vcpu->arch.hcall_needed = 0 to indicate
> completion
> of H_ENTER_NESTED before we exit to L0 userspace.
Nice Catch :)
Reviewed-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
>
> Cc: linuxppc-dev@ozlabs.org
> Cc: David Gibson <david@gibson.dropbear.id.au>
> Cc: Paul Mackerras <paulus@ozlabs.org>
> Cc: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
> Signed-off-by: Michael Roth <mdroth@linux.vnet.ibm.com>
> ---
> arch/powerpc/kvm/book3s_hv.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/arch/powerpc/kvm/book3s_hv.c
> b/arch/powerpc/kvm/book3s_hv.c
> index d65b961661fb..a56f8413758a 100644
> --- a/arch/powerpc/kvm/book3s_hv.c
> +++ b/arch/powerpc/kvm/book3s_hv.c
> @@ -983,6 +983,7 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu
> *vcpu)
> ret = kvmhv_enter_nested_guest(vcpu);
> if (ret = H_INTERRUPT) {
> kvmppc_set_gpr(vcpu, 3, 0);
> + vcpu->arch.hcall_needed = 0;
> return -EINTR;
> }
> break;
^ permalink raw reply [flat|nested] 10+ messages in thread* Re: [PATCH] KVM: PPC: Book3S HV: fix handling for interrupted H_ENTER_NESTED
@ 2018-11-14 3:09 ` Suraj Jitindar Singh
0 siblings, 0 replies; 10+ messages in thread
From: Suraj Jitindar Singh @ 2018-11-14 3:09 UTC (permalink / raw)
To: Michael Roth, kvm; +Cc: linuxppc-dev, kvm-ppc, David Gibson
On Thu, 2018-11-08 at 21:27 -0600, Michael Roth wrote:
> While running a nested guest VCPU on L0 via H_ENTER_NESTED hcall, a
> pending signal in the L0 QEMU process can generate the following
> sequence:
>
> ret0 = kvmppc_pseries_do_hcall()
> ret1 = kvmhv_enter_nested_guest()
> ret2 = kvmhv_run_single_vcpu()
> if (ret2 == -EINTR)
> return H_INTERRUPT
> if (ret1 == H_INTERRUPT)
> kvmppc_set_gpr(vcpu, 3, 0)
> return -EINTR
> /* skipped: */
> kvmppc_set_gpr(vcpu, 3, ret)
> vcpu->arch.hcall_needed = 0
> return RESUME_GUEST
>
> which causes an exit to L0 userspace with ret0 == -EINTR.
>
> The intention seems to be to set the hcall return value to 0 (via
> VCPU r3) so that L1 will see a successful return from H_ENTER_NESTED
> once we resume executing the VCPU. However, because we don't set
> vcpu->arch.hcall_needed = 0, we do the following once userspace
> resumes execution via kvm_arch_vcpu_ioctl_run():
>
> ...
> } else if (vcpu->arch.hcall_needed) {
> int i
>
> kvmppc_set_gpr(vcpu, 3, run->papr_hcall.ret);
> for (i = 0; i < 9; ++i)
> kvmppc_set_gpr(vcpu, 4 + i, run->papr_hcall.args[i]);
> vcpu->arch.hcall_needed = 0;
>
> since vcpu->arch.hcall_needed == 1 indicates that userspace should
> have handled the hcall and stored the return value in
> run->papr_hcall.ret. Since that's not the case here, we can get an
> unexpected value in VCPU r3, which can result in
> kvmhv_p9_guest_entry() reporting an unexpected trap value when it
> returns from H_ENTER_NESTED, causing the following register dump to
> console via subsequent call to kvmppc_handle_exit_hv() in L1:
>
> [ 350.612854] vcpu 00000000f9564cf8 (0):
> [ 350.612915] pc = c00000000013eb98 msr = 8000000000009033 trap
> = 1
> [ 350.613020] r 0 = c0000000004b9044 r16 = 0000000000000000
> [ 350.613075] r 1 = c00000007cffba30 r17 = 0000000000000000
> [ 350.613120] r 2 = c00000000178c100 r18 = 00007fffc24f3b50
> [ 350.613166] r 3 = c00000007ef52480 r19 = 00007fffc24fff58
> [ 350.613212] r 4 = 0000000000000000 r20 = 00000a1e96ece9d0
> [ 350.613253] r 5 = 70616d00746f6f72 r21 = 00000a1ea117c9b0
> [ 350.613295] r 6 = 0000000000000020 r22 = 00000a1ea1184360
> [ 350.613338] r 7 = c0000000783be440 r23 = 0000000000000003
> [ 350.613380] r 8 = fffffffffffffffc r24 = 00000a1e96e9e124
> [ 350.613423] r 9 = c00000007ef52490 r25 = 00000000000007ff
> [ 350.613469] r10 = 0000000000000004 r26 = c00000007eb2f7a0
> [ 350.613513] r11 = b0616d0009eccdb2 r27 = c00000007cffbb10
> [ 350.613556] r12 = c0000000004b9000 r28 = c00000007d83a2c0
> [ 350.613597] r13 = c000000001b00000 r29 = c0000000783cdf68
> [ 350.613639] r14 = 0000000000000000 r30 = 0000000000000000
> [ 350.613681] r15 = 0000000000000000 r31 = c00000007cffbbf0
> [ 350.613723] ctr = c0000000004b9000 lr = c0000000004b9044
> [ 350.613765] srr0 = 0000772f954dd48c srr1 = 800000000280f033
> [ 350.613808] sprg0 = 0000000000000000 sprg1 = c000000001b00000
> [ 350.613859] sprg2 = 0000772f9565a280 sprg3 = 0000000000000000
> [ 350.613911] cr = 88002848 xer = 0000000020040000 dsisr =
> 42000000
> [ 350.613962] dar = 0000772f95390000
> [ 350.614031] fault dar = c000000244b278c0 dsisr = 00000000
> [ 350.614073] SLB (0 entries):
> [ 350.614157] lpcr = 0040000003d40413 sdr1 = 0000000000000000
> last_inst = ffffffff
> [ 350.614252] trap=0x1 | pc=0xc00000000013eb98 |
> msr=0x8000000000009033
>
> followed by L1's QEMU reporting the following before stopping
> execution
> of the nested guest:
>
> KVM: unknown exit, hardware reason 1
> NIP c00000000013eb98 LR c0000000004b9044 CTR c0000000004b9000 XER
> 0000000020040000 CPU#0
> MSR 8000000000009033 HID0 0000000000000000 HF 8000000000000000
> iidx 3 didx 3
> TB 00000000 00000000 DECR 00000000
> GPR00 c0000000004b9044 c00000007cffba30 c00000000178c100
> c00000007ef52480
> GPR04 0000000000000000 70616d00746f6f72 0000000000000020
> c0000000783be440
> GPR08 fffffffffffffffc c00000007ef52490 0000000000000004
> b0616d0009eccdb2
> GPR12 c0000000004b9000 c000000001b00000 0000000000000000
> 0000000000000000
> GPR16 0000000000000000 0000000000000000 00007fffc24f3b50
> 00007fffc24fff58
> GPR20 00000a1e96ece9d0 00000a1ea117c9b0 00000a1ea1184360
> 0000000000000003
> GPR24 00000a1e96e9e124 00000000000007ff c00000007eb2f7a0
> c00000007cffbb10
> GPR28 c00000007d83a2c0 c0000000783cdf68 0000000000000000
> c00000007cffbbf0
> CR 88002848 [ L L - - E L G L ] RES
> ffffffffffffffff
> SRR0 0000772f954dd48c SRR1 800000000280f033 PVR
> 00000000004e1202 VRSAVE 0000000000000000
> SPRG0 0000000000000000 SPRG1 c000000001b00000 SPRG2
> 0000772f9565a280 SPRG3 0000000000000000
> SPRG4 0000000000000000 SPRG5 0000000000000000 SPRG6
> 0000000000000000 SPRG7 0000000000000000
> HSRR0 0000000000000000 HSRR1 0000000000000000
> CFAR 0000000000000000
> LPCR 0000000003d40413
> PTCR 0000000000000000 DAR 0000772f95390000 DSISR
> 0000000042000000
>
> Fix this by setting vcpu->arch.hcall_needed = 0 to indicate
> completion
> of H_ENTER_NESTED before we exit to L0 userspace.
Nice Catch :)
Reviewed-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
>
> Cc: linuxppc-dev@ozlabs.org
> Cc: David Gibson <david@gibson.dropbear.id.au>
> Cc: Paul Mackerras <paulus@ozlabs.org>
> Cc: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
> Signed-off-by: Michael Roth <mdroth@linux.vnet.ibm.com>
> ---
> arch/powerpc/kvm/book3s_hv.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/arch/powerpc/kvm/book3s_hv.c
> b/arch/powerpc/kvm/book3s_hv.c
> index d65b961661fb..a56f8413758a 100644
> --- a/arch/powerpc/kvm/book3s_hv.c
> +++ b/arch/powerpc/kvm/book3s_hv.c
> @@ -983,6 +983,7 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu
> *vcpu)
> ret = kvmhv_enter_nested_guest(vcpu);
> if (ret == H_INTERRUPT) {
> kvmppc_set_gpr(vcpu, 3, 0);
> + vcpu->arch.hcall_needed = 0;
> return -EINTR;
> }
> break;
^ permalink raw reply [flat|nested] 10+ messages in thread