* Intermittent guest kernel crashes with v4.5-rc6.
@ 2016-03-02 13:56 Shanker Donthineni
2016-03-02 14:16 ` Marc Zyngier
2016-03-02 14:48 ` Marc Zyngier
0 siblings, 2 replies; 13+ messages in thread
From: Shanker Donthineni @ 2016-03-02 13:56 UTC (permalink / raw)
To: Marc Zyngier, kvmarm
For some reason v4.5-rc6 kernel is not stable for guest machines on
Qualcomm server platforms.
We are getting IABT translation faults while booting the guest kernel.
The problem disappears with
the following code snippet (insert "dsb ish" instruction just before
switching to EL1 guest). I am
using v4.5-rc6 kernel for both host and guest machines.
Please let me know if you have any thoughts or ideas for tracing this
problem.
--- a/arch/arm64/kvm/hyp/entry.S
+++ b/arch/arm64/kvm/hyp/entry.S
@@ -88,6 +88,7 @@ ENTRY(__guest_enter)
ldp x0, x1, [sp], #16
// Do not touch any register after this!
+ dsb ish
eret
ENDPROC(__guest_enter)
Using below QEMU command for launching guest machine:
qemu-system-aarch64 -machine type=virt,accel=kvm,gic-version=3 \
-cpu "host" -smp cpus=1,maxcpus=1 -m 256M -serial stdio \
-kernel /boot/Image -initrd /boot/rootfs.cpio.gz \
-append 'earlycon=earlycon=pl011,0x09000000 \
console=ttyAMA0,115200 root=/dev/ram'
Guest machine crash log messages:
[ 0.000000] Booting Linux on physical CPU 0x0
[ 0.000000] Boot CPU: AArch64 Processor [510f2811]
[ 0.000000] Bad mode in Synchronous Abort handler detected, code
0x8600000f -- IABT (current EL)
[ 0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted 4.5.rc6+
[ 0.000000] task: ffffffc000d52200 ti: ffffffc000d44000 task.ti:
ffffffc000d44000
[ 0.000000] PC is at early_init_dt_scan_root+0x28/0x94
[ 0.000000] LR is at of_scan_flat_dt+0x9c/0xd0
[ 0.000000] pc : [<ffffffc000cb32e8>] lr : [<ffffffc000cb3248>]
pstate: 800003c5
[ 0.000000] sp : ffffffc000d47e80
[ 0.000000] x29: ffffffc000d47e80 x28: 0000000000000000
--
Shanker Donthineni
Qualcomm Technologies, Inc. on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Intermittent guest kernel crashes with v4.5-rc6.
2016-03-02 13:56 Intermittent guest kernel crashes with v4.5-rc6 Shanker Donthineni
@ 2016-03-02 14:16 ` Marc Zyngier
2016-03-02 14:59 ` Shanker Donthineni
2016-03-02 14:48 ` Marc Zyngier
1 sibling, 1 reply; 13+ messages in thread
From: Marc Zyngier @ 2016-03-02 14:16 UTC (permalink / raw)
To: Shanker Donthineni, kvmarm
On 02/03/16 13:56, Shanker Donthineni wrote:
>
> For some reason v4.5-rc6 kernel is not stable for guest machines on
> Qualcomm server platforms.
> We are getting IABT translation faults while booting the guest kernel.
> The problem disappears with
> the following code snippet (insert "dsb ish" instruction just before
> switching to EL1 guest). I am
> using v4.5-rc6 kernel for both host and guest machines.
>
> Please let me know if you have any thoughts or ideas for tracing this
> problem.
>
> --- a/arch/arm64/kvm/hyp/entry.S
> +++ b/arch/arm64/kvm/hyp/entry.S
> @@ -88,6 +88,7 @@ ENTRY(__guest_enter)
> ldp x0, x1, [sp], #16
>
> // Do not touch any register after this!
> + dsb ish
> eret
> ENDPROC(__guest_enter)
>
>
> Using below QEMU command for launching guest machine:
>
> qemu-system-aarch64 -machine type=virt,accel=kvm,gic-version=3 \
> -cpu "host" -smp cpus=1,maxcpus=1 -m 256M -serial stdio \
> -kernel /boot/Image -initrd /boot/rootfs.cpio.gz \
> -append 'earlycon=earlycon=pl011,0x09000000 \
> console=ttyAMA0,115200 root=/dev/ram'
>
>
> Guest machine crash log messages:
>
> [ 0.000000] Booting Linux on physical CPU 0x0
> [ 0.000000] Boot CPU: AArch64 Processor [510f2811]
> [ 0.000000] Bad mode in Synchronous Abort handler detected, code
> 0x8600000f -- IABT (current EL)
> [ 0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted 4.5.rc6+
> [ 0.000000] task: ffffffc000d52200 ti: ffffffc000d44000 task.ti:
> ffffffc000d44000
> [ 0.000000] PC is at early_init_dt_scan_root+0x28/0x94
> [ 0.000000] LR is at of_scan_flat_dt+0x9c/0xd0
> [ 0.000000] pc : [<ffffffc000cb32e8>] lr : [<ffffffc000cb3248>]
> pstate: 800003c5
> [ 0.000000] sp : ffffffc000d47e80
> [ 0.000000] x29: ffffffc000d47e80 x28: 0000000000000000
>
If you're getting a prefetch abort, it would be interesting to find out
what instruction is there, whether the page is mapped at stage-2 or not,
what are the stage-2 permissions... Basically, a full description of the
memory state.
Also, does it work if you do a "dsb ishst" instead?
Thanks,
M.
--
Jazz is not dead. It just smells funny...
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Intermittent guest kernel crashes with v4.5-rc6.
2016-03-02 13:56 Intermittent guest kernel crashes with v4.5-rc6 Shanker Donthineni
2016-03-02 14:16 ` Marc Zyngier
@ 2016-03-02 14:48 ` Marc Zyngier
1 sibling, 0 replies; 13+ messages in thread
From: Marc Zyngier @ 2016-03-02 14:48 UTC (permalink / raw)
To: Shanker Donthineni, kvmarm
On 02/03/16 13:56, Shanker Donthineni wrote:
>
> For some reason v4.5-rc6 kernel is not stable for guest machines on
> Qualcomm server platforms.
> We are getting IABT translation faults while booting the guest kernel.
> The problem disappears with
> the following code snippet (insert "dsb ish" instruction just before
> switching to EL1 guest). I am
> using v4.5-rc6 kernel for both host and guest machines.
>
> Please let me know if you have any thoughts or ideas for tracing this
> problem.
Another thing you can try is find out how far up you can move that dsb,
up to the point where it starts crashing again.
Also, how reproducible is it? Every time? Randomly?
Thanks,
M.
--
Jazz is not dead. It just smells funny...
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Intermittent guest kernel crashes with v4.5-rc6.
2016-03-02 14:16 ` Marc Zyngier
@ 2016-03-02 14:59 ` Shanker Donthineni
2016-03-02 15:09 ` Marc Zyngier
0 siblings, 1 reply; 13+ messages in thread
From: Shanker Donthineni @ 2016-03-02 14:59 UTC (permalink / raw)
To: Marc Zyngier, kvmarm
Hi Marc,
Thanks for your quick reply.
On 03/02/2016 08:16 AM, Marc Zyngier wrote:
> On 02/03/16 13:56, Shanker Donthineni wrote:
>> For some reason v4.5-rc6 kernel is not stable for guest machines on
>> Qualcomm server platforms.
>> We are getting IABT translation faults while booting the guest kernel.
>> The problem disappears with
>> the following code snippet (insert "dsb ish" instruction just before
>> switching to EL1 guest). I am
>> using v4.5-rc6 kernel for both host and guest machines.
>>
>> Please let me know if you have any thoughts or ideas for tracing this
>> problem.
>>
>> --- a/arch/arm64/kvm/hyp/entry.S
>> +++ b/arch/arm64/kvm/hyp/entry.S
>> @@ -88,6 +88,7 @@ ENTRY(__guest_enter)
>> ldp x0, x1, [sp], #16
>>
>> // Do not touch any register after this!
>> + dsb ish
>> eret
>> ENDPROC(__guest_enter)
>>
>>
>> Using below QEMU command for launching guest machine:
>>
>> qemu-system-aarch64 -machine type=virt,accel=kvm,gic-version=3 \
>> -cpu "host" -smp cpus=1,maxcpus=1 -m 256M -serial stdio \
>> -kernel /boot/Image -initrd /boot/rootfs.cpio.gz \
>> -append 'earlycon=earlycon=pl011,0x09000000 \
>> console=ttyAMA0,115200 root=/dev/ram'
>>
>>
>> Guest machine crash log messages:
>>
>> [ 0.000000] Booting Linux on physical CPU 0x0
>> [ 0.000000] Boot CPU: AArch64 Processor [510f2811]
>> [ 0.000000] Bad mode in Synchronous Abort handler detected, code
>> 0x8600000f -- IABT (current EL)
>> [ 0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted 4.5.rc6+
>> [ 0.000000] task: ffffffc000d52200 ti: ffffffc000d44000 task.ti:
>> ffffffc000d44000
>> [ 0.000000] PC is at early_init_dt_scan_root+0x28/0x94
>> [ 0.000000] LR is at of_scan_flat_dt+0x9c/0xd0
>> [ 0.000000] pc : [<ffffffc000cb32e8>] lr : [<ffffffc000cb3248>]
>> pstate: 800003c5
>> [ 0.000000] sp : ffffffc000d47e80
>> [ 0.000000] x29: ffffffc000d47e80 x28: 0000000000000000
>>
> If you're getting a prefetch abort, it would be interesting to find out
> what instruction is there, whether the page is mapped at stage-2 or not,
> what are the stage-2 permissions... Basically, a full description of the
> memory state.
>
> Also, does it work if you do a "dsb ishst" instead?
>
> Thanks,
>
> M.
Most of the times it is faulting at ldr/str instructions. I have
verified stage-1 page and the
the corresponding stage-2 page attributes (SH, AP, PERM), PA etc. after
IABT, everything
perfectly matches. I am very confident that stage-1/stage-2 MMU page
tables are correct.
Instruction "dsb ishst" also fixing the problem.
One more Interesting observation, if retry an instruction fetch that
caused IABT, second
time fetch is successful and I don't see IABT. I used below
experimental code to test.
--- a/arch/arm64/kernel/entry.S
+++ b/arch/arm64/kernel/entry.S
@@ -346,6 +346,7 @@ el1_sync:
b.eq el1_undef
cmp x24, #ESR_ELx_EC_BREAKPT_CUR // debug exception in EL1
b.ge el1_dbg
+ kernel_exit 1
b el1_inv
el1_da:
--
Shanker Donthineni
Qualcomm Technologies, Inc. on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Intermittent guest kernel crashes with v4.5-rc6.
2016-03-02 14:59 ` Shanker Donthineni
@ 2016-03-02 15:09 ` Marc Zyngier
2016-03-02 15:48 ` Shanker Donthineni
0 siblings, 1 reply; 13+ messages in thread
From: Marc Zyngier @ 2016-03-02 15:09 UTC (permalink / raw)
To: Shanker Donthineni, kvmarm
On 02/03/16 14:59, Shanker Donthineni wrote:
> Hi Marc,
>
> Thanks for your quick reply.
>
> On 03/02/2016 08:16 AM, Marc Zyngier wrote:
>> On 02/03/16 13:56, Shanker Donthineni wrote:
>>> For some reason v4.5-rc6 kernel is not stable for guest machines on
>>> Qualcomm server platforms.
>>> We are getting IABT translation faults while booting the guest kernel.
>>> The problem disappears with
>>> the following code snippet (insert "dsb ish" instruction just before
>>> switching to EL1 guest). I am
>>> using v4.5-rc6 kernel for both host and guest machines.
>>>
>>> Please let me know if you have any thoughts or ideas for tracing this
>>> problem.
>>>
>>> --- a/arch/arm64/kvm/hyp/entry.S
>>> +++ b/arch/arm64/kvm/hyp/entry.S
>>> @@ -88,6 +88,7 @@ ENTRY(__guest_enter)
>>> ldp x0, x1, [sp], #16
>>>
>>> // Do not touch any register after this!
>>> + dsb ish
>>> eret
>>> ENDPROC(__guest_enter)
>>>
>>>
>>> Using below QEMU command for launching guest machine:
>>>
>>> qemu-system-aarch64 -machine type=virt,accel=kvm,gic-version=3 \
>>> -cpu "host" -smp cpus=1,maxcpus=1 -m 256M -serial stdio \
>>> -kernel /boot/Image -initrd /boot/rootfs.cpio.gz \
>>> -append 'earlycon=earlycon=pl011,0x09000000 \
>>> console=ttyAMA0,115200 root=/dev/ram'
>>>
>>>
>>> Guest machine crash log messages:
>>>
>>> [ 0.000000] Booting Linux on physical CPU 0x0
>>> [ 0.000000] Boot CPU: AArch64 Processor [510f2811]
>>> [ 0.000000] Bad mode in Synchronous Abort handler detected, code
>>> 0x8600000f -- IABT (current EL)
>>> [ 0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted 4.5.rc6+
>>> [ 0.000000] task: ffffffc000d52200 ti: ffffffc000d44000 task.ti:
>>> ffffffc000d44000
>>> [ 0.000000] PC is at early_init_dt_scan_root+0x28/0x94
>>> [ 0.000000] LR is at of_scan_flat_dt+0x9c/0xd0
>>> [ 0.000000] pc : [<ffffffc000cb32e8>] lr : [<ffffffc000cb3248>]
>>> pstate: 800003c5
>>> [ 0.000000] sp : ffffffc000d47e80
>>> [ 0.000000] x29: ffffffc000d47e80 x28: 0000000000000000
>>>
>> If you're getting a prefetch abort, it would be interesting to find out
>> what instruction is there, whether the page is mapped at stage-2 or not,
>> what are the stage-2 permissions... Basically, a full description of the
>> memory state.
>>
>> Also, does it work if you do a "dsb ishst" instead?
>>
>> Thanks,
>>
>> M.
>
> Most of the times it is faulting at ldr/str instructions. I have
> verified stage-1 page and the
> the corresponding stage-2 page attributes (SH, AP, PERM), PA etc. after
> IABT, everything
> perfectly matches. I am very confident that stage-1/stage-2 MMU page
> tables are correct.
>
> Instruction "dsb ishst" also fixing the problem.
>
> One more Interesting observation, if retry an instruction fetch that
> caused IABT, second
> time fetch is successful and I don't see IABT. I used below
> experimental code to test.
>
> --- a/arch/arm64/kernel/entry.S
> +++ b/arch/arm64/kernel/entry.S
> @@ -346,6 +346,7 @@ el1_sync:
> b.eq el1_undef
> cmp x24, #ESR_ELx_EC_BREAKPT_CUR // debug exception in EL1
> b.ge el1_dbg
> + kernel_exit 1
> b el1_inv
> el1_da:
>
>
OK, that's pretty scary, specially considering that we don't have a DSB
on that path. Do you ever see it exploding at EL0?
Thanks,
M.
--
Jazz is not dead. It just smells funny...
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Intermittent guest kernel crashes with v4.5-rc6.
2016-03-02 15:09 ` Marc Zyngier
@ 2016-03-02 15:48 ` Shanker Donthineni
2016-03-02 17:35 ` Marc Zyngier
0 siblings, 1 reply; 13+ messages in thread
From: Shanker Donthineni @ 2016-03-02 15:48 UTC (permalink / raw)
To: Marc Zyngier, kvmarm
On 03/02/2016 09:09 AM, Marc Zyngier wrote:
> On 02/03/16 14:59, Shanker Donthineni wrote:
>> Hi Marc,
>>
>> Thanks for your quick reply.
>>
>> On 03/02/2016 08:16 AM, Marc Zyngier wrote:
>>> On 02/03/16 13:56, Shanker Donthineni wrote:
>>>> For some reason v4.5-rc6 kernel is not stable for guest machines on
>>>> Qualcomm server platforms.
>>>> We are getting IABT translation faults while booting the guest kernel.
>>>> The problem disappears with
>>>> the following code snippet (insert "dsb ish" instruction just before
>>>> switching to EL1 guest). I am
>>>> using v4.5-rc6 kernel for both host and guest machines.
>>>>
>>>> Please let me know if you have any thoughts or ideas for tracing this
>>>> problem.
>>>>
>>>> --- a/arch/arm64/kvm/hyp/entry.S
>>>> +++ b/arch/arm64/kvm/hyp/entry.S
>>>> @@ -88,6 +88,7 @@ ENTRY(__guest_enter)
>>>> ldp x0, x1, [sp], #16
>>>>
>>>> // Do not touch any register after this!
>>>> + dsb ish
>>>> eret
>>>> ENDPROC(__guest_enter)
>>>>
>>>>
>>>> Using below QEMU command for launching guest machine:
>>>>
>>>> qemu-system-aarch64 -machine type=virt,accel=kvm,gic-version=3 \
>>>> -cpu "host" -smp cpus=1,maxcpus=1 -m 256M -serial stdio \
>>>> -kernel /boot/Image -initrd /boot/rootfs.cpio.gz \
>>>> -append 'earlycon=earlycon=pl011,0x09000000 \
>>>> console=ttyAMA0,115200 root=/dev/ram'
>>>>
>>>>
>>>> Guest machine crash log messages:
>>>>
>>>> [ 0.000000] Booting Linux on physical CPU 0x0
>>>> [ 0.000000] Boot CPU: AArch64 Processor [510f2811]
>>>> [ 0.000000] Bad mode in Synchronous Abort handler detected, code
>>>> 0x8600000f -- IABT (current EL)
>>>> [ 0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted 4.5.rc6+
>>>> [ 0.000000] task: ffffffc000d52200 ti: ffffffc000d44000 task.ti:
>>>> ffffffc000d44000
>>>> [ 0.000000] PC is at early_init_dt_scan_root+0x28/0x94
>>>> [ 0.000000] LR is at of_scan_flat_dt+0x9c/0xd0
>>>> [ 0.000000] pc : [<ffffffc000cb32e8>] lr : [<ffffffc000cb3248>]
>>>> pstate: 800003c5
>>>> [ 0.000000] sp : ffffffc000d47e80
>>>> [ 0.000000] x29: ffffffc000d47e80 x28: 0000000000000000
>>>>
>>> If you're getting a prefetch abort, it would be interesting to find out
>>> what instruction is there, whether the page is mapped at stage-2 or not,
>>> what are the stage-2 permissions... Basically, a full description of the
>>> memory state.
>>>
>>> Also, does it work if you do a "dsb ishst" instead?
>>>
>>> Thanks,
>>>
>>> M.
>> Most of the times it is faulting at ldr/str instructions. I have
>> verified stage-1 page and the
>> the corresponding stage-2 page attributes (SH, AP, PERM), PA etc. after
>> IABT, everything
>> perfectly matches. I am very confident that stage-1/stage-2 MMU page
>> tables are correct.
>>
>> Instruction "dsb ishst" also fixing the problem.
>>
>> One more Interesting observation, if retry an instruction fetch that
>> caused IABT, second
>> time fetch is successful and I don't see IABT. I used below
>> experimental code to test.
>>
>> --- a/arch/arm64/kernel/entry.S
>> +++ b/arch/arm64/kernel/entry.S
>> @@ -346,6 +346,7 @@ el1_sync:
>> b.eq el1_undef
>> cmp x24, #ESR_ELx_EC_BREAKPT_CUR // debug exception in EL1
>> b.ge el1_dbg
>> + kernel_exit 1
>> b el1_inv
>> el1_da:
>>
>>
> OK, that's pretty scary, specially considering that we don't have a DSB
> on that path. Do you ever see it exploding at EL0?
>
> Thanks,
>
> M.
We haven't started running heavy workloads in VMs. So far we
have noticed this random nature behavior only during guest
kernel boot (at EL1).
We didn't see this problem on 4.3 kernel. Do you think it is
related to TLB conflicts?
--
Shanker Donthineni
Qualcomm Technologies, Inc. on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Intermittent guest kernel crashes with v4.5-rc6.
2016-03-02 15:48 ` Shanker Donthineni
@ 2016-03-02 17:35 ` Marc Zyngier
2016-03-03 13:25 ` Shanker Donthineni
0 siblings, 1 reply; 13+ messages in thread
From: Marc Zyngier @ 2016-03-02 17:35 UTC (permalink / raw)
To: Shanker Donthineni, kvmarm
On 02/03/16 15:48, Shanker Donthineni wrote:
> We haven't started running heavy workloads in VMs. So far we
> have noticed this random nature behavior only during guest
> kernel boot (at EL1).
>
> We didn't see this problem on 4.3 kernel. Do you think it is
> related to TLB conflicts?
I cannot imagine why a DSB would solve a TLB conflict. But the fact that
you didn't see it crashing on 4.3 is a good indication that something
else it at play.
In 4.5, we've rewritten a large part of KVM in C, which has changed the
ordering of the various accesses a lot. It could be that a latent
problem is now exposed more widely.
Can you try moving this DSB around and find out what is the earliest
point where it solves this problem? Some sort of bisection?
Thanks,
M.
--
Jazz is not dead. It just smells funny...
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Intermittent guest kernel crashes with v4.5-rc6.
2016-03-02 17:35 ` Marc Zyngier
@ 2016-03-03 13:25 ` Shanker Donthineni
2016-03-03 14:03 ` Marc Zyngier
0 siblings, 1 reply; 13+ messages in thread
From: Shanker Donthineni @ 2016-03-03 13:25 UTC (permalink / raw)
To: Marc Zyngier, kvmarm
On 03/02/2016 11:35 AM, Marc Zyngier wrote:
> On 02/03/16 15:48, Shanker Donthineni wrote:
>
>> We haven't started running heavy workloads in VMs. So far we
>> have noticed this random nature behavior only during guest
>> kernel boot (at EL1).
>>
>> We didn't see this problem on 4.3 kernel. Do you think it is
>> related to TLB conflicts?
> I cannot imagine why a DSB would solve a TLB conflict. But the fact that
> you didn't see it crashing on 4.3 is a good indication that something
> else it at play.
>
> In 4.5, we've rewritten a large part of KVM in C, which has changed the
> ordering of the various accesses a lot. It could be that a latent
> problem is now exposed more widely.
>
> Can you try moving this DSB around and find out what is the earliest
> point where it solves this problem? Some sort of bisection?
The maximum I can move up 'dsb ishst' to the beginning of
__guest_enter() but not out side of this function.
I don't understand why it is failing below code, branch
instruction causing problems.
/* Jump in the fire! */
+ dsb(ishst);
exit_code = __guest_enter(vcpu, host_ctxt);
/* And we're baaack! */
> Thanks,
>
> M.
--
Shanker Donthineni
Qualcomm Technologies, Inc. on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Intermittent guest kernel crashes with v4.5-rc6.
2016-03-03 13:25 ` Shanker Donthineni
@ 2016-03-03 14:03 ` Marc Zyngier
2016-03-03 14:26 ` Shanker Donthineni
0 siblings, 1 reply; 13+ messages in thread
From: Marc Zyngier @ 2016-03-03 14:03 UTC (permalink / raw)
To: Shanker Donthineni, kvmarm
On 03/03/16 13:25, Shanker Donthineni wrote:
>
>
> On 03/02/2016 11:35 AM, Marc Zyngier wrote:
>> On 02/03/16 15:48, Shanker Donthineni wrote:
>>
>>> We haven't started running heavy workloads in VMs. So far we
>>> have noticed this random nature behavior only during guest
>>> kernel boot (at EL1).
>>>
>>> We didn't see this problem on 4.3 kernel. Do you think it is
>>> related to TLB conflicts?
>> I cannot imagine why a DSB would solve a TLB conflict. But the fact that
>> you didn't see it crashing on 4.3 is a good indication that something
>> else it at play.
>>
>> In 4.5, we've rewritten a large part of KVM in C, which has changed the
>> ordering of the various accesses a lot. It could be that a latent
>> problem is now exposed more widely.
>>
>> Can you try moving this DSB around and find out what is the earliest
>> point where it solves this problem? Some sort of bisection?
> The maximum I can move up 'dsb ishst' to the beginning of
> __guest_enter() but not out side of this function.
>
> I don't understand why it is failing below code, branch
> instruction causing problems.
>
> /* Jump in the fire! */
> + dsb(ishst);
> exit_code = __guest_enter(vcpu, host_ctxt);
> /* And we're baaack! */
That's very worrying. I can't see how the branch can have an influence
on the the DSB (nor why the DSB has an influence on the rest of the
execution, btw).
What if you replace the DSB with an ISB? Do you observe a similar
behaviour (works if the barrier is in __guest_enter, but not if it is
outside)?
Another thing worth looking at is what happened just before we decided
to get back into the guest. Or to put it differently, what was the
reason to exit the first place. Was it a Stage-2 fault by any chance?
Thanks,
M.
--
Jazz is not dead. It just smells funny...
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Intermittent guest kernel crashes with v4.5-rc6.
2016-03-03 14:03 ` Marc Zyngier
@ 2016-03-03 14:26 ` Shanker Donthineni
2016-03-03 14:38 ` Marc Zyngier
0 siblings, 1 reply; 13+ messages in thread
From: Shanker Donthineni @ 2016-03-03 14:26 UTC (permalink / raw)
To: Marc Zyngier, kvmarm, shankerd
On 03/03/2016 08:03 AM, Marc Zyngier wrote:
> On 03/03/16 13:25, Shanker Donthineni wrote:
>>
>> On 03/02/2016 11:35 AM, Marc Zyngier wrote:
>>> On 02/03/16 15:48, Shanker Donthineni wrote:
>>>
>>>> We haven't started running heavy workloads in VMs. So far we
>>>> have noticed this random nature behavior only during guest
>>>> kernel boot (at EL1).
>>>>
>>>> We didn't see this problem on 4.3 kernel. Do you think it is
>>>> related to TLB conflicts?
>>> I cannot imagine why a DSB would solve a TLB conflict. But the fact that
>>> you didn't see it crashing on 4.3 is a good indication that something
>>> else it at play.
>>>
>>> In 4.5, we've rewritten a large part of KVM in C, which has changed the
>>> ordering of the various accesses a lot. It could be that a latent
>>> problem is now exposed more widely.
>>>
>>> Can you try moving this DSB around and find out what is the earliest
>>> point where it solves this problem? Some sort of bisection?
>> The maximum I can move up 'dsb ishst' to the beginning of
>> __guest_enter() but not out side of this function.
>>
>> I don't understand why it is failing below code, branch
>> instruction causing problems.
>>
>> /* Jump in the fire! */
>> + dsb(ishst);
>> exit_code = __guest_enter(vcpu, host_ctxt);
>> /* And we're baaack! */
> That's very worrying. I can't see how the branch can have an influence
> on the the DSB (nor why the DSB has an influence on the rest of the
> execution, btw).
>
> What if you replace the DSB with an ISB? Do you observe a similar
> behaviour (works if the barrier is in __guest_enter, but not if it is
> outside)?
I have already tried with isb without success. I did another
experiment flush stage-2 TLBs before calling __guest_enetr(),
it fixed the problem.
> Another thing worth looking at is what happened just before we decided
> to get back into the guest. Or to put it differently, what was the
> reason to exit the first place. Was it a Stage-2 fault by any chance?
I will collect as much possible debug data and update results
to you. I went through your KVM refracted 'C' code and did not
find any thing suspicious. I am thinking may be Qualcomm CPUs
have a very aggressive prefech logic that causing the problem.
> Thanks,
>
> M.
--
Shanker Donthineni
Qualcomm Technologies, Inc. on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Intermittent guest kernel crashes with v4.5-rc6.
2016-03-03 14:26 ` Shanker Donthineni
@ 2016-03-03 14:38 ` Marc Zyngier
[not found] ` <56DE48B6.4060705@codeaurora.org>
0 siblings, 1 reply; 13+ messages in thread
From: Marc Zyngier @ 2016-03-03 14:38 UTC (permalink / raw)
To: Shanker Donthineni, kvmarm
On 03/03/16 14:26, Shanker Donthineni wrote:
>
>
> On 03/03/2016 08:03 AM, Marc Zyngier wrote:
>> On 03/03/16 13:25, Shanker Donthineni wrote:
>>>
>>> On 03/02/2016 11:35 AM, Marc Zyngier wrote:
>>>> On 02/03/16 15:48, Shanker Donthineni wrote:
>>>>
>>>>> We haven't started running heavy workloads in VMs. So far we
>>>>> have noticed this random nature behavior only during guest
>>>>> kernel boot (at EL1).
>>>>>
>>>>> We didn't see this problem on 4.3 kernel. Do you think it is
>>>>> related to TLB conflicts?
>>>> I cannot imagine why a DSB would solve a TLB conflict. But the fact that
>>>> you didn't see it crashing on 4.3 is a good indication that something
>>>> else it at play.
>>>>
>>>> In 4.5, we've rewritten a large part of KVM in C, which has changed the
>>>> ordering of the various accesses a lot. It could be that a latent
>>>> problem is now exposed more widely.
>>>>
>>>> Can you try moving this DSB around and find out what is the earliest
>>>> point where it solves this problem? Some sort of bisection?
>>> The maximum I can move up 'dsb ishst' to the beginning of
>>> __guest_enter() but not out side of this function.
>>>
>>> I don't understand why it is failing below code, branch
>>> instruction causing problems.
>>>
>>> /* Jump in the fire! */
>>> + dsb(ishst);
>>> exit_code = __guest_enter(vcpu, host_ctxt);
>>> /* And we're baaack! */
>> That's very worrying. I can't see how the branch can have an influence
>> on the the DSB (nor why the DSB has an influence on the rest of the
>> execution, btw).
>>
>> What if you replace the DSB with an ISB? Do you observe a similar
>> behaviour (works if the barrier is in __guest_enter, but not if it is
>> outside)?
> I have already tried with isb without success. I did another
> experiment flush stage-2 TLBs before calling __guest_enetr(),
> it fixed the problem.
I suspected something like that. But it is such a massive hammer that it
will hide any sort of subtle bug (HW *and* SW).
>
>> Another thing worth looking at is what happened just before we decided
>> to get back into the guest. Or to put it differently, what was the
>> reason to exit the first place. Was it a Stage-2 fault by any chance?
>
> I will collect as much possible debug data and update results
> to you. I went through your KVM refracted 'C' code and did not
> find any thing suspicious. I am thinking may be Qualcomm CPUs
> have a very aggressive prefech logic that causing the problem.
OK. Please keep me posted about your findings. Also maybe involving some
HW people ouwld be a good idea (running something in an emulator, for
example...).
Thanks,
M.
--
Jazz is not dead. It just smells funny...
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Intermittent guest kernel crashes with v4.5-rc6.
[not found] ` <56DE48B6.4060705@codeaurora.org>
@ 2016-04-18 15:56 ` Christopher Covington
2016-04-18 16:00 ` Marc Zyngier
0 siblings, 1 reply; 13+ messages in thread
From: Christopher Covington @ 2016-04-18 15:56 UTC (permalink / raw)
To: Shanker Donthineni, Marc Zyngier, kvmarm
On 03/07/2016 10:36 PM, Shanker Donthineni wrote:
> On 03/03/2016 08:38 AM, Marc Zyngier wrote:
>> On 03/03/16 14:26, Shanker Donthineni wrote:
>>> On 03/03/2016 08:03 AM, Marc Zyngier wrote:
>>>> On 03/03/16 13:25, Shanker Donthineni wrote:
>>>>> On 03/02/2016 11:35 AM, Marc Zyngier wrote:
>>>>>> On 02/03/16 15:48, Shanker Donthineni wrote:
>>>>>>
>>>>>>> We haven't started running heavy workloads in VMs. So far we
>>>>>>> have noticed this random nature behavior only during guest
>>>>>>> kernel boot (at EL1).
>>>>>>>
>>>>>>> We didn't see this problem on 4.3 kernel. Do you think it is
>>>>>>> related to TLB conflicts?
>>>>>> I cannot imagine why a DSB would solve a TLB conflict. But the fact
>>>>>> that
>>>>>> you didn't see it crashing on 4.3 is a good indication that something
>>>>>> else it at play.
>>>>>>
>>>>>> In 4.5, we've rewritten a large part of KVM in C, which has changed the
>>>>>> ordering of the various accesses a lot. It could be that a latent
>>>>>> problem is now exposed more widely.
>>>>>>
>>>>>> Can you try moving this DSB around and find out what is the earliest
>>>>>> point where it solves this problem? Some sort of bisection?
>>>>> The maximum I can move up 'dsb ishst' to the beginning of
>>>>> __guest_enter() but not out side of this function.
>>>>>
>>>>> I don't understand why it is failing below code, branch
>>>>> instruction causing problems.
>>>>>
>>>>> /* Jump in the fire! */
>>>>> + dsb(ishst);
>>>>> exit_code = __guest_enter(vcpu, host_ctxt);
>>>>> /* And we're baaack! */
>>>> That's very worrying. I can't see how the branch can have an influence
>>>> on the the DSB (nor why the DSB has an influence on the rest of the
>>>> execution, btw).
>>>>
>>>> What if you replace the DSB with an ISB? Do you observe a similar
>>>> behaviour (works if the barrier is in __guest_enter, but not if it is
>>>> outside)?
>>> I have already tried with isb without success. I did another
>>> experiment flush stage-2 TLBs before calling __guest_enetr(),
>>> it fixed the problem.
>> I suspected something like that. But it is such a massive hammer that it
>> will hide any sort of subtle bug (HW *and* SW).
>>
>>>> Another thing worth looking at is what happened just before we decided
>>>> to get back into the guest. Or to put it differently, what was the
>>>> reason to exit the first place. Was it a Stage-2 fault by any chance?
>>> I will collect as much possible debug data and update results
>>> to you. I went through your KVM refracted 'C' code and did not
>>> find any thing suspicious. I am thinking may be Qualcomm CPUs
>>> have a very aggressive prefech logic that causing the problem.
>> OK. Please keep me posted about your findings. Also maybe involving some
>> HW people ouwld be a good idea (running something in an emulator, for
>> example...).
This has been confirmed to be a hardware defect with a firmware workaround.
Regards,
Christopher Covington
--
Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Intermittent guest kernel crashes with v4.5-rc6.
2016-04-18 15:56 ` Christopher Covington
@ 2016-04-18 16:00 ` Marc Zyngier
0 siblings, 0 replies; 13+ messages in thread
From: Marc Zyngier @ 2016-04-18 16:00 UTC (permalink / raw)
To: Christopher Covington, Shanker Donthineni, kvmarm
On 18/04/16 16:56, Christopher Covington wrote:
>>> OK. Please keep me posted about your findings. Also maybe involving some
>>> HW people ouwld be a good idea (running something in an emulator, for
>>> example...).
>
> This has been confirmed to be a hardware defect with a firmware workaround.
Good to know. Hopefully performance is not too much impacted by the
workaround.
Thanks,
M.
--
Jazz is not dead. It just smells funny...
^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2016-04-18 15:58 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-03-02 13:56 Intermittent guest kernel crashes with v4.5-rc6 Shanker Donthineni
2016-03-02 14:16 ` Marc Zyngier
2016-03-02 14:59 ` Shanker Donthineni
2016-03-02 15:09 ` Marc Zyngier
2016-03-02 15:48 ` Shanker Donthineni
2016-03-02 17:35 ` Marc Zyngier
2016-03-03 13:25 ` Shanker Donthineni
2016-03-03 14:03 ` Marc Zyngier
2016-03-03 14:26 ` Shanker Donthineni
2016-03-03 14:38 ` Marc Zyngier
[not found] ` <56DE48B6.4060705@codeaurora.org>
2016-04-18 15:56 ` Christopher Covington
2016-04-18 16:00 ` Marc Zyngier
2016-03-02 14:48 ` Marc Zyngier
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.