* [Qemu-devel] [BUG/RFC] Two cpus are not brought up normally in SLES11 sp3 VM after reboot @ 2015-07-06 7:54 zhanghailiang 2015-07-06 8:45 ` Paolo Bonzini 0 siblings, 1 reply; 8+ messages in thread From: zhanghailiang @ 2015-07-06 7:54 UTC (permalink / raw) To: kvm; +Cc: peter.huangpeng, qemu-devel@nongnu.org Hi, Recently we encountered a problem in our project: 2 CPUs in VM are not brought up normally after reboot. Our host is using KVM kmod 3.6 and QEMU 2.1. A SLES 11 sp3 VM configured with 8 vcpus, cpu model is configured with 'host-passthrough'. After VM's first time started up, everything seems to be OK. and then VM is paniced and rebooted. After reboot, only 6 cpus are brought up in VM, cpu1 and cpu7 are not online. This is the only message we can get from VM: VM dmesg shows: [ 0.069867] Booting Node 0, Processors #1 [ 5.060042] CPU1: Stuck ?? [ 5.060499] #2 [ 5.088322] kvm-clock: cpu 2, msr 6:3fc90901, secondary cpu clock [ 5.088335] KVM setup async PF for cpu 2 [ 5.092967] NMI watchdog enabled, takes one hw-pmu counter. [ 5.094405] #3 [ 5.108324] kvm-clock: cpu 3, msr 6:3fcd0901, secondary cpu clock [ 5.108333] KVM setup async PF for cpu 3 [ 5.113553] NMI watchdog enabled, takes one hw-pmu counter. [ 5.114970] #4 [ 5.128325] kvm-clock: cpu 4, msr 6:3fd10901, secondary cpu clock [ 5.128336] KVM setup async PF for cpu 4 [ 5.134576] NMI watchdog enabled, takes one hw-pmu counter. [ 5.135998] #5 [ 5.152324] kvm-clock: cpu 5, msr 6:3fd50901, secondary cpu clock [ 5.152334] KVM setup async PF for cpu 5 [ 5.154764] NMI watchdog enabled, takes one hw-pmu counter. [ 5.156467] #6 [ 5.172327] kvm-clock: cpu 6, msr 6:3fd90901, secondary cpu clock [ 5.172341] KVM setup async PF for cpu 6 [ 5.180738] NMI watchdog enabled, takes one hw-pmu counter. [ 5.182173] #7 Ok. [ 10.170815] CPU7: Stuck ?? [ 10.171648] Brought up 6 CPUs [ 10.172394] Total of 6 processors activated (28799.97 BogoMIPS). From host, we found that QEMU vcpu1 thread and vcpu7 thread were not consuming any cpu (Should be in idle state), All of VCPUs' stacks in host is like bellow: [<ffffffffa07089b5>] kvm_vcpu_block+0x65/0xa0 [kvm] [<ffffffffa071c7c1>] __vcpu_run+0xd1/0x260 [kvm] [<ffffffffa071d508>] kvm_arch_vcpu_ioctl_run+0x68/0x1a0 [kvm] [<ffffffffa0709cee>] kvm_vcpu_ioctl+0x38e/0x580 [kvm] [<ffffffff8116be8b>] do_vfs_ioctl+0x8b/0x3b0 [<ffffffff8116c251>] sys_ioctl+0xa1/0xb0 [<ffffffff81468092>] system_call_fastpath+0x16/0x1b [<00002ab9fe1f99a7>] 0x2ab9fe1f99a7 [<ffffffffffffffff>] 0xffffffffffffffff We looked into the kernel codes that could leading to the above 'Stuck' warning, and found that the only possible is the emulation of 'cpuid' instruct in kvm/qemu has something wrong. But since we can’t reproduce this problem, we are not quite sure. Is there any possible that the cupid emulation in kvm/qemu has some bug ? Has anyone come across these problem before? Or any idea? Thanks, zhanghailiang ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [Qemu-devel] [BUG/RFC] Two cpus are not brought up normally in SLES11 sp3 VM after reboot 2015-07-06 7:54 [Qemu-devel] [BUG/RFC] Two cpus are not brought up normally in SLES11 sp3 VM after reboot zhanghailiang @ 2015-07-06 8:45 ` Paolo Bonzini 2015-07-06 9:59 ` zhanghailiang 0 siblings, 1 reply; 8+ messages in thread From: Paolo Bonzini @ 2015-07-06 8:45 UTC (permalink / raw) To: zhanghailiang, kvm; +Cc: peter.huangpeng, qemu-devel@nongnu.org On 06/07/2015 09:54, zhanghailiang wrote: > > From host, we found that QEMU vcpu1 thread and vcpu7 thread were not > consuming any cpu (Should be in idle state), > All of VCPUs' stacks in host is like bellow: > > [<ffffffffa07089b5>] kvm_vcpu_block+0x65/0xa0 [kvm] > [<ffffffffa071c7c1>] __vcpu_run+0xd1/0x260 [kvm] > [<ffffffffa071d508>] kvm_arch_vcpu_ioctl_run+0x68/0x1a0 [kvm] > [<ffffffffa0709cee>] kvm_vcpu_ioctl+0x38e/0x580 [kvm] > [<ffffffff8116be8b>] do_vfs_ioctl+0x8b/0x3b0 > [<ffffffff8116c251>] sys_ioctl+0xa1/0xb0 > [<ffffffff81468092>] system_call_fastpath+0x16/0x1b > [<00002ab9fe1f99a7>] 0x2ab9fe1f99a7 > [<ffffffffffffffff>] 0xffffffffffffffff > > We looked into the kernel codes that could leading to the above 'Stuck' > warning, > and found that the only possible is the emulation of 'cpuid' instruct in > kvm/qemu has something wrong. > But since we can’t reproduce this problem, we are not quite sure. > Is there any possible that the cupid emulation in kvm/qemu has some bug ? Can you explain the relationship to the cpuid emulation? What do the traces say about vcpus 1 and 7? Paolo ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [Qemu-devel] [BUG/RFC] Two cpus are not brought up normally in SLES11 sp3 VM after reboot 2015-07-06 8:45 ` Paolo Bonzini @ 2015-07-06 9:59 ` zhanghailiang 2015-07-06 11:03 ` Paolo Bonzini 2015-07-07 11:23 ` Igor Mammedov 0 siblings, 2 replies; 8+ messages in thread From: zhanghailiang @ 2015-07-06 9:59 UTC (permalink / raw) To: Paolo Bonzini, kvm; +Cc: peter.huangpeng, qemu-devel@nongnu.org On 2015/7/6 16:45, Paolo Bonzini wrote: > > > On 06/07/2015 09:54, zhanghailiang wrote: >> >> From host, we found that QEMU vcpu1 thread and vcpu7 thread were not >> consuming any cpu (Should be in idle state), >> All of VCPUs' stacks in host is like bellow: >> >> [<ffffffffa07089b5>] kvm_vcpu_block+0x65/0xa0 [kvm] >> [<ffffffffa071c7c1>] __vcpu_run+0xd1/0x260 [kvm] >> [<ffffffffa071d508>] kvm_arch_vcpu_ioctl_run+0x68/0x1a0 [kvm] >> [<ffffffffa0709cee>] kvm_vcpu_ioctl+0x38e/0x580 [kvm] >> [<ffffffff8116be8b>] do_vfs_ioctl+0x8b/0x3b0 >> [<ffffffff8116c251>] sys_ioctl+0xa1/0xb0 >> [<ffffffff81468092>] system_call_fastpath+0x16/0x1b >> [<00002ab9fe1f99a7>] 0x2ab9fe1f99a7 >> [<ffffffffffffffff>] 0xffffffffffffffff >> >> We looked into the kernel codes that could leading to the above 'Stuck' >> warning, >> and found that the only possible is the emulation of 'cpuid' instruct in >> kvm/qemu has something wrong. >> But since we can’t reproduce this problem, we are not quite sure. >> Is there any possible that the cupid emulation in kvm/qemu has some bug ? > > Can you explain the relationship to the cpuid emulation? What do the > traces say about vcpus 1 and 7? OK, we searched the VM's kernel codes with the 'Stuck' message, and it is located in do_boot_cpu(). It's in BSP context, the call process is: BSP executes start_kernel() -> smp_init() -> smp_boot_cpus() -> do_boot_cpu() -> wakeup_secondary_via_INIT() to trigger APs. It will wait 5s for APs to startup, if some AP not startup normally, it will print 'CPU%d Stuck' or 'CPU%d: Not responding'. If it prints 'Stuck', it means the AP has received the SIPI interrupt and begins to execute the code 'ENTRY(trampoline_data)' (trampoline_64.S) , but be stuck in some places before smp_callin()(smpboot.c). The follow is the starup process of BSP and AP. BSP: start_kernel() ->smp_init() ->smp_boot_cpus() ->do_boot_cpu() ->start_ip = trampoline_address(); //set the address that AP will go to execute ->wakeup_secondary_cpu_via_init(); // kick the secondary CPU ->for (timeout = 0; timeout < 50000; timeout++) if (cpumask_test_cpu(cpu, cpu_callin_mask)) break;// check if AP startup or not APs: ENTRY(trampoline_data) (trampoline_64.S) ->ENTRY(secondary_startup_64) (head_64.S) ->start_secondary() (smpboot.c) ->cpu_init(); ->smp_callin(); ->cpumask_set_cpu(cpuid, cpu_callin_mask); ->Note: if AP comes here, the BSP will not prints the error message. From above call process, we can be sure that, the AP has been stuck between trampoline_data and the cpumask_set_cpu() in smp_callin(), we look through these codes path carefully, and only found a 'hlt' instruct that could block the process. It is located in trampoline_data(): ENTRY(trampoline_data) ... call verify_cpu # Verify the cpu supports long mode testl %eax, %eax # Check for return code jnz no_longmode ... no_longmode: hlt jmp no_longmode For the process verify_cpu(), we can only find the 'cpuid' sensitive instruct that could lead VM exit from No-root mode. This is why we doubt if cpuid emulation is wrong in KVM/QEMU that leading to the fail in verify_cpu. From the message in VM, we know vcpu1 and vcpu7 is something wrong. [ 5.060042] CPU1: Stuck ?? [ 10.170815] CPU7: Stuck ?? [ 10.171648] Brought up 6 CPUs Besides, the follow is the cpus message got from host. 80FF72F5-FF6D-E411-A8C8-000000821800:/home/fsp/hrg # virsh qemu-monitor-command instance-0000000 * CPU #0: pc=0x00007f64160c683d thread_id=68570 CPU #1: pc=0xffffffff810301f1 (halted) thread_id=68573 CPU #2: pc=0xffffffff810301e2 (halted) thread_id=68575 CPU #3: pc=0xffffffff810301e2 (halted) thread_id=68576 CPU #4: pc=0xffffffff810301e2 (halted) thread_id=68577 CPU #5: pc=0xffffffff810301e2 (halted) thread_id=68578 CPU #6: pc=0xffffffff810301e2 (halted) thread_id=68583 CPU #7: pc=0xffffffff810301f1 (halted) thread_id=68584 Oh, i also forgot to mention in the above message that, we have bond each vCPU to different physical CPU in host. Thanks, zhanghailiang ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [Qemu-devel] [BUG/RFC] Two cpus are not brought up normally in SLES11 sp3 VM after reboot 2015-07-06 9:59 ` zhanghailiang @ 2015-07-06 11:03 ` Paolo Bonzini 2015-07-07 11:23 ` Igor Mammedov 1 sibling, 0 replies; 8+ messages in thread From: Paolo Bonzini @ 2015-07-06 11:03 UTC (permalink / raw) To: zhanghailiang, kvm; +Cc: peter.huangpeng, qemu-devel@nongnu.org On 06/07/2015 11:59, zhanghailiang wrote: > > > Besides, the follow is the cpus message got from host. > 80FF72F5-FF6D-E411-A8C8-000000821800:/home/fsp/hrg # virsh > qemu-monitor-command instance-0000000 > * CPU #0: pc=0x00007f64160c683d thread_id=68570 > CPU #1: pc=0xffffffff810301f1 (halted) thread_id=68573 > CPU #2: pc=0xffffffff810301e2 (halted) thread_id=68575 > CPU #3: pc=0xffffffff810301e2 (halted) thread_id=68576 > CPU #4: pc=0xffffffff810301e2 (halted) thread_id=68577 > CPU #5: pc=0xffffffff810301e2 (halted) thread_id=68578 > CPU #6: pc=0xffffffff810301e2 (halted) thread_id=68583 > CPU #7: pc=0xffffffff810301f1 (halted) thread_id=68584 > > Oh, i also forgot to mention in the above message that, we have bond > each vCPU to different physical CPU in > host. Can you capture a trace on the host (trace-cmd record -e kvm) and send it privately? Please note which CPUs get stuck, since I guess it's not always 1 and 7. Paolo ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [Qemu-devel] [BUG/RFC] Two cpus are not brought up normally in SLES11 sp3 VM after reboot 2015-07-06 9:59 ` zhanghailiang 2015-07-06 11:03 ` Paolo Bonzini @ 2015-07-07 11:23 ` Igor Mammedov 2015-07-07 11:43 ` zhanghailiang 1 sibling, 1 reply; 8+ messages in thread From: Igor Mammedov @ 2015-07-07 11:23 UTC (permalink / raw) To: zhanghailiang; +Cc: Paolo Bonzini, peter.huangpeng, kvm, qemu-devel@nongnu.org On Mon, 6 Jul 2015 17:59:10 +0800 zhanghailiang <zhang.zhanghailiang@huawei.com> wrote: > On 2015/7/6 16:45, Paolo Bonzini wrote: > > > > > > On 06/07/2015 09:54, zhanghailiang wrote: > >> > >> From host, we found that QEMU vcpu1 thread and vcpu7 thread were not > >> consuming any cpu (Should be in idle state), > >> All of VCPUs' stacks in host is like bellow: > >> > >> [<ffffffffa07089b5>] kvm_vcpu_block+0x65/0xa0 [kvm] > >> [<ffffffffa071c7c1>] __vcpu_run+0xd1/0x260 [kvm] > >> [<ffffffffa071d508>] kvm_arch_vcpu_ioctl_run+0x68/0x1a0 [kvm] > >> [<ffffffffa0709cee>] kvm_vcpu_ioctl+0x38e/0x580 [kvm] > >> [<ffffffff8116be8b>] do_vfs_ioctl+0x8b/0x3b0 > >> [<ffffffff8116c251>] sys_ioctl+0xa1/0xb0 > >> [<ffffffff81468092>] system_call_fastpath+0x16/0x1b > >> [<00002ab9fe1f99a7>] 0x2ab9fe1f99a7 > >> [<ffffffffffffffff>] 0xffffffffffffffff > >> > >> We looked into the kernel codes that could leading to the above 'Stuck' > >> warning, in current upstream there isn't any printk(...Stuck...) left since that code path has been reworked. I've often seen this on over-committed host during guest CPUs up/down torture test. Could you update guest kernel to upstream and see if issue reproduces? > >> and found that the only possible is the emulation of 'cpuid' instruct in > >> kvm/qemu has something wrong. > >> But since we can’t reproduce this problem, we are not quite sure. > >> Is there any possible that the cupid emulation in kvm/qemu has some bug ? > > > > Can you explain the relationship to the cpuid emulation? What do the > > traces say about vcpus 1 and 7? > > OK, we searched the VM's kernel codes with the 'Stuck' message, and it is located in > do_boot_cpu(). It's in BSP context, the call process is: > BSP executes start_kernel() -> smp_init() -> smp_boot_cpus() -> do_boot_cpu() -> wakeup_secondary_via_INIT() to trigger APs. > It will wait 5s for APs to startup, if some AP not startup normally, it will print 'CPU%d Stuck' or 'CPU%d: Not responding'. > > If it prints 'Stuck', it means the AP has received the SIPI interrupt and begins to execute the code > 'ENTRY(trampoline_data)' (trampoline_64.S) , but be stuck in some places before smp_callin()(smpboot.c). > The follow is the starup process of BSP and AP. > BSP: > start_kernel() > ->smp_init() > ->smp_boot_cpus() > ->do_boot_cpu() > ->start_ip = trampoline_address(); //set the address that AP will go to execute > ->wakeup_secondary_cpu_via_init(); // kick the secondary CPU > ->for (timeout = 0; timeout < 50000; timeout++) > if (cpumask_test_cpu(cpu, cpu_callin_mask)) break;// check if AP startup or not > > APs: > ENTRY(trampoline_data) (trampoline_64.S) > ->ENTRY(secondary_startup_64) (head_64.S) > ->start_secondary() (smpboot.c) > ->cpu_init(); > ->smp_callin(); > ->cpumask_set_cpu(cpuid, cpu_callin_mask); ->Note: if AP comes here, the BSP will not prints the error message. > > From above call process, we can be sure that, the AP has been stuck between trampoline_data and the cpumask_set_cpu() in > smp_callin(), we look through these codes path carefully, and only found a 'hlt' instruct that could block the process. > It is located in trampoline_data(): > > ENTRY(trampoline_data) > ... > > call verify_cpu # Verify the cpu supports long mode > testl %eax, %eax # Check for return code > jnz no_longmode > > ... > > no_longmode: > hlt > jmp no_longmode > > For the process verify_cpu(), > we can only find the 'cpuid' sensitive instruct that could lead VM exit from No-root mode. > This is why we doubt if cpuid emulation is wrong in KVM/QEMU that leading to the fail in verify_cpu. > > From the message in VM, we know vcpu1 and vcpu7 is something wrong. > [ 5.060042] CPU1: Stuck ?? > [ 10.170815] CPU7: Stuck ?? > [ 10.171648] Brought up 6 CPUs > > Besides, the follow is the cpus message got from host. > 80FF72F5-FF6D-E411-A8C8-000000821800:/home/fsp/hrg # virsh qemu-monitor-command instance-0000000 > * CPU #0: pc=0x00007f64160c683d thread_id=68570 > CPU #1: pc=0xffffffff810301f1 (halted) thread_id=68573 > CPU #2: pc=0xffffffff810301e2 (halted) thread_id=68575 > CPU #3: pc=0xffffffff810301e2 (halted) thread_id=68576 > CPU #4: pc=0xffffffff810301e2 (halted) thread_id=68577 > CPU #5: pc=0xffffffff810301e2 (halted) thread_id=68578 > CPU #6: pc=0xffffffff810301e2 (halted) thread_id=68583 > CPU #7: pc=0xffffffff810301f1 (halted) thread_id=68584 > > Oh, i also forgot to mention in the above message that, we have bond each vCPU to different physical CPU in > host. > > Thanks, > zhanghailiang > > > > > -- > To unsubscribe from this list: send the line "unsubscribe kvm" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [Qemu-devel] [BUG/RFC] Two cpus are not brought up normally in SLES11 sp3 VM after reboot 2015-07-07 11:23 ` Igor Mammedov @ 2015-07-07 11:43 ` zhanghailiang 2015-07-07 12:21 ` Igor Mammedov 0 siblings, 1 reply; 8+ messages in thread From: zhanghailiang @ 2015-07-07 11:43 UTC (permalink / raw) To: Igor Mammedov; +Cc: Paolo Bonzini, peter.huangpeng, kvm, qemu-devel@nongnu.org On 2015/7/7 19:23, Igor Mammedov wrote: > On Mon, 6 Jul 2015 17:59:10 +0800 > zhanghailiang <zhang.zhanghailiang@huawei.com> wrote: > >> On 2015/7/6 16:45, Paolo Bonzini wrote: >>> >>> >>> On 06/07/2015 09:54, zhanghailiang wrote: >>>> >>>> From host, we found that QEMU vcpu1 thread and vcpu7 thread were not >>>> consuming any cpu (Should be in idle state), >>>> All of VCPUs' stacks in host is like bellow: >>>> >>>> [<ffffffffa07089b5>] kvm_vcpu_block+0x65/0xa0 [kvm] >>>> [<ffffffffa071c7c1>] __vcpu_run+0xd1/0x260 [kvm] >>>> [<ffffffffa071d508>] kvm_arch_vcpu_ioctl_run+0x68/0x1a0 [kvm] >>>> [<ffffffffa0709cee>] kvm_vcpu_ioctl+0x38e/0x580 [kvm] >>>> [<ffffffff8116be8b>] do_vfs_ioctl+0x8b/0x3b0 >>>> [<ffffffff8116c251>] sys_ioctl+0xa1/0xb0 >>>> [<ffffffff81468092>] system_call_fastpath+0x16/0x1b >>>> [<00002ab9fe1f99a7>] 0x2ab9fe1f99a7 >>>> [<ffffffffffffffff>] 0xffffffffffffffff >>>> >>>> We looked into the kernel codes that could leading to the above 'Stuck' >>>> warning, > in current upstream there isn't any printk(...Stuck...) left since that code path > has been reworked. > I've often seen this on over-committed host during guest CPUs up/down torture test. > Could you update guest kernel to upstream and see if issue reproduces? > Hmm, Unfortunately, it is very hard to reproduce, and we are still trying to reproduce it. For your test case, is it a kernel bug? Or is there any related patch could solve your test problem been merged into upstream ? Thanks, zhanghailiang >>>> and found that the only possible is the emulation of 'cpuid' instruct in >>>> kvm/qemu has something wrong. >>>> But since we can’t reproduce this problem, we are not quite sure. >>>> Is there any possible that the cupid emulation in kvm/qemu has some bug ? >>> >>> Can you explain the relationship to the cpuid emulation? What do the >>> traces say about vcpus 1 and 7? >> >> OK, we searched the VM's kernel codes with the 'Stuck' message, and it is located in >> do_boot_cpu(). It's in BSP context, the call process is: >> BSP executes start_kernel() -> smp_init() -> smp_boot_cpus() -> do_boot_cpu() -> wakeup_secondary_via_INIT() to trigger APs. >> It will wait 5s for APs to startup, if some AP not startup normally, it will print 'CPU%d Stuck' or 'CPU%d: Not responding'. >> >> If it prints 'Stuck', it means the AP has received the SIPI interrupt and begins to execute the code >> 'ENTRY(trampoline_data)' (trampoline_64.S) , but be stuck in some places before smp_callin()(smpboot.c). >> The follow is the starup process of BSP and AP. >> BSP: >> start_kernel() >> ->smp_init() >> ->smp_boot_cpus() >> ->do_boot_cpu() >> ->start_ip = trampoline_address(); //set the address that AP will go to execute >> ->wakeup_secondary_cpu_via_init(); // kick the secondary CPU >> ->for (timeout = 0; timeout < 50000; timeout++) >> if (cpumask_test_cpu(cpu, cpu_callin_mask)) break;// check if AP startup or not >> >> APs: >> ENTRY(trampoline_data) (trampoline_64.S) >> ->ENTRY(secondary_startup_64) (head_64.S) >> ->start_secondary() (smpboot.c) >> ->cpu_init(); >> ->smp_callin(); >> ->cpumask_set_cpu(cpuid, cpu_callin_mask); ->Note: if AP comes here, the BSP will not prints the error message. >> >> From above call process, we can be sure that, the AP has been stuck between trampoline_data and the cpumask_set_cpu() in >> smp_callin(), we look through these codes path carefully, and only found a 'hlt' instruct that could block the process. >> It is located in trampoline_data(): >> >> ENTRY(trampoline_data) >> ... >> >> call verify_cpu # Verify the cpu supports long mode >> testl %eax, %eax # Check for return code >> jnz no_longmode >> >> ... >> >> no_longmode: >> hlt >> jmp no_longmode >> >> For the process verify_cpu(), >> we can only find the 'cpuid' sensitive instruct that could lead VM exit from No-root mode. >> This is why we doubt if cpuid emulation is wrong in KVM/QEMU that leading to the fail in verify_cpu. >> >> From the message in VM, we know vcpu1 and vcpu7 is something wrong. >> [ 5.060042] CPU1: Stuck ?? >> [ 10.170815] CPU7: Stuck ?? >> [ 10.171648] Brought up 6 CPUs >> >> Besides, the follow is the cpus message got from host. >> 80FF72F5-FF6D-E411-A8C8-000000821800:/home/fsp/hrg # virsh qemu-monitor-command instance-0000000 >> * CPU #0: pc=0x00007f64160c683d thread_id=68570 >> CPU #1: pc=0xffffffff810301f1 (halted) thread_id=68573 >> CPU #2: pc=0xffffffff810301e2 (halted) thread_id=68575 >> CPU #3: pc=0xffffffff810301e2 (halted) thread_id=68576 >> CPU #4: pc=0xffffffff810301e2 (halted) thread_id=68577 >> CPU #5: pc=0xffffffff810301e2 (halted) thread_id=68578 >> CPU #6: pc=0xffffffff810301e2 (halted) thread_id=68583 >> CPU #7: pc=0xffffffff810301f1 (halted) thread_id=68584 >> >> Oh, i also forgot to mention in the above message that, we have bond each vCPU to different physical CPU in >> host. >> >> Thanks, >> zhanghailiang >> >> >> >> >> -- >> To unsubscribe from this list: send the line "unsubscribe kvm" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > > > . > ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [Qemu-devel] [BUG/RFC] Two cpus are not brought up normally in SLES11 sp3 VM after reboot 2015-07-07 11:43 ` zhanghailiang @ 2015-07-07 12:21 ` Igor Mammedov 2015-07-07 12:39 ` zhanghailiang 0 siblings, 1 reply; 8+ messages in thread From: Igor Mammedov @ 2015-07-07 12:21 UTC (permalink / raw) To: zhanghailiang; +Cc: Paolo Bonzini, peter.huangpeng, kvm, qemu-devel@nongnu.org On Tue, 7 Jul 2015 19:43:35 +0800 zhanghailiang <zhang.zhanghailiang@huawei.com> wrote: > On 2015/7/7 19:23, Igor Mammedov wrote: > > On Mon, 6 Jul 2015 17:59:10 +0800 > > zhanghailiang <zhang.zhanghailiang@huawei.com> wrote: > > > >> On 2015/7/6 16:45, Paolo Bonzini wrote: > >>> > >>> > >>> On 06/07/2015 09:54, zhanghailiang wrote: > >>>> > >>>> From host, we found that QEMU vcpu1 thread and vcpu7 thread were not > >>>> consuming any cpu (Should be in idle state), > >>>> All of VCPUs' stacks in host is like bellow: > >>>> > >>>> [<ffffffffa07089b5>] kvm_vcpu_block+0x65/0xa0 [kvm] > >>>> [<ffffffffa071c7c1>] __vcpu_run+0xd1/0x260 [kvm] > >>>> [<ffffffffa071d508>] kvm_arch_vcpu_ioctl_run+0x68/0x1a0 [kvm] > >>>> [<ffffffffa0709cee>] kvm_vcpu_ioctl+0x38e/0x580 [kvm] > >>>> [<ffffffff8116be8b>] do_vfs_ioctl+0x8b/0x3b0 > >>>> [<ffffffff8116c251>] sys_ioctl+0xa1/0xb0 > >>>> [<ffffffff81468092>] system_call_fastpath+0x16/0x1b > >>>> [<00002ab9fe1f99a7>] 0x2ab9fe1f99a7 > >>>> [<ffffffffffffffff>] 0xffffffffffffffff > >>>> > >>>> We looked into the kernel codes that could leading to the above 'Stuck' > >>>> warning, > > in current upstream there isn't any printk(...Stuck...) left since that code path > > has been reworked. > > I've often seen this on over-committed host during guest CPUs up/down torture test. > > Could you update guest kernel to upstream and see if issue reproduces? > > > > Hmm, Unfortunately, it is very hard to reproduce, and we are still trying to reproduce it. > > For your test case, is it a kernel bug? > Or is there any related patch could solve your test problem been merged into > upstream ? I don't remember all prerequisite patches but you should be able to find http://marc.info/?l=linux-kernel&m=140326703108009&w=2 "x86/smpboot: Initialize secondary CPU only if master CPU will wait for it" and then look for dependencies. > > Thanks, > zhanghailiang > > >>>> and found that the only possible is the emulation of 'cpuid' instruct in > >>>> kvm/qemu has something wrong. > >>>> But since we can’t reproduce this problem, we are not quite sure. > >>>> Is there any possible that the cupid emulation in kvm/qemu has some bug ? > >>> > >>> Can you explain the relationship to the cpuid emulation? What do the > >>> traces say about vcpus 1 and 7? > >> > >> OK, we searched the VM's kernel codes with the 'Stuck' message, and it is located in > >> do_boot_cpu(). It's in BSP context, the call process is: > >> BSP executes start_kernel() -> smp_init() -> smp_boot_cpus() -> do_boot_cpu() -> wakeup_secondary_via_INIT() to trigger APs. > >> It will wait 5s for APs to startup, if some AP not startup normally, it will print 'CPU%d Stuck' or 'CPU%d: Not responding'. > >> > >> If it prints 'Stuck', it means the AP has received the SIPI interrupt and begins to execute the code > >> 'ENTRY(trampoline_data)' (trampoline_64.S) , but be stuck in some places before smp_callin()(smpboot.c). > >> The follow is the starup process of BSP and AP. > >> BSP: > >> start_kernel() > >> ->smp_init() > >> ->smp_boot_cpus() > >> ->do_boot_cpu() > >> ->start_ip = trampoline_address(); //set the address that AP will go to execute > >> ->wakeup_secondary_cpu_via_init(); // kick the secondary CPU > >> ->for (timeout = 0; timeout < 50000; timeout++) > >> if (cpumask_test_cpu(cpu, cpu_callin_mask)) break;// check if AP startup or not > >> > >> APs: > >> ENTRY(trampoline_data) (trampoline_64.S) > >> ->ENTRY(secondary_startup_64) (head_64.S) > >> ->start_secondary() (smpboot.c) > >> ->cpu_init(); > >> ->smp_callin(); > >> ->cpumask_set_cpu(cpuid, cpu_callin_mask); ->Note: if AP comes here, the BSP will not prints the error message. > >> > >> From above call process, we can be sure that, the AP has been stuck between trampoline_data and the cpumask_set_cpu() in > >> smp_callin(), we look through these codes path carefully, and only found a 'hlt' instruct that could block the process. > >> It is located in trampoline_data(): > >> > >> ENTRY(trampoline_data) > >> ... > >> > >> call verify_cpu # Verify the cpu supports long mode > >> testl %eax, %eax # Check for return code > >> jnz no_longmode > >> > >> ... > >> > >> no_longmode: > >> hlt > >> jmp no_longmode > >> > >> For the process verify_cpu(), > >> we can only find the 'cpuid' sensitive instruct that could lead VM exit from No-root mode. > >> This is why we doubt if cpuid emulation is wrong in KVM/QEMU that leading to the fail in verify_cpu. > >> > >> From the message in VM, we know vcpu1 and vcpu7 is something wrong. > >> [ 5.060042] CPU1: Stuck ?? > >> [ 10.170815] CPU7: Stuck ?? > >> [ 10.171648] Brought up 6 CPUs > >> > >> Besides, the follow is the cpus message got from host. > >> 80FF72F5-FF6D-E411-A8C8-000000821800:/home/fsp/hrg # virsh qemu-monitor-command instance-0000000 > >> * CPU #0: pc=0x00007f64160c683d thread_id=68570 > >> CPU #1: pc=0xffffffff810301f1 (halted) thread_id=68573 > >> CPU #2: pc=0xffffffff810301e2 (halted) thread_id=68575 > >> CPU #3: pc=0xffffffff810301e2 (halted) thread_id=68576 > >> CPU #4: pc=0xffffffff810301e2 (halted) thread_id=68577 > >> CPU #5: pc=0xffffffff810301e2 (halted) thread_id=68578 > >> CPU #6: pc=0xffffffff810301e2 (halted) thread_id=68583 > >> CPU #7: pc=0xffffffff810301f1 (halted) thread_id=68584 > >> > >> Oh, i also forgot to mention in the above message that, we have bond each vCPU to different physical CPU in > >> host. > >> > >> Thanks, > >> zhanghailiang > >> > >> > >> > >> > >> -- > >> To unsubscribe from this list: send the line "unsubscribe kvm" in > >> the body of a message to majordomo@vger.kernel.org > >> More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > > . > > > > > ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [Qemu-devel] [BUG/RFC] Two cpus are not brought up normally in SLES11 sp3 VM after reboot 2015-07-07 12:21 ` Igor Mammedov @ 2015-07-07 12:39 ` zhanghailiang 0 siblings, 0 replies; 8+ messages in thread From: zhanghailiang @ 2015-07-07 12:39 UTC (permalink / raw) To: Igor Mammedov; +Cc: Paolo Bonzini, peter.huangpeng, kvm, qemu-devel@nongnu.org On 2015/7/7 20:21, Igor Mammedov wrote: > On Tue, 7 Jul 2015 19:43:35 +0800 > zhanghailiang <zhang.zhanghailiang@huawei.com> wrote: > >> On 2015/7/7 19:23, Igor Mammedov wrote: >>> On Mon, 6 Jul 2015 17:59:10 +0800 >>> zhanghailiang <zhang.zhanghailiang@huawei.com> wrote: >>> >>>> On 2015/7/6 16:45, Paolo Bonzini wrote: >>>>> >>>>> >>>>> On 06/07/2015 09:54, zhanghailiang wrote: >>>>>> >>>>>> From host, we found that QEMU vcpu1 thread and vcpu7 thread were not >>>>>> consuming any cpu (Should be in idle state), >>>>>> All of VCPUs' stacks in host is like bellow: >>>>>> >>>>>> [<ffffffffa07089b5>] kvm_vcpu_block+0x65/0xa0 [kvm] >>>>>> [<ffffffffa071c7c1>] __vcpu_run+0xd1/0x260 [kvm] >>>>>> [<ffffffffa071d508>] kvm_arch_vcpu_ioctl_run+0x68/0x1a0 [kvm] >>>>>> [<ffffffffa0709cee>] kvm_vcpu_ioctl+0x38e/0x580 [kvm] >>>>>> [<ffffffff8116be8b>] do_vfs_ioctl+0x8b/0x3b0 >>>>>> [<ffffffff8116c251>] sys_ioctl+0xa1/0xb0 >>>>>> [<ffffffff81468092>] system_call_fastpath+0x16/0x1b >>>>>> [<00002ab9fe1f99a7>] 0x2ab9fe1f99a7 >>>>>> [<ffffffffffffffff>] 0xffffffffffffffff >>>>>> >>>>>> We looked into the kernel codes that could leading to the above 'Stuck' >>>>>> warning, >>> in current upstream there isn't any printk(...Stuck...) left since that code path >>> has been reworked. >>> I've often seen this on over-committed host during guest CPUs up/down torture test. >>> Could you update guest kernel to upstream and see if issue reproduces? >>> >> >> Hmm, Unfortunately, it is very hard to reproduce, and we are still trying to reproduce it. >> >> For your test case, is it a kernel bug? >> Or is there any related patch could solve your test problem been merged into >> upstream ? > I don't remember all prerequisite patches but you should be able to find > http://marc.info/?l=linux-kernel&m=140326703108009&w=2 > "x86/smpboot: Initialize secondary CPU only if master CPU will wait for it" > and then look for dependencies. > Er, we have investigated this patch, and it is not related to our problem, :) Thanks. > >> >> Thanks, >> zhanghailiang >> >>>>>> and found that the only possible is the emulation of 'cpuid' instruct in >>>>>> kvm/qemu has something wrong. >>>>>> But since we can’t reproduce this problem, we are not quite sure. >>>>>> Is there any possible that the cupid emulation in kvm/qemu has some bug ? >>>>> >>>>> Can you explain the relationship to the cpuid emulation? What do the >>>>> traces say about vcpus 1 and 7? >>>> >>>> OK, we searched the VM's kernel codes with the 'Stuck' message, and it is located in >>>> do_boot_cpu(). It's in BSP context, the call process is: >>>> BSP executes start_kernel() -> smp_init() -> smp_boot_cpus() -> do_boot_cpu() -> wakeup_secondary_via_INIT() to trigger APs. >>>> It will wait 5s for APs to startup, if some AP not startup normally, it will print 'CPU%d Stuck' or 'CPU%d: Not responding'. >>>> >>>> If it prints 'Stuck', it means the AP has received the SIPI interrupt and begins to execute the code >>>> 'ENTRY(trampoline_data)' (trampoline_64.S) , but be stuck in some places before smp_callin()(smpboot.c). >>>> The follow is the starup process of BSP and AP. >>>> BSP: >>>> start_kernel() >>>> ->smp_init() >>>> ->smp_boot_cpus() >>>> ->do_boot_cpu() >>>> ->start_ip = trampoline_address(); //set the address that AP will go to execute >>>> ->wakeup_secondary_cpu_via_init(); // kick the secondary CPU >>>> ->for (timeout = 0; timeout < 50000; timeout++) >>>> if (cpumask_test_cpu(cpu, cpu_callin_mask)) break;// check if AP startup or not >>>> >>>> APs: >>>> ENTRY(trampoline_data) (trampoline_64.S) >>>> ->ENTRY(secondary_startup_64) (head_64.S) >>>> ->start_secondary() (smpboot.c) >>>> ->cpu_init(); >>>> ->smp_callin(); >>>> ->cpumask_set_cpu(cpuid, cpu_callin_mask); ->Note: if AP comes here, the BSP will not prints the error message. >>>> >>>> From above call process, we can be sure that, the AP has been stuck between trampoline_data and the cpumask_set_cpu() in >>>> smp_callin(), we look through these codes path carefully, and only found a 'hlt' instruct that could block the process. >>>> It is located in trampoline_data(): >>>> >>>> ENTRY(trampoline_data) >>>> ... >>>> >>>> call verify_cpu # Verify the cpu supports long mode >>>> testl %eax, %eax # Check for return code >>>> jnz no_longmode >>>> >>>> ... >>>> >>>> no_longmode: >>>> hlt >>>> jmp no_longmode >>>> >>>> For the process verify_cpu(), >>>> we can only find the 'cpuid' sensitive instruct that could lead VM exit from No-root mode. >>>> This is why we doubt if cpuid emulation is wrong in KVM/QEMU that leading to the fail in verify_cpu. >>>> >>>> From the message in VM, we know vcpu1 and vcpu7 is something wrong. >>>> [ 5.060042] CPU1: Stuck ?? >>>> [ 10.170815] CPU7: Stuck ?? >>>> [ 10.171648] Brought up 6 CPUs >>>> >>>> Besides, the follow is the cpus message got from host. >>>> 80FF72F5-FF6D-E411-A8C8-000000821800:/home/fsp/hrg # virsh qemu-monitor-command instance-0000000 >>>> * CPU #0: pc=0x00007f64160c683d thread_id=68570 >>>> CPU #1: pc=0xffffffff810301f1 (halted) thread_id=68573 >>>> CPU #2: pc=0xffffffff810301e2 (halted) thread_id=68575 >>>> CPU #3: pc=0xffffffff810301e2 (halted) thread_id=68576 >>>> CPU #4: pc=0xffffffff810301e2 (halted) thread_id=68577 >>>> CPU #5: pc=0xffffffff810301e2 (halted) thread_id=68578 >>>> CPU #6: pc=0xffffffff810301e2 (halted) thread_id=68583 >>>> CPU #7: pc=0xffffffff810301f1 (halted) thread_id=68584 >>>> >>>> Oh, i also forgot to mention in the above message that, we have bond each vCPU to different physical CPU in >>>> host. >>>> >>>> Thanks, >>>> zhanghailiang >>>> >>>> >>>> >>>> >>>> -- >>>> To unsubscribe from this list: send the line "unsubscribe kvm" in >>>> the body of a message to majordomo@vger.kernel.org >>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>> >>> >>> . >>> >> >> >> > > > . > ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2015-07-07 12:40 UTC | newest] Thread overview: 8+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2015-07-06 7:54 [Qemu-devel] [BUG/RFC] Two cpus are not brought up normally in SLES11 sp3 VM after reboot zhanghailiang 2015-07-06 8:45 ` Paolo Bonzini 2015-07-06 9:59 ` zhanghailiang 2015-07-06 11:03 ` Paolo Bonzini 2015-07-07 11:23 ` Igor Mammedov 2015-07-07 11:43 ` zhanghailiang 2015-07-07 12:21 ` Igor Mammedov 2015-07-07 12:39 ` zhanghailiang
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).