From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:60475) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1ZCRsa-0007bm-U1 for qemu-devel@nongnu.org; Tue, 07 Jul 2015 08:21:42 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1ZCRsX-0006Oo-Ej for qemu-devel@nongnu.org; Tue, 07 Jul 2015 08:21:40 -0400 Received: from mx1.redhat.com ([209.132.183.28]:43240) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1ZCRsX-0006Oi-7R for qemu-devel@nongnu.org; Tue, 07 Jul 2015 08:21:37 -0400 Date: Tue, 7 Jul 2015 14:21:32 +0200 From: Igor Mammedov Message-ID: <20150707142132.33d7d9d6@nial.brq.redhat.com> In-Reply-To: <559BBB67.4000503@huawei.com> References: <559A342C.6020207@huawei.com> <559A4010.30808@redhat.com> <559A516E.1070000@huawei.com> <20150707132344.04476183@nial.brq.redhat.com> <559BBB67.4000503@huawei.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Subject: Re: [Qemu-devel] [BUG/RFC] Two cpus are not brought up normally in SLES11 sp3 VM after reboot List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: zhanghailiang Cc: Paolo Bonzini , peter.huangpeng@huawei.com, kvm@vger.kernel.org, "qemu-devel@nongnu.org" On Tue, 7 Jul 2015 19:43:35 +0800 zhanghailiang wrote: > On 2015/7/7 19:23, Igor Mammedov wrote: > > On Mon, 6 Jul 2015 17:59:10 +0800 > > zhanghailiang wrote: > > > >> On 2015/7/6 16:45, Paolo Bonzini wrote: > >>> > >>> > >>> On 06/07/2015 09:54, zhanghailiang wrote: > >>>> > >>>> From host, we found that QEMU vcpu1 thread and vcpu7 thread were n= ot > >>>> consuming any cpu (Should be in idle state), > >>>> All of VCPUs' stacks in host is like bellow: > >>>> > >>>> [] kvm_vcpu_block+0x65/0xa0 [kvm] > >>>> [] __vcpu_run+0xd1/0x260 [kvm] > >>>> [] kvm_arch_vcpu_ioctl_run+0x68/0x1a0 [kvm] > >>>> [] kvm_vcpu_ioctl+0x38e/0x580 [kvm] > >>>> [] do_vfs_ioctl+0x8b/0x3b0 > >>>> [] sys_ioctl+0xa1/0xb0 > >>>> [] system_call_fastpath+0x16/0x1b > >>>> [<00002ab9fe1f99a7>] 0x2ab9fe1f99a7 > >>>> [] 0xffffffffffffffff > >>>> > >>>> We looked into the kernel codes that could leading to the above 'Stu= ck' > >>>> warning, > > in current upstream there isn't any printk(...Stuck...) left since that= code path > > has been reworked. > > I've often seen this on over-committed host during guest CPUs up/down t= orture test. > > Could you update guest kernel to upstream and see if issue reproduces? > > >=20 > Hmm, Unfortunately, it is very hard to reproduce, and we are still trying= to reproduce it. >=20 > For your test case, is it a kernel bug? > Or is there any related patch could solve your test problem been merged i= nto > upstream ? I don't remember all prerequisite patches but you should be able to find http://marc.info/?l=3Dlinux-kernel&m=3D140326703108009&w=3D2 "x86/smpboot: Initialize secondary CPU only if master CPU will wait for i= t" and then look for dependencies. >=20 > Thanks, > zhanghailiang >=20 > >>>> and found that the only possible is the emulation of 'cpuid' instruc= t in > >>>> kvm/qemu has something wrong. > >>>> But since we can=E2=80=99t reproduce this problem, we are not quite = sure. > >>>> Is there any possible that the cupid emulation in kvm/qemu has some = bug ? > >>> > >>> Can you explain the relationship to the cpuid emulation? What do the > >>> traces say about vcpus 1 and 7? > >> > >> OK, we searched the VM's kernel codes with the 'Stuck' message, and i= t is located in > >> do_boot_cpu(). It's in BSP context, the call process is: > >> BSP executes start_kernel() -> smp_init() -> smp_boot_cpus() -> do_boo= t_cpu() -> wakeup_secondary_via_INIT() to trigger APs. > >> It will wait 5s for APs to startup, if some AP not startup normally, i= t will print 'CPU%d Stuck' or 'CPU%d: Not responding'. > >> > >> If it prints 'Stuck', it means the AP has received the SIPI interrupt = and begins to execute the code > >> 'ENTRY(trampoline_data)' (trampoline_64.S) , but be stuck in some plac= es before smp_callin()(smpboot.c). > >> The follow is the starup process of BSP and AP. > >> BSP: > >> start_kernel() > >> ->smp_init() > >> ->smp_boot_cpus() > >> ->do_boot_cpu() > >> ->start_ip =3D trampoline_address(); //set the address th= at AP will go to execute > >> ->wakeup_secondary_cpu_via_init(); // kick the secondary = CPU > >> ->for (timeout =3D 0; timeout < 50000; timeout++) > >> if (cpumask_test_cpu(cpu, cpu_callin_mask)) break;// = check if AP startup or not > >> > >> APs: > >> ENTRY(trampoline_data) (trampoline_64.S) > >> ->ENTRY(secondary_startup_64) (head_64.S) > >> ->start_secondary() (smpboot.c) > >> ->cpu_init(); > >> ->smp_callin(); > >> ->cpumask_set_cpu(cpuid, cpu_callin_mask); ->Note: i= f AP comes here, the BSP will not prints the error message. > >> > >> From above call process, we can be sure that, the AP has been stuck = between trampoline_data and the cpumask_set_cpu() in > >> smp_callin(), we look through these codes path carefully, and only fou= nd a 'hlt' instruct that could block the process. > >> It is located in trampoline_data(): > >> > >> ENTRY(trampoline_data) > >> ... > >> > >> call verify_cpu # Verify the cpu supports long mode > >> testl %eax, %eax # Check for return code > >> jnz no_longmode > >> > >> ... > >> > >> no_longmode: > >> hlt > >> jmp no_longmode > >> > >> For the process verify_cpu(), > >> we can only find the 'cpuid' sensitive instruct that could lead VM exi= t from No-root mode. > >> This is why we doubt if cpuid emulation is wrong in KVM/QEMU that lead= ing to the fail in verify_cpu. > >> > >> From the message in VM, we know vcpu1 and vcpu7 is something wrong. > >> [ 5.060042] CPU1: Stuck ?? > >> [ 10.170815] CPU7: Stuck ?? > >> [ 10.171648] Brought up 6 CPUs > >> > >> Besides, the follow is the cpus message got from host. > >> 80FF72F5-FF6D-E411-A8C8-000000821800:/home/fsp/hrg # virsh qemu-monito= r-command instance-0000000 > >> * CPU #0: pc=3D0x00007f64160c683d thread_id=3D68570 > >> CPU #1: pc=3D0xffffffff810301f1 (halted) thread_id=3D68573 > >> CPU #2: pc=3D0xffffffff810301e2 (halted) thread_id=3D68575 > >> CPU #3: pc=3D0xffffffff810301e2 (halted) thread_id=3D68576 > >> CPU #4: pc=3D0xffffffff810301e2 (halted) thread_id=3D68577 > >> CPU #5: pc=3D0xffffffff810301e2 (halted) thread_id=3D68578 > >> CPU #6: pc=3D0xffffffff810301e2 (halted) thread_id=3D68583 > >> CPU #7: pc=3D0xffffffff810301f1 (halted) thread_id=3D68584 > >> > >> Oh, i also forgot to mention in the above message that, we have bond e= ach vCPU to different physical CPU in > >> host. > >> > >> Thanks, > >> zhanghailiang > >> > >> > >> > >> > >> -- > >> To unsubscribe from this list: send the line "unsubscribe kvm" in > >> the body of a message to majordomo@vger.kernel.org > >> More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > > . > > >=20 >=20 >=20