From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:41574)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <imammedo@redhat.com>) id 1ZCQye-0004oe-HF
	for qemu-devel@nongnu.org; Tue, 07 Jul 2015 07:23:54 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <imammedo@redhat.com>) id 1ZCQyb-0005GY-66
	for qemu-devel@nongnu.org; Tue, 07 Jul 2015 07:23:52 -0400
Received: from mx1.redhat.com ([209.132.183.28]:43216)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <imammedo@redhat.com>) id 1ZCQya-0005G8-UK
	for qemu-devel@nongnu.org; Tue, 07 Jul 2015 07:23:49 -0400
Date: Tue, 7 Jul 2015 13:23:44 +0200
From: Igor Mammedov <imammedo@redhat.com>
Message-ID: <20150707132344.04476183@nial.brq.redhat.com>
In-Reply-To: <559A516E.1070000@huawei.com>
References: <559A342C.6020207@huawei.com> <559A4010.30808@redhat.com>
	<559A516E.1070000@huawei.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
Subject: Re: [Qemu-devel] [BUG/RFC] Two cpus are not brought up normally in
 SLES11 sp3 VM after reboot
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: zhanghailiang <zhang.zhanghailiang@huawei.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>, peter.huangpeng@huawei.com, kvm@vger.kernel.org, "qemu-devel@nongnu.org" <qemu-devel@nongnu.org>

On Mon, 6 Jul 2015 17:59:10 +0800
zhanghailiang <zhang.zhanghailiang@huawei.com> wrote:

> On 2015/7/6 16:45, Paolo Bonzini wrote:
> >
> >
> > On 06/07/2015 09:54, zhanghailiang wrote:
> >>
> >>  From host, we found that QEMU vcpu1 thread and vcpu7 thread were not
> >> consuming any cpu (Should be in idle state),
> >> All of VCPUs' stacks in host is like bellow:
> >>
> >> [<ffffffffa07089b5>] kvm_vcpu_block+0x65/0xa0 [kvm]
> >> [<ffffffffa071c7c1>] __vcpu_run+0xd1/0x260 [kvm]
> >> [<ffffffffa071d508>] kvm_arch_vcpu_ioctl_run+0x68/0x1a0 [kvm]
> >> [<ffffffffa0709cee>] kvm_vcpu_ioctl+0x38e/0x580 [kvm]
> >> [<ffffffff8116be8b>] do_vfs_ioctl+0x8b/0x3b0
> >> [<ffffffff8116c251>] sys_ioctl+0xa1/0xb0
> >> [<ffffffff81468092>] system_call_fastpath+0x16/0x1b
> >> [<00002ab9fe1f99a7>] 0x2ab9fe1f99a7
> >> [<ffffffffffffffff>] 0xffffffffffffffff
> >>
> >> We looked into the kernel codes that could leading to the above 'Stuck'
> >> warning,
in current upstream there isn't any printk(...Stuck...) left since that cod=
e path
has been reworked.
I've often seen this on over-committed host during guest CPUs up/down tortu=
re test.
Could you update guest kernel to upstream and see if issue reproduces?

> >> and found that the only possible is the emulation of 'cpuid' instruct =
in
> >> kvm/qemu has something wrong.
> >> But since we can=E2=80=99t reproduce this problem, we are not quite su=
re.
> >> Is there any possible that the cupid emulation in kvm/qemu has some bu=
g ?
> >
> > Can you explain the relationship to the cpuid emulation?  What do the
> > traces say about vcpus 1 and 7?
>=20
> OK, we searched the VM's kernel codes with the 'Stuck' message, and  it i=
s located in
> do_boot_cpu(). It's in BSP context, the call process is:
> BSP executes start_kernel() -> smp_init() -> smp_boot_cpus() -> do_boot_c=
pu() -> wakeup_secondary_via_INIT() to trigger APs.
> It will wait 5s for APs to startup, if some AP not startup normally, it w=
ill print 'CPU%d Stuck' or 'CPU%d: Not responding'.
>=20
> If it prints 'Stuck', it means the AP has received the SIPI interrupt and=
 begins to execute the code
> 'ENTRY(trampoline_data)' (trampoline_64.S) , but be stuck in some places =
before smp_callin()(smpboot.c).
> The follow is the starup process of BSP and AP.
> BSP:
> start_kernel()
>    ->smp_init()
>       ->smp_boot_cpus()
>         ->do_boot_cpu()
>             ->start_ip =3D trampoline_address(); //set the address that A=
P will go to execute
>             ->wakeup_secondary_cpu_via_init(); // kick the secondary CPU
>             ->for (timeout =3D 0; timeout < 50000; timeout++)
>                 if (cpumask_test_cpu(cpu, cpu_callin_mask)) break;// chec=
k if AP startup or not
>=20
> APs:
> ENTRY(trampoline_data) (trampoline_64.S)
>        ->ENTRY(secondary_startup_64) (head_64.S)
>           ->start_secondary() (smpboot.c)
>              ->cpu_init();
>              ->smp_callin();
>                  ->cpumask_set_cpu(cpuid, cpu_callin_mask); ->Note: if AP=
 comes here, the BSP will not prints the error message.
>=20
>  From above call process, we can be sure that, the AP has been stuck betw=
een trampoline_data and the cpumask_set_cpu() in
> smp_callin(), we look through these codes path carefully, and only found =
a 'hlt' instruct that could block the process.
> It is located in trampoline_data():
>=20
> ENTRY(trampoline_data)
>          ...
>=20
> 	call	verify_cpu		# Verify the cpu supports long mode
> 	testl   %eax, %eax		# Check for return code
> 	jnz	no_longmode
>=20
>          ...
>=20
> no_longmode:
> 	hlt
> 	jmp no_longmode
>=20
> For the process verify_cpu(),
> we can only find the 'cpuid' sensitive instruct that could lead VM exit f=
rom No-root mode.
> This is why we doubt if cpuid emulation is wrong in KVM/QEMU that leading=
 to the fail in verify_cpu.
>=20
>  From the message in VM, we know vcpu1 and vcpu7 is something wrong.
> [    5.060042] CPU1: Stuck ??
> [   10.170815] CPU7: Stuck ??
> [   10.171648] Brought up 6 CPUs
>=20
> Besides, the follow is the cpus message got from host.
> 80FF72F5-FF6D-E411-A8C8-000000821800:/home/fsp/hrg # virsh qemu-monitor-c=
ommand instance-0000000
> * CPU #0: pc=3D0x00007f64160c683d thread_id=3D68570
>    CPU #1: pc=3D0xffffffff810301f1 (halted) thread_id=3D68573
>    CPU #2: pc=3D0xffffffff810301e2 (halted) thread_id=3D68575
>    CPU #3: pc=3D0xffffffff810301e2 (halted) thread_id=3D68576
>    CPU #4: pc=3D0xffffffff810301e2 (halted) thread_id=3D68577
>    CPU #5: pc=3D0xffffffff810301e2 (halted) thread_id=3D68578
>    CPU #6: pc=3D0xffffffff810301e2 (halted) thread_id=3D68583
>    CPU #7: pc=3D0xffffffff810301f1 (halted) thread_id=3D68584
>=20
> Oh, i also forgot to mention in the above message that, we have bond each=
 vCPU to different physical CPU in
> host.
>=20
> Thanks,
> zhanghailiang
>=20
>=20
>=20
>=20
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html