From mboxrd@z Thu Jan  1 00:00:00 1970
From: Andre Przywara <andre.przywara@amd.com>
Subject: Re: Hypervisor crash(!) on xl cpupool-numa-split
Date: Mon, 21 Feb 2011 15:45:03 +0100
Message-ID: <4D627A6F.5070105@amd.com>
References: <4D41FD3A.5090506@amd.com>	<4D4A72D8.3020502@ts.fujitsu.com>	<4D4C08B6.30600@amd.com>	<4D4FE7E2.9070605@amd.com>	<4D4FF452.6060508@ts.fujitsu.com>	<AANLkTinoRUQC_suVYFM9-x3D00KvYofq3R=XkCQUj6RP@mail.gmail.com>	<4D50D80F.9000007@ts.fujitsu.com>	<AANLkTinKJUAXhiXpKui_XX8XCD6T5fmzNARwHE6Fjafv@mail.gmail.com>	<AANLkTinP0z9GynF1RFd8RwzWuqvxYdb+UBE+7xKpX6D4@mail.gmail.com>	<4D517051.10402@amd.com>	<AANLkTi=MiELBnPFvb6-jzVth+T7aKxP5JMFhVh3Crdmo@mail.gmail.com>	<AANLkTikgGNz=imS1xRVVjntY5P=+MuT_Qsb=-h3QHajY@mail.gmail.com>	<4D529BD9.5050200@amd.com>	<4D52A2CD.9090507@ts.fujitsu.com>	<4D5388DF.8040900@ts.fujitsu.com>	<4D53AF27.7030909@amd.com>	<4D53F3BC.4070807@amd.com>	<4D54D478.9000402@ts.fujitsu.com>	<4D54E79E.3000800@amd.com>	<AANLkTimkRAHtM4CoTskQ7w6B-8Pis4B2+k7=frxM3oyW@mail.gmail
 .com>	<4D5A29C0.4050702@ts.fujitsu.com>	<4D5B9D2B.107@ts.fujitsu.com>	<AANLkTin+rE1=+vpmTg9xeQdYn7_hucSFkrz1qCtiKfkY@mail.gmail.com>
	<4D6237C6.1050206@amd.c om> <4D62666C.6010608@ts.fujitsu. com>
Mime-Version: 1.0
Content-Type: text/plain; charset="ISO-8859-1"; format=flowed
Content-Transfer-Encoding: 7bit
Return-path: <xen-devel-bounces@lists.xensource.com>
In-Reply-To: <4D62666C.6010608@ts.fujitsu.com>
List-Unsubscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xensource.com>
List-Help: <mailto:xen-devel-request@lists.xensource.com?subject=help>
List-Subscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
Sender: xen-devel-bounces@lists.xensource.com
Errors-To: xen-devel-bounces@lists.xensource.com
To: Juergen Gross <juergen.gross@ts.fujitsu.com>
Cc: George Dunlap <George.Dunlap@eu.citrix.com>, "xen-devel@lists.xensource.com" <xen-devel@lists.xensource.com>, "Diestelhorst, Stephan" <Stephan.Diestelhorst@amd.com>
List-Id: xen-devel@lists.xenproject.org

Juergen Gross wrote:
> On 02/21/11 11:00, Andre Przywara wrote:
>> George Dunlap wrote:
>>> Andre (and Juergen), can you try again with the attached patch?
>> I applied this patch on top of 22931 and it did _not_ work.
>> The crash occurred almost immediately after I started my script, so the
>> same behaviour as without the patch.
> 
> Did you try my patch addressing races in the scheduler when moving cpus
> between cpupools?
Sorry, I tried yours first, but it didn't apply cleanly on my particular 
tree (sched_jg_fix ;-). So I tested George's first.

> I've attached it again. For me it works quite well, while George's patch
> seems not to be enough (machine hanging after some tests with cpupools).
OK, it now applied after a rebase.
And yes, I didn't see a crash! At least until the script stopped while 
at lot of these messages appeared:
(XEN) do_IRQ: 0.89 No irq handler for vector (irq -1)

That is what I reported before and is most probably totally unrelated to 
this issue.
So I consider this fix working!
I will try to match my recent theories and debug results with your patch 
to see whether this fits.

> OTOH I can't reproduce an error as fast as you even without any patch :-)
> 
>> (attached my script for reference, though it will most likely only make
>> sense on bigger NUMA machines)
> 
> Yeah, on my 2-node system I need several hundred tries to get an error.
> But it seems to be more effective than George's script.
I consider the large over-provisioning the reason. With Dom0 having 48 
VCPUs finally squashed together to 6 pCPUs, my script triggered at the 
second run the latest.
With your patch it made 24 iterations before the other bug kicked in.

Thanks very much!
Andre.

> 
> 
> Juergen
> 
>> Regards,
>> Andre.
>>
>>
>>> What the patch basically does is try to make "cpu_disable_scheduler()"
>>> do what it seems to say it does. :-) Namely, the various
>>> scheduler-related interrutps (both per-cpu ticks and the master tick)
>>> is a part of the scheduler, so disable them before doing anything, and
>>> don't enable them until the cpu is really ready to go again.
>>>
>>> To be precise:
>>> * cpu_disable_scheduler() disables ticks
>>> * scheduler_cpu_switch() only enables ticks if adding a cpu to a pool,
>>> and does it after inserting the idle vcpu
>>> * Modify semantics, s.t., {alloc,free}_pdata() don't actually start or
>>> stop tickers
>>> + Call tick_{resume,suspend} in cpu_{up,down}, respectively
>>> * Modify credit1's tick_{suspend,resume} to handle the master ticker
>>> as well.
>>>
>>> With this patch (if dom0 doesn't get wedged due to all 8 vcpus being
>>> on one pcpu), I can perform thousands of operations successfully.
>>>
>>> (NB this is not ready for application yet, I just wanted to check to
>>> see if it fixes Andre's problem)
>>>
>>> -George
>>>
>>> On Wed, Feb 16, 2011 at 9:47 AM, Juergen Gross
>>> <juergen.gross@ts.fujitsu.com> wrote:
>>>> Okay, I have some more data.
>>>>
>>>> I activated cpupool_dprintk() and included checks in sched_credit.c to
>>>> test for weight inconsistencies. To reduce race possibilities I've added
>>>> my patch to execute cpu assigning/unassigning always in a tasklet on the
>>>> cpu to be moved.
>>>>
>>>> Here is the result:
>>>>
>>>> (XEN) cpupool_unassign_cpu(pool=0,cpu=6)
>>>> (XEN) cpupool_unassign_cpu(pool=0,cpu=6) ret -16
>>>> (XEN) cpupool_unassign_cpu(pool=0,cpu=6)
>>>> (XEN) cpupool_unassign_cpu(pool=0,cpu=6) ret -16
>>>> (XEN) cpupool_assign_cpu(pool=0,cpu=1)
>>>> (XEN) cpupool_assign_cpu(pool=0,cpu=1) ffff83083fff74c0
>>>> (XEN) cpupool_assign_cpu(cpu=1) ret 0
>>>> (XEN) cpupool_assign_cpu(pool=1,cpu=4)
>>>> (XEN) cpupool_assign_cpu(pool=1,cpu=4) ffff831002ad5e40
>>>> (XEN) cpupool_assign_cpu(cpu=4) ret 0
>>>> (XEN) cpu 4, weight 0,prv ffff831002ad5e40, dom 0:
>>>> (XEN) sdom->weight: 256, sdom->active_vcpu_count: 1
>>>> (XEN) Xen BUG at sched_credit.c:570
>>>> (XEN) ----[ Xen-4.1.0-rc5-pre x86_64 debug=y Tainted: C ]----
>>>> (XEN) CPU: 4
>>>> (XEN) RIP: e008:[<ffff82c4801197d7>] csched_tick+0x186/0x37f
>>>> (XEN) RFLAGS: 0000000000010086 CONTEXT: hypervisor
>>>> (XEN) rax: 0000000000000000 rbx: ffff830839d3ec30 rcx: 0000000000000000
>>>> (XEN) rdx: ffff830839dcff18 rsi: 000000000000000a rdi: ffff82c4802542e8
>>>> (XEN) rbp: ffff830839dcfe38 rsp: ffff830839dcfde8 r8: 0000000000000004
>>>> (XEN) r9: ffff82c480213520 r10: 00000000fffffffc r11: 0000000000000001
>>>> (XEN) r12: 0000000000000004 r13: ffff830839d3ec40 r14: ffff831002ad5e40
>>>> (XEN) r15: ffff830839d66f90 cr0: 000000008005003b cr4: 00000000000026f0
>>>> (XEN) cr3: 0000001020a98000 cr2: 00007fc5e9b79d98
>>>> (XEN) ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: e010 cs: e008
>>>> (XEN) Xen stack trace from rsp=ffff830839dcfde8:
>>>> (XEN) ffff83083ffa3ba0 ffff831002ad5e40 0000000000000246
>>>> ffff830839d6c000
>>>> (XEN) 0000000000000000 ffff830839dd1100 0000000000000004
>>>> ffff82c480119651
>>>> (XEN) ffff831002b28018 ffff831002b28010 ffff830839dcfe68
>>>> ffff82c480126204
>>>> (XEN) 0000000000000002 ffff83083ffa3bb8 ffff830839dd1100
>>>> 000000cae439ea7e
>>>> (XEN) ffff830839dcfeb8 ffff82c480126539 00007fc5e9fa5b20
>>>> ffff830839dd1100
>>>> (XEN) ffff831002b28010 0000000000000004 0000000000000004
>>>> ffff82c4802b0880
>>>> (XEN) ffff830839dcff18 ffffffffffffffff ffff830839dcfef8
>>>> ffff82c480123647
>>>> (XEN) ffff830839dcfed8 ffff830077eee000 00007fc5e9b79d98
>>>> 00007fc5e9fa5b20
>>>> (XEN) 0000000000000002 00007fff46826f20 ffff830839dcff08
>>>> ffff82c4801236c2
>>>> (XEN) 00007cf7c62300c7 ffff82c480206ad6 00007fff46826f20
>>>> 0000000000000002
>>>> (XEN) 00007fc5e9fa5b20 00007fc5e9b79d98 00007fff46827260
>>>> 00007fff46826f50
>>>> (XEN) 0000000000000246 0000000000000032 0000000000000000
>>>> 00000000ffffffff
>>>> (XEN) 0000000000000009 00007fc5e9d9de1a 0000000000000003
>>>> 0000000000004848
>>>> (XEN) 00007fc5e9b7a000 0000010000000000 ffffffff800073f0
>>>> 000000000000e033
>>>> (XEN) 0000000000000246 ffff880f97b51fc8 000000000000e02b
>>>> 0000000000000000
>>>> (XEN) 0000000000000000 0000000000000000 0000000000000000
>>>> 0000000000000004
>>>> (XEN) ffff830077eee000 00000043b9afd180 0000000000000000
>>>> (XEN) Xen call trace:
>>>> (XEN) [<ffff82c4801197d7>] csched_tick+0x186/0x37f
>>>> (XEN) [<ffff82c480126204>] execute_timer+0x4e/0x6c
>>>> (XEN) [<ffff82c480126539>] timer_softirq_action+0xf6/0x239
>>>> (XEN) [<ffff82c480123647>] __do_softirq+0x88/0x99
>>>> (XEN) [<ffff82c4801236c2>] do_softirq+0x6a/0x7a
>>>> (XEN)
>>>> (XEN)
>>>> (XEN) ****************************************
>>>> (XEN) Panic on CPU 4:
>>>> (XEN) Xen BUG at sched_credit.c:570
>>>> (XEN) ****************************************
>>>>
>>>> As you can see, a Dom0 vcpus is becoming active on a pool 1 cpu. The
>>>> BUG_ON
>>>> triggered in csched_acct() is a logical result of this.
>>>>
>>>> How this can happen I don't know yet.
>>>> Anyone any idea? I'll keep searching...
>>>>
>>>>
>>>> Juergen
>>>>
>>>> On 02/15/11 08:22, Juergen Gross wrote:
>>>>> On 02/14/11 18:57, George Dunlap wrote:
>>>>>> The good news is, I've managed to reproduce this on my local test
>>>>>> hardware with 1x4x2 (1 socket, 4 cores, 2 threads per core) using the
>>>>>> attached script. It's time to go home now, but I should be able to
>>>>>> dig something up tomorrow.
>>>>>>
>>>>>> To use the script:
>>>>>> * Rename cpupool0 to "p0", and create an empty second pool, "p1"
>>>>>> * You can modify elements by adding "arg=val" as arguments.
>>>>>> * Arguments are:
>>>>>> + dryrun={true,false} Do the work, but don't actually execute any xl
>>>>>> arguments. Default false.
>>>>>> + left: Number commands to execute. Default 10.
>>>>>> + maxcpus: highest numerical value for a cpu. Default 7 (i.e., 0-7 is
>>>>>> 8 cpus).
>>>>>> + verbose={true,false} Print what you're doing. Default is true.
>>>>>>
>>>>>> The script sometimes attempts to remove the last cpu from cpupool0; in
>>>>>> this case, libxl will print an error. If the script gets an error
>>>>>> under that condition, it will ignore it; under any other condition, it
>>>>>> will print diagnostic information.
>>>>>>
>>>>>> What finally crashed it for me was this command:
>>>>>> # ./cpupool-test.sh verbose=false left=1000
>>>>> Nice!
>>>>> With your script I finally managed to get the error, too. On my box (2
>>>>> sockets
>>>>> a 6 cores) I had to use
>>>>>
>>>>> ./cpupool-test.sh verbose=false left=10000 maxcpus=11
>>>>>
>>>>> to trigger it.
>>>>> Looking for more data now...
>>>>>
>>>>>
>>>>> Juergen
>>>>>
>>>>>> -George
>>>>>>
>>>>>> On Fri, Feb 11, 2011 at 7:39 AM, Andre
>>>>>> Przywara<andre.przywara@amd.com> wrote:
>>>>>>> Juergen Gross wrote:
>>>>>>>> On 02/10/11 15:18, Andre Przywara wrote:
>>>>>>>>> Andre Przywara wrote:
>>>>>>>>>> On 02/10/2011 07:42 AM, Juergen Gross wrote:
>>>>>>>>>>> On 02/09/11 15:21, Juergen Gross wrote:
>>>>>>>>>>>> Andre, George,
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> What seems to be interesting: I think the problem did always
>>>>>>>>>>>> occur
>>>>>>>>>>>> when
>>>>>>>>>>>> a new cpupool was created and the first cpu was moved to it.
>>>>>>>>>>>>
>>>>>>>>>>>> I think my previous assumption regarding the master_ticker
>>>>>>>>>>>> was not
>>>>>>>>>>>> too bad.
>>>>>>>>>>>> I think somehow the master_ticker of the new cpupool is becoming
>>>>>>>>>>>> active
>>>>>>>>>>>> before the scheduler is really initialized properly. This could
>>>>>>>>>>>> happen, if
>>>>>>>>>>>> enough time is spent between alloc_pdata for the cpu to be moved
>>>>>>>>>>>> and
>>>>>>>>>>>> the
>>>>>>>>>>>> critical section in schedule_cpu_switch().
>>>>>>>>>>>>
>>>>>>>>>>>> The solution should be to activate the timers only if the
>>>>>>>>>>>> scheduler is
>>>>>>>>>>>> ready for them.
>>>>>>>>>>>>
>>>>>>>>>>>> George, do you think the master_ticker should be stopped in
>>>>>>>>>>>> suspend_ticker
>>>>>>>>>>>> as well? I still see potential problems for entering deep
>>>>>>>>>>>> C-States.
>>>>>>>>>>>> I think
>>>>>>>>>>>> I'll prepare a patch which will keep the master_ticker active
>>>>>>>>>>>> for the
>>>>>>>>>>>> C-State case and migrate it for the schedule_cpu_switch() case.
>>>>>>>>>>> Okay, here is a patch for this. It ran on my 4-core machine
>>>>>>>>>>> without any
>>>>>>>>>>> problems.
>>>>>>>>>>> Andre, could you give it a try?
>>>>>>>>>> Did, but unfortunately it crashed as always. Tried twice and made
>>>>>>>>>> sure
>>>>>>>>>> I booted the right kernel. Sorry.
>>>>>>>>>> The idea with the race between the timer and the state changing
>>>>>>>>>> sounded very appealing, actually that was suspicious to me from
>>>>>>>>>> the
>>>>>>>>>> beginning.
>>>>>>>>>>
>>>>>>>>>> I will add some code to dump the state of all cpupools to the
>>>>>>>>>> BUG_ON
>>>>>>>>>> to see in which situation we are when the bug triggers.
>>>>>>>>> OK, here is a first try of this, the patch iterates over all CPU
>>>>>>>>> pools
>>>>>>>>> and outputs some data if the BUG_ON
>>>>>>>>> ((sdom->weight * sdom->active_vcpu_count)> weight_left) condition
>>>>>>>>> triggers:
>>>>>>>>> (XEN) CPU pool #0: 1 domains (SMP Credit Scheduler), mask:
>>>>>>>>> fffffffc003f
>>>>>>>>> (XEN) CPU pool #1: 0 domains (SMP Credit Scheduler), mask: fc0
>>>>>>>>> (XEN) CPU pool #2: 0 domains (SMP Credit Scheduler), mask: 1000
>>>>>>>>> (XEN) Xen BUG at sched_credit.c:1010
>>>>>>>>> ....
>>>>>>>>> The masks look proper (6 cores per node), the bug triggers when the
>>>>>>>>> first CPU is about to be(?) inserted.
>>>>>>>> Sure? I'm missing the cpu with mask 2000.
>>>>>>>> I'll try to reproduce the problem on a larger machine here (24
>>>>>>>> cores, 4
>>>>>>>> numa
>>>>>>>> nodes).
>>>>>>>> Andre, can you give me your xen boot parameters? Which xen changeset
>>>>>>>> are
>>>>>>>> you
>>>>>>>> running, and do you have any additional patches in use?
>>>>>>> The grub lines:
>>>>>>> kernel (hd1,0)/boot/xen-22858_debug_04.gz console=com1,vga
>>>>>>> com1=115200
>>>>>>> module (hd1,0)/boot/vmlinuz-2.6.32.27_pvops console=tty0
>>>>>>> console=ttyS0,115200 ro root=/dev/sdb1 xencons=hvc0
>>>>>>>
>>>>>>> All of my experiments are use c/s 22858 as a base.
>>>>>>> If you use a AMD Magny-Cours box for your experiments (socket C32 or
>>>>>>> G34),
>>>>>>> you should add the following patch (removing the line)
>>>>>>> --- a/xen/arch/x86/traps.c
>>>>>>> +++ b/xen/arch/x86/traps.c
>>>>>>> @@ -803,7 +803,6 @@ static void pv_cpuid(struct cpu_user_regs *regs)
>>>>>>> __clear_bit(X86_FEATURE_SKINIT % 32,&c);
>>>>>>> __clear_bit(X86_FEATURE_WDT % 32,&c);
>>>>>>> __clear_bit(X86_FEATURE_LWP % 32,&c);
>>>>>>> - __clear_bit(X86_FEATURE_NODEID_MSR % 32,&c);
>>>>>>> __clear_bit(X86_FEATURE_TOPOEXT % 32,&c);
>>>>>>> break;
>>>>>>> case 5: /* MONITOR/MWAIT */
>>>>>>>
>>>>>>> This is not necessary (in fact that reverts my patch c/s 22815), but
>>>>>>> raises
>>>>>>> the probability to trigger the bug, probably because it increases the
>>>>>>> pressure of the Dom0 scheduler. If you cannot trigger it with Dom0,
>>>>>>> try to
>>>>>>> create a guest with many VCPUs and squeeze it into a small CPU-pool.
>>>>>>>
>>>>>>> Good luck ;-)
>>>>>>> Andre.
>>>>>>>
>>>>>>> --
>>>>>>> Andre Przywara
>>>>>>> AMD-OSRC (Dresden)
>>>>>>> Tel: x29712
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Xen-devel mailing list
>>>>>>> Xen-devel@lists.xensource.com
>>>>>>> http://lists.xensource.com/xen-devel
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Xen-devel mailing list
>>>>>>> Xen-devel@lists.xensource.com
>>>>>>> http://lists.xensource.com/xen-devel
>>>> --
>>>> Juergen Gross Principal Developer Operating Systems
>>>> TSP ES&S SWE OS6 Telephone: +49 (0) 89 3222 2967
>>>> Fujitsu Technology Solutions e-mail:
>>>> juergen.gross@ts.fujitsu.com
>>>> Domagkstr. 28 Internet: ts.fujitsu.com
>>>> D-80807 Muenchen Company details:
>>>> ts.fujitsu.com/imprint.html
>>>>
>>>> _______________________________________________
>>>> Xen-devel mailing list
>>>> Xen-devel@lists.xensource.com
>>>> http://lists.xensource.com/xen-devel
>>>>
>>
>>
>>
>> _______________________________________________
>> Xen-devel mailing list
>> Xen-devel@lists.xensource.com
>> http://lists.xensource.com/xen-devel
> 
> 
> --
> Juergen Gross                 Principal Developer Operating Systems
> TSP ES&S SWE OS6                       Telephone: +49 (0) 89 3222 2967
> Fujitsu Technology Solutions              e-mail: juergen.gross@ts.fujitsu.com
> Domagkstr. 28                           Internet: ts.fujitsu.com
> D-80807 Muenchen                 Company details: ts.fujitsu.com/imprint.html
> 


-- 
Andre Przywara
AMD-OSRC (Dresden)
Tel: x29712