Re: Hypervisor crash(!) on xl cpupool-numa-split

xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed

From: "André Przywara" <andre.przywara@amd.com>
To: Juergen Gross <juergen.gross@ts.fujitsu.com>
Cc: George Dunlap <George.Dunlap@eu.citrix.com>,
	"xen-devel@lists.xensource.com" <xen-devel@lists.xensource.com>,
	"Diestelhorst, Stephan" <Stephan.Diestelhorst@amd.com>
Subject: Re: Hypervisor crash(!) on xl cpupool-numa-split
Date: Thu, 17 Feb 2011 01:05:27 +0100	[thread overview]
Message-ID: <4D5C6647.2020201@amd.com> (raw)
In-Reply-To: <4D5BDAF8.50800@ts.fujitsu.com>

Am 16.02.2011 15:11, schrieb Juergen Gross:
> On 02/16/11 14:54, George Dunlap wrote:
>> Andre (and Juergen), can you try again with the attached patch?
George, Juergen, thanks for all your work on this!
I will try the patch as soon as I am back in the office today afternoon.

Regards,
Andre.

>>
>> What the patch basically does is try to make "cpu_disable_scheduler()"
>> do what it seems to say it does. :-)  Namely, the various
>> scheduler-related interrutps (both per-cpu ticks and the master tick)
>> is a part of the scheduler, so disable them before doing anything, and
>> don't enable them until the cpu is really ready to go again.
>>
>> To be precise:
>> * cpu_disable_scheduler() disables ticks
>> * scheduler_cpu_switch() only enables ticks if adding a cpu to a pool,
>> and does it after inserting the idle vcpu
>> * Modify semantics, s.t., {alloc,free}_pdata() don't actually start or
>> stop tickers
>>    + Call tick_{resume,suspend} in cpu_{up,down}, respectively
>
> I tried this before :-)
> It didn't work for Andre, but may be there were some bits missing.
>
>> * Modify credit1's tick_{suspend,resume} to handle the master ticker as well.
>>
>> With this patch (if dom0 doesn't get wedged due to all 8 vcpus being
>> on one pcpu), I can perform thousands of operations successfully.
>
> Nice. I'll try later. In the moment I'm testing another patch (attached
> for review, if you like). I think I've identified two possible races.
>
>
> Juergen
>
>>
>> (NB this is not ready for application yet, I just wanted to check to
>> see if it fixes Andre's problem)
>>
>>    -George
>>
>> On Wed, Feb 16, 2011 at 9:47 AM, Juergen Gross
>> <juergen.gross@ts.fujitsu.com>   wrote:
>>> Okay, I have some more data.
>>>
>>> I activated cpupool_dprintk() and included checks in sched_credit.c to
>>> test for weight inconsistencies. To reduce race possibilities I've added
>>> my patch to execute cpu assigning/unassigning always in a tasklet on the
>>> cpu to be moved.
>>>
>>> Here is the result:
>>>
>>> (XEN) cpupool_unassign_cpu(pool=0,cpu=6)
>>> (XEN) cpupool_unassign_cpu(pool=0,cpu=6) ret -16
>>> (XEN) cpupool_unassign_cpu(pool=0,cpu=6)
>>> (XEN) cpupool_unassign_cpu(pool=0,cpu=6) ret -16
>>> (XEN) cpupool_assign_cpu(pool=0,cpu=1)
>>> (XEN) cpupool_assign_cpu(pool=0,cpu=1) ffff83083fff74c0
>>> (XEN) cpupool_assign_cpu(cpu=1) ret 0
>>> (XEN) cpupool_assign_cpu(pool=1,cpu=4)
>>> (XEN) cpupool_assign_cpu(pool=1,cpu=4) ffff831002ad5e40
>>> (XEN) cpupool_assign_cpu(cpu=4) ret 0
>>> (XEN) cpu 4, weight 0,prv ffff831002ad5e40, dom 0:
>>> (XEN) sdom->weight: 256, sdom->active_vcpu_count: 1
>>> (XEN) Xen BUG at sched_credit.c:570
>>> (XEN) ----[ Xen-4.1.0-rc5-pre  x86_64  debug=y  Tainted:    C ]----
>>> (XEN) CPU:    4
>>> (XEN) RIP:    e008:[<ffff82c4801197d7>] csched_tick+0x186/0x37f
>>> (XEN) RFLAGS: 0000000000010086   CONTEXT: hypervisor
>>> (XEN) rax: 0000000000000000   rbx: ffff830839d3ec30   rcx: 0000000000000000
>>> (XEN) rdx: ffff830839dcff18   rsi: 000000000000000a   rdi: ffff82c4802542e8
>>> (XEN) rbp: ffff830839dcfe38   rsp: ffff830839dcfde8   r8:  0000000000000004
>>> (XEN) r9:  ffff82c480213520   r10: 00000000fffffffc   r11: 0000000000000001
>>> (XEN) r12: 0000000000000004   r13: ffff830839d3ec40   r14: ffff831002ad5e40
>>> (XEN) r15: ffff830839d66f90   cr0: 000000008005003b   cr4: 00000000000026f0
>>> (XEN) cr3: 0000001020a98000   cr2: 00007fc5e9b79d98
>>> (XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: e010   cs: e008
>>> (XEN) Xen stack trace from rsp=ffff830839dcfde8:
>>> (XEN)    ffff83083ffa3ba0 ffff831002ad5e40 0000000000000246 ffff830839d6c000
>>> (XEN)    0000000000000000 ffff830839dd1100 0000000000000004 ffff82c480119651
>>> (XEN)    ffff831002b28018 ffff831002b28010 ffff830839dcfe68 ffff82c480126204
>>> (XEN)    0000000000000002 ffff83083ffa3bb8 ffff830839dd1100 000000cae439ea7e
>>> (XEN)    ffff830839dcfeb8 ffff82c480126539 00007fc5e9fa5b20 ffff830839dd1100
>>> (XEN)    ffff831002b28010 0000000000000004 0000000000000004 ffff82c4802b0880
>>> (XEN)    ffff830839dcff18 ffffffffffffffff ffff830839dcfef8 ffff82c480123647
>>> (XEN)    ffff830839dcfed8 ffff830077eee000 00007fc5e9b79d98 00007fc5e9fa5b20
>>> (XEN)    0000000000000002 00007fff46826f20 ffff830839dcff08 ffff82c4801236c2
>>> (XEN)    00007cf7c62300c7 ffff82c480206ad6 00007fff46826f20 0000000000000002
>>> (XEN)    00007fc5e9fa5b20 00007fc5e9b79d98 00007fff46827260 00007fff46826f50
>>> (XEN)    0000000000000246 0000000000000032 0000000000000000 00000000ffffffff
>>> (XEN)    0000000000000009 00007fc5e9d9de1a 0000000000000003 0000000000004848
>>> (XEN)    00007fc5e9b7a000 0000010000000000 ffffffff800073f0 000000000000e033
>>> (XEN)    0000000000000246 ffff880f97b51fc8 000000000000e02b 0000000000000000
>>> (XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000004
>>> (XEN)    ffff830077eee000 00000043b9afd180 0000000000000000
>>> (XEN) Xen call trace:
>>> (XEN)    [<ffff82c4801197d7>] csched_tick+0x186/0x37f
>>> (XEN)    [<ffff82c480126204>] execute_timer+0x4e/0x6c
>>> (XEN)    [<ffff82c480126539>] timer_softirq_action+0xf6/0x239
>>> (XEN)    [<ffff82c480123647>] __do_softirq+0x88/0x99
>>> (XEN)    [<ffff82c4801236c2>] do_softirq+0x6a/0x7a
>>> (XEN)
>>> (XEN)
>>> (XEN) ****************************************
>>> (XEN) Panic on CPU 4:
>>> (XEN) Xen BUG at sched_credit.c:570
>>> (XEN) ****************************************
>>>
>>> As you can see, a Dom0 vcpus is becoming active on a pool 1 cpu. The BUG_ON
>>> triggered in csched_acct() is a logical result of this.
>>>
>>> How this can happen I don't know yet.
>>> Anyone any idea? I'll keep searching...
>>>
>>>
>>> Juergen
>>>
>>> On 02/15/11 08:22, Juergen Gross wrote:
>>>>
>>>> On 02/14/11 18:57, George Dunlap wrote:
>>>>>
>>>>> The good news is, I've managed to reproduce this on my local test
>>>>> hardware with 1x4x2 (1 socket, 4 cores, 2 threads per core) using the
>>>>> attached script. It's time to go home now, but I should be able to
>>>>> dig something up tomorrow.
>>>>>
>>>>> To use the script:
>>>>> * Rename cpupool0 to "p0", and create an empty second pool, "p1"
>>>>> * You can modify elements by adding "arg=val" as arguments.
>>>>> * Arguments are:
>>>>> + dryrun={true,false} Do the work, but don't actually execute any xl
>>>>> arguments. Default false.
>>>>> + left: Number commands to execute. Default 10.
>>>>> + maxcpus: highest numerical value for a cpu. Default 7 (i.e., 0-7 is
>>>>> 8 cpus).
>>>>> + verbose={true,false} Print what you're doing. Default is true.
>>>>>
>>>>> The script sometimes attempts to remove the last cpu from cpupool0; in
>>>>> this case, libxl will print an error. If the script gets an error
>>>>> under that condition, it will ignore it; under any other condition, it
>>>>> will print diagnostic information.
>>>>>
>>>>> What finally crashed it for me was this command:
>>>>> # ./cpupool-test.sh verbose=false left=1000
>>>>
>>>> Nice!
>>>> With your script I finally managed to get the error, too. On my box (2
>>>> sockets
>>>> a 6 cores) I had to use
>>>>
>>>> ./cpupool-test.sh verbose=false left=10000 maxcpus=11
>>>>
>>>> to trigger it.
>>>> Looking for more data now...
>>>>
>>>>
>>>> Juergen
>>>>
>>>>>
>>>>> -George
>>>>>
>>>>> On Fri, Feb 11, 2011 at 7:39 AM, Andre
>>>>> Przywara<andre.przywara@amd.com>   wrote:
>>>>>>
>>>>>> Juergen Gross wrote:
>>>>>>>
>>>>>>> On 02/10/11 15:18, Andre Przywara wrote:
>>>>>>>>
>>>>>>>> Andre Przywara wrote:
>>>>>>>>>
>>>>>>>>> On 02/10/2011 07:42 AM, Juergen Gross wrote:
>>>>>>>>>>
>>>>>>>>>> On 02/09/11 15:21, Juergen Gross wrote:
>>>>>>>>>>>
>>>>>>>>>>> Andre, George,
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> What seems to be interesting: I think the problem did always occur
>>>>>>>>>>> when
>>>>>>>>>>> a new cpupool was created and the first cpu was moved to it.
>>>>>>>>>>>
>>>>>>>>>>> I think my previous assumption regarding the master_ticker was not
>>>>>>>>>>> too bad.
>>>>>>>>>>> I think somehow the master_ticker of the new cpupool is becoming
>>>>>>>>>>> active
>>>>>>>>>>> before the scheduler is really initialized properly. This could
>>>>>>>>>>> happen, if
>>>>>>>>>>> enough time is spent between alloc_pdata for the cpu to be moved
>>>>>>>>>>> and
>>>>>>>>>>> the
>>>>>>>>>>> critical section in schedule_cpu_switch().
>>>>>>>>>>>
>>>>>>>>>>> The solution should be to activate the timers only if the
>>>>>>>>>>> scheduler is
>>>>>>>>>>> ready for them.
>>>>>>>>>>>
>>>>>>>>>>> George, do you think the master_ticker should be stopped in
>>>>>>>>>>> suspend_ticker
>>>>>>>>>>> as well? I still see potential problems for entering deep C-States.
>>>>>>>>>>> I think
>>>>>>>>>>> I'll prepare a patch which will keep the master_ticker active
>>>>>>>>>>> for the
>>>>>>>>>>> C-State case and migrate it for the schedule_cpu_switch() case.
>>>>>>>>>>
>>>>>>>>>> Okay, here is a patch for this. It ran on my 4-core machine
>>>>>>>>>> without any
>>>>>>>>>> problems.
>>>>>>>>>> Andre, could you give it a try?
>>>>>>>>>
>>>>>>>>> Did, but unfortunately it crashed as always. Tried twice and made
>>>>>>>>> sure
>>>>>>>>> I booted the right kernel. Sorry.
>>>>>>>>> The idea with the race between the timer and the state changing
>>>>>>>>> sounded very appealing, actually that was suspicious to me from the
>>>>>>>>> beginning.
>>>>>>>>>
>>>>>>>>> I will add some code to dump the state of all cpupools to the BUG_ON
>>>>>>>>> to see in which situation we are when the bug triggers.
>>>>>>>>
>>>>>>>> OK, here is a first try of this, the patch iterates over all CPU pools
>>>>>>>> and outputs some data if the BUG_ON
>>>>>>>> ((sdom->weight * sdom->active_vcpu_count)>   weight_left) condition
>>>>>>>> triggers:
>>>>>>>> (XEN) CPU pool #0: 1 domains (SMP Credit Scheduler), mask:
>>>>>>>> fffffffc003f
>>>>>>>> (XEN) CPU pool #1: 0 domains (SMP Credit Scheduler), mask: fc0
>>>>>>>> (XEN) CPU pool #2: 0 domains (SMP Credit Scheduler), mask: 1000
>>>>>>>> (XEN) Xen BUG at sched_credit.c:1010
>>>>>>>> ....
>>>>>>>> The masks look proper (6 cores per node), the bug triggers when the
>>>>>>>> first CPU is about to be(?) inserted.
>>>>>>>
>>>>>>> Sure? I'm missing the cpu with mask 2000.
>>>>>>> I'll try to reproduce the problem on a larger machine here (24 cores, 4
>>>>>>> numa
>>>>>>> nodes).
>>>>>>> Andre, can you give me your xen boot parameters? Which xen changeset
>>>>>>> are
>>>>>>> you
>>>>>>> running, and do you have any additional patches in use?
>>>>>>
>>>>>> The grub lines:
>>>>>> kernel (hd1,0)/boot/xen-22858_debug_04.gz console=com1,vga com1=115200
>>>>>> module (hd1,0)/boot/vmlinuz-2.6.32.27_pvops console=tty0
>>>>>> console=ttyS0,115200 ro root=/dev/sdb1 xencons=hvc0
>>>>>>
>>>>>> All of my experiments are use c/s 22858 as a base.
>>>>>> If you use a AMD Magny-Cours box for your experiments (socket C32 or
>>>>>> G34),
>>>>>> you should add the following patch (removing the line)
>>>>>> --- a/xen/arch/x86/traps.c
>>>>>> +++ b/xen/arch/x86/traps.c
>>>>>> @@ -803,7 +803,6 @@ static void pv_cpuid(struct cpu_user_regs *regs)
>>>>>> __clear_bit(X86_FEATURE_SKINIT % 32,&c);
>>>>>> __clear_bit(X86_FEATURE_WDT % 32,&c);
>>>>>> __clear_bit(X86_FEATURE_LWP % 32,&c);
>>>>>> - __clear_bit(X86_FEATURE_NODEID_MSR % 32,&c);
>>>>>> __clear_bit(X86_FEATURE_TOPOEXT % 32,&c);
>>>>>> break;
>>>>>> case 5: /* MONITOR/MWAIT */
>>>>>>
>>>>>> This is not necessary (in fact that reverts my patch c/s 22815), but
>>>>>> raises
>>>>>> the probability to trigger the bug, probably because it increases the
>>>>>> pressure of the Dom0 scheduler. If you cannot trigger it with Dom0,
>>>>>> try to
>>>>>> create a guest with many VCPUs and squeeze it into a small CPU-pool.
>>>>>>
>>>>>> Good luck ;-)
>>>>>> Andre.
>>>>>>
>>>>>> --
>>>>>> Andre Przywara
>>>>>> AMD-OSRC (Dresden)
>>>>>> Tel: x29712
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Xen-devel mailing list
>>>>>> Xen-devel@lists.xensource.com
>>>>>> http://lists.xensource.com/xen-devel
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Xen-devel mailing list
>>>>>> Xen-devel@lists.xensource.com
>>>>>> http://lists.xensource.com/xen-devel
>>>>
>>>>
>>>
>>>
>>> --
>>> Juergen Gross                 Principal Developer Operating Systems
>>> TSP ES&S SWE OS6                       Telephone: +49 (0) 89 3222 2967
>>> Fujitsu Technology Solutions              e-mail:
>>> juergen.gross@ts.fujitsu.com
>>> Domagkstr. 28                           Internet: ts.fujitsu.com
>>> D-80807 Muenchen                 Company details:
>>> ts.fujitsu.com/imprint.html
>>>
>>> _______________________________________________
>>> Xen-devel mailing list
>>> Xen-devel@lists.xensource.com
>>> http://lists.xensource.com/xen-devel
>>>
>>>
>>>
>>> _______________________________________________
>>> Xen-devel mailing list
>>> Xen-devel@lists.xensource.com
>>> http://lists.xensource.com/xen-devel
>
>
> --
> Juergen Gross                 Principal Developer Operating Systems
> TSP ES&S SWE OS6                       Telephone: +49 (0) 89 3222 2967
> Fujitsu Technology Solutions              e-mail: juergen.gross@ts.fujitsu.com
> Domagkstr. 28                           Internet: ts.fujitsu.com
> D-80807 Muenchen                 Company details: ts.fujitsu.com/imprint.html


-- 
Andre Przywara
AMD-Operating System Research Center (OSRC), Dresden, Germany

next prev parent reply	other threads:[~2011-02-17  0:05 UTC|newest]

Thread overview: 53+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-01-27 23:18 Hypervisor crash(!) on xl cpupool-numa-split Andre Przywara
2011-01-28  6:47 ` Juergen Gross
2011-01-28 11:07   ` Andre Przywara
2011-01-28 11:44     ` Juergen Gross
2011-01-28 13:14       ` Andre Przywara
2011-01-31  7:04         ` Juergen Gross
2011-01-31 14:59           ` Andre Przywara
2011-01-31 15:28             ` George Dunlap
2011-02-01 16:32               ` Andre Przywara
2011-02-02  6:27                 ` Juergen Gross
2011-02-02  8:49                   ` Juergen Gross
2011-02-02 10:05                     ` Juergen Gross
2011-02-02 10:59                       ` Andre Przywara
2011-02-02 14:39                 ` Stephan Diestelhorst
2011-02-02 15:14                   ` Juergen Gross
2011-02-02 16:01                     ` Stephan Diestelhorst
2011-02-03  5:57                       ` Juergen Gross
2011-02-03  9:18                         ` Juergen Gross
2011-02-04 14:09                           ` Andre Przywara
2011-02-07 12:38                             ` Andre Przywara
2011-02-07 13:32                               ` Juergen Gross
2011-02-07 15:55                                 ` George Dunlap
2011-02-08  5:43                                   ` Juergen Gross
2011-02-08 12:08                                     ` George Dunlap
2011-02-08 12:14                                       ` George Dunlap
2011-02-08 16:33                                         ` Andre Przywara
2011-02-09 12:27                                           ` George Dunlap
2011-02-09 12:27                                             ` George Dunlap
2011-02-09 13:04                                               ` Juergen Gross
2011-02-09 13:39                                                 ` Andre Przywara
2011-02-09 13:51                                               ` Andre Przywara
2011-02-09 14:21                                                 ` Juergen Gross
2011-02-10  6:42                                                   ` Juergen Gross
2011-02-10  9:25                                                     ` Andre Przywara
2011-02-10 14:18                                                       ` Andre Przywara
2011-02-11  6:17                                                         ` Juergen Gross
2011-02-11  7:39                                                           ` Andre Przywara
2011-02-14 17:57                                                             ` George Dunlap
2011-02-15  7:22                                                               ` Juergen Gross
2011-02-16  9:47                                                                 ` Juergen Gross
2011-02-16 13:54                                                                   ` George Dunlap
     [not found]                                                                     ` <4D6237C6.1050206@amd.c om>
2011-02-16 14:11                                                                     ` Juergen Gross
2011-02-16 14:28                                                                       ` Juergen Gross
2011-02-17  0:05                                                                       ` André Przywara [this message]
2011-02-17  7:05                                                                     ` Juergen Gross
2011-02-17  9:11                                                                       ` Juergen Gross
2011-02-21 10:00                                                                     ` Andre Przywara
2011-02-21 13:19                                                                       ` Juergen Gross
2011-02-21 14:45                                                                         ` Andre Przywara
2011-02-21 14:50                                                                           ` Juergen Gross
2011-02-08 12:23                                       ` Juergen Gross
2011-01-28 11:13   ` George Dunlap
2011-01-28 13:05     ` Andre Przywara

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4D5C6647.2020201@amd.com \
    --to=andre.przywara@amd.com \
    --cc=George.Dunlap@eu.citrix.com \
    --cc=Stephan.Diestelhorst@amd.com \
    --cc=juergen.gross@ts.fujitsu.com \
    --cc=xen-devel@lists.xensource.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).