Strange kernel BUG() on PV DomU boot

xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed

* Strange kernel BUG() on PV DomU boot
@ 2012-06-22 12:21 Joanna Rutkowska
  2012-06-22 12:26 ` Joanna Rutkowska
  0 siblings, 1 reply; 11+ messages in thread
From: Joanna Rutkowska @ 2012-06-22 12:21 UTC (permalink / raw)
  To: xen-devel@lists.xensource.com; +Cc: Marek Marczykowski


[-- Attachment #1.1: Type: text/plain, Size: 6397 bytes --]

Hello,

From time to time (every several weeks or even less) I run into a
strange Dom0 kernel BUG() that manifests itself with the following
message (see the end of the message). The Dom0 and VM kernels are 3.2.7
pvops, and the Xen hypervisor is 4.1.2 both with only some minor,
irrelevant (I think) modifications for Qubes.

The bug is very hard to reproduce, but once this BUG() starts being
signaled, it consistently prevents me from starting any new VMs in the
system (e.g. tried over a dozen of times now, and every time the VM boot
fails).

The following lines in the VM kernel are responsible for signaling the
BUG():

  if (HYPERVISOR_vcpu_op(VCPUOP_initialise, cpu, ctxt))
        BUG();

...yet, there is nothing in the xl dmesg that would provide more info
why this hypercall fails. Ah, that's because there are not printk's in
the hypercall code:

   case VCPUOP_initialise:
        if ( v->vcpu_info == &dummy_vcpu_info )
            return -EINVAL;

        if ( (ctxt = xmalloc(struct vcpu_guest_context)) == NULL )
            return -ENOMEM;

        if ( copy_from_guest(ctxt, arg, 1) )
        {
            xfree(ctxt);
            return -EFAULT;
        }

        domain_lock(d);
        rc = -EEXIST;
        if ( !v->is_initialised )
            rc = boot_vcpu(d, vcpuid, ctxt);
        domain_unlock(d);

        xfree(ctxt);
        break;

So, looking at the above it seems like it might be failing because of
xmalloc() fails, however Xen seems to have enough memory as reported by
xl info:

total_memory           : 8074
free_memory            : 66
free_cpus              : 0

Any ideas what might be the cause?

FWIW, below the actual oops message.

Thanks,
joanna.




[    0.004356] ------------[ cut here ]------------
[    0.004361] kernel BUG at
/home/user/qubes-src/kernel/kernel-3.2.7/linux-3.2.7/arch/x86/xen/smp.c:322!
[    0.004366] invalid opcode: 0000 [#1] SMP
[    0.004370] CPU 0
[    0.004372] Modules linked in:
[    0.004376]
[    0.004379] Pid: 1, comm: swapper/0 Not tainted
3.2.7-5.pvops.qubes.x86_64 #1
[    0.004385] RIP: e030:[<ffffffff8143a229>]  [<ffffffff8143a229>]
cpu_initialize_context+0x263/0x280
[    0.004396] RSP: e02b:ffff880018063e10  EFLAGS: 00010282
[    0.004399] RAX: fffffffffffffff4 RBX: ffff8800180c0000 RCX:
0000000000000000
[    0.004404] RDX: ffff8800180c0000 RSI: 0000000000000001 RDI:
0000000000000000
[    0.004408] RBP: ffff880018063e50 R08: 00003ffffffff000 R09:
ffff880000000000
[    0.004412] R10: ffff8800180c0000 R11: 0000000000002000 R12:
0000000000000001
[    0.004417] R13: ffff880018f82d30 R14: ffff88001806e0c0 R15:
00000000000a98ed
[    0.004429] FS:  0000000000000000(0000) GS:ffff880018f5c000(0000)
knlGS:0000000000000000
[    0.004436] CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
[    0.004441] CR2: 0000000000000000 CR3: 0000000001805000 CR4:
0000000000002660
[    0.004447] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[    0.004452] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
0000000000000400
[    0.004459] Process swapper/0 (pid: 1, threadinfo ffff880018062000,
task ffff880018060040)
[    0.004465] Stack:
[    0.004469]  ffff88001806e0c0 0000000000018f7b ffffffff81866c80
0000000000000001
[    0.004479]  ffff88001806e0c0 0000000000000001 ffffffff81866c80
0000000000000001
[    0.004490]  ffff880018063e80 ffffffff8143a2e1 ffff880018063e70
0000000000000000
[    0.004500] Call Trace:
[    0.004507]  [<ffffffff8143a2e1>] xen_cpu_up+0x9b/0x115
[    0.004513]  [<ffffffff81440ad8>] _cpu_up+0x9c/0x10e
[    0.004520]  [<ffffffff81440bbf>] cpu_up+0x75/0x85
[    0.004527]  [<ffffffff818998f1>] smp_init+0x46/0x9e
[    0.004533]  [<ffffffff8188263c>] kernel_init+0x89/0x142
[    0.004541]  [<ffffffff814518b4>] kernel_thread_helper+0x4/0x10
[    0.004549]  [<ffffffff8144f973>] ? int_ret_from_sys_call+0x7/0x1b
[    0.004558]  [<ffffffff81447d7c>] ? retint_restore_args+0x5/0x6
[    0.004565]  [<ffffffff814518b0>] ? gs_change+0x13/0x13
[    0.004570] Code: 74 0d 48 ba ff ff ff ff ff ff ff 3f 48 21 d0 48 c1
e0 0c 31 ff 49 63 f4 48 89 83 90 13 00 00 48 89 da e8 db 70 bc ff 85 c0
74 04 <0f> 0b eb fe 48 89 df e8 db f6 ce ff 31 c0 48 83 c4 18 5b 41 5c
[    0.004653] RIP  [<ffffffff8143a229>] cpu_initialize_context+0x263/0x280
[    0.004661]  RSP <ffff880018063e10>
[    0.004672] ---[ end trace 4eaa2a86a8e2da22 ]---
[    0.004686] Kernel panic - not syncing: Attempted to kill init!
[    0.004692] Pid: 1, comm: swapper/0 Tainted: G      D
3.2.7-5.pvops.qubes.x86_64 #1
[    0.004698] Call Trace:
[    0.004704]  [<ffffffff81444c4a>] panic+0x8c/0x1a2
[    0.004712]  [<ffffffff81059814>] ? enqueue_entity+0x74/0x2f0
[    0.004719]  [<ffffffff8106113d>] forget_original_parent+0x34d/0x360
[    0.004728]  [<ffffffff8100a05f>] ? xen_restore_fl_direct_reloc+0x4/0x4
[    0.004735]  [<ffffffff814478b1>] ? _raw_spin_unlock_irqrestore+0x11/0x20
[    0.004743]  [<ffffffff8104acb3>] ? sched_move_task+0x93/0x150
[    0.004750]  [<ffffffff81061162>] exit_notify+0x12/0x190
[    0.004756]  [<ffffffff81062a3d>] do_exit+0x1ed/0x3e0
[    0.004763]  [<ffffffff814489e6>] oops_end+0xa6/0xf0
[    0.004770]  [<ffffffff81016476>] die+0x56/0x90
[    0.004776]  [<ffffffff81448584>] do_trap+0xc4/0x170
[    0.004783]  [<ffffffff81014440>] do_invalid_op+0x90/0xb0
[    0.004790]  [<ffffffff8143a229>] ? cpu_initialize_context+0x263/0x280
[    0.004799]  [<ffffffff81128ce4>] ? cache_grow.clone.0+0x2b4/0x3b0
[    0.004805]  [<ffffffff8100a05f>] ? xen_restore_fl_direct_reloc+0x4/0x4
[    0.004812]  [<ffffffff810052f1>] ? pte_mfn_to_pfn+0x71/0xf0
[    0.004820]  [<ffffffff8145172b>] invalid_op+0x1b/0x20
[    0.004827]  [<ffffffff8143a229>] ? cpu_initialize_context+0x263/0x280
[    0.004834]  [<ffffffff8143a2e1>] xen_cpu_up+0x9b/0x115
[    0.004840]  [<ffffffff81440ad8>] _cpu_up+0x9c/0x10e
[    0.004846]  [<ffffffff81440bbf>] cpu_up+0x75/0x85
[    0.004852]  [<ffffffff818998f1>] smp_init+0x46/0x9e
[    0.004858]  [<ffffffff8188263c>] kernel_init+0x89/0x142
[    0.004864]  [<ffffffff814518b4>] kernel_thread_helper+0x4/0x10
[    0.004871]  [<ffffffff8144f973>] ? int_ret_from_sys_call+0x7/0x1b
[    0.004878]  [<ffffffff81447d7c>] ? retint_restore_args+0x5/0x6
[    0.004885]  [<ffffffff814518b0>] ? gs_change+0x13/0x13




[-- Attachment #1.2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Strange kernel BUG() on PV DomU boot
  2012-06-22 12:21 Strange kernel BUG() on PV DomU boot Joanna Rutkowska
@ 2012-06-22 12:26 ` Joanna Rutkowska
  2012-06-22 12:38   ` Jan Beulich
  0 siblings, 1 reply; 11+ messages in thread
From: Joanna Rutkowska @ 2012-06-22 12:26 UTC (permalink / raw)
  To: xen-devel@lists.xensource.com; +Cc: Marek Marczykowski


[-- Attachment #1.1: Type: text/plain, Size: 2384 bytes --]

On 06/22/12 14:21, Joanna Rutkowska wrote:
> Hello,
> 
> From time to time (every several weeks or even less) I run into a
> strange Dom0 kernel BUG() that manifests itself with the following
> message (see the end of the message). The Dom0 and VM kernels are 3.2.7
> pvops, and the Xen hypervisor is 4.1.2 both with only some minor,
> irrelevant (I think) modifications for Qubes.
> 
> The bug is very hard to reproduce, but once this BUG() starts being
> signaled, it consistently prevents me from starting any new VMs in the
> system (e.g. tried over a dozen of times now, and every time the VM boot
> fails).
> 
> The following lines in the VM kernel are responsible for signaling the
> BUG():
> 
>   if (HYPERVISOR_vcpu_op(VCPUOP_initialise, cpu, ctxt))
>         BUG();
> 
> ...yet, there is nothing in the xl dmesg that would provide more info
> why this hypercall fails. Ah, that's because there are not printk's in
> the hypercall code:
> 
>    case VCPUOP_initialise:
>         if ( v->vcpu_info == &dummy_vcpu_info )
>             return -EINVAL;
> 
>         if ( (ctxt = xmalloc(struct vcpu_guest_context)) == NULL )
>             return -ENOMEM;
> 
>         if ( copy_from_guest(ctxt, arg, 1) )
>         {
>             xfree(ctxt);
>             return -EFAULT;
>         }
> 
>         domain_lock(d);
>         rc = -EEXIST;
>         if ( !v->is_initialised )
>             rc = boot_vcpu(d, vcpuid, ctxt);
>         domain_unlock(d);
> 
>         xfree(ctxt);
>         break;
> 
> So, looking at the above it seems like it might be failing because of
> xmalloc() fails, however Xen seems to have enough memory as reported by
> xl info:
> 
> total_memory           : 8074
> free_memory            : 66
> free_cpus              : 0
> 
> Any ideas what might be the cause?
> 
> FWIW, below the actual oops message.
> 

Ok, it seems like this was an out-of-memeory condition indeed, because
once I did:

xl mem-set 0 1800m

and then quickly started a VM, it booted fine...

Is there any proposal of how to handle out of memory conditions in Xen
(like this one, as well as e.g. SWIOTLB problem) in a more user friendly
way?

Any recommendations regarding the preferred minimum Xen free memory, as
reported by xl info, that should be preserved in order to assure Xen
runs smoothly?

joanna.


[-- Attachment #1.2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Strange kernel BUG() on PV DomU boot
  2012-06-22 12:26 ` Joanna Rutkowska
@ 2012-06-22 12:38   ` Jan Beulich
  2012-06-22 12:53     ` Handling of out of memory conditions (was: Re: Strange kernel BUG() on PV DomU boot) Joanna Rutkowska
  0 siblings, 1 reply; 11+ messages in thread
From: Jan Beulich @ 2012-06-22 12:38 UTC (permalink / raw)
  To: Joanna Rutkowska; +Cc: Marek Marczykowski, xen-devel

>>> On 22.06.12 at 14:26, Joanna Rutkowska <joanna@invisiblethingslab.com> wrote:
> On 06/22/12 14:21, Joanna Rutkowska wrote:
>> Hello,
>> 
>> From time to time (every several weeks or even less) I run into a
>> strange Dom0 kernel BUG() that manifests itself with the following
>> message (see the end of the message). The Dom0 and VM kernels are 3.2.7
>> pvops, and the Xen hypervisor is 4.1.2 both with only some minor,
>> irrelevant (I think) modifications for Qubes.
>> 
>> The bug is very hard to reproduce, but once this BUG() starts being
>> signaled, it consistently prevents me from starting any new VMs in the
>> system (e.g. tried over a dozen of times now, and every time the VM boot
>> fails).
>> 
>> The following lines in the VM kernel are responsible for signaling the
>> BUG():
>> 
>>   if (HYPERVISOR_vcpu_op(VCPUOP_initialise, cpu, ctxt))
>>         BUG();
>> 
>> ...yet, there is nothing in the xl dmesg that would provide more info
>> why this hypercall fails. Ah, that's because there are not printk's in
>> the hypercall code:
>> 
>>    case VCPUOP_initialise:
>>         if ( v->vcpu_info == &dummy_vcpu_info )
>>             return -EINVAL;
>> 
>>         if ( (ctxt = xmalloc(struct vcpu_guest_context)) == NULL )
>>             return -ENOMEM;
>> 
>>         if ( copy_from_guest(ctxt, arg, 1) )
>>         {
>>             xfree(ctxt);
>>             return -EFAULT;
>>         }
>> 
>>         domain_lock(d);
>>         rc = -EEXIST;
>>         if ( !v->is_initialised )
>>             rc = boot_vcpu(d, vcpuid, ctxt);
>>         domain_unlock(d);
>> 
>>         xfree(ctxt);
>>         break;
>> 
>> So, looking at the above it seems like it might be failing because of
>> xmalloc() fails, however Xen seems to have enough memory as reported by
>> xl info:
>> 
>> total_memory           : 8074
>> free_memory            : 66
>> free_cpus              : 0
>> 
>> Any ideas what might be the cause?
>> 
>> FWIW, below the actual oops message.
>> 
> 
> Ok, it seems like this was an out-of-memeory condition indeed, because
> once I did:
> 
> xl mem-set 0 1800m
> 
> and then quickly started a VM, it booted fine...

Had you looked at the error value in %rax, you would also
have seen that it's -ENOMEM. I suppose the problem here is
that a multi-page allocation was needed, yet only single
pages were available.

> Is there any proposal of how to handle out of memory conditions in Xen
> (like this one, as well as e.g. SWIOTLB problem) in a more user friendly
> way?

In 4.2, I hope we managed to remove all runtime allocations
larger than a page, so the particular situation here should arise
anymore.

As to more user-friendly - what do you think of? An error is an
error (and converting this to a meaningful, user visible message
is the responsibility of the entity receiving the error). In the
case at hand, printing an error message wouldn't meaningfully
increase user-friendliness imo.

> Any recommendations regarding the preferred minimum Xen free memory, as
> reported by xl info, that should be preserved in order to assure Xen
> runs smoothly?

In pre-4.2 Xen, there's not much you can do when memory gets
fragmented (otherwise you'd have to keep more than half the
memory in the box in the hypervisor). With multi-page runtime
allocations gone, you should be fine leaving just a minimal amount
to the hypervisor.

Jan

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Handling of out of memory conditions (was: Re: Strange kernel BUG() on PV DomU boot)
  2012-06-22 12:38   ` Jan Beulich
@ 2012-06-22 12:53     ` Joanna Rutkowska
  2012-06-22 13:02       ` Jan Beulich
                         ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Joanna Rutkowska @ 2012-06-22 12:53 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Marek Marczykowski, xen-devel


[-- Attachment #1.1: Type: text/plain, Size: 2455 bytes --]

On 06/22/12 14:38, Jan Beulich wrote:
>> Ok, it seems like this was an out-of-memeory condition indeed, because
>> > once I did:
>> > 
>> > xl mem-set 0 1800m
>> > 
>> > and then quickly started a VM, it booted fine...
> Had you looked at the error value in %rax, you would also
> have seen that it's -ENOMEM. I suppose the problem here is
> that a multi-page allocation was needed, yet only single
> pages were available.
> 

Ah, right, good point.

>> > Is there any proposal of how to handle out of memory conditions in Xen
>> > (like this one, as well as e.g. SWIOTLB problem) in a more user friendly
>> > way?
> In 4.2, I hope we managed to remove all runtime allocations
> larger than a page, so the particular situation here should arise
> anymore.
> 
> As to more user-friendly - what do you think of? An error is an
> error (and converting this to a meaningful, user visible message
> is the responsibility of the entity receiving the error). In the
> case at hand, printing an error message wouldn't meaningfully
> increase user-friendliness imo.
> 

How would you suggest to let the user (in an interactive desktop system,
such as Qubes) know why his or her VM doesn't start? Certainly, some
savvy user might just analyze the guest's dmesg log but that's really
not a user friendly solution. And yet the out of memory errors are
something that might happen quite often and are not really "exception"
or "errors" in the same sense as e.g. traditional BUG() conditions that
suggest something really bad happened. The problem here is that this bug
occurs after the domain has been built, and is now running, so xl start
is not a good place to return the error. Same with SWIOTLB out of memory
errors, that again just prevent the domain from starting. Any other
ideas how to handle such situations more gracefully?

>> > Any recommendations regarding the preferred minimum Xen free memory, as
>> > reported by xl info, that should be preserved in order to assure Xen
>> > runs smoothly?
> In pre-4.2 Xen, there's not much you can do when memory gets
> fragmented (otherwise you'd have to keep more than half the
> memory in the box in the hypervisor). With multi-page runtime
> allocations gone, you should be fine leaving just a minimal amount
> to the hypervisor.

Right, but 4.2 will not be released until weeks or months :/ And we
would like to release Qubes 1.0 within weeks...

joanna.


[-- Attachment #1.2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Handling of out of memory conditions (was: Re: Strange kernel BUG() on PV DomU boot)
  2012-06-22 12:53     ` Handling of out of memory conditions (was: Re: Strange kernel BUG() on PV DomU boot) Joanna Rutkowska
@ 2012-06-22 13:02       ` Jan Beulich
  2012-06-22 13:11         ` Handling of out of memory conditions Joanna Rutkowska
  2012-06-22 14:46       ` Handling of out of memory conditions (was: Re: Strange kernel BUG() on PV DomU boot) George Dunlap
  2012-06-25 15:39       ` Konrad Rzeszutek Wilk
  2 siblings, 1 reply; 11+ messages in thread
From: Jan Beulich @ 2012-06-22 13:02 UTC (permalink / raw)
  To: Joanna Rutkowska; +Cc: Konrad Rzeszutek Wilk, Marek Marczykowski, xen-devel

>>> On 22.06.12 at 14:53, Joanna Rutkowska <joanna@invisiblethingslab.com> wrote:
> On 06/22/12 14:38, Jan Beulich wrote:
>>> > Is there any proposal of how to handle out of memory conditions in Xen
>>> > (like this one, as well as e.g. SWIOTLB problem) in a more user friendly
>>> > way?
>> In 4.2, I hope we managed to remove all runtime allocations
>> larger than a page, so the particular situation here should arise
>> anymore.
>> 
>> As to more user-friendly - what do you think of? An error is an
>> error (and converting this to a meaningful, user visible message
>> is the responsibility of the entity receiving the error). In the
>> case at hand, printing an error message wouldn't meaningfully
>> increase user-friendliness imo.
> 
> How would you suggest to let the user (in an interactive desktop system,
> such as Qubes) know why his or her VM doesn't start? Certainly, some
> savvy user might just analyze the guest's dmesg log but that's really
> not a user friendly solution. And yet the out of memory errors are
> something that might happen quite often and are not really "exception"
> or "errors" in the same sense as e.g. traditional BUG() conditions that
> suggest something really bad happened. The problem here is that this bug
> occurs after the domain has been built, and is now running, so xl start
> is not a good place to return the error. Same with SWIOTLB out of memory
> errors, that again just prevent the domain from starting. Any other
> ideas how to handle such situations more gracefully?

In the case at hand, failing CPU bringup rather than invoking
BUG() would likely be possible. Then the guest would come up
single-CPU. (That's a more general theme though: Many BUG()
instances really don't need to be as harsh.)

SWIOTLB allocation is a different thing - if the guest really needs
it, yet fails to set it up, the most it could do is to defer the crash
until the first I/O needs to make use of it. Which likely doesn't buy
much to the user.

>>> > Any recommendations regarding the preferred minimum Xen free memory, as
>>> > reported by xl info, that should be preserved in order to assure Xen
>>> > runs smoothly?
>> In pre-4.2 Xen, there's not much you can do when memory gets
>> fragmented (otherwise you'd have to keep more than half the
>> memory in the box in the hypervisor). With multi-page runtime
>> allocations gone, you should be fine leaving just a minimal amount
>> to the hypervisor.
> 
> Right, but 4.2 will not be released until weeks or months :/ And we
> would like to release Qubes 1.0 within weeks...

If you're running your own hypervisor, you could go and backport
all those changes. If you running some vendor's, you could ask
them to (but if you asked us, we'd likely would [try to] refuse).

Jan

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Handling of out of memory conditions
  2012-06-22 13:02       ` Jan Beulich
@ 2012-06-22 13:11         ` Joanna Rutkowska
  2012-06-22 13:21           ` Jan Beulich
  0 siblings, 1 reply; 11+ messages in thread
From: Joanna Rutkowska @ 2012-06-22 13:11 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Konrad Rzeszutek Wilk, Marek Marczykowski, xen-devel


[-- Attachment #1.1: Type: text/plain, Size: 2696 bytes --]

On 06/22/12 15:02, Jan Beulich wrote:
>>>> On 22.06.12 at 14:53, Joanna Rutkowska <joanna@invisiblethingslab.com> wrote:
>> > On 06/22/12 14:38, Jan Beulich wrote:
>>>>> >>> > Is there any proposal of how to handle out of memory conditions in Xen
>>>>> >>> > (like this one, as well as e.g. SWIOTLB problem) in a more user friendly
>>>>> >>> > way?
>>> >> In 4.2, I hope we managed to remove all runtime allocations
>>> >> larger than a page, so the particular situation here should arise
>>> >> anymore.
>>> >> 
>>> >> As to more user-friendly - what do you think of? An error is an
>>> >> error (and converting this to a meaningful, user visible message
>>> >> is the responsibility of the entity receiving the error). In the
>>> >> case at hand, printing an error message wouldn't meaningfully
>>> >> increase user-friendliness imo.
>> > 
>> > How would you suggest to let the user (in an interactive desktop system,
>> > such as Qubes) know why his or her VM doesn't start? Certainly, some
>> > savvy user might just analyze the guest's dmesg log but that's really
>> > not a user friendly solution. And yet the out of memory errors are
>> > something that might happen quite often and are not really "exception"
>> > or "errors" in the same sense as e.g. traditional BUG() conditions that
>> > suggest something really bad happened. The problem here is that this bug
>> > occurs after the domain has been built, and is now running, so xl start
>> > is not a good place to return the error. Same with SWIOTLB out of memory
>> > errors, that again just prevent the domain from starting. Any other
>> > ideas how to handle such situations more gracefully?
> In the case at hand, failing CPU bringup rather than invoking
> BUG() would likely be possible. Then the guest would come up
> single-CPU. (That's a more general theme though: Many BUG()
> instances really don't need to be as harsh.)
> 
> SWIOTLB allocation is a different thing - if the guest really needs
> it, yet fails to set it up, the most it could do is to defer the crash
> until the first I/O needs to make use of it. Which likely doesn't buy
> much to the user.
> 

How about having the guest kernel (e.g. every time it is about to BUG())
write the cause of the Xen-related runtime errors (such as out of memory
conditions) to some predefined xenstore key, which would allow the
management tools/whatever other software to retrieve that easily and
display a meaningful message to the user? Hm... access to the xenstore
might not be easy at the early VM boot stage -- so perhaps writing it
into some predefined shared page, that could be easily read by the
toolstack?

joanna.


[-- Attachment #1.2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Handling of out of memory conditions
  2012-06-22 13:11         ` Handling of out of memory conditions Joanna Rutkowska
@ 2012-06-22 13:21           ` Jan Beulich
  2012-06-22 13:24             ` Joanna Rutkowska
  0 siblings, 1 reply; 11+ messages in thread
From: Jan Beulich @ 2012-06-22 13:21 UTC (permalink / raw)
  To: Joanna Rutkowska; +Cc: Konrad Rzeszutek Wilk, Marek Marczykowski, xen-devel

>>> On 22.06.12 at 15:11, Joanna Rutkowska <joanna@invisiblethingslab.com> wrote:
> How about having the guest kernel (e.g. every time it is about to BUG())
> write the cause of the Xen-related runtime errors (such as out of memory
> conditions) to some predefined xenstore key, which would allow the
> management tools/whatever other software to retrieve that easily and
> display a meaningful message to the user? Hm... access to the xenstore
> might not be easy at the early VM boot stage -- so perhaps writing it
> into some predefined shared page, that could be easily read by the
> toolstack?

That's the console shared page, isn't it?

Jan

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Handling of out of memory conditions
  2012-06-22 13:21           ` Jan Beulich
@ 2012-06-22 13:24             ` Joanna Rutkowska
  0 siblings, 0 replies; 11+ messages in thread
From: Joanna Rutkowska @ 2012-06-22 13:24 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Konrad Rzeszutek Wilk, Marek Marczykowski, xen-devel


[-- Attachment #1.1: Type: text/plain, Size: 1235 bytes --]

On 06/22/12 15:21, Jan Beulich wrote:
>>>> On 22.06.12 at 15:11, Joanna Rutkowska <joanna@invisiblethingslab.com> wrote:
>> > How about having the guest kernel (e.g. every time it is about to BUG())
>> > write the cause of the Xen-related runtime errors (such as out of memory
>> > conditions) to some predefined xenstore key, which would allow the
>> > management tools/whatever other software to retrieve that easily and
>> > display a meaningful message to the user? Hm... access to the xenstore
>> > might not be easy at the early VM boot stage -- so perhaps writing it
>> > into some predefined shared page, that could be easily read by the
>> > toolstack?
> That's the console shared page, isn't it?


Yeah, but parsing and interpreting the console output is problematic --
e.g. how should an automatic tool know from the oops message I quoted in
my first message, that the reason for not starting the VM was just an
out of memory? I'm thinking about some simple form of Xen-related
runtime error reporting (this would be mostly out of memory), that would
be easily parse'able by scripts. Again, the goal is display a simple
error message to the user, explaining why his or her VM doesn't start.

joanna.


[-- Attachment #1.2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Handling of out of memory conditions (was: Re: Strange kernel BUG() on PV DomU boot)
  2012-06-22 12:53     ` Handling of out of memory conditions (was: Re: Strange kernel BUG() on PV DomU boot) Joanna Rutkowska
  2012-06-22 13:02       ` Jan Beulich
@ 2012-06-22 14:46       ` George Dunlap
  2012-06-22 15:22         ` George Dunlap
  2012-06-25 15:39       ` Konrad Rzeszutek Wilk
  2 siblings, 1 reply; 11+ messages in thread
From: George Dunlap @ 2012-06-22 14:46 UTC (permalink / raw)
  To: Joanna Rutkowska, Jonathan Ludlam
  Cc: Marek Marczykowski, Jan Beulich, xen-devel

On Fri, Jun 22, 2012 at 1:53 PM, Joanna Rutkowska
<joanna@invisiblethingslab.com> wrote:
>>> > Any recommendations regarding the preferred minimum Xen free memory, as
>>> > reported by xl info, that should be preserved in order to assure Xen
>>> > runs smoothly?
>> In pre-4.2 Xen, there's not much you can do when memory gets
>> fragmented (otherwise you'd have to keep more than half the
>> memory in the box in the hypervisor). With multi-page runtime
>> allocations gone, you should be fine leaving just a minimal amount
>> to the hypervisor.
>
> Right, but 4.2 will not be released until weeks or months :/ And we
> would like to release Qubes 1.0 within weeks...

FWIW, XenServer has had to tackle this same issue.  As I understand
it, what they ended up doing was to have a model of how much memory
they thought each VM was going to use (based on things like guest
memory, video ram size, number of vcpus, and so on).  They then
manually try to track how much memory they think each running VM was
using, and used that to figure out whether they would be able to start
a new VM or not.  Obviously that's a big pain to set up and keep
current; but it has actually seems to work tolerably well for them.

The XenServer toolstack is open source, so it shouldn't be a problem
to share that formula.  If you think that would work for you, I think
Jonathan Ludlam is probably the guy to help out with that.

 -George

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Handling of out of memory conditions (was: Re: Strange kernel BUG() on PV DomU boot)
  2012-06-22 14:46       ` Handling of out of memory conditions (was: Re: Strange kernel BUG() on PV DomU boot) George Dunlap
@ 2012-06-22 15:22         ` George Dunlap
  0 siblings, 0 replies; 11+ messages in thread
From: George Dunlap @ 2012-06-22 15:22 UTC (permalink / raw)
  To: Joanna Rutkowska, Jonathan Ludlam
  Cc: Marek Marczykowski, Jan Beulich, xen-devel

On Fri, Jun 22, 2012 at 3:46 PM, George Dunlap
<George.Dunlap@eu.citrix.com> wrote:
>> Right, but 4.2 will not be released until weeks or months :/ And we
>> would like to release Qubes 1.0 within weeks...
>
> FWIW, XenServer has had to tackle this same issue.  As I understand
> it, what they ended up doing was to have a model of how much memory
> they thought each VM was going to use (based on things like guest
> memory, video ram size, number of vcpus, and so on).  They then
> manually try to track how much memory they think each running VM was
> using, and used that to figure out whether they would be able to start
> a new VM or not.  Obviously that's a big pain to set up and keep
> current; but it has actually seems to work tolerably well for them.

And I realize this is basically:
 Patient: "Doc it hurts when I do this" [demonstrates]
 Doctor: "Then don't do that."

It would be great to offer something better, but as Jan said, that's
unfortunately the best that 4.1 has to offer at the moment.

 -George

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Handling of out of memory conditions (was: Re: Strange kernel BUG() on PV DomU boot)
  2012-06-22 12:53     ` Handling of out of memory conditions (was: Re: Strange kernel BUG() on PV DomU boot) Joanna Rutkowska
  2012-06-22 13:02       ` Jan Beulich
  2012-06-22 14:46       ` Handling of out of memory conditions (was: Re: Strange kernel BUG() on PV DomU boot) George Dunlap
@ 2012-06-25 15:39       ` Konrad Rzeszutek Wilk
  2 siblings, 0 replies; 11+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-06-25 15:39 UTC (permalink / raw)
  To: Joanna Rutkowska; +Cc: Marek Marczykowski, Jan Beulich, xen-devel

On Fri, Jun 22, 2012 at 02:53:18PM +0200, Joanna Rutkowska wrote:
> On 06/22/12 14:38, Jan Beulich wrote:
> >> Ok, it seems like this was an out-of-memeory condition indeed, because
> >> > once I did:
> >> > 
> >> > xl mem-set 0 1800m
> >> > 
> >> > and then quickly started a VM, it booted fine...
> > Had you looked at the error value in %rax, you would also
> > have seen that it's -ENOMEM. I suppose the problem here is
> > that a multi-page allocation was needed, yet only single
> > pages were available.
> > 
> 
> Ah, right, good point.
> 
> >> > Is there any proposal of how to handle out of memory conditions in Xen
> >> > (like this one, as well as e.g. SWIOTLB problem) in a more user friendly
> >> > way?
> > In 4.2, I hope we managed to remove all runtime allocations
> > larger than a page, so the particular situation here should arise
> > anymore.
> > 
> > As to more user-friendly - what do you think of? An error is an
> > error (and converting this to a meaningful, user visible message
> > is the responsibility of the entity receiving the error). In the
> > case at hand, printing an error message wouldn't meaningfully
> > increase user-friendliness imo.
> > 
> 
> How would you suggest to let the user (in an interactive desktop system,
> such as Qubes) know why his or her VM doesn't start? Certainly, some
> savvy user might just analyze the guest's dmesg log but that's really
> not a user friendly solution. And yet the out of memory errors are
> something that might happen quite often and are not really "exception"
> or "errors" in the same sense as e.g. traditional BUG() conditions that
> suggest something really bad happened. The problem here is that this bug
> occurs after the domain has been built, and is now running, so xl start
> is not a good place to return the error. Same with SWIOTLB out of memory
> errors, that again just prevent the domain from starting. Any other
> ideas how to handle such situations more gracefully?

Right now SWIOTLB retries with smaller sizes..

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2012-06-25 15:39 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-06-22 12:21 Strange kernel BUG() on PV DomU boot Joanna Rutkowska
2012-06-22 12:26 ` Joanna Rutkowska
2012-06-22 12:38   ` Jan Beulich
2012-06-22 12:53     ` Handling of out of memory conditions (was: Re: Strange kernel BUG() on PV DomU boot) Joanna Rutkowska
2012-06-22 13:02       ` Jan Beulich
2012-06-22 13:11         ` Handling of out of memory conditions Joanna Rutkowska
2012-06-22 13:21           ` Jan Beulich
2012-06-22 13:24             ` Joanna Rutkowska
2012-06-22 14:46       ` Handling of out of memory conditions (was: Re: Strange kernel BUG() on PV DomU boot) George Dunlap
2012-06-22 15:22         ` George Dunlap
2012-06-25 15:39       ` Konrad Rzeszutek Wilk

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).