* Unplugging a dom0 vcpu and domain destruction
@ 2009-02-17 17:30 George Dunlap
2009-02-17 17:39 ` Keir Fraser
0 siblings, 1 reply; 10+ messages in thread
From: George Dunlap @ 2009-02-17 17:30 UTC (permalink / raw)
To: xen-devel@lists.xensource.com
In the course of developing the new scheduler, I noticed something
rather strange.
If I bring dom0's second cpu offline (echo "0" >
/sys/devices/system/cpu/cpu1/online), and then create and destroy a
number of domains, xen/common/domain.c:domain_destroy() is not called
(nor vcpu_destroy, and the scheduler domain destruction
functionality). If I bring the cpu back online (echo 1 > ...), the
domains are destroyed almost immediately.
domain_destroy() is only called from put_domain(), so presumably
there's somehow reference counts held somewhere which aren't released
when the second cpu is offline.
I've duplicated this using the standard credit scheduler on
xen-unstable tip. I'm using a Debian dom0 filesystem, and a
linux-2.6.18-xen0 build from a month ago.
My box has 2 cores, so dom0 has only 2 cpus; disabling the second
causes it to switch to UP primitives.
I'm looking into it, but I thought it might ring some bells with someone...
-George
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Unplugging a dom0 vcpu and domain destruction
2009-02-17 17:30 Unplugging a dom0 vcpu and domain destruction George Dunlap
@ 2009-02-17 17:39 ` Keir Fraser
2009-02-20 17:13 ` George Dunlap
0 siblings, 1 reply; 10+ messages in thread
From: Keir Fraser @ 2009-02-17 17:39 UTC (permalink / raw)
To: George Dunlap, xen-devel@lists.xensource.com
On 17/02/2009 17:30, "George Dunlap" <dunlapg@umich.edu> wrote:
> domain_destroy() is only called from put_domain(), so presumably
> there's somehow reference counts held somewhere which aren't released
> when the second cpu is offline.
>
> I've duplicated this using the standard credit scheduler on
> xen-unstable tip. I'm using a Debian dom0 filesystem, and a
> linux-2.6.18-xen0 build from a month ago.
If the domain never runs there will be very few domain refcnt updates. You
should be able to track it down pretty easily by logging every caller.
-- Keir
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Unplugging a dom0 vcpu and domain destruction
2009-02-17 17:39 ` Keir Fraser
@ 2009-02-20 17:13 ` George Dunlap
2009-02-20 18:15 ` Jeremy Fitzhardinge
0 siblings, 1 reply; 10+ messages in thread
From: George Dunlap @ 2009-02-20 17:13 UTC (permalink / raw)
To: Keir Fraser, Jeremy Fitzhardinge; +Cc: xen-devel@lists.xensource.com
OK, I finally popped off all the interrupts on my stack and got back to this.
The put_domain() that finally destroys the domains (after plugging
back in the cpu) is in page_alloc.c:931, in free_domheap_pages().
Here's the callstack from xen:
(XEN) [<ffff828c80112cd6>] free_domheap_pages+0x3a9/0x427
(XEN) [<ffff828c8014f0e3>] put_page+0x4b/0x52
(XEN) [<ffff828c80150236>] put_page_from_l1e+0x137/0x1ae
(XEN) [<ffff828c80155ed0>] ptwr_emulated_update+0x555/0x57c
(XEN) [<ffff828c80155fa3>] ptwr_emulated_cmpxchg+0xac/0xb5
(XEN) [<ffff828c80176511>] x86_emulate+0xf876/0xfb5d
(XEN) [<ffff828c8014f523>] ptwr_do_page_fault+0x15c/0x190
(XEN) [<ffff828c80164d8c>] do_page_fault+0x3b8/0x571
So the thing that finally destroys the domain is unmapping its last
outstanding domheap page from dom0's pagetables. It was unmapped from
vcpu 1 (which had just come back online), from
linux/mm/memory.c:unmap_vmas().
I confirmed that there were two outstanding unmapped pages of the
"zombie domain" using the 'q' debug key:
(XEN) General information for domain 2:
(XEN) refcnt=1 dying=2 nr_pages=2 xenheap_pages=0 dirty_cpus={}
max_pages=8192
(XEN) handle=a7c2bcb8-e647-992f-9e15-7313072a36bf vm_assist=00000008
(XEN) Rangesets belonging to domain 2:
(XEN) Interrupts { }
(XEN) I/O Memory { }
(XEN) I/O Ports { }
(XEN) Memory pages belonging to domain 2:
(XEN) DomPage 000000000003d64f: caf=00000001, taf=e800000000000001
(XEN) DomPage 000000000003d64e: caf=00000001, taf=e800000000000001
(XEN) VCPU information and callbacks for domain 2:
(XEN) VCPU0: CPU0 [has=F] flags=1 poll=0 upcall_pend = 00,
upcall_mask = 00 dirty_cpus={} cpu_affinity={0-31}
(XEN) 100 Hz periodic timer (period 10 ms)
(XEN) Notifying guest (virq 1, port 0, stat 0/-1/0)
I'm not sure if this is relevant, but looks that while dom0's vcpu 1
was offline, it had a pending interrupt:
(XEN) VCPU1: CPU0 [has=F] flags=2 poll=0 upcall_pend = 01,
upcall_mask = 01 dirty_cpus={} cpu_affinity={0-31}
(XEN) 100 Hz periodic timer (period 10 ms)
(XEN) Notifying guest (virq 1, port 0, stat 0/-1/-1)
So it appears that when vcpu 1 is offline, it never successfully
removes mappings for the domU until vcpu 1 comes back online.
I don't know enough about the unmapping process... Jeremy, do you know
anything about the process for unmapping domU memory from dom0 when
the domU is being destroyed in the linux-2.6.18-xen.hg tree? More
specifically, why if I take dom0's vcpu 1 offline (via the /sys
interface), why the unmapping doesn't happen until I bring vcpu 1
online?
-George
On Tue, Feb 17, 2009 at 5:39 PM, Keir Fraser <keir.fraser@eu.citrix.com> wrote:
> On 17/02/2009 17:30, "George Dunlap" <dunlapg@umich.edu> wrote:
>
>> domain_destroy() is only called from put_domain(), so presumably
>> there's somehow reference counts held somewhere which aren't released
>> when the second cpu is offline.
>>
>> I've duplicated this using the standard credit scheduler on
>> xen-unstable tip. I'm using a Debian dom0 filesystem, and a
>> linux-2.6.18-xen0 build from a month ago.
>
> If the domain never runs there will be very few domain refcnt updates. You
> should be able to track it down pretty easily by logging every caller.
>
> -- Keir
>
>
>
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Unplugging a dom0 vcpu and domain destruction
2009-02-20 17:13 ` George Dunlap
@ 2009-02-20 18:15 ` Jeremy Fitzhardinge
2009-02-20 18:59 ` George Dunlap
0 siblings, 1 reply; 10+ messages in thread
From: Jeremy Fitzhardinge @ 2009-02-20 18:15 UTC (permalink / raw)
To: George Dunlap; +Cc: xen-devel@lists.xensource.com, Keir Fraser
George Dunlap wrote:
> OK, I finally popped off all the interrupts on my stack and got back to this.
>
> The put_domain() that finally destroys the domains (after plugging
> back in the cpu) is in page_alloc.c:931, in free_domheap_pages().
>
> Here's the callstack from xen:
>
> (XEN) [<ffff828c80112cd6>] free_domheap_pages+0x3a9/0x427
> (XEN) [<ffff828c8014f0e3>] put_page+0x4b/0x52
> (XEN) [<ffff828c80150236>] put_page_from_l1e+0x137/0x1ae
> (XEN) [<ffff828c80155ed0>] ptwr_emulated_update+0x555/0x57c
> (XEN) [<ffff828c80155fa3>] ptwr_emulated_cmpxchg+0xac/0xb5
> (XEN) [<ffff828c80176511>] x86_emulate+0xf876/0xfb5d
> (XEN) [<ffff828c8014f523>] ptwr_do_page_fault+0x15c/0x190
> (XEN) [<ffff828c80164d8c>] do_page_fault+0x3b8/0x571
>
> So the thing that finally destroys the domain is unmapping its last
> outstanding domheap page from dom0's pagetables. It was unmapped from
> vcpu 1 (which had just come back online), from
> linux/mm/memory.c:unmap_vmas().
>
> I confirmed that there were two outstanding unmapped pages of the
> "zombie domain" using the 'q' debug key:
> (XEN) General information for domain 2:
> (XEN) refcnt=1 dying=2 nr_pages=2 xenheap_pages=0 dirty_cpus={}
> max_pages=8192
> (XEN) handle=a7c2bcb8-e647-992f-9e15-7313072a36bf vm_assist=00000008
> (XEN) Rangesets belonging to domain 2:
> (XEN) Interrupts { }
> (XEN) I/O Memory { }
> (XEN) I/O Ports { }
> (XEN) Memory pages belonging to domain 2:
> (XEN) DomPage 000000000003d64f: caf=00000001, taf=e800000000000001
> (XEN) DomPage 000000000003d64e: caf=00000001, taf=e800000000000001
> (XEN) VCPU information and callbacks for domain 2:
> (XEN) VCPU0: CPU0 [has=F] flags=1 poll=0 upcall_pend = 00,
> upcall_mask = 00 dirty_cpus={} cpu_affinity={0-31}
> (XEN) 100 Hz periodic timer (period 10 ms)
> (XEN) Notifying guest (virq 1, port 0, stat 0/-1/0)
>
> I'm not sure if this is relevant, but looks that while dom0's vcpu 1
> was offline, it had a pending interrupt:
>
> (XEN) VCPU1: CPU0 [has=F] flags=2 poll=0 upcall_pend = 01,
> upcall_mask = 01 dirty_cpus={} cpu_affinity={0-31}
> (XEN) 100 Hz periodic timer (period 10 ms)
> (XEN) Notifying guest (virq 1, port 0, stat 0/-1/-1)
>
> So it appears that when vcpu 1 is offline, it never successfully
> removes mappings for the domU until vcpu 1 comes back online.
>
> I don't know enough about the unmapping process... Jeremy, do you know
> anything about the process for unmapping domU memory from dom0 when
> the domU is being destroyed in the linux-2.6.18-xen.hg tree? More
> specifically, why if I take dom0's vcpu 1 offline (via the /sys
> interface), why the unmapping doesn't happen until I bring vcpu 1
> online?
>
Is it that the offline cpu still has a cr3 reference to a pagetable, and
that's not being given up? Or gdt?
In the pvops kernels we also keep a reference to the vcpu info
structure, since we place it the kernel's memory rather than keeping it
in the shared info structure. For a while that had bugs that left
zombie domains lying around, but I don't think anyone backported that
stuff to 2.6.18.
J
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Unplugging a dom0 vcpu and domain destruction
2009-02-20 18:15 ` Jeremy Fitzhardinge
@ 2009-02-20 18:59 ` George Dunlap
2009-02-20 20:02 ` Jeremy Fitzhardinge
2009-02-20 21:17 ` Keir Fraser
0 siblings, 2 replies; 10+ messages in thread
From: George Dunlap @ 2009-02-20 18:59 UTC (permalink / raw)
To: Jeremy Fitzhardinge; +Cc: xen-devel@lists.xensource.com, Keir Fraser
On Fri, Feb 20, 2009 at 6:15 PM, Jeremy Fitzhardinge <jeremy@goop.org> wrote:
> Is it that the offline cpu still has a cr3 reference to a pagetable, and
> that's not being given up? Or gdt?
But that would be a reference in Xen, would it not -- from the d0v1's
vcpu struct in Xen?
The last reference to the domain pages went away when d0v1 wrote to
one of its own l1e's. So one of dom0's l1's still contained a
reference to those pages. The curious bit is why they weren't
unmapped immediately if d0v1 wasn't online, and why they were unmapped
immediately when d0v1 came back online.
> In the pvops kernels we also keep a reference to the vcpu info structure,
> since we place it the kernel's memory rather than keeping it in the shared
> info structure. For a while that had bugs that left zombie domains lying
> around, but I don't think anyone backported that stuff to 2.6.18.
Hmm, I'll take a look tomorrow and see if I can work out what those
two pages that were being kept are.
-George
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Unplugging a dom0 vcpu and domain destruction
2009-02-20 18:59 ` George Dunlap
@ 2009-02-20 20:02 ` Jeremy Fitzhardinge
2009-02-20 21:17 ` Keir Fraser
1 sibling, 0 replies; 10+ messages in thread
From: Jeremy Fitzhardinge @ 2009-02-20 20:02 UTC (permalink / raw)
To: George Dunlap; +Cc: xen-devel@lists.xensource.com, Keir Fraser
George Dunlap wrote:
> The last reference to the domain pages went away when d0v1 wrote to
> one of its own l1e's. So one of dom0's l1's still contained a
> reference to those pages. The curious bit is why they weren't
> unmapped immediately if d0v1 wasn't online, and why they were unmapped
> immediately when d0v1 came back online.
>
Yes, that doesn't make much sense to me. Pagetable mappings aren't
vcpu-dependent.
J
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Unplugging a dom0 vcpu and domain destruction
2009-02-20 18:59 ` George Dunlap
2009-02-20 20:02 ` Jeremy Fitzhardinge
@ 2009-02-20 21:17 ` Keir Fraser
2009-02-24 9:07 ` George Dunlap
1 sibling, 1 reply; 10+ messages in thread
From: Keir Fraser @ 2009-02-20 21:17 UTC (permalink / raw)
To: George Dunlap, Jeremy Fitzhardinge; +Cc: xen-devel@lists.xensource.com
On 20/02/2009 18:59, "George Dunlap" <dunlapg@umich.edu> wrote:
>> In the pvops kernels we also keep a reference to the vcpu info structure,
>> since we place it the kernel's memory rather than keeping it in the shared
>> info structure. For a while that had bugs that left zombie domains lying
>> around, but I don't think anyone backported that stuff to 2.6.18.
>
> Hmm, I'll take a look tomorrow and see if I can work out what those
> two pages that were being kept are.
Jeremy's hunch might be worth following up -- that the offline vcpu holds
onto an mm, which doesn't get dropped until the vcpu comes back (at which
point unmap_vmas() would happen). It seems likely it'll be something silly
like that.
-- Keir
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Unplugging a dom0 vcpu and domain destruction
2009-02-20 21:17 ` Keir Fraser
@ 2009-02-24 9:07 ` George Dunlap
2009-02-24 12:08 ` Keir Fraser
0 siblings, 1 reply; 10+ messages in thread
From: George Dunlap @ 2009-02-24 9:07 UTC (permalink / raw)
To: Keir Fraser; +Cc: Jeremy Fitzhardinge, xen-devel@lists.xensource.com
On Fri, Feb 20, 2009 at 9:17 PM, Keir Fraser <keir.fraser@eu.citrix.com> wrote:
> On 20/02/2009 18:59, "George Dunlap" <dunlapg@umich.edu> wrote:
>
>>> In the pvops kernels we also keep a reference to the vcpu info structure,
>>> since we place it the kernel's memory rather than keeping it in the shared
>>> info structure. For a while that had bugs that left zombie domains lying
>>> around, but I don't think anyone backported that stuff to 2.6.18.
>>
>> Hmm, I'll take a look tomorrow and see if I can work out what those
>> two pages that were being kept are.
>
> Jeremy's hunch might be worth following up -- that the offline vcpu holds
> onto an mm, which doesn't get dropped until the vcpu comes back (at which
> point unmap_vmas() would happen). It seems likely it'll be something silly
> like that.
The problem with that theory is how the offline vcpu got "hold" of the
mm in the first place. The sequence of events to reproduce is:
* offline cpu1
* create domain
* destroy domain
+ zombie domain
* online cpu1
+ domain finally destroyed
Is there a good way to trigger a Linux stack dump from within Xen?
Even crashing dom0 would be OK if it will get a good stack dump. :-)
If I could see how we got to the final unmap_vma(), I might be able to
track things down easier...
Thanks,
-George
>
> -- Keir
>
>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel
>
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Unplugging a dom0 vcpu and domain destruction
2009-02-24 9:07 ` George Dunlap
@ 2009-02-24 12:08 ` Keir Fraser
2009-02-24 17:26 ` George Dunlap
0 siblings, 1 reply; 10+ messages in thread
From: Keir Fraser @ 2009-02-24 12:08 UTC (permalink / raw)
To: George Dunlap; +Cc: Jeremy Fitzhardinge, xen-devel@lists.xensource.com
On 24/02/2009 01:07, "George Dunlap" <George.Dunlap@eu.citrix.com> wrote:
> Is there a good way to trigger a Linux stack dump from within Xen?
> Even crashing dom0 would be OK if it will get a good stack dump. :-)
> If I could see how we got to the final unmap_vma(), I might be able to
> track things down easier...
do_guest_trap(TRAP_gp_fault, regs, 0) ?
-- Keir
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Unplugging a dom0 vcpu and domain destruction
2009-02-24 12:08 ` Keir Fraser
@ 2009-02-24 17:26 ` George Dunlap
0 siblings, 0 replies; 10+ messages in thread
From: George Dunlap @ 2009-02-24 17:26 UTC (permalink / raw)
To: Jeremy Fitzhardinge, xen-devel@lists.xensource.com
Gah... a real, proper clean-up and build from scratch and the problem
disappears. Who knows what it was, but it's more a bug with
dependencies / build system than a bug in the code proper. I'm done
chasing it, anyway.
-George
2009/2/24 Keir Fraser <keir.fraser@eu.citrix.com>:
> On 24/02/2009 01:07, "George Dunlap" <George.Dunlap@eu.citrix.com> wrote:
>
>> Is there a good way to trigger a Linux stack dump from within Xen?
>> Even crashing dom0 would be OK if it will get a good stack dump. :-)
>> If I could see how we got to the final unmap_vma(), I might be able to
>> track things down easier...
>
> do_guest_trap(TRAP_gp_fault, regs, 0) ?
>
> -- Keir
>
>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel
>
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2009-02-24 17:26 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-02-17 17:30 Unplugging a dom0 vcpu and domain destruction George Dunlap
2009-02-17 17:39 ` Keir Fraser
2009-02-20 17:13 ` George Dunlap
2009-02-20 18:15 ` Jeremy Fitzhardinge
2009-02-20 18:59 ` George Dunlap
2009-02-20 20:02 ` Jeremy Fitzhardinge
2009-02-20 21:17 ` Keir Fraser
2009-02-24 9:07 ` George Dunlap
2009-02-24 12:08 ` Keir Fraser
2009-02-24 17:26 ` George Dunlap
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.