Re: null domains after xl destroy

xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed

From: "Roger Pau Monné" <roger.pau@citrix.com>
To: Juergen Gross <jgross@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>,
	xen-devel@lists.xen.org,
	Dietmar Hahn <dietmar.hahn@ts.fujitsu.com>,
	glenn@rimuhosting.com
Subject: Re: null domains after xl destroy
Date: Wed, 19 Apr 2017 08:16:24 +0100	[thread overview]
Message-ID: <20170419071624.6enfeemielfqhqw2@dhcp-3-128.uk.xensource.com> (raw)
In-Reply-To: <034c9f96-1bfe-6793-68a7-9b070676971a@suse.com>

On Wed, Apr 19, 2017 at 06:39:41AM +0200, Juergen Gross wrote:
> On 19/04/17 03:02, Glenn Enright wrote:
> > On 18/04/17 20:36, Juergen Gross wrote:
> >> On 12/04/17 00:45, Glenn Enright wrote:
> >>> On 12/04/17 10:23, Andrew Cooper wrote:
> >>>> On 11/04/2017 23:13, Glenn Enright wrote:
> >>>>> On 11/04/17 21:49, Dietmar Hahn wrote:
> >>>>>> Am Dienstag, 11. April 2017, 20:03:14 schrieb Glenn Enright:
> >>>>>>> On 11/04/17 17:59, Juergen Gross wrote:
> >>>>>>>> On 11/04/17 07:25, Glenn Enright wrote:
> >>>>>>>>> Hi all
> >>>>>>>>>
> >>>>>>>>> We are seeing an odd issue with domu domains from xl destroy,
> >>>>>>>>> under
> >>>>>>>>> recent 4.9 kernels a (null) domain is left behind.
> >>>>>>>>
> >>>>>>>> I guess this is the dom0 kernel version?
> >>>>>>>>
> >>>>>>>>> This has occurred on a variety of hardware, with no obvious
> >>>>>>>>> commonality.
> >>>>>>>>>
> >>>>>>>>> 4.4.55 does not show this behavior.
> >>>>>>>>>
> >>>>>>>>> On my test machine I have the following packages installed under
> >>>>>>>>> centos6, from https://xen.crc.id.au/
> >>>>>>>>>
> >>>>>>>>> ~]# rpm -qa | grep xen
> >>>>>>>>> xen47-licenses-4.7.2-4.el6.x86_64
> >>>>>>>>> xen47-4.7.2-4.el6.x86_64
> >>>>>>>>> kernel-xen-4.9.21-1.el6xen.x86_64
> >>>>>>>>> xen47-ocaml-4.7.2-4.el6.x86_64
> >>>>>>>>> xen47-libs-4.7.2-4.el6.x86_64
> >>>>>>>>> xen47-libcacard-4.7.2-4.el6.x86_64
> >>>>>>>>> xen47-hypervisor-4.7.2-4.el6.x86_64
> >>>>>>>>> xen47-runtime-4.7.2-4.el6.x86_64
> >>>>>>>>> kernel-xen-firmware-4.9.21-1.el6xen.x86_64
> >>>>>>>>>
> >>>>>>>>> I've also replicated the issue with 4.9.17 and 4.9.20
> >>>>>>>>>
> >>>>>>>>> To replicate, on a cleanly booted dom0 with one pv VM, I run the
> >>>>>>>>> following on the VM
> >>>>>>>>>
> >>>>>>>>> {
> >>>>>>>>> while true; do
> >>>>>>>>>  dd bs=1M count=512 if=/dev/zero of=test conv=fdatasync
> >>>>>>>>> done
> >>>>>>>>> }
> >>>>>>>>>
> >>>>>>>>> Then on the dom0 I do this sequence to reliably get a null domain.
> >>>>>>>>> This
> >>>>>>>>> occurs with oxenstored and xenstored both.
> >>>>>>>>>
> >>>>>>>>> {
> >>>>>>>>> xl sync 1
> >>>>>>>>> xl destroy 1
> >>>>>>>>> }
> >>>>>>>>>
> >>>>>>>>> xl list then renders something like ...
> >>>>>>>>>
> >>>>>>>>> (null)                                       1     4     4
> >>>>>>>>> --p--d
> >>>>>>>>> 9.8     0
> >>>>>>>>
> >>>>>>>> Something is referencing the domain, e.g. some of its memory pages
> >>>>>>>> are
> >>>>>>>> still mapped by dom0.
> >>>>>>
> >>>>>> You can try
> >>>>>> # xl debug-keys q
> >>>>>> and further
> >>>>>> # xl dmesg
> >>>>>> to see the output of the previous command. The 'q' dumps domain
> >>>>>> (and guest debug) info.
> >>>>>> # xl debug-keys h
> >>>>>> prints all possible parameters for more informations.
> >>>>>>
> >>>>>> Dietmar.
> >>>>>>
> >>>>>
> >>>>> I've done this as requested, below is the output.
> >>>>>
> >>>>> <snip>
> >>>>> (XEN) Memory pages belonging to domain 1:
> >>>>> (XEN)     DomPage 0000000000071c00: caf=00000001, taf=7400000000000001
> >>>>> (XEN)     DomPage 0000000000071c01: caf=00000001, taf=7400000000000001
> >>>>> (XEN)     DomPage 0000000000071c02: caf=00000001, taf=7400000000000001
> >>>>> (XEN)     DomPage 0000000000071c03: caf=00000001, taf=7400000000000001
> >>>>> (XEN)     DomPage 0000000000071c04: caf=00000001, taf=7400000000000001
> >>>>> (XEN)     DomPage 0000000000071c05: caf=00000001, taf=7400000000000001
> >>>>> (XEN)     DomPage 0000000000071c06: caf=00000001, taf=7400000000000001
> >>>>> (XEN)     DomPage 0000000000071c07: caf=00000001, taf=7400000000000001
> >>>>> (XEN)     DomPage 0000000000071c08: caf=00000001, taf=7400000000000001
> >>>>> (XEN)     DomPage 0000000000071c09: caf=00000001, taf=7400000000000001
> >>>>> (XEN)     DomPage 0000000000071c0a: caf=00000001, taf=7400000000000001
> >>>>> (XEN)     DomPage 0000000000071c0b: caf=00000001, taf=7400000000000001
> >>>>> (XEN)     DomPage 0000000000071c0c: caf=00000001, taf=7400000000000001
> >>>>> (XEN)     DomPage 0000000000071c0d: caf=00000001, taf=7400000000000001
> >>>>> (XEN)     DomPage 0000000000071c0e: caf=00000001, taf=7400000000000001
> >>>>> (XEN)     DomPage 0000000000071c0f: caf=00000001, taf=7400000000000001
> >>>>
> >>>> There are 16 pages still referenced from somewhere.
> >>
> >> Just a wild guess: could you please try the attached kernel patch? This
> >> might give us some more diagnostic data...
> >>
> >>
> >> Juergen
> >>
> > 
> > Thanks Juergen. I applied that, to our 4.9.23 dom0 kernel, which still
> > shows the issue. When replicating the leak I now see this trace (via
> > dmesg). Hopefully that is useful.
> > 
> > Please note, I'm going to be offline next week, but am keen to keep on
> > with this, it may just be a while before I followup is all.
> > 
> > Regards, Glenn
> > http://rimuhosting.com
> > 
> > 
> > ------------[ cut here ]------------
> > WARNING: CPU: 0 PID: 19 at drivers/block/xen-blkback/xenbus.c:508
> > xen_blkbk_remove+0x138/0x140
> > Modules linked in: xen_pciback xen_netback xen_gntalloc xen_gntdev
> > xen_evtchn xenfs xen_privcmd xt_CT ipt_REJECT nf_reject_ipv4
> > ebtable_filter ebtables xt_hashlimit xt_recent xt_state iptable_security
> > iptable_raw igle iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4
> > nf_nat_ipv4 nf_nat nf_conntrack iptable_filter ip_tables bridge stp llc
> > ipv6 crc_ccitt ppdev parport_pc parport serio_raw sg i2c_i801 i2c_smbus
> > i2c_core e1000e ptp p000_edac edac_core raid1 sd_mod ahci libahci floppy
> > dm_mirror dm_region_hash dm_log dm_mod
> > CPU: 0 PID: 19 Comm: xenwatch Not tainted 4.9.23-1.el6xen.x86_64 #1
> > Hardware name: Supermicro PDSML/PDSML+, BIOS 6.00 08/27/2007
> >  ffffc90040cfbba8 ffffffff8136b61f 0000000000000013 0000000000000000
> >  0000000000000000 0000000000000000 ffffc90040cfbbf8 ffffffff8108007d
> >  ffffea0001373fe0 000001fc33394434 ffff880000000001 ffff88004d93fac0
> > Call Trace:
> >  [<ffffffff8136b61f>] dump_stack+0x67/0x98
> >  [<ffffffff8108007d>] __warn+0xfd/0x120
> >  [<ffffffff810800bd>] warn_slowpath_null+0x1d/0x20
> >  [<ffffffff814ebde8>] xen_blkbk_remove+0x138/0x140
> >  [<ffffffff814497f7>] xenbus_dev_remove+0x47/0xa0
> >  [<ffffffff814bcfd4>] __device_release_driver+0xb4/0x160
> >  [<ffffffff814bd0ad>] device_release_driver+0x2d/0x40
> >  [<ffffffff814bbfd4>] bus_remove_device+0x124/0x190
> >  [<ffffffff814b93a2>] device_del+0x112/0x210
> >  [<ffffffff81448113>] ? xenbus_read+0x53/0x70
> >  [<ffffffff814b94c2>] device_unregister+0x22/0x60
> >  [<ffffffff814ed7cd>] frontend_changed+0xad/0x4c0
> >  [<ffffffff810a974e>] ? schedule_tail+0x1e/0xc0
> >  [<ffffffff81449b57>] xenbus_otherend_changed+0xc7/0x140
> >  [<ffffffff816f1436>] ? _raw_spin_unlock_irqrestore+0x16/0x20
> >  [<ffffffff810a974e>] ? schedule_tail+0x1e/0xc0
> >  [<ffffffff81449fe0>] frontend_changed+0x10/0x20
> >  [<ffffffff814477fc>] xenwatch_thread+0x9c/0x140
> >  [<ffffffff810bffa0>] ? woken_wake_function+0x20/0x20
> >  [<ffffffff816ed93a>] ? schedule+0x3a/0xa0
> >  [<ffffffff816f1436>] ? _raw_spin_unlock_irqrestore+0x16/0x20
> >  [<ffffffff810c0c5d>] ? complete+0x4d/0x60
> >  [<ffffffff81447760>] ? split+0xf0/0xf0
> >  [<ffffffff810a051d>] kthread+0xcd/0xf0
> >  [<ffffffff810a974e>] ? schedule_tail+0x1e/0xc0
> >  [<ffffffff810a0450>] ? __kthread_init_worker+0x40/0x40
> >  [<ffffffff810a0450>] ? __kthread_init_worker+0x40/0x40
> >  [<ffffffff816f1b45>] ret_from_fork+0x25/0x30
> > ---[ end trace ee097287c9865a62 ]---
> 
> Konrad, Roger,
> 
> this was triggered by a debug patch in xen_blkbk_remove():
> 
> 	if (be->blkif)
> -		xen_blkif_disconnect(be->blkif);
> +		WARN_ON(xen_blkif_disconnect(be->blkif));
> 
> So I guess we need something like xen_blk_drain_io() in case of calls to
> xen_blkif_disconnect() which are not allowed to fail (either at the call
> sites of xen_blkif_disconnect() or in this function depending on a new
> boolean parameter indicating it should wait for outstanding I/Os).
> 
> I can try a patch, but I'd appreciate if you could confirm this wouldn't
> add further problems...

Hello,

Thanks for debugging this, the easiest solution seems to be to replace the
ring->inflight atomic_read check in xen_blkif_disconnect with a call to
xen_blk_drain_io instead, and making xen_blkif_disconnect return void (to
prevent further issues like this one).

Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

next prev parent reply	other threads:[~2017-04-19  7:16 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-04-11  5:25 null domains after xl destroy Glenn Enright
2017-04-11  5:59 ` Juergen Gross
2017-04-11  8:03   ` Glenn Enright
2017-04-11  9:49     ` Dietmar Hahn
2017-04-11 22:13       ` Glenn Enright
2017-04-11 22:23         ` Andrew Cooper
2017-04-11 22:45           ` Glenn Enright
2017-04-18  8:36             ` Juergen Gross
2017-04-19  1:02               ` Glenn Enright
2017-04-19  4:39                 ` Juergen Gross
2017-04-19  7:16                   ` Roger Pau Monné [this message]
2017-04-19  7:35                     ` Juergen Gross
2017-04-19 10:09                     ` Juergen Gross
2017-04-19 16:22                       ` Steven Haigh
2017-04-21  8:42                         ` Steven Haigh
2017-04-21  8:44                           ` Juergen Gross
2017-05-01  0:55                       ` Glenn Enright
2017-05-03 10:45                         ` Steven Haigh
2017-05-03 13:38                           ` Juergen Gross
2017-05-03 15:53                           ` Juergen Gross
2017-05-03 16:58                             ` Steven Haigh
2017-05-03 22:17                               ` Glenn Enright
2017-05-08  9:10                                 ` Juergen Gross
2017-05-09  9:24                                   ` Roger Pau Monné
2017-05-13  4:02                                     ` Glenn Enright
2017-05-15  9:57                                       ` Juergen Gross
2017-05-16  0:49                                         ` Glenn Enright
2017-05-16  1:18                                           ` Steven Haigh

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170419071624.6enfeemielfqhqw2@dhcp-3-128.uk.xensource.com \
    --to=roger.pau@citrix.com \
    --cc=andrew.cooper3@citrix.com \
    --cc=dietmar.hahn@ts.fujitsu.com \
    --cc=glenn@rimuhosting.com \
    --cc=jgross@suse.com \
    --cc=xen-devel@lists.xen.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).