All of lore.kernel.org
 help / color / mirror / Atom feed
From: Ian Campbell <ian.campbell@citrix.com>
To: xen-devel@lists.xensource.com, Wei Liu <wei.liu2@citrix.com>
Cc: Julien Grall <julien.grall@citrix.com>,
	ian.jackson@eu.citrix.com, Tim Deegan <tim@xen.org>,
	David Vrabel <david.vrabel@citrix.com>,
	Stefano Stabellini <Stefano.Stabellini@eu.citrix.com>
Subject: Re: [xen-unstable test] 56759: regressions - FAIL
Date: Wed, 27 May 2015 17:04:37 +0100	[thread overview]
Message-ID: <1432742677.14664.270.camel@citrix.com> (raw)
In-Reply-To: <1432646989.14664.112.camel@citrix.com>

On Tue, 2015-05-26 at 14:29 +0100, Ian Campbell wrote:
> On Wed, 2015-05-20 at 10:56 +0100, Ian Campbell wrote:
> > On Wed, 2015-05-20 at 09:34 +0000, osstest service user wrote:
> > > flight 56759 xen-unstable real [real]
> > > http://logs.test-lab.xenproject.org/osstest/logs/56759/
> > > 
> > > Regressions :-(
> > > 
> > > Tests which did not succeed and are blocking,
> > > including tests which could not be run:
> > >  test-armhf-armhf-xl-multivcpu 17 leak-check/check         fail REGR. vs. 56375
> > 
> > I'm pretty hard pressed to explain this from the set of commits
> > currently under test, but it has happened a few times now (e.g. 56700
> > 56576) so it does seem to be real.
> > 
> > http://logs.test-lab.xenproject.org/osstest/results/bisect.xen-unstable.test-armhf-armhf-xl-multivcpu.leak-check--check.html
> > is working on it and is currently consider the set of changes from:
> > ianc@cosworth:xen.git$ git log --oneline 9ab42~1...45fcc4
> > 45fcc45 use ticket locks for spin locks
> > e13013d libxc/restore: add checkpointed flag to the restore context
> > ce44b40 libxc/restore: introduce setup() and cleanup() on restore
> > c5c5a04 libxc/restore: split read/handle qemu info
> > 9ab42c9 libxc/restore: introduce process_record()
> > 
> > where e13013d is current master which was pushed in by flight 56375.
> > 
> > I think it unlikely the libxl stuff is responible, given we don't
> > migrate on ARM, which would seem to point to the ticket locks...
> 
> I've now managed to reproduce using the arndale on my desk.

... and now I've confirmed that reverting the spin lock change causes
the issue to not happen any more.

> I'm just starting to dig in to the issue.
> 
> So far the only thing I've concluded is that the message comes from
> netback try to read the script node for inclusion in the hotplug
> invocation's environment.
> 
> I wonder if perhaps the spinlock change has just exposed a pre-existing
> race?

I'm still confirming, but AFAICT libxl does the right thing and writes
state=Closing and waits for it to hit state=Closed before tearing down
the backend directory. AFAICS it is not timing out while waiting.

Looking at the netback side though it seems like netback_remove is
switching to state=Closed _before_ it calls kobject_uevent(...,
KOBJ_OFFLINE) and it is this which generates the call to netback_uevent
which tries and fails to read script and produces the error message.

Since switching to state=Closed is what prompts libxl to go and delete
the xenstore backend dir it seems like it would be possible that
netback_uevent might not happen until the xenstore key was gone,
prompting it to write the error nodes. Is there anything else which
might prevent against that possibility?

Handwaving a bit (ok, a lot) it's possible that the change of spinlocks
has caused a commonly won race to become a commonly lost one at least
under these circumstances.

My theory is that this is exacerbated on arndale because the CPU is
relatively slow (even compared to cubietruck which is the same core but
faster DRAM etc) and the fact that it is dual core while the test case
which is failing involves a 4 vcpu guest (which is a bit dumb but not
invalid) is loading things even more.

I'm still slightly concerned that perhaps the new spinlock stuff has
some sort of bad behaviour either on arndale specifically or more
generally for ARM systems which has pushed this particular case over the
edge.

Ian.

  reply	other threads:[~2015-05-27 16:04 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-05-20  9:34 [xen-unstable test] 56759: regressions - FAIL osstest service user
2015-05-20  9:56 ` Ian Campbell
2015-05-26  9:11   ` Julien Grall
2015-05-26  9:17     ` Ian Campbell
2015-05-26  9:22       ` Julien Grall
2015-05-26 13:29   ` Ian Campbell
2015-05-27 16:04     ` Ian Campbell [this message]
2015-05-28  8:50       ` Jan Beulich
2015-05-28  9:26         ` Ian Campbell
2015-05-28 10:10           ` Jan Beulich
2015-05-29  9:56             ` Andrew Cooper
2015-05-29 10:40               ` Jan Beulich
2015-05-29 10:50               ` Ian Campbell
2015-05-29 16:32       ` Ian Campbell
2015-06-02 10:30         ` Jan Beulich
2015-06-11 13:22           ` Julien Grall

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1432742677.14664.270.camel@citrix.com \
    --to=ian.campbell@citrix.com \
    --cc=Stefano.Stabellini@eu.citrix.com \
    --cc=david.vrabel@citrix.com \
    --cc=ian.jackson@eu.citrix.com \
    --cc=julien.grall@citrix.com \
    --cc=tim@xen.org \
    --cc=wei.liu2@citrix.com \
    --cc=xen-devel@lists.xensource.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.