From: Ian Campbell <ian.campbell@citrix.com>
To: Wei Liu <wei.liu2@citrix.com>
Cc: xen-devel@lists.xensource.com,
Ian Jackson <Ian.Jackson@eu.citrix.com>,
osstest service owner <osstest-admin@xenproject.org>
Subject: Re: [linux-4.1 test] 63030: regressions - FAIL
Date: Thu, 22 Oct 2015 12:12:37 +0100 [thread overview]
Message-ID: <1445512357.9563.279.camel@citrix.com> (raw)
In-Reply-To: <20151022110327.GK5060@zion.uk.xensource.com>
On Thu, 2015-10-22 at 12:03 +0100, Wei Liu wrote:
> On Thu, Oct 22, 2015 at 11:39:39AM +0100, Ian Campbell wrote:
> > On Thu, 2015-10-22 at 11:28 +0100, Wei Liu wrote:
> > > On Thu, Oct 22, 2015 at 10:50:54AM +0100, Ian Campbell wrote:
> > > > On Wed, 2015-10-21 at 18:34 +0100, Wei Liu wrote:
> > > > > On Wed, Oct 21, 2015 at 05:47:06PM +0100, Ian Campbell wrote:
> > > > > > On Tue, 2015-10-20 at 16:34 +0100, Ian Jackson wrote:
> > > > > > > Wei Liu writes ("Re: [Xen-devel] [linux-4.1 test] 63030:
> > > > > > > regressions
> > > > > > > - FAIL"):
> > > > > > > > From mere code inspection and document of lwip 1.3.0 I
> > > > > > > > think
> > > > > > > > mini
> > > > > > > -os
> > > > > > > > does send gratuitous ARP.
> > > > > > >
> > > > > > > The guest is using the PVHVM drivers at this point, with the
> > > > > > > backend
> > > > > > > directly in dom0, so it is the guest's gratuitous arp which
> > > > > > > is
> > > > > > > needed,
> > > > > > > I think.
> > > > > >
> > > > > > It would be worth investigating whether mini-os's gratuitous
> > > > > > ARP
> > > > > > might
> > > > > > also be occurring and confusing things, e.g. by coming after
> > > > > > and
> > > > > > therefore taking precedence over the one coming from the guest.
> > > > > >
> > > > >
> > > > > Several observations:
> > > > >
> > > > > 1. The guest doesn't always send gratuitous arp -- but this might
> > > > > not
> > > > > be
> > > > > the cause of this failure. Guest works fine when using qemu
> > > > > -trad
> > > > > only.
> > > >
> > > > As in it always sends the arp when using qemu-trad, or that it is
> > > > fine
> > > > irrespective of not always sending it?
> > > >
> > >
> > > Whether or not stubdom is in use, the guest behaves the same -- it
> > > doesn't always send gratuitous arp.
> > >
> > > When using qemu-trad alone, it's always fine when it doesn't send
> > > gratuitous arp because either there is cache in dom0 that already has
> > > guest mac address or the guest responses instantly to dom0 arp
> > > request.
> >
> > Where has this cache entry come from? Any preexisting ARP cache would
> > be
> > associated with vifX.0 and would go away when that device was destroyed
> > and
> > replace with vif(X+1).0.
> >
>
> No, vif-bridge script has two runes for off-lining a vif
> brctl delif $bridge $vif
> ifconfig $vif down
>
> Neither of these causes cache entry to be flushed.
$vif disappearing when netback finally deletes the device will though. Or
it should/used to.
Maybe this is happening after the new guest has started and confusing
things somewhere?
> > Also this only work for localhost migration. If the domain actually
> > moved
> > to another host then the ARP is required in order for the physical
> > switch
> > to learn the new location.
> >
> > Thus it seems to me that not always sending the gratuitous ARP is the
> > most
> > important thing to get to the bottom of here.
> >
>
> That's another issue, but this would cause other error (no route to
> host) instead of timeout. The failure exhibits timeout error -- let's do
> one thing at a time.
The presence of an ARP cache entry in dom0 pointing to the old VIF would
also cause a timeout issue, I think, since the guest is no longer connected
to that vif.
This stale ARP cache entry should be the first thing to investigate, before
either the lack of a grat ARP or the slowness of the guest, since its
presence will confuse the results in both those other cases.
> > > So it comes down to the responsiveness of guest is the key.
> > >
> > [...]
> > > > > 3. When using stubdom, guest is a lot less responsive. See two
> > > > > experiments and analysis below.
> > > >
> > > > Less responsive in use or only while migrating, or to ssh after
> > > > migration,
> > > > or to something else?
> > > >
> > >
> > > For every activity after migration for a period of time, including
> > > both
> > > arp request / reply and ssh connection.
> > >
> > > > > Scenario 1:
> > > > > xl shows "Migration successful."
> > > > > ...30s...
> > > > > xenbr0 receives gratuitous arp
> > > > > ...1s...
> > > > > ssh date command comes back
> > > > >
> > > > > Scenario 2:
> > > > > xenbr0 receives gratuitous arp
> > > > > ...1s...
> > > > > xl shows "Migration successful."
> > > > > ssh date command comes back
> > > > >
> > > > > When stubdom was not present I never saw scenario 1.
> >
> > So in that case you only saw Scenario 2 which includes a "receives
> > gratuitous ARP". But above you state that even with non-stub case
> > sometimes
> > the grauitous ARP is not sent. Is this a 3rd case which isn't mentioned
> > here?
> >
>
> Scenario 3:
> xl shows "Migration successful."
> dom0 sends arp request because arp cache entry not available
> guest takes a long time to respond when using stubdom or responds
> instantly when not using stubdom
>
> Scenario 4:
> xl shows "Migration successful."
> (arp cache entry still available)
> guest takes a long time to respond to ssh when using stubdom or
> responds instantly when not using stubdom
>
> > > > It would be worth looking at the possibility of a delay between
> > > > "Migration
> > > > successful" and the target domain actually running. A 30s delay
> > > > between
> > > > the
> > > > guest restarting and it sending the ARP would be pretty strange
> > > > IMHO
> > > >
> > >
> > > The guest is in a weird state.
> > >
> > > xl list shows the stubdom is in "b" state while guest has no state at
> > > all, heh.
> >
> > Has it actually been started/unpaused then?
> >
>
> Yes, of course -- otherwise the state would have been "p". And I
> observed the transition from "p" to "weird state".
If weird state is "-----" then I think that is normal, it is "runnable but
not running" IIRC.
Ian.
next prev parent reply other threads:[~2015-10-22 11:12 UTC|newest]
Thread overview: 22+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-10-18 17:52 [linux-4.1 test] 63030: regressions - FAIL osstest service owner
2015-10-19 13:51 ` Wei Liu
2015-10-20 14:39 ` Ian Jackson
2015-10-20 15:24 ` Wei Liu
2015-10-20 15:34 ` Ian Jackson
2015-10-21 16:47 ` Ian Campbell
2015-10-21 17:34 ` Wei Liu
2015-10-22 9:50 ` Ian Campbell
2015-10-22 10:28 ` Wei Liu
2015-10-22 10:39 ` Ian Campbell
2015-10-22 11:03 ` Wei Liu
2015-10-22 11:12 ` Ian Campbell [this message]
2015-10-22 14:41 ` Ian Jackson
2015-10-22 14:56 ` Ian Campbell
2015-10-22 15:18 ` Ian Jackson
2015-10-21 9:04 ` Ian Campbell
2015-10-21 9:24 ` Wei Liu
2015-10-21 9:44 ` Ian Campbell
2015-10-21 10:04 ` Ian Campbell
2015-10-21 10:35 ` Wei Liu
2015-10-21 10:48 ` Ian Campbell
2015-10-21 11:07 ` Wei Liu
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1445512357.9563.279.camel@citrix.com \
--to=ian.campbell@citrix.com \
--cc=Ian.Jackson@eu.citrix.com \
--cc=osstest-admin@xenproject.org \
--cc=wei.liu2@citrix.com \
--cc=xen-devel@lists.xensource.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.