[linux-4.1 test] 63030: regressions

xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed

* [linux-4.1 test] 63030: regressions - FAIL
@ 2015-10-18 17:52 osstest service owner
  2015-10-19 13:51 ` Wei Liu
  0 siblings, 1 reply; 22+ messages in thread
From: osstest service owner @ 2015-10-18 17:52 UTC (permalink / raw)
  To: xen-devel, osstest-admin

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 19030 bytes --]

flight 63030 linux-4.1 real [real]
http://logs.test-lab.xenproject.org/osstest/logs/63030/

Regressions :-(

Tests which did not succeed and are blocking,
including tests which could not be run:
 test-amd64-i386-xl-qemut-stubdom-debianhvm-amd64-xsm 15 guest-localmigrate.2 fail REGR. vs. 62318

Tests which are failing intermittently (not blocking):
 test-amd64-i386-xl-qemut-stubdom-debianhvm-amd64-xsm 13 guest-localmigrate fail in 63013 pass in 63030
 test-armhf-armhf-xl-credit2   6 xen-boot                    fail pass in 63013

Regressions which are regarded as allowable (not blocking):
 test-armhf-armhf-xl-rtds     11 guest-start                  fail   like 62256
 test-amd64-amd64-libvirt-pair 21 guest-migrate/src_host/dst_host fail like 62256
 test-amd64-i386-libvirt-pair 21 guest-migrate/src_host/dst_host fail like 62256
 test-amd64-amd64-xl-qemut-stubdom-debianhvm-amd64-xsm 13 guest-localmigrate fail like 62318
 test-amd64-amd64-xl-qemut-win7-amd64 17 guest-stop             fail like 62318
 test-amd64-amd64-xl-qemuu-win7-amd64 17 guest-stop             fail like 62318

Tests which did not succeed, but are not blocking:
 test-armhf-armhf-xl-rtds 13 saverestore-support-check fail in 63013 never pass
 test-armhf-armhf-xl-rtds     12 migrate-support-check fail in 63013 never pass
 test-armhf-armhf-xl-rtds 16 guest-start/debian.repeat fail in 63013 never pass
 test-armhf-armhf-xl-credit2 13 saverestore-support-check fail in 63013 never pass
 test-armhf-armhf-xl-credit2  12 migrate-support-check fail in 63013 never pass
 test-armhf-armhf-xl-vhd       9 debian-di-install            fail   never pass
 test-armhf-armhf-libvirt-qcow2  9 debian-di-install            fail never pass
 test-amd64-amd64-xl-pvh-intel 14 guest-saverestore            fail  never pass
 test-amd64-amd64-xl-pvh-amd  11 guest-start                  fail   never pass
 test-armhf-armhf-libvirt-raw  9 debian-di-install            fail   never pass
 test-armhf-armhf-libvirt-xsm 12 migrate-support-check        fail   never pass
 test-armhf-armhf-libvirt-xsm 14 guest-saverestore            fail   never pass
 test-amd64-amd64-libvirt-xsm 12 migrate-support-check        fail   never pass
 test-armhf-armhf-libvirt     14 guest-saverestore            fail   never pass
 test-armhf-armhf-libvirt     12 migrate-support-check        fail   never pass
 test-armhf-armhf-xl-arndale  12 migrate-support-check        fail   never pass
 test-armhf-armhf-xl-arndale  13 saverestore-support-check    fail   never pass
 test-amd64-i386-libvirt-qemuu-debianhvm-amd64-xsm 10 migrate-support-check fail never pass
 test-amd64-i386-libvirt-xsm  12 migrate-support-check        fail   never pass
 test-amd64-amd64-libvirt     12 migrate-support-check        fail   never pass
 test-amd64-i386-libvirt      12 migrate-support-check        fail   never pass
 test-armhf-armhf-xl-cubietruck 12 migrate-support-check        fail never pass
 test-armhf-armhf-xl-cubietruck 13 saverestore-support-check    fail never pass
 test-armhf-armhf-xl-xsm      13 saverestore-support-check    fail   never pass
 test-armhf-armhf-xl-xsm      12 migrate-support-check        fail   never pass
 test-armhf-armhf-xl-multivcpu 13 saverestore-support-check    fail  never pass
 test-armhf-armhf-xl-multivcpu 12 migrate-support-check        fail  never pass
 test-amd64-amd64-libvirt-vhd 11 migrate-support-check        fail   never pass
 test-amd64-amd64-libvirt-qemuu-debianhvm-amd64-xsm 10 migrate-support-check fail never pass
 test-armhf-armhf-xl          12 migrate-support-check        fail   never pass
 test-armhf-armhf-xl          13 saverestore-support-check    fail   never pass
 test-amd64-i386-xl-qemuu-win7-amd64 17 guest-stop              fail never pass
 test-amd64-i386-xl-qemut-win7-amd64 17 guest-stop              fail never pass

version targeted for testing:
 linux                27f1b7fed9c305ef46f8708f1bdde9cdb5f166bd
baseline version:
 linux                36311a9ec4904c080bbdfcefc0f3d609ed508224

Last test of basis    62318  2015-09-24 00:30:22 Z   24 days
Failing since         62540  2015-09-29 17:44:52 Z   19 days   17 attempts
Testing same since    62659  2015-10-04 12:21:24 Z   14 days   15 attempts

------------------------------------------------------------
People who touched revisions under test:
  "Eric W. Biederman" <ebiederm@xmission.com>
  Aaron Brown <aaron.f.brown@intel.com>
  Adam Lee <adam.lee@canonical.com>
  Adrien Schildknecht <adrien+dev@schischi.me>
  Alex Deucher <alexander.deucher@amd.com>
  Alexander Drozdov <al.drozdov@gmail.com>
  Alexander Duyck <alexander.h.duyck@redhat.com>
  Alexandre Belloni <alexandre.belloni@free-electrons.com>
  Alexei Starovoitov <ast@plumgrid.com>
  Alexey Brodkin <abrodkin@synopsys.com>
  Alexey Brodkin <Alexey.Brodkin@synopsys.com>
  Andrew Morton <akpm@linux-foundation.org>
  Andrew W Elble <aweits@rit.edu>
  Andy Whitcroft <apw@canonical.com>
  Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
  Angga <Hermin.Anggawijaya@alliedtelesis.co.nz>
  Anna Schumaker <Anna.Schumaker@netapp.com>
  Ard Biesheuvel <ard.biesheuvel@linaro.org>
  Ariel Nahum <arieln@mellanox.com>
  Atsushi Nemoto <nemoto@toshiba-tops.co.jp>
  Bart Van Assche <bart.vanassche@sandisk.com>
  Benjamin Coddington <bcodding@redhat.com>
  Benjamin Herrenschmidt <benh@kernel.crashing.org>
  Benoit Parrot <bparrot@ti.com>
  Bob Copeland <me@bobcopeland.com>
  Bob Liu <bob.liu@oracle.com>
  Brenden Blanco <bblanco@plumgrid.com>
  Brian Starkey <brian.starkey@arm.com>
  Carol L Soto <clsoto@linux.vnet.ibm.com>
  Catalin Marinas <catalin.marinas@arm.com>
  Chris Mason <clm@fb.com>
  Christian Borntraeger <borntraeger@de.ibm.com>
  Christoph Hellwig <hch@lst.de>
  Christophe Ricard <christophe-h.ricard@st.com>
  Christophe Ricard <christophe.ricard@gmail.com>
  Cong Wang <cwang@twopensource.com>
  Cong Wang <xiyou.wangcong@gmail.com>
  Dan Carpenter <dan.carpenter@oracle.com>
  Daniel Axtens <dja@axtens.net>
  Daniel Borkmann <daniel@iogearbox.net>
  Darren Hart <dvhart@linux.intel.com>
  David Ahern <dsa@cumulusnetworks.com>
  David Dueck <davidcdueck@googlemail.com>
  David Härdeman <david@hardeman.nu>
  David Rientjes <rientjes@google.com>
  David S. Miller <davem@davemloft.net>
  Ding Tianhong <dingtianhong@huawei.com>
  dingtianhong <dingtianhong@huawei.com>
  Dmitry Torokhov <dmitry.torokhov@gmail.com>
  Doug Ledford <dledford@redhat.com>
  Edward Hyunkoo Jee <edjee@google.com>
  Emil Medve <Emilian.Medve@Freescale.com>
  Eric Dumazet <edumazet@google.com>
  Eric Sandeen <sandeen@redhat.com>
  Eric W. Biederman <ebiederm@xmission.com>
  Eryu Guan <guaneryu@gmail.com>
  Eugene Shatokhin <eugene.shatokhin@rosalab.ru>
  Filipe Manana <fdmanana@suse.com>
  Florian Fainelli <f.fainelli@gmail.com>
  Florian Westphal <fw@strlen.de>
  Fugang Duan <B38611@freescale.com>
  Gavin Shan <gwshan@linux.vnet.ibm.com>
  Greg Kroah-Hartman <gregkh@linuxfoundation.org>
  Gregory Hoggarth <Gregory.Hoggarth@alliedtelesis.co.nz>
  Haggai Eran <haggaie@mellanox.com>
  Hans de Goede <hdegoede@redhat.com>
  Hans Verkuil <hans.verkuil@cisco.com>
  Heiko Stuebner <heiko@sntech.de>
  Heiko Stübner <heiko@sntech.de>
  Helge Deller <deller@gmx.de>
  Herbert Xu <herbert@gondor.apana.org.au>
  Hermin Anggawijaya <hermin.anggawijaya@alliedtelesis.co.nz>
  Hin-Tak Leung <htl10@users.sourceforge.net>
  huaibin Wang <huaibin.wang@6wind.com>
  Ian Munsie <imunsie@au1.ibm.com>
  Ido Schimmel <idosch@mellanox.com>
  Ivan Vecera <ivecera@redhat.com>
  J. Bruce Fields <bfields@redhat.com>
  Jack Morgenstein <jackm@dev.mellanox.co.il>
  Jaewon Kim <jaewon31.kim@samsung.com>
  Jamal Hadi Salim <jhs@mojatatu.com>
  Jan Kara <jack@suse.com>
  Jann Horn <jann@thejh.net>
  Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
  Jean Delvare <jdelvare@suse.de>
  Jeff Kirsher <jeffrey.t.kirsher@intel.com>
  Jeff Layton <jeff.layton@primarydata.com>
  Jeff Layton <jlayton@poochiereds.net>
  Jeff Vander Stoep <jeffv@google.com>
  Jeffery Miller <jmiller@neverware.com>
  Jens Axboe <axboe@fb.com>
  Jesse Gross <jesse@nicira.com>
  Jesse Jones <jjones@cococorp.com>
  Jialing Fu <jlfu@marvell.com>
  Jiri Pirko <jiri@resnulli.us>
  Jisheng Zhang <jszhang@marvell.com>
  Joerg Roedel <jroedel@suse.de>
  Johannes Berg <johannes.berg@intel.com>
  John David Anglin <dave.anglin>
  John David Anglin <dave.anglin@bell.net>
  John Fastabend <john.r.fastabend@intel.com>
  Joonyoung Shim <jy0922.shim@samsung.com>
  Julian Anastasov <ja@ssi.bg>
  Kalle Valo <kvalo@codeaurora.org>
  Kees Cook <keescook@chromium.org>
  Ken-ichirou MATSUZAWA <chamaken@gmail.com>
  Kinglong Mee <kinglongmee@gmail.com>
  Krzysztof Kozlowski <k.kozlowski@samsung.com>
  Kyle Evans <kvans32@gmail.com>
  Lad, Prabhakar <prabhakar.csengg@gmail.com>
  Larry Finger <Larry.Finger@lwfinger.net>
  Lars Westerhoff <lars.westerhoff@newtec.eu>
  Laurent Pinchart <laurent.pinchart@ideasonboard.com>
  Leonidas Da Silva Barbosa <leosilva@linux.vnet.ibm.com>
  Leonidas S. Barbosa <leosilva@linux.vnet.ibm.com>
  Linus Lüssing <linus.luessing@c0d3.blue>
  Linus Torvalds <torvalds@linux-foundation.org>
  Linus Walleij <linus.walleij@linaro.org>
  Ludovic Desroches <ludovic.desroches@atmel.com>
  Luis Henriques <luis.henriques@canonical.com>
  Madalin Bucur <Madalin.Bucur@freescale.com>
  Marc Zyngier <marc.zyngier@arm.com>
  Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
  Markos Chandras <markos.chandras@imgtec.com>
  Matan Barak <matanb@mellanox.com>
  Matthew Rosato <mjrosato@linux.vnet.ibm.com>
  Mauro Carvalho Chehab <mchehab@osg.samsung.com>
  Mel Gorman <mgorman@suse.de>
  Michael Ellerman <mpe@ellerman.id.au>
  Michael S. Tsirkin <mst@redhat.com>
  Michal Hocko <mhocko@suse.com>
  Mike Marciniszyn <mike.marciniszyn@intel.com>
  Minchan Kim <minchan@kernel.org>
  Minfei Huang <mnfhuang@gmail.com>
  Ming Lei <ming.lei@canonical.com>
  Mitja Spes <mitja@lxnav.com>
  NeilBrown <neilb@suse.com>
  Nicolas Dichtel <nicolas.dichtel@6wind.com>
  Nicolas Ferre <nicolas.ferre@atmel.com>
  Nicolas Iooss <nicolas.iooss_linux@m4x.org>
  Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
  Nikolay Aleksandrov <razor@blackwall.org>
  Niranjan Sivakumar <ns253@cornell.edu>
  Noa Osherovich <noaos@mellanox.com>
  Oleg Nesterov <oleg@redhat.com>
  Oliver Hartkopp <socketcan@hartkopp.net>
  Oliver Neukum <oneukum@suse.com>
  Or Gerlitz <ogerlitz@mellanox.com>
  Paul Moore <paul@paul-moore.com>
  Pavel Fedin <p.fedin@samsung.com>
  Peng Tao <tao.peng@primarydata.com>
  Peter Guo <peter.guo@bayhubtech.com>
  Phil Sutter <phil@nwl.cc>
  Pratyush Anand <panand@redhat.com>
  Pravin B Shelar <pshelar@nicira.com>
  Ralf Baechle <ralf@linux-mips.org>
  Rasesh Mody <rasesh.mody@qlogic.com>
  Richard Laing <richard.laing@alliedtelesis.co.nz>
  Rob Herring <robh@kernel.org>
  Roopa Prabhu <roopa@cumulusnetworks.com>
  Russell King <rmk+kernel@arm.linux.org.uk>
  Sagi Grimberg <sagig@mellanox.com>
  Sakari Ailus <sakari.ailus@iki.fi>
  Samuel Ortiz <sameo@linux.intel.com>
  Scott Feldman <sfeldma@gmail.com>
  Sergei Antonov <saproj@gmail.com>
  Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
  Shachar Raindel <raindel@mellanox.com>
  Shani Michaeli <shanim@mellanox.com>
  Shawn Lin <shawn.lin@rock-chips.com>
  Shota Suzuki <suzuki_shota_t3@lab.ntt.co.jp>
  Stas Sergeev <stsp@list.ru>
  Stas Sergeev <stsp@users.sourceforge.net>
  Stephen Smalley <sds@tycho.nsa.gov>
  Steve French <smfrench@gmail.com>
  Steven Rostedt <rostedt@goodmis.org>
  Stuart Yoder <stuart.yoder@freescale.com>
  Sudip Mukherjee <sudip@vectorindia.org>
  Takashi Iwai <tiwai@suse.de>
  Theodore Ts'o <tytso@mit.edu>
  Thierry Reding <treding@nvidia.com>
  Thierry Strudel <tstrudel@google.com>
  Thomas Gleixner <tglx@linutronix.de>
  Thomas Graf <tgraf@suug.ch>
  Thomas Huth <thuth@redhat.com>
  Tilman Schmidt <tilman@imap.cc>
  Timo Teräs <timo.teras@iki.fi>
  Tobias Powalowski <tobias.powalowski@googlemail.com>
  Tony Luck <tony.luck@intel.com>
  Trond Myklebust <trond.myklebust@primarydata.com>
  Tyler Hicks <tyhicks@canonical.com>
  Ulf Hansson <ulf.hansson@linaro.org>
  Varun Sethi <Varun.Sethi@freescale.com>
  Vlad Yasevich <vyasevich@gmail.com>
  Vlad Zolotarov <vladz@cloudius-systems.com>
  Vlastimil Babka <vbabka@suse.cz>
  WANG Cong <xiyou.wangcong@gmail.com>
  Will Deacon <will.deacon@arm.com>
  Wilson Kok <wkok@cumulusnetworks.com>
  Woodrow Shen <woodrow.shen@canonical.com>
  Yao-Wen Mao <yaowen@google.com>
  Ying Xue <ying.xue@windriver.com>
  Yinghai Lu <yinghai@kernel.org>
  Yishai Hadas <yishaih@mellanox.com>
  Yuchung Cheng <ycheng@google.com>

jobs:
 build-amd64-xsm                                              pass
 build-armhf-xsm                                              pass
 build-i386-xsm                                               pass
 build-amd64                                                  pass
 build-armhf                                                  pass
 build-i386                                                   pass
 build-amd64-libvirt                                          pass
 build-armhf-libvirt                                          pass
 build-i386-libvirt                                           pass
 build-amd64-pvops                                            pass
 build-armhf-pvops                                            pass
 build-i386-pvops                                             pass
 build-amd64-rumpuserxen                                      pass
 build-i386-rumpuserxen                                       pass
 test-amd64-amd64-xl                                          pass
 test-armhf-armhf-xl                                          pass
 test-amd64-i386-xl                                           pass
 test-amd64-amd64-xl-qemut-debianhvm-amd64-xsm                pass
 test-amd64-i386-xl-qemut-debianhvm-amd64-xsm                 pass
 test-amd64-amd64-libvirt-qemuu-debianhvm-amd64-xsm           pass
 test-amd64-i386-libvirt-qemuu-debianhvm-amd64-xsm            pass
 test-amd64-amd64-xl-qemuu-debianhvm-amd64-xsm                pass
 test-amd64-i386-xl-qemuu-debianhvm-amd64-xsm                 pass
 test-amd64-amd64-xl-qemut-stubdom-debianhvm-amd64-xsm        fail
 test-amd64-i386-xl-qemut-stubdom-debianhvm-amd64-xsm         fail
 test-amd64-amd64-libvirt-xsm                                 pass
 test-armhf-armhf-libvirt-xsm                                 fail
 test-amd64-i386-libvirt-xsm                                  pass
 test-amd64-amd64-xl-xsm                                      pass
 test-armhf-armhf-xl-xsm                                      pass
 test-amd64-i386-xl-xsm                                       pass
 test-amd64-amd64-xl-pvh-amd                                  fail
 test-amd64-i386-qemut-rhel6hvm-amd                           pass
 test-amd64-i386-qemuu-rhel6hvm-amd                           pass
 test-amd64-amd64-xl-qemut-debianhvm-amd64                    pass
 test-amd64-i386-xl-qemut-debianhvm-amd64                     pass
 test-amd64-amd64-xl-qemuu-debianhvm-amd64                    pass
 test-amd64-i386-xl-qemuu-debianhvm-amd64                     pass
 test-amd64-i386-freebsd10-amd64                              pass
 test-amd64-amd64-xl-qemuu-ovmf-amd64                         pass
 test-amd64-i386-xl-qemuu-ovmf-amd64                          pass
 test-amd64-amd64-rumpuserxen-amd64                           pass
 test-amd64-amd64-xl-qemut-win7-amd64                         fail
 test-amd64-i386-xl-qemut-win7-amd64                          fail
 test-amd64-amd64-xl-qemuu-win7-amd64                         fail
 test-amd64-i386-xl-qemuu-win7-amd64                          fail
 test-armhf-armhf-xl-arndale                                  pass
 test-amd64-amd64-xl-credit2                                  pass
 test-armhf-armhf-xl-credit2                                  fail
 test-armhf-armhf-xl-cubietruck                               pass
 test-amd64-i386-freebsd10-i386                               pass
 test-amd64-i386-rumpuserxen-i386                             pass
 test-amd64-amd64-xl-pvh-intel                                fail
 test-amd64-i386-qemut-rhel6hvm-intel                         pass
 test-amd64-i386-qemuu-rhel6hvm-intel                         pass
 test-amd64-amd64-libvirt                                     pass
 test-armhf-armhf-libvirt                                     fail
 test-amd64-i386-libvirt                                      pass
 test-amd64-amd64-xl-multivcpu                                pass
 test-armhf-armhf-xl-multivcpu                                pass
 test-amd64-amd64-pair                                        pass
 test-amd64-i386-pair                                         pass
 test-amd64-amd64-libvirt-pair                                fail
 test-amd64-i386-libvirt-pair                                 fail
 test-amd64-amd64-amd64-pvgrub                                pass
 test-amd64-amd64-i386-pvgrub                                 pass
 test-amd64-amd64-pygrub                                      pass
 test-armhf-armhf-libvirt-qcow2                               fail
 test-amd64-amd64-xl-qcow2                                    pass
 test-armhf-armhf-libvirt-raw                                 fail
 test-amd64-i386-xl-raw                                       pass
 test-amd64-amd64-xl-rtds                                     pass
 test-armhf-armhf-xl-rtds                                     fail
 test-amd64-i386-xl-qemut-winxpsp3-vcpus1                     pass
 test-amd64-i386-xl-qemuu-winxpsp3-vcpus1                     pass
 test-amd64-amd64-libvirt-vhd                                 pass
 test-armhf-armhf-xl-vhd                                      fail
 test-amd64-amd64-xl-qemut-winxpsp3                           pass
 test-amd64-i386-xl-qemut-winxpsp3                            pass
 test-amd64-amd64-xl-qemuu-winxpsp3                           pass
 test-amd64-i386-xl-qemuu-winxpsp3                            pass


------------------------------------------------------------
sg-report-flight on osstest.test-lab.xenproject.org
logs: /home/logs/logs
images: /home/logs/images

Logs, config files, etc. are available at
    http://logs.test-lab.xenproject.org/osstest/logs

Explanation of these reports, and of osstest in general, is at
    http://xenbits.xen.org/gitweb/?p=osstest.git;a=blob;f=README.email;hb=master
    http://xenbits.xen.org/gitweb/?p=osstest.git;a=blob;f=README;hb=master

Test harness code can be found at
    http://xenbits.xen.org/gitweb?p=osstest.git;a=summary


Not pushing.

(No revision log; it would be 5606 lines long.)


[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [linux-4.1 test] 63030: regressions - FAIL
  2015-10-18 17:52 [linux-4.1 test] 63030: regressions - FAIL osstest service owner
@ 2015-10-19 13:51 ` Wei Liu
  2015-10-20 14:39   ` Ian Jackson
  0 siblings, 1 reply; 22+ messages in thread
From: Wei Liu @ 2015-10-19 13:51 UTC (permalink / raw)
  To: osstest service owner; +Cc: xen-devel, wei.liu2

On Sun, Oct 18, 2015 at 05:52:32PM +0000, osstest service owner wrote:
> flight 63030 linux-4.1 real [real]
> http://logs.test-lab.xenproject.org/osstest/logs/63030/
> 
> Regressions :-(
> 
> Tests which did not succeed and are blocking,
> including tests which could not be run:
>  test-amd64-i386-xl-qemut-stubdom-debianhvm-amd64-xsm 15 guest-localmigrate.2 fail REGR. vs. 62318
> 

Unfortunately there isn't much useful information in various log files.
I think we need to wait for Ian's patch [0] to land in production in
order to get more insight on what's going on.

Wei.

[0]: [PATCH OSSTEST] stubdom: Arrange for guest serial to go to a host logfile

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [linux-4.1 test] 63030: regressions - FAIL
  2015-10-19 13:51 ` Wei Liu
@ 2015-10-20 14:39   ` Ian Jackson
  2015-10-20 15:24     ` Wei Liu
  0 siblings, 1 reply; 22+ messages in thread
From: Ian Jackson @ 2015-10-20 14:39 UTC (permalink / raw)
  To: Wei Liu; +Cc: xen-devel, osstest service owner

Wei Liu writes ("Re: [Xen-devel] [linux-4.1 test] 63030: regressions - FAIL"):
> On Sun, Oct 18, 2015 at 05:52:32PM +0000, osstest service owner wrote:
...
> > Tests which did not succeed and are blocking,
> > including tests which could not be run:
> >  test-amd64-i386-xl-qemut-stubdom-debianhvm-amd64-xsm 15 guest-localmigrate.2 fail REGR. vs. 62318
> > 
> 
> Unfortunately there isn't much useful information in various log files.
> I think we need to wait for Ian's patch [0] to land in production in
> order to get more insight on what's going on.
...
> [0]: [PATCH OSSTEST] stubdom: Arrange for guest serial to go to a host logfile

That osstest patch was in service in this flight.  The guest kernel
messages are in

   http://logs.test-lab.xenproject.org/osstest/logs/63030/test-amd64-i386-xl-qemut-stubdom-debianhvm-amd64-xsm/merlot0---var-log-xen-qemu-dm-debianhvm.guest.osstest.log.3.gz

et al, mixed in with the minios and stub qemu output.

I don't immediately see an explanation for the problem but (as we
discovered with BSD) the success of this test depends on the
gratuitous arp.

Ian.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [linux-4.1 test] 63030: regressions - FAIL
  2015-10-20 14:39   ` Ian Jackson
@ 2015-10-20 15:24     ` Wei Liu
  2015-10-20 15:34       ` Ian Jackson
  2015-10-21  9:04       ` Ian Campbell
  0 siblings, 2 replies; 22+ messages in thread
From: Wei Liu @ 2015-10-20 15:24 UTC (permalink / raw)
  To: Ian Jackson; +Cc: xen-devel, Wei Liu, osstest service owner

On Tue, Oct 20, 2015 at 03:39:26PM +0100, Ian Jackson wrote:
> Wei Liu writes ("Re: [Xen-devel] [linux-4.1 test] 63030: regressions - FAIL"):
> > On Sun, Oct 18, 2015 at 05:52:32PM +0000, osstest service owner wrote:
> ...
> > > Tests which did not succeed and are blocking,
> > > including tests which could not be run:
> > >  test-amd64-i386-xl-qemut-stubdom-debianhvm-amd64-xsm 15 guest-localmigrate.2 fail REGR. vs. 62318
> > > 
> > 
> > Unfortunately there isn't much useful information in various log files.
> > I think we need to wait for Ian's patch [0] to land in production in
> > order to get more insight on what's going on.
> ...
> > [0]: [PATCH OSSTEST] stubdom: Arrange for guest serial to go to a host logfile
> 
> That osstest patch was in service in this flight.  The guest kernel
> messages are in
> 
>    http://logs.test-lab.xenproject.org/osstest/logs/63030/test-amd64-i386-xl-qemut-stubdom-debianhvm-amd64-xsm/merlot0---var-log-xen-qemu-dm-debianhvm.guest.osstest.log.3.gz
> 
> et al, mixed in with the minios and stub qemu output.
> 

Oops. I didn't have the latest OSSTest tree.

> I don't immediately see an explanation for the problem but (as we
> discovered with BSD) the success of this test depends on the
> gratuitous arp.
> 

>From mere code inspection and document of lwip 1.3.0 I think mini-os
does send gratuitous ARP.

The call graph is like

call_main
  start_netwokring
    init_netfront   <- netfront changes to connected state
    netif_set_up    <- sends gratuitous ARP [0]
  app_main...

And according to FreeBSD changeset, the bug about gratuitous ARP was
that the packet was sent before netfront was changed to connected state,
so it doesn't look like mini-os has the same problem as FreeBSD did.

But this is only code inspection,  so I'm not very confident whether
everything does what it says it does.

Wei.

[0] http://lwip.wikia.com/wiki/Writing_a_device_driver
Gratuitous ARP
A "gratuitous ARP" can be generated by a call etharp_query(our_netif,
its_ip_addr, NULL) (see RFC 3220, Section 4.6). Starting in version
1.3.0, the gratuitous ARP is generated by netif_set_up() and should not
be done in the driver or application code.



> Ian.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [linux-4.1 test] 63030: regressions - FAIL
  2015-10-20 15:24     ` Wei Liu
@ 2015-10-20 15:34       ` Ian Jackson
  2015-10-21 16:47         ` Ian Campbell
  2015-10-21  9:04       ` Ian Campbell
  1 sibling, 1 reply; 22+ messages in thread
From: Ian Jackson @ 2015-10-20 15:34 UTC (permalink / raw)
  To: Wei Liu; +Cc: xen-devel, osstest service owner

Wei Liu writes ("Re: [Xen-devel] [linux-4.1 test] 63030: regressions - FAIL"):
> From mere code inspection and document of lwip 1.3.0 I think mini-os
> does send gratuitous ARP.

The guest is using the PVHVM drivers at this point, with the backend
directly in dom0, so it is the guest's gratuitous arp which is needed,
I think.

Ian.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [linux-4.1 test] 63030: regressions - FAIL
  2015-10-20 15:24     ` Wei Liu
  2015-10-20 15:34       ` Ian Jackson
@ 2015-10-21  9:04       ` Ian Campbell
  2015-10-21  9:24         ` Wei Liu
  1 sibling, 1 reply; 22+ messages in thread
From: Ian Campbell @ 2015-10-21  9:04 UTC (permalink / raw)
  To: Wei Liu, Ian Jackson; +Cc: xen-devel, osstest service owner

On Tue, 2015-10-20 at 16:24 +0100, Wei Liu wrote:
> But this is only code inspection,  so I'm not very confident whether
> everything does what it says it does.

Right,. I think this one probably needs someone to setup a system in a
similar configuration and play with it.

xen-netfront.c calls netdev_notify_peers (née netif_notify_peers). I seem
to vaguely recall that setting some sysctl (arp_notify?) can be required to
allow that to actually do anything but I think that has been fixed i.e.
NETDEV_NOTIFY_PEERS is unconditional in inetdev_event() and only
NETDEV_CHANGEADDR is gated.

Ian.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [linux-4.1 test] 63030: regressions - FAIL
  2015-10-21  9:04       ` Ian Campbell
@ 2015-10-21  9:24         ` Wei Liu
  2015-10-21  9:44           ` Ian Campbell
  0 siblings, 1 reply; 22+ messages in thread
From: Wei Liu @ 2015-10-21  9:24 UTC (permalink / raw)
  To: Ian Campbell; +Cc: Ian Jackson, xen-devel, Wei Liu, osstest service owner

On Wed, Oct 21, 2015 at 10:04:14AM +0100, Ian Campbell wrote:
> On Tue, 2015-10-20 at 16:24 +0100, Wei Liu wrote:
> > But this is only code inspection,  so I'm not very confident whether
> > everything does what it says it does.
> 
> Right,. I think this one probably needs someone to setup a system in a
> similar configuration and play with it.
> 

Is there an easy way to do that? Say, give me some runes so that I can
lock a machine in Cambridge instance, run the failing test case.

> xen-netfront.c calls netdev_notify_peers (née netif_notify_peers). I seem
> to vaguely recall that setting some sysctl (arp_notify?) can be required to
> allow that to actually do anything but I think that has been fixed i.e.
> NETDEV_NOTIFY_PEERS is unconditional in inetdev_event() and only
> NETDEV_CHANGEADDR is gated.
> 

Found your patch posted in 2011.

https://patchwork.ozlabs.org/patch/82813/

I think you're right and the said behaviour exists in Wheezy's 3.2
kernel.

Wei.

> Ian.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [linux-4.1 test] 63030: regressions - FAIL
  2015-10-21  9:24         ` Wei Liu
@ 2015-10-21  9:44           ` Ian Campbell
  2015-10-21 10:04             ` Ian Campbell
  2015-10-21 10:35             ` Wei Liu
  0 siblings, 2 replies; 22+ messages in thread
From: Ian Campbell @ 2015-10-21  9:44 UTC (permalink / raw)
  To: Wei Liu; +Cc: xen-devel, Ian Jackson, osstest service owner

On Wed, 2015-10-21 at 10:24 +0100, Wei Liu wrote:
> On Wed, Oct 21, 2015 at 10:04:14AM +0100, Ian Campbell wrote:
> > On Tue, 2015-10-20 at 16:24 +0100, Wei Liu wrote:
> > > But this is only code inspection,  so I'm not very confident whether
> > > everything does what it says it does.
> > 
> > Right,. I think this one probably needs someone to setup a system in a
> > similar configuration and play with it.
> > 
> 
> Is there an easy way to do that? Say, give me some runes so that I can
> lock a machine in Cambridge instance, run the failing test case.

I could[0] but, why can't you just set things up on your existing test
hosts, either using standalone mode or by just installing the guest by
hand?

That's what I would do (probably the latter) in the first instance. It's
very likely IME that you are going to need to poke at this interactively
while debugging and to run repeated migrations etc to trigger the issue.
IMHO trying to use osstest for such manual debugging is just going to get
in the way.

> > xen-netfront.c calls netdev_notify_peers (née netif_notify_peers). I
> > seem
> > to vaguely recall that setting some sysctl (arp_notify?) can be
> > required to
> > allow that to actually do anything but I think that has been fixed i.e.
> > NETDEV_NOTIFY_PEERS is unconditional in inetdev_event() and only
> > NETDEV_CHANGEADDR is gated.
> > 
> 
> Found your patch posted in 2011.
> 
> https://patchwork.ozlabs.org/patch/82813/
> 
> I think you're right and the said behaviour exists in Wheezy's 3.2
> kernel.

This ended up as d11327ad6695db8117c78d70611e71102ceec2ac and:

$ git describe --contains d11327ad6695db8117c78d70611e71102ceec2ac
v2.6.38-rc6~20^2~10

...suggests this was in mainline long before 3.2.

Ian.

[0] ./mg-allocate -U <timespan> <machine-name>
Where timespan is digits followed by d (for days), h (for hours) etc.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [linux-4.1 test] 63030: regressions - FAIL
  2015-10-21  9:44           ` Ian Campbell
@ 2015-10-21 10:04             ` Ian Campbell
  2015-10-21 10:35             ` Wei Liu
  1 sibling, 0 replies; 22+ messages in thread
From: Ian Campbell @ 2015-10-21 10:04 UTC (permalink / raw)
  To: Wei Liu; +Cc: xen-devel, Ian Jackson, osstest service owner

On Wed, 2015-10-21 at 10:44 +0100, Ian Campbell wrote:

> > Found your patch posted in 2011.
> > 
> > https://patchwork.ozlabs.org/patch/82813/
> > 
> > I think you're right and the said behaviour exists in Wheezy's 3.2
> > kernel.
> 
> This ended up as d11327ad6695db8117c78d70611e71102ceec2ac and:
> 
> $ git describe --contains d11327ad6695db8117c78d70611e71102ceec2ac
> v2.6.38-rc6~20^2~10
> 
> ...suggests this was in mainline long before 3.2.

And I have now confirmed that the kernel in the debian-7.2.0 iso we are
using in testing today has this change in it.

Ian.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [linux-4.1 test] 63030: regressions - FAIL
  2015-10-21  9:44           ` Ian Campbell
  2015-10-21 10:04             ` Ian Campbell
@ 2015-10-21 10:35             ` Wei Liu
  2015-10-21 10:48               ` Ian Campbell
  1 sibling, 1 reply; 22+ messages in thread
From: Wei Liu @ 2015-10-21 10:35 UTC (permalink / raw)
  To: Ian Campbell; +Cc: Ian Jackson, xen-devel, Wei Liu, osstest service owner

On Wed, Oct 21, 2015 at 10:44:48AM +0100, Ian Campbell wrote:
> On Wed, 2015-10-21 at 10:24 +0100, Wei Liu wrote:
> > On Wed, Oct 21, 2015 at 10:04:14AM +0100, Ian Campbell wrote:
> > > On Tue, 2015-10-20 at 16:24 +0100, Wei Liu wrote:
> > > > But this is only code inspection,  so I'm not very confident whether
> > > > everything does what it says it does.
> > > 
> > > Right,. I think this one probably needs someone to setup a system in a
> > > similar configuration and play with it.
> > > 
> > 
> > Is there an easy way to do that? Say, give me some runes so that I can
> > lock a machine in Cambridge instance, run the failing test case.
> 
> I could[0] but, why can't you just set things up on your existing test
> hosts, either using standalone mode or by just installing the guest by
> hand?
> 
> That's what I would do (probably the latter) in the first instance. It's
> very likely IME that you are going to need to poke at this interactively
> while debugging and to run repeated migrations etc to trigger the issue.
> IMHO trying to use osstest for such manual debugging is just going to get
> in the way.
> 

I could do all these manually, but not without paying much attention:
allocating a new test box (all my test boxes are in use at the moment),
run standalone mode, use standalone mode to install the test box, grab
various tarballs from osstest website if I don't want to build them
again, put them in suitable location and use standalone script to fiddle
with standalone mode database, manually install a guest etc etc,  let
alone the bug we're hunting might not be reproducible on the new test
box due to different hardware and external environment (as we've already
witnessed in production osstest system), then I'm left in dilemma
wondering whether I should repeat all these things (well, part of) again
or just give up.

This looks like a list of endless tedious tasks and it could go wrong
many places in between. If I can get OSSTest to lock a box and run up to
the point that it reproduces the issue that would be of great help.

Furthermore, I can write down all the runes I use so that other people
can do the same to reproduce bugs discovered in osstest. That would
certainly help lower the barrier for people who want to help triaging
bugs.

Wei.

> > > xen-netfront.c calls netdev_notify_peers (née netif_notify_peers). I
> > > seem
> > > to vaguely recall that setting some sysctl (arp_notify?) can be
> > > required to
> > > allow that to actually do anything but I think that has been fixed i.e.
> > > NETDEV_NOTIFY_PEERS is unconditional in inetdev_event() and only
> > > NETDEV_CHANGEADDR is gated.
> > > 
> > 
> > Found your patch posted in 2011.
> > 
> > https://patchwork.ozlabs.org/patch/82813/
> > 
> > I think you're right and the said behaviour exists in Wheezy's 3.2
> > kernel.
> 
> This ended up as d11327ad6695db8117c78d70611e71102ceec2ac and:
> 
> $ git describe --contains d11327ad6695db8117c78d70611e71102ceec2ac
> v2.6.38-rc6~20^2~10
> 
> ...suggests this was in mainline long before 3.2.
> 
> Ian.
> 
> [0] ./mg-allocate -U <timespan> <machine-name>
> Where timespan is digits followed by d (for days), h (for hours) etc.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [linux-4.1 test] 63030: regressions - FAIL
  2015-10-21 10:35             ` Wei Liu
@ 2015-10-21 10:48               ` Ian Campbell
  2015-10-21 11:07                 ` Wei Liu
  0 siblings, 1 reply; 22+ messages in thread
From: Ian Campbell @ 2015-10-21 10:48 UTC (permalink / raw)
  To: Wei Liu; +Cc: xen-devel, Ian Jackson, osstest service owner

On Wed, 2015-10-21 at 11:35 +0100, Wei Liu wrote:
> On Wed, Oct 21, 2015 at 10:44:48AM +0100, Ian Campbell wrote:
> > On Wed, 2015-10-21 at 10:24 +0100, Wei Liu wrote:
> > > On Wed, Oct 21, 2015 at 10:04:14AM +0100, Ian Campbell wrote:
> > > > On Tue, 2015-10-20 at 16:24 +0100, Wei Liu wrote:
> > > > > But this is only code inspection,  so I'm not very confident whether
> > > > > everything does what it says it does.
> > > > 
> > > > Right,. I think this one probably needs someone to setup a system in a
> > > > similar configuration and play with it.
> > > > 
> > > 
> > > Is there an easy way to do that? Say, give me some runes so that I can
> > > lock a machine in Cambridge instance, run the failing test case.
> > 
> > I could[0] but, why can't you just set things up on your existing test
> > hosts, either using standalone mode or by just installing the guest by
> > hand?
> > 
> > That's what I would do (probably the latter) in the first instance. It's
> > very likely IME that you are going to need to poke at this interactively
> > while debugging and to run repeated migrations etc to trigger the issue.
> > IMHO trying to use osstest for such manual debugging is just going to get
> > in the way.
> > 
> 
> I could do all these manually, but not without paying much attention:
> allocating a new test box (all my test boxes are in use at the moment),
> run standalone mode, use standalone mode to install the test box, grab
> various tarballs from osstest website if I don't want to build them
> again, put them in suitable location and use standalone script to fiddle
> with standalone mode database, manually install a guest etc etc,  let
> alone the bug we're hunting might not be reproducible on the new test
> box due to different hardware and external environment (as we've already
> witnessed in production osstest system), then I'm left in dilemma
> wondering whether I should repeat all these things (well, part of) again
> or just give up.
> 
> This looks like a list of endless tedious tasks and it could go wrong
> many places in between. If I can get OSSTest to lock a box and run up to
> the point that it reproduces the issue that would be of great help.

This seems to me to be making a mountain out of a mole hill, installing a
Xen host should be bread and butter for most of us.

However, since you insist, I recently added some explanation in README of
how to make an adhoc job including cloning a previous flight and forcing it
to run on a given machine (useful if you think it might be machine
specific).

There is no mechanical way to then lock a host on failure. What I usually
do is run the mg-allocate run I mentioned in my previous mail after the
test case has already started. Since mg-allocate has a higher priority than
regular jobs, but with -U waits for the current job to finish, you are
basically guaranteed that your mg-allocate will get the host next.

> Furthermore, I can write down all the runes I use so that other people
> can do the same to reproduce bugs discovered in osstest. That would
> certainly help lower the barrier for people who want to help triaging
> bugs.

This sort of thing is of no help with triage. It might be useful for
debugging and reproducing an issue, but triage does not involve doing such
things, it is the step before.

I'm being pedantic here because I don't think it is helpful to overstate
what triage involves, since that will put people off doing useful triage
activities.

Ian.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [linux-4.1 test] 63030: regressions - FAIL
  2015-10-21 10:48               ` Ian Campbell
@ 2015-10-21 11:07                 ` Wei Liu
  0 siblings, 0 replies; 22+ messages in thread
From: Wei Liu @ 2015-10-21 11:07 UTC (permalink / raw)
  To: Ian Campbell; +Cc: Ian Jackson, xen-devel, Wei Liu, osstest service owner

On Wed, Oct 21, 2015 at 11:48:24AM +0100, Ian Campbell wrote:
> On Wed, 2015-10-21 at 11:35 +0100, Wei Liu wrote:
> > On Wed, Oct 21, 2015 at 10:44:48AM +0100, Ian Campbell wrote:
> > > On Wed, 2015-10-21 at 10:24 +0100, Wei Liu wrote:
> > > > On Wed, Oct 21, 2015 at 10:04:14AM +0100, Ian Campbell wrote:
> > > > > On Tue, 2015-10-20 at 16:24 +0100, Wei Liu wrote:
> > > > > > But this is only code inspection,  so I'm not very confident whether
> > > > > > everything does what it says it does.
> > > > > 
> > > > > Right,. I think this one probably needs someone to setup a system in a
> > > > > similar configuration and play with it.
> > > > > 
> > > > 
> > > > Is there an easy way to do that? Say, give me some runes so that I can
> > > > lock a machine in Cambridge instance, run the failing test case.
> > > 
> > > I could[0] but, why can't you just set things up on your existing test
> > > hosts, either using standalone mode or by just installing the guest by
> > > hand?
> > > 
> > > That's what I would do (probably the latter) in the first instance. It's
> > > very likely IME that you are going to need to poke at this interactively
> > > while debugging and to run repeated migrations etc to trigger the issue.
> > > IMHO trying to use osstest for such manual debugging is just going to get
> > > in the way.
> > > 
> > 
> > I could do all these manually, but not without paying much attention:
> > allocating a new test box (all my test boxes are in use at the moment),
> > run standalone mode, use standalone mode to install the test box, grab
> > various tarballs from osstest website if I don't want to build them
> > again, put them in suitable location and use standalone script to fiddle
> > with standalone mode database, manually install a guest etc etc,  let
> > alone the bug we're hunting might not be reproducible on the new test
> > box due to different hardware and external environment (as we've already
> > witnessed in production osstest system), then I'm left in dilemma
> > wondering whether I should repeat all these things (well, part of) again
> > or just give up.
> > 
> > This looks like a list of endless tedious tasks and it could go wrong
> > many places in between. If I can get OSSTest to lock a box and run up to
> > the point that it reproduces the issue that would be of great help.
> 
> This seems to me to be making a mountain out of a mole hill, installing a
> Xen host should be bread and butter for most of us.
> 
> However, since you insist, I recently added some explanation in README of
> how to make an adhoc job including cloning a previous flight and forcing it
> to run on a given machine (useful if you think it might be machine
> specific).
> 
> There is no mechanical way to then lock a host on failure. What I usually
> do is run the mg-allocate run I mentioned in my previous mail after the
> test case has already started. Since mg-allocate has a higher priority than
> regular jobs, but with -U waits for the current job to finish, you are
> basically guaranteed that your mg-allocate will get the host next.
> 

Thanks. I will have a look at Osstest README to determine which way to
proceed is better.

> > Furthermore, I can write down all the runes I use so that other people
> > can do the same to reproduce bugs discovered in osstest. That would
> > certainly help lower the barrier for people who want to help triaging
> > bugs.
> 
> This sort of thing is of no help with triage. It might be useful for
> debugging and reproducing an issue, but triage does not involve doing such
> things, it is the step before.
> 
> I'm being pedantic here because I don't think it is helpful to overstate
> what triage involves, since that will put people off doing useful triage
> activities.
> 

Right, I actually meant bug fixing.

Wei.

> Ian.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [linux-4.1 test] 63030: regressions - FAIL
  2015-10-20 15:34       ` Ian Jackson
@ 2015-10-21 16:47         ` Ian Campbell
  2015-10-21 17:34           ` Wei Liu
  0 siblings, 1 reply; 22+ messages in thread
From: Ian Campbell @ 2015-10-21 16:47 UTC (permalink / raw)
  To: Ian Jackson, Wei Liu; +Cc: xen-devel, osstest service owner

On Tue, 2015-10-20 at 16:34 +0100, Ian Jackson wrote:
> Wei Liu writes ("Re: [Xen-devel] [linux-4.1 test] 63030: regressions 
> - FAIL"):
> > From mere code inspection and document of lwip 1.3.0 I think mini
> -os
> > does send gratuitous ARP.
> 
> The guest is using the PVHVM drivers at this point, with the backend
> directly in dom0, so it is the guest's gratuitous arp which is needed,
> I think.

It would be worth investigating whether mini-os's gratuitous ARP might
also be occurring and confusing things, e.g. by coming after and
therefore taking precedence over the one coming from the guest.

Ian.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [linux-4.1 test] 63030: regressions - FAIL
  2015-10-21 16:47         ` Ian Campbell
@ 2015-10-21 17:34           ` Wei Liu
  2015-10-22  9:50             ` Ian Campbell
  0 siblings, 1 reply; 22+ messages in thread
From: Wei Liu @ 2015-10-21 17:34 UTC (permalink / raw)
  To: Ian Campbell; +Cc: Wei Liu, xen-devel, Ian Jackson, osstest service owner

On Wed, Oct 21, 2015 at 05:47:06PM +0100, Ian Campbell wrote:
> On Tue, 2015-10-20 at 16:34 +0100, Ian Jackson wrote:
> > Wei Liu writes ("Re: [Xen-devel] [linux-4.1 test] 63030: regressions 
> > - FAIL"):
> > > From mere code inspection and document of lwip 1.3.0 I think mini
> > -os
> > > does send gratuitous ARP.
> > 
> > The guest is using the PVHVM drivers at this point, with the backend
> > directly in dom0, so it is the guest's gratuitous arp which is needed,
> > I think.
> 
> It would be worth investigating whether mini-os's gratuitous ARP might
> also be occurring and confusing things, e.g. by coming after and
> therefore taking precedence over the one coming from the guest.
> 

Several observations:

1. The guest doesn't always send gratuitous arp -- but this might not be
   the cause of this failure. Guest works fine when using qemu-trad
   only.
2. Guest only sends one gratuitous arp at most.
3. When using stubdom, guest is a lot less responsive. See two
   experiments and analysis below.

I statically add arp entry for guest interface because arp entry some
times gets deleted. Note that this is not covering up the root cause of
failure because  the arp entry is normally deleted after a few migration
iterations. The failure on merlot* mostly fail on first iteration. And
when arp entry is not available, the error for ssh should be "No route
to host", not "timed out".

Furthermore when the arp entry is not available, dom0 naturally sends an
arp request to guest. When stubdom is not in use, guest responded
instantly, when stubdom is in use, guest was a lot less responsive.

I use a script to repeat migration and ssh.

  i=1
  while true; do
      echo "#### iteration $i"
      ssh localhost xl migrate wheezy-hvm localhost
      if [ $? != 0 ]; then
          echo "migration failed $?";
          exit 1;
      fi 
      timeout 40 ssh -o BatchMode=yes -o ConnectTimeout=100 -o ServerAliveInterval=100 root@10.80.239.39 date
      st=$?
      if [ $st != 0 ]; then
          echo "failed $st";
          exit 1;
      fi 
      i=$((i+1))
  done

At the same time
  tcpdump -i xenbr0 arp and host $GUEST_IP

When stubdom is present.

Scenario 1:
  xl shows "Migration successful."
  ...30s...
  xenbr0 receives gratuitous arp
  ...1s...
  ssh date command comes back

Scenario 2:
  xenbr0 receives gratuitous arp
  ...1s...
  xl shows "Migration successful."
  ssh date command comes back

When stubdom was not present I never saw scenario 1.

Note that my machine is relative old (>6 years). It would never pass
the test in osstest because in osstest the timeout is 10s.

The slowness in osstest seems to be host specific because all failures
in guest migrate test failed on merlot*. It's not only linux-4.1 is
failing, other branches fail the same test step on merlot*, too.

Wei.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [linux-4.1 test] 63030: regressions - FAIL
  2015-10-21 17:34           ` Wei Liu
@ 2015-10-22  9:50             ` Ian Campbell
  2015-10-22 10:28               ` Wei Liu
  0 siblings, 1 reply; 22+ messages in thread
From: Ian Campbell @ 2015-10-22  9:50 UTC (permalink / raw)
  To: Wei Liu; +Cc: xen-devel, Ian Jackson, osstest service owner

On Wed, 2015-10-21 at 18:34 +0100, Wei Liu wrote:
> On Wed, Oct 21, 2015 at 05:47:06PM +0100, Ian Campbell wrote:
> > On Tue, 2015-10-20 at 16:34 +0100, Ian Jackson wrote:
> > > Wei Liu writes ("Re: [Xen-devel] [linux-4.1 test] 63030: regressions 
> > > - FAIL"):
> > > > From mere code inspection and document of lwip 1.3.0 I think mini
> > > -os
> > > > does send gratuitous ARP.
> > > 
> > > The guest is using the PVHVM drivers at this point, with the backend
> > > directly in dom0, so it is the guest's gratuitous arp which is
> > > needed,
> > > I think.
> > 
> > It would be worth investigating whether mini-os's gratuitous ARP might
> > also be occurring and confusing things, e.g. by coming after and
> > therefore taking precedence over the one coming from the guest.
> > 
> 
> Several observations:
> 
> 1. The guest doesn't always send gratuitous arp -- but this might not be
>    the cause of this failure. Guest works fine when using qemu-trad
>    only.

As in it always sends the arp when using qemu-trad, or that it is fine
irrespective of not always sending it?

> 2. Guest only sends one gratuitous arp at most.

This is as expected, but does the stubdom also send one?

> 3. When using stubdom, guest is a lot less responsive. See two
>    experiments and analysis below.

Less responsive in use or only while migrating, or to ssh after migration,
or to something else?

> Scenario 1:
>   xl shows "Migration successful."
>   ...30s...
>   xenbr0 receives gratuitous arp
>   ...1s...
>   ssh date command comes back
> 
> Scenario 2:
>   xenbr0 receives gratuitous arp
>   ...1s...
>   xl shows "Migration successful."
>   ssh date command comes back
> 
> When stubdom was not present I never saw scenario 1.

It would be worth looking at the possibility of a delay between "Migration
successful" and the target domain actually running. A 30s delay between the
guest restarting and it sending the ARP would be pretty strange IMHO

> Note that my machine is relative old (>6 years). It would never pass
> the test in osstest because in osstest the timeout is 10s.
> 
> The slowness in osstest seems to be host specific because all failures
> in guest migrate test failed on merlot*. It's not only linux-4.1 is
> failing, other branches fail the same test step on merlot*, too.

This could be a factor in common with the other qmu timeout on merlot which
led to 9acfbe14d726.

It might be worth prodding AMD over that issue again.

Ian.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [linux-4.1 test] 63030: regressions - FAIL
  2015-10-22  9:50             ` Ian Campbell
@ 2015-10-22 10:28               ` Wei Liu
  2015-10-22 10:39                 ` Ian Campbell
  0 siblings, 1 reply; 22+ messages in thread
From: Wei Liu @ 2015-10-22 10:28 UTC (permalink / raw)
  To: Ian Campbell; +Cc: Ian Jackson, xen-devel, Wei Liu, osstest service owner

On Thu, Oct 22, 2015 at 10:50:54AM +0100, Ian Campbell wrote:
> On Wed, 2015-10-21 at 18:34 +0100, Wei Liu wrote:
> > On Wed, Oct 21, 2015 at 05:47:06PM +0100, Ian Campbell wrote:
> > > On Tue, 2015-10-20 at 16:34 +0100, Ian Jackson wrote:
> > > > Wei Liu writes ("Re: [Xen-devel] [linux-4.1 test] 63030: regressions 
> > > > - FAIL"):
> > > > > From mere code inspection and document of lwip 1.3.0 I think mini
> > > > -os
> > > > > does send gratuitous ARP.
> > > > 
> > > > The guest is using the PVHVM drivers at this point, with the backend
> > > > directly in dom0, so it is the guest's gratuitous arp which is
> > > > needed,
> > > > I think.
> > > 
> > > It would be worth investigating whether mini-os's gratuitous ARP might
> > > also be occurring and confusing things, e.g. by coming after and
> > > therefore taking precedence over the one coming from the guest.
> > > 
> > 
> > Several observations:
> > 
> > 1. The guest doesn't always send gratuitous arp -- but this might not be
> >    the cause of this failure. Guest works fine when using qemu-trad
> >    only.
> 
> As in it always sends the arp when using qemu-trad, or that it is fine
> irrespective of not always sending it?
> 

Whether or not stubdom is in use, the guest behaves the same -- it
doesn't always send gratuitous arp.

When using qemu-trad alone, it's always fine when it doesn't send
gratuitous arp because either there is cache in dom0 that already has
guest mac address or the guest responses instantly to dom0 arp request.

So it comes down to the responsiveness of guest is the key.

> > 2. Guest only sends one gratuitous arp at most.
> 
> This is as expected, but does the stubdom also send one?
> 

There is at most one gratuitous arp request per migration, I think it's
from guest, not stubdom. To identify the exact interface the arp packet
comes from requires a bit of gymnastics with tcpdump that I haven't
managed to do yesterday.

> > 3. When using stubdom, guest is a lot less responsive. See two
> >    experiments and analysis below.
> 
> Less responsive in use or only while migrating, or to ssh after migration,
> or to something else?
> 

For every activity after migration for a period of time, including both
arp request / reply and ssh connection.

> > Scenario 1:
> >   xl shows "Migration successful."
> >   ...30s...
> >   xenbr0 receives gratuitous arp
> >   ...1s...
> >   ssh date command comes back
> > 
> > Scenario 2:
> >   xenbr0 receives gratuitous arp
> >   ...1s...
> >   xl shows "Migration successful."
> >   ssh date command comes back
> > 
> > When stubdom was not present I never saw scenario 1.
> 
> It would be worth looking at the possibility of a delay between "Migration
> successful" and the target domain actually running. A 30s delay between the
> guest restarting and it sending the ARP would be pretty strange IMHO
> 

The guest is in a weird state.

xl list shows the stubdom is in "b" state while guest has no state at
all, heh.

Wei.

> > Note that my machine is relative old (>6 years). It would never pass
> > the test in osstest because in osstest the timeout is 10s.
> > 
> > The slowness in osstest seems to be host specific because all failures
> > in guest migrate test failed on merlot*. It's not only linux-4.1 is
> > failing, other branches fail the same test step on merlot*, too.
> 
> This could be a factor in common with the other qmu timeout on merlot which
> led to 9acfbe14d726.
> 
> It might be worth prodding AMD over that issue again.
> 
> Ian.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [linux-4.1 test] 63030: regressions - FAIL
  2015-10-22 10:28               ` Wei Liu
@ 2015-10-22 10:39                 ` Ian Campbell
  2015-10-22 11:03                   ` Wei Liu
  0 siblings, 1 reply; 22+ messages in thread
From: Ian Campbell @ 2015-10-22 10:39 UTC (permalink / raw)
  To: Wei Liu; +Cc: xen-devel, Ian Jackson, osstest service owner

On Thu, 2015-10-22 at 11:28 +0100, Wei Liu wrote:
> On Thu, Oct 22, 2015 at 10:50:54AM +0100, Ian Campbell wrote:
> > On Wed, 2015-10-21 at 18:34 +0100, Wei Liu wrote:
> > > On Wed, Oct 21, 2015 at 05:47:06PM +0100, Ian Campbell wrote:
> > > > On Tue, 2015-10-20 at 16:34 +0100, Ian Jackson wrote:
> > > > > Wei Liu writes ("Re: [Xen-devel] [linux-4.1 test] 63030:
> > > > > regressions 
> > > > > - FAIL"):
> > > > > > From mere code inspection and document of lwip 1.3.0 I think
> > > > > > mini
> > > > > -os
> > > > > > does send gratuitous ARP.
> > > > > 
> > > > > The guest is using the PVHVM drivers at this point, with the
> > > > > backend
> > > > > directly in dom0, so it is the guest's gratuitous arp which is
> > > > > needed,
> > > > > I think.
> > > > 
> > > > It would be worth investigating whether mini-os's gratuitous ARP
> > > > might
> > > > also be occurring and confusing things, e.g. by coming after and
> > > > therefore taking precedence over the one coming from the guest.
> > > > 
> > > 
> > > Several observations:
> > > 
> > > 1. The guest doesn't always send gratuitous arp -- but this might not
> > > be
> > >    the cause of this failure. Guest works fine when using qemu-trad
> > >    only.
> > 
> > As in it always sends the arp when using qemu-trad, or that it is fine
> > irrespective of not always sending it?
> > 
> 
> Whether or not stubdom is in use, the guest behaves the same -- it
> doesn't always send gratuitous arp.
> 
> When using qemu-trad alone, it's always fine when it doesn't send
> gratuitous arp because either there is cache in dom0 that already has
> guest mac address or the guest responses instantly to dom0 arp request.

Where has this cache entry come from? Any preexisting ARP cache would be
associated with vifX.0 and would go away when that device was destroyed and
replace with vif(X+1).0.

Also this only work for localhost migration. If the domain actually moved
to another host then the ARP is required in order for the physical switch
to learn the new location.

Thus it seems to me that not always sending the gratuitous ARP is the most
important thing to get to the bottom of here.

> So it comes down to the responsiveness of guest is the key.
> 
[...]
> > > 3. When using stubdom, guest is a lot less responsive. See two
> > >    experiments and analysis below.
> > 
> > Less responsive in use or only while migrating, or to ssh after
> > migration,
> > or to something else?
> > 
> 
> For every activity after migration for a period of time, including both
> arp request / reply and ssh connection.
> 
> > > Scenario 1:
> > >   xl shows "Migration successful."
> > >   ...30s...
> > >   xenbr0 receives gratuitous arp
> > >   ...1s...
> > >   ssh date command comes back
> > > 
> > > Scenario 2:
> > >   xenbr0 receives gratuitous arp
> > >   ...1s...
> > >   xl shows "Migration successful."
> > >   ssh date command comes back
> > > 
> > > When stubdom was not present I never saw scenario 1.

So in that case you only saw Scenario 2 which includes a "receives
gratuitous ARP". But above you state that even with non-stub case sometimes
the grauitous ARP is not sent. Is this a 3rd case which isn't mentioned
here?

> > It would be worth looking at the possibility of a delay between
> > "Migration
> > successful" and the target domain actually running. A 30s delay between
> > the
> > guest restarting and it sending the ARP would be pretty strange IMHO
> > 
> 
> The guest is in a weird state.
> 
> xl list shows the stubdom is in "b" state while guest has no state at
> all, heh.

Has it actually been started/unpaused then?

Ian.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [linux-4.1 test] 63030: regressions - FAIL
  2015-10-22 10:39                 ` Ian Campbell
@ 2015-10-22 11:03                   ` Wei Liu
  2015-10-22 11:12                     ` Ian Campbell
  0 siblings, 1 reply; 22+ messages in thread
From: Wei Liu @ 2015-10-22 11:03 UTC (permalink / raw)
  To: Ian Campbell; +Cc: Ian Jackson, xen-devel, Wei Liu, osstest service owner

On Thu, Oct 22, 2015 at 11:39:39AM +0100, Ian Campbell wrote:
> On Thu, 2015-10-22 at 11:28 +0100, Wei Liu wrote:
> > On Thu, Oct 22, 2015 at 10:50:54AM +0100, Ian Campbell wrote:
> > > On Wed, 2015-10-21 at 18:34 +0100, Wei Liu wrote:
> > > > On Wed, Oct 21, 2015 at 05:47:06PM +0100, Ian Campbell wrote:
> > > > > On Tue, 2015-10-20 at 16:34 +0100, Ian Jackson wrote:
> > > > > > Wei Liu writes ("Re: [Xen-devel] [linux-4.1 test] 63030:
> > > > > > regressions 
> > > > > > - FAIL"):
> > > > > > > From mere code inspection and document of lwip 1.3.0 I think
> > > > > > > mini
> > > > > > -os
> > > > > > > does send gratuitous ARP.
> > > > > > 
> > > > > > The guest is using the PVHVM drivers at this point, with the
> > > > > > backend
> > > > > > directly in dom0, so it is the guest's gratuitous arp which is
> > > > > > needed,
> > > > > > I think.
> > > > > 
> > > > > It would be worth investigating whether mini-os's gratuitous ARP
> > > > > might
> > > > > also be occurring and confusing things, e.g. by coming after and
> > > > > therefore taking precedence over the one coming from the guest.
> > > > > 
> > > > 
> > > > Several observations:
> > > > 
> > > > 1. The guest doesn't always send gratuitous arp -- but this might not
> > > > be
> > > >    the cause of this failure. Guest works fine when using qemu-trad
> > > >    only.
> > > 
> > > As in it always sends the arp when using qemu-trad, or that it is fine
> > > irrespective of not always sending it?
> > > 
> > 
> > Whether or not stubdom is in use, the guest behaves the same -- it
> > doesn't always send gratuitous arp.
> > 
> > When using qemu-trad alone, it's always fine when it doesn't send
> > gratuitous arp because either there is cache in dom0 that already has
> > guest mac address or the guest responses instantly to dom0 arp request.
> 
> Where has this cache entry come from? Any preexisting ARP cache would be
> associated with vifX.0 and would go away when that device was destroyed and
> replace with vif(X+1).0.
> 

No, vif-bridge script has two runes for off-lining a vif
  brctl delif $bridge $vif
  ifconfig $vif down

Neither of these causes cache entry to be flushed.

> Also this only work for localhost migration. If the domain actually moved
> to another host then the ARP is required in order for the physical switch
> to learn the new location.
> 
> Thus it seems to me that not always sending the gratuitous ARP is the most
> important thing to get to the bottom of here.
> 

That's another issue, but this would cause other error (no route to
host) instead of timeout. The failure exhibits timeout error -- let's do
one thing at a time.

> > So it comes down to the responsiveness of guest is the key.
> > 
> [...]
> > > > 3. When using stubdom, guest is a lot less responsive. See two
> > > >    experiments and analysis below.
> > > 
> > > Less responsive in use or only while migrating, or to ssh after
> > > migration,
> > > or to something else?
> > > 
> > 
> > For every activity after migration for a period of time, including both
> > arp request / reply and ssh connection.
> > 
> > > > Scenario 1:
> > > >   xl shows "Migration successful."
> > > >   ...30s...
> > > >   xenbr0 receives gratuitous arp
> > > >   ...1s...
> > > >   ssh date command comes back
> > > > 
> > > > Scenario 2:
> > > >   xenbr0 receives gratuitous arp
> > > >   ...1s...
> > > >   xl shows "Migration successful."
> > > >   ssh date command comes back
> > > > 
> > > > When stubdom was not present I never saw scenario 1.
> 
> So in that case you only saw Scenario 2 which includes a "receives
> gratuitous ARP". But above you state that even with non-stub case sometimes
> the grauitous ARP is not sent. Is this a 3rd case which isn't mentioned
> here?
> 

Scenario 3:
  xl shows "Migration successful."
  dom0 sends arp request because arp cache entry not available
  guest takes a long time to respond when using stubdom or responds
    instantly when not using stubdom

Scenario 4:
  xl shows "Migration successful."
  (arp cache entry still available)
  guest takes a long time to respond to ssh when using stubdom or
    responds instantly when not using stubdom

> > > It would be worth looking at the possibility of a delay between
> > > "Migration
> > > successful" and the target domain actually running. A 30s delay between
> > > the
> > > guest restarting and it sending the ARP would be pretty strange IMHO
> > > 
> > 
> > The guest is in a weird state.
> > 
> > xl list shows the stubdom is in "b" state while guest has no state at
> > all, heh.
> 
> Has it actually been started/unpaused then?
> 

Yes, of course -- otherwise the state would have been "p". And I
observed the transition from "p" to "weird state".

Wei.

> Ian.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [linux-4.1 test] 63030: regressions - FAIL
  2015-10-22 11:03                   ` Wei Liu
@ 2015-10-22 11:12                     ` Ian Campbell
  2015-10-22 14:41                       ` Ian Jackson
  0 siblings, 1 reply; 22+ messages in thread
From: Ian Campbell @ 2015-10-22 11:12 UTC (permalink / raw)
  To: Wei Liu; +Cc: xen-devel, Ian Jackson, osstest service owner

On Thu, 2015-10-22 at 12:03 +0100, Wei Liu wrote:
> On Thu, Oct 22, 2015 at 11:39:39AM +0100, Ian Campbell wrote:
> > On Thu, 2015-10-22 at 11:28 +0100, Wei Liu wrote:
> > > On Thu, Oct 22, 2015 at 10:50:54AM +0100, Ian Campbell wrote:
> > > > On Wed, 2015-10-21 at 18:34 +0100, Wei Liu wrote:
> > > > > On Wed, Oct 21, 2015 at 05:47:06PM +0100, Ian Campbell wrote:
> > > > > > On Tue, 2015-10-20 at 16:34 +0100, Ian Jackson wrote:
> > > > > > > Wei Liu writes ("Re: [Xen-devel] [linux-4.1 test] 63030:
> > > > > > > regressions 
> > > > > > > - FAIL"):
> > > > > > > > From mere code inspection and document of lwip 1.3.0 I
> > > > > > > > think
> > > > > > > > mini
> > > > > > > -os
> > > > > > > > does send gratuitous ARP.
> > > > > > > 
> > > > > > > The guest is using the PVHVM drivers at this point, with the
> > > > > > > backend
> > > > > > > directly in dom0, so it is the guest's gratuitous arp which
> > > > > > > is
> > > > > > > needed,
> > > > > > > I think.
> > > > > > 
> > > > > > It would be worth investigating whether mini-os's gratuitous
> > > > > > ARP
> > > > > > might
> > > > > > also be occurring and confusing things, e.g. by coming after
> > > > > > and
> > > > > > therefore taking precedence over the one coming from the guest.
> > > > > > 
> > > > > 
> > > > > Several observations:
> > > > > 
> > > > > 1. The guest doesn't always send gratuitous arp -- but this might
> > > > > not
> > > > > be
> > > > >    the cause of this failure. Guest works fine when using qemu
> > > > > -trad
> > > > >    only.
> > > > 
> > > > As in it always sends the arp when using qemu-trad, or that it is
> > > > fine
> > > > irrespective of not always sending it?
> > > > 
> > > 
> > > Whether or not stubdom is in use, the guest behaves the same -- it
> > > doesn't always send gratuitous arp.
> > > 
> > > When using qemu-trad alone, it's always fine when it doesn't send
> > > gratuitous arp because either there is cache in dom0 that already has
> > > guest mac address or the guest responses instantly to dom0 arp
> > > request.
> > 
> > Where has this cache entry come from? Any preexisting ARP cache would
> > be
> > associated with vifX.0 and would go away when that device was destroyed
> > and
> > replace with vif(X+1).0.
> > 
> 
> No, vif-bridge script has two runes for off-lining a vif
>   brctl delif $bridge $vif
>   ifconfig $vif down
> 
> Neither of these causes cache entry to be flushed.

$vif disappearing when netback finally deletes the device will though. Or
it should/used to.

Maybe this is happening after the new guest has started and confusing
things somewhere?

> > Also this only work for localhost migration. If the domain actually
> > moved
> > to another host then the ARP is required in order for the physical
> > switch
> > to learn the new location.
> > 
> > Thus it seems to me that not always sending the gratuitous ARP is the
> > most
> > important thing to get to the bottom of here.
> > 
> 
> That's another issue, but this would cause other error (no route to
> host) instead of timeout. The failure exhibits timeout error -- let's do
> one thing at a time.

The presence of an ARP cache entry in dom0 pointing to the old VIF would
also cause a timeout issue, I think, since the guest is no longer connected
to that vif.

This stale ARP cache entry should be the first thing to investigate, before
either the lack of a grat ARP or the slowness of the guest, since its
presence will confuse the results in both those other cases.

> > > So it comes down to the responsiveness of guest is the key.
> > > 
> > [...]
> > > > > 3. When using stubdom, guest is a lot less responsive. See two
> > > > >    experiments and analysis below.
> > > > 
> > > > Less responsive in use or only while migrating, or to ssh after
> > > > migration,
> > > > or to something else?
> > > > 
> > > 
> > > For every activity after migration for a period of time, including
> > > both
> > > arp request / reply and ssh connection.
> > > 
> > > > > Scenario 1:
> > > > >   xl shows "Migration successful."
> > > > >   ...30s...
> > > > >   xenbr0 receives gratuitous arp
> > > > >   ...1s...
> > > > >   ssh date command comes back
> > > > > 
> > > > > Scenario 2:
> > > > >   xenbr0 receives gratuitous arp
> > > > >   ...1s...
> > > > >   xl shows "Migration successful."
> > > > >   ssh date command comes back
> > > > > 
> > > > > When stubdom was not present I never saw scenario 1.
> > 
> > So in that case you only saw Scenario 2 which includes a "receives
> > gratuitous ARP". But above you state that even with non-stub case
> > sometimes
> > the grauitous ARP is not sent. Is this a 3rd case which isn't mentioned
> > here?
> > 
> 
> Scenario 3:
>   xl shows "Migration successful."
>   dom0 sends arp request because arp cache entry not available
>   guest takes a long time to respond when using stubdom or responds
>     instantly when not using stubdom
> 
> Scenario 4:
>   xl shows "Migration successful."
>   (arp cache entry still available)
>   guest takes a long time to respond to ssh when using stubdom or
>     responds instantly when not using stubdom
> 
> > > > It would be worth looking at the possibility of a delay between
> > > > "Migration
> > > > successful" and the target domain actually running. A 30s delay
> > > > between
> > > > the
> > > > guest restarting and it sending the ARP would be pretty strange
> > > > IMHO
> > > > 
> > > 
> > > The guest is in a weird state.
> > > 
> > > xl list shows the stubdom is in "b" state while guest has no state at
> > > all, heh.
> > 
> > Has it actually been started/unpaused then?
> > 
> 
> Yes, of course -- otherwise the state would have been "p". And I
> observed the transition from "p" to "weird state".

If weird state is "-----" then I think that is normal, it is "runnable but
not running" IIRC.

Ian.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [linux-4.1 test] 63030: regressions - FAIL
  2015-10-22 11:12                     ` Ian Campbell
@ 2015-10-22 14:41                       ` Ian Jackson
  2015-10-22 14:56                         ` Ian Campbell
  0 siblings, 1 reply; 22+ messages in thread
From: Ian Jackson @ 2015-10-22 14:41 UTC (permalink / raw)
  To: Ian Campbell; +Cc: xen-devel, Wei Liu, osstest service owner

Ian Campbell writes ("Re: [Xen-devel] [linux-4.1 test] 63030: regressions - FAIL"):
> On Thu, 2015-10-22 at 12:03 +0100, Wei Liu wrote:
> > No, vif-bridge script has two runes for off-lining a vif
> >   brctl delif $bridge $vif
> >   ifconfig $vif down
> > 
> > Neither of these causes cache entry to be flushed.
> 
> $vif disappearing when netback finally deletes the device will though. Or
> it should/used to.
> 
> Maybe this is happening after the new guest has started and confusing
> things somewhere?

There is confusion here.  Someone used the phrase `arp cache'.  But
there are actually two relevant runtime of MAC addresses:

 * Each host has a neighbour database mapping IPv4 addresses to MAC
   addresses.  This is used when trying to pass on an IPv4 datagram to
   a host on the same ethernet (same broadcast domain).  This database
   is normally referred to as an `ARP cache'.  Addresses are added to
   the table by both ARP requests and responses, and also in many
   implementations entries are refreshed by ordinary traffic.

   In the test colo, the osstest VM is on the same (bridged) ethernet
   as the test box so, the relevant arp cache is the one in the
   osstest controller's kernel: the osstest controller wants to send
   an ssh SYN to the guest, and needs to construct an ethernet frame
   with the guest's MAC address.  This is done using the osstest
   controller's ARP cache entry.

   The osstest controller's ARP cache is unaffected by the migration.
   ARP cache entries do time out but only after a number of minutes,
   and the guest will have been spoken to recently by the controller.
   I have no reason to think that lack of an entry for the guest's
   IPv4 address in the osstest controller's ARP cache is relevant.

 * Each bridge has a forwarding database mapping MAC addresses to its
   outbound links.  This is normally referred to as the bridge
   (switch) `learning', and the table as the `MAC address table'.  MAC
   addresses are learned when switch sees incoming frames.  When the
   bridge receives a frame for a destination MAC address for which it
   has no entry, it forwards the frame out of all its ports.  Special
   considerations apply to broadcast and multicast MAC addresses.
   None of this involves IPv4 or IPv6 addresses.

   In the test colo in the migration test case, there are up to four
   relevant bridges:

      * The source test box's dom0's software bridge.
        This has (logically speaking) three `ports':
          - the test box's physical network interface
          - the dom0 itself
          - the vif corresponding to the outbound guest
          - in a single-host test, the vif corresponding to
            the inbound guest

      * The physical switch connecting the test boxes and the VM host
        (newcastle.test-lab.xenproject.org).  This has two or three
        relevant physical ports, for the two or three relevant
        physical machines.  (In fact there are VLANs involved but this
        is not relevant.)

      * The software bridge in newcastle.  This has two relevant
        ports:
          - newcastle's physical interface
          - the vif serving the osstest VM

      * In a two-machine test, rather than a localhost test, the
        destination test box's dom0's software bridge, which parallels
        the source test box's.

   When the guest stops running on the source (with its vif torn
   down), and starts running on the destination:

     (a) The source test box software bridge should lose its MAC
        address table entry for the guest, because the corresponding
        port (the vif) is removed.  However I am not sure whether this
        actually happens immediately in Linux.

        It may be that instead the MAC address table entry for the
        guest remains present but points to the dead vif.  In this
        case incoming frames from the wire, the for the guest will be
        dropped.

     (b) The destination test box (if different) will come up without a
        MAC address entry for the guest.  If a frame for the guest's
        MAC address arrives at the physical interface, it will be
        forwarded to all of the other interfaces enslaved to the
        bridge: ie, to the dom0 (which will ignore it because it has
        the wrong destination MAC address) and to the newly-created
        guest.

     (c) In a two-host test, the physical switch connecting the two
        test boxes will retain the wrong learnt switch port.  It will
        forward frames for the guest (only) to the source test box,
        rather than the destination test box, where they will be
        discarded.

   It is (a) and (c) that the gratuitous ARP is supposed to fix.

   The guest is supposed to send, when its interface comes up after
   migration, a single broadcast gratuitous ARP response containing
   its own IPv4 and MAC addresses.

   The IPv4 address in this message is irrelevant.

   The purpose is to update the MAC address tables in all the switches
   in the network.  Each switch which receives the gratuitous ARP
   updates its MAC address table to map the guest's MAC address to the
   port on which the gratuitous ARP was recevied.

   If this happens, then frames from everywhere on the ethernet, to
   the guest, will be properly delivered.  If it doesn't then there
   may be lost packets and/or low-level timeouts of various kinds.

Ian.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [linux-4.1 test] 63030: regressions - FAIL
  2015-10-22 14:41                       ` Ian Jackson
@ 2015-10-22 14:56                         ` Ian Campbell
  2015-10-22 15:18                           ` Ian Jackson
  0 siblings, 1 reply; 22+ messages in thread
From: Ian Campbell @ 2015-10-22 14:56 UTC (permalink / raw)
  To: Ian Jackson; +Cc: xen-devel, Wei Liu, osstest service owner

On Thu, 2015-10-22 at 15:41 +0100, Ian Jackson wrote:
> Ian Campbell writes ("Re: [Xen-devel] [linux-4.1 test] 63030: regressions
> - FAIL"):
> > On Thu, 2015-10-22 at 12:03 +0100, Wei Liu wrote:
> > > No, vif-bridge script has two runes for off-lining a vif
> > >   brctl delif $bridge $vif
> > >   ifconfig $vif down
> > > 
> > > Neither of these causes cache entry to be flushed.
> > 
> > $vif disappearing when netback finally deletes the device will though.
> > Or
> > it should/used to.
> > 
> > Maybe this is happening after the new guest has started and confusing
> > things somewhere?
> 
> 
> There is confusion here.  Someone used the phrase `arp cache'.  But
> there are actually two relevant runtime of MAC addresses:
> 
>  * Each host has a neighbour database mapping IPv4 addresses to MAC
>    addresses.  This is used when trying to pass on an IPv4 datagram to
>    a host on the same ethernet (same broadcast domain).  This database
>    is normally referred to as an `ARP cache'.  Addresses are added to
>    the table by both ARP requests and responses, and also in many
>    implementations entries are refreshed by ordinary traffic.
> 
>    In the test colo, the osstest VM is on the same (bridged) ethernet
>    as the test box so, the relevant arp cache is the one in the
>    osstest controller's kernel: the osstest controller wants to send
>    an ssh SYN to the guest, and needs to construct an ethernet frame
>    with the guest's MAC address.  This is done using the osstest
>    controller's ARP cache entry.
> 
>    The osstest controller's ARP cache is unaffected by the migration.
>    ARP cache entries do time out but only after a number of minutes,
>    and the guest will have been spoken to recently by the controller.
>    I have no reason to think that lack of an entry for the guest's
>    IPv4 address in the osstest controller's ARP cache is relevant.

I was talking about this kind of ARP cache, but the one in the (single,
since it is localhost migrate) dom0. That's because I had misread Wei's
earlier script as sshing to the guest from dom0, not from his workstation
(the "controller" in his scenario).

Sorry for the confusion.

FWIW I believe the source dom0's ARP entry will be dropped when the VIF
device is destroyed.

>  * Each bridge has a forwarding database mapping MAC addresses to its
>    outbound links.  This is normally referred to as the bridge
>    (switch) `learning', and the table as the `MAC address table'.  MAC
>    addresses are learned when switch sees incoming frames.  When the
>    bridge receives a frame for a destination MAC address for which it
>    has no entry, it forwards the frame out of all its ports.  Special
>    considerations apply to broadcast and multicast MAC addresses.
>    None of this involves IPv4 or IPv6 addresses.
> 
>    In the test colo in the migration test case, there are up to four
>    relevant bridges:
> 
>       * The source test box's dom0's software bridge.
>         This has (logically speaking) three `ports':
>           - the test box's physical network interface
>           - the dom0 itself
>           - the vif corresponding to the outbound guest
>           - in a single-host test, the vif corresponding to
>             the inbound guest
> 
>       * The physical switch connecting the test boxes and the VM host
>         (newcastle.test-lab.xenproject.org).  This has two or three
>         relevant physical ports, for the two or three relevant
>         physical machines.  (In fact there are VLANs involved but this
>         is not relevant.)
> 
>       * The software bridge in newcastle.  This has two relevant
>         ports:
>           - newcastle's physical interface
>           - the vif serving the osstest VM
>         
>       * In a two-machine test, rather than a localhost test, the
>         destination test box's dom0's software bridge, which parallels
>         the source test box's.
> 
>    When the guest stops running on the source (with its vif torn
>    down), and starts running on the destination:
> 
>      (a) The source test box software bridge should lose its MAC
>         address table entry for the guest, because the corresponding
>         port (the vif) is removed.  However I am not sure whether this
>         actually happens immediately in Linux.

For Linux bridging I believe it happens at the latest when the vif device
is deleted, or possibly when it is removed from the bridge (i.e. earlier).

IOW I do not believe that Linux bridge remembers old ports.

openvswitch might, I don't recall, but I don't think that is in the picture
here.

>         It may be that instead the MAC address table entry for the
>         guest remains present but points to the dead vif.  In this
>         case incoming frames from the wire, the for the guest will be
>         dropped.
> 
>      (b) The destination test box (if different) will come up without a
>         MAC address entry for the guest.

Given the above I think even if it is the same as the source box, since it
will have been forgotten by the "source" box when the original VIF
disappeared.

>   If a frame for the guest's
>         MAC address arrives at the physical interface, it will be
>         forwarded to all of the other interfaces enslaved to the
>         bridge: ie, to the dom0 (which will ignore it because it has
>         the wrong destination MAC address) and to the newly-created
>         guest.
> 
>      (c) In a two-host test, the physical switch connecting the two
>         test boxes will retain the wrong learnt switch port.  It will
>         forward frames for the guest (only) to the source test box,
>         rather than the destination test box, where they will be
>         discarded.
> 
>    It is (a) and (c) that the gratuitous ARP is supposed to fix.
> 
>    The guest is supposed to send, when its interface comes up after
>    migration, a single broadcast gratuitous ARP response containing
>    its own IPv4 and MAC addresses.
> 
>    The IPv4 address in this message is irrelevant.
> 
>    The purpose is to update the MAC address tables in all the switches
>    in the network.  Each switch which receives the gratuitous ARP
>    updates its MAC address table to map the guest's MAC address to the
>    port on which the gratuitous ARP was recevied.
> 
>    If this happens, then frames from everywhere on the ethernet, to
>    the guest, will be properly delivered.  If it doesn't then there
>    may be lost packets and/or low-level timeouts of various kinds.
> 
> 
> Ian.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [linux-4.1 test] 63030: regressions - FAIL
  2015-10-22 14:56                         ` Ian Campbell
@ 2015-10-22 15:18                           ` Ian Jackson
  0 siblings, 0 replies; 22+ messages in thread
From: Ian Jackson @ 2015-10-22 15:18 UTC (permalink / raw)
  To: Ian Campbell; +Cc: xen-devel, Wei Liu, osstest service owner

Ian Campbell writes ("Re: [Xen-devel] [linux-4.1 test] 63030: regressions - FAIL"):
> FWIW I believe the source dom0's ARP entry will be dropped when the VIF
> device is destroyed.
...
> For Linux bridging I believe it happens at the latest when the vif device
> is deleted, or possibly when it is removed from the bridge (i.e. earlier).

In that case when there is only one physical host, the gratuitous ARP
should not matter.  Since if a switch sees a frame destined for a MAC
address that isn't in its forwarding table, it must forward it to
every port.

Ian.

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2015-10-22 15:18 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-10-18 17:52 [linux-4.1 test] 63030: regressions - FAIL osstest service owner
2015-10-19 13:51 ` Wei Liu
2015-10-20 14:39   ` Ian Jackson
2015-10-20 15:24     ` Wei Liu
2015-10-20 15:34       ` Ian Jackson
2015-10-21 16:47         ` Ian Campbell
2015-10-21 17:34           ` Wei Liu
2015-10-22  9:50             ` Ian Campbell
2015-10-22 10:28               ` Wei Liu
2015-10-22 10:39                 ` Ian Campbell
2015-10-22 11:03                   ` Wei Liu
2015-10-22 11:12                     ` Ian Campbell
2015-10-22 14:41                       ` Ian Jackson
2015-10-22 14:56                         ` Ian Campbell
2015-10-22 15:18                           ` Ian Jackson
2015-10-21  9:04       ` Ian Campbell
2015-10-21  9:24         ` Wei Liu
2015-10-21  9:44           ` Ian Campbell
2015-10-21 10:04             ` Ian Campbell
2015-10-21 10:35             ` Wei Liu
2015-10-21 10:48               ` Ian Campbell
2015-10-21 11:07                 ` Wei Liu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).