* [linux-4.1 test] 63030: regressions - FAIL
@ 2015-10-18 17:52 osstest service owner
  2015-10-19 13:51 ` Wei Liu
  0 siblings, 1 reply; 22+ messages in thread
From: osstest service owner @ 2015-10-18 17:52 UTC (permalink / raw)
  To: xen-devel, osstest-admin
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 19030 bytes --]
flight 63030 linux-4.1 real [real]
http://logs.test-lab.xenproject.org/osstest/logs/63030/
Regressions :-(
Tests which did not succeed and are blocking,
including tests which could not be run:
 test-amd64-i386-xl-qemut-stubdom-debianhvm-amd64-xsm 15 guest-localmigrate.2 fail REGR. vs. 62318
Tests which are failing intermittently (not blocking):
 test-amd64-i386-xl-qemut-stubdom-debianhvm-amd64-xsm 13 guest-localmigrate fail in 63013 pass in 63030
 test-armhf-armhf-xl-credit2   6 xen-boot                    fail pass in 63013
Regressions which are regarded as allowable (not blocking):
 test-armhf-armhf-xl-rtds     11 guest-start                  fail   like 62256
 test-amd64-amd64-libvirt-pair 21 guest-migrate/src_host/dst_host fail like 62256
 test-amd64-i386-libvirt-pair 21 guest-migrate/src_host/dst_host fail like 62256
 test-amd64-amd64-xl-qemut-stubdom-debianhvm-amd64-xsm 13 guest-localmigrate fail like 62318
 test-amd64-amd64-xl-qemut-win7-amd64 17 guest-stop             fail like 62318
 test-amd64-amd64-xl-qemuu-win7-amd64 17 guest-stop             fail like 62318
Tests which did not succeed, but are not blocking:
 test-armhf-armhf-xl-rtds 13 saverestore-support-check fail in 63013 never pass
 test-armhf-armhf-xl-rtds     12 migrate-support-check fail in 63013 never pass
 test-armhf-armhf-xl-rtds 16 guest-start/debian.repeat fail in 63013 never pass
 test-armhf-armhf-xl-credit2 13 saverestore-support-check fail in 63013 never pass
 test-armhf-armhf-xl-credit2  12 migrate-support-check fail in 63013 never pass
 test-armhf-armhf-xl-vhd       9 debian-di-install            fail   never pass
 test-armhf-armhf-libvirt-qcow2  9 debian-di-install            fail never pass
 test-amd64-amd64-xl-pvh-intel 14 guest-saverestore            fail  never pass
 test-amd64-amd64-xl-pvh-amd  11 guest-start                  fail   never pass
 test-armhf-armhf-libvirt-raw  9 debian-di-install            fail   never pass
 test-armhf-armhf-libvirt-xsm 12 migrate-support-check        fail   never pass
 test-armhf-armhf-libvirt-xsm 14 guest-saverestore            fail   never pass
 test-amd64-amd64-libvirt-xsm 12 migrate-support-check        fail   never pass
 test-armhf-armhf-libvirt     14 guest-saverestore            fail   never pass
 test-armhf-armhf-libvirt     12 migrate-support-check        fail   never pass
 test-armhf-armhf-xl-arndale  12 migrate-support-check        fail   never pass
 test-armhf-armhf-xl-arndale  13 saverestore-support-check    fail   never pass
 test-amd64-i386-libvirt-qemuu-debianhvm-amd64-xsm 10 migrate-support-check fail never pass
 test-amd64-i386-libvirt-xsm  12 migrate-support-check        fail   never pass
 test-amd64-amd64-libvirt     12 migrate-support-check        fail   never pass
 test-amd64-i386-libvirt      12 migrate-support-check        fail   never pass
 test-armhf-armhf-xl-cubietruck 12 migrate-support-check        fail never pass
 test-armhf-armhf-xl-cubietruck 13 saverestore-support-check    fail never pass
 test-armhf-armhf-xl-xsm      13 saverestore-support-check    fail   never pass
 test-armhf-armhf-xl-xsm      12 migrate-support-check        fail   never pass
 test-armhf-armhf-xl-multivcpu 13 saverestore-support-check    fail  never pass
 test-armhf-armhf-xl-multivcpu 12 migrate-support-check        fail  never pass
 test-amd64-amd64-libvirt-vhd 11 migrate-support-check        fail   never pass
 test-amd64-amd64-libvirt-qemuu-debianhvm-amd64-xsm 10 migrate-support-check fail never pass
 test-armhf-armhf-xl          12 migrate-support-check        fail   never pass
 test-armhf-armhf-xl          13 saverestore-support-check    fail   never pass
 test-amd64-i386-xl-qemuu-win7-amd64 17 guest-stop              fail never pass
 test-amd64-i386-xl-qemut-win7-amd64 17 guest-stop              fail never pass
version targeted for testing:
 linux                27f1b7fed9c305ef46f8708f1bdde9cdb5f166bd
baseline version:
 linux                36311a9ec4904c080bbdfcefc0f3d609ed508224
Last test of basis    62318  2015-09-24 00:30:22 Z   24 days
Failing since         62540  2015-09-29 17:44:52 Z   19 days   17 attempts
Testing same since    62659  2015-10-04 12:21:24 Z   14 days   15 attempts
------------------------------------------------------------
People who touched revisions under test:
  "Eric W. Biederman" <ebiederm@xmission.com>
  Aaron Brown <aaron.f.brown@intel.com>
  Adam Lee <adam.lee@canonical.com>
  Adrien Schildknecht <adrien+dev@schischi.me>
  Alex Deucher <alexander.deucher@amd.com>
  Alexander Drozdov <al.drozdov@gmail.com>
  Alexander Duyck <alexander.h.duyck@redhat.com>
  Alexandre Belloni <alexandre.belloni@free-electrons.com>
  Alexei Starovoitov <ast@plumgrid.com>
  Alexey Brodkin <abrodkin@synopsys.com>
  Alexey Brodkin <Alexey.Brodkin@synopsys.com>
  Andrew Morton <akpm@linux-foundation.org>
  Andrew W Elble <aweits@rit.edu>
  Andy Whitcroft <apw@canonical.com>
  Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
  Angga <Hermin.Anggawijaya@alliedtelesis.co.nz>
  Anna Schumaker <Anna.Schumaker@netapp.com>
  Ard Biesheuvel <ard.biesheuvel@linaro.org>
  Ariel Nahum <arieln@mellanox.com>
  Atsushi Nemoto <nemoto@toshiba-tops.co.jp>
  Bart Van Assche <bart.vanassche@sandisk.com>
  Benjamin Coddington <bcodding@redhat.com>
  Benjamin Herrenschmidt <benh@kernel.crashing.org>
  Benoit Parrot <bparrot@ti.com>
  Bob Copeland <me@bobcopeland.com>
  Bob Liu <bob.liu@oracle.com>
  Brenden Blanco <bblanco@plumgrid.com>
  Brian Starkey <brian.starkey@arm.com>
  Carol L Soto <clsoto@linux.vnet.ibm.com>
  Catalin Marinas <catalin.marinas@arm.com>
  Chris Mason <clm@fb.com>
  Christian Borntraeger <borntraeger@de.ibm.com>
  Christoph Hellwig <hch@lst.de>
  Christophe Ricard <christophe-h.ricard@st.com>
  Christophe Ricard <christophe.ricard@gmail.com>
  Cong Wang <cwang@twopensource.com>
  Cong Wang <xiyou.wangcong@gmail.com>
  Dan Carpenter <dan.carpenter@oracle.com>
  Daniel Axtens <dja@axtens.net>
  Daniel Borkmann <daniel@iogearbox.net>
  Darren Hart <dvhart@linux.intel.com>
  David Ahern <dsa@cumulusnetworks.com>
  David Dueck <davidcdueck@googlemail.com>
  David Härdeman <david@hardeman.nu>
  David Rientjes <rientjes@google.com>
  David S. Miller <davem@davemloft.net>
  Ding Tianhong <dingtianhong@huawei.com>
  dingtianhong <dingtianhong@huawei.com>
  Dmitry Torokhov <dmitry.torokhov@gmail.com>
  Doug Ledford <dledford@redhat.com>
  Edward Hyunkoo Jee <edjee@google.com>
  Emil Medve <Emilian.Medve@Freescale.com>
  Eric Dumazet <edumazet@google.com>
  Eric Sandeen <sandeen@redhat.com>
  Eric W. Biederman <ebiederm@xmission.com>
  Eryu Guan <guaneryu@gmail.com>
  Eugene Shatokhin <eugene.shatokhin@rosalab.ru>
  Filipe Manana <fdmanana@suse.com>
  Florian Fainelli <f.fainelli@gmail.com>
  Florian Westphal <fw@strlen.de>
  Fugang Duan <B38611@freescale.com>
  Gavin Shan <gwshan@linux.vnet.ibm.com>
  Greg Kroah-Hartman <gregkh@linuxfoundation.org>
  Gregory Hoggarth <Gregory.Hoggarth@alliedtelesis.co.nz>
  Haggai Eran <haggaie@mellanox.com>
  Hans de Goede <hdegoede@redhat.com>
  Hans Verkuil <hans.verkuil@cisco.com>
  Heiko Stuebner <heiko@sntech.de>
  Heiko Stübner <heiko@sntech.de>
  Helge Deller <deller@gmx.de>
  Herbert Xu <herbert@gondor.apana.org.au>
  Hermin Anggawijaya <hermin.anggawijaya@alliedtelesis.co.nz>
  Hin-Tak Leung <htl10@users.sourceforge.net>
  huaibin Wang <huaibin.wang@6wind.com>
  Ian Munsie <imunsie@au1.ibm.com>
  Ido Schimmel <idosch@mellanox.com>
  Ivan Vecera <ivecera@redhat.com>
  J. Bruce Fields <bfields@redhat.com>
  Jack Morgenstein <jackm@dev.mellanox.co.il>
  Jaewon Kim <jaewon31.kim@samsung.com>
  Jamal Hadi Salim <jhs@mojatatu.com>
  Jan Kara <jack@suse.com>
  Jann Horn <jann@thejh.net>
  Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
  Jean Delvare <jdelvare@suse.de>
  Jeff Kirsher <jeffrey.t.kirsher@intel.com>
  Jeff Layton <jeff.layton@primarydata.com>
  Jeff Layton <jlayton@poochiereds.net>
  Jeff Vander Stoep <jeffv@google.com>
  Jeffery Miller <jmiller@neverware.com>
  Jens Axboe <axboe@fb.com>
  Jesse Gross <jesse@nicira.com>
  Jesse Jones <jjones@cococorp.com>
  Jialing Fu <jlfu@marvell.com>
  Jiri Pirko <jiri@resnulli.us>
  Jisheng Zhang <jszhang@marvell.com>
  Joerg Roedel <jroedel@suse.de>
  Johannes Berg <johannes.berg@intel.com>
  John David Anglin <dave.anglin>
  John David Anglin <dave.anglin@bell.net>
  John Fastabend <john.r.fastabend@intel.com>
  Joonyoung Shim <jy0922.shim@samsung.com>
  Julian Anastasov <ja@ssi.bg>
  Kalle Valo <kvalo@codeaurora.org>
  Kees Cook <keescook@chromium.org>
  Ken-ichirou MATSUZAWA <chamaken@gmail.com>
  Kinglong Mee <kinglongmee@gmail.com>
  Krzysztof Kozlowski <k.kozlowski@samsung.com>
  Kyle Evans <kvans32@gmail.com>
  Lad, Prabhakar <prabhakar.csengg@gmail.com>
  Larry Finger <Larry.Finger@lwfinger.net>
  Lars Westerhoff <lars.westerhoff@newtec.eu>
  Laurent Pinchart <laurent.pinchart@ideasonboard.com>
  Leonidas Da Silva Barbosa <leosilva@linux.vnet.ibm.com>
  Leonidas S. Barbosa <leosilva@linux.vnet.ibm.com>
  Linus Lüssing <linus.luessing@c0d3.blue>
  Linus Torvalds <torvalds@linux-foundation.org>
  Linus Walleij <linus.walleij@linaro.org>
  Ludovic Desroches <ludovic.desroches@atmel.com>
  Luis Henriques <luis.henriques@canonical.com>
  Madalin Bucur <Madalin.Bucur@freescale.com>
  Marc Zyngier <marc.zyngier@arm.com>
  Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
  Markos Chandras <markos.chandras@imgtec.com>
  Matan Barak <matanb@mellanox.com>
  Matthew Rosato <mjrosato@linux.vnet.ibm.com>
  Mauro Carvalho Chehab <mchehab@osg.samsung.com>
  Mel Gorman <mgorman@suse.de>
  Michael Ellerman <mpe@ellerman.id.au>
  Michael S. Tsirkin <mst@redhat.com>
  Michal Hocko <mhocko@suse.com>
  Mike Marciniszyn <mike.marciniszyn@intel.com>
  Minchan Kim <minchan@kernel.org>
  Minfei Huang <mnfhuang@gmail.com>
  Ming Lei <ming.lei@canonical.com>
  Mitja Spes <mitja@lxnav.com>
  NeilBrown <neilb@suse.com>
  Nicolas Dichtel <nicolas.dichtel@6wind.com>
  Nicolas Ferre <nicolas.ferre@atmel.com>
  Nicolas Iooss <nicolas.iooss_linux@m4x.org>
  Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
  Nikolay Aleksandrov <razor@blackwall.org>
  Niranjan Sivakumar <ns253@cornell.edu>
  Noa Osherovich <noaos@mellanox.com>
  Oleg Nesterov <oleg@redhat.com>
  Oliver Hartkopp <socketcan@hartkopp.net>
  Oliver Neukum <oneukum@suse.com>
  Or Gerlitz <ogerlitz@mellanox.com>
  Paul Moore <paul@paul-moore.com>
  Pavel Fedin <p.fedin@samsung.com>
  Peng Tao <tao.peng@primarydata.com>
  Peter Guo <peter.guo@bayhubtech.com>
  Phil Sutter <phil@nwl.cc>
  Pratyush Anand <panand@redhat.com>
  Pravin B Shelar <pshelar@nicira.com>
  Ralf Baechle <ralf@linux-mips.org>
  Rasesh Mody <rasesh.mody@qlogic.com>
  Richard Laing <richard.laing@alliedtelesis.co.nz>
  Rob Herring <robh@kernel.org>
  Roopa Prabhu <roopa@cumulusnetworks.com>
  Russell King <rmk+kernel@arm.linux.org.uk>
  Sagi Grimberg <sagig@mellanox.com>
  Sakari Ailus <sakari.ailus@iki.fi>
  Samuel Ortiz <sameo@linux.intel.com>
  Scott Feldman <sfeldma@gmail.com>
  Sergei Antonov <saproj@gmail.com>
  Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
  Shachar Raindel <raindel@mellanox.com>
  Shani Michaeli <shanim@mellanox.com>
  Shawn Lin <shawn.lin@rock-chips.com>
  Shota Suzuki <suzuki_shota_t3@lab.ntt.co.jp>
  Stas Sergeev <stsp@list.ru>
  Stas Sergeev <stsp@users.sourceforge.net>
  Stephen Smalley <sds@tycho.nsa.gov>
  Steve French <smfrench@gmail.com>
  Steven Rostedt <rostedt@goodmis.org>
  Stuart Yoder <stuart.yoder@freescale.com>
  Sudip Mukherjee <sudip@vectorindia.org>
  Takashi Iwai <tiwai@suse.de>
  Theodore Ts'o <tytso@mit.edu>
  Thierry Reding <treding@nvidia.com>
  Thierry Strudel <tstrudel@google.com>
  Thomas Gleixner <tglx@linutronix.de>
  Thomas Graf <tgraf@suug.ch>
  Thomas Huth <thuth@redhat.com>
  Tilman Schmidt <tilman@imap.cc>
  Timo Teräs <timo.teras@iki.fi>
  Tobias Powalowski <tobias.powalowski@googlemail.com>
  Tony Luck <tony.luck@intel.com>
  Trond Myklebust <trond.myklebust@primarydata.com>
  Tyler Hicks <tyhicks@canonical.com>
  Ulf Hansson <ulf.hansson@linaro.org>
  Varun Sethi <Varun.Sethi@freescale.com>
  Vlad Yasevich <vyasevich@gmail.com>
  Vlad Zolotarov <vladz@cloudius-systems.com>
  Vlastimil Babka <vbabka@suse.cz>
  WANG Cong <xiyou.wangcong@gmail.com>
  Will Deacon <will.deacon@arm.com>
  Wilson Kok <wkok@cumulusnetworks.com>
  Woodrow Shen <woodrow.shen@canonical.com>
  Yao-Wen Mao <yaowen@google.com>
  Ying Xue <ying.xue@windriver.com>
  Yinghai Lu <yinghai@kernel.org>
  Yishai Hadas <yishaih@mellanox.com>
  Yuchung Cheng <ycheng@google.com>
jobs:
 build-amd64-xsm                                              pass
 build-armhf-xsm                                              pass
 build-i386-xsm                                               pass
 build-amd64                                                  pass
 build-armhf                                                  pass
 build-i386                                                   pass
 build-amd64-libvirt                                          pass
 build-armhf-libvirt                                          pass
 build-i386-libvirt                                           pass
 build-amd64-pvops                                            pass
 build-armhf-pvops                                            pass
 build-i386-pvops                                             pass
 build-amd64-rumpuserxen                                      pass
 build-i386-rumpuserxen                                       pass
 test-amd64-amd64-xl                                          pass
 test-armhf-armhf-xl                                          pass
 test-amd64-i386-xl                                           pass
 test-amd64-amd64-xl-qemut-debianhvm-amd64-xsm                pass
 test-amd64-i386-xl-qemut-debianhvm-amd64-xsm                 pass
 test-amd64-amd64-libvirt-qemuu-debianhvm-amd64-xsm           pass
 test-amd64-i386-libvirt-qemuu-debianhvm-amd64-xsm            pass
 test-amd64-amd64-xl-qemuu-debianhvm-amd64-xsm                pass
 test-amd64-i386-xl-qemuu-debianhvm-amd64-xsm                 pass
 test-amd64-amd64-xl-qemut-stubdom-debianhvm-amd64-xsm        fail
 test-amd64-i386-xl-qemut-stubdom-debianhvm-amd64-xsm         fail
 test-amd64-amd64-libvirt-xsm                                 pass
 test-armhf-armhf-libvirt-xsm                                 fail
 test-amd64-i386-libvirt-xsm                                  pass
 test-amd64-amd64-xl-xsm                                      pass
 test-armhf-armhf-xl-xsm                                      pass
 test-amd64-i386-xl-xsm                                       pass
 test-amd64-amd64-xl-pvh-amd                                  fail
 test-amd64-i386-qemut-rhel6hvm-amd                           pass
 test-amd64-i386-qemuu-rhel6hvm-amd                           pass
 test-amd64-amd64-xl-qemut-debianhvm-amd64                    pass
 test-amd64-i386-xl-qemut-debianhvm-amd64                     pass
 test-amd64-amd64-xl-qemuu-debianhvm-amd64                    pass
 test-amd64-i386-xl-qemuu-debianhvm-amd64                     pass
 test-amd64-i386-freebsd10-amd64                              pass
 test-amd64-amd64-xl-qemuu-ovmf-amd64                         pass
 test-amd64-i386-xl-qemuu-ovmf-amd64                          pass
 test-amd64-amd64-rumpuserxen-amd64                           pass
 test-amd64-amd64-xl-qemut-win7-amd64                         fail
 test-amd64-i386-xl-qemut-win7-amd64                          fail
 test-amd64-amd64-xl-qemuu-win7-amd64                         fail
 test-amd64-i386-xl-qemuu-win7-amd64                          fail
 test-armhf-armhf-xl-arndale                                  pass
 test-amd64-amd64-xl-credit2                                  pass
 test-armhf-armhf-xl-credit2                                  fail
 test-armhf-armhf-xl-cubietruck                               pass
 test-amd64-i386-freebsd10-i386                               pass
 test-amd64-i386-rumpuserxen-i386                             pass
 test-amd64-amd64-xl-pvh-intel                                fail
 test-amd64-i386-qemut-rhel6hvm-intel                         pass
 test-amd64-i386-qemuu-rhel6hvm-intel                         pass
 test-amd64-amd64-libvirt                                     pass
 test-armhf-armhf-libvirt                                     fail
 test-amd64-i386-libvirt                                      pass
 test-amd64-amd64-xl-multivcpu                                pass
 test-armhf-armhf-xl-multivcpu                                pass
 test-amd64-amd64-pair                                        pass
 test-amd64-i386-pair                                         pass
 test-amd64-amd64-libvirt-pair                                fail
 test-amd64-i386-libvirt-pair                                 fail
 test-amd64-amd64-amd64-pvgrub                                pass
 test-amd64-amd64-i386-pvgrub                                 pass
 test-amd64-amd64-pygrub                                      pass
 test-armhf-armhf-libvirt-qcow2                               fail
 test-amd64-amd64-xl-qcow2                                    pass
 test-armhf-armhf-libvirt-raw                                 fail
 test-amd64-i386-xl-raw                                       pass
 test-amd64-amd64-xl-rtds                                     pass
 test-armhf-armhf-xl-rtds                                     fail
 test-amd64-i386-xl-qemut-winxpsp3-vcpus1                     pass
 test-amd64-i386-xl-qemuu-winxpsp3-vcpus1                     pass
 test-amd64-amd64-libvirt-vhd                                 pass
 test-armhf-armhf-xl-vhd                                      fail
 test-amd64-amd64-xl-qemut-winxpsp3                           pass
 test-amd64-i386-xl-qemut-winxpsp3                            pass
 test-amd64-amd64-xl-qemuu-winxpsp3                           pass
 test-amd64-i386-xl-qemuu-winxpsp3                            pass
------------------------------------------------------------
sg-report-flight on osstest.test-lab.xenproject.org
logs: /home/logs/logs
images: /home/logs/images
Logs, config files, etc. are available at
    http://logs.test-lab.xenproject.org/osstest/logs
Explanation of these reports, and of osstest in general, is at
    http://xenbits.xen.org/gitweb/?p=osstest.git;a=blob;f=README.email;hb=master
    http://xenbits.xen.org/gitweb/?p=osstest.git;a=blob;f=README;hb=master
Test harness code can be found at
    http://xenbits.xen.org/gitweb?p=osstest.git;a=summary
Not pushing.
(No revision log; it would be 5606 lines long.)
[-- Attachment #2: Type: text/plain, Size: 126 bytes --]
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel
^ permalink raw reply	[flat|nested] 22+ messages in thread* Re: [linux-4.1 test] 63030: regressions - FAIL 2015-10-18 17:52 [linux-4.1 test] 63030: regressions - FAIL osstest service owner @ 2015-10-19 13:51 ` Wei Liu 2015-10-20 14:39 ` Ian Jackson 0 siblings, 1 reply; 22+ messages in thread From: Wei Liu @ 2015-10-19 13:51 UTC (permalink / raw) To: osstest service owner; +Cc: xen-devel, wei.liu2 On Sun, Oct 18, 2015 at 05:52:32PM +0000, osstest service owner wrote: > flight 63030 linux-4.1 real [real] > http://logs.test-lab.xenproject.org/osstest/logs/63030/ > > Regressions :-( > > Tests which did not succeed and are blocking, > including tests which could not be run: > test-amd64-i386-xl-qemut-stubdom-debianhvm-amd64-xsm 15 guest-localmigrate.2 fail REGR. vs. 62318 > Unfortunately there isn't much useful information in various log files. I think we need to wait for Ian's patch [0] to land in production in order to get more insight on what's going on. Wei. [0]: [PATCH OSSTEST] stubdom: Arrange for guest serial to go to a host logfile ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [linux-4.1 test] 63030: regressions - FAIL 2015-10-19 13:51 ` Wei Liu @ 2015-10-20 14:39 ` Ian Jackson 2015-10-20 15:24 ` Wei Liu 0 siblings, 1 reply; 22+ messages in thread From: Ian Jackson @ 2015-10-20 14:39 UTC (permalink / raw) To: Wei Liu; +Cc: xen-devel, osstest service owner Wei Liu writes ("Re: [Xen-devel] [linux-4.1 test] 63030: regressions - FAIL"): > On Sun, Oct 18, 2015 at 05:52:32PM +0000, osstest service owner wrote: ... > > Tests which did not succeed and are blocking, > > including tests which could not be run: > > test-amd64-i386-xl-qemut-stubdom-debianhvm-amd64-xsm 15 guest-localmigrate.2 fail REGR. vs. 62318 > > > > Unfortunately there isn't much useful information in various log files. > I think we need to wait for Ian's patch [0] to land in production in > order to get more insight on what's going on. ... > [0]: [PATCH OSSTEST] stubdom: Arrange for guest serial to go to a host logfile That osstest patch was in service in this flight. The guest kernel messages are in http://logs.test-lab.xenproject.org/osstest/logs/63030/test-amd64-i386-xl-qemut-stubdom-debianhvm-amd64-xsm/merlot0---var-log-xen-qemu-dm-debianhvm.guest.osstest.log.3.gz et al, mixed in with the minios and stub qemu output. I don't immediately see an explanation for the problem but (as we discovered with BSD) the success of this test depends on the gratuitous arp. Ian. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [linux-4.1 test] 63030: regressions - FAIL 2015-10-20 14:39 ` Ian Jackson @ 2015-10-20 15:24 ` Wei Liu 2015-10-20 15:34 ` Ian Jackson 2015-10-21 9:04 ` Ian Campbell 0 siblings, 2 replies; 22+ messages in thread From: Wei Liu @ 2015-10-20 15:24 UTC (permalink / raw) To: Ian Jackson; +Cc: xen-devel, Wei Liu, osstest service owner On Tue, Oct 20, 2015 at 03:39:26PM +0100, Ian Jackson wrote: > Wei Liu writes ("Re: [Xen-devel] [linux-4.1 test] 63030: regressions - FAIL"): > > On Sun, Oct 18, 2015 at 05:52:32PM +0000, osstest service owner wrote: > ... > > > Tests which did not succeed and are blocking, > > > including tests which could not be run: > > > test-amd64-i386-xl-qemut-stubdom-debianhvm-amd64-xsm 15 guest-localmigrate.2 fail REGR. vs. 62318 > > > > > > > Unfortunately there isn't much useful information in various log files. > > I think we need to wait for Ian's patch [0] to land in production in > > order to get more insight on what's going on. > ... > > [0]: [PATCH OSSTEST] stubdom: Arrange for guest serial to go to a host logfile > > That osstest patch was in service in this flight. The guest kernel > messages are in > > http://logs.test-lab.xenproject.org/osstest/logs/63030/test-amd64-i386-xl-qemut-stubdom-debianhvm-amd64-xsm/merlot0---var-log-xen-qemu-dm-debianhvm.guest.osstest.log.3.gz > > et al, mixed in with the minios and stub qemu output. > Oops. I didn't have the latest OSSTest tree. > I don't immediately see an explanation for the problem but (as we > discovered with BSD) the success of this test depends on the > gratuitous arp. > >From mere code inspection and document of lwip 1.3.0 I think mini-os does send gratuitous ARP. The call graph is like call_main start_netwokring init_netfront <- netfront changes to connected state netif_set_up <- sends gratuitous ARP [0] app_main... And according to FreeBSD changeset, the bug about gratuitous ARP was that the packet was sent before netfront was changed to connected state, so it doesn't look like mini-os has the same problem as FreeBSD did. But this is only code inspection, so I'm not very confident whether everything does what it says it does. Wei. [0] http://lwip.wikia.com/wiki/Writing_a_device_driver Gratuitous ARP A "gratuitous ARP" can be generated by a call etharp_query(our_netif, its_ip_addr, NULL) (see RFC 3220, Section 4.6). Starting in version 1.3.0, the gratuitous ARP is generated by netif_set_up() and should not be done in the driver or application code. > Ian. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [linux-4.1 test] 63030: regressions - FAIL 2015-10-20 15:24 ` Wei Liu @ 2015-10-20 15:34 ` Ian Jackson 2015-10-21 16:47 ` Ian Campbell 2015-10-21 9:04 ` Ian Campbell 1 sibling, 1 reply; 22+ messages in thread From: Ian Jackson @ 2015-10-20 15:34 UTC (permalink / raw) To: Wei Liu; +Cc: xen-devel, osstest service owner Wei Liu writes ("Re: [Xen-devel] [linux-4.1 test] 63030: regressions - FAIL"): > From mere code inspection and document of lwip 1.3.0 I think mini-os > does send gratuitous ARP. The guest is using the PVHVM drivers at this point, with the backend directly in dom0, so it is the guest's gratuitous arp which is needed, I think. Ian. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [linux-4.1 test] 63030: regressions - FAIL 2015-10-20 15:34 ` Ian Jackson @ 2015-10-21 16:47 ` Ian Campbell 2015-10-21 17:34 ` Wei Liu 0 siblings, 1 reply; 22+ messages in thread From: Ian Campbell @ 2015-10-21 16:47 UTC (permalink / raw) To: Ian Jackson, Wei Liu; +Cc: xen-devel, osstest service owner On Tue, 2015-10-20 at 16:34 +0100, Ian Jackson wrote: > Wei Liu writes ("Re: [Xen-devel] [linux-4.1 test] 63030: regressions > - FAIL"): > > From mere code inspection and document of lwip 1.3.0 I think mini > -os > > does send gratuitous ARP. > > The guest is using the PVHVM drivers at this point, with the backend > directly in dom0, so it is the guest's gratuitous arp which is needed, > I think. It would be worth investigating whether mini-os's gratuitous ARP might also be occurring and confusing things, e.g. by coming after and therefore taking precedence over the one coming from the guest. Ian. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [linux-4.1 test] 63030: regressions - FAIL 2015-10-21 16:47 ` Ian Campbell @ 2015-10-21 17:34 ` Wei Liu 2015-10-22 9:50 ` Ian Campbell 0 siblings, 1 reply; 22+ messages in thread From: Wei Liu @ 2015-10-21 17:34 UTC (permalink / raw) To: Ian Campbell; +Cc: Wei Liu, xen-devel, Ian Jackson, osstest service owner On Wed, Oct 21, 2015 at 05:47:06PM +0100, Ian Campbell wrote: > On Tue, 2015-10-20 at 16:34 +0100, Ian Jackson wrote: > > Wei Liu writes ("Re: [Xen-devel] [linux-4.1 test] 63030: regressions > > - FAIL"): > > > From mere code inspection and document of lwip 1.3.0 I think mini > > -os > > > does send gratuitous ARP. > > > > The guest is using the PVHVM drivers at this point, with the backend > > directly in dom0, so it is the guest's gratuitous arp which is needed, > > I think. > > It would be worth investigating whether mini-os's gratuitous ARP might > also be occurring and confusing things, e.g. by coming after and > therefore taking precedence over the one coming from the guest. > Several observations: 1. The guest doesn't always send gratuitous arp -- but this might not be the cause of this failure. Guest works fine when using qemu-trad only. 2. Guest only sends one gratuitous arp at most. 3. When using stubdom, guest is a lot less responsive. See two experiments and analysis below. I statically add arp entry for guest interface because arp entry some times gets deleted. Note that this is not covering up the root cause of failure because the arp entry is normally deleted after a few migration iterations. The failure on merlot* mostly fail on first iteration. And when arp entry is not available, the error for ssh should be "No route to host", not "timed out". Furthermore when the arp entry is not available, dom0 naturally sends an arp request to guest. When stubdom is not in use, guest responded instantly, when stubdom is in use, guest was a lot less responsive. I use a script to repeat migration and ssh. i=1 while true; do echo "#### iteration $i" ssh localhost xl migrate wheezy-hvm localhost if [ $? != 0 ]; then echo "migration failed $?"; exit 1; fi timeout 40 ssh -o BatchMode=yes -o ConnectTimeout=100 -o ServerAliveInterval=100 root@10.80.239.39 date st=$? if [ $st != 0 ]; then echo "failed $st"; exit 1; fi i=$((i+1)) done At the same time tcpdump -i xenbr0 arp and host $GUEST_IP When stubdom is present. Scenario 1: xl shows "Migration successful." ...30s... xenbr0 receives gratuitous arp ...1s... ssh date command comes back Scenario 2: xenbr0 receives gratuitous arp ...1s... xl shows "Migration successful." ssh date command comes back When stubdom was not present I never saw scenario 1. Note that my machine is relative old (>6 years). It would never pass the test in osstest because in osstest the timeout is 10s. The slowness in osstest seems to be host specific because all failures in guest migrate test failed on merlot*. It's not only linux-4.1 is failing, other branches fail the same test step on merlot*, too. Wei. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [linux-4.1 test] 63030: regressions - FAIL 2015-10-21 17:34 ` Wei Liu @ 2015-10-22 9:50 ` Ian Campbell 2015-10-22 10:28 ` Wei Liu 0 siblings, 1 reply; 22+ messages in thread From: Ian Campbell @ 2015-10-22 9:50 UTC (permalink / raw) To: Wei Liu; +Cc: xen-devel, Ian Jackson, osstest service owner On Wed, 2015-10-21 at 18:34 +0100, Wei Liu wrote: > On Wed, Oct 21, 2015 at 05:47:06PM +0100, Ian Campbell wrote: > > On Tue, 2015-10-20 at 16:34 +0100, Ian Jackson wrote: > > > Wei Liu writes ("Re: [Xen-devel] [linux-4.1 test] 63030: regressions > > > - FAIL"): > > > > From mere code inspection and document of lwip 1.3.0 I think mini > > > -os > > > > does send gratuitous ARP. > > > > > > The guest is using the PVHVM drivers at this point, with the backend > > > directly in dom0, so it is the guest's gratuitous arp which is > > > needed, > > > I think. > > > > It would be worth investigating whether mini-os's gratuitous ARP might > > also be occurring and confusing things, e.g. by coming after and > > therefore taking precedence over the one coming from the guest. > > > > Several observations: > > 1. The guest doesn't always send gratuitous arp -- but this might not be > the cause of this failure. Guest works fine when using qemu-trad > only. As in it always sends the arp when using qemu-trad, or that it is fine irrespective of not always sending it? > 2. Guest only sends one gratuitous arp at most. This is as expected, but does the stubdom also send one? > 3. When using stubdom, guest is a lot less responsive. See two > experiments and analysis below. Less responsive in use or only while migrating, or to ssh after migration, or to something else? > Scenario 1: > xl shows "Migration successful." > ...30s... > xenbr0 receives gratuitous arp > ...1s... > ssh date command comes back > > Scenario 2: > xenbr0 receives gratuitous arp > ...1s... > xl shows "Migration successful." > ssh date command comes back > > When stubdom was not present I never saw scenario 1. It would be worth looking at the possibility of a delay between "Migration successful" and the target domain actually running. A 30s delay between the guest restarting and it sending the ARP would be pretty strange IMHO > Note that my machine is relative old (>6 years). It would never pass > the test in osstest because in osstest the timeout is 10s. > > The slowness in osstest seems to be host specific because all failures > in guest migrate test failed on merlot*. It's not only linux-4.1 is > failing, other branches fail the same test step on merlot*, too. This could be a factor in common with the other qmu timeout on merlot which led to 9acfbe14d726. It might be worth prodding AMD over that issue again. Ian. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [linux-4.1 test] 63030: regressions - FAIL 2015-10-22 9:50 ` Ian Campbell @ 2015-10-22 10:28 ` Wei Liu 2015-10-22 10:39 ` Ian Campbell 0 siblings, 1 reply; 22+ messages in thread From: Wei Liu @ 2015-10-22 10:28 UTC (permalink / raw) To: Ian Campbell; +Cc: Ian Jackson, xen-devel, Wei Liu, osstest service owner On Thu, Oct 22, 2015 at 10:50:54AM +0100, Ian Campbell wrote: > On Wed, 2015-10-21 at 18:34 +0100, Wei Liu wrote: > > On Wed, Oct 21, 2015 at 05:47:06PM +0100, Ian Campbell wrote: > > > On Tue, 2015-10-20 at 16:34 +0100, Ian Jackson wrote: > > > > Wei Liu writes ("Re: [Xen-devel] [linux-4.1 test] 63030: regressions > > > > - FAIL"): > > > > > From mere code inspection and document of lwip 1.3.0 I think mini > > > > -os > > > > > does send gratuitous ARP. > > > > > > > > The guest is using the PVHVM drivers at this point, with the backend > > > > directly in dom0, so it is the guest's gratuitous arp which is > > > > needed, > > > > I think. > > > > > > It would be worth investigating whether mini-os's gratuitous ARP might > > > also be occurring and confusing things, e.g. by coming after and > > > therefore taking precedence over the one coming from the guest. > > > > > > > Several observations: > > > > 1. The guest doesn't always send gratuitous arp -- but this might not be > > the cause of this failure. Guest works fine when using qemu-trad > > only. > > As in it always sends the arp when using qemu-trad, or that it is fine > irrespective of not always sending it? > Whether or not stubdom is in use, the guest behaves the same -- it doesn't always send gratuitous arp. When using qemu-trad alone, it's always fine when it doesn't send gratuitous arp because either there is cache in dom0 that already has guest mac address or the guest responses instantly to dom0 arp request. So it comes down to the responsiveness of guest is the key. > > 2. Guest only sends one gratuitous arp at most. > > This is as expected, but does the stubdom also send one? > There is at most one gratuitous arp request per migration, I think it's from guest, not stubdom. To identify the exact interface the arp packet comes from requires a bit of gymnastics with tcpdump that I haven't managed to do yesterday. > > 3. When using stubdom, guest is a lot less responsive. See two > > experiments and analysis below. > > Less responsive in use or only while migrating, or to ssh after migration, > or to something else? > For every activity after migration for a period of time, including both arp request / reply and ssh connection. > > Scenario 1: > > xl shows "Migration successful." > > ...30s... > > xenbr0 receives gratuitous arp > > ...1s... > > ssh date command comes back > > > > Scenario 2: > > xenbr0 receives gratuitous arp > > ...1s... > > xl shows "Migration successful." > > ssh date command comes back > > > > When stubdom was not present I never saw scenario 1. > > It would be worth looking at the possibility of a delay between "Migration > successful" and the target domain actually running. A 30s delay between the > guest restarting and it sending the ARP would be pretty strange IMHO > The guest is in a weird state. xl list shows the stubdom is in "b" state while guest has no state at all, heh. Wei. > > Note that my machine is relative old (>6 years). It would never pass > > the test in osstest because in osstest the timeout is 10s. > > > > The slowness in osstest seems to be host specific because all failures > > in guest migrate test failed on merlot*. It's not only linux-4.1 is > > failing, other branches fail the same test step on merlot*, too. > > This could be a factor in common with the other qmu timeout on merlot which > led to 9acfbe14d726. > > It might be worth prodding AMD over that issue again. > > Ian. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [linux-4.1 test] 63030: regressions - FAIL 2015-10-22 10:28 ` Wei Liu @ 2015-10-22 10:39 ` Ian Campbell 2015-10-22 11:03 ` Wei Liu 0 siblings, 1 reply; 22+ messages in thread From: Ian Campbell @ 2015-10-22 10:39 UTC (permalink / raw) To: Wei Liu; +Cc: xen-devel, Ian Jackson, osstest service owner On Thu, 2015-10-22 at 11:28 +0100, Wei Liu wrote: > On Thu, Oct 22, 2015 at 10:50:54AM +0100, Ian Campbell wrote: > > On Wed, 2015-10-21 at 18:34 +0100, Wei Liu wrote: > > > On Wed, Oct 21, 2015 at 05:47:06PM +0100, Ian Campbell wrote: > > > > On Tue, 2015-10-20 at 16:34 +0100, Ian Jackson wrote: > > > > > Wei Liu writes ("Re: [Xen-devel] [linux-4.1 test] 63030: > > > > > regressions > > > > > - FAIL"): > > > > > > From mere code inspection and document of lwip 1.3.0 I think > > > > > > mini > > > > > -os > > > > > > does send gratuitous ARP. > > > > > > > > > > The guest is using the PVHVM drivers at this point, with the > > > > > backend > > > > > directly in dom0, so it is the guest's gratuitous arp which is > > > > > needed, > > > > > I think. > > > > > > > > It would be worth investigating whether mini-os's gratuitous ARP > > > > might > > > > also be occurring and confusing things, e.g. by coming after and > > > > therefore taking precedence over the one coming from the guest. > > > > > > > > > > Several observations: > > > > > > 1. The guest doesn't always send gratuitous arp -- but this might not > > > be > > > the cause of this failure. Guest works fine when using qemu-trad > > > only. > > > > As in it always sends the arp when using qemu-trad, or that it is fine > > irrespective of not always sending it? > > > > Whether or not stubdom is in use, the guest behaves the same -- it > doesn't always send gratuitous arp. > > When using qemu-trad alone, it's always fine when it doesn't send > gratuitous arp because either there is cache in dom0 that already has > guest mac address or the guest responses instantly to dom0 arp request. Where has this cache entry come from? Any preexisting ARP cache would be associated with vifX.0 and would go away when that device was destroyed and replace with vif(X+1).0. Also this only work for localhost migration. If the domain actually moved to another host then the ARP is required in order for the physical switch to learn the new location. Thus it seems to me that not always sending the gratuitous ARP is the most important thing to get to the bottom of here. > So it comes down to the responsiveness of guest is the key. > [...] > > > 3. When using stubdom, guest is a lot less responsive. See two > > > experiments and analysis below. > > > > Less responsive in use or only while migrating, or to ssh after > > migration, > > or to something else? > > > > For every activity after migration for a period of time, including both > arp request / reply and ssh connection. > > > > Scenario 1: > > > xl shows "Migration successful." > > > ...30s... > > > xenbr0 receives gratuitous arp > > > ...1s... > > > ssh date command comes back > > > > > > Scenario 2: > > > xenbr0 receives gratuitous arp > > > ...1s... > > > xl shows "Migration successful." > > > ssh date command comes back > > > > > > When stubdom was not present I never saw scenario 1. So in that case you only saw Scenario 2 which includes a "receives gratuitous ARP". But above you state that even with non-stub case sometimes the grauitous ARP is not sent. Is this a 3rd case which isn't mentioned here? > > It would be worth looking at the possibility of a delay between > > "Migration > > successful" and the target domain actually running. A 30s delay between > > the > > guest restarting and it sending the ARP would be pretty strange IMHO > > > > The guest is in a weird state. > > xl list shows the stubdom is in "b" state while guest has no state at > all, heh. Has it actually been started/unpaused then? Ian. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [linux-4.1 test] 63030: regressions - FAIL 2015-10-22 10:39 ` Ian Campbell @ 2015-10-22 11:03 ` Wei Liu 2015-10-22 11:12 ` Ian Campbell 0 siblings, 1 reply; 22+ messages in thread From: Wei Liu @ 2015-10-22 11:03 UTC (permalink / raw) To: Ian Campbell; +Cc: Ian Jackson, xen-devel, Wei Liu, osstest service owner On Thu, Oct 22, 2015 at 11:39:39AM +0100, Ian Campbell wrote: > On Thu, 2015-10-22 at 11:28 +0100, Wei Liu wrote: > > On Thu, Oct 22, 2015 at 10:50:54AM +0100, Ian Campbell wrote: > > > On Wed, 2015-10-21 at 18:34 +0100, Wei Liu wrote: > > > > On Wed, Oct 21, 2015 at 05:47:06PM +0100, Ian Campbell wrote: > > > > > On Tue, 2015-10-20 at 16:34 +0100, Ian Jackson wrote: > > > > > > Wei Liu writes ("Re: [Xen-devel] [linux-4.1 test] 63030: > > > > > > regressions > > > > > > - FAIL"): > > > > > > > From mere code inspection and document of lwip 1.3.0 I think > > > > > > > mini > > > > > > -os > > > > > > > does send gratuitous ARP. > > > > > > > > > > > > The guest is using the PVHVM drivers at this point, with the > > > > > > backend > > > > > > directly in dom0, so it is the guest's gratuitous arp which is > > > > > > needed, > > > > > > I think. > > > > > > > > > > It would be worth investigating whether mini-os's gratuitous ARP > > > > > might > > > > > also be occurring and confusing things, e.g. by coming after and > > > > > therefore taking precedence over the one coming from the guest. > > > > > > > > > > > > > Several observations: > > > > > > > > 1. The guest doesn't always send gratuitous arp -- but this might not > > > > be > > > > the cause of this failure. Guest works fine when using qemu-trad > > > > only. > > > > > > As in it always sends the arp when using qemu-trad, or that it is fine > > > irrespective of not always sending it? > > > > > > > Whether or not stubdom is in use, the guest behaves the same -- it > > doesn't always send gratuitous arp. > > > > When using qemu-trad alone, it's always fine when it doesn't send > > gratuitous arp because either there is cache in dom0 that already has > > guest mac address or the guest responses instantly to dom0 arp request. > > Where has this cache entry come from? Any preexisting ARP cache would be > associated with vifX.0 and would go away when that device was destroyed and > replace with vif(X+1).0. > No, vif-bridge script has two runes for off-lining a vif brctl delif $bridge $vif ifconfig $vif down Neither of these causes cache entry to be flushed. > Also this only work for localhost migration. If the domain actually moved > to another host then the ARP is required in order for the physical switch > to learn the new location. > > Thus it seems to me that not always sending the gratuitous ARP is the most > important thing to get to the bottom of here. > That's another issue, but this would cause other error (no route to host) instead of timeout. The failure exhibits timeout error -- let's do one thing at a time. > > So it comes down to the responsiveness of guest is the key. > > > [...] > > > > 3. When using stubdom, guest is a lot less responsive. See two > > > > experiments and analysis below. > > > > > > Less responsive in use or only while migrating, or to ssh after > > > migration, > > > or to something else? > > > > > > > For every activity after migration for a period of time, including both > > arp request / reply and ssh connection. > > > > > > Scenario 1: > > > > xl shows "Migration successful." > > > > ...30s... > > > > xenbr0 receives gratuitous arp > > > > ...1s... > > > > ssh date command comes back > > > > > > > > Scenario 2: > > > > xenbr0 receives gratuitous arp > > > > ...1s... > > > > xl shows "Migration successful." > > > > ssh date command comes back > > > > > > > > When stubdom was not present I never saw scenario 1. > > So in that case you only saw Scenario 2 which includes a "receives > gratuitous ARP". But above you state that even with non-stub case sometimes > the grauitous ARP is not sent. Is this a 3rd case which isn't mentioned > here? > Scenario 3: xl shows "Migration successful." dom0 sends arp request because arp cache entry not available guest takes a long time to respond when using stubdom or responds instantly when not using stubdom Scenario 4: xl shows "Migration successful." (arp cache entry still available) guest takes a long time to respond to ssh when using stubdom or responds instantly when not using stubdom > > > It would be worth looking at the possibility of a delay between > > > "Migration > > > successful" and the target domain actually running. A 30s delay between > > > the > > > guest restarting and it sending the ARP would be pretty strange IMHO > > > > > > > The guest is in a weird state. > > > > xl list shows the stubdom is in "b" state while guest has no state at > > all, heh. > > Has it actually been started/unpaused then? > Yes, of course -- otherwise the state would have been "p". And I observed the transition from "p" to "weird state". Wei. > Ian. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [linux-4.1 test] 63030: regressions - FAIL 2015-10-22 11:03 ` Wei Liu @ 2015-10-22 11:12 ` Ian Campbell 2015-10-22 14:41 ` Ian Jackson 0 siblings, 1 reply; 22+ messages in thread From: Ian Campbell @ 2015-10-22 11:12 UTC (permalink / raw) To: Wei Liu; +Cc: xen-devel, Ian Jackson, osstest service owner On Thu, 2015-10-22 at 12:03 +0100, Wei Liu wrote: > On Thu, Oct 22, 2015 at 11:39:39AM +0100, Ian Campbell wrote: > > On Thu, 2015-10-22 at 11:28 +0100, Wei Liu wrote: > > > On Thu, Oct 22, 2015 at 10:50:54AM +0100, Ian Campbell wrote: > > > > On Wed, 2015-10-21 at 18:34 +0100, Wei Liu wrote: > > > > > On Wed, Oct 21, 2015 at 05:47:06PM +0100, Ian Campbell wrote: > > > > > > On Tue, 2015-10-20 at 16:34 +0100, Ian Jackson wrote: > > > > > > > Wei Liu writes ("Re: [Xen-devel] [linux-4.1 test] 63030: > > > > > > > regressions > > > > > > > - FAIL"): > > > > > > > > From mere code inspection and document of lwip 1.3.0 I > > > > > > > > think > > > > > > > > mini > > > > > > > -os > > > > > > > > does send gratuitous ARP. > > > > > > > > > > > > > > The guest is using the PVHVM drivers at this point, with the > > > > > > > backend > > > > > > > directly in dom0, so it is the guest's gratuitous arp which > > > > > > > is > > > > > > > needed, > > > > > > > I think. > > > > > > > > > > > > It would be worth investigating whether mini-os's gratuitous > > > > > > ARP > > > > > > might > > > > > > also be occurring and confusing things, e.g. by coming after > > > > > > and > > > > > > therefore taking precedence over the one coming from the guest. > > > > > > > > > > > > > > > > Several observations: > > > > > > > > > > 1. The guest doesn't always send gratuitous arp -- but this might > > > > > not > > > > > be > > > > > the cause of this failure. Guest works fine when using qemu > > > > > -trad > > > > > only. > > > > > > > > As in it always sends the arp when using qemu-trad, or that it is > > > > fine > > > > irrespective of not always sending it? > > > > > > > > > > Whether or not stubdom is in use, the guest behaves the same -- it > > > doesn't always send gratuitous arp. > > > > > > When using qemu-trad alone, it's always fine when it doesn't send > > > gratuitous arp because either there is cache in dom0 that already has > > > guest mac address or the guest responses instantly to dom0 arp > > > request. > > > > Where has this cache entry come from? Any preexisting ARP cache would > > be > > associated with vifX.0 and would go away when that device was destroyed > > and > > replace with vif(X+1).0. > > > > No, vif-bridge script has two runes for off-lining a vif > brctl delif $bridge $vif > ifconfig $vif down > > Neither of these causes cache entry to be flushed. $vif disappearing when netback finally deletes the device will though. Or it should/used to. Maybe this is happening after the new guest has started and confusing things somewhere? > > Also this only work for localhost migration. If the domain actually > > moved > > to another host then the ARP is required in order for the physical > > switch > > to learn the new location. > > > > Thus it seems to me that not always sending the gratuitous ARP is the > > most > > important thing to get to the bottom of here. > > > > That's another issue, but this would cause other error (no route to > host) instead of timeout. The failure exhibits timeout error -- let's do > one thing at a time. The presence of an ARP cache entry in dom0 pointing to the old VIF would also cause a timeout issue, I think, since the guest is no longer connected to that vif. This stale ARP cache entry should be the first thing to investigate, before either the lack of a grat ARP or the slowness of the guest, since its presence will confuse the results in both those other cases. > > > So it comes down to the responsiveness of guest is the key. > > > > > [...] > > > > > 3. When using stubdom, guest is a lot less responsive. See two > > > > > experiments and analysis below. > > > > > > > > Less responsive in use or only while migrating, or to ssh after > > > > migration, > > > > or to something else? > > > > > > > > > > For every activity after migration for a period of time, including > > > both > > > arp request / reply and ssh connection. > > > > > > > > Scenario 1: > > > > > xl shows "Migration successful." > > > > > ...30s... > > > > > xenbr0 receives gratuitous arp > > > > > ...1s... > > > > > ssh date command comes back > > > > > > > > > > Scenario 2: > > > > > xenbr0 receives gratuitous arp > > > > > ...1s... > > > > > xl shows "Migration successful." > > > > > ssh date command comes back > > > > > > > > > > When stubdom was not present I never saw scenario 1. > > > > So in that case you only saw Scenario 2 which includes a "receives > > gratuitous ARP". But above you state that even with non-stub case > > sometimes > > the grauitous ARP is not sent. Is this a 3rd case which isn't mentioned > > here? > > > > Scenario 3: > xl shows "Migration successful." > dom0 sends arp request because arp cache entry not available > guest takes a long time to respond when using stubdom or responds > instantly when not using stubdom > > Scenario 4: > xl shows "Migration successful." > (arp cache entry still available) > guest takes a long time to respond to ssh when using stubdom or > responds instantly when not using stubdom > > > > > It would be worth looking at the possibility of a delay between > > > > "Migration > > > > successful" and the target domain actually running. A 30s delay > > > > between > > > > the > > > > guest restarting and it sending the ARP would be pretty strange > > > > IMHO > > > > > > > > > > The guest is in a weird state. > > > > > > xl list shows the stubdom is in "b" state while guest has no state at > > > all, heh. > > > > Has it actually been started/unpaused then? > > > > Yes, of course -- otherwise the state would have been "p". And I > observed the transition from "p" to "weird state". If weird state is "-----" then I think that is normal, it is "runnable but not running" IIRC. Ian. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [linux-4.1 test] 63030: regressions - FAIL 2015-10-22 11:12 ` Ian Campbell @ 2015-10-22 14:41 ` Ian Jackson 2015-10-22 14:56 ` Ian Campbell 0 siblings, 1 reply; 22+ messages in thread From: Ian Jackson @ 2015-10-22 14:41 UTC (permalink / raw) To: Ian Campbell; +Cc: xen-devel, Wei Liu, osstest service owner Ian Campbell writes ("Re: [Xen-devel] [linux-4.1 test] 63030: regressions - FAIL"): > On Thu, 2015-10-22 at 12:03 +0100, Wei Liu wrote: > > No, vif-bridge script has two runes for off-lining a vif > > brctl delif $bridge $vif > > ifconfig $vif down > > > > Neither of these causes cache entry to be flushed. > > $vif disappearing when netback finally deletes the device will though. Or > it should/used to. > > Maybe this is happening after the new guest has started and confusing > things somewhere? There is confusion here. Someone used the phrase `arp cache'. But there are actually two relevant runtime of MAC addresses: * Each host has a neighbour database mapping IPv4 addresses to MAC addresses. This is used when trying to pass on an IPv4 datagram to a host on the same ethernet (same broadcast domain). This database is normally referred to as an `ARP cache'. Addresses are added to the table by both ARP requests and responses, and also in many implementations entries are refreshed by ordinary traffic. In the test colo, the osstest VM is on the same (bridged) ethernet as the test box so, the relevant arp cache is the one in the osstest controller's kernel: the osstest controller wants to send an ssh SYN to the guest, and needs to construct an ethernet frame with the guest's MAC address. This is done using the osstest controller's ARP cache entry. The osstest controller's ARP cache is unaffected by the migration. ARP cache entries do time out but only after a number of minutes, and the guest will have been spoken to recently by the controller. I have no reason to think that lack of an entry for the guest's IPv4 address in the osstest controller's ARP cache is relevant. * Each bridge has a forwarding database mapping MAC addresses to its outbound links. This is normally referred to as the bridge (switch) `learning', and the table as the `MAC address table'. MAC addresses are learned when switch sees incoming frames. When the bridge receives a frame for a destination MAC address for which it has no entry, it forwards the frame out of all its ports. Special considerations apply to broadcast and multicast MAC addresses. None of this involves IPv4 or IPv6 addresses. In the test colo in the migration test case, there are up to four relevant bridges: * The source test box's dom0's software bridge. This has (logically speaking) three `ports': - the test box's physical network interface - the dom0 itself - the vif corresponding to the outbound guest - in a single-host test, the vif corresponding to the inbound guest * The physical switch connecting the test boxes and the VM host (newcastle.test-lab.xenproject.org). This has two or three relevant physical ports, for the two or three relevant physical machines. (In fact there are VLANs involved but this is not relevant.) * The software bridge in newcastle. This has two relevant ports: - newcastle's physical interface - the vif serving the osstest VM * In a two-machine test, rather than a localhost test, the destination test box's dom0's software bridge, which parallels the source test box's. When the guest stops running on the source (with its vif torn down), and starts running on the destination: (a) The source test box software bridge should lose its MAC address table entry for the guest, because the corresponding port (the vif) is removed. However I am not sure whether this actually happens immediately in Linux. It may be that instead the MAC address table entry for the guest remains present but points to the dead vif. In this case incoming frames from the wire, the for the guest will be dropped. (b) The destination test box (if different) will come up without a MAC address entry for the guest. If a frame for the guest's MAC address arrives at the physical interface, it will be forwarded to all of the other interfaces enslaved to the bridge: ie, to the dom0 (which will ignore it because it has the wrong destination MAC address) and to the newly-created guest. (c) In a two-host test, the physical switch connecting the two test boxes will retain the wrong learnt switch port. It will forward frames for the guest (only) to the source test box, rather than the destination test box, where they will be discarded. It is (a) and (c) that the gratuitous ARP is supposed to fix. The guest is supposed to send, when its interface comes up after migration, a single broadcast gratuitous ARP response containing its own IPv4 and MAC addresses. The IPv4 address in this message is irrelevant. The purpose is to update the MAC address tables in all the switches in the network. Each switch which receives the gratuitous ARP updates its MAC address table to map the guest's MAC address to the port on which the gratuitous ARP was recevied. If this happens, then frames from everywhere on the ethernet, to the guest, will be properly delivered. If it doesn't then there may be lost packets and/or low-level timeouts of various kinds. Ian. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [linux-4.1 test] 63030: regressions - FAIL 2015-10-22 14:41 ` Ian Jackson @ 2015-10-22 14:56 ` Ian Campbell 2015-10-22 15:18 ` Ian Jackson 0 siblings, 1 reply; 22+ messages in thread From: Ian Campbell @ 2015-10-22 14:56 UTC (permalink / raw) To: Ian Jackson; +Cc: xen-devel, Wei Liu, osstest service owner On Thu, 2015-10-22 at 15:41 +0100, Ian Jackson wrote: > Ian Campbell writes ("Re: [Xen-devel] [linux-4.1 test] 63030: regressions > - FAIL"): > > On Thu, 2015-10-22 at 12:03 +0100, Wei Liu wrote: > > > No, vif-bridge script has two runes for off-lining a vif > > > brctl delif $bridge $vif > > > ifconfig $vif down > > > > > > Neither of these causes cache entry to be flushed. > > > > $vif disappearing when netback finally deletes the device will though. > > Or > > it should/used to. > > > > Maybe this is happening after the new guest has started and confusing > > things somewhere? > > > There is confusion here. Someone used the phrase `arp cache'. But > there are actually two relevant runtime of MAC addresses: > > * Each host has a neighbour database mapping IPv4 addresses to MAC > addresses. This is used when trying to pass on an IPv4 datagram to > a host on the same ethernet (same broadcast domain). This database > is normally referred to as an `ARP cache'. Addresses are added to > the table by both ARP requests and responses, and also in many > implementations entries are refreshed by ordinary traffic. > > In the test colo, the osstest VM is on the same (bridged) ethernet > as the test box so, the relevant arp cache is the one in the > osstest controller's kernel: the osstest controller wants to send > an ssh SYN to the guest, and needs to construct an ethernet frame > with the guest's MAC address. This is done using the osstest > controller's ARP cache entry. > > The osstest controller's ARP cache is unaffected by the migration. > ARP cache entries do time out but only after a number of minutes, > and the guest will have been spoken to recently by the controller. > I have no reason to think that lack of an entry for the guest's > IPv4 address in the osstest controller's ARP cache is relevant. I was talking about this kind of ARP cache, but the one in the (single, since it is localhost migrate) dom0. That's because I had misread Wei's earlier script as sshing to the guest from dom0, not from his workstation (the "controller" in his scenario). Sorry for the confusion. FWIW I believe the source dom0's ARP entry will be dropped when the VIF device is destroyed. > * Each bridge has a forwarding database mapping MAC addresses to its > outbound links. This is normally referred to as the bridge > (switch) `learning', and the table as the `MAC address table'. MAC > addresses are learned when switch sees incoming frames. When the > bridge receives a frame for a destination MAC address for which it > has no entry, it forwards the frame out of all its ports. Special > considerations apply to broadcast and multicast MAC addresses. > None of this involves IPv4 or IPv6 addresses. > > In the test colo in the migration test case, there are up to four > relevant bridges: > > * The source test box's dom0's software bridge. > This has (logically speaking) three `ports': > - the test box's physical network interface > - the dom0 itself > - the vif corresponding to the outbound guest > - in a single-host test, the vif corresponding to > the inbound guest > > * The physical switch connecting the test boxes and the VM host > (newcastle.test-lab.xenproject.org). This has two or three > relevant physical ports, for the two or three relevant > physical machines. (In fact there are VLANs involved but this > is not relevant.) > > * The software bridge in newcastle. This has two relevant > ports: > - newcastle's physical interface > - the vif serving the osstest VM > > * In a two-machine test, rather than a localhost test, the > destination test box's dom0's software bridge, which parallels > the source test box's. > > When the guest stops running on the source (with its vif torn > down), and starts running on the destination: > > (a) The source test box software bridge should lose its MAC > address table entry for the guest, because the corresponding > port (the vif) is removed. However I am not sure whether this > actually happens immediately in Linux. For Linux bridging I believe it happens at the latest when the vif device is deleted, or possibly when it is removed from the bridge (i.e. earlier). IOW I do not believe that Linux bridge remembers old ports. openvswitch might, I don't recall, but I don't think that is in the picture here. > It may be that instead the MAC address table entry for the > guest remains present but points to the dead vif. In this > case incoming frames from the wire, the for the guest will be > dropped. > > (b) The destination test box (if different) will come up without a > MAC address entry for the guest. Given the above I think even if it is the same as the source box, since it will have been forgotten by the "source" box when the original VIF disappeared. > If a frame for the guest's > MAC address arrives at the physical interface, it will be > forwarded to all of the other interfaces enslaved to the > bridge: ie, to the dom0 (which will ignore it because it has > the wrong destination MAC address) and to the newly-created > guest. > > (c) In a two-host test, the physical switch connecting the two > test boxes will retain the wrong learnt switch port. It will > forward frames for the guest (only) to the source test box, > rather than the destination test box, where they will be > discarded. > > It is (a) and (c) that the gratuitous ARP is supposed to fix. > > The guest is supposed to send, when its interface comes up after > migration, a single broadcast gratuitous ARP response containing > its own IPv4 and MAC addresses. > > The IPv4 address in this message is irrelevant. > > The purpose is to update the MAC address tables in all the switches > in the network. Each switch which receives the gratuitous ARP > updates its MAC address table to map the guest's MAC address to the > port on which the gratuitous ARP was recevied. > > If this happens, then frames from everywhere on the ethernet, to > the guest, will be properly delivered. If it doesn't then there > may be lost packets and/or low-level timeouts of various kinds. > > > Ian. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [linux-4.1 test] 63030: regressions - FAIL 2015-10-22 14:56 ` Ian Campbell @ 2015-10-22 15:18 ` Ian Jackson 0 siblings, 0 replies; 22+ messages in thread From: Ian Jackson @ 2015-10-22 15:18 UTC (permalink / raw) To: Ian Campbell; +Cc: xen-devel, Wei Liu, osstest service owner Ian Campbell writes ("Re: [Xen-devel] [linux-4.1 test] 63030: regressions - FAIL"): > FWIW I believe the source dom0's ARP entry will be dropped when the VIF > device is destroyed. ... > For Linux bridging I believe it happens at the latest when the vif device > is deleted, or possibly when it is removed from the bridge (i.e. earlier). In that case when there is only one physical host, the gratuitous ARP should not matter. Since if a switch sees a frame destined for a MAC address that isn't in its forwarding table, it must forward it to every port. Ian. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [linux-4.1 test] 63030: regressions - FAIL 2015-10-20 15:24 ` Wei Liu 2015-10-20 15:34 ` Ian Jackson @ 2015-10-21 9:04 ` Ian Campbell 2015-10-21 9:24 ` Wei Liu 1 sibling, 1 reply; 22+ messages in thread From: Ian Campbell @ 2015-10-21 9:04 UTC (permalink / raw) To: Wei Liu, Ian Jackson; +Cc: xen-devel, osstest service owner On Tue, 2015-10-20 at 16:24 +0100, Wei Liu wrote: > But this is only code inspection, so I'm not very confident whether > everything does what it says it does. Right,. I think this one probably needs someone to setup a system in a similar configuration and play with it. xen-netfront.c calls netdev_notify_peers (née netif_notify_peers). I seem to vaguely recall that setting some sysctl (arp_notify?) can be required to allow that to actually do anything but I think that has been fixed i.e. NETDEV_NOTIFY_PEERS is unconditional in inetdev_event() and only NETDEV_CHANGEADDR is gated. Ian. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [linux-4.1 test] 63030: regressions - FAIL 2015-10-21 9:04 ` Ian Campbell @ 2015-10-21 9:24 ` Wei Liu 2015-10-21 9:44 ` Ian Campbell 0 siblings, 1 reply; 22+ messages in thread From: Wei Liu @ 2015-10-21 9:24 UTC (permalink / raw) To: Ian Campbell; +Cc: Ian Jackson, xen-devel, Wei Liu, osstest service owner On Wed, Oct 21, 2015 at 10:04:14AM +0100, Ian Campbell wrote: > On Tue, 2015-10-20 at 16:24 +0100, Wei Liu wrote: > > But this is only code inspection, so I'm not very confident whether > > everything does what it says it does. > > Right,. I think this one probably needs someone to setup a system in a > similar configuration and play with it. > Is there an easy way to do that? Say, give me some runes so that I can lock a machine in Cambridge instance, run the failing test case. > xen-netfront.c calls netdev_notify_peers (née netif_notify_peers). I seem > to vaguely recall that setting some sysctl (arp_notify?) can be required to > allow that to actually do anything but I think that has been fixed i.e. > NETDEV_NOTIFY_PEERS is unconditional in inetdev_event() and only > NETDEV_CHANGEADDR is gated. > Found your patch posted in 2011. https://patchwork.ozlabs.org/patch/82813/ I think you're right and the said behaviour exists in Wheezy's 3.2 kernel. Wei. > Ian. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [linux-4.1 test] 63030: regressions - FAIL 2015-10-21 9:24 ` Wei Liu @ 2015-10-21 9:44 ` Ian Campbell 2015-10-21 10:04 ` Ian Campbell 2015-10-21 10:35 ` Wei Liu 0 siblings, 2 replies; 22+ messages in thread From: Ian Campbell @ 2015-10-21 9:44 UTC (permalink / raw) To: Wei Liu; +Cc: xen-devel, Ian Jackson, osstest service owner On Wed, 2015-10-21 at 10:24 +0100, Wei Liu wrote: > On Wed, Oct 21, 2015 at 10:04:14AM +0100, Ian Campbell wrote: > > On Tue, 2015-10-20 at 16:24 +0100, Wei Liu wrote: > > > But this is only code inspection, so I'm not very confident whether > > > everything does what it says it does. > > > > Right,. I think this one probably needs someone to setup a system in a > > similar configuration and play with it. > > > > Is there an easy way to do that? Say, give me some runes so that I can > lock a machine in Cambridge instance, run the failing test case. I could[0] but, why can't you just set things up on your existing test hosts, either using standalone mode or by just installing the guest by hand? That's what I would do (probably the latter) in the first instance. It's very likely IME that you are going to need to poke at this interactively while debugging and to run repeated migrations etc to trigger the issue. IMHO trying to use osstest for such manual debugging is just going to get in the way. > > xen-netfront.c calls netdev_notify_peers (née netif_notify_peers). I > > seem > > to vaguely recall that setting some sysctl (arp_notify?) can be > > required to > > allow that to actually do anything but I think that has been fixed i.e. > > NETDEV_NOTIFY_PEERS is unconditional in inetdev_event() and only > > NETDEV_CHANGEADDR is gated. > > > > Found your patch posted in 2011. > > https://patchwork.ozlabs.org/patch/82813/ > > I think you're right and the said behaviour exists in Wheezy's 3.2 > kernel. This ended up as d11327ad6695db8117c78d70611e71102ceec2ac and: $ git describe --contains d11327ad6695db8117c78d70611e71102ceec2ac v2.6.38-rc6~20^2~10 ...suggests this was in mainline long before 3.2. Ian. [0] ./mg-allocate -U <timespan> <machine-name> Where timespan is digits followed by d (for days), h (for hours) etc. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [linux-4.1 test] 63030: regressions - FAIL 2015-10-21 9:44 ` Ian Campbell @ 2015-10-21 10:04 ` Ian Campbell 2015-10-21 10:35 ` Wei Liu 1 sibling, 0 replies; 22+ messages in thread From: Ian Campbell @ 2015-10-21 10:04 UTC (permalink / raw) To: Wei Liu; +Cc: xen-devel, Ian Jackson, osstest service owner On Wed, 2015-10-21 at 10:44 +0100, Ian Campbell wrote: > > Found your patch posted in 2011. > > > > https://patchwork.ozlabs.org/patch/82813/ > > > > I think you're right and the said behaviour exists in Wheezy's 3.2 > > kernel. > > This ended up as d11327ad6695db8117c78d70611e71102ceec2ac and: > > $ git describe --contains d11327ad6695db8117c78d70611e71102ceec2ac > v2.6.38-rc6~20^2~10 > > ...suggests this was in mainline long before 3.2. And I have now confirmed that the kernel in the debian-7.2.0 iso we are using in testing today has this change in it. Ian. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [linux-4.1 test] 63030: regressions - FAIL 2015-10-21 9:44 ` Ian Campbell 2015-10-21 10:04 ` Ian Campbell @ 2015-10-21 10:35 ` Wei Liu 2015-10-21 10:48 ` Ian Campbell 1 sibling, 1 reply; 22+ messages in thread From: Wei Liu @ 2015-10-21 10:35 UTC (permalink / raw) To: Ian Campbell; +Cc: Ian Jackson, xen-devel, Wei Liu, osstest service owner On Wed, Oct 21, 2015 at 10:44:48AM +0100, Ian Campbell wrote: > On Wed, 2015-10-21 at 10:24 +0100, Wei Liu wrote: > > On Wed, Oct 21, 2015 at 10:04:14AM +0100, Ian Campbell wrote: > > > On Tue, 2015-10-20 at 16:24 +0100, Wei Liu wrote: > > > > But this is only code inspection, so I'm not very confident whether > > > > everything does what it says it does. > > > > > > Right,. I think this one probably needs someone to setup a system in a > > > similar configuration and play with it. > > > > > > > Is there an easy way to do that? Say, give me some runes so that I can > > lock a machine in Cambridge instance, run the failing test case. > > I could[0] but, why can't you just set things up on your existing test > hosts, either using standalone mode or by just installing the guest by > hand? > > That's what I would do (probably the latter) in the first instance. It's > very likely IME that you are going to need to poke at this interactively > while debugging and to run repeated migrations etc to trigger the issue. > IMHO trying to use osstest for such manual debugging is just going to get > in the way. > I could do all these manually, but not without paying much attention: allocating a new test box (all my test boxes are in use at the moment), run standalone mode, use standalone mode to install the test box, grab various tarballs from osstest website if I don't want to build them again, put them in suitable location and use standalone script to fiddle with standalone mode database, manually install a guest etc etc, let alone the bug we're hunting might not be reproducible on the new test box due to different hardware and external environment (as we've already witnessed in production osstest system), then I'm left in dilemma wondering whether I should repeat all these things (well, part of) again or just give up. This looks like a list of endless tedious tasks and it could go wrong many places in between. If I can get OSSTest to lock a box and run up to the point that it reproduces the issue that would be of great help. Furthermore, I can write down all the runes I use so that other people can do the same to reproduce bugs discovered in osstest. That would certainly help lower the barrier for people who want to help triaging bugs. Wei. > > > xen-netfront.c calls netdev_notify_peers (née netif_notify_peers). I > > > seem > > > to vaguely recall that setting some sysctl (arp_notify?) can be > > > required to > > > allow that to actually do anything but I think that has been fixed i.e. > > > NETDEV_NOTIFY_PEERS is unconditional in inetdev_event() and only > > > NETDEV_CHANGEADDR is gated. > > > > > > > Found your patch posted in 2011. > > > > https://patchwork.ozlabs.org/patch/82813/ > > > > I think you're right and the said behaviour exists in Wheezy's 3.2 > > kernel. > > This ended up as d11327ad6695db8117c78d70611e71102ceec2ac and: > > $ git describe --contains d11327ad6695db8117c78d70611e71102ceec2ac > v2.6.38-rc6~20^2~10 > > ...suggests this was in mainline long before 3.2. > > Ian. > > [0] ./mg-allocate -U <timespan> <machine-name> > Where timespan is digits followed by d (for days), h (for hours) etc. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [linux-4.1 test] 63030: regressions - FAIL 2015-10-21 10:35 ` Wei Liu @ 2015-10-21 10:48 ` Ian Campbell 2015-10-21 11:07 ` Wei Liu 0 siblings, 1 reply; 22+ messages in thread From: Ian Campbell @ 2015-10-21 10:48 UTC (permalink / raw) To: Wei Liu; +Cc: xen-devel, Ian Jackson, osstest service owner On Wed, 2015-10-21 at 11:35 +0100, Wei Liu wrote: > On Wed, Oct 21, 2015 at 10:44:48AM +0100, Ian Campbell wrote: > > On Wed, 2015-10-21 at 10:24 +0100, Wei Liu wrote: > > > On Wed, Oct 21, 2015 at 10:04:14AM +0100, Ian Campbell wrote: > > > > On Tue, 2015-10-20 at 16:24 +0100, Wei Liu wrote: > > > > > But this is only code inspection, so I'm not very confident whether > > > > > everything does what it says it does. > > > > > > > > Right,. I think this one probably needs someone to setup a system in a > > > > similar configuration and play with it. > > > > > > > > > > Is there an easy way to do that? Say, give me some runes so that I can > > > lock a machine in Cambridge instance, run the failing test case. > > > > I could[0] but, why can't you just set things up on your existing test > > hosts, either using standalone mode or by just installing the guest by > > hand? > > > > That's what I would do (probably the latter) in the first instance. It's > > very likely IME that you are going to need to poke at this interactively > > while debugging and to run repeated migrations etc to trigger the issue. > > IMHO trying to use osstest for such manual debugging is just going to get > > in the way. > > > > I could do all these manually, but not without paying much attention: > allocating a new test box (all my test boxes are in use at the moment), > run standalone mode, use standalone mode to install the test box, grab > various tarballs from osstest website if I don't want to build them > again, put them in suitable location and use standalone script to fiddle > with standalone mode database, manually install a guest etc etc, let > alone the bug we're hunting might not be reproducible on the new test > box due to different hardware and external environment (as we've already > witnessed in production osstest system), then I'm left in dilemma > wondering whether I should repeat all these things (well, part of) again > or just give up. > > This looks like a list of endless tedious tasks and it could go wrong > many places in between. If I can get OSSTest to lock a box and run up to > the point that it reproduces the issue that would be of great help. This seems to me to be making a mountain out of a mole hill, installing a Xen host should be bread and butter for most of us. However, since you insist, I recently added some explanation in README of how to make an adhoc job including cloning a previous flight and forcing it to run on a given machine (useful if you think it might be machine specific). There is no mechanical way to then lock a host on failure. What I usually do is run the mg-allocate run I mentioned in my previous mail after the test case has already started. Since mg-allocate has a higher priority than regular jobs, but with -U waits for the current job to finish, you are basically guaranteed that your mg-allocate will get the host next. > Furthermore, I can write down all the runes I use so that other people > can do the same to reproduce bugs discovered in osstest. That would > certainly help lower the barrier for people who want to help triaging > bugs. This sort of thing is of no help with triage. It might be useful for debugging and reproducing an issue, but triage does not involve doing such things, it is the step before. I'm being pedantic here because I don't think it is helpful to overstate what triage involves, since that will put people off doing useful triage activities. Ian. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [linux-4.1 test] 63030: regressions - FAIL 2015-10-21 10:48 ` Ian Campbell @ 2015-10-21 11:07 ` Wei Liu 0 siblings, 0 replies; 22+ messages in thread From: Wei Liu @ 2015-10-21 11:07 UTC (permalink / raw) To: Ian Campbell; +Cc: Ian Jackson, xen-devel, Wei Liu, osstest service owner On Wed, Oct 21, 2015 at 11:48:24AM +0100, Ian Campbell wrote: > On Wed, 2015-10-21 at 11:35 +0100, Wei Liu wrote: > > On Wed, Oct 21, 2015 at 10:44:48AM +0100, Ian Campbell wrote: > > > On Wed, 2015-10-21 at 10:24 +0100, Wei Liu wrote: > > > > On Wed, Oct 21, 2015 at 10:04:14AM +0100, Ian Campbell wrote: > > > > > On Tue, 2015-10-20 at 16:24 +0100, Wei Liu wrote: > > > > > > But this is only code inspection, so I'm not very confident whether > > > > > > everything does what it says it does. > > > > > > > > > > Right,. I think this one probably needs someone to setup a system in a > > > > > similar configuration and play with it. > > > > > > > > > > > > > Is there an easy way to do that? Say, give me some runes so that I can > > > > lock a machine in Cambridge instance, run the failing test case. > > > > > > I could[0] but, why can't you just set things up on your existing test > > > hosts, either using standalone mode or by just installing the guest by > > > hand? > > > > > > That's what I would do (probably the latter) in the first instance. It's > > > very likely IME that you are going to need to poke at this interactively > > > while debugging and to run repeated migrations etc to trigger the issue. > > > IMHO trying to use osstest for such manual debugging is just going to get > > > in the way. > > > > > > > I could do all these manually, but not without paying much attention: > > allocating a new test box (all my test boxes are in use at the moment), > > run standalone mode, use standalone mode to install the test box, grab > > various tarballs from osstest website if I don't want to build them > > again, put them in suitable location and use standalone script to fiddle > > with standalone mode database, manually install a guest etc etc, let > > alone the bug we're hunting might not be reproducible on the new test > > box due to different hardware and external environment (as we've already > > witnessed in production osstest system), then I'm left in dilemma > > wondering whether I should repeat all these things (well, part of) again > > or just give up. > > > > This looks like a list of endless tedious tasks and it could go wrong > > many places in between. If I can get OSSTest to lock a box and run up to > > the point that it reproduces the issue that would be of great help. > > This seems to me to be making a mountain out of a mole hill, installing a > Xen host should be bread and butter for most of us. > > However, since you insist, I recently added some explanation in README of > how to make an adhoc job including cloning a previous flight and forcing it > to run on a given machine (useful if you think it might be machine > specific). > > There is no mechanical way to then lock a host on failure. What I usually > do is run the mg-allocate run I mentioned in my previous mail after the > test case has already started. Since mg-allocate has a higher priority than > regular jobs, but with -U waits for the current job to finish, you are > basically guaranteed that your mg-allocate will get the host next. > Thanks. I will have a look at Osstest README to determine which way to proceed is better. > > Furthermore, I can write down all the runes I use so that other people > > can do the same to reproduce bugs discovered in osstest. That would > > certainly help lower the barrier for people who want to help triaging > > bugs. > > This sort of thing is of no help with triage. It might be useful for > debugging and reproducing an issue, but triage does not involve doing such > things, it is the step before. > > I'm being pedantic here because I don't think it is helpful to overstate > what triage involves, since that will put people off doing useful triage > activities. > Right, I actually meant bug fixing. Wei. > Ian. ^ permalink raw reply [flat|nested] 22+ messages in thread
end of thread, other threads:[~2015-10-22 15:18 UTC | newest] Thread overview: 22+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2015-10-18 17:52 [linux-4.1 test] 63030: regressions - FAIL osstest service owner 2015-10-19 13:51 ` Wei Liu 2015-10-20 14:39 ` Ian Jackson 2015-10-20 15:24 ` Wei Liu 2015-10-20 15:34 ` Ian Jackson 2015-10-21 16:47 ` Ian Campbell 2015-10-21 17:34 ` Wei Liu 2015-10-22 9:50 ` Ian Campbell 2015-10-22 10:28 ` Wei Liu 2015-10-22 10:39 ` Ian Campbell 2015-10-22 11:03 ` Wei Liu 2015-10-22 11:12 ` Ian Campbell 2015-10-22 14:41 ` Ian Jackson 2015-10-22 14:56 ` Ian Campbell 2015-10-22 15:18 ` Ian Jackson 2015-10-21 9:04 ` Ian Campbell 2015-10-21 9:24 ` Wei Liu 2015-10-21 9:44 ` Ian Campbell 2015-10-21 10:04 ` Ian Campbell 2015-10-21 10:35 ` Wei Liu 2015-10-21 10:48 ` Ian Campbell 2015-10-21 11:07 ` Wei Liu
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).