From mboxrd@z Thu Jan 1 00:00:00 1970 From: Atom2 Subject: Re: HVM domains crash after upgrade from XEN 4.5.1 to 4.5.2 Date: Mon, 16 Nov 2015 02:05:35 +0100 Message-ID: <56492BDF.5030208@web2web.at> References: <5643E68C.8090406@web2web.at> <564499B002000078000B43EE@prv-mh.provo.novell.com> <56448D9B.4090007@citrix.com> <5644A248.1060505@web2web.at> <5644C1CD.3020202@citrix.com> <56451A2B.9090706@web2web.at> <56459E5F02000078000B4944@prv-mh.provo.novell.com> <5645B6BC.6030603@citrix.com> <56467D44.5040205@web2web.at> <56479A6B.6080102@citrix.com> <5647CE57.50209@web2web.at> <5648E727.6080204@cardoe.com> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="------------080709050007020108060608" Return-path: In-Reply-To: <5648E727.6080204@cardoe.com> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: Doug Goldstein , xen-devel@lists.xen.org List-Id: xen-devel@lists.xenproject.org This is a multi-part message in MIME format. --------------080709050007020108060608 Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit Am 15.11.15 um 21:12 schrieb Doug Goldstein: > On 11/14/15 6:14 PM, Atom2 wrote: >> Am 14.11.15 um 21:32 schrieb Andrew Cooper: >>> On 14/11/2015 00:16, Atom2 wrote: >>>> Am 13.11.15 um 11:09 schrieb Andrew Cooper: >>>>> On 13/11/15 07:25, Jan Beulich wrote: >>>>>>>>> On 13.11.15 at 00:00, wrote: >>>>>>> Am 12.11.15 um 17:43 schrieb Andrew Cooper: >>>>>>>> On 12/11/15 14:29, Atom2 wrote: >>>>>>>>> Hi Andrew, >>>>>>>>> thanks for your reply. Answers are inline further down. >>>>>>>>> >>>>>>>>> Am 12.11.15 um 14:01 schrieb Andrew Cooper: >>>>>>>>>> On 12/11/15 12:52, Jan Beulich wrote: >>>>>>>>>>>>>> On 12.11.15 at 02:08, wrote: >>>>>>>>>>>> After the upgrade HVM domUs appear to no longer work - regardless >>>>>>>>>>>> of the >>>>>>>>>>>> dom0 kernel (tested with both 3.18.9 and 4.1.7 as the dom0 kernel); PV >>>>>>>>>>>> domUs, however, work just fine as before on both dom0 kernels. >>>>>>>>>>>> >>>>>>>>>>>> xl dmesg shows the following information after the first crashed HVM >>>>>>>>>>>> domU which is started as part of the machine booting up: >>>>>>>>>>>> [...] >>>>>>>>>>>> (XEN) Failed vm entry (exit reason 0x80000021) caused by invalid guest >>>>>>>>>>>> state (0). >>>>>>>>>>>> (XEN) ************* VMCS Area ************** >>>>>>>>>>>> (XEN) *** Guest State *** >>>>>>>>>>>> (XEN) CR0: actual=0x0000000000000039, shadow=0x0000000000000011, >>>>>>>>>>>> gh_mask=ffffffffffffffff >>>>>>>>>>>> (XEN) CR4: actual=0x0000000000002050, shadow=0x0000000000000000, >>>>>>>>>>>> gh_mask=ffffffffffffffff >>>>>>>>>>>> (XEN) CR3: actual=0x0000000000800000, target_count=0 >>>>>>>>>>>> (XEN) target0=0000000000000000, target1=0000000000000000 >>>>>>>>>>>> (XEN) target2=0000000000000000, target3=0000000000000000 >>>>>>>>>>>> (XEN) RSP = 0x0000000000006fdc (0x0000000000006fdc) RIP = >>>>>>>>>>>> 0x0000000100000000 (0x0000000100000000) >>>>>>>>>>> Other than RIP looking odd for a guest still in non-paged protected >>>>>>>>>>> mode I can't seem to spot anything wrong with guest state. >>>>>>>>>> odd? That will be the source of the failure. >>>>>>>>>> >>>>>>>>>> Out of long mode, the upper 32bit of %rip should all be zero, and it >>>>>>>>>> should not be possible to set any of them. >>>>>>>>>> >>>>>>>>>> I suspect that the guest has exited for emulation, and there has been a >>>>>>>>>> bad update to %rip. The alternative (which I hope is not the case) is >>>>>>>>>> that there is a hardware errata which allows the guest to accidentally >>>>>>>>>> get it self into this condition. >>>>>>>>>> >>>>>>>>>> Are you able to rerun with a debug build of the hypervisor? >> [big snip] >>>>>>>>> Now _without_ the debug USE flag, but with debug information in >>>>>>>>> the binary (I used splitdebug), all is back to where the problem >>>>>>>>> started off (i.e. the system boots without issues until such >>>>>>>>> time it starts a HVM domU which then crashes; PV domUs are >>>>>>>>> working). I have attached the latest "xl dmesg" output with the >>>>>>>>> timing information included. >>>>>>>>> >>>> I hope any of this makes sense to you. >>>> >>>> Again many thanks and best regards >>>> >>> Right - it would appear that the USE flag is definitely not what you >>> wanted, and causes bad compilation for Xen. The do_IRQ disassembly >>> you sent is a the result of disassembling a whole block of zeroes. >>> Sorry for leading you on a goose chase - the double faults will be the >>> product of bad compilation, rather than anything to do with your >>> specific problem. >> Hi Andrew, >> there's absolutely no need to appologize as it is me who asked for help >> and you who generously stepped in and provided it. I really do >> appreciate your help and it is for me, as the one seeking help, to >> provide all the information you deem necessary and you ask for. >>> However, the final log you sent (dmesg) is using a debug Xen, which is >>> what I was attempting to get you to do originally. >> Next time I know better how to arrive at a debug XEN. It's all about >> learning. >>> We still observe that the VM ends up in 32bit non-paged mode but with >>> an RIP with bit 32 set, which is an invalid state to be in. However, >>> there was nothing particularly interesting in the extra log information. >>> >>> Please can you rerun with "hvm_debug=0xc3f", which will cause far more >>> logging to occur to the console while the HVM guest is running. That >>> might show some hints. >> I haven't done that yet - but please see my next paragraph. If you are >> still interested in this, for whatever reason, I am clearly more than >> happy to rerun with your suggested option and provide that information >> as well. >>> Also, the fact that this occurs just after starting SeaBIOS is >>> interesting. As you have switched versions of Xen, you have also >>> switched hvmloader, which contains the SeaBIOS binary embedded in it. >>> Would you be able to compile both 4.5.1 and 4.5.2 and switch the >>> hvmloader binaries in use. It would be very interesting to see >>> whether the failure is caused by the hvmloader binary or the >>> hypervisor. (With `xl`, you can use >>> firmware_override="/full/path/to/firmware" to override the default >>> hvmloader). >> Your analysis was absolutely spot on. After re-thinking this for a >> moment, I thought going down that route first would make a lot of sense >> as PV guests still do work and one of the differences to HVM domUs is >> that the former do _not_ require SeaBIOS. Looking at my log files of >> installed packages confirmed an upgrade from SeaBIOS 1.7.5 to 1.8.2 in >> the relevant timeframe which obviously had not made it to the hvmloader >> of xen-4.5.1 as I did not re-compile xen after the upgrade of SeaBIOS. >> >> So I re-compiled xen-4.5.1 (obviously now using the installed SeaBIOS >> 1.8.2) and the same error as with xen-4.5.2 popped up - and that seemed >> to strongly indicate that there indeed might be an issue with SeaBIOS as >> this probably was the only variable that had changed from the original >> install of xen-4.5.1. >> >> My next step was to downgrade SeaBIOS to 1.7.5 and to re-compile >> xen-4.5.1. Voila, the system was again up and running. While still >> having SeaBIOS 1.7.5 installed, I also re-compiled xen-4.5.2 and ... you >> probably guessed it ... the problem was gone: The system boots up with >> no issues and everything is fine again. >> >> So in a nutshell: There seems to be a problem with SeaBIOS 1.8.2 >> preventing HVM doamins from successfully starting up. I don't know what >> this is triggered from, if this is specific to my hardware or whether >> something else in my environment is to blame. >> >> In any case, I am again more than happy to provide data / run a few >> tests should you wish to get to the grounds of this. >> >> I do owe you a beer (or any other drink) should you ever be at my >> location (i.e. Vienna, Austria). >> >> Many thanks again for your analysis and your first class support. Xen >> and their people absolutely rock! >> >> Atom2 > I'm a little late to the thread but can you send me (you can do it > off-list if you'd like) the USE flags you used for xen, xen-tools and > seabios? Also emerge --info. You can kill two birds with one stone by > using emerge --info xen. Hi Doug, here you go: USE flags: app-emulation/xen-4.5.2-r1::gentoo USE="-custom-cflags -debug -efi -flask -xsm" app-emulation/xen-tools-4.5.2::gentoo USE="hvm pam pygrub python qemu screen system-seabios -api -custom-cflags -debug -doc -flask (-ocaml) -ovmf -static-libs -system-qemu" PYTHON_TARGETS="python2_7" sys-firmware/seabios-1.7.5::gentoo USE="binary" emerge --info: Please see the attached file > I'm not too familiar with the xen ebuilds but I was pretty sure that > xen-tools is what builds hvmloader and it downloads a copy of SeaBIOS > and builds it so that it remains consistent. But obviously your > experience shows otherwise. You are right, it's xen-tools that builds hvmloader. If I remember correctly, the "system-seabios" USE flag (for xen-tools) specifies whether sys-firmware/seabios is used and the latter downloads SeaBIOS in it's binary form provided its "binary" USE flag is set. At least that's my understanding. > I'm looking at some ideas to improve SeaBIOS packaging on Gentoo and > your info would be helpful. Great. Whatever makes gentoo and xen stronger will be awesome. What immediately springs to mind is to create a separate hvmloader package and slot that (that's just an idea and probably not fully thought through, but ss far as I understood Andrew, it would then be possible to specify the specific firmware version [i.e. hvmloader] to use on xl's command line by using firmware_override="full/path/to/firmware"). I also found out that an upgrade to sys-firmware/seabios obviously does not trigger an automatic re-emerge of xen-tools and thus hvmloader. Shouldn't this also happen automatically as xen-tools depends on seabios? Thanks and best regards Atom2 P.S. If you prefer to take this off-list, just reply to my mail address. --------------080709050007020108060608 Content-Type: text/plain; charset=UTF-8; name="info" Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename="info" Portage 2.2.20.1 (python 2.7.10-final-0, hardened/linux/amd64, gcc-4.9.3, glibc-2.21-r1, 4.1.7-hardened-r1 x86_64) ================================================================= System uname: Linux-4.1.7-hardened-r1-x86_64-Intel-R-_Xeon-R-_CPU_E31260L_@_2.40GHz-with-gentoo-2.2 KiB Mem: 4032716 total, 3678784 free KiB Swap: 16777148 total, 16777148 free Timestamp of repository gentoo: Sun, 15 Nov 2015 00:45:01 +0000 sh bash 4.3_p39 ld GNU ld (Gentoo 2.25.1 p1.1) 2.25.1 app-shells/bash: 4.3_p39::gentoo dev-lang/perl: 5.20.2::gentoo dev-lang/python: 2.7.10::gentoo, 3.4.3::gentoo dev-util/cmake: 3.3.1-r1::gentoo dev-util/pkgconfig: 0.28-r2::gentoo sys-apps/baselayout: 2.2::gentoo sys-apps/openrc: 0.17::gentoo sys-apps/sandbox: 2.6-r1::gentoo sys-devel/autoconf: 2.69::gentoo sys-devel/automake: 1.13.4::gentoo, 1.14.1::gentoo, 1.15::gentoo sys-devel/binutils: 2.25.1-r1::gentoo sys-devel/gcc: 4.8.5::gentoo, 4.9.3::gentoo sys-devel/gcc-config: 1.7.3::gentoo sys-devel/libtool: 2.4.6::gentoo sys-devel/make: 4.1-r1::gentoo sys-kernel/linux-headers: 3.18::gentoo (virtual/os-headers) sys-libs/glibc: 2.21-r1::gentoo Repositories: gentoo location: /usr/portage sync-type: rsync sync-uri: rsync://rsync.europe.gentoo.org/gentoo-portage/ priority: -1000 x-portage location: /usr/local/portage masters: gentoo priority: 0 ACCEPT_KEYWORDS="amd64" ACCEPT_LICENSE="* -@EULA" CBUILD="x86_64-pc-linux-gnu" CFLAGS="-march=native -O2 -pipe -fomit-frame-pointer" CHOST="x86_64-pc-linux-gnu" CONFIG_PROTECT="/etc /usr/share/gnupg/qualified.txt" CONFIG_PROTECT_MASK="/etc/ca-certificates.conf /etc/env.d /etc/gconf /etc/gentoo-release /etc/revdep-rebuild /etc/sandbox.d /etc/terminfo" CXXFLAGS="-march=native -O2 -pipe -fomit-frame-pointer" DISTDIR="/usr/portage/distfiles" EMERGE_DEFAULT_OPTS="--quiet-build=y --buildpkg-exclude sys-kernel/hardened-sources" FCFLAGS="-O2 -pipe" FEATURES="assume-digests binpkg-logs buildpkg config-protect-if-modified distlocks ebuild-locks fixlafiles merge-sync news parallel-fetch preserve-libs protect-owned sandbox sfperms strict unknown-features-warn unmerge-logs unmerge-orphans userfetch userpriv usersandbox usersync xattr" FFLAGS="-O2 -pipe" GENTOO_MIRRORS="http://gd.tuwien.ac.at/opsys/linux/gentoo/ ftp://gd.tuwien.ac.at/opsys/linux/gentoo/" LANG="en_US.UTF-8" LDFLAGS="-Wl,-O1 -Wl,--as-needed" MAKEOPTS="-j9" PKGDIR="/usr/portage/packages" PORTAGE_COMPRESS="" PORTAGE_CONFIGROOT="/" PORTAGE_RSYNC_EXTRA_OPTS="--quiet --progress" PORTAGE_RSYNC_OPTS="--recursive --links --safe-links --perms --times --omit-dir-times --compress --force --whole-file --delete --stats --human-readable --timeout=180 --exclude=/distfiles --exclude=/local --exclude=/packages" PORTAGE_TMPDIR="/var/tmp" USE="acl aes amd64 avx bash-completion berkdb bzip2 cli cracklib crypt cxx gdbm hardened iconv justify lm_sensors mmx mmxext modules multilib ncurses nls nptl openmp pam pax_kernel pcre pie popcnt readline seccomp session sse sse2 sse3 sse4.1 sse4_1 ssl ssp ssse3 tcpd unicode urandom vim-syntax xattr xtpax zlib" ABI_X86="64" CPU_FLAGS_X86="aes avx mmx mmxext popcnt sse sse2 sse3 sse4_1 sse4.1 ssse3" ELIBC="glibc" KERNEL="linux" LINGUAS="en" PHP_TARGETS="php5-5" PYTHON_SINGLE_TARGET="python2_7" PYTHON_TARGETS="python2_7" RUBY_TARGETS="ruby20" USERLAND="GNU" VIDEO_CARDS="intel i965" XTABLES_ADDONS="quota2 psd pknock lscan length2 ipv4options ipset ipp2p iface geoip fuzzy condition tee tarpit sysrq steal rawnat logmark ipmark dhcpmac delude chaos account" USE_PYTHON="2.7" Unset: CC, CPPFLAGS, CTARGET, CXX, INSTALL_MASK, LC_ALL, PORTAGE_BUNZIP2_COMMAND, PORTAGE_COMPRESS_FLAGS --------------080709050007020108060608 Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel --------------080709050007020108060608--