All of lore.kernel.org
 help / color / mirror / Atom feed
From: Harald Braumann <harry@unheit.net>
To: qemu-devel@nongnu.org
Subject: [Qemu-devel] vfio-pci freezes host
Date: Sat, 9 Nov 2013 02:33:59 +0100	[thread overview]
Message-ID: <20131109013359.GA18124@nn.nn> (raw)

[-- Attachment #1: Type: text/plain, Size: 8360 bytes --]

(please CC as I'm not subscribed)

Hi,

I'm passing through a GPU using vfio-pci. This regularly completely
freezes the host. I'm hoping the attached files give some clue as to
what the problem might be.

Specs:
Chipset: AMD 990FX
Kernel: 3.12.0
QEMU: 
latest as of today (commit 964668b03d26f0b5baa5e5aff0c966f4fcb76e9e)
GPU:
06:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Juniper XT [Radeon HD 5770]
06:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Juniper HDMI Audio [Radeon HD 5700 Series]

QEMU command line:
/home/harry/dev/kvm-gpu-passthrough/qemu/x86_64-softmmu/qemu-system-x86_64 \
-runas spielzeug \
-monitor unix:monitor,server,nowait \
-L /home/harry/dev/kvm-gpu-passthrough/qemu/pc-bios \
-drive file=spielzeug_tmp.qcow2,if=virtio,cache=none,media=disk \
-boot order=c \
-smp 4 \
-cpu host  \
-m 4096M  \
-net nic,model=virtio,macaddr=52:54:00:12:34:57  \
-net tap,ifname=tap0,script=no,downscript=no \
-localtime  \
-enable-kvm  \
-M q35  \
-vga none \
-nographic \
-device ioh3420,bus=pcie.0,addr=1c.0,multifunction=on,port=1,chassis=1,id=root.1,romfile=radeon-hd-5770.rom  \
-device vfio-pci,host=0000:06:00.0,bus=root.1,addr=00.0,multifunction=on,x-vga=on \
-device vfio-pci,host=0000:06:00.1,bus=root.1,addr=00.1 \
-usbdevice tablet

QEMU starts up and after a view seconds the host completely
freezes. Sometimes I'm able to still get some dmesg output or a kernel
panic. In these cases it can be seen, that always some other PCI
device produces some error. 

Example:
[  179.998189] ------------[ cut here ]------------
[  179.998211] WARNING: CPU: 3 PID: 0 at net/sched/sch_generic.c:264 dev_watchdog+0xd9/0x13f()
[  179.998228] NETDEV WATCHDOG: eth0 (r8169): transmit queue 0 timed out
[  179.998229] Modules linked in: tun vfio_pci vfio_iommu_type1 vfio vboxpci(O) vboxnetadp(O) binfmt_misc vboxnetflt(O) vboxdrv(O) deflate ctr twofish_generic twofish_avx_x86_64 twofish_x86_64_3way twofish_x86_64 twofish_common camellia_generic camellia_aesni_avx_x86_64 camellia_x86_64 serpent_avx_x86_64 serpent_sse2_x86_64 xts serpent_generic blowfish_generic blowfish_x86_64 blowfish_common cast5_avx_x86_64 cast5_generic cast_common des_generic cbc cmac xcbc rmd160 sha512_ssse3 sha512_generic sha256_ssse3 sha256_generic crypto_null af_key xfrm_algo bridge stp llc iptable_mangle iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_filter ip_tables x_tables ext2 it87 hwmon_vid fuse joydev hid_generic radeon snd_hda_codec_hdmi usbhid snd_hda_codec_realtek hid snd_hda_intel snd_hda_codec snd_hwdep snd_pcm_oss ttm snd_mixer_oss drm_kms_helper kvm_amd kvm snd_pcm drm snd_page_alloc snd_seq_dummy snd_seq_midi snd_seq_oss snd_seq_midi_event snd_rawmidi sp5100_tco mxm_wmi agpgart snd_seq i2c_piix4 i2c_algo_bit i2c_core fam15h_power microcode pcspkr evdev snd_seq_device wmi k10temp snd_timer button snd processor soundcore edac_core ohci_pci thermal_sys ohci_hcd ext4 crc16 jbd2 mbcache dm_crypt dm_mod md_mod pci_stub sg sr_mod cdrom sd_mod crc_t10dif crct10dif_pclmul crct10dif_common crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd firewire_ohci firewire_core crc_itu_t r8169 mii ehci_pci ehci_hcd xhci_hcd usbcore usb_common ahci libahci libata scsi_mod
[  179.998304] CPU: 3 PID: 0 Comm: swapper/3 Tainted: G           O 3.12.0-hb #1
[  179.998306] Hardware name: To be filled by O.E.M. To be filled by O.E.M./SABERTOOTH 990FX, BIOS 1208 04/18/2012
[  179.998313]  0000000000000000 ffffffff81390b45 ffff88024ecc3e30 ffffffff81036e55
[  179.998316]  ffffffff812efdbe ffff880241240000 ffff88024ecc3e80 ffffffff812efce5
[  179.998318]  ffff880241240348 ffffffff81036eb1 ffffffff81526cee 0000000000000030
[  179.998324] Call Trace:
[  179.998326]  <IRQ>  [<ffffffff81390b45>] ? dump_stack+0x41/0x51
[  179.998333]  [<ffffffff81036e55>] ? warn_slowpath_common+0x74/0x89
[  179.998336]  [<ffffffff812efdbe>] ? dev_watchdog+0xd9/0x13f
[  179.998338]  [<ffffffff812efce5>] ? dev_deactivate_queue+0x54/0x54
[  179.998340]  [<ffffffff81036eb1>] ? warn_slowpath_fmt+0x47/0x49
[  179.998341]  [<ffffffff812ef9e8>] ? netif_tx_lock+0x47/0x72
[  179.998345]  [<ffffffff812efdbe>] ? dev_watchdog+0xd9/0x13f
[  179.998347]  [<ffffffff8103fd35>] ? call_timer_fn+0x2d/0xdc
[  179.998350]  [<ffffffff81040677>] ? run_timer_softirq+0x18c/0x1b0
[  179.998351]  [<ffffffff812efce5>] ? dev_deactivate_queue+0x54/0x54
[  179.998353]  [<ffffffff8103a68a>] ? __do_softirq+0xc3/0x1df
[  179.998355]  [<ffffffff81396cdc>] ? call_softirq+0x1c/0x30
[  179.998357]  [<ffffffff8100422a>] ? do_softirq+0x2a/0x64
[  179.998359]  [<ffffffff8103a866>] ? irq_exit+0x3a/0x7a
[  179.998361]  [<ffffffff81024111>] ? smp_apic_timer_interrupt+0x2c/0x37
[  179.998363]  [<ffffffff8139620a>] ? apic_timer_interrupt+0x6a/0x70
[  179.998365]  <EOI>  [<ffffffff81077257>] ? clockevents_program_event+0x98/0xb4
[  179.998368]  [<ffffffff812af2f7>] ? cpuidle_enter_state+0x4d/0x9e
[  179.998376]  [<ffffffff812af421>] ? cpuidle_idle_call+0xd9/0x12e
[  179.998379]  [<ffffffff81009e72>] ? arch_cpu_idle+0x5/0x14
[  179.998382]  [<ffffffff8106c212>] ? cpu_startup_entry+0x102/0x152
[  179.998385]  [<ffffffff81022d2d>] ? start_secondary+0x1d9/0x1dd
[  179.998387] ---[ end trace 206ceb71b6aa3a0a ]---
[  180.023699] r8169 0000:09:00.0 eth0: link up
[  230.425196] kvm: zapping shadow pages for mmio generation wraparound
[  240.028560] SysRq : Emergency Sync
[  240.632371] Emergency Sync complete
[  244.008964] br0: port 2(tap0) entered disabled state

Other example:

[  165.586276] usb 9-3: USB disconnect, device number 2
[  165.596353] r8169 0000:09:00.0 eth0: rtl_chipcmd_cond == 1 (loop: 100, delay: 100).
[  165.597471] r8169 0000:09:00.0 eth0: link up
[  165.622236] r8169 0000:09:00.0 eth0: rtl_chipcmd_cond == 1 (loop: 100, delay: 100).
[  165.627619] r8169 0000:09:00.0 eth0: link down
[  165.627765] br0: port 1(eth0) entered disabled state
[  165.712495] ohci-pci 0000:00:13.0: leak ed ffff880243a010a0 (#81) state 0 (has tds)
[  165.712498] ohci-pci 0000:00:13.0: leak ed ffff880243a01050 (#82) state 0 (has tds)
[  166.205984] irq 20: nobody cared (try booting with the "irqpoll" option)
[  166.205988] CPU: 4 PID: 0 Comm: swapper/4 Tainted: G           O 3.12.0-hb #1
[  166.205989] Hardware name: To be filled by O.E.M. To be filled by O.E.M./SABERTOOTH 990FX, BIOS 1208 04/18/2012
[  166.205991]  0000000000000000 ffffffff81390b45 ffff880244c90d00 ffffffff8106e18c
[  166.205993]  ffff880244c90d00 0000000000000000 ffff880244c90d00 ffffffff8106e4ed
[  166.205995]  0000000000000000 0000000000000014 ffff880244c90d00 0000000000000000
[  166.205997] Call Trace:
[  166.205998]  <IRQ>  [<ffffffff81390b45>] ? dump_stack+0x41/0x51
[  166.206005]  [<ffffffff8106e18c>] ? __report_bad_irq+0x2c/0xb4
[  166.206008]  [<ffffffff8106e4ed>] ? note_interrupt+0x136/0x1b3
[  166.206010]  [<ffffffff8106c9af>] ? handle_irq_event_percpu+0x105/0x16c
[  166.206012]  [<ffffffff8106ca41>] ? handle_irq_event+0x2b/0x46
[  166.206014]  [<ffffffff8106ece9>] ? handle_fasteoi_irq+0x71/0xa1
[  166.206016]  [<ffffffff810041f8>] ? handle_irq+0x15/0x1d
[  166.206018]  [<ffffffff81003e8e>] ? do_IRQ+0x40/0x95
[  166.206020]  [<ffffffff81394e2a>] ? common_interrupt+0x6a/0x6a
[  166.206021]  <EOI>  [<ffffffff812af2f7>] ? cpuidle_enter_state+0x4d/0x9e
[  166.206040]  [<ffffffff812af421>] ? cpuidle_idle_call+0xd9/0x12e
[  166.206042]  [<ffffffff81009e72>] ? arch_cpu_idle+0x5/0x14
[  166.206044]  [<ffffffff8106c212>] ? cpu_startup_entry+0x102/0x152
[  166.206047]  [<ffffffff81022d2d>] ? start_secondary+0x1d9/0x1dd
[  166.206048] handlers:
[  166.206059] [<ffffffffa0093fa6>] usb_hcd_irq [usbcore]
[  166.206060] Disabling IRQ #20
[  304.730601] br0: port 2(tap0) entered disabled state

Quite often the SATA controller has some error (see ahci-error.jpg)

Another symptom was spam of "[R600] flush TLB failed" in dmesg for
some time, then the host freezes.

Attached is a tgz with the following files:
- ahci-error.jpg
- dmesg
- interrupts: copy of /proc/interrupts
- pci-dump: produced with lspci -vvvxxx
- qemu-config.log: config.log from QEMU source
- vfio.log: output of QEMU with vfio debugging enabled

Cheers,
harry

[-- Attachment #2: vfio-freeze-dumps.tgz --]
[-- Type: application/x-gtar-compressed, Size: 279518 bytes --]

                 reply	other threads:[~2013-11-09  1:34 UTC|newest]

Thread overview: [no followups] expand[flat|nested]  mbox.gz  Atom feed

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20131109013359.GA18124@nn.nn \
    --to=harry@unheit.net \
    --cc=qemu-devel@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.