From mboxrd@z Thu Jan 1 00:00:00 1970 From: Roland Dreier Subject: Hang (due to HW?) in qi_submit_sync() Date: Mon, 5 Jan 2015 16:57:20 -0800 Message-ID: <1420505840-30096-1-git-send-email-roland@kernel.org> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Return-path: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: iommu-bounces-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org Errors-To: iommu-bounces-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org To: iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org Cc: Jiang Liu List-Id: iommu@lists.linux-foundation.org From: Roland Dreier Hi, we're running kernel 3.10.59 (pretty recent long-term kernel) on a 2-socket Xeon E5 v3 (Haswell) system. We're using vfio to access some PCI devices from userspace, and occasionally when we kill a process, we see the system hang in qi_submit_sync(). Based on a very old patch from Intel , we added code to the dmar driver: int qi_submit_sync(struct qi_desc *desc, struct intel_iommu *iommu ) { //... /* * update the HW tail register indicating the presence of * new descriptors. */ writel(qi->free_head << DMAR_IQ_SHIFT, iommu->reg + DMAR_IQT_REG); start_time = get_cycles(); while (qi->desc_status[wait_index] != QI_DONE) { /* * We will leave the interrupts disabled, to prevent interrupt * context to queue another cmd while a cmd is already submitted * and waiting for completion on this cpu. This is to avoid * a deadlock where the interrupt context can wait indefinitely * for free slots in the queue. */ rc = qi_check_fault(iommu, index); if (rc) break; raw_spin_unlock(&qi->q_lock); // We added this --> if (get_cycles() - start_time > DMAR_OPERATION_TIMEOUT) { printk(KERN_EMERG "desc_status[%d] = %d.\n", wait_index, qi->desc_status[wait_index]); /* line 888: */ BUG(); } // <-- to here cpu_relax(); raw_spin_lock(&qi->q_lock); } and indeed when the system hangs, we see for example desc_status[69] = 1. ------------[ cut here ]------------ kernel BUG at drivers/iommu/dmar.c:888! CPU: 8 PID: 12211 Comm: foed Tainted: P O 3.10.59+ #201412290537+4e4984e.platinum task: ffff88275ac643e0 ti: ffff8825d329a000 task.ti: ffff8825d329a000 RIP: 0010:[] [] qi_submit_sync+0x3f7/0x490 RSP: 0018:ffff8825d329ba10 EFLAGS: 00010092 RAX: 0000000000000014 RBX: 0000000000000044 RCX: ffff881fffb0ec00 RDX: 0000000000000000 RSI: ffff881fffb0d048 RDI: 0000000000000046 RBP: ffff8825d329ba78 R08: ffffffffffffffff R09: 000000000001a4a1 R10: 0000000000000051 R11: 00000000000000e4 R12: 00007068faa64fc8 R13: ffff881fff40c780 R14: 0000000000000114 R15: ffff883ffec01a00 FS: 00007f3c86ffb700(0000) GS:ffff881fffb00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f996d3f1ba0 CR3: 00000026222f0000 CR4: 00000000001407e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Stack: ffff8825d329ba88 0000000000000450 0000000000000440 ffff881ff3215000 00000044d329bb18 0000000000000086 0000000000000044 ffff882500000045 ffff881ff12b1600 0000000000000000 0000000000000246 ffff881ff278e858 Call Trace: [] free_irte+0xc5/0x100 [] free_remapped_irq+0x44/0x60 [] destroy_irq+0x33/0xd0 [] native_teardown_msi_irq+0xe/0x10 [] default_teardown_msi_irqs+0x60/0x80 [] free_msi_irqs+0x99/0x150 [] pci_disable_msix+0x3d/0x60 [] vfio_msi_disable+0xc8/0xe0 [vfio_pci] [] vfio_pci_set_msi_trigger+0x2a6/0x2d0 [vfio_pci] [] vfio_pci_set_irqs_ioctl+0x8c/0xa0 [vfio_pci] [] vfio_pci_release+0x70/0x150 [vfio_pci] [] vfio_device_fops_release+0x1c/0x40 [vfio] [] __fput+0xdb/0x220 [] ____fput+0xe/0x10 [] task_work_run+0xbc/0xe0 [] do_exit+0x3ce/0xe50 [] do_group_exit+0x3f/0xa0 [] get_signal_to_deliver+0x1a9/0x5b0 [] do_signal+0x48/0x5e0 as far as I can understand the driver, this is a "shouldn't happen, your hardware is broken" occurrence. However I haven't been able to find any relevant looking sightings for our CPU. Does anyone from Intel (or elsewhere) have any suggestions on how to chase this further? Thanks! Roland