From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753323AbbAEQtU (ORCPT ); Mon, 5 Jan 2015 11:49:20 -0500 Received: from 8bytes.org ([81.169.241.247]:60200 "EHLO theia.8bytes.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751684AbbAEQtT (ORCPT ); Mon, 5 Jan 2015 11:49:19 -0500 Date: Mon, 5 Jan 2015 17:49:17 +0100 From: Joerg Roedel To: Raimonds Cicans Cc: linux-kernel@vger.kernel.org Subject: Re: Question about: AMD-Vi: Event logged [IO_PAGE_FAULT ... Message-ID: <20150105164916.GB13975@8bytes.org> References: <54AAACE5.2020909@apollo.lv> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <54AAACE5.2020909@apollo.lv> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hello Raimonds, On Mon, Jan 05, 2015 at 05:25:25PM +0200, Raimonds Cicans wrote: > After kernel upgrade (3.13 => 3.17) I started to receive following > string in my logs: > AMD-Vi: Event logged [IO_PAGE_FAULT device=08:00.0 domain=0x001c > address=0x0000000001355000 flags=0x0000] > > I would like to deeper understand this problem, so it > would be nice if some body can fix my assumptions and > answer my questions. > > > Assumptions: > > 1) This message is generated by AMD IOMMU subsystem > because PCIe device 08:00.0 tried to access memory > region which was not mapped to any real memory > (lspci show that this device is DVB-S2 receiver card > TBS 6981) > > 2) Because flags are 0 and because in general receivers > write to memory not read from memory it is memory > write operation Almost right, but flags are 0 for this fault which means it was a read operation. The operation was to a page marked as non-present. This caused the fault. > 3) Possible causes: > a) memory region was never mapped > b) device accessed memory region before it was mapped > c) device accessed memory region after it was unmapped I'd vote for option c) The address reported in the fault is a device virtual address. The value looks like it was handed out from the DMA-address allocator in the AMD IOMMU driver, which means the address was once mapped for the device. > > 3) Suspects: > a) kernel's DMA subsystem: very unlikely > b) kernel's IOMMU subsystem: very unlikely > c) AMD IOMMU driver: unlikely? - i had problems with AMD IOMMU > itself in kernels 3.14 - 3.17 (AMD-Vi: Completion-Wait loop > timed out) > So maybe this problem not fully fixed? IO_PAGE_FAULTs are almost always a bug in the device driver for the peripheral (or a bug in the firmware, but that is unlikely here). But the "Completion-Wait loop timed out" message is also worrying. It usually indicates broken firmware or broken hardware. > d) Receiver's driver: likely Yes, my guess is that the driver for the receiver device calls dma_unmap_$foo on a memory region it still uses for DMA. But the call lets the AMD IOMMU driver unmap the region and DMA fails with the message you see. > Questions: > 1) What 'domain=0x001c' mean? This is just an internal handle and means the domain-id. It is reported in the fault structure by the hardware and indicates whether the device has been attached to a DMA domain at all. > 2) Where I can find definition of possible flags? In the AMD IOMMU specification, look for the IO_PAGE_FAULT reporting structure. The flags reported in the kernel message are bits 16-27 of the second 32bit value. > 3) What kind of address is written in message? > - physical? > - virtual? > - address from devices point of view? It is a device virtual address, the address the device tried to access but which was not mapped. HTH, Joerg