From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1753323AbbAEQtU (ORCPT <rfc822;w@1wt.eu>);
	Mon, 5 Jan 2015 11:49:20 -0500
Received: from 8bytes.org ([81.169.241.247]:60200 "EHLO theia.8bytes.org"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S1751684AbbAEQtT (ORCPT <rfc822;linux-kernel@vger.kernel.org>);
	Mon, 5 Jan 2015 11:49:19 -0500
Date: Mon, 5 Jan 2015 17:49:17 +0100
From: Joerg Roedel <joro@8bytes.org>
To: Raimonds Cicans <ray@apollo.lv>
Cc: linux-kernel@vger.kernel.org
Subject: Re: Question about: AMD-Vi: Event logged [IO_PAGE_FAULT ...
Message-ID: <20150105164916.GB13975@8bytes.org>
References: <54AAACE5.2020909@apollo.lv>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <54AAACE5.2020909@apollo.lv>
User-Agent: Mutt/1.5.21 (2010-09-15)
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

Hello Raimonds,

On Mon, Jan 05, 2015 at 05:25:25PM +0200, Raimonds Cicans wrote:
> After kernel upgrade (3.13 => 3.17) I started to receive following
> string in my logs:
> AMD-Vi: Event logged [IO_PAGE_FAULT device=08:00.0 domain=0x001c
> address=0x0000000001355000 flags=0x0000]
> 
> I would like to deeper understand this problem, so it
> would be nice if some body can fix my assumptions and
> answer my questions.
> 
> 
> Assumptions:
> 
> 1) This message is generated by AMD IOMMU subsystem
>      because PCIe device 08:00.0 tried to access memory
>      region which was not mapped to any real memory
>      (lspci show that this device is DVB-S2 receiver card
>       TBS 6981)
> 
> 2) Because flags are 0 and because in general receivers
>     write to memory not read from memory it is memory
>     write operation

Almost right, but flags are 0 for this fault which means it was a read
operation. The operation was to a page marked as non-present. This
caused the fault.

> 3) Possible causes:
>     a) memory region was never mapped
>     b) device accessed memory region before it was mapped
>     c) device accessed memory region after it was unmapped

I'd vote for option c) The address reported in the fault is a device
virtual address. The value looks like it was handed out from the
DMA-address allocator in the AMD IOMMU driver, which means the address
was once mapped for the device.

> 
> 3) Suspects:
>      a) kernel's DMA subsystem: very unlikely
>      b) kernel's IOMMU subsystem: very unlikely
>      c) AMD IOMMU driver: unlikely? - i had problems with AMD IOMMU
>          itself in kernels 3.14 - 3.17 (AMD-Vi: Completion-Wait loop
> timed out)
>          So maybe this problem not fully fixed?

IO_PAGE_FAULTs are almost always a bug in the device driver for the
peripheral (or a bug in the firmware, but that is unlikely here).

But the "Completion-Wait loop timed out" message is also worrying. It
usually indicates broken firmware or broken hardware.

>      d) Receiver's driver: likely

Yes, my guess is that the driver for the receiver device calls
dma_unmap_$foo on a memory region it still uses for DMA. But the call
lets the AMD IOMMU driver unmap the region and DMA fails with the
message you see.

> Questions:
> 1) What 'domain=0x001c' mean?

This is just an internal handle and means the domain-id. It is reported
in the fault structure by the hardware and indicates whether the device
has been attached to a DMA domain at all.

> 2) Where I can find definition of possible flags?

In the AMD IOMMU specification, look for the IO_PAGE_FAULT reporting
structure. The flags reported in the kernel message are bits 16-27 of
the second 32bit value.

> 3) What kind of address is written in message?
>      - physical?
>      - virtual?
>      - address from devices point of view?

It is a device virtual address, the address the device tried to access
but which was not mapped.


HTH,

	Joerg