All of lore.kernel.org
 help / color / mirror / Atom feed
From: Don Dutile <ddutile-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
To: "Duran, Leo" <leo.duran-5C7GfCeVMHo@public.gmane.org>
Cc: "iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org"
	<iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org>,
	"linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org"
	<linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>
Subject: Re: RFC: IOMMU/AMD: Error Handling
Date: Mon, 29 Apr 2013 17:42:24 -0400	[thread overview]
Message-ID: <517EE940.8010005@redhat.com> (raw)
In-Reply-To: <BA42942F2D0DED45AFB0A6216D1E951D44CBE1F9-Vo+W8YXarrgxlywnonMhLEEOCMrvLtNR@public.gmane.org>

On 04/29/2013 04:34 PM, Duran, Leo wrote:
> I'm wondering if resetting the IOMMU at init-time (once) would clear any BIOS induced noise.
> Leo
>
Well, depends what you mean by 'reset'....
(a) setting it up for OS use is effectively a reset, but doesn't quiesce a device
      doing dma reads of a (bios-setup) queue.  then the noisy messages begin
(b) disable the iommu, and then the dma just occurs... and bad for writes, potentially.

Similar issue is being reported & worked for kdump, where device are still
doing DMA while the system is trying to 'reset' to the kexec'd kernel, and
take a crash dump.

Solution: stop devices from doing dma... but some you _want_ enabled throughout...
           like keyboard & mouse via usb controller, so you get to pick os from
           grub...  not so for kexec...

so, again, for isolation faults.... let the hw do its job -- isolate
and throttle/silence the fault messages on a per-device, time-duration heuristic
so the system can get through boot-up where enough OS is init'd (drivers started)
to stop the temporary noise.

>> -----Original Message-----
>> From: iommu-bounces-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org [mailto:iommu-
>> bounces-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org] On Behalf Of Don Dutile
>> Sent: Monday, April 29, 2013 3:10 PM
>> To: Suthikulpanit, Suravee
>> Cc: iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org; linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> Subject: Re: RFC: IOMMU/AMD: Error Handling
>>
>> On 04/29/2013 03:45 PM, Suravee Suthikulanit wrote:
>>> Joerg,
>>>
>>> We are in the process of implementing AMD IOMMU error handling, and I
>> would like some comments from you and the community.
>>>
>>> Currently, the AMD IOMMU driver only reports events from the event log
>> in the dmesg, and does not try to handle them in case of errors. AMD
>> IOMMU errors can be categorized as device-specific errors and IOMMU
>> errors.
>>>
>>> 1. For IOMMU errors such as:
>>> - DEV_TAB_HADWARE_ERROR
>>> - PAGE_TAB_ERROR
>>> - COMMAND_HARDWARE_ERROR
>>> If the error is detected during IOMMU initialization, we could disable
>> IOMMU and proceed. If the error occurs after IOMMU is initialized, we won't
>> be able to recover from this, and might need to result in panic.
>>>
>>> 2. For device-specific errors such as:
>>> - ILLEGAL_DEV_TABLE_ENTRY
>>> - IO_PAGE_FAULT
>>> - INVALDE_DEVICE_REQUEST
>>> We think the AMD IOMMU driver should try to isolate the device. This
>> involves blocking device transactions at IOMMU DTE and tries to disable the
>> device (e.g. calling the remove(struct pci_dev *pdev) interface generally
>> provides by device drivers). This could prevents the device from continuing
>> to fail and to risk of system instability.
>>>
>> disabling the device is not an option.
>> We've seen mis-configured ACPI tables generate storms of invalide dte
>> messages after iommu setup but before they are cleared up when the OS
>> driver is started&  resets the device. The original storm is from bios-use of
>> IOMMU with a device.
>> I'd recommend creating a filter that prevents further logging from a device
>> for 5 mins at a time if a storm of DTE-related errors are seen.
>> by definition, the DMA is blocked from corrupting/changing memory, so
>> isolation has been established; keeping the failure log from consuming the
>> system is the needed fix.
>>
>>> 3. In case of posted memory write transaction, device driver might not be
>> aware that the transaction has failed and blocked at IOMMU. If there is no
>> HW IOMMU, I believe this is handled by PCI error handling code. If the
>> IOMMU hardware reporth such case, could this potentially leverage the
>> Linux IOMMU fault handling interface, iommu_set_fault_handler() and
>> report_iommu_fault(), to communicate to device driver or PCI driver?
>>>
>> Wondering if you could use AER-like callback mechanism so a driver can be
>> invoked when IOMMU error occurs, so the device driver can quiesce or reset
>> the device if it deems it transient.
>>
>>
>>> Any feedback or comments are appreciated.
>>>
>>> Thank you,
>>> Suravee
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> iommu mailing list
>>> iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
>>> https://lists.linuxfoundation.org/mailman/listinfo/iommu
>>
>> _______________________________________________
>> iommu mailing list
>> iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
>> https://lists.linuxfoundation.org/mailman/listinfo/iommu
>
>

WARNING: multiple messages have this Message-ID (diff)
From: Don Dutile <ddutile@redhat.com>
To: "Duran, Leo" <leo.duran@amd.com>
Cc: "Suthikulpanit, Suravee" <Suravee.Suthikulpanit@amd.com>,
	"iommu@lists.linux-foundation.org"
	<iommu@lists.linux-foundation.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>
Subject: Re: RFC: IOMMU/AMD: Error Handling
Date: Mon, 29 Apr 2013 17:42:24 -0400	[thread overview]
Message-ID: <517EE940.8010005@redhat.com> (raw)
In-Reply-To: <BA42942F2D0DED45AFB0A6216D1E951D44CBE1F9@sausexdag01.amd.com>

On 04/29/2013 04:34 PM, Duran, Leo wrote:
> I'm wondering if resetting the IOMMU at init-time (once) would clear any BIOS induced noise.
> Leo
>
Well, depends what you mean by 'reset'....
(a) setting it up for OS use is effectively a reset, but doesn't quiesce a device
      doing dma reads of a (bios-setup) queue.  then the noisy messages begin
(b) disable the iommu, and then the dma just occurs... and bad for writes, potentially.

Similar issue is being reported & worked for kdump, where device are still
doing DMA while the system is trying to 'reset' to the kexec'd kernel, and
take a crash dump.

Solution: stop devices from doing dma... but some you _want_ enabled throughout...
           like keyboard & mouse via usb controller, so you get to pick os from
           grub...  not so for kexec...

so, again, for isolation faults.... let the hw do its job -- isolate
and throttle/silence the fault messages on a per-device, time-duration heuristic
so the system can get through boot-up where enough OS is init'd (drivers started)
to stop the temporary noise.

>> -----Original Message-----
>> From: iommu-bounces@lists.linux-foundation.org [mailto:iommu-
>> bounces@lists.linux-foundation.org] On Behalf Of Don Dutile
>> Sent: Monday, April 29, 2013 3:10 PM
>> To: Suthikulpanit, Suravee
>> Cc: iommu@lists.linux-foundation.org; linux-kernel@vger.kernel.org
>> Subject: Re: RFC: IOMMU/AMD: Error Handling
>>
>> On 04/29/2013 03:45 PM, Suravee Suthikulanit wrote:
>>> Joerg,
>>>
>>> We are in the process of implementing AMD IOMMU error handling, and I
>> would like some comments from you and the community.
>>>
>>> Currently, the AMD IOMMU driver only reports events from the event log
>> in the dmesg, and does not try to handle them in case of errors. AMD
>> IOMMU errors can be categorized as device-specific errors and IOMMU
>> errors.
>>>
>>> 1. For IOMMU errors such as:
>>> - DEV_TAB_HADWARE_ERROR
>>> - PAGE_TAB_ERROR
>>> - COMMAND_HARDWARE_ERROR
>>> If the error is detected during IOMMU initialization, we could disable
>> IOMMU and proceed. If the error occurs after IOMMU is initialized, we won't
>> be able to recover from this, and might need to result in panic.
>>>
>>> 2. For device-specific errors such as:
>>> - ILLEGAL_DEV_TABLE_ENTRY
>>> - IO_PAGE_FAULT
>>> - INVALDE_DEVICE_REQUEST
>>> We think the AMD IOMMU driver should try to isolate the device. This
>> involves blocking device transactions at IOMMU DTE and tries to disable the
>> device (e.g. calling the remove(struct pci_dev *pdev) interface generally
>> provides by device drivers). This could prevents the device from continuing
>> to fail and to risk of system instability.
>>>
>> disabling the device is not an option.
>> We've seen mis-configured ACPI tables generate storms of invalide dte
>> messages after iommu setup but before they are cleared up when the OS
>> driver is started&  resets the device. The original storm is from bios-use of
>> IOMMU with a device.
>> I'd recommend creating a filter that prevents further logging from a device
>> for 5 mins at a time if a storm of DTE-related errors are seen.
>> by definition, the DMA is blocked from corrupting/changing memory, so
>> isolation has been established; keeping the failure log from consuming the
>> system is the needed fix.
>>
>>> 3. In case of posted memory write transaction, device driver might not be
>> aware that the transaction has failed and blocked at IOMMU. If there is no
>> HW IOMMU, I believe this is handled by PCI error handling code. If the
>> IOMMU hardware reporth such case, could this potentially leverage the
>> Linux IOMMU fault handling interface, iommu_set_fault_handler() and
>> report_iommu_fault(), to communicate to device driver or PCI driver?
>>>
>> Wondering if you could use AER-like callback mechanism so a driver can be
>> invoked when IOMMU error occurs, so the device driver can quiesce or reset
>> the device if it deems it transient.
>>
>>
>>> Any feedback or comments are appreciated.
>>>
>>> Thank you,
>>> Suravee
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> iommu mailing list
>>> iommu@lists.linux-foundation.org
>>> https://lists.linuxfoundation.org/mailman/listinfo/iommu
>>
>> _______________________________________________
>> iommu mailing list
>> iommu@lists.linux-foundation.org
>> https://lists.linuxfoundation.org/mailman/listinfo/iommu
>
>


  parent reply	other threads:[~2013-04-29 21:42 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-04-29 19:45 RFC: IOMMU/AMD: Error Handling Suravee Suthikulanit
2013-04-29 19:45 ` Suravee Suthikulanit
     [not found] ` <517ECDDA.3000606-5C7GfCeVMHo@public.gmane.org>
2013-04-29 20:10   ` Don Dutile
2013-04-29 20:10     ` Don Dutile
2013-04-29 20:34     ` Duran, Leo
     [not found]       ` <BA42942F2D0DED45AFB0A6216D1E951D44CBE1F9-Vo+W8YXarrgxlywnonMhLEEOCMrvLtNR@public.gmane.org>
2013-04-29 21:42         ` Don Dutile [this message]
2013-04-29 21:42           ` Don Dutile
2013-04-29 22:31           ` Duran, Leo
     [not found]           ` <517EE940.8010005-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2013-04-30 14:56             ` Suravee Suthikulanit
2013-04-30 14:56               ` Suravee Suthikulanit
     [not found]               ` <517FDB96.7060602-5C7GfCeVMHo@public.gmane.org>
2013-04-30 15:09                 ` Don Dutile
2013-04-30 15:09                   ` Don Dutile
2013-04-30 15:21                 ` Joerg Roedel
2013-04-30 15:21                   ` Joerg Roedel
     [not found]     ` <517ED3A9.2050508-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2013-04-30 14:49       ` Suravee Suthikulanit
2013-04-30 14:49         ` Suravee Suthikulanit
     [not found]         ` <517FD9E8.8070802-5C7GfCeVMHo@public.gmane.org>
2013-04-30 15:06           ` Don Dutile
2013-04-30 15:06             ` Don Dutile
     [not found]             ` <517FDDF6.8090707-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2013-04-30 16:02               ` Alex Williamson
2013-04-30 16:02                 ` Alex Williamson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=517EE940.8010005@redhat.com \
    --to=ddutile-h+wxahxf7alqt0dzr+alfa@public.gmane.org \
    --cc=iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org \
    --cc=leo.duran-5C7GfCeVMHo@public.gmane.org \
    --cc=linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.