From: Takao Indoh <indou.takao@jp.fujitsu.com>
To: khalid@gonehiking.org
Cc: martin.wilck@ts.fujitsu.com, linux-pci@vger.kernel.org,
x86@kernel.org, kexec@lists.infradead.org,
linux-kernel@vger.kernel.org, hbabu@us.ibm.com,
andi@firstfloor.org, ddutile@redhat.com,
ishii.hironobu@jp.fujitsu.com, hpa@zytor.com,
bhelgaas@google.com, tglx@linutronix.de, mingo@redhat.com,
vgoyal@redhat.com
Subject: Re: [PATCH v3 1/2] x86, pci: Reset PCIe devices at boot time
Date: Fri, 12 Oct 2012 20:28:38 +0900 [thread overview]
Message-ID: <5077FEE6.7060000@jp.fujitsu.com> (raw)
In-Reply-To: <1349976523.7065.22.camel@rhapsody>
(2012/10/12 2:28), Khalid Aziz wrote:
> On Thu, 2012-10-11 at 15:16 +0900, Takao Indoh wrote:
>> (2012/10/11 5:08), Khalid Aziz wrote:
> .....
>>>> +static void __init do_reset(u8 bus, u8 slot, u8 func)
>>>> +{
>>>> + u16 ctrl;
>>>> +
>>>> + printk(KERN_INFO "pci 0000:%02x:%02x.%d reset\n", bus, slot, func);
>>>> +
>>>> + /* Assert Secondary Bus Reset */
>>>> + ctrl = read_pci_config_16(bus, slot, func, PCI_BRIDGE_CONTROL);
>>>> + ctrl |= PCI_BRIDGE_CTL_BUS_RESET;
>>>> + write_pci_config_16(bus, slot, func, PCI_BRIDGE_CONTROL, ctrl);
>>>> +
>>>> + pci_udelay(5000);
>>>> +
>>>> + /* De-assert Secondary Bus Reset */
>>>> + ctrl &= ~PCI_BRIDGE_CTL_BUS_RESET;
>>>> + write_pci_config_16(bus, slot, func, PCI_BRIDGE_CONTROL, ctrl);
>>>> +
>>>> + pci_udelay(500000);
>>>
>>> This is 0.5 second. This will add up quickly on larger servers with
>>> multiple busses. Is 0.5 second required by the spec?
>>> aer_do_secondary_bus_reset() holds PCI_BRIDGE_CTL_BUS_RESET for 2 ms and
>>> then waits another 200 ms after de-asserting it. Still long, but less
>>> than half of the delay in above code..
>>
>> According to PCIe spec, they should be more than 1ms and 100ms. The reason
>> why they're 2ms and 200ms in aer_do_secondary_bus_reset() is that it's a
>> kind of safety margin, I think.
>>
>> At first these delay were 2ms and 200ms as well as
>> aer_do_secondary_bus_reset(), but I found a problem that on certain
>> machine it was not enough. I think this problem occurs because
>> native_io_delay() is not precise. Therefore I extended them to have more
>> margin.
>>
>> If this delay should be shortened, I have two solutions.
>>
>> 1)
>> Call early_reset_pcie_devices() after calibrate_delay() so that we can
>> use precise mdelay. Personally I don't want to do this because I think
>> DMA should be stopped asap.
>>
>> 2)
>> Make it tuneable. For exmple, improve "reset_devices"
>> reset_devices=reset_delay=500
>> or something like that. As default, 200ms is enough as far as I tested.
>> But if it is not enough, we can adjust delay time using this.
>>
>> Any other idea?
>>
>
> I don't like the idea of asking end user to determine how many msec of
> delay they would need on their system for a reset. If we have to go that
> route, I would rather have a default of 200 msec and then add a kernel
> option like "reset_devices=reset_delay=long" which changes the delay to
> 500 msec.
>
> Here is another idea. The code you currently have does:
>
> for (each device) {
> save config registers
> reset
> wait for 500 ms
> restore config registers
> }
>
> You need to wait for 500 ms because you can not access config registers
> until reset is complete. So how about this - why not just save config
> registers, reset each device and then wait only once for 500 ms before
> restoring config registers on all devices. Here is what this will look
> like:
>
> for (each device) {
> save config registers
> reset
> }
> wait 500 ms
> for (each device) {
> restore config registers
> }
>
> This is obviously more complex and requires you to allocate more space
> for saving state, but it has the benefits of minimizing boot up delay
> and making the delay constant irrespective of number of switches/bridges
> on the system.
>
> Does this sound reasonable?
Yep, that looks good idea. I'll update my patch with this idea.
Thanks,
Takao Indoh
>
>
>>>
>>> We have been seeing problems with kexec/kdump kernel for quite some time
>>> that are related to I/O devices not being quiesced before kexec. I had
>>> added code to clear Bus Master bit to help stop runaway DMAs which
>>> helped many cases, but obviously not all. If resetting downstream ports
>>> helps stop runaway I/O from PCIe devices, I am all for this approach.
>>> This patch still doesn't do anything for old PCI devices though.
>>
>> This patch does not reset PCI devices because of VGA problem. As I wrote
>> in patch description, the monitor blacks out when VGA controller is
>> reset. So when VGA device is found we need to skip it. In the case of
>> PCIe, we can decide whether we do bus reset or not on each device, but
>> we cannot in the case of PCI.
>
> OK. My concern is there are plenty of older servers out there that
> potentially have the problem with kexec/kdump failing often and we are
> not solving the problem for them. If we can only solve it for PCIe for
> now, I am ok with starting there.
>
> --
> Khalid
>
>
>
>
_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec
WARNING: multiple messages have this Message-ID (diff)
From: Takao Indoh <indou.takao@jp.fujitsu.com>
To: khalid@gonehiking.org
Cc: martin.wilck@ts.fujitsu.com, linux-pci@vger.kernel.org,
x86@kernel.org, kexec@lists.infradead.org,
linux-kernel@vger.kernel.org, hbabu@us.ibm.com,
andi@firstfloor.org, ddutile@redhat.com,
ishii.hironobu@jp.fujitsu.com, hpa@zytor.com,
bhelgaas@google.com, tglx@linutronix.de, mingo@redhat.com,
vgoyal@redhat.com
Subject: Re: [PATCH v3 1/2] x86, pci: Reset PCIe devices at boot time
Date: Fri, 12 Oct 2012 20:28:38 +0900 [thread overview]
Message-ID: <5077FEE6.7060000@jp.fujitsu.com> (raw)
In-Reply-To: <1349976523.7065.22.camel@rhapsody>
(2012/10/12 2:28), Khalid Aziz wrote:
> On Thu, 2012-10-11 at 15:16 +0900, Takao Indoh wrote:
>> (2012/10/11 5:08), Khalid Aziz wrote:
> .....
>>>> +static void __init do_reset(u8 bus, u8 slot, u8 func)
>>>> +{
>>>> + u16 ctrl;
>>>> +
>>>> + printk(KERN_INFO "pci 0000:%02x:%02x.%d reset\n", bus, slot, func);
>>>> +
>>>> + /* Assert Secondary Bus Reset */
>>>> + ctrl = read_pci_config_16(bus, slot, func, PCI_BRIDGE_CONTROL);
>>>> + ctrl |= PCI_BRIDGE_CTL_BUS_RESET;
>>>> + write_pci_config_16(bus, slot, func, PCI_BRIDGE_CONTROL, ctrl);
>>>> +
>>>> + pci_udelay(5000);
>>>> +
>>>> + /* De-assert Secondary Bus Reset */
>>>> + ctrl &= ~PCI_BRIDGE_CTL_BUS_RESET;
>>>> + write_pci_config_16(bus, slot, func, PCI_BRIDGE_CONTROL, ctrl);
>>>> +
>>>> + pci_udelay(500000);
>>>
>>> This is 0.5 second. This will add up quickly on larger servers with
>>> multiple busses. Is 0.5 second required by the spec?
>>> aer_do_secondary_bus_reset() holds PCI_BRIDGE_CTL_BUS_RESET for 2 ms and
>>> then waits another 200 ms after de-asserting it. Still long, but less
>>> than half of the delay in above code..
>>
>> According to PCIe spec, they should be more than 1ms and 100ms. The reason
>> why they're 2ms and 200ms in aer_do_secondary_bus_reset() is that it's a
>> kind of safety margin, I think.
>>
>> At first these delay were 2ms and 200ms as well as
>> aer_do_secondary_bus_reset(), but I found a problem that on certain
>> machine it was not enough. I think this problem occurs because
>> native_io_delay() is not precise. Therefore I extended them to have more
>> margin.
>>
>> If this delay should be shortened, I have two solutions.
>>
>> 1)
>> Call early_reset_pcie_devices() after calibrate_delay() so that we can
>> use precise mdelay. Personally I don't want to do this because I think
>> DMA should be stopped asap.
>>
>> 2)
>> Make it tuneable. For exmple, improve "reset_devices"
>> reset_devices=reset_delay=500
>> or something like that. As default, 200ms is enough as far as I tested.
>> But if it is not enough, we can adjust delay time using this.
>>
>> Any other idea?
>>
>
> I don't like the idea of asking end user to determine how many msec of
> delay they would need on their system for a reset. If we have to go that
> route, I would rather have a default of 200 msec and then add a kernel
> option like "reset_devices=reset_delay=long" which changes the delay to
> 500 msec.
>
> Here is another idea. The code you currently have does:
>
> for (each device) {
> save config registers
> reset
> wait for 500 ms
> restore config registers
> }
>
> You need to wait for 500 ms because you can not access config registers
> until reset is complete. So how about this - why not just save config
> registers, reset each device and then wait only once for 500 ms before
> restoring config registers on all devices. Here is what this will look
> like:
>
> for (each device) {
> save config registers
> reset
> }
> wait 500 ms
> for (each device) {
> restore config registers
> }
>
> This is obviously more complex and requires you to allocate more space
> for saving state, but it has the benefits of minimizing boot up delay
> and making the delay constant irrespective of number of switches/bridges
> on the system.
>
> Does this sound reasonable?
Yep, that looks good idea. I'll update my patch with this idea.
Thanks,
Takao Indoh
>
>
>>>
>>> We have been seeing problems with kexec/kdump kernel for quite some time
>>> that are related to I/O devices not being quiesced before kexec. I had
>>> added code to clear Bus Master bit to help stop runaway DMAs which
>>> helped many cases, but obviously not all. If resetting downstream ports
>>> helps stop runaway I/O from PCIe devices, I am all for this approach.
>>> This patch still doesn't do anything for old PCI devices though.
>>
>> This patch does not reset PCI devices because of VGA problem. As I wrote
>> in patch description, the monitor blacks out when VGA controller is
>> reset. So when VGA device is found we need to skip it. In the case of
>> PCIe, we can decide whether we do bus reset or not on each device, but
>> we cannot in the case of PCI.
>
> OK. My concern is there are plenty of older servers out there that
> potentially have the problem with kexec/kdump failing often and we are
> not solving the problem for them. If we can only solve it for PCIe for
> now, I am ok with starting there.
>
> --
> Khalid
>
>
>
>
next prev parent reply other threads:[~2012-10-12 11:33 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-10-10 7:50 [PATCH v3 0/2] Reset PCIe devices to address DMA problem on kdump with iommu Takao Indoh
2012-10-10 7:50 ` Takao Indoh
2012-10-10 7:51 ` [PATCH v3 1/2] x86, pci: Reset PCIe devices at boot time Takao Indoh
2012-10-10 7:51 ` Takao Indoh
2012-10-10 20:08 ` Khalid Aziz
2012-10-10 20:08 ` Khalid Aziz
2012-10-11 6:16 ` Takao Indoh
2012-10-11 6:16 ` Takao Indoh
2012-10-11 17:28 ` Khalid Aziz
2012-10-11 17:28 ` Khalid Aziz
2012-10-12 11:28 ` Takao Indoh [this message]
2012-10-12 11:28 ` Takao Indoh
2012-10-10 7:51 ` [PATCH v3 2/2] x86, pci: Enable PCI INTx when MSI is disabled Takao Indoh
2012-10-10 7:51 ` Takao Indoh
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=5077FEE6.7060000@jp.fujitsu.com \
--to=indou.takao@jp.fujitsu.com \
--cc=andi@firstfloor.org \
--cc=bhelgaas@google.com \
--cc=ddutile@redhat.com \
--cc=hbabu@us.ibm.com \
--cc=hpa@zytor.com \
--cc=ishii.hironobu@jp.fujitsu.com \
--cc=kexec@lists.infradead.org \
--cc=khalid@gonehiking.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-pci@vger.kernel.org \
--cc=martin.wilck@ts.fujitsu.com \
--cc=mingo@redhat.com \
--cc=tglx@linutronix.de \
--cc=vgoyal@redhat.com \
--cc=x86@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.