[RFC PATCH] PCI: pciehp: Add module parameter to enable debouncing of HP link events

linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH] PCI: pciehp: Add module parameter to enable debouncing of HP link events
@ 2018-01-30  8:41 Stefan Roese
  2018-01-30 10:28 ` Mika Westerberg
  0 siblings, 1 reply; 9+ messages in thread
From: Stefan Roese @ 2018-01-30  8:41 UTC (permalink / raw)
  To: linux-pci; +Cc: Mika Westerberg, Bjorn Helgaas

Hotplugging of some PCIe devices on our platform sometimes leads to a
bounce of link-up and link-down events, resulting in problems in the
corresponding PCI drivers.

Here an example of such a hotplug event bounce for a AHCI PCIe card:
...
pciehp 0000:00:1c.1:pcie004: Slot(1): Card present
pciehp 0000:00:1c.1:pcie004: Slot(1): Link Up
pciehp 0000:00:1c.1:pcie004: Slot(1): Link Up event ignored; already powering on
pciehp 0000:00:1c.1:pcie004: Slot(1): Link Down
pciehp 0000:00:1c.1:pcie004: Slot(1): Card present
pciehp 0000:00:1c.1:pcie004: Slot(1): Link Up
pci 0000:02:00.0: [1b4b:9215] type 00 class 0x010601
pci 0000:02:00.0: reg 0x10: [io  0x8000-0x8007]
...
ata3: SATA max UDMA/133 abar m2048@0x80910000 port 0x80910100 irq 100
ata4: SATA max UDMA/133 abar m2048@0x80910000 port 0x80910180 irq 100
ata5: SATA max UDMA/133 abar m2048@0x80910000 port 0x80910200 irq 100
ata6: SATA max UDMA/133 abar m2048@0x80910000 port 0x80910280 irq 100
pciehp 0000:00:1c.1:pcie004: Slot(1): Link Up event ignored; already powering on
ahci 0000:02:00.0: PME# disabled
ata3: SATA link down (SStatus 0 SControl 300)
ata5: SATA link down (SStatus 0 SControl 300)
ata4: SATA link down (SStatus 0 SControl 300)
WARNING: CPU: 2 PID: 1162 at drivers/ata/libata-core.c:6620 ata_host_detach+0x125/0x130
ata6: SATA link down (SStatus 0 SControl 300)
Modules linked in:
CPU: 2 PID: 1162 Comm: kworker/u8:5 Not tainted 4.15.0+ #26
Hardware name: congatec conga-qeval20-qa3-e3845/conga-qeval20-qa3-e3845, BIOS 2018.01-00033-g0125f37185-dirty 01/18/2018
Workqueue: pciehp-1 pciehp_power_thread
...

This patch now adds the 'pciehp_debounce_time' module parameter, which
can be used to drop all events for the specified time (in milliseconds)
after a link-up event occurred. A value of ~100ms works fine in my tests
to debounce all the link-up / link-down events in my tests.

If this parameter is not set (default) then the current implementation
is unchanged and all events are handled.

Signed-off-by: Stefan Roese <sr@denx.de>
Cc: Mika Westerberg <mika.westerberg@linux.intel.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
---
 drivers/pci/hotplug/pciehp.h      |  3 +++
 drivers/pci/hotplug/pciehp_core.c |  4 ++++
 drivers/pci/hotplug/pciehp_ctrl.c | 38 ++++++++++++++++++++++++++++++++++----
 3 files changed, 41 insertions(+), 4 deletions(-)

diff --git a/drivers/pci/hotplug/pciehp.h b/drivers/pci/hotplug/pciehp.h
index 06109d40c4ac..a9ff87150e82 100644
--- a/drivers/pci/hotplug/pciehp.h
+++ b/drivers/pci/hotplug/pciehp.h
@@ -43,6 +43,7 @@
 extern bool pciehp_poll_mode;
 extern int pciehp_poll_time;
 extern bool pciehp_debug;
+extern int pciehp_debounce_time;
 
 #define dbg(format, arg...)						\
 do {									\
@@ -78,6 +79,8 @@ struct slot {
 	struct mutex lock;
 	struct mutex hotplug_lock;
 	struct workqueue_struct *wq;
+	unsigned long linkup_start;	/* jiffies */
+	int linkup_debounce_active;	/* linkup-debounce is active */
 };
 
 struct event_info {
diff --git a/drivers/pci/hotplug/pciehp_core.c b/drivers/pci/hotplug/pciehp_core.c
index 35d84845d5af..5a97f2550cba 100644
--- a/drivers/pci/hotplug/pciehp_core.c
+++ b/drivers/pci/hotplug/pciehp_core.c
@@ -45,6 +45,7 @@ bool pciehp_debug;
 bool pciehp_poll_mode;
 int pciehp_poll_time;
 static bool pciehp_force;
+int pciehp_debounce_time;
 
 /*
  * not really modular, but the easiest way to keep compat with existing
@@ -54,10 +55,13 @@ module_param(pciehp_debug, bool, 0644);
 module_param(pciehp_poll_mode, bool, 0644);
 module_param(pciehp_poll_time, int, 0644);
 module_param(pciehp_force, bool, 0644);
+module_param(pciehp_debounce_time, int, 0644);
 MODULE_PARM_DESC(pciehp_debug, "Debugging mode enabled or not");
 MODULE_PARM_DESC(pciehp_poll_mode, "Using polling mechanism for hot-plug events or not");
 MODULE_PARM_DESC(pciehp_poll_time, "Polling mechanism frequency, in seconds");
 MODULE_PARM_DESC(pciehp_force, "Force pciehp, even if OSHP is missing");
+MODULE_PARM_DESC(pciehp_debounce_time,
+		 "PCIe hotplug debounce time in milliseconds");
 
 #define PCIE_MODULE_NAME "pciehp"
 
diff --git a/drivers/pci/hotplug/pciehp_ctrl.c b/drivers/pci/hotplug/pciehp_ctrl.c
index 83f3d4af3677..03d966c21c41 100644
--- a/drivers/pci/hotplug/pciehp_ctrl.c
+++ b/drivers/pci/hotplug/pciehp_ctrl.c
@@ -40,6 +40,7 @@ static void interrupt_event_handler(struct work_struct *work);
 void pciehp_queue_interrupt_event(struct slot *p_slot, u32 event_type)
 {
 	struct event_info *info;
+	bool drop_event = false;
 
 	info = kmalloc(sizeof(*info), GFP_ATOMIC);
 	if (!info) {
@@ -47,10 +48,39 @@ void pciehp_queue_interrupt_event(struct slot *p_slot, u32 event_type)
 		return;
 	}
 
-	INIT_WORK(&info->work, interrupt_event_handler);
-	info->event_type = event_type;
-	info->p_slot = p_slot;
-	queue_work(p_slot->wq, &info->work);
+	/* Clear linkup-debounce flag if time exceeds linkup-debounce timeout */
+	if (time_after(jiffies, p_slot->linkup_start +
+		       msecs_to_jiffies(pciehp_debounce_time)))
+		p_slot->linkup_debounce_active = 0;
+
+	/* Check if this event starts a new linkup-debounce period */
+	if (pciehp_debounce_time && (event_type == INT_LINK_UP) &&
+	    !p_slot->linkup_debounce_active) {
+		p_slot->linkup_start = jiffies;
+		p_slot->linkup_debounce_active = 1;
+		ctrl_info(p_slot->ctrl,
+			  "Slot(%s): Linkup-debounce active for %dms\n",
+			  slot_name(p_slot), pciehp_debounce_time);
+	} else {
+		/*
+		 * Drop this event if it occurs inside the debounce period
+		 * after the linkup event
+		 */
+		if (p_slot->linkup_debounce_active)
+			drop_event = true;
+	}
+
+	if (drop_event) {
+		ctrl_info(p_slot->ctrl,
+			  "Slot(%s): Event %x dropped (dt=%dms)!\n",
+			  slot_name(p_slot), event_type,
+			  jiffies_to_msecs(jiffies - p_slot->linkup_start));
+	} else {
+		INIT_WORK(&info->work, interrupt_event_handler);
+		info->event_type = event_type;
+		info->p_slot = p_slot;
+		queue_work(p_slot->wq, &info->work);
+	}
 }
 
 /* The following routines constitute the bulk of the
-- 
2.16.1

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH] PCI: pciehp: Add module parameter to enable debouncing of HP link events
  2018-01-30  8:41 [RFC PATCH] PCI: pciehp: Add module parameter to enable debouncing of HP link events Stefan Roese
@ 2018-01-30 10:28 ` Mika Westerberg
  2018-02-02 13:38   ` Stefan Roese
  0 siblings, 1 reply; 9+ messages in thread
From: Mika Westerberg @ 2018-01-30 10:28 UTC (permalink / raw)
  To: Stefan Roese; +Cc: linux-pci, Bjorn Helgaas

On Tue, Jan 30, 2018 at 09:41:21AM +0100, Stefan Roese wrote:
> Hotplugging of some PCIe devices on our platform sometimes leads to a
> bounce of link-up and link-down events, resulting in problems in the
> corresponding PCI drivers.
> 
> Here an example of such a hotplug event bounce for a AHCI PCIe card:
> ...
> pciehp 0000:00:1c.1:pcie004: Slot(1): Card present
> pciehp 0000:00:1c.1:pcie004: Slot(1): Link Up
> pciehp 0000:00:1c.1:pcie004: Slot(1): Link Up event ignored; already powering on
> pciehp 0000:00:1c.1:pcie004: Slot(1): Link Down
> pciehp 0000:00:1c.1:pcie004: Slot(1): Card present
> pciehp 0000:00:1c.1:pcie004: Slot(1): Link Up

It would be good to find out why this happens in the first place.
Perhaps there is some environmental interference or something causing
this?

> pci 0000:02:00.0: [1b4b:9215] type 00 class 0x010601
> pci 0000:02:00.0: reg 0x10: [io  0x8000-0x8007]
> ...
> ata3: SATA max UDMA/133 abar m2048@0x80910000 port 0x80910100 irq 100
> ata4: SATA max UDMA/133 abar m2048@0x80910000 port 0x80910180 irq 100
> ata5: SATA max UDMA/133 abar m2048@0x80910000 port 0x80910200 irq 100
> ata6: SATA max UDMA/133 abar m2048@0x80910000 port 0x80910280 irq 100
> pciehp 0000:00:1c.1:pcie004: Slot(1): Link Up event ignored; already powering on
> ahci 0000:02:00.0: PME# disabled
> ata3: SATA link down (SStatus 0 SControl 300)
> ata5: SATA link down (SStatus 0 SControl 300)
> ata4: SATA link down (SStatus 0 SControl 300)
> WARNING: CPU: 2 PID: 1162 at drivers/ata/libata-core.c:6620 ata_host_detach+0x125/0x130

I think the AHCI driver should be fixed to cope with this.

> ata6: SATA link down (SStatus 0 SControl 300)
> Modules linked in:
> CPU: 2 PID: 1162 Comm: kworker/u8:5 Not tainted 4.15.0+ #26
> Hardware name: congatec conga-qeval20-qa3-e3845/conga-qeval20-qa3-e3845, BIOS 2018.01-00033-g0125f37185-dirty 01/18/2018
> Workqueue: pciehp-1 pciehp_power_thread
> ...
> 
> This patch now adds the 'pciehp_debounce_time' module parameter, which
> can be used to drop all events for the specified time (in milliseconds)
> after a link-up event occurred. A value of ~100ms works fine in my tests
> to debounce all the link-up / link-down events in my tests.

This sounds a bit "hackish". I would rather make sure we can handle
situations like this properly without passing additional parameters.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH] PCI: pciehp: Add module parameter to enable debouncing of HP link events
  2018-01-30 10:28 ` Mika Westerberg
@ 2018-02-02 13:38   ` Stefan Roese
  2018-02-02 13:47     ` Lukas Wunner
  2018-02-02 13:56     ` Mika Westerberg
  0 siblings, 2 replies; 9+ messages in thread
From: Stefan Roese @ 2018-02-02 13:38 UTC (permalink / raw)
  To: Mika Westerberg; +Cc: linux-pci, Bjorn Helgaas

Hi Mika,

sorry for the late reply.

On 30.01.2018 11:28, Mika Westerberg wrote:
> On Tue, Jan 30, 2018 at 09:41:21AM +0100, Stefan Roese wrote:
>> Hotplugging of some PCIe devices on our platform sometimes leads to a
>> bounce of link-up and link-down events, resulting in problems in the
>> corresponding PCI drivers.
>>
>> Here an example of such a hotplug event bounce for a AHCI PCIe card:
>> ...
>> pciehp 0000:00:1c.1:pcie004: Slot(1): Card present
>> pciehp 0000:00:1c.1:pcie004: Slot(1): Link Up
>> pciehp 0000:00:1c.1:pcie004: Slot(1): Link Up event ignored; already powering on
>> pciehp 0000:00:1c.1:pcie004: Slot(1): Link Down
>> pciehp 0000:00:1c.1:pcie004: Slot(1): Card present
>> pciehp 0000:00:1c.1:pcie004: Slot(1): Link Up
> 
> It would be good to find out why this happens in the first place.
> Perhaps there is some environmental interference or something causing
> this?

I'm seeing these link bounces in the following environments:

a) Using a BayTrail SoC and hotplugging a standard Desktop PCIe SATA /
   AHCI Controller (Marvell chip)
b) Hotplugging (booting via SPI) an Altera / Intel FPGA which is connected
   via PCIe to a PCIe switch

In both cases, this link bouncing happens infrequently, approx. once out
of 5 - 10 tries.

Out of curiosity, has nobody else ever experienced such "link bouncing"
with PCIe cards / devices getting hot-plugged?

>> pci 0000:02:00.0: [1b4b:9215] type 00 class 0x010601
>> pci 0000:02:00.0: reg 0x10: [io  0x8000-0x8007]
>> ...
>> ata3: SATA max UDMA/133 abar m2048@0x80910000 port 0x80910100 irq 100
>> ata4: SATA max UDMA/133 abar m2048@0x80910000 port 0x80910180 irq 100
>> ata5: SATA max UDMA/133 abar m2048@0x80910000 port 0x80910200 irq 100
>> ata6: SATA max UDMA/133 abar m2048@0x80910000 port 0x80910280 irq 100
>> pciehp 0000:00:1c.1:pcie004: Slot(1): Link Up event ignored; already powering on
>> ahci 0000:02:00.0: PME# disabled
>> ata3: SATA link down (SStatus 0 SControl 300)
>> ata5: SATA link down (SStatus 0 SControl 300)
>> ata4: SATA link down (SStatus 0 SControl 300)
>> WARNING: CPU: 2 PID: 1162 at drivers/ata/libata-core.c:6620 ata_host_detach+0x125/0x130
> 
> I think the AHCI driver should be fixed to cope with this.

Yes, this can be discussed. But still the root-cause should be fixed,
IMHO. Either in our environment (HW issue?) or by adding this de-bouncing
feature.
 
>> ata6: SATA link down (SStatus 0 SControl 300)
>> Modules linked in:
>> CPU: 2 PID: 1162 Comm: kworker/u8:5 Not tainted 4.15.0+ #26
>> Hardware name: congatec conga-qeval20-qa3-e3845/conga-qeval20-qa3-e3845, BIOS 2018.01-00033-g0125f37185-dirty 01/18/2018
>> Workqueue: pciehp-1 pciehp_power_thread
>> ...
>>
>> This patch now adds the 'pciehp_debounce_time' module parameter, which
>> can be used to drop all events for the specified time (in milliseconds)
>> after a link-up event occurred. A value of ~100ms works fine in my tests
>> to debounce all the link-up / link-down events in my tests.
> 
> This sounds a bit "hackish". I would rather make sure we can handle
> situations like this properly without passing additional parameters.

I'm open for other / better ideas on how to solve this situation, we
are seeing on our systems.

Thanks,
Stefan

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH] PCI: pciehp: Add module parameter to enable debouncing of HP link events
  2018-02-02 13:38   ` Stefan Roese
@ 2018-02-02 13:47     ` Lukas Wunner
  2018-02-02 14:44       ` Stefan Roese
  2018-02-02 13:56     ` Mika Westerberg
  1 sibling, 1 reply; 9+ messages in thread
From: Lukas Wunner @ 2018-02-02 13:47 UTC (permalink / raw)
  To: Stefan Roese; +Cc: Mika Westerberg, linux-pci, Bjorn Helgaas

On Fri, Feb 02, 2018 at 02:38:34PM +0100, Stefan Roese wrote:
> > On Tue, Jan 30, 2018 at 09:41:21AM +0100, Stefan Roese wrote:
> >> Hotplugging of some PCIe devices on our platform sometimes leads to a
> >> bounce of link-up and link-down events, resulting in problems in the
> >> corresponding PCI drivers.
> >>
> >> Here an example of such a hotplug event bounce for a AHCI PCIe card:
> >> ...
> >> pciehp 0000:00:1c.1:pcie004: Slot(1): Card present
> >> pciehp 0000:00:1c.1:pcie004: Slot(1): Link Up
> >> pciehp 0000:00:1c.1:pcie004: Slot(1): Link Up event ignored; already powering on
> >> pciehp 0000:00:1c.1:pcie004: Slot(1): Link Down
> >> pciehp 0000:00:1c.1:pcie004: Slot(1): Card present
> >> pciehp 0000:00:1c.1:pcie004: Slot(1): Link Up
> 
> I'm open for other / better ideas on how to solve this situation, we
> are seeing on our systems.

If a Link Up event is received and there is already a Link Up / Link Down
pair in the queue, the Link Down event can be dequeued and the newly
received Link Up event need not be queued.

Same if a Link Down event is received and there is already a Link Down /
Link Up pair in the queue.

Thanks,

Lukas

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH] PCI: pciehp: Add module parameter to enable debouncing of HP link events
  2018-02-02 13:47     ` Lukas Wunner
@ 2018-02-02 14:44       ` Stefan Roese
  2018-02-02 19:20         ` Bjorn Helgaas
  0 siblings, 1 reply; 9+ messages in thread
From: Stefan Roese @ 2018-02-02 14:44 UTC (permalink / raw)
  To: Lukas Wunner; +Cc: Mika Westerberg, linux-pci, Bjorn Helgaas

On 02.02.2018 14:47, Lukas Wunner wrote:
> On Fri, Feb 02, 2018 at 02:38:34PM +0100, Stefan Roese wrote:
>>> On Tue, Jan 30, 2018 at 09:41:21AM +0100, Stefan Roese wrote:
>>>> Hotplugging of some PCIe devices on our platform sometimes leads to a
>>>> bounce of link-up and link-down events, resulting in problems in the
>>>> corresponding PCI drivers.
>>>>
>>>> Here an example of such a hotplug event bounce for a AHCI PCIe card:
>>>> ...
>>>> pciehp 0000:00:1c.1:pcie004: Slot(1): Card present
>>>> pciehp 0000:00:1c.1:pcie004: Slot(1): Link Up
>>>> pciehp 0000:00:1c.1:pcie004: Slot(1): Link Up event ignored; already powering on
>>>> pciehp 0000:00:1c.1:pcie004: Slot(1): Link Down
>>>> pciehp 0000:00:1c.1:pcie004: Slot(1): Card present
>>>> pciehp 0000:00:1c.1:pcie004: Slot(1): Link Up
>>
>> I'm open for other / better ideas on how to solve this situation, we
>> are seeing on our systems.
> 
> If a Link Up event is received and there is already a Link Up / Link Down
> pair in the queue, the Link Down event can be dequeued and the newly
> received Link Up event need not be queued.
> 
> Same if a Link Down event is received and there is already a Link Down /
> Link Up pair in the queue.

Makes sense. But I'm more often seeing this sequence here while
hot-plugging the PCIe card:

[   41.260667] pciehp 0000:00:1c.1:pcie004: Slot(1): Card present
[   41.260731] pciehp 0000:00:1c.1:pcie004: Slot(1): Link Up
[   41.290650] pciehp 0000:00:1c.1:pcie004: Slot(1): Link Down
[   41.295837] pciehp 0000:00:1c.1:pcie004: Slot(1): Card present
[   41.320664] pciehp 0000:00:1c.1:pcie004: Slot(1): Card not present
[   41.330042] pciehp 0000:00:1c.1:pcie004: Slot(1): Card present
[   41.330110] pciehp 0000:00:1c.1:pcie004: Slot(1): Link Up
[   41.375950] pci 0000:02:00.0: [1b4b:9215] type 00 class 0x010601
...

So a link-down is following the link-up directly (~30ms here). Sometimes
a double link-up is also seen. But this one is more frequent in my test
cases.

Thanks,
Stefan

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH] PCI: pciehp: Add module parameter to enable debouncing of HP link events
  2018-02-02 14:44       ` Stefan Roese
@ 2018-02-02 19:20         ` Bjorn Helgaas
  0 siblings, 0 replies; 9+ messages in thread
From: Bjorn Helgaas @ 2018-02-02 19:20 UTC (permalink / raw)
  To: Stefan Roese; +Cc: Lukas Wunner, Mika Westerberg, linux-pci, Bjorn Helgaas

On Fri, Feb 02, 2018 at 03:44:21PM +0100, Stefan Roese wrote:
> On 02.02.2018 14:47, Lukas Wunner wrote:
> >On Fri, Feb 02, 2018 at 02:38:34PM +0100, Stefan Roese wrote:
> >>>On Tue, Jan 30, 2018 at 09:41:21AM +0100, Stefan Roese wrote:
> >>>>Hotplugging of some PCIe devices on our platform sometimes leads to a
> >>>>bounce of link-up and link-down events, resulting in problems in the
> >>>>corresponding PCI drivers.
> >>>>
> >>>>Here an example of such a hotplug event bounce for a AHCI PCIe card:
> >>>>...
> >>>>pciehp 0000:00:1c.1:pcie004: Slot(1): Card present
> >>>>pciehp 0000:00:1c.1:pcie004: Slot(1): Link Up
> >>>>pciehp 0000:00:1c.1:pcie004: Slot(1): Link Up event ignored; already powering on
> >>>>pciehp 0000:00:1c.1:pcie004: Slot(1): Link Down
> >>>>pciehp 0000:00:1c.1:pcie004: Slot(1): Card present
> >>>>pciehp 0000:00:1c.1:pcie004: Slot(1): Link Up
> >>
> >>I'm open for other / better ideas on how to solve this situation, we
> >>are seeing on our systems.

This is definitely a real problem that should be fixed somehow.

But I don't like the idea of a new module parameter because it's not
very user-friendly.  It would be very difficult for a user to identify
the problem, discover the parameter, and figure out what debounce time
to use.

> >If a Link Up event is received and there is already a Link Up / Link Down
> >pair in the queue, the Link Down event can be dequeued and the newly
> >received Link Up event need not be queued.
> >
> >Same if a Link Down event is received and there is already a Link Down /
> >Link Up pair in the queue.
> 
> Makes sense. But I'm more often seeing this sequence here while
> hot-plugging the PCIe card:
> 
> [   41.260667] pciehp 0000:00:1c.1:pcie004: Slot(1): Card present
> [   41.260731] pciehp 0000:00:1c.1:pcie004: Slot(1): Link Up
> [   41.290650] pciehp 0000:00:1c.1:pcie004: Slot(1): Link Down
> [   41.295837] pciehp 0000:00:1c.1:pcie004: Slot(1): Card present
> [   41.320664] pciehp 0000:00:1c.1:pcie004: Slot(1): Card not present
> [   41.330042] pciehp 0000:00:1c.1:pcie004: Slot(1): Card present
> [   41.330110] pciehp 0000:00:1c.1:pcie004: Slot(1): Link Up
> [   41.375950] pci 0000:02:00.0: [1b4b:9215] type 00 class 0x010601
> ...
> 
> So a link-down is following the link-up directly (~30ms here). Sometimes
> a double link-up is also seen. But this one is more frequent in my test
> cases.

Unfortunately I don't have any easy ideas to offer.  I do think the
pciehp interrupt handling is baroque and I suspect that if we could
simplify and rationalize it, some of these issues would take care of
themselves.

Bjorn

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH] PCI: pciehp: Add module parameter to enable debouncing of HP link events
  2018-02-02 13:38   ` Stefan Roese
  2018-02-02 13:47     ` Lukas Wunner
@ 2018-02-02 13:56     ` Mika Westerberg
  2018-02-02 14:50       ` Stefan Roese
  1 sibling, 1 reply; 9+ messages in thread
From: Mika Westerberg @ 2018-02-02 13:56 UTC (permalink / raw)
  To: Stefan Roese; +Cc: linux-pci, Bjorn Helgaas

On Fri, Feb 02, 2018 at 02:38:34PM +0100, Stefan Roese wrote:
> Hi Mika,
> 
> sorry for the late reply.
> 
> On 30.01.2018 11:28, Mika Westerberg wrote:
> > On Tue, Jan 30, 2018 at 09:41:21AM +0100, Stefan Roese wrote:
> >> Hotplugging of some PCIe devices on our platform sometimes leads to a
> >> bounce of link-up and link-down events, resulting in problems in the
> >> corresponding PCI drivers.
> >>
> >> Here an example of such a hotplug event bounce for a AHCI PCIe card:
> >> ...
> >> pciehp 0000:00:1c.1:pcie004: Slot(1): Card present
> >> pciehp 0000:00:1c.1:pcie004: Slot(1): Link Up
> >> pciehp 0000:00:1c.1:pcie004: Slot(1): Link Up event ignored; already powering on
> >> pciehp 0000:00:1c.1:pcie004: Slot(1): Link Down
> >> pciehp 0000:00:1c.1:pcie004: Slot(1): Card present
> >> pciehp 0000:00:1c.1:pcie004: Slot(1): Link Up
> > 
> > It would be good to find out why this happens in the first place.
> > Perhaps there is some environmental interference or something causing
> > this?
> 
> I'm seeing these link bounces in the following environments:
> 
> a) Using a BayTrail SoC and hotplugging a standard Desktop PCIe SATA /
>    AHCI Controller (Marvell chip)
> b) Hotplugging (booting via SPI) an Altera / Intel FPGA which is connected
>    via PCIe to a PCIe switch
> 
> In both cases, this link bouncing happens infrequently, approx. once out
> of 5 - 10 tries.
> 
> Out of curiosity, has nobody else ever experienced such "link bouncing"
> with PCIe cards / devices getting hot-plugged?

I've seen it with some Thunderbolt devices from time to time.

I think it is entirely possible in real world that the link goes down
briefly for example because of some external interference so we should
make sure we can handle that properly.

> >> pci 0000:02:00.0: [1b4b:9215] type 00 class 0x010601
> >> pci 0000:02:00.0: reg 0x10: [io  0x8000-0x8007]
> >> ...
> >> ata3: SATA max UDMA/133 abar m2048@0x80910000 port 0x80910100 irq 100
> >> ata4: SATA max UDMA/133 abar m2048@0x80910000 port 0x80910180 irq 100
> >> ata5: SATA max UDMA/133 abar m2048@0x80910000 port 0x80910200 irq 100
> >> ata6: SATA max UDMA/133 abar m2048@0x80910000 port 0x80910280 irq 100
> >> pciehp 0000:00:1c.1:pcie004: Slot(1): Link Up event ignored; already powering on
> >> ahci 0000:02:00.0: PME# disabled
> >> ata3: SATA link down (SStatus 0 SControl 300)
> >> ata5: SATA link down (SStatus 0 SControl 300)
> >> ata4: SATA link down (SStatus 0 SControl 300)
> >> WARNING: CPU: 2 PID: 1162 at drivers/ata/libata-core.c:6620 ata_host_detach+0x125/0x130
> > 
> > I think the AHCI driver should be fixed to cope with this.
> 
> Yes, this can be discussed. But still the root-cause should be fixed,
> IMHO. Either in our environment (HW issue?) or by adding this de-bouncing
> feature.
>
> >> ata6: SATA link down (SStatus 0 SControl 300)
> >> Modules linked in:
> >> CPU: 2 PID: 1162 Comm: kworker/u8:5 Not tainted 4.15.0+ #26
> >> Hardware name: congatec conga-qeval20-qa3-e3845/conga-qeval20-qa3-e3845, BIOS 2018.01-00033-g0125f37185-dirty 01/18/2018
> >> Workqueue: pciehp-1 pciehp_power_thread
> >> ...
> >>
> >> This patch now adds the 'pciehp_debounce_time' module parameter, which
> >> can be used to drop all events for the specified time (in milliseconds)
> >> after a link-up event occurred. A value of ~100ms works fine in my tests
> >> to debounce all the link-up / link-down events in my tests.
> > 
> > This sounds a bit "hackish". I would rather make sure we can handle
> > situations like this properly without passing additional parameters.
> 
> I'm open for other / better ideas on how to solve this situation, we
> are seeing on our systems.

Well, I would start by fixing drivers that can't cope with surprise link
down (e.g disapearing PCI device, or suddenly reading 0xffffffff from
register).

BTW, have you checked whether presence detect actually toggles similarly
or is it only triggered when the link is fully up? Since currently we
prioritize link up/down higher than presence detect but it may be that
we should do the opposite.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH] PCI: pciehp: Add module parameter to enable debouncing of HP link events
  2018-02-02 13:56     ` Mika Westerberg
@ 2018-02-02 14:50       ` Stefan Roese
  2018-02-02 15:11         ` Mika Westerberg
  0 siblings, 1 reply; 9+ messages in thread
From: Stefan Roese @ 2018-02-02 14:50 UTC (permalink / raw)
  To: Mika Westerberg; +Cc: linux-pci, Bjorn Helgaas

On 02.02.2018 14:56, Mika Westerberg wrote:
> On Fri, Feb 02, 2018 at 02:38:34PM +0100, Stefan Roese wrote:
>> Hi Mika,
>>
>> sorry for the late reply.
>>
>> On 30.01.2018 11:28, Mika Westerberg wrote:
>>> On Tue, Jan 30, 2018 at 09:41:21AM +0100, Stefan Roese wrote:
>>>> Hotplugging of some PCIe devices on our platform sometimes leads to a
>>>> bounce of link-up and link-down events, resulting in problems in the
>>>> corresponding PCI drivers.
>>>>
>>>> Here an example of such a hotplug event bounce for a AHCI PCIe card:
>>>> ...
>>>> pciehp 0000:00:1c.1:pcie004: Slot(1): Card present
>>>> pciehp 0000:00:1c.1:pcie004: Slot(1): Link Up
>>>> pciehp 0000:00:1c.1:pcie004: Slot(1): Link Up event ignored; already powering on
>>>> pciehp 0000:00:1c.1:pcie004: Slot(1): Link Down
>>>> pciehp 0000:00:1c.1:pcie004: Slot(1): Card present
>>>> pciehp 0000:00:1c.1:pcie004: Slot(1): Link Up
>>>
>>> It would be good to find out why this happens in the first place.
>>> Perhaps there is some environmental interference or something causing
>>> this?
>>
>> I'm seeing these link bounces in the following environments:
>>
>> a) Using a BayTrail SoC and hotplugging a standard Desktop PCIe SATA /
>>     AHCI Controller (Marvell chip)
>> b) Hotplugging (booting via SPI) an Altera / Intel FPGA which is connected
>>     via PCIe to a PCIe switch
>>
>> In both cases, this link bouncing happens infrequently, approx. once out
>> of 5 - 10 tries.
>>
>> Out of curiosity, has nobody else ever experienced such "link bouncing"
>> with PCIe cards / devices getting hot-plugged?
> 
> I've seen it with some Thunderbolt devices from time to time.
> 
> I think it is entirely possible in real world that the link goes down
> briefly for example because of some external interference so we should
> make sure we can handle that properly.
> 
>>>> pci 0000:02:00.0: [1b4b:9215] type 00 class 0x010601
>>>> pci 0000:02:00.0: reg 0x10: [io  0x8000-0x8007]
>>>> ...
>>>> ata3: SATA max UDMA/133 abar m2048@0x80910000 port 0x80910100 irq 100
>>>> ata4: SATA max UDMA/133 abar m2048@0x80910000 port 0x80910180 irq 100
>>>> ata5: SATA max UDMA/133 abar m2048@0x80910000 port 0x80910200 irq 100
>>>> ata6: SATA max UDMA/133 abar m2048@0x80910000 port 0x80910280 irq 100
>>>> pciehp 0000:00:1c.1:pcie004: Slot(1): Link Up event ignored; already powering on
>>>> ahci 0000:02:00.0: PME# disabled
>>>> ata3: SATA link down (SStatus 0 SControl 300)
>>>> ata5: SATA link down (SStatus 0 SControl 300)
>>>> ata4: SATA link down (SStatus 0 SControl 300)
>>>> WARNING: CPU: 2 PID: 1162 at drivers/ata/libata-core.c:6620 ata_host_detach+0x125/0x130
>>>
>>> I think the AHCI driver should be fixed to cope with this.
>>
>> Yes, this can be discussed. But still the root-cause should be fixed,
>> IMHO. Either in our environment (HW issue?) or by adding this de-bouncing
>> feature.
>>
>>>> ata6: SATA link down (SStatus 0 SControl 300)
>>>> Modules linked in:
>>>> CPU: 2 PID: 1162 Comm: kworker/u8:5 Not tainted 4.15.0+ #26
>>>> Hardware name: congatec conga-qeval20-qa3-e3845/conga-qeval20-qa3-e3845, BIOS 2018.01-00033-g0125f37185-dirty 01/18/2018
>>>> Workqueue: pciehp-1 pciehp_power_thread
>>>> ...
>>>>
>>>> This patch now adds the 'pciehp_debounce_time' module parameter, which
>>>> can be used to drop all events for the specified time (in milliseconds)
>>>> after a link-up event occurred. A value of ~100ms works fine in my tests
>>>> to debounce all the link-up / link-down events in my tests.
>>>
>>> This sounds a bit "hackish". I would rather make sure we can handle
>>> situations like this properly without passing additional parameters.
>>
>> I'm open for other / better ideas on how to solve this situation, we
>> are seeing on our systems.
> 
> Well, I would start by fixing drivers that can't cope with surprise link
> down (e.g disapearing PCI device, or suddenly reading 0xffffffff from
> register).

I've already sent a patch regarding a libata problem while unplugging
an AHCI controller:

https://www.spinics.net/lists/linux-ide/msg55038.html

> BTW, have you checked whether presence detect actually toggles similarly
> or is it only triggered when the link is fully up? Since currently we
> prioritize link up/down higher than presence detect but it may be that
> we should do the opposite.

As seen in the log sent in my previous mail, presence detect also
toggles. Here again:

[   41.260667] pciehp 0000:00:1c.1:pcie004: Slot(1): Card present
[   41.260731] pciehp 0000:00:1c.1:pcie004: Slot(1): Link Up
[   41.290650] pciehp 0000:00:1c.1:pcie004: Slot(1): Link Down
[   41.295837] pciehp 0000:00:1c.1:pcie004: Slot(1): Card present
[   41.320664] pciehp 0000:00:1c.1:pcie004: Slot(1): Card not present
[   41.330042] pciehp 0000:00:1c.1:pcie004: Slot(1): Card present
[   41.330110] pciehp 0000:00:1c.1:pcie004: Slot(1): Link Up
[   41.375950] pci 0000:02:00.0: [1b4b:9215] type 00 class 0x010601
...

Even though I'm wondering, why we are seeing 3 times "Card present"
and only one time "Card not present". I would expect to see 2
"Card present" messages here. This does not seem to be balanced
correctly.

Thanks,
Stefan

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH] PCI: pciehp: Add module parameter to enable debouncing of HP link events
  2018-02-02 14:50       ` Stefan Roese
@ 2018-02-02 15:11         ` Mika Westerberg
  0 siblings, 0 replies; 9+ messages in thread
From: Mika Westerberg @ 2018-02-02 15:11 UTC (permalink / raw)
  To: Stefan Roese; +Cc: linux-pci, Bjorn Helgaas

On Fri, Feb 02, 2018 at 03:50:55PM +0100, Stefan Roese wrote:
> I've already sent a patch regarding a libata problem while unplugging
> an AHCI controller:
> 
> https://www.spinics.net/lists/linux-ide/msg55038.html

Great :)

> > BTW, have you checked whether presence detect actually toggles similarly
> > or is it only triggered when the link is fully up? Since currently we
> > prioritize link up/down higher than presence detect but it may be that
> > we should do the opposite.
> 
> As seen in the log sent in my previous mail, presence detect also
> toggles. Here again:
> 
> [   41.260667] pciehp 0000:00:1c.1:pcie004: Slot(1): Card present
> [   41.260731] pciehp 0000:00:1c.1:pcie004: Slot(1): Link Up
> [   41.290650] pciehp 0000:00:1c.1:pcie004: Slot(1): Link Down
> [   41.295837] pciehp 0000:00:1c.1:pcie004: Slot(1): Card present
> [   41.320664] pciehp 0000:00:1c.1:pcie004: Slot(1): Card not present
> [   41.330042] pciehp 0000:00:1c.1:pcie004: Slot(1): Card present
> [   41.330110] pciehp 0000:00:1c.1:pcie004: Slot(1): Link Up
> [   41.375950] pci 0000:02:00.0: [1b4b:9215] type 00 class 0x010601
> ...

Indeed, it seems to follow link status changes closely. So changing the
"priority" here would not help.

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2018-02-02 19:21 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-01-30  8:41 [RFC PATCH] PCI: pciehp: Add module parameter to enable debouncing of HP link events Stefan Roese
2018-01-30 10:28 ` Mika Westerberg
2018-02-02 13:38   ` Stefan Roese
2018-02-02 13:47     ` Lukas Wunner
2018-02-02 14:44       ` Stefan Roese
2018-02-02 19:20         ` Bjorn Helgaas
2018-02-02 13:56     ` Mika Westerberg
2018-02-02 14:50       ` Stefan Roese
2018-02-02 15:11         ` Mika Westerberg

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).