From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1755303AbcAIB1m (ORCPT <rfc822;w@1wt.eu>);
	Fri, 8 Jan 2016 20:27:42 -0500
Received: from bh-25.webhostbox.net ([208.91.199.152]:39469 "EHLO
	bh-25.webhostbox.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1751019AbcAIB1k (ORCPT
	<rfc822;linux-kernel@vger.kernel.org>);
	Fri, 8 Jan 2016 20:27:40 -0500
Subject: Re: [RFC PATCH 1/2] PCI: pciehp: Merge hotplug work requests
To: Bjorn Helgaas <helgaas@kernel.org>
References: <1446522496-21628-1-git-send-email-linux@roeck-us.net>
 <20160108161846.GA31005@localhost> <568FE426.5000707@roeck-us.net>
 <20160108174612.GA5354@localhost>
Cc: Bjorn Helgaas <bhelgaas@google.com>, Yinghai Lu <yinghai@kernel.org>,
        linux-pci@vger.kernel.org, linux-kernel@vger.kernel.org
From: Guenter Roeck <linux@roeck-us.net>
Message-ID: <56906209.9040401@roeck-us.net>
Date: Fri, 8 Jan 2016 17:27:37 -0800
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101
 Thunderbird/38.4.0
MIME-Version: 1.0
In-Reply-To: <20160108174612.GA5354@localhost>
Content-Type: text/plain; charset=windows-1252; format=flowed
Content-Transfer-Encoding: 7bit
X-Authenticated_sender: linux@roeck-us.net
X-OutGoing-Spam-Status: No, score=-1.0
X-AntiAbuse: This header was added to track abuse, please include it with any abuse report
X-AntiAbuse: Primary Hostname - bh-25.webhostbox.net
X-AntiAbuse: Original Domain - vger.kernel.org
X-AntiAbuse: Originator/Caller UID/GID - [47 12] / [47 12]
X-AntiAbuse: Sender Address Domain - roeck-us.net
X-Get-Message-Sender-Via: bh-25.webhostbox.net: authenticated_id: linux@roeck-us.net
X-Authenticated-Sender: bh-25.webhostbox.net: linux@roeck-us.net
X-Source: 
X-Source-Args: 
X-Source-Dir: 
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

Hi Bjorn,

On 01/08/2016 09:46 AM, Bjorn Helgaas wrote:
> On Fri, Jan 08, 2016 at 08:30:30AM -0800, Guenter Roeck wrote:
>> On 01/08/2016 08:18 AM, Bjorn Helgaas wrote:
>>> Hi Guenter,
>>>
>>> Sorry for the delay in getting to this.
>>>
>>> On Mon, Nov 02, 2015 at 07:48:15PM -0800, Guenter Roeck wrote:
>>>> Some oddball devices may experience a PCIe link flap after power-on.
>>>> This may result in the following sequence of events.
>>>>
>>>> fpc0 kernel: pciehp 0000:02:08.0:pcie24: Card present on Slot(0)
>>>> fpc0 kernel: pciehp 0000:02:08.0:pcie24: slot(0): Link Up event
>>>> fpc0 kernel: pciehp 0000:02:08.0:pcie24:
>>>> 	Link Up event ignored on slot(0): already powering on
>>>> fpc0 kernel: pciehp 0000:02:08.0:pcie24: slot(0): Link Down event
>>>> fpc0 kernel: pciehp 0000:02:08.0:pcie24:
>>>> 	Link Down event queued on slot(0): currently getting powered on
>>>> fpc0 kernel: pciehp 0000:02:08.0:pcie24: slot(0): Link Up event
>>>> fpc0 kernel: pciehp 0000:02:08.0:pcie24:
>>>> 	Link Up event queued on slot(0): currently getting powered off
>>>>
>>>> This causes the driver for affected devices to be instantiated, removed,
>>>> and re-instantiated.
>>>>
>>>> An even worse problem is a device which resets itself continuously.
>>>> This can result in the following endless sequence of messages.
>>>>
>>>> pciehp 0000:02:0a.0:pcie24: Card present on Slot(10)
>>>> pciehp 0000:02:0a.0:pcie24: Card not present on Slot(10)
>>>> pciehp 0000:02:0a.0:pcie24: Card present on Slot(10)
>>>> pciehp 0000:02:0a.0:pcie24: Card not present on Slot(10)
>>>> pciehp 0000:02:0a.0:pcie24: Card present on Slot(10)
>>>> pciehp 0000:02:0a.0:pcie24: Card not present on Slot(10)
>>>>
>>>> The problem in the both cases is that all events are enqueued as hotplug
>>>> work requests and executed in sequence, which can overwhelm the system
>>>> and even result in "hung task" tracebacks in pciehp_power_thread().
>>>>
>>>> This exposes an underlying limitation of the hotplug state machine: It
>>>> executes all received requests, even though only the most recent request
>>>> really needs to be handled. As a result, by the time a request is handled,
>>>> it may well be obsolete and have been superseded by many other enable /
>>>> disable requests which have been enqueued in the meantime.
>>>>
>>>> To solve the problem, fold hotplug work requests into a single request.
>>>> Store the request as well as the work data structure in 'struct slot',
>>>> thus eliminating the need to allocate memory for each request.
>>>> Handle a sequence of requests by setting a 'disable' flag when needed,
>>>> indicating that a link needs to be disabled prior to re-enabling it.
>>>>
>>>> With this change, requests and request sequences are handled as follows.
>>>>
>>>> enable (single request):		enable link
>>>> disable (single request):		disable link
>>>> ... disable-enable-disable...disable:	disable link
>>>> ... disable-enable-disable...enable:	disable link, then enable it
>>>
>>> I think this is a really good idea, but I would like to understand
>>> what the critical points are and how they affect the state machine.
>>>
>>> I think you're basically accounting for the fact that some hotplug
>>> controller commands are not completed instantaneously, and we might
>>> receive more interrupts before the first command is completed.  I
>>> suspect that your patch only makes a difference on controllers that
>>> support Command Completed events, right?
>>
>> No, not really. problem is that state change interrupts are not handled
>> immediately but enqueued. By the time an event is handled by the workqueue,
>> it is long since obsolete and superseded by other events.
>
> Ah.  So the important interval is the one between pcie_isr(), where we
> enqueue work, and the worker threads that are awakened to actually do
> the work.  Then the idea is that only the most recent state is
> important -- if we have several presence detect changes, e.g.,
> present, absent, present, absent, before the worker thread starts
> processing them, it should only look at the most recent state.  That
> seems like the right thing to me, and I think the removal of kmalloc
> from the ISR path is an important consequence of doing that.
>
Yes, exactly.

> Blue sky thinking:
>
>    - Do all interrupt-related CSR reads in pcie_isr() (as opposed to
>      doing some in pcie_isr() and others in the worker threads).  I
>      think this is important for consistency.  I think it's a mistake
>      to read and clear the status register in pcie_isr(), then go back
>      and read other CSRs later in the worker threads.
>
>    - Have pcie_isr() update a single set of "current state" CSR values.
>      This means we don't need any allocation in pcie_isr().  Some
>      values, e.g., Power Fault Detected, might be OR-d in.  Others,
>      e.g., Presence Detect, might overwrite the previous value.
>
>    - Collapse the several worker threads into a single one that
>      pcie_isr() would awaken (as opposed to having pcie_isr() decide
>      whether this is a button press, presence detect change, link
>      event, etc.)
>
>    - Have the single worker thread figure out what happened and how to
>      advance the state machine, based on the CSR values read by the
>      most recent pcie_isr() invocation.
>
> This would be a lot of changes, but I think it has the potential to
> centralize the state machine management and simplify things
> significantly.
>
That all makes a lot of sense, and I would love to spend some time on it.
Unfortunately, I don't think I will be able to spend much time on this
anytime soon.

Any chance to accept at least the first of my two patches ? Or, in other
words, do you see an immediate problem with it that I could address in,
say, the next week or two ?

Thanks,
Guenter