Re: [PATCH] pci/pciehp: bail on bogus pcie reads from removed devices

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Jarod Wilson <jarod@redhat.com>
To: Bjorn Helgaas <bhelgaas@google.com>
Cc: "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"Rafael J. Wysocki" <rafael.j.wysocki@intel.com>,
	"linux-pci@vger.kernel.org" <linux-pci@vger.kernel.org>
Subject: Re: [PATCH] pci/pciehp: bail on bogus pcie reads from removed devices
Date: Tue, 4 Aug 2015 16:22:44 -0400	[thread overview]
Message-ID: <55C11F14.3080803@redhat.com> (raw)
In-Reply-To: <CAErSpo4JoBxmwwscckH+hrSYFbqKgX4KAiRGS2ZDQxxZ_So63Q@mail.gmail.com>

On 8/4/2015 3:27 PM, Bjorn Helgaas wrote:
> On Tue, Aug 4, 2015 at 1:46 PM, Jarod Wilson <jarod@redhat.com> wrote:
>> On 8/4/2015 1:51 PM, Bjorn Helgaas wrote:
...
>>>>> Can you try the version I posted, with the additional tests in
>>>>> pcie_poll_cmd() and pcie_do_write_cmd()?  We should try to read from
>>>>> the device there, even before we free the IRQ, so we might see several
>>>>> messages.  Maybe there's a way we can be smarter about bailing out
>>>>> there.
>>>>
>>>>
>>>> The above was with your additions munged in with the older patch, I
>>>> actually do see "pcie_do_write_cmd: no response from device" just
>>>> two lines ahead of each "Device has gone away" message from
>>>> pcie_isr().
>>>>
>>>> pciehp 0000:06:00.0:pcie24: pcie_do_write_cmd: no response from device
>>>> pciehp 0000:06:00.0:pcie24: pcie_disable_notification: SLOTCTRL d8
>>>> write cmd 0
>>>> pciehp 0000:06:00.0:pcie24: Device has gone away <- from pcie_isr()
>>>
>>>
>>> Oh, sorry!  I should have noticed that.  I just wanted to make sure I
>>> didn't cause a flood of extra messages.
>>>
>>> I think I'll merge this version (with all three checks).  We still have a
>>> slot lifetime issue, but that's a separate problem.
>>
>>
>> Sounds good to me, thanks much for your help on this.
>>
>> Do we really still have a slot lifetime issue though? It looks to be the
>> path from pciehp_release_ctrl that leads to free_irq and __free_irq calling
>> pcie_isr one last time, and there's a ctrl_info("Latch %s on Slot(%s)", open
>> ? ..., slot_name(slot)); in pcie_isr *if* we aren't bailing when we read all
>> 1's from PCI_EXP_SLTSTA. I think when we bail early, we should never see the
>> subsequent attempt to read the freed slot.
>
> It's possible that we avoid referencing the freed data, but I don't
> have warm fuzzies because it's hard to prove that by analyzing the
> source code.  It's hard to even know what to look for -- there's no
> clue in the code that says "don't reference slot->hotplug_slot after
> this point."  And it feels like a poor design to hang on to that
> pointer after the slot has been freed.
>
> IIRC, your initial report mentioned possible memory corruption, and I
> don't even have a theory about where that happened.  The
> slot->hotplug_slot references I saw were all reads where we printed
> junk but shouldn't have actually corrupted anything.

Looking at the output I was seeing, it looks like one of the ~0 reads is 
interpreted as a switch interrupt received, data link layer state 
change, etc., followed by "Enabling domain:bus:device=0000:0x:00" from 
pciehp_power_thread. Subsequently, we're calling pciehp_enable_slot, 
which calls board_added, and in the output I've got, its tripping over 
board_added's call to pciehp_check_link_status ("Failed to check link 
status"), which means going to err_exit and calling set_slot_off.

Next up, set_slot_off is calling pciehp_power_off_slot, which does a 
pcie_write_cmd(). Is it possible that write might lead to memory corruption?

> Anyway, I don't know what to do about it, and I don't have time right
> now to poke at it, but it does worry me a bit.

Definitely a bit worrying, still trying to comprehend it all here myself...

-- 
Jarod Wilson
jarod@redhat.com

next prev parent reply	other threads:[~2015-08-04 20:22 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-07-21 16:25 [PATCH] pci/pciehp: bail on bogus pcie reads from removed devices Jarod Wilson
2015-08-03  4:14 ` Bjorn Helgaas
     [not found]   ` <CAKfmpSdrdjyiQ-WWBdXFuPuTyo0WkTTsTX5ByHBt7haZeF0w=g@mail.gmail.com>
2015-08-04 14:10     ` Jarod Wilson
2015-08-04 16:56     ` Bjorn Helgaas
2015-08-04 17:21       ` Jarod Wilson
2015-08-04 17:51         ` Bjorn Helgaas
2015-08-04 18:46           ` Jarod Wilson
2015-08-04 19:27             ` Bjorn Helgaas
2015-08-04 20:22               ` Jarod Wilson [this message]
2015-08-04 21:11                 ` Bjorn Helgaas

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=55C11F14.3080803@redhat.com \
    --to=jarod@redhat.com \
    --cc=bhelgaas@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=rafael.j.wysocki@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.