All of lore.kernel.org
 help / color / mirror / Atom feed
From: Gu Zheng <guz.fnst@cn.fujitsu.com>
To: Myron Stowe <myron.stowe@gmail.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>,
	Joe Lawrence <Joe.Lawrence@stratus.com>,
	linux-pci@vger.kernel.org, Matthew Garrett <mjg59@srcf.ucam.org>,
	Myron Stowe <mstowe@redhat.com>,
	David Bulkow <david.bulkow@stratus.com>,
	Yinghai Lu <yinghai@kernel.org>
Subject: Re: [PATCH 1/2] PCI: ASPM exit link state code could skip devices
Date: Wed, 27 Feb 2013 14:42:17 +0800	[thread overview]
Message-ID: <512DAAC9.8030400@cn.fujitsu.com> (raw)
In-Reply-To: <CAL-B5D0xN6bs3oXugS9FOS5bQiFRJQ9766u5wsXw4UQdUmwfyw@mail.gmail.com>

On 02/27/2013 12:03 AM, Myron Stowe wrote:

> On Sun, Feb 24, 2013 at 10:59 PM, Gu Zheng <guz.fnst@cn.fujitsu.com> wrote:
>> On 02/24/2013 08:20 AM, Bjorn Helgaas wrote:
>>
>>> [+cc Yinghai]
>>>
...snip...
>>> Please create a bugzilla for this issue.
>>>
>>
>> Here:https://bugzilla.kernel.org/show_bug.cgi?id=54411
>>
>>> I think this is a general object lifetime issue that really has
>>> nothing to do with ASPM except that ASPM happens to be the victim.
>>>
>>> You're doing this:
>>>
>>>     echo -n 1 > /sys/bus/pci/devices/0000\:10\:00.0/remove ; echo -n 1
>>>>  /sys/bus/pci/devices/0000\:1a\:01.0/remove
>>>
>>> The 1a:01.0 device is downstream from the 10:00.0 bridge.  The sysfs
>>> interface remove_store() uses device_schedule_callback() to schedule
>>> the remove for later.  I think what's happening is that we schedule
>>> remove_callback() for both devices before 10:00.0 has been removed,
>>> like this:
>>>
>>>     # echo -n 1 > /sys/bus/pci/devices/0000\:10\:00.0/remove
>>>     remove_store  # for 10:00.0
>>>       device_schedule_callback(10:00.0, remove_callback)
>>>         sysfs_schedule_callback
>>>           kobject_get
>>>           queue_work
>>>     # echo -n 1 >  /sys/bus/pci/devices/0000\:1a\:01.0/remove
>>>     remove_store  # for 1a:01.0
>>>       device_schedule_callback(1a:01.0, remove_callback)
>>>         sysfs_schedule_callback
>>>           kobject_get
>>>           queue_work
>>>
>>> Note that we acquire a reference on each pci_dev before queuing the work item.
>>>
>>> Later, we run the callbacks, starting with 10:00.0.  This calls
>>> remove_callback() to perform the remove:
>>>
>>>     remove_callback(10:00.0)
>>>       mutex_lock(&pci_remove_rescan_mutex)
>>>       pci_stop_and_remove_bus_device(pdev)
>>>       mutex_unlock(&pci_remove_rescan_mutex)
>>>
>>> This will stop and remove the subtree below 10:00.0, but it does not
>>> actually free the pci_dev for 1a:01.0 because we increased its ref
>>> count in sysfs_schedule_callback.  So after completing
>>> remove_callback(10:00.0), we run the second callback for 1a:01.0.
>>>
>>> The remove for 1a:01.0 calls pcie_aspm_exit_link_state() from
>>> pci_stop_dev().  This is where we blow up because, according to your
>>> debugging, pdev->bus->self is no longer valid.
>>>
>>> The PCI core did this removal wrong.  If we have a valid pci_dev
>>> pointer, as we do in pcie_aspm_exit_link_state(), the whole object
>>> ought to be valid.  But the PCI core deallocated the struct pci_bus
>>> for bus 0000:1a too soon.
>>
>> Your analysis is perfect, and it solves my doubt. Thanks very much!:)
> 
> Gu:
> 
> I can't tell from your response whether you are stating that you agree
> with Bjorn's analysis -or- that you applied Yinghai's patch above
> (dated Feb 23) and found that the testing scenario was now successful.
>  Could you please specifically state which and if it was just the
> former would you be able to test Yinghai's implementation of Bjorn's
> analysis?

Hi Myron,
    Sorry to make you confused.
    I just agree with Bjorn's analysis. And I have test Yinghai's patch on kernel 3.8
, but it seems does not work. More infos, please refer to bugzilla:
https://bugzilla.kernel.org/show_bug.cgi?id=54411

Thanks,
Gu

> 
> Thanks,
>  Myron
>>
>>>
>>> My guess is that when we build a pci_dev, we need to increase the ref
>>> count on the pci_bus where that pci_dev lives.  That way we can keep
>>> around all the buses and bridges leading from the root to the device
>>> in question.
>>>
>>> Bjorn
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 



  reply	other threads:[~2013-02-27  6:42 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-01-18 18:22 PCIe ASPM crash on device removal Joe Lawrence
2013-01-18 18:23 ` [PATCH 1/2] PCI: ASPM exit link state code could skip devices Joe Lawrence
2013-01-31 23:29   ` Myron Stowe
2013-02-01 19:55     ` Joe Lawrence
2013-02-01 22:31       ` Bjorn Helgaas
2013-02-06 10:09   ` Gu Zheng
2013-02-06 15:23     ` Joe Lawrence
2013-02-09  0:35     ` Bjorn Helgaas
     [not found]       ` <5122F276.80807@cn.fujitsu.com>
2013-02-24  0:20         ` Bjorn Helgaas
2013-02-24  3:13           ` Yinghai Lu
2013-02-27 20:14             ` Bjorn Helgaas
2013-02-25  5:59           ` Gu Zheng
2013-02-26 16:03             ` Myron Stowe
2013-02-27  6:42               ` Gu Zheng [this message]
2013-02-27  6:47                 ` Yinghai Lu
2013-02-28 10:47                   ` Gu Zheng
2013-02-28 11:50                     ` Yijing Wang
2013-02-28 15:11                     ` Yinghai Lu
2013-03-01  1:14                       ` Gu Zheng
2013-01-18 18:24 ` [PATCH 2/2] PCI: Don't touch ASPM if forcibly disabled Joe Lawrence
2013-01-18 22:54   ` Myron Stowe
2013-02-01 22:32     ` Bjorn Helgaas
     [not found] ` <CAL-B5D0+6uO7WDYR7inmZKdU0h8-bpkOs_CzbF0bD2b9i6=1ZA@mail.gmail.com>
2013-01-18 19:53   ` PCIe ASPM crash on device removal Joe Lawrence
2013-01-18 23:15     ` Myron Stowe
2013-01-18 23:41       ` Myron Stowe
2013-01-19  1:03         ` Joe Lawrence
2013-02-01 22:45           ` Bjorn Helgaas
2013-01-18 19:57 ` Myron Stowe

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=512DAAC9.8030400@cn.fujitsu.com \
    --to=guz.fnst@cn.fujitsu.com \
    --cc=Joe.Lawrence@stratus.com \
    --cc=bhelgaas@google.com \
    --cc=david.bulkow@stratus.com \
    --cc=linux-pci@vger.kernel.org \
    --cc=mjg59@srcf.ucam.org \
    --cc=mstowe@redhat.com \
    --cc=myron.stowe@gmail.com \
    --cc=yinghai@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.