All of lore.kernel.org
 help / color / mirror / Atom feed
From: Andreas Hartmann <andihartmann@freenet.de>
To: Alex Williamson <alex.williamson@redhat.com>,
	Andreas Hartmann <andihartmann@freenet.de>
Cc: Bjorn Helgaas <bhelgaas@google.com>,
	linux-pci@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH 4/4] PCI: quirk Atheros AR93xx to avoid bus reset
Date: Mon, 12 Jan 2015 20:15:16 +0100	[thread overview]
Message-ID: <54B41D44.5080606@maya.org> (raw)
In-Reply-To: <1421081344.6130.54.camel@redhat.com>

Hello Alex!

Alex Williamson wrote:
> On Mon, 2015-01-12 at 16:20 +0100, Andreas Hartmann wrote:
>> Alex Williamson wrote:
>>> On Thu, 2015-01-08 at 09:07 -0700, Bjorn Helgaas wrote:
>>>> On Fri, Nov 21, 2014 at 11:24:27AM -0700, Alex Williamson wrote:
>>>>> Reports against the TL-WDN4800 card indicate that PCI bus reset of
>>>>> this Atheros device cause system lock-ups and resets.  I've also
>>>>> been able to confirm this behavior on multiple systems.  The device
>>>>> never returns from reset and attempts to access config space of the
>>>>> device after reset result in hangs.  Blacklist bus reset for the
>>>>> device to avoid this issue.
>>>>>
>>>>> Reported-by: Andreas Hartmann <andihartmann@freenet.de>
>>>>> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
>>>>> Tested-by: Andreas Hartmann <andihartmann@freenet.de>
>>>>
>>>> If I understand correctly, these two (patches 3 & 4) fix a v3.14 regression
>>>> caused by 425c1b223dac ("PCI: Add Virtual Channel to save/restore support").
>>>>
>>>> If so, these should go to for-linus for v3.19.  What about patches 1 & 2?
>>>> Do they fix a regression?  Is there a pointer to a bugzilla or problem
>>>> report about that issue?
>>>>
>>>> I don't understand the connection between 425c1b223dac and
>>>> PCI_DEV_FLAGS_NO_BUS_RESET, because 425c1b223dac doesn't seem to do any
>>>> resets.  Is that the wrong commit, or can you outline the connection for
>>>> me?
>>>
>>> TBH, I don't have a lot of faith in associating this to 425c1b223dac,
>>> I'm not sure how Andreas' bisect landed there. 
>>
>> Because removing this patch made it working again :-)
>>
>> And too:
>> http://thread.gmane.org/gmane.linux.kernel.pci/35170/focus=35984
>>
>> Kernel 2.10. and 2.12. and 2.13. did work fine for me. 2.14 is the first
>> kernel, which hangs the machine at startup of the VM. The userland
>> (qemu) didn't change in between.
> 
> s/2\./3\./

Thanks :-) It seems I don't like the number 3 :-)

> Ok, so what about VC save/restore (425c1b223dac) is the problem then?
> When we tried to determine that, you found that if we continue from the
> top of the save loop, everything works (ie. no VC state saved), but if
> you continue after the variable declaration of the same loop (ie. still
> no VC state saved), it breaks:
> 
> http://www.spinics.net/lists/linux-pci/msg36166.html
> 
> So, please forgive me if I don't have a whole lot of faith that
> 425c1b223dac is involved.

It's hard for me, too. Really. It's kind of mystique.

> We also both independently determined that this particular device never
> recovers from a PCI bus reset, even when done from userspace with setpci
> and absolutely no save/restore wrappers.

Yes.

>  Config space on the device is
> never accessible after the reset.

Yes.

>  Therefore, how could any sort of bus
> reset with save/restore ever work for this device?

I can't say. What I definitely can say, is that I never had problems
with running VMs w/ qemu until 3.14 came up. Do you think I'm lying? I
used 3.10. and 3.12. for long time w/o (known!) problems (3.12 only on
first start of VM). Otherwise I would have been here long time before :-))).

>> Therefore: from my point of view, it is a regression, because things
>> have been working < 2.14.
>>
>> Besides that: It is undoubted, that there is a problem with resetting
>> this card. But the difference between >= 3.14 and < 3.14 is, that < 3.14
>> has been working nevertheless. The patch
>> 425c1b223dac456d00a61fd6b451b6d1cf00d065 obviously changed something
>> which I can't say and I don't know off. Therefore, the quirk-patch is
>> definitely required, because things work completely fine again w/ this
>> patch.
>>
>> "Working" means for me here: I was able to start (and use) the VM w/o
>> crashing the machine and this isn't possible w/ unpatched 2.14+ any
>> more. Yes, w/ 2.12, I wasn't able to restart the VM (it then crashed the
>> machine), but w/ 2.10 even this was possible.
> 
> What?!  So v3.12 still had a machine crash when assigning this device.

Yes. If you *re*start the VM (long time, I didn't knew that fact at all
- I just discovered it during testing while analyzing the problem :-)).
The first start (after reboot) was not a problem. This was the usual use
case here :-)).

Believe me, I'm really convinced that this card does have a problem with
resets. I'm just wondering why it had worked for me until 3.13. That's all.

> The vfio hot reset interface was added in v3.12, so v3.10 didn't have
> any way to do a reset other than what pci_reset_function() decided to
> do.  That all seems to associate the machine crash to the ability to do
> a bus reset on the device.  I'm not sure why the behavior changed
> between v3.14 and v3.12 (maybe the try-reset addition), but there's some
> sort of pre-existing issue before we even got to 425c1b223dac.

Most probably.

> I'm perfectly happy tagging this for stable,

Thanks!! I'm really very comfortable with your patch and your support!
Really! Thanks a lot! It's just odd for me, why it partly worked (first
start of VM worked) w/ 3.12 and 3.13 and 3.14 suddenly no more at all.

You have been accidentally the sufferer - most probably it could have
hit any other change, too. Sorry for that :-(. Therefore: kudos for
anyway fixing the problem. This is really not a matter of course at all!

> but it seems like a
> hardware bug exposed by allowing userspace the ability to select a bus
> reset.  Whether or not that's a kernel regression isn't exactly clear to
> me ("new functionality exposes broken hardware, news at 11").  Thanks,
> 
> Alex


Kind regards,
Andreas

>>> IME, this device cannot,
>>> and has never been able to handle a bus reset.  A simple setpci
>>> experiment on the commandline can confirm this.  What I think happened
>>> is that with the PCI bus reset infrastructure we added, we switched QEMU
>>> to prefer PCI bus resets over things like PM D3hot->D0 resets.  So it's
>>> just more prolific use of bus resets by userspace.
>>>
>>> There's also no regression in 1 & 2, PM reset has never done anything
>>> useful on those devices.  Thanks,
>>>
>>> Alex
>>>
>>>>> ---
>>>>>
>>>>>  drivers/pci/quirks.c |   14 ++++++++++++++
>>>>>  1 file changed, 14 insertions(+)
>>>>>
>>>>> diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
>>>>> index 561e10d..ebbd5b4 100644
>>>>> --- a/drivers/pci/quirks.c
>>>>> +++ b/drivers/pci/quirks.c
>>>>> @@ -3029,6 +3029,20 @@ static void quirk_no_pm_reset(struct pci_dev *dev)
>>>>>  DECLARE_PCI_FIXUP_CLASS_HEADER(PCI_VENDOR_ID_ATI, PCI_ANY_ID,
>>>>>  			       PCI_CLASS_DISPLAY_VGA, 8, quirk_no_pm_reset);
>>>>>  
>>>>> +static void quirk_no_bus_reset(struct pci_dev *dev)
>>>>> +{
>>>>> +	dev->dev_flags |= PCI_DEV_FLAGS_NO_BUS_RESET;
>>>>> +}
>>>>> +
>>>>> +/*
>>>>> + * Atheros AR93xx chips do not behave after a bus reset.  The device will
>>>>> + * throw a Link Down error on AER capable system and regardless of AER,
>>>>> + * config space of the device is never accessible again and typically
>>>>> + * causes the system to hang or reset when access is attempted.
>>>>> + * http://www.spinics.net/lists/linux-pci/msg34797.html
>>>>> + */
>>>>> +DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATHEROS, 0x0030, quirk_no_bus_reset);
>>>>> +
>>>>>  #ifdef CONFIG_ACPI
>>>>>  /*
>>>>>   * Apple: Shutdown Cactus Ridge Thunderbolt controller.
>>>>>
>>>
>>>
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>
> 
> 
> 


  reply	other threads:[~2015-01-12 19:18 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-11-21 18:24 [PATCH 0/4] PCI: Reset exclusions Alex Williamson
2014-11-21 18:24 ` [PATCH 1/4] PCI: Allow device quirks to exclude D3->D0 PM reset Alex Williamson
2014-11-21 18:24 ` [PATCH 2/4] PCI: quirk AMD/ATI VGA cards to avoid " Alex Williamson
2014-11-21 19:00   ` Deucher, Alexander
2014-11-21 19:00     ` Deucher, Alexander
2014-11-21 18:24 ` [PATCH 3/4] PCI: Allow device quirks to exclude bus reset Alex Williamson
2014-11-21 18:24 ` [PATCH 4/4] PCI: quirk Atheros AR93xx to avoid " Alex Williamson
2014-12-26  7:56   ` Andreas Hartmann
2015-01-08 16:07   ` Bjorn Helgaas
2015-01-08 19:30     ` Alex Williamson
2015-01-08 23:10       ` Bjorn Helgaas
2015-01-12 15:20       ` Andreas Hartmann
2015-01-12 16:49         ` Alex Williamson
2015-01-12 19:15           ` Andreas Hartmann [this message]
2015-01-13  0:37             ` Bjorn Helgaas
2015-01-16  0:28 ` [PATCH 0/4] PCI: Reset exclusions Bjorn Helgaas
2015-01-16 16:15   ` Bjorn Helgaas

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=54B41D44.5080606@maya.org \
    --to=andihartmann@freenet.de \
    --cc=alex.williamson@redhat.com \
    --cc=bhelgaas@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pci@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.