linux-ide.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] ahci: don't ignore result code of ahci_reset_controller()
@ 2017-10-02 18:31 Ard Biesheuvel
  2017-10-02 18:54 ` Tejun Heo
  2017-10-02 19:21 ` Tejun Heo
  0 siblings, 2 replies; 4+ messages in thread
From: Ard Biesheuvel @ 2017-10-02 18:31 UTC (permalink / raw)
  To: linux-ide, tj
  Cc: graeme.gregory, leif.lindholm, daniel.thompson, Ard Biesheuvel

ahci_pci_reset_controller() calls ahci_reset_controller(), which may
fail, but ignores the result code and always returns success. This
may result in failures like below

  ahci 0000:02:00.0: version 3.0
  ahci 0000:02:00.0: enabling device (0000 -> 0003)
  ahci 0000:02:00.0: SSS flag set, parallel bus scan disabled
  ahci 0000:02:00.0: controller reset failed (0xffffffff)
  ahci 0000:02:00.0: failed to stop engine (-5)
    ... repeated many times ...
  ahci 0000:02:00.0: failed to stop engine (-5)
  Unable to handle kernel paging request at virtual address ffff0000093f9018
    ...
  PC is at ahci_stop_engine+0x5c/0xd8 [libahci]
  LR is at ahci_deinit_port.constprop.12+0x1c/0xc0 [libahci]
    ...
  [<ffff000000a17014>] ahci_stop_engine+0x5c/0xd8 [libahci]
  [<ffff000000a196b4>] ahci_deinit_port.constprop.12+0x1c/0xc0 [libahci]
  [<ffff000000a197d8>] ahci_init_controller+0x80/0x168 [libahci]
  [<ffff000000a260f8>] ahci_pci_init_controller+0x60/0x68 [ahci]
  [<ffff000000a26f94>] ahci_init_one+0x75c/0xd88 [ahci]
  [<ffff000008430324>] local_pci_probe+0x3c/0xb8
  [<ffff000008431728>] pci_device_probe+0x138/0x170
  [<ffff000008585e54>] driver_probe_device+0x2dc/0x458
  [<ffff0000085860e4>] __driver_attach+0x114/0x118
  [<ffff000008583ca8>] bus_for_each_dev+0x60/0xa0
  [<ffff000008585638>] driver_attach+0x20/0x28
  [<ffff0000085850b0>] bus_add_driver+0x1f0/0x2a8
  [<ffff000008586ae0>] driver_register+0x60/0xf8
  [<ffff00000842f9b4>] __pci_register_driver+0x3c/0x48
  [<ffff000000a3001c>] ahci_pci_driver_init+0x1c/0x1000 [ahci]
  [<ffff000008083918>] do_one_initcall+0x38/0x120

where an obvious hardware level failure results in an unnecessary 15 second
delay and a subsequent crash.

So record the result code of ahci_reset_controller() and relay it, rather
than ignoring it.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 drivers/ata/ahci.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/ata/ahci.c b/drivers/ata/ahci.c
index 5a5fd0b404eb..649e799df9c1 100644
--- a/drivers/ata/ahci.c
+++ b/drivers/ata/ahci.c
@@ -621,8 +621,11 @@ static void ahci_pci_save_initial_config(struct pci_dev *pdev,
 static int ahci_pci_reset_controller(struct ata_host *host)
 {
 	struct pci_dev *pdev = to_pci_dev(host->dev);
+	int rc;
 
-	ahci_reset_controller(host);
+	rc = ahci_reset_controller(host);
+	if (rc)
+		return rc;
 
 	if (pdev->vendor == PCI_VENDOR_ID_INTEL) {
 		struct ahci_host_priv *hpriv = host->private_data;
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH] ahci: don't ignore result code of ahci_reset_controller()
  2017-10-02 18:31 [PATCH] ahci: don't ignore result code of ahci_reset_controller() Ard Biesheuvel
@ 2017-10-02 18:54 ` Tejun Heo
  2017-10-02 18:59   ` Ard Biesheuvel
  2017-10-02 19:21 ` Tejun Heo
  1 sibling, 1 reply; 4+ messages in thread
From: Tejun Heo @ 2017-10-02 18:54 UTC (permalink / raw)
  To: Ard Biesheuvel; +Cc: linux-ide, graeme.gregory, leif.lindholm, daniel.thompson

Hello, Ard.

On Mon, Oct 02, 2017 at 07:31:24PM +0100, Ard Biesheuvel wrote:
> ahci_pci_reset_controller() calls ahci_reset_controller(), which may
> fail, but ignores the result code and always returns success. This
> may result in failures like below
> 
>   ahci 0000:02:00.0: version 3.0
>   ahci 0000:02:00.0: enabling device (0000 -> 0003)
>   ahci 0000:02:00.0: SSS flag set, parallel bus scan disabled
>   ahci 0000:02:00.0: controller reset failed (0xffffffff)
>   ahci 0000:02:00.0: failed to stop engine (-5)
>     ... repeated many times ...
>   ahci 0000:02:00.0: failed to stop engine (-5)
>   Unable to handle kernel paging request at virtual address ffff0000093f9018
>     ...
>   PC is at ahci_stop_engine+0x5c/0xd8 [libahci]
>   LR is at ahci_deinit_port.constprop.12+0x1c/0xc0 [libahci]
>     ...
>   [<ffff000000a17014>] ahci_stop_engine+0x5c/0xd8 [libahci]
>   [<ffff000000a196b4>] ahci_deinit_port.constprop.12+0x1c/0xc0 [libahci]
>   [<ffff000000a197d8>] ahci_init_controller+0x80/0x168 [libahci]
>   [<ffff000000a260f8>] ahci_pci_init_controller+0x60/0x68 [ahci]
>   [<ffff000000a26f94>] ahci_init_one+0x75c/0xd88 [ahci]
>   [<ffff000008430324>] local_pci_probe+0x3c/0xb8
>   [<ffff000008431728>] pci_device_probe+0x138/0x170
>   [<ffff000008585e54>] driver_probe_device+0x2dc/0x458
>   [<ffff0000085860e4>] __driver_attach+0x114/0x118
>   [<ffff000008583ca8>] bus_for_each_dev+0x60/0xa0
>   [<ffff000008585638>] driver_attach+0x20/0x28
>   [<ffff0000085850b0>] bus_add_driver+0x1f0/0x2a8
>   [<ffff000008586ae0>] driver_register+0x60/0xf8
>   [<ffff00000842f9b4>] __pci_register_driver+0x3c/0x48
>   [<ffff000000a3001c>] ahci_pci_driver_init+0x1c/0x1000 [ahci]
>   [<ffff000008083918>] do_one_initcall+0x38/0x120
> 
> where an obvious hardware level failure results in an unnecessary 15 second
> delay and a subsequent crash.

I'm not sure the retries are necessarily bad and am hesitant to change
that part; however, we definitely wanna fix the crash.  How does
forwarding the error make the crash go away?  That sounds like we
aren't clearing something we should have cleared while offlining the
controller.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] ahci: don't ignore result code of ahci_reset_controller()
  2017-10-02 18:54 ` Tejun Heo
@ 2017-10-02 18:59   ` Ard Biesheuvel
  0 siblings, 0 replies; 4+ messages in thread
From: Ard Biesheuvel @ 2017-10-02 18:59 UTC (permalink / raw)
  To: Tejun Heo; +Cc: IDE-ML, Graeme Gregory, Leif Lindholm, Daniel Thompson

On 2 October 2017 at 19:54, Tejun Heo <tj@kernel.org> wrote:
> Hello, Ard.
>
> On Mon, Oct 02, 2017 at 07:31:24PM +0100, Ard Biesheuvel wrote:
>> ahci_pci_reset_controller() calls ahci_reset_controller(), which may
>> fail, but ignores the result code and always returns success. This
>> may result in failures like below
>>
>>   ahci 0000:02:00.0: version 3.0
>>   ahci 0000:02:00.0: enabling device (0000 -> 0003)
>>   ahci 0000:02:00.0: SSS flag set, parallel bus scan disabled
>>   ahci 0000:02:00.0: controller reset failed (0xffffffff)
>>   ahci 0000:02:00.0: failed to stop engine (-5)
>>     ... repeated many times ...
>>   ahci 0000:02:00.0: failed to stop engine (-5)
>>   Unable to handle kernel paging request at virtual address ffff0000093f9018
>>     ...
>>   PC is at ahci_stop_engine+0x5c/0xd8 [libahci]
>>   LR is at ahci_deinit_port.constprop.12+0x1c/0xc0 [libahci]
>>     ...
>>   [<ffff000000a17014>] ahci_stop_engine+0x5c/0xd8 [libahci]
>>   [<ffff000000a196b4>] ahci_deinit_port.constprop.12+0x1c/0xc0 [libahci]
>>   [<ffff000000a197d8>] ahci_init_controller+0x80/0x168 [libahci]
>>   [<ffff000000a260f8>] ahci_pci_init_controller+0x60/0x68 [ahci]
>>   [<ffff000000a26f94>] ahci_init_one+0x75c/0xd88 [ahci]
>>   [<ffff000008430324>] local_pci_probe+0x3c/0xb8
>>   [<ffff000008431728>] pci_device_probe+0x138/0x170
>>   [<ffff000008585e54>] driver_probe_device+0x2dc/0x458
>>   [<ffff0000085860e4>] __driver_attach+0x114/0x118
>>   [<ffff000008583ca8>] bus_for_each_dev+0x60/0xa0
>>   [<ffff000008585638>] driver_attach+0x20/0x28
>>   [<ffff0000085850b0>] bus_add_driver+0x1f0/0x2a8
>>   [<ffff000008586ae0>] driver_register+0x60/0xf8
>>   [<ffff00000842f9b4>] __pci_register_driver+0x3c/0x48
>>   [<ffff000000a3001c>] ahci_pci_driver_init+0x1c/0x1000 [ahci]
>>   [<ffff000008083918>] do_one_initcall+0x38/0x120
>>
>> where an obvious hardware level failure results in an unnecessary 15 second
>> delay and a subsequent crash.
>
> I'm not sure the retries are necessarily bad and am hesitant to change
> that part; however, we definitely wanna fix the crash.

It is not retrying anything. It is repeatedly trying to stop the
engine as part of the 'start engine' sequence, with which it proceeds
because the code in ahci_init_one() does not realize the reset has
failed.

> How does
> forwarding the error make the crash go away?  That sounds like we
> aren't clearing something we should have cleared while offlining the
> controller.
>

ahci_init_one() bails out if ahci_pci_reset_controller() returns a
failure. But as you can see, that code never does return a failure,
because it ignores the result code of ahci_reset_controller(). So the
'if (rc) return rc;' that follows in ahci_init_one() is essentially
dead code.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] ahci: don't ignore result code of ahci_reset_controller()
  2017-10-02 18:31 [PATCH] ahci: don't ignore result code of ahci_reset_controller() Ard Biesheuvel
  2017-10-02 18:54 ` Tejun Heo
@ 2017-10-02 19:21 ` Tejun Heo
  1 sibling, 0 replies; 4+ messages in thread
From: Tejun Heo @ 2017-10-02 19:21 UTC (permalink / raw)
  To: Ard Biesheuvel; +Cc: linux-ide, graeme.gregory, leif.lindholm, daniel.thompson

On Mon, Oct 02, 2017 at 07:31:24PM +0100, Ard Biesheuvel wrote:
> ahci_pci_reset_controller() calls ahci_reset_controller(), which may
> fail, but ignores the result code and always returns success. This
> may result in failures like below

Applied to libata/for-4.14-fixes.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2017-10-02 19:21 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-10-02 18:31 [PATCH] ahci: don't ignore result code of ahci_reset_controller() Ard Biesheuvel
2017-10-02 18:54 ` Tejun Heo
2017-10-02 18:59   ` Ard Biesheuvel
2017-10-02 19:21 ` Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).