linux-ide.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* ata_check_status_mmio exception kernel panic
@ 2009-04-01  3:06 Sagar Borikar
  2009-04-01  3:08 ` Sagar Borikar
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Sagar Borikar @ 2009-04-01  3:06 UTC (permalink / raw)
  To: linux-ide, Jeff Garzik, Tejun Heo

Hi,

We are facing random kernel panics on drive removal when IO is
happening to RAID. Note that kernel panic is random and not every time
it happens. This is mips based production system and kernel is 2.6.18.
Unfortunately we can't upgrade the kernel as its on field.

Here is the log that we have got,

Data bus error, epc == 80377358, ra == 80377384
Oops[#1]:
Cpu 0
$ 0   : 00000000 804d0024 c001e0c7 0001000b
$ 4   : 811a829c 811a8d5c 00000260 804d358c
$ 8   : 90008000 1000001f 00000000 852c4000
$12   : 87a2bb80 00006764 00000000 00000000
$16   : 811a8d5c 811a829c 811a829c 00000001
$20   : 80513d98 00000000 00000000 00000000
$24   : 00000000 2b0c2ba0
$28   : 80512000 80513be0 00000000 80377384
Hi    : 00000000
Lo    : 00000000
epc   : 80377358 ata_check_status_mmio+0x4/0x10     Not tainted
ra    : 80377384 ata_check_status+0x20/0x3c
Status: 90008003    KERNEL EXL IE
Cause : 0000201c
PrId  : 000034c1
Modules linked in: aes
Process swapper (pid: 0, threadinfo=80512000, task=80514fc8)
Stack : 00000000 803ff4bc 00000000 00000000 80377240 80364204 8703c6a8 805bccc8
        8011d1a8 80434840 811a829c 811a8ccc 8037731c 871bf660 00000001 805bccc8
        8703c6a8 00000000 8037068c 8538db80 805345d8 80513cd8 00000001 805ce368
        811a8348 00000001 80378e54 00000001 82560238 82560238 8011dfd8 00000001
        00010000 811a829c 811a829c c001e000 80378f60 871bf660 871bf660 00000000
        ...
Call Trace:
[<80377358>] ata_check_status_mmio+0x4/0x10
[<80377384>] ata_check_status+0x20/0x3c
[<80377240>] ata_tf_read_mmio+0x1c/0xd8
[<8037731c>] ata_tf_read+0x20/0x3c
[<8037068c>] ata_qc_complete+0xb4/0x128
[<80378e54>] ata_port_abort+0xc4/0x100
[<80378f60>] ata_port_freeze+0x54/0x78
[<8037b7b8>] sil_host_intr+0x208/0x220
[<8037b8a4>] sil_interrupt+0xd4/0x108
[<80145b58>] handle_IRQ_event+0x60/0xc8
[<80145c78>] __do_IRQ+0xb8/0x140
[<80104624>] do_IRQ+0x1c/0x34
[<80100ef8>] pmc_sequoia_pci_isr+0x3c/0x98
[<80100fc8>] do_extended_irq+0x74/0x80
[<80101070>] plat_irq_dispatch+0x9c/0xac
[<80102eb0>] ret_from_irq+0x0/0x10


Code: 03e00008  304200ff  8c820054 <90420000> 03e00008  304200ff
27bdffe8  afbf0010  8c82000c
Kernel panic - not syncing: Fatal exception in interrupt

The problem is we don't see this exception every time the drive is
pulled out.First level look at log indicates that the bus address is
not right because of which the exception occurs.
Has anyone observed similar issues before?

Thanks
Sagar

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: ata_check_status_mmio exception kernel panic
  2009-04-01  3:06 ata_check_status_mmio exception kernel panic Sagar Borikar
@ 2009-04-01  3:08 ` Sagar Borikar
  2009-04-01  3:37 ` Tejun Heo
  2009-04-03 17:56 ` Dustin Harrison
  2 siblings, 0 replies; 8+ messages in thread
From: Sagar Borikar @ 2009-04-01  3:08 UTC (permalink / raw)
  To: linux-ide, Jeff Garzik, Tejun Heo

Forgot to mention one point that it happens only with SIL3114 SATA controller

Sagar

On Wed, Apr 1, 2009 at 8:36 AM, Sagar Borikar <sagar.borikar@gmail.com> wrote:
> Hi,
>
> We are facing random kernel panics on drive removal when IO is
> happening to RAID. Note that kernel panic is random and not every time
> it happens. This is mips based production system and kernel is 2.6.18.
> Unfortunately we can't upgrade the kernel as its on field.
>
> Here is the log that we have got,
>
> Data bus error, epc == 80377358, ra == 80377384
> Oops[#1]:
> Cpu 0
> $ 0   : 00000000 804d0024 c001e0c7 0001000b
> $ 4   : 811a829c 811a8d5c 00000260 804d358c
> $ 8   : 90008000 1000001f 00000000 852c4000
> $12   : 87a2bb80 00006764 00000000 00000000
> $16   : 811a8d5c 811a829c 811a829c 00000001
> $20   : 80513d98 00000000 00000000 00000000
> $24   : 00000000 2b0c2ba0
> $28   : 80512000 80513be0 00000000 80377384
> Hi    : 00000000
> Lo    : 00000000
> epc   : 80377358 ata_check_status_mmio+0x4/0x10     Not tainted
> ra    : 80377384 ata_check_status+0x20/0x3c
> Status: 90008003    KERNEL EXL IE
> Cause : 0000201c
> PrId  : 000034c1
> Modules linked in: aes
> Process swapper (pid: 0, threadinfo=80512000, task=80514fc8)
> Stack : 00000000 803ff4bc 00000000 00000000 80377240 80364204 8703c6a8 805bccc8
>        8011d1a8 80434840 811a829c 811a8ccc 8037731c 871bf660 00000001 805bccc8
>        8703c6a8 00000000 8037068c 8538db80 805345d8 80513cd8 00000001 805ce368
>        811a8348 00000001 80378e54 00000001 82560238 82560238 8011dfd8 00000001
>        00010000 811a829c 811a829c c001e000 80378f60 871bf660 871bf660 00000000
>        ...
> Call Trace:
> [<80377358>] ata_check_status_mmio+0x4/0x10
> [<80377384>] ata_check_status+0x20/0x3c
> [<80377240>] ata_tf_read_mmio+0x1c/0xd8
> [<8037731c>] ata_tf_read+0x20/0x3c
> [<8037068c>] ata_qc_complete+0xb4/0x128
> [<80378e54>] ata_port_abort+0xc4/0x100
> [<80378f60>] ata_port_freeze+0x54/0x78
> [<8037b7b8>] sil_host_intr+0x208/0x220
> [<8037b8a4>] sil_interrupt+0xd4/0x108
> [<80145b58>] handle_IRQ_event+0x60/0xc8
> [<80145c78>] __do_IRQ+0xb8/0x140
> [<80104624>] do_IRQ+0x1c/0x34
> [<80100ef8>] pmc_sequoia_pci_isr+0x3c/0x98
> [<80100fc8>] do_extended_irq+0x74/0x80
> [<80101070>] plat_irq_dispatch+0x9c/0xac
> [<80102eb0>] ret_from_irq+0x0/0x10
>
>
> Code: 03e00008  304200ff  8c820054 <90420000> 03e00008  304200ff
> 27bdffe8  afbf0010  8c82000c
> Kernel panic - not syncing: Fatal exception in interrupt
>
> The problem is we don't see this exception every time the drive is
> pulled out.First level look at log indicates that the bus address is
> not right because of which the exception occurs.
> Has anyone observed similar issues before?
>
> Thanks
> Sagar
>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: ata_check_status_mmio exception kernel panic
  2009-04-01  3:06 ata_check_status_mmio exception kernel panic Sagar Borikar
  2009-04-01  3:08 ` Sagar Borikar
@ 2009-04-01  3:37 ` Tejun Heo
  2009-04-03 17:56 ` Dustin Harrison
  2 siblings, 0 replies; 8+ messages in thread
From: Tejun Heo @ 2009-04-01  3:37 UTC (permalink / raw)
  To: Sagar Borikar; +Cc: linux-ide, Jeff Garzik

Hello,

Sagar Borikar wrote:
> We are facing random kernel panics on drive removal when IO is
> happening to RAID. Note that kernel panic is random and not every time
> it happens. This is mips based production system and kernel is 2.6.18.
> Unfortunately we can't upgrade the kernel as its on field.

2.6.18 is way too old and the mentioned function has been removed a
lont time ago in favor of iomem support.  Can you please give a shot
at more recent kernel?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: ata_check_status_mmio exception kernel panic
  2009-04-01  3:06 ata_check_status_mmio exception kernel panic Sagar Borikar
  2009-04-01  3:08 ` Sagar Borikar
  2009-04-01  3:37 ` Tejun Heo
@ 2009-04-03 17:56 ` Dustin Harrison
  2009-04-04  4:58   ` Tejun Heo
  2009-04-04 16:00   ` Alan Cox
  2 siblings, 2 replies; 8+ messages in thread
From: Dustin Harrison @ 2009-04-03 17:56 UTC (permalink / raw)
  To: Sagar Borikar; +Cc: linux-ide, Jeff Garzik, Tejun Heo

Sagar Borikar wrote:
> Hi,
>
> We are facing random kernel panics on drive removal when IO is
> happening to RAID. Note that kernel panic is random and not every time
> it happens. This is mips based production system and kernel is 2.6.18.
> Unfortunately we can't upgrade the kernel as its on field.
>
> Here is the log that we have got,
>
> Data bus error, epc == 80377358, ra == 80377384
> Oops[#1]:
> Cpu 0
> $ 0   : 00000000 804d0024 c001e0c7 0001000b
> $ 4   : 811a829c 811a8d5c 00000260 804d358c
> $ 8   : 90008000 1000001f 00000000 852c4000
> $12   : 87a2bb80 00006764 00000000 00000000
> $16   : 811a8d5c 811a829c 811a829c 00000001
> $20   : 80513d98 00000000 00000000 00000000
> $24   : 00000000 2b0c2ba0
> $28   : 80512000 80513be0 00000000 80377384
> Hi    : 00000000
> Lo    : 00000000
> epc   : 80377358 ata_check_status_mmio+0x4/0x10     Not tainted
> ra    : 80377384 ata_check_status+0x20/0x3c
> Status: 90008003    KERNEL EXL IE
> Cause : 0000201c
> PrId  : 000034c1
> Modules linked in: aes
> Process swapper (pid: 0, threadinfo=80512000, task=80514fc8)
> Stack : 00000000 803ff4bc 00000000 00000000 80377240 80364204 8703c6a8 805bccc8
>         8011d1a8 80434840 811a829c 811a8ccc 8037731c 871bf660 00000001 805bccc8
>         8703c6a8 00000000 8037068c 8538db80 805345d8 80513cd8 00000001 805ce368
>         811a8348 00000001 80378e54 00000001 82560238 82560238 8011dfd8 00000001
>         00010000 811a829c 811a829c c001e000 80378f60 871bf660 871bf660 00000000
>         ...
> Call Trace:
> [<80377358>] ata_check_status_mmio+0x4/0x10
> [<80377384>] ata_check_status+0x20/0x3c
> [<80377240>] ata_tf_read_mmio+0x1c/0xd8
> [<8037731c>] ata_tf_read+0x20/0x3c
> [<8037068c>] ata_qc_complete+0xb4/0x128
> [<80378e54>] ata_port_abort+0xc4/0x100
> [<80378f60>] ata_port_freeze+0x54/0x78
> [<8037b7b8>] sil_host_intr+0x208/0x220
> [<8037b8a4>] sil_interrupt+0xd4/0x108
>   
Hi Sagar,

I also run a MIPS platform and have seen this problem in 2.6.22.  It 
stems from the Sil3512 (and possibly others) not allowing read access to 
the taskfile registers while a DMA transfer is active.  What happens for 
me is that when DMA_ENABLE is true the Sil3512 (in my case) will 
disallow reads to the taskfile registers.  So any event that triggers a 
port freeze during an interrupt while DMA is active causes a bus error 
to be thrown when the ata_check_status call fires.  I cannot reproduce 
this on x86.  I assume it handles the taskfile read error differently.

As a workaround I have used this patch on sata_sil.c to cover up the 
problem and stop the kernel panics.  But I don't think this is the best 
approach.

--- drivers/ata/sata_sil.c.orig 2009-04-01 18:15:55.000000000 -0700
+++ drivers/ata/sata_sil.c      2009-04-03 10:51:56.000000000 -0700
@@ -454,6 +454,23 @@
  err_hsm:
        qc->err_mask |= AC_ERR_HSM;
  freeze:
+
+       /* Before we do a port freeze we need to ensure DMA_ENABLE is off.
+       * This is because the controller will not give us access to the 
taskfile
+       * registers while a DMA is in progress and ata_qc_complete is 
the first
+       * function executed in ata_port_freeze. ata_port_freeze will 
attempt to
+       * access the tf registers and give us a host bus error kernel panic.
+       *
+       * This code is repeated from ata_bmdma_stop because we may not 
have a
+       * valid qc to pass to ata_bmdma_stop.
+       */
+       iowrite8(ioread8(ap->ioaddr.bmdma_addr) & ~SIL_DMA_ENABLE, 
ap->ioaddr.bmdma_addr);
+
+       /* According to ata_bmdma_stop, an HDMA transition requires on 
PIO cycle.
+        *  But we can't read a taskfile register.
+       */
+       ioread8(ap->ioaddr.bmdma_addr)
+
        ata_port_freeze(ap);
 }




^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: ata_check_status_mmio exception kernel panic
  2009-04-03 17:56 ` Dustin Harrison
@ 2009-04-04  4:58   ` Tejun Heo
  2009-04-07  2:59     ` Jeff Garzik
  2009-04-04 16:00   ` Alan Cox
  1 sibling, 1 reply; 8+ messages in thread
From: Tejun Heo @ 2009-04-04  4:58 UTC (permalink / raw)
  To: Dustin Harrison; +Cc: Sagar Borikar, linux-ide, Jeff Garzik

Hello,

Dustin Harrison wrote:
> I also run a MIPS platform and have seen this problem in 2.6.22.  It
> stems from the Sil3512 (and possibly others) not allowing read
> access to the taskfile registers while a DMA transfer is active.
> What happens for me is that when DMA_ENABLE is true the Sil3512 (in
> my case) will disallow reads to the taskfile registers.  So any
> event that triggers a port freeze during an interrupt while DMA is
> active causes a bus error to be thrown when the ata_check_status
> call fires.

Thanks a lot for diagnosing the problem.

> I cannot reproduce this on x86.  I assume it handles the taskfile
> read error differently.

I think we just get 0xff on IO read errors.

> As a workaround I have used this patch on sata_sil.c to cover up the
> problem and stop the kernel panics.  But I don't think this is the
> best approach.
>
> --- drivers/ata/sata_sil.c.orig 2009-04-01 18:15:55.000000000 -0700
> +++ drivers/ata/sata_sil.c      2009-04-03 10:51:56.000000000 -0700
> @@ -454,6 +454,23 @@
>  err_hsm:
>        qc->err_mask |= AC_ERR_HSM;
>  freeze:
> +
> +       /* Before we do a port freeze we need to ensure DMA_ENABLE is off.
> +       * This is because the controller will not give us access to the
> taskfile
> +       * registers while a DMA is in progress and ata_qc_complete is
> the first
> +       * function executed in ata_port_freeze. ata_port_freeze will
> attempt to
> +       * access the tf registers and give us a host bus error kernel
> panic.
> +       *
> +       * This code is repeated from ata_bmdma_stop because we may not
> have a
> +       * valid qc to pass to ata_bmdma_stop.
> +       */
> +       iowrite8(ioread8(ap->ioaddr.bmdma_addr) & ~SIL_DMA_ENABLE,
> ap->ioaddr.bmdma_addr);
> +
> +       /* According to ata_bmdma_stop, an HDMA transition requires on
> PIO cycle.
> +        *  But we can't read a taskfile register.
> +       */
> +       ioread8(ap->ioaddr.bmdma_addr)
> +
>        ata_port_freeze(ap);

Can you please move the logic to sil_freeze() and see whether it
works?  The freeze handler is supposed to put the controller into idle
(or at least not-crazy) state, so things like this fit there.

Thanks.

-- 
tejun


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: ata_check_status_mmio exception kernel panic
  2009-04-03 17:56 ` Dustin Harrison
  2009-04-04  4:58   ` Tejun Heo
@ 2009-04-04 16:00   ` Alan Cox
  2009-04-07  3:02     ` Jeff Garzik
  1 sibling, 1 reply; 8+ messages in thread
From: Alan Cox @ 2009-04-04 16:00 UTC (permalink / raw)
  To: Dustin Harrison; +Cc: Sagar Borikar, linux-ide, Jeff Garzik, Tejun Heo

> port freeze during an interrupt while DMA is active causes a bus error 
> to be thrown when the ata_check_status call fires.  I cannot reproduce 
> this on x86.  I assume it handles the taskfile read error differently.

I hit this problem with the HPT343 (which locks the bus if you do this).
For other devices where you get a PCI abort it is usually the case that
x86 platforms ignore them while on many other platforms they produce
exceptions.

In general I think we need to be sure all the core code paths stop DMA
before poking in the taskfile (other than the basic IRQ check which we
can't avoid)

Alan

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: ata_check_status_mmio exception kernel panic
  2009-04-04  4:58   ` Tejun Heo
@ 2009-04-07  2:59     ` Jeff Garzik
  0 siblings, 0 replies; 8+ messages in thread
From: Jeff Garzik @ 2009-04-07  2:59 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Dustin Harrison, Sagar Borikar, linux-ide

Tejun Heo wrote:
> Can you please move the logic to sil_freeze() and see whether it
> works?  The freeze handler is supposed to put the controller into idle
> (or at least not-crazy) state, so things like this fit there.

ata_qc_complete() is called before ->freeze(), and ata_qc_complete() 
needs taskfile access to fill the result TF.

I thought about ->pre_freeze(), but now I wonder if we shouldn't just 
call __ata_port_freeze() before ata_port_abort(), in ata_port_freeze()?

	Jeff




^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: ata_check_status_mmio exception kernel panic
  2009-04-04 16:00   ` Alan Cox
@ 2009-04-07  3:02     ` Jeff Garzik
  0 siblings, 0 replies; 8+ messages in thread
From: Jeff Garzik @ 2009-04-07  3:02 UTC (permalink / raw)
  To: Alan Cox; +Cc: Dustin Harrison, Sagar Borikar, linux-ide, Tejun Heo

Alan Cox wrote:
>> port freeze during an interrupt while DMA is active causes a bus error 
>> to be thrown when the ata_check_status call fires.  I cannot reproduce 
>> this on x86.  I assume it handles the taskfile read error differently.
> 
> I hit this problem with the HPT343 (which locks the bus if you do this).
> For other devices where you get a PCI abort it is usually the case that
> x86 platforms ignore them while on many other platforms they produce
> exceptions.
> 
> In general I think we need to be sure all the core code paths stop DMA
> before poking in the taskfile (other than the basic IRQ check which we
> can't avoid)

I really think we are currently getting the ordering wrong, by calling 
ata_qc_complete() before we call the ->freeze() hook.

You can see sata_promise had problems with this, which led us to add DMA 
disabling in pdc_freeze()

But I think more generally, we seem to be missing a DMA-disable upon 
freeze, across a wide variety of controllers.

	Jeff




^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2009-04-07  3:03 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-04-01  3:06 ata_check_status_mmio exception kernel panic Sagar Borikar
2009-04-01  3:08 ` Sagar Borikar
2009-04-01  3:37 ` Tejun Heo
2009-04-03 17:56 ` Dustin Harrison
2009-04-04  4:58   ` Tejun Heo
2009-04-07  2:59     ` Jeff Garzik
2009-04-04 16:00   ` Alan Cox
2009-04-07  3:02     ` Jeff Garzik

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).