System reboots after insertion and removal of disks in 2.6.18 kernel

All of lore.kernel.org
 help / color / mirror / Atom feed

* System reboots after insertion and removal of disks in 2.6.18 kernel
@ 2008-03-27 17:46 Sagar Borikar
  0 siblings, 0 replies; 4+ messages in thread
From: Sagar Borikar @ 2008-03-27 17:46 UTC (permalink / raw)
  To: linux-scsi@vger.kernel.org

Hello folks,

I am currently working on NAS which has sil 3114 SATA controller.There is some strange scenario reported by product validation team.When I insert the drive and remove immediately without settling down,
the system gets reset after roughly 30 seconds. Tried to capture the log from drivers but couldn't get any of the stack dump or kernel panic in due course. I am using 2.6.18 kernel and sata_sil is enabled.
Rest functionality works pretty fine. But only when I do insert and remove without time gap, the system resets. Strange thing is when I insert the disk and remove it back immediately the interrupt line asserted is only for insert and not for removal.  Sata controller is not getting interrupt for second immediate drive removal. Hence I guess that the interrupt line should be level triggered. Now based on the logs captured, I can say that in this typical case, sata controller first gets the request to handle drive insertion. It waits for some time to check the status to ensure that it is proper request and after that it again reads the line. It finds that drive is removed till that time. But actually SATA controller is not detecting the remove instance as it is not reflected in GPIO transiti
 on as well. So I get messages like COMRESET Failed and hard reset failed. This doesn't happen if I insert back the drive immediately. The system immediately recovers. 

Attaching the log

After only inserting drive:
( No issues here: )

?[root@NAS00180001310e ~]# Drive 2 inserted
ata2: exception Emask 0x10 SAct 0x0 SErr 0x50000 action 0x2 frozen
ata2: hard resetting port
ata2: port is slow to respond, please be patient
ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
ata2.00: ATA-7, max UDMA/133, 490234752 sectors: LBA48 NCQ (depth
0/32)
ata2.00: ata2: dev 0 multi count 0
ata2.00: configured for UDMA/100
ata2: EH complete
 Vendor: ATA       Model: WDC WD2500YS-01S  Rev: 20.0
 Type:   Direct-Access                      ANSI SCSI revision: 05
SCSI device sdd: 490234752 512-byte hdwr sectors (251000 MB)
sdd: Write Protect is off
SCSI device sdd: drive cache: write back
SCSI device sdd: 490234752 512-byte hdwr sectors (251000 MB)
sdd: Write Protect is off
SCSI device sdd: drive cache: write back
 sdd:
sd 1:0:0:0: Attached scsi disk sdd
sd 1:0:0:0: Attached scsi generic sg3 type 0

After insertion and immediate removal: Reboot:

?[root@NAS00180001310e ~]# Drive 2 inserted
ata2: exception Emask 0x10 SAct 0x0 SErr 0x50000 action 0x2 frozen
ata2: hard resetting port
ata2: port is slow to respond, please be patient
ata2: port failed to respond (30 secs)  ---------------------> At this state, actually the drive is removed. But not detected. 
ata2: COMRESET failed (device not ready)
ata2: hardreset failed, retrying in 5 secs
ata2: hard resetting port
ata2: SATA link down (SStatus 0 SControl 310)
ata2: EH complete

?
PMON2000 MIPS Initializing. Standby...
ERRORPC=bfc00004 CONFIG=0042e4bb STATUS=00400000
CPU PRID 000034c1, MaskID 00001320
Initializing caches...done (CONFIG=0042e4bb)
Switching to runtime address map...done
Setting up SDRAM controller: sdram config 0x80010000
master clock 100 Mhz, MulFundBIU 0x02, DivXSDRAM 0x02
sdram freq 0x09ef21aa hz, sdram period: 0x06 nsec
dimm0: density 256Mbit, width 16, single-sided, unbuffered, size
0x08000000
 supported CAS latency: 2.5 2, using 2.5 cycles, byte18=0x0c
 RAS to CAS delay (tRCD) 0x12 nsec, byte29=0x

Any pointers are really appreciated.

Thanks in advance,
Sagar

^ permalink raw reply	[flat|nested] 4+ messages in thread

* System reboots after insertion and removal of disks in 2.6.18 kernel
@ 2008-03-29  2:48 Sagar Borikar
  2008-03-29 13:50 ` Tejun Heo
  0 siblings, 1 reply; 4+ messages in thread
From: Sagar Borikar @ 2008-03-29  2:48 UTC (permalink / raw)
  To: linux-ide

Hello folks,
I am currently working on NAS which has sil 3114 SATA controller.There
is some strange scenario reported by product validation team.When I
insert the drive and remove immediately without settling down,
the system gets reset after roughly 30 seconds. Tried to capture the
log from drivers but couldn't get any of the stack dump or kernel
panic in due course. I am using 2.6.18 kernel and sata_sil is enabled.
Rest functionality works pretty fine. But only when I do insert and
remove without time gap, the system resets. Strange thing is when I
insert the disk and remove it back immediately the interrupt line
asserted is only for insert and not for removal.  But if I insert
another disk, then this interrupt is recognised properly. Sata
controller is not getting interrupt for second immediate drive
removal.Now based on the logs captured, I can say that in this typical
case, sata controller first gets the request to handle drive
insertion. It waits for some time to check the status to ensure that
it is proper request and after that it again reads the line. It finds
that drive is removed till that time. But actually SATA controller is
not detecting the remove instance as it is not reflected in GPIO
transition as well. So I get messages like COMRESET Failed and hard
reset failed. This doesn't happen if I insert back the drive
immediately. The system immediately recovers.
Attaching the log
After only inserting drive:
( No issues here: )
?[root@NAS00180001310e ~]# Drive 2 inserted
ata2: exception Emask 0x10 SAct 0x0 SErr 0x50000 action 0x2 frozen
ata2: hard resetting port
ata2: port is slow to respond, please be patient
ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
ata2.00: ATA-7, max UDMA/133, 490234752 sectors: LBA48 NCQ (depth
0/32)
ata2.00: ata2: dev 0 multi count 0
ata2.00: configured for UDMA/100
ata2: EH complete
 Vendor: ATA       Model: WDC WD2500YS-01S  Rev: 20.0
 Type:   Direct-Access                      ANSI SCSI revision: 05
SCSI device sdd: 490234752 512-byte hdwr sectors (251000 MB)
sdd: Write Protect is off
SCSI device sdd: drive cache: write back
SCSI device sdd: 490234752 512-byte hdwr sectors (251000 MB)
sdd: Write Protect is off
SCSI device sdd: drive cache: write back
 sdd:
sd 1:0:0:0: Attached scsi disk sdd
sd 1:0:0:0: Attached scsi generic sg3 type 0
After insertion and immediate removal: Reboot:

?[root@NAS00180001310e ~]# Drive 2 inserted
ata2: exception Emask 0x10 SAct 0x0 SErr 0x50000 action 0x2 frozen
ata2: hard resetting port
ata2: port is slow to respond, please be patient
ata2: port failed to respond (30 secs)  ---------------------> At this
state, actually the drive is removed. But not detected.
ata2: COMRESET failed (device not ready)
ata2: hardreset failed, retrying in 5 secs
ata2: hard resetting port
ata2: SATA link down (SStatus 0 SControl 310)
ata2: EH complete
?
PMON2000 MIPS Initializing. Standby...
ERRORPC=bfc00004 CONFIG=0042e4bb STATUS=00400000
CPU PRID 000034c1, MaskID 00001320
Initializing caches...done (CONFIG=0042e4bb)
Switching to runtime address map...done
Setting up SDRAM controller: sdram config 0x80010000
master clock 100 Mhz, MulFundBIU 0x02, DivXSDRAM 0x02
sdram freq 0x09ef21aa hz, sdram period: 0x06 nsec
dimm0: density 256Mbit, width 16, single-sided, unbuffered, size
0x08000000
 supported CAS latency: 2.5 2, using 2.5 cycles, byte18=0x0c
 RAS to CAS delay (tRCD) 0x12 nsec, byte29=0x

Any pointers are really appreciated.
Thanks in advance,
Sagar

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: System reboots after insertion and removal of disks in 2.6.18 kernel
  2008-03-29  2:48 System reboots after insertion and removal of disks in 2.6.18 kernel Sagar Borikar
@ 2008-03-29 13:50 ` Tejun Heo
  2008-03-29 17:47   ` Sagar Borikar
  0 siblings, 1 reply; 4+ messages in thread
From: Tejun Heo @ 2008-03-29 13:50 UTC (permalink / raw)
  To: Sagar Borikar; +Cc: linux-ide

Hello,

Sagar Borikar wrote:
> I am currently working on NAS which has sil 3114 SATA controller.There
> is some strange scenario reported by product validation team.When I
> insert the drive and remove immediately without settling down,
> the system gets reset after roughly 30 seconds. Tried to capture the
> log from drivers but couldn't get any of the stack dump or kernel
> panic in due course. I am using 2.6.18 kernel and sata_sil is enabled.
> Rest functionality works pretty fine. But only when I do insert and
> remove without time gap, the system resets. Strange thing is when I
> insert the disk and remove it back immediately the interrupt line
> asserted is only for insert and not for removal.  But if I insert
> another disk, then this interrupt is recognised properly. Sata
> controller is not getting interrupt for second immediate drive
> removal.Now based on the logs captured, I can say that in this typical
> case, sata controller first gets the request to handle drive
> insertion. It waits for some time to check the status to ensure that
> it is proper request and after that it again reads the line. It finds
> that drive is removed till that time. But actually SATA controller is
> not detecting the remove instance as it is not reflected in GPIO
> transition as well. So I get messages like COMRESET Failed and hard
> reset failed. This doesn't happen if I insert back the drive
> immediately. The system immediately recovers.

Okay, that was one long paragraph.  :-)

The behavior itself (sans triggering machine reset) is intended.  libata 
EH doesn't rely on the edge events (PHY status changed).  It relies on 
level state (PHY readiness) and as long as at least one PHY event is 
triggered after link status has changed, it doesn't care what polarity 
those events are or how many of them are.  That was the design decision 
made for robustness.

> ata2: exception Emask 0x10 SAct 0x0 SErr 0x50000 action 0x2 frozen
> ata2: hard resetting port
> ata2: port is slow to respond, please be patient
> ata2: port failed to respond (30 secs)  ---------------------> At this
> state, actually the drive is removed. But not detected.
> ata2: COMRESET failed (device not ready)
> ata2: hardreset failed, retrying in 5 secs
> ata2: hard resetting port
> ata2: SATA link down (SStatus 0 SControl 310)
> ata2: EH complete

This is a quite old kernel, right?  Recent ones take much shorter to 
detect the condition.

> PMON2000 MIPS Initializing. Standby...
> ERRORPC=bfc00004 CONFIG=0042e4bb STATUS=00400000
> CPU PRID 000034c1, MaskID 00001320
> Initializing caches...done (CONFIG=0042e4bb)
> Switching to runtime address map...done
> Setting up SDRAM controller: sdram config 0x80010000
> master clock 100 Mhz, MulFundBIU 0x02, DivXSDRAM 0x02
> sdram freq 0x09ef21aa hz, sdram period: 0x06 nsec
> dimm0: density 256Mbit, width 16, single-sided, unbuffered, size
> 0x08000000
>  supported CAS latency: 2.5 2, using 2.5 cycles, byte18=0x0c
>  RAS to CAS delay (tRCD) 0x12 nsec, byte29=0x

Okay, and the machine got reboot.  It's weird that the reset happens 
*after* EH is complete.  After EH complete is printed, libata won't 
touch the hardware.  I'm sorry but I don't have any clue why the machine 
is getting rebooted.  Does the machine reset on oops?

-- 
tejun

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: System reboots after insertion and removal of disks in 2.6.18 kernel
  2008-03-29 13:50 ` Tejun Heo
@ 2008-03-29 17:47   ` Sagar Borikar
  0 siblings, 0 replies; 4+ messages in thread
From: Sagar Borikar @ 2008-03-29 17:47 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-ide

[-- Attachment #1: Type: text/plain, Size: 3220 bytes --]

Thanks for your prompt reply Tejun.

On Sat, Mar 29, 2008 at 7:20 PM, Tejun Heo <htejun@gmail.com> wrote:
> Hello,
>
> Sagar Borikar wrote:

> > transition as well. So I get messages like COMRESET Failed and hard
> > reset failed. This doesn't happen if I insert back the drive
> > immediately. The system immediately recovers.
>
> Okay, that was one long paragraph.  :-)
>
> The behavior itself (sans triggering machine reset) is intended.  libata
> EH doesn't rely on the edge events (PHY status changed).  It relies on
> level state (PHY readiness) and as long as at least one PHY event is
> triggered after link status has changed, it doesn't care what polarity
> those events are or how many of them are.  That was the design decision
> made for robustness.

I understand. But the issue is, if I insert another drive, that event
gets detected. Only remove
event is not getting detected. So I was wondering if somehow I am able
to make the remove events detected, I can go ahead.Also digging
further in the code.
As expected in insert_remove action, the dev->class becomes
ATA_UNKNOWN and hence the ata_eh_revalidate_and_attach function
doesn't execute the following if condition
"action & ATA_EH_REVALIDATE && ata_dev_ready"
I am attaching two logs with remove and insert_remove files which
indicates the flow of the sequence in these two paths. IF you could
browse through that, would be great.

> > ata2: exception Emask 0x10 SAct 0x0 SErr 0x50000 action 0x2 frozen
> > ata2: hard resetting port
> > ata2: port is slow to respond, please be patient
> > ata2: port failed to respond (30 secs)  ---------------------> At this
> > state, actually the drive is removed. But not detected.
> > ata2: COMRESET failed (device not ready)
> > ata2: hardreset failed, retrying in 5 secs
> > ata2: hard resetting port
> > ata2: SATA link down (SStatus 0 SControl 310)
> > ata2: EH complete
>
> This is a quite old kernel, right?  Recent ones take much shorter to
> detect the condition.

That's right. It is 2.6.18

> > PMON2000 MIPS Initializing. Standby...
> > ERRORPC=bfc00004 CONFIG=0042e4bb STATUS=00400000
> > CPU PRID 000034c1, MaskID 00001320
> > Initializing caches...done (CONFIG=0042e4bb)
> > Switching to runtime address map...done
> > Setting up SDRAM controller: sdram config 0x80010000
> > master clock 100 Mhz, MulFundBIU 0x02, DivXSDRAM 0x02
> > sdram freq 0x09ef21aa hz, sdram period: 0x06 nsec
> > dimm0: density 256Mbit, width 16, single-sided, unbuffered, size
> > 0x08000000
> >  supported CAS latency: 2.5 2, using 2.5 cycles, byte18=0x0c
> >  RAS to CAS delay (tRCD) 0x12 nsec, byte29=0x
>
> Okay, and the machine got reboot.  It's weird that the reset happens
> *after* EH is complete.  After EH complete is printed, libata won't
> touch the hardware.  I'm sorry but I don't have any clue why the machine
> is getting rebooted.  Does the machine reset on oops?

Also it happens after say 1 to 1.5 minutes. If I insert drive within
this duration, reset doesn't happen. Also if I insert in any other
slot, reset doesn't happen. Only after immediate removal of the disk,
the reset happens.
Is there any way by which I can make the insertion event edge triggered?

Thanks in advance
Sagar
> --
> tejun
>

[-- Attachment #2: remove.txt --]
[-- Type: text/plain, Size: 2407 bytes --]

[root@NAS00180001310e ~]# sil_host_intr:1
sil_host_intr:4 ata_port_freeze
sil_freeze start
sil_freeze end
ata1 port frozen
ata_bmdma_error_handler: start
ata_do_eh : ata_eh_autopsy
ata_eh_autopsy : start
ata_do_eh : ata_eh_report
ata1: exception Emask 0x10 SAct 0x0 SErr 0x10000 action 0x2 frozen
ata_do_eh : ata_eh_recover
ata_eh_detach_dev : start
ata_eh_detach_dev : ata_eh_prep_resume
ata_eh_detach_dev : ata_eh_skip_recovery
sil_freeze start
sil_freeze end
ata1 port frozen
ata_eh_detach_dev : ata_eh_reset
ata_std_prereset: start
ata1: hard resetting port
ata_port_offline : port is offline
sata_print_link_status:start
ata1: SATA link down (SStatus 0 SControl 310)
sata_print_link_status:end
ata_eh_detach_dev : ata_eh_thaw_port
ata_eh_detach_dev : ata_eh_revalidate_and_attach
ata_port_offline : port is offline
ata1: failed to recover some devices, retrying in 5 secs
ata_eh_detach_dev : ata_eh_prep_resume
ata_eh_detach_dev : ata_eh_skip_recovery
sil_freeze start
sil_freeze end
ata1 port frozen
ata_eh_detach_dev : ata_eh_reset
ata_std_prereset: start
ata1: hard resetting port
lcdout: error response: ng
ata_port_offline : port is offline
sata_print_link_status:start
ata1: SATA link down (SStatus 0 SControl 310)
sata_print_link_status:end
ata_eh_detach_dev : ata_eh_thaw_port
ata_eh_detach_dev : ata_eh_revalidate_and_attach
ata_port_offline : port is offline
ata1: failed to recover some devices, retrying in 5 secs
ata_eh_detach_dev : ata_eh_prep_resume
ata_eh_detach_dev : ata_eh_skip_recovery
sil_freeze start
sil_freeze end
ata1 port frozen
ata_eh_detach_dev : ata_eh_reset
ata_std_prereset: start
ata1: hard resetting port
ata_port_offline : port is offline
sata_print_link_status:start
ata1: SATA link down (SStatus 0 SControl 310)
sata_print_link_status:end
ata_eh_detach_dev : ata_eh_thaw_port
ata_eh_detach_dev : ata_eh_revalidate_and_attach
ata_port_offline : port is offline
ata1.00: disabled
ata_port_offline : port is offline
ata_eh_detach_dev : start
ata_eh_detach_dev : start
ata_eh_detach_dev : ata_eh_prep_resume
ata_eh_detach_dev : ata_eh_skip_recovery
ata_eh_detach_dev : ata_eh_revalidate_and_attach
ata_eh_detach_dev : ata_eh_resume
ata_eh_detach_dev : ata_eh_suspend
ata_do_eh : ata_eh_finish
ata_eh_finish : start
ata_bmdma_error_handler: end
ata1: EH complete
ata1.00: detaching (SCSI 0:0:0:0)

[-- Attachment #3: insert_remove.txt --]
[-- Type: text/plain, Size: 999 bytes --]

[root@NAS00180001310e ~]# Drive 1 inserted
sil_host_intr:1
sil_host_intr:4 ata_port_freeze
sil_freeze start
sil_freeze end
ata1 port frozen
ata_bmdma_error_handler: start
ata_do_eh : ata_eh_autopsy
ata_eh_autopsy : start
ata_do_eh : ata_eh_report
ata1: exception Emask 0x10 SAct 0x0 SErr 0x50000 action 0x2 frozen
ata_do_eh : ata_eh_recover
ata_eh_detach_dev : start
ata_eh_detach_dev : ata_eh_prep_resume
ata_eh_detach_dev : ata_eh_skip_recovery
sil_freeze start
sil_freeze end
ata1 port frozen
ata_eh_detach_dev : ata_eh_reset
ata_std_prereset: start
ata1: hard resetting port
ata_port_offline : port is offline
sata_print_link_status:start
ata1: SATA link down (SStatus 0 SControl 310)
sata_print_link_status:end
ata_eh_detach_dev : ata_eh_thaw_port
ata_eh_detach_dev : ata_eh_revalidate_and_attach
ata_eh_detach_dev : ata_eh_resume
ata_eh_detach_dev : ata_eh_suspend
ata_do_eh : ata_eh_finish
ata_eh_finish : start
ata_bmdma_error_handler: end
ata1: EH complete

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2008-03-29 17:47 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-03-29  2:48 System reboots after insertion and removal of disks in 2.6.18 kernel Sagar Borikar
2008-03-29 13:50 ` Tejun Heo
2008-03-29 17:47   ` Sagar Borikar
  -- strict thread matches above, loose matches on Subject: below --
2008-03-27 17:46 Sagar Borikar

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.