sata_sil24 AMD64 crash/lockup

linux-ide.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* sata_sil24 AMD64 crash/lockup
@ 2005-10-04 22:14 linux
  2005-10-05  4:02 ` linux
  2005-10-06  7:04 ` Tejun Heo
  0 siblings, 2 replies; 8+ messages in thread
From: linux @ 2005-10-04 22:14 UTC (permalink / raw)
  To: linux-ide; +Cc: linux

I'm trying to bring up a new AMD64 (uniprocessor) storage server with
3x Sil3132 PCIe SATA controllers running 6x Seagate 7200.8 drives.

Kernel 2.6.13.2 + 2.6.13-rc7-libata1.patch.bz2 + PPSkit-light

I'm having some problems with intermitted (every few days) crashes
which lock up the drives.  The machine is not yet in service, so
activity is pretty light, but I've been running zcav on a 6-way RAID-0
partition to keep it busy.  (350 MB/sec sustained is fun.)

I have twice seen an assert fail that I didn't manage to wrote down.
(The first time, I though it was something I had done, and the second,
someone rebooted it before I got a chance.)

Both times, the keyboard was still operating and I could scroll back.
This last time, it was locked up hard and I could only get what was
on the screen.  Omitting the leading ffffffff from the kernel addresses,
and modulo any transcription errors, what I saw was:

802c1d50 do_unblank_screen+272
8012117e do_page_fault+1838
80133a9c call_console_drivers+76
801348c9 vprintk+601
80147739 autoremove_wake_function+9
8012ff23 wake_up_common+67
8010f22d error_exit+0
80340b3a ata_gen_fixed_sense+138
8034128a ata_scsi_qc_complete+106
8033ceaa ata_qc_complete+362
80342c15 sil24_interrupt+325

I know there are more recent kernels and sata_sil24 code, but that's
the most current pair I could figure out how to fit together.

Trying the instructions at http://kernel.org/git/ gives me:

$ cg-clone http://www.kernel.org/pub/scm/linux/kernel/git/jgarzik/libata-dev.git
defaulting to local storage area
17:54:10 URL:http://www.kernel.org/pub/scm/linux/kernel/git/jgarzik/libata-dev.git/refs/heads/master [41/41] -> "refs/heads/origin" [1]
progress: 2 objects, 926 bytes
error: File ca442d313d86dc67e0a2e5d584b465bd382cbf5c (http://www.kernel.org/pub/scm/linux/kernel/git/jgarzik/libata-dev.git/objects/ca/442d313d86dc67e0a2e5d584b465bd382cbf5c) corrupt

Cannot obtain needed blob ca442d313d86dc67e0a2e5d584b465bd382cbf5c
while processing commit 0000000000000000000000000000000000000000.
cg-pull: objects pull failed
cg-clone: pull failed

In the mean time, I'll keep working to reproduce the problem.

Thanks for any hints!

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: sata_sil24 AMD64 crash/lockup
  2005-10-04 22:14 sata_sil24 AMD64 crash/lockup linux
@ 2005-10-05  4:02 ` linux
  2005-10-06  7:04 ` Tejun Heo
  1 sibling, 0 replies; 8+ messages in thread
From: linux @ 2005-10-05  4:02 UTC (permalink / raw)
  To: linux-ide; +Cc: linux

> I know there are more recent kernels and sata_sil24 code, but that's
> the most current pair I could figure out how to fit together.
> 
> Trying the instructions at http://kernel.org/git/ gives me:
...
> cg-clone: pull failed

Ah, solved.  I was using a truly ancient version of cogito.  I'm still
haveing a little bit of trouble (why does .git/branches contain only the
one file "origin"?), but reality is bearing a closer resemblance to the
various FAQs and HOWTOs now.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: sata_sil24 AMD64 crash/lockup
  2005-10-04 22:14 sata_sil24 AMD64 crash/lockup linux
  2005-10-05  4:02 ` linux
@ 2005-10-06  7:04 ` Tejun Heo
  2005-10-06 16:48   ` Tejun Heo
  2005-10-06 19:10   ` linux
  1 sibling, 2 replies; 8+ messages in thread
From: Tejun Heo @ 2005-10-06  7:04 UTC (permalink / raw)
  To: linux; +Cc: linux-ide


  Hello, there.

linux@horizon.com wrote:
> I'm trying to bring up a new AMD64 (uniprocessor) storage server with
> 3x Sil3132 PCIe SATA controllers running 6x Seagate 7200.8 drives.
> 
> Kernel 2.6.13.2 + 2.6.13-rc7-libata1.patch.bz2 + PPSkit-light
> 
> I'm having some problems with intermitted (every few days) crashes
> which lock up the drives.  The machine is not yet in service, so
> activity is pretty light, but I've been running zcav on a 6-way RAID-0
> partition to keep it busy.  (350 MB/sec sustained is fun.)
> 
> I have twice seen an assert fail that I didn't manage to wrote down.
> (The first time, I though it was something I had done, and the second,
> someone rebooted it before I got a chance.)
> 
> Both times, the keyboard was still operating and I could scroll back.
> This last time, it was locked up hard and I could only get what was
> on the screen.  Omitting the leading ffffffff from the kernel addresses,
> and modulo any transcription errors, what I saw was:
> 
> 802c1d50 do_unblank_screen+272
> 8012117e do_page_fault+1838
> 80133a9c call_console_drivers+76
> 801348c9 vprintk+601
> 80147739 autoremove_wake_function+9
> 8012ff23 wake_up_common+67
> 8010f22d error_exit+0
> 80340b3a ata_gen_fixed_sense+138
> 8034128a ata_scsi_qc_complete+106
> 8033ceaa ata_qc_complete+362
> 80342c15 sil24_interrupt+325
> 

  Oh.. this is because of missing tf_read callback which is called from 
ata_gen_fixed_sense().  I'll try to figure out what to do about it a bit 
later.  I gotta leave for english lessons now.  I'm already a bit late.

  However, your kernel hitting that path means that some drives are 
actually generating errors.  Let's see about that after get tf_read 
thing fixed.

  Damn. I'm really late. :-)

> I know there are more recent kernels and sata_sil24 code, but that's
> the most current pair I could figure out how to fit together.
> 
> Trying the instructions at http://kernel.org/git/ gives me:
> 
> $ cg-clone http://www.kernel.org/pub/scm/linux/kernel/git/jgarzik/libata-dev.git
> defaulting to local storage area
> 17:54:10 URL:http://www.kernel.org/pub/scm/linux/kernel/git/jgarzik/libata-dev.git/refs/heads/master [41/41] -> "refs/heads/origin" [1]
> progress: 2 objects, 926 bytes
> error: File ca442d313d86dc67e0a2e5d584b465bd382cbf5c (http://www.kernel.org/pub/scm/linux/kernel/git/jgarzik/libata-dev.git/objects/ca/442d313d86dc67e0a2e5d584b465bd382cbf5c) corrupt
> 
> Cannot obtain needed blob ca442d313d86dc67e0a2e5d584b465bd382cbf5c
> while processing commit 0000000000000000000000000000000000000000.
> cg-pull: objects pull failed
> cg-clone: pull failed
> 
> 
> In the mean time, I'll keep working to reproduce the problem.
> 
> Thanks for any hints!
> -
> To unsubscribe from this list: send the line "unsubscribe linux-ide" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


-- 
tejun

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: sata_sil24 AMD64 crash/lockup
  2005-10-06  7:04 ` Tejun Heo
@ 2005-10-06 16:48   ` Tejun Heo
  2005-10-10 17:40     ` linux
  2005-10-06 19:10   ` linux
  1 sibling, 1 reply; 8+ messages in thread
From: Tejun Heo @ 2005-10-06 16:48 UTC (permalink / raw)
  To: linux; +Cc: linux-ide

 Hi,

 I posted the patch to implement sil24 ->tf_read callback (and cc'd to
you) just now.  It should be applied on top of three patches I posted
the other day.

http://marc.theaimsgroup.com/?l=linux-ide&m=112856662723993&w=2
http://marc.theaimsgroup.com/?l=linux-ide&m=112856662716013&w=2
http://marc.theaimsgroup.com/?l=linux-ide&m=112856662708456&w=2

 After applying all four patches (above three plus the one I cc'd to
you), when you hit a SATA error, you'll get proper report.  And we
should be able to debug that from there.

 Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: sata_sil24 AMD64 crash/lockup
  2005-10-06  7:04 ` Tejun Heo
  2005-10-06 16:48   ` Tejun Heo
@ 2005-10-06 19:10   ` linux
  1 sibling, 0 replies; 8+ messages in thread
From: linux @ 2005-10-06 19:10 UTC (permalink / raw)
  To: htejun; +Cc: linux, linux-ide

>  Damn. I'm really late. :-)

Well, thank you for taking the trouble to reply!

>  However, your kernel hitting that path means that some drives are 
> actually generating errors.  Let's see about that after get tf_read 
> thing fixed.

Given that I'm reading about 2e+14 bits/day, and it takes a couple of
days to reproduce the error, the occasional glitch is not too surprising.

Thanks for the patches; I'm trying to see if I can get them by git
right now, but they appear to not be on kernel.org yet, so I guess
I'll have to try by hand.

Results in a bit!

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: sata_sil24 AMD64 crash/lockup
  2005-10-06 16:48   ` Tejun Heo
@ 2005-10-10 17:40     ` linux
  0 siblings, 0 replies; 8+ messages in thread
From: linux @ 2005-10-10 17:40 UTC (permalink / raw)
  To: htejun, linux; +Cc: linux-ide

>  After applying all four patches (above three plus the one I cc'd to
> you), when you hit a SATA error, you'll get proper report.  And we
> should be able to debug that from there.

Just an update - I've been running continuous disk for the last several
days, and no error reports yet.

So sata_sil24 is looking pretty good.  (I'm looking forward to NCQ!)

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: sata_sil24 AMD64 crash/lockup
       [not found] <434F9F92.7060805@gmail.com>
@ 2005-10-26 19:18 ` linux
  2005-11-02 18:58   ` linux
  0 siblings, 1 reply; 8+ messages in thread
From: linux @ 2005-10-26 19:18 UTC (permalink / raw)
  To: htejun; +Cc: linux, linux-ide

It finaly puked!  Unfortulately, it did NOT survive and while it's less of
a hard lock-up, the root file system is unhappy and I can't log in and copy
the log files, so this is copied by hand.

(Which is a bit odd, because I have the root file system on a RAID-10
such that no controller is critical.  I.e. the mirror pairs are sdb/sdc,
sdd/sde, and sdf/sda.  If the drive just returned -EIO, then the md
layer could drop the drive and use the other mirror.)

I have an entire scrollback buffer full of a mixture of ata3 and ata4
error messages.

The two errors are almost identical except for the "Info fld" value,
whose value is a consistent function of the controller number.

ata3: command timeout
ata3: status=0x50 { DriveReady SeekComplete }
sdc: Current: sense key: No Sense
    Additional sense: No additional sense information
Info fld=0x78af38
sata_sil24 ata3: resetting controller

ata4: command timeout
ata4: status=0x50 { DriveReady SeekComplete }
sdd: Current: sense key: No Sense
    Additional sense: No additional sense information
Info fld=0xeb9712
sata_sil24 ata4: resetting controller

If it matters, the order of the errors I can see in the
scrollback buffer (but there are clearly more off the top) is
343434343444444433343434344444443.

I will hold the machine in the current state for a couple of hours
in case there's any more debug information that can be retrieved
from the console.

Anyway, thanks a lot for any clues.  I'd really like to push this
machine into production, but it's a bit difficult right now.

Interestingly, the machine was not running any stress tests at the time
of the problem, but pcbackup may have been doing its thing, and it was
recently power-cycled (taken down as non-essential for Monday's hurricane).

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: sata_sil24 AMD64 crash/lockup
  2005-10-26 19:18 ` linux
@ 2005-11-02 18:58   ` linux
  0 siblings, 0 replies; 8+ messages in thread
From: linux @ 2005-11-02 18:58 UTC (permalink / raw)
  To: htejun; +Cc: linux-ide, linux

An additional data point: it doesn't appear to have failed "fast", given
that it left the XFS file system in a confused state.
See http://marc.theaimsgroup.com/?l=linux-xfs&m=113081816804458

So while it did lock up and refuse to proceed further, at least one
write got reordered, misdirected, or miswritten as part of the locking-up
procedure.

(Actually, theoretically it could have been a read which returned bad
metadata and led to an incorrect following write, but I would expect
a 150G file system on a 2G machine to have all such core journaling
metadata cached, so there should be on such read.)

As before, thanks for any suggestions or patches.

I'm now attempting to repair the XFS file system and move to a
post-2.6.14 libata kernel.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2005-11-02 18:58 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-10-04 22:14 sata_sil24 AMD64 crash/lockup linux
2005-10-05  4:02 ` linux
2005-10-06  7:04 ` Tejun Heo
2005-10-06 16:48   ` Tejun Heo
2005-10-10 17:40     ` linux
2005-10-06 19:10   ` linux
     [not found] <434F9F92.7060805@gmail.com>
2005-10-26 19:18 ` linux
2005-11-02 18:58   ` linux

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).