sata_sil24 corruption details

linux-ide.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* sata_sil24 corruption details
@ 2005-11-07  9:59 linux
  2005-11-07 16:15 ` Greg Freemyer
  2005-11-10  7:17 ` linux
  0 siblings, 2 replies; 23+ messages in thread
From: linux @ 2005-11-07  9:59 UTC (permalink / raw)
  To: linux-ide; +Cc: linux

I just compared the two halves of my RAID-1 mirrors and found something
very interesting...

sector 95958 of the two halves looks like:

 0000000: 9db4 87cf 4e2f cba7 c727 1feb 5f08 b7cf  ....N/...'.._...
 0000010: 9f7f 0d18 c4c1 b3b4 bffd 3579 6cfa d13d  ..........5yl..=
 0000020: d2c7 10eb 61ab 7dd7 d070 eb16 cb91 81bf  ....a.}..p......
 0000030: 839f 8067 f724 b4eb bf5f e2ff 8077 472f  ...g.$..._...wG/
 0000040: fcf7 cbb8 ab0e 3837 2359 8dfb 5225 9b4c  ......87#Y..R%.L
 0000050: ea7d c6d6 7df8 3f53 3ce3 4e33 98ee 3eff  .}..}.?S<.N3..>.
 0000060: 52b3 057e 9324 f71b 0d96 279a d9f5 654d  R..~.$....'...eM
 0000070: af9d 2bc7 e6eb 5585 b97d f187 f131 a364  ..+...U..}...1.d
 0000080: aef9 a464 cdcf 3b0b 5e83 35df a67e 683c  ...d..;.^.5..~h<
 0000090: 03e0 0a57 49bc e5fa 3501 8d2f becb 5ebd  ...WI...5../..^.
 00000a0: ccad fc7c 2756 d861 5548 ee39 41ff 1e13  ...|'V.aUH.9A...
 00000b0: 0693 a3ca 103c 0d25 918d 62e1 d1a7 8c22  .....<.%..b...."
 00000c0: a126 af84 5e6f c0f3 9567 8967 89a9 d7c2  .&..^o...g.g....
 00000d0: 90a0 68ce 0cde 1ec0 1652 3064 348e d7b0  ..h......R0d4...
 00000e0: cf0a f014 2a90 9143 6a62 b29a 3578 3ec0  ....*..Cjb..5x>.
 00000f0: fcf0 9a18 1bbd 208b 1468 9072 cc95 2ea8  ...... ..h.r....
 0000100: 9f02 573c f339 0348 dbc4 52b0 1f93 ffa4  ..W<.9.H..R.....
 0000110: 3bf8 6478 525a c509 ea41 0c8d 3c7c 7610  ;.dxRZ...A..<|v.
 0000120: 1ad6 02a3 769f 5b64 b066 aae9 f47a d463  ....v.[d.f...z.c
 0000130: 7839 1172 9622 5b54 975f f450 98a4 c733  x9.r."[T._.P...3
 0000140: b959 339e f47a d786 f0bd 4c7e 74f6 8f7b  .Y3..z....L~t..{
 0000150: 5d70 fc7b aa06 146c cea1 fbac ff33 d73f  ]p.{...l.....3.?
 0000160: 40cc f31f 30f1 5957 bffe 3b93 fbc1 ac68  @...0.YW..;....h
 0000170: 90fe 94bf 6770 ded7 17bf c77e 4be8 15af  ....gp.....~K...
-0000180: 4a2b 371e 8a1c baf5 7ab0 7998 84cb bfae  J+7.....z.y.....
+0000180: 6dd2 09ec b42b 0638 996e e914 7a7c d353  m....+.8.n..z|.S
 0000190: 0f5e e234 e488 997d 5564 a630 e7ad c3db  .^.4...}Ud.0....
 00001a0: b1f0 3a4b 4958 a9ac 7632 4edd 5d8d 60c3  ..:KIX..v2N.].`.
 00001b0: 6877 cf3c 26fb 50d2 fe3a 67f2 b69d a7be  hw.<&.P..:g.....
 00001c0: ee8b 39e9 a52d b3ee 8970 77e3 2b2b be13  ..9..-...pw.++..
 00001d0: 6abf 66eb 6b81 2319 185b 404a 8bef cee9  j.f.k.#..[@J....
 00001e0: 7efd 556e 93fc 5360 054d e436 d5f7 4774  ~.Un..S`.M.6..Gt
 00001f0: f5a3 a63a eb3c 6156 6eaf e23f eece 6450  ...:.<aVn..?..dP

Then sector 129547:
 0000000: 494e 41e8 0101 0002 0000 0065 0000 0069  INA........e...i
 0000010: 0000 0002 0000 0000 0000 0000 0000 00a1  ................
 0000020: 4358 32f1 361f 2dfc 4358 32f1 361f 2dfc  CX2.6.-.CX2.6.-.
 0000030: 4358 32f1 361f 2dfc 0000 0000 0000 0006  CX2.6.-.........
 0000040: 0000 0000 0000 0000 0000 0000 0000 0000  ................
 0000050: 0000 0002 0000 0000 0000 0000 0000 0006  ................
 0000060: ffff ffff 0000 04a5 042a 0700 3066 5769  .........*..0fWi
 0000070: 6e4d 696c 02f7 1438 0900 4866 4461 7461  nMil...8..HfData
 0000080: 6261 7365 1203 3827 0600 6061 7474 7269  base..8'..`attri
 0000090: 6200 2a48 2200 0000 0001 0000 0000 003c  b.*H"..........<
 00000a0: c800 0080 0000 0000 0002 0000 0000 003c  ...............<
 00000b0: 0000 0000 0001 cc3f 0002 8000 0000 0039  .......?.......9
 00000c0: b000 0040 0000 0000 0003 0000 0000 0039  ...@...........9
 00000d0: a800 0040 0000 0000 0003 8000 0000 0039  ...@...........9
 00000e0: 9800 0040 0000 0000 0004 0000 0000 0039  ...@...........9
 00000f0: 9000 0040 ffff ffff 1801 0000 0000 0000  ...@............
 0000100: 494e 81a4 0102 0001 0000 0000 0000 0000  IN..............
-0000110: 0000 0001 0000 0000 0000 0000 0000 055a  ...............Z
-0000120: 435f 2276 096e bf0e 4345 8a5f 34e9 60ae  C_"v.n..CE._4.`.
+0000110: 0000 0001 0000 0000 0000 0000 0000 0557  ...............W
+0000120: 435e e888 066b 4474 4345 8a5f 34e9 60ae  C^...kDtCE._4.`.
 0000130: 4345 8a5f 34e9 60ae 0000 0000 0000 01f2  CE._4.`.........
 0000140: 0000 0000 0000 0001 0000 0000 0000 0001  ................
 0000150: 0000 0002 0000 0000 0000 0000 0000 0001  ................
 0000160: ffff ffff 0000 0000 0000 0000 0000 0017  ................
 0000170: caa0 0001 0000 0000 0000 0000 0000 0000  ................
 0000180: 7070 5100 0000 0000 ef1d 2dab aa2a 0000  ppQ.......-..*..
 0000190: 0051 1c00 0000 0000 1100 0000 0000 0000  .Q..............
 00001a0: 4895 5100 0000 0000 0000 0000 0000 0000  H.Q.............
 00001b0: 0051 2c80 ffff ffff 0600 0000 0000 0000  .Q,.............
 00001c0: cf8d 0200 0000 0000 0200 0000 0100 0000  ................
 00001d0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
 00001e0: 7070 5100 0000 0000 031c 2dab aa2a 0000  ppQ.......-..*..
 00001f0: 80f8 0200 0000 0000 0100 0100 0000 0000  ................

And sector 195094:
 0000000: 494e 41e8 0102 0019 0000 0065 0000 0069  INA........e...i
 0000010: 0000 0019 0000 0000 0000 0000 0000 0002  ................
 0000020: 4355 94a0 0530 68f2 4355 94af 0434 3c90  CU...0h.CU...4<.
 0000030: 4355 94af 0434 3c90 0000 0000 0000 1000  CU...4<.........
 0000040: 0000 0000 0000 0001 0000 0000 0000 0001  ................
 0000050: 0000 0002 0000 0000 0000 0000 0000 0000  ................
 0000060: ffff ffff 0000 0000 0000 0000 0000 0023  ...............#
 0000070: b6e0 0001 5302 3100 0d09 0048 6652 4543  ....S.1....HfREC
 0000080: 5943 4c45 441a 17b4 3e0d 0060 664d 7920  YCLED...>..`fMy 
 0000090: 446f 6375 6d65 6e74 731e e758 000e 0078  Documents..X...x
 00000a0: 6650 726f 6772 616d 2046 696c 6573 0c1f  fProgram Files..
 00000b0: 9825 0000 0000 0000 0000 0000 0000 0000  .%..............
 00000c0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
 00000d0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
 00000e0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
 00000f0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
 0000100: 494e 41e8 0101 0002 0000 0065 0000 0069  INA........e...i
-0000110: 0000 0002 0000 0000 0000 0000 0000 0010  ................
-0000120: 435f 1c9a 0be9 b322 4355 9502 1f35 8dad  C_....."CU...5..
+0000110: 0000 0002 0000 0000 0000 0000 0000 000e  ................
+0000120: 435d 942d 069a 8c21 4355 9502 1f35 8dad  C].-...!CU...5..
 0000130: 4355 9502 1f35 8dad 0000 0000 0000 0062  CU...5.........b
 0000140: 0000 0000 0000 0000 0000 0000 0000 0000  ................
 0000150: 0000 0002 0000 0000 0000 0000 0000 0001  ................
 0000160: ffff ffff 0500 141f 303a 0c00 3066 436f  ........0:..0fCo
 0000170: 6e6e 4f72 6967 2e30 3100 8554 090e 0048  nnOrig.01..T...H
 0000180: 6645 7843 746c 725f 4944 432e 3031 0085  fExCtlr_IDC.01..
 0000190: 5c05 0f00 6866 4672 616d 6545 785f 4944  \...hfFrameEx_ID
 00001a0: 432e 3031 0085 5c06 0a00 8866 646c 3266  C.01..\....fdl2f
 00001b0: 732e 6c6f 6700 855c 0706 00a0 6174 7472  s.log..\....attr
 00001c0: 6962 0085 5c08 0000 0000 0000 0000 0000  ib..\...........
 00001d0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
 00001e0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
 00001f0: 0000 0000 0000 0000 0000 0000 0000 0000  ................

These shorts bursts of inconsistencies are alarming.  The number of error bits
doesn't look like RAM errors, and in any case, the machine has ECC
memory, not overclocked, and was burned in with memtest86+ and prime95
for multiple days to make sure RAM was reliable.

FWIW, those sectors numbers are relative to the partition, which itself starts on
raw disk sector 3903795, so the device-level partition numbers are
partition-relative.  The partition
3903795 + 95958 = 3999753 = 0x3D0809
3903795 + 129547 = 4033342 = 0x3D8B3E
3903795 + 195094 = 4098889 = 0x3E8B49

Hardware is single-core AMD64 processor, nForce chipset, generic motherboard,
2x1 GB ECC SDRAM, 3x Sil3132 SATA controllers, 6x 400 GB 7200.8 drives.


I finished "badblocks -b 4096 -c 65536 -s -v -w -t random" run on 350 G of one drive
without seeing problems, and am working on the other 5.
(In parallel, just to stress the driver.)

Does anyone have any recommended diagnostics for seeing whether a drive reliably
remembers data you give to it?  Ones that particularly abuse the disk driver?

Thanks!

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: sata_sil24 corruption details
@ 2005-11-07 16:05 SMALL, Timothy
  0 siblings, 0 replies; 23+ messages in thread
From: SMALL, Timothy @ 2005-11-07 16:05 UTC (permalink / raw)
  To: 'linux@horizon.com'; +Cc: linux-ide

Sounds good, if you wanted extra confidence, you could apply this as well:

http://bluesmoke.sourceforge.net

It will flag ECC, and also some PCI errors...  Try
http://prdownloads.sourceforge.net/bluesmoke/bluesmoke-devel-20051027.tar.gz
?download but don't include the NMI code (it's less invasive).

Cheers,

Tim.

> doesn't look like RAM errors, and in any case, the machine has ECC
> memory, not overclocked, and was burned in with memtest86+ and prime95
> for multiple days to make sure RAM was reliable.

> Hardware is single-core AMD64 processor,

This email is for the intended addressee only.
If you have received it in error then you must not use, retain, disseminate or otherwise deal with it.
Please notify the sender by return email.
The views of the author may not necessarily constitute the views of EADS Astrium Limited.
Nothing in this email shall bind EADS Astrium Limited in any contract or obligation.

EADS Astrium Limited, Registered in England and Wales No. 2449259
Registered Office: Gunnels Wood Road, Stevenage, Hertfordshire, SG1 2AS, England

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: sata_sil24 corruption details
  2005-11-07  9:59 sata_sil24 corruption details linux
@ 2005-11-07 16:15 ` Greg Freemyer
  2005-11-10  7:17 ` linux
  1 sibling, 0 replies; 23+ messages in thread
From: Greg Freemyer @ 2005-11-07 16:15 UTC (permalink / raw)
  To: linux-ide; +Cc: linux

On 7 Nov 2005 04:59:56 -0500, linux@horizon.com <linux@horizon.com> wrote:
> I just compared the two halves of my RAID-1 mirrors and found something
> very interesting...
>
> sector 95958 of the two halves looks like:
>
>  0000000: 9db4 87cf 4e2f cba7 c727 1feb 5f08 b7cf  ....N/...'.._...
<snip>
>  0000150: 5d70 fc7b aa06 146c cea1 fbac ff33 d73f  ]p.{...l.....3.?
>  0000160: 40cc f31f 30f1 5957 bffe 3b93 fbc1 ac68  @...0.YW..;....h
>  0000170: 90fe 94bf 6770 ded7 17bf c77e 4be8 15af  ....gp.....~K...
> -0000180: 4a2b 371e 8a1c baf5 7ab0 7998 84cb bfae  J+7.....z.y.....
> +0000180: 6dd2 09ec b42b 0638 996e e914 7a7c d353  m....+.8.n..z|.S
>  0000190: 0f5e e234 e488 997d 5564 a630 e7ad c3db  .^.4...}Ud.0....
...
>
> Then sector 129547:
>  0000000: 494e 41e8 0101 0002 0000 0065 0000 0069  INA........e...i
<snip>
>  00000f0: 9000 0040 ffff ffff 1801 0000 0000 0000  ...@............
>  0000100: 494e 81a4 0102 0001 0000 0000 0000 0000  IN..............
> -0000110: 0000 0001 0000 0000 0000 0000 0000 055a  ...............Z
> -0000120: 435f 2276 096e bf0e 4345 8a5f 34e9 60ae  C_"v.n..CE._4.`.
> +0000110: 0000 0001 0000 0000 0000 0000 0000 0557  ...............W
> +0000120: 435e e888 066b 4474 4345 8a5f 34e9 60ae  C^...kDtCE._4.`.
>  0000130: 4345 8a5f 34e9 60ae 0000 0000 0000 01f2  CE._4.`.........
<snip>
>
> And sector 195094:
>  0000000: 494e 41e8 0102 0019 0000 0065 0000 0069  INA........e...i
<snip>
>  0000100: 494e 41e8 0101 0002 0000 0065 0000 0069  INA........e...i
> -0000110: 0000 0002 0000 0000 0000 0000 0000 0010  ................
> -0000120: 435f 1c9a 0be9 b322 4355 9502 1f35 8dad  C_....."CU...5..
> +0000110: 0000 0002 0000 0000 0000 0000 0000 000e  ................
> +0000120: 435d 942d 069a 8c21 4355 9502 1f35 8dad  C].-...!CU...5..
>  0000130: 4355 9502 1f35 8dad 0000 0000 0000 0062  CU...5.........b
<snip>
>
> These shorts bursts of inconsistencies are alarming.  The number of error bits
> doesn't look like RAM errors, and in any case, the machine has ECC
> memory, not overclocked, and was burned in with memtest86+ and prime95
> for multiple days to make sure RAM was reliable.
>
> FWIW, those sectors numbers are relative to the partition, which itself starts on
> raw disk sector 3903795, so the device-level partition numbers are
> partition-relative.  The partition
> 3903795 + 95958 = 3999753 = 0x3D0809
> 3903795 + 129547 = 4033342 = 0x3D8B3E
> 3903795 + 195094 = 4098889 = 0x3E8B49
>
> Hardware is single-core AMD64 processor, nForce chipset, generic motherboard,
> 2x1 GB ECC SDRAM, 3x Sil3132 SATA controllers, 6x 400 GB 7200.8 drives.
>
>
> I finished "badblocks -b 4096 -c 65536 -s -v -w -t random" run on 350 G of one drive
> without seeing problems, and am working on the other 5.
> (In parallel, just to stress the driver.)
>
> Does anyone have any recommended diagnostics for seeing whether a drive reliably
> remembers data you give to it?  Ones that particularly abuse the disk driver?
>
> Thanks!
> -
Interesting.  I'm particularily surprised the data corruption is so
localized.  I had assumed it would be an entire sector.  Maybe that
will give someone an idea of what is causing the corruption.

FYI: We last tested a Sil 3112 for corruption issues with a SuSE
2.6.11 kernel.  We found that in copying 2 GB files from a SIG PCI
connected PATA drive to a 3112 connected SATA drive we would get
occasional corruption.  Generally we found between 1 and 3 of each
hundred files would get corrupted.  We detected corruption by running
an MD5 on the PATA and SATA copies of the files.

Based on seeing other postings of data corruption, I just assumed it
was a driver issue.  On the same machine we have run numerous PATA to
PATA copies like above with no consistentcy issues.  Both PATA drives
would have been set to master and connected to the SIG PCI card.

Greg
--
Greg Freemyer
The Norcross Group
Forensics for the 21st Century

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: sata_sil24 corruption details
  2005-11-07  9:59 sata_sil24 corruption details linux
  2005-11-07 16:15 ` Greg Freemyer
@ 2005-11-10  7:17 ` linux
  2005-11-10  9:01   ` Tejun Heo
  2005-11-10 20:27   ` Edward Falk
  1 sibling, 2 replies; 23+ messages in thread
From: linux @ 2005-11-10  7:17 UTC (permalink / raw)
  To: linux-ide; +Cc: linux

Three days ago, I wrote:
> I finished "badblocks -b 4096 -c 65536 -s -v -w -t random" run on 350
> G of one drive without seeing problems, and am working on the other 5.
> (In parallel, just to stress the driver.)

My parallel -p1 badblocks runs (I shrunk the chunk size to -c 16384)
finished on 3 of the 5 drives, but after 69 hours and I don't know how
many passes, it's still running on one pair of drives.  Interestingly,
the pair (sdc4 & sdd4) is connected to a single controller.

Thus, it might not be a multiple-controller issue (I don't know how
many other people have 3 Sil3132s in a system), but perhaps an issue
with simultaneous activity on the 2 ports of a single controller.

Is there anything else I could do to help debug this problem?  Any additional
debugging I can enable?

It would take me a while to clean the backups off the system and move
it outside the firewall to allow remote access if someone wants access
to that particular hardware, but it's just an expensive bit bucket at
the moment, so ask if it would help...

Thanks!

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: sata_sil24 corruption details
  2005-11-10  7:17 ` linux
@ 2005-11-10  9:01   ` Tejun Heo
  2005-11-10 14:15     ` Greg Freemyer
  2005-11-10 20:27   ` Edward Falk
  1 sibling, 1 reply; 23+ messages in thread
From: Tejun Heo @ 2005-11-10  9:01 UTC (permalink / raw)
  To: linux; +Cc: linux-ide

linux@horizon.com wrote:
> Three days ago, I wrote:
> 
>>I finished "badblocks -b 4096 -c 65536 -s -v -w -t random" run on 350
>>G of one drive without seeing problems, and am working on the other 5.
>>(In parallel, just to stress the driver.)
> 
> 
> My parallel -p1 badblocks runs (I shrunk the chunk size to -c 16384)
> finished on 3 of the 5 drives, but after 69 hours and I don't know how
> many passes, it's still running on one pair of drives.  Interestingly,
> the pair (sdc4 & sdd4) is connected to a single controller.
> 
> Thus, it might not be a multiple-controller issue (I don't know how
> many other people have 3 Sil3132s in a system), but perhaps an issue
> with simultaneous activity on the 2 ports of a single controller.
> 
> Is there anything else I could do to help debug this problem?  Any additional
> debugging I can enable?
> 
> It would take me a while to clean the backups off the system and move
> it outside the firewall to allow remote access if someone wants access
> to that particular hardware, but it's just an expensive bit bucket at
> the moment, so ask if it would help...

Hello, there.

I'll soon try to tackle this one.  However, I currently have only one
3124 controller and one harddisk to hook to that controller, so I cannot
reproduce your setup over here.  Here are things that I think might help
in diagnosing the problem.

* Trying other drivers
	* Trying the original driver.  I'll port the original driver
	  from sii to the current tree and post the patch.
	* Performing similar test under Windows.

* Ruling out disk problem
	* Trying other harddisks.  All harddisk drives perform error
	  detection/correction when data are read from the media, but
	  ruling out the possibility would still be helpful.

* If you have log of failed sectors, finding patterns will be helpful.
  If the errors occur at random places, it's likely that we have
  controller/driver issues.  If errors are localized over multiple runs,
  maybe the disk is at fault.

-- 
tejun

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: sata_sil24 corruption details
  2005-11-10  9:01   ` Tejun Heo
@ 2005-11-10 14:15     ` Greg Freemyer
  2005-11-10 14:41       ` Tejun Heo
  0 siblings, 1 reply; 23+ messages in thread
From: Greg Freemyer @ 2005-11-10 14:15 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux, linux-ide

On 11/10/05, Tejun Heo <htejun@gmail.com> wrote:
> linux@horizon.com wrote:
> > Three days ago, I wrote:
> >
> >>I finished "badblocks -b 4096 -c 65536 -s -v -w -t random" run on 350
> >>G of one drive without seeing problems, and am working on the other 5.
> >>(In parallel, just to stress the driver.)
> >
> >
> > My parallel -p1 badblocks runs (I shrunk the chunk size to -c 16384)
> > finished on 3 of the 5 drives, but after 69 hours and I don't know how
> > many passes, it's still running on one pair of drives.  Interestingly,
> > the pair (sdc4 & sdd4) is connected to a single controller.
> >
> > Thus, it might not be a multiple-controller issue (I don't know how
> > many other people have 3 Sil3132s in a system), but perhaps an issue
> > with simultaneous activity on the 2 ports of a single controller.
> >
> > Is there anything else I could do to help debug this problem?  Any additional
> > debugging I can enable?
> >
> > It would take me a while to clean the backups off the system and move
> > it outside the firewall to allow remote access if someone wants access
> > to that particular hardware, but it's just an expensive bit bucket at
> > the moment, so ask if it would help...
>
> Hello, there.
>
> I'll soon try to tackle this one.  However, I currently have only one
> 3124 controller and one harddisk to hook to that controller, so I cannot
> reproduce your setup over here.  Here are things that I think might help
> in diagnosing the problem.
>
> * Trying other drivers
>         * Trying the original driver.  I'll port the original driver
>           from sii to the current tree and post the patch.
>         * Performing similar test under Windows.
>
> * Ruling out disk problem
>         * Trying other harddisks.  All harddisk drives perform error
>           detection/correction when data are read from the media, but
>           ruling out the possibility would still be helpful.
>
> * If you have log of failed sectors, finding patterns will be helpful.
>   If the errors occur at random places, it's likely that we have
>   controller/driver issues.  If errors are localized over multiple runs,
>   maybe the disk is at fault.
>
> --
> tejun

Tejun,

I assume you saw my e-mail that with a 3112 and a single SATA drive we
were seeing corruption as well.  That being the case I think you
should first verify that corruption is not occuring in the single SATA
drive case.

Our test was to create a bunch of 2 GB files on a PATA drive.

We simply used a drive with real data as the source of our test files.
ie. IIRC: cd test_dir; dd if=/dev/hde conv=noerror,sync | split -b 2000m

Then we calculated the md5 of all the 2 GB pieces.  All of this done
in a pure PATA setup.

Then we connected a SATA drive to a 3112 and simply copied the files
from the PATA drive to the SATA drive and verified the md5 values.  We
found corruption in 1 - 3% of the files copied.

FYI: The above are all very common steps for a computer forensic
examine, thus we found this issue in our attempts to qualify the 3112
as part of our forensic equipment.  We have not tested since 2.6.11
and that was with a SUSE kernel.

Greg
--
Greg Freemyer
The Norcross Group
Forensics for the 21st Century

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: sata_sil24 corruption details
  2005-11-10 14:15     ` Greg Freemyer
@ 2005-11-10 14:41       ` Tejun Heo
  2005-11-10 15:26         ` linux
                           ` (2 more replies)
  0 siblings, 3 replies; 23+ messages in thread
From: Tejun Heo @ 2005-11-10 14:41 UTC (permalink / raw)
  To: Greg Freemyer, Jens Axboe; +Cc: linux, linux-ide

Greg Freemyer wrote:
> On 11/10/05, Tejun Heo <htejun@gmail.com> wrote:
> 
>>linux@horizon.com wrote:
>>
>>>Three days ago, I wrote:
>>>
>>>
>>>>I finished "badblocks -b 4096 -c 65536 -s -v -w -t random" run on 350
>>>>G of one drive without seeing problems, and am working on the other 5.
>>>>(In parallel, just to stress the driver.)
>>>
>>>
>>>My parallel -p1 badblocks runs (I shrunk the chunk size to -c 16384)
>>>finished on 3 of the 5 drives, but after 69 hours and I don't know how
>>>many passes, it's still running on one pair of drives.  Interestingly,
>>>the pair (sdc4 & sdd4) is connected to a single controller.
>>>
>>>Thus, it might not be a multiple-controller issue (I don't know how
>>>many other people have 3 Sil3132s in a system), but perhaps an issue
>>>with simultaneous activity on the 2 ports of a single controller.
>>>
>>>Is there anything else I could do to help debug this problem?  Any additional
>>>debugging I can enable?
>>>
>>>It would take me a while to clean the backups off the system and move
>>>it outside the firewall to allow remote access if someone wants access
>>>to that particular hardware, but it's just an expensive bit bucket at
>>>the moment, so ask if it would help...
>>
>>Hello, there.
>>
>>I'll soon try to tackle this one.  However, I currently have only one
>>3124 controller and one harddisk to hook to that controller, so I cannot
>>reproduce your setup over here.  Here are things that I think might help
>>in diagnosing the problem.
>>
>>* Trying other drivers
>>        * Trying the original driver.  I'll port the original driver
>>          from sii to the current tree and post the patch.
>>        * Performing similar test under Windows.
>>
>>* Ruling out disk problem
>>        * Trying other harddisks.  All harddisk drives perform error
>>          detection/correction when data are read from the media, but
>>          ruling out the possibility would still be helpful.
>>
>>* If you have log of failed sectors, finding patterns will be helpful.
>>  If the errors occur at random places, it's likely that we have
>>  controller/driver issues.  If errors are localized over multiple runs,
>>  maybe the disk is at fault.
>>
>>--
>>tejun
> 
> 
> Tejun,
> 
> I assume you saw my e-mail that with a 3112 and a single SATA drive we
> were seeing corruption as well.  That being the case I think you
> should first verify that corruption is not occuring in the single SATA
> drive case.
> 
> Our test was to create a bunch of 2 GB files on a PATA drive.
> 
> We simply used a drive with real data as the source of our test files.
> ie. IIRC: cd test_dir; dd if=/dev/hde conv=noerror,sync | split -b 2000m
> 
> Then we calculated the md5 of all the 2 GB pieces.  All of this done
> in a pure PATA setup.
> 
> Then we connected a SATA drive to a 3112 and simply copied the files
> from the PATA drive to the SATA drive and verified the md5 values.  We
> found corruption in 1 - 3% of the files copied.
> 
> FYI: The above are all very common steps for a computer forensic
> examine, thus we found this issue in our attempts to qualify the 3112
> as part of our forensic equipment.  We have not tested since 2.6.11
> and that was with a SUSE kernel.
> 

Hi,

I'll run single drive test on sil3112 tonight, but can you please try 
2.6.14?  IIRC, there have been some PCI FIFO setting change.  Hmmm.. 
oh.. it was the following commit.

---
$ git-cat-file commit e1dd23a0012c3929737798fda9fede0e783f4ff3
tree c7f808b6433ef1015f55418e7f11f432943bdefd
parent 5273a00d9c763108397658d440618f7ac3e40f83
author Jens Axboe <axboe@suse.de> 1118228545 +0200
committer Jeff Garzik <jgarzik@pobox.com> 1118300782 -0400

[PATCH] sata_sil: Fix FIFO PCI Bus Arbitration kernel oops

Correct this.
---

Jens, is it possible that above change fixes data corruption?

-- 
tejun

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: sata_sil24 corruption details
  2005-11-10 14:41       ` Tejun Heo
@ 2005-11-10 15:26         ` linux
  2005-11-10 17:32         ` Tejun Heo
  2005-11-10 17:39         ` Jens Axboe
  2 siblings, 0 replies; 23+ messages in thread
From: linux @ 2005-11-10 15:26 UTC (permalink / raw)
  To: axboe, greg.freemyer, htejun; +Cc: linux-ide, linux

> I'll run single drive test on sil3112 tonight, but can you please try 
> 2.6.14?  IIRC, there have been some PCI FIFO setting change.  Hmmm.. 
> oh.. it was the following commit.

I was running a post-2.6.14 libata-dev kernel, but my root file system
got corrupted, and the "emergency backup" boot image that I've
been using is 2.6.13 + libata patches.

You asked about bad block patterns...

I ran the test on three dirves at once (hdb, hdc, hdd), and the former
had no problems, while the latter two reported the following
bad blocks.  Note that these are 4096-byte block numbers relative to
a partition starting at sector 91795410.

The whole partition has 687662325 sectors, or 85957790.625 blocks.

Of those, the first pass of "badblocks -b 4096 -c 32768 -w -t random -p1"
found:

11 "bad blocks" in one pass over /dev/sdc4:
 4265401 
23598860
31564978
33854103
35258513
44588559
45069578
59358213
59554821
70448351
73983236

and 13 "bad blocks" in one pass over /dev/sdd4:
  226244  
 6595957 
 7402436 
 9464777 
14395278
14862235
15085611
16072105
16796706
26095323
39376782
46807588
51765692

It's doing a second write pass now.

It found no problems on /dev/sdb4 while there was no particular traffic on
/dev/sda.  I'm doing a run with sda and sdb in parallel.

Anyway, thanks a lot for the attention.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: sata_sil24 corruption details
  2005-11-10 14:41       ` Tejun Heo
  2005-11-10 15:26         ` linux
@ 2005-11-10 17:32         ` Tejun Heo
  2005-11-10 20:34           ` Greg Freemyer
  2005-11-11  2:16           ` sata_sil24 corruption details linux
  2005-11-10 17:39         ` Jens Axboe
  2 siblings, 2 replies; 23+ messages in thread
From: Tejun Heo @ 2005-11-10 17:32 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Greg Freemyer, Jens Axboe, linux, linux-ide

Tejun Heo wrote:
> Greg Freemyer wrote:
> 
>> On 11/10/05, Tejun Heo <htejun@gmail.com> wrote:
>>
>>> linux@horizon.com wrote:
>>>
>>>> Three days ago, I wrote:
>>>>
>>>>
>>>>> I finished "badblocks -b 4096 -c 65536 -s -v -w -t random" run on 350
>>>>> G of one drive without seeing problems, and am working on the other 5.
>>>>> (In parallel, just to stress the driver.)
>>>>
>>>>
>>>>
>>>> My parallel -p1 badblocks runs (I shrunk the chunk size to -c 16384)
>>>> finished on 3 of the 5 drives, but after 69 hours and I don't know how
>>>> many passes, it's still running on one pair of drives.  Interestingly,
>>>> the pair (sdc4 & sdd4) is connected to a single controller.
>>>>
>>>> Thus, it might not be a multiple-controller issue (I don't know how
>>>> many other people have 3 Sil3132s in a system), but perhaps an issue
>>>> with simultaneous activity on the 2 ports of a single controller.
>>>>
>>>> Is there anything else I could do to help debug this problem?  Any 
>>>> additional
>>>> debugging I can enable?
>>>>
>>>> It would take me a while to clean the backups off the system and move
>>>> it outside the firewall to allow remote access if someone wants access
>>>> to that particular hardware, but it's just an expensive bit bucket at
>>>> the moment, so ask if it would help...
>>>
>>>
>>> Hello, there.
>>>
>>> I'll soon try to tackle this one.  However, I currently have only one
>>> 3124 controller and one harddisk to hook to that controller, so I cannot
>>> reproduce your setup over here.  Here are things that I think might help
>>> in diagnosing the problem.
>>>
>>> * Trying other drivers
>>>        * Trying the original driver.  I'll port the original driver
>>>          from sii to the current tree and post the patch.
>>>        * Performing similar test under Windows.
>>>
>>> * Ruling out disk problem
>>>        * Trying other harddisks.  All harddisk drives perform error
>>>          detection/correction when data are read from the media, but
>>>          ruling out the possibility would still be helpful.
>>>
>>> * If you have log of failed sectors, finding patterns will be helpful.
>>>  If the errors occur at random places, it's likely that we have
>>>  controller/driver issues.  If errors are localized over multiple runs,
>>>  maybe the disk is at fault.
>>>
>>> -- 
>>> tejun
>>
>>
>>
>> Tejun,
>>
>> I assume you saw my e-mail that with a 3112 and a single SATA drive we
>> were seeing corruption as well.  That being the case I think you
>> should first verify that corruption is not occuring in the single SATA
>> drive case.
>>
>> Our test was to create a bunch of 2 GB files on a PATA drive.
>>
>> We simply used a drive with real data as the source of our test files.
>> ie. IIRC: cd test_dir; dd if=/dev/hde conv=noerror,sync | split -b 2000m
>>
>> Then we calculated the md5 of all the 2 GB pieces.  All of this done
>> in a pure PATA setup.
>>
>> Then we connected a SATA drive to a 3112 and simply copied the files
>> from the PATA drive to the SATA drive and verified the md5 values.  We
>> found corruption in 1 - 3% of the files copied.
>>
>> FYI: The above are all very common steps for a computer forensic
>> examine, thus we found this issue in our attempts to qualify the 3112
>> as part of our forensic equipment.  We have not tested since 2.6.11
>> and that was with a SUSE kernel.
>>
> 
> Hi,
> 
> I'll run single drive test on sil3112 tonight, but can you please try 
> 2.6.14?  IIRC, there have been some PCI FIFO setting change.  Hmmm.. 
> oh.. it was the following commit.
> 
> ---
> $ git-cat-file commit e1dd23a0012c3929737798fda9fede0e783f4ff3
> tree c7f808b6433ef1015f55418e7f11f432943bdefd
> parent 5273a00d9c763108397658d440618f7ac3e40f83
> author Jens Axboe <axboe@suse.de> 1118228545 +0200
> committer Jeff Garzik <jgarzik@pobox.com> 1118300782 -0400
> 
> [PATCH] sata_sil: Fix FIFO PCI Bus Arbitration kernel oops
> 
> Correct this.
> ---
> 
> Jens, is it possible that above change fixes data corruption?
> 

Greg, first pass of 'badblocks -t random -v -w' on 100G partion of 160G 
disk just finished without any error.  This is samsung hd160jj drive on 
sil3112 controller.  I'll let badblocks run thorough the night and 
perform file copy & md5sum test tomorrow.  But my hunch is that there is 
no common data corruption problem with sil3112.  It's just in too 
wide-spread use to have such data corruption problem with so few reportings.

What exact controller/disk did you use?  Care to retest your setup with 
2.6.14?

-- 
tejun

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: sata_sil24 corruption details
  2005-11-10 14:41       ` Tejun Heo
  2005-11-10 15:26         ` linux
  2005-11-10 17:32         ` Tejun Heo
@ 2005-11-10 17:39         ` Jens Axboe
  2 siblings, 0 replies; 23+ messages in thread
From: Jens Axboe @ 2005-11-10 17:39 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Greg Freemyer, linux, linux-ide

On Thu, Nov 10 2005, Tejun Heo wrote:
> Greg Freemyer wrote:
> >On 11/10/05, Tejun Heo <htejun@gmail.com> wrote:
> >
> >>linux@horizon.com wrote:
> >>
> >>>Three days ago, I wrote:
> >>>
> >>>
> >>>>I finished "badblocks -b 4096 -c 65536 -s -v -w -t random" run on 350
> >>>>G of one drive without seeing problems, and am working on the other 5.
> >>>>(In parallel, just to stress the driver.)
> >>>
> >>>
> >>>My parallel -p1 badblocks runs (I shrunk the chunk size to -c 16384)
> >>>finished on 3 of the 5 drives, but after 69 hours and I don't know how
> >>>many passes, it's still running on one pair of drives.  Interestingly,
> >>>the pair (sdc4 & sdd4) is connected to a single controller.
> >>>
> >>>Thus, it might not be a multiple-controller issue (I don't know how
> >>>many other people have 3 Sil3132s in a system), but perhaps an issue
> >>>with simultaneous activity on the 2 ports of a single controller.
> >>>
> >>>Is there anything else I could do to help debug this problem?  Any 
> >>>additional
> >>>debugging I can enable?
> >>>
> >>>It would take me a while to clean the backups off the system and move
> >>>it outside the firewall to allow remote access if someone wants access
> >>>to that particular hardware, but it's just an expensive bit bucket at
> >>>the moment, so ask if it would help...
> >>
> >>Hello, there.
> >>
> >>I'll soon try to tackle this one.  However, I currently have only one
> >>3124 controller and one harddisk to hook to that controller, so I cannot
> >>reproduce your setup over here.  Here are things that I think might help
> >>in diagnosing the problem.
> >>
> >>* Trying other drivers
> >>       * Trying the original driver.  I'll port the original driver
> >>         from sii to the current tree and post the patch.
> >>       * Performing similar test under Windows.
> >>
> >>* Ruling out disk problem
> >>       * Trying other harddisks.  All harddisk drives perform error
> >>         detection/correction when data are read from the media, but
> >>         ruling out the possibility would still be helpful.
> >>
> >>* If you have log of failed sectors, finding patterns will be helpful.
> >> If the errors occur at random places, it's likely that we have
> >> controller/driver issues.  If errors are localized over multiple runs,
> >> maybe the disk is at fault.
> >>
> >>--
> >>tejun
> >
> >
> >Tejun,
> >
> >I assume you saw my e-mail that with a 3112 and a single SATA drive we
> >were seeing corruption as well.  That being the case I think you
> >should first verify that corruption is not occuring in the single SATA
> >drive case.
> >
> >Our test was to create a bunch of 2 GB files on a PATA drive.
> >
> >We simply used a drive with real data as the source of our test files.
> >ie. IIRC: cd test_dir; dd if=/dev/hde conv=noerror,sync | split -b 2000m
> >
> >Then we calculated the md5 of all the 2 GB pieces.  All of this done
> >in a pure PATA setup.
> >
> >Then we connected a SATA drive to a 3112 and simply copied the files
> >from the PATA drive to the SATA drive and verified the md5 values.  We
> >found corruption in 1 - 3% of the files copied.
> >
> >FYI: The above are all very common steps for a computer forensic
> >examine, thus we found this issue in our attempts to qualify the 3112
> >as part of our forensic equipment.  We have not tested since 2.6.11
> >and that was with a SUSE kernel.
> >
> 
> Hi,
> 
> I'll run single drive test on sil3112 tonight, but can you please try 
> 2.6.14?  IIRC, there have been some PCI FIFO setting change.  Hmmm.. 
> oh.. it was the following commit.
> 
> ---
> $ git-cat-file commit e1dd23a0012c3929737798fda9fede0e783f4ff3
> tree c7f808b6433ef1015f55418e7f11f432943bdefd
> parent 5273a00d9c763108397658d440618f7ac3e40f83
> author Jens Axboe <axboe@suse.de> 1118228545 +0200
> committer Jeff Garzik <jgarzik@pobox.com> 1118300782 -0400
> 
> [PATCH] sata_sil: Fix FIFO PCI Bus Arbitration kernel oops
> 
> Correct this.
> ---
> 
> Jens, is it possible that above change fixes data corruption?

It could, but only on the 3114 (where it would oops before). The 3112
data corruption cache line fix predates it, so it probably isn't this
one.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: sata_sil24 corruption details
  2005-11-10  7:17 ` linux
  2005-11-10  9:01   ` Tejun Heo
@ 2005-11-10 20:27   ` Edward Falk
  1 sibling, 0 replies; 23+ messages in thread
From: Edward Falk @ 2005-11-10 20:27 UTC (permalink / raw)
  To: linux; +Cc: linux-ide


> Is there anything else I could do to help debug this problem?  Any additional
> debugging I can enable?

Check /var/log/kern.log for timeout messages; it would be interesting to 
see if you find any.

	-ed falk

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: sata_sil24 corruption details
  2005-11-10 17:32         ` Tejun Heo
@ 2005-11-10 20:34           ` Greg Freemyer
  2005-11-12  0:49             ` Greg Freemyer
  2005-11-11  2:16           ` sata_sil24 corruption details linux
  1 sibling, 1 reply; 23+ messages in thread
From: Greg Freemyer @ 2005-11-10 20:34 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Jens Axboe, linux, linux-ide

On 11/10/05, Tejun Heo <htejun@gmail.com> wrote:
> Tejun Heo wrote:
> > Greg Freemyer wrote:
> >
> >> On 11/10/05, Tejun Heo <htejun@gmail.com> wrote:
> >>
> >>> linux@horizon.com wrote:
> >>>
> >>>> Three days ago, I wrote:
> >>>>
> >>>>
> >>>>> I finished "badblocks -b 4096 -c 65536 -s -v -w -t random" run on 350
> >>>>> G of one drive without seeing problems, and am working on the other 5.
> >>>>> (In parallel, just to stress the driver.)
> >>>>
> >>>>
> >>>>
> >>>> My parallel -p1 badblocks runs (I shrunk the chunk size to -c 16384)
> >>>> finished on 3 of the 5 drives, but after 69 hours and I don't know how
> >>>> many passes, it's still running on one pair of drives.  Interestingly,
> >>>> the pair (sdc4 & sdd4) is connected to a single controller.
> >>>>
> >>>> Thus, it might not be a multiple-controller issue (I don't know how
> >>>> many other people have 3 Sil3132s in a system), but perhaps an issue
> >>>> with simultaneous activity on the 2 ports of a single controller.
> >>>>
> >>>> Is there anything else I could do to help debug this problem?  Any
> >>>> additional
> >>>> debugging I can enable?
> >>>>
> >>>> It would take me a while to clean the backups off the system and move
> >>>> it outside the firewall to allow remote access if someone wants access
> >>>> to that particular hardware, but it's just an expensive bit bucket at
> >>>> the moment, so ask if it would help...
> >>>
> >>>
> >>> Hello, there.
> >>>
> >>> I'll soon try to tackle this one.  However, I currently have only one
> >>> 3124 controller and one harddisk to hook to that controller, so I cannot
> >>> reproduce your setup over here.  Here are things that I think might help
> >>> in diagnosing the problem.
> >>>
> >>> * Trying other drivers
> >>>        * Trying the original driver.  I'll port the original driver
> >>>          from sii to the current tree and post the patch.
> >>>        * Performing similar test under Windows.
> >>>
> >>> * Ruling out disk problem
> >>>        * Trying other harddisks.  All harddisk drives perform error
> >>>          detection/correction when data are read from the media, but
> >>>          ruling out the possibility would still be helpful.
> >>>
> >>> * If you have log of failed sectors, finding patterns will be helpful.
> >>>  If the errors occur at random places, it's likely that we have
> >>>  controller/driver issues.  If errors are localized over multiple runs,
> >>>  maybe the disk is at fault.
> >>>
> >>> --
> >>> tejun
> >>
> >>
> >>
> >> Tejun,
> >>
> >> I assume you saw my e-mail that with a 3112 and a single SATA drive we
> >> were seeing corruption as well.  That being the case I think you
> >> should first verify that corruption is not occuring in the single SATA
> >> drive case.
> >>
> >> Our test was to create a bunch of 2 GB files on a PATA drive.
> >>
> >> We simply used a drive with real data as the source of our test files.
> >> ie. IIRC: cd test_dir; dd if=/dev/hde conv=noerror,sync | split -b 2000m
> >>
> >> Then we calculated the md5 of all the 2 GB pieces.  All of this done
> >> in a pure PATA setup.
> >>
> >> Then we connected a SATA drive to a 3112 and simply copied the files
> >> from the PATA drive to the SATA drive and verified the md5 values.  We
> >> found corruption in 1 - 3% of the files copied.
> >>
> >> FYI: The above are all very common steps for a computer forensic
> >> examine, thus we found this issue in our attempts to qualify the 3112
> >> as part of our forensic equipment.  We have not tested since 2.6.11
> >> and that was with a SUSE kernel.
> >>
> >
> > Hi,
> >
> > I'll run single drive test on sil3112 tonight, but can you please try
> > 2.6.14?  IIRC, there have been some PCI FIFO setting change.  Hmmm..
> > oh.. it was the following commit.
> >
> > ---
> > $ git-cat-file commit e1dd23a0012c3929737798fda9fede0e783f4ff3
> > tree c7f808b6433ef1015f55418e7f11f432943bdefd
> > parent 5273a00d9c763108397658d440618f7ac3e40f83
> > author Jens Axboe <axboe@suse.de> 1118228545 +0200
> > committer Jeff Garzik <jgarzik@pobox.com> 1118300782 -0400
> >
> > [PATCH] sata_sil: Fix FIFO PCI Bus Arbitration kernel oops
> >
> > Correct this.
> > ---
> >
> > Jens, is it possible that above change fixes data corruption?
> >
>
> Greg, first pass of 'badblocks -t random -v -w' on 100G partion of 160G
> disk just finished without any error.  This is samsung hd160jj drive on
> sil3112 controller.  I'll let badblocks run thorough the night and
> perform file copy & md5sum test tomorrow.  But my hunch is that there is
> no common data corruption problem with sil3112.  It's just in too
> wide-spread use to have such data corruption problem with so few reportings.
>
> What exact controller/disk did you use?  Care to retest your setup with
> 2.6.14?
>
> --
> tejun
>
Tejun

The corruption I was seeing was on the order of a few bytes per 100
GB.  I'm not sure that most users would realize they were having
problems with that small of an error rate.

I'm not sure what the OPs error rate was, but maybe he can tell us.

I will attempt to retest with 2.6.14 vanilla.  Not sure if that will
be today or tomorrow.

I don't have the old disk any more, but I will report what I use this time.

I also have a CoolGear SATA to USB bridge, so if corruption is still
occuring I can retry the process with a USB connection to the
computer.  http://www.cooldrives.com/seatatousb20.html   If that works
it should rule out the Drive.

Thanks for taking the time.

Greg
--
Greg Freemyer
The Norcross Group
Forensics for the 21st Century

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: sata_sil24 corruption details
  2005-11-10 17:32         ` Tejun Heo
  2005-11-10 20:34           ` Greg Freemyer
@ 2005-11-11  2:16           ` linux
  2005-11-13  6:11             ` linux
  1 sibling, 1 reply; 23+ messages in thread
From: linux @ 2005-11-11  2:16 UTC (permalink / raw)
  To: htejun; +Cc: axboe, greg.freemyer, linux-ide, linux

>>> * Ruling out disk problem
>>>        * Trying other harddisks.  All harddisk drives perform error
>>>          detection/correction when data are read from the media, but
>>>          ruling out the possibility would still be helpful.
>>>
>>> * If you have log of failed sectors, finding patterns will be helpful.
>>>  If the errors occur at random places, it's likely that we have
>>>  controller/driver issues.  If errors are localized over multiple runs,
>>>  maybe the disk is at fault.

Well, I've been an idiot.  My problems are NOT solved, but the problems
are isolated to one of the three controller cards (sdc and sdd), and
by doing all my testing through the RAID layer which spread out those
errors across all the different logical volumes, I wasn't seeing it.

Damn it, I had the mirrors carefully shared across controllers for better
tolerance to hard errors, and that meant that these silent errors
got into everything.

So I went whining for help before I applied elementary debugging logic
to the problem and now have to publicly apologize for suspecting the
software.

There will now follow a period of drive, cable, and controller swapping
while I figure out WTF is going on here.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: sata_sil24 corruption details
  2005-11-10 20:34           ` Greg Freemyer
@ 2005-11-12  0:49             ` Greg Freemyer
  2005-11-12  2:59               ` Tejun Heo
  0 siblings, 1 reply; 23+ messages in thread
From: Greg Freemyer @ 2005-11-12  0:49 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Jens Axboe, linux, linux-ide

On 11/10/05, Greg Freemyer <greg.freemyer@gmail.com> wrote:
> On 11/10/05, Tejun Heo <htejun@gmail.com> wrote:
> > Tejun Heo wrote:
> > > Greg Freemyer wrote:
> > >
> > >> On 11/10/05, Tejun Heo <htejun@gmail.com> wrote:
> > >>
> > >>> Hello, there.
> > >>>
> > >>> I'll soon try to tackle this one.  However, I currently have only one
> > >>> 3124 controller and one harddisk to hook to that controller, so I cannot
> > >>> reproduce your setup over here.  Here are things that I think might help
> > >>> in diagnosing the problem.
> > >>>
> > >>> * Trying other drivers
> > >>>        * Trying the original driver.  I'll port the original driver
> > >>>          from sii to the current tree and post the patch.
> > >>>        * Performing similar test under Windows.
> > >>>
> > >>> * Ruling out disk problem
> > >>>        * Trying other harddisks.  All harddisk drives perform error
> > >>>          detection/correction when data are read from the media, but
> > >>>          ruling out the possibility would still be helpful.
> > >>>
> > >>> * If you have log of failed sectors, finding patterns will be helpful.
> > >>>  If the errors occur at random places, it's likely that we have
> > >>>  controller/driver issues.  If errors are localized over multiple runs,
> > >>>  maybe the disk is at fault.
> > >>>
> > >>> --
> > >>> tejun
> > >>
> > >> Tejun,
> > >>
> > >> I assume you saw my e-mail that with a 3112 and a single SATA drive we
> > >> were seeing corruption as well.  That being the case I think you
> > >> should first verify that corruption is not occuring in the single SATA
> > >> drive case.
> > >>
> > >> Our test was to create a bunch of 2 GB files on a PATA drive.
> > >>
> > >> We simply used a drive with real data as the source of our test files.
> > >> ie. IIRC: cd test_dir; dd if=/dev/hde conv=noerror,sync | split -b 2000m
> > >>
> > >> Then we calculated the md5 of all the 2 GB pieces.  All of this done
> > >> in a pure PATA setup.
> > >>
> > >> Then we connected a SATA drive to a 3112 and simply copied the files
> > >> from the PATA drive to the SATA drive and verified the md5 values.  We
> > >> found corruption in 1 - 3% of the files copied.
> > >>
> > >> FYI: The above are all very common steps for a computer forensic
> > >> examine, thus we found this issue in our attempts to qualify the 3112
> > >> as part of our forensic equipment.  We have not tested since 2.6.11
> > >> and that was with a SUSE kernel.
> > >>
> > >
> > > Hi,
> > >
> > > I'll run single drive test on sil3112 tonight, but can you please try
> > > 2.6.14?  IIRC, there have been some PCI FIFO setting change.  Hmmm..
> > > oh.. it was the following commit.
> > >
> > > ---
> > > $ git-cat-file commit e1dd23a0012c3929737798fda9fede0e783f4ff3
> > > tree c7f808b6433ef1015f55418e7f11f432943bdefd
> > > parent 5273a00d9c763108397658d440618f7ac3e40f83
> > > author Jens Axboe <axboe@suse.de> 1118228545 +0200
> > > committer Jeff Garzik <jgarzik@pobox.com> 1118300782 -0400
> > >
> > > [PATCH] sata_sil: Fix FIFO PCI Bus Arbitration kernel oops
> > >
> > > Correct this.
> > > ---
> > >
> > > Jens, is it possible that above change fixes data corruption?
> > >
> >
> > Greg, first pass of 'badblocks -t random -v -w' on 100G partion of 160G
> > disk just finished without any error.  This is samsung hd160jj drive on
> > sil3112 controller.  I'll let badblocks run thorough the night and
> > perform file copy & md5sum test tomorrow.  But my hunch is that there is
> > no common data corruption problem with sil3112.  It's just in too
> > wide-spread use to have such data corruption problem with so few reportings.
> >
> > What exact controller/disk did you use?  Care to retest your setup with
> > 2.6.14?
> >
> > --
> > tejun
> >
> Tejun
>
> The corruption I was seeing was on the order of a few bytes per 100
> GB.  I'm not sure that most users would realize they were having
> problems with that small of an error rate.
>
> I'm not sure what the OPs error rate was, but maybe he can tell us.
>
> I will attempt to retest with 2.6.14 vanilla.  Not sure if that will
> be today or tomorrow.
>
> I don't have the old disk any more, but I will report what I use this time.
>
> I also have a CoolGear SATA to USB bridge, so if corruption is still
> occuring I can retry the process with a USB connection to the
> computer.  http://www.cooldrives.com/seatatousb20.html   If that works
> it should rule out the Drive.
>
> Thanks for taking the time.
>
> Greg
> --
> Greg Freemyer
> The Norcross Group
> Forensics for the 21st Century
>

Tejun,

Success report:

I did a 80 GB test copy with 2.6.14.1 and a Maxtor 80GB SATA drive and
a 3112A.  I had 3 drives connected to my server, one PATA for booting,
one PATA for to hold the source data, and the SATA drive.  Let me know
if you want want details about the setup.

I found no corruption.  Given that my error rate was very low it is
possible that the corruption simply did not happen, but for now I'm
assuming my earlier issues were either a disk problem or a kernel
issue that was resolved by the latest kernel.

I'm going to continue to test this setup.  I'll report any problems I find.

Greg
--
Greg Freemyer
The Norcross Group
Forensics for the 21st Century

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: sata_sil24 corruption details
  2005-11-12  0:49             ` Greg Freemyer
@ 2005-11-12  2:59               ` Tejun Heo
  2005-11-13 10:19                 ` Tejun Heo
  0 siblings, 1 reply; 23+ messages in thread
From: Tejun Heo @ 2005-11-12  2:59 UTC (permalink / raw)
  To: Greg Freemyer; +Cc: Jens Axboe, linux, linux-ide

Greg Freemyer wrote:
> On 11/10/05, Greg Freemyer <greg.freemyer@gmail.com> wrote:
> 
>>On 11/10/05, Tejun Heo <htejun@gmail.com> wrote:
>>
>>>Tejun Heo wrote:
>>>
>>>>Greg Freemyer wrote:
>>>>
>>>>
>>>>>On 11/10/05, Tejun Heo <htejun@gmail.com> wrote:
>>>>>
>>>>>
>>>>>>Hello, there.
>>>>>>
>>>>>>I'll soon try to tackle this one.  However, I currently have only one
>>>>>>3124 controller and one harddisk to hook to that controller, so I cannot
>>>>>>reproduce your setup over here.  Here are things that I think might help
>>>>>>in diagnosing the problem.
>>>>>>
>>>>>>* Trying other drivers
>>>>>>       * Trying the original driver.  I'll port the original driver
>>>>>>         from sii to the current tree and post the patch.
>>>>>>       * Performing similar test under Windows.
>>>>>>
>>>>>>* Ruling out disk problem
>>>>>>       * Trying other harddisks.  All harddisk drives perform error
>>>>>>         detection/correction when data are read from the media, but
>>>>>>         ruling out the possibility would still be helpful.
>>>>>>
>>>>>>* If you have log of failed sectors, finding patterns will be helpful.
>>>>>> If the errors occur at random places, it's likely that we have
>>>>>> controller/driver issues.  If errors are localized over multiple runs,
>>>>>> maybe the disk is at fault.
>>>>>>
>>>>>>--
>>>>>>tejun
>>>>>
>>>>>Tejun,
>>>>>
>>>>>I assume you saw my e-mail that with a 3112 and a single SATA drive we
>>>>>were seeing corruption as well.  That being the case I think you
>>>>>should first verify that corruption is not occuring in the single SATA
>>>>>drive case.
>>>>>
>>>>>Our test was to create a bunch of 2 GB files on a PATA drive.
>>>>>
>>>>>We simply used a drive with real data as the source of our test files.
>>>>>ie. IIRC: cd test_dir; dd if=/dev/hde conv=noerror,sync | split -b 2000m
>>>>>
>>>>>Then we calculated the md5 of all the 2 GB pieces.  All of this done
>>>>>in a pure PATA setup.
>>>>>
>>>>>Then we connected a SATA drive to a 3112 and simply copied the files
>>>>>from the PATA drive to the SATA drive and verified the md5 values.  We
>>>>>found corruption in 1 - 3% of the files copied.
>>>>>
>>>>>FYI: The above are all very common steps for a computer forensic
>>>>>examine, thus we found this issue in our attempts to qualify the 3112
>>>>>as part of our forensic equipment.  We have not tested since 2.6.11
>>>>>and that was with a SUSE kernel.
>>>>>
>>>>
>>>>Hi,
>>>>
>>>>I'll run single drive test on sil3112 tonight, but can you please try
>>>>2.6.14?  IIRC, there have been some PCI FIFO setting change.  Hmmm..
>>>>oh.. it was the following commit.
>>>>
>>>>---
>>>>$ git-cat-file commit e1dd23a0012c3929737798fda9fede0e783f4ff3
>>>>tree c7f808b6433ef1015f55418e7f11f432943bdefd
>>>>parent 5273a00d9c763108397658d440618f7ac3e40f83
>>>>author Jens Axboe <axboe@suse.de> 1118228545 +0200
>>>>committer Jeff Garzik <jgarzik@pobox.com> 1118300782 -0400
>>>>
>>>>[PATCH] sata_sil: Fix FIFO PCI Bus Arbitration kernel oops
>>>>
>>>>Correct this.
>>>>---
>>>>
>>>>Jens, is it possible that above change fixes data corruption?
>>>>
>>>
>>>Greg, first pass of 'badblocks -t random -v -w' on 100G partion of 160G
>>>disk just finished without any error.  This is samsung hd160jj drive on
>>>sil3112 controller.  I'll let badblocks run thorough the night and
>>>perform file copy & md5sum test tomorrow.  But my hunch is that there is
>>>no common data corruption problem with sil3112.  It's just in too
>>>wide-spread use to have such data corruption problem with so few reportings.
>>>
>>>What exact controller/disk did you use?  Care to retest your setup with
>>>2.6.14?
>>>
>>>--
>>>tejun
>>>
>>
>>Tejun
>>
>>The corruption I was seeing was on the order of a few bytes per 100
>>GB.  I'm not sure that most users would realize they were having
>>problems with that small of an error rate.
>>
>>I'm not sure what the OPs error rate was, but maybe he can tell us.
>>
>>I will attempt to retest with 2.6.14 vanilla.  Not sure if that will
>>be today or tomorrow.
>>
>>I don't have the old disk any more, but I will report what I use this time.
>>
>>I also have a CoolGear SATA to USB bridge, so if corruption is still
>>occuring I can retry the process with a USB connection to the
>>computer.  http://www.cooldrives.com/seatatousb20.html   If that works
>>it should rule out the Drive.
>>
>>Thanks for taking the time.
>>
>>Greg
>>--
>>Greg Freemyer
>>The Norcross Group
>>Forensics for the 21st Century
>>
> 
> 
> Tejun,
> 
> Success report:
> 
> I did a 80 GB test copy with 2.6.14.1 and a Maxtor 80GB SATA drive and
> a 3112A.  I had 3 drives connected to my server, one PATA for booting,
> one PATA for to hold the source data, and the SATA drive.  Let me know
> if you want want details about the setup.
> 
> I found no corruption.  Given that my error rate was very low it is
> possible that the corruption simply did not happen, but for now I'm
> assuming my earlier issues were either a disk problem or a kernel
> issue that was resolved by the latest kernel.
> 
> I'm going to continue to test this setup.  I'll report any problems I find.
> 

Hi, Greg.

I also have been continuing corruption test on 3112 during last two 
days.  It's being performed on 100GB partition of a 160GB harddisk 
(samsung hd160jj).  Nine passes of 'badblocks -t random -v -w /dev/sdb2' 
succeeded without any problem.  To replicate your test, I created a 4GB 
random file by dd'ing from /dev/urandom in a separate IDE disk and 
copied it to the partition 24times (24 different files of course), then 
I md5sum'd all copied files twice.  This test succeeded five times 
without any problem, and it's in the sixth run now.

Above badblocks and file copy tests amount to about 1.4TB of writes and 
1.9TB of reads without any data corruption.

Let me know how your test turns out.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: sata_sil24 corruption details
  2005-11-11  2:16           ` sata_sil24 corruption details linux
@ 2005-11-13  6:11             ` linux
  0 siblings, 0 replies; 23+ messages in thread
From: linux @ 2005-11-13  6:11 UTC (permalink / raw)
  To: linux-ide; +Cc: htejun, linux

> There will now follow a period of drive, cable, and controller swapping
> while I figure out WTF is going on here.

In case anyone's following this saga...
The motherboard has 6 slots, starting from the CPU:

PCIe x1		(/dev/sde and /dev/sdf)
PCIe x16
PCIe x1		(/dev/sda and /dev/sdb)
PCIe x1		(/dev/sdc and /dev/sdd)	<=== THE CULPRIT
PCI
PCI

I've tried all three controller cards in the "culprit" slot, and
all read data from the disk unreliably.

Interestingly, I can still *write* data to the disk reliably.  I copied
350 GB to /dev/sdc4, swapped it to the sdf slot, and verified it was all
correct.  Then I swapped the sde+sdf controller card (and the attached
cables and drives) into the sdc+sdd slot and got tons of verify errors.

So right now, I'm blaming the motherboard.  (Albatron K8NF4U, if anyone cares)

The fact it's the slot furthest from the bridge seems suggestive, but
PCIe is packetized with link-level CRCs and retransmission, so
transmission problems shouldn't be capable of causing corruption.
Might I have a bad bridge chip?

I'm currently reading the PCIe spec looking for link error logging
registers.

This isn't hugely IDE software related, but I figured it was worth
sharing some progress, as I've been talking about it rather a lot.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: sata_sil24 corruption details
  2005-11-12  2:59               ` Tejun Heo
@ 2005-11-13 10:19                 ` Tejun Heo
  2005-11-14 23:30                   ` Greg Freemyer
  0 siblings, 1 reply; 23+ messages in thread
From: Tejun Heo @ 2005-11-13 10:19 UTC (permalink / raw)
  To: Greg Freemyer; +Cc: Jens Axboe, linux, linux-ide

Tejun Heo wrote:
> Greg Freemyer wrote:
>>
>> Tejun,
>>
>> Success report:
>>
>> I did a 80 GB test copy with 2.6.14.1 and a Maxtor 80GB SATA drive and
>> a 3112A.  I had 3 drives connected to my server, one PATA for booting,
>> one PATA for to hold the source data, and the SATA drive.  Let me know
>> if you want want details about the setup.
>>
>> I found no corruption.  Given that my error rate was very low it is
>> possible that the corruption simply did not happen, but for now I'm
>> assuming my earlier issues were either a disk problem or a kernel
>> issue that was resolved by the latest kernel.
>>
>> I'm going to continue to test this setup.  I'll report any problems I 
>> find.
>>
> 
> Hi, Greg.
> 
> I also have been continuing corruption test on 3112 during last two 
> days.  It's being performed on 100GB partition of a 160GB harddisk 
> (samsung hd160jj).  Nine passes of 'badblocks -t random -v -w /dev/sdb2' 
> succeeded without any problem.  To replicate your test, I created a 4GB 
> random file by dd'ing from /dev/urandom in a separate IDE disk and 
> copied it to the partition 24times (24 different files of course), then 
> I md5sum'd all copied files twice.  This test succeeded five times 
> without any problem, and it's in the sixth run now.
> 
> Above badblocks and file copy tests amount to about 1.4TB of writes and 
> 1.9TB of reads without any data corruption.
> 
> Let me know how your test turns out.
> 

The 4G file test has been running through last two days and it finished 
thirteen more runs without any hickup.  So, my 3112 + hd160jj 
combination finshed 9 runs of 'badblocks -t random -v -w' and 22 runs of 
4G file copy & verify twice test.  That's > 3TB of writes and > 5TB of 
reads without any error.

I guess my setup doesn't have the problem you used to experience.  How's 
your testing going?

-- 
tejun

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: sata_sil24 corruption details
  2005-11-13 10:19                 ` Tejun Heo
@ 2005-11-14 23:30                   ` Greg Freemyer
  2005-11-18  2:23                     ` sata_sil24 corruption FIXED by motherboard swap linux
  0 siblings, 1 reply; 23+ messages in thread
From: Greg Freemyer @ 2005-11-14 23:30 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Jens Axboe, linux, linux-ide

On 11/13/05, Tejun Heo <htejun@gmail.com> wrote:
> Tejun Heo wrote:
> > Greg Freemyer wrote:
> >>
> >> Tejun,
> >>
> >> Success report:
> >>
> >> I did a 80 GB test copy with 2.6.14.1 and a Maxtor 80GB SATA drive and
> >> a 3112A.  I had 3 drives connected to my server, one PATA for booting,
> >> one PATA for to hold the source data, and the SATA drive.  Let me know
> >> if you want want details about the setup.
> >>
> >> I found no corruption.  Given that my error rate was very low it is
> >> possible that the corruption simply did not happen, but for now I'm
> >> assuming my earlier issues were either a disk problem or a kernel
> >> issue that was resolved by the latest kernel.
> >>
> >> I'm going to continue to test this setup.  I'll report any problems I
> >> find.
> >>
> >
> > Hi, Greg.
> >
> > I also have been continuing corruption test on 3112 during last two
> > days.  It's being performed on 100GB partition of a 160GB harddisk
> > (samsung hd160jj).  Nine passes of 'badblocks -t random -v -w /dev/sdb2'
> > succeeded without any problem.  To replicate your test, I created a 4GB
> > random file by dd'ing from /dev/urandom in a separate IDE disk and
> > copied it to the partition 24times (24 different files of course), then
> > I md5sum'd all copied files twice.  This test succeeded five times
> > without any problem, and it's in the sixth run now.
> >
> > Above badblocks and file copy tests amount to about 1.4TB of writes and
> > 1.9TB of reads without any data corruption.
> >
> > Let me know how your test turns out.
> >
>
> The 4G file test has been running through last two days and it finished
> thirteen more runs without any hickup.  So, my 3112 + hd160jj
> combination finshed 9 runs of 'badblocks -t random -v -w' and 22 runs of
> 4G file copy & verify twice test.  That's > 3TB of writes and > 5TB of
> reads without any error.
>
> I guess my setup doesn't have the problem you used to experience.  How's
> your testing going?
>
> --
> tejun
>

Tejun,

We still have not experienced any problems with 2.6.14.1.  I'm going
to put the 3112 into limited production use for drives over 200GB.  We
verify md5's as part of our standard protocol, so if anything is out
of whack we should find it.  Unfortunately we don't process a lot of
200+ GB drives, but enough that we will likely exercise it a couple
times a month.

Greg
--
Greg Freemyer
The Norcross Group
Forensics for the 21st Century

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: sata_sil24 corruption details
@ 2005-11-15  9:30 SMALL, Timothy
  0 siblings, 0 replies; 23+ messages in thread
From: SMALL, Timothy @ 2005-11-15  9:30 UTC (permalink / raw)
  To: 'linux@horizon.com', linux-ide; +Cc: htejun

> The fact it's the slot furthest from the bridge seems 
> suggestive, but PCIe is packetized with link-level CRCs and 
> retransmission, so transmission problems shouldn't be capable 
> of causing corruption. Might I have a bad bridge chip?
> 
> I'm currently reading the PCIe spec looking for link error 
> logging registers.

This is already (at least partly) implemented in the PCI error reporting
stuff from:

http://bluesmoke.sourceforge.net/

Just to verify this...  The developers don't have many cases of real PCI
errors, so it'd certainly be useful for the bluesmoke/EDAC project if you
could see if this notices your data corruption trouble.

Thanks,

Tim.

This email is for the intended addressee only.
If you have received it in error then you must not use, retain, disseminate or otherwise deal with it.
Please notify the sender by return email.
The views of the author may not necessarily constitute the views of EADS Astrium Limited.
Nothing in this email shall bind EADS Astrium Limited in any contract or obligation.

EADS Astrium Limited, Registered in England and Wales No. 2449259
Registered Office: Gunnels Wood Road, Stevenage, Hertfordshire, SG1 2AS, England

^ permalink raw reply	[flat|nested] 23+ messages in thread

* sata_sil24 corruption FIXED by motherboard swap
  2005-11-14 23:30                   ` Greg Freemyer
@ 2005-11-18  2:23                     ` linux
  2005-11-18 19:36                       ` sata_sil24 test support linux
  0 siblings, 1 reply; 23+ messages in thread
From: linux @ 2005-11-18  2:23 UTC (permalink / raw)
  To: htejun, linux-ide; +Cc: linux

For anybody who's been following my saga, what I thought was software
problems turned out to be a hardware problem with one PCI-express slot
on an Albatron K8NF4U motherboard.  (Whose support folks still haven't
gotten back to me...)

I swapped in a new motherboard this morning, and have been running disk
tests all afternoon with no problems.

My sincere apologies for the false alarm.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: sata_sil24 test support
  2005-11-18  2:23                     ` sata_sil24 corruption FIXED by motherboard swap linux
@ 2005-11-18 19:36                       ` linux
  2005-11-22  0:23                         ` linux
  0 siblings, 1 reply; 23+ messages in thread
From: linux @ 2005-11-18 19:36 UTC (permalink / raw)
  To: htejun, linux-ide; +Cc: linux

> One thing I wanna verify on sil24 is data integrity with multiple disks 
> attached.  It would be very helpful if you can do some parallel data 
> stress testing with multiple disks.
> 
> * Parallel 'badblocks -w -t random' on all attached disks.  Maybe repeat 
> it for a few days and verify no corrupted IO occurs.

I only ran it for a day, but I can report success on exactly this
test on 6x Seagate 7200.8 drives (350G partition of 400G drives)
across 3x Sil3132.

That's how I found my problems, and how I verified that they were gone.

(The only "intereting" finding was that one drive was noticeably slower
than the others.  Not 10%, but it finished most of an hour later.  I
checked the cables and all looked well, and its partner on the same
controller was fine.  I'm going to do a bit of swapping to experiment.)

This is with CONFIG_PCI_MSI=y.  It was run in single-user mode (all
file systems mounted read-only) because the question was whether
live file systems were safe.

One thing I'm thinking of as a *driver* test is to write a little utility
that uses O_DIRECT to do heavy I/O to the drive's cache.  That should
be able to exceed the 60 MB/sec media transfer rate limit.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: sata_sil24 test support
  2005-11-18 19:36                       ` sata_sil24 test support linux
@ 2005-11-22  0:23                         ` linux
  2005-11-22  1:52                           ` Tejun Heo
  0 siblings, 1 reply; 23+ messages in thread
From: linux @ 2005-11-22  0:23 UTC (permalink / raw)
  To: htejun; +Cc: linux, linux-ide

> One thing I wanna verify on sil24 is data integrity with multiple disks 
> attached.  It would be very helpful if you can do some parallel data 
> stress testing with multiple disks.
> 
> * Parallel 'badblocks -w -t random' on all attached disks.  Maybe repeat 
> it for a few days and verify no corrupted IO occurs.

Just completed 6 passes x 6 drives x 350 GB = 12.6 TB of badblocks (10^14
bits) with no errors.  That's in addition to a previous 5 passes that
was interrupted by timeout problems on one drive, but that's an error
handling issue and not a data corruption problem, and it did resolve
itself eventually after I killed the badblocks run.

That's several days of solid disk access at > 300 MB/sec.
(Some silly people asked me why I ingored the Sil3114 that came with
the motherboard...)

Thanks for a great driver!  I'll have even more fun testing NCQ one of
these days. :-)

Now rebooting to 2.6.14-rc2.  Now that it's stable, this system is going
into production Very Very Soon.  If you want any more testing, speak up!

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: sata_sil24 test support
  2005-11-22  0:23                         ` linux
@ 2005-11-22  1:52                           ` Tejun Heo
  0 siblings, 0 replies; 23+ messages in thread
From: Tejun Heo @ 2005-11-22  1:52 UTC (permalink / raw)
  To: linux; +Cc: linux-ide

linux@horizon.com wrote:
>>One thing I wanna verify on sil24 is data integrity with multiple disks 
>>attached.  It would be very helpful if you can do some parallel data 
>>stress testing with multiple disks.
>>
>>* Parallel 'badblocks -w -t random' on all attached disks.  Maybe repeat 
>>it for a few days and verify no corrupted IO occurs.
> 
> 
> Just completed 6 passes x 6 drives x 350 GB = 12.6 TB of badblocks (10^14
> bits) with no errors.  That's in addition to a previous 5 passes that
> was interrupted by timeout problems on one drive, but that's an error
> handling issue and not a data corruption problem, and it did resolve
> itself eventually after I killed the badblocks run.
> 
> That's several days of solid disk access at > 300 MB/sec.
> (Some silly people asked me why I ingored the Sil3114 that came with
> the motherboard...)
> 
> Thanks for a great driver!  I'll have even more fun testing NCQ one of
> these days. :-)
> 
> Now rebooting to 2.6.14-rc2.  Now that it's stable, this system is going
> into production Very Very Soon.  If you want any more testing, speak up!

I'm very glad to here the good news.  I'll let you know when more
testing is needed.  Thanks for doing this.

-- 
tejun

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2005-11-22  1:52 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-11-07  9:59 sata_sil24 corruption details linux
2005-11-07 16:15 ` Greg Freemyer
2005-11-10  7:17 ` linux
2005-11-10  9:01   ` Tejun Heo
2005-11-10 14:15     ` Greg Freemyer
2005-11-10 14:41       ` Tejun Heo
2005-11-10 15:26         ` linux
2005-11-10 17:32         ` Tejun Heo
2005-11-10 20:34           ` Greg Freemyer
2005-11-12  0:49             ` Greg Freemyer
2005-11-12  2:59               ` Tejun Heo
2005-11-13 10:19                 ` Tejun Heo
2005-11-14 23:30                   ` Greg Freemyer
2005-11-18  2:23                     ` sata_sil24 corruption FIXED by motherboard swap linux
2005-11-18 19:36                       ` sata_sil24 test support linux
2005-11-22  0:23                         ` linux
2005-11-22  1:52                           ` Tejun Heo
2005-11-11  2:16           ` sata_sil24 corruption details linux
2005-11-13  6:11             ` linux
2005-11-10 17:39         ` Jens Axboe
2005-11-10 20:27   ` Edward Falk
  -- strict thread matches above, loose matches on Subject: below --
2005-11-07 16:05 SMALL, Timothy
2005-11-15  9:30 SMALL, Timothy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).