Errors when copying between drives on a SiI3114 controller under kernel 2.6.18

linux-ide.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Errors when copying between drives on a SiI3114 controller under kernel 2.6.18
@ 2006-10-07 13:11 Jonathan Bell
  2006-10-08  4:33 ` Tejun Heo
  0 siblings, 1 reply; 14+ messages in thread
From: Jonathan Bell @ 2006-10-07 13:11 UTC (permalink / raw)
  To: jgarzik; +Cc: linux-ide

[-- Attachment #1: Type: text/plain, Size: 2751 bytes --]

Hello

I have been having input/output errors copying data between drives
attached to the same controller. I have two 3114 cards, a set of four
Seagate 250GB drives (Model: ST3250824NS  Rev: 3.AE) and set of 3 Maxtor
300GB drives (Model:6L300S0  Rev:BACE). This problem is reproducible
across all the drives and both controller cards.

The problem is that when copying a file off one drive on the controller to
another on the same controller, be it via dd or cp, the file that gets
written becomes corrupted along with the filesystem itself. Here is an
extract from dmesg:

[12689.451466] attempt to access beyond end of device
[12689.451475] sdb1: rw=0, want=2339438600, limit=488392002
[12689.451480] attempt to access beyond end of device
[12689.451484] sdb1: rw=0, want=18446744056529747976, limit=488392002
[12689.453822] attempt to access beyond end of device
[12689.453831] sdb1: rw=0, want=2339438600, limit=488392002
[12689.453834] Buffer I/O error on device sdb1, logical block 292429824
[12689.453935] attempt to access beyond end of device
[12689.453938] sdb1: rw=0, want=2339438600, limit=488392002
[12689.453941] Buffer I/O error on device sdb1, logical block 292429824

The actual command used was:

cp ~/hugefile /mnt/sda1
cp /mnt/sda1/hugefile /mnt/sdb1/
md5sum /mnt/sda1/hugefile /mnt/sdb1/hugefile

where hugefile is a 4.9GB piped output of "yes 0123456789" on ~/, a PATA
drive used for the root filesystem and /home.
md5sum calculates the first file checksum fine and errors on the second
file.

ccf5f9052aa1fac3062c3f1920abb1fc  /mnt/sda1/hugefile
md5sum: /mnt/sdb1/hugefile: Input/output error

The exact same problem happens when the drives are reversed, i.e. the file
is copied to sdb1 first then copied/dd'd to sda1, md5sum on sda1 bombs.
There is no problem copying the file individually to each drive from
~/hugefile and performing the above test on drives from different
controllers. All the drives have been rotated, the same test repeated with
exactly the same result. Each drive has had a complete "badblocks -w -s"
performed on them with no problems.

I have performed the same test with ext2, ext3 and reiserfs 3.6 and all
exhibit the same behaviour: seeking beyond the end of the disk to
ludicrously high sectors.

I would like some help tracking down the cause of this problem as I have
practically exhausted the methods currently at my disposal - my best guess
at the moment is that data being written to another port is being trampled
on somehow but only when there is I/O active on another port. I will
continue testing to see if simultaneous writes to multiple drives on a
controller causes the same problem.

Thanks for any advice you can give,
Jonathan

[-- Attachment #2: lspci.txt.gz --]
[-- Type: application/x-gzip, Size: 2377 bytes --]

[-- Attachment #3: dmesg.txt.gz --]
[-- Type: application/x-gzip, Size: 8768 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Errors when copying between drives on a SiI3114 controller under kernel 2.6.18
  2006-10-07 13:11 Errors when copying between drives on a SiI3114 controller under kernel 2.6.18 Jonathan Bell
@ 2006-10-08  4:33 ` Tejun Heo
  2006-10-08 13:19   ` Jonathan Bell
  0 siblings, 1 reply; 14+ messages in thread
From: Tejun Heo @ 2006-10-08  4:33 UTC (permalink / raw)
  To: Jonathan Bell; +Cc: jgarzik, linux-ide

Hello.

Jonathan Bell wrote:
> The problem is that when copying a file off one drive on the controller to
> another on the same controller, be it via dd or cp, the file that gets
> written becomes corrupted along with the filesystem itself. Here is an
> extract from dmesg:

That's very weird.

> [12689.451466] attempt to access beyond end of device
> [12689.451475] sdb1: rw=0, want=2339438600, limit=488392002
> [12689.451480] attempt to access beyond end of device
> [12689.451484] sdb1: rw=0, want=18446744056529747976, limit=488392002
> [12689.453822] attempt to access beyond end of device
> [12689.453831] sdb1: rw=0, want=2339438600, limit=488392002
> [12689.453834] Buffer I/O error on device sdb1, logical block 292429824
> [12689.453935] attempt to access beyond end of device
> [12689.453938] sdb1: rw=0, want=2339438600, limit=488392002
> [12689.453941] Buffer I/O error on device sdb1, logical block 292429824
[--snip--]
> I would like some help tracking down the cause of this problem as I have
> practically exhausted the methods currently at my disposal - my best guess
> at the moment is that data being written to another port is being trampled
> on somehow but only when there is I/O active on another port. I will
> continue testing to see if simultaneous writes to multiple drives on a
> controller causes the same problem.

Can you repeat the test using raw devices - /dev/sdX?  I don't think 
filesystem is at fault, so let's rule it out.  Also, please post the 
result of lspci -nvvvxxx

Thanks.

-- 
tejun


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Errors when copying between drives on a SiI3114 controller under kernel 2.6.18
  2006-10-08  4:33 ` Tejun Heo
@ 2006-10-08 13:19   ` Jonathan Bell
  2006-10-09  8:38     ` Tejun Heo
  0 siblings, 1 reply; 14+ messages in thread
From: Jonathan Bell @ 2006-10-08 13:19 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-ide

[-- Attachment #1: Type: text/plain, Size: 2664 bytes --]

On Sun, 08 Oct 2006 05:33:42 +0100, Tejun Heo <htejun@gmail.com> wrote:

> Hello.
>
> Jonathan Bell wrote:
>> The problem is that when copying a file off one drive on the controller  
>> to
>> another on the same controller, be it via dd or cp, the file that gets
>> written becomes corrupted along with the filesystem itself. Here is an
>> extract from dmesg:
>
> That's very weird.
>
>> [12689.451466] attempt to access beyond end of device
>> [12689.451475] sdb1: rw=0, want=2339438600, limit=488392002
>> [12689.451480] attempt to access beyond end of device
>> [12689.451484] sdb1: rw=0, want=18446744056529747976, limit=488392002
>> [12689.453822] attempt to access beyond end of device
>> [12689.453831] sdb1: rw=0, want=2339438600, limit=488392002
>> [12689.453834] Buffer I/O error on device sdb1, logical block 292429824
>> [12689.453935] attempt to access beyond end of device
>> [12689.453938] sdb1: rw=0, want=2339438600, limit=488392002
>> [12689.453941] Buffer I/O error on device sdb1, logical block 292429824
> [--snip--]
>> I would like some help tracking down the cause of this problem as I have
>> practically exhausted the methods currently at my disposal - my best  
>> guess
>> at the moment is that data being written to another port is being  
>> trampled
>> on somehow but only when there is I/O active on another port. I will
>> continue testing to see if simultaneous writes to multiple drives on a
>> controller causes the same problem.
>
> Can you repeat the test using raw devices - /dev/sdX?  I don't think  
> filesystem is at fault, so let's rule it out.  Also, please post the  
> result of lspci -nvvvxxx
>
> Thanks.
>


See attached for the lspci output.

I have confirmed the problem still happens with the following command:

yes 0123456789 | dd of=/dev/sda1 & dd if=/dev/sdb1 of=/dev/null &

I killed it after a while, then did "uniq /dev/sda1"

The results were.... interesting - instead of just 0123456789 I ended up  
with a whole load of variations on the theme of "0123456789". Attached is  
an extract. While this proved the problem still is there I don't really  
know how to send you any useful information without sending you a ~256  
megabyte dump of /dev/sda1 (compressed it is still approximately 1.8MB)

 From the looks of things the corruptions are few and far between - I  
wouldn't know how to check how often they occur or what length they are  
though.

Also, I probed the validity of the "Buffer I/O error" and found that the  
logical block wasn't actually corrupted - dd read it just fine - it was  
full of 0x00 (from badblocks I guess).


[-- Attachment #2: lspci2.txt.gz --]
[-- Type: application/x-gzip, Size: 4614 bytes --]

[-- Attachment #3: uniq.txt.gz --]
[-- Type: application/x-gzip, Size: 794 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Errors when copying between drives on a SiI3114 controller under kernel 2.6.18
  2006-10-08 13:19   ` Jonathan Bell
@ 2006-10-09  8:38     ` Tejun Heo
  2006-10-09  8:43       ` Tejun Heo
  0 siblings, 1 reply; 14+ messages in thread
From: Tejun Heo @ 2006-10-09  8:38 UTC (permalink / raw)
  To: Jonathan Bell; +Cc: linux-ide, Carlos Pardo

[cc'ing Carlos Pardo]

Jonathan Bell wrote:
> On Sun, 08 Oct 2006 05:33:42 +0100, Tejun Heo <htejun@gmail.com> wrote:
> 
>> Hello.
>>
>> Jonathan Bell wrote:
>>> The problem is that when copying a file off one drive on the 
>>> controller to
>>> another on the same controller, be it via dd or cp, the file that gets
>>> written becomes corrupted along with the filesystem itself. Here is an
>>> extract from dmesg:
>>
>> That's very weird.
>>
>>> [12689.451466] attempt to access beyond end of device
>>> [12689.451475] sdb1: rw=0, want=2339438600, limit=488392002
>>> [12689.451480] attempt to access beyond end of device
>>> [12689.451484] sdb1: rw=0, want=18446744056529747976, limit=488392002
>>> [12689.453822] attempt to access beyond end of device
>>> [12689.453831] sdb1: rw=0, want=2339438600, limit=488392002
>>> [12689.453834] Buffer I/O error on device sdb1, logical block 292429824
>>> [12689.453935] attempt to access beyond end of device
>>> [12689.453938] sdb1: rw=0, want=2339438600, limit=488392002
>>> [12689.453941] Buffer I/O error on device sdb1, logical block 292429824
>> [--snip--]
>>> I would like some help tracking down the cause of this problem as I have
>>> practically exhausted the methods currently at my disposal - my best 
>>> guess
>>> at the moment is that data being written to another port is being 
>>> trampled
>>> on somehow but only when there is I/O active on another port. I will
>>> continue testing to see if simultaneous writes to multiple drives on a
>>> controller causes the same problem.
>>
>> Can you repeat the test using raw devices - /dev/sdX?  I don't think 
>> filesystem is at fault, so let's rule it out.  Also, please post the 
>> result of lspci -nvvvxxx
>>
>> Thanks.
>>
> 
> 
> See attached for the lspci output.
> 
> I have confirmed the problem still happens with the following command:
> 
> yes 0123456789 | dd of=/dev/sda1 & dd if=/dev/sdb1 of=/dev/null &
> 
> I killed it after a while, then did "uniq /dev/sda1"
> 
> The results were.... interesting - instead of just 0123456789 I ended up 
> with a whole load of variations on the theme of "0123456789". Attached 
> is an extract. While this proved the problem still is there I don't 
> really know how to send you any useful information without sending you a 
> ~256 megabyte dump of /dev/sda1 (compressed it is still approximately 
> 1.8MB)
> 
>  From the looks of things the corruptions are few and far between - I 
> wouldn't know how to check how often they occur or what length they are 
> though.
> 
> Also, I probed the validity of the "Buffer I/O error" and found that the 
> logical block wasn't actually corrupted - dd read it just fine - it was 
> full of 0x00 (from badblocks I guess).

I cannot reproduce your problem here.  Can you retest after running the 
following commands?

# setpci -s 01:07.0 0c.b=04
# setpci -s 01:08.0 0c.b=04

The above commands adjust cache line size to 16bytes.

Carlos, the whole thread can be found at the following URL.  lspci 
-nvvvxx result is there too.

http://thread.gmane.org/gmane.linux.ide/13381/focus=13381

-- 
tejun

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Errors when copying between drives on a SiI3114 controller under kernel 2.6.18
  2006-10-09  8:38     ` Tejun Heo
@ 2006-10-09  8:43       ` Tejun Heo
  2006-10-09 14:49         ` Jonathan Bell
  0 siblings, 1 reply; 14+ messages in thread
From: Tejun Heo @ 2006-10-09  8:43 UTC (permalink / raw)
  To: Jonathan Bell; +Cc: linux-ide, Carlos Pardo

Tejun Heo wrote:
> I cannot reproduce your problem here.  Can you retest after running the 
> following commands?
> 
> # setpci -s 01:07.0 0c.b=04
> # setpci -s 01:08.0 0c.b=04

I forgot something.

* You need to make sata_sil a module.  Boot, unload sata_sil if loaded, 
run above commands, load sata_sil and test.

* If above commands don't work, try =00 instead of =04.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Errors when copying between drives on a SiI3114 controller under kernel 2.6.18
  2006-10-09  8:43       ` Tejun Heo
@ 2006-10-09 14:49         ` Jonathan Bell
  2006-10-11 22:35           ` Jonathan Bell
  0 siblings, 1 reply; 14+ messages in thread
From: Jonathan Bell @ 2006-10-09 14:49 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-ide

On Mon, 09 Oct 2006 09:43:18 +0100, Tejun Heo <htejun@gmail.com> wrote:

> Tejun Heo wrote:
>> I cannot reproduce your problem here.  Can you retest after running the  
>> following commands?
>>  # setpci -s 01:07.0 0c.b=04
>> # setpci -s 01:08.0 0c.b=04
>
> I forgot something.
>
> * You need to make sata_sil a module.  Boot, unload sata_sil if loaded,  
> run above commands, load sata_sil and test.
>
> * If above commands don't work, try =00 instead of =04.
>
> Thanks.
>


setpci -s 01:07/8.0 0c.b=04 performed, sata_sil inserted...

md5sum crapped out again, similar errors in dmesg as before.

setpci -s 01:07/8.0 0c.b=00 performed, sata_sil inserted...

It worked...
cp ~/hugefile /mnt/sda1 && cp /mnt/sda1/hugefile /mnt/sdb1
&& md5sum /mnt/sda1/hugefile /mnt/sdb1/hugefile

ccf5f9052aa1fac3062c3f1920abb1fc  /mnt/sda1/hugefile
ccf5f9052aa1fac3062c3f1920abb1fc  /mnt/sdb1/hugefile

What does this register do, out of interest? With 00 it took ages and made  
my load average shoot up to about 6.50!





^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Errors when copying between drives on a SiI3114 controller under kernel 2.6.18
  2006-10-09 14:49         ` Jonathan Bell
@ 2006-10-11 22:35           ` Jonathan Bell
  2006-10-14 12:13             ` Tejun Heo
  0 siblings, 1 reply; 14+ messages in thread
From: Jonathan Bell @ 2006-10-11 22:35 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-ide@vger.kernel.org

On Mon, 09 Oct 2006 15:49:26 +0100, Jonathan Bell  
<doggs.lay.eggs@googlemail.com> wrote:

> On Mon, 09 Oct 2006 09:43:18 +0100, Tejun Heo <htejun@gmail.com> wrote:
>
>> Tejun Heo wrote:
>>> I cannot reproduce your problem here.  Can you retest after running  
>>> the following commands?
>>>  # setpci -s 01:07.0 0c.b=04
>>> # setpci -s 01:08.0 0c.b=04
>>
>> I forgot something.
>>
>> * You need to make sata_sil a module.  Boot, unload sata_sil if loaded,  
>> run above commands, load sata_sil and test.
>>
>> * If above commands don't work, try =00 instead of =04.
>>
>> Thanks.
>>
>
>
> setpci -s 01:07/8.0 0c.b=04 performed, sata_sil inserted...
>
> md5sum crapped out again, similar errors in dmesg as before.
>
> setpci -s 01:07/8.0 0c.b=00 performed, sata_sil inserted...
>
> It worked...
> cp ~/hugefile /mnt/sda1 && cp /mnt/sda1/hugefile /mnt/sdb1
> && md5sum /mnt/sda1/hugefile /mnt/sdb1/hugefile
>
> ccf5f9052aa1fac3062c3f1920abb1fc  /mnt/sda1/hugefile
> ccf5f9052aa1fac3062c3f1920abb1fc  /mnt/sdb1/hugefile
>
> What does this register do, out of interest? With 00 it took ages and  
> made my load average shoot up to about 6.50!
>
>
>

Apologies for bumping this a mere 2 days later but I felt that progress  
was being made...

Ok, so it's the PCI cache line size register... 08 means a value of 64  
bits which corresponds to the line size of my L1/L2 cache, am I correct?

The fact that even with a value of 01 set (for fun) still corrupts the  
file seems to indicate that the fault is somewhere there, but why? Should  
I just give up and buy a decent mainboard? :P (currently running  
A7N8X-Deluxe v2.0, latest 1008 BIOS)

I would like to know more about this since the only topics on PCI cache  
line sizes I can find are ones where people are having problems.

Regards
Jonathan

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Errors when copying between drives on a SiI3114 controller under kernel 2.6.18
  2006-10-11 22:35           ` Jonathan Bell
@ 2006-10-14 12:13             ` Tejun Heo
  2006-10-22 15:33               ` Jonathan Bell
  0 siblings, 1 reply; 14+ messages in thread
From: Tejun Heo @ 2006-10-14 12:13 UTC (permalink / raw)
  To: Jonathan Bell; +Cc: linux-ide@vger.kernel.org, Carlos Pardo

[cc'ing Carlos again.  please don't drop cc list]

Jonathan Bell wrote:
> On Mon, 09 Oct 2006 15:49:26 +0100, Jonathan Bell 
> <doggs.lay.eggs@googlemail.com> wrote:
> 
>> On Mon, 09 Oct 2006 09:43:18 +0100, Tejun Heo <htejun@gmail.com> wrote:
>>
>>> Tejun Heo wrote:
>>>> I cannot reproduce your problem here.  Can you retest after running 
>>>> the following commands?
>>>>  # setpci -s 01:07.0 0c.b=04
>>>> # setpci -s 01:08.0 0c.b=04
>>>
>>> I forgot something.
>>>
>>> * You need to make sata_sil a module.  Boot, unload sata_sil if 
>>> loaded, run above commands, load sata_sil and test.
>>>
>>> * If above commands don't work, try =00 instead of =04.
>>>
>>> Thanks.
>>
>> setpci -s 01:07/8.0 0c.b=04 performed, sata_sil inserted...
>>
>> md5sum crapped out again, similar errors in dmesg as before.
>>
>> setpci -s 01:07/8.0 0c.b=00 performed, sata_sil inserted...
>>
>> It worked...
>> cp ~/hugefile /mnt/sda1 && cp /mnt/sda1/hugefile /mnt/sdb1
>> && md5sum /mnt/sda1/hugefile /mnt/sdb1/hugefile
>>
>> ccf5f9052aa1fac3062c3f1920abb1fc  /mnt/sda1/hugefile
>> ccf5f9052aa1fac3062c3f1920abb1fc  /mnt/sdb1/hugefile
>>
>> What does this register do, out of interest? With 00 it took ages and 
>> made my load average shoot up to about 6.50!
> 
> Apologies for bumping this a mere 2 days later but I felt that progress 
> was being made...
> 
> Ok, so it's the PCI cache line size register... 08 means a value of 64 
> bits which corresponds to the line size of my L1/L2 cache, am I correct?

Yes, you're right.

> The fact that even with a value of 01 set (for fun) still corrupts the 
> file seems to indicate that the fault is somewhere there, but why? 
> Should I just give up and buy a decent mainboard? :P (currently running 
> A7N8X-Deluxe v2.0, latest 1008 BIOS)

I'm not sure whether the cache line size is the actual problem or the 
slowdown caused by 0 cacheline size (r/w optimizations based on 
cacheline size are turned off) hides the problem.  I was hoping BIOS 
messed up while setting cachline size and adjusting it to 4 makes things 
work.

> I would like to know more about this since the only topics on PCI cache 
> line sizes I can find are ones where people are having problems.

I don't know.  I think this can be best diagnosed by SIMG.  Carlos, does 
anything ring a bell?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Errors when copying between drives on a SiI3114 controller under kernel 2.6.18
  2006-10-14 12:13             ` Tejun Heo
@ 2006-10-22 15:33               ` Jonathan Bell
  2006-10-23  2:22                 ` Tejun Heo
  0 siblings, 1 reply; 14+ messages in thread
From: Jonathan Bell @ 2006-10-22 15:33 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-ide

On Sat, 14 Oct 2006 13:13:40 +0100, Tejun Heo <htejun@gmail.com> wrote:

> [cc'ing Carlos again.  please don't drop cc list]
>
> Jonathan Bell wrote:
>> On Mon, 09 Oct 2006 15:49:26 +0100, Jonathan Bell  
>> <doggs.lay.eggs@googlemail.com> wrote:
>>
>>> On Mon, 09 Oct 2006 09:43:18 +0100, Tejun Heo <htejun@gmail.com> wrote:
>>>
>>>> Tejun Heo wrote:
>>>>> I cannot reproduce your problem here.  Can you retest after running  
>>>>> the following commands?
>>>>>  # setpci -s 01:07.0 0c.b=04
>>>>> # setpci -s 01:08.0 0c.b=04
>>>>
>>>> I forgot something.
>>>>
>>>> * You need to make sata_sil a module.  Boot, unload sata_sil if  
>>>> loaded, run above commands, load sata_sil and test.
>>>>
>>>> * If above commands don't work, try =00 instead of =04.
>>>>
>>>> Thanks.
>>>
>>> setpci -s 01:07/8.0 0c.b=04 performed, sata_sil inserted...
>>>
>>> md5sum crapped out again, similar errors in dmesg as before.
>>>
>>> setpci -s 01:07/8.0 0c.b=00 performed, sata_sil inserted...
>>>
>>> It worked...
>>> cp ~/hugefile /mnt/sda1 && cp /mnt/sda1/hugefile /mnt/sdb1
>>> && md5sum /mnt/sda1/hugefile /mnt/sdb1/hugefile
>>>
>>> ccf5f9052aa1fac3062c3f1920abb1fc  /mnt/sda1/hugefile
>>> ccf5f9052aa1fac3062c3f1920abb1fc  /mnt/sdb1/hugefile
>>>
>>> What does this register do, out of interest? With 00 it took ages and  
>>> made my load average shoot up to about 6.50!
>>  Apologies for bumping this a mere 2 days later but I felt that  
>> progress was being made...
>>  Ok, so it's the PCI cache line size register... 08 means a value of 64  
>> bits which corresponds to the line size of my L1/L2 cache, am I correct?
>
> Yes, you're right.
>
>> The fact that even with a value of 01 set (for fun) still corrupts the  
>> file seems to indicate that the fault is somewhere there, but why?  
>> Should I just give up and buy a decent mainboard? :P (currently running  
>> A7N8X-Deluxe v2.0, latest 1008 BIOS)
>
> I'm not sure whether the cache line size is the actual problem or the  
> slowdown caused by 0 cacheline size (r/w optimizations based on  
> cacheline size are turned off) hides the problem.  I was hoping BIOS  
> messed up while setting cachline size and adjusting it to 4 makes things  
> work.
>
>> I would like to know more about this since the only topics on PCI cache  
>> line sizes I can find are ones where people are having problems.
>
> I don't know.  I think this can be best diagnosed by SIMG.  Carlos, does  
> anything ring a bell?
>
> Thanks.
>

This is where it gets wierd... I may have uncovered a BIOS bug.

I changed the mainboard out as a last-ditch attempt to get this working  
and BEHOLD! The drives work perfectly. I swapped the A7N8X-D out for an  
Abit NF7-M (same nForce2 chipset, with the exception of onboard graphics)  
and used the same hardware as before.

This NF7-M is on loan to me so I cannot use it indefinitely. Any ideas,  
Tejun?

Worst comes to worst I can buy an old nForce2 board for a minor sum off  
eBay.

Jonathan




^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Errors when copying between drives on a SiI3114 controller under kernel 2.6.18
  2006-10-22 15:33               ` Jonathan Bell
@ 2006-10-23  2:22                 ` Tejun Heo
  2006-10-23 10:13                   ` Alan Cox
  0 siblings, 1 reply; 14+ messages in thread
From: Tejun Heo @ 2006-10-23  2:22 UTC (permalink / raw)
  To: Jonathan Bell; +Cc: linux-ide

Jonathan Bell wrote:
> This is where it gets wierd... I may have uncovered a BIOS bug.
> 
> I changed the mainboard out as a last-ditch attempt to get this working 
> and BEHOLD! The drives work perfectly. I swapped the A7N8X-D out for an 
> Abit NF7-M (same nForce2 chipset, with the exception of onboard 
> graphics) and used the same hardware as before.
> 
> This NF7-M is on loan to me so I cannot use it indefinitely. Any ideas, 
> Tejun?
> 
> Worst comes to worst I can buy an old nForce2 board for a minor sum off 
> eBay.

I guess it could be a PCI bus problem.  Maybe the controller and the PCI 
bus on the board don't like each other and thing get corrupt when 
transactions occur at high speed.  I've seen data corruption over PCI 
bus on some pilot embedded system board.  Not sure whether such things 
are applicable to consumer products.

I dunno.  Simply changing the motherboard or the controller might be the 
best solution for you.  Considering the large deployment of 
3112/3152/3114 controllers, it's hard to believe your problem is 
software bug.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Errors when copying between drives on a SiI3114 controller under kernel 2.6.18
  2006-10-23  2:22                 ` Tejun Heo
@ 2006-10-23 10:13                   ` Alan Cox
  2006-10-23 13:35                     ` Jonathan Bell
  0 siblings, 1 reply; 14+ messages in thread
From: Alan Cox @ 2006-10-23 10:13 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Jonathan Bell, linux-ide

Ar Llu, 2006-10-23 am 11:22 +0900, ysgrifennodd Tejun Heo:
> I guess it could be a PCI bus problem.  Maybe the controller and the PCI 
> bus on the board don't like each other and thing get corrupt when 
> transactions occur at high speed.  I've seen data corruption over PCI 
> bus on some pilot embedded system board.  Not sure whether such things 
> are applicable to consumer products.

>From the IDE driver...

 *      If you have strange problems with nVidia chipset systems please
 *      see the SI support documentation and update your system BIOS
 *      if neccessary


Alan


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Errors when copying between drives on a SiI3114 controller under kernel 2.6.18
  2006-10-23 10:13                   ` Alan Cox
@ 2006-10-23 13:35                     ` Jonathan Bell
  2006-10-23 14:09                       ` Alan Cox
  0 siblings, 1 reply; 14+ messages in thread
From: Jonathan Bell @ 2006-10-23 13:35 UTC (permalink / raw)
  To: Alan Cox; +Cc: linux-ide@vger.kernel.org

On Mon, 23 Oct 2006 11:13:57 +0100, Alan Cox <alan@lxorguk.ukuu.org.uk>  
wrote:

> Ar Llu, 2006-10-23 am 11:22 +0900, ysgrifennodd Tejun Heo:
>> I guess it could be a PCI bus problem.  Maybe the controller and the PCI
>> bus on the board don't like each other and thing get corrupt when
>> transactions occur at high speed.  I've seen data corruption over PCI
>> bus on some pilot embedded system board.  Not sure whether such things
>> are applicable to consumer products.
>
>> From the IDE driver...
>
>  *      If you have strange problems with nVidia chipset systems please
>  *      see the SI support documentation and update your system BIOS
>  *      if neccessary
>
>
> Alan
>


BIOS for the A7N8X is the latest 1008 version which overcomes a boot  
limitation - the board would not boot with more than 1 SATA controller  
installed.

This RAID corruption bug was supposedly fixed in 1005.

Since contacting Asus technical support is likely to be as productive as  
getting blood from a stone, I'm going to go ahead and scrounge another  
motherboard - since this is socket A stuff it should be dirt cheap by now.

Thanks for all the suggestions,
Jonathan

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Errors when copying between drives on a SiI3114 controller under kernel 2.6.18
  2006-10-23 13:35                     ` Jonathan Bell
@ 2006-10-23 14:09                       ` Alan Cox
  2006-10-30 20:53                         ` Jonathan Bell
  0 siblings, 1 reply; 14+ messages in thread
From: Alan Cox @ 2006-10-23 14:09 UTC (permalink / raw)
  To: Jonathan Bell; +Cc: linux-ide@vger.kernel.org

Ar Llu, 2006-10-23 am 14:35 +0100, ysgrifennodd Jonathan Bell:
> BIOS for the A7N8X is the latest 1008 version which overcomes a boot  
> limitation - the board would not boot with more than 1 SATA controller  
> installed.
> 
> This RAID corruption bug was supposedly fixed in 1005.

But this is a BIOS so was it unfixed again in 1006 ?

Also for that matter, does anyone in Nvidia know and want to explain
what was wrong in those BIOSes and what we can do in Linux to
handle/correct the situation ?


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Errors when copying between drives on a SiI3114 controller under kernel 2.6.18
  2006-10-23 14:09                       ` Alan Cox
@ 2006-10-30 20:53                         ` Jonathan Bell
  0 siblings, 0 replies; 14+ messages in thread
From: Jonathan Bell @ 2006-10-30 20:53 UTC (permalink / raw)
  To: Alan Cox; +Cc: linux-ide@vger.kernel.org

On Mon, 23 Oct 2006 15:09:55 +0100, Alan Cox <alan@lxorguk.ukuu.org.uk>
wrote:

> Ar Llu, 2006-10-23 am 14:35 +0100, ysgrifennodd Jonathan Bell:
>> BIOS for the A7N8X is the latest 1008 version which overcomes a boot
>> limitation - the board would not boot with more than 1 SATA controller
>> installed.
>>
>> This RAID corruption bug was supposedly fixed in 1005.
>
> But this is a BIOS so was it unfixed again in 1006 ?
>
> Also for that matter, does anyone in Nvidia know and want to explain
> what was wrong in those BIOSes and what we can do in Linux to
> handle/correct the situation ?
>

OK.... the problem has "fixed itself".

In contacting Asus technical support the nice guy at the other end told me
to increase the latency timer of the cards. In order to test the mainboard
I put it back in instead of the temporary NF7-M. Suspiciously the tests
that failed before are now flawless and the only thing different was 2 PCI
cards were not installed, the tv tuner card and a wireless lan card.
Installing these to try to exactly duplicate the previous setup again
didn't cause the error to appear.

Even so the board seems flaky so I'm hesitant to put it into full swing
just yet - I'll do a few stress test runs.

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2006-10-30 20:53 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-10-07 13:11 Errors when copying between drives on a SiI3114 controller under kernel 2.6.18 Jonathan Bell
2006-10-08  4:33 ` Tejun Heo
2006-10-08 13:19   ` Jonathan Bell
2006-10-09  8:38     ` Tejun Heo
2006-10-09  8:43       ` Tejun Heo
2006-10-09 14:49         ` Jonathan Bell
2006-10-11 22:35           ` Jonathan Bell
2006-10-14 12:13             ` Tejun Heo
2006-10-22 15:33               ` Jonathan Bell
2006-10-23  2:22                 ` Tejun Heo
2006-10-23 10:13                   ` Alan Cox
2006-10-23 13:35                     ` Jonathan Bell
2006-10-23 14:09                       ` Alan Cox
2006-10-30 20:53                         ` Jonathan Bell

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).