* Errors when copying between drives on a SiI3114 controller under kernel 2.6.18
@ 2006-10-07 13:11 Jonathan Bell
2006-10-08 4:33 ` Tejun Heo
0 siblings, 1 reply; 14+ messages in thread
From: Jonathan Bell @ 2006-10-07 13:11 UTC (permalink / raw)
To: jgarzik; +Cc: linux-ide
[-- Attachment #1: Type: text/plain, Size: 2751 bytes --]
Hello
I have been having input/output errors copying data between drives
attached to the same controller. I have two 3114 cards, a set of four
Seagate 250GB drives (Model: ST3250824NS Rev: 3.AE) and set of 3 Maxtor
300GB drives (Model:6L300S0 Rev:BACE). This problem is reproducible
across all the drives and both controller cards.
The problem is that when copying a file off one drive on the controller to
another on the same controller, be it via dd or cp, the file that gets
written becomes corrupted along with the filesystem itself. Here is an
extract from dmesg:
[12689.451466] attempt to access beyond end of device
[12689.451475] sdb1: rw=0, want=2339438600, limit=488392002
[12689.451480] attempt to access beyond end of device
[12689.451484] sdb1: rw=0, want=18446744056529747976, limit=488392002
[12689.453822] attempt to access beyond end of device
[12689.453831] sdb1: rw=0, want=2339438600, limit=488392002
[12689.453834] Buffer I/O error on device sdb1, logical block 292429824
[12689.453935] attempt to access beyond end of device
[12689.453938] sdb1: rw=0, want=2339438600, limit=488392002
[12689.453941] Buffer I/O error on device sdb1, logical block 292429824
The actual command used was:
cp ~/hugefile /mnt/sda1
cp /mnt/sda1/hugefile /mnt/sdb1/
md5sum /mnt/sda1/hugefile /mnt/sdb1/hugefile
where hugefile is a 4.9GB piped output of "yes 0123456789" on ~/, a PATA
drive used for the root filesystem and /home.
md5sum calculates the first file checksum fine and errors on the second
file.
ccf5f9052aa1fac3062c3f1920abb1fc /mnt/sda1/hugefile
md5sum: /mnt/sdb1/hugefile: Input/output error
The exact same problem happens when the drives are reversed, i.e. the file
is copied to sdb1 first then copied/dd'd to sda1, md5sum on sda1 bombs.
There is no problem copying the file individually to each drive from
~/hugefile and performing the above test on drives from different
controllers. All the drives have been rotated, the same test repeated with
exactly the same result. Each drive has had a complete "badblocks -w -s"
performed on them with no problems.
I have performed the same test with ext2, ext3 and reiserfs 3.6 and all
exhibit the same behaviour: seeking beyond the end of the disk to
ludicrously high sectors.
I would like some help tracking down the cause of this problem as I have
practically exhausted the methods currently at my disposal - my best guess
at the moment is that data being written to another port is being trampled
on somehow but only when there is I/O active on another port. I will
continue testing to see if simultaneous writes to multiple drives on a
controller causes the same problem.
Thanks for any advice you can give,
Jonathan
[-- Attachment #2: lspci.txt.gz --]
[-- Type: application/x-gzip, Size: 2377 bytes --]
[-- Attachment #3: dmesg.txt.gz --]
[-- Type: application/x-gzip, Size: 8768 bytes --]
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Errors when copying between drives on a SiI3114 controller under kernel 2.6.18
2006-10-07 13:11 Errors when copying between drives on a SiI3114 controller under kernel 2.6.18 Jonathan Bell
@ 2006-10-08 4:33 ` Tejun Heo
2006-10-08 13:19 ` Jonathan Bell
0 siblings, 1 reply; 14+ messages in thread
From: Tejun Heo @ 2006-10-08 4:33 UTC (permalink / raw)
To: Jonathan Bell; +Cc: jgarzik, linux-ide
Hello.
Jonathan Bell wrote:
> The problem is that when copying a file off one drive on the controller to
> another on the same controller, be it via dd or cp, the file that gets
> written becomes corrupted along with the filesystem itself. Here is an
> extract from dmesg:
That's very weird.
> [12689.451466] attempt to access beyond end of device
> [12689.451475] sdb1: rw=0, want=2339438600, limit=488392002
> [12689.451480] attempt to access beyond end of device
> [12689.451484] sdb1: rw=0, want=18446744056529747976, limit=488392002
> [12689.453822] attempt to access beyond end of device
> [12689.453831] sdb1: rw=0, want=2339438600, limit=488392002
> [12689.453834] Buffer I/O error on device sdb1, logical block 292429824
> [12689.453935] attempt to access beyond end of device
> [12689.453938] sdb1: rw=0, want=2339438600, limit=488392002
> [12689.453941] Buffer I/O error on device sdb1, logical block 292429824
[--snip--]
> I would like some help tracking down the cause of this problem as I have
> practically exhausted the methods currently at my disposal - my best guess
> at the moment is that data being written to another port is being trampled
> on somehow but only when there is I/O active on another port. I will
> continue testing to see if simultaneous writes to multiple drives on a
> controller causes the same problem.
Can you repeat the test using raw devices - /dev/sdX? I don't think
filesystem is at fault, so let's rule it out. Also, please post the
result of lspci -nvvvxxx
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Errors when copying between drives on a SiI3114 controller under kernel 2.6.18
2006-10-08 4:33 ` Tejun Heo
@ 2006-10-08 13:19 ` Jonathan Bell
2006-10-09 8:38 ` Tejun Heo
0 siblings, 1 reply; 14+ messages in thread
From: Jonathan Bell @ 2006-10-08 13:19 UTC (permalink / raw)
To: Tejun Heo; +Cc: linux-ide
[-- Attachment #1: Type: text/plain, Size: 2664 bytes --]
On Sun, 08 Oct 2006 05:33:42 +0100, Tejun Heo <htejun@gmail.com> wrote:
> Hello.
>
> Jonathan Bell wrote:
>> The problem is that when copying a file off one drive on the controller
>> to
>> another on the same controller, be it via dd or cp, the file that gets
>> written becomes corrupted along with the filesystem itself. Here is an
>> extract from dmesg:
>
> That's very weird.
>
>> [12689.451466] attempt to access beyond end of device
>> [12689.451475] sdb1: rw=0, want=2339438600, limit=488392002
>> [12689.451480] attempt to access beyond end of device
>> [12689.451484] sdb1: rw=0, want=18446744056529747976, limit=488392002
>> [12689.453822] attempt to access beyond end of device
>> [12689.453831] sdb1: rw=0, want=2339438600, limit=488392002
>> [12689.453834] Buffer I/O error on device sdb1, logical block 292429824
>> [12689.453935] attempt to access beyond end of device
>> [12689.453938] sdb1: rw=0, want=2339438600, limit=488392002
>> [12689.453941] Buffer I/O error on device sdb1, logical block 292429824
> [--snip--]
>> I would like some help tracking down the cause of this problem as I have
>> practically exhausted the methods currently at my disposal - my best
>> guess
>> at the moment is that data being written to another port is being
>> trampled
>> on somehow but only when there is I/O active on another port. I will
>> continue testing to see if simultaneous writes to multiple drives on a
>> controller causes the same problem.
>
> Can you repeat the test using raw devices - /dev/sdX? I don't think
> filesystem is at fault, so let's rule it out. Also, please post the
> result of lspci -nvvvxxx
>
> Thanks.
>
See attached for the lspci output.
I have confirmed the problem still happens with the following command:
yes 0123456789 | dd of=/dev/sda1 & dd if=/dev/sdb1 of=/dev/null &
I killed it after a while, then did "uniq /dev/sda1"
The results were.... interesting - instead of just 0123456789 I ended up
with a whole load of variations on the theme of "0123456789". Attached is
an extract. While this proved the problem still is there I don't really
know how to send you any useful information without sending you a ~256
megabyte dump of /dev/sda1 (compressed it is still approximately 1.8MB)
From the looks of things the corruptions are few and far between - I
wouldn't know how to check how often they occur or what length they are
though.
Also, I probed the validity of the "Buffer I/O error" and found that the
logical block wasn't actually corrupted - dd read it just fine - it was
full of 0x00 (from badblocks I guess).
[-- Attachment #2: lspci2.txt.gz --]
[-- Type: application/x-gzip, Size: 4614 bytes --]
[-- Attachment #3: uniq.txt.gz --]
[-- Type: application/x-gzip, Size: 794 bytes --]
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Errors when copying between drives on a SiI3114 controller under kernel 2.6.18
2006-10-08 13:19 ` Jonathan Bell
@ 2006-10-09 8:38 ` Tejun Heo
2006-10-09 8:43 ` Tejun Heo
0 siblings, 1 reply; 14+ messages in thread
From: Tejun Heo @ 2006-10-09 8:38 UTC (permalink / raw)
To: Jonathan Bell; +Cc: linux-ide, Carlos Pardo
[cc'ing Carlos Pardo]
Jonathan Bell wrote:
> On Sun, 08 Oct 2006 05:33:42 +0100, Tejun Heo <htejun@gmail.com> wrote:
>
>> Hello.
>>
>> Jonathan Bell wrote:
>>> The problem is that when copying a file off one drive on the
>>> controller to
>>> another on the same controller, be it via dd or cp, the file that gets
>>> written becomes corrupted along with the filesystem itself. Here is an
>>> extract from dmesg:
>>
>> That's very weird.
>>
>>> [12689.451466] attempt to access beyond end of device
>>> [12689.451475] sdb1: rw=0, want=2339438600, limit=488392002
>>> [12689.451480] attempt to access beyond end of device
>>> [12689.451484] sdb1: rw=0, want=18446744056529747976, limit=488392002
>>> [12689.453822] attempt to access beyond end of device
>>> [12689.453831] sdb1: rw=0, want=2339438600, limit=488392002
>>> [12689.453834] Buffer I/O error on device sdb1, logical block 292429824
>>> [12689.453935] attempt to access beyond end of device
>>> [12689.453938] sdb1: rw=0, want=2339438600, limit=488392002
>>> [12689.453941] Buffer I/O error on device sdb1, logical block 292429824
>> [--snip--]
>>> I would like some help tracking down the cause of this problem as I have
>>> practically exhausted the methods currently at my disposal - my best
>>> guess
>>> at the moment is that data being written to another port is being
>>> trampled
>>> on somehow but only when there is I/O active on another port. I will
>>> continue testing to see if simultaneous writes to multiple drives on a
>>> controller causes the same problem.
>>
>> Can you repeat the test using raw devices - /dev/sdX? I don't think
>> filesystem is at fault, so let's rule it out. Also, please post the
>> result of lspci -nvvvxxx
>>
>> Thanks.
>>
>
>
> See attached for the lspci output.
>
> I have confirmed the problem still happens with the following command:
>
> yes 0123456789 | dd of=/dev/sda1 & dd if=/dev/sdb1 of=/dev/null &
>
> I killed it after a while, then did "uniq /dev/sda1"
>
> The results were.... interesting - instead of just 0123456789 I ended up
> with a whole load of variations on the theme of "0123456789". Attached
> is an extract. While this proved the problem still is there I don't
> really know how to send you any useful information without sending you a
> ~256 megabyte dump of /dev/sda1 (compressed it is still approximately
> 1.8MB)
>
> From the looks of things the corruptions are few and far between - I
> wouldn't know how to check how often they occur or what length they are
> though.
>
> Also, I probed the validity of the "Buffer I/O error" and found that the
> logical block wasn't actually corrupted - dd read it just fine - it was
> full of 0x00 (from badblocks I guess).
I cannot reproduce your problem here. Can you retest after running the
following commands?
# setpci -s 01:07.0 0c.b=04
# setpci -s 01:08.0 0c.b=04
The above commands adjust cache line size to 16bytes.
Carlos, the whole thread can be found at the following URL. lspci
-nvvvxx result is there too.
http://thread.gmane.org/gmane.linux.ide/13381/focus=13381
--
tejun
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Errors when copying between drives on a SiI3114 controller under kernel 2.6.18
2006-10-09 8:38 ` Tejun Heo
@ 2006-10-09 8:43 ` Tejun Heo
2006-10-09 14:49 ` Jonathan Bell
0 siblings, 1 reply; 14+ messages in thread
From: Tejun Heo @ 2006-10-09 8:43 UTC (permalink / raw)
To: Jonathan Bell; +Cc: linux-ide, Carlos Pardo
Tejun Heo wrote:
> I cannot reproduce your problem here. Can you retest after running the
> following commands?
>
> # setpci -s 01:07.0 0c.b=04
> # setpci -s 01:08.0 0c.b=04
I forgot something.
* You need to make sata_sil a module. Boot, unload sata_sil if loaded,
run above commands, load sata_sil and test.
* If above commands don't work, try =00 instead of =04.
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Errors when copying between drives on a SiI3114 controller under kernel 2.6.18
2006-10-09 8:43 ` Tejun Heo
@ 2006-10-09 14:49 ` Jonathan Bell
2006-10-11 22:35 ` Jonathan Bell
0 siblings, 1 reply; 14+ messages in thread
From: Jonathan Bell @ 2006-10-09 14:49 UTC (permalink / raw)
To: Tejun Heo; +Cc: linux-ide
On Mon, 09 Oct 2006 09:43:18 +0100, Tejun Heo <htejun@gmail.com> wrote:
> Tejun Heo wrote:
>> I cannot reproduce your problem here. Can you retest after running the
>> following commands?
>> # setpci -s 01:07.0 0c.b=04
>> # setpci -s 01:08.0 0c.b=04
>
> I forgot something.
>
> * You need to make sata_sil a module. Boot, unload sata_sil if loaded,
> run above commands, load sata_sil and test.
>
> * If above commands don't work, try =00 instead of =04.
>
> Thanks.
>
setpci -s 01:07/8.0 0c.b=04 performed, sata_sil inserted...
md5sum crapped out again, similar errors in dmesg as before.
setpci -s 01:07/8.0 0c.b=00 performed, sata_sil inserted...
It worked...
cp ~/hugefile /mnt/sda1 && cp /mnt/sda1/hugefile /mnt/sdb1
&& md5sum /mnt/sda1/hugefile /mnt/sdb1/hugefile
ccf5f9052aa1fac3062c3f1920abb1fc /mnt/sda1/hugefile
ccf5f9052aa1fac3062c3f1920abb1fc /mnt/sdb1/hugefile
What does this register do, out of interest? With 00 it took ages and made
my load average shoot up to about 6.50!
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Errors when copying between drives on a SiI3114 controller under kernel 2.6.18
2006-10-09 14:49 ` Jonathan Bell
@ 2006-10-11 22:35 ` Jonathan Bell
2006-10-14 12:13 ` Tejun Heo
0 siblings, 1 reply; 14+ messages in thread
From: Jonathan Bell @ 2006-10-11 22:35 UTC (permalink / raw)
To: Tejun Heo; +Cc: linux-ide@vger.kernel.org
On Mon, 09 Oct 2006 15:49:26 +0100, Jonathan Bell
<doggs.lay.eggs@googlemail.com> wrote:
> On Mon, 09 Oct 2006 09:43:18 +0100, Tejun Heo <htejun@gmail.com> wrote:
>
>> Tejun Heo wrote:
>>> I cannot reproduce your problem here. Can you retest after running
>>> the following commands?
>>> # setpci -s 01:07.0 0c.b=04
>>> # setpci -s 01:08.0 0c.b=04
>>
>> I forgot something.
>>
>> * You need to make sata_sil a module. Boot, unload sata_sil if loaded,
>> run above commands, load sata_sil and test.
>>
>> * If above commands don't work, try =00 instead of =04.
>>
>> Thanks.
>>
>
>
> setpci -s 01:07/8.0 0c.b=04 performed, sata_sil inserted...
>
> md5sum crapped out again, similar errors in dmesg as before.
>
> setpci -s 01:07/8.0 0c.b=00 performed, sata_sil inserted...
>
> It worked...
> cp ~/hugefile /mnt/sda1 && cp /mnt/sda1/hugefile /mnt/sdb1
> && md5sum /mnt/sda1/hugefile /mnt/sdb1/hugefile
>
> ccf5f9052aa1fac3062c3f1920abb1fc /mnt/sda1/hugefile
> ccf5f9052aa1fac3062c3f1920abb1fc /mnt/sdb1/hugefile
>
> What does this register do, out of interest? With 00 it took ages and
> made my load average shoot up to about 6.50!
>
>
>
Apologies for bumping this a mere 2 days later but I felt that progress
was being made...
Ok, so it's the PCI cache line size register... 08 means a value of 64
bits which corresponds to the line size of my L1/L2 cache, am I correct?
The fact that even with a value of 01 set (for fun) still corrupts the
file seems to indicate that the fault is somewhere there, but why? Should
I just give up and buy a decent mainboard? :P (currently running
A7N8X-Deluxe v2.0, latest 1008 BIOS)
I would like to know more about this since the only topics on PCI cache
line sizes I can find are ones where people are having problems.
Regards
Jonathan
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Errors when copying between drives on a SiI3114 controller under kernel 2.6.18
2006-10-11 22:35 ` Jonathan Bell
@ 2006-10-14 12:13 ` Tejun Heo
2006-10-22 15:33 ` Jonathan Bell
0 siblings, 1 reply; 14+ messages in thread
From: Tejun Heo @ 2006-10-14 12:13 UTC (permalink / raw)
To: Jonathan Bell; +Cc: linux-ide@vger.kernel.org, Carlos Pardo
[cc'ing Carlos again. please don't drop cc list]
Jonathan Bell wrote:
> On Mon, 09 Oct 2006 15:49:26 +0100, Jonathan Bell
> <doggs.lay.eggs@googlemail.com> wrote:
>
>> On Mon, 09 Oct 2006 09:43:18 +0100, Tejun Heo <htejun@gmail.com> wrote:
>>
>>> Tejun Heo wrote:
>>>> I cannot reproduce your problem here. Can you retest after running
>>>> the following commands?
>>>> # setpci -s 01:07.0 0c.b=04
>>>> # setpci -s 01:08.0 0c.b=04
>>>
>>> I forgot something.
>>>
>>> * You need to make sata_sil a module. Boot, unload sata_sil if
>>> loaded, run above commands, load sata_sil and test.
>>>
>>> * If above commands don't work, try =00 instead of =04.
>>>
>>> Thanks.
>>
>> setpci -s 01:07/8.0 0c.b=04 performed, sata_sil inserted...
>>
>> md5sum crapped out again, similar errors in dmesg as before.
>>
>> setpci -s 01:07/8.0 0c.b=00 performed, sata_sil inserted...
>>
>> It worked...
>> cp ~/hugefile /mnt/sda1 && cp /mnt/sda1/hugefile /mnt/sdb1
>> && md5sum /mnt/sda1/hugefile /mnt/sdb1/hugefile
>>
>> ccf5f9052aa1fac3062c3f1920abb1fc /mnt/sda1/hugefile
>> ccf5f9052aa1fac3062c3f1920abb1fc /mnt/sdb1/hugefile
>>
>> What does this register do, out of interest? With 00 it took ages and
>> made my load average shoot up to about 6.50!
>
> Apologies for bumping this a mere 2 days later but I felt that progress
> was being made...
>
> Ok, so it's the PCI cache line size register... 08 means a value of 64
> bits which corresponds to the line size of my L1/L2 cache, am I correct?
Yes, you're right.
> The fact that even with a value of 01 set (for fun) still corrupts the
> file seems to indicate that the fault is somewhere there, but why?
> Should I just give up and buy a decent mainboard? :P (currently running
> A7N8X-Deluxe v2.0, latest 1008 BIOS)
I'm not sure whether the cache line size is the actual problem or the
slowdown caused by 0 cacheline size (r/w optimizations based on
cacheline size are turned off) hides the problem. I was hoping BIOS
messed up while setting cachline size and adjusting it to 4 makes things
work.
> I would like to know more about this since the only topics on PCI cache
> line sizes I can find are ones where people are having problems.
I don't know. I think this can be best diagnosed by SIMG. Carlos, does
anything ring a bell?
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Errors when copying between drives on a SiI3114 controller under kernel 2.6.18
2006-10-14 12:13 ` Tejun Heo
@ 2006-10-22 15:33 ` Jonathan Bell
2006-10-23 2:22 ` Tejun Heo
0 siblings, 1 reply; 14+ messages in thread
From: Jonathan Bell @ 2006-10-22 15:33 UTC (permalink / raw)
To: Tejun Heo; +Cc: linux-ide
On Sat, 14 Oct 2006 13:13:40 +0100, Tejun Heo <htejun@gmail.com> wrote:
> [cc'ing Carlos again. please don't drop cc list]
>
> Jonathan Bell wrote:
>> On Mon, 09 Oct 2006 15:49:26 +0100, Jonathan Bell
>> <doggs.lay.eggs@googlemail.com> wrote:
>>
>>> On Mon, 09 Oct 2006 09:43:18 +0100, Tejun Heo <htejun@gmail.com> wrote:
>>>
>>>> Tejun Heo wrote:
>>>>> I cannot reproduce your problem here. Can you retest after running
>>>>> the following commands?
>>>>> # setpci -s 01:07.0 0c.b=04
>>>>> # setpci -s 01:08.0 0c.b=04
>>>>
>>>> I forgot something.
>>>>
>>>> * You need to make sata_sil a module. Boot, unload sata_sil if
>>>> loaded, run above commands, load sata_sil and test.
>>>>
>>>> * If above commands don't work, try =00 instead of =04.
>>>>
>>>> Thanks.
>>>
>>> setpci -s 01:07/8.0 0c.b=04 performed, sata_sil inserted...
>>>
>>> md5sum crapped out again, similar errors in dmesg as before.
>>>
>>> setpci -s 01:07/8.0 0c.b=00 performed, sata_sil inserted...
>>>
>>> It worked...
>>> cp ~/hugefile /mnt/sda1 && cp /mnt/sda1/hugefile /mnt/sdb1
>>> && md5sum /mnt/sda1/hugefile /mnt/sdb1/hugefile
>>>
>>> ccf5f9052aa1fac3062c3f1920abb1fc /mnt/sda1/hugefile
>>> ccf5f9052aa1fac3062c3f1920abb1fc /mnt/sdb1/hugefile
>>>
>>> What does this register do, out of interest? With 00 it took ages and
>>> made my load average shoot up to about 6.50!
>> Apologies for bumping this a mere 2 days later but I felt that
>> progress was being made...
>> Ok, so it's the PCI cache line size register... 08 means a value of 64
>> bits which corresponds to the line size of my L1/L2 cache, am I correct?
>
> Yes, you're right.
>
>> The fact that even with a value of 01 set (for fun) still corrupts the
>> file seems to indicate that the fault is somewhere there, but why?
>> Should I just give up and buy a decent mainboard? :P (currently running
>> A7N8X-Deluxe v2.0, latest 1008 BIOS)
>
> I'm not sure whether the cache line size is the actual problem or the
> slowdown caused by 0 cacheline size (r/w optimizations based on
> cacheline size are turned off) hides the problem. I was hoping BIOS
> messed up while setting cachline size and adjusting it to 4 makes things
> work.
>
>> I would like to know more about this since the only topics on PCI cache
>> line sizes I can find are ones where people are having problems.
>
> I don't know. I think this can be best diagnosed by SIMG. Carlos, does
> anything ring a bell?
>
> Thanks.
>
This is where it gets wierd... I may have uncovered a BIOS bug.
I changed the mainboard out as a last-ditch attempt to get this working
and BEHOLD! The drives work perfectly. I swapped the A7N8X-D out for an
Abit NF7-M (same nForce2 chipset, with the exception of onboard graphics)
and used the same hardware as before.
This NF7-M is on loan to me so I cannot use it indefinitely. Any ideas,
Tejun?
Worst comes to worst I can buy an old nForce2 board for a minor sum off
eBay.
Jonathan
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Errors when copying between drives on a SiI3114 controller under kernel 2.6.18
2006-10-22 15:33 ` Jonathan Bell
@ 2006-10-23 2:22 ` Tejun Heo
2006-10-23 10:13 ` Alan Cox
0 siblings, 1 reply; 14+ messages in thread
From: Tejun Heo @ 2006-10-23 2:22 UTC (permalink / raw)
To: Jonathan Bell; +Cc: linux-ide
Jonathan Bell wrote:
> This is where it gets wierd... I may have uncovered a BIOS bug.
>
> I changed the mainboard out as a last-ditch attempt to get this working
> and BEHOLD! The drives work perfectly. I swapped the A7N8X-D out for an
> Abit NF7-M (same nForce2 chipset, with the exception of onboard
> graphics) and used the same hardware as before.
>
> This NF7-M is on loan to me so I cannot use it indefinitely. Any ideas,
> Tejun?
>
> Worst comes to worst I can buy an old nForce2 board for a minor sum off
> eBay.
I guess it could be a PCI bus problem. Maybe the controller and the PCI
bus on the board don't like each other and thing get corrupt when
transactions occur at high speed. I've seen data corruption over PCI
bus on some pilot embedded system board. Not sure whether such things
are applicable to consumer products.
I dunno. Simply changing the motherboard or the controller might be the
best solution for you. Considering the large deployment of
3112/3152/3114 controllers, it's hard to believe your problem is
software bug.
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Errors when copying between drives on a SiI3114 controller under kernel 2.6.18
2006-10-23 2:22 ` Tejun Heo
@ 2006-10-23 10:13 ` Alan Cox
2006-10-23 13:35 ` Jonathan Bell
0 siblings, 1 reply; 14+ messages in thread
From: Alan Cox @ 2006-10-23 10:13 UTC (permalink / raw)
To: Tejun Heo; +Cc: Jonathan Bell, linux-ide
Ar Llu, 2006-10-23 am 11:22 +0900, ysgrifennodd Tejun Heo:
> I guess it could be a PCI bus problem. Maybe the controller and the PCI
> bus on the board don't like each other and thing get corrupt when
> transactions occur at high speed. I've seen data corruption over PCI
> bus on some pilot embedded system board. Not sure whether such things
> are applicable to consumer products.
>From the IDE driver...
* If you have strange problems with nVidia chipset systems please
* see the SI support documentation and update your system BIOS
* if neccessary
Alan
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Errors when copying between drives on a SiI3114 controller under kernel 2.6.18
2006-10-23 10:13 ` Alan Cox
@ 2006-10-23 13:35 ` Jonathan Bell
2006-10-23 14:09 ` Alan Cox
0 siblings, 1 reply; 14+ messages in thread
From: Jonathan Bell @ 2006-10-23 13:35 UTC (permalink / raw)
To: Alan Cox; +Cc: linux-ide@vger.kernel.org
On Mon, 23 Oct 2006 11:13:57 +0100, Alan Cox <alan@lxorguk.ukuu.org.uk>
wrote:
> Ar Llu, 2006-10-23 am 11:22 +0900, ysgrifennodd Tejun Heo:
>> I guess it could be a PCI bus problem. Maybe the controller and the PCI
>> bus on the board don't like each other and thing get corrupt when
>> transactions occur at high speed. I've seen data corruption over PCI
>> bus on some pilot embedded system board. Not sure whether such things
>> are applicable to consumer products.
>
>> From the IDE driver...
>
> * If you have strange problems with nVidia chipset systems please
> * see the SI support documentation and update your system BIOS
> * if neccessary
>
>
> Alan
>
BIOS for the A7N8X is the latest 1008 version which overcomes a boot
limitation - the board would not boot with more than 1 SATA controller
installed.
This RAID corruption bug was supposedly fixed in 1005.
Since contacting Asus technical support is likely to be as productive as
getting blood from a stone, I'm going to go ahead and scrounge another
motherboard - since this is socket A stuff it should be dirt cheap by now.
Thanks for all the suggestions,
Jonathan
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Errors when copying between drives on a SiI3114 controller under kernel 2.6.18
2006-10-23 13:35 ` Jonathan Bell
@ 2006-10-23 14:09 ` Alan Cox
2006-10-30 20:53 ` Jonathan Bell
0 siblings, 1 reply; 14+ messages in thread
From: Alan Cox @ 2006-10-23 14:09 UTC (permalink / raw)
To: Jonathan Bell; +Cc: linux-ide@vger.kernel.org
Ar Llu, 2006-10-23 am 14:35 +0100, ysgrifennodd Jonathan Bell:
> BIOS for the A7N8X is the latest 1008 version which overcomes a boot
> limitation - the board would not boot with more than 1 SATA controller
> installed.
>
> This RAID corruption bug was supposedly fixed in 1005.
But this is a BIOS so was it unfixed again in 1006 ?
Also for that matter, does anyone in Nvidia know and want to explain
what was wrong in those BIOSes and what we can do in Linux to
handle/correct the situation ?
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Errors when copying between drives on a SiI3114 controller under kernel 2.6.18
2006-10-23 14:09 ` Alan Cox
@ 2006-10-30 20:53 ` Jonathan Bell
0 siblings, 0 replies; 14+ messages in thread
From: Jonathan Bell @ 2006-10-30 20:53 UTC (permalink / raw)
To: Alan Cox; +Cc: linux-ide@vger.kernel.org
On Mon, 23 Oct 2006 15:09:55 +0100, Alan Cox <alan@lxorguk.ukuu.org.uk>
wrote:
> Ar Llu, 2006-10-23 am 14:35 +0100, ysgrifennodd Jonathan Bell:
>> BIOS for the A7N8X is the latest 1008 version which overcomes a boot
>> limitation - the board would not boot with more than 1 SATA controller
>> installed.
>>
>> This RAID corruption bug was supposedly fixed in 1005.
>
> But this is a BIOS so was it unfixed again in 1006 ?
>
> Also for that matter, does anyone in Nvidia know and want to explain
> what was wrong in those BIOSes and what we can do in Linux to
> handle/correct the situation ?
>
OK.... the problem has "fixed itself".
In contacting Asus technical support the nice guy at the other end told me
to increase the latency timer of the cards. In order to test the mainboard
I put it back in instead of the temporary NF7-M. Suspiciously the tests
that failed before are now flawless and the only thing different was 2 PCI
cards were not installed, the tv tuner card and a wireless lan card.
Installing these to try to exactly duplicate the previous setup again
didn't cause the error to appear.
Even so the board seems flaky so I'm hesitant to put it into full swing
just yet - I'll do a few stress test runs.
^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2006-10-30 20:53 UTC | newest]
Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-10-07 13:11 Errors when copying between drives on a SiI3114 controller under kernel 2.6.18 Jonathan Bell
2006-10-08 4:33 ` Tejun Heo
2006-10-08 13:19 ` Jonathan Bell
2006-10-09 8:38 ` Tejun Heo
2006-10-09 8:43 ` Tejun Heo
2006-10-09 14:49 ` Jonathan Bell
2006-10-11 22:35 ` Jonathan Bell
2006-10-14 12:13 ` Tejun Heo
2006-10-22 15:33 ` Jonathan Bell
2006-10-23 2:22 ` Tejun Heo
2006-10-23 10:13 ` Alan Cox
2006-10-23 13:35 ` Jonathan Bell
2006-10-23 14:09 ` Alan Cox
2006-10-30 20:53 ` Jonathan Bell
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).