public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* Via-Rhine NIC, Via SATA or reiserfs broken, how to tell??
@ 2005-08-11 23:43 Grant Coady
  2005-08-12 10:43 ` Vladimir V. Saveliev
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Grant Coady @ 2005-08-11 23:43 UTC (permalink / raw)
  To: linux-kernel

Greetings,

Situation is dataloss with no errors logged.

Test: unpack 2.6.12 tarball from NFS mount source, diff against 
previous attempt:

$ diff -Nrup linux-2.6.12.old linux-2.6.12
Binary files linux-2.6.12.old/include/asm-sparc/a.out.h and linux-2.6.12/include/asm-sparc/a.out.h differ
diff -Nrup linux-2.6.12.old/include/asm-sparc/apc.h linux-2.6.12/include/asm-sparc/apc.h
--- linux-2.6.12.old/include/asm-sparc/apc.h    2005-06-18 05:48:29.000000000 +1000
+++ linux-2.6.12/include/asm-sparc/apc.h        2005-06-18 05:48:29.000000000 +1000
@@ -31,7 +31,7 @@
 #define APC_BPORT_REG  0x30

 #define APC_REGMASK            0x01
-define APC_BPMASK              0x03
+#define APC_BPMASK             0x03

 /*
  * IDLE - CPU standby values (set to initiate standby)
diff -Nrup linux-2.6.12.old/include/asm-sparc/svr4.h linux-2.6.12/include/asm-sparc/svr4.h
--- linux-2.6.12.old/include/asm-sparc/svr4.h   2005-06-18 05:48:29.000000000 +1000
+++ linux-2.6.12/include/asm-sparc/svr4.h       2005-06-18 05:48:29.000000000 +1000
@@ -15,7 +15,7 @@ typedef struct {                /* signa

 /* Values for siginfo.code */
 #define SVR4_SINOINFO 32767
-/* Siginfo, sucker expects bunch of information on those paramEters */
+/* Siginfo, sucker expects bunch of information on those parameters */
 typedef union {
        char total_size [128];
        struct {


Seems like three bit errors for source tree.  Other times I've noted 
compile failures where unpacking source tree fresh would 'fix' error.
I'd previously assumed that I accidentally killed source tree with 
'cp -al ...' copies but I've had a segfault on that operation, hence 
I do not know if this be NIC or filesystem (reiserfs on via SATA).


Today disabled onboard via-rhine and used Intel pro/100 + e100 driver, 
several source trees unpacked identically, running 2.6.12.4 or 2.4.31-hf3

The fault occurs on 2.4 latest or 2.6 latest only on particular target 
box, so problem is not the NFS server.

How to test and isolate this error is in NIC driver, SATA driver or 
filesystem?  

Thanks,
Grant.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Via-Rhine NIC, Via SATA or reiserfs broken, how to tell??
  2005-08-11 23:43 Via-Rhine NIC, Via SATA or reiserfs broken, how to tell?? Grant Coady
@ 2005-08-12 10:43 ` Vladimir V. Saveliev
  2005-08-12 12:19   ` Grant Coady
  2005-08-14  9:12 ` Resolved?: " Grant Coady
  2005-08-14 12:12 ` Roger Luethi
  2 siblings, 1 reply; 8+ messages in thread
From: Vladimir V. Saveliev @ 2005-08-12 10:43 UTC (permalink / raw)
  To: Grant Coady; +Cc: linux-kernel

Hello

Grant Coady wrote:
> Greetings,
> 
> Situation is dataloss with no errors logged.
> 
> Test: unpack 2.6.12 tarball from NFS mount source, diff against 
> previous attempt:
> 
> $ diff -Nrup linux-2.6.12.old linux-2.6.12
> Binary files linux-2.6.12.old/include/asm-sparc/a.out.h and linux-2.6.12/include/asm-sparc/a.out.h differ
> diff -Nrup linux-2.6.12.old/include/asm-sparc/apc.h linux-2.6.12/include/asm-sparc/apc.h
> --- linux-2.6.12.old/include/asm-sparc/apc.h    2005-06-18 05:48:29.000000000 +1000
> +++ linux-2.6.12/include/asm-sparc/apc.h        2005-06-18 05:48:29.000000000 +1000
> @@ -31,7 +31,7 @@
>  #define APC_BPORT_REG  0x30
> 
>  #define APC_REGMASK            0x01
> -define APC_BPMASK              0x03
> +#define APC_BPMASK             0x03
> 
>  /*
>   * IDLE - CPU standby values (set to initiate standby)
> diff -Nrup linux-2.6.12.old/include/asm-sparc/svr4.h linux-2.6.12/include/asm-sparc/svr4.h
> --- linux-2.6.12.old/include/asm-sparc/svr4.h   2005-06-18 05:48:29.000000000 +1000
> +++ linux-2.6.12/include/asm-sparc/svr4.h       2005-06-18 05:48:29.000000000 +1000
> @@ -15,7 +15,7 @@ typedef struct {                /* signa
> 
>  /* Values for siginfo.code */
>  #define SVR4_SINOINFO 32767
> -/* Siginfo, sucker expects bunch of information on those paramEters */
> +/* Siginfo, sucker expects bunch of information on those parameters */
>  typedef union {
>         char total_size [128];
>         struct {
> 
> 
> Seems like three bit errors for source tree.  Other times I've noted 
> compile failures where unpacking source tree fresh would 'fix' error.
> I'd previously assumed that I accidentally killed source tree with 
> 'cp -al ...' copies but I've had a segfault on that operation, hence 
> I do not know if this be NIC or filesystem (reiserfs on via SATA).
> 
> 
> Today disabled onboard via-rhine and used Intel pro/100 + e100 driver, 
> several source trees unpacked identically, running 2.6.12.4 or 2.4.31-hf3
> 
> The fault occurs on 2.4 latest or 2.6 latest only on particular target 
> box, so problem is not the NFS server.
> 
> How to test and isolate this error is in NIC driver, SATA driver or 
> filesystem?  
> 

Could it be that tarbal on NFS server changed?
It is not very likely that error in kernel drivers fixed typos in source code.

> Thanks,
> Grant.
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 
> 


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Via-Rhine NIC, Via SATA or reiserfs broken, how to tell??
  2005-08-12 10:43 ` Vladimir V. Saveliev
@ 2005-08-12 12:19   ` Grant Coady
  2005-08-12 14:21     ` Masoud Sharbiani
  0 siblings, 1 reply; 8+ messages in thread
From: Grant Coady @ 2005-08-12 12:19 UTC (permalink / raw)
  To: Vladimir V. Saveliev; +Cc: linux-kernel

On Fri, 12 Aug 2005 14:43:42 +0400, "Vladimir V. Saveliev" <vs@namesys.com> wrote:
>> How to test and isolate this error is in NIC driver, SATA driver or 
>> filesystem?  
>> 
>
>Could it be that tarbal on NFS server changed?
>It is not very likely that error in kernel drivers fixed typos in source code.

The 'typos' are the observed errors from extracting kernel source tarball, 
renaming top level directory and extracting tarball again.  Other times 
extraction fails with corrupt tarball error.  Cached image of tarball is 
corrupted as box doesn't go back to server.

Since first report I've changed to using ext2 target filesystem, still get 
errors, so not reiserfs specific either.  

Am in process of reducing options in kernel config, try to narrow down 
what problem is.  Nothing in logs, me have no idea ... yet.  

Not a memory error as box compiled many hundred kernels last week without 
choking.  Test just now was with 2.6.13-rc6-git3, very repeatable.

Same test on different box, no errors.  Other box has pro/100 NIC, 
reiserfs, unpack tarball from same server.  Never a problem.

Cheers,
Grant.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Via-Rhine NIC, Via SATA or reiserfs broken, how to tell??
  2005-08-12 12:19   ` Grant Coady
@ 2005-08-12 14:21     ` Masoud Sharbiani
  2005-08-12 22:13       ` Grant Coady
  0 siblings, 1 reply; 8+ messages in thread
From: Masoud Sharbiani @ 2005-08-12 14:21 UTC (permalink / raw)
  To: Grant Coady; +Cc: Vladimir V. Saveliev, linux-kernel

Can you turn on UDP checksums and try again? That would isolate the 
fault between the network or SATA.
cheers,
Masoud
Grant Coady wrote:

>On Fri, 12 Aug 2005 14:43:42 +0400, "Vladimir V. Saveliev" <vs@namesys.com> wrote:
>  
>
>>>How to test and isolate this error is in NIC driver, SATA driver or 
>>>filesystem?  
>>>
>>>      
>>>
>>Could it be that tarbal on NFS server changed?
>>It is not very likely that error in kernel drivers fixed typos in source code.
>>    
>>
>
>The 'typos' are the observed errors from extracting kernel source tarball, 
>renaming top level directory and extracting tarball again.  Other times 
>extraction fails with corrupt tarball error.  Cached image of tarball is 
>corrupted as box doesn't go back to server.
>
>Since first report I've changed to using ext2 target filesystem, still get 
>errors, so not reiserfs specific either.  
>
>Am in process of reducing options in kernel config, try to narrow down 
>what problem is.  Nothing in logs, me have no idea ... yet.  
>
>Not a memory error as box compiled many hundred kernels last week without 
>choking.  Test just now was with 2.6.13-rc6-git3, very repeatable.
>
>Same test on different box, no errors.  Other box has pro/100 NIC, 
>reiserfs, unpack tarball from same server.  Never a problem.
>
>Cheers,
>Grant.
>
>-
>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at  http://www.tux.org/lkml/
>
>  
>



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Via-Rhine NIC, Via SATA or reiserfs broken, how to tell??
  2005-08-12 14:21     ` Masoud Sharbiani
@ 2005-08-12 22:13       ` Grant Coady
  0 siblings, 0 replies; 8+ messages in thread
From: Grant Coady @ 2005-08-12 22:13 UTC (permalink / raw)
  To: Masoud Sharbiani; +Cc: Vladimir V. Saveliev, linux-kernel

On Sat, 13 Aug 2005 00:21:30 +1000, Masoud Sharbiani <masouds@masoud.ir> wrote:

> Can you turn on UDP checksums and try again? That would isolate the
> fault between the network or SATA.

It is the second tarball extraction from cache that suffers data
corruption, not a network error.  I am in process of narrowing
down cause as I now have 2.4.32-pre3 and 2.6.13-rc6-git3 .configs
that work okay (5 tarball extracts, diff okay)on reiserfs and ext2.

Something to do with MTRR, highmem (box has 1GB), unwanted MP
detection in dmesg --> no longer network, filesystem and/or SATA
driver directly, dunno what yet, but tedious process of elimination
will take time.

How do I force fetching tarball from over NFS again?  At present
the repeat extractions are from memory, not from network.

Cheers,
Grant.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Resolved?: Via-Rhine NIC, Via SATA or reiserfs broken, how to tell??
  2005-08-11 23:43 Via-Rhine NIC, Via SATA or reiserfs broken, how to tell?? Grant Coady
  2005-08-12 10:43 ` Vladimir V. Saveliev
@ 2005-08-14  9:12 ` Grant Coady
  2005-08-14 12:12 ` Roger Luethi
  2 siblings, 0 replies; 8+ messages in thread
From: Grant Coady @ 2005-08-14  9:12 UTC (permalink / raw)
  To: linux-kernel; +Cc: annabellesgarden, jgarzik

On Fri, 12 Aug 2005 09:43:31 +1000, Grant Coady <Grant.Coady@gmail.com> wrote:

Hi there,

Problem was dataloss on extracting kernel source, sometimes only 
one character changed.  Details on 

  http://bugsplatter.mine.nu/test/boxen/sempro/

Not the NIC, not reiserfs, not the kernel config, not even the 
SATA data cable...  Not make sense :o)

Dataloss seemed to be the buffered memory copy of the tarball, 
but this box also compile several hundred kernels in a session 
without a problem.  It also locked up after 4 1/2 hours compiling, 
at that time I thought a kernel config change fixed the issue.

Solution?

Set BIOS memory timing to manual, thinking perhaps BIOS sometimes 
not read SPD EEPROM correctly, 'cos it was like I had bad memory 
only sometimes, reboot, memory okay, next day maybe something bad 
again.

I'll be extracting source tarballs twice and diff for some time to 
be sure.  Built the box in March, it sometimes locked up, I'd do 
some ad hoc kernel config adjustments and carry on.  This time I try 
to methodically nail the issue and got nowhere with configuration 
changes.

Does BIOS not setting memory timing properly sometimes sound like a 
reasonable explanation for the fault?  Extracted about 100 tarballs 
without error.  Currently running 2.6.13-rc6-git5 which produced 
heaps of errors before I attacked the hardware, reseating memory 
modules, AGP card and adjust the BIOS settings.

Grant.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Via-Rhine NIC, Via SATA or reiserfs broken, how to tell??
  2005-08-11 23:43 Via-Rhine NIC, Via SATA or reiserfs broken, how to tell?? Grant Coady
  2005-08-12 10:43 ` Vladimir V. Saveliev
  2005-08-14  9:12 ` Resolved?: " Grant Coady
@ 2005-08-14 12:12 ` Roger Luethi
  2005-08-14 20:13   ` Grant Coady
  2 siblings, 1 reply; 8+ messages in thread
From: Roger Luethi @ 2005-08-14 12:12 UTC (permalink / raw)
  To: Grant Coady; +Cc: linux-kernel

> @@ -31,7 +31,7 @@
>  #define APC_BPORT_REG  0x30
> 
>  #define APC_REGMASK            0x01
> -define APC_BPMASK              0x03
> +#define APC_BPMASK             0x03

Color me skeptical. I've seen some weird bit flips and data corruption;
"paramters" to "paramEters" I could buy. But data corruption that
_inserts_ a hash mark a the beginning of a line of a header file? What
are the odds?

> Today disabled onboard via-rhine and used Intel pro/100 + e100 driver, 
> several source trees unpacked identically, running 2.6.12.4 or 2.4.31-hf3

While that seems to point to the Rhine as the possible cause, I can't
see how any driver could possibly be involved in this.

Roger


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Via-Rhine NIC, Via SATA or reiserfs broken, how to tell??
  2005-08-14 12:12 ` Roger Luethi
@ 2005-08-14 20:13   ` Grant Coady
  0 siblings, 0 replies; 8+ messages in thread
From: Grant Coady @ 2005-08-14 20:13 UTC (permalink / raw)
  To: Roger Luethi; +Cc: linux-kernel

On Sun, 14 Aug 2005 22:12:55 +1000, Roger Luethi <rl@hellgate.ch> wrote:

>> @@ -31,7 +31,7 @@
>>  #define APC_BPORT_REG  0x30
>>
>>  #define APC_REGMASK            0x01
>> -define APC_BPMASK              0x03
>> +#define APC_BPMASK             0x03
>
> Color me skeptical. I've seen some weird bit flips and data corruption;
> "paramters" to "paramEters" I could buy. But data corruption that
> _inserts_ a hash mark a the beginning of a line of a header file? What
> are the odds?

A bitflip in the compressed image could expand the wrong token
resulting in dataloss just as easily as flip character case.

Since reporting this error I've eliminated filesystem and NIC
by substitution, fault occurs on ext2 and Intel pro/100.
>
>> Today disabled onboard via-rhine and used Intel pro/100 + e100 driver,
>> several source trees unpacked identically, running 2.6.12.4 or 2.4.31-hf3
>
> While that seems to point to the Rhine as the possible cause, I can't
> see how any driver could possibly be involved in this.

Neither can I, now testing outside of linux, eliminate OS as factor,
or, is it hardware or software?  I dunno...

Grant.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2005-08-14 20:13 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-08-11 23:43 Via-Rhine NIC, Via SATA or reiserfs broken, how to tell?? Grant Coady
2005-08-12 10:43 ` Vladimir V. Saveliev
2005-08-12 12:19   ` Grant Coady
2005-08-12 14:21     ` Masoud Sharbiani
2005-08-12 22:13       ` Grant Coady
2005-08-14  9:12 ` Resolved?: " Grant Coady
2005-08-14 12:12 ` Roger Luethi
2005-08-14 20:13   ` Grant Coady

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox