All of lore.kernel.org
 help / color / mirror / Atom feed
* Automount/NFS issues causing executables to appear corrupted
@ 2004-04-18 21:23 Venkata Ravella
  2004-04-18 23:06 ` H. Peter Anvin
  2004-04-19  1:07 ` Ian Kent
  0 siblings, 2 replies; 15+ messages in thread
From: Venkata Ravella @ 2004-04-18 21:23 UTC (permalink / raw)
  To: linux-kernel; +Cc: Ramki Balasubramanian, ab, hpa


The current kernel we use is default 7.2 kernel with two modifications:
1) BM patch applied to extend address space for a single process to 3.6GB
2) mnt patch applied to allow upto 1024 nfs mount points

uname -r output:
2.4.7-10mntBMsmp

Here is the detailed description of the problem, it's symptoms and few
observations. Let me know where I should look for solutions, pointers 
that can help further debug this problem more and any possible solutions.
Unfortunately, upgrading to a newer kernel is not an option for us at the
moment. 


The problem Description:
The executables on a particular nfs directory appear corrupted. The problem
is limited to that one nfs filesystem only. Analysis done so far is pointing to 
automount/nfs on the local host as the culprit.  Until a permanent fix can be 
found, the nfs directory has to be unmounted and re-mounted or the automount 
has to be restarted to clear the problem.  This problem is not reproducible 
but, showing up on our systems at random.


Symptoms:
The following are the symptoms of this problem. These symptoms may be very
misleading to the user.

Symptom 1
---------
executable gives one of the following errors and fail:

error while loading shared libraries: unexpected PLT reloc type 0x00
or
error while loading shared libraries: unsupported version 0 of Verneed record
or
Memory fault, Segmentation fault or Illegal instruction

Symptom 2
---------
executable gives the following kind of errors and fail:

/lib/libdl.so.2: version `' not found
/lib/i686/libm.so.6: version `' not found
/lib/i686/libpthread.so.0: version `' not found
/lib/i686/libc.so.6: version `' not found


Symptom 3
---------
SGE generated job output logs are truncated.



Detailed Analysis [Data Points Only]:

Sum produces wrong result
-------------------------
Example comparison of sum output of the same executable extracted from a good
system with the executable extracted from a bad one:

$ sum qqq*
50340  1147 qqq.bad
48019  1147 qqq.good
$


Executable dies at at relocation phase
--------------------------------------
The following is the tail output from the executable run with LD_DEBUG=all
setting:

24416:  symbol=stderr;  lookup in file=/lib/i686/libm.so.6
24416:  symbol=stderr;  lookup in file=/lib/i686/libc.so.6
24416:  binding file ./qqq.bad to /lib/i686/libc.so.6: normal symbol `stderr'
[GLIBC_2.0]
24416:  symbol=__ctype_toupper;  lookup in file=/lib/i686/libm.so.6
24416:  symbol=__ctype_toupper;  lookup in file=/lib/i686/libc.so.6
24416:  binding file ./qqq.bad to /lib/i686/libc.so.6: normal symbol
`__ctype_toupper' [GLIBC_2.0]
24416:  symbol=__ctype_b;  lookup in file=/lib/i686/libm.so.6
24416:  symbol=__ctype_b;  lookup in file=/lib/i686/libc.so.6
24416:  binding file ./qqq.bad to /lib/i686/libc.so.6: normal symbol
`__ctype_b' [GLIBC_2.0]
./qqq.bad: error while loading shared libraries: unexpected PLT reloc type
0x00


cmp output between good and bad executable differ
-------------------------------------------------
$ cmp  qqq.bad qqq.good
qqq.bad qqq.good differ: char 12289, line 40


Object dump on bad executable shows null bytes from 12289
---------------------------------------------------------
$ od -j 12250 qqq.bad | head -10
0027732 004023 142007 000000 037340 004023 142407 000000 037344
0027752 004023 143007 000000 037350 004023 143407 000000 037354
0027772 004023 144007 000000 000000 000000 000000 000000 000000
0030012 000000 000000 000000 000000 000000 000000 000000 000000
*
0037772 000000 000000 000000 175750 000316 164400 127560 000000
0040012 133215 000000 000000 106613 174324 177777 161676 010033
0040032 135410 015743 004020 010613 153611 003271 000000 176000
0040052 140061 123363 002164 140031 000414 140205 013564 164676
0040072 010033 104410 134727 000006 000000 124374 171400 007646

$ od -j 12250 qqq.good  | head -10
0027732 004023 142007 000000 037340 004023 142407 000000 037344
0027752 004023 143007 000000 037350 004023 143407 000000 037354
0027772 004023 144007 000000 037360 004023 144407 000000 037364
0030012 004023 145007 000000 037370 004023 145407 000000 037374
0030032 004023 146007 000000 037400 004023 146407 000000 037404
0030052 004023 147007 000000 037410 004023 147407 000000 037414
0030072 004023 151007 000000 037420 004023 151407 000000 037424
0030112 004023 152007 000000 037430 004023 152407 000000 037434
0030132 004023 153007 000000 037440 004023 153407 000000 037444
0030152 004023 154007 000000 037450 004023 156407 000000 037454


Other Observations:

- sum output comparison of the executable between two different systems
experiencing this behaviour is
  different.

- This affects only executables. Text files seem to be fine.

- copying any binary into the affected nfs partition gives input/output
error:

$ cp /tmp/ppp.good .
cp: writing `./ppp.good': Input/output error
cp: closing `./ppp.good': Input/output error

$ cp /usr/bin/archive .
cp: closing `./archive': Input/output error




^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Automount/NFS issues causing executables to appear corrupted
  2004-04-18 21:23 Venkata Ravella
@ 2004-04-18 23:06 ` H. Peter Anvin
  2004-04-19  1:07 ` Ian Kent
  1 sibling, 0 replies; 15+ messages in thread
From: H. Peter Anvin @ 2004-04-18 23:06 UTC (permalink / raw)
  To: Venkata Ravella; +Cc: linux-kernel, Ramki Balasubramanian, ab

Venkata Ravella wrote:
> The current kernel we use is default 7.2 kernel with two modifications:
> 1) BM patch applied to extend address space for a single process to 3.6GB
> 2) mnt patch applied to allow upto 1024 nfs mount points
> 
> uname -r output:
> 2.4.7-10mntBMsmp

In other words, you're using an ancient kernel with plenty of known 
problems, applied two additional patches to it, and are surprised you're 
having problems?

 > Unfortunately, upgrading to a newer kernel is not an option for us at 
 > the moment.

Sucks to be you.

	-hpa


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Automount/NFS issues causing executables to appear corrupted
  2004-04-18 21:23 Venkata Ravella
  2004-04-18 23:06 ` H. Peter Anvin
@ 2004-04-19  1:07 ` Ian Kent
  1 sibling, 0 replies; 15+ messages in thread
From: Ian Kent @ 2004-04-19  1:07 UTC (permalink / raw)
  To: Venkata Ravella; +Cc: linux-kernel, Ramki Balasubramanian, ab, hpa


Please cc autofs questions to the list at autofs@linux.kernel.org.

On Sun, 18 Apr 2004, Venkata Ravella wrote:

> 
> The current kernel we use is default 7.2 kernel with two modifications:
> 1) BM patch applied to extend address space for a single process to 3.6GB
> 2) mnt patch applied to allow upto 1024 nfs mount points
> 
> uname -r output:
> 2.4.7-10mntBMsmp

What autofs version?

To be honest it's a bit hard to see how this is an autofs issue.
Mind, having said that, ....

Ian


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Automount/NFS issues causing executables to appear corrupted
@ 2004-04-19 15:56 Venkata Ravella
  0 siblings, 0 replies; 15+ messages in thread
From: Venkata Ravella @ 2004-04-19 15:56 UTC (permalink / raw)
  To: autofs


** This is already sent to linux-kernel@vger.kernel.org, if you **
** are subsribed to that list, this may be a duplicate for you  **


The current kernel we use is default 7.2 kernel with two modifications:
1) BM patch applied to extend address space for a single process to 3.6GB
2) mnt patch applied to allow upto 1024 nfs mount points

uname -r output:
2.4.7-10mntBMsmp

Here is the detailed description of the problem, it's symptoms and few
observations. Let me know where I should look for solutions, pointers 
that can help further debug this problem more and any possible solutions.
Unfortunately, upgrading to a newer kernel is not an option for us at the
moment. 


The problem Description:
The executables on a particular nfs directory appear corrupted. The problem
is limited to that one nfs filesystem only. Analysis done so far is pointing to 
automount/nfs on the local host as the culprit.  Until a permanent fix can be 
found, the nfs directory has to be unmounted and re-mounted or the automount 
has to be restarted to clear the problem.  This problem is not reproducible 
but, showing up on our systems at random.


Symptoms:
The following are the symptoms of this problem. These symptoms may be very
misleading to the user.

Symptom 1
---------
executable gives one of the following errors and fail:

error while loading shared libraries: unexpected PLT reloc type 0x00
or
error while loading shared libraries: unsupported version 0 of Verneed record
or
Memory fault, Segmentation fault or Illegal instruction

Symptom 2
---------
executable gives the following kind of errors and fail:

/lib/libdl.so.2: version `' not found
/lib/i686/libm.so.6: version `' not found
/lib/i686/libpthread.so.0: version `' not found
/lib/i686/libc.so.6: version `' not found


Symptom 3
---------
SGE generated job output logs are truncated.



Detailed Analysis [Data Points Only]:

Sum produces wrong result
-------------------------
Example comparison of sum output of the same executable extracted from a good
system with the executable extracted from a bad one:

$ sum qqq*
50340  1147 qqq.bad
48019  1147 qqq.good
$


Executable dies at at relocation phase
--------------------------------------
The following is the tail output from the executable run with LD_DEBUG=all
setting:

24416:  symbol=stderr;  lookup in file=/lib/i686/libm.so.6
24416:  symbol=stderr;  lookup in file=/lib/i686/libc.so.6
24416:  binding file ./qqq.bad to /lib/i686/libc.so.6: normal symbol `stderr'
[GLIBC_2.0]
24416:  symbol=__ctype_toupper;  lookup in file=/lib/i686/libm.so.6
24416:  symbol=__ctype_toupper;  lookup in file=/lib/i686/libc.so.6
24416:  binding file ./qqq.bad to /lib/i686/libc.so.6: normal symbol
`__ctype_toupper' [GLIBC_2.0]
24416:  symbol=__ctype_b;  lookup in file=/lib/i686/libm.so.6
24416:  symbol=__ctype_b;  lookup in file=/lib/i686/libc.so.6
24416:  binding file ./qqq.bad to /lib/i686/libc.so.6: normal symbol
`__ctype_b' [GLIBC_2.0]
./qqq.bad: error while loading shared libraries: unexpected PLT reloc type
0x00


cmp output between good and bad executable differ
-------------------------------------------------
$ cmp  qqq.bad qqq.good
qqq.bad qqq.good differ: char 12289, line 40


Object dump on bad executable shows null bytes from 12289
---------------------------------------------------------
$ od -j 12250 qqq.bad | head -10
0027732 004023 142007 000000 037340 004023 142407 000000 037344
0027752 004023 143007 000000 037350 004023 143407 000000 037354
0027772 004023 144007 000000 000000 000000 000000 000000 000000
0030012 000000 000000 000000 000000 000000 000000 000000 000000
*
0037772 000000 000000 000000 175750 000316 164400 127560 000000
0040012 133215 000000 000000 106613 174324 177777 161676 010033
0040032 135410 015743 004020 010613 153611 003271 000000 176000
0040052 140061 123363 002164 140031 000414 140205 013564 164676
0040072 010033 104410 134727 000006 000000 124374 171400 007646

$ od -j 12250 qqq.good  | head -10
0027732 004023 142007 000000 037340 004023 142407 000000 037344
0027752 004023 143007 000000 037350 004023 143407 000000 037354
0027772 004023 144007 000000 037360 004023 144407 000000 037364
0030012 004023 145007 000000 037370 004023 145407 000000 037374
0030032 004023 146007 000000 037400 004023 146407 000000 037404
0030052 004023 147007 000000 037410 004023 147407 000000 037414
0030072 004023 151007 000000 037420 004023 151407 000000 037424
0030112 004023 152007 000000 037430 004023 152407 000000 037434
0030132 004023 153007 000000 037440 004023 153407 000000 037444
0030152 004023 154007 000000 037450 004023 156407 000000 037454


Other Observations:

- sum output comparison of the executable between two different systems
experiencing this behaviour is
  different.

- This affects only executables. Text files seem to be fine.

- copying any binary into the affected nfs partition gives input/output
error:

$ cp /tmp/ppp.good .
cp: writing `./ppp.good': Input/output error
cp: closing `./ppp.good': Input/output error

$ cp /usr/bin/archive .
cp: closing `./archive': Input/output error

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Automount/NFS issues causing executables to appear corrupted
@ 2004-04-20  0:08 Venkata Ravella
  2004-04-20  0:24   ` H. Peter Anvin
  2004-04-20 14:35 ` Todd Denniston
  0 siblings, 2 replies; 15+ messages in thread
From: Venkata Ravella @ 2004-04-20  0:08 UTC (permalink / raw)
  To: raven; +Cc: linux-kernel, Ramki.Balasubramanium, ab, hpa, autofs


autofs version is autofs-3.1.7-21

I also have one new update. We started seeing similar problem on
the system running the kernel 2.4.18-e.12smp which has the same
version(3.1.7-21) of autofs as well.

This may or may not be an autofs problem but, restarting autofs
fixes this problem temporarily.


>
>Please cc autofs questions to the list at autofs@linux.kernel.org.
>
>On Sun, 18 Apr 2004, Venkata Ravella wrote:
>
>> 
>> The current kernel we use is default 7.2 kernel with two modifications:
>> 1) BM patch applied to extend address space for a single process to 3.6GB
>> 2) mnt patch applied to allow upto 1024 nfs mount points
>> 
>> uname -r output:
>> 2.4.7-10mntBMsmp
>
>What autofs version?
>
>To be honest it's a bit hard to see how this is an autofs issue.
>Mind, having said that, ....
>
>Ian

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Automount/NFS issues causing executables to appear corrupted
  2004-04-20  0:08 Automount/NFS issues causing executables to appear corrupted Venkata Ravella
@ 2004-04-20  0:24   ` H. Peter Anvin
  2004-04-20 14:35 ` Todd Denniston
  1 sibling, 0 replies; 15+ messages in thread
From: H. Peter Anvin @ 2004-04-20  0:24 UTC (permalink / raw)
  To: Venkata Ravella; +Cc: autofs, ab, Ramki.Balasubramanium, linux-kernel, raven

Venkata Ravella wrote:
> autofs version is autofs-3.1.7-21
> 
> I also have one new update. We started seeing similar problem on
> the system running the kernel 2.4.18-e.12smp which has the same
> version(3.1.7-21) of autofs as well.
> 
> This may or may not be an autofs problem but, restarting autofs
> fixes this problem temporarily.
> 

That will cause an NFS remount.  This really feels much more like an NFS
problem.

	-hpa

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Automount/NFS issues causing executables to appear corrupted
@ 2004-04-20  0:24   ` H. Peter Anvin
  0 siblings, 0 replies; 15+ messages in thread
From: H. Peter Anvin @ 2004-04-20  0:24 UTC (permalink / raw)
  To: Venkata Ravella; +Cc: raven, linux-kernel, Ramki.Balasubramanium, ab, autofs

Venkata Ravella wrote:
> autofs version is autofs-3.1.7-21
> 
> I also have one new update. We started seeing similar problem on
> the system running the kernel 2.4.18-e.12smp which has the same
> version(3.1.7-21) of autofs as well.
> 
> This may or may not be an autofs problem but, restarting autofs
> fixes this problem temporarily.
> 

That will cause an NFS remount.  This really feels much more like an NFS
problem.

	-hpa


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Automount/NFS issues causing executables to appear corrupted
  2004-04-20  0:24   ` H. Peter Anvin
@ 2004-04-20  1:27     ` Ian Kent
  -1 siblings, 0 replies; 15+ messages in thread
From: Ian Kent @ 2004-04-20  1:27 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: autofs, ab, Venkata Ravella, linux-kernel, Ramki.Balasubramanium

On Mon, 19 Apr 2004, H. Peter Anvin wrote:

> Venkata Ravella wrote:
> > autofs version is autofs-3.1.7-21
> > 
> > I also have one new update. We started seeing similar problem on
> > the system running the kernel 2.4.18-e.12smp which has the same
> > version(3.1.7-21) of autofs as well.
> > 
> > This may or may not be an autofs problem but, restarting autofs
> > fixes this problem temporarily.
> > 
> 
> That will cause an NFS remount.  This really feels much more like an NFS
> problem.

Certainly does.

Venkata,

Can you also forward this question to the nfs list at 
nfs@lists.sourceforge.net. Sorry to ask you to post all over the place.

Please investigate the NFS client patches maintained by Trond Myklebust. 
Check nfs.sourceforge.net. We found we had to use them in early 2.4 versions.

Ian

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Automount/NFS issues causing executables to appear corrupted
@ 2004-04-20  1:27     ` Ian Kent
  0 siblings, 0 replies; 15+ messages in thread
From: Ian Kent @ 2004-04-20  1:27 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Venkata Ravella, linux-kernel, Ramki.Balasubramanium, ab, autofs

On Mon, 19 Apr 2004, H. Peter Anvin wrote:

> Venkata Ravella wrote:
> > autofs version is autofs-3.1.7-21
> > 
> > I also have one new update. We started seeing similar problem on
> > the system running the kernel 2.4.18-e.12smp which has the same
> > version(3.1.7-21) of autofs as well.
> > 
> > This may or may not be an autofs problem but, restarting autofs
> > fixes this problem temporarily.
> > 
> 
> That will cause an NFS remount.  This really feels much more like an NFS
> problem.

Certainly does.

Venkata,

Can you also forward this question to the nfs list at 
nfs@lists.sourceforge.net. Sorry to ask you to post all over the place.

Please investigate the NFS client patches maintained by Trond Myklebust. 
Check nfs.sourceforge.net. We found we had to use them in early 2.4 versions.

Ian



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Automount/NFS issues causing executables to appear corrupted
       [not found]     ` <20040420042811.GE20474@rearview.synopsys.com>
@ 2004-04-20  5:24       ` Ian Kent
  0 siblings, 0 replies; 15+ messages in thread
From: Ian Kent @ 2004-04-20  5:24 UTC (permalink / raw)
  To: Venkata Ravella
  Cc: autofs, ab, Ramki.Balasubramanium, linux-kernel, H. Peter Anvin

On Mon, 19 Apr 2004, Venkata Ravella wrote:

> 
> Posted. I am very thankful to your pointers. Both those lists are closed
> lists and moderated. I do not think moderator got a chance to look at it
> and post. 
> 

I certainly accepted your post to the autofs list.

Hope someone on one of the lists has seen this in the past.

Ian

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Re: Automount/NFS issues causing executables to appear  corrupted
  2004-04-20  0:08 Automount/NFS issues causing executables to appear corrupted Venkata Ravella
  2004-04-20  0:24   ` H. Peter Anvin
@ 2004-04-20 14:35 ` Todd Denniston
  2004-04-20 14:46   ` Todd Denniston
  2004-04-21  1:17   ` Jim Carter
  1 sibling, 2 replies; 15+ messages in thread
From: Todd Denniston @ 2004-04-20 14:35 UTC (permalink / raw)
  To: Venkata Ravella; +Cc: autofs

question,
Is the file system mounted with the 'soft' option?
i.e. on the systems that are causing problems try
if mount | grep -i soft >>/dev/null 2>&1
then
  echo "we have soft mounts"
else
  echo "good, only normal mounts"
fi

We had a problem that caused me headaches for 6 months to track down... One of
the other admins had chosen to mount all the file systems with the soft option
and propagated this to all machines he could, that is to any I did not
control, and then people using his config started asking me why they were
getting IO errors transferring files to/from the file server I maintained.  if
the file was bigger than would fit in the normal [wr]size, which defaults to
1024 bytes {or 4096 dependent on which kernel version I believe} the
probability of an IO error during normal operations went from 0 towards
certainty by the time the file was 650 MBytes, generally would happen by
~100MBytes.

My server was a sun ultra 2 running solaris 2.6, the clients were Linux
running 2.[02].X and a mix of autofs-3 and autofs-4 (which ever was installed
with the distros, RH6-9 & Slack7-9.1).

I have a script being ran in everyone's .profile now to help me find any
remaining soft holdouts because it caused so much trouble.
The alternative to 'soft' for us was 'hard,intr' which at least allows you to
break the applications when you really need to, but be much more robust in
normal operations.


Venkata Ravella wrote:
> 
> autofs version is .1.7-21
> 
> I also have one new update. We started seeing similar problem on
> the system running the kernel 2.4.18-e.12smp which has the same
> version(3.1.7-21) of autofs as well.
> 
> This may or may not be an autofs problem but, restarting autofs
> fixes this problem temporarily.
> 
> >
> >Please cc autofs questions to the list at autofs@linux.kernel.org.
> >
> >On Sun, 18 Apr 2004, Venkata Ravella wrote:
> >
> >>
> >> The current kernel we use is default 7.2 kernel with two modifications:
> >> 1) BM patch applied to extend address space for a single process to 3.6GB
> >> 2) mnt patch applied to allow upto 1024 nfs mount points
> >>
> >> uname -r output:
> >> 2.4.7-10mntBMsmp
> >
> >What autofs version?
> >
> >To be honest it's a bit hard to see how this is an autofs issue.
> >Mind, having said that, ....
> >
> >Ian

-- 
Todd Denniston
Crane Division, Naval Surface Warfare Center (NSWC Crane) 
Harnessing the Power of Technology for the Warfighter

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Re: Automount/NFS issues causing executables to appear  corrupted
  2004-04-20 14:35 ` Todd Denniston
@ 2004-04-20 14:46   ` Todd Denniston
  2004-04-21  1:17   ` Jim Carter
  1 sibling, 0 replies; 15+ messages in thread
From: Todd Denniston @ 2004-04-20 14:46 UTC (permalink / raw)
  To: Venkata Ravella, autofs

Todd Denniston wrote:
> 
> question,
> Is the file system mounted with the 'soft' option?
> i.e. on the systems that are causing problems try
> if mount | grep -i soft >>/dev/null 2>&1
> then
>   echo "we have soft mounts"
> else
>   echo "good, only normal mounts"
> fi
> 
> We had a problem that caused me headaches for 6 months to track down... One of
> the other admins had chosen to mount all the file systems with the soft option
> and propagated this to all machines he could, that is to any I did not
> control, and then people using his config started asking me why they were
> getting IO errors transferring files to/from the file server I maintained.  if
> the file was bigger than would fit in the normal [wr]size, which defaults to
> 1024 bytes {or 4096 dependent on which kernel version I believe} the
> probability of an IO error during normal operations went from 0 towards
> certainty by the time the file was 650 MBytes, generally would happen by
> ~100MBytes.
> 
> My server was a sun ultra 2 running solaris 2.6, the clients were Linux
> running 2.[02].X and a mix of autofs-3 and autofs-4 (which ever was installed
> with the distros, RH6-9 & Slack7-9.1).
<SNIP>

typo
they were actually 2.[24].X systems with maybe one 2.0.?? system hidden
amongst the bunch.

-- 
Todd Denniston
Crane Division, Naval Surface Warfare Center (NSWC Crane) 
Harnessing the Power of Technology for the Warfighter

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Re: Automount/NFS issues causing executables to appear corrupted
  2004-04-20 14:35 ` Todd Denniston
  2004-04-20 14:46   ` Todd Denniston
@ 2004-04-21  1:17   ` Jim Carter
  2004-04-21 14:03     ` Re: Automount/NFS issues causing executables to appearcorrupted Todd Denniston
  1 sibling, 1 reply; 15+ messages in thread
From: Jim Carter @ 2004-04-21  1:17 UTC (permalink / raw)
  To: Todd Denniston; +Cc: autofs

Sorry to continue a non-automount issue, but this is where it was posted...

On Tue, 20 Apr 2004, Todd Denniston wrote:

> question,
> Is the file system mounted with the 'soft' option?
> i.e. on the systems that are causing problems try
> mount | grep -i soft

> We had a problem that caused me headaches for 6 months to track down...

> ...probability of an IO error during normal operations went from 0 towards
> certainty by the time the file was 650 MBytes, generally would happen by
> ~100MBytes.
>
> My server was a sun ultra 2 running solaris 2.6, the clients were Linux
> running 2.[02].X and a mix of autofs-3 and autofs-4 (which ever was installed
> with the distros, RH6-9 & Slack7-9.1).

We have Solaris 2.6, Solaris 8 (not tested), SuSE 8.2 (kernel 2.4.20) and
SuSE 9.0 (kernel 2.4.21, not tested).  I just ran some tests as follows:
Write one file of 1.3 Gb into the partner's NFS-exported filesystem.  Read
it back comparing bit-for-bit.  Delete the NFS file.  This was tried twice
with a Solaris 2.6 partner and twice with a Linux (2.4.20) partner.  The
local machine has Linux (2.4.20).  Both partners were on a different
subnet, but traffic was light and dropped UDP packets probably were very
few.  All NFS mounts were courtesy of the automounter.  All were soft,
specifically: -rsize=8192,wsize=8192,retry=1,soft.

There were no errors whatsoever.  Execution times were identical on
repeat trials (meaning no erratic network timeouts).  At Mathnet,
historically we do not see any of the described symptoms.

I wonder what's going on at your end.  If it's going to jump up and bite us
in the future...

James F. Carter          Voice 310 825 2897    FAX 310 206 6673
UCLA-Mathnet;  6115 MSA; 405 Hilgard Ave.; Los Angeles, CA, USA 90095-1555
Email: jimc@math.ucla.edu  http://www.math.ucla.edu/~jimc (q.v. for PGP key)

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Re: Automount/NFS issues causing executables to  appearcorrupted
  2004-04-21  1:17   ` Jim Carter
@ 2004-04-21 14:03     ` Todd Denniston
  2004-04-21 17:47       ` Jim Carter
  0 siblings, 1 reply; 15+ messages in thread
From: Todd Denniston @ 2004-04-21 14:03 UTC (permalink / raw)
  To: Jim Carter; +Cc: autofs

Jim Carter wrote:
> 
> Sorry to continue a non-automount issue, but this is where it was posted...
This is the only NFS related list I am subscribed to.

> 
> On Tue, 20 Apr 2004, Todd Denniston wrote:
> 
> > question,
> > Is the file system mounted with the 'soft' option?
> > i.e. on the systems that are causing problems try
> > mount | grep -i soft
> 
> > We had a problem that caused me headaches for 6 months to track down...
> 
> > ...probability of an IO error during normal operations went from 0 towards
> > certainty by the time the file was 650 MBytes, generally would happen by
> > ~100MBytes.
> >
> > My server was a sun ultra 2 running solaris 2.6, the clients were Linux
> > running 2.[02].X and a mix of autofs-3 and autofs-4 (which ever was installed
> > with the distros, RH6-9 & Slack7-9.1).
> 
> We have Solaris 2.6, Solaris 8 (not tested), SuSE 8.2 (kernel 2.4.20) and
> SuSE 9.0 (kernel 2.4.21, not tested).  I just ran some tests as follows:
> Write one file of 1.3 Gb into the partner's NFS-exported filesystem.  Read
> it back comparing bit-for-bit.  Delete the NFS file.  This was tried twice
> with a Solaris 2.6 partner and twice with a Linux (2.4.20) partner.  The
> local machine has Linux (2.4.20).  Both partners were on a different
> subnet, but traffic was light and dropped UDP packets probably were very
> few.  
There are at least two differences
1) you have light network traffic, at times we have a couple of video streams
going across our 100Mb net, and 50 users that have a bad habit of keeping
there netscape caches on the network drives.
2) our server has a veritas controlled 64 disk software raid set which seems
to eat kernel time, nfs seems to use a lot of kernel time too, so probably
more dropped UDP packets.
3) solaris nfs server -> linux clients.... I have heard that in olden days the
nfs servers and clients of different OSs handled things differently from one
another and this caused some lossage, which is probably more apparent in error
conditions like dropped UDP packets.
4) Oh, and all these disks were on fibre channel from back when dot hill was
box hill (97-98 time frame), further investigation showed (when I did it a
long time ago) that when linux was using fibre cards that reported the same
make+model+version as ours they had to do several things to keep the cards
running right ... seems they did not quite work right, and we never got
updated drivers from box hill before our support contracts ran out (which was
before I took over the machine).

> All NFS mounts were courtesy of the automounter.  All were soft,
> specifically: -rsize=8192,wsize=8192,retry=1,soft.
> 
> There were no errors whatsoever.  Execution times were identical on
> repeat trials (meaning no erratic network timeouts).  At Mathnet,
> historically we do not see any of the described symptoms.

for me copy times would change on the order of 5 to 10 minutes for a 650MB
file.

> 
> I wonder what's going on at your end.  If it's going to jump up and bite us
> in the future...
> 
As our user load increased from 25 to 50, so did the frequency of IO errors.

from the linux `man nfs`
       soft           If an NFS file operation has a major timeout then report
                      an I/O error to the calling program.  The default is  to
                      continue retrying NFS file operations indefinitely.

       hard           If an NFS file operation has a major timeout then report
                      "server not responding"  on  the  console  and  continue
                      retrying indefinitely.  This is the default.

-- 
Todd Denniston
Crane Division, Naval Surface Warfare Center (NSWC Crane) 
Harnessing the Power of Technology for the Warfighter

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Re: Automount/NFS issues causing executables to  appearcorrupted
  2004-04-21 14:03     ` Re: Automount/NFS issues causing executables to appearcorrupted Todd Denniston
@ 2004-04-21 17:47       ` Jim Carter
  0 siblings, 0 replies; 15+ messages in thread
From: Jim Carter @ 2004-04-21 17:47 UTC (permalink / raw)
  To: Todd Denniston; +Cc: autofs

On Wed, 21 Apr 2004, Todd Denniston wrote:
> There are at least two differences
> 1) you have light network traffic, at times we have a couple of video streams
> going across our 100Mb net, and 50 users that have a bad habit of keeping
> there netscape caches on the network drives.

Thanks very much for your reply.  Your configuration is indeed very 
different from ours, and it's entirely believable that we can be reliable 
with soft mounts while you cannot.  I've reassured my management.  End 
panic mode :-)

James F. Carter          Voice 310 825 2897    FAX 310 206 6673
UCLA-Mathnet;  6115 MSA; 405 Hilgard Ave.; Los Angeles, CA, USA  90095-1555
Email: jimc@math.ucla.edu    http://www.math.ucla.edu/~jimc (q.v. for PGP key)

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2004-04-21 17:47 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-04-20  0:08 Automount/NFS issues causing executables to appear corrupted Venkata Ravella
2004-04-20  0:24 ` H. Peter Anvin
2004-04-20  0:24   ` H. Peter Anvin
2004-04-20  1:27   ` Ian Kent
2004-04-20  1:27     ` Ian Kent
     [not found]     ` <20040420042811.GE20474@rearview.synopsys.com>
2004-04-20  5:24       ` Ian Kent
2004-04-20 14:35 ` Todd Denniston
2004-04-20 14:46   ` Todd Denniston
2004-04-21  1:17   ` Jim Carter
2004-04-21 14:03     ` Re: Automount/NFS issues causing executables to appearcorrupted Todd Denniston
2004-04-21 17:47       ` Jim Carter
  -- strict thread matches above, loose matches on Subject: below --
2004-04-19 15:56 Automount/NFS issues causing executables to appear corrupted Venkata Ravella
2004-04-18 21:23 Venkata Ravella
2004-04-18 23:06 ` H. Peter Anvin
2004-04-19  1:07 ` Ian Kent

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.