public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* 2.4.8 NFS Problems
@ 2001-09-05 11:56 Mike Black
  2001-09-07 11:49 ` Trond Myklebust
  0 siblings, 1 reply; 14+ messages in thread
From: Mike Black @ 2001-09-05 11:56 UTC (permalink / raw)
  To: linux-kernel

I've been getting random NFS EIO errors for a few months but now it's
repeatable.
Trying to copy a large file from one 2.4.8 SMP box to another is
consistently failing (at different offsets each time).
This doesn't appear to be a network problem as the last comm between the
machines looks OK.
By the timestamps it appears that a read() is taking too long and causing a
timeout?
I dropped the rsize and wsize on the mount from 8192 to 4096 and this solved
the problem (I had repeated this problem at least a dozen times before doing
this).
With the 4096 wsize there was one 5 second read delay and 3 at approx 2
seconds each.
So...it appears wsize 8192 was causing a timeout of some sort?

Here's a tail of the strace of the "cp" process with relative time stamps:
     0.000103 read(3,
"\307\3173-\226k\252-/]VU\261o\227x)\211c\362\370ZR\340"..., 8192) = 8192
     0.000099 write(4,
"\307\3173-\226k\252-/]VU\261o\227x)\211c\362\370ZR\340"..., 8192) = 8192
     0.000102 read(3,
"bh\0\31]U\"\307Eh\302Qp\324\313\345i\350\17\261\330\376"..., 8192) = 8192
     0.000100 write(4,
"bh\0\31]U\"\307Eh\302Qp\324\313\345i\350\17\261\330\376"..., 8192) = 8192
     0.000104 read(3, ",M\322\236h
\335\34e;L\275\221\326e\324\306y\200\310uD"..., 8192) = 8192
     0.000100 write(4, ",M\322\236h
\335\34e;L\275\221\326e\324\306y\200\310uD"..., 8192) = 8192
     0.000233 read(3,
"\315\240)\324~\315\373gJ}\272\263~\200\306\374i\215\246"..., 8192) = 8192
     0.000100 write(4,
"\315\240)\324~\315\373gJ}\272\263~\200\306\374i\215\246"..., 8192) = 8192
     0.000110 read(3,
"\222\362\357\315\3072\352\367\316\304\376wL\304.\346\375"..., 8192) = 8192
     0.000099 write(4,
"\222\362\357\315\3072\352\367\316\304\376wL\304.\346\375"..., 8192) = 8192
    10.535725 read(3,
"\3371f}g\314\372w\207A\v\253q\353\371S\23?\221\2752D\360"..., 8192) = 8192
     0.000182 write(4,
"\3371f}g\314\372w\207A\v\253q\353\371S\23?\221\2752D\360"..., 8192) = -1
EIO (Input/output error)
     0.000155 write(2, "cp: ", 4cp: )       = 4
     0.000046 write(2, "/picard/tmp/glibc.tgz", 21/picard/tmp/glibc.tgz) =
21
     0.000077 write(2, ": Input/output error", 20: Input/output error) = 20
     0.000054 write(2, "\n", 1
)         = 1
     0.000041 close(4)                  = 0
     0.001030 close(3)                  = 0
     0.000087 _exit(1)                  = ?

And here's the tail of the network traffic:
07:01:57.048590 yeti.csihq.com.652632144 > picard.csihq.com.nfs: 1472 write
[|nfs] (frag 28944:1480@0+)
07:01:57.048720 yeti.csihq.com > picard.csihq.com: (frag 28944:1480@1480+)
07:01:57.048841 yeti.csihq.com > picard.csihq.com: (frag 28944:1480@2960+)
07:01:57.048963 yeti.csihq.com > picard.csihq.com: (frag 28944:1480@4440+)
07:01:57.049090 yeti.csihq.com > picard.csihq.com: (frag 28944:1480@5920+)
07:01:57.049159 yeti.csihq.com > picard.csihq.com: (frag 28944:916@7400)
07:01:57.049520 picard.csihq.com.nfs > yeti.csihq.com.652632144: reply ok
136 write [|nfs] (DF)
07:02:01.910476 arp who-has picard.csihq.com tell yeti.csihq.com
07:02:01.910526 arp reply picard.csihq.com is-at 0:e0:29:2a:db:e9
07:02:07.480364 yeti.csihq.com.669409360 > picard.csihq.com.nfs: 108 commit
[|nfs] (DF)
07:02:07.480568 picard.csihq.com.nfs > yeti.csihq.com.669409360: reply ok
128 commit (DF)
07:02:07.481323 yeti.csihq.com.686186576 > picard.csihq.com.nfs: 1472 write
[|nfs] (frag 28948:1480@0+)
07:02:07.481446 yeti.csihq.com > picard.csihq.com: (frag 28948:1480@1480+)
07:02:07.481569 yeti.csihq.com > picard.csihq.com: (frag 28948:1480@2960+)
07:02:07.481692 yeti.csihq.com > picard.csihq.com: (frag 28948:1480@4440+)
07:02:07.481814 yeti.csihq.com > picard.csihq.com: (frag 28948:1480@5920+)
07:02:07.481886 yeti.csihq.com > picard.csihq.com: (frag 28948:916@7400)
07:02:07.482321 picard.csihq.com.nfs > yeti.csihq.com.686186576: reply ok
136 write [|nfs] (DF)
07:02:07.482511 yeti.csihq.com.702963792 > picard.csihq.com.nfs: 108 commit
[|nfs] (DF)
07:02:07.482642 picard.csihq.com.nfs > yeti.csihq.com.702963792: reply ok
128 commit (DF)


________________________________________
Michael D. Black   Principal Engineer
mblack@csihq.com  321-676-2923,x203
http://www.csihq.com  Computer Science Innovations
http://www.csihq.com/~mike  My home page
FAX 321-676-2355


^ permalink raw reply	[flat|nested] 14+ messages in thread
* Re: 2.4.8 NFS Problems
@ 2001-12-20  9:44 Steffen Persvold
  2001-12-20 11:10 ` Trond Myklebust
  0 siblings, 1 reply; 14+ messages in thread
From: Steffen Persvold @ 2001-12-20  9:44 UTC (permalink / raw)
  To: lkml, nfs list, Neil Brown, Trond Myklebust

Hi guys,

I was searching on google for some reports on the problem I'm seeing with our NFS server/clients and
found this thread. It looked somewhat the same (atleast the result with the EIO is the same).

Parts of old message :

>From: Mike Black (mblack@csihq.com)
>Date: Sep 05 2001 

>I've been getting random NFS EIO errors for a few months but
>now it's repeatable. 
>Trying to copy a large file from one 2.4.8 SMP box to another
>is consistently failing (at different offsets >each time). 



Our setup is like this :

Server:
	RedHat 7.2 - kernel 2.4.9-13smp
        nfs-utils-0.3.1-13.7.2.1
	ext3 filesystem (73GB)


Clients:
	ia32 client - RedHat 6.2 - kernel 2.2.19-6.2.7enterprise
	mount-2.10r-0.6.x


	alpha client - RedHat 6.2 - kernel 2.2.19 (vanilla)
	mount-2.10r-5


	ia64 client - RedHat 7.1 - kernel 2.4.3-12smp
	mount-2.10r-5



I've seen the "Input/Output error" problem only on the Alpha and the IA64 clients and the problem is
occuring when making a static library (with 'ar'). The message is like this :

ar: xxxxxx/libmpi.a: Input/output error


The mountpoints is mounted like this :

ia32 client:
huey:/export/home/mpitest /home/mpitest nfs rw,v3,rsize=8192,wsize=8192,addr=huey 0 0

alpha client:
huey:/export/home/mpitest /home/mpitest nfs rw,v3,rsize=8192,wsize=8192,addr=huey 0 0

ia64 client:
huey:/export/home/mpitest /home/mpitest nfs rw,v3,rsize=8192,wsize=8192,hard,udp,lock,addr=huey 0 0


I don't know why the "hard" and "lock" options doesn't appear on ia32 and alpha, but this might be
related to the /proc/mounts interface on the running kernel (these clients are running 2.2.19 while
the ia64 client is running 2.4). The automount entry looks like this :

/home           auto_home       rsize=8192,wsize=8192

So according to the nfs man pages the "hard" option should be default :

       hard           If an NFS file operation has a major timeout then report "server not
                      responding" on the console and continue retrying indefinitely.  This
                      is the default.


So what could be the problem here ? Is it a NFS server bug, a NFS client bug or a NFS/ext3 bug ? We
used to run RedHat 7.0 on this server with the 2.2.19-enterprise kernel, nfs-utils-0.3.1-7 and with
a ext2 filesystem. This problem did not occur back then.

Thanks,
-- 
  Steffen Persvold   | Scalable Linux Systems |   Try out the world's best   
 mailto:sp@scali.no  |  http://www.scali.com  | performing MPI implementation:
Tel: (+47) 2262 8950 |   Olaf Helsets vei 6   |      - ScaMPI 1.12.2 -         
Fax: (+47) 2262 8951 |   N0621 Oslo, NORWAY   | >300MBytes/s and <4uS latency

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2001-12-20 20:28 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2001-09-05 11:56 2.4.8 NFS Problems Mike Black
2001-09-07 11:49 ` Trond Myklebust
2001-09-07 12:05   ` Peter T. Breuer
2001-09-07 12:27     ` Trond Myklebust
2001-09-07 12:36     ` Trond Myklebust
2001-09-07 13:13   ` Mike Black
2001-09-07 14:42     ` Trond Myklebust
2001-09-07 15:46       ` Mike Black
2001-09-08 10:53         ` Trond Myklebust
2001-09-08 11:53           ` Mike Black
  -- strict thread matches above, loose matches on Subject: below --
2001-12-20  9:44 Steffen Persvold
2001-12-20 11:10 ` Trond Myklebust
2001-12-20 14:40   ` Steffen Persvold
2001-12-20 20:27     ` Trond Myklebust

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox