2.4.8 NFS Problems

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* 2.4.8 NFS Problems
@ 2001-09-05 11:56 Mike Black
  2001-09-07 11:49 ` Trond Myklebust
  0 siblings, 1 reply; 14+ messages in thread
From: Mike Black @ 2001-09-05 11:56 UTC (permalink / raw)
  To: linux-kernel

I've been getting random NFS EIO errors for a few months but now it's
repeatable.
Trying to copy a large file from one 2.4.8 SMP box to another is
consistently failing (at different offsets each time).
This doesn't appear to be a network problem as the last comm between the
machines looks OK.
By the timestamps it appears that a read() is taking too long and causing a
timeout?
I dropped the rsize and wsize on the mount from 8192 to 4096 and this solved
the problem (I had repeated this problem at least a dozen times before doing
this).
With the 4096 wsize there was one 5 second read delay and 3 at approx 2
seconds each.
So...it appears wsize 8192 was causing a timeout of some sort?

Here's a tail of the strace of the "cp" process with relative time stamps:
     0.000103 read(3,
"\307\3173-\226k\252-/]VU\261o\227x)\211c\362\370ZR\340"..., 8192) = 8192
     0.000099 write(4,
"\307\3173-\226k\252-/]VU\261o\227x)\211c\362\370ZR\340"..., 8192) = 8192
     0.000102 read(3,
"bh\0\31]U\"\307Eh\302Qp\324\313\345i\350\17\261\330\376"..., 8192) = 8192
     0.000100 write(4,
"bh\0\31]U\"\307Eh\302Qp\324\313\345i\350\17\261\330\376"..., 8192) = 8192
     0.000104 read(3, ",M\322\236h
\335\34e;L\275\221\326e\324\306y\200\310uD"..., 8192) = 8192
     0.000100 write(4, ",M\322\236h
\335\34e;L\275\221\326e\324\306y\200\310uD"..., 8192) = 8192
     0.000233 read(3,
"\315\240)\324~\315\373gJ}\272\263~\200\306\374i\215\246"..., 8192) = 8192
     0.000100 write(4,
"\315\240)\324~\315\373gJ}\272\263~\200\306\374i\215\246"..., 8192) = 8192
     0.000110 read(3,
"\222\362\357\315\3072\352\367\316\304\376wL\304.\346\375"..., 8192) = 8192
     0.000099 write(4,
"\222\362\357\315\3072\352\367\316\304\376wL\304.\346\375"..., 8192) = 8192
    10.535725 read(3,
"\3371f}g\314\372w\207A\v\253q\353\371S\23?\221\2752D\360"..., 8192) = 8192
     0.000182 write(4,
"\3371f}g\314\372w\207A\v\253q\353\371S\23?\221\2752D\360"..., 8192) = -1
EIO (Input/output error)
     0.000155 write(2, "cp: ", 4cp: )       = 4
     0.000046 write(2, "/picard/tmp/glibc.tgz", 21/picard/tmp/glibc.tgz) =
21
     0.000077 write(2, ": Input/output error", 20: Input/output error) = 20
     0.000054 write(2, "\n", 1
)         = 1
     0.000041 close(4)                  = 0
     0.001030 close(3)                  = 0
     0.000087 _exit(1)                  = ?

And here's the tail of the network traffic:
07:01:57.048590 yeti.csihq.com.652632144 > picard.csihq.com.nfs: 1472 write
[|nfs] (frag 28944:1480@0+)
07:01:57.048720 yeti.csihq.com > picard.csihq.com: (frag 28944:1480@1480+)
07:01:57.048841 yeti.csihq.com > picard.csihq.com: (frag 28944:1480@2960+)
07:01:57.048963 yeti.csihq.com > picard.csihq.com: (frag 28944:1480@4440+)
07:01:57.049090 yeti.csihq.com > picard.csihq.com: (frag 28944:1480@5920+)
07:01:57.049159 yeti.csihq.com > picard.csihq.com: (frag 28944:916@7400)
07:01:57.049520 picard.csihq.com.nfs > yeti.csihq.com.652632144: reply ok
136 write [|nfs] (DF)
07:02:01.910476 arp who-has picard.csihq.com tell yeti.csihq.com
07:02:01.910526 arp reply picard.csihq.com is-at 0:e0:29:2a:db:e9
07:02:07.480364 yeti.csihq.com.669409360 > picard.csihq.com.nfs: 108 commit
[|nfs] (DF)
07:02:07.480568 picard.csihq.com.nfs > yeti.csihq.com.669409360: reply ok
128 commit (DF)
07:02:07.481323 yeti.csihq.com.686186576 > picard.csihq.com.nfs: 1472 write
[|nfs] (frag 28948:1480@0+)
07:02:07.481446 yeti.csihq.com > picard.csihq.com: (frag 28948:1480@1480+)
07:02:07.481569 yeti.csihq.com > picard.csihq.com: (frag 28948:1480@2960+)
07:02:07.481692 yeti.csihq.com > picard.csihq.com: (frag 28948:1480@4440+)
07:02:07.481814 yeti.csihq.com > picard.csihq.com: (frag 28948:1480@5920+)
07:02:07.481886 yeti.csihq.com > picard.csihq.com: (frag 28948:916@7400)
07:02:07.482321 picard.csihq.com.nfs > yeti.csihq.com.686186576: reply ok
136 write [|nfs] (DF)
07:02:07.482511 yeti.csihq.com.702963792 > picard.csihq.com.nfs: 108 commit
[|nfs] (DF)
07:02:07.482642 picard.csihq.com.nfs > yeti.csihq.com.702963792: reply ok
128 commit (DF)


________________________________________
Michael D. Black   Principal Engineer
mblack@csihq.com  321-676-2923,x203
http://www.csihq.com  Computer Science Innovations
http://www.csihq.com/~mike  My home page
FAX 321-676-2355


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: 2.4.8 NFS Problems
  2001-09-05 11:56 2.4.8 NFS Problems Mike Black
@ 2001-09-07 11:49 ` Trond Myklebust
  2001-09-07 12:05   ` Peter T. Breuer
  2001-09-07 13:13   ` Mike Black
  0 siblings, 2 replies; 14+ messages in thread
From: Trond Myklebust @ 2001-09-07 11:49 UTC (permalink / raw)
  To: Mike Black; +Cc: linux-kernel

>>>>> " " == Mike Black <mblack@csihq.com> writes:

     > I've been getting random NFS EIO errors for a few months but
     > now it's repeatable.  Trying to copy a large file from one
     > 2.4.8 SMP box to another is consistently failing (at different
     > offsets each time).  This doesn't appear to be a network
     > problem as the last comm between the machines looks OK.  By the
     > timestamps it appears that a read() is taking too long and
     > causing a timeout?

Morale: Don't use soft mounts: they are prone to these things. If you
insist on using them, then try playing around with the `timeo' and
`retrans' mount variables.

Soft mount timeouts are not only due to network problems, but can
equally well be due to internal congestion. The rate at which the
network can transmit requests is usually (unless you are using
Gigabit) way below the rate at which your machine can generate them.

Cheers,
   Trond

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: 2.4.8 NFS Problems
  2001-09-07 11:49 ` Trond Myklebust
@ 2001-09-07 12:05   ` Peter T. Breuer
  2001-09-07 12:27     ` Trond Myklebust
  2001-09-07 12:36     ` Trond Myklebust
  2001-09-07 13:13   ` Mike Black
  1 sibling, 2 replies; 14+ messages in thread
From: Peter T. Breuer @ 2001-09-07 12:05 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Mike Black, linux-kernel

"A month of sundays ago Trond Myklebust wrote:"
> >>>>> " " == Mike Black <mblack@csihq.com> writes:
> 
>      > I've been getting random NFS EIO errors for a few months but
>      > now it's repeatable.  Trying to copy a large file from one
>      > 2.4.8 SMP box to another is consistently failing (at different
>      > offsets each time).  This doesn't appear to be a network
>      > problem as the last comm between the machines looks OK.  By the
>      > timestamps it appears that a read() is taking too long and
>      > causing a timeout?
> 
> Morale: Don't use soft mounts: they are prone to these things. If you
> insist on using them, then try playing around with the `timeo' and

Unless you like having all your clients hang when the server happens to
be rebooted, and like having to go round hunting for them in dark
recesses in order to try and fool them into unmounting and remounting,
I'd recommend soft mounts every time!

> `retrans' mount variables.

It would be nice if nfs could do the a remount automatically when the
nfs handle it has goes stale an dit discovers it.  Is that part of v3
nfs or not?

> Soft mount timeouts are not only due to network problems, but can
> equally well be due to internal congestion. The rate at which the
> network can transmit requests is usually (unless you are using
> Gigabit) way below the rate at which your machine can generate them.

But soft mounts at least break nicely and automatically.  And since
failures are inevitable, I prefer them.

Come to think of it, why not have an option that does a hard,intr but
sends a ^C automatically to all referents when a stale handle is detected.

Peter

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: 2.4.8 NFS Problems
  2001-09-07 12:05   ` Peter T. Breuer
@ 2001-09-07 12:27     ` Trond Myklebust
  2001-09-07 12:36     ` Trond Myklebust
  1 sibling, 0 replies; 14+ messages in thread
From: Trond Myklebust @ 2001-09-07 12:27 UTC (permalink / raw)
  To: ptb; +Cc: Mike Black, linux-kernel

>>>>> " " == Peter T Breuer <ptb@it.uc3m.es> writes:

     > It would be nice if nfs could do the a remount automatically
     > when the nfs handle it has goes stale an dit discovers it.  Is
     > that part of v3 nfs or not?

The exact wording in RFC1813 is

   NFS3ERR_STALE
       Invalid file handle. The file handle given in the
       arguments was invalid. The file referred to by that file
       handle no longer exists or access to it has been
       revoked.

The problem is with the 'access to it has been revoked'. It says
nothing about whether or not that is permanent, hence you have to be
very careful about applying this concept to the mount point.

In any case, remounting automatically is a very bad idea unless you do
it cleanly (i.e. kill all existing processes, unmount the disk, and
then start afresh). If you just do it transparently and don't clean
out the (d|i)caches, you will see some pretty odd things happening if
filehandles on the new disk don't match the filehandles on the old
disk.
This is BTW the reason why unfsd is badly broken wrt. CDROMS.

     > But soft mounts at least break nicely and automatically.  And
     > since failures are inevitable, I prefer them.

     > Come to think of it, why not have an option that does a
     > hard,intr but sends a ^C automatically to all referents when a
     > stale handle is detected.

See above.

Cheers,
   Trond

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: 2.4.8 NFS Problems
  2001-09-07 12:05   ` Peter T. Breuer
  2001-09-07 12:27     ` Trond Myklebust
@ 2001-09-07 12:36     ` Trond Myklebust
  1 sibling, 0 replies; 14+ messages in thread
From: Trond Myklebust @ 2001-09-07 12:36 UTC (permalink / raw)
  To: ptb; +Cc: Mike Black, linux-kernel

>>>>> " " == Peter T Breuer <ptb@it.uc3m.es> writes:

    >> Soft mount timeouts are not only due to network problems, but
    >> can equally well be due to internal congestion. The rate at
    >> which the network can transmit requests is usually (unless you
    >> are using Gigabit) way below the rate at which your machine can
    >> generate them.

     > But soft mounts at least break nicely and automatically.  And
     > since failures are inevitable, I prefer them.

The problem is that they need careful tuning if they are to work at
all. They assume a perfect setup.

For instance most servers will drop UDP requests if they don't have a
free thread to serve them. They assume that you will automatically
retry. soft mounts do retry, but give up eventually. IOW even on an
otherwise working setup you will, every once in a blue moon, get an
EIO due to a soft timeout and you will lose data.

Cheers,
  Trond

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: 2.4.8 NFS Problems
  2001-09-07 11:49 ` Trond Myklebust
  2001-09-07 12:05   ` Peter T. Breuer
@ 2001-09-07 13:13   ` Mike Black
  2001-09-07 14:42     ` Trond Myklebust
  1 sibling, 1 reply; 14+ messages in thread
From: Mike Black @ 2001-09-07 13:13 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: linux-kernel

But my timeouts were only 10 seconds -- well below the timeo and retrans
timeout periods.
And my network traffic shows that this is the client causing the problem NOT
the server.
It's the read() that pauses for 10 seconds and then the NFS write
immediately returns EIO.
So...I don't think soft mounts has anything to do with it.
Also...I've now seen this error once more even with the 4096 read/write
sizes.
________________________________________
Michael D. Black   Principal Engineer
mblack@csihq.com  321-676-2923,x203
http://www.csihq.com  Computer Science Innovations
http://www.csihq.com/~mike  My home page
FAX 321-676-2355
----- Original Message -----
From: "Trond Myklebust" <trond.myklebust@fys.uio.no>
To: "Mike Black" <mblack@csihq.com>
Cc: "linux-kernel" <linux-kernel@vger.kernel.org>
Sent: Friday, September 07, 2001 7:49 AM
Subject: Re: 2.4.8 NFS Problems


>>>>> " " == Mike Black <mblack@csihq.com> writes:

     > I've been getting random NFS EIO errors for a few months but
     > now it's repeatable.  Trying to copy a large file from one
     > 2.4.8 SMP box to another is consistently failing (at different
     > offsets each time).  This doesn't appear to be a network
     > problem as the last comm between the machines looks OK.  By the
     > timestamps it appears that a read() is taking too long and
     > causing a timeout?

Morale: Don't use soft mounts: they are prone to these things. If you
insist on using them, then try playing around with the `timeo' and
`retrans' mount variables.

Soft mount timeouts are not only due to network problems, but can
equally well be due to internal congestion. The rate at which the
network can transmit requests is usually (unless you are using
Gigabit) way below the rate at which your machine can generate them.

Cheers,
   Trond


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: 2.4.8 NFS Problems
  2001-09-07 13:13   ` Mike Black
@ 2001-09-07 14:42     ` Trond Myklebust
  2001-09-07 15:46       ` Mike Black
  0 siblings, 1 reply; 14+ messages in thread
From: Trond Myklebust @ 2001-09-07 14:42 UTC (permalink / raw)
  To: Mike Black; +Cc: linux-kernel

>>>>> " " == Mike Black <mblack@csihq.com> writes:

     > But my timeouts were only 10 seconds -- well below the timeo
     > and retrans timeout periods.  And my network traffic shows that

According to the 'nfs' manpage, the default timeo on the mount in
util-linux is usually 0.7 seconds. retrans is 3.

  0.7 + 1.4 + 2.8 = 4.9 seconds < 10...

     > this is the client causing the problem NOT the server.  It's
     > the read() that pauses for 10 seconds and then the NFS write
     > immediately returns EIO.  So...I don't think soft mounts has
     > anything to do with it.

I think it does.

Cheers,
  Trond

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: 2.4.8 NFS Problems
  2001-09-07 14:42     ` Trond Myklebust
@ 2001-09-07 15:46       ` Mike Black
  2001-09-08 10:53         ` Trond Myklebust
  0 siblings, 1 reply; 14+ messages in thread
From: Mike Black @ 2001-09-07 15:46 UTC (permalink / raw)
  To: trond.myklebust; +Cc: linux-kernel

But did you notice the network log:
07:02:07.481323 yeti.csihq.com.686186576 > picard.csihq.com.nfs: 1472 write
[|nfs] (frag 28948:1480@0+)
07:02:07.481446 yeti.csihq.com > picard.csihq.com: (frag 28948:1480@1480+)
07:02:07.481569 yeti.csihq.com > picard.csihq.com: (frag 28948:1480@2960+)
07:02:07.481692 yeti.csihq.com > picard.csihq.com: (frag 28948:1480@4440+)
07:02:07.481814 yeti.csihq.com > picard.csihq.com: (frag 28948:1480@5920+)
07:02:07.481886 yeti.csihq.com > picard.csihq.com: (frag 28948:916@7400)
07:02:07.482321 picard.csihq.com.nfs > yeti.csihq.com.686186576: reply ok
136 write [|nfs] (DF)
07:02:07.482511 yeti.csihq.com.702963792 > picard.csihq.com.nfs: 108 commit
[|nfs] (DF)
07:02:07.482642 picard.csihq.com.nfs > yeti.csihq.com.702963792: reply ok
128 commit (DF)

The file is being copied from yeti to picard.  Last packet seen is picard
telling yeti "OK" after the commit.
If soft timeouts were occurring shouldn't we be seeing packets from yeti
again with no response from picard?

________________________________________
Michael D. Black   Principal Engineer
mblack@csihq.com  321-676-2923,x203
http://www.csihq.com  Computer Science Innovations
http://www.csihq.com/~mike  My home page
FAX 321-676-2355
----- Original Message -----
From: "Trond Myklebust" <trond.myklebust@fys.uio.no>
To: "Mike Black" <mblack@csihq.com>
Cc: "linux-kernel" <linux-kernel@vger.kernel.org>
Sent: Friday, September 07, 2001 10:42 AM
Subject: Re: 2.4.8 NFS Problems


>>>>> " " == Mike Black <mblack@csihq.com> writes:

     > But my timeouts were only 10 seconds -- well below the timeo
     > and retrans timeout periods.  And my network traffic shows that

According to the 'nfs' manpage, the default timeo on the mount in
util-linux is usually 0.7 seconds. retrans is 3.

  0.7 + 1.4 + 2.8 = 4.9 seconds < 10...

     > this is the client causing the problem NOT the server.  It's
     > the read() that pauses for 10 seconds and then the NFS write
     > immediately returns EIO.  So...I don't think soft mounts has
     > anything to do with it.

I think it does.

Cheers,
  Trond


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: 2.4.8 NFS Problems
  2001-09-07 15:46       ` Mike Black
@ 2001-09-08 10:53         ` Trond Myklebust
  2001-09-08 11:53           ` Mike Black
  0 siblings, 1 reply; 14+ messages in thread
From: Trond Myklebust @ 2001-09-08 10:53 UTC (permalink / raw)
  To: Mike Black; +Cc: linux-kernel

>>>>> " " == Mike Black <mblack@csihq.com> writes:

     > The file is being copied from yeti to picard.  Last packet seen
     > is picard telling yeti "OK" after the commit.  If soft timeouts
     > were occurring shouldn't we be seeing packets from yeti again
     > with no response from picard?

You are assuming that the last packet seen is the one that corresponds
to your read. In doing so, you are neglecting the fact that these are
asynchronous reads, and that file readahead can muddle the waters for
you.

Look, this is getting us nowhere. The bottom line is: if you are able
to reproduce the EIO on hard mounts it is a bug, and I'll be happy to
help you trace it. If it is occuring only on soft mounts, it is user
error...

Cheers,
  Trond

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: 2.4.8 NFS Problems
  2001-09-08 10:53         ` Trond Myklebust
@ 2001-09-08 11:53           ` Mike Black
  0 siblings, 0 replies; 14+ messages in thread
From: Mike Black @ 2001-09-08 11:53 UTC (permalink / raw)
  To: trond.myklebust; +Cc: linux-kernel

Did some more testing this A.M.

Soft Mount:
    Took me several tries -- had to run a tiobench on the server side to
make for some I/O contention.  Was able to get EIO error (since this is Sat
the system was pretty idle).
Hard Mount:
    Unable to reproduce even though 10 second timeouts could be seen.
Soft Mount (retrans=5)
    Unable to reproduce

Could be the interaction with ext3 where I/O gets bound up a while.  Just
long enough to trigger the timeouts for a soft mount.
----- Original Message -----
From: "Trond Myklebust" <trond.myklebust@fys.uio.no>
To: "Mike Black" <mblack@csihq.com>
Cc: "linux-kernel" <linux-kernel@vger.kernel.org>
Sent: Saturday, September 08, 2001 6:53 AM
Subject: Re: 2.4.8 NFS Problems


> >>>>> " " == Mike Black <mblack@csihq.com> writes:
>
>      > The file is being copied from yeti to picard.  Last packet seen
>      > is picard telling yeti "OK" after the commit.  If soft timeouts
>      > were occurring shouldn't we be seeing packets from yeti again
>      > with no response from picard?
>
> You are assuming that the last packet seen is the one that corresponds
> to your read. In doing so, you are neglecting the fact that these are
> asynchronous reads, and that file readahead can muddle the waters for
> you.
>
> Look, this is getting us nowhere. The bottom line is: if you are able
> to reproduce the EIO on hard mounts it is a bug, and I'll be happy to
> help you trace it. If it is occuring only on soft mounts, it is user
> error...
>
> Cheers,
>   Trond
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: 2.4.8 NFS Problems
@ 2001-12-20  9:44 Steffen Persvold
  2001-12-20 11:10 ` Trond Myklebust
  0 siblings, 1 reply; 14+ messages in thread
From: Steffen Persvold @ 2001-12-20  9:44 UTC (permalink / raw)
  To: lkml, nfs list, Neil Brown, Trond Myklebust

Hi guys,

I was searching on google for some reports on the problem I'm seeing with our NFS server/clients and
found this thread. It looked somewhat the same (atleast the result with the EIO is the same).

Parts of old message :

>From: Mike Black (mblack@csihq.com)
>Date: Sep 05 2001 

>I've been getting random NFS EIO errors for a few months but
>now it's repeatable. 
>Trying to copy a large file from one 2.4.8 SMP box to another
>is consistently failing (at different offsets >each time). 

Our setup is like this :

Server:
	RedHat 7.2 - kernel 2.4.9-13smp
        nfs-utils-0.3.1-13.7.2.1
	ext3 filesystem (73GB)

Clients:
	ia32 client - RedHat 6.2 - kernel 2.2.19-6.2.7enterprise
	mount-2.10r-0.6.x

	alpha client - RedHat 6.2 - kernel 2.2.19 (vanilla)
	mount-2.10r-5

	ia64 client - RedHat 7.1 - kernel 2.4.3-12smp
	mount-2.10r-5

I've seen the "Input/Output error" problem only on the Alpha and the IA64 clients and the problem is
occuring when making a static library (with 'ar'). The message is like this :

ar: xxxxxx/libmpi.a: Input/output error

The mountpoints is mounted like this :

ia32 client:
huey:/export/home/mpitest /home/mpitest nfs rw,v3,rsize=8192,wsize=8192,addr=huey 0 0

alpha client:
huey:/export/home/mpitest /home/mpitest nfs rw,v3,rsize=8192,wsize=8192,addr=huey 0 0

ia64 client:
huey:/export/home/mpitest /home/mpitest nfs rw,v3,rsize=8192,wsize=8192,hard,udp,lock,addr=huey 0 0

I don't know why the "hard" and "lock" options doesn't appear on ia32 and alpha, but this might be
related to the /proc/mounts interface on the running kernel (these clients are running 2.2.19 while
the ia64 client is running 2.4). The automount entry looks like this :

/home           auto_home       rsize=8192,wsize=8192

So according to the nfs man pages the "hard" option should be default :

       hard           If an NFS file operation has a major timeout then report "server not
                      responding" on the console and continue retrying indefinitely.  This
                      is the default.

So what could be the problem here ? Is it a NFS server bug, a NFS client bug or a NFS/ext3 bug ? We
used to run RedHat 7.0 on this server with the 2.2.19-enterprise kernel, nfs-utils-0.3.1-7 and with
a ext2 filesystem. This problem did not occur back then.

Thanks,
-- 
  Steffen Persvold   | Scalable Linux Systems |   Try out the world's best   
 mailto:sp@scali.no  |  http://www.scali.com  | performing MPI implementation:
Tel: (+47) 2262 8950 |   Olaf Helsets vei 6   |      - ScaMPI 1.12.2 -         
Fax: (+47) 2262 8951 |   N0621 Oslo, NORWAY   | >300MBytes/s and <4uS latency

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: 2.4.8 NFS Problems
  2001-12-20  9:44 Steffen Persvold
@ 2001-12-20 11:10 ` Trond Myklebust
  2001-12-20 14:40   ` Steffen Persvold
  0 siblings, 1 reply; 14+ messages in thread
From: Trond Myklebust @ 2001-12-20 11:10 UTC (permalink / raw)
  To: Steffen Persvold; +Cc: lkml, nfs list, Neil Brown

>>>>> " " == Steffen Persvold <sp@scali.no> writes:

    >> I've been getting random NFS EIO errors for a few months but
    >> now it's repeatable. Trying to copy a large file from one 2.4.8
    >> SMP box to another is consistently failing (at different
    >> offsets >each time).

Please try the patch on

  http://www.fys.uio.no/~trondmy/src/2.4.17/linux-2.4.17-fattr.dif

that fixes at least 1 such EIO error which was discovered using fsx.

Cheers,
   Trond

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: 2.4.8 NFS Problems
  2001-12-20 11:10 ` Trond Myklebust
@ 2001-12-20 14:40   ` Steffen Persvold
  2001-12-20 20:27     ` Trond Myklebust
  0 siblings, 1 reply; 14+ messages in thread
From: Steffen Persvold @ 2001-12-20 14:40 UTC (permalink / raw)
  To: trond.myklebust; +Cc: lkml, nfs list, Neil Brown

Trond Myklebust wrote:
> 
> >>>>> " " == Steffen Persvold <sp@scali.no> writes:
> 
>     >> I've been getting random NFS EIO errors for a few months but
>     >> now it's repeatable. Trying to copy a large file from one 2.4.8
>     >> SMP box to another is consistently failing (at different
>     >> offsets >each time).
> 
> Please try the patch on
> 
>   http://www.fys.uio.no/~trondmy/src/2.4.17/linux-2.4.17-fattr.dif
> 
> that fixes at least 1 such EIO error which was discovered using fsx.
> 

I can do that, but since one of the clients reporting this problem is an Alpha machine running
2.2.19 the patch won't do much good (not that the patch is architecture dependent, but it's only for
2.4.17). Has this patch been there since 2.2 or is it a new "feature" in the "stable" #:) 2.4
kernels.

Regards,
-- 
  Steffen Persvold   | Scalable Linux Systems |   Try out the world's best   
 mailto:sp@scali.no  |  http://www.scali.com  | performing MPI implementation:
Tel: (+47) 2262 8950 |   Olaf Helsets vei 6   |      - ScaMPI 1.12.2 -         
Fax: (+47) 2262 8951 |   N0621 Oslo, NORWAY   | >300MBytes/s and <4uS latency

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: 2.4.8 NFS Problems
  2001-12-20 14:40   ` Steffen Persvold
@ 2001-12-20 20:27     ` Trond Myklebust
  0 siblings, 0 replies; 14+ messages in thread
From: Trond Myklebust @ 2001-12-20 20:27 UTC (permalink / raw)
  To: Steffen Persvold; +Cc: lkml, nfs list, Neil Brown

>>>>> " " == Steffen Persvold <sp@scali.no> writes:

     > I can do that, but since one of the clients reporting this
     > problem is an Alpha machine running
     > 2.2.19 the patch won't do much good (not that the patch is
     >        architecture dependent, but it's only for
     > 2.4.17). Has this patch been there since 2.2 or is it a new
     > "feature" in the "stable" #:) 2.4 kernels.

All the problems fixed by the patch should be present in 2.2.19 too. I
don't really have time to backport the whole thing, but I've appended
a backport of the bit that is directly relevant to the EIO error.

Cheers,
   Trond

--- linux-2.2.19-up/fs/nfs/read.c.orig	Sun Mar 25 18:37:38 2001
+++ linux-2.2.19-up/fs/nfs/read.c	Thu Dec 20 21:25:13 2001
@@ -420,7 +420,7 @@
 {
 	struct nfs_read_data	*data = (struct nfs_read_data *) task->tk_calldata;
 	struct inode		*inode = data->inode;
-	int			count = data->res.count;
+	unsigned int		count = data->res.count;
 
 	dprintk("NFS: %4d nfs_readpage_result, (status %d)\n",
 		task->tk_pid, task->tk_status);
@@ -431,10 +431,15 @@
 		struct page *page = req->wb_page;
 		nfs_list_remove_request(req);
 
-		if (task->tk_status >= 0 && count >= 0) {
+		if (task->tk_status >= 0) {
+			char *p = page_address(page);
+			if (count < PAGE_CACHE_SIZE) {
+				memset(p + count, 0, PAGE_CACHE_SIZE - count);
+				count = 0;
+			} else
+				count -= PAGE_CACHE_SIZE;
 			flush_dcache_page(page_address(page)); /* Is this correct? */
 			set_bit(PG_uptodate, &page->flags);
-			count -= PAGE_CACHE_SIZE;
 		} else
 			set_bit(PG_error, &page->flags);
 		nfs_unlock_page(page);


^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2001-12-20 20:28 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2001-09-05 11:56 2.4.8 NFS Problems Mike Black
2001-09-07 11:49 ` Trond Myklebust
2001-09-07 12:05   ` Peter T. Breuer
2001-09-07 12:27     ` Trond Myklebust
2001-09-07 12:36     ` Trond Myklebust
2001-09-07 13:13   ` Mike Black
2001-09-07 14:42     ` Trond Myklebust
2001-09-07 15:46       ` Mike Black
2001-09-08 10:53         ` Trond Myklebust
2001-09-08 11:53           ` Mike Black
  -- strict thread matches above, loose matches on Subject: below --
2001-12-20  9:44 Steffen Persvold
2001-12-20 11:10 ` Trond Myklebust
2001-12-20 14:40   ` Steffen Persvold
2001-12-20 20:27     ` Trond Myklebust

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox