[NFS] I/O Errors with hard mounts

All of lore.kernel.org
 help / color / mirror / Atom feed

* [NFS] I/O Errors with hard mounts
@ 2008-06-04 13:33 David Konerding
       [not found] ` <4f0f0cb0806040633x74fd0afbm94866cf85810f242-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 13+ messages in thread
From: David Konerding @ 2008-06-04 13:33 UTC (permalink / raw)
  To: nfs

Hi,

We have a bunch of Linux clients (SLES 10 SP1) which mount a NetApp filer.

When the NetApp gets very, very busy, for example, one user is
deleting 1Tbyte of data
while another user is doing a 30 client throughput test, it will stop
responding to some requests.

Although we are using hard mounts, some users report that during the
hammering period, some of their
file operations produce "I/O Error" messages on their terminal.

We checked, and the hosts are indeed using hard mounting.  From our
reading, I/O Errors
should only ever make it back to the user if are using soft mounting.

We're pretty sure the filer is not sending back an NFS_ERR response (and we're
pretty sure that wouldn't get reported to the user as an I/O Error...)

At this point, we suspect there must be a path in the NFS
implementation that returns I/O Error to user
space even with a hard mount.

Any ideas?

Dave

-------------------------------------------------------------------------
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://sourceforge.net/services/buy/index.php
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs
_______________________________________________
Please note that nfs@lists.sourceforge.net is being discontinued.
Please subscribe to linux-nfs@vger.kernel.org instead.
    http://vger.kernel.org/vger-lists.html#linux-nfs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [NFS] I/O Errors with hard mounts
       [not found] ` <4f0f0cb0806040633x74fd0afbm94866cf85810f242-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2008-06-04 15:20   ` Blake Golliher
  2008-06-04 16:17   ` Jeff Layton
  2008-06-04 18:19   ` Chuck Lever
  2 siblings, 0 replies; 13+ messages in thread
From: Blake Golliher @ 2008-06-04 15:20 UTC (permalink / raw)
  To: David Konerding, nfs

Can you take a trace on the client when the =8Ci/o error=B9 message sho=
ws up?  I
suppose that=B9d be hard, but that=B9d tell us pretty quick where the m=
essage is
coming from.  You could also take a trace from the filer if that's easi=
er
using pktt.

-Blake


On 6/4/08 6:33 AM, "David Konerding" <dakoner@gmail.com> wrote:

> Hi,
>=20
> We have a bunch of Linux clients (SLES 10 SP1) which mount a NetApp f=
iler.
>=20
> When the NetApp gets very, very busy, for example, one user is
> deleting 1Tbyte of data
> while another user is doing a 30 client throughput test, it will stop
> responding to some requests.
>=20
> Although we are using hard mounts, some users report that during the
> hammering period, some of their
> file operations produce "I/O Error" messages on their terminal.
>=20
> We checked, and the hosts are indeed using hard mounting.  From our
> reading, I/O Errors
> should only ever make it back to the user if are using soft mounting.
>=20
> We're pretty sure the filer is not sending back an NFS_ERR response (=
and we're
> pretty sure that wouldn't get reported to the user as an I/O Error...=
)
>=20
> At this point, we suspect there must be a path in the NFS
> implementation that returns I/O Error to user
> space even with a hard mount.
>=20
> Any ideas?
>=20
> Dave
>=20
> ---------------------------------------------------------------------=
----
> Check out the new SourceForge.net Marketplace.
> It's the best place to buy or sell services for
> just about anything Open Source.
> http://sourceforge.net/services/buy/index.php
> _______________________________________________
> NFS maillist  -  NFS@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nfs
> _______________________________________________
> Please note that nfs@lists.sourceforge.net is being discontinued.
> Please subscribe to linux-nfs@vger.kernel.org instead.
>     http://vger.kernel.org/vger-lists.html#linux-nfs
>=20


-----------------------------------------------------------------------=
--
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://sourceforge.net/services/buy/index.php
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs
_______________________________________________
Please note that nfs@lists.sourceforge.net is being discontinued.
Please subscribe to linux-nfs@vger.kernel.org instead.
    http://vger.kernel.org/vger-lists.html#linux-nfs


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [NFS] I/O Errors with hard mounts
       [not found] ` <4f0f0cb0806040633x74fd0afbm94866cf85810f242-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2008-06-04 15:20   ` Blake Golliher
@ 2008-06-04 16:17   ` Jeff Layton
       [not found]     ` <20080604121723.5b6a53e6-RtJpwOs3+0O+kQycOl6kW4xkIHaj4LzF@public.gmane.org>
  2008-06-04 18:19   ` Chuck Lever
  2 siblings, 1 reply; 13+ messages in thread
From: Jeff Layton @ 2008-06-04 16:17 UTC (permalink / raw)
  To: David Konerding; +Cc: nfs

On Wed, 4 Jun 2008 06:33:18 -0700
"David Konerding" <dakoner@gmail.com> wrote:

> Hi,
> 
> We have a bunch of Linux clients (SLES 10 SP1) which mount a NetApp filer.
> 
> When the NetApp gets very, very busy, for example, one user is
> deleting 1Tbyte of data
> while another user is doing a 30 client throughput test, it will stop
> responding to some requests.
> 
> Although we are using hard mounts, some users report that during the
> hammering period, some of their
> file operations produce "I/O Error" messages on their terminal.
> 
> We checked, and the hosts are indeed using hard mounting.  From our
> reading, I/O Errors
> should only ever make it back to the user if are using soft mounting.
> 
> We're pretty sure the filer is not sending back an NFS_ERR response (and we're
> pretty sure that wouldn't get reported to the user as an I/O Error...)
> 
> At this point, we suspect there must be a path in the NFS
> implementation that returns I/O Error to user
> space even with a hard mount.
> 
> Any ideas?
> 

hard/soft only governs what happens when there is a major timeout (i.e.
the server doesn't respond within a given time). If there are other
errors (for instance, client side memory shortage, server starts
refusing connections, etc), then there can be errors returned to the
application.

EIO is pretty generic, and is often what you see when a more obscure
error is translated into what a syscall would expect. It can happen for
other reasons besides an RPC timeout.

-- 
Jeff Layton <jlayton@redhat.com>

-------------------------------------------------------------------------
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://sourceforge.net/services/buy/index.php
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs
_______________________________________________
Please note that nfs@lists.sourceforge.net is being discontinued.
Please subscribe to linux-nfs@vger.kernel.org instead.
    http://vger.kernel.org/vger-lists.html#linux-nfs


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [NFS] I/O Errors with hard mounts
       [not found]     ` <20080604121723.5b6a53e6-RtJpwOs3+0O+kQycOl6kW4xkIHaj4LzF@public.gmane.org>
@ 2008-06-04 17:00       ` David Konerding
       [not found]         ` <4f0f0cb0806041000m7926d1e7m93f71ebaacd6c976-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 13+ messages in thread
From: David Konerding @ 2008-06-04 17:00 UTC (permalink / raw)
  To: Jeff Layton; +Cc: nfs

>> Although we are using hard mounts, some users report that during the
>> hammering period, some of their
>> file operations produce "I/O Error" messages on their terminal.
>>
>> We checked, and the hosts are indeed using hard mounting.  From our
>> reading, I/O Errors
>> should only ever make it back to the user if are using soft mounting.
>>
> hard/soft only governs what happens when there is a major timeout (i.e.
> the server doesn't respond within a given time). If there are other
> errors (for instance, client side memory shortage, server starts
> refusing connections, etc), then there can be errors returned to the
> application.
>

OK; we're already using TCP mounts, so I don't think that any new
client->server connections
should occur after the mount is established.

Second, memory is not an issue; this happens on lightly loaded clients
with 64Gbytes RAM,
and RAM is all cache and buffer.


> EIO is pretty generic, and is often what you see when a more obscure
> error is translated into what a syscall would expect. It can happen for
> other reasons besides an RPC timeout.


OK, so, our best bet to debug this, is to:
1) reproduce the problem
2) when the problem occurs, make sure the command that run that got an
EIO was running
under strace, so we know what syscall was being made
3) when we know what syscall was being made, backtrack to the kernel
source for that syscall
4) inspect the source to see what paths generate EIO

Dave
>
> --
> Jeff Layton <jlayton@redhat.com>
>

-------------------------------------------------------------------------
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://sourceforge.net/services/buy/index.php
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs
_______________________________________________
Please note that nfs@lists.sourceforge.net is being discontinued.
Please subscribe to linux-nfs@vger.kernel.org instead.
    http://vger.kernel.org/vger-lists.html#linux-nfs


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [NFS] I/O Errors with hard mounts
       [not found]         ` <4f0f0cb0806041000m7926d1e7m93f71ebaacd6c976-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2008-06-04 17:58           ` Jeff Layton
       [not found]             ` <20080604135817.0608273a-RtJpwOs3+0O+kQycOl6kW4xkIHaj4LzF@public.gmane.org>
  0 siblings, 1 reply; 13+ messages in thread
From: Jeff Layton @ 2008-06-04 17:58 UTC (permalink / raw)
  To: David Konerding; +Cc: nfs

On Wed, 4 Jun 2008 10:00:16 -0700
"David Konerding" <dakoner@gmail.com> wrote:

> >> Although we are using hard mounts, some users report that during the
> >> hammering period, some of their
> >> file operations produce "I/O Error" messages on their terminal.
> >>
> >> We checked, and the hosts are indeed using hard mounting.  From our
> >> reading, I/O Errors
> >> should only ever make it back to the user if are using soft mounting.
> >>
> > hard/soft only governs what happens when there is a major timeout (i.e.
> > the server doesn't respond within a given time). If there are other
> > errors (for instance, client side memory shortage, server starts
> > refusing connections, etc), then there can be errors returned to the
> > application.
> >
> 
> OK; we're already using TCP mounts, so I don't think that any new
> client->server connections
> should occur after the mount is established.
> 

Unless the connection is broken for some reason and the socket has
to be reconnected.

> Second, memory is not an issue; this happens on lightly loaded clients
> with 64Gbytes RAM,
> and RAM is all cache and buffer.
> 

Yeah, you'd probably get a -ENOMEM or something if memory were short. I
was just offering up that as an obvious way to get errors even if
you're hard mounting.

> 
> > EIO is pretty generic, and is often what you see when a more obscure
> > error is translated into what a syscall would expect. It can happen for
> > other reasons besides an RPC timeout.
> 
> 
> OK, so, our best bet to debug this, is to:
> 1) reproduce the problem
> 2) when the problem occurs, make sure the command that run that got an
> EIO was running
> under strace, so we know what syscall was being made
> 3) when we know what syscall was being made, backtrack to the kernel
> source for that syscall
> 4) inspect the source to see what paths generate EIO
> 
> Dave

Getting straces of the apps failing might be helpful, particularly if
it's always in the same syscalls. I have a hunch though that you'll find
yourself in the twisty maze of RPC code. In that case, knowing the
particular syscalls might not be that informative.

Looking at network captures might also be helpful. If you can correlate
the straces with what's going over the wire, then you might be able to
determine whether this error is being generated as a result of a NFS
error from the server or something else entirely.

NFS/RPC debugging might also be helpful (see rpcdebug manpage and note
that it can have significant performance impact).

-- 
Jeff Layton <jlayton@redhat.com>

-------------------------------------------------------------------------
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://sourceforge.net/services/buy/index.php
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs
_______________________________________________
Please note that nfs@lists.sourceforge.net is being discontinued.
Please subscribe to linux-nfs@vger.kernel.org instead.
    http://vger.kernel.org/vger-lists.html#linux-nfs


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [NFS] I/O Errors with hard mounts
       [not found] ` <4f0f0cb0806040633x74fd0afbm94866cf85810f242-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2008-06-04 15:20   ` Blake Golliher
  2008-06-04 16:17   ` Jeff Layton
@ 2008-06-04 18:19   ` Chuck Lever
  2 siblings, 0 replies; 13+ messages in thread
From: Chuck Lever @ 2008-06-04 18:19 UTC (permalink / raw)
  To: David Konerding; +Cc: nfs

On Jun 4, 2008, at 9:33 AM, David Konerding wrote:
> Hi,
>
> We have a bunch of Linux clients (SLES 10 SP1) which mount a NetApp  
> filer.
>
> When the NetApp gets very, very busy, for example, one user is
> deleting 1Tbyte of data
> while another user is doing a 30 client throughput test, it will stop
> responding to some requests.
>
> Although we are using hard mounts, some users report that during the
> hammering period, some of their
> file operations produce "I/O Error" messages on their terminal.
>
> We checked, and the hosts are indeed using hard mounting.  From our
> reading, I/O Errors
> should only ever make it back to the user if are using soft mounting.
>
> We're pretty sure the filer is not sending back an NFS_ERR response  
> (and we're
> pretty sure that wouldn't get reported to the user as an I/O Error...)
>
> At this point, we suspect there must be a path in the NFS
> implementation that returns I/O Error to user
> space even with a hard mount.
>
> Any ideas?

One place where this can occur is if XDR encoding or decoding fails.   
This is not too likely though.  I would look at the RPC client's  
decoding logic first: call_decode() and friends.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com




-------------------------------------------------------------------------
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://sourceforge.net/services/buy/index.php
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs
_______________________________________________
Please note that nfs@lists.sourceforge.net is being discontinued.
Please subscribe to linux-nfs@vger.kernel.org instead.
    http://vger.kernel.org/vger-lists.html#linux-nfs


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [NFS] I/O Errors with hard mounts
       [not found]             ` <20080604135817.0608273a-RtJpwOs3+0O+kQycOl6kW4xkIHaj4LzF@public.gmane.org>
@ 2008-06-04 21:07               ` David Konerding
  0 siblings, 0 replies; 13+ messages in thread
From: David Konerding @ 2008-06-04 21:07 UTC (permalink / raw)
  To: Jeff Layton; +Cc: nfs

>
> Getting straces of the apps failing might be helpful, particularly if
> it's always in the same syscalls. I have a hunch though that you'll find
> yourself in the twisty maze of RPC code. In that case, knowing the
> particular syscalls might not be that informative.
>
> Looking at network captures might also be helpful. If you can correlate
> the straces with what's going over the wire, then you might be able to
> determine whether this error is being generated as a result of a NFS
> error from the server or something else entirely.
>

One hint is that if I run ls, and hit control-C while it's trolling
through filer directories
(but not local dirs), I get an I/O Error on the command line.  This
may not reproduce our rm
problems (since those don't have Control-C events), but here's the
last part of the strace:

open("src/modules", O_RDONLY|O_NONBLOCK|O_DIRECTORY) = 3
fstat(3, {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
fcntl(3, F_SETFD, FD_CLOEXEC)           = 0
getdents64(3, 0x534808, 32768)          = -1 EIO (Input/output error)
--- SIGINT (Interrupt) @ 0 (0) ---

We're trying to reproduce the problem with an uninterrupted rm under strace.

Dave

-------------------------------------------------------------------------
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://sourceforge.net/services/buy/index.php
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs
_______________________________________________
Please note that nfs@lists.sourceforge.net is being discontinued.
Please subscribe to linux-nfs@vger.kernel.org instead.
    http://vger.kernel.org/vger-lists.html#linux-nfs


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [NFS] I/O Errors with hard mounts
@ 2008-06-04 22:45 Ricardo Labiaga
       [not found] ` <927260.87785.qm-KtJlQ5K7SlOvuULXzWHTWIglqE1Y4D90QQ4Iyu8u01E@public.gmane.org>
  0 siblings, 1 reply; 13+ messages in thread
From: Ricardo Labiaga @ 2008-06-04 22:45 UTC (permalink / raw)
  To: dakoner; +Cc: nfs

Does=A0/var/log/messages show any errors around the same time?=A0 In ad=
dition=A0to the=A0network trace=A0and=A0rpcdebug on the client, take a =
look at "nfsstat -d" on the filer.=A0Is=A0the filer=A0dropping the conn=
ection?=A0 Look for "dropped with EAGAIN" or "dropped from vol offline"=
 in the output.=A0 This will help narrow down the problem.
- ricardo
> -----Original Message-----
> From: David Konerding [mailto:dakoner@gmail.com]=20
> Sent: Wednesday, June 04, 2008 6:33 AM
> To: nfs@lists.sourceforge.net
> Subject: [NFS] I/O Errors with hard mounts
>=20
> Hi,
>=20
> We have a bunch of Linux clients (SLES 10 SP1) which mount a=20
> NetApp filer.
>=20
> When the NetApp gets very, very busy, for example, one user is
> deleting 1Tbyte of data
> while another user is doing a 30 client throughput test, it will stop
> responding to some requests.
>=20
> Although we are using hard mounts, some users report that during the
> hammering period, some of their
> file operations produce "I/O Error" messages on their terminal.
>=20
> We checked, and the hosts are indeed using hard mounting.=A0 From our
> reading, I/O Errors
> should only ever make it back to the user if are using soft mounting.
>=20
> We're pretty sure the filer is not sending back an NFS_ERR=20
> response (and we're
> pretty sure that wouldn't get reported to the user as an I/O Error...=
)
>=20
> At this point, we suspect there must be a path in the NFS
> implementation that returns I/O Error to user
> space even with a hard mount.
>=20
> Any ideas?
>=20
> Dave



     =20

-----------------------------------------------------------------------=
--
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://sourceforge.net/services/buy/index.php
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs
_______________________________________________
Please note that nfs@lists.sourceforge.net is being discontinued.
Please subscribe to linux-nfs@vger.kernel.org instead.
    http://vger.kernel.org/vger-lists.html#linux-nfs


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [NFS] I/O Errors with hard mounts
       [not found] ` <927260.87785.qm-KtJlQ5K7SlOvuULXzWHTWIglqE1Y4D90QQ4Iyu8u01E@public.gmane.org>
@ 2008-06-04 22:56   ` David Konerding
  0 siblings, 0 replies; 13+ messages in thread
From: David Konerding @ 2008-06-04 22:56 UTC (permalink / raw)
  To: nfs

On Wed, Jun 4, 2008 at 3:45 PM, Ricardo Labiaga <labiaga@yahoo.com> wrote:
> Does /var/log/messages show any errors around the same time?  In addition to the network trace and rpcdebug on the client, take a look at "nfsstat -d" on the filer. Is the filer dropping the connection?  Look for "dropped with EAGAIN" or "dropped from vol offline" in the output.  This will help narrow down the problem.

So, sometimes when somebody deletes a lot of data (like the problem we
just observed),
the deleting host, and often other hosts, do report  'filer not
responding' in the logs.
However, operations that aren't happening in the delete dir, tend to
work just fine (for example, iozone could be running and doing pretty
well)).  Further, the most recent time this happened, the host didn't
report filer not responding.


This is the only EAGAN reference I see:

assist queue (queued, split mbufs, drop for EAGAIN) = (0, 64478612, 94340)


Dave

-------------------------------------------------------------------------
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://sourceforge.net/services/buy/index.php
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs
_______________________________________________
Please note that nfs@lists.sourceforge.net is being discontinued.
Please subscribe to linux-nfs@vger.kernel.org instead.
    http://vger.kernel.org/vger-lists.html#linux-nfs


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [NFS] I/O Errors with hard mounts
@ 2008-06-05  0:40 Ricardo Labiaga
  0 siblings, 0 replies; 13+ messages in thread
From: Ricardo Labiaga @ 2008-06-05  0:40 UTC (permalink / raw)
  To: dakoner; +Cc: nfs

Can you provide the entire nfsstat -d output on the filer?
(Apologies for the lack of subject line in previous reply)
- ricardo
> -----Original Message-----
> From: David Konerding [mailto:dakoner@gmail.com]=20
> Sent: Wednesday, June 04, 2008 3:56 PM
> To: nfs@lists.sourceforge.net
> Subject: Re: [NFS] I/O Errors with hard mounts
>=20
> On Wed, Jun 4, 2008 at 3:45 PM, Ricardo Labiaga=20
> <labiaga@yahoo.com> wrote:
> > Does /var/log/messages show any errors around the same=20
> time?=A0 In addition to the network trace and rpcdebug on the=20
> client, take a look at "nfsstat -d" on the filer. Is the=20
> filer dropping the connection?=A0 Look for "dropped with=20
> EAGAIN" or "dropped from vol offline" in the output.=A0 This=20
> will help narrow down the problem.
>=20
> So, sometimes when somebody deletes a lot of data (like the problem w=
e
> just observed),
> the deleting host, and often other hosts, do report=A0 'filer not
> responding' in the logs.
> However, operations that aren't happening in the delete dir, tend to
> work just fine (for example, iozone could be running and doing pretty
> well)).=A0 Further, the most recent time this happened, the host didn=
't
> report filer not responding.
>=20
>=20
> This is the only EAGAN reference I see:
>=20
> assist queue (queued, split mbufs, drop for EAGAIN) =3D (0,=20
> 64478612, 94340)
>=20
>=20
> Dave
>=20



     =20

-----------------------------------------------------------------------=
--
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://sourceforge.net/services/buy/index.php
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs
_______________________________________________
Please note that nfs@lists.sourceforge.net is being discontinued.
Please subscribe to linux-nfs@vger.kernel.org instead.
    http://vger.kernel.org/vger-lists.html#linux-nfs


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [NFS] I/O Errors with hard mounts
@ 2008-06-06  1:00 Ricardo Labiaga
  0 siblings, 0 replies; 13+ messages in thread
From: Ricardo Labiaga @ 2008-06-06  1:00 UTC (permalink / raw)
  To: David Konerding; +Cc: nfs

You have a=A0significant number of dropped connections, as indicated by=
 the high EAGAIN count.
I wouldn't be surprised if the 2.6.16 kernel isn't handling the reconne=
ction correctly and propagating
EIO to the application.=A0 There's=A0been a fair amount of client side =
work in the RPC=A0reconnection=20
code=A0recently .=A0 Can you try with a recent kernel?
A network trace and rpcdebug output would be invaluable when you're abl=
e to reproduce this.
- ricardo
On Wed, Jun 4, 2008 at 3:45 PM, Ricardo Labiaga <labiaga@yahoo.com> wro=
te:
>> Does /var/log/messages show any errors around the same time?=A0=20
>> In addition to the network trace and rpcdebug on the client, take a =
look at "nfsstat -d" on the filer.=20
>>=A0Is the filer dropping the connection?=A0 Look for "dropped with EA=
GAIN" or "dropped from vol offline"=20
>> in the output.=A0 This will help narrow down the problem.
> So, sometimes when somebody deletes a lot of data (like the problem w=
e
> just observed),
> the deleting host, and often other hosts, do report=A0 'filer not
> responding' in the logs.
> However, operations that aren't happening in the delete dir, tend to
> work just fine (for example, iozone could be running and doing pretty
> well)).=A0 Further, the most recent time this happened, the host didn=
't
> report filer not responding.
>
> This is the only EAGAN reference I see:
>
> assist queue (queued, split mbufs, drop for EAGAIN) =3D (0, 64478612,=
 94340)
>
> Dave


     =20

-----------------------------------------------------------------------=
--
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://sourceforge.net/services/buy/index.php
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs
_______________________________________________
Please note that nfs@lists.sourceforge.net is being discontinued.
Please subscribe to linux-nfs@vger.kernel.org instead.
    http://vger.kernel.org/vger-lists.html#linux-nfs


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [NFS] I/O Errors with hard mounts
       [not found]   ` <4f0f0cb0806061638i35ae4f9bp423148d6acbb953b-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2008-06-09 17:02     ` David Konerding
       [not found]       ` <4f0f0cb0806091002w7f0110fh17e40568c7eb5bb8-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 13+ messages in thread
From: David Konerding @ 2008-06-09 17:02 UTC (permalink / raw)
  To: Ricardo Labiaga; +Cc: nfs

I collected some more information on the problem we are seeing.

Here's what I've got:

1) SuSE 10.1 (2.6.16 kernel): running ls -R, hit Control-C-- often see
an "I/O Error", for example:

/gne/home/aa/barfod.files/mac.backup/Avi's/TNFR-IgG/Mutants/mAbs:
11.15.91
/bin/ls: reading directory
/gne/home/aa/barfod.files/mac.backup/Avi's/TNFR-IgG/Mutants/mAbs/11.15.91:
Input/output error

Here's what I captured from RPC and NFS debugging.  No "disconnect"
message like I saw before, but:

Jun  9 09:50:30 lablnx01 kernel: RPC: 46206 xprt_transmit(136)
Jun  9 09:50:30 lablnx01 kernel: RPC:      xs_tcp_send_request(136) = 136
Jun  9 09:50:30 lablnx01 kernel: RPC: 46206 xmit complete
Jun  9 09:50:30 lablnx01 kernel: RPC: 46206 sleep_on(queue
"xprt_pending" time 4340153030)
Jun  9 09:50:30 lablnx01 kernel: RPC: 46206 added to queue
ffff81046bca5d20 "xprt_pending"
Jun  9 09:50:30 lablnx01 kernel: RPC: 46206 setting alarm for 60000 ms
Jun  9 09:50:30 lablnx01 kernel: RPC:
wake_up_next(ffff81046bca5cd0 "xprt_resend")
Jun  9 09:50:30 lablnx01 kernel: RPC:
wake_up_next(ffff81046bca5c80 "xprt_sending")
Jun  9 09:50:30 lablnx01 kernel: RPC: 46206 sync task going to sleep
Jun  9 09:50:30 lablnx01 kernel: RPC: 46206 got signal
Jun  9 09:50:30 lablnx01 kernel: RPC: 46206 __rpc_wake_up_task (now 4340153035)
Jun  9 09:50:30 lablnx01 kernel: RPC: 46206 disabling timer
Jun  9 09:50:30 lablnx01 kernel: RPC: 46206 removed from queue
ffff81046bca5d20 "xprt_pending"
Jun  9 09:50:30 lablnx01 kernel: RPC:      __rpc_wake_up_task done
Jun  9 09:50:30 lablnx01 kernel: RPC: 46206 sync task resuming
Jun  9 09:50:30 lablnx01 kernel: RPC: 46206 deleting timer
Jun  9 09:50:30 lablnx01 kernel: RPC: 46206, return -512, status -512
Jun  9 09:50:30 lablnx01 kernel: RPC: 46206 release task
Jun  9 09:50:30 lablnx01 kernel: RPC: 46206 release request ffff81046f718000
Jun  9 09:50:30 lablnx01 kernel: RPC:
wake_up_next(ffff81046bca5d70 "xprt_backlog")
Jun  9 09:50:30 lablnx01 kernel: RPC: 46206 releasing UNIX cred ffff810c70dbba40
Jun  9 09:50:30 lablnx01 kernel: RPC:
rpc_release_client(ffff81046c15b800, 1)
Jun  9 09:50:30 lablnx01 kernel: RPC: 46206 freeing task
Jun  9 09:50:30 lablnx01 kernel: NFS reply readdir: -512
Jun  9 09:50:30 lablnx01 kernel: NFS: find_dirent_page() returns -5
Jun  9 09:50:30 lablnx01 kernel: NFS: readdir_search_pagecache() returned -5
Jun  9 09:50:30 lablnx01 kernel: NFS: dentry_delete(mAbs/11.15.91, 8)


Note the "reply readdir: -512", is that consistent with what you guys
are saying?

Noticeably, I cannot get the error message on a host with a newer kernel.

Dave

-------------------------------------------------------------------------
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://sourceforge.net/services/buy/index.php
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs
_______________________________________________
Please note that nfs@lists.sourceforge.net is being discontinued.
Please subscribe to linux-nfs@vger.kernel.org instead.
    http://vger.kernel.org/vger-lists.html#linux-nfs


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [NFS] I/O Errors with hard mounts
       [not found]       ` <4f0f0cb0806091002w7f0110fh17e40568c7eb5bb8-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2008-06-09 23:20         ` Trond Myklebust
  0 siblings, 0 replies; 13+ messages in thread
From: Trond Myklebust @ 2008-06-09 23:20 UTC (permalink / raw)
  To: David Konerding; +Cc: nfs, Ricardo Labiaga

On Mon, 2008-06-09 at 10:02 -0700, David Konerding wrote:
> I collected some more information on the problem we are seeing.
> 
> Here's what I've got:
> 
> 1) SuSE 10.1 (2.6.16 kernel): running ls -R, hit Control-C-- often see
> an "I/O Error", for example:
> 
> /gne/home/aa/barfod.files/mac.backup/Avi's/TNFR-IgG/Mutants/mAbs:
> 11.15.91
> /bin/ls: reading directory
> /gne/home/aa/barfod.files/mac.backup/Avi's/TNFR-IgG/Mutants/mAbs/11.15.91:
> Input/output error
> 
> Here's what I captured from RPC and NFS debugging.  No "disconnect"
> message like I saw before, but:
> 
> Jun  9 09:50:30 lablnx01 kernel: RPC: 46206 xprt_transmit(136)
> Jun  9 09:50:30 lablnx01 kernel: RPC:      xs_tcp_send_request(136) = 136
> Jun  9 09:50:30 lablnx01 kernel: RPC: 46206 xmit complete
> Jun  9 09:50:30 lablnx01 kernel: RPC: 46206 sleep_on(queue
> "xprt_pending" time 4340153030)
> Jun  9 09:50:30 lablnx01 kernel: RPC: 46206 added to queue
> ffff81046bca5d20 "xprt_pending"
> Jun  9 09:50:30 lablnx01 kernel: RPC: 46206 setting alarm for 60000 ms
> Jun  9 09:50:30 lablnx01 kernel: RPC:
> wake_up_next(ffff81046bca5cd0 "xprt_resend")
> Jun  9 09:50:30 lablnx01 kernel: RPC:
> wake_up_next(ffff81046bca5c80 "xprt_sending")
> Jun  9 09:50:30 lablnx01 kernel: RPC: 46206 sync task going to sleep
> Jun  9 09:50:30 lablnx01 kernel: RPC: 46206 got signal
> Jun  9 09:50:30 lablnx01 kernel: RPC: 46206 __rpc_wake_up_task (now 4340153035)
> Jun  9 09:50:30 lablnx01 kernel: RPC: 46206 disabling timer
> Jun  9 09:50:30 lablnx01 kernel: RPC: 46206 removed from queue
> ffff81046bca5d20 "xprt_pending"
> Jun  9 09:50:30 lablnx01 kernel: RPC:      __rpc_wake_up_task done
> Jun  9 09:50:30 lablnx01 kernel: RPC: 46206 sync task resuming
> Jun  9 09:50:30 lablnx01 kernel: RPC: 46206 deleting timer
> Jun  9 09:50:30 lablnx01 kernel: RPC: 46206, return -512, status -512

That would be ERESTARTSYS, in other words, a fatal signal. Just out of
interest, could you send us the results of

  cat /proc/mounts

please?

Cheers
  Trond


-------------------------------------------------------------------------
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://sourceforge.net/services/buy/index.php
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs
_______________________________________________
Please note that nfs@lists.sourceforge.net is being discontinued.
Please subscribe to linux-nfs@vger.kernel.org instead.
    http://vger.kernel.org/vger-lists.html#linux-nfs


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2008-06-10  0:23 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <505115.86554.qm@web31405.mail.mud.yahoo.com>
     [not found] ` <4f0f0cb0806061638i35ae4f9bp423148d6acbb953b@mail.gmail.com>
     [not found]   ` <4f0f0cb0806061638i35ae4f9bp423148d6acbb953b-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2008-06-09 17:02     ` [NFS] I/O Errors with hard mounts David Konerding
     [not found]       ` <4f0f0cb0806091002w7f0110fh17e40568c7eb5bb8-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2008-06-09 23:20         ` Trond Myklebust
2008-06-06  1:00 Ricardo Labiaga
  -- strict thread matches above, loose matches on Subject: below --
2008-06-05  0:40 Ricardo Labiaga
2008-06-04 22:45 Ricardo Labiaga
     [not found] ` <927260.87785.qm-KtJlQ5K7SlOvuULXzWHTWIglqE1Y4D90QQ4Iyu8u01E@public.gmane.org>
2008-06-04 22:56   ` David Konerding
2008-06-04 13:33 David Konerding
     [not found] ` <4f0f0cb0806040633x74fd0afbm94866cf85810f242-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2008-06-04 15:20   ` Blake Golliher
2008-06-04 16:17   ` Jeff Layton
     [not found]     ` <20080604121723.5b6a53e6-RtJpwOs3+0O+kQycOl6kW4xkIHaj4LzF@public.gmane.org>
2008-06-04 17:00       ` David Konerding
     [not found]         ` <4f0f0cb0806041000m7926d1e7m93f71ebaacd6c976-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2008-06-04 17:58           ` Jeff Layton
     [not found]             ` <20080604135817.0608273a-RtJpwOs3+0O+kQycOl6kW4xkIHaj4LzF@public.gmane.org>
2008-06-04 21:07               ` David Konerding
2008-06-04 18:19   ` Chuck Lever

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.