* 100% repeatable NFS v3 client hang
@ 2007-06-14 16:55 John McCorquodale
2007-06-14 18:03 ` Trond Myklebust
0 siblings, 1 reply; 5+ messages in thread
From: John McCorquodale @ 2007-06-14 16:55 UTC (permalink / raw)
To: nfs
Guys,
I've been having a reliable problem with NFS v3 since 2.6.19 (possibly earlier;
I didn't test), and just confirmed still present in 2.6.21.5 (x86_64). The
problem is that issuing the command:
$ dd if=/dev/zero of=biggy bs=1G count=50
100% reliably results in a hung client and completely idle server after a
few (different each time, usually betwene 5 and 10) GB have been written.
No output via dmesg at all. Sometimes, however, when not writing large files,
I'll see an occasional dmesg "NFS: desynchronized value of nfs_i.ncommit."
It proceeds at about 30MB/s until the client eats it, and then it never wakes
up again. The roof fs is mounted via nfs v3, so once it eats it I can't log
on or anything. Serial console outputs no information during/after the eat-it
condition, so it's not a panic. Pings continue. Feels deadlocky.
I saw an old dicussion/patch about something that sounded like my problem
(but the old kernel+patch didn't fix my problem) here:
http://lkml.org/lkml/2007/4/21/112
This problem has been reliable and present for months for me, and I assumed
that it was so obvious that anybody would see it immediately, and thus
concluded that 'nobody was working on v3 anymore'. Enough eyebrows were
raised by that comment by my most on the v4 list this morning:
http://linux-nfs.org/pipermail/nfsv4/2007-June/006183.html
that I am forced to re-evaluate that conclusion -- it appears likely that
this may be something unique to my hardware.
The only thing at all weird about my platform is that the servers (not the
clients) are nvidia forcedeth gigE, which has never inspired my confidence.
The forcedeth in 2.6.21 seems fine, 'tho, as it regularly sends >800GB via
nc for backups without hiccup or incident.
Anyway, last bit of data is that I just switched to nfs4 and the problem is
gone. I have successfully dd'd four 50G files-of-zeros simultaneously with
no ill effects.
Anyway, I have a shadow server/client that I could conceivably bring v3 up on
and run tests with if you guys would like to use me as a debugging agent or
would like something to log in to.
I was going to ignore this, but when Trond's eyebrows went up on the v4 list,
I figured I ought to get off my duff and report it.
Cheers,
-mcq
-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
NFS maillist - NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs
^ permalink raw reply [flat|nested] 5+ messages in thread
* 100% repeatable NFS v3 client hang
@ 2007-06-14 17:04 John McCorquodale
2007-06-14 17:11 ` John McCorquodale
0 siblings, 1 reply; 5+ messages in thread
From: John McCorquodale @ 2007-06-14 17:04 UTC (permalink / raw)
To: nfs
Guys,
I've been having a reliable problem with NFS v3 since 2.6.19 (possibly earlier;
I didn't test), and just confirmed still present in 2.6.21.5 (x86_64). The
problem is that issuing the command:
$ dd if=/dev/zero of=biggy bs=1G count=50
100% reliably results in a hung client and completely idle server after a
few (different each time, usually between 5 and 10) GB have been written.
No output via dmesg at all. Sometimes, however, when not writing large files,
I'll see an occasional dmesg "NFS: desynchronized value of nfs_i.ncommit."
It proceeds at about 30MB/s until the client eats it, and then it never wakes
up again. The root fs is mounted via nfs v3, so once it eats it I can't log
on or anything. Serial console outputs no information during/after the eat-it
condition, so it's not a panic. Pings continue. Feels deadlocky.
I saw an old dicussion/patch about something that sounded like my problem here:
http://lkml.org/lkml/2007/4/21/112
but this old kernel+patch did not fix my problem.
This problem has been reliable and present for months for me, and I assumed
that it was so obvious that anybody would see it immediately, and thus
concluded that 'nobody was working on v3 anymore'. Enough eyebrows were
raised by that comment by my post on the v4 list this morning:
http://linux-nfs.org/pipermail/nfsv4/2007-June/006183.html
that I am forced to re-evaluate that conclusion -- it appears likely that
this may be something unique to my configuration or personal luck.
The only thing at all weird about my platform is that the servers (not the
clients) are nvidia forcedeth gigE, which has never inspired my confidence.
The forcedeth in 2.6.21 seems fine, 'tho, as it regularly sends >800GB via
nc for backups without hiccup or incident.
Anyway, last bit of data is that I just switched to nfs4 and the problem is
gone. I have successfully dd'd four 50G files-of-zeros simultaneously with
no ill effects.
Anyway, I have a shadow server/client that I could conceivably bring v3 up on
and run tests with if you guys would like to use me as a debugging agent or
would like something to log in to.
I was going to ignore this, but when Trond's eyebrows went up on the v4 list,
I figured I ought to get off my duff and report it.
Cheers,
-mcq
P.S. If this comes thru twice, my apologies. I wasn't subscribed the first
time I sent it and I assume it got blackholed.
-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
NFS maillist - NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: 100% repeatable NFS v3 client hang
2007-06-14 17:04 100% repeatable NFS v3 client hang John McCorquodale
@ 2007-06-14 17:11 ` John McCorquodale
0 siblings, 0 replies; 5+ messages in thread
From: John McCorquodale @ 2007-06-14 17:11 UTC (permalink / raw)
To: nfs
> Serial console outputs no information during/after the eat-it condition
Whoops, checked my notes. This is a lie. server says nothing on console
during eat-it. Client says:
server not responding
server not responding
server not responding
OK
OK
server not responding
server not responding
server not responding
server not responding
server not responding
server not responding
server not responding
server not responding
server not responding
(hang)
-mcq
-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
NFS maillist - NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: 100% repeatable NFS v3 client hang
2007-06-14 16:55 John McCorquodale
@ 2007-06-14 18:03 ` Trond Myklebust
2007-06-14 18:09 ` John McCorquodale
0 siblings, 1 reply; 5+ messages in thread
From: Trond Myklebust @ 2007-06-14 18:03 UTC (permalink / raw)
To: John McCorquodale; +Cc: nfs
On Thu, 2007-06-14 at 09:55 -0700, John McCorquodale wrote:
> Guys,
>
> I've been having a reliable problem with NFS v3 since 2.6.19 (possibly earlier;
> I didn't test), and just confirmed still present in 2.6.21.5 (x86_64). The
> problem is that issuing the command:
>
> $ dd if=/dev/zero of=biggy bs=1G count=50
>
> 100% reliably results in a hung client and completely idle server after a
> few (different each time, usually betwene 5 and 10) GB have been written.
> No output via dmesg at all. Sometimes, however, when not writing large files,
> I'll see an occasional dmesg "NFS: desynchronized value of nfs_i.ncommit."
> It proceeds at about 30MB/s until the client eats it, and then it never wakes
> up again. The roof fs is mounted via nfs v3, so once it eats it I can't log
> on or anything. Serial console outputs no information during/after the eat-it
> condition, so it's not a panic. Pings continue. Feels deadlocky.
>
> I saw an old dicussion/patch about something that sounded like my problem
> (but the old kernel+patch didn't fix my problem) here:
> http://lkml.org/lkml/2007/4/21/112
>
> This problem has been reliable and present for months for me, and I assumed
> that it was so obvious that anybody would see it immediately, and thus
> concluded that 'nobody was working on v3 anymore'. Enough eyebrows were
> raised by that comment by my most on the v4 list this morning:
> http://linux-nfs.org/pipermail/nfsv4/2007-June/006183.html
> that I am forced to re-evaluate that conclusion -- it appears likely that
> this may be something unique to my hardware.
>
> The only thing at all weird about my platform is that the servers (not the
> clients) are nvidia forcedeth gigE, which has never inspired my confidence.
> The forcedeth in 2.6.21 seems fine, 'tho, as it regularly sends >800GB via
> nc for backups without hiccup or incident.
>
> Anyway, last bit of data is that I just switched to nfs4 and the problem is
> gone. I have successfully dd'd four 50G files-of-zeros simultaneously with
> no ill effects.
>
> Anyway, I have a shadow server/client that I could conceivably bring v3 up on
> and run tests with if you guys would like to use me as a debugging agent or
> would like something to log in to.
>
> I was going to ignore this, but when Trond's eyebrows went up on the v4 list,
> I figured I ought to get off my duff and report it.
The write() code had to go through emergency surgery in the 2.6.20
timeframe, since the VM/MM code was becoming increasingly intolerant of
our aging asynchronous write design. This didn't affect just NFSv3, it
affected NFSv2 and v4 too since it was in generic code.
2.6.21 still had some problems, but 2.6.22-rcX has fixes for all
reported bugs and should be stable. At least I haven't heard of any
further issues.
If you can reproduce the hang, and send me the output from an 'echo t
>/proc/sysrq-trigger', then I'd be happy to look at the resulting dump.
Don't forget to specify a large buffer size when you use dmesg: I tend
to use 'dmesg -s 90000'.
It would also be nice if you could try to reproduce the problem on a
2.6.22-rc4 kernel, to see if it is still there.
Cheers
Trond
-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
NFS maillist - NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: 100% repeatable NFS v3 client hang
2007-06-14 18:03 ` Trond Myklebust
@ 2007-06-14 18:09 ` John McCorquodale
0 siblings, 0 replies; 5+ messages in thread
From: John McCorquodale @ 2007-06-14 18:09 UTC (permalink / raw)
To: Trond Myklebust; +Cc: nfs
> 2.6.21 still had some problems, but 2.6.22-rcX has fixes for all
> reported bugs and should be stable. At least I haven't heard of any
> further issues.
>
> If you can reproduce the hang, and send me the output from an 'echo t
> >/proc/sysrq-trigger', then I'd be happy to look at the resulting dump.
> Don't forget to specify a large buffer size when you use dmesg: I tend
> to use 'dmesg -s 90000'.
> It would also be nice if you could try to reproduce the problem on a
> 2.6.22-rc4 kernel, to see if it is still there.
Will do as you instruct. It'll probably be the weekend before I can get the
shadow systems running for our experiments. I was running 2.6.22-rc3 at
one point and saw the problem, so we'll probably get something interesting
from the sysrq.
Thanks for the precise insturctions; that's about the right level of detail
for my inexperience doing this sort of thing. :)
Cheers,
-mcq
-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
NFS maillist - NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2007-06-14 18:09 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-06-14 17:04 100% repeatable NFS v3 client hang John McCorquodale
2007-06-14 17:11 ` John McCorquodale
-- strict thread matches above, loose matches on Subject: below --
2007-06-14 16:55 John McCorquodale
2007-06-14 18:03 ` Trond Myklebust
2007-06-14 18:09 ` John McCorquodale
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.