* 2.4.20-rc1 NFS/TCP client (still) hangs running dbench 2.0
@ 2002-10-31 5:06 Andrew Ryan
2002-11-01 17:24 ` Trond Myklebust
0 siblings, 1 reply; 12+ messages in thread
From: Andrew Ryan @ 2002-10-31 5:06 UTC (permalink / raw)
To: nfs
2.4.20rc1 still exhibits the same identical problem as 2.4.19+kmap1 patch
which I identified a few weeks ago. I guess that patch was applied to
2.4.20, since it doesn't appear on Trond's patch page for 2.4.20.
I am still available to help duplicate or test fixes to this problem, if
anyone has any ideas. I can still use UDP but I'd really like to start
getting the speed of TCP.
thanks,
andrew
---------- Forwarded message ----------
Date: Mon, 14 Oct 2002 16:29:55 -0700 (PDT)
From: Andrew Ryan <andrewr@nam-shub.com>
To: nfs@lists.sourceforge.net
Subject: 2.4.19+RPC_ALL hangs running dbench 2.0
I've been running tests on the 2.4.19_NFS_ALL (the one from Oct 5) kernel
and seeing an easily reproducible hang on my machine (2x1.4 GHz PIII,
Compaq DL380G2, 4GB RAM), mounting a Netapp (F820 running 6.2R2) with the
mount options:
rw,tcp,nfsvers=3,rsize=32768,wsize=32768,intr,hard
The symptom is, I start a dbench run, and it starts up and runs for a
bit...
$ ~/dbench-2.0/dbench 16
clients started
16 23801 21.45 MB/sec
Then it gets hung up, and the dbench process is still running, and
the MB/sec number keeps dropping rapidly, approaching 0.
At this point:
* Any commands in other shells that are currently running (e.g. 'top') are
hung.
* My other shells are not hung, but if I try to execute any commands, the
commands hang forever.
* I can kill the dbench process with Ctrl-C, but that just gives me a
shell that cannot execute any commands (they all hang, like the other
shells).
* The nmi_watchdog is never triggered, even though the system is
completely unresponsive from a user level.
When I ctl-C the hung dbench process, sometimes the kernel generates
an oops, but other times not. If I have kdb on, I can get a backtrace, but
I was hoping there was an easier way to figure out what is causing this
bug. The one oops I get says something about 'kernel BUG at
highmem.c:159!'
Note I do *NOT* get this error if I run without the NFS_ALL. I also tested
this with just the RPC_ALL and I get the same error. So it definitely has
to be something in the RPC_ALL patchset. I'm confused though, bec. this is
the patchset which claims to have specific fixes for HIGHMEM.
All I really want is a fast, stable client for my 4GB, 2 CPU boxes. I'd
use the stock 2.4.19 but the RPC_ALL patchset leads me to believe that
there are HIGHMEM bugs in the stock 2.4.19 NFS client.
I'm willing to do some testing to chase this down, if it helps.
andrew
-------------------------------------------------------
This sf.net email is sponsored by: Influence the future
of Java(TM) technology. Join the Java Community
Process(SM) (JCP(SM)) program now.
http://ads.sourceforge.net/cgi-bin/redirect.pl?sunm0004en
_______________________________________________
NFS maillist - NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: 2.4.20-rc1 NFS/TCP client (still) hangs running dbench 2.0
2002-10-31 5:06 2.4.20-rc1 NFS/TCP client (still) hangs running dbench 2.0 Andrew Ryan
@ 2002-11-01 17:24 ` Trond Myklebust
0 siblings, 0 replies; 12+ messages in thread
From: Trond Myklebust @ 2002-11-01 17:24 UTC (permalink / raw)
To: Andrew Ryan; +Cc: nfs
>>>>> " " == Andrew Ryan <andrewr@nam-shub.com> writes:
> 2.4.20rc1 still exhibits the same identical problem as
> 2.4.19+kmap1 patch which I identified a few weeks ago. I guess
> that patch was applied to 2.4.20, since it doesn't appear on
> Trond's patch page for 2.4.20.
Does the following patch work for you?
Cheers,
Trond
diff -u --recursive --new-file linux-2.4.20-rc1/net/sunrpc/xdr.c linux-2.4.20-fix/net/sunrpc/xdr.c
--- linux-2.4.20-rc1/net/sunrpc/xdr.c 2002-08-14 09:00:35.000000000 -0400
+++ linux-2.4.20-fix/net/sunrpc/xdr.c 2002-11-01 12:08:52.000000000 -0500
@@ -244,6 +244,7 @@
pglen -= base;
base += xdr->page_base;
ppage += base >> PAGE_CACHE_SHIFT;
+ pglen += base & ~PAGE_CACHE_MASK;
}
for (;;) {
flush_dcache_page(*ppage);
-------------------------------------------------------
This sf.net email is sponsored by: See the NEW Palm
Tungsten T handheld. Power & Color in a compact size!
http://ads.sourceforge.net/cgi-bin/redirect.pl?palm0001en
_______________________________________________
NFS maillist - NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: 2.4.20-rc1 NFS/TCP client (still) hangs running dbench 2.0
[not found] <3DC2E7B6.6090400@collab.net>
@ 2002-11-01 21:18 ` Andrew Ryan
2002-11-01 21:58 ` Trond Myklebust
0 siblings, 1 reply; 12+ messages in thread
From: Andrew Ryan @ 2002-11-01 21:18 UTC (permalink / raw)
To: Trond Myklebust; +Cc: nfs
On Friday 01 November 2002 12:44 pm, you wrote:
> " " == Andrew Ryan <andrewr@nam-shub.com> writes:
> > 2.4.20rc1 still exhibits the same identical problem as
> > 2.4.19+kmap1 patch which I identified a few weeks ago. I guess
> > that patch was applied to 2.4.20, since it doesn't appear on
> > Trond's patch page for 2.4.20.
>
> Does the following patch work for you?
So far so good on the crashes. I'm able to get through a complete run of
dbench using TCP mounts on 2.4.20rc1, which I haven't been able to do before
this.
However, according to nfsstat -c's "Client rpc stats", I'm getting around a
10:1 ratio of retransmits to calls. That's bad. I get 0 retransmits when
using 2.4.19+kmap1, and any previous kernel I've ever tried TCP mounting
with. I checked 2.4.20rc1-vanilla and it does the same thing, so it's not the
latest patch.
andrewr
-------------------------------------------------------
This sf.net email is sponsored by: See the NEW Palm
Tungsten T handheld. Power & Color in a compact size!
http://ads.sourceforge.net/cgi-bin/redirect.pl?palm0001en
_______________________________________________
NFS maillist - NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: 2.4.20-rc1 NFS/TCP client (still) hangs running dbench 2.0
2002-11-01 21:18 ` Andrew Ryan
@ 2002-11-01 21:58 ` Trond Myklebust
2002-11-03 5:21 ` Andrew Ryan
0 siblings, 1 reply; 12+ messages in thread
From: Trond Myklebust @ 2002-11-01 21:58 UTC (permalink / raw)
To: Andrew Ryan; +Cc: nfs
>>>>> " " == Andrew Ryan <andrewr@nam-shub.com> writes:
> However, according to nfsstat -c's "Client rpc stats", I'm
> getting around a 10:1 ratio of retransmits to calls. That's
> bad. I get 0 retransmits when using 2.4.19+kmap1, and any
> previous kernel I've ever tried TCP mounting with. I checked
> 2.4.20rc1-vanilla and it does the same thing, so it's not the
> latest patch.
It is probably just the (known) accounting error in the 'retransmit'
statistics. Try applying the patch
http://www.fys.uio.no/~trondmy/src/2.4.20-pre9/linux-2.4.20-01-call_start.dif
if it bothers you.
Note: the TCP timeout + retransmits will (unlike UDP) be of the order
of 60 seconds, so if these were real retransmits, you'd be seeing a
*very* noticeable slowdown...
Cheers,
Trond
-------------------------------------------------------
This sf.net email is sponsored by: See the NEW Palm
Tungsten T handheld. Power & Color in a compact size!
http://ads.sourceforge.net/cgi-bin/redirect.pl?palm0001en
_______________________________________________
NFS maillist - NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs
^ permalink raw reply [flat|nested] 12+ messages in thread
* RE: 2.4.20-rc1 NFS/TCP client (still) hangs running dbench 2.0
@ 2002-11-02 14:57 Lever, Charles
2002-11-02 21:19 ` Trond Myklebust
0 siblings, 1 reply; 12+ messages in thread
From: Lever, Charles @ 2002-11-02 14:57 UTC (permalink / raw)
To: 'trond.myklebust@fys.uio.no'; +Cc: nfs, Andrew Ryan
> >>>>> " " == Andrew Ryan <andrewr@nam-shub.com> writes:
>
> > However, according to nfsstat -c's "Client rpc stats", I'm
> > getting around a 10:1 ratio of retransmits to calls. That's
> > bad. I get 0 retransmits when using 2.4.19+kmap1, and any
> > previous kernel I've ever tried TCP mounting with. I checked
> > 2.4.20rc1-vanilla and it does the same thing, so it's not the
> > latest patch.
>
> It is probably just the (known) accounting error in the
> 'retransmit' statistics. Try applying the patch
>
http://www.fys.uio.no/~trondmy/src/2.4.20-pre9/linux-2.4.20-01-call_start.dif
> if it bothers you.
call_start changes the way RPCs are counted, not how retransmits are counted.
i suspect andrew is seeing partial transmits -- in 2.4.20-pre, partial
TCP transmissions are counted as retransmits. in fact, it looks like the
patch i sent in august for 2.4.20-pre5 to fix xprt_write_space was never
applied to 2.4. trond, can you ping marcelo on this? it's a pretty
important performance enhancement that should be in 2.4.20.
> Note: the TCP timeout + retransmits will (unlike UDP) be of the order
> of 60 seconds, so if these were real retransmits, you'd be seeing a
> *very* noticeable slowdown...
except that mount is broken and sets a 6 second TCP retransmission default
rather than the standard value of 60 seconds.
-------------------------------------------------------
This sf.net email is sponsored by: See the NEW Palm
Tungsten T handheld. Power & Color in a compact size!
http://ads.sourceforge.net/cgi-bin/redirect.pl?palm0001en
_______________________________________________
NFS maillist - NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs
^ permalink raw reply [flat|nested] 12+ messages in thread
* RE: 2.4.20-rc1 NFS/TCP client (still) hangs running dbench 2.0
[not found] <6440EA1A6AA1D5118C6900902745938E07D5503C@black.eng.netapp. com>
@ 2002-11-02 16:49 ` Andrew Ryan
0 siblings, 0 replies; 12+ messages in thread
From: Andrew Ryan @ 2002-11-02 16:49 UTC (permalink / raw)
To: Lever, Charles, 'trond.myklebust@fys.uio.no'; +Cc: nfs
At 06:57 AM 11/2/02 -0800, Lever, Charles wrote:
>i suspect andrew is seeing partial transmits -- in 2.4.20-pre, partial
>TCP transmissions are counted as retransmits. in fact, it looks like the
>patch i sent in august for 2.4.20-pre5 to fix xprt_write_space was never
>applied to 2.4. trond, can you ping marcelo on this? it's a pretty
>important performance enhancement that should be in 2.4.20.
It also seems that this patch was never included in Trond's
linux-2.4.20-NFS_ALL.dif patchset, although I found it in my email (sent
Aug 30, 2002). I'll test it out this weekend and post with/without results.
> > Note: the TCP timeout + retransmits will (unlike UDP) be of the order
> > of 60 seconds, so if these were real retransmits, you'd be seeing a
> > *very* noticeable slowdown...
>
>except that mount is broken and sets a 6 second TCP retransmission default
>rather than the standard value of 60 seconds.
Is this a big problem? Should we be upgrading/patching mount if we are
going to use TCP mounts?
thanks,
andrew
-------------------------------------------------------
This sf.net email is sponsored by: See the NEW Palm
Tungsten T handheld. Power & Color in a compact size!
http://ads.sourceforge.net/cgi-bin/redirect.pl?palm0001en
_______________________________________________
NFS maillist - NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs
^ permalink raw reply [flat|nested] 12+ messages in thread
* RE: 2.4.20-rc1 NFS/TCP client (still) hangs running dbench 2.0
2002-11-02 14:57 Lever, Charles
@ 2002-11-02 21:19 ` Trond Myklebust
0 siblings, 0 replies; 12+ messages in thread
From: Trond Myklebust @ 2002-11-02 21:19 UTC (permalink / raw)
To: Lever, Charles; +Cc: nfs, Andrew Ryan
>>>>> " " == Charles Lever <Lever> writes:
> call_start changes the way RPCs are counted, not how
> retransmits are counted.
Wrong. It fixes the following bug too:
@@ -645,7 +663,6 @@
case -ENOMEM:
case -EAGAIN:
task->tk_action = call_transmit;
- clnt->cl_stats->rpcretrans++;
break;
default:
if (clnt->cl_chatty)
Cheers,
Trond
-------------------------------------------------------
This sf.net email is sponsored by: See the NEW Palm
Tungsten T handheld. Power & Color in a compact size!
http://ads.sourceforge.net/cgi-bin/redirect.pl?palm0001en
_______________________________________________
NFS maillist - NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs
^ permalink raw reply [flat|nested] 12+ messages in thread
* RE: 2.4.20-rc1 NFS/TCP client (still) hangs running dbench 2.0
@ 2002-11-03 0:33 Lever, Charles
0 siblings, 0 replies; 12+ messages in thread
From: Lever, Charles @ 2002-11-03 0:33 UTC (permalink / raw)
To: 'trond.myklebust@fys.uio.no'; +Cc: nfs
> >>>>> " " == Charles Lever <Lever> writes:
>
> > call_start changes the way RPCs are counted, not how
> > retransmits are counted.
>
> Wrong. It fixes the following bug too:
>
> @@ -645,7 +663,6 @@
> case -ENOMEM:
> case -EAGAIN:
> task->tk_action = call_transmit;
> - clnt->cl_stats->rpcretrans++;
> break;
> default:
> if (clnt->cl_chatty)
oops, i forgot about that. it doesn't obviate the need for the
xprt_write_space fix, however.
-------------------------------------------------------
This sf.net email is sponsored by: See the NEW Palm
Tungsten T handheld. Power & Color in a compact size!
http://ads.sourceforge.net/cgi-bin/redirect.pl?palm0001en
_______________________________________________
NFS maillist - NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: 2.4.20-rc1 NFS/TCP client (still) hangs running dbench 2.0
2002-11-01 21:58 ` Trond Myklebust
@ 2002-11-03 5:21 ` Andrew Ryan
2002-11-03 14:16 ` Quentin Fennessy
0 siblings, 1 reply; 12+ messages in thread
From: Andrew Ryan @ 2002-11-03 5:21 UTC (permalink / raw)
To: trond.myklebust; +Cc: nfs
At 10:58 PM 11/1/02 +0100, Trond Myklebust wrote:
>It is probably just the (known) accounting error in the 'retransmit'
>statistics. Try applying the patch
>
>
>http://www.fys.uio.no/~trondmy/src/2.4.20-pre9/linux-2.4.20-01-call_start.dif
>
>if it bothers you.
Yes, you are right, it seems the counter is broken, and performance is
fine. I did bonnie++ runs with 2.4.20rc1 kernels with and without the
call_start patch, and didn't see any significant difference.
But, it does bother me since I want to be able to diagnose network errors
if I ever have them, and with a busted counter it will be impossible to
tell what my real error rate is. Can the call_start patch go to Marcelo for
2.4.20rc2?
After a bonnie run with the call_start patch, I get this, which is much
better :)
$ nfsstat -c
Client rpc stats:
calls retrans authrefrsh
1648050 0 0
thanks,
andrew
-------------------------------------------------------
This sf.net email is sponsored by: See the NEW Palm
Tungsten T handheld. Power & Color in a compact size!
http://ads.sourceforge.net/cgi-bin/redirect.pl?palm0001en
_______________________________________________
NFS maillist - NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: 2.4.20-rc1 NFS/TCP client (still) hangs running dbench 2.0
2002-11-03 5:21 ` Andrew Ryan
@ 2002-11-03 14:16 ` Quentin Fennessy
0 siblings, 0 replies; 12+ messages in thread
From: Quentin Fennessy @ 2002-11-03 14:16 UTC (permalink / raw)
To: Andrew Ryan; +Cc: trond.myklebust, nfs
>>>>> Andrew Ryan writes:
> At 10:58 PM 11/1/02 +0100, Trond Myklebust wrote:
>> It is probably just the (known) accounting error in the 'retransmit'
>> statistics. Try applying the patch
>>
>> http://www.fys.uio.no/~trondmy/src/2.4.20-pre9/linux-2.4.20-01-call_start.dif
>>
>> if it bothers you.
> Yes, you are right, it seems the counter is broken, and performance is
> fine. I did bonnie++ runs with 2.4.20rc1 kernels with and without the
> call_start patch, and didn't see any significant difference.
> But, it does bother me since I want to be able to diagnose network errors if
> I ever have them, and with a busted counter it will be impossible to tell
> what my real error rate is. Can the call_start patch go to Marcelo for
> 2.4.20rc2?
I'm an end user with approximately 3000 Linux NFS clients.
I will really appreciate correct counters when I'm debugging
nfs server, client and network problems. I've had to ignore
spurious counts of nfs retrans while debugging network issues.
I hope this patch will be in 2.4.20rc2. Thanks to all of you for
providing an excellent client-side implementation. And I am
serious.
> After a bonnie run with the call_start patch, I get this, which is much
> better :)
> $ nfsstat -c
> Client rpc stats:
> calls retrans authrefrsh
> 1648050 0 0
--
Quentin Fennessy Quentin.Fennessy@amd.com
Office: 512.602.3873
Cell: 512.694.7489
-------------------------------------------------------
This SF.net email is sponsored by: ApacheCon, November 18-21 in
Las Vegas (supported by COMDEX), the only Apache event to be
fully supported by the ASF. http://www.apachecon.com
_______________________________________________
NFS maillist - NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs
^ permalink raw reply [flat|nested] 12+ messages in thread
* RE: 2.4.20-rc1 NFS/TCP client (still) hangs running dbench 2.0
@ 2002-11-03 17:05 Lever, Charles
0 siblings, 0 replies; 12+ messages in thread
From: Lever, Charles @ 2002-11-03 17:05 UTC (permalink / raw)
To: 'Andrew Ryan'; +Cc: nfs
> > >It is probably just the (known) accounting error in the 'retransmit'
> > >statistics. Try applying the patch
> > >
> > >
> > >http://www.fys.uio.no/~trondmy/src/2.4.20-pre9/linux-2.4.20-0
1-call_start.dif
> >
> >if it bothers you.
>
> Yes, you are right, it seems the counter is broken, and performance is
> fine. I did bonnie++ runs with 2.4.20rc1 kernels with and without the
> call_start patch, and didn't see any significant difference.
the call_start patch only changes RPC accounting. it should have no
effects (either positive or negative) on performance.
did you try the xprt_write_space patch? that fixes a real problem,
and addresses the root cause of the high retransmission counts you see.
> But, it does bother me since I want to be able to diagnose network errors
> if I ever have them, and with a busted counter it will be impossible to
> tell what my real error rate is.
the retransmit counter is supposed to count only whole RPC retransmissions,
not TCP partial transmissions, which are not due to network problems. if
the xprt_write_space patch was included, the false retransmit rate would
be much lower anyway (although not zero).
> Can the call_start patch go to Marcelo for 2.4.20rc2?
call_start is available in Trond's NFS_ALL patch for now. i think 2.4.20
is closed for changes like this.
-------------------------------------------------------
This SF.net email is sponsored by: ApacheCon, November 18-21 in
Las Vegas (supported by COMDEX), the only Apache event to be
fully supported by the ASF. http://www.apachecon.com
_______________________________________________
NFS maillist - NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs
^ permalink raw reply [flat|nested] 12+ messages in thread
* RE: 2.4.20-rc1 NFS/TCP client (still) hangs running dbench 2.0
[not found] <6440EA1A6AA1D5118C6900902745938E07D55047@black.eng.netapp. com>
@ 2002-11-04 16:51 ` Andrew Ryan
0 siblings, 0 replies; 12+ messages in thread
From: Andrew Ryan @ 2002-11-04 16:51 UTC (permalink / raw)
To: Lever, Charles; +Cc: nfs
At 09:05 AM 11/3/02 -0800, Lever, Charles wrote:
>did you try the xprt_write_space patch? that fixes a real problem,
>and addresses the root cause of the high retransmission counts you see.
Has this patch been posted on Trond's site? If so can you refer me to it?
It's not marked under the 2.4.20 directory. If it hasn't been posted, can
it be?
>call_start is available in Trond's NFS_ALL patch for now. i think 2.4.20
>is closed for changes like this.
OK, fair enough.
-------------------------------------------------------
This SF.net email is sponsored by: ApacheCon, November 18-21 in
Las Vegas (supported by COMDEX), the only Apache event to be
fully supported by the ASF. http://www.apachecon.com
_______________________________________________
NFS maillist - NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2002-11-04 16:48 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2002-10-31 5:06 2.4.20-rc1 NFS/TCP client (still) hangs running dbench 2.0 Andrew Ryan
2002-11-01 17:24 ` Trond Myklebust
[not found] <3DC2E7B6.6090400@collab.net>
2002-11-01 21:18 ` Andrew Ryan
2002-11-01 21:58 ` Trond Myklebust
2002-11-03 5:21 ` Andrew Ryan
2002-11-03 14:16 ` Quentin Fennessy
-- strict thread matches above, loose matches on Subject: below --
2002-11-02 14:57 Lever, Charles
2002-11-02 21:19 ` Trond Myklebust
[not found] <6440EA1A6AA1D5118C6900902745938E07D5503C@black.eng.netapp. com>
2002-11-02 16:49 ` Andrew Ryan
2002-11-03 0:33 Lever, Charles
2002-11-03 17:05 Lever, Charles
[not found] <6440EA1A6AA1D5118C6900902745938E07D55047@black.eng.netapp. com>
2002-11-04 16:51 ` Andrew Ryan
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.